[Z3-zemantic] Re: big zemantic storage / some changes

Tarek Ziadé tziade at nuxeo.com
Fri Apr 15 17:43:19 MEST 2005


Michel Pelletier wrote:

>On Tuesday 05 April 2005 06:23 am, Tarek Ziadé wrote:
>
>Hey Tarek, I'm cc:ing the zemantic list so this doesn't get lost...
>
>  
>
>>Hi Michel
>>
>>how are you doing ?
>>
>>I have some feedback from big zemantic repository,  (my real mailbox) :)
>>
>>Everything works pretty cool except :
>>
>>+ the clear method : it take ages
>>+ simple queries can be long when they retrieve +10k results
>>
>>
>>I have made a few tests and came up with this :
>>
>>+ the clear method can be as fast as a instant doing this :
>>
>>
>>def clear(self, backend=None):
>>        """ Clear zemantic. """
>>        self.__init__(backend)
>>
>>(instead of removing each triple one by one)
>>    
>>
>
>Right, I did it the one by one method because that's what rdflib did, but your 
>idea above is superior.
>
>  
>
>>+ i have introduced a "max result" thing in the query, so when the max
>>number of result is reached
>>  it stops.
>>
>>for example, looking in my mails for the word "cps" retrieves billions
>>of entry, so i decided to cut the thing by showing the first 100
>>entries. (this is enough for the user anyway, i tell him to make a more
>>precise search)
>>it make it very very fast
>>
>>(the 100 first entries for "cps" takes 200 ms, the whole thing, ages)
>>
>>what's your opinion on these points ?
>>    
>>
>
>I have no problem with limiting search results, the question is, are the first 
>100 results the most relevant?  When I transition Zemantic to use an external 
>catalog instead of an internal text index, we'll need to look at the 
>relevance code (the "score") to make sure that if we limit search results, 
>the results that are returned are the most relevant.
>
>BTW, have you seen the recent changes to rdflib?  They're big changes and 
>Zemantic will corespondingly change quite a bit, you might want to keep 
>yourself at the version we developed at the sprint, or upgrade (soon) to the 
>new version once I'm done writing it ;)
>  
>
Hi Michel,

About relevancy (i am not sure this word exists in english :) ) :

I have stated that if there's more than 100 mail in the results, then 
the search itself is not
 relevant, so the first 100 results are not likely to be the most 
relevant but more like a "sample"


Recent changes :
Cool, I can't wait on this new version then :)

Backends/ZODB Thaughts:

I've come to the conclusion that is would be better for scalability to 
put some data out of ZODB.

I was thinking about making some kind of "lightweight directory storage" 
to store zemantic indexation
 *and* other mail parts I get from IMAP server to keep the zope mail 
object as light as possible.

The idea is to get rid of all the transaction mechanism for Datas that 
does not really need it,
 like read-only parts of emails and indexations : once created or 
downloaded, they will never change anyway.

But if zemantic will be based on some kind of external catalog, maybe I 
could also use this catalog to store raw datas for each mail. the idea 
would be to be able to link regular ZODB objects to properties which 
values would be stored outside Zope.

I'm pretty sure a bit of magic is possible to bind some zope object 
attributes to an external storage and hide them from zodb.  

I  have the feeling that this would speed up things a lot but my 
knowledge of ZODB is not good enough to be sure what i am saying is not 
totally wrong.  :)


Tarek




>-Michel
>
>  
>
>>Cheers
>>
>>Tarek
>>    
>>



More information about the Z3-zemantic mailing list