[Z3-zemantic] thaughts about zodb backend storage

Michel Pelletier michel at dialnetwork.com
Tue Mar 29 21:40:01 MEST 2005


On Tuesday 29 March 2005 07:08 am, Tres Seaver wrote:
> Tarek Ziadé wrote:
> > I have indexed my mailbox today with zemantic (about 30 000 mails) in
> > the webmail and i had performance issues.
> >
> > Since the add code is not linear, it gets slower and slower when I index
> > all messages. When it comes around 5000 mails indexed, its speeds is
> > around 1 mail per second on my laptop. Around 10 000, the speed is more
> > likely to be 1 minute per mail, so i had to stop the process.
>
> Your problem sounds RAM / swap related;  I would suggest adding a
> "sub-commit" of your transaction ('commit(1)'), after every "batch" of
> mails (the number could be tuned, but try 100 to start).

I think Tres is right, your performance issue isn't related to big triples, 
but because you are adding all of this RDF in one database transaction.  As 
he suggested, commiting sub-transactions per batch of mails will help keep 
the speed up and the RAM consumption down.

Mass-indexing of any kind into ZODB will have this problem.  Also try indexing 
in whole-transaction batches.  This is much faster than even 
sub-transactions, but has the problem of not being atomic if it gets 
interrupted.

And third, there is a little know trick one can find in the Zope 2 "Find" 
machinery that de-activates objects from memory after you are "done" with 
them.  This trick could be used to deactivate messages and other gunk once 
you have indexed them.

> > the reason is that, beside other triples I have this big triple :
> >
> > Message / body is / Message body
> >
> > this generates *big* triple statements, besides the text indexing
>
> The "body-is" triple doesn't "feel right" to me.  I would rather use a
> separate text index for the bodies (perhaps actually a "SearchableText"
> style aggregate) and then work out how to intersect the results of an
> RDF-based Zemantic query with results from that index.

Right, as I mentioned a couple emails back, soon Zemantic (for 3.1) will 
delegate all _interpretive_ indexing to another catalog.  So things like text 
searches and date ranges will be possible.  How these interfacs will look i'M 
not sure yet, Dan and I aer still working on some of the guts.  But this 
draws a clear line: Zemantic is just responsible for holding the "shape" of 
the RDF graph, Zemantic will _never_ interpret this shape!  Interpretation is 
application specific.

But I don't think "big triples" is really a huge problem.  There is the 
obvious waste if the body of the message is easily gotten from the message, 
but a "big triple" per se should not be a problem any more than any other 
kind of big object as a BTree value (within reason of course!  triples 
aproaching the size of your virtual memory are going to be a problem!)

> > This is more likely to be a conceptual problem in the webmail program,
> > but thinking about how it could go faster can't be a bad thing, as the
> > problem might raises in big zemantic storage in classical uses cases.
> >
> > idea :
> >
> > I just had that feeling (this can be totally wrong as I don't know
> > nothing about Btrees and i don't know what is actually stored in a
> > OIBTree (the full Literal is stored ?) ) :
>
> Yes.

Yep, a BTree is just a fancy, persistent dictionary.

>
> > the reverse OIBtree could be skipped if the id that are generated for
> > the forward IOBtree would  be md5 hash keys calculated with subject,
> > predicate and object, then any search that are actually made for
> > example with  "r.has_key(object)" could be replace by
> > "f.has_key(object_md5_key)"
>
> That would be a "saner" predicate.

But it wouldn't be the same RDF graph, and that would be Zemantic playing a 
dirty trick (ie, an "interpretation").  Zemantic cannot interpret data in any 
way, if you want to stick a bunch of big honking literals in Zemantic then it 
must let you do it.  MD5 hashing would be an "interpretation"   While true 
that big triples would be made much smaller with this trick, it colapses the 
literal value, which RDF defines as atomic.

In the near future when 3.1 is released, this problem will go away, you'll put 
your text indexeable content in a catalog and your RDF graph in a Zemantic 
and probably not stuff then entire message body into the RDF graph.

-Michel




More information about the Z3-zemantic mailing list