[Z3-zemantic] thaughts about zodb backend storage
Michel Pelletier
michel at dialnetwork.com
Tue Mar 29 21:40:01 MEST 2005
On Tuesday 29 March 2005 07:08 am, Tres Seaver wrote:
> Tarek Ziadé wrote:
> > I have indexed my mailbox today with zemantic (about 30 000 mails) in
> > the webmail and i had performance issues.
> >
> > Since the add code is not linear, it gets slower and slower when I index
> > all messages. When it comes around 5000 mails indexed, its speeds is
> > around 1 mail per second on my laptop. Around 10 000, the speed is more
> > likely to be 1 minute per mail, so i had to stop the process.
>
> Your problem sounds RAM / swap related; I would suggest adding a
> "sub-commit" of your transaction ('commit(1)'), after every "batch" of
> mails (the number could be tuned, but try 100 to start).
I think Tres is right, your performance issue isn't related to big triples,
but because you are adding all of this RDF in one database transaction. As
he suggested, commiting sub-transactions per batch of mails will help keep
the speed up and the RAM consumption down.
Mass-indexing of any kind into ZODB will have this problem. Also try indexing
in whole-transaction batches. This is much faster than even
sub-transactions, but has the problem of not being atomic if it gets
interrupted.
And third, there is a little know trick one can find in the Zope 2 "Find"
machinery that de-activates objects from memory after you are "done" with
them. This trick could be used to deactivate messages and other gunk once
you have indexed them.
> > the reason is that, beside other triples I have this big triple :
> >
> > Message / body is / Message body
> >
> > this generates *big* triple statements, besides the text indexing
>
> The "body-is" triple doesn't "feel right" to me. I would rather use a
> separate text index for the bodies (perhaps actually a "SearchableText"
> style aggregate) and then work out how to intersect the results of an
> RDF-based Zemantic query with results from that index.
Right, as I mentioned a couple emails back, soon Zemantic (for 3.1) will
delegate all _interpretive_ indexing to another catalog. So things like text
searches and date ranges will be possible. How these interfacs will look i'M
not sure yet, Dan and I aer still working on some of the guts. But this
draws a clear line: Zemantic is just responsible for holding the "shape" of
the RDF graph, Zemantic will _never_ interpret this shape! Interpretation is
application specific.
But I don't think "big triples" is really a huge problem. There is the
obvious waste if the body of the message is easily gotten from the message,
but a "big triple" per se should not be a problem any more than any other
kind of big object as a BTree value (within reason of course! triples
aproaching the size of your virtual memory are going to be a problem!)
> > This is more likely to be a conceptual problem in the webmail program,
> > but thinking about how it could go faster can't be a bad thing, as the
> > problem might raises in big zemantic storage in classical uses cases.
> >
> > idea :
> >
> > I just had that feeling (this can be totally wrong as I don't know
> > nothing about Btrees and i don't know what is actually stored in a
> > OIBTree (the full Literal is stored ?) ) :
>
> Yes.
Yep, a BTree is just a fancy, persistent dictionary.
>
> > the reverse OIBtree could be skipped if the id that are generated for
> > the forward IOBtree would be md5 hash keys calculated with subject,
> > predicate and object, then any search that are actually made for
> > example with "r.has_key(object)" could be replace by
> > "f.has_key(object_md5_key)"
>
> That would be a "saner" predicate.
But it wouldn't be the same RDF graph, and that would be Zemantic playing a
dirty trick (ie, an "interpretation"). Zemantic cannot interpret data in any
way, if you want to stick a bunch of big honking literals in Zemantic then it
must let you do it. MD5 hashing would be an "interpretation" While true
that big triples would be made much smaller with this trick, it colapses the
literal value, which RDF defines as atomic.
In the near future when 3.1 is released, this problem will go away, you'll put
your text indexeable content in a catalog and your RDF graph in a Zemantic
and probably not stuff then entire message body into the RDF graph.
-Michel
More information about the Z3-zemantic
mailing list