[Z3-zemantic] thaughts about zodb backend storage

Tres Seaver tseaver at zope.com
Tue Mar 29 22:23:04 MEST 2005


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michel Pelletier wrote:

> I think Tres is right, your performance issue isn't related to big triples, 
> but because you are adding all of this RDF in one database transaction.  As 
> he suggested, commiting sub-transactions per batch of mails will help keep 
> the speed up and the RAM consumption down.

I'd be willing to bet that the whole thing would go faster even without
batching if Tarek just dropped the "body" statement.

> Mass-indexing of any kind into ZODB will have this problem.  Also try indexing 
> in whole-transaction batches.  This is much faster than even 
> sub-transactions, but has the problem of not being atomic if it gets 
> interrupted.
> 
> And third, there is a little know trick one can find in the Zope 2 "Find" 
> machinery that de-activates objects from memory after you are "done" with 
> them.  This trick could be used to deactivate messages and other gunk once 
> you have indexed them.
> 
> 
>>>the reason is that, beside other triples I have this big triple :
>>>
>>>Message / body is / Message body
>>>
>>>this generates *big* triple statements, besides the text indexing
>>
>>The "body-is" triple doesn't "feel right" to me.  I would rather use a
>>separate text index for the bodies (perhaps actually a "SearchableText"
>>style aggregate) and then work out how to intersect the results of an
>>RDF-based Zemantic query with results from that index.
> 
> 
> Right, as I mentioned a couple emails back, soon Zemantic (for 3.1) will 
> delegate all _interpretive_ indexing to another catalog.  So things like text 
> searches and date ranges will be possible.  How these interfacs will look i'M 
> not sure yet, Dan and I aer still working on some of the guts.  But this 
> draws a clear line: Zemantic is just responsible for holding the "shape" of 
> the RDF graph, Zemantic will _never_ interpret this shape!  Interpretation is 
> application specific.
> 
> But I don't think "big triples" is really a huge problem.  There is the 
> obvious waste if the body of the message is easily gotten from the message, 
> but a "big triple" per se should not be a problem any more than any other 
> kind of big object as a BTree value (within reason of course!  triples 
> aproaching the size of your virtual memory are going to be a problem!)
> 
> 
>>>This is more likely to be a conceptual problem in the webmail program,
>>>but thinking about how it could go faster can't be a bad thing, as the
>>>problem might raises in big zemantic storage in classical uses cases.
>>>
>>>idea :
>>>
>>>I just had that feeling (this can be totally wrong as I don't know
>>>nothing about Btrees and i don't know what is actually stored in a
>>>OIBTree (the full Literal is stored ?) ) :
>>
>>Yes.
> 
> 
> Yep, a BTree is just a fancy, persistent dictionary.

C'mon, you have to admit that the "reverse" index for a 'body' predicate
is absurd.  Having Zemantic store the bodies as "keys" in a BTree is
ridiculous.

>>>the reverse OIBtree could be skipped if the id that are generated for
>>>the forward IOBtree would  be md5 hash keys calculated with subject,
>>>predicate and object, then any search that are actually made for
>>>example with  "r.has_key(object)" could be replace by
>>>"f.has_key(object_md5_key)"
>>
>>That would be a "saner" predicate.
> 
> 
> But it wouldn't be the same RDF graph, and that would be Zemantic playing a 
> dirty trick (ie, an "interpretation").  Zemantic cannot interpret data in any 
> way, if you want to stick a bunch of big honking literals in Zemantic then it 
> must let you do it.  MD5 hashing would be an "interpretation"   While true 
> that big triples would be made much smaller with this trick, it colapses the 
> literal value, which RDF defines as atomic.

I don't think that we *want* to store the body in Zemantic.

> In the near future when 3.1 is released, this problem will go away, you'll put 
> your text indexeable content in a catalog and your RDF graph in a Zemantic 
> and probably not stuff then entire message body into the RDF graph.

Then we will still need a way to join the results of a query against
Zemantic's own triples with a query against the external text index.

Tres.
- --
===============================================================
Tres Seaver                                tseaver at zope.com
Zope Corporation      "Zope Dealers"       http://www.zope.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFCSbkoGqWXf00rNCgRAg44AJ0UOvZpHe5LDGUnug/KIUIhdSIp4gCePbnF
bcYhGfyToD6bNUGI10u8RrU=
=nsPq
-----END PGP SIGNATURE-----


More information about the Z3-zemantic mailing list