[Z3-zemantic] thaughts about zodb backend storage
Tres Seaver
tseaver at zope.com
Tue Mar 29 22:23:04 MEST 2005
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Michel Pelletier wrote:
> I think Tres is right, your performance issue isn't related to big triples,
> but because you are adding all of this RDF in one database transaction. As
> he suggested, commiting sub-transactions per batch of mails will help keep
> the speed up and the RAM consumption down.
I'd be willing to bet that the whole thing would go faster even without
batching if Tarek just dropped the "body" statement.
> Mass-indexing of any kind into ZODB will have this problem. Also try indexing
> in whole-transaction batches. This is much faster than even
> sub-transactions, but has the problem of not being atomic if it gets
> interrupted.
>
> And third, there is a little know trick one can find in the Zope 2 "Find"
> machinery that de-activates objects from memory after you are "done" with
> them. This trick could be used to deactivate messages and other gunk once
> you have indexed them.
>
>
>>>the reason is that, beside other triples I have this big triple :
>>>
>>>Message / body is / Message body
>>>
>>>this generates *big* triple statements, besides the text indexing
>>
>>The "body-is" triple doesn't "feel right" to me. I would rather use a
>>separate text index for the bodies (perhaps actually a "SearchableText"
>>style aggregate) and then work out how to intersect the results of an
>>RDF-based Zemantic query with results from that index.
>
>
> Right, as I mentioned a couple emails back, soon Zemantic (for 3.1) will
> delegate all _interpretive_ indexing to another catalog. So things like text
> searches and date ranges will be possible. How these interfacs will look i'M
> not sure yet, Dan and I aer still working on some of the guts. But this
> draws a clear line: Zemantic is just responsible for holding the "shape" of
> the RDF graph, Zemantic will _never_ interpret this shape! Interpretation is
> application specific.
>
> But I don't think "big triples" is really a huge problem. There is the
> obvious waste if the body of the message is easily gotten from the message,
> but a "big triple" per se should not be a problem any more than any other
> kind of big object as a BTree value (within reason of course! triples
> aproaching the size of your virtual memory are going to be a problem!)
>
>
>>>This is more likely to be a conceptual problem in the webmail program,
>>>but thinking about how it could go faster can't be a bad thing, as the
>>>problem might raises in big zemantic storage in classical uses cases.
>>>
>>>idea :
>>>
>>>I just had that feeling (this can be totally wrong as I don't know
>>>nothing about Btrees and i don't know what is actually stored in a
>>>OIBTree (the full Literal is stored ?) ) :
>>
>>Yes.
>
>
> Yep, a BTree is just a fancy, persistent dictionary.
C'mon, you have to admit that the "reverse" index for a 'body' predicate
is absurd. Having Zemantic store the bodies as "keys" in a BTree is
ridiculous.
>>>the reverse OIBtree could be skipped if the id that are generated for
>>>the forward IOBtree would be md5 hash keys calculated with subject,
>>>predicate and object, then any search that are actually made for
>>>example with "r.has_key(object)" could be replace by
>>>"f.has_key(object_md5_key)"
>>
>>That would be a "saner" predicate.
>
>
> But it wouldn't be the same RDF graph, and that would be Zemantic playing a
> dirty trick (ie, an "interpretation"). Zemantic cannot interpret data in any
> way, if you want to stick a bunch of big honking literals in Zemantic then it
> must let you do it. MD5 hashing would be an "interpretation" While true
> that big triples would be made much smaller with this trick, it colapses the
> literal value, which RDF defines as atomic.
I don't think that we *want* to store the body in Zemantic.
> In the near future when 3.1 is released, this problem will go away, you'll put
> your text indexeable content in a catalog and your RDF graph in a Zemantic
> and probably not stuff then entire message body into the RDF graph.
Then we will still need a way to join the results of a query against
Zemantic's own triples with a query against the external text index.
Tres.
- --
===============================================================
Tres Seaver tseaver at zope.com
Zope Corporation "Zope Dealers" http://www.zope.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFCSbkoGqWXf00rNCgRAg44AJ0UOvZpHe5LDGUnug/KIUIhdSIp4gCePbnF
bcYhGfyToD6bNUGI10u8RrU=
=nsPq
-----END PGP SIGNATURE-----
More information about the Z3-zemantic
mailing list