[lxml-dev] Segmentation fault in lxml.html after pickling

Ian Bicking ianb at colorstudy.com
Tue Jul 1 02:48:30 CEST 2008


Stefan Behnel wrote:
> Martijn Faassen wrote:
>> I'd love it if I could somehow store lxml trees in the ZODB, and that'd
>> need pickle support. Whether it could be made to be efficient I don't
>> know - you'd not want the whole tree to be pickled as a whole in case of
>> large trees, but some form of partitioning scheme into separate pickles.
>> You're right that custom-element binding would be nice in this case, and
>> that means the pickle can't simply be the XML content unless it's
>> somehow annotated first.
>>
>> Anyway, this is a rather out there use case. I am just intrigued to
>> learn that objectify elements can be pickled.
> 
> It's just easier to do in objectify, as it has a pretty comprehensive
> setup for Element class mapping. If you want to be sure to get back
> exactly the same Element tree after pickling, you can just annotate() an
> objectify tree before pickling it.
> 
> Doing the same thing in lxml.etree would require storing some information
> about the current Element lookup, which may be a lot of information, e.g.
> for the namespace class setup. That's a parser-local setup, so we can't
> just use the setup of the default parser either but need a concrete
> context for the unpickling.
> 
> lxml.html might be considered having such a context in a similar way
> lxml.objectify has it, as it comes with its own classes and lookup scheme.

Just what would end up being pickled, do you think?  The entire document?

A first thought is that the document gets pickled, and then the element 
is an offset in that document.  Like, erm...

class HtmlMixin:
     def __getstate__(self):
         return (self.getroottree(), self._indexes_to_self())
     def _indexes_to_self(self):
         result = []
         el = self
         while el.getparent():
             result.insert(0, el.getparent().index(el))
             el = el.getparent()
         return result
     def __setstate__(self, state):
         # Dammit... this doesn't actually work:
         doc, indexes_to_self = state
         el = doc.getroot()
         for index in indexes_to_self:
             el = el[index]
         return el

There is no return value for __setstate__, and no way to indicate a 
constructor method for creating instances.  That's dumb.  I don't like 
pickle.

For documents, if the pickle hooks worked reasonably I'd just store the 
serialization of the document (as a string) plus all the special 
attributes (doctype, url, etc).  Given that the hooks don't work 
reasonably I'm not sure how to do it; maybe people with the ZODB 
experience to have hit this problem would have an idea?

 From what I can tell there's no reason to store the document as 
anything but a string -- serializing and re-parsing the string is faster 
than any other means of storing a document (it all ends up as strings 
eventually anyway).

-- 
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org


More information about the lxml-dev mailing list