From d.rothe at semantics.de Mon Feb 1 16:58:08 2010 From: d.rothe at semantics.de (Dirk Rothe) Date: Mon, 01 Feb 2010 16:58:08 +0100 Subject: [lxml-dev] exslt functions in xpath expressions In-Reply-To: <4B6459F1.9090600@behnel.de> References: <4B62B463.4040701@behnel.de> <4B6459F1.9090600@behnel.de> Message-ID: done: https://bugs.launchpad.net/lxml/+bug/515553 On Sat, 30 Jan 2010 17:10:25 +0100, Stefan Behnel wrote: > > Dirk Rothe, 30.01.2010 16:14: >> In [9]: print tree.xpath("/a[@b=str:split('12 34')]", namespaces={'str': >> "http://exslt.org/strings"}) >> [...] >> XPathEvalError: Unregistered function > > You're right, they are currently only available to XSLT. It seems that at > least the date, math, sets and string functions can be enabled in plain > XPath, but only from libxslt 1.1.25 onwards. That version was released on > 2009-09-17, so it's fairly recent. > > http://xmlsoft.org/XSLT/EXSLT/html/libexslt-exslt.html > > Could you file a feature request for this in the bug tracker? I should be > able to add support in lxml 2.3. > > Stefan From stefan_ml at behnel.de Mon Feb 1 20:18:02 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 01 Feb 2010 20:18:02 +0100 Subject: [lxml-dev] ElementTree 1.3a xpath position broken? In-Reply-To: References: <4B6526B9.9060007@behnel.de> Message-ID: <4B6728EA.8010700@behnel.de> Richard Baron Penman, 31.01.2010 14:56: >>> I am after xpath support for an application running on Google App Engine, >>> which unfortunately rules out lxml. >> Yeah, I know. That's one of the reasons I never found a use for the GAE on >> my side. That also makes your e-mail somewhat misplaced on this list. ;) > > Hopefully the lxml feature request goes somewhere: > http://code.google.com/p/googleappengine/issues/detail?id=18 Intestering. Yes, let's see what that gives. Everyone's invited to vote for that bug, obviously. > Can you recommend an alternative for discussing ElementTree? > I tried emailing Fredrik earlier but didn't get a response and the > ElementTree repository hasn't been committed to since 2007. Fredrik is rather packed with other stuff these days. Note that ET 1.3 may make it into 3.2/2.7: http://bugs.python.org/issue1143 So c.l.py and the Python bug tracker are suitable places. For implementation specific questions, there is also python-dev and the stdlib sig mailing list. > http://sourceforge.net/projects/pdis-xpath/ >> I never tried it, but it's been recently updated, so it looks like it's >> still maintained. >> > > That project does look promising, however it doesn't yet support // or .. I believe that it lacks support for '..' (although that's trivial to implement even with ET), but '//'??? >>> However tag positions appear to be broken: >>> >>> print list(tree.findall('.//b[1]')) # should return b element >>> [] >> That shouldn't be hard to add. You just have to make sure it only counts >> elements within the same parent, so you may have to add the selector in >> more than one place. I guess that's why Fredrik didn't add it while he was >> at it. > > I found it was half implemented and finished it off. There is some elegant > code in ElementPath.py but it needs refactoring... Care to provide a patch? Stefan From richardbp+lxml at gmail.com Tue Feb 2 01:56:28 2010 From: richardbp+lxml at gmail.com (Richard Baron Penman) Date: Tue, 2 Feb 2010 11:56:28 +1100 Subject: [lxml-dev] ElementTree 1.3a xpath position broken? In-Reply-To: <4B6728EA.8010700@behnel.de> References: <4B6526B9.9060007@behnel.de> <4B6728EA.8010700@behnel.de> Message-ID: On Tue, Feb 2, 2010 at 6:18 AM, Stefan Behnel wrote: > > Fredrik is rather packed with other stuff these days. > > Note that ET 1.3 may make it into 3.2/2.7: > > http://bugs.python.org/issue1143 > hmm, not sure that is a good idea if ET was not maintained recently. > http://sourceforge.net/projects/pdis-xpath/ > >> I never tried it, but it's been recently updated, so it looks like it's > >> still maintained. > >> > > > > That project does look promising, however it doesn't yet support // or .. > > I believe that it lacks support for '..' (although that's trivial to > implement even with ET), but '//'??? > Yep - no // support. And I found out (from Ken Riley) that pdis is no longer maintained. > > I found it was half implemented and finished it off. There is some > elegant > > code in ElementPath.py but it needs refactoring... > > Care to provide a patch? > To who? Fredrik seems busy... So I plan to fork ElementPath.py with my own updates and then release as a standalone ElementTree xpath library like pdis. Also I aim to add support for absolute xpaths, for compatibility with my existing lxml dependent code. Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100202/0c93c146/attachment.htm From stefan_ml at behnel.de Tue Feb 2 09:20:53 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 02 Feb 2010 09:20:53 +0100 Subject: [lxml-dev] ElementTree 1.3a xpath position broken? In-Reply-To: References: <4B6526B9.9060007@behnel.de> <4B6728EA.8010700@behnel.de> Message-ID: <4B67E065.2000104@behnel.de> Richard Baron Penman, 02.02.2010 01:56: > On Tue, Feb 2, 2010 at 6:18 AM, Stefan Behnel wrote: >>> I found it was half implemented and finished it off. There is some >>> elegant code in ElementPath.py but it needs refactoring... >> >> Care to provide a patch? > > To who? Fredrik seems busy... To lxml - remember what list we have this discussion on? Also, to the Python bug tracker. I doubt that a new feature like this would be rejected. > So I plan to fork ElementPath.py with my own updates and then release as a > standalone ElementTree xpath library like pdis. Also I aim to add support > for absolute xpaths, for compatibility with my existing lxml dependent code. That's also an option, and a good one IMHO. That would make it easily available to both ET and lxml. (Advantage for lxml being the incremental search support which is not available in XPath). Please take a look at both implementations, the one in ET 1.3 and the one in lxml. http://codespeak.net/svn/lxml/trunk/src/lxml/_elementpath.py There are some minor differences that make the latter one run faster on top of lxml's advanced tree iterators. Stefan From jholg at gmx.de Thu Feb 4 00:33:09 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 04 Feb 2010 00:33:09 +0100 Subject: [lxml-dev] [Bug 488222] Feature request: add better schematron support to lxml Message-ID: <20100203233309.15250@gmx.net> > > This speaks for pulling the result accessor into the Schematron class, > probably as a class attribute that can be overridden on an instance level. > > > > The same might make sense for the iso-schematron implementation xsl > transformation steps. > > Sounds like a much better interface. Any interesting global options would > be better overridden by subtyping the validator class, so class attributes > make sense to me. Committed to trunk: https://codespeak.net/viewvc/?view=rev&revision=71090 This simply exposes the skeleton xslt steps and the validation result xpath as class attributes. I consider the iso-schematron works pretty much finished for now... Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From dchandekstark at gmail.com Fri Feb 5 19:38:44 2010 From: dchandekstark at gmail.com (David Chandek-Stark) Date: Fri, 5 Feb 2010 13:38:44 -0500 Subject: [lxml-dev] Get the xml-stylesheet processing instruction Message-ID: <9e5a3df61002051038x6b12ef0fsbad90228e991b2d8@mail.gmail.com> Hi, I had a hard time tracking this info down (only figured out after reading the thread at http://codespeak.net/pipermail/lxml-dev/2006-September/001903.html), so posted a recipe: http://fragmentsofcode.wordpress.com/2010/02/05/get-the-xml-stylesheet-processing-instruction-with-lxml/ Feel free to use or copy for any purpose. Thanks, David -- David Chandek-Stark dchandekstark (at) gmail (dot) com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100205/dc521d6f/attachment.htm From manu3d at gmail.com Fri Feb 5 21:02:48 2010 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Fri, 5 Feb 2010 20:02:48 +0000 Subject: [lxml-dev] Get the xml-stylesheet processing instruction In-Reply-To: <9e5a3df61002051038x6b12ef0fsbad90228e991b2d8@mail.gmail.com> References: <9e5a3df61002051038x6b12ef0fsbad90228e991b2d8@mail.gmail.com> Message-ID: <915dc91d1002051202r36325815ofce07d33d8b0ba3f@mail.gmail.com> On 5 February 2010 18:38, David Chandek-Stark wrote: > I had a hard time tracking this info down (only figured out after reading > the thread at > http://codespeak.net/pipermail/lxml-dev/2006-September/001903.html), so > posted a recipe: > > > http://fragmentsofcode.wordpress.com/2010/02/05/get-the-xml-stylesheet-processing-instruction-with-lxml/ > True, obtaining PIs (and comments) is not exactly intuitive nor straightforward (in, fact for head-of-file PIs it's straightbackward!! =) ). I guess ideally there could be a list of them available via read-only properties on the tree object, i.e.: docTree.PIs and docTree.comments. Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100205/de1c056d/attachment-0001.htm From jkrukoff at ltgc.com Fri Feb 5 21:58:38 2010 From: jkrukoff at ltgc.com (John Krukoff) Date: Fri, 05 Feb 2010 13:58:38 -0700 Subject: [lxml-dev] Get the xml-stylesheet processing instruction In-Reply-To: <915dc91d1002051202r36325815ofce07d33d8b0ba3f@mail.gmail.com> References: <9e5a3df61002051038x6b12ef0fsbad90228e991b2d8@mail.gmail.com> <915dc91d1002051202r36325815ofce07d33d8b0ba3f@mail.gmail.com> Message-ID: <1265403518.10548.1766.camel@localhost.localdomain> On Fri, 2010-02-05 at 20:02 +0000, Emanuele D'Arrigo wrote: > True, obtaining PIs (and comments) is not exactly intuitive nor > straightforward (in, fact for head-of-file PIs it's straightbackward!! > =) ). I guess ideally there could be a list of them available via > read-only properties on the tree object, i.e.: docTree.PIs and > docTree.comments. > > Manu I agree, although my version of ideally would be for the ElementTree class to do a better imitation of being a root node, so that something simple like this would work for looping over all root level nodes: >>> from lxml import etree >>> x = etree.XML( '' ) >>> t = x.getroottree() >>> list( t ) [, , ] Or if ElementTree even supported just getchildren() to retrieve the same data. Even more ideally ;) it'd support insert/append/replace/remove and company for editing such root level elements. Oh yeah, and if xpath( '/' ) returned said extended ElementTree as the root node, as long as I'm wishing for ponies and unicorns. I suppose the reason it's hard is because effbot's ElementTree hasn't ever dealt with the issue of non-element root level contents. -- John Krukoff Land Title Guarantee Company From optilude+lists at gmail.com Sat Feb 6 12:46:39 2010 From: optilude+lists at gmail.com (Martin Aspeli) Date: Sat, 06 Feb 2010 19:46:39 +0800 Subject: [lxml-dev] Copying children including text nodes Message-ID: Hi, I have two trees that were parsed with the HTML parser. The source tree is: Foo

Bar

Baz The target is:
Placeholder
Now, I want to replace the whole of
tag (so, the tag and its children) with the *contents* of the tag in the source tree. I obviously don't want the body tag itself. Performance is important. Also, I don't care about the source tree after I'm done, so if "moving" rather than copying makes things faster/easier, that's OK. What's the best way to do this? My naive approach was to do this: sourceBody = source.find('body') for sourceBodyChild in sourceBody: targetPlaceholder.addnext(sourceBodyChild) targetPlaceholder.getparent().remove(targetPlaceholder) However, this loses the text ("Foo"). I guess this is one case where dealing with text nodes explicitly would actually be better. :) Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From akaihola at gmail.com Mon Feb 8 11:11:59 2010 From: akaihola at gmail.com (Antti Kaihola) Date: Mon, 8 Feb 2010 12:11:59 +0200 Subject: [lxml-dev] Typo in exception message Message-ID: <154405ff1002080211p331f0fdep4b4e35bce07dacaa@mail.gmail.com> Hi, I'm catching exceptions from cssselect and my code needs to make decisions based on not only exception classes but also the particular exception messages. The word "pseudo" is mistyped as "psuedo" in four different exceptions. Are these typos going to be kept unchanged for good, or should my code be prepared to match a corrected version as well? The same typo seems to occur on the lxml web pages as well, by the way. Regards, Antti Kaihola Espoo, Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100208/0a0a3b3f/attachment.htm From stefan_ml at behnel.de Mon Feb 8 12:04:33 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 08 Feb 2010 12:04:33 +0100 Subject: [lxml-dev] Typo in exception message In-Reply-To: <154405ff1002080211p331f0fdep4b4e35bce07dacaa@mail.gmail.com> References: <154405ff1002080211p331f0fdep4b4e35bce07dacaa@mail.gmail.com> Message-ID: <4B6FEFC1.2010203@behnel.de> Antti Kaihola, 08.02.2010 11:11: > I'm catching exceptions from cssselect and my code needs to make decisions > based on not only exception classes but also the particular exception > messages. Could you provide an insight into your use case here? I wouldn't mind adding some machine-readable information to the exceptions, in case that helps. (patches appreciated) > The word "pseudo" is mistyped as "psuedo" in four different exceptions. Are > these typos going to be kept unchanged for good, or should my code be > prepared to match a corrected version as well? > > The same typo seems to occur on the lxml web pages as well, by the way. Thanks for catching that. I fixed it in the code and the docs. At least lxml 2.3 will have this change. Stefan From stefan_ml at behnel.de Mon Feb 8 14:41:55 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 08 Feb 2010 14:41:55 +0100 Subject: [lxml-dev] Copying children including text nodes In-Reply-To: References: Message-ID: <4B7014A3.7050001@behnel.de> Martin Aspeli, 06.02.2010 12:46: > I have two trees that were parsed with the HTML parser. The source tree is: > > > > > Foo >

Bar

> Baz > > > > The target is: > > > > >
Placeholder
> > > > Now, I want to replace the whole of
tag (so, the tag > and its children) with the *contents* of the tag in the source > tree. I obviously don't want the body tag itself. parent.replace() doesn't currently support sequence insertion, but I would expect this to work: prev = div_element.getprevious() if prev is None: target_body[:1] = source_body[:] target_body.text = source_body.text # take care of existing text? else: pos = target_body.index(div_element) target_body[pos:pos+1] = source_body[:] if prev.tail: prev.tail += source_body.text else: prev.tail = source_body.text > Performance is important. Also, I don't care about the source tree after > I'm done, so if "moving" rather than copying makes things faster/easier, > that's OK. Moving is certainly faster than copying, as copying does at least the same amount of work, plus the memory allocations. If copying was required, you could always do a deepcopy of the source content before inserting it. I can't give any further comments on performance, though. You'll need to do your own benchmarks (although I'm always interested in the results :) Stefan From nath at nreynolds.me.uk Mon Feb 8 15:46:34 2010 From: nath at nreynolds.me.uk (Nathan Reynolds) Date: Mon, 8 Feb 2010 14:46:34 +0000 Subject: [lxml-dev] Custom element classes: lost my proxy Message-ID: <78a87de31002080646h1d62bbe6l4bfeff00c0228a6c@mail.gmail.com> Hi all, As soon as I insert my custom elements into a regular lxml.etree.Element, they revert to the standard Element interface. Is it possible to get my proxy back for these elements? Thanks, Nath -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100208/db9d5c11/attachment.htm From stefan_ml at behnel.de Mon Feb 8 15:50:57 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 08 Feb 2010 15:50:57 +0100 Subject: [lxml-dev] Custom element classes: lost my proxy In-Reply-To: <78a87de31002080646h1d62bbe6l4bfeff00c0228a6c@mail.gmail.com> References: <78a87de31002080646h1d62bbe6l4bfeff00c0228a6c@mail.gmail.com> Message-ID: <4B7024D1.80204@behnel.de> Nathan Reynolds, 08.02.2010 15:46: > As soon as I insert my custom elements into a regular lxml.etree.Element, > they revert to the standard Element interface. > > Is it possible to get my proxy back for these elements? http://codespeak.net/lxml/element_classes.html#generating-xml-with-custom-classes HTH, Stefan From l at lrowe.co.uk Tue Feb 9 15:56:00 2010 From: l at lrowe.co.uk (Laurence Rowe) Date: Tue, 9 Feb 2010 14:56:00 +0000 Subject: [lxml-dev] Building an ESI tag with lxml In-Reply-To: References: <4B62928F.5050400@behnel.de> <4B62B3B8.60206@behnel.de> <4B62D1BB.1050408@behnel.de> <4B62F18A.2040105@behnel.de> Message-ID: This seems to be a limitation of the xml serializer when it detects xhtml :( $ xsltproc --version Using libxml 20703-SVN3827, libxslt 10124-SVN1494 and libexslt 813 xsltproc was compiled against libxml 20703, libxslt 10124 and libexslt 813 libxslt 10124 was compiled against libxml 20703 libexslt 813 was compiled against libxml 20703 $ cat in.html $ cat test.xsl $ xsltproc test.xsl in.html xsltproc is using the xml parser here. We need xhtml mode or you end up with elements like
(no space) which confuse some browsers. Laurence From jbb at scryent.com Tue Feb 16 21:44:59 2010 From: jbb at scryent.com (Jordan Baker) Date: Tue, 16 Feb 2010 15:44:59 -0500 Subject: [lxml-dev] lxml.html.tostring & CDATA Message-ID: <26138711002161244l10455778w7d7144f91eab2c45@mail.gmail.com> Trying to embed a script that needs to be inside a CDATA and it seems to be escaping the CDATA itself: mydoc = """ """ >>> print tostring(document_fromstring(mydoc), method="xml") Is this is a bug? -- Jordan Baker Scryent Clearly Open Source Plone, Zope, Python, Linux & more +1 416 871-3810 www.scryent.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100216/50e1c2d7/attachment.htm From jbb at scryent.com Tue Feb 16 22:16:05 2010 From: jbb at scryent.com (Jordan Baker) Date: Tue, 16 Feb 2010 16:16:05 -0500 Subject: [lxml-dev] lxml.html.tostring & CDATA In-Reply-To: <26138711002161244l10455778w7d7144f91eab2c45@mail.gmail.com> References: <26138711002161244l10455778w7d7144f91eab2c45@mail.gmail.com> Message-ID: <26138711002161316v2c6486ffs6077a91e2fd90bc7@mail.gmail.com> On Tue, Feb 16, 2010 at 3:44 PM, Jordan Baker wrote: > Trying to embed a script that needs to be inside a CDATA and it seems to be > escaping the CDATA itself: > > mydoc = """ > > > > > > > > > > > """ > > >>> print tostring(document_fromstring(mydoc), method="xml") > > > > Is this is a bug? > > > Also tried now with lxml 2.2.4 / libxml2-2.7.6 / libxslt-1.1.26 - same problem. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100216/f30f966e/attachment-0001.htm From jbb at scryent.com Tue Feb 16 21:46:49 2010 From: jbb at scryent.com (Jordan Baker) Date: Tue, 16 Feb 2010 15:46:49 -0500 Subject: [lxml-dev] lxml.html.tostring & CDATA In-Reply-To: <26138711002161244l10455778w7d7144f91eab2c45@mail.gmail.com> References: <26138711002161244l10455778w7d7144f91eab2c45@mail.gmail.com> Message-ID: <26138711002161246r5041a07g5c22b382ba0471dc@mail.gmail.com> On Tue, Feb 16, 2010 at 3:44 PM, Jordan Baker wrote: > Trying to embed a script that needs to be inside a CDATA and it seems to be > escaping the CDATA itself > Noticed that this is lxml 2.1.2 ... will try with a more recent version -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100216/a5aa10bc/attachment.htm From jkrukoff at ltgc.com Tue Feb 16 23:00:08 2010 From: jkrukoff at ltgc.com (John Krukoff) Date: Tue, 16 Feb 2010 15:00:08 -0700 Subject: [lxml-dev] lxml.html.tostring & CDATA In-Reply-To: <26138711002161316v2c6486ffs6077a91e2fd90bc7@mail.gmail.com> References: <26138711002161244l10455778w7d7144f91eab2c45@mail.gmail.com> <26138711002161316v2c6486ffs6077a91e2fd90bc7@mail.gmail.com> Message-ID: <1266357608.3897.47.camel@localhost.localdomain> On Tue, 2010-02-16 at 16:16 -0500, Jordan Baker wrote: > On Tue, Feb 16, 2010 at 3:44 PM, Jordan Baker wrote: > Trying to embed a script that needs to be inside a CDATA and > it seems to be escaping the CDATA itself: > > > mydoc = """ > > > > > > > > > > > """ > > > >>> print tostring(document_fromstring(mydoc), method="xml") > xmlns="http://www.w3.org/1999/xhtml"> > src="http://foo.jpg" /> > > > Is this is a bug? > > > > > Also tried now with lxml 2.2.4 / libxml2-2.7.6 / libxslt-1.1.26 - > same problem. > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev This is just a guess, as I don't know enough about the HTML spec to say for sure, but my hunch is that HTML doesn't include Land Title Guarantee Company From cfbearden at gmail.com Tue Feb 16 22:59:39 2010 From: cfbearden at gmail.com (Chuck Bearden) Date: Tue, 16 Feb 2010 15:59:39 -0600 Subject: [lxml-dev] lxml.html.tostring & CDATA In-Reply-To: <433ebc871002161358w6b5f710fyb80fc757f4b1f204@mail.gmail.com> References: <26138711002161244l10455778w7d7144f91eab2c45@mail.gmail.com> <26138711002161246r5041a07g5c22b382ba0471dc@mail.gmail.com> <433ebc871002161358w6b5f710fyb80fc757f4b1f204@mail.gmail.com> Message-ID: <433ebc871002161359ke32cf8et8a64897c5821764@mail.gmail.com> On Tue, Feb 16, 2010 at 2:46 PM, Jordan Baker wrote: > On Tue, Feb 16, 2010 at 3:44 PM, Jordan Baker wrote: >> >> Trying to embed a script that needs to be inside a CDATA and it seems to >> be escaping the CDATA itself > > Noticed that this is lxml 2.1.2 ... will try with a more recent version Does HTML 4.01 Transitional define CDATA sections? I can't find any reference to them in the specification at the W3C. I'm not a standards guru, though, and I may well have missed something. Chuck From jkrukoff at ltgc.com Tue Feb 16 23:34:29 2010 From: jkrukoff at ltgc.com (John Krukoff) Date: Tue, 16 Feb 2010 15:34:29 -0700 Subject: [lxml-dev] lxml.html.tostring & CDATA In-Reply-To: <433ebc871002161359ke32cf8et8a64897c5821764@mail.gmail.com> References: <26138711002161244l10455778w7d7144f91eab2c45@mail.gmail.com> <26138711002161246r5041a07g5c22b382ba0471dc@mail.gmail.com> <433ebc871002161358w6b5f710fyb80fc757f4b1f204@mail.gmail.com> <433ebc871002161359ke32cf8et8a64897c5821764@mail.gmail.com> Message-ID: <1266359669.3897.53.camel@localhost.localdomain> On Tue, 2010-02-16 at 15:59 -0600, Chuck Bearden wrote: > On Tue, Feb 16, 2010 at 2:46 PM, Jordan Baker wrote: > > On Tue, Feb 16, 2010 at 3:44 PM, Jordan Baker wrote: > >> > >> Trying to embed a script that needs to be inside a CDATA and it seems to > >> be escaping the CDATA itself > > > > Noticed that this is lxml 2.1.2 ... will try with a more recent version > > Does HTML 4.01 Transitional define CDATA sections? I can't find any > reference to them in the specification at the W3C. I'm not a standards > guru, though, and I may well have missed something. > > Chuck > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev I poked around, it looks like HTML 4 doesn't define CDATA sections, while it looks like HTML 5 does. libxml2, and as such, lxml, implement an HTML 4.0 parser for HTML parsing (ref: http://xmlsoft.org/html/libxml-HTMLparser.html ). So, if the crystal ball is working and the goal is to parse HTML 5, maybe the OP should give the lxml html5lib interface a try? http://codespeak.net/lxml/dev/html5parser.html -- John Krukoff Land Title Guarantee Company From jbb at scryent.com Tue Feb 16 23:38:39 2010 From: jbb at scryent.com (Jordan Baker) Date: Tue, 16 Feb 2010 17:38:39 -0500 Subject: [lxml-dev] lxml.html.tostring & CDATA In-Reply-To: <1266357608.3897.47.camel@localhost.localdomain> References: <26138711002161244l10455778w7d7144f91eab2c45@mail.gmail.com> <26138711002161316v2c6486ffs6077a91e2fd90bc7@mail.gmail.com> <1266357608.3897.47.camel@localhost.localdomain> Message-ID: <26138711002161438n1e32ac59l1f95628bbc985b90@mail.gmail.com> On Tue, Feb 16, 2010 at 5:00 PM, John Krukoff wrote: > > On Tue, 2010-02-16 at 16:16 -0500, Jordan Baker wrote: > > On Tue, Feb 16, 2010 at 3:44 PM, Jordan Baker wrote: > > ? ? ? ? Trying to embed a script that needs to be inside a CDATA and > > ? ? ? ? it seems to be escaping the CDATA itself: > > > > > > ? ? ? ? mydoc = """ > > ? ? ? ? > > ? ? ? ? > > ? ? ? ? ? ? > > ? ? ? ? ? ? ? ? > > ? ? ? ? ? ? > > ? ? ? ? ? ? > > ? ? ? ? ? ? ? ? ? ? ? ? > > ? ? ? ? ? ? ? ? ? ? ? ? > > ? ? ? ? ? ? > > ? ? ? ? > > ? ? ? ? """ > > > > > > ? ? ? ? >>> print tostring(document_fromstring(mydoc), method="xml") > > ? ? ? ? > ? ? ? ? xmlns="http://www.w3.org/1999/xhtml"> > > ? ? ? ? ? ? ? ? ? ? ? ? > ? ? ? ? src="http://foo.jpg" /> > > > > > > ? ? ? ? Is this is a bug? > > > > > > > > > > Also tried now with lxml 2.2.4 / libxml2-2.7.6 / libxslt-1.1.26 ?- > > same problem. > > > > _______________________________________________ > > lxml-dev mailing list > > lxml-dev at codespeak.net > > http://codespeak.net/mailman/listinfo/lxml-dev > > This is just a guess, as I don't know enough about the HTML spec to say > for sure, but my hunch is that HTML doesn't include special tag. If true, that would explain why when you use the HTML > parser it's simply treating the the HTML serializer ( "tostring( document_fromstring( mydoc ) )" )does > what you expect, while the XML serializer > ( "tostring( document_fromstring( mydoc, method = "xml" ) )" ) doesn't. > > Perhaps you could use XHTML, as the XML parser would likely do what you > want? > -- > John Krukoff > Land Title Guarantee Company > Oops, there was an error with my example. ?I tried it again with XHTML 1.0 transitional docstring and same thing... CDATA is for sure part of XHTML 1.0 http://www.w3.org/TR/xhtml1/#h-4.8 -jordan. From jkrukoff at ltgc.com Tue Feb 16 23:43:12 2010 From: jkrukoff at ltgc.com (John Krukoff) Date: Tue, 16 Feb 2010 15:43:12 -0700 Subject: [lxml-dev] lxml.html.tostring & CDATA In-Reply-To: <26138711002161316v2c6486ffs6077a91e2fd90bc7@mail.gmail.com> References: <26138711002161244l10455778w7d7144f91eab2c45@mail.gmail.com> <26138711002161316v2c6486ffs6077a91e2fd90bc7@mail.gmail.com> Message-ID: <1266360192.3897.57.camel@localhost.localdomain> On Tue, 2010-02-16 at 16:16 -0500, Jordan Baker wrote: > On Tue, Feb 16, 2010 at 3:44 PM, Jordan Baker wrote: > > I just noticed that your DOCTYPE declaration doesn't make any sense. Are you trying to write HTML ( something like ) or XHTML ( something like )? I'm guessing you were going for XHTML, in which case you probably should fix your DOCTYPE declaration and use the XML parser ( lxml.etree.fromstring ) rather than the HTML parser ( lxml.html.document_fromstring ). -- John Krukoff Land Title Guarantee Company From jkrukoff at ltgc.com Wed Feb 17 00:28:11 2010 From: jkrukoff at ltgc.com (John Krukoff) Date: Tue, 16 Feb 2010 16:28:11 -0700 Subject: [lxml-dev] lxml.html.tostring & CDATA In-Reply-To: <26138711002161438n1e32ac59l1f95628bbc985b90@mail.gmail.com> References: <26138711002161244l10455778w7d7144f91eab2c45@mail.gmail.com> <26138711002161316v2c6486ffs6077a91e2fd90bc7@mail.gmail.com> <1266357608.3897.47.camel@localhost.localdomain> <26138711002161438n1e32ac59l1f95628bbc985b90@mail.gmail.com> Message-ID: <1266362891.3897.73.camel@localhost.localdomain> On Tue, 2010-02-16 at 17:38 -0500, Jordan Baker wrote: > Oops, there was an error with my example. I tried it again with XHTML > 1.0 transitional docstring and same thing... > > CDATA is for sure part of XHTML 1.0 > > http://www.w3.org/TR/xhtml1/#h-4.8 > > -jordan. One last note, if what you want to do is parse XHTML as XML (which you should, so that things like CDATA work), but get the lxml.html element interface, it looks like you can use the lxml.html.xhtml_parser like so: >>> from lxml import etree, html >>> mydoc = ''' ... ... ...

... ...

... ... ''' >>> e = etree.fromstring( mydoc, parser = html.xhtml_parser ) >>> type( e ) >>> etree.tostring( e ) '\n \n

\n Some text. \n

\n \n' I mention it, as it looks like xhtml_parser is only documented in the API reference (so I didn't know it existed until 5 minutes ago), with the only mention in the parsing section being that you ought to parse XHTML as XML: http://codespeak.net/lxml/parsing.html#parsers And no mention at all of how to parse XHTML in the lxml.html documentation: http://codespeak.net/lxml/lxmlhtml.html#parsing-html I have no idea if this is a good idea or not, but it looks like lxml.html put some effort into XHTML support, so it probably works. For my own uses I've always wanted to treat XHTML as plain XML, and haven't made any use of the lxml.html extensions. -- John Krukoff Land Title Guarantee Company From john.byrne at propylon.com Wed Feb 17 18:19:20 2010 From: john.byrne at propylon.com (John Byrne) Date: Wed, 17 Feb 2010 11:19:20 -0600 Subject: [lxml-dev] is it ok to move lxml build after running setup.py? Message-ID: <4B7C2518.7030404@propylon.com> Hi, Complete newbie to lxml here. I have just built the version 2.2.4 using the setup.py script and I was wondering if it actually installs/copies anything anywhere. The reason is, I built it in my home directory, but now I'd like to move it. I'm guessing that I just need to have the "build/lib.linux-i686-2.4/lxml" directory on my PYTHONPATH, but I want to be sure I don't mess anything up. Is it OK to move the above directory, and then include that new location on PYTHONPATH? Thanks (and sorry if this is an obvious one, I am not too familiar with setup.py etc.) -John From stefan_ml at behnel.de Thu Feb 18 11:15:59 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 18 Feb 2010 11:15:59 +0100 Subject: [lxml-dev] is it ok to move lxml build after running setup.py? In-Reply-To: <4B7C2518.7030404@propylon.com> References: <4B7C2518.7030404@propylon.com> Message-ID: <4B7D135F.8070006@behnel.de> John Byrne, 17.02.2010 18:19: > Complete newbie to lxml here. I have just built the version 2.2.4 using > the setup.py script and I was wondering if it actually installs/copies > anything anywhere. The reason is, I built it in my home directory, but > now I'd like to move it. I'm guessing that I just need to have the > "build/lib.linux-i686-2.4/lxml" directory on my PYTHONPATH, but I want > to be sure I don't mess anything up. > > Is it OK to move the above directory, and then include that new location > on PYTHONPATH? > > Thanks (and sorry if this is an obvious one, I am not too familiar with > setup.py etc.) It should be ok as long as you keep everything in the 'lxml' package. Note that there are various ways to install Python packages, including the 'install' target and eggs. Please take a look at the distutils documentation and read up on setuptools. Note that (besides comp.lang.python, obviously) there is also a mailing list targeted at distutils, where this kind of question fits much better. Stefan From ross at burtonini.com Fri Feb 19 18:47:27 2010 From: ross at burtonini.com (Ross Burton) Date: Fri, 19 Feb 2010 17:47:27 +0000 (UTC) Subject: [lxml-dev] Custom XSLT functions and text nodes Message-ID: Hi, I'm trying to perform some string manipulation on text nodes via a custom XSLT function. My XSLT has this: But no matter what I do the context node doesn't appear to have any text content. Using pdb: (Pdb) print input_node (Pdb) print input_node.tag text (Pdb) print input_node.tail None (Pdb) print input_node.text None I expect I'm doing something rather stupid, but how can I get the text so that I can manipulate it and then put it in the result tree? Cheers, Ross From stefan_ml at behnel.de Sat Feb 20 16:48:17 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 20 Feb 2010 16:48:17 +0100 Subject: [lxml-dev] Custom XSLT functions and text nodes In-Reply-To: References: Message-ID: <4B800441.9010703@behnel.de> Ross Burton, 19.02.2010 18:47: > I'm trying to perform some string manipulation on text nodes via a custom XSLT > function. > > My XSLT has this: > > > > Note that the expected wording here is "extension element", not function (which XPath uses). > But no matter what I do the context node doesn't appear to have any text content. > Using pdb: > > (Pdb) print input_node > > (Pdb) print input_node.tag > text > (Pdb) print input_node.tail > None > (Pdb) print input_node.text > None That's a bug that has been fixed for 2.3, which will raise an exception instead. The feature that you want is not currently implemented. > how can I get the text so that > I can manipulate it and then put it in the result tree? You can match on elements instead of text nodes, e.g. and then work on the .text/.tail data of that element. You can also use an XPath function and call that on the text content explicitly. Stefan From fantasai.lists at inkedblade.net Sat Feb 20 04:40:06 2010 From: fantasai.lists at inkedblade.net (fantasai) Date: Fri, 19 Feb 2010 19:40:06 -0800 Subject: [lxml-dev] Question about newlines Message-ID: Stefan Behnel behnel.de> writes: > > Noah Slater wrote: > > On Sun, Dec 09, 2007 at 08:48:17AM +0100, Stefan Behnel wrote: > >> Serialisation will never alter content. > > [snip] > >>> 1) When adding a PI via the element.addprevious method and PI has > >>> it's tail trimmed and so when serialising the PI runs into the > >>> root element. > > > > Well, this is well and good but lxml REMOVES the PI tail so I cannot > > insert a newline even if I want to. > > Ah, got it. Thanks for insisting. :) > > lxml.etree does this on purpose. If you allow character data around the > processing instructions that you add as siblings of the root node, you need to > make sure it's only whitespace (not 'real' data) to keep the in-memory tree > well-formed and to serialise well-formed XML. So the behaviour would be: strip > the tail, but keep it if it's whitespace. Sounds a bit ugly to me... > > I also noted that libxml2's parser drops whitespace at the root level, which > is perfectly fine, as it is the most definitely ignorable whitespace there is. > I personally prefer having lxml add a line break when serialising processing > instructions and comments at the root level, and cosistently dropping all tail > text of PIs and comments appended/prepended to a root node. So the behaviour > for the root level would be: drop all whitespace when parsing, and add line > breaks around PIs and comments on serialisation. > > There's also the document ending issue. The document serialiser of libxml2 > does append a newline, and one day, lxml may switch to using it. So I added > this behaviour now - and had to adapt tons of test cases that compare > serialised XML between ET and lxml. But I don't mind having white-space > differences in the serialisation as long as it's well-formed, equivalent XML. I have a related problem, more like this one: http://stackoverflow.com/questions/973079/how-can-i-make-lxmls-parser-preserve-whitespace-outside-of-the-root-element Maybe you want to make it a parse-time option, but I think white space outside the root should be preserved. And stripping the addition to the document root context only if it's not all whitespace seems to me the right thing to do. ~fantasai From stefan_ml at behnel.de Sun Feb 21 12:04:52 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 21 Feb 2010 12:04:52 +0100 Subject: [lxml-dev] Question about newlines In-Reply-To: References: Message-ID: <4B811354.9010400@behnel.de> fantasai, 20.02.2010 04:40: > Maybe you want to make it a parse-time option, but I think white space outside > the root should be preserved. And stripping the addition to the document root > context only if it's not all whitespace seems to me the right thing to do. Why would you want to preserve that space? It doesn't have any meaning in XML. Stefan From fantasai.lists at inkedblade.net Sat Feb 20 04:40:06 2010 From: fantasai.lists at inkedblade.net (fantasai) Date: Fri, 19 Feb 2010 19:40:06 -0800 Subject: [lxml-dev] Question about newlines Message-ID: Stefan Behnel behnel.de> writes: > > Noah Slater wrote: > > On Sun, Dec 09, 2007 at 08:48:17AM +0100, Stefan Behnel wrote: > >> Serialisation will never alter content. > > [snip] > >>> 1) When adding a PI via the element.addprevious method and PI has > >>> it's tail trimmed and so when serialising the PI runs into the > >>> root element. > > > > Well, this is well and good but lxml REMOVES the PI tail so I cannot > > insert a newline even if I want to. > > Ah, got it. Thanks for insisting. :) > > lxml.etree does this on purpose. If you allow character data around the > processing instructions that you add as siblings of the root node, you need to > make sure it's only whitespace (not 'real' data) to keep the in-memory tree > well-formed and to serialise well-formed XML. So the behaviour would be: strip > the tail, but keep it if it's whitespace. Sounds a bit ugly to me... > > I also noted that libxml2's parser drops whitespace at the root level, which > is perfectly fine, as it is the most definitely ignorable whitespace there is. > I personally prefer having lxml add a line break when serialising processing > instructions and comments at the root level, and cosistently dropping all tail > text of PIs and comments appended/prepended to a root node. So the behaviour > for the root level would be: drop all whitespace when parsing, and add line > breaks around PIs and comments on serialisation. > > There's also the document ending issue. The document serialiser of libxml2 > does append a newline, and one day, lxml may switch to using it. So I added > this behaviour now - and had to adapt tons of test cases that compare > serialised XML between ET and lxml. But I don't mind having white-space > differences in the serialisation as long as it's well-formed, equivalent XML. I have a related problem, more like this one: http://stackoverflow.com/questions/973079/how-can-i-make-lxmls-parser-preserve-whitespace-outside-of-the-root-element Maybe you want to make it a parse-time option, but I think white space outside the root should be preserved. And stripping the addition to the document root context only if it's not all whitespace seems to me the right thing to do. ~fantasai From fantasai.lists at inkedblade.net Mon Feb 22 23:15:25 2010 From: fantasai.lists at inkedblade.net (fantasai) Date: Mon, 22 Feb 2010 14:15:25 -0800 Subject: [lxml-dev] Question about newlines In-Reply-To: <4B811354.9010400@behnel.de> References: <4B811354.9010400@behnel.de> Message-ID: On 02/21/2010 03:04 AM, Stefan Behnel wrote: > fantasai, 20.02.2010 04:40: >> Maybe you want to make it a parse-time option, but I think white space outside >> the root should be preserved. And stripping the addition to the document root >> context only if it's not all whitespace seems to me the right thing to do. > > Why would you want to preserve that space? It doesn't have any meaning in XML. a) As with the original poster, to avoid unnecessary diff thrashing b) Because these files are edited by hand, and stripping whitespace makes them harder to read c) To test the effect of whitespace in the source file on CSS+XML and CSS+HTML rendering engines (since the files I'm working with happen to be CSS tests). c is obviously not a common case :) But a) and b) are relevant to others. ~fantasai From stefan_ml at behnel.de Tue Feb 23 10:35:00 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 23 Feb 2010 10:35:00 +0100 Subject: [lxml-dev] Question about newlines In-Reply-To: References: <4B811354.9010400@behnel.de> Message-ID: <4B83A144.4010201@behnel.de> fantasai, 22.02.2010 23:15: > On 02/21/2010 03:04 AM, Stefan Behnel wrote: >> fantasai, 20.02.2010 04:40: >>> Maybe you want to make it a parse-time option, but I think white space outside >>> the root should be preserved. And stripping the addition to the document root >>> context only if it's not all whitespace seems to me the right thing to do. >> Why would you want to preserve that space? It doesn't have any meaning in XML. > > a) As with the original poster, to avoid unnecessary diff thrashing > b) Because these files are edited by hand, and stripping whitespace makes them > harder to read > c) To test the effect of whitespace in the source file on CSS+XML and CSS+HTML > rendering engines (since the files I'm working with happen to be CSS tests). > > c is obviously not a common case :) But a) and b) are relevant to others. I have to express my doubts here, but your mileage may obviously vary. I don't see how whitespace before the root tag makes things more readable. And to keep it out of diffs, all you have to do is keep it out of the file. That said, it's quite possible that the serialising code in lxml.etree is the culprit here. It only special cases PIs and comments around the root node, not text nodes. So please open a bug report on this for now. Stefan From Zsolt.Cserna at MorganStanley.com Tue Feb 23 12:25:18 2010 From: Zsolt.Cserna at MorganStanley.com (Cserna, Zsolt) Date: Tue, 23 Feb 2010 11:25:18 +0000 Subject: [lxml-dev] ElementTree.iterparse segfault Message-ID: <0FE1D5D2B5C6754898C9E32C1230EC656E38E9747B@LNWEXMBX0105.msad.ms.com> Hi all, There's a test in selftest2.py which fails (results segfault) on my 64-bit linux host: def iterparse(): """ Test iterparse interface. >>> iterparse = ElementTree.iterparse >>> context = iterparse("samples/simple.xml") >>> for action, elem in context: ... print("%s %s" % (action, elem.tag)) end element end element end empty-element end root I've created a standalone script which contains: from lxml import etree as ElementTree iterparse = ElementTree.iterparse context = iterparse("samples/simple.xml") for i in context: print i ...and it also results segmentation fault. Where samples/simple.xml is the xml bundled with lxml: text texttail I've tested it with python 2.6.4, 2.5.4, and lxml 2.2.2 and 2.2.4 (2.2.2 is linked with libxml2 2.7.3, libxslt 1.1.24, 2.2.4 is linked with libxml2 2.7.6). It only happens on 64-bit linux (RHEL 4), my 32-bit linux systems (RHEL 3 and RHEL 4) are fine. Any advise on this bug? Could you check that it also happens on you 64-bit linux system? Thanks for advance, Zsolt -------------------------------------------------------------------------- NOTICE: If received in error, please destroy, and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error. We may monitor and store emails to the extent permitted by applicable law. From stefan_ml at behnel.de Tue Feb 23 13:23:56 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 23 Feb 2010 13:23:56 +0100 Subject: [lxml-dev] ElementTree.iterparse segfault In-Reply-To: <0FE1D5D2B5C6754898C9E32C1230EC656E38E9747B@LNWEXMBX0105.msad.ms.com> References: <0FE1D5D2B5C6754898C9E32C1230EC656E38E9747B@LNWEXMBX0105.msad.ms.com> Message-ID: <4B83C8DC.80102@behnel.de> Cserna, Zsolt, 23.02.2010 12:25: > There's a test in selftest2.py which fails (results segfault) on my > 64-bit linux host: > [...] > I've tested it with python 2.6.4, 2.5.4, and lxml 2.2.2 and 2.2.4 (2.2.2 > is linked with libxml2 2.7.3, libxslt 1.1.24, 2.2.4 is linked with > libxml2 2.7.6). > > It only happens on 64-bit linux (RHEL 4), my 32-bit linux systems (RHEL > 3 and RHEL 4) are fine. > > Any advise on this bug? Could you check that it also happens on you > 64-bit linux system? Just tested with the current trunk on 64bit - no problem there. The test suite in the current 2.2 branch also passes for me. Could you run the test through gdb to get a stack trace of the segfault? Stefan From Zsolt.Cserna at MorganStanley.com Tue Feb 23 14:34:21 2010 From: Zsolt.Cserna at MorganStanley.com (Cserna, Zsolt) Date: Tue, 23 Feb 2010 13:34:21 +0000 Subject: [lxml-dev] ElementTree.iterparse segfault In-Reply-To: <4B83C8DC.80102@behnel.de> References: <0FE1D5D2B5C6754898C9E32C1230EC656E38E9747B@LNWEXMBX0105.msad.ms.com> <4B83C8DC.80102@behnel.de> Message-ID: <0FE1D5D2B5C6754898C9E32C1230EC656E38E974D8@LNWEXMBX0105.msad.ms.com> Hi, Thanks for the quick reply. The backtrace is the following: #0 __pyx_pf_4lxml_5etree_9iterparse___next__ (__pyx_v_self=0x2a96d12110) at src/lxml/lxml.etree.c:86174 #1 0x0000002a95725dee in PyEval_EvalFrameEx (f=0x573030, throwflag=Variable "throwflag" is not available. ) at Python/ceval.c:2195 #2 0x0000002a9572c5f5 in PyEval_EvalCodeEx (co=0x2a96040558, globals=Variable "globals" is not available. ) at Python/ceval.c:2875 #3 0x0000002a9572c772 in PyEval_EvalCode (co=Variable "co" is not available. ) at Python/ceval.c:514 #4 0x0000002a9574e60c in PyRun_FileExFlags (fp=0x501010, filename=0x7fbfffed5e "/home/zsolt/devel/python/temp/lxml_test.py", start=Variable "start" is not available. ) at Python/pythonrun.c:1273 #5 0x0000002a9574f293 in PyRun_SimpleFileExFlags (fp=0x501010, filename=0x7fbfffed5e "/home/zsolt/devel/python/temp/lxml_test.py", closeit=1, flags=0x7fbfffe80c) at Python/pythonrun.c:879 #6 0x0000002a9575a548 in Py_Main (argc=Variable "argc" is not available. ) at Modules/main.c:532 I'm using the .c files bundled in the tar file (so I haven't re-built them by cython). Zsolt > -----Original Message----- > From: Stefan Behnel [mailto:stefan_ml at behnel.de] > Sent: Tuesday, February 23, 2010 13:24 > To: Cserna, Zsolt (IDEAS) > Cc: lxml-dev at codespeak.net > Subject: Re: [lxml-dev] ElementTree.iterparse segfault > > Cserna, Zsolt, 23.02.2010 12:25: > > There's a test in selftest2.py which fails (results segfault) on my > > 64-bit linux host: > > [...] > > I've tested it with python 2.6.4, 2.5.4, and lxml 2.2.2 and 2.2.4 > > (2.2.2 is linked with libxml2 2.7.3, libxslt 1.1.24, 2.2.4 > is linked > > with > > libxml2 2.7.6). > > > > It only happens on 64-bit linux (RHEL 4), my 32-bit linux systems > > (RHEL > > 3 and RHEL 4) are fine. > > > > Any advise on this bug? Could you check that it also happens on you > > 64-bit linux system? > > Just tested with the current trunk on 64bit - no problem > there. The test suite in the current 2.2 branch also passes for me. > > Could you run the test through gdb to get a stack trace of > the segfault? > > Stefan > -------------------------------------------------------------------------- NOTICE: If received in error, please destroy, and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error. We may monitor and store emails to the extent permitted by applicable law. From Zsolt.Cserna at MorganStanley.com Tue Feb 23 14:47:30 2010 From: Zsolt.Cserna at MorganStanley.com (Cserna, Zsolt) Date: Tue, 23 Feb 2010 13:47:30 +0000 Subject: [lxml-dev] ElementTree.iterparse segfault In-Reply-To: <0FE1D5D2B5C6754898C9E32C1230EC656E38E974D8@LNWEXMBX0105.msad.ms.com> References: <0FE1D5D2B5C6754898C9E32C1230EC656E38E9747B@LNWEXMBX0105.msad.ms.com><4B83C8DC.80102@behnel.de> <0FE1D5D2B5C6754898C9E32C1230EC656E38E974D8@LNWEXMBX0105.msad.ms.com> Message-ID: <0FE1D5D2B5C6754898C9E32C1230EC656E38E974DE@LNWEXMBX0105.msad.ms.com> Hm, depending on the -O switch of gcc I have different results. With -O0, the test passes. With -O3 what I used previously it results segfault. Is it possible that this switch caused the problem? Zsolt > -----Original Message----- > From: lxml-dev-bounces at codespeak.net > [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Cserna, > Zsolt (IDEAS) > Sent: Tuesday, February 23, 2010 14:34 > To: Stefan Behnel > Cc: lxml-dev at codespeak.net > Subject: Re: [lxml-dev] ElementTree.iterparse segfault > > > Hi, > > Thanks for the quick reply. > > The backtrace is the following: > > #0 __pyx_pf_4lxml_5etree_9iterparse___next__ > (__pyx_v_self=0x2a96d12110) at src/lxml/lxml.etree.c:86174 > #1 0x0000002a95725dee in PyEval_EvalFrameEx (f=0x573030, > throwflag=Variable "throwflag" is not available. > ) at Python/ceval.c:2195 > #2 0x0000002a9572c5f5 in PyEval_EvalCodeEx (co=0x2a96040558, > globals=Variable "globals" is not available. > ) at Python/ceval.c:2875 > #3 0x0000002a9572c772 in PyEval_EvalCode (co=Variable "co" > is not available. > ) at Python/ceval.c:514 > #4 0x0000002a9574e60c in PyRun_FileExFlags (fp=0x501010, > filename=0x7fbfffed5e > "/home/zsolt/devel/python/temp/lxml_test.py", start=Variable > "start" is not available. > ) at Python/pythonrun.c:1273 > #5 0x0000002a9574f293 in PyRun_SimpleFileExFlags > (fp=0x501010, filename=0x7fbfffed5e > "/home/zsolt/devel/python/temp/lxml_test.py", closeit=1, > flags=0x7fbfffe80c) at Python/pythonrun.c:879 > #6 0x0000002a9575a548 in Py_Main (argc=Variable "argc" is > not available. > ) at Modules/main.c:532 > > I'm using the .c files bundled in the tar file (so I haven't > re-built them by cython). > > Zsolt > > > > -----Original Message----- > > From: Stefan Behnel [mailto:stefan_ml at behnel.de] > > Sent: Tuesday, February 23, 2010 13:24 > > To: Cserna, Zsolt (IDEAS) > > Cc: lxml-dev at codespeak.net > > Subject: Re: [lxml-dev] ElementTree.iterparse segfault > > > > Cserna, Zsolt, 23.02.2010 12:25: > > > There's a test in selftest2.py which fails (results > segfault) on my > > > 64-bit linux host: > > > [...] > > > I've tested it with python 2.6.4, 2.5.4, and lxml 2.2.2 and 2.2.4 > > > (2.2.2 is linked with libxml2 2.7.3, libxslt 1.1.24, 2.2.4 > > is linked > > > with > > > libxml2 2.7.6). > > > > > > It only happens on 64-bit linux (RHEL 4), my 32-bit linux systems > > > (RHEL > > > 3 and RHEL 4) are fine. > > > > > > Any advise on this bug? Could you check that it also > happens on you > > > 64-bit linux system? > > > > Just tested with the current trunk on 64bit - no problem there. The > > test suite in the current 2.2 branch also passes for me. > > > > Could you run the test through gdb to get a stack trace of the > > segfault? > > > > Stefan > > > -------------------------------------------------------------- > ------------ > NOTICE: If received in error, please destroy, and notify > sender. Sender does not intend to waive confidentiality or > privilege. Use of this email is prohibited when received in > error. We may monitor and store emails to the extent > permitted by applicable law. > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > -------------------------------------------------------------------------- NOTICE: If received in error, please destroy, and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error. We may monitor and store emails to the extent permitted by applicable law. From stefan_ml at behnel.de Tue Feb 23 14:50:51 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 23 Feb 2010 14:50:51 +0100 Subject: [lxml-dev] ElementTree.iterparse segfault In-Reply-To: <0FE1D5D2B5C6754898C9E32C1230EC656E38E974D8@LNWEXMBX0105.msad.ms.com> References: <0FE1D5D2B5C6754898C9E32C1230EC656E38E9747B@LNWEXMBX0105.msad.ms.com> <4B83C8DC.80102@behnel.de> <0FE1D5D2B5C6754898C9E32C1230EC656E38E974D8@LNWEXMBX0105.msad.ms.com> Message-ID: <4B83DD3B.7060908@behnel.de> Cserna, Zsolt, 23.02.2010 14:34: > Stefan Behnel wrote: >> Cserna, Zsolt, 23.02.2010 12:25: >>> There's a test in selftest2.py which fails (results segfault) on my >>> 64-bit linux host: >>> [...] >>> I've tested it with python 2.6.4, 2.5.4, and lxml 2.2.2 and 2.2.4 >>> (2.2.2 is linked with libxml2 2.7.3, libxslt 1.1.24, 2.2.4 >>> is linked with libxml2 2.7.6). >>> >>> It only happens on 64-bit linux (RHEL 4), my 32-bit linux systems >>> (RHEL >>> 3 and RHEL 4) are fine. >>> >>> Any advise on this bug? Could you check that it also happens on you >>> 64-bit linux system? >> Just tested with the current trunk on 64bit - no problem >> there. The test suite in the current 2.2 branch also passes for me. >> >> Could you run the test through gdb to get a stack trace of >> the segfault? >> > The backtrace is the following: > > #0 __pyx_pf_4lxml_5etree_9iterparse___next__ (__pyx_v_self=0x2a96d12110) at src/lxml/lxml.etree.c:86174 > #1 0x0000002a95725dee in PyEval_EvalFrameEx (f=0x573030, throwflag=Variable "throwflag" is not available. > ) at Python/ceval.c:2195 > #2 0x0000002a9572c5f5 in PyEval_EvalCodeEx (co=0x2a96040558, globals=Variable "globals" is not available. > ) at Python/ceval.c:2875 > #3 0x0000002a9572c772 in PyEval_EvalCode (co=Variable "co" is not available. > ) at Python/ceval.c:514 > #4 0x0000002a9574e60c in PyRun_FileExFlags (fp=0x501010, filename=0x7fbfffed5e "/home/zsolt/devel/python/temp/lxml_test.py", start=Variable "start" is not available. > ) at Python/pythonrun.c:1273 > #5 0x0000002a9574f293 in PyRun_SimpleFileExFlags (fp=0x501010, filename=0x7fbfffed5e "/home/zsolt/devel/python/temp/lxml_test.py", closeit=1, flags=0x7fbfffe80c) at Python/pythonrun.c:879 > #6 0x0000002a9575a548 in Py_Main (argc=Variable "argc" is not available. > ) at Modules/main.c:532 > > I'm using the .c files bundled in the tar file (so I haven't re-built them by cython). You didn't write if this is with 2.2.2 or 2.2.4, but neither of the two release versions shows anything in line 86174 that could potentially segfault. You also omitted the error output of gdb, so I don't know what the actual problem is here. For now, I would suspect that it may even be a gcc problem (what version do you use?), doesn't look like anything is broken in lxml itself. Stefan From Zsolt.Cserna at MorganStanley.com Tue Feb 23 15:48:14 2010 From: Zsolt.Cserna at MorganStanley.com (Cserna, Zsolt) Date: Tue, 23 Feb 2010 14:48:14 +0000 Subject: [lxml-dev] ElementTree.iterparse segfault In-Reply-To: <4B83DD3B.7060908@behnel.de> References: <0FE1D5D2B5C6754898C9E32C1230EC656E38E9747B@LNWEXMBX0105.msad.ms.com> <4B83C8DC.80102@behnel.de> <0FE1D5D2B5C6754898C9E32C1230EC656E38E974D8@LNWEXMBX0105.msad.ms.com> <4B83DD3B.7060908@behnel.de> Message-ID: <0FE1D5D2B5C6754898C9E32C1230EC656E38E97500@LNWEXMBX0105.msad.ms.com> > You didn't write if this is with 2.2.2 or 2.2.4, but neither > of the two release versions shows anything in line 86174 that > could potentially segfault. You also omitted the error output > of gdb, so I don't know what the actual problem is here. I've tried 2.2.4. In line 86174 there's an if, and I agree with you that it should not cause any problems. Weird.. Maybe -O3 caused this wrong gcc backtrace. > For now, I would suspect that it may even be a gcc problem > (what version do you use?), doesn't look like anything is > broken in lxml itself. Yes, what I forgot to mention that I use different gcc on 32 bit (3.2.3) and 64-bit (3.4.5). So this also could have been the problem - now it works fine for me to compile with -O2 on 64-bit. Thanks, Zsolt -------------------------------------------------------------------------- NOTICE: If received in error, please destroy, and notify sender. Sender does not intend to waive confidentiality or privilege. Use of this email is prohibited when received in error. We may monitor and store emails to the extent permitted by applicable law. From cthedot at gmail.com Tue Feb 23 19:26:20 2010 From: cthedot at gmail.com (Christof) Date: Tue, 23 Feb 2010 19:26:20 +0100 Subject: [lxml-dev] Question about newlines In-Reply-To: References: <4B811354.9010400@behnel.de> Message-ID: <4B841DCC.6020300@gmail.com> On 22.02.2010 23:15, fantasai wrote: > On 02/21/2010 03:04 AM, Stefan Behnel wrote: >> fantasai, 20.02.2010 04:40: >>> Maybe you want to make it a parse-time option, but I think white space outside >>> the root should be preserved. And stripping the addition to the document root >>> context only if it's not all whitespace seems to me the right thing to do. >> >> Why would you want to preserve that space? It doesn't have any meaning in XML. > > a) As with the original poster, to avoid unnecessary diff thrashing > b) Because these files are edited by hand, and stripping whitespace makes them > harder to read > c) To test the effect of whitespace in the source file on CSS+XML and CSS+HTML > rendering engines (since the files I'm working with happen to be CSS tests). > > c is obviously not a common case :) But a) and b) are relevant to others. (Followed the discussion only sporadically and hope I did not miss important parts but my 2 cents:) I think a) and b) are - as Stefan wrote in his answer - irrelevant if you keep out any WS outside the root element out of any XML file in the first place. Even when edited by hand WS outside the root is not necessary at all. IMHO an XML parser should only keep what is actually relevant spec-wise and the XML spec (in contrast to HTML or CSS ;) is quite strict. Re c): An XML parser should probably not be used here anyway. Guess the lxml HTML parser would be better. I guess HTML being a very special case (not even counting HTML5 ;) may be reason enough for a parser to do these extra tricks... Have not tried it but maybe there would be room and reason to change something. BTW, somehow related I recall lxml stripping any DOCTYPE out of an XHTML file as well when using lxml.html. Is this still the case or have I missed something? Christof From fantasai.lists at inkedblade.net Tue Feb 23 20:14:17 2010 From: fantasai.lists at inkedblade.net (fantasai) Date: Tue, 23 Feb 2010 11:14:17 -0800 Subject: [lxml-dev] Question about newlines In-Reply-To: <4B83A144.4010201@behnel.de> References: <4B811354.9010400@behnel.de> <4B83A144.4010201@behnel.de> Message-ID: On 02/23/2010 01:35 AM, Stefan Behnel wrote: > fantasai, 22.02.2010 23:15: >> On 02/21/2010 03:04 AM, Stefan Behnel wrote: >>> fantasai, 20.02.2010 04:40: >>>> Maybe you want to make it a parse-time option, but I think white space outside >>>> the root should be preserved. And stripping the addition to the document root >>>> context only if it's not all whitespace seems to me the right thing to do. >>> Why would you want to preserve that space? It doesn't have any meaning in XML. >> >> a) As with the original poster, to avoid unnecessary diff thrashing >> b) Because these files are edited by hand, and stripping whitespace makes them >> harder to read >> c) To test the effect of whitespace in the source file on CSS+XML and CSS+HTML >> rendering engines (since the files I'm working with happen to be CSS tests). >> >> c is obviously not a common case :) But a) and b) are relevant to others. > > I have to express my doubts here, but your mileage may obviously vary. I > don't see how whitespace before the root tag makes things more readable. > And to keep it out of diffs, all you have to do is keep it out of the file. Given that it seems to be stripping whitespace between the doctype and the root element's start tag, I have to disagree. (Also between comments outside the root element.) > That said, it's quite possible that the serialising code in lxml.etree is > the culprit here. It only special cases PIs and comments around the root > node, not text nodes. I've tried using both the lxml serializer and the html5lib serializer. It doesn't seem to be a serializer problem. > So please open a bug report on this for now. Ok. ~fantasai From mateusz-lists at ant.gliwice.pl Thu Feb 25 10:24:14 2010 From: mateusz-lists at ant.gliwice.pl (Mateusz Korniak) Date: Thu, 25 Feb 2010 10:24:14 +0100 Subject: [lxml-dev] Why Hi ! I am trying to perform simple task, "fix" any (X)HTML page to valid XML (XHTML) page. I am parsing source with lxml.html.HTMLParser(remove_comments=True,remove_pis=True) parser, than generating XML: lxml.etree.tostring(html_doc,pretty_print=True,with_tail=False,method="xml") To verify results I parse result but now using recovering_xml_parser = lxml.etree.XMLParser(recover=True) Whole code is attached. I works great except case when source has References: <201002251024.14299.mateusz-lists@ant.gliwice.pl> Message-ID: <433ebc871002251805h60bc4ee2ld7eb430dd1d07e6c@mail.gmail.com> On Thu, Feb 25, 2010 at 3:24 AM, Mateusz Korniak wrote: > Hi ! > I am trying to perform simple task, "fix" any (X)HTML page to valid XML > (XHTML) page. > I am parsing source with > lxml.html.HTMLParser(remove_comments=True,remove_pis=True) ?parser, > than generating XML: > lxml.etree.tostring(html_doc,pretty_print=True,with_tail=False,method="xml") > > To verify results I parse result but now using > recovering_xml_parser = lxml.etree.XMLParser(recover=True) > > Whole code is ?attached. > > I works great except case when source has > declaration, then instead of expected tags like 'body' I end up with > '{http://www.w3.org/1999/xhtml}body' ?:/ > > My questions are: > 1) why it happens ? > 2) how one should parse HTML of any kind to always end up with XML which after > parsing contains tags values without any namespace ? Hi Mateusz, You may have some wrong ideas about HTML, XML, namespaces, and lxml. 1. HTML that isn't XHTML doesn't have namespaces; HTML constructs that look like XML namespace declarations are actually HTML attributes. I suggest you invest some time in learning more about the HTML and XML standards. In particular, make sure you understand how XML namespaces work. 2. I strongly suggest you spend some time using the Python interactive prompt to examine the results of each step of your code; I don't think that what is actually happening is quite what you think is happening. Investing time in learning how to use Python interactive sessions this way will pay off. 3. The lxml.etree.cleanup_namespaces() function doesn't do what you seem to think it does. Re-read the API docs about it. I think you can figure out how to figure out what the problems are with your approach. Here's one hint: your code actually succeeds in removing the HTML 'xmlns' attribute, but a namespace declaration gets inserted later. When you understand where that happens and why, that will be a sign of progress. Best wishes, Chuck > Thanks in advance for any help, hints > > Regards, > -- > Mateusz Korniak > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > > From roby.brunelli at gmail.com Fri Feb 26 09:08:01 2010 From: roby.brunelli at gmail.com (roby.brunelli at gmail.com) Date: Fri, 26 Feb 2010 08:08:01 +0000 Subject: [lxml-dev] Problem writing special HTML characters &#... Message-ID: <00163649939ff5f55004807c6566@google.com> I'm trying to write an RSS file (extracting information from an html page) using etree.ElementTree(..).write(..) When I create the description part of a news I insert text with special characters such as: ? and when I print (or write to file) the corresponding element, I get È which I do not want (I want the original special char): is there a way to prevent this kind of mapping?? Thanks a lot, -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100226/f6d35e28/attachment.htm From roby.brunelli at gmail.com Fri Feb 26 12:20:55 2010 From: roby.brunelli at gmail.com (Roberto Brunelli) Date: Fri, 26 Feb 2010 12:20:55 +0100 Subject: [lxml-dev] Problem writing special HTML characters &#... In-Reply-To: References: <00163649939ff5f55004807c6566@google.com> Message-ID: <45e891581002260320lb8240a4x477a944e5f91cab4@mail.gmail.com> Jens, thanks for the hints, but I still do not understand how to solve the problem I have. Just a couple of steps to better show it: RSSroot = etree.Element('rss') etree.SubElement(RSSroot, 'title').text = '& # 200;' # space between & # added here just to make sure the actual chars are shown print etree.tostring(RSSroot) and I get &#200; so the '&' turns out to be sanitized, while I wanted the special charcater È to go along ... Roberto On Fri, Feb 26, 2010 at 10:39 AM, Jens Quade wrote: > > On 26.02.2010, at 09:08, roby.brunelli at gmail.com wrote: > >> I'm trying to write an RSS file (extracting information from an html page) using >> >> etree.ElementTree(..).write(..) >> >> When I create the description part of a news I insert text with special characters such as: >> >> ? >> >> and when I print (or write to file) the corresponding element, I get >> >> È >> >> which I do not want (I want the original special char): is there a way to prevent this kind of mapping?? > >>>> from lxml import etree > >>>> x = etree.XML('?') >>>> etree.ElementTree(x).write(sys.stdout) > ü > >>>> etree.ElementTree(x).write(sys.stdout, encoding='utf-8') > ? > > also: > >>>> print etree.tostring(x,encoding='utf-8') > ? > > > default encoding is ascii. > > From stefan_ml at behnel.de Fri Feb 26 13:13:37 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 26 Feb 2010 13:13:37 +0100 Subject: [lxml-dev] Problem writing special HTML characters &#... In-Reply-To: <45e891581002260320lb8240a4x477a944e5f91cab4@mail.gmail.com> References: <00163649939ff5f55004807c6566@google.com> <45e891581002260320lb8240a4x477a944e5f91cab4@mail.gmail.com> Message-ID: <4B87BAF1.5050003@behnel.de> Hi, please don't top-post. Roberto Brunelli, 26.02.2010 12:20: > On Fri, Feb 26, 2010 at 10:39 AM, Jens Quade wrote: >> On 26.02.2010, at 09:08, roby.brunelli at gmail.com wrote: >> >>> I'm trying to write an RSS file (extracting information from an html page) using >>> >>> etree.ElementTree(..).write(..) >>> >>> When I create the description part of a news I insert text with special characters such as: >>> >>> ? >>> >>> and when I print (or write to file) the corresponding element, I get >>> >>> È >>> >>> which I do not want (I want the original special char): is there a way to prevent this kind of mapping?? >>>>> from lxml import etree >>>>> x = etree.XML('?') >>>>> etree.ElementTree(x).write(sys.stdout) >> ü >> >>>>> etree.ElementTree(x).write(sys.stdout, encoding='utf-8') >> ? >> >> also: >> >>>>> print etree.tostring(x,encoding='utf-8') >> ? >> >> default encoding is ascii. >> > thanks for the hints, but I still do not understand how to solve the > problem I have. > Just a couple of steps to better show it: > > RSSroot = etree.Element('rss') > etree.SubElement(RSSroot, 'title').text = '& # 200;' # space between & > # added here just to make sure the actual chars are shown > print etree.tostring(RSSroot) > > and I get > > &#200; > > so the '&' turns out to be sanitized, while I wanted the special > charcater È to go along ... So, what is it that you want in the serialised XML: '?' or 'È' ? Jens showed you how to get to both. Stefan From manu3d at gmail.com Fri Feb 26 14:13:03 2010 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Fri, 26 Feb 2010 13:13:03 +0000 Subject: [lxml-dev] Architecture/best practice question. Message-ID: <915dc91d1002260513o52e92d2dwa7a4739fb77cbfbb@mail.gmail.com> Hi everybody, a bit of a general architecture/best practice question. Say you want to keep in sync an ElementTree with a separate tree structure, one that is parallel but does not have the exact same nodes and yet needs to be informed and updated whenever a change in the ElementTree occurs. ElementTree supports custom elements and I guess it wouldn't be too difficult to override the standard methods of an element to do something before or after any change. -However-, I understand that ElementProxies cannot store instance-level data as the instances are not persistent and are garbage collected more or less as soon as they are no longer referenced somewhere. So, what I'm wondering is, how do I tell a method of a custom element what object in the parallel structure to inform whenever a change arises? I guess one way would be to store at -class level- (or where else?) a dictionary mapping custom ElementProxies instances to nodes of the parallel structure. In so doing whenever a custom method is executed it can get hold of the parallel structure. Is that a reasonable way to do it or are there better ones? Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100226/012bed7c/attachment.htm From stefan_ml at behnel.de Sun Feb 28 11:15:33 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 28 Feb 2010 11:15:33 +0100 Subject: [lxml-dev] lxml 2.2.5 released Message-ID: <4B8A4245.1040208@behnel.de> Hi, I just released lxml 2.2.5 to PyPI. http://pypi.python.org/pypi/lxml/2.2.5 This is a bug fix release for the stable 2.2 series. It fixes three crash bugs in XPath, XSLT and lxml.objectify that occurred on certain operations. Updating is generally recommended, but not required if these did not affect your code so far. Stefan 2.2.5 (2010-02-28) Features added * Support for running XSLT extension elements on the input root node (e.g. in a template matching on "/"). Bugs fixed * Crash in XPath evaluation when reading smart strings from a document other than the original context document. * Support recent versions of html5lib by not requiring its XHTMLParser in htmlparser.py anymore. * Manually instantiating the custom element classes in lxml.objectify could crash. * Invalid XML text characters were not rejected by the API when they appeared in unicode strings directly after non-ASCII characters. * lxml.html.open_http_urllib() did not work in Python 3. * The functions strip_tags() and strip_elements() in lxml.etree did not remove all occurrences of a tag in all cases. * Crash in XSLT extension elements when the XSLT context node is not an element. From stefan_ml at behnel.de Sun Feb 28 12:18:46 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 28 Feb 2010 12:18:46 +0100 Subject: [lxml-dev] lxml 2.2.5 released In-Reply-To: <4B8A4245.1040208@behnel.de> References: <4B8A4245.1040208@behnel.de> Message-ID: <4B8A5116.5080407@behnel.de> Stefan Behnel, 28.02.2010 11:15: > I just released lxml 2.2.5 to PyPI. > > http://pypi.python.org/pypi/lxml/2.2.5 Forgot to say that this release was built with Cython 0.12.1. > This is a bug fix release for the stable 2.2 series. It fixes three crash > bugs in XPath, XSLT and lxml.objectify that occurred on certain operations. > Updating is generally recommended, but not required if these did not affect > your code so far. > > Stefan > > > 2.2.5 (2010-02-28) > > Features added > > * Support for running XSLT extension elements on the input root node > (e.g. in a template matching on "/"). > > Bugs fixed > > * Crash in XPath evaluation when reading smart strings from a document > other than the original context document. > * Support recent versions of html5lib by not requiring its XHTMLParser > in htmlparser.py anymore. > * Manually instantiating the custom element classes in lxml.objectify > could crash. > * Invalid XML text characters were not rejected by the API when they > appeared in unicode strings directly after non-ASCII characters. > * lxml.html.open_http_urllib() did not work in Python 3. > * The functions strip_tags() and strip_elements() in lxml.etree did > not remove all occurrences of a tag in all cases. > * Crash in XSLT extension elements when the XSLT context node is not > an element.