From martin at martinthomas.net Tue May 1 04:55:24 2007 From: martin at martinthomas.net (Martin Thomas) Date: Mon, 30 Apr 2007 21:55:24 -0500 Subject: [lxml-dev] Whoops, Internal Error In-Reply-To: <20070430111242.5nn8bxf4ragowck0@64.40.144.195> References: <20070430111242.5nn8bxf4ragowck0@64.40.144.195> Message-ID: <1177988124.7775.31.camel@tigger> I have attached the file to be validated and the schema that was causing a problem (there are 5 more schemas involved but I didn't think you needed them - either download them or ask me to email them) as well as a python script that I used to create the problem. The output is as follows: ios.xml:41:ERROR:SCHEMASV:SCHEMAV_INTERNAL: Internal error: xmlSchemaIDCRegisterMatchers, Could not find an augmented IDC item for an IDC definition. ios.xml:41:ERROR:SCHEMASV:SCHEMAV_INTERNAL: Internal error: xmlSchemaValidateElem, calling xmlSchemaValidateElemDecl(). ios.xml:41:ERROR:SCHEMASV:SCHEMAV_INTERNAL: Internal error: xmlSchemaDocWalk, calling xmlSchemaValidateElem(). The line number in ios.xml corresponds to a cpe-list that is defined in the attached schema. If I remove it from ios.xml, everything else passes. Cheers // Martin On Mon, 2007-04-30 at 11:12 -0500, martin at martinthomas.net wrote: > Using the lxml rpm for FC6 and Python 2.4, I get an internal error > when I try validating a document against a XMLschema document. The > xml document that I am trying to validate and the XMLschema which I am > validating against both came from NIST (contained in the 'Complete > 1.1.3 Schema Bundle .zip' at http://nvd.nist.gov/scap/xccdf/xccdf.cfm). > > The error message reads Internal error: xmlSchemaIDCRegisterMatchers, > Could not find an augmented IDC item for an IDC definition. > > I'll write this up properly tonight and send in an error log, along > with all the schema documents etc unless someone tells me otherwise. > > Cheers // Martin > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev -------------- next part -------------- A non-text attachment was scrubbed... Name: cpe-1.0.xsd Type: application/xml Size: 7544 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070430/4016ba70/attachment-0002.xml -------------- next part -------------- A non-text attachment was scrubbed... Name: ios.xml Type: application/xml Size: 14944 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070430/4016ba70/attachment-0003.xml -------------- next part -------------- A non-text attachment was scrubbed... Name: xsd.py Type: text/x-python Size: 215 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070430/4016ba70/attachment-0001.py From stefan_ml at behnel.de Tue May 1 08:52:15 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 01 May 2007 08:52:15 +0200 Subject: [lxml-dev] Whoops, Internal Error In-Reply-To: <1177988124.7775.31.camel@tigger> References: <20070430111242.5nn8bxf4ragowck0@64.40.144.195> <1177988124.7775.31.camel@tigger> Message-ID: <4636E39F.1050900@behnel.de> Hi Martin, a quick test (after renaming cpe-1.0.xsd to xccdf-1.1.xsd) didn't show any problem with lxml trunk and libxml2 2.6.17. Since you didn't mention any of the versions of lxml or libxml2 you are using, I assume it's just a problem with an older libxml2 version. XML-Schema is still under development in libxml2, so any newer version is likely to provide better support and bug fixes. Please upgrade and retry. Regards, Stefan Martin Thomas wrote: > I have attached the file to be validated and the schema that was causing > a problem (there are 5 more schemas involved but I didn't think you > needed them - either download them or ask me to email them) as well as a > python script that I used to create the problem. > > The output is as follows: > ios.xml:41:ERROR:SCHEMASV:SCHEMAV_INTERNAL: Internal error: > xmlSchemaIDCRegisterMatchers, Could not find an augmented IDC item for > an IDC definition. > ios.xml:41:ERROR:SCHEMASV:SCHEMAV_INTERNAL: Internal error: > xmlSchemaValidateElem, calling xmlSchemaValidateElemDecl(). > ios.xml:41:ERROR:SCHEMASV:SCHEMAV_INTERNAL: Internal error: > xmlSchemaDocWalk, calling xmlSchemaValidateElem(). > > The line number in ios.xml corresponds to a cpe-list that is defined in > the attached schema. If I remove it from ios.xml, everything else > passes. > > Cheers // Martin > > > > On Mon, 2007-04-30 at 11:12 -0500, martin at martinthomas.net wrote: >> Using the lxml rpm for FC6 and Python 2.4, I get an internal error >> when I try validating a document against a XMLschema document. The >> xml document that I am trying to validate and the XMLschema which I am >> validating against both came from NIST (contained in the 'Complete >> 1.1.3 Schema Bundle .zip' at http://nvd.nist.gov/scap/xccdf/xccdf.cfm). >> >> The error message reads Internal error: xmlSchemaIDCRegisterMatchers, >> Could not find an augmented IDC item for an IDC definition. >> >> I'll write this up properly tonight and send in an error log, along >> with all the schema documents etc unless someone tells me otherwise. >> >> Cheers // Martin >> >> _______________________________________________ >> lxml-dev mailing list >> lxml-dev at codespeak.net >> http://codespeak.net/mailman/listinfo/lxml-dev >> >> ------------------------------------------------------------------------ >> >> from lxml import etree >> >> xsd = etree.ElementTree(file='xccdf-1.1.xsd') >> >> doc = etree.ElementTree(file='ios.xml') >> >> xsv = etree.XMLSchema(xsd) >> try: >> xsv.validate(doc) >> except Exception, e: >> pass >> >> print e.error_log >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> lxml-dev mailing list >> lxml-dev at codespeak.net >> http://codespeak.net/mailman/listinfo/lxml-dev From martin at martinthomas.net Tue May 1 16:06:34 2007 From: martin at martinthomas.net (martin at martinthomas.net) Date: Tue, 01 May 2007 09:06:34 -0500 Subject: [lxml-dev] Whoops, Internal Error Message-ID: <20070501090634.69ncpr4ilxukg0co@64.40.144.195> Stefan, I can reproduce this problem on Cygwin and Fedora Core 6 which have libxml2 2.6.26 and 2.6.26 respectively. Sorry if I have confused things by sending that schema doc. There are 6 different schema documents involved: xccdf-1.1.xsd xccdfp-1.1.xsd xml.xsd platform-0.2.3.xsd cpe-1.0.xsd simpledc20021212.xsd They are available from nist.gov in the zip file at the URL I gave earlier. I only attached the CPE schema because the element causing the internal error belongs in the CPE namespace and I didn't want to attach documents that are publicly available. As it turns out, this is an error in libxml2.. if I use xmllint, I get the same error message. I'll send them a bug report. Thanks // M Quoting Stefan Behnel : > Hi Martin, > > a quick test (after renaming cpe-1.0.xsd to xccdf-1.1.xsd) didn't show any > problem with lxml trunk and libxml2 2.6.17. Since you didn't mention any of > the versions of lxml or libxml2 you are using, I assume it's just a problem > with an older libxml2 version. XML-Schema is still under development in > libxml2, so any newer version is likely to provide better support and bug > fixes. Please upgrade and retry. > > Regards, > Stefan > > > Martin Thomas wrote: >> I have attached the file to be validated and the schema that was causing >> a problem (there are 5 more schemas involved but I didn't think you >> needed them - either download them or ask me to email them) as well as a >> python script that I used to create the problem. >> >> The output is as follows: >> ios.xml:41:ERROR:SCHEMASV:SCHEMAV_INTERNAL: Internal error: >> xmlSchemaIDCRegisterMatchers, Could not find an augmented IDC item for >> an IDC definition. >> ios.xml:41:ERROR:SCHEMASV:SCHEMAV_INTERNAL: Internal error: >> xmlSchemaValidateElem, calling xmlSchemaValidateElemDecl(). >> ios.xml:41:ERROR:SCHEMASV:SCHEMAV_INTERNAL: Internal error: >> xmlSchemaDocWalk, calling xmlSchemaValidateElem(). >> >> The line number in ios.xml corresponds to a cpe-list that is defined in >> the attached schema. If I remove it from ios.xml, everything else >> passes. >> >> Cheers // Martin >> >> >> >> On Mon, 2007-04-30 at 11:12 -0500, martin at martinthomas.net wrote: >>> Using the lxml rpm for FC6 and Python 2.4, I get an internal error >>> when I try validating a document against a XMLschema document. The >>> xml document that I am trying to validate and the XMLschema which I am >>> validating against both came from NIST (contained in the 'Complete >>> 1.1.3 Schema Bundle .zip' at http://nvd.nist.gov/scap/xccdf/xccdf.cfm). >>> >>> The error message reads Internal error: xmlSchemaIDCRegisterMatchers, >>> Could not find an augmented IDC item for an IDC definition. >>> >>> I'll write this up properly tonight and send in an error log, along >>> with all the schema documents etc unless someone tells me otherwise. >>> >>> Cheers // Martin >>> >>> _______________________________________________ >>> lxml-dev mailing list >>> lxml-dev at codespeak.net >>> http://codespeak.net/mailman/listinfo/lxml-dev >>> >>> ------------------------------------------------------------------------ >>> >>> from lxml import etree >>> >>> xsd = etree.ElementTree(file='xccdf-1.1.xsd') >>> >>> doc = etree.ElementTree(file='ios.xml') >>> >>> xsv = etree.XMLSchema(xsd) >>> try: >>> xsv.validate(doc) >>> except Exception, e: >>> pass >>> >>> print e.error_log >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> lxml-dev mailing list >>> lxml-dev at codespeak.net >>> http://codespeak.net/mailman/listinfo/lxml-dev > From jholg at gmx.de Thu May 3 11:42:59 2007 From: jholg at gmx.de (Holger Joukl) Date: Thu, 03 May 2007 11:42:59 +0200 Subject: [lxml-dev] etree.XMLSchema generic error "Document is not valid XML Schema" Message-ID: <20070503094259.131360@gmx.net> Hi, is there any chance to expose more detailed information on why a schema document is not a valid XML Schema? A brief look at libxml2 schema API did not really tell me anything. Regards, Holger -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070503/c712bb2d/attachment.htm From tseaver at palladion.com Thu May 3 19:20:12 2007 From: tseaver at palladion.com (Tres Seaver) Date: Thu, 03 May 2007 13:20:12 -0400 Subject: [lxml-dev] etree.XMLSchema generic error "Document is not valid XML Schema" In-Reply-To: <20070503094259.131360@gmx.net> References: <20070503094259.131360@gmx.net> Message-ID: <463A19CC.7020809@palladion.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Holger Joukl wrote: > Hi, > is there any chance to expose more detailed information on why a schema > document is not a valid XML Schema? > A brief look at libxml2 schema API did not really tell me anything. Can you try validating it with xmllint? If you get the same error message, then the bug / problem is in libxml2, rather than lxml. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGOhnM+gerLs4ltQ4RAufUAKCJXoSOljP09ufsvyS8O0jS+D+8XACg1n7W TigLDp78G/4O3wcdKqXmy7c= =209+ -----END PGP SIGNATURE----- From jholg at gmx.de Fri May 4 09:23:09 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 04 May 2007 09:23:09 +0200 Subject: [lxml-dev] etree.XMLSchema generic error "Document is not valid XML Schema" In-Reply-To: <463A19CC.7020809@palladion.com> References: <20070503094259.131360@gmx.net> <463A19CC.7020809@palladion.com> Message-ID: <20070504072309.223240@gmx.net> Hi, > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Holger Joukl wrote: > > Hi, > > is there any chance to expose more detailed information on why a schema > > document is not a valid XML Schema? > > A brief look at libxml2 schema API did not really tell me anything. > > Can you try validating it with xmllint? If you get the same error > message, then the bug / problem is in libxml2, rather than lxml. This is not a bug in libxml2 nor lxml, it's a feature request. I know that the schema is not a valid schema, but I'd like to see the reason why. lxml does currently not present any errors from the libxml2 layer when instantiating a schema. If this is possible at all, I don't know. Of course, it is always possible to use another tool, like a good schema editor, but for quick small hacks it would be really convenient to see what you missed. Holger -- "Feel free" - 10 GB Mailbox, 100 FreeSMS/Monat ... Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail From jholg at gmx.de Fri May 4 09:23:09 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 04 May 2007 09:23:09 +0200 Subject: [lxml-dev] etree.XMLSchema generic error "Document is not valid XML Schema" In-Reply-To: <463A19CC.7020809@palladion.com> References: <20070503094259.131360@gmx.net> <463A19CC.7020809@palladion.com> Message-ID: <20070504072309.223240@gmx.net> Hi, > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Holger Joukl wrote: > > Hi, > > is there any chance to expose more detailed information on why a schema > > document is not a valid XML Schema? > > A brief look at libxml2 schema API did not really tell me anything. > > Can you try validating it with xmllint? If you get the same error > message, then the bug / problem is in libxml2, rather than lxml. This is not a bug in libxml2 nor lxml, it's a feature request. I know that the schema is not a valid schema, but I'd like to see the reason why. lxml does currently not present any errors from the libxml2 layer when instantiating a schema. If this is possible at all, I don't know. Of course, it is always possible to use another tool, like a good schema editor, but for quick small hacks it would be really convenient to see what you missed. Holger -- "Feel free" - 10 GB Mailbox, 100 FreeSMS/Monat ... Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail From erik.swanson at gmail.com Fri May 4 18:23:01 2007 From: erik.swanson at gmail.com (Erik Swanson) Date: Fri, 4 May 2007 09:23:01 -0700 Subject: [lxml-dev] lxml.sax.saxify breaks on comments; `make test` failure on MacPython 2.5.1 Message-ID: <57993d730705040923h56da9c8fta43bebbb85556e0@mail.gmail.com> There appears to be a bug with lxml.sax's handling of comments, as the following code causes lxml.sax.saxify to fail: """ import lxml.etree, lxml.sax, xml.sax.handler from cStringIO import StringIO p = lxml.etree.HTMLParser(remove_blank_text=True) h = xml.sax.handler.ContentHandler() f = StringIO("

bar

") t = lxml.etree.parse(f, p) lxml.sax.saxify(t, h) """ """ Traceback (most recent call last): File "saxBug.py", line 11, in lxml.sax.saxify(t, h) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/lxml- 1.3beta-py2.5-macosx-10.4-i386.egg/lxml/sax.py", line 178, in saxify return ElementTreeProducer(element_or_tree, content_handler).saxify() File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/lxml- 1.3beta-py2.5-macosx-10.4-i386.egg/lxml/sax.py", line 130, in saxify self._recursive_saxify(self._element, {}) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/lxml- 1.3beta-py2.5-macosx-10.4-i386.egg/lxml/sax.py", line 160, in _recursive_saxify self._recursive_saxify(child, prefixes) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/lxml- 1.3beta-py2.5-macosx-10.4-i386.egg/lxml/sax.py", line 160, in _recursive_saxify self._recursive_saxify(child, prefixes) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/lxml- 1.3beta-py2.5-macosx-10.4-i386.egg/lxml/sax.py", line 149, in _recursive_saxify ns_uri, local_name = _getNsTag(element.tag) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/lxml- 1.3beta-py2.5-macosx-10.4-i386.egg/lxml/sax.py", line 8, in _getNsTag if tag[0] == '{': TypeError: 'builtin_function_or_method' object is unsubscriptable """ I have been able to replicate the above error with both release and svn lxml, as well as with both Apple-supplied libxml2/libxslt and up-to-date libraries. Also, and I doubt this is related, but `make test` fails for me on OS X 10.4.9 with MacPython 2.5.1 (python.org binary): """ python test.py -p -v TESTED VERSION: Python: (2, 5, 1, 'final', 0) lxml.etree: (1, 3, -1, 42667) libxml used: (2, 6, 28) libxml compiled: (2, 6, 28) libxslt used: (1, 1, 20) libxslt compiled: (1, 1, 20) 733/733 (100.0%): Doctest: xpathxslt.txt ====================================================================== FAIL: test_module_HTML_unicode ( lxml.tests.test_htmlparser.HtmlParserTestCaseBase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", line 260, in run testMethod() File "/Users/erik/Projects/lxml/src/lxml/tests/test_htmlparser.py", line 33, in test_module_HTML_unicode self.uhtml_str) File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", line 334, in failUnlessEqual (msg or '%r != %r' % (first, second)) AssertionError: u'test \xc3\x83\xc2\xa1\xef\xa3\x92

page \xc3\x83\xc2\xa1\xef\xa3\x92 title

' != u'test \xc3\xa1\uf8d2

page \xc3\xa1\uf8d2 title

' ---------------------------------------------------------------------- Ran 733 tests in 1.380s FAILED (failures=1) """ -- Erik Swanson -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070504/ccd2b4b6/attachment-0001.htm From stefan_ml at behnel.de Fri May 4 19:26:15 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 04 May 2007 19:26:15 +0200 Subject: [lxml-dev] lxml.sax.saxify breaks on comments; `make test` failure on MacPython 2.5.1 In-Reply-To: <57993d730705040923h56da9c8fta43bebbb85556e0@mail.gmail.com> References: <57993d730705040923h56da9c8fta43bebbb85556e0@mail.gmail.com> Message-ID: <463B6CB7.4050404@behnel.de> Hi, Erik Swanson wrote: > There appears to be a bug with lxml.sax's handling of comments, as the > following code causes lxml.sax.saxify to fail: > > """ > import lxml.etree , lxml.sax, xml.sax.handler > from cStringIO import StringIO > > p = lxml.etree.HTMLParser(remove_blank_text=True) > h = xml.sax.handler.ContentHandler() > f = StringIO("

bar

") > t = lxml.etree.parse(f, p) > lxml.sax.saxify(t, h) > """ ah, yes, thanks for the report. This is due to the way ElementTree handles Element.tag for comments and processing instructions. They actually return their factory functions and lxml.etree follows them for compatibility. But the real problem is obviously in lxml.sax. It should handle comments correctly. I'll fix it. > Also, and I doubt this is related, but `make test` fails for me on OS X > 10.4.9 with MacPython 2.5.1 (python.org binary): > > """ > python test.py -p -v > > TESTED VERSION: > Python: (2, 5, 1, 'final', 0) > lxml.etree : (1, 3, -1, 42667) > libxml used: (2, 6, 28) > libxml compiled: (2, 6, 28) > libxslt used: (1, 1, 20) > libxslt compiled: (1, 1, 20) > > 733/733 (100.0%): Doctest: xpathxslt.txt > > ====================================================================== > FAIL: test_module_HTML_unicode ( > lxml.tests.test_htmlparser.HtmlParserTestCaseBase) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", > line 260, in run > testMethod() > File "/Users/erik/Projects/lxml/src/lxml/tests/test_htmlparser.py", > line 33, in test_module_HTML_unicode > self.uhtml_str) > File > "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/unittest.py", > line 334, in failUnlessEqual > (msg or '%r != %r' % (first, second)) > AssertionError: u'test > \xc3\x83\xc2\xa1\xef\xa3\x92

page > \xc3\x83\xc2\xa1\xef\xa3\x92 title

' != > u'test \xc3\xa1\uf8d2

page > \xc3\xa1\uf8d2 title

' > > ---------------------------------------------------------------------- > Ran 733 tests in 1.380s > > FAILED (failures=1) > """ Good to know. Not a big problem, but an annoying one, as it breaks the test suite. I'll look into that, too. Thanks for the reports, Stefan From stefan_ml at behnel.de Sat May 5 12:30:33 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 05 May 2007 12:30:33 +0200 Subject: [lxml-dev] prepending PIs and serialising them - finally Message-ID: <463C5CC9.1020805@behnel.de> Hi, since the problem came up again when fixing the SAX issue, I finally decided that it was time for a way to prepend processing instructions to a tree. Elements now have general methods 'el.addprevious(sibling)' and 'el.addnext(sibling)'. They move the new sibling either before or after the element. The methods also check that you can't create a second root node next to another (TypeError) and will discard the tail text if adding at the top level. So they will (try to) prevent you from creating broken XML. Only PIs and comments are allowed as siblings of a root node, but any element can be used inside the tree. While I was at it, I also fixed the issue with writing out comments and PIs that are siblings of a root node. Note that only root nodes are special cased here, so that you get the complete document if you serialise the root node. Elements in the tree will not write out their siblings if you serialise them. Another step towards a shining new lxml 1.3, I'd say. :) Have fun, Stefan From stefan_ml at behnel.de Sat May 5 19:10:06 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 05 May 2007 19:10:06 +0200 Subject: [lxml-dev] feedback on XPath docs? Message-ID: <463CBA6E.1050302@behnel.de> Hi all, I happened to take a deeper skip over the XPath documentation page of lxml, and since there was loads of stuff missing, I decided to give it half a rewrite. Now I'm interested in feedback to see if it is understandable or if any other important (or interesting) stuff is missing. http://codespeak.net/lxml/dev/xpathxslt.html Any comments? Stefan PS: feed back on the other doc pages will be appreciated, too. :) From stefan_ml at behnel.de Mon May 7 05:45:36 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 07 May 2007 05:45:36 +0200 Subject: [lxml-dev] etree.XMLSchema generic error "Document is not valid XML Schema" In-Reply-To: <20070503094259.131360@gmx.net> References: <20070503094259.131360@gmx.net> Message-ID: <463EA0E0.8020001@behnel.de> Hi, Holger Joukl wrote: > is there any chance to expose more detailed information on why a schema > document is not a valid XML Schema? > A brief look at libxml2 schema API did not really tell me anything. Have you looked at the error log of the exception? >>> from lxml.etree import XML, XMLSchema >>> try: ... XMLSchema(XML("")) ... except Exception, e: ... print e.error_log :0:ERROR:SCHEMASP:SCHEMAP_NOT_SCHEMA: The XML document '(null)' is not a schema document. I assume the reported errors here are more telling if you pass a real schema document, though. Stefan From jholg at gmx.de Mon May 7 11:29:40 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Mon, 07 May 2007 11:29:40 +0200 Subject: [lxml-dev] etree.XMLSchema generic error "Document is not valid XML Schema" In-Reply-To: <463EA0E0.8020001@behnel.de> References: <20070503094259.131360@gmx.net> <463EA0E0.8020001@behnel.de> Message-ID: <20070507092940.181680@gmx.net> Hi, > Have you looked at the error log of the exception? > > >>> from lxml.etree import XML, XMLSchema > >>> try: > ... XMLSchema(XML("")) > ... except Exception, e: > ... print e.error_log > :0:ERROR:SCHEMASP:SCHEMAP_NOT_SCHEMA: The XML document '(null)' > is > not a schema document. Oh, right. I wasn't aware of the error_log concept in lxml exceptions. > I assume the reported errors here are more telling if you pass a real > schema > document, though. They are, indeed. Thanks, Holger -- "Feel free" - 10 GB Mailbox, 100 FreeSMS/Monat ... Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail From jholg at gmx.de Mon May 7 13:26:35 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Mon, 07 May 2007 13:26:35 +0200 Subject: [lxml-dev] feedback on XPath docs? In-Reply-To: <463CBA6E.1050302@behnel.de> References: <463CBA6E.1050302@behnel.de> Message-ID: <20070507112635.225700@gmx.net> Hi, > > http://codespeak.net/lxml/dev/xpathxslt.html > > Any comments? I think it's well understandable. I'd grant the getpath convenience its own paragraph + heading as it is a nice little goodie and does not seem especially related to XPath return values ("A related convenience method of ElementTree objects is getpath(element),[...]") Suggestion: "XPath expression generation for Elements" If interesting stuff is missing I wouldn't know as the documentation now covers more than what I've ever used of lxml XPath capabilities. Holger -- "Feel free" - 10 GB Mailbox, 100 FreeSMS/Monat ... Jetzt GMX TopMail testen: http://www.gmx.net/de/go/topmail From Curtis at DAYCOS.com Mon May 7 21:03:44 2007 From: Curtis at DAYCOS.com (Curtis Scheer) Date: Mon, 7 May 2007 14:03:44 -0500 Subject: [lxml-dev] relaxNQ errors Message-ID: <031936836C46D611BB1B00508BE7345D0511687D@gatekeeper.daycos.com> New to the lxml library so forgive me if I missed this in the documentation, I am trying to validate an xml file against RelaxNG. The function relaxing() appears to only returned a 1 or 0 based on whether it is valid. Is there a way I can get the specific error as to why it failed? Thanks, Curtis -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070507/ce9f2bff/attachment.htm From stefan_ml at behnel.de Mon May 7 22:19:41 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 07 May 2007 22:19:41 +0200 Subject: [lxml-dev] relaxNQ errors In-Reply-To: <031936836C46D611BB1B00508BE7345D0511687D@gatekeeper.daycos.com> References: <031936836C46D611BB1B00508BE7345D0511687D@gatekeeper.daycos.com> Message-ID: <463F89DD.4050401@behnel.de> Hi, Curtis Scheer wrote: > New to the lxml library so forgive me if I missed this in the > documentation, I am trying to validate an xml file against RelaxNG. The > function relaxing() appears to only returned a 1 or 0 based on whether > it is valid. Is there a way I can get the specific error as to why it > failed? Please refer to the in in-development docs of lxml, they are much easier to read. http://codespeak.net/lxml/dev/validation.html Hope it helps, Stefan From Curtis at DAYCOS.com Mon May 7 22:49:05 2007 From: Curtis at DAYCOS.com (Curtis Scheer) Date: Mon, 7 May 2007 15:49:05 -0500 Subject: [lxml-dev] relaxNQ errors Message-ID: <031936836C46D611BB1B00508BE7345D051168FC@gatekeeper.daycos.com> Thanks for the help, so far I am quite impressed with this library. -----Original Message----- From: Stefan Behnel [mailto:stefan_ml at behnel.de] Sent: Monday, May 07, 2007 3:20 PM To: Curtis Scheer Cc: lxml-dev at codespeak.net Subject: Re: [lxml-dev] relaxNQ errors Hi, Curtis Scheer wrote: > New to the lxml library so forgive me if I missed this in the > documentation, I am trying to validate an xml file against RelaxNG. The > function relaxing() appears to only returned a 1 or 0 based on whether > it is valid. Is there a way I can get the specific error as to why it > failed? Please refer to the in in-development docs of lxml, they are much easier to read. http://codespeak.net/lxml/dev/validation.html Hope it helps, Stefan _______________________________________________ lxml-dev mailing list lxml-dev at codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev From stefan_ml at behnel.de Sat May 12 17:35:57 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 12 May 2007 17:35:57 +0200 Subject: [lxml-dev] XPath exceptions Message-ID: <4645DEDD.9090604@behnel.de> Hi all, I've just rewritten the XPath exception generation code to provide better error messages on failure. lxml 1.3 will have two main XPath exceptions that inherit from XPathError: XPathSyntaxError and XPathEvalError. However, the problem is that they are not entirely, well, consistent. When you create an XPath object, you will nicely get an XPathSyntaxError if something goes wrong in the instantiation (parsing) and an XPathEvalError if you call evaluate(). But when you use the other two evaluators (or the xpath() method), parsing and evaluating are really one step, so you will always get an eval error. Meaning, you may get different exceptions from XPath() and the xpath() method for the same XPath expression. It is easy to raise different errors from the xpath() method depending on what libxml2 tells us about the error source. However, libxml2 also uses a generic error code for both syntax and eval errors if the error is not more specific (e.g. for a completely unparsable expression), so there are cases where lxml cannot tell what kind of error it was, so it would have to default to an eval error for the xpath() method. This would still give you different exceptions from the XPath() class and the xpath() method for the same unparsable expression. Another problem is backward compatibility: if we introduce a new exception for the evaluation, these errors will no longer be caught by existing code that catches XPathSyntaxError (or even plain SyntaxError). To make such code run with older versions of lxml, you would have to catch XPathError instead. So, it's easy to do, but you'd have change your code if it uses the original exception. BTW, this fixes the issue of missing namespace prefixes raising a syntax error. You will now get an eval error saying "Undefined namespace prefix". I would like to hear opinions about this before it becomes the official behaviour of lxml 1.3. The trunk currently implements the variant that raises different errors from xpath(). Stefan From stefan_ml at behnel.de Sun May 13 20:27:07 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 13 May 2007 20:27:07 +0200 Subject: [lxml-dev] XPath exceptions In-Reply-To: <4645DEDD.9090604@behnel.de> References: <4645DEDD.9090604@behnel.de> Message-ID: <4647587B.2070408@behnel.de> Hi again, Stefan Behnel wrote: > It is easy to raise different errors from the xpath() method depending on what > libxml2 tells us about the error source. However, libxml2 also uses a generic > error code for both syntax and eval errors if the error is not more specific > (e.g. for a completely unparsable expression), so there are cases where lxml > cannot tell what kind of error it was, so it would have to default to an eval > error for the xpath() method. This would still give you different exceptions > from the XPath() class and the xpath() method for the same unparsable expression. I now believe that always raising an eval error here is more consistent and easier to handle. The semantics of raising different errors are flawed anyway, so having a single evaluation error for an evaluation function is as consistent as we can get. Note that this still breaks backwards compatibility as the XPath evaluators and the xpath() method no longer raise a syntax error but an eval error. You can work around this to support older lxml versions by excepting on XPathError. Stefan From eric at detede.com Mon May 14 13:10:48 2007 From: eric at detede.com (Eric Garin) Date: Mon, 14 May 2007 13:10:48 +0200 Subject: [lxml-dev] Details of XML Validation errors with lxml Message-ID: <288780425.20070514131048@detede.com> An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070514/73400253/attachment.htm From stefan_ml at behnel.de Mon May 14 13:27:38 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 14 May 2007 13:27:38 +0200 Subject: [lxml-dev] Details of XML Validation errors with lxml In-Reply-To: <288780425.20070514131048@detede.com> References: <288780425.20070514131048@detede.com> Message-ID: <464847AA.8060006@behnel.de> Hi, Eric Garin wrote: > I've found some informations > here http://codespeak.net/lxml/dev/api.html#error-handling-on-exceptions and most likely also http://codespeak.net/lxml/dev/validation.html > Where Can I found the documentation API for error_log ? The section in api.html which you mentioned above is already the main source. You might also try 'help(etree)', which will tell you what fields there are in the error log entries, or read the source code, class _LogEntry in http://codespeak.net/svn/lxml/trunk/src/lxml/xmlerror.pxi > I've found a bit by chance how to render the line number error but I > would like for example to be abble to give to the users the offset (or > position in line) where the error occur. Sadly, this information is not provided by libxml2, so lxml can't provide it either. Note also that the document might be constructed by hand (don't know your application), in which case a line number would be meaningless already. This is obviously a problem in 1-line XML documents, but a work around could be to pretty print the document, parse it back in and then run the validation a second time. This will give you a meaningful line number that you can then use to provide the user with more context such as the surrounding tags. Stefan From stefan_ml at behnel.de Mon May 14 14:37:06 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 14 May 2007 14:37:06 +0200 Subject: [lxml-dev] Details of XML Validation errors with lxml In-Reply-To: <464847AA.8060006@behnel.de> References: <288780425.20070514131048@detede.com> <464847AA.8060006@behnel.de> Message-ID: <464857F2.70209@behnel.de> Stefan Behnel wrote: > Eric Garin wrote: >> I've found a bit by chance how to render the line number error but I >> would like for example to be abble to give to the users the offset (or >> position in line) where the error occur. > > Sadly, this information is not provided by libxml2 Sorry, looks like I was mistaken here. This information *is* provided by libxml2 in some contexts, so I'll try to make it available at the lxml API level. Please check the SVN trunk (becoming lxml 1.3) to see how things advance. Stefan From stefan_ml at behnel.de Mon May 14 22:44:08 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 14 May 2007 22:44:08 +0200 Subject: [lxml-dev] [objectify] schema type registry: QNames for xsi:type? In-Reply-To: <20070423080131.114650@gmx.net> References: <20070416095901.169710@gmx.net> <462BAC61.8040109@behnel.de> <20070423080131.114650@gmx.net> Message-ID: <4648CA18.5000909@behnel.de> Hi Holger, jholg at gmx.de wrote: >>> Is it easily possible to use QNames in the xsi-type lookup system? >> I believe this would be the right thing to do, as lxml should be >> consistent. I gave it a preliminary implementation on the trunk, could you check if this works for you? Stefan From stefan_ml at behnel.de Tue May 15 16:56:52 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 15 May 2007 16:56:52 +0200 Subject: [lxml-dev] [objectify] schema type registry: QNames for xsi:type? In-Reply-To: <20070515143723.18920@gmx.net> References: <20070416095901.169710@gmx.net> <462BAC61.8040109@behnel.de> <20070423080131.114650@gmx.net> <4648CA18.5000909@behnel.de> <20070515143723.18920@gmx.net> Message-ID: <4649CA34.8070801@behnel.de> Hi Holger, jholg at gmx.de wrote: >> jholg at gmx.de wrote: >>>>> Is it easily possible to use QNames in the xsi-type lookup system? >>>> I believe this would be the right thing to do, as lxml should be >>>> consistent. >> I gave it a preliminary implementation on the trunk, could you check if >> this >> works for you? > > With this small fix: > > *** src/lxml/objectify.pyx.ORIG Tue May 15 16:24:31 2007 > --- src/lxml/objectify.pyx Tue May 15 16:23:14 2007 > *************** > *** 1768,1774 **** > name = _xsi > for p, ns in nsmap.items(): > if ns == XML_SCHEMA_NS: > ! _xsi = prefix + ':' + _xsi > break > else: > raise TypeError, "XSD types require the XSD namespace" > --- 1768,1775 ---- > name = _xsi > for p, ns in nsmap.items(): > if ns == XML_SCHEMA_NS: > ! if p: > ! _xsi = p + ':' + _xsi > break > else: > raise TypeError, "XSD types require the XSD namespace" > > > it mostly works for me. Sure, thanks. > However, I detected that the nice nsmap-unification currently does not > handle attributes: > >>>> msg = etree.fromstring("""""") >>>> s = DataElement("234837", _xsi="string", nsmap={'myXSD': 'http://www.w3.org/2001/XMLSchema'}) >>>> >>>> print etree.tostring(s, pretty_print=1) > 234837 >>>> print etree.tostring(msg, pretty_print=1) > >>>> msg.s = s >>>> print etree.tostring(msg, pretty_print=1) > > 234837 > > > Now that nsmap-unification has taken place, the myXSD-prefix has not been changed to FOOBAR. > I fear this is not a nice one to fix as all the attribute-value prefixes had to be checked. So basically I think this is not a problem of objectify. Definitely not. While we could potentially look for attributes that we created ourselves, I don't think it is worth having lxml handle this. Users should take care when they use their own prefixes. Maybe worth a remark in the docs... > Not really related to this, I detected what I think is a slight inconsistency regarding nsmap when using None vs '' as nsmap-keys (=prefixes): > >>>> s = DataElement("234837", _xsi="string", nsmap={'': 'http://www.w3.org/2001/XMLSchema'}) >>>> print etree.tostring(s, pretty_print=1) > 234837 >>>> s = DataElement("234837", _xsi="string", nsmap={None: 'http://www.w3.org/2001/XMLSchema'}) >>>> print etree.tostring(s, pretty_print=1) > 234837 > > Though I'm not sure if this really a bug, a mere inconvenience, or even valid (is xmlns:=... allowed?) Maybe this is rather a user error, but I don't see why lxml should not just convert this to None internally. So, two fixes applied, please check again. :) Stefan From jholg at gmx.de Wed May 16 08:56:38 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 16 May 2007 08:56:38 +0200 Subject: [lxml-dev] Fwd: Re: XPath exceptions Message-ID: <20070516065638.229930@gmx.net> Hi, a quick question on nsmaps in XPath: Is it intentional that this >>> >>> root = etree.fromstring("what?") >>> root.xpath("//a", {'':'my/foo/bar/URI'}) ['what?'] >>> root.xpath("//a") ['what?'] >>> but >>> root = etree.fromstring("what?") >>> root.xpath("//a", {'':'my/foo/bar/URI'}) [] does not? Of course you should do: >>> root.xpath("//foo:a", {'foo':'my/foo/bar/URI'}) ['what?'] >>> Seems like an empty string prefix in an XPath-nsmap does not have the desired effect, basically it is ignored as ns-prefix. I notice that None is explicitly disallowed in an XPath-nsmap argument: >>> root.xpath("//a", {None:'my/foo/bar/URI'}) Traceback (most recent call last): File "", line 1, in ? File "etree.pyx", line 1042, in etree._Element.xpath File "xpath.pxi", line 222, in etree.XPathElementEvaluator.__init__ File "xpath.pxi", line 102, in etree._XPathEvaluatorBase.__init__ File "xpath.pxi", line 54, in etree._XPathContext.__init__ File "extensions.pxi", line 73, in etree._BaseContext.__init__ TypeError: empty namespace prefix is not supported in XPath >>> Maybe '' should internally get converted to None and thus raise the same error when used in an xpath nsmap argument dictionary? Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kanns mit allen: http://www.gmx.net/de/go/multimessenger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail From stefan_ml at behnel.de Wed May 16 10:19:33 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 16 May 2007 10:19:33 +0200 Subject: [lxml-dev] XPath exceptions In-Reply-To: <20070516065535.229910@gmx.net> References: <4645DEDD.9090604@behnel.de> <20070516065535.229910@gmx.net> Message-ID: <464ABE95.6010700@behnel.de> jholg at gmx.de wrote: > a quick question on nsmaps in XPath: > Is it intentional that this > >>>> root = etree.fromstring("what?") >>>> root.xpath("//a", {'':'my/foo/bar/URI'}) > ['what?'] >>>> root.xpath("//a") > ['what?'] > > but > >>>> root = etree.fromstring("what?") >>>> root.xpath("//a", {'':'my/foo/bar/URI'}) > [] > > does not? No. Passing an empty prefix should raise the same exception as passing None. Fixed on the trunk. Stefan From eric at detede.com Wed May 16 11:07:08 2007 From: eric at detede.com (Eric Garin) Date: Wed, 16 May 2007 11:07:08 +0200 Subject: [lxml-dev] find Message-ID: <221418083.20070516110708@detede.com> An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070516/536cb907/attachment-0001.htm From stefan_ml at behnel.de Wed May 16 11:14:20 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 16 May 2007 11:14:20 +0200 Subject: [lxml-dev] find In-Reply-To: <221418083.20070516110708@detede.com> References: <221418083.20070516110708@detede.com> Message-ID: <464ACB6C.2060900@behnel.de> Hi, Eric Garin wrote: > Using the find method with a tag with namespace failed > > if node.find("ns:tag"): find*() uses "{qualified}tags". See http://effbot.org/zone/element.htm#xml-namespaces on this. Stefan From lkraider at gmail.com Wed May 16 21:22:32 2007 From: lkraider at gmail.com (Paul Eipper) Date: Wed, 16 May 2007 16:22:32 -0300 Subject: [lxml-dev] case-insensitive xpath search Message-ID: <2ee02670705161222v4ee082c6ie2d4abcbaa2721d2@mail.gmail.com> Hello, I wonder if this is the right place to ask, but I am trying to run a case-insensitive search on a XML using lxml XPath. This is the current search (not case-insensitive): keyword = "what to find" o.xpath( '//*[ contains( @*, "%s" ) ]' % keyword ) how could I make that be a case-insensitive search ? thanks, -- Paul Eipper Brasil From stefan_ml at behnel.de Wed May 16 21:42:34 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 16 May 2007 21:42:34 +0200 Subject: [lxml-dev] case-insensitive xpath search In-Reply-To: <2ee02670705161222v4ee082c6ie2d4abcbaa2721d2@mail.gmail.com> References: <2ee02670705161222v4ee082c6ie2d4abcbaa2721d2@mail.gmail.com> Message-ID: <464B5EAA.40400@behnel.de> Paul Eipper wrote: > I wonder if this is the right place to ask, but I am trying to run a > case-insensitive search on a XML using lxml XPath. > > This is the current search (not case-insensitive): > > keyword = "what to find" > o.xpath( '//*[ contains( @*, "%s" ) ]' % keyword ) > > how could I make that be a case-insensitive search ? It will be easy to do in lxml 1.3, as it will support regexps. http://codespeak.net/lxml/dev/xpathxslt.html#the-xpath-class http://www.exslt.org/regexp/index.html If you don't want to wait: have you tried if the translate() function from XSLT is available? Stefan From stefan_ml at behnel.de Wed May 16 22:40:05 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 16 May 2007 22:40:05 +0200 Subject: [lxml-dev] first take on an lxml.etree tutorial Message-ID: <464B6C25.9030304@behnel.de> Hi everyone, I finally found some time to get started on a tutorial for lxml.etree. http://codespeak.net/lxml/dev/tutorial.html http://codespeak.net/svn/lxml/trunk/doc/tutorial.txt The intention is to give beginners (regarding etree, ElementTree and to a certain extent even XML) an idea about how lxml.etree works and what features can help them in getting their problems solved. I happily borrowed ideas from Fredrik's ET tutorial. Still, there's lots of stuff missing, so if someone feels ambitious and finds some time over the week-end, I'd be glad to add another name to the list of authors. :) Everything from fixes and remarks to readily written sections will be very much appreciated. The source for the HTML page is the ReST text file above. Hoping for some helpful hands, Stefan From jholg at gmx.de Thu May 17 18:34:01 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 17 May 2007 18:34:01 +0200 Subject: [lxml-dev] [objectify] schema type registry: QNames for xsi:type? Message-ID: <20070517163401.76590@gmx.net> Hi Stefan, couldn't respond earlier as I have no svn access at work currently. I've tested your changes and they work just perfect for me. Find attached a little patch that adds some information on this topic to the objectify docs, and a test method also. Thanks, Holger -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070517/59d45dc7/attachment.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/x-patch Size: 5775 bytes Desc: attachment Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070517/59d45dc7/attachment.bin From lkraider at gmail.com Thu May 17 20:11:32 2007 From: lkraider at gmail.com (Paul Eipper) Date: Thu, 17 May 2007 15:11:32 -0300 Subject: [lxml-dev] Question o xpath Message-ID: <2ee02670705171111n3f44f5e3x74e346122472ae30@mail.gmail.com> Hello again :) I'm hitting an issue... hope someone can help me. Say I have this data: ***snip*** 6:1, 128 kbps#44100Hz, Joint stereo ***snip*** if I do a xpath search like this, I get a result: >>> o.xpath('//*[ contains( @*, "Andy" ) ]' ) [] but if I try to search for this string: >>> o.xpath('//*[ contains( @*, "SimCity" ) ]' ) [] ...I get no result. Is the problem on the xpath query ? What am I missing here ? I thought " @* " is supposed to look at all tag attributes ? thanks, -- Paul Eipper Brasil From etiffany at alum.mit.edu Fri May 18 16:16:10 2007 From: etiffany at alum.mit.edu (Eric Tiffany) Date: Fri, 18 May 2007 10:16:10 -0400 Subject: [lxml-dev] python crashes in xmlDictFree inside Zope Message-ID: I have been prototyping some XMLSchema parsing/validating using lxml 1.3beta. Everthing works great from python 2.4.4 started from the command line, or running from inside Eclipse. However, when I moved my code over to my Plone product, python crashes when Zope is initializing the product. I am creating my XMLSchema object there. THe code is essentially the same, running under the same python and with the same libs (afaik). Some earlier attempts (with python 2.4.3) gave me this error: python(11139) malloc: *** Deallocation of a pointer not malloced: 0x80; This could be a double free(), or free() called with the middle of an allocated block; Try setting environment variable MallocHelp to see tools to help debug I'm not seeing that now, as I have "upgraded" to python 2.4.4 which seems to be stripped or something. I don't have a test case immediately available, but here is the stack backtrace from the python crash log: OS Version: 10.4.9 (Build 8P2137) Report Version: 4 Command: Python Path: /opt/local/Library/Frameworks/Python.framework/Versions/2.4/Resources/Python .app/Contents/MacOS/Python Parent: bash [11055] Version: 2.4a0 (2.4alpha1) PID: 11135 Thread: 0 Exception: EXC_BAD_ACCESS (0x0001) Codes: KERN_PROTECTION_FAILURE (0x0002) at 0x00000001 Thread 0 Crashed: 0 libxml2.2.dylib 0x033f6294 xmlDictFree + 45 1 libxml2.2.dylib 0x033f62e2 xmlDictFree + 123 2 libxml2.2.dylib 0x033f62e2 xmlDictFree + 123 3 etree.so 0x06406a08 __pyx_f_5etree_14_ParserContext_initThreadDictRef + 63 (etree.c:19725) 4 etree.so 0x06406a5b __pyx_f_5etree_14_ParserContext_initParserDict + 35 (etree.c:19740) 5 etree.so 0x06431bca __pyx_f_5etree_11_BaseParser__parseDocFromFile + 99 (etree.c:21340) 6 etree.so 0x0646a8f4 __pyx_f_5etree__parseDocument + 395 (etree.c:22486) 7 etree.so 0x0646cc50 __pyx_f_5etree_parse + 176 (etree.c:10300) 8 org.python.python 0x0027faca PyEval_EvalFrame + 22777 (ceval.c:3568) 9 org.python.python 0x00280665 PyEval_EvalCodeEx + 1774 (ceval.c:2741) 10 org.python.python 0x0027e49f PyEval_EvalFrame + 17102 (ceval.c:3661) 11 org.python.python 0x00280665 PyEval_EvalCodeEx + 1774 (ceval.c:2741) 12 org.python.python 0x0027e49f PyEval_EvalFrame + 17102 (ceval.c:3661) 13 org.python.python 0x0027ebaa PyEval_EvalFrame + 18905 (ceval.c:3651) 14 org.python.python 0x0027ebaa PyEval_EvalFrame + 18905 (ceval.c:3651) 15 org.python.python 0x0027ebaa PyEval_EvalFrame + 18905 (ceval.c:3651) 16 org.python.python 0x0027ebaa PyEval_EvalFrame + 18905 (ceval.c:3651) 17 org.python.python 0x0027ebaa PyEval_EvalFrame + 18905 (ceval.c:3651) 18 org.python.python 0x0027ebaa PyEval_EvalFrame + 18905 (ceval.c:3651) 19 org.python.python 0x0027ebaa PyEval_EvalFrame + 18905 (ceval.c:3651) 20 org.python.python 0x0027ebaa PyEval_EvalFrame + 18905 (ceval.c:3651) 21 org.python.python 0x0027ebaa PyEval_EvalFrame + 18905 (ceval.c:3651) 22 org.python.python 0x00280665 PyEval_EvalCodeEx + 1774 (ceval.c:2741) 23 org.python.python 0x002808a5 PyEval_EvalCode + 87 (ceval.c:490) 24 org.python.python 0x002a75fd PyRun_FileExFlags + 200 (pythonrun.c:1285) 25 org.python.python 0x002a78e0 PyRun_SimpleFileExFlags + 640 (pythonrun.c:869) 26 org.python.python 0x002b13ec Py_Main + 3199 (main.c:493) 27 org.python.python 0x00001fae 0x1000 + 4014 28 org.python.python 0x00001ed5 0x1000 + 3797 Thread 0 crashed with X86 Thread State (32-bit): eax: 0x00000000 ebx: 0x033f6275 ecx: 0x00000000 edx: 0x00000000 edi: 0x00000001 esi: 0x05856fd0 ebp: 0xbfffcfe8 esp: 0xbfffcfb0 ss: 0x0000001f efl: 0x00010286 eip: 0x033f6294 cs: 0x00000017 ds: 0x0000001f es: 0x0000001f fs: 0x00000000 gs: 0x00000037 Binary Images Description: 0x1000 - 0x2fff org.python.python 2.4a0 (2.4alpha1) /opt/local/Library/Frameworks/Python.framework/Versions/2.4/Resources/Python .app/Contents/MacOS/Python 0xa0000 - 0xa1fff icglue.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/icglue.so 0xb8000 - 0xb9fff time.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/time.so 0xc3000 - 0xc6fff strop.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/strop.so 0x205000 - 0x2d9fff org.python.python 2.4a0 (2.2) /opt/local/Library/Frameworks/Python.framework/Versions/2.4/Python 0x575000 - 0x576fff cStringIO.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/cStringIO.so 0x580000 - 0x581fff collections.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/collections.so 0x5cc000 - 0x5d2fff _socket.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/_socket.so 0x5e7000 - 0x5e8fff _ssl.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/_ssl.so 0x705000 - 0x737fff libssl.0.9.8.dylib /opt/local/lib/libssl.0.9.8.dylib 0x74b000 - 0x75cfff libz.1.dylib /opt/local/lib/libz.1.dylib 0x761000 - 0x763fff struct.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/struct.so 0x76f000 - 0x771fff binascii.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/binascii.so 0x7bc000 - 0x7bdfff math.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/math.so 0x7c5000 - 0x7c6fff _random.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/_random.so 0x7cd000 - 0x7cefff fcntl.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/fcntl.so 0x7d6000 - 0x7d7fff md5.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/md5.so 0x7de000 - 0x7dffff sha.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/sha.so 0x7e6000 - 0x7e6fff _bisect.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/_bisect.so 0x1008000 - 0x10f7fff libcrypto.0.9.8.dylib /opt/local/lib/libcrypto.0.9.8.dylib 0x11a0000 - 0x11abfff datetime.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/datetime.so 0x120e000 - 0x1211fff array.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/array.so 0x1222000 - 0x1222fff grp.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/grp.so 0x126e000 - 0x127bfff cPickle.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/cPickle.so 0x129a000 - 0x129dfff _Res.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/_Res.so 0x12ac000 - 0x12b2fff _File.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/_File.so 0x12cb000 - 0x12ccfff MacOS.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/MacOS.so 0x1348000 - 0x1374fff pyexpat.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/pyexpat.so 0x1427000 - 0x1427fff _weakref.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/_weakref.so 0x142d000 - 0x142dfff _initgroups.so /Applications/Plone-2.5.2/lib/python/initgroups/_initgroups.so 0x1433000 - 0x1433fff crypt.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/crypt.so 0x1439000 - 0x143afff _heapq.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/_heapq.so 0x1442000 - 0x1442fff _Missing.so /Applications/Plone-2.5.2/lib/python/Missing/_Missing.so 0x144e000 - 0x1450fff cPersistence.so /Applications/Plone-2.5.2/lib/python/persistent/cPersistence.so 0x145b000 - 0x145cfff TimeStamp.so /Applications/Plone-2.5.2/lib/python/persistent/TimeStamp.so 0x1463000 - 0x1464fff cPickleCache.so /Applications/Plone-2.5.2/lib/python/persistent/cPickleCache.so 0x146e000 - 0x146ffff _zope_interface_coptimizations.so /Applications/Plone-2.5.2/lib/python/zope/interface/_zope_interface_coptimiz ations.so 0x1477000 - 0x1477fff _MultiMapping.so /Applications/Plone-2.5.2/lib/python/MultiMapping/_MultiMapping.so 0x14fe000 - 0x1502fff cAccessControl.so /Applications/Plone-2.5.2/lib/python/AccessControl/cAccessControl.so 0x1510000 - 0x1512fff _ExtensionClass.so /Applications/Plone-2.5.2/lib/python/ExtensionClass/_ExtensionClass.so 0x151c000 - 0x1520fff _Acquisition.so /Applications/Plone-2.5.2/lib/python/Acquisition/_Acquisition.so 0x152e000 - 0x152ffff _Record.so /Applications/Plone-2.5.2/lib/python/Record/_Record.so 0x1537000 - 0x1539fff cDocumentTemplate.so /Applications/Plone-2.5.2/lib/python/DocumentTemplate/cDocumentTemplate.so 0x1585000 - 0x1591fff parser.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/parser.so 0x15e8000 - 0x15eafff zlib.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/zlib.so 0x1633000 - 0x1635fff operator.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/operator.so 0x167f000 - 0x1681fff _zope_proxy_proxy.so /Applications/Plone-2.5.2/lib/python/zope/proxy/_zope_proxy_proxy.so 0x168f000 - 0x1691fff itertools.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/itertools.so 0x16a0000 - 0x16a0fff _zope_i18nmessageid_message.so /Applications/Plone-2.5.2/lib/python/zope/i18nmessageid/_zope_i18nmessageid_ message.so 0x16e7000 - 0x16e7fff _zope_thread.so /Applications/Plone-2.5.2/lib/python/zope/thread/_zope_thread.so 0x16ee000 - 0x16f1fff _proxy.so /Applications/Plone-2.5.2/lib/python/zope/security/_proxy.so 0x1701000 - 0x1702fff _zope_security_checker.so /Applications/Plone-2.5.2/lib/python/zope/security/_zope_security_checker.so 0x174a000 - 0x174afff _zope_hookable.so /Applications/Plone-2.5.2/lib/python/zope/hookable/_zope_hookable.so 0x1790000 - 0x1790fff _ComputedAttribute.so /Applications/Plone-2.5.2/lib/python/ComputedAttribute/_ComputedAttribute.so 0x17d6000 - 0x17d9fff _zope_app_container_contained.so /Applications/Plone-2.5.2/lib/python/zope/app/container/_zope_app_container_ contained.so 0x17e8000 - 0x17e8fff _Persistence.so /Applications/Plone-2.5.2/lib/python/Persistence/_Persistence.so 0x17ee000 - 0x17eefff _MethodObject.so /Applications/Plone-2.5.2/lib/python/MethodObject/_MethodObject.so 0x17f4000 - 0x17f5fff select.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/select.so 0x20c8000 - 0x20d3fff _OOBTree.so /Applications/Plone-2.5.2/lib/python/BTrees/_OOBTree.so 0x20f0000 - 0x20f0fff stopper.so /Applications/Plone-2.5.2/lib/python/Products/ZCTextIndex/stopper.so 0x20f6000 - 0x20f6fff okascore.so /Applications/Plone-2.5.2/lib/python/Products/ZCTextIndex/okascore.so 0x2245000 - 0x2250fff _OIBTree.so /Applications/Plone-2.5.2/lib/python/BTrees/_OIBTree.so 0x226d000 - 0x2278fff _IOBTree.so /Applications/Plone-2.5.2/lib/python/BTrees/_IOBTree.so 0x2297000 - 0x22a3fff _IIBTree.so /Applications/Plone-2.5.2/lib/python/BTrees/_IIBTree.so 0x2abd000 - 0x2ae7fff _imaging.so /opt/local/lib/python2.4/site-packages/PIL/_imaging.so 0x2b79000 - 0x2b94fff libjpeg.62.dylib /opt/local/lib/libjpeg.62.dylib 0x2be2000 - 0x2be4fff _csv.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/_csv.so 0x2bf0000 - 0x2bf0fff _ThreadLock.so /Applications/Plone-2.5.2/lib/python/ThreadLock/_ThreadLock.so 0x2f85000 - 0x2f87fff unicodedata.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/li b-dynload/unicodedata.so 0x3285000 - 0x32a8fff libxml2mod.so /opt/local/lib/python2.4/site-packages/libxml2mod.so 0x3337000 - 0x3427fff libxml2.2.dylib /opt/local/lib/libxml2.2.dylib 0x3456000 - 0x354bfff libiconv.2.dylib /opt/local/lib/libiconv.2.dylib 0x3c05000 - 0x3c10fff _fsBTree.so /Applications/Plone-2.5.2/lib/python/BTrees/_fsBTree.so 0x52eb000 - 0x52ecfff ZopeSplitter.so /Applications/Plone-2.5.2/lib/python/Products/PluginIndexes/TextIndex/Splitt er/ZopeSplitter/ZopeSplitter.so 0x56c5000 - 0x56d0fff libexslt.0.dylib /opt/local/lib/libexslt.0.dylib 0x5a05000 - 0x5a2efff libxslt.1.dylib /opt/local/lib/libxslt.1.dylib 0x6405000 - 0x6477fff etree.so /opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/si te-packages/lxml-1.3beta-py2.4-macosx-10.4-i386.egg/lxml/etree.so 0x8fe00000 - 0x8fe4afff dyld 46.12 /usr/lib/dyld 0x90000000 - 0x90170fff libSystem.B.dylib /usr/lib/libSystem.B.dylib 0x901c0000 - 0x901c2fff libmathCommon.A.dylib /usr/lib/system/libmathCommon.A.dylib 0x901c4000 - 0x90201fff com.apple.CoreText 1.1.2 (???) /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/CoreText.framework/Versions/A/CoreText 0x90228000 - 0x902fefff ATS /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/ATS.framework/Versions/A/ATS 0x9031e000 - 0x90773fff com.apple.CoreGraphics 1.258.61 (???) /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/CoreGraphics.framework/Versions/A/CoreGraphics 0x9080a000 - 0x908d2fff com.apple.CoreFoundation 6.4.7 (368.28) /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundatio n 0x90910000 - 0x90910fff com.apple.CoreServices 10.4 (???) /System/Library/Frameworks/CoreServices.framework/Versions/A/CoreServices 0x90912000 - 0x90a05fff libicucore.A.dylib /usr/lib/libicucore.A.dylib 0x90a55000 - 0x90ad4fff libobjc.A.dylib /usr/lib/libobjc.A.dylib 0x90afd000 - 0x90b61fff libstdc++.6.dylib /usr/lib/libstdc++.6.dylib 0x90bd0000 - 0x90bd7fff libgcc_s.1.dylib /usr/lib/libgcc_s.1.dylib 0x90bdc000 - 0x90c4ffff com.apple.framework.IOKit 1.4.6 (???) /System/Library/Frameworks/IOKit.framework/Versions/A/IOKit 0x90c64000 - 0x90c76fff libauto.dylib /usr/lib/libauto.dylib 0x90c7c000 - 0x90f22fff com.apple.CoreServices.CarbonCore 682.21 /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/Carb onCore.framework/Versions/A/CarbonCore 0x90f65000 - 0x90fcdfff com.apple.CoreServices.OSServices 4.1 /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/OSSe rvices.framework/Versions/A/OSServices 0x91006000 - 0x91044fff com.apple.CFNetwork 129.20 /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/CFNe twork.framework/Versions/A/CFNetwork 0x91057000 - 0x91067fff com.apple.WebServices 1.1.3 (1.1.0) /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/WebS ervicesCore.framework/Versions/A/WebServicesCore 0x91072000 - 0x910f1fff com.apple.SearchKit 1.0.5 /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/Sear chKit.framework/Versions/A/SearchKit 0x9112b000 - 0x91149fff com.apple.Metadata 10.4.4 (121.36) /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/Meta data.framework/Versions/A/Metadata 0x91155000 - 0x91163fff libz.1.dylib /usr/lib/libz.1.dylib 0x91166000 - 0x91305fff com.apple.security 4.5.2 (29774) /System/Library/Frameworks/Security.framework/Versions/A/Security 0x91403000 - 0x9140bfff com.apple.DiskArbitration 2.1.1 /System/Library/Frameworks/DiskArbitration.framework/Versions/A/DiskArbitrat ion 0x91412000 - 0x91419fff libbsm.dylib /usr/lib/libbsm.dylib 0x9141d000 - 0x91443fff com.apple.SystemConfiguration 1.8.6 /System/Library/Frameworks/SystemConfiguration.framework/Versions/A/SystemCo nfiguration 0x91455000 - 0x914cbfff com.apple.audio.CoreAudio 3.0.4 /System/Library/Frameworks/CoreAudio.framework/Versions/A/CoreAudio 0x9151c000 - 0x9151cfff com.apple.ApplicationServices 10.4 (???) /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Applicat ionServices 0x9151e000 - 0x9154afff com.apple.AE 314 (313) /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/AE.framework/Versions/A/AE 0x9155d000 - 0x91631fff com.apple.ColorSync 4.4.9 /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/ColorSync.framework/Versions/A/ColorSync 0x9166c000 - 0x916dffff com.apple.print.framework.PrintCore 4.6 (177.13) /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/PrintCore.framework/Versions/A/PrintCore 0x9170d000 - 0x917b6fff com.apple.QD 3.10.24 (???) /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/QD.framework/Versions/A/QD 0x917dc000 - 0x91827fff com.apple.HIServices 1.5.2 (???) /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/HIServices.framework/Versions/A/HIServices 0x91846000 - 0x9185cfff com.apple.LangAnalysis 1.6.3 /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/LangAnalysis.framework/Versions/A/LangAnalysis 0x91868000 - 0x91883fff com.apple.FindByContent 1.5 /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/FindByContent.framework/Versions/A/FindByContent 0x9188e000 - 0x918cbfff com.apple.LaunchServices 182 /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/LaunchServices.framework/Versions/A/LaunchServices 0x918df000 - 0x918ebfff com.apple.speech.synthesis.framework 3.5 /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/SpeechSynthesis.framework/Versions/A/SpeechSynthesis 0x918f2000 - 0x91931fff com.apple.ImageIO.framework 1.5.4 /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/ImageIO.framework/Versions/A/ImageIO 0x91944000 - 0x919f6fff libcrypto.0.9.7.dylib /usr/lib/libcrypto.0.9.7.dylib 0x91a3c000 - 0x91a52fff libcups.2.dylib /usr/lib/libcups.2.dylib 0x91a57000 - 0x91a75fff libJPEG.dylib /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/ImageIO.framework/Versions/A/Resources/libJPEG.dylib 0x91a7a000 - 0x91ad9fff libJP2.dylib /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/ImageIO.framework/Versions/A/Resources/libJP2.dylib 0x91aeb000 - 0x91aeffff libGIF.dylib /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/ImageIO.framework/Versions/A/Resources/libGIF.dylib 0x91af1000 - 0x91b75fff libRaw.dylib /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/ImageIO.framework/Versions/A/Resources/libRaw.dylib 0x91b79000 - 0x91bb6fff libTIFF.dylib /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/ImageIO.framework/Versions/A/Resources/libTIFF.dylib 0x91bbc000 - 0x91bd6fff libPng.dylib /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/ImageIO.framework/Versions/A/Resources/libPng.dylib 0x91bdb000 - 0x91bddfff libRadiance.dylib /System/Library/Frameworks/ApplicationServices.framework/Versions/A/Framewor ks/ImageIO.framework/Versions/A/Resources/libRadiance.dylib 0x91bdf000 - 0x91cbdfff libxml2.2.dylib /usr/lib/libxml2.2.dylib 0x91cda000 - 0x91cdafff com.apple.Accelerate 1.3.1 (Accelerate 1.3.1) /System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate 0x91cdc000 - 0x91d6afff com.apple.vImage 2.5 /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vImage .framework/Versions/A/vImage 0x91d71000 - 0x91d71fff com.apple.Accelerate.vecLib 3.3.1 (vecLib 3.3.1) /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib .framework/Versions/A/vecLib 0x91d73000 - 0x91dccfff libvMisc.dylib /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib .framework/Versions/A/libvMisc.dylib 0x91dd5000 - 0x91df9fff libvDSP.dylib /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib .framework/Versions/A/libvDSP.dylib 0x91e01000 - 0x9220afff libBLAS.dylib /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib .framework/Versions/A/libBLAS.dylib 0x92244000 - 0x925f8fff libLAPACK.dylib /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib .framework/Versions/A/libLAPACK.dylib 0x92625000 - 0x92712fff libiconv.2.dylib /usr/lib/libiconv.2.dylib 0x92714000 - 0x92791fff com.apple.DesktopServices 1.3.6 /System/Library/PrivateFrameworks/DesktopServicesPriv.framework/Versions/A/D esktopServicesPriv 0x927d2000 - 0x92a02fff com.apple.Foundation 6.4.8 (567.29) /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation 0x92baa000 - 0x92baafff com.apple.Carbon 10.4 (???) /System/Library/Frameworks/Carbon.framework/Versions/A/Carbon 0x92bac000 - 0x92bbcfff com.apple.ImageCapture 3.0.4 /System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/ImageCaptu re.framework/Versions/A/ImageCapture 0x92bcb000 - 0x92bd3fff com.apple.speech.recognition.framework 3.6 /System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/SpeechReco gnition.framework/Versions/A/SpeechRecognition 0x92bd9000 - 0x92bdffff com.apple.securityhi 2.0.1 (24742) /System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/SecurityHI .framework/Versions/A/SecurityHI 0x92be5000 - 0x92c76fff com.apple.ink.framework 101.2.1 (71) /System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/Ink.framew ork/Versions/A/Ink 0x92c8a000 - 0x92c8efff com.apple.help 1.0.3 (32.1) /System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/Help.frame work/Versions/A/Help 0x92c91000 - 0x92caffff com.apple.openscripting 1.2.5 (???) /System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/OpenScript ing.framework/Versions/A/OpenScripting 0x92cc1000 - 0x92cc7fff com.apple.print.framework.Print 5.2 (192.4) /System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/Print.fram ework/Versions/A/Print 0x92ccd000 - 0x92d30fff com.apple.htmlrendering 66.1 (1.1.3) /System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/HTMLRender ing.framework/Versions/A/HTMLRendering 0x92d57000 - 0x92d98fff com.apple.NavigationServices 3.4.4 (3.4.3) /System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/Navigation Services.framework/Versions/A/NavigationServices 0x92dbf000 - 0x92dcdfff com.apple.audio.SoundManager 3.9.1 /System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/CarbonSoun d.framework/Versions/A/CarbonSound 0x92dd4000 - 0x92dd9fff com.apple.CommonPanels 1.2.3 (73) /System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/CommonPane ls.framework/Versions/A/CommonPanels 0x92dde000 - 0x930d3fff com.apple.HIToolbox 1.4.9 (???) /System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/HIToolbox. framework/Versions/A/HIToolbox -- From hongqn at gmail.com Fri May 18 18:32:18 2007 From: hongqn at gmail.com (Qiangning Hong) Date: Sat, 19 May 2007 00:32:18 +0800 Subject: [lxml-dev] etree.tostring generate invalid XML? Message-ID: >>> from lxml import etree >>> e = lxml.etree.Element('root') >>> e.text = u'\x08' >>> xml = etree.tostring(e, 'utf8') >>> xml '\x08' >>> etree.XML(xml) >>> etree.XML(xml) Traceback (most recent call last): File "", line 1, in File "etree.pyx", line 1749, in etree.XML File "parser.pxi", line 934, in etree._parseMemoryDocument File "parser.pxi", line 830, in etree._parseDoc File "parser.pxi", line 516, in etree._BaseParser._parseDoc File "parser.pxi", line 619, in etree._handleParseResult File "parser.pxi", line 590, in etree._raiseParseError etree.XMLSyntaxError: line 1: PCDATA invalid Char value 8 Shouldn't xml be '' ? Is it a bug of lxml? -- Qiangning Hong http://www.douban.com/people/hongqn/ From stefan_ml at behnel.de Sun May 20 09:11:57 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 20 May 2007 09:11:57 +0200 Subject: [lxml-dev] Question o xpath In-Reply-To: <2ee02670705171111n3f44f5e3x74e346122472ae30@mail.gmail.com> References: <2ee02670705171111n3f44f5e3x74e346122472ae30@mail.gmail.com> Message-ID: <464FF4BD.70308@behnel.de> Hi, Paul Eipper wrote: > ***snip*** > time="Dec 13 13:36 2006"> > 6:1, 128 kbps#44100Hz, Joint stereo > album="SimCity 4 Rush Hour Soundtrack" year="2003"> > > ***snip*** > > if I do a xpath search like this, I get a result: > >>>> o.xpath('//*[ contains( @*, "Andy" ) ]' ) > [] > > but if I try to search for this string: >>>> o.xpath('//*[ contains( @*, "SimCity" ) ]' ) > [] > > ...I get no result. > > Is the problem on the xpath query ? What am I missing here ? I thought > " @* " is supposed to look at all tag attributes ? The problem is that the contains() function requires a string as parameter, not a node set (such as a set of attributes, as in your example). The way XPath converts a node set to a string is: take the first entry and convert that to a string. In your case, this happens to hit the "artist" attribute, not the "album" attribute (no guarantee here!). Consider reading a good book or some other documentation on XPath, to learn the full details. This expression should do what you want: //*[ @*[contains(., $whatyouwant) ]] then set "whatyouwant" to whatever you are looking for during an evaluation. Stefan From stefan_ml at behnel.de Sun May 20 09:25:42 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 20 May 2007 09:25:42 +0200 Subject: [lxml-dev] etree.tostring generate invalid XML? In-Reply-To: References: Message-ID: <464FF7F6.3080004@behnel.de> Hi, without looking it up, I don't think this is a bug (and definitely not in lxml). The XML spec simply forbids certain characters in serialised XML. Qiangning Hong wrote: > >>> from lxml import etree > >>> e = lxml.etree.Element('root') > >>> e.text = u'\x08' > >>> xml = etree.tostring(e, 'utf8') > >>> xml > '\x08' Don't tell me you didn't expect that. :) >>>> etree.XML(xml) Interesting, no output here? >>>> etree.XML(xml) > Traceback (most recent call last): > File "", line 1, in > File "etree.pyx", line 1749, in etree.XML > File "parser.pxi", line 934, in etree._parseMemoryDocument > File "parser.pxi", line 830, in etree._parseDoc > File "parser.pxi", line 516, in etree._BaseParser._parseDoc > File "parser.pxi", line 619, in etree._handleParseResult > File "parser.pxi", line 590, in etree._raiseParseError > etree.XMLSyntaxError: line 1: PCDATA invalid Char value 8 > > Shouldn't xml be '' ? Is it a bug of lxml? When you're dealing with binary data in XML, you should always encode it in a way that makes it 'XML compatible', such as uuencode, base64 or what ever. If you want, you can ask on the libxml2 mailing list, but I doubt they'll tell you anything different. You might get an answer, though, that gives you a bit more of insight into what goes on. Stefan From stefan_ml at behnel.de Sun May 20 10:33:19 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 20 May 2007 10:33:19 +0200 Subject: [lxml-dev] python crashes in xmlDictFree inside Zope In-Reply-To: References: Message-ID: <465007CF.2090405@behnel.de> Hi, from the stack traces I can see that you are on MacOS-X. Could you check which libxml2 version you are using? Just run "test.py" from the source distribution or look at "lxml.etree.LIBXML*". There are known issues with older versions of libxml2, so especially when you are using XML-Schema, you should care for installing a recent version. Stefan From etiffany at alum.mit.edu Sun May 20 21:29:42 2007 From: etiffany at alum.mit.edu (Eric Tiffany) Date: Sun, 20 May 2007 15:29:42 -0400 Subject: [lxml-dev] python crashes in xmlDictFree inside Zope In-Reply-To: <303c45610705201224i4e2ba2dctecff6ae67adbe1e4@mail.gmail.com> References: <465007CF.2090405@behnel.de> <303c45610705201224i4e2ba2dctecff6ae67adbe1e4@mail.gmail.com> Message-ID: <303c45610705201229p43f3fba5n5cc24f45986f00ca@mail.gmail.com> Thanks, I am using the latest "darwinports" version of libxml2, which I just updated -- this problem occurs with the version below. Note that this problem doesn't happen with the same code run from the python shell -- only on when used inside Zope. [sorry for the repeat post -- sent from wrong acct first time] ET >>> lxml.etree.LIBXML_VERSION (2, 6, 28) >>> lxml.etree.LIBXML_COMPILED_VERSION (2, 6, 28) On 5/20/07, Stefan Behnel wrote: > Hi, > > from the stack traces I can see that you are on MacOS-X. Could you check which > libxml2 version you are using? Just run "test.py" from the source distribution > or look at "lxml.etree.LIBXML*". There are known issues with older versions of > libxml2, so especially when you are using XML-Schema, you should care for > installing a recent version. > > Stefan > From stefan_ml at behnel.de Mon May 21 06:47:12 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 21 May 2007 06:47:12 +0200 Subject: [lxml-dev] [objectify] schema type registry: QNames for xsi:type? In-Reply-To: <20070517163401.76590@gmx.net> References: <20070517163401.76590@gmx.net> Message-ID: <46512450.5070007@behnel.de> Hi Holger, jholg at gmx.de wrote: > couldn't respond earlier as I have no svn access at work currently. > I've tested your changes and they work just perfect for me. > > Find attached a little patch that adds some information on this topic > to the objectify docs, and a test method also. thanks for the doc patch. I had to beautify it in a couple of places (after all, it's more of a documentation thing than a test), and that brought me to a larger restructuring of the existing page (it now actually *has* a structure). Thanks again, Stefan From stefan_ml at behnel.de Mon May 21 08:21:11 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 21 May 2007 08:21:11 +0200 Subject: [lxml-dev] python crashes in xmlDictFree inside Zope In-Reply-To: References: Message-ID: <46513A57.2050405@behnel.de> Hi, Eric Tiffany wrote: > I have been prototyping some XMLSchema parsing/validating using lxml > 1.3beta. > > Everthing works great from python 2.4.4 started from the command line, or > running from inside Eclipse. > > However, when I moved my code over to my Plone product, python crashes when > Zope is initializing the product. I am creating my XMLSchema object there. > > Some earlier attempts (with python 2.4.3) gave me this error: > > python(11139) malloc: *** Deallocation of a pointer not malloced: 0x80; This > could be a double free(), or free() called with the middle of an allocated > block; Try setting environment variable MallocHelp to see tools to help > debug > > OS Version: 10.4.9 (Build 8P2137) > Report Version: 4 > > Version: 2.4a0 (2.4alpha1) Is this the Python version? I'm asking because you said it was 2.4.3. lxml.etree behaves different for Python <= 2.4.2, as there are known bugs with threading in earlier versions. If you're sure you can reproduce the *same* bug with a version older than 2.4.2 (and libxml 2.6.28, as you also mentioned), that would completely shift the focus of the bug hunt. Is there any way to detect MacOS-X at the C level? In that case, we could try to disable thread concurrency support completely for this platform - in case that's the source of the segfault. You can try to see if this would fix the problem by passing the option "--without-threading" to setup.py when building lxml. Could you please try that with your current setup and report back to the list? Another question: are you using a custom parser (i.e. passing a second argument to the parse() function) here or is it the default parser that crashes here? Stefan From sidnei at enfoldsystems.com Mon May 21 16:08:13 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Mon, 21 May 2007 11:08:13 -0300 Subject: [lxml-dev] Resolvers not passed on to sub-documents? Message-ID: This might or might not have been fixed recently. I am using lxml 1.2. I'm writing a self-contained test for this. If you have a XSLT that includes another XSLT, which in turn includes a third one, the resolvers doesn't seem to be passed on. theme.xsl: xsl:import sub.xsl sub.xsl: xsl:import common.xsl The custom resolver I passed when parsing theme.xsl is used to resolve 'sub.xsl' but not to resolve 'common.xsl'. I remember discussing a similar issue, maybe this is a problem with libxml2? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From stefan_ml at behnel.de Mon May 21 16:28:01 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 21 May 2007 16:28:01 +0200 Subject: [lxml-dev] Resolvers not passed on to sub-documents? In-Reply-To: References: Message-ID: <4651AC71.9080608@behnel.de> Hi, Sidnei da Silva wrote: > This might or might not have been fixed recently. I am using lxml 1.2. > I'm writing a self-contained test for this. > > If you have a XSLT that includes another XSLT, which in turn includes > a third one, the resolvers doesn't seem to be passed on. > > theme.xsl: > xsl:import sub.xsl > > sub.xsl: > xsl:import common.xsl > > The custom resolver I passed when parsing theme.xsl is used to resolve > 'sub.xsl' but not to resolve 'common.xsl'. That's quite possible as keeping track of parsed documents in XSLT isn't the most simple thing on earth. We had the same problem with XInclude (where nothing would help), but it should be fixable for XSLT. Could you check if the attached (and completely untested) patch fixes the problem? It's against the trunk, but should apply to 1.2 also. In case this patch helps, 'mind coming up with a test case for it? That could easily convince me that I should include it in lxml 1.3. :) Stefan Index: src/lxml/xslt.pxi =================================================================== --- src/lxml/xslt.pxi (Revision 43508) +++ src/lxml/xslt.pxi (Arbeitskopie) @@ -141,6 +141,8 @@ c_doc = _xslt_resolve_stylesheet(c_uri, c_pcontext) if c_doc is not NULL: python.PyGILState_Release(gil_state) + if c_type == xslt.XSLT_LOAD_STYLESHEET: + c_doc._private = c_pcontext return c_doc c_doc = _xslt_resolve_from_python(c_uri, c_pcontext, parse_options, &error) @@ -151,6 +153,8 @@ _xslt_store_resolver_exception(c_uri, c_pcontext, c_type) python.PyGILState_Release(gil_state) + if c_doc is not NULL and c_type == xslt.XSLT_LOAD_STYLESHEET: + c_doc._private = c_pcontext return c_doc cdef xslt.xsltDocLoaderFunc XSLT_DOC_DEFAULT_LOADER From sidnei at enfoldsystems.com Mon May 21 17:41:00 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Mon, 21 May 2007 12:41:00 -0300 Subject: [lxml-dev] Resolvers not passed on to sub-documents? In-Reply-To: <4651AC71.9080608@behnel.de> References: <4651AC71.9080608@behnel.de> Message-ID: Yup! That fixes it! I will come up with a test to add to the test suite. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From sidnei at enfoldsystems.com Tue May 22 06:42:47 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Tue, 22 May 2007 01:42:47 -0300 Subject: [lxml-dev] Building LXML Trunk Message-ID: Hi, I've tried to build lxml from trunk today, on Win32. Got the following error: src\lxml\etree.c(880) : error C2059: syntax error : ')' src\lxml\etree.c(881) : error C2059: syntax error : ')' src\lxml\etree.c(882) : error C2059: syntax error : ')' src\lxml\etree.c(883) : error C2059: syntax error : ')' I'm attaching etree.c. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 -------------- next part -------------- A non-text attachment was scrubbed... Name: etree.c.bz2 Type: application/x-bzip2 Size: 133412 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070522/69e0a303/attachment-0001.bin From sidnei at enfoldsystems.com Tue May 22 19:56:41 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Tue, 22 May 2007 14:56:41 -0300 Subject: [lxml-dev] Building LXML Trunk Message-ID: Hi, I've tried to build lxml from trunk today, on Win32. Got the following error: src\lxml\etree.c(880) : error C2059: syntax error : ')' src\lxml\etree.c(881) : error C2059: syntax error : ')' src\lxml\etree.c(882) : error C2059: syntax error : ')' src\lxml\etree.c(883) : error C2059: syntax error : ')' Any clue? Smells like a Pyrex issue? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From cz at gocept.com Wed May 23 08:50:39 2007 From: cz at gocept.com (Christian Zagrodnick) Date: Wed, 23 May 2007 08:50:39 +0200 Subject: [lxml-dev] Bug in objectify node[:].index Message-ID: Hi, following little script fails at the last assert: ---------------------------- import lxml.objectify tree = lxml.objectify.fromstring( """\ foo bar foo bar """) trees = tree.findall('//a[@name="tree"]') print trees foo_tree = trees[0] assert foo_tree.get('name') == 'tree' parent = foo_tree.getparent() assert parent.tag == 'root' node_list = parent[foo_tree.tag] import pdb; pdb.set_trace() foo_index = node_list[:].index(foo_tree) assert foo_index == 3, foo_index # FAILS: foo_index is 0 ---------------------------- So, fo_index == 0. Which is foo. Apparently the .index only looks at the text or something?! Anyway, all I *actually* want is to remove the nodes found by the xpath. The way you'd think it would be 'normal' doesn't work unfortunately: (Pdb) p parent.index(foo_tree) 2 (Pdb) del parent[2] *** TypeError: deleting items not supported by root element This is obviously because of the sort fo strange list/attribute handling (i.e. parent is parent[0]) But then there is parent.remove(...) which works. Apart from the bug above objectify feels kind of strange sometimes... :/ But then at other times its really nice again :) -- Christian Zagrodnick gocept gmbh & co. kg ? forsterstrasse 29 ? 06112 halle/saale www.gocept.com ? fon. +49 345 12298894 ? fax. +49 345 12298891 From jholg at gmx.de Wed May 23 09:44:25 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 23 May 2007 09:44:25 +0200 Subject: [lxml-dev] Bug in objectify node[:].index In-Reply-To: References: Message-ID: <20070523074425.103190@gmx.net> > Hi, > > following little script fails at the last assert: > > ---------------------------- > import lxml.objectify > > > tree = lxml.objectify.fromstring( > """\ > > foo > bar > foo > bar > > """) > > trees = tree.findall('//a[@name="tree"]') > print trees > > foo_tree = trees[0] > assert foo_tree.get('name') == 'tree' > > parent = foo_tree.getparent() > assert parent.tag == 'root' > > > node_list = parent[foo_tree.tag] > import pdb; pdb.set_trace() > foo_index = node_list[:].index(foo_tree) > assert foo_index == 3, foo_index # FAILS: foo_index is 0 > > ---------------------------- > > So, fo_index == 0. Which is foo. Apparently the > .index only looks at the text or something?! Note that you use [].index, not ObjectifiedElement.index, with the slice you apply: >>> node_list[:].index >>> parent.index >>> node_list[:].index(foo_tree) 0 >>> parent.index(foo_tree) 2 >>> Seems like [].index returns the first list item that compares equal to its argument. As StringElements behave much like strings, this is what happens here, as the first element in your list also has the element.text "foo": >>> node_list[:][0] == foo_tree True >>> node_list[:][1] == foo_tree False >>> node_list[:][2] == foo_tree True >>> >>> print foo_tree.text foo >>> print node_list[:][0].text foo >>> So I'd rather not say this is a bug... > Anyway, all I *actually* want is to remove the nodes found by the > xpath. The way you'd think it would be 'normal' doesn't work > unfortunately: > > > (Pdb) p parent.index(foo_tree) > 2 > (Pdb) del parent[2] > *** TypeError: deleting items not supported by root element > > This is obviously because of the sort fo strange list/attribute > handling (i.e. parent is parent[0]) It is different from the lxml.etree (ElementTree) API but clearly stated in the docs: Indexed access returns siblings (aka "neighbour" elements with the same name) rather than children, and every ObjectifiedElement has list behaviour; unindexed access is just a *shortcut* to retrieve the first sibling. Cheers, Holger -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer From cz at gocept.com Wed May 23 15:27:19 2007 From: cz at gocept.com (Christian Zagrodnick) Date: Wed, 23 May 2007 15:27:19 +0200 Subject: [lxml-dev] Bug in objectify node[:].index References: <20070523074425.103190@gmx.net> Message-ID: On 2007-05-23 09:44:25 +0200, jholg at gmx.de said: > >> Hi, >> > >> following little script fails at the last assert: >> > >> ---------------------------- >> import lxml.objectify >> > >> > >> tree = lxml.objectify.fromstring( >> """\ >> >> foo >> bar >> foo >> bar >> >> """) >> > >> trees = tree.findall('//a[@name="tree"]') >> print trees >> > >> foo_tree = trees[0] >> assert foo_tree.get('name') == 'tree' >> > >> parent = foo_tree.getparent() >> assert parent.tag == 'root' >> > >> > >> node_list = parent[foo_tree.tag] >> import pdb; pdb.set_trace() >> foo_index = node_list[:].index(foo_tree) >> assert foo_index == 3, foo_index # FAILS: foo_index is 0 >> > >> ---------------------------- >> > >> So, fo_index == 0. Which is foo. Apparently the > >> .index only looks at the text or something?! > > Note that you use [].index, not ObjectifiedElement.index, with the > slice you apply: > >>>> node_list[:].index > >>>> parent.index > >>>> node_list[:].index(foo_tree) > 0 >>>> parent.index(foo_tree) > 2 >>>> > > > Seems like [].index returns the first list item that compares equal to > its argument. As StringElements behave much like strings, this is what > happens here, as the first element in your list also has the element.text > "foo": >>>> node_list[:][0] == foo_tree > True >>>> node_list[:][1] == foo_tree > False >>>> node_list[:][2] == foo_tree > True >>>> >>>> print foo_tree.text > foo >>>> print node_list[:][0].text > foo >>>> > > > So I'd rather not say this is a bug... Oh I see. Actually I was trying to see if they [0] was equal to [2]. In my case there were not but I probably did something wrong. And as I indeed thought they should be wrong I haven't looked closer. > >> Anyway, all I *actually* want is to remove the nodes found by the > >> xpath. The way you'd think it would be 'normal' doesn't work > >> unfortunately: >> > >> > >> (Pdb) p parent.index(foo_tree) >> 2 >> (Pdb) del parent[2] >> *** TypeError: deleting items not supported by root element >> > >> This is obviously because of the sort fo strange list/attribute > >> handling (i.e. parent is parent[0]) > > It is different from the lxml.etree (ElementTree) API but clearly stated > in the docs: Indexed access returns siblings (aka "neighbour" elements > with the same name) rather than children, and every ObjectifiedElement > has list behaviour; unindexed access is just a *shortcut* to retrieve > the first sib ling. Yeah. In general I even know that. All I'm saying is that if feels strange. Thanks by the way :) -- Christian Zagrodnick gocept gmbh & co. kg ? forsterstrasse 29 ? 06112 halle/saale www.gocept.com ? fon. +49 345 12298894 ? fax. +49 345 12298891 From stefan_ml at behnel.de Wed May 23 21:37:13 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 23 May 2007 21:37:13 +0200 Subject: [lxml-dev] Building LXML Trunk In-Reply-To: References: Message-ID: <465497E9.9090808@behnel.de> Hi Sidnei, Sidnei da Silva wrote: > I've tried to build lxml from trunk today, on Win32. Got the following error: > > src\lxml\etree.c(880) : error C2059: syntax error : ')' > src\lxml\etree.c(881) : error C2059: syntax error : ')' > src\lxml\etree.c(882) : error C2059: syntax error : ')' > src\lxml\etree.c(883) : error C2059: syntax error : ')' > > Any clue? Smells like a Pyrex issue? Looks like it, yes. The problem lies in the following lines: void ((*registerGlobalFunctions)(struct __pyx_obj_5etree__BaseContext *,void (*),int ((void (*),PyObject *,PyObject *)))); void ((*registerLocalFunctions)(struct __pyx_obj_5etree__BaseContext *,void (*),int ((void (*),PyObject *,PyObject *)))); PyObject *((*unregisterAllFunctions)(struct __pyx_obj_5etree__BaseContext *,void (*),int ((void (*),PyObject *,PyObject *)))); PyObject *((*unregisterGlobalFunctions)(struct __pyx_obj_5etree__BaseContext *,void (*),int ((void (*),PyObject *,PyObject *)))); These are new in lxml 1.3. Looks like MS's "C" compiler can't handle that. Any idea how we could get this to work? I mean, without the obvious approach of switching to MinGW. :) Stefan From sidnei at enfoldsystems.com Wed May 23 21:53:34 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Wed, 23 May 2007 16:53:34 -0300 Subject: [lxml-dev] Building LXML Trunk In-Reply-To: <465497E9.9090808@behnel.de> References: <465497E9.9090808@behnel.de> Message-ID: I don't have an idea myself, but I can ask Mark to take a look at it. My shallow C experience is not enough to parse that. :) On 5/23/07, Stefan Behnel wrote: > Hi Sidnei, > > Sidnei da Silva wrote: > > I've tried to build lxml from trunk today, on Win32. Got the following error: > > > > src\lxml\etree.c(880) : error C2059: syntax error : ')' > > src\lxml\etree.c(881) : error C2059: syntax error : ')' > > src\lxml\etree.c(882) : error C2059: syntax error : ')' > > src\lxml\etree.c(883) : error C2059: syntax error : ')' > > > > Any clue? Smells like a Pyrex issue? > > Looks like it, yes. The problem lies in the following lines: > > void ((*registerGlobalFunctions)(struct __pyx_obj_5etree__BaseContext *,void > (*),int ((void (*),PyObject *,PyObject *)))); > void ((*registerLocalFunctions)(struct __pyx_obj_5etree__BaseContext *,void > (*),int ((void (*),PyObject *,PyObject *)))); > PyObject *((*unregisterAllFunctions)(struct __pyx_obj_5etree__BaseContext > *,void (*),int ((void (*),PyObject *,PyObject *)))); > PyObject *((*unregisterGlobalFunctions)(struct __pyx_obj_5etree__BaseContext > *,void (*),int ((void (*),PyObject *,PyObject *)))); > > These are new in lxml 1.3. Looks like MS's "C" compiler can't handle that. > > Any idea how we could get this to work? I mean, without the obvious approach > of switching to MinGW. :) > > Stefan > -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From ianb at colorstudy.com Thu May 24 17:57:08 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Thu, 24 May 2007 10:57:08 -0500 Subject: [lxml-dev] lhtml Message-ID: <4655B5D4.2080101@colorstudy.com> I really want to take all our HTML-related routines and put them into a proper package -- right now they are scattered all over the place. lhtml seems like a nice name for this. I thought it would be good to also place it on codespeak, but I'm not sure where. New top-level? In /lxml/lhtml/(trunk|branches|tags) ? -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From ianb at colorstudy.com Thu May 24 18:42:30 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Thu, 24 May 2007 11:42:30 -0500 Subject: [lxml-dev] lhtml In-Reply-To: <4655B5D4.2080101@colorstudy.com> References: <4655B5D4.2080101@colorstudy.com> Message-ID: <4655C076.7050901@colorstudy.com> Ian Bicking wrote: > I really want to take all our HTML-related routines and put them into a > proper package And maybe a bit of advice -- we could just do this as a set of functions (what we currently have), or potentially explore objectify and add the routines as methods. E.g., el.find_by_class('classname') This feels like a cleaner API, but I'm worried that it will mean problems when mixing non-objectify-HTML with other elements, and if there's problems with threads or memory overhead, or any other issues. I don't really mind functions, which is why I am unsure; OTOH, almost every function has a first argument of "el", which makes them seem like methods. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From stefan_ml at behnel.de Thu May 24 16:36:44 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 24 May 2007 16:36:44 +0200 Subject: [lxml-dev] Bug in objectify node[:].index In-Reply-To: References: Message-ID: <4655A2FC.1070204@behnel.de> Hi, I only noticed now that your paragraph below was a) well hidden and b) left unanswered. Christian Zagrodnick wrote: > Anyway, all I *actually* want is to remove the nodes found by the > xpath. The way you'd think it would be 'normal' doesn't work > unfortunately: > > (Pdb) p parent.index(foo_tree) > 2 > (Pdb) del parent[2] > *** TypeError: deleting items not supported by root element Sounds pretty inefficient to me, it even has to traverse the children twice. The following is usually much faster, clearer and also works in this case: parent = foo_tree.getparent() if parent is not None: parent.remove(foo_tree) Stefan From stefan_ml at behnel.de Fri May 25 08:10:27 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 25 May 2007 08:10:27 +0200 Subject: [lxml-dev] lhtml In-Reply-To: <4655B5D4.2080101@colorstudy.com> References: <4655B5D4.2080101@colorstudy.com> Message-ID: <46567DD3.8090700@behnel.de> Hi Ian, Ian Bicking wrote: > I really want to take all our HTML-related routines and put them into a > proper package -- right now they are scattered all over the place. > lhtml seems like a nice name for this. I thought it would be good to > also place it on codespeak, but I'm not sure where. New top-level? In > /lxml/lhtml/(trunk|branches|tags) ? Are they based on lxml.etree or rather ET compatible? In the first case, I'd make them a module of lxml (part of the same project), in the second case, it might be worth a separate project. Note that lxml already has quite a number of modules like lxml.sax and lxml.objectify, so lxml.html would nicely fit in here. Stefan From stefan_ml at behnel.de Fri May 25 08:14:30 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 25 May 2007 08:14:30 +0200 Subject: [lxml-dev] lhtml In-Reply-To: <4655C076.7050901@colorstudy.com> References: <4655B5D4.2080101@colorstudy.com> <4655C076.7050901@colorstudy.com> Message-ID: <46567EC6.2000201@behnel.de> Hi Ian, Ian Bicking wrote: > Ian Bicking wrote: >> I really want to take all our HTML-related routines and put them into a >> proper package > > And maybe a bit of advice -- we could just do this as a set of functions > (what we currently have), or potentially explore objectify and add the > routines as methods. E.g., el.find_by_class('classname') You're not using objectify as a base, are you? I mean, HTML is mainly about text, so objectify will not help you much. > This feels like a cleaner API, but I'm worried that it will mean > problems when mixing non-objectify-HTML with other elements, and if > there's problems with threads or memory overhead, or any other issues. > I don't really mind functions, which is why I am unsure; OTOH, almost > every function has a first argument of "el", which makes them seem like > methods. What about implementing the HTML namespace in a couple of Element subclasses and add the methods where they are appropriate? That sounds like a nice API to me. Any chance you could post your code somewhere so that I could take a look at what you're really contributing here? Stefan From ianb at colorstudy.com Fri May 25 17:34:11 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Fri, 25 May 2007 10:34:11 -0500 Subject: [lxml-dev] lhtml In-Reply-To: <46567DD3.8090700@behnel.de> References: <4655B5D4.2080101@colorstudy.com> <46567DD3.8090700@behnel.de> Message-ID: <465701F3.2060707@colorstudy.com> Stefan Behnel wrote: > Hi Ian, > > Ian Bicking wrote: >> I really want to take all our HTML-related routines and put them into a >> proper package -- right now they are scattered all over the place. >> lhtml seems like a nice name for this. I thought it would be good to >> also place it on codespeak, but I'm not sure where. New top-level? In >> /lxml/lhtml/(trunk|branches|tags) ? > > Are they based on lxml.etree or rather ET compatible? In the first case, I'd > make them a module of lxml (part of the same project), in the second case, it > might be worth a separate project. Many of them use xpath, getparent, or something lxml-specific. Also without lxml.etree.HTML the whole thing is rather academic. > Note that lxml already has quite a number of modules like lxml.sax and > lxml.objectify, so lxml.html would nicely fit in here. I had thought about that, but I don't know if it should have the same release schedule...? It's a somewhat random collection of functions that we've written here and there in other modules. I can clean them up, of course, but exactly what functionality is in there has been an on-demand sort of thing. Which is to say, it's young. OTOH, we could distribute it as a namespace package with its own release cycle but still in lxml.html. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From ianb at colorstudy.com Fri May 25 18:11:12 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Fri, 25 May 2007 11:11:12 -0500 Subject: [lxml-dev] lhtml In-Reply-To: <46567EC6.2000201@behnel.de> References: <4655B5D4.2080101@colorstudy.com> <4655C076.7050901@colorstudy.com> <46567EC6.2000201@behnel.de> Message-ID: <46570AA0.7030705@colorstudy.com> Stefan Behnel wrote: > Hi Ian, > > Ian Bicking wrote: >> Ian Bicking wrote: >>> I really want to take all our HTML-related routines and put them into a >>> proper package >> And maybe a bit of advice -- we could just do this as a set of functions >> (what we currently have), or potentially explore objectify and add the >> routines as methods. E.g., el.find_by_class('classname') > > You're not using objectify as a base, are you? I mean, HTML is mainly about > text, so objectify will not help you much. I'm not using it now, no. But if I used objectify as a base, it would be to add methods like .html_serialize() to elements, or any number of other handy methods. At least "handy" for dealing with the mixed content that HTML has, which is relatively uncommon in other XML. >> This feels like a cleaner API, but I'm worried that it will mean >> problems when mixing non-objectify-HTML with other elements, and if >> there's problems with threads or memory overhead, or any other issues. >> I don't really mind functions, which is why I am unsure; OTOH, almost >> every function has a first argument of "el", which makes them seem like >> methods. > > What about implementing the HTML namespace in a couple of Element subclasses > and add the methods where they are appropriate? That sounds like a nice API to me. The HTML() parser doesn't actually use namespaces. Well, maybe it does if you give it XHTML, or maybe you really have to use XML() to get that. It's never come up because I don't deal with any XHTML sites (because there are almost no XHTML sites ;). I'm not entirely clear on how namespaces fit in. Most of the methods would apply to all HTML elements, but HTML 4 elements aren't easy to distinguish. > Any chance you could post your code somewhere so that I could take a look at > what you're really contributing here? Sure; I started collecting a few of the routines from various libraries yesterday. There's still stuff in Deliverance and htmldiff that I haven't integrated. I haven't copied over any tests and there may be broken imports in many of the modules, but it should give you a vague idea of scope. (I'm actually looking for a home for htmldiff, so it's possible it could also go in this library; it's at https://svn.openplans.org/svn/opencore/trunk/opencore/nui/wiki/htmldiff2.py and https://svn.openplans.org/svn/opencore/trunk/opencore/nui/wiki/test_htmldiff2.txt) Anyway, it's not too big so I'll just attach the stuff I have collected. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers -------------- next part -------------- A non-text attachment was scrubbed... Name: lhtml.tar.gz Type: application/x-gzip Size: 5480 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070525/9e5fcd8c/attachment.bin From stefan_ml at behnel.de Fri May 25 20:58:48 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 25 May 2007 20:58:48 +0200 Subject: [lxml-dev] lhtml In-Reply-To: <46570AA0.7030705@colorstudy.com> References: <4655B5D4.2080101@colorstudy.com> <4655C076.7050901@colorstudy.com> <46567EC6.2000201@behnel.de> <46570AA0.7030705@colorstudy.com> Message-ID: <465731E8.40603@behnel.de> Hi Ian, Ian Bicking wrote: > Stefan Behnel wrote: >> Ian Bicking wrote: >>> Ian Bicking wrote: >>>> I really want to take all our HTML-related routines and put them >>>> into a proper package >>> And maybe a bit of advice -- we could just do this as a set of >>> functions (what we currently have), or potentially explore objectify >>> and add the routines as methods. E.g., el.find_by_class('classname') >> >> You're not using objectify as a base, are you? I mean, HTML is mainly >> about text, so objectify will not help you much. > > I'm not using it now, no. But if I used objectify as a base, it would > be to add methods like .html_serialize() to elements, or any number of > other handy methods. I don't know what you mean here. Maybe I'm just missing something that's more obvious to you, or are talking about custom element classes in general rather than objectify? http://codespeak.net/lxml/dev/element_classes.html http://codespeak.net/lxml/dev/objectify.html http://codespeak.net/lxml/dev/FAQ.html#what-is-the-difference-between-lxml-etree-and-lxml-objectify >>> This feels like a cleaner API, but I'm worried that it will mean >>> problems when mixing non-objectify-HTML with other elements, and if >>> there's problems with threads or memory overhead, or any other >>> issues. I don't really mind functions, which is why I am unsure; >>> OTOH, almost every function has a first argument of "el", which makes >>> them seem like methods. >> >> What about implementing the HTML namespace in a couple of Element >> subclasses >> and add the methods where they are appropriate? That sounds like a >> nice API to me. > > The HTML() parser doesn't actually use namespaces. True, I forgot. Still, you can use something like: >>> class HtmlElement(etree._ElementBase): ... # your implementation here >>> # some more subclasses for different HTML tags, e.g. AnchorElement >>> HTML_CLASSES = { ... "a" : AnchorElement, ... # ... ... } >>> class HtmlLookup(etree.CustomElementClassLookup): ... def lookup(self, node_type, document, namespace, name): ... if node_type == "element": ... return HTML_CLASSES.get(name, HtmlElement) ... else: ... return None # delegate >>> html_parser = etree.HTMLParser() >>> html_parser.setElementClassLookup(HtmlLookup()) >>> def HTML(html): ... return etree.HTML(html, html_parser) That does almost the same as the Namespace classes would. > I'm not entirely clear on how namespaces fit in. Most of the methods > would apply to all HTML elements, but HTML 4 elements aren't easy to > distinguish. I would expect only HTML elements in an HTML4 document. That makes it rather easy. But if you like, you can add any kind of special casing into the lookup method above, such as: if the tag has a namespace that's not XHTML, return None (i.e. the default Element class). >> Any chance you could post your code somewhere so that I could take a >> look at what you're really contributing here? > > Sure; I started collecting a few of the routines from various libraries > yesterday. There's still stuff in Deliverance and htmldiff that I > haven't integrated. I haven't copied over any tests and there may be > broken imports in many of the modules, but it should give you a vague > idea of scope. I took a quick look at it and I totally like the doctestcompare module. I'd love to use it for lxml's own doctests first of all, so, sure, that's a perfect companion to lxml's other modules. You already have write access to lxml's SVN repository, so there's not much of a problem with release cycles or anything. If you want to add new stuff, that may even be a good reason for a new version of lxml. :) Questions: doctest module: - is there any reason why you require a call to "lxmldoctest.install()"? I'd rather execute that immediately when you import the module. That's less intrusive for doctests (which is the main use case after all). - I'd like to call that module "lxml.xmldoctest" or something like that, so that you can "import xmldoctest" in a doctest file, which is rather readable. serialise.py: - libxml2 actually has some internal support for serialising HTML, so maybe it's worth looking at that first, in case we ever decide to wrap it. parse, serialize and fixuplinks: - I'll have to take a closer look at that to see if this makes sense in general. __init__.py: - some of this can be rewritten using plain XPath, e.g. get_parent_with_class (there's now RegExp support in lxml 1.3) or get_text (basically what 'string()' does). contains_class_xpath is not really much better than an XPath expression with variables, dito for get_elements_by_class and get_rel_links, e.g. the latter is better written as: get_rel_links = etree.XPath("descendant-or-self::a[@rel=$rel]") get_rel_links(el, rel="whatever") > (I'm actually looking for a home for htmldiff, so it's > possible it could also go in this library; it's at > https://svn.openplans.org/svn/opencore/trunk/opencore/nui/wiki/htmldiff2.py > and > https://svn.openplans.org/svn/opencore/trunk/opencore/nui/wiki/test_htmldiff2.txt) I'll take a look at this when I have a bit more time. Stefan From ianb at colorstudy.com Fri May 25 21:44:52 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Fri, 25 May 2007 14:44:52 -0500 Subject: [lxml-dev] lhtml In-Reply-To: <465731E8.40603@behnel.de> References: <4655B5D4.2080101@colorstudy.com> <4655C076.7050901@colorstudy.com> <46567EC6.2000201@behnel.de> <46570AA0.7030705@colorstudy.com> <465731E8.40603@behnel.de> Message-ID: <46573CB4.60601@colorstudy.com> Stefan Behnel wrote: > Hi Ian, > > Ian Bicking wrote: >> Stefan Behnel wrote: >>> Ian Bicking wrote: >>>> Ian Bicking wrote: >>>>> I really want to take all our HTML-related routines and put them >>>>> into a proper package >>>> And maybe a bit of advice -- we could just do this as a set of >>>> functions (what we currently have), or potentially explore objectify >>>> and add the routines as methods. E.g., el.find_by_class('classname') >>> You're not using objectify as a base, are you? I mean, HTML is mainly >>> about text, so objectify will not help you much. >> I'm not using it now, no. But if I used objectify as a base, it would >> be to add methods like .html_serialize() to elements, or any number of >> other handy methods. > > I don't know what you mean here. Maybe I'm just missing something that's more > obvious to you, or are talking about custom element classes in general rather > than objectify? > > http://codespeak.net/lxml/dev/element_classes.html > http://codespeak.net/lxml/dev/objectify.html > http://codespeak.net/lxml/dev/FAQ.html#what-is-the-difference-between-lxml-etree-and-lxml-objectify I probably confused the terms/modules, since I haven't used any of them. I think you are right, I'm just thinking about a custom element class. >>>> This feels like a cleaner API, but I'm worried that it will mean >>>> problems when mixing non-objectify-HTML with other elements, and if >>>> there's problems with threads or memory overhead, or any other >>>> issues. I don't really mind functions, which is why I am unsure; >>>> OTOH, almost every function has a first argument of "el", which makes >>>> them seem like methods. >>> What about implementing the HTML namespace in a couple of Element >>> subclasses >>> and add the methods where they are appropriate? That sounds like a >>> nice API to me. >> The HTML() parser doesn't actually use namespaces. > > True, I forgot. Still, you can use something like: > > >>> class HtmlElement(etree._ElementBase): > ... # your implementation here > > >>> # some more subclasses for different HTML tags, e.g. AnchorElement > > >>> HTML_CLASSES = { > ... "a" : AnchorElement, > ... # ... > ... } > > >>> class HtmlLookup(etree.CustomElementClassLookup): > ... def lookup(self, node_type, document, namespace, name): > ... if node_type == "element": > ... return HTML_CLASSES.get(name, HtmlElement) > ... else: > ... return None # delegate > > >>> html_parser = etree.HTMLParser() > >>> html_parser.setElementClassLookup(HtmlLookup()) > > >>> def HTML(html): > ... return etree.HTML(html, html_parser) > > That does almost the same as the Namespace classes would. Yes, that's the sort of thing I was thinking about (but was fuzzy on the details because I haven't tried it). It relies on a different parser from lxml.etree.HTML, and I would guess that elements created with etree.Element wouldn't necessarily use the right class. I'm just worried it adds more confusion, because things act differently depending on how the element was created or how a document is parsed. Functions are fairly straight-forward in comparison -- they just do stuff. They are also somewhat easier to document and browse through as a new user. For instance, it would be amusing to have an AnchorElement.GET() method. But what exactly would it do? Which HTTP library would it use? I don't know; if it was a function then it wouldn't matter, you'd just implement however many functions were necessary to do what people wanted to do. And those functions may or may not be implemented in lxml.html -- someone else could distribute their own implementations using whatever library they liked. But not all methods are like GET(). find_by_class() is probably more obvious -- it gets all elements according to a class name, and multiple implementations aren't necessary. >> I'm not entirely clear on how namespaces fit in. Most of the methods >> would apply to all HTML elements, but HTML 4 elements aren't easy to >> distinguish. > > I would expect only HTML elements in an HTML4 document. That makes it rather > easy. But if you like, you can add any kind of special casing into the lookup > method above, such as: if the tag has a namespace that's not XHTML, return > None (i.e. the default Element class). > > >>> Any chance you could post your code somewhere so that I could take a >>> look at what you're really contributing here? >> Sure; I started collecting a few of the routines from various libraries >> yesterday. There's still stuff in Deliverance and htmldiff that I >> haven't integrated. I haven't copied over any tests and there may be >> broken imports in many of the modules, but it should give you a vague >> idea of scope. > > I took a quick look at it and I totally like the doctestcompare module. I'd > love to use it for lxml's own doctests first of all, so, sure, that's a > perfect companion to lxml's other modules. You already have write access to > lxml's SVN repository, so there's not much of a problem with release cycles or > anything. If you want to add new stuff, that may even be a good reason for a > new version of lxml. :) > > Questions: > > doctest module: > > - is there any reason why you require a call to "lxmldoctest.install()"? I'd > rather execute that immediately when you import the module. That's less > intrusive for doctests (which is the main use case after all). I dislike having modules do something to the system when you import them. OTOH, I dislike that I have to monkeypatch doctest to get the comparison function in, but it's not practical to do anything else. So maybe I just have to put up with it. > - I'd like to call that module "lxml.xmldoctest" or something like that, so > that you can "import xmldoctest" in a doctest file, which is rather readable. I'd be surprised it this would actually work -- I'd expect that it would be too late once you were running the doctest. But I haven't tried. > serialise.py: > > - libxml2 actually has some internal support for serialising HTML, so maybe > it's worth looking at that first, in case we ever decide to wrap it. Sure; this was just the most expedient thing we figured out. The XSLT serialization probably uses something else in libxml2, so maybe direct access to that is possible. There's nothing terribly wrong about it either, it's just a little roundabout (which isn't so bad if it is implemented in a reusable function, of course -- but reimplementing it each time you want to serialize HTML isn't so good). > parse, serialize and fixuplinks: > > - I'll have to take a closer look at that to see if this makes sense in general. The parse stuff is really just charset detection. I don't think lxml/libxml2 does this natively (checking the meta tag), but I'm not actually 100% sure. It should include parsing HTML fragments too, which is a little hard (HTML() interprets all text as complete documents, and adds in elements to make the document valid, which often isn't what you'd want). > __init__.py: > > - some of this can be rewritten using plain XPath, e.g. get_parent_with_class > (there's now RegExp support in lxml 1.3) or get_text (basically what > 'string()' does). contains_class_xpath is not really much better than an XPath > expression with variables, dito for get_elements_by_class and get_rel_links, > e.g. the latter is better written as: > > get_rel_links = etree.XPath("descendant-or-self::a[@rel=$rel]") > get_rel_links(el, rel="whatever") I tried doing class name matching with a regular expression, but never got it to work. It might have been a bug in my or lxml's code, I'm not sure -- whatever it was, I was in a mind to move on ;). General CSS selector support would be wonderful. But anyway, these are things that weren't obvious to me, so I think it's still useful to include the functions even if their implementation is fairly trivial. I'm sure there's other implementation details that could be improved. Most of those particular functions came from some microformat parsing, and most microformats are just built on a small number of queries. Deliverance and htmldiff had more stuff for modifying the structure, which is often quite awkward with the ElementTree model (doing something that seems easy, like removing a tag, is nontrivial). A lot of those things aren't that specific to HTML, except that HTML has lots of situations where tags and text are mixed together. >> (I'm actually looking for a home for htmldiff, so it's >> possible it could also go in this library; it's at >> https://svn.openplans.org/svn/opencore/trunk/opencore/nui/wiki/htmldiff2.py >> and >> > https://svn.openplans.org/svn/opencore/trunk/opencore/nui/wiki/test_htmldiff2.txt) > > I'll take a look at this when I have a bit more time. Sure; note it's very much oriented towards human-readable diffs, not formal diffs. Which fits HTML fairly well (where the tags are more like annotations of the text), but not most other XML documents. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From ianb at colorstudy.com Fri May 25 22:34:26 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Fri, 25 May 2007 15:34:26 -0500 Subject: [lxml-dev] lhtml In-Reply-To: <46573CB4.60601@colorstudy.com> References: <4655B5D4.2080101@colorstudy.com> <4655C076.7050901@colorstudy.com> <46567EC6.2000201@behnel.de> <46570AA0.7030705@colorstudy.com> <465731E8.40603@behnel.de> <46573CB4.60601@colorstudy.com> Message-ID: <46574852.2000405@colorstudy.com> On the list-of-things-I-could-imagine-in-lxml.html, I've been wanting to reimplement formencode.htmlfill for a long time (http://formencode.org/htmlfill.html). It uses HTMLParser, and I hate HTMLParser. It fills in forms, and also fills in errors. I'd probably change the way errors are given, and separate it into two functions -- HTMLParser rewards doing everything in one pass, but with a document model it would both be much easier to write and using multiple passes is no problem. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From stefan_ml at behnel.de Fri May 25 22:37:44 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 25 May 2007 22:37:44 +0200 Subject: [lxml-dev] lhtml In-Reply-To: <46573CB4.60601@colorstudy.com> References: <4655B5D4.2080101@colorstudy.com> <4655C076.7050901@colorstudy.com> <46567EC6.2000201@behnel.de> <46570AA0.7030705@colorstudy.com> <465731E8.40603@behnel.de> <46573CB4.60601@colorstudy.com> Message-ID: <46574918.70900@behnel.de> Hi Ian, Ian Bicking wrote: > Stefan Behnel wrote: > It relies on a different parser from lxml.etree.HTML, and I would guess > that elements created with etree.Element wouldn't necessarily use the > right class. objectify replicates the XML() and Element() factories for exactly this purpose. lxml.html could do likewise. > Functions are fairly straight-forward in comparison > -- they just do stuff. They are also somewhat easier to document and > browse through as a new user. ET has functions like tostring(), but methods like write() or xpath(). Some are obvious decisions, others are a matter of taste. There's nothing bad to it. > For instance, it would be amusing to have an AnchorElement.GET() method. > But what exactly would it do? Which HTTP library would it use? Well, lxml already has a custom resolver API. We should be able to reuse that in some way. On the other hand, I'd give a clear vote for a function approach here rather than a method. I think a do-what-I-mean GET() method is a sign of overdesign. > But not all methods are like GET(). find_by_class() is probably more > obvious -- it gets all elements according to a class name, and multiple > implementations aren't necessary. Definitely. That's the kind of argument that works well for and against functions and methods. >> - I'd like to call that module "lxml.xmldoctest" or something like >> that, so >> that you can "import xmldoctest" in a doctest file, which is rather >> readable. > > I'd be surprised it this would actually work -- I'd expect that it would > be too late once you were running the doctest. But I haven't tried. Me neither. But *if* it works, *then* requiring a call to install() shouldn't be necessary. >> parse, serialize and fixuplinks: >> >> - I'll have to take a closer look at that to see if this makes sense >> in general. > > The parse stuff is really just charset detection. I don't think > lxml/libxml2 does this natively (checking the meta tag), but I'm not > actually 100% sure. It does actually. You will see that when you pass in a unicode string that contains a meta-tag with some byte encoding (say, UTF-8). This will break immediately. Note, however, that libxml2 requires a bit of structure to actually find the tag. Simply prepending a complete HTML document with such a tag (which I've seen in a couple of real-life broken HTML documents) will not work. > It should include parsing HTML fragments too, which > is a little hard (HTML() interprets all text as complete documents, and > adds in elements to make the document valid, which often isn't what > you'd want). Maybe a simple approach here would be to check if a string starts with a known inner HTML tag, then just prefix it with before parsing and return their child (or children) after parsing. >> __init__.py: >> >> - some of this can be rewritten using plain XPath, e.g. >> get_parent_with_class >> (there's now RegExp support in lxml 1.3) or get_text (basically what >> 'string()' does). contains_class_xpath is not really much better than >> an XPath >> expression with variables, dito for get_elements_by_class and >> get_rel_links, >> e.g. the latter is better written as: >> >> get_rel_links = etree.XPath("descendant-or-self::a[@rel=$rel]") >> get_rel_links(el, rel="whatever") > > I tried doing class name matching with a regular expression, but never > got it to work. It might have been a bug in my or lxml's code, I'm not > sure -- whatever it was, I was in a mind to move on ;). I recently fixed a few problems with the regexp support, quite possible that it were those that stopped you. > General CSS > selector support would be wonderful. But anyway, these are things that > weren't obvious to me, so I think it's still useful to include the > functions even if their implementation is fairly trivial. Sure, little helpers keep you from (re-)writing them yourself, but more interestingly, they encourage a certain programming style that can make your life easier. > I'm sure there's other implementation details that could be improved. > Most of those particular functions came from some microformat parsing, > and most microformats are just built on a small number of queries. > Deliverance and htmldiff had more stuff for modifying the structure, > which is often quite awkward with the ElementTree model (doing something > that seems easy, like removing a tag, is nontrivial). A lot of those > things aren't that specific to HTML, except that HTML has lots of > situations where tags and text are mixed together. That's just another reason for a wrapper API on top of lxml.etree that makes working with HTML more intuitive. Fredrik wrote a nice factory class for generating (X|HT)ML a while ago, I felt free to add it as "lxml.htmlbuilder" (although I'm still waiting for his reply to see if it can stay there to become part of lxml 1.3). But the other API side of parsing and treating HTML document in a convenient way is much more ambitious. Stefan From faassen at startifact.com Fri May 25 22:41:16 2007 From: faassen at startifact.com (Martijn Faassen) Date: Fri, 25 May 2007 22:41:16 +0200 Subject: [lxml-dev] lhtml In-Reply-To: <465701F3.2060707@colorstudy.com> References: <4655B5D4.2080101@colorstudy.com> <46567DD3.8090700@behnel.de> <465701F3.2060707@colorstudy.com> Message-ID: Ian Bicking wrote: > Stefan Behnel wrote: [snip] >> Note that lxml already has quite a number of modules like lxml.sax and >> lxml.objectify, so lxml.html would nicely fit in here. > > I had thought about that, but I don't know if it should have the same > release schedule...? It's a somewhat random collection of functions > that we've written here and there in other modules. I can clean them > up, of course, but exactly what functionality is in there has been an > on-demand sort of thing. Which is to say, it's young. If you're willing to help getting releases out of the door, I could imagine we let the html functionality drive the release schedule for a while. It is not like we have a lot of other features lined up right now, so this would be a good way to actually drive a few new releases. What do you think, Stefan? > OTOH, we could distribute it as a namespace package with its own release > cycle but still in lxml.html. I think this is possible as 'lxml' itself is currently an empty namespace package (empty __init__.py). It does contain a lot of modules though. Does the whole namespace package installation machinery in setuptools still work in this case? Regards, Martijn From ianb at colorstudy.com Fri May 25 23:39:59 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Fri, 25 May 2007 16:39:59 -0500 Subject: [lxml-dev] lhtml In-Reply-To: <46574918.70900@behnel.de> References: <4655B5D4.2080101@colorstudy.com> <4655C076.7050901@colorstudy.com> <46567EC6.2000201@behnel.de> <46570AA0.7030705@colorstudy.com> <465731E8.40603@behnel.de> <46573CB4.60601@colorstudy.com> <46574918.70900@behnel.de> Message-ID: <465757AF.6090700@colorstudy.com> Stefan Behnel wrote: > Ian Bicking wrote: >> Stefan Behnel wrote: >> It relies on a different parser from lxml.etree.HTML, and I would guess >> that elements created with etree.Element wouldn't necessarily use the >> right class. > > objectify replicates the XML() and Element() factories for exactly this > purpose. lxml.html could do likewise. Sure. Presumably at least a parser would be in there (HTML()). I suppose no reason Element can't be too. How does this interact with XSLT translations? When you translate a document, it keeps the parser and hence the custom classes? >>> - I'd like to call that module "lxml.xmldoctest" or something like >>> that, so >>> that you can "import xmldoctest" in a doctest file, which is rather >>> readable. >> I'd be surprised it this would actually work -- I'd expect that it would >> be too late once you were running the doctest. But I haven't tried. > > Me neither. But *if* it works, *then* requiring a call to install() shouldn't > be necessary. Probably we could *make* it work. It would make me more comfortable if at least it was a separate module. So there'd be an lxml.xmldoctest module, and an lxml.usexmldoctest or something. Then you wouldn't *have* to enable the checker if you just want to import the module (e.g., to make your own checker based on that checker). There's also some ambiguity between HTML and XML. When do you parse something as HTML, and when only as XML? It depends on the doctest. You can kind of tell by looking for , but I actually spend more time looking at HTML snippets than documents when doing testing. With enough work it would probably be possible to use that import to selectively activate the checker only during the doctest it was imported into. That would be ideal to me. Then you could use that to indicate if you prefer HTML or XML parsing your checking. I generally like doctests to be standalone, so being able to enable your preferred checker directly in the doctest would certainly be nice. >>> parse, serialize and fixuplinks: >>> >>> - I'll have to take a closer look at that to see if this makes sense >>> in general. >> The parse stuff is really just charset detection. I don't think >> lxml/libxml2 does this natively (checking the meta tag), but I'm not >> actually 100% sure. > > It does actually. You will see that when you pass in a unicode string that > contains a meta-tag with some byte encoding (say, UTF-8). This will break > immediately. > > Note, however, that libxml2 requires a bit of structure to actually find the > tag. Simply prepending a complete HTML document with such a tag (which > I've seen in a couple of real-life broken HTML documents) will not work. OK. I don't know quite why we had that code; maybe we never tested exactly how it worked. Including chardet as a fallback could be kind of interesting. I don't usually deal with the web when it is *quite* that wild and woolly that its character sets have to be guessed, but I'd like to handle that nicely anyway. >> It should include parsing HTML fragments too, which >> is a little hard (HTML() interprets all text as complete documents, and >> adds in elements to make the document valid, which often isn't what >> you'd want). > > Maybe a simple approach here would be to check if a string starts with a known > inner HTML tag, then just prefix it with before parsing and > return their child (or children) after parsing. I'm comfortable (probably more comfortable) with different parsing functions. I imagine parse, parse_fragment, and parse_element. parse is like HTML(), parse_fragment returns a list of elements, parse_element only returns a single element (and an exception if you give it a document with multiple elements). Leading text for parse_fragment is a little awkward. In addition to returning the children, I'd like to break the reference to the artificial parent that was added in. You can get at the parent with many kinds of queries, which can be confusing. >>> __init__.py: >>> >>> - some of this can be rewritten using plain XPath, e.g. >>> get_parent_with_class >>> (there's now RegExp support in lxml 1.3) or get_text (basically what >>> 'string()' does). contains_class_xpath is not really much better than >>> an XPath >>> expression with variables, dito for get_elements_by_class and >>> get_rel_links, >>> e.g. the latter is better written as: >>> >>> get_rel_links = etree.XPath("descendant-or-self::a[@rel=$rel]") >>> get_rel_links(el, rel="whatever") >> I tried doing class name matching with a regular expression, but never >> got it to work. It might have been a bug in my or lxml's code, I'm not >> sure -- whatever it was, I was in a mind to move on ;). > > I recently fixed a few problems with the regexp support, quite possible that > it were those that stopped you. Perhaps so. \b wasn't working right for me, if I remember. > Fredrik wrote a nice factory class for generating (X|HT)ML a while ago, I felt > free to add it as "lxml.htmlbuilder" (although I'm still waiting for his reply > to see if it can stay there to become part of lxml 1.3). But the other API > side of parsing and treating HTML document in a convenient way is much more > ambitious. How are attributes handled in his version? That's always the place where opinions vary on builders. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From ianb at colorstudy.com Fri May 25 23:46:30 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Fri, 25 May 2007 16:46:30 -0500 Subject: [lxml-dev] lhtml In-Reply-To: References: <4655B5D4.2080101@colorstudy.com> <46567DD3.8090700@behnel.de> <465701F3.2060707@colorstudy.com> Message-ID: <46575936.6000402@colorstudy.com> Martijn Faassen wrote: > Ian Bicking wrote: >> Stefan Behnel wrote: > [snip] >>> Note that lxml already has quite a number of modules like lxml.sax and >>> lxml.objectify, so lxml.html would nicely fit in here. >> I had thought about that, but I don't know if it should have the same >> release schedule...? It's a somewhat random collection of functions >> that we've written here and there in other modules. I can clean them >> up, of course, but exactly what functionality is in there has been an >> on-demand sort of thing. Which is to say, it's young. > > If you're willing to help getting releases out of the door, I could > imagine we let the html functionality drive the release schedule for a > while. It is not like we have a lot of other features lined up right > now, so this would be a good way to actually drive a few new releases. > What do you think, Stefan? Well, maybe I'm thinking of more common releases, but maybe also less common releases. In that, I don't know exactly what all should be in the module to start with, and what kind of design review should go into those modules. >> OTOH, we could distribute it as a namespace package with its own release >> cycle but still in lxml.html. > > I think this is possible as 'lxml' itself is currently an empty > namespace package (empty __init__.py). It does contain a lot of modules > though. Does the whole namespace package installation machinery in > setuptools still work in this case? Yes. I don't quite understand the details, but I think lxml.__path__ points to the directories for both packages, and it searches both in turn. So both can have an arbitrary number of modules under lxml and in each distribution. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From stefan_ml at behnel.de Sat May 26 07:50:50 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 26 May 2007 07:50:50 +0200 Subject: [lxml-dev] lhtml In-Reply-To: <465757AF.6090700@colorstudy.com> References: <4655B5D4.2080101@colorstudy.com> <4655C076.7050901@colorstudy.com> <46567EC6.2000201@behnel.de> <46570AA0.7030705@colorstudy.com> <465731E8.40603@behnel.de> <46573CB4.60601@colorstudy.com> <46574918.70900@behnel.de> <465757AF.6090700@colorstudy.com> Message-ID: <4657CABA.3010005@behnel.de> Ian Bicking wrote: > Stefan Behnel wrote: >> Ian Bicking wrote: >>> Stefan Behnel wrote: >>> It relies on a different parser from lxml.etree.HTML, and I would guess >>> that elements created with etree.Element wouldn't necessarily use the >>> right class. >> >> objectify replicates the XML() and Element() factories for exactly this >> purpose. lxml.html could do likewise. > > Sure. Presumably at least a parser would be in there (HTML()). I > suppose no reason Element can't be too. > > How does this interact with XSLT translations? When you translate a > document, it keeps the parser and hence the custom classes? Exactly. It Does What You'd Expect(TM). > It would make me more comfortable if at least it was a separate module. > So there'd be an lxml.xmldoctest module, and an lxml.usexmldoctest And since lxml.usexmldoctest and lxml.usehtmldoctest would be the ones you'd import, the xmldoctest would just be the implementation detail in the background. > There's also some ambiguity between HTML and XML. When do you parse > something as HTML, and when only as XML? It depends on the doctest. You > can kind of tell by looking for , but I actually spend more time > looking at HTML snippets than documents when doing testing. Doing that on trees is possible, but when comparing serialised HT/XML, you've already lost the information how it was parsed. So, no reason why we shouldn't have two modules that can be imported. > With enough work it would probably be possible to use that import to > selectively activate the checker only during the doctest it was imported > into. That would be ideal to me. Then you could use that to indicate > if you prefer HTML or XML parsing your checking. I generally like > doctests to be standalone, so being able to enable your preferred > checker directly in the doctest would certainly be nice. That would be great. Another point: how do we deal with doctests that mix XML and HTML? That's likely more rare, so we could still provide some kind of "use()" function in the two modules that allows to switch between the two if you really need to, but that would not have to be called if you just import one of them. >>> It should include parsing HTML fragments too, which >>> is a little hard (HTML() interprets all text as complete documents, and >>> adds in elements to make the document valid, which often isn't what >>> you'd want). >> >> Maybe a simple approach here would be to check if a string starts with >> a known >> inner HTML tag, then just prefix it with before parsing and >> return their child (or children) after parsing. > > I'm comfortable (probably more comfortable) with different parsing > functions. I imagine parse, parse_fragment, and parse_element. parse > is like HTML(), parse_fragment returns a list of elements, parse_element > only returns a single element (and an exception if you give it a > document with multiple elements). Leading text for parse_fragment is a > little awkward. Sure, sounds reasonable. > In addition to returning the children, I'd like to break the reference > to the artificial parent that was added in. You can get at the parent > with many kinds of queries, which can be confusing. That's harder, though. Once they are in the document, it's hard to change the root node from Python code. Maybe we can come up with a solution that allows us to hide that in the parse functions. >> Fredrik wrote a nice factory class for generating (X|HT)ML a while >> ago, I felt >> free to add it as "lxml.htmlbuilder" (although I'm still waiting for >> his reply >> to see if it can stay there to become part of lxml 1.3). But the other >> API >> side of parsing and treating HTML document in a convenient way is much >> more >> ambitious. > > How are attributes handled in his version? That's always the place > where opinions vary on builders. Keyword arguments. See http://online.effbot.org/2006_11_01_archive.htm#et-builder Stefan From eric at detede.com Sat May 26 09:03:56 2007 From: eric at detede.com (Eric Garin) Date: Sat, 26 May 2007 09:03:56 +0200 Subject: [lxml-dev] Xhtml and entities Message-ID: <683416396.20070526090356@detede.com> An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070526/6d9f3e56/attachment.htm From stefan_ml at behnel.de Sat May 26 09:59:06 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 26 May 2007 09:59:06 +0200 Subject: [lxml-dev] Xhtml and entities In-Reply-To: <683416396.20070526090356@detede.com> References: <683416396.20070526090356@detede.com> Message-ID: <4657E8CA.20200@behnel.de> Eric Garin wrote: > I've got some problem with xhtml validation and entities. > But when the document contains some entities for example rsquo which is > declared in the xhtml-special.ent entities file : > > > > > > I've got this error : > > > Traceback (most recent call last): > > File "xhtml.py", line 42, in ? > > print xv.validate(xhtmlDoc) > > File "xhtml.py", line 28, in validate > > docXhtml = etree.parse(fileXhtml) [...] > etree.XMLSyntaxError: line 8: Entity 'rsquo' not defined Please read http://codespeak.net/lxml/dev/parsing.html#parser-options lxml.etree does not load DTDs by default. Stefan From stefan_ml at behnel.de Sat May 26 12:54:56 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 26 May 2007 12:54:56 +0200 Subject: [lxml-dev] Xhtml and entities In-Reply-To: <1751304501.20070526123637@detede.com> References: <683416396.20070526090356@detede.com> <4657E8CA.20200@behnel.de> <1751304501.20070526123637@detede.com> Message-ID: <46581200.8030007@behnel.de> Eric Garin wrote: > I got something a bit better cause I've just declare a DOCTYPE in the XML instance. > > My document is : > > > > > > > Untitled Document > > > >

This entity exists : ’

>

This not exists : &rsquo1111;

> > > > > The thing is that now it's ALWAYS valid even if there is an non-declared XHTML entity in the XML > I use the parser to load the instance like that : > > parser = etree.XMLParser(load_dtd=True) > docXhtml = etree.parse(fileXhtml, parser) As I said, please read http://codespeak.net/lxml/dev/parsing.html#parser-options where it says: load_dtd - load and parse the DTD while parsing (no validation is performed) dtd_validation - validate while parsing (if a DTD was referenced) lxml.etree does not load DTDs by default and it does not do validation only because you tell it to *load* the DTD. If you want DTD validation in the parser, you have to tell it to do "dtd_validation". Stefan From stefan_ml at behnel.de Sun May 27 10:13:54 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 27 May 2007 10:13:54 +0200 Subject: [lxml-dev] lxml lets undeclared entities pass through silently (was: Xhtml and entities) In-Reply-To: <1363447391.20070526140259@detede.com> References: <683416396.20070526090356@detede.com> <4657E8CA.20200@behnel.de> <1751304501.20070526123637@detede.com> <46581200.8030007@behnel.de> <1363447391.20070526140259@detede.com> Message-ID: <46593DC2.2070807@behnel.de> Hi Eric, please reply also to the mailing list (reply all), not just to me. That way, you may also get comments by other people and the mails will be archived and others can search and read them. Eric Garin wrote: > Sorry Stefan but I've actually read this documentation (and even several times) Sorry if I sounded somewhat harsh, I do believe you. > So I did a test with something really simple : [parse XML containing an undeclared entity] > Result : parser says nothing even if &oneXXX; is not declared Not quite, it does say something, it just doesn't raise an exception. >>> from lxml import etree >>> parser = etree.XMLParser() >>> xml = etree.parse("entity.xml", parser) Ok, no exception here, so what happened? >>> print parser.error_log entity.xml:5:ERROR:PARSER:WAR_UNDECLARED_ENTITY: Entity 'oneXXX' not defined So, libxml2 did find the missing entity and reported the error to lxml. I looked into it and it seems that the parser continued parsing and returned a document containing the entity reference, saying that it was well formed. Therefore, lxml did not raise an exception. When you serialise the document after parsing, you will see that the entity reference is still in there,so this actually works. However, when you print the ".text" of the element containing the entity reference, it is not printed, so you can see that it is not passed on to the API level. Given the normal lxml behaviour of resolving entities and not supporting them at the API level at all, I would call this a bug. However, it is not obvious how to deal with this. I mean, entity references currently pass in and out rather nicely, they are just not visible. So raising an error here would likely break some existing code that does not explicitly load DTDs to resolve them but relies on lxml's current behaviour of passing them through. On the other hand, there is no easy way to support them at the API level, as they can occur anywhere in text content. I mean, how should lxml distinguish between a user passing in the entity "&entity;" and someone who just passes in the text "In XML, entities are written as &entity;", expecting that it gets properly escaped like any other text in lxml.etree? So it is better to raise an error here than to have users deal with entities. Any opinions on this? Stefan From stefan_ml at behnel.de Sun May 27 10:34:00 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 27 May 2007 10:34:00 +0200 Subject: [lxml-dev] lxml lets undeclared entities pass through silently In-Reply-To: <46593DC2.2070807@behnel.de> References: <683416396.20070526090356@detede.com> <4657E8CA.20200@behnel.de> <1751304501.20070526123637@detede.com> <46581200.8030007@behnel.de> <1363447391.20070526140259@detede.com> <46593DC2.2070807@behnel.de> Message-ID: <46594278.8070600@behnel.de> Stefan Behnel wrote: > Hi Eric, > > please reply also to the mailing list (reply all), not just to me. That way, > you may also get comments by other people and the mails will be archived and > others can search and read them. > > Eric Garin wrote: >> Sorry Stefan but I've actually read this documentation (and even several times) > > Sorry if I sounded somewhat harsh, I do believe you. > > >> So I did a test with something really simple : > [parse XML containing an undeclared entity] >> Result : parser says nothing even if &oneXXX; is not declared > > Not quite, it does say something, it just doesn't raise an exception. > > >>> from lxml import etree > >>> parser = etree.XMLParser() > >>> xml = etree.parse("entity.xml", parser) > > Ok, no exception here, so what happened? > > >>> print parser.error_log > entity.xml:5:ERROR:PARSER:WAR_UNDECLARED_ENTITY: Entity 'oneXXX' not defined > > So, libxml2 did find the missing entity and reported the error to lxml. I > looked into it and it seems that the parser continued parsing and returned a > document containing the entity reference, saying that it was well formed. > Therefore, lxml did not raise an exception. Here's a trivial patch that raises an exception in this case. Still not sure this is the right solution, though. Stefan Index: src/lxml/parser.pxi =================================================================== --- src/lxml/parser.pxi (Revision 43690) +++ src/lxml/parser.pxi (Arbeitskopie) @@ -622,7 +622,8 @@ ctxt.myDoc = NULL if result is not NULL: - if ctxt.wellFormed or recover: + if recover or (ctxt.wellFormed and \ + ctxt.lastError.level < xmlerror.XML_ERR_ERROR): __GLOBAL_PARSER_CONTEXT.initDocDict(result) else: # free broken document From eric at detede.com Sun May 27 12:08:53 2007 From: eric at detede.com (Eric Garin) Date: Sun, 27 May 2007 12:08:53 +0200 Subject: [lxml-dev] lxml lets undeclared entities pass through silently (was: Xhtml and entities) In-Reply-To: <46593DC2.2070807@behnel.de> References: <683416396.20070526090356@detede.com> <4657E8CA.20200@behnel.de> <1751304501.20070526123637@detede.com> <46581200.8030007@behnel.de> <1363447391.20070526140259@detede.com> <46593DC2.2070807@behnel.de> Message-ID: <1463988179.20070527120853@detede.com> An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070527/e81145e2/attachment-0001.htm From stefan_ml at behnel.de Sun May 27 15:33:00 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 27 May 2007 15:33:00 +0200 Subject: [lxml-dev] Entity handling in lxml Message-ID: <4659888C.1080405@behnel.de> Hi all, lets make this a new thread to discuss the topic that was raised by Eric Garin. The parsers in lxml are currently configured to replace entity references (&entity;) by their definition. This requires a DTD, either inside the document, as external URL reference or from the system catalog. The parsers do not currently load DTDs by default, neither do they do validation. So, the current situation is: 1) If you use the default parser, all entities will pass through without exception, but put an error message in the error_log: entity.xml:5:ERROR:PARSER:WAR_UNDECLARED_ENTITY: Entity 'oneXXX' not defined They will not be visible at the API level, they will cut off text that contains them ("my &entity; value" will result in a text property value "my "), but they will be serialised correctly. They may also break a lot of things internally, as the implementation is not prepared for dealing with stuff like entity reference nodes. 2) If you configure a parser to load the DTD, declared entities will be replaced and undeclared entities will behave as above. 3) If you configure a parser to validate against a DTD, it will still behave exactly as above. This behaviour is definitely a bug. It would be cleaner to do this: 1) The default parser should replace internally defined entities and report all other entities as an error. 2) A parser that loads the DTD should report undeclared entities as an error (although it would not do any validation). 3) A validating parser should report undeclared entities as an error, just as any other structural or semantic deviation from the DTD. The alternative would be to provide an API for entities and to rewrite the internals to deal with them somehow. We could potentially make entity references a sort of element that behaves more or less like a comment. Entities would mainly have a name and a tail. We would then need an Entity() factory and integrate entity reference nodes into the internal traversal code (basically: let _isElement(c_entity_node) return 1). When would they appear in the tree? We would additionally need a "resolve_entities" keyword argument for the parsers, that would be the easiest way to deal with this. If it is set, unresolvable entities will result in an error as described above. Otherwise, entity references will not be replaced. Any comments? Stefan From stefan_ml at behnel.de Sun May 27 15:45:38 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 27 May 2007 15:45:38 +0200 Subject: [lxml-dev] lxml 1.3 - or 2.0? Message-ID: <46598B82.1020908@behnel.de> Hi all, I'm currently thinking about what should be changed or fixed for lxml 2.0, a preliminary list is at the end of TODO.txt. As fixing the entity handling would visibly change the way parsers behave (more exceptions, stricter parsing, etc.), maybe this would rather go into that list. This is what I have for now: * reformat error log lines, add column number * always use '' as URL when tree was parsed from string? (can libxml2 handle this?) * clean up (and remove?) duplicated API for extension functions * find a way to integrate Schematron (if it's available) * always use ns-prefixed type names in objectify's ``xsi:type`` attributes * remove ``findOrBuildNodeNs()`` from C-API (replaced by findOrBuildNodeNsPrefix) * lxml.html (Ian?) * clean support for entities (maybe an Entity element class?) * follow PEP 8 in API naming (avoidCamelCase in_favour_of_underscores) The list of trunk changes since 1.2.1 is already pretty long, and there were some major changes under the hood (even since 1.3beta), e.g. the namespace cleanup code, XSD type prefixing in objectify, exception messages, etc. Hence, I'm actually considering to skip the 1.3 release and to rather start working on 2.0 directly, so that we could have an alpha version soon. Maybe some more bug fixes could be backported into a 1.2.2 first, so that people can work with it, but a 1.3 would mean "new features". I would like to have the freedom to break a few things here in order to make their API cleaner. And that means 2.0, not 1.3. Comments? Stefan From ianb at colorstudy.com Tue May 29 01:46:46 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 28 May 2007 18:46:46 -0500 Subject: [lxml-dev] lxml 1.3 - or 2.0? In-Reply-To: <46598B82.1020908@behnel.de> References: <46598B82.1020908@behnel.de> Message-ID: <465B69E6.7010603@colorstudy.com> Stefan Behnel wrote: > Hi all, > > I'm currently thinking about what should be changed or fixed for lxml 2.0, a > preliminary list is at the end of TODO.txt. As fixing the entity handling > would visibly change the way parsers behave (more exceptions, stricter > parsing, etc.), maybe this would rather go into that list. This is what I have > for now: [...] > * lxml.html (Ian?) I wouldn't expect to change anything outside lxml.html, so there's no real compatibility issue. Should I start putting stuff in there? I'd like to start by putting together the routines as I extract them from current projects, then cleaning up overlap or implementation issues. I could do it in a branch, but that would really only be to keep people from using it prematurely. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From ianb at colorstudy.com Tue May 29 05:37:13 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 28 May 2007 22:37:13 -0500 Subject: [lxml-dev] unittest.main() Message-ID: <465B9FE9.2020108@colorstudy.com> I noticed many tests have: if __name__ == '__main__': unittest.main() But this seems to ignore test_suite(), so it's not really very useful. This seems more helpful: if __name__ == '__main__': print 'to test use test.py %s' % __file__ -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From ianb at colorstudy.com Tue May 29 05:52:51 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 28 May 2007 22:52:51 -0500 Subject: [lxml-dev] matches() Message-ID: <465BA393.2020302@colorstudy.com> Using lxml trunk: doc.xpath('descendant-or-self::*[starts-with(lower-case(@href), "javascript:")]') works, but: doc.xpath('descendant-or-self::*[matches(@href, "^javascript:", "i")]') Returns ["i"]. This does not seem right...? -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From ianb at colorstudy.com Tue May 29 06:00:42 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 28 May 2007 23:00:42 -0500 Subject: [lxml-dev] matches() In-Reply-To: <465BA393.2020302@colorstudy.com> References: <465BA393.2020302@colorstudy.com> Message-ID: <465BA56A.3060504@colorstudy.com> Ian Bicking wrote: > Using lxml trunk: > doc.xpath('descendant-or-self::*[starts-with(lower-case(@href), > "javascript:")]') Well, maybe this one doesn't work either (returns 1/0). Now I'm just confused. > works, but: > > doc.xpath('descendant-or-self::*[matches(@href, "^javascript:", "i")]') > > Returns ["i"]. This does not seem right...? -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From philipp at weitershausen.de Tue May 29 05:59:59 2007 From: philipp at weitershausen.de (Philipp von Weitershausen) Date: Tue, 29 May 2007 05:59:59 +0200 Subject: [lxml-dev] unittest.main() In-Reply-To: <465B9FE9.2020108@colorstudy.com> References: <465B9FE9.2020108@colorstudy.com> Message-ID: <465BA53F.4000806@weitershausen.de> Ian Bicking wrote: > I noticed many tests have: > > if __name__ == '__main__': > unittest.main() > > But this seems to ignore test_suite(), so it's not really very useful. You can do:: if __name__ == '__main__': unittest.main(defaultTest='test_suite') which seems to work. > This seems more helpful: > > if __name__ == '__main__': > print 'to test use test.py %s' % __file__ +1 Not-really-having-to-say-anything-in-lxml-development-ly Philipp -- http://worldcookery.com -- Professional Zope documentation and training From sidnei at enfoldsystems.com Tue May 29 06:02:27 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Tue, 29 May 2007 01:02:27 -0300 Subject: [lxml-dev] unittest.main() In-Reply-To: <465B9FE9.2020108@colorstudy.com> References: <465B9FE9.2020108@colorstudy.com> Message-ID: On 5/29/07, Ian Bicking wrote: > I noticed many tests have: > > if __name__ == '__main__': > unittest.main() > > But this seems to ignore test_suite(), so it's not really very useful. > This seems more helpful: > > if __name__ == '__main__': > print 'to test use test.py %s' % __file__ I think you can also do: unittest.main(defaultTest='test_suite') or something very similar to that. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From ianb at colorstudy.com Tue May 29 09:18:58 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 29 May 2007 02:18:58 -0500 Subject: [lxml-dev] matches() In-Reply-To: <465BA56A.3060504@colorstudy.com> References: <465BA393.2020302@colorstudy.com> <465BA56A.3060504@colorstudy.com> Message-ID: <465BD3E2.3090500@colorstudy.com> Ian Bicking wrote: > Ian Bicking wrote: >> Using lxml trunk: >> doc.xpath('descendant-or-self::*[starts-with(lower-case(@href), >> "javascript:")]') > > Well, maybe this one doesn't work either (returns 1/0). Now I'm just > confused. Adding to this, I'm trying to do the rel matching with: etree.XPath("descendant-or-self::a[fn:lower-case(@rel)=$rel]") I *have* to use fn:lower-case, not just lower-case, otherwise I get XPathEvalError: Unregistered function. And it doesn't matter if I use it or not, it doesn't effect the outcome at all. Similarly upper-case doesn't change anything. I also tried using XPath(r'...[translate(@class, "\n\t\r", " ")]) and that didn't work. The \n etc doesn't seem to be interpreted; only if I include the actual characters does it work. (I then noticed normalize-whitespace, which is better, but it still seems odd.) How literals are supposed to work in XPath is rather unclear to me, I guess \ isn't an escape character? The spec says use 'something'' if you want to include a literal ' in a string. Which I assume in an XML attribute you'd have to do as 'something&apos;', since it probably gets double-unescaped? Bah. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From stefan_ml at behnel.de Tue May 29 10:59:50 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 29 May 2007 10:59:50 +0200 Subject: [lxml-dev] matches() In-Reply-To: <465BA393.2020302@colorstudy.com> References: <465BA393.2020302@colorstudy.com> Message-ID: <465BEB86.7030506@behnel.de> Hi, Ian Bicking wrote: > doc.xpath('descendant-or-self::*[matches(@href, "^javascript:", "i")]') > > Returns ["i"]. This does not seem right...? You're not calling the right function. The exslt functions are in the EXSLT namespaces, so you have to do something like xpath('regexp:matches(., "^huhu", "i")', {'regexp':'http://exslt.org/regular-expressions}) Stefan From jholg at gmx.de Tue May 29 11:52:44 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 29 May 2007 11:52:44 +0200 Subject: [lxml-dev] lxml 1.3 - or 2.0? In-Reply-To: <46598B82.1020908@behnel.de> References: <46598B82.1020908@behnel.de> Message-ID: <20070529095244.80180@gmx.net> Hi, I for one would love to have the latest & greatest features in a release sooner rather than later; as usual for me that's anything objectify-related :-) But then again I do not really know if it means a lot of extra-work to get out a release, and if the wait for 2.0 would actually be long at all. Tendency-wise I'm pro 1.3. Holger -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer From cz at gocept.com Tue May 29 14:34:58 2007 From: cz at gocept.com (Christian Zagrodnick) Date: Tue, 29 May 2007 14:34:58 +0200 Subject: [lxml-dev] lxml 1.3 - or 2.0? References: <46598B82.1020908@behnel.de> <20070529095244.80180@gmx.net> Message-ID: On 2007-05-29 11:52:44 +0200, jholg at gmx.de said: > Hi, > > I for one would love to have the latest & greatest features in > a release sooner rather than later; as usual for me that's anything > objectify-related :-) > But then again I do not really know if it means a lot of extra-work to > get out a release, and if the wait for 2.0 would actually be long at > all. > > Tendency-wise I'm pro 1.3. Yeah. Release a 1.3 final and start with 2.0. I don't see any problems with the current 1.3 beta. So why not making it the 1.3 final then. -- Christian Zagrodnick gocept gmbh & co. kg ? forsterstrasse 29 ? 06112 halle/saale www.gocept.com ? fon. +49 345 12298894 ? fax. +49 345 12298891 From ianb at colorstudy.com Tue May 29 17:37:04 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 29 May 2007 10:37:04 -0500 Subject: [lxml-dev] html branch Message-ID: <465C48A0.8030808@colorstudy.com> I've started a branch with lxml.html, in http://codespeak.net/svn/lxml/branch/html It currently includes: lxml.doctestcompare: XML/HTML doctests lxml.usedoctest: enable the doctest from within a doctest lxml.html.usedoctest: enable the doctest, using the HTML parser lxml.html: * lxml.html.HtmlMixin, defining on each element: - remove_element: element removes itself from a tree - remove_tag: element removes itself but not its children from a tree - find_rel_links: find - find_class: find <* class="?"> * HTML: parser * parse_elements: parse fragment, return list of elements * parse_element: parse fragment, return single element * Element: apparently a highly broken element factory (segfaults?!) * tostring: HTML serialization lxml.defs: lists of HTML tags (e.g., block_tags) lxml.clean: clean Javascript and other problem code from HTML lxml.rewritelinks: change the links in a document lxml.htmldiff: make human-readable diffs and blame reports The usedoctest modules are based on a really horrible hack. It seems to work, except for some reason lxml/html/tests/test_clean.txt is sometimes run without the doctest change. The other doctests aren't run like this, and when you explicitly run the test (e.g., python test.py test_clean) it runs fine. So something weird with the test runner, I guess. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From stefan_ml at behnel.de Tue May 29 21:46:41 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 29 May 2007 21:46:41 +0200 Subject: [lxml-dev] lxml 1.3 - or 2.0? In-Reply-To: References: <46598B82.1020908@behnel.de> <20070529095244.80180@gmx.net> Message-ID: <465C8321.3060206@behnel.de> Christian Zagrodnick wrote: > On 2007-05-29 11:52:44 +0200, jholg at gmx.de said: > >> Hi, >> >> I for one would love to have the latest & greatest features in >> a release sooner rather than later; as usual for me that's anything >> objectify-related :-) >> But then again I do not really know if it means a lot of extra-work to >> get out a release, and if the wait for 2.0 would actually be long at >> all. >> >> Tendency-wise I'm pro 1.3. > > Yeah. Release a 1.3 final and start with 2.0. I don't see any problems > with the current 1.3 beta. So why not making it the 1.3 final then. Ok, I'll see what I can come up with. There may still be a couple of bug fixes I'll have to merge. So don't expect 1.3 too soon, either. Stefan From stefan_ml at behnel.de Tue May 29 22:47:42 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 29 May 2007 22:47:42 +0200 Subject: [lxml-dev] matches() In-Reply-To: <465BD3E2.3090500@colorstudy.com> References: <465BA393.2020302@colorstudy.com> <465BA56A.3060504@colorstudy.com> <465BD3E2.3090500@colorstudy.com> Message-ID: <465C916E.1010406@behnel.de> Hi Ian, Ian Bicking wrote: >>> doc.xpath('descendant-or-self::*[starts-with(lower-case(@href), >>> "javascript:")]') >> Well, maybe this one doesn't work either (returns 1/0). Now I'm just >> confused. > > Adding to this, I'm trying to do the rel matching with: > > etree.XPath("descendant-or-self::a[fn:lower-case(@rel)=$rel]") IIRC, "lower-case()" is XPath 2.0. libxml2 supports XPath 1.0 only, so there just is no such function. It's easy to implement that in Python, though: def make_lower_case(ctxt, s): return s.lower() etree.FunctionNamespace("myNs")["lower-case"] = make_lower_case find = etree.XPath( "descendant-or-self::a[fn:lower-case(string(@rel))=$rel]", {'fn':'myNs'}) (Note the call to "string(...)" to make sure we get a string value here, not a node set.) BTW, I get a reproduceable crash with the above under libxml2 2.6.27, but it works with 2.6.28. Sigh... > I also tried using XPath(r'...[translate(@class, "\n\t\r", " ")]) and > that didn't work. The \n etc doesn't seem to be interpreted; only if I > include the actual characters does it work. (I then noticed > normalize-whitespace, which is better, but it still seems odd.) Hmm, you didn't try without the r'', did you? XPath('...[translate(@class, "\n\t\r", " ")]) That should work as it leaves it to Python to handle the char escapes. Stefan From stefan_ml at behnel.de Tue May 29 23:41:22 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 29 May 2007 23:41:22 +0200 Subject: [lxml-dev] html branch In-Reply-To: <465C48A0.8030808@colorstudy.com> References: <465C48A0.8030808@colorstudy.com> Message-ID: <465C9E02.6030308@behnel.de> Hi Ian, Ian Bicking wrote: > I've started a branch with lxml.html, in > http://codespeak.net/svn/lxml/branch/html Sure, cool. > lxml.doctestcompare: XML/HTML doctests As people would rarely import this, why not have it start with an underscore? > lxml.usedoctest: enable the doctest from within a doctest > lxml.html.usedoctest: enable the doctest, using the HTML parser Good idea. That way it's automatically gets the same 'interface'. I'm not sure about the "use...", though. It needs to read well with "import": from lxml import usedoctest Too many verbs IMHO (but as long as I can't come up with a better name, I'll just leave it as is :) > lxml.html: > * lxml.html.HtmlMixin, defining on each element: > - remove_element: element removes itself from a tree > - remove_tag: element removes itself but not its children from a tree remove() already exists and removes the element you pass (not the element you call it on), so this becomes too ambiguous. Also, the more ElementTree-ish way would be to go through the parent: def cut_out_tree(self, element): if element.tail: previous = element.getprevious() previous.tail = (previous.tail or '') + element.tail self.remove(element) def cut_out_element(self, element): pos = self.index(element) if element.text: self.text = (self.text or '') + element.text self.cut_out_tree(element) self[pos:pos] = element[:] > * HTML: parser > * parse_elements: parse fragment, return list of elements > * parse_element: parse fragment, return single element I'll look into those, but they look ok at first glance. > * Element: apparently a highly broken element factory (segfaults?!) Yup, that won't work that way. Element classes cannot be instantiated on their own. But you can do Element = html_parser.makeelement > * tostring: HTML serialization Based on XSLT, as I've seen before. Sure, why not. > lxml.[html.]defs: lists of HTML tags (e.g., block_tags) Ok. > lxml.[html.]clean: clean Javascript and other problem code from HTML That rather looks like an HtmlElement method to me: "cleanup(...)", and the clean_html() function would fit right into the top-level of the lxml.html module. > lxml.[html.]rewritelinks: change the links in a document Maybe too special and too long for integration into the lxml.html and HtmlElement, not sure. Some of this might fit, though. > lxml.[html.]htmldiff: make human-readable diffs and blame reports > > The usedoctest modules are based on a really horrible hack. It seems to > work, except for some reason lxml/html/tests/test_clean.txt is sometimes > run without the doctest change. The other doctests aren't run like > this, and when you explicitly run the test (e.g., python test.py > test_clean) it runs fine. So something weird with the test runner, I guess. I'll take a look at these later. Stefan From ianb at colorstudy.com Wed May 30 00:10:13 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 29 May 2007 17:10:13 -0500 Subject: [lxml-dev] html branch In-Reply-To: <465C9E02.6030308@behnel.de> References: <465C48A0.8030808@colorstudy.com> <465C9E02.6030308@behnel.de> Message-ID: <465CA4C5.4050308@colorstudy.com> Stefan Behnel wrote: >> lxml.doctestcompare: XML/HTML doctests > > As people would rarely import this, why not have it start with an underscore? I guess... the usedoctest technique is a pretty egregious hack; I actually change doctest.OutputChecker.check_output.im_func.func_code because there's a local bound method that has to be changed. So the more conventional installation method still seems good, if there are interaction bugs. >> lxml.usedoctest: enable the doctest from within a doctest >> lxml.html.usedoctest: enable the doctest, using the HTML parser > > Good idea. That way it's automatically gets the same 'interface'. > > I'm not sure about the "use...", though. It needs to read well with "import": > > from lxml import usedoctest > > Too many verbs IMHO (but as long as I can't come up with a better name, I'll > just leave it as is :) I feel like there needs to be a verb in the name, since the import does stuff. The module itself is useless. >> lxml.html: >> * lxml.html.HtmlMixin, defining on each element: >> - remove_element: element removes itself from a tree >> - remove_tag: element removes itself but not its children from a tree > > remove() already exists and removes the element you pass (not the element you > call it on), so this becomes too ambiguous. Also, the more ElementTree-ish way > would be to go through the parent: > > def cut_out_tree(self, element): > if element.tail: > previous = element.getprevious() > previous.tail = (previous.tail or '') + element.tail > self.remove(element) > > def cut_out_element(self, element): > pos = self.index(element) > if element.text: > self.text = (self.text or '') + element.text > self.cut_out_tree(element) > self[pos:pos] = element[:] I am a little reluctant to add self-delete methods in general in Python, but with this technique I would *always* do el.getparent().cut_out_tree(el). I pretty much always find an element then get rid of it. Doing it from the parent is consistent but inconvenient. I agree the remove names are ambiguous -- both how they relate to each other, and that they seem similar to remove(). >> * Element: apparently a highly broken element factory (segfaults?!) > > Yup, that won't work that way. Element classes cannot be instantiated on their > own. But you can do > > Element = html_parser.makeelement OK. What's the distinction between Element and SubElement? >> * tostring: HTML serialization > > Based on XSLT, as I've seen before. Sure, why not. Yeah; it works. I hate the removal via a regex, but not removing it bugs the hell out of me and there's no other way I see to get rid of it. If I was more apt to dig in libxml2 code I'm sure there's a better technique, but I'm shy around C code. >> lxml.[html.]clean: clean Javascript and other problem code from HTML > > That rather looks like an HtmlElement method to me: "cleanup(...)", and the > clean_html() function would fit right into the top-level of the lxml.html module. The long signature of the function made me reluctant to do this. Any function with that many parameters feels non-authoritative to me. And I would encourage people to actually write their own clean function with the parameter defaults that are appropriate for their domain (e.g., clean_untrusted_comment, clean_wysiwyg_submission, etc). I just guessed reasonable defaults for those keyword arguments. >> lxml.[html.]rewritelinks: change the links in a document > > Maybe too special and too long for integration into the lxml.html and > HtmlElement, not sure. Some of this might fit, though. This I feel a little more comfortable about than the cleanup. Especially making all links absolute is really convenient when you are doing parsing. I'd like to do some kind of query (returning all links in the document), but I'm not sure what that would look like. Generally *just* the link is kind of boring. Usually the link plus the element that has the link is more interesting. But some kinds of links don't have elements; CSS particularly. OTOH, a method that didn't cover that particular case (even though the rewriting did) would still be useful. Maybe it would return [(element_with_link, attribute_where_link_is), ...]. Or it could be (element_with_link, attribute_where_link_is, link), and for CSS that'd be (