From amax at redsymbol.net Fri May 1 19:41:19 2009 From: amax at redsymbol.net (Aaron Maxwell) Date: Fri, 1 May 2009 10:41:19 -0700 Subject: [lxml-dev] Ingore namespace when parsing Message-ID: <200905011041.19432.amax@redsymbol.net> Hi all, When using python lxml to parse an XML document whose root element defines a namespace, is there some way the library can allow me to not explicitly invoke that namespace in queries? Consider an XML document with this content: {{{ }}} If I parse it like this: {{{ def ignore_ns(path_to_file): x = etree.parse(open(path_to_file)) for kid in x.getroot(): print kid.tag }}} ... where the path_to_file contains the above xml document, then this output is produced: {{{ {http://redsymbol.net/SomeNamespace}Child1 {http://redsymbol.net/SomeNamespace}Child2 }}} Alternatively, I can define a namespace-string stripping function dynamically, and apply it as needed: {{{ def strip_out_ns(): x = etree.parse(open(path_to_file)) ns = x.getroot().nsmap[None] def no_ns(s): return s.split('{'+ns+'}')[-1] for kid in x.getroot(): print no_ns(kid.tag) }}} The output of this is simpler: {{{ Child1 Child2 }}} More commonly, I will want to search for a child element of some root, using a query like {{{ rootElement.find('Child1') }}} (where rootElement is an Element object). In the namespaced xml document above, this call to .find() will return None, but {{{ # ns found from rootElement.nsmap as above rootElement.find('{' + ns + '}' + 'Child1') }}} will correctly find the child element. In this kind of situation, where I just want to parse the document and really don't care about the namespace, is there some way to construct a parser that will ignore it in a more automated way? Is there a simpler, better approach, or some insight I'm missing? Thanks everyone in advance. Cheers, Aaron From jlovell at nwesd.org Fri May 1 20:00:29 2009 From: jlovell at nwesd.org (John Lovell) Date: Fri, 1 May 2009 11:00:29 -0700 Subject: [lxml-dev] Ingore namespace when parsing In-Reply-To: <200905011041.19432.amax@redsymbol.net> References: <200905011041.19432.amax@redsymbol.net> Message-ID: Aaron: It sounds to me like you could use an xpath query. rootElement.xpath('//*[local-name() = 'Child1') http://codespeak.net/lxml/xpathxslt.html Good luck, John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell at nwesd.org www.nwesd.org Together We Can ... -----Original Message----- From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Aaron Maxwell Sent: Friday, May 01, 2009 10:41 AM To: lxml-dev at codespeak.net Subject: [lxml-dev] Ingore namespace when parsing Hi all, When using python lxml to parse an XML document whose root element defines a namespace, is there some way the library can allow me to not explicitly invoke that namespace in queries? Consider an XML document with this content: {{{ }}} If I parse it like this: {{{ def ignore_ns(path_to_file): x = etree.parse(open(path_to_file)) for kid in x.getroot(): print kid.tag }}} ... where the path_to_file contains the above xml document, then this output is produced: {{{ {http://redsymbol.net/SomeNamespace}Child1 {http://redsymbol.net/SomeNamespace}Child2 }}} Alternatively, I can define a namespace-string stripping function dynamically, and apply it as needed: {{{ def strip_out_ns(): x = etree.parse(open(path_to_file)) ns = x.getroot().nsmap[None] def no_ns(s): return s.split('{'+ns+'}')[-1] for kid in x.getroot(): print no_ns(kid.tag) }}} The output of this is simpler: {{{ Child1 Child2 }}} More commonly, I will want to search for a child element of some root, using a query like {{{ rootElement.find('Child1') }}} (where rootElement is an Element object). In the namespaced xml document above, this call to .find() will return None, but {{{ # ns found from rootElement.nsmap as above rootElement.find('{' + ns + '}' + 'Child1') }}} will correctly find the child element. In this kind of situation, where I just want to parse the document and really don't care about the namespace, is there some way to construct a parser that will ignore it in a more automated way? Is there a simpler, better approach, or some insight I'm missing? Thanks everyone in advance. Cheers, Aaron _______________________________________________ lxml-dev mailing list lxml-dev at codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev From sergio at sergiomb.no-ip.org Fri May 1 22:31:35 2009 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Fri, 01 May 2009 21:31:35 +0100 Subject: [lxml-dev] Ingore namespace when parsing In-Reply-To: <200905011041.19432.amax@redsymbol.net> References: <200905011041.19432.amax@redsymbol.net> Message-ID: <1241209895.15312.1.camel@segulix> On Fri, 2009-05-01 at 10:41 -0700, Aaron Maxwell wrote: > {{{ > # ns found from rootElement.nsmap as above > rootElement.find('{' + ns + '}' + 'Child1') > }}} > > will correctly find the child element. > > In this kind of situation, where I just want to parse the document and > really don't care about the namespace, is there some way to construct > a parser that will ignore it in a more automated way? Is there a > simpler, better approach, or some insight I'm missing? > rootElement.xpath('//Child1') -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 2192 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090501/f67b3abd/attachment-0001.bin From amax at redsymbol.net Sat May 2 00:30:06 2009 From: amax at redsymbol.net (Aaron Maxwell) Date: Fri, 1 May 2009 15:30:06 -0700 Subject: [lxml-dev] Ingore namespace when parsing In-Reply-To: References: <200905011041.19432.amax@redsymbol.net> Message-ID: <200905011530.06605.amax@redsymbol.net> On Friday 01 May 2009 11:00:29 am John Lovell wrote: > Aaron: > > It sounds to me like you could use an xpath query. > rootElement.xpath('//*[local-name() = 'Child1') > http://codespeak.net/lxml/xpathxslt.html Thanks, that does work fine. My actual problem is somewhat more complex than the simplistic example I gave, however. The structure of the XML document is more like this (lots of the actual document is excised): {{{ 0521545668 (snip) 7517 (snip) }}} This is from Amazon's Associate Web Service API, incidentally. What's needed is to extract the prices for the offers. So I first obtain an offer element - the easiest way is to use exactly the xpath expression you mentioned: {{{ offers = tree.xpath('//*[local-name()="Offer"]) }}} Then for each offer in offers, I want to get the price information, i.e. the content of that Amount tag. This works: {{{ def price(offer): return offer.xpath('*[local-name()="OfferListing"]/*[local-name()="Price"]/*[local-name()="Amount"]') [0].text }}} But, in a word, "yikes". There has got to be a less verbose way! I can't skip any of those intermediate elements (there are multiple leaf elements named Amount, for example; only the specific one above is the actual sale price.) So something like {{{'*[local-name()="OfferListing"]//*[local-name()="Amount"]'}}} fails by mixing in garbage with the correct result. (This will probably improve once I learn xpath a little better - still in the process of mastering it.) Anyway, thanks for the xpath suggestion, John - it's probably better than the ns()/no_ns() functions in my first post. Would still be useful if there is a way to instruct lxml.etree to somehow strip out the namespace prefix more automatically, if anyone can suggest that. Cheers, Aaron -- Aaron Maxwell http://redsymbol.net/ From sergio at sergiomb.no-ip.org Tue May 5 23:36:16 2009 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Tue, 05 May 2009 22:36:16 +0100 Subject: [lxml-dev] Ingore namespace when parsing In-Reply-To: <200905011530.06605.amax@redsymbol.net> References: <200905011041.19432.amax@redsymbol.net> <200905011530.06605.amax@redsymbol.net> Message-ID: <1241559376.8873.6.camel@segulix> from http://codespeak.net/lxml/xpathxslt.html a simplifyed example: f = StringIO(''' Text ''') doc = etree.parse(f,parser=hparser) r = doc.xpath('//b:bar', namespaces={'b': 'http://codespeak.net/ns/test2'}) print len(r) print r[0].tag print r[0].text and extensions http://codespeak.net/lxml/extensions.html I'm trying work with some namespaces either but the documentation spin too much for me. In yours example, I don't see any etc . so it is difficult guess On Fri, 2009-05-01 at 15:30 -0700, Aaron Maxwell wrote: > On Friday 01 May 2009 11:00:29 am John Lovell wrote: > > Aaron: > > > > It sounds to me like you could use an xpath query. > > rootElement.xpath('//*[local-name() = 'Child1') > > http://codespeak.net/lxml/xpathxslt.html > > Thanks, that does work fine. > > My actual problem is somewhat more complex than the simplistic example I gave, > however. The structure of the XML document is more like this (lots of the > actual document is excised): > {{{ > xmlns="http://webservices.amazon.com/AWSECommerceService/2008-04-07"> > > > > 0521545668 > > (snip) > > > > > > 7517 > > (snip) > }}} > > This is from Amazon's Associate Web Service API, incidentally. What's needed > is to extract the prices for the offers. So I first obtain an offer > element - the easiest way is to use exactly the xpath expression you > mentioned: > > {{{ > offers = tree.xpath('//*[local-name()="Offer"]) > }}} > > Then for each offer in offers, I want to get the price information, i.e. the > content of that Amount tag. This works: > {{{ > def price(offer): > return > offer.xpath('*[local-name()="OfferListing"]/*[local-name()="Price"]/*[local-name()="Amount"]') > [0].text > }}} > > But, in a word, "yikes". There has got to be a less verbose way! I can't > skip any of those intermediate elements (there are multiple leaf elements > named Amount, for example; only the specific one above is the actual sale > price.) So something like > {{{'*[local-name()="OfferListing"]//*[local-name()="Amount"]'}}} fails by > mixing in garbage with the correct result. > > (This will probably improve once I learn xpath a little better - still in the > process of mastering it.) > > Anyway, thanks for the xpath suggestion, John - it's probably better than the > ns()/no_ns() functions in my first post. Would still be useful if there is a > way to instruct lxml.etree to somehow strip out the namespace prefix more > automatically, if anyone can suggest that. > > Cheers, > Aaron > > -- > Aaron Maxwell > http://redsymbol.net/ > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 2192 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090505/7da5d528/attachment.bin From l at lrowe.co.uk Wed May 6 00:04:55 2009 From: l at lrowe.co.uk (Laurence Rowe) Date: Wed, 6 May 2009 00:04:55 +0200 Subject: [lxml-dev] Ingore namespace when parsing In-Reply-To: <200905011530.06605.amax@redsymbol.net> References: <200905011041.19432.amax@redsymbol.net> <200905011530.06605.amax@redsymbol.net> Message-ID: 2009/5/2 Aaron Maxwell : > On Friday 01 May 2009 11:00:29 am John Lovell wrote: >> Aaron: >> >> It sounds to me like you could use an xpath query. >> rootElement.xpath('//*[local-name() = 'Child1') >> http://codespeak.net/lxml/xpathxslt.html > > Thanks, that does work fine. > > My actual problem is somewhat more complex than the simplistic example I gave, > however. ?The structure of the XML document is more like this (lots of the > actual document is excised): > {{{ > xmlns="http://webservices.amazon.com/AWSECommerceService/2008-04-07"> > ? > ? > ? ? > ? ? ?0521545668 > ? ? ? > ? ? ? ? (snip) > ? ? ? > ? ? ? > ? ? ? ? > ? ? ? ? ? > ? ? ? ? ? ? > ? ? ? ? ? ? ?7517 > ? ? ? ? ? ? > (snip) > }}} > > This is from Amazon's Associate Web Service API, incidentally. ?What's needed > is to extract the prices for the offers. ?So I first obtain an offer > element - the easiest way is to use exactly the xpath expression you > mentioned: > > {{{ > offers = tree.xpath('//*[local-name()="Offer"]) > }}} > > Then for each offer in offers, I want to get the price information, i.e. the > content of that Amount tag. ?This works: > {{{ > def price(offer): > ? ?return > offer.xpath('*[local-name()="OfferListing"]/*[local-name()="Price"]/*[local-name()="Amount"]') > [0].text > }}} > > But, in a word, "yikes". ?There has got to be a less verbose way! ?I can't > skip any of those intermediate elements (there are multiple leaf elements > named Amount, for example; only the specific one above is the actual sale > price.) ?So something like > {{{'*[local-name()="OfferListing"]//*[local-name()="Amount"]'}}} fails by > mixing in garbage with the correct result. > > (This will probably improve once I learn xpath a little better - still in the > process of mastering it.) > > Anyway, thanks for the xpath suggestion, John - it's probably better than the > ns()/no_ns() functions in my first post. ?Would still be useful if there is a > way to instruct lxml.etree to somehow strip out the namespace prefix more > automatically, if anyone can suggest that. You can supply a namespaces argument to the xpath method: {{{ offers = tree.xpath('//aws:Offer', namespaces=dict(aws="http://webservices.amazon.com/AWSECommerceService/2008-04-07")) }}} See http://codespeak.net/lxml/xpathxslt.html for the details. Laurence From sergio at sergiomb.no-ip.org Wed May 6 00:39:33 2009 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Tue, 05 May 2009 23:39:33 +0100 Subject: [lxml-dev] Ingore namespace when parsing In-Reply-To: References: <200905011041.19432.amax@redsymbol.net> <200905011530.06605.amax@redsymbol.net> Message-ID: <1241563173.8873.11.camel@segulix> On Wed, 2009-05-06 at 00:04 +0200, Laurence Rowe wrote: > 2009/5/2 Aaron Maxwell : > > On Friday 01 May 2009 11:00:29 am John Lovell wrote: > >> Aaron: > >> > >> It sounds to me like you could use an xpath query. > >> rootElement.xpath('//*[local-name() = 'Child1') > >> http://codespeak.net/lxml/xpathxslt.html > > > > Thanks, that does work fine. > > > > My actual problem is somewhat more complex than the simplistic example I gave, > > however. The structure of the XML document is more like this (lots of the > > actual document is excised): > > {{{ > > > xmlns="http://webservices.amazon.com/AWSECommerceService/2008-04-07"> > > > > > > > > 0521545668 > > > > (snip) > > > > > > > > > > > > 7517 > > > > (snip) > > }}} > > > > This is from Amazon's Associate Web Service API, incidentally. What's needed > > is to extract the prices for the offers. So I first obtain an offer > > element - the easiest way is to use exactly the xpath expression you > > mentioned: > > > > {{{ > > offers = tree.xpath('//*[local-name()="Offer"]) > > }}} > > > > Then for each offer in offers, I want to get the price information, i.e. the > > content of that Amount tag. This works: > > {{{ > > def price(offer): > > return > > offer.xpath('*[local-name()="OfferListing"]/*[local-name()="Price"]/*[local-name()="Amount"]') > > [0].text > > }}} > > > > But, in a word, "yikes". There has got to be a less verbose way! I can't > > skip any of those intermediate elements (there are multiple leaf elements > > named Amount, for example; only the specific one above is the actual sale > > price.) So something like > > {{{'*[local-name()="OfferListing"]//*[local-name()="Amount"]'}}} fails by > > mixing in garbage with the correct result. > > > > (This will probably improve once I learn xpath a little better - still in the > > process of mastering it.) > > > > Anyway, thanks for the xpath suggestion, John - it's probably better than the > > ns()/no_ns() functions in my first post. Would still be useful if there is a > > way to instruct lxml.etree to somehow strip out the namespace prefix more > > automatically, if anyone can suggest that. > > > You can supply a namespaces argument to the xpath method: > {{{ > offers = tree.xpath('//aws:Offer', > namespaces=dict(aws="http://webservices.amazon.com/AWSECommerceService/2008-04-07")) > }}} > > See http://codespeak.net/lxml/xpathxslt.html for the details. or define in global way (I think that is what you want). Using last example: myns1 = etree.FunctionNamespace('http://webservices.amazon.com/AWSECommerceService/2008-04-07') myns1.prefix = "aws" offers = tree.xpath('//aws:Offer') -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 2192 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090505/e37c746a/attachment.bin From lei at ipac.caltech.edu Thu May 7 02:27:10 2009 From: lei at ipac.caltech.edu (Mary Lei) Date: Wed, 06 May 2009 17:27:10 -0700 Subject: [lxml-dev] how to get line,col position Message-ID: <4A022ADE.2080002@ipac.caltech.edu> How can I get dtd.validate to return the line, column number for the xhtml in error? Here is my code to validate an xhtml doc against the dtd using lxml: # no need to write a temp html file CoRotHomeFile = open ( 'CoRoTHome.html', 'r' ) contents = CoRotHomeFile.read() CoRotHomeFile.close() dtd1 = etree.DTD(file='xhtml1-transitional.dtd') (the ent files are present) etree.clear_error_log() root1 = etree.HTML(contents) try: rc = dtd1.validate(root1) except (DTDValidateError,DTDError),e: print "e ", e print "dtd errors" len = len(dtd1.error_log) error = dtd1.error_log[0] print "line", (error.line) print "column", (error.column) print dtd1.error_log If I use xmllint, I got column /home/lei/lxml-2.2/src:.:/home/lei/python-stuff/BeautifulSoup-3.1.0.1 dtd errors without any position info line 0 column 0 :0:0:ERROR:VALID:DTD_STANDALONE_WHITE_SPACE: standalone: tr declared in the external subset contains white spaces nodes :0:0:ERROR:VALID:DTD_STANDALONE_WHITE_SPACE: standalone: tr declared in the external subset contains white spaces nodes :0:0:ERROR:VALID:DTD_STANDALONE_WHITE_SPACE: standalone: tr declared in the external subset contains white spaces nodes :0:0:ERROR:VALID:DTD_STANDALONE_WHITE_SPACE: standalone: tr declared in the external subset contains white spaces nodes But if I apply xmllint, it gives the same messages but with positional info: /home/lei/python-stuff/CoRoTHome.html:83: HTML parser error : htmlParseStartTag: invalid element name dedicated to asteroseismology of bright stars (typically V<10mag) and ^ /home/lei/python-stuff/CoRoTHome.html:23: element tr: validity error : standalone: tr declared in the external subset contains white spaces nodes /home/lei/python-stuff/CoRoTHome.html:93: element tr: validity error : standalone: tr declared in the external subset contains white spaces nodes /home/lei/python-stuff/CoRoTHome.html:99: element tr: validity error : standalone: tr declared in the external subset contains white spaces nodes ... Document /home/lei/python-stuff/CoRoTHome.html does not validate against xhtml1-transitional.dtd Thanks. -- Mary Lei Software Testing IPAC-NExScl Rm: KS-233 MS: 220-6 Phone: 395-1998 From stefan_ml at behnel.de Thu May 7 07:17:23 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 07 May 2009 07:17:23 +0200 Subject: [lxml-dev] how to get line,col position In-Reply-To: <4A022ADE.2080002@ipac.caltech.edu> References: <4A022ADE.2080002@ipac.caltech.edu> Message-ID: <4A026EE3.2010806@behnel.de> Hi, Mary Lei wrote: > How can I get dtd.validate to return the > line, column number for the xhtml in error? You can't if you use the HTML parser, that's a known bug in libxml2: http://bugzilla.gnome.org/show_bug.cgi?id=580705 Note that this bug has a patch associated to it, which you can apply to libxml2 to get what you want. Otherwise, for parsing XHTML you should use the XML parser anyway, which will track line numbers correctly. > But if I apply xmllint, it gives the same messages but with positional info: > /home/lei/python-stuff/CoRoTHome.html:83: HTML parser error : > htmlParseStartTag: invalid element name > dedicated to asteroseismology of bright stars (typically V<10mag) and > ^ > /home/lei/python-stuff/CoRoTHome.html:23: element tr: validity error : > standalone: tr declared in the external subset contains white spaces nodes > ... > Document /home/lei/python-stuff/CoRoTHome.html does not validate against > xhtml1-transitional.dtd You didn't say if you used the HTML parser or the XML parser in xmllint. In any case, xmllint does the DTD validation at parse time, where the line information is still available. It only gets lost when building the tree, so that running a validator on the tree cannot report line numbers anymore. lxml.etree does not currently support parse-time validation against a user-provided DTD (i.e. one that is not referenced by the document itself). Might be worth a bug report. Stefan From lei at ipac.caltech.edu Thu May 7 19:28:59 2009 From: lei at ipac.caltech.edu (Mary Lei) Date: Thu, 07 May 2009 10:28:59 -0700 Subject: [lxml-dev] how to get line,col position In-Reply-To: <4A026EE3.2010806@behnel.de> References: <4A022ADE.2080002@ipac.caltech.edu> <4A026EE3.2010806@behnel.de> Message-ID: <4A031A5B.5070607@ipac.caltech.edu> My responses are below: Stefan Behnel wrote: > Hi, > > Mary Lei wrote: >> How can I get dtd.validate to return the >> line, column number for the xhtml in error? > > You can't if you use the HTML parser, that's a known bug in libxml2: > > http://bugzilla.gnome.org/show_bug.cgi?id=580705 > > Note that this bug has a patch associated to it, which you can apply to > libxml2 to get what you want. Where can I locate this patch ? > > Otherwise, for parsing XHTML you should use the XML parser anyway, which > will track line numbers correctly. Using the XML parser, results in error to load the dtd from network lxml.etree.XMLSyntaxError: Attempt to load network entity http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd If I turned off the option, I dont get anything from parser. I dont really want to load each time so I downloaded a copy with the entities and decide to use etree.dtd.validate to validate it instead. But as mentioned, this does not give the line,col info. If I use the XMLParser, I have an issue with lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 24, column 13 but my xhtml has a DTD specified. I have checked these issues on the web but not clear how to fix this. > > >> But if I apply xmllint, it gives the same messages but with positional info: >> /home/lei/python-stuff/CoRoTHome.html:83: HTML parser error : >> htmlParseStartTag: invalid element name >> dedicated to asteroseismology of bright stars (typically V<10mag) and >> ^ >> /home/lei/python-stuff/CoRoTHome.html:23: element tr: validity error : >> standalone: tr declared in the external subset contains white spaces nodes >> ... >> Document /home/lei/python-stuff/CoRoTHome.html does not validate against >> xhtml1-transitional.dtd > > You didn't say if you used the HTML parser or the XML parser in xmllint. In > any case, xmllint does the DTD validation at parse time, where the line > information is still available. It only gets lost when building the tree, > so that running a validator on the tree cannot report line numbers anymore. > > lxml.etree does not currently support parse-time validation against a > user-provided DTD (i.e. one that is not referenced by the document itself). > Might be worth a bug report. My xmllint command is xmllint --dtdvalid xhtml1-transitional.dtd --noout /home/lei/python-stuff/CoRoTHome.html --recover --html > > Stefan Thanks. -- Mary Lei Software Testing IPAC-NExScl Rm: KS-233 MS: 220-6 Phone: 395-1998 From stefan_ml at behnel.de Thu May 7 20:54:25 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 07 May 2009 20:54:25 +0200 Subject: [lxml-dev] how to get line,col position In-Reply-To: <4A031A5B.5070607@ipac.caltech.edu> References: <4A022ADE.2080002@ipac.caltech.edu> <4A026EE3.2010806@behnel.de> <4A031A5B.5070607@ipac.caltech.edu> Message-ID: <4A032E61.1090207@behnel.de> Hi, Mary Lei wrote: > Stefan Behnel wrote: >> http://bugzilla.gnome.org/show_bug.cgi?id=580705 >> >> Note that this bug has a patch associated to it, which you can apply to >> libxml2 to get what you want. > Where can I locate this patch ? You click on the link above, and on that page, click on the link where it says "patch here". >> Otherwise, for parsing XHTML you should use the XML parser anyway, which >> will track line numbers correctly. > Using the XML parser, results in error to load the dtd from network > lxml.etree.XMLSyntaxError: Attempt to load network entity > http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd You can enable network access with the "no_network=False" option. However, it's better to use a catalog to let libxml2 look up the DTD locally. http://www.xmlsoft.org/catalog.html > I dont really want to load each time so I downloaded a copy with the > entities and decide to use etree.dtd.validate to validate it instead. > But as > mentioned, this does not give the line,col info. > > If I use the XMLParser, I have an issue with > lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 24, column 13 > > but my xhtml has a DTD specified. Did you load the DTD when parsing the file? This is not done automatically. See the XMLParser documentation, it has a "load_dtd" option IIRC. Stefan From stefan_ml at behnel.de Sat May 9 11:30:08 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 09 May 2009 11:30:08 +0200 Subject: [lxml-dev] xpath on text nodes In-Reply-To: <1241122046.5549.19.camel@atman.artefact.org.nz> References: <1240994671.8989.9.camel@atman.artefact.org.nz> <1241065853.5570.4.camel@atman.artefact.org.nz> <1241122046.5549.19.camel@atman.artefact.org.nz> Message-ID: <4A054D20.5000502@behnel.de> Hi, Jamie Norrish wrote: > On Thu, 2009-04-30 at 09:42 +0200, Stefan Behnel wrote: >> It would be rarely used, I'd say. What sort of interesting XPath queries >> could you possibly do on a node that doesn't have any children, nor >> attributes, nor a tag name or namespace. > > Besides selecting other nodes and values relative to the text? Yes, it > is possible to use text_result.getparent() and proceed from there - but > this has the downside of requiring, for some XPath expressions, the code > to modify the expression based on whether text_result was the text or > tail of its parent, which is annoying. Ok, I do see your use case, although I still don't know what your selections look like in practice. If you want a more predictable XPath result, maybe it would make sense to select the surrounding element instead of the plain text content. As I said, lxml.etree does not have a representation for text nodes. So by adding an xpath() method to text results, you'd end up with a rather fragile setup that might crash when you replace the text of a node, just because an XPath text result is still holding a reference to a now-dead text node, for example. So it's not just adding a method, it's more like rethinking concepts inside lxml.etree. I'm pretty sure this use case is not worth going there - especially since it's nothing that can't be done today, but rather an inconvenience. >> Also, XPath queries can return Elements and (special) strings, but >> also plain numbers and boolean values. >> So you'd still not have a common interface for all possible result types. > > Well, I'm not really asking for a common interface - only that XPath be > enabled for the results of an XPath expression for text(). This would > bring it into line with XSLT behaviour, for one. Well, XSLT is a different language with a different tree model. > About using iterwalk: this wouldn't seem (on a quick perusal of the > documentation) to easily allow for me to get the preceding context of > the text result, unless I picked some arbitrary earlier element as the > starting point. What am I missing? I guess I misjudged your use case when you first described it. iterwalk() will not allow you to access the text context preceding an element, only the text content of the element itself. I still do not have a clear idea of what you consider "text context" actually. Does that take the tree structure into account (e.g. only within a certain parent element), or is it just any text content that precedes the XPath result in reverse document order, wherever it occurs in the tree? What about just stepping up parent by parent until the contained text content is long enough? Or, if it's too long, split it by the substring that XPath found, and strip the left and right part... Stefan From qhlonline at 163.com Tue May 12 05:56:23 2009 From: qhlonline at 163.com (qhlonline) Date: Tue, 12 May 2009 11:56:23 +0800 (CST) Subject: [lxml-dev] Ask for help about lxml usage Message-ID: <24110436.658541242100583875.JavaMail.coremail@bj163app69.163.com> Hi, all I am a lxml experimental user. from site http://codespeak.net/lxml/FAQ.html I know that python can support multithread parsing without GIL. I have tried to write multi-thread parsing program to run on a eight-core CPU computer, but the total CPU used was only 180%, only about 20% of each core had been used.but when I tried the libxml2 directly, It run much faster, and more then 50% of each CPU core were used. My goal is to parse a HTML file on a disk to get special HTML tags and their relative data, like attributes and texts. I will not use DOM tree creation, renew, delete, or XPath operations. then how can my HTML-Parsing program run faster? I have used the Target SAX parser to parse a HTML file ,but the speed is not good enough. the Iterparse can't parse HTML file eigher(I have set the "html=True" parameter), the parser said my HTML file had misplaced the DOCTYPE declaration,but this web page is caught from a popular website and is truly subject to the HTML protocal. Now there are more HTML files to process, so now I wan't to speed up the parsing by multi-thread process. My question are whether the LXML had freed GIL completely on memery and disk file prasing? how can my multi-thread program run faster on multi-core CPU computer? Can I make some change on lxml source to jump some unwanted operation to Improve my program? Thanks a lot Yours Sincere -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090512/c8912f8e/attachment.htm From richardlewis at fastmail.co.uk Tue May 12 15:38:22 2009 From: richardlewis at fastmail.co.uk (Richard Lewis) Date: Tue, 12 May 2009 14:38:22 +0100 Subject: [lxml-dev] ElementTree.write default_namespace argument Message-ID: <87k54mbfw1.wl%richard.lewis@gold.ac.uk> Hi there, I have lxml version 2.1.5, Python 2.5 both installed from Debian packages. I'd like to be able to use the default_namespace argument to ElementTree's write method. However, I get the following error: Traceback (most recent call last): File "./join-html-pages.py", line 62, in main() File "./join-html-pages.py", line 59, in main document.write(file(new_filename, 'w'), encoding='utf-8', method='html', default_namespace='html') File "lxml.etree.pyx", line 1576, in lxml.etree._ElementTree.write (src/lxml/lxml.etree.c:16699) TypeError: write() got an unexpected keyword argument 'default_namespace' Is this expected behaviour? My application is extracting nodes from numerous XHTML documents and merging them into one HTML document. The bit that's going wrong is writing the resultant HTML to a file. I'm not really interested in it being correct XHTML or in using the correct namespace. So any alternative solutions would also be of interest. Cheers, Richard From stefan_ml at behnel.de Tue May 12 16:24:48 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 12 May 2009 16:24:48 +0200 (CEST) Subject: [lxml-dev] ElementTree.write default_namespace argument In-Reply-To: <87k54mbfw1.wl%richard.lewis@gold.ac.uk> References: <87k54mbfw1.wl%richard.lewis@gold.ac.uk> Message-ID: <2a1496319ff7b802a12157de5846a3f0.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, Richard Lewis wrote: > I have lxml version 2.1.5, Python 2.5 both installed from Debian > packages. I'd like to be able to use the default_namespace argument to > ElementTree's write method. Yep, that's not supported. It's not easy to add, and it would actually make more sense to change the prefix in the tree rather than on serialisation. > document.write(file(new_filename, 'w'), encoding='utf-8', > method='html', default_namespace='html') Note that 'html' is not a valid namespace URI. > My application is extracting nodes from numerous XHTML documents and > merging them into one HTML document. In which case there shouldn't be any namespace at all. > The bit that's going wrong is > writing the resultant HTML to a file. Could you elaborate on the "going wrong" bit? What is the result (of what kind of operation) and what did you expect instead? Note that lxml.html has an "xhtml_to_html()" function, maybe that helps already. Stefan From stefan_ml at behnel.de Tue May 12 10:14:27 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 12 May 2009 10:14:27 +0200 Subject: [lxml-dev] Ask for help about lxml usage In-Reply-To: <24110436.658541242100583875.JavaMail.coremail@bj163app69.163.com> References: <24110436.658541242100583875.JavaMail.coremail@bj163app69.163.com> Message-ID: <4A092FE3.8040400@behnel.de> Hi, qhlonline wrote: > Hi, all I am a lxml experimental user. from site > http://codespeak.net/lxml/FAQ.html I know that python can support > multithread parsing without GIL. I have tried to write multi-thread > parsing program to run on a eight-core CPU computer, but the total CPU > used was only 180%, only about 20% of each core had been used. I never tested lxml on a machine with more than two cores myself (simply because I don't have one). But it always depends on your code how much parallelism you get. The figures above let me assume that less than 20% of the work is spent in parsing. The comment below are rather generic based on the information you provided. If you want better suggestions, it would help to see a bit of your code to understand which parts of lxml's API you are using, and how. > but when I > tried the libxml2 directly, It run much faster, and more then 50% of > each CPU core were used. That's because libxml2 (I assume you are referring to the Python bindings?) does a lot less on each call, so if it frees the GIL, I doubt that there is any case where it has to reclaim it during the parsing. Depending on your code, there may be a reason lxml has to. Also, if you use lxml.html instead of lxml.etree, you add another Python (i.e. GIL locked) layer on top of that. > My goal is to parse a HTML file on a disk to > get special HTML tags and their relative data, like attributes and > texts. I will not use DOM tree creation, renew, delete, or XPath > operations. then how can my HTML-Parsing program run faster? I have used > the Target SAX parser to parse a HTML file ,but the speed is not good > enough. That's because freeing the GIL in the target parser (I don't remember if that's actually done or not) would induce too much overhead (one GIL acquire-release cycle per element or text node!) and actually hurt the performance. > the Iterparse can't parse HTML file eigher (I have set the > "html=True" parameter) Parsing HTML with iterparse() should work (so I'm interested in an example that fails). But don't expect too much parallelism there. iterparse() is not made for highly parallel parsing, as it calls into Python a lot. It does not even free the GIL during parsing, only during file access. > the parser said my HTML file had misplaced the > DOCTYPE declaration, but this web page is caught from a popular website > and is truly subject to the HTML protocal. I assume you validated it against the W3C validator? I'm not sure, but at a quick glance at the code, it may be possible that iterparse() doesn't set the "recover" option for HTML, so your results with broken HTML may not be as expected. > Now there are more HTML files > to process, so now I wan't to speed up the parsing by multi-thread > process. My question are whether the LXML had freed GIL completely on > memery and disk file prasing? It does that, yes. But note that you need to pass a filename or URL to get maximum parallelism, not a file(-like) object, as that is read in Python space. > how can my multi-thread program run faster on multi-core CPU computer? The usual answer in Python is: use multiple processes instead of multiple threads. If you have many files that are treated independently anyway, using a process pool of eight processes should really get you close to 100% parallel code. If the idea is to extract small amounts of data from HTML pages and aggregate them somehow, I doubt that there is a way to beat separate processes that write their output into a common pipe/queue/database/whatever. This is actually an important thing to understand: Threads are good for avoiding I/O latency when your problem is I/O bound. They are not good for parallel computations, i.e. CPU bound tasks. Stefan From stefan_ml at behnel.de Tue May 12 20:22:12 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 12 May 2009 20:22:12 +0200 Subject: [lxml-dev] xpath on text nodes In-Reply-To: <1241861247.5355.41.camel@atman.artefact.org.nz> References: <1240994671.8989.9.camel@atman.artefact.org.nz> <1241065853.5570.4.camel@atman.artefact.org.nz> <1241122046.5549.19.camel@atman.artefact.org.nz> <4A053F34.9040809@behnel.de> <1241861247.5355.41.camel@atman.artefact.org.nz> Message-ID: <4A09BE54.1030501@behnel.de> Hi, Jamie Norrish wrote: > I've included at the end of this message an example of the XML I'm > operating over, where the aim is to get a rough number of characters of > textual content preceding and following a name or rs element. Given the > highly multiform nature of the markup, I *think* that the simplest way > of going about this is to go from text node to text node, forward and > back, accumulating the text as it goes, and stopping once a certain > amount has been reached. > > The way I'm currently doing this is by simply selecting a certain number > of text nodes preceding and following the name or rs element > (name_node.xpath('following::text()[position()<15]'), for example), and > iterating through those and stopping when the right amount of text has > been accumulated. Obviously this has the problem that too many or too > few text nodes (in the XPath result sense) may be selected, which is > either inefficient or leads to too little context. > > Selecting an ancestor and then splitting the textual content of that > isn't, I think, a better option, given the nature of the XML I'm dealing > with. A name/rs element may be at almost any level of the tree, and its > textual content may well be repeated multiple times within any given > chunk. Ok, I now see where you are coming from. Something like the above XPath expression or the respective lxml.etree API code would have been my first attempt, too. I actually doubt that you can do much better in this case. It's actually a more general problem. Imagine you select a text node that has a certain length and contains the found text multiple times. How would you find a good context here? Is it the context of the first occurrence, which may include a lot of preceding text but not the last occurrence within the text node itself (if it is long enough) - or is it the last occurrence that is interesting here, with all the text that follows the matching text node? So the underlying problem is even independent of the API you use, it's more that substrings do not match nicely with the granularity of a text node. > I totally understand that it's problematic to change lxml to have a > different model for text, and I'm either going to continue with my > current method, or else use a modified form of my ideal solution, which > is to get the parent element of the text, and then use XPath again to > get the appropriate next text node in the sequence from that. This is a > little more cumbersome than I'd like, obviously, since the expression > changes not just by the direction of the context (preceding or > following) but also whether the current text is the text or tail of the > element. I'd have to run some tests to see whether the extra processing > slowed things down too much - this process is one that operates over > (often) thousands of name elements within each of over a thousand > documents. Maybe you should try the same thing without XPath, just using the API. XPath is fast when you are very selective or when you grab the aggregated text content of an element. It's less great when you do things iteratively. The API based algorithm may not even be that complex as you can use tree iteration and stuff. (Did I mention that readability counts? :) > (The point of getting this context is to give people some idea of who a > name element might be referring to, for when it is being keyed to an > entity in our authority control system. So the markup doesn't matter > particularly, but the textual content does.) > >Stefan Behnel wrote: >> I still do not have a clear idea of what you consider "text context" >> actually. Does that take the tree structure into account (e.g. only within >> a certain parent element), or is it just any text content that precedes the >> XPath result in reverse document order, wherever it occurs in the tree? > > Just any, though there are some cases where the markup could be used to > usefully limit the context (so, for example, the name may occur within a > bibliographic entry in a list of citations, and it's unlikely that any > textual content from before or after that entry will be relevant. That's > typically going to be the exception, however; even staying within a > paragraph element is not necessarily helpful (named things are often > introduced at the end of a paragraph and given more context in the > following paragraph, for example). This sounds like your algorithm is already more complex than a simple "any text node preceding the one that matches". That convinces me that an API based solution will be a lot more flexible than anything you could scratch out of XPath. It would allow you to special case certain tag types, for example, or to notice when you cross parent boundaries. > Here's the example of a small piece of a document, in case it helps. I'll leave it in the reply, just in case others have ideas, too. > But really, I'm happy enough with the way lxml works (it's great software - > thank you and everyone else who has made it what it is!). Not being > familiar with its inner workings I didn't know whether it would be > feasible or practical to add XPath to text results. Now I know, and I'll > continue on without complaint. :) Stefan > give my love to everybody including key="name-110011" type="person">Peter, hoping he is > finding his way around the house better now, & that > this > > > finds you as it leaves me, in the best of health & very > much in love with you. >

> > YrYour > affect.affectionate son > > > > J.C. Ulysses > Beaglehole > > P.S. You might tell yourself, key="name-110417" type="person">Auntie & key="name-034628" type="person">Christine, that > I have struck nobody yet with so swish a > > dressing-gowndressing-gown > as mine. > I had now better get on to some other letters > of thanks, greeting, business, etc. > type="person">J. > P.P.S. You might send me the date of > Auntie Sis' > birthday. I hope Auntie's had a fitting celebration. > J. > > > P.P.P.S. I have been writing all the > morning & it is now > ? to 1. If you pass the letter round it will save > much exhaustion to my dexter hand. > > YrsYours > finally penultimately, > > > type="person">J.C.B. From stefan_ml at behnel.de Wed May 13 08:32:56 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 13 May 2009 08:32:56 +0200 (CEST) Subject: [lxml-dev] Ask for help about lxml usage In-Reply-To: <31824399.258461242192887596.JavaMail.coremail@bj163app119.163.com> References: <4A092FE3.8040400@behnel.de> <24110436.658541242100583875.JavaMail.coremail@bj163app69.163.com> <31824399.258461242192887596.JavaMail.coremail@bj163app119.163.com> Message-ID: <99ea4463d3281bf47509fcef619eae83.squirrel@groupware.dvs.informatik.tu-darmstadt.de> [forwarding this to the list] qhlonline wrote: > Thank you very much for your kindly help. I have tried to trace the lxml > source code. I have attempted to find out whether the lxml is simply a > python-wrapped interface for HTML/XML parsing. if so, after my program > have reached the C/Python interface(eg. xmlDoc* htmlCtxtReadMemory in > htmlparser.pxd), the following job should have run parallelled on > different CPU cores, becausing it was using libxml2. I have also naively > thought that the elimination of DOM tree creation process would save time, > but it seems that this job is done by SAX event processers of libxml2 if > we are using the default parser. It's not naive, the thing is just that this is a lot more efficient for single-threaded programs than for multi-threaded ones. That's not an obvious difference, and it's also not advertised in the docs. Most users don't care too much about the exact high performance characteristics and it's not like there's an obvious bottleneck in lxml.etree that you have to warn people about. It's more the general "Python code is single-threaded" kind of thing and it all depends on how you use it, so you have to benchmark your own code anyway. > I don't know the Cython language, some > new types and declarations of variables and functions in .pxi files and > .pxd files makes me feel headache. You can ignore most of the little stuff that you don't understand. The main idea is just that you can switch freely between C and Python in the code, so depending on how firm you are in both, some code sections may be less obvious than others. > but my instructor said that since the > multi-thread parsing program with lxml can save time on two-core CPU > machine(It is true, it can run nearly 20% faseter with 2 thread on > two-core CPU machine), it should run better with more thread on a better > machine. now, with your help, I know maybe multi-process program will do > better. thank you again! Usually a lot faster, as you avoid any implicit concurrency issues. In multi-process programs, synchronisation only happens where you explicitly do it. Things are a lot more implicit and subtle in multi-threading. I'll comment on your program in your other mail. Please keep the list involved next time. Stefan From stefan_ml at behnel.de Wed May 13 09:18:50 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 13 May 2009 09:18:50 +0200 (CEST) Subject: [lxml-dev] Ask for help about lxml usage In-Reply-To: <11199132.276211242194686537.JavaMail.coremail@bj163app119.163.com> References: <4A092FE3.8040400@behnel.de> <24110436.658541242100583875.JavaMail.coremail@bj163app69.163.com> <11199132.276211242194686537.JavaMail.coremail@bj163app119.163.com> Message-ID: qhlonline wrote: > My code of multi-thread parsing test is as follows, May be there are some > problems, and I am very glad to get your suggestion. I have created eight > threads in this program and they collaborate to parse a single HTML file > 1000 times, the result I want is the time used. > from lxml import etree > import time > > import threading > class TargetParser(object): > # The Target Parser Def: > def __init__(self): > self.ImgTag = 0 > self.StyleTag = 0 > self.ScriptTag= 0 > def start(self, tag, attrib): > if tag == 'img': > self.ImgTag = self.ImgTag+1 > elif tag == 'style': > self.StyleTag = self.StyleTag+1 > elif tag == 'script': > self.ScriptTag = self.ScriptTag+1 > else: > pass > def end(self, tag): > pass > def close(self): > return self I'd strip this all together and go with separate parse/search phases. > class MultiThread: > def __init__(self): > self.circle = 0 > self.timeres = 0.0 > > def CircleParse(self, webpath, circles=1000): > self.webpath=webpath > self.circles=circles > self.lock=thread.allocate_lock() > self.lock2=thread.allocate_lock() > starttime = time.time() > thread.start_new_thread(self.NewParse,()) > thread.start_new_thread(self.NewParse,()) > > thread.start_new_thread(self.NewParse,()) > thread.start_new_thread(self.NewParse,()) > thread.start_new_thread(self.NewParse,()) > thread.start_new_thread(self.NewParse,()) > thread.start_new_thread(self.NewParse,()) > thread.start_new_thread(self.NewParse,()) I'd use a for-loop here. > self.lock2.acquire() > self.lock2.acquire() This bothers me. Not only that it appears twice, but also that you need an explicit lock in your program. > timeres = time.time()-starttime # total time > print 'In MultiThread Parser, the time Consume is : ',timeres, ' > Seconds!\n' > > def NewParse(self): > > MyParser = TargetParser() > Parser=etree.HTMLParser(target=MyParser) #The parser used in this > thread A general note on naming in Python programs: you should skim through PEP 8. Your code is rather hard to read as it uses AllCamelCase names for class names, method names and some local variables, i.e. things that are semantically very different. It's common to use it for class names, but everything else should use lower-case separated by underscores. > while (self.circle < self.circles): > if(self.circle >= self.circles): > break > > res = etree.parse(self.webpath,Parser) > > if(self.lock.acquire()): > self.circle = self.circle + 1 > if(self.circle >= self.circles and self.lock2.locked()): > > self.lock2.release() > break > self.lock.release() Ok, my take would be this (although it's completely(!) untested): from Queue import Queue from threading import Thread from lxml import etree tag_counters = [ (tagname, etree.XPath('count(//%s)' % tagname)) for tagname in ('img', 'style', 'script') ] def start_threads(func, thread_count, *args): for _ in range(thread_count): thread = Thread(target=func, args=args) thread.setDaemon() thread.start() def handle_urls(url_queue, result_queue): parser = etree.HTMLParser() while True: # I'm a deamon, so I don't care try: url = url_queue.get() doc = etree.parse(url, parser) result = [ (tagname, count(doc)) for tagname, count in tag_counters ] result_queue.put(result) doc = None # free space while we wait except Exception, e: # catch-all to make sure we report all 'normal' exceptions result_queue.put(e) e = None # free space while we wait # run benchmark in_queue = Queue() out_queue = Queue() start_threads(handle_urls, 10, in_queue, out_queue) from time import time t = time() for _ in range(100): in_queue.put("file://tmp/somefile.html") for _ in range(100): print out_queue.get() print time() - t ... minus some glitches, but I bet you can fix them and post a better version. Have fun, Stefan From james.slagle at gmail.com Tue May 19 04:21:03 2009 From: james.slagle at gmail.com (James Slagle) Date: Mon, 18 May 2009 22:21:03 -0400 Subject: [lxml-dev] lxml 2.2 validation question Message-ID: <88bf15530905181921v334ad86dyb098871925983b0f@mail.gmail.com> Hello, I'm having some trouble getting lxml (v. 2.2) to validate an ElementTree object that I'm building and was hoping someone on the list could help and maybe tell me what I'm doing wrong. If I create an ElementTree object directly from xml and an associated schema, it will validate fine. If I then construct a similar ElementTree object by just instantianting ElementTree, it will not validate. The odd thing is that the resulting xml from etree.tostring for both objects is identical. I've attached a python script that shows the problem I'm having. The validation error is: *** DocumentInvalid: Element 'Foo': No matching global declaration available for the validation root. I can get the second ElementTree object (etree2) to validate if I put the long explicit namesplace in front of the tag value (Foo) when I create etree2 in the script. So, if I change line 25 in the script to: rootelem = etree.Element('{http://example.com}Foo', {}, nsmap) , it will validate. However, the 2 resulting xml outputs are no longer equal b/c the output from etree2 is output with explict namespaces. Any ideas? Thanks for any help. -- James Slagle -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090518/38045aa8/attachment.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: lxmltest.py Type: text/x-python Size: 912 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090518/38045aa8/attachment.py From jholg at gmx.de Tue May 19 11:06:43 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 19 May 2009 11:06:43 +0200 Subject: [lxml-dev] lxml 2.2 validation question In-Reply-To: <88bf15530905181921v334ad86dyb098871925983b0f@mail.gmail.com> References: <88bf15530905181921v334ad86dyb098871925983b0f@mail.gmail.com> Message-ID: <20090519090643.288760@gmx.net> > I can get the second ElementTree object (etree2) to validate if I put the > long > explicit namesplace in front of the tag value (Foo) when I create etree2 > in > the > script. So, if I change line 25 in the script to: > rootelem = etree.Element('{http://example.com}Foo', {}, nsmap) > , it will validate. And this is the right way to create an element that lives in namespace http://example.com. Some comments: > nsmap={None: 'http://example.com', 'foo': 'http://example.com'} > rootElem = etree.Element('Foo', {}, nsmap) Note that this does not put Foo into the http://example.com NS. It creates an element Foo wit no namespace. The nsmap is rather a collection of known namespace prefixes in the context of an element. So you could do >>> rootElem = etree.Element('{http://example.com}Foo', {}, {None: 'http://example.com'}) rootElem.text = '\nContents\n' >>> rootElem.text = '\nContents\n' >>> print etree.tostring(rootElem, pretty_print=True) Contents >>> >>> print schemaObj.validate(rootElem) True >>> which puts Foo into the intended NS and uses this NS unprefixed in the output. But if you do this >>> rootElem = etree.Element('{http://example.com}Foo', {}, {None: 'http://example.com', 'foo': 'http://example.com'}) >>> rootElem.text = '\nContents\n' >>> print schemaObj.validate(rootElem) True >>> print etree.tostring(rootElem, pretty_print=True) Contents >>> you end up with the foo prefix, the reason for this probably being the order a prefix for the NS http://example.com is found in the given nsmap (dictionaries are unordered). > However, the 2 resulting xml outputs are no longer equal b/c the output > from > etree2 is output with explict namespaces. While textual equality is often dubious in XML :) you might cleanup the superfluous namespaces: >>> rootElem = etree.Element('{http://example.com}Foo', nsmap={None: 'http://example.com'}) >>> rootElem.text = '\nContents\n' >>> >>> print etree.tostring(rootElem, pretty_print=True) Contents >>> etree1 = etree.fromstring("""\ ... ... Contents ... ... """ ... ) >>> etree.tostring(etree1, pretty_print=True) == etree.tostring(rootElem, pretty_print=True) False >>> etree.cleanup_namespaces(etree1) >>> >>> etree.tostring(etree1, pretty_print=True) == etree.tostring(rootElem, pretty_print=True) True >>> Or even consider canonicalization, see http://codespeak.net/lxml/api.html#write-c14n-on-elementtree Holger -- Neu: GMX FreeDSL Komplettanschluss mit DSL 6.000 Flatrate + Telefonanschluss f?r nur 17,95 Euro/mtl.!* http://dslspecial.gmx.de/freedsl-surfflat/?ac=OM.AD.PD003K11308T4569a From james.slagle at gmail.com Tue May 19 13:17:43 2009 From: james.slagle at gmail.com (James Slagle) Date: Tue, 19 May 2009 07:17:43 -0400 Subject: [lxml-dev] lxml 2.2 validation question In-Reply-To: <20090519090643.288760@gmx.net> References: <88bf15530905181921v334ad86dyb098871925983b0f@mail.gmail.com> <20090519090643.288760@gmx.net> Message-ID: <88bf15530905190417t549d4a37r928ceeb6be2c8780@mail.gmail.com> On Tue, May 19, 2009 at 5:06 AM, wrote: > Some comments: > > > nsmap={None: 'http://example.com', 'foo': 'http://example.com'} > > rootElem = etree.Element('Foo', {}, nsmap) > > Note that this does not put Foo into the http://example.com NS. It creates an element Foo wit no namespace. The nsmap is rather a collection of known namespace prefixes in the context of an element. > Ok, that clears things up a bit and explains the validation error. I also see how I can use etree.cleanup_namespaces before comparing the xml output. I guess the thing that I find strange is that there seems to be no way to end up with the xml I started with in my example if I instead start by instantiating an ElementTree object first. Either you have the validation error, fully prefixed namespaces, or one of the namespace declarations removed (if you were to use cleanup_namespaces). Thanks for your help! -- James Slagle From stefan_ml at behnel.de Tue May 19 19:44:46 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 19 May 2009 19:44:46 +0200 Subject: [lxml-dev] lxml 2.2 validation question In-Reply-To: <88bf15530905181921v334ad86dyb098871925983b0f@mail.gmail.com> References: <88bf15530905181921v334ad86dyb098871925983b0f@mail.gmail.com> Message-ID: <4A12F00E.6040009@behnel.de> James Slagle wrote: > I'm having some trouble getting lxml (v. 2.2) to validate an ElementTree > object that I'm building and was hoping someone on the list could help and > maybe tell me what I'm doing wrong. > > If I create an ElementTree object directly from xml and an associated > schema, > it will validate fine. If I then construct a similar ElementTree object by > just instantianting ElementTree, it will not validate. The odd thing is > that the resulting xml from etree.tostring for both objects is identical. > > I've attached a python script that shows the problem I'm having. The > validation error is: > *** DocumentInvalid: Element 'Foo': No matching global declaration available > for the validation root. > > I can get the second ElementTree object (etree2) to validate if I put the > long > explicit namesplace in front of the tag value (Foo) when I create etree2 in > the > script. So, if I change line 25 in the script to: > rootelem = etree.Element('{http://example.com}Foo', {}, nsmap) > , it will validate. > > However, the 2 resulting xml outputs are no longer equal b/c the output from > etree2 is output with explict namespaces. With "explicit", do you mean that it uses namespace prefixes instead of the default namespace? lxml.etree internally does some namespace cleanup on the fly and (re-)maps the namespaces of qualified tag names ("{abc}tag") to namespace prefixes depending on the place you insert an Element into a tree. Doing so, it will only use one namespace declaration for each mapping, even if you redeclare a namespace with more than one prefix. A side effect is that a namespace declaration may end up being unused if lxml finds a different declaration first. Anyway, a few things to note here: 1) namespace prefixes are highly overrated 2) the default namespace is highly overused, especially when mixed with other (prefixed) namespaces 3) it is rarely (not 'never', but 'rarely') useful to declare the same namespace more than once. 4) comparing textual representations of XML documents is futile most of the time, except for their C14N serialisation. Stefan From james.slagle at gmail.com Tue May 19 20:18:22 2009 From: james.slagle at gmail.com (James Slagle) Date: Tue, 19 May 2009 14:18:22 -0400 Subject: [lxml-dev] lxml 2.2 validation question In-Reply-To: <4A12F00E.6040009@behnel.de> References: <88bf15530905181921v334ad86dyb098871925983b0f@mail.gmail.com> <4A12F00E.6040009@behnel.de> Message-ID: <88bf15530905191118u6bf40e09gf65869e69e82b16d@mail.gmail.com> On Tue, May 19, 2009 at 1:44 PM, Stefan Behnel wrote: > With "explicit", do you mean that it uses namespace prefixes instead of the > default namespace? Yes, exactly. So in my example, it is output as , instead of just . > lxml.etree internally does some namespace cleanup on the fly and (re-)maps > the namespaces of qualified tag names ("{abc}tag") to namespace prefixes > depending on the place you insert an Element into a tree. Doing so, it will > only use one namespace declaration for each mapping, even if you redeclare > a namespace with more than one prefix. A side effect is that a namespace > declaration may end up being unused if lxml finds a different declaration > first. > > Anyway, a few things to note here: > > 1) namespace prefixes are highly overrated > 2) the default namespace is highly overused, especially when mixed with > other (prefixed) namespaces > 3) it is rarely (not 'never', but 'rarely') useful to declare the same > namespace more than once. My main issue is with an external tool I'm passing my generated xml to. This tool expects no prefixes on the elements, but prefixes on the attributes, and thus needs the namespace declared with a prefix and as the default. Yes, I know this is broken, and the tool needs to be fixed to be more flexible :). I was mainly wanting to know if it was possible to use lxml to generate xml output in this manner. Thanks. -- James Slagle From piet at cs.uu.nl Tue May 19 21:36:33 2009 From: piet at cs.uu.nl (Piet van Oostrum) Date: Tue, 19 May 2009 21:36:33 +0200 Subject: [lxml-dev] lxml 2.2 validation question In-Reply-To: <20090519090643.288760@gmx.net> References: <88bf15530905181921v334ad86dyb098871925983b0f@mail.gmail.com> <20090519090643.288760@gmx.net> Message-ID: <18963.2625.405755.476698@cochabamba.local> >>>>> jholg at gmx.de (j) wrote: >j> Some comments: >>> nsmap={None: 'http://example.com', 'foo': 'http://example.com'} >>> rootElem = etree.Element('Foo', {}, nsmap) >j> Note that this does not put Foo into the http://example.com NS. It creates an element Foo wit no namespace. The nsmap is rather a collection of known namespace prefixes in the context of an element. But that means that the serialization: Contents that etree.tostring produces is wrong. -- Piet van Oostrum URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: piet at vanoostrum.org From lei at ipac.caltech.edu Tue May 19 23:40:13 2009 From: lei at ipac.caltech.edu (Mary Lei) Date: Tue, 19 May 2009 14:40:13 -0700 Subject: [lxml-dev] could not find doctest Message-ID: <4A13273D.6020100@ipac.caltech.edu> I copied the following from the lxml document on the web: #! /stage/irsa-sw-dev/cm/env_1.4_mnt/python/bin/python import urllib import urllib2 import urlparse import os import popen2 import StringIO import sys, getopt import re import string import doctest version = sys.version_info print "python version: ", version print os.environ['PYTHONPATH'] ## test with lxml import lxml.html from lxml.html import builder as E from lxml.html import usedoctest html = E.HTML( E.HEAD( E.LINK(rel="stylesheet", href="great.css", type="text/css"), E.TITLE("Best Page Ever") ), E.BODY( E.H1(E.CLASS("heading"), "Top News"), E.P("World News only on this page", style="font-size: 200%"), "Ah, and here's some more text, by the way.", lxml.html.fromstring("

and this is a parsed fragment

") ) ) print lxml.html.tostring(html) And got the following error: python version: (2, 5, 1, 'final', 0) .:/home/lei/python-stuff/BeautifulSoup-3.1.0.1:./lib:/home/lei/lxml-2.2/src Traceback (most recent call last): File "/home/lei/python-stuff/test_lxml_htmlout.py", line 22, in from lxml.html import usedoctest File "/home/lei/lxml-2.2/src/lxml/html/usedoctest.py", line 13, in doctestcompare.temp_install(html=True, del_module=__name__) File "/home/lei/lxml-2.2/src/lxml/doctestcompare.py", line 395, in temp_install frame = _find_doctest_frame() File "/home/lei/lxml-2.2/src/lxml/doctestcompare.py", line 486, in _find_doctest_frame "Could not find doctest (only use this function *inside* a doctest)") LookupError: Could not find doctest (only use this function *inside* a doctest) Any clues to this problem ? -- Mary Lei Software Testing IPAC-NExScl Rm: KS-233 MS: 220-6 Phone: 395-1998 From lei at ipac.caltech.edu Tue May 19 23:44:19 2009 From: lei at ipac.caltech.edu (Mary Lei) Date: Tue, 19 May 2009 14:44:19 -0700 Subject: [lxml-dev] parse unable to load url Message-ID: <4A132833.8050009@ipac.caltech.edu> I copied sample code from lxml documentation #! /stage/irsa-sw-dev/cm/env_1.4_mnt/python/bin/python import urllib import urllib2 import urlparse import os import popen2 import StringIO import sys, getopt import re import string version = sys.version_info print "python version: ", version print os.environ['PYTHONPATH'] ## test with lxml from lxml import etree from lxml.html import fromstring, tostring, parse, submit_form #page = parse('http://tinyurl.com').getroot() page = parse('/home/lei/python-stuff/tmp_CoRoT_exo_index.html').getroot() print page page = parse('http://tinyurl.com').getroot() And got errors when parsing url, local file is ok dodo:lei > test_lxml.py python version: (2, 5, 1, 'final', 0) .:/home/lei/python-stuff/BeautifulSoup-3.1.0.1:./lib:/home/lei/lxml-2.2/src Traceback (most recent call last): File "test_lxml.py", line 27, in page = parse('http://tinyurl.com').getroot() File "/home/lei/lxml-2.2/src/lxml/html/__init__.py", line 661, in parse return etree.parse(filename_or_url, parser, base_url=base_url, **kw) File "lxml.etree.pyx", line 2693, in lxml.etree.parse (src/lxml/lxml.etree.c:52591) File "parser.pxi", line 1478, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:75665) File "parser.pxi", line 1507, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:75993) File "parser.pxi", line 1407, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:75002) File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:72023) File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:67830) File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:68877) File "parser.pxi", line 563, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:68093) IOError: Error reading file 'http://tinyurl.com': failed to load HTTP resource -- Mary Lei Software Testing IPAC-NExScl Rm: KS-233 MS: 220-6 Phone: 395-1998 From sergio at sergiomb.no-ip.org Wed May 20 04:17:13 2009 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Wed, 20 May 2009 03:17:13 +0100 Subject: [lxml-dev] parse unable to load url In-Reply-To: <4A132833.8050009@ipac.caltech.edu> References: <4A132833.8050009@ipac.caltech.edu> Message-ID: <1242785833.2437.9.camel@monteirov> Forgot cc to list ! On Tue, 2009-05-19 at 14:44 -0700, Mary Lei wrote: > page = parse('http://tinyurl.com').getroot() http://tinyurl.com redirect to http://tinyurl.com/ page = parse('http://tinyurl.com/').getroot() works best regards -- S?rgio M.B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 2192 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090520/9647cc28/attachment.bin From stefan_ml at behnel.de Wed May 20 09:35:03 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 20 May 2009 09:35:03 +0200 (CEST) Subject: [lxml-dev] lxml 2.2 validation question In-Reply-To: <18963.2625.405755.476698@cochabamba.local> References: <88bf15530905181921v334ad86dyb098871925983b0f@mail.gmail.com> <20090519090643.288760@gmx.net> <18963.2625.405755.476698@cochabamba.local> Message-ID: <4c23cbfec2ca32a0e50b2fb5ed508646.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Piet van Oostrum wrote: >> jholg at gmx.de wrote: > >>> nsmap={None: 'http://example.com', 'foo': 'http://example.com'} > >>> rootElem = etree.Element('Foo', {}, nsmap) > >> Note that this does not put Foo into the http://example.com NS. It >> creates an element Foo wit no namespace. The nsmap is rather a collection >> of known namespace prefixes in the context of an element. > > But that means that the serialization: > > > Contents > > > that etree.tostring produces is wrong. Correct. This is actually an error case that we could catch at the API level: the new element has no namespace *and* the nsmap defines a default namespace, i.e. this should fail: el = etree.Element('thetag', nsmap={None : 'uri:some-namespace'}) We'd then need to make sure that you can write el = etree.Element('{}thetag', nsmap={None : 'uri:some-namespace'}) which would result in something like Similarly, adding a namespace-less subelement to a tree that defines a default namespace will not work 'as expected' (whatever a user may expect in doing that). In this case, we'd have to cut the default namespace definition by inserting a xmlns="" on the new element, so this case would not be an error. The same applies in the general case where you create a tree without namespaces and one that uses a default namespace, and then insert the namespace-less tree into the other one at some place. Another sick case: el1 = etree.fromstring( '') el2 = etree.Element("{uri:tns}thetag", nsmap={None: "uri:otherns"}) el2.append(el1) So it looks like we'd have to integrate something similar into the namespace fixing mechanism... Ugly, ugly... Stefan From stefan_ml at behnel.de Wed May 20 09:47:23 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 20 May 2009 09:47:23 +0200 (CEST) Subject: [lxml-dev] lxml 2.2 validation question In-Reply-To: <88bf15530905191118u6bf40e09gf65869e69e82b16d@mail.gmail.com> References: <88bf15530905181921v334ad86dyb098871925983b0f@mail.gmail.com> <4A12F00E.6040009@behnel.de> <88bf15530905191118u6bf40e09gf65869e69e82b16d@mail.gmail.com> Message-ID: James Slagle wrote: > My main issue is with an external tool I'm passing my generated xml > to. This tool expects no prefixes on the elements, but prefixes on > the attributes, and thus needs the namespace declared with a prefix > and as the default. Yes, I know this is broken, and the tool needs to > be fixed to be more flexible :). ... "to support XML namespaces", you mean. ;-) > I was mainly wanting to know if it was possible to use lxml to > generate xml output in this manner. I recall adding a namespace setup rule that explicitly prefers the prefixed namespace over an equivalent default namespace when you define both on the same node. This is because otherwise you end up with a similar problem with unnamespaced attributes on namespaced elements. The output you get is a side effect of that fix, so, no, there isn't a way to define a namespace as both the default namespace and a prefixed namespace, and basically let lxml ignore the prefixed namespace in favour of the default. That said, was there actually a reason why you defined the namespace prefix in the first place? Why isn't the default namespace enough to do what you want? (Note that the prefix used in the schema is independent of the one used in the document, that's what I meant with prefixes being 'overrated'.) Stefan From stefan_ml at behnel.de Wed May 20 09:52:46 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 20 May 2009 09:52:46 +0200 (CEST) Subject: [lxml-dev] could not find doctest In-Reply-To: <4A13273D.6020100@ipac.caltech.edu> References: <4A13273D.6020100@ipac.caltech.edu> Message-ID: Mary Lei wrote: > I copied the following from the lxml document on the web: ... and you copied a bit too much. > from lxml.html import usedoctest > > Traceback (most recent call last): > File "/home/lei/lxml-2.2/src/lxml/doctestcompare.py", line 486, in > _find_doctest_frame > "Could not find doctest (only use this function *inside* a doctest)") > LookupError: Could not find doctest (only use this function *inside* a > doctest) As the error message tells you, importing "usedoctest" will only work from a doctest. Unless you are writing a doctest, there is no reason to use that module. Stefan From lei at ipac.caltech.edu Wed May 20 19:14:00 2009 From: lei at ipac.caltech.edu (Mary Lei) Date: Wed, 20 May 2009 10:14:00 -0700 Subject: [lxml-dev] parse unable to load url In-Reply-To: <1242785833.2437.9.camel@monteirov> References: <4A132833.8050009@ipac.caltech.edu> <1242785833.2437.9.camel@monteirov> Message-ID: <4A143A58.3020301@ipac.caltech.edu> This works great for me: tiny fd > ['http://tinyurl.com/2xae8s', 'http://preview.tinyurl.com/2xae8s'] Thanks so much. I am beginning to believe in lxml. Sorry, Stefan :-) Sergio Monteiro Basto wrote: > Forgot cc to list ! > > On Tue, 2009-05-19 at 14:44 -0700, Mary Lei wrote: >> page = parse('http://tinyurl.com').getroot() > http://tinyurl.com redirect to http://tinyurl.com/ > > page = parse('http://tinyurl.com/').getroot() > works > > best regards > -- Mary Lei Software Testing IPAC-NExScl Rm: KS-233 MS: 220-6 Phone: 395-1998 From lei at ipac.caltech.edu Wed May 20 20:06:36 2009 From: lei at ipac.caltech.edu (Mary Lei) Date: Wed, 20 May 2009 11:06:36 -0700 Subject: [lxml-dev] could not find doctest In-Reply-To: References: <4A13273D.6020100@ipac.caltech.edu> Message-ID: <4A1446AC.9060406@ipac.caltech.edu> I discovered the following for my purpose: --------------------------------------- import lxml.html ## html output html = lxml.html.fromstring('''\ Creating HTML with the E-factory

Hi !

''') print lxml.html.tostring(html) #

Hi !

print lxml.html.tostring(html) from lxml.html import builder as E html = E.HTML( E.HEAD( # E.LINK(rel="stylesheet", href="great.css", type="text/css"), E.TITLE("Best Page Ever") ), E.BODY( E.H1(E.CLASS("heading"), "Top News"), E.P("World News only on this page", style="font-size: 200%"), "Ah, and here's some more text, by the way.", lxml.html.fromstring("

and this is a parsed fragment

") ) ) print lxml.html.tostring(html,pretty_print=True) <-- is what I want ------------------------------------------ but got confused by doctest yesterday, being new to python. Stefan Behnel wrote: > Mary Lei wrote: >> I copied the following from the lxml document on the web: > > ... and you copied a bit too much. > >> from lxml.html import usedoctest >> >> Traceback (most recent call last): >> File "/home/lei/lxml-2.2/src/lxml/doctestcompare.py", line 486, in >> _find_doctest_frame >> "Could not find doctest (only use this function *inside* a doctest)") >> LookupError: Could not find doctest (only use this function *inside* a >> doctest) > > As the error message tells you, importing "usedoctest" will only work from > a doctest. Unless you are writing a doctest, there is no reason to use > that module. > > Stefan -- Mary Lei Software Testing IPAC-NExScl Rm: KS-233 MS: 220-6 Phone: 395-1998 From james.slagle at gmail.com Wed May 20 21:27:41 2009 From: james.slagle at gmail.com (James Slagle) Date: Wed, 20 May 2009 15:27:41 -0400 Subject: [lxml-dev] lxml 2.2 validation question In-Reply-To: References: <88bf15530905181921v334ad86dyb098871925983b0f@mail.gmail.com> <4A12F00E.6040009@behnel.de> <88bf15530905191118u6bf40e09gf65869e69e82b16d@mail.gmail.com> Message-ID: <88bf15530905201227t6a43e8a1na22880945d7af6d8@mail.gmail.com> On Wed, May 20, 2009 at 3:47 AM, Stefan Behnel wrote: > That said, was there actually a reason why you defined the namespace > prefix in the first place? Why isn't the default namespace enough to do > what you want? (Note that the prefix used in the schema is independent of > the one used in the document, that's what I meant with prefixes being > 'overrated'.) There was no good reason. I was merely trying to match the example xml that I have for working with this particular tool. Thanks for the helpful information, it has certainly helped clear things up for me. -- -- James Slagle -- From ovnicraft at gmail.com Mon May 25 17:54:17 2009 From: ovnicraft at gmail.com (Ovnicraft) Date: Mon, 25 May 2009 10:54:17 -0500 Subject: [lxml-dev] UTF-8 not supported Message-ID: Hi folks, when do this, encoding='iso-8859-1' write xml enconding ok, but when try the same thing with 'UTF-8', not appears in my file, i have 2.2 version. How i can encoding my file with utf-8? regards, -- http://twitter.com/ovnicraft -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090525/ec9f7e64/attachment.htm From shigin at rambler-co.ru Mon May 25 18:01:00 2009 From: shigin at rambler-co.ru (Alexander Shigin) Date: Mon, 25 May 2009 20:01:00 +0400 Subject: [lxml-dev] UTF-8 not supported In-Reply-To: References: Message-ID: <1243267260.6964.4414.camel@atlas> ? ???, 25/05/2009 ? 10:54 -0500, Ovnicraft ?????: > Hi folks, when do this, encoding='iso-8859-1' write xml enconding ok, > but when try the same thing with 'UTF-8', not appears in my file, i > have 2.2 version. > How i can encoding my file with utf-8? Can you give your code snippet? UTF-8 works fine for me. Here is an example: In [6]: print etree.tostring(etree.Element(unicode('???', 'utf-8')), encoding='utf-8') If you haven't got cyrillic letters: In [9]: etree.tostring(etree.Element(u'\u044b\u044a\u044a'), encoding='utf-8') Out[9]: '<\xd1\x8b\xd1\x8a\xd1\x8a/>' From sergio at sergiomb.no-ip.org Mon May 25 18:11:45 2009 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Mon, 25 May 2009 17:11:45 +0100 Subject: [lxml-dev] UTF-8 not supported In-Reply-To: <1243267260.6964.4414.camel@atlas> References: <1243267260.6964.4414.camel@atlas> Message-ID: <1243267905.4887.2.camel@segulix> Hi, I use to set the enconding to parser like this: hparser = etree.HTMLParser(encoding='utf-8', remove_comments=True) etree_document = etree.HTML(f, parser=hparser) On Mon, 2009-05-25 at 20:01 +0400, Alexander Shigin wrote: > ? ???, 25/05/2009 ? 10:54 -0500, Ovnicraft ?????: > > Hi folks, when do this, encoding='iso-8859-1' write xml enconding ok, > > but when try the same thing with 'UTF-8', not appears in my file, i > > have 2.2 version. > > How i can encoding my file with utf-8? > > Can you give your code snippet? UTF-8 works fine for me. > > Here is an example: > In [6]: print etree.tostring(etree.Element(unicode('???', 'utf-8')), encoding='utf-8') > > > If you haven't got cyrillic letters: > In [9]: etree.tostring(etree.Element(u'\u044b\u044a\u044a'), encoding='utf-8') > Out[9]: '<\xd1\x8b\xd1\x8a\xd1\x8a/>' > > > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 2192 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090525/596f8e46/attachment.bin From ovnicraft at gmail.com Mon May 25 18:20:28 2009 From: ovnicraft at gmail.com (Ovnicraft) Date: Mon, 25 May 2009 11:20:28 -0500 Subject: [lxml-dev] UTF-8 not supported In-Reply-To: <1243267260.6964.4414.camel@atlas> References: <1243267260.6964.4414.camel@atlas> Message-ID: 2009/5/25 Alexander Shigin > ? ???, 25/05/2009 ? 10:54 -0500, Ovnicraft ?????: > > Hi folks, when do this, encoding='iso-8859-1' write xml enconding ok, > > but when try the same thing with 'UTF-8', not appears in my file, i > > have 2.2 version. > > How i can encoding my file with utf-8? > > Can you give your code snippet? UTF-8 works fine for me. > > Here is an example: > In [6]: print etree.tostring(etree.Element(unicode('???', 'utf-8')), > encoding='utf-8') > > > If you haven't got cyrillic letters: > In [9]: etree.tostring(etree.Element(u'\u044b\u044a\u044a'), > encoding='utf-8') > Out[9]: '<\xd1\x8b\xd1\x8a\xd1\x8a/>' > > > > I write something very simple, http://lxml.pastebin.com/m3f8c5b9c -- Cristian Salamea CEO GnuThink Software Labs Software Libre / Open Source (+593-8) 4-36-44-48 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090525/70c5a1ab/attachment.htm From ovnicraft at gmail.com Mon May 25 18:23:46 2009 From: ovnicraft at gmail.com (Ovnicraft) Date: Mon, 25 May 2009 11:23:46 -0500 Subject: [lxml-dev] UTF-8 not supported In-Reply-To: References: <1243267260.6964.4414.camel@atlas> Message-ID: 2009/5/25 Ovnicraft > > > 2009/5/25 Alexander Shigin > > ? ???, 25/05/2009 ? 10:54 -0500, Ovnicraft ?????: >> > Hi folks, when do this, encoding='iso-8859-1' write xml enconding ok, >> > but when try the same thing with 'UTF-8', not appears in my file, i >> > have 2.2 version. >> > How i can encoding my file with utf-8? >> >> Can you give your code snippet? UTF-8 works fine for me. >> >> Here is an example: >> In [6]: print etree.tostring(etree.Element(unicode('???', 'utf-8')), >> encoding='utf-8') >> >> >> If you haven't got cyrillic letters: >> In [9]: etree.tostring(etree.Element(u'\u044b\u044a\u044a'), >> encoding='utf-8') >> Out[9]: '<\xd1\x8b\xd1\x8a\xd1\x8a/>' >> >> >> >> > I write something very simple, http://lxml.pastebin.com/m3f8c5b9c With utf-8, http://lxml.pastebin.com/m72a14d. > > > > > -- > Cristian Salamea > CEO GnuThink Software Labs > Software Libre / Open Source > (+593-8) 4-36-44-48 > -- Cristian Salamea CEO GnuThink Software Labs Software Libre / Open Source (+593-8) 4-36-44-48 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090525/e353e9e9/attachment-0001.htm From shigin at rambler-co.ru Mon May 25 18:28:55 2009 From: shigin at rambler-co.ru (Alexander Shigin) Date: Mon, 25 May 2009 20:28:55 +0400 Subject: [lxml-dev] UTF-8 not supported In-Reply-To: References: <1243267260.6964.4414.camel@atlas> Message-ID: <1243268936.6964.4425.camel@atlas> ? ???, 25/05/2009 ? 11:23 -0500, Ovnicraft ?????: > I write something very simple, http://lxml.pastebin.com/m3f8c5b9c > > With utf-8, http://lxml.pastebin.com/m72a14d. In [33]: etree.tostring(openerp, pretty_print=True, encoding='utf-8') == \ ....: etree.tostring(openerp, pretty_print=True, encoding='utf-8') Out[33]: True Is it possible that you paste the wrong code snippet? I don't see any non-ascii chars and the output is the same for me. From ovnicraft at gmail.com Mon May 25 21:41:58 2009 From: ovnicraft at gmail.com (Ovnicraft) Date: Mon, 25 May 2009 14:41:58 -0500 Subject: [lxml-dev] UTF-8 not supported In-Reply-To: <1243268936.6964.4425.camel@atlas> References: <1243267260.6964.4414.camel@atlas> <1243268936.6964.4425.camel@atlas> Message-ID: 2009/5/25 Alexander Shigin > ? ???, 25/05/2009 ? 11:23 -0500, Ovnicraft ?????: > > I write something very simple, http://lxml.pastebin.com/m3f8c5b9c > > > > With utf-8, http://lxml.pastebin.com/m72a14d. > > > In [33]: etree.tostring(openerp, pretty_print=True, encoding='utf-8') == \ > ....: etree.tostring(openerp, pretty_print=True, > encoding='utf-8') > Out[33]: True > > Is it possible that you paste the wrong code snippet? I don't see any > non-ascii chars and the output is the same for me. > > http://lxml.pastebin.com/m5d4a419,in this, the hightlighted line dont appears in the output when use utf-8 -- Cristian Salamea CEO GnuThink Software Labs Software Libre / Open Source (+593-8) 4-36-44-48 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090525/d9da1f81/attachment.htm From limi at plone.org Tue May 26 00:03:35 2009 From: limi at plone.org (Alexander Limi) Date: Mon, 25 May 2009 15:03:35 -0700 Subject: [lxml-dev] Binary egg for Mac OS X Message-ID: Hi, I'm working on documenting Deliverance / xdv for use with Plone. It has lxml as a dependency, and I have run into a serious issue: On Mac OS X, we can't assume that people have Xcode (ie. gcc and friends) installed, thus we can't really compile lxml on those computers, not even using the staticlxml[1] recipe. I see that there are binary eggs for Windows, is there a special reason why there are no binary eggs for OS X, or is it just a matter of not having the infrastructure to make it available? Happy to help find a solution if it's just a matter of locating a reliable way to get it compiled every time there is a new release. [1] http://pypi.python.org/pypi/z3c.recipe.staticlxml/ -- Alexander Limi ? http://limi.net From shigin at rambler-co.ru Tue May 26 11:04:25 2009 From: shigin at rambler-co.ru (Alexander Shigin) Date: Tue, 26 May 2009 13:04:25 +0400 Subject: [lxml-dev] UTF-8 not supported In-Reply-To: References: <1243267260.6964.4414.camel@atlas> <1243268936.6964.4425.camel@atlas> Message-ID: <1243328665.6964.4441.camel@atlas> ? ???, 25/05/2009 ? 14:41 -0500, Ovnicraft ?????: > http://lxml.pastebin.com/m5d4a419,in this, the hightlighted line dont > appears in the output when use utf-8 Oh, I get it. The XML specification says: """ In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. """ i.e. utf-8 is default encoding for xml and the xml library can omit encoding declaration if encoding is utf-8. The tostring routine has an option xml_declaration. The option force lxml to write or omit xml encoding declaration. In [37]: print etree.tostring(openerp, pretty_print=True, encoding='utf-8', xml_declaration=True) field From avleen at gmail.com Wed May 27 17:16:17 2009 From: avleen at gmail.com (Avleen Vig) Date: Wed, 27 May 2009 16:16:17 +0100 Subject: [lxml-dev] Possible bug, libxml segfault? Message-ID: <33c66c80905270816le007e19ua8844d5a5d8b8c28@mail.gmail.com> Hi folks :) I sent this message to the xml at gnome list, but that place is seems very quiet lately. I'm hoping someone here might have some insight. Indeed this might be a bug in lxml, we're not sure. Background: We use libxml and libxslt in one of our applications (specifically, in Python via lxml). Recently we've seen our application dying at strange times for no apparent reason. We managed to get a core file out of one crash, and the results of some of our debugging are here: http://xml.pastebin.com/m70c259d6 (I'd be happy to poke more in a particular direction on there, I'm a bit new to gdb :) To me, it seems the parser is complaining while trying to parse the namespaces in the node in transforms/_base.xslt The node for that opens like this: I dug a little deeper and found a bunch of the "address out of bounds" errors and thought I should ask here as I'm drawing a blank on where to go next. The problem happens intermittently, but usually several times a day. I could probably reproduce it. I also see the 'exclude-result-prefixes' mentioned in the backtrace, could that be involved here? Any suggestions you have would be much appreciated! From stefan_ml at behnel.de Sat May 30 13:04:14 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 30 May 2009 13:04:14 +0200 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: References: Message-ID: <4A2112AE.8040903@behnel.de> Hi, Alexander Limi wrote: > I'm working on documenting Deliverance / xdv for use with Plone. It has > lxml as a dependency, and I have run into a serious issue: > > On Mac OS X, we can't assume that people have Xcode (ie. gcc and friends) > installed, thus we can't really compile lxml on those computers, not even > using the staticlxml[1] recipe. > > I see that there are binary eggs for Windows, is there a special reason > why there are no binary eggs for OS X, or is it just a matter of not > having the infrastructure to make it available? The main problem is that many MacOS-X users have some kind of package distribution like macports installed, which usually has some distribution specific setup/dependencies/paths/whatever. OTOH, those users won't be the target for a binary distribution of lxml anyway. > Happy to help find a solution if it's just a matter of locating a > reliable way to get it compiled every time there is a new release. Yes, I'd be happy if we could get a static binary egg for each release. I don't have a Mac myself (and I'm definitely not a Mac user), so contributions are welcome. http://codespeak.net/lxml/build.html#building-lxml-on-macos-x Stefan From stefan_ml at behnel.de Sat May 30 13:25:34 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 30 May 2009 13:25:34 +0200 Subject: [lxml-dev] Possible bug, libxml segfault? In-Reply-To: <33c66c80905270816le007e19ua8844d5a5d8b8c28@mail.gmail.com> References: <33c66c80905270816le007e19ua8844d5a5d8b8c28@mail.gmail.com> Message-ID: <4A2117AE.7010908@behnel.de> Hi, Avleen Vig wrote: > Background: > We use libxml and libxslt in one of our applications (specifically, in > Python via lxml). > > Recently we've seen our application dying at strange times for no > apparent reason. > We managed to get a core file out of one crash, and the results of > some of our debugging are here: > http://xml.pastebin.com/m70c259d6 > (I'd be happy to poke more in a particular direction on there, I'm a > bit new to gdb :) The information you give is pretty informative. It just lacks a hint which versions of lxml, libxml2 and libxslt you are using. Is this reproducible with the latest release versions of all three? > To me, it seems the parser is complaining while trying to parse the > namespaces in the node in transforms/_base.xslt > The node for that opens like this: > xmlns="http://www.w3.org/1999/xhtml"; > xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; > xmlns:tfxslt="http://tfnet.co.uk/ns/tfxslt"; > xmlns:fb="http://www.facebook.com/2008/fbml"; > extension-element-prefixes="tfxslt str exsl" > exclude-result-prefixes="str exsl tfxslt fb" > xmlns:error="http://www.woome.com/error/";> The triggering problem is that you try to exclude the namespace prefixes "str" and "exsl" which are not defined in your stylesheet. This seems to induce a problem in lxml's error reporting mechanism in your specific setup. Is your application threaded? I do remember a couple of problems with threaded error reporting in the past, although all of those were solved back then. Stefan From qhlonline at 163.com Sun May 31 04:29:42 2009 From: qhlonline at 163.com (qhlonline) Date: Sun, 31 May 2009 10:29:42 +0800 (CST) Subject: [lxml-dev] problem about lxml encoding Message-ID: <28194717.92641243736982945.JavaMail.coremail@bj163app25.163.com> Hi, all I have a question about lxml encoding and html encoding. What is the relationship between the encoding which lxml parser will be using and the encoding of an HTML file? if the HTML file encoding is not the same with it's , What the choice of lxml's parser will be? the 'charset' encoding or the HTML file coding? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090531/1357b75d/attachment.htm From stefan_ml at behnel.de Sun May 31 08:10:16 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 31 May 2009 08:10:16 +0200 Subject: [lxml-dev] problem about lxml encoding In-Reply-To: <28194717.92641243736982945.JavaMail.coremail@bj163app25.163.com> References: <28194717.92641243736982945.JavaMail.coremail@bj163app25.163.com> Message-ID: <4A221F48.7080503@behnel.de> qhlonline wrote: > Hi, all I have a question about lxml encoding and html encoding. What is > the relationship between the encoding which lxml parser will be using > and the encoding of an HTML file? if the HTML file encoding is not the > same with it's , What the choice of lxml's > parser will be? The HTML parser will use the encoding specified by the tag if it's present, otherwise it will expect the document to be Latin-1. It will not magically guess the right encoding if the tag happens to be incorrect. If you somehow know the encoding better, you can override the behaviour by passing the "encoding" option when instantiating the parser. Does that help? Stefan From jamie at artefact.org.nz Sun May 31 08:40:33 2009 From: jamie at artefact.org.nz (Jamie Norrish) Date: Sun, 31 May 2009 18:40:33 +1200 Subject: [lxml-dev] xpath on text nodes In-Reply-To: <4A09BE54.1030501@behnel.de> References: <1240994671.8989.9.camel@atman.artefact.org.nz> <1241065853.5570.4.camel@atman.artefact.org.nz> <1241122046.5549.19.camel@atman.artefact.org.nz> <4A053F34.9040809@behnel.de> <1241861247.5355.41.camel@atman.artefact.org.nz> <4A09BE54.1030501@behnel.de> Message-ID: <1243752033.6621.21.camel@atman.artefact.org.nz> Stefan Behnel wrote: > This sounds like your algorithm is already more complex than a simple "any > text node preceding the one that matches". That convinces me that an API > based solution will be a lot more flexible than anything you could scratch > out of XPath. It would allow you to special case certain tag types, for > example, or to notice when you cross parent boundaries. Well, the original plan didn't really call for much special casing of particular elements, but now that things are working I'll likely add in such as I think of them. I've changed the approach completely, to use XSLT to transform the entire document into something that has the context handled appropriately (and using XPath on text nodes :). It takes two transformations (the second one to handle ordering issues with the preceding context, and to do a little cleanup of whitespace, but it is more than an order of magnitude faster than what I had before. I'm not sure why I didn't go down that route in the first place, but now that I have I'm very happy. And of course it's great that XSLT is so easy to use with lxml. Oh, I also tried using .getparent() and some logic to get the equivalent of preceding::text()[1] and following::text()[1], but it turned out (not surprisingly, given the complexities of that approach) to be marginally slower than what I had. Jamie -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090531/b28709dd/attachment-0001.pgp From stefan_ml at behnel.de Sun May 31 11:29:05 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 31 May 2009 11:29:05 +0200 Subject: [lxml-dev] problem about lxml encoding In-Reply-To: <26911870.306181243760617643.JavaMail.coremail@bj163app60.163.com> References: <4A221F48.7080503@behnel.de> <28194717.92641243736982945.JavaMail.coremail@bj163app25.163.com> <26911870.306181243760617643.JavaMail.coremail@bj163app60.163.com> Message-ID: <4A224DE1.5000907@behnel.de> Hi, please keep the list involved. Thanks. qhlonline wrote: > Since I am processing Chinese Webs. There are instances that some Webs > are not regular. When it said , we can't decode > the HTML string with GB2312 decoder, we found that this content is > encoded with GBK or GB18030 in fact. But the lxml parser will process it > according to the meta declaration. I have read the source of libxml2, > there are no "encoding" string match "GB2312","GBK" or "GB18030", but > some other encodings like ENCODING_2022_JP which may be a super set of > GB2312 I just don't know where in libxml2 or in lxml the declared > GB2312 encoding is converted to some other encodings that apparently > supported by libxml2's XML's xmlCharEncoding structure, and How? Most encodings are handled by libiconv. libxml2 only handles the 'normal' (or most common) XML encodings in its own code. > I think > since "GB18030" is a super set of "GB2312", if we change the lxml source > to let all strings were decoded with GB18030 > codec, then there will be no error even if the HTML file is not regular. No need to play with lxml's sources here. What should work is to read the HTML page into a byte string, decode it manually into a unicode string, and then let lxml parse that. That way, you also have full control over the decoding and can handle any decoding errors yourself. Stefan From foolistbar at googlemail.com Sun May 31 14:28:31 2009 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sun, 31 May 2009 13:28:31 +0100 Subject: [lxml-dev] problem about lxml encoding In-Reply-To: <4A224DE1.5000907@behnel.de> References: <4A221F48.7080503@behnel.de> <28194717.92641243736982945.JavaMail.coremail@bj163app25.163.com> <26911870.306181243760617643.JavaMail.coremail@bj163app60.163.com> <4A224DE1.5000907@behnel.de> Message-ID: <4BF7C14A-DE65-492D-8C69-EB5B04F213F7@googlemail.com> On 31 May 2009, at 10:29, Stefan Behnel wrote: > qhlonline wrote: >> Since I am processing Chinese Webs. There are instances that some >> Webs >> are not regular. If you need support for deployed web content, you'll probably have better success with html5lib than you will with libxml2's HTML parser. -- Geoffrey Sneddon