From stefan_ml at behnel.de Tue Mar 2 08:31:31 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 02 Mar 2010 08:31:31 +0100 Subject: [lxml-dev] [XML-SIG] Python working with XML In-Reply-To: <710c80bb1003011222l13ea5b53h84dfbe8c6d8d1f9b@mail.gmail.com> References: <710c80bb1002181242g5c0df17aid5710a1574761871@mail.gmail.com> <4B811583.4020604@behnel.de> <710c80bb1003011222l13ea5b53h84dfbe8c6d8d1f9b@mail.gmail.com> Message-ID: <4B8CBED3.3030108@behnel.de> Hi, I hope you don't mind if I forward your mail to the lxml mailing list. lxml is an OpenSource library, and help is commonly open, too. James Johnston, 01.03.2010 21:22: > I've been trying to use lxml mainly because of the speed and easy of use. I > don't know enough to know what direction to pursue. I'm reading in multiple > directions on xml and different xml packages and trying some sample code. You'll want to read the lxml tutorial, and maybe also this one: http://www.nmt.edu/tcc/help/pubs/pylxml/ You may also want to subscribe to the lxml mailing list. > I've been asked to learn how to write some Python "scripts" as tools to deal > with XML. I have very meager experience with Python and object > programming. Then you'll be happy to hear that Python is quick to learn and lxml is easy to use. > *With that in mind, do you think lxml is still the way to go Always. :) > if I have relatively complex XML schema (stripped down mock up schema > attached ) that I want to use as a template for finding reference issues in > XML documents*. The 'XML schema' you attached is more of a template than what is commonly referred to as 'XML-Schema', which is a schema language that puts structural/value constraints on an XML document. > And I've just come across some of the discussion of dealing > with CDATA which the XML I've been given uses. Don't worry, CDATA is really easy to handle in lxml these days. You won't even notice it on the way in, and you can wrap your text in the CDATA() factory function to create a CDATA section on the way out. That's all. It's also highly overrated, so you may even be better off not to use it at all. > The basic idea is a schema for a multivolume set of reference books. > > For example, if I want to find every footnote and validate the content > against a file containing all the acceptable authors. I figure that I have > to read the schema into something, read the XML and use the schema to find > the "note(s)" and then check the contents against the file of acceptable > references. I think I have the concept but not the know how. *Any advice > would be appreciated*. Finding a specific tag is a very common task, so there are a couple of very efficient ways to do that - mostly a trade-off between speed and expressiveness. Start with the simple and fast .iter() or .find() methods, and switch to the powerful XPath language when you need it. I would advise against a totally generic approach based on reading and evaluating the template. Just write your code for exactly the XML format that your template describes. It'll be short enough to adapt it to other formats when the need arises. Since you didn't mention what volume of data you have to handle here ('references' can be anything from a few bibtex entries in a personal list to a complete list of all books listed by Amazon), it's hard to give any advice on a suitable approach. You may get away with parsing the author file on the fly, or you may have to build an index from it. The different dbm databases in Python's standard library will help with the latter. If you need more help, feel free to ask on the lxml mailing list or in the comp.lang.python newsgroup (for all sorts of Python related questions). Stefan From vojta.rylko at seznam.cz Tue Mar 2 11:16:23 2010 From: vojta.rylko at seznam.cz (=?ISO-8859-2?Q?Vojt=ECch_Rylko?=) Date: Tue, 02 Mar 2010 11:16:23 +0100 Subject: [lxml-dev] namespaces Message-ID: <4B8CE577.2000805@seznam.cz> Hi! Can I ask for a little help? I have xml: ==================== Nouveau stage : Le micashift - Les crations de Nathalie http://nathalie.creations.over-blog.com/article-nouveau.html http://nathalie.creations.over-blog.com/article-nouveau.html Thu, 25 Feb 2010 21:56:24 GMT http://nathalie.creations.over-blog.com/ Les crations de Nathalie ==================== And code: ==================== from lxml import etree tree = etree.parse("./data.xml") print tree.xpath("dc:source")[0].text ==================== But: "lxml.etree.XPathEvalError: Undefined namespace prefix" Why? How I can use namespace defined in ? I spent two days with it :( rss.attrib == 'version="2.0"' - nothink about namespaces... And in documentation I didnt found anything. Just: >langNS = etree.FunctionNamespace("http://purl.org/dc/elements/1.1/") >langNS.prefix = "dc" which is manual work... Thanks for your answers! Vojta From stefan_ml at behnel.de Tue Mar 2 11:56:06 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 02 Mar 2010 11:56:06 +0100 Subject: [lxml-dev] namespaces In-Reply-To: <4B8CE577.2000805@seznam.cz> References: <4B8CE577.2000805@seznam.cz> Message-ID: <4B8CEEC6.3010101@behnel.de> Vojt?ch Rylko, 02.03.2010 11:16: > I have xml: > ==================== > xmlns:dc="http://purl.org/dc/elements/1.1/" > xmlns:source="http://tailrank.com/ns/#post"> > > > Nouveau stage : Le micashift - Les crations de Nathalie > http://nathalie.creations.over-blog.com/article-nouveau.html > http://nathalie.creations.over-blog.com/article-nouveau.html > Thu, 25 Feb 2010 21:56:24 GMT > http://nathalie.creations.over-blog.com/ > Les crations de Nathalie > > > > ==================== > > > And code: > ==================== > from lxml import etree > tree = etree.parse("./data.xml") > print tree.xpath("dc:source")[0].text > ==================== > > But: "lxml.etree.XPathEvalError: Undefined namespace prefix" > > Why? How I can use namespace defined in ? I spent two days with it :( > > rss.attrib == 'version="2.0"' - nothink about namespaces... > > And in documentation I didnt found anything. Just: > >langNS = etree.FunctionNamespace("http://purl.org/dc/elements/1.1/") > >langNS.prefix = "dc" > which is manual work... Right, the docs aren't particularly clear here. You have to pass a namespace prefix mapping to XPath, here is an example: http://codespeak.net/lxml/xpathxslt.html#etxpath Stefan From manu3d at gmail.com Tue Mar 2 11:57:46 2010 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Tue, 2 Mar 2010 10:57:46 +0000 Subject: [lxml-dev] namespaces In-Reply-To: <4B8CE577.2000805@seznam.cz> References: <4B8CE577.2000805@seznam.cz> Message-ID: <915dc91d1003020257p32ab6012i6366fb5c6619df7a@mail.gmail.com> On 2 March 2010 10:16, Vojt?ch Rylko wrote: > And code: > ==================== > from lxml import etree > tree = etree.parse("./data.xml") > print tree.xpath("dc:source")[0].text > ==================== > > But: "lxml.etree.XPathEvalError: Undefined namespace prefix" > > Why? How I can use namespace defined in ? I spent two days with it :( > > Try to use instead: ns = {"dc":"http://purl.org/dc/elements/1.1/ "} print tree.xpath("dc:source", namespaces=ns)[0].text Hope it helps! Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100302/6e941141/attachment.htm From manu3d at gmail.com Tue Mar 2 12:09:56 2010 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Tue, 2 Mar 2010 11:09:56 +0000 Subject: [lxml-dev] namespaces In-Reply-To: <4B8CEEC6.3010101@behnel.de> References: <4B8CE577.2000805@seznam.cz> <4B8CEEC6.3010101@behnel.de> Message-ID: <915dc91d1003020309l2d3691e8ja8ac29ab705284bd@mail.gmail.com> > Right, the docs aren't particularly clear here. You have to pass a > namespace prefix mapping to XPath, here is an example: > > http://codespeak.net/lxml/xpathxslt.html#etxpath > > Especially, I think it catches people by surprise that the namespaces/prefixes defined in the root element are not automatically taken into consideration when dealing with prefixed elements and attributes. I understand the rational that any element in an xml tree could completely redefine the set of namespaces/prefixes for its subtree, but couldn't lxml/libxml fetch this information automatically whenever there is no namespace definition on the element where it's needed? Or couldn't the mappings be propagated appropriately to all elements during parsing, maybe in a form that keeps them "under the hood" rather than making them explicit xmlns attributes? Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100302/b20f6484/attachment.htm From manu3d at gmail.com Tue Mar 2 12:11:56 2010 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Tue, 2 Mar 2010 11:11:56 +0000 Subject: [lxml-dev] Fwd: Architecture/best practice question. In-Reply-To: <915dc91d1002260513o52e92d2dwa7a4739fb77cbfbb@mail.gmail.com> References: <915dc91d1002260513o52e92d2dwa7a4739fb77cbfbb@mail.gmail.com> Message-ID: <915dc91d1003020311r14602571n85ff0722ef1e10b2@mail.gmail.com> Sorry folks, I'm posting this message again just in case it slipped under everybody's radar. Please let me know if my question isn't clear enough or if there are problems with any of its premises. ---------- Forwarded message ---------- From: Emanuele D'Arrigo Date: 26 February 2010 13:13 Subject: Architecture/best practice question. To: lxml-dev at codespeak.net Hi everybody, a bit of a general architecture/best practice question. Say you want to keep in sync an ElementTree with a separate tree structure, one that is parallel but does not have the exact same nodes and yet needs to be informed and updated whenever a change in the ElementTree occurs. ElementTree supports custom elements and I guess it wouldn't be too difficult to override the standard methods of an element to do something before or after any change. -However-, I understand that ElementProxies cannot store instance-level data as the instances are not persistent and are garbage collected more or less as soon as they are no longer referenced somewhere. So, what I'm wondering is, how do I tell a method of a custom element what object in the parallel structure to inform whenever a change arises? I guess one way would be to store at -class level- (or where else?) a dictionary mapping custom ElementProxies instances to nodes of the parallel structure. In so doing whenever a custom method is executed it can get hold of the parallel structure. Is that a reasonable way to do it or are there better ones? Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100302/9ea107cc/attachment.htm From stefan_ml at behnel.de Tue Mar 2 13:26:36 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 02 Mar 2010 13:26:36 +0100 Subject: [lxml-dev] namespaces In-Reply-To: <915dc91d1003020309l2d3691e8ja8ac29ab705284bd@mail.gmail.com> References: <4B8CE577.2000805@seznam.cz> <4B8CEEC6.3010101@behnel.de> <915dc91d1003020309l2d3691e8ja8ac29ab705284bd@mail.gmail.com> Message-ID: <4B8D03FC.7080408@behnel.de> Emanuele D'Arrigo, 02.03.2010 12:09: >> Right, the docs aren't particularly clear here. You have to pass a >> namespace prefix mapping to XPath, here is an example: >> >> http://codespeak.net/lxml/xpathxslt.html#etxpath >> > Especially, I think it catches people by surprise that the > namespaces/prefixes defined in the root element are not automatically taken > into consideration when dealing with prefixed elements and attributes. That's because many people do not understand the difference between prefixes and namespaces and consequently assign semantics to prefixes that they do not have. Prefixes only exist within a closed context and do not have any meaning outside of their scope. Only namespace URIs are generally meaningful. Namespace prefixes are a quirk that was introduced at the time to counter data size concerns and to keep the idea alive that XML is text and is as such meant to be read and edited by humans. Reality today is that compression solves the data size issue much better than namespace prefixes. And most XML that is meant to be textually edited by humans uses an estimated average of one namespace, often none at all, so prefixes actually aren't all that helpful for readability. However, as the recurrence of this thread shows, the confusion they produce is certainly a reality, and it's well stimulated by the notion of "well-known namespace prefixes". > I understand the rational that any element in an xml tree could completely > redefine the set of namespaces/prefixes for its subtree, but couldn't > lxml/libxml fetch this information automatically whenever there is no > namespace definition on the element where it's needed? Even the root element could define /any/ prefix for a namespace, including the empty prefix, so your code would break at the very first input document that uses a different prefix than what you anticipated. Reading definitions from the document would therefore encourage users to write extremely fragile code that fails to work for XML in general. There's the ETXPath class if you don't want to care about prefixes. Anything more than that will not work. > Or couldn't the > mappings be propagated appropriately to all elements during parsing, maybe > in a form that keeps them "under the hood" rather than making them explicit > xmlns attributes? They are, see the .prefix and .nsmap properties on Elements. Stefan From stefan_ml at behnel.de Tue Mar 2 14:04:07 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 02 Mar 2010 14:04:07 +0100 Subject: [lxml-dev] namespaces In-Reply-To: <4B8CEEC6.3010101@behnel.de> References: <4B8CE577.2000805@seznam.cz> <4B8CEEC6.3010101@behnel.de> Message-ID: <4B8D0CC7.8000308@behnel.de> Stefan Behnel, 02.03.2010 11:56: > Vojt?ch Rylko, 02.03.2010 11:16: >> print tree.xpath("dc:source")[0].text >> >> But: "lxml.etree.XPathEvalError: Undefined namespace prefix" >> >> Why? How I can use namespace defined in ? I spent two days with it :( >> >> rss.attrib == 'version="2.0"' - nothink about namespaces... >> >> And in documentation I didnt found anything. Just: >> >langNS = etree.FunctionNamespace("http://purl.org/dc/elements/1.1/") >> >langNS.prefix = "dc" >> which is manual work... > > Right, the docs aren't particularly clear here. Could you check if this is clear enough now? http://codespeak.net/lxml/xpathxslt.html#namespaces-and-prefixes Stefan From manu3d at gmail.com Tue Mar 2 14:50:49 2010 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Tue, 2 Mar 2010 13:50:49 +0000 Subject: [lxml-dev] namespaces In-Reply-To: <4B8D03FC.7080408@behnel.de> References: <4B8CE577.2000805@seznam.cz> <4B8CEEC6.3010101@behnel.de> <915dc91d1003020309l2d3691e8ja8ac29ab705284bd@mail.gmail.com> <4B8D03FC.7080408@behnel.de> Message-ID: <915dc91d1003020550t7a9779e0gffd4334025c3e8b4@mail.gmail.com> On 2 March 2010 12:26, Stefan Behnel wrote: > Even the root element could define /any/ prefix for a namespace, including > the empty prefix, so your code would break at the very first input document > that uses a different prefix than what you anticipated. Reading definitions > from the document would therefore encourage users to write extremely > fragile code that fails to work for XML in general. > I understand what you are saying. > Or couldn't the > > mappings be propagated appropriately to all elements during parsing, > maybe > > in a form that keeps them "under the hood" rather than making them > explicit > > xmlns attributes? > > They are, see the .prefix and .nsmap properties on Elements. > Ooops. You are right. I forgot about that. Thank you for taking some time further detailing the issue! Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100302/336562d3/attachment.htm From stefan_ml at behnel.de Tue Mar 2 17:24:42 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 02 Mar 2010 17:24:42 +0100 Subject: [lxml-dev] lxml 2.2.6 released Message-ID: <4B8D3BCA.2060102@behnel.de> Hi, I just released lxml 2.2.6. The only change is a fix for lxml 2.2.5, which failed to import in Python 3. The problem was an incompatible change in Cython 0.12 which is not yet supported by the 2.2 series branch. This release was therefore built using Cython 0.11.3. Sorry for the inconvenience. Stefan 2.2.6 (2010-03-02) Bugs fixed * Fixed several Python 3 regressions by building with Cython 0.11.3. From jholg at gmx.de Wed Mar 3 08:53:31 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 03 Mar 2010 08:53:31 +0100 Subject: [lxml-dev] Fwd: Architecture/best practice question. In-Reply-To: <915dc91d1003020311r14602571n85ff0722ef1e10b2@mail.gmail.com> References: <915dc91d1002260513o52e92d2dwa7a4739fb77cbfbb@mail.gmail.com> <915dc91d1003020311r14602571n85ff0722ef1e10b2@mail.gmail.com> Message-ID: <20100303075331.308120@gmx.net> Hi, > a bit of a general architecture/best practice question. Say you want to > keep > in sync an ElementTree with a separate tree structure, one that is > parallel > but does not have the exact same nodes and yet needs to be informed and > updated whenever a change in the ElementTree occurs. ElementTree supports > custom elements and I guess it wouldn't be too difficult to override the > standard methods of an element to do something before or after any change. > > -However-, I understand that ElementProxies cannot store instance-level > data > as the instances are not persistent and are garbage collected more or less > as soon as they are no longer referenced somewhere. So, what I'm wondering > is, how do I tell a method of a custom element what object in the parallel > structure to inform whenever a change arises? I guess one way would be to > store at -class level- (or where else?) a dictionary mapping custom > ElementProxies instances to nodes of the parallel structure. In so doing > whenever a custom method is executed it can get hold of the parallel > structure. Is that a reasonable way to do it or are there better ones? Just an idea: If it's ok for you to actually modify the referencing tree what about storing a unique xpath to the referenced node of the other tree, e.g. in an XML attribute? Holger -- Sicherer, schneller und einfacher. Die aktuellen Internet-Browser - jetzt kostenlos herunterladen! http://portal.gmx.net/de/go/chbrowser From stefan_ml at behnel.de Wed Mar 3 08:58:49 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 03 Mar 2010 08:58:49 +0100 Subject: [lxml-dev] Architecture/best practice question. In-Reply-To: <915dc91d1002260513o52e92d2dwa7a4739fb77cbfbb@mail.gmail.com> References: <915dc91d1002260513o52e92d2dwa7a4739fb77cbfbb@mail.gmail.com> Message-ID: <4B8E16B9.7070301@behnel.de> Emanuele D'Arrigo, 26.02.2010 14:13: > a bit of a general architecture/best practice question. Say you want to keep > in sync an ElementTree with a separate tree structure, one that is parallel > but does not have the exact same nodes and yet needs to be informed and > updated whenever a change in the ElementTree occurs. ElementTree supports > custom elements and I guess it wouldn't be too difficult to override the > standard methods of an element to do something before or after any change. > > -However-, I understand that ElementProxies cannot store instance-level data > as the instances are not persistent and are garbage collected more or less > as soon as they are no longer referenced somewhere. So, what I'm wondering > is, how do I tell a method of a custom element what object in the parallel > structure to inform whenever a change arises? I guess one way would be to > store at -class level- (or where else?) a dictionary mapping custom > ElementProxies instances to nodes of the parallel structure. In so doing > whenever a custom method is executed it can get hold of the parallel > structure. Is that a reasonable way to do it or are there better ones? I don't feel like having enough information to give good advice. I'm not aware of any ready-made solution here, so the 'best practices' certainly lack statistical evidence, but you can try to use the generated XPath of the Element as a key, or keep the Element proxies alive, as you already suggested (but I expect that to be more cumbersome). http://codespeak.net/lxml/xpathxslt.html#generating-xpath-expressions Stefan From jholg at gmx.de Wed Mar 3 09:15:24 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 03 Mar 2010 09:15:24 +0100 Subject: [lxml-dev] [Bug 488222] Feature request: add better schematron support to lxml In-Reply-To: <20100203233309.15250@gmx.net> References: <20100203233309.15250@gmx.net> Message-ID: <20100303081524.164310@gmx.net> Hi, > Committed to trunk: > https://codespeak.net/viewvc/?view=rev&revision=71090 > > This simply exposes the skeleton xslt steps and the validation result > xpath as class attributes. > > I consider the iso-schematron works pretty much finished for now... Nevertheless, I've been thinking about this a bit more: What about generalising the approach to an "XSLValidator" approach that basically - uses a (series of) "validating" XSLT tranformation(s) - to create an output document (an output "report") - based on which actual input validity gets determined by an XPath expression. The isoschematron Schematron class would then be some specialization of this. Not sure how to support the separate steps with their own parameters to create the "validating XSLT" from the input schema, though. Maybe we'd need to separate this into some worker "XSLProvider" class. Holger -- Sicherer, schneller und einfacher. Die aktuellen Internet-Browser - jetzt kostenlos herunterladen! http://portal.gmx.net/de/go/atbrowser From stefan_ml at behnel.de Wed Mar 3 09:34:21 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 03 Mar 2010 09:34:21 +0100 Subject: [lxml-dev] [Bug 488222] Feature request: add better schematron support to lxml In-Reply-To: <20100303081524.164310@gmx.net> References: <20100203233309.15250@gmx.net> <20100303081524.164310@gmx.net> Message-ID: <4B8E1F0D.90802@behnel.de> jholg at gmx.de, 03.03.2010 09:15: >> I consider the iso-schematron works pretty much finished for now... > > Nevertheless, I've been thinking about this a bit more: > What about generalising the approach to an "XSLValidator" approach that basically > - uses a (series of) "validating" XSLT tranformation(s) > - to create an output document (an output "report") > - based on which actual input validity gets determined by an XPath expression. > > The isoschematron Schematron class would then be some specialization of > this. Not sure how to support the separate steps with their own > parameters to create the "validating XSLT" from the input schema, > though. Maybe we'd need to separate this into some worker "XSLProvider" > class. While this sounds interesting in general, I'm not sure if there is a) a use case for this (Schematron is likely as good as you can get anyway), and b) enough to generalise. The above things aren't hard to implement (XPath is two lines, XSLT maybe three), so before we have a second schema language / use case that shows what the truely overlapping parts are, I don't see how to extract a good abstraction from this. That doesn't mean I'm against such an idea. Wouldn't be the first time that new abstractions lead to new ideas. I'm just suggesting that it's the kind of idea that needs an implementation with examples before evaluating the design. Stefan From vojta.rylko at seznam.cz Wed Mar 3 20:50:36 2010 From: vojta.rylko at seznam.cz (=?UTF-8?B?Vm9qdMSbY2ggUnlsa28=?=) Date: Wed, 03 Mar 2010 20:50:36 +0100 Subject: [lxml-dev] namespaces In-Reply-To: <915dc91d1003020550t7a9779e0gffd4334025c3e8b4@mail.gmail.com> References: <4B8CE577.2000805@seznam.cz> <4B8CEEC6.3010101@behnel.de> <915dc91d1003020309l2d3691e8ja8ac29ab705284bd@mail.gmail.com> <4B8D03FC.7080408@behnel.de> <915dc91d1003020550t7a9779e0gffd4334025c3e8b4@mail.gmail.com> Message-ID: <4B8EBD8C.1020107@seznam.cz> Thanks, Emanuele and Stefan. My result is http://pastebin.com/zsCnznZG Vojt?ch Rylko Dne 2.3.2010 14:50, Emanuele D'Arrigo napsal(a): > On 2 March 2010 12:26, Stefan Behnel > wrote: > > Even the root element could define /any/ prefix for a namespace, > including > the empty prefix, so your code would break at the very first input > document > that uses a different prefix than what you anticipated. Reading > definitions > from the document would therefore encourage users to write extremely > fragile code that fails to work for XML in general. > > > I understand what you are saying. > > > Or couldn't the > > mappings be propagated appropriately to all elements during > parsing, maybe > > in a form that keeps them "under the hood" rather than making > them explicit > > xmlns attributes? > > They are, see the .prefix and .nsmap properties on Elements. > > > Ooops. You are right. I forgot about that. Thank you for taking some > time further detailing the issue! > > Manu > > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100303/99cb6dea/attachment.htm From james.johnston at lifeway.com Thu Mar 4 22:27:30 2010 From: james.johnston at lifeway.com (James Johnston) Date: Thu, 4 Mar 2010 15:27:30 -0600 Subject: [lxml-dev] Converting HTML to XML Message-ID: <710c80bb1003041327p6407379eud707eba5b0cedc30@mail.gmail.com> I've been given an assignment to use HTML content as input to create XML. What are a few good resources? This is all pretty new to me so I apologize up front before I ask really obvious questions. I have a general and fuzzy idea about what this might entail like importing lxml.html. Conceptually, I want to read an HTML document, strip out or ignore parts of it, convert HTML tags to XML for certain references, and put the associated data into CDATA format and output an XML file. Effectively this will be used to add cross-reference comments to an existing XML document on the same basic subject. For example, the existing XML may have a section on circular saws, and the HTML document has a short article on circular saws that we want to link to in the new XML format. Thanks -- James Johnston james.johnston at lifeway.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100304/e05f9345/attachment.htm From manu3d at gmail.com Fri Mar 5 01:19:04 2010 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Fri, 5 Mar 2010 00:19:04 +0000 Subject: [lxml-dev] Converting HTML to XML In-Reply-To: <710c80bb1003041327p6407379eud707eba5b0cedc30@mail.gmail.com> References: <710c80bb1003041327p6407379eud707eba5b0cedc30@mail.gmail.com> Message-ID: <915dc91d1003041619v5af9030amfa550d551d6ec2fb@mail.gmail.com> On 4 March 2010 21:27, James Johnston wrote: > I've been given an assignment to use HTML content as input to create XML. > What are a few good resources? This is all pretty new to me so I apologize > up front before I ask really obvious questions. It's not so much obvious but it is a little broad. ;) One thing you could do to help us help you is to post a short sample of your input and output. But generally speaking there are at least a couple of things that you could do. 1) create an appropriate XSLT file to transform your input HTML into your output XML. A good starting point for is *this tutorial * on XSLT. This would limit your python/lxml scripting to a minimum but would concentrate complexity on the XSLT file. 2) you can use lxml to parse the input file and walk through the tree to generate the output you need. This way you can skip completely XSLT but you'll need to get familiar with lxml and *this tutorial * should be a good starting point. Hope it helps! Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100305/f32535ea/attachment.htm From james.johnston at lifeway.com Fri Mar 5 16:27:32 2010 From: james.johnston at lifeway.com (James Johnston) Date: Fri, 5 Mar 2010 09:27:32 -0600 Subject: [lxml-dev] Converting HTML to XML Message-ID: <710c80bb1003050727u11a4d4sf1efc38d0c609650@mail.gmail.com> Thanks Manu. I've been trying to add to my meager Python knowledge and working with the lxml tutorial already. I'll look at the XSLT stuff you sent as well. I appreciate your help and guidance. Jim. -- James Johnston (615) 251-2792 james.johnston at lifeway.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100305/6584746a/attachment.htm From nicolas at nexedi.com Fri Mar 5 17:20:54 2010 From: nicolas at nexedi.com (Nicolas Delaby) Date: Fri, 05 Mar 2010 17:20:54 +0100 Subject: [lxml-dev] Comments stripping issue during html parsing Message-ID: <4B912F66.4030103@nexedi.com> Hi, I'm wonder why the code above do not strip the first comment (inside style node). Is there is something particular with style nodes that i missed ? from lxml import etree from lxml.etree import HTMLParser parser = HTMLParser(remove_comments=True) html_string = '' print etree.tostring(etree.HTML(html_string, parser=parser), pretty_print=True) Best regards, Nicolas -- Nicolas Delaby Nexedi: Consulting and Development of Libre / Open Source Software http://www.nexedi.com/ From stefan_ml at behnel.de Fri Mar 5 20:58:07 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 05 Mar 2010 20:58:07 +0100 Subject: [lxml-dev] Comments stripping issue during html parsing In-Reply-To: <4B912F66.4030103@nexedi.com> References: <4B912F66.4030103@nexedi.com> Message-ID: <4B91624F.6080408@behnel.de> Nicolas Delaby, 05.03.2010 17:20: > I'm wonder why the code above do not strip the first comment (inside style node). > Is there is something particular with style nodes that i missed ? > > from lxml import etree > from lxml.etree import HTMLParser > > parser = HTMLParser(remove_comments=True) > html_string = '' > print etree.tostring(etree.HTML(html_string, parser=parser), pretty_print=True) > > > > > It is (or was, for quite a while) common to 'comment out' style information inside of style tags to hide it from old browsers that don't support the tag. That's even presented in the HTML4 spec. http://www.w3.org/TR/html4/present/styles.html#h-14.5 It's therefore questionable if comment markup surrounding style sheet data is really meant to be a comment in the first place. In any case, the content of the style tag is application specific text, not markup. http://www.w3.org/TR/REC-html40/present/styles.html#edef-STYLE Since your example above does not even provide the required content type for the style content (spec: "Authors must supply a value for this attribute; there is no default value for this attribute."), parsing it as an HTML comment would be a rather arbitrary decision to make. It could potentially be anything, including base64 encoded binary data, for example. Stefan From sergio at sergiomb.no-ip.org Fri Mar 5 22:55:07 2010 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Fri, 05 Mar 2010 21:55:07 +0000 Subject: [lxml-dev] copy a xpath from an element to an element without double copies Message-ID: <1267826107.14419.18.camel@segulix> Hi, from lxml import html f = open("teste.html").read() html_document = html.fromstring(f) elems=html_document.xpath('//h1|//div[@id="articleTitle"]') text="" for frags in elems: text += html.tostring(frags, method="html", encoding="utf-8") elemfinal = None for frags in elems: if elemfinal is None: elemfinal = frags else: elemfinal.append(frags) text2 = html.tostring(elemfinal, method="html", encoding="utf-8") For this sample teste.html:

head1

head2

for discard
--------------------------- print text

head1

head2

head2

doubles

head2

and print text2

head1

head2

put

head2

out of div. How I copy correctly html_document.xpath('//h1|//div[@id="articleTitle"]') to another element ? Thanks, -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3159 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100305/0a21e951/attachment.bin From marcello at perathoner.de Sat Mar 6 08:04:37 2010 From: marcello at perathoner.de (Marcello Perathoner) Date: Sat, 06 Mar 2010 08:04:37 +0100 Subject: [lxml-dev] Comments stripping issue during html parsing In-Reply-To: <4B912F66.4030103@nexedi.com> References: <4B912F66.4030103@nexedi.com> Message-ID: <4B91FE85.3090508@perathoner.de> Nicolas Delaby wrote: > Hi, > I'm wonder why the code above do not strip the first comment (inside style node). > Is there is something particular with style nodes that i missed ? Yes. Style elements contain CDATA according to the HTML 4.01 spec. http://www.w3.org/TR/html401/types.html#type-cdata > > from lxml import etree > from lxml.etree import HTMLParser > > parser = HTMLParser(remove_comments=True) > html_string = '' > print etree.tostring(etree.HTML(html_string, parser=parser), pretty_print=True) > > > > > > > Best regards, > Nicolas -- Marcello Perathoner webmaster at gutenberg.org From jameshfisher at gmail.com Sun Mar 7 01:16:45 2010 From: jameshfisher at gmail.com (James Fisher) Date: Sun, 7 Mar 2010 00:16:45 +0000 Subject: [lxml-dev] Request: etree.Element.set() returns self Message-ID: <771da05a1003061616r4ef9b1cdic33bea796cdad1e0@mail.gmail.com> Hi, Would people appreciate the convenience of the Element.set() method returning itself, allowing strung-together method calls (jQuery style)? I ask because I want to do something like this: etree.SubElement(self.head, "meta").set("http-equiv", "Content-Type").set("content", "text/html; charset=utf-8") Where I can't use http-equiv as a kwarg due to characters, and nor can I put "content" as a kwarg as I want the order of the attributes preserved. Is there a technical reason why we couldn't have this? Does set() return anything currently? (I can't find its definition.) Best James -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100307/4303c20b/attachment.htm From stefan_ml at behnel.de Sun Mar 7 08:27:58 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 07 Mar 2010 08:27:58 +0100 Subject: [lxml-dev] Request: etree.Element.set() returns self In-Reply-To: <771da05a1003061616r4ef9b1cdic33bea796cdad1e0@mail.gmail.com> References: <771da05a1003061616r4ef9b1cdic33bea796cdad1e0@mail.gmail.com> Message-ID: <4B93557E.2090302@behnel.de> James Fisher, 07.03.2010 01:16: > Would people appreciate the convenience of the Element.set() method > returning itself, allowing strung-together method calls (jQuery style)? I > ask because I want to do something like this: > > etree.SubElement(self.head, "meta").set("http-equiv", > "Content-Type").set("content", "text/html; charset=utf-8") Alternatives are: 1) meta = etree.SubElement(self.head, "meta") meta.set("http-equiv", "Content-Type") meta.set("content", "text/html; charset=utf-8") 2) etree.SubElement(self.head, "meta", attrib={ "http-equiv" : "Content-Type", "content": "text/html; charset=utf-8"}) 3) etree.SubElement(self.head, "meta", content="text/html; charset=utf-8" ).set("http-equiv", "Content-Type") 4) etree.SubElement(self.head, "meta", attrib={ "http-equiv" : "Content-Type"} ).set("content", "text/html; charset=utf-8") Out of these, 1) and 2) look ok to me, and I find 1) the most readable. > Where I can't use http-equiv as a kwarg due to characters, and nor can I put > "content" as a kwarg as I want the order of the attributes preserved. 1) and 4) give you that, 3 won't, 2) may or may not. > Is there a technical reason why we couldn't have this? Does set() return > anything currently? (I can't find its definition.) It returns None, as usual, and there is no /technical/ reason why it can't return anything else. However, returning an Element here would imply that the method does not modify the Element in place, but instead returns a new Element that has the modifications applied. This is clearly not the case here, so returning None is the right thing to do. Stefan From stefan_ml at behnel.de Sun Mar 7 11:27:35 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 07 Mar 2010 11:27:35 +0100 Subject: [lxml-dev] copy a xpath from an element to an element without double copies In-Reply-To: <1267826107.14419.18.camel@segulix> References: <1267826107.14419.18.camel@segulix> Message-ID: <4B937F97.6000005@behnel.de> Sergio Monteiro Basto, 05.03.2010 22:55: > from lxml import html > f = open("teste.html").read() > html_document = html.fromstring(f) > elems=html_document.xpath('//h1|//div[@id="articleTitle"]') > text="" > for frags in elems: > text += html.tostring(frags, method="html", encoding="utf-8") > > elemfinal = None > for frags in elems: > if elemfinal is None: > elemfinal = frags > else: > elemfinal.append(frags) > text2 = html.tostring(elemfinal, method="html", encoding="utf-8") This code looks seriously flawed to me. Could you explain why you are concatenating serialised HTML fragments here? What is it that you are trying to achieve? Note that the .append() method *moves* the element to the new position. If you want to copy it, use the copy module to create a deep copy of the element before moving it over. Stefan From stefan_ml at behnel.de Mon Mar 8 07:45:33 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 08 Mar 2010 07:45:33 +0100 Subject: [lxml-dev] copy a xpath from an element to an element without double copies In-Reply-To: <1267991224.5590.2.camel@segulix> References: <1267826107.14419.18.camel@segulix> <4B937F97.6000005@behnel.de> <1267991224.5590.2.camel@segulix> Message-ID: <4B949D0D.5080903@behnel.de> Sergio Monteiro Basto, 07.03.2010 20:47: > On Sun, 2010-03-07 at 11:27 +0100, Stefan Behnel wrote: >> Sergio Monteiro Basto, 05.03.2010 22:55: >>> from lxml import html >>> f = open("teste.html").read() >>> html_document = html.fromstring(f) >>> elems=html_document.xpath('//h1|//div[@id="articleTitle"]') >>> text="" >>> for frags in elems: >>> text += html.tostring(frags, method="html", encoding="utf-8") >>> >>> elemfinal = None >>> for frags in elems: >>> if elemfinal is None: >>> elemfinal = frags >>> else: >>> elemfinal.append(frags) >>> text2 = html.tostring(elemfinal, method="html", encoding="utf-8") >> >> This code looks seriously flawed to me. Could you explain why you are >> concatenating serialised HTML fragments here? >> >> What is it that you are trying to achieve? > > I want an tree that is a copy of tags

and
> of the original document. Ah, ok. So I guess you want this: import copy new_root = html.Element("div") new_root.extend( map(copy.deepcopy, html_document.xpath('//h1|//div[@id="articleTitle"]'))) Note that the XPath result will be in document order already. >> Note that the .append() method *moves* the element to the new position. If >> you want to copy it, use the copy module to create a deep copy of the >> element before moving it over. > > Copy module here is it ? Sorry, I can't parse that sentence, but I guess you are asking for "import copy". See the stdlib docs. Stefan From sergio at sergiomb.no-ip.org Sun Mar 7 20:49:52 2010 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Sun, 07 Mar 2010 19:49:52 +0000 Subject: [lxml-dev] copy a xpath from an element to an element without double copies In-Reply-To: <4B937F97.6000005@behnel.de> References: <1267826107.14419.18.camel@segulix> <4B937F97.6000005@behnel.de> Message-ID: <1267991393.5590.5.camel@segulix> On Sun, 2010-03-07 at 11:27 +0100, Stefan Behnel wrote: > Sergio Monteiro Basto, 05.03.2010 22:55: > > from lxml import html > > f = open("teste.html").read() > > html_document = html.fromstring(f) > > elems=html_document.xpath('//h1|//div[@id="articleTitle"]') > > text="" > > for frags in elems: > > text += html.tostring(frags, method="html", encoding="utf-8") > > > > elemfinal = None > > for frags in elems: > > if elemfinal is None: > > elemfinal = frags > > else: > > elemfinal.append(frags) > > text2 = html.tostring(elemfinal, method="html", encoding="utf-8") > > This code looks seriously flawed to me. Could you explain why you are > concatenating serialised HTML fragments here? > > What is it that you are trying to achieve? Sorry try, explain better I want an tree that represent the xpath (in this example //h1|//div[@id="articleTitle"]) of the original document. > > Note that the .append() method *moves* the element to the new position. If > you want to copy it, use the copy module to create a deep copy of the > element before moving it over. > > Stefan -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3293 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100307/33958ffb/attachment-0001.bin From sergio at sergiomb.no-ip.org Sun Mar 7 20:47:04 2010 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Sun, 07 Mar 2010 19:47:04 +0000 Subject: [lxml-dev] copy a xpath from an element to an element without double copies In-Reply-To: <4B937F97.6000005@behnel.de> References: <1267826107.14419.18.camel@segulix> <4B937F97.6000005@behnel.de> Message-ID: <1267991224.5590.2.camel@segulix> On Sun, 2010-03-07 at 11:27 +0100, Stefan Behnel wrote: > Sergio Monteiro Basto, 05.03.2010 22:55: > > from lxml import html > > f = open("teste.html").read() > > html_document = html.fromstring(f) > > elems=html_document.xpath('//h1|//div[@id="articleTitle"]') > > text="" > > for frags in elems: > > text += html.tostring(frags, method="html", encoding="utf-8") > > > > elemfinal = None > > for frags in elems: > > if elemfinal is None: > > elemfinal = frags > > else: > > elemfinal.append(frags) > > text2 = html.tostring(elemfinal, method="html", encoding="utf-8") > > This code looks seriously flawed to me. Could you explain why you are > concatenating serialised HTML fragments here? > > What is it that you are trying to achieve? I want an tree that is a copy of tags

and
of the original document. > > Note that the .append() method *moves* the element to the new position. If > you want to copy it, use the copy module to create a deep copy of the > element before moving it over. Copy module here is it ? > > Stefan -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3293 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100307/53ef8177/attachment.bin From jameshfisher at gmail.com Sun Mar 7 19:08:42 2010 From: jameshfisher at gmail.com (James Fisher) Date: Sun, 7 Mar 2010 18:08:42 +0000 Subject: [lxml-dev] Request: etree.Element.set() returns self In-Reply-To: <4B93557E.2090302@behnel.de> References: <771da05a1003061616r4ef9b1cdic33bea796cdad1e0@mail.gmail.com> <4B93557E.2090302@behnel.de> Message-ID: <771da05a1003071008x4ff89a18n1d4db984009aad58@mail.gmail.com> Hi Stefan, Thanks for the reply. I didn't find attrib={} (ended up using **{} instead, but I'll switch). I appreciate your point about the implication that modification is not in-place. I guess that idiom doesn't apply in the Javascript world I've been in lately. More importantly, I've now discovered that lxml actually doesn't store element attributes in order (I noted earlier I needed this order preserved). Presumably a Python dictionary or some equivalent of it is being used. Is this a limitation or by design (does XML specify that attributes are unordered)? I note Python 3.1 has an ordered dictionary class; don't know if that's something for the future (again I'm quite oblivious to the Python-C relationship and where the code is). The reason that this non-orderedness is unfortunate for me is that it creates awkward constructions like this when I use etree.tostring in constructing HTML: Where I would much prefer that the attribute name came before the value. Is there a workaround for my situation? Best James On Sun, Mar 7, 2010 at 7:27 AM, Stefan Behnel wrote: > James Fisher, 07.03.2010 01:16: > > Would people appreciate the convenience of the Element.set() method >> returning itself, allowing strung-together method calls (jQuery style)? I >> ask because I want to do something like this: >> >> etree.SubElement(self.head, "meta").set("http-equiv", >> "Content-Type").set("content", "text/html; charset=utf-8") >> > > Alternatives are: > > 1) > > meta = etree.SubElement(self.head, "meta") > > meta.set("http-equiv", "Content-Type") > meta.set("content", "text/html; charset=utf-8") > > 2) > > etree.SubElement(self.head, "meta", attrib={ > "http-equiv" : "Content-Type", > > "content": "text/html; charset=utf-8"}) > > 3) > > etree.SubElement(self.head, "meta", > > content="text/html; charset=utf-8" > ).set("http-equiv", "Content-Type") > > 4) > > etree.SubElement(self.head, "meta", attrib={ > > "http-equiv" : "Content-Type"} > ).set("content", "text/html; charset=utf-8") > > Out of these, 1) and 2) look ok to me, and I find 1) the most readable. > > > > Where I can't use http-equiv as a kwarg due to characters, and nor can I >> put >> "content" as a kwarg as I want the order of the attributes preserved. >> > > 1) and 4) give you that, 3 won't, 2) may or may not. > > > > Is there a technical reason why we couldn't have this? Does set() return >> anything currently? (I can't find its definition.) >> > > It returns None, as usual, and there is no /technical/ reason why it can't > return anything else. > > However, returning an Element here would imply that the method does not > modify the Element in place, but instead returns a new Element that has the > modifications applied. This is clearly not the case here, so returning None > is the right thing to do. > > Stefan > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100307/b527a5fa/attachment.htm From sergio at sergiomb.no-ip.org Mon Mar 8 13:29:50 2010 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Mon, 08 Mar 2010 12:29:50 +0000 Subject: [lxml-dev] copy a xpath from an element to an element without double copies In-Reply-To: <4B949D0D.5080903@behnel.de> References: <1267826107.14419.18.camel@segulix> <4B937F97.6000005@behnel.de> <1267991224.5590.2.camel@segulix> <4B949D0D.5080903@behnel.de> Message-ID: <1268051391.22828.1.camel@segulix> On Mon, 2010-03-08 at 07:45 +0100, Stefan Behnel wrote: > Sergio Monteiro Basto, 07.03.2010 20:47: > > On Sun, 2010-03-07 at 11:27 +0100, Stefan Behnel wrote: > >> Sergio Monteiro Basto, 05.03.2010 22:55: > >>> from lxml import html > >>> f = open("teste.html").read() > >>> html_document = html.fromstring(f) > >>> elems=html_document.xpath('//h1|//div[@id="articleTitle"]') > >>> text="" > >>> for frags in elems: > >>> text += html.tostring(frags, method="html", encoding="utf-8") > >>> > >>> elemfinal = None > >>> for frags in elems: > >>> if elemfinal is None: > >>> elemfinal = frags > >>> else: > >>> elemfinal.append(frags) > >>> text2 = html.tostring(elemfinal, method="html", encoding="utf-8") > >> > >> This code looks seriously flawed to me. Could you explain why you are > >> concatenating serialised HTML fragments here? > >> > >> What is it that you are trying to achieve? > > > > I want an tree that is a copy of tags

and
> > of the original document. > > Ah, ok. So I guess you want this: > > import copy > new_root = html.Element("div") > new_root.extend( map(copy.deepcopy, > html_document.xpath('//h1|//div[@id="articleTitle"]'))) > > Note that the XPath result will be in document order already. > > > >> Note that the .append() method *moves* the element to the new position. If > >> you want to copy it, use the copy module to create a deep copy of the > >> element before moving it over. > > > > Copy module where is it ? > > Sorry, I can't parse that sentence, but I guess you are asking for "import > copy". See the stdlib docs. yes, s/here/where/ Thanks very much , I will test it . > Stefan -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3159 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100308/355766c9/attachment.bin From sergio at sergiomb.no-ip.org Mon Mar 8 14:27:57 2010 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Mon, 08 Mar 2010 13:27:57 +0000 Subject: [lxml-dev] copy a xpath from an element to an element without double copies In-Reply-To: <1268051391.22828.1.camel@segulix> References: <1267826107.14419.18.camel@segulix> <4B937F97.6000005@behnel.de> <1267991224.5590.2.camel@segulix> <4B949D0D.5080903@behnel.de> <1268051391.22828.1.camel@segulix> Message-ID: <1268054877.22828.31.camel@segulix> On Mon, 2010-03-08 at 12:29 +0000, Sergio Monteiro Basto wrote: > > import copy > > new_root = html.Element("div") > > new_root.extend( map(copy.deepcopy, > > html_document.xpath('//h1|//div[@id="articleTitle"]'))) No , it does exactly the same problem. I will show another example: from lxml import html html_document = html.fromstring ("""

something that is not a div

title

, 13:36
""") # div id="articleTitle" have divs inside. new_root = html.Element("body") new_root.extend( map(copy.deepcopy,html_document.xpath('//div'))) print html.tostring(new_root, method="html", encoding="utf-8") --------> print(html.tostring(new_root, method="html", encoding="utf-8"))

title

, 13:36
, 13:36
you see, the divs inside the div, are repeated ... Well , I (think), I found one solution (I'll explain later) , but something that return all tree correctly, instead fragments would be nice. Thanks, -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3159 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100308/e9de4ce3/attachment-0001.bin From jholg at gmx.de Mon Mar 8 15:13:58 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Mon, 08 Mar 2010 15:13:58 +0100 Subject: [lxml-dev] copy a xpath from an element to an element without double copies In-Reply-To: <1268054877.22828.31.camel@segulix> References: <1267826107.14419.18.camel@segulix> <4B937F97.6000005@behnel.de> <1267991224.5590.2.camel@segulix> <4B949D0D.5080903@behnel.de> <1268051391.22828.1.camel@segulix> <1268054877.22828.31.camel@segulix> Message-ID: <20100308141358.28990@gmx.net> Hi, > No , it does exactly the same problem. > > I will show another example: > > from lxml import html > html_document = html.fromstring (""" >

something that is not a div

>
> > >

title

>
, > 13:36
>
""") # div id="articleTitle" have divs inside. > > new_root = html.Element("body") > new_root.extend( map(copy.deepcopy,html_document.xpath('//div'))) Deviating from what has already been proposed your XPath selects every div element in the document... > print html.tostring(new_root, method="html", encoding="utf-8") > --------> print(html.tostring(new_root, method="html", > encoding="utf-8")) >
> > >

title

>
, > 13:36
>
> >
, > 13:36
> > > > you see, the divs inside the div, are repeated ... ... so this is to be expected. If you're just interested in
you need to use XPath predicates, like "//div[@id='articleTitle']" Holger -- GMX DSL: Internet, Telefon und Entertainment f?r nur 19,99 EUR/mtl.! http://portal.gmx.net/de/go/dsl02 From tseaver at palladion.com Mon Mar 8 16:03:36 2010 From: tseaver at palladion.com (Tres Seaver) Date: Mon, 08 Mar 2010 10:03:36 -0500 Subject: [lxml-dev] Request: etree.Element.set() returns self In-Reply-To: <771da05a1003071008x4ff89a18n1d4db984009aad58@mail.gmail.com> References: <771da05a1003061616r4ef9b1cdic33bea796cdad1e0@mail.gmail.com> <4B93557E.2090302@behnel.de> <771da05a1003071008x4ff89a18n1d4db984009aad58@mail.gmail.com> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 James Fisher wrote: > Thanks for the reply. I didn't find attrib={} (ended up using **{} instead, > but I'll switch). I appreciate your point about the implication that > modification is not in-place. I guess that idiom doesn't apply in the > Javascript world I've been in lately. > > More importantly, I've now discovered that lxml actually doesn't store > element attributes in order (I noted earlier I needed this order > preserved). Presumably a Python dictionary or some equivalent of it is > being used. Is this a limitation or by design (does XML specify that > attributes are unordered)? I note Python 3.1 has an ordered dictionary > class; don't know if that's something for the future (again I'm quite > oblivious to the Python-C relationship and where the code is). > > The reason that this non-orderedness is unfortunate for me is that it > creates awkward constructions like this when I use etree.tostring in > constructing HTML: > > > > Where I would much prefer that the attribute name came before the value. > > Is there a workaround for my situation? The order of XML attributes is explicitly not part of the infoset, which means that re-serializing a chunk of parsed XML txt is never guaranteed to produce "identical" text. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkuVEcgACgkQ+gerLs4ltQ7c8wCdF3SsS9MDQBT60Bf4Wl080QIw wn8AoKPOsP/dSXkO0ZGzGo8q+rnzfbBR =CFXE -----END PGP SIGNATURE----- From sergio at sergiomb.no-ip.org Mon Mar 8 17:23:51 2010 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Mon, 08 Mar 2010 16:23:51 +0000 Subject: [lxml-dev] copy a xpath from an element to an element without double copies In-Reply-To: <20100308141358.28990@gmx.net> References: <1267826107.14419.18.camel@segulix> <4B937F97.6000005@behnel.de> <1267991224.5590.2.camel@segulix> <4B949D0D.5080903@behnel.de> <1268051391.22828.1.camel@segulix> <1268054877.22828.31.camel@segulix> <20100308141358.28990@gmx.net> Message-ID: <1268065431.24776.5.camel@segulix> On Mon, 2010-03-08 at 15:13 +0100, jholg at gmx.de wrote: > Hi, > > > No , it does exactly the same problem. > > > > I will show another example: > > > > from lxml import html > > html_document = html.fromstring (""" > >

something that is not a div

> >
> > > > > >

title

> >
, > > 13:36
> >
""") # div id="articleTitle" have divs inside. > > > > new_root = html.Element("body") > > new_root.extend( map(copy.deepcopy,html_document.xpath('//div'))) > > Deviating from what has already been proposed your XPath selects every div element in the document... > > > print html.tostring(new_root, method="html", encoding="utf-8") > > --------> print(html.tostring(new_root, method="html", > > encoding="utf-8")) > >
> > > > > >

title

> >
, > > 13:36
> >
> > > >
, > > 13:36
> > > > > > > > you see, the divs inside the div, are repeated ... > > ... so this is to be expected. > > If you're just interested in
you need to use XPath predicates, like "//div[@id='articleTitle']" I know this was an example. In what I am doing, I can't predict, if xpath, match with elements and sub elements or not. I want calculate the "real" xpath, which "should" be without dups. Thanks, > > Holger -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3159 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100308/8e28cb56/attachment.bin From stefan_ml at behnel.de Mon Mar 8 17:45:54 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 08 Mar 2010 17:45:54 +0100 Subject: [lxml-dev] Request: etree.Element.set() returns self In-Reply-To: <771da05a1003071008x4ff89a18n1d4db984009aad58@mail.gmail.com> References: <771da05a1003061616r4ef9b1cdic33bea796cdad1e0@mail.gmail.com> <4B93557E.2090302@behnel.de> <771da05a1003071008x4ff89a18n1d4db984009aad58@mail.gmail.com> Message-ID: <4B9529C2.40207@behnel.de> James Fisher, 07.03.2010 19:08: > More importantly, I've now discovered that lxml actually doesn't store > element attributes in order Yes, it does, see my previous post. (And please don't top-post in replies.) Stefan From stefan_ml at behnel.de Mon Mar 8 17:53:41 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 08 Mar 2010 17:53:41 +0100 Subject: [lxml-dev] copy a xpath from an element to an element without double copies In-Reply-To: <1268065431.24776.5.camel@segulix> References: <1267826107.14419.18.camel@segulix> <4B937F97.6000005@behnel.de> <1267991224.5590.2.camel@segulix> <4B949D0D.5080903@behnel.de> <1268051391.22828.1.camel@segulix> <1268054877.22828.31.camel@segulix> <20100308141358.28990@gmx.net> <1268065431.24776.5.camel@segulix> Message-ID: <4B952B95.8080903@behnel.de> Sergio Monteiro Basto, 08.03.2010 17:23: >>> I will show another example: >>> >>> from lxml import html >>> html_document = html.fromstring (""" >>>

something that is not a div

>>>
>>> >>> >>>

title

>>>
, >>> 13:36
>>>
""") # div id="articleTitle" have divs inside. >>> >>> new_root = html.Element("body") >>> new_root.extend( map(copy.deepcopy,html_document.xpath('//div'))) >> >> Deviating from what has already been proposed your XPath selects every div element in the document... >> >>> print html.tostring(new_root, method="html", encoding="utf-8") >>> --------> print(html.tostring(new_root, method="html", >>> encoding="utf-8")) >>>
>>> >>> >>>

title

>>>
, >>> 13:36
>>>
>>> >>>
, >>> 13:36
>>> >>> >>> >>> you see, the divs inside the div, are repeated ... >> >> ... so this is to be expected. >> >> If you're just interested in
you need to use XPath predicates, like "//div[@id='articleTitle']" > > I know this was an example. > > In what I am doing, I can't predict, if xpath, match with elements and > sub elements or not. I want calculate the "real" xpath, which "should" > be without dups. Well, there are no duplicates, according to your XPath expression. Each node that was found was only copied over once, but including its children. Please clarify what you consider a duplicate, then we might be able to help you find a suitable XPath expression. If that's not possible, consider eliminating the unwanted nodes yourself by traversing the trees that were found and comparing them to the other nodes. Stefan From nicolas at nexedi.com Thu Mar 11 10:37:33 2010 From: nicolas at nexedi.com (Nicolas Delaby) Date: Thu, 11 Mar 2010 10:37:33 +0100 Subject: [lxml-dev] Comments stripping issue during html parsing In-Reply-To: <4B91624F.6080408@behnel.de> References: <4B912F66.4030103@nexedi.com> <4B91624F.6080408@behnel.de> Message-ID: <4B98B9DD.1010107@nexedi.com> Stefan Behnel a ?crit : > Nicolas Delaby, 05.03.2010 17:20: >> I'm wonder why the code above do not strip the first comment (inside >> style node). >> Is there is something particular with style nodes that i missed ? >> >> from lxml import etree >> from lxml.etree import HTMLParser >> >> parser = HTMLParser(remove_comments=True) >> html_string = '' >> print etree.tostring(etree.HTML(html_string, parser=parser), >> pretty_print=True) >> >> >> >> >> > > It is (or was, for quite a while) common to 'comment out' style > information inside of style tags to hide it from old browsers that don't > support the tag. That's even presented in the HTML4 spec. > > http://www.w3.org/TR/html4/present/styles.html#h-14.5 > > It's therefore questionable if comment markup surrounding style sheet > data is really meant to be a comment in the first place. > > In any case, the content of the style tag is application specific text, > not markup. > > http://www.w3.org/TR/REC-html40/present/styles.html#edef-STYLE > > Since your example above does not even provide the required content type > for the style content (spec: "Authors must supply a value for this > attribute; there is no default value for this attribute."), parsing it > as an HTML comment would be a rather arbitrary decision to make. It > could potentially be anything, including base64 encoded binary data, for > example. Hi Stefan, Thanks a lot for all your complete explanation. I wasn't aware of those two particularity things. My goal is to cleanup dirty HTML input in order to give it in a safe mode by stripping javascript, some tags and so on... For this particular use case I will update the Parser. Regards, Nicolas -- Nicolas Delaby Nexedi: Consulting and Development of Libre / Open Source Software http://www.nexedi.com/ From kessle10 at gmail.com Thu Mar 11 20:58:27 2010 From: kessle10 at gmail.com (James Kessler) Date: Thu, 11 Mar 2010 13:58:27 -0600 Subject: [lxml-dev] Add stylesheet PI to xml doc Message-ID: <2e70f9031003111158lf900f48r87de2177d32277cf@mail.gmail.com> Hello, Is there a simple way to prepend an xml-stylesheet PI to an document? Thanks, James -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100311/3fac8c9a/attachment.htm From sergio at sergiomb.no-ip.org Fri Mar 12 01:23:12 2010 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Fri, 12 Mar 2010 00:23:12 +0000 Subject: [lxml-dev] bug in id function ? Message-ID: <1268353392.11635.9.camel@segulix> Hi, strxpath = 'id("tit")|id("det")' divs=etree_document.xpath(strxpath) elembody = html.Element("body") print divs[0] elembody.append(divs[0]) # move the id to other tree. divs=etree_document.xpath(strxpath) # calculate again print divs[0] # the element is the same, but was delete from etree_document #with strxpath = '//*[@id="titNot" or @id="detNot"]' #the things works like expect. divs=etree_document.xpath(strxpath) elembody = html.Element("body") print divs[0] elembody.append(divs[0]) # move the id to other tree. divs=etree_document.xpath(strxpath) # calculate again print divs[0] # is the other element Thanks, -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3159 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100312/994033ad/attachment-0001.bin From dakota at brokenpipe.ru Fri Mar 12 12:12:05 2010 From: dakota at brokenpipe.ru (Marat Dakota) Date: Fri, 12 Mar 2010 14:12:05 +0300 Subject: [lxml-dev] XSLT extension elements appending text to output_parent problem Message-ID: Hi, I've got a problem. If I have XSLT like: ... Test1
Test2
... And my XSLT extension element's code looks like: ... def execute(self, context, self_node, input_node, output_parent): # I just want to copy extension element's text to result output_parent.text += self_node.text ... The thing I want to get in result is: ... Test1
Test2 ... But I get: Test1Test2
And in theory of texts and tails it's pretty ok. The problem is when the things come to processing second extension element, output_parent contains
element already and I have to set it's tail to get desired result. But tail is read only. I see two possible solutions: 1) make tail of output_parent's children writable; 2) create some element class that will be turned to text during serialization to be able to do things like: ... def execute(self, context, self_node, input_node, output_parent): output_parent.append(etree.TextElement(self_node.text)) ... -- Marat -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100312/07a64d7d/attachment.htm From jrogers98 at gmail.com Tue Mar 16 23:46:28 2010 From: jrogers98 at gmail.com (Jennifer Rogers) Date: Tue, 16 Mar 2010 16:46:28 -0600 Subject: [lxml-dev] Namespace redundancy in new elements Message-ID: <34D386D2-216F-4FBB-B264-0D44AD3F414E@gmail.com> Hello, I'm using lxml to build Excel spreadsheets in the Office Open XML format. I've found that it is particularly picky about using specific namespace prefixes, even though prefixes aren't supposed to matter in XML so long as the URI is consistent. I've been able to get around this in most cases using namespace mappings, but I got tripped up in one particular instance. The root element of one of my xml documents looks like this: ------------------------- ------------------------- These namespace prefixes are inherited perfectly fine by the rest of the document in any instance that uses no prefixes or that has a tag that uses a prefix. However, I'm getting stuck when I try to create a subelement that has an attribute that uses a prefix. The new element should read like this (with some context): ------------------------- ------------------------- However, when I try to create this element, it turns out like this instead: ------------------------- ------------------------- I realize that it shouldn't matter. However, Excel is throwing a fit over my spreadsheet until I uncompress the .xlsx file, manually delete the redundant namespace binding, and recompress. Here is the (simplified) code I'm using now to create this element, where mytree is the relevant ElementTree object: ------------------------- NSMAIN = "http://schemas.openxmlformats.org/spreadsheetml/2006/main" NSRELS = "http://schemas.openxmlformats.org/package/2006/relationships" E = ElementMaker(nsmap={None: NSMAIN, 'r': NSRELS}) ... new_element = E.sheet(name='Sheet1', sheetId='1') new_element.set('{%s}id' % NSRELS, 'rId1') target_element = mytree.xpath('//ns:sheets', namespaces={'ns': NSMAIN})[0] target_element.append(new_element) ------------------------- Is there any way to define the r:id attribute without repeating the namespace binding within the element? I haven't used lxml before, and am new to Python in general, so if I'm overlooking something entirely obvious, I apologize. Thanks, Jen -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100316/047b6aad/attachment.htm From stefan_ml at behnel.de Wed Mar 17 08:56:04 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 17 Mar 2010 08:56:04 +0100 Subject: [lxml-dev] Namespace redundancy in new elements In-Reply-To: <34D386D2-216F-4FBB-B264-0D44AD3F414E@gmail.com> References: <34D386D2-216F-4FBB-B264-0D44AD3F414E@gmail.com> Message-ID: <4BA08B14.8070102@behnel.de> Jennifer Rogers, 16.03.2010 23:46: > >[...] > Is the different namespace URI for the 'r' prefix just a copy&paste accident here? Stefan From jholg at gmx.de Wed Mar 17 09:31:14 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 17 Mar 2010 09:31:14 +0100 Subject: [lxml-dev] Namespace redundancy in new elements In-Reply-To: <34D386D2-216F-4FBB-B264-0D44AD3F414E@gmail.com> References: <34D386D2-216F-4FBB-B264-0D44AD3F414E@gmail.com> Message-ID: <20100317083114.297560@gmx.net> Hi, > ------------------------- > NSMAIN = "http://schemas.openxmlformats.org/spreadsheetml/2006/main" > NSRELS = > "http://schemas.openxmlformats.org/package/2006/relationships" > E = ElementMaker(nsmap={None: NSMAIN, 'r': NSRELS}) > ... > new_element = E.sheet(name='Sheet1', sheetId='1') > new_element.set('{%s}id' % NSRELS, 'rId1') > target_element = mytree.xpath('//ns:sheets', namespaces={'ns': > NSMAIN})[0] > target_element.append(new_element) > ------------------------- > > Is there any way to define the r:id attribute without repeating the > namespace binding within the element? Adding the namespace parameter (in addition to nsmap) when creating the ElementMaker instance, this works fine for me: >>> from lxml.builder import ElementMaker >>> NSMAIN = "http://schemas.openxmlformats.org/spreadsheetml/2006/main" >>> NSRELS = "http://schemas.openxmlformats.org/package/2006/relationships" >>> E = ElementMaker(namespace=NSMAIN, nsmap={None: NSMAIN, 'r': NSRELS}) >>> new_element = E.sheet(name='Sheet1', sheetId='1') >>> new_element.set('{%s}id' % NSRELS, 'rId1') >>> mytree = E.root(E.sheets()) # need to create a skeleton tree for this test >>> target_element = mytree.xpath('//ns:sheets', namespaces={'ns': NSMAIN})[0] >>> target_element.append(new_element) >>> print etree.tostring(mytree) >>> Sidenote: I'm currently running these versions: lxml version: 2.2.2 libxml2 version: (2, 6, 32) libxslt version: (1, 1, 23) > I'm using lxml to build Excel spreadsheets in the Office Open XML format. Can you share? I'm currently creating some spreadsheet documentation from an annotated XML Schema by extracting a csv and loading this into Excel and then apply formatting using a macro, which is a) fragile and b) needs manual steps. So I've wanted to improve this for quite some time but never got round to do it, so any hints/code on how to do this properly would be much appreciated. brgds, Holger -- GMX DSL: Internet, Telefon und Entertainment f?r nur 19,99 EUR/mtl.! http://portal.gmx.net/de/go/dsl02 From sergio at sergiomb.no-ip.org Wed Mar 17 22:30:08 2010 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Wed, 17 Mar 2010 21:30:08 +0000 Subject: [lxml-dev] copy a xpath from an element to an element without double copies In-Reply-To: <4B952B95.8080903@behnel.de> References: <1267826107.14419.18.camel@segulix> <4B937F97.6000005@behnel.de> <1267991224.5590.2.camel@segulix> <4B949D0D.5080903@behnel.de> <1268051391.22828.1.camel@segulix> <1268054877.22828.31.camel@segulix> <20100308141358.28990@gmx.net> <1268065431.24776.5.camel@segulix> <4B952B95.8080903@behnel.de> Message-ID: <1268861408.13040.28.camel@segulix> On Mon, 2010-03-08 at 17:53 +0100, Stefan Behnel wrote: > Sergio Monteiro Basto, 08.03.2010 17:23: > >>> I will show another example: > >>> > >>> from lxml import html > >>> html_document = html.fromstring (""" > >>>

something that is not a div

> >>>
> >>> > >>> > >>>

title

> >>>
, > >>> 13:36
> >>>
""") # div id="articleTitle" have divs inside. > >>> > >>> new_root = html.Element("body") > >>> new_root.extend( map(copy.deepcopy,html_document.xpath('//div'))) > >> > >> Deviating from what has already been proposed your XPath selects every div element in the document... > >> > >>> print html.tostring(new_root, method="html", encoding="utf-8") > >>> --------> print(html.tostring(new_root, method="html", > >>> encoding="utf-8")) > >>>
> >>> > >>> > >>>

title

> >>>
, > >>> 13:36
> >>>
> >>> > >>>
, > >>> 13:36
> >>> > >>> > >>> > >>> you see, the divs inside the div, are repeated ... > >> > >> ... so this is to be expected. > >> > >> If you're just interested in
you need to use XPath predicates, like "//div[@id='articleTitle']" > > > > I know this was an example. > > > > In what I am doing, I can't predict, if xpath, match with elements and > > sub elements or not. I want calculate the "real" xpath, which "should" > > be without dups. > > Well, there are no duplicates, according to your XPath expression. Each > node that was found was only copied over once, but including its children. For me the dups are the comparative between original html and new html - the original html have 3 divs, the new have 5. Ok , I know lxml works as expect , but I expect don't return two times same tag. The problem is it with big (and bad) html. That things similar to the example might be happen. > Please clarify what you consider a duplicate, then we might be able to help > you find a suitable XPath expression. If that's not possible, consider > eliminating the unwanted nodes yourself by traversing the trees that were > found and comparing them to the other nodes. The problem is, I can not choose the suitable XPath. The xpath is given to me ( by an user and could be wrong). So here is the code : I will call it the realxpath, I accept suggestion for the name of function. Which calculate xpath on every iteration. from lxml import html f = open(options.file).read() strxpath="//div" html_document = html.fromstring(f) elembody = html.Element("body") while len (html_document.xpath(strxpath)) != 0 : frag = html_document.xpath(strxpath)[0] elembody.append(frag) print html.tostring(elembody, method="html", encoding="utf-8") it's print only 3 divs on the example of this thread. BTW: xpath id function seems buggy (like I wrote in my previous email) and this method don't work with id function. and loops forever. > Stefan Thanks, -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3159 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100317/f97ecbb5/attachment-0001.bin From cmtaylor at ti.com Wed Mar 17 22:50:23 2010 From: cmtaylor at ti.com (Taylor, Martin) Date: Wed, 17 Mar 2010 16:50:23 -0500 Subject: [lxml-dev] lxml pre-built for MacTel OS X 10.6 and ActivePython 2.6 Message-ID: <92CDD168D1E81F4F9D3839DC45903FC6763A2FA0@dlee03.ent.ti.com> Does anyone know of a .dmg installer for lxml pre-built for MacTel OS X 10.6 and ActivePython 2.6? This web page: http://codespeak.net/lxml/installation.html#installation-in-activepython suggests doing this: pypm install lxml but that requires a special Business License from ActiveState at a cost of about $1000! The build-it-yourself instructions for Mac look REALLY HAIRY, so I'd very much appreciate getting a pre-built installer if anyone knows of one. Thanks, Martin -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100317/7625f6f8/attachment.htm From sergio at sergiomb.no-ip.org Wed Mar 17 23:09:00 2010 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Wed, 17 Mar 2010 22:09:00 +0000 Subject: [lxml-dev] copy a xpath from an element to an element without double copies In-Reply-To: <4B952B95.8080903@behnel.de> References: <1267826107.14419.18.camel@segulix> <4B937F97.6000005@behnel.de> <1267991224.5590.2.camel@segulix> <4B949D0D.5080903@behnel.de> <1268051391.22828.1.camel@segulix> <1268054877.22828.31.camel@segulix> <20100308141358.28990@gmx.net> <1268065431.24776.5.camel@segulix> <4B952B95.8080903@behnel.de> Message-ID: <1268863741.13040.29.camel@segulix> Sorry I hope don't get email doubled, I had some problems with my smtp. On Mon, 2010-03-08 at 17:53 +0100, Stefan Behnel wrote: > Sergio Monteiro Basto, 08.03.2010 17:23: > >>> I will show another example: > >>> > >>> from lxml import html > >>> html_document = html.fromstring (""" > >>>

something that is not a div

> >>>
> >>> > >>> > >>>

title

> >>>
, > >>> 13:36
> >>>
""") # div id="articleTitle" have divs inside. > >>> > >>> new_root = html.Element("body") > >>> new_root.extend( map(copy.deepcopy,html_document.xpath('//div'))) > >> > >> Deviating from what has already been proposed your XPath selects every div element in the document... > >> > >>> print html.tostring(new_root, method="html", encoding="utf-8") > >>> --------> print(html.tostring(new_root, method="html", > >>> encoding="utf-8")) > >>>
> >>> > >>> > >>>

title

> >>>
, > >>> 13:36
> >>>
> >>> > >>>
, > >>> 13:36
> >>> > >>> > >>> > >>> you see, the divs inside the div, are repeated ... > >> > >> ... so this is to be expected. > >> > >> If you're just interested in
you need to use XPath predicates, like "//div[@id='articleTitle']" > > > > I know this was an example. > > > > In what I am doing, I can't predict, if xpath, match with elements and > > sub elements or not. I want calculate the "real" xpath, which "should" > > be without dups. > > Well, there are no duplicates, according to your XPath expression. Each > node that was found was only copied over once, but including its children. For me the dups are the comparative between original html and new html - the original html have 3 divs, the new have 5. Ok , I know lxml works as expect , but I expect don't return two times same tag. The problem is it with big (and bad) html. That things similar to the example might be happen. > Please clarify what you consider a duplicate, then we might be able to help > you find a suitable XPath expression. If that's not possible, consider > eliminating the unwanted nodes yourself by traversing the trees that were > found and comparing them to the other nodes. The problem is, I can not choose the suitable XPath. The xpath is given to me ( by an user and could be wrong). So here is the code : I will call it the realxpath, I accept suggestion for the name of function. Which calculate xpath on every iteration. from lxml import html f = open(options.file).read() strxpath="//div" html_document = html.fromstring(f) elembody = html.Element("body") while len (html_document.xpath(strxpath)) != 0 : frag = html_document.xpath(strxpath)[0] elembody.append(frag) print html.tostring(elembody, method="html", encoding="utf-8") it's print only 3 divs on the example of this thread. BTW: xpath id function seems buggy (like I wrote in my previous email) and this method don't work with id function. and loops forever. > Stefan Thanks, -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3159 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100317/e41eeb3a/attachment.bin From stefan_ml at behnel.de Thu Mar 18 08:39:04 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 18 Mar 2010 08:39:04 +0100 Subject: [lxml-dev] copy a xpath from an element to an element without double copies In-Reply-To: <1268861408.13040.28.camel@segulix> References: <1267826107.14419.18.camel@segulix> <4B937F97.6000005@behnel.de> <1267991224.5590.2.camel@segulix> <4B949D0D.5080903@behnel.de> <1268051391.22828.1.camel@segulix> <1268054877.22828.31.camel@segulix> <20100308141358.28990@gmx.net> <1268065431.24776.5.camel@segulix> <4B952B95.8080903@behnel.de> <1268861408.13040.28.camel@segulix> Message-ID: <4BA1D898.9040009@behnel.de> Sergio Monteiro Basto, 17.03.2010 22:30: > On Mon, 2010-03-08 at 17:53 +0100, Stefan Behnel wrote: >> Sergio Monteiro Basto, 08.03.2010 17:23: >>>>> I will show another example: >>>>> >>>>> from lxml import html >>>>> html_document = html.fromstring (""" >>>>>

something that is not a div

>>>>>
>>>>> >>>>> >>>>>

title

>>>>>
, >>>>> 13:36
>>>>>
""") # div id="articleTitle" have divs inside. >>>>> >>>>> new_root = html.Element("body") >>>>> new_root.extend( map(copy.deepcopy,html_document.xpath('//div'))) >>>> >>>> Deviating from what has already been proposed your XPath selects every div element in the document... >>>> >>>>> print html.tostring(new_root, method="html", encoding="utf-8") >>>>> --------> print(html.tostring(new_root, method="html", >>>>> encoding="utf-8")) >>>>>
>>>>> >>>>> >>>>>

title

>>>>>
, >>>>> 13:36
>>>>>
>>>>> >>>>>
, >>>>> 13:36
>>>>> >>>>> >>>>> >>>>> you see, the divs inside the div, are repeated ... >>>> >>>> ... so this is to be expected. >>>> >>>> If you're just interested in
you need to use XPath predicates, like "//div[@id='articleTitle']" >>> >>> I know this was an example. >>> >>> In what I am doing, I can't predict, if xpath, match with elements and >>> sub elements or not. I want calculate the "real" xpath, which "should" >>> be without dups. >> >> Well, there are no duplicates, according to your XPath expression. Each >> node that was found was only copied over once, but including its children. > > For me the dups are the comparative between original html and new html - > the original html have 3 divs, the new have 5. > > Ok , I know lxml works as expect , but I expect don't return two times > same tag. > The problem is it with big (and bad) html. That things similar to the > example might be happen. > >> Please clarify what you consider a duplicate, then we might be able to help >> you find a suitable XPath expression. If that's not possible, consider >> eliminating the unwanted nodes yourself by traversing the trees that were >> found and comparing them to the other nodes. > > The problem is, I can not choose the suitable XPath. The xpath is given > to me ( by an user and could be wrong). Well, then the user provided you with an XPath expression that selects certain elements, and you construct a tree that contains everything that the user's expression selected. Sounds like a good response to the user's request. However, it seems to me that what you consider a duplicate is any element within a subtree that the XPath expression already selected. If that is the case, you will have to eliminate duplicates yourself. There is no way lxml can guess that intention from your "potentially wrong" user expression. A simple way to do that is to build a set of all parents of each Element in the result ("set(el.iterancestors())"), and then check each Element against all other sets. Drop all Elements for which the ancestor set matches with any other result Element. It might be faster to do the same with a dict that maps the parents to a list of result Elements, but I guess you can figure out the details and do the benchmarking. Does that match your intention? Stefan From stefan_ml at behnel.de Thu Mar 18 08:42:00 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 18 Mar 2010 08:42:00 +0100 Subject: [lxml-dev] lxml pre-built for MacTel OS X 10.6 and ActivePython 2.6 In-Reply-To: <92CDD168D1E81F4F9D3839DC45903FC6763A2FA0@dlee03.ent.ti.com> References: <92CDD168D1E81F4F9D3839DC45903FC6763A2FA0@dlee03.ent.ti.com> Message-ID: <4BA1D948.10204@behnel.de> Taylor, Martin, 17.03.2010 22:50: > Does anyone know of a .dmg installer for lxml pre-built for MacTel OS X 10.6 and ActivePython 2.6? > > This web page: http://codespeak.net/lxml/installation.html#installation-in-activepython suggests doing this: > > pypm install lxml > > but that requires a special Business License from ActiveState at a cost of about $1000! Ok, guess I'll just remove that from the web site then. > The build-it-yourself instructions for Mac look REALLY HAIRY Did you actually *try* the one-liner that they present? STATIC_DEPS=true easy_install lxml Stefan From jholg at gmx.de Thu Mar 18 08:52:39 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 18 Mar 2010 08:52:39 +0100 Subject: [lxml-dev] copy a xpath from an element to an element without double copies In-Reply-To: <1268861408.13040.28.camel@segulix> References: <1267826107.14419.18.camel@segulix> <4B937F97.6000005@behnel.de> <1267991224.5590.2.camel@segulix> <4B949D0D.5080903@behnel.de> <1268051391.22828.1.camel@segulix> <1268054877.22828.31.camel@segulix> <20100308141358.28990@gmx.net> <1268065431.24776.5.camel@segulix> <4B952B95.8080903@behnel.de> <1268861408.13040.28.camel@segulix> Message-ID: <20100318075239.66950@gmx.net> Hi, > > Well, there are no duplicates, according to your XPath expression. Each > > node that was found was only copied over once, but including its > children. > > For me the dups are the comparative between original html and new html - > the original html have 3 divs, the new have 5. Please realize that a node can have child nodes, so if you select some
...
that contains other divs you get *the whole subtree*. > The problem is, I can not choose the suitable XPath. The xpath is given > to me ( by an user and could be wrong). > > So here is the code : > I will call it the realxpath, I accept suggestion for the name of > function. Which calculate xpath on every iteration. > > > from lxml import html > > f = open(options.file).read() > strxpath="//div" > > html_document = html.fromstring(f) > elembody = html.Element("body") > > while len (html_document.xpath(strxpath)) != 0 : > frag = html_document.xpath(strxpath)[0] > elembody.append(frag) Note that you rip out frag from the original html_document here, so effectively you take the topmost
(possibly containing other
s and put it, with all descendants, to the result document. > print html.tostring(elembody, method="html", encoding="utf-8") > > it's print only 3 divs on the example of this thread. > > BTW: xpath id function seems buggy (like I wrote in my previous email) > and this method don't work with id function. and loops forever. No idea about this issue. When's a node selectable using id()? XPath rec says: "[...] NOTE: If a document does not have a DTD, then no element in the document will have a unique ID. [...]" (http://www.w3.org/TR/xpath/#unique-id) So I suspect this is not the same as elements simply having an id-attribute. Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From cmtaylor at ti.com Thu Mar 18 13:30:55 2010 From: cmtaylor at ti.com (Taylor, Martin) Date: Thu, 18 Mar 2010 07:30:55 -0500 Subject: [lxml-dev] lxml pre-built for MacTel OS X 10.6 and ActivePython 2.6 In-Reply-To: <4BA1D948.10204@behnel.de> References: <92CDD168D1E81F4F9D3839DC45903FC6763A2FA0@dlee03.ent.ti.com> <4BA1D948.10204@behnel.de> Message-ID: <92CDD168D1E81F4F9D3839DC45903FC6763A3232@dlee03.ent.ti.com> I hadn't tried the Mac build yesterday because the instructions really looked like there could be many "gotchas". However, Stefan's comment challenged me to give it a try this morning. I downloaded and extracted the lxml-2.2.6.tar.gz source file, then ran the "easy_install" and it immediately didn't work: $which easy_install /Library/Frameworks/Python.framework/Versions/2.6/bin/easy_install $ STATIC_DEPS=true sudo easy_install lxml Searching for lxml Reading http://pypi.python.org/simple/lxml/ Download error: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found! Reading http://pypi.python.org/simple/lxml/ Download error: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found! Couldn't find index page for 'lxml' (maybe misspelled?) Scanning index of all packages (this may take a while) Reading http://pypi.python.org/simple/ Download error: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found! No local packages or download links found for lxml error: Could not find suitable distribution for Requirement.parse('lxml') I've never used easy_install before and I've never had ANY success with FTP from behind our company firewall. So I'm not surprised that I got these error messages about "servname". Does anyone have any suggestions as to how I might get this to work? Based on comments here: http://www.explain.com.au/oss/libxml2xslt.html it would seem that as of "Leopard" (OS X 10.5 and I'm using OS X 10.6), the libxml2 and libxslt that come with the Mac OS X are OK to use with lxml. So I tried a simple install: $python setup.py install Building lxml version 2.2.6. NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available. Using build configuration of libxslt 1.1.24 running install running bdist_egg running egg_info writing src/lxml.egg-info/PKG-INFO writing top-level names to src/lxml.egg-info/top_level.txt writing dependency_links to src/lxml.egg-info/dependency_links.txt reading manifest file 'src/lxml.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'src/lxml.egg-info/SOURCES.txt' installing library code to build/bdist.macosx-10.3-fat/egg running install_lib running build_py creating build/lib.macosx-10.3-fat-2.6 creating build/lib.macosx-10.3-fat-2.6/lxml copying src/lxml/__init__.py -> build/lib.macosx-10.3-fat-2.6/lxml copying src/lxml/_elementpath.py -> build/lib.macosx-10.3-fat-2.6/lxml copying src/lxml/builder.py -> build/lib.macosx-10.3-fat-2.6/lxml copying src/lxml/cssselect.py -> build/lib.macosx-10.3-fat-2.6/lxml copying src/lxml/doctestcompare.py -> build/lib.macosx-10.3-fat-2.6/lxml copying src/lxml/ElementInclude.py -> build/lib.macosx-10.3-fat-2.6/lxml copying src/lxml/pyclasslookup.py -> build/lib.macosx-10.3-fat-2.6/lxml copying src/lxml/sax.py -> build/lib.macosx-10.3-fat-2.6/lxml copying src/lxml/usedoctest.py -> build/lib.macosx-10.3-fat-2.6/lxml creating build/lib.macosx-10.3-fat-2.6/lxml/html copying src/lxml/html/__init__.py -> build/lib.macosx-10.3-fat-2.6/lxml/html copying src/lxml/html/_dictmixin.py -> build/lib.macosx-10.3-fat-2.6/lxml/html copying src/lxml/html/_diffcommand.py -> build/lib.macosx-10.3-fat-2.6/lxml/html copying src/lxml/html/_html5builder.py -> build/lib.macosx-10.3-fat-2.6/lxml/html copying src/lxml/html/_setmixin.py -> build/lib.macosx-10.3-fat-2.6/lxml/html copying src/lxml/html/builder.py -> build/lib.macosx-10.3-fat-2.6/lxml/html copying src/lxml/html/clean.py -> build/lib.macosx-10.3-fat-2.6/lxml/html copying src/lxml/html/defs.py -> build/lib.macosx-10.3-fat-2.6/lxml/html copying src/lxml/html/diff.py -> build/lib.macosx-10.3-fat-2.6/lxml/html copying src/lxml/html/ElementSoup.py -> build/lib.macosx-10.3-fat-2.6/lxml/html copying src/lxml/html/formfill.py -> build/lib.macosx-10.3-fat-2.6/lxml/html copying src/lxml/html/html5parser.py -> build/lib.macosx-10.3-fat-2.6/lxml/html copying src/lxml/html/soupparser.py -> build/lib.macosx-10.3-fat-2.6/lxml/html copying src/lxml/html/usedoctest.py -> build/lib.macosx-10.3-fat-2.6/lxml/html running build_ext building 'lxml.etree' extension creating build/temp.macosx-10.3-fat-2.6 creating build/temp.macosx-10.3-fat-2.6/src creating build/temp.macosx-10.3-fat-2.6/src/lxml gcc -arch ppc -arch i386 -fno-strict-aliasing -fPIC -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/include/libxml2 -I/Library/Frameworks/Python.framework/Versions/2.6/include/python2.6 -c src/lxml/lxml.etree.c -o build/temp.macosx-10.3-fat-2.6/src/lxml/lxml.etree.o -w -flat_namespace Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/MacOSX10.4u.sdk Please check your Xcode installation gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -bundle -undefined dynamic_lookup build/temp.macosx-10.3-fat-2.6/src/lxml/lxml.etree.o -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.3-fat-2.6/lxml/etree.so ld: library not found for -lbundle1.o collect2: ld returned 1 exit status ld: library not found for -lbundle1.o collect2: ld returned 1 exit status lipo: can't open input file: /var/folders/Bm/BmG3PdbEFTqwQbJ-5p7qw++++TI/-Tmp-//ccjg1Ufi.out (No such file or directory) error: command 'gcc' failed with exit status 1 I think the real problem here is: Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/MacOSX10.4u.sdk Of course I don't have a 10.4 SDK on my 10.6 Xcode. At TI we do support our products on 10.5 and 10.6 but 10.4 was declared "obsolete" last year. So I'm back to my original question: Has anyone build lxml for Mac OS X 10.6 and, if so, could you provide me a link to a binary installer? Thanks very much, Martin > -----Original Message----- > From: Stefan Behnel [mailto:stefan_ml at behnel.de] > Sent: Thursday, March 18, 2010 2:42 AM > To: Taylor, Martin > Cc: ML-Lxml-dev > Subject: Re: [lxml-dev] lxml pre-built for MacTel OS X 10.6 > and ActivePython 2.6 > > Taylor, Martin, 17.03.2010 22:50: > > Does anyone know of a .dmg installer for lxml pre-built for > MacTel OS X 10.6 and ActivePython 2.6? > > > > This web page: > http://codespeak.net/lxml/installation.html#installation-in-ac tivepython suggests doing this: > > > > pypm install lxml > > > > but that requires a special Business License from > ActiveState at a cost of about $1000! > > Ok, guess I'll just remove that from the web site then. > > > > The build-it-yourself instructions for Mac look REALLY HAIRY > > Did you actually *try* the one-liner that they present? > > STATIC_DEPS=true easy_install lxml > > Stefan > From stefan_ml at behnel.de Thu Mar 18 13:58:48 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 18 Mar 2010 13:58:48 +0100 Subject: [lxml-dev] lxml pre-built for MacTel OS X 10.6 and ActivePython 2.6 In-Reply-To: <92CDD168D1E81F4F9D3839DC45903FC6763A3232@dlee03.ent.ti.com> References: <92CDD168D1E81F4F9D3839DC45903FC6763A2FA0@dlee03.ent.ti.com> <4BA1D948.10204@behnel.de> <92CDD168D1E81F4F9D3839DC45903FC6763A3232@dlee03.ent.ti.com> Message-ID: <4BA22388.8070507@behnel.de> Taylor, Martin, 18.03.2010 13:30: > I hadn't tried the Mac build yesterday because the instructions really looked like there could be many "gotchas". However, Stefan's comment challenged me to give it a try this morning. I downloaded and extracted the lxml-2.2.6.tar.gz source file, then ran the "easy_install" and it immediately didn't work: > > $which easy_install > /Library/Frameworks/Python.framework/Versions/2.6/bin/easy_install > $ STATIC_DEPS=true sudo easy_install lxml > Searching for lxml > Reading http://pypi.python.org/simple/lxml/ > Download error: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found! > Reading http://pypi.python.org/simple/lxml/ > Download error: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found! > Couldn't find index page for 'lxml' (maybe misspelled?) > Scanning index of all packages (this may take a while) > Reading http://pypi.python.org/simple/ > Download error: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found! > No local packages or download links found for lxml > error: Could not find suitable distribution for Requirement.parse('lxml') > > I've never used easy_install before and I've never had ANY success with FTP from behind our company firewall. So I'm not surprised that I got these error messages about "servname". Does anyone have any suggestions as to how I might get this to work? > > Based on comments here: http://www.explain.com.au/oss/libxml2xslt.html it would seem that as of "Leopard" (OS X 10.5 and I'm using OS X 10.6), the libxml2 and libxslt that come with the Mac OS X are OK to use with lxml. So I tried a simple install: > > $python setup.py install You forgot to say "STATIC_DEPS=true". > Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/MacOSX10.4u.sdk > Please check your Xcode installation > gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -bundle -undefined dynamic_lookup build/temp.macosx-10.3-fat-2.6/src/lxml/lxml.etree.o -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.3-fat-2.6/lxml/etree.so > ld: library not found for -lbundle1.o > collect2: ld returned 1 exit status > ld: library not found for -lbundle1.o > collect2: ld returned 1 exit status > lipo: can't open input file: /var/folders/Bm/BmG3PdbEFTqwQbJ-5p7qw++++TI/-Tmp-//ccjg1Ufi.out (No such file or directory) > error: command 'gcc' failed with exit status 1 > > I think the real problem here is: > Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/MacOSX10.4u.sdk There's at least some code in buildlibxml2.py that deals with this case: major_version, minor_version = map(int, platform.mac_ver()[0].split('.')[:2]) if major_version > 7: env = os.environ.copy() if minor_version < 6: env.update({ 'CFLAGS' : "-arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -O2", 'LDFLAGS' : "-arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk", 'MACOSX_DEPLOYMENT_TARGET' : "10.3" }) else: env.update({ 'CFLAGS' : "-arch ppc -arch i386 -arch x86_64 -O2", 'LDFLAGS' : "-arch ppc -arch i386 -arch x86_64", 'MACOSX_DEPLOYMENT_TARGET' : "10.6" }) call_setup['env'] = env Stefan From wichert at wiggy.net Thu Mar 18 15:16:24 2010 From: wichert at wiggy.net (Wichert Akkerman) Date: Thu, 18 Mar 2010 15:16:24 +0100 Subject: [lxml-dev] Unicode behaviour of Element.text Message-ID: <4BA235B8.9010201@wiggy.net> I tried to figure out the unicode-behaviour of Element.text. The lxml documentation does mention how parsing unicode data and serializing to unicode works, but I can not find any information on how Element.text returns strings. From what I can see it appears that Element.text returns either a str or a unicode instance, depending on the presence of non-ASCII text. That behaviour feels inconsistent, and for unicode using applications it means that every use of Element.text has to be written as unicode(node.text), which is not very pretty. Would it be possible to add an option to make the text attribute always return a unicode instance? Wichert. From wichert at wiggy.net Thu Mar 18 15:16:05 2010 From: wichert at wiggy.net (Wichert Akkerman) Date: Thu, 18 Mar 2010 15:16:05 +0100 Subject: [lxml-dev] adding a namespace Message-ID: <4BA235A5.7070507@wiggy.net> I am having some problems adding a new namespace to a parsed document. My goal is to take an input file like this:

first paragraph

second paragraph

and turn it into this:

first paragraph

second paragraph

the code is fairly simple, and looks like this (simplified from original): NS="http://xml.zope.org/namespaces/i18n" tree=lxml.etree.parse(input) root=tree.getroot() count=1 if "i18n" not in root.nsmap: root.nsmap["i18n"]=NS for el in root.iter(): if "{%s}translate" % NS in el.attrib: continue if hasText(el): el.attrib["{%s}translate" % NS]="string%d" % count count+=1 print lxml.etree.tostring(tree) However the resulting output looks like this:

first paragraph

second paragraph

while trying to debug this I noticed something odd: lxml allows you to modify the nsmap for an element, but ignores what you do: >>> root.nsmap {None: 'http://www.w3.org/1999/xhtml', 'py': 'http://genshi.edgewall.org/', 'xi': 'http://www.w3.org/2001/XInclude'} >>> root.nsmap["frop"]='http://frip' >>> root.nsmap {None: 'http://www.w3.org/1999/xhtml', 'py': 'http://genshi.edgewall.org/', 'xi': 'http://www.w3.org/2001/XInclude'} I would expect that to either work, or raise an exception telling me I am trying to do something that is not allowed. The current behaviour feels a bit unpythonic. It is possible to specify your own nsmap when creating elements, but I can not find an API to modify the nsmap for a parsed tree. Is that a missing feature, or is there another way to do this? Wichert. From stefan_ml at behnel.de Thu Mar 18 16:25:18 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 18 Mar 2010 16:25:18 +0100 Subject: [lxml-dev] Unicode behaviour of Element.text In-Reply-To: <4BA235B8.9010201@wiggy.net> References: <4BA235B8.9010201@wiggy.net> Message-ID: <4BA245DE.8010901@behnel.de> Wichert Akkerman, 18.03.2010 15:16: > I tried to figure out the unicode-behaviour of Element.text. The lxml > documentation does mention how parsing unicode data and serializing to > unicode works, but I can not find any information on how Element.text > returns strings. From what I can see it appears that Element.text > returns either a str or a unicode instance, depending on the presence of > non-ASCII text. That behaviour feels inconsistent, and for unicode using > applications it means that every use of Element.text has to be written > as unicode(node.text), which is not very pretty. Would it be possible to > add an option to make the text attribute always return a unicode instance? Since this has been asked a couple of time before, here's a short answer: That's how ElementTree works in Py2 and lxml.etree is compatible with it. It's also faster for plain ASCII data (which is common). In Python 3, lxml.etree always returns Unicode strings for .tag, .text and .tail. Stefan From cmtaylor at ti.com Thu Mar 18 18:31:09 2010 From: cmtaylor at ti.com (Taylor, Martin) Date: Thu, 18 Mar 2010 12:31:09 -0500 Subject: [lxml-dev] lxml pre-built for MacTel OS X 10.6 and ActivePython 2.6 In-Reply-To: <4BA22388.8070507@behnel.de> References: <92CDD168D1E81F4F9D3839DC45903FC6763A2FA0@dlee03.ent.ti.com> <4BA1D948.10204@behnel.de> <92CDD168D1E81F4F9D3839DC45903FC6763A3232@dlee03.ent.ti.com> <4BA22388.8070507@behnel.de> Message-ID: <92CDD168D1E81F4F9D3839DC45903FC67642B31E@dlee03.ent.ti.com> I'm making some progress but am stuck again on this Mac OS X 10.6 build of lxml. I downloaded the two dependent tarballs manually, since the FTP access through our firewall didn't work: ls libs libxml2-2.7.6.tar.gz libxslt-1.1.26.tar.gz Then I ran this build command: python setup.py build --static-deps --libxml2-version=2.7.6 --libxslt-version=1.1.26 I think it built the two libraries successfully 'cause I saw messages like this: ---------------------------------------------------------------------- Libraries have been installed in: .../lxml/build/tmp/libxml2/lib For both libraries. But when it got to this stage: building 'lxml.etree' extension I got these error messages: Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/MacOSX10.4u.sdk Please check your Xcode installation gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -bundle -undefined dynamic_lookup build/temp.macosx-10.3-fat-2.6/src/lxml/lxml.etree.o -liconv /Users/epsqainfprod/_AccuRev_/RF-integ_EPSQAInf5Lab09/TI-RF/robotframework-NspireOracleLibrary/3rdPartyTools/lxml/build/tmp/libxml2/lib/libexslt.a /Users/epsqainfprod/_AccuRev_/RF-integ_EPSQAInf5Lab09/TI-RF/robotframework-NspireOracleLibrary/3rdPartyTools/lxml/build/tmp/libxml2/lib/libxml2.a /Users/epsqainfprod/_AccuRev_/RF-integ_EPSQAInf5Lab09/TI-RF/robotframework-NspireOracleLibrary/3rdPartyTools/lxml/build/tmp/libxml2/lib/libxslt.a -L/Users/epsqainfprod/_AccuRev_/RF-integ_EPSQAInf5Lab09/TI-RF/robotframework-NspireOracleLibrary/3rdPartyTools/lxml/build/tmp/libxml2/lib -lz -lm -o build/lib.macosx-10.3-fat-2.6/lxml/etree.so ld: library not found for -lbundle1.o collect2: ld returned 1 exit status ld: library not found for -lbundle1.o collect2: ld returned 1 exit status lipo: can't open input file: /var/folders/Bm/BmG3PdbEFTqwQbJ-5p7qw++++TI/-Tmp-//ccxV0tCt.out (No such file or directory) error: command 'gcc' failed with exit status 1 Which indicates to me that it is trying to build with the SDK for the wrong Mac OS version. I've searched the entire lxml code tree and can't find anywhere where it does this kind of logic for the building of lxml itself (buildlibxml2.py only builds that library, as far as I can tell): > There's at least some code in buildlibxml2.py that deals with > this case: > > > major_version, minor_version = map(int, platform.mac_ver()[0].split('.')[:2]) > if major_version > 7: > env = os.environ.copy() > if minor_version < 6: > env.update({ > 'CFLAGS' : "-arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -O2", > 'LDFLAGS' : "-arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk", > 'MACOSX_DEPLOYMENT_TARGET' : "10.3" > }) > else: > env.update({ > 'CFLAGS' : "-arch ppc -arch i386 -arch x86_64 -O2", > 'LDFLAGS' : "-arch ppc -arch i386 -arch x86_64", > 'MACOSX_DEPLOYMENT_TARGET' : "10.6" > }) > call_setup['env'] = env SO now the question is "What hidden magic is used to determine the Mac OS X version and the SDK that should be used for building lxml itself?" Thanks again, Martin From stefan_ml at behnel.de Thu Mar 18 20:11:23 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 18 Mar 2010 20:11:23 +0100 Subject: [lxml-dev] lxml pre-built for MacTel OS X 10.6 and ActivePython 2.6 In-Reply-To: <92CDD168D1E81F4F9D3839DC45903FC67642B31E@dlee03.ent.ti.com> References: <92CDD168D1E81F4F9D3839DC45903FC6763A2FA0@dlee03.ent.ti.com> <4BA1D948.10204@behnel.de> <92CDD168D1E81F4F9D3839DC45903FC6763A3232@dlee03.ent.ti.com> <4BA22388.8070507@behnel.de> <92CDD168D1E81F4F9D3839DC45903FC67642B31E@dlee03.ent.ti.com> Message-ID: <4BA27ADB.1020608@behnel.de> Taylor, Martin, 18.03.2010 18:31: > SO now the question is "What hidden magic is used to determine the Mac > OS X version and the SDK that should be used for building lxml itself?" This looks like a distutils related question (or maybe even ActivePython related). So unless someone else can answer it on this list, I'd suggest you ask on comp.lang.python or the distutils sig mailing list. BTW, have you succeeded in building any other binary extensions on your platform yet? That might tell you if it's a general problem with your installation or something that's specific to lxml. Another thing you could try is use the pre-built lxml 2.2.2 binaries on PyPI. Not totally up-to-date, but certainly usable. http://pypi.python.org/pypi/lxml/2.2.2 Those were built by Stephan Eletzhofer, maybe he can upload a build of 2.2.6? Stefan From wichert at wiggy.net Tue Mar 23 08:33:29 2010 From: wichert at wiggy.net (Wichert Akkerman) Date: Tue, 23 Mar 2010 08:33:29 +0100 Subject: [lxml-dev] adding a namespace In-Reply-To: <4BA235A5.7070507@wiggy.net> References: <4BA235A5.7070507@wiggy.net> Message-ID: <4BA86EC9.2060807@wiggy.net> I apologize if I'm being impatient, but I am wondering if the lack of response means that people are too busy to look at this, or if it means that this is, at least currently, not possible with lxml? Regards, Wichert. On 3/18/10 15:16 , Wichert Akkerman wrote: > I am having some problems adding a new namespace to a parsed document. > My goal is to take an input file like this: > > > >

first paragraph

>

second paragraph

> > > > > and turn it into this: > > xmlns:i18n="http://xml.zope.org/namespaces/i18n"> > >

first paragraph

>

second paragraph

> > > > the code is fairly simple, and looks like this (simplified from original): > > NS="http://xml.zope.org/namespaces/i18n" > tree=lxml.etree.parse(input) > root=tree.getroot() > count=1 > if "i18n" not in root.nsmap: > root.nsmap["i18n"]=NS > for el in root.iter(): > if "{%s}translate" % NS in el.attrib: > continue > if hasText(el): > el.attrib["{%s}translate" % NS]="string%d" % count > count+=1 > print lxml.etree.tostring(tree) > > However the resulting output looks like this: > > > >

ns0:translate="string1">first paragraph

>

ns1:translate="string2">second paragraph

> > > > while trying to debug this I noticed something odd: lxml allows you to > modify the nsmap for an element, but ignores what you do: > > >>> root.nsmap > {None: 'http://www.w3.org/1999/xhtml', 'py': > 'http://genshi.edgewall.org/', 'xi': 'http://www.w3.org/2001/XInclude'} > >>> root.nsmap["frop"]='http://frip' > >>> root.nsmap > {None: 'http://www.w3.org/1999/xhtml', 'py': > 'http://genshi.edgewall.org/', 'xi': 'http://www.w3.org/2001/XInclude'} > > I would expect that to either work, or raise an exception telling me I > am trying to do something that is not allowed. The current behaviour > feels a bit unpythonic. > > It is possible to specify your own nsmap when creating elements, but I > can not find an API to modify the nsmap for a parsed tree. Is that a > missing feature, or is there another way to do this? > > Wichert. From wichert at wiggy.net Tue Mar 23 10:02:14 2010 From: wichert at wiggy.net (Wichert Akkerman) Date: Tue, 23 Mar 2010 10:02:14 +0100 Subject: [lxml-dev] adding a namespace In-Reply-To: <1269333717.10101.127.camel@ddbc-it-simon> References: <4BA235A5.7070507@wiggy.net> <4BA86EC9.2060807@wiggy.net> <1269333717.10101.127.camel@ddbc-it-simon> Message-ID: <4BA88396.9010607@wiggy.net> On 3/23/10 09:41 , Simon Wiles ??? wrote: > On Tue, 2010-03-23 at 08:33 +0100, Wichert Akkerman wrote: >> I apologize if I'm being impatient, but I am wondering if the lack of >> response means that people are too busy to look at this, or if it means >> that this is, at least currently, not possible with lxml? >> >> Regards, >> Wichert. > > > You could try something like this: > > ==================================== > > from lxml import etree > > NS="http://xml.zope.org/namespaces/i18n" > tree=etree.parse(input) > root=tree.getroot() > count=1 > if "i18n" not in root.nsmap: > new_root = etree.Element(root.tag, nsmap=dict(i18n=NS, **root.nsmap)) > new_root[:] = root[:] > for el in new_root.iter(): > if "{%s}translate" % NS in el.attrib: > continue > if el.text is not None and el.text.strip() != '': > el.attrib["{%s}translate" % NS]="string%d" % count > count+=1 > print etree.tostring(new_root) > > > ==================================== > > > Is that what you had in mind? Almost! The problem with this approach is that you loose the doctype, since that is serialised as part of tree.docinfo, while you are not only outputting the root and its children. As a workaround I could manually output tree.docinfo.doctype I suppose. Wichert. From simonjwiles at gmail.com Tue Mar 23 09:41:57 2010 From: simonjwiles at gmail.com (Simon Wiles =?UTF-8?Q?=E9=AD=8F=E5=B8=8C=E6=98=8E?=) Date: Tue, 23 Mar 2010 16:41:57 +0800 Subject: [lxml-dev] adding a namespace In-Reply-To: <4BA86EC9.2060807@wiggy.net> References: <4BA235A5.7070507@wiggy.net> <4BA86EC9.2060807@wiggy.net> Message-ID: <1269333717.10101.127.camel@ddbc-it-simon> On Tue, 2010-03-23 at 08:33 +0100, Wichert Akkerman wrote: > I apologize if I'm being impatient, but I am wondering if the lack of > response means that people are too busy to look at this, or if it means > that this is, at least currently, not possible with lxml? > > Regards, > Wichert. You could try something like this: ==================================== from lxml import etree NS="http://xml.zope.org/namespaces/i18n" tree=etree.parse(input) root=tree.getroot() count=1 if "i18n" not in root.nsmap: new_root = etree.Element(root.tag, nsmap=dict(i18n=NS, **root.nsmap)) new_root[:] = root[:] for el in new_root.iter(): if "{%s}translate" % NS in el.attrib: continue if el.text is not None and el.text.strip() != '': el.attrib["{%s}translate" % NS]="string%d" % count count+=1 print etree.tostring(new_root) ==================================== Is that what you had in mind? simon From stefan_ml at behnel.de Tue Mar 23 20:09:29 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 23 Mar 2010 20:09:29 +0100 Subject: [lxml-dev] adding a namespace In-Reply-To: <4BA235A5.7070507@wiggy.net> References: <4BA235A5.7070507@wiggy.net> Message-ID: <4BA911E9.4030800@behnel.de> Hi, bumping this thread was a good idea, it seems. ;) Wichert Akkerman, 18.03.2010 15:16: > I am having some problems adding a new namespace to a parsed document. > My goal is to take an input file like this: > > > >

first paragraph

>

second paragraph

> > > > and turn it into this: > > xmlns:i18n="http://xml.zope.org/namespaces/i18n"> > >

first paragraph

>

second paragraph

> > > > the code is fairly simple, and looks like this (simplified from original): > > NS="http://xml.zope.org/namespaces/i18n" > tree=lxml.etree.parse(input) > root=tree.getroot() > count=1 > if "i18n" not in root.nsmap: > root.nsmap["i18n"]=NS Ok, this won't work as the return value of the nsmap property is a newly created dict. The reason is that it returns a map of all prefixes that are defined in the context of the Element, including all live prefixes defined on its ancestors. I've added a short section to the tutorial that explains this (not on the website yet). > I would expect that to either work, or raise an exception telling me I > am trying to do something that is not allowed. The current behaviour > feels a bit unpythonic. You get a plain dict here, so an exception won't work. It would also be unfriendly to return a read-only dict (which would raise an exception on changes) as it's quite reasonable to use the dict in other places of your code. > It is possible to specify your own nsmap when creating elements, but I > can not find an API to modify the nsmap for a parsed tree. Is that a > missing feature, or is there another way to do this? Simon showed you a way, but apart from that, it's a missing feature. Changing namespace mappings is nothing that the ElementTree API needs to care about, and lxml clearly lacks a good way to do it. Could you file a ticket on the bug tracker? This should be doable for 2.3. Stefan From nickle at gmail.com Thu Mar 25 11:10:36 2010 From: nickle at gmail.com (Nick Leaton) Date: Thu, 25 Mar 2010 10:10:36 +0000 Subject: [lxml-dev] Namespaces Message-ID: <8797930d1003250310k637e4836yfc6f2ada35353e23@mail.gmail.com> I'm trying to generate the following header for an xml file However after reading the section on nsmap on this page http://codespeak.net/lxml/tutorial.html I'm none the wiser Can anyone give a hand? Thanks -- Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100325/89f95163/attachment.htm From Tim.Arnold at sas.com Thu Mar 25 20:00:55 2010 From: Tim.Arnold at sas.com (Tim Arnold) Date: Thu, 25 Mar 2010 15:00:55 -0400 Subject: [lxml-dev] help with a special attribute Message-ID: Hi, I have some citation keys that contain colons in my source xml. I use lxml to manipulate that source into valid docbook. For example, a key might look like this "kdpm_c:78" I don't have any way to change the keys in the original source to get rid of the colon. I'm currently manipulating the key into which is ok but not valid docbook. The attribute needs to be xml:id="kdpm_c78". But if I create it that way, then lxml won't parse it since the key has the colon. On the other hand if I try to postprocess it by adding an xml:id attribute like this: elem.set('xml:id', elem.get('id').replace(':', '')) lxml says "Invalid attribute name u'xml:id' Is there any way to start with "kdpm_c:78" and end up with without plain-text-processing? thanks, --Tim From stefan_ml at behnel.de Thu Mar 25 20:47:33 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 25 Mar 2010 20:47:33 +0100 Subject: [lxml-dev] Namespaces In-Reply-To: <8797930d1003250310k637e4836yfc6f2ada35353e23@mail.gmail.com> References: <8797930d1003250310k637e4836yfc6f2ada35353e23@mail.gmail.com> Message-ID: <4BABBDD5.5040301@behnel.de> Nick Leaton, 25.03.2010 11:10: > I'm trying to generate the following header for an xml file > > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xsi:noNamespaceSchemaLocation="cmf.xsd"> > > However after reading the section on nsmap on this page > http://codespeak.net/lxml/tutorial.html I'm none the wiser > > Can anyone give a hand? This should work: XSI_NS = "http://www.w3.org/2001/XMLSchema-instance" messages = etree.Element('messages', nsmap = {'xsi' : XSI_NS}) messages.set("{%s}noNamespaceSchemaLocation" % XSI_NS, "cmf.xsd") Stefan From nickle at gmail.com Thu Mar 25 21:16:09 2010 From: nickle at gmail.com (Nick Leaton) Date: Thu, 25 Mar 2010 20:16:09 +0000 Subject: [lxml-dev] Namespaces In-Reply-To: <4BABBDD5.5040301@behnel.de> References: <8797930d1003250310k637e4836yfc6f2ada35353e23@mail.gmail.com> <4BABBDD5.5040301@behnel.de> Message-ID: <8797930d1003251316u11b118f1y27d0c191ef6d808c@mail.gmail.com> Thanks - I'll try it out tomorrow Nick On Thu, Mar 25, 2010 at 7:47 PM, Stefan Behnel wrote: > Nick Leaton, 25.03.2010 11:10: > > I'm trying to generate the following header for an xml file >> >> > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >> xsi:noNamespaceSchemaLocation="cmf.xsd"> >> >> However after reading the section on nsmap on this page >> http://codespeak.net/lxml/tutorial.html I'm none the wiser >> >> Can anyone give a hand? >> > > This should work: > > XSI_NS = "http://www.w3.org/2001/XMLSchema-instance" > > messages = etree.Element('messages', nsmap = {'xsi' : XSI_NS}) > messages.set("{%s}noNamespaceSchemaLocation" % XSI_NS, "cmf.xsd") > > Stefan > -- Nick -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100325/b5c5e608/attachment.htm From Tim.Arnold at sas.com Fri Mar 26 18:19:59 2010 From: Tim.Arnold at sas.com (Tim Arnold) Date: Fri, 26 Mar 2010 13:19:59 -0400 Subject: [lxml-dev] multiple manipulation of xml file, some are ignored Message-ID: Hi, I apply several manipulations on an xml document and it *seems* like some of them are ignored. For example, the fix_nested_optional method is called last in a sequence of manipulations: ----------------------------------- xns = {'d':'http://docbook.org/ns/docbook'} class DocBookProcessor(object): def __init__(self, trees): self.trees = trees def process(self): for _, tree in self.trees.items(): ... many methods called ..... self.fix_nested_optional() return self.trees def fix_nested_optional(self): for optional in self.tree.xpath('//d:optional/d:optional', namespaces=xns): optional.tag = 'phrase' ----------------------------------- But when the tree is written out, I still have nested optional tags. In fact if I apply the same function to the newly written file, the nested optionals are taken care of. How can this be? Is the lxml document changed immediately or is there some sort of wait before the changes take effect? thanks, --Tim From Tim.Arnold at sas.com Fri Mar 26 18:39:00 2010 From: Tim.Arnold at sas.com (Tim Arnold) Date: Fri, 26 Mar 2010 13:39:00 -0400 Subject: [lxml-dev] multiple manipulation of xml file, some are ignored In-Reply-To: References: Message-ID: > -----Original Message----- > From: Jens Quade [mailto:jq at qdevelop.de] > Sent: Friday, March 26, 2010 1:32 PM > To: Tim Arnold > Subject: Re: [lxml-dev] multiple manipulation of xml file, some are ignored > > > On 26.03.2010, at 18:19, Tim Arnold wrote: > > > Hi, > > I apply several manipulations on an xml document and it *seems* like some > of them are ignored. > > For example, the fix_nested_optional method is called last in a sequence > of manipulations: > > ----------------------------------- > > xns = {'d':'http://docbook.org/ns/docbook'} > > class DocBookProcessor(object): > > def __init__(self, trees): > > self.trees = trees > > > > def process(self): > > for _, tree in self.trees.items(): > > ... many methods called ..... > > self.fix_nested_optional() > > return self.trees > > > > def fix_nested_optional(self): > > for optional in self.tree.xpath('//d:optional/d:optional', > namespaces=xns): > > optional.tag = 'phrase' > > ----------------------------------- > > > > But when the tree is written out, I still have nested optional tags. In > fact if I apply the same function to the newly written file, the nested > optionals are taken care of. > > where does self.tree come from? Is it part of self.trees? > wouldn't it be clearer if "tree" was a parameter to > fix_nested_optional(self, tree) > > Sorry, that's important isn't it. Each tree is an lxml document representing a chapter in a book. The loop called 'process' above sets 'self.tree' to tree and then calls the methods. I think you're right though, just sending tree as the argument to the methods would be cleaner. The current method looks like this: def process(self): for _, tree in self.trees.items(): self.tree = tree self.fix_options() self.fix_optionalias() self.create_outputs() self.clean_bibliography() self.drop_pdftext() self.fix_SAS_output() self.drop_empty_elem('para') self.drop_empty_elem('blockquote') self.drop_elem_with_inlineequation('indexterm') self.fix_nested_optional() return self.trees Do you think this setup is causing the problem? I'll rewrite to send tree to the method as an argument and see if that changes anything. thanks, --Tim From Tim.Arnold at sas.com Fri Mar 26 18:54:17 2010 From: Tim.Arnold at sas.com (Tim Arnold) Date: Fri, 26 Mar 2010 13:54:17 -0400 Subject: [lxml-dev] update: multiple manipulation, some ignored Message-ID: Hi, It doesn't seem to matter whether the lxml object is passed as an argument to the method or not. I recoded with identical results. For completeness sake, here is how the self.trees object is created: for f in [x for x in os.listdir(path) if x.endswith('.xml')]: fname = os.path.join(path, f) fd = codecs.open(fname, 'rb', encoding='utf8') try: self.trees[fname] = etree.fromstring(fd.read()) except etree.XMLSyntaxError as e: print 'ERROR: %s cannot be parsed: ' % (os.path.basename(fname)) print '%s \n' % e finally: fd.close() It is that dictionary object self.trees that contains the etrees that is passed to the DocBookProcessor class, as described in the first part of this thread: http://codespeak.net/pipermail/lxml-dev/2010-March/005341.html again, thanks for any help. I'm stymied. --Tim From stefan_ml at behnel.de Fri Mar 26 19:56:19 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 26 Mar 2010 19:56:19 +0100 Subject: [lxml-dev] update: multiple manipulation, some ignored In-Reply-To: References: Message-ID: <4BAD0353.3020404@behnel.de> Tim Arnold, 26.03.2010 18:54: > fname = os.path.join(path, f) > fd = codecs.open(fname, 'rb', encoding='utf8') > try: > self.trees[fname] = etree.fromstring(fd.read()) > except etree.XMLSyntaxError as e: > print 'ERROR: %s cannot be parsed: ' % (os.path.basename(fname)) > print '%s \n' % e > finally: > fd.close() Note that this code is extremely inefficient. It recodes characters multiple times (even using the rather slow codecs module), passes through various I/O layers and creates several unnecessary objects on the way. It's likely several times faster to just write self.trees[fname] = etree.parse(fname).getroot() Stefan From stefan_ml at behnel.de Fri Mar 26 20:10:55 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 26 Mar 2010 20:10:55 +0100 Subject: [lxml-dev] multiple manipulation of xml file, some are ignored In-Reply-To: References: Message-ID: <4BAD06BF.8080405@behnel.de> Tim Arnold, 26.03.2010 18:39: > From: Jens Quade >> On 26.03.2010, at 18:19, Tim Arnold wrote: >>> I apply several manipulations on an xml document and it *seems* like some >>> of them are ignored. >>> For example, the fix_nested_optional method is called last in a sequence >>> of manipulations: >>> ----------------------------------- >>> xns = {'d':'http://docbook.org/ns/docbook'} >>> class DocBookProcessor(object): >>> def __init__(self, trees): >>> self.trees = trees >>> >>> def process(self): >>> for _, tree in self.trees.items(): >>> ... many methods called ..... >>> self.fix_nested_optional() >>> return self.trees >>> >>> def fix_nested_optional(self): >>> for optional in self.tree.xpath('//d:optional/d:optional', >> namespaces=xns): >>> optional.tag = 'phrase' >>> ----------------------------------- >>> >>> But when the tree is written out, I still have nested optional tags. In >>> fact if I apply the same function to the newly written file, the nested >>> optionals are taken care of. >> >> where does self.tree come from? Is it part of self.trees? >> wouldn't it be clearer if "tree" was a parameter to >> fix_nested_optional(self, tree) I second that. >>> How can this be? Is the lxml document changed immediately Yes. > Each tree is an lxml document representing a chapter in a book. The loop > called 'process' above sets 'self.tree' to tree and then calls the > methods. I think you're right though, just sending tree as the argument > to the methods would be cleaner. The current method looks like this: > > def process(self): > for _, tree in self.trees.items(): > self.tree = tree > self.fix_options() > self.fix_optionalias() > > self.create_outputs() > self.clean_bibliography() > self.drop_pdftext() > self.fix_SAS_output() > self.drop_empty_elem('para') > self.drop_empty_elem('blockquote') > self.drop_elem_with_inlineequation('indexterm') > self.fix_nested_optional() > return self.trees Looking at your pipeline, it's quite possible that you messed up your namespaces somewhere along the path. You may have added elements to the tree that do not have a namespace (or maybe renamed their tags), which then can't be found by the namespaced XPath expression. To debug, print the namespaced tag names between two pipeline steps: for el in tree.iter(): print el.tag That being said, without a deeper look into your code it's impossible to figure out what's going wrong and where. Try to strip down the pipeline by eliminating steps that do not induce problems, and reduce your code to an easily testable example that reproduces the problem. Stefan From stefan_ml at behnel.de Fri Mar 26 20:25:53 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 26 Mar 2010 20:25:53 +0100 Subject: [lxml-dev] help with a special attribute In-Reply-To: References: Message-ID: <4BAD0A41.2030802@behnel.de> Tim Arnold, 25.03.2010 20:00: > Hi, I have some citation keys that contain colons in my source xml. I > use lxml to manipulate that source into valid docbook. For example, a > key might look like this "kdpm_c:78" I don't have any way to change the > keys in the original source to get rid of the colon. > > I'm currently manipulating the key into which is > ok but not valid docbook. The attribute needs to be xml:id="kdpm_c78". 'xml:id' is a qualified name consisting of a namespace prefix and a local name. By specification, the 'xml' prefix maps to the namespace URI http://www.w3.org/XML/1998/namespace. lxml.etree (and ElementTree) writes this in Clark notation: "{http://www.w3.org/XML/1998/namespace}id". http://www.jclark.com/xml/xmlns.htm > But if I create it that way, then lxml won't parse it since the key has > the colon. On the other hand if I try to postprocess it by adding an > xml:id attribute like this: elem.set('xml:id', > elem.get('id').replace(':', '')) > > lxml says "Invalid attribute name u'xml:id' > > Is there any way to start with "kdpm_c:78" and end up with xml:id="kdpm_c78"> without plain-text-processing? This should work: for el in root.iter(): id_text = el.get('id') if id_text: el.set("{http://www.w3.org/XML/1998/namespace}id", id_text.replace(':', '')) Stefan From jq at qdevelop.de Fri Mar 26 22:05:46 2010 From: jq at qdevelop.de (Jens Quade) Date: Fri, 26 Mar 2010 22:05:46 +0100 Subject: [lxml-dev] update: multiple manipulation, some ignored In-Reply-To: References: Message-ID: <0CC817C5-9A0B-4503-80BB-0A2FFFC2CBAD@qdevelop.de> On 26.03.2010, at 18:54, Tim Arnold wrote: > Hi, > It doesn't seem to matter whether the lxml object is passed as an argument to the method or not. I recoded with identical results. Can you provide a minimum document that shows the behavior? Simple tests, like >>> tree = XML('foobar') >>> for b in tree.xpath('//b/b'): ... b.tag = 'c' ... >>> dump(tree) foo bar >>> seem to work. Can you dump and compare the document before and after the call to the tag rewriting function? Does that provide any clues when nested tags are not fixed? If you reparse the tree before the tag rewriting function, like XML(tostring(tree)), does the function then work? From sergio at sergiomb.no-ip.org Sun Mar 28 03:52:11 2010 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Sun, 28 Mar 2010 02:52:11 +0100 Subject: [lxml-dev] copy a xpath from an element to an element without double copies #2 Message-ID: <1269741131.30391.24.camel@segulix> Hello, based on "Note that the .append() method *moves* the element to the new position. If you want to copy it, use the copy module to create a deep copy of the element before moving it over." I made, this simple, the realxpath function, is move all elements from a tree to other , if the root() of an element of xpath isn't the root of the original html_document, those element was already moved so we don't move again. from lxml import html f = open("teste.html").read() html_document = html.fromstring(f) elems=html_document.xpath('//h1|//div[@id="articleTitle"]') text="" elembody = html.Element("body") for frags in elems: if frags.getroottree().getroot() != html_document: continue elembody.append(frags) text += html.tostring(elembody, method="html", encoding="utf-8") My previous solution, of iterate again, each time we move an element , doesn't work with xpath which use position of the node, like '//table[1]'. Hope that the realxpath function could be a feature of lxml :) Best regards and thanks for yours help. -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3159 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100328/093d1342/attachment.bin From mykingheaven at gmail.com Sun Mar 28 06:09:26 2010 From: mykingheaven at gmail.com (David Shieh) Date: Sun, 28 Mar 2010 12:09:26 +0800 Subject: [lxml-dev] How to get HTML charset ? Message-ID: Hi all, I use lxml for a long time and it works fine for me. But now, I get confused about the charset thing. When I want to get the original charset of a html file, I used codes below: file_content = ''.join( [i.rstrip('\r\n ').lstrip() for i in response.readlines()] ) html = lxml.html.fromstring(file_content) for i in html.xpath('head/meta'): print lxml.html.tostring(i) Surprisingly, there's no output of any element. So, how can I know the original charset of this html? BTW, I used urllib2 to get charset, using the codes below: req = urllib2.Request(url) try: response = urllib2.urlopen(req) except HTTPError, e: print e.code else: print response.headers.getheader('Content-Type') Not every sites return its charset, some sites don't return any charset information. What I gonna do if I really want to know the charset? Thanks, guys. Best wishes, David -- ---------------------------------------------- Attitude determines everything ! ---------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100328/44686db7/attachment.htm From sergio at sergiomb.no-ip.org Sun Mar 28 12:11:37 2010 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Sun, 28 Mar 2010 11:11:37 +0100 Subject: [lxml-dev] How to get HTML charset ? In-Reply-To: References: Message-ID: <1269771097.2155.7.camel@segulix> On Sun, 2010-03-28 at 12:09 +0800, David Shieh wrote: > Hi all, > > I use lxml for a long time and it works fine for me. > But now, I get confused about the charset thing. When I want to get > the original charset of a html file, I used codes below: > > file_content = ''.join( > [i.rstrip('\r\n ').lstrip() for i in > response.readlines()] > ) > html = lxml.html.fromstring(file_content) > for i in html.xpath('head/meta'): xpath('.//meta[@http-equiv="Content-Type"]/@content') I don't know if match with content-type (lower case) if not xpath('.//meta[re:test(@http-equiv, "^Content-Type$", "i")]', namespaces={"re": "http://exslt.org/regular-expressions"}) > print lxml.html.tostring(i) > > Surprisingly, there's no output of any http-equiv="Content-Type" .. /> element. So, how can I know the > original charset of this html? > BTW, I used urllib2 to get charset, using the codes below: > > req = urllib2.Request(url) > try: > response = urllib2.urlopen(req) > except HTTPError, e: > print e.code > else: > print response.headers.getheader('Content-Type') > > Not every sites return its charset, some sites don't return any > charset information. > What I gonna do if I really want to know the charset? > > Thanks, guys. > > Best wishes, > David > -- > --------------------------------------------- > Attitude determines everything ! > ---------------------------------------------- > > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3159 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100328/9bace30c/attachment-0001.bin From jholg at gmx.de Mon Mar 29 16:49:10 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Mon, 29 Mar 2010 16:49:10 +0200 Subject: [lxml-dev] ObjectifiedDataElements as dict keys? Message-ID: <20100329144910.108930@gmx.net> Hi, I just noticed that ObjectifiedDataElements can not be used as dict keys (I expected that) but ObjectifiedElements can (which I expected not): >>> root = objectify.Element('root') >>> {root: 1} {: 1} >>> root.s = "some string" >>> {root.s: 1} Traceback (most recent call last): File "", line 1, in ? TypeError: unhashable type >>> >>> hash(root) 4445024 >>> hash(root.s) Traceback (most recent call last): File "", line 1, in ? TypeError: unhashable type >>> But: >>> root.s.__hash__ >>> root.s.__hash__() 4444928 >>> So I'm obviously missing something about the hashability rules. Any quick hint on that? Holger -- GMX.at - ?sterreichs FreeMail-Dienst mit ?ber 2 Mio Mitgliedern E-Mail, SMS & mehr! Kostenlos: http://portal.gmx.net/de/go/atfreemail From Tim.Arnold at sas.com Mon Mar 29 19:29:18 2010 From: Tim.Arnold at sas.com (Tim Arnold) Date: Mon, 29 Mar 2010 13:29:18 -0400 Subject: [lxml-dev] multiple manipulation of xml file, some are ignored In-Reply-To: <4BAD06BF.8080405@behnel.de> References: <4BAD06BF.8080405@behnel.de> Message-ID: > -----Original Message----- > From: Stefan Behnel [mailto:stefan_ml at behnel.de] > Sent: Friday, March 26, 2010 3:11 PM > To: Tim Arnold > Cc: lxml-dev at codespeak.net > Subject: Re: [lxml-dev] multiple manipulation of xml file, some are ignored > > Tim Arnold, 26.03.2010 18:39: > > From: Jens Quade > >> On 26.03.2010, at 18:19, Tim Arnold wrote: > >>> I apply several manipulations on an xml document and it *seems* like > some > >>> of them are ignored. > >>> For example, the fix_nested_optional method is called last in a sequence > >>> of manipulations: > >>> ----------------------------------- > >>> xns = {'d':'http://docbook.org/ns/docbook'} > >>> class DocBookProcessor(object): > >>> def __init__(self, trees): > >>> self.trees = trees > >>> > >>> def process(self): > >>> for _, tree in self.trees.items(): > >>> ... many methods called ..... > >>> self.fix_nested_optional() > >>> return self.trees > >>> > >>> def fix_nested_optional(self): > >>> for optional in self.tree.xpath('//d:optional/d:optional', > >> namespaces=xns): > >>> optional.tag = 'phrase' > >>> ----------------------------------- > >>> > >>> But when the tree is written out, I still have nested optional tags. In > >>> fact if I apply the same function to the newly written file, the nested > >>> optionals are taken care of. > >> > >> where does self.tree come from? Is it part of self.trees? > >> wouldn't it be clearer if "tree" was a parameter to > >> fix_nested_optional(self, tree) > > I second that. > > > >>> How can this be? Is the lxml document changed immediately > > Yes. > > > > Each tree is an lxml document representing a chapter in a book. The loop > > called 'process' above sets 'self.tree' to tree and then calls the > > methods. I think you're right though, just sending tree as the argument > > to the methods would be cleaner. The current method looks like this: > > > > def process(self): > > for _, tree in self.trees.items(): > > self.tree = tree > > self.fix_options() > > self.fix_optionalias() > > > > self.create_outputs() > > self.clean_bibliography() > > self.drop_pdftext() > > self.fix_SAS_output() > > self.drop_empty_elem('para') > > self.drop_empty_elem('blockquote') > > self.drop_elem_with_inlineequation('indexterm') > > self.fix_nested_optional() > > return self.trees > > Looking at your pipeline, it's quite possible that you messed up your > namespaces somewhere along the path. You may have added elements to the > tree that do not have a namespace (or maybe renamed their tags), which then > can't be found by the namespaced XPath expression. > > To debug, print the namespaced tag names between two pipeline steps: > > for el in tree.iter(): > print el.tag > > That being said, without a deeper look into your code it's impossible to > figure out what's going wrong and where. Try to strip down the pipeline by > eliminating steps that do not induce problems, and reduce your code to an > easily testable example that reproduces the problem. > > Stefan Thanks for your input Stefan. You're right--I had messed up my namespaces by changing a tag name without prepending the appropriate namespace. Once I changed that, things started working. Also, thank you for the comment about the inefficient code in reading in the XML. I now just use etree to parse the file in the first step with resorting to codecs. The xml:id code-snippet you sent works well. I understood it as I read the code, but it wasn't something I would have figured out without your help. In short, thanks to you and to Jens Quade, my workflow from LaTeX to DocBook5 is working now and I'm starting to clean things up better. thanks very much for your help, --Tim Arnold From stefan_ml at behnel.de Mon Mar 29 21:56:07 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 29 Mar 2010 21:56:07 +0200 Subject: [lxml-dev] ObjectifiedDataElements as dict keys? In-Reply-To: <20100329144910.108930@gmx.net> References: <20100329144910.108930@gmx.net> Message-ID: <4BB105D7.5080604@behnel.de> Hi Holger, it's funny to see the exactly same question come up on the Cython mailing list and here within just a couple of days. jholg at gmx.de, 29.03.2010 16:49: > I just noticed that ObjectifiedDataElements can not be used as dict > keys (I expected that) but ObjectifiedElements can (which I expected > not): > >>>> root = objectify.Element('root') >>>> {root: 1} > {: 1} >>>> root.s = "some string" >>>> {root.s: 1} > Traceback (most recent call last): > File "", line 1, in ? > TypeError: unhashable type >>>> > >>>> hash(root) > 4445024 >>>> hash(root.s) > Traceback (most recent call last): > File "", line 1, in ? > TypeError: unhashable type >>>> > > But: > >>>> root.s.__hash__ > >>>> root.s.__hash__() > 4444928 >>>> > > So I'm obviously missing something about the hashability rules. > Any quick hint on that? http://docs.python.org/c-api/typeobj.html#tp_compare http://docs.python.org/reference/datamodel.html#object.__hash__ The reason is that ODE overrides __richcmp__ but not __hash__, whereas the baseclass (OE) overrides none of the two. The CPython runtime lets the OE type inherit both from the baseclass in this case, whereas it considers the ODE type an unhashable type and inherits none. The currently proposed solution is to fix this in Cython by automatically setting up both if they are implemented within the type hierarchy. However, the quick fix is to add a __hash__ to ODE that returns the base type's hash value. Stefan From dkuhlman at rexx.com Mon Mar 29 23:48:31 2010 From: dkuhlman at rexx.com (Dave Kuhlman) Date: Mon, 29 Mar 2010 14:48:31 -0700 Subject: [lxml-dev] Tempory data attached to custom subclasses Message-ID: <20100329214830.GA21855@cutter.rexx.com> I've been using the custom subclasses capability of lxml. It's slick. I do, however, miss the ability to attach temporary data to the ElementBase subclasses. (see the warnings under "Element initialization" at http://codespeak.net/lxml/element_classes.html) I can, as suggested by the docs, add attributes or children to the underlying etree.Element, but that means that I'd have to strip that temporary data off when I want to serialize the tree. (please stop me if you've already heard this request, or if there is another solution.) I'd have a solution (see below) to this need if I could get a value, say an ID, (1) that is unique to each node and (2) that does not change during the existence of the ElementTree. Note that this "ID" does not have to be meaningful, and does not need to enable me to do anything with the underlying XML object (other than re-identify it). If I could get this opaque ID (or whatever it might be called), then I could use a dictionary and something like the following to store and retrieve temporary data:: Datadict1 = {} def get_temp_data(node, datadict): id = node.get_opaque_id() if id in datadict: return datadict[id] else: data = {} datadict[id] = data return data def test(): doc = lxml.parse('somedoc.xml') root = doc.getroot() node = root[0] data = get_temp_data(node, Datadict1) value1 = 'some temporary data' data['key1'] = value1 o o o data = get_temp_data(node, Datadict1) print data['key1'] test() Looking at lxml-2.2.4/src/lxml/lxml.etree.pyx, it seems like that would be a trivial function to add. (see below) What do you think? It's pretty simple solution. Has it be tried or rejected already? Here is a patch that seems to add the necessary function. This function returns the C pointer to the libxml2 object that is underneath the lxml/etree object. Am I right that this value would be (1) unique and (2) persistent across the lifetime of the lxml/etree ElementTree? Index: lxml.etree.pyx =================================================================== --- lxml.etree.pyx (revision 71999) +++ lxml.etree.pyx (working copy) @@ -1185,6 +1185,21 @@ return None return _elementFactory(self._doc, c_node) + def getopaqueid(self): + u"""getopaqueid(self) + + Returns an opaque ID for the underlying XML C node. This + opaque ID is guaranteed (1) to be unique to each node + and (2) not to change during the existence of the + ElementTree. + """ + cdef xmlNode* c_node + cdef int intnode + c_node = self._c_node + intnode = c_node + opaqueid = intnode + return opaqueid + def getnext(self): u"""getnext(self) - Dave -- Dave Kuhlman http://www.rexx.com/~dkuhlman From jholg at gmx.de Tue Mar 30 16:29:33 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 30 Mar 2010 16:29:33 +0200 Subject: [lxml-dev] ObjectifiedDataElements as dict keys? In-Reply-To: <4BB105D7.5080604@behnel.de> References: <20100329144910.108930@gmx.net> <4BB105D7.5080604@behnel.de> Message-ID: <20100330142933.271000@gmx.net> Hi, > > I just noticed that ObjectifiedDataElements can not be used as dict > > keys (I expected that) but ObjectifiedElements can (which I expected > > not): > > > > [...] > http://docs.python.org/c-api/typeobj.html#tp_compare > http://docs.python.org/reference/datamodel.html#object.__hash__ > > The reason is that ODE overrides __richcmp__ but not __hash__, whereas the > baseclass (OE) overrides none of the two. The CPython runtime lets the OE > type inherit both from the baseclass in this case, whereas it considers > the > ODE type an unhashable type and inherits none. > Ah, thanks a lot for this explanation. I'd probably have had a hard time finding this out in detail. > The currently proposed solution is to fix this in Cython by automatically > setting up both if they are implemented within the type hierarchy. > However, > the quick fix is to add a __hash__ to ODE that returns the base type's > hash > value. ODE should probably get a __hash__ that returns the underlying pyval hash results rather than the hash of its .text, anyway, then. Holger -- Sicherer, schneller und einfacher. Die aktuellen Internet-Browser - jetzt kostenlos herunterladen! http://portal.gmx.net/de/go/atbrowser From Joe at skyscanner.net Tue Mar 30 16:53:52 2010 From: Joe at skyscanner.net (Joe Sarre) Date: Tue, 30 Mar 2010 15:53:52 +0100 Subject: [lxml-dev] lxml iterparse generator not returning anything Message-ID: Hi everyone, I'm finding that when using iterparse, the generator always throws StopIteration immediately, without returning any data. I must be doing something wrong, or I must have some kind of setup problem, but I'm struggling to work out what it is. If anybody has any ideas, then that would be greatly appreciated, or if this is a bug, I will raise it on the bug tracker. My version details are: >>> print etree.LXML_VERSION (2, 2, 2, 0) >>> print etree.LIBXML_VERSION (2, 7, 6) >>> print etree.LIBXML_COMPILED_VERSION (2, 7, 3) >>> print etree.LIBXSLT_VERSION (1, 1, 26) >>> print etree.LIBXSLT_COMPILED_VERSION (1, 1, 24) The most striking thing about this is that LIBXML_VERSION != LIBXML_COMPILED_VERSION, and LIBXSLT_VERSION != LIBXSLT_COMPILED_VERSION. If this version discrepancy is the real cause of the problem, then I think this issue is perhaps more appropriate for the Fedora mailing list, and you can ignore the rest of this mail. An example in which I am seeing this ( taken from http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk ) is: """ >>> from lxml import etree >>> from StringIO import StringIO >>> xml = ''' ... text ... texttail ... ... ''' >>> print xml text texttail >>> context = etree.iterparse(StringIO(xml)) >>> for action, elem in context: ... print("%s: %s" % (action, elem.tag)) end: element end: element end: {http://testns/}empty-element end: root """ if __name__ == '__main__': import doctest doctest.testmod() The result of putting this in a file and running it is that python complains: ********************************************************************** File "test.py", line 20, in __main__ Failed example: for action, elem in context: print("%s: %s" % (action, elem.tag)) Expected: end: element end: element end: {http://testns/}empty-element end: root Got nothing ********************************************************************** 1 items had failures: 1 of 6 in __main__ ***Test Failed*** 1 failures. Thanks in advance for any help, Joe Sarre This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Skyscanner. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. From ethan.jucovy at gmail.com Tue Mar 30 16:57:38 2010 From: ethan.jucovy at gmail.com (Ethan Jucovy) Date: Tue, 30 Mar 2010 10:57:38 -0400 Subject: [lxml-dev] How to get HTML charset ? In-Reply-To: References: Message-ID: On Sun, Mar 28, 2010 at 12:09 AM, David Shieh wrote: > Hi all, > > I use lxml for a long time and it works fine for me. > But now, I get confused about the charset thing. When I want to get the > original charset of a html file, I used codes below: > > ??????? file_content = ''.join( > ??????????????? [i.rstrip('\r\n ').lstrip() for i in response.readlines()] > ??????????? ) > ??????? html = lxml.html.fromstring(file_content) > ??????? for i in html.xpath('head/meta'): > ??????????? print lxml.html.tostring(i) > > Surprisingly, there's no output of any > element. So, how can I know the original charset of this html? You need to pass the kwarg `include_meta_content_type=True` to `tostring`, or the tag will always be stripped on the way out -- >>> from lxml.html import fromstring, tostring >>> x=fromstring("""""") >>> x.xpath("head/meta") [] >>> [tostring(u) for u in x.xpath("head/meta")] [''] >>> [tostring(u, include_meta_content_type=True) for u in x.xpath("head/meta")] [''] From stefan_ml at behnel.de Wed Mar 31 10:34:34 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 31 Mar 2010 10:34:34 +0200 Subject: [lxml-dev] Tempory data attached to custom subclasses In-Reply-To: <20100329214830.GA21855@cutter.rexx.com> References: <20100329214830.GA21855@cutter.rexx.com> Message-ID: <4BB3091A.5090708@behnel.de> Dave Kuhlman, 29.03.2010 23:48: > I've been using the custom subclasses capability of lxml. It's > slick. > > I do, however, miss the ability to attach temporary data to the > ElementBase subclasses. (see the warnings under "Element > initialization" at http://codespeak.net/lxml/element_classes.html) > > I can, as suggested by the docs, add attributes or children to the > underlying etree.Element, but that means that I'd have to strip > that temporary data off when I want to serialize the tree. As long as your tree doesn't change, the easiest solution is to keep a reference to all Elements ("list(root.iter())") and then just store the data in the proxy instances. They are guaranteed not to change as long as there is a live reference to them. If your tree changes, you can still try to add new Elements to your keep-alive list to get the same behaviour, but you may need to take a little more care when you remove elements, so that you only remove them from the keep-alive list when you are sure they'll get discarded. > I'd have a solution (see below) to this need if I could get a > value, say an ID, (1) that is unique to each node and (2) that does > not change during the existence of the ElementTree. Note that this > "ID" does not have to be meaningful, and does not need to enable me > to do anything with the underlying XML object (other than > re-identify it). > > If I could get this opaque ID (or whatever it might be called), > then I could use a dictionary and something like the following to > store and retrieve temporary data: I usually suggest using the generated XPath of the element: http://codespeak.net/lxml/xpathxslt.html#generating-xpath-expressions But that's certainly more expensive than just returning a Py_ssize_t value. Stefan From stefan_ml at behnel.de Wed Mar 31 12:55:42 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 31 Mar 2010 12:55:42 +0200 Subject: [lxml-dev] ObjectifiedDataElements as dict keys? In-Reply-To: <20100330142933.271000@gmx.net> References: <20100329144910.108930@gmx.net> <4BB105D7.5080604@behnel.de> <20100330142933.271000@gmx.net> Message-ID: <4BB32A2E.1000408@behnel.de> jholg at gmx.de, 30.03.2010 16:29: > Stefan Behnel: >> The currently proposed solution is to fix this in Cython by automatically >> setting up both if they are implemented within the type hierarchy. ... although this hasn't been decided yet. I guess we'll end up going with the Py3 semantics here (i.e. the current semantics in Cython anyway), and just emit a warning. > ODE should probably get a __hash__ that returns the underlying pyval > hash results rather than the hash of its .text, anyway, then. Done: https://codespeak.net/viewvc/?view=rev&revision=73205 Stefan From mykingheaven at gmail.com Wed Mar 31 14:01:28 2010 From: mykingheaven at gmail.com (David Shieh) Date: Wed, 31 Mar 2010 20:01:28 +0800 Subject: [lxml-dev] How to get HTML charset ? In-Reply-To: References: Message-ID: 2010/3/30 Ethan Jucovy > On Sun, Mar 28, 2010 at 12:09 AM, David Shieh > wrote: > > Hi all, > > > > I use lxml for a long time and it works fine for me. > > But now, I get confused about the charset thing. When I want to get the > > original charset of a html file, I used codes below: > > > > file_content = ''.join( > > [i.rstrip('\r\n ').lstrip() for i in > response.readlines()] > > ) > > html = lxml.html.fromstring(file_content) > > for i in html.xpath('head/meta'): > > print lxml.html.tostring(i) > > > > Surprisingly, there's no output of any /> > > element. So, how can I know the original charset of this html? > > You need to pass the kwarg `include_meta_content_type=True` to > `tostring`, or the tag will > always be stripped on the way out -- > > But I really get charset using Sergio's way. I think your method is also great. I will add it in safe. Thanks! >>> from lxml.html import fromstring, tostring > >>> x=fromstring(""" content="text/html; charset=ASCII">""") > >>> x.xpath("head/meta") > [] > >>> [tostring(u) for u in x.xpath("head/meta")] > [''] > >>> [tostring(u, include_meta_content_type=True) for u in > x.xpath("head/meta")] > [''] > -- ---------------------------------------------- Attitude determines everything ! ---------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100331/6382fd94/attachment.htm