From robert at smithpierce.net Mon Jun 1 18:21:01 2009 From: robert at smithpierce.net (Robert Pierce) Date: Mon, 1 Jun 2009 09:21:01 -0700 Subject: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes In-Reply-To: <8dbf5ac80906010911g13f04a6ek14bf0510c5db80da@mail.gmail.com> References: <8dbf5ac80906010911g13f04a6ek14bf0510c5db80da@mail.gmail.com> Message-ID: <8dbf5ac80906010921r41427cedt49b084011d856d9@mail.gmail.com> The nil node is not deannotated as I would expect in the following snippet. I could not find a reference to this behaviour in the archives or documentation. Is this a design feature for which there is a work around, or a bug? I'm using lxml-2.2-py2.5-linux-i686. Thanks! #### CODE #### import lxml.etree import lxml.objectify x = lxml.objectify.fromstring('') x.Foo = '' x.Fubar = None lxml.objectify.deannotate(x) lxml.etree.cleanup_namespaces(x) print lxml.etree.tostring(x) #### END CODE ### -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090601/dc229dde/attachment.htm From velvetcrafter.subscriber at gmail.com Mon Jun 1 22:36:56 2009 From: velvetcrafter.subscriber at gmail.com (Alexis Georges) Date: Mon, 1 Jun 2009 16:36:56 -0400 Subject: [lxml-dev] XML Documents & I18N (the way Cocoon does it) In-Reply-To: <49F74416.7050209@behnel.de> References: <89A8F7A1-A544-49C8-8E81-7F88EF77E31A@gmail.com> <49F74416.7050209@behnel.de> Message-ID: <716CECA7-6326-47DC-9E74-5993E86ED754@gmail.com> Hi, This is a bit late, but thanks for the response. I am playing around with iterparse() and am following the advice you gave. I have a question though: I could not find a way to consume an element and replace it with just text. For example hello when found in the middle of a paragraph will be replaced by text. The replace() method requires the replacement to be an element. Is this possible? Thanks! Alexis Georges On 28-Apr-09, at 1:59 PM, Stefan Behnel wrote: > Hi, > > Alexis Georges wrote: >> I am maintaining a multilingual website which works with XML, XSLT to >> generate XHTML. >> >> I am working with Apache Cocoon (http://cocoon.apache.org/2.1/) using >> (among other things) their I18NTransformer. Basically I can use >> elements >> in the I18N (http://apache.org/cocoon/i18n/2.1) namespace, and then >> tell >> Cocoon to apply the I18NTransfomer to the document; this replaces the >> I18N elements with a localized value (eg. a formatted date/number, a >> translated label/attribute, etc...). >> >> I have been looking at lxml a little bit to see if I could move to a >> Python-based framework for the website. I am not quite sure how to go >> about the I18N part though. >> >> Using the Babel library (http://babel.edgewall.org/) along with >> request >> headers to generate localized data, I have everything I need. What is >> missing is the "parser" for the I18N elements. All I can think of >> right >> now is to implement a SAX parser, the way Cocoon does (in Java). > > There is a SAX-like interface in lxml.etree, called "target parser". > > However, if your documents fit into memory, using iterparse() is a lot > simpler (and likely not even much slower). > > Something like this might work: > > context = etree.iterparse( > "somefile.xml", > tag = "{http://apache.org/cocoon/i18n/2.1}*") > > for event, i18n_element in context: > new_element = get_i18n_replacement_for(i18n_element) > i18n_element.getparent().replace(i18n_element, new_element) > > context.getroottree().write("newfile.xml") > > See here for some documentation: > > http://codespeak.net/lxml/parsing.html > > You can also achieve the same thing in XSLT, or using XPath, or ... > > Stefan From qhlonline at 163.com Tue Jun 2 04:25:56 2009 From: qhlonline at 163.com (qhlonline) Date: Tue, 2 Jun 2009 10:25:56 +0800 (CST) Subject: [lxml-dev] lxml deconding problem caused by tag specification Message-ID: <17958119.626231243909556557.JavaMail.coremail@bj163app72.163.com> Hi, all There are instances that when an HTML file has meta tags, the charset declared in tag is not right, because the HTML content next is using a different encoding. But lxml will parse accroding to what said. In this situation, it may report error information of error decoding, but some times it can parse, and generate a DOM that is not complete. eg. I have a WEB file has while the following content is encoded with GBK(which is a Supper set of GB2312). We have got a result with only part of the HTML tags parsed out. I wan't to know, if lxml have any warning or error information reported for this situation? What it is? and how can we deal with this kind of fault ? Is there any common method? I have also seen some HTML files have tag attributes "lang", I don't know whether this attribute is used in the HTML parsing process. In meta tag like , there are also language statement, But in the htmlCheckMeta method of libxml2 library source, I didn't find any processing with the http-equiv attribute value "Content-Language". Is it because that "Content-Language" is not standerd? Is lxml support this attribute? if so , how to deal with the " content="zh-cn" " declaration when it was saying another different language? yours -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090602/6cc01157/attachment.htm From jholg at gmx.de Tue Jun 2 09:59:02 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 02 Jun 2009 09:59:02 +0200 Subject: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes In-Reply-To: <8dbf5ac80906010921r41427cedt49b084011d856d9@mail.gmail.com> References: <8dbf5ac80906010911g13f04a6ek14bf0510c5db80da@mail.gmail.com> <8dbf5ac80906010921r41427cedt49b084011d856d9@mail.gmail.com> Message-ID: <20090602075902.62950@gmx.net> Hi, > The nil node is not deannotated as I would expect in the > following > snippet. I could not find a reference to this behaviour in the archives > or > documentation. Is this a design feature for which there is a work around, > or a bug? I'm using lxml-2.2-py2.5-linux-i686. Design feature. Only py:pytype/xsi:type attributes get removed by deannotate(): >>> print etree.__version__ 2.1.5 >>> help(objectify.deannotate) Help on built-in function deannotate in module lxml.objectify: deannotate(...) deannotate(element_or_tree, pytype=True, xsi=True) Recursively de-annotate the elements of an XML tree by removing 'pytype' and/or 'type' attributes. If the 'pytype' keyword argument is True (the default), 'pytype' attributes will be removed. If the 'xsi' keyword argument is True (the default), 'xsi:type' attributes will be removed. IMHO the xsi:nil concept in XML Schema pretty much corresponds to NULL values in databases, i.e. a typed element/column may (or may not) be xsi:nil/NULL, but it does not so directly translate to the distinct Python None object. OTOH I think mapping xsi:nil to None very much captures the meaning of xsi:nil/NULL, because in most use cases you'd test if a value has been set (!=None) or not (==None). Or course, you can always easily get rid of xsi:nil if you wish so: >>> for elt in root.iter(): elt.attrib.pop('{http://www.w3.org/2001/XMLSchema-instance}nil', None) Holger -- Nur bis 31.05.: GMX FreeDSL Komplettanschluss mit DSL 6.000 Flatrate und Telefonanschluss nur 17,95 Euro/mtl.!* http://portal.gmx.net/de/go/dsl02 From qhlonline at 163.com Tue Jun 2 11:04:45 2009 From: qhlonline at 163.com (qhlonline) Date: Tue, 2 Jun 2009 17:04:45 +0800 (CST) Subject: [lxml-dev] lxml about Target Parser Message-ID: <20629233.873781243933485270.JavaMail.coremail@bj163app61.163.com> Hi?all When I used the lxml with self defined Target Parser, There is a function that can be redefined-- data . def data (self, data): When can we use it? and what it will do when we simply write a single line: "return " ? Is there any encoding conversion? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090602/b439ff2b/attachment.htm From stefan_ml at behnel.de Tue Jun 2 20:12:54 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 02 Jun 2009 20:12:54 +0200 Subject: [lxml-dev] lxml about Target Parser In-Reply-To: <20629233.873781243933485270.JavaMail.coremail@bj163app61.163.com> References: <20629233.873781243933485270.JavaMail.coremail@bj163app61.163.com> Message-ID: <4A256BA6.5020605@behnel.de> qhlonline wrote: > When I used the lxml with self defined Target Parser, There is a > function that can be redefined-- data . def data (self, data): When can > we use it? when you want to receive character content from the document you parse. > and what it will do when we simply write a single line: "return " ? nothing? actually, a "pass" will do in that case, as will not implementing the method (IIRC). > Is there any encoding conversion? You will get either ASCII encoded byte strings or unicode strings, just like everywhere else. BTW, it's sometimes faster to try these things out than to ask a mailing list. Stefan From stefan_ml at behnel.de Tue Jun 2 20:29:57 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 02 Jun 2009 20:29:57 +0200 Subject: [lxml-dev] lxml deconding problem caused by tag specification In-Reply-To: <17958119.626231243909556557.JavaMail.coremail@bj163app72.163.com> References: <17958119.626231243909556557.JavaMail.coremail@bj163app72.163.com> Message-ID: <4A256FA5.7020602@behnel.de> Hi, qhlonline wrote: > There are instances that when an HTML file has meta tags, the > charset declared in tag is not right, because the HTML content next is > using a different encoding. But lxml will parse accroding to what said. > In this situation, it may report error information of error decoding, > but some times it can parse, and generate a DOM that is not complete. By default, the HTML parser will ignore errors and try to keep parsing regardless. Pass "recover=False" if you want to get an exception instead. Note that character decoding errors cannot always be detected, as they may lead to valid (although unreadable) characters even when the wrong encoding is assumed. Latin-1 is a good example, which uses a plain 8-bit encoding. It will work perfectly well to read a UTF-8 encoded document with a Latin-1 decoder. It just won't give you readable output in most cases. > eg. I have a WEB file has while the following content is encoded with > GBK(which is a Supper set of GB2312). We have got a result with only > part of the HTML tags parsed out. I wan't to know, if lxml have any > warning or error information reported for this situation? What it is? See the error_log property on the parser. http://codespeak.net/lxml/parsing.html#error-log > Is there any common > method? I have also seen some HTML files have tag attributes "lang", I > don't know whether this attribute is used in the HTML parsing process. I don't think so. > In meta tag like , there are also language statement, But in the > htmlCheckMeta method of libxml2 library source, I didn't find any > processing with the http-equiv attribute value "Content-Language". The "language" is not relevant to the parser. The charset is. Just think of UTF-8, which can encode any written language that uses characters defined in Unicode. Stefan From stefan_ml at behnel.de Tue Jun 2 21:24:01 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 02 Jun 2009 21:24:01 +0200 Subject: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes In-Reply-To: <20090602075902.62950@gmx.net> References: <8dbf5ac80906010911g13f04a6ek14bf0510c5db80da@mail.gmail.com> <8dbf5ac80906010921r41427cedt49b084011d856d9@mail.gmail.com> <20090602075902.62950@gmx.net> Message-ID: <4A257C51.2050203@behnel.de> Hi, Holger wrote: >> The nil node is not deannotated as I would expect in the >> following >> snippet. I could not find a reference to this behaviour in the archives >> or >> documentation. Is this a design feature for which there is a work around, >> or a bug? I'm using lxml-2.2-py2.5-linux-i686. > > Design feature. I'd be a little more careful with such a big word. ;) > Only py:pytype/xsi:type attributes get removed by deannotate(): > >>>> print etree.__version__ > 2.1.5 >>>> help(objectify.deannotate) > > Help on built-in function deannotate in module lxml.objectify: > > deannotate(...) > deannotate(element_or_tree, pytype=True, xsi=True) > > Recursively de-annotate the elements of an XML tree by removing 'pytype' > and/or 'type' attributes. > > If the 'pytype' keyword argument is True (the default), 'pytype' attributes > will be removed. If the 'xsi' keyword argument is True (the default), > 'xsi:type' attributes will be removed. Yes, so it's even implicitly documented. :) Anyway, I'm not sure it's always a good idea to leave this special case in instead of cleaning everything up. I think if you remove it, you'd get an empty string result, which may be surprising - but more surprising than not getting it cleaned up? After all, deannotate() means deannotate()... Stefan From stefan_ml at behnel.de Tue Jun 2 22:13:13 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 02 Jun 2009 22:13:13 +0200 Subject: [lxml-dev] lxml 2.2.1 released Message-ID: <4A2587D9.5040903@behnel.de> Hi all, I just pushed lxml 2.2.1 to PyPI as a minor maintenance release. Changelog follows below. This release was built with Cython 0.11.2. Have fun, Stefan 2.2.1 (2009-06-02) Features added * Injecting default attributes into a document during XML Schema validation (also at parse time). * Pass huge_tree parser option to disable parser security restrictions imposed by libxml2 2.7. Bugs fixed * The script for statically building libxml2 and libxslt didn't work in Py3. * XMLSchema() also passes invalid schema documents on to libxml2 for parsing (which could lead to a crash before release 2.6.24). From robert at smithpierce.net Wed Jun 3 02:43:38 2009 From: robert at smithpierce.net (Robert Pierce) Date: Tue, 2 Jun 2009 17:43:38 -0700 Subject: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes In-Reply-To: <4A257C51.2050203@behnel.de> References: <8dbf5ac80906010911g13f04a6ek14bf0510c5db80da@mail.gmail.com> <8dbf5ac80906010921r41427cedt49b084011d856d9@mail.gmail.com> <20090602075902.62950@gmx.net> <4A257C51.2050203@behnel.de> Message-ID: <8dbf5ac80906021743k147a78a0vae060dd5e8a8995f@mail.gmail.com> Thanks! That answers my questions. The apparent asymmetry of handling nodes was confusing, but the distinction of pytypes vs xsi makes some sense. I would naively agree that a seemingly general purpose function like deannotate should remove everything. Otherwise, I have to walk the tree twice: once with deannotate and once to unlink remaining nill types. Or recreate my own deannotate(). Not a big deal either way, though. On Tue, Jun 2, 2009 at 12:24 PM, Stefan Behnel wrote: > Hi, > > Holger wrote: > >> The nil node is not deannotated as I would expect in the > >> following > >> snippet. I could not find a reference to this behaviour in the archives > >> or > >> documentation. Is this a design feature for which there is a work > around, > >> or a bug? I'm using lxml-2.2-py2.5-linux-i686. > > > > Design feature. > > I'd be a little more careful with such a big word. ;) > > > > Only py:pytype/xsi:type attributes get removed by deannotate(): > > > >>>> print etree.__version__ > > 2.1.5 > >>>> help(objectify.deannotate) > > > > Help on built-in function deannotate in module lxml.objectify: > > > > deannotate(...) > > deannotate(element_or_tree, pytype=True, xsi=True) > > > > Recursively de-annotate the elements of an XML tree by removing > 'pytype' > > and/or 'type' attributes. > > > > If the 'pytype' keyword argument is True (the default), 'pytype' > attributes > > will be removed. If the 'xsi' keyword argument is True (the default), > > 'xsi:type' attributes will be removed. > > Yes, so it's even implicitly documented. :) > > Anyway, I'm not sure it's always a good idea to leave this special case in > instead of cleaning everything up. I think if you remove it, you'd get an > empty string result, which may be surprising - but more surprising than not > getting it cleaned up? After all, deannotate() means deannotate()... > > Stefan > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090602/bf68a118/attachment.htm From qhlonline at 163.com Wed Jun 3 05:41:45 2009 From: qhlonline at 163.com (qhlonline) Date: Wed, 3 Jun 2009 11:41:45 +0800 (CST) Subject: [lxml-dev] lxml about Target Parser In-Reply-To: <4A256BA6.5020605@behnel.de> References: <4A256BA6.5020605@behnel.de> <20629233.873781243933485270.JavaMail.coremail@bj163app61.163.com> Message-ID: <13610593.167811244000505954.JavaMail.coremail@bj163app59.163.com> 2009-06-03?"Stefan Behnel" ? > >qhlonline wrote: >> When I used the lxml with self defined Target Parser, There is a >> function that can be redefined-- data . def data (self, data): When can >> we use it? > >when you want to receive character content from the document you parse. > > >> and what it will do when we simply write a single line: "return " ? > >nothing? actually, a "pass" will do in that case, as will not implementing >the method (IIRC). > > >> Is there any encoding conversion? > >You will get either ASCII encoded byte strings or unicode strings, just >like everywhere else. > >BTW, it's sometimes faster to try these things out than to ask a mailing list. > >Stefan Hi, Stefan My last mail has mixed the problem and target parser data function problem as one. I have made some tests and the result shows they are separate problems. When I do not define data function in my target parser, That will slove my problem of http://www.jiayuan.com/ web decoding error in parsing process. But still can't slove the problem of partly parsing caused by encoding declaration, eg. http://www.sina.com/ could be parsed, while a incomplete result was given. And I have dealed with this problem with two methods: The first one is to change the parsing content. After read out HTML string from the site http://www.sina.com/ ,I changed all 's content="charset **" attribute value as content="" to avoid encoding change in libxml2. This method is somewhat dangerous, Because at most times the declaration should be considered?The second method is for Chinese webs only, you know the largest character set of Chinese is GB18030 for now, So I changed the libxml2 source code and let GB18030 be the constant decoder. But this method can only resolve Chinese web problems of declaration error?It declared a different encoding to the web content?, and I don't know whether webs of other language contains declaration irregular problems like that in Chinese. Although the http://www.jiayuan.com/ decoding error had been solved, I just don't know why. The method of shielding data function of my target parser is got by my lots of tests, and I'm searching for the reason. Could you give me some suggestion? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090603/1d34487c/attachment.htm From qhlonline at 163.com Wed Jun 3 08:37:57 2009 From: qhlonline at 163.com (qhlonline) Date: Wed, 3 Jun 2009 14:37:57 +0800 (CST) Subject: [lxml-dev] lxml deconding problem caused by tag specification In-Reply-To: <4A256FA5.7020602@behnel.de> References: <4A256FA5.7020602@behnel.de> <17958119.626231243909556557.JavaMail.coremail@bj163app72.163.com> Message-ID: <28169648.249391244011077738.JavaMail.coremail@bj163app59.163.com> Hi, 2009-06-03?"Stefan Behnel" ??? >Hi, > >qhlonline wrote: >> There are instances that when an HTML file has meta tags, the >> charset declared in tag is not right, because the HTML content next is >> using a different encoding. But lxml will parse accroding to what said. >> In this situation, it may report error information of error decoding, >> but some times it can parse, and generate a DOM that is not complete. > >By default, the HTML parser will ignore errors and try to keep parsing >regardless. Pass "recover=False" if you want to get an exception instead. > >Note that character decoding errors cannot always be detected, as they may >lead to valid (although unreadable) characters even when the wrong encoding >is assumed. Latin-1 is a good example, which uses a plain 8-bit encoding. >It will work perfectly well to read a UTF-8 encoded document with a Latin-1 >decoder. It just won't give you readable output in most cases. > > >> eg. I have a WEB file has while the following content is encoded with >> GBK(which is a Supper set of GB2312). We have got a result with only >> part of the HTML tags parsed out. I wan't to know, if lxml have any >> warning or error information reported for this situation? What it is? > >See the error_log property on the parser. > >http://codespeak.net/lxml/parsing.html#error-log > I have tried to get error information through parser.error_log. most of the log messages are like:"Element script embeds close tag" and I know these error must have been recovered because defaultly the htmlparser have "recover=True". But there ares still some useful informations: In the charset caused incomplete-parsing problem when parsing http://www.sina.com/, if It occurs, the log info correspondingly will be :"input conversion failed due to input error, bytes 0xAD 0x5A 0xB6 0xF9". In another mail I have said the fault when parsing http://www.jiayuan.com/ with target parser. Web of this site has problems, It is encoded with GB18030 but in it declares utf-8, If my target parser had data function defined, It would report: ?UnicodeDecodeError: 'utf8' codec can't decode byte 0xbe in position 10: unexpected code byte. Then the target parser would jump out and no error_log was got. If I do not define the data function in my target parser, The parser will do well on result, But the error_log infomation will be: "Input is not proper UTF-8, indicate encoding !" . May be this problem has been recoverd by lxml parser itself and so I can get the result. It is only when I have changed the html content of http://www.jiayuan.com/ with setting its as (Where gb18030 is the correct encoding of this HTML file), Then this web is smoothly parsed and no error_log about encoding left. >> Is there any common >> method? I have also seen some HTML files have tag attributes "lang", I >> don't know whether this attribute is used in the HTML parsing process. > >I don't think so. > > >> In meta tag like , there are also language statement, But in the >> htmlCheckMeta method of libxml2 library source, I didn't find any >> processing with the http-equiv attribute value "Content-Language". > >The "language" is not relevant to the parser. The charset is. Just think of >UTF-8, which can encode any written language that uses characters defined >in Unicode. > >Stefan > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090603/9b3b0e09/attachment-0001.htm From jholg at gmx.de Wed Jun 3 08:58:58 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 03 Jun 2009 08:58:58 +0200 Subject: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes In-Reply-To: <4A257C51.2050203@behnel.de> References: <8dbf5ac80906010911g13f04a6ek14bf0510c5db80da@mail.gmail.com> <8dbf5ac80906010921r41427cedt49b084011d856d9@mail.gmail.com> <20090602075902.62950@gmx.net> <4A257C51.2050203@behnel.de> Message-ID: <20090603065858.115690@gmx.net> Hi, > >> The nil node is not deannotated as I would expect in the > >> following > >> snippet. I could not find a reference to this behaviour in the > archives > >> or > >> documentation. Is this a design feature for which there is a work > around, > >> or a bug? I'm using lxml-2.2-py2.5-linux-i686. > > > > Design feature. > > I'd be a little more careful with such a big word. ;) Well, it's definitely not a bug :) > Yes, so it's even implicitly documented. :) > > Anyway, I'm not sure it's always a good idea to leave this special case in > instead of cleaning everything up. I think if you remove it, you'd get an > empty string result, which may be surprising - but more surprising than > not > getting it cleaned up? After all, deannotate() means deannotate()... But deannotate() cares about type attributes and nil is not exactly a type attribute. We annotate the tree to have help in mapping to proper Python types, but xsi:nil can well show up in any non-annotated document. Of course, we make *use* of it for the type lookup system, both by interpreting it if it's there and by setting it for None assignment, but that still does not make it a type annotation attribute IMHO. Consider this use case: >>> root = objectify.fromstring(""" ... """) >>> print etree.tostring(root) >>> objectify.deannotate(root) # Should this *remove* xsi:nil?! >>> print etree.tostring(root) >>> I wouldn't want deannotate() to remove xsi:nil here. What's the use case for a deannotate() that removes xsi:nil? Why not just assign '' instead of None and deannotate() afterwards? A compromise may be to add another keyword arg "nil" to deannotate() to allow for xsi:nil removal if needed (defaults to False, of course :) Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From qhlonline at 163.com Wed Jun 3 14:35:39 2009 From: qhlonline at 163.com (qhlonline) Date: Wed, 3 Jun 2009 20:35:39 +0800 (CST) Subject: [lxml-dev] Target parser parsing error Message-ID: <1733034.434611244032539801.JavaMail.coremail@bj163app25.163.com> Hi, all There are more informations about my parsing error when I use target parser to parse http://www.jiayuan.com/ . The fatal error reported out is: Input is not proper UTF-8, indicate encoding ! To find the real place where this problem occured, I have tried to convert the HTML string encoding with iconv directly. This time it also report error, and the error character index in string is just the same with my lxml test. Now things are clear that this parsing error is caused by encoding conversion of iconv from utf-8 to utf-8 when there are illegal characters in the source. When I do not define the data function in my target parser, It will paser without error report. Is it means that when I escape the data function , the UTF-8 to UTF-8 conversion is also escaped ? Or some correct conversion has been done before the call to the data function ? yours -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090603/d1bb10ae/attachment.htm From stefan_ml at behnel.de Wed Jun 3 15:34:24 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 3 Jun 2009 15:34:24 +0200 (CEST) Subject: [lxml-dev] Target parser parsing error In-Reply-To: <1733034.434611244032539801.JavaMail.coremail@bj163app25.163.com> References: <1733034.434611244032539801.JavaMail.coremail@bj163app25.163.com> Message-ID: <14f1f2265bcf5f788498ab3ccf4e7854.squirrel@groupware.dvs.informatik.tu-darmstadt.de> qhlonline wrote: > There are more informations about my parsing error when I use target > parser to parse http://www.jiayuan.com/ . The fatal error reported out > is: Input is not proper UTF-8, indicate encoding ! To find the real > place where this problem occured, I have tried to convert the HTML > string encoding with iconv directly. This time it also report error, > and the error character index in string is just the same with my lxml > test. Now things are clear that this parsing error is caused by > encoding conversion of iconv from utf-8 to utf-8 when there are illegal > characters in the source. What do you mean by "from utf-8 to utf-8" conversion? > When I do not define the data function in my > target parser, It will paser without error report. Is it means that > when I escape the data function , the UTF-8 to UTF-8 conversion is > also escaped ? Or some correct conversion has been done before the call > to the data function ? It just means that the parser has ignored your character content. There are two levels here. The libxml2 parser will parse the byte stream and try to convert it to UTF-8. If that fails but it is asked to "recover" from it, it will just continue without raising an error. Not sure what becomes of the data in this case, but apparently there is no guarantee that the invalid bytes that were parsed up to this point get stripped. The second level is where lxml comes into the play. When you define a "data()" method on your target parser, you ask lxml to pass you the character data from the document. lxml's SAX handler will then try to decode the UTF-8 data provided by the libxml2 parser to pass it into your method. If the data returned by the parser is not valid UTF-8, this will fail. I assume that this is where the exception that you see originates from, as this is done through the Python Codec API. Does this clear things up? That said, I could imagine letting the character decoder work around broken data if the "recover" option is enabled, simply by replacing broken content with a replacement character. This would improve the recovery capabilities in your case, without breaking the data any further than it already is. Stefan From robert at smithpierce.net Wed Jun 3 16:32:45 2009 From: robert at smithpierce.net (Robert Pierce) Date: Wed, 3 Jun 2009 07:32:45 -0700 Subject: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes In-Reply-To: <20090603065858.115690@gmx.net> References: <8dbf5ac80906010911g13f04a6ek14bf0510c5db80da@mail.gmail.com> <8dbf5ac80906010921r41427cedt49b084011d856d9@mail.gmail.com> <20090602075902.62950@gmx.net> <4A257C51.2050203@behnel.de> <20090603065858.115690@gmx.net> Message-ID: <8dbf5ac80906030732u5334a526sfb6720593cdb791@mail.gmail.com> In case it isn't obvious, I'm not an XML guru and haven't been using lxml for long, but truly IMHO: I stipulate the importance of nil (or null) in schema definitions, as well as in attaching types to the in memory representation of the tree. But from the standpoint of text representation, doesn't seem to carry any additional information over . My use case is passing XML through SQS, which has an upper bound of about 6kB (after http headers are accounted for). When lxml annotates empty elements, it attaches BOTH schema and type to each node, which increases the size of the text representation of the element by a factor of 4 or more. So I really have to deannotate it "all the way". On 6/2/09, jholg at gmx.de wrote: > Consider this use case: > >>>> root = objectify.fromstring(""" > ... xsi:nil='true'/>""") >>>> print etree.tostring(root) > xsi:nil="true"/> >>>> objectify.deannotate(root) # Should this *remove* xsi:nil?! >>>> print etree.tostring(root) > xsi:nil="true"/> >>>> > > I wouldn't want deannotate() to remove xsi:nil here. I think it is impossible to retain input intent once a tree is parsed into memory. Really, in the absence of a schema I shouldn't be able to tell the difference between your input and root = objectify.fromstring('') or root = objectify.fromstring('') root.x = None You can only ask for consistency on output. Currently, the output of deannotate is not consistent in this case. In any event, type constraints are more properly defined in a schema, aren't they? Just because you passed me doesn't constrain me from passing you back unless there's a schema that says otherwise. > What's the use case for a deannotate() that removes xsi:nil? Why not just > assign '' instead of None and deannotate() afterwards? As you suggest, I can set the element value to '', so it is a string type and deannotate() removes the type. However, tostring() + deannotate() then produces rather than ... better, but still not efficient. Of course, there is a valid argument to say that a space constrained API shouldn't use a bloated data format like XML at all, but (for my API) it's too late to make that argument. > A compromise may be to add another keyword arg "nil" to deannotate() to > allow for xsi:nil removal if needed (defaults to False, of course :) Works for me! From Olivier.Collioud at wipo.int Wed Jun 3 19:16:11 2009 From: Olivier.Collioud at wipo.int (Collioud, Olivier) Date: Wed, 3 Jun 2009 19:16:11 +0200 Subject: [lxml-dev] Struggling with unicode again In-Reply-To: <4A2587D9.5040903@behnel.de> References: <4A2587D9.5040903@behnel.de> Message-ID: <7EC22A059E684A48B7B9D4B1D60DFF96767C9F1A@ICCV101C.gms02.unicc.org> Hi, I'm trying to update an etree in a WSGI application with data coming from a posted form. The data is converted first using urllib.unquote_plus. I know that the data (text) is then UTF-8 encoded. LXML is giving: Traceback (most recent call last): File "D:/Applications/IPC_Definitions_Editor/defedit/defedit.py", line 130, in application elt.text = text File "lxml.etree.pyx", line 835, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:9595), File "apihelpers.pxi", line 409, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:28436) File "apihelpers.pxi", line 951, in lxml.etree._utf8 (src/lxml/lxml.etree.c:32423) AssertionError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes What encoding do I need to convert 'text' to and how ? Thanks, Olivier. From stefan_ml at behnel.de Wed Jun 3 09:35:54 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 03 Jun 2009 09:35:54 +0200 Subject: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes In-Reply-To: <20090603065858.115690@gmx.net> References: <8dbf5ac80906010911g13f04a6ek14bf0510c5db80da@mail.gmail.com> <8dbf5ac80906010921r41427cedt49b084011d856d9@mail.gmail.com> <20090602075902.62950@gmx.net> <4A257C51.2050203@behnel.de> <20090603065858.115690@gmx.net> Message-ID: <4A2627DA.5000104@behnel.de> Hi, I do see your point that xsi:nil is still a bit different from xsi:type. That's why I had my doubts in the first place. jholg at gmx.de wrote: > A compromise may be to add another keyword arg "nil" to deannotate() to > allow for xsi:nil removal if needed (defaults to False, of course :) I think that should be done, yes. A "nil=False" keyword would nicely solve this. And disabling it by default makes sense for two reasons: backwards compatibility and the fact that xsi:nil may be used in existing documents. Is a plain "nil" enough or should we use "xsi_nil"? Stefan From stefan_ml at behnel.de Wed Jun 3 19:55:20 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 03 Jun 2009 19:55:20 +0200 Subject: [lxml-dev] Struggling with unicode again In-Reply-To: <7EC22A059E684A48B7B9D4B1D60DFF96767C9F1A@ICCV101C.gms02.unicc.org> References: <4A2587D9.5040903@behnel.de> <7EC22A059E684A48B7B9D4B1D60DFF96767C9F1A@ICCV101C.gms02.unicc.org> Message-ID: <4A26B908.2000300@behnel.de> Collioud, Olivier wrote: > I'm trying to update an etree in a WSGI application with data coming from a posted form. > The data is converted first using urllib.unquote_plus. > > I know that the data (text) is then UTF-8 encoded. > > LXML is giving: > > Traceback (most recent call last): > File "D:/Applications/IPC_Definitions_Editor/defedit/defedit.py", line 130, in application > elt.text = text > File "lxml.etree.pyx", line 835, in lxml.etree._Element.text.__set__ (src/lxml/lxml.etree.c:9595), > File "apihelpers.pxi", line 409, in lxml.etree._setNodeText (src/lxml/lxml.etree.c:28436) > File "apihelpers.pxi", line 951, in lxml.etree._utf8 (src/lxml/lxml.etree.c:32423) > AssertionError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes > > What encoding do I need to convert 'text' to and how ? No encoding. You have to /de/code it from UTF-8 into Unicode and pass a Python unicode string, i.e. do elt.text = text.decode("utf-8") Stefan From stefan_ml at behnel.de Wed Jun 3 20:14:14 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 03 Jun 2009 20:14:14 +0200 Subject: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes In-Reply-To: <8dbf5ac80906030732u5334a526sfb6720593cdb791@mail.gmail.com> References: <8dbf5ac80906010911g13f04a6ek14bf0510c5db80da@mail.gmail.com> <8dbf5ac80906010921r41427cedt49b084011d856d9@mail.gmail.com> <20090602075902.62950@gmx.net> <4A257C51.2050203@behnel.de> <20090603065858.115690@gmx.net> <8dbf5ac80906030732u5334a526sfb6720593cdb791@mail.gmail.com> Message-ID: <4A26BD76.4040601@behnel.de> Robert Pierce wrote: > In case it isn't obvious, I'm not an XML guru and haven't been using > lxml for long, but truly IMHO: > > I stipulate the importance of nil (or null) in schema definitions, as > well as in attaching types to the in memory representation of the > tree. But from the standpoint of text representation, xsi:nil='true'/> doesn't seem to carry any additional information over > . I think it makes more sense to let an empty leaf element represent an empty string than to represent it as None. It's a matter of use cases, obviously. > My use case is passing XML through SQS "SQS" is an ambiguous abbreviation. > which has an upper bound of > about 6kB (after http headers are accounted for). That sounds like a rather odd restriction. Doesn't it at least support compression? > I think it is impossible to retain input intent once a tree is parsed > into memory. Really, in the absence of a schema I shouldn't be able > to tell the difference between your input and > > root = objectify.fromstring('') > > or > root = objectify.fromstring('') > root.x = None Well, it /is/ different, though. >>> root = objectify.fromstring('') >>> str(root.x) '' >>> root = objectify.fromstring('') >>> root.x = None >>> str(root.x) 'None' > You can only ask for consistency on output. No, lxml.objectify is a Python-object-like in-memory tree. Serialisation is only a way out, validation only a way to check what leaves the code that processed the tree. All the rest is about making it easy to use as a tree structure. That's what the annotations are there for. If you want to keep the necessary information during a serialise-parse cycle or not is up to you (or should be, so an option to remove everything is just fine). Stefan From ovnicraft at gmail.com Wed Jun 3 22:20:16 2009 From: ovnicraft at gmail.com (Ovnicraft) Date: Wed, 3 Jun 2009 15:20:16 -0500 Subject: [lxml-dev] ODS file with lxml Message-ID: Hi folks, i am triying to manipulate the content.xml file from ODS in lxml but when i try: doc = etree.parse(xmldata), give me the content of xmldata in STDOUT, i guess it can be parsed. I attached the file what i ma using. Waiting your suggest, regards, -- Cristian Salamea http://twitter.com/ovnicraft Software Libre / Open Source (+593-8) 4-36-44-48 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090603/e6d65511/attachment-0001.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: content.xml Type: text/xml Size: 49947 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090603/e6d65511/attachment-0001.bin From qhlonline at 163.com Thu Jun 4 04:58:27 2009 From: qhlonline at 163.com (qhlonline) Date: Thu, 4 Jun 2009 10:58:27 +0800 (CST) Subject: [lxml-dev] Target parser parsing error In-Reply-To: <14f1f2265bcf5f788498ab3ccf4e7854.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <14f1f2265bcf5f788498ab3ccf4e7854.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <1733034.434611244032539801.JavaMail.coremail@bj163app25.163.com> Message-ID: <27338673.633061244084307188.JavaMail.coremail@bj163app60.163.com> Hi? 2009-06-03?"Stefan Behnel" ? >qhlonline wrote: >> There are more informations about my parsing error when I use target >> parser to parse http://www.jiayuan.com/ . The fatal error reported out >> is: Input is not proper UTF-8, indicate encoding ! To find the real >> place where this problem occured, I have tried to convert the HTML >> string encoding with iconv directly. This time it also report error, >> and the error character index in string is just the same with my lxml >> test. Now things are clear that this parsing error is caused by >> encoding conversion of iconv from utf-8 to utf-8 when there are illegal >> characters in the source. > >What do you mean by "from utf-8 to utf-8" conversion? I am not sure whether this conversion had taken place. But when I convert the html content string in this way. It will report illegal character error in some place of the string, and it is just the place where my lxml target parser generate error. Does it ensure that there are illegal characters in the html content? > >> When I do not define the data function in my >> target parser, It will paser without error report. Is it means that >> when I escape the data function , the UTF-8 to UTF-8 conversion is >> also escaped ? Or some correct conversion has been done before the call >> to the data function ? > >It just means that the parser has ignored your character content. There >are two levels here. The libxml2 parser will parse the byte stream and try >to convert it to UTF-8. If that fails but it is asked to "recover" from >it, it will just continue without raising an error. Not sure what becomes >of the data in this case, but apparently there is no guarantee that the >invalid bytes that were parsed up to this point get stripped. > I agree with you. I have thought about what libxml2 would do when an illegal character came. Your answer makes me clear at this point. >The second level is where lxml comes into the play. When you define a >"data()" method on your target parser, you ask lxml to pass you the >character data from the document. lxml's SAX handler will then try to >decode the UTF-8 data provided by the libxml2 parser to pass it into your >method. If the data returned by the parser is not valid UTF-8, this will >fail. I assume that this is where the exception that you see originates >from, as this is done through the Python Codec API. Yes, That is the case. But the illegal character came out side of lxml and outside of libxml2, The whole string was got from an URL by using urllib module in python. So, I wonder whether there were some other method to get HTML content from URL without illegal characters. > >Does this clear things up? Thank you for your help, I think I have learned more about lxml parsing process with your guidance. Thank you! > >That said, I could imagine letting the character decoder work around >broken data if the "recover" option is enabled, simply by replacing broken >content with a replacement character. This would improve the recovery >capabilities in your case, without breaking the data any further than >already is. > >Stefan > Happy days? yours -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090604/137ca073/attachment.htm From piet at cs.uu.nl Thu Jun 4 08:36:15 2009 From: piet at cs.uu.nl (Piet van Oostrum) Date: Thu, 4 Jun 2009 08:36:15 +0200 Subject: [lxml-dev] ODS file with lxml In-Reply-To: References: Message-ID: <18983.27487.345490.871205@cochabamba.local> >>>>> Ovnicraft (O) escribi?: >O> Hi folks, i am triying to manipulate the content.xml file from ODS in lxml >O> but when i try: doc = etree.parse(xmldata), give me the content of xmldata >O> in STDOUT, i guess it can be parsed. >O> I attached the file what i ma using. >O> Waiting your suggest, etree.parse should be given a file object or filename as parameter. If you want to use an xml string you should use doc=etree.fromstring(xmldata) See http://codespeak.net/lxml/parsing.html -- Piet van Oostrum URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: piet at vanoostrum.org From avleen at gmail.com Thu Jun 4 11:57:41 2009 From: avleen at gmail.com (Avleen Vig) Date: Thu, 4 Jun 2009 10:57:41 +0100 Subject: [lxml-dev] Possible bug, libxml segfault? In-Reply-To: <4A2117AE.7010908@behnel.de> References: <33c66c80905270816le007e19ua8844d5a5d8b8c28@mail.gmail.com> <4A2117AE.7010908@behnel.de> Message-ID: <33c66c80906040257l3d6b4655o630700ba9bd90a3d@mail.gmail.com> On Sat, May 30, 2009 at 12:25 PM, Stefan Behnel wrote: > The information you give is pretty informative. It just lacks a hint which > versions of lxml, libxml2 and libxslt you are using. Is this reproducible > with the latest release versions of all three? lxml 1.3.6 libxslt 1.1.22 libxml 2.6.31 > The triggering problem is that you try to exclude the namespace prefixes > "str" and "exsl" which are not defined in your stylesheet. This seems to > induce a problem in lxml's error reporting mechanism in your specific setup. > > Is your application threaded? I do remember a couple of problems with > threaded error reporting in the past, although all of those were solved > back then. We're running it with the Spawning python module (which uses threading), otherwise it's just a regular Django app. I haven't had a change to try this with the latest lxml. I'll try to soon. Something changed and made the problem happen much less frequently, so not sure what's going on there :) From stefan_ml at behnel.de Thu Jun 4 13:14:14 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 4 Jun 2009 13:14:14 +0200 (CEST) Subject: [lxml-dev] Possible bug, libxml segfault? In-Reply-To: <33c66c80906040257l3d6b4655o630700ba9bd90a3d@mail.gmail.com> References: <33c66c80905270816le007e19ua8844d5a5d8b8c28@mail.gmail.com> <4A2117AE.7010908@behnel.de> <33c66c80906040257l3d6b4655o630700ba9bd90a3d@mail.gmail.com> Message-ID: <9b21939bc6ebd1f5367977b66360daba.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Avleen Vig wrote: > On Sat, May 30, 2009 at 12:25 PM, Stefan Behnel wrote: >> The information you give is pretty informative. It just lacks a hint >> which versions of lxml, libxml2 and libxslt you are using. Is this >> reproducible with the latest release versions of all three? > > lxml 1.3.6 > [...] > I haven't had a change to try this with the latest lxml. I'll try to > soon. You should. Error reporting related bugs (and many, many others) have long been fixed since the 1.3 series. Also, many threading scenarios are not considered safe in 1.3. Stefan From jholg at gmx.de Thu Jun 4 14:42:27 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 04 Jun 2009 14:42:27 +0200 Subject: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes In-Reply-To: <4A2627DA.5000104@behnel.de> References: <8dbf5ac80906010911g13f04a6ek14bf0510c5db80da@mail.gmail.com> <8dbf5ac80906010921r41427cedt49b084011d856d9@mail.gmail.com> <20090602075902.62950@gmx.net> <4A257C51.2050203@behnel.de> <20090603065858.115690@gmx.net> <4A2627DA.5000104@behnel.de> Message-ID: <20090604124227.28860@gmx.net> Hi, > jholg at gmx.de wrote: > > A compromise may be to add another keyword arg "nil" to deannotate() to > > allow for xsi:nil removal if needed (defaults to False, of course :) > > I think that should be done, yes. A "nil=False" keyword would nicely solve > this. And disabling it by default makes sense for two reasons: backwards > compatibility and the fact that xsi:nil may be used in existing documents. > > Is a plain "nil" enough or should we use "xsi_nil"? I think xsi_nil is clearer. What if we add a general deannotation function that lets you strip a tree off arbitrary attributes? Something like def remove_attributes(element_or_tree, *attrs): ... which takes either ns-qualified strings or (ns, attrname) tuples and removes these attributes wherever found. objectify.deannotate() would then be a special case of this and share the implementation. Then again maybe that's overkill... Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From stefan_ml at behnel.de Thu Jun 4 15:34:18 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 4 Jun 2009 15:34:18 +0200 (CEST) Subject: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes In-Reply-To: <20090604124227.28860@gmx.net> References: <8dbf5ac80906010911g13f04a6ek14bf0510c5db80da@mail.gmail.com> <8dbf5ac80906010921r41427cedt49b084011d856d9@mail.gmail.com> <20090602075902.62950@gmx.net> <4A257C51.2050203@behnel.de> <20090603065858.115690@gmx.net> <4A2627DA.5000104@behnel.de> <20090604124227.28860@gmx.net> Message-ID: <36741390b07e0cf6fc7f93834031fb3c.squirrel@groupware.dvs.informatik.tu-darmstadt.de> jholg at gmx.de wrote: >> jholg at gmx.de wrote: >> > A compromise may be to add another keyword arg "nil" to deannotate() >> to >> > allow for xsi:nil removal if needed (defaults to False, of course :) >> >> I think that should be done, yes. A "nil=False" keyword would nicely >> solve >> this. And disabling it by default makes sense for two reasons: backwards >> compatibility and the fact that xsi:nil may be used in existing >> documents. >> >> Is a plain "nil" enough or should we use "xsi_nil"? > > I think xsi_nil is clearer. Thought so, too. > What if we add a general deannotation function that lets you strip a tree > off arbitrary attributes? Something like > > def remove_attributes(element_or_tree, *attrs): > ... > > which takes either ns-qualified strings or (ns, attrname) tuples and > removes these attributes wherever found. objectify.deannotate() would then > be a special case of this and share the implementation. That sounds like functionality that belongs into lxml.etree, although it's partly available in lxml.html already. What about adding some more, then? - strip_attributes(tree, *attribute_names) remove all named attributes from a tree - strip_elements(tree, *element_names) remove all named elements from a tree, including their subtrees (alt: "strip_subtrees") - strip_tags(tree, *element_names) remove all named elements from a tree, merging their children and text content into their parents Since lxml.html provides a drop_tag() Element method, I considered drop_tags() for the last one, but thought that "strip_*" might be slightly better for consistency here. Alternatively, we might use "drop_*" for everything, but "strip" is a common thing in Python, while "drop" isn't. Plus, there are "drop_*()" /methods/ in lxml.html, which make sense on an Element and do not traverse into subtrees. "strip" makes no sense in that context. I also vote for functions instead of methods here since they work on complete (sub-)trees rather than a single Element object. A function makes this clearer. Comments? Stefan From jlovell at nwesd.org Thu Jun 4 17:30:12 2009 From: jlovell at nwesd.org (John Lovell) Date: Thu, 4 Jun 2009 08:30:12 -0700 Subject: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes In-Reply-To: <36741390b07e0cf6fc7f93834031fb3c.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <8dbf5ac80906010911g13f04a6ek14bf0510c5db80da@mail.gmail.com> <8dbf5ac80906010921r41427cedt49b084011d856d9@mail.gmail.com> <20090602075902.62950@gmx.net> <4A257C51.2050203@behnel.de> <20090603065858.115690@gmx.net> <4A2627DA.5000104@behnel.de> <20090604124227.28860@gmx.net> <36741390b07e0cf6fc7f93834031fb3c.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: My comments would be: brilliant, useful, wonderful! However should the last one read... strip_tags(tree, *tag_names) John W. Lovell Web Applications Engineer Northwest Educational Service District 1601 R Avenue Anacortes, WA 98221 (360) 299-4086 jlovell at nwesd.org www.nwesd.org Together We Can ... -----Original Message----- From: lxml-dev-bounces at codespeak.net [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Stefan Behnel Sent: Thursday, June 04, 2009 6:34 AM To: jholg at gmx.de Cc: lxml-dev at codespeak.net Subject: Re: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes jholg at gmx.de wrote: >> jholg at gmx.de wrote: >> > A compromise may be to add another keyword arg "nil" to >> > deannotate() >> to >> > allow for xsi:nil removal if needed (defaults to False, of course >> > :) >> >> I think that should be done, yes. A "nil=False" keyword would nicely >> solve this. And disabling it by default makes sense for two reasons: >> backwards compatibility and the fact that xsi:nil may be used in >> existing documents. >> >> Is a plain "nil" enough or should we use "xsi_nil"? > > I think xsi_nil is clearer. Thought so, too. > What if we add a general deannotation function that lets you strip a > tree off arbitrary attributes? Something like > > def remove_attributes(element_or_tree, *attrs): > ... > > which takes either ns-qualified strings or (ns, attrname) tuples and > removes these attributes wherever found. objectify.deannotate() would > then be a special case of this and share the implementation. That sounds like functionality that belongs into lxml.etree, although it's partly available in lxml.html already. What about adding some more, then? - strip_attributes(tree, *attribute_names) remove all named attributes from a tree - strip_elements(tree, *element_names) remove all named elements from a tree, including their subtrees (alt: "strip_subtrees") - strip_tags(tree, *element_names) remove all named elements from a tree, merging their children and text content into their parents Since lxml.html provides a drop_tag() Element method, I considered drop_tags() for the last one, but thought that "strip_*" might be slightly better for consistency here. Alternatively, we might use "drop_*" for everything, but "strip" is a common thing in Python, while "drop" isn't. Plus, there are "drop_*()" /methods/ in lxml.html, which make sense on an Element and do not traverse into subtrees. "strip" makes no sense in that context. I also vote for functions instead of methods here since they work on complete (sub-)trees rather than a single Element object. A function makes this clearer. Comments? Stefan _______________________________________________ lxml-dev mailing list lxml-dev at codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev From robert at smithpierce.net Thu Jun 4 19:23:02 2009 From: robert at smithpierce.net (Robert Pierce) Date: Thu, 4 Jun 2009 10:23:02 -0700 Subject: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes In-Reply-To: References: <8dbf5ac80906010911g13f04a6ek14bf0510c5db80da@mail.gmail.com> <8dbf5ac80906010921r41427cedt49b084011d856d9@mail.gmail.com> <20090602075902.62950@gmx.net> <4A257C51.2050203@behnel.de> <20090603065858.115690@gmx.net> <4A2627DA.5000104@behnel.de> <20090604124227.28860@gmx.net> <36741390b07e0cf6fc7f93834031fb3c.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <8dbf5ac80906041023y3e9a3f69pe2b21311da14d7b2@mail.gmail.com> I agree! Solves my problem and then some. On Thu, Jun 4, 2009 at 8:30 AM, John Lovell wrote: > My comments would be: brilliant, useful, wonderful! > > ... > > That sounds like functionality that belongs into lxml.etree, although it's > partly available in lxml.html already. What about adding some more, then? > > - strip_attributes(tree, *attribute_names) > remove all named attributes from a tree > > - strip_elements(tree, *element_names) > remove all named elements from a tree, including their subtrees (alt: > "strip_subtrees") > > - strip_tags(tree, *element_names) > remove all named elements from a tree, merging their children and text > content into their parents > > Since lxml.html provides a drop_tag() Element method, I considered > drop_tags() for the last one, but thought that "strip_*" might be slightly > better for consistency here. Alternatively, we might use "drop_*" for > everything, but "strip" is a common thing in Python, while "drop" isn't. > Plus, there are "drop_*()" /methods/ in lxml.html, which make sense on an > Element and do not traverse into subtrees. "strip" makes no sense in that > context. > > I also vote for functions instead of methods here since they work on > complete (sub-)trees rather than a single Element object. A function makes > this clearer. > > Comments? > > Stefan > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090604/b3d8b836/attachment.htm From stefan_ml at behnel.de Thu Jun 4 08:55:11 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 04 Jun 2009 08:55:11 +0200 Subject: [lxml-dev] Target parser parsing error In-Reply-To: <27338673.633061244084307188.JavaMail.coremail@bj163app60.163.com> References: <14f1f2265bcf5f788498ab3ccf4e7854.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <1733034.434611244032539801.JavaMail.coremail@bj163app25.163.com> <27338673.633061244084307188.JavaMail.coremail@bj163app60.163.com> Message-ID: <4A276FCF.10109@behnel.de> Hi, qhlonline wrote: > 2009-06-03?"Stefan Behnel" wrote: >> The libxml2 parser will parse the byte stream and try >> to convert it to UTF-8. If that fails but it is asked to "recover" from >> it, it will just continue without raising an error. Not sure what >> becomes of the data in this case, but apparently there is no guarantee >> that the invalid bytes that were parsed up to this point get stripped. > > I agree with you. I have thought about what libxml2 would do when an > illegal character came. Your answer makes me clear at this point. Then its clearer to you than to me. I'm actually not convinced yet that this is the case. I was rather guessing based on my (limited) knowledge about the problem you observe, which I have never observed myself in the wild. The parser of libxml2 uses leveled buffers that copy the data during decoding. That may already be a sufficient barrier against such problems. What about posting a self-contained and stripped-down to the minimum Python module that shows the unexpected behaviour? Nothing that accesses the internet or something, just embed a sufficient part of a failing web page as a string (possibly base64 encoded). That way, others could try to reproduce the problem on their side and debug it. >> The second level is where lxml comes into the play. When you define a >> "data()" method on your target parser, you ask lxml to pass you the >> character data from the document. lxml's SAX handler will then try to >> decode the UTF-8 data provided by the libxml2 parser to pass it into your >> method. If the data returned by the parser is not valid UTF-8, this will >> fail. I assume that this is where the exception that you see originates >> from, as this is done through the Python Codec API. > > Yes, That is the case. But the illegal character came out side of lxml > and outside of libxml2, The whole string was got from an URL by using > urllib module in python. So, I wonder whether there were some other method > to get HTML content from URL without illegal characters. Well, as I said before: if the HTML is broken, there is no way to make sure the parser can read all data 'correctly' (whatever that means in this context). If the web page adheres to an encoding and just fails to declare it correctly, your best bet is to decode the page into a unicode string yourself, catch and handle any decoding errors in a suitable way, and pass that unicode string into the parser. Stefan From jholg at gmx.de Fri Jun 5 13:58:57 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 05 Jun 2009 13:58:57 +0200 Subject: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes In-Reply-To: <36741390b07e0cf6fc7f93834031fb3c.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <8dbf5ac80906010911g13f04a6ek14bf0510c5db80da@mail.gmail.com> <8dbf5ac80906010921r41427cedt49b084011d856d9@mail.gmail.com> <20090602075902.62950@gmx.net> <4A257C51.2050203@behnel.de> <20090603065858.115690@gmx.net> <4A2627DA.5000104@behnel.de> <20090604124227.28860@gmx.net> <36741390b07e0cf6fc7f93834031fb3c.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <20090605115857.167800@gmx.net> Hi, > > def remove_attributes(element_or_tree, *attrs): > > ... > > > > which takes either ns-qualified strings or (ns, attrname) tuples and > > removes these attributes wherever found. objectify.deannotate() would > then > > be a special case of this and share the implementation. > > That sounds like functionality that belongs into lxml.etree, although it's > partly available in lxml.html already. What about adding some more, then? I suspected so but wasn't sure about the lxml.etree policy with regard to extending the elementtree API, apart from obvious libxml2/libxslt superpowers. > - strip_attributes(tree, *attribute_names) > remove all named attributes from a tree > > - strip_elements(tree, *element_names) > remove all named elements from a tree, including their subtrees (alt: > "strip_subtrees") > > - strip_tags(tree, *element_names) > remove all named elements from a tree, merging their children and text > content into their parents > > Since lxml.html provides a drop_tag() Element method, I considered > drop_tags() for the last one, but thought that "strip_*" might be slightly > better for consistency here. Alternatively, we might use "drop_*" for > everything, but "strip" is a common thing in Python, while "drop" isn't. > Plus, there are "drop_*()" /methods/ in lxml.html, which make sense on an > Element and do not traverse into subtrees. "strip" makes no sense in that > context. +1 for strip_*. > I also vote for functions instead of methods here since they work on > complete (sub-)trees rather than a single Element object. A function makes > this clearer. +1 for functions. Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From frank at chagford.com Fri Jun 5 14:06:57 2009 From: frank at chagford.com (Frank Millman) Date: Fri, 5 Jun 2009 14:06:57 +0200 Subject: [lxml-dev] lxml 2.2.1 released In-Reply-To: <4A2587D9.5040903@behnel.de> Message-ID: <20090605120702.7F59013BFD@ctb-mesg-1-2.saix.net> Stefan Behnel wrote: > > Hi all, > > I just pushed lxml 2.2.1 to PyPI as a minor maintenance release. > [...] > > * Injecting default attributes into a document during XML Schema > validation (also at parse time). Thanks very much for adding this feature, Stefan. I have given it a quick test and so far it works perfectly. Much appreciated. Frank Millman From stefan_ml at behnel.de Sat Jun 6 10:49:24 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 06 Jun 2009 10:49:24 +0200 Subject: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes In-Reply-To: <8dbf5ac80906041023y3e9a3f69pe2b21311da14d7b2@mail.gmail.com> References: <8dbf5ac80906010911g13f04a6ek14bf0510c5db80da@mail.gmail.com> <8dbf5ac80906010921r41427cedt49b084011d856d9@mail.gmail.com> <20090602075902.62950@gmx.net> <4A257C51.2050203@behnel.de> <20090603065858.115690@gmx.net> <4A2627DA.5000104@behnel.de> <20090604124227.28860@gmx.net> <36741390b07e0cf6fc7f93834031fb3c.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <8dbf5ac80906041023y3e9a3f69pe2b21311da14d7b2@mail.gmail.com> Message-ID: <4A2A2D94.20607@behnel.de> Hi, Robert Pierce wrote: > On Thu, Jun 4, 2009 at 8:30 AM, John Lovell wrote: >> Stefan Behnel wrote: >>> - strip_attributes(tree, *attribute_names) >>> remove all named attributes from a tree >>> >>> - strip_elements(tree, *element_names) >>> remove all named elements from a tree, including their subtrees (alt: >>> "strip_subtrees") >>> >>> - strip_tags(tree, *element_names) >>> remove all named elements from a tree, merging their children and text >>> content into their parents Done: https://codespeak.net/viewvc/?view=rev&revision=65612 https://codespeak.net/viewvc/lxml/trunk/src/lxml/cleanup.pxi?view=markup&pathrev=65612 >>> Comments? >> >> My comments would be: brilliant, useful, wonderful! > > I agree! Solves my problem and then some. Since you two seem to be very happy about this feature, what about writing up some docs/doctests for it? A new section here sounds like the right place: http://codespeak.net/svn/lxml/trunk/doc/api.txt -> http://codespeak.net/lxml/api.html Maybe the tutorial could also benefit from a short reference. Holger, could you replace the current deannotate() implementation in lxml.objectify and add the xsl:nil cleanup option as we discussed? I expect it to be a little slower than before due to the more general implementation. If you have some code at your hands to benchmark it, please do. Unless Ian (or someone else) beats me to it, I'll also look through lxml.html next week to check for places where this can be used. For example, clean.py looks like an obvious candidate. Stefan From optilude+lists at gmail.com Sun Jun 7 18:42:26 2009 From: optilude+lists at gmail.com (Martin Aspeli) Date: Mon, 08 Jun 2009 00:42:26 +0800 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: <4A2112AE.8040903@behnel.de> References: <4A2112AE.8040903@behnel.de> Message-ID: Stefan Behnel wrote: > Hi, > > Alexander Limi wrote: >> I'm working on documenting Deliverance / xdv for use with Plone. It has >> lxml as a dependency, and I have run into a serious issue: >> >> On Mac OS X, we can't assume that people have Xcode (ie. gcc and friends) >> installed, thus we can't really compile lxml on those computers, not even >> using the staticlxml[1] recipe. >> >> I see that there are binary eggs for Windows, is there a special reason >> why there are no binary eggs for OS X, or is it just a matter of not >> having the infrastructure to make it available? > > The main problem is that many MacOS-X users have some kind of package > distribution like macports installed, which usually has some distribution > specific setup/dependencies/paths/whatever. OTOH, those users won't be the > target for a binary distribution of lxml anyway. Well, I use MacPorts. And every time I have a package that depends on lxml, I need to jump through hoops to avoid random segfauls (and OS X makes those hard to debug!). Luckily, those hoops aren't so big anymore: I tend to use zc.buildout and I can add this to my buildout: [lxml] recipe = z3c.recipe.staticlxml egg = lxml force = false However, people have a *lot* of problems with this. Every Plone sprint it seems someone is spending half a day or more getting lxml to compile. Which is a big shame, because lxml is so nice so people like me want to use it. :) Therefore, the Plone community, at least, is pretty interested in solving this problem. >> Happy to help find a solution if it's just a matter of locating a >> reliable way to get it compiled every time there is a new release. > > Yes, I'd be happy if we could get a static binary egg for each release. I > don't have a Mac myself (and I'm definitely not a Mac user), so > contributions are welcome. > > http://codespeak.net/lxml/build.html#building-lxml-on-macos-x So, if we followed those steps to build a statically compiled egg for each Python version we support (2.4 and 2.6 for Plone...), and uploaded that to PyPI, we'd be able to just depend on this version of lxml and no-one on OSX should ever get these annoying problems? That'd be really nice. ;) If this is the case, we'll find someone to do this. Are you able to give someone access to the PyPI page and any relevant support to make this happen? Finally - what about Linux? Is it rarely/never a problem, or should we be trying to make binary eggs there too? Is it possible to make binary eggs that work across the most common distributions? Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From stefan_ml at behnel.de Sun Jun 7 20:12:53 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 07 Jun 2009 20:12:53 +0200 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: References: <4A2112AE.8040903@behnel.de> Message-ID: <4A2C0325.5040200@behnel.de> Martin Aspeli wrote: > Stefan Behnel wrote: >> Alexander Limi wrote: >>> I'm working on documenting Deliverance / xdv for use with Plone. It has >>> lxml as a dependency, and I have run into a serious issue: >>> >>> On Mac OS X, we can't assume that people have Xcode (ie. gcc and friends) >>> installed, thus we can't really compile lxml on those computers, not even >>> using the staticlxml[1] recipe. >>> >>> I see that there are binary eggs for Windows, is there a special reason >>> why there are no binary eggs for OS X, or is it just a matter of not >>> having the infrastructure to make it available? >> The main problem is that many MacOS-X users have some kind of package >> distribution like macports installed, which usually has some distribution >> specific setup/dependencies/paths/whatever. OTOH, those users won't be the >> target for a binary distribution of lxml anyway. > > Well, I use MacPorts. The problem is that macports is not the only package distribution. Trying to support them all from lxml's setup.py is futile. Hence the static builds that are independent of the installed libraries. >>> Happy to help find a solution if it's just a matter of locating a >>> reliable way to get it compiled every time there is a new release. >> Yes, I'd be happy if we could get a static binary egg for each release. I >> don't have a Mac myself (and I'm definitely not a Mac user), so >> contributions are welcome. >> >> http://codespeak.net/lxml/build.html#building-lxml-on-macos-x > > So, if we followed those steps to build a statically compiled egg for > each Python version we support (2.4 and 2.6 for Plone...), and uploaded > that to PyPI, we'd be able to just depend on this version of lxml and > no-one on OSX should ever get these annoying problems? That'd be really > nice. ;) > > If this is the case, we'll find someone to do this. Are you able to give > someone access to the PyPI page and any relevant support to make this > happen? Sure. Sidnei da Silva provides the Windows builds, and I'm happy to add another package maintainer for MacOS. > Finally - what about Linux? Is it rarely/never a problem, or should we > be trying to make binary eggs there too? Is it possible to make binary > eggs that work across the most common distributions? As long as you have a somewhat recent system, installing lxml is trivial here. Plus, updating the system installations of libxml2 and libxslt isn't hard either, so that's a totally different situation than for MacOS. Stefan From limi at plone.org Mon Jun 8 01:45:54 2009 From: limi at plone.org (Alexander Limi) Date: Sun, 07 Jun 2009 16:45:54 -0700 Subject: [lxml-dev] Binary egg for Mac OS X References: <4A2112AE.8040903@behnel.de> Message-ID: Hi, Stefan Eletzhofer (CC'ed) mentioned that he'd be willing to help out. I did contact the Snakebite people to see if we can automate the builds ? but they aren't quite up and running yet. So, for the time being, could we manually build a binary egg for lxml and upload it along with the others? Plone people would be very happy, and it would save a lot of grey hairs for people on Mac OS X. :) ? Alexander On Sat, 30 May 2009 04:04:14 -0700, Stefan Behnel wrote: > Hi, > > Alexander Limi wrote: >> I'm working on documenting Deliverance / xdv for use with Plone. It has >> lxml as a dependency, and I have run into a serious issue: >> >> On Mac OS X, we can't assume that people have Xcode (ie. gcc and >> friends) >> installed, thus we can't really compile lxml on those computers, not >> even >> using the staticlxml[1] recipe. >> >> I see that there are binary eggs for Windows, is there a special reason >> why there are no binary eggs for OS X, or is it just a matter of not >> having the infrastructure to make it available? > > The main problem is that many MacOS-X users have some kind of package > distribution like macports installed, which usually has some distribution > specific setup/dependencies/paths/whatever. OTOH, those users won't be > the > target for a binary distribution of lxml anyway. > > >> Happy to help find a solution if it's just a matter of locating a >> reliable way to get it compiled every time there is a new release. > > Yes, I'd be happy if we could get a static binary egg for each release. I > don't have a Mac myself (and I'm definitely not a Mac user), so > contributions are welcome. > > http://codespeak.net/lxml/build.html#building-lxml-on-macos-x > > Stefan -- Alexander Limi ? http://limi.net From optilude+lists at gmail.com Mon Jun 8 02:48:41 2009 From: optilude+lists at gmail.com (Martin Aspeli) Date: Mon, 08 Jun 2009 08:48:41 +0800 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: <4A2C0325.5040200@behnel.de> References: <4A2112AE.8040903@behnel.de> <4A2C0325.5040200@behnel.de> Message-ID: Stefan Behnel wrote: > Martin Aspeli wrote: >> Stefan Behnel wrote: >>> Alexander Limi wrote: >>>> I'm working on documenting Deliverance / xdv for use with Plone. It has >>>> lxml as a dependency, and I have run into a serious issue: >>>> >>>> On Mac OS X, we can't assume that people have Xcode (ie. gcc and friends) >>>> installed, thus we can't really compile lxml on those computers, not even >>>> using the staticlxml[1] recipe. >>>> >>>> I see that there are binary eggs for Windows, is there a special reason >>>> why there are no binary eggs for OS X, or is it just a matter of not >>>> having the infrastructure to make it available? >>> The main problem is that many MacOS-X users have some kind of package >>> distribution like macports installed, which usually has some distribution >>> specific setup/dependencies/paths/whatever. OTOH, those users won't be the >>> target for a binary distribution of lxml anyway. >> Well, I use MacPorts. > > The problem is that macports is not the only package distribution. Trying > to support them all from lxml's setup.py is futile. Hence the static builds > that are independent of the installed libraries. If a static build works for everyone, I'll be ecstatic (pun intended). >>>> Happy to help find a solution if it's just a matter of locating a >>>> reliable way to get it compiled every time there is a new release. >>> Yes, I'd be happy if we could get a static binary egg for each release. I >>> don't have a Mac myself (and I'm definitely not a Mac user), so >>> contributions are welcome. >>> >>> http://codespeak.net/lxml/build.html#building-lxml-on-macos-x >> So, if we followed those steps to build a statically compiled egg for >> each Python version we support (2.4 and 2.6 for Plone...), and uploaded >> that to PyPI, we'd be able to just depend on this version of lxml and >> no-one on OSX should ever get these annoying problems? That'd be really >> nice. ;) >> >> If this is the case, we'll find someone to do this. Are you able to give >> someone access to the PyPI page and any relevant support to make this >> happen? > > Sure. Sidnei da Silva provides the Windows builds, and I'm happy to add > another package maintainer for MacOS. Great! I'll try to find a volunteer. >> Finally - what about Linux? Is it rarely/never a problem, or should we >> be trying to make binary eggs there too? Is it possible to make binary >> eggs that work across the most common distributions? > > As long as you have a somewhat recent system, installing lxml is trivial > here. Plus, updating the system installations of libxml2 and libxslt isn't > hard either, so that's a totally different situation than for MacOS. Right. We haven't had too many complaints yet, at least. It'd be nice to have some binary eggs, though, if they can be done reliably, for people who don't have libxml2/libxslt installed. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From stefan_ml at behnel.de Mon Jun 8 07:54:54 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 08 Jun 2009 07:54:54 +0200 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: References: <4A2112AE.8040903@behnel.de> Message-ID: <4A2CA7AE.8030508@behnel.de> Alexander Limi wrote: > Stefan Eletzhofer (CC'ed) mentioned that he'd be willing to help out. I > did contact the Snakebite people to see if we can automate the builds ? > but they aren't quite up and running yet. > > So, for the time being, could we manually build a binary egg for lxml and > upload it along with the others? Plone people would be very happy, and it > would save a lot of grey hairs for people on Mac OS X. :) Sure, it's not like there's an lxml release every week or so, so a manual build is quite doable. What I would like to see is a really fat egg, i.e. one that works on all hardware platforms supported by MacOS-X. I have no idea how to do that, so I can't provide any hints. But since there seem to be some MacOS users on the list by now, I hope we can get away with some web digging plus trial and error. Also, the build must not rely on any macports libraries, as not everyone uses macports. So we'd have to make sure it's only built against system libraries (plus the statically built libxml2/libxslt). What about the "system Python vs. macports Python" question? Would such an egg work on both? Or does it make more sense to support only the system Python installation, and to leave macports specific builds to the macports maintainers? Stefan From stefan_ml at behnel.de Mon Jun 8 08:07:07 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 08 Jun 2009 08:07:07 +0200 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: References: <4A2112AE.8040903@behnel.de> <4A2C0325.5040200@behnel.de> Message-ID: <4A2CAA8B.6040107@behnel.de> Martin Aspeli wrote: > Stefan Behnel wrote: >> Martin Aspeli wrote: >>> Finally - what about Linux? Is it rarely/never a problem, or should we >>> be trying to make binary eggs there too? Is it possible to make binary >>> eggs that work across the most common distributions? >> As long as you have a somewhat recent system, installing lxml is trivial >> here. Plus, updating the system installations of libxml2 and libxslt isn't >> hard either, so that's a totally different situation than for MacOS. > > Right. We haven't had too many complaints yet, at least. It'd be nice to > have some binary eggs, though, if they can be done reliably, for people > who don't have libxml2/libxslt installed. I provided binary eggs for AMD64-GNU/Linux at the beginning, and there were a couple of downloads back then (although no feedback on them). However, the various dialects of Linux form a platform that is very easy to target with source code and much harder to target with binary releases. For example, a static build doesn't make any sense here, since most Linux installations have very good package management by now. They automatically update their system libraries from time to time, even without notifying the user (especially for security updates). A static lxml would not benefit from that, whereas a dynamic build will certainly not run on all platforms that lxml supports as a source distribution. So a binary egg that easy_install fetches by default would actually break more than it could fix. The whole problem with MacOS is that it is *not* easy to update the system libraries. That's why a static binary build makes sense. Stefan From piet at cs.uu.nl Mon Jun 8 09:36:35 2009 From: piet at cs.uu.nl (Piet van Oostrum) Date: Mon, 8 Jun 2009 09:36:35 +0200 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: <4A2CAA8B.6040107@behnel.de> References: <4A2112AE.8040903@behnel.de> <4A2C0325.5040200@behnel.de> <4A2CAA8B.6040107@behnel.de> Message-ID: <18988.49027.571878.97882@cochabamba.local> >>>>> Stefan Behnel (SB) wrote: >SB> The whole problem with MacOS is that it is *not* easy to update the system >SB> libraries. That's why a static binary build makes sense. On Mac OS X you *shouldn't* update system libraries. With the next system software update they could easily disappear. -- Piet van Oostrum URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: piet at vanoostrum.org From tseaver at palladion.com Mon Jun 8 14:40:49 2009 From: tseaver at palladion.com (Tres Seaver) Date: Mon, 08 Jun 2009 08:40:49 -0400 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: References: <4A2112AE.8040903@behnel.de> <4A2C0325.5040200@behnel.de> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Martin Aspeli wrote: > Stefan Behnel wrote: >> Martin Aspeli wrote: >>> Stefan Behnel wrote: >>>> Alexander Limi wrote: >>>>> I'm working on documenting Deliverance / xdv for use with Plone. It has >>>>> lxml as a dependency, and I have run into a serious issue: >>>>> >>>>> On Mac OS X, we can't assume that people have Xcode (ie. gcc and friends) >>>>> installed, thus we can't really compile lxml on those computers, not even >>>>> using the staticlxml[1] recipe. >>>>> >>>>> I see that there are binary eggs for Windows, is there a special reason >>>>> why there are no binary eggs for OS X, or is it just a matter of not >>>>> having the infrastructure to make it available? >>>> The main problem is that many MacOS-X users have some kind of package >>>> distribution like macports installed, which usually has some distribution >>>> specific setup/dependencies/paths/whatever. OTOH, those users won't be the >>>> target for a binary distribution of lxml anyway. >>> Well, I use MacPorts. >> The problem is that macports is not the only package distribution. Trying >> to support them all from lxml's setup.py is futile. Hence the static builds >> that are independent of the installed libraries. > > If a static build works for everyone, I'll be ecstatic (pun intended). > >>>>> Happy to help find a solution if it's just a matter of locating a >>>>> reliable way to get it compiled every time there is a new release. >>>> Yes, I'd be happy if we could get a static binary egg for each release. I >>>> don't have a Mac myself (and I'm definitely not a Mac user), so >>>> contributions are welcome. >>>> >>>> http://codespeak.net/lxml/build.html#building-lxml-on-macos-x >>> So, if we followed those steps to build a statically compiled egg for >>> each Python version we support (2.4 and 2.6 for Plone...), and uploaded >>> that to PyPI, we'd be able to just depend on this version of lxml and >>> no-one on OSX should ever get these annoying problems? That'd be really >>> nice. ;) >>> >>> If this is the case, we'll find someone to do this. Are you able to give >>> someone access to the PyPI page and any relevant support to make this >>> happen? >> Sure. Sidnei da Silva provides the Windows builds, and I'm happy to add >> another package maintainer for MacOS. > > Great! I'll try to find a volunteer. > >>> Finally - what about Linux? Is it rarely/never a problem, The last time I had trouble with it (a CentOS 4 box), building the static version myself was any easy workaround. >>> or should we >>> be trying to make binary eggs there too? Is it possible to make binary >>> eggs that work across the most common distributions? No, nor would it be desirable to try. >> As long as you have a somewhat recent system, installing lxml is trivial >> here. Plus, updating the system installations of libxml2 and libxslt isn't >> hard either, so that's a totally different situation than for MacOS. > > Right. We haven't had too many complaints yet, at least. It'd be nice to > have some binary eggs, though, if they can be done reliably, for people > who don't have libxml2/libxslt installed. - -1 to uploading any binary eggs for Linux: they will get preferred over the source distributions, which is not what I want at all. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFKLQbR+gerLs4ltQ4RAhaUAJ49em5e/XLAzkiX01Q1yW+P7icI/QCgouPI lC7t0OcJWgiTeOGJollO7Ck= =mtPN -----END PGP SIGNATURE----- From stefan_ml at behnel.de Mon Jun 8 16:13:53 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 8 Jun 2009 16:13:53 +0200 (CEST) Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: References: <4A2112AE.8040903@behnel.de> <4A2CA7AE.8030508@behnel.de> Message-ID: Hi, Stefan Eletzhofer wrote: > On Mon, Jun 8, 2009 at 7:54 AM, Stefan Behnel wrote: >> What I would like to see is a really fat egg, i.e. one that works on all >> hardware platforms supported by MacOS-X. I have no idea how to do that, >> so I can't provide any hints. But since there seem to be some MacOS users >> on the list by now, I hope we can get away with some web digging plus >> trial and error. > > Well, I don't know if that's possible -- AFAIK there's only "binary > eggs" and that's it. That would be a compiler option, if I'm not mistaken. >> Also, the build must not rely on any macports libraries, as not everyone >> uses macports. So we'd have to make sure it's only built against system >> libraries (plus the statically built libxml2/libxslt). > > objdump says that this is true for libs built by the staticlxml recipe. Ok, although I'd prefer getting the eggs created using lxml's own static build. >> What about the "system Python vs. macports Python" question? Would such >> an >> egg work on both? Or does it make more sense to support only the system >> Python installation, and to leave macports specific builds to the >> macports maintainers? > > Well, we'd probly need to support all the python **versions** (2.4, > 2.5, ...). And > being statically built, it does per definition not interfer with other > libs, so we should be pretty safe here. IMO, it should only be statically built against libxml2 and libxslt, the rest would be taken from the system libs. Static building is evil, unless it *really* solves a problem. > So all in all, I think we should identify: > > - the LXML versions which are "important enough" to support (1.2.6 > anyone?) Since everything before 2.2 is out of maintenance (and the API is mostly compatible across versions anyway), I'd prefer having eggs for 2.2+ only. Providing older eggs will just let people run into ancient bugs. > - the python versions which we support (surely 2.4 and 2.5. what other > versions?) We currently have Windows eggs for 2.4/5/6 and 3.0, so having the same span for MacOS eggs would be nice. > I could perhaps set up a semi-automated system on my mac server to build > these. > I can also put them on a FTP / Web Server somewhere, and I could (given > access) also upload the static eggs to PyPi. Sure. Do you have a PyPI account? > Do we want to give the static versions a dedicated name? > "lxml-x.y.z-static" or smth like > that? Maybe not everyone wants static eggs (even not on OSX boxen)? I would want them to be the default download on MacOS (i.e. no special naming), but only after some testing by some MacOS users on differently configured systems, i.e. using different MacOS versions, different package distributions, different Python versions, etc. So maybe you could provide them separately first? Thanks for volunteering! Stefan From jholg at gmx.de Tue Jun 9 00:26:02 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 09 Jun 2009 00:26:02 +0200 Subject: [lxml-dev] lxml.objectify.deannotate refuses to clean nil nodes In-Reply-To: <4A2A2D94.20607@behnel.de> References: <8dbf5ac80906010911g13f04a6ek14bf0510c5db80da@mail.gmail.com> <8dbf5ac80906010921r41427cedt49b084011d856d9@mail.gmail.com> <20090602075902.62950@gmx.net> <4A257C51.2050203@behnel.de> <20090603065858.115690@gmx.net> <4A2627DA.5000104@behnel.de> <20090604124227.28860@gmx.net> <36741390b07e0cf6fc7f93834031fb3c.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <8dbf5ac80906041023y3e9a3f69pe2b21311da14d7b2@mail.gmail.com> <4A2A2D94.20607@behnel.de> Message-ID: <20090608222602.198580@gmx.net> Hi, > > Holger, could you replace the current deannotate() implementation in > lxml.objectify and add the xsl:nil cleanup option as we discussed? I > expect > it to be a little slower than before due to the more general > implementation. If you have some code at your hands to benchmark it, > please do. Done: https://codespeak.net/viewvc/?view=rev&revision=65680 No benchmarking yet, though. Holger -- GMX FreeDSL mit DSL 6.000 Flatrate und Telefonanschluss nur 17,95 Euro/mtl.! http://dslspecial.gmx.de/freedsl-aktionspreis/?ac=OM.AD.PD003K11308T4569a From dalist0 at gmail.com Tue Jun 9 21:14:54 2009 From: dalist0 at gmail.com (D) Date: Tue, 9 Jun 2009 15:14:54 -0400 Subject: [lxml-dev] Difference between xhtml etrees In-Reply-To: References: Message-ID: Hello, I have two xhtml documents which I would like to compare. They are available as etrees. Ideally I would like to have a resulting tree, where the appropriate changes are marked with ins and del tags. I don't need anything fancy like a detection of moves. I had a look at lxml.html.diff http://codespeak.net/lxml/lxmlhtml.html#html-diff but it operates on html strings only, and not on my parsed tree. That solution would mean that I have to dump the xhtml to html, diff, reparse the string as html and transform it to xml. I like the way daisydiff operates, but again only on files, and the output is either html, or xml, which I would need to merge into my tree. http://code.google.com/p/daisydiff/wiki/Examples Is there any way to compare two trees directly and interpret the differences the way daisydiff does, i.e. look only at text? Daniel From stefan_ml at behnel.de Thu Jun 11 07:38:07 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 11 Jun 2009 07:38:07 +0200 Subject: [lxml-dev] Difference between xhtml etrees In-Reply-To: References: Message-ID: <4A30983F.3090205@behnel.de> D wrote: > I have two xhtml documents which I would like to compare. They are > available as etrees. > Ideally I would like to have a resulting tree, where the appropriate > changes are marked with ins and del tags. I don't need anything fancy > like a detection of moves. > > I had a look at lxml.html.diff > http://codespeak.net/lxml/lxmlhtml.html#html-diff > but it operates on html strings only, and not on my parsed tree. Did you try passing the root elements of the trees? Stefan From stefan_ml at behnel.de Sun Jun 14 13:21:27 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 14 Jun 2009 13:21:27 +0200 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: <97EFCE41-4A6E-4CB2-8593-CE7CF5F5A2D3@inquant.de> References: <4A2112AE.8040903@behnel.de> <4A2CA7AE.8030508@behnel.de> <97EFCE41-4A6E-4CB2-8593-CE7CF5F5A2D3@inquant.de> Message-ID: <4A34DD37.8000207@behnel.de> Hi, Stephan Eletzhofer wrote: > Am 08.06.2009 um 16:13 schrieb Stefan Behnel: >> Stefan Eletzhofer wrote: >>> On Mon, Jun 8, 2009 at 7:54 AM, Stefan Behnel wrote: >>>> What I would like to see is a really fat egg, i.e. one that works on >>>> all >>>> hardware platforms supported by MacOS-X. I have no idea how to do that, >>>> so I can't provide any hints. But since there seem to be some MacOS >>>> users >>>> on the list by now, I hope we can get away with some web digging plus >>>> trial and error. >>> >>> Well, I don't know if that's possible -- AFAIK there's only "binary >>> eggs" and that's it. > >> That would be a compiler option, if I'm not mistaken. > > Ah yes -- You mean a mixed PPC/Intel "fat binary" type thing? Exactly. >>> So all in all, I think we should identify: >>> >>> - the LXML versions which are "important enough" to support (1.2.6 >>> anyone?) > >> Since everything before 2.2 is out of maintenance (and the API is mostly >> compatible across versions anyway), I'd prefer having eggs for 2.2+ only. >> Providing older eggs will just let people run into ancient bugs. > >>> - the python versions which we support (surely 2.4 and 2.5. what other >>> versions?) > >> We currently have Windows eggs for 2.4/5/6 and 3.0, so having the same >> span for MacOS eggs would be nice. > > u-huh, well :) > > I don't know about 3.0 yet -- let's see if I can build it. At least lxml currently compiles on 3.0-Py3.1rc1, so if you get 3.x installed, it should 'just work'. >>> I could perhaps set up a semi-automated system on my mac server to build >>> these. >>> I can also put them on a FTP / Web Server somewhere, and I could (given >>> access) also upload the static eggs to PyPi. > >> Sure. Do you have a PyPI account? > > Yup: "seletz" Ok, you have a maintainer role for lxml now. >>> Do we want to give the static versions a dedicated name? >>> "lxml-x.y.z-static" or smth like >>> that? Maybe not everyone wants static eggs (even not on OSX boxen)? > >> I would want them to be the default download on MacOS (i.e. no special >> naming), but only after some testing by some MacOS users on differently >> configured systems, i.e. using different MacOS versions, different >> package >> distributions, different Python versions, etc. So maybe you could provide >> them separately first? > > Yes sure -- you mean "separately" as in "not on pypi"? Yep, just build some and advertise them on lxml-dev asking for feedback. If they seem to work for everyone, we can upload them to PyPI. Stefan From faassen at startifact.com Mon Jun 15 16:27:02 2009 From: faassen at startifact.com (Martijn Faassen) Date: Mon, 15 Jun 2009 16:27:02 +0200 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: References: <4A2112AE.8040903@behnel.de> Message-ID: Martin Aspeli wrote: > Finally - what about Linux? Is it rarely/never a problem, or should we > be trying to make binary eggs there too? Is it possible to make binary > eggs that work across the most common distributions? -1 to binary eggs on Linux. I don't think problems are common and binary eggs might lead to problems. Regards, Martijn From stefan_ml at behnel.de Mon Jun 15 17:46:58 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 15 Jun 2009 17:46:58 +0200 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: References: <4A2112AE.8040903@behnel.de> Message-ID: <4A366CF2.6080406@behnel.de> Martijn Faassen wrote: > Martin Aspeli wrote: > >> Finally - what about Linux? Is it rarely/never a problem, or should we >> be trying to make binary eggs there too? Is it possible to make binary >> eggs that work across the most common distributions? > > -1 to binary eggs on Linux. I don't think problems are common and binary > eggs might lead to problems. Right, now that you mention it: Linux distributions use different unicode width settings (16/32 bit), which isn't handled by setuptools' egg search algorithm. So providing binary eggs here is actually harmful because they will not work on all platforms. I assume that we do not have this problem on MacOS, as the only vendor that ships the system Python installation is Apple. Stefan From david.antliff at gmail.com Tue Jun 16 07:29:13 2009 From: david.antliff at gmail.com (David Antliff) Date: Tue, 16 Jun 2009 17:29:13 +1200 Subject: [lxml-dev] Using, or building, lxml in Windows with Cygwin Message-ID: Hello, I am trying to get the Windows lxml egg to work with Cygwin, Python 2.5.2 on WindowsXP, 32bit. I wasn't able to find a Cygwin package for lxml so I felt I needed to obtain or build a win32 version of lxml. I want to use the xpath functionality that lxml provides in python scripts, running in Cygwin. *** All commands below are issued in a Cygwin bash shell *** I downloaded lxml-2.2.1-py2.5-win32.eggfrom http://pypi.python.org/pypi/lxml/2.2.1 and placed it in a directory, setting PYTHONPATH to point here too. However pkg_resources.require() can't find it, so instead I unzipped it into a directory called 'lxml'. Then I do this: $ python Python 2.5.2 (r252:60911, Dec 2 2008, 09:26:14) [GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import etree Traceback (most recent call last): File "", line 1, in File "lxml/etree.py", line 7, in __bootstrap__() File "lxml/etree.py", line 6, in __bootstrap__ imp.load_dynamic(__name__,__file__) ImportError: Permission denied Ok, I think I know why this is happening, so I do this: $ chmod +x lxml/*.pyd $ python Python 2.5.2 (r252:60911, Dec 2 2008, 09:26:14) [GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import etree Segmentation fault (core dumped) Ok, so that didn't work. The next thing I tried was the instructions found here: http://www.scribd.com/doc/16158375/LXML-221-documentation in section 23.7 - "Static linking on Windows". I downloaded the latest versions of iconv, libxml2, libxslt, zlib and lxml-2.2.1.tar.gz. I unpacked them as per the instructions and edited lxml-2.2.1/setup.py to fix the STATIC_INCLUDE_DIRS and STATIC_LIBRARY_DIRS to point back to those newly extracted libraries. Now I run: $ cd lxml-2.2.1 $ python setup.py bdist_wininst --static Building lxml version 2.2.1. NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available. ERROR: /bin/sh: xslt-config: command not found ** make sure the development packages of libxml2 and libxslt are installed ** Using build configuration of libxslt Building against libxml2/libxslt in one of the following directories: ..\libxml2-2.7.3.win32\lib ..\libxslt-1.1.24.win32\lib ..\zlib-1.2.3.win32\lib ..\iconv-1.9.2.win32\lib running bdist_wininst error: distribution contains extensions and/or C libraries; must be compiled on a Windows 32 platform So I figure this is because I'm trying to build inside Cygwin. I tried inside a cmd.exe window, and it gets a bit further but then complains about not having VisualC++2003 installed... I tried -c mingw32 but setup.py didn't understand that option. Seemed like a dead end. I think it might be picking up my ActiveState Python installation too, rather than the Cygwin one. I also tried the actual versions of libxml, libxslt etc that the document actually mentions, same result. Anyway, my understanding is that all this should be unnecessary - the .egg should contain everything I need, shouldn't it? But the segmentation fault at import time is bad. I'd really like to use lxml in Cygwin if possible, because all the other options for xpath processing seem very limited. I like the idea of the egg because I want to be able to deploy my XML-processing python scripts with the .egg, so that I don't need to install any additional software (such as Cygwin packages). Any help would be greatly appreciated, please. Regards, -- David. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090616/9f10b825/attachment.htm From stefan_ml at behnel.de Tue Jun 16 08:46:01 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 16 Jun 2009 08:46:01 +0200 (CEST) Subject: [lxml-dev] Using, or building, lxml in Windows with Cygwin In-Reply-To: References: Message-ID: <3cacf976a02541d97693e587746ce3c6.squirrel@groupware.dvs.informatik.tu-darmstadt.de> David Antliff wrote: > I am trying to get the Windows lxml egg to work with Cygwin, Python 2.5.2 > on WindowsXP, 32bit. I wasn't able to find a Cygwin package for lxml so > I felt I needed to obtain or build a win32 version of lxml. I want to > use the xpath > functionality that lxml provides in python scripts, running in Cygwin. > > *** All commands below are issued in a Cygwin bash shell *** Any reason why you use the Cygwin Python installation instead of a normal Windows build? I never tried building lxml on Windows myself, but with Cygwin, you may get away with installing the developer packages of libxml2 and libxslt, or with simply passing "--static-deps" as described for the MacOS build. MinGW should work, but you need to configure it in your distutils.cfg: [build] compiler=mingw32 http://docs.python.org/install/index.html#location-and-names-of-config-files If you get it to work, please report anything you had to do to the list. Or even better: write a little section for this file: http://codespeak.net/svn/lxml/trunk/doc/build.txt Please ask back if you run into further problems. Stefan From stefan_ml at behnel.de Tue Jun 16 16:00:21 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 16 Jun 2009 16:00:21 +0200 (CEST) Subject: [lxml-dev] Difference between xhtml etrees In-Reply-To: References: <4A30983F.3090205@behnel.de> Message-ID: Hi, please CC the list on replies. D wrote: > 2009/6/11 Stefan Behnel: >> D wrote: >>> I have two xhtml documents which I would like to compare. They are >>> available as etrees. >>> Ideally I would like to have a resulting tree, where the appropriate >>> changes are marked with ins and del tags. I don't need anything fancy >>> like a detection of moves. >>> >>> I had a look at lxml.html.diff >>> http://codespeak.net/lxml/lxmlhtml.html#html-diff >>> but it operates on html strings only, and not on my parsed tree. >> >> Did you try passing the root elements of the trees? > > passing the root objects was a good idea, it can generates the > difference the way I want it. I just don't manage to get the data back > to xhtml. Maybe you could have a look: > > Here is my code: > def expandFiles(filename): > """open the file named filename, return an etree""" > document = "".join(open(filename).readlines()) > px = lxml.etree.XMLParser(load_dtd=True, no_network=False) > px.feed(document) > rx=px.close() > docx=lxml.etree.ElementTree(rx) > return docx Note that "load_dtd" does not imply validation, just that a DTD will be loaded if referenced. Also, it is a *lot* more efficient to do this: parser = lxml.etree.XMLParser(load_dtd=True, no_network=False) def expandFiles(filename): """open the file named filename, return an etree""" return lxml.etree.parse(filename, parser) ... and I'd actually rename the function (or drop it completely). > r1=expandFiles(r"1.xhtml") > r2=expandFiles(r"2.xhtml") > diff = lxml.html.diff.htmldiff(r1.getroot(),r2.getroot()) > # diff is now an html fragment, parse it > pdiff = lxml.html.document_fromstring(diff) > lxml.html.html_to_xhtml(pdiff) > pe = lxml.etree.ElementTree(pdiff) So far, so good. > # this gives me an xhtml file that is parsed without errors by > firefox, but does not contain any markup > # it looks like this in firefox: {http://www.w3.org/1999/xhtml}meta> > Resist SPR3012 Preparation{http://www.w3.org Not sure how this can happen. I'll give it a try later today. > # in addition, all character entities apper in the form > and not > like they should: #62; Would you have a 'real' example here? > I don't manage to transform pdiff to the same form r1 and r2 are in. > > I am sure this is due to a basic misunderstanding of lxml, maybe you > directly see what I am doing wrong? Not direcly, no. Maybe others have an idea? Stefan From D.Hendriks at tue.nl Tue Jun 16 16:20:07 2009 From: D.Hendriks at tue.nl (D.Hendriks (Dennis)) Date: Tue, 16 Jun 2009 16:20:07 +0200 Subject: [lxml-dev] parser target exception recovery bug? Message-ID: <4A37AA17.7060902@tue.nl> Hello all, Using lxml 2.2 with a custom parser target (tree builder), I've run into a problem when the parser target raises an exception. In this case, parsing continues, although only for 'data' (not for 'start' and 'end'). I used recover=False when creating the XMLParser. Using the following code: import sys from lxml import etree # Parser target without exceptions. class MyTreeBuilder1(object): def close(self): print 'close' def start(self, tag, attrs): print 'start', tag, attrs def data(self, data): if len(data.strip()) > 0: print 'data: data=', repr(data) def end(self, tag): print 'end', tag # Parser target with exceptions. class MyTreeBuilder2(MyTreeBuilder1): def close(self): print 'close' def start(self, tag, attrs): print 'start', tag, attrs def data(self, data): if len(data.strip()) > 0: print 'data: data=', repr(data) def end(self, tag): print 'end', tag if tag=='b': print 'ERROR' raise ValueError('error') xml_data=''' test test2 test2 ''' # Successfull parsing. print '---' builder = MyTreeBuilder1() parser = etree.XMLParser(target=builder, recover=False) rslt = etree.fromstring(xml_data, parser) # Unsuccessfull parsing. print '---' builder = MyTreeBuilder2() parser = etree.XMLParser(target=builder, recover=False) rslt = etree.fromstring(xml_data, parser) I get this output: --- start a {} start b {} data: data= u'test' end b start d {} data: data= u'test2' end d start d {} data: data= u'test2' end d end a close --- start a {} start b {} data: data= u'test' end b ERROR data: data= u'test2' data: data= u'test2' Traceback (most recent call last): File "lxml_parser_target_bug.py", line 49, in ? rslt = etree.fromstring(xml_data, parser) File "lxml.etree.pyx", line 2534, in lxml.etree.fromstring (src/lxml/lxml.etree.c:51135) File "parser.pxi", line 1523, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:76176) File "parser.pxi", line 1402, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:74927) File "parser.pxi", line 928, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:71707) File "parsertarget.pxi", line 135, in lxml.etree._TargetParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:82586) File "lxml.etree.pyx", line 230, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:6813) File "saxparser.pxi", line 227, in lxml.etree._handleSaxEnd (src/lxml/lxml.etree.c:78230) File "parsertarget.pxi", line 78, in lxml.etree._PythonSaxParserTarget._handleSaxEnd (src/lxml/lxml.etree.c:81918) File "lxml_parser_target_bug.py", line 33, in end raise ValueError('error') ValueError: error The first output (between --- and ---) is ok, since it is for the non-exception parser target. The second output (after the second ---) is not ok for me. You can see 'ERROR' at the point where the exception is raised. After that, two 'data' events are generated in the parser target. Clearly, parsing continued. Also, the 'close' is never called. After the entire input is parsed, the exception is finally re-raised. Two questions: - Is the continued parsing ('data' function calls) a bug? - Is the not calling 'close' a bug? Any replies would be greatly appreciated. Dennis From jholg at gmx.de Tue Jun 16 16:41:09 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 16 Jun 2009 16:41:09 +0200 Subject: [lxml-dev] objectify parses 't' and 'f' as BoolElement? Message-ID: <20090616144109.25650@gmx.net> Hi, somewhere on the way between lxml 2.0 and 2.1.5 objectify changed to accept 't' and 'f' as values recognized as BoolElements: 2.0.alpha4: 0 $ python2.4 -i -c 'from lxml import etree, objectify; objectify.enableRecursiveStr(); print etree.__version__; print etree.LIBXML_VERSION, etree.LIBXSLT_VERSION' 2.0.alpha4 (2, 6, 27) (1, 1, 20) >>> root = objectify.fromstring('f') >>> print root root = None [ObjectifiedElement] bool = 'f' [StringElement] >>> root.bool.text 'f' 2.1.5: 0 $ python2.4 -i -c 'from lxml import etree, objectify; objectify.enable_recursive_str(); print etree.__version__; print etree.LIBXML_VERSION, etree.LIBXSLT_VERSION' 2.1.5 (2, 6, 32) (1, 1, 23) >>> root = objectify.fromstring('f') >>> print root root = None [ObjectifiedElement] bool = False [BoolElement] I consider this a bug, XML Schema datatypes spec says " 3.2.2 boolean [Definition:] boolean has the ?value space? required to support the mathematical concept of binary-valued logic: {true, false}. 3.2.2.1 Lexical representation An instance of a datatype that is defined as ?boolean? can have the following legal literals {true, false, 1, 0}. ... " Objections to restricting this to said literals, again? Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From stefan_ml at behnel.de Tue Jun 16 20:52:10 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 16 Jun 2009 20:52:10 +0200 Subject: [lxml-dev] parser target exception recovery bug? In-Reply-To: <4A37AA17.7060902@tue.nl> References: <4A37AA17.7060902@tue.nl> Message-ID: <4A37E9DA.8010809@behnel.de> Hi, D.Hendriks (Dennis) wrote: > Using lxml 2.2 with a custom parser target (tree builder), I've run into > a problem when the parser target raises an exception. In this case, > parsing continues, although only for 'data' (not for 'start' and 'end'). > > [nicely detailed example stripped] > > The first output (between --- and ---) is ok, since it is for the > non-exception parser target. The second output (after the second ---) is > not ok for me. You can see 'ERROR' at the point where the exception is > raised. After that, two 'data' events are generated in the parser > target. Clearly, parsing continued. Also, the 'close' is never called. > After the entire input is parsed, the exception is finally re-raised. > > Two questions: > - Is the continued parsing ('data' function calls) a bug? Yes. Should be fixed in SVN now: https://codespeak.net/viewvc/?view=rev&revision=65796 > - Is the not calling 'close' a bug? I don't know. ElementTree doesn't specify the behaviour in the error case. http://effbot.org/elementtree/elementtree-xmlparser.htm In my tests, ET 1.3 didn't call the .close() method either. I may have to look into this a bit closer, but so far, I don't see an obligation to call it in the case of an error. Stefan From stefan_ml at behnel.de Tue Jun 16 21:05:05 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 16 Jun 2009 21:05:05 +0200 Subject: [lxml-dev] objectify parses 't' and 'f' as BoolElement? In-Reply-To: <20090616144109.25650@gmx.net> References: <20090616144109.25650@gmx.net> Message-ID: <4A37ECE1.8020004@behnel.de> Hi Holger, jholg at gmx.de wrote: > somewhere on the way between lxml 2.0 and 2.1.5 objectify changed to > accept 't' and 'f' as values recognized as BoolElements: > > 2.0.alpha4: > 0 $ python2.4 -i -c 'from lxml import etree, objectify; objectify.enableRecursiveStr(); print etree.__version__; print etree.LIBXML_VERSION, etree.LIBXSLT_VERSION' > 2.0.alpha4 > (2, 6, 27) (1, 1, 20) >>>> root = objectify.fromstring('f') >>>> print root > root = None [ObjectifiedElement] > bool = 'f' [StringElement] >>>> root.bool.text > 'f' > > 2.1.5: > 0 $ python2.4 -i -c 'from lxml import etree, objectify; objectify.enable_recursive_str(); print etree.__version__; print etree.LIBXML_VERSION, etree.LIBXSLT_VERSION' > 2.1.5 > (2, 6, 32) (1, 1, 23) >>>> root = objectify.fromstring('f') >>>> print root > root = None [ObjectifiedElement] > bool = False [BoolElement] > > > I consider this a bug, XML Schema datatypes spec says > " > 3.2.2 boolean > > [Definition:] boolean has the ?value space? required to support the mathematical concept of binary-valued logic: {true, false}. > > 3.2.2.1 Lexical representation > > An instance of a datatype that is defined as ?boolean? can have the following legal literals {true, false, 1, 0}. > ... > " > > Objections to restricting this to said literals, again? No. I don't think something as short as 'f' and 't' is worth being considered a boolean value by default. Note that '0' and '1' will never be considered boolean unless there is a type hint, so the only literals that will become boolean by default are 'true' and 'false'. If there really is code that depends on t/f, you can get the old behaviour by registering a custom PyType for BoolElement before the "bool" type. This is a global configuration and is backwards compatible, so this isn't much of a problem for users. Can you do the change and add a comment on this to the changelog? Stefan From dalist0 at gmail.com Tue Jun 16 23:08:41 2009 From: dalist0 at gmail.com (D) Date: Tue, 16 Jun 2009 17:08:41 -0400 Subject: [lxml-dev] Difference between xhtml etrees In-Reply-To: References: <4A30983F.3090205@behnel.de> Message-ID: Hi All, > please CC the list on replies. I am sorry, I pressed the wrong button. I made a running example and attached three small files, the code finds the difference between the two files r1.xhtml and r2.xhtml. The output is written to the file rdiff.xhtml. This file does not display correctly in Firefox. Please note that the output diff is not totally correct. r1 reads "Leave some solvent in the bowl." and r2 "Leave some solvent in the bowl and heat." the code marks: bowl and heat.END{http://www.w3.org/1999/xhtml}p> {http://www.w3.org/1999/xhtml}p> Previous Versions: {http://www.w3.org/1999/xhtml}b>{http://www.w3.org/1999/xhtml}p> as inserted, i.e. "bowl and heat." instead of "and heat" > Note that "load_dtd" does not imply validation, just that a DTD will be > loaded if referenced. unfortunately my original xhtml is very non-conforming. (I am planning to migrate a laboratory notebook that was unfortunately written in word. The plan is to copy from word, paste into Kompozer, then parse the result, get rid of all the word-specific stuff and validate later. This is necessary because each experiment is composed of many smaller descriptions which will be put together into big file. Unfortunately word 2007 still can not handle a master document that contains other documents) best Daniel def minimalExample(): # files contain entities like   # often r contains illegal attributes (start , type in ol), not DTD conforming element content (br), and illegally nested paragraphs (p in p, p in b) parser = lxml.etree.XMLParser(load_dtd=True, dtd_validation=True, no_network=False) r1 = lxml.etree.parse("r1.xhtml", parser) r2 = lxml.etree.parse("r2.xhtml", parser) diff = lxml.html.diff.htmldiff(r1.getroot(),r2.getroot()) pdiff = lxml.html.document_fromstring(diff) lxml.html.html_to_xhtml(pdiff) pe = lxml.etree.ElementTree(pdiff) pe.write("rdiff.xhtml",pretty_print = True) -------------- next part -------------- A non-text attachment was scrubbed... Name: r1.xhtml Type: application/xhtml+xml Size: 850 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090616/71dda68b/attachment.xhtml -------------- next part -------------- A non-text attachment was scrubbed... Name: r2.xhtml Type: application/xhtml+xml Size: 845 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090616/71dda68b/attachment-0001.xhtml -------------- next part -------------- A non-text attachment was scrubbed... Name: rdiff.xhtml Type: application/xhtml+xml Size: 1075 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090616/71dda68b/attachment-0002.xhtml From david.antliff at gmail.com Tue Jun 16 23:22:17 2009 From: david.antliff at gmail.com (David Antliff) Date: Wed, 17 Jun 2009 09:22:17 +1200 Subject: [lxml-dev] Using, or building, lxml in Windows with Cygwin In-Reply-To: <3cacf976a02541d97693e587746ce3c6.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <3cacf976a02541d97693e587746ce3c6.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: On Tue, Jun 16, 2009 at 18:46, Stefan Behnel wrote: > > David Antliff wrote: > > I am trying to get the Windows lxml egg to work with Cygwin, Python 2.5.2 > > on WindowsXP, 32bit. I wasn't able to find a Cygwin package for lxml so > > I felt I needed to obtain or build a win32 version of lxml. I want to > > use the xpath > > functionality that lxml provides in python scripts, running in Cygwin. > > > > *** All commands below are issued in a Cygwin bash shell *** > > Any reason why you use the Cygwin Python installation instead of a normal > Windows build? The situation is that I'm maintaining a build environment for FPGA tools. The users are hardware engineers with little experience in software issues. We use Cygwin because it provides GNU make, python, perl, and a bunch of other GNU things. The main problem is that the environment is already deployed and I want to avoid requiring people to manually update their Cygwin installation to bring in an lxml library. Actually, there doesn't even seem to BE a Cygwin lxml library. I have written a program as part of a new process in our build system that does a lot of XML processing. I was using ElementTree but it's falling short in several areas. Lxml looks like a good solution. So my thinking was that I could bundle the lxml egg with the next version of the build system and nobody would need to manually intervene. As it stands, there's also a need to install setuptools but this can be done easily and automatically. Do you think the binary egg provided on the web does not work because I'm using a different version of Python (Cygwin, rather than Windows native) with it? If possible, I want to avoid switching to the Windows native python because nobody else has it installed. > I never tried building lxml on Windows myself, but with Cygwin, you may > get away with installing the developer packages of libxml2 and libxslt, or > with simply passing "--static-deps" as described for the MacOS build. > MinGW should work, but you need to configure it in your distutils.cfg: > > ?[build] > ?compiler=mingw32 > > http://docs.python.org/install/index.html#location-and-names-of-config-files > > If you get it to work, please report anything you had to do to the list. > Or even better: write a little section for this file: > > http://codespeak.net/svn/lxml/trunk/doc/build.txt > > Please ask back if you run into further problems. I'll give this a try - thank you very much. Regards, -- David. From david.antliff at gmail.com Wed Jun 17 02:06:59 2009 From: david.antliff at gmail.com (David Antliff) Date: Wed, 17 Jun 2009 12:06:59 +1200 Subject: [lxml-dev] Using, or building, lxml in Windows with Cygwin In-Reply-To: References: <3cacf976a02541d97693e587746ce3c6.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: > On Tue, Jun 16, 2009 at 18:46, Stefan Behnel wrote: >> I never tried building lxml on Windows myself, but with Cygwin, you may >> get away with installing the developer packages of libxml2 and libxslt, or >> with simply passing "--static-deps" as described for the MacOS build. >> MinGW should work, but you need to configure it in your distutils.cfg: >> >> ?[build] >> ?compiler=mingw32 >> >> http://docs.python.org/install/index.html#location-and-names-of-config-files >> >> If you get it to work, please report anything you had to do to the list. >> Or even better: write a little section for this file: >> >> http://codespeak.net/svn/lxml/trunk/doc/build.txt >> >> Please ask back if you run into further problems. I am trying a slightly different approach - compiling entirely within Cygwin, using Cygwin's gcc. What I have done is unpacked lxml-2.2.1.tgz.gz into lxml-2.2.1, then inside that directory I try: $ python setup.py build --static-deps This proceeds to download libxml2 and libxslt, unpack them, and build them. But it runs into numerous problems related to include/library paths. Here are my 'fixes' as they pop up. I run the correct command manually, then type 'make' to continue the process. Unfortunately if I run that setup.py command again, it clears out the progress-so-far, so I'm not sure what I'm going to do once Make completes successfully... running build running build_py running build_ext building 'lxml.objectify' extension gcc -shared -Wl,--enable-auto-image-base build/temp.cygwin-1.5.25-i686-2.5/src/lxml/lxml.objectify.o /cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib/libexslt.a /cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib/libxml2.a /cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib/libxslt.a -L/cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib -L/usr/lib/python2.5/config -lz -lm -lpython2.5 -o build/lib.cygwin-1.5.25-i686-2.5/lxml/objectify.dll /cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib/libxml2.a(encoding.o): In function `xmlFindCharEncodingHandler': /cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2-2.7.3/encoding.c:1614: undefined reference to `_libiconv_open' ...more libiconv errors... I fix this by adding "/usr/lib/libiconv.a": $ gcc -shared -Wl,--enable-auto-image-base build/temp.cygwin-1.5.25-i686-2.5/src/lxml/lxml.objectify.o /cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib/libexslt.a /cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib/libxml2.a /cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib/libxslt.a -L/cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib -L/usr/lib/python2.5/config -lz -lm -lpython2.5 -o build/lib.cygwin-1.5.25-i686-2.5/lxml/objectify.dll /usr/lib/libiconv.a I wonder why setup.py didn't automatically download libiconv when it downloaded libxml2 and libxslt... hmm Now I type 'make' to continue to the next problem: Using build configuration of libxslt running build_ext building 'lxml.etree' extension gcc -shared -Wl,--enable-auto-image-base build/temp.cygwin-1.5.25-i686-2.5/src/lxml/lxml.etree.o -L/usr/lib/python2.5/config -lxslt -lexslt -lxml2 -lz -lm -lpython2.5 -o build/lib.cygwin-1.5.25-i686-2.5/lxml/etree.dll /usr/lib/gcc/i686-pc-cygwin/3.4.4/../../../../i686-pc-cygwin/bin/ld: cannot find -lxslt collect2: ld returned 1 exit status But at this point I'm not convinced 'make' is doing the same thing that the --static-deps would cause. So I'm not sure this is a valid path to take. Is there a way to have 'setup.py build' continue from where it last reached, rather than starting everything again from scratch? -- David. From D.Hendriks at tue.nl Wed Jun 17 08:16:18 2009 From: D.Hendriks at tue.nl (D.Hendriks (Dennis)) Date: Wed, 17 Jun 2009 08:16:18 +0200 Subject: [lxml-dev] parser target exception recovery bug? In-Reply-To: <4A37E9DA.8010809@behnel.de> References: <4A37AA17.7060902@tue.nl> <4A37E9DA.8010809@behnel.de> Message-ID: <4A388A32.5040401@tue.nl> Hello, > Yes. Should be fixed in SVN now: Thanks. > I may have to look into this a bit closer, but so far, I don't see an > obligation to call it in the case of an error. The lxml 2.2 documentation (pdf) states: "You can reuse the parser and its target as often as you like, so you should take care that the .close() methods really resets the target to a usable state (also in the case of an error!)." I assumed this meant that the close() method will always be called, even in case of errors, to (re)set the parser target into a usable state for the next parsing. Is this not true? If not, the next parsing operation will most likely fail, since the state of the parser target was not reset yet. Dennis Stefan Behnel wrote: > Hi, > > D.Hendriks (Dennis) wrote: >> Using lxml 2.2 with a custom parser target (tree builder), I've run into >> a problem when the parser target raises an exception. In this case, >> parsing continues, although only for 'data' (not for 'start' and 'end'). >> >> [nicely detailed example stripped] >> >> The first output (between --- and ---) is ok, since it is for the >> non-exception parser target. The second output (after the second ---) is >> not ok for me. You can see 'ERROR' at the point where the exception is >> raised. After that, two 'data' events are generated in the parser >> target. Clearly, parsing continued. Also, the 'close' is never called. >> After the entire input is parsed, the exception is finally re-raised. >> >> Two questions: >> - Is the continued parsing ('data' function calls) a bug? > > Yes. Should be fixed in SVN now: > > https://codespeak.net/viewvc/?view=rev&revision=65796 > >> - Is the not calling 'close' a bug? > > I don't know. ElementTree doesn't specify the behaviour in the error case. > > http://effbot.org/elementtree/elementtree-xmlparser.htm > > In my tests, ET 1.3 didn't call the .close() method either. I may have to > look into this a bit closer, but so far, I don't see an obligation to call > it in the case of an error. > > Stefan > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev From stefan_ml at behnel.de Wed Jun 17 09:05:49 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 17 Jun 2009 09:05:49 +0200 (CEST) Subject: [lxml-dev] parser target exception recovery bug? In-Reply-To: <4A388A32.5040401@tue.nl> References: <4A37AA17.7060902@tue.nl> <4A37E9DA.8010809@behnel.de> <4A388A32.5040401@tue.nl> Message-ID: <9cd34d6e05d236ba7b9d23bebb645ae3.squirrel@groupware.dvs.informatik.tu-darmstadt.de> D.Hendriks (Dennis) wrote: > Stefan Behnel wrote: >> D.Hendriks (Dennis) wrote: >>> - Is the not calling 'close' a bug? >> >> I don't know. ElementTree doesn't specify the behaviour in the error >> case. >> >> http://effbot.org/elementtree/elementtree-xmlparser.htm >> >> In my tests, ET 1.3 didn't call the .close() method either. I may have >> to look into this a bit closer, but so far, I don't see an obligation >> to call it in the case of an error. >> I may have to look into this a bit closer, but so far, I don't see an >> obligation to call it in the case of an error. > > The lxml 2.2 documentation (pdf) states: "You can reuse the parser and > its target as often as you like, so you should take care that the > .close() methods really resets the target to a usable state (also in the > case of an error!)." I assumed this meant that the close() method will > always be called, even in case of errors, to (re)set the parser target > into a usable state for the next parsing. Is this not true? If not, the > next parsing operation will most likely fail, since the state of the > parser target was not reset yet. Hmm, yes, in that light, it does make sense to call .close() even after an error. However, the question is if existing code is prepared for such a call even if the parsing failed for some reason. Imagine one of the callbacks (like ".data()") raises an exception which needs to get propagated, and then we call ".close()", which also happens to raise an exception (maybe due to an inconsistent state). I assume that the latter should be ignored then, although it may really hide a bug (or even a resource leak) in the user code, so both exceptions are of interest. Dropping exceptions is a bad thing... We should at least write something to the parser log, I think. This could be handled differently in Py3, though. We could raise the .close() exception there (instead of dropping it) and pass the original exception as its context (instead of raising it). You would get a different exception in Py3 in that case, but at least you wouldn't loose any information. This would effectively map to this Python code in Py3: try: parse_to_target(parser_target, input) except: parser_target.close() raise else: return parser_target.close() and to this code in Py2: try: parse_to_target(parser_target, input) except: try: parser_target.close() except: pass raise else: return parser_target.close() Given the documented requirement that you quote above (which I likely wrote myself ;), it's actually a clear bug in user code if .close() fails to reset the target in an error case. So I think users will just have to fix their code. And since this only affects a case where things went wrong anyway, I doubt that there will be a huge impact in fixing this. You may get a different exception in some cases, but it makes things a lot safer to always call .close(). Stefan From stefan_ml at behnel.de Wed Jun 17 08:09:49 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 17 Jun 2009 08:09:49 +0200 Subject: [lxml-dev] Using, or building, lxml in Windows with Cygwin In-Reply-To: References: <3cacf976a02541d97693e587746ce3c6.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <4A3888AD.6060107@behnel.de> Hi, David Antliff wrote: > On Tue, Jun 16, 2009 at 18:46, Stefan Behnel wrote: >> David Antliff wrote: >>> I am trying to get the Windows lxml egg to work with Cygwin, Python 2.5.2 >>> on WindowsXP, 32bit. I wasn't able to find a Cygwin package for lxml so >>> I felt I needed to obtain or build a win32 version of lxml. I want to >>> use the xpath >>> functionality that lxml provides in python scripts, running in Cygwin. >>> >>> *** All commands below are issued in a Cygwin bash shell *** >> Any reason why you use the Cygwin Python installation instead of a normal >> Windows build? > > The situation is that I'm maintaining a build environment for FPGA > tools. The users are hardware engineers with little experience in > software issues. We use Cygwin because it provides GNU make, python, > perl, and a bunch of other GNU things. The main problem is that the > environment is already deployed and I want to avoid requiring people > to manually update their Cygwin installation to bring in an lxml > library. Actually, there doesn't even seem to BE a Cygwin lxml > library. Yes, I assumed there wasn't. > If possible, I want to avoid switching to the Windows native python > because nobody else has it installed. Thanks for the explanation, that makes sense to me. > my thinking was that I could bundle the lxml egg with the next version of > the build system and nobody would need to manually intervene. As it > stands, there's also a need to install setuptools but this can be done > easily and automatically. I don't think we really /require/ setuptools. IIRC, it's always been optional. > Do you think the binary egg provided on the web does not work because > I'm using a different version of Python (Cygwin, rather than Windows > native) with it? Yes, I'm pretty sure that is the case. It was built with a different compiler for a different environment. Stefan From stefan_ml at behnel.de Wed Jun 17 08:35:44 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 17 Jun 2009 08:35:44 +0200 Subject: [lxml-dev] Using, or building, lxml in Windows with Cygwin In-Reply-To: References: <3cacf976a02541d97693e587746ce3c6.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <4A388EC0.6040501@behnel.de> Hi, David Antliff wrote: > I am trying a slightly different approach - compiling entirely within > Cygwin, using Cygwin's gcc. Right, that should be best anyway. > What I have done is unpacked lxml-2.2.1.tgz.gz into lxml-2.2.1, then > inside that directory I try: > > $ python setup.py build --static-deps > > This proceeds to download libxml2 and libxslt, unpack them, and build > them. But it runs into numerous problems related to include/library > paths. ... which you may be able to fix using appropriate CFLAGS/LDFLAGS. > Here are my 'fixes' as they pop up. I run the correct command > manually, then type 'make' to continue the process. Unfortunately if I > run that setup.py command again, it clears out the progress-so-far, so > I'm not sure what I'm going to do once Make completes successfully... > > running build > running build_py > running build_ext > building 'lxml.objectify' extension > gcc -shared -Wl,--enable-auto-image-base > build/temp.cygwin-1.5.25-i686-2.5/src/lxml/lxml.objectify.o > /cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib/libexslt.a > /cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib/libxml2.a > /cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib/libxslt.a > -L/cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib > -L/usr/lib/python2.5/config -lz -lm -lpython2.5 -o > build/lib.cygwin-1.5.25-i686-2.5/lxml/objectify.dll > /cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib/libxml2.a(encoding.o): > In function `xmlFindCharEncodingHandler': > /cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2-2.7.3/encoding.c:1614: > undefined reference to `_libiconv_open' > ...more libiconv errors... > > I fix this by adding "/usr/lib/libiconv.a": > $ gcc -shared -Wl,--enable-auto-image-base > build/temp.cygwin-1.5.25-i686-2.5/src/lxml/lxml.objectify.o > /cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib/libexslt.a > /cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib/libxml2.a > /cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib/libxslt.a > -L/cygdrive/d/git/work/lxml-exp/lxml-2.2.1/build/tmp/libxml2/lib > -L/usr/lib/python2.5/config -lz -lm -lpython2.5 -o > build/lib.cygwin-1.5.25-i686-2.5/lxml/objectify.dll > /usr/lib/libiconv.a I don't know if you install the libiconv developer package on your machine (I hope that exists in Cygwin), because building against the shared lib should work just fine here. Maybe you need to point gcc to the right include and/or lib directory. I wonder why it didn't add "-liconv" automatically... You can also try to go the ugly route and add "/usr/lib/libiconv.a" to your CFLAGS, but I think you'll be happier with the shared lib. > I wonder why setup.py didn't automatically download libiconv when it > downloaded libxml2 and libxslt... hmm Because the static build was designed for MacOS-X, where only those two libraries are a problem. The libiconv is binary compatible enough across versions not to pose any major problems. So it's best to build dynamically against libiconv. That said, it shouldn't be too hard to add code to also download libiconv and build it. I would be happy to receive a patch that accepts an optional list of library names for the --static-deps option, as in "--static-deps=libxml2,libxslt,iconv", and would then download and build all requested libraries. Although I doubt that it would make sense (or even work) to pass only "--static-deps=iconv", so maybe a new option "--build-iconv" would be better. > Now I type 'make' to continue to the next problem: > Using build configuration of libxslt > running build_ext > building 'lxml.etree' extension > gcc -shared -Wl,--enable-auto-image-base > build/temp.cygwin-1.5.25-i686-2.5/src/lxml/lxml.etree.o > -L/usr/lib/python2.5/config -lxslt -lexslt -lxml2 -lz -lm -lpython2.5 > -o build/lib.cygwin-1.5.25-i686-2.5/lxml/etree.dll > /usr/lib/gcc/i686-pc-cygwin/3.4.4/../../../../i686-pc-cygwin/bin/ld: > cannot find -lxslt > collect2: ld returned 1 exit status > > But at this point I'm not convinced 'make' is doing the same thing > that the --static-deps would cause. So I'm not sure this is a valid > path to take. Yes, this is linking dynamically against libxslt. > Is there a way to have 'setup.py build' continue from where it last > reached, rather than starting everything again from scratch? No. The goal is a completely automated build, and there is no option for manual interaction. (Well, you can always start the debugger from the setup script and run your own code, but I doubt that that's something that's worth maintaining in the long term). Stefan From D.Hendriks at tue.nl Wed Jun 17 09:47:14 2009 From: D.Hendriks at tue.nl (D.Hendriks (Dennis)) Date: Wed, 17 Jun 2009 09:47:14 +0200 Subject: [lxml-dev] parser target exception recovery bug? In-Reply-To: <9cd34d6e05d236ba7b9d23bebb645ae3.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <4A37AA17.7060902@tue.nl> <4A37E9DA.8010809@behnel.de> <4A388A32.5040401@tue.nl> <9cd34d6e05d236ba7b9d23bebb645ae3.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <4A389F82.6030708@tue.nl> Hello, Stefan Behnel wrote: > D.Hendriks (Dennis) wrote: >> Stefan Behnel wrote: >>> D.Hendriks (Dennis) wrote: >>>> - Is the not calling 'close' a bug? >>> I don't know. ElementTree doesn't specify the behaviour in the error >>> case. >>> >>> http://effbot.org/elementtree/elementtree-xmlparser.htm >>> >>> In my tests, ET 1.3 didn't call the .close() method either. I may have >>> to look into this a bit closer, but so far, I don't see an obligation >>> to call it in the case of an error. >>> I may have to look into this a bit closer, but so far, I don't see an >>> obligation to call it in the case of an error. >> The lxml 2.2 documentation (pdf) states: "You can reuse the parser and >> its target as often as you like, so you should take care that the >> .close() methods really resets the target to a usable state (also in the >> case of an error!)." I assumed this meant that the close() method will >> always be called, even in case of errors, to (re)set the parser target >> into a usable state for the next parsing. Is this not true? If not, the >> next parsing operation will most likely fail, since the state of the >> parser target was not reset yet. > > Hmm, yes, in that light, it does make sense to call .close() even after an > error. However, the question is if existing code is prepared for such a > call even if the parsing failed for some reason. Imagine one of the > callbacks (like ".data()") raises an exception which needs to get > propagated, and then we call ".close()", which also happens to raise an > exception (maybe due to an inconsistent state). I assume that the latter > should be ignored then, although it may really hide a bug (or even a > resource leak) in the user code, so both exceptions are of interest. > Dropping exceptions is a bad thing... We should at least write something > to the parser log, I think. > > This could be handled differently in Py3, though. We could raise the > .close() exception there (instead of dropping it) and pass the original > exception as its context (instead of raising it). You would get a > different exception in Py3 in that case, but at least you wouldn't loose > any information. This would effectively map to this Python code in Py3: > > try: > parse_to_target(parser_target, input) > except: > parser_target.close() > raise > else: > return parser_target.close() > > and to this code in Py2: > > try: > parse_to_target(parser_target, input) > except: > try: parser_target.close() > except: pass > raise > else: > return parser_target.close() > > Given the documented requirement that you quote above (which I likely > wrote myself ;), it's actually a clear bug in user code if .close() fails > to reset the target in an error case. So I think users will just have to > fix their code. And since this only affects a case where things went wrong > anyway, I doubt that there will be a huge impact in fixing this. You may > get a different exception in some cases, but it makes things a lot safer > to always call .close(). > > Stefan Thanks for the reply. I agree that close() should always be called. In case of multiple exceptions, you could create an instance of yet another exception, containing the 'multiple exceptions'. This could for instance be called MultiParseException or whatever. This would work for both Python 2.x and Python 3.x... Anyway, let me know what you decide and when it will be in SVN. Thanks! Dennis From jholg at gmx.de Wed Jun 17 17:38:45 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 17 Jun 2009 17:38:45 +0200 Subject: [lxml-dev] objectify parses 't' and 'f' as BoolElement? In-Reply-To: <4A37ECE1.8020004@behnel.de> References: <20090616144109.25650@gmx.net> <4A37ECE1.8020004@behnel.de> Message-ID: <20090617153845.103380@gmx.net> Hi, > > An instance of a datatype that is defined as ?boolean? can have the > following legal literals {true, false, 1, 0}. > > ... > > " > > > > Objections to restricting this to said literals, again? > > No. I don't think something as short as 'f' and 't' is worth being > considered a boolean value by default. > > Note that '0' and '1' will never be considered boolean unless there is a > type hint, so the only literals that will become boolean by default are > 'true' and 'false'. Which is due to the standard objectify lookup order. > Can you do the change and add a comment on this to the changelog? Done: https://codespeak.net/viewvc/?view=rev&revision=65803 A question, though: I noticed some optimization wrt to using _cstr() on the text and then comparing the resulting C string in the 2.0alpha I previously used. Curious about what _cstr() does I checked, it is a shortcut to PyString_AS_STRING. Python docs say: char* PyString_AS_STRING(PyObject *string)? Macro form of PyString_AsString but without error checking. Only string objects are supported; no Unicode objects should be passed. That basically means I *cannot* use _cstr() and C string comparison for __parseBoolAsInt(), as I would run into havoc with unicode, right? Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From stefan_ml at behnel.de Wed Jun 17 18:00:45 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 17 Jun 2009 18:00:45 +0200 Subject: [lxml-dev] objectify parses 't' and 'f' as BoolElement? In-Reply-To: <20090617153845.103380@gmx.net> References: <20090616144109.25650@gmx.net> <4A37ECE1.8020004@behnel.de> <20090617153845.103380@gmx.net> Message-ID: <4A39132D.9020509@behnel.de> jholg at gmx.de wrote: > I noticed some optimization wrt to using _cstr() on the text and then > comparing the resulting C string in the 2.0alpha I previously used. > > Curious about what _cstr() does I checked, it is a shortcut to > PyString_AS_STRING. > > Python docs say: > char* PyString_AS_STRING(PyObject *string)? > Macro form of PyString_AsString but without error checking. Only string > objects are supported; no Unicode objects should be passed. > > That basically means I *cannot* use _cstr() and C string comparison for > __parseBoolAsInt(), as I would run into havoc with unicode, right? Right. _cstr() is only used as an optimisation when we know we have a plain byte string. It's quite a bit less overhead than the couple of checks that a call to PyString_AsString() does, plus there's no error handling code generated by Cython. The bool parsing code in lxml.objectify isn't very fast. Making it faster requires some special casing and a call to utf8() - that may be doable without too much code bloat. Not sure it's worth it, though. There's a lot more overhead in other places of lxml.objectify (I'm not saying it's slow, just that the convenient API is bought with some overhead). Stefan From stefan_ml at behnel.de Wed Jun 17 18:40:28 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 17 Jun 2009 18:40:28 +0200 Subject: [lxml-dev] parser target exception recovery bug? In-Reply-To: <4A389F82.6030708@tue.nl> References: <4A37AA17.7060902@tue.nl> <4A37E9DA.8010809@behnel.de> <4A388A32.5040401@tue.nl> <9cd34d6e05d236ba7b9d23bebb645ae3.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <4A389F82.6030708@tue.nl> Message-ID: <4A391C7C.2020500@behnel.de> D.Hendriks (Dennis) wrote: > In case of multiple exceptions, you could create an instance of yet another > exception, containing the 'multiple exceptions'. This could for instance > be called MultiParseException or whatever. This would work for both > Python 2.x and Python 3.x... No, not that easily. 1) I deem the first exception more important than the one raised by .close(), and 2) the user may anticipate or even control the exception being raised, so raising a different one may miss a well-forged except statement in user code. I actually mixed up the way exception chaining works in Py3. It's right to always raise the first exception in both Py2 and Py3, and attach the second exception only in Py3. The Python code snippets I gave were correct, though. > Anyway, let me know what you decide and when it will be in SVN. Thanks! No schedule yet. Could you file a bug report for now? Stefan From az at svilendobrev.com Thu Jun 18 23:07:07 2009 From: az at svilendobrev.com (az at svilendobrev.com) Date: Fri, 19 Jun 2009 00:07:07 +0300 Subject: [lxml-dev] can validation yield multiple errors? Message-ID: <200906190007.07744.az@svilendobrev.com> hi i'm new to lxml/libxml. as i see, the validation (i use schemas) stops at first error. is it possible to get all errors that parser is able to recover from? ciao svilen www.svilendobrev.com From stefan_ml at behnel.de Fri Jun 19 07:06:33 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 19 Jun 2009 07:06:33 +0200 Subject: [lxml-dev] can validation yield multiple errors? In-Reply-To: <200906190007.07744.az@svilendobrev.com> References: <200906190007.07744.az@svilendobrev.com> Message-ID: <4A3B1CD9.5080107@behnel.de> az at svilendobrev.com wrote: > as i see, the validation (i use schemas) stops at first error. > is it possible to get all errors that parser is able to recover from? You didn't state if you were validating at parse time or afterwards (please try both), but in any case, I have no idea if (and if so, how well) libxml2 can recover from validation errors. At least, I didn't see an obvious switch anywhere in the API docs. Stefan From stefan_ml at behnel.de Fri Jun 19 07:08:38 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 19 Jun 2009 07:08:38 +0200 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: <4A34DD37.8000207@behnel.de> References: <4A2112AE.8040903@behnel.de> <4A2CA7AE.8030508@behnel.de> <97EFCE41-4A6E-4CB2-8593-CE7CF5F5A2D3@inquant.de> <4A34DD37.8000207@behnel.de> Message-ID: <4A3B1D56.3000007@behnel.de> Stefan Behnel wrote: > Stephan Eletzhofer wrote: >>>> Do we want to give the static versions a dedicated name? >>>> "lxml-x.y.z-static" or smth like >>>> that? Maybe not everyone wants static eggs (even not on OSX boxen)? >>> I would want them to be the default download on MacOS (i.e. no special >>> naming), but only after some testing by some MacOS users on differently >>> configured systems, i.e. using different MacOS versions, different >>> package >>> distributions, different Python versions, etc. So maybe you could provide >>> them separately first? >> Yes sure -- you mean "separately" as in "not on pypi"? > > Yep, just build some and advertise them on lxml-dev asking for feedback. If > they seem to work for everyone, we can upload them to PyPI. ... any news from the front? Stefan From limi at plone.org Fri Jun 19 07:30:51 2009 From: limi at plone.org (Alexander Limi) Date: Thu, 18 Jun 2009 22:30:51 -0700 Subject: [lxml-dev] Binary egg for Mac OS X References: <4A2112AE.8040903@behnel.de> <4A2CA7AE.8030508@behnel.de> <97EFCE41-4A6E-4CB2-8593-CE7CF5F5A2D3@inquant.de> <4A34DD37.8000207@behnel.de> <4A3B1D56.3000007@behnel.de> Message-ID: On Thu, 18 Jun 2009 22:08:38 -0700, Stefan Behnel wrote: > > Stefan Behnel wrote: >> Stephan Eletzhofer wrote: >>>>> Do we want to give the static versions a dedicated name? >>>>> "lxml-x.y.z-static" or smth like >>>>> that? Maybe not everyone wants static eggs (even not on OSX boxen)? >>>> I would want them to be the default download on MacOS (i.e. no special >>>> naming), but only after some testing by some MacOS users on >>>> differently >>>> configured systems, i.e. using different MacOS versions, different >>>> package >>>> distributions, different Python versions, etc. So maybe you could >>>> provide >>>> them separately first? >>> Yes sure -- you mean "separately" as in "not on pypi"? >> >> Yep, just build some and advertise them on lxml-dev asking for >> feedback. If >> they seem to work for everyone, we can upload them to PyPI. > > ... any news from the front? Martin Aspeli has indicated that they worked for him. I haven't had time to test it yet, but will hopefully get around to it this weekend. Don't let me hold you up, though ? PyPI eggs need testing too, right? :) -- Alexander Limi ? http://limi.net From optilude+lists at gmail.com Fri Jun 19 12:25:39 2009 From: optilude+lists at gmail.com (Martin Aspeli) Date: Fri, 19 Jun 2009 18:25:39 +0800 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: References: <4A2112AE.8040903@behnel.de> <4A2CA7AE.8030508@behnel.de> <97EFCE41-4A6E-4CB2-8593-CE7CF5F5A2D3@inquant.de> <4A34DD37.8000207@behnel.de> <4A3B1D56.3000007@behnel.de> Message-ID: Alexander Limi wrote: > Martin Aspeli has indicated that they worked for him. I haven't had time > to test it yet, but will hopefully get around to it this weekend. Don't > let me hold you up, though ? PyPI eggs need testing too, right? :) Yeah, they seem to work well for me. 2.2.1 on Python 2.4, OS X 10.5. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From sergio at sergiomb.no-ip.org Fri Jun 19 19:21:46 2009 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Fri, 19 Jun 2009 18:21:46 +0100 Subject: [lxml-dev] why this works on fedora and don't work on debian Message-ID: <1245432106.5753.4.camel@segulix> Hi, I had install python lxml-2.2.1 on both machines $ python testing.py on fedora ['nuevologo.jpg', 'http://www.google.com/logos/Logo_25wht.gif', 'secciones/movil/promo.gif', 'img-noticias/AAA_279.JPG', 'img-noticias/bresuc_508.jpg', 'img-noticias/nacional_116.JPG', 'img-noticias/CONOMI1_84.JPG', 'http://www.tutiempo.net/imagenes_asociados/84x38/SVVG.png', '../web/webfisico/paginas/01_220.jpg', 'digitales.jpg', 'bannermi.gif', 'http://pikis.net/banner/pikis_banner1_3.jpg'] on debian $ python testing.py ['nuevologo.jpg', 'http://www.google.com/logos/Logo_25wht.gif'] Note: the page have a double on fedora the two html are consider on debian no . but I need this works on debian What could I do ? -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: testing.py Type: text/x-python Size: 257 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090619/dce6c5bb/attachment-0001.py -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 2192 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090619/dce6c5bb/attachment-0001.bin From ted at milo.com Fri Jun 19 19:40:19 2009 From: ted at milo.com (Ted Dziuba) Date: Fri, 19 Jun 2009 10:40:19 -0700 Subject: [lxml-dev] why this works on fedora and don't work on debian In-Reply-To: <1245432106.5753.4.camel@segulix> References: <1245432106.5753.4.camel@segulix> Message-ID: <6451ccbf0906191040m74544169hefc1dbf98700ec9@mail.gmail.com> Are your versions of libxml and libxslt the same? Ted On Fri, Jun 19, 2009 at 10:21 AM, Sergio Monteiro Basto < sergio at sergiomb.no-ip.org> wrote: > Hi, > I had install python lxml-2.2.1 on both machines > > $ python testing.py on fedora ['nuevologo.jpg', > 'http://www.google.com/logos/Logo_25wht.gif', > 'secciones/movil/promo.gif', 'img-noticias/AAA_279.JPG', > 'img-noticias/bresuc_508.jpg', 'img-noticias/nacional_116.JPG', > 'img-noticias/CONOMI1_84.JPG', > 'http://www.tutiempo.net/imagenes_asociados/84x38/SVVG.png', > '../web/webfisico/paginas/01_220.jpg', 'digitales.jpg', 'bannermi.gif', > 'http://pikis.net/banner/pikis_banner1_3.jpg'] > > on debian > $ python testing.py > ['nuevologo.jpg', 'http://www.google.com/logos/Logo_25wht.gif'] > > Note: the page have a double > on fedora the two html are consider > on debian no . > but I need this works on debian > > What could I do ? > > -- > S?rgio M. B. > > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > > -- Ted Dziuba Co-Founder and Engineer Milo.com, Inc. 165 University Avenue Palo Alto, CA, 94301 http://milo.com Cell: (609)-665-2639 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090619/dcac9235/attachment.htm From sergio at sergiomb.no-ip.org Fri Jun 19 20:35:59 2009 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Fri, 19 Jun 2009 19:35:59 +0100 Subject: [lxml-dev] why this works on fedora and don't work on debian In-Reply-To: <6451ccbf0906191040m74544169hefc1dbf98700ec9@mail.gmail.com> References: <1245432106.5753.4.camel@segulix> <6451ccbf0906191040m74544169hefc1dbf98700ec9@mail.gmail.com> Message-ID: <1245436559.5753.6.camel@segulix> On Fri, 2009-06-19 at 10:40 -0700, Ted Dziuba wrote: > Are your versions of libxml and libxslt the same? libxml no /usr/lib/libxml2.so.2.7.3 /usr/lib/libxml2.so.2.6.32 but libxslt yes /usr/lib/libxslt.so.1.1.24 /usr/lib/libxslt.so.1.1.24 > > Ted > > On Fri, Jun 19, 2009 at 10:21 AM, Sergio Monteiro Basto > wrote: > Hi, > I had install python lxml-2.2.1 on both machines > > $ python testing.py on fedora ['nuevologo.jpg', > 'http://www.google.com/logos/Logo_25wht.gif', > 'secciones/movil/promo.gif', 'img-noticias/AAA_279.JPG', > 'img-noticias/bresuc_508.jpg', > 'img-noticias/nacional_116.JPG', > 'img-noticias/CONOMI1_84.JPG', > 'http://www.tutiempo.net/imagenes_asociados/84x38/SVVG.png', > '../web/webfisico/paginas/01_220.jpg', 'digitales.jpg', > 'bannermi.gif', > 'http://pikis.net/banner/pikis_banner1_3.jpg'] > > on debian > $ python testing.py > ['nuevologo.jpg', > 'http://www.google.com/logos/Logo_25wht.gif'] > > Note: the page have a double > on fedora the two html are consider > on debian no . > but I need this works on debian > > What could I do ? > > -- > S?rgio M. B. > > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > > > > > -- > Ted Dziuba > Co-Founder and Engineer > > Milo.com, Inc. > 165 University Avenue > Palo Alto, CA, 94301 > http://milo.com > > Cell: (609)-665-2639 > -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 2192 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090619/e3b6a653/attachment.bin From ted at milo.com Fri Jun 19 21:52:22 2009 From: ted at milo.com (Ted Dziuba) Date: Fri, 19 Jun 2009 12:52:22 -0700 Subject: [lxml-dev] why this works on fedora and don't work on debian In-Reply-To: <1245436559.5753.6.camel@segulix> References: <1245432106.5753.4.camel@segulix> <6451ccbf0906191040m74544169hefc1dbf98700ec9@mail.gmail.com> <1245436559.5753.6.camel@segulix> Message-ID: <6451ccbf0906191252n14a1aebmf91f6b2999f34e42@mail.gmail.com> It doesn't seem like there would be such a stark difference in functionality across minor versions of libxml, but I suppose it's not out of the question. Also check that the version you see in /usr/lib is actually the version that lxml is loading, and that the compilation versions match the runtime versions: from lxml import etree print "lxml.etree: ", etree.LXML_VERSION print "libxml used: ", etree.LIBXML_VERSION print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION print "libxslt used: ", etree.LIBXSLT_VERSION print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION On my workstation the above script outputs: ted at gonzo:~$ python lxml_versions.py lxml.etree: (2, 1, 2, 0) libxml used: (2, 7, 2) libxml compiled: (2, 7, 2) libxslt used: (1, 1, 22) libxslt compiled: (1, 1, 22) Ted On Fri, Jun 19, 2009 at 11:35 AM, Sergio Monteiro Basto < sergio at sergiomb.no-ip.org> wrote: > On Fri, 2009-06-19 at 10:40 -0700, Ted Dziuba wrote: > > Are your versions of libxml and libxslt the same? > libxml no > /usr/lib/libxml2.so.2.7.3 > /usr/lib/libxml2.so.2.6.32 > > but libxslt yes > /usr/lib/libxslt.so.1.1.24 > /usr/lib/libxslt.so.1.1.24 > > > > > > Ted > > > > On Fri, Jun 19, 2009 at 10:21 AM, Sergio Monteiro Basto > > wrote: > > Hi, > > I had install python lxml-2.2.1 on both machines > > > > $ python testing.py on fedora ['nuevologo.jpg', > > 'http://www.google.com/logos/Logo_25wht.gif', > > 'secciones/movil/promo.gif', 'img-noticias/AAA_279.JPG', > > 'img-noticias/bresuc_508.jpg', > > 'img-noticias/nacional_116.JPG', > > 'img-noticias/CONOMI1_84.JPG', > > 'http://www.tutiempo.net/imagenes_asociados/84x38/SVVG.png', > > '../web/webfisico/paginas/01_220.jpg', 'digitales.jpg', > > 'bannermi.gif', > > 'http://pikis.net/banner/pikis_banner1_3.jpg'] > > > > on debian > > $ python testing.py > > ['nuevologo.jpg', > > 'http://www.google.com/logos/Logo_25wht.gif'] > > > > Note: the page have a double > > on fedora the two html are consider > > on debian no . > > but I need this works on debian > > > > What could I do ? > > > > -- > > S?rgio M. B. > > > > > > _______________________________________________ > > lxml-dev mailing list > > lxml-dev at codespeak.net > > http://codespeak.net/mailman/listinfo/lxml-dev > > > > > > > > > > -- > > Ted Dziuba > > Co-Founder and Engineer > > > > Milo.com, Inc. > > 165 University Avenue > > Palo Alto, CA, 94301 > > http://milo.com > > > > Cell: (609)-665-2639 > > > -- > S?rgio M. B. > -- Ted Dziuba Co-Founder and Engineer Milo.com, Inc. 165 University Avenue Palo Alto, CA, 94301 http://milo.com Cell: (609)-665-2639 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090619/d966078a/attachment.htm From jkrukoff at ltgc.com Fri Jun 19 22:18:26 2009 From: jkrukoff at ltgc.com (John Krukoff) Date: Fri, 19 Jun 2009 14:18:26 -0600 Subject: [lxml-dev] Converting an objectified lxml tree to a standard etree one. Message-ID: <1245442706.28204.32.camel@localhost.localdomain> Is the best way to convert an objectified (with lxml.objectify) element tree to a standard etree based one just to serialize and reparse? Is the reverse transform just as hard? As a use case, I've been experimenting with lxml.objectify, and quite liking it. However, I've a large library of support functions that I want to use that expect plain etree elements, and I'm trying to find an efficient way to convert between the two. I care more about memory than CPU time. Additionally, there's something odd about the objectify module that prevents help from working: >>> help( objectify ) Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.6/site.py", line 430, in __call__ return pydoc.help(*args, **kwds) File "/usr/lib/python2.6/pydoc.py", line 1720, in __call__ self.help(request) File "/usr/lib/python2.6/pydoc.py", line 1766, in help else: doc(request, 'Help on %s:') File "/usr/lib/python2.6/pydoc.py", line 1508, in doc pager(render_doc(thing, title, forceload)) File "/usr/lib/python2.6/pydoc.py", line 1503, in render_doc return title % desc + '\n\n' + text.document(object, name) File "/usr/lib/python2.6/pydoc.py", line 327, in document if inspect.ismodule(object): return self.docmodule(*args) File "/usr/lib/python2.6/pydoc.py", line 1086, in docmodule inspect.getclasstree(classlist, 1), name)] File "/usr/lib/python2.6/inspect.py", line 720, in getclasstree for parent in c.__bases__: TypeError: 'lxml.objectify._ObjectifyElementMakerCaller' object is not iterable -- John Krukoff Land Title Guarantee Company From sergio at sergiomb.no-ip.org Fri Jun 19 23:33:26 2009 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Fri, 19 Jun 2009 22:33:26 +0100 Subject: [lxml-dev] why this works on fedora and don't work on debian In-Reply-To: <6451ccbf0906191252n14a1aebmf91f6b2999f34e42@mail.gmail.com> References: <1245432106.5753.4.camel@segulix> <6451ccbf0906191040m74544169hefc1dbf98700ec9@mail.gmail.com> <1245436559.5753.6.camel@segulix> <6451ccbf0906191252n14a1aebmf91f6b2999f34e42@mail.gmail.com> Message-ID: <1245447206.5753.8.camel@segulix> Good: In [1]: from lxml import etree In [3]: print "lxml.etree: ", etree.LXML_VERSION ('lxml.etree: ', (2, 2, 1, 0)) In [5]: print "libxml used: ", etree.LIBXML_VERSION ('libxml used: ', (2, 7, 3)) In [6]: print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION ('libxml compiled: ', (2, 7, 3)) In [7]: print "libxslt used: ", etree.LIBXSLT_VERSION ('libxslt used: ', (1, 1, 24)) In [8]: print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION ('libxslt compiled: ', (1, 1, 24)) bad: In [1]: from lxml import etree In [3]: print "lxml.etree: ", etree.LXML_VERSION lxml.etree: (2, 2, 1, 0) In [5]: print "libxml used: ", etree.LIBXML_VERSION libxml used: (2, 6, 32) In [6]: print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION libxml compiled: (2, 6, 32) In [7]: print "libxslt used: ", etree.LIBXSLT_VERSION libxslt used: (1, 1, 24) In [8]: print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION libxslt compiled: (1, 1, 24) On Fri, 2009-06-19 at 12:52 -0700, Ted Dziuba wrote: > It doesn't seem like there would be such a stark difference in > functionality across minor versions of libxml, but I suppose it's not > out of the question. Also check that the version you see in /usr/lib > is actually the version that lxml is loading, and that the compilation > versions match the runtime versions: > > from lxml import etree > > print "lxml.etree: ", etree.LXML_VERSION > > print "libxml used: ", etree.LIBXML_VERSION > print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION > print "libxslt used: ", etree.LIBXSLT_VERSION > print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION > > On my workstation the above script outputs: > > ted at gonzo:~$ python lxml_versions.py > lxml.etree: (2, 1, 2, 0) > libxml used: (2, 7, 2) > libxml compiled: (2, 7, 2) > libxslt used: (1, 1, 22) > libxslt compiled: (1, 1, 22) > > Ted > > On Fri, Jun 19, 2009 at 11:35 AM, Sergio Monteiro Basto > wrote: > On Fri, 2009-06-19 at 10:40 -0700, Ted Dziuba wrote: > > Are your versions of libxml and libxslt the same? > > libxml no > /usr/lib/libxml2.so.2.7.3 > /usr/lib/libxml2.so.2.6.32 > > but libxslt yes > /usr/lib/libxslt.so.1.1.24 > /usr/lib/libxslt.so.1.1.24 > > > > > > > Ted > > > > On Fri, Jun 19, 2009 at 10:21 AM, Sergio Monteiro Basto > > wrote: > > Hi, > > I had install python lxml-2.2.1 on both machines > > > > $ python testing.py on fedora ['nuevologo.jpg', > > 'http://www.google.com/logos/Logo_25wht.gif', > > 'secciones/movil/promo.gif', > 'img-noticias/AAA_279.JPG', > > 'img-noticias/bresuc_508.jpg', > > 'img-noticias/nacional_116.JPG', > > 'img-noticias/CONOMI1_84.JPG', > > > 'http://www.tutiempo.net/imagenes_asociados/84x38/SVVG.png', > > '../web/webfisico/paginas/01_220.jpg', > 'digitales.jpg', > > 'bannermi.gif', > > 'http://pikis.net/banner/pikis_banner1_3.jpg'] > > > > on debian > > $ python testing.py > > ['nuevologo.jpg', > > 'http://www.google.com/logos/Logo_25wht.gif'] > > > > Note: the page have a double > > on fedora the two html are consider > > on debian no . > > but I need this works on debian > > > > What could I do ? > > > > -- > > S?rgio M. B. > > > > > > _______________________________________________ > > lxml-dev mailing list > > lxml-dev at codespeak.net > > http://codespeak.net/mailman/listinfo/lxml-dev > > > > > > > > > > -- > > Ted Dziuba > > Co-Founder and Engineer > > > > Milo.com, Inc. > > 165 University Avenue > > Palo Alto, CA, 94301 > > http://milo.com > > > > Cell: (609)-665-2639 > > > > -- > S?rgio M. B. > > > > -- > Ted Dziuba > Co-Founder and Engineer > > Milo.com, Inc. > 165 University Avenue > Palo Alto, CA, 94301 > http://milo.com > > Cell: (609)-665-2639 > -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 2192 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090619/d370df40/attachment-0001.bin From ted at milo.com Sat Jun 20 00:39:38 2009 From: ted at milo.com (Ted Dziuba) Date: Fri, 19 Jun 2009 15:39:38 -0700 Subject: [lxml-dev] why this works on fedora and don't work on debian In-Reply-To: <1245447206.5753.8.camel@segulix> References: <1245432106.5753.4.camel@segulix> <6451ccbf0906191040m74544169hefc1dbf98700ec9@mail.gmail.com> <1245436559.5753.6.camel@segulix> <6451ccbf0906191252n14a1aebmf91f6b2999f34e42@mail.gmail.com> <1245447206.5753.8.camel@segulix> Message-ID: <6451ccbf0906191539o6ffc1f52rd4d055d223b1e645@mail.gmail.com> Try updating libxml on the bad. That's really the only difference that could cause the different behavior (I think). ted On Fri, Jun 19, 2009 at 2:33 PM, Sergio Monteiro Basto < sergio at sergiomb.no-ip.org> wrote: > Good: > In [1]: from lxml import etree > > In [3]: print "lxml.etree: ", etree.LXML_VERSION > ('lxml.etree: ', (2, 2, 1, 0)) > > In [5]: print "libxml used: ", etree.LIBXML_VERSION > ('libxml used: ', (2, 7, 3)) > > In [6]: print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION > ('libxml compiled: ', (2, 7, 3)) > > In [7]: print "libxslt used: ", etree.LIBXSLT_VERSION > ('libxslt used: ', (1, 1, 24)) > > In [8]: print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION > ('libxslt compiled: ', (1, 1, 24)) > > bad: > In [1]: from lxml import etree > > In [3]: print "lxml.etree: ", etree.LXML_VERSION > lxml.etree: (2, 2, 1, 0) > > In [5]: print "libxml used: ", etree.LIBXML_VERSION > libxml used: (2, 6, 32) > > In [6]: print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION > libxml compiled: (2, 6, 32) > > In [7]: print "libxslt used: ", etree.LIBXSLT_VERSION > libxslt used: (1, 1, 24) > > In [8]: print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION > libxslt compiled: (1, 1, 24) > > > On Fri, 2009-06-19 at 12:52 -0700, Ted Dziuba wrote: > > It doesn't seem like there would be such a stark difference in > > functionality across minor versions of libxml, but I suppose it's not > > out of the question. Also check that the version you see in /usr/lib > > is actually the version that lxml is loading, and that the compilation > > versions match the runtime versions: > > > > from lxml import etree > > > > print "lxml.etree: ", etree.LXML_VERSION > > > > print "libxml used: ", etree.LIBXML_VERSION > > print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION > > print "libxslt used: ", etree.LIBXSLT_VERSION > > print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION > > > > On my workstation the above script outputs: > > > > ted at gonzo:~$ python lxml_versions.py > > lxml.etree: (2, 1, 2, 0) > > libxml used: (2, 7, 2) > > libxml compiled: (2, 7, 2) > > libxslt used: (1, 1, 22) > > libxslt compiled: (1, 1, 22) > > > > Ted > > > > On Fri, Jun 19, 2009 at 11:35 AM, Sergio Monteiro Basto > > wrote: > > On Fri, 2009-06-19 at 10:40 -0700, Ted Dziuba wrote: > > > Are your versions of libxml and libxslt the same? > > > > libxml no > > /usr/lib/libxml2.so.2.7.3 > > /usr/lib/libxml2.so.2.6.32 > > > > but libxslt yes > > /usr/lib/libxslt.so.1.1.24 > > /usr/lib/libxslt.so.1.1.24 > > > > > > > > > > > > Ted > > > > > > On Fri, Jun 19, 2009 at 10:21 AM, Sergio Monteiro Basto > > > wrote: > > > Hi, > > > I had install python lxml-2.2.1 on both machines > > > > > > $ python testing.py on fedora ['nuevologo.jpg', > > > 'http://www.google.com/logos/Logo_25wht.gif', > > > 'secciones/movil/promo.gif', > > 'img-noticias/AAA_279.JPG', > > > 'img-noticias/bresuc_508.jpg', > > > 'img-noticias/nacional_116.JPG', > > > 'img-noticias/CONOMI1_84.JPG', > > > > > 'http://www.tutiempo.net/imagenes_asociados/84x38/SVVG.png', > > > '../web/webfisico/paginas/01_220.jpg', > > 'digitales.jpg', > > > 'bannermi.gif', > > > 'http://pikis.net/banner/pikis_banner1_3.jpg'] > > > > > > on debian > > > $ python testing.py > > > ['nuevologo.jpg', > > > 'http://www.google.com/logos/Logo_25wht.gif'] > > > > > > Note: the page have a double > > > on fedora the two html are consider > > > on debian no . > > > but I need this works on debian > > > > > > What could I do ? > > > > > > -- > > > S?rgio M. B. > > > > > > > > > _______________________________________________ > > > lxml-dev mailing list > > > lxml-dev at codespeak.net > > > http://codespeak.net/mailman/listinfo/lxml-dev > > > > > > > > > > > > > > > -- > > > Ted Dziuba > > > Co-Founder and Engineer > > > > > > Milo.com, Inc. > > > 165 University Avenue > > > Palo Alto, CA, 94301 > > > http://milo.com > > > > > > Cell: (609)-665-2639 > > > > > > > -- > > S?rgio M. B. > > > > > > > > -- > > Ted Dziuba > > Co-Founder and Engineer > > > > Milo.com, Inc. > > 165 University Avenue > > Palo Alto, CA, 94301 > > http://milo.com > > > > Cell: (609)-665-2639 > > > -- > S?rgio M. B. > -- Ted Dziuba Co-Founder and Engineer Milo.com, Inc. 165 University Avenue Palo Alto, CA, 94301 http://milo.com Cell: (609)-665-2639 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090619/926826d3/attachment.htm From stefan_ml at behnel.de Sat Jun 20 07:51:23 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 20 Jun 2009 07:51:23 +0200 Subject: [lxml-dev] Converting an objectified lxml tree to a standard etree one. In-Reply-To: <1245442706.28204.32.camel@localhost.localdomain> References: <1245442706.28204.32.camel@localhost.localdomain> Message-ID: <4A3C78DB.5030100@behnel.de> Hi, John Krukoff wrote: > Is the best way to convert an objectified (with lxml.objectify) element > tree to a standard etree based one just to serialize and reparse? Is the > reverse transform just as hard? I would say so. The problem is that if I allow changing the element lookup while the tree is alive in Python space (which would be required since you need to pass at least one Element instance into lxml to request the change), lxml would have to replace the proxies used inside the tree, which would mean that all live proxies in the tree would become Zombies (including the one you passed). That's rather dangerous. Deep copying the tree and returning a root node from the new parser context would be a solution if you need the tree in memory, which I assume is the case here. But IIRC, there isn't currently a way to deep-copy the tree so that it uses a new element lookup. > I care more about memory than CPU time. A serialised byte string is several times smaller in memory than the tree itself, so I doubt that serialising and parsing would really hurt memory consumption that much (if you take care to drop the original tree when it's serialised). Do your benchmarks indicate that this is a problem? It *can* be if the tree needs to be garbage collected due to reference cycles (e.g. when you use ".attrib"). That might hold it in memory longer than necessary. As a side-note, it would be possible to compress the memory buffer during serialisation (see IDEAS.txt), but that's not trivial to implement and would add a compile time dependency on zlib. It also wouldn't help much if the next thing you do is parse the tree back into memory... > Additionally, there's something odd about the objectify module that > prevents help from working: > >>>> help( objectify ) > Traceback (most recent call last): > File "", line 1, in > File "/usr/lib/python2.6/site.py", line 430, in __call__ > return pydoc.help(*args, **kwds) > File "/usr/lib/python2.6/pydoc.py", line 1720, in __call__ > self.help(request) > File "/usr/lib/python2.6/pydoc.py", line 1766, in help > else: doc(request, 'Help on %s:') > File "/usr/lib/python2.6/pydoc.py", line 1508, in doc > pager(render_doc(thing, title, forceload)) > File "/usr/lib/python2.6/pydoc.py", line 1503, in render_doc > return title % desc + '\n\n' + text.document(object, name) > File "/usr/lib/python2.6/pydoc.py", line 327, in document > if inspect.ismodule(object): return self.docmodule(*args) > File "/usr/lib/python2.6/pydoc.py", line 1086, in docmodule > inspect.getclasstree(classlist, 1), name)] > File "/usr/lib/python2.6/inspect.py", line 720, in getclasstree > for parent in c.__bases__: > TypeError: 'lxml.objectify._ObjectifyElementMakerCaller' object is not > iterable Hmm, yes, that looks weird. It works with lxml.etree, but not with lxml.objectify. Could you please file a bug report on this? Stefan From stefan_ml at behnel.de Sun Jun 21 09:57:56 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 21 Jun 2009 09:57:56 +0200 Subject: [lxml-dev] lxml 2.2.2 released Message-ID: <4A3DE804.3010309@behnel.de> Hi all, I just released lxml 2.2.2 to PyPI. This is (mainly) a bug fix release that contains a fix for a potential tree corruption in namespace handling during tree manipulations. Since this can be triggered by input data (depending on your code), you might want to upgrade if you do not control your input. Changelog follows below, as usual. Stefan 2.2.2 (2009-06-21) Features added * New helper functions strip_attributes(), strip_elements(), strip_tags() in lxml.etree to remove attributes/subtrees/tags from a subtree. Bugs fixed * Namespace cleanup on subtree insertions could result in missing namespace declarations (and potentially crashes) if the element defining a namespace was deleted and the namespace was not used by the top element of the inserted subtree but only in deeper subtrees. * Raising an exception from a parser target callback didn't always terminate the parser. * Only {true, false, 1, 0} are accepted as the lexical representation for BoolElement ({True, False, T, F, t, f} not any more), restoring lxml <= 2.0 behaviour. From limi at plone.org Mon Jun 22 20:46:53 2009 From: limi at plone.org (Alexander Limi) Date: Mon, 22 Jun 2009 11:46:53 -0700 Subject: [lxml-dev] Binary egg for Mac OS X References: <4A2112AE.8040903@behnel.de> <4A2CA7AE.8030508@behnel.de> <97EFCE41-4A6E-4CB2-8593-CE7CF5F5A2D3@inquant.de> <4A34DD37.8000207@behnel.de> <4A3B1D56.3000007@behnel.de> Message-ID: On Thu, 18 Jun 2009 22:30:51 -0700, Alexander Limi wrote: > Martin Aspeli has indicated that they worked for him. I haven't had time > to test it yet, but will hopefully get around to it this weekend. Don't > let me hold you up, though ? PyPI eggs need testing too, right? :) OK, I have tested the OS X binary egg, and it works for me too. Time to upload one for the recently-released lxml 2.2.2? :) -- Alexander Limi ? http://limi.net From jkrukoff at ltgc.com Mon Jun 22 20:57:50 2009 From: jkrukoff at ltgc.com (John Krukoff) Date: Mon, 22 Jun 2009 12:57:50 -0600 Subject: [lxml-dev] Converting an objectified lxml tree to a standard etree one. In-Reply-To: <4A3C78DB.5030100@behnel.de> References: <1245442706.28204.32.camel@localhost.localdomain> <4A3C78DB.5030100@behnel.de> Message-ID: <1245697070.23804.6.camel@localhost.localdomain> On Sat, 2009-06-20 at 07:51 +0200, Stefan Behnel wrote: > Hi, > > John Krukoff wrote: > > Is the best way to convert an objectified (with lxml.objectify) element > > tree to a standard etree based one just to serialize and reparse? Is the > > reverse transform just as hard? > > I would say so. The problem is that if I allow changing the element lookup > while the tree is alive in Python space (which would be required since you > need to pass at least one Element instance into lxml to request the > change), lxml would have to replace the proxies used inside the tree, which > would mean that all live proxies in the tree would become Zombies > (including the one you passed). That's rather dangerous. > > Deep copying the tree and returning a root node from the new parser context > would be a solution if you need the tree in memory, which I assume is the > case here. But IIRC, there isn't currently a way to deep-copy the tree so > that it uses a new element lookup. > Yeah, this sounds like what I'll hope for in the future to make this more efficient. I won't hold my breath though, and it looks like the worst performance hit I'm seeing in my test cases is the assert which checks the tree to make sure I'm not leaking any objectify elements unintentionally. Obviously not a real problem. :) [ snipped for length ] > Hmm, yes, that looks weird. It works with lxml.etree, but not with > lxml.objectify. Could you please file a bug report on this? > > Stefan Bug report filed. Side note, having to register for bug reports is always a pain. I don't think I've ever actually reused one of those logins. -- John Krukoff Land Title Guarantee Company From seasong at chantofwaves.com Tue Jun 23 02:53:34 2009 From: seasong at chantofwaves.com (Thomas Weigel) Date: Mon, 22 Jun 2009 19:53:34 -0500 Subject: [lxml-dev] lxml.html, now with ignored namespaces! In-Reply-To: References: Message-ID: <4A40278E.3020805@chantofwaves.com> I am using lxml to parse HTML documents, which include a custom namespace (for example, "

FRUIT

"). In lxml 2.2.0, on Windows, this worked just fine, and elements could be processed based on this data. In lxml 2.2.2, on Linux, this fails. The above example becomes "

FRUIT

" as soon as it is parsed by lxml.html (or lxml.etree.HTMLParser()). I don't know if this is caused by the switch to Linux, or the upgrade to 2.2.2. I don't have control over the installation, so I can't switch to 2.2.2 under Windows, or 2.2.0 under Linux to check. I did find this reference (the only reference to this I could find) to the HTML ignoring namespaces: http://codespeak.net/lxml/lxmlhtml.html#running-html-doctests ...however, it wasn't doing that before, and it seems odd that this is only mentioned in the doctests section. Is there a way to work around this? Are custom namespaces simply not possible in lxml's HTML? Notes: 1. The XML parser will not work. Some documents will have legal HTML that breaks an XML parser, like "
". 2. Here is the sample code: ----- >>> import lxml.html as parser >>> document = parser.fromstring("""Help!

My namespaces are going to disappear!

FRUIT

""") >>> print parser.tostring(document) ----- The output: ----- Help!

My namespaces are going to disappear!

FRUIT

----- Thomas Weigel From jholg at gmx.de Tue Jun 23 09:33:41 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 23 Jun 2009 09:33:41 +0200 Subject: [lxml-dev] Converting an objectified lxml tree to a standard etree one. In-Reply-To: <1245697070.23804.6.camel@localhost.localdomain> References: <1245442706.28204.32.camel@localhost.localdomain> <4A3C78DB.5030100@behnel.de> <1245697070.23804.6.camel@localhost.localdomain> Message-ID: <20090623073341.69330@gmx.net> Hi, > [ snipped for length ] > > Hmm, yes, that looks weird. It works with lxml.etree, but not with > > lxml.objectify. Could you please file a bug report on this? This seems to be the villain: >>> for (name, obj) in objectify.__dict__.items(): ... if hasattr(obj, '__bases__'): ... try: ... i = iter(getattr(obj, '__bases__')) ... except: ... print name, obj, getattr(obj, '__bases__') ... E >>> Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From david.antliff at gmail.com Tue Jun 23 11:37:32 2009 From: david.antliff at gmail.com (David Antliff) Date: Tue, 23 Jun 2009 21:37:32 +1200 Subject: [lxml-dev] Using, or building, lxml in Windows with Cygwin In-Reply-To: References: <3cacf976a02541d97693e587746ce3c6.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <4A388EC0.6040501@behnel.de> Message-ID: On Wed, Jun 17, 2009 at 18:35, Stefan Behnel wrote: > David Antliff wrote: >> I am trying a slightly different approach - compiling entirely within >> Cygwin, using Cygwin's gcc. > > Right, that should be best anyway. > > >> What I have done is unpacked lxml-2.2.1.tgz.gz into lxml-2.2.1, then >> inside that directory I try: >> >> $ python setup.py build --static-deps >> >> This proceeds to download libxml2 and libxslt, unpack them, and build >> them. But it runs into numerous problems related to include/library >> paths. > > ... which you may be able to fix using appropriate CFLAGS/LDFLAGS. [snip] > I don't know if you install the libiconv developer package on your machine > (I hope that exists in Cygwin), because building against the shared lib > should work just fine here. Maybe you need to point gcc to the right > include and/or lib directory. I wonder why it didn't add "-liconv" > automatically... > > You can also try to go the ugly route and add "/usr/lib/libiconv.a" to your > CFLAGS, but I think you'll be happier with the shared lib. > > >> I wonder why setup.py didn't automatically download libiconv when it >> downloaded libxml2 and libxslt... hmm > > Because the static build was designed for MacOS-X, where only those two > libraries are a problem. The libiconv is binary compatible enough across > versions not to pose any major problems. So it's best to build dynamically > against libiconv. > > That said, it shouldn't be too hard to add code to also download libiconv > and build it. I would be happy to receive a patch that accepts an optional > list of library names for the --static-deps option, as in > "--static-deps=libxml2,libxslt,iconv", and would then download and build > all requested libraries. Although I doubt that it would make sense (or even > work) to pass only "--static-deps=iconv", so maybe a new option > "--build-iconv" would be better. Hi Stefan, I must admit I don't understand how setup.py, setupinfo.py and distutils all fits together. By watching the output of 'python setup.py build --static-deps' it's fairly clear to me what is needed to fix each step, but I can't work out where one sets extra CFLAGS or LDFLAGS in setup.py so that I can continue the process beyond each error. What would also help is if I knew what command I was meant to be using. The documentation suggests all sorts of options and I'm really not sure what I'm doing. Here's what I want to end up with: - a statically compiled python 'egg' of lxml that I can simply distribute with a python script and use in Cygwin. Here's what I currently have: - lxml-2.2.1 tar.gz unpacked - Cygwin *without* libxml2 or libxslt installed (I don't want my users to have to install these via Cygwin setup.exe). - gcc, ld, etc are all present As per a previous email, I tried: $ python setup.py build --static-deps But this has problems finding libiconv. You suggested I could use-liconv to fix this, but I can't work out where to actually put this in setup.py or setupinfo.py. $ ls /usr/lib/libic* /usr/lib/libiconv.a /usr/lib/libiconv.dll.a /usr/lib/libiconv.la However that's on my own machine - it turns out that nobody else has Cygwin's libiconv installed, so I'd like to *statically* link in libiconv too. Dynamically linking won't work in my case. I think the biggest problem I have is that having to run "python setup.py build" every time clears out everything. It would be good if I could get setup.py to simply print all the command lines it intends to use without actually running them... Any assistance you can give me would be greatly appreciated please. Regards, -- David. From stefan_ml at behnel.de Wed Jun 24 06:35:52 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 24 Jun 2009 06:35:52 +0200 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: References: <4A2112AE.8040903@behnel.de> <4A2CA7AE.8030508@behnel.de> <97EFCE41-4A6E-4CB2-8593-CE7CF5F5A2D3@inquant.de> <4A34DD37.8000207@behnel.de> <4A3B1D56.3000007@behnel.de> Message-ID: <4A41AD28.4010202@behnel.de> Hi, Alexander Limi wrote: > On Thu, 18 Jun 2009 22:30:51 -0700, Alexander Limi wrote: > >> Martin Aspeli has indicated that they worked for him. I haven't had time >> to test it yet, but will hopefully get around to it this weekend. Don't >> let me hold you up, though ? PyPI eggs need testing too, right? :) > > OK, I have tested the OS X binary egg, and it works for me too. Time to > upload one for the recently-released lxml 2.2.2? :) Well, let's go for it then. Stefan, could you please upload the egg to PyPI? Please also indicate which lib versions you used for building (the latest, I assume?). Stefan From stefan_ml at behnel.de Wed Jun 24 06:41:22 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 24 Jun 2009 06:41:22 +0200 Subject: [lxml-dev] Using, or building, lxml in Windows with Cygwin In-Reply-To: References: <3cacf976a02541d97693e587746ce3c6.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <4A388EC0.6040501@behnel.de> Message-ID: <4A41AE72.6040804@behnel.de> Hi, David Antliff wrote: > I must admit I don't understand how setup.py, setupinfo.py and > distutils all fits together. By watching the output of 'python > setup.py build --static-deps' it's fairly clear to me what is needed > to fix each step, but I can't work out where one sets extra CFLAGS or > LDFLAGS in setup.py so that I can continue the process beyond each > error. You can pass them externally, as in CFLAGS="..." LDFLAGS="..." python setup.py --static-deps > What would also help is if I knew what command I was meant to be > using. The documentation suggests all sorts of options and I'm really > not sure what I'm doing. Here's what I want to end up with: > - a statically compiled python 'egg' of lxml that I can simply > distribute with a python script and use in Cygwin. > > Here's what I currently have: > - lxml-2.2.1 tar.gz unpacked > - Cygwin *without* libxml2 or libxslt installed (I don't want my > users to have to install these via Cygwin setup.exe). > - gcc, ld, etc are all present Fair enough, should be doable. > As per a previous email, I tried: > $ python setup.py build --static-deps > > But this has problems finding libiconv. You suggested I could > use-liconv to fix this, but I can't work out where to actually put > this in setup.py or setupinfo.py. > > $ ls /usr/lib/libic* > /usr/lib/libiconv.a /usr/lib/libiconv.dll.a /usr/lib/libiconv.la > > However that's on my own machine - it turns out that nobody else has > Cygwin's libiconv installed, so I'd like to *statically* link in > libiconv too. Dynamically linking won't work in my case. You can try passing LDFLAGS=/usr/lib/libiconv.a to setup.py. Not sure if that works, but it may. If the above doesn't work, you can take a look at buildlibxml.py. The function at the end does the library build. Adding your /usr/lib/libiconv.a to the "static_binaries" list at the end should make it work. Does that help? Stefan From cattafra at hotmail.com Wed Jun 24 12:29:30 2009 From: cattafra at hotmail.com (Francesco) Date: Wed, 24 Jun 2009 10:29:30 +0000 (UTC) Subject: [lxml-dev] =?utf-8?q?clean=5Fhtml?= Message-ID: I have written the following code: >>> from lxml.html.clean import clean_html >>> html = "?" >>> print clean_html(html)

??

I am wondering why I have an extra character (?) in my output. What should I do to avoid that? Thanks, Francesco From stefan_ml at behnel.de Wed Jun 24 14:10:16 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 24 Jun 2009 14:10:16 +0200 (CEST) Subject: [lxml-dev] clean_html In-Reply-To: References: Message-ID: <3032207af8b3c41f215704b2f7297b3a.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Francesco wrote: > I have written the following code: > >>>> from lxml.html.clean import clean_html >>>> html = "??" Note that you are passing a byte string here. Without any encoding information, the HTML parser of libxml2 will fall back to the Latin-1 encoding. >>>> print clean_html(html) >

????

> > I am wondering why I have an extra character (??) in my output. > What should I do to avoid that? That's just because the serialised HTML output is encoded as UTF-8. If you want to print the resulting byte string, use .decode('UTF-8') to decode it to unicode first. If you want to write it to a file (or send it through the network), keeping it in UTF-8 is the right thing, though. Stefan From kevin.p.dwyer at gmail.com Wed Jun 24 14:10:49 2009 From: kevin.p.dwyer at gmail.com (Kev Dwyer) Date: Wed, 24 Jun 2009 13:10:49 +0100 Subject: [lxml-dev] clean_html In-Reply-To: References: Message-ID: <4d3439f90906240510k589f426bq1fdfde7f0eb5ca93@mail.gmail.com> Hello Francesco, For me the problem can be avoided by defining html as a unicode string: >>> html = u"?" >>> print clean_html(html)

?

I suspect this is only a problem if the encoding of the html string passed to clean_html is undefined, or incorrectly defined. Kevin 2009/6/24 Francesco > I have written the following code: > > >>> from lxml.html.clean import clean_html > >>> html = "?" > >>> print clean_html(html) >

??

> > I am wondering why I have an extra character (?) in my output. > What should I do to avoid that? > > Thanks, > > Francesco > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090624/66fc56fe/attachment.htm From cattafra at hotmail.com Wed Jun 24 14:46:40 2009 From: cattafra at hotmail.com (Francesco) Date: Wed, 24 Jun 2009 12:46:40 +0000 (UTC) Subject: [lxml-dev] =?utf-8?q?clean=5Fhtml?= References: Message-ID: Thank you very much for your answers! The html string is read from a file with: inputfile = "test.txt" # where test.txt contains "My site » Homepage" input = open(inputfile, "rb") html = input.read() How could I define the encoding for html? Thanks, Francesco From piet at cs.uu.nl Wed Jun 24 15:04:57 2009 From: piet at cs.uu.nl (Piet van Oostrum) Date: Wed, 24 Jun 2009 15:04:57 +0200 Subject: [lxml-dev] clean_html In-Reply-To: References: Message-ID: <19010.9337.789099.101036@cochabamba.local> >>>>> Francesco (F) wrote: >F> Thank you very much for your answers! >F> The html string is read from a file with: >F> inputfile = "test.txt" >F> # where test.txt contains "My site » Homepage" >F> input = open(inputfile, "rb") >F> html = input.read() >F> How could I define the encoding for html? Do you know the encoding beforehand? If so, you could use codecs, and come up with a unicode string. import codecs input = codecs.open( inputfile, "r", "utf-8" ) Why do you use "rb"? -- Piet van Oostrum URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: piet at vanoostrum.org From stefan_ml at behnel.de Wed Jun 24 15:17:41 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 24 Jun 2009 15:17:41 +0200 (CEST) Subject: [lxml-dev] clean_html In-Reply-To: References: Message-ID: <03965a2dcb750acaf571d8de1742cd3c.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Francesco wrote: > Thank you very much for your answers! > > The html string is read from a file with: > inputfile = "test.txt" > # where test.txt contains "My site » Homepage" > input = open(inputfile, "rb") > html = input.read() > > How could I define the encoding for html? *Iff* you know it before hand, you can create a new parser and use the parse() function: parser = lxml.html.HTMLParser(encoding='UTF-8') html_tree = lxml.html.parse(inputfile, parser=parser) Most functions in lxml.html can deal with both tree and string input, and will return the type you passed in. However, working with trees allows you to control the parsing and serialisation more accurately, and avoids redundant parse-serialise cycles if you want to run more than one operation on the data. Stefan From stefan_ml at behnel.de Wed Jun 24 15:18:56 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 24 Jun 2009 15:18:56 +0200 (CEST) Subject: [lxml-dev] clean_html In-Reply-To: <19010.9337.789099.101036@cochabamba.local> References: <19010.9337.789099.101036@cochabamba.local> Message-ID: Piet van Oostrum wrote: >>>>>> Francesco (F) wrote: > >>F> Thank you very much for your answers! >>F> The html string is read from a file with: >>F> inputfile = "test.txt" >>F> # where test.txt contains "My site » Homepage" >>F> input = open(inputfile, "rb") >>F> html = input.read() > > Why do you use "rb"? Because the file contains byte encoded data. Stefan From cattafra at hotmail.com Wed Jun 24 15:30:48 2009 From: cattafra at hotmail.com (Francesco) Date: Wed, 24 Jun 2009 13:30:48 +0000 (UTC) Subject: [lxml-dev] =?utf-8?q?clean=5Fhtml?= References: <03965a2dcb750acaf571d8de1742cd3c.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: Thank you very much! I need now a way to find out the encoding of my data... Because it is a webpage there must be a way to extract that information... Should I look for something like charset=XXXXXXX? Is there a way to extract that info easily after a call to urlopen? html = urlopen(webpage).read() Thanks, Francesco From stefan_ml at behnel.de Wed Jun 24 15:45:36 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 24 Jun 2009 15:45:36 +0200 Subject: [lxml-dev] clean_html In-Reply-To: References: <03965a2dcb750acaf571d8de1742cd3c.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <4A422E00.9060503@behnel.de> Francesco wrote: > Thank you very much! > > I need now a way to find out the encoding of my data... Because it is a webpage > there must be a way to extract that information... > > Should I look for something like charset=XXXXXXX? > > Is there a way to extract that info easily after a call to urlopen? > html = urlopen(webpage).read() The HTML parser knows about the meta/charset tags in HTML, so if the web page provides it, there is no need to override the parser encoding. Stefan From cattafra at hotmail.com Wed Jun 24 17:28:23 2009 From: cattafra at hotmail.com (Francesco) Date: Wed, 24 Jun 2009 15:28:23 +0000 (UTC) Subject: [lxml-dev] =?utf-8?q?clean=5Fhtml?= References: <03965a2dcb750acaf571d8de1742cd3c.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <4A422E00.9060503@behnel.de> Message-ID: Thank you for your reply! What should I do if I want to save to disk my data? I am using etree.parse and also etree.xpath... but I always have encoding problems... Is it possible to set the encoding to the same that was used by the parser? Thanks, Francesco From cattafra at hotmail.com Wed Jun 24 18:27:19 2009 From: cattafra at hotmail.com (Francesco) Date: Wed, 24 Jun 2009 16:27:19 +0000 (UTC) Subject: [lxml-dev] XPath return values to file? Message-ID: How could I save to a file with the right encoding the results from a XPath call? My XPath data is in a list... and I have problems saving it to a file... Thanks, Francesco From herve.cauwelier at free.fr Wed Jun 24 19:26:40 2009 From: herve.cauwelier at free.fr (=?UTF-8?B?SGVydsOpIENhdXdlbGllcg==?=) Date: Wed, 24 Jun 2009 19:26:40 +0200 Subject: [lxml-dev] parsing and serializing XML fragments Message-ID: <4A4261D0.8080703@free.fr> Hi, I'm trying to load fragments of XML to inject them in an existing document tree. They look like this: (It's OpenDocument format.) Converting the fragment to the "{uri}name" syntax is not an option since I must remain agnostic to the XML parser. I would expect the XML() function to take an "nsmap" argument, like the xpath() method on elements, or parts of the API for subclassing elements. For now I have another template, a complete document with namespace declaration, and I inject my fragment using string formatting. Lxml will parse it and I extract the first child element. (I was using this technique with the libxml2 Python wrapper and I have seen it in the lxml.html source code for loading HTML fragments.) I have looked at custom elements and other resolving methods but lxml was raising a namespace error before my "print"'s show up. Another issue is to save the element back to its snippet form, for unit test validation. Lxml will produce a valid document with namespace declaration. Either how to serialize without namespace declaration or how to remove it while keeping prefixes? Thanks in advance From sergio at sergiomb.no-ip.org Wed Jun 24 22:51:11 2009 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Wed, 24 Jun 2009 21:51:11 +0100 Subject: [lxml-dev] why this works on fedora and don't work on debian In-Reply-To: <6451ccbf0906191539o6ffc1f52rd4d055d223b1e645@mail.gmail.com> References: <1245432106.5753.4.camel@segulix> <6451ccbf0906191040m74544169hefc1dbf98700ec9@mail.gmail.com> <1245436559.5753.6.camel@segulix> <6451ccbf0906191252n14a1aebmf91f6b2999f34e42@mail.gmail.com> <1245447206.5753.8.camel@segulix> <6451ccbf0906191539o6ffc1f52rd4d055d223b1e645@mail.gmail.com> Message-ID: <1245876671.12777.14.camel@segulix> Hi, upgrading libxml to 2.7.3 fix the issue Big thanks , On Fri, 2009-06-19 at 15:39 -0700, Ted Dziuba wrote: > Try updating libxml on the bad. That's really the only difference > that could cause the different behavior (I think). > > ted > > On Fri, Jun 19, 2009 at 2:33 PM, Sergio Monteiro Basto > wrote: > Good: > In [1]: from lxml import etree > > In [3]: print "lxml.etree: ", etree.LXML_VERSION > ('lxml.etree: ', (2, 2, 1, 0)) > > In [5]: print "libxml used: ", etree.LIBXML_VERSION > ('libxml used: ', (2, 7, 3)) > > In [6]: print "libxml compiled: ", > etree.LIBXML_COMPILED_VERSION > ('libxml compiled: ', (2, 7, 3)) > > In [7]: print "libxslt used: ", etree.LIBXSLT_VERSION > ('libxslt used: ', (1, 1, 24)) > > In [8]: print "libxslt compiled: ", > etree.LIBXSLT_COMPILED_VERSION > ('libxslt compiled: ', (1, 1, 24)) > > bad: > In [1]: from lxml import etree > > In [3]: print "lxml.etree: ", etree.LXML_VERSION > lxml.etree: (2, 2, 1, 0) > > In [5]: print "libxml used: ", etree.LIBXML_VERSION > libxml used: (2, 6, 32) > > In [6]: print "libxml compiled: ", > etree.LIBXML_COMPILED_VERSION > libxml compiled: (2, 6, 32) > > In [7]: print "libxslt used: ", etree.LIBXSLT_VERSION > libxslt used: (1, 1, 24) > > In [8]: print "libxslt compiled: ", > etree.LIBXSLT_COMPILED_VERSION > libxslt compiled: (1, 1, 24) > > > > On Fri, 2009-06-19 at 12:52 -0700, Ted Dziuba wrote: > > It doesn't seem like there would be such a stark difference > in > > functionality across minor versions of libxml, but I suppose > it's not > > out of the question. Also check that the version you see > in /usr/lib > > is actually the version that lxml is loading, and that the > compilation > > versions match the runtime versions: > > > > from lxml import etree > > > > print "lxml.etree: ", etree.LXML_VERSION > > > > print "libxml used: ", etree.LIBXML_VERSION > > print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION > > print "libxslt used: ", etree.LIBXSLT_VERSION > > print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION > > > > On my workstation the above script outputs: > > > > ted at gonzo:~$ python lxml_versions.py > > lxml.etree: (2, 1, 2, 0) > > libxml used: (2, 7, 2) > > libxml compiled: (2, 7, 2) > > libxslt used: (1, 1, 22) > > libxslt compiled: (1, 1, 22) > > > > Ted > > > > On Fri, Jun 19, 2009 at 11:35 AM, Sergio Monteiro Basto > > wrote: > > On Fri, 2009-06-19 at 10:40 -0700, Ted Dziuba wrote: > > > Are your versions of libxml and libxslt the same? > > > > libxml no > > /usr/lib/libxml2.so.2.7.3 > > /usr/lib/libxml2.so.2.6.32 > > > > but libxslt yes > > /usr/lib/libxslt.so.1.1.24 > > /usr/lib/libxslt.so.1.1.24 > > > > > > > > > > > > Ted > > > > > > On Fri, Jun 19, 2009 at 10:21 AM, Sergio Monteiro > Basto > > > wrote: > > > Hi, > > > I had install python lxml-2.2.1 on both > machines > > > > > > $ python testing.py on fedora > ['nuevologo.jpg', > > > > 'http://www.google.com/logos/Logo_25wht.gif', > > > 'secciones/movil/promo.gif', > > 'img-noticias/AAA_279.JPG', > > > 'img-noticias/bresuc_508.jpg', > > > 'img-noticias/nacional_116.JPG', > > > 'img-noticias/CONOMI1_84.JPG', > > > > > > 'http://www.tutiempo.net/imagenes_asociados/84x38/SVVG.png', > > > '../web/webfisico/paginas/01_220.jpg', > > 'digitales.jpg', > > > 'bannermi.gif', > > > > 'http://pikis.net/banner/pikis_banner1_3.jpg'] > > > > > > on debian > > > $ python testing.py > > > ['nuevologo.jpg', > > > > 'http://www.google.com/logos/Logo_25wht.gif'] > > > > > > Note: the page have a double > > > on fedora the two html are consider > > > on debian no . > > > but I need this works on debian > > > > > > What could I do ? > > > > > > -- > > > S?rgio M. B. > > > > > > > > > > _______________________________________________ > > > lxml-dev mailing list > > > lxml-dev at codespeak.net > > > > http://codespeak.net/mailman/listinfo/lxml-dev > > > > > > > > > > > > > > > -- > > > Ted Dziuba > > > Co-Founder and Engineer > > > > > > Milo.com, Inc. > > > 165 University Avenue > > > Palo Alto, CA, 94301 > > > http://milo.com > > > > > > Cell: (609)-665-2639 > > > > > > > -- > > S?rgio M. B. > > > > > > > > -- > > Ted Dziuba > > Co-Founder and Engineer > > > > Milo.com, Inc. > > 165 University Avenue > > Palo Alto, CA, 94301 > > http://milo.com > > > > Cell: (609)-665-2639 > > > > -- > S?rgio M. B. > > > > -- > Ted Dziuba > Co-Founder and Engineer > > Milo.com, Inc. > 165 University Avenue > Palo Alto, CA, 94301 > http://milo.com > > Cell: (609)-665-2639 > -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 2192 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090624/f42cd6a6/attachment-0001.bin From kris at cs.ucsb.edu Thu Jun 25 00:34:41 2009 From: kris at cs.ucsb.edu (kristian kvilekval) Date: Wed, 24 Jun 2009 15:34:41 -0700 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: <4A41AD28.4010202@behnel.de> References: <4A2112AE.8040903@behnel.de> <4A2CA7AE.8030508@behnel.de> <97EFCE41-4A6E-4CB2-8593-CE7CF5F5A2D3@inquant.de> <4A34DD37.8000207@behnel.de> <4A3B1D56.3000007@behnel.de> <4A41AD28.4010202@behnel.de> Message-ID: <1245882881.10362.239.camel@loup.ece.ucsb.edu> Is this available to the general public for testing? On Wed, 2009-06-24 at 06:35 +0200, Stefan Behnel wrote: > Hi, > > Alexander Limi wrote: > > On Thu, 18 Jun 2009 22:30:51 -0700, Alexander Limi wrote: > > > >> Martin Aspeli has indicated that they worked for him. I haven't had time > >> to test it yet, but will hopefully get around to it this weekend. Don't > >> let me hold you up, though ? PyPI eggs need testing too, right? :) > > > > OK, I have tested the OS X binary egg, and it works for me too. Time to > > upload one for the recently-released lxml 2.2.2? :) > > Well, let's go for it then. Stefan, could you please upload the egg to > PyPI? Please also indicate which lib versions you used for building (the > latest, I assume?). > > Stefan > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev From limi at plone.org Thu Jun 25 01:29:53 2009 From: limi at plone.org (Alexander Limi) Date: Wed, 24 Jun 2009 16:29:53 -0700 Subject: [lxml-dev] Binary egg for Mac OS X References: <4A2112AE.8040903@behnel.de> <4A2CA7AE.8030508@behnel.de> <97EFCE41-4A6E-4CB2-8593-CE7CF5F5A2D3@inquant.de> <4A34DD37.8000207@behnel.de> <4A3B1D56.3000007@behnel.de> <4A41AD28.4010202@behnel.de> <1245882881.10362.239.camel@loup.ece.ucsb.edu> Message-ID: You can test the not-yet-official version of the Python 2.4-compatible egg from here: http://eletztrick.de/static/lxml/ (note that this location will disappear once the real egg is up on PyPI ??for testing only!) ? Alexander On Wed, 24 Jun 2009 15:34:41 -0700, kristian kvilekval wrote: > Is this available to the general public for testing? > > > On Wed, 2009-06-24 at 06:35 +0200, Stefan Behnel wrote: >> Hi, >> >> Alexander Limi wrote: >> > On Thu, 18 Jun 2009 22:30:51 -0700, Alexander Limi wrote: >> > >> >> Martin Aspeli has indicated that they worked for him. I haven't had >> time >> >> to test it yet, but will hopefully get around to it this weekend. >> Don't >> >> let me hold you up, though ? PyPI eggs need testing too, right? :) >> > >> > OK, I have tested the OS X binary egg, and it works for me too. Time >> to >> > upload one for the recently-released lxml 2.2.2? :) >> >> Well, let's go for it then. Stefan, could you please upload the egg to >> PyPI? Please also indicate which lib versions you used for building (the >> latest, I assume?). >> >> Stefan >> _______________________________________________ >> lxml-dev mailing list >> lxml-dev at codespeak.net >> http://codespeak.net/mailman/listinfo/lxml-dev > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev -- Alexander Limi ? http://limi.net From stefan_ml at behnel.de Thu Jun 25 08:36:53 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 25 Jun 2009 08:36:53 +0200 Subject: [lxml-dev] clean_html In-Reply-To: References: <03965a2dcb750acaf571d8de1742cd3c.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <4A422E00.9060503@behnel.de> Message-ID: <4A431B05.2050902@behnel.de> Francesco wrote: > What should I do if I want to save to disk my data? > > I am using etree.parse and also etree.xpath... XPath is not related to parsing or serialising. > but I always have encoding problems... On parsing or on serialising? And what kind of problem? > Is it possible to set the encoding to the same that was used by the > parser? You can check if the ElementTree object returned by parse() has something useful in its ".docinfo.encoding" property. But why not use UTF-8 on output in general? Stefan From stefan_ml at behnel.de Thu Jun 25 08:39:21 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 25 Jun 2009 08:39:21 +0200 Subject: [lxml-dev] XPath return values to file? In-Reply-To: References: Message-ID: <4A431B99.4040505@behnel.de> Francesco wrote: > How could I save to a file with the right encoding the results from a XPath call? > > My XPath data is in a list... and I have problems saving it to a file... How do you want to save it? As an HTML document? As a list of fragments? Please try to describe what you are trying to achieve, that may allow us to come up with a solution. Stefan From stefan_ml at behnel.de Thu Jun 25 09:14:19 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 25 Jun 2009 09:14:19 +0200 Subject: [lxml-dev] parsing and serializing XML fragments In-Reply-To: <4A4261D0.8080703@free.fr> References: <4A4261D0.8080703@free.fr> Message-ID: <4A4323CB.4000107@behnel.de> Hi, Herv? Cauwelier wrote: > Hi, I'm trying to load fragments of XML to inject them in an existing > document tree. > > They look like this: > > Just curious: why do you create a document in which you can do string replacements? > Converting the fragment to the "{uri}name" syntax is not an option since > I must remain agnostic to the XML parser. That can't be parsed by lxml's parser either. > I would expect the XML() function to take an "nsmap" argument, like the > xpath() method on elements, or parts of the API for subclassing elements. No, it's a namespace aware XML parser, so it will reject documents that are not namespace well-formed. > For now I have another template, a complete document with namespace > declaration, and I inject my fragment using string formatting. Lxml will > parse it and I extract the first child element. You can use the feed parser instead and do parser = etree.XMLParser(...) parser.feed('') parser.feed(the_fragment) parser.feed('' fragment = parser.close()[0] Feed parsers are reusable after a call to close(), BTW. > I have looked at custom elements and other resolving methods but lxml > was raising a namespace error before my "print"'s show up. Element proxies are created /after/ parsing, so this won't help. > Another issue is to save the element back to its snippet form, for unit > test validation. Lxml will produce a valid document with namespace > declaration. Either how to serialize without namespace declaration or > how to remove it while keeping prefixes? I put a lot of work into preventing lxml from serialising broken documents, sorry. I also doubt that lxml's doctest support can help you here, as it also requires parsing. But you can insert your document into a new root Element that defines all used namespaces (either fixed or collected at runtime), serialise that, and strip the root element from top and bottom of the serialised byte string. I do a bit of string mangling in lxml's own doctests to make them work in both Py2 and Py3, it's not that hard to add these things. You can write a little wrapper class around lxml.etree and override tostring() and parse() to fit your needs (kudos to Fredrik for making them functions, BTW). Here's an example: http://codespeak.net/svn/lxml/trunk/doc/api.txt Stefan From herve.cauwelier at free.fr Thu Jun 25 11:29:07 2009 From: herve.cauwelier at free.fr (=?UTF-8?B?SGVydsOpIENhdXdlbGllcg==?=) Date: Thu, 25 Jun 2009 11:29:07 +0200 Subject: [lxml-dev] parsing and serializing XML fragments In-Reply-To: <4A4323CB.4000107@behnel.de> References: <4A4261D0.8080703@free.fr> <4A4323CB.4000107@behnel.de> Message-ID: <4A434363.7040306@free.fr> Stefan Behnel a ?crit : > Hi, > > Herv? Cauwelier wrote: >> Hi, I'm trying to load fragments of XML to inject them in an existing >> document tree. >> >> They look like this: >> >> > > Just curious: why do you create a document in which you can do string > replacements? I have templates for paragraphgs, lists, tables, etc. You just provide its name, style, text contents, whatever applies. > But you can insert your document into a new root Element that defines all > used namespaces (either fixed or collected at runtime), serialise that, and > strip the root element from top and bottom of the serialised byte string. I > do a bit of string mangling in lxml's own doctests to make them work in > both Py2 and Py3, it's not that hard to add these things. You can write a > little wrapper class around lxml.etree and override tostring() and parse() > to fit your needs (kudos to Fredrik for making them functions, BTW). > > Here's an example: > > http://codespeak.net/svn/lxml/trunk/doc/api.txt Thanks, I'll look deeper into the guts of lxml. From stefan_ml at behnel.de Fri Jun 26 08:55:48 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 26 Jun 2009 08:55:48 +0200 Subject: [lxml-dev] lxml.html, now with ignored namespaces! In-Reply-To: <4A40278E.3020805@chantofwaves.com> References: <4A40278E.3020805@chantofwaves.com> Message-ID: <4A4470F4.7040708@behnel.de> Hi, Thomas Weigel wrote: > I am using lxml to parse HTML documents, which include a custom > namespace (for example, "

FRUIT

"). > > In lxml 2.2.0, on Windows, this worked just fine, and elements could be > processed based on this data. > > In lxml 2.2.2, on Linux, this fails. The above example becomes "

content='fruit'>FRUIT

" as soon as it is parsed by lxml.html (or > lxml.etree.HTMLParser()). You forgot to mention which versions of libxml2 you are using on both systems. That's likely the reason for the difference. http://codespeak.net/lxml/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do Stefan From cattafra at hotmail.com Fri Jun 26 11:48:57 2009 From: cattafra at hotmail.com (Francesco) Date: Fri, 26 Jun 2009 09:48:57 +0000 (UTC) Subject: [lxml-dev] =?utf-8?q?clean=5Fhtml?= References: <03965a2dcb750acaf571d8de1742cd3c.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <4A422E00.9060503@behnel.de> <4A431B05.2050902@behnel.de> Message-ID: Thank you for your answer... I will try the ".docinfo.encoding" property. How could I use UTF-8 on output in general? I have tried output.write(unicode(result)) and output.write(result.encode('utf-8')). With the first I got "UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in position 17: ordinal not in range(128)" while with the second the extra character "?" before "?". result is u'La Repubblica.it \xbb Homepage' Thanks, Francesco From cattafra at hotmail.com Fri Jun 26 13:23:39 2009 From: cattafra at hotmail.com (Francesco) Date: Fri, 26 Jun 2009 11:23:39 +0000 (UTC) Subject: [lxml-dev] XML file and XPath Message-ID: What is the best way to load an XML file, then query it using XPath and keeping the same encoding? Thanks, Francesco From stefan_ml at behnel.de Fri Jun 26 13:30:25 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 26 Jun 2009 13:30:25 +0200 Subject: [lxml-dev] XML file and XPath In-Reply-To: References: Message-ID: <4A44B151.3000101@behnel.de> Francesco wrote: > What is the best way to load an XML file, then query it using XPath and keeping > the same encoding? I'm sure you can infer the first two parts from the tutorial and the rest of the documentation. Regarding the last bit: did you follow my advice to look at the .docinfo.encoding property of the ElementTree object? Stefan From cattafra at hotmail.com Fri Jun 26 14:43:47 2009 From: cattafra at hotmail.com (Francesco) Date: Fri, 26 Jun 2009 12:43:47 +0000 (UTC) Subject: [lxml-dev] XML file and XPath References: <4A44B151.3000101@behnel.de> Message-ID: Hi Stefan, thank you for your replay... Yes, I followed your advice and I was able to save files with the right encoding... Thank you very much! Here is my code: " input = open(fileNameXML, 'rb') output = open(fileNameXPath, 'wb') parser = etree.XMLParser() xml_tree = etree.parse(input, parser) document_encoding = xml_tree.docinfo.encoding print(document_encoding) xpath_query = xml_tree.xpath('//text()') output.write(result) output.close() " and the Error: "UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' in position 17: ordinal not in range(128)" xpath_query contains: ['\n ', '\n ', u'La Repubblica.it \xbb Homepage', '\n ', '\n ', '\n ', '\n ', '\n\n\n\t', '\r\n\t\r\n\t'] document_encoding: iso-8859-1 What should I do to use XPath with the right encoding? Thanks, Francesco From stefan_ml at behnel.de Fri Jun 26 16:10:42 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 26 Jun 2009 16:10:42 +0200 Subject: [lxml-dev] XML file and XPath In-Reply-To: References: <4A44B151.3000101@behnel.de> Message-ID: <4A44D6E2.3030605@behnel.de> Hi, Francesco wrote: > Here is my code: > " > input = open(fileNameXML, 'rb') > output = open(fileNameXPath, 'wb') > > parser = etree.XMLParser() > xml_tree = etree.parse(input, parser) Creating a parser here is redundant. Also, it's better to pass the filename than to pass an open file, i.e. do input = etree.parse(fileNameXML) > document_encoding = xml_tree.docinfo.encoding > > print(document_encoding) > > xpath_query = xml_tree.xpath('//text()') > > output.write(result) What's "result" here? > and the Error: "UnicodeEncodeError: 'ascii' codec can't encode character u'\xbb' > in position 17: ordinal not in range(128)" > > xpath_query contains: > ['\n ', '\n ', u'La Repubblica.it \xbb Homepage', '\n ', '\n ', '\n > ', '\n ', '\n\n\n\t', '\r\n\t\r\n\t'] > document_encoding: > iso-8859-1 > > What should I do to use XPath with the right encoding? Seems to me like the XPath results are perfectly ok. Am I right in guessing that you want to write out the XPath result as a plain text file? Then what about collect_text_content = etree.XPath('string()') text_content = collect_text_content(xml_tree) output = codecs.open(fileNameXPath, 'wb', document_encoding) output.write(text_content) output.close() Stefan From cattafra at hotmail.com Fri Jun 26 23:10:00 2009 From: cattafra at hotmail.com (Francesco) Date: Fri, 26 Jun 2009 21:10:00 +0000 (UTC) Subject: [lxml-dev] XML file and XPath References: <4A44B151.3000101@behnel.de> <4A44D6E2.3030605@behnel.de> Message-ID: Thank you for your reply. Yes, I want to write out the XPath result as a plain text file... I have followed your advice but I still get an encoding error message: UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014' in position 9199: ordinal not in range(256) Can you set the encoding to XPath? Thanks, Francesco From seasong at chantofwaves.com Fri Jun 26 23:48:01 2009 From: seasong at chantofwaves.com (Thomas Weigel) Date: Fri, 26 Jun 2009 16:48:01 -0500 Subject: [lxml-dev] lxml.html, now with ignored namespaces! In-Reply-To: <4A4470F4.7040708@behnel.de> References: <4A40278E.3020805@chantofwaves.com> <4A4470F4.7040708@behnel.de> Message-ID: <4A454211.1080003@chantofwaves.com> Hello, Stefan Behnel wrote: > Thomas Weigel wrote: >> I am using lxml to parse HTML documents, which include a custom >> namespace (for example, "

FRUIT

"). > > You forgot to mention which versions of libxml2 you are using on both > systems. That's likely the reason for the difference. Thank you for being kind. > http://codespeak.net/lxml/FAQ.html#i-think-i-have-found-a-bug-in-lxml-what-should-i-do I have begun investigating down this path. I will not bother you again until I have finished there. In the meantime, I am working around the problem with a regular expression to replace 'custom_namespace:' with 'custom_namespace_', depending on whether or not lxml deletes the custom namespace. Thank you for your time. Thomas Weigel From stefan_ml at behnel.de Sat Jun 27 07:08:48 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 27 Jun 2009 07:08:48 +0200 Subject: [lxml-dev] XML file and XPath In-Reply-To: References: <4A44B151.3000101@behnel.de> <4A44D6E2.3030605@behnel.de> Message-ID: <4A45A960.1090404@behnel.de> Francesco wrote: > Thank you for your reply. Yes, I want to write out the XPath result as a plain > text file... > > I have followed your advice but I still get an encoding error message: > UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014' in position > 9199: ordinal not in range(256) Please provide everything that is needed to reproduce this without major guesswork. This usually implies a small script and some input data that makes the code fail. This may turn out to be an interesting read: http://www.catb.org/~esr/faqs/smart-questions.html And this surely is: http://www.amk.ca/python/howto/unicode > Can you set the encoding to XPath? No, that wouldn't make sense. XPath works on Unicode, so there is no encoding involved. Stefan From stefan_ml at behnel.de Sat Jun 27 07:23:10 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 27 Jun 2009 07:23:10 +0200 Subject: [lxml-dev] lxml.html, now with ignored namespaces! In-Reply-To: <4A40278E.3020805@chantofwaves.com> References: <4A40278E.3020805@chantofwaves.com> Message-ID: <4A45ACBE.40107@behnel.de> Hi, I actually didn't read up to your example, sorry. Thomas Weigel wrote: > I am using lxml to parse HTML documents, which include a custom > namespace (for example, "

FRUIT

"). > > Notes: > > 1. The XML parser will not work. Some documents will have legal HTML > that breaks an XML parser, like "
". > > 2. Here is the sample code: > > ----- > >>> import lxml.html as parser > >>> document = parser.fromstring(""" XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> xmlns="http://www.w3.org/TR/1999/REC-html-in-xml" > xmlns:cs="http://something.com/cs" xml:lang="en" > lang="en">Help!

My namespaces are > going to disappear!

FRUIT

""") > >>> print parser.tostring(document) > ----- That's an XHTML document, for which the XML parser would be the right tool. If you have XHTML documents that contain unterminated
tags, they are not well-formed, and thus simply not XML, i.e. not XHTML. But you could try creating a custom XMLParser with the "recover" option, which will try to keep parsing despite errors. There's no guarantee that it won't kick out some data that it failed to parse, though, as usual when parsing broken documents. Obviously, the best way to deal with this kind of problem is fixing the input documents. > The output: > ----- > cs="http://something.com/cs" xml:lang="en" > lang="en">Help!

My namespaces are > going to disappear!

FRUIT

> ----- That's because HTML parsers are not namespace aware. Namespaces are simply not defined for HTML. But if you get a difference on different systems, I'd still suspect the reason to be different libxml2 versions. There's nothing lxml can do about this. Stefan From stefan_ml at behnel.de Sat Jun 27 08:57:52 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 27 Jun 2009 08:57:52 +0200 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: References: <4A2112AE.8040903@behnel.de> <4A2CA7AE.8030508@behnel.de> <97EFCE41-4A6E-4CB2-8593-CE7CF5F5A2D3@inquant.de> <4A34DD37.8000207@behnel.de> <4A3B1D56.3000007@behnel.de> <4A41AD28.4010202@behnel.de> <1245882881.10362.239.camel@loup.ece.ucsb.edu> Message-ID: <4A45C2F0.5030802@behnel.de> Alexander Limi wrote: > You can test the not-yet-official version of the Python 2.4-compatible egg > from here: > > http://eletztrick.de/static/lxml/ I've uploaded the 2.2.1 egg to PyPI for now. Let's see what becomes of it. Stefan From kris at cs.ucsb.edu Sat Jun 27 09:20:34 2009 From: kris at cs.ucsb.edu (kristian kvilekval) Date: Sat, 27 Jun 2009 00:20:34 -0700 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: <4A45C2F0.5030802@behnel.de> References: <4A2112AE.8040903@behnel.de> <4A2CA7AE.8030508@behnel.de> <97EFCE41-4A6E-4CB2-8593-CE7CF5F5A2D3@inquant.de> <4A34DD37.8000207@behnel.de> <4A3B1D56.3000007@behnel.de> <4A41AD28.4010202@behnel.de> <1245882881.10362.239.camel@loup.ece.ucsb.edu> <4A45C2F0.5030802@behnel.de> Message-ID: <1246087234.3475.28.camel@krispc.sd.cox.net> Would it be possible upload a python 2.5 version as well. Recent Macs come with python 2.5. Thanks, Kris On Sat, 2009-06-27 at 08:57 +0200, Stefan Behnel wrote: > Alexander Limi wrote: > > You can test the not-yet-official version of the Python 2.4-compatible egg > > from here: > > > > http://eletztrick.de/static/lxml/ > > I've uploaded the 2.2.1 egg to PyPI for now. Let's see what becomes of it. > > Stefan > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev From cattafra at hotmail.com Sat Jun 27 09:47:27 2009 From: cattafra at hotmail.com (Francesco) Date: Sat, 27 Jun 2009 07:47:27 +0000 (UTC) Subject: [lxml-dev] XML file and XPath References: <4A44B151.3000101@behnel.de> <4A44D6E2.3030605@behnel.de> <4A45A960.1090404@behnel.de> Message-ID: Stefan, it is really interesting what you have just written: "XPath works on Unicode" Should I then transform the input for XPath from another encoding into Unicode? Thanks, Francesco From piet at cs.uu.nl Sat Jun 27 10:19:16 2009 From: piet at cs.uu.nl (Piet van Oostrum) Date: Sat, 27 Jun 2009 10:19:16 +0200 Subject: [lxml-dev] XML file and XPath In-Reply-To: References: <4A44B151.3000101@behnel.de> <4A44D6E2.3030605@behnel.de> Message-ID: <19013.54788.502590.70749@cochabamba.cs.uu.nl> >>>>> Francesco (F) wrote: >F> Thank you for your reply. Yes, I want to write out the XPath >F> result as a plain text file... >F> I have followed your advice but I still get an encoding error >F> message: UnicodeEncodeError: 'latin-1' codec can't encode >F> character u'\u2014' in position 9199: ordinal not in range(256) Unicode character \u2014 (which is an em-dash) cannot be represented in latin-1 (which goes only up to \u00ff). So you will have to use another encoding for your output file, e.g. utf-8. In general utf-8 is the best encoding unless you know that you have a restricted character set and further processing expects another encoding. There are two other options: - replace the em-dash with a hyphen or some such - use a Microsoft encoding like windows-1252 -- Piet van Oostrum URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: piet at vanoostrum.org From cattafra at hotmail.com Sat Jun 27 10:43:21 2009 From: cattafra at hotmail.com (Francesco) Date: Sat, 27 Jun 2009 08:43:21 +0000 (UTC) Subject: [lxml-dev] XML file and XPath References: <4A44B151.3000101@behnel.de> <4A44D6E2.3030605@behnel.de> <19013.54788.502590.70749@cochabamba.cs.uu.nl> Message-ID: Piet, thank you very much! I have just changed the output encoding to utf-8 and it works! I thought I have done something like that before... I am changing to many little things... Thank you very much! Ciao, Francesco From optilude+lists at gmail.com Sat Jun 27 12:35:37 2009 From: optilude+lists at gmail.com (Martin Aspeli) Date: Sat, 27 Jun 2009 18:35:37 +0800 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: <4A45C2F0.5030802@behnel.de> References: <4A2112AE.8040903@behnel.de> <4A2CA7AE.8030508@behnel.de> <97EFCE41-4A6E-4CB2-8593-CE7CF5F5A2D3@inquant.de> <4A34DD37.8000207@behnel.de> <4A3B1D56.3000007@behnel.de> <4A41AD28.4010202@behnel.de> <1245882881.10362.239.camel@loup.ece.ucsb.edu> <4A45C2F0.5030802@behnel.de> Message-ID: Stefan Behnel wrote: > Alexander Limi wrote: >> You can test the not-yet-official version of the Python 2.4-compatible egg >> from here: >> >> http://eletztrick.de/static/lxml/ > > I've uploaded the 2.2.1 egg to PyPI for now. Let's see what becomes of it. Just tested it, and it works great. :) Any chance of a 2.2.2 egg? :) Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From optilude+lists at gmail.com Sat Jun 27 13:33:28 2009 From: optilude+lists at gmail.com (Martin Aspeli) Date: Sat, 27 Jun 2009 19:33:28 +0800 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: References: <4A2112AE.8040903@behnel.de> <4A2CA7AE.8030508@behnel.de> <97EFCE41-4A6E-4CB2-8593-CE7CF5F5A2D3@inquant.de> <4A34DD37.8000207@behnel.de> <4A3B1D56.3000007@behnel.de> <4A41AD28.4010202@behnel.de> <1245882881.10362.239.camel@loup.ece.ucsb.edu> <4A45C2F0.5030802@behnel.de> Message-ID: Martin Aspeli wrote: > Stefan Behnel wrote: >> Alexander Limi wrote: >>> You can test the not-yet-official version of the Python 2.4-compatible egg >>> from here: >>> >>> http://eletztrick.de/static/lxml/ >> I've uploaded the 2.2.1 egg to PyPI for now. Let's see what becomes of it. > > Just tested it, and it works great. :) Erm... seems I spoke too soon. I thought it was working, but somehow my setuptools is building a lxml-2.2.1-py2.4-macosx-10.3-i386.egg in preference over lxml-2.2.1-py2.4-macosx-10.5-i386.egg, even though I'm on OSX 10.5. Any ideas why setuptools may be do this? Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From optilude+lists at gmail.com Sat Jun 27 15:40:19 2009 From: optilude+lists at gmail.com (Martin Aspeli) Date: Sat, 27 Jun 2009 21:40:19 +0800 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: References: <4A2112AE.8040903@behnel.de> <4A2CA7AE.8030508@behnel.de> <97EFCE41-4A6E-4CB2-8593-CE7CF5F5A2D3@inquant.de> <4A34DD37.8000207@behnel.de> <4A3B1D56.3000007@behnel.de> <4A41AD28.4010202@behnel.de> <1245882881.10362.239.camel@loup.ece.ucsb.edu> <4A45C2F0.5030802@behnel.de> Message-ID: <4A462143.5080907@gmail.com> Martin Aspeli wrote: > Erm... seems I spoke too soon. I thought it was working, but somehow my > setuptools is building a lxml-2.2.1-py2.4-macosx-10.3-i386.egg in > preference over lxml-2.2.1-py2.4-macosx-10.5-i386.egg, even though I'm > on OSX 10.5. Okay. So it seems there were two problems. 1) Buildout tried to build an lxml 10.3 egg instead of using the binary one. That was a local issue. I fixed it like this: - Upgrade MacPorts' python to 2.4.6 (possibly unnecessary) - Re-run boostrap.py in the buildout - Remove the lxml egg from my eggs cache - Remove the lxml tgz download from the dist directory of my download cache - Re-running buildout 2) The egg now installs, but it appears to be broken. The problem is that the egg is somehow 'nested'. In my eggs cache, I have: $ ls lxml-2.2.1-py2.4-macosx-10.5-i386.egg/ lxml-2.2.1-py2.4-macosx-10.5-i386.egg/ $ ls lxml-2.2.1-py2.4-macosx-10.5-i386.egg/lxml-2.2.1-py2.4-macosx-10.5-i386.eggEGG-INFO/ lxml/ If I fix the egg manually, it works fine. Martin -- Author of `Professional Plone Development`, a book for developers who want to work with Plone. See http://martinaspeli.net/plone-book From stefan_ml at behnel.de Sat Jun 27 16:45:42 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 27 Jun 2009 16:45:42 +0200 Subject: [lxml-dev] XML file and XPath In-Reply-To: References: <4A44B151.3000101@behnel.de> <4A44D6E2.3030605@behnel.de> <4A45A960.1090404@behnel.de> Message-ID: <4A463096.3090607@behnel.de> Francesco wrote: > it is really interesting what you have just written: > "XPath works on Unicode" > > Should I then transform the input for XPath from another encoding into Unicode? The XML parser does that for you. Stefan From stefan_ml at behnel.de Sat Jun 27 17:03:25 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 27 Jun 2009 17:03:25 +0200 Subject: [lxml-dev] XML file and XPath In-Reply-To: <19013.54788.502590.70749@cochabamba.cs.uu.nl> References: <4A44B151.3000101@behnel.de> <4A44D6E2.3030605@behnel.de> <19013.54788.502590.70749@cochabamba.cs.uu.nl> Message-ID: <4A4634BD.4000004@behnel.de> Piet van Oostrum wrote: >>>>>> Francesco wrote: > >> F> Thank you for your reply. Yes, I want to write out the XPath >> F> result as a plain text file... > >> F> I have followed your advice but I still get an encoding error >> F> message: UnicodeEncodeError: 'latin-1' codec can't encode >> F> character u'\u2014' in position 9199: ordinal not in range(256) > > Unicode character \u2014 (which is an em-dash) cannot be represented > in latin-1 (which goes only up to \u00ff). So you will have to use > another encoding for your output file, e.g. utf-8. In general utf-8 is > the best encoding unless you know that you have a restricted character > set and further processing expects another encoding. True. The fact that the original document was encoded as Latin-1 does not mean that Latin-1 can represent all characters used in the document. Character references (Ѣ) or external entities can result in non-encodable characters appearing in the text content. So UTF-8 is the safest choice (if you know that the tools used to read the text file afterwards can handle it). Stefan From seasong at chantofwaves.com Sun Jun 28 23:09:22 2009 From: seasong at chantofwaves.com (Thomas Weigel) Date: Sun, 28 Jun 2009 16:09:22 -0500 Subject: [lxml-dev] lxml.html, now with ignored namespaces! In-Reply-To: References: Message-ID: <4A47DC02.3040804@chantofwaves.com> Stefan Behnel wrote: > That's an XHTML document, for which the XML parser would be the right tool. Sadly, not every page will be an XHTML document. Nor will every page be created by someone like me, an individual who loves XHTML and strictness. I apologize for giving the impression that my users might be sane and decent. > If you have XHTML documents that contain unterminated
tags, they are > not well-formed, and thus simply not XML, i.e. not XHTML. I will have HTML 4 Loose and HTML5 documents that contain unterminated
tags, among others. > Obviously, the best way to deal with this kind of problem is fixing the > input documents. Sadly, not possible. I mean, it would be nice. It surely would. But no. >> ----- >> > cs="http://something.com/cs" xml:lang="en" >> lang="en">Help!

My namespaces are >> going to disappear!

FRUIT

>> ----- > > That's because HTML parsers are not namespace aware. Namespaces are simply > not defined for HTML. But if you get a difference on different systems, I'd > still suspect the reason to be different libxml2 versions. There's nothing > lxml can do about this. Yes, I gathered that from the previous reply. There's not much I can do about it, either, since I won't be in control of the specific libxml2 installation. Currently, I have a small unit test built in that checks the parser for eliminating namespaces or not. If the parser eliminates the namespace, I replace all "cs:something" attributes with "cs_something" attributes. It's far from ideal, but it at least works. Again, thank you. Thomas Weigel From stefan_ml at behnel.de Mon Jun 29 08:36:01 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 29 Jun 2009 08:36:01 +0200 Subject: [lxml-dev] lxml.html, now with ignored namespaces! In-Reply-To: <4A47DC02.3040804@chantofwaves.com> References: <4A47DC02.3040804@chantofwaves.com> Message-ID: <4A4860D1.3040409@behnel.de> Hi, Thomas Weigel wrote: > Stefan Behnel wrote: >> That's an XHTML document, for which the XML parser would be the right tool. > > Sadly, not every page will be an XHTML document. Nor will every page be > created by someone like me, an individual who loves XHTML and > strictness. I apologize for giving the impression that my users might be > sane and decent. > >> If you have XHTML documents that contain unterminated
tags, they are >> not well-formed, and thus simply not XML, i.e. not XHTML. > > I will have HTML 4 Loose and HTML5 documents that contain unterminated >
tags, among others. Well, that's not XHTML then, though, and both aren't that hard to distinguish even before parsing. What about running the XML parser on the document first, and only fall back to the HTML parser if the XML parser fails? Parsing should be fast enough to just go and pay it twice for the increase in convenience that you get. If you parse from a (byte?) string, you could also just check if the XHTML namespace appears in the input data or if the data starts with an XML declaration (" Hi, On http://codespeak.net/lxml/capi.html#writing-external-modules-in-cython , I believe there is a typo in the second example section: DefaultElementClassLookup should be etree.ElementDefaultClassLookup Which is the way it is written on http://codespeak.net/lxml/element_classes.html#setting-up-a-class-lookup-scheme. Thanks. -- Elliott Slaughter "Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." - Alan Kay -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090629/f32383de/attachment.htm From stefan_ml at behnel.de Tue Jun 30 12:38:41 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 30 Jun 2009 12:38:41 +0200 Subject: [lxml-dev] Writing external modules in Cython In-Reply-To: <42c0ab790906291614o7de16b15r5be8311270bedd90@mail.gmail.com> References: <42c0ab790906291614o7de16b15r5be8311270bedd90@mail.gmail.com> Message-ID: <4A49EB31.5090500@behnel.de> Elliott Slaughter wrote: > On http://codespeak.net/lxml/capi.html#writing-external-modules-in-cython , > I believe there is a typo in the second example section: > > DefaultElementClassLookup > > should be > > etree.ElementDefaultClassLookup Ah, yes. Thanks for catching this. That page isn't backed by doctests since it's using Cython code. Stefan