From matt.barto at gmail.com Thu Jul 1 22:01:39 2010 From: matt.barto at gmail.com (matt barto) Date: Thu, 1 Jul 2010 13:01:39 -0700 Subject: [lxml-dev] html.fromstring returning encoded string from a non unicoded string source Message-ID: Hello, I am trying to obtain a title from a website which has a Unicode tm and register mark, but the Unicode behavior is not what I expect. tree = html.fromstring("

Apple® - iPad™ with Wi-Fi - 16GB

" print tree.text_content() --------> print(tree.text_content()) Apple? - iPad? with Wi-Fi - 16GB I would expect the output from the print is "Apple® - iPad™ with Wi-Fi - 16GB", but it seems some encoding occurred during the tree creation. The browser is able to handle these html characters correctly using "iso-8856-1" char set (even though the document says "utf-8"). Can you provide some insight how these two html tags are handled and what I can do to have the expected behavior? My lxml version is 2.2.2. Thanks in advanced. Best, Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100701/15bfe9b9/attachment.htm From eugene.vandenbulke at gmail.com Fri Jul 2 10:49:45 2010 From: eugene.vandenbulke at gmail.com (Eugene Van den Bulke) Date: Fri, 2 Jul 2010 18:49:45 +1000 Subject: [lxml-dev] regexp Message-ID: Hi, I am experimenting with web scraping using lxml. I have played a little with BeautifulSoup in the past and scrapy recently. I am recoding something I did with scrapy with lxml but encounter a problem I am not sure how to iron out. With scrapy, hxs is an xpath selector which has a select and re method types = hxs.select('.//a[@href]/@href').re(r'type=([A-Z]*)') Which will return a list of the matches in href. How would I do the same thing with lxml? types = doc.xpath('.//a[@href]/@href') ... Thanks a lot, -- EuGeNe -- I lend my books on COlivri http://www.colivri.org/user/eugene, do you? From stefan_ml at behnel.de Fri Jul 2 11:42:29 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 02 Jul 2010 11:42:29 +0200 Subject: [lxml-dev] regexp In-Reply-To: References: Message-ID: <4C2DB485.2080504@behnel.de> Eugene Van den Bulke, 02.07.2010 10:49: > I am experimenting with web scraping using lxml. > > I have played a little with BeautifulSoup in the past and scrapy recently. > > I am recoding something I did with scrapy with lxml but encounter a > problem I am not sure how to iron out. > > With scrapy, hxs is an xpath selector which has a select and re method > > types = hxs.select('.//a[@href]/@href').re(r'type=([A-Z]*)') > > Which will return a list of the matches in href. > > How would I do the same thing with lxml? > > types = doc.xpath('.//a[@href]/@href') ... http://lmgtfy.com/?q=lxml+regular+expressions&l=1 ;-) Stefan From stefan_ml at behnel.de Fri Jul 2 12:13:56 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 02 Jul 2010 12:13:56 +0200 Subject: [lxml-dev] regexp In-Reply-To: References: <4C2DB485.2080504@behnel.de> Message-ID: <4C2DBBE4.8080807@behnel.de> Eugene Van den Bulke, 02.07.2010 11:53: > Stefan Behnel, 02.07.2010 11:42: >> Eugene Van den Bulke, 02.07.2010 10:49: >>> I am experimenting with web scraping using lxml. >>> >>> I have played a little with BeautifulSoup in the past and scrapy >>> recently. >>> >>> I am recoding something I did with scrapy with lxml but encounter a >>> problem I am not sure how to iron out. >>> >>> With scrapy, hxs is an xpath selector which has a select and re method >>> >>> types = hxs.select('.//a[@href]/@href').re(r'type=([A-Z]*)') >>> >>> Which will return a list of the matches in href. >>> >>> How would I do the same thing with lxml? >>> >>> types = doc.xpath('.//a[@href]/@href') ... Note that this is redundant, './/a/@href' is enough. >> http://lmgtfy.com/?q=lxml+regular+expressions&l=1 > > I did read the doc before I took the liberty to post ... I am afraid I > just don't get it. Personally, I wouldn't even use XPath regular expressions here. I'd rather do something like this: from lxml import html import re parse_type_value = re.compile(r'type=([A-Z]*)').findall root = html.parse(the_file).getroot() for el, attr, link, pos in root.iterlinks(): if 'type=' in link: print el.tag, parse_type_value(link) Note that this will give you all links, not only those in href's. If you really only want those, the XPath expression above will do just fine. Stefan From eugene.vandenbulke at gmail.com Fri Jul 2 12:18:38 2010 From: eugene.vandenbulke at gmail.com (Eugene Van den Bulke) Date: Fri, 2 Jul 2010 20:18:38 +1000 Subject: [lxml-dev] regexp In-Reply-To: <4C2DBBE4.8080807@behnel.de> References: <4C2DB485.2080504@behnel.de> <4C2DBBE4.8080807@behnel.de> Message-ID: On Fri, Jul 2, 2010 at 8:13 PM, Stefan Behnel wrote: >>>> types = doc.xpath('.//a[@href]/@href') ... > > Note that this is redundant, './/a/@href' is enough. I am discovering XPath as well as you can tell :P > Personally, I wouldn't even use XPath regular expressions here. I'd rather > do something like this: > > ? ?from lxml import html > ? ?import re > > ? ?parse_type_value = re.compile(r'type=([A-Z]*)').findall > > ? ?root = html.parse(the_file).getroot() > > ? ?for el, attr, link, pos in root.iterlinks(): > ? ? ? ?if 'type=' in link: > ? ? ? ? ? ? print el.tag, parse_type_value(link) > > Note that this will give you all links, not only those in href's. If you > really only want those, the XPath expression above will do just fine. > > Stefan Thanks ! -- EuGeNe -- I lend my books on COlivri http://www.colivri.org/user/eugene, do you? From paul.girard at sciences-po.fr Fri Jul 2 12:20:06 2010 From: paul.girard at sciences-po.fr (Paul Girard) Date: Fri, 02 Jul 2010 12:20:06 +0200 Subject: [lxml-dev] mac os x installation process Message-ID: <4C2DBD56.9020800@sciences-po.fr> Hi dear people of lxml, I wrote a small lib in Python using lxml to generate graph file in a specific format gexf. This little thing is called pygexf : http://packages.python.org/pygexf/ http://github.com/paulgirard/pygexf First thanks for your great work on lxml, I am loving it ! Second I am experiencing rude probs with installing lxml on my mac os x. I read that this hasn't been completed ported yet but still some workarounds with static libs seems to have worked for some of us. I will only focus on the method : STATIC_DEPS=true sudo easy_install lxml I am having probs with gcc. I have 3 different versions of gcc : gcc-4 : coming from fink install gcc42 gcc-4.0 gcc-4.2 : bot coming from Xcode mac os x tools note: I am changing the link gcc to the different versions to test all of them Now here are the different errors I have with the various gcc versions : gcc-4 $ls -la /usr/bin/gcc lrwxr-xr-x 1 root wheel 13 2 jul 11:43 /usr/bin/gcc -> /sw/bin/gcc-4 $STATIC_DEPS=true sudo easy_install lxml searching for lxml [...] Building against libxml2/libxslt in the following directory: /usr/lib gcc: unrecognized option '-no-cpp-precomp' cc1: erreur: option "-mno-fused-madd" de la ligne de commande non reconnue cc1: erreur: option "-arch" de la ligne de commande non reconnue cc1: erreur: option "-arch" de la ligne de commande non reconnue cc1: erreur: option "-Wno-long-double" de la ligne de commande non reconnue error: Setup script exited with error: command 'gcc' failed with exit status 1 gcc-4.2 $ STATIC_DEPS=true sudo easy_install lxml Searching for lxml [...] Using build configuration of libxslt 1.1.12 Building against libxml2/libxslt in the following directory: /usr/lib cc1: error: unrecognized command line option "-Wno-long-double" cc1: error: unrecognized command line option "-Wno-long-double" lipo: can't open input file: /var/tmp//ccDn5F16.out (No such file or directory) error: Setup script exited with error: command 'gcc' failed with exit status 1 The error list of gcc-4.0 is huge. I'll not post it here but I could if necessary. So here I am facing building problems. I am not a expert into that kind of thing. I am usually developing on linux (ubuntu) but many users of my small lib (including me) are mac users. I don't really want to change my code to use another xml lib so I hope i'll finally find a way.. If anyone can help on this issue it'd be more than great, thanks for reading me Paul ps: I couldn't try the darwin port method I couldn't understand how to use it... -- Paul Girard responsable num?rique m?dialab paul.girard at sciences-po.fr 01 45 49 63 58 m?dialab | Sciences Po medialab.sciences-po.fr 13 rue de l'universit? 75007 PARIS -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100702/81d7d8c0/attachment-0001.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: Sc-Po-Medialab-Gris.jpg Type: image/jpeg Size: 7276 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100702/81d7d8c0/attachment-0001.jpg From ab at rdprojekt.pl Fri Jul 2 17:47:22 2010 From: ab at rdprojekt.pl (=?UTF-8?B?QWRhbSBCaWVsYcWEc2tp?=) Date: Fri, 02 Jul 2010 17:47:22 +0200 Subject: [lxml-dev] Fwd: iterparse() doesn't do dtd_validation Message-ID: <4C2E0A0A.40009@rdprojekt.pl> Hi, Can anyone help me with validating XML file against internal DTD with iterparse()? I just can't make iterparse() use dtd_validation flag. I ended up with two execution paths, one that uses etree.XMLParser and finds all errors in validated file and another one which uses iterparse and just prints all elements from XML file. Below is the code I used, and attached are sample XML and DTD files I used. Just to make it clear: I'm using lxml in version 2.2.4 on Python 2.6.5. The XmlWithDTDStream class (mentioned in code below) behaves like a stream, returning XML declaration (first line of well-formed XML file), DTD string and then - rest of XML file. It works correctly, since you can see that XMLParser returns correct errors and iterparse returns actual elements from input.xml.


from lxml import etree
if __name__ == "__main__":
      print "XMLParser"
      with open("internalSchema.dtd") as dtdFile:
          dtd = dtdFile.read()
          stream = XmlWithDTDStream(dtd, 'input.xml')
          parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
          try:
              root = etree.XML(stream.read(32768), parser)
          except Exception, e:
              print "An exception:: ", e
              print "Error log:: ", parser.error_log
      print "iterparse: "
      with open("internalSchema.dtd") as dtdFile:
          dtd = dtdFile.read()
          stream = XmlWithDTDStream(dtd, 'input.xml')
          for aTuple in etree.iterparse(stream, dtd_validation=True,
load_dtd=True):
              print aTuple
The result is then as follows:
XMLParser
An exception::  No declaration for attribute badAttribute of element
pName, line 23, column 26
Error log:::23:26:ERROR:VALID:DTD_UNKNOWN_ATTRIBUTE: No
declaration for attribute badAttribute of element pName
:25:7:ERROR:VALID:DTD_UNKNOWN_ELEM: No declaration for element
property--badName
:26:5:ERROR:VALID:DTD_CONTENT_MODEL: Element propertyGroup
content does not follow the DTD, expecting (property)+, got
(property--badName )
:29:29:ERROR:VALID:DTD_UNKNOWN_ELEM: No declaration for element name

iterparse:
(u'end',)
(u'end',)
(u'end',)
(u'end',)
(u'end',)
(u'end',)
(u'end',)
(u'end',)
Hope someone could assist me, or state that this is a bug in lxml. My issues comes from the fact that I need to validate files that are too large to be read into memory using etree.XML(). If there is a way I could do it without iterparse - I'd gladly learn it. Best regards, Adam. -------------- next part -------------- A non-text attachment was scrubbed... Name: input.xml Type: text/xml Size: 418 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100702/4da2ccdd/attachment.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: internalSchema.dtd Type: application/xml-dtd Size: 602 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100702/4da2ccdd/attachment-0001.bin From ab at rdprojekt.pl Fri Jul 2 17:41:16 2010 From: ab at rdprojekt.pl (=?UTF-8?B?QWRhbSBCaWVsYcWEc2tp?=) Date: Fri, 02 Jul 2010 17:41:16 +0200 Subject: [lxml-dev] iterparse() doesn't do dtd_validation Message-ID: <4C2E089C.4080203@rdprojekt.pl> Hi, Can anyone help me with validating XML file against internal DTD with iterparse()? I just can't make iterparse() use dtd_validation flag. I ended up with two execution paths, one that uses etree.XMLParser and finds all errors in validated file and another one which uses iterparse and just prints all elements from XML file. Below is the code I used, and attached are sample XML and DTD files I used. Just to make it clear: I'm using lxml in version 2.2.4 on Python 2.6.5. The XmlWithDTDStream class (mentioned in code below) behaves like a stream, returning XML declaration (first line of well-formed XML file), DTD string and then - rest of XML file. It works correctly, since you can see that XMLParser returns correct errors and iterparse returns actual elements from input.xml.


from lxml import etree
if __name__ == "__main__":
     print "XMLParser"
     with open("internalSchema.dtd") as dtdFile:
         dtd = dtdFile.read()
         stream = XmlWithDTDStream(dtd, 'input.xml')
         parser = etree.XMLParser(dtd_validation=True, load_dtd=True)
         try:
             root = etree.XML(stream.read(32768), parser)
         except Exception, e:
             print "An exception:: ", e
             print "Error log:: ", parser.error_log
     print "iterparse: "
     with open("internalSchema.dtd") as dtdFile:
         dtd = dtdFile.read()
         stream = XmlWithDTDStream(dtd, 'input.xml')
         for aTuple in etree.iterparse(stream, dtd_validation=True, 
load_dtd=True):
             print aTuple
The result is as follows:
XMLParser
An exception::  No declaration for attribute badAttribute of element 
pName, line 23, column 26
Error log:: :23:26:ERROR:VALID:DTD_UNKNOWN_ATTRIBUTE: No 
declaration for attribute badAttribute of element pName
:25:7:ERROR:VALID:DTD_UNKNOWN_ELEM: No declaration for element 
property--badName
:26:5:ERROR:VALID:DTD_CONTENT_MODEL: Element propertyGroup 
content does not follow the DTD, expecting (property)+, got 
(property--badName )
:29:29:ERROR:VALID:DTD_UNKNOWN_ELEM: No declaration for element name

iterparse:
(u'end', )
(u'end', )
(u'end', )
(u'end', )
(u'end', )
(u'end', )
(u'end', )
(u'end', )
Hope someone could assist me, or state that this is a bug in lxml. My issues comes from the fact that I need to validate files that are too large to be read into memory using etree.XML(). If there is a way I could do it without iterparse - I'd gladly learn it. Best regards, Adam. -------------- next part -------------- A non-text attachment was scrubbed... Name: input.xml Type: text/xml Size: 418 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100702/67fd6cfb/attachment.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: internalSchema.dtd Type: application/xml-dtd Size: 602 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100702/67fd6cfb/attachment-0001.bin From stefan_ml at behnel.de Fri Jul 2 17:57:54 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 02 Jul 2010 17:57:54 +0200 Subject: [lxml-dev] Fwd: iterparse() doesn't do dtd_validation In-Reply-To: <4C2E0A0A.40009@rdprojekt.pl> References: <4C2E0A0A.40009@rdprojekt.pl> Message-ID: <4C2E0C82.7020808@behnel.de> Adam Biela?ski, 02.07.2010 17:47: > Can anyone help me with validating XML file against internal DTD with > iterparse()? I just can't make iterparse() use dtd_validation flag. > [...] > > > ... To reference a DTD, your XML document needs a DOCTYPE declaration. http://xmlsoft.org/xmldtd.html Once that's in the document, the "dtd_validation" flag should work. Stefan From ab at rdprojekt.pl Fri Jul 2 19:26:26 2010 From: ab at rdprojekt.pl (=?UTF-8?Q?Adam_Biela=C5=84ski?=) Date: Fri, 02 Jul 2010 10:26:26 -0700 Subject: [lxml-dev] =?utf-8?q?Fwd=3A_iterparse=28=29_doesn=27t_do_dtd=5Fva?= =?utf-8?q?lidation?= In-Reply-To: <4C2E0C82.7020808@behnel.de> References: <4C2E0A0A.40009@rdprojekt.pl> <4C2E0C82.7020808@behnel.de> Message-ID: On Fri, 02 Jul 2010 17:57:54 +0200, Stefan Behnel wrote: > Adam Biela?ski, 02.07.2010 17:47: >> Can anyone help me with validating XML file against internal DTD with >> iterparse()? I just can't make iterparse() use dtd_validation flag. > > [...] > > > > > > ... > > To reference a DTD, your XML document needs a DOCTYPE declaration. > > http://xmlsoft.org/xmldtd.html > > Once that's in the document, the "dtd_validation" flag should work. > > Stefan Well, my problem is that I need to validate XML with *internal* DTD (one that is embedded in file). It's because XML files come from untrusted parties and I can't rely on them providing correct DTD declaration, so I'm putting my DTD into XML file with this custom stream class mentioned in my previous post. It works when I use that stream output and do etree.XML(stream.read(), etree.XMLParser(dtd_validation=True)), that's why I still believe that there is a way to make it work also with iterparse(). In other words: I need to make a program that takes a potentially large XML file from user, embeds known and constant (not provided by user) DTD into it and feeds the result (XML with embedded DTD) to lxml validator without putting whole XML tree into memory. The part that combines DTD with XML already exists as a stream and works fine with etree.XML(). Now I want to have incremental validation (to save memory), so I wanted to use etree.iterparse(). Does lxml design allows it, is there some special 'catch' (like - DTD has to be surrounded with something else than just ) to make it work, or wouldn't it work at all this way? Regards, Adam. From stefan_ml at behnel.de Fri Jul 2 20:37:00 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 02 Jul 2010 20:37:00 +0200 Subject: [lxml-dev] Fwd: iterparse() doesn't do dtd_validation In-Reply-To: References: <4C2E0A0A.40009@rdprojekt.pl> <4C2E0C82.7020808@behnel.de> Message-ID: <4C2E31CC.1070205@behnel.de> Adam Biela?ski, 02.07.2010 19:26: > Well, my problem is that I need to validate XML with *internal* DTD (one > that is embedded in file). I guess you mean "external" here, i.e. not the internal subset that is included in the XML document but a DTD in an external file. In this case, even one that is not referenced by the document itself. > I need to make a program that takes a potentially large XML file from > user, embeds known and constant (not provided by user) DTD into it and > feeds the result (XML with embedded DTD) to lxml validator without putting > whole XML tree into memory. The part that combines DTD with XML already > exists as a stream and works fine with etree.XML(). Now I want to have > incremental validation (to save memory), so I wanted to use > etree.iterparse(). Does lxml design allows it, is there some special > 'catch' (like - DTD has to be surrounded with something else than just > ) to make it work, or wouldn't it work at all this way? There isn't currently a way to provide an unrelated DTD to the parser, but you can push your DTD through "trang" to generate an XML Schema from it, which you can then pass into the parser. http://www.thaiopensource.com/relaxng/trang.html Stefan From piet at cs.uu.nl Sat Jul 3 05:05:33 2010 From: piet at cs.uu.nl (Piet van Oostrum) Date: Fri, 2 Jul 2010 23:05:33 -0400 Subject: [lxml-dev] html.fromstring returning encoded string from a non unicoded string source In-Reply-To: References: Message-ID: <19502.43261.171205.402707@cochabamba.vanoostrum.org> >>>>> matt barto (mb) wrote: > Hello, > I am trying to obtain a title from a website which has a Unicode tm and > register mark, but the Unicode behavior is not what I expect. > tree = html.fromstring("

Apple® - iPad™ with Wi-Fi - > 16GB

" > print tree.text_content() --------> print(tree.text_content()) > Apple? - iPad? with Wi-Fi - 16GB > I would expect the output from the print is "Apple® - iPad™ with > Wi-Fi - 16GB", but it seems some encoding occurred during the tree creation.? > The browser is able to handle these html characters correctly using > "iso-8856-1" char set (even though the document says "utf-8"). > Can you provide some insight how these two html tags are handled and what I > can do to have the expected behavior? My lxml version is 2.2.2. ® is the same as ?. 174 is the Unicode code point for ? and this has no relation to the character set used (it is a common misunderstanding that it is the code point in the current encoding but it is not). ™ isn't a valid Unicode code point, but apparently it is accepted because this is the Windows-1252 code and these are often used by mistake as if they are Unicode code points. In that case it is the code for ?. But the official code for trademark is ™ In HTML there is no difference between the numeric code and the corresponding character, therefore the parser parses the code to get the Unicode character. On output you can get the entity code back by using tostring with the default encoding (=ASCII). Note that this will also output numeric entity codes for non-ASCII characters that were in the string as characters rather than as numeric entity codes. >>> html.tostring(tree[0][0]) '

Apple® - iPad™ with Wi-Fi - 16GB

' or if you want only the text, use the string encode method: >>> tree[0][0].text.encode('ascii', 'xmlcharrefreplace') 'Apple® - iPad™ with Wi-Fi - 16GB' >>> print(tree.text_content().encode('ascii', 'xmlcharrefreplace')) Apple® - iPad™ with Wi-Fi - 16GB -- Piet van Oostrum WWW: http://pietvanoostrum.com/ PGP key: [8DAE142BE17999C4] Nu Fair Trade woonwaar op http://www.zylja.com From p.oberndoerfer at urheberrecht.org Sat Jul 3 17:59:59 2010 From: p.oberndoerfer at urheberrecht.org (=?iso-8859-1?Q?=22Pascal_Obernd=F6rfer=22?=) Date: Sat, 3 Jul 2010 17:59:59 +0200 Subject: [lxml-dev] lxml eggs for Mac OS X 10.4 (maybe 10.5?) Message-ID: All, Don't know if these eggs are of help to anybody else on "older" Macs (i.e. 10.4 on ppc and i386): I needed lxml on such machines, so here they are at your own risk! The 2.2.6 eggs should statically link to: -libxml2-2.7.7 -libxslt-1.1.26 -libiconv-1.13.1 -zlib-1.2.5 Only tested with 'selftest.py' and 'selftest2.py'. All tests pass. But further testing is certainly needed! Anybody with PyPI rights, feel free to upload them, as I cannot guarantee the above link will remain forever ;-) Thanks. Pascal From egnor at ofb.net Mon Jul 5 02:20:01 2010 From: egnor at ofb.net (Dan Egnor) Date: Sun, 4 Jul 2010 17:20:01 -0700 Subject: [lxml-dev] Handling charset properly in lxml.html.parse() and friends In-Reply-To: References: Message-ID: I'm not sure if this is a bug, a missing feature, maybe something that should be discussed in the documentation/FAQ more clearly, or maybe I'm just a moron and missed something obvious, but: Handling charsets in HTML is always challenging. Charset information may come from an HTTP Content-Type header (if the HTML comes via HTTP, which often it does), from an http-equiv meta tag in the file itself, from some other external assertion, or it may need to be guessed from the content. The lxml documentation gives this advice at one point: Similarly, you will get errors when you try the same with HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone. However, if you take that advice, then I don't know any way to supply external charset information to lxml.html.parse() (or lxml.html.document_fromstring() and so on). These functions also don't seem to do much in the way of charset guessing. The *html5lib* parser has a "guess_charset" flag, but the ordinary HTML parser apparently does not. I have resorted to calling chardet myself and converting the input bytes to Unicode using whatever charset I get from chardet, and passing Unicode to lxml.html.document_fromstring(), in explicit contravention of the advice quoted above. This ignores HTTP headers (which I do have access to in my application, which is a content transforming proxy) as well as meta tags in the document, which is a bummer. Ideally, the HTML parsing functions would accept a byte string along with any charset assertions you might have access to, and then do the right thing (using your assertions, looking for meta tags, and applying chardet, in some order). _Really_ ideally it would even parse Content-type header values (since it will need to do so to understand the meta tag anyway). Maybe there's a way to get it to do something like this, but if so I couldn't figure it out. Are there currently known best practices? Searching around the forums, I find other puzzled users, but no crisp answers. I'm happy to help however I can. -- egnor -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100704/c6875da5/attachment.htm From stefan_ml at behnel.de Mon Jul 5 07:35:54 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 05 Jul 2010 07:35:54 +0200 Subject: [lxml-dev] lxml eggs for Mac OS X 10.4 (maybe 10.5?) In-Reply-To: References: Message-ID: <4C316F3A.1050602@behnel.de> "Pascal Obernd?rfer", 03.07.2010 17:59: > Don't know if these eggs are of help to anybody > else on "older" Macs (i.e. 10.4 on ppc and i386): > > > > I needed lxml on such machines, so here they are at your own risk! > > The 2.2.6 eggs should statically link to: > > -libxml2-2.7.7 > -libxslt-1.1.26 > -libiconv-1.13.1 > -zlib-1.2.5 > > Only tested with 'selftest.py' and 'selftest2.py'. All tests pass. But > further testing is certainly needed! > > Anybody with PyPI rights, feel free to upload them, as I cannot guarantee > the above link will remain forever ;-) I uploaded the fat egg. Thanks a lot! Stefan From stefan_ml at behnel.de Mon Jul 5 08:04:22 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 05 Jul 2010 08:04:22 +0200 Subject: [lxml-dev] Handling charset properly in lxml.html.parse() and friends In-Reply-To: References: Message-ID: <4C3175E6.60706@behnel.de> Dan Egnor, 05.07.2010 02:20: > Handling charsets in HTML is always challenging. Charset information may > come from an HTTP Content-Type header (if the HTML comes via HTTP, which > often it does), from an http-equiv meta tag in the file itself, from some > other external assertion, or it may need to be guessed from the content. > > The lxml documentation gives this advice at one point: > > Similarly, you will get errors when you try the same with HTML data in a > unicode string that specifies a charset in a meta tag of the header. You > should generally avoid converting XML/HTML data to unicode before passing it > into the parsers. It is both slower and error prone. > > > However, if you take that advice, then I don't know any way to supply > external charset information to lxml.html.parse() (or > lxml.html.document_fromstring() and so on). Create your own parser instance. http://codespeak.net/lxml/api/lxml.etree.HTMLParser-class.html http://codespeak.net/lxml/parsing.html#parser-options I just noticed that the 'encoding' option wasn't documented on the web site, only in the docstrings and API docs. Fixed in SVN. > These functions also don't seem > to do much in the way of charset guessing. The *html5lib* parser has a > "guess_charset" flag, but the ordinary HTML parser apparently does not. > > I have resorted to calling chardet myself and converting the input bytes to > Unicode using whatever charset I get from chardet, and passing Unicode to > lxml.html.document_fromstring(), in explicit contravention of the advice > quoted above. This ignores HTTP headers (which I do have access to in my > application, which is a content transforming proxy) as well as meta tags in > the document, which is a bummer. > > Ideally, the HTML parsing functions would accept a byte string along with > any charset assertions you might have access to, and then do the right thing > (using your assertions, looking for meta tags, and applying chardet, in some > order). _Really_ ideally it would even parse Content-type header values > (since it will need to do so to understand the meta tag anyway). Maybe > there's a way to get it to do something like this, but if so I couldn't > figure it out. > > Are there currently known best practices? Searching around the forums, I > find other puzzled users, but no crisp answers. I'm happy to help however I > can. This has been an often requested feature. Unknown or broken encodings are the most common reason why parsing HTML fails, AFAICT. It would be nice to have something like this to point users to. Either integrated into lxml.html or as a recipe. Any help is appreciated. Stefan From ab at rdprojekt.pl Mon Jul 5 11:46:35 2010 From: ab at rdprojekt.pl (=?UTF-8?B?QWRhbSBCaWVsYcWEc2tp?=) Date: Mon, 05 Jul 2010 11:46:35 +0200 Subject: [lxml-dev] Fwd: iterparse() doesn't do dtd_validation In-Reply-To: <4C2E31CC.1070205@behnel.de> References: <4C2E0A0A.40009@rdprojekt.pl> <4C2E0C82.7020808@behnel.de> <4C2E31CC.1070205@behnel.de> Message-ID: <4C31A9FB.9070404@rdprojekt.pl> Stefan, I see that I wrote too much about the background of my problem and it only obfuscated it. In the attachment you can find sample xml file, I'd like to validate with etree.iterparse(dtd_validation=True) I'm using lxml 2.2.4 (as I mentioned before) and the problem is that iterparse returns all elements *without* raising validation errors while tag is invalid as well as badAttribute attribute of tag. Errors are raised correctly when I do "etree.XML(file.read(), XMLParser(dtd_validation=True))" so I know that lxml is capable of validating this file correctly and I hope that the problem is with the way I'm using it. Regards, Adam. P.S. Thanks for the link to trang. I don't think I use it in this place, but still - it's a nice tool to have. W dniu 2010-07-02 20:37, Stefan Behnel pisze: > Adam Biela?ski, 02.07.2010 19:26: >> Well, my problem is that I need to validate XML with *internal* DTD (one >> that is embedded in file). > > I guess you mean "external" here, i.e. not the internal subset that is > included in the XML document but a DTD in an external file. In this > case, even one that is not referenced by the document itself. > > >> I need to make a program that takes a potentially large XML file from >> user, embeds known and constant (not provided by user) DTD into it and >> feeds the result (XML with embedded DTD) to lxml validator without >> putting >> whole XML tree into memory. The part that combines DTD with XML already >> exists as a stream and works fine with etree.XML(). Now I want to have >> incremental validation (to save memory), so I wanted to use >> etree.iterparse(). Does lxml design allows it, is there some special >> 'catch' (like - DTD has to be surrounded with something else than just >> ) to make it work, or wouldn't it work at all this >> way? > > There isn't currently a way to provide an unrelated DTD to the parser, > but you can push your DTD through "trang" to generate an XML Schema > from it, which you can then pass into the parser. > > http://www.thaiopensource.com/relaxng/trang.html > > Stefan > -------------- next part -------------- A non-text attachment was scrubbed... Name: sample.xml Type: text/xml Size: 992 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100705/69735cd8/attachment.bin From burak.arslan at arskom.com.tr Mon Jul 5 17:34:57 2010 From: burak.arslan at arskom.com.tr (Burak Arslan) Date: Mon, 05 Jul 2010 18:34:57 +0300 Subject: [lxml-dev] validating one document with multiple xml schemas Message-ID: <4C31FBA1.4040906@arskom.com.tr> hi, i'm sorry if i'm asking the obvious here, but i could not deduce the proper way of doing this from the api documentation. my schemas are defined using multiple tags that define data structures for multiple namespaces, which from each other when necessary. (fyi, this is for validating the input to a soap server, available here: github.com/arskom/soaplib) how do i instantiate an etree.XMLSchema object with multiple schema tags? best regards, burak From jholg at gmx.de Tue Jul 6 08:50:51 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 06 Jul 2010 08:50:51 +0200 Subject: [lxml-dev] validating one document with multiple xml schemas In-Reply-To: <4C31FBA1.4040906@arskom.com.tr> References: <4C31FBA1.4040906@arskom.com.tr> Message-ID: <20100706065051.302120@gmx.net> Hi, > my schemas are defined using multiple > tags that define data structures for multiple namespaces, which /> from each other when necessary. (fyi, this is for validating the > input to a soap server, available here: github.com/arskom/soaplib) > > how do i instantiate an etree.XMLSchema object with multiple schema tags? Works fine for me by simply instantiating the importing schema: importme.xsd: importing.xsd: >>> schemadoc = etree.parse("/var/tmp/importing.xsd") >>> schema = etree.XMLSchema(schemadoc) >>> Note that you need to give a hint to the imported schema's location with the schemaLocation attribute; if you leave this out the XMLSchema instantiation will fail: >>> schema = etree.XMLSchema(schemadoc) Traceback (most recent call last): File "", line 1, in ? File "xmlschema.pxi", line 103, in lxml.etree.XMLSchema.__init__ (src/lxml/lxml.etree.c:115904) lxml.etree.XMLSchemaParseError: element decl. 'root', attribute 'type': The QName value '{http://www.example.com/import/me}myType' does not resolve to a(n) type definition., line 6 (Maybe this isn' strictly necessary if the imported namespace actually refers to an accessible url so libxml2 shortcuts that, but I haven't tested that - in my example the files are simply in the same directory) Best regards, Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From burak.arslan at arskom.com.tr Tue Jul 6 09:58:34 2010 From: burak.arslan at arskom.com.tr (Burak Arslan) Date: Tue, 06 Jul 2010 10:58:34 +0300 Subject: [lxml-dev] validating one document with multiple xml schemas In-Reply-To: <20100706065051.302120@gmx.net> References: <4C31FBA1.4040906@arskom.com.tr> <20100706065051.302120@gmx.net> Message-ID: <4C32E22A.4080403@arskom.com.tr> On 07/06/10 09:50, jholg at gmx.de wrote: > Note that you need to give a hint to the imported schema's location > with the schemaLocation attribute; if you leave this out the XMLSchema > instantiation will fail: Hi there, schemaLocation attribute is not specified for schemas that are part of the same wsdl file. See an example here: http://mssoapinterop.org/asmx/simple.asmx?WSDL I'm looking for a way to manually suppy missing schema tags to XMLSchema object before it throws an exception. best regards, burak From stefan_ml at behnel.de Tue Jul 6 10:01:44 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 06 Jul 2010 10:01:44 +0200 Subject: [lxml-dev] validating one document with multiple xml schemas In-Reply-To: <4C32E22A.4080403@arskom.com.tr> References: <4C31FBA1.4040906@arskom.com.tr> <20100706065051.302120@gmx.net> <4C32E22A.4080403@arskom.com.tr> Message-ID: <4C32E2E8.2030206@behnel.de> Burak Arslan, 06.07.2010 09:58: > On 07/06/10 09:50, jholg at gmx.de wrote: >> Note that you need to give a hint to the imported schema's location >> with the schemaLocation attribute; if you leave this out the XMLSchema >> instantiation will fail: > > schemaLocation attribute is not specified for schemas that are part of > the same wsdl file. > > See an example here: > > http://mssoapinterop.org/asmx/simple.asmx?WSDL > > I'm looking for a way to manually suppy missing schema tags to XMLSchema > object before it throws an exception. What about a - parse schema document - insert schemaLocation - hand over to XMLSchema() approach? Stefan From burak.arslan at arskom.com.tr Tue Jul 6 11:23:43 2010 From: burak.arslan at arskom.com.tr (Burak Arslan) Date: Tue, 06 Jul 2010 12:23:43 +0300 Subject: [lxml-dev] validating one document with multiple xml schemas In-Reply-To: <4C32E2E8.2030206@behnel.de> References: <4C31FBA1.4040906@arskom.com.tr> <20100706065051.302120@gmx.net> <4C32E22A.4080403@arskom.com.tr> <4C32E2E8.2030206@behnel.de> Message-ID: <4C32F61F.7000100@arskom.com.tr> On 07/06/10 11:01, Stefan Behnel wrote: > Burak Arslan, 06.07.2010 09:58: >> On 07/06/10 09:50, jholg at gmx.de wrote: >>> Note that you need to give a hint to the imported schema's location >>> with the schemaLocation attribute; if you leave this out the XMLSchema >>> instantiation will fail: >> >> schemaLocation attribute is not specified for schemas that are part of >> the same wsdl file. >> >> See an example here: >> >> http://mssoapinterop.org/asmx/simple.asmx?WSDL >> >> I'm looking for a way to manually suppy missing schema tags to XMLSchema >> object before it throws an exception. > > What about a > > - parse schema document > - insert schemaLocation > - hand over to XMLSchema() > > approach? and write schema nodes to separate temporary files while doing that? that'd be a horrible hack, but i guess you're suggesting this because there's no easy way to patch lxml (or even libxml2) to support manual schema node feeding, right? thanks burak From stefan_ml at behnel.de Tue Jul 6 11:31:29 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 06 Jul 2010 11:31:29 +0200 Subject: [lxml-dev] validating one document with multiple xml schemas In-Reply-To: <4C32F61F.7000100@arskom.com.tr> References: <4C31FBA1.4040906@arskom.com.tr> <20100706065051.302120@gmx.net> <4C32E22A.4080403@arskom.com.tr> <4C32E2E8.2030206@behnel.de> <4C32F61F.7000100@arskom.com.tr> Message-ID: <4C32F7F1.2040301@behnel.de> Burak Arslan, 06.07.2010 11:23: > On 07/06/10 11:01, Stefan Behnel wrote: >> Burak Arslan, 06.07.2010 09:58: >>> On 07/06/10 09:50, jholg at gmx.de wrote: >>>> Note that you need to give a hint to the imported schema's location >>>> with the schemaLocation attribute; if you leave this out the XMLSchema >>>> instantiation will fail: >>> >>> schemaLocation attribute is not specified for schemas that are part of >>> the same wsdl file. >>> >>> See an example here: >>> >>> http://mssoapinterop.org/asmx/simple.asmx?WSDL >>> >>> I'm looking for a way to manually suppy missing schema tags to XMLSchema >>> object before it throws an exception. >> >> What about a >> >> - parse schema document >> - insert schemaLocation >> - hand over to XMLSchema() >> >> approach? > > and write schema nodes to separate temporary files while doing that? > that'd be a horrible hack, but i guess you're suggesting this because > there's no easy way to patch lxml (or even libxml2) to support manual > schema node feeding, right? Erm, no, I was just suggesting that because you mentioned that your schema documents were missing the schemaLocation attribute, so adding it sounds like a viable solution to me. However, now that you mention it, I never tried if the custom document loaders are used by the XML Schema parser. Would be worth testing here. Stefan From stefan_ml at behnel.de Tue Jul 6 11:39:10 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 06 Jul 2010 11:39:10 +0200 Subject: [lxml-dev] validating one document with multiple xml schemas In-Reply-To: <4C32F7F1.2040301@behnel.de> References: <4C31FBA1.4040906@arskom.com.tr> <20100706065051.302120@gmx.net> <4C32E22A.4080403@arskom.com.tr> <4C32E2E8.2030206@behnel.de> <4C32F61F.7000100@arskom.com.tr> <4C32F7F1.2040301@behnel.de> Message-ID: <4C32F9BE.7080801@behnel.de> Stefan Behnel, 06.07.2010 11:31: > Burak Arslan, 06.07.2010 11:23: >> On 07/06/10 11:01, Stefan Behnel wrote: >>> Burak Arslan, 06.07.2010 09:58: >>>> On 07/06/10 09:50, jholg at gmx.de wrote: >>>>> Note that you need to give a hint to the imported schema's location >>>>> with the schemaLocation attribute; if you leave this out the XMLSchema >>>>> instantiation will fail: >>>> >>>> schemaLocation attribute is not specified for schemas that are part of >>>> the same wsdl file. >>>> >>>> See an example here: >>>> >>>> http://mssoapinterop.org/asmx/simple.asmx?WSDL >>>> >>>> I'm looking for a way to manually suppy missing schema tags to XMLSchema >>>> object before it throws an exception. >>> >>> What about a >>> >>> - parse schema document >>> - insert schemaLocation >>> - hand over to XMLSchema() >>> >>> approach? >> >> and write schema nodes to separate temporary files while doing that? >> that'd be a horrible hack, but i guess you're suggesting this because >> there's no easy way to patch lxml (or even libxml2) to support manual >> schema node feeding, right? > > Erm, no, I was just suggesting that because you mentioned that your schema > documents were missing the schemaLocation attribute, so adding it sounds > like a viable solution to me. > > However, now that you mention it, I never tried if the custom document > loaders are used by the XML Schema parser. Would be worth testing here. At least according to the sources, it should just work: http://codespeak.net/lxml/resolvers.html Stefan From ab at rdprojekt.pl Tue Jul 6 11:52:03 2010 From: ab at rdprojekt.pl (=?UTF-8?B?QWRhbSBCaWVsYcWEc2tp?=) Date: Tue, 06 Jul 2010 11:52:03 +0200 Subject: [lxml-dev] Fwd: iterparse() doesn't do dtd_validation In-Reply-To: <4C2E31CC.1070205@behnel.de> References: <4C2E0A0A.40009@rdprojekt.pl> <4C2E0C82.7020808@behnel.de> <4C2E31CC.1070205@behnel.de> Message-ID: <4C32FCC3.4060401@rdprojekt.pl> Stefan, I see that I wrote too much about the background of my problem and it only obfuscated my question for lxml. Below you can find sample xml file. I'd like to validate it using etree.iterparse(dtd_validation=True) I'm using lxml 2.2.4 (as I mentioned before) and the problem is that iterparse returns all elements *without* raising validation errors. You can see easily that tag is invalid as well as badAttribute attribute of tag though. Errors are raised as expected when I call "etree.XML(file.read(), XMLParser(dtd_validation=True))", so I know that lxml is capable of validating this file correctly and I hope that the problem is with the way I'm using it. Could you tell me if there is a bug in iterparse and I should seek different approach, or can I change the way I call it to make it work? Regards, Adam. The XML to validate: ]> SchemaVersion 21 Attribute name Regards, Adam. P.S. Seems like my previous post was incorrectly interpreted as an answer to my post so I'm resending it. Sorry for the mess. W dniu 2010-07-02 20:37, Stefan Behnel pisze: > Adam Biela?ski, 02.07.2010 19:26: >> Well, my problem is that I need to validate XML with *internal* DTD (one >> that is embedded in file). > > I guess you mean "external" here, i.e. not the internal subset that is > included in the XML document but a DTD in an external file. In this > case, even one that is not referenced by the document itself. > > >> I need to make a program that takes a potentially large XML file from >> user, embeds known and constant (not provided by user) DTD into it and >> feeds the result (XML with embedded DTD) to lxml validator without >> putting >> whole XML tree into memory. The part that combines DTD with XML already >> exists as a stream and works fine with etree.XML(). Now I want to have >> incremental validation (to save memory), so I wanted to use >> etree.iterparse(). Does lxml design allows it, is there some special >> 'catch' (like - DTD has to be surrounded with something else than just >> ) to make it work, or wouldn't it work at all this >> way? > > There isn't currently a way to provide an unrelated DTD to the parser, > but you can push your DTD through "trang" to generate an XML Schema > from it, which you can then pass into the parser. > > http://www.thaiopensource.com/relaxng/trang.html > > Stefan > From burak.arslan at arskom.com.tr Tue Jul 6 13:22:39 2010 From: burak.arslan at arskom.com.tr (Burak Arslan) Date: Tue, 06 Jul 2010 14:22:39 +0300 Subject: [lxml-dev] validating one document with multiple xml schemas In-Reply-To: <4C32F9BE.7080801@behnel.de> References: <4C31FBA1.4040906@arskom.com.tr> <20100706065051.302120@gmx.net> <4C32E22A.4080403@arskom.com.tr> <4C32E2E8.2030206@behnel.de> <4C32F61F.7000100@arskom.com.tr> <4C32F7F1.2040301@behnel.de> <4C32F9BE.7080801@behnel.de> Message-ID: <4C3311FF.5020309@arskom.com.tr> On 07/06/10 12:39, Stefan Behnel wrote: > Stefan Behnel, 06.07.2010 11:31: >> Burak Arslan, 06.07.2010 11:23: >>> On 07/06/10 11:01, Stefan Behnel wrote: >>>> Burak Arslan, 06.07.2010 09:58: >>>>> On 07/06/10 09:50, jholg at gmx.de wrote: >>>>>> Note that you need to give a hint to the imported schema's location >>>>>> with the schemaLocation attribute; if you leave this out the >>>>>> XMLSchema >>>>>> instantiation will fail: >>>>> >>>>> schemaLocation attribute is not specified for schemas that are >>>>> part of >>>>> the same wsdl file. >>>>> >>>>> See an example here: >>>>> >>>>> http://mssoapinterop.org/asmx/simple.asmx?WSDL >>>>> >>>>> I'm looking for a way to manually suppy missing schema tags to >>>>> XMLSchema >>>>> object before it throws an exception. >>>> >>>> What about a >>>> >>>> - parse schema document >>>> - insert schemaLocation >>>> - hand over to XMLSchema() >>>> >>>> approach? >>> >>> and write schema nodes to separate temporary files while doing that? >>> that'd be a horrible hack, but i guess you're suggesting this because >>> there's no easy way to patch lxml (or even libxml2) to support manual >>> schema node feeding, right? >> >> Erm, no, I was just suggesting that because you mentioned that your >> schema >> documents were missing the schemaLocation attribute, so adding it sounds >> like a viable solution to me. >> >> However, now that you mention it, I never tried if the custom document >> loaders are used by the XML Schema parser. Would be worth testing here. > > At least according to the sources, it should just work: > > http://codespeak.net/lxml/resolvers.html > > Stefan hi stefan, thanks for the pointer. there's one issue with it though: i'm generating the xml schema, not parsing it from some input. is there a way to have the resolvers work for already-parsed input? thanks, burak From stefan_ml at behnel.de Tue Jul 6 13:30:23 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 06 Jul 2010 13:30:23 +0200 Subject: [lxml-dev] validating one document with multiple xml schemas In-Reply-To: <4C3311FF.5020309@arskom.com.tr> References: <4C31FBA1.4040906@arskom.com.tr> <20100706065051.302120@gmx.net> <4C32E22A.4080403@arskom.com.tr> <4C32E2E8.2030206@behnel.de> <4C32F61F.7000100@arskom.com.tr> <4C32F7F1.2040301@behnel.de> <4C32F9BE.7080801@behnel.de> <4C3311FF.5020309@arskom.com.tr> Message-ID: <4C3313CF.9050004@behnel.de> Burak Arslan, 06.07.2010 13:22: > On 07/06/10 12:39, Stefan Behnel wrote: >> http://codespeak.net/lxml/resolvers.html > > thanks for the pointer. there's one issue with it though: i'm generating > the xml schema, not parsing it from some input. is there a way to have > the resolvers work for already-parsed input? How do you generate it? Do you build it manually through the ET API? Even then, it would be enough to parse the root element instead of calling Element(), and then appending to that. Or parse it once and then use deepcopy. Or ... Stefan From mike_mp at zzzcomputing.com Tue Jul 6 19:33:02 2010 From: mike_mp at zzzcomputing.com (Michael Bayer) Date: Tue, 6 Jul 2010 13:33:02 -0400 Subject: [lxml-dev] schema validation issue Message-ID: <9098E2B4-D8F6-4729-B709-1F8B9AC21D62@zzzcomputing.com> I was playing with the libxml versions that underlie my lxml installation in order to help someone with the "static deps" flag. I noticed that when I build against the latest libxml (2.7.7), as opposed to the older one I have lying around on my mac (2.2.6) , some validations change behavior. The script below illustrates three XML docs that validate completely with 2.2.6 but the second two fail on 2.7.7. My question is, does elem1, elem2, .. mean that *any* of elem1, elem2, etc. can be present , or that every subelement, or none, must be present ? In 2.7.7 it seems to be interpreting it as, "all of elem1, elem2, elem3, ... must be present, or none". This doesn't match what I can find in the xml-schema spec (http://www.w3.org/TR/xmlschema-0/#ref18, "All the elements in the group may appear once or not at all, and they may appear in any order.") and various online tutorials (http://www.w3schools.com/Schema/el_all.asp, "The example above indicates that the "firstname" and the "lastname" elements can appear in any order and each element CAN appear zero or one time!"). So is libxml wrong or am I misinterpreting ? How would I lay out a tag like with each child tag being independently optional ? The other one is how to make an "xs:int" that is optional, but that one I haven't researched yet. Previously, I could say without issue, now it says its an invalid integer. from lxml import etree from StringIO import StringIO schema = """ """ xmlschema = etree.XMLSchema(etree.parse(StringIO(schema))) # passes doc1 = """ some value 12 """ # fails. it wants both "int-attr" and "str-attr" to be present. # didn't think this was how "xs:all" worked ? doc2 = """ 12 """ # fails. doesn't allow blank for "type='xs:int'". doc3 = """ some value """ for i, doc in enumerate((doc1, doc2, doc3)): doc = etree.parse(StringIO(doc)) try: xmlschema.assertValid(doc) print "document %d is valid." % i except Exception, e: print "document %d is not valid." % i print e output: document 0 is valid. document 1 is not valid. Element 'parent': Missing child element(s). Expected is ( str-attr )., line 2 document 2 is not valid. Element 'int-attr': '' is not a valid value of the atomic type 'xs:int'., line 4 From rasmusscholer at gmail.com Tue Jul 6 21:26:07 2010 From: rasmusscholer at gmail.com (=?ISO-8859-1?Q?Rasmus_Sch=F8ler_S=F8rensen?=) Date: Tue, 6 Jul 2010 21:26:07 +0200 Subject: [lxml-dev] Python3 support (easy, not messy) Message-ID: Hi, Is there any easy (i.e. no source) way to get lxml to work with python3? I'm on an ubuntu (9.04) system with python 3.0.1 installed using "apt-get install python3". If I try to use "from lxml import etree" in a python script, I get "ImportError: No module named lxml". So I go to http://codespeak.net/lxml/installation.html and see that I need to "easy_install lxml". Great, I go to http://peak.telecommunity.com/DevCenter/EasyInstall to see how to get easy_install. Under http://peak.telecommunity.com/DevCenter/EasyInstall#installing-easy-install it tells me to go to http://pypi.python.org/pypi/setuptools. Ok, I go there and find instructions for Windows, RPM and "Cygwin, Mac, Linux, Other" - it tells me to find my appropriate egg in the file list. I look in the file list and see versions 2.3 to 2.6. So, I ask: No 3.0 ?? So, the easy_install'er -- does not work (out-of-the-box) with 3.0... does this also mean that there is no easy way to install lxml on python 3.0? If not, I really think you should add a note about that on http://codespeak.net/lxml/installation.html -- it's rather much trouble to go through to figure out. Cheers, -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100706/f94bea96/attachment.htm From bboissin at gmail.com Tue Jul 6 23:00:48 2010 From: bboissin at gmail.com (Benoit Boissinot) Date: Tue, 6 Jul 2010 23:00:48 +0200 Subject: [lxml-dev] Python3 support (easy, not messy) In-Reply-To: References: Message-ID: 2010/7/6 Rasmus Sch?ler S?rensen : > So, the easy_install'er -- does not work (out-of-the-box) with 3.0... does > this also mean that there is no easy way to install lxml on python 3.0? > If not, I really think you should add a note about that > on?http://codespeak.net/lxml/installation.html?-- it's rather much trouble > to go through to figure out. FYI, with python libs and apps, you should assume they do *not* work on 3.x unless stated otherwise. regards, Benoit From burak.arslan at arskom.com.tr Tue Jul 6 23:39:19 2010 From: burak.arslan at arskom.com.tr (Burak Arslan) Date: Wed, 07 Jul 2010 00:39:19 +0300 Subject: [lxml-dev] validating one document with multiple xml schemas In-Reply-To: <4C3313CF.9050004@behnel.de> References: <4C31FBA1.4040906@arskom.com.tr> <20100706065051.302120@gmx.net> <4C32E22A.4080403@arskom.com.tr> <4C32E2E8.2030206@behnel.de> <4C32F61F.7000100@arskom.com.tr> <4C32F7F1.2040301@behnel.de> <4C32F9BE.7080801@behnel.de> <4C3311FF.5020309@arskom.com.tr> <4C3313CF.9050004@behnel.de> Message-ID: <4C33A287.1030008@arskom.com.tr> On 07/06/10 14:30, Stefan Behnel wrote: > Burak Arslan, 06.07.2010 13:22: >> On 07/06/10 12:39, Stefan Behnel wrote: >>> http://codespeak.net/lxml/resolvers.html >> >> thanks for the pointer. there's one issue with it though: i'm generating >> the xml schema, not parsing it from some input. is there a way to have >> the resolvers work for already-parsed input? > > How do you generate it? Do you build it manually through the ET API? yes. there's another way? > Even then, it would be enough to parse the root element instead of > calling Element(), and then appending to that. Or parse it once and > then use deepcopy. Or ... > see the __build_validator function here: http://bit.ly/bmpKnS is this what you suggested me to do? and here's the output from running the test. notice that the resolver function does not get called (no output) http://dpaste.com/hold/215262/ Is there a way to customize how XMLSchema object resolves imports? Best regards, Burak From stefan_ml at behnel.de Wed Jul 7 06:55:41 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 07 Jul 2010 06:55:41 +0200 Subject: [lxml-dev] Python3 support (easy, not messy) In-Reply-To: References: Message-ID: <4C3408CD.7030104@behnel.de> Benoit Boissinot, 06.07.2010 23:00: > 2010/7/6 Rasmus Sch?ler S?rensen: >> So, the easy_install'er -- does not work (out-of-the-box) with 3.0... does >> this also mean that there is no easy way to install lxml on python 3.0? >> If not, I really think you should add a note about that >> on http://codespeak.net/lxml/installation.html -- it's rather much trouble >> to go through to figure out. > > FYI, with python libs and apps, you should assume they do *not* work > on 3.x unless stated otherwise. Well, it's pretty clearly stated on the PyPI page that lxml supports 3.x, and it did so since the early days. Stefan From stefan_ml at behnel.de Wed Jul 7 06:58:36 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 07 Jul 2010 06:58:36 +0200 Subject: [lxml-dev] Python3 support (easy, not messy) In-Reply-To: References: Message-ID: <4C34097C.3080002@behnel.de> Rasmus Sch?ler S?rensen, 06.07.2010 21:26: > Is there any easy (i.e. no source) way to get lxml to work with python3? I'm > on an ubuntu (9.04) system with python 3.0.1 installed using "apt-get > install python3". You shouldn't both with 3.0, install 3.1.x instead. > If I try to use "from lxml import etree" in a python script, I get > "ImportError: No module named lxml". > > So I go to http://codespeak.net/lxml/installation.html and see that I need > to "easy_install lxml". > Great, I go to http://peak.telecommunity.com/DevCenter/EasyInstall to see > how to get easy_install. > Under > http://peak.telecommunity.com/DevCenter/EasyInstall#installing-easy-install it > tells me to go to http://pypi.python.org/pypi/setuptools. > Ok, I go there and find instructions for Windows, RPM and "Cygwin, Mac, > Linux, Other" - it tells me to find my appropriate egg in the file list. > I look in the file list and see versions 2.3 to 2.6. So, I ask: No 3.0 ?? > > So, the easy_install'er -- does not work (out-of-the-box) with 3.0... does > this also mean that there is no easy way to install lxml on python 3.0? You can use "distribute" instead of setuptools, or just do the normal setup.py call yourself. As you would for any other Python module. I'll add a note to the docs that distribute can be used on Py3. Stefan From stefan_ml at behnel.de Wed Jul 7 07:19:42 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 07 Jul 2010 07:19:42 +0200 Subject: [lxml-dev] validating one document with multiple xml schemas In-Reply-To: <4C33A287.1030008@arskom.com.tr> References: <4C31FBA1.4040906@arskom.com.tr> <20100706065051.302120@gmx.net> <4C32E22A.4080403@arskom.com.tr> <4C32E2E8.2030206@behnel.de> <4C32F61F.7000100@arskom.com.tr> <4C32F7F1.2040301@behnel.de> <4C32F9BE.7080801@behnel.de> <4C3311FF.5020309@arskom.com.tr> <4C3313CF.9050004@behnel.de> <4C33A287.1030008@arskom.com.tr> Message-ID: <4C340E6E.9070901@behnel.de> Burak Arslan, 06.07.2010 23:39: > On 07/06/10 14:30, Stefan Behnel wrote: >> Burak Arslan, 06.07.2010 13:22: >>> On 07/06/10 12:39, Stefan Behnel wrote: >>>> http://codespeak.net/lxml/resolvers.html >>> >>> thanks for the pointer. there's one issue with it though: i'm generating >>> the xml schema, not parsing it from some input. is there a way to have >>> the resolvers work for already-parsed input? >> >> How do you generate it? Do you build it manually through the ET API? > > yes. there's another way? Tons of them. ;) I'm currently preparing a lecture about these things, so I should know. >> Even then, it would be enough to parse the root element instead of >> calling Element(), and then appending to that. Or parse it once and >> then use deepcopy. Or ... > > see the __build_validator function here: http://bit.ly/bmpKnS > is this what you suggested me to do? Yes, without looking too closely, I think that should work. Use fromstring() instead of StringIO, though. > and here's the output from running the test. notice that the resolver > function does not get called (no output) > > http://dpaste.com/hold/215262/ > > Is there a way to customize how XMLSchema object resolves imports? Sorry, I really can't debug your code for you. Please post a short code snippet that shows what doesn't work for you. Stefan From stefan_ml at behnel.de Wed Jul 7 07:41:56 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 07 Jul 2010 07:41:56 +0200 Subject: [lxml-dev] schema validation issue In-Reply-To: <9098E2B4-D8F6-4729-B709-1F8B9AC21D62@zzzcomputing.com> References: <9098E2B4-D8F6-4729-B709-1F8B9AC21D62@zzzcomputing.com> Message-ID: <4C3413A4.7040202@behnel.de> Michael Bayer, 06.07.2010 19:33: > I was playing with the libxml versions that underlie my lxml > installation in order to help someone with the "static deps" flag. I > noticed that when I build against the latest libxml (2.7.7), as opposed > to the older one I have lying around on my mac (2.2.6) , some > validations change behavior. The script below illustrates three XML > docs that validate completely with 2.2.6 but the second two fail on > 2.7.7. Are you sure you mean libxml2 2.2.6? That's a pretty old version (and not supported by lxml). Anyway, the right place to ask this is the libxml2 mailing list. Stefan From burak.arslan at arskom.com.tr Wed Jul 7 09:58:30 2010 From: burak.arslan at arskom.com.tr (Burak Arslan) Date: Wed, 07 Jul 2010 10:58:30 +0300 Subject: [lxml-dev] validating one document with multiple xml schemas In-Reply-To: <4C340E6E.9070901@behnel.de> References: <4C31FBA1.4040906@arskom.com.tr> <20100706065051.302120@gmx.net> <4C32E22A.4080403@arskom.com.tr> <4C32E2E8.2030206@behnel.de> <4C32F61F.7000100@arskom.com.tr> <4C32F7F1.2040301@behnel.de> <4C32F9BE.7080801@behnel.de> <4C3311FF.5020309@arskom.com.tr> <4C3313CF.9050004@behnel.de> <4C33A287.1030008@arskom.com.tr> <4C340E6E.9070901@behnel.de> Message-ID: <4C3433A6.8060809@arskom.com.tr> >>> Even then, it would be enough to parse the root element instead of >>> calling Element(), and then appending to that. Or parse it once and >>> then use deepcopy. Or ... >> >> see the __build_validator function here: http://bit.ly/bmpKnS >> is this what you suggested me to do? > > Yes, without looking too closely, I think that should work. Use > fromstring() instead of StringIO, though. > > here's an even smaller version: http://dpaste.com/hold/215375/ to me it says: lxml.etree.XMLSchemaParseError: element decl. '{TestService.TestService}p', attribute 'type': The QName value '{TestService}Person' does not resolve to a(n) type definition., line 29 as i told earlier, the resolver does not get called at all. and i just can't see anything wrong with the statement, even after i've re-examined the relevant docs. so i think there's something missing in the code snippet here. thanks a lot, burak From rasmusscholer at gmail.com Wed Jul 7 11:36:17 2010 From: rasmusscholer at gmail.com (=?ISO-8859-1?Q?Rasmus_Sch=F8ler_S=F8rensen?=) Date: Wed, 7 Jul 2010 11:36:17 +0200 Subject: [lxml-dev] Python3 support (easy, not messy) In-Reply-To: <4C34097C.3080002@behnel.de> References: <4C34097C.3080002@behnel.de> Message-ID: Hi guys, thanks for your replies. What I did: *wget http://python-distribute.org/distribute_setup.py* *sudo python3 distribute_setup.py (it install ok)* *sudo apt-get install libxml2-dev* *sudo apt-get install libxlst1-dev* *sudo apt-get install python3-dev* *sudo easy_install lxml* I'm not 100% sure that was the neatest way to do it, but I got an error from easy_install until python3-dev was installed. I can now successfully do: from xml.etree import ElementTree tree = ElementTree.parse("people.xml") That's not completely consistent with the tutorial at http://codespeak.net/lxml/tutorial.html#parsing-from-strings-and-files which says: >>> from lxml import etree >>> tree = etree.parse(some_file_like) But the above gives an error. (I probably missed something as I scrolled down the tutorial, but I don't have time to read everything carefully, I just want quick answers...) Anyways, it works now. Feel free to copy my text in bold to your installation instructions at http://codespeak.net/lxml/installation.html (maybe add a small "how-to-install-on-ubuntu-with-python3" section? After all, Ubuntu is for people who expect simple "I'm used to use Windows where everything works out of the box" kind of instructions ;-) Regarding using 3.0.1+ vs 3.1: 3.0.1+ is currently what is delivered by the Ubuntu repositories, so that's what I use -- I expect that my version will be automagically updated when a newer version hits the repos. Regarding Benoit's "If it doesn't explicitly state that it works on python3, then it is only for 2.x": This is actually exactly what I think is currently the worst thing about python at the moment, at least as a newcomer. I don't want to write my apps as 2.x scripts since 3.x is naturally the future. However, a great deal of python's tutorials, libs and frameworks have not been converted to 3.x. And even worse, they don't even mention anything about 3.x vs 2.x (or at least they don't try to make it clear) which makes it very hard to figure out which sections are useful and which are obsolete -- it is so frustrating that I am sometimes considering going back to Java or some of the original .NET languages or maybe go with Ruby or Groovy -- even PHP would sometimes seem like a better option... But anyways, thanks a lot for getting me up and running. Best regards, Rasmus Scholer. 2010/7/7 Stefan Behnel > Rasmus Sch?ler S?rensen, 06.07.2010 21:26: > > Is there any easy (i.e. no source) way to get lxml to work with python3? >> I'm >> on an ubuntu (9.04) system with python 3.0.1 installed using "apt-get >> install python3". >> > > You shouldn't both with 3.0, install 3.1.x instead. > > > > If I try to use "from lxml import etree" in a python script, I get >> "ImportError: No module named lxml". >> >> So I go to http://codespeak.net/lxml/installation.html and see that I >> need >> to "easy_install lxml". >> Great, I go to http://peak.telecommunity.com/DevCenter/EasyInstall to see >> how to get easy_install. >> Under >> >> http://peak.telecommunity.com/DevCenter/EasyInstall#installing-easy-installit >> tells me to go to http://pypi.python.org/pypi/setuptools. >> Ok, I go there and find instructions for Windows, RPM and "Cygwin, Mac, >> Linux, Other" - it tells me to find my appropriate egg in the file list. >> I look in the file list and see versions 2.3 to 2.6. So, I ask: No 3.0 ?? >> >> So, the easy_install'er -- does not work (out-of-the-box) with 3.0... does >> this also mean that there is no easy way to install lxml on python 3.0? >> > > You can use "distribute" instead of setuptools, or just do the normal > setup.py call yourself. As you would for any other Python module. > > I'll add a note to the docs that distribute can be used on Py3. > > Stefan > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100707/c1c31c68/attachment-0001.htm From stefan_ml at behnel.de Wed Jul 7 11:43:43 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 07 Jul 2010 11:43:43 +0200 Subject: [lxml-dev] Python3 support (easy, not messy) In-Reply-To: References: <4C34097C.3080002@behnel.de> Message-ID: <4C344C4F.6050400@behnel.de> Rasmus Sch?ler S?rensen, 07.07.2010 11:36: > What I did: > *wget http://python-distribute.org/distribute_setup.py* > *sudo python3 distribute_setup.py (it install ok)* > *sudo apt-get install libxml2-dev* > *sudo apt-get install libxlst1-dev* > *sudo apt-get install python3-dev* > *sudo easy_install lxml* Looks good. > I can now successfully do: > from xml.etree import ElementTree > tree = ElementTree.parse("people.xml") You can always do that starting from Python 2.5, even if lxml is not installed. ;-) > That's not completely consistent with the tutorial at > http://codespeak.net/lxml/tutorial.html#parsing-from-strings-and-files which > says: > > >>> from lxml import etree > >>> tree = etree.parse(some_file_like) > > But the above gives an error. ... such as? > Regarding using 3.0.1+ vs 3.1: > 3.0.1+ is currently what is delivered by the Ubuntu repositories, so that's > what I use It's pretty dead, though. Only 3.1+ will be maintained in the future. Stefan From ab at rdprojekt.pl Wed Jul 7 16:25:13 2010 From: ab at rdprojekt.pl (=?UTF-8?B?QWRhbSBCaWVsYcWEc2tp?=) Date: Wed, 07 Jul 2010 16:25:13 +0200 Subject: [lxml-dev] No error position when validating against XMLSchema Message-ID: <4C348E49.3090006@rdprojekt.pl> Hello, I'm calling creating object with "etree.iterparse(open('NotValidFile.xml'), schema=schema)" and iterating over it raises etree.XMLSyntaxError, as expected. Unfortunately, value of 'position' attribute of that error is (0,0). Value of 'offset' attribute is None. Is there a way to get at least the line number of the offending tag in my XML file? When etree.XMLSyntaxError is raised in etree.XML(), it contains position info pointing precisely to the place where error occured. Is it possible also when using iterparse() ? Regards, Adam Biela?ski. From mike_mp at zzzcomputing.com Wed Jul 7 17:35:06 2010 From: mike_mp at zzzcomputing.com (Michael Bayer) Date: Wed, 7 Jul 2010 11:35:06 -0400 Subject: [lxml-dev] schema validation issue In-Reply-To: <4C3413A4.7040202@behnel.de> References: <9098E2B4-D8F6-4729-B709-1F8B9AC21D62@zzzcomputing.com> <4C3413A4.7040202@behnel.de> Message-ID: <447D1B9A-6717-433E-8DEE-B0CD02EB1D5F@zzzcomputing.com> On Jul 7, 2010, at 1:41 AM, Stefan Behnel wrote: > Michael Bayer, 06.07.2010 19:33: >> I was playing with the libxml versions that underlie my lxml >> installation in order to help someone with the "static deps" flag. I >> noticed that when I build against the latest libxml (2.7.7), as opposed >> to the older one I have lying around on my mac (2.2.6) , some >> validations change behavior. The script below illustrates three XML >> docs that validate completely with 2.2.6 but the second two fail on >> 2.7.7. > > Are you sure you mean libxml2 2.2.6? That's a pretty old version (and not supported by lxml). you're correct in that I am incorrect, its libxml2 2.6. > > Anyway, the right place to ask this is the libxml2 mailing list. My real question first off is one of "what is the correct behvior for XML schema", before I go report a bug with them and get slapped down, so I will ask first on stackoverflow. From mike_mp at zzzcomputing.com Wed Jul 7 18:27:15 2010 From: mike_mp at zzzcomputing.com (Michael Bayer) Date: Wed, 7 Jul 2010 12:27:15 -0400 Subject: [lxml-dev] schema validation issue In-Reply-To: <447D1B9A-6717-433E-8DEE-B0CD02EB1D5F@zzzcomputing.com> References: <9098E2B4-D8F6-4729-B709-1F8B9AC21D62@zzzcomputing.com> <4C3413A4.7040202@behnel.de> <447D1B9A-6717-433E-8DEE-B0CD02EB1D5F@zzzcomputing.com> Message-ID: <43B9FC47-7F24-4F61-846F-5397FEAAE0BD@zzzcomputing.com> On Jul 7, 2010, at 11:35 AM, Michael Bayer wrote: > > On Jul 7, 2010, at 1:41 AM, Stefan Behnel wrote: > >> Michael Bayer, 06.07.2010 19:33: >>> I was playing with the libxml versions that underlie my lxml >>> installation in order to help someone with the "static deps" flag. I >>> noticed that when I build against the latest libxml (2.7.7), as opposed >>> to the older one I have lying around on my mac (2.2.6) , some >>> validations change behavior. The script below illustrates three XML >>> docs that validate completely with 2.2.6 but the second two fail on >>> 2.7.7. >> >> Are you sure you mean libxml2 2.2.6? That's a pretty old version (and not supported by lxml). > > you're correct in that I am incorrect, its libxml2 2.6. > >> >> Anyway, the right place to ask this is the libxml2 mailing list. > > > My real question first off is one of "what is the correct behvior for XML schema", before I go report a bug with them and get slapped down, so I will ask first on stackoverflow. I'm asking on SO, but also here is where they changed the behavior: https://bugzilla.gnome.org/show_bug.cgi?id=571271 > > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev From p.oberndoerfer at urheberrecht.org Tue Jul 6 11:41:00 2010 From: p.oberndoerfer at urheberrecht.org (Pascal =?utf-8?b?T2Jlcm5kw7ZyZmVy?=) Date: Tue, 6 Jul 2010 09:41:00 +0000 (UTC) Subject: [lxml-dev] lxml eggs for Mac OS X 10.4 (maybe 10.5?) References: <4C316F3A.1050602@behnel.de> Message-ID: Stefan Behnel behnel.de> writes: > "Pascal Obernd?rfer", 03.07.2010 17:59: > > Don't know if these eggs are of help to anybody > > else on "older" Macs (i.e. 10.4 on ppc and i386): > > > > > > [...] > > Anybody with PyPI rights, feel free to upload them, as I cannot guarantee > > the above link will remain forever > > I uploaded the fat egg. Thanks a lot! > > Stefan Well, the strange thing is that the fat egg is indeed named 'fat' only when lxml is built against Python 2.6. The two others ('ppc' and 'i386') are 'fat' as well (according to lipo), but the correct naming isn't carried over, as they were built against Python 2.5. I'll try to build against Python 2.7 in the next couple of days. Pascal From p.oberndoerfer at urheberrecht.org Wed Jul 7 21:54:00 2010 From: p.oberndoerfer at urheberrecht.org (=?iso-8859-1?Q?=22Pascal_Obernd=F6rfer=22?=) Date: Wed, 7 Jul 2010 21:54:00 +0200 Subject: [lxml-dev] lxml eggs for Mac OS X 10.4 (maybe 10.5?) In-Reply-To: <4C316F3A.1050602@behnel.de> References: <4C316F3A.1050602@behnel.de> Message-ID: <314890dea6148df77f90161ebf65288b.squirrel@mail.urheberrecht.org> > "Pascal Obernd??rfer", 03.07.2010 17:59: >> Don't know if these eggs are of help to anybody >> else on "older" Macs (i.e. 10.4 on ppc and i386): >> >> >> >> I needed lxml on such machines, so here they are at your own risk! >> >> The 2.2.6 eggs should statically link to: >> >> -libxml2-2.7.7 >> -libxslt-1.1.26 >> -libiconv-1.13.1 >> -zlib-1.2.5 >> >> Only tested with 'selftest.py' and 'selftest2.py'. All tests pass. But >> further testing is certainly needed! >> >> Anybody with PyPI rights, feel free to upload them, as I cannot >> guarantee >> the above link will remain forever ;-) > > I uploaded the fat egg. Thanks a lot! > > Stefan I added an egg for Python 2.7 and MacOS X 10.4 as well: Same limited testing as above though! Pascal From public at codethief.eu Thu Jul 8 03:51:20 2010 From: public at codethief.eu (codethief) Date: Thu, 8 Jul 2010 03:51:20 +0200 Subject: [lxml-dev] Python3 support (easy, not messy) In-Reply-To: <4C344C4F.6050400@behnel.de> References: <4C34097C.3080002@behnel.de> <4C344C4F.6050400@behnel.de> Message-ID: lxml works perfectly with Python 3. All you have to do is to use python3-setuptools, i.e. install lxml by doing "easy_install3 lxml". And Python 3.1+ is in the repositories, too. You just have to upgrade to Ubuntu 10.04, I guess. ;) -- Simon Hirscher http://simonhirscher.de From sidnei.da.silva at gmail.com Fri Jul 9 06:15:33 2010 From: sidnei.da.silva at gmail.com (Sidnei da Silva) Date: Fri, 9 Jul 2010 01:15:33 -0300 Subject: [lxml-dev] (unofficial) lxml 2.2.6 Windows Binaries for Python 2.6 Message-ID: HI all, I finally managed to get the builds going again. I have some unofficial builds for ready for testing, and will upload them to pypi there's confirmation that they work just fine: http://bit.ly/bNGEol Worth of note is the fact that the x64 binaries have been stripped of iconv support because I couldn't figure out (yet!) how to build that one properly. According to Stefan, that means that the x64 build won't be able to parse even ISO8859-* encoded files. YMMV. If you're interested in some background about why it took so long, there's more information on my blog: http://bit.ly/9GCgvn Thanks for your patience, and get on testing! -- Sidnei From gialloporpora at gmail.com Sun Jul 11 01:29:22 2010 From: gialloporpora at gmail.com (gialloporpora) Date: Sun, 11 Jul 2010 01:29:22 +0200 Subject: [lxml-dev] building lxml from source using Mingw32 Message-ID: <4C390252.2080105@gmail.com> Dear all, I have a problem building lxml from source using Mingw compiler. I have already built lxml from source creating the installer using the Visual C++, but now I have updated Python to 2.7 and I would like to create the installer using Mingw (I have not MS Visual C++ installed on this machine). I have followed all instructions on the lxml documentation page, I have downloaded the packages from: ftp://ftp.zlatkovic.com/pub/libxml/ but if I try to build the installer I receive many errors messages, I have used this commands: python setup.py build --compiler mingw32 --static bdist_wininst (if you want, I could report all the messages of the output). I have tried to solve this myself using Google but I am not very expert and I have not solved. Reading this document: http://docs.python.org/install/ It seem that I must convert all DLL's in the iconv, libxml, libxslt and zlib package using the pexports tool suggested in the page. Is it true? If the reply to previous question is affermative, there exist a package of DLL's already compatible with Mingw? If not, no problem I could install the Visual C++ and building the installer using it. Sandro -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 552 bytes Desc: OpenPGP digital signature Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100711/bb036665/attachment.pgp From gialloporpora at gmail.com Sun Jul 11 19:34:08 2010 From: gialloporpora at gmail.com (gialloporpora) Date: Sun, 11 Jul 2010 19:34:08 +0200 Subject: [lxml-dev] building lxml from source using Mingw32 In-Reply-To: <4C390252.2080105@gmail.com> References: <4C390252.2080105@gmail.com> Message-ID: <4C3A0090.3000808@gmail.com> Risposta al messaggio di gialloporpora : > Dear all, > I have a problem building lxml from source using Mingw compiler. > I have already built lxml from source creating the installer using the > Visual C++, but now I have updated Python to 2.7 and I would like to > create the installer using Mingw (I have not MS Visual C++ installed on > this machine). > > I have followed all instructions on the lxml documentation page, I have > downloaded the packages from: > ftp://ftp.zlatkovic.com/pub/libxml/ > > but if I try to build the installer I receive many errors messages, I > have used this commands: > > python setup.py build --compiler mingw32 --static bdist_wininst > > (if you want, I could report all the messages of the output). > > I have tried to solve this myself using Google but I am not very expert > and I have not solved. > > Reading this document: > http://docs.python.org/install/ > > > It seem that I must convert all DLL's in the iconv, libxml, libxslt and > zlib package using the pexports tool suggested in the page. Is it true? > > If the reply to previous question is affermative, there exist a package > of DLL's already compatible with Mingw? > If not, no problem I could install the Visual C++ and building the > installer using it. > > Sandro > > > > > > Ok, I have installed MS Visual C++ and created the installer with it and now lxml works again on Python 2.7, the installer is here if somebody is interested: http://bit.ly/bwCHDw If somebody try to create it with MS Visual C++ 2010, first read this article: http://bit.ly/bR2ZOy variable to avoid errors. I have another question, what is the difference by the packages with "sec" in the name? ftp://ftp.zlatkovic.com/pub/libxml/ For example, what of this packages is the better to build lxml? libxmlsec-1.2.13.win32.zip I have used the package without "sec" as suggested in the documentation but I would like to know if I could use also the "sec" package and if it is better. libxml2-2.7.6.win32.zip Regars Sandro -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 552 bytes Desc: OpenPGP digital signature Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100711/2b725ab0/attachment-0001.pgp From lists at cheimes.de Sun Jul 11 22:20:19 2010 From: lists at cheimes.de (Christian Heimes) Date: Sun, 11 Jul 2010 22:20:19 +0200 Subject: [lxml-dev] building lxml from source using Mingw32 In-Reply-To: <4C3A0090.3000808@gmail.com> References: <4C390252.2080105@gmail.com> <4C3A0090.3000808@gmail.com> Message-ID: > Ok, I have installed MS Visual C++ and created the installer with it and > now lxml works again on Python 2.7, the installer is here if somebody > is interested: > http://bit.ly/bwCHDw > > If somebody try to create it with MS Visual C++ 2010, first read this > article: > http://bit.ly/bR2ZOy > > variable to avoid errors. Please don't use anything else than Visual Studio 2008 for official packages. You must not mix multiple CRTs since this can lead to errors and segfaults. Data structures like *FILE, memory management over malloc/free, errno and more are local to each CRT. The need for the /MANIFEST option is a big warning sign for future trouble. In general .pyd files shouldn't have an embedded manifest. Christian From lists at cheimes.de Mon Jul 12 00:16:05 2010 From: lists at cheimes.de (Christian Heimes) Date: Mon, 12 Jul 2010 00:16:05 +0200 Subject: [lxml-dev] building lxml from source using Mingw32 In-Reply-To: <4C3A3F65.4050604@gmail.com> References: <4C390252.2080105@gmail.com> <4C3A0090.3000808@gmail.com> <4C3A3F65.4050604@gmail.com> Message-ID: <4C3A42A5.4020904@cheimes.de> > Ok, thank for you suggestion, I don't know that. I have used it because > it was the only download that I have found on the MS website. > My first idea was to built lxml from source using Mingw, but, it was > too complicated for my skills (if I have well understand the procedure I > must convert all DLL in the required packages using a program called > pexports). It's tricky to compile extensions with MinGW32. You have to use an unsupported version in order to get proper C++ support and event that isn't ABI compatible with MSVC++ (AFAIK). > Do you know if is it possible to have a free version of the Visual C++ 2008? > Sandro Here you are: http://www.microsoft.com/express/downloads/#2008-All I hightly recommand the DVD ISO. It's larger but it will work after Microsoft has removed VS EE 2008 from its site. MS tends to remove old Visual Studio stuff after the new version has been released. The express edition can't build X64 binaries but it's possible to integrate the 64bit compiler from the Windows 7 SDK. Please don't ask how. I haven't tried it because I'm in the lucky position to have a free MSDN membership for my Python open source work. ;) Christian From sidnei.da.silva at canonical.com Mon Jul 12 16:13:59 2010 From: sidnei.da.silva at canonical.com (Sidnei da Silva) Date: Mon, 12 Jul 2010 11:13:59 -0300 Subject: [lxml-dev] building lxml from source using Mingw32 In-Reply-To: <4C3A42A5.4020904@cheimes.de> References: <4C390252.2080105@gmail.com> <4C3A0090.3000808@gmail.com> <4C3A3F65.4050604@gmail.com> <4C3A42A5.4020904@cheimes.de> Message-ID: On Sun, Jul 11, 2010 at 7:16 PM, Christian Heimes wrote: > I hightly recommand the DVD ISO. It's larger but it will work after > Microsoft has removed VS EE 2008 from its site. MS tends to remove old > Visual Studio stuff after the new version has been released. The express > edition can't build X64 binaries but it's possible to integrate the > 64bit compiler from the Windows 7 SDK. Please don't ask how. I haven't > tried it because I'm in the lucky position to have a free MSDN > membership for my Python open source work. ;) It is actually possible to build everything with *only* the SDK and without VC installed at all. You just need to set the right environment variables so that distutils knows you're using the SDK and some more variables so that it knows where to find the SDK. This is what I'm using to setup my environment: http://bit.ly/9vKQDE And this is what I use to build the lxml binaries: http://bit.ly/c82me7 -- Sidnei From tseaver at palladion.com Mon Jul 12 17:01:41 2010 From: tseaver at palladion.com (Tres Seaver) Date: Mon, 12 Jul 2010 11:01:41 -0400 Subject: [lxml-dev] building lxml from source using Mingw32 In-Reply-To: References: <4C390252.2080105@gmail.com> <4C3A0090.3000808@gmail.com> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Christian Heimes wrote: >> Ok, I have installed MS Visual C++ and created the installer with it and >> now lxml works again on Python 2.7, the installer is here if somebody >> is interested: >> http://bit.ly/bwCHDw >> >> If somebody try to create it with MS Visual C++ 2010, first read this >> article: >> http://bit.ly/bR2ZOy >> >> variable to avoid errors. > > Please don't use anything else than Visual Studio 2008 for official > packages. You must not mix multiple CRTs since this can lead to errors > and segfaults. Data structures like *FILE, memory management over > malloc/free, errno and more are local to each CRT. The need for the > /MANIFEST option is a big warning sign for future trouble. In general > .pyd files shouldn't have an embedded manifest. FWIW, mingw's gcc uses the native CRT, unlike cygwin's gcc. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkw7LlUACgkQ+gerLs4ltQ7CfgCeNC7mkcOhp/bpXW9k1bqAJI6M MpAAoIm5Ic+NEgjT5Jxv/mvNCAMDWzDV =j5q9 -----END PGP SIGNATURE----- From gialloporpora at gmail.com Mon Jul 12 17:21:33 2010 From: gialloporpora at gmail.com (gialloporpora) Date: Mon, 12 Jul 2010 17:21:33 +0200 Subject: [lxml-dev] building lxml from source using Mingw32 In-Reply-To: References: <4C390252.2080105@gmail.com> <4C3A0090.3000808@gmail.com> <4C3A3F65.4050604@gmail.com> <4C3A42A5.4020904@cheimes.de> Message-ID: <4C3B32FD.1080601@gmail.com> Risposta al messaggio di Sidnei da Silva : > It is actually possible to build everything with*only* the SDK and > without VC installed at all. You just need to set the right > environment variables so that distutils knows you're using the SDK and > some more variables so that it knows where to find the SDK. > > This is what I'm using to setup my environment: > > http://bit.ly/9vKQDE > > And this is what I use to build the lxml binaries: > > http://bit.ly/c82me7 > > -- Sidnei Thanks Sidnei for your suggestions, but I have already followed the suggestion of Christian and I have used the MS Visual C++ 2008 to build lxml from source and with works nicely :-) without making editing to distutils files. I have updated the file linked in my first message, the executable has been created with Visual C++ 2008 and these libs packages: iconv-1.9.2.win32.zip libxml2-2.7.6.win32.zip libxslt-1.1.26.win32.zip zlib-1.2.3.win32.zip and lxml 2.6.6 using the command: setup.py bdist_wininst --static Thanks again for your help and suggestions Sandro File: http://bit.ly/bwCHDw MD5: fb225ef59f43f3f4e14b5653c90fcd9c -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 552 bytes Desc: OpenPGP digital signature Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100712/3019f976/attachment.pgp From jaraco at jaraco.com Tue Jul 13 10:22:54 2010 From: jaraco at jaraco.com (Jason R. Coombs) Date: Tue, 13 Jul 2010 01:22:54 -0700 Subject: [lxml-dev] LXML Windows builds for Python 2.7 Message-ID: <12C7AB425F0DD546B6049311F827C74E08A2E65269@VA3DIAXVS141.RED001.local> No sooner does Sidnei get the builds out for Python 2.6 (thanks!) as I need a build for Python 2.7. I was able to build lxml for Python 2.7 on a 32-bit platform using this script: http://dl.dropbox.com/u/54081/linked/lxml-build.py Comments are welcome. The script also has support for building on a 64-bit platform using binaries I found built for PHP, though I haven't yet got that to compile (it seems they've used optimizations for VS2008 SP1, so I'm applying that SP now). I'm pleased to report that thusfar, the tentative builds for Python 2.6 are working for me. Please consider also building against Python 2.7 when releasing to PyPI. Thanks, Jason -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100713/69b2f64e/attachment-0001.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 6460 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100713/69b2f64e/attachment-0001.bin From masetto4ever at gmail.com Fri Jul 16 14:07:46 2010 From: masetto4ever at gmail.com (masetto) Date: Fri, 16 Jul 2010 14:07:46 +0200 Subject: [lxml-dev] Invalid Schematron Message-ID: Hi all, i'm trying to validate an OVAL (http://oval.mitre.org) xml document against it's schematron rules ( http://oval.mitre.org/language/version5.7/ovaldefinition/schematron/oval-definitions-schematron.sch ) Following the manual (http://codespeak.net/lxml/validation.html) i wrote the following piece of code: from lxml import etree rule = open("oval-definitions-schematron.sch") defs = open("oval.xml") sct_doc = etree.parse(rule) try: schematron = etree.Schematron(sct_doc) except etree.SchematronParseError, e: print e.args print e.error_log print e.message doc = etree.parse(defs) print schematron.validate(doc) but i got the following error(s): Document is not a valid Schematron schema oval-definitions-schematron.sch:34:0:ERROR:SCHEMASP:SCHEMAP_NOROOT: Expecting a pattern element instead of phase (repeated N times) ... oval-definitions-schematron.sch:1547:0:ERROR:SCHEMASP:SCHEMAP_NOROOT: Failed to compile context expression oval-def:objects/*/*[@datatype='binary']|oval-def:states/*/*[@datatype='binary']|oval-def:states/*/* (repeated N times) $ dpkg -l | grep libxml2 ii libxml-libxml-perl 1.70.ds-1 Perl interface to the libxml2 library ii libxml2 2.7.6.dfsg-1ubuntu1 GNOME XML library ii libxml2-dev 2.7.6.dfsg-1ubuntu1 Development files for the GNOME XML library ii libxml2-utils 2.7.6.dfsg-1ubuntu1 XML utilities ii python-libxml2 2.7.6.dfsg-1ubuntu1 Python bindings for the GNOME XML library on Ubuntu 10.04 Can you help me? Thanks --- Masetto -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100716/cfa3421b/attachment.htm From frank at chagford.com Sat Jul 17 13:08:01 2010 From: frank at chagford.com (Frank Millman) Date: Sat, 17 Jul 2010 13:08:01 +0200 Subject: [lxml-dev] Quick question about objectify Message-ID: <20100717110808.876061F06@rrba-ip-smtp-5-1.saix.net> Hi all Sometimes an element may or may not exist in a document. If it exists, I need to perform some processing on it. Using etree and xpath, the following is convenient - for obj in root.xpath('bp:object', namespaces=nsmap): [do something] If there are no 'object's, an empty list is returned, so it works. I tried the same with objectify - for obj in root.object: [do something] If there are no 'object's, AttributeError is raised. This means that I must either precede this with 'if hasattr(...)', or wrap it in a try/except. I have to do this quite a lot, so the code does not flow as nicely. Is there any way to use objectify and retain the convenience of xpath? Thanks Frank Millman From stefan_ml at behnel.de Sat Jul 17 15:55:32 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 17 Jul 2010 15:55:32 +0200 Subject: [lxml-dev] Quick question about objectify In-Reply-To: <20100717110808.876061F06@rrba-ip-smtp-5-1.saix.net> References: <20100717110808.876061F06@rrba-ip-smtp-5-1.saix.net> Message-ID: <4C41B654.40707@behnel.de> Frank Millman, 17.07.2010 13:08: > Sometimes an element may or may not exist in a document. If it exists, I > need to perform some processing on it. > > Using etree and xpath, the following is convenient - > > for obj in root.xpath('bp:object', namespaces=nsmap): > [do something] > > If there are no 'object's, an empty list is returned, so it works. > > I tried the same with objectify - > > for obj in root.object: > [do something] > > If there are no 'object's, AttributeError is raised. > > This means that I must either precede this with 'if hasattr(...)', or wrap > it in a try/except. I have to do this quite a lot, so the code does not flow > as nicely. > > Is there any way to use objectify and retain the convenience of xpath? getattr? for obj in getattr(root, 'object', ()): ... Stefan From sylvain.duvillard at stericsson.com Mon Jul 19 18:03:07 2010 From: sylvain.duvillard at stericsson.com (Sylvain DUVILLARD) Date: Mon, 19 Jul 2010 18:03:07 +0200 Subject: [lxml-dev] lxml.etree.XMLSyntaxError reported on a valid xml Message-ID: <2AC7D4AD8BA1C640B4C60C61C8E5201539DECDA25D@EXDCVYMBSTM006.EQ1STM.local> Hi ! I get the following error when parsing an xml file with objectify on this call : objectify.fromstring(etree.tostring(raw_doc), parser) ... File "lxml.objectify.pyx", line 1826, in lxml.objectify.fromstring (src/lxml/lxml.objectify.c:18625) File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48634) File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72245) File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71106) File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67875) File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257) File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178) File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64521) lxml.etree.XMLSyntaxError: Element '{http://www.spiritconsortium.org/XMLSchema/SPIRIT/1.4}remapPort': No match found for key-sequence ['phantom_remap'] of keyref '{http://www.spiritconsortium.org/XMLSchema/SPIRIT/1.4}remapStatePortRef'. The parser was made using a schema : component_schema = etree.XMLSchema(file='component.xsd') component_parser = objectify.makeparser(schema = component_schema, remove_blank_text = True) Now, when I use xmllint stand alone like this, it works : xmllint --noout --schema component.xsd Vendor_Library_adec_1.0.xml Vendor_Library_adec_1.0.xml validates I'm using the latest version of all released modules : >>> lxml.etree: (2, 2, 6, 0) >>> libxml used: (2, 7, 7) >>> libxml compiled: (2, 7, 7) >>> libxslt used: (1, 1, 26) >>> libxslt compiled: (1, 1, 26) Should I enter a bug or do I do something obviously wrong here ? I also could validate the file against the schema using xmlspy Thanks in advance Sylvain -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100719/0760ad38/attachment.htm From arfrever.fta at gmail.com Tue Jul 20 03:02:38 2010 From: arfrever.fta at gmail.com (Arfrever Frehtes Taifersar Arahesis) Date: Tue, 20 Jul 2010 03:02:38 +0200 Subject: [lxml-dev] SyntaxErrors with Python 3 Message-ID: <201007200303.15139.Arfrever.FTA@gmail.com> LXML r76211 generally supports Python 3, but there are still some SyntaxErrors. (The problem with ez_setup.py can be ignored.) $ python3.1 -m compileall -q . *** Error compiling ./benchmark/bench_etree.py ... File "./benchmark/bench_etree.py", line 10 UTEXT = u"some klingon: \F8D2" ^ SyntaxError: invalid syntax *** Error compiling ./benchmark/benchbase.py ... File "./benchmark/benchbase.py", line 10 _UTEXT = u"some klingon: \F8D2" * TREE_FACTOR ^ SyntaxError: invalid syntax *** Error compiling ./doc/mklatex.py ... File "./doc/mklatex.py", line 247 print "Creating %s" % outname ^ SyntaxError: invalid syntax *** Error compiling ./doc/rest2html.py ... File "./doc/rest2html.py", line 41 except ValueError, e: ^ SyntaxError: invalid syntax *** Error compiling ./doc/rest2latex.py ... File "./doc/rest2latex.py", line 44 except ValueError, e: ^ SyntaxError: invalid syntax *** Error compiling ./doc/s5/rst2s5.py ... File "./doc/s5/rst2s5.py", line 82 parsed = highlight(u'\n'.join(self.content), lexer, formatter) ^ SyntaxError: invalid syntax *** Error compiling ./ez_setup.py ... File "./ez_setup.py", line 106 except pkg_resources.VersionConflict, e: ^ SyntaxError: invalid syntax *** Error compiling ./src/local_doctest.py ... File "./src/local_doctest.py", line 372 raise TypeError, 'Expected a module: %r' % module ^ SyntaxError: invalid syntax *** Error compiling ./src/lxml/html/_diffcommand.py ... File "./src/lxml/html/_diffcommand.py", line 85 print "Not yet implemented" ^ SyntaxError: invalid syntax *** Error compiling ./src/lxml/html/tests/transform_feedparser_data.py ... File "./src/lxml/html/tests/transform_feedparser_data.py", line 91 print 'Bad data in %s:' % filename ^ SyntaxError: invalid syntax *** Error compiling ./src/lxml/tests/test_errors.py ... File "./src/lxml/tests/test_errors.py", line 1 ?# -*- coding: utf-8 -*- ^ SyntaxError: invalid character in identifier *** Error compiling ./tools/xpathgrep.py ... File "./tools/xpathgrep.py", line 5 except ImportError, e: ^ SyntaxError: invalid syntax -- Arfrever Frehtes Taifersar Arahesis -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part. Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100720/7265a520/attachment-0001.pgp From stefan_ml at behnel.de Tue Jul 20 09:42:56 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 20 Jul 2010 09:42:56 +0200 Subject: [lxml-dev] SyntaxErrors with Python 3 In-Reply-To: <201007200303.15139.Arfrever.FTA@gmail.com> References: <201007200303.15139.Arfrever.FTA@gmail.com> Message-ID: <4C455380.3020905@behnel.de> Arfrever Frehtes Taifersar Arahesis, 20.07.2010 03:02: > LXML r76211 generally supports Python 3, but there are still some SyntaxErrors. > [snip] Thanks. Only 2 or 3 of those are relevant to Py3, but I'll see if I can fix them. A patch could easily speed this up, BTW. Stefan From frank at chagford.com Wed Jul 21 15:35:06 2010 From: frank at chagford.com (Frank Millman) Date: Wed, 21 Jul 2010 15:35:06 +0200 Subject: [lxml-dev] Xpath function converts int to float Message-ID: <20100721133508.42EBE15ED@rrba-ip-smtp-4-1.saix.net> Hi all As the subject line states, an xpath function that should return an int actually returns a float. Here is a simple example - --------------------- from lxml import etree obj = {'one':1, 'two':2} def get_obj(context, param): return obj[param] ns = etree.FunctionNamespace(None) ns['get_obj'] = get_obj xml = "get_obj('one')" elem = etree.fromstring(xml) ob1 = elem.find('obj') ob2 = ob1.xpath(ob1.text) print ob2 --------------------- This prints 1.0 . Is this a bug? Is there a workaround? Thanks Frank Millman From stefan_ml at behnel.de Wed Jul 21 15:43:33 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 21 Jul 2010 15:43:33 +0200 Subject: [lxml-dev] Xpath function converts int to float In-Reply-To: <20100721133508.42EBE15ED@rrba-ip-smtp-4-1.saix.net> References: <20100721133508.42EBE15ED@rrba-ip-smtp-4-1.saix.net> Message-ID: <4C46F985.2090903@behnel.de> Frank Millman, 21.07.2010 15:35: > As the subject line states, an xpath function that should return an int > actually returns a float. > > Here is a simple example - > > --------------------- > from lxml import etree > > obj = {'one':1, 'two':2} > > def get_obj(context, param): > return obj[param] > > ns = etree.FunctionNamespace(None) > ns['get_obj'] = get_obj > > xml = "get_obj('one')" > > elem = etree.fromstring(xml) > ob1 = elem.find('obj') > ob2 = ob1.xpath(ob1.text) > print ob2 > --------------------- > > This prints 1.0 . > > Is this a bug? No, that's how XPath 1.0 works. Check the spec. OTOH, lxml.etree could try to be smart and return an integer if the truncated result equals the float result. Not sure if that is the expected behaviour. Stefan From frank at chagford.com Wed Jul 21 16:34:42 2010 From: frank at chagford.com (Frank Millman) Date: Wed, 21 Jul 2010 16:34:42 +0200 Subject: [lxml-dev] Xpath function converts int to float In-Reply-To: <4C46F985.2090903@behnel.de> Message-ID: <20100721143444.3B64E33D3@rrba-ip-smtp-2-4.saix.net> Stefan Behnel wrote: > > Frank Millman, 21.07.2010 15:35: > > As the subject line states, an xpath function that should > return an int actually returns a float. > > [...] > > > > Is this a bug? > > No, that's how XPath 1.0 works. Check the spec. > > OTOH, lxml.etree could try to be smart and return an integer if the > truncated result equals the float result. Not sure if that is > the expected behaviour. > If that is the spec, I don't think we should tamper with it. I can convert it back to an int myself. Thanks, Stefan Frank From stefan_ml at behnel.de Sat Jul 24 22:39:02 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 24 Jul 2010 22:39:02 +0200 Subject: [lxml-dev] lxml 2.2.7 and 2.3alpha2 released Message-ID: <4C4B4F66.4000404@behnel.de> Hi all, I just pushed lxml 2.2.7 and 2.3alpha2 to PyPI. They are bug-fix-only releases that fix a crash showing mostly in XSLT. http://pypi.python.org/pypi/lxml/2.2.7 http://codespeak.net/lxml/ http://pypi.python.org/pypi/lxml/2.3alpha2 http://codespeak.net/lxml/dev/ The 2.2 release series continues to be built with Cython 0.11.3, whereas the 2.3 release was built with the close-to-final Cython 0.13beta0. Have fun, Stefan 2.2.7 (2010-07-24) ================== Bugs fixed ---------- * Crash in XSLT when generating text-only result documents with a stylesheet created in a different thread. 2.3alpha2 (2010-07-24) ====================== Features added -------------- Bugs fixed ---------- * Crash in XSLT when generating text-only result documents with a stylesheet created in a different thread. Other changes -------------- * ``repr()`` of Element objects shows the hex ID with leading 0x (following ElementTree 1.3). From arfrever.fta at gmail.com Sun Jul 25 01:56:19 2010 From: arfrever.fta at gmail.com (Arfrever Frehtes Taifersar Arahesis) Date: Sun, 25 Jul 2010 01:56:19 +0200 Subject: [lxml-dev] lxml 2.2.7 and 2.3alpha2 released In-Reply-To: <4C4B4F66.4000404@behnel.de> References: <4C4B4F66.4000404@behnel.de> Message-ID: <201007250157.33669.Arfrever.FTA@gmail.com> 2010-07-24 22:39:02 Stefan Behnel napisa?(a): > http://pypi.python.org/pypi/lxml/2.3alpha2 > http://codespeak.net/lxml/dev/ http://codespeak.net/lxml/dev/ contains URL http://codespeak.net/lxml/dev/lxml-2.3alpha2.tgz, but http://codespeak.net/lxml/lxml-2.3alpha2.tgz (without "dev/") is the only working URL of this tarball. -- Arfrever Frehtes Taifersar Arahesis -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part. Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100725/41bc789c/attachment.pgp From arfrever.fta at gmail.com Sun Jul 25 02:37:40 2010 From: arfrever.fta at gmail.com (Arfrever Frehtes Taifersar Arahesis) Date: Sun, 25 Jul 2010 02:37:40 +0200 Subject: [lxml-dev] lxml 2.2.7 and 2.3alpha2 released In-Reply-To: <4C4B4F66.4000404@behnel.de> References: <4C4B4F66.4000404@behnel.de> Message-ID: <201007250237.41724.Arfrever.FTA@gmail.com> I ran test.py, selftest.py and selftest2.py (with appropriate PYTHONPATH). All tests of lxml 2.2.7 passed with Python 2.6 and 2.7. Tests from test.py and selftest2.py of lxml 2.3alpha2 passed with Python 2.6 and 2.7. 1 test from selftest.py of lxml 2.3alpha2 failed with Python 2.6 and 2.7. The output of selftest.py was: ********************************************************************** File "/var/tmp/portage/dev-python/lxml-2.3_alpha2/work/lxml-2.3alpha2/selftest.py", line 298, in selftest.bad_find Failed example: elem.findall("section//") Expected: Traceback (most recent call last): SyntaxError: invalid path Got: [] ********************************************************************** 1 items had failures: 1 of 3 in selftest.bad_find ***Test Failed*** 1 failures. 184 tests ok. -- Arfrever Frehtes Taifersar Arahesis -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part. Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100725/d52f0a45/attachment.pgp From stefan_ml at behnel.de Sun Jul 25 09:10:18 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 25 Jul 2010 09:10:18 +0200 Subject: [lxml-dev] lxml 2.2.7 and 2.3alpha2 released In-Reply-To: <201007250157.33669.Arfrever.FTA@gmail.com> References: <4C4B4F66.4000404@behnel.de> <201007250157.33669.Arfrever.FTA@gmail.com> Message-ID: <4C4BE35A.1090404@behnel.de> Arfrever Frehtes Taifersar Arahesis, 25.07.2010 01:56: > 2010-07-24 22:39:02 Stefan Behnel napisa?: >> http://pypi.python.org/pypi/lxml/2.3alpha2 >> http://codespeak.net/lxml/dev/ > > http://codespeak.net/lxml/dev/ contains URL http://codespeak.net/lxml/dev/lxml-2.3alpha2.tgz, > but http://codespeak.net/lxml/lxml-2.3alpha2.tgz (without "dev/") is the only working URL of > this tarball. Thanks, fixed. Stefan From stefan_ml at behnel.de Sun Jul 25 09:13:47 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 25 Jul 2010 09:13:47 +0200 Subject: [lxml-dev] lxml 2.2.7 and 2.3alpha2 released In-Reply-To: <201007250237.41724.Arfrever.FTA@gmail.com> References: <4C4B4F66.4000404@behnel.de> <201007250237.41724.Arfrever.FTA@gmail.com> Message-ID: <4C4BE42B.3030109@behnel.de> Arfrever Frehtes Taifersar Arahesis, 25.07.2010 02:37: > I ran test.py, selftest.py and selftest2.py (with appropriate PYTHONPATH). > All tests of lxml 2.2.7 passed with Python 2.6 and 2.7. > > Tests from test.py and selftest2.py of lxml 2.3alpha2 passed with Python 2.6 and 2.7. > 1 test from selftest.py of lxml 2.3alpha2 failed with Python 2.6 and 2.7. The output of selftest.py was: > > ********************************************************************** > File "/var/tmp/portage/dev-python/lxml-2.3_alpha2/work/lxml-2.3alpha2/selftest.py", line 298, in selftest.bad_find > Failed example: > elem.findall("section//") > Expected: > Traceback (most recent call last): > SyntaxError: invalid path > Got: > [] > ********************************************************************** > 1 items had failures: > 1 of 3 in selftest.bad_find > ***Test Failed*** 1 failures. > 184 tests ok. Right, I noticed that, too. This is due to the updated ElementPath implementation, i.e. it's actually a bug in ElementTree 1.3. Stefan From arfrever.fta at gmail.com Sun Jul 25 17:14:53 2010 From: arfrever.fta at gmail.com (Arfrever Frehtes Taifersar Arahesis) Date: Sun, 25 Jul 2010 17:14:53 +0200 Subject: [lxml-dev] SyntaxErrors with Python 3 In-Reply-To: <4C455380.3020905@behnel.de> References: <201007200303.15139.Arfrever.FTA@gmail.com> <4C455380.3020905@behnel.de> Message-ID: <201007251715.35395.Arfrever.FTA@gmail.com> 2010-07-20 09:42:56 Stefan Behnel napisa?(a): > Arfrever Frehtes Taifersar Arahesis, 20.07.2010 03:02: > > LXML r76211 generally supports Python 3, but there are still some SyntaxErrors. > > [snip] > > Thanks. Only 2 or 3 of those are relevant to Py3, but I'll see if I can fix > them. A patch could easily speed this up, BTW. I'm attaching the partial patch. -- Arfrever Frehtes Taifersar Arahesis -------------- next part -------------- A non-text attachment was scrubbed... Name: lxml-syntax_errors.patch Type: text/x-patch Size: 8191 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100725/d8ec6b91/attachment-0001.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part. Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100725/d8ec6b91/attachment-0001.pgp From public at codethief.eu Mon Jul 26 15:09:19 2010 From: public at codethief.eu (codethief) Date: Mon, 26 Jul 2010 15:09:19 +0200 Subject: [lxml-dev] No error position when validating against XMLSchema In-Reply-To: <4C348E49.3090006@rdprojekt.pl> References: <4C348E49.3090006@rdprojekt.pl> Message-ID: I would be interested in a solution to that issue, too. On Wed, Jul 7, 2010 at 16:25, Adam Biela?ski wrote: > Hello, > > I'm calling creating object with > "etree.iterparse(open('NotValidFile.xml'), schema=schema)" and iterating > over it raises etree.XMLSyntaxError, as expected. > > Unfortunately, value of 'position' attribute of that error is (0,0). > Value of 'offset' attribute is None. Is there a way to get at least the > line number of the offending tag in my XML file? > > When etree.XMLSyntaxError is raised in etree.XML(), it contains position > info pointing precisely to the place where error occured. Is it possible > also when using iterparse() ? > > Regards, > ? ? Adam Biela?ski. > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > -- Simon Hirscher http://simonhirscher.de From herve.cauwelier at free.fr Mon Jul 26 17:21:40 2010 From: herve.cauwelier at free.fr (=?ISO-8859-1?Q?Herv=E9_Cauwelier?=) Date: Mon, 26 Jul 2010 17:21:40 +0200 Subject: [lxml-dev] text nodes... again Message-ID: <4C4DA804.1000602@free.fr> Hi, I'm having difficulties working on the kind of XML used by OpenDocument. Consider that example from the tutorial: Hello
World An application on top of OpenDocument would be to "strip"
from a document. Using "Element.remove()", I lose "World". What pattern of code would you recommend to merge "Hello" and "World"? Yes, it would be glued like "HelloWorld", don't worry. The FAQ[1] mentions the issue: "A good way to deal with this is to use helper functions that copy the Element without its tail." In my case, it would be "remove the Element without its tail". Are these helper functions shipped with lxml? Is there some cookbook online? Thanks in advance, Herv? Cauwelier [1] http://codespeak.net/lxml/FAQ.html#what-about-that-trailing-text-on-serialised-elements From sergio at sergiomb.no-ip.org Mon Jul 26 19:00:13 2010 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Mon, 26 Jul 2010 18:00:13 +0100 Subject: [lxml-dev] text nodes... again In-Reply-To: <4C4DA804.1000602@free.fr> References: <4C4DA804.1000602@free.fr> Message-ID: <1280163613.31502.23.camel@segulix> On Mon, 2010-07-26 at 17:21 +0200, Herv? Cauwelier wrote: > Hi, I'm having difficulties working on the kind of XML used by OpenDocument. > > Consider that example from the tutorial: > > Hello
World > > An application on top of OpenDocument would be to "strip"
from a > document. > > Using "Element.remove()", I lose "World". > > What pattern of code would you recommend to merge "Hello" and "World"? > Yes, it would be glued like "HelloWorld", don't worry. > Are these helper functions shipped with lxml? Is there some cookbook online? yes and yes, from lxml import html html.fromstring(f) delelems = frags.xpath(delxpath) for delnode in delelems: delnode.drop_tree() function drop_tree() will do it for you (is what I use now) before that I had use: import lxml, lxml.html etree_document = etree.HTML(f) delelems = frags.xpath(delxpath) (...) for delnode in delelems: parent = delnode.getparent() if delnode.tail: prevnode = delnode.getprevious() if prevnode is not None: if prevnode.tail: prevnode.tail += delnode.tail else: prevnode.tail = delnode.tail elif parent.text: parent.text += delnode.tail else: parent.text = delnode.tail parent.remove(delnode) -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3293 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100726/d2ac8a3e/attachment.bin From p.oberndoerfer at urheberrecht.org Tue Jul 27 00:00:03 2010 From: p.oberndoerfer at urheberrecht.org (=?iso-8859-1?Q?=22Pascal_Obernd=F6rfer=22?=) Date: Tue, 27 Jul 2010 00:00:03 +0200 Subject: [lxml-dev] lxml 2.2.7 eggs for MacOS X 10.4 (and up?) Message-ID: <796a05af45ad7b4db1b8fafe1b4e7927.squirrel@mail.urheberrecht.org> Dear all, Some lxml 2.2.7 eggs for Mac OS X have been added here: Regards, Pascal From herve.cauwelier at free.fr Wed Jul 28 19:04:48 2010 From: herve.cauwelier at free.fr (=?UTF-8?B?SGVydsOpIENhdXdlbGllcg==?=) Date: Wed, 28 Jul 2010 19:04:48 +0200 Subject: [lxml-dev] text nodes... again In-Reply-To: <1280163613.31502.23.camel@segulix> References: <4C4DA804.1000602@free.fr> <1280163613.31502.23.camel@segulix> Message-ID: <4C506330.7070707@free.fr> On 07/26/10 19:00, Sergio Monteiro Basto wrote: > On Mon, 2010-07-26 at 17:21 +0200, Herv? Cauwelier wrote: >> Are these helper functions shipped with lxml? Is there some cookbook online? > yes and yes, > > from lxml import html > html.fromstring(f) > delelems = frags.xpath(delxpath) > for delnode in delelems: > delnode.drop_tree() > > function drop_tree() will do it for you (is what I use now) Thanks for the notice. I also found "drop_tag" to be useful. Unfortunately, these are only available in HTML elements, not through the basic etree. I used HTML as an example, but even though OpenDocument was quite influenced by HTML at first, it's advanced XML. A accurate example would look like: HelloWorld > before that I had use: > > import lxml, lxml.html > etree_document = etree.HTML(f) > delelems = frags.xpath(delxpath) > (...) > for delnode in delelems: > parent = delnode.getparent() > if delnode.tail: > prevnode = delnode.getprevious() > if prevnode is not None: > if prevnode.tail: > prevnode.tail += delnode.tail > else: > prevnode.tail = delnode.tail > elif parent.text: > parent.text += delnode.tail > else: > parent.text = delnode.tail > parent.remove(delnode) > Writing this function in my project is another possibility. I'll also look at the source code in case these functions are written in Python. Thanks, Herv? From burak.arslan at arskom.com.tr Thu Jul 29 10:57:32 2010 From: burak.arslan at arskom.com.tr (Burak Arslan) Date: Thu, 29 Jul 2010 11:57:32 +0300 Subject: [lxml-dev] decoding unicode strings Message-ID: <4C51427C.7060206@arskom.com.tr> hi, the lxml.etree.XMLID function does not accept unicode strings when the xml declaration tag is present at the beginning of the xml document. however, not all soap clients send the xml declaration, so sometimes i must rely on information in http headers to decode the string. my solution was this: try: root, xmlids = etree.XMLID(xml_string.decode(http_charset)) except ValueError,e: logger.debug('%s -- falling back to str decoding.' % (e)) root, xmlids = etree.XMLID(xml_string) is this the proper way to check whether an xml document candidate has an xml declaration at the beginning? thanks, burak From donn.ingle at gmail.com Thu Jul 29 16:34:30 2010 From: donn.ingle at gmail.com (donn) Date: Thu, 29 Jul 2010 16:34:30 +0200 Subject: [lxml-dev] Virtual elements (python) Message-ID: <4C519176.7070203@gmail.com> Hi, I am hacking slowly on an SVG editor in Python and I am using lxml which is making it all possible. I would like to use many svg trees at once, but not in one big tree; rather in a 'forest' with each one rooted alone. Elements in one can refer to elements in another, by way of the tag. Is there any way to make a 'virtual' element, such that it forwards its children on-to another element (in any other Tree)? * I mean something like this: Root1 has element W with children: e1 (mapped-to:Root2/e2), e5, bob, sally Root2 has element e2 (id="e2_id") e2 has children: c2, c3 Thus: >>> list(e1) [c2,c3] >>> e1.attrib.get('id') 'e2_id' >>> list(W) [e1,e5,bob,sally] >>> e1.getparent() #it 'knows' its actual Tree >>> for elem in etree.iterwalk(Root1): print elem W...e1...[now into Root2] c2,c3... [back into Root1] e5, etc. >>> etree.tostring(W) "" To top it off, can xpaths be used across such a system? So I'd say 'gimme all the elements with attrib=xyz in Root1' and it will *also* skip along into Root2 by way of that virtual element? \d (* I guess I would have to walk the trees and identify the elements that are tag=="use" and then replace them with this fancy element.) From jholg at gmx.de Thu Jul 29 17:06:26 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 29 Jul 2010 17:06:26 +0200 Subject: [lxml-dev] Virtual elements (python) In-Reply-To: <4C519176.7070203@gmail.com> References: <4C519176.7070203@gmail.com> Message-ID: <20100729150626.154470@gmx.net> Hi, > I would like to use many svg trees at once, but not in one big tree; > rather in a 'forest' with each one rooted alone. Elements in one can > refer to elements in another, by way of the tag. > > Is there any way to make a 'virtual' element, such that it forwards its > children on-to another element (in any other Tree)? * > You might be able to achieve what you want using custom element classes: http://codespeak.net/lxml/element_classes.html Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From donn.ingle at gmail.com Thu Jul 29 17:39:30 2010 From: donn.ingle at gmail.com (donn) Date: Thu, 29 Jul 2010 17:39:30 +0200 Subject: [lxml-dev] Virtual elements (python) In-Reply-To: <20100729150626.154470@gmx.net> References: <4C519176.7070203@gmail.com> <20100729150626.154470@gmx.net> Message-ID: <4C51A0B2.6000804@gmail.com> On 29/07/2010 17:06, jholg at gmx.de wrote: > You might be able to achieve what you want using custom element classes: http://codespeak.net/lxml/element_classes.html I suspected as much, and I can do with some help. I am a little lost as to how to begin. \d From sakshichawla12354 at gmail.com Thu Jul 29 18:58:25 2010 From: sakshichawla12354 at gmail.com (sakshi chawla) Date: Thu, 29 Jul 2010 22:28:25 +0530 Subject: [lxml-dev] reading xml document Message-ID: hi how can I read the attributes from xml file? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100729/79ef8eb1/attachment.htm From d.rothe at semantics.de Thu Jul 29 22:27:18 2010 From: d.rothe at semantics.de (Dirk Rothe) Date: Thu, 29 Jul 2010 22:27:18 +0200 Subject: [lxml-dev] reading xml document In-Reply-To: References: Message-ID: On Thu, 29 Jul 2010 18:58:25 +0200, sakshi chawla wrote: > hi > how can I read the attributes from xml file? hi, better explain, what "the attributes" are. From spuzhava at purdue.edu Fri Jul 30 00:24:08 2010 From: spuzhava at purdue.edu (spuzhava at purdue.edu) Date: Thu, 29 Jul 2010 18:24:08 -0400 Subject: [lxml-dev] reading xml document In-Reply-To: References: Message-ID: <20100729182408.91116and866y2r48@boilermail.purdue.edu> If you are referring to the attributes of an Element in the XML - I think you should be using the get() method of the Element class. More details on the API of the Element class can be found here http://codespeak.net/lxml/api/lxml.etree._Element-class.html ~Shankar. Quoting Dirk Rothe : > On Thu, 29 Jul 2010 18:58:25 +0200, sakshi chawla > wrote: > >> hi >> how can I read the attributes from xml file? > > hi, > > better explain, what "the attributes" are. > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev >