From Russell.S.Johnson at gmail.com Tue Jun 1 19:34:07 2010 From: Russell.S.Johnson at gmail.com (Russ Johnson) Date: Tue, 1 Jun 2010 13:34:07 -0400 Subject: [lxml-dev] Unable to build lxml 2.2.6 on WinXP Message-ID: I'm attempting to built lxml 2.2.6 with MinGW for Python 2.7b2 on Windows XP Pro SP 3 using static linking, but the build fails with the following error. Any suggestions? C:\lxml-2.2.6>python setup.py install --static Building lxml version 2.2.6. NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available. ERROR: 'xslt-config' is not recognized as an internal or external command, operable program or batch file. ** make sure the development packages of libxml2 and libxslt are installed ** Using build configuration of libxslt Building against libxml2/libxslt in one of the following directories: ??..\iconv-1.9.2.win32\lib ??..\libxml2-2.7.6.win32\lib ??..\libxslt-1.1.26.win32\lib ??..\zlib-1.2.3.win32\lib running install install_dir C:\Python27\Lib\site-packages\ running bdist_egg running egg_info writing src\lxml.egg-info\PKG-INFO writing top-level names to src\lxml.egg-info\top_level.txt writing dependency_links to src\lxml.egg-info\dependency_links.txt reading manifest file 'src\lxml.egg-info\SOURCES.txt' reading manifest template 'MANIFEST.in' warning: no files found matching 'lxml.etree.c' under directory 'src\lxml' warning: no files found matching 'lxml.objectify.c' under directory 'src\lxml' warning: no files found matching 'lxml.etree.h' under directory 'src\lxml' warning: no files found matching 'lxml.etree_api.h' under directory 'src\lxml' warning: no files found matching 'etree_defs.h' under directory 'src\lxml' warning: no files found matching 'pubkey.asc' under directory 'doc' warning: no files found matching 'tagpython*.png' under directory 'doc' writing manifest file 'src\lxml.egg-info\SOURCES.txt' installing library code to build\bdist.win32\egg running install_lib running build_py running build_ext building 'lxml.etree' extension C:\MinGW\bin\gcc.exe -mno-cygwin -mdll -O -Wall -I..\iconv-1.9.2.win32\include - I..\libxml2-2.7.6.win32\include -I..\libxslt-1.1.26.win32\include -I..\zlib-1.2. 3.win32\include -IC:\Python27\include -IC:\Python27\PC -c src/lxml/lxml.etree.c -o build\temp.win32-2.7\Release\src\lxml\lxml.etree.o -w writing build\temp.win32-2.7\Release\src\lxml\etree.def C:\MinGW\bin\gcc.exe -mno-cygwin -shared -s build\temp.win32-2.7\Release\src\lxm l\lxml.etree.o build\temp.win32-2.7\Release\src\lxml\etree.def -L..\iconv-1.9.2. win32\lib -L..\libxml2-2.7.6.win32\lib -L..\libxslt-1.1.26.win32\lib -L..\zlib-1 .2.3.win32\lib -LC:\Python27\libs -LC:\Python27\PCbuild -llibxslt_a -llibexslt_a ?-llibxml2_a -liconv_a -lzlib -lWS2_32 -lpython27 -lmsvcr90 -o build\lib.win32-2 .7\lxml\etree.pyd Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"MS VCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"MS VCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"MS VCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"ws 2_32.lib" /DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized -- snip -- collect2: ld returned 1 exit status error: command 'gcc' failed with exit status 1 From sidnei.da.silva at gmail.com Tue Jun 1 22:21:01 2010 From: sidnei.da.silva at gmail.com (Sidnei da Silva) Date: Tue, 1 Jun 2010 17:21:01 -0300 Subject: [lxml-dev] Unable to build lxml 2.2.6 on WinXP In-Reply-To: References: Message-ID: On Tue, Jun 1, 2010 at 2:34 PM, Russ Johnson wrote: > I'm attempting to built lxml 2.2.6 with MinGW for Python 2.7b2 on > Windows XP Pro SP 3 using > static linking, but the build fails with the following error. ?Any suggestions? I have never tried to build with MinGW, so no clue on that. I know there are many people waiting for a 2.2.6 build for Windows, which I forgot to upload when I built the others. Stefan advised me to build it against libxml2 2.7.7 because of some potential crasher bugs, so I'm waiting for a new build from http://www.zlatkovic.com/libxml.en.html, which is where I get the libxml2 binaries from. OTOH, I *can* make a 2.2.6 build with libxml2 2.7.6 right now if people are willing to put up with potential crashes. Sorry everyone for not speaking up earlier. Between traveling and news that I will be dad of twins, life has been crazy lately. I'll make sure to put up a Baby Registry on Amazon where you can kindly contribute to my happiness as I do to yours *wink*. -- Sidnei From stefan_ml at behnel.de Wed Jun 2 10:16:37 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 02 Jun 2010 10:16:37 +0200 Subject: [lxml-dev] Unable to build lxml 2.2.6 on WinXP In-Reply-To: References: Message-ID: <4C061365.5000804@behnel.de> Russ Johnson, 01.06.2010 19:34: > I'm attempting to built lxml 2.2.6 with MinGW for Python 2.7b2 on > Windows XP Pro SP 3 using > static linking, but the build fails with the following error. Any suggestions? > > C:\lxml-2.2.6>python setup.py install --static > Building lxml version 2.2.6. > NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' > needs to be available. > ERROR: 'xslt-config' is not recognized as an internal or external command, > operable program or batch file. > ** make sure the development packages of libxml2 and libxslt are > installed ** It tells you that you need to have "xslt-config" in your PATH. It comes with libxslt. Not sure if this is required for the static build, though. > Using build configuration of libxslt > Building against libxml2/libxslt in one of the following directories: > ..\iconv-1.9.2.win32\lib > ..\libxml2-2.7.6.win32\lib > ..\libxslt-1.1.26.win32\lib > ..\zlib-1.2.3.win32\lib This looks ok. > running install > install_dir C:\Python27\Lib\site-packages\ > running bdist_egg > running egg_info > writing src\lxml.egg-info\PKG-INFO > writing top-level names to src\lxml.egg-info\top_level.txt > writing dependency_links to src\lxml.egg-info\dependency_links.txt > reading manifest file 'src\lxml.egg-info\SOURCES.txt' > reading manifest template 'MANIFEST.in' > warning: no files found matching 'lxml.etree.c' under directory 'src\lxml' > warning: no files found matching 'lxml.objectify.c' under directory 'src\lxml' > warning: no files found matching 'lxml.etree.h' under directory 'src\lxml' > warning: no files found matching 'lxml.etree_api.h' under directory 'src\lxml' > warning: no files found matching 'etree_defs.h' under directory 'src\lxml' This looks weird. They seem to exist below ... > writing manifest file 'src\lxml.egg-info\SOURCES.txt' > installing library code to build\bdist.win32\egg > running install_lib > running build_py > running build_ext > building 'lxml.etree' extension > C:\MinGW\bin\gcc.exe -mno-cygwin -mdll -O -Wall -I..\iconv-1.9.2.win32\include - > I..\libxml2-2.7.6.win32\include -I..\libxslt-1.1.26.win32\include -I..\zlib-1.2. > 3.win32\include -IC:\Python27\include -IC:\Python27\PC -c src/lxml/lxml.etree.c > -o build\temp.win32-2.7\Release\src\lxml\lxml.etree.o -w > writing build\temp.win32-2.7\Release\src\lxml\etree.def > C:\MinGW\bin\gcc.exe -mno-cygwin -shared -s build\temp.win32-2.7\Release\src\lxm > l\lxml.etree.o build\temp.win32-2.7\Release\src\lxml\etree.def -L..\iconv-1.9.2. > win32\lib -L..\libxml2-2.7.6.win32\lib -L..\libxslt-1.1.26.win32\lib -L..\zlib-1 > .2.3.win32\lib -LC:\Python27\libs -LC:\Python27\PCbuild -llibxslt_a -llibexslt_a > -llibxml2_a -liconv_a -lzlib -lWS2_32 -lpython27 -lmsvcr90 -o build\lib.win32-2 > .7\lxml\etree.pyd > Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"MS > VCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized > Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"MS > VCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized > Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"MS > VCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized > Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"ws > 2_32.lib" /DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized > > -- snip -- > > collect2: ld returned 1 exit status > error: command 'gcc' failed with exit status 1 The "snip" hides the interesting part. Stefan From sidnei.da.silva at gmail.com Wed Jun 2 14:36:21 2010 From: sidnei.da.silva at gmail.com (Sidnei da Silva) Date: Wed, 2 Jun 2010 09:36:21 -0300 Subject: [lxml-dev] Unable to build lxml 2.2.6 on WinXP In-Reply-To: <4C061365.5000804@behnel.de> References: <4C061365.5000804@behnel.de> Message-ID: On Wed, Jun 2, 2010 at 5:16 AM, Stefan Behnel wrote: > It tells you that you need to have "xslt-config" in your PATH. It comes > with libxslt. Not sure if this is required for the static build, though. It's not. At least I have ignored it forever and haven't had an issue. -- Sidnei From Russell.S.Johnson at gmail.com Wed Jun 2 14:33:20 2010 From: Russell.S.Johnson at gmail.com (Russ Johnson) Date: Wed, 2 Jun 2010 08:33:20 -0400 Subject: [lxml-dev] Unable to build lxml 2.2.6 on WinXP In-Reply-To: <4C061365.5000804@behnel.de> References: <4C061365.5000804@behnel.de> Message-ID: That's the thing, I got the libraries from ftp://ftp.zlatkovic.com/pub/libxml/, as indicated in http://codespeak.net/lxml/build.html#static-linking-on-windows. However, I can't find xslt-config anywhere in libxslt. Is there somewhere else I should/could get them from? Full output is here: http://docs.google.com/document/pub?id=1lFTEQaW-76immaD3f9pDGDVZj26Dbs0iF6p3PxtzZRk On Wed, Jun 2, 2010 at 4:16 AM, Stefan Behnel wrote: > Russ Johnson, 01.06.2010 19:34: > > I'm attempting to built lxml 2.2.6 with MinGW for Python 2.7b2 on >> Windows XP Pro SP 3 using >> static linking, but the build fails with the following error. Any >> suggestions? >> >> C:\lxml-2.2.6>python setup.py install --static >> Building lxml version 2.2.6. >> NOTE: Trying to build without Cython, pre-generated >> 'src/lxml/lxml.etree.c' >> needs to be available. >> ERROR: 'xslt-config' is not recognized as an internal or external command, >> operable program or batch file. >> > > ** make sure the development packages of libxml2 and libxslt are > > installed ** > > It tells you that you need to have "xslt-config" in your PATH. It comes > with libxslt. Not sure if this is required for the static build, though. > > > > Using build configuration of libxslt >> Building against libxml2/libxslt in one of the following directories: >> ..\iconv-1.9.2.win32\lib >> ..\libxml2-2.7.6.win32\lib >> ..\libxslt-1.1.26.win32\lib >> ..\zlib-1.2.3.win32\lib >> > > This looks ok. > > > > running install >> install_dir C:\Python27\Lib\site-packages\ >> running bdist_egg >> running egg_info >> writing src\lxml.egg-info\PKG-INFO >> writing top-level names to src\lxml.egg-info\top_level.txt >> writing dependency_links to src\lxml.egg-info\dependency_links.txt >> reading manifest file 'src\lxml.egg-info\SOURCES.txt' >> reading manifest template 'MANIFEST.in' >> warning: no files found matching 'lxml.etree.c' under directory 'src\lxml' >> warning: no files found matching 'lxml.objectify.c' under directory >> 'src\lxml' >> warning: no files found matching 'lxml.etree.h' under directory 'src\lxml' >> warning: no files found matching 'lxml.etree_api.h' under directory >> 'src\lxml' >> warning: no files found matching 'etree_defs.h' under directory 'src\lxml' >> > > This looks weird. They seem to exist below ... > > > > writing manifest file 'src\lxml.egg-info\SOURCES.txt' >> installing library code to build\bdist.win32\egg >> running install_lib >> running build_py >> running build_ext >> building 'lxml.etree' extension >> C:\MinGW\bin\gcc.exe -mno-cygwin -mdll -O -Wall >> -I..\iconv-1.9.2.win32\include - >> I..\libxml2-2.7.6.win32\include -I..\libxslt-1.1.26.win32\include >> -I..\zlib-1.2. >> 3.win32\include -IC:\Python27\include -IC:\Python27\PC -c >> src/lxml/lxml.etree.c >> -o build\temp.win32-2.7\Release\src\lxml\lxml.etree.o -w >> writing build\temp.win32-2.7\Release\src\lxml\etree.def >> C:\MinGW\bin\gcc.exe -mno-cygwin -shared -s >> build\temp.win32-2.7\Release\src\lxm >> l\lxml.etree.o build\temp.win32-2.7\Release\src\lxml\etree.def >> -L..\iconv-1.9.2. >> win32\lib -L..\libxml2-2.7.6.win32\lib -L..\libxslt-1.1.26.win32\lib >> -L..\zlib-1 >> .2.3.win32\lib -LC:\Python27\libs -LC:\Python27\PCbuild -llibxslt_a >> -llibexslt_a >> -llibxml2_a -liconv_a -lzlib -lWS2_32 -lpython27 -lmsvcr90 -o >> build\lib.win32-2 >> .7\lxml\etree.pyd >> Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" >> /DEFAULTLIB:"MS >> VCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized >> Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" >> /DEFAULTLIB:"MS >> VCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized >> Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" >> /DEFAULTLIB:"MS >> VCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized >> Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" >> /DEFAULTLIB:"ws >> 2_32.lib" /DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized >> >> -- snip -- >> >> collect2: ld returned 1 exit status >> error: command 'gcc' failed with exit status 1 >> > > The "snip" hides the interesting part. > > Stefan > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100602/cb33ab90/attachment.htm From sergio at sergiomb.no-ip.org Wed Jun 2 17:34:19 2010 From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto) Date: Wed, 02 Jun 2010 16:34:19 +0100 Subject: [lxml-dev] another bug Message-ID: <1275492859.3346.2.camel@segulix> Hi, I have many sites on test. a parking domain, import lxml.html hparser = lxml.html.HTMLParser(encoding='utf-8', remove_comments=True) content=""" """ etree_document = lxml.html.fromstring(content, parser=hparser) TypeError Traceback (most recent call last) /home/sergio/ in () /usr/lib/python2.6/site-packages/lxml/html/__init__.pyc in fromstring(html, base_url, parser, **kw) 634 other_head.drop_tree() 635 return doc --> 636 if (len(body) == 1 and (not body.text or not body.text.strip()) 637 and (not body[-1].tail or not body[-1].tail.strip())): 638 # The body has just one element, so it was probably a single TypeError: object of type 'NoneType' has no len() thanks, -- S?rgio M. B. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3293 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100602/9c8c3e22/attachment.bin From edxxgardo at gmail.com Fri Jun 4 14:07:36 2010 From: edxxgardo at gmail.com (Edgardo C.) Date: Fri, 4 Jun 2010 14:07:36 +0200 Subject: [lxml-dev] lxml for Python 3.1 Message-ID: Hello everyone, I've installed Ubuntu 10.04LTS in my pc and python-lxml 2.2.4-1 is already in. I installed it, but once done it finished in the Python 2.6 already installed version. But, I'm developing for 3.1 and I would like to have lxml working for it. My question is, does the last version of lxml compile for Python 3.1? What do I need for that? I would really appreciate any help because I need it very soon. Thanks in advance, Ed. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100604/ec3232cd/attachment-0001.htm From ovnicraft at gmail.com Fri Jun 4 17:06:04 2010 From: ovnicraft at gmail.com (Ovnicraft) Date: Fri, 4 Jun 2010 10:06:04 -0500 Subject: [lxml-dev] lxml for Python 3.1 In-Reply-To: References: Message-ID: 2010/6/4 Edgardo C. > Hello everyone, > > I've installed Ubuntu 10.04LTS in my pc and python-lxml 2.2.4-1 is already > in. I installed it, but once done it finished in the Python 2.6 already > installed version. But, I'm developing for 3.1 and I would like to have lxml > working for it. My question is, does the last version of lxml compile for > Python 3.1? What do I need for that? I would really appreciate any help > because I need it very soon. Thanks in advance, > I didnt try it in 3.1 so i suggest you, try to install it in 3.1, identify the errors and report them **if** you get them. In another hand python community recommend still develop in 2.X versions, BTW in your choice can help to test lxml in 3.x my 2c > Ed. > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > > -- Cristian -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100604/0dfe4d62/attachment.htm From public at codethief.eu Fri Jun 4 18:39:50 2010 From: public at codethief.eu (codethief) Date: Fri, 4 Jun 2010 18:39:50 +0200 Subject: [lxml-dev] lxml for Python 3.1 In-Reply-To: References: Message-ID: lxml should work fine with Python 3. There are a few bugs (see previous posts in this mailing list) but aside from that it works. And if you come across such a bug, you can still report it here, as Ovnicraft suggested. ;) Just do a "$ easy_install3 lxml" and you should be good to go. (If you haven't installed easy_install3, you must do that first, of course.) On Fri, Jun 4, 2010 at 2:07 PM, Edgardo C. wrote: > Hello everyone, > > I've installed Ubuntu 10.04LTS in my pc and python-lxml 2.2.4-1 is already > in. I installed it, but once done it finished in the Python 2.6 already > installed version. But, I'm developing for 3.1 and I would like to have lxml > working for it. My question is, does the last version of lxml compile for > Python 3.1? What do I need for that? I would really appreciate any help > because I need it very soon. Thanks in advance, > > Ed. > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > > -- Simon Hirscher http://simonhirscher.de From isaac at wagnerfam.com Tue Jun 8 18:40:32 2010 From: isaac at wagnerfam.com (Isaac Wagner) Date: Tue, 8 Jun 2010 12:40:32 -0400 Subject: [lxml-dev] Pretty print indent level Message-ID: Here's a little snippet of what I'm doing: from lxml import etree doc = etree.parse(file) text = etree.tostring(doc, pretty_print=True) ... So, that works great and outputs a nicely formatted XML document. However, the indent level of pretty_print is 2 spaces whereas I'd like 4 spaces. Is there any way to control the number of spaces per indent level? From kris at cs.ucsb.edu Tue Jun 8 21:26:37 2010 From: kris at cs.ucsb.edu (kristian kvilekval) Date: Tue, 08 Jun 2010 12:26:37 -0700 Subject: [lxml-dev] comment processing (bug?) Message-ID: <1276025197.15709.278.camel@loup.ece.ucsb.edu> I am having a problem with embedded comments not being ignored by the parser. >From the examples on iterparse and iterwalk http://codespeak.net/lxml/parsing.html I tried the examples and it seems to work as advertised (comments ignored unless 'comment' is in events). However, I changed the input slightly to use an embedded comment: commented_xml = ''' text ''' xml = etree.XML (commented_xml) context = etree.iterwalk (xml, events = ('start','end')) for action, elem in context: print("%s: -%s-" % (action, elem.tag)) start: -root- start: -element- end: -element- start: -- end: -- end: -root- Since events does not contain comments, I wasn't expecting the comment and seem to unable to filter it out. Any pointers appreciated.. Kris From basti at redtoad.de Wed Jun 9 07:32:05 2010 From: basti at redtoad.de (Sebastian Rahlf) Date: Wed, 09 Jun 2010 07:32:05 +0200 Subject: [lxml-dev] Force element content to be string In-Reply-To: <20100531173639.44972c1fzxilwps0@webmail.ssl-gateway.de> References: <20100531145102.51434veknbsgx3n4@webmail.ssl-gateway.de> <4C03B5D6.10203@behnel.de> <20100531173639.44972c1fzxilwps0@webmail.ssl-gateway.de> Message-ID: <1276061525.2447.2.camel@phoenix> > >> I have a bit of XML which I want to parse via lxml.objectify. > >> > >> from lxml import objectify > >> node = objectify.fromstring(''' > >> > >> 0747532745 > >> > >> Joanne K. Rowling > >> Bloomsbury Publishing > >> Book > >> Harry Potter and the Philosopher's Stone > >> > >> > >> ''') > >> > >> I have the following problem: node.ASIN is evaluated to integer > value > >> 747532745 but should be a string ('0747532745'). > >> > >> There is no way for me to influence the incoming XML, so any > py:pytype > >> magic or adding a schema is out of the question. Is there a way to > >> ensure that ASIN elements are always evaluated to a string? > > > > Just add the type attribute after parsing and remove it before > serialisation. > > > > Alternatively, you can register your own Element type for the ASIN > > tag. There should be something about that in the objectify docs. > > > > Stefan > > Thanks for the tip. I wrote my own class lookup > > from lxml import etree > class MyLookup(etree.CustomElementClassLookup): > def lookup(self, node_type, document, namespace, name): > if name == 'ASIN': > return objectify.StringElement > > lookup = MyLookup() > parser = etree.XMLParser() > parser.set_element_class_lookup(lookup) > node = objectify.fromstring(xml, parser) > > which now returns the right element type. > > node = objectify.fromstring(xml, parser) > objectify.annotate(node) > print objectify.dump(node) > > Item = None [_Element] > * py:pytype = 'str' > ASIN = '0747532745' [StringElement] > * py:pytype = 'int' > DetailPageURL = 'http://www.amazon.de/Harry-...' [_Element] > * py:pytype = 'str' > ItemAttributes = None [_Element] > * py:pytype = 'str' > Author = 'Joanne K. Rowling' [_Element] > * py:pytype = 'str' > Manufacturer = 'Bloomsbury Publishing' [_Element] > * py:pytype = 'str' > ProductGroup = 'Book' [_Element] > * py:pytype = 'str' > Title = "Harry Potter and the Philosopher's Stone" [_Element] > * py:pytype = 'str' > > How do I make it fall back to objectify.ObjectifyElementClassLookup? To answer my own question: You can set a fallbac? lookup lookup.set_fallback(objectify.ObjectifyElementClassLookup()) As easy as pie! Seb. -- Sebastian Rahlf From fantasai.lists at inkedblade.net Sun Jun 13 01:14:22 2010 From: fantasai.lists at inkedblade.net (fantasai) Date: Sat, 12 Jun 2010 16:14:22 -0700 Subject: [lxml-dev] Resolver not getting proper SYSTEM id / parsing XHTML with entities Message-ID: Hello, I'm trying to process around 10,000 XHTML files with lxml: specifically parsing them, extracting some metadata, then serializing to XHTML and HTML with html5lib. My problem is that lxml chokes on   et al. Now, I could turn on dtd_validation, and maybe (maybe) that would work, but it would also make the W3C sysadmins hate me because lxml does not cache the DTD: http://www.hoboes.com/Mimsy/hacks/caching-dtds-using-lxml-and-etree/ There is also no option to turn off entity checking and just parse their names into the tree, which would also solve my problem: https://bugs.launchpad.net/lxml/+bug/267825 So the best idea I could come up with was to borrow Jerry Stratton's code for caching the DTDs. So I copied it: class CachingDTDResolver(etree.Resolver): """ For caching DTDs. Copied from http://www.hoboes.com/Mimsy/hacks/caching-dtds-using-lxml-and-etree/ """ def resolve(self, URL, id, context): print "args %s %s %s %s" % (self, URL, id, context) ... from Utils import CachingDTDResolver __parser = etree.XMLParser(no_network=False, dtd_validation=True, remove_comments=False, strip_cdata=False, resolve_entities=False) __parser.resolvers.add(CachingDTDResolver()) And tried to parse with it: self.tree = etree.parse(self.sourcepath, parser=self.__parser) The following file (sourcepath = path/to/file.xht) ... Which has the correct DOCTYPE according to W3C: http://www.w3.org/TR/xhtml1/#strict And got the following debugging output: args path/to/file.xht None I double-checked the Resolver interface: resolve(...) resolve(self, system_url, public_id, context) Override this method to resolve an external source by ``system_url`` and ``public_id``. The third argument is an opaque context object. And I double-checked that lxml found the correct system ID without the custom resolver: print self.tree.docinfo.system_url "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" And I double-checked that the file validated at W3C: http://validator.w3.org/ And could not understand why a) I didn't get a proper system_url b) I didn't get a proper public_id c) I got the relative filepath??? End result: Entity-parsing problem not solved! :( Can anyone tell me what I'm doing wrong? Surely parsing XHTML with lxml is a solved problem! ~fantasai From arcriley at gmail.com Sun Jun 13 19:15:02 2010 From: arcriley at gmail.com (Arc Riley) Date: Sun, 13 Jun 2010 13:15:02 -0400 Subject: [lxml-dev] Generating lxml objects from expat Message-ID: We have a C application that uses expat to parse a XML stream (XMPP). The stream only terminates when the connection closes and we need to process (and respond) to it while its open, so a DOM parser would not work. Inside the (root element) are "stanzas", elements at depth=1 which are generally short and easy to parse, so its desirable to pass each stanza to a DOM tree for further processing in Python. We can do this with cElementTree due to the exposed expat handlers, but I have not yet seen anything in lxml's C api which would suggest that its capable of this. Is it possible to construct and populate a lxml.etree externally using startElement, endElement, etc calls in a manner which is compatible with libxslt? Its also a bit frustrating that lxml's C headers are not installed to the system (and thus unpackaged on both Gentoo and Debian/Ubuntu). -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100613/67bd2a36/attachment.htm From stefan_ml at behnel.de Sun Jun 13 21:13:36 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 13 Jun 2010 21:13:36 +0200 Subject: [lxml-dev] Generating lxml objects from expat In-Reply-To: References: Message-ID: <4C152DE0.6030502@behnel.de> Arc Riley, 13.06.2010 19:15: > We have a C application that uses expat to parse a XML stream (XMPP). The > stream only terminates when the connection closes and we need to process > (and respond) to it while its open, so a DOM parser would not work. > > Inside the (root element) are "stanzas", elements at depth=1 which > are generally short and easy to parse, so its desirable to pass each stanza > to a DOM tree for further processing in Python. You didn't mention what your C application actually does and through what kind of 'connection' it gets its data, but this sounds like you should drop the C code entirely and just use iterparse(some_stream, tag='stanzas') in lxml.etree. When done with each element, .clear() it to discard its content from memory. If you still need your C code for some reason, there may still be ways to interact easily, but you will need to provide more information. Stefan From stefan_ml at behnel.de Mon Jun 14 08:33:46 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 14 Jun 2010 08:33:46 +0200 Subject: [lxml-dev] Pretty print indent level In-Reply-To: References: Message-ID: <4C15CD4A.6090305@behnel.de> Isaac Wagner, 08.06.2010 18:40: > Here's a little snippet of what I'm doing: > > from lxml import etree > doc = etree.parse(file) > text = etree.tostring(doc, pretty_print=True) > ... > > So, that works great and outputs a nicely formatted XML document. > However, the indent level of pretty_print is 2 spaces whereas I'd like > 4 spaces. Is there any way to control the number of spaces per indent > level? Potentially, yes, libxml2 supports this. But it's not available through the lxml.etree API. Note that there is a little indent() function on effbot's ElementTree site that implements pretty printing at the user level. Works fine with lxml. Stefan From stefan_ml at behnel.de Mon Jun 14 08:46:13 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 14 Jun 2010 08:46:13 +0200 Subject: [lxml-dev] Resolver not getting proper SYSTEM id / parsing XHTML with entities In-Reply-To: References: Message-ID: <4C15D035.4000302@behnel.de> fantasai, 13.06.2010 01:14: > I'm trying to process around 10,000 XHTML files with lxml: specifically parsing > them, extracting some metadata, then serializing to XHTML and HTML with html5lib. > > My problem is that lxml chokes on  et al. Now, I could turn on dtd_validation, > and maybe (maybe) that would work, but it would also make the W3C sysadmins > hate me because lxml does not cache the DTD The right way to handle this is XML catalogs. http://xmlsoft.org/catalog.html Also, you don't need to enable validation, you only need to let lxml.etree load the DTD, that's a different option. > self.tree = etree.parse(self.sourcepath, parser=self.__parser) > [...] > And got the following debugging output: > args > path/to/file.xht None No idea what's in "self.sourcepath", but that's likely why you get the relative path out. Stefan From isaac at wagnerfam.com Tue Jun 15 18:05:30 2010 From: isaac at wagnerfam.com (Isaac Wagner) Date: Tue, 15 Jun 2010 12:05:30 -0400 Subject: [lxml-dev] Problem with schema validation Message-ID: I'm using LXML version 2.2.6, libxml 2.7.7, and libxslt 1.1.26. I've gone through the example of validating a XML document against a schema, but I get an error every time. Here is the relevant portion of my Python code: from lxml import etree schemaDoc = etree.parse(schemaFileName) schema = etree.XMLSchema(schemaDoc) xmlDoc = etree.parse(xmlFileName) if schema.validate(xmlDoc): print "Valid" else: print schema.error_log My XML document fails with this error: Test.xsd:3:0:ERROR:SCHEMASV:SCHEMAV_CVC_ELT_1: Element '{http://www.w3.org/2001/XMLSchema}schema': No matching global declaration available for the validation root. I've run xmllint against the schema and input XML and it works fine: $ xmllint --schema Test.xsd --noout Test.xml Test.xml validates I've also loaded my schema and XML into Eclipse and run the XML and XSD validators and those work. So, I believe the problem is not in libxml nor in my documents. Can someone please shed some light on this problem? From abhijeet.thatte at gmail.com Tue Jun 15 21:02:11 2010 From: abhijeet.thatte at gmail.com (abhijeet thatte) Date: Tue, 15 Jun 2010 12:02:11 -0700 Subject: [lxml-dev] Need to parse python dictionaries into xml Message-ID: Hello, I am a novice Python user. I am using Python to parse some hardware specifications and create xml files from them. I generate dict of really huge sizes. (I am parsing some 10,000 register definitions.) So, it looks like : {elem1,elem2, elem3,dict1,{elem4,elem5, dict2 {elem6, elem7, dict3{.....}}}}. Is it possible to parse such dictionaries into xml without any new tags other than tags used in dictionaries. Thanks, Abhijeet -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100615/5e6b71bc/attachment.htm From jholg at gmx.de Wed Jun 16 16:20:19 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 16 Jun 2010 16:20:19 +0200 Subject: [lxml-dev] Problem with schema validation In-Reply-To: References: Message-ID: <20100616142019.241140@gmx.net> Hi, > from lxml import etree > schemaDoc = etree.parse(schemaFileName) > schema = etree.XMLSchema(schemaDoc) > xmlDoc = etree.parse(xmlFileName) > if schema.validate(xmlDoc): > print "Valid" > else: > print schema.error_log > > My XML document fails with this error: > > Test.xsd:3:0:ERROR:SCHEMASV:SCHEMAV_CVC_ELT_1: Element > '{http://www.w3.org/2001/XMLSchema}schema': No matching global > declaration available for the validation root. > > I've run xmllint against the schema and input XML and it works fine: > [...] The error message suggests that you are inadvertently trying to validate the *schema doc* instead of the instance document. Apart from that your code works like expected for me: >>> xmlFileName = "test.xml" >>> schemaFileName = "test.xsd" >>> from lxml import etree >>> schemaDoc = etree.parse(schemaFileName) >>> schema = etree.XMLSchema(schemaDoc) >>> xmlDoc = etree.parse(xmlFileName) >>> if schema.validate(xmlDoc): ... print "Valid" ... else: ... print schema.error_log ... Valid However: >>> if schema.validate(schemaDoc): ... print "Valid" ... else: ... print schema.error_log ... test.xsd:2:0:ERROR:SCHEMASV:SCHEMAV_CVC_ELT_1: Element '{http://www.w3.org/2001/XMLSchema}schema': No matching global declaration available for the validation root. Greets, Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From stefan_ml at behnel.de Thu Jun 17 19:38:27 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 17 Jun 2010 19:38:27 +0200 Subject: [lxml-dev] Need to parse python dictionaries into xml In-Reply-To: References: Message-ID: <4C1A5D93.5050903@behnel.de> abhijeet thatte, 15.06.2010 21:02: > I am a novice Python user. I am using Python to parse some hardware > specifications and create xml files from them. > I generate dict of really huge sizes. (I am parsing some 10,000 register > definitions.) > So, it looks like : {elem1,elem2, elem3,dict1,{elem4,elem5, dict2 {elem6, > elem7, dict3{.....}}}}. > > Is it possible to parse such dictionaries into xml without any new tags > other than tags used in dictionaries. Just a quick note that this has been discussed on c.l.py. Stefan From hanooter at gmail.com Fri Jun 18 21:42:15 2010 From: hanooter at gmail.com (Kyle Hanson) Date: Fri, 18 Jun 2010 12:42:15 -0700 Subject: [lxml-dev] Using Regex to search CSSselectors? Message-ID: I am switching my code from BeautifulSoup to LXML for HTML parsing however there is one area of my code that I was hoping someone could help on. Beautfiul soup allows re compiled patterns as arguments to search for nodes. I was hoping that LXML would allow for something similar. Basically I need to find all tags with a class name that matches the regex expression r'post_?\d+' and another one that matches the tag's id to the regex r'(user|profile)_\d+' Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100618/e1afce29/attachment.htm From stefan_ml at behnel.de Sat Jun 19 10:13:26 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 19 Jun 2010 10:13:26 +0200 Subject: [lxml-dev] comment processing (bug?) In-Reply-To: <1276025197.15709.278.camel@loup.ece.ucsb.edu> References: <1276025197.15709.278.camel@loup.ece.ucsb.edu> Message-ID: <4C1C7C26.1030007@behnel.de> kristian kvilekval, 08.06.2010 21:26: > I am having a problem with embedded comments not being ignored by the > parser. Note that iterwalk() is not related to the parser, it just walks the tree. >> From the examples on iterparse and iterwalk > > http://codespeak.net/lxml/parsing.html > > I tried the examples and it seems to work as advertised (comments > ignored unless 'comment' is in events). > > However, I changed the input slightly to use an embedded comment: > > commented_xml = ''' > > text > ''' > > > xml = etree.XML (commented_xml) > context = etree.iterwalk (xml, events = ('start','end')) > for action, elem in context: > print("%s: -%s-" % (action, elem.tag)) > > start: -root- > start: -element- > end: -element- > start: -- > end: -- > end: -root- > > > Since events does not contain comments, I wasn't expecting the comment > and seem to unable to filter it out. Any pointers appreciated.. Yes, that's a bug. Comments and PIs shouldn't be returned unless explicitly requested. That's the behaviour of iterparse(), and iterwalk() should behave the same. I'll see if I can fix that for 2.3. Could you please file a bug report in the launchpad tracker? Thanks! Stefan From stefan_ml at behnel.de Sat Jun 19 10:19:28 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 19 Jun 2010 10:19:28 +0200 Subject: [lxml-dev] another bug In-Reply-To: <1275492859.3346.2.camel@segulix> References: <1275492859.3346.2.camel@segulix> Message-ID: <4C1C7D90.4020902@behnel.de> Hi, note that subjects like "another bug" are less likely to receive interest than something that describes the actual problem. Sergio Monteiro Basto, 02.06.2010 17:34: > Hi, I have many sites on test. > > a parking domain, > > import lxml.html > hparser = lxml.html.HTMLParser(encoding='utf-8', remove_comments=True) > > content=""" > > id="srcpg" > frameborder="0" scrolling="Auto" marginwidth="" > marginheight="0"> > > """ > > etree_document = lxml.html.fromstring(content, parser=hparser) > TypeError Traceback (most recent call > last) > > /home/sergio/ in() > > /usr/lib/python2.6/site-packages/lxml/html/__init__.pyc in > fromstring(html, base_url, parser, **kw) > 634 other_head.drop_tree() > 635 return doc > --> 636 if (len(body) == 1 and (not body.text or not > body.text.strip()) > 637 and (not body[-1].tail or not > body[-1].tail.strip())): > 638 # The body has just one element, so it was probably a > single > > > TypeError: object of type 'NoneType' has no len() Yes, the exception is a bug. I'm not sure what the parser should return in this case. I'll have to look into this, maybe it's worth special casing. Could you file a bug report? Thanks! Stefan From stefan_ml at behnel.de Sat Jun 19 10:28:31 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 19 Jun 2010 10:28:31 +0200 Subject: [lxml-dev] Using Regex to search CSSselectors? In-Reply-To: References: Message-ID: <4C1C7FAF.5050501@behnel.de> Kyle Hanson, 18.06.2010 21:42: > I am switching my code from BeautifulSoup to LXML for HTML parsing however > there is one area of my code that I was hoping someone could help on. > > Beautfiul soup allows re compiled patterns as arguments to search for nodes. > I was hoping that LXML would allow for something similar. Basically I need > to find all tags with a class name that matches the regex expression > r'post_?\d+' and another one that matches the tag's id to the regex > r'(user|profile)_\d+' The CSS selectors don't support this, but XPath does. Just use the EXSLT namespace for regular expressions. http://codespeak.net/lxml/xpathxslt.html#the-xpath-class Note that you can also use the cssselect module manually to convert a given CSS selector to an equivalent XPath expression. Stefan From stefan_ml at behnel.de Sat Jun 19 12:52:49 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 19 Jun 2010 12:52:49 +0200 Subject: [lxml-dev] lxml 2.3alpha1 released Message-ID: <4C1CA181.10004@behnel.de> Hi all, I'm happy to announce the first alpha release of lxml 2.3. This is *not* a production-ready release, and there will be further fixes and changes before the final 2.3 release. Therefore, it is not the default version on PyPI but only available via "easy_install lxml==2.3alpha1" and from here: http://pypi.python.org/pypi/lxml/2.3alpha1 http://codespeak.net/lxml/dev/ The main reasons for this release are: There hasn't been a release for too long; there were too many changes in the meantime, including various bug fixes; I'd like to make it easier for users to benefit from those improvements and to report bugs and unexpected changes that get in their way. Note that not all reported bugs in the tracker that I'd like to fix for 2.3 are closed with this alpha, but I hope that a couple of others will make it before the final release. If you think that your Very Important Bug hasn't received enough interest so far, consider adding a comment to the bug tracker. ("fix this bug!" may not be enough ;) Major new features in this release include better compatibility with ElementTree 1.3 (which will be in the Python 3.2 standard library), ISO-Schematron support and various usability fixes for XSLT and XPath. The (IMHO) most important bug fix is the hardening of the API against potential crashes due to badly initialised wrapper objects. It is the clear intention for the future development to make lxml safer to use and hard to crash. I'm also planning to offer commercial (paid) support for lxml's further development, so if you really need a new feature or can't wait for a bug to get fixed, you may consider paying me to do it for you. Important notice: This does *not* mean that lxml is going non-free in any way, or that its OpenSource development is endangered. Lxml will continue to ship its complete source code under the liberal BSD license, and any (generally interesting) paid changes will become part of the mainline distribution. It just means that there will be a way for users to financially support my work on the project and to invest resources into targeted work that they can't (or don't want to) do themselves. This release was built using revision "c71560698d0d" of the cython-closures development branch (pre-0.13). A final Cython 0.13 is expected for early July, which will be the base for the final lxml 2.3 release. ... and I'm pretty sure this is the longest release announcement I've ever written. ;) Have fun, Stefan Here's the complete post-2.2.6 changelog for this release: 2.3alpha1 (2010-06-19) ====================== Features added -------------- * Keyword argument ``namespaces`` in ``lxml.cssselect.CSSSelector()`` to pass a prefix-to-namespace mapping for the selector. * New function ``lxml.etree.register_namespace(prefix, uri)`` that globally registers a namespace prefix for a namespace that newly created Elements in that namespace will use automatically. Follows ElementTree 1.3. * Support 'unicode' string name as encoding parameter in ``tostring()``, following ElementTree 1.3. * Support 'c14n' serialisation method in ``ElementTree.write()`` and ``tostring()``, following ElementTree 1.3. * The ElementPath expression syntax (``el.find*()``) was extended to match the upcoming ElementTree 1.3 that will ship in the standard library of Python 3.2/2.7. This includes extended support for predicates as well as namespace prefixes (as known from XPath). * During regular XPath evaluation, various ESXLT functions are available within their namespace when using libxslt 1.1.26 or later. * Support passing a readily configured logger instance into ``PyErrorLog``, instead of a logger name. * On serialisation, the new ``doctype`` parameter can be used to override the DOCTYPE (internal subset) of the document. * New parameter ``output_parent`` to ``XSLTExtension.apply_templates()`` to append the resulting content directly to an output element. * ``XSLTExtension.process_children()`` to process the content of the XSLT extension element itself. * ISO-Schematron support based on the de-facto Schematron reference 'skeleton implementation'. * XSLT objects now take XPath object as ``__call__`` stylesheet parameters. * Enable path caching in ElementPath (``el.find*()``) to avoid parsing overhead. * Setting the value of a namespaced attribute always uses a prefixed namespace instead of the default namespace even if both declare the same namespace URI. This avoids serialisation problems when an attribute from a default namespace is set on an element from a different namespace. * XSLT extension elements: support for XSLT context nodes other than elements: document root, comments, processing instructions. * Support for strings (in addition to Elements) in node-sets returned by extension functions. * Forms that lack an ``action`` attribute default to the base URL of the document on submit. * XPath attribute result strings have an ``attrname`` property. * Namespace URIs get validated against RFC 3986 at the API level (required by the XML namespace specification). * Target parsers show their target object in the ``.target`` property (compatible with ElementTree). Bugs fixed ---------- * API is hardened against invalid proxy instances to prevent crashes due to incorrectly instantiated Element instances. * Prevent crash when instantiating ``CommentBase`` and friends. * Export ElementTree compatible XML parser class as ``XMLTreeBuilder``, as it is called in ET 1.2. * ObjectifiedDataElements in lxml.objectify were not hashable. They now use the hash value of the underlying Python value (string, number, etc.) to which they compare equal. * Parsing broken fragments in lxml.html could fail if the fragment contained an orphaned closing '' tag. * Using XSLT extension elements around the root of the output document crashed. * ``lxml.cssselect`` did not distinguish between ``x[attr="val"]`` and ``x [attr="val"]`` (with a space). The latter now matches the attribute independent of the element. * Rewriting multiple links inside of HTML text content could end up replacing unrelated content as replacements could impact the reported position of subsequent matches. Modifications are now simplified by letting the ``iterlinks()`` generator in ``lxml.html`` return links in reversed order if they appear inside the same text node. Thus, replacements and link-internal modifications no longer change the position of links reported afterwards. * The ``.value`` attribute of ``textarea`` elements in lxml.html did not represent the complete raw value (including child tags etc.). It now serialises the complete content on read and replaces the complete content by a string on write. * Target parser didn't call ``.close()`` on the target object if parsing failed. Now it is guaranteed that ``.close()`` will be called after parsing, regardless of the outcome. Other changes ------------- * Official support for Python 3.1.2 and later. * Static MS Windows builds can now download their dependencies themselves. * ``Element.attrib`` no longer uses a cyclic reference back to its Element object. It therefore no longer requires the garbage collector to clean up. * Static builds include libiconv, in addition to libxml2 and libxslt. From sridharr at activestate.com Mon Jun 21 19:15:16 2010 From: sridharr at activestate.com (Sridhar Ratnakumar) Date: Mon, 21 Jun 2010 10:15:16 -0700 Subject: [lxml-dev] lxml 2.3alpha1 released In-Reply-To: <4C1CA181.10004@behnel.de> References: <4C1CA181.10004@behnel.de> Message-ID: On 2010-06-19, at 3:52 AM, Stefan Behnel wrote: > it is not the default > version on PyPI but only available via "easy_install lxml==2.3alpha1" Hmm, the thing is ... even if a release is "hidden" in PyPI, easy_install (specifically setuptools.package_index, which is what PyPM uses as well) and pip will find it. $ bin/easy_install lxml Searching for lxml Reading http://pypi.python.org/simple/lxml/ Reading http://codespeak.net/lxml Best match: lxml 2.3alpha1 Downloading http://pypi.python.org/packages/source/l/lxml/lxml-2.3alpha1.tar.gz [...] If this is not intended, please do not upload alpha releases to PyPI. -srid From stefan_ml at behnel.de Mon Jun 21 20:01:28 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 21 Jun 2010 20:01:28 +0200 Subject: [lxml-dev] lxml 2.3alpha1 released In-Reply-To: References: <4C1CA181.10004@behnel.de> Message-ID: <4C1FA8F8.50807@behnel.de> Sridhar Ratnakumar, 21.06.2010 19:15: > On 2010-06-19, at 3:52 AM, Stefan Behnel wrote: > >> it is not the default >> version on PyPI but only available via "easy_install lxml==2.3alpha1" > > Hmm, the thing is ... even if a release is "hidden" in PyPI, > easy_install (specifically setuptools.package_index, which is what > PyPM uses as well) and pip will find it. Ah, and I was wondering why the downloads were firing up that quickly. ;) > $ bin/easy_install lxml > Searching for lxml > Reading http://pypi.python.org/simple/lxml/ > Reading http://codespeak.net/lxml > Best match: lxml 2.3alpha1 > Downloading http://pypi.python.org/packages/source/l/lxml/lxml-2.3alpha1.tar.gz > [...] > > If this is not intended, please do not upload alpha releases to PyPI. It's not intended. I'd consider it a bug in PyPI (and right another one in setuptools). The "simple" page shows all sorts of files that make absolute no sense. Just because a URL appears somewhere on a page about an outdated and disabled software version doesn't mean tools like setuptools should get their nose bumped into it. And I can't see a way to prevent entries from appearing on that page. I guess the only way to prevent automatic installation for now is to remove the file from PyPI, which, I guess, will disable easy_install support completely... Stefan From relativityguy at gmail.com Mon Jun 21 20:05:08 2010 From: relativityguy at gmail.com (Lashkara Singh) Date: Mon, 21 Jun 2010 23:35:08 +0530 Subject: [lxml-dev] XPath vs lxml methods Message-ID: Hi everyone I am new to Python and etree.lxml in general, so please answer accordingly! I am parsing a file with Python and lxml, and after reading up stuff, I have a confusion as to where to choose between XPath and lxml, given a choice. For example, I will take the following code to illustrate what I mean: To get the text of all tags with name *price*, I can use this: for ele in Tree.getiterator('price'): print ele.text > Or, I can do this using XPath: price_list = Tree.xpath('//price/text()') > The question is, given both of them return the same output and mean the same thing, which one should I use? XPath or the methods provided by the lxml API? Please note that I am not looking to optimize my code, though that is one part, what I am more concerned about is which is good, from a code point of view. Thanks for your time. L Singh -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100621/c81b707e/attachment.htm From sridharr at activestate.com Mon Jun 21 20:07:51 2010 From: sridharr at activestate.com (Sridhar Ratnakumar) Date: Mon, 21 Jun 2010 11:07:51 -0700 Subject: [lxml-dev] lxml 2.3alpha1 released In-Reply-To: <4C1FA8F8.50807@behnel.de> References: <4C1CA181.10004@behnel.de> <4C1FA8F8.50807@behnel.de> Message-ID: <8F2E6FEB-FEA8-440F-A670-ECFEC5FBBADE@activestate.com> On 2010-06-21, at 11:01 AM, Stefan Behnel wrote: >> $ bin/easy_install lxml >> Searching for lxml >> Reading http://pypi.python.org/simple/lxml/ >> Reading http://codespeak.net/lxml >> Best match: lxml 2.3alpha1 >> Downloading http://pypi.python.org/packages/source/l/lxml/lxml-2.3alpha1.tar.gz >> [...] >> >> If this is not intended, please do not upload alpha releases to PyPI. > > It's not intended. I'd consider it a bug in PyPI (and right another one in setuptools). The "simple" page shows all sorts of files that make absolute no sense. Just because a URL appears somewhere on a page about an outdated and disabled software version doesn't mean tools like setuptools should get their nose bumped into it. And I can't see a way to prevent entries from appearing on that page. This was discussed before http://www.mail-archive.com/catalog-sig at python.org/msg01511.html ... but no resolution was taken. > I guess the only way to prevent automatic installation for now is to remove the file from PyPI, which, I guess, will disable easy_install support completely... Yes, it will also disable pip/PyPM support. Perhaps you can poke Martin in distutils-sig@ about this bug in PyPI. :-) -srid From public at codethief.eu Mon Jun 21 22:23:05 2010 From: public at codethief.eu (codethief) Date: Mon, 21 Jun 2010 22:23:05 +0200 Subject: [lxml-dev] XPath vs lxml methods In-Reply-To: References: Message-ID: Use the code whose purpose, in your opinion, is more obvious and which can be debugged more easily. On Mon, Jun 21, 2010 at 8:05 PM, Lashkara Singh wrote: > Hi everyone > > I am new to Python and etree.lxml in general, so please answer accordingly! > > I am parsing a file with Python and lxml, and after reading up stuff, I have > a confusion as to where to choose between XPath and lxml, given a choice. > For example, I will take the following code to illustrate what I mean: > > To get the text of all tags with name price, I can use this: > >> for ele in Tree.getiterator('price'): >> >> ? ? print ele.text > > Or, I can do this using XPath: > >> price_list = Tree.xpath('//price/text()') > > The question is, given both of them return the same output and mean the same > thing, which one should I use? XPath or the methods provided by the lxml > API? Please note that I am not looking to optimize my code, though that is > one part, what I am more concerned about is which is good, from a code point > of view. > > Thanks for your time. > L Singh > > > > > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > > -- Simon Hirscher http://simonhirscher.de From spuzhava at purdue.edu Mon Jun 21 22:54:12 2010 From: spuzhava at purdue.edu (spuzhava at purdue.edu) Date: Mon, 21 Jun 2010 16:54:12 -0400 Subject: [lxml-dev] Xml to flat file Message-ID: <20100621165412.962713as2yc3mdm8@boilermail.purdue.edu> Hello, I was looking for a way to convert an xml fragment (_ElementTree object) to a flat file (comma separated perhaps). Though I understand this can be done quite easily by iterating, I just wanted to check if there was already something built in the library for this purpose. I tried going through the API (http://codespeak.net/lxml/api/index.html) but wasn't able to lay my hands on something that did this. I would be grateful if someone could let me know if they have come across a function that does this in lxml. Thanks, Shankar. From hanooter at gmail.com Mon Jun 21 23:46:19 2010 From: hanooter at gmail.com (Kyle Hanson) Date: Mon, 21 Jun 2010 14:46:19 -0700 Subject: [lxml-dev] Wrapping one Tag around existing text Message-ID: Hello, I was trying to find all the text elements in an html doc whose parents are divs and make it so their parents were p=elements. e.g. turn this:

Title

text

stuff

into this:

Title

text

stuff

so far I have: for text in (x for x in parsed.xpath('//div/text()') if len(x.strip())): -----p = builder.P(''+text) #I have to put ""+ because it doesnt recognize text is not a string type and str() chokes on string conversion because of unicode. I just dont know how to insert the created p element into the div while deleting the text node. I tried text.getparent().insert(text.getparent().index(text), p) but it says that the argument is supposed to be an _Element not an _ElementStringResult Also it says that the _ElementStringResult "text" 's parent is h1, and not div. Why is this? In order for me to get the div container (the true parent of 'text') I had to do text.getparent().getparent() Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100621/e64e84e4/attachment.htm From hanooter at gmail.com Tue Jun 22 05:23:41 2010 From: hanooter at gmail.com (Kyle Hanson) Date: Mon, 21 Jun 2010 20:23:41 -0700 Subject: [lxml-dev] Wrapping one Tag around existing text In-Reply-To: References: Message-ID: Hey guys, I actually just finished it. For future reference in case anybody needs it: for text in (x for x in self.parsed.xpath('//div/text()') if len(x.strip())): --p = E.P(''+text) --if text.is_tail: ----textIndex = text.getparent().getparent().index(text.getparent()) + 1 ----text.getparent().getparent().insert(textIndex, p) ----text.getparent().tail = None --elif text.is_text: ----text.getparent().getparent().insert(0, p) ----text.getparent().text = None Although I do have some issues. My main issue is the way that text is handled with lxml in parsing an HTML document. I can understand .tail and .text attributes for XML, but it is my belief that LXML should handle text in HTML like an Element. Because the tail text for an Element in LXML in the context of HTML has nothing to do with the Element, I believe that lxml.html should handle text like an element, and should be included in the list. e.g. for this: parsed = lxml.html.document_fromstring('
This is some textthat has linkswhich interrupts the text.In a couple of places.
') list(parsed.cssselect('div')[0]) should return [<_TextElement>, <_Element>, <_TextElement>, <_Element>, <_TextElement>, <_Element>, <_TextElement>] print str(parsed.cssselect('div')[0][0]) should return "This is some text" parsed.cssselect('div')[0][1] is the a container with the text "that has links" parsed.xpath('//text()') should return [<_TextElement>, <_TextElement>, <_TextElement>, <_TextElement>] Also _Element.index should work properly with text. parsed.cssselect('div')[0].index(parsed.xpath('//text()')[1]) would return 2 I believe this because it is more intuitive. In my example the text "a couple" should have only a sibling relationship with the A container, but lxml.html designs it so that the "parent" of the text is the A container and not the true parent (in which it would gain all of its attributes and non from the A container) the DIV. Why would you look in the A container when the text isn't even in it? Although I am sure that there is something preventing this from happening, I would appreciate if it was considered. -- Kyle Hanson On Mon, Jun 21, 2010 at 2:46 PM, Kyle Hanson wrote: > Hello, > > I was trying to find all the text elements in an html doc whose parents are > divs and make it so their parents were p=elements. > > e.g. > > turn this: >

Title

text

stuff

> > into this: > >

Title

text

stuff

> > so far I have: > > > for text in (x for x in parsed.xpath('//div/text()') if len(x.strip())): > -----p = builder.P(''+text) #I have to put ""+ because it doesnt recognize > text is not a string type and str() chokes on string conversion because of > unicode. > > I just dont know how to insert the created p element into the div while > deleting the text node. I tried > text.getparent().insert(text.getparent().index(text), p) but it says that > the argument is supposed to be an _Element not an _ElementStringResult > > Also it says that the _ElementStringResult "text" 's parent is h1, and not > div. Why is this? In order for me to get the div container (the true parent > of 'text') I had to do text.getparent().getparent() > > Thanks > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100621/3ded965a/attachment-0001.htm From stefan_ml at behnel.de Tue Jun 22 06:46:23 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 22 Jun 2010 06:46:23 +0200 Subject: [lxml-dev] Wrapping one Tag around existing text In-Reply-To: References: Message-ID: <4C20401F.2070905@behnel.de> Kyle Hanson, 22.06.2010 05:23: > parsed = lxml.html.document_fromstring('
This is some > textthat has linkswhich interrupts the text.In a couple > of places.
') >[...] > In my example the text "a > couple" should have only a sibling relationship with the A container, but > lxml.html designs it so that the "parent" of the text is the A container and > not the true parent Well, it *is* the 'true parent' in the tree model. You'd be rather surprised if the text wasn't accessible on the Element that getparent() returned. There isn't a sibling relationship for text, and I certainly don't want to add that as well. So, changing the parent would mean that you'd have to search all children of your proposed parent in order to find the text (and if you're unlucky, you'd find the text more than once that way). In the current implementation, all you have to do to get to the Element that holds the text is to call getparent(). And to get to the surrounding container of tail text, you can call getparent() twice. That's a lot simpler than an unsafe subtree search, or an additional sibling API just for tail text content. Stefan From stefan_ml at behnel.de Tue Jun 22 13:17:57 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 22 Jun 2010 13:17:57 +0200 Subject: [lxml-dev] Xml to flat file In-Reply-To: <20100621165412.962713as2yc3mdm8@boilermail.purdue.edu> References: <20100621165412.962713as2yc3mdm8@boilermail.purdue.edu> Message-ID: <4C209BE5.6060105@behnel.de> spuzhava at purdue.edu, 21.06.2010 22:54: > I was looking for a way to convert an xml fragment (_ElementTree > object) to a flat file (comma separated perhaps). Though I understand > this can be done quite easily by iterating, I just wanted to check if > there was already something built in the library for this purpose. I > tried going through the API (http://codespeak.net/lxml/api/index.html) > but wasn't able to lay my hands on something that did this. I would be > grateful if someone could let me know if they have come across a > function that does this in lxml. No, there's nothing pre-built. In any case, implementing this for your specific format should be rather simple and straight forward, but, IMHO, implementing it in a generic, configurable way that works with all sorts of different XML input formats and CSV (character separated values) output formats is futile. Stefan From leorochael at gmail.com Tue Jun 22 17:36:53 2010 From: leorochael at gmail.com (Leonardo Rochael Almeida) Date: Tue, 22 Jun 2010 12:36:53 -0300 Subject: [lxml-dev] broken link for 2.3alpha release on PyPI "simple" index Message-ID: Hi, The "simple" index for lxml on PyPI [1] has a "2.3alpha1 download_url" link that results in 404 [2] This is breaking "easy_install" based systems (ex. buildout) that point use "simple" PyPI index [3] [1] http://pypi.python.org/simple/lxml/ [2] http://pypi.python.org/packages/source/l/lxml/lxml-2.3alpha1.tar.gz [3] http://pypi.python.org/simple/ Work-arounds include: - using the regular PyPI as index - explicitly requiring version 2.2.6 of lxml (ex. pinning the version in buildout) Cheers, Leo From sridharr at activestate.com Wed Jun 23 05:03:11 2010 From: sridharr at activestate.com (Sridhar) Date: Tue, 22 Jun 2010 20:03:11 -0700 Subject: [lxml-dev] broken link for 2.3alpha release on PyPI "simple" index In-Reply-To: References: Message-ID: <4C21796F.3080007@activestate.com> Stefan, I think you will also have to delete the PyPI release http://pypi.python.org/pypi/lxml/2.3alpha1 to make 2.2.6 the default. On 6/22/2010 8:36 AM, Leonardo Rochael Almeida wrote: > Hi, > > The "simple" index for lxml on PyPI [1] has a "2.3alpha1 download_url" > link that results in 404 [2] > > This is breaking "easy_install" based systems (ex. buildout) that > point use "simple" PyPI index [3] > > [1] http://pypi.python.org/simple/lxml/ > > [2] http://pypi.python.org/packages/source/l/lxml/lxml-2.3alpha1.tar.gz > > [3] http://pypi.python.org/simple/ > > Work-arounds include: > > - using the regular PyPI as index > > - explicitly requiring version 2.2.6 of lxml (ex. pinning the version > in buildout) > > Cheers, > > Leo > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > From stefan_ml at behnel.de Wed Jun 23 08:51:25 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 23 Jun 2010 08:51:25 +0200 Subject: [lxml-dev] broken link for 2.3alpha release on PyPI "simple" index In-Reply-To: References: Message-ID: <4C21AEED.5090201@behnel.de> Leonardo Rochael Almeida, 22.06.2010 17:36: > The "simple" index for lxml on PyPI [1] has a "2.3alpha1 download_url" > link that results in 404 [2] > > This is breaking "easy_install" based systems (ex. buildout) that > point use "simple" PyPI index [3] > > [1] http://pypi.python.org/simple/lxml/ > > [2] http://pypi.python.org/packages/source/l/lxml/lxml-2.3alpha1.tar.gz > > [3] http://pypi.python.org/simple/ > > Work-arounds include: > > - using the regular PyPI as index > > - explicitly requiring version 2.2.6 of lxml (ex. pinning the version > in buildout) Works for me after deleting the broken download link. Thanks, Stefan From jholg at gmx.de Fri Jun 25 11:49:08 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 25 Jun 2010 11:49:08 +0200 Subject: [lxml-dev] some troubles building lxml dev version Message-ID: <20100625094908.210960@gmx.net> Hi, a few quick notes as I've just had some difficulties to build the lxml dev version from the dev instructions (http://codespeak.net/lxml/dev/build.html): * easy_install 'Cython>=0.13' does not work as there seems not to be a 0.13 release of Cython * I tried the latest official Cython release 0.12.1, but cythoning the lxml sources failed * I was then unsure what to download from cython.org as the repository layout seems a bit unusual (coming from a rather cvs/subversion background, that is - but still: no version-tagged repos, no 'trunk', 'cython' denotes the latest release); also didn't find any wiki entry on how to download/checkout cython "trunk" I finally tried hg clone http://hg.cython.org/cython-devel and built lxml trunk with it, which just seemed to succeed. One more note: cython-devel version info says 0.12.1. Holger -- GMX DSL: Internet-, Telefon- und Handy-Flat ab 19,99 EUR/mtl. Bis zu 150 EUR Startguthaben inklusive! http://portal.gmx.net/de/go/dsl From stefan_ml at behnel.de Fri Jun 25 12:18:24 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 25 Jun 2010 12:18:24 +0200 Subject: [lxml-dev] some troubles building lxml dev version In-Reply-To: <20100625094908.210960@gmx.net> References: <20100625094908.210960@gmx.net> Message-ID: <4C248270.9040804@behnel.de> Hi Holger, sorry for the inconvenience. jholg at gmx.de, 25.06.2010 11:49: > a few quick notes as I've just had some difficulties to build the lxml > dev version from the dev instructions > (http://codespeak.net/lxml/dev/build.html): I know, that will change as soon as Cython 0.13 is out. > * easy_install 'Cython>=0.13' does not work as there seems not to be a > 0.13 release of Cython Right, I think I mentioned something like that in the last release notes. > * I tried the latest official Cython release 0.12.1, but cythoning the > lxml sources failed The right version is 0.11.2 for lxml 2.2.x, but 0.13 for lxml 2.3. The trunk requires 0.13 now, mainly due to some code cleanups on my side, trying to make the code in lxml.etree less C-ish and more accessible, readable, maintainable. The next Cython version really has some wonderful features in that regard (and I'd really wish I had had all that available when starting my work on lxml...). > * I was then unsure what to download from cython.org as the repository > layout seems a bit unusual (coming from a rather cvs/subversion > background, that is - but still: no version-tagged repos, no 'trunk', > 'cython' denotes the latest release); also didn't find any wiki entry on > how to download/checkout cython "trunk" > > I finally tried hg clone http://hg.cython.org/cython-devel and built > lxml trunk with it, which just seemed to succeed. You can always get an archive from the download links at the left side of the hg branch pages: http://hg.cython.org/cython-devel/ http://hg.cython.org/cython-closures/ You used cython-devel, whereas I used cython-closures for 2.3alpha1. Both will work just fine, but cython-closures is currently closer to what will become Cython 0.13. > One more note: cython-devel version info says 0.12.1. Right. Changed now. Stefan From stefan_ml at behnel.de Fri Jun 25 12:57:36 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 25 Jun 2010 12:57:36 +0200 Subject: [lxml-dev] German 1-day course on XML/ET/lxml in Leipzig, Germany - September 3rd, 2010 Message-ID: <4C248BA0.708@behnel.de> Hi everyone, [English version] I will be giving a beginners course on XML, ElementTree and lxml in September. It will be held in German (although sufficient interest may convince me to give it in English as well). It's called "High-Performance XML with Python" and will take place at the Python Academy in Leipzig/Germany on September 3rd, 2010. http://www.python-academy.de/Kurse/kurs_xml_python.html http://www.python-academy.com/courses/prices.html If you are interested, please contact me and/or Mike M?ller of the PyA. http://www.python-academy.com/contact.html [German version] Am 3. September 2010 gebe ich einen eint?gigen deutschsprachigen Einsteigerkurs zu XML, ElementTree und lxml. Der Kurstitel ist "High-Performance XML mit Python", Veranstaltungsort ist die Python-Akademie in Leipzig. http://www.python-academy.de/Kurse/kurs_xml_python.html http://www.python-academy.de/Kurse/preise.html Bei Intresse bitte an mich und/oder Mike M?ller von der PyA wenden. http://www.python-academy.de/kontakt.html Have fun, Stefan From jholg at gmx.de Fri Jun 25 14:22:41 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 25 Jun 2010 14:22:41 +0200 Subject: [lxml-dev] objectify.DataElement() fixed to use stringify Message-ID: <20100625122241.210920@gmx.net> Hi, I recently noticed that DataElement() doesn't make use of PyType.stringify. Patch committed to trunk, with test: Committed revision 75584. 0 pytaf at adevp02 .../lxml $ svn diff -rCOMMITTED Index: src/lxml/tests/test_objectify.py =================================================================== --- src/lxml/tests/test_objectify.py (revision 75482) +++ src/lxml/tests/test_objectify.py (working copy) @@ -318,6 +318,13 @@ arg = objectify.DataElement(3.1415) self.assertRaises(ValueError, objectify.DataElement, arg, _xsi="xsd:int") + + def test_data_element_element_arg(self): + arg = objectify.Element('arg') + value = objectify.DataElement(arg) + self.assert_(isinstance(value, objectify.ObjectifiedElement)) + for attr in arg.attrib: + self.assertEquals(value.get(attr), arg.get(attr)) def test_root(self): root = self.Element("test") @@ -1968,6 +1975,14 @@ self.assertEquals(r.date.pyval, parse_date(stringify_date(time))) self.assertEquals(r.date.text, stringify_date(time)) + date = objectify.DataElement(time) + + self.assert_(isinstance(date, DatetimeElement)) + self.assert_(isinstance(date.pyval, datetime)) + + self.assertEquals(date.pyval, parse_date(stringify_date(time))) + self.assertEquals(date.text, stringify_date(time)) + def test_object_path(self): root = self.XML(xml_str) path = objectify.ObjectPath( "root.c1.c2" ) Index: src/lxml/lxml.objectify.pyx =================================================================== --- src/lxml/lxml.objectify.pyx (revision 75482) +++ src/lxml/lxml.objectify.pyx (working copy) @@ -1973,6 +1973,9 @@ if dict_result is not NULL: _pytype = (dict_result).name + if _pytype is None: + _pytype = _pytypename(_value) + if _value is None and _pytype != u"str": _pytype = _pytype or u"NoneType" strval = None @@ -1984,11 +1987,12 @@ else: strval = u"false" else: - strval = unicode(_value) + stringify = unicode + dict_result = python.PyDict_GetItem(_PYTYPE_DICT, _pytype) + if dict_result is not NULL: + stringify = (dict_result).stringify + strval = stringify(_value) - if _pytype is None: - _pytype = _pytypename(_value) - if _pytype is not None: if _pytype == u"NoneType" or _pytype == u"none": strval = None Btw.: Before I noticed the existing stringify test in test_objectify I thought I'd add my custom pure-python DecimalElement class to have something to test. I didn't need to in the end but I wondered: What (and where?) about adding this very simple DecimalElement class to lxml? Maybe have an extra subdir for objectify element classes that are not enabled per default but can be imported & registered? I could also provide a DatetimeElement implementation, though this currently depends on dateutil - but I don't think this should be a bigproblem, it simply wouldn't work on machines without dateutil. Holger -- GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01 From jholg at gmx.de Fri Jun 25 14:49:12 2010 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 25 Jun 2010 14:49:12 +0200 Subject: [lxml-dev] some troubles building lxml dev version In-Reply-To: <4C248270.9040804@behnel.de> References: <20100625094908.210960@gmx.net> <4C248270.9040804@behnel.de> Message-ID: <20100625124912.210960@gmx.net> Hi, > sorry for the inconvenience. No problem at all. > jholg at gmx.de, 25.06.2010 11:49: > > a few quick notes as I've just had some difficulties to build the lxml > > dev version from the dev instructions > > (http://codespeak.net/lxml/dev/build.html): > > I know, that will change as soon as Cython 0.13 is out. I see. I just corrected the build doc to use quotation marks, and also the doc/s5/Makefile to follow the PYTHON Makefile option: Committed revision 75585. $ svn diff -c75585 doc/ Index: doc/s5/Makefile =================================================================== --- doc/s5/Makefile (revision 75584) +++ doc/s5/Makefile (revision 75585) @@ -1,3 +1,4 @@ +PYTHON?=python SLIDES=$(subst .txt,.html,$(wildcard *.txt)) @@ -4,7 +5,7 @@ slides: $(SLIDES) %.html: %.txt - python rst2s5.py --current-slide --language=en $< $@ + $(PYTHON) rst2s5.py --current-slide --language=en $< $@ clean: rm -f *~ $(SLIDES) Index: doc/build.txt =================================================================== --- doc/build.txt (revision 75584) +++ doc/build.txt (revision 75585) @@ -46,7 +46,7 @@ you want to be an lxml developer, then you do need a working Cython installation. You can use EasyInstall_ to install it:: - easy_install Cython>=0.13 + easy_install "Cython>=0.13" lxml currently requires Cython 0.13, later release versions should work as well. Holger -- GMX DSL: Internet-, Telefon- und Handy-Flat ab 19,99 EUR/mtl. Bis zu 150 EUR Startguthaben inklusive! http://portal.gmx.net/de/go/dsl From manu3d at gmail.com Mon Jun 28 10:11:02 2010 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Mon, 28 Jun 2010 09:11:02 +0100 Subject: [lxml-dev] AAAAAARGGHHH!!! Message-ID: AAAAAARGGHH!!! I just installed the latest version of an SDK I'm using and they are now based on Python 2.6.x!!!! I guess I'll have to go back to the previous version of the SDK. . I'm noticing *here * that the hold up for a 2.6-compatible distribution of lxml is the wait for a new build of libxml as 2.7.6 generates some crashes. Is that still the case? Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100628/a95cc745/attachment.htm From stefan_ml at behnel.de Mon Jun 28 10:36:28 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 28 Jun 2010 10:36:28 +0200 Subject: [lxml-dev] AAAAAARGGHHH!!! In-Reply-To: References: Message-ID: <4C285F0C.5080002@behnel.de> Emanuele D'Arrigo, 28.06.2010 10:11: > I'm noticing *here* that the hold > up for a 2.6-compatible distribution of lxml is the wait for a new build of > libxml as 2.7.6 generates some crashes. Is that still the case? Well, lxml *is* Py2.6 compatible, just like it runs in Py2.3 and Py3.1. If you are referring to Windows binaries, Sidnei is actually trying to build binaries based on libxml2 2.7.3 now, as that currently seems to be the best-behaving version. He ran into some problems, though, that I need to look into. They'll eventually become available. Stefan From manu3d at gmail.com Mon Jun 28 11:23:54 2010 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Mon, 28 Jun 2010 10:23:54 +0100 Subject: [lxml-dev] AAAAAARGGHHH!!! In-Reply-To: <4C285F0C.5080002@behnel.de> References: <4C285F0C.5080002@behnel.de> Message-ID: On 28 June 2010 09:36, Stefan Behnel wrote: > Emanuele D'Arrigo, 28.06.2010 10:11: > >> I'm noticing *here* that the hold >> >> up for a 2.6-compatible distribution of lxml is the wait for a new build >> of >> libxml as 2.7.6 generates some crashes. Is that still the case? >> > > Well, lxml *is* Py2.6 compatible, just like it runs in Py2.3 and Py3.1. > > If you are referring to Windows binaries, Ooops, I omitted that detail, didn't I? Yes, unfortunately I'm stuck in Windows' world for now... =P > Sidnei is actually trying to build binaries based on libxml2 2.7.3 now, as > that currently seems to be the best-behaving version. He ran into some > problems, though, that I need to look into. They'll eventually become > available. > Thank you both then! Looking forward to the new release! Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100628/57c0a336/attachment.htm From fdrake at acm.org Mon Jun 28 15:12:49 2010 From: fdrake at acm.org (Fred Drake) Date: Mon, 28 Jun 2010 09:12:49 -0400 Subject: [lxml-dev] Relax NG validation question In-Reply-To: References: Message-ID: On Tue, May 4, 2010 at 12:09 PM, Fred Drake wrote: > I suspect something about the include structure of my schema is > causing lxml this heartburn. ... > Any ideas? ?I'm using Python 2.6.5 and lxml 2.2.6, libxml2 2.7.7, and > libxslt 1.1.26. I've not seen any responses to this; has anyone else had similar experiences with RelaxNG includes? I'd really like to be able to use lxml in my application, instead of requiring a Java runtime, but this approaches being a blocker. -Fred -- Fred L. Drake, Jr. "A storm broke loose in my mind." --Albert Einstein From stefan_ml at behnel.de Mon Jun 28 15:56:26 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 28 Jun 2010 15:56:26 +0200 Subject: [lxml-dev] Relax NG validation question In-Reply-To: References: Message-ID: <4C28AA0A.7010609@behnel.de> Fred Drake, 04.05.2010 18:09: > I've been using jing to validate documents against a schema written in > the RelaxNG compact syntax. I'd like to use lxml to validate instead > of using jing. > > After converting the schema to the XML syntax using trang, I'm loading > the schema like so: > >>>> import lxml.etree >>>> doc = lxml.etree.parse("top.rng") >>>> schema = lxml.etree.RelaxNG(doc) > Traceback (most recent call last): > File "", line 1, in > File "relaxng.pxi", line 84, in lxml.etree.RelaxNG.__init__ > (src/lxml/lxml.etree.c:114962) > RelaxNGParseError: xmlRelaxNG: include middle.rng has a define > anyElement but not the included grammar, line 9 I get the same error with xmllint. This may indicate a problem in libxml2. I never used the override feature that you deploy in the top.rng file, but the error indicates that libxml2 expects "anyElement" to be a grammar rule defined in the included RNG file that the including file overrides. My guess is that the existence test happens before processing the inner import in middle.rng, so that the overridden element isn't included yet. The RNG spec isn't particularly clear here and does not even mention this specific case. http://www.relaxng.org/spec-20011203.html#IDAG3YR This is a very special case that is worth bringing to the attention of the libxml2 mailing list. > Any ideas? You can try running the include in a different tool and loading the resulting RNG instead. No idea what to use for this, though, trang doesn't do it, for example. You can also try applying the includes yourself after parsing the document and before handing it to RelaxNG(). Shouldn't be too hard now that it's clear what you have to take care of. A recursive depth-first include processor would do the job just fine. Stefan From powcarz at gmail.com Wed Jun 30 16:44:54 2010 From: powcarz at gmail.com (Piotr Owcarz) Date: Wed, 30 Jun 2010 16:44:54 +0200 Subject: [lxml-dev] html.xpath returns not decoded unicode string Message-ID: Hi guys, I try to parse html encoded in 'iso-8859-2' and with xpath want to get a specific content. The content I usually get with xpath is python unicode, but in this case it does not contain unicode code points but characters encoded in 'iso-8859-2' just like it was never decoded and put in unicode object as it is. Let's take for example this url: ' http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1', and do something in command line: >>> from lxml import html >>> import urllib2 >>> root = html.parse(urllib2.urlopen(' http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1' )) >>> root.docinfo.encoding 'iso-8859-2' >>> header = root.xpath('/html/body/center/center[1]/table/tr/td/table')[3].text_content().strip() >>> header u'Soboty, niedziele i \xb6wi\xeata' >>> uc = u'Soboty, niedziele i ?wi?ta' >>> uc u'Soboty, niedziele i \u015bwi\u0119ta' >>> uc == header False I expect header and uc variables to be equal but they're not, while uc is the actual unicode representation of my string. I use this code in a script and run it on Windows with english locale and the script has # -*- coding: utf-8 -*- directive. Interesting thing is that the script passes the compassion uc==header on http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=*13*&kier=1 but does not pass on http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=*14*&kier=1. Needless to say, the content I try to get (Soboty, niedziele i ?wi?ta) on both pages is binary the same, as well as declared encoding and they both render correctly in a web browser. Can anybody help me with this? OS: Windows XP (english) 32 bit Python: 2.6.5 lxml.etree: (2, 2, 0, 0) libxml used: (2, 7, 2) libxml compiled: (2, 7, 2) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 24) Regards Piotr -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100630/99fdab81/attachment.htm From stefan_ml at behnel.de Wed Jun 30 17:34:40 2010 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 30 Jun 2010 17:34:40 +0200 Subject: [lxml-dev] html.xpath returns not decoded unicode string In-Reply-To: References: Message-ID: <4C2B6410.6080807@behnel.de> Piotr Owcarz, 30.06.2010 16:44: > I try to parse html encoded in 'iso-8859-2' and with xpath want to get a > specific content. The content I usually get with xpath is python unicode, > but in this case it does not contain unicode code points but characters > encoded in 'iso-8859-2' just like it was never decoded and put in unicode > object as it is. Note that the problem at hand is unrelated to XPath. Only the parser has an impact on the encodings. > Let's take for example this url: ' > http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1', > and do something in command line: > >>>> from lxml import html >>>> import urllib2 >>>> root = html.parse(urllib2.urlopen(' > http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1' > )) >>>> root.docinfo.encoding > 'iso-8859-2' >>>> header = > root.xpath('/html/body/center/center[1]/table/tr/td/table')[3].text_content().strip() >>>> header > u'Soboty, niedziele i \xb6wi\xeata' >>>> uc = u'Soboty, niedziele i ?wi?ta' >>>> uc > u'Soboty, niedziele i \u015bwi\u0119ta' >>>> uc == header > False Seems to work for me: In [1]: from lxml import html In [2]: root = html.parse('http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1') In [3]: root.docinfo.encoding Out[3]: 'iso-8859-2' In [4]: root.xpath('/html/body/center/center[1]/table/tr/td/table')[3].text_content().strip() Out[4]: u'Soboty, niedziele i \u015bwi\u0119ta' Even when I use urllib2, I get In [14]: root = html.parse(urllib2.urlopen('http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1')) In [15]: header = root.xpath('/html/body/center/center[1]/table/tr/td/table')[3].text_content().strip() In [16]: header Out[16]: u'Soboty, niedziele i \u015bwi\u0119ta' > I use this code in a script and run it on Windows with english locale and > the script has # -*- coding: utf-8 -*- directive. That doesn't matter. > Interesting thing is that the script passes the compassion uc==header on > http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=*13*&kier=1 > but does not pass on > http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=*14*&kier=1. Same result for both on my side. > Needless to say, the content I try to get (Soboty, niedziele i ?wi?ta) on > both pages is binary the same, as well as declared encoding and they both > render correctly in a web browser. Note that "renders correctly in a web browser" is not a good indicator for a page being valid HTML. Browsers are extremely advanced when dealing with broken HTML. But once a page is broken, there is no such thing as "correct" behaviour. > Can anybody help me with this? > > OS: Windows XP (english) 32 bit > Python: 2.6.5 > lxml.etree: (2, 2, 0, 0) > libxml used: (2, 7, 2) > libxml compiled: (2, 7, 2) > libxslt used: (1, 1, 24) > libxslt compiled: (1, 1, 24) I'm using lxml 2.3alpha1 and libxml2 2.7.6. The libxml2 version may make a difference here. Try the lxml 2.2.4 binaries for Windows, I think they use a newer lib version. Stefan