From dfedoruk at gmail.com Tue Apr 1 16:13:03 2008 From: dfedoruk at gmail.com (Dmitri Fedoruk) Date: Tue, 1 Apr 2008 18:13:03 +0400 Subject: [lxml-dev] etree.parse hangs with a lot of parallel requests Message-ID: Hi all, I'm using lxml-2.0_1 now (I have not upgraded since to most recent versions as I have not noticed any features relevant to me), libxml2-2.6.30 , libxslt-1.1.22, FreeBSD 6.2 and 7.0 , the application runs within mod_python / apache 2.2.8 . My situation is pretty straightforward: fetch xml as plain text via http, parse it and get etree object, than apply xslt and get resulting html. The code is the following: self.xmlParser = etree.XMLParser(no_network = False, resolve_entities = False, load_dtd = True ) I use load_dtd=True as sometimes I encounter html entities in my input data. They are included in my dtd in this way: %HTMLlat1; %HTMLsymbol; %HTMLspecial; Then eventually it comes up to ... xmlres = etree.parse( StringIO.StringIO( reply['data'] ), self.xmlParser ) And here I have serious problems. Parsing time is usually up to 100 ms (even this is critical time for me). But sometimes I have 3, 5 and even 60 seconds (!) of parsing. This situation happens under a heavy load (~20 simultaneous parsings/transformations per sec). So, I have several questions: 1) What am I doing wrong? 2) Is there any way to limit the runtime of the etree.parse? Is there any way to kill a thread maybe? I can not afford to wait even 150 ms, to say nothing about 1 second and more. Any help would be appreciated! Cheers, Dmitri From jkrukoff at ltgc.com Tue Apr 1 22:00:03 2008 From: jkrukoff at ltgc.com (John Krukoff) Date: Tue, 01 Apr 2008 14:00:03 -0600 Subject: [lxml-dev] ElementTree.find does not accept QName objects. In-Reply-To: <47EE1D00.3000109@behnel.de> References: <1206736251.5734.29.camel@localhost.localdomain> <47EE1D00.3000109@behnel.de> Message-ID: <1207080003.6208.1.camel@localhost.localdomain> On Sat, 2008-03-29 at 11:42 +0100, Stefan Behnel wrote: > Hi, > > John Krukoff wrote: > > Since I was the one that complained about the find method on Elements > > not accepting QNames, it's probably not surprising that I expected them > > to work with the ElementTree find method as well. Instead an unsliceable > > error is thrown, due to the value being expected to be a string > > Sure, here's the obvious patch. > > BTW, I expect ET to have the same problem here. > > Stefan > Thanks for your always quick response. Yeah, ET has the same issue, but then it doesn't accept QNames for element.find either. Only one of many reasons I gave up on ET compatibility a long time ago. -- John Krukoff Land Title Guarantee Company From jkrukoff at ltgc.com Tue Apr 1 22:24:22 2008 From: jkrukoff at ltgc.com (John Krukoff) Date: Tue, 01 Apr 2008 14:24:22 -0600 Subject: [lxml-dev] ElementTree.find does not accept QName objects. In-Reply-To: <47EE1D00.3000109@behnel.de> References: <1206736251.5734.29.camel@localhost.localdomain> <47EE1D00.3000109@behnel.de> Message-ID: <1207081462.6208.13.camel@localhost.localdomain> Okay, that's weird. I knew that I'd been able to use QName's with ET in that past, but when I double checked I found that it didn't work for me. It looks like I just managed to hit some magic special case in ET to make this work at all, as this works: Python 2.5.1 (r251:54863, Jan 8 2008, 15:02:32) [GCC 4.1.2 (Gentoo 4.1.2)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from elementtree import ElementTree >>> x = ElementTree.XML( '' ) >>> x.find( 'b' ) >>> x.find( ElementTree.QName( 'b' ) ) But this doesn't: Python 2.5.1 (r251:54863, Jan 8 2008, 15:02:32) [GCC 4.1.2 (Gentoo 4.1.2)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from elementtree import ElementTree >>> x = ElementTree.XML( '' ) >>> x.find( ElementTree.QName( 'b' ) ) Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.5/site-packages/elementtree/ElementTree.py", line 327, in find return ElementPath.find(self, path) File "/usr/lib/python2.5/site-packages/elementtree/ElementPath.py", line 183, in find return _compile(path).find(element) File "/usr/lib/python2.5/site-packages/elementtree/ElementPath.py", line 173, in _compile p = Path(path) File "/usr/lib/python2.5/site-packages/elementtree/ElementPath.py", line 69, in __init__ tokens = xpath_tokenizer(path) TypeError: expected string or buffer It looks like it only works when ET.find has already been called with the string value of the same name that a following QName find specifies. Some internal caching, perhaps? In any case, it does look like accepting QNames at all is a bug in ET, or at least an accident. Don't know what that means for lxml, but it would seem to me that strict compatibility would mean that find should be restricted to strings. As for my use case, I suppose I can always just wrap all my find parameters with str to make it work. On Sat, 2008-03-29 at 11:42 +0100, Stefan Behnel wrote: > Hi, > > John Krukoff wrote: > > Since I was the one that complained about the find method on Elements > > not accepting QNames, it's probably not surprising that I expected them > > to work with the ElementTree find method as well. Instead an unsliceable > > error is thrown, due to the value being expected to be a string > > Sure, here's the obvious patch. > > BTW, I expect ET to have the same problem here. > > Stefan > -- John Krukoff Land Title Guarantee Company From jholg at gmx.de Fri Apr 4 12:58:25 2008 From: jholg at gmx.de (Holger Joukl) Date: Fri, 04 Apr 2008 12:58:25 +0200 Subject: [lxml-dev] [lxml] adding __float__, __int__ etc. to objectify.StringElement Message-ID: <20080404110548.115260@gmx.net> Hi, ?I suggest adding __float__, __int__ etc. methods to lxml.objectify StringElement, to enable things like ?>>> float(objectify.DataElement("234")) Traceback (most recent call last): ? File "", line 1, in ? TypeError: float() argument must be a string or a number >>> ?These will just try to invoke the very same operation on the underlying pyval.? Maybe there are other classes where such methods would be helpful (BoolElement comes to mind, but I'll have to look). ?Any objections? I can add this plus some tests otherwise. ?Holger? -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080404/1adc604a/attachment.htm From stefan_ml at behnel.de Sun Apr 6 16:20:18 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 06 Apr 2008 16:20:18 +0200 Subject: [lxml-dev] ElementTree.find does not accept QName objects. In-Reply-To: <1207081462.6208.13.camel@localhost.localdomain> References: <1206736251.5734.29.camel@localhost.localdomain> <47EE1D00.3000109@behnel.de> <1207081462.6208.13.camel@localhost.localdomain> Message-ID: <47F8DC22.5080706@behnel.de> John Krukoff wrote: > It looks like it only works when ET.find has already been called with > the string value of the same name that a following QName find specifies. > Some internal caching, perhaps? There is an internal cache in ElementPath.py that might cut in here. > In any case, it does look like accepting QNames at all is a bug in ET, > or at least an accident. Don't know what that means for lxml, but it > would seem to me that strict compatibility would mean that find should > be restricted to strings. I think it's right to accept QName objects wherever tag names are accepted. So it's ET that's wrong here. Stefan From stefan_ml at behnel.de Sun Apr 6 16:26:59 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 06 Apr 2008 16:26:59 +0200 Subject: [lxml-dev] etree.parse hangs with a lot of parallel requests In-Reply-To: References: Message-ID: <47F8DDB3.8070603@behnel.de> Hi, Dmitri Fedoruk wrote: > The code is the following: > self.xmlParser = etree.XMLParser(no_network = False, resolve_entities > = False, load_dtd = True ) > > I use load_dtd=True as sometimes I encounter html entities in my input > data. They are included in my dtd in this way: > > %HTMLlat1; > > > %HTMLsymbol; > > > %HTMLspecial; > > Then eventually it comes up to > ... > xmlres = etree.parse( StringIO.StringIO( reply['data'] ), self.xmlParser ) > > And here I have serious problems. Parsing time is usually up to 100 > ms (even this is critical time for me). But sometimes I have 3, 5 and > even 60 seconds (!) of parsing. This situation happens under a heavy > load (~20 simultaneous parsings/transformations per sec). > > So, I have several questions: > 1) What am I doing wrong? > 2) Is there any way to limit the runtime of the etree.parse? Is there > any way to kill a thread maybe? I can not afford to wait even 150 ms, > to say nothing about 1 second and more. It seems you only want to parse DTDs locally from disc, so setting "no_network=True" (which is the default in lxml 2.0) should prevent any accidental remote access. Does that help? Stefan From stefan_ml at behnel.de Sun Apr 6 16:32:40 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 06 Apr 2008 16:32:40 +0200 Subject: [lxml-dev] adding __float__, __int__ etc. to objectify.StringElement In-Reply-To: <20080404110548.115260@gmx.net> References: <20080404110548.115260@gmx.net> Message-ID: <47F8DF08.1090402@behnel.de> Hi, Holger Joukl wrote: > I suggest adding __float__, __int__ etc. methods > > to lxml.objectify StringElement, to enable things like > > >>> float(objectify.DataElement("234")) > Traceback (most recent call last): > File "", line 1, in ? > TypeError: float() argument must be a string or a number > > These will just try to invoke the very same operation on the underlying pyval. Ok with me, StringElement should behave as much like a string as possible. > Maybe there are other classes where such methods would > be helpful (BoolElement comes to mind, but I'll have to look). You mean because of the int/bool duality in Python, but I don't think that's something we should easily enable without a compelling use case. Remember that it would mean converting the string value "true" to int(1), I don't think that's obvious behaviour. > Any objections? I can add this plus some tests otherwise. Go ahead. Stefan From jholg at gmx.de Mon Apr 7 15:33:07 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Mon, 07 Apr 2008 15:33:07 +0200 Subject: [lxml-dev] adding __float__, __int__ etc. to objectify.StringElement In-Reply-To: <47F8DF08.1090402@behnel.de> References: <20080404110548.115260@gmx.net> <47F8DF08.1090402@behnel.de> Message-ID: <20080407133428.293320@gmx.net> Hi, Checked in as revision 53527: $ svn diff -r 53526 Index: src/lxml/tests/test_objectify.py =================================================================== --- src/lxml/tests/test_objectify.py??? (revision 53526) +++ src/lxml/tests/test_objectify.py??? (working copy) @@ -815,7 +815,27 @@ ???????? el = objectify.DataElement(s) ???????? val = 5 ???????? self.assertRaises(TypeError, el.__mod__, val) + +??? def test_type_str_as_int(self): +??????? v = "1" +??????? el = objectify.DataElement(v) +??????? self.assertEquals(int(el), 1) ? +??? def test_type_str_as_long(self): +??????? v = "1" +??????? el = objectify.DataElement(v) +??????? self.assertEquals(long(el), 1) + +??? def test_type_str_as_float(self): +??????? v = "1" +??????? el = objectify.DataElement(v) +??????? self.assertEquals(float(el), 1) + +??? def test_type_str_as_complex(self): +??????? v = "1" +??????? el = objectify.DataElement(v) +??????? self.assertEquals(complex(el), 1) + ???? def test_type_str_mod_data_elements(self): ???????? s = "%d %f %s %r" ???????? el = objectify.DataElement(s) Index: src/lxml/lxml.objectify.pyx =================================================================== --- src/lxml/lxml.objectify.pyx (revision 53526) +++ src/lxml/lxml.objectify.pyx (working copy) @@ -773,6 +773,18 @@ ???? def __mod__(self, other): ???????? return _strValueOf(self) % other ? +??? def __int__(self): +??????? return int(textOf(self._c_node)) + +??? def __long__(self): +??????? return long(textOf(self._c_node)) + +??? def __float__(self): +??????? return float(textOf(self._c_node)) + +??? def __complex__(self): +??????? return complex(textOf(self._c_node)) + ?cdef class NoneElement(ObjectifiedDataElement): ???? def __str__(self): ???????? return "None" > You mean because of the int/bool duality in Python, but I don't think that's > something we should easily enable without a compelling use case. Remember that > it would mean converting the string value "true" to int(1), I don't think > that's obvious behaviour. Yes, I was referring to that: >>> int(True) 1 >>> float(True) 1.0 >>> long(True) 1L >>> complex >>> complex(True) (1+0j) >>> I actually wasn't aware of that behaviour of Python booleans. And this is definitely no priority for me. Then again, one could argue that BoolElement should behave as much as a native bool in Python, only that its XML representation is the string value "true". And there are already subtleties for BoolElement: >>> root = etree.fromstring("true") >>> type(root.x) >>> root.x.text 'true' >>> str(root.x) 'True' >>> Cheers, Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080407/f181fae2/attachment.htm From stefan_ml at behnel.de Tue Apr 8 09:52:11 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 08 Apr 2008 09:52:11 +0200 Subject: [lxml-dev] adding __float__, __int__ etc. to objectify.StringElement In-Reply-To: <20080407133428.293320@gmx.net> References: <20080404110548.115260@gmx.net> <47F8DF08.1090402@behnel.de> <20080407133428.293320@gmx.net> Message-ID: <47FB242B.6000504@behnel.de> Hi, jholg at gmx.de wrote: >>>> int(True) > 1 >>>> float(True) > 1.0 >>>> long(True) > 1L >>>> complex > >>>> complex(True) > (1+0j) > > I actually wasn't aware of that behaviour of Python booleans. > And this is definitely no priority for me. Then again, one could argue > that BoolElement should behave as much as a native bool in Python, > only that its XML representation is the string value "true". Hmm, I buy that. As long as the conversion is explicit, I think objectify Elements /should/ behave as their Python counter types. I'll check if inheriting from IntElement does the right thing. Stefan From jholg at gmx.de Wed Apr 9 08:29:04 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 09 Apr 2008 08:29:04 +0200 Subject: [lxml-dev] adding __float__, __int__ etc. to objectify.StringElement In-Reply-To: <47FB242B.6000504@behnel.de> References: <20080404110548.115260@gmx.net> <47F8DF08.1090402@behnel.de> <20080407133428.293320@gmx.net> <47FB242B.6000504@behnel.de> Message-ID: <20080409063140.167580@gmx.net> Hi, > > I actually wasn't aware of that behaviour of Python booleans. > > And this is definitely no priority for me. Then again, one could argue > > that BoolElement should behave as much as a native bool in Python, > > only that its XML representation is the string value "true". > > Hmm, I buy that. As long as the conversion is explicit, I think objectify > Elements /should/ behave as their Python counter types. > > I'll check if inheriting from IntElement does the right thing. > ?Maybe the __int__, __float__ etc. methods should even go the the ObjectifiedDataElement class? So basically every explicit to-number conversion for data elements would work right out of the box, if the corresponding pyval class supports it. For BoolElement, you'd need t override it anyway, as str(textOf(self._c_node))? will not work for "true". ?Holger? -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080409/316d666b/attachment.htm From dfedoruk at gmail.com Wed Apr 9 08:59:03 2008 From: dfedoruk at gmail.com (=?KOI8-R?B?5M3J1NLJyiDmxcTP0tXL?=) Date: Wed, 9 Apr 2008 10:59:03 +0400 Subject: [lxml-dev] etree.parse hangs with a lot of parallel requests In-Reply-To: <47F8DDB3.8070603@behnel.de> References: <47F8DDB3.8070603@behnel.de> Message-ID: <69B9724A-1BE8-4644-8067-DF8F743FAA2D@gmail.com> Hi, > It seems you only want to parse DTDs locally from disc, so setting > "no_network=True" (which is the default in lxml 2.0) should prevent > any > accidental remote access. > Eventually it turned out that I'm working fine without DTD. So, setting no_network = True and load_dtd = False really solved the problem. The parsing time is almost insignificant now. To be frank, I find that out by accident. I encountered problems with machines carrying DTD and every parsing began to took 60 sec. I was very quick to disable DTD and feeled relief to see it working and it working fast :) Nevertheless, thanks for the advice! If you're interested, here it goes: http://beta.rambler.ru/srch?query=python+lxml&searchtype=web This is a search engine frontend, xml\xslt based, everything is carried via lxml. Cheers, Dmitri From stefan_ml at behnel.de Wed Apr 9 09:02:44 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 9 Apr 2008 09:02:44 +0200 (CEST) Subject: [lxml-dev] adding __float__, __int__ etc. to objectify.StringElement In-Reply-To: <20080409063140.167580@gmx.net> References: <20080404110548.115260@gmx.net> <47F8DF08.1090402@behnel.de> <20080407133428.293320@gmx.net> <47FB242B.6000504@behnel.de> <20080409063140.167580@gmx.net> Message-ID: <52309.194.114.62.66.1207724564.squirrel@groupware.dvs.informatik.tu-darmstadt.de> jholg at gmx.de wrote: > Maybe the __int__, __float__ etc. methods should even go the > the ObjectifiedDataElement class? So basically every explicit to-number > conversion for data elements would work right out of the box, if the > corresponding pyval class supports it. They behave different for numbers and strings now, though. int(string_element) will do a straight int(string_element.text), while int(number_element) will first parse the number according to the type rules and then convert the result to an int. I changed this for 2.1 as int(element.text) will not work for non-int types such as a float string, for example. I may even consider it a bug to be fixed in 2.0, but I find it safer to leave the change for the new release series. > For BoolElement, you'd need to override it anyway, as > str(textOf(self._c_node)) > will not work for "true". Same problem. Stefan From stefan_ml at behnel.de Wed Apr 9 09:46:17 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 9 Apr 2008 09:46:17 +0200 (CEST) Subject: [lxml-dev] etree.parse hangs with a lot of parallel requests Message-ID: <9172.194.114.62.66.1207727177.squirrel@groupware.dvs.informatik.tu-darmstadt.de> >> It seems you only want to parse DTDs locally from disc, so setting >> "no_network=True" (which is the default in lxml 2.0) should prevent any >> accidental remote access. > > Eventually it turned out that I'm working fine without DTD. So, > setting no_network = True and load_dtd = False really solved the problem. Hmm, do you really need to turn off DTD loading or is disabling network access enough? I wouldn't expect loading the DTD from the disk cache to take that much time (although, if you can live without it and time is really critical, then it's obviously better to safe that bit of time also). > The parsing time is almost insignificant now. Shameless plug: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ > If you're interested, here it goes: > http://beta.rambler.ru/srch?query=python+lxml&searchtype=web > This is a search engine frontend, xml\xslt based, everything is > carried via lxml. Nice. That's a meta search engine, right? Will that site stay online? I.e. does it make sense to set a link from our "who uses lxml" FAQ entry? http://codespeak.net/lxml/FAQ.html#who-uses-lxml Stefan From jwashin at vt.edu Wed Apr 9 14:00:58 2008 From: jwashin at vt.edu (Jim Washington) Date: Wed, 09 Apr 2008 08:00:58 -0400 Subject: [lxml-dev] sednaobject - Pythonic interface to Sedna XML Database Message-ID: <47FCAFFA.9060703@vt.edu> I've put out a new alpha (0.10alpha2) of zif.sedna, a Python adapter to Sedna, a multi-user XML database, at the Python Cheese Shop. The new alpha has a start at objectifying XML from the Sedna database in a manner kind of like sqlobject does for SQL. The aim is to make easy, pythonic CRUD (create, read, update, delete) operations for the XML store. Push data in, get data back out. sednaobject, now included in zif.sedna, provides three classes. First is SednaXQuery, which gives you list-like semantics for the results of any arbitrary XQuery or XPath expression. You init it with a Sedna cursor, a query statement, and an optional parser method. Then, you can iterate, or obtain items by index or slice. If you do not provide a parser, you get items as unicode strings. Second is SednaContainer, which behaves like a SednaXQuery, except it is read-write. Like SednaXQuery, you init it with a cursor, a query statement, and an optional parser method. The query statement must refer to exactly one element in the database. This is the container, and you can obtain and replace items in the container by index. Slicing works for retrieval, and append, remove, insert, and del work as per the elementtree API. Third is SednaObjectifiedElement, which also operates on a single element in the database. SednaObjectifiedElement is a thin wrapper around lxml.objectify. Alter the item with the objectify API, and save(). Thanks, lxml team, for making this really easy! Since, in XML, an element is an element is an element, you can use the second and third sednaobject classes on any element in the database. Which you would use in a situation depends on the aspect you are interested in at the moment. I see Sedna as an attractive middle ground between SQL databases and object databases like ZODB. Data size is practically unlimited. You can alter a small portion of a data set transactionally, in a multi-user environment, without a full rewrite of the data. Like SQL databases, it uses a query language to obtain and format just the data you want, from anywhere in the database. XQuery has nice built-in functionality for counting, filtering, reordering, doing math, etc., on items. Like ZODB, you can store and retrieve items of arbitrary complexity without too much fuss. A Sedna database can have multiple XML documents and multiple 'collections' of (similar) documents that can be queried together or separately. The Sedna team just released version 3.0 of the Sedna server, which has improved speed and reliability. 3.0 now runs on Mac OSX, in addition to x86 Linux and Win2K/XP. zif.sedna with sednaobject version 0.10 is alpha, so interfaces can and probably will change. The included doctests all pass using a Sedna 2.x server. I have not included the new features of 3.0 (e.g., faster, read-only queries) yet. Testing with a 3.0 server results in a single harmless failure. Speed? I'm getting 60-70 single-query transactions per second through Pylons on a 2Ghz Opteron. Transaction speed of course depends on how many queries are in the transaction and what the queries do. zif.sedna: http://pypi.python.org/pypi/zif.sedna/ Sedna: http://modis.ispras.ru/sedna/ I am currently the sole developer for zif.sedna. Feedback is welcome. - Jim Washington From xphuture at gmail.com Thu Apr 10 01:16:56 2008 From: xphuture at gmail.com (Fabien) Date: Thu, 10 Apr 2008 01:16:56 +0200 Subject: [lxml-dev] Help getting lxml to work reliably on MacOS-X In-Reply-To: <20080314125920.01c867a1@bhuda.mired.org> References: <47A6D0B1.5020600@behnel.de> <622afeaa0803140757k2854afccx26d87167e4b6a1b8@mail.gmail.com> <47DAA828.6040507@behnel.de> <20080314125920.01c867a1@bhuda.mired.org> Message-ID: <622afeaa0804091616r20357a9et8b630694a5753ce0@mail.gmail.com> Hello, > IIRC, macports builds binaries for your system, not universal > binaries. /usr/bin/python, on the other hand, is a universal > binary. See if the library ports have a universal variant, and if so, > try installing that. Yes ! It nearly works. Now when I'm trying to install lxml with easy_install I get the following (in verbose mode) : http://pastebin.com/f4c1468c8 The build seems to be ok, but it fails when installing. -- Fabien SCHWOB _____________________________________________________________ Derri?re chaque bogue, il y a un d?veloppeur, un homme qui s'est tromp?. (Bon, OK, parfois ils s'y mettent ? plusieurs). From stefan_ml at behnel.de Fri Apr 11 11:59:39 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 11 Apr 2008 11:59:39 +0200 Subject: [lxml-dev] CDATA and lxml In-Reply-To: <5ebe3201-80a6-45ba-8b88-e8ece6883efa@w8g2000prd.googlegroups.com> References: <5ebe3201-80a6-45ba-8b88-e8ece6883efa@w8g2000prd.googlegroups.com> Message-ID: <47FF368B.4030708@behnel.de> Silfheed wrote: > So first off I know that CDATA is generally hated and just shouldn't > be done, but I'm simply required to parse it and spit it back out. > Parsing is pretty easy with lxml, but it's the spitting back out > that's giving me issues. The fact that lxml strips all the CDATA > stuff off isnt really a big issue either, so long as I can create > CDATA blocks later with <>&'s showing up instead of <>& . > I've scoured through the lxml docs, but probably not hard enough, so > anyone know the page I'm looking for or have a quick how to? There's nothing in the docs because lxml doesn't allow you to create CDATA sections. You're not the first one asking that, but so far, no one really had a take on this. It's not as trivial as it sounds. Removing the CDATA sections in the parser is just for fun. It simplifies the internal tree traversal and text aggregation, so this would be affected if we allowed CDATA content in addition to normal text content. It's not that hard, it's just that it hasn't been done so far. Stefan From stefan_ml at behnel.de Fri Apr 11 12:49:18 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 11 Apr 2008 12:49:18 +0200 Subject: [lxml-dev] CDATA and lxml In-Reply-To: <47FF368B.4030708@behnel.de> References: <5ebe3201-80a6-45ba-8b88-e8ece6883efa@w8g2000prd.googlegroups.com> <47FF368B.4030708@behnel.de> Message-ID: <47FF422E.5090104@behnel.de> Stefan Behnel wrote: > It's not as trivial as it sounds. Removing the CDATA sections in the parser is > just for fun. ... *not* just for fun ... obviously ... Stefan From stefan_ml at behnel.de Fri Apr 11 19:33:11 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 11 Apr 2008 19:33:11 +0200 Subject: [lxml-dev] CDATA and lxml In-Reply-To: <47FF368B.4030708@behnel.de> References: <5ebe3201-80a6-45ba-8b88-e8ece6883efa@w8g2000prd.googlegroups.com> <47FF368B.4030708@behnel.de> Message-ID: <47FFA0D7.1030901@behnel.de> Hi again, Stefan Behnel wrote: > Silfheed wrote: >> So first off I know that CDATA is generally hated and just shouldn't >> be done, but I'm simply required to parse it and spit it back out. >> Parsing is pretty easy with lxml, but it's the spitting back out >> that's giving me issues. The fact that lxml strips all the CDATA >> stuff off isnt really a big issue either, so long as I can create >> CDATA blocks later with <>&'s showing up instead of <>& . >> I've scoured through the lxml docs, but probably not hard enough, so >> anyone know the page I'm looking for or have a quick how to? > > There's nothing in the docs because lxml doesn't allow you to create CDATA > sections. You're not the first one asking that, but so far, no one really had > a take on this. So I gave it a try, then. In lxml 2.1, you will be able to do this: >>> root = Element("root") >>> root.text = CDATA('test') >>> tostring(root)) '' This does not work for .tail content, only for .text content (no technical reason, I just don't see why that should be enabled). There's also a parser option "strip_cdata" now that allows you to leave CDATA sections in the tree. However, they will *not* behave any different than normal text, so you can't even see at the API level that you are dealing with CDATA. If you want to be really, really sure, you can always do this: >>> root.text = CDATA(root.text) Hope that helps, Stefan From brunobg at gmail.com Sat Apr 12 22:38:20 2008 From: brunobg at gmail.com (Bruno) Date: Sat, 12 Apr 2008 20:38:20 +0000 (UTC) Subject: [lxml-dev] Weird errors in tostring Message-ID: Hi, I'm getting a weird error in lxml.html.tostring; it happens in one machine but not in another, although both are using lxml 2.0.2, but one has python 2.5 (which works all the time) and the other python 2.4 (which doesn't). Here's the relevant backtrace: File "/home/spyder/spyder/core/base.py", line 289, in treetostring return tostring(root, method='xml', encoding=unicode) File "/usr/lib/python2.4/site-packages/lxml-2.0.2-py2.4-linux-i686.egg/lxml/html/ __init__.py", line 1313, in tostring encoding=encoding) File "lxml.etree.pyx", line 2455, in lxml.etree.tostring File "serializer.pxi", line 61, in lxml.etree._tostring File "serializer.pxi", line 126, in lxml.etree._tounicode UnicodeDecodeError: 'utf8' codec can't decode bytes in position 21-24: invalid data In the other machine all goes well. FYI, the tree (root variable) is being built with root = lxml.html.fromstring(data). I'm parsing data in utf8 and iso-8859-1, and this particular backtrace happened in a HTML document correctly labelled with a meta charset=iso-8859-1. If you have any ideas of how to trace what is going wrong? From stefan_ml at behnel.de Sun Apr 13 09:04:19 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 13 Apr 2008 09:04:19 +0200 Subject: [lxml-dev] Weird errors in tostring In-Reply-To: References: Message-ID: <4801B073.3090806@behnel.de> Hi, Bruno wrote: > In the other machine all goes well. FYI, the tree (root variable) is being > built with root = lxml.html.fromstring(data). I'm parsing data in utf8 and > iso-8859-1, and this particular backtrace happened in a HTML document > correctly labelled with a meta charset=iso-8859-1. You can ask the document which encoding it was parsed with: >>> print root.getroottree().docinfo.encoding It should say "iso-8859-1" if the parser picked up the tag correctly. Also, maybe the tag comes behind the in the document? AFAIR, libxml2's HTML parser switches encodings when it sees a <meta> declaration, but it doesn't reparse the document (as most browsers do to work around this problem). If the parser gets the encoding wrong, you can try parsing with BeautifulSoup (separate install) by using the fromstring() function in lxml.html.ElementSoup instead. That's quite a bit slower, but it *might* give you better results in this case. http://codespeak.net/lxml/elementsoup.html (note that the soupparser module was added in 2.0.3 to fix the parse() function. Just use the ElementSoup module in 2.0.2) Stefan From stefan_ml at behnel.de Sun Apr 13 19:54:22 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 13 Apr 2008 19:54:22 +0200 Subject: [lxml-dev] lxml 2.0.4 released Message-ID: <480248CE.7040005@behnel.de> Hi all, lxml 2.0.4 is on PyPI. This is a bug-fix release for the stable 2.0 series. Have fun, Stefan 2.0.4 (2008-04-13) Bugs fixed * Crash bug in iterparse when moving elements into other documents. * HTML elements' .cssselect() method was broken. * ElementTree.find*() didn't accept QName objects. From stefan_ml at behnel.de Sun Apr 13 20:15:37 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 13 Apr 2008 20:15:37 +0200 Subject: [lxml-dev] lxml 2.0.4 released In-Reply-To: <480249EB.5030407@necoro.eu> References: <480248CE.7040005@behnel.de> <480249EB.5030407@necoro.eu> Message-ID: <48024DC9.3090900@behnel.de> Ren? 'Necoro' Neumann wrote: > Just me again :) - what's about the GTK-hang-bug? Sorry, that fix went missing. I just re-released a new source distro that has it. Sorry for the inconvenience. Stefan From lists at necoro.eu Mon Apr 14 01:34:02 2008 From: lists at necoro.eu (=?UTF-8?B?UmVuw6kgJ05lY29ybycgTmV1bWFubg==?=) Date: Mon, 14 Apr 2008 01:34:02 +0200 Subject: [lxml-dev] lxml 2.0.4 released In-Reply-To: <48024DC9.3090900@behnel.de> References: <480248CE.7040005@behnel.de> <480249EB.5030407@necoro.eu> <48024DC9.3090900@behnel.de> Message-ID: <4802986A.8020102@necoro.eu> Stefan Behnel schrieb: > > Ren? 'Necoro' Neumann wrote: >> Just me again :) - what's about the GTK-hang-bug? > > Sorry, that fix went missing. I just re-released a new source distro that has it. Great -- thanks :) From stefan_ml at behnel.de Tue Apr 15 09:38:49 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 15 Apr 2008 09:38:49 +0200 Subject: [lxml-dev] Weird errors in tostring In-Reply-To: <4803D2DC.808@gmail.com> References: <loom.20080412T201340-966@post.gmane.org> <4801B073.3090806@behnel.de> <4803D2DC.808@gmail.com> Message-ID: <48045B89.9030004@behnel.de> Hi, Bruno Barberi Gnecco wrote: > Hi Stephan, -f- >>> In the other machine all goes well. FYI, the tree (root variable) is >>> being built with root = lxml.html.fromstring(data). I'm parsing data >>> in utf8 and >>> iso-8859-1, and this particular backtrace happened in a HTML document >>> correctly labelled with a meta charset=iso-8859-1. >> >> >> You can ask the document which encoding it was parsed with: >> >> >>> print root.getroottree().docinfo.encoding >> >> It should say "iso-8859-1" if the parser picked up the <meta> tag >> correctly. > > It says 'None', actually. Then that's a clear sign that libxml2 didn't pick up the encoding. > Shouldn't it give the error when *parsing* and creating the tree, > instead of when converting the tree to something else? HTML is parsed with the "recover" option, which lets libxml2 try to work around all sorts of broken page content *without* raising an error. You can still check the error log of the parser to see what happend on the way through the page. > I thought lxml stored the parsed tree in unicode. UTF-8, actually, which is much easier (and faster) to handle in C than any other unicode encoding. > Besides, I'm asking for a unicode string: > > tostring(root, method='xml', encoding=unicode) Which lets lxml serialise the tree to a Python unicode character sequence in XML style. I know, this looks simple, but there's actually work being done here. >> Also, maybe the <meta> tag comes behind the <title> in the document? >> AFAIR, >> libxml2's HTML parser switches encodings when it sees a <meta> >> declaration, >> but it doesn't reparse the document (as most browsers do to work >> around this >> problem). > > It happens with fragments of HTML as well (I'm actually reading HTML > messages). Yet I was having this problem with pages download from the > internet, in which the encoding was incorrectly detected. Which implies most of the time that it was incorrectly specified as well. That is a very common problem in real world HTML pages. Browsers do a great deal of work in their Quirks mode to figure out the page encoding. libxml2's HTML parser works pretty well, but fortune telling wasn't one of its design goals. > Since I had more information > in that case (HTTP headers, with a chardet pass just to be sure) I ended up > forcing the encoding with a 'html.decode(encoding)' step before building > the tree. I think it's weird that it works (since some pages declare one > encoding and use a different one), but it does. You might want to strip <meta> Content Type tags from the string using a regex, that should make sure it works in all cases. Read the function "htmlCheckEncoding()" in libxml2's HTMLparser.c to see what works and what doesn't. For example, there is some code to prevent changing the parser encoding a second time, so that you can override it with the "encoding" parser keyword in lxml. >> If the parser gets the encoding wrong, you can try parsing with >> BeautifulSoup >> (separate install) by using the fromstring() function in >> lxml.html.ElementSoup >> instead. That's quite a bit slower, but it *might* give you better >> results in this case. I wrote a little doc section on that topic: http://codespeak.net/lxml/elementsoup.html#using-soupparser-as-a-fallback > First, why does it work in one of the machines and not in the other, > even with the same data? I installed Python2.5, but with the same results. > Maybe the cause is libxml2 (2.6.30 where it works, 2.6.26 where it > doesn't)? That's almost definitely the reason, yes. > Second, if the tree is created, how to know if the encoding is > wrong? I only convert to string much later. You can serialise immediately, just for testing, that will tell you. Or, you can check the parser error log for encoding errors. Stefan From faassen at startifact.com Tue Apr 15 12:45:00 2008 From: faassen at startifact.com (Martijn Faassen) Date: Tue, 15 Apr 2008 12:45:00 +0200 Subject: [lxml-dev] lxml 2.0.4 released In-Reply-To: <48024DC9.3090900@behnel.de> References: <480248CE.7040005@behnel.de> <480249EB.5030407@necoro.eu> <48024DC9.3090900@behnel.de> Message-ID: <fu20vd$4dv$2@ger.gmane.org> Stefan Behnel wrote: > > Ren? 'Necoro' Neumann wrote: >> Just me again :) - what's about the GTK-hang-bug? > > Sorry, that fix went missing. I just re-released a new source distro that has it. Please don't do that! I know it's tempting, but released is released, and this should've been 2.0.5 or 2.0.4.1. Releasing two things under the same version number risks quite a bit of confusion. The simple rule is to avoid this under all circumstances. Regards, Martijn From stefan_ml at behnel.de Tue Apr 15 13:22:00 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 15 Apr 2008 13:22:00 +0200 Subject: [lxml-dev] lxml 2.0.4 released In-Reply-To: <fu20vd$4dv$2@ger.gmane.org> References: <480248CE.7040005@behnel.de> <480249EB.5030407@necoro.eu> <48024DC9.3090900@behnel.de> <fu20vd$4dv$2@ger.gmane.org> Message-ID: <48048FD8.8010702@behnel.de> Hi Martijn, Martijn Faassen wrote: > Stefan Behnel wrote: >> Ren? 'Necoro' Neumann wrote: >>> Just me again :) - what's about the GTK-hang-bug? >> Sorry, that fix went missing. I just re-released a new source distro that has it. > > Please don't do that! I know it's tempting, but released is released, > and this should've been 2.0.5 or 2.0.4.1. Releasing two things under the > same version number risks quite a bit of confusion. The simple rule is > to avoid this under all circumstances. I know, I normally wouldn't do that, sorry. But in this case, there were some 10 minutes between the time I sent the release mail and deleting the source tar from PyPI, and less than 20 minutes before having the fix up. It usually takes a couple of days to weeks to walk up into Debian etc., and I don't expect many people to jump on the train within seconds, so the harm done here should really be negligible... Note also that PyPI presents the MD5 sum of the file, so you can click on that to see if what you have is what you want. Anyway, as I said, you're right in general. Stefan From jholg at gmx.de Tue Apr 15 13:38:52 2008 From: jholg at gmx.de (Holger Joukl) Date: Tue, 15 Apr 2008 13:38:52 +0200 Subject: [lxml-dev] trunk build problems + cython version Message-ID: <20080415123353.139340@gmx.net> Hi, ?I am trying to build lxml svn trunk to check out the new objectify type conversions and have some difficulty: ?$ PYTHONPATH=/data/pydev/DOWNLOADS/LXML/cython/versions/Cython-0.9.6.11b/build/lib/ /apps/pydev/bin/python2.4 setup.py build Building lxml version 2.1.alpha1-53751. Building with Cython 0.9.6.11b. Using build configuration of libxslt 1.1.20 Building against libxml2/libxslt in one of the following directories: ? /apps/prod//lib ? /data/pydev/DOWNLOADS/LXML/libxml2/libxml2-2.6.27 running build running build_py writing byte-compilation script '/tmp/tmpVmrBsO.py' /apps/pydev/bin/python2.4 -O /tmp/tmpVmrBsO.py removing /tmp/tmpVmrBsO.py running build_ext cythoning src/lxml/lxml.etree.pyx to src/lxml/lxml.etree.c building 'lxml.etree' extension gcc -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes -fPIC -I/apps/prod//include -I/apps/prod//include/libxml2 -I/apps/prod/include/libxml2 -I/apps/prod/include -I/apps/pydev/include/python2.4 -c src/lxml/lxml.etree.c -o build/temp.solaris-2.8-sun4u-2.4/src/lxml/lxml.etree.o -w gcc -shared build/temp.solaris-2.8-sun4u-2.4/src/lxml/lxml.etree.o -L/apps/prod//lib -L/data/pydev/DOWNLOADS/LXML/libxml2/libxml2-2.6.27 -L/apps/prod/lib -Wl,-R/apps/prod/lib -lxslt -lexslt -lxml2 -lz -lm -o build/lib.solaris-2.8-sun4u-2.4/lxml/etree.so cythoning src/lxml/lxml.objectify.pyx to src/lxml/lxml.objectify.c Error converting Pyrex file to C: ------------------------------------------------------------ ... ??? if s is not None: ??????? value = __parseBoolAsInt(s) ??? if value == -1: ??????? raise ValueError cpdef __parseBool(s): ???? ^ ------------------------------------------------------------ /data/pydev/DOWNLOADS/LXML/lxml/versions/SVN_CHECKOUTS/TRUNK/lxml/src/lxml/lxml.objectify.pyx:842:6: Overridable cdef function not allowed here building 'lxml.objectify' extension gcc -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes -fPIC -I/apps/prod//include -I/apps/prod//include/libxml2 -I/apps/prod/include/libxml2 -I/apps/prod/include -I/apps/pydev/include/python2.4 -c src/lxml/lxml.objectify.c -o build/temp.solaris-2.8-sun4u-2.4/src/lxml/lxml.objectify.o -w gcc: src/lxml/lxml.objectify.c: No such file or directory gcc: No input files error: command 'gcc' failed with exit status 1 ??Is the Cython version I'm using up to par for this? ?The dev docs still indicate so: ?"lxml currently requires Cython 0.9.6.11b or 0.9.6.12, later versions were not tested. " ?I also tried to build with the latest Cython, but this fails due to generated code not working with my oldtimer compiler (filed a Cython bug report for it).? ?Holger? -- Psst! Geheimtipp: Online Games kostenlos spielen bei den GMX Free Games! http://games.entertainment.gmx.net/de/entertainment/games/free -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080415/93d1313b/attachment.htm From stefan_ml at behnel.de Tue Apr 15 14:59:49 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 15 Apr 2008 14:59:49 +0200 Subject: [lxml-dev] trunk build problems + cython version In-Reply-To: <20080415123353.139340@gmx.net> References: <20080415123353.139340@gmx.net> Message-ID: <4804A6C5.90301@behnel.de> Hi, Holger Joukl wrote: > "lxml currently requires Cython 0.9.6.11b or 0.9.6.12, later versions > were not tested. " Ah, thanks, that should say 0.9.6.12 for the trunk. > I also tried to build with the latest Cython, but this fails due to > generated code not working with my > oldtimer compiler (filed a Cython bug report for it). That was a bug in Cython 0.9.6.13. I fixed it, but there isn't a new release yet. Stefan From jholg at gmx.de Tue Apr 15 15:23:12 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 15 Apr 2008 15:23:12 +0200 Subject: [lxml-dev] trunk build problems + cython version In-Reply-To: <4804A6C5.90301@behnel.de> References: <20080415123353.139340@gmx.net> <4804A6C5.90301@behnel.de> Message-ID: <20080415132450.318040@gmx.net> Hi Stefan, > Ah, thanks, that should say 0.9.6.12 for the trunk. > ?Is there any place to download 0.9.6.12 (or other older versions)? I can find only the latest version around. > > I also tried to build with the latest Cython, but this fails due to > > generated code not working with my > > oldtimer compiler (filed a Cython bug report for it). > > That was a bug in Cython 0.9.6.13. I fixed it, but there isn't a new > release yet. > ?Thanks.Gonna try devel. ?Holger? ?? -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080415/9600b8c0/attachment.htm From stefan_ml at behnel.de Tue Apr 15 15:45:18 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 15 Apr 2008 15:45:18 +0200 Subject: [lxml-dev] trunk build problems + cython version In-Reply-To: <20080415132450.318040@gmx.net> References: <20080415123353.139340@gmx.net> <4804A6C5.90301@behnel.de> <20080415132450.318040@gmx.net> Message-ID: <4804B16E.6020106@behnel.de> Hi, jholg at gmx.de wrote: >> Ah, thanks, that should say 0.9.6.12 for the trunk. > > Is there any place to download 0.9.6.12 (or other older versions)? I can > find only the latest version around. http://codespeak.net/lxml/build.html#cython in short: easy_install Cython==0.9.6.12 Stefan From faassen at startifact.com Tue Apr 15 15:48:22 2008 From: faassen at startifact.com (Martijn Faassen) Date: Tue, 15 Apr 2008 15:48:22 +0200 Subject: [lxml-dev] lxml 2.0.4 released In-Reply-To: <48048FD8.8010702@behnel.de> References: <480248CE.7040005@behnel.de> <480249EB.5030407@necoro.eu> <48024DC9.3090900@behnel.de> <fu20vd$4dv$2@ger.gmane.org> <48048FD8.8010702@behnel.de> Message-ID: <8928d4e90804150648g4c3ae67et6bb86f660027ac83@mail.gmail.com> Hey Stefan, On Tue, Apr 15, 2008 at 1:22 PM, Stefan Behnel <stefan_ml at behnel.de> wrote: > I know, I normally wouldn't do that, sorry. But in this case, there were some > 10 minutes between the time I sent the release mail and deleting the source > tar from PyPI, and less than 20 minutes before having the fix up. It usually > takes a couple of days to weeks to walk up into Debian etc., and I don't > expect many people to jump on the train within seconds, so the harm done here > should really be negligible... I know the harm done is probably neglible, but that's only "probably" you still shouldn't do it. :) People who use easy_install or zc.buildout might've hit the 10 minute window and will end up with a slightly different version. People are pulling stuff from the cheeseshop automatically quite frequently these days. Your average Plone buildout includes lxml, for instance. > Note also that PyPI presents the MD5 sum of the file, so you can click on that > to see if what you have is what you want. If there is a problem (admittedly unlikely), it might be quite a while before they consider > checking MD5 sums. Anyway, it's up to you, of course. It's just that even while doing this was low-risk, the risk can be entirely eliminated instead. Regards, Martijn P.S. I should add that overall you're doing a most excellent job with lxml, much better than I could've done myself. So this is just a small issue while I actually continue to be blissfully happy. :) From jholg at gmx.de Tue Apr 15 16:14:32 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 15 Apr 2008 16:14:32 +0200 Subject: [lxml-dev] trunk build problems + cython version In-Reply-To: <4804B16E.6020106@behnel.de> References: <20080415123353.139340@gmx.net> <4804A6C5.90301@behnel.de> <20080415132450.318040@gmx.net> <4804B16E.6020106@behnel.de> Message-ID: <20080415141823.139350@gmx.net> > >Is there any place to download 0.9.6.12 (or other older versions)? I can > > find only the latest version around. > > http://codespeak.net/lxml/build.html#cython > > in short: > > easy_install Cython==0.9.6.12 > Sorry I should've been clearer. I've read that but was missing a way to download apart from using easy_install. My development systems can not always simply connect. ?But I should've looked closer: ?Here's all the stuff I've been looking for:? http://pypi.python.org/packages/source/C/Cython/ Holger -- GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen! Jetzt dabei sein: http://www.shortview.de/?mc=sv_ext_mf at gmx -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080415/4827cf8f/attachment-0001.htm From HargraveJE at ldschurch.org Tue Apr 15 18:22:17 2008 From: HargraveJE at ldschurch.org (Jim Hargrave) Date: Tue, 15 Apr 2008 16:22:17 +0000 (UTC) Subject: [lxml-dev] Can't load external DTD. References: <loom.20080220T152155-538@post.gmane.org> Message-ID: <loom.20080415T161703-188@post.gmane.org> Nef Asus <nefasus <at> gmail.com> writes: > > Hello everyone, > I've written this little program that refuses to work: > > from lxml import etree > if __name__ == "__main__": > xml_input = "C:\Desarrollo\pythontests\lxml\foo.xml" > parser = etree.XMLParser(load_dtd = True, dtd_validation = True, > attribute_defaults = True) > doc = etree.parse(xml_input, parser) > > Here's the traceback. > Traceback (most recent call last): > File "C:\Desarrollo\pythontests\lxml\dtd_loader.py", \ > line 27, in <module> doc = etree.parse(xml_input, parser) > File "lxml.etree.pyx", line 2515, in lxml.etree.parse > File "parser.pxi", line 1755, in lxml.etree._parseDocument > File "parser.pxi", line 1759, in lxml.etree._parseDocumentFromURL > File "parser.pxi", line 1681, in lxml.etree._parseDocFromFile > File "parser.pxi", line 826 ,in lxml.etree._BaseParser._parseDocFromFile > File "parser.pxi",line 450,in lxml.etree._ParserContext._handleParseResultDoc > File "parser.pxi", line 534, in lxml.etree._handleParseResult > File "parser.pxi", line 476, in lxml.etree._raiseParseError > lxml.etree.XMLSyntaxError: failed to load external entity "NULL", > line 9, column 83 > > This is a snippet of foo.xml : > <?xml version="1.0" encoding="iso-8859-1" ?> > <!DOCTYPE rem:requirementsProject > SYSTEM "C:\Desarrollo\pythontests\lxml\foo.dtd"> > ... > > Then, I tried to write a custom resolver. > > from lxml import etree > class DTDResolver(etree.Resolver): > def resolve(self, url, id, context): > print("Resolving (url, %s)(id, %s)"% (url,id)) > self.resolve_filename("C:\Desarrollo\pythontests\lxml\JENSEN.dtd", \ > context) > I had the same exact same problem with lxml 2.03 with a DITA XML file referencing the DITA DTD's (pretty complicated). Switching back to lxml 1.3.6 fixed the problem. Is this problem fixed in any of the 2.x series? Heres's my resolver: class DITA_DTD_Resolver(etree.Resolver): def __init__(self, dtdDir): self.dtdDir = dtdDir def resolve(self, url, id, context): (entityName, ext) = os.path.splitext(url) #dtd = u'<!DOCTYPE %s PUBLIC "%s" "%s">' % (entityName, id, self.dtdDir + '/' + url) return self.resolve_filename(self.dtdDir + '/' + url, context) Jim From stefan_ml at behnel.de Tue Apr 15 19:52:44 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 15 Apr 2008 19:52:44 +0200 Subject: [lxml-dev] Can't load external DTD. In-Reply-To: <loom.20080415T161703-188@post.gmane.org> References: <loom.20080220T152155-538@post.gmane.org> <loom.20080415T161703-188@post.gmane.org> Message-ID: <4804EB6C.9030802@behnel.de> Hi, Jim Hargrave wrote: > Nef Asus <nefasus <at> gmail.com> writes: >> >> from lxml import etree >> if __name__ == "__main__": >> xml_input = "C:\Desarrollo\pythontests\lxml\foo.xml" >> parser = etree.XMLParser(load_dtd = True, dtd_validation = True, >> attribute_defaults = True) >> doc = etree.parse(xml_input, parser) >> >> Here's the traceback. >> Traceback (most recent call last): >> File "C:\Desarrollo\pythontests\lxml\dtd_loader.py", \ >> line 27, in <module> doc = etree.parse(xml_input, parser) >> File "lxml.etree.pyx", line 2515, in lxml.etree.parse >> File "parser.pxi", line 1755, in lxml.etree._parseDocument >> File "parser.pxi", line 1759, in lxml.etree._parseDocumentFromURL >> File "parser.pxi", line 1681, in lxml.etree._parseDocFromFile >> File "parser.pxi", line 826 ,in lxml.etree._BaseParser._parseDocFromFile >> File "parser.pxi",line 450,in > lxml.etree._ParserContext._handleParseResultDoc >> File "parser.pxi", line 534, in lxml.etree._handleParseResult >> File "parser.pxi", line 476, in lxml.etree._raiseParseError >> lxml.etree.XMLSyntaxError: failed to load external entity "NULL", >> line 9, column 83 >> >> This is a snippet of foo.xml : >> <?xml version="1.0" encoding="iso-8859-1" ?> >> <!DOCTYPE rem:requirementsProject >> SYSTEM "C:\Desarrollo\pythontests\lxml\foo.dtd"> > > I had the same exact same problem with lxml 2.03 with a DITA XML file > referencing the DITA DTD's (pretty complicated). Switching back to lxml 1.3.6 > fixed the problem. > > Is this problem fixed in any of the 2.x series? Thanks for pointing me at the problem. Here's a patch. Will be fixed in 2.1beta1 and 2.0.5 (when it comes out). Stefan === src/lxml/parser.pxi ================================================================== --- src/lxml/parser.pxi (revision 3984) +++ src/lxml/parser.pxi (local) @@ -333,7 +333,7 @@ c_context, _cstr(data)) elif doc_ref._type == PARSER_DATA_FILENAME: c_input = xmlparser.xmlNewInputFromFile( - c_context, _cstr(doc_ref._data_bytes)) + c_context, _cstr(doc_ref._filename)) elif doc_ref._type == PARSER_DATA_FILE: file_context = _FileReaderContext(doc_ref._file, context, url) c_input = file_context._createParserInput(c_context) From stefan_ml at behnel.de Tue Apr 15 21:07:10 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 15 Apr 2008 21:07:10 +0200 Subject: [lxml-dev] lxml 2.1beta1 released Message-ID: <4804FCDE.4080801@behnel.de> Hi all, lxml 2.1beta1 is on PyPI. This is the first beta release of the upcoming 2.1 release series. http://codespeak.net/lxml/dev/ http://pypi.python.org/pypi/lxml/2.1beta1 This release contains the bug fixes that were included in 2.0.4 as well as a couple of other important fixes over 2.1alpha1. It also brings some new features, some of which were long awaited - CDATA support was first requested in December 2005, according to the mailing list archive. Another long awaited feature, XSLT extension element support, was already released in 2.1alpha1. Feedback on both will be appreciated. I hope you like it. Have fun, Stefan 2.1beta1 (2008-04-15) Features added * Error logging in Schematron (requires libxml2 2.6.32 or later). * Parser option strip_cdata for normalising or keeping CDATA sections. Defaults to True as before, thus replacing CDATA sections by their text content. * CDATA() factory to wrap string content as CDATA section. Bugs fixed * Resolving to a filename in custom resolvers didn't work. * lxml did not honour libxslt's second error state "STOPPED", which let some XSLT errors pass silently. * Memory leak in Schematron with libxml2 >= 2.6.31. * lxml.etree accepted non well-formed namespace prefix names. Other changes * Major cleanup in internal moveNodeToDocument() function, which takes care of namespace cleanup when moving elements between different namespace contexts. * New Elements created through the makeelement() method of an HTML parser or through lxml.html now end up in a new HTML document (doctype HTML 4.01 Transitional) instead of a generic XML document. This mostly impacts the serialisation and the availability of a DTD context. 2.1alpha1 (2008-03-27) Features added * New event types 'comment' and 'pi' in iterparse(). * XSLTAccessControl instances have a property options that returns a dict of access configuration options. * Constant instances DENY_ALL and DENY_WRITE on XSLTAccessControl class. * Extension elements for XSLT (experimental!) * Element.base property returns the xml:base or HTML base URL of an Element. * docinfo.URL property is writable. Bugs fixed * Default encoding for plain text serialisation was different from that of XML serialisation (UTF-8 instead of ASCII). Other changes * Minor API speed-ups. * The benchmark suite now uses tail text in the trees, which makes the absolute numbers incomparable to previous results. * Generating the HTML documentation now requires Pygments, which is used to enable syntax highlighting for the doctest examples. Most long-time deprecated functions and methods were removed: * etree.clearErrorLog(), use etree.clear_error_log() * etree.useGlobalPythonLog(), use etree.use_global_python_log() * etree.ElementClassLookup.setFallback(), use etree.ElementClassLookup.set_fallback() * etree.getDefaultParser(), use etree.get_default_parser() * etree.setDefaultParser(), use etree.set_default_parser() * etree.setElementClassLookup(), use etree.set_element_class_lookup() Note that parser.setElementClassLookup() has not been removed yet, although parser.set_element_class_lookup() should be used instead. * xpath_evaluator.registerNamespace(), use xpath_evaluator.register_namespace() * xpath_evaluator.registerNamespaces(), use xpath_evaluator.register_namespaces() * objectify.setPytypeAttributeTag, use objectify.set_pytype_attribute_tag * objectify.setDefaultParser(), use objectify.set_default_parser() From stefan_ml at behnel.de Tue Apr 15 21:58:57 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 15 Apr 2008 21:58:57 +0200 Subject: [lxml-dev] lxml 2.0.4 released In-Reply-To: <8928d4e90804150648g4c3ae67et6bb86f660027ac83@mail.gmail.com> References: <480248CE.7040005@behnel.de> <480249EB.5030407@necoro.eu> <48024DC9.3090900@behnel.de> <fu20vd$4dv$2@ger.gmane.org> <48048FD8.8010702@behnel.de> <8928d4e90804150648g4c3ae67et6bb86f660027ac83@mail.gmail.com> Message-ID: <48050901.3080903@behnel.de> Hi, Martijn Faassen wrote: > People who use easy_install or zc.buildout might've hit the 10 minute > window and will end up with a slightly different version. People are > pulling stuff from the cheeseshop automatically quite frequently these > days. Your average Plone buildout includes lxml, for instance. You're right. I noticed that 2.1alpha1 was downloaded from PyPI more than a thousand times - within not even three weeks. That's more than every thirty minutes. :) That's even the first time I see four digits for the download counter. Looks like there's a lot going on automatically... BTW, is there a default/required version used in Plone? I'm asking because I was planning to work on a 1.3.7 quite a while ago. It seems that even Ubuntu Hardy will ship with 1.3.6 on board. And there are definitely enough bugs in 1.3.6 to get them fixed in their own right. I just didn't find the time (and incentive) so far to touch that old code again. Stefan From faassen at startifact.com Wed Apr 16 09:29:38 2008 From: faassen at startifact.com (Martijn Faassen) Date: Wed, 16 Apr 2008 09:29:38 +0200 Subject: [lxml-dev] lxml 2.0.4 released In-Reply-To: <48050901.3080903@behnel.de> References: <480248CE.7040005@behnel.de> <480249EB.5030407@necoro.eu> <48024DC9.3090900@behnel.de> <fu20vd$4dv$2@ger.gmane.org> <48048FD8.8010702@behnel.de> <8928d4e90804150648g4c3ae67et6bb86f660027ac83@mail.gmail.com> <48050901.3080903@behnel.de> Message-ID: <8928d4e90804160029g5fce11g1ba7afa6ddc04948@mail.gmail.com> Hi there, > BTW, is there a default/required version used in Plone? I'm asking because I > was planning to work on a 1.3.7 quite a while ago. It seems that even Ubuntu > Hardy will ship with 1.3.6 on board. And there are definitely enough bugs in > 1.3.6 to get them fixed in their own right. I just didn't find the time (and > incentive) so far to touch that old code again. Hm, I think I was mistaken with the plone buildout, as I can't find obvious evidence of it including lxml, but it's quite possible I missed some dependency somewhere. I *know* the Plone people are including lxml with the Windows installer, but that of course won't affect the downloads directly. That said, I certainly have (automatic) dependencies on lxml 1.3.x in some of my applications still. Regards, Martijn From dfedoruk at gmail.com Wed Apr 16 19:50:26 2008 From: dfedoruk at gmail.com (Dmitri Fedoruk) Date: Wed, 16 Apr 2008 21:50:26 +0400 Subject: [lxml-dev] etree.parse hangs with a lot of parallel requests In-Reply-To: <6689.194.114.62.66.1207725907.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <a36e1d4b0804010713q69b75fdfpc4cbc81f1a73a4bc@mail.gmail.com> <47F8DDB3.8070603@behnel.de> <69B9724A-1BE8-4644-8067-DF8F743FAA2D@gmail.com> <6689.194.114.62.66.1207725907.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <a36e1d4b0804161050m394eba1cge24cba46175a13cb@mail.gmail.com> Hi! Speaking again about the issue with DTD loading, parsing etc. > >> It seems you only want to parse DTDs locally from disc, so setting > >> "no_network=True" (which is the default in lxml 2.0) should prevent > >> any accidental remote access. > > > > Eventually it turned out that I'm working fine without DTD. So, > > setting no_network = True and load_dtd = False really solved the > > problem. > > Hmm, do you really need to turn off DTD loading or is disabling network > access enough? I wouldn't expect loading the DTD from the disk cache to > take that much time (although, if you can live without it and time is > really critical, then it's obviously better to safe that bit of time > also). I was wrong - I do need DTD to resolve entities correctly. Sometimes I got the html   and things like these. My DTD included all the required entities, but it is referenced by URL. And the only way to deal with this enity is to load the DTD, isn't it? Which options do I have except of switching URL to a local path in SYSTEM definition? Setting up the DTD catalog on every machine that runs the application? The ideal option would be to tell the parser "load the given DTD from a given location(i.e. disk) and use it from now and on for parsing all incoming data", but is it possible? Cheers, Dmitri From stefan_ml at behnel.de Wed Apr 16 20:46:27 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 16 Apr 2008 20:46:27 +0200 Subject: [lxml-dev] etree.parse hangs with a lot of parallel requests In-Reply-To: <a36e1d4b0804161050m394eba1cge24cba46175a13cb@mail.gmail.com> References: <a36e1d4b0804010713q69b75fdfpc4cbc81f1a73a4bc@mail.gmail.com> <47F8DDB3.8070603@behnel.de> <69B9724A-1BE8-4644-8067-DF8F743FAA2D@gmail.com> <6689.194.114.62.66.1207725907.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <a36e1d4b0804161050m394eba1cge24cba46175a13cb@mail.gmail.com> Message-ID: <48064983.7050300@behnel.de> Hi, Dmitri Fedoruk wrote: > Which options do I have except of switching URL to a local path in > SYSTEM definition? Setting up the DTD catalog on every machine that > runs the application? The ideal option would be to tell the parser > "load the given DTD from a given location(i.e. disk) and use it from > now and on for parsing all incoming data", but is it possible? You can use a custom resolver and cache the DTD (by its URL) once its loaded. http://codespeak.net/lxml/resolvers.html Stefan From stefan_ml at behnel.de Wed Apr 16 23:27:14 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 16 Apr 2008 23:27:14 +0200 Subject: [lxml-dev] Faster parsing! Message-ID: <48066F32.3000604@behnel.de> Hi, here is a (pretty ugly, hackish) patch against libxml2 2.6.32 that replaces the hash function of the internal hash table implementation by one that I found on the web: http://www.azillionmonkeys.com/qed/hash.html Remember that cElementTree is still the fastest XML tree parser for Python? By a factor of up to 10 compared to lxml? http://codespeak.net/lxml/performance.html#parsing and-serialising According to lxml's benchmark suite, this patch brings the parser of libxml2/lxml down to 2 times (10x->2x!) the parsing time of cElementTree for larger files (some MB). I find this quite impressive. Here are the numbers (lower is better): lxe: XML (SAXR T1) 39.4800 msec/pass # pretty large tree cET: XML (SAXR T1) 20.0679 msec/pass lxe: XML (SAXR T3) 25.9020 msec/pass cET: XML (SAXR T3) 33.2189 msec/pass lxe: XML (SAXR T4) 0.7598 msec/pass cET: XML (SAXR T4) 0.7181 msec/pass While the benchmark is not a particularly good measure for this exact case as it generates the XML tag names instead of sticking to a (likely smaller) fixed set of language tags, this gives me a factor of 7 (!) in performance improvement for in-memory parsing compared to an unpatched libxml2. I also reran the old testament benchmark for a more realistic benchmark scenario. The speedup there is up to 30%, not that bad either. And lxml's "parse()+iter()" implementation of that benchmark is now as fast as cET's "iterparse()" version. :) I would love to get some feedback from others who want to test this. Just patch your copy of libxml2 and let lxml run against it. I'm eager to hear some numbers to convince Daniel to get a cleaned up version of this patch into mainstream libxml2. Hope you like it, Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: faster-dict.c.patch Type: text/x-patch Size: 4108 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080416/f5885e3a/attachment-0001.bin From Marcin.Kasperski at softax.com.pl Fri Apr 18 16:00:51 2008 From: Marcin.Kasperski at softax.com.pl (Marcin Kasperski) Date: Fri, 18 Apr 2008 16:00:51 +0200 Subject: [lxml-dev] (patch) Generating PDF documentation Message-ID: <8763ufb6uk.fsf@softax.com.pl> A non-text attachment was scrubbed... Name: lxmlpdf.diff Type: text/x-diff Size: 9711 bytes Desc: patch Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080418/fb9817ab/attachment.bin From stefan_ml at behnel.de Fri Apr 18 21:44:45 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 18 Apr 2008 21:44:45 +0200 Subject: [lxml-dev] (patch) Generating PDF documentation In-Reply-To: <8763ufb6uk.fsf@softax.com.pl> References: <8763ufb6uk.fsf@softax.com.pl> Message-ID: <4808FA2D.4090703@behnel.de> Hi, Marcin Kasperski wrote: > As this is my first post to this list, let me start from great thanks > for this excellent library. :) > I created quick&dirty patch which adds "make pdf" target to the lxml > Makefile, generating latex files and then pdflatex-ing them to obtain > the PDF book. I fixed up the patch a little and committed it to the trunk. If there are no objection, I'll just add the generated PDF to the source release, that's not even 1MB more. > The result is surely unpolished (some LaTeX presentation > tuning could make sense, also the section structure should be imported > from common file instead of being copied), but nevertheless may be > useful. It definitely is. I already thought about that a couple of times, but didn't really take a shot on it. So, thanks a lot for the contribution. Polish will come over time. I take more patches here. :) Stefan From stefan_ml at behnel.de Sat Apr 19 15:14:40 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 19 Apr 2008 15:14:40 +0200 Subject: [lxml-dev] Generating PDF documentation In-Reply-To: <1208552385.2937.18.camel@platon> References: <8763ufb6uk.fsf@softax.com.pl> <4808FA2D.4090703@behnel.de> <1208552385.2937.18.camel@platon> Message-ID: <4809F040.10800@behnel.de> Hi, Marcin Kasperski wrote: >> If there are no >> objection, I'll just add the generated PDF to the source release, that's not >> even 1MB more. > > I'd suggest distributing PDF separately (and linking it from the website > so it is easily available). PDFs aren't compressing too well. The PDF gzips from 900KB down to 600KB, so that's not too much in addition. On the other hand, many people just use easy_install or a buildout to grab the tar.gz, so they won't benefit from the docs at all. We currently have a source distribution of a bit less than 2MB. Most of that is generated HTML documentation (gzipped 1.1MB, mainly the API docs). If we removed everything that easy_install doesn't use, maybe even the test suites, I guess we would be down to some 600KB, including the generated C source. We could still provide a complete tar.gz for download, maybe next to the plain source on PyPI, or just from the lxml homepage. It's harder to do, though, as I don't think distutils supports this without a little setup.py tweaking. Any opinions on this? > And - > especially - some people who do not use source release may find use for > PDF ;-) Sure, I didn't say it wouldn't show up on the web page. Stefan From stefan_ml at behnel.de Sat Apr 19 17:15:05 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 19 Apr 2008 17:15:05 +0200 Subject: [lxml-dev] Generating PDF documentation In-Reply-To: <4809F040.10800@behnel.de> References: <8763ufb6uk.fsf@softax.com.pl> <4808FA2D.4090703@behnel.de> <1208552385.2937.18.camel@platon> <4809F040.10800@behnel.de> Message-ID: <480A0C79.1080309@behnel.de> Stefan Behnel wrote: > Marcin Kasperski wrote: >> especially - some people who do not use source release may find use for >> PDF ;-) > > Sure, I didn't say it wouldn't show up on the web page. I uploaded the PDF docs for lxml 2.1beta1 here: http://codespeak.net/lxml/dev/lxmldoc-2.1beta1.pdf It's now built automatically with my normal web site upload. Any patches to improve doc/mklatex.py are welcome. Stefan From tseaver at palladion.com Sat Apr 19 19:51:23 2008 From: tseaver at palladion.com (Tres Seaver) Date: Sat, 19 Apr 2008 13:51:23 -0400 Subject: [lxml-dev] Generating PDF documentation In-Reply-To: <4809F040.10800@behnel.de> References: <8763ufb6uk.fsf@softax.com.pl> <4808FA2D.4090703@behnel.de> <1208552385.2937.18.camel@platon> <4809F040.10800@behnel.de> Message-ID: <480A311B.30107@palladion.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stefan Behnel wrote: > Hi, > > Marcin Kasperski wrote: >>> If there are no >>> objection, I'll just add the generated PDF to the source release, that's not >>> even 1MB more. >> I'd suggest distributing PDF separately (and linking it from the website >> so it is easily available). PDFs aren't compressing too well. > > The PDF gzips from 900KB down to 600KB, so that's not too much in addition. > > On the other hand, many people just use easy_install or a buildout to grab the > tar.gz, so they won't benefit from the docs at all. > > We currently have a source distribution of a bit less than 2MB. Most of that > is generated HTML documentation (gzipped 1.1MB, mainly the API docs). If we > removed everything that easy_install doesn't use, maybe even the test suites, > I guess we would be down to some 600KB, including the generated C source. We > could still provide a complete tar.gz for download, maybe next to the plain > source on PyPI, or just from the lxml homepage. It's harder to do, though, as > I don't think distutils supports this without a little setup.py tweaking. > > Any opinions on this? +1 for leaving the docs, etc. in the source dist (*especially* the test suite code). Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFICjEb+gerLs4ltQ4RAtOLAJ0TGL+7oYOYP3ITdlOXq/fTVXuecQCgodBq OQfh8NAQYVE44uVUqMN9vh4= =ZZoD -----END PGP SIGNATURE----- From stefan_ml at behnel.de Sun Apr 20 18:27:51 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 20 Apr 2008 18:27:51 +0200 Subject: [lxml-dev] Generating PDF documentation In-Reply-To: <480A311B.30107@palladion.com> References: <8763ufb6uk.fsf@softax.com.pl> <4808FA2D.4090703@behnel.de> <1208552385.2937.18.camel@platon> <4809F040.10800@behnel.de> <480A311B.30107@palladion.com> Message-ID: <480B6F07.5070207@behnel.de> Tres Seaver wrote: > +1 for leaving the docs, etc. in the source dist (*especially* the test > suite code). :) I didn't mean to close down the source or something, just considering a way to adapt to the fact that many downloads of lxml happen automatically through easy_install, so most people won't even see that there is more in there than the installable code. The idea was to make a stripped down distribution the *default* download from PyPI, not the *only* download. But I think this would lead to more confusion than it helps anyone. Stefan From faassen at startifact.com Tue Apr 22 14:14:02 2008 From: faassen at startifact.com (Martijn Faassen) Date: Tue, 22 Apr 2008 14:14:02 +0200 Subject: [lxml-dev] Generating PDF documentation In-Reply-To: <480A0C79.1080309@behnel.de> References: <8763ufb6uk.fsf@softax.com.pl> <4808FA2D.4090703@behnel.de> <1208552385.2937.18.camel@platon> <4809F040.10800@behnel.de> <480A0C79.1080309@behnel.de> Message-ID: <fukkqd$265$1@ger.gmane.org> Stefan Behnel wrote: > Stefan Behnel wrote: >> Marcin Kasperski wrote: >>> especially - some people who do not use source release may find use for >>> PDF ;-) >> Sure, I didn't say it wouldn't show up on the web page. > > I uploaded the PDF docs for lxml 2.1beta1 here: > > http://codespeak.net/lxml/dev/lxmldoc-2.1beta1.pdf > > It's now built automatically with my normal web site upload. > > Any patches to improve doc/mklatex.py are welcome. This looks cool. Quite interesting to see we have almost 200 pages of documentation now (not counting the changelogs). That's impressive! My suggestion for the stylesheet would be to use whitespace between paragraphs, instead of what it does now, indenting the paragraphs. I find that easier to read. There also seems to be a jumping of the left margin on alternate pages. Perhaps this is nice if you print it? I don't see this in many PDFs I see though, and it makes it a bit harder to read on the screen. Regards, Martijn From ad at papyrus-gmbh.de Tue Apr 22 23:01:17 2008 From: ad at papyrus-gmbh.de (Andreas Degert) Date: Tue, 22 Apr 2008 23:01:17 +0200 Subject: [lxml-dev] Problem with using the same URI twice in a namespace Message-ID: <20080422230117.02d0bbaf@pluto.noname> I assume it is legal to have to following namespace declaration/usage: <top xmlns="a" xmlns:a="a" xmlns:b="b"> <foo bar=""/> <b:foobar a:bar=""/> </top> It works when I read such a definition with lxml.etree.parse, but I can't construct it with lxml.etree.Element because then the nsmap dict will be normalized in such a way that each URI occurs only once. Is this a bug in lxml or shouldn't it be used in this way? cheers Andreas From stefan_ml at behnel.de Tue Apr 22 20:17:28 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 22 Apr 2008 20:17:28 +0200 Subject: [lxml-dev] Faster parsing! In-Reply-To: <48066F32.3000604@behnel.de> References: <48066F32.3000604@behnel.de> Message-ID: <480E2BB8.1070209@behnel.de> Hi, Stefan Behnel wrote: > here is a patch against libxml2 2.6.32 that replaces > the hash function of the internal hash table implementation by one that I > found on the web a cleaned up version of this patch will be integrated into libxml2 2.6.33. It won't make a difference for those who parse 'only' HTML or other single languages with a somewhat small vocabulary (tags/attributes), but if you parse many different types of XML documents (XSD, XSLT, your language, ...), you will notice a difference. Stefan From stefan_ml at behnel.de Tue Apr 22 22:42:12 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 22 Apr 2008 22:42:12 +0200 Subject: [lxml-dev] Problem with using the same URI twice in a namespace In-Reply-To: <20080422230117.02d0bbaf@pluto.noname> References: <20080422230117.02d0bbaf@pluto.noname> Message-ID: <480E4DA4.6090609@behnel.de> Hi, Andreas Degert wrote: > I assume it is legal to have to following namespace declaration/usage: > > <top xmlns="a" xmlns:a="a" xmlns:b="b"> > <foo bar=""/> > <b:foobar a:bar=""/> > </top> Sure, the spec calls this well-formed XML - not talking aesthetics, though. > It works when I read such a definition with lxml.etree.parse, but I > can't construct it with lxml.etree.Element because then the nsmap dict > will be normalized in such a way that each URI occurs only once. Finally someone complaining that there are too *few* namespace declarations instead of too many. ;o) lxml does a lot of work behind the scenes to keep namespaces consistent and simple throughout whatever operation you affect at the API level. In the case you describe, lxml checks on each new namespace prefix declaration if that namespace is already defined in the tree context of the Element and reuses the old prefix if that is the case. The function that does that is _initNodeNamespaces() in apihelpers.pxi, in case you're interested. > Is this a bug in lxml or shouldn't it be used in this way? I don't see the use case. What could you do with redundant namespace prefix declarations that you can't do with a single one? Imagine you have two prefixes defined for a namespace and you add a subelement with that namespace. Which prefix should be used? What purpose does that ambiguity serve? Stefan From ad at papyrus-gmbh.de Wed Apr 23 10:20:07 2008 From: ad at papyrus-gmbh.de (Andreas Degert) Date: Wed, 23 Apr 2008 10:20:07 +0200 Subject: [lxml-dev] Problem with using the same URI twice in a namespace In-Reply-To: <480E4DA4.6090609@behnel.de> References: <20080422230117.02d0bbaf@pluto.noname> <480E4DA4.6090609@behnel.de> Message-ID: <20080423102007.12c33672@pluto.noname> On Tue, 22 Apr 2008 22:42:12 +0200 Stefan Behnel <stefan_ml at behnel.de> wrote: > Hi, > > Andreas Degert wrote: > > I assume it is legal to have to following namespace > > declaration/usage: > > > > <top xmlns="a" xmlns:a="a" xmlns:b="b"> > > <foo bar=""/> > > <b:foobar a:bar=""/> > > </top> > > Sure, the spec calls this well-formed XML - not talking aesthetics, > though. > > > > It works when I read such a definition with lxml.etree.parse, but I > > can't construct it with lxml.etree.Element because then the nsmap > > dict will be normalized in such a way that each URI occurs only > > once. > > Finally someone complaining that there are too *few* namespace > declarations instead of too many. ;o) > > lxml does a lot of work behind the scenes to keep namespaces > consistent and simple throughout whatever operation you affect at the > API level. In the case you describe, lxml checks on each new > namespace prefix declaration if that namespace is already defined in > the tree context of the Element and reuses the old prefix if that is > the case. The function that does that is _initNodeNamespaces() in > apihelpers.pxi, in case you're interested. > > > > Is this a bug in lxml or shouldn't it be used in this way? > > I don't see the use case. What could you do with redundant namespace > prefix declarations that you can't do with a single one? I think the behaviour leads to a bug: t = Element("top",nsmap={None:"a","b":"b"}) SubElement(t, "{b}foobar", {"{a}bar":""}) print tostring(t, pretty_print=True) ----- <top xmlns="a" xmlns:b="b"> <b:foobar bar=""/> </top> ----- In the output the attribute bar should have namespace a, but it has no namespace (the default namespace doesn't apply to attributes as specified in http://www.w3.org/TR/REC-xml-names/#scoping-defaulting, section 6.2). hmmm... even simpler example: Element("top", {"bar":"", "{a}bar":""}, nsmap={None:"a","b":"b"}) yields <top xmlns="a" xmlns:b="b" bar="" bar=""/> > Imagine you have two prefixes defined for a namespace and you add a > subelement with that namespace. Which prefix should be used? What > purpose does that ambiguity serve? The default namespace is a special case because it doesn't apply to attributes (this means when attributes have a namespace value they must be serialized with a prefix). When serializing elements the default namespace should have a higher priority, i.e. those elements can be written without prefix. > Stefan > From friedel at translate.org.za Wed Apr 23 13:28:50 2008 From: friedel at translate.org.za (F Wolff) Date: Wed, 23 Apr 2008 13:28:50 +0200 Subject: [lxml-dev] Faster parsing! In-Reply-To: <480E2BB8.1070209@behnel.de> References: <48066F32.3000604@behnel.de> <480E2BB8.1070209@behnel.de> Message-ID: <1208950130.7034.3.camel@dhcppc2> Op Dinsdag 2008-04-22 skryf Stefan Behnel: > Hi, > > Stefan Behnel wrote: > > here is a patch against libxml2 2.6.32 that replaces > > the hash function of the internal hash table implementation by one that I > > found on the web > > a cleaned up version of this patch will be integrated into libxml2 2.6.33. It > won't make a difference for those who parse 'only' HTML or other single > languages with a somewhat small vocabulary (tags/attributes), but if you parse > many different types of XML documents (XSD, XSLT, your language, ...), you > will notice a difference. > > Stefan Well done, Stefan. I'm not sure if this patch will help me specifically, but I really appreciate the work you put into lxml. I'm really glad I ported our code in this direction :-) Keep well Friedel From stefan_ml at behnel.de Wed Apr 23 15:41:51 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 23 Apr 2008 15:41:51 +0200 (CEST) Subject: [lxml-dev] Problem with using the same URI twice in a namespace In-Reply-To: <20080423102007.12c33672@pluto.noname> References: <20080422230117.02d0bbaf@pluto.noname> <480E4DA4.6090609@behnel.de> <20080423102007.12c33672@pluto.noname> Message-ID: <9122.194.114.62.65.1208958111.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Andreas Degert wrote: > In the output the attribute bar should have namespace a, but it has no > namespace (the default namespace doesn't apply to attributes as > specified in http://www.w3.org/TR/REC-xml-names/#scoping-defaulting, > section 6.2). > > Element("top", {"bar":"", "{a}bar":""}, nsmap={None:"a","b":"b"}) > > yields <top xmlns="a" xmlns:b="b" bar="" bar=""/> > > The default namespace is a special case because it doesn't apply to > attributes (this means when attributes have a namespace value they > must be serialized with a prefix). I see the problem. Actually, now that you mention it, it is not uncommon to define multiple prefixes for a namespace, e.g. in XSD or WSDL. Maybe we can somehow prioritise namespace declarations on the way in, or special case the default namespace in the cleanup procedure (like: making sure it comes last in the declaration list, although that wouldn't impact the parser). It would be nice to have some simple rules how to check that this has to be done, as it definitely adds overhead. I could even accept not simplifying the nsmap at all, but there still is the problem of namespace cleanup when moving elements (moveNodeToDocument() in proxi.pxi). We would need special rules there, too, like: allow adding a second prefix for the default namespace - no idea if that case is easy to recognise and handle. > When serializing elements the default > namespace should have a higher priority, i.e. those elements can be > written without prefix. The serialiser is part of libxml2. If you want changes in this part of lxml, ask on the libxml2 mailing list. However, I think the more general problem is in lxml here. Stefan From stefan_ml at behnel.de Wed Apr 23 18:35:26 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 23 Apr 2008 18:35:26 +0200 Subject: [lxml-dev] Generating PDF documentation In-Reply-To: <fukkqd$265$1@ger.gmane.org> References: <8763ufb6uk.fsf@softax.com.pl> <4808FA2D.4090703@behnel.de> <1208552385.2937.18.camel@platon> <4809F040.10800@behnel.de> <480A0C79.1080309@behnel.de> <fukkqd$265$1@ger.gmane.org> Message-ID: <480F654E.7000709@behnel.de> Hi, Martijn Faassen wrote: > This looks cool. Quite interesting to see we have almost 200 pages of > documentation now (not counting the changelogs). That's impressive! I also got epydoc into generating the API documentation in LaTeX now. Actually, the biggest problem was how to include it in the existing document without breaking the document structure. That means we're at 435 pages now. :) Stefan From stefan_ml at behnel.de Wed Apr 23 12:15:37 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 23 Apr 2008 12:15:37 +0200 Subject: [lxml-dev] Generating PDF documentation In-Reply-To: <fukkqd$265$1@ger.gmane.org> References: <8763ufb6uk.fsf@softax.com.pl> <4808FA2D.4090703@behnel.de> <1208552385.2937.18.camel@platon> <4809F040.10800@behnel.de> <480A0C79.1080309@behnel.de> <fukkqd$265$1@ger.gmane.org> Message-ID: <480F0C49.3000508@behnel.de> Hi Martijn, Martijn Faassen wrote: > My suggestion for the stylesheet would be to use whitespace between > paragraphs, instead of what it does now, indenting the paragraphs. Done. > There also seems to be a jumping of the left > margin on alternate pages. Perhaps this is nice if you print it? The LaTeX document class was set to "book", which automatically alternates left and right pages. The "report" class is better here. Stefan From Marcin.Kasperski at softax.com.pl Thu Apr 24 10:48:30 2008 From: Marcin.Kasperski at softax.com.pl (Marcin Kasperski) Date: Thu, 24 Apr 2008 10:48:30 +0200 Subject: [lxml-dev] Generating PDF documentation References: <8763ufb6uk.fsf@softax.com.pl> <4808FA2D.4090703@behnel.de> <1208552385.2937.18.camel@platon> <4809F040.10800@behnel.de> <480A0C79.1080309@behnel.de> <fukkqd$265$1@ger.gmane.org> <480F0C49.3000508@behnel.de> Message-ID: <877ien7i5d.fsf@softax.com.pl> > The LaTeX document class was set to "book", which automatically alternates > left and right pages. The "report" class is better here. There are no \part's in report. But book can be set to oneside... From Marcin.Kasperski at softax.com.pl Thu Apr 24 10:49:37 2008 From: Marcin.Kasperski at softax.com.pl (Marcin Kasperski) Date: Thu, 24 Apr 2008 10:49:37 +0200 Subject: [lxml-dev] Generating PDF documentation References: <8763ufb6uk.fsf@softax.com.pl> <4808FA2D.4090703@behnel.de> <1208552385.2937.18.camel@platon> <4809F040.10800@behnel.de> <480A0C79.1080309@behnel.de> <fukkqd$265$1@ger.gmane.org> <480F654E.7000709@behnel.de> Message-ID: <873apb7i3i.fsf@softax.com.pl> > That means we're at 435 pages now. :) I'd strongly consider publishing two separate PDFs. 435 is a bit too much to print and API refs are less useful in print than tutorials and design docs. -- ---------------------------------------------------------------------- | Marcin Kasperski | If Staff, Scope and Schedule are all fixed, | http://mekk.waw.pl | managers will have no options, other than | | prayer. (Martin) ---------------------------------------------------------------------- From stefan_ml at behnel.de Thu Apr 24 11:08:46 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 24 Apr 2008 11:08:46 +0200 (CEST) Subject: [lxml-dev] Generating PDF documentation In-Reply-To: <877ien7i5d.fsf@softax.com.pl> References: <8763ufb6uk.fsf@softax.com.pl> <4808FA2D.4090703@behnel.de> <1208552385.2937.18.camel@platon> <4809F040.10800@behnel.de> <480A0C79.1080309@behnel.de> <fukkqd$265$1@ger.gmane.org> <480F0C49.3000508@behnel.de> <877ien7i5d.fsf@softax.com.pl> Message-ID: <54816.194.114.62.37.1209028126.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Marcin Kasperski schrieb: >> The LaTeX document class was set to "book", which automatically >> alternates >> left and right pages. The "report" class is better here. > > There are no \part's in report. Yes, there are. Look for "part" in this file, for example: http://ftp.funet.fi/pub/TeX/TeX-3.14/latex/report.sty The differences between document classes are really small, and "report" is almost like "book", except for the things that make a difference here. Stefan From stefan_ml at behnel.de Thu Apr 24 11:13:21 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 24 Apr 2008 11:13:21 +0200 (CEST) Subject: [lxml-dev] Generating PDF documentation In-Reply-To: <873apb7i3i.fsf@softax.com.pl> References: <8763ufb6uk.fsf@softax.com.pl> <4808FA2D.4090703@behnel.de> <1208552385.2937.18.camel@platon> <4809F040.10800@behnel.de> <480A0C79.1080309@behnel.de> <fukkqd$265$1@ger.gmane.org> <480F654E.7000709@behnel.de> <873apb7i3i.fsf@softax.com.pl> Message-ID: <31113.194.114.62.37.1209028401.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Marcin Kasperski schrieb: >> That means we're at 435 pages now. :) > > I'd strongly consider publishing two separate PDFs. 435 is a bit too > much to print and API refs are less useful in print than tutorials and > design docs. True. What about just moving the generated stuff to the end of the PDF? That way, you can print the file up to page X (somewhere around 200) if you want to print it, but you'd still have everything in one file to carry around and do a text search. Stefan From Marcin.Kasperski at softax.com.pl Thu Apr 24 11:37:26 2008 From: Marcin.Kasperski at softax.com.pl (Marcin Kasperski) Date: Thu, 24 Apr 2008 11:37:26 +0200 Subject: [lxml-dev] Generating PDF documentation References: <8763ufb6uk.fsf@softax.com.pl> <4808FA2D.4090703@behnel.de> <1208552385.2937.18.camel@platon> <4809F040.10800@behnel.de> <480A0C79.1080309@behnel.de> <fukkqd$265$1@ger.gmane.org> <480F0C49.3000508@behnel.de> <877ien7i5d.fsf@softax.com.pl> <54816.194.114.62.37.1209028126.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <87lk3361bd.fsf@softax.com.pl> > Yes, there are. Look for "part" in this file, for example: Sorry if so. > The differences between document classes are really small, and "report" is > almost like "book", except for the things that make a difference here. IIRC there are also things like forcing part start on the odd page etc. Whatever - surely one can start from any of those classes and tune it. The main problem is what is the expected result ;-) Some things to consider: (purely look&feel) - more aestethic title page (ideas welcome, things like underlined title in big letters on the lower part of the page come to min) - more aestethic chapter and part titles - some font tuning (there are a few fonts which look better than computer modern) - more aestethic footers (content) - information ordering and chapter/titles review (see the table of contents, there is chapter lxml containing section lxml, What's new in lxml 2.0 is in first part while detailed change history on the end, some discussion about ElementTree compatibility i between performance and FAQ and before the first tutorial etc etc) - (?) index (of crucial terms) -- ---------------------------------------------------------------------- | Marcin Kasperski | You have the right to peace, fun, and | http://mekk.waw.pl | productive and enjoyable work. (Beck) | | ---------------------------------------------------------------------- From Marcin.Kasperski at softax.com.pl Thu Apr 24 11:41:58 2008 From: Marcin.Kasperski at softax.com.pl (Marcin Kasperski) Date: Thu, 24 Apr 2008 11:41:58 +0200 Subject: [lxml-dev] Generating PDF documentation References: <8763ufb6uk.fsf@softax.com.pl> <4808FA2D.4090703@behnel.de> <1208552385.2937.18.camel@platon> <4809F040.10800@behnel.de> <480A0C79.1080309@behnel.de> <fukkqd$265$1@ger.gmane.org> <480F654E.7000709@behnel.de> <873apb7i3i.fsf@softax.com.pl> <31113.194.114.62.37.1209028401.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <87hcdr613t.fsf@softax.com.pl> "Stefan Behnel" <stefan_ml at behnel.de> writes: > Marcin Kasperski schrieb: >>> That means we're at 435 pages now. :) >> >> I'd strongly consider publishing two separate PDFs. 435 is a bit too >> much to print and API refs are less useful in print than tutorials and >> design docs. > > True. What about just moving the generated stuff to the end of the PDF? > That way, you can print the file up to page X (somewhere around 200) if > you want to print it, but you'd still have everything in one file to carry > around and do a text search. Table of contents will still be polluted, plus ... partial printing is amazingly error-prone (the fact that printed page numbers mismatch with PDF-browser page numbers is one of the factors). IMO it would be nice to have two PDF - lxml Developer Guide (or so) and lxml Reference. As a sidenote: I am not sure whether 'Changes' chapter is at all needed in PDF, this is 30 pages of information nobody is likely to read. -- ---------------------------------------------------------------------- | Marcin Kasperski | Communication takes place between people, | http://mekk.waw.pl | documents are secondary. (Booch) | | ---------------------------------------------------------------------- From stefan_ml at behnel.de Thu Apr 24 11:54:50 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 24 Apr 2008 11:54:50 +0200 (CEST) Subject: [lxml-dev] Generating PDF documentation In-Reply-To: <31113.194.114.62.37.1209028401.squirrel@groupware.dvs.informatik.tu-d armstadt.de> References: <8763ufb6uk.fsf@softax.com.pl> <4808FA2D.4090703@behnel.de> <1208552385.2937.18.camel@platon> <4809F040.10800@behnel.de> <480A0C79.1080309@behnel.de> <fukkqd$265$1@ger.gmane.org> <480F654E.7000709@behnel.de> <873apb7i3i.fsf@softax.com.pl> <31113.194.114.62.37.1209028401.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <23122.194.114.62.37.1209030890.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Stefan Behnel wrote: > Marcin Kasperski wrote: >>> That means we're at 435 pages now. :) >> >> I'd strongly consider publishing two separate PDFs. 435 is a bit too >> much to print and API refs are less useful in print than tutorials and >> design docs. > > True. What about just moving the generated stuff to the end of the PDF? > That way, you can print the file up to page X (somewhere around 200) ... or less, considering that you are most likely not interested in the complete ChangeLog either. I think the best thing to do is to move both the changelog and the API docs into the appendix. That gives us some 170 pages of 'real' documentation, which is visibly separated from the rest. Stefan From stefan_ml at behnel.de Thu Apr 24 12:43:52 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 24 Apr 2008 12:43:52 +0200 (CEST) Subject: [lxml-dev] Generating PDF documentation In-Reply-To: <87hcdr613t.fsf@softax.com.pl> References: <8763ufb6uk.fsf@softax.com.pl> <4808FA2D.4090703@behnel.de> <1208552385.2937.18.camel@platon> <4809F040.10800@behnel.de> <480A0C79.1080309@behnel.de> <fukkqd$265$1@ger.gmane.org> <480F654E.7000709@behnel.de> <873apb7i3i.fsf@softax.com.pl> <31113.194.114.62.37.1209028401.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <87hcdr613t.fsf@softax.com.pl> Message-ID: <32512.194.114.62.37.1209033832.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Marcin Kasperski wrote: > partial printing is > amazingly error-prone (the fact that printed page numbers mismatch with > PDF-browser page numbers is one of the factors). That can be fixed. > IMO it would be nice to have two PDF - lxml Developer Guide (or so) > and lxml Reference. I don't see that need. Be it a 200 page document or one with 400 pages, people will only read the parts they are interested in. And if the page numbers match the print range, I'm just fine with providing a single download instead of risking that people might want to look things up on the train and only notice then that they forgot to download the second PDF. > As a sidenote: I am not sure whether 'Changes' chapter is at all needed > in PDF, this is 30 pages of information nobody is likely to read. People who must consider backwards compatibility might care. Stefan From stefan_ml at behnel.de Thu Apr 24 12:48:20 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 24 Apr 2008 12:48:20 +0200 (CEST) Subject: [lxml-dev] Generating PDF documentation In-Reply-To: <87lk3361bd.fsf@softax.com.pl> References: <8763ufb6uk.fsf@softax.com.pl> <4808FA2D.4090703@behnel.de> <1208552385.2937.18.camel@platon> <4809F040.10800@behnel.de> <480A0C79.1080309@behnel.de> <fukkqd$265$1@ger.gmane.org> <480F0C49.3000508@behnel.de> <877ien7i5d.fsf@softax.com.pl> <54816.194.114.62.37.1209028126.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <87lk3361bd.fsf@softax.com.pl> Message-ID: <5919.194.114.62.37.1209034100.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Marcin Kasperski wrote: > - more aestethic title page (ideas welcome, things like underlined > title in big letters on the lower part of the page come to min) I added a logo for now, but there's more that can be done, sure. > - information ordering and chapter/titles review (see the table of > contents, > there is chapter lxml containing section lxml, What's new in lxml 2.0 is > in > first part while detailed change history on the end, some discussion > about > ElementTree compatibility i between performance and FAQ and before the > first tutorial etc etc) That might be something worth changing on the web page menu, too. See doc/docstructure.py > - (?) index (of crucial terms) That's a hard one. The information is not there in the ReST files. Stefan From stefan_ml at behnel.de Fri Apr 25 10:11:36 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 25 Apr 2008 10:11:36 +0200 Subject: [lxml-dev] help needed for cleaning up the docs Message-ID: <48119238.1080601@behnel.de> Hi, since Marcin brought this up, I'd like to aggregate some effort for cleaning up the docs in the "doc" directory (ReStructured Text format). http://codespeak.net/svn/lxml/trunk/doc/ For example, I find the current homepage (main.txt) lengthy, unfocused and overwhelming. It only remains usable due to the menu on the left, not by its content itself. There should be a better balance between the amount of content and its accessibility. Maybe the "Changes" section and a part of the "Download" section could go on the installation page, for example. I also find the "Documentation" section too detailed. The list may be fine, but the explanations could be moved to the FAQ entry on documentation, or maybe even a dedicated documentation page. There's also the "APIs specific to lxml.etree" page (api.txt), which continues to be a mess of things that belong nowhere else. It's outdated, badly integrated into the rest of the documentation and overlaps with other docs. The tutorial, for example, has a more extensive section on iteration, and things like the error log should also be handled in the tutorial. Maybe the current api.txt is the right place to start a general "start here" page on documentation? I'm unsure what to do with the "why lxml?" page (intro.txt). I think the content should be there, although it's somewhat dated by now. But I don't feel like it merits its own page (or chapter in the PDF). Maybe a FAQ section would do, although that would be a much less prominent place (and it's not really a FAQ anyway...). Same for the "what's new in 2.0" page (lxml2.txt). Since lxml 2.1 is close, it will soon loose it's front-page topicality. I would be very happy if a couple of people could spare hands and thoughts on this topic. I find accessible documentation very important, and it's something everyone can work on who wants to give something back to the project. Thanks for any help! Stefan From stefan_ml at behnel.de Sun Apr 27 00:25:05 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 27 Apr 2008 00:25:05 +0200 Subject: [lxml-dev] Problem with using the same URI twice in a namespace In-Reply-To: <20080423102007.12c33672@pluto.noname> References: <20080422230117.02d0bbaf@pluto.noname> <480E4DA4.6090609@behnel.de> <20080423102007.12c33672@pluto.noname> Message-ID: <4813ABC1.2080206@behnel.de> Andreas Degert wrote: > I think the behaviour leads to a bug: > > t = Element("top",nsmap={None:"a","b":"b"}) > SubElement(t, "{b}foobar", {"{a}bar":""}) > print tostring(t, pretty_print=True) > ----- > <top xmlns="a" xmlns:b="b"> > <b:foobar bar=""/> > </top> > ----- This is definitely a problem in the serialiser of libxml2: >>> t = Element("top",nsmap={None:"a","b":"b",'a':'a'}) >>> SubElement(t, "{b}foobar", {"{a}bar":""}) <Element {b}foobar at b798dd9c> >>> print tostring(t, pretty_print=True) <top xmlns="a" xmlns:a="a" xmlns:b="b"> <b:foobar bar=""/> </top> It would have to prefer the prefixed namespace instead of the default one to get this right. But this does not come for free, imagine this case: <top xmlns:a="a" xmlns:b="b"> <test xmlns="a"> <b:foobar bar=""/> </test> </top> So it would always have to check the entire root path if the attribute target namespace is defined with an empty prefix, and the current element has a different namespace. Stefan