From ianb at colorstudy.com Tue Jul 1 02:48:30 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 30 Jun 2008 19:48:30 -0500 Subject: [lxml-dev] Segmentation fault in lxml.html after pickling In-Reply-To: <35866.145.253.136.18.1213885001.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <47812.XVNZDFwXRQM=.1213866327.squirrel@webmailer.hosteurope.de> <53226.145.253.136.18.1213867558.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <35866.145.253.136.18.1213885001.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <48697EDE.4060507@colorstudy.com> Stefan Behnel wrote: > Martijn Faassen wrote: >> I'd love it if I could somehow store lxml trees in the ZODB, and that'd >> need pickle support. Whether it could be made to be efficient I don't >> know - you'd not want the whole tree to be pickled as a whole in case of >> large trees, but some form of partitioning scheme into separate pickles. >> You're right that custom-element binding would be nice in this case, and >> that means the pickle can't simply be the XML content unless it's >> somehow annotated first. >> >> Anyway, this is a rather out there use case. I am just intrigued to >> learn that objectify elements can be pickled. > > It's just easier to do in objectify, as it has a pretty comprehensive > setup for Element class mapping. If you want to be sure to get back > exactly the same Element tree after pickling, you can just annotate() an > objectify tree before pickling it. > > Doing the same thing in lxml.etree would require storing some information > about the current Element lookup, which may be a lot of information, e.g. > for the namespace class setup. That's a parser-local setup, so we can't > just use the setup of the default parser either but need a concrete > context for the unpickling. > > lxml.html might be considered having such a context in a similar way > lxml.objectify has it, as it comes with its own classes and lookup scheme. Just what would end up being pickled, do you think? The entire document? A first thought is that the document gets pickled, and then the element is an offset in that document. Like, erm... class HtmlMixin: def __getstate__(self): return (self.getroottree(), self._indexes_to_self()) def _indexes_to_self(self): result = [] el = self while el.getparent(): result.insert(0, el.getparent().index(el)) el = el.getparent() return result def __setstate__(self, state): # Dammit... this doesn't actually work: doc, indexes_to_self = state el = doc.getroot() for index in indexes_to_self: el = el[index] return el There is no return value for __setstate__, and no way to indicate a constructor method for creating instances. That's dumb. I don't like pickle. For documents, if the pickle hooks worked reasonably I'd just store the serialization of the document (as a string) plus all the special attributes (doctype, url, etc). Given that the hooks don't work reasonably I'm not sure how to do it; maybe people with the ZODB experience to have hit this problem would have an idea? From what I can tell there's no reason to store the document as anything but a string -- serializing and re-parsing the string is faster than any other means of storing a document (it all ends up as strings eventually anyway). -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org From mmaccana at au1.ibm.com Tue Jul 1 05:13:30 2008 From: mmaccana at au1.ibm.com (Mike MacCana) Date: Tue, 01 Jul 2008 13:13:30 +1000 Subject: [lxml-dev] Premature end of data in tag - but it looks well formed Message-ID: <1214882010.19173.12.camel@mmaccana-laptop> Hi gents, Firstly, thanks for lxml. It's by far the nicest tool for someone who needs to do xpath in python without being an XML god. I'm a first time user of lxml attempting to etree.parse a document. My code (below) works fine on some sample text, but libxml complains about the real data with: etree.XMLSyntaxError: line 196: Premature end of data in tag html line 5 The data is below. Line 5 seems OK to me, but I'm new to XML coding so maybe I'm missing something. __________________________________ 1 2 3 4 5 ?__________________________________ Any ideas? The full code is below. Cheers, Mike #!/usr/bin/env python import urllib, sys, lxml, StringIO from lxml import etree from StringIO import StringIO # Use http://www.someproxy.com:3128 for http proxying proxies = {'http': 'http://xpvm:3128'} url='http://peoplesearch.in.telstra.com.au:8094/peoplesearch/userdetail.aspx?BaseDN=CN=d299061,OU=People,OU=eProfile,DC=PeopleSearch,DC=Telstra,DC=Com' filehandle = urllib.urlopen(url, proxies=proxies) print filehandle ## Real html html=filehandle.read() ## Test html #html="underpants" print "--------------------------------" print html print '==========================' f = StringIO(html) tree = etree.parse(f) ## Real xpath r = tree.xpath('/html/body/div[4]/form/div[3]/div/div/div/div/table/tbody/tr[6]/td') ## Test xpath #r = tree.xpath('/foo/bar/baz') print 'length:' print len(r) print 'tag:' print r[0].tag print 'contents:' print r[0].text ________________________________________________ Mike MacCana Technical Specialist Australia Linux and Virtualisation Services IBM Global Services Level 14, 60 City Rd Southgate Vic 3000 Phone: +61-3-8656-2138 Fax: +61-3-?8656-2423 Email: mmaccana at au1.ibm.com From stefan_ml at behnel.de Tue Jul 1 07:20:54 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 01 Jul 2008 07:20:54 +0200 Subject: [lxml-dev] namespace strangeness in lxml 1.1 In-Reply-To: <1214856697.868.38.camel@localhost.localdomain> References: <1214856697.868.38.camel@localhost.localdomain> Message-ID: <4869BEB6.9000609@behnel.de> Hi, Eric Jahn wrote: > type="{http://domain2.info}someattribute > > element = etree.Element(NS2 + "secondelement", nsmap=NSMAP, type = NS2 + > "someattribute") You are setting a namespace as attribute /value/ here, not as attribute /name/. lxml will not modify content unless you tell it to do so. If you want it to replace the namespace by a resolved prefix, use type = etree.QName(NS2 + "...") If it's just a mistake and you wanted to set the attribute /namespace/ instead, pass attrib = {NS2 + "someattribute" : "somevalue"} to the Element factory. There should also be a section on this in the tutorial IIRC. Stefan From stefan_ml at behnel.de Tue Jul 1 07:24:17 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 01 Jul 2008 07:24:17 +0200 Subject: [lxml-dev] Premature end of data in tag - but it looks well formed In-Reply-To: <1214882010.19173.12.camel@mmaccana-laptop> References: <1214882010.19173.12.camel@mmaccana-laptop> Message-ID: <4869BF81.20501@behnel.de> Hi, Mike MacCana wrote: > Hi gents, Are you sure you don't want advice from any girls? > I'm a first time user of lxml attempting to etree.parse a document. My > code (below) works fine on some sample text, but libxml complains about > the real data with: > > etree.XMLSyntaxError: line 196: Premature end of data in tag html line 5 > > The data is below. Line 5 seems OK to me, but I'm new to XML coding so > maybe I'm missing something. The problem is not in line 5 (where the html tag starts) but in line 196, where it apparently ends. Try validating it at the W3C validator if you don't believe lxml. ;) Stefan From stefan_ml at behnel.de Tue Jul 1 08:35:31 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 1 Jul 2008 08:35:31 +0200 (CEST) Subject: [lxml-dev] Segmentation fault in lxml.html after pickling In-Reply-To: <48697EDE.4060507@colorstudy.com> References: <47812.XVNZDFwXRQM=.1213866327.squirrel@webmailer.hosteurope.de> <53226.145.253.136.18.1213867558.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <35866.145.253.136.18.1213885001.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <48697EDE.4060507@colorstudy.com> Message-ID: <49012.213.61.181.86.1214894131.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Ian Bicking wrote: > A first thought is that the document gets pickled, and then the element > is an offset in that document. That's a brilliant idea, but why so complicated? :) pickle: doc = self.getroottree() return (tostring(doc), doc.getpath(self)) unpickle: doc, path = pickle_value return doc.xpath(path) would do the trick. Maybe we should serialise as XML instead of HTML, so that we don't run into any "relaxed parser" problems (I remember a not so old libxml2 HTML serialiser bug with roundtrips, for example). > There is no return value for __setstate__, and no way to indicate a > constructor method for creating instances. That's dumb. I don't like > pickle. :) You don't have to use __[sg]etstate__(). You can define an external function to do it for you, just like objectify does (search src/lxml/lxml.objectify.pyx for "pickle"). The stupid thing is that this function has to be registered /and/ public. It's not enough to register it and delete it afterwards... Still, the problem remains that we need to assure we keep the element lookup context, so this is still not a general solution for lxml.etree. But it should be suitable for lxml.html. Stefan From jholg at gmx.de Tue Jul 1 09:37:42 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 01 Jul 2008 09:37:42 +0200 Subject: [lxml-dev] objectify.deannotate: call to etree.cleanup_namespaces in 2.1beta In-Reply-To: <48608FC4.6000200@behnel.de> References: <20080623160833.295700@gmx.net> <48608FC4.6000200@behnel.de> Message-ID: <20080701073742.85060@gmx.net> Hi, > Holger Joukl wrote: > > I have a usecase where I need to deannotate an objectified tree > > and then manually set py:pytype or xsi:type attributes. > > > > However, this seems to be getting difficult with 2.1beta as deannotate > > wipes out all nsmap information with its call to cleanup_namespaces(), > > and I cannot set a namespaced > > > > attribute through .set(...) ?Just to be precise:? A namespaced attribute value like "xsd:string". It is easy to set a ns-qualified attribute using Clark notation, as anywhere in lxml.? ? > > could we make the call to cleanup_namespaces optional (defaults > > to True) in deannotate()? > > I wasn't entirely sure if it was a good idea when I added it. I guess > it's > best to keep it out or make it optional (default False). > I'll remove it, then. Rationale: lxml does a good job of keeping namespace declarations clean when adding elements to a tree anyway, so with objectify's default nsmap namespace declarations concerning xsi:type and py:pytype are usually located at the root element only. Anyone who needs a real clean document can still conveniently call etree.cleanup_namespaces() after deannotate(). ?Holger -- GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen! Jetzt dabei sein: http://www.shortview.de/wasistshortview.php?mc=sv_ext_mf at gmx -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080701/9b2ed745/attachment.htm From jholg at gmx.de Tue Jul 1 10:48:36 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 01 Jul 2008 10:48:36 +0200 Subject: [lxml-dev] objectify.deannotate: call to etree.cleanup_namespaces in 2.1beta In-Reply-To: <20080701073742.85060@gmx.net> References: <20080623160833.295700@gmx.net> <48608FC4.6000200@behnel.de> <20080701073742.85060@gmx.net> Message-ID: <20080701084842.85090@gmx.net> > ? > > >could we make the call to cleanup_namespaces optional (defaults > > > to True) in deannotate()? > > > > I wasn't entirely sure if it was a good idea when I added it. I guess > > it's > > best to keep it out or make it optional (default False). > > > > > I'll remove it, then. > > Rationale: lxml does a good job of keeping namespace declarations clean > when > > adding elements to a tree anyway, so with objectify's default nsmap > namespace > > declarations concerning xsi:type and py:pytype are usually located at the > root element > > only. > > Anyone who needs a real clean document can still conveniently call > etree.cleanup_namespaces() > > after deannotate(). ?Committed to trunk (revision 56199): ?$ svn diff -r55702:56199 src/lxml/lxml.objectify.pyx Index: src/lxml/lxml.objectify.pyx =================================================================== --- src/lxml/lxml.objectify.pyx (revision 55702) +++ src/lxml/lxml.objectify.pyx (revision 56199) @@ -1752,7 +1752,6 @@ ???????????? cetree.delAttributeFromNsName( ???????????????? c_node, _XML_SCHEMA_INSTANCE_NS, "type") ???????? tree.END_FOR_EACH_ELEMENT_FROM(c_node) -??? etree.cleanup_namespaces(element) ? ? ?################################################################################ -- GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen! Jetzt dabei sein: http://www.shortview.de/wasistshortview.php?mc=sv_ext_mf at gmx -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080701/7a3ba92a/attachment.htm From mmaccana at au1.ibm.com Tue Jul 1 11:12:58 2008 From: mmaccana at au1.ibm.com (Mike MacCana) Date: Tue, 01 Jul 2008 19:12:58 +1000 Subject: [lxml-dev] A quick and simple xpath solution for nasty HTML (was Re: Premature end of data in tag - but it looks well formed) In-Reply-To: <4869BF81.20501@behnel.de> References: <1214882010.19173.12.camel@mmaccana-laptop> <4869BF81.20501@behnel.de> Message-ID: <1214903578.19173.41.camel@mmaccana-laptop> Ladies and gentleman, On Tue, 2008-07-01 at 07:24 +0200, Stefan Behnel wrote: > Hi, > > Mike MacCana wrote: > > Hi gents, > > Are you sure you don't want advice from any girls? > > > > I'm a first time user of lxml attempting to etree.parse a document. > My > > code (below) works fine on some sample text, but libxml complains > about > > the real data with: > > > > etree.XMLSyntaxError: line 196: Premature end of data in tag html > line 5 > > > > The data is below. Line 5 seems OK to me, but I'm new to XML coding > so > > maybe I'm missing something. > > The problem is not in line 5 (where the html tag starts) but in line > 196, > where it apparently ends. Try validating it at the W3C validator if > you don't > believe lxml. ;) Thanks Stefan. I solved the crap HTML problem as follows. Hopefully the following will be useful to anyone beginning XPath with lxml. #!/usr/bin/env python import urllib, sys, lxml, StringIO, lxml.html,os from lxml import etree from StringIO import StringIO from lxml.html.clean import Cleaner ## Point this at your XP VM used to get to Telstra proxies = {'http': 'http://xpvm:3128'} url='http://domain.com/page' ## Function to strip non-ascii characters ## See http://en.wikipedia.org/wiki/Ascii#ASCII_printable_characters ## for list def onlyascii(char): if ord(char) < 32 or ord(char) > 176: return '' else: return char ## Open the URL and read its contents filehandle = urllib.urlopen(url, proxies=proxies) html=filehandle.read() asciihtml=filter(onlyascii, html) ## Customer's HTML content is REALLY bad. Clean it. ## See http://codespeak.net/lxml/lxmlhtml.html#cleaning-up-html ## and 'pydoc lxml.html.clean.Cleaner' ## Clean HTML and strip a bunch of tags that are broken and that we dont care about. badtags=['img','a','div','span','h2','h1','style','title','ul','li','col'] cleaner = Cleaner(page_structure=False, links=False, remove_tags=badtags ) ## We can now access our cleaned content as 'cleanedcontent' cleanedcontent=cleaner.clean_html(asciihtml) ## Save Clean content to disk for debugging purposes os.remove('debug.html') outputfile = open('debug.html','w') outputfile.write(cleanedcontent) outputfile.close() ## Go parse our content cleanedcontentstringio = StringIO(cleanedcontent) parser = etree.XMLParser(recover=True) tree = etree.parse(cleanedcontentstringio) ## Xpath locations of what we're interested in (element zero is all we care about ## text is the text within the tags, and strip off any whitespace ## You can find XPath locations by loading up 'debug.html' in Firefox with the Firebug extension name = tree.xpath('/html/body/table/tbody/tr/td')[0].text.strip() email = tree.xpath('/html/body/table/tbody/tr[7]/td')[0].text.strip().lower() print name+","+email Cheers, Mike ________________________________________________ Mike MacCana Technical Specialist Australia Linux and Virtualisation Services IBM Global Services Level 14, 60 City Rd Southgate Vic 3000 Phone: +61-3-8656-2138 Fax: +61-3-?8656-2423 Email: mmaccana at au1.ibm.com From stefan_ml at behnel.de Tue Jul 1 13:51:32 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 1 Jul 2008 13:51:32 +0200 (CEST) Subject: [lxml-dev] objectify.deannotate: call to etree.cleanup_namespaces in 2.1beta In-Reply-To: <20080701084842.85090@gmx.net> References: <20080623160833.295700@gmx.net> <48608FC4.6000200@behnel.de> <20080701073742.85060@gmx.net> <20080701084842.85090@gmx.net> Message-ID: <48955.213.61.181.86.1214913092.squirrel@groupware.dvs.informatik.tu-darmstadt.de> jholg at gmx.de wrote: >>> could we make the call to cleanup_namespaces optional (defaults >>> to True) in deannotate()? >>> >>> I wasn't entirely sure if it was a good idea when I added it. I guess >>> it's best to keep it out or make it optional (default False). >> >> I'll remove it, then. Ok, thanks. Stefan From stefan_ml at behnel.de Tue Jul 1 14:03:35 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 1 Jul 2008 14:03:35 +0200 (CEST) Subject: [lxml-dev] A quick and simple xpath solution for nasty HTML (was Re: Premature end of data in tag - but it looks well formed) In-Reply-To: <1214903578.19173.41.camel@mmaccana-laptop> References: <1214882010.19173.12.camel@mmaccana-laptop> <4869BF81.20501@behnel.de> <1214903578.19173.41.camel@mmaccana-laptop> Message-ID: <61901.213.61.181.86.1214913815.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, Mike MacCana wrote: > I solved the crap HTML problem as follows. Hopefully the following will > be useful to anyone beginning XPath with lxml. Just adding a few comments as I see fit. > ## Function to strip non-ascii characters > ## See http://en.wikipedia.org/wiki/Ascii#ASCII_printable_characters > ## for list > def onlyascii(char): > if ord(char) < 32 or ord(char) > 176: > return '' > else: > return char Note that this will not work as expected with multi-byte encodings such as UTF-8. > ## We can now access our cleaned content as 'cleanedcontent' > cleanedcontent=cleaner.clean_html(asciihtml) This will (obviously) parse the HTML into a tree internally, so it's more efficient to pass a parsed tree directly. > ## Go parse our content > cleanedcontentstringio = StringIO(cleanedcontent) > parser = etree.XMLParser(recover=True) > tree = etree.parse(cleanedcontentstringio) I wonder why you use an XML parser here. The HTML parser will likely work better, as it knows about self-closing HTML tags. Stefan From jholg at gmx.de Tue Jul 1 12:48:56 2008 From: jholg at gmx.de (Holger Joukl) Date: Tue, 01 Jul 2008 12:48:56 +0200 Subject: [lxml-dev] 2.1beta questions: objectify.XML, objectify.parse base_url arg, deprecate enableRecursiveStr, etree.tounicode() Message-ID: <20080701121932.85070@gmx.net> Hi, ?I? guess the module functions XML() and parse() should also support the base_url arg? Also, I suppose enableRecursiveStr() could be removed? Btw I realized that etree.tounicode() is bound to be deprecated in favor of tostring(..., encoding=unicode). I suppose this is owed to ElementTree API compat which doesn't have tounicode() - or is this a py3k issue? ?IMHO unicode is not an encoding and from my experience it confuses people starting out with unicode to think of unicode as an encoding. That said, I rather like the tostring()/tounicode() distinction API-wise, if? purely for documentational purposes.? ?I'm not at all questioning the possibility to produce a unicode serialization of an XML tree: This is really helpful in lxml as it enables one to fallback to python encoding capabilities if libxml2 does not support some intended target encoding. ?Holger? ? ? -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080701/888dbe25/attachment.htm From stefan_ml at behnel.de Tue Jul 1 15:36:24 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 1 Jul 2008 15:36:24 +0200 (CEST) Subject: [lxml-dev] 2.1beta questions: objectify.XML, objectify.parse base_url arg, deprecate enableRecursiveStr, etree.tounicode() In-Reply-To: <20080701121932.85070@gmx.net> References: <20080701121932.85070@gmx.net> Message-ID: <50336.213.61.181.86.1214919384.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi Holger, looks like you started cleaning up. :) Holger Joukl wrote: > I?guess the module functions XML() and parse() should also support the > base_url arg? Yes. > Also, I suppose enableRecursiveStr() could be removed? I never really liked it, but why would you want to remove it? > Btw I realized that etree.tounicode() is bound to be deprecated in favor > of tostring(..., encoding=unicode). Yes. Having a second function for a more limited functional scope is just superfluous. BTW, does that affect objectify in any way or is it just curiosity (or users interest) on your side? > I suppose this is owed to ElementTree API compat which doesn't have > tounicode() - or is this a py3k issue? Actually, the "encoding=unicode" bit has a Py3k issue. In Py3, you have to say "encoding=str" instead... >?IMHO unicode is not an encoding and from my experience it confuses > people starting out with unicode to think of unicode as an encoding. If you start with unicode, I think this is your smallest problem. You are right that it's not an encoding and I admit that this might look a little hackish if you think about it. However, a unicode string is a well-defined way of representing the data, and it replaces the byte encoding that you'd normally get from the tostring() function. So it fits into the existing API quite well. >?I'm not at all questioning the possibility to produce a unicode > serialization of an XML tree: > > This is really helpful in lxml as it enables one to fallback to python > encoding capabilities if libxml2 does not support some intended target > encoding. ... although you'd have to take care that you strip the encoding declaration. My favourite use case are actually doctests. Stefan From jholg at gmx.de Tue Jul 1 16:21:29 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 01 Jul 2008 16:21:29 +0200 Subject: [lxml-dev] 2.1beta questions: objectify.XML, objectify.parse base_url arg, deprecate enableRecursiveStr, etree.tounicode() In-Reply-To: <50336.213.61.181.86.1214919384.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <20080701121932.85070@gmx.net> <50336.213.61.181.86.1214919384.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <20080701154552.85080@gmx.net> Hi Stefan, ? > looks like you started cleaning up. :) > > > Quite right. I started having a bad conscience for never really looking at 2.1 for quite a while. Works smoothly for me for all I can tell.? ? > Holger Joukl wrote: > > I?guess the module functions XML() and parse() should also support the > > base_url arg? > > Yes. > ?Implemented on trunk, revision? 56201. I stole the unittests from test_etree and noticed that I also had to special case 'base' in objectify's __setattr__ magic. ? > > > > Also, I suppose enableRecursiveStr() could be removed? > > I never really liked it, but why would you want to remove it? > ?I put it the wrong way: There's already enable_recursive_str() which should be used instead. I for one actually *need* it, so I do like it :)? But some other of the old CamelCase method/function names went away, so I figured this can also go.? > > > Btw I realized that etree.tounicode() is bound to be deprecated in > favor > > of tostring(..., encoding=unicode). > > Yes. Having a second function for a more limited functional scope is just > superfluous. > > BTW, does that affect objectify in any way or is it just curiosity (or > users interest) on your side? > > ?No, just curiosity. I currently use tounicode() for what I outlined (fallback to python encoding capabilities) but can just as well switch to the new conventions.? ? > > I suppose this is owed to ElementTree API compat which doesn't have > > tounicode() - or is this a py3k issue? > > Actually, the "encoding=unicode" bit has a Py3k issue. In Py3, you have > to > say "encoding=str" instead... > > ? How do you specify which actual encoding, e.g 'ISO-8859-15', here? ? > > >?IMHO unicode is not an encoding and from my experience it confuses > > people starting out with unicode to think of unicode as an encoding. > > If you start with unicode, I think this is your smallest problem. > > You are right that it's not an encoding and I admit that this might look > a > little hackish if you think about it. However, a unicode string is a > well-defined way of representing the data, and it replaces the byte > encoding that you'd normally get from the tostring() function. So it fits > into the existing API quite well. > ?lxml is just a fine design. So even smallest deviations in the realms of hackishness provoke protest storms ;). Just joking, and maybe I'm being anal about it but it still feels a little uncomfortable to hand in s.th. that it isn't an encoding to a parameter that is named 'encoding', if only from an educational perspective. ?Not that I can't live with it, especially since I can't think of an good alternative... Yet another parameter to tostring() feels awkward, and renaming the parameter conflicts with ElementTree compatibility. ?Holger? -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080701/ee2b7cf1/attachment.htm From eric at ejahn.net Tue Jul 1 18:19:03 2008 From: eric at ejahn.net (Eric Jahn) Date: Tue, 01 Jul 2008 12:19:03 -0400 Subject: [lxml-dev] lxml website Message-ID: <1214929144.10505.36.camel@localhost.localdomain> Hello folks, Just wondering if we could remove the Table of Contents sidebar shown thought the http://codespeak.net/lxml/ site. On my 12" laptop running Iceweasel/Firefox or Epiphany, it occupies most of the screen space and even flows into the main text area. I usually use Firebug to remove the "
and its subelements so I can read the web page. Could we make that sidemenu resizeable or just get rid of it? Thanks! -Eric From eric at ejahn.net Tue Jul 1 18:54:19 2008 From: eric at ejahn.net (Eric Jahn) Date: Tue, 01 Jul 2008 12:54:19 -0400 Subject: [lxml-dev] namespace strangeness in lxml 1.1 In-Reply-To: <4869BEB6.9000609@behnel.de> References: <1214856697.868.38.camel@localhost.localdomain> <4869BEB6.9000609@behnel.de> Message-ID: <1214931259.10505.48.camel@localhost.localdomain> On Tue, 2008-07-01 at 07:20 +0200, Stefan Behnel wrote: > ... You are setting a namespace as attribute /value/ here, not as attribute > /name/. ... Stefan, thank you very much for the response. Yes, I am intending to set the namespace in an attribute value (not attribute name) here as I am creating an XML Schema Document with lxml, not an XML Instance Document. I apologize for not clarifying that in my post. > lxml will not modify content unless you tell it to do so. If you want > it to replace the namespace by a resolved prefix, use > > type = etree.QName(NS2 + "...") No, I don't want the prefix resolved the the url, so I guess my only option is to do something like the following and just pass the type value a string with the namespace prefix explicity stated: child1 = etree.SubElement(root,NS2 + "secondelement", nsmap=NSMAP, type = "NS2:someattribute") I think the tutorial could benefit from a small section on creating schema docs as opposed to instance docs. I'd be happy to submit the start of such a new section? Would this be helpful? I know it would have saved me a little time... Thanks again for your help! -Eric From stefan_ml at behnel.de Tue Jul 1 19:13:04 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 01 Jul 2008 19:13:04 +0200 Subject: [lxml-dev] 2.1beta questions: objectify.XML, objectify.parse base_url arg, deprecate enableRecursiveStr, etree.tounicode() In-Reply-To: <20080701154552.85080@gmx.net> References: <20080701121932.85070@gmx.net> <50336.213.61.181.86.1214919384.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20080701154552.85080@gmx.net> Message-ID: <486A65A0.1080205@behnel.de> Hi Holger, jholg at gmx.de wrote: > I started having a bad conscience for never really looking at > 2.1 for quite a while. > > Works smoothly for me for all I can tell. Glad to hear that. >>> I guess the module functions XML() and parse() should also support the >>> base_url arg? > > Implemented on trunk, revision 56201. > I stole the unittests from test_etree and noticed that I also had to > special case 'base' in objectify's __setattr__ magic. Thanks! >>> Also, I suppose enableRecursiveStr() could be removed? >> I never really liked it, but why would you want to remove it? > > some other of the old CamelCase method/function names went away, so I > figured this can also go. Yep, please do so. >>> I suppose this is owed to ElementTree API compat which doesn't have >>> tounicode() - or is this a py3k issue? >> Actually, the "encoding=unicode" bit has a Py3k issue. In Py3, you have >> to say "encoding=str" instead... > > How do you specify which actual encoding, e.g 'ISO-8859-15', here? Same as before. You get a byte string when you pass an encoding name, and a unicode string (str type) when you pass str. That's also something I like about the new interface. > Yet another parameter to tostring() feels awkward, and renaming the > parameter conflicts with ElementTree compatibility. It would also require a bit more parameter checking and exception raising. The unicode option and the encoding are mutually exclusive, and unicode is not so far from an encoding that it would really merit an option on its own. Note also that you do not pass "Unicode" as a string but the unicode type, and you get a unicode object back. Stefan From stefan_ml at behnel.de Tue Jul 1 19:19:36 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 01 Jul 2008 19:19:36 +0200 Subject: [lxml-dev] namespace strangeness in lxml 1.1 In-Reply-To: <1214931259.10505.48.camel@localhost.localdomain> References: <1214856697.868.38.camel@localhost.localdomain> <4869BEB6.9000609@behnel.de> <1214931259.10505.48.camel@localhost.localdomain> Message-ID: <486A6728.9060200@behnel.de> Hi, Eric Jahn wrote: > On Tue, 2008-07-01 at 07:20 +0200, Stefan Behnel wrote: >> type = etree.QName(NS2 + "...") > > No, I don't want the prefix resolved the the url, so I guess my only > option is to do something like the following and just pass the type > value a string with the namespace prefix explicity stated: > > child1 = etree.SubElement(root,NS2 + "secondelement", nsmap=NSMAP, type > = "NS2:someattribute") Ah, now that you mention it, the above doesn't actually work. It only works for element text, not for attributes. I'll see if I can change that. > I think the tutorial could benefit from a small section on creating > schema docs as opposed to instance docs. I'd be happy to submit the > start of such a new section? Would this be helpful? I know it would > have saved me a little time... Please do, any contribution is appreciated. Stefan From stefan_ml at behnel.de Wed Jul 2 08:11:33 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 02 Jul 2008 08:11:33 +0200 Subject: [lxml-dev] lxml website In-Reply-To: <1214929144.10505.36.camel@localhost.localdomain> References: <1214929144.10505.36.camel@localhost.localdomain> Message-ID: <486B1C15.3020502@behnel.de> Hi, Eric Jahn wrote: > Just wondering if we could remove the Table of Contents sidebar shown > thought the http://codespeak.net/lxml/ site. On my 12" laptop running > Iceweasel/Firefox or Epiphany, it occupies most of the screen space and > even flows into the main text area. I actually find it really helpful and I don't think many people have the problem you are experiencing. You didn't state what screen resolution or font size you are using, but it looks nice for me even in 1024x768. Maybe you can just try a different font? The web site is part of the source download, BTW, so you can also fix up doc/html/style.css yourself. Stefan From jholg at gmx.de Wed Jul 2 09:27:16 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 02 Jul 2008 09:27:16 +0200 Subject: [lxml-dev] 2.1beta questions: objectify.XML, objectify.parse base_url arg, deprecate enableRecursiveStr, etree.tounicode() In-Reply-To: <486A65A0.1080205@behnel.de> References: <20080701121932.85070@gmx.net> <50336.213.61.181.86.1214919384.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20080701154552.85080@gmx.net> <486A65A0.1080205@behnel.de> Message-ID: <20080702072716.85050@gmx.net> > >>>Also, I suppose enableRecursiveStr() could be removed? > >> I never really liked it, but why would you want to remove it? > > > > some other of the old CamelCase method/function names went away, so I > > figured this can also go. > > Yep, please do so. > Done, revision 56229. -- Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten Browser-Versionen downloaden: http://www.gmx.net/de/go/browser -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080702/6e4f8e85/attachment.htm From jholg at gmx.de Wed Jul 2 09:30:36 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 02 Jul 2008 09:30:36 +0200 Subject: [lxml-dev] 2.1beta questions: objectify.XML, objectify.parse base_url arg, deprecate enableRecursiveStr, etree.tounicode() In-Reply-To: <486A65A0.1080205@behnel.de> References: <20080701121932.85070@gmx.net> <50336.213.61.181.86.1214919384.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <20080701154552.85080@gmx.net> <486A65A0.1080205@behnel.de> Message-ID: <20080702073426.85060@gmx.net> Hi Stefan, > It would also require a bit more parameter checking and exception > raising. The > unicode option and the encoding are mutually exclusive, and unicode is > not so > far from an encoding that it would really merit an option on its own. > > Note also that you do not pass "Unicode" as a string but the unicode > type, and > you get a unicode object back. > > > ??I took the liberty to modify the documentation a tiny little bit to reward this behaviour: ?$ svn diff -rPREV doc/parsing.txt Index: doc/parsing.txt =================================================================== --- doc/parsing.txt???? (revision 56229) +++ doc/parsing.txt???? (working copy) @@ -675,8 +675,8 @@ ?? >>> etree.tostring(root, encoding='UTF-8', xml_declaration=False) ?? b' \xef\xa3\x91 + \xef\xa3\x92 ' ? -As an extension, lxml.etree recognises the unicode type as encoding to -build a Python unicode representation of a tree: +As an extension, lxml.etree recognises the unicode type as an argument to the +encoding parameter to build a Python unicode representation of a tree: ? ?.. sourcecode:: pycon ?I think this subtly documents that unicode is not an encoding, strictly speaking. ?Holger? -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080702/8aa6515c/attachment-0001.htm From floris.bruynooghe at gmail.com Wed Jul 2 11:34:49 2008 From: floris.bruynooghe at gmail.com (Floris Bruynooghe) Date: Wed, 2 Jul 2008 10:34:49 +0100 Subject: [lxml-dev] Using the Xpath id function Message-ID: Hi I've sent this to comp.lang.python a few days ago with no repsonse, so I hope you don't mind if I try here (if there is a better place please let me know)... Basically I'm trying to use the .xpath('id("foo")') method on an lxml tree but can't get it to work. Given the following XML: And it's XMLSchema: Or in more readable, compact RelaxNG, form: element root { element child { attribute id { xsd:ID } } } Now I'm trying to parse the XML and use the .xpath() method to find the element using the id XPath function: from lxml import etree schema_root = etree.parse(file('schema.xsd')) schema = etree.XMLSchema(schema_root) parser = etree.XMLParser(schema=schema) root = etree.fromstring('', parser) root.xpath('id("foo")') --> [] I was expecting to get the element with that last statement (well, inside a list that is), but instead I just get an empty list. Is there anything obvious I'm doing wrong? As far as I can see the lxml documentation says this should work. Cheers Floris For completeness, `dpkg -l python-lxml' tells me I'm using Debian lenny's lxml 2.0.6-1 -- Debian GNU/Linux -- The Power of Freedom www.debian.org | www.gnu.org | www.kernel.org From jholg at gmx.de Wed Jul 2 14:08:35 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 02 Jul 2008 14:08:35 +0200 Subject: [lxml-dev] Using the Xpath id function In-Reply-To: References: Message-ID: <20080702133828.70720@gmx.net> Hi, > from lxml import etree > schema_root = etree.parse(file('schema.xsd')) > schema = etree.XMLSchema(schema_root) > parser = etree.XMLParser(schema=schema) > root = etree.fromstring('', parser) > root.xpath('id("foo")') --> [] > > I was expecting to get the element with that last statement > (well, inside a list that is), but instead I just get an empty list. > Is there anything obvious I'm doing wrong? As far as I can see the > lxml documentation says this should work. > You can always check for the id attribute without any schema involvement: ?>>> from lxml import etree >>> root = etree.fromstring('') >>> root.xpath('//*[@id="foo"]') [''] >>> for elt in root.xpath('//*[@id="foo"]'): print etree.tostring(elt) ... >>> ??Regarding the xpath id() function I think you'd need a DTD: ?See http://www.w3.org/TR/xpath: """ Function: node-set id(object) The id function selects elements by theirunique ID (see [5.2.1 Unique IDs]). """? ?and: ?""" 5.2.1 Unique IDs An element node may have a unique identifier (ID). This is the value of the attribute that is declared in the DTD as type ID. No two elements in a document may have the sameunique ID. If an XML processor reports two elements in a document ashaving the same unique ID (which is possible only if the document isinvalid) then the second element in document order must be treated asnot having a unique ID. > NOTE: If a document does not have a DTD, then no element in thedocument > will have a unique ID. > ? """? ?Holger? > > -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080702/f0341afb/attachment.htm From jwashin at vt.edu Wed Jul 2 16:17:12 2008 From: jwashin at vt.edu (Jim Washington) Date: Wed, 02 Jul 2008 10:17:12 -0400 Subject: [lxml-dev] Using the Xpath id function In-Reply-To: <20080702133828.70720@gmx.net> References: <20080702133828.70720@gmx.net> Message-ID: <486B8DE8.9010003@vt.edu> jholg at gmx.de wrote: > > > Hi, > >> from lxml import etree >> schema_root = etree.parse(file('schema.xsd')) >> schema = etree.XMLSchema(schema_root) >> parser = etree.XMLParser(schema=schema) >> root = etree.fromstring('', parser) >> root.xpath('id("foo")') --> [] >> >> I was expecting to get the element with that last statement >> (well, inside a list that is), but instead I just get an empty list. >> Is there anything obvious I'm doing wrong? As far as I can see the >> lxml documentation says this should work. > [...] > You can always check for the id attribute without any schema involvement: [...] > Regarding the xpath id() function I think you'd need a DTD: [...] xml:id, http://www.w3.org/TR/xml-id/ works OK, if you do not want to use a schema. At least, it works with lxml 2.1.beta3. from lxml import etree XML_NAMESPACE='http://www.w3.org/XML/1998/namespace' XML_PREFIX= '{%s}' % XML_NAMESPACE f = etree.Element('test') f.set(XML_PREFIX+'id','23455') etree.tostring(f) '' g = etree.SubElement(f,'test1') g.set(XML_PREFIX+'id','23456') f.xpath('id("23456")') [] f.xpath('id("23455")') [] >>> - Jim Washington From floris.bruynooghe at gmail.com Wed Jul 2 17:31:45 2008 From: floris.bruynooghe at gmail.com (Floris Bruynooghe) Date: Wed, 2 Jul 2008 16:31:45 +0100 Subject: [lxml-dev] Using the Xpath id function In-Reply-To: <486B8DE8.9010003@vt.edu> References: <20080702133828.70720@gmx.net> <486B8DE8.9010003@vt.edu> Message-ID: Hello 2008/7/2 Jim Washington : > jholg at gmx.de wrote: >>> from lxml import etree >>> schema_root = etree.parse(file('schema.xsd')) >>> schema = etree.XMLSchema(schema_root) >>> parser = etree.XMLParser(schema=schema) >>> root = etree.fromstring('', parser) >>> root.xpath('id("foo")') --> [] >>> >>> I was expecting to get the element with that last statement >>> (well, inside a list that is), but instead I just get an empty list. >>> Is there anything obvious I'm doing wrong? As far as I can see the >>> lxml documentation says this should work. >> > [...] > > >> You can always check for the id attribute without any schema involvement: > > [...] > >> Regarding the xpath id() function I think you'd need a DTD: Indeed, converting the schema to a DTD solved my problem! Too simple to think of myself of course. Somehow the documentation must have confused me into believing validation with XMLSchema or DTD would work (I did understand RelaxNG wouldn't work). Converting the schema is simple using trang anyway (which I need to do in any case as I'm writing it in compact RelaxNG). > xml:id, http://www.w3.org/TR/xml-id/ works OK, if you do not want to > use a schema. At least, it works with lxml 2.1.beta3. > > from lxml import etree > XML_NAMESPACE='http://www.w3.org/XML/1998/namespace' > XML_PREFIX= '{%s}' % XML_NAMESPACE > f = etree.Element('test') > f.set(XML_PREFIX+'id','23455') > etree.tostring(f) > '' > g = etree.SubElement(f,'test1') > g.set(XML_PREFIX+'id','23456') > f.xpath('id("23456")') > [] > f.xpath('id("23455")') > [] >>>> That's quite nifty too! Don't think I'll use it this time but very good to know. Thanks for your tips Floris -- Debian GNU/Linux -- The Power of Freedom www.debian.org | www.gnu.org | www.kernel.org From jwashin at vt.edu Wed Jul 2 19:24:07 2008 From: jwashin at vt.edu (Jim Washington) Date: Wed, 02 Jul 2008 13:24:07 -0400 Subject: [lxml-dev] Using the Xpath id function In-Reply-To: <486B8DE8.9010003@vt.edu> References: <20080702133828.70720@gmx.net> <486B8DE8.9010003@vt.edu> Message-ID: <486BB9B7.9090605@vt.edu> Jim Washington wrote: > xml:id, http://www.w3.org/TR/xml-id/ works OK, if you do not want to > use a schema. At least, it works with lxml 2.1.beta3. > > from lxml import etree > XML_NAMESPACE='http://www.w3.org/XML/1998/namespace' > XML_PREFIX= '{%s}' % XML_NAMESPACE > f = etree.Element('test') > f.set(XML_PREFIX+'id','23455') > etree.tostring(f) > '' > g = etree.SubElement(f,'test1') > g.set(XML_PREFIX+'id','23456') > f.xpath('id("23456")') > [] > f.xpath('id("23455")') > [] > A quick blunder fix: The above works, but is not quite in accordance with the xml:id specification. Other XML processors may balk at the above. The error in this case is that the value of xml:id attributes cannot start with digits. The concept works, but it is a bad example. - Jim Washington From jholg at gmx.de Thu Jul 3 07:56:34 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 03 Jul 2008 07:56:34 +0200 Subject: [lxml-dev] Using the Xpath id function In-Reply-To: References: <20080702133828.70720@gmx.net> <486B8DE8.9010003@vt.edu> Message-ID: <20080703060256.70710@gmx.net> Hi, > Indeed, converting the schema to a DTD solved my problem! Too simple > to think of myself of course. Somehow the documentation must have > confused me into believing validation with XMLSchema or DTD would work > (I did understand RelaxNG wouldn't work). Converting the schema is > simple using trang anyway (which I need to do in any case as I'm > writing it in compact RelaxNG). > Just to clarify: lxml *does* validate (if told so), both against a W3C XMLSchema and a RelaxNG schema. So from my point of view, using a schema rather than a DTD nowadays is usually the right choice, if only because the schema is itself well-formed XML - if possible. ?How this relates to the XPath id() function and the concept of the ID attribute in general I don't really know, from what I've looked up in the spec this concept seems somehow tied to DTDs (?) ?Holger? -- Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten Browser-Versionen downloaden: http://www.gmx.net/de/go/browser -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080703/664f31a5/attachment.htm From stromnov at gmail.com Thu Jul 3 12:45:25 2008 From: stromnov at gmail.com (Andrew Stromnov) Date: Thu, 3 Jul 2008 14:45:25 +0400 Subject: [lxml-dev] Parsing XML with undefined namespace Message-ID: <9bfc77170807030345p7748355ei46d6b416deab78fe@mail.gmail.com> Hi, Is it possible to parse slightly broken XML like this? etree.parse("""""") >>> lxml.etree.XMLSyntaxError: Namespace prefix sanitizer for value on xml is not defined, line 1, column 31 -- Andrew Stromnov From jholg at gmx.de Thu Jul 3 12:59:33 2008 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 03 Jul 2008 12:59:33 +0200 Subject: [lxml-dev] Parsing XML with undefined namespace In-Reply-To: <9bfc77170807030345p7748355ei46d6b416deab78fe@mail.gmail.com> References: <9bfc77170807030345p7748355ei46d6b416deab78fe@mail.gmail.com> Message-ID: <20080703110050.70700@gmx.net> Hi, > Is it possible to parse slightly broken XML like this? > > etree.parse("""""") > > >>> lxml.etree.XMLSyntaxError: Namespace prefix sanitizer for value > on xml is not defined, line 1, column 31 > ?You can use a parser that is up to the task: ?>>> parser = etree.XMLParser(recover=True) >>> root = etree.fromstring("""""", parser=parser) >>> print root >>> print etree.tostring(root) >>> ?Please take a look at help(etree.XMLParser): ?...? ? - recover??????????? - try hard to parse through broken XML? ?Cheers, Holger? -- Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten Browser-Versionen downloaden: http://www.gmx.net/de/go/browser -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080703/8b4ec1be/attachment-0001.htm From jman at zultron.com Thu Jul 3 21:07:41 2008 From: jman at zultron.com (John Morris) Date: Fri, 04 Jul 2008 03:07:41 +0800 Subject: [lxml-dev] lxml segfaults Message-ID: <486D237D.1040507@zultron.com> I'm trying to subclass lxml._Element. I'm completely new to python, so I'm definitely doing something wrong. The code below segfaults. Tried this with both lxml 1.3.6 and 2.0.7 (both from EPEL 5) along with libxslt 1.1.17 (CentOS 5 updates). Someone beat me over the head with a clue? Thanks! John Sysconf.py: -------------------- from lxml import etree class Sysconf(etree._Element): def __init__(self,datastore): etree._Element.__init__(self) test.py: -------------------- #!/usr/bin/python from lxml import etree import Sysconf sysconf = Sysconf.Sysconf("sysconf.xml") print etree.tostring(sysconf) From stefan_ml at behnel.de Thu Jul 3 21:17:05 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 03 Jul 2008 21:17:05 +0200 Subject: [lxml-dev] lxml segfaults In-Reply-To: <486D237D.1040507@zultron.com> References: <486D237D.1040507@zultron.com> Message-ID: <486D25B1.10306@behnel.de> Hi, John Morris wrote: > I'm trying to subclass lxml._Element. [...] > Someone beat me over the head with a clue? http://codespeak.net/lxml/dev/element_classes.html Hope it doesn't hurt. Stefan From jman at zultron.com Fri Jul 4 07:26:24 2008 From: jman at zultron.com (John Morris) Date: Fri, 04 Jul 2008 13:26:24 +0800 Subject: [lxml-dev] lxml segfaults In-Reply-To: <486D237D.1040507@zultron.com> References: <486D237D.1040507@zultron.com> Message-ID: <486DB480.8020208@zultron.com> Update: recompiled python-lxml-2.0.7-1.el5.x86_64.rpm with --without-threading, same problem. Thanks! John John Morris wrote: > I'm trying to subclass lxml._Element. I'm completely new to python, so > I'm definitely doing something wrong. The code below segfaults. Tried > this with both lxml 1.3.6 and 2.0.7 (both from EPEL 5) along with > libxslt 1.1.17 (CentOS 5 updates). Someone beat me over the head with a > clue? > > Thanks! > > John > > > Sysconf.py: > -------------------- > from lxml import etree > > class Sysconf(etree._Element): > def __init__(self,datastore): > etree._Element.__init__(self) > > > test.py: > -------------------- > #!/usr/bin/python > > from lxml import etree > import Sysconf > > sysconf = Sysconf.Sysconf("sysconf.xml") > > print etree.tostring(sysconf) > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev From stefan_ml at behnel.de Fri Jul 4 22:39:01 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 04 Jul 2008 22:39:01 +0200 Subject: [lxml-dev] [Bug 245541] [NEW] Segfault caused by double free after GC in lxml 1.1.2 In-Reply-To: <20080704143729.6279.42061.malonedeb@potassium.ubuntu.com> References: <20080704143729.6279.42061.malonedeb@potassium.ubuntu.com> <20080704143729.6279.42061.malonedeb@potassium.ubuntu.com> Message-ID: <486E8A65.5090600@behnel.de> Hi, Mark Seaborn wrote: > I know lxml 1.1.2 is quite an old version but I wanted to > debug this segfault to find the root cause in case the root cause was > not fixed in later versions. It definitely was, likely over more than one release. > Also, the newer releases do not build on > Ubuntu edgy without upgrading lots of other components. Apparently, edgy has libxml2 2.6.26, so it should build and work. What is the problem you are experiencing? > I believe this bug is fixed in commit 44623 in version 1.3.1 (where the > description in https://codespeak.net/lxml/changes-2.0.6.html is "Better way > to prevent crashes in Element proxy cleanup code"). The respective code has been fixed and rewritten multiple times since 1.1 and IIRC the last corner cases were fixed as recent as somewhere in 2.0.x. In any case, don't use 1.3.1, as it's completely broken. I don't quite remember, that might be related to the fix or not. Stefan From foolistbar at googlemail.com Sat Jul 5 10:48:50 2008 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sat, 5 Jul 2008 09:48:50 +0100 Subject: [lxml-dev] Removing an attribute Message-ID: Hi, I can't seem to find any way to totally remove an attribute, short of using Element.clear() (which has obvious issues when you just want to remove a single attribute). I need to remove at least @id: conformance requirements say this must be at least one character, thus just setting it to an empty string does not suffice. -- Geoffrey Sneddon From john at nmt.edu Sat Jul 5 16:38:25 2008 From: john at nmt.edu (John W. Shipman) Date: Sat, 5 Jul 2008 08:38:25 -0600 (MDT) Subject: [lxml-dev] Removing an attribute In-Reply-To: References: Message-ID: On Sat, 5 Jul 2008, Geoffrey Sneddon wrote: > Hi, > > I can't seem to find any way to totally remove an attribute, short of > using Element.clear() (which has obvious issues when you just want to > remove a single attribute). I need to remove at least @id: conformance > requirements say this must be at least one character, thus just > setting it to an empty string does not suffice. >>> from lxml import etree >>> elt=etree.Element('gi', a1='one', a2='two') >>> etree.tostring(elt) '' >>> del elt.attrib['a2'] >>> etree.tostring(elt) '' >>> Best regards, John Shipman (john at nmt.edu), Applications Specialist, NM Tech Computer Center, Speare 119, Socorro, NM 87801, (505) 835-5950, http://www.nmt.edu/~john ``Let's go outside and commiserate with nature.'' --Dave Farber From foolistbar at googlemail.com Sat Jul 5 16:51:19 2008 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sat, 5 Jul 2008 15:51:19 +0100 Subject: [lxml-dev] Removing an attribute In-Reply-To: References: Message-ID: <7178F7F6-A126-4413-ACA6-6B806D8D8573@googlemail.com> On 5 Jul 2008, at 15:38, John W. Shipman wrote: > On Sat, 5 Jul 2008, Geoffrey Sneddon wrote: > >> Hi, >> >> I can't seem to find any way to totally remove an attribute, short of >> using Element.clear() (which has obvious issues when you just want to >> remove a single attribute). I need to remove at least @id: >> conformance >> requirements say this must be at least one character, thus just >> setting it to an empty string does not suffice. > > >>>> from lxml import etree >>>> elt=etree.Element('gi', a1='one', a2='two') >>>> etree.tostring(elt) > '' >>>> del elt.attrib['a2'] >>>> etree.tostring(elt) > '' >>>> > Ah, simply that. Much thanks. -- Geoffrey Sneddon From foolistbar at googlemail.com Sat Jul 5 20:13:43 2008 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sat, 5 Jul 2008 19:13:43 +0100 Subject: [lxml-dev] Behaviour of Element.iter() changes when removing a node depending on whether the node has children or not Message-ID: Hi (and sorry for the stupidly long subject, but otherwise it doesn't actually cover the subject), I'm not sure whether this is a bug or not (if not, there probably ought to be some note in the docs), but, for example: >>> from lxml import etree >>> foo = etree.fromstring("") >>> for element in foo.iter(etree.Element): ... print element.tag ... if element.tag == "a": ... element.getparent().remove(element) ... root a b >>> foo = etree.fromstring("") >>> for element in foo.iter(etree.Element): ... print element.tag ... if element.tag == "a": ... element.getparent().remove(element) ... root a a Only in the latter case does the final a element actually appear in the iteration, whereas in the former case it just vanishes. What changes the behaviour is the fact that in former case the first a element has a child: an empty b element. -- Geoffrey Sneddon From stefan_ml at behnel.de Sun Jul 6 21:13:01 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 06 Jul 2008 21:13:01 +0200 Subject: [lxml-dev] Behaviour of Element.iter() changes when removing a node depending on whether the node has children or not In-Reply-To: References: Message-ID: <4871193D.7010102@behnel.de> Hi, Geoffrey Sneddon wrote: > Behaviour of Element.iter() changes when removing a node > depending on whether the node has children or not Admittedly, the docs are not very explicit here, but it's common that container modification during iteration results in undefined behaviour. This also applies to lxml's tree iterator. Stefan From azuriel at gmail.com Mon Jul 7 03:07:12 2008 From: azuriel at gmail.com (Andrew Wang) Date: Sun, 6 Jul 2008 21:07:12 -0400 Subject: [lxml-dev] Does etree.iter() handle the refcount properly? Message-ID: <1c7233eb0807061807l6d4e5201k813153b7fac1cb8c@mail.gmail.com> Hi, I just started using lxml and I ran into an odd problem. I'm using version 2.0.4-14 from the OpenSuSE repositories. import lxml.etree as etree # test.py xml = '''\ ''' for i in range(10000): et = etree.fromstring(xml) for el in et.iter(): pass $ python test.py Fatal Python error: deallocating None Aborted Is this a bug, or am I using etree.iter() incorrectly? Thanks, Andrew From stefan_ml at behnel.de Mon Jul 7 10:05:11 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 07 Jul 2008 10:05:11 +0200 Subject: [lxml-dev] Does etree.iter() handle the refcount properly? In-Reply-To: <1c7233eb0807061807l6d4e5201k813153b7fac1cb8c@mail.gmail.com> References: <1c7233eb0807061807l6d4e5201k813153b7fac1cb8c@mail.gmail.com> Message-ID: <4871CE37.80705@behnel.de> Hi, Andrew Wang wrote: > I just started using lxml and I ran into an odd problem. I'm using > version 2.0.4-14 from the OpenSuSE repositories. > > import lxml.etree as etree > > # test.py > xml = '''\ > > ''' > > for i in range(10000): > et = etree.fromstring(xml) > for el in et.iter(): > pass > > $ python test.py > Fatal Python error: deallocating None > Aborted > > Is this a bug, or am I using etree.iter() incorrectly? Works for me in the latest version. Stefan From aryeh at bigfoot.com Mon Jul 7 19:56:57 2008 From: aryeh at bigfoot.com (Arye) Date: Mon, 7 Jul 2008 19:56:57 +0200 Subject: [lxml-dev] validation with multiple XSD files In-Reply-To: <4861228D.8040904@behnel.de> References: <4861228D.8040904@behnel.de> Message-ID: Hello. Thanks for your attention. What I was trying to do is load MANY XSD files with lxml. I understand now that the proper way to do this is load just ONE file that includes the other and let lxml go and load the required includes. For this, this document had the piece of information that I was missing i.e. the different options to manage multiple schema files: http://www.xfront.com/ZeroOneOrManyNamespaces.html Sorry about the confusion and thanks again for your help. Sincerely, Arye. On Tue, Jun 24, 2008 at 6:36 PM, Stefan Behnel wrote: > Hi, > > Arye wrote: > > Now I would like to extend this to a XSD file that > > includes many other files. In other words I have a directory of XSD files > > that I would like to use. The include statement look like this (the > included > > file is referenced by its name): > > > > > > > elementFormDefault="qualified"> > > > > > > ... > > ... some types defined in "base.xsd" are used here > > I'm not sure what you are trying to do here. Including or importing XSD > files > should not be a problem at all, so maybe you could elaborate on the actual > problem you are facing? Maybe with some example code that shows what you > are > doing? > > Stefan > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080707/fff9967a/attachment-0001.htm From azuriel at gmail.com Wed Jul 9 02:28:52 2008 From: azuriel at gmail.com (Andrew Wang) Date: Tue, 8 Jul 2008 20:28:52 -0400 Subject: [lxml-dev] Does etree.iter() handle the refcount properly? In-Reply-To: <4871CE37.80705@behnel.de> References: <1c7233eb0807061807l6d4e5201k813153b7fac1cb8c@mail.gmail.com> <4871CE37.80705@behnel.de> Message-ID: <1c7233eb0807081728g6e365ccbke32d439ea2eeb6e@mail.gmail.com> On Mon, Jul 7, 2008 at 4:05 AM, Stefan Behnel wrote: > Hi, > > Andrew Wang wrote: >> I just started using lxml and I ran into an odd problem. I'm using >> version 2.0.4-14 from the OpenSuSE repositories. >> >> import lxml.etree as etree >> >> # test.py >> xml = '''\ >> >> ''' >> >> for i in range(10000): >> et = etree.fromstring(xml) >> for el in et.iter(): >> pass >> >> $ python test.py >> Fatal Python error: deallocating None >> Aborted >> >> Is this a bug, or am I using etree.iter() incorrectly? > > Works for me in the latest version. > > Stefan > Thanks for the feedback. I've tried version 2.0.7 now and I get the same error. I'm starting to suspect it's something wrong with my OpenSUSE 11.0 install, because this same code worked on lxml version 2.0.5 on a Gentoo box...I'll take my problem to the OpenSUSE bugzilla unless someone else can reproduce it. Andrew From mmaccana at au1.ibm.com Wed Jul 9 08:43:41 2008 From: mmaccana at au1.ibm.com (Mike MacCana) Date: Wed, 09 Jul 2008 16:43:41 +1000 Subject: [lxml-dev] Proper syntax to insert elements with namespaces? Message-ID: <1215585821.7447.138.camel@mmaccana-laptop> Hello all, I'm reading some OpenDocument files with lxml, and inserting various elements. A snippet, which works fine, is below. Yay LXML! ## Parse our content as 'document' document = etree.parse(contentfile) print etree.tostring(document.getroot(), pretty_print=True), ## The body of the document, where we'll add our text elements body = document.xpath('/office:document-content/office:body/office:text', namespaces={'office':'urn:oasis:names:tc:opendocument:xmlns:office:1.0','text':'urn:oasis:names:tc:opendocument:xmlns:text:1.0'}) ## Insert some text as element one print "adding element:" body[0].insert(1, etree.Element("text",name="Standard")) ## Identify the element by it's xpath and change its contents. newelement = document.xpath('/office:document-content/office:body/office:text/text', namespaces={'office':'urn:oasis:names:tc:opendocument:xmlns:office:1.0','text':'urn:oasis:names:tc:opendocument:xmlns:text:1.0'}) newelement[0].text="new element text" 1. Is there a better way to refer to, and set the text of, the newly created element? 2. How do I insert an element with a namespace? The code above creates: new element text But I'd prefer: old element Cheers, Mike ________________________________________________ Mike MacCana Technical Specialist Australia Linux and Virtualisation Services IBM Global Services Level 14, 60 City Rd Southgate Vic 3000 Phone: +61-3-8656-2138 Fax: +61-3-?8656-2423 Email: mmaccana at au1.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080709/9b1c6a46/attachment.htm From stefan_ml at behnel.de Wed Jul 9 09:17:49 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 09 Jul 2008 09:17:49 +0200 Subject: [lxml-dev] Proper syntax to insert elements with namespaces? In-Reply-To: <1215585821.7447.138.camel@mmaccana-laptop> References: <1215585821.7447.138.camel@mmaccana-laptop> Message-ID: <4874661D.20205@behnel.de> Hi, Mike MacCana wrote: > I'm reading some OpenDocument files with lxml, and inserting various > elements. A snippet, which works fine, is below. Yay LXML! :) > ## Parse our content as 'document' > document = etree.parse(contentfile) > print etree.tostring(document.getroot(), pretty_print=True), > > ## The body of the document, where we'll add our text elements > body = document.xpath('/office:document-content/office:body/office:text', namespaces={'office':'urn:oasis:names:tc:opendocument:xmlns:office:1.0','text':'urn:oasis:names:tc:opendocument:xmlns:text:1.0'}) > > ## Insert some text as element one > print "adding element:" > body[0].insert(1, etree.Element("text",name="Standard")) > > ## Identify the element by it's xpath and change its contents. > newelement = document.xpath('/office:document-content/office:body/office:text/text', namespaces={'office':'urn:oasis:names:tc:opendocument:xmlns:office:1.0','text':'urn:oasis:names:tc:opendocument:xmlns:text:1.0'}) > newelement[0].text="new element text" > > > 1. Is there a better way to refer to, and set the text of, the newly > created element? You mean like text_element = etree.Element("text",name="Standard") text_element.text = "new text" body[0].insert(1, text_element) although I'd prefer doing text_element = body.makeelement("text",name="Standard") ... because it's faster and a bit more memory friendly (no need to create a new document for the element). > 2. How do I insert an element with a namespace? The code above creates: > new element text > > But I'd prefer: > old element http://codespeak.net/lxml/tutorial.html#namespaces Stefan From stefan_ml at behnel.de Wed Jul 9 15:27:21 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 09 Jul 2008 15:27:21 +0200 Subject: [lxml-dev] lxml 2.1 released Message-ID: <4874BCB9.9020506@behnel.de> Hi, lxml 2.1 finally made it to PyPI! This is a major new release that follows the 2.0 series with a couple of cleanups and tons of new features. The complete changelog follows below. This is also the first version that officially supports Python 3, as released in 3.0beta1. Have fun, Stefan 2.1 (2008-07-09) ================ Features added -------------- * Smart strings can be switched off in XPath (``smart_string`` keyword option). * ``lxml.html.rewrite_links()`` strips links to work around documents with whitespace in URL attributes. * Pickling ``ElementTree`` objects in lxml.objectify. * Major overhaul of ``tools/xpathgrep.py`` script. * Pickling ``ElementTree`` objects in lxml.objectify. * Support for parsing from file-like objects that return unicode strings. * New function ``etree.cleanup_namespaces(el)`` that removes unused namespace declarations from a (sub)tree (experimental). * XSLT results support the buffer protocol in Python 3. * Polymorphic functions in ``lxml.html`` that accept either a tree or a parsable string will return either a UTF-8 encoded byte string, a unicode string or a tree, based on the type of the input. Previously, the result was always a byte string or a tree. * Support for Python 2.6 and 3.0 beta. * File name handling now uses a heuristic to convert between byte strings (usually filenames) and unicode strings (usually URLs). * Parsing from a plain file object frees the GIL under Python 2.x. * Running ``iterparse()`` on a plain file (or filename) frees the GIL on reading under Python 2.x. * Conversion functions ``html_to_xhtml()`` and ``xhtml_to_html()`` in lxml.html (experimental). * Most features in lxml.html work for XHTML namespaced tag names (experimental). * All parse functions in lxml.html take a ``parser`` keyword argument. * lxml.html has a new parser class ``XHTMLParser`` and a module attribute ``xhtml_parser`` that provide XML parsers that are pre-configured for the lxml.html package. * Error logging in Schematron (requires libxml2 2.6.32 or later). * Parser option ``strip_cdata`` for normalising or keeping CDATA sections. Defaults to ``True`` as before, thus replacing CDATA sections by their text content. * ``CDATA()`` factory to wrap string content as CDATA section. * New event types 'comment' and 'pi' in ``iterparse()``. * ``XSLTAccessControl`` instances have a property ``options`` that returns a dict of access configuration options. * Constant instances ``DENY_ALL`` and ``DENY_WRITE`` on ``XSLTAccessControl`` class. * Extension elements for XSLT (experimental!) * ``Element.base`` property returns the xml:base or HTML base URL of an Element. * ``docinfo.URL`` property is writable. Bugs fixed ---------- * Custom resolvers were not used for XMLSchema includes/imports and XInclude processing. * CSS selector parser dropped remaining expression after a function with parameters. * Descending dot-separated classes in CSS selectors were not resolved correctly. * ``ElementTree.parse()`` didn't handle target parser result. * Potential threading problem in XInclude. * Crash in Element class lookup classes when the __init__() method of the super class is not called from Python subclasses. * A number of problems related to unicode/byte string conversion of filenames and error messages were fixed. * Building on MacOS-X now passes the "flat_namespace" option to the C compiler, which reportedly prevents build quirks and crashes on this platform. * Windows build was broken. * Rare crash when serialising to a file object with certain encodings. * Incorrect evaluation of ``el.find("tag[child]")``. * Moving a subtree from a document created in one thread into a document of another thread could crash when the rest of the source document is deleted while the subtree is still in use. * Passing an nsmap when creating an Element will no longer strip redundantly defined namespace URIs. This prevented the definition of more than one prefix for a namespace on the same Element. * Resolving to a filename in custom resolvers didn't work. * lxml did not honour libxslt's second error state "STOPPED", which let some XSLT errors pass silently. * Memory leak in Schematron with libxml2 >= 2.6.31. * lxml.etree accepted non well-formed namespace prefix names. * Hanging thread in conjunction with GTK threading. * Crash bug in iterparse when moving elements into other documents. * HTML elements' ``.cssselect()`` method was broken. * ``ElementTree.find*()`` didn't accept QName objects. * Default encoding for plain text serialisation was different from that of XML serialisation (UTF-8 instead of ASCII). Other changes ------------- * ``objectify.enableRecursiveStr()`` was removed, use ``objectify.enable_recursive_str()`` instead * Speed-up when running XSLTs on documents from other threads * Non-ASCII characters in attribute values are no longer escaped on serialisation. * Passing non-ASCII byte strings or invalid unicode strings as .tag, namespaces, etc. will result in a ValueError instead of an AssertionError (just like the tag well-formedness check). * Up to several times faster attribute access (i.e. tree traversal) in lxml.objectify. * lxml should now build without problems on MacOS-X. * If the default namespace is redundantly defined with a prefix on the same Element, the prefix will now be preferred for subelements and attributes. This allows users to work around a problem in libxml2 where attributes from the default namespace could serialise without a prefix even when they appear on an Element with a different namespace (i.e. they would end up in the wrong namespace). * Major cleanup in internal ``moveNodeToDocument()`` function, which takes care of namespace cleanup when moving elements between different namespace contexts. * New Elements created through the ``makeelement()`` method of an HTML parser or through lxml.html now end up in a new HTML document (doctype HTML 4.01 Transitional) instead of a generic XML document. This mostly impacts the serialisation and the availability of a DTD context. * Minor API speed-ups. * The benchmark suite now uses tail text in the trees, which makes the absolute numbers incomparable to previous results. * Generating the HTML documentation now requires Pygments_, which is used to enable syntax highlighting for the doctest examples. .. _Pygments: http://pygments.org/ Most long-time deprecated functions and methods were removed: - ``etree.clearErrorLog()``, use ``etree.clear_error_log()`` - ``etree.useGlobalPythonLog()``, use ``etree.use_global_python_log()`` - ``etree.ElementClassLookup.setFallback()``, use ``etree.ElementClassLookup.set_fallback()`` - ``etree.getDefaultParser()``, use ``etree.get_default_parser()`` - ``etree.setDefaultParser()``, use ``etree.set_default_parser()`` - ``etree.setElementClassLookup()``, use ``etree.set_element_class_lookup()`` Note that ``parser.setElementClassLookup()`` has not been removed yet, although ``parser.set_element_class_lookup()`` should be used instead. - ``xpath_evaluator.registerNamespace()``, use ``xpath_evaluator.register_namespace()`` - ``xpath_evaluator.registerNamespaces()``, use ``xpath_evaluator.register_namespaces()`` - ``objectify.setPytypeAttributeTag``, use ``objectify.set_pytype_attribute_tag`` - ``objectify.setDefaultParser()``, use ``objectify.set_default_parser()`` From Dominique.Holzwarth at ch.delarue.com Wed Jul 9 15:37:20 2008 From: Dominique.Holzwarth at ch.delarue.com (Dominique.Holzwarth at ch.delarue.com) Date: Wed, 9 Jul 2008 14:37:20 +0100 Subject: [lxml-dev] questions about xslt with xml, where to get help? Message-ID: <5213E58D85BC414998FA553C701E386C0EF2A5FDD6@SGBD012511.dlrmail.ad.delarue.com> Hi everyone I'm not completely sure whether this mailing list only intended to further develop lxml or as well for _users_ of lxml... but I haven't found any other mailing list / forum :-( I'm having problems with getting xsl transformation run. When i do: xslt_doc = etree.parse(stylesheet) transform = etree.XSLT(xslt_doc) result = transform(DOMdocument, **{statustext='some text', mode='navigation'}) the result will just be empty even tho in the documentation it is stated the paramters should be passed as keyword parameters... greetings Dominique Dominique Holzwarth | De La Rue International Limited Software Engineer dominique.holzwarth at ch.delarue.com | tel: +41 (0) 31 997 56 13 | fax: +41 (0) 997 56 80 Berne Branch, Morgenstrasse 131, 3018 Berne, Switzerland | www.delarue.com ***************************************************************************** This e-mail and any files attached are strictly confidential, may be legally privileged and are intended solely for the addressee. If you are not the intended recipient please notify the sender immediately by return email and then delete the e-mail and any attachments immediately. The views and or opinions expressed in this e-mail are not necessarily the views of De La Rue plc or any of its subsidiaries and the De La Rue Group of companies, their directors, officers and employees make no representation about and accept no liability for its accuracy or completeness. You should ensure that you have adequate virus protection as the De La Rue Group of companies do not accept liability for any viruses. De La Rue plc Registered No.3834125, De La Rue Holdings plc Registered No 58025 and De La Rue International Limited Registered No 720284 are all registered in England with their registered office at: De La Rue House, Jays Close, Viables, Hampshire RG22 4BS ***************************************************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080709/7d6b1f3a/attachment-0001.htm From stefan_ml at behnel.de Wed Jul 9 16:19:27 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 09 Jul 2008 16:19:27 +0200 Subject: [lxml-dev] questions about xslt with xml, where to get help? In-Reply-To: <5213E58D85BC414998FA553C701E386C0EF2A5FDD6@SGBD012511.dlrmail.ad.delarue.com> References: <5213E58D85BC414998FA553C701E386C0EF2A5FDD6@SGBD012511.dlrmail.ad.delarue.com> Message-ID: <4874C8EF.4010703@behnel.de> Hi, Dominique.Holzwarth at ch.delarue.com wrote: > I'm not completely sure whether this mailing list only intended to further > develop lxml or as well for _users_ of lxml... both. > I'm having problems with getting xsl transformation run. When i do: > > xslt_doc = etree.parse(stylesheet) > > transform = etree.XSLT(xslt_doc) > > result = transform(DOMdocument, **{statustext='some text', mode='navigation'}) > > the result will just be empty this is the right way of doing it. I would expect a problem with your stylesheet. Stefan From marius at pov.lt Wed Jul 9 20:56:05 2008 From: marius at pov.lt (Marius Gedminas) Date: Wed, 9 Jul 2008 21:56:05 +0300 Subject: [lxml-dev] questions about xslt with xml, where to get help? In-Reply-To: <5213E58D85BC414998FA553C701E386C0EF2A5FDD6@SGBD012511.dlrmail.ad.delarue.com> References: <5213E58D85BC414998FA553C701E386C0EF2A5FDD6@SGBD012511.dlrmail.ad.delarue.com> Message-ID: <20080709185605.GA26399@fridge.pov.lt> On Wed, Jul 09, 2008 at 02:37:20PM +0100, Dominique.Holzwarth at ch.delarue.com wrote: > result = transform(DOMdocument, **{statustext='some text', mode='navigation'}) FWIW this code is equivalent to the simpler result = transform(DOMdocument, statustext='some text', mode='navigation') Marius Gedminas -- "I may not understand what I'm installing, but that's not my job. I just need to click Next, Next, Finish here so I can walk to the next system and repeat the process" -- Anonymous NT Admin -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080709/d7b8325a/attachment.pgp From limi at plone.org Thu Jul 10 05:59:38 2008 From: limi at plone.org (Alexander Limi) Date: Wed, 09 Jul 2008 20:59:38 -0700 Subject: [lxml-dev] lxml 2.1 released References: <4874BCB9.9020506@behnel.de> Message-ID: The download link on the web site reports a 404: http://codespeak.net/lxml/lxml-2.1.tgz This is also present in the Deliverance buildout, which is how I got the error. Going to the web site didn't help. :) ? Alexander Limi On Wed, 09 Jul 2008 06:27:21 -0700, Stefan Behnel wrote: > Hi, > > lxml 2.1 finally made it to PyPI! > > This is a major new release that follows the 2.0 series with a couple of > cleanups and tons of new features. The complete changelog follows below. > > This is also the first version that officially supports Python 3, as > released > in 3.0beta1. > > Have fun, > Stefan > > > 2.1 (2008-07-09) > ================ > > Features added > -------------- > > * Smart strings can be switched off in XPath (``smart_string`` keyword > option). > > * ``lxml.html.rewrite_links()`` strips links to work around documents > with whitespace in URL attributes. > > * Pickling ``ElementTree`` objects in lxml.objectify. > > * Major overhaul of ``tools/xpathgrep.py`` script. > > * Pickling ``ElementTree`` objects in lxml.objectify. > > * Support for parsing from file-like objects that return unicode > strings. > > * New function ``etree.cleanup_namespaces(el)`` that removes unused > namespace declarations from a (sub)tree (experimental). > > * XSLT results support the buffer protocol in Python 3. > > * Polymorphic functions in ``lxml.html`` that accept either a tree or > a parsable string will return either a UTF-8 encoded byte string, a > unicode string or a tree, based on the type of the input. > Previously, the result was always a byte string or a tree. > > * Support for Python 2.6 and 3.0 beta. > > * File name handling now uses a heuristic to convert between byte > strings (usually filenames) and unicode strings (usually URLs). > > * Parsing from a plain file object frees the GIL under Python 2.x. > > * Running ``iterparse()`` on a plain file (or filename) frees the GIL > on reading under Python 2.x. > > * Conversion functions ``html_to_xhtml()`` and ``xhtml_to_html()`` in > lxml.html (experimental). > > * Most features in lxml.html work for XHTML namespaced tag names > (experimental). > > * All parse functions in lxml.html take a ``parser`` keyword argument. > > * lxml.html has a new parser class ``XHTMLParser`` and a module > attribute ``xhtml_parser`` that provide XML parsers that are > pre-configured for the lxml.html package. > > * Error logging in Schematron (requires libxml2 2.6.32 or later). > > * Parser option ``strip_cdata`` for normalising or keeping CDATA > sections. Defaults to ``True`` as before, thus replacing CDATA > sections by their text content. > > * ``CDATA()`` factory to wrap string content as CDATA section. > > * New event types 'comment' and 'pi' in ``iterparse()``. > > * ``XSLTAccessControl`` instances have a property ``options`` that > returns a dict of access configuration options. > > * Constant instances ``DENY_ALL`` and ``DENY_WRITE`` on > ``XSLTAccessControl`` class. > > * Extension elements for XSLT (experimental!) > > * ``Element.base`` property returns the xml:base or HTML base URL of > an Element. > > * ``docinfo.URL`` property is writable. > > > Bugs fixed > ---------- > > * Custom resolvers were not used for XMLSchema includes/imports and > XInclude processing. > > * CSS selector parser dropped remaining expression after a function > with parameters. > > * Descending dot-separated classes in CSS selectors were not resolved > correctly. > > * ``ElementTree.parse()`` didn't handle target parser result. > > * Potential threading problem in XInclude. > > * Crash in Element class lookup classes when the __init__() method of > the super class is not called from Python subclasses. > > * A number of problems related to unicode/byte string conversion of > filenames and error messages were fixed. > > * Building on MacOS-X now passes the "flat_namespace" option to the C > compiler, which reportedly prevents build quirks and crashes on this > platform. > > * Windows build was broken. > > * Rare crash when serialising to a file object with certain encodings. > > * Incorrect evaluation of ``el.find("tag[child]")``. > > * Moving a subtree from a document created in one thread into a > document of another thread could crash when the rest of the source > document is deleted while the subtree is still in use. > > * Passing an nsmap when creating an Element will no longer strip > redundantly defined namespace URIs. This prevented the definition > of more than one prefix for a namespace on the same Element. > > * Resolving to a filename in custom resolvers didn't work. > > * lxml did not honour libxslt's second error state "STOPPED", which > let some XSLT errors pass silently. > > * Memory leak in Schematron with libxml2 >= 2.6.31. > > * lxml.etree accepted non well-formed namespace prefix names. > > * Hanging thread in conjunction with GTK threading. > > * Crash bug in iterparse when moving elements into other documents. > > * HTML elements' ``.cssselect()`` method was broken. > > * ``ElementTree.find*()`` didn't accept QName objects. > > * Default encoding for plain text serialisation was different from > that of XML serialisation (UTF-8 instead of ASCII). > > > Other changes > ------------- > > * ``objectify.enableRecursiveStr()`` was removed, use > ``objectify.enable_recursive_str()`` instead > > * Speed-up when running XSLTs on documents from other threads > > * Non-ASCII characters in attribute values are no longer escaped on > serialisation. > > * Passing non-ASCII byte strings or invalid unicode strings as .tag, > namespaces, etc. will result in a ValueError instead of an > AssertionError (just like the tag well-formedness check). > > * Up to several times faster attribute access (i.e. tree traversal) in > lxml.objectify. > > * lxml should now build without problems on MacOS-X. > > * If the default namespace is redundantly defined with a prefix on the > same Element, the prefix will now be preferred for subelements and > attributes. This allows users to work around a problem in libxml2 > where attributes from the default namespace could serialise without > a prefix even when they appear on an Element with a different > namespace (i.e. they would end up in the wrong namespace). > > * Major cleanup in internal ``moveNodeToDocument()`` function, which > takes care of namespace cleanup when moving elements between > different namespace contexts. > > * New Elements created through the ``makeelement()`` method of an HTML > parser or through lxml.html now end up in a new HTML document > (doctype HTML 4.01 Transitional) instead of a generic XML document. > This mostly impacts the serialisation and the availability of a DTD > context. > > * Minor API speed-ups. > > * The benchmark suite now uses tail text in the trees, which makes the > absolute numbers incomparable to previous results. > > * Generating the HTML documentation now requires Pygments_, which is > used to enable syntax highlighting for the doctest examples. > > .. _Pygments: http://pygments.org/ > > Most long-time deprecated functions and methods were removed: > > - ``etree.clearErrorLog()``, use ``etree.clear_error_log()`` > > - ``etree.useGlobalPythonLog()``, use > ``etree.use_global_python_log()`` > > - ``etree.ElementClassLookup.setFallback()``, use > ``etree.ElementClassLookup.set_fallback()`` > > - ``etree.getDefaultParser()``, use ``etree.get_default_parser()`` > > - ``etree.setDefaultParser()``, use ``etree.set_default_parser()`` > > - ``etree.setElementClassLookup()``, use > ``etree.set_element_class_lookup()`` > > Note that ``parser.setElementClassLookup()`` has not been removed > yet, although ``parser.set_element_class_lookup()`` should be used > instead. > > - ``xpath_evaluator.registerNamespace()``, use > ``xpath_evaluator.register_namespace()`` > > - ``xpath_evaluator.registerNamespaces()``, use > ``xpath_evaluator.register_namespaces()`` > > - ``objectify.setPytypeAttributeTag``, use > ``objectify.set_pytype_attribute_tag`` > > - ``objectify.setDefaultParser()``, use > ``objectify.set_default_parser()`` -- Alexander Limi ? http://limi.net From stefan_ml at behnel.de Thu Jul 10 07:33:35 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 10 Jul 2008 07:33:35 +0200 Subject: [lxml-dev] lxml 2.1 released In-Reply-To: References: <4874BCB9.9020506@behnel.de> Message-ID: <48759F2F.6050800@behnel.de> Hi, Alexander Limi wrote: > The download link on the web site reports a 404: > > http://codespeak.net/lxml/lxml-2.1.tgz Ah, sorry, wrong upload name (tar.gz vs. .tgz). I should really put the upload command into my release script one day... Stefan From stefan_ml at behnel.de Thu Jul 10 08:48:01 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 10 Jul 2008 08:48:01 +0200 Subject: [lxml-dev] questions about xslt with xml, where to get help? In-Reply-To: <5213E58D85BC414998FA553C701E386C0EF2A5FDD6@SGBD012511.dlrmail.ad.delarue.com> References: <5213E58D85BC414998FA553C701E386C0EF2A5FDD6@SGBD012511.dlrmail.ad.delarue.com> Message-ID: <4875B0A1.3060500@behnel.de> Hi, Dominique.Holzwarth at ch.delarue.com wrote: > result = transform(DOMdocument, **{statustext='some text', mode='navigation'}) this fails because there is no node called "some text" in your data. The parameter you pass is an XPath value, not a plain Python value, so if you want to pass a string instead of a node name, use "'some text'". Any patch to the docs will be appreciated. :) Stefan From limi at plone.org Thu Jul 10 09:28:31 2008 From: limi at plone.org (Alexander Limi) Date: Thu, 10 Jul 2008 00:28:31 -0700 Subject: [lxml-dev] lxml 2.1 released References: <4874BCB9.9020506@behnel.de> <48759F2F.6050800@behnel.de> Message-ID: On Wed, 09 Jul 2008 22:33:35 -0700, Stefan Behnel wrote: > Alexander Limi wrote: >> The download link on the web site reports a 404: >> >> http://codespeak.net/lxml/lxml-2.1.tgz > > Ah, sorry, wrong upload name (tar.gz vs. .tgz). I should really put the > upload > command into my release script one day... Thanks for the fix! I can't believe nobody noticed until now, though. :) -- Alexander Limi ? http://limi.net From Dominique.Holzwarth at ch.delarue.com Thu Jul 10 10:32:31 2008 From: Dominique.Holzwarth at ch.delarue.com (Dominique.Holzwarth at ch.delarue.com) Date: Thu, 10 Jul 2008 09:32:31 +0100 Subject: [lxml-dev] questions about xslt with xml, where to get help? In-Reply-To: <20080709185605.GA26399@fridge.pov.lt> Message-ID: <5213E58D85BC414998FA553C701E386C0EF2A6008D@SGBD012511.dlrmail.ad.delarue.com> Well my posted code is for easy debugging / getting help only. Actually, the dict that'll be passed is generated somewhere else dynamicly, with varying keys etc. ;-) So I'll pass the dict by it's name and not pass the individual keys/values =) > -----Original Message----- > From: lxml-dev-bounces at codespeak.net > [mailto:lxml-dev-bounces at codespeak.net] On Behalf Of Marius Gedminas > Sent: Mittwoch, 9. Juli 2008 20:56 > To: lxml-dev at codespeak.net > Subject: Re: [lxml-dev] questions about xslt with xml, where > to get help? > > On Wed, Jul 09, 2008 at 02:37:20PM +0100, > Dominique.Holzwarth at ch.delarue.com wrote: > > result = transform(DOMdocument, **{statustext='some text', > > mode='navigation'}) > > FWIW this code is equivalent to the simpler > > result = transform(DOMdocument, statustext='some text', > mode='navigation') > > Marius Gedminas > -- > "I may not understand what I'm installing, but that's not my > job. I just need to click Next, Next, Finish here so I can > walk to the next system and repeat the process" > -- Anonymous NT Admin > ***************************************************************************** This e-mail and any files attached are strictly confidential, may be legally privileged and are intended solely for the addressee. If you are not the intended recipient please notify the sender immediately by return email and then delete the e-mail and any attachments immediately. The views and or opinions expressed in this e-mail are not necessarily the views of De La Rue plc or any of its subsidiaries and the De La Rue Group of companies, their directors, officers and employees make no representation about and accept no liability for its accuracy or completeness. You should ensure that you have adequate virus protection as the De La Rue Group of companies do not accept liability for any viruses. De La Rue plc Registered No.3834125, De La Rue Holdings plc Registered No 58025 and De La Rue International Limited Registered No 720284 are all registered in England with their registered office at: De La Rue House, Jays Close, Viables, Hampshire RG22 4BS ***************************************************************************** From stefan_ml at behnel.de Thu Jul 10 10:53:16 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 10 Jul 2008 10:53:16 +0200 Subject: [lxml-dev] questions about xslt with xml, where to get help? In-Reply-To: <5213E58D85BC414998FA553C701E386C0EF2A60088@SGBD012511.dlrmail.ad.delarue.com> References: <5213E58D85BC414998FA553C701E386C0EF2A60088@SGBD012511.dlrmail.ad.delarue.com> Message-ID: <4875CDFC.4040203@behnel.de> Hi, please keep the list on CC. Dominique.Holzwarth at ch.delarue.com wrote: > Ok, that seems to work now :-) Thanks a lot! I indeed was wondering what > this weird syntax a="'A'" means in the doc. Is it correct, that if I want > to pass a complete DOM-node as xsl-param I can just write something like > myParam="node-tag"? That doesn't pass the node, it just passes the reference (i.e. the tag name), which will be resolved from the XML source document that you are transforming, in the current transformation context (i.e. from within the node you are operating on). I think your best bet to pass tree content is through XSLT extension elements or XPath extension functions. Stefan From armin.ronacher at active-4.com Sat Jul 12 18:47:12 2008 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sat, 12 Jul 2008 16:47:12 +0000 (UTC) Subject: [lxml-dev] Proposal: Better html5lib Support Message-ID: Hi, I'm lately working a lot with html5lib which has a tree builder that can generate an lxml tree which is awesome :-) There are however a few inconveniences in the html5lib lxml support. Mostly because the html5lib API is quite complex to use and I've seen that there is a beautiful soup parser support in html5lib, so why not move the html5lib tree builder into an lxml.html.html5 module or so that provides the same API as the html (that is `fragment_fromstring`, `document_fromstring`, etc.) html5lib is currently the most advanced HTML parsing module for Python I know about and it is able to deal with most HTML the same way popular browsers do. There is another small problem with html5lib and lxml interoperability that is the HTML5 doctype ("") that lxml naturally cannot handle. I know that lxml is an XML library after all, but maybe support for this special doctype could be added. Regards, Armin From stefan_ml at behnel.de Sat Jul 12 21:54:26 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 12 Jul 2008 21:54:26 +0200 Subject: [lxml-dev] Proposal: Better html5lib Support In-Reply-To: References: Message-ID: <48790BF2.3090905@behnel.de> Hi, Armin Ronacher wrote: > I'm lately working a lot with html5lib which has a tree builder that can > generate an lxml tree which is awesome :-) :) > There are however a few inconveniences in the html5lib lxml support. Mostly > because the html5lib API is quite complex to use and I've > seen that there is a beautiful soup parser support in html5lib, so why not > move the html5lib tree builder into an lxml.html.html5 module or so that > provides the same API as the html (that is `fragment_fromstring`, > `document_fromstring`, etc.) I do not use html5lib myself, but I'm happily taking patches if you can fix it up in a more convenient way. > There is another small problem with html5lib and lxml interoperability that > is the HTML5 doctype ("") that lxml naturally cannot handle. Does the "cannot handle" result in any visible problems? > I know that lxml is an XML library after all, but maybe support for this > special doctype could be added. This is something that is handled at the level of libxml2 and the system wide catalogs. Check the catalogs on your system to see if there is anything that resembles that doctype. Maybe it can be added. Stefan From armin.ronacher at active-4.com Sat Jul 12 23:14:50 2008 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sat, 12 Jul 2008 21:14:50 +0000 (UTC) Subject: [lxml-dev] Proposal: Better html5lib Support References: <48790BF2.3090905@behnel.de> Message-ID: Stefan Behnel behnel.de> writes: > > There are however a few inconveniences in the html5lib lxml support. Mostly > > because the html5lib API is quite complex to use and I've > > seen that there is a beautiful soup parser support in html5lib, so why not > > move the html5lib tree builder into an lxml.html.html5 module or so that > > provides the same API as the html (that is `fragment_fromstring`, > > `document_fromstring`, etc.) > > I do not use html5lib myself, but I'm happily taking patches if you can fix it > up in a more convenient way. I'll happily create a patch :-) > > There is another small problem with html5lib and lxml interoperability that > > is the HTML5 doctype ("") that lxml naturally cannot handle. > > Does the "cannot handle" result in any visible problems? This document:: foo

blah Comes out as (lxml.etree.tostring):: ... Not a big deal as not writing out data as a whole document and if I would then as HTML4. I think the html5 doctype is not a valid XML doctype but HTML5 as serialization format is not really XML. For HTML5 serialization one would have to use the html5lib serializer anyways and that could add a workaround for lxml. Regards, Armin From stefan_ml at behnel.de Sun Jul 13 06:57:05 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 13 Jul 2008 06:57:05 +0200 Subject: [lxml-dev] Proposal: Better html5lib Support In-Reply-To: References: <48790BF2.3090905@behnel.de> Message-ID: <48798B21.7090408@behnel.de> Hi, Armin Ronacher wrote: > Stefan Behnel behnel.de> writes: >>> There is another small problem with html5lib and lxml interoperability that >>> is the HTML5 doctype ("") that lxml naturally cannot handle. >> Does the "cannot handle" result in any visible problems? > This document:: > > > foo >

blah > > Comes out as (lxml.etree.tostring):: > > > ... We are actually serialising the DOCTYPE ourselves. Try this patch. I'm not sure if is actually allowed in SGML, didn't find anything on that so far. If it isn't, I'll have to see if I can restrict the impact of the patch to this specific case. Note that you will need Cython 0.9.8 installed to build a patched lxml. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: html5-doctype.patch Type: text/x-patch Size: 1781 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080713/1f011c61/attachment.bin From stefan_ml at behnel.de Sun Jul 13 17:12:32 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 13 Jul 2008 17:12:32 +0200 Subject: [lxml-dev] Proposal: Better html5lib Support In-Reply-To: <48798B21.7090408@behnel.de> References: <48790BF2.3090905@behnel.de> <48798B21.7090408@behnel.de> Message-ID: <487A1B60.8010005@behnel.de> Stefan Behnel wrote: > Armin Ronacher wrote: >> This document:: >> >> >> foo >>

blah >> >> Comes out as (lxml.etree.tostring):: >> >> >> ... > > I'm not sure if is actually allowed in SGML, didn't find > anything on that so far. http://xml.coverpages.org//sgmlsyn/sgmlsyn.htm#P110 Looks like that's the right thing to do, so I committed a fixed version of the patch to the trunk. Stefan From armin.ronacher at active-4.com Sun Jul 13 22:32:30 2008 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sun, 13 Jul 2008 20:32:30 +0000 (UTC) Subject: [lxml-dev] Proposal: Better html5lib Support References: <48790BF2.3090905@behnel.de> Message-ID: Hi, Stefan Behnel behnel.de> writes: > I do not use html5lib myself, but I'm happily taking patches if you can fix it > up in a more convenient way. I created a patch now: http://paste.pocoo.org/show/79376/ That however has two disadvantages. For one it extends the lxml etree builder in a pretty ugly way but that could probably be improved, and it also creates etree.Comment objects and not etree.html.HtmlComments. The same problem exists with the soupparser, mainly because there is no way to generate HtmlComment objects without creating a segfault. (The only way is to use html.fromstring with the comment there, but that's an ugly hack). Regards, Armin From armin.ronacher at active-4.com Sun Jul 13 22:33:47 2008 From: armin.ronacher at active-4.com (Armin Ronacher) Date: Sun, 13 Jul 2008 20:33:47 +0000 (UTC) Subject: [lxml-dev] Proposal: Better html5lib Support References: <48790BF2.3090905@behnel.de> <48798B21.7090408@behnel.de> <487A1B60.8010005@behnel.de> Message-ID: Hi, Stefan Behnel behnel.de> writes: > http://xml.coverpages.org//sgmlsyn/sgmlsyn.htm#P110 > > Looks like that's the right thing to do, so I committed a fixed version of the > patch to the trunk. Thanks a lot for the quick fix! Regards, Armin From Dominique.Holzwarth at ch.delarue.com Mon Jul 14 09:40:42 2008 From: Dominique.Holzwarth at ch.delarue.com (Dominique.Holzwarth at ch.delarue.com) Date: Mon, 14 Jul 2008 08:40:42 +0100 Subject: [lxml-dev] problem with e-factory input text field Message-ID: <5213E58D85BC414998FA553C701E386C0EF43A2483@SGBD012511.dlrmail.ad.delarue.com> Hi everyone I'm trying to generate a HTML page with the E-Factory (from lxml.html import builder as E) but I'm having some problem with checkboxes... The output I'd like to get is the following: label text I'm creating the input field with: E.INPUT("label text", name="something", type="checkbox", value="1") but when i do a lxml.html.tostring(...) of my whole page object the "label" text" will be missing and the tag will look like that: <-- Note: no closing tag and no text-node When i do a lxml.etree.tostring of just that tag the result is perfect, but that method will cause troubles with tags like