From stefan_ml at behnel.de Tue Sep 2 11:35:19 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 2 Sep 2008 11:35:19 +0200 (CEST) Subject: [lxml-dev] [Fwd: [Bug 263898] [NEW] Windows Installer crashes due to access violation] Message-ID: <64883.213.61.181.86.1220348119.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, has anyone seem this before or could someone please test if this is reproducible on other machines? Thanks, Stefan -------------------------------------------------------------------------- Subject: [Bug 263898] [NEW] Windows Installer crashes due to access violation -------------------------------------------------------------------------- Public bug reported: I tried to install lxml-2.1.1.win32-py2.5.exe on my WinXP PC and got the attached crash from the installer. It happened after confirming all settings and clicking the "Next" button on the "Ready to install" page of the installer. Using Visual Studio Debugger to investigate crash : Callstack : ntdll.dll!_RtlEnterCriticalSection at 4() + 0xb msvcr71.dll!_lock_file(void * pf=0x00000000) Line 236 C > msvcr71.dll!fprintf(_iobuf * str=0x00000000, const char * format=0x0012d278, ...) Line 63 + 0x6 C lxml-2.1.1.win32-py2.5.exe!00402ca8() user32.dll!77d48734() user32.dll!77d48bd9() user32.dll!77d541dc() user32.dll!77d541a9() user32.dll!77d53fd9() ntdll.dll!_RtlpFreeToHeapLookaside at 8() + 0x26 ntdll.dll!_RtlFreeHeap at 12() + 0x114 ntdll.dll!_RtlpFreeAtom at 4() + 0x1b c5ffffff() Looking at the fprintf function I can see that "str" variable is NULL. _lock_file function using NULL pointer the access struture variable which eventually results in the reported crash. ** Affects: lxml Importance: Undecided Status: New -- Windows Installer crashes due to access violation https://bugs.launchpad.net/bugs/263898 From stefan_ml at behnel.de Fri Sep 5 14:30:45 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 05 Sep 2008 14:30:45 +0200 Subject: [lxml-dev] lxml 2.1.2 and 2.0.9 released Message-ID: <48C12675.5080800@behnel.de> Hi, lxml 2.1.2 and 2.0.9 are on PyPI. Both are bug-fix releases for the stable and mature release series. They mainly fix a thread-related memory problem that was introduced in the last releases of both branches. Updating is recommended. The complete changelog follows below. Have fun, Stefan 2.0.9 (2008-09-05) Bugs fixed * Memory problem when passing documents between threads. * Target parser did not honour the recover option and raised an exception instead of calling .close() on the target. 2.1.2 (2008-09-05) Features added * lxml.etree now tries to find the absolute path name of files when parsing from a file-like object. This helps custom resolvers when resolving relative URLs, as lixbml2 can prepend them with the path of the source document. Bugs fixed * Memory problem when passing documents between threads. * Target parser did not honour the recover option and raised an exception instead of calling .close() on the target. From klizhentas at gmail.com Sat Sep 6 14:18:42 2008 From: klizhentas at gmail.com (Alex Klizhentas) Date: Sat, 6 Sep 2008 16:18:42 +0400 Subject: [lxml-dev] Preventing XPath injection Message-ID: <6310a8f80809060518v3f16002auc581605d9223bce8@mail.gmail.com> Hi All, I'm facing the following issue: xslt transformations accept xpath expressions as parameters, and if you write something like: transform(a,param = " ' ' ' ") - xpath evaluation will fail. Is there any common/standard way to prevent that? Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080906/24b89275/attachment.htm From foolistbar at googlemail.com Sat Sep 6 15:24:59 2008 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sat, 6 Sep 2008 14:24:59 +0100 Subject: [lxml-dev] Preventing XPath injection In-Reply-To: <6310a8f80809060518v3f16002auc581605d9223bce8@mail.gmail.com> References: <6310a8f80809060518v3f16002auc581605d9223bce8@mail.gmail.com> Message-ID: On 6 Sep 2008, at 13:18, Alex Klizhentas wrote: > Hi All, I'm facing the following issue: > > xslt transformations accept xpath expressions as parameters, and if > you > write something like: > > transform(a,param = " ' ' ' ") - xpath evaluation will fail. Is > there any > common/standard way to prevent that? No, what I've been using is: def escapeXPathString(string): return u"concat('', '%s')" % string.replace(u"'", u"', \"'\", '") The first parameter to the concat function is needed because it must always have at least two parameters. -- Geoffrey Sneddon From klizhentas at gmail.com Sat Sep 6 19:52:30 2008 From: klizhentas at gmail.com (Alex Klizhentas) Date: Sat, 6 Sep 2008 21:52:30 +0400 Subject: [lxml-dev] Preventing XPath injection In-Reply-To: References: <6310a8f80809060518v3f16002auc581605d9223bce8@mail.gmail.com> Message-ID: <6310a8f80809061052y4b79fe0fyf8333c7d0980b9b6@mail.gmail.com> That's strange, I thought it should be quoted like: ' 2008/9/6 Geoffrey Sneddon > > On 6 Sep 2008, at 13:18, Alex Klizhentas wrote: > > Hi All, I'm facing the following issue: >> >> xslt transformations accept xpath expressions as parameters, and if you >> write something like: >> >> transform(a,param = " ' ' ' ") - xpath evaluation will fail. Is there >> any >> common/standard way to prevent that? >> > > No, what I've been using is: > > def escapeXPathString(string): > return u"concat('', '%s')" % string.replace(u"'", u"', \"'\", '") > > The first parameter to the concat function is needed because it must always > have at least two parameters. > > > -- > Geoffrey Sneddon > > > -- Regards, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080906/e47974f8/attachment-0001.htm From foolistbar at googlemail.com Sun Sep 7 17:53:33 2008 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sun, 7 Sep 2008 16:53:33 +0100 Subject: [lxml-dev] Preventing XPath injection In-Reply-To: <6310a8f80809061052y4b79fe0fyf8333c7d0980b9b6@mail.gmail.com> References: <6310a8f80809060518v3f16002auc581605d9223bce8@mail.gmail.com> <6310a8f80809061052y4b79fe0fyf8333c7d0980b9b6@mail.gmail.com> Message-ID: <89168140-C25E-4E28-AAFB-5754D82344AC@googlemail.com> On 6 Sep 2008, at 18:52, Alex Klizhentas wrote: > That's strange, I thought it should be quoted like: ' Nope. A string is "[^"]*" or '[^']*' ? it is exactly what is between the quotes. -- Geoffrey Sneddon From ianb at colorstudy.com Sun Sep 7 19:16:25 2008 From: ianb at colorstudy.com (Ian Bicking) Date: Sun, 07 Sep 2008 12:16:25 -0500 Subject: [lxml-dev] Preventing XPath injection In-Reply-To: <89168140-C25E-4E28-AAFB-5754D82344AC@googlemail.com> References: <6310a8f80809060518v3f16002auc581605d9223bce8@mail.gmail.com> <6310a8f80809061052y4b79fe0fyf8333c7d0980b9b6@mail.gmail.com> <89168140-C25E-4E28-AAFB-5754D82344AC@googlemail.com> Message-ID: <48C40C69.2070302@colorstudy.com> Geoffrey Sneddon wrote: > On 6 Sep 2008, at 18:52, Alex Klizhentas wrote: > >> That's strange, I thought it should be quoted like: ' > > Nope. A string is "[^"]*" or '[^']*' ? it is exactly what is between > the quotes. When I was trying to figure out CSS to XPath translation, I tried to figure out how string quoting worked in XPath. Unfortunately I couldn't find any reference to string quoting in the specs (though of course I might have missed it). This seemed like a very peculiar omission. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org From marius at pov.lt Sun Sep 7 20:05:58 2008 From: marius at pov.lt (Marius Gedminas) Date: Sun, 7 Sep 2008 21:05:58 +0300 Subject: [lxml-dev] Preventing XPath injection In-Reply-To: <48C40C69.2070302@colorstudy.com> References: <6310a8f80809060518v3f16002auc581605d9223bce8@mail.gmail.com> <6310a8f80809061052y4b79fe0fyf8333c7d0980b9b6@mail.gmail.com> <89168140-C25E-4E28-AAFB-5754D82344AC@googlemail.com> <48C40C69.2070302@colorstudy.com> Message-ID: <20080907180558.GA23636@fridge.pov.lt> On Sun, Sep 07, 2008 at 12:16:25PM -0500, Ian Bicking wrote: > Geoffrey Sneddon wrote: > > On 6 Sep 2008, at 18:52, Alex Klizhentas wrote: > > > >> That's strange, I thought it should be quoted like: ' > > > > Nope. A string is "[^"]*" or '[^']*' ? it is exactly what is between > > the quotes. > > When I was trying to figure out CSS to XPath translation, I tried to > figure out how string quoting worked in XPath. Unfortunately I couldn't > find any reference to string quoting in the specs (though of course I > might have missed it). This seemed like a very peculiar omission. XPath 2.0 spec rectifies that: The value of a string literal is an atomic value whose type is xs:string and whose value is the string denoted by the characters between the delimiting apostrophes or quotation marks. If the literal is delimited by apostrophes, two adjacent apostrophes within the literal are interpreted as a single apostrophe. Similarly, if the literal is delimited by quotation marks, two adjacent quotation marks within the literal are interpreted as one quotation mark. -- http://www.w3.org/TR/xpath20/#id-literals XPath 1.0 is silent on the matter. I suppose you could always concatenate strings, e.g. concat("Look, it's a ", '"quoted string"!')... Marius Gedminas -- Hoping the problem magically goes away by ignoring it is the "microsoft approach to programming" and should never be allowed. -- Linus Torvalds -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080907/92505927/attachment.pgp From foolistbar at googlemail.com Sun Sep 7 20:24:30 2008 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sun, 7 Sep 2008 19:24:30 +0100 Subject: [lxml-dev] Preventing XPath injection In-Reply-To: <20080907180558.GA23636@fridge.pov.lt> References: <6310a8f80809060518v3f16002auc581605d9223bce8@mail.gmail.com> <6310a8f80809061052y4b79fe0fyf8333c7d0980b9b6@mail.gmail.com> <89168140-C25E-4E28-AAFB-5754D82344AC@googlemail.com> <48C40C69.2070302@colorstudy.com> <20080907180558.GA23636@fridge.pov.lt> Message-ID: On 7 Sep 2008, at 19:05, Marius Gedminas wrote: > XPath 1.0 is silent on the matter. I suppose you could always > concatenate strings, e.g. concat("Look, it's a ", '"quoted > string"!')... I just read interpreted the XML EBNF as meaning there was no escaping, and removed leading/trailing quote char for it to be logical. Which seems to be how things work. -- Geoffrey Sneddon From jg307 at cam.ac.uk Sun Sep 7 22:15:59 2008 From: jg307 at cam.ac.uk (James Graham) Date: Sun, 07 Sep 2008 21:15:59 +0100 Subject: [lxml-dev] lxml.html adds a default doctype to HTML documents Message-ID: <48C4367F.3030204@cam.ac.uk> In [2]: from lxml import html In [3]: t = html.fromstring("

Hello World") In [4]: docinfo = t.getroottree().docinfo In [5]: docinfo.public_id Out[5]: '-//W3C//DTD HTML 4.0 Transitional//EN' Is it possible to prevent this from occurring? I couldn't see anything in the API documentation but I might have been missing something obvious. Silently gaining incorrect data is annoying :) -- "Eternity's a terrible thought. I mean, where's it all going to end?" -- Tom Stoppard, Rosencrantz and Guildenstern are Dead From stefan_ml at behnel.de Mon Sep 8 13:48:02 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 8 Sep 2008 13:48:02 +0200 (CEST) Subject: [lxml-dev] Resolving entities In-Reply-To: <20080907222951.6933.89031.launchpad@canonical@palladium.canonical.com > References: <20080907222951.6933.89031.launchpad@canonical@palladium.canonical.com> Message-ID: <49178.213.61.181.86.1220874482.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Kovid Goyal wrote: > My application needs to process XML files that do not have DTD > declarations but that contain entities. In this case your document is not well-formed, i.e. not XML. http://www.w3.org/TR/REC-xml/#sec-references > Can I inform XMLParser of the entities somehow? No, there isn't currently a way to work around such a broken document. libxml2 follows the XML spec strictly in that it rejects references to undeclared entities in the absence of a DTD. ElementTree lacks DTD support and instead allows you to specify entities through a parser local "entity" dictionary. lxml could potentially support a similar interface by intercepting the entity reference resolving at the SAX layer ("getEntity()" callback function), but that's not implemented. Please file a wishlist bug. Stefan From klizhentas at gmail.com Mon Sep 8 16:44:27 2008 From: klizhentas at gmail.com (Alex Klizhentas) Date: Mon, 8 Sep 2008 18:44:27 +0400 Subject: [lxml-dev] Preventing XPath injection In-Reply-To: <6310a8f80809060518v3f16002auc581605d9223bce8@mail.gmail.com> References: <6310a8f80809060518v3f16002auc581605d9223bce8@mail.gmail.com> Message-ID: <6310a8f80809080744n2ce1c4f2ob4bd1fe1f7c63888@mail.gmail.com> Hi All, The context is a first parameter in the xpath/xslt extension functions and the tutorial states that it can be used to save function state. I wonder whether it is thread safe. Regards, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080908/94ba1355/attachment.htm From stefan_ml at behnel.de Mon Sep 8 17:01:06 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 8 Sep 2008 17:01:06 +0200 (CEST) Subject: [lxml-dev] Preventing XPath injection In-Reply-To: <6310a8f80809080744n2ce1c4f2ob4bd1fe1f7c63888@mail.gmail.com> References: <6310a8f80809060518v3f16002auc581605d9223bce8@mail.gmail.com> <6310a8f80809080744n2ce1c4f2ob4bd1fe1f7c63888@mail.gmail.com> Message-ID: <39630.213.61.181.86.1220886066.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, please start a new thread for a new question. Stefan Alex Klizhentas wrote: > The context is a first parameter in the xpath/xslt extension functions and > the tutorial states that it can be used to save function state. > I wonder whether it is thread safe. > > Regards, > Alex From klizhentas at gmail.com Mon Sep 8 17:23:10 2008 From: klizhentas at gmail.com (Alex Klizhentas) Date: Mon, 8 Sep 2008 19:23:10 +0400 Subject: [lxml-dev] context parameter thread safety Message-ID: <6310a8f80809080823w17ba4d68tdd8bd98488ccee4e@mail.gmail.com> Hi All, The context is a first parameter in the xpath/xslt extension functions and the tutorial states that it can be used to save function state. I wonder whether it is thread safe. Regards, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080908/f088d79e/attachment.htm From klizhentas at gmail.com Mon Sep 8 17:23:45 2008 From: klizhentas at gmail.com (Alex Klizhentas) Date: Mon, 8 Sep 2008 19:23:45 +0400 Subject: [lxml-dev] Preventing XPath injection In-Reply-To: <39630.213.61.181.86.1220886066.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <6310a8f80809060518v3f16002auc581605d9223bce8@mail.gmail.com> <6310a8f80809080744n2ce1c4f2ob4bd1fe1f7c63888@mail.gmail.com> <39630.213.61.181.86.1220886066.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <6310a8f80809080823i1855a463u867a618239e1cd0a@mail.gmail.com> Sorry, forgot to change the subject field :) 2008/9/8 Stefan Behnel > Hi, > > please start a new thread for a new question. > > Stefan > > > Alex Klizhentas wrote: > > The context is a first parameter in the xpath/xslt extension functions > and > > the tutorial states that it can be used to save function state. > > I wonder whether it is thread safe. > > > > Regards, > > Alex > > -- Regards, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080908/286541a9/attachment.htm From mwm-keyword-lxml.9112b8 at mired.org Mon Sep 8 17:32:46 2008 From: mwm-keyword-lxml.9112b8 at mired.org (Mike Meyer) Date: Mon, 8 Sep 2008 11:32:46 -0400 Subject: [lxml-dev] Preventing XPath injection In-Reply-To: <6310a8f80809080823i1855a463u867a618239e1cd0a@mail.gmail.com> References: <6310a8f80809060518v3f16002auc581605d9223bce8@mail.gmail.com> <6310a8f80809080744n2ce1c4f2ob4bd1fe1f7c63888@mail.gmail.com> <39630.213.61.181.86.1220886066.squirrel@groupware.dvs.informatik.tu-darmstadt.de> <6310a8f80809080823i1855a463u867a618239e1cd0a@mail.gmail.com> Message-ID: <20080908113246.280aba1d@bhuda.mired.org> On Mon, 8 Sep 2008 19:23:45 +0400 "Alex Klizhentas" wrote: > Sorry, forgot to change the subject field :) And the "In-Reply-To" and "References" fields - at the very least. 2008/9/8 Stefan Behnel > > > Hi, > > > > please start a new thread for a new question. > > > > Stefan > > > > > > Alex Klizhentas wrote: > > > The context is a first parameter in the xpath/xslt extension functions > > and > > > the tutorial states that it can be used to save function state. > > > I wonder whether it is thread safe. > > > > > > Regards, > > > Alex > > > > > > -- Mike Meyer http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. O< ascii ribbon campaign - stop html mail - www.asciiribbon.org From stefan_ml at behnel.de Mon Sep 8 21:38:15 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 08 Sep 2008 21:38:15 +0200 Subject: [lxml-dev] context parameter thread safety In-Reply-To: <6310a8f80809080823w17ba4d68tdd8bd98488ccee4e@mail.gmail.com> References: <6310a8f80809080823w17ba4d68tdd8bd98488ccee4e@mail.gmail.com> Message-ID: <48C57F27.8050302@behnel.de> Hi, Alex Klizhentas wrote: > The context is a first parameter in the xpath/xslt extension functions and > the tutorial states that it can be used to save function state. > I wonder whether it is thread safe. It doesn't have to be. XPath evaluations use an instance-local lock per evaluator, so their context will not be used concurrently. XSLT objects uses a new context for each evaluation, so there isn't any context concurrency there either. Stefan From Peter.Santoro at po.state.ct.us Wed Sep 10 17:48:13 2008 From: Peter.Santoro at po.state.ct.us (Peter Santoro ) Date: Wed, 10 Sep 2008 11:48:13 -0400 Subject: [lxml-dev] lxml 2.1.2 and 2.0.9 released In-Reply-To: References: Message-ID: <010C664C16DC4BAAA3C6AC7ABFB6E22D@drshmain.drs.state.ct.us> Does anyone know when the win32 binaries and eggs will be available for the 2.1.2 release? From sidnei at enfoldsystems.com Wed Sep 10 18:36:52 2008 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Wed, 10 Sep 2008 14:36:52 -0200 Subject: [lxml-dev] lxml 2.1.2 and 2.0.9 released In-Reply-To: <010C664C16DC4BAAA3C6AC7ABFB6E22D@drshmain.drs.state.ct.us> References: <010C664C16DC4BAAA3C6AC7ABFB6E22D@drshmain.drs.state.ct.us> Message-ID: Soon. Maybe today. :) On Wed, Sep 10, 2008 at 1:48 PM, Peter Santoro wrote: > Does anyone know when the win32 binaries and eggs will be available for the > 2.1.2 release? > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From castironpi at gmail.com Fri Sep 12 02:08:22 2008 From: castironpi at gmail.com (Aaron Brady) Date: Thu, 11 Sep 2008 19:08:22 -0500 Subject: [lxml-dev] xmlns / xmlns:xmlns inconsistency Message-ID: <7862feb10809111708y74542f48y81387e23bf4b7d9d@mail.gmail.com> Hello, I don't know if anyone checks this or cares. I tried to set a 'xmlns' attribute of a node. 'etree.tostring' produced an attribute name of 'xmlns:xmlns' instead. I found a workaround: call 'root.set( 'xmlns', 'urn:schemas-microsoft-com:office:spreadsheet' )' after creating the node. Full report, thanks for your time. from lxml import etree print "lxml.etree: ", etree.LXML_VERSION print "libxml used: ", etree.LIBXML_VERSION print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION print "libxslt used: ", etree.LIBXSLT_VERSION print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION ''' lxml.etree: (2, 1, 0, 0) libxml used: (2, 6, 32) libxml compiled: (2, 6, 32) libxslt used: (1, 1, 23) libxslt compiled: (1, 1, 23) ''' ''' Description: Setting 'xmlns' attribute of tag produces 'xmlns:xmlns' instead. ''' ''' Possibly related issue: * With ``lxml.doctestcompare`` if you do ```` in your output, it will then be namespace-neutral (before the ellipsis was treated as a real namespace). ''' ''' Target: Current: ''' nns= { 'xmlns': 'urn:schemas-microsoft-com:office:spreadsheet', 'o': 'urn:schemas-microsoft-com:office:office', 'x': 'urn:schemas-microsoft-com:office:excel', 'ss': 'urn:schemas-microsoft-com:office:spreadsheet', 'html': 'http://www.w3.org/TR/REC-html40' } root= etree.Element( 'Workbook', nsmap= nns ) out= etree.tostring( root, pretty_print= True, xml_declaration=True ) print( out ) ''' Workaround: ''' nns= { 'o': 'urn:schemas-microsoft-com:office:office', 'x': 'urn:schemas-microsoft-com:office:excel', 'ss': 'urn:schemas-microsoft-com:office:spreadsheet', 'html': 'http://www.w3.org/TR/REC-html40' } root= etree.Element( 'Workbook', nsmap= nns ) root.set( 'xmlns', 'urn:schemas-microsoft-com:office:spreadsheet' ) out= etree.tostring( root, pretty_print= True, xml_declaration=True ) print( out ) ''' Output: ---------^------- ''' From stefan_ml at behnel.de Fri Sep 12 09:05:29 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 12 Sep 2008 09:05:29 +0200 Subject: [lxml-dev] xmlns / xmlns:xmlns inconsistency In-Reply-To: <7862feb10809111708y74542f48y81387e23bf4b7d9d@mail.gmail.com> References: <7862feb10809111708y74542f48y81387e23bf4b7d9d@mail.gmail.com> Message-ID: <48CA14B9.6040807@behnel.de> Hi, Aaron Brady wrote: > I tried to set a 'xmlns' attribute of a node. 'etree.tostring' > produced an attribute name of 'xmlns:xmlns' instead. Interesting. I think we should raise a descriptive error when someone tries to set an attribute with that name. > Target: > > References: <7862feb10809111708y74542f48y81387e23bf4b7d9d@mail.gmail.com> <48CA14B9.6040807@behnel.de> Message-ID: <20080912073659.15890@gmx.net> Hi, > Aaron Brady wrote: > > I tried to set a 'xmlns' attribute of a node. 'etree.tostring' > > produced an attribute name of 'xmlns:xmlns' instead. > > Interesting. I think we should raise a descriptive error when someone > tries to > set an attribute with that name. > > > > Target: > > > > > Use > > root = etree.Element( > '{urn:schemas-microsoft-com:office:spreadsheet}Workbook' ) > > and read the docs at > > http://codespeak.net/lxml/tutorial.html#namespaces > > ? ?While I can't see the usecase for it, lxml doesn't allow to use two different ns-prefixes for the same namespace through the API, but it does when parsing: ?>>> root = etree.fromstring('') >>> print etree.tostring(root) >>> root.nsmap {'foo': '/foo/bar/namespace', None: '/foo/bar/namespace'} >>> root2 = etree.Element("root", nsmap=root.nsmap) >>> print etree.tostring(root2) >>> ?Holger? -- GMX startet ShortView.de. Hier findest Du Leute mit Deinen Interessen! Jetzt dabei sein: http://www.shortview.de/wasistshortview.php?mc=sv_ext_mf at gmx -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080912/7f31c214/attachment.htm From stefan_ml at behnel.de Fri Sep 12 10:42:49 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 12 Sep 2008 10:42:49 +0200 (CEST) Subject: [lxml-dev] xmlns / xmlns:xmlns inconsistency In-Reply-To: <20080912073659.15890@gmx.net> References: <7862feb10809111708y74542f48y81387e23bf4b7d9d@mail.gmail.com> <48CA14B9.6040807@behnel.de> <20080912073659.15890@gmx.net> Message-ID: <43528.213.61.181.86.1221208969.squirrel@groupware.dvs.informatik.tu-darmstadt.de> jholg at gmx.de wrote: >> Aaron Brady wrote: >> > > >> Use >> root = etree.Element( >> '{urn:schemas-microsoft-com:office:spreadsheet}Workbook' ) > >?While I can't see the usecase for it, lxml doesn't allow to use two > different ns-prefixes for the > same namespace through the API, but it does when parsing: > > ?>>> root = etree.fromstring(' xmlns="/foo/bar/namespace"/>') > >>> print etree.tostring(root) > > >>> root.nsmap > {'foo': '/foo/bar/namespace', None: '/foo/bar/namespace'} > >>> root2 = etree.Element("root", nsmap=root.nsmap) > >>> print etree.tostring(root2) > Yes, now that you mention it... lxml (starting with 2.1 IIRC, or maybe also in 2.0.x) prefers the prefixed namespace over the default namespace if both are defined in one nsmap and have the same URI. The code that handles this is in apihelpers.pxi, function _initNodeNamespaces(). The reason is that the prefixed namespace can also be used for attributes and within text values, while the default namespace only applies to elements. This is not a 100% solution, rather a "works in most cases" one. There are corner cases where the default namespace still wins, e.g. when a parsed document defines it before the equivalent prefixed namespace, so that libxml2 finds it first when it looks for a declaration. I consider it best to avoid the default namespace when you're dealing with multiple (say, more than two) namespaces in one document, regardless of the tool you are using. You never need the default namespace, it's always pure convenience. Stefan From castironpi at gmail.com Fri Sep 12 14:53:39 2008 From: castironpi at gmail.com (Aaron Brady) Date: Fri, 12 Sep 2008 07:53:39 -0500 Subject: [lxml-dev] xmlns / xmlns:xmlns inconsistency In-Reply-To: <43528.213.61.181.86.1221208969.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <7862feb10809111708y74542f48y81387e23bf4b7d9d@mail.gmail.com> <48CA14B9.6040807@behnel.de> <20080912073659.15890@gmx.net> <43528.213.61.181.86.1221208969.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <7862feb10809120553p6406cecbq3560eaf8cc6bca38@mail.gmail.com> On Fri, Sep 12, 2008 at 3:42 AM, Stefan Behnel wrote: > jholg at gmx.de wrote: >>> Aaron Brady wrote: >>> > >> >>> Use >>> root = etree.Element( >>> '{urn:schemas-microsoft-com:office:spreadsheet}Workbook' ) >> >> While I can't see the usecase for it, lxml doesn't allow to use two >> different ns-prefixes for the >> same namespace through the API, but it does when parsing: >> >> >>> root = etree.fromstring('> xmlns="/foo/bar/namespace"/>') >> >>> print etree.tostring(root) >> >> >>> root.nsmap >> {'foo': '/foo/bar/namespace', None: '/foo/bar/namespace'} >> >>> root2 = etree.Element("root", nsmap=root.nsmap) >> >>> print etree.tostring(root2) >> > > Yes, now that you mention it... > > lxml (starting with 2.1 IIRC, or maybe also in 2.0.x) prefers the prefixed > namespace over the default namespace if both are defined in one nsmap and > have the same URI. The code that handles this is in apihelpers.pxi, > function _initNodeNamespaces(). > > The reason is that the prefixed namespace can also be used for attributes > and within text values, while the default namespace only applies to > elements. This is not a 100% solution, rather a "works in most cases" one. > There are corner cases where the default namespace still wins, e.g. when a > parsed document defines it before the equivalent prefixed namespace, so > that libxml2 finds it first when it looks for a declaration. > > I consider it best to avoid the default namespace when you're dealing with > multiple (say, more than two) namespaces in one document, regardless of > the tool you are using. You never need the default namespace, it's always > pure convenience. > > Stefan Whoops, sorry Stefan. Reply to all: I was getting round-trip errors, plus I was targeting exact MS XML output. If the semantics are the same, then it's not a problem, or shouldn't be. Here's the MS XML: ... Which as you inferred defines 'xmlns' in the same node as other attributes that use it. From steven.vereecken at gmail.com Mon Sep 15 23:43:50 2008 From: steven.vereecken at gmail.com (Steven Vereecken) Date: Mon, 15 Sep 2008 23:43:50 +0200 Subject: [lxml-dev] resolvers Message-ID: Hi, I 've got a couple of strange things with custom resolvers: 1) I can't get the resolve_file method to work, not with actual file objects, and not with StringIO either. I always get the following message: Exception exceptions.TypeError: 'function takes exactly 4 arguments (3 given)' in 'lxml.etree._local_resolver' ignored As it says, this is ignored, but I get lxml.etree.XMLSyntaxError: None later on. This is with lxml 2.1.1 (on Windows XP) . I've also tried an older version (1.3.6) that was installed on my Ubuntu box, and that worked, however, after building 2.1.2 there, I get the same problem. I'm not sure whether this is an actual bug, or whether something has changed that is not clear from the docs. 2) resolve_string seems only to work for unicode strings and ASCII-only bytestrings. I would expect behaviour more consistent with etree.fromstring, so I'd assume this is a bug too? If you can confirm these are bugs, I'll file them in the tracker. greetings, Steven From xphuture at gmail.com Tue Sep 16 11:42:26 2008 From: xphuture at gmail.com (Fabien) Date: Tue, 16 Sep 2008 11:42:26 +0200 Subject: [lxml-dev] Best practice for lxml and html5lib Message-ID: <622afeaa0809160242y328334e4scaebb4c93e4bfa32@mail.gmail.com> Hello, I'm trying to make lxml and html5lib working together and I must say that I've some difficulties to find the best solution. lxml 2.2 seems to works well with lxml.html.html5parser but I'm a bit reluctant to put alpha software in production. And when using lxml 2.1.x and html5lib "lxml" tree builder, I get "ValueError: Invalid attribute name u'xml:lang'". Any help for the best combination of lxml and html5lib is welcomed. -- Fabien SCHWOB From stefan_ml at behnel.de Tue Sep 16 14:37:02 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 16 Sep 2008 14:37:02 +0200 (CEST) Subject: [lxml-dev] Best practice for lxml and html5lib In-Reply-To: <622afeaa0809160242y328334e4scaebb4c93e4bfa32@mail.gmail.com> References: <622afeaa0809160242y328334e4scaebb4c93e4bfa32@mail.gmail.com> Message-ID: <51859.213.61.181.86.1221568622.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Fabien wrote: > I'm trying to make lxml and html5lib working together and I must say > that I've some difficulties to find the best solution. lxml 2.2 seems > to works well with lxml.html.html5parser but I'm a bit reluctant to > put alpha software in production. And when using lxml 2.1.x and > html5lib "lxml" tree builder, I get "ValueError: Invalid attribute > name u'xml:lang'". Have you tried copying the html5parser.py file over into your lxml 2.1 install? Stefan From steven.vereecken at gmail.com Tue Sep 16 17:41:46 2008 From: steven.vereecken at gmail.com (Steven Vereecken) Date: Tue, 16 Sep 2008 17:41:46 +0200 Subject: [lxml-dev] resolvers In-Reply-To: References: Message-ID: I've added some simple examples to illustrate what I mean. Hope this helps... 2008/9/15 Steven Vereecken : > Hi, > > I 've got a couple of strange things with custom resolvers: > > 1) I can't get the resolve_file method to work, not with actual file > objects, and not with StringIO either. I always get the following > message: > Exception exceptions.TypeError: 'function takes exactly 4 arguments (3 > given)' in 'lxml.etree._local_resolver' ignored > As it says, this is ignored, but I get lxml.etree.XMLSyntaxError: None later on. > This is with lxml 2.1.1 (on Windows XP) . I've also tried an older > version (1.3.6) that was installed on my Ubuntu box, and that worked, > however, after building 2.1.2 there, I get the same problem. > I'm not sure whether this is an actual bug, or whether something has > changed that is not clear from the docs. > Example: from cStringIO import StringIO from lxml import etree test_file = StringIO(''' h\xc3\xa9llo world''') class TestResolver(etree.Resolver): def resolve(self, url, id, context): return self.resolve_file(test_file, context) test_parser = etree.XMLParser() test_parser.resolvers.add(TestResolver()) # this works etree.parse(test_file)te # this doesn't etree.parse('dummy', parser=test_parser) > 2) resolve_string seems only to work for unicode strings and > ASCII-only bytestrings. I would expect behaviour more consistent with > etree.fromstring, so I'd assume this is a bug too? > > If you can confirm these are bugs, I'll file them in the tracker. > > greetings, > > Steven > Example: from lxml import etree test_string = ''' h\xc3\xa9ello world''' class TestResolver(etree.Resolver): def resolve(self, url, id, context): return self.resolve_string(test_string, context) test_parser = etree.XMLParser() test_parser.resolvers.add(TestResolver()) # this works etree.fromstring(test_string) # this doesn't etree.parse('dummy', parser=test_parser) From owenzhang.chicago at gmail.com Tue Sep 16 22:43:00 2008 From: owenzhang.chicago at gmail.com (Owen Zhang) Date: Tue, 16 Sep 2008 16:43:00 -0400 Subject: [lxml-dev] Build error of lxml 2.1 Message-ID: <505558a80809161343m3524d55dxdbcb5652a47b6a95@mail.gmail.com> I am trying to build *lxml* 2.1 package in SunOS 5.10. I got the following errors. Does anybody know why? $ python setup.py build Building *lxml* version 2.1. NOTE: Trying to build without Cython, pre-generated 'src/*lxml*/ *lxml*.etree.c' needs to be available. Using build configuration of libxslt 1.1.7 Building against libxml2/libxslt in the following directory: /usr/lib running build running build_py running build_ext building '*lxml*.etree' extension gcc -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes - fPIC -I/opt/swt/install/libxml2-2.6.27/include/libxml2 -I/opt/swt/ install/Python-2.5/include/python2.5 -c src/*lxml*/*lxml*.etree.c -o build/ temp.solaris-2.10-sun4u-2.5/src/*lxml*/*lxml*.etree.o -w In file included from /usr/include/sys/wait.h:24, from /usr/include/stdlib.h:22, from /opt/swt/install/Python-2.5/include/python2.5/ Python.h:41, from src/*lxml*/*lxml*.etree.c:4: /usr/include/sys/siginfo.h:259: error: syntax error before "ctid_t" /usr/include/sys/siginfo.h:292: error: syntax error before '}' token /usr/include/sys/siginfo.h:294: error: syntax error before '}' token /usr/include/sys/siginfo.h:390: error: syntax error before "ctid_t" /usr/include/sys/siginfo.h:398: error: conflicting types for '__fault' /usr/include/sys/siginfo.h:267: error: previous declaration of '__fault' was here -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080916/7fb460e8/attachment.htm From stefan_ml at behnel.de Wed Sep 17 14:13:10 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 17 Sep 2008 14:13:10 +0200 Subject: [lxml-dev] resolvers In-Reply-To: References: Message-ID: <48D0F456.7000507@behnel.de> Hi, Steven Vereecken wrote: > I've added some simple examples to illustrate what I mean. Hope this helps... Thanks for the report. These definitely are bugs (also in 2.0 AFAICT). I'll look into them when I find the time. They shouldn't be hard to fix, so you can give it a try yourself. Stefan From stefan_ml at behnel.de Wed Sep 17 15:25:04 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 17 Sep 2008 15:25:04 +0200 Subject: [lxml-dev] resolvers In-Reply-To: References: Message-ID: <48D10530.8010102@behnel.de> Hi, Steven Vereecken wrote: > 1) I can't get the resolve_file method to work, not with actual file > objects, and not with StringIO either. I always get the following > message: > Exception exceptions.TypeError: 'function takes exactly 4 arguments (3 > given)' in 'lxml.etree._local_resolver' ignored Here's a quick-fix for that problem. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: file-resolver-fix.patch Type: text/x-patch Size: 612 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080917/9447bcb1/attachment.bin From stefan_ml at behnel.de Wed Sep 17 19:21:33 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 17 Sep 2008 19:21:33 +0200 Subject: [lxml-dev] Build error of lxml 2.1 In-Reply-To: <505558a80809161343m3524d55dxdbcb5652a47b6a95@mail.gmail.com> References: <505558a80809161343m3524d55dxdbcb5652a47b6a95@mail.gmail.com> Message-ID: <48D13C9D.8030900@behnel.de> Hi, I never used SunOS or tried to build lxml on it, but ... Owen Zhang wrote: > I am trying to build *lxml* 2.1 package in SunOS 5.10. I got the following > errors. Does anybody know why? > > $ python setup.py build > Building *lxml* version 2.1. > NOTE: Trying to build without Cython, pre-generated 'src/*lxml*/ > *lxml*.etree.c' needs to be available. > Using build configuration of libxslt 1.1.7 > Building against libxml2/libxslt in the following directory: /usr/lib > running build > running build_py > running build_ext > building '*lxml*.etree' extension > gcc -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes - > fPIC -I/opt/swt/install/libxml2-2.6.27/include/libxml2 -I/opt/swt/ > install/Python-2.5/include/python2.5 -c src/*lxml*/*lxml*.etree.c -o build/ > temp.solaris-2.10-sun4u-2.5/src/*lxml*/*lxml*.etree.o -w > In file included from /usr/include/sys/wait.h:24, > from /usr/include/stdlib.h:22, > from /opt/swt/install/Python-2.5/include/python2.5/ > Python.h:41, > from src/*lxml*/*lxml*.etree.c:4: > /usr/include/sys/siginfo.h:259: error: syntax error before "ctid_t" > /usr/include/sys/siginfo.h:292: error: syntax error before '}' token > /usr/include/sys/siginfo.h:294: error: syntax error before '}' token > /usr/include/sys/siginfo.h:390: error: syntax error before "ctid_t" > /usr/include/sys/siginfo.h:398: error: conflicting types for '__fault' > /usr/include/sys/siginfo.h:267: error: previous declaration of > '__fault' was here This looks like there is something wrong on your system. Have you looked into the /usr/include/sys/siginfo.h file to see what the compiler is complaining about in the above lines? Where is the ctid_t type defined on your system? Stefan From jkrukoff at ltgc.com Wed Sep 17 21:08:43 2008 From: jkrukoff at ltgc.com (John Krukoff) Date: Wed, 17 Sep 2008 13:08:43 -0600 Subject: [lxml-dev] lxml parser encodings? What's supported? Message-ID: <1221678523.4123.25.camel@jmk> So, I've been trying to deal with some places where I need to force the parser's encoding, and I've been surprised by how little it seems to support. Specifically, 'ascii' isn't a supported encoding: Python 2.5.2 (r252:60911, Jul 31 2008, 15:38:58) [GCC 4.1.2 (Gentoo 4.1.2 p1.1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import etree >>> p = etree.XMLParser( encoding = 'ascii' ) Traceback (most recent call last): File "", line 1, in File "parser.pxi", line 1240, in lxml.etree.XMLParser.__init__ (src/lxml/lxml.etree.c:58722) File "parser.pxi", line 711, in lxml.etree._BaseParser.__init__ (src/lxml/lxml.etree.c:55050) LookupError: unknown encoding: 'ascii' >>> etree.__version__ u'2.1.1' I checked the libxml2 documentation, and that claims that on linux it supports all the encodings that iconv does, which is quite a lot. Almost none of those returned by iconv actually work, though. Am I doing something wrong here by trying to specify the encoding in this way? Is there something weird about my build? If everything is working as intended, is there anyplace I can find a list of the encodings lxml does support? My current workaround is to do the decoding to unicode first, then hand the unicode string to lxml, but that seems less efficient than letting the parser handle it. -- John Krukoff Land Title Guarantee Company From stefan_ml at behnel.de Wed Sep 17 21:31:58 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 17 Sep 2008 21:31:58 +0200 Subject: [lxml-dev] lxml parser encodings? What's supported? In-Reply-To: <1221678523.4123.25.camel@jmk> References: <1221678523.4123.25.camel@jmk> Message-ID: <48D15B2E.9090800@behnel.de> Hi, John Krukoff wrote: > So, I've been trying to deal with some places where I need to force the > parser's encoding, and I've been surprised by how little it seems to > support. Specifically, 'ascii' isn't a supported encoding: > > Python 2.5.2 (r252:60911, Jul 31 2008, 15:38:58) > [GCC 4.1.2 (Gentoo 4.1.2 p1.1)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> from lxml import etree >>>> p = etree.XMLParser( encoding = 'ascii' ) > Traceback (most recent call last): > File "", line 1, in > File "parser.pxi", line 1240, in lxml.etree.XMLParser.__init__ > (src/lxml/lxml.etree.c:58722) > File "parser.pxi", line 711, in lxml.etree._BaseParser.__init__ > (src/lxml/lxml.etree.c:55050) > LookupError: unknown encoding: 'ascii' >>>> etree.__version__ > u'2.1.1' > > > I checked the libxml2 documentation, and that claims that on linux it > supports all the encodings that iconv does, which is quite a lot. Almost > none of those returned by iconv actually work, though. Am I doing > something wrong here by trying to specify the encoding in this way? No, you've found a bug. The way the override input encoding is checked by the parser instantiation is simply wrong, it doesn't find any "standard" encoding (utf-8 or ASCII), neither does it find iconv encodings. Here's a fix. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: parser-encoding.patch Type: text/x-patch Size: 1581 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20080917/ba9bff40/attachment-0001.bin From stefan_ml at behnel.de Wed Sep 17 21:56:26 2008 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 17 Sep 2008 21:56:26 +0200 Subject: [lxml-dev] lxml parser encodings? What's supported? In-Reply-To: <48D15B2E.9090800@behnel.de> References: <1221678523.4123.25.camel@jmk> <48D15B2E.9090800@behnel.de> Message-ID: <48D160EA.7070709@behnel.de> Stefan Behnel wrote: > John Krukoff wrote: >> So, I've been trying to deal with some places where I need to force the >> parser's encoding, and I've been surprised by how little it seems to >> support. Specifically, 'ascii' isn't a supported encoding: > > Here's a fix. ... not a complete one, though. It doesn't work for iterparse(). I'll have to look into that. Stefan From jkrukoff at ltgc.com Wed Sep 17 22:14:28 2008 From: jkrukoff at ltgc.com (John Krukoff) Date: Wed, 17 Sep 2008 14:14:28 -0600 Subject: [lxml-dev] lxml parser encodings? What's supported? In-Reply-To: <48D15B2E.9090800@behnel.de> References: <1221678523.4123.25.camel@jmk> <48D15B2E.9090800@behnel.de> Message-ID: <1221682468.4123.36.camel@jmk> On Wed, 2008-09-17 at 21:31 +0200, Stefan Behnel wrote: > No, you've found a bug. The way the override input encoding is checked by the > parser instantiation is simply wrong, it doesn't find any "standard" encoding > (utf-8 or ASCII), neither does it find iconv encodings. > > Here's a fix. > > Stefan FWIW, utf-8 does work. Here's some that I tried, and that worked for me: utf8 utf16 iso-8859-1 through iso-8859-9 shift_jis Just mentioning, since you said that utf8 doesn't work (I would have noticed that a long time ago). Anyway, I'll give the patch a try. Thanks for your always quick response! -- John Krukoff Land Title Guarantee Company From jkrukoff at ltgc.com Fri Sep 26 01:32:38 2008 From: jkrukoff at ltgc.com (John Krukoff) Date: Thu, 25 Sep 2008 17:32:38 -0600 Subject: [lxml-dev] lxml parser encodings? What's supported? In-Reply-To: <48D15B2E.9090800@behnel.de> References: <1221678523.4123.25.camel@jmk> <48D15B2E.9090800@behnel.de> Message-ID: <1222385558.4123.96.camel@jmk> On Wed, 2008-09-17 at 21:31 +0200, Stefan Behnel wrote: > Hi, > No, you've found a bug. The way the override input encoding is checked by the > parser instantiation is simply wrong, it doesn't find any "standard" encoding > (utf-8 or ASCII), neither does it find iconv encodings. > > Here's a fix. > > Stefan After some abortive fumbling until I figured out I needed to have cython installed to use the patch, I gave it a try. Looks like it works fine here for my use case: >>> html.fromstring( '', parser = html.HTMLParser( ) ) >>> html.fromstring( '', parser = html.HTMLParser( encoding = 'us-ascii' ) ) -- John Krukoff Land Title Guarantee Company From jkrukoff at ltgc.com Fri Sep 26 01:33:11 2008 From: jkrukoff at ltgc.com (John Krukoff) Date: Thu, 25 Sep 2008 17:33:11 -0600 Subject: [lxml-dev] HTML Meta Content-Type Tag not created as documenation states? Message-ID: <1222385591.4123.97.camel@jmk> So, I was trying to figure out what happend to my meta tags when using the lxml.html module, and saw the note in the documentation that html.tostring will handle them as so: > Note: if include_meta_content_type is true this will create a > ```` tag in the head; > regardless of the value of include_meta_content_type any existing > ```` tag will be removed > However, that doesn't seem to actually be the case. It looks like etree.tostring is never creating the meta tag as html.tostring appears to expect, and instead the include_meta_content_type flag is simply controlling whether any found meta tag is removed from the output (with an re!). Python 2.5.2 (r252:60911, Sep 22 2008, 12:08:38) [GCC 4.1.2 (Gentoo 4.1.2 p1.1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import html, etree >>> h = html.fromstring( '

©2008

' ) >>> html.tostring( h, encoding = 'us-ascii', include_meta_content_type = True ) '

©2008

' Not present there, so I figure maybe it's because it's not being treated as a complete document? >>> h = html.document_fromstring( '

©2008

' ) >>> html.tostring( h, encoding = 'us-ascii', include_meta_content_type = True ) '

©2008

' Parsing as document doesn't create it either. >>> html.tostring( h, encoding = 'iso-8859-1', include_meta_content_type = True ) '

\xa92008

' Okay, maybe it's because I'm using the default encoding for HTML (us-ascii)? Nope, trying something else doesn't cause it to exist either. >>> html.tostring( etree.ElementTree( h ), encoding = 'iso-8859-1', include_meta_content_type = True ) '\n

\xa92008

' Maybe wrapping in an ElementTree? Get a doctype declaration out of that, but still no meta tag. >>> from lxml import etree >>> etree.__version__ u'2.1.2' In further testing, it appeared that if a Meta Content-Type tag was specified, it was passed though as is, as long as include_meta_content_type was True. The really weird part of this for me though, is that I've set include_meta_content_type on my much more complicated application server, and it does in fact appear to be generating meta tags automatically (or at least something in my XSLT heavy processing chain is). My testing was an attempt to duplicate that, and I was quite surprised when I couldn't. I've tried this on boxes with both libxml2 2.6.26 (RHEL5) & 2.6.32, and didn't see a difference there. -- John Krukoff Land Title Guarantee Company From friedel at translate.org.za Mon Sep 29 12:49:15 2008 From: friedel at translate.org.za (F Wolff) Date: Mon, 29 Sep 2008 12:49:15 +0200 Subject: [lxml-dev] Simple doctypes not in docinfo.doctype Message-ID: <1222685355.29104.6.camel@localhost> Hallo list I've hit a snag with lxml and a DOCTYPE decleration. I don't know if I'm to blame here, but would appreciate help either way. ?I've tried this with an old (1.3.2) and newer (2.0.6) lxml version. (this example is roughly based on the code at http://codespeak.net/lxml/tutorial.html) from lxml import etree from StringIO import StringIO tree = etree.parse(StringIO("""""")) tree.docinfo.doctype '' >From my understanding this DOCTYPE declaration is valid (and occurring in the wild in Qt .ts files). My real issue is round-trip problems in a reading-writing cycle where the DOCTYPE is lost, but I guess not being able to use .docinfo.doctype is already a problem. Any help will be appreciated. Keep well Friedel -- Recently on my blog: http://translate.org.za/blogs/friedel/en/content/vrot-mango From ivanov.maxim at gmail.com Mon Sep 29 16:56:41 2008 From: ivanov.maxim at gmail.com (Max Ivanov) Date: Mon, 29 Sep 2008 18:56:41 +0400 Subject: [lxml-dev] Writing TargetParser in Cython Message-ID: Hi all! I'm trying to write TargetParser in Cython just to compare perfomance. The problem is with data types. If I define data method as "def data(self, char *data):" I'm unable to use it as TargetParser. I get " def data(self, char *data): UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)" error. I could instance and directly call data() and close() methods and everything works fine, but it refuses to work with lxml. Small testcase following: ----- _target.pyx ----------- cdef class Target: cdef list _data def __init__(self): self._data = [] def data(self, char *data): self._data.append(data) def close(self): return ''.join(self._data) ---- end of target.pyx ------ ---- test.py ------- # -*- encoding: utf-8 -*- import lxml.html from lxml import etree from _target import Target res = etree.HTML(u"ABCD", parser=lxml.html.HTMLParser(target = Target())) -------end of target.pyx ------ From joao.moreira at free.fr Tue Sep 30 23:37:58 2008 From: joao.moreira at free.fr (Joao Moreira) Date: Tue, 30 Sep 2008 23:37:58 +0200 Subject: [lxml-dev] Internal compiler error when compiling lxml Message-ID: <48E29C36.40401@free.fr> Hi all, I've just installed the latest lxml and tried to compile it (first I had to get the latest libxml2 and libxslt), it crashes on the first file lxml.etree.c with a "internal compiler error". I'm on a Linux box running Fedora 5, with python 2.4.2. Should I report this as a bug somewhere else ? Any suggestions on how to work around this ? maybe a problem with version compatibility, between lxml and python ? or libxml2, libxslt ? this is my very first attempt at installing anything written in python, please forgive my ignorance. Detailed trace below. Thanks for helping, Joao --------------------- [joao at fox lxml-2.1.2]$ python setup.py build Building lxml version 2.1.2. NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available. Using build configuration of libxslt 1.1.24 Building against libxml2/libxslt in one of the following directories: /w7/u/libxslt-1.1.24/lib /w7/u/libxml2-2.7.1/lib running build running build_py running build_ext building 'lxml.etree' extension gcc -pthread -fno-strict-aliasing -DNDEBUG -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic -fasynchronous-unwind-tables -D_GNU_SOURCE -fPIC -fPIC -I/w7/u/libxslt-1.1.24/include -I/w7/u/libxml2-2.7.1/include/libxml2 -I/usr/include/python2.4 -c src/lxml/lxml.etree.c -o build/temp.linux-i686-2.4/src/lxml/lxml.etree.o -w src/lxml/lxml.etree.c: In function $-1??__pyx_f_4lxml_5etree_17_IterparseContext__setEventFilter??: src/lxml/lxml.etree.c:73968: internal compiler error: in merge_alias_info, at tree-ssa-copy.c:235 Please submit a full bug report, with preprocessed source if appropriate. See for instructions. Preprocessed source stored into /tmp/cc9Y1mVr.out file, please attach this to your bugreport. error: command 'gcc' failed with exit status 1 [joao at fox lxml-2.1.2]$ -- Joao Moreira de Sa Coutinho joao.moreira at free.fr From jkrukoff at ltgc.com Tue Sep 30 23:59:54 2008 From: jkrukoff at ltgc.com (John Krukoff) Date: Tue, 30 Sep 2008 15:59:54 -0600 Subject: [lxml-dev] Internal compiler error when compiling lxml In-Reply-To: <48E29C36.40401@free.fr> References: <48E29C36.40401@free.fr> Message-ID: <1222811994.4123.114.camel@jmk> On Tue, 2008-09-30 at 23:37 +0200, Joao Moreira wrote: > Any suggestions on how to work around this ? maybe a problem with version > compatibility, between lxml and python ? or libxml2, libxslt ? this is > my very > first attempt at installing anything written in python, please forgive > my ignorance. > Detailed trace below. Just a random guess, but FC5 is pretty ancient as linux distributions go. You might have better luck trying a version of lxml that's as old as your distro, like 1.0.x or so. If you're lucky, maybe 1.3.6 will work. If you want to be insistent about getting 2.1.2 built, I wouldn't be surprised if you'll need to build an updated gcc and libc to be able to build it. Upgrading your OS would be easier. -- John Krukoff Land Title Guarantee Company