From elliottslaughter at gmail.com Wed Jul 1 00:04:26 2009 From: elliottslaughter at gmail.com (Elliott Slaughter) Date: Tue, 30 Jun 2009 15:04:26 -0700 Subject: [lxml-dev] parsing DTDs - listing of valid elements Message-ID: <42c0ab790906301504x3eddea6at21c32c3b17f0aa1f@mail.gmail.com> Hi, I'm trying to get the elements in a DTD. Since these internals are not exported in the Python interface of lxml.etree, I am trying to write a Cython extension to do so, as previously suggested on this mailing list (see link below). http://codespeak.net/pipermail/lxml-dev/2009-January/004298.html To quote the message, "all you'd really need is the internal _c_dtd field of the DTD class, which you could cimport". I'm wondering exactly how I am supposed to do that (my attempts so far are described below). It would also be nice to know if the last attempt to do so was successful or not. Thanks. Any help would be appreciated. Here is what I've tried so far (on Python 2.5.4, Cython 0.11.2, Windows): The DTD class is not declared in etreepublic.pxd, so I can't just "cimport etreepublic". The actual DTD class definition is in dtd.pxi, as stated in the message. But I can't just "include 'dtd.pxi' " because it inherits from the _Validator class in lxml.etree.pyx . And I can't "cimport lxml.etree" because there is no file lxml.etree.pxd. I tried writing a lxml.etree.pxd file to circumvent these barriers (which was thoroughly confusing because _Validator contains an _ErrorLog which made me search through several other files...), but even when I got the entire thing to compile, it failed to load in Python: >>> import mydtd Traceback (most recent call last): File "", line 1, in File "lxml.etree.pxd", line 3, in mydtd (mydtd.c:513) cdef class _LogEntry: ValueError: lxml.etree._LogEntry does not appear to be the correct type object I have attached my lxml.etree.pxd in case I made any mistakes, in the event that this method can be made to work. -- Elliott Slaughter "Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." - Alan Kay -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090630/0a7b98d7/attachment.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: lxml.etree.pxd Type: application/octet-stream Size: 602 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090630/0a7b98d7/attachment.obj From kris at cs.ucsb.edu Wed Jul 1 00:18:31 2009 From: kris at cs.ucsb.edu (kristian kvilekval) Date: Tue, 30 Jun 2009 15:18:31 -0700 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: <1246087234.3475.28.camel@krispc.sd.cox.net> References: <4A2112AE.8040903@behnel.de> <4A2CA7AE.8030508@behnel.de> <97EFCE41-4A6E-4CB2-8593-CE7CF5F5A2D3@inquant.de> <4A34DD37.8000207@behnel.de> <4A3B1D56.3000007@behnel.de> <4A41AD28.4010202@behnel.de> <1245882881.10362.239.camel@loup.ece.ucsb.edu> <4A45C2F0.5030802@behnel.de> <1246087234.3475.28.camel@krispc.sd.cox.net> Message-ID: <1246400311.10362.293.camel@loup.ece.ucsb.edu> Hi, We are critical need of an installable recent lxml egg for Mac OS 10.5.7 with python 2.5 Could you please post the building instructions you used to create the eggs for python 2.4 Thanks, Kris From tseaver at palladion.com Wed Jul 1 01:15:32 2009 From: tseaver at palladion.com (Tres Seaver) Date: Tue, 30 Jun 2009 19:15:32 -0400 Subject: [lxml-dev] Binary egg for Mac OS X In-Reply-To: <1246400311.10362.293.camel@loup.ece.ucsb.edu> References: <4A2112AE.8040903@behnel.de> <4A2CA7AE.8030508@behnel.de> <97EFCE41-4A6E-4CB2-8593-CE7CF5F5A2D3@inquant.de> <4A34DD37.8000207@behnel.de> <4A3B1D56.3000007@behnel.de> <4A41AD28.4010202@behnel.de> <1245882881.10362.239.camel@loup.ece.ucsb.edu> <4A45C2F0.5030802@behnel.de> <1246087234.3475.28.camel@krispc.sd.cox.net> <1246400311.10362.293.camel@loup.ece.ucsb.edu> Message-ID: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 kristian kvilekval wrote: > Hi, > > We are critical need of an installable > recent lxml egg for Mac OS 10.5.7 with python 2.5 > > Could you please post the building instructions you used to > create the eggs for python 2.4 This is what I use to build lxml in CentOS4, which has the "too-old-libxml2-and-libxslt" problem: $ wget http://pypi.python.org/packages/source/l/lxml/lxml-2.2.2.tar.gz ... $ tar xzf lxml-2.2.2.tar.gz $ cd lxml-2.2.2 $ /path/to/python setup.py bdist_egg --static-deps Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFKSpyU+gerLs4ltQ4RAjrpAJ9OSCMg2OrVviezfDnAdg9krNqcnQCfdBne 3MOjrE76sMTqtjcrDMatLcE= =Ihsx -----END PGP SIGNATURE----- From elliottslaughter at gmail.com Wed Jul 1 02:29:19 2009 From: elliottslaughter at gmail.com (Elliott Slaughter) Date: Tue, 30 Jun 2009 17:29:19 -0700 Subject: [lxml-dev] parsing DTDs - listing of valid elements In-Reply-To: <42c0ab790906301504x3eddea6at21c32c3b17f0aa1f@mail.gmail.com> References: <42c0ab790906301504x3eddea6at21c32c3b17f0aa1f@mail.gmail.com> Message-ID: <42c0ab790906301729w326e3f5ahbbf15ea4d920a914@mail.gmail.com> Please ignore my previous message; I solved my own problem by finding an XML schema for what I need to do. Sorry for the noise. On Tue, Jun 30, 2009 at 3:04 PM, Elliott Slaughter < elliottslaughter at gmail.com> wrote: > Hi, > > I'm trying to get the elements in a DTD. Since these internals are not > exported in the Python interface of lxml.etree, I am trying to write a > Cython extension to do so, as previously suggested on this mailing list (see > link below). > > http://codespeak.net/pipermail/lxml-dev/2009-January/004298.html > > To quote the message, "all you'd really need is the internal _c_dtd field > of the DTD class, which you could cimport". I'm wondering exactly how I am > supposed to do that (my attempts so far are described below). It would also > be nice to know if the last attempt to do so was successful or not. > > Thanks. Any help would be appreciated. > > > > Here is what I've tried so far (on Python 2.5.4, Cython 0.11.2, Windows): > > The DTD class is not declared in etreepublic.pxd, so I can't just "cimport > etreepublic". The actual DTD class definition is in dtd.pxi, as stated in > the message. But I can't just "include 'dtd.pxi' " because it inherits from > the _Validator class in lxml.etree.pyx . And I can't "cimport lxml.etree" > because there is no file lxml.etree.pxd. > > I tried writing a lxml.etree.pxd file to circumvent these barriers (which > was thoroughly confusing because _Validator contains an _ErrorLog which made > me search through several other files...), but even when I got the entire > thing to compile, it failed to load in Python: > > >>> import mydtd > Traceback (most recent call last): > File "", line 1, in > File "lxml.etree.pxd", line 3, in mydtd (mydtd.c:513) > cdef class _LogEntry: > ValueError: lxml.etree._LogEntry does not appear to be the correct type > object > > I have attached my lxml.etree.pxd in case I made any mistakes, in the event > that this method can be made to work. > > -- > Elliott Slaughter > > "Don't worry about what anybody else is going to do. The best way to > predict the future is to invent it." - Alan Kay > -- Elliott Slaughter "Don't worry about what anybody else is going to do. The best way to predict the future is to invent it." - Alan Kay -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090630/3a28afc0/attachment.htm From stefan_ml at behnel.de Wed Jul 1 07:37:05 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 01 Jul 2009 07:37:05 +0200 Subject: [lxml-dev] parsing DTDs - listing of valid elements In-Reply-To: <42c0ab790906301729w326e3f5ahbbf15ea4d920a914@mail.gmail.com> References: <42c0ab790906301504x3eddea6at21c32c3b17f0aa1f@mail.gmail.com> <42c0ab790906301729w326e3f5ahbbf15ea4d920a914@mail.gmail.com> Message-ID: <4A4AF601.7010900@behnel.de> Hi, Elliott Slaughter wrote: > Please ignore my previous message; I solved my own problem by finding an XML > schema for what I need to do. Note that you can always use trang to convert a DTD to an XML Schema. http://www.thaiopensource.com/relaxng/trang.html If all you need is a list of allowed elements, the required logic to extract that from the schema shouldn't be too hard to figure out. Although I wonder if RelaxNG wouldn't be easier to work on. Stefan From stefan_ml at behnel.de Wed Jul 1 08:14:57 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 01 Jul 2009 08:14:57 +0200 Subject: [lxml-dev] parsing DTDs - listing of valid elements In-Reply-To: <42c0ab790906301504x3eddea6at21c32c3b17f0aa1f@mail.gmail.com> References: <42c0ab790906301504x3eddea6at21c32c3b17f0aa1f@mail.gmail.com> Message-ID: <4A4AFEE1.9080707@behnel.de> Hi, Elliott Slaughter wrote: > I'm trying to get the elements in a DTD. Since these internals are not > exported in the Python interface of lxml.etree, I am trying to write a > Cython extension to do so, as previously suggested on this mailing list (see > link below). > > http://codespeak.net/pipermail/lxml-dev/2009-January/004298.html > > To quote the message, "all you'd really need is the internal _c_dtd field of > the DTD class, which you could cimport". I'm wondering exactly how I am > supposed to do that > [...] > Here is what I've tried so far (on Python 2.5.4, Cython 0.11.2, Windows): > > The DTD class is not declared in etreepublic.pxd, so I can't just "cimport > etreepublic". The actual DTD class definition is in dtd.pxi, as stated in > the message. But I can't just "include 'dtd.pxi' " because it inherits from > the _Validator class in lxml.etree.pyx . And I can't "cimport lxml.etree" > because there is no file lxml.etree.pxd. True. So your only chance is to write one yourself. And yes, it needs to be called "lxml.etree.pxd". > I tried writing a lxml.etree.pxd file to circumvent these barriers (which > was thoroughly confusing because _Validator contains an _ErrorLog which made > me search through several other files...), All you should really need is this: cimport tree cdef class _Validator: cdef object _error_log cdef class DTD(_Validator): cdef tree.xmlDtd* _c_dtd Cython needs to know the exact /layout/ of the classes that you use (at least if they are not exported as C header files), but it doesn't need to know the exact class types of attributes. "object" will do just fine if you don't care. I know that this is harder than necessary (thanks for bringing this up, BTW), but that's just because _DTD isn't an 'officially' C-exported type, just like all other schema types. Stefan From maruadventurer at gmail.com Thu Jul 2 04:08:33 2009 From: maruadventurer at gmail.com (john mcginnis) Date: Wed, 1 Jul 2009 21:08:33 -0500 Subject: [lxml-dev] Really dumb question Message-ID: I have completed a build with easy_install. Looking at site-packages I also see a lxml directory that exists. What I don't find however is a lxml.py module anywhere. If you have a successful build should there not be a lxml.py module somewhere? Thanks. JohnMc -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090701/6dcd978c/attachment-0001.htm From stefan_ml at behnel.de Thu Jul 2 06:18:02 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 02 Jul 2009 06:18:02 +0200 Subject: [lxml-dev] Really dumb question In-Reply-To: References: Message-ID: <4A4C34FA.3060008@behnel.de> john mcginnis wrote: > I have completed a build with easy_install. Looking at site-packages I also > see a lxml directory that exists. What I don't find however is a lxml.py > module anywhere. > > If you have a successful build should there not be a lxml.py module > somewhere? No, lxml is a package, not a module. When you import "lxml", Python will load "lxml/__init__.py" instead. Stefan From friedel at translate.org.za Fri Jul 3 15:48:44 2009 From: friedel at translate.org.za (F Wolff) Date: Fri, 03 Jul 2009 15:48:44 +0200 Subject: [lxml-dev] space normalisation for .text and .tail Message-ID: <1246628924.19137.298.camel@localhost> Hallo all On 2009-03-24 I wrote about space normalisation with reference to the xml:space attribute, and the string() and normalize-string() functions in xpath. I solved my problem in code, partly due to slightly changing requirements. Now I need to do similar magic, but need to handle the text nodes separately, without descending into child nodes. >From the xpath document: > The string-value of an element node is the concatenation of the > string-values of all text node descendants of the element node in > document order. ...which is not what I need to do in this case. Is there a way to apply the normalize-text() to a node's .text or .tail only? Is there another way to obtain the same result? From the looks of it, there is no reliable way that I can normalise correctly in code, since I won't know if a newline (for example) was given as a newline or as a character reference, and this should influence the normalisation. Any help is appreciated. Friedel -- Recently on my blog: http://translate.org.za/blogs/friedel/en/content/presentation-afrilex-alasa-2009 From foolistbar at googlemail.com Sat Jul 4 11:13:48 2009 From: foolistbar at googlemail.com (Geoffrey Sneddon) Date: Sat, 4 Jul 2009 11:13:48 +0200 Subject: [lxml-dev] lxml.html, now with ignored namespaces! In-Reply-To: <4A45ACBE.40107@behnel.de> References: <4A40278E.3020805@chantofwaves.com> <4A45ACBE.40107@behnel.de> Message-ID: <847184AA-5B2B-4C97-8F8F-F50FCD04C8F0@googlemail.com> On 27 Jun 2009, at 07:23, Stefan Behnel wrote: >> The output: >> ----- >> > cs="http://something.com/cs" xml:lang="en" >> lang="en">Help!

My namespaces are >> going to disappear!

FRUIT

>> ----- > > That's because HTML parsers are not namespace aware. Namespaces are > simply > not defined for HTML. But if you get a difference on different > systems, I'd > still suspect the reason to be different libxml2 versions. There's > nothing > lxml can do about this. It should still be outputting an element with a name of "cs:content", it shouldn't be dropping the "cs:", as, as you say, there are not namespaces in HTML, so it has no meaning. My basic advice to the OP would be to use html5lib, which is far slower, but does cope with this fine. -- Geoffrey Sneddon From stefan_ml at behnel.de Sat Jul 4 12:03:01 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 04 Jul 2009 12:03:01 +0200 Subject: [lxml-dev] lxml.html, now with ignored namespaces! In-Reply-To: <847184AA-5B2B-4C97-8F8F-F50FCD04C8F0@googlemail.com> References: <4A40278E.3020805@chantofwaves.com> <4A45ACBE.40107@behnel.de> <847184AA-5B2B-4C97-8F8F-F50FCD04C8F0@googlemail.com> Message-ID: <4A4F28D5.2020008@behnel.de> Hi, Geoffrey Sneddon wrote: >>> The output: >>> ----- >>> >> cs="http://something.com/cs" xml:lang="en" >>> lang="en">Help!

My namespaces are >>> going to disappear!

FRUIT

>>> ----- > > My basic advice to the OP would be to use html5lib, which is far slower, > but does cope with this fine. Well, as I said, it just depends on the version of libxml2 that you are using. >>> from lxml import etree >>> print "lxml.etree: ", etree.LXML_VERSION lxml.etree: (2, 2, 2, 0) >>> print "libxml used: ", etree.LIBXML_VERSION libxml used: (2, 6, 32) >>> from lxml.html import fromstring >>> document = fromstring("""Help!

My namespaces are ... going to disappear!

FRUIT

... """) >>> print parser.tostring(document) Help!

My namespaces are going to disappear!

FRUIT

Stefan From stefan_ml at behnel.de Sat Jul 4 13:24:00 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 04 Jul 2009 13:24:00 +0200 Subject: [lxml-dev] space normalisation for .text and .tail In-Reply-To: <1246628924.19137.298.camel@localhost> References: <1246628924.19137.298.camel@localhost> Message-ID: <4A4F3BD0.7070108@behnel.de> Hi, F Wolff wrote: > On 2009-03-24 I wrote about space normalisation with reference to the > xml:space attribute, and the string() and normalize-string() functions > in xpath. I solved my problem in code, partly due to slightly changing > requirements. > > Now I need to do similar magic, but need to handle the text nodes > separately, without descending into child nodes. > >>From the xpath document: >> The string-value of an element node is the concatenation of the >> string-values of all text node descendants of the element node in >> document order. > ...which is not what I need to do in this case. > > Is there a way to apply the normalize-text() to a node's .text or .tail > only? Is there another way to obtain the same result? Well, lxml will not allow you to modify individual text nodes that the parser created next to each other for whatever reason (likely due to implementation details), even if XPath allows you to get your hands on them using "text()". The text/tail properties are as deep down as it gets. > From the looks of > it, there is no reliable way that I can normalise correctly in code, > since I won't know if a newline (for example) was given as a newline or > as a character reference, and this should influence the normalisation. Why is that? XML parsers will always replace character references by their Unicode character value, and there is no way XPath could see them. If you need that information for your algorithm, you will have to parse the XML byte stream yourself. Neither the XML infoset nor the XPath data model provide this. Stefan From friedel at translate.org.za Tue Jul 7 09:29:51 2009 From: friedel at translate.org.za (F Wolff) Date: Tue, 07 Jul 2009 09:29:51 +0200 Subject: [lxml-dev] space normalisation for .text and .tail In-Reply-To: <4A4F3BD0.7070108@behnel.de> References: <1246628924.19137.298.camel@localhost> <4A4F3BD0.7070108@behnel.de> Message-ID: <1246951791.15406.14.camel@localhost.localdomain> Op Sa, 2009-07-04 om 13:24 +0200 skryf Stefan Behnel: > Hi, > > F Wolff wrote: > > On 2009-03-24 I wrote about space normalisation with reference to the > > xml:space attribute, and the string() and normalize-string() functions > > in xpath. I solved my problem in code, partly due to slightly changing > > requirements. > > > > Now I need to do similar magic, but need to handle the text nodes > > separately, without descending into child nodes. > > > >>From the xpath document: > >> The string-value of an element node is the concatenation of the > >> string-values of all text node descendants of the element node in > >> document order. > > ...which is not what I need to do in this case. > > > > Is there a way to apply the normalize-text() to a node's .text or .tail > > only? Is there another way to obtain the same result? > > Well, lxml will not allow you to modify individual text nodes that the > parser created next to each other for whatever reason (likely due to > implementation details), even if XPath allows you to get your hands on them > using "text()". The text/tail properties are as deep down as it gets. Sorry, let me rephrase: I don't need to alter the internal XML structure, I just want to obtain normalised versions of the .text and .tail nodes in a tree with text and xml elements intertwined. For example: Moo Mew bla bla In this case I'm looking or a way to obtain the strings "Moo", "Mew", and "bla bla" (with the the spaces normalised). XPath's normalize-text() can give me "Moo Mew bla bla", but I still want access to each .text and .tail separately normalised. > > > From the looks of > > it, there is no reliable way that I can normalise correctly in code, > > since I won't know if a newline (for example) was given as a newline or > > as a character reference, and this should influence the normalisation. > > Why is that? XML parsers will always replace character references by their > Unicode character value, and there is no way XPath could see them. If you > need that information for your algorithm, you will have to parse the XML > byte stream yourself. Neither the XML infoset nor the XPath data model > provide this. > > Stefan My understanding was that the normalisation does not touch entities, and that the following two is not equivalent when normalised; vs. ...but playing now with normalize-string() it seems that they are equivalent. Would it be possible to normalise correctly in code in all cases? Thank you for the help. Friedel -- Recently on my blog: http://translate.org.za/blogs/friedel/en/content/presentation-afrilex-alasa-2009 From herve.cauwelier at free.fr Thu Jul 9 20:35:10 2009 From: herve.cauwelier at free.fr (=?UTF-8?B?SGVydsOpIENhdXdlbGllcg==?=) Date: Thu, 09 Jul 2009 20:35:10 +0200 Subject: [lxml-dev] blather question about XML declaration Message-ID: <4A56385E.1000001@free.fr> Hi, I just wonder why lxml is producing XML declarations such as or sometimes instead of the most common (to my knowledge) For now I don't ask tostring() to print it, I concatenate my own with the tree serialisation. Thanks, Herv? From stefan_ml at behnel.de Thu Jul 9 22:03:28 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 09 Jul 2009 22:03:28 +0200 Subject: [lxml-dev] blather question about XML declaration In-Reply-To: <4A56385E.1000001@free.fr> References: <4A56385E.1000001@free.fr> Message-ID: <4A564D10.7010504@behnel.de> Hi, Herv? Cauwelier wrote: > I just wonder why lxml is producing XML declarations such as > > > > or sometimes > > Except that the latter is impossible to get out of the serialiser. > instead of the most common (to my knowledge) > > Why should this be more common? Just because Java doesn't understand strings in single quotes? The XML spec allows both: http://www.w3.org/TR/REC-xml/#sec-prolog-dtd and I see no reason to use '"'. It just degrades readability. > For now I don't ask tostring() to print it, I concatenate my own with > the tree serialisation. Should work as long as you only serialise to UTF-8 (in which case the declaration is optional anyway). Stefan From stefan_ml at behnel.de Fri Jul 10 06:49:12 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 10 Jul 2009 06:49:12 +0200 Subject: [lxml-dev] bizarre crashes under FreeBSD Message-ID: <4A56C848.2030202@behnel.de> Hi all, I got a bug report on crashes under FreeBSD that are not reproducible on other systems. The (varying) problem descriptions sound rather bizarre and unlikely. They are all threading related, one even regards the thread-local storage in Python. https://bugs.launchpad.net/bugs/397516 Does anyone have experience with threading on FreeBSD? Any ideas where these problems may arise from? Could anyone test the attached test script on other FreeBSD systems? Thanks for any hints. Stefan From stefan_ml at behnel.de Sat Jul 11 21:12:40 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 11 Jul 2009 21:12:40 +0200 Subject: [lxml-dev] bizarre crashes under FreeBSD In-Reply-To: <4A56C848.2030202@behnel.de> References: <4A56C848.2030202@behnel.de> Message-ID: <4A58E428.8070908@behnel.de> Stefan Behnel wrote: > I got a bug report on crashes under FreeBSD that are not reproducible on > other systems. The (varying) problem descriptions sound rather bizarre and > unlikely. They are all threading related, one even regards the thread-local > storage in Python. > > https://bugs.launchpad.net/bugs/397516 > > Does anyone have experience with threading on FreeBSD? Any ideas where > these problems may arise from? Could anyone test the attached test script > on other FreeBSD systems? As a quick follow-up: it seems that static building fixes this. No idea what the original problem was, though... Stefan From jonas.esp at googlemail.com Sun Jul 12 19:09:15 2009 From: jonas.esp at googlemail.com (Jonas) Date: Sun, 12 Jul 2009 17:09:15 +0000 Subject: [lxml-dev] complex node Message-ID: <20cf13940907121009m5488fd55q5a04b60e32409161@mail.gmail.com> Hi, I'm trying to parsing a xml file (in python) to convert it to dictionaries and then to JSON. This is the file: http://unicode.org/cldr/data/common/main/es.xml and I've got all nodes except dates', which is enought complex. Any idea to get it? because I'm very tired of try it. From stefan_ml at behnel.de Mon Jul 13 08:16:16 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 13 Jul 2009 08:16:16 +0200 Subject: [lxml-dev] complex node In-Reply-To: <20cf13940907121009m5488fd55q5a04b60e32409161@mail.gmail.com> References: <20cf13940907121009m5488fd55q5a04b60e32409161@mail.gmail.com> Message-ID: <4A5AD130.9080104@behnel.de> Jonas wrote: > I'm trying to parsing a xml file (in python) to convert it to > dictionaries and then to JSON. > > This is the file: > http://unicode.org/cldr/data/common/main/es.xml > and I've got all nodes except dates', which is enought complex. > > Any idea to get it? because I'm very tired of try it. What's your question? - how to find the 'dates' element in a parsed tree? - how to traverse the subtree of the 'dates' element? - how to extract the data from the parsed tree? - how to extract the data using iterparse()? - how to map the data to a dictionary? - how to map the data to a dictionary while parsing? - how to map the data to a recursive structure of dictionaries? - how to find a dictionary mapping that works well with JSON? - how to find a dictionary mapping that works well in the use case that I didn't tell you about? - ... something else ? Stefan From stefan_ml at behnel.de Mon Jul 13 08:24:15 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 13 Jul 2009 08:24:15 +0200 Subject: [lxml-dev] lxml binary egg redux In-Reply-To: <05CD06C6-9260-4D5F-BD10-CB87B9B00465@inquant.de> References: <189D48B6-52BA-4C47-9922-3467060BCAD7@inquant.de> <5af21ed10907090716t76324991pc7a4839ff4fdf5c4@mail.gmail.com> <4A5A398F.5050105@behnel.de> <05CD06C6-9260-4D5F-BD10-CB87B9B00465@inquant.de> Message-ID: <4A5AD30F.8020000@behnel.de> Stephan Eletzhofer wrote: > Am 12.07.2009 um 21:29 schrieb Stefan Behnel: >> Stephan Eletzhofer wrote: >>> I've built 2.4, 2.5 and 2.6 versions of LXML 2.2.2 and uploaded them. >> >> Any reason why the 2.5 egg is so much bigger than the others? > > Hmm: > > $ ls -lah lib.macosx-10.5-i386-2.?/lxml/*.so > -rwxr-xr-x 1 seletz staff 3,0M 12. Jul 11:09 > lib.macosx-10.5-i386-2.4/lxml/etree.so > -rwxr-xr-x 1 seletz staff 1,7M 12. Jul 11:09 > lib.macosx-10.5-i386-2.4/lxml/objectify.so > -rwxr-xr-x 1 seletz staff 5,5M 12. Jul 21:15 > lib.macosx-10.5-i386-2.5/lxml/etree.so > -rwxr-xr-x 1 seletz staff 3,0M 12. Jul 21:16 > lib.macosx-10.5-i386-2.5/lxml/objectify.so > -rwxr-xr-x 1 seletz staff 2,8M 12. Jul 21:20 > lib.macosx-10.5-i386-2.6/lxml/etree.so > -rwxr-xr-x 1 seletz staff 1,5M 12. Jul 21:20 > lib.macosx-10.5-i386-2.6/lxml/objectify.so > > I've built the 2.5 version using the Apple-supplied python 2.5, the > other ones using the > macports variants. I've no idea about the sizes, really. :p I guess the macports builds are platform specific, while the Apple-builds are fat eggs, i.e. they would work on both PPC and x86. Given that Apple's platform Python is 2.5, I guess it's ok to provide the other eggs only for macports. And I also guess that 'fatness' isn't really a future-proof requirement, either... Anyway, thanks for the contribution! Stefan From p.oberndoerfer at urheberrecht.org Mon Jul 13 22:56:00 2009 From: p.oberndoerfer at urheberrecht.org (Pascal Oberndoerfer) Date: Mon, 13 Jul 2009 22:56:00 +0200 Subject: [lxml-dev] building lxml for OS X 10.4 on PPC? Message-ID: <4A5B9F60.6070108@urheberrecht.org> Hello, I am desperately trying to build lxml on a Mac PPC with OS X 10.4. It is not the most current plattform, but I am stuck with it... I tried "STATIC_DEPS= true easy_install lxml", which works perfectly on all Intel machines I have access to. But unfortunately there seems to be a problem with libiconv on the PPC side. Has anyone else had success? I searched the archives for any hints on building with this combination, but didn't find a solution/suggestions. Should I try something similar to "./configure --without-iconv"? But how would I pass this to easy_install? Or maybe try a local static build including not only libxml2 and libxslt but libiconv as well? Any help is greatly appreciated! and sorry for bothering you with these newbee questions. Pascal From manu3d at gmail.com Mon Jul 13 13:35:33 2009 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Mon, 13 Jul 2009 12:35:33 +0100 Subject: [lxml-dev] Targeted XSL transformations Message-ID: <915dc91d0907130435k7aad6e3aq2726b87ad59bd59e@mail.gmail.com> Hi everybody, newbie here, typing from an unusually sunny UK. I've been using lxml with great pleasure for a few months now and I'd like to thank the main authors and the rest of the lxml community for the work that has gone into it. From my perspective it's a well done API, with many useful features and it's easy to use. I was wondering if I could pick this mailing-list's collective brain as my own seems to be insufficient for the matter at hand. Let's say I have parsed an xml document, representing xhtml data, into an ElementTree, and I have applied one or more xsl transformations to add style information to it. Now, let's say I want to add a whole subtree with a top-element of class "sub" to it. The subtree to be added might have been partially styled already, i.e. because it comes from an xml file that has its own xml-stylesheet processing instructions. However, some styling can occurr only in the context of the whole tree, i.e. because the document-wide xsl transformation file specifically establishes that elements of class "sub" must have a yellow background. Furthermore, the same, xsl file asserts that if the document contains an element of class "sub" the document's background color must be purple rather than blue. My fundamental question in this context is: how do I avoid re-applying the xsl transformation to the whole ElementTree and only apply the bits of the transformation that are necessary, due to the change in the tree? a Manu typing from a now much more normal, cloudy UK - [sigh] -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090713/f624848b/attachment.htm From jonas.esp at googlemail.com Mon Jul 13 15:46:47 2009 From: jonas.esp at googlemail.com (Jonas) Date: Mon, 13 Jul 2009 13:46:47 +0000 Subject: [lxml-dev] complex node In-Reply-To: <4A5AD130.9080104@behnel.de> References: <20cf13940907121009m5488fd55q5a04b60e32409161@mail.gmail.com> <4A5AD130.9080104@behnel.de> Message-ID: <20cf13940907130646w1ceea04free9f0a938723e933@mail.gmail.com> 2009/7/13 Stefan Behnel : > > Jonas wrote: >> I'm trying to parsing a xml file (in python) to convert it to >> dictionaries and then to JSON. >> >> This is the file: >> http://unicode.org/cldr/data/common/main/es.xml >> and I've got all nodes except dates', which is enought complex. >> >> Any idea to get it? because I'm very tired of try it. > > What's your question? > > - how to find the 'dates' element in a parsed tree? > - how to traverse the subtree of the 'dates' element? > - how to extract the data from the parsed tree? > - how to extract the data using iterparse()? > - how to map the data to a dictionary? > - how to map the data to a dictionary while parsing? > - how to map the data to a recursive structure of dictionaries? This is my main problem. But checking i.e. if each element has attributes to manage it > - how to find a dictionary mapping that works well with JSON? > - how to find a dictionary mapping that works well in the use case that > ?I didn't tell you about? > - ... something else ? * For if anybody is interested here there is an awesome conversor from XML to JSON http://code.google.com/p/xmltojson/ From mike at it-loops.com Tue Jul 14 07:46:30 2009 From: mike at it-loops.com (Michael Guntsche) Date: Tue, 14 Jul 2009 07:46:30 +0200 Subject: [lxml-dev] =?utf-8?q?building_lxml_for_OS_X_10=2E4_on_PPC=3F?= In-Reply-To: <4A5B9F60.6070108@urheberrecht.org> References: <4A5B9F60.6070108@urheberrecht.org> Message-ID: <8d93cb3af9d1da196b2b8b0000336a8d@localhost> On Mon, 13 Jul 2009 22:56:00 +0200, Pascal Oberndoerfer wrote: > Hello, > > I am desperately trying to build lxml on a Mac PPC with OS X 10.4. It is > not the most current plattform, but I am stuck with it... > > I tried "STATIC_DEPS= true easy_install lxml", which works perfectly on > all Intel machines I have access to. But unfortunately there seems to be > a problem with libiconv on the PPC side. I built lxml several months ago on my Tiger G5 without any problems. What error do you get exactly? Kind regards, Michael From stefan_ml at behnel.de Tue Jul 14 21:16:55 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 14 Jul 2009 21:16:55 +0200 Subject: [lxml-dev] complex node In-Reply-To: <20cf13940907130646w1ceea04free9f0a938723e933@mail.gmail.com> References: <20cf13940907121009m5488fd55q5a04b60e32409161@mail.gmail.com> <4A5AD130.9080104@behnel.de> <20cf13940907130646w1ceea04free9f0a938723e933@mail.gmail.com> Message-ID: <4A5CD9A7.4010304@behnel.de> Hi, Jonas wrote: > 2009/7/13 Stefan Behnel wrote: >> Jonas wrote: >>> I'm trying to parsing a xml file (in python) to convert it to >>> dictionaries and then to JSON. >>> >>> This is the file: >>> http://unicode.org/cldr/data/common/main/es.xml >>> and I've got all nodes except dates', which is enought complex. >>> >>> Any idea to get it? because I'm very tired of try it. >> What's your question? >> >> - how to find the 'dates' element in a parsed tree? >> - how to traverse the subtree of the 'dates' element? >> - how to extract the data from the parsed tree? >> - how to extract the data using iterparse()? >> - how to map the data to a dictionary? >> - how to map the data to a dictionary while parsing? >> - how to map the data to a recursive structure of dictionaries? > > This is my main problem. But > checking i.e. if each element has attributes to manage it What about providing some more background and detail so that we can understand the problem you are trying to solve? So far, I have no idea about the target data structure that you want to construct, or what information in the XML document you consider important. This might be worth reading: http://www.catb.org/~esr/faqs/smart-questions.html Stefan From stefan_ml at behnel.de Tue Jul 14 21:31:04 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 14 Jul 2009 21:31:04 +0200 Subject: [lxml-dev] Targeted XSL transformations In-Reply-To: <915dc91d0907130435k7aad6e3aq2726b87ad59bd59e@mail.gmail.com> References: <915dc91d0907130435k7aad6e3aq2726b87ad59bd59e@mail.gmail.com> Message-ID: <4A5CDCF8.3000602@behnel.de> Hi, Emanuele D'Arrigo wrote: > I've been using lxml with great pleasure for a few months now and I'd like > to thank the main authors and the rest of the lxml community for the work > that has gone into it. From my perspective it's a well done API, with many > useful features and it's easy to use. Happy to hear that. :) > Let's say I have parsed an xml document, representing xhtml data, into an > ElementTree, and I have applied one or more xsl transformations to add style > information to it. > > Now, let's say I want to add a whole subtree with a top-element of class > "sub" to it. The subtree to be added might have been partially styled > already, i.e. because it comes from an xml file that has its own > xml-stylesheet processing instructions. However, some styling can occurr > only in the context of the whole tree, i.e. because the document-wide xsl > transformation file specifically establishes that elements of class "sub" > must have a yellow background. Furthermore, the same, xsl file asserts that > if the document contains an element of class "sub" the document's background > color must be purple rather than blue. > > My fundamental question in this context is: how do I avoid re-applying the > xsl transformation to the whole ElementTree and only apply the bits of the > transformation that are necessary, due to the change in the tree? Hmm, to me, this description contains a bit too many uncertainties. If you can control the input documents, I'd rather try to make the stylesheets distinct and apply them in a progressing order on a clean input, instead of trying to figure out which XSLTs you may have to run on a document. If you can't do that, you can try to make the XSLTs idempotent, so that they do not break already styled documents or subtrees. But you should definitely try to let different stylesheets do independent things. My preferred solution, however: if you can avoid putting any style information into the (X)HTML document (or at least the style bits that interfere in multiple XSL docs), and move it out into a CSS file, I think you can get yourself out of a lot of hassle. Many of the decisions that you would take in XSLT to put style information in the right places can be left to the browser's CSS engine. Stefan From lei at ipac.caltech.edu Tue Jul 14 22:11:25 2009 From: lei at ipac.caltech.edu (Mary Lei) Date: Tue, 14 Jul 2009 13:11:25 -0700 Subject: [lxml-dev] lxml cssselect Message-ID: <4A5CE66D.305@ipac.caltech.edu> Can I use cssselect to select a set of input tags with image for img in root.cssselect('input'): ... works fine but for img in root.cssselect('input type="image"'): ... sends lxml spinnning. -- Mary Lei Software Testing IPAC-NExScl Rm: KS-233 MS: 220-6 Phone: 395-1998 From ted at milo.com Tue Jul 14 22:16:02 2009 From: ted at milo.com (Ted Dziuba) Date: Tue, 14 Jul 2009 13:16:02 -0700 Subject: [lxml-dev] lxml cssselect In-Reply-To: <4A5CE66D.305@ipac.caltech.edu> References: <4A5CE66D.305@ipac.caltech.edu> Message-ID: <6451ccbf0907141316v1b7aa0cu82cbaf8c74ee7243@mail.gmail.com> I don't think that's a valid CSS selector. Try xpath: root.xpath("//input[@type = 'image']") ted On Tue, Jul 14, 2009 at 1:11 PM, Mary Lei wrote: > Can I use cssselect to select a set of input tags > with image > > type="image"> > > for img in root.cssselect('input'): > ... > > > works fine but > for img in root.cssselect('input type="image"'): > ... > sends lxml spinnning. > > -- > Mary Lei > > Software Testing > IPAC-NExScl > > Rm: KS-233 > MS: 220-6 > Phone: 395-1998 > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > -- Ted Dziuba Co-Founder and Engineer Milo.com, Inc. 165 University Avenue Palo Alto, CA, 94301 http://milo.com Cell: (609)-665-2639 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090714/9b9e3c43/attachment.htm From lostlogic at lostlogicx.com Wed Jul 15 02:20:20 2009 From: lostlogic at lostlogicx.com (Brandon Low) Date: Tue, 14 Jul 2009 17:20:20 -0700 Subject: [lxml-dev] lxml cssselect In-Reply-To: <4A5CE66D.305@ipac.caltech.edu> References: <4A5CE66D.305@ipac.caltech.edu> Message-ID: <20090715002020.GS1135@lostlogicx.com> root.cssselect('input[type="image"]') is the syntax you need --Brandon On 2009-07-14 (Tue) at 13:11:25 -0700, Mary Lei wrote: > Can I use cssselect to select a set of input tags > with image > > type="image"> > > for img in root.cssselect('input'): > ... > > > works fine but > for img in root.cssselect('input type="image"'): > ... > sends lxml spinnning. > > -- > Mary Lei > > Software Testing > IPAC-NExScl > > Rm: KS-233 > MS: 220-6 > Phone: 395-1998 > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev From stefan_ml at behnel.de Wed Jul 15 08:36:56 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 15 Jul 2009 08:36:56 +0200 Subject: [lxml-dev] lxml cssselect In-Reply-To: <20090715002020.GS1135@lostlogicx.com> References: <4A5CE66D.305@ipac.caltech.edu> <20090715002020.GS1135@lostlogicx.com> Message-ID: <4A5D7908.7050900@behnel.de> Hi, Brandon Low wrote: > On 2009-07-14 (Tue) at 13:11:25 -0700, Mary Lei wrote: >> Can I use cssselect to select a set of input tags >> with image >> >> > type="image"> >> >> for img in root.cssselect('input'): >> ... >> >> works fine but >> for img in root.cssselect('input type="image"'): >> ... >> sends lxml spinnning. > > root.cssselect('input[type="image"]') > > is the syntax you need Right, here are the docs: http://www.w3.org/TR/CSS2/selector.html#matching-attrs Stefan From manu3d at gmail.com Thu Jul 16 01:08:51 2009 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Thu, 16 Jul 2009 00:08:51 +0100 Subject: [lxml-dev] Targeted XSL transformations In-Reply-To: <4A5CDCF8.3000602@behnel.de> References: <915dc91d0907130435k7aad6e3aq2726b87ad59bd59e@mail.gmail.com> <4A5CDCF8.3000602@behnel.de> Message-ID: <915dc91d0907151608o28747c4di7cd58334888be195@mail.gmail.com> Stefan, thank you for your reply. 2009/7/14 Stefan Behnel > > My fundamental question in this context is: how do I avoid re-applying > the > > xsl transformation to the whole ElementTree and only apply the bits of > the > > transformation that are necessary, due to the change in the tree? > > Hmm, to me, this description contains a bit too many uncertainties. If you > can control the input documents, I'd rather try to make the stylesheets > distinct and apply them in a progressing order on a clean input, instead of > trying to figure out which XSLTs you may have to run on a document. Hmm... the intention IS to apply distinct stylesheets in progressing order on a clean input. But eventually that input must be used in the application and that's when the application might add pieces that require styling and might change the style of other elements that are already in the tree. I.e., suppose I'm starting with just a node and an xslt file that gives the attribute color="blue" to it and the attribute color="orange" to any of its children - when any exists. After the initial transformation I then have . Now, let's suppose the application then does the following: 1) changes the color attribute of from blue to red. 2) adds to an node that needs to be transformed according to the initial xslt file. If I now apply the initial xslt to the whole tree the child node will correctly acquire its orange color attribute, but at the same time the root node will go back to its initial blue color when it shouldn't be touched. -Ideally- there should be a way to apply only the transformations that are due to the addition of the child element. This might very well be impossible or impractical. I'm just asking if that's the case. If you can't do that, you can try to make the XSLTs idempotent, so that > they do not break already styled documents or subtrees. But you should > definitely try to let different stylesheets do independent things. Uhm... I think I understand what you mean. In the example above you are talking about splitting the stylesheet so that the initial stylesheet is applied and then, only when a child is added, a second stylesheet is applied bringing only transformations that are related to the addition of the child. Hmmm... I wonder if this can be made to work in all circumstances. I.e. in the example above if I apply the same orange-coloring stylesheet every time I add a child I might end up overwriting the color of children that in the meantime has been changed from the default orange. > My preferred solution, however: if you can avoid putting any style > information into the (X)HTML document (or at least the style bits that > interfere in multiple XSL docs), and move it out into a CSS file, I think > you can get yourself out of a lot of hassle. Many of the decisions that you > would take in XSLT to put style information in the right places can be left > to the browser's CSS engine. I wish I could leave that to something indeed rather than having to implement it myself. Unfortunately, although my original post referred to xhtml, in practice I'm not dealing with a browser nor with xhtml. Instead I'm attempting to write a GUI system based on Mozilla's XUL. I initially thought of using CSS but, I seem to understand, lxml only allows me to -select- elements in the tree via css. This is quite short from generating the cascade of css rules and transforming the tree accordingly. For this reason I'm currently oriented toward XSLT: once a stylesheet is ready it's quite easy to create a transform object out of it and apply it to an xml object. What do you think? Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090716/5eedb658/attachment.htm From manu3d at gmail.com Thu Jul 16 01:15:38 2009 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Thu, 16 Jul 2009 00:15:38 +0100 Subject: [lxml-dev] objectify and processing instructions Message-ID: <915dc91d0907151615q76dc9363yb4135a7d36e5210d@mail.gmail.com> Hi everybody, having a piece of xml such as: I can use: xmlObject = objectify.fromstring() piElement = xmlObject.getprevious() but the returned object (differently from an xml object generated with etree.fromstring()), is not a processing instruction but an ObjectifiedElement, and doesn't seem to give me access to the href and type attributes. I'm aware that they are actually a single pseudo-attribute, but how do I access it? Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090716/599f8520/attachment.htm From qhlonline at 163.com Thu Jul 16 07:06:04 2009 From: qhlonline at 163.com (qhlonline) Date: Thu, 16 Jul 2009 13:06:04 +0800 (CST) Subject: [lxml-dev] About the position of html parsing by HTML Target parser Message-ID: <2514262.666501247720764592.JavaMail.coremail@bj163app16.163.com> Hi, all I am parsing html files with lxml target parser, now I wan't to know when I have reached some HTML tag, how can I know the position of the HTML document I am parsing? Is there any callbacks in target parser who can tell me the total stream length I have parsed? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090716/9efbea4a/attachment.htm From stefan_ml at behnel.de Thu Jul 16 07:11:30 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 16 Jul 2009 07:11:30 +0200 Subject: [lxml-dev] Targeted XSL transformations In-Reply-To: <915dc91d0907151608o28747c4di7cd58334888be195@mail.gmail.com> References: <915dc91d0907130435k7aad6e3aq2726b87ad59bd59e@mail.gmail.com> <4A5CDCF8.3000602@behnel.de> <915dc91d0907151608o28747c4di7cd58334888be195@mail.gmail.com> Message-ID: <4A5EB682.6030501@behnel.de> Hi, just a quick answer to the last bit (skipping over your mail somewhat). Emanuele D'Arrigo wrote: > 2009/7/14 Stefan Behnel wrote: >> My preferred solution, however: if you can avoid putting any style >> information into the (X)HTML document (or at least the style bits that >> interfere in multiple XSL docs), and move it out into a CSS file, I think >> you can get yourself out of a lot of hassle. Many of the decisions that you >> would take in XSLT to put style information in the right places can be left >> to the browser's CSS engine. > > I wish I could leave that to something indeed rather than having to > implement it myself. Unfortunately, although my original post referred to > xhtml, in practice I'm not dealing with a browser nor with xhtml. Instead > I'm attempting to write a GUI system based on Mozilla's XUL. How does that discourage CSS? > I initially thought of using CSS but, I seem to understand, lxml only allows > me to -select- elements in the tree via css. This is quite short from > generating the cascade of css rules and transforming the tree accordingly. I didn't say "generate" them - just write them down. There's also a package called "cssutils" which you might (or might not) find useful for your needs. http://cthedot.de/cssutils/ > For this reason I'm currently oriented toward XSLT: once a stylesheet is > ready it's quite easy to create a transform object out of it and apply it to > an xml object. It's common to use both: XSLT for structural tree transformations, and CSS for applying style information. That keeps the details of content structure and visual representation somewhat separated, which tends to be a rather big advantage. Stefan From stefan_ml at behnel.de Thu Jul 16 07:14:40 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 16 Jul 2009 07:14:40 +0200 Subject: [lxml-dev] objectify and processing instructions In-Reply-To: <915dc91d0907151615q76dc9363yb4135a7d36e5210d@mail.gmail.com> References: <915dc91d0907151615q76dc9363yb4135a7d36e5210d@mail.gmail.com> Message-ID: <4A5EB740.50305@behnel.de> Hi, Emanuele D'Arrigo wrote: > having a piece of xml such as: > > > > > I can use: > > xmlObject = objectify.fromstring() > piElement = xmlObject.getprevious() > > but the returned object (differently from an xml object generated with > etree.fromstring()), is not a processing instruction but an > ObjectifiedElement, and doesn't seem to give me access to the href and type > attributes. I'm aware that they are actually a single pseudo-attribute, but > how do I access it? Hmmm, without looking into it myself, this sounds like a bug to me. It shouldn't return an ObjectifiedElement for Processing instructions, a normal etree PI element should work here (given that there are no children, for example). Stefan From dglick at gmail.com Thu Jul 16 08:28:00 2009 From: dglick at gmail.com (David Glick) Date: Thu, 16 Jul 2009 06:28:00 +0000 (UTC) Subject: [lxml-dev] binary egg problems with OSX 10.5 / official Python Mac Installer Message-ID: I just tried out the OSX binary egg of lxml 2.2.2 for the first time. When I try to import lxml.etree, I get the following traceback: from lxml import etree ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-pac kages/lxml-2.2.2-py2.6-macosx-10.5-i386.egg/lxml/etree.so, 2): Library not loaded: /opt/local/lib/libiconv.2.dylib Referenced from: /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/l xml-2.2.2-py2.6-macosx-10.5-i386.egg/lxml/etree.so Reason: Incompatible library version: etree.so requires version 8.0.0 or later, but libiconv.2.dylib provides version 7.0.0 I presume this is because I'm using Python 2.6 from the official Python Mac Installer rather than from macports, so I don't have the right version of libiconv in my macports directory where the binary egg is trying to link against it. Does that sound like an accurate diagnosis? Does the binary egg require installing Python 2.6 via macports? I'm working around this by building the egg myself, but wanted to make sure you're aware of it. peace, David Glick From manu3d at gmail.com Thu Jul 16 10:45:45 2009 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Thu, 16 Jul 2009 09:45:45 +0100 Subject: [lxml-dev] Targeted XSL transformations In-Reply-To: <4A5EB682.6030501@behnel.de> References: <915dc91d0907130435k7aad6e3aq2726b87ad59bd59e@mail.gmail.com> <4A5CDCF8.3000602@behnel.de> <915dc91d0907151608o28747c4di7cd58334888be195@mail.gmail.com> <4A5EB682.6030501@behnel.de> Message-ID: <915dc91d0907160145s43bb854dy893cc754412238b0@mail.gmail.com> 2009/7/16 Stefan Behnel > (skipping over your mail somewhat). Oh, thank you for that! =D > Emanuele D'Arrigo wrote: > > I wish I could leave that to something indeed rather than having to > > implement it myself. Unfortunately, although my original post referred to > > xhtml, in practice I'm not dealing with a browser nor with xhtml. Instead > > I'm attempting to write a GUI system based on Mozilla's XUL. > > How does that discourage CSS? It isn't XUL or me working on a GUI system that does. But lxml does not have an etree.CSS() function which generates a CSS transformation, does it? However, it does have an etree.XSLT() function which generates an xslt transformation. Isn't it? I didn't say "generate" them - just write them down. There's also a package > called "cssutils" which you might (or might not) find useful for your > needs. > > http://cthedot.de/cssutils/ Thank you for this! I did have a look and it has all sorts of nice things. Still, it doesn't seem to have functionality ordering the CSS rules according to the cascade nor there is functionality to apply all those rules at once to an xml/xhtml object. Does it? It's common to use both: XSLT for structural tree transformations, and CSS > for applying style information. That keeps the details of content structure > and visual representation somewhat separated, which tends to be a rather > big advantage. Totally agree. But in my case the content structure is already separate from the visual representation. It is stored in a XUL file with no style information. Only when I apply the xsl transformations XUL content gains style (and locale, in a separate xsl transformation) information. I somehow suspect I wasn't particularly clear in my explanation of the problem. Was I? Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090716/4156f745/attachment.htm From stefan_ml at behnel.de Thu Jul 16 11:41:04 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 16 Jul 2009 11:41:04 +0200 Subject: [lxml-dev] Targeted XSL transformations In-Reply-To: <915dc91d0907160145s43bb854dy893cc754412238b0@mail.gmail.com> References: <915dc91d0907130435k7aad6e3aq2726b87ad59bd59e@mail.gmail.com> <4A5CDCF8.3000602@behnel.de> <915dc91d0907151608o28747c4di7cd58334888be195@mail.gmail.com> <4A5EB682.6030501@behnel.de> <915dc91d0907160145s43bb854dy893cc754412238b0@mail.gmail.com> Message-ID: <4A5EF5B0.3070507@behnel.de> Emanuele D'Arrigo wrote: > 2009/7/16 Stefan Behnel wrote: >> Emanuele D'Arrigo wrote: >>> I wish I could leave that to something indeed rather than having to >>> implement it myself. Unfortunately, although my original post referred to >>> xhtml, in practice I'm not dealing with a browser nor with xhtml. Instead >>> I'm attempting to write a GUI system based on Mozilla's XUL. >> >> How does that discourage CSS? > > It isn't XUL or me working on a GUI system that does. But lxml does not have > an etree.CSS() function which generates a CSS transformation, does it? > However, it does have an etree.XSLT() function which generates an xslt > transformation. Isn't it? That's what I meant: you don't need that. CSS doesn't do tree transformations, it only cares about layout and 'visual effects'. The idea is to build up a document that contains the complete content, and then associate a CSS file with it that defines where the different parts of the content appear and how they should look like (colour, fonts, etc.). You can use XSLT to aggregate the content into a document, but you'd write up a CSS file manually (or with a suitable CSS editor or whatever) to define the visual layout of the presentation. Do you have a need to do that programmatically? Stefan From manu3d at gmail.com Thu Jul 16 12:54:53 2009 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Thu, 16 Jul 2009 11:54:53 +0100 Subject: [lxml-dev] Targeted XSL transformations In-Reply-To: <4A5EF5B0.3070507@behnel.de> References: <915dc91d0907130435k7aad6e3aq2726b87ad59bd59e@mail.gmail.com> <4A5CDCF8.3000602@behnel.de> <915dc91d0907151608o28747c4di7cd58334888be195@mail.gmail.com> <4A5EB682.6030501@behnel.de> <915dc91d0907160145s43bb854dy893cc754412238b0@mail.gmail.com> <4A5EF5B0.3070507@behnel.de> Message-ID: <915dc91d0907160354j7e145213i93bd33ff1e632d1@mail.gmail.com> 2009/7/16 Stefan Behnel > > It isn't XUL or me working on a GUI system that does. But lxml does not > have > > an etree.CSS() function which generates a CSS transformation, does it? > > However, it does have an etree.XSLT() function which generates an xslt > > transformation. Isn't it? > > That's what I meant: you don't need that. CSS doesn't do tree > transformations, it only cares about layout and 'visual effects'. The idea > is to build up a document that contains the complete content, and then > associate a CSS file with it that defines where the different parts of the > content appear and how they should look like (colour, fonts, etc.). Hmmm. I might be starting to grasp what you are suggesting. You are suggesting to load the xml/xul content separately from the css information and associate the two together only as needed, inside the application, as I build the actual objects that are described by the xul content and are eventually displayed to screen. This effectively means that the css doesn't act on the xml/xul content file nor the generated ElementTree, but instead it acts directly on the display objects that are created out of it. It acts on the objects that are eventually displayed rather than the xml objects generated by etree or objectify. Hmmm. That means I'd have to create my own css selectors that act on the display objects rather than the elements in an ElementTree object. And of course I'd have to call upon the right selectors as the relationships of the display objects changes - the original problem I am facing even with xslt providing style information. Right now I can't quite think how to do that. But it's certainly something worth investigating further. Thank you! > You can use XSLT to aggregate the content into a document, but you'd write > up a CSS file manually (or with a suitable CSS editor or whatever) to > define the visual layout of the presentation. Do you have a need to do that > programmatically? Programmatically what I need is to be able to add composite GUI elements to an interface. I.e. a composite GUI element might have text labels, text input fields and buttons. The description of the element in terms of its sub-components is in a XUL file but the style is dependent on 1) some default style information 2) any overriding style information that has been added or modified at runtime, i.e. by the user. When I add a composite GUI element to an interface (i.e. a new record in a table) it's easy enough to add it to the tree of objects in the GUI, but in terms of style I'd need to find out which style information applies to it given the nature of the added element and its location in the tree. In a sense, rather than selecting the element given a (css) rule, I'd need to select the applicable rules from the element and its position. Furthermore I'd need to find a way to apply any rule that the addition, removal or modification of an element has made applicable to other elements. I.e. the rule E F {color:red} does not apply until the element E has an element F as a child. When I add that child I must somehow retrieve that rule and apply it to E. Right now I don't quite know how to do that. But I realise I am digressing into non-lxml related issues. I apologize if I'm abusing of your and other's patience. Admittedly there are various parts of the problem statement that remain unclear even to me. Attempting to explain it to you and the list does help a little, hence my potentially unwelcome verbosity. Thank you again. Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090716/b0fd855e/attachment.htm From stefan_ml at behnel.de Thu Jul 16 13:31:39 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 16 Jul 2009 13:31:39 +0200 Subject: [lxml-dev] Targeted XSL transformations In-Reply-To: <915dc91d0907160354j7e145213i93bd33ff1e632d1@mail.gmail.com> References: <915dc91d0907130435k7aad6e3aq2726b87ad59bd59e@mail.gmail.com> <4A5CDCF8.3000602@behnel.de> <915dc91d0907151608o28747c4di7cd58334888be195@mail.gmail.com> <4A5EB682.6030501@behnel.de> <915dc91d0907160145s43bb854dy893cc754412238b0@mail.gmail.com> <4A5EF5B0.3070507@behnel.de> <915dc91d0907160354j7e145213i93bd33ff1e632d1@mail.gmail.com> Message-ID: <4A5F0F9B.8090601@behnel.de> Emanuele D'Arrigo wrote: > Hmmm. I might be starting to grasp what you are suggesting. You are > suggesting to load the xml/xul content separately from the css information > and associate the two together only as needed, inside the application, as I > build the actual objects that are described by the xul content and are > eventually displayed to screen. This effectively means that the css doesn't > act on the xml/xul content file nor the generated ElementTree, but instead > it acts directly on the display objects that are created out of it. It acts > on the objects that are eventually displayed rather than the xml objects > generated by etree or objectify. Sort of, yes. > Hmmm. That means I'd have to create my own > css selectors that act on the display objects rather than the elements in an > ElementTree object. What? Why? I thought you used the Mozilla engine for displaying the XUL pages? They are perfectly capable of handling CSS. Actually, CSS is the only way to map style information onto a XUL document. I'm only suggesting to put it into a separate file rather than into the XUL document itself. Asking Google for "xul css" immediately brought up this, for example, although there may well be some better tutorials out there: https://developer.mozilla.org/en/introduction_to_xul > Programmatically what I need is to be able to add composite GUI elements to > an interface. I.e. a composite GUI element might have text labels, text > input fields and buttons. The description of the element in terms of its > sub-components is in a XUL file but the style is dependent on 1) some > default style information 2) any overriding style information that has been > added or modified at runtime, i.e. by the user. > > When I add a composite GUI element to an interface (i.e. a new record in a > table) it's easy enough to add it to the tree of objects in the GUI, but in > terms of style I'd need to find out which style information applies to it > given the nature of the added element and its location in the tree. In a > sense, rather than selecting the element given a (css) rule, I'd need to > select the applicable rules from the element and its position. And that needs to be decided at runtime? Aren't there any general rules like "when element X appears within the panel named Y, give it style Z"? What you can generate with XSLT (or using whatever type of tree manipulation) is the element X, which now appears in the panel Y and thus picks up the style that was predefined in the CSS file for exactly this case: "Y > X { ... }". I doubt that you really have requirements that you cannot express statically (i.e. in advance) using CSS selectors. If you really need to distinguish identical elements depending on some logic in the XSLT driven document generation process, just give them different IDs and use those in the CSS file. > Furthermore I'd need to find a way to apply any rule that the addition, > removal or modification of an element has made applicable to other elements. > I.e. the rule E F {color:red} does not apply until the element E has an > element F as a child. When I add that child I must somehow retrieve that > rule and apply it to E. Right now I don't quite know how to do that. That's exactly what the CSS engine is there for. Keep moving, nothing to do here. > But I realise I am digressing into non-lxml related issues. I apologize if > I'm abusing of your and other's patience. Admittedly there are various parts > of the problem statement that remain unclear even to me. Attempting to > explain it to you and the list does help a little, hence my potentially > unwelcome verbosity. What I think you need to understand is the difference between the document status of an element (e.g. descendant of Z, child of X, @id="Y", @enabled="true") and the visual representation (visible, green, large, ...). The mapping between those two is what CSS is designed for. And the CSS engine that sits right inside your XUL layout engine will do that mapping for you whenever either the document or the style information changes. Stefan From manu3d at gmail.com Thu Jul 16 14:10:27 2009 From: manu3d at gmail.com (Emanuele D'Arrigo) Date: Thu, 16 Jul 2009 13:10:27 +0100 Subject: [lxml-dev] Targeted XSL transformations In-Reply-To: <4A5F0F9B.8090601@behnel.de> References: <915dc91d0907130435k7aad6e3aq2726b87ad59bd59e@mail.gmail.com> <4A5CDCF8.3000602@behnel.de> <915dc91d0907151608o28747c4di7cd58334888be195@mail.gmail.com> <4A5EB682.6030501@behnel.de> <915dc91d0907160145s43bb854dy893cc754412238b0@mail.gmail.com> <4A5EF5B0.3070507@behnel.de> <915dc91d0907160354j7e145213i93bd33ff1e632d1@mail.gmail.com> <4A5F0F9B.8090601@behnel.de> Message-ID: <915dc91d0907160510t22df6264m437be462ac332ea1@mail.gmail.com> 2009/7/16 Stefan Behnel > What? Why? I thought you used the Mozilla engine for displaying the XUL > pages? Ahem. No. I am using the XUL -format- as a description of interfaces. But I'm implementing my own layout engine in a completely different rendering framework (Panda3d ). Sorry if this wasn't clear from my previous messages. > When I add a composite GUI element to an interface (i.e. a new record in a > > table) it's easy enough to add it to the tree of objects in the GUI, but > in > > terms of style I'd need to find out which style information applies to it > > given the nature of the added element and its location in the tree. In a > > sense, rather than selecting the element given a (css) rule, I'd need to > > select the applicable rules from the element and its position. > > And that needs to be decided at runtime? Aren't there any general rules > like "when element X appears within the panel named Y, give it style Z"? Not only. Some rules are applicable to an element even though they are specific to that element. I.e. ".myClass" applies to any element of class "myClass". And that might not be that difficult to find. But there are some rules that specify that IF element Y has a child element X you should apply style Z (to Y, not X). To find those rules sounds a little harder. > I doubt that you really have requirements that you cannot express > statically (i.e. in advance) using CSS selectors. If you really need to > distinguish identical elements depending on some logic in the XSLT driven > document generation process, just give them different IDs and use those in > the CSS file. Yes, true, I might be attempting to cover all cases programmatically when some sensible use of css/xslt might simplify the implementation. What I think you need to understand is the difference between the document > status of an element (e.g. descendant of Z, child of X, @id="Y", > @enabled="true") and the visual representation (visible, green, large, > ...). The mapping between those two is what CSS is designed for. And the > CSS engine that sits right inside your XUL layout engine will do that > mapping for you whenever either the document or the style information > changes. Except, as I mentioned at the very beginning and is the crux of the discussion, I do not have a css engine nor a xul layout engine. I am making those more or less from scratch. Manu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090716/bb5e5bbc/attachment-0001.htm From qhlonline at 163.com Thu Jul 16 15:31:10 2009 From: qhlonline at 163.com (qhlonline) Date: Thu, 16 Jul 2009 21:31:10 +0800 (CST) Subject: [lxml-dev] About lxml Target parser Message-ID: <27417447.901721247751070945.JavaMail.coremail@bj163app72.163.com> Hi ,all I have read part of the source file Saxparser.pxi, May be the two classes named _SaxParserTarget and _SaxParserContext have close relationship with Target Parser. A function _handleSaxData has been defined to deal with HTML data part. But why the data returned is only the text part between tags? I wan't to know how much characters of the HTML file have been parsed with the target parser, when a "start_element" event comes. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090716/41226d05/attachment.htm From dfedoruk at gmail.com Thu Jul 16 16:22:28 2009 From: dfedoruk at gmail.com (Dmitri Fedoruk) Date: Thu, 16 Jul 2009 18:22:28 +0400 Subject: [lxml-dev] bizarre crashes under FreeBSD In-Reply-To: <4A56C848.2030202@behnel.de> References: <4A56C848.2030202@behnel.de> Message-ID: Greetings, > Does anyone have experience with threading on FreeBSD? Any ideas where > these problems may arise from? Could anyone test the attached test script > on other FreeBSD systems? So far I can only reproduce this: FreeBSD 7.0-RELEASE i386 lxml.etree: (2, 2, -199, 59835) libxml used: (2, 6, 32) libxml compiled: (2, 6, 30) libxslt used: (1, 1, 22) libxslt compiled: (1, 1, 22) Coredump with the following backtrace: #0 0x2853eb1b in __pyx_f_4lxml_5etree__forwardError () from /usr/local/lib/python2.5/site-packages/lxml-2.2alpha1-py2.5-freebsd-7.0-RELEASE-i386.egg/lxml/etree.so [New Thread 0x28302b00 (LWP 100819)] [New Thread 0x28302a00 (LWP 100646)] [New Thread 0x28302900 (LWP 100572)] [New Thread 0x28302800 (LWP 100561)] [New Thread 0x28302700 (LWP 100524)] [New Thread 0x28302600 (LWP 100522)] [New Thread 0x28302500 (LWP 100500)] [New Thread 0x28302400 (LWP 100444)] [New Thread 0x28302300 (LWP 100335)] [New Thread 0x28302200 (LWP 100254)] [New Thread 0x28301100 (LWP 100182)] (gdb) bt #0 0x2853eb1b in __pyx_f_4lxml_5etree__forwardError () from /usr/local/lib/python2.5/site-packages/lxml-2.2alpha1-py2.5-freebsd-7.0-RELEASE-i386.egg/lxml/etree.so #1 0x28641699 in __xmlRaiseError () from /usr/local/lib/libxml2.so.5 #2 0x2867dc1b in htmlParseErr () from /usr/local/lib/libxml2.so.5 #3 0x2867fe3d in htmlParseEndTag () from /usr/local/lib/libxml2.so.5 #4 0x28684de0 in htmlParseContent () from /usr/local/lib/libxml2.so.5 #5 0x286849a9 in htmlParseElement () from /usr/local/lib/libxml2.so.5 #6 0x28684e3e in htmlParseContent () from /usr/local/lib/libxml2.so.5 #7 0x286849a9 in htmlParseElement () from /usr/local/lib/libxml2.so.5 #8 0x28684e3e in htmlParseContent () from /usr/local/lib/libxml2.so.5 #9 0x286849a9 in htmlParseElement () from /usr/local/lib/libxml2.so.5 #10 0x28684e3e in htmlParseContent () from /usr/local/lib/libxml2.so.5 #11 0x286849a9 in htmlParseElement () from /usr/local/lib/libxml2.so.5 #12 0x28684e3e in htmlParseContent () from /usr/local/lib/libxml2.so.5 #13 0x28685299 in htmlParseDocument () from /usr/local/lib/libxml2.so.5 #14 0x28685460 in htmlDoRead () from /usr/local/lib/libxml2.so.5 #15 0x285ac36b in __pyx_f_4lxml_5etree_11_BaseParser__parseDoc () from /usr/local/lib/python2.5/site-packages/lxml-2.2alpha1-py2.5-freebsd-7.0-RELEASE-i386.egg/lxml/etree.so #16 0x285b128b in __pyx_f_4lxml_5etree__parseDoc () from /usr/local/lib/python2.5/site-packages/lxml-2.2alpha1-py2.5-freebsd-7.0-RELEASE-i386.egg/lxml/etree.so #17 0x285b24f2 in __pyx_f_4lxml_5etree__parseMemoryDocument () from /usr/local/lib/python2.5/site-packages/lxml-2.2alpha1-py2.5-freebsd-7.0-RELEASE-i386.egg/lxml/etree.so #18 0x285b28b3 in __pyx_pf_4lxml_5etree_fromstring () from /usr/local/lib/python2.5/site-packages/lxml-2.2alpha1-py2.5-freebsd-7.0-RELEASE-i386.egg/lxml/etree.so #19 0x08059cd7 in PyObject_Call () #20 0x080af811 in PyEval_EvalFrameEx () #21 0x080b21c9 in PyEval_EvalCodeEx () #22 0x080edd5c in PyClassMethod_New () #23 0x08059cd7 in PyObject_Call () #24 0x080af811 in PyEval_EvalFrameEx () #25 0x080b21c9 in PyEval_EvalCodeEx () #26 0x080b0cc0 in PyEval_EvalFrameEx () #27 0x080b1858 in PyEval_EvalFrameEx () #28 0x080b21c9 in PyEval_EvalCodeEx () #29 0x080eddd1 in PyClassMethod_New () #30 0x08059cd7 in PyObject_Call () #31 0x0805f111 in PyClass_IsSubclass () #32 0x08059cd7 in PyObject_Call () #33 0x080abddc in PyEval_CallObjectWithKeywords () #34 0x080d3f38 in initthread () #35 0x28174b1f in pthread_getprio () from /lib/libthr.so.3 Another machine, FreeBSD 7.0 amd64, lxml.etree: (2, 0, 7, 0) libxml used: (2, 6, 32) ml compiled: (2, 6, 32) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 24) : #0 0x00000000004394d5 in PyObject_GetAttr () [New Thread 0x705780 (LWP 100540)] [New Thread 0x705600 (LWP 100539)] [New Thread 0x705180 (LWP 100538)] [New Thread 0x705000 (LWP 100537)] [New Thread 0x704d00 (LWP 100536)] [New Thread 0x704e80 (LWP 100535)] [New Thread 0x705900 (LWP 100534)] [New Thread 0x705a80 (LWP 100533)] [New Thread 0x705480 (LWP 100532)] [New Thread 0x705300 (LWP 100193)] [New Thread 0x702180 (LWP 101248)] (gdb) bt #0 0x00000000004394d5 in PyObject_GetAttr () #1 0x0000000800ecd2a3 in __pyx_f_4lxml_5etree_13_BaseErrorLog () from /usr/local/lib/python2.5/site-packages/lxml-2.0.7-py2.5-freebsd-7.0-20081001-SNAP-amd64.egg/lxml/etree.so #2 0x0000000800ed8b20 in __pyx_f_4lxml_5etree__forwardError () from /usr/local/lib/python2.5/site-packages/lxml-2.0.7-py2.5-freebsd-7.0-20081001-SNAP-amd64.egg/lxml/etree.so #3 0x0000000801305910 in __xmlRaiseError () from /usr/local/lib/libxml2.so.5 #4 0x000000080133e3a6 in htmlParseErr () from /usr/local/lib/libxml2.so.5 #5 0x00000008013404b0 in htmlParseEndTag () from /usr/local/lib/libxml2.so.5 #6 0x00000008013452c2 in htmlParseContent () from /usr/local/lib/libxml2.so.5 #7 0x0000000801344f8f in htmlParseElement () from /usr/local/lib/libxml2.so.5 #8 0x0000000801345320 in htmlParseContent () from /usr/local/lib/libxml2.so.5 #9 0x0000000801344f8f in htmlParseElement () from /usr/local/lib/libxml2.so.5 #10 0x0000000801345320 in htmlParseContent () from /usr/local/lib/libxml2.so.5 #11 0x0000000801344f8f in htmlParseElement () from /usr/local/lib/libxml2.so.5 #12 0x0000000801345320 in htmlParseContent () from /usr/local/lib/libxml2.so.5 #13 0x0000000801344f8f in htmlParseElement () from /usr/local/lib/libxml2.so.5 #14 0x0000000801345320 in htmlParseContent () from /usr/local/lib/libxml2.so.5 #15 0x00000008013457bf in htmlParseDocument () from /usr/local/lib/libxml2.so.5 #16 0x000000080134596c in htmlDoRead () from /usr/local/lib/libxml2.so.5 #17 0x0000000800edeb44 in __pyx_f_4lxml_5etree_11_BaseParser__parseDoc () from /usr/local/lib/python2.5/site-packages/lxml-2.0.7-py2.5-freebsd-7.0-20081001-SNAP-amd64.egg/lxml/etree.so #18 0x0000000800f22fba in __pyx_f_4lxml_5etree__parseDoc () from /usr/local/lib/python2.5/site-packages/lxml-2.0.7-py2.5-freebsd-7.0-20081001-SNAP-amd64.egg/lxml/etree.so #19 0x0000000800f24181 in __pyx_f_4lxml_5etree__parseMemoryDocument () from /usr/local/lib/python2.5/site-packages/lxml-2.0.7-py2.5-freebsd-7.0-20081001-SNAP-amd64.egg/lxml/etree.so #20 0x0000000800f28e75 in __pyx_pf_4lxml_5etree_HTML () from /usr/local/lib/python2.5/site-packages/lxml-2.0.7-py2.5-freebsd-7.0-20081001-SNAP-amd64.egg/lxml/etree.so #21 0x0000000000415173 in PyObject_Call () #22 0x000000000046e5f0 in PyEval_EvalFrameEx () #23 0x00000000004710ac in PyEval_EvalCodeEx () #24 0x00000000004af191 in PyClassMethod_New () #25 0x0000000000415173 in PyObject_Call () #26 0x000000000046e5f0 in PyEval_EvalFrameEx () #27 0x00000000004710ac in PyEval_EvalCodeEx () #28 0x000000000046fbe4 in PyEval_EvalFrameEx () #29 0x00000000004708e4 in PyEval_EvalFrameEx () #30 0x00000000004708e4 in PyEval_EvalFrameEx () #31 0x00000000004710ac in PyEval_EvalCodeEx () #32 0x00000000004af1f8 in PyClassMethod_New () #33 0x0000000000415173 in PyObject_Call () #34 0x000000000041adad in PyClass_IsSubclass () #35 0x0000000000415173 in PyObject_Call () #36 0x000000000046aa02 in PyEval_CallObjectWithKeywords () #37 0x0000000000494cad in initthread () #38 0x000000080093fa27 in pthread_getprio () from /lib/libthr.so.3 #39 0x00007fffffb5b000 in ?? () But no ideas about the reason, sorry Dmitri From stefan_ml at behnel.de Thu Jul 16 16:53:00 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 16 Jul 2009 16:53:00 +0200 Subject: [lxml-dev] About the position of html parsing by HTML Target parser In-Reply-To: <2514262.666501247720764592.JavaMail.coremail@bj163app16.163.com> References: <2514262.666501247720764592.JavaMail.coremail@bj163app16.163.com> Message-ID: <4A5F3ECC.1060104@behnel.de> qhlonline wrote: > Hi, all I am parsing html files with lxml target parser, now I wan't to > know when I have reached some HTML tag, how can I know the position of > the HTML document I am parsing? These are two different requirements. Do you really need the line/character information here? Isn't the structural position enough? > Is there any callbacks in target parser > who can tell me the total stream length I have parsed? Not that I know of. Same as in ElementTree, I'd say. Stefan From stefan_ml at behnel.de Thu Jul 16 16:56:06 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 16 Jul 2009 16:56:06 +0200 Subject: [lxml-dev] About lxml Target parser In-Reply-To: <27417447.901721247751070945.JavaMail.coremail@bj163app72.163.com> References: <27417447.901721247751070945.JavaMail.coremail@bj163app72.163.com> Message-ID: <4A5F3F86.2060508@behnel.de> qhlonline wrote: > I have read part of the source file Saxparser.pxi, May be the two > classes named _SaxParserTarget and _SaxParserContext have close > relationship with Target Parser. Yes, they are (somewhat) related. > A function _handleSaxData has been > defined to deal with HTML data part. But why the data returned is only > the text part between tags? Because that's how the SAX interface is defined (and how XML defines text content, BTW). > I wan't to know how much characters of the > HTML file have been parsed with the target parser, when a > "start_element" event comes. I replied in your other thread about the same topic. Stefan From p.oberndoerfer at urheberrecht.org Thu Jul 16 19:59:09 2009 From: p.oberndoerfer at urheberrecht.org (Pascal Oberndoerfer) Date: Thu, 16 Jul 2009 19:59:09 +0200 Subject: [lxml-dev] building lxml for OS X 10.4 on PPC? In-Reply-To: <8d93cb3af9d1da196b2b8b0000336a8d@localhost> References: <4A5B9F60.6070108@urheberrecht.org> <8d93cb3af9d1da196b2b8b0000336a8d@localhost> Message-ID: <4A5F6A6D.1000509@urheberrecht.org> Michael Guntsche schrieb: > On Mon, 13 Jul 2009 22:56:00 +0200, Pascal Oberndoerfer > wrote: >> Hello, >> >> I am desperately trying to build lxml on a Mac PPC with OS X 10.4. It is >> not the most current plattform, but I am stuck with it... >> >> I tried "STATIC_DEPS= true easy_install lxml", which works perfectly on >> all Intel machines I have access to. But unfortunately there seems to be >> a problem with libiconv on the PPC side. > > I built lxml several months ago on my Tiger G5 without any problems. What > error do you get exactly? > > I get the following "last lines" (below) during compiling libxml2, where I suppose the important bit is: > /usr/libexec/gcc/i686-apple-darwin8/4.0.1/ld: > warning /usr/lib/libiconv.dylib cputype (18, > architecture ppc) does not match cputype (7) > for specified -arch flag: i386 (file not loaded) This is a PPC G5, 2Ghz, 1GB, 10.4.11, XCode 2.5. Maybe I got an incompatible libiconv installed by a previous version of Apple's DevTools? I am really wildly guessing here... Thanks a lot. Pascal > (cd .libs && rm -f testdso.la && ln -s ../testdso.la testdso.la) > gcc -DHAVE_CONFIG_H -I. -I./include -I./include -D_REENTRANT -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -O2 -pedantic -W -Wformat - > Wunused -Wimplicit -Wreturn-type -Wswitch -Wcomment -Wtrigraphs -Wformat -Wchar-subscripts -Wuninitialized -Wparentheses -Wshadow -Wpointer-arith -Wcast-alig > n -Wwrite-strings -Waggregate-return -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs -Winline -Wredundant-decls -c xmllint.c > /bin/sh ./libtool --tag=CC --mode=link gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -O2 -pedantic -W -Wformat -Wunused -Wimplicit -W > return-type -Wswitch -Wcomment -Wtrigraphs -Wformat -Wchar-subscripts -Wuninitialized -Wparentheses -Wshadow -Wpointer-arith -Wcast-align -Wwrite-strings -Wa > ggregate-return -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs -Winline -Wredundant-decls -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10 > .4u.sdk -o xmllint xmllint.o ./libxml2.la -lpthread -lz -liconv -lm > gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -O2 -pedantic -W -Wformat -Wunused -Wimplicit -Wreturn-type -Wswitch -Wcomment -Wtrigraphs > -Wformat -Wchar-subscripts -Wuninitialized -Wparentheses -Wshadow -Wpointer-arith -Wcast-align -Wwrite-strings -Waggregate-return -Wstrict-prototypes -Wmiss > ing-prototypes -Wnested-externs -Winline -Wredundant-decls -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -o xmllint xmllint.o ./.libs/libxm > l2.a -lpthread -lz /usr/lib/libiconv.dylib -lm > /usr/libexec/gcc/i686-apple-darwin8/4.0.1/ld: for architecture i386 > /usr/libexec/gcc/i686-apple-darwin8/4.0.1/ld: warning /usr/lib/libiconv.dylib cputype (18, architecture ppc) does not match cputype (7) for specified -arch f > lag: i386 (file not loaded) > /usr/libexec/gcc/i686-apple-darwin8/4.0.1/ld: Undefined symbols: > _libiconv > _libiconv_close > _libiconv_open > collect2: ld returned 1 exit status > lipo: can't open input file: /var/tmp//cc5Ydgsq.out (No such file or directory) > make[2]: *** [xmllint] Error 1 > make[1]: *** [all-recursive] Error 1 > make: *** [all] Error 2 > Traceback (most recent call last): > File "/Library/Frameworks/Python.framework/Versions/Current/bin/easy_install", line 8, in > load_entry_point('setuptools==0.6c9', 'console_scripts', 'easy_install')() > File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/setuptools-0.6c9-py2.5.egg/setuptools/command/easy_install.py", line 1671, in main > File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/setuptools-0.6c9-py2.5.egg/setuptools/command/easy_install.py", line 1659, in with_ei_usage > File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/setuptools-0.6c9-py2.5.egg/setuptools/command/easy_install.py", line 1675, in > File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/core.py", line 151, in setup > dist.run_commands() > File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/dist.py", line 974, in run_commands > self.run_command(cmd) > File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/distutils/dist.py", line 994, in run_command > cmd_obj.run() > File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/setuptools-0.6c9-py2.5.egg/setuptools/command/easy_install.py", line 211, in run > File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/setuptools-0.6c9-py2.5.egg/setuptools/command/easy_install.py", line 446, in easy_install > File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/setuptools-0.6c9-py2.5.egg/setuptools/command/easy_install.py", line 476, in install_item > File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/setuptools-0.6c9-py2.5.egg/setuptools/command/easy_install.py", line 655, in install_eggs > File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/setuptools-0.6c9-py2.5.egg/setuptools/command/easy_install.py", line 930, in build_and_install > File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/setuptools-0.6c9-py2.5.egg/setuptools/command/easy_install.py", line 919, in run_setup > File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/setuptools-0.6c9-py2.5.egg/setuptools/sandbox.py", line 27, in run_setup > File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/setuptools-0.6c9-py2.5.egg/setuptools/sandbox.py", line 63, in run > File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/setuptools-0.6c9-py2.5.egg/setuptools/sandbox.py", line 29, in > File "setup.py", line 115, in > File "/tmp/easy_install-O_UuTG/lxml-2.2.2/setupinfo.py", line 50, in ext_modules > File "/tmp/easy_install-O_UuTG/lxml-2.2.2/buildlibxml.py", line 200, in build_libxml2xslt > File "/tmp/easy_install-O_UuTG/lxml-2.2.2/buildlibxml.py", line 158, in call_subprocess > Exception: Command "make" returned code 2 From sidnei at enfoldsystems.com Thu Jul 16 21:35:51 2009 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Thu, 16 Jul 2009 16:35:51 -0300 Subject: [lxml-dev] Compilation of lxml on Windows 64-bit In-Reply-To: <789d27b10810170810q50ccabd7v1c95312d6286b0de@mail.gmail.com> References: <789d27b10810020656j147558e9yb52af3bfafd2ca17@mail.gmail.com> <789d27b10810150943v1e156111r64dde24aeb53d03d@mail.gmail.com> <789d27b10810170810q50ccabd7v1c95312d6286b0de@mail.gmail.com> Message-ID: Hi Hanni, I finally managed to compile lxml for x64 myself. Since now I have a working environment capable of generating those, I will start uploading x64 binaries to PyPI. Would be awesome if you could give those lxml 2.2.1 binaries a try and let me know if it works for you. They are completely statically linked so no extra dependencies should be needed. http://drop.io/ouequws -- Sidnei From mike at it-loops.com Thu Jul 16 23:04:48 2009 From: mike at it-loops.com (Michael Guntsche) Date: Thu, 16 Jul 2009 23:04:48 +0200 Subject: [lxml-dev] building lxml for OS X 10.4 on PPC? In-Reply-To: <4A5F6A6D.1000509@urheberrecht.org> References: <4A5B9F60.6070108@urheberrecht.org> <8d93cb3af9d1da196b2b8b0000336a8d@localhost> <4A5F6A6D.1000509@urheberrecht.org> Message-ID: <38E808CD-2715-41AE-BF4B-3618FA57F484@it-loops.com> On Jul 16, 2009, at 19:59, Pascal Oberndoerfer wrote: > I get the following "last lines" (below) during compiling libxml2, > where > I suppose the important bit is: > >> /usr/libexec/gcc/i686-apple-darwin8/4.0.1/ld: >> warning /usr/lib/libiconv.dylib cputype (18, >> architecture ppc) does not match cputype (7) >> for specified -arch flag: i386 (file not loaded) > Ok, I was able to reproduce the problem. I also have a quick fix but it's more a hack than anything else. First the problem. During the creation of the static libxml.a library libtool also creates an libxml2.la file. libtool finds /usr/lib/libiconv.la which defines /usr/lib/ libiconv.dylib above. Now this file is no fat-binary and just has the PCC architecture. As a workaround rename /usr/lib/libiconv.la or move it somewhere else, build lxml and move/rename the file back. This way the "correct" libiconv.dylib will be used. I think the most proper fix would be to build libiconv as part of the static-deps as well. Stefan what do you think? Kind regards, Michael From p.oberndoerfer at urheberrecht.org Fri Jul 17 01:18:25 2009 From: p.oberndoerfer at urheberrecht.org (Pascal Oberndoerfer) Date: Fri, 17 Jul 2009 01:18:25 +0200 Subject: [lxml-dev] building lxml for OS X 10.4 on PPC? In-Reply-To: <38E808CD-2715-41AE-BF4B-3618FA57F484@it-loops.com> References: <4A5B9F60.6070108@urheberrecht.org> <8d93cb3af9d1da196b2b8b0000336a8d@localhost> <4A5F6A6D.1000509@urheberrecht.org> <38E808CD-2715-41AE-BF4B-3618FA57F484@it-loops.com> Message-ID: <4A5FB541.2070804@urheberrecht.org> Michael Guntsche schrieb: > > On Jul 16, 2009, at 19:59, Pascal Oberndoerfer wrote: >> I get the following "last lines" (below) during compiling libxml2, where >> I suppose the important bit is: >> >>> /usr/libexec/gcc/i686-apple-darwin8/4.0.1/ld: >>> warning /usr/lib/libiconv.dylib cputype (18, >>> architecture ppc) does not match cputype (7) >>> for specified -arch flag: i386 (file not loaded) >> > > Ok, I was able to reproduce the problem. I also have a quick fix but > it's more a hack than anything else. > First the problem. During the creation of the static libxml.a library > libtool also creates an libxml2.la file. > libtool finds /usr/lib/libiconv.la which defines /usr/lib/libiconv.dylib > above. Now this file is no fat-binary and just has the PCC architecture. > As a workaround rename /usr/lib/libiconv.la or move it somewhere else, > build lxml and move/rename the file back. This way the "correct" > libiconv.dylib will be used. > > I think the most proper fix would be to build libiconv as part of the > static-deps as well. Stefan what do you think? > > Kind regards, > Michael > This allowed me to build on a G4 PPC and a G5 PPC Mac. Thanks! The very last step: "Building against libxml2/libxslt in the following directory: /private/tmp/easy_install-W57UyT/lxml-2.2.2/build/tmp/libxml2/lib" takes quite long. Is this expected behaviour? I will do some testing over the weekend. Please let me know if I can do some testing etc. Best regards & again: thanks! Pascal From qhlonline at 163.com Fri Jul 17 05:15:55 2009 From: qhlonline at 163.com (qhlonline) Date: Fri, 17 Jul 2009 11:15:55 +0800 (CST) Subject: [lxml-dev] About the position of html parsing by HTML Target parser In-Reply-To: <4A5F3ECC.1060104@behnel.de> References: <4A5F3ECC.1060104@behnel.de> <2514262.666501247720764592.JavaMail.coremail@bj163app16.163.com> Message-ID: <15125988.136891247800555967.JavaMail.coremail@bj163app61.163.com> Stefan Behnel" > >qhlonline wrote: >> Hi, all I am parsing html files with lxml target parser, now I wan't to >> know when I have reached some HTML tag, how can I know the position of >> the HTML document I am parsing? > >These are two different requirements. Do you really need the line/character >information here? Isn't the structural position enough? > I have to know the real parsing position when some special tags found by target parser. Is the 'structural position ' means information about which line and which column, like that in Parsing Error Report? I think they are helpless in compute the parsing stream length. In libxml2 source file SAX2.c there is an callback interface (charactersSAXFunc) for character event: hdlr->characters = xmlSAX2Characters The event handler has a 'len' parameter which tells current parsed HTML stream length. and I noticed that lxml source Saxparser.pxi there is a function defination: cdef void _handleSaxData(void* ctxt, char* c_data, int data_len) with gil: It works just as processer of the sax.character event. How can I change the lxml source code of target parser to add sax.character event processing to it with 'data_len' parameter? Not the default 'data' function in target parser of couse, It has no parameter like 'data_len' and its 'data' parameter is only the text between an element, not the whole parsed string. >> Is there any callbacks in target parser >> who can tell me the total stream length I have parsed? > >Not that I know of. Same as in ElementTree, I'd say. > >Stefan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090717/1f213c46/attachment.htm From qhlonline at 163.com Fri Jul 17 05:34:16 2009 From: qhlonline at 163.com (qhlonline) Date: Fri, 17 Jul 2009 11:34:16 +0800 (CST) Subject: [lxml-dev] About the position of html parsing by HTML Target parser In-Reply-To: <4A5F3ECC.1060104@behnel.de> References: <4A5F3ECC.1060104@behnel.de> <2514262.666501247720764592.JavaMail.coremail@bj163app16.163.com> Message-ID: <452989.142481247801656118.JavaMail.coremail@bj163app72.163.com> 2009-07-16?"Stefan Behnel" > >qhlonline wrote: >> Hi, all I am parsing html files with lxml target parser, now I wan't to >> know when I have reached some HTML tag, how can I know the position of >> the HTML document I am parsing? > >These are two different requirements. Do you really need the line/character >information here? Isn't the structural position enough? > > >> Is there any callbacks in target parser >> who can tell me the total stream length I have parsed? > >Not that I know of. Same as in ElementTree, I'd say. > >Stefan If there are some way for me to get the parsing context, and if I can access this structure directly, may be this problem can get solved. In libxml2 there is a defination of "struct _xmlParserCtxt". This structure have a member "long nbChars; " , It is just the "number of xmlChar processed" . -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090717/5c1329fc/attachment.htm From stefan_ml at behnel.de Fri Jul 17 08:53:38 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 17 Jul 2009 08:53:38 +0200 Subject: [lxml-dev] About the position of html parsing by HTML Target parser In-Reply-To: <15125988.136891247800555967.JavaMail.coremail@bj163app61.163.com> References: <4A5F3ECC.1060104@behnel.de> <2514262.666501247720764592.JavaMail.coremail@bj163app16.163.com> <15125988.136891247800555967.JavaMail.coremail@bj163app61.163.com> Message-ID: <4A601FF2.8080604@behnel.de> qhlonline wrote: > I have to know the real parsing position when some special tags found by > target parser. Interesting requirement. I wonder who designs XML formats where you have to know the stream position to read them. Do you actually mean the bytes position or the character position? > Is the 'structural position ' means information about which > line and which column, like that in Parsing Error Report? No, with "structural position" I meant the position of the element within the tree structure, such as the unique path from the root element to the currently parsed element. > In libxml2 source file > SAX2.c there is an callback interface (charactersSAXFunc) for character event: > > hdlr->characters = xmlSAX2Characters > > The event handler has a 'len' parameter which tells current parsed HTML > stream length. and I noticed that lxml source Saxparser.pxi there is a > function defination: > > cdef void _handleSaxData(void* ctxt, char* c_data, int data_len) with gil: > > It works just as processer of the sax.character event. How can I change > the lxml source code of target parser to add sax.character event > processing to it with 'data_len' parameter? You don't have to. Just take the string that you receive in .data(), encode it as UTF-8, and take its len(). However, that doesn't help you with your problem, as it is not the information you are looking for. Stefan From stefan_ml at behnel.de Fri Jul 17 09:00:44 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 17 Jul 2009 09:00:44 +0200 Subject: [lxml-dev] About the position of html parsing by HTML Target parser In-Reply-To: <452989.142481247801656118.JavaMail.coremail@bj163app72.163.com> References: <4A5F3ECC.1060104@behnel.de> <2514262.666501247720764592.JavaMail.coremail@bj163app16.163.com> <452989.142481247801656118.JavaMail.coremail@bj163app72.163.com> Message-ID: <4A60219C.7050806@behnel.de> qhlonline wrote: > If there are some way for me to get the parsing context, and if I can > access this structure directly, may be this problem can get solved. In > libxml2 there is a defination of "struct _xmlParserCtxt". This structure > have a member "long nbChars; " , It is just the "number of xmlChar > processed" . You could subtype the XMLParser class in Cython. That's not trivial, since it's not exported at the C-API level. You'll have to redefine the class hierarchy in a separate lxml.etree.pxd file to do that. Note that you only need to access the _parser_context and (maybe) _push_parser_context. The other object type fields in the classes can be set to type "object" instead of their real type. But remember that the type isn't public. Future lxml versions may change it, which means that you will have to adapt your code. That said, I still do not understand why you need the character stream position for parsing. Could you elaborate on that? Stefan From stefan_ml at behnel.de Fri Jul 17 09:02:46 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 17 Jul 2009 09:02:46 +0200 Subject: [lxml-dev] building lxml for OS X 10.4 on PPC? In-Reply-To: <4A5FB541.2070804@urheberrecht.org> References: <4A5B9F60.6070108@urheberrecht.org> <8d93cb3af9d1da196b2b8b0000336a8d@localhost> <4A5F6A6D.1000509@urheberrecht.org> <38E808CD-2715-41AE-BF4B-3618FA57F484@it-loops.com> <4A5FB541.2070804@urheberrecht.org> Message-ID: <4A602216.9020606@behnel.de> Pascal Oberndoerfer wrote: > The very last step: "Building against libxml2/libxslt in the following > directory: > /private/tmp/easy_install-W57UyT/lxml-2.2.2/build/tmp/libxml2/lib" takes > quite long. Is this expected behaviour? Yes. The generated C source files are pretty huge. Stefan From stefan_ml at behnel.de Fri Jul 17 09:04:46 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 17 Jul 2009 09:04:46 +0200 Subject: [lxml-dev] building lxml for OS X 10.4 on PPC? In-Reply-To: <38E808CD-2715-41AE-BF4B-3618FA57F484@it-loops.com> References: <4A5B9F60.6070108@urheberrecht.org> <8d93cb3af9d1da196b2b8b0000336a8d@localhost> <4A5F6A6D.1000509@urheberrecht.org> <38E808CD-2715-41AE-BF4B-3618FA57F484@it-loops.com> Message-ID: <4A60228E.3090607@behnel.de> Michael Guntsche wrote: > I think the most proper fix would be to build libiconv as part of the > static-deps as well. Stefan what do you think? Yes, I agree that that should at least be an option. Want to give it a try? Stefan From qhlonline at 163.com Fri Jul 17 09:23:56 2009 From: qhlonline at 163.com (qhlonline) Date: Fri, 17 Jul 2009 15:23:56 +0800 (CST) Subject: [lxml-dev] About the position of html parsing by HTML Target parser In-Reply-To: <4A601FF2.8080604@behnel.de> References: <4A601FF2.8080604@behnel.de> <4A5F3ECC.1060104@behnel.de> <2514262.666501247720764592.JavaMail.coremail@bj163app16.163.com> <15125988.136891247800555967.JavaMail.coremail@bj163app61.163.com> Message-ID: <25627354.246611247815436083.JavaMail.coremail@bj163app69.163.com> 2009-07-17?"Stefan Behnel" ? > >qhlonline wrote: >> I have to know the real parsing position when some special tags found by >> target parser. > >Interesting requirement. I wonder who designs XML formats where you >know the stream position to read them. Do you actually mean the bytes >position or the character position? We are parsing some HTML files. I don't think HTML parser has any obligation to provide position infomation too. But Our leader requires position information for Traget events. The position is not necessarily accurate, but it must be character position, like how much Bytes of the HTML file have been parsed . > > >> Is the 'structural position ' means information about which >> line and which column, like that in Parsing Error Report? > >No, with "structural position" I meant the position of the element within >the tree structure, such as the unique path from the root element to the >currently parsed element. > > >> In libxml2 source file >> SAX2.c there is an callback interface (charactersSAXFunc) for character event: >> >> hdlr->characters = xmlSAX2Characters >> >> The event handler has a 'len' parameter which tells current parsed HTML >> stream length. and I noticed that lxml source Saxparser.pxi there is a >> function defination: >> >> cdef void _handleSaxData(void* ctxt, char* c_data, int data_len) with gil: >> >> It works just as processer of the sax.character event. How can I change >> the lxml source code of target parser to add sax.character event >> processing to it with 'data_len' parameter? > >You don't have to. Just take the string that you receive in .data(), encode >it as UTF-8, and take its len(). However, that doesn't help you with your >problem, as it is not the information you are looking for. > Yes, I have tested for libxml2 library directly. I have defined my function according to the character event handlers' arg-list of libxml2. The result shows that its 'len' argument only shows the string length of text between closed tags. It doesn't help to solve my problem. The worse, I found its "ctxt" parameter, whose type is "xmlParserCtxtPtr ", is NULL. That means I can't get ParserContext now. The xmlParserCtxt structer has a member showing it's current parsing position. Now I am trying to know, If a lxml target parser (or if a self-defined libxml2 sax parser) generate a "xmlParserCtextPtr" like ParserContext? How Can I get it? I guess may be I have steped on the wrong way. I don't know which I should focus on ( libxml2 or lxml ) to solve my problem, would you give me some suggestion? >Stefan > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090717/bbbc5562/attachment-0001.htm From hanni.ali at gmail.com Fri Jul 17 09:36:47 2009 From: hanni.ali at gmail.com (Hanni Ali) Date: Fri, 17 Jul 2009 08:36:47 +0100 Subject: [lxml-dev] Compilation of lxml on Windows 64-bit In-Reply-To: References: <789d27b10810020656j147558e9yb52af3bfafd2ca17@mail.gmail.com> <789d27b10810150943v1e156111r64dde24aeb53d03d@mail.gmail.com> <789d27b10810170810q50ccabd7v1c95312d6286b0de@mail.gmail.com> Message-ID: <789d27b10907170036k69984df0oe9b767c603d90aea@mail.gmail.com> Hi Sidnei, Seems to work well for me here, I will install it across our dev environment to give it some thorought testing time, but it looks good to me. Cheers, Hanni 2009/7/16 Sidnei da Silva > Hi Hanni, > > I finally managed to compile lxml for x64 myself. Since now I have a > working environment capable of generating those, I will start > uploading x64 binaries to PyPI. > > Would be awesome if you could give those lxml 2.2.1 binaries a try and > let me know if it works for you. They are completely statically linked > so no extra dependencies should be needed. > > http://drop.io/ouequws > > -- Sidnei > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090717/e9feb31c/attachment.htm From ndudfield at gmail.com Fri Jul 17 12:48:24 2009 From: ndudfield at gmail.com (Nicholas Dudfield) Date: Fri, 17 Jul 2009 18:48:24 +0800 Subject: [lxml-dev] About the position of html parsing by HTML Target parser Message-ID: Wow, someone else with this requirement. I was meaning to post to the list about this. I'm using lxml to implement a XPath / CSS selection plugin for a python extensible editor. I'd like to have a mapping of view buffer regions to xml nodes. The workaround I used to get the exact character position was to use the feed interface, a character at a time and manually monitor bytestream position. It's fairly slow though. I'd like to implement this in CYthon or use whatever underlying facility there is to speed it up. You can see some screen casts at this forum thread which should make it more obvious what I mean re: css / xpath document selections: http://www.sublimetext.com/forum/viewtopic.php?f=5&t=547 Cheers. From qhlonline at 163.com Fri Jul 17 14:52:18 2009 From: qhlonline at 163.com (qhlonline) Date: Fri, 17 Jul 2009 20:52:18 +0800 (CST) Subject: [lxml-dev] Fw:Re:Re: About the position of html parsing by HTML Target parser Message-ID: <21282282.401601247835138011.JavaMail.coremail@bj163app59.163.com> Re:Re: [lxml-dev] About the position of html parsing by HTML Target parser 2009-07-17?"Nicholas Dudfield" >Wow, someone else with this requirement. I was meaning to post to the >list about this. I'm using lxml to implement a XPath / CSS selection >plugin for a python extensible editor. I'd like to have a mapping of >view buffer regions to xml nodes. The workaround I used to get the >exact character position was to use the feed interface, a character at >a time and manually monitor bytestream position. It's fairly slow >though. I'd like to implement this in CYthon or use whatever >underlying facility there is to speed it up. > Thank you for your suggestion. I have another idea that I can cumulate total characters of tags and text parsed when I encounter an element, that will mean I have to set a counter to add characters got by startElement function and data function of target parser?This is not an accurate result although. But the key problem is that we need high parsing speed too. I mean we should get the position value during the parsing process. The libxml2's ParsingContext does provide a value of current parsing position. Now I wan't to read it in lxml. So I think I have to define a callback function in libxml2 to access to the value and then alter part of lxml pxi source to receive the value in target parser. I don't know whether this will do, but I am trying. Thank you for your suggestion again! >You can see some screen casts at this forum thread which should make >it more obvious what I mean re: css / xpath document selections: >http://www.sublimetext.com/forum/viewtopic.php?f=5&t=547 > >Cheers. >_______________________________________________ >lxml-dev mailing list >lxml-dev at codespeak.net >http://codespeak.net/mailman/listinfo/lxml-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090717/1c3f56dd/attachment.htm From cz at gocept.com Sun Jul 19 10:25:34 2009 From: cz at gocept.com (Christian Zagrodnick) Date: Sun, 19 Jul 2009 10:25:34 +0200 Subject: [lxml-dev] Weakref support? Message-ID: Hoi, would it be possible to add weakref support to lxml trees (objectify in particular)? http://docs.python.org/extending/newtypes.html#weakref-support Regards, -- Christian Zagrodnick ? cz at gocept.com gocept gmbh & co. kg ? forsterstra?e 29 ? 06112 halle (saale) ? germany http://gocept.com ? tel +49 345 1229889 4 ? fax +49 345 1229889 1 Zope and Plone consulting and development From stefan_ml at behnel.de Sun Jul 19 15:25:52 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 19 Jul 2009 15:25:52 +0200 Subject: [lxml-dev] Weakref support? In-Reply-To: References: Message-ID: <4A631EE0.3080200@behnel.de> Christian Zagrodnick wrote: > would it be possible to add weakref support to lxml trees (objectify in > particular)? Possible: yes. You can simply add a "__weakref__" attribute to the _Element class (or just the ObjectifiedElement class) and Cython will do the rest. http://docs.cython.org/docs/extension_types.html#making-extension-types-weak-referenceable However, making the class weak-referenceable will add another bit to the size of each Element instance, which you will notice when you do lots of tree work in Python space. I'd prefer avoiding that for a class as vital as _Element, that's why it isn't currently there. I also don't know if it has any performance implications if a class is weak-referenceable. Does anyone know if there is any impact on the garbage collector, for example? At least the memory size isn't that a big issue. When you look at the class, you may notice that there already is some stuff in there that isn't 100% required (e.g. the cached tag name). Adding 4/8 bytes to it may not be that bad. Is there a specific use case you have in mind? Stefan From stefan_ml at behnel.de Sun Jul 19 20:31:56 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 19 Jul 2009 20:31:56 +0200 Subject: [lxml-dev] About the position of html parsing by HTML Target parser In-Reply-To: References: Message-ID: <4A63669C.4050404@behnel.de> Nicholas Dudfield wrote: > Wow, someone else with this requirement. I was meaning to post to the > list about this. I'm using lxml to implement a XPath / CSS selection > plugin for a python extensible editor. I'd like to have a mapping of > view buffer regions to xml nodes. At least the line is available from the "sourceline" property of an element - although only up to 65536: http://bugzilla.gnome.org/show_bug.cgi?id=325533 If you are in a position to whitespace-clean and pretty-print the XML document, that would give you a simple mapping from elements to document positions that you can exploit at the application level. Even if you can't, that would still give you a usable model to work with that you could match with the original stream to find the 'real' positions. > The workaround I used to get the > exact character position was to use the feed interface, a character at > a time and manually monitor bytestream position. It's fairly slow > though. I'd like to implement this in CYthon or use whatever > underlying facility there is to speed it up. Speeding up this approach is pretty much futile IMHO. The parser gains speed from efficient I/O and memory management. Passing a byte at a time totally counters that (and even then it's only a *byte* at a time, not a *character* at a time). What I could imagine to do instead is to traverse the element tree and to do an incremental text search for each element tag (i.e. the regexp " References: <4A63669C.4050404@behnel.de> Message-ID: <9148036.360291248060120782.JavaMail.coremail@bj163app72.163.com> 2009-07-20?"Stefan Behnel" ? > >Nicholas Dudfield wrote: >> Wow, someone else with this requirement. I was meaning to post to the >> list about this. I'm using lxml to implement a XPath / CSS selection >> plugin for a python extensible editor. I'd like to have a mapping of >> view buffer regions to xml nodes. > >At least the line is available from the "sourceline" property of an element >- although only up to 65536: > >http://bugzilla.gnome.org/show_bug.cgi?id=325533 > >If you are in a position to whitespace-clean and pretty-print the XML >document, that would give you a simple mapping from elements to document >positions that you can exploit at the application level. Even if you can't, >that would still give you a usable model to work with that you could match >with the original stream to find the 'real' positions. > > >> The workaround I used to get the >> exact character position was to use the feed interface, a character at >> a time and manually monitor bytestream position. It's fairly slow >> though. I'd like to implement this in CYthon or use whatever >> underlying facility there is to speed it up. > >Speeding up this approach is pretty much futile IMHO. The parser gains >speed from efficient I/O and memory management. Passing a byte at a time >totally counters that (and even then it's only a *byte* at a time, not a >*character* at a time). > >What I could imagine to do instead is to traverse the element tree and to >do an incremental text search for each element tag (i.e. the regexp >"allow you to work at the character level rather than the byte level >(assuming that the editor works at that level, too). Searching for the >above regexp is safe as "<" cannot occur anywhere in the XML data stream >except for a tag start/end or comment/PI (ok, minus DTDs, but that's easy >to catch using the line number of the root element). > >I will also check if there is a way to provide the position at the (target) >parser level, but that needs to fit the current interface. And I currently >do not have much time to dig into this. > >Stefan >_______________________________________________ >lxml-dev mailing list >lxml-dev at codespeak.net >http://codespeak.net/mailman/listinfo/lxml-dev Hi, all I have tried to alter the libxml2 source to add a callback telling the current position when an element parsed. Now I wan't to know this position in lxml target parser. I don't know where and how to alter lxml source to add target parser event defination and event handler defination, so that my libxml2 new event(callback) can be processed. and I have nerver compile cython source before. Can any body give me some suggestion? Thank you! Best Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090720/193a6136/attachment-0001.htm From qhlonline at 163.com Mon Jul 20 05:48:52 2009 From: qhlonline at 163.com (qhlonline) Date: Mon, 20 Jul 2009 11:48:52 +0800 (CST) Subject: [lxml-dev] About the position of html parsing by HTML Target parser In-Reply-To: <4A60219C.7050806@behnel.de> References: <4A60219C.7050806@behnel.de> <4A5F3ECC.1060104@behnel.de> <2514262.666501247720764592.JavaMail.coremail@bj163app16.163.com> <452989.142481247801656118.JavaMail.coremail@bj163app72.163.com> Message-ID: <13861823.379461248061732358.JavaMail.coremail@bj163app72.163.com> 2009-07-17?"Stefan Behnel" ? > >qhlonline wrote: >> If there are some way for me to get the parsing context, and if I can >> access this structure directly, may be this problem can get solved. In >> libxml2 there is a defination of "struct _xmlParserCtxt". This structure >> have a member "long nbChars; " , It is just the "number of xmlChar >> processed" . > >You could subtype the XMLParser class in Cython. That's not trivial, since >it's not exported at the C-API level. You'll have to redefine the class >hierarchy in a separate lxml.etree.pxd file to do that. Note that you only >need to access the _parser_context and (maybe) _push_parser_context. The >other object type fields in the classes can be set to type "object" instead >of their real type. > >But remember that the type isn't public. Future lxml versions may change >it, which means that you will have to adapt your code. > I have changed the libxml2 code to add a new callback telling the current position when an element was seen. I think this can avoid the direct access to parser context. But I am now thinking of how to change target parser to let it access the newly defined callback on python level. I even did't know where to find the target related lxml source code. nor do I know whether my idea is feasble. Is target parser inherited from the TreeBuilder class? can It be changed? and how? I am in urgent need of more and deeper lxml source information. >That said, I still do not understand why you need the character stream >position for parsing. Could you elaborate on that? Well, the position information is usefull. Some outside source of HTML document is declared in a seperate file, like