From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Jun 1 11:40:48 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 01 Jun 2006 11:40:48 +0200 Subject: [lxml-dev] improved valgrind suppressions for Python? In-Reply-To: <447D5F5B.305@infrae.com> References: <447C4351.7050804@infrae.com> <447C4437.4020806@infrae.com> <447C4DB4.2030303@gkec.informatik.tu-darmstadt.de> <447C660F.6010505@infrae.com> <447C6D16.5000206@gkec.informatik.tu-darmstadt.de> <447D5F5B.305@infrae.com> Message-ID: <447EB620.7000300@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: >>> Are you checking with valgrind, by the way? > How are the supressions working for you? There are a lot of "uninitialised values" and "conditional jumps" before we get to etree.initetree. So I happily ignore those. When the test cases run (I run test.py -vv), I get a few more, but most of them do not make me too suspicious, as they seem to be triggered by Python code (might still be GC issues, though). Note that ElementTree actually triggers most of those. There are a few things left I'll look at today. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Jun 1 20:20:13 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 01 Jun 2006 20:20:13 +0200 Subject: [lxml-dev] lxml 1.0 is on cheeseshop! Message-ID: <447F2FDD.8050602@gkec.informatik.tu-darmstadt.de> Hallo everyone, I have the honour to announce the availability of lxml 1.0 on cheeseshop. While the list of features added since the beta version (1.0.beta) is rather small, this version contains a pretty large number of bug fixes found by various users and testers. Thank you all for your help! Features added since 0.9.2: * Element.getiterator() and the findall() methods support finding arbitrary elements from a namespace (pattern {namespace}*) * Another speedup in tree iteration code * General speedup of Python Element object creation and deallocation * Writing C14N no longer serializes in memory (reduced memory footprint) * PyErrorLog for error logging through the Python logging module * element.getroottree() returns an ElementTree for the root node of the document that contains the element. * ElementTree.getpath(element) returns a simple, absolute XPath expression to find the element in the tree structure * Error logs have a last_error attribute for convenience * Comment texts can be changed through the API * Formatted output via pretty_print keyword to serialization functions * XSLT can block access to file system and network via XSLTAccessControl * ElementTree.write() no longer serializes in memory (reduced memory footprint) * Speedup of Element.findall(tag) and Element.getiterator(tag) * Support for writing the XML representation of Elements and ElementTrees to Python unicode strings via etree.tounicode() * Support for writing XSLT results to Python unicode strings via unicode() * Parsing a unicode string no longer copies the string (reduced memory footprint) * Parsing file-like objects now reads chunks rather than the whole file (reduced memory footprint) * Parsing StringIO objects from the start avoids copying the string (reduced memory footprint) * Read-only 'docinfo' attribute in ElementTree class holds DOCTYPE information, original encoding and XML version as seen by the parser * etree module can be compiled without libxslt by commenting out the line include "xslt.pxi" near the end of the etree.pyx source file * Better error messages in parser exceptions * Error reporting now also works in XSLT * Support for custom document loaders (URI resolvers) in parsers and XSLT, resolvers are registered at parser level * Implementation of exslt:regexp for XSLT based on the Python 're' module, enabled by default, can be switched off with 'regexp=False' keyword argument * Support for exslt extensions (libexslt) and libxslt extra functions (node-set, document, write, output) * Substantial speedup in XPath.evaluate() * HTMLParser for parsing (broken) HTML * XMLDTDID function parses XML into tuple (root node, ID dict) based on xml:id implementation of libxml2 (as opposed to ET compatible XMLID) Bugs fixed since 0.9.2: * Memory leak in Element.__setitem__ * Memory leak in Element.attrib.items() and Element.attrib.values() * Memory leak in XPath extension functions * Memory leak in unicode related setup code * Element now raises ValueError on empty tag names * Namespace fixing after moving elements between documents could fail if the source document was freed too early * Setting namespace-less tag names on namespaced elements ('{ns}t' -> 't') didn't reset the namespace * Unknown constants from newer libxml2 versions could raise exceptions in the error handlers * lxml.etree compiles much faster * On libxml2 <= 2.6.22, parsing strings with encoding declaration could fail in certain cases * Document reference in ElementTree objects was not updated when the root element was moved to a different document * Running absolute XPath expressions on an Element now evaluates against the root tree * Evaluating absolute XPath expressions (/*) on an ElementTree could fail * Crashes when calling XSLT, RelaxNG, etc. with uninitialized ElementTree objects * Removed public function initThreadLogging(), replaced by more general initThread() which fixes a number of setup problems in threads * Memory leak when using iconv encoders in tostring/write * Deep copying Elements and ElementTrees maintains the document information * Serialization functions raise LookupError for unknown encodings * Memory deallocation crash resulting from deep copying elements * Some ElementTree methods could crash if the root node was not initialized (neither file nor element passed to the constructor) * Element/SubElement failed to set attribute namespaces from passed attrib dictionary * tostring() now adds an XML declaration for non-ASCII encodings * tostring() failed to serialize encodings that contain 0-bytes * ElementTree.xpath() and XPathDocumentEvaluator were not using the ElementTree root node as reference point * Calling document('') in XSLT failed to return the stylesheet I feel a certain fascination when I look back on the relatively short time it took Martijn and me (and several other contributors) to implement the large set of features that this version has and to bring it to this level of maturity. A big "Thank you!" to all code contributors, egg builders, bug finders, testers, users and everyone else who helped in bringing lxml towards 1.0! Stefan From buro at petr.com Thu Jun 1 20:27:54 2006 From: buro at petr.com (Petr van Blokland) Date: Thu, 1 Jun 2006 20:27:54 +0200 Subject: [lxml-dev] namespaces Message-ID: Hi lxml-developers, I have another question about the way lxml handles namespaces in combination with Python based elements: Think of an example derived from the doc/namespace_extension.txt from lxml.etree import Namespace, ElementBase namespace = Namespace('http://hui.de/honk') class HonkElement(ElementBase): def __str__(self): return 'String of this element' tree = XML('') The question is: is it possible to make the tree build to xml by using lxml.etree.tostring(tree) where during the evaluation the __str__ (or another method) of the Python element is called. Otherwise the example as given in the doc will only work when the Python element is root of a tree as in: honk_element = XML('') Kind regards, Petr van Blokland ---------------------------------------------- Petr van Blokland buro at petr.com | www.petr.com | +31 15 219 10 40 ---------------------------------------------- From buro at petr.com Thu Jun 1 20:34:05 2006 From: buro at petr.com (Petr van Blokland) Date: Thu, 1 Jun 2006 20:34:05 +0200 Subject: [lxml-dev] lxml 1.0 is on cheeseshop! In-Reply-To: <447F2FDD.8050602@gkec.informatik.tu-darmstadt.de> References: <447F2FDD.8050602@gkec.informatik.tu-darmstadt.de> Message-ID: Congratulations, Stefan, Maarten and all the others, for crossing the bridge of 1.0 Thank you. Petr van Blokland On Jun 1, 2006, at 8:20 PM, Stefan Behnel wrote: > Hallo everyone, > > I have the honour to announce the availability of lxml 1.0 on > cheeseshop. > > While the list of features added since the beta version (1.0.beta) > is rather > small, this version contains a pretty large number of bug fixes > found by > various users and testers. Thank you all for your help! > I feel a certain fascination when I look back on the relatively > short time it > took Martijn and me (and several other contributors) to implement > the large > set of features that this version has and to bring it to this level > of maturity. > > A big "Thank you!" to all code contributors, egg builders, bug > finders, > testers, users and everyone else who helped in bringing lxml > towards 1.0! > > Stefan > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > ---------------------------------------------- Petr van Blokland buro at petr.com | www.petr.com | +31 15 219 10 40 ---------------------------------------------- From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Jun 1 22:21:01 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 01 Jun 2006 22:21:01 +0200 Subject: [lxml-dev] namespaces In-Reply-To: References: Message-ID: <447F4C2D.9010706@gkec.informatik.tu-darmstadt.de> Hi Petr, Petr van Blokland wrote: > Think of an example derived from the doc/namespace_extension.txt > class HonkElement(ElementBase): > def __str__(self): > return 'String of this element' > > tree = XML(' >') > > The question is: is it possible to make the tree build to xml by using > > lxml.etree.tostring(tree) > > where during the evaluation the __str__ (or another method) of the > Python element is called. No. As I said before, elements are never 'executed' by lxml. Remember that lxml builds on top of libxml2, so things like serialisation, XSLT, XPath or validation run entirely in C. While there are ways to hook into XPath and XSLT for extension functions (and elements, btw), there is no way to hook into validation or serialisation. However, I still think it is worth looking at the current Namespace mechanism and seeing if there are other interesting things we can do with it. The function registration, for example, unified and simplified the API towards extension functions. XSLT elements are an obvious step forwards, although the API is far from clear to me. What you proposed in your last mail is viable, but only a special case. It misses the support for XSLT subtrees below the extension element, for example. I also thought about an extension towards a more XIST-like API for creating trees, like el = html( head(title("title"), body(p("test paragraph"), "more text") ) but I found that that's not trivial either. One reason is that we can't easily call "title()", i.e. without arguments, as we use that internally for normal element construction. So, there are a lot of interesting paths to follow here, but I haven't sorted out yet how to get this nicely integrated (and implemented). You seem to have some good ideas, so if you want to discuss them further, here's the place to do it. :) Stefan From apaku at gmx.de Thu Jun 1 23:51:15 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Thu, 1 Jun 2006 23:51:15 +0200 Subject: [lxml-dev] ANN: XPathEvaluator 0.1.0 Message-ID: <20060601215115.GA14937@morpheus.apaku.dnsalias.org> Hi, as per Stefan's request I'm announcing here (and on the PyQt mailing list) the first release of a small tool I "just" (i.e. the last 3 days) wrote. XPathEvaluator is a tool to test what results a XPath expression gives you when executed on a specific XML file. With the help of lxml it can also parse pretty broken (and of course correct) HTML files. It loads it's data from URLs if you want and highlights all nodes that an XPath evaluation returns so you can easily identify them. I couldn't use lxml for more than HTML parsing because there's unfortunately no easy way to find out the element to which an attribute result or text result belongs to. I might actually use lxml for the initial XML parsing too, because it's way faster than the PyXML parser I currently have. I hope somebody finds this useful, I might add new features in the future, however there's no priority at the moment for the development of XPathEvaluator. As always with open source software: Patches welcome. XPathEvaluator can be downloaded from: http://www.apaku.de/linux/xpathevaluator/index.php That page also mentions all required software. Hope somebody will find this useful. Andreas -- You will obey or molten silver will be poured into your ears. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060601/e44055eb/attachment.pgp From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Jun 2 07:37:20 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 02 Jun 2006 07:37:20 +0200 Subject: [lxml-dev] windows compiler error: string too big In-Reply-To: <1768108014.20060601160009@carcass.dhs.org> References: <447F2FDD.8050602@gkec.informatik.tu-darmstadt.de> <1768108014.20060601160009@carcass.dhs.org> Message-ID: <447FCE90.6040906@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > I was trying to build the Windows eggs, when got the following: > > C:\Program Files\Microsoft Visual Studio .NET 2003\Vc7\bin\cl.exe /c /nologo /Ox /MD /W3 /GX /DNDEBUG -Iz:/xml/include - > IZ:\python24\include -IZ:\python24\PC /Tcsrc/lxml/etree.c /Fobuild\temp.win32-2.4\Release\src/lxml/etree.obj -w > cl : Command line warning D4025 : overriding '/W3' with '/w' > etree.c > src\lxml\etree.c(964) : error C2026: string too big, trailing characters truncated > src\lxml\etree.c(964) : error C2026: string too big, trailing characters truncated > src\lxml\etree.c(964) : error C2026: string too big, trailing characters truncated > src\lxml\etree.c(964) : error C2026: string too big, trailing characters truncated > src\lxml\etree.c(964) : error C2026: string too big, trailing characters truncated > src\lxml\etree.c(964) : error C2026: string too big, trailing characters truncated > src\lxml\etree.c(964) : error C2026: string too big, trailing characters truncated > src\lxml\etree.c(964) : error C2026: string too big, trailing characters truncated > src\lxml\etree.c(964) : error C2026: string too big, trailing characters truncated > src\lxml\etree.c(964) : error C2026: string too big, trailing characters truncated > src\lxml\etree.c(964) : error C2026: string too big, trailing characters truncated > src\lxml\etree.c(964) : error C2026: string too big, trailing characters truncated > src\lxml\etree.c(964) : error C2026: string too big, trailing characters truncated > src\lxml\etree.c(964) : error C2026: string too big, trailing characters truncated > src\lxml\etree.c(964) : error C2026: string too big, trailing characters truncated > error: command '"C:\Program Files\Microsoft Visual Studio .NET 2003\Vc7\bin\cl.exe"' failed with exit status 2 > > And indeed, that line is quite large (~32kb)... what should I do ? Oh well, how is that for a bug. :) > That compiler is the same which compiled Python, so its the "official" one. Then it's the right one to use. > Could that string be broken apart ? No problem. Here's a patch that does that. Compiles nicely on my machine. Likely adds nothing but a microsecond to the setup time, but is a little harder to maintain as we have to remember to apply something alike when we update the error code constants. Just apply the patch to the official version and compile that. I'll see if I can find a better patch for the trunk. As we know, most people won't compile under windows themselves, so that's not really something we need corrected in the source distribution. Sorry for the hassle. Stefan Index: src/lxml/xmlerror.pxi =================================================================== --- src/lxml/xmlerror.pxi (Revision 28057) +++ src/lxml/xmlerror.pxi (Arbeitskopie) @@ -561,6 +561,8 @@ XML_NS_ERR_QNAME = 202 : 202 XML_NS_ERR_ATTRIBUTE_REDEFINED = 203 : 203 XML_NS_ERR_EMPTY = 204 : 204 +""" + \ +""" XML_DTD_ATTRIBUTE_DEFAULT = 500 XML_DTD_ATTRIBUTE_REDEFINED = 501 : 501 XML_DTD_ATTRIBUTE_VALUE = 502 : 502 @@ -727,6 +729,8 @@ XML_RNGP_VALUE_NO_CONTENT = 1120 : 1120 XML_RNGP_XMLNS_NAME = 1121 : 1121 XML_RNGP_XML_NS = 1122 : 1122 +""" + \ +""" XML_XPATH_EXPRESSION_OK = 1200 XML_XPATH_NUMBER_ERROR = 1201 : 1201 XML_XPATH_UNFINISHED_LITERAL_ERROR = 1202 : 1202 @@ -1017,6 +1021,8 @@ XML_SCHEMAV_CVC_TYPE_2 = 1876 : 1876 XML_SCHEMAV_CVC_IDC = 1877 : 1877 XML_SCHEMAV_CVC_WILDCARD = 1878 : 1878 +""" + \ +""" XML_XPTR_UNKNOWN_SCHEME = 1900 XML_XPTR_CHILDSEQ_START = 1901 : 1901 XML_XPTR_EVAL_FAILED = 1902 : 1902 From howe at carcass.dhs.org Fri Jun 2 08:45:50 2006 From: howe at carcass.dhs.org (Steve Howe) Date: Fri, 2 Jun 2006 03:45:50 -0300 Subject: [lxml-dev] windows compiler error: string too big In-Reply-To: <447FCE90.6040906@gkec.informatik.tu-darmstadt.de> References: <447F2FDD.8050602@gkec.informatik.tu-darmstadt.de> <1768108014.20060601160009@carcass.dhs.org> <447FCE90.6040906@gkec.informatik.tu-darmstadt.de> Message-ID: <814681539.20060602034550@carcass.dhs.org> Hello Stefan, Friday, June 2, 2006, 2:37:20 AM, you wrote: > No problem. Here's a patch that does that. Compiles nicely on my machine. > Likely adds nothing but a microsecond to the setup time, but is a little > harder to maintain as we have to remember to apply something alike when we > update the error code constants. [...] Thanks, it worked, but now we the same thing on line 972: etree.c src\lxml\etree.c(972) : error C2026: string too big, trailing characters truncated Every line above 32000 chars will not be compiled. I could be fixing those myself, but I think it's better to have it right in the trunk... -- Best regards, Steve mailto:howe at carcass.dhs.org From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Jun 2 09:00:36 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 02 Jun 2006 09:00:36 +0200 Subject: [lxml-dev] windows compiler error: string too big In-Reply-To: <814681539.20060602034550@carcass.dhs.org> References: <447F2FDD.8050602@gkec.informatik.tu-darmstadt.de> <1768108014.20060601160009@carcass.dhs.org> <447FCE90.6040906@gkec.informatik.tu-darmstadt.de> <814681539.20060602034550@carcass.dhs.org> Message-ID: <447FE214.9080304@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > Friday, June 2, 2006, 2:37:20 AM, you wrote: >> No problem. Here's a patch that does that. Compiles nicely on my machine. >> Likely adds nothing but a microsecond to the setup time, but is a little >> harder to maintain as we have to remember to apply something alike when we >> update the error code constants. > [...] > > Thanks, it worked, but now we the same thing on line 972: > > etree.c > src\lxml\etree.c(972) : error C2026: string too big, trailing characters truncated > > Every line above 32000 chars will not be compiled. I could be fixing > those myself, but I think it's better to have it right in the trunk... That's the same place as before. Are you sure the maximum is 32k? Because the patch I sent you should have cut the longest line down to some 13k... Maybe setup.py didn't rebuild etree.c? That's not done automatically because we only have it depend on etree.pyx (in which there were no changes). I sent you a generated etree.c where the longest line is below 8k. But I'll look into the issue to fix it on the trunk... Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Jun 2 09:15:14 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 02 Jun 2006 09:15:14 +0200 Subject: [lxml-dev] etree.c In-Reply-To: <1642664414.20060602040344@carcass.dhs.org> References: <447FE12D.9000504@gkec.informatik.tu-darmstadt.de> <1642664414.20060602040344@carcass.dhs.org> Message-ID: <447FE582.4020502@gkec.informatik.tu-darmstadt.de> Hi Steve, Steve Howe wrote: > Friday, June 2, 2006, 3:56:45 AM, you wrote: >> here's a new etree.c that /should/ compile. All lines below 8k. > I was wrong about the max line size. It seems max size is 2k (argh!) 2k for a string? That's nothing! How can this thing call itself a C compiler? > The platform (MSVC6) docs for C2026: > > After adjacent strings are concatenated, a string cannot be longer > than 2048 characters. > > However, this should work: > > "huge part 1" "huge part 1" > > The compiler only gripes about a single string literal > 2K. Hu? But it says "after adjacent strings are concatenated"... Anyway, I can't do that from Pyrex code. Ok, so that means I have to revert the change that introduced these long strings, which are essentially long lists of constants, one per line. I did that to reduce the time it takes to compile. If we use Python lists to store them or Python objects that have them as attributes, Pyrex will generate tons of setup code that builds the objects or lists and that bloats the code and raises the compile time to the sky. A quick loop over a splitted string is just the fastest and smallest thing I could imagine. I'll see what I can come up with. I guess the best way is an auto-generator for these constants from the libxml2 source I currently cut&pasted them from. Maybe we should store them in C arrays and then have our own loop to generate Python strings from them. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Jun 2 12:50:59 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 02 Jun 2006 12:50:59 +0200 Subject: [lxml-dev] windows compiler error: string too big In-Reply-To: <447FE214.9080304@gkec.informatik.tu-darmstadt.de> References: <447F2FDD.8050602@gkec.informatik.tu-darmstadt.de> <1768108014.20060601160009@carcass.dhs.org> <447FCE90.6040906@gkec.informatik.tu-darmstadt.de> <814681539.20060602034550@carcass.dhs.org> <447FE214.9080304@gkec.informatik.tu-darmstadt.de> Message-ID: <44801813.2080405@gkec.informatik.tu-darmstadt.de> Ok, one for the archives. MSVC only supports strings of up to 2048 bytes, so we had to find a way how to split these up. Steve wrote a little script that does that in etree.c. It makes lxml compile, but doesn't work well with distutils, so that's not really an option. I wrote a new script "update-error-constants.py" that fixes the problem at the root. Since we have to update the constants from time to time anyway to support new libxml2 versions, this script parses the HTML documentation page of libxml2 (it obviously uses lxml for that :) and generates the declarations in xmlerror.pxd and xmlerror.pxi. The strings it puts into the .pxi are split at about 2000 bytes, so that fixes the MSVC problem. To update the constants for a new version of libxml2, run the script as follows: cd /path/to/lxml python update-error-constants.py /path/to/libxml2-doc-dir it will then pick up the file "html/libxml2-xmlerror.html" from that directory and parse it. A "svn ci" will do the rest, but please remember running "make clean test" first! Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Jun 2 13:11:42 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 02 Jun 2006 13:11:42 +0200 Subject: [lxml-dev] Element.getnext() and Element.getprevious() ? Message-ID: <44801CEE.5000606@gkec.informatik.tu-darmstadt.de> Hi all, when I wrote the constant updater script, I noticed that navigating through an ElementTree to find the preceding sibling of an element is not trivial. However, that's not an unusual thing to do in HTML, where you might want to find a specific heading in the body, for example, and then look through the paragraphs belonging to the heading. It's ok as long as you stick with ET and traverse the tree yourself to find the heading. However, if you find the heading with XPath, you're lost as you can't easily find out how the XML structure continues at the same level... I'm therefore tempted to add the (trivially implemented) methods getnext() and getprevious() to Element, in the style of getparent(), getchildren() and gettreeroot(), but I wanted to ask here first if there are any objections to this extension. I think, we already have opened up the ET API towards a document based structure, so these would actually match the other extensions rather nicely. Stefan From elephantum at cyberzoo.ru Fri Jun 2 13:31:32 2006 From: elephantum at cyberzoo.ru (Andrey Tatarinov) Date: Fri, 02 Jun 2006 15:31:32 +0400 Subject: [lxml-dev] Element.getnext() and Element.getprevious() ? In-Reply-To: <44801CEE.5000606@gkec.informatik.tu-darmstadt.de> References: <44801CEE.5000606@gkec.informatik.tu-darmstadt.de> Message-ID: <1149247893.3007.16.camel@zoo.yandex.ru> On Fri, 2006-06-02 at 13:11 +0200, Stefan Behnel wrote: > Hi all, > > when I wrote the constant updater script, I noticed that navigating through an > ElementTree to find the preceding sibling of an element is not trivial. > However, that's not an unusual thing to do in HTML, where you might want to > find a specific heading in the body, for example, and then look through the > paragraphs belonging to the heading. > > It's ok as long as you stick with ET and traverse the tree yourself to find > the heading. However, if you find the heading with XPath, you're lost as you > can't easily find out how the XML structure continues at the same level... > > I'm therefore tempted to add the (trivially implemented) methods getnext() and > getprevious() to Element, in the style of getparent(), getchildren() and > gettreeroot(), but I wanted to ask here first if there are any objections to > this extension. I think, we already have opened up the ET API towards a > document based structure, so these would actually match the other extensions > rather nicely. It's better to think more on naming. What are you talking about is called "axes" in XPath ( http://www.w3.org/TR/xpath#axes ), and there are more than parent, children, and siblings axes. I'd propose to create properties, which act like lists. So the following would be correct: >>> node.following_sibling[0] there could be exception for parent node, as there couldn't be more than one parent. From faassen at infrae.com Fri Jun 2 14:02:58 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri, 02 Jun 2006 14:02:58 +0200 Subject: [lxml-dev] Element.getnext() and Element.getprevious() ? In-Reply-To: <1149247893.3007.16.camel@zoo.yandex.ru> References: <44801CEE.5000606@gkec.informatik.tu-darmstadt.de> <1149247893.3007.16.camel@zoo.yandex.ru> Message-ID: <448028F2.2040106@infrae.com> Andrey Tatarinov wrote: > On Fri, 2006-06-02 at 13:11 +0200, Stefan Behnel wrote: >> Hi all, >> >> when I wrote the constant updater script, I noticed that navigating through an >> ElementTree to find the preceding sibling of an element is not trivial. >> However, that's not an unusual thing to do in HTML, where you might want to >> find a specific heading in the body, for example, and then look through the >> paragraphs belonging to the heading. >> >> It's ok as long as you stick with ET and traverse the tree yourself to find >> the heading. However, if you find the heading with XPath, you're lost as you >> can't easily find out how the XML structure continues at the same level... >> >> I'm therefore tempted to add the (trivially implemented) methods getnext() and >> getprevious() to Element, in the style of getparent(), getchildren() and >> gettreeroot(), but I wanted to ask here first if there are any objections to >> this extension. I think, we already have opened up the ET API towards a >> document based structure, so these would actually match the other extensions >> rather nicely. > > It's better to think more on naming. What are you talking about is > called "axes" in XPath ( http://www.w3.org/TR/xpath#axes ), and there > are more than parent, children, and siblings axes. > > I'd propose to create properties, which act like lists. So the following > would be correct: > >>>> node.following_sibling[0] > > > there could be exception for parent node, as there couldn't be more than > one parent. I don't consider this to be easier to understand though. getnext() and getprevious() tend to be easier to grasp. I mean, I'm sure XPath axes are nice to use occasionally, and have some conceptual attraction, but if you want to use those, why not just use XPath? Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Jun 2 14:17:31 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 02 Jun 2006 14:17:31 +0200 Subject: [lxml-dev] Element.getnext() and Element.getprevious() ? In-Reply-To: <1149247893.3007.16.camel@zoo.yandex.ru> References: <44801CEE.5000606@gkec.informatik.tu-darmstadt.de> <1149247893.3007.16.camel@zoo.yandex.ru> Message-ID: <44802C5B.50509@gkec.informatik.tu-darmstadt.de> Hi Andrey, Andrey Tatarinov wrote: > On Fri, 2006-06-02 at 13:11 +0200, Stefan Behnel wrote: >> when I wrote the constant updater script, I noticed that navigating through an >> ElementTree to find the preceding sibling of an element is not trivial. >> However, that's not an unusual thing to do in HTML, where you might want to >> find a specific heading in the body, for example, and then look through the >> paragraphs belonging to the heading. >> >> It's ok as long as you stick with ET and traverse the tree yourself to find >> the heading. However, if you find the heading with XPath, you're lost as you >> can't easily find out how the XML structure continues at the same level... >> >> I'm therefore tempted to add the (trivially implemented) methods getnext() and >> getprevious() to Element, in the style of getparent(), getchildren() and >> gettreeroot(), but I wanted to ask here first if there are any objections to >> this extension. I think, we already have opened up the ET API towards a >> document based structure, so these would actually match the other extensions >> rather nicely. > > It's better to think more on naming. What are you talking about is > called "axes" in XPath ( http://www.w3.org/TR/xpath#axes ), and there > are more than parent, children, and siblings axes. I know. I thought about that, too. But I didn't want to add "sibling" to make it longer without making it clearer. > I'd propose to create properties, which act like lists. So the following > would be correct: > >>>> node.following_sibling[0] > Ok, let's walk that through. Here are the other axes and their current API: * ancestor - subsequent calls to getparent() * child - element[i] or getchildren() * descendant - getiterator() * following - ? * following-sibling - ? * parent - getparent() * preceding - ? * preceding-sibling - ? So all that's currently missing is really the sibling stuff. However, your above proposal would also encourage an ancestor 'list'. Note also that the preceding axis is rather tricky (and rarely used IMHO), so it's rather unlikely it will make it into the API. The following axis, on the other hand, can be seen as a combination of getnext() and getiterator(), so that's covered by adding a getnext(). I'm a bit opposed to the list idea, as it is not very explicit. Just for performance, how would you distinguish between these two from the point of view of the property itself: >>> element.following_sibling >>> element.following_sibling[0] Should we always build a list of all siblings for both cases? Also, it doesn't match the getchildren() API call (which /is/ explicit). So, if we follow the axis naming exactly, all that is really missing is getfollowingsibling() and getprecedingsibling(). Now, those two are rather hard to read, but getfollowing() and getpreceding() are just wrong in terms of XPath. So, I still prefer getnext() and getprevious(). Stefan From elephantum at cyberzoo.ru Fri Jun 2 14:20:45 2006 From: elephantum at cyberzoo.ru (Andrey Tatarinov) Date: Fri, 02 Jun 2006 16:20:45 +0400 Subject: [lxml-dev] Element.getnext() and Element.getprevious() ? In-Reply-To: <448028F2.2040106@infrae.com> References: <44801CEE.5000606@gkec.informatik.tu-darmstadt.de> <1149247893.3007.16.camel@zoo.yandex.ru> <448028F2.2040106@infrae.com> Message-ID: <1149250846.3007.27.camel@zoo.yandex.ru> On Fri, 2006-06-02 at 14:02 +0200, Martijn Faassen wrote: > Andrey Tatarinov wrote: > > On Fri, 2006-06-02 at 13:11 +0200, Stefan Behnel wrote: > >> Hi all, > >> > >> when I wrote the constant updater script, I noticed that navigating through an > >> ElementTree to find the preceding sibling of an element is not trivial. > >> However, that's not an unusual thing to do in HTML, where you might want to > >> find a specific heading in the body, for example, and then look through the > >> paragraphs belonging to the heading. > >> > >> It's ok as long as you stick with ET and traverse the tree yourself to find > >> the heading. However, if you find the heading with XPath, you're lost as you > >> can't easily find out how the XML structure continues at the same level... > >> > >> I'm therefore tempted to add the (trivially implemented) methods getnext() and > >> getprevious() to Element, in the style of getparent(), getchildren() and > >> gettreeroot(), but I wanted to ask here first if there are any objections to > >> this extension. I think, we already have opened up the ET API towards a > >> document based structure, so these would actually match the other extensions > >> rather nicely. > > > > It's better to think more on naming. What are you talking about is > > called "axes" in XPath ( http://www.w3.org/TR/xpath#axes ), and there > > are more than parent, children, and siblings axes. > > > > I'd propose to create properties, which act like lists. So the following > > would be correct: > > > >>>> node.following_sibling[0] > > > > > > there could be exception for parent node, as there couldn't be more than > > one parent. > > I don't consider this to be easier to understand though. getnext() and > getprevious() tend to be easier to grasp. I mean, I'm sure XPath axes > are nice to use occasionally, and have some conceptual attraction, but > if you want to use those, why not just use XPath? There is such a thing as 'consistency'. As we are working in domain of XML manipulation and there is already well-thought-of dictionary of terms and definitions, well-thought language, we should adopt it as much as possible. It's like Occam's razor. (I know really well, that term consistency is very important, cause at the moment I'm working on a huge system that lacks it. often it's really hard to understand what is meant by this or that word) Thus introducing new naming scheme (that is not thought through at all) is a bad thing. All that ElementTree/lxml is about is thin and intuitive wrapper of XML domain for python. I think that it should not be forgotten. From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Jun 2 14:30:26 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 02 Jun 2006 14:30:26 +0200 Subject: [lxml-dev] Element.getnext() and Element.getprevious() ? In-Reply-To: <448028F2.2040106@infrae.com> References: <44801CEE.5000606@gkec.informatik.tu-darmstadt.de> <1149247893.3007.16.camel@zoo.yandex.ru> <448028F2.2040106@infrae.com> Message-ID: <44802F62.8000900@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Andrey Tatarinov wrote: >> >>> node.following_sibling[0] > > I don't consider this to be easier to understand though. getnext() and > getprevious() tend to be easier to grasp. I think so, too. Another thing would be "itersiblings()" to match iter(element), similar in naming to what the Python container classes (most notably dict) do. I think that would also make a nice companion. Something like: def itersiblings(self, preceding=False): ... to reach both directions. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Jun 2 14:42:05 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 02 Jun 2006 14:42:05 +0200 Subject: [lxml-dev] Element.getnext() and Element.getprevious() ? In-Reply-To: <1149250846.3007.27.camel@zoo.yandex.ru> References: <44801CEE.5000606@gkec.informatik.tu-darmstadt.de> <1149247893.3007.16.camel@zoo.yandex.ru> <448028F2.2040106@infrae.com> <1149250846.3007.27.camel@zoo.yandex.ru> Message-ID: <4480321D.8000407@gkec.informatik.tu-darmstadt.de> Hi Andrey, Andrey Tatarinov wrote: > Thus introducing new naming scheme (that is not thought through at all) > is a bad thing. I personally find element.getnext() and element.getprevious() *very* intuitive, given the already existing getparent(), getchildren() and getrootnode(). Maybe getiterator() is a bit less intuitive, as it doesn't tell you what it iterates over, but then again, when you know elements iterate over their own children, it becomes close-to-intuitive that getiterator() does the other thing, you know, iterate over the elements in the tree. It's the same with element.getnext(). I'd just go: "Can't be children as there's getchildren() for that. Can't be the parent, as it wouldn't make sense and there's getparent() for that. So I guess it's the siblings. And getprevious() matches it, obviously." That's what I mean with intuitive. Stefan From elephantum at cyberzoo.ru Fri Jun 2 14:53:31 2006 From: elephantum at cyberzoo.ru (Andrey Tatarinov) Date: Fri, 02 Jun 2006 16:53:31 +0400 Subject: [lxml-dev] Element.getnext() and Element.getprevious() ? In-Reply-To: <44802C5B.50509@gkec.informatik.tu-darmstadt.de> References: <44801CEE.5000606@gkec.informatik.tu-darmstadt.de> <1149247893.3007.16.camel@zoo.yandex.ru> <44802C5B.50509@gkec.informatik.tu-darmstadt.de> Message-ID: <1149252812.3007.46.camel@zoo.yandex.ru> On Fri, 2006-06-02 at 14:17 +0200, Stefan Behnel wrote: > Hi Andrey, > > Andrey Tatarinov wrote: > > On Fri, 2006-06-02 at 13:11 +0200, Stefan Behnel wrote: > >> when I wrote the constant updater script, I noticed that navigating through an > >> ElementTree to find the preceding sibling of an element is not trivial. > >> However, that's not an unusual thing to do in HTML, where you might want to > >> find a specific heading in the body, for example, and then look through the > >> paragraphs belonging to the heading. > >> > >> It's ok as long as you stick with ET and traverse the tree yourself to find > >> the heading. However, if you find the heading with XPath, you're lost as you > >> can't easily find out how the XML structure continues at the same level... > >> > >> I'm therefore tempted to add the (trivially implemented) methods getnext() and > >> getprevious() to Element, in the style of getparent(), getchildren() and > >> gettreeroot(), but I wanted to ask here first if there are any objections to > >> this extension. I think, we already have opened up the ET API towards a > >> document based structure, so these would actually match the other extensions > >> rather nicely. > > > > It's better to think more on naming. What are you talking about is > > called "axes" in XPath ( http://www.w3.org/TR/xpath#axes ), and there > > are more than parent, children, and siblings axes. > > I know. I thought about that, too. But I didn't want to add "sibling" to make > it longer without making it clearer. > > > > I'd propose to create properties, which act like lists. So the following > > would be correct: > > > >>>> node.following_sibling[0] > > > > Ok, let's walk that through. Here are the other axes and their current API: > > * ancestor - subsequent calls to getparent() > * child - element[i] or getchildren() > * descendant - getiterator() > * following - ? > * following-sibling - ? > * parent - getparent() > * preceding - ? > * preceding-sibling - ? sorry, but it's a mess > So all that's currently missing is really the sibling stuff. However, your > above proposal would also encourage an ancestor 'list'. Note also that the > preceding axis is rather tricky (and rarely used IMHO), so it's rather > unlikely it will make it into the API. The following axis, on the other hand, > can be seen as a combination of getnext() and getiterator(), so that's covered > by adding a getnext(). > > I'm a bit opposed to the list idea, as it is not very explicit. Just for > performance, how would you distinguish between these two from the point of > view of the property itself: > > >>> element.following_sibling > >>> element.following_sibling[0] > > Should we always build a list of all siblings for both cases? Also, it doesn't > match the getchildren() API call (which /is/ explicit). list and list-like-object are different things, in case you're cared about perfomance explicit means using well-know, interoperable interface as much as possible (file-like-objects are great example), this doesn't mean using _exact_ class for a task, but using _exact_ interface. of course it's a little bit less explicit than using list .children, which could be the only container and the mean to access contained nodes, but things are already not that way, so it doesn't count > So, if we follow the axis naming exactly, all that is really missing is > getfollowingsibling() and getprecedingsibling(). Now, those two are rather > hard to read, but getfollowing() and getpreceding() are just wrong in terms of > XPath. So, I still prefer getnext() and getprevious(). I thought a little more about it, that wouldn't hurt much, I suppose at the moment lxml is a bloat of different approaches and inconsistent api's, so adding just a little bit more of it is nothing. From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Jun 2 15:07:40 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 02 Jun 2006 15:07:40 +0200 Subject: [lxml-dev] Element.getnext() and Element.getprevious() ? In-Reply-To: <1149252812.3007.46.camel@zoo.yandex.ru> References: <44801CEE.5000606@gkec.informatik.tu-darmstadt.de> <1149247893.3007.16.camel@zoo.yandex.ru> <44802C5B.50509@gkec.informatik.tu-darmstadt.de> <1149252812.3007.46.camel@zoo.yandex.ru> Message-ID: <4480381C.2050302@gkec.informatik.tu-darmstadt.de> Hi Andrey, Andrey Tatarinov wrote: > at the moment lxml is a bloat of different approaches and inconsistent > api's, so adding just a little bit more of it is nothing. Ah, finally, that's good news. Anything specific in your mind that you might want to change regarding the current API? Stefan :) From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Jun 2 15:19:23 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 02 Jun 2006 15:19:23 +0200 Subject: [lxml-dev] Element.getnext() and Element.getprevious() ? In-Reply-To: <44802F62.8000900@gkec.informatik.tu-darmstadt.de> References: <44801CEE.5000606@gkec.informatik.tu-darmstadt.de> <1149247893.3007.16.camel@zoo.yandex.ru> <448028F2.2040106@infrae.com> <44802F62.8000900@gkec.informatik.tu-darmstadt.de> Message-ID: <44803ADB.1070105@gkec.informatik.tu-darmstadt.de> Stefan Behnel wrote: > Martijn Faassen wrote: >> Andrey Tatarinov wrote: >>> >>> node.following_sibling[0] >> I don't consider this to be easier to understand though. getnext() and >> getprevious() tend to be easier to grasp. > > I think so, too. Another thing would be "itersiblings()" to match > iter(element), similar in naming to what the Python container classes (most > notably dict) do. I think that would also make a nice companion. Something like: > > def itersiblings(self, preceding=False): > ... Hmm, now that I think about it, we'd then also want iterparents(), right? But then it's really the question if we use iterparents() or iterancestors(). There is only one parent, but many siblings, so iterparents() is not really the right idea... I'll leave it out for now, until someone has a good argument for either of the two. Stefan From elephantum at cyberzoo.ru Fri Jun 2 15:22:42 2006 From: elephantum at cyberzoo.ru (Andrey Tatarinov) Date: Fri, 02 Jun 2006 17:22:42 +0400 Subject: [lxml-dev] Element.getnext() and Element.getprevious() ? In-Reply-To: <4480381C.2050302@gkec.informatik.tu-darmstadt.de> References: <44801CEE.5000606@gkec.informatik.tu-darmstadt.de> <1149247893.3007.16.camel@zoo.yandex.ru> <44802C5B.50509@gkec.informatik.tu-darmstadt.de> <1149252812.3007.46.camel@zoo.yandex.ru> <4480381C.2050302@gkec.informatik.tu-darmstadt.de> Message-ID: <1149254562.3007.53.camel@zoo.yandex.ru> On Fri, 2006-06-02 at 15:07 +0200, Stefan Behnel wrote: > Hi Andrey, > > Andrey Tatarinov wrote: > > at the moment lxml is a bloat of different approaches and inconsistent > > api's, so adding just a little bit more of it is nothing. > > Ah, finally, that's good news. Anything specific in your mind that you might > want to change regarding the current API? That's a question for more than just 10 minutes which I can afford at the moment. The obvious ones: - element's .xpath, .getiterator - xslt result's .__str__ I hope, I wouldn't forget to make deeper examination and will write it to ML sometime. From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Jun 2 15:30:30 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 02 Jun 2006 15:30:30 +0200 Subject: [lxml-dev] Element.getnext() and Element.getprevious() ? In-Reply-To: <1149254562.3007.53.camel@zoo.yandex.ru> References: <44801CEE.5000606@gkec.informatik.tu-darmstadt.de> <1149247893.3007.16.camel@zoo.yandex.ru> <44802C5B.50509@gkec.informatik.tu-darmstadt.de> <1149252812.3007.46.camel@zoo.yandex.ru> <4480381C.2050302@gkec.informatik.tu-darmstadt.de> <1149254562.3007.53.camel@zoo.yandex.ru> Message-ID: <44803D76.6000100@gkec.informatik.tu-darmstadt.de> Hi Andrey, Andrey Tatarinov schrieb: > On Fri, 2006-06-02 at 15:07 +0200, Stefan Behnel wrote: >> Hi Andrey, >> >> Andrey Tatarinov wrote: >>> at the moment lxml is a bloat of different approaches and inconsistent >>> api's, so adding just a little bit more of it is nothing. >> Ah, finally, that's good news. Anything specific in your mind that you might >> want to change regarding the current API? > > That's a question for more than just 10 minutes which I can afford at > the moment. > > The obvious ones: > - element's .xpath, .getiterator Uhm, I guess you mean .xpath() and .findall() here, .getiterator() does something different. Sure, the path expressions accepted by both are different. > - xslt result's .__str__ How's that inconsistent? All this is saying is "I know how to become a string", which is right away true for XSLT results (but not for arbitrary trees, if that's what you're comparing to). I'm actually happy you didn't come up with anything important in your first shot. That makes me confident for your future criticism. Stefan From faassen at infrae.com Fri Jun 2 15:43:39 2006 From: faassen at infrae.com (Martijn Faassen) Date: Fri, 02 Jun 2006 15:43:39 +0200 Subject: [lxml-dev] Element.getnext() and Element.getprevious() ? In-Reply-To: <1149252812.3007.46.camel@zoo.yandex.ru> References: <44801CEE.5000606@gkec.informatik.tu-darmstadt.de> <1149247893.3007.16.camel@zoo.yandex.ru> <44802C5B.50509@gkec.informatik.tu-darmstadt.de> <1149252812.3007.46.camel@zoo.yandex.ru> Message-ID: <4480408B.9090603@infrae.com> Andrey Tatarinov wrote: [snip] >> Ok, let's walk that through. Here are the other axes and their current API: >> >> * ancestor - subsequent calls to getparent() >> * child - element[i] or getchildren() >> * descendant - getiterator() >> * following - ? >> * following-sibling - ? >> * parent - getparent() >> * preceding - ? >> * preceding-sibling - ? > > sorry, but it's a mess It's possible to express XPath axes in terms of simple operations like this. Here's an example of code (from Forest, an attempt at an XML database): self.selfAxis = lambda nodes: nodes self.childAxis = Concat(Map(doc.firstChild), TransitiveClosure([doc.nextSibling])) self.parentAxis = Concat(TransitiveClosure([doc.nextSiblingInverse]), Map(doc.firstChildInverse)) self.descendantAxis = Concat(Map(doc.firstChild), TransitiveClosure([doc.firstChild, doc.nextSibling])) self.ancestorAxis = Concat( TransitiveClosure([doc.firstChildInverse, doc.nextSiblingInverse]), Map(doc.firstChildInverse)) self.descendantOrSelfAxis = AxisUnion(self.descendantAxis, self.selfAxis) self.ancestorOrSelfAxis = AxisUnion(self.ancestorAxis, self.selfAxis) self.followingAxis = Concat( Concat(Concat(self.ancestorOrSelfAxis, Map(doc.nextSibling)), TransitiveClosure([doc.nextSibling])), self.descendantOrSelfAxis) self.precedingAxis = Concat( Concat(Concat(self.ancestorOrSelfAxis, Map(doc.nextSiblingInverse)), TransitiveClosure([doc.nextSiblingInverse])), self.descendantOrSelfAxis) self.followingSiblingAxis = Concat( Map(doc.nextSibling), TransitiveClosure([doc.nextSibling])) self.precedingSiblingAxis = Concat( Map(doc.nextSiblingInverse), TransitiveClosure([doc.nextSiblingInverse])) self.attributeAxis = Concat(Map(doc.firstAttribute), TransitiveClosure([doc.nextAttribute])) https://infrae.com/viewvc/old/forest/trunk/src/forest/axes.py And also an example of some higher order functional programming in Python. :) As you can see, to define all the axes you only need firstChild (element[0]), nextSibling (getnext()), firstChildInverse (getparent()) and nextSiblingInverse (getprevious()), except for the attribute axis. Of course, as far as I'm aware, XPath is defined to walk over a tree with text nodes present so I'm not sure whether this is all relevant at all. Anyway, the XPath database model is not the be-all and end-all of XML tree navigation. DOM for instance defines things much like getparent(), getnext() and so on. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Jun 2 21:41:02 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 02 Jun 2006 21:41:02 +0200 Subject: [lxml-dev] Defining name spaces in lxml/xslt In-Reply-To: <3A7A4B1D-B4CA-4473-8A5F-114B18B8871C@petr.com> References: <4EDDA00D-C9E2-441C-9398-3B3EC77A7169@petr.com> <447B63E4.1060709@gkec.informatik.tu-darmstadt.de> <3A7A4B1D-B4CA-4473-8A5F-114B18B8871C@petr.com> Message-ID: <4480944E.3040400@gkec.informatik.tu-darmstadt.de> Hi Petr, Petr van Blokland wrote: > I could solve the problem by only using xpath in combination with > xsl:value-of. It might work (although not no readable as tags). Trying > though I found that als < and > answered by an external xpath function > are escaped. Any idea how to avoid that? Well, unfortunately, that's intentional. Maybe you could describe your use case in a little more detail so that I understand what you're trying to do. Is it only about injecting a tree into the XSL result? Or are you trying to make things configurable, i.e. do you need parametrized XSLT elements? Stefan From buro at petr.com Fri Jun 2 22:36:36 2006 From: buro at petr.com (Petr van Blokland) Date: Fri, 2 Jun 2006 22:36:36 +0200 Subject: [lxml-dev] Defining name spaces in lxml/xslt In-Reply-To: <4480944E.3040400@gkec.informatik.tu-darmstadt.de> References: <4EDDA00D-C9E2-441C-9398-3B3EC77A7169@petr.com> <447B63E4.1060709@gkec.informatik.tu-darmstadt.de> <3A7A4B1D-B4CA-4473-8A5F-114B18B8871C@petr.com> <4480944E.3040400@gkec.informatik.tu-darmstadt.de> Message-ID: <7AD44204-E314-49AC-8F31-C00E33DB6AD6@petr.com> On Jun 2, 2006, at 9:41 PM, Stefan Behnel wrote: > Hi Petr, > > Petr van Blokland wrote: >> I could solve the problem by only using xpath in combination with >> xsl:value-of. It might work (although not no readable as tags). >> Trying >> though I found that als < and > answered by an external xpath >> function >> are escaped. Any idea how to avoid that? > > Well, unfortunately, that's intentional. > > Maybe you could describe your use case in a little more detail so > that I > understand what you're trying to do. Is it only about injecting a > tree into > the XSL result? Or are you trying to make things configurable, i.e. > do you > need parametrized XSLT elements? > > Stefan Hi Stefan, I try to inject a tree (or even direct a source containing tags as return value from an external xpath function. I found a solution to it by using: where the br function is defined as: def xhtml_br(dummy): return '
' and namespace: ns = FunctionNamespace('http://xml.petr.com/xpyth3/xpath/xhtml') ns.prefix = 'xhtml' ns['br'] = xhtml_br so this works. It fits my earlier need for creating external elements. Although a little less nice to read, this way it is possible to make the evaluation of tostring() from an XSLT transformation call the pieces of Python code during the process. We use this kind of technique to insert the result of specific functions as the result of SQL queries, etc. inside the XSL transformation. Some years ago we wrote our own XSL parser in Python that was almost but not quite following the standard. Now we are facing 3 disadvantages to that approach: 1 people using the system cannot use standard XSL knowledge and documentation 2 it is much slower than doing the processing in C 3 it allowed us to drift away from XSLT standard (which partly is an advandage, because it made features possible that are not available in the standard). Now I am trying to get the system run inside lxml because especially 1 and 2 look VERY promising. The part to solve is 3 where previous freedom needs to fit inside the "limitations" of lxml, libxml2 and real XSLT. But I am making progress. Thanks for the help until now. Petr ---------------------------------------------- Petr van Blokland buro at petr.com | www.petr.com | +31 15 219 10 40 ---------------------------------------------- From behnel_ml at gkec.informatik.tu-darmstadt.de Fri Jun 2 22:58:30 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Fri, 02 Jun 2006 22:58:30 +0200 Subject: [lxml-dev] Defining name spaces in lxml/xslt In-Reply-To: <7AD44204-E314-49AC-8F31-C00E33DB6AD6@petr.com> References: <4EDDA00D-C9E2-441C-9398-3B3EC77A7169@petr.com> <447B63E4.1060709@gkec.informatik.tu-darmstadt.de> <3A7A4B1D-B4CA-4473-8A5F-114B18B8871C@petr.com> <4480944E.3040400@gkec.informatik.tu-darmstadt.de> <7AD44204-E314-49AC-8F31-C00E33DB6AD6@petr.com> Message-ID: <4480A676.4070506@gkec.informatik.tu-darmstadt.de> Hi Petr, Petr van Blokland wrote: > I try to inject a tree (or even direct a source containing tags as > return value from > an external xpath function. I found a solution to it by using: > > > > where the br function is defined as: > > def xhtml_br(dummy): > return '
' > > and namespace: > > ns = FunctionNamespace('http://xml.petr.com/xpyth3/xpath/xhtml') > ns.prefix = 'xhtml' > ns['br'] = xhtml_br Why don't you return an Element? You might have to use xsl:copy-of in that case, but it should do what you want. You should always try to stay within the world of XML with these things instead of writing out tags "by hand". Makes life easier. :) > so this works. It fits my earlier need for creating external > elements. Although > a little less nice to read, this way it is possible to make the > evaluation of tostring() > from an XSLT transformation call the pieces of Python code during the > process. It's not called in tostring(). It's called during the XSLT evaluation. You can traverse the result tree to see that what you generate above is really the string "
", not an element. lxml just doesn't know what you meant to do, so it can't help you. > We use this kind of technique to insert the result of specific > functions as the > result of SQL queries, etc. inside the XSL transformation. Some years > ago we wrote > our own XSL parser in Python that was almost but not quite following > the standard. > Now we are facing 3 disadvantages to that approach: > 1 people using the system cannot use standard XSL knowledge and > documentation > 2 it is much slower than doing the processing in C > 3 it allowed us to drift away from XSLT standard (which partly is an > advandage, because > it made features possible that are not available in the standard). > > Now I am trying to get the system run inside lxml because especially > 1 and 2 look > VERY promising. The part to solve is 3 where previous freedom needs > to fit inside > the "limitations" of lxml, libxml2 and real XSLT. There is not that much you can do from extension elements that you could not do from XPath extensions. They are mainly syntactic sugar. Nice sugar, but sugar. Stefan From apaku at gmx.de Sat Jun 3 04:17:13 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Sat, 3 Jun 2006 04:17:13 +0200 Subject: [lxml-dev] "Extract" namespace from tag Message-ID: <20060603021713.GA7078@morpheus> Hi, I know I had some discussions about this already but in a different context, I think. I'd like to be able to extract the namespace and namespace prefix for an element and it seems I can't do that with the current API (unless I'm overlooking something). Is there a way to make this work? Of course I can just "parse" the tag name for the namespace, but this doesn't give me the namespace prefix used. Andreas -- You will forget that you ever knew me. From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Jun 3 09:11:48 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 03 Jun 2006 09:11:48 +0200 Subject: [lxml-dev] "Extract" namespace from tag In-Reply-To: <20060603021713.GA7078@morpheus> References: <20060603021713.GA7078@morpheus> Message-ID: <44813634.7030501@gkec.informatik.tu-darmstadt.de> Hi Andreas, Andreas Pakulat wrote: > I'd like to be able to extract the namespace and namespace prefix for an > element and it seems I can't do that with the current API (unless I'm > overlooking something). Is there a way to make this work? > > Of course I can just "parse" the tag name for the namespace, but this > doesn't give me the namespace prefix used. The prefix is returned by the "prefix" property on an Element. However, there is currently no direct way to parse a tag name, although IMHO there should be. Maybe we should extend the QName class with "namespace" and "local_name" attributes to provide a way for parsing "{ns}tag" names. Stefan From apaku at gmx.de Sat Jun 3 11:49:26 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Sat, 3 Jun 2006 11:49:26 +0200 Subject: [lxml-dev] "Extract" namespace from tag In-Reply-To: <44813634.7030501@gkec.informatik.tu-darmstadt.de> References: <20060603021713.GA7078@morpheus> <44813634.7030501@gkec.informatik.tu-darmstadt.de> Message-ID: <20060603094926.GA5516@morpheus> On 03.06.06 09:11:48, Stefan Behnel wrote: > Hi Andreas, > > Andreas Pakulat wrote: > > I'd like to be able to extract the namespace and namespace prefix for an > > element and it seems I can't do that with the current API (unless I'm > > overlooking something). Is there a way to make this work? > > > > Of course I can just "parse" the tag name for the namespace, but this > > doesn't give me the namespace prefix used. > > The prefix is returned by the "prefix" property on an Element. However, there > is currently no direct way to parse a tag name, although IMHO there should be. Seems I overlooked that one. > Maybe we should extend the QName class with "namespace" and "local_name" > attributes to provide a way for parsing "{ns}tag" names. I think prefix should be in there too. Andreas -- You will triumph over your enemy. From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Jun 3 12:09:54 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 03 Jun 2006 12:09:54 +0200 Subject: [lxml-dev] "Extract" namespace from tag In-Reply-To: <20060603094926.GA5516@morpheus> References: <20060603021713.GA7078@morpheus> <44813634.7030501@gkec.informatik.tu-darmstadt.de> <20060603094926.GA5516@morpheus> Message-ID: <44815FF2.2050101@gkec.informatik.tu-darmstadt.de> Hi Andreas, Andreas Pakulat wrote: > On 03.06.06 09:11:48, Stefan Behnel wrote: >> Maybe we should extend the QName class with "namespace" and "local_name" >> attributes to provide a way for parsing "{ns}tag" names. > > I think prefix should be in there too. Hmmm, it's easy to say that. However, why should a prefix be part of a qualified tag name? Stefan From apaku at gmx.de Sat Jun 3 12:34:37 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Sat, 3 Jun 2006 12:34:37 +0200 Subject: [lxml-dev] element of an xpath evaluation In-Reply-To: <447CA69E.6060903@gkec.informatik.tu-darmstadt.de> References: <20060530194624.GA28762@morpheus.apaku.dnsalias.org> <447CA69E.6060903@gkec.informatik.tu-darmstadt.de> Message-ID: <20060603103437.GA10248@morpheus> On 30.05.06 22:10:06, Stefan Behnel wrote: > Andreas Pakulat wrote: > > However if I want to highlight the tree node that the xpath > > matches I have a "problem" when the xpath matches attributes or text > > nodes. So the question is: Is there a way using lxml to find out to > > which element a certain non-element result of an xpath evaluation > > belongs? > > Not straight away. Both are returned as strings, so you loose the information > where it came from. > > You can try to run a second XPath expression to find the result text or > attribute value in the tree, but that's bound to fail if text data is not > unique (which is pretty likely for attributes). I tried a few things and to me it seems running a second XPath-Expression using the extra step /parent::node() gives me the element node. Now the question is: Can I assume that the last step either contains text() or attribute:: or @attrname? The only problem I see is that I need to traverse the text-childs of the elements returned when the XPath selects text nodes to know which strings belong to which elements. Are there other ways to get at text() or attribute nodes? Andreas -- Are you ever going to do the dishes? Or will you change your major to biology? From apaku at gmx.de Sat Jun 3 12:41:25 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Sat, 3 Jun 2006 12:41:25 +0200 Subject: [lxml-dev] "Extract" namespace from tag In-Reply-To: <44815FF2.2050101@gkec.informatik.tu-darmstadt.de> References: <20060603021713.GA7078@morpheus> <44813634.7030501@gkec.informatik.tu-darmstadt.de> <20060603094926.GA5516@morpheus> <44815FF2.2050101@gkec.informatik.tu-darmstadt.de> Message-ID: <20060603104125.GB10248@morpheus> On 03.06.06 12:09:54, Stefan Behnel wrote: > Andreas Pakulat wrote: > > On 03.06.06 09:11:48, Stefan Behnel wrote: > >> Maybe we should extend the QName class with "namespace" and "local_name" > >> attributes to provide a way for parsing "{ns}tag" names. > > > > I think prefix should be in there too. > > Hmmm, it's easy to say that. However, why should a prefix be part of a > qualified tag name? Well according to the XML Namespace standard it belongs to a QName: http://www.w3.org/TR/REC-xml-names/#ns-qualnames Is that enough? Andreas -- You recoil from the crude; you tend naturally toward the exquisite. From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Jun 3 12:46:18 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 03 Jun 2006 12:46:18 +0200 Subject: [lxml-dev] "Extract" namespace from tag In-Reply-To: <20060603104125.GB10248@morpheus> References: <20060603021713.GA7078@morpheus> <44813634.7030501@gkec.informatik.tu-darmstadt.de> <20060603094926.GA5516@morpheus> <44815FF2.2050101@gkec.informatik.tu-darmstadt.de> <20060603104125.GB10248@morpheus> Message-ID: <4481687A.10402@gkec.informatik.tu-darmstadt.de> Hi Andreas, Andreas Pakulat wrote: > On 03.06.06 12:09:54, Stefan Behnel wrote: >> Andreas Pakulat wrote: >>> On 03.06.06 09:11:48, Stefan Behnel wrote: >>>> Maybe we should extend the QName class with "namespace" and "local_name" >>>> attributes to provide a way for parsing "{ns}tag" names. >>> I think prefix should be in there too. >> Hmmm, it's easy to say that. However, why should a prefix be part of a >> qualified tag name? > > Well according to the XML Namespace standard it belongs to a QName: > http://www.w3.org/TR/REC-xml-names/#ns-qualnames > > Is that enough? Sure, I mixed that up (didn't see the prefix between all those element trees). So, you can pass QNames into Element() and the like. That would be a rather handy way of specifying prefixes for elements without using the nsmap dictionary, I'd say... I'll have to think this through a bit more to see if there are other implications, but I'm close to liking this. Stefan From apaku at gmx.de Sat Jun 3 12:51:58 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Sat, 3 Jun 2006 12:51:58 +0200 Subject: [lxml-dev] documentation improvement Message-ID: <20060603105158.GC10248@morpheus> Hi, I recently overlooked the prefix attribute for lxml Element's and I wonder if there are any plans to provide an api document that is similar to the one of ElementTree but includes all "extensions"? I already tried to use the docstrings from the source, but there are quite some things missing (I didn't get any _* classes on the first try). I know pretty much the whole documentation is already "there", but it's distributed over many text files. One needs to read all and remember everything, especially for the extensions because there's no way to easily see the extensions of a given class. I'm willing to help here and maybe for starters I could "collect" all current documentation in a howto-like latex file (using Python's howto document class). From this one can easily generate various formats (pdf, html, ps and ascii) using Pythons mkhowto tool. Andreas -- Courage is your greatest present need. From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Jun 3 12:59:10 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 03 Jun 2006 12:59:10 +0200 Subject: [lxml-dev] element of an xpath evaluation In-Reply-To: <20060603103437.GA10248@morpheus> References: <20060530194624.GA28762@morpheus.apaku.dnsalias.org> <447CA69E.6060903@gkec.informatik.tu-darmstadt.de> <20060603103437.GA10248@morpheus> Message-ID: <44816B7E.1080405@gkec.informatik.tu-darmstadt.de> Hi Andreas, Andreas Pakulat wrote: > On 30.05.06 22:10:06, Stefan Behnel wrote: >> Andreas Pakulat wrote: >>> However if I want to highlight the tree node that the xpath >>> matches I have a "problem" when the xpath matches attributes or text >>> nodes. So the question is: Is there a way using lxml to find out to >>> which element a certain non-element result of an xpath evaluation >>> belongs? >> Not straight away. Both are returned as strings, so you loose the information >> where it came from. >> >> You can try to run a second XPath expression to find the result text or >> attribute value in the tree, but that's bound to fail if text data is not >> unique (which is pretty likely for attributes). > > I tried a few things and to me it seems running a second > XPath-Expression using the extra step /parent::node() gives me the > element node. Sure, good idea. > Now the question is: Can I assume that the last step either contains > text() or attribute:: or @attrname? You mean as the result of an XPath expression? Well, you may get back bool values or generated strings (can you?), in which case you can't expect to find out what node (or nodes) they came from. Also, AFAIR, you can merge multiple XPath expressions into one and that case may be hard to detect. The last part of an XPath expression is not always what returned the result... > The only problem I see is that I need to traverse the text-childs of the > elements returned when the XPath selects text nodes to know which > strings belong to which elements. Note that you can get back a wild combination of strings, nodes and numbers, so there is a bit of work to do anyway. > Are there other ways to get at text() or attribute nodes? What do you mean? Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Jun 3 13:13:40 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 03 Jun 2006 13:13:40 +0200 Subject: [lxml-dev] documentation improvement In-Reply-To: <20060603105158.GC10248@morpheus> References: <20060603105158.GC10248@morpheus> Message-ID: <44816EE4.1090504@gkec.informatik.tu-darmstadt.de> Hi Andreas, Andreas Pakulat wrote: > I recently overlooked the prefix attribute for lxml Element's and I > wonder if there are any plans to provide an api document that is similar > to the one of ElementTree but includes all "extensions"? No plans so far. We always contented ourselves with documenting the differences. Even more so now that ET is in Python's stdlib. > I already tried to use the docstrings from the source, but there are > quite some things missing (I didn't get any _* classes on the first > try). True, there are a lot of docstrings missing. Might be a good starting point if you want to improve things. :) Note, however, that the support for the MSVC compiler forces us to restrict their size to 2k, so we can't simply copy the documentation from api.txt or something... > I know pretty much the whole documentation is already "there", but it's > distributed over many text files. One needs to read all and remember > everything, especially for the extensions because there's no way to > easily see the extensions of a given class. Hmm, there are actually not so many different files. There is the external ET documentation, obviously, but the rest is mainly in api.txt and compatibility.txt. We might consider moving a bit of the compatibility.txt back into api.txt, though, now that the lxml API is comprehensive enough to merit some more explanation. I recently started doing that with a section on "trees and documents", maybe there's more to say here. http://codespeak.net/svn/lxml/trunk/doc/api.txt We should also consider getting some more structure into that long document to make it more readable. > I'm willing to help here and maybe for starters I could "collect" all > current documentation in a howto-like latex file (using Python's howto > document class). From this one can easily generate various formats (pdf, > html, ps and ascii) using Pythons mkhowto tool. The current documentation is already generated using the ReST tools, maybe you can try to get some improvements into that part? Thanks for looking into this, Stefan From apaku at gmx.de Sat Jun 3 14:39:21 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Sat, 3 Jun 2006 14:39:21 +0200 Subject: [lxml-dev] element of an xpath evaluation In-Reply-To: <44816B7E.1080405@gkec.informatik.tu-darmstadt.de> References: <20060530194624.GA28762@morpheus.apaku.dnsalias.org> <447CA69E.6060903@gkec.informatik.tu-darmstadt.de> <20060603103437.GA10248@morpheus> <44816B7E.1080405@gkec.informatik.tu-darmstadt.de> Message-ID: <20060603123920.GA13709@morpheus> On 03.06.06 12:59:10, Stefan Behnel wrote: > Andreas Pakulat wrote: > > Now the question is: Can I assume that the last step either contains > > text() or attribute:: or @attrname? > > You mean as the result of an XPath expression? I mean if the result of an XPath is a list of strings, can I assume that this was created by either a ::text() or an attribute:: expression. > Well, you may get back bool values or generated strings (can you?), in which > case you can't expect to find out what node (or nodes) they came from. Well, for bool values you cannot get at the "matched" nodes anyway and it doesn't make sense for that, I think. The same is true AFAIK for generated strings. The evaluator can only highlight tree nodes that are part of the result of an XPath expression, it will probably show the result in cases of generated strings or bools but it cannot display any tree node for there anyway. > Also, AFAIR, you can merge multiple XPath expressions into one and that case > may be hard to detect. The last part of an XPath expression is not always what > returned the result... Hmm, could you give me an example for something like that? I'm not that familiar with XPath... > > The only problem I see is that I need to traverse the text-childs of the > > elements returned when the XPath selects text nodes to know which > > strings belong to which elements. > > Note that you can get back a wild combination of strings, nodes and numbers, > so there is a bit of work to do anyway. Ah, I didn't see "|" until now. Well, that makes the whole thing a bit "harder", because I can't tell wether a given string is created from a text node or is the value of an attribute. I'm thinking, maybe I should just highlight the element for any text returned (regardless wether it is from an attribute or a text node) and not the try to find the proper attribute or text... After all you can easily see the attributes and text of the element. > > Are there other ways to get at text() or attribute nodes? > > What do you mean? I mean, is there another way to have the text node of an element as result of the xpath expression, other than having the text() function somewhere (probably at the end of the path)? The same for attributes, is there another way to get the values of attributes in the result, other than using attribuge::(*|) or using @? Andreas -- Don't hate yourself in the morning -- sleep till noon. From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Jun 3 14:54:58 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 03 Jun 2006 14:54:58 +0200 Subject: [lxml-dev] element of an xpath evaluation In-Reply-To: <20060603123920.GA13709@morpheus> References: <20060530194624.GA28762@morpheus.apaku.dnsalias.org> <447CA69E.6060903@gkec.informatik.tu-darmstadt.de> <20060603103437.GA10248@morpheus> <44816B7E.1080405@gkec.informatik.tu-darmstadt.de> <20060603123920.GA13709@morpheus> Message-ID: <448186A2.2070404@gkec.informatik.tu-darmstadt.de> Hi Andreas, Andreas Pakulat wrote: > On 03.06.06 12:59:10, Stefan Behnel wrote: >> Andreas Pakulat wrote: >>> Now the question is: Can I assume that the last step either contains >>> text() or attribute:: or @attrname? >> You mean as the result of an XPath expression? > > I mean if the result of an XPath is a list of strings, can I assume that > this was created by either a ::text() or an attribute:: expression. What about "string(a)" ? >> Also, AFAIR, you can merge multiple XPath expressions into one and that case >> may be hard to detect. The last part of an XPath expression is not always what >> returned the result... > > Hmm, could you give me an example for something like that? I'm not that > familiar with XPath... >>> from lxml import etree >>> el = etree.Element("root") >>> etree.SubElement(el, "a") >>> etree.SubElement(el, "b") >>> el.xpath("a|b|c|d") [, ] >>> el.xpath("string(a|b|c|d)") '' >>> el.xpath("string(a|b|c|d)|string(a)") '' There's all sorts of weird expressions you could come up with... >>> The only problem I see is that I need to traverse the text-childs of the >>> elements returned when the XPath selects text nodes to know which >>> strings belong to which elements. >> Note that you can get back a wild combination of strings, nodes and numbers, >> so there is a bit of work to do anyway. > > Ah, I didn't see "|" until now. Well, that makes the whole thing a bit > "harder", because I can't tell wether a given string is created from a > text node or is the value of an attribute. It already makes it harder to find its parent element. You may still end up having to parse the expression to find partial '|' expressions etc. >>> Are there other ways to get at text() or attribute nodes? >> What do you mean? > > I mean, is there another way to have the text node of an element as > result of the xpath expression, other than having the text() function > somewhere (probably at the end of the path)? The same for attributes, is > there another way to get the values of attributes in the result, other > than using attribuge::(*|) or using @? Functions are a good way. Stefan From apaku at gmx.de Sat Jun 3 17:58:10 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Sat, 3 Jun 2006 17:58:10 +0200 Subject: [lxml-dev] element of an xpath evaluation In-Reply-To: <448186A2.2070404@gkec.informatik.tu-darmstadt.de> References: <20060530194624.GA28762@morpheus.apaku.dnsalias.org> <447CA69E.6060903@gkec.informatik.tu-darmstadt.de> <20060603103437.GA10248@morpheus> <44816B7E.1080405@gkec.informatik.tu-darmstadt.de> <20060603123920.GA13709@morpheus> <448186A2.2070404@gkec.informatik.tu-darmstadt.de> Message-ID: <20060603155810.GB17910@morpheus> On 03.06.06 14:54:58, Stefan Behnel wrote: > There's all sorts of weird expressions you could come up with... Thanks for that input. I guess for now I just skip attributes and text nodes and only highlight element nodes. Maybe I'll look into this at a later time again. > >>> The only problem I see is that I need to traverse the text-childs of the > >>> elements returned when the XPath selects text nodes to know which > >>> strings belong to which elements. > >> Note that you can get back a wild combination of strings, nodes and numbers, > >> so there is a bit of work to do anyway. > > > > Ah, I didn't see "|" until now. Well, that makes the whole thing a bit > > "harder", because I can't tell wether a given string is created from a > > text node or is the value of an attribute. > > It already makes it harder to find its parent element. You may still end up > having to parse the expression to find partial '|' expressions etc. I might end up writing my own XPath Parser, which is clearly out of my reach atm. Again thanks for your help on this. After all, I myself only need XPath's that return element nodesets for my purposes... Andreas -- You never hesitate to tackle the most difficult problems. From apaku at gmx.de Sat Jun 3 18:14:17 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Sat, 3 Jun 2006 18:14:17 +0200 Subject: [lxml-dev] documentation improvement In-Reply-To: <44816EE4.1090504@gkec.informatik.tu-darmstadt.de> References: <20060603105158.GC10248@morpheus> <44816EE4.1090504@gkec.informatik.tu-darmstadt.de> Message-ID: <20060603161417.GD17910@morpheus> On 03.06.06 13:13:40, Stefan Behnel wrote: > Andreas Pakulat wrote: > > I recently overlooked the prefix attribute for lxml Element's and I > > wonder if there are any plans to provide an api document that is similar > > to the one of ElementTree but includes all "extensions"? > > No plans so far. We always contented ourselves with documenting the > differences. Even more so now that ET is in Python's stdlib. I think for somebody not familiar with ET and lxml it's easier if he has to "scan" only one document for the API and not look at one and than also read another (which is not in a particular order, or so it seems) for what else is available with lxml. Of course, maybe that's just me not being able to do this ;-) > > I already tried to use the docstrings from the source, but there are > > quite some things missing (I didn't get any _* classes on the first > > try). > > True, there are a lot of docstrings missing. Might be a good starting point if > you want to improve things. :) I'll have a look. > > I know pretty much the whole documentation is already "there", but it's > > distributed over many text files. One needs to read all and remember > > everything, especially for the extensions because there's no way to > > easily see the extensions of a given class. > > Hmm, there are actually not so many different files. There is the external ET > documentation, obviously, but the rest is mainly in api.txt and compatibility.txt. Yeah, but at least compatibility.txt doesn't seem to have any structure, the various things are just listed and you always have to switch between ET and lxml docs to check wether the API of Element is extended or not and with which attributes/functions. I know if you know the API well, this is no problem, but for beginners its rather hard, IMHO. Andreas -- You are only young once, but you can stay immature indefinitely. From apaku at gmx.de Sat Jun 3 19:24:03 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Sat, 3 Jun 2006 19:24:03 +0200 Subject: [lxml-dev] ANN: XPathEvaluator 0.1.0 In-Reply-To: <20060601215115.GA14937@morpheus.apaku.dnsalias.org> References: <20060601215115.GA14937@morpheus.apaku.dnsalias.org> Message-ID: <20060603172403.GA20255@morpheus> On 01.06.06 23:51:15, Andreas Pakulat wrote: > as per Stefan's request I'm announcing here (and on the PyQt mailing > list) the first release of a small tool I "just" (i.e. the last 3 days) > wrote. > > XPathEvaluator is a tool to test what results a XPath expression gives > you when executed on a specific XML file. With the help of lxml it can > also parse pretty broken (and of course correct) HTML files. It loads > it's data from URLs if you want and highlights all nodes that an XPath > evaluation returns so you can easily identify them. > > I couldn't use lxml for more than HTML parsing because there's > unfortunately no easy way to find out the element to which an attribute > result or text result belongs to. I might actually use lxml for the > initial XML parsing too, because it's way faster than the PyXML parser I > currently have. > > I hope somebody finds this useful, I might add new features in the > future, however there's no priority at the moment for the development of > XPathEvaluator. As always with open source software: Patches welcome. > > XPathEvaluator can be downloaded from: > http://www.apaku.de/linux/xpathevaluator/index.php > That page also mentions all required software. I just "released" version 0.2.0, which now supports 3 different parsing methods and 2 different xpath implementations: 1. Plain PyXML, this is relatively slow compared to lxml, but it supports highlighting of any node in the tree. 2. lxml->PyXML, this uses lxml to parse the input and the traverses the lxml ElementTree building a PyXML DOM out of it. This proves to build already faster than 1 for rather small xml files. It also supports the hightlighting of any tree node 3. Plain lxml, I guess this is the fastest parsing and XPath implementation (it's not visibly faster at parsing than 2 here with my small xml files), however only element nodes that are contained in an XPath result are highlighted. So if your XPath result is a Attribute value or some kind of text you won't see anything. I'm planning to release 0.3.0 about next weekend, that version will show errors on a separate tab and will also present the output of the xpath evaluation on a tab, possibly linked to the tree. Andreas -- Generosity and perfection are your everlasting goals. From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Jun 3 22:35:02 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 03 Jun 2006 22:35:02 +0200 Subject: [lxml-dev] ANN: XPathEvaluator 0.1.0 In-Reply-To: <20060603172403.GA20255@morpheus> References: <20060601215115.GA14937@morpheus.apaku.dnsalias.org> <20060603172403.GA20255@morpheus> Message-ID: <4481F276.1050700@gkec.informatik.tu-darmstadt.de> Hi Andreas, Andreas Pakulat wrote: > 2. lxml->PyXML, this uses lxml to parse the input and the traverses the > lxml ElementTree building a PyXML DOM out of it. This proves to build > already faster than 1 for rather small xml files. Just a note on this one, have you tried using lxml.sax to do the conversion? I don't know if that's faster (or slower), but it should be simpler at least... Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Jun 3 22:41:14 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 03 Jun 2006 22:41:14 +0200 Subject: [lxml-dev] documentation improvement In-Reply-To: <20060603161417.GD17910@morpheus> References: <20060603105158.GC10248@morpheus> <44816EE4.1090504@gkec.informatik.tu-darmstadt.de> <20060603161417.GD17910@morpheus> Message-ID: <4481F3EA.8050400@gkec.informatik.tu-darmstadt.de> Hi Andreas, Andreas Pakulat wrote: > Yeah, but at least compatibility.txt doesn't seem to have any structure, > the various things are just listed True. Feel free to take a shot at it. Note, though, that compatibility.txt is more targeted towards people who want to port their applications from ET. Everyone else will likely find api.txt more useful. > and you always have to switch between > ET and lxml docs to check wether the API of Element is extended or not > and with which attributes/functions. I know if you know the API well, > this is no problem, but for beginners its rather hard, IMHO. What makes this worse is that etree is a C extension. You can't just check the arguments of a function via help() because Python can't see them. So we really should starting putting more docstrings everywhere to make online help available. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sat Jun 3 23:05:20 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sat, 03 Jun 2006 23:05:20 +0200 Subject: [lxml-dev] documentation improvement In-Reply-To: <20060603161417.GD17910@morpheus> References: <20060603105158.GC10248@morpheus> <44816EE4.1090504@gkec.informatik.tu-darmstadt.de> <20060603161417.GD17910@morpheus> Message-ID: <4481F990.6020707@gkec.informatik.tu-darmstadt.de> Hi again, Andreas Pakulat wrote: > I think for somebody not familiar with ET and lxml it's easier if he > has to "scan" only one document for the API and not look at one and than > also read another (which is not in a particular order, or so it seems) > for what else is available with lxml. Note that there's a tutorial on ET, which is also referenced in the FAQ. http://effbot.org/zone/element.htm It's not quite as simple as it could be for lxml (it has a lot of references to ET versions etc.), but it should help people grasp the ideas behind the ElementTree API (which you need in order to make any use of lxml). And after the tutorial, you can read (and bookmark) the reference page of ET http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm or skip to lxml/doc/api.txt right away to see what other great things you can do with lxml. The rest is listed in http://codespeak.net/lxml/#documentation I think that's the order of choice. Just read the first two FAQ entries on this. http://codespeak.net/lxml/FAQ.html Stefan From apaku at gmx.de Sat Jun 3 23:59:50 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Sat, 3 Jun 2006 23:59:50 +0200 Subject: [lxml-dev] ANN: XPathEvaluator 0.1.0 In-Reply-To: <4481F276.1050700@gkec.informatik.tu-darmstadt.de> References: <20060601215115.GA14937@morpheus.apaku.dnsalias.org> <20060603172403.GA20255@morpheus> <4481F276.1050700@gkec.informatik.tu-darmstadt.de> Message-ID: <20060603215950.GB26716@morpheus> On 03.06.06 22:35:02, Stefan Behnel wrote: > Hi Andreas, > > Andreas Pakulat wrote: > > 2. lxml->PyXML, this uses lxml to parse the input and the traverses the > > lxml ElementTree building a PyXML DOM out of it. This proves to build > > already faster than 1 for rather small xml files. > > Just a note on this one, have you tried using lxml.sax to do the conversion? I > don't know if that's faster (or slower), but it should be simpler at least... Well, doing the conversion using the elementtree is just a matter of walking over the tree, for each element creating the attributes and then adding the text. However, I just saw: I miss some text nodes :-( (because I only add elem.text, not the tails of the child elements). Will have a look at the sax-stuff. Andreas -- Beware of a tall blond man with one black shoe. From apaku at gmx.de Sun Jun 4 01:04:28 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Sun, 4 Jun 2006 01:04:28 +0200 Subject: [lxml-dev] documentation improvement In-Reply-To: <4481F3EA.8050400@gkec.informatik.tu-darmstadt.de> References: <20060603105158.GC10248@morpheus> <44816EE4.1090504@gkec.informatik.tu-darmstadt.de> <20060603161417.GD17910@morpheus> <4481F3EA.8050400@gkec.informatik.tu-darmstadt.de> Message-ID: <20060603230428.GC26716@morpheus> On 03.06.06 22:41:14, Stefan Behnel wrote: > Andreas Pakulat wrote: > > Yeah, but at least compatibility.txt doesn't seem to have any structure, > > the various things are just listed > > True. Feel free to take a shot at it. I will. > Note, though, that compatibility.txt is more targeted towards people who want > to port their applications from ET. Everyone else will likely find api.txt > more useful. As for the reason I wrote the initial mail: the prefix attribute of the element class is not mentioned in api.txt (and of course not in the elementtree api). BTW: regarding the inclusion of elementtree in pythons stdlib, that will be for Python 2.5 right? > > and you always have to switch between > > ET and lxml docs to check wether the API of Element is extended or not > > and with which attributes/functions. I know if you know the API well, > > this is no problem, but for beginners its rather hard, IMHO. > > What makes this worse is that etree is a C extension. You can't just check the > arguments of a function via help() because Python can't see them. So we really > should starting putting more docstrings everywhere to make online help available. Well, especially for people learning lxml (like me ;-) it would be helpful to have the elementtree-api-page just including all "extensions" that lxml does (i.e. extra attributes of the classes, new classes and functions). Of course it would be great if that could be "built" from the python docstrings. Andreas -- If you learn one useless thing every day, in a single year you'll learn 365 useless things. From apaku at gmx.de Sun Jun 4 01:09:14 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Sun, 4 Jun 2006 01:09:14 +0200 Subject: [lxml-dev] documentation improvement In-Reply-To: <4481F990.6020707@gkec.informatik.tu-darmstadt.de> References: <20060603105158.GC10248@morpheus> <44816EE4.1090504@gkec.informatik.tu-darmstadt.de> <20060603161417.GD17910@morpheus> <4481F990.6020707@gkec.informatik.tu-darmstadt.de> Message-ID: <20060603230914.GD26716@morpheus> On 03.06.06 23:05:20, Stefan Behnel wrote: > Andreas Pakulat wrote: > > I think for somebody not familiar with ET and lxml it's easier if he > > has to "scan" only one document for the API and not look at one and than > > also read another (which is not in a particular order, or so it seems) > > for what else is available with lxml. > > Note that there's a tutorial on ET, which is also referenced in the FAQ. > > http://effbot.org/zone/element.htm Yeah, this is great to get to know ET. > It's not quite as simple as it could be for lxml (it has a lot of references > to ET versions etc.), but it should help people grasp the ideas behind the > ElementTree API (which you need in order to make any use of lxml). And after > the tutorial, you can read (and bookmark) the reference page of ET > > http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm > > or skip to lxml/doc/api.txt right away to see what other great things you can > do with lxml. The rest is listed in > > http://codespeak.net/lxml/#documentation > > I think that's the order of choice. Just read the first two FAQ entries on this. > > http://codespeak.net/lxml/FAQ.html I know that all the documentation "is there", I'm more concerned about it being available from one place. At least the basic stuff. Andreas -- You have no real enemies. From apaku at gmx.de Sun Jun 4 01:42:34 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Sun, 4 Jun 2006 01:42:34 +0200 Subject: [lxml-dev] ANN: XPathEvaluator 0.1.0 In-Reply-To: <4481F276.1050700@gkec.informatik.tu-darmstadt.de> References: <20060601215115.GA14937@morpheus.apaku.dnsalias.org> <20060603172403.GA20255@morpheus> <4481F276.1050700@gkec.informatik.tu-darmstadt.de> Message-ID: <20060603234234.GA29185@morpheus> On 03.06.06 22:35:02, Stefan Behnel wrote: > Hi Andreas, > > Andreas Pakulat wrote: > > 2. lxml->PyXML, this uses lxml to parse the input and the traverses the > > lxml ElementTree building a PyXML DOM out of it. This proves to build > > already faster than 1 for rather small xml files. > > Just a note on this one, have you tried using lxml.sax to do the conversion? I > don't know if that's faster (or slower), but it should be simpler at least... Well, as far as I understood I'd need a ContentHandler that builds a DOM out of the SAX events, however I couldn't find a "ready-to-use" implementation, neither in pythons stdlib (pulldom somehow doesn't work) or in PyXml. Writing my own is basically what I already did, only without the "need" of SAX events. Andreas -- You have the body of a 19 year old. Please return it before it gets wrinkled. From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Jun 4 08:37:47 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 04 Jun 2006 08:37:47 +0200 Subject: [lxml-dev] ANN: XPathEvaluator 0.1.0 In-Reply-To: <20060603234234.GA29185@morpheus> References: <20060601215115.GA14937@morpheus.apaku.dnsalias.org> <20060603172403.GA20255@morpheus> <4481F276.1050700@gkec.informatik.tu-darmstadt.de> <20060603234234.GA29185@morpheus> Message-ID: <44827FBB.4060000@gkec.informatik.tu-darmstadt.de> Hi Andreas, Andreas Pakulat wrote: > Well, as far as I understood I'd need a ContentHandler that builds a DOM > out of the SAX events, however I couldn't find a "ready-to-use" > implementation, neither in pythons stdlib (pulldom somehow doesn't work) > or in PyXml. Right, I only noticed by now that minidom is actually built directly on top of expat (for historical reasons, I assume). Although pulldom should work (any idea why it failed for you?), that's not very helpful for you anyway, since you'd need to convert it back into a minidom afterwards. I'll write up a SAX test case against pulldom so that we can make sure this works in the future. > Writing my own is basically what I already did, only without the "need" > of SAX events. Sure. Traversing ET is simple enough anyway. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Jun 4 09:17:01 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 04 Jun 2006 09:17:01 +0200 Subject: [lxml-dev] ANN: XPathEvaluator 0.1.0 In-Reply-To: <20060603234234.GA29185@morpheus> References: <20060601215115.GA14937@morpheus.apaku.dnsalias.org> <20060603172403.GA20255@morpheus> <4481F276.1050700@gkec.informatik.tu-darmstadt.de> <20060603234234.GA29185@morpheus> Message-ID: <448288ED.6040809@gkec.informatik.tu-darmstadt.de> Hi Andreas, Andreas Pakulat wrote: > On 03.06.06 22:35:02, Stefan Behnel wrote: >> Andreas Pakulat wrote: >>> 2. lxml->PyXML, this uses lxml to parse the input and the traverses the >>> lxml ElementTree building a PyXML DOM out of it. This proves to build >>> already faster than 1 for rather small xml files. >> Just a note on this one, have you tried using lxml.sax to do the conversion? I >> don't know if that's faster (or slower), but it should be simpler at least... > > Well, as far as I understood I'd need a ContentHandler that builds a DOM > out of the SAX events, however I couldn't find a "ready-to-use" > implementation, neither in pythons stdlib (pulldom somehow doesn't work) > or in PyXml. Sorry for that, that was a bug in sax.py, I appended a patch. You can simply create a xml.dom.pulldom.SAX2DOM() and pass it into lxml.sax.saxify. That nicely creates a minidom for you. I'll add a doctest to sax.txt also. Stefan Index: src/lxml/sax.py =================================================================== --- src/lxml/sax.py (Revision 28159) +++ src/lxml/sax.py (Arbeitskopie) @@ -122,7 +122,9 @@ self._empty_attributes = attr_class({}, {}) def saxify(self): + self._content_handler.startDocument() self._recursive_saxify(self._element, {}) + self._content_handler.endDocument() def _recursive_saxify(self, element, prefixes): new_prefixes = [] From apaku at gmx.de Sun Jun 4 10:38:28 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Sun, 4 Jun 2006 10:38:28 +0200 Subject: [lxml-dev] ANN: XPathEvaluator 0.1.0 In-Reply-To: <448288ED.6040809@gkec.informatik.tu-darmstadt.de> References: <20060601215115.GA14937@morpheus.apaku.dnsalias.org> <20060603172403.GA20255@morpheus> <4481F276.1050700@gkec.informatik.tu-darmstadt.de> <20060603234234.GA29185@morpheus> <448288ED.6040809@gkec.informatik.tu-darmstadt.de> Message-ID: <20060604083828.GC6202@morpheus> On 04.06.06 09:17:01, Stefan Behnel wrote: > Andreas Pakulat wrote: > > On 03.06.06 22:35:02, Stefan Behnel wrote: > >> Andreas Pakulat wrote: > >>> 2. lxml->PyXML, this uses lxml to parse the input and the traverses the > >>> lxml ElementTree building a PyXML DOM out of it. This proves to build > >>> already faster than 1 for rather small xml files. > >> Just a note on this one, have you tried using lxml.sax to do the conversion? I > >> don't know if that's faster (or slower), but it should be simpler at least... > > > > Well, as far as I understood I'd need a ContentHandler that builds a DOM > > out of the SAX events, however I couldn't find a "ready-to-use" > > implementation, neither in pythons stdlib (pulldom somehow doesn't work) > > or in PyXml. > > Sorry for that, that was a bug in sax.py, I appended a patch. Ah, I only had a quick glimpse at the backtrace after sending the mail and saw sax.py and just wanted to report it, when I saw this :-) Andreas -- Caution: Keep out of reach of children. From buro at petr.com Sun Jun 4 10:43:03 2006 From: buro at petr.com (Petr van Blokland) Date: Sun, 4 Jun 2006 10:43:03 +0200 Subject: [lxml-dev] Resolving Message-ID: <23592F85-947B-4734-BF40-FD5951BFEAD6@petr.com> Hi, I get a very consistent error when including an XSL stylesheet from another XSL stylesheet using or But the problem is so basic that I cannot believe it could be something inside lxml. So I am overlooking something in my code. When there is a template file named "template.xsl" with ------------------------------------------------ ... ------------------------------------------------ and another template file named: ------------------------------------------------ This is a paragraph ------------------------------------------------ and then executing the code below ------------------------------------------------ import lxml f = open('template.xsl', 'rb') xslttree = lxml.etree.parse(f) f.close() transformer = lxml.etree.XSLT(xslttree) ------------------------------------------------ the following error is raised: File "xslt.pxi", line 261, in etree.XSLT.__init__ File "etree.pyx", line 133, in etree._ExceptionContext._raise_if_stored etree.XSLTParseError: Cannot resolve URI XSLT://para.xsl All 3 files are in the same directory. And I am in that directory when executing the Python code. What do I do wrong that XSLT cannot resolve the para.xsl file? Petr ---------------------------------------------- Petr van Blokland buro at petr.com | www.petr.com | +31 15 219 10 40 ---------------------------------------------- From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Jun 4 12:16:32 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 04 Jun 2006 12:16:32 +0200 Subject: [lxml-dev] Resolving In-Reply-To: <23592F85-947B-4734-BF40-FD5951BFEAD6@petr.com> References: <23592F85-947B-4734-BF40-FD5951BFEAD6@petr.com> Message-ID: <4482B300.5070605@gkec.informatik.tu-darmstadt.de> Hi Petr, Petr van Blokland wrote: > I get a very consistent error when including an XSL stylesheet from > another XSL stylesheet using or > But the problem is so basic that I cannot believe it could be > something inside lxml. > > the following error is raised: > > File "xslt.pxi", line 261, in etree.XSLT.__init__ > File "etree.pyx", line 133, in > etree._ExceptionContext._raise_if_stored > etree.XSLTParseError: Cannot resolve URI XSLT://para.xsl > > What do I do wrong that XSLT cannot resolve the para.xsl file? Nothing. That was a bug in lxml, thanks for reporting it. It stopped working when I started making lxml read file-like objects directly where ever it can. The problem was that libxml2 doesn't get to know the file name in this case and thus can't store it in the document when you call parse(). This prevents the stylesheet from knowing where it came from. Here's a patch: Index: src/lxml/parser.pxi =================================================================== --- src/lxml/parser.pxi (Revision 28159) +++ src/lxml/parser.pxi (Arbeitskopie) @@ -419,6 +419,8 @@ if result is NULL: _raiseParseError(ctxt, c_filename) + elif result.URL is NULL and c_filename is not NULL: + result.URL = tree.xmlStrdup(c_filename) return result ############################################################ BTW, you can also call parse on the filename rather than an open file. That's even more efficient as it doesn't go through Python to read the file. Sorry for the inconvenience, Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Sun Jun 4 17:10:50 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Sun, 04 Jun 2006 17:10:50 +0200 Subject: [lxml-dev] Axis iteration Message-ID: <4482F7FA.8060306@gkec.informatik.tu-darmstadt.de> Hi all, I thought a bit more about the axis issues that Andrey raised. While list-like access to them is not a viable solution IMHO (and already covered by XPath calls anyway), I think it would still be nice to provide them as iterators. I therefore added a few new methods to Element: itersiblings(preceding=False) iterancestors() iterdescendants() Note that iterdescendants() is almost like getiterator(), except that it does not include the element itself, which is consistent with XPath's descendant axis. getiterator() is therefore the equivalent of the descendant-or-self axis. I think these methods are pretty Pythonic and make tree navigation fairly simple. Stefan From buro at petr.com Sun Jun 4 23:32:29 2006 From: buro at petr.com (Petr van Blokland) Date: Sun, 4 Jun 2006 23:32:29 +0200 Subject: [lxml-dev] Resolving In-Reply-To: <4482B300.5070605@gkec.informatik.tu-darmstadt.de> References: <23592F85-947B-4734-BF40-FD5951BFEAD6@petr.com> <4482B300.5070605@gkec.informatik.tu-darmstadt.de> Message-ID: <38C77A08-F5E3-43DA-8203-2B9B9640316E@petr.com> Stefan, thanks for the patch. I don't seem to get the change appear in a new compile, however. Isn't it enough to do: python setup.py build python setup.py install Petr On Jun 4, 2006, at 12:16 PM, Stefan Behnel wrote: > Hi Petr, > > Petr van Blokland wrote: >> I get a very consistent error when including an XSL stylesheet from >> another XSL stylesheet using or >> But the problem is so basic that I cannot believe it could be >> something inside lxml. >> >> the following error is raised: >> >> File "xslt.pxi", line 261, in etree.XSLT.__init__ >> File "etree.pyx", line 133, in >> etree._ExceptionContext._raise_if_stored >> etree.XSLTParseError: Cannot resolve URI XSLT://para.xsl >> >> What do I do wrong that XSLT cannot resolve the para.xsl file? > > Nothing. That was a bug in lxml, thanks for reporting it. It > stopped working > when I started making lxml read file-like objects directly where > ever it can. > > The problem was that libxml2 doesn't get to know the file name in > this case > and thus can't store it in the document when you call parse(). This > prevents > the stylesheet from knowing where it came from. Here's a patch: > > Index: src/lxml/parser.pxi > =================================================================== > --- src/lxml/parser.pxi (Revision 28159) > +++ src/lxml/parser.pxi (Arbeitskopie) > @@ -419,6 +419,8 @@ > > if result is NULL: > _raiseParseError(ctxt, c_filename) > + elif result.URL is NULL and c_filename is not NULL: > + result.URL = tree.xmlStrdup(c_filename) > return result > > ############################################################ > > > BTW, you can also call parse on the filename rather than an open > file. That's > even more efficient as it doesn't go through Python to read the file. > > Sorry for the inconvenience, > > Stefan > ---------------------------------------------- Petr van Blokland buro at petr.com | www.petr.com | +31 15 219 10 40 ---------------------------------------------- From behnel_ml at gkec.informatik.tu-darmstadt.de Mon Jun 5 07:50:30 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 05 Jun 2006 07:50:30 +0200 Subject: [lxml-dev] Resolving In-Reply-To: <38C77A08-F5E3-43DA-8203-2B9B9640316E@petr.com> References: <23592F85-947B-4734-BF40-FD5951BFEAD6@petr.com> <4482B300.5070605@gkec.informatik.tu-darmstadt.de> <38C77A08-F5E3-43DA-8203-2B9B9640316E@petr.com> Message-ID: <4483C626.9080602@gkec.informatik.tu-darmstadt.de> Hi Petr, Petr van Blokland wrote: > Stefan, > thanks for the patch. > I don't seem to get the change appear in a new compile, however. > Isn't it enough to do: > python setup.py build > python setup.py install Do "make clean" first or remove the file src/lxml/etree.c (I assume you have Pyrex 0.9.4.1 installed?). Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Mon Jun 5 09:26:25 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 05 Jun 2006 09:26:25 +0200 Subject: [lxml-dev] Threading redux Message-ID: <4483DCA1.5010203@gkec.informatik.tu-darmstadt.de> Hi all, there is now a branch where I implemented thread concurrency for the parsers. http://codespeak.net/svn/lxml/branch/threading/ This branch is not very well tested. I attached the only test program I have so far. There are likely some bizarre race conditions that I overlooked. Note that parsers are not currently locked, so lxml will crash if multiple threads use /the same/ parser concurrently. I'll see how to do that efficiently. If you have an interest in getting this integrated, please build the branch on your side and do some testing. There are other places where we can try to add concurrency, but that is still to come. Feel free to post a list of personal preferences including little test programs (like the one attached) that show concurrent usage. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: testthreads.py Type: text/x-python Size: 697 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060605/51f89d8c/attachment.py From behnel_ml at gkec.informatik.tu-darmstadt.de Mon Jun 5 17:45:22 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Mon, 05 Jun 2006 17:45:22 +0200 Subject: [lxml-dev] Threading redux In-Reply-To: <4483DCA1.5010203@gkec.informatik.tu-darmstadt.de> References: <4483DCA1.5010203@gkec.informatik.tu-darmstadt.de> Message-ID: <44845192.6080103@gkec.informatik.tu-darmstadt.de> Hi again, > there is now a branch where I implemented thread concurrency for the parsers. > http://codespeak.net/svn/lxml/branch/threading/ Here is a new (and slightly more complex) test program that also tests serialisation. The current branch status for parsing and serialisation is: * all in-memory operations (tostring, parse(StringIO), etc.) free the GIL * file operations (on file names) free the GIL * reading from file-like objects frees the GIL and reacquires it for reading * serialisation to file-like objects is single-threaded (high lock overhead) Note that you *must* create independent parsers for each thread. Sharing parsers between threads will serialise the calls or (currently) crash. I'd be very happy if anyone who has a multi-processor machine could give it a try. That would allow us to see if there are 'real' concurrency issues and if lxml achieves to saturate more than one processor. Any feedback is appreciated. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: testthreads.py Type: text/x-python Size: 1213 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060605/b3cfa3d3/attachment.py From buro at petr.com Mon Jun 5 22:50:30 2006 From: buro at petr.com (Petr van Blokland) Date: Mon, 5 Jun 2006 22:50:30 +0200 Subject: [lxml-dev] Bus Error in external XPath function Message-ID: Hi, I have build lxml XSLT transformation into a webserver, based on Twisted Matrix. I runs all right, unless I used a call to a Python function in an XSL template with XPath. The generation of a page prints a remark to the output "transforming page". The following error is very consistent, and predictable for the amount of pages after which is goes wrong. --> python start.py Starting server as user/group 501/501 (petr) ====== transforming page ====== transforming page ====== transforming page ====== transforming page Exception exceptions.AssertionError: 'Tried to unregister unknown proxy' in Bus error --> So after the rendering of 5 identical pages (reloads from a browser) the XPyth call generates a Bus error (as said, this does not happen with XSL without this call). I guess this is too little information, so I'll try to make a small test program that can reproduce it. Regards, Petr ---------------------------------------------- Petr van Blokland buro at petr.com | www.petr.com | +31 15 219 10 40 ---------------------------------------------- From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Jun 6 08:02:47 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 06 Jun 2006 08:02:47 +0200 Subject: [lxml-dev] Bus Error in external XPath function In-Reply-To: References: Message-ID: <44851A87.5020708@gkec.informatik.tu-darmstadt.de> Hi Petr, Petr van Blokland wrote: > I have build lxml XSLT transformation into a webserver, > based on Twisted Matrix. > I runs all right, unless I used a call to a Python function > in an XSL template with XPath. The generation of a page > prints a remark to the output "transforming page". > The following error is very consistent, and predictable > for the amount of pages after which is goes wrong. > > --> python start.py > Starting server as user/group 501/501 (petr) > ====== transforming page > ====== transforming page > ====== transforming page > ====== transforming page > Exception exceptions.AssertionError: 'Tried to unregister unknown > proxy' in Bus error > --> > > So after the rendering of 5 identical pages (reloads from > a browser) the XPyth call generates a Bus error (as said, > this does not happen with XSL without this call). > I guess this is too little information, so I'll try to make > a small test program that can reproduce it. Yes, this is not quite enough to pinpoint the bug. Thanks for reporting it, though. It should be a Python garbage collection issue. You might have triggered some obscure case under which the document is freed before its Python elements. This is normally related to the Python garbage collector that frees objects in dependency order when they are refcounted out, but out-of-order when they have cyclic dependencies and are freed by the cyclic GC at a more or less random time. But then, that's just guessing. It would really be very helpful if you could cut down your code to a reasonably sized sequence of commands that shows the problem. GC issues are extremely hard to find through normal test cases... Could you also tell me the Python version that you are using? Most likely, this is not related to Twisted - also just guessing... Thanks for helping, Stefan From buro at petr.com Tue Jun 6 10:34:23 2006 From: buro at petr.com (Petr van Blokland) Date: Tue, 6 Jun 2006 10:34:23 +0200 Subject: [lxml-dev] Bus Error in external XPath function In-Reply-To: <44851A87.5020708@gkec.informatik.tu-darmstadt.de> References: <44851A87.5020708@gkec.informatik.tu-darmstadt.de> Message-ID: <9A69AD8E-EEB5-405B-BAF5-4E09497635B5@petr.com> On Jun 6, 2006, at 8:02 AM, Stefan Behnel wrote: > Could you also tell me the Python version that you are using? Most > likely, > this is not related to Twisted - also just guessing... Stefan, I use Python 2.3. I don't seem to get the same error in a test script without Twisted, so I guess that I have to build a mini-server using it. But that is not a lot a work. I'll keep you informed. Petr ---------------------------------------------- Petr van Blokland buro at petr.com | www.petr.com | +31 15 219 10 40 ---------------------------------------------- From buro at petr.com Tue Jun 6 11:17:40 2006 From: buro at petr.com (Petr van Blokland) Date: Tue, 6 Jun 2006 11:17:40 +0200 Subject: [lxml-dev] Bus Error in external XPath function In-Reply-To: <44851A87.5020708@gkec.informatik.tu-darmstadt.de> References: <44851A87.5020708@gkec.informatik.tu-darmstadt.de> Message-ID: <3C5E2AB8-66E7-474D-899F-7FC25974DE07@petr.com> > > Could you also tell me the Python version that you are using? Most > likely, > this is not related to Twisted - also just guessing... Hi Stefan, This is a small server source. Problem now is that is just works alright. No problems with it. I guess I have to make it larger and more complex as the original to get the error back. -------------- next part -------------- A non-text attachment was scrubbed... Name: lxmlparsertest.py Type: text/x-python-script Size: 1742 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060606/77d1fdd1/attachment.bin -------------- next part -------------- Start it and then the browser should have a page on http://127.0.0.1 Reloads in the browser give repeated new pages with a counter. Petr ---------------------------------------------- Petr van Blokland buro at petr.com | www.petr.com | +31 15 219 10 40 ---------------------------------------------- From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Jun 6 11:21:15 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 06 Jun 2006 11:21:15 +0200 Subject: [lxml-dev] Bus Error in external XPath function In-Reply-To: <9A69AD8E-EEB5-405B-BAF5-4E09497635B5@petr.com> References: <44851A87.5020708@gkec.informatik.tu-darmstadt.de> <9A69AD8E-EEB5-405B-BAF5-4E09497635B5@petr.com> Message-ID: <4485490B.7080402@gkec.informatik.tu-darmstadt.de> Hi Petr, Petr van Blokland wrote: > On Jun 6, 2006, at 8:02 AM, Stefan Behnel wrote: > >> Could you also tell me the Python version that you are using? Most >> likely, >> this is not related to Twisted - also just guessing... > > Stefan, > I use Python 2.3. I don't seem to get the same error in a test script > without Twisted, > so I guess that I have to build a mini-server using it. But that is not > a lot a work. > I'll keep you informed. Thanks. Thing is, we could relatively easily prevent these bugs by doubling the tree traversal done on freeing documents, but if we can avoid that by fixing bugs from time to time, I'm much for it. Sometimes it's really just little things... Here's an unclean patch that might or might not change something for the bug you witness. Please try it. Stefan Index: src/lxml/extensions.pxi =================================================================== --- src/lxml/extensions.pxi (Revision 28350) +++ src/lxml/extensions.pxi (Arbeitskopie) @@ -169,16 +169,17 @@ """ cdef _NodeBase element if isinstance(obj, _NodeBase): - obj = (obj,) - elif not python.PySequence_Check(obj): + self._temp_refs.add(obj) return + elif _isString(obj) or not python.PySequence_Check(obj): + return for o in obj: if isinstance(o, _NodeBase): element = <_NodeBase>o #print "Holding element:", element._c_node self._temp_refs.add(element) #print "Holding document:", element._doc._c_doc - self._temp_refs.add(element._doc) + #self._temp_refs.add(element._doc) def Extension(module, function_mapping, ns=None): From buro at petr.com Tue Jun 6 11:30:14 2006 From: buro at petr.com (Petr van Blokland) Date: Tue, 6 Jun 2006 11:30:14 +0200 Subject: [lxml-dev] Bus Error in external XPath function In-Reply-To: <4485490B.7080402@gkec.informatik.tu-darmstadt.de> References: <44851A87.5020708@gkec.informatik.tu-darmstadt.de> <9A69AD8E-EEB5-405B-BAF5-4E09497635B5@petr.com> <4485490B.7080402@gkec.informatik.tu-darmstadt.de> Message-ID: On Jun 6, 2006, at 11:21 AM, Stefan Behnel wrote: > > Here's an unclean patch that might or might not change something > for the bug > you witness. Please try it. At least some change. The patch on the original server has the following effect: Starting server as user/group 501/501 (petr) ====== transforming page ====== transforming page ====== transforming page Exception exceptions.AssertionError: 'Tried to unregister unknown proxy' in ignored Exception exceptions.AssertionError: 'Tried to unregister unknown proxy' in ignored Exception exceptions.AssertionError: 'Tried to unregister unknown proxy' in Bus error ---------------------------------------------- Petr van Blokland buro at petr.com | www.petr.com | +31 15 219 10 40 ---------------------------------------------- From buro at petr.com Tue Jun 6 11:49:13 2006 From: buro at petr.com (Petr van Blokland) Date: Tue, 6 Jun 2006 11:49:13 +0200 Subject: [lxml-dev] Bus Error in external XPath function In-Reply-To: <4485490B.7080402@gkec.informatik.tu-darmstadt.de> References: <44851A87.5020708@gkec.informatik.tu-darmstadt.de> <9A69AD8E-EEB5-405B-BAF5-4E09497635B5@petr.com> <4485490B.7080402@gkec.informatik.tu-darmstadt.de> Message-ID: <07DFB8EC-89AB-470E-AF7D-85CF7DC320B8@petr.com> On Jun 6, 2006, at 11:21 AM, Stefan Behnel wrote: > > Here's an unclean patch that might or might not change something > for the bug > you witness. Please try it. The small test server that I have sent now gives the following error with your patch: === Rendering page 0 Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/ python2.3/site-packages/twisted/protocols/basic.py", line 223, in dataReceived why = self.lineReceived(line) File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/ python2.3/site-packages/twisted/protocols/http.py", line 950, in lineReceived self.allContentReceived() File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/ python2.3/site-packages/twisted/protocols/http.py", line 991, in allContentReceived req.requestReceived(command, path, version) File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/ python2.3/site-packages/twisted/protocols/http.py", line 549, in requestReceived self.process() --- --- File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/ python2.3/site-packages/twisted/web/server.py", line 159, in process self.render(resrc) File "/System/Library/Frameworks/Python.framework/Versions/2.3/lib/ python2.3/site-packages/twisted/web/server.py", line 166, in render body = resrc.render(self) File "lxmlparsertest.py", line 79, in render return parser.render(xslt, xml) File "lxmlparsertest.py", line 68, in render return str(transformer(xmltree)) File "xslt.pxi", line 362, in etree.XSLT.__call__ etree.XSLTApplyError: runtime error (element 'value-of') (BTW, I am using OSX 10.4.6, Python 2.3) ---------------------------------------------- Petr van Blokland buro at petr.com | www.petr.com | +31 15 219 10 40 ---------------------------------------------- From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Jun 6 12:19:44 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 06 Jun 2006 12:19:44 +0200 Subject: [lxml-dev] Bus Error in external XPath function In-Reply-To: <07DFB8EC-89AB-470E-AF7D-85CF7DC320B8@petr.com> References: <44851A87.5020708@gkec.informatik.tu-darmstadt.de> <9A69AD8E-EEB5-405B-BAF5-4E09497635B5@petr.com> <4485490B.7080402@gkec.informatik.tu-darmstadt.de> <07DFB8EC-89AB-470E-AF7D-85CF7DC320B8@petr.com> Message-ID: <448556C0.8060600@gkec.informatik.tu-darmstadt.de> Hi Petr, Petr van Blokland wrote: > On Jun 6, 2006, at 11:21 AM, Stefan Behnel wrote: >> >> Here's an unclean patch that might or might not change something for >> the bug you witness. Please try it. > > The small test server that I have sent now gives the following error > with your patch: > etree.XSLTApplyError: runtime error (element 'value-of') > > (BTW, I am using OSX 10.4.6, Python 2.3) Hmm, ok, then revert it (was worth a try). The other error messages you get show that the libxml2 node was already freed when its Python element was GCed. So that's just as I suspected. I guess I'll really have to write up some more conservative cleanup code then... Stefan From fredrik at pythonware.com Tue Jun 6 12:27:35 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 06 Jun 2006 12:27:35 +0200 Subject: [lxml-dev] documentation improvement In-Reply-To: <4481F990.6020707@gkec.informatik.tu-darmstadt.de> References: <20060603105158.GC10248@morpheus> <44816EE4.1090504@gkec.informatik.tu-darmstadt.de> <20060603161417.GD17910@morpheus> <4481F990.6020707@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel wrote: > And after the tutorial, you can read (and bookmark) the reference page of ET > > http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm > > or skip to lxml/doc/api.txt right away to see what other great things you can > do with lxml. one approach would be to generate a composite reference page, using the original ET pythondoc infoset for the core API, and a "dummy module" for the lxml additions (this can be a text file with pythondoc syntax), and merge them together: http://online.effbot.org/2003_11_01_archive.htm#pythondoc-merge (just make sure you mark the extensions clearly; it's a bit sad to see lxml-specific addons that could be made to work with any ElementTree implementation with very little work. guess the idea of a portable API still is a bit foreign to some pythoneers...) From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Jun 6 12:29:48 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 06 Jun 2006 12:29:48 +0200 Subject: [lxml-dev] Bus Error in external XPath function In-Reply-To: <3C5E2AB8-66E7-474D-899F-7FC25974DE07@petr.com> References: <44851A87.5020708@gkec.informatik.tu-darmstadt.de> <3C5E2AB8-66E7-474D-899F-7FC25974DE07@petr.com> Message-ID: <4485591C.9010804@gkec.informatik.tu-darmstadt.de> Hi Petr, Petr van Blokland wrote: > This is a small server source. > Problem now is that is just works alright. No problems with it. > I guess I have to make it larger and more complex as the original to get > the error back. > > ------------------------------------------------------------------------ > def ns_get(dummy, key): > global pagecounter > return str(pagecounter) Wait, I though you were creating Elements here. I what you do in the external function really that simple? Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Jun 6 12:53:20 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 06 Jun 2006 12:53:20 +0200 Subject: [lxml-dev] documentation improvement In-Reply-To: References: <20060603105158.GC10248@morpheus> <44816EE4.1090504@gkec.informatik.tu-darmstadt.de> <20060603161417.GD17910@morpheus> <4481F990.6020707@gkec.informatik.tu-darmstadt.de> Message-ID: <44855EA0.5020802@gkec.informatik.tu-darmstadt.de> Hi Fredrik, Fredrik Lundh wrote: > Stefan Behnel wrote: >> And after the tutorial, you can read (and bookmark) the reference page of ET >> >> http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm >> >> or skip to lxml/doc/api.txt right away to see what other great things you can >> do with lxml. > > one approach would be to generate a composite reference page, using the > original ET pythondoc infoset for the core API, and a "dummy module" for > the lxml additions (this can be a text file with pythondoc syntax), and > merge them together: > > http://online.effbot.org/2003_11_01_archive.htm#pythondoc-merge I never used PythonDoc, but that sounds like a good idea. > (just make sure you mark the extensions clearly; it's a bit sad to see > lxml-specific addons that could be made to work with any ElementTree > implementation with very little work. guess the idea of a portable API > still is a bit foreign to some pythoneers...) You're always invited to comment on proposed changes when we discuss them on the list. Except for the tounicode() bit and the xinclude() method, I don't think there are many places where lxml adds things that ET could offer. Feel free to correct me here. Stefan From fredrik at pythonware.com Tue Jun 6 13:24:32 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Tue, 06 Jun 2006 13:24:32 +0200 Subject: [lxml-dev] documentation improvement In-Reply-To: <44855EA0.5020802@gkec.informatik.tu-darmstadt.de> References: <20060603105158.GC10248@morpheus> <44816EE4.1090504@gkec.informatik.tu-darmstadt.de> <20060603161417.GD17910@morpheus> <4481F990.6020707@gkec.informatik.tu-darmstadt.de> <44855EA0.5020802@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel wrote: > You're always invited to comment on proposed changes when we discuss them on > the list. Except for the tounicode() bit and the xinclude() method, I don't > think there are many places where lxml adds things that ET could offer. Feel > free to correct me here. oh, I didn't mean lxml.etree API extensions, I meant 3rd party code written for lxml.etree that could be useful for a larger ET audience. the lxml.etree API extensions are very important input for future ET improvements (including the elusive 1.3 release ;-). From faassen at infrae.com Tue Jun 6 13:43:48 2006 From: faassen at infrae.com (Martijn Faassen) Date: Tue, 06 Jun 2006 13:43:48 +0200 Subject: [lxml-dev] documentation improvement In-Reply-To: <44855EA0.5020802@gkec.informatik.tu-darmstadt.de> References: <20060603105158.GC10248@morpheus> <44816EE4.1090504@gkec.informatik.tu-darmstadt.de> <20060603161417.GD17910@morpheus> <4481F990.6020707@gkec.informatik.tu-darmstadt.de> <44855EA0.5020802@gkec.informatik.tu-darmstadt.de> Message-ID: <44856A74.6020105@infrae.com> Stefan Behnel wrote: > Hi Fredrik, > > Fredrik Lundh wrote: >> Stefan Behnel wrote: >>> And after the tutorial, you can read (and bookmark) the reference page of ET >>> >>> http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm >>> >>> or skip to lxml/doc/api.txt right away to see what other great things you can >>> do with lxml. >> one approach would be to generate a composite reference page, using the >> original ET pythondoc infoset for the core API, and a "dummy module" for >> the lxml additions (this can be a text file with pythondoc syntax), and >> merge them together: >> >> http://online.effbot.org/2003_11_01_archive.htm#pythondoc-merge > > I never used PythonDoc, but that sounds like a good idea. Sounds good. Don't have experience with PythonDoc either but I'll try to help out. It would be nice if we could make it look like the Python standard documentation, with, as Fredrik suggests, clearly marked extensions added. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Tue Jun 6 14:24:13 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Tue, 06 Jun 2006 14:24:13 +0200 Subject: [lxml-dev] documentation improvement In-Reply-To: References: <20060603105158.GC10248@morpheus> <44816EE4.1090504@gkec.informatik.tu-darmstadt.de> <20060603161417.GD17910@morpheus> <4481F990.6020707@gkec.informatik.tu-darmstadt.de> <44855EA0.5020802@gkec.informatik.tu-darmstadt.de> Message-ID: <448573ED.8050009@gkec.informatik.tu-darmstadt.de> Hi Fredrik, Fredrik Lundh wrote: > Stefan Behnel wrote: > >> You're always invited to comment on proposed changes when we discuss them on >> the list. Except for the tounicode() bit and the xinclude() method, I don't >> think there are many places where lxml adds things that ET could offer. Feel >> free to correct me here. > > oh, I didn't mean lxml.etree API extensions, I meant 3rd party code > written for lxml.etree that could be useful for a larger ET audience. Ah, ok, sure. I actually tend to encourage people to write stuff against the ET API when it's obvious that it will work. It's for their own advantage: everything that can work with stdlib should. On the other hand, many things really are easier to do in lxml than in ET, XPath is only one reason. > the lxml.etree API extensions are very important input for future ET > improvements (including the elusive 1.3 release ;-). Sure, go ahead. If you add them to the "official" API, we can blame /you/ if they are not documented well enough. :) BTW, is there any release schedule for 1.3? Stefan From faassen at infrae.com Wed Jun 7 11:36:49 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed, 07 Jun 2006 11:36:49 +0200 Subject: [lxml-dev] parser bug in lxml 1.0 Message-ID: <44869E31.3070509@infrae.com> Hi there, After a hint from Guido Wesdorp, I tried the following with lxml 1.0: utf.xml: This is some UTF-8 content: ? and this script (tryparse.py): from lxml import etree f = open('utf.xml', 'r') etree.parse(f) f.close() running it gives the following traceback: Traceback (most recent call last): File "tryparse.py", line 4, in ? etree.parse(f) File "etree.pyx", line 1468, in etree.parse File "parser.pxi", line 671, in etree._parseDocument File "parser.pxi", line 697, in etree._parseFilelikeDocument File "parser.pxi", line 622, in etree._parseDocFromFilelike File "parser.pxi", line 379, in etree._BaseParser._parseDocFromFilelike File "parser.pxi", line 418, in etree._handleParseResult File "etree.pyx", line 151, in etree._ExceptionContext._raise_if_stored File "parser.pxi", line 159, in etree.copyToBuffer File "apihelpers.pxi", line 319, in etree._utf8 AssertionError: All strings must be Unicode or ASCII This is of course wrong. lxml should definitely be able to parse UTF-8 encoded XML files. This did work in previous versions of lxml too. It also looks like it is going into an in-memory string parser. I recall in earlier versions of lxml this wasn't necessary - the file object was inspected and the filename was extracted, passing it into libxml2 directly. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Jun 7 12:08:51 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 07 Jun 2006 12:08:51 +0200 Subject: [lxml-dev] parser bug in lxml 1.0 In-Reply-To: <44869E31.3070509@infrae.com> References: <44869E31.3070509@infrae.com> Message-ID: <4486A5B3.2010903@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > After a hint from Guido Wesdorp, I tried the following with lxml 1.0: > > utf.xml: > > > > This is some UTF-8 content: ? > > > and this script (tryparse.py): > > from lxml import etree > > f = open('utf.xml', 'r') > etree.parse(f) > f.close() > > AssertionError: All strings must be Unicode or ASCII > > This is of course wrong. lxml should definitely be able to parse UTF-8 > encoded XML files. By filename, sure. When you pass through Python, however, that's different. > This did work in previous versions of lxml too. It > also looks like it is going into an in-memory string parser. I recall in > earlier versions of lxml this wasn't necessary - the file object was > inspected and the filename was extracted, passing it into libxml2 directly. True. However, what do you do with code like this: f = open('embedded-xml.txt', 'r') f.seek(non_xml_header_length) etree.parse(f) f.close() I think, if you pass a file-like object, then lxml should assume that there must be a reason for you to do that. Otherwise you'd just pass the plain file name in the first place, right? There is an obvious semantic difference between etree.parse("file.xml") and f = open("file.xml", "r") etree.parse(f) f.close() We could use the same trick as for StringIO, ask the file object if it reads from the beginning (f.tell() == 0) and then special case that to read from the file name. But that would destroy the semantics of the second call: "read from the file object I passed". You'd also have hidden tempfile issues and the like. Both calls are not equivalent. I guess we should not encode what comes in from a file-like object, so I think the bug is rather at that place. However, then we'd have no way to deal with file-like objects returning unicode strings - we just don't know what they return before we start reading from them... Maybe it's acceptable to just raise an exception for file-like objects that return unicode strings or read(). Stefan From fredrik at pythonware.com Wed Jun 7 12:33:29 2006 From: fredrik at pythonware.com (Fredrik Lundh) Date: Wed, 07 Jun 2006 12:33:29 +0200 Subject: [lxml-dev] parser bug in lxml 1.0 In-Reply-To: <4486A5B3.2010903@gkec.informatik.tu-darmstadt.de> References: <44869E31.3070509@infrae.com> <4486A5B3.2010903@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel wrote: > There is an obvious semantic difference between > > etree.parse("file.xml") > > and > > f = open("file.xml", "r") > etree.parse(f) > f.close() if there is, that's a bug. parsing from a file-like object for which "read" returns binary data (8-bit strings containing encoded data) should be equivalent to parsing from a file. this is standard Python behaviour. From faassen at infrae.com Wed Jun 7 12:43:23 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed, 07 Jun 2006 12:43:23 +0200 Subject: [lxml-dev] parser bug in lxml 1.0 In-Reply-To: <4486A5B3.2010903@gkec.informatik.tu-darmstadt.de> References: <44869E31.3070509@infrae.com> <4486A5B3.2010903@gkec.informatik.tu-darmstadt.de> Message-ID: <4486ADCB.2010804@infrae.com> Stefan Behnel wrote: >> This is of course wrong. lxml should definitely be able to parse UTF-8 >> encoded XML files. > > By filename, sure. When you pass through Python, however, that's different. lxml supported the reading of UTF-8 encoded content when you pass in a file object. I'm pretty sure ElementTree does. So, this is a bug. >> This did work in previous versions of lxml too. It >> also looks like it is going into an in-memory string parser. I recall in >> earlier versions of lxml this wasn't necessary - the file object was >> inspected and the filename was extracted, passing it into libxml2 directly. > > True. However, what do you do with code like this: > > f = open('embedded-xml.txt', 'r') > f.seek(non_xml_header_length) > etree.parse(f) > f.close() > > I think, if you pass a file-like object, then lxml should assume that there > must be a reason for you to do that. Otherwise you'd just pass the plain file > name in the first place, right? There is an obvious semantic difference between Sure, if there's no way to make this work better, then we should go the string route. But it still shouldn't bail out if you pass in a file that contains UTF-8 data (or *any* encoding as long as there's an encoding declaration). > I guess we should not encode what comes in from a file-like object, so I think > the bug is rather at that place. However, then we'd have no way to deal with > file-like objects returning unicode strings - we just don't know what they > return before we start reading from them... The file-like object (StringIO) that returns unicode case is not so important to me - more important is not to break backwards compatibility, not to break compatibility with ElementTree, and the ability to actually parse XML files that are passed in through a file object. > Maybe it's acceptable to just raise an exception for file-like objects that > return unicode strings or read(). Yes, I agree I'd rather have that be broken than the current situation. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Jun 7 12:48:21 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 07 Jun 2006 12:48:21 +0200 Subject: [lxml-dev] parser bug in lxml 1.0 In-Reply-To: References: <44869E31.3070509@infrae.com> <4486A5B3.2010903@gkec.informatik.tu-darmstadt.de> Message-ID: <4486AEF5.4040307@gkec.informatik.tu-darmstadt.de> Hi Fredrik, Fredrik Lundh wrote: > Stefan Behnel wrote: > > There is an obvious semantic difference between >> etree.parse("file.xml") >> >> and >> >> f = open("file.xml", "r") >> etree.parse(f) >> f.close() > > if there is, that's a bug. Well, there is one, though. The first call to parse() says: "Here is a filename, look it up in the file system and parse the data contained in the file it points to." The second one says: "Here is an object that will give you XML data when you call it's read() method. Do that and parse what it returns." *That* is standard Python behaviour. It would be wrong to change the second into: "Here is an object that will give you XML data when you call it's read() method. Try to figure out which file in the file system it originally came from, then open that file again and read the data you get from it." One of the differences is that the file position will be magically unchanged after reading "from the file object". Another difference is that the file will be read from the beginning as I pointed out. Both are totally unexpected behaviour. I'm not questioning the general bug in lxml. I'm just saying that it's better to read the file-like object as such than to read from the file system where the user wanted us to read from a file(-like) object. Stefan From faassen at infrae.com Wed Jun 7 12:51:39 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed, 07 Jun 2006 12:51:39 +0200 Subject: [lxml-dev] parser bug in lxml 1.0 In-Reply-To: References: <44869E31.3070509@infrae.com> <4486A5B3.2010903@gkec.informatik.tu-darmstadt.de> Message-ID: <4486AFBB.8050709@infrae.com> Fredrik Lundh wrote: > Stefan Behnel wrote: > > > There is an obvious semantic difference between >> etree.parse("file.xml") >> >> and >> >> f = open("file.xml", "r") >> etree.parse(f) >> f.close() > > if there is, that's a bug. parsing from a file-like object for which > "read" returns binary data (8-bit strings containing encoded data) > should be equivalent to parsing from a file. this is standard Python > behaviour. I agree that this should semantically be exactly the same. Stefan goes on to explain that in the case of the file object a seek can happen first, which is not the case for referring a filename. It would be nice if we could support seek(), and of course if you pass in a file object you can mess with the file object in advance, something you cannot do when you pass in a filename. I assume Stefan refers to that. Then again I think that *this* case, where no seek occurs, should behave the same way as the direct opening of file, and hopefully follow the same codepath in lxml, ideally a fast one that doesn't need the reading in as the file as a string. I think Fredrik had a point in previous discussions that involving unicode in both the parsing and serialization side of XML makes things less clear. We should be very careful to make sure lxml works correctly for both parsing and serialization of encoded text (in strings or files, referenced through fileobject or filename). That's the *main* usecase and this usecase should ideally be reflected in the source code. Anything to do with dealing with Python unicode strings at that level is a bonus. Regards, Martijn From faassen at infrae.com Wed Jun 7 12:55:20 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed, 07 Jun 2006 12:55:20 +0200 Subject: [lxml-dev] parser bug in lxml 1.0 In-Reply-To: <4486AEF5.4040307@gkec.informatik.tu-darmstadt.de> References: <44869E31.3070509@infrae.com> <4486A5B3.2010903@gkec.informatik.tu-darmstadt.de> <4486AEF5.4040307@gkec.informatik.tu-darmstadt.de> Message-ID: <4486B098.9030408@infrae.com> Stefan Behnel wrote: [snip] > I'm not questioning the general bug in lxml. I'm just saying that it's better > to read the file-like object as such than to read from the file system where > the user wanted us to read from a file(-like) object. Well, figuring out the filename is necessary anyway if we want to make things like, say, XSLT includes, work correctly right? That said of course lxml should behave correctly for Python file objects, also in the case of seek(). But, the current behavior attempting at correctness breaks far more than the hack I coded in reading the filename directly - so you could call my old approach "worse is better". :) Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Jun 7 13:50:32 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 07 Jun 2006 13:50:32 +0200 Subject: [lxml-dev] parser bug in lxml 1.0 In-Reply-To: <44869E31.3070509@infrae.com> References: <44869E31.3070509@infrae.com> Message-ID: <4486BD88.4020907@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > After a hint from Guido Wesdorp, I tried the following with lxml 1.0: > > utf.xml: > > > > This is some UTF-8 content: ? > > > and this script (tryparse.py): > > from lxml import etree > > f = open('utf.xml', 'r') > etree.parse(f) > f.close() > > running it gives the following traceback: > AssertionError: All strings must be Unicode or ASCII This is fixed. The new behaviour is: Parsing a file object or file-like object reads the data in chunks from the object and checks each chunk if it's a plain string. If not, it raises a TypeError. Otherwise, it passes the bytes directly into libxml2. This means that parsing unicode strings from file-like objects is no longer supported, mainly due to the encoding sensing bug in libxml2. Parsing file-objects from the file system rather than the Python object could be made to work as in previous versions of lxml, but I really don't see why. If you want plain speed, pass a file name. If you want control over the data and the way it is read, pass a file(-like) object. Stefan From faassen at infrae.com Wed Jun 7 14:04:29 2006 From: faassen at infrae.com (Martijn Faassen) Date: Wed, 07 Jun 2006 14:04:29 +0200 Subject: [lxml-dev] parser bug in lxml 1.0 In-Reply-To: <4486BD88.4020907@gkec.informatik.tu-darmstadt.de> References: <44869E31.3070509@infrae.com> <4486BD88.4020907@gkec.informatik.tu-darmstadt.de> Message-ID: <4486C0CD.2020403@infrae.com> Stefan Behnel wrote: [snip bug] > This is fixed. The new behaviour is: Great! We need to do a 1.0.1 release soon then. :) > Parsing a file object or file-like object reads the data in chunks from the > object and checks each chunk if it's a plain string. If not, it raises a > TypeError. Otherwise, it passes the bytes directly into libxml2. > > This means that parsing unicode strings from file-like objects is no longer > supported, mainly due to the encoding sensing bug in libxml2. Okay, too bad but not a big loss. > Parsing file-objects from the file system rather than the Python object could > be made to work as in previous versions of lxml, but I really don't see why. > If you want plain speed, pass a file name. If you want control over the data > and the way it is read, pass a file(-like) object. Okay, understood. How does this deal with XSLT parsing where includes need to be resolved? Do you somehow pass the filepath along? Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Wed Jun 7 14:30:32 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Wed, 07 Jun 2006 14:30:32 +0200 Subject: [lxml-dev] parser bug in lxml 1.0 In-Reply-To: <4486C0CD.2020403@infrae.com> References: <44869E31.3070509@infrae.com> <4486BD88.4020907@gkec.informatik.tu-darmstadt.de> <4486C0CD.2020403@infrae.com> Message-ID: <4486C6E8.3050701@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Stefan Behnel wrote: > [snip bug] > >> This is fixed. The new behaviour is: > > Great! We need to do a 1.0.1 release soon then. :) I had planned that anyway. But there is still the mysterious crash that Petr encountered, and I don't want to release 1.0.1 until we made sure there won't necessarily be a 1.0.2 a week later. >> Parsing a file object or file-like object reads the data in chunks >> from the >> object and checks each chunk if it's a plain string. If not, it raises a >> TypeError. Otherwise, it passes the bytes directly into libxml2. >> >> This means that parsing unicode strings from file-like objects is no >> longer >> supported, mainly due to the encoding sensing bug in libxml2. > > Okay, too bad but not a big loss. I think so, too. >> Parsing file-objects from the file system rather than the Python >> object could >> be made to work as in previous versions of lxml, but I really don't >> see why. >> If you want plain speed, pass a file name. If you want control over >> the data >> and the way it is read, pass a file(-like) object. > > How does this deal with XSLT parsing where includes need to be resolved? > Do you somehow pass the filepath along? Yes. It uses mainly the same file-name-figuring-out code that you wrote. However, it's good you asked. You just made me aware of a second bug: loading external entities during parsing. This should also work *now*. It never worked for things like GZipFiles etc. but it stopped working in 1.0 for plain file objects. Oh, well... Stefan From johnny at johnnydebris.net Wed Jun 7 22:51:08 2006 From: johnny at johnnydebris.net (Johnny deBris) Date: Wed, 07 Jun 2006 22:51:08 +0200 Subject: [lxml-dev] Losing namespace information after deepcopy() Message-ID: <44873C3C.4030106@johnnydebris.net> Hello! This is new since 1.0: after deepcopying certain structures, namespace information gets lost. Attached is a small test case, if you run it you'll see that the 't' namespace is not displayed in a string representation of a copy of the node with the t:foo attribute. I'm really clueless what could cause this... Hope the snippet helps. Cheers, Guido P.S. Thanks for the file issue fixes, works like a charm now. -------------- next part -------------- A non-text attachment was scrubbed... Name: lxml_namespace_issue.py Type: text/x-python Size: 255 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20060607/00d8009e/attachment.py From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Jun 8 00:56:16 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 08 Jun 2006 00:56:16 +0200 Subject: [lxml-dev] Losing namespace information after deepcopy() In-Reply-To: <44873C3C.4030106@johnnydebris.net> References: <44873C3C.4030106@johnnydebris.net> Message-ID: <44875990.1070003@gkec.informatik.tu-darmstadt.de> Hi Johnny, Johnny deBris wrote: > This is new since 1.0: after deepcopying certain structures, namespace > information gets lost. Attached is a small test case, if you run it > you'll see that the 't' namespace is not displayed in a string > representation of a copy of the node with the t:foo attribute. I'm > really clueless what could cause this... Hope the snippet helps. Thanks for the report and the test case, I can reproduce this. It looks like copying fake root documents lets us loose namespace information. I had changed the deepcopy method in 1.0 to use fake root documents to avoid creating new documents and get a 'more complete' copy of the original. Well, more data on one side, loss of information on the other. I also found that deepcopy did not previously copy the tail of an element. This is now ET compatible. Both bugs are fixed in the current trunk and the 1.0 branch. Deep copying now uses this function throughout lxml: ------------------------------- cdef xmlDoc* _copyDocRoot(xmlDoc* c_doc, xmlNode* c_new_root): "Recursively copy the document and make c_new_root the new root node." cdef xmlDoc* result cdef xmlNode* c_node result = tree.xmlCopyDoc(c_doc, 2) # non recursive, but with ns __GLOBAL_PARSER_CONTEXT._initDocDict(result) c_node = tree.xmlDocCopyNode(c_new_root, result, 1) # recursive tree.xmlDocSetRootElement(result, c_node) if c_new_root.parent is not c_doc: # do not copy the tail for the root node - done automatically _copyTail(c_new_root.next, c_node) if c_doc.URL is not NULL: # handle a bug in older libxml2 versions if result.URL is not NULL: tree.xmlFree(result.URL) result.URL = tree.xmlStrdup(c_doc.URL) return result ------------------------------- Regards, Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Jun 8 10:00:39 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 08 Jun 2006 10:00:39 +0200 Subject: [lxml-dev] Losing namespace information after deepcopy() In-Reply-To: <44875990.1070003@gkec.informatik.tu-darmstadt.de> References: <44873C3C.4030106@johnnydebris.net> <44875990.1070003@gkec.informatik.tu-darmstadt.de> Message-ID: <4487D927.5040401@gkec.informatik.tu-darmstadt.de> Hi again, Stefan Behnel wrote: > Both bugs are fixed in the current trunk and the 1.0 branch. Deep copying now > uses this function throughout lxml: [snip] Another simplification, another stupid bug fix for the last implementation: ------------------------------- cdef xmlDoc* _copyDocRoot(xmlDoc* c_doc, xmlNode* c_new_root): "Recursively copy the document and make c_new_root the new root node." cdef xmlDoc* result cdef xmlNode* c_node result = tree.xmlCopyDoc(c_doc, 0) # non recursive __GLOBAL_PARSER_CONTEXT._initDocDict(result) c_node = tree.xmlDocCopyNode(c_new_root, result, 1) # recursive tree.xmlDocSetRootElement(result, c_node) _copyTail(c_new_root.next, c_node) if c_doc.URL is not NULL: # handle a bug in older libxml2 versions if result.URL is not NULL: tree.xmlFree(result.URL) result.URL = tree.xmlStrdup(c_doc.URL) return result ------------------------------- Late at night is not the best time to get bugs fixed ... Anyway, this works now as expected. Stefan From johnny at johnnydebris.net Thu Jun 8 10:06:40 2006 From: johnny at johnnydebris.net (Johnny deBris) Date: Thu, 08 Jun 2006 10:06:40 +0200 Subject: [lxml-dev] Losing namespace information after deepcopy() In-Reply-To: <4487D927.5040401@gkec.informatik.tu-darmstadt.de> References: <44873C3C.4030106@johnnydebris.net> <44875990.1070003@gkec.informatik.tu-darmstadt.de> <4487D927.5040401@gkec.informatik.tu-darmstadt.de> Message-ID: <4487DA90.80200@johnnydebris.net> Stefan Behnel wrote: > > Late at night is not the best time to get bugs fixed ... > No, late at night should be reserved for working out vague ideas that you throw away later... ;) > Anyway, this works now as expected. > My tests are happy again, and so am I! Great! Thanks again! Cheers, Guido From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Jun 8 12:27:47 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 08 Jun 2006 12:27:47 +0200 Subject: [lxml-dev] Bus Error in external XPath function In-Reply-To: <36A849FD-E071-40DA-88C2-49D03C99BEFC@petr.com> References: <44851A87.5020708@gkec.informatik.tu-darmstadt.de> <3C5E2AB8-66E7-474D-899F-7FC25974DE07@petr.com> <4485591C.9010804@gkec.informatik.tu-darmstadt.de> <44855C02.4040201@gkec.informatik.tu-darmstadt.de> <53884052-2D5A-42B1-9B34-ECA4F5A3DC5B@petr.com> <44867305.4090703@gkec.informatik.tu-darmstadt.de> <2F4BC6DF-609B-45D6-AE89-D524BA2250DC@petr.com> <4486DB49.9020002@gkec.informatik.tu-darmstadt.de> <4486E5B8.8080705@gkec.informatik.tu-darmstadt.de> <017CF475-E3A0-4718-BC3E-4E81BA719A3C@petr.com> <4486F899.5060400@gkec.informatik.tu-darmstadt.de> <36A849FD-E071-40DA-88C2-49D03C99BEFC@petr.com> Message-ID: <4487FBA3.3070401@gkec.informatik.tu-darmstadt.de> Hi Petr, Petr van Blokland wrote: > > On Jun 7, 2006, at 6:02 PM, Stefan Behnel wrote: > >> Hi Petr, >> >> Petr van Blokland wrote: >>> The error remains, different for both versions though. >>> De different behaviour is reproducable. >> >> It is no difference what the error messages end with as long as they >> state the >> "unregistering unknown proxy" bit. The rest is simply trying to access >> invalid >> (= already freed) memory regions, which simply crashes in most cases. >> >> Still, thanks for testing. The sooner we have this resolved, the >> sooner I can >> release 1.0.1. >> >> Stefan >> > > If you change the > > in > > > in the template.xsl, then the template works alright. I committed a (rather large) patch to the trunk, not yet to the branch. It should fix the problem, please test it. I still have to come up with a suitable test case for the test suite. Stefan From mircea at ag-projects.com Thu Jun 8 13:47:24 2006 From: mircea at ag-projects.com (Mircea Amarascu) Date: Thu, 08 Jun 2006 14:47:24 +0300 Subject: [lxml-dev] XPath default namespace Message-ID: <44880E4C.3020007@ag-projects.com> Hello, I'm having this XML fragment: Text If I aplly the following XPath expression on the parsed tree: tree.xpath('/a/b[@id="first"]') this will not return anything. Of course I should give a second argument, a dictionary that maps the namespace prefix to the namespace URI: tree.xpath('/a/b[@id="first"]', {'?': 'ns1'}) but I don't know what prefix to use for the default namespace (None doesn't work, nor the empty string). What are the best sources for documentation for lxml's XPath and XMLSchema features? The answer to the XPath question is probably common knowledge for the rest of you, however I've posted the question here as I couldn't find detailed info regarding this topics. Thank you for your time. From elephantum at cyberzoo.ru Thu Jun 8 13:51:21 2006 From: elephantum at cyberzoo.ru (Andrey Tatarinov) Date: Thu, 08 Jun 2006 15:51:21 +0400 Subject: [lxml-dev] XPath default namespace In-Reply-To: <44880E4C.3020007@ag-projects.com> References: <44880E4C.3020007@ag-projects.com> Message-ID: <1149767481.6676.40.camel@zoo.yandex.ru> On Thu, 2006-06-08 at 14:47 +0300, Mircea Amarascu wrote: > Hello, > > I'm having this XML fragment: > > > Text > > > If I aplly the following XPath expression on the parsed tree: > > tree.xpath('/a/b[@id="first"]') > > this will not return anything. > > Of course I should give a second argument, a dictionary that maps the > namespace prefix to the namespace URI: > tree.xpath('/a/b[@id="first"]', {'?': 'ns1'}) It has nothing to do with lxml, just with the XPath evaluation. use tree.xpath('/x:a/b[@id="first"]', {'x': 'ns1'}) and you will get what you want. From elephantum at cyberzoo.ru Thu Jun 8 13:54:24 2006 From: elephantum at cyberzoo.ru (Andrey Tatarinov) Date: Thu, 08 Jun 2006 15:54:24 +0400 Subject: [lxml-dev] XPath default namespace In-Reply-To: <1149767481.6676.40.camel@zoo.yandex.ru> References: <44880E4C.3020007@ag-projects.com> <1149767481.6676.40.camel@zoo.yandex.ru> Message-ID: <1149767664.6676.42.camel@zoo.yandex.ru> On Thu, 2006-06-08 at 15:51 +0400, Andrey Tatarinov wrote: > On Thu, 2006-06-08 at 14:47 +0300, Mircea Amarascu wrote: > > Hello, > > > > I'm having this XML fragment: > > > > > > Text > > > > > > If I aplly the following XPath expression on the parsed tree: > > > > tree.xpath('/a/b[@id="first"]') > > > > this will not return anything. > > > > Of course I should give a second argument, a dictionary that maps the > > namespace prefix to the namespace URI: > > tree.xpath('/a/b[@id="first"]', {'?': 'ns1'}) > > It has nothing to do with lxml, just with the XPath evaluation. > use > > tree.xpath('/x:a/b[@id="first"]', {'x': 'ns1'}) oops, of course I've meant .../x:a/x:b[... > and you will get what you want. > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev From mircea at ag-projects.com Thu Jun 8 14:04:00 2006 From: mircea at ag-projects.com (Mircea Amarascu) Date: Thu, 08 Jun 2006 15:04:00 +0300 Subject: [lxml-dev] XPath default namespace In-Reply-To: <1149767664.6676.42.camel@zoo.yandex.ru> References: <44880E4C.3020007@ag-projects.com> <1149767481.6676.40.camel@zoo.yandex.ru> <1149767664.6676.42.camel@zoo.yandex.ru> Message-ID: <44881230.6070606@ag-projects.com> Andrey Tatarinov wrote: > On Thu, 2006-06-08 at 15:51 +0400, Andrey Tatarinov wrote: > >> On Thu, 2006-06-08 at 14:47 +0300, Mircea Amarascu wrote: >> >>> Hello, >>> >>> I'm having this XML fragment: >>> >>> >>> Text >>> >>> >>> If I aplly the following XPath expression on the parsed tree: >>> >>> tree.xpath('/a/b[@id="first"]') >>> >>> this will not return anything. >>> >>> Of course I should give a second argument, a dictionary that maps the >>> namespace prefix to the namespace URI: >>> tree.xpath('/a/b[@id="first"]', {'?': 'ns1'}) >>> >> It has nothing to do with lxml, just with the XPath evaluation. >> use >> >> tree.xpath('/x:a/b[@id="first"]', {'x': 'ns1'}) >> > > oops, of course I've meant > > .../x:a/x:b[.. Yes, I know the tree.xpath('/x:a/x:b[@id="first"]', {'x': 'ns1'}) approach works, however I was curious if there's a way not to alter the Xpath expression (the 'x' prefix is not specificly defined in the XML document, so this looked to me like a workaround). Thanks a lot! :) From elephantum at cyberzoo.ru Thu Jun 8 14:08:13 2006 From: elephantum at cyberzoo.ru (Andrey Tatarinov) Date: Thu, 08 Jun 2006 16:08:13 +0400 Subject: [lxml-dev] XPath default namespace In-Reply-To: <44881230.6070606@ag-projects.com> References: <44880E4C.3020007@ag-projects.com> <1149767481.6676.40.camel@zoo.yandex.ru> <1149767664.6676.42.camel@zoo.yandex.ru> <44881230.6070606@ag-projects.com> Message-ID: <1149768493.6676.46.camel@zoo.yandex.ru> On Thu, 2006-06-08 at 15:04 +0300, Mircea Amarascu wrote: > Andrey Tatarinov wrote: > > On Thu, 2006-06-08 at 15:51 +0400, Andrey Tatarinov wrote: > > > >> On Thu, 2006-06-08 at 14:47 +0300, Mircea Amarascu wrote: > >> > >>> Hello, > >>> > >>> I'm having this XML fragment: > >>> > >>> > >>> Text > >>> > >>> > >>> If I aplly the following XPath expression on the parsed tree: > >>> > >>> tree.xpath('/a/b[@id="first"]') > >>> > >>> this will not return anything. > >>> > >>> Of course I should give a second argument, a dictionary that maps the > >>> namespace prefix to the namespace URI: > >>> tree.xpath('/a/b[@id="first"]', {'?': 'ns1'}) > >>> > >> It has nothing to do with lxml, just with the XPath evaluation. > >> use > >> > >> tree.xpath('/x:a/b[@id="first"]', {'x': 'ns1'}) > >> > > > > oops, of course I've meant > > > > .../x:a/x:b[.. > Yes, I know the > > tree.xpath('/x:a/x:b[@id="first"]', {'x': 'ns1'}) > > approach works, however I was curious if there's a way not to alter the > Xpath expression (the 'x' prefix is not specificly defined in the XML > document, so this looked to me like a workaround). Thanks a lot! :) namespace declarations in XPath has nothing to do with namespace declarations in XML document. in generic case you will not have information which namespaces are declared in document and what prefixes are used. From faassen at infrae.com Thu Jun 8 14:23:12 2006 From: faassen at infrae.com (Martijn Faassen) Date: Thu, 08 Jun 2006 14:23:12 +0200 Subject: [lxml-dev] XPath default namespace In-Reply-To: <44881230.6070606@ag-projects.com> References: <44880E4C.3020007@ag-projects.com> <1149767481.6676.40.camel@zoo.yandex.ru> <1149767664.6676.42.camel@zoo.yandex.ru> <44881230.6070606@ag-projects.com> Message-ID: <448816B0.1090101@infrae.com> Mircea Amarascu wrote: [snip] > Yes, I know the > > tree.xpath('/x:a/x:b[@id="first"]', {'x': 'ns1'}) > > approach works, however I was curious if there's a way not to alter the > Xpath expression (the 'x' prefix is not specificly defined in the XML > document, so this looked to me like a workaround). Thanks a lot! :) It's not a workaround but the right way to do it, though admittedly this is something that trips up many users. The XPath recommendation specifies that you only find elements (and attributes) without namespace if you use an unprefixed name in XPath. Since you cannot use namespace URIs directly in XPath expressions (Clarke notation is not supported), in order to find elements defined in a namespace, you'll have to map a prefix to a namespace URI first. It *might* be possible to create a convenience API in lxml that digs up prefix -> namespace URI mappings from the document so you can create this mapping automatically. How this API would work is trickier than it seems at first, though, as the same namespace prefix could mean entirely different namespaces in an XML document. It might also not be very efficient, as namespace definitions could exist on any element. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Jun 8 14:43:30 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 08 Jun 2006 14:43:30 +0200 Subject: [lxml-dev] XPath default namespace In-Reply-To: <448816B0.1090101@infrae.com> References: <44880E4C.3020007@ag-projects.com> <1149767481.6676.40.camel@zoo.yandex.ru> <1149767664.6676.42.camel@zoo.yandex.ru> <44881230.6070606@ag-projects.com> <448816B0.1090101@infrae.com> Message-ID: <44881B72.8090706@gkec.informatik.tu-darmstadt.de> Hi Martijn, Martijn Faassen wrote: > Mircea Amarascu wrote: > [snip] >> Yes, I know the >> >> tree.xpath('/x:a/x:b[@id="first"]', {'x': 'ns1'}) >> >> approach works, however I was curious if there's a way not to alter the >> Xpath expression (the 'x' prefix is not specificly defined in the XML >> document, so this looked to me like a workaround). Thanks a lot! :) > > It's not a workaround but the right way to do it, though admittedly this > is something that trips up many users. I've added a FAQ entry on this in branch and trunk. You can update the FAQ from the branch. Stefan From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Jun 8 15:14:55 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 08 Jun 2006 15:14:55 +0200 Subject: [lxml-dev] Bus Error in external XPath function In-Reply-To: <680C2EE7-0A1F-4643-AFEE-7C31907715E6@petr.com> References: <44851A87.5020708@gkec.informatik.tu-darmstadt.de> <3C5E2AB8-66E7-474D-899F-7FC25974DE07@petr.com> <4485591C.9010804@gkec.informatik.tu-darmstadt.de> <44855C02.4040201@gkec.informatik.tu-darmstadt.de> <53884052-2D5A-42B1-9B34-ECA4F5A3DC5B@petr.com> <44867305.4090703@gkec.informatik.tu-darmstadt.de> <2F4BC6DF-609B-45D6-AE89-D524BA2250DC@petr.com> <4486DB49.9020002@gkec.informatik.tu-darmstadt.de> <4486E5B8.8080705@gkec.informatik.tu-darmstadt.de> <017CF475-E3A0-4718-BC3E-4E81BA719A3C@petr.com> <4486F899.5060400@gkec.informatik.tu-darmstadt.de> <36A849FD-E071-40DA-88C2-49D03C99BEFC@petr.com> <4487FBA3.3070401@gkec.informatik.tu-darmstadt.de> <680C2EE7-0A1F-4643-AFEE-7C31907715E6@petr.com> Message-ID: <448822CF.404@gkec.informatik.tu-darmstadt.de> Hi Petr, Petr van Blokland wrote: > On Jun 8, 2006, at 12:27 PM, Stefan Behnel wrote: >> I committed a (rather large) patch to the trunk, not yet to the >> branch. It >> should fix the problem, please test it. I still have to come up with a >> suitable test case for the test suite. > > That seems to work much better. Thanks. I am going to test is further > in the real server. But at first glance it works. Great, finally. > Just interested: what kind of problem was it? Garbage collection? Sorry, I normally write a bit more about these things, but I was just in a hurry and wanted to get a mail out to have you test it. So here's a longer explanation. It was quite a bit of work to get that fixed. I had replaced the extension function call registry stuff with a generic run-time lookup function in 1.0. However, libxslt seems to have problems with that when the call is coming from an included stylesheet. So the current test case I have is: * a stylesheet that builds a variable value from an extension function call * a second stylesheet that includes and uses the first one * the lxml code that parses the second stylesheet and applies it to a tree The result is that the stylesheet is broken after the first run and subsequent runs no longer work as expected. I have no idea why that happens, no idea at all. I assume that it is a bug in libxslt, but I can't tell for sure. All I know is that it does *not* break when the extension function is registered with libxslt when it is called. So, basically what I did is, I rewrote the code to use pre-registered extension functions as before. So now, there still is a bit of cleanup to be done, also in the XPath code. But the bug should be fixed. Stefan From ogrisel at nuxeo.com Thu Jun 8 15:37:56 2006 From: ogrisel at nuxeo.com (Olivier Grisel) Date: Thu, 08 Jun 2006 15:37:56 +0200 Subject: [lxml-dev] using pyflakes Message-ID: Hi Stefan and others, I have just checked in a Makefile target to run pyflakes on the lxml source tree: $ sudo easy_install pyflakes $ make lint pyflakes . ./selftest2.py:7: 'sys' imported but unused ./selftest2.py:7: 'StringIO' imported but unused ./setup.py:1: redefinition of unused 'os' from line 1 ./setup.py:10: redefinition of unused 'setup' from line 5 ./setup.py:11: redefinition of unused 'Extension' from line 6 ./bench.py:2: 'from itertools import *' used; unable to detect undefined names ./bench.py:632: redefinition of unused 'time' from line 1 ./selftest.py:12: 'sys' imported but unused ./selftest.py:12: 'StringIO' imported but unused ./update-error-constants.py:3: redefinition of unused 'os' from line 3 ./src/doctest.py:98: 'types' imported but unused ./src/lxml/tests/common_imports.py:11: redefinition of unused 'ElementTree' from line 9 ./src/lxml/tests/test_htmlparser.py:10: 'fileInTestDir' imported but unused ./src/lxml/tests/test_etree.py:14: 'SillyFileLike' imported but unused ./src/lxml/tests/test_errors.py:2: 'doctest' imported but unused ./src/lxml/tests/test_unicode.py:2: 'doctest' imported but unused ./src/lxml/tests/test_unicode.py:4: 'SillyFileLike' imported but unused ./src/lxml/tests/test_io.py:10: 'fileInTestDir' imported but unused ./src/lxml/tests/test_elementtree.py:11: 'doctest' imported but unused ./src/lxml/tests/test_elementtree.py:14: 'HelperTestCase' imported but unused pylint and pychecker might be useful too but pyflakes is simple and very fast. pylint is much more verbose and pychecker requires to import the modules to do its work thus needs inplace compilation first. -- Olivier From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Jun 8 15:49:40 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 08 Jun 2006 15:49:40 +0200 Subject: [lxml-dev] using pyflakes In-Reply-To: References: Message-ID: <44882AF4.70209@gkec.informatik.tu-darmstadt.de> Hi Olivier, Olivier Grisel wrote: > I have just checked in a Makefile target to run pyflakes on the lxml source tree: Hmm, may I ask, why? There isn't much code to test here. Mostly everything that's interesting is in Pyrex code, which is not checkable by any Python *lint* I heard of so far. > ./selftest2.py:7: 'sys' imported but unused > ./selftest2.py:7: 'StringIO' imported but unused > ./selftest.py:12: 'sys' imported but unused > ./selftest.py:12: 'StringIO' imported but unused We didn't write those and most likely the reason for the two is that some warnings was commented out in them. > ./setup.py:1: redefinition of unused 'os' from line 1 > ./setup.py:10: redefinition of unused 'setup' from line 5 > ./setup.py:11: redefinition of unused 'Extension' from line 6 Those warnings are misleading. They are double imported depending on ImportError. > ./bench.py:2: 'from itertools import *' used; unable to detect undefined names > ./bench.py:632: redefinition of unused 'time' from line 1 > ./update-error-constants.py:3: redefinition of unused 'os' from line 3 > ./src/doctest.py:98: 'types' imported but unused > ./src/lxml/tests/common_imports.py:11: redefinition of unused 'ElementTree' from > line 9 > ./src/lxml/tests/test_htmlparser.py:10: 'fileInTestDir' imported but unused > ./src/lxml/tests/test_etree.py:14: 'SillyFileLike' imported but unused > ./src/lxml/tests/test_errors.py:2: 'doctest' imported but unused > ./src/lxml/tests/test_unicode.py:2: 'doctest' imported but unused > ./src/lxml/tests/test_unicode.py:4: 'SillyFileLike' imported but unused > ./src/lxml/tests/test_io.py:10: 'fileInTestDir' imported but unused > ./src/lxml/tests/test_elementtree.py:11: 'doctest' imported but unused > ./src/lxml/tests/test_elementtree.py:14: 'HelperTestCase' imported but unused I don't think any of these are relevant. And they don't give me the impression that pyflakes is a reasonably sophisticated tool either. I'll revert the change. Stefan From ogrisel at nuxeo.com Thu Jun 8 15:57:14 2006 From: ogrisel at nuxeo.com (Olivier Grisel) Date: Thu, 08 Jun 2006 15:57:14 +0200 Subject: [lxml-dev] using pyflakes In-Reply-To: <44882AF4.70209@gkec.informatik.tu-darmstadt.de> References: <44882AF4.70209@gkec.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel a ?crit : > Hi Olivier, > > Olivier Grisel wrote: >> I have just checked in a Makefile target to run pyflakes on the lxml source tree: > > Hmm, may I ask, why? There isn't much code to test here. Mostly everything > that's interesting is in Pyrex code, which is not checkable by any Python > *lint* I heard of so far. > [snip] > I don't think any of these are relevant. And they don't give me the impression > that pyflakes is a reasonably sophisticated tool either. > > I'll revert the change. Ok as you wish. I thought about pyflakes when I so your last commit on mkhtml.py with the shutil useless import. I find it quite handy on python projects to run before launching the tests. It's quite fast to run and finds most variable name typos that a usual compiler would find on statically typed languages. But as you said, it does not work on pyrex files. Maybe pychecker is a better choice for lxml. -- Olivier From mircea at ag-projects.com Thu Jun 8 16:08:11 2006 From: mircea at ag-projects.com (Mircea Amarascu) Date: Thu, 08 Jun 2006 17:08:11 +0300 Subject: [lxml-dev] XPath default namespace In-Reply-To: <448816B0.1090101@infrae.com> References: <44880E4C.3020007@ag-projects.com> <1149767481.6676.40.camel@zoo.yandex.ru> <1149767664.6676.42.camel@zoo.yandex.ru> <44881230.6070606@ag-projects.com> <448816B0.1090101@infrae.com> Message-ID: <44882F4B.8060705@ag-projects.com> Martijn Faassen wrote: > Mircea Amarascu wrote: > [snip] >> Yes, I know the >> >> tree.xpath('/x:a/x:b[@id="first"]', {'x': 'ns1'}) >> >> approach works, however I was curious if there's a way not to alter the >> Xpath expression (the 'x' prefix is not specificly defined in the XML >> document, so this looked to me like a workaround). Thanks a lot! :) > > It's not a workaround but the right way to do it, though admittedly > this is something that trips up many users. > I see that the behaviour of XPath 1.0 is that if a QName does not have a prefix, the namespace in that axis is null (http://www.w3.org/TR/xpath#node-tests). This seems to have changed in XPath 2.0 to the default element namespace in the expression context (http://www.w3.org/TR/xpath20/#node-tests). So my understanding is that for the following XML: the expression "/root/bar" would select the first bar element in XPath 2.0, and no element in XPath 1.0 And since lxml (libxml2) implements XPath 1.0, things work as expected. From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Jun 8 16:10:50 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 08 Jun 2006 16:10:50 +0200 Subject: [lxml-dev] using pyflakes In-Reply-To: References: <44882AF4.70209@gkec.informatik.tu-darmstadt.de> Message-ID: <44882FEA.8010600@gkec.informatik.tu-darmstadt.de> Hi Olivier, Olivier Grisel wrote: > I thought about pyflakes when I so your last commit on mkhtml.py > with the shutil useless import. Sure, that became superfluous then. Not too much of a worry, though. That file is rarely run by normal users. > I find it quite handy on python projects to run > before launching the tests. It's quite fast to run and finds most variable name > typos that a usual compiler would find on statically typed languages. But as you > said, it does not work on pyrex files. Maybe pychecker is a better choice for lxml. I never heard anything about pychecker being able to parse Pyrex code either. The problem is, when you add these things as Makefile targets, people will think we actually use them and start telling us about 'bugs' that these tools appear to have found. Stefan From apaku at gmx.de Thu Jun 8 16:47:18 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Thu, 8 Jun 2006 16:47:18 +0200 Subject: [lxml-dev] Improved docstrings for etree.pyx Message-ID: <20060608144718.GA17544@morpheus.apaku.dnsalias.org> Hi, attached is a patch that adds docstrings to those methods of Element and Elementtree that didn't have any. I also changed some docstrings to use """ instead of just " to make it consistent with the rest. Most of the strings are "copied" from the official element tree api. I did not include any reference to the parameters, because help() doesn't show the parameteres here anyway. The diff is against revision 28530 of trunk. Andreas -- Afternoon very favorable for romance. Try a single person for a change. -------------- next part -------------- diff -x .svn -ur lxml.org/src/lxml/etree.pyx lxml/src/lxml/etree.pyx --- lxml.org/src/lxml/etree.pyx 2006-06-08 16:41:16.000000000 +0200 +++ lxml/src/lxml/etree.pyx 2006-06-08 16:41:47.000000000 +0200 @@ -382,6 +382,8 @@ return self._context_node def getroot(self): + """Gets the root element for this tree. + """ return self._context_node property docinfo: @@ -417,6 +419,8 @@ c_write_declaration, bool(pretty_print)) def getpath(self, _NodeBase element not None): + """Returns a structural, absolute XPath expression to find that element. + """ cdef _Document doc cdef xmlDoc* c_doc cdef char* c_path @@ -433,12 +437,17 @@ return path def getiterator(self, tag=None): + """Creates an iterator for the root element. The iterator loops over all elements + in this tree, in document order. + """ root = self.getroot() if root is None: return () return root.getiterator(tag) def find(self, path): + """Finds the first toplevel element with given tag. Same as getroot().find(path). + """ self._assertHasRoot() root = self.getroot() if path[:1] == "/": @@ -446,6 +455,8 @@ return root.find(path) def findtext(self, path, default=None): + """Finds the element text for the first toplevel element with given tag. Same as getroot().findtext(path) + """ self._assertHasRoot() root = self.getroot() if path[:1] == "/": @@ -453,6 +464,8 @@ return root.findtext(path, default) def findall(self, path): + """Finds all toplevel elements with the given tag. Same as getroot().findall(path). + """ self._assertHasRoot() root = self.getroot() if path[:1] == "/": @@ -574,6 +587,8 @@ # MANIPULATORS def __setitem__(self, Py_ssize_t index, _NodeBase element not None): + """Replaces the given subelement. + """ cdef xmlNode* c_node cdef xmlNode* c_next c_node = _findChild(self._c_node, index) @@ -587,6 +602,8 @@ attemptDeallocation(c_node) def __delitem__(self, Py_ssize_t index): + """Deletes the given subelement. + """ cdef xmlNode* c_node c_node = _findChild(self._c_node, index) if c_node is NULL: @@ -595,11 +612,16 @@ _removeNode(c_node) def __delslice__(self, Py_ssize_t start, Py_ssize_t stop): + """Deletes a number of subelements. + """ cdef xmlNode* c_node c_node = _findChild(self._c_node, start) _deleteSlice(c_node, start, stop) def __setslice__(self, Py_ssize_t start, Py_ssize_t stop, value): + """Replaces a number of subelements with elements + from a sequence. + """ cdef xmlNode* c_node cdef xmlNode* c_next cdef _Element mynode @@ -641,9 +663,14 @@ return new_doc.getroot() def set(self, key, value): + """Sets an element attribute. + """ _setAttributeValue(self, key, value) def append(self, _Element element not None): + """ + Adds a subelement to the end of this element. + """ cdef xmlNode* c_next cdef xmlNode* c_node c_node = element._c_node @@ -659,6 +686,10 @@ moveNodeToDocument(element, self._doc) def clear(self): + """Resets an element. This function removes all subelements, + clears all attributes and sets the text and tail + attributes to None. + """ cdef xmlAttr* c_attr cdef xmlAttr* c_attr_next cdef xmlNode* c_node @@ -684,6 +715,8 @@ c_node = c_node_next def insert(self, index, _Element element not None): + """Inserts a subelement at the given position in this element + """ cdef xmlNode* c_node cdef xmlNode* c_next c_node = _findChild(self._c_node, index) @@ -696,6 +729,10 @@ moveNodeToDocument(element, self._doc) def remove(self, _Element element not None): + """Removes a matching subelement. Unlike the find methods, this + method compares elements based on identity, not on tag value + or contents. + """ cdef xmlNode* c_node c_node = element._c_node if c_node.parent is not self._c_node: @@ -705,6 +742,8 @@ # PROPERTIES property tag: + """Element tag + """ def __get__(self): if self._tag is not None: return self._tag @@ -722,6 +761,8 @@ # not in ElementTree, read-only property prefix: + """Namespace Prefix or None. + """ def __get__(self): if self._c_node.ns is not NULL: if self._c_node.ns.prefix is not NULL: @@ -729,6 +770,9 @@ return None property attrib: + """Element attribute dictionary. Where possible, use + get, set, keys and items to access element attributes. + """ def __get__(self): # do *NOT* keep a reference here to prevent cyclic dependencies # this would free the element in the Cyclic GC, which might let @@ -736,6 +780,9 @@ return _Attrib(self) property text: + """Text before the first subelement. This is either a string or + the value None, if there was no text + """ def __get__(self): return _collectText(self._c_node.children) @@ -756,6 +803,10 @@ c_text_node) property tail: + """Text after this element's end tag, but before the next sibling + element's start tag. This is either a string or the value None, if + there was no text. + """ def __get__(self): return _collectText(self._c_node.next) @@ -775,6 +826,8 @@ return "" % (self.tag, id(self)) def __getitem__(self, Py_ssize_t index): + """Returns the given subelement. + """ cdef xmlNode* c_node c_node = _findChild(self._c_node, index) if c_node is NULL: @@ -782,6 +835,8 @@ return _elementFactory(self._doc, c_node) def __getslice__(self, Py_ssize_t start, Py_ssize_t stop): + """Returns a list containing subelements in the given range. + """ cdef xmlNode* c_node cdef _Document doc cdef Py_ssize_t c @@ -804,6 +859,8 @@ return result def __len__(self): + """Returns the number of subelements. + """ cdef Py_ssize_t c cdef xmlNode* c_node c = 0 @@ -905,16 +962,25 @@ raise ValueError, "list.index(x): x not in list" def get(self, key, default=None): + """Gets an element attribute. + """ return _getAttributeValue(self, key, default) def keys(self): + """Gets a list of attribute names. The names are returned in an arbitrary + order (just like for an ordinary Python dictionary). + """ return self.attrib.keys() def items(self): + """Gets element attributes, as a sequence. The attributes are returned in + an arbitrary order. + """ return self.attrib.items() def getchildren(self): - "Return a list with all children of this element." + """Returns all subelements. The elements are returned in document order. + """ cdef xmlNode* c_node cdef _Document doc cdef int ret @@ -930,7 +996,8 @@ return result def getparent(self): - "Returns the parent of this element or None for the root element" + """Returns the parent of this element or None for the root element. + """ cdef xmlNode* c_node c_node = _parentElement(self._c_node) if c_node is NULL: @@ -939,7 +1006,8 @@ return _elementFactory(self._doc, c_node) def getnext(self): - "Returns the following sibling of this element or None" + """Returns the following sibling of this element or None. + """ cdef xmlNode* c_node c_node = _nextElement(self._c_node) if c_node is not NULL: @@ -947,7 +1015,8 @@ return None def getprevious(self): - "Returns the preceding sibling of this element or None" + """Returns the preceding sibling of this element or None. + """ cdef xmlNode* c_node c_node = _previousElement(self._c_node) if c_node is not NULL: @@ -962,7 +1031,8 @@ return SiblingsIterator(self, preceding) def iterancestors(self): - "Iterate over the ancestors of this element (from parent to parent)." + """Iterate over the ancestors of this element (from parent to parent). + """ return AncestorsIterator(self) def iterdescendants(self): @@ -994,7 +1064,8 @@ return ElementDepthFirstIterator(self, tag) def makeelement(self, _tag, attrib=None, nsmap=None, **_extra): - "Creates a new element associated with the same document." + """Creates a new element associated with the same document. + """ # a little code duplication, but less overhead through doc reuse cdef xmlNode* c_node cdef xmlDoc* c_doc @@ -1009,15 +1080,23 @@ return _elementFactory(doc, c_node) def find(self, path): + """Finds the first matching subelement, by tag name or path. + """ return _elementpath.find(self, path) def findtext(self, path, default=None): + """Finds text for the first matching subelement, by tag name or path. + """ return _elementpath.findtext(self, path, default) def findall(self, path): + """Finds all matching subelements, by tag name or path. + """ return _elementpath.findall(self, path) def xpath(self, _path, namespaces=None, extensions=None, **_variables): + """Evaluate an xpath expression using the element as context node. + """ evaluator = XPathElementEvaluator(self, namespaces, extensions) return evaluator.evaluate(_path, **_variables) @@ -1409,6 +1488,8 @@ # module-level API for ElementTree def Element(_tag, attrib=None, nsmap=None, **_extra): + """Element factory. This function returns an object implementing the Element interface. + """ cdef xmlNode* c_node cdef xmlDoc* c_doc cdef _Document doc @@ -1423,6 +1504,9 @@ return _elementFactory(doc, c_node) def Comment(text=None): + """Comment element factory. This factory function creates a special element that will + be serialized as an XML comment. + """ cdef _Document doc cdef xmlNode* c_node cdef xmlDoc* c_doc @@ -1438,6 +1522,9 @@ def SubElement(_Element _parent not None, _tag, attrib=None, nsmap=None, **_extra): + """Subelement factory. This function creates an element instance, and appends it to an + existing element. + """ cdef xmlNode* c_node cdef _Document doc ns_utf, name_utf = _getNsTag(_tag) @@ -1450,6 +1537,8 @@ return _elementFactory(doc, c_node) def ElementTree(_Element element=None, file=None, _BaseParser parser=None): + """ElementTree wrapper class. + """ cdef xmlNode* c_next cdef xmlNode* c_node cdef xmlNode* c_node_copy @@ -1468,11 +1557,17 @@ return _elementTreeFactory(doc, element) def HTML(text): + """Parses an HTML document from a string constant. This function can be used + to embed "HTML literals" in Python code. + """ cdef _Document doc doc = _parseMemoryDocument(text, None, __DEFAULT_HTML_PARSER) return doc.getroot() def XML(text): + """Parses an XML document from a string constant. This function can be used + to embed "XML literals" in Python code. + """ cdef _Document doc doc = _parseMemoryDocument(text, None, __DEFAULT_XML_PARSER) return doc.getroot() @@ -1480,6 +1575,8 @@ fromstring = XML cdef class QName: + """QName wrapper. + """ cdef readonly object text def __init__(self, text_or_uri, tag=None): if tag is not None: @@ -1493,9 +1590,14 @@ return self.text.__hash__() def iselement(element): + """Checks if an object appears to be a valid element object. + """ return isinstance(element, _Element) def dump(_NodeBase elem not None, pretty_print=True): + """Writes an element tree or element structure to sys.stdout. This function + should be used for debugging only. + """ _dumpToFile(sys.stdout, elem._c_node, bool(pretty_print)) def tostring(element_or_tree, encoding=None, From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Jun 8 16:52:12 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 08 Jun 2006 16:52:12 +0200 Subject: [lxml-dev] Improved docstrings for etree.pyx In-Reply-To: <20060608144718.GA17544@morpheus.apaku.dnsalias.org> References: <20060608144718.GA17544@morpheus.apaku.dnsalias.org> Message-ID: <4488399C.1010804@gkec.informatik.tu-darmstadt.de> Hi Andreas, Andreas Pakulat wrote: > attached is a patch that adds docstrings to those methods of Element and > Elementtree that didn't have any. I also changed some docstrings to use > """ instead of just " to make it consistent with the rest. > > Most of the strings are "copied" from the official element tree api. I > did not include any reference to the parameters, because help() doesn't > show the parameteres here anyway. > > The diff is against revision 28530 of trunk. Great. That was just in time. I was already working on getting 1.0.1 ready for release - and there we go with the doc updates. I'll look through it and then see how I get it applied to the branch to merge it into 1.0.1. Thanks a million, Stefan From apaku at gmx.de Thu Jun 8 16:52:52 2006 From: apaku at gmx.de (Andreas Pakulat) Date: Thu, 8 Jun 2006 16:52:52 +0200 Subject: [lxml-dev] make html doesn't work due to rest2html beeing renamed to rst2html Message-ID: <20060608145252.GA17773@morpheus.apaku.dnsalias.org> Hi, I just executed make html and got: andreas at morpheus:~/compiling/python/xml/lxml>make html mkdir -p doc/html python doc/mkhtml.py doc/html . `cat version.txt` sh: rest2html: command not found :-( I guess docutils renamed rest2html to rst2html some time ago, at least my docutils package version 0.4 uses rst2html. Of course I have no problem renaming rest2html in mkhtml.py myself, others might though. Andreas -- You'll feel devilish tonight. Toss dynamite caps under a flamenco dancer's heel. From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Jun 8 16:59:48 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 08 Jun 2006 16:59:48 +0200 Subject: [lxml-dev] make html doesn't work due to rest2html beeing renamed to rst2html In-Reply-To: <20060608145252.GA17773@morpheus.apaku.dnsalias.org> References: <20060608145252.GA17773@morpheus.apaku.dnsalias.org> Message-ID: <44883B64.3080905@gkec.informatik.tu-darmstadt.de> Hi Andreas, Andreas Pakulat wrote: > I just executed make html and got: > > andreas at morpheus:~/compiling/python/xml/lxml>make html > mkdir -p doc/html > python doc/mkhtml.py doc/html . `cat version.txt` > sh: rest2html: command not found > > :-( I guess docutils renamed rest2html to rst2html some time ago, at > least my docutils package version 0.4 uses rst2html. Actually, it's even called "rst2html.py" in my installation. Anyway, that script is some 25 lines long and in the public domain, so I guess the best way to deal with this is to merge it into the mkhtml.py script. Stefan From faassen at infrae.com Thu Jun 8 17:13:25 2006 From: faassen at infrae.com (Martijn Faassen) Date: Thu, 08 Jun 2006 17:13:25 +0200 Subject: [lxml-dev] XPath default namespace In-Reply-To: <44882F4B.8060705@ag-projects.com> References: <44880E4C.3020007@ag-projects.com> <1149767481.6676.40.camel@zoo.yandex.ru> <1149767664.6676.42.camel@zoo.yandex.ru> <44881230.6070606@ag-projects.com> <448816B0.1090101@infrae.com> <44882F4B.8060705@ag-projects.com> Message-ID: <44883E95.2080804@infrae.com> Mircea Amarascu wrote: [snip] > I see that the behaviour of XPath 1.0 is that if a QName does not have a > prefix, the namespace in that axis is null (http://www.w3.org/TR/xpath#node-tests). > This seems to have changed in XPath 2.0 to the default element namespace in the expression > context (http://www.w3.org/TR/xpath20/#node-tests). > > So my understanding is that for the following XML: > > > > > > the expression "/root/bar" would select the first bar element in XPath > 2.0, and no element in XPath 1.0 > > And since lxml (libxml2) implements XPath 1.0, things work as expected. Correct, I don't know much about XPath 2.0, but I had gotten the impression that some things changed there. Should've mentioned that. Expression context may be the best way also to get the convenience function working, though that wouldn't work for absolute XPath queries, I think. Regards, Martijn From behnel_ml at gkec.informatik.tu-darmstadt.de Thu Jun 8 22:15:37 2006 From: behnel_ml at gkec.informatik.tu-darmstadt.de (Stefan Behnel) Date: Thu, 08 Jun 2006 22:15:37 +0200 Subject: [lxml-dev] New deallocation procedure Message-ID: <44888569.2020507@gkec.informatik.tu-darmstadt.de> Hi all, I tried to resolve the bug that Petr found, but it seems to be a general problem with Python's garbage collector. Apparently, we cannot rely on the assumption that _Elements that hold a reference to _Documents are always freed *before* the _Document. This *may* be the case for the normal ref-count deallocator. However, whenever the cyclic GC is required to resolve the dependencies, it is free to deallocate objects in arbitrary order. The reason why this is a problem is that lxml previously freed C nodes when deallocating their proxy _Element (if it was not contained in a C document), and the entire C document when the _Document was deallocated. This works as long as _Documents are freed after all their proxy _Elements. It crashes when the GC decides to deallocate any of the _Elements after their _Document. It is not easy to solve this, since _Elements can't just free the entire C document at their own deallocation time. There may be other _Elements still sitting on it. Only the _Document can do that - after all proxies are disconnected. The way I resolved this is by adding a deep-traversal to the _Document deallocation code. It searches the entire tree for _Element proxies and clears their C node reference. This prevents these proxies from caring about their C representation when they are deallocated - no more access to already freed data, no more crashes. The down side is that this is costly. It doesn't matter too much for small documents, as traversal is very fast and the memory free() call is already costly enough. However, it does matter for large trees, especially if they do not fit into memory. If swapping is involved, we have to swap the tree twice now. It's sad we have to do this, but I really cannot see any way to force the Python memory management to always free _Document objects after all _Elements, regardless of the way and time they are freed. This is a very conservative approach, but it is the only way I can see that definitely prevents crashes. Stefan From ianb at colorstudy.com Thu Jun 8 22:46:58 2006 From: ianb at colorstudy.com (Ian Bicking) Date: Thu, 08 Jun 2006 15:46:58 -0500 Subject: [lxml-dev] HTML serialization Message-ID: <44888CC2.7080303@colorstudy.com> Howdy. Just started using lxml (http://blog.ianbicking.org/neutral-templating.html), mostly for its HTML parser. One thing that felt missing to me was (X)HTML serialization. Specifically, XML serialization tends to create HTML which browsers don't like (like