From stefan_ml at behnel.de Sun Jul 1 15:20:12 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 01 Jul 2007 15:20:12 +0200 Subject: [lxml-dev] lxml 1.3 coming up In-Reply-To: <20070621165252.177750@gmx.net> References: <466EDCE5.8020407@behnel.de> <20070613071920.19070@gmx.net> <466FD558.7020404@behnel.de> <20070613120039.40630@gmx.net> <20070614094533.276670@gmx.net> <46717FF3.3060900@behnel.de> <20070615123522.202110@gmx.net> <46728DD9.6080202@behnel.de> <20070618153131.19060@gmx.net> <46779A23.6030208@behnel.de> <20070619145137.287570@gmx.net> <4677F497.8030000@behnel.de> <20070619162609.287610@gmx.net> <46782836.4000408@behnel.de> <20070620153043.276660@gmx.net> <20070621165252.177750@gmx.net> Message-ID: <4687AA0C.2080803@behnel.de> Hi Holger, jholg at gmx.de wrote: > Find attached a patch that: > > - changes the above to apply xsi:nil="true" for None value arguments Ok. > - lets DataElement() graciously handle ObjectifiedDataElement arguments, > keeping their attributes intact, if not overridden by the DataElement() > args. This also reuses existing xsi:type or py:pytype information, unless > _pytype and/or _xsi are provided as parameters to DataElement() > > Previously, DataElement() cut off all attributes if given an > ObjectifiedDataElement instance. Ok. > - Type-checks the _value against the given type hint: > You will run into the error anyway - sooner or > later - when accessing the .pyval in any way, so why not during > instantiation. Ok. > Tests are included for the described behaviour. Cool, thanks. > Additionally, I've revamped some of the tests I provided earlier and split > them up: More but smaller test methods now. That's even better. :) > Please try it out, if any of the DataElement changes are not ok I can also > send only the split-up tests, of course. > > Btw.: I'm always getting > > IOError: Error reading file > '/data/pydev/hjoukl/LXML/lxml-1.3/src/lxml/tests/test_xinclude.xml': failed > to load external entity > "/data/pydev/hjoukl/LXML/lxml-1.3/src/lxml/tests/test_xinclude.xml" > > due to some missing xml file lately when running the tests. I moved an XML file to a subdirectory to also test relative references in a base directory. But it should be fixed in the 1.3 release... Stefan From stefan_ml at behnel.de Mon Jul 2 10:32:36 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 02 Jul 2007 10:32:36 +0200 Subject: [lxml-dev] Some XPath questions... In-Reply-To: <468538ED.9060004@colorstudy.com> References: <468538ED.9060004@colorstudy.com> Message-ID: <4688B824.1020707@behnel.de> Hi Ian, just to comment on your actual first post in this thread, which I kinda oversaw because of the later discussion. I think this is pretty cool stuff and I love to have this in lxml. The html module really seems to be getting somewhere. I think we shouldn't even wait too long with a release so that we get some more feedback on the new APIs. Maybe I should fix lxml's versioning so that we can put out a 2.0alpha1 (and not only alpha, beta, final). Ian Bicking wrote: > div:contains('celia') -- means a div where the textual content has the > word 'celia' in it, case insensitive. At least, I think it's case > insensitive -- the CSS spec is annoyingly vague, but implementations > seem to work like this. I translate this to: > > descendant-or-self::div[contains(css:lower-case(string(.)), 'celia'] > > I added the lower-case function like: > > def _make_lower_case(context, s): > return s.lower() > etree.FunctionNamespace("css")['lower-case'] = _make_lower_case "css" is not the namespace, it's the prefix. You can do this: ns = etree.FunctionNamespace("http://my/css/namespace") ns.prefix = "css" ns['lower-case'] = _make_lower_case or this: ns = etree.FunctionNamespace("http://my/css/namespace") ns['lower-case'] = _make_lower_case def css_to_xpath(css): xpath = build_xpath(css) return etree.XPath(xpath, {'css' : "http://my/css/namespace"}) You should consider providing a default namespace map here, and maybe even return compiled XPath objects, i.e. callables. Note that these provide a "path" attribute that returns the original path, so if you have to extend an expression later on, you can still do so by creating a new XPath object. Note that this would also allow you to wrap the function with an additional call to set(), so that or-ed results really become the union and not the sum of all parts. > But XPath gives so few errors that it's hard to tell if it's really > working. Sadly, there doesn't seem to be a simple way to find out that a function was undeclared. Or maybe I'll just have to look back into that... didn't I do that already? :) > There's also > div:nth-child(matcher) and div:nth-of-type(matcher), which selects among > siblings with patterns like "2" (second sibling), "3n" (every third > element), "odd" (odd elements) and some other selections. I kind of see > how to deal with this using position(), but I'm not sure how to do > either nth-of-type or nth-child (and the ones I do understand I am also > vague about). If I understand this correctly, this would be nth-of-type: //*/NAME[position() = x] nth-child: //*/*[position() = x] To deal with things like "2n", try this: //*/NAME[(position() mod 2) = 0] > I've committed the incomplete code in lxml.html.css I skipped through it a bit and found it really cool. I'm not completely satisfied with the naming, but I now see that the context of the css module makes it clearer what the semantics are. Still, I prefer "css_to_xpath()", and providing a top-level class XPath() makes me think it should return an etree.XPath object, i.e. a compiled path. One more note: def run_xpath(doc, xpath): return [el for el in doc.xpath(xpath) if isinstance(el, etree.ElementBase)] Do you mean "etree.iselement(el)" here or are you intentionally restricting this to real-element subclasses of _Element? (i.e. no plain lxml.etree elements, no PIs, no comments) I actually think this module merits its own top-level placing, not necessarily only as part of lxml.html. It could just as well become "lxml.css", and should thus not rely too much on a specific API from lxml.html. Stefan From stefan_ml at behnel.de Mon Jul 2 18:35:59 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 02 Jul 2007 18:35:59 +0200 Subject: [lxml-dev] lxml 1.3.1 on cheeseshop Message-ID: <4689296F.3030004@behnel.de> Hi all, I just released lxml 1.3.1. This is a bugfix release for the stable 1.3 series. Changelog follows. Have fun, Stefan 1.3.1 (2007-07-02) ================== Features added -------------- * objectify.DataElement now supports setting values from existing data elements (not just plain Python types) and reuses defined namespaces etc. * E-factory support for lxml.objectify (``objectify.E``) Bugs fixed ---------- * Better way to prevent crashes in Element proxy cleanup code * objectify.DataElement didn't set up None value correctly * objectify.DataElement didn't check the value against the provided type hints * Reference-counting bug in ``Element.attrib.pop()`` From ianb at colorstudy.com Mon Jul 2 19:21:54 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 02 Jul 2007 12:21:54 -0500 Subject: [lxml-dev] Some XPath questions... In-Reply-To: <4688B824.1020707@behnel.de> References: <468538ED.9060004@colorstudy.com> <4688B824.1020707@behnel.de> Message-ID: <46893432.10404@colorstudy.com> Stefan Behnel wrote: > Hi Ian, > > just to comment on your actual first post in this thread, which I kinda > oversaw because of the later discussion. > > I think this is pretty cool stuff and I love to have this in lxml. The html > module really seems to be getting somewhere. I think we shouldn't even wait > too long with a release so that we get some more feedback on the new APIs. > Maybe I should fix lxml's versioning so that we can put out a 2.0alpha1 (and > not only alpha, beta, final). Yeah, I was thinking about writing up a summary of things that need to be done in the html package; there's still some outstanding stuff, but not too much. The clean module needs to be cleaned up (I'm thinking of moving from a function to a class). I'd like to make the usedoctest hack a little more general, as elsewhere I'm now using a similar hack to enable ELLIPSIS, and I'd like them not to conflict. And then some docs, but I guess that's it. > Ian Bicking wrote: >> div:contains('celia') -- means a div where the textual content has the >> word 'celia' in it, case insensitive. At least, I think it's case >> insensitive -- the CSS spec is annoyingly vague, but implementations >> seem to work like this. I translate this to: >> >> descendant-or-self::div[contains(css:lower-case(string(.)), 'celia'] >> >> I added the lower-case function like: >> >> def _make_lower_case(context, s): >> return s.lower() >> etree.FunctionNamespace("css")['lower-case'] = _make_lower_case > > "css" is not the namespace, it's the prefix. You can do this: > > ns = etree.FunctionNamespace("http://my/css/namespace") > ns.prefix = "css" > ns['lower-case'] = _make_lower_case OK, I've switched to this. > or this: > > ns = etree.FunctionNamespace("http://my/css/namespace") > ns['lower-case'] = _make_lower_case > > def css_to_xpath(css): > xpath = build_xpath(css) > return etree.XPath(xpath, {'css' : "http://my/css/namespace"}) Is there any advantage to this, over a more global prefix? I suppose there's a possible collision of css:, but I doubt that will be a problem. > You should consider providing a default namespace map here, and maybe even > return compiled XPath objects, i.e. callables. Note that these provide a > "path" attribute that returns the original path, so if you have to extend an > expression later on, you can still do so by creating a new XPath object. That's handy. I was thinking of creating a CSSXPath subclass or something, that would keep the original CSS selector around, in addition the translated XPath. > Note that this would also allow you to wrap the function with an additional > call to set(), so that or-ed results really become the union and not the sum > of all parts. If you use | in the XPath expression it seems to work out that there won't be any duplicates. >> But XPath gives so few errors that it's hard to tell if it's really >> working. > > Sadly, there doesn't seem to be a simple way to find out that a function was > undeclared. Or maybe I'll just have to look back into that... didn't I do that > already? :) We talked about it previously when I was trying to use match(), and instead of errors got bizarre results. But I don't think it resulted in any improvements on error messages. >> There's also >> div:nth-child(matcher) and div:nth-of-type(matcher), which selects among >> siblings with patterns like "2" (second sibling), "3n" (every third >> element), "odd" (odd elements) and some other selections. I kind of see >> how to deal with this using position(), but I'm not sure how to do >> either nth-of-type or nth-child (and the ones I do understand I am also >> vague about). > > If I understand this correctly, this would be > > nth-of-type: //*/NAME[position() = x] > nth-child: //*/*[position() = x] > > To deal with things like "2n", try this: > > //*/NAME[(position() mod 2) = 0] I think I already have all this working now... though I wish there was a test case I could use, as I'm not 100% sure that my tests are testing for the correct results. >> I've committed the incomplete code in lxml.html.css > > I skipped through it a bit and found it really cool. I'm not completely > satisfied with the naming, but I now see that the context of the css module > makes it clearer what the semantics are. Still, I prefer "css_to_xpath()", and > providing a top-level class XPath() makes me think it should return an > etree.XPath object, i.e. a compiled path. I was thinking about changing around all the public naming. I'd like for it to be a method on elements, though I'm not sure what to call the method. .css(expr) is a bit funny, as it's not "css", it's just a css selector. .select(expr) doesn't say what kind of selector you are using. Another public function would be like XPath, something that compiles the entire CSS expression. Especially since the CSS parsing is non-trivial (just like the XPath parsing is non-trivial), precompiling will be beneficial. I'm thinking of also adding a fast path for a couple common kinds of selectors, that translate them more quickly into XPath. E.g., search for r'^\.(\w+)' for class name matches, or '^#(\w+)' for id matches, etc. And there's the question about whether simple CSS selectors should be translated to XPath at all (especially when they aren't precompiled). For people that are familiar with CSS selectors, it seems entirely possible that they will use it for very simple queries, like el.css('div'). If I detect that case and turn it into el.findall('div') then it would be completely reasonable; but if it gets tokenized, parsed, translated to XPath, compiled, then run, then that's going to be pretty inefficient. Anyway, back to naming -- if there's a method and a function/object to compile expressions, that's all the public interface I think it needs. I don't think translating css to xpath without compiling is particularly important. > One more note: > > def run_xpath(doc, xpath): > return [el for el in doc.xpath(xpath) > if isinstance(el, etree.ElementBase)] > > Do you mean "etree.iselement(el)" here or are you intentionally restricting > this to real-element subclasses of _Element? (i.e. no plain lxml.etree > elements, no PIs, no comments) I wasn't aware of iselement(). I'm not actually sure this is even necessary; I'm not sure if I can ever match non-elements with the expressions at all. I think I put it in there at some point when I wasn't sure. Instead it should probably be an assertion in the tests. > I actually think this module merits its own top-level placing, not necessarily > only as part of lxml.html. It could just as well become "lxml.css", and should > thus not rely too much on a specific API from lxml.html. Yes, you can do selections on anything. CSS it seems uses | for namespaces, like "atom|title", and it doesn't know anything special about HTML (except for special handling of the class attribute). Right now I'm assuming the XPath picks up the prefixes from elsewhere in the document. CSS uses "@namespace prefix URI", but that's part of a CSS document, and we're only handling selectors. So I just translate "atom|title" to "//atom:title", and assume it'll work. The CSS syntax does seem handier for a lot of kinds of selections, and after translating them I find the equivalent XPath rather complex in some cases (e.g., li:first-child). So there's some benefit there. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From ianb at colorstudy.com Tue Jul 3 01:26:06 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 02 Jul 2007 18:26:06 -0500 Subject: [lxml-dev] Some XPath questions... In-Reply-To: <4686A230.70105@behnel.de> References: <468538ED.9060004@colorstudy.com> <18053.22206.159599.207098@bhuda.mired.org> <468579E3.7010802@colorstudy.com> <18053.36353.691485.8754@bhuda.mired.org> <46859771.8@colorstudy.com> <4686A230.70105@behnel.de> Message-ID: <4689898E.9080509@colorstudy.com> Stefan Behnel wrote: >> So when I use // it works. Huh. I prefer descendant-or-self, because I >> find it peculiar to do a search from the root when you've called the >> method on some particular element (that may not be at the root). > > There's also ".//*". That seems to be equivalent to //*, i.e., // goes directly to the root regardless of context. >>>>>> div:empty (no children, including text, maybe not including whitespace). >>>>> Ouch. let me think about that one. >>>> Yeah, I couldn't figure that one out. I thought this might work: >>>> >>> xpath('E:empty') >>>> e[count(./children::*) = 0 and string(.) = ''] >>>> But maybe I don't understand how count() works; this isn't a valid XPath >>>> expression. >>> You want "child" not "children". Using normalize-space(.) instead of >>> string(.) will exclude whitespace. This does assume you are ignoring >>> comments and PIs; I believe that's the behavior you want. >> Cool, that seems to work right. > > What about "e[not(*) and not(normalize-space())]" ? Yes, that works too. >> One query I'm realizing might be really hard (maybe too hard in XPath) >> is *:first-of-type, *:last-of-type, and *:only-of-type, since they match >> in a funny sort of way. You can't really do: >> >> *[count(../*[name() = name()) = 1] > > You need two expressions here, one to find the node and one to compare it to > others (note that name() can also take an argument) - but those are really > trick, you're right. They may already touch the borders of what XPath can express. I could probably do it by adding a new function, I suppose; css:last-of-type() for instance. It's not that hard to do in Python, after all. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From ianb at colorstudy.com Tue Jul 3 01:45:37 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 02 Jul 2007 18:45:37 -0500 Subject: [lxml-dev] Some XPath questions... In-Reply-To: <18057.35970.547924.621080@bhuda.mired.org> References: <468538ED.9060004@colorstudy.com> <18053.22206.159599.207098@bhuda.mired.org> <468579E3.7010802@colorstudy.com> <18053.36353.691485.8754@bhuda.mired.org> <46859771.8@colorstudy.com> <4686A230.70105@behnel.de> <4689898E.9080509@colorstudy.com> <18057.35970.547924.621080@bhuda.mired.org> Message-ID: <46898E21.4030508@colorstudy.com> Mike Meyer wrote: > In <4689898E.9080509 at colorstudy.com>, Ian Bicking typed: >> Stefan Behnel wrote: >>>> So when I use // it works. Huh. I prefer descendant-or-self, because I >>>> find it peculiar to do a search from the root when you've called the >>>> method on some particular element (that may not be at the root). >>> There's also ".//*". >> That seems to be equivalent to //*, i.e., // goes directly to the root >> regardless of context. > > Not quite. '//*' always goes to the root. './/*' starts at the current > node and matches from there down. If you always test at the root of > the document, they'll look the same. It seems to be changing the results when I replace 'descendant-or-self::' with './/'. I want to include the current node if it matches; at least to me, that seems most logical. Also necessary when I was doing microformat parsing, as a single element can have multiple roles. It seems like .// excludes the current node, only looking at descendants. >>>>>>>> div:empty (no children, including text, maybe not including whitespace). >>>>>>> Ouch. let me think about that one. >>>>>> Yeah, I couldn't figure that one out. I thought this might work: >>>>>> >>> xpath('E:empty') >>>>>> e[count(./children::*) = 0 and string(.) = ''] >>>>>> But maybe I don't understand how count() works; this isn't a valid XPath >>>>>> expression. >>>>> You want "child" not "children". Using normalize-space(.) instead of >>>>> string(.) will exclude whitespace. This does assume you are ignoring >>>>> comments and PIs; I believe that's the behavior you want. >>>> Cool, that seems to work right. >>> What about "e[not(*) and not(normalize-space())]" ? >> Yes, that works too. > > That's the 'implicit conversion' I was talking about. You're relying > on 0 and the empty string being false. It's a standard idiom, and > pythonic, but I'm not sure you want to use it in automatically > generated code, since it means you can't generalize the code from "has > 0 children" to "has n children". In this case it's a fixed expression used for e:empty, and nothing else, so it seems fine. And possibly makes the resulting expression a bit easier to recognize from its CSS roots. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From stefan_ml at behnel.de Tue Jul 3 08:54:03 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 03 Jul 2007 08:54:03 +0200 Subject: [lxml-dev] Some XPath questions... In-Reply-To: <46898E21.4030508@colorstudy.com> References: <468538ED.9060004@colorstudy.com> <18053.22206.159599.207098@bhuda.mired.org> <468579E3.7010802@colorstudy.com> <18053.36353.691485.8754@bhuda.mired.org> <46859771.8@colorstudy.com> <4686A230.70105@behnel.de> <4689898E.9080509@colorstudy.com> <18057.35970.547924.621080@bhuda.mired.org> <46898E21.4030508@colorstudy.com> Message-ID: <4689F28B.6080808@behnel.de> Ian Bicking wrote: >>>>>>> >>> xpath('E:empty') >>>>>>> e[count(./children::*) = 0 and string(.) = ''] >>>>>>> But maybe I don't understand how count() works; this isn't a >>>>>>> valid XPath expression. >>>>>> You want "child" not "children". Using normalize-space(.) instead of >>>>>> string(.) will exclude whitespace. This does assume you are ignoring >>>>>> comments and PIs; I believe that's the behavior you want. >>>>> Cool, that seems to work right. >>>> What about "e[not(*) and not(normalize-space())]" ? >>> Yes, that works too. >> >> That's the 'implicit conversion' I was talking about. You're relying >> on 0 and the empty string being false. It's a standard idiom, and >> pythonic, but I'm not sure you want to use it in automatically >> generated code, since it means you can't generalize the code from "has >> 0 children" to "has n children". > > In this case it's a fixed expression used for e:empty, and nothing else, > so it seems fine. And possibly makes the resulting expression a bit > easier to recognize from its CSS roots. It's also likely faster. I don't think libxml2 optimises the comparisons, so looking for "not(*)" can stop false after the first node, while "count(./child::*) = 0" needs to count all children and then sees that, oh, the number is bigger than 0. Stefan From jholg at gmx.de Tue Jul 3 09:43:03 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 03 Jul 2007 09:43:03 +0200 Subject: [lxml-dev] lxml 1.3.1 setup.py bug Message-ID: <20070703074303.12840@gmx.net> Hi, the setup.py script in 1.3.1 seems to try to remove the dependency on setuptools (which is a very good thing imho!) but fails: Traceback (most recent call last): File "setup.py", line 7, in ? except pkg_resources.VersionConflict, e: NameError: name 'pkg_resources' is not defined 1 lb54320 at adevp02 .../lxml-1.3 $ I must admit I don't fully undestand the intention of the relevant code portion, as it raises ImportError even if pkg_resources import and version check runs smoothly; maybe this is the intended behaviour? try: import pkg_resources try: pkg_resources.require("setuptools>=0.6c5") except pkg_resources.VersionConflict, e: from ez_setup import use_setuptools use_setuptools(version="0.6c5") from setuptools import setup except ImportError: # not setuptools installed from distutils.core import setup (Note: This is untested code, I have not tested with setuptools installed) Oh, btw I couldn't find a 1.3.1 tag in the repository when trying to check out 1.3.1. Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kanns mit allen: http://www.gmx.net/de/go/multimessenger From stefan_ml at behnel.de Tue Jul 3 15:16:24 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 03 Jul 2007 15:16:24 +0200 Subject: [lxml-dev] lxml 1.3.1 setup.py bug In-Reply-To: <20070703074303.12840@gmx.net> References: <20070703074303.12840@gmx.net> Message-ID: <468A4C28.8040701@behnel.de> Hi Holger, jholg at gmx.de wrote: > the setup.py script in 1.3.1 seems to try to remove the dependency on > setuptools (which is a very good thing imho!) but fails: > > Traceback (most recent call last): File "setup.py", line 7, in ? except > pkg_resources.VersionConflict, e: NameError: name 'pkg_resources' is not > defined 1 lb54320 at adevp02 .../lxml-1.3 $ > > I must admit I don't fully undestand the intention of the relevant code > portion, as it raises ImportError even if pkg_resources import and version > check runs smoothly; maybe this is the intended behaviour? Ah, great. That was plain debug code. :) Thanks, I just re-released the sources. Could you check if it works now? > Oh, btw I couldn't find a 1.3.1 tag in the repository when trying to check > out 1.3.1. Luckily, yes. I'll tag it with the fix applied. :) Stefan From jholg at gmx.de Tue Jul 3 15:40:18 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 03 Jul 2007 15:40:18 +0200 Subject: [lxml-dev] lxml 1.3.1 setup.py bug In-Reply-To: <468A4C28.8040701@behnel.de> References: <20070703074303.12840@gmx.net> <468A4C28.8040701@behnel.de> Message-ID: <20070703134018.327480@gmx.net> Hi Stefan, > Thanks, I just re-released the sources. Could you check if it works now? Works for me now. Note: I get Building lxml version 1.3.1-44702 /apps/prod/lib/python2.4/distutils/dist.py:236: UserWarning: Unknown distribution option: 'zip_safe' warnings.warn(msg) but simply ignore it because I bet this is just some setuptools-related stuff and can be safely ignored by plain-old-distutillers like me. Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kanns mit allen: http://www.gmx.net/de/go/multimessenger From dgrimes at navisite.com Tue Jul 3 21:15:57 2007 From: dgrimes at navisite.com (David M. Grimes) Date: Tue, 03 Jul 2007 15:15:57 -0400 Subject: [lxml-dev] Ref-counting bug returns in 1.3.1 Message-ID: <468AA06D.3030502@navisite.com> I posted a patch for what I believed to be a reference-counting bug in Attrib.pop() based on the 1.3 release. The patch was accepted, and is present in 1.3.1. The patch is included at the end of this message. Looking through the generated C code, I'm no longer sure my patch was correct - perhaps just masking the underlying problem in 1.3. I'm not fluent in Pyrex, so not sure if the python.Py_INCREF is really necessary for something which would be a "borrwed reference" in the C-API (PyTuple_GET_ITEM result). It looks like the Pyrex "return" is generating it's own INCREF ... Now, what is intriguing is that the 1.3.1 stock build is crashing again with the same symptom, and is easily reproducable with the following test program (this crashed after iteration 956 in i686 with python 2.4.4: import lxml.etree as etree xml = '''\ ''' for i in range(10000): print i et = etree.fromstring(xml) et.attrib.pop('x', None) This dies at this point: ... ... 951 952 953 954 955 956 Fatal Python error: deallocating None Aborted Original 1.3 patch: diff -urN lxml-1.3~/src/lxml/etree.pyx lxml-1.3/src/lxml/etree.pyx --- lxml-1.3~/src/lxml/etree.pyx 2007-06-25 02:25:37.000000000 -0400 +++ lxml-1.3/src/lxml/etree.pyx 2007-06-27 15:36:15.000000000 -0400 @@ -1480,10 +1480,12 @@ if python.PyTuple_GET_SIZE(default) == 0: raise KeyError, key else: - return python.PyTuple_GET_ITEM(default, 0) + result = python.PyTuple_GET_ITEM(default, 0) + python.Py_INCREF(result) else: _delAttribute(self._element, key) - return result + + return result def clear(self): cdef xmlNode* c_node -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070703/c385c04e/attachment.htm From stefan_ml at behnel.de Tue Jul 3 22:41:17 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 03 Jul 2007 22:41:17 +0200 Subject: [lxml-dev] Ref-counting bug returns in 1.3.1 In-Reply-To: <468AA06D.3030502@navisite.com> References: <468AA06D.3030502@navisite.com> Message-ID: <468AB46D.3020804@behnel.de> David M. Grimes wrote: > I posted a patch for what I believed to be a reference-counting bug in > Attrib.pop() based on the 1.3 release. The patch was accepted, and is > present in 1.3.1. The patch is included at the end of this message. > Looking through the generated C code, I'm no longer sure my patch was > correct - perhaps just masking the underlying problem in 1.3. I'm not > fluent in Pyrex, so not sure if the python.Py_INCREF is really necessary > for something which would be a "borrwed reference" in the C-API > (PyTuple_GET_ITEM result). It looks like the Pyrex "return" is > generating it's own INCREF ... You can debug this kind of problem with print sys.getrefcount(None) When I run the following on 1.3.1: et = etree.fromstring(xml) for i in range(10000): print i print sys.getrefcount(None) et.attrib.pop('x', None) print sys.getrefcount(None) instead of your test, it shows me that the problem is not "pop()", as the ref-count is constant at each iteration. Trying to remove the Py_INCREF() from your patch makes it crash with a continuously decreasing ref-count. However, when I run your test: for i in range(10000): print i et = etree.fromstring(xml) print sys.getrefcount(None) et.attrib.pop('x', None) print sys.getrefcount(None) the ref-count keeps increasing until the garbage collector hits and then drops below the start value and finally crashes on the second GC run. So the problem is somewhere else. I'll investigate. Thanks for the report, Stefan From stefan_ml at behnel.de Tue Jul 3 23:24:03 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 03 Jul 2007 23:24:03 +0200 Subject: [lxml-dev] Ref-counting bug returns in 1.3.1 In-Reply-To: <468AB46D.3020804@behnel.de> References: <468AA06D.3030502@navisite.com> <468AB46D.3020804@behnel.de> Message-ID: <468ABE73.1080408@behnel.de> Stefan Behnel wrote: > David M. Grimes wrote: >> I posted a patch for what I believed to be a reference-counting bug in >> Attrib.pop() based on the 1.3 release. The patch was accepted, and is >> present in 1.3.1. > > the problem is somewhere else. I'll investigate. Ok, the problem was actually in the new deallocation code. For those who want to know: The GC calls tp_clear() before tp_dealloc(), so that Pyrex has already set the _Document reference of the _Element to None when __dealloc__ is called on the _Element and tries to Py_DECREF the doc reference => deallocating None. I worked around this by adding a redundant PyObject* to _Element that references the document. Pyrex does not set it to None so that we can keep a pointer in there when the Python reference is already None-ed and DECREF it ourselves. Obviously, that's a hack, but it works, so I'll leave it in and release a 1.3.2 with it... Stefan From dgrimes at navisite.com Tue Jul 3 23:27:46 2007 From: dgrimes at navisite.com (Grimes, David) Date: Tue, 3 Jul 2007 17:27:46 -0400 Subject: [lxml-dev] Ref-counting bug returns in 1.3.1 References: <468AA06D.3030502@navisite.com> <468AB46D.3020804@behnel.de> Message-ID: If there's anything else I can do to help test/diagnose, let me know ... --Dave ________________________________ From: Stefan Behnel [mailto:stefan_ml at behnel.de] Sent: Tue 7/3/2007 4:41 PM To: Grimes, David Cc: lxml-dev at codespeak.net Subject: Re: [lxml-dev] Ref-counting bug returns in 1.3.1 David M. Grimes wrote: > I posted a patch for what I believed to be a reference-counting bug in > Attrib.pop() based on the 1.3 release. The patch was accepted, and is > present in 1.3.1. The patch is included at the end of this message. > Looking through the generated C code, I'm no longer sure my patch was > correct - perhaps just masking the underlying problem in 1.3. I'm not > fluent in Pyrex, so not sure if the python.Py_INCREF is really necessary > for something which would be a "borrwed reference" in the C-API > (PyTuple_GET_ITEM result). It looks like the Pyrex "return" is > generating it's own INCREF ... You can debug this kind of problem with print sys.getrefcount(None) When I run the following on 1.3.1: et = etree.fromstring(xml) for i in range(10000): print i print sys.getrefcount(None) et.attrib.pop('x', None) print sys.getrefcount(None) instead of your test, it shows me that the problem is not "pop()", as the ref-count is constant at each iteration. Trying to remove the Py_INCREF() from your patch makes it crash with a continuously decreasing ref-count. However, when I run your test: for i in range(10000): print i et = etree.fromstring(xml) print sys.getrefcount(None) et.attrib.pop('x', None) print sys.getrefcount(None) the ref-count keeps increasing until the garbage collector hits and then drops below the start value and finally crashes on the second GC run. So the problem is somewhere else. I'll investigate. Thanks for the report, Stefan This e-mail is the property of NaviSite, Inc. It is intended only for the person or entity to which it is addressed and may contain information that is privileged, confidential, or otherwise protected from disclosure. Distribution or copying of this e-mail, or the information contained herein, to anyone other than the intended recipient is prohibited. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070703/e5e8ccbc/attachment.htm From stefan_ml at behnel.de Tue Jul 3 23:43:22 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 03 Jul 2007 23:43:22 +0200 Subject: [lxml-dev] lxml 1.3.2 released to cheeseshop Message-ID: <468AC2FA.9020403@behnel.de> Hi all, due to a severe crash bug in 1.3.1, I released 1.3.2 today. The only change is the bug fix. So please don't use or package 1.3.1, use 1.3.2 instead. Have fun, Stefan ChangeLog: 1.3.2 (2007-07-03) ================== Bugs fixed ---------- * "deallocating None" crash bug 1.3.1 (2007-07-02) ================== Features added -------------- * objectify.DataElement now supports setting values from existing data elements (not just plain Python types) and reuses defined namespaces etc. * E-factory support for lxml.objectify (``objectify.E``) Bugs fixed ---------- * Better way to prevent crashes in Element proxy cleanup code * objectify.DataElement didn't set up None value correctly * objectify.DataElement didn't check the value against the provided type hints * Reference-counting bug in ``Element.attrib.pop()`` From jholg at gmx.de Wed Jul 4 11:28:48 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 04 Jul 2007 11:28:48 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text Message-ID: <20070704092848.254480@gmx.net> Hi, playing around with the new E-factory I found that it does not handle unicode the way the rest of the API does: >>> STR = objectify.E.str >>> STR(unicode("???", 'latin-1')) Traceback (most recent call last): File "", line 1, in ? File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 43, in return lambda *args, **kwargs: func(tag, *args, **kwargs) File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 177, in __call__ v = t(elem, item) File "objectify.pyx", line 1661, in objectify.__add_text UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) >>> This is easily fixed by changing __add_text to def __add_text(_Element elem not None, text): cdef tree.xmlNode* c_child if not python._isString(text): if isinstance(text, bool): text = str(text).lower() else: text = str(text) c_child = cetree.findChildBackwards(elem._c_node, 0) [...] >>> STR = objectify.E.str >>> STR(unicode("???", 'latin-1')) Patches for trunk / 1.3 branch appended. Another issue with E-factory is that it currently does not have support for the custom objectify classes that you can add with the PyType mechanisms: E.g. I'm using datetime and decimal additions, which leads to >>> import decimal >>> DEC = objectify.E.decimal >>> DEC(decimal.Decimal(0)) Traceback (most recent call last): File "", line 1, in ? File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 43, in return lambda *args, **kwargs: func(tag, *args, **kwargs) File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 175, in __call__ raise TypeError("bad argument type: %r" % item) TypeError: bad argument type: Decimal("0") >>> So I'd have to add decimal.decimal into objectify.E._typemap. The nicest way to handle this would be PyType.register() doing it for me, but PyType uses type names rather than type objects for its purposes. Maybe the easiest thing is to instrument ElementMaker with its own register/unregister() methods and well-document it? Holger -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer -------------- next part -------------- A non-text attachment was scrubbed... Name: trunk_efactory_unicode.patch Type: application/octet-stream Size: 671 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070704/65923b81/attachment.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: branch13_efactory_unicode.patch Type: application/octet-stream Size: 671 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070704/65923b81/attachment-0001.obj From stefan_ml at behnel.de Wed Jul 4 12:07:01 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 04 Jul 2007 12:07:01 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text In-Reply-To: <20070704092848.254480@gmx.net> References: <20070704092848.254480@gmx.net> Message-ID: <468B7145.4050706@behnel.de> jholg at gmx.de wrote: > playing around with the new E-factory I found that it does not handle > unicode the way the rest of the API does: > >>>> STR = objectify.E.str >>>> STR(unicode("???", 'latin-1')) > Traceback (most recent call last): > File "", line 1, in ? > File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 43, in > return lambda *args, **kwargs: func(tag, *args, **kwargs) > File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 177, in __call__ > v = t(elem, item) > File "objectify.pyx", line 1661, in objectify.__add_text > UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) > > This is easily fixed by changing __add_text to > > def __add_text(_Element elem not None, text): > cdef tree.xmlNode* c_child > if not python._isString(text): > if isinstance(text, bool): > text = str(text).lower() > else: > text = str(text) > c_child = cetree.findChildBackwards(elem._c_node, 0) > [...] > >>>> STR = objectify.E.str >>>> STR(unicode("???", 'latin-1')) > > Patches for trunk / 1.3 branch appended. Thanks, that fixes it. Maybe we should even split __add_text up into a function for strings and a function that handles other stuff. > Another issue with E-factory is that it currently does not have support for the custom objectify classes that you can add with the PyType mechanisms: E.g. I'm using datetime and decimal additions, which leads to > >>>> import decimal >>>> DEC = objectify.E.decimal >>>> DEC(decimal.Decimal(0)) > Traceback (most recent call last): > File "", line 1, in ? > File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 43, in > return lambda *args, **kwargs: func(tag, *args, **kwargs) > File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 175, in __call__ > raise TypeError("bad argument type: %r" % item) > TypeError: bad argument type: Decimal("0") > > So I'd have to add decimal.decimal into objectify.E._typemap. The nicest way to handle this would be PyType.register() doing it for me, but > PyType uses type names rather than type objects for its purposes. Maybe the easiest thing is to instrument ElementMaker with its own register/unregister() methods and well-document it? No, one registry should be enough. Even with names, you can always check globals() in the PyType registry. Maybe we should even feed the typemap in ElementMaker.__init__ from the PyType registry (and just update objectify.E when the registry is changed). Could you look into that? Stefan From stefan_ml at behnel.de Wed Jul 4 13:50:49 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 04 Jul 2007 13:50:49 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text In-Reply-To: <468B7145.4050706@behnel.de> References: <20070704092848.254480@gmx.net> <468B7145.4050706@behnel.de> Message-ID: <468B8999.6080701@behnel.de> Stefan Behnel wrote: >> Another issue with E-factory is that it currently does not have support for >> the custom objectify classes that you can add with the PyType mechanisms: E.g. >> I'm using datetime and decimal additions, which leads to >> >>>>> import decimal >>>>> DEC = objectify.E.decimal >>>>> DEC(decimal.Decimal(0)) >> Traceback (most recent call last): >> File "", line 1, in ? >> File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 43, in >> return lambda *args, **kwargs: func(tag, *args, **kwargs) >> File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 175, in __call__ >> raise TypeError("bad argument type: %r" % item) >> TypeError: bad argument type: Decimal("0") >> >> So I'd have to add decimal.decimal into objectify.E._typemap. The nicest way to handle this would be PyType.register() doing it for me, but >> PyType uses type names rather than type objects for its purposes. Maybe the easiest thing is to instrument ElementMaker with its own register/unregister() methods and well-document it? > > No, one registry should be enough. Even with names, you can always check > globals() in the PyType registry. Maybe we should even feed the typemap in > ElementMaker.__init__ from the PyType registry (and just update objectify.E > when the registry is changed). Ah, I guess the problem here is that your external types are not in the module's globals(). Maybe we could extend the data element classes with a non-public function that converts a value to a string. Would that fit here? Note that the Element proxy is already created when the text value is updated, that's like your _setText() case. Stefan From jholg at gmx.de Wed Jul 4 15:40:10 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 04 Jul 2007 15:40:10 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text In-Reply-To: <468B8999.6080701@behnel.de> References: <20070704092848.254480@gmx.net> <468B7145.4050706@behnel.de> <468B8999.6080701@behnel.de> Message-ID: <20070704134010.72530@gmx.net> Hi Stefan, > Stefan Behnel wrote: > >> So I'd have to add decimal.decimal into objectify.E._typemap. The > nicest way to handle this would be PyType.register() doing it for me, but > >> PyType uses type names rather than type objects for its purposes. Maybe > the easiest thing is to instrument ElementMaker with its own > register/unregister() methods and well-document it? > > > > No, one registry should be enough. Even with names, you can always check > > globals() in the PyType registry. Maybe we should even feed the typemap > in > > ElementMaker.__init__ from the PyType registry (and just update > objectify.E > > when the registry is changed). > > Ah, I guess the problem here is that your external types are not in the > module's globals(). Maybe we could extend the data element classes with a > non-public function that converts a value to a string. Would that fit > here? > Note that the Element proxy is already created when the text value is > updated, > that's like your _setText() case. What one actually does for registration is datetimeType = PyType("datetime", parseDatetime, DatetimeElement) datetimeType.register() just like objectify does for the standard builtin types. I think that PyType.register()/unregister() should update E._typemap; the problem here is that register() does not really know about the Python type, just a name, a check function and the ObjectifiedDataElement class; this is also nice because it is so versatile. What about simply adding an optional argument python_type where one can supply the actual python type/class the custom element class does mimic? Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kanns mit allen: http://www.gmx.net/de/go/multimessenger From stefan_ml at behnel.de Wed Jul 4 15:59:17 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 04 Jul 2007 15:59:17 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text In-Reply-To: <20070704134010.72530@gmx.net> References: <20070704092848.254480@gmx.net> <468B7145.4050706@behnel.de> <468B8999.6080701@behnel.de> <20070704134010.72530@gmx.net> Message-ID: <468BA7B5.7070103@behnel.de> jholg at gmx.de wrote: >> Stefan Behnel wrote: >>>> So I'd have to add decimal.decimal into objectify.E._typemap. The >> nicest way to handle this would be PyType.register() doing it for me, but >>>> PyType uses type names rather than type objects for its purposes. Maybe >> the easiest thing is to instrument ElementMaker with its own >> register/unregister() methods and well-document it? >>> No, one registry should be enough. Even with names, you can always check >>> globals() in the PyType registry. Maybe we should even feed the typemap >> in >>> ElementMaker.__init__ from the PyType registry (and just update >> objectify.E >>> when the registry is changed). >> Ah, I guess the problem here is that your external types are not in the >> module's globals(). Maybe we could extend the data element classes with a >> non-public function that converts a value to a string. Would that fit >> here? >> Note that the Element proxy is already created when the text value is >> updated, >> that's like your _setText() case. > > What one actually does for registration is > > datetimeType = PyType("datetime", parseDatetime, DatetimeElement) > datetimeType.register() > > just like objectify does for the standard builtin types. I think that > PyType.register()/unregister() should update E._typemap; the problem > here is that register() does not really know about the Python type, just > a name, a check function and the ObjectifiedDataElement class; this is also nice because it is so versatile. > What about simply adding an optional argument python_type where one can supply the actual python type/class the custom element class does mimic? Well, all you'd really need is a conversion to a string, so, given such a type would Do The Right Thing for __str__, that would work. But then, if str() did the right thing, we could just as well use the existing behaviour of the E factory and just extend typemap to also check for the type /name/, not only the type itself. Maybe that's the way to go? Stefan From rogerpatterson at gmail.com Wed Jul 4 22:22:45 2007 From: rogerpatterson at gmail.com (Roger Patterson) Date: Wed, 04 Jul 2007 13:22:45 -0700 Subject: [lxml-dev] xslt exceptions Message-ID: <468C0195.2010605@gmail.com> Hello, I am getting errors from within an XPath call within a custom extension being called while doing an XSLT transform. I am able to access the global error log as well as the error_log on the transform object, but the error information is sketchy at best. Unfortunately no exception is being thrown. Is this normal? Is there a way of turning on exception throwing? -Roger From stefan_ml at behnel.de Wed Jul 4 23:14:38 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 04 Jul 2007 23:14:38 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text In-Reply-To: <20070704142538.8170@gmx.net> References: <20070704092848.254480@gmx.net> <468B7145.4050706@behnel.de> <468B8999.6080701@behnel.de> <20070704134010.72530@gmx.net> <468BA7B5.7070103@behnel.de> <20070704142538.8170@gmx.net> Message-ID: <468C0DBE.4020101@behnel.de> Hi Holger, jholg at gmx.de wrote: >>> What about simply adding an optional argument python_type where one can >> supply the actual python type/class the custom element class does mimic? >> >> Well, all you'd really need is a conversion to a string, so, given such a >> type >> would Do The Right Thing for __str__, that would work. But then, if str() >> did >> the right thing, we could just as well use the existing behaviour of the E >> factory and just extend typemap to also check for the type /name/, not >> only >> the type itself. Maybe that's the way to go? > > You mean by using a customized typemap that uses additional(typename, > ) entries and a get() method that also tries to lookup by > typename? The convention then being that the typenames one uses in the > PyType registry must correspond to the actual python type name he models, > if he wants to make use of objectify.E for custom DataElements. Sounds > reasonable. Here is a patch that (I think) might be a way to solve this problem. The idea is to use a custom class instead of the typemap dictionary and have it fall back to the PyType registry. If your type does not support a simple str() conversion, you can pass a conversion function as an additional argument to PyType() when registering your type. It's currently untested, so please play with it if you think it's the right approach. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: objectify-data-type-support-in-elementmaker-class.patch Type: text/x-diff Size: 7015 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070704/60b1ba59/attachment-0001.bin From stefan_ml at behnel.de Wed Jul 4 23:42:16 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 04 Jul 2007 23:42:16 +0200 Subject: [lxml-dev] xslt exceptions In-Reply-To: <468C0195.2010605@gmail.com> References: <468C0195.2010605@gmail.com> Message-ID: <468C1438.6020804@behnel.de> Hi, Roger Patterson wrote: > I am getting errors from within an XPath call within a custom extension > being called while doing an XSLT transform. > I am able to access the global error log as well as the error_log on the > transform object, but the error information is sketchy at best. > Unfortunately no exception is being thrown. Is this normal? Is there a > way of turning on exception throwing? Hmm, if I understand that right, what you do is: from an XSLT, you call into a Python extension function and from that you call an XPath expression. This expressions fails and you want to do what? * stop XSLT execution and propagate the error? Then throw an exception yourself. * know why it failed? The XPath code has had a major refactoring on the current SVN trunk that will become lxml 2.0. The XPath class now has its own error log and may even throw a meaningful exception for you. I encourage you to check out the trunk and try it out. Any comments are appreciated. http://codespeak.net/svn/lxml/trunk/ Stefan From jholg at gmx.de Thu Jul 5 15:42:44 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 05 Jul 2007 15:42:44 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text In-Reply-To: <468C0DBE.4020101@behnel.de> References: <20070704092848.254480@gmx.net> <468B7145.4050706@behnel.de> <468B8999.6080701@behnel.de> <20070704134010.72530@gmx.net> <468BA7B5.7070103@behnel.de> <20070704142538.8170@gmx.net> <468C0DBE.4020101@behnel.de> Message-ID: <20070705134244.145570@gmx.net> Hi Stefan, I tried it out (latest trunk instead of this patch) > It's currently untested, so please play with it if you think it's the > right > approach. With this small fix 0 lb54320 at adevp02 .../lxml $ svn diff Index: src/lxml/objectify.pyx =================================================================== --- src/lxml/objectify.pyx (revision 44735) +++ src/lxml/objectify.pyx (working copy) @@ -1054,7 +1054,7 @@ result = python.PyDict_GetItem(_PYTYPE_DICT, name) if result is NULL: return None - return (result)._stringify + return (result)._stringify return result def __contains__(self, type): 0 lb54320 at adevp02 .../lxml $ this works for me. The E-factory has some strangeness to it regarding objectify: >>> msg.x = objectify.E.INT(5,3,2) >>> print objectify.dump(msg) msg = None [ObjectifiedElement] x = 532 [IntElement] >>> but I don't think it makes sense to investigate. This would now mean to rename my "decimal" type to "Decimal", so that it matches decimal.Decimal.__name__, which enables the lookup in _ObjectifyTypemap.get(). Although that might break some code here I think it might be a good thing to use the names of the actual python types (by convention). One of the biggest issues for my users here is that the way objectify works, they might sometimes assign a string "2323" into a tree, which gets then interpreted as an IntElement. Although this is insignificant for most practical issues, and they can always use DataElement() to type-fix anything anytime, using such a naming convention would enable me to make type-fixing even easier: >>> def PYTYPE(value): ... if isinstance(value, ObjectifiedElement): ... return deepcopy(value) ... else: ... return DataElement(value, type(value).__name__) ... (Note: I don't like the name of that function yet...) What about None? It is currently called "none" in the type registry, whereas type(None).__name__ == "NoneType". I'd prefer to special-case it here (and maybe in _ObjectifyTypemap), as "none" is so nice and short. Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail From stefan_ml at behnel.de Thu Jul 5 22:04:23 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 05 Jul 2007 22:04:23 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text In-Reply-To: <20070705134244.145570@gmx.net> References: <20070704092848.254480@gmx.net> <468B7145.4050706@behnel.de> <468B8999.6080701@behnel.de> <20070704134010.72530@gmx.net> <468BA7B5.7070103@behnel.de> <20070704142538.8170@gmx.net> <468C0DBE.4020101@behnel.de> <20070705134244.145570@gmx.net> Message-ID: <468D4EC7.1070705@behnel.de> Hi Holger, jholg at gmx.de wrote: > With this small fix > [...] > this works for me. Argh. :) Thanks. > The E-factory has some strangeness to it regarding objectify: >>>> msg.x = objectify.E.INT(5,3,2) >>>> print objectify.dump(msg) > msg = None [ObjectifiedElement] > x = 532 [IntElement] > but I don't think it makes sense to investigate. Hmmm, looks funny but is normal. This is how the factory works normaly: http://codespeak.net/lxml/dev/tutorial.html#the-e-factory It takes a content list as argument, so that's expected behaviour. ;) > This would now mean to rename my "decimal" type to "Decimal", so that it > matches decimal.Decimal.__name__, which enables the lookup in > _ObjectifyTypemap.get(). If that's all it takes, that's perfect. > Although that might break some code here I think it might be a good thing > to use the names of the actual python types (by convention). One of the biggest > issues for my users here is that the way objectify works, they might sometimes > assign a string "2323" into a tree, which gets then interpreted as an IntElement. > Although this is insignificant for most practical issues, and they can > always use DataElement() to type-fix anything anytime, using such a naming > convention would enable me to make type-fixing even easier: > >>>> def PYTYPE(value): > ... if isinstance(value, ObjectifiedElement): > ... return deepcopy(value) > ... else: > ... return DataElement(value, type(value).__name__) > ... > (Note: I don't like the name of that function yet...) Good idea, we should add something like that to objectify - and definitely to the E-factory in objectify. > What about None? It is currently called "none" in the type registry, > whereas type(None).__name__ == "NoneType". I'd prefer to special-case it > here (and maybe in _ObjectifyTypemap), as "none" is so nice and short. Sure, go ahead. Stefan From stefan_ml at behnel.de Thu Jul 5 22:08:40 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 05 Jul 2007 22:08:40 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text In-Reply-To: <468D4EC7.1070705@behnel.de> References: <20070704092848.254480@gmx.net> <468B7145.4050706@behnel.de> <468B8999.6080701@behnel.de> <20070704134010.72530@gmx.net> <468BA7B5.7070103@behnel.de> <20070704142538.8170@gmx.net> <468C0DBE.4020101@behnel.de> <20070705134244.145570@gmx.net> <468D4EC7.1070705@behnel.de> Message-ID: <468D4FC8.1090609@behnel.de> Stefan Behnel wrote: > jholg at gmx.de wrote: >> The E-factory has some strangeness to it regarding objectify: >>>>> msg.x = objectify.E.INT(5,3,2) >>>>> print objectify.dump(msg) >> msg = None [ObjectifiedElement] >> x = 532 [IntElement] >> but I don't think it makes sense to investigate. > > Hmmm, looks funny but is normal. This is how the factory works normaly: > > http://codespeak.net/lxml/dev/tutorial.html#the-e-factory > > It takes a content list as argument, so that's expected behaviour. ;) >> >> >>> def PYTYPE(value): >> ... if isinstance(value, ObjectifiedElement): >> ... return deepcopy(value) >> ... else: >> ... return DataElement(value, type(value).__name__) >> ... >> (Note: I don't like the name of that function yet...) > > Good idea, we should add something like that to objectify - and definitely to > the E-factory in objectify. ... which then should also fix the above problem, BTW. Stefan From stefan_ml at behnel.de Fri Jul 6 10:10:02 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 06 Jul 2007 10:10:02 +0200 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <468D88F8.3010009@colorstudy.com> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> Message-ID: <468DF8DA.7030306@behnel.de> Ian Bicking wrote: > I'm still not sure what to call all the parsing functions for HTML. Hmm, there isn't really something comparable in lxml's API so far, so we can't just copy names here. "parse_string()" would match their intention, so that would make it "parse_string_element()" and "parse_string_elements()". Maybe that's too long for an every-day-use function, but at least the names are clear. I don't even think length matters here as parse functions may be used in every program, but likely only once or a couple of times in a few selected places, so clarity outweighs typing here IMHO. "strparse()" would be shorter but might suggest that they only parse plain strings, not unicode (although unicode parsing is somewhat 'advanced use' anyway). On the other hand, I'm wondering why they parse strings in the first place. Wouldn't parsing from a file make more sense? There's always StringIO if you need it (which is efficiently special cased in lxml). Note that libxml2 can even parse from http and ftp URLs directly, so you would even loose something (if only performance) if you required people to load a document into memory first and then pass it to the parser as a string. You'd also loose base URL information, BTW. So, my preferred solution would be to keep the names and make them functions that parse from a filename or file-like object, just like etree.parse() works. Admittedly, that's a bit tricky as you can't check what the file starts with to decide how to parse it without opening it first... > Also > I'd like some method on at least HTML elements for doing CSS selections, > but I'm not sure what to call it. Any ideas? Well, the xpath() method is named after the language, so why not just call the method "cssselect()" ? That makes it clear where the implementation comes from and matches the existing API. Stefan From ianb at colorstudy.com Fri Jul 6 19:01:31 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Fri, 06 Jul 2007 12:01:31 -0500 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <468DF8DA.7030306@behnel.de> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> Message-ID: <468E756B.3060202@colorstudy.com> Stefan Behnel wrote: > Ian Bicking wrote: >> I'm still not sure what to call all the parsing functions for HTML. > > Hmm, there isn't really something comparable in lxml's API so far, so we can't > just copy names here. > > "parse_string()" would match their intention, so that would make it > "parse_string_element()" and "parse_string_elements()". Maybe that's too long > for an every-day-use function, but at least the names are clear. I don't even > think length matters here as parse functions may be used in every program, but > likely only once or a couple of times in a few selected places, so clarity > outweighs typing here IMHO. > > "strparse()" would be shorter but might suggest that they only parse plain > strings, not unicode (although unicode parsing is somewhat 'advanced use' anyway). For the different varieties, I wonder if they should just be attributes on the parser? E.g., HTML() (full doc), HTML.element(), HTML.elements(). Similarly, parse(fn) (full doc), parse.element(fn), parse.elements(fn). Then we just have HTML and parse. One nice thing about this is that you don't have to fiddle with imports when you change your mind about what you are parsing. > On the other hand, I'm wondering why they parse strings in the first place. > Wouldn't parsing from a file make more sense? There's always StringIO if you > need it (which is efficiently special cased in lxml). Note that libxml2 can > even parse from http and ftp URLs directly, so you would even loose something > (if only performance) if you required people to load a document into memory > first and then pass it to the parser as a string. You'd also loose base URL > information, BTW. Where is base URL information kept? This should be an optional argument for all the parsing functions that don't use a URL. > So, my preferred solution would be to keep the names and make them functions > that parse from a filename or file-like object, just like etree.parse() works. > Admittedly, that's a bit tricky as you can't check what the file starts with > to decide how to parse it without opening it first... If I did that, I'd just have to write the string-based versions over and over, as that's what I use (and pretty much have to use) in all the tests. I suppose outside of tests it's not that useful, but tests are of course important. Plus lxml.XML, HTML, etc., already work on strings, so there should be equivalent parsers. >> Also >> I'd like some method on at least HTML elements for doing CSS selections, >> but I'm not sure what to call it. Any ideas? > > Well, the xpath() method is named after the language, so why not just call the > method "cssselect()" ? That makes it clear where the implementation comes from > and matches the existing API. Sure. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From doug at isotoma.com Sat Jul 7 14:45:35 2007 From: doug at isotoma.com (Doug Winter) Date: Sat, 07 Jul 2007 13:45:35 +0100 Subject: [lxml-dev] xpath on newly created elements Message-ID: <468F8AEF.8060405@isotoma.com> I can't make xpath work on elements that have been created using etree.Element when they have a namespace that doesn't use Clark notation. I have a test case: -- begins -- from lxml import etree print "lxml.etree: ", etree.LXML_VERSION print "libxml used: ", etree.LIBXML_VERSION print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION print "libxslt used: ", etree.LIBXSLT_VERSION print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION nsmap=dict(test="http://test.com") e = [] e.append(etree.fromstring('')) e.append(etree.Element("test:foo", nsmap=nsmap)) e.append(etree.Element("test:foo", {'xmlns:test': nsmap['test']})) e.append(etree.Element("{%(test)s}foo" % nsmap)) e.append(etree.Element("{%(test)s}foo" % nsmap, nsmap=nsmap)) for i, elem in enumerate(e): print i, elem.xpath("/test:foo", nsmap) -- ends -- I get this output if I run the above: lxml.etree: (1, 3, 2, 0) libxml used: (2, 6, 27) libxml compiled: (2, 6, 27) libxslt used: (1, 1, 20) libxslt compiled: (1, 1, 20) 0 [] 1 [] 2 [] 3 [] 4 [] I would expect all 5 cases to match the root element, but cases 1 and 2 do not. It appears to be only for elements created using namespace prefixes - and yet these work perfectly well in all other respects. Is this a bug, or should elements not be created this way? Cheers, Doug. -- Isotoma, Open Source Software Consulting - http://www.isotoma.com Tel: 01904 567349, Mobile: 07879 423002, Fax: 020 79006980 Postal Address: Tower House, Fishergate, York, YO10 4UA, UK Registered in England. Company No 5171172. VAT GB843570325. Registered Office: 19a Goodge Street, London, W1T 2PH From stefan_ml at behnel.de Sun Jul 8 08:52:00 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 08 Jul 2007 08:52:00 +0200 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <468E756B.3060202@colorstudy.com> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> Message-ID: <46908990.3020707@behnel.de> Hi Ian, Ian Bicking wrote: > Stefan Behnel wrote: >> Ian Bicking wrote: >>> I'm still not sure what to call all the parsing functions for HTML. >> > For the different varieties, I wonder if they should just be attributes > on the parser? E.g., HTML() (full doc), HTML.element(), > HTML.elements(). Similarly, parse(fn) (full doc), parse.element(fn), > parse.elements(fn). Then we just have HTML and parse. Funny idea. But that reminds me that HTML is a factory function, so what about calling the string parser functions "HTML()", "HTMLFragment()" and "HTMLFragments()"? The "parse()" function could then get equivalent functions "parse_fragment()" and "parse_fragments()" - although I rate them less important. If you're dealing with fragments, they'd most likely not come from the file. And if you really need to parse some fragments from a file for a rare use case, you can still read the file first and then pass it into "HTMLFragments()". > Where is base URL information kept? This should be an optional argument > for all the parsing functions that don't use a URL. libxml2 stores it in the xmlDoc, but we can overwrite it if we need to. We should make that a general option in the string parse functions of etree: etree.HTML("", base_url="http://codespeak.net/lxml") etree.XML("", base_url="http://codespeak.net/lxml") Stefan From stefan_ml at behnel.de Sun Jul 8 09:11:26 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 08 Jul 2007 09:11:26 +0200 Subject: [lxml-dev] xpath on newly created elements In-Reply-To: <468F8AEF.8060405@isotoma.com> References: <468F8AEF.8060405@isotoma.com> Message-ID: <46908E1E.8060107@behnel.de> Hi, Doug Winter wrote: > I can't make xpath work on elements that have been created using > etree.Element when they have a namespace that doesn't use Clark notation. Please distinguish between namespaces and prefixes. Prefixes are not namespaces. Their only use it to reduce the redundancy in XML documents. They have no meaning by themselves. > e.append(etree.fromstring('')) Ok. > e.append(etree.Element("test:foo", nsmap=nsmap)) > e.append(etree.Element("test:foo", {'xmlns:test': nsmap['test']})) These actually create elements with the tag name "test:foo", which is different from "foo" and also different from "{http://test.com}foo" (which is the only one that declares a namespace). Tag names with colons are not special cased. > e.append(etree.Element("{%(test)s}foo" % nsmap)) > e.append(etree.Element("{%(test)s}foo" % nsmap, nsmap=nsmap)) These are ok, too. > >>> for i, elem in enumerate(e): > ... print i, elem.xpath("/test:foo", nsmap) > 0 [] > 1 [] > 2 [] > 3 [] > 4 [] As expected, as you're looking for "{http://test.com}foo", not for "test:foo". > It appears to be only for elements created using namespace prefixes - > and yet these work perfectly well in all other respects. > > Is this a bug, or should elements not be created this way? Well, you can currently create them this way, but it doesn't give you what you want. Maybe we should catch the case where ':' is contained in a tag name and raise an exception instead (it won't give you well-formed XML anyway). That way, it would be clear that this can't work. When you create elements with namespaces, use the Clark notation. Stefan From doug at isotoma.com Sun Jul 8 10:26:34 2007 From: doug at isotoma.com (Doug Winter) Date: Sun, 08 Jul 2007 09:26:34 +0100 Subject: [lxml-dev] xpath on newly created elements In-Reply-To: <46908E1E.8060107@behnel.de> References: <468F8AEF.8060405@isotoma.com> <46908E1E.8060107@behnel.de> Message-ID: <46909FBA.1030607@isotoma.com> Stefan Behnel wrote: > Well, you can currently create them this way, but it doesn't give you what you > want. Maybe we should catch the case where ':' is contained in a tag name and > raise an exception instead (it won't give you well-formed XML anyway). That > way, it would be clear that this can't work. > > When you create elements with namespaces, use the Clark notation. Thanks for the clarification. What threw me is that serialising a tag with a colon in it works fine, and this all feels quite natural: >>> nsmap = {'test': 'http://test.com'} >>> e = etree.Element('test:foo', nsmap=nsmap) >>> e2 = etree.fromstring(etree.tostring(e)) >>> e.xpath("/test:foo", nsmap) [] >>> e2.xpath("/test:foo", nsmap) [] I'd expect the round trip to produce something identical, which I guess is where I got confused. I think it may be worth raising an exception on tags with colons, since it is a bit surprising. Cheers, Doug. -- Isotoma, Open Source Software Consulting - http://www.isotoma.com Tel: 01904 567349, Mobile: 07879 423002, Fax: 020 79006980 Postal Address: Tower House, Fishergate, York, YO10 4UA, UK Registered in England. Company No 5171172. VAT GB843570325. Registered Office: 19a Goodge Street, London, W1T 2PH From stefan_ml at behnel.de Sun Jul 8 11:15:07 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 08 Jul 2007 11:15:07 +0200 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <46908990.3020707@behnel.de> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> <46908990.3020707@behnel.de> Message-ID: <4690AB1B.5090707@behnel.de> Stefan Behnel wrote: > HTML is a factory function, so what about > calling the string parser functions "HTML()", "HTMLFragment()" and > "HTMLFragments()"? That would also make the semantics pretty simple: HTML() will always return a complete HTML document, i.e. wrapped by html/body if necessary. HTMLFragment() will always return a fragment, i.e. a single element that can be pasted into a body. This means: remove html/body if they are present and add a
if there are multiple elements. Maybe check if there actually are any block tags and just wrap the fragments in a

otherwise, but that's more of an optimisation. HTMLFragments() will always return a list of fragments, i.e. text and/or elements and remove any html/body parts that come from the document or were added by the parser. Does that sound like a suitable API? Stefan From stefan_ml at behnel.de Mon Jul 9 21:44:15 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 09 Jul 2007 21:44:15 +0200 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <4690AB1B.5090707@behnel.de> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> <46908990.3020707@behnel.de> <4690AB1B.5090707@behnel.de> Message-ID: <4692900F.7060008@behnel.de> Stefan Behnel wrote: > Stefan Behnel wrote: >> HTML is a factory function, so what about >> calling the string parser functions "HTML()", "HTMLFragment()" and >> "HTMLFragments()"? > > That would also make the semantics pretty simple: > > HTML() will always return a complete HTML document, i.e. wrapped by html/body > if necessary. > > HTMLFragment() will always return a fragment, i.e. a single element that can > be pasted into a body. This means: remove html/body if they are present and > add a

if there are multiple elements. Maybe check if there actually are > any block tags and just wrap the fragments in a

otherwise, but that's more > of an optimisation. > > HTMLFragments() will always return a list of fragments, i.e. text and/or > elements and remove any html/body parts that come from the document or were > added by the parser. I changed this on the branch and also renamed the current do-what-I-mean "parse()" function to "fromstring()". This means that "HTML()" now behaves differently from "fromstring()", although "XML()" and "fromstring()" behave the same in etree. But I find that ok, since they behave as you would expect. HTML() gives you an HTML page (including html/body) and "fromstring()" more or less gives you what you passed in as a string, be it with or without . So, that makes the API complete (for now), I think. I'll double check the modules to see if everything looks nice and consistent and will then try to merge the branch back into the trunk soon to get out a "2.0alpha1". The API may still change during the alpha cycle, but this will hopefully get us some broader feedback on the new package. Stefan From ianb at colorstudy.com Mon Jul 9 21:53:11 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 09 Jul 2007 14:53:11 -0500 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <4692900F.7060008@behnel.de> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> <46908990.3020707@behnel.de> <4690AB1B.5090707@behnel.de> <4692900F.7060008@behnel.de> Message-ID: <46929227.9010604@colorstudy.com> Stefan Behnel wrote: > Stefan Behnel wrote: >> Stefan Behnel wrote: >>> HTML is a factory function, so what about >>> calling the string parser functions "HTML()", "HTMLFragment()" and >>> "HTMLFragments()"? >> That would also make the semantics pretty simple: >> >> HTML() will always return a complete HTML document, i.e. wrapped by html/body >> if necessary. >> >> HTMLFragment() will always return a fragment, i.e. a single element that can >> be pasted into a body. This means: remove html/body if they are present and >> add a

if there are multiple elements. Maybe check if there actually are >> any block tags and just wrap the fragments in a

otherwise, but that's more >> of an optimisation. I think we talked about using if there were no block tags, not

. Something about HTMLFragment(s) seems weird to me. I guess HTML() itself is weird, though it is reminiscent of XML(). Which is itself weird, since neither is a class. HTMLFragment() bothers me more because it definitely doesn't return a different type of object from HTML(), but the naming implies it does. >> HTMLFragments() will always return a list of fragments, i.e. text and/or >> elements and remove any html/body parts that come from the document or were >> added by the parser. > > I changed this on the branch and also renamed the current do-what-I-mean > "parse()" function to "fromstring()". That seems like a fine name. > This means that "HTML()" now behaves differently from "fromstring()", although > "XML()" and "fromstring()" behave the same in etree. But I find that ok, since > they behave as you would expect. HTML() gives you an HTML page (including > html/body) and "fromstring()" more or less gives you what you passed in as a > string, be it with or without . Sometimes you actually don't get a body, like if you parse HTML('') you only get a head. And sometimes you don't get a head. Maybe the parsing should normalize this too, as it's a corner case people often don't think about. For that matter, I think there should probably be a body property on the html element (or all elements?), since I find myself commonly plucking out the body element right away. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From stefan_ml at behnel.de Mon Jul 9 22:07:47 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 09 Jul 2007 22:07:47 +0200 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <46929227.9010604@colorstudy.com> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> <46908990.3020707@behnel.de> <4690AB1B.5090707@behnel.de> <4692900F.7060008@behnel.de> <46929227.9010604@colorstudy.com> Message-ID: <46929593.40508@behnel.de> Ian Bicking wrote: > Stefan Behnel wrote: >>> HTMLFragment() will always return a fragment, i.e. a single element >>> that can >>> be pasted into a body. This means: remove html/body if they are >>> present and >>> add a

if there are multiple elements. Maybe check if there >>> actually are >>> any block tags and just wrap the fragments in a

otherwise, but >>> that's more >>> of an optimisation. > > I think we talked about using if there were no block tags, not

. Ah, sure. Anyway, I didn't change your implementation, so everything works as before (except for the naming). > Something about HTMLFragment(s) seems weird to me. I guess HTML() > itself is weird, though it is reminiscent of XML(). Which is itself > weird, since neither is a class. It's a factory though, that is mainly meant for HTML 'literals'. And it gives you an HtmlElement or a list of those. Hmmm, I admit that HTMLFragments() does not really sound like returning a list... > HTMLFragment() bothers me more because > it definitely doesn't return a different type of object from HTML(), but > the naming implies it does. Hmmm, I don't really feel the same way, but maybe I'm too biased already. :) It's Python after all, so the actual type is not that relevant. >> This means that "HTML()" now behaves differently from "fromstring()", >> although >> "XML()" and "fromstring()" behave the same in etree. But I find that >> ok, since >> they behave as you would expect. HTML() gives you an HTML page (including >> html/body) and "fromstring()" more or less gives you what you passed >> in as a >> string, be it with or without . > > Sometimes you actually don't get a body, like if you parse HTML(' rel="foo">') you only get a head. And sometimes you don't get a head. > Maybe the parsing should normalize this too, as it's a corner case > people often don't think about. For that matter, I think there should > probably be a body property on the html element (or all elements?), > since I find myself commonly plucking out the body element right away. If we keep the current names, we should make sure they fit the expectations. Having HTML() always return a complete document sounds natural to me. Checking the returned tag for 'body' or 'head' is simple enough. Stefan From ianb at colorstudy.com Mon Jul 9 22:13:05 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 09 Jul 2007 15:13:05 -0500 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <46929593.40508@behnel.de> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> <46908990.3020707@behnel.de> <4690AB1B.5090707@behnel.de> <4692900F.7060008@behnel.de> <46929227.9010604@colorstudy.com> <46929593.40508@behnel.de> Message-ID: <469296D1.2080804@colorstudy.com> Stefan Behnel wrote: > > Ian Bicking wrote: >> Stefan Behnel wrote: >>>> HTMLFragment() will always return a fragment, i.e. a single element >>>> that can >>>> be pasted into a body. This means: remove html/body if they are >>>> present and >>>> add a

if there are multiple elements. Maybe check if there >>>> actually are >>>> any block tags and just wrap the fragments in a

otherwise, but >>>> that's more >>>> of an optimisation. >> I think we talked about using if there were no block tags, not

. > > Ah, sure. Anyway, I didn't change your implementation, so everything works as > before (except for the naming). > > >> Something about HTMLFragment(s) seems weird to me. I guess HTML() >> itself is weird, though it is reminiscent of XML(). Which is itself >> weird, since neither is a class. > > It's a factory though, that is mainly meant for HTML 'literals'. And it gives > you an HtmlElement or a list of those. Hmmm, I admit that HTMLFragments() does > not really sound like returning a list... Everything is potentially a factory. dict.items() is a list factory. HTML and HTMLFragment are factories for the same kind of object. >> HTMLFragment() bothers me more because >> it definitely doesn't return a different type of object from HTML(), but >> the naming implies it does. > > Hmmm, I don't really feel the same way, but maybe I'm too biased already. :) > > It's Python after all, so the actual type is not that relevant. Yes, but we're already badly abusing naming conventions. These aren't classes, but they are named like classes. This has caused confusion for me in the past. >>> This means that "HTML()" now behaves differently from "fromstring()", >>> although >>> "XML()" and "fromstring()" behave the same in etree. But I find that >>> ok, since >>> they behave as you would expect. HTML() gives you an HTML page (including >>> html/body) and "fromstring()" more or less gives you what you passed >>> in as a >>> string, be it with or without . >> Sometimes you actually don't get a body, like if you parse HTML('> rel="foo">') you only get a head. And sometimes you don't get a head. >> Maybe the parsing should normalize this too, as it's a corner case >> people often don't think about. For that matter, I think there should >> probably be a body property on the html element (or all elements?), >> since I find myself commonly plucking out the body element right away. > > If we keep the current names, we should make sure they fit the expectations. > Having HTML() always return a complete document sounds natural to me. I'd be inclined to feel the other way, that HTML() would be more like fromstring(), and return what you give it instead of interpreting everything as a document. But I'm not too concerned there. > Checking the returned tag for 'body' or 'head' is simple enough. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From stefan_ml at behnel.de Mon Jul 9 22:52:19 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 09 Jul 2007 22:52:19 +0200 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <469296D1.2080804@colorstudy.com> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> <46908990.3020707@behnel.de> <4690AB1B.5090707@behnel.de> <4692900F.7060008@behnel.de> <46929227.9010604@colorstudy.com> <46929593.40508@behnel.de> <469296D1.2080804@colorstudy.com> Message-ID: <4692A003.3050700@behnel.de> Ian Bicking wrote: > Stefan Behnel wrote: >> Ian Bicking wrote: >>> HTMLFragment() bothers me more because >>> it definitely doesn't return a different type of object from HTML(), but >>> the naming implies it does. >> >> Hmmm, I don't really feel the same way, but maybe I'm too biased >> already. :) >> >> It's Python after all, so the actual type is not that relevant. > > Yes, but we're already badly abusing naming conventions. These aren't > classes, but they are named like classes. This has caused confusion for > me in the past. Ok, I buy that. But what would be the alternative? * element_from_string(s) and elements_from_string(s) * fragment_from_string(s) and fragments_from_string(s) * parse_element_string(s) and ??? * parse_string_element(s) and parse_string_elements(s) I could maybe live with the first. Stefan From ianb at colorstudy.com Tue Jul 10 00:48:06 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 09 Jul 2007 17:48:06 -0500 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <4692A003.3050700@behnel.de> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> <46908990.3020707@behnel.de> <4690AB1B.5090707@behnel.de> <4692900F.7060008@behnel.de> <46929227.9010604@colorstudy.com> <46929593.40508@behnel.de> <469296D1.2080804@colorstudy.com> <4692A003.3050700@behnel.de> Message-ID: <4692BB26.9020304@colorstudy.com> Stefan Behnel wrote: > Ian Bicking wrote: >> Stefan Behnel wrote: >>> Ian Bicking wrote: >>>> HTMLFragment() bothers me more because >>>> it definitely doesn't return a different type of object from HTML(), but >>>> the naming implies it does. >>> Hmmm, I don't really feel the same way, but maybe I'm too biased >>> already. :) >>> >>> It's Python after all, so the actual type is not that relevant. >> Yes, but we're already badly abusing naming conventions. These aren't >> classes, but they are named like classes. This has caused confusion for >> me in the past. > > Ok, I buy that. But what would be the alternative? > > * element_from_string(s) and elements_from_string(s) > * fragment_from_string(s) and fragments_from_string(s) > * parse_element_string(s) and ??? > * parse_string_element(s) and parse_string_elements(s) > > I could maybe live with the first. I'm somewhat more comfortable with fromstring() being do-what-I-mean (i.e., return a document only if a document is passed in), and document_fromstring() for what HTML() currently does (maybe with a little normalization), and fragment_fromstring() for something that *must* be a fragment (which I suppose should strip everything but body, if it is passed a full document, and I think I then even rename body to div in the current code). That is, most people are really comfortable working with HTML fragments, and this whole notion of a "valid HTML document" is less of an issue for most people. So when libxml2 turns their fragment into a valid HTML document it can be disconcerting. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From rcdailey at gmail.com Tue Jul 10 06:34:16 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Mon, 9 Jul 2007 23:34:16 -0500 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows Message-ID: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> Hi, I'm attempting to build LXML for windows. Below are details on the linker errors I'm getting (the compile works fine). Anyone that can help would be greatly appreciated. Thank you! Here is my modified paths in the setup.py file: STATIC_INCLUDE_DIRS = [ "..\\libxml2\\include", "..\\libxslt\\include", "..\\zlib\\include", "..\\iconv\\include" ] STATIC_LIBRARY_DIRS = [ "..\\libxml2\\lib", "..\\libxslt\\lib", "..\\zlib\\lib", "..\\iconv\\lib", "C:\\Program Files\\Microsoft Visual Studio 8\\VC\\lib" ] STATIC_CFLAGS = [] I get the following output in the command line (note the first line is the line I typed in): C:\IT\SDK\lxml>python setup.py build -c mingw32 --static Building lxml version 1.3.2 C:\Python25\lib\distutils\dist.py:263: UserWarning: Unknown distribution option: 'zip_safe' warnings.warn(msg) running build running build_py running build_ext building 'lxml.etree' extension writing build\temp.win32-2.5\Release\src\lxml\etree.def C:\mingw\bin\gcc.exe -mno-cygwin -shared -s build\temp.win32- 2.5\Release\src\lxml\etree.o build\temp .win32-2.5\Release\src\lxml\etree.def -L..\libxml2\lib -L..\libxslt\lib -L..\zlib\lib -L..\iconv\lib "-LC:\Program Files\Microsoft Visual Studio 8\VC\lib" -LC:\Python25\libs -LC:\Python25\PCBuild -lli bxslt_a -llibexslt_a -llibxml2_a -liconv_a -lzlib -lWS2_32 -lpython25 -lmsvcr71 -o build\lib.win32-2 .5\lxml\etree.pyd Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"ws2_32.lib" /DEFAULTLI B:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"advapi32.lib" /DEFAULT LIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"O LDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"O LDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"O LDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"O LDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized ..\libxslt\lib\libxslt_a.lib(int.xslta.msvc /xsltutils.obj):..\libxslt\xsltuti:(.text[_xsltTimestamp] +0xa5): undefined reference to `_ftol2' ..\libxslt\lib\libxslt_a.lib(int.xslta.msvc /numbers.obj):..\libxslt\numbers:(.text[_xsltNumberFormat Decimal]+0x9c): undefined reference to `_ftol2' ..\libxslt\lib\libxslt_a.lib(int.xslta.msvc /numbers.obj):..\libxslt\numbers:(.text[_xsltNumberFormat Alpha]+0x4b): undefined reference to `_ftol2' ..\libxslt\lib\libxslt_a.lib(int.xslta.msvc /numbers.obj):..\libxslt\numbers:(.text[_xsltNumberFormat ]+0x6): undefined reference to `_chkstk' ..\libxslt\lib\libexslt_a.lib(int.exslta.msvc /date.obj):..\libexslt\date.c:(.text[_exsltDateParseDur ation]+0x226): undefined reference to `_ftol2' ..\libxslt\lib\libexslt_a.lib(int.exslta.msvc /date.obj):..\libexslt\date.c:(.text[_exsltDateParseDur ation]+0x230): undefined reference to `_ftol2' ..\libxslt\lib\libexslt_a.lib(int.exslta.msvc /date.obj):..\libexslt\date.c:(.text[_exsltDateFormatDu ration]+0x119): undefined reference to `_ftol2' ..\libxslt\lib\libexslt_a.lib(int.exslta.msvc /date.obj):..\libexslt\date.c:(.text[_exsltDateFormatDu ration]+0x175): undefined reference to `_ftol2' ..\libxslt\lib\libexslt_a.lib(int.exslta.msvc /date.obj):..\libexslt\date.c:(.text[_exsltDateFormatDu ration]+0x213): undefined reference to `_ftol2' ..\libxslt\lib\libexslt_a.lib(int.exslta.msvc /date.obj):..\libexslt\date.c:(.text[_exsltDateFormatDu ration]+0x28a): more undefined references to `_ftol2' follow ..\libxml2\lib\libxml2_a.lib(int.a.msvc/encoding.obj):..\encoding.c:(.text[_xmlByteConsumed]+0x6): u ndefined reference to `_chkstk' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /valid.obj):..\valid.c:(.text[_xmlValidBuildContentModel]+0x6 ): undefined reference to `_chkstk' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /valid.obj):..\valid.c:(.text[_xmlValidateElementContent]+0x6 ): undefined reference to `_chkstk' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /xpointer.obj):..\xpointer.c:(.text[_xmlXPtrStringRangeFuncti on]+0x65): undefined reference to `_ftol2' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /xpointer.obj):..\xpointer.c:(.text[_xmlXPtrStringRangeFuncti on]+0x9d): undefined reference to `_ftol2' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /debugXML.obj):..\debugXML.c:(.text[_xmlCtxtDumpElemDecl]+0x6 ): undefined reference to `_chkstk' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaVal idateDuration]+0x21c): undefined reference to `_ftol2' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaVal idateDuration]+0x226): undefined reference to `_ftol2' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaCom pareDurations]+0x2f): undefined reference to `_ftol2' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /xmlschemastypes.obj):..\xmlschemastypes:(.text[__xmlSchemaDa teAdd]+0xfe): undefined reference to `_ftol2' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /xmlschemastypes.obj):..\xmlschemastypes:(.text[__xmlSchemaDa teAdd]+0x120): undefined reference to `_ftol2' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /xmlschemastypes.obj):..\xmlschemastypes:(.text[__xmlSchemaDa teAdd]+0x171): more undefined references to `_ftol2' follow ..\libxml2\lib\libxml2_a.lib(int.a.msvc /nanohttp.obj):..\nanohttp.c:(.text[_xmlNanoHTTPReadLine]+0x6 ): undefined reference to `_chkstk' ..\libxml2\lib\libxml2_a.lib(int.a.msvc/nanoftp.obj):..\nanoftp.c:(.text[_xmlNanoFTPList]+0x6): unde fined reference to `_chkstk' ..\libxml2\lib\libxml2_a.lib(int.a.msvc/nanoftp.obj):..\nanoftp.c:(.text[_xmlNanoFTPGet]+0x6): undef ined reference to `_chkstk' ..\iconv\lib\iconv_a.lib(iconv.obj):./iconv.c:(.text[_libiconvlist]+0x9): undefined reference to `_c hkstk' ..\zlib\lib\zlib.lib(gzio.obj):gzio.c:(.text[_gzprintf]+0x6): undefined reference to `_chkstk' collect2: ld returned 1 exit status error: command 'gcc' failed with exit status 1 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070709/c89306aa/attachment-0001.htm From stefan_ml at behnel.de Tue Jul 10 08:36:00 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 10 Jul 2007 08:36:00 +0200 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <4692BB26.9020304@colorstudy.com> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> <46908990.3020707@behnel.de> <4690AB1B.5090707@behnel.de> <4692900F.7060008@behnel.de> <46929227.9010604@colorstudy.com> <46929593.40508@behnel.de> <469296D1.2080804@colorstudy.com> <4692A003.3050700@behnel.de> <4692BB26.9020304@colorstudy.com> Message-ID: <469328D0.2060301@behnel.de> Ian Bicking wrote: > I'm somewhat more comfortable with fromstring() being do-what-I-mean > (i.e., return a document only if a document is passed in), and > document_fromstring() for what HTML() currently does (maybe with a > little normalization), and fragment_fromstring() for something that > *must* be a fragment (which I suppose should strip everything but body, > if it is passed a full document, and I think I then even rename body to > div in the current code). Sure, that works well, I think. What about the "fragments" function? I think "fragments_fromstring()" would fit nicely in there, and in the Python context, people would suspect it to return a list. > That is, most people are really comfortable working with HTML fragments, > and this whole notion of a "valid HTML document" is less of an issue for > most people. So when libxml2 turns their fragment into a valid HTML > document it can be disconcerting. That's why I'm not arguing your functions technically. I think they all make sense, I just want them to be less of a surprise for people who use them. Stefan From stefan_ml at behnel.de Tue Jul 10 09:10:44 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 10 Jul 2007 09:10:44 +0200 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> Message-ID: <469330F4.5060402@behnel.de> Hi, Robert Dailey wrote: > I'm attempting to build LXML for windows. Below are details on the > linker errors I'm getting (the compile works fine). Anyone that can help > would be greatly appreciated. Thank you! [...] > C:\IT\SDK\lxml>python setup.py build -c mingw32 --static > Building lxml version 1.3.2 > C:\Python25\lib\distutils\dist.py:263: UserWarning: Unknown distribution > option: 'zip_safe' > warnings.warn(msg) > running build > running build_py > running build_ext > building 'lxml.etree' extension > writing build\temp.win32-2.5\Release\src\lxml\etree.def > C:\mingw\bin\gcc.exe -mno-cygwin -shared -s > build\temp.win32-2.5\Release\src\lxml\etree.o build\temp > .win32-2.5\Release\src\lxml\etree.def -L..\libxml2\lib -L..\libxslt\lib > -L..\zlib\lib -L..\iconv\lib > "-LC:\Program Files\Microsoft Visual Studio 8\VC\lib" > -LC:\Python25\libs -LC:\Python25\PCBuild -lli > bxslt_a -llibexslt_a -llibxml2_a -liconv_a -lzlib -lWS2_32 -lpython25 > -lmsvcr71 -o build\lib.win32-2 > .5\lxml\etree.pyd [...] > ..\libxml2\lib\libxml2_a.lib(int.a.msvc/xpointer.obj):..\xpointer.c:(.text[_xmlXPtrStringRangeFuncti > on]+0x65): undefined reference to `_ftol2' > ..\libxml2\lib\libxml2_a.lib(int.a.msvc/xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaVal > idateDuration]+0x21c): undefined reference to `_ftol2' > ..\libxml2\lib\libxml2_a.lib(int.a.msvc/xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaVal [...] > idateDuration]+0x226): undefined reference to `_ftol2' > ..\iconv\lib\iconv_a.lib(iconv.obj):./iconv.c:(.text[_libiconvlist]+0x9): > undefined reference to `_c > hkstk' > ..\zlib\lib\zlib.lib(gzio.obj):gzio.c:(.text[_gzprintf]+0x6): undefined > reference to `_chkstk' > collect2: ld returned 1 exit status > error: command 'gcc' failed with exit status 1 See these: http://mail.gnome.org/archives/xml/2005-April/msg00028.html http://mail.gnome.org/archives/xml/2005-April/msg00042.html Stefan From rcdailey at gmail.com Tue Jul 10 16:15:26 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Tue, 10 Jul 2007 09:15:26 -0500 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: <469330F4.5060402@behnel.de> References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> <469330F4.5060402@behnel.de> Message-ID: <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> Stefan, Thank you very much for your reply. The articles you linked to me are an interesting read, however I don't feel like they solve my problem. Perhaps I'm a little bit confused on what the articles are suggesting. I'm still stuck on what to do to fix this problem. In fact, I don't even know what the problem is to begin with. I had a hard time relating my problems to the topics discussed in the linked articles. Any further assistance would be greatly appreciated. Thanks again for your reply. On 7/10/07, Stefan Behnel wrote: > > Hi, > > Robert Dailey wrote: > > I'm attempting to build LXML for windows. Below are details on the > > linker errors I'm getting (the compile works fine). Anyone that can help > > would be greatly appreciated. Thank you! > [...] > > C:\IT\SDK\lxml>python setup.py build -c mingw32 --static > > Building lxml version 1.3.2 > > C:\Python25\lib\distutils\dist.py:263: UserWarning: Unknown distribution > > option: 'zip_safe' > > warnings.warn(msg) > > running build > > running build_py > > running build_ext > > building 'lxml.etree' extension > > writing build\temp.win32-2.5\Release\src\lxml\etree.def > > C:\mingw\bin\gcc.exe -mno-cygwin -shared -s > > build\temp.win32-2.5\Release\src\lxml\etree.o build\temp > > .win32-2.5\Release\src\lxml\etree.def -L..\libxml2\lib -L..\libxslt\lib > > -L..\zlib\lib -L..\iconv\lib > > "-LC:\Program Files\Microsoft Visual Studio 8\VC\lib" > > -LC:\Python25\libs -LC:\Python25\PCBuild -lli > > bxslt_a -llibexslt_a -llibxml2_a -liconv_a -lzlib -lWS2_32 -lpython25 > > -lmsvcr71 -o build\lib.win32-2 > > .5\lxml\etree.pyd > [...] > > ..\libxml2\lib\libxml2_a.lib(int.a.msvc > /xpointer.obj):..\xpointer.c:(.text[_xmlXPtrStringRangeFuncti > > on]+0x65): undefined reference to `_ftol2' > > ..\libxml2\lib\libxml2_a.lib(int.a.msvc > /xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaVal > > idateDuration]+0x21c): undefined reference to `_ftol2' > > ..\libxml2\lib\libxml2_a.lib(int.a.msvc > /xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaVal > [...] > > idateDuration]+0x226): undefined reference to `_ftol2' > > ..\iconv\lib\iconv_a.lib(iconv.obj > ):./iconv.c:(.text[_libiconvlist]+0x9): > > undefined reference to `_c > > hkstk' > > ..\zlib\lib\zlib.lib(gzio.obj):gzio.c:(.text[_gzprintf]+0x6): undefined > > reference to `_chkstk' > > collect2: ld returned 1 exit status > > error: command 'gcc' failed with exit status 1 > > See these: > > http://mail.gnome.org/archives/xml/2005-April/msg00028.html > http://mail.gnome.org/archives/xml/2005-April/msg00042.html > > Stefan > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070710/f79a828d/attachment.htm From rcdailey at gmail.com Tue Jul 10 22:00:23 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Tue, 10 Jul 2007 15:00:23 -0500 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> <469330F4.5060402@behnel.de> <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> Message-ID: <496954360707101300p71cd187cl1639e0c573bfc2f@mail.gmail.com> Can anyone respond on this issue? I would really appreciate it. On 7/10/07, Robert Dailey wrote: > > Stefan, > > Thank you very much for your reply. The articles you linked to me are an > interesting read, however I don't feel like they solve my problem. Perhaps > I'm a little bit confused on what the articles are suggesting. I'm still > stuck on what to do to fix this problem. In fact, I don't even know what the > problem is to begin with. I had a hard time relating my problems to the > topics discussed in the linked articles. Any further assistance would be > greatly appreciated. Thanks again for your reply. > > On 7/10/07, Stefan Behnel wrote: > > > > Hi, > > > > Robert Dailey wrote: > > > I'm attempting to build LXML for windows. Below are details on the > > > linker errors I'm getting (the compile works fine). Anyone that can > > help > > > would be greatly appreciated. Thank you! > > [...] > > > C:\IT\SDK\lxml>python setup.py build -c mingw32 --static > > > Building lxml version 1.3.2 > > > C:\Python25\lib\distutils\dist.py:263: UserWarning: Unknown > > distribution > > > option: 'zip_safe' > > > warnings.warn(msg) > > > running build > > > running build_py > > > running build_ext > > > building 'lxml.etree' extension > > > writing build\temp.win32-2.5\Release\src\lxml\etree.def > > > C:\mingw\bin\gcc.exe -mno-cygwin -shared -s > > > build\temp.win32-2.5\Release\src\lxml\etree.o build\temp > > > .win32-2.5\Release\src\lxml\etree.def -L..\libxml2\lib > > -L..\libxslt\lib > > > -L..\zlib\lib -L..\iconv\lib > > > "-LC:\Program Files\Microsoft Visual Studio 8\VC\lib" > > > -LC:\Python25\libs -LC:\Python25\PCBuild -lli > > > bxslt_a -llibexslt_a -llibxml2_a -liconv_a -lzlib -lWS2_32 -lpython25 > > > -lmsvcr71 -o build\lib.win32-2 > > > .5\lxml\etree.pyd > > [...] > > > ..\libxml2\lib\libxml2_a.lib( int.a.msvc > > /xpointer.obj):..\xpointer.c:(.text[_xmlXPtrStringRangeFuncti > > > on]+0x65): undefined reference to `_ftol2' > > > ..\libxml2\lib\libxml2_a.lib(int.a.msvc/xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaVal > > > > > idateDuration]+0x21c): undefined reference to `_ftol2' > > > ..\libxml2\lib\libxml2_a.lib(int.a.msvc > > /xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaVal > > [...] > > > idateDuration]+0x226): undefined reference to `_ftol2' > > > ..\iconv\lib\iconv_a.lib(iconv.obj > > ):./iconv.c:(.text[_libiconvlist]+0x9): > > > undefined reference to `_c > > > hkstk' > > > ..\zlib\lib\zlib.lib(gzio.obj):gzio.c:(.text[_gzprintf]+0x6): > > undefined > > > reference to `_chkstk' > > > collect2: ld returned 1 exit status > > > error: command 'gcc' failed with exit status 1 > > > > See these: > > > > http://mail.gnome.org/archives/xml/2005-April/msg00028.html > > http://mail.gnome.org/archives/xml/2005-April/msg00042.html > > > > Stefan > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070710/aadba114/attachment.htm From sidnei at enfoldsystems.com Tue Jul 10 22:26:06 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Tue, 10 Jul 2007 17:26:06 -0300 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: <496954360707101300p71cd187cl1639e0c573bfc2f@mail.gmail.com> References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> <469330F4.5060402@behnel.de> <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> <496954360707101300p71cd187cl1639e0c573bfc2f@mail.gmail.com> Message-ID: On 7/10/07, Robert Dailey wrote: > Can anyone respond on this issue? I would really appreciate it. Which version are you trying to build? I'm the 'official' maintainer of the binary for Windows. I am planning to make a build of 1.3.2 sometime this week. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From rcdailey at gmail.com Tue Jul 10 22:32:39 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Tue, 10 Jul 2007 15:32:39 -0500 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> <469330F4.5060402@behnel.de> <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> <496954360707101300p71cd187cl1639e0c573bfc2f@mail.gmail.com> Message-ID: <496954360707101332k25e5b9c0lc59a4527ce5f3d3d@mail.gmail.com> I'm attempting to build 1.3.2 for windows. So far here's the steps I've taken: - Download the windows binaries for iconv, zlib, libxml2, and libxslt as directed from the tutorial on the lxml website. - Extract all of the folders to the same folder, placing lxml 1.3.2 in this folder as well. I now have 5 folders in the same directory. I remove the version numbers from the folder names to allow a more readable path to the include directories. - Modify the setup.py file, adding the following code: STATIC_INCLUDE_DIRS = [ "..\\libxml2\\include", "..\\libxslt\\include", "..\\zlib\\include", "..\\iconv\\include" ] STATIC_LIBRARY_DIRS = [ "..\\libxml2\\lib", "..\\libxslt\\lib", "..\\zlib\\lib", "..\\iconv\\lib" ] STATIC_CFLAGS = [] - I then pass the following to the command line (minus the quotes): "python setup.py build -c mingw32 --static" - The compile succeeds fine, but the link stage can't find various symbols, such as "xmlFree" and "_ftol2". This is where I'm stuck. I don't have VS2003 installed so I can't use that. Thanks for responding. On 7/10/07, Sidnei da Silva wrote: > > On 7/10/07, Robert Dailey wrote: > > Can anyone respond on this issue? I would really appreciate it. > > Which version are you trying to build? I'm the 'official' maintainer > of the binary for Windows. I am planning to make a build of 1.3.2 > sometime this week. > > -- > Sidnei da Silva > Enfold Systems http://enfoldsystems.com > Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070710/6377a105/attachment-0001.htm From sidnei at enfoldsystems.com Tue Jul 10 23:01:45 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Tue, 10 Jul 2007 18:01:45 -0300 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: <496954360707101332k25e5b9c0lc59a4527ce5f3d3d@mail.gmail.com> References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> <469330F4.5060402@behnel.de> <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> <496954360707101300p71cd187cl1639e0c573bfc2f@mail.gmail.com> <496954360707101332k25e5b9c0lc59a4527ce5f3d3d@mail.gmail.com> Message-ID: On 7/10/07, Robert Dailey wrote: > I'm attempting to build 1.3.2 for windows. ... > I don't have VS2003 installed so I can't use that. Uh, and I've never tried mingw32, so I can't comment :( If you can wait until tomorrow, I will upload a VS2003-built binary to PyPI. Which version of python are you using? 2.4 or 2.5? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From rcdailey at gmail.com Wed Jul 11 00:16:06 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Tue, 10 Jul 2007 17:16:06 -0500 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> <469330F4.5060402@behnel.de> <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> <496954360707101300p71cd187cl1639e0c573bfc2f@mail.gmail.com> <496954360707101332k25e5b9c0lc59a4527ce5f3d3d@mail.gmail.com> Message-ID: <496954360707101516j48eed328t7ae06821c615335c@mail.gmail.com> I'm using Python 2.5. If you would upload a binary I would greatly appreciate it. If you also wouldn't mind giving me the link to where I can find the file when it is available I would also appreciate that (if it will be located on Python Cheese Shop, I know how to find it). I realize it may be impractical for you, but if you have any spare time: if you could attempt to build using mingw32 maybe you could figure out the outstanding linker issues I've been having. Maybe you'd be able to solve the problem and then post with your results. This is completely optional; I just ask that you look at it if you're willing and if time allows. I can most definitely wait until tomorrow for your generous binary distribution for Windows. I'm greatly appreciative of your efforts. Thanks for following up with me on this. Take care. On 7/10/07, Sidnei da Silva wrote: > > On 7/10/07, Robert Dailey wrote: > > I'm attempting to build 1.3.2 for windows. > ... > > I don't have VS2003 installed so I can't use that. > > Uh, and I've never tried mingw32, so I can't comment :( > > If you can wait until tomorrow, I will upload a VS2003-built binary to > PyPI. Which version of python are you using? 2.4 or 2.5? > > -- > Sidnei da Silva > Enfold Systems http://enfoldsystems.com > Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070710/c7df90de/attachment.htm From stefan_ml at behnel.de Wed Jul 11 08:50:23 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 11 Jul 2007 08:50:23 +0200 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: <496954360707101516j48eed328t7ae06821c615335c@mail.gmail.com> References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> <469330F4.5060402@behnel.de> <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> <496954360707101300p71cd187cl1639e0c573bfc2f@mail.gmail.com> <496954360707101332k25e5b9c0lc59a4527ce5f3d3d@mail.gmail.com> <496954360707101516j48eed328t7ae06821c615335c@mail.gmail.com> Message-ID: <46947DAF.2060701@behnel.de> Robert Dailey wrote: > I'm using Python 2.5. If you would upload a binary I would greatly > appreciate it. If you also wouldn't mind giving me the link to where I > can find the file when it is available I would also appreciate that (if > it will be located on Python Cheese Shop, I know how to find it). They'll be on Cheeseshop. > I > realize it may be impractical for you, but if you have any spare time: > if you could attempt to build using mingw32 maybe you could figure out > the outstanding linker issues I've been having. Maybe you'd be able to > solve the problem and then post with your results. This is completely > optional; I just ask that you look at it if you're willing and if time > allows. >From what I've read about the topic so far, it might be a problem to build against a VC-built libxml2 etc. with mingw32 (that was in the links I posted), but using it to build extensions against the official Python release /should/ work. So what you could try if you want to build it yourself is build libxml2 and libxslt from sources first using mingw32 and then build lxml against those. It would actually be interesting for us to know if this works, as it would allow others to work on lxml from windows more easily (without buying VC first). Stefan From reder at jpl.nasa.gov Wed Jul 11 09:02:58 2007 From: reder at jpl.nasa.gov (Leonard J. Reder) Date: Wed, 11 Jul 2007 00:02:58 -0700 Subject: [lxml-dev] Compact RelaxNG Validation Message-ID: <469480A2.3020608@jpl.nasa.gov> Hello, Does the lxml validation support the compact form of RelaxNG Schema language? Thanks, Len -- ____________________________________________________ Leonard J. Reder Jet Propulsion Laboratory Mar Science Laboratory Project Flight Software Applications & Data Product Management, Section 316D Email: reder at jpl.nasa.gov Phone (Voice): 818-354-3639 Mail Address: Mail Stop: 171-113 4800 Oak Grove Dr. Pasadena, CA. 91109 --------------------------------------------------- From stefan_ml at behnel.de Wed Jul 11 09:47:05 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 11 Jul 2007 09:47:05 +0200 Subject: [lxml-dev] Compact RelaxNG Validation In-Reply-To: <469480A2.3020608@jpl.nasa.gov> References: <469480A2.3020608@jpl.nasa.gov> Message-ID: <46948AF9.5020707@behnel.de> Leonard J. Reder wrote: > Does the lxml validation support the compact form of RelaxNG Schema > language? No, but that's been on the wish list for a while. There is a patch for libxml2 that supports it and has been waiting for inclusion for ages. Once libxml2 supports it, we can see if we can also support it in lxml (obviously requires a backwards compatible implementation, as it must still compile on older libxml2 versions). The other solution would be to add a separate (Python-)implementation to lxml, but I am not aware of a spec-compliant Python implementation here. There are two partial implementations, but they currently fail to handle a larger number of non-trivial RNC schemas, so there is not much use in integrating them. Any help is obviously appreciated. It might already help to keep asking on the libxml2 mailing list. Stefan From stefan_ml at behnel.de Wed Jul 11 14:13:12 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 11 Jul 2007 14:13:12 +0200 Subject: [lxml-dev] Compact RelaxNG Validation In-Reply-To: <469480A2.3020608@jpl.nasa.gov> References: <469480A2.3020608@jpl.nasa.gov> Message-ID: <4694C958.9050609@behnel.de> Leonard J. Reder wrote: > Does the lxml validation support the compact form of RelaxNG Schema > language? A possible (though not portable) way would be to pipe RNC through trang: http://www.thaiopensource.com/relaxng/trang.html It's written in Java, but there are GCJ'ed Linux binaries available. Stefan From micxer at micxer.de Wed Jul 11 14:13:38 2007 From: micxer at micxer.de (micxer) Date: Wed, 11 Jul 2007 14:13:38 +0200 Subject: [lxml-dev] Ignoring unknown namespaces in XML while validating In-Reply-To: <469480A2.3020608@jpl.nasa.gov> References: <469480A2.3020608@jpl.nasa.gov> Message-ID: <4694C972.9060004@micxer.de> Hello, I'm using lxml primarily for validation of XML documents and requests of UPnP devices. Since many vendors are going to make their devices DLNA compliant, some additional XML elements appear in the XML docs. I would have to pay for the DLNA specs so I have no other choice than deleting these elements in advance and validate the XML afterwards. Is there an easy way to do this with lxml? Am I missing something? Thanks, Michael From stefan_ml at behnel.de Wed Jul 11 14:30:17 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 11 Jul 2007 14:30:17 +0200 Subject: [lxml-dev] Ignoring unknown namespaces in XML while validating In-Reply-To: <4694C972.9060004@micxer.de> References: <469480A2.3020608@jpl.nasa.gov> <4694C972.9060004@micxer.de> Message-ID: <4694CD59.1060904@behnel.de> Hi, first of all: please don't respond to posts from a different thread when you want to start a new one. Mail-Readers will sort the e-mail into the wrong thread and confuse people. micxer wrote: > I'm using lxml primarily for validation of XML documents and requests of > UPnP devices. Since many vendors are going to make their devices DLNA > compliant, some additional XML elements appear in the XML docs. I would > have to pay for the DLNA specs so I have no other choice than deleting > these elements in advance and validate the XML afterwards. Is there an > easy way to do this with lxml? Am I missing something? Not sure what your problem is exactly. Are these "additional elements" in a specific namespace? That would make it easy to remove them: for el in root.getiterator("{http://the/namespace}*"): parent = el.getparent() if parent is not None: # not the root element parent.remove(el) Or are they in other namespaces than the main one? MAIN_NS = "{http://the/namespace}" for el in root.getiterator("*"): if not el.tag.startswith(MAIN_NS): parent = el.getparent() if parent is not None: # not the root element parent.remove(el) Similarly, if you have a set of tag names that must be kept or removed, you can iterate over all elements and check the tag names against the set. Does that solve your problem? Stefan From jholg at gmx.de Wed Jul 11 15:45:39 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 11 Jul 2007 15:45:39 +0200 Subject: [lxml-dev] [objectify] Typed E-factory for objectify, PT DataElement()-wrapper Message-ID: <20070711134539.193100@gmx.net> Hi, attached patch (against trunk) * adds a typed E-factory (called T-factory) * inserts NoneType into the E-factory/T-factory typemap * adds the PT() (="PyTyped()) convenience function that is a thin wrapper uses the argument value's type to set the pytype * provides unittests for E-factory, T-factory and PT() * fixes DataElement() to care for some previously-unhandled corner cases concerning None and/or _pytype "none" Despite of what I previously said ;-) I now think it would be better to rename "none" to "NoneType", to use the same name as the Python builtin original. While it is a longer name I seriously doubt you need to actually use it explicitly very often. By convention, the PyType name should match the Python builtin type name; then both the T-factory and the PT() function can work smoothly (the only thing special-cased is the Python type name "unicode" with gets substituted by "str"). Therefore, the patch also changes "none" to "NoneType" in objectify and the objectify tests/doctests. I'd really like to see the PT() function go into the 1.3 series, too. Please take a look, I can come up with some documentation if you like it. Holger -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer -------------- next part -------------- A non-text attachment was scrubbed... Name: tfactory_pt_nonetype.patch Type: application/octet-stream Size: 23331 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070711/02980d26/attachment-0001.obj From rcdailey at gmail.com Wed Jul 11 17:25:09 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Wed, 11 Jul 2007 10:25:09 -0500 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: <46947DAF.2060701@behnel.de> References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> <469330F4.5060402@behnel.de> <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> <496954360707101300p71cd187cl1639e0c573bfc2f@mail.gmail.com> <496954360707101332k25e5b9c0lc59a4527ce5f3d3d@mail.gmail.com> <496954360707101516j48eed328t7ae06821c615335c@mail.gmail.com> <46947DAF.2060701@behnel.de> Message-ID: <496954360707110825o141c43e9o1138f4ebecc4dacb@mail.gmail.com> > > > It would actually be interesting for us to know if this works, as it would > allow others to work on lxml from windows more easily (without buying VC > first). It would be interesting, however I've tried building libxml from sources before and it was very non-trivial. In fact, it was so difficult I never actually succeeded. It turns out that theres a chain of API dependencies that I can never fulfill. You end up building sources for say 10 different libraries just in order to build the sources for libxml. If there's a nice, clean walkthrough on it I imagine I could figure it out. Likewise with libxslt. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070711/07c057bb/attachment.htm From rcdailey at gmail.com Wed Jul 11 22:03:01 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Wed, 11 Jul 2007 15:03:01 -0500 Subject: [lxml-dev] Version 1.2.1 not working? Message-ID: <496954360707111303q102e03d7i8177642eead8e510@mail.gmail.com> Hi, I have the following Python code: from lxml import etree from StringIO import StringIO def loadXMLFile( filename ): f = open( filename, 'r' ) xmldata = f.read() root = etree.parse( StringIO( xmldata ) ) f.close() return root Python either crashes or hangs at the etree.parse() call. Below is the contents of the XML file I'm opening: Anyone know why it isn't working? Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070711/03c7d2ff/attachment.htm From rogerpatterson at gmail.com Wed Jul 11 22:25:39 2007 From: rogerpatterson at gmail.com (Roger Patterson) Date: Wed, 11 Jul 2007 13:25:39 -0700 Subject: [lxml-dev] Version 1.2.1 not working? In-Reply-To: <496954360707111303q102e03d7i8177642eead8e510@mail.gmail.com> References: <496954360707111303q102e03d7i8177642eead8e510@mail.gmail.com> Message-ID: <46953CC3.9000503@gmail.com> Hi Robert, Firstly, you don't need to read your XML file into a string and then convert it to a StringIO object. You can just feed the file handle to etree.parse(). But other than that, your XML file parsed fine for me, and I'm using lxml 1.2.1 You may want to make sure you've installed correctly. -Roger Robert Dailey wrote: > Hi, > > I have the following Python code: > > > from lxml import etree > from StringIO import StringIO > > def loadXMLFile( filename ): > f = open( filename, 'r' ) > xmldata = f.read() > root = etree.parse( StringIO( xmldata ) ) > f.close() > return root > > > Python either crashes or hangs at the etree.parse() call. Below is the > contents of the XML file I'm opening: > > > > > > > > > > > > > > > > > > > > > > Anyone know why it isn't working? Thanks! > > ------------------------------------------------------------------------ > > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > From rcdailey at gmail.com Wed Jul 11 23:21:04 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Wed, 11 Jul 2007 16:21:04 -0500 Subject: [lxml-dev] Version 1.2.1 not working? In-Reply-To: <46953CC3.9000503@gmail.com> References: <496954360707111303q102e03d7i8177642eead8e510@mail.gmail.com> <46953CC3.9000503@gmail.com> Message-ID: <496954360707111421r20ab8758yd633412a84db5d8a@mail.gmail.com> Roger, Thank you for your reply. I wasn't aware that you could just pass in a filename. From the examples on the website I've only seen strings passed in. I suppose it might be worth my time to look at the API reference. The way I installed lxml is through the binary distribution (installer for windows) on the Cheese Shop. Unless the installer is broken, it should have installed fine. There wasn't much user intervention during the installation process. Thanks for your help. I'll try passing in the file and see what I get. Thank you. On 7/11/07, Roger Patterson wrote: > > Hi Robert, > > Firstly, you don't need to read your XML file into a string and then > convert it to a StringIO object. You can just feed the file handle to > etree.parse(). > But other than that, your XML file parsed fine for me, and I'm using > lxml 1.2.1 > You may want to make sure you've installed correctly. > -Roger > > Robert Dailey wrote: > > Hi, > > > > I have the following Python code: > > > > > > from lxml import etree > > from StringIO import StringIO > > > > def loadXMLFile( filename ): > > f = open( filename, 'r' ) > > xmldata = f.read() > > root = etree.parse( StringIO( xmldata ) ) > > f.close() > > return root > > > > > > Python either crashes or hangs at the etree.parse() call. Below is the > > contents of the XML file I'm opening: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Anyone know why it isn't working? Thanks! > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > lxml-dev mailing list > > lxml-dev at codespeak.net > > http://codespeak.net/mailman/listinfo/lxml-dev > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070711/811ee121/attachment.htm From rcdailey at gmail.com Wed Jul 11 23:31:12 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Wed, 11 Jul 2007 16:31:12 -0500 Subject: [lxml-dev] Version 1.2.1 not working? In-Reply-To: <46953CC3.9000503@gmail.com> References: <496954360707111303q102e03d7i8177642eead8e510@mail.gmail.com> <46953CC3.9000503@gmail.com> Message-ID: <496954360707111431w5d3417c4nc1686559589f0f03@mail.gmail.com> One more thing: I just tested again. Running my python script through the command line works fine. However, running it through my IDE (PyScripter), it crashes on the second time I run it. Interesting. On 7/11/07, Roger Patterson wrote: > > Hi Robert, > > Firstly, you don't need to read your XML file into a string and then > convert it to a StringIO object. You can just feed the file handle to > etree.parse(). > But other than that, your XML file parsed fine for me, and I'm using > lxml 1.2.1 > You may want to make sure you've installed correctly. > -Roger > > Robert Dailey wrote: > > Hi, > > > > I have the following Python code: > > > > > > from lxml import etree > > from StringIO import StringIO > > > > def loadXMLFile( filename ): > > f = open( filename, 'r' ) > > xmldata = f.read() > > root = etree.parse( StringIO( xmldata ) ) > > f.close() > > return root > > > > > > Python either crashes or hangs at the etree.parse() call. Below is the > > contents of the XML file I'm opening: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Anyone know why it isn't working? Thanks! > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > lxml-dev mailing list > > lxml-dev at codespeak.net > > http://codespeak.net/mailman/listinfo/lxml-dev > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070711/86177c39/attachment.htm From stefan_ml at behnel.de Thu Jul 12 00:49:43 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 12 Jul 2007 00:49:43 +0200 Subject: [lxml-dev] Version 1.2.1 not working? In-Reply-To: <496954360707111421r20ab8758yd633412a84db5d8a@mail.gmail.com> References: <496954360707111303q102e03d7i8177642eead8e510@mail.gmail.com> <46953CC3.9000503@gmail.com> <496954360707111421r20ab8758yd633412a84db5d8a@mail.gmail.com> Message-ID: <46955E87.3070100@behnel.de> Robert Dailey wrote: > Thank you for your reply. I wasn't aware that you could just pass in a > filename. From the examples on the website I've only seen strings passed > in. True, thanks for noting this. I updated the trunk docs to explain a bit more what parse() and fromstring() actually support. http://codespeak.net/lxml/dev/parsing.html > The way I installed lxml is through the binary distribution (installer > for windows) on the Cheese Shop. Unless the installer is broken, it > should have installed fine. There wasn't much user intervention during > the installation process. lxml's windows binaries are static builds, so there isn't much that can go wrong here. I suggest waiting for Sidnei's 1.3.2 build to check if the problem remains. Stefan From sidnei at enfoldsystems.com Thu Jul 12 02:33:30 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Wed, 11 Jul 2007 21:33:30 -0300 Subject: [lxml-dev] Test Failures in lxml 1.3.2 Message-ID: I get one test failure with lxml 1.3.2, doesn't look too bad. Maybe it has something to do with the libxml2 version? ====================================================================== FAIL: test_module_HTML_unicode (lxml.tests.test_htmlparser.HtmlParserTestCaseBas e) ---------------------------------------------------------------------- Traceback (most recent call last): File "c:\Python24\lib\unittest.py", line 260, in run testMethod() File "C:\src\lxml-build\lxml-1.3.2\src\lxml\tests\test_htmlparser.py", line 33 , in test_module_HTML_unicode unicode(self.uhtml_str.encode('UTF8'), 'UTF8')) File "c:\Python24\lib\unittest.py", line 333, in failUnlessEqual raise self.failureException, \ AssertionError: u'test \xc3\x83\xc2\xa1\xef\xa3\x92

page \xc3\x83\xc2\xa1\xef\xa3\x92 title

' != u' test \xc3\xa1\uf8d2

page \xc3\xa1\uf8 d2 title

' ---------------------------------------------------------------------- -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From sidnei at enfoldsystems.com Thu Jul 12 02:40:56 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Wed, 11 Jul 2007 21:40:56 -0300 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: References: Message-ID: Versions used, FWIW: TESTED VERSION: Python: (2, 5, 0, 'final', 0) lxml.etree: (1, 3, 2, 0) libxml used: (2, 6, 28) libxml compiled: (2, 6, 28) libxslt used: (1, 1, 19) libxslt compiled: (1, 1, 19) -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From sidnei at enfoldsystems.com Thu Jul 12 02:48:21 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Wed, 11 Jul 2007 21:48:21 -0300 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: <496954360707101516j48eed328t7ae06821c615335c@mail.gmail.com> References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> <469330F4.5060402@behnel.de> <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> <496954360707101300p71cd187cl1639e0c573bfc2f@mail.gmail.com> <496954360707101332k25e5b9c0lc59a4527ce5f3d3d@mail.gmail.com> <496954360707101516j48eed328t7ae06821c615335c@mail.gmail.com> Message-ID: Binaries for Python 2.4 and 2.5 up! http://cheeseshop.python.org/pypi/lxml/1.3.2 -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From stefan_ml at behnel.de Thu Jul 12 09:31:57 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 12 Jul 2007 09:31:57 +0200 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: References: Message-ID: <4695D8ED.8090002@behnel.de> Sidnei da Silva wrote: > I get one test failure with lxml 1.3.2, doesn't look too bad. Maybe it > has something to do with the libxml2 version? > > ====================================================================== > FAIL: test_module_HTML_unicode (lxml.tests.test_htmlparser.HtmlParserTestCaseBas > e) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "c:\Python24\lib\unittest.py", line 260, in run > testMethod() > File "C:\src\lxml-build\lxml-1.3.2\src\lxml\tests\test_htmlparser.py", line 33 > , in test_module_HTML_unicode > unicode(self.uhtml_str.encode('UTF8'), 'UTF8')) > File "c:\Python24\lib\unittest.py", line 333, in failUnlessEqual > raise self.failureException, \ > AssertionError: u'test \xc3\x83\xc2\xa1\xef\xa3\x92 head>

page \xc3\x83\xc2\xa1\xef\xa3\x92 title

' != u' > test \xc3\xa1\uf8d2

page \xc3\xa1\uf8 > d2 title

' Hmmm, didn't I take that test out? :) Erik Swanson reported the same problem on OS-X. I guess that makes parsing HTML from a unicode string pretty much a Unix-only thing, though maybe it's actually rather a UCS4-only thing. No idea how to fix that (or what actually goes wrong here). It seems like the problem only arises on UCS-2 systems. Could anyone with a UCS-2 Linux system check if this is also fails there? UCS-2 can be detected with "sys.maxunicode" being 65535 (I think). UCS-4 systems say 1114111 here. I heard rumours that Redhat systems have UCS-2 builds. Ubuntu definitely doesn't. The test case itself is pretty simple: >>> import lxml.etree as et >>> html = et.HTML(u'\xc3\xa1\uf8d2') >>> print repr(et.tounicode(html)) u'\xc3\xa1\uf8d2' To see that the actual problem is the parser, not the serialiser, you can do: >>> print repr(et.tostring(html, 'utf-8')) '\xc3\x83\xc2\xa1\xef\xa3\x92' Hoping for feedback and ideas, Stefan From stefan_ml at behnel.de Thu Jul 12 09:36:37 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 12 Jul 2007 09:36:37 +0200 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: <4695D8ED.8090002@behnel.de> References: <4695D8ED.8090002@behnel.de> Message-ID: <4695DA05.5010707@behnel.de> Stefan Behnel wrote: > Erik Swanson reported the same problem on OS-X. I guess that makes parsing > HTML from a unicode string pretty much a Unix-only thing Before anyone else notices, I obviously meant "pretty much a *Linux*-only thing"... Stefan :) From jholg at gmx.de Thu Jul 12 09:58:11 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 12 Jul 2007 09:58:11 +0200 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: <4695D8ED.8090002@behnel.de> References: <4695D8ED.8090002@behnel.de> Message-ID: <20070712075811.193110@gmx.net> Hi, > It seems like the problem only arises on UCS-2 systems. Could anyone with > a > UCS-2 Linux system check if this is also fails there? UCS-2 can be > detected > with "sys.maxunicode" being 65535 (I think). UCS-4 systems say 1114111 > here. I Runs without failures on Solaris with sys.maxunicode==65535: 0 lb54320 at adevp02 .../lxml-1.3 $ python2.4 -c "import sys; print sys.maxunicode" 65535 0 lb54320 at adevp02 .../lxml-1.3 $ make test PYTHON=python2.4 python2.4 setup.py build_ext -i Building lxml version 1.3.3-44945 /apps/prod/lib/python2.4/distutils/dist.py:236: UserWarning: Unknown distribution option: 'zip_safe' warnings.warn(msg) running build_ext python2.4 test.py -p -v TESTED VERSION: Python: (2, 4, 4, 'final', 0) lxml.etree: (1, 3, 3, 44945) libxml used: (2, 6, 27) libxml compiled: (2, 6, 27) libxslt used: (1, 1, 20) libxslt compiled: (1, 1, 20) 607/607 (100.0%): Doctest: xpathxslt.txt ---------------------------------------------------------------------- Ran 607 tests in 1.310s OK PYTHONPATH=src python2.4 selftest.py 126 tests ok. PYTHONPATH=src python2.4 selftest2.py 88 tests ok. 0 lb54320 at adevp02 .../lxml-1.3 $ Note that I ran from 1.3 branch, not 1.3.2 release (found not 1.3.2 tag in the repository), so maybe the offending test has been disabled already (?) Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail From stefan_ml at behnel.de Thu Jul 12 10:11:41 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 12 Jul 2007 10:11:41 +0200 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: <20070712075811.193110@gmx.net> References: <4695D8ED.8090002@behnel.de> <20070712075811.193110@gmx.net> Message-ID: <4695E23D.1030605@behnel.de> jholg at gmx.de wrote: >> It seems like the problem only arises on UCS-2 systems. Could anyone with >> a >> UCS-2 Linux system check if this is also fails there? UCS-2 can be >> detected >> with "sys.maxunicode" being 65535 (I think). UCS-4 systems say 1114111 >> here. I > > Runs without failures on Solaris with sys.maxunicode==65535: Thanks for testing. Just to be sure, Sun Solaris machines are big endian, right? Intel is little endian, so Solaris actually uses a different byte encoding here. So I think we can restrict the problem to UCS-2 little endian. Any non-Windows, non-Mac-OS-X testers for that one? Or maybe any Mac-OS PPC testers? > Note that I ran from 1.3 branch, not 1.3.2 release (found not 1.3.2 tag in the repository) Thanks for reminding me. It's there now. > so maybe the offending test has been disabled already (?) No, it's still in there. Stefan From jholg at gmx.de Thu Jul 12 10:48:18 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 12 Jul 2007 10:48:18 +0200 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: <4695E23D.1030605@behnel.de> References: <4695D8ED.8090002@behnel.de> <20070712075811.193110@gmx.net> <4695E23D.1030605@behnel.de> Message-ID: <20070712084818.193080@gmx.net> Hi Stefan, > > Runs without failures on Solaris with sys.maxunicode==65535: > > Thanks for testing. Just to be sure, Sun Solaris machines are big endian, > right? Intel is little endian, so Solaris actually uses a different byte > encoding here. > > So I think we can restrict the problem to UCS-2 little endian. Any > non-Windows, non-Mac-OS-X testers for that one? Right, and I've been a bit sloppy: SUN Sparc is big endian architecture, whereas Intel is little endian, so I should've rather said "Runs without failures on *Sparc* Solaris". Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail From tseaver at palladion.com Thu Jul 12 18:53:07 2007 From: tseaver at palladion.com (Tres Seaver) Date: Thu, 12 Jul 2007 12:53:07 -0400 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: <4695D8ED.8090002@behnel.de> References: <4695D8ED.8090002@behnel.de> Message-ID: <46965C73.8030505@palladion.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stefan Behnel wrote: > Sidnei da Silva wrote: >> I get one test failure with lxml 1.3.2, doesn't look too bad. Maybe it >> has something to do with the libxml2 version? >> >> ====================================================================== >> FAIL: test_module_HTML_unicode (lxml.tests.test_htmlparser.HtmlParserTestCaseBas >> e) >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "c:\Python24\lib\unittest.py", line 260, in run >> testMethod() >> File "C:\src\lxml-build\lxml-1.3.2\src\lxml\tests\test_htmlparser.py", line 33 >> , in test_module_HTML_unicode >> unicode(self.uhtml_str.encode('UTF8'), 'UTF8')) >> File "c:\Python24\lib\unittest.py", line 333, in failUnlessEqual >> raise self.failureException, \ >> AssertionError: u'test \xc3\x83\xc2\xa1\xef\xa3\x92> head>

page \xc3\x83\xc2\xa1\xef\xa3\x92 title

' != u' >> test \xc3\xa1\uf8d2

page \xc3\xa1\uf8 >> d2 title

' > > Hmmm, didn't I take that test out? :) > > Erik Swanson reported the same problem on OS-X. I guess that makes parsing > HTML from a unicode string pretty much a Unix-only thing, though maybe it's > actually rather a UCS4-only thing. No idea how to fix that (or what actually > goes wrong here). > > It seems like the problem only arises on UCS-2 systems. Could anyone with a > UCS-2 Linux system check if this is also fails there? UCS-2 can be detected > with "sys.maxunicode" being 65535 (I think). UCS-4 systems say 1114111 here. I > heard rumours that Redhat systems have UCS-2 builds. Ubuntu definitely doesn't. > > The test case itself is pretty simple: > > >>> import lxml.etree as et > >>> html = et.HTML(u'\xc3\xa1\uf8d2') > >>> print repr(et.tounicode(html)) > u'\xc3\xa1\uf8d2' > > To see that the actual problem is the parser, not the serialiser, you can do: > > >>> print repr(et.tostring(html, 'utf-8')) > '\xc3\x83\xc2\xa1\xef\xa3\x92' > > Hoping for feedback and ideas, > > Stefan I have lxml installed in both UCS4 and UCS2 versions of python2.4 on my Ubuntu laptop:: $ cat et_test.py import sys print sys.version print sys.maxunicode import lxml.etree as et html = et.HTML(u'\xc3\xa1\uf8d2') print repr(et.tounicode(html)) $ /path/to/ucs4/bin/python et_test.py 2.4.3 (#2, Oct 6 2006, 07:52:30) [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] 1114111 u'\xc3\xa1\uf8d2' [/home/tseaver] $ /path/to/ucs2/bin/python et_test.py 2.4.4 (#1, Apr 19 2007, 16:14:47) [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] 65535 u'\xc3\xa1\uf8d2' Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGllxz+gerLs4ltQ4RAjZ/AJ9Pvf4WBX1cZywNmaePspGyFiD/TQCfTGIO mPMPYd0dfCk/uCVyRJpmAu4= =Y4mN -----END PGP SIGNATURE----- From stefan_ml at behnel.de Thu Jul 12 23:21:29 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 12 Jul 2007 23:21:29 +0200 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: <46965C73.8030505@palladion.com> References: <4695D8ED.8090002@behnel.de> <46965C73.8030505@palladion.com> Message-ID: <46969B59.3050105@behnel.de> Hi Tres, thanks for testing. Tres Seaver wrote: > Stefan Behnel wrote: >> It seems like the problem only arises on UCS-2 systems. Could anyone with a >> UCS-2 Linux system check if this is also fails there? UCS-2 can be detected >> with "sys.maxunicode" being 65535 (I think). UCS-4 systems say 1114111 here. I >> heard rumours that Redhat systems have UCS-2 builds. Ubuntu definitely doesn't. >> >> The test case itself is pretty simple: >> >> >>> import lxml.etree as et >> >>> html = et.HTML(u'\xc3\xa1\uf8d2') >> >>> print repr(et.tounicode(html)) >> u'\xc3\xa1\uf8d2' > >> To see that the actual problem is the parser, not the serialiser, you can do: > >> >>> print repr(et.tostring(html, 'utf-8')) >> '\xc3\x83\xc2\xa1\xef\xa3\x92' > > I have lxml installed in both UCS4 and UCS2 versions of python2.4 on my > Ubuntu laptop:: > > $ cat et_test.py > import sys > print sys.version > print sys.maxunicode > import lxml.etree as et > html = et.HTML(u'\xc3\xa1\uf8d2') > print repr(et.tounicode(html)) > > $ /path/to/ucs4/bin/python et_test.py > 2.4.3 (#2, Oct 6 2006, 07:52:30) > [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] > 1114111 > u'\xc3\xa1\uf8d2' > [/home/tseaver] > > $ /path/to/ucs2/bin/python et_test.py > 2.4.4 (#1, Apr 19 2007, 16:14:47) > [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] > 65535 > u'\xc3\xa1\uf8d2' Hmmm, that leaves me hoping that my test case actually touched the problem. Could we get feedback from someone with a non-working setup here? So far, we have the following cases: - it fails on MacOS-X (Intel) with a UCS-2 little endian Python - it fails on Windows with a UCS-2 little endian Python - it works on Linux/Intel with UCS-2 little endian - it works on Linux/Intel with UCS-4 little endian - it works on Solaris/Sparc with UCS-2 big endian I can't really see a pattern there... Stefan From stefan_ml at behnel.de Fri Jul 13 10:08:05 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 13 Jul 2007 10:08:05 +0200 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: <46969B59.3050105@behnel.de> References: <4695D8ED.8090002@behnel.de> <46965C73.8030505@palladion.com> <46969B59.3050105@behnel.de> Message-ID: <469732E5.6040000@behnel.de> Stefan Behnel wrote: > - it fails on MacOS-X (Intel) with a UCS-2 little endian Python > - it fails on Windows with a UCS-2 little endian Python > - it works on Linux/Intel with UCS-2 little endian > - it works on Linux/Intel with UCS-4 little endian > - it works on Solaris/Sparc with UCS-2 big endian > > I can't really see a pattern there... I've heard from a few people who tested (either failing or succeeding) that they have fairly recent libxml2 versions. Also, libxml2 works for me from 2.6.20 through 2.6.29. But what about the iconv version? Is there any difference on the systems that were tested so far? "iconv --version" says 2.5 for me. I assume it's about the same for Tres (who's on Ubuntu also). What about the others? Stefan From stefan_ml at behnel.de Fri Jul 13 10:34:29 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 13 Jul 2007 10:34:29 +0200 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: <469732E5.6040000@behnel.de> References: <4695D8ED.8090002@behnel.de> <46965C73.8030505@palladion.com> <46969B59.3050105@behnel.de> <469732E5.6040000@behnel.de> Message-ID: <46973915.8080506@behnel.de> Stefan Behnel wrote: > Stefan Behnel wrote: >> - it fails on MacOS-X (Intel) with a UCS-2 little endian Python >> - it fails on Windows with a UCS-2 little endian Python >> - it works on Linux/Intel with UCS-2 little endian >> - it works on Linux/Intel with UCS-4 little endian >> - it works on Solaris/Sparc with UCS-2 big endian >> >> I can't really see a pattern there... > > I've heard from a few people who tested (either failing or succeeding) that > they have fairly recent libxml2 versions. Also, libxml2 works for me from > 2.6.20 through 2.6.29. But what about the iconv version? Is there any > difference on the systems that were tested so far? "iconv --version" says 2.5 > for me. I assume it's about the same for Tres (who's on Ubuntu also). What > about the others? Sure, that must be it. Erik said he had iconv 1.9 on MacOS-X, the Windows binaries of libxml2 come with a pre-built iconv 1.9.2, but most Linux systems have a more recent version installed (at least the Debian based ones). I sent a mail to Igor Zlatkovi? to update the official libxml2 builds with a newer iconv version. Sidnei, in case it's not too much hassle for you, could you install a more recent iconv version yourself to try this out? Stefan From jholg at gmx.de Fri Jul 13 10:38:08 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 13 Jul 2007 10:38:08 +0200 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: <46973915.8080506@behnel.de> References: <4695D8ED.8090002@behnel.de> <46965C73.8030505@palladion.com> <46969B59.3050105@behnel.de> <469732E5.6040000@behnel.de> <46973915.8080506@behnel.de> Message-ID: <20070713083808.17490@gmx.net> Hi, > > 2.6.20 through 2.6.29. But what about the iconv version? Is there any > > difference on the systems that were tested so far? "iconv --version" > says 2.5 > > for me. I assume it's about the same for Tres (who's on Ubuntu also). > What > > about the others? libxml2 built without iconv here (Sparc Solaris). Holger -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer From sidnei at enfoldsystems.com Fri Jul 13 14:46:56 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Fri, 13 Jul 2007 09:46:56 -0300 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: <46973915.8080506@behnel.de> References: <4695D8ED.8090002@behnel.de> <46965C73.8030505@palladion.com> <46969B59.3050105@behnel.de> <469732E5.6040000@behnel.de> <46973915.8080506@behnel.de> Message-ID: On 7/13/07, Stefan Behnel wrote: > Sure, that must be it. Erik said he had iconv 1.9 on MacOS-X, the Windows > binaries of libxml2 come with a pre-built iconv 1.9.2, but most Linux systems > have a more recent version installed (at least the Debian based ones). > > I sent a mail to Igor Zlatkovi? to update the official libxml2 builds with a > newer iconv version. > > Sidnei, in case it's not too much hassle for you, could you install a more > recent iconv version yourself to try this out? As soon as Igor makes a new iconv build, sure. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From stefan_ml at behnel.de Fri Jul 13 15:01:22 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 13 Jul 2007 15:01:22 +0200 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: <20070713083808.17490@gmx.net> References: <4695D8ED.8090002@behnel.de> <46965C73.8030505@palladion.com> <46969B59.3050105@behnel.de> <469732E5.6040000@behnel.de> <46973915.8080506@behnel.de> <20070713083808.17490@gmx.net> Message-ID: <469777A2.5070804@behnel.de> jholg at gmx.de wrote: >>> 2.6.20 through 2.6.29. But what about the iconv version? Is there any >>> difference on the systems that were tested so far? "iconv --version" >>> says 2.5 >>> for me. I assume it's about the same for Tres (who's on Ubuntu also). >>> What about the others? > > libxml2 built without iconv here (Sparc Solaris). I first thought your comment wasn't relevant as Sparc uses a different encoding already, but then I looked back into the code of libxml2 and found that iconv is not used for detecting the encoding, only for later decoding if libxml2 itself doesn't support the encoding. So iconv isn't the real problem here, it's rather libxml2 that fails to detect the encoding on some platforms. What we use here is the function xmlDetectCharEncoding() in encoding.c, which (AFAICT) checks for a BOM. Maybe these platforms do not have a that in their unicode strings... Here is a patch that will print out the internal representation of a unicode string when importing etree. Could someone with a Windows or MacOS machine please try this and send me the results? Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: unicode-debug.patch Type: text/x-diff Size: 1019 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070713/f42be58d/attachment.bin From sidnei at enfoldsystems.com Fri Jul 13 15:19:10 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Fri, 13 Jul 2007 10:19:10 -0300 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: <469777A2.5070804@behnel.de> References: <4695D8ED.8090002@behnel.de> <46965C73.8030505@palladion.com> <46969B59.3050105@behnel.de> <469732E5.6040000@behnel.de> <46973915.8080506@behnel.de> <20070713083808.17490@gmx.net> <469777A2.5070804@behnel.de> Message-ID: On 7/13/07, Stefan Behnel wrote: > Here is a patch that will print out the internal representation of a unicode > string when importing etree. > > Could someone with a Windows or MacOS machine please try this and send me the > results? There you have it: C:\src\lxml-build\lxml-1.3.2>python test.py -vv '<\x00t\x00e\x00s\x00t\x00/\x00>\x00' -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From stefan_ml at behnel.de Fri Jul 13 15:40:17 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 13 Jul 2007 15:40:17 +0200 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: References: <4695D8ED.8090002@behnel.de> <46965C73.8030505@palladion.com> <46969B59.3050105@behnel.de> <469732E5.6040000@behnel.de> <46973915.8080506@behnel.de> <20070713083808.17490@gmx.net> <469777A2.5070804@behnel.de> Message-ID: <469780C1.6030207@behnel.de> Sidnei da Silva wrote: > On 7/13/07, Stefan Behnel wrote: >> Here is a patch that will print out the internal representation of a >> unicode string when importing etree. >> >> Could someone with a Windows or MacOS machine please try this and send >> me the results? > > There you have it: > > C:\src\lxml-build\lxml-1.3.2>python test.py -vv > '<\x00t\x00e\x00s\x00t\x00/\x00>\x00' Thanks, the attached patch should work around it. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: unicode-detect.patch Type: text/x-diff Size: 1059 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070713/4e5547ef/attachment.bin From sidnei at enfoldsystems.com Fri Jul 13 15:52:05 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Fri, 13 Jul 2007 10:52:05 -0300 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: <469780C1.6030207@behnel.de> References: <4695D8ED.8090002@behnel.de> <46965C73.8030505@palladion.com> <46969B59.3050105@behnel.de> <469732E5.6040000@behnel.de> <46973915.8080506@behnel.de> <20070713083808.17490@gmx.net> <469777A2.5070804@behnel.de> <469780C1.6030207@behnel.de> Message-ID: On 7/13/07, Stefan Behnel wrote: > Thanks, the attached patch should work around it. Nope, still fails the same way. ====================================================================== FAIL: test_module_HTML_unicode (lxml.tests.test_htmlparser.HtmlParserTestCaseBas e) ---------------------------------------------------------------------- Traceback (most recent call last): File "c:\Python24\lib\unittest.py", line 260, in run testMethod() File "C:\src\lxml-build\lxml-1.3.2\src\lxml\tests\test_htmlparser.py", line 33 , in test_module_HTML_unicode unicode(self.uhtml_str.encode('UTF8'), 'UTF8')) File "c:\Python24\lib\unittest.py", line 333, in failUnlessEqual raise self.failureException, \ AssertionError: u'test \xc3\x83\xc2\xa1\xef\xa3\x92

page \xc3\x83\xc2\xa1\xef\xa3\x92 title

' != u' test \xc3\xa1\uf8d2

page \xc3\xa1\uf8 d2 title

' -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From mwm-keyword-lxml.9112b8 at mired.org Sat Jul 14 10:09:37 2007 From: mwm-keyword-lxml.9112b8 at mired.org (Mike Meyer) Date: Sat, 14 Jul 2007 04:09:37 -0400 Subject: [lxml-dev] Element children not right? Message-ID: <20070714040937.3b280164@bhuda.mired.org> I think I have a very, very basic bug here: >>> from sys import version >>> version '2.5.1 (r251:54863, May 15 2007, 15:31:37) \n[GCC 3.4.6 [FreeBSD] 20060305]' >>> from lxml import etree >>> etree.LXML_VERSION (1, 3, 2, 0) >>> etree.LIBXML_VERSION (2, 6, 29) >>> etree.LIBXSLT_VERSION (1, 1, 21) >>> d = etree.parse('/home/mwm/.plpwmrc.xml') >>> n = d.find('namemenu') >>> [el for el in n] [, , , , , , , , , , ] >>> from xml.etree.ElementTree import parse >>> d2 = parse('/home/mwm/.plpwmrc.xml') >>> n2 = d2.find('namemenu') >>> [el for el in n2] [, , , , , , , , , ] Note that the original ElementTree implementation only returns children that are *elements*, whereas the lxml version returns all the children. This makes life much more interesting, especially as there isn't an obvious method for checking whether or not the node value is actually an element. http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. From mwm-keyword-lxml.9112b8 at mired.org Sat Jul 14 10:11:24 2007 From: mwm-keyword-lxml.9112b8 at mired.org (Mike Meyer) Date: Sat, 14 Jul 2007 04:11:24 -0400 Subject: [lxml-dev] Custom resolvers vs. RelaxNG? Message-ID: <20070714041124.3f736a93@bhuda.mired.org> Hi, Should custom resolvers work with Relax NG documents? They don't seem to be being invoked at all in my tests (and I already figured out that I have to have libxml2.6.29 for them to work at all, so that's what I'm using). Here's the resovler class: class MyResolver(Resolver): """A resolver that returns local strings.""" __entities = {'vivid.rng': _vivid, 'dev.rng': _dev} def resolve(self, uri, id, context): """Returns the right string for the given name.""" print "Resolving", uri, id, context if self.__entities.has_key(uri): return self.resolve_string(self.__entities[uri], context) else: return None And it's invoked like so: parser = XMLParser() parser.resolvers.add(MyResolver()) vivid = RelaxNG(fromstring(_vivid, parser)) dev = RelaxNG(fromstring(_dev, parser)) The critical schema is dev: _dev = """ """ I'd like the include of vivid.rng to pick up the _vivid string. However, when I run this, it just complains about the RelaxNG format, even though they it works fine if I have a copy of _vivid in vivid.rng in the current directory: Traceback (most recent call last): File "", line 1, in File "schema/__init__.py", line 545, in dev = RelaxNG(fromstring(_dev, parser), parser) File "relaxng.pxi", line 70, in etree.RelaxNG.__init__ etree.RelaxNGParseError: Document is not valid Relax NG Clearly, there's a bug here. The question is, is it in my understanding of things, or in lxml? Thanks, http://www.mired.org/consulting.html Independent Network/Unix/Perforce consultant, email for more information. From stefan_ml at behnel.de Sat Jul 14 13:17:24 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 14 Jul 2007 13:17:24 +0200 Subject: [lxml-dev] Element children not right? In-Reply-To: <20070714040937.3b280164@bhuda.mired.org> References: <20070714040937.3b280164@bhuda.mired.org> Message-ID: <4698B0C4.8060008@behnel.de> Mike Meyer schrieb: > I think I have a very, very basic bug here: > >>>> from sys import version >>>> version > '2.5.1 (r251:54863, May 15 2007, 15:31:37) \n[GCC 3.4.6 [FreeBSD] 20060305]' >>>> from lxml import etree >>>> etree.LXML_VERSION > (1, 3, 2, 0) >>>> etree.LIBXML_VERSION > (2, 6, 29) >>>> etree.LIBXSLT_VERSION > (1, 1, 21) >>>> d = etree.parse('/home/mwm/.plpwmrc.xml') >>>> n = d.find('namemenu') >>>> [el for el in n] > [, , , , , , , , , , ] >>>> from xml.etree.ElementTree import parse >>>> d2 = parse('/home/mwm/.plpwmrc.xml') >>>> n2 = d2.find('namemenu') >>>> [el for el in n2] > [, , , , , , , , , ] > > Note that the original ElementTree implementation only returns > children that are *elements*, whereas the lxml version returns all > the children. This makes life much more interesting, especially as > there isn't an obvious method for checking whether or not the node > value is actually an element. Sure there is, just check for the tag property being a string. lxml.etree is compatible to ElementTree in that it returns the factory functions for everything that does not have a tag (comments, PIs, entities). >>> [el for el in n if isinstance(el.tag, basestring)] should do what you want, in both lxml.etree and ElementTree. I don't see why this should be a bug, it's just an extended tree model. Comments are nothing to frown upon, they are as much part of the XML world as element nodes, so why would you want to ignore them? Stefan From stefan_ml at behnel.de Sat Jul 14 13:24:44 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 14 Jul 2007 13:24:44 +0200 Subject: [lxml-dev] Custom resolvers vs. RelaxNG? In-Reply-To: <20070714041124.3f736a93@bhuda.mired.org> References: <20070714041124.3f736a93@bhuda.mired.org> Message-ID: <4698B27C.7030707@behnel.de> Hi, Mike Meyer wrote: > Should custom resolvers work with Relax NG documents? They don't seem > to be being invoked at all in my tests (and I already figured out that > I have to have libxml2.6.29 for them to work at all, so that's what > I'm using). I don't think there is a test case for calling custom resolvers from RNG, but there are definitely test cases for custom resolvers, so I can assure you that they work nicely with libxml2 2.6.20 through 2.6.29. You cannot currently use them with XInclude as libxml2 does not support it, so maybe RNG suffers from a similar problem or needs some special setup. > Here's the resovler class: > > class MyResolver(Resolver): > """A resolver that returns local strings.""" > > __entities = {'vivid.rng': _vivid, 'dev.rng': _dev} > > def resolve(self, uri, id, context): > """Returns the right string for the given name.""" > > print "Resolving", uri, id, context > if self.__entities.has_key(uri): > return self.resolve_string(self.__entities[uri], context) > else: > return None > > And it's invoked like so: > > parser = XMLParser() > parser.resolvers.add(MyResolver()) > vivid = RelaxNG(fromstring(_vivid, parser)) > dev = RelaxNG(fromstring(_dev, parser)) > > The critical schema is dev: > > _dev = """ > > > > > > > > > > > > > > > > > > > > > > > """ > > I'd like the include of vivid.rng to pick up the _vivid > string. However, when I run this, it just complains about the RelaxNG > format, even though they it works fine if I have a copy of _vivid in > vivid.rng in the current directory: > > Traceback (most recent call last): > File "", line 1, in > File "schema/__init__.py", line 545, in > dev = RelaxNG(fromstring(_dev, parser), parser) > File "relaxng.pxi", line 70, in etree.RelaxNG.__init__ > etree.RelaxNGParseError: Document is not valid Relax NG At first sight, I can't see why this should not work. I'll look into it. Stefan From jf.pieronne at laposte.net Sat Jul 14 15:01:54 2007 From: jf.pieronne at laposte.net (=?ISO-8859-1?Q?Jean-Fran=E7ois_Pi=E9ronne?=) Date: Sat, 14 Jul 2007 15:01:54 +0200 Subject: [lxml-dev] OpenVMS port of lxml Message-ID: <4698C942.7060408@laposte.net> Hi, lxml 1.3.2 has been successfully ported to OpenVMS (Alpha and Itanium platform). The problems founds are: - A lot of compilation warning, I can send then if there is some interest. TESTED VERSION: Python: (2, 5, 1, 'final', 0) lxml.etree: (1, 3, 2, 0) libxml used: (2, 6, 29) libxml compiled: (2, 6, 29) libxslt used: (1, 1, 21) libxslt compiled: (1, 1, 21) - Some of the tests failed * test_etree.py ................................................................................ ..................................EE ====================================================================== ERROR: test_xinclude (__main__.XIncludeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_etree.py", line 1604, in test_xinclude self.include( tree ) AttributeError: 'XIncludeTestCase' object has no attribute 'include' ====================================================================== ERROR: test_xinclude_text (__main__.XIncludeTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_etree.py", line 1597, in test_xinclude_text self.include( etree.ElementTree(root) ) AttributeError: 'XIncludeTestCase' object has no attribute 'include' ---------------------------------------------------------------------- Ran 116 tests in 0.258s FAILED (errors=2) I don't think it's a specific VMS problem but I don't have any others platforms to test * test_xslt.py Python interpreter crash with an error during the rundown of the program: assert error: expression = autoInterpreterState, in file PYTHON_ROOT:[Python]pystate.c;1 at line 563 I can provide a complete traceback * test_elementtree.py and some others tests raised error like ====================================================================== ERROR: test_ElementTree (__main__.ETreeTestCaseBase) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_elementtree.py", line 252, in test_ElementTree Element = self.etree.Element AttributeError: 'NoneType' object has no attribute 'Element' Thanks for any advises. Jean-Fran?ois From stefan_ml at behnel.de Sun Jul 15 17:17:34 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 15 Jul 2007 17:17:34 +0200 Subject: [lxml-dev] Custom resolvers vs. RelaxNG? In-Reply-To: <20070714041124.3f736a93@bhuda.mired.org> References: <20070714041124.3f736a93@bhuda.mired.org> Message-ID: <469A3A8E.8080703@behnel.de> Mike Meyer wrote: > Should custom resolvers work with Relax NG documents? They should, yes, but they can't currently (as of libxml2 2.6.29). libxml2 parses includes out of context. It passes neither the original parser options nor any kind of reference to the original document on to the parser used for parsing the imported document. Instead, it calls a plain "xmlReadFile(URL, NULL, 0)". So it's impossible for lxml to figure out which resolver to use. The same applies to XMLSchema, BTW. It creates a new parser context internally, but forgets to pass on the private context from the original document. Bad luck. I'll file a bug report on libxml2. Stefan From stefan_ml at behnel.de Sun Jul 15 19:46:31 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 15 Jul 2007 19:46:31 +0200 Subject: [lxml-dev] Element children not right? In-Reply-To: <20070714094300.16f76ba5@bhuda.mired.org> References: <20070714040937.3b280164@bhuda.mired.org> <4698B0C4.8060008@behnel.de> <20070714094300.16f76ba5@bhuda.mired.org> Message-ID: <469A5D77.7040407@behnel.de> Hi, Mike Meyer wrote: > On Sat, 14 Jul 2007 13:17:24 +0200 Stefan Behnel wrote: >> Mike Meyer wrote: >>> I think I have a very, very basic bug here: >>> Note that the original ElementTree implementation only returns >>> children that are *elements*, whereas the lxml version returns all >>> the children. This makes life much more interesting, especially as >>> there isn't an obvious method for checking whether or not the node >>> value is actually an element. >> Sure there is, just check for the tag property being a string. lxml.etree is >> compatible to ElementTree in that it returns the factory functions for >> everything that does not have a tag (comments, PIs, entities). >> >>> [el for el in n if isinstance(el.tag, basestring)] > > Obvious is relative. I did find that test, but this doesn't say "I'm > processing only element children". You have to know that those are the > only children an element can have for which the tag attribute is a > string. Maybe that's obvious to you, it certainly wasn't to me, and > probably isn't to the casual reader either. Not when compared to > something like: > > ***** WARNING: CODE FOR EXPOSITION ONLY. THESE DO NOT WORK ****** > > [el for el in n if isinstance(el, etree.Element)] > [el for el in n if etree.is_element(el)] > [el for el in n if el.type == etree.ELEMENT_TYPE] > > (or any of a dozen ways that says "n is an element" as opposed to > saying "some attribute of n is some other type that the two examples > we're looking at only use for that attribute on the type of interest.") Well, I admit that it's not the most obvious thing ever and that the lxml docs do not provide obvious advise here. This is definitely something we can improve. Still, that's how ElementTree works. So, if you start by comparing lxml.etree's behaviour to ElementTree and claim that we "have a very, very basic bug here", you may have to accept that someone tells you that it's not a bug in lxml.etree. Note that ElementTree only shows this behaviour because its parser silently drops comments and processing instructions completely. If you constructed the same tree through the API, the behaviour of lxml.etree and ElementTree would be identical for what you tested and you would run into the same problem with both. >>> import xml.etree.ElementTree as ET >>> root = ET.XML("") >>> root.getchildren() [] >>> root.append(ET.PI("test")) >>> root.getchildren() [, at b79dd04c>] The difference is that lxml.etree also behaves this way for a tree that was parsed. I find this rather consistent. >>> import lxml.etree as et >>> et.XML("") >>> root = et.XML("") >>> root.getchildren() [, ] >> should do what you want, in both lxml.etree and ElementTree. > > Right. But I don't need the extra filter in ElementTree - it returns > just the element children and that's what it's documented as doing. I > quite literally found this when I took working ElementTree code and > tried to move it to lxml's implementation of ElementTree (and then > moved back to ElementTree rather than use test on the tag attributes > type). Hmm, I see your point. Maybe we should provide an ElemenTree compatible parser that strips comments and PIs from the document. That would at least help in porting existing code. > And this doesn't help at all if I want to distinguish between PIs and > comments. Well, read the ElementTree docs (or re-read my last mail). For elements, the tag property returns the tag name, for everything else, it returns the respective *factory function*, i.e. etree.Comment or etree.ProcessingInstruction. So, testing what you have if it's not an Element is actually not more than a straight "is". >> I don't see why this should be a bug, it's just an extended tree model. >> Comments are nothing to frown upon, they are as much part of the XML world as >> element nodes, so why would you want to ignore them? > > You want to ignore them because you're working with an API that > ignores them. If ElementTree returned them, then I wouldn't consider > it a bug. ElementTree *does* return them if they are in the tree. It's just the parser that does not put them in the tree. So it actually encourages you to write code that depends on the way the tree was constructed. If, one day, you feed one of your functions with a tree containing Comments that were added through the API, it will just stop working and you will have a hard time figuring out why your perfectly working and not-touched-in-a-long-time function fails for a certain subset of the input. > However, the single most common thing to want to do when > processing the children of a node is to recurse on the element > children. ElementTree makes that easy by only exposing the children > that are elements via the list API, as otherwise you wind up having to > write more code for the most common use case. > > Even if you don't want to ignore them, I've never seen a case where > you wanted to do the same things to all the children of a node. So > there should be an easy way to figure out what kind of child you're > dealing with, if nothing else. > > Basically, I think this is a bug in the design of the lxml extensions > to the ElementTree API. Rather than extending the API, lxml changes > the API by adding comments and PIs to the list interface. This isn't > mentioned in any of the lxml.etree documentation, other than one > paragraph on the compatibility page that implies that this might > happen. As an aside, lxml doesn't add the other children of an element > to that interface, though it would make as much sense to do so. This > is presumably because the ElementTree API already has good interfaces > for dealing with text and attribute children. ... and Elements and PIs and Comments, which all behave mostly the same at the API level of lxml.etree and ElementTree. > That same paragraph notes that you can enable ignoring comments by > tweaking the parser (but doesn't deal with ignoring PIs). True, as I said, an ET compatibility parser would help here. > If the goal > is to be compatible with ElementTree, this is wrong - the default mode > should be the one that's compatible, and getting the incompatible mode > should take extra work. If the goal is to be easy-to-use, then it's > still wrong - the default mode should be the more common use case > (though what's more common clearly depends on your environment). Ok, but your proposal is based on the wrong assumption that it's the API that shows this behaviour in ET. But since it's the parser, being compatible would mean drop PIs and comments by default, thus changing the document behind the scenes in an I/O cycle. I think this is the wrong behaviour for a default. > Even if you don't agree that this is the most common use case, being > easy-to-use means there should be a way to check for each of the three > types in the list API that's obvious when you read it and easy to find > by looking at the class docstrings. Of course, doing this breaks > compatibility with other ElementTree implementations, but we punted on > that when we added extra children to the list API. I assume you read carefully up to this point, so I won't need to comment on this. > My gut reaction is that it would be better to actually extend the API > rather than changing it. For example, have get_children, getiterator, > iterchildren, find, and so on accept extra keyword arguments to > indicate it will *not* (won't because we're extending the ElementTree > API, which only provides "will") filter PIs or comments (or even > elements, thought that one has the opposite default). However, that's > just a first thought on the issue. At least for all the iterator functions, you can pass a "tag" argument. If you pass nothing, you will get all the nodes. If you pass "*", you will get only Elements (i.e. nodes that have any tag name). If you pass one of the factory functions, you will get only those. Currently, you cannot pass the Element function (which is ET compatible behaviour), but maybe that would be a nice and consistent alternative to passing "*". Stefan From stefan_ml at behnel.de Sun Jul 15 19:58:19 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 15 Jul 2007 19:58:19 +0200 Subject: [lxml-dev] OpenVMS port of lxml In-Reply-To: <4698C942.7060408@laposte.net> References: <4698C942.7060408@laposte.net> Message-ID: <469A603B.3020809@behnel.de> Jean-Fran?ois Pi?ronne wrote: > lxml 1.3.2 has been successfully ported to OpenVMS (Alpha and Itanium > platform). Cool. It's been working on AMD64 already, but I'm happy to see other 64-bit platforms are not too much of a hassle either. > The problems founds are: > - A lot of compilation warning, I can send then if there is some interest. Definitely, send them in private E-Mail. Most likely, we'd have to fix them in Pyrex, though. > - Some of the tests failed > > ====================================================================== > ERROR: test_xinclude (__main__.XIncludeTestCase) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_etree.py", line 1604, in test_xinclude > self.include( tree ) > AttributeError: 'XIncludeTestCase' object has no attribute 'include' > > ====================================================================== > ERROR: test_xinclude_text (__main__.XIncludeTestCase) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_etree.py", line 1597, in test_xinclude_text > self.include( etree.ElementTree(root) ) > AttributeError: 'XIncludeTestCase' object has no attribute 'include' > > ====================================================================== > ERROR: test_ElementTree (__main__.ETreeTestCaseBase) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_elementtree.py", line 252, in test_ElementTree > Element = self.etree.Element > AttributeError: 'NoneType' object has no attribute 'Element' :) those are ok, they are not supposed to be run on their own anyway. Please run "python test.py etree xinclude" or "test.py elementtree" from the main directory instead (or just run "python test.py" to test everything). > * test_xslt.py > Python interpreter crash with an error during the rundown of the program: > assert error: expression = autoInterpreterState, in file > PYTHON_ROOT:[Python]pystate.c;1 at line 563 Hmmm, that one looks bad, though. Would you have any more hints on what happens here? Stefan From stefan_ml at behnel.de Sun Jul 15 21:13:31 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 15 Jul 2007 21:13:31 +0200 Subject: [lxml-dev] DOM2 range() support? In-Reply-To: <4698F51A.9@comcast.net> References: <4698AF5E.50807@behnel.de> <4698F51A.9@comcast.net> Message-ID: <469A71DB.1090200@behnel.de> Hi, just continuing this discussion on the right mailing list (from XML-SIG). Gloria W wrote: > I need to be able to do DOM2 range() functionality, to meet the > requirements of a back end Dojo interface I have written in Python. > I have already written my own DOM2 compliant node schema, out of > necessity, but without the range functionality, since it is so tedious. > I have forced a requirement on my Dojo developer colleague to not make > use of range() for the time being. > I wish it were properly supported in Python, but for the time being, I > seem to be the only person needing it. This is actually the first time I come across the concept of DOM2 ranges. http://www.w3.org/TR/DOM-Level-2-Traversal-Range/ranges.html Please correct me, but from a quick look at this: http://www.w3.org/TR/DOM-Level-2-Traversal-Range/ranges.html#Level-2-Range-Definitions isn't that simply a tuple of two positions, where each position contains an Element and optionally one of the following: - an attribute "{ns}name" and a string position in the attribute value - a string position in the text - a string position in the tail Admittedly, the DOM2 interface on top of that is a little more complex and (should I say it?) DOM-ishly obfuscated, but there shouldn't be more to it than that, right? I mean, the respective W3C spec is only some 13 sections long, there *can't* be more than that. :) Hmm, now that we have cool HTML support and CSS selection, I wouldn't mind having a Range class hanging around in some (lxml.range?) module. Doesn't even sound like you'd have to implement it in Pyrex, Python code should be enough here. Stefan From stefan_ml at behnel.de Sun Jul 15 21:47:45 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 15 Jul 2007 21:47:45 +0200 Subject: [lxml-dev] OpenVMS port of lxml In-Reply-To: <469A7343.1070601@laposte.net> References: <4698C942.7060408@laposte.net> <469A603B.3020809@behnel.de> <469A7343.1070601@laposte.net> Message-ID: <469A79E1.2060903@behnel.de> Jean-Fran?ois Pi?ronne wrote: > Stefan Behnel wrote: >> Jean-Fran?ois Pi?ronne wrote: >>> lxml 1.3.2 has been successfully ported to OpenVMS (Alpha and Itanium >>> platform). > I have attach the generated warning for one of the source file. > Others source files produce the same king of warning. Thanks. That's actually my 'fault' rather than Pyrex', but those are only irrelevant warnings about generated generic code, nothing that could break anything. It looks like the C compiler on VMS is a bit more picky about some things. The warnings should go away with an explicit cast in the right place of the generated header file (see attached pyrex-cast.patch) > ====================================================================== > FAIL: test_module_HTML_unicode > (lxml.tests.test_htmlparser.HtmlParserTestCaseBase) > ---------------------------------------------------------------------- I know that one. Does the VMS port of libxml2 use iconv? Do you know the native unicode encoding that VMS Python uses? UTF16 or UCS-4? big or little endian? > The NAN/NaNQ if probably a VMS Looks like it. Here's a simple fix for the test case: ======================= Index: src/lxml/tests/test_xpathevaluator.py =================================================================== --- src/lxml/tests/test_xpathevaluator.py (Revision 44997) +++ src/lxml/tests/test_xpathevaluator.py (Arbeitskopie) @@ -23,7 +23,7 @@ tree.xpath('number(/a)')) tree = self.parse('A') actual = str(tree.xpath('number(/a)')) - expected = ['nan', '1.#qnan'] + expected = ['nan', '1.#qnan', 'nanq'] if not actual.lower() in expected: self.fail('Expected a NAN value, got %s' % actual) ======================= > So on AXP 432 tests are executed and 794 on Itanium. You probably have ElementTree installed on Itanium (maybe Python 2.5?), but not on AXP. The test suite also contains comparative compatibility tests against ElementTree that only run when it is installed. >>> * test_xslt.py >>> Python interpreter crash with an error during the rundown of the program: >>> assert error: expression = autoInterpreterState, in file >>> PYTHON_ROOT:[Python]pystate.c;1 at line 563 >> >> Hmmm, that one looks bad, though. Would you have any more hints on what >> happens here? > > Full traceback: > #> python test_xslt.py > ............................................... > ---------------------------------------------------------------------- > Ran 47 tests in 0.223s > > OK > assert error: expression = autoInterpreterState, in file > PYTHON_ROOT:[Python]pys > tate.c;1 at line 563 > %SYSTEM-F-OPCCUS, opcode reserved to customer fault at > PC=FFFFFFFF80AA0DF4, PS=0 > 000001B > %TRACE-F-TRACEBACK, symbolic stack dump follows > image module routine line rel PC abs PC > 0 FFFFFFFF80AA0DF4 > FFFFFFFF80AA0DF4 > 0 FFFFFFFF80B34A74 > FFFFFFFF80B34A74 > PYTHONSHR pystate PyGILState_Ensure 20250 0000000000001374 > 00000000002C7E04 > libxml2xsltlxmlmod etree __pyx_f_5etree__receiveError > 67362 000000000002F1E4 > 0000000000B1E704 > libxml2xsltlxmlmod error __xmlRaiseError > 16027 0000000000001248 [...] Hmmm, but your above tests ran through, didn't they? Not sure what happens here. Could this be a problem with differences in threading on VMS? Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: pyrex-cast.patch Type: text/x-diff Size: 681 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070715/728527b1/attachment.bin From stefan_ml at behnel.de Sun Jul 15 23:07:54 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 15 Jul 2007 23:07:54 +0200 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: References: <4695D8ED.8090002@behnel.de> <46965C73.8030505@palladion.com> <46969B59.3050105@behnel.de> <469732E5.6040000@behnel.de> <46973915.8080506@behnel.de> <20070713083808.17490@gmx.net> <469777A2.5070804@behnel.de> <469780C1.6030207@behnel.de> Message-ID: <469A8CAA.8000507@behnel.de> Sidnei da Silva wrote: > On 7/13/07, Stefan Behnel wrote: >> Thanks, the attached patch should work around it. > > Nope, still fails the same way. Ok, I read in the libxml2 sources a bit more and found that I was actually using iconv alias names for the UTF16 encodings, "UTF16LE" instead of "UTF-16LE". It looks like libxml2 only understands the latter natively, so I switched to using it instead. Could you test the attached patch? Thanks, Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: unicode-detect-name-fix.patch Type: text/x-diff Size: 1126 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070715/021acd23/attachment.bin From stefan_ml at behnel.de Mon Jul 16 00:24:50 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 16 Jul 2007 00:24:50 +0200 Subject: [lxml-dev] Element children not right? In-Reply-To: <20070715143753.573b47cd@bhuda.mired.org> References: <20070714040937.3b280164@bhuda.mired.org> <4698B0C4.8060008@behnel.de> <20070714094300.16f76ba5@bhuda.mired.org> <469A5D77.7040407@behnel.de> <20070715143753.573b47cd@bhuda.mired.org> Message-ID: <469A9EB2.8080805@behnel.de> Mike Meyer wrote: > On Sun, 15 Jul 2007 19:46:31 +0200 Stefan Behnel wrote: >>> Right. But I don't need the extra filter in ElementTree - it returns >>> just the element children and that's what it's documented as doing. I >>> quite literally found this when I took working ElementTree code and >>> tried to move it to lxml's implementation of ElementTree (and then >>> moved back to ElementTree rather than use test on the tag attributes >>> type). >> Hmm, I see your point. Maybe we should provide an ElemenTree compatible parser >> that strips comments and PIs from the document. That would at least help in >> porting existing code. > > I think that would certainly solve the problem I ran into. The current trunk has an "ETCompatXMLParser" class that is a subclass of XMLParser with a different default setup. Currently that only means that remove_comments and remove_pis are active by default. Please try setting it as a default parser on your side and report back if you find any further incompatibilities. BTW: ETCompatXMLParser will not become the default parser in lxml.etree, it will only be available to simplify code portability. Stefan From ianb at colorstudy.com Mon Jul 16 08:33:25 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 16 Jul 2007 02:33:25 -0400 Subject: [lxml-dev] lxml.html and forms Message-ID: <469B1135.9030807@colorstudy.com> I feel a little bad adding a bunch of stuff to lxml.html when it's supposed to get all stable. But I was getting ready for a presentation on lxml.html, and this seemed like it made it a lot more fun. So with my last commit you can do things like: from lxml.html import parse, open_in_browser url = 'http://tripsweb.rtachicago.com/' page = parse(url) page.make_links_absolute(url) form = page.forms[0] form.inputs['Orig'].value = '1535 W Leland' form.inputs['Dest'].value = '847 W Bertrand' res = form.submit() res_page = parse(res) res_page.make_links_absolute(res_page.geturl()) open_in_browser(res_page) It's kind of like Mechanize, only of course better. There's some things I still haven't figured out. Some data structures are convenient, but maybe have some non-obvious aspects. Like form.inputs, which doesn't always return elements (for things like checkboxes it can return something that is more like a logical element). Also, I'd like to merge in most of the functionality of lxml.html.formfill (except for error-filling), so where form.form_values() currently returns a list of all the values as they'd be if submitted, I'd like to make it settable. And maybe even have form.form_values return something that would modify inputs in-place, like form.form_values['Orig'] = '1535 W Leland' mean the equivalent of form.inputs['Orig'].value = '1535 W Leland'. Another option question is actual form submission. Right now it uses urllib. But I like httplib2, for instance, and I'd like it to be possible to use that. Also, I'm wondering about how to keep track of the URL when a page is parsed. Stefan mentioned if you use parse(url) it would keep track of that... where? I'd like it to be possible to keep the URL around for any kind of parsing, e.g., with document_fromstring(html, url=X). Ian From stefan_ml at behnel.de Mon Jul 16 10:19:49 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 16 Jul 2007 10:19:49 +0200 Subject: [lxml-dev] lxml.html and forms In-Reply-To: <469B1135.9030807@colorstudy.com> References: <469B1135.9030807@colorstudy.com> Message-ID: <469B2A25.9000204@behnel.de> Hi Ian, Ian Bicking wrote: > I feel a little bad adding a bunch of stuff to lxml.html when it's > supposed to get all stable. And you should! You're lucky that there will be more than one 2.0alpha version. :) No, seriously. We do this for fun, right? And adding cool stuff from time to time is a pretty good way to keep up the motivation. > So with my last commit you can do things like: > > from lxml.html import parse, open_in_browser > url = 'http://tripsweb.rtachicago.com/' > page = parse(url) > page.make_links_absolute(url) > form = page.forms[0] > form.inputs['Orig'].value = '1535 W Leland' > form.inputs['Dest'].value = '847 W Bertrand' > res = form.submit() > res_page = parse(res) > res_page.make_links_absolute(res_page.geturl()) > open_in_browser(res_page) Sounds like you should put something like that into the docs. (hint, hint) > It's kind of like Mechanize, only of course better. There's some things > I still haven't figured out. Some data structures are convenient, but > maybe have some non-obvious aspects. Like form.inputs, which doesn't > always return elements (for things like checkboxes it can return > something that is more like a logical element). Have I ever encouraged you to look at objectify? It has special data Elements that behave like normal Python data classes, but are actually objects. Something similar could apply here, you could use a string-like Element for "input" and a boolean-like Element for "checkbox". Hmmm, and radio buttons could be lists? Although a boolean-like Element always has the disadvantage that bool() would behave different for it than for an in-tree element (i.e.: does it have children?) > Also, I'd like to merge > in most of the functionality of lxml.html.formfill (except for > error-filling), so where form.form_values() currently returns a list of > all the values as they'd be if submitted, I'd like to make it settable. > And maybe even have form.form_values return something that would > modify inputs in-place, like form.form_values['Orig'] = '1535 W Leland' > mean the equivalent of form.inputs['Orig'].value = '1535 W Leland'. Hmmm, I already stumbled over the name "form_values" when it actually behaves more like "form_items". This looks like it should be a dictionary-like class, but it's actually more like a hash bag, as parameters can repeat. Those don't seem to have an intuitive mapping to Python idioms, at least not when the most common use case with unique keys is supposed to be convienient. Although, you could actually return a subclass of "list" in form_values that also supports __getitem__ and __setitem__ with string keys. Then, at least, it would be consistent for reading *and* writing. That sounds nicely polymorphic and is sufficiently close to a dict to be helpful in the most common case, but stays mainly a list for the general case. You could then call it "inputitems" to let it match with "inputs" and dicts. > Another option question is actual form submission. Right now it uses > urllib. But I like httplib2, for instance, and I'd like it to be > possible to use that. What about a module global setting? You would most likely not want to use both. Alternatively, you could provide a simple interface that takes a URL and a list of name-value pairs and opens it. Then implement it for both libraries and provide an optional keyword argument in submit() that takes a callable function with that signature (or maybe an instance of a dedicated abstract superclass, if you want to make the interface visible). > Also, I'm wondering about how to keep track of the URL when a page is > parsed. Stefan mentioned if you use parse(url) it would keep track of > that... where? I'd like it to be possible to keep the URL around for > any kind of parsing, e.g., with document_fromstring(html, url=X). You can pass a "base_url" keyword arg to HTML(). If you want to read the original URL, wrap a document in an ElementTree and read its "docinfo.URL" property. Stefan From stefan_ml at behnel.de Mon Jul 16 11:06:31 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 16 Jul 2007 11:06:31 +0200 Subject: [lxml-dev] DOM2 range() support? In-Reply-To: <469B30B6.2060802@comcast.net> References: <4698AF5E.50807@behnel.de> <4698F51A.9@comcast.net> <469A71DB.1090200@behnel.de> <469B30B6.2060802@comcast.net> Message-ID: <469B3517.4010702@behnel.de> Gloria W wrote: > Stefan Behnel wrote: >> This is actually the first time I come across the concept of DOM2 ranges. >> isn't that simply a tuple of two positions, where each position contains an >> Element and optionally one of the following: >> >> - an attribute "{ns}name" and a string position in the attribute value >> - a string position in the text >> - a string position in the tail > > Yes, I think so. > The time consuming part of the implementation is that a range can access > any position in any component of the DOM structure, without restriction. > So this means accessing characters anywhere in the tag elements, values, > etc. > For example, the DOM range can begin at the 'r' and end at the 'x' in > '' Hmm, but honestly, that use pattern isn't very likely, right? I mean, I would consider putting the index /on/ an element (start and/or end) or somewhere into textual content to be the only useful patterns anyway. Wouldn't such an implementation be much simpler and still largely sufficient? You can always tell users to stick to that... > The DOM structure has to be able to be treated like a node structure and > a string index simultaneously. It has to be able to interpret changes to > the string and translate those into node changes, which is the very > tedious, time-consuming part of implementing this feature. Wait, does this mean adding 20 to the range index could let you end up on a different node? Say, switch from an element to an attribute, or maybe from an element to its closing parent? Who designs these things??? And: is a string index defined based on the original encoding or is it the unicode content of the infoset? (I *hope* the latter) >> Admittedly, the DOM2 interface on top of that is a little more complex and >> (should I say it?) DOM-ishly obfuscated, but there shouldn't be more to it >> than that, right? I mean, the respective W3C spec is only some 13 sections >> long, there *can't* be more than that. :) >> > 8) I would be so happy with just a fully functional range() > implementation. I'm sure there's more to it, but developers can > hopefully contribute as they need these features on the server side. As I said, go for the common cases first. >> Hmm, now that we have cool HTML support and CSS selection, I wouldn't mind >> having a Range class hanging around in some (lxml.range?) module. Doesn't even >> sound like you'd have to implement it in Pyrex, Python code should be enough here. >> > I think so as well. regex should be enough. Well, what you describe above sounds more like I'd try to traverse the tree to get along, not the completely serialised representation, and then serialise each single element, attribute, etc. in order to see where we end up. > I would have liked to do it, but I don't have the spare time right now. Well, that's usually a show-stopper in the OS world. It's either time or money (but at least it's not *always* money). Stefan From stefan_ml at behnel.de Mon Jul 16 15:16:03 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 16 Jul 2007 15:16:03 +0200 Subject: [lxml-dev] new ElementSoup module in lxml.html Message-ID: <469B6F93.7020208@behnel.de> Hi, I rewrote Fredrik's ElementSoup.py module for lxml.html so that you can now have lxml read in tag soup with BeautifulSoup and convert it into an lxml.html tree of Elements. While libxml2 can also parse broken HTML, it is not made to parse sick soup of tags, so if you need to work with web pages that sort of look like they might have been HTML once, the lxml.html.ElementSoup module can help you get there. http://codespeak.net/svn/lxml/branch/html/doc/elementsoup.txt http://codespeak.net/svn/lxml/branch/html/src/lxml/html/ElementSoup.py Have fun, Stefan From jholg at gmx.de Mon Jul 16 15:46:25 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Mon, 16 Jul 2007 15:46:25 +0200 Subject: [lxml-dev] invalid tag names get serialized Message-ID: <20070716134625.41330@gmx.net> Hi, I noticed that lxml (both objectify and etree) happily accepts broken tag names (numbers, containing whitespace, ...) throughout the API and also serializes such document; only when trying to re-parse it this fails: >>> root = etree.Element("root") >>> etree.SubElement(root, " __foo bar ") '' >>> print etree.tostring(root) < __foo bar /> >>> print etree.fromstring(etree.tostring(root)) Traceback (most recent call last): File "", line 1, in ? File "etree.pyx", line 1970, in etree.fromstring File "parser.pxi", line 980, in etree._parseMemoryDocument File "parser.pxi", line 876, in etree._parseDoc File "parser.pxi", line 533, in etree._BaseParser._parseDoc File "parser.pxi", line 660, in etree._handleParseResult File "parser.pxi", line 608, in etree._raiseParseError etree.XMLSyntaxError: StartTag: invalid element name, line 1, column 8 >>> I gather this is basically libxml2 behaviour. It is not nice, though, since you can produce serialized data without knowing your evil doings, and only detect it when you try to parse it back in (in vain). Would it be a problem to have the tag name checked before it is set for an element? Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kanns mit allen: http://www.gmx.net/de/go/multimessenger From sidnei at enfoldsystems.com Mon Jul 16 20:41:05 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Mon, 16 Jul 2007 15:41:05 -0300 Subject: [lxml-dev] Test Failures in lxml 1.3.2 In-Reply-To: <469A8CAA.8000507@behnel.de> References: <46969B59.3050105@behnel.de> <469732E5.6040000@behnel.de> <46973915.8080506@behnel.de> <20070713083808.17490@gmx.net> <469777A2.5070804@behnel.de> <469780C1.6030207@behnel.de> <469A8CAA.8000507@behnel.de> Message-ID: On 7/15/07, Stefan Behnel wrote: > Could you test the attached patch? Tested, didn't make a difference apparently. ====================================================================== FAIL: test_module_HTML_unicode (lxml.tests.test_htmlparser.HtmlParserTestCaseBas e) ---------------------------------------------------------------------- Traceback (most recent call last): File "c:\Python24\lib\unittest.py", line 260, in run testMethod() File "C:\src\lxml-build\lxml-1.3.2\src\lxml\tests\test_htmlparser.py", line 33 , in test_module_HTML_unicode unicode(self.uhtml_str.encode('UTF8'), 'UTF8')) File "c:\Python24\lib\unittest.py", line 333, in failUnlessEqual raise self.failureException, \ AssertionError: u'test \xc3\x83\xc2\xa1\xef\xa3\x92

page \xc3\x83\xc2\xa1\xef\xa3\x92 title

' != u' test \xc3\xa1\uf8d2

page \xc3\xa1\uf8 d2 title

' ---------------------------------------------------------------------- -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From ianb at colorstudy.com Mon Jul 16 21:23:21 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 16 Jul 2007 14:23:21 -0500 Subject: [lxml-dev] lxml.html and forms In-Reply-To: <469B2A25.9000204@behnel.de> References: <469B1135.9030807@colorstudy.com> <469B2A25.9000204@behnel.de> Message-ID: <469BC5A9.3050600@colorstudy.com> Stefan Behnel wrote: >> So with my last commit you can do things like: >> >> from lxml.html import parse, open_in_browser >> url = 'http://tripsweb.rtachicago.com/' >> page = parse(url) >> page.make_links_absolute(url) >> form = page.forms[0] >> form.inputs['Orig'].value = '1535 W Leland' >> form.inputs['Dest'].value = '847 W Bertrand' >> res = form.submit() >> res_page = parse(res) >> res_page.make_links_absolute(res_page.geturl()) >> open_in_browser(res_page) > > Sounds like you should put something like that into the docs. (hint, hint) Yeah, I put in docstrings for everything, but it needs more docs to show how it fits together. > >> It's kind of like Mechanize, only of course better. There's some things >> I still haven't figured out. Some data structures are convenient, but >> maybe have some non-obvious aspects. Like form.inputs, which doesn't >> always return elements (for things like checkboxes it can return >> something that is more like a logical element). > > Have I ever encouraged you to look at objectify? It has special data Elements > that behave like normal Python data classes, but are actually objects. > Something similar could apply here, you could use a string-like Element for > "input" and a boolean-like Element for "checkbox". Hmmm, and radio buttons > could be lists? > > Although a boolean-like Element always has the disadvantage that bool() would > behave different for it than for an in-tree element (i.e.: does it have children?) Ugh... I don't like that idea at all. Elements aren't strings, or bools, or whatever. Well, elements do have truthiness, but that itself drives me nuts -- I refuse to think of an element with no children as "false". I usually test len(el) == 0 if I want to test for children, just out of a stubborn refusal to consider something like an input element false. Using .value is a bit crude, though easy enough to figure out -- but I'd rather use wrappers to give a more convenient access than override the elements more than they are already overridden. I'm still thinking about how microformat parsing should really work, but I suspect that will also be a wrapper around elements and not something in the elements itself. My intuition is that microformats don't exactly map to elements or to classes. Anyway, a somewhat similar issue. >> Also, I'd like to merge >> in most of the functionality of lxml.html.formfill (except for >> error-filling), so where form.form_values() currently returns a list of >> all the values as they'd be if submitted, I'd like to make it settable. >> And maybe even have form.form_values return something that would >> modify inputs in-place, like form.form_values['Orig'] = '1535 W Leland' >> mean the equivalent of form.inputs['Orig'].value = '1535 W Leland'. > > Hmmm, I already stumbled over the name "form_values" when it actually behaves > more like "form_items". This looks like it should be a dictionary-like class, > but it's actually more like a hash bag, as parameters can repeat. Those don't > seem to have an intuitive mapping to Python idioms, at least not when the most > common use case with unique keys is supposed to be convienient. > > Although, you could actually return a subclass of "list" in form_values that > also supports __getitem__ and __setitem__ with string keys. Then, at least, it > would be consistent for reading *and* writing. That sounds nicely polymorphic > and is sufficiently close to a dict to be helpful in the most common case, but > stays mainly a list for the general case. You could then call it "inputitems" > to let it match with "inputs" and dicts. In Paste I use an multi-key dict implementation to hold form keys, so that you get something dict-like that doesn't lose information like ordering. It's basically a view over a list of tuples. The implementation is here: http://svn.pythonpaste.org/Paste/trunk/paste/util/multidict.py Unfortunately there's no clear convention for how these kinds of dict-like objects work. I chose to make them as much like normal dicts as possible (so, for instance, if you do d[key] = value, then it's always true that d[key] == value), since most of the time keys are single-valued. But for an actual form I'd like to present the entire form if possible. Like I am now, I guess, with set-like objects and whatnot. And really what form_values gives is intended for urllib.urlencode, and maybe can just be left that way. The order doesn't matter as much to the Python side, as it's just intrinsic in the way the page is laid out. That is, you can't (usually) "make item 4 be (name, value)", because item 4 already has a name, and the value might be constrained anyway. You could say, possibly, "make the second text input with name X have value Y", but that's relatively uncommon in forms and still more constrained than a general dictionary interface. I.e., you can't invent new names, you can't change the order of the fields, and constrained fields like checkboxes stay constrained. So maybe keep form_values, and use something else entirely that is more dict-like for this more dynamic get/set structure. Something a bit like form.inputs, but maybe fully embrace the wrapperness of it. That thing would be more strictly dict-like, and every key would map to some structure that represents the entirety of what represents that key in the form. So a single text input would map to a string. A single checkbox to a boolean (kind of... it's a little fuzzy; it kind of maps to None/the-value-of-the-checkbox, but I could allow a true/false setter as well). Multi-select to a set, etc. Radio buttons would map to a single value, but I'd also want to give some access to the possible set of values (since unlike a text box there is a constrained set of possible values). Right now you get that with form.inputs['radio_name'].value_options, but that won't work with a flatter dictionary. Maybe there'd generally be a form_values.options('field_name'), which would be None for unconstrained, and a set for constrained fields. >> Another option question is actual form submission. Right now it uses >> urllib. But I like httplib2, for instance, and I'd like it to be >> possible to use that. > > What about a module global setting? You would most likely not want to use both. > > Alternatively, you could provide a simple interface that takes a URL and a > list of name-value pairs and opens it. Then implement it for both libraries > and provide an optional keyword argument in submit() that takes a callable > function with that signature (or maybe an instance of a dedicated abstract > superclass, if you want to make the interface visible). That's what I was thinking of. I don't like module global settings at all. Passing it in to submit seems fine. I was thinking about using a class variable too, if you wanted to subclass the elements, or just set it manually on a particular instance. Maybe it would be attached to the tree object? E.g.: foo = parse(blah) foo.getroottree().urlfetch = my_url_fetch I was also thinking about whether I should return a new parsed page, or just a file-like, or what. Or a file-like object that has a method to get the page, perhaps; e.g., new_page = form.submit().document(). I don't think the url fetching function would need to do any of this, it would just have a very minimal interface and the submit method would wrap it up in whatever seems most convenient. >> Also, I'm wondering about how to keep track of the URL when a page is >> parsed. Stefan mentioned if you use parse(url) it would keep track of >> that... where? I'd like it to be possible to keep the URL around for >> any kind of parsing, e.g., with document_fromstring(html, url=X). > > You can pass a "base_url" keyword arg to HTML(). If you want to read the > original URL, wrap a document in an ElementTree and read its "docinfo.URL" > property. OK, I guess that keyword argument should be available in all the parsing functions. Maybe I should add a property to elements too, that fetches that information from the tree. And possibly something in parse that uses fp.geturl() if it is available. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From micxer at micxer.de Mon Jul 16 22:03:16 2007 From: micxer at micxer.de (micxer) Date: Mon, 16 Jul 2007 22:03:16 +0200 Subject: [lxml-dev] Ignoring unknown namespaces in XML while validating In-Reply-To: <4694CD59.1060904@behnel.de> References: <469480A2.3020608@jpl.nasa.gov> <4694C972.9060004@micxer.de> <4694CD59.1060904@behnel.de> Message-ID: <469BCF04.9040206@micxer.de> Hi, Stefan Behnel wrote: > Hi, > > first of all: please don't respond to posts from a different thread when you > want to start a new one. Mail-Readers will sort the e-mail into the wrong > thread and confuse people. > Sorry about that. I thought I removed everything from the old post but I forgot about the headers. And sorry for the late reply. I just found your message in the Junk folder. > > micxer wrote: >> I'm using lxml primarily for validation of XML documents and requests of >> UPnP devices. Since many vendors are going to make their devices DLNA >> compliant, some additional XML elements appear in the XML docs. I would >> have to pay for the DLNA specs so I have no other choice than deleting >> these elements in advance and validate the XML afterwards. Is there an >> easy way to do this with lxml? Am I missing something? > > Not sure what your problem is exactly. Are these "additional elements" in a > specific namespace? That would make it easy to remove them: > > for el in root.getiterator("{http://the/namespace}*"): > parent = el.getparent() > if parent is not None: # not the root element > parent.remove(el) > > Or are they in other namespaces than the main one? > > MAIN_NS = "{http://the/namespace}" > for el in root.getiterator("*"): > if not el.tag.startswith(MAIN_NS): > parent = el.getparent() > if parent is not None: # not the root element > parent.remove(el) > > Similarly, if you have a set of tag names that must be kept or removed, you > can iterate over all elements and check the tag names against the set. > That's exactly the problem I have. I already thought about this manual approach, but I also assumed there must be an easier way like telling the parser to ignore any unknown tag or any tag that's not listed in the schema. > > Does that solve your problem? > Absolutely, Thanks :-) > > Stefan Michael From stefan_ml at behnel.de Mon Jul 16 23:05:07 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 16 Jul 2007 23:05:07 +0200 Subject: [lxml-dev] lxml.html and forms In-Reply-To: <469BC5A9.3050600@colorstudy.com> References: <469B1135.9030807@colorstudy.com> <469B2A25.9000204@behnel.de> <469BC5A9.3050600@colorstudy.com> Message-ID: <469BDD83.2080809@behnel.de> Ian Bicking wrote: > Stefan Behnel wrote: > And really what form_values gives is intended for urllib.urlencode, and > maybe can just be left that way. The order doesn't matter as much to > the Python side, as it's just intrinsic in the way the page is laid out. > That is, you can't (usually) "make item 4 be (name, value)", because > item 4 already has a name, and the value might be constrained anyway. > You could say, possibly, "make the second text input with name X have > value Y", but that's relatively uncommon in forms and still more > constrained than a general dictionary interface. I.e., you can't invent > new names, you can't change the order of the fields, and constrained > fields like checkboxes stay constrained. So maybe keep form_values, and > use something else entirely that is more dict-like for this more dynamic > get/set structure. Something a bit like form.inputs, but maybe fully > embrace the wrapperness of it. Makes sense to me. > That thing would be more strictly dict-like, and every key would map to > some structure that represents the entirety of what represents that key > in the form. So a single text input would map to a string. Sure. > A single > checkbox to a boolean (kind of... it's a little fuzzy; it kind of maps > to None/the-value-of-the-checkbox, but I could allow a true/false setter > as well). Hmm, except for an empty string value, Python's idea of a truth value would match that. And as you said, changing the form structure is not really intended, so you'd normally not change the value string but rather the "checked" property. So, assigning a truth value would simply change that, whereas a string value could still change the value property. The return value would then be the string value or None. For the special case of an empty string, you could return a string subclass that evaluates to the bool value True. Not sure if I like this, though, sounds like too much magic - and you never know where values end up in in application code... Maybe it's a rare enough corner case to accept this, though. Or isn't there a Unicode character like "zero width space" or something like that, that we could return instead? > Multi-select to a set, etc. Radio buttons would map to a > single value, but I'd also want to give some access to the possible set > of values (since unlike a text box there is a constrained set of > possible values). Ok, so, how would you set them? >>> form.inputs["my_radio_name"] = "new_value" Like this? This would then deselect all other radio buttons with the name "my_radio_name" and only select the one with the "new_value" value. If we adopt this, reading the property should definitely return the selected value as a single string: >>> form.inputs["my_radio_name"] 'new_value' Maybe we could return a subclass with an "element" property that returns the Element that carries that value? >>> form.inputs["my_radio_name"].element > Right now you get that with > form.inputs['radio_name'].value_options, but that won't work with a > flatter dictionary. Why not? I actually like that. > Maybe there'd generally be a > form_values.options('field_name'), which would be None for > unconstrained, and a set for constrained fields. Sounds too generic for a simple case. You shouldn't forget that you can't really fill a form without knowing what is a radio button and what is a checkbox, so there is not much to gain by providing a generic API. hasattr(el, "value_options") is also easy to write and reads better than el.value_options is None >>> Another option question is actual form submission. Right now it uses >>> urllib. But I like httplib2, for instance, and I'd like it to be >>> possible to use that. >> >> Alternatively, you could provide a simple interface that takes a URL >> and a list of name-value pairs and opens it. > > That's what I was thinking of. I don't like module global settings at > all. Passing it in to submit seems fine. I was thinking about using a > class variable too, if you wanted to subclass the elements, or just set > it manually on a particular instance. Maybe it would be attached to the > tree object? E.g.: > > foo = parse(blah) > foo.getroottree().urlfetch = my_url_fetch That wouldn't work, as ElementTrees (and Elements) are not kept alive by the tree, so you can't store state in them. > I was also thinking about whether I should return a new parsed page, or > just a file-like, or what. Or a file-like object that has a method to > get the page, perhaps; e.g., new_page = form.submit().document(). I > don't think the url fetching function would need to do any of this, it > would just have a very minimal interface and the submit method would > wrap it up in whatever seems most convenient. You can't return a parsed tree as the server reply can be anything from XML to weird binary. I think a file-like serves most purposes. Maybe an additional "parse()" method would work here, but I don't think it's necessary. >>> reply_tree = parse(form.submit()) works just fine, is intuitive and avoids overhead. > OK, I guess that keyword argument should be available in all the parsing > functions. "string" parsing functions. Sure. > Maybe I should add a property to elements too, that fetches > that information from the tree. And possibly something in parse that > uses fp.geturl() if it is available. etree already does that internally: cdef _getFilenameForFile(source): """Given a Python File or Gzip object, give filename back. Returns None if not a file object. """ # file instances have a name attribute if hasattr(source, 'name'): return source.name # gzip file instances have a filename attribute if hasattr(source, 'filename'): return source.filename # urllib2 if hasattr(source, 'geturl'): return source.geturl() return None Stefan From ianb at colorstudy.com Mon Jul 16 23:33:05 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 16 Jul 2007 16:33:05 -0500 Subject: [lxml-dev] lxml.html and forms In-Reply-To: <469BDD83.2080809@behnel.de> References: <469B1135.9030807@colorstudy.com> <469B2A25.9000204@behnel.de> <469BC5A9.3050600@colorstudy.com> <469BDD83.2080809@behnel.de> Message-ID: <469BE411.9070502@colorstudy.com> Stefan Behnel wrote: >> A single >> checkbox to a boolean (kind of... it's a little fuzzy; it kind of maps >> to None/the-value-of-the-checkbox, but I could allow a true/false setter >> as well). > > Hmm, except for an empty string value, Python's idea of a truth value would > match that. And as you said, changing the form structure is not really > intended, so you'd normally not change the value string but rather the > "checked" property. So, assigning a truth value would simply change that, > whereas a string value could still change the value property. The return value > would then be the string value or None. > > For the special case of an empty string, you could return a string subclass > that evaluates to the bool value True. Not sure if I like this, though, sounds > like too much magic - and you never know where values end up in in application > code... Maybe it's a rare enough corner case to accept this, though. Or isn't > there a Unicode character like "zero width space" or something like that, that > we could return instead? The empty string is definitely a corner case, as many server-side languages would treat that as false already. Maybe it could just be returned as True in that case. This could break code that expects a string, but it's such a strange case anyway that I don't mind too much. Or I could return a string subclass of str that is true, which is also very weird, but again it's very much a corner case so maybe it's not that big a deal. If you don't give a value to a checkbox it defaults to "on" anyway, so only an explicit value="" causes this. >> Multi-select to a set, etc. Radio buttons would map to a >> single value, but I'd also want to give some access to the possible set >> of values (since unlike a text box there is a constrained set of >> possible values). > > Ok, so, how would you set them? > > >>> form.inputs["my_radio_name"] = "new_value" > > Like this? This would then deselect all other radio buttons with the name > "my_radio_name" and only select the one with the "new_value" value. If we > adopt this, reading the property should definitely return the selected value > as a single string: > > >>> form.inputs["my_radio_name"] > 'new_value' Yes, right now it works like: form.inputs['my_radio_name'].value = 'new_value' Where form.inputs['my_radio_name'] is a subclass of list, which contains all the radio input elements and also allows this group setting. If it's a group of checkboxes, it's: form.inputs['my_checkbox_name'].value.add('value1') Which checks the checkbox with the value 'value1'. You can also assign to value, which clears the set and assigns values from the iterator you give. So basically I could take what I have now, and just always get/set .value to create a flatish dictionary. And if you assign directly to the dictionary, it would clear the current values and then update with the values you give, just like the set works. Whether this should replace or augment .inputs, I'm not sure. I think augment, since .inputs gives you access to all the elements, which sometimes you will want. > Maybe we could return a subclass with an "element" property that returns the > Element that carries that value? > > >>> form.inputs["my_radio_name"].element > Then we have something stringish, but isn't quite a string. And when you an assignment, you get back something that's different than what you assigned. It all feels too magic to me. I think we can just have two accessors, one that gives you elements (like the current form.inputs) and one that gives you values only. >> Right now you get that with >> form.inputs['radio_name'].value_options, but that won't work with a >> flatter dictionary. > > Why not? I actually like that. You'd also have to augment the string-like object, since form.inputs['radio_name'] would be the value of the currently checked radio button. >> Maybe there'd generally be a >> form_values.options('field_name'), which would be None for >> unconstrained, and a set for constrained fields. > > Sounds too generic for a simple case. You shouldn't forget that you can't > really fill a form without knowing what is a radio button and what is a > checkbox, so there is not much to gain by providing a generic API. > > hasattr(el, "value_options") > > is also easy to write and reads better than > > el.value_options is None Yes, most of the time you'll be filling out forms that you expect to have very particular fields. But it's useful generally. With a flat dictionary it's hard to get access to per-field information, so there has to be some other means of access. Anyway, currently value_options is only set on those elements and objects where it makes sense. >>>> Another option question is actual form submission. Right now it uses >>>> urllib. But I like httplib2, for instance, and I'd like it to be >>>> possible to use that. >>> Alternatively, you could provide a simple interface that takes a URL >>> and a list of name-value pairs and opens it. >> That's what I was thinking of. I don't like module global settings at >> all. Passing it in to submit seems fine. I was thinking about using a >> class variable too, if you wanted to subclass the elements, or just set >> it manually on a particular instance. Maybe it would be attached to the >> tree object? E.g.: >> >> foo = parse(blah) >> foo.getroottree().urlfetch = my_url_fetch > > That wouldn't work, as ElementTrees (and Elements) are not kept alive by the > tree, so you can't store state in them. Hrm... that's too bad. I'd like to keep some kind of local information around, ideally inherited as you go from page to page. I really hate global settings. >> I was also thinking about whether I should return a new parsed page, or >> just a file-like, or what. Or a file-like object that has a method to >> get the page, perhaps; e.g., new_page = form.submit().document(). I >> don't think the url fetching function would need to do any of this, it >> would just have a very minimal interface and the submit method would >> wrap it up in whatever seems most convenient. > > You can't return a parsed tree as the server reply can be anything from XML to > weird binary. I think a file-like serves most purposes. Maybe an additional > "parse()" method would work here, but I don't think it's necessary. > > >>> reply_tree = parse(form.submit()) > > works just fine, is intuitive and avoids overhead. Yeah, you are probably right. The etree parse method works just fine right now, especially if it already picks up the url. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From ianb at colorstudy.com Tue Jul 17 01:35:56 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 16 Jul 2007 18:35:56 -0500 Subject: [lxml-dev] lxml.html and forms In-Reply-To: <469BDD83.2080809@behnel.de> References: <469B1135.9030807@colorstudy.com> <469B2A25.9000204@behnel.de> <469BC5A9.3050600@colorstudy.com> <469BDD83.2080809@behnel.de> Message-ID: <469C00DC.6010304@colorstudy.com> Stefan Behnel wrote: >> OK, I guess that keyword argument should be available in all the parsing >> functions. > > "string" parsing functions. Sure. I'm not sure how to do this. lxml.etree.HTML doesn't take a base_url argument, and root.getroottree().docinfo.URL is read-only. Also, why are all the signatures "..." in help? E.g., help(lxml.etree.HTML) gives "HTML(...)". Is this a Pyrex thing? Perhaps fixable? If not, it would be nice to give signature help in the docstrings. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From stefan_ml at behnel.de Tue Jul 17 08:21:50 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 17 Jul 2007 08:21:50 +0200 Subject: [lxml-dev] lxml.html and forms In-Reply-To: <469C00DC.6010304@colorstudy.com> References: <469B1135.9030807@colorstudy.com> <469B2A25.9000204@behnel.de> <469BC5A9.3050600@colorstudy.com> <469BDD83.2080809@behnel.de> <469C00DC.6010304@colorstudy.com> Message-ID: <469C5FFE.7080403@behnel.de> Hi, Ian Bicking wrote: > Stefan Behnel wrote: >>> OK, I guess that keyword argument should be available in all the parsing >>> functions. >> >> "string" parsing functions. Sure. > > I'm not sure how to do this. lxml.etree.HTML doesn't take a base_url > argument It works for me in both trunk and html branch: def HTML(text, _BaseParser parser=None, base_url=None): > Also, why are all the signatures "..." in help? E.g., > help(lxml.etree.HTML) gives "HTML(...)". Is this a Pyrex thing? Perhaps > fixable? If not, it would be nice to give signature help in the > docstrings. Yes, that's a Pyrex (or rather C) thing. The signature is not visible in C modules. I tried to make Pyrex add signatures to the docstrings automatically, but that turned out to be harder than I thought and I didn't have the time to get it right since then. I attached a patch that gets you part of the way. I started experimenting with epydoc and it can actually read a signature line that you prepend to docstrings, so doing that would give us nicely formatted HTML docs. http://codespeak.net/lxml/dev/api/ Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: sigs.patch Type: text/x-diff Size: 2188 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070717/81639919/attachment.bin From ianb at colorstudy.com Tue Jul 17 19:44:00 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 17 Jul 2007 12:44:00 -0500 Subject: [lxml-dev] lxml.html and forms In-Reply-To: <469C5FFE.7080403@behnel.de> References: <469B1135.9030807@colorstudy.com> <469B2A25.9000204@behnel.de> <469BC5A9.3050600@colorstudy.com> <469BDD83.2080809@behnel.de> <469C00DC.6010304@colorstudy.com> <469C5FFE.7080403@behnel.de> Message-ID: <469CFFE0.1040504@colorstudy.com> Stefan Behnel wrote: > Hi, > > Ian Bicking wrote: >> Stefan Behnel wrote: >>>> OK, I guess that keyword argument should be available in all the parsing >>>> functions. >>> "string" parsing functions. Sure. >> I'm not sure how to do this. lxml.etree.HTML doesn't take a base_url >> argument > > It works for me in both trunk and html branch: > > def HTML(text, _BaseParser parser=None, base_url=None): Really? Here's what I get on the html branch: >>> from lxml.etree import HTML >>> h = HTML('', None, 'http://foo.com') Traceback (most recent call last): File "", line 1, in ? TypeError: function takes at most 2 arguments (3 given) >>> h = HTML('', base_url='http://foo.com') Traceback (most recent call last): File "", line 1, in ? TypeError: 'base_url' is an invalid keyword argument for this function -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From stefan_ml at behnel.de Tue Jul 17 20:00:24 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 17 Jul 2007 20:00:24 +0200 Subject: [lxml-dev] lxml.html and forms In-Reply-To: <469CFFE0.1040504@colorstudy.com> References: <469B1135.9030807@colorstudy.com> <469B2A25.9000204@behnel.de> <469BC5A9.3050600@colorstudy.com> <469BDD83.2080809@behnel.de> <469C00DC.6010304@colorstudy.com> <469C5FFE.7080403@behnel.de> <469CFFE0.1040504@colorstudy.com> Message-ID: <469D03B8.5070905@behnel.de> Ian Bicking wrote: > Stefan Behnel wrote: >> Ian Bicking wrote: >>> Stefan Behnel wrote: >>>>> OK, I guess that keyword argument should be available in all the >>>>> parsing >>>>> functions. >>>> "string" parsing functions. Sure. >>> I'm not sure how to do this. lxml.etree.HTML doesn't take a base_url >>> argument >> >> It works for me in both trunk and html branch: >> >> def HTML(text, _BaseParser parser=None, base_url=None): > > Really? Here's what I get on the html branch: > >>>> from lxml.etree import HTML >>>> h = HTML('', None, 'http://foo.com') > Traceback (most recent call last): > File "", line 1, in ? > TypeError: function takes at most 2 arguments (3 given) >>>> h = HTML('', base_url='http://foo.com') > Traceback (most recent call last): > File "", line 1, in ? > TypeError: 'base_url' is an invalid keyword argument for this function Believe me: >>> import lxml.etree as et >>> et.HTML("", None, "oh") >>> et.HTML("", base_url="oh") $ LANG=en_GB svn info src/lxml/etree.pyx Path: src/lxml/etree.pyx Name: etree.pyx URL: https://scoder at codespeak.net/svn/lxml/branch/html/src/lxml/etree.pyx [...] Revision: 45164 Node Kind: file Schedule: normal Last Changed Author: scoder Last Changed Rev: 44837 Last Changed Date: 2007-07-08 09:54:56 +0200 (Sun, 08 Jul 2007) Text Last Updated: 2007-07-16 15:01:37 +0200 (Mon, 16 Jul 2007) Checksum: 9b183d6891d5e5f3606cd13350b582ad Have you rebuilt lxml.etree lately? Or do you have an installed egg version that takes precedence? Stefan From ianb at colorstudy.com Tue Jul 17 20:47:23 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 17 Jul 2007 13:47:23 -0500 Subject: [lxml-dev] lxml.html and forms In-Reply-To: <469D03B8.5070905@behnel.de> References: <469B1135.9030807@colorstudy.com> <469B2A25.9000204@behnel.de> <469BC5A9.3050600@colorstudy.com> <469BDD83.2080809@behnel.de> <469C00DC.6010304@colorstudy.com> <469C5FFE.7080403@behnel.de> <469CFFE0.1040504@colorstudy.com> <469D03B8.5070905@behnel.de> Message-ID: <469D0EBB.1050209@colorstudy.com> Stefan Behnel wrote: > Have you rebuilt lxml.etree lately? Or do you have an installed egg version > that takes precedence? Ah, you are right, I had not rebuilt lately. python setup.py develop got it back in order. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From ianb at colorstudy.com Tue Jul 17 21:30:59 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Tue, 17 Jul 2007 14:30:59 -0500 Subject: [lxml-dev] lxml & parsing: return of a classes Message-ID: <469D18F3.3080806@colorstudy.com> So I was thinking a little about how we could allow easy customization of the URL getter, since we can't attach it to the tree or any element. And then generally how any customization could be done, for instance if you want a new method on all elements. This isn't that easy currently. You'd have to subclass a bunch of classes and rewrite a bunch of functions. But I think if we move all parsing to a single class it would help a great deal. The idea is something like: class Parser(object): _etree_parser_class = etree.HTMLParser def __init__(self): self._etree_parser = self._etree_parser_class() self._etree_parser.setElementClassLookup(self) def __call__(self, filename, **kw): return etree.parse(filename, self._etree_parser, **kw) def fromstring(...): ... And so forth. Then either expose this via: parse = Parser() Or perhaps: _parser = Parser() parse = _parser fromstring = _parser.fromstring And so forth. If you want to adjust something, you don't have to reimplement all the forms of parsers, since they all would just use self, and are mostly defined in terms of each other. We could support subclassing with something like this: class Parser(object): _element_classes = {} _element_mixins = {} def __init__(self): self._element_classes = self._element_classes.copy() mixers = {} for name, value in _element_mixins: if name == '*': for n in self._element_classes.keys(): mixers.setdefault(n, []).append(value) else: mixers.setdefault(name, []).append(value) for name, mixins in mixers: cur = self._element_classes.get(name, HtmlElement) bases = mixins + [cur] new_class = type(cur.__name__, tuple(bases), {}) self._element_classes[name] = new_class class MyMixin(object): extra methods class FormMixin(object): other methods for the form element class ParserMixedIn(Parser): _element_mixins = {'*': MyMixin, 'form': FormMixin} And then it would be really easy to create local extensions for all HTML elements, or particular elements. I'm not sure exactly how to attach the URL getting method to the Parser object in this model, because I'm not sure how to give elements a reference back to it. We could do it with class variables, but then the parser would *have* to subclass every element everytime it was instantiated, so it could make new classes with a reference back to itself. But maybe there's a better way. Do the elements already have a reference back to that etree.HTMLParser() instance, and could we attach this to that instance? Or perhaps extend HTMLParser directly instead of having this other parser class? -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From alexander.kozlovsky at gmail.com Tue Jul 17 22:49:12 2007 From: alexander.kozlovsky at gmail.com (Alexander Kozlovsky) Date: Wed, 18 Jul 2007 00:49:12 +0400 Subject: [lxml-dev] Encoding bug in lxml.etree.HTML Message-ID: <778728455.20070718004912@gmail.com> Hello! I've probably discovered a bug in lxml.etree.HTML: >>> from lxml import etree >>> a = u'

\u044b

' >>> b = etree.HTML(a) >>> b[0][0].text u'\xd1\x8b' Expected: u'\u044b' It seems that etree.HTML function works with non-ascii symbols incorrectly. I can reproduce it on Windows. This bug is relatively new: it happens with lxml with statically linked libxml2 version 2.6.28 and libxslt2 version 1.1.19 (current version of lxml-1.2.1 from Cheese Shop and the newer releases). Older versions of lxml (lxml-1.2.1 with libxml2 version 2.6.26 and libxslt2 version 1.1.17, which are no longer available from Cheese Shop, or older releases such as lxml-1.2) work fine. -- Best regards, Alexander mailto:alexander.kozlovsky at gmail.com From stefan_ml at behnel.de Wed Jul 18 08:42:45 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 18 Jul 2007 08:42:45 +0200 Subject: [lxml-dev] Encoding bug in lxml.etree.HTML In-Reply-To: <778728455.20070718004912@gmail.com> References: <778728455.20070718004912@gmail.com> Message-ID: <469DB665.2050205@behnel.de> Hi, Alexander Kozlovsky wrote: > I've probably discovered a bug in lxml.etree.HTML: > > >>> from lxml import etree > >>> a = u'

\u044b

' > >>> b = etree.HTML(a) > >>> b[0][0].text > u'\xd1\x8b' > > Expected: u'\u044b' > > It seems that etree.HTML function works with non-ascii symbols incorrectly. > I can reproduce it on Windows. Thanks for the extensive report. This is actually a bug that has been fixed two days ago, so there isn't a release yet containing the fix. It will go away in lxml 1.3.3. Stefan From stefan_ml at behnel.de Wed Jul 18 09:36:32 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 18 Jul 2007 09:36:32 +0200 Subject: [lxml-dev] invalid tag names get serialized In-Reply-To: <20070716134625.41330@gmx.net> References: <20070716134625.41330@gmx.net> Message-ID: <469DC300.4050909@behnel.de> jholg at gmx.de wrote: > I noticed that lxml (both objectify and etree) happily accepts broken tag names (numbers, containing whitespace, ...) throughout the API and also serializes such document; only when trying to re-parse it this fails: > > >>> root = etree.Element("root") > >>> etree.SubElement(root, " __foo bar ") > '' > >>> print etree.tostring(root) > < __foo bar /> > >>> print etree.fromstring(etree.tostring(root)) > Traceback (most recent call last): > File "", line 1, in ? > File "etree.pyx", line 1970, in etree.fromstring > File "parser.pxi", line 980, in etree._parseMemoryDocument > File "parser.pxi", line 876, in etree._parseDoc > File "parser.pxi", line 533, in etree._BaseParser._parseDoc > File "parser.pxi", line 660, in etree._handleParseResult > File "parser.pxi", line 608, in etree._raiseParseError > etree.XMLSyntaxError: StartTag: invalid element name, line 1, column 8 > > I gather this is basically libxml2 behaviour. It is not nice, though, since > you can produce serialized data without knowing your evil doings, and only > detect it when you try to parse it back in (in vain). Would it be a problem > to have the tag name checked before it is set for an element? Not entirely "libxml2 behaviour", since it actually provides functions to check names. You just have to use them. Although 'just' is slightly too simplistic here. The straight forward patch actually breaks lots of test cases, e.g. getiterator('*'). I'll have to look into this, but this is definitely 2.0 stuff. Maybe it would be enough to check names only in the factory functions, 'el.set()' and 'el.attrib.__setitem__()'. Lookup and search methods/functions don't have to care. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: name-validation.patch Type: text/x-diff Size: 1419 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070718/1d0f962b/attachment.bin From jholg at gmx.de Wed Jul 18 09:48:22 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 18 Jul 2007 09:48:22 +0200 Subject: [lxml-dev] invalid tag names get serialized In-Reply-To: <469DC300.4050909@behnel.de> References: <20070716134625.41330@gmx.net> <469DC300.4050909@behnel.de> Message-ID: <20070718074822.116730@gmx.net> Hi Stefan, > > detect it when you try to parse it back in (in vain). Would it be a > problem > > to have the tag name checked before it is set for an element? > > Not entirely "libxml2 behaviour", since it actually provides functions to > check names. You just have to use them. Although 'just' is slightly too > simplistic here. The straight forward patch actually breaks lots of test > cases, e.g. getiterator('*'). > > I'll have to look into this, but this is definitely 2.0 stuff. Maybe it > would > be enough to check names only in the factory functions, 'el.set()' and > 'el.attrib.__setitem__()'. Lookup and search methods/functions don't have > to care. For my purposes, it would be sufficient if a tree did not serialize successfully; what I want to avoid is that I store/pickle documents that then turn out to not have been well-formed XML in the first place. So maybe that's easier to achieve than to check names straight away, although I fear not... Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail From stefan_ml at behnel.de Wed Jul 18 10:23:23 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 18 Jul 2007 10:23:23 +0200 Subject: [lxml-dev] invalid tag names get serialized In-Reply-To: <20070718074822.116730@gmx.net> References: <20070716134625.41330@gmx.net> <469DC300.4050909@behnel.de> <20070718074822.116730@gmx.net> Message-ID: <469DCDFB.4070706@behnel.de> jholg at gmx.de wrote: >>> detect it when you try to parse it back in (in vain). Would it be a >> problem >>> to have the tag name checked before it is set for an element? >> Not entirely "libxml2 behaviour", since it actually provides functions to >> check names. You just have to use them. Although 'just' is slightly too >> simplistic here. The straight forward patch actually breaks lots of test >> cases, e.g. getiterator('*'). >> >> I'll have to look into this, but this is definitely 2.0 stuff. Maybe it >> would >> be enough to check names only in the factory functions, 'el.set()' and >> 'el.attrib.__setitem__()'. Lookup and search methods/functions don't have >> to care. > > For my purposes, it would be sufficient if a tree did not serialize > successfully :) that's actually the heaviest thing to implement, as we currently only pass a tree to libxml2 and let it do the rest. Also, it's too late and too hard to debug. No, this patch works much better, but the now failing tests seem to imply that Klingon tag names are not allowed in well-formed XML documents. I'll have to check if it's the XML spec that's xenophobe here or only libxml2... Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: name-validation.patch Type: text/x-diff Size: 1992 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070718/b7f270c8/attachment.bin From stefan_ml at behnel.de Wed Jul 18 16:32:10 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 18 Jul 2007 16:32:10 +0200 Subject: [lxml-dev] invalid tag names get serialized In-Reply-To: <469DCDFB.4070706@behnel.de> References: <20070716134625.41330@gmx.net> <469DC300.4050909@behnel.de> <20070718074822.116730@gmx.net> <469DCDFB.4070706@behnel.de> Message-ID: <469E246A.9050203@behnel.de> Stefan Behnel wrote: > jholg at gmx.de wrote: >>>> detect it when you try to parse it back in (in vain). Would it be a >>> problem >>>> to have the tag name checked before it is set for an element? >>> Not entirely "libxml2 behaviour", since it actually provides functions to >>> check names. You just have to use them. Although 'just' is slightly too >>> simplistic here. The straight forward patch actually breaks lots of test >>> cases, e.g. getiterator('*'). >>> >>> I'll have to look into this, but this is definitely 2.0 stuff. Maybe it >>> would >>> be enough to check names only in the factory functions, 'el.set()' and >>> 'el.attrib.__setitem__()'. Lookup and search methods/functions don't have >>> to care. >> For my purposes, it would be sufficient if a tree did not serialize >> successfully > > :) that's actually the heaviest thing to implement, as we currently only pass > a tree to libxml2 and let it do the rest. > > Also, it's too late and too hard to debug. No, this patch works much better, > but the now failing tests seem to imply that Klingon tag names are not allowed > in well-formed XML documents. I'll have to check if it's the XML spec that's > xenophobe here or only libxml2... Actually it's the spec, libxml2 is right here. So the current trunk no longer accepts invalid tag names at the API level when *creating* elements or attributes. It still accepts them when searching tags or looking up attributes. Stefan From jholg at gmx.de Wed Jul 18 17:04:13 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 18 Jul 2007 17:04:13 +0200 Subject: [lxml-dev] invalid tag names get serialized In-Reply-To: <469DCDFB.4070706@behnel.de> References: <20070716134625.41330@gmx.net> <469DC300.4050909@behnel.de> <20070718074822.116730@gmx.net> <469DCDFB.4070706@behnel.de> Message-ID: <20070718150413.83900@gmx.net> Hi, I've just seen you've already been looking into this, so my comment below concerning test cases is just for reference, but: The name check should go directly into _createElement, otherwise etree.SubElement will not pick it up. I'm also pro renaming TagNameIsValid to NCNameIsValid, as it is used on attributes also. > Also, it's too late and too hard to debug. No, this patch works much > better, > but the now failing tests seem to imply that Klingon tag names are not > allowed > in well-formed XML documents. I'll have to check if it's the XML spec > that's > xenophobe here or only libxml2... I do think that the character \u1234 is not allowed for XML NCNames: BaseChar production snippet: [...] #x11EB | #x11F0 | #x11F9 | [#x1E00-#x1E9B] | [#x1EA0-#x1EF9] [...] Thanks, Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kanns mit allen: http://www.gmx.net/de/go/multimessenger From stefan_ml at behnel.de Wed Jul 18 20:25:11 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 18 Jul 2007 20:25:11 +0200 Subject: [lxml-dev] invalid tag names get serialized In-Reply-To: <20070718150413.83900@gmx.net> References: <20070716134625.41330@gmx.net> <469DC300.4050909@behnel.de> <20070718074822.116730@gmx.net> <469DCDFB.4070706@behnel.de> <20070718150413.83900@gmx.net> Message-ID: <469E5B07.7020700@behnel.de> jholg at gmx.de wrote: > The name check should go directly into _createElement, No, _createElement() is only a tiny wrapper around the element node creation in libxml2. No Python exceptions allowed there. > otherwise etree.SubElement will not pick it up. Then SubElement will get its own check. I factored out the exception raising so that it's only a one-liner to prevent invalid tags from passing through the API. > I'm also pro renaming TagNameIsValid to NCNameIsValid, as it is used on attributes also. I actually renamed it to "_xmlNameIsValid()". It's not a public function yet, but I might reconsider that. >> Also, it's too late and too hard to debug. No, this patch works much >> better, >> but the now failing tests seem to imply that Klingon tag names are not >> allowed >> in well-formed XML documents. I'll have to check if it's the XML spec >> that's xenophobe here or only libxml2... > > I do think that the character \u1234 is not allowed for XML NCNames: > BaseChar production snippet: > > [...] #x11EB | #x11F0 | #x11F9 | [#x1E00-#x1E9B] | [#x1EA0-#x1EF9] [...] Right, I noticed that also. I also fixed the test cases now and added a bunch of new ones. Stefan From jholg at gmx.de Thu Jul 19 09:51:20 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 19 Jul 2007 09:51:20 +0200 Subject: [lxml-dev] invalid tag names get serialized In-Reply-To: <469E5B07.7020700@behnel.de> References: <20070716134625.41330@gmx.net> <469DC300.4050909@behnel.de> <20070718074822.116730@gmx.net> <469DCDFB.4070706@behnel.de> <20070718150413.83900@gmx.net> <469E5B07.7020700@behnel.de> Message-ID: <20070719075120.3300@gmx.net> Hi, > > The name check should go directly into _createElement, > > No, _createElement() is only a tiny wrapper around the element node > creation > in libxml2. No Python exceptions allowed there. Just out of curiosity: Is this by policy, or would it really cause problems? Because I tried just that and didn't see any problems. But of course I didn't test any tricky stuff or threading or you know what. > > otherwise etree.SubElement will not pick it up. > > Then SubElement will get its own check. I factored out the exception > raising > so that it's only a one-liner to prevent invalid tags from passing through > the > API. Great, thanks. Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail From stefan_ml at behnel.de Thu Jul 19 10:24:31 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 19 Jul 2007 10:24:31 +0200 Subject: [lxml-dev] how to hack (and where) In-Reply-To: <20070719075120.3300@gmx.net> References: <20070716134625.41330@gmx.net> <469DC300.4050909@behnel.de> <20070718074822.116730@gmx.net> <469DCDFB.4070706@behnel.de> <20070718150413.83900@gmx.net> <469E5B07.7020700@behnel.de> <20070719075120.3300@gmx.net> Message-ID: <469F1FBF.2090806@behnel.de> jholg at gmx.de wrote: >>> The name check should go directly into _createElement, >> No, _createElement() is only a tiny wrapper around the element node >> creation in libxml2. No Python exceptions allowed there. > > Just out of curiosity: Is this by policy, or would it really cause > problems? Because I tried just that and didn't see any problems. But of > course I didn't test any tricky stuff or threading or you know what. They are meant to be as simple as they are: just create a plain xmlNode. They basically give that a more explicit name and make sure we always call the same thing, so if we ever really need to change something here, we have one place to do so. But they are internal functions, not API level functions. Error checking must be done at the API level, before entering into the internals. For example, _makeElement is the main function for creating an Element proxy at the API level. It's also a public C function that can be safely used in a external modules (like objectify). It does all the error checking and figuring out what you meant and it can happily throw an exception if you provided rubbish, as it will only be used from API functions. Another good example is _getNsTag, which is the main API-level helper for splitting up something that came from the user into a UTF-8 encoded namespace *string* (or None) and a tag name *string*. It throws an exception if that fails and guarantees to return objects of the expected types. That really helps internally, because all internal code can just rely on that. There's also public-api.pxi that wraps some of the half-public C functions and adds some additional error checking in some cases to make them publicly usable. Maybe a good way to detect an API-level C function is if a) they already throw an exception, b) they return things like _Element or other API-level objects or c) they are often used at the beginning of API functions. Admittedly, it's not always 100% clear from the code (_createElement is a bad example as it's directly used in SubElement), but those are good rules of thumb. Does that make the difference clear? Stefan From jholg at gmx.de Thu Jul 19 14:16:00 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 19 Jul 2007 14:16:00 +0200 Subject: [lxml-dev] how to hack (and where) In-Reply-To: <469F1FBF.2090806@behnel.de> References: <20070716134625.41330@gmx.net> <469DC300.4050909@behnel.de> <20070718074822.116730@gmx.net> <469DCDFB.4070706@behnel.de> <20070718150413.83900@gmx.net> <469E5B07.7020700@behnel.de> <20070719075120.3300@gmx.net> <469F1FBF.2090806@behnel.de> Message-ID: <20070719121600.130180@gmx.net> Hi Stefan, > Maybe a good way to detect an API-level C function is if a) they already > throw > an exception, b) they return things like _Element or other API-level > objects > or c) they are often used at the beginning of API functions. Admittedly, > it's > not always 100% clear from the code (_createElement is a bad example as > it's > directly used in SubElement), but those are good rules of thumb. > > Does that make the difference clear? Yes, that's very helpful, and good to know. Thanks, Holger -- Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten Browser-Versionen downloaden: http://www.gmx.net/de/go/browser From itamar at itamarst.org Thu Jul 19 17:46:19 2007 From: itamar at itamarst.org (Itamar Shtull-Trauring) Date: Thu, 19 Jul 2007 11:46:19 -0400 (EDT) Subject: [lxml-dev] Segfault in lxml 1.3.2 Message-ID: <10842.63.107.91.99.1184859979.squirrel@webmail.zoteca.com> Hi, One of my co-workers is getting an occasional segfault in some code using lxml; it uses no other non-stdlib libraries, and it's single threaded. The segfault is in: __pyx_tp_dealloc_5etree__Element (src/lxml/etree.c:11282). Any info I can try to get to help debug this? Would a core dump be helpful? He's so far been unable to create a minimal reproducing example. Thanks, Itamar Shtull-Trauring From stefan_ml at behnel.de Thu Jul 19 17:57:13 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 19 Jul 2007 17:57:13 +0200 Subject: [lxml-dev] lxml & parsing: return of a classes In-Reply-To: <469D18F3.3080806@colorstudy.com> References: <469D18F3.3080806@colorstudy.com> Message-ID: <469F89D9.1040500@behnel.de> Hi Ian, Ian Bicking wrote: > So I was thinking a little about how we could allow easy customization > of the URL getter, since we can't attach it to the tree or any element. > And then generally how any customization could be done, for instance > if you want a new method on all elements. > > This isn't that easy currently. You'd have to subclass a bunch of > classes and rewrite a bunch of functions. But I think if we move all > parsing to a single class it would help a great deal. > > The idea is something like: > > class Parser(object): > _etree_parser_class = etree.HTMLParser > def __init__(self): > self._etree_parser = self._etree_parser_class() > self._etree_parser.setElementClassLookup(self) > def __call__(self, filename, **kw): > return etree.parse(filename, self._etree_parser, **kw) > def fromstring(...): > ... That's a good idea, but as you suggest at the end, extending the HTMLParser class directly is the way to go. Documents in lxml.etree keep a reference to their parser to support inheritance of resolvers. It's even readable from Python as "parser" property of an ElementTree. That would nicely solve most of your problems. > If you want to adjust something, you don't have to > reimplement all the forms of parsers, since they all would just use > self, and are mostly defined in terms of each other. We could support > subclassing with something like this: > > class Parser(object): > _element_classes = {} > _element_mixins = {} > def __init__(self): > self._element_classes = self._element_classes.copy() > mixers = {} > for name, value in _element_mixins: > if name == '*': > for n in self._element_classes.keys(): > mixers.setdefault(n, []).append(value) > else: > mixers.setdefault(name, []).append(value) > for name, mixins in mixers: > cur = self._element_classes.get(name, HtmlElement) > bases = mixins + [cur] > new_class = type(cur.__name__, tuple(bases), {}) > self._element_classes[name] = new_class > > class MyMixin(object): > extra methods > class FormMixin(object): > other methods for the form element > > class ParserMixedIn(Parser): > _element_mixins = {'*': MyMixin, 'form': FormMixin} > > And then it would be really easy to create local extensions for all HTML > elements, or particular elements. I would have to see how this looks if you inherit from HTMLParser and how this matches with the existing class lookup mechanisms. > I'm not sure exactly how to attach the URL getting method to the Parser > object in this model, because I'm not sure how to give elements a > reference back to it. I think we should try to integrate with the normal Resolver mechanism here (doc/resolvers.txt). Not sure how this works exactly if we want to use it from Python code (currently it's only called from libxml2 internally), but I would like to avoid adding yet another way to resolve URLs. Currently, resolvers receive an opaque "context" object as last argument and return an opaque object with a string or file-like object etc. We could easily replace the context with an object containing a sequence of form arguments (which would be None when calling from libxml2). Stefan From stefan_ml at behnel.de Thu Jul 19 23:14:12 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 19 Jul 2007 23:14:12 +0200 Subject: [lxml-dev] Segfault in lxml 1.3.2 In-Reply-To: <10842.63.107.91.99.1184859979.squirrel@webmail.zoteca.com> References: <10842.63.107.91.99.1184859979.squirrel@webmail.zoteca.com> Message-ID: <469FD424.6060604@behnel.de> Hi, Itamar Shtull-Trauring wrote: > One of my co-workers is getting an occasional segfault in some code using > lxml; it uses no other non-stdlib libraries, and it's single threaded. The > segfault is in: > > __pyx_tp_dealloc_5etree__Element (src/lxml/etree.c:11282). That's definitely a weird place for a crash. I definitely need more infos here. > Any info I can try to get to help debug this? Would a core dump be > helpful? He's so far been unable to create a minimal reproducing example. What usually helps more than a core dump is a crash under valgrind control. The Makefile has a command line for it. Valgrind usually prints the exact error, the original reason why that memory location is not accessible (where it was freed or allocated) and what was the last thing that happened when it crashed. Stefan From rogerpatterson at gmail.com Sat Jul 21 00:10:21 2007 From: rogerpatterson at gmail.com (Roger Patterson) Date: Fri, 20 Jul 2007 15:10:21 -0700 Subject: [lxml-dev] new ElementSoup module in lxml.html In-Reply-To: <469B6F93.7020208@behnel.de> References: <469B6F93.7020208@behnel.de> Message-ID: <46A132CD.6010703@gmail.com> Hi Stefan, I hadn't tried to use the lxml.html module before, but it doesn't seem to be in trunk (only in branch). So I guess this means it can only be installed from source? (eggs are only made from the trunk?) In which case, does your elementsoup.py really need lxml.html? I noticed elementsoup.py only uses "makeelement" from lxml.html.html_parser. Can I get away with using anything from the trunk instead? cheers, -Roger Stefan Behnel wrote: > Hi, > > I rewrote Fredrik's ElementSoup.py module for lxml.html so that you can now > have lxml read in tag soup with BeautifulSoup and convert it into an lxml.html > tree of Elements. While libxml2 can also parse broken HTML, it is not made to > parse sick soup of tags, so if you need to work with web pages that sort of > look like they might have been HTML once, the lxml.html.ElementSoup module can > help you get there. > > http://codespeak.net/svn/lxml/branch/html/doc/elementsoup.txt > http://codespeak.net/svn/lxml/branch/html/src/lxml/html/ElementSoup.py > > Have fun, > Stefan > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > > From stefan_ml at behnel.de Sat Jul 21 09:11:17 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 21 Jul 2007 09:11:17 +0200 Subject: [lxml-dev] new ElementSoup module in lxml.html In-Reply-To: <46A132CD.6010703@gmail.com> References: <469B6F93.7020208@behnel.de> <46A132CD.6010703@gmail.com> Message-ID: <46A1B195.6060207@behnel.de> Roger Patterson wrote: > I hadn't tried to use the lxml.html module before, but it doesn't seem > to be in trunk (only in branch). So I guess this means it can only be > installed from source? (eggs are only made from the trunk?) > > In which case, does your elementsoup.py really need lxml.html? I > noticed elementsoup.py only uses "makeelement" from > lxml.html.html_parser. Can I get away with using anything from the > trunk instead? Hmm, there isn't currently a trunk release either, but it /should/ also work with 1.3.2. Just take ElementSoup.py and pass your own "makeelement" function to parse(), try etree.Element for starters. Stefan From stefan_ml at behnel.de Sat Jul 21 20:06:29 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 21 Jul 2007 20:06:29 +0200 Subject: [lxml-dev] lxml & parsing: return of a classes In-Reply-To: <469F89D9.1040500@behnel.de> References: <469D18F3.3080806@colorstudy.com> <469F89D9.1040500@behnel.de> Message-ID: <46A24B25.20203@behnel.de> Stefan Behnel wrote: > Ian Bicking wrote: >> I'm not sure exactly how to attach the URL getting method to the Parser >> object in this model, because I'm not sure how to give elements a >> reference back to it. > > I think we should try to integrate with the normal Resolver mechanism here > (doc/resolvers.txt). Not sure how this works exactly if we want to use it from > Python code (currently it's only called from libxml2 internally), but I would > like to avoid adding yet another way to resolve URLs. Currently, resolvers > receive an opaque "context" object as last argument and return an opaque > object with a string or file-like object etc. We could easily replace the > context with an object containing a sequence of form arguments (which would be > None when calling from libxml2). Ok, this is basically how resolvers work currently: They have a resolve() method that takes a system URL and a public ID (as in a DTD DOCTYPE), as well as an opaque "context" object. They return another opaque object of type _InputDocument, that is created by calling one of the resolve_*() methods in the _Resolver base type. It is evaluated internally to read from the source (string, file, ...) that was passed to resolve_*(). I could imagine making the parse() function aware of _InputDocument as input type, so that you could subclass _Resolver in your use case, call its resolve() method directly from the submit() method and return the result so that the user can pass the return value to parse() and have it read the result into a tree. This would allow "parse( form.submit() )" to work with the existing resolver infrastructure. Current problems: - resolvers do not support URL options (?x=y&a=b). As described above, this would have to be passed through the context object somehow. - this would work with parse(), but there's also the case where the result is not XML or HTML. We would need a different API to retrieve the result as a bare string. So, this would work, but it's far from clean currently. We should put some more thoughts into this. Stefan From stefan_ml at behnel.de Sat Jul 21 22:44:51 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 21 Jul 2007 22:44:51 +0200 Subject: [lxml-dev] new Namespace() registry behaviour for lxml 2.0 Message-ID: <46A27043.8000804@behnel.de> Hi all, this has been in the queue for quite a while: the global Namespace() class lookup will go away in lxml 2.0, although it will be possible to write backwards compatible code. In lxml 1.x, the namespace class lookup was configured using the global Namespace() factory. This was easy, but it was also an application wide setup: >>> ns = etree.Namespace("http://my/namespace") >>> ns["mytag"] = MyElementClass The lookup itself was then configured through the ElementNamespaceClassLookup class, which (starting with lxml 1.2) was a parser local setup: >>> lookup = etree.ElementNamespaceClassLookup() >>> parser = etree.XMLParser() >>> parser.setElementClassLookup(lookup) In lxml 2.0, the namespace registry will become local to the lookup instance. Consequently, the new setup will look like this: >>> lookup = etree.ElementNamespaceClassLookup() >>> parser = etree.XMLParser() >>> parser.setElementClassLookup(lookup) >>> ns = lookup.get_namespace("http://my/namespace") >>> ns["mytag"] = MyElementClass Not a big difference, but this still has the huge benefit that the namespace registry can now be different for different parsers, i.e. different modules using lxml can now have their own registry without having to fear any interference with other modules or parts of an application. This means that all Element class lookup schemes are now parser local. As the usage itself does not change, it is easy to adapt existing code so that it runs with lxml 2.0 while still supporting lxml 1.2/3 (at least, it was easy enough to adapt the unit tests in lxml...). All you have to do is set the same lookup instance for all parsers you use (note that the initial setup is already required today) and replace the occurrences of calls to etree.Namespace by the "get_namespace" method of that lookup instance. I believe that this simple change is easy enough to work around for existing code, so it will not have a big impact here. However, the advantage of removing the global setup and thus supporting independent configurations for different parts of an application (and third party libs) outweighs this overhead considerably. Have fun, Stefan From stefan_ml at behnel.de Sun Jul 22 14:38:53 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 22 Jul 2007 14:38:53 +0200 Subject: [lxml-dev] lxml & parsing: return of a classes In-Reply-To: <469F89D9.1040500@behnel.de> References: <469D18F3.3080806@colorstudy.com> <469F89D9.1040500@behnel.de> Message-ID: <46A34FDD.3060806@behnel.de> Hi Ian, Stefan Behnel wrote: > Ian Bicking wrote: >> If you want to adjust something, you don't have to >> reimplement all the forms of parsers, since they all would just use >> self, and are mostly defined in terms of each other. We could support >> subclassing with something like this: >> >> class Parser(object): >> _element_classes = {} >> _element_mixins = {} >> def __init__(self): >> self._element_classes = self._element_classes.copy() >> mixers = {} >> for name, value in _element_mixins: >> if name == '*': >> for n in self._element_classes.keys(): >> mixers.setdefault(n, []).append(value) >> else: >> mixers.setdefault(name, []).append(value) >> for name, mixins in mixers: >> cur = self._element_classes.get(name, HtmlElement) >> bases = mixins + [cur] >> new_class = type(cur.__name__, tuple(bases), {}) >> self._element_classes[name] = new_class >> >> class MyMixin(object): >> extra methods >> class FormMixin(object): >> other methods for the form element >> >> class ParserMixedIn(Parser): >> _element_mixins = {'*': MyMixin, 'form': FormMixin} >> >> And then it would be really easy to create local extensions for all HTML >> elements, or particular elements. The right way to do this is not to subclass the parser, but to rewrite the HtmlLookup (which I renamed to HtmlElementClassLookup) with the above __init__ code. I did that in revision 45244, please check if it matches your requirements. Stefan From stefan_ml at behnel.de Sun Jul 22 16:18:13 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 22 Jul 2007 16:18:13 +0200 Subject: [lxml-dev] lxml & parsing: return of a classes In-Reply-To: <46A24B25.20203@behnel.de> References: <469D18F3.3080806@colorstudy.com> <469F89D9.1040500@behnel.de> <46A24B25.20203@behnel.de> Message-ID: <46A36725.7080108@behnel.de> Stefan Behnel wrote: > Stefan Behnel wrote: >> Ian Bicking wrote: >>> I'm not sure exactly how to attach the URL getting method to the Parser >>> object in this model, because I'm not sure how to give elements a >>> reference back to it. >> I think we should try to integrate with the normal Resolver mechanism here >> (doc/resolvers.txt). Not sure how this works exactly if we want to use it from >> Python code (currently it's only called from libxml2 internally), but I would >> like to avoid adding yet another way to resolve URLs. Currently, resolvers >> receive an opaque "context" object as last argument and return an opaque >> object with a string or file-like object etc. We could easily replace the >> context with an object containing a sequence of form arguments (which would be >> None when calling from libxml2). I thought about this some more and I now think that it would be inappropriate to use etree's resolver interface here. It serves a totally different purpose and the extension with form data would be useless everywhere else. A simple function passed as argument to submit would do. If you want a set-once-use-everywhere setup, the parser is the only place where this would work, although I dislike the idea of using the /parser/ to set a method for /submitting/ form data. Especially if you have to parse the result yourself. I now made the submit() method a "submit_form()" module function. That way, you can easily write your own function with the same interface that simply passes the appropriate HTTP mechanism in for you. The signature is: def submit_form(form, extra_values=None, open_http=None) Stefan From albert.brandl at tttech.com Mon Jul 23 13:22:49 2007 From: albert.brandl at tttech.com (Albert Brandl) Date: Mon, 23 Jul 2007 13:22:49 +0200 Subject: [lxml-dev] Strange behaviour with namespaces Message-ID: <20070723112248.GA20941@tttech.com> Hi! I'd like to serialize a part of an XML document for later retrieval. Since the elements are defined using namespaces, I created an ElementTree instance for the element to be serialized and called its write_c14n method. >>> from import etree >>> import cStringIO >>> XML=""" ... ... ... ... ... """ >>> e = etree.fromstring(XML) >>> et = etree.ElementTree(e[0]) >>> sb = cStringIO.StringIO() >>> et.write_c14n(sb) >>> sb.getvalue() '\n \n ' The xmlns information is transported correctly, but the information about the namespaces for the elements is lost. I assume this is a bug. Is there another way to extract a part of the document in textual form such that namespace information is preserved? Using tostring does not work, since this method throws away the xmlns attributes altogether. Thanks & best regards, Albert Brandl From jholg at gmx.de Mon Jul 23 13:51:40 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Mon, 23 Jul 2007 13:51:40 +0200 Subject: [lxml-dev] Strange behaviour with namespaces In-Reply-To: <20070723112248.GA20941@tttech.com> References: <20070723112248.GA20941@tttech.com> Message-ID: <20070723115140.209470@gmx.net> Hi, -------- Original-Nachricht -------- Datum: Mon, 23 Jul 2007 13:22:49 +0200 Von: Albert Brandl > >>> from import etree > >>> import cStringIO > >>> XML=""" > ... xmlns:c="http://c.org"> > ... > ... > ... > ... """ > >>> e = etree.fromstring(XML) > >>> et = etree.ElementTree(e[0]) > >>> sb = cStringIO.StringIO() > >>> et.write_c14n(sb) > >>> sb.getvalue() > ' xmlns:c="http://c.org">\n \n ' > > The xmlns information is transported correctly, but the information about > the namespaces for the elements is lost. I assume this is a bug. You got your prefixes confused: The ns-prefixes of the elements do not correspond to any xmlns-declaration (n1:a vs xmlns:a="...") Try this with consistent prefixes and it works like a charm: >>> from lxml import etree >>> import cStringIO >>> XML=""" ... ... ... ... ... """ >>> >>> e = etree.fromstring(XML) >>> et = etree.ElementTree(e[0]) >>> sb = cStringIO.StringIO() >>> et.write_c14n(sb) >>> sb.getvalue() >>> sb.getvalue()'' >>> >>> etree.tostring(e) '' > > Is there another way to extract a part of the document in textual form > such that namespace information is preserved? Using tostring does not > work, since this method throws away the xmlns attributes altogether. No it doesn't: >>> etree.tostring(e) '' >>> Regards, Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kanns mit allen: http://www.gmx.net/de/go/multimessenger From albert.brandl at tttech.com Mon Jul 23 14:39:23 2007 From: albert.brandl at tttech.com (Albert Brandl) Date: Mon, 23 Jul 2007 14:39:23 +0200 Subject: [lxml-dev] Strange behaviour with namespaces In-Reply-To: <20070723115140.209470@gmx.net> References: <20070723112248.GA20941@tttech.com> <20070723115140.209470@gmx.net> Message-ID: <20070723123922.GC20941@tttech.com> Hi, On Mon, Jul 23, 2007 at 01:51:40PM +0200, jholg at gmx.de wrote: > You got your prefixes confused: The ns-prefixes of the elements do not > correspond to any xmlns-declaration (n1:a vs xmlns:a="...") you are correct. Thanks for the quick reply. > No it doesn't: > >>> etree.tostring(e) > '' This is much better than first wrapping the element in an ElementTree. It seems to have been fixed somewhere between lxml 1.1.2 and the current version: >>> lxml.etree.LXML_VERSION (1, 1, 2, 0) >>> e = fromstring('') >>> tostring(e[0]) '' Looks like it's time to upgrade :-) Regards, Albert Brandl From jholg at gmx.de Wed Jul 25 16:47:25 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 25 Jul 2007 16:47:25 +0200 Subject: [lxml-dev] [objectify] Typed E-factory for objectify, PT DataElement()-wrapper In-Reply-To: <20070711134539.193100@gmx.net> References: <20070711134539.193100@gmx.net> Message-ID: <20070725144725.83900@gmx.net> Hi, due to (seemingly) pressing needs of my users, I propose a change of ObjectifiedElement '.'-operator subelement-setting (the _setElementValue function, to be exact) behaviour, possibly configurable as a module setup. The proposed change is to auto-add the python type() of the RVAL of an assignment as a py:pytype attribute. This is much what you would use the proposed-below PT convenience function for, with current behaviour. Current behaviour: >>> root = objectify.Element("root") >>> root.s = "0003" >>> print objectify.dump(root) root = None [ObjectifiedElement] s = 3 [IntElement] >>> Proposed behaviour (switchable): >>> root = objectify.Element("root") >>> root.s = "0003" >>> print objectify.dump(root) root = None [ObjectifiedElement] s = '0003' [StringElement] * py:pytype = 'str' >>> I am well aware that this * auto-adds an attribute to an Element, where you now need to tell this explicitly, e.g. by using DataElement() * in a sense maybe means some loss of symmetry, considering adding content throught the objectify API vs. parsing from an XML file or string However, my users just can't seem to grasp the notion of "assignment only cares about RVAL literals, and type-lookup happens on element-access by type-guessing". Especially, they seem to have trouble with situations like this: >>> root = objectify.Element("root") >>> root.comment = "this is my 500. comment" >>> print objectify.dump(root) root = None [ObjectifiedElement] comment = 'this is my 500. comment' [StringElement] >>> root.comment = root.comment.pyval[11:14] >>> print objectify.dump(root) root = None [ObjectifiedElement] comment = 500 [IntElement] >>> where you cut some parts out of a string and might then get this presented as an IntElement, due to the int-able literal. In addition, I still propose what I posted before ;-): Betreff: [lxml-dev] [objectify] Typed E-factory for objectify, PT DataElement()-wrapper > Hi, > > attached patch (against trunk) > > * adds a typed E-factory (called T-factory) > * inserts NoneType into the E-factory/T-factory typemap > * adds the PT() (="PyTyped()) convenience function that is a thin > wrapper uses the argument value's type to set the pytype > * provides unittests for E-factory, T-factory and PT() > * fixes DataElement() to care for some previously-unhandled corner cases > concerning None and/or _pytype "none" > > Despite of what I previously said ;-) I now think it would be better to > rename "none" to "NoneType", to use the same name as the Python builtin > original. While it is a longer name I seriously doubt you need to actually use > it explicitly very often. > By convention, the PyType name should match the Python builtin type name; > then both the T-factory and the PT() function can work smoothly (the only > thing special-cased is the Python type name "unicode" with gets substituted > by "str"). > > Therefore, the patch also changes "none" to "NoneType" in objectify and > the objectify tests/doctests. > > I'd really like to see the PT() function go into the 1.3 series, too. > > Please take a look, I can come up with some documentation if you like it. > > Holger Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail From stefan_ml at behnel.de Wed Jul 25 17:06:58 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 25 Jul 2007 17:06:58 +0200 Subject: [lxml-dev] [objectify] Typed E-factory for objectify, PT DataElement()-wrapper In-Reply-To: <20070725144725.83900@gmx.net> References: <20070711134539.193100@gmx.net> <20070725144725.83900@gmx.net> Message-ID: <46A76712.1040300@behnel.de> jholg at gmx.de wrote: > due to (seemingly) pressing needs of my users, I propose a change of > ObjectifiedElement '.'-operator subelement-setting (the _setElementValue > function, to be exact) behaviour, possibly configurable as a module setup. > The proposed change is to auto-add the python type() of the RVAL of an > assignment as a py:pytype attribute. This is much what you would use the > proposed-below PT convenience function for, with current behaviour. > > Current behaviour: > >>> root = objectify.Element("root") > >>> root.s = "0003" > >>> print objectify.dump(root) > root = None [ObjectifiedElement] > s = 3 [IntElement] > > Proposed behaviour (switchable): > >>> root = objectify.Element("root") > >>> root.s = "0003" > >>> print objectify.dump(root) > root = None [ObjectifiedElement] > s = '0003' [StringElement] > * py:pytype = 'str' Hmmm, this makes sense at first sight, but I'll have to think this through to figure out the implications. I'm not all together happy with the attribute type business today, as it keeps people from generating 'clean' XML. Ok, you can run deannotate() on trees before you serialise, but that might mean that objectify could behave differently the next time you parse it. So it's somewhat quirky either way: live with the artifacts or live with surprises. Sounds like the first is a lot better, though. :) > I am well aware that this > * auto-adds an attribute to an Element, where you now need to tell this explicitly, e.g. by using DataElement() We already do that in a couple of places now, so it wouldn't add much ugliness. It would even make the type behaviour less surprising - you get out what you put in. I agree that this actually helps users. We're talking about 2.0 behaviour here, though. > In addition, I still propose what I posted before ;-): Right, I'll look at that also. We really need a bug tracker for lxml... Stefan From stefan_ml at behnel.de Wed Jul 25 17:39:38 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 25 Jul 2007 17:39:38 +0200 Subject: [lxml-dev] [objectify] Typed E-factory for objectify, PT DataElement()-wrapper In-Reply-To: <20070711134539.193100@gmx.net> References: <20070711134539.193100@gmx.net> Message-ID: <46A76EBA.9090508@behnel.de> Hi Holger, sorry, I keep pushing non trivial decisions back into the FIFO when I first see them and the queue was pretty long this time. I already looked at your patch, but didn't get through it completely. You should really cut down the size of your patches... :) jholg at gmx.de wrote: > * adds a typed E-factory (called T-factory) > * inserts NoneType into the E-factory/T-factory typemap > * adds the PT() (="PyTyped()) convenience function that is a thin wrapper uses the argument value's type to set the pytype > * provides unittests for E-factory, T-factory and PT() > * fixes DataElement() to care for some previously-unhandled corner cases concerning None and/or _pytype "none" I'll take another look at the patch. > Despite of what I previously said ;-) I now think it would be better to > rename "none" to "NoneType", to use the same name as the Python builtin > original. While it is a longer name I seriously doubt you need to actually use > it explicitly very often. > By convention, the PyType name should match the Python builtin type name; > then both the T-factory and the PT() function can work smoothly (the only > thing special-cased is the Python type name "unicode" with gets substituted by > "str"). > > Therefore, the patch also changes "none" to "NoneType" in objectify and the objectify tests/doctests. This will break existing documents, though, if they do not additionally use xsi:nil. No idea how many there are... We could accept both names for the time being, though, and write out the new one in 2.0 and the old one in 1.3. Stefan From jholg at gmx.de Thu Jul 26 09:42:03 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 26 Jul 2007 09:42:03 +0200 Subject: [lxml-dev] [objectify] Typed E-factory for objectify, PT DataElement()-wrapper In-Reply-To: <46A76EBA.9090508@behnel.de> References: <20070711134539.193100@gmx.net> <46A76EBA.9090508@behnel.de> Message-ID: <20070726074203.298440@gmx.net> Hi, > patch, but didn't get through it completely. You should really cut down > the > size of your patches... :) Just for the defence the biggest portion was the unittests ;-) Btw I now have an svn account (thanks, Philipp) so if I can help in a way that is easier for you to manage & quality-assure, just let me know. And I do have one other thing in the queue, namely adding a keep_tree option to the *annotate() functions, renaming annotatate() to pyannotate() and giving _annotate a public interface that is pretty backwards-compatible to the 1.x annotate(). > > Therefore, the patch also changes "none" to "NoneType" in objectify and > the objectify tests/doctests. > > This will break existing documents, though, if they do not additionally > use > xsi:nil. No idea how many there are... > > We could accept both names for the time being, though, and write out the > new > one in 2.0 and the old one in 1.3. Sounds good. I'll take a look at DataElement() to see where it should handle both names. Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kanns mit allen: http://www.gmx.net/de/go/multimessenger From stefan_ml at behnel.de Thu Jul 26 12:31:36 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 26 Jul 2007 12:31:36 +0200 Subject: [lxml-dev] lxml has its page on launchpad Message-ID: <46A87808.7050704@behnel.de> Hi all, I added the lxml project to launchpad, the Ubuntu Bug-Tracker. It also has a FAQ engine and a couple of other goodies. https://launchpad.net/lxml It's easy to sign up for launchpad, BTW, no 90%-footnotes-contract. Have fun, Stefan From jholg at gmx.de Thu Jul 26 14:38:58 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 26 Jul 2007 14:38:58 +0200 Subject: [lxml-dev] [objectify] Typed E-factory for objectify, PT DataElement()-wrapper In-Reply-To: <46A76712.1040300@behnel.de> References: <20070711134539.193100@gmx.net> <20070725144725.83900@gmx.net> <46A76712.1040300@behnel.de> Message-ID: <20070726123858.77870@gmx.net> Hi, > I'm not all together happy with the attribute type business today, as it > keeps > people from generating 'clean' XML. Ok, you can run deannotate() on trees > before you serialise, but that might mean that objectify could behave > differently the next time you parse it. So it's somewhat quirky either > way: > live with the artifacts or live with surprises. Sounds like the first is a > lot > better, though. :) If we make the behaviour I proposed switchable, then maybe this switch should also affect the auto-generation of py:pytype="TREE" in the objectify.Element() factory. That way the user can decide what you suggested: -no artefacts, build up clean XML, at the cost of type-"uncertainty" -with artefacts, you get py:pytype attributes everywhere and can rely on "stable types" So one could basically use the objectify API in 2 ways, one being more or less an (arguably simpler) alternative to the etree API (if you need "clean" trees and are prepared to protect yourself against type confusion), the other being "fully type-annotated". I think objectify.Element() is the only place where "TREE" gets auto-generated. Holger -- Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten Browser-Versionen downloaden: http://www.gmx.net/de/go/browser From stefan_ml at behnel.de Thu Jul 26 18:49:42 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 26 Jul 2007 18:49:42 +0200 Subject: [lxml-dev] lxml 1.3.3 released Message-ID: <46A8D0A6.30000@behnel.de> Hi all, I just released lxml 1.3.3 to cheeseshop. This is mainly a bug-fix release for the stable 1.3 series. Changelog follows. Have fun, Stefan 1.3.3 (2007-07-26) ================== Features added -------------- * ElementTree compatible parser ETCompatXMLParser strips processing instructions and comments while parsing XML * Parsers now support stripping PIs (keyword argument 'remove_pis') * etree.fromstring() now supports parsing both HTML and XML, depending on the parser you pass. * Support base_url keyword argument in HTML() and XML() Bugs fixed ---------- * Parsing from Python Unicode strings failed on some platforms * Element() did not raise an exception on tag names containing ':' * Element.getiterator(tag) did not accept Comment and ProcessingInstruction as tags. It also accepts Element now. From jholg at gmx.de Fri Jul 27 08:57:47 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Fri, 27 Jul 2007 08:57:47 +0200 Subject: [lxml-dev] reloading a network resource fails Message-ID: <20070727065747.182500@gmx.net> Hi, parsing a network resource multiple times fails in current trunk: *** 2.0.dev-45372 *** >>> etree.parse("http://adevp01:8080/validSummary-1.2.xml") >>> etree.parse("http://adevp01:8080/validSummary-1.2.xml") Traceback (most recent call last): File "", line 1, in ? File "etree.pyx", line 2065, in etree.parse File "parser.pxi", line 1004, in etree._parseDocument File "parser.pxi", line 1008, in etree._parseDocumentFromURL File "parser.pxi", line 925, in etree._parseDocFromFile File "parser.pxi", line 585, in etree._BaseParser._parseDocFromFile File "parser.pxi", line 682, in etree._handleParseResult File "parser.pxi", line 630, in etree._raiseParseError etree.XMLSyntaxError: Attempt to load network entity http://adevp01:8080/validSummary-1.2.xml >>> etree.parse("http://adevp01:8080/validSummary-1.2.xml") Traceback (most recent call last): File "", line 1, in ? File "etree.pyx", line 2065, in etree.parse File "parser.pxi", line 1004, in etree._parseDocument File "parser.pxi", line 1008, in etree._parseDocumentFromURL File "parser.pxi", line 925, in etree._parseDocFromFile File "parser.pxi", line 585, in etree._BaseParser._parseDocFromFile File "parser.pxi", line 682, in etree._handleParseResult File "parser.pxi", line 630, in etree._raiseParseError etree.XMLSyntaxError: Attempt to load network entity http://adevp01:8080/validSummary-1.2.xml >>> etree.parse("http://adevp01:8080/validSummary-1.2.xml") Traceback (most recent call last): File "", line 1, in ? File "etree.pyx", line 2065, in etree.parse File "parser.pxi", line 1004, in etree._parseDocument File "parser.pxi", line 1008, in etree._parseDocumentFromURL File "parser.pxi", line 925, in etree._parseDocFromFile File "parser.pxi", line 585, in etree._BaseParser._parseDocFromFile File "parser.pxi", line 682, in etree._handleParseResult File "parser.pxi", line 630, in etree._raiseParseError etree.XMLSyntaxError: Attempt to load network entity http://adevp01:8080/validSummary-1.2.xml >>> I rather often use local files, so I ran into this by accident. 1.3 does not have the same problem (same libxml2 version). I also reported this one on the new launchpad bugtracker. Can this be set up to let one choose the version, or should we introduce a convention on naming the summary line, e.g. [trunk], [1.3]? Holger -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer From stefan_ml at behnel.de Fri Jul 27 09:54:04 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 27 Jul 2007 09:54:04 +0200 Subject: [lxml-dev] reloading a network resource fails In-Reply-To: <20070727065747.182500@gmx.net> References: <20070727065747.182500@gmx.net> Message-ID: <46A9A49C.6050406@behnel.de> Hi, jholg at gmx.de wrote: > parsing a network resource multiple times fails in current trunk: > > *** 2.0.dev-45372 *** >>>> etree.parse("http://adevp01:8080/validSummary-1.2.xml") > >>>> etree.parse("http://adevp01:8080/validSummary-1.2.xml") > Traceback (most recent call last): > File "", line 1, in ? > File "etree.pyx", line 2065, in etree.parse > File "parser.pxi", line 1004, in etree._parseDocument > File "parser.pxi", line 1008, in etree._parseDocumentFromURL > File "parser.pxi", line 925, in etree._parseDocFromFile > File "parser.pxi", line 585, in etree._BaseParser._parseDocFromFile > File "parser.pxi", line 682, in etree._handleParseResult > File "parser.pxi", line 630, in etree._raiseParseError > etree.XMLSyntaxError: Attempt to load network entity http://adevp01:8080/validSummary-1.2.xml Funny. This actually changed in the trunk, as lxml 2.0 will no longer read network resources by default (pass False for the "no_network" kw arg of the parser if you want to enable it). However, it should also fail the first time you do it. I'll have to see where the problem is (libxml2 or lxml). > I also reported this one on the new launchpad bugtracker. Can this be set > up to let one choose the version, or should we introduce a convention on > naming the summary line, e.g. [trunk], [1.3]? I'm also missing such an option. Looks like I should dig a little more. Stefan From rcdailey at gmail.com Sun Jul 29 22:35:29 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Sun, 29 Jul 2007 15:35:29 -0500 Subject: [lxml-dev] lxml 1.3.3 released In-Reply-To: <46A8D0A6.30000@behnel.de> References: <46A8D0A6.30000@behnel.de> Message-ID: <496954360707291335m392786b9vae956191a9eeae3b@mail.gmail.com> When can we expect a binary distribution of 1.3.3 for windows? On 7/26/07, Stefan Behnel wrote: > > Hi all, > > I just released lxml 1.3.3 to cheeseshop. This is mainly a bug-fix release > for > the stable 1.3 series. Changelog follows. > > Have fun, > Stefan > > > 1.3.3 (2007-07-26) > ================== > > Features added > -------------- > > * ElementTree compatible parser ETCompatXMLParser strips processing > instructions and comments while parsing XML > * Parsers now support stripping PIs (keyword argument 'remove_pis') > * etree.fromstring() now supports parsing both HTML and XML, depending > on > the parser you pass. > * Support base_url keyword argument in HTML() and XML() > > Bugs fixed > ---------- > > * Parsing from Python Unicode strings failed on some platforms > * Element() did not raise an exception on tag names containing ':' > * Element.getiterator(tag) did not accept Comment and > ProcessingInstruction as tags. It also accepts Element now. > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070729/b4443e33/attachment.htm From stefan_ml at behnel.de Mon Jul 30 13:03:33 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 30 Jul 2007 13:03:33 +0200 Subject: [lxml-dev] lxml 2.0 will require Cython instead of Pyrex Message-ID: <46ADC585.2030006@behnel.de> Hi all, the new Cython project has started its work on the original Pyrex source code. It already has several advantages over Pyrex, including support for list comprehension and type based optimisation. http://www.cython.org/ Cython includes the patches lxml originally needed in Pyrex. lxml 2.0 will therefore stop shipping with a modified Pyrex and simply require Cython if people want to build from non-release sources. Note that we will continue to ship releases with the generated .c files, so no Cython is required by people who just want to build from release sources. Have fun, Stefan From stefan_ml at behnel.de Mon Jul 30 14:22:39 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 30 Jul 2007 14:22:39 +0200 Subject: [lxml-dev] lxml 1.3.3 released In-Reply-To: <496954360707291335m392786b9vae956191a9eeae3b@mail.gmail.com> References: <46A8D0A6.30000@behnel.de> <496954360707291335m392786b9vae956191a9eeae3b@mail.gmail.com> Message-ID: <46ADD80F.2050700@behnel.de> Hi, Robert Dailey wrote: > When can we expect a binary distribution of 1.3.3 for windows? Windows builds of lxml require a) MS Windows and b) a MS compiler, two reasons why I can't do them myself but rely on our official Windows maintainer Sidnei. It sometimes takes a couple of days until binary builds become available, so I can only ask you to wait a little longer - or to bug Sidnei yourself :) Stefan From rcdailey at gmail.com Mon Jul 30 19:53:07 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Mon, 30 Jul 2007 12:53:07 -0500 Subject: [lxml-dev] lxml 1.3.3 released In-Reply-To: <46ADD80F.2050700@behnel.de> References: <46A8D0A6.30000@behnel.de> <496954360707291335m392786b9vae956191a9eeae3b@mail.gmail.com> <46ADD80F.2050700@behnel.de> Message-ID: <496954360707301053y2fd0abb1i4d66460878c18841@mail.gmail.com> Hi, I apologize if I came across as impatient. I meant no rush; I'm simply just wondering. I have an MS compiler (msvc8), but I don't think the build process supports it. I've actually tried making the binaries for windows myself at one point and I couldn't do it. It's been a while so I don't remember exactly what the problem was. However, I do appreciate your reply. Take care. On 7/30/07, Stefan Behnel wrote: > > Hi, > > Robert Dailey wrote: > > When can we expect a binary distribution of 1.3.3 for windows? > > Windows builds of lxml require a) MS Windows and b) a MS compiler, two > reasons > why I can't do them myself but rely on our official Windows maintainer > Sidnei. > It sometimes takes a couple of days until binary builds become available, > so > I can only ask you to wait a little longer - or to bug Sidnei yourself :) > > Stefan > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070730/7ea353a9/attachment.htm From philipp at weitershausen.de Tue Jul 31 08:50:45 2007 From: philipp at weitershausen.de (Philipp von Weitershausen) Date: Tue, 31 Jul 2007 08:50:45 +0200 Subject: [lxml-dev] lxml 1.3.3 released In-Reply-To: <496954360707301053y2fd0abb1i4d66460878c18841@mail.gmail.com> References: <46A8D0A6.30000@behnel.de> <496954360707291335m392786b9vae956191a9eeae3b@mail.gmail.com> <46ADD80F.2050700@behnel.de> <496954360707301053y2fd0abb1i4d66460878c18841@mail.gmail.com> Message-ID: <46AEDBC5.9000005@weitershausen.de> Robert Dailey wrote: > I apologize if I came across as impatient. I meant no rush; I'm simply > just wondering. I have an MS compiler (msvc8), but I don't think the > build process supports it. I've actually tried making the binaries for > windows myself at one point and I couldn't do it. It's been a while so I > don't remember exactly what the problem was. However, I do appreciate > your reply. Take care. Perhaps this may help, at least in the mean time: http://www.z3lab.org/sections/blogs/philipp-weitershausen/2007_07_26_cheap-binary-windows -- http://worldcookery.com -- Professional Zope documentation and training From rcdailey at gmail.com Tue Jul 31 17:28:36 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Tue, 31 Jul 2007 10:28:36 -0500 Subject: [lxml-dev] lxml 1.3.3 released In-Reply-To: <46AEDBC5.9000005@weitershausen.de> References: <46A8D0A6.30000@behnel.de> <496954360707291335m392786b9vae956191a9eeae3b@mail.gmail.com> <46ADD80F.2050700@behnel.de> <496954360707301053y2fd0abb1i4d66460878c18841@mail.gmail.com> <46AEDBC5.9000005@weitershausen.de> Message-ID: <496954360707310828k15c2835do473caba11f32c01b@mail.gmail.com> Philipp, Thank you kindly for your website reference. It came in handy. On 7/31/07, Philipp von Weitershausen wrote: > > Robert Dailey wrote: > > I apologize if I came across as impatient. I meant no rush; I'm simply > > just wondering. I have an MS compiler (msvc8), but I don't think the > > build process supports it. I've actually tried making the binaries for > > windows myself at one point and I couldn't do it. It's been a while so I > > don't remember exactly what the problem was. However, I do appreciate > > your reply. Take care. > > Perhaps this may help, at least in the mean time: > > http://www.z3lab.org/sections/blogs/philipp-weitershausen/2007_07_26_cheap-binary-windows > > > -- > http://worldcookery.com -- Professional Zope documentation and training > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070731/ca130c11/attachment.htm From sidnei at enfoldsystems.com Tue Jul 31 19:07:03 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Tue, 31 Jul 2007 14:07:03 -0300 Subject: [lxml-dev] lxml 1.3.3 released In-Reply-To: <46ADD80F.2050700@behnel.de> References: <46A8D0A6.30000@behnel.de> <496954360707291335m392786b9vae956191a9eeae3b@mail.gmail.com> <46ADD80F.2050700@behnel.de> Message-ID: Hey there, Sorry for the delay, I was taking a short vacation in LA, just arrived in Houston now. I've built 1.3.3 for Python 2.4 and 2.5 and uploaded them to the cheeseshop. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214