From stefan_ml at behnel.de Sun Jul 1 15:20:12 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 01 Jul 2007 15:20:12 +0200 Subject: [lxml-dev] lxml 1.3 coming up In-Reply-To: <20070621165252.177750@gmx.net> References: <466EDCE5.8020407@behnel.de> <20070613071920.19070@gmx.net> <466FD558.7020404@behnel.de> <20070613120039.40630@gmx.net> <20070614094533.276670@gmx.net> <46717FF3.3060900@behnel.de> <20070615123522.202110@gmx.net> <46728DD9.6080202@behnel.de> <20070618153131.19060@gmx.net> <46779A23.6030208@behnel.de> <20070619145137.287570@gmx.net> <4677F497.8030000@behnel.de> <20070619162609.287610@gmx.net> <46782836.4000408@behnel.de> <20070620153043.276660@gmx.net> <20070621165252.177750@gmx.net> Message-ID: <4687AA0C.2080803@behnel.de> Hi Holger, jholg at gmx.de wrote: > Find attached a patch that: > > - changes the above to apply xsi:nil="true" for None value arguments Ok. > - lets DataElement() graciously handle ObjectifiedDataElement arguments, > keeping their attributes intact, if not overridden by the DataElement() > args. This also reuses existing xsi:type or py:pytype information, unless > _pytype and/or _xsi are provided as parameters to DataElement() > > Previously, DataElement() cut off all attributes if given an > ObjectifiedDataElement instance. Ok. > - Type-checks the _value against the given type hint: > You will run into the error anyway - sooner or > later - when accessing the .pyval in any way, so why not during > instantiation. Ok. > Tests are included for the described behaviour. Cool, thanks. > Additionally, I've revamped some of the tests I provided earlier and split > them up: More but smaller test methods now. That's even better. :) > Please try it out, if any of the DataElement changes are not ok I can also > send only the split-up tests, of course. > > Btw.: I'm always getting > > IOError: Error reading file > '/data/pydev/hjoukl/LXML/lxml-1.3/src/lxml/tests/test_xinclude.xml': failed > to load external entity > "/data/pydev/hjoukl/LXML/lxml-1.3/src/lxml/tests/test_xinclude.xml" > > due to some missing xml file lately when running the tests. I moved an XML file to a subdirectory to also test relative references in a base directory. But it should be fixed in the 1.3 release... Stefan From stefan_ml at behnel.de Mon Jul 2 10:32:36 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 02 Jul 2007 10:32:36 +0200 Subject: [lxml-dev] Some XPath questions... In-Reply-To: <468538ED.9060004@colorstudy.com> References: <468538ED.9060004@colorstudy.com> Message-ID: <4688B824.1020707@behnel.de> Hi Ian, just to comment on your actual first post in this thread, which I kinda oversaw because of the later discussion. I think this is pretty cool stuff and I love to have this in lxml. The html module really seems to be getting somewhere. I think we shouldn't even wait too long with a release so that we get some more feedback on the new APIs. Maybe I should fix lxml's versioning so that we can put out a 2.0alpha1 (and not only alpha, beta, final). Ian Bicking wrote: > div:contains('celia') -- means a div where the textual content has the > word 'celia' in it, case insensitive. At least, I think it's case > insensitive -- the CSS spec is annoyingly vague, but implementations > seem to work like this. I translate this to: > > descendant-or-self::div[contains(css:lower-case(string(.)), 'celia'] > > I added the lower-case function like: > > def _make_lower_case(context, s): > return s.lower() > etree.FunctionNamespace("css")['lower-case'] = _make_lower_case "css" is not the namespace, it's the prefix. You can do this: ns = etree.FunctionNamespace("http://my/css/namespace") ns.prefix = "css" ns['lower-case'] = _make_lower_case or this: ns = etree.FunctionNamespace("http://my/css/namespace") ns['lower-case'] = _make_lower_case def css_to_xpath(css): xpath = build_xpath(css) return etree.XPath(xpath, {'css' : "http://my/css/namespace"}) You should consider providing a default namespace map here, and maybe even return compiled XPath objects, i.e. callables. Note that these provide a "path" attribute that returns the original path, so if you have to extend an expression later on, you can still do so by creating a new XPath object. Note that this would also allow you to wrap the function with an additional call to set(), so that or-ed results really become the union and not the sum of all parts. > But XPath gives so few errors that it's hard to tell if it's really > working. Sadly, there doesn't seem to be a simple way to find out that a function was undeclared. Or maybe I'll just have to look back into that... didn't I do that already? :) > There's also > div:nth-child(matcher) and div:nth-of-type(matcher), which selects among > siblings with patterns like "2" (second sibling), "3n" (every third > element), "odd" (odd elements) and some other selections. I kind of see > how to deal with this using position(), but I'm not sure how to do > either nth-of-type or nth-child (and the ones I do understand I am also > vague about). If I understand this correctly, this would be nth-of-type: //*/NAME[position() = x] nth-child: //*/*[position() = x] To deal with things like "2n", try this: //*/NAME[(position() mod 2) = 0] > I've committed the incomplete code in lxml.html.css I skipped through it a bit and found it really cool. I'm not completely satisfied with the naming, but I now see that the context of the css module makes it clearer what the semantics are. Still, I prefer "css_to_xpath()", and providing a top-level class XPath() makes me think it should return an etree.XPath object, i.e. a compiled path. One more note: def run_xpath(doc, xpath): return [el for el in doc.xpath(xpath) if isinstance(el, etree.ElementBase)] Do you mean "etree.iselement(el)" here or are you intentionally restricting this to real-element subclasses of _Element? (i.e. no plain lxml.etree elements, no PIs, no comments) I actually think this module merits its own top-level placing, not necessarily only as part of lxml.html. It could just as well become "lxml.css", and should thus not rely too much on a specific API from lxml.html. Stefan From stefan_ml at behnel.de Mon Jul 2 18:35:59 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 02 Jul 2007 18:35:59 +0200 Subject: [lxml-dev] lxml 1.3.1 on cheeseshop Message-ID: <4689296F.3030004@behnel.de> Hi all, I just released lxml 1.3.1. This is a bugfix release for the stable 1.3 series. Changelog follows. Have fun, Stefan 1.3.1 (2007-07-02) ================== Features added -------------- * objectify.DataElement now supports setting values from existing data elements (not just plain Python types) and reuses defined namespaces etc. * E-factory support for lxml.objectify (``objectify.E``) Bugs fixed ---------- * Better way to prevent crashes in Element proxy cleanup code * objectify.DataElement didn't set up None value correctly * objectify.DataElement didn't check the value against the provided type hints * Reference-counting bug in ``Element.attrib.pop()`` From ianb at colorstudy.com Mon Jul 2 19:21:54 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 02 Jul 2007 12:21:54 -0500 Subject: [lxml-dev] Some XPath questions... In-Reply-To: <4688B824.1020707@behnel.de> References: <468538ED.9060004@colorstudy.com> <4688B824.1020707@behnel.de> Message-ID: <46893432.10404@colorstudy.com> Stefan Behnel wrote: > Hi Ian, > > just to comment on your actual first post in this thread, which I kinda > oversaw because of the later discussion. > > I think this is pretty cool stuff and I love to have this in lxml. The html > module really seems to be getting somewhere. I think we shouldn't even wait > too long with a release so that we get some more feedback on the new APIs. > Maybe I should fix lxml's versioning so that we can put out a 2.0alpha1 (and > not only alpha, beta, final). Yeah, I was thinking about writing up a summary of things that need to be done in the html package; there's still some outstanding stuff, but not too much. The clean module needs to be cleaned up (I'm thinking of moving from a function to a class). I'd like to make the usedoctest hack a little more general, as elsewhere I'm now using a similar hack to enable ELLIPSIS, and I'd like them not to conflict. And then some docs, but I guess that's it. > Ian Bicking wrote: >> div:contains('celia') -- means a div where the textual content has the >> word 'celia' in it, case insensitive. At least, I think it's case >> insensitive -- the CSS spec is annoyingly vague, but implementations >> seem to work like this. I translate this to: >> >> descendant-or-self::div[contains(css:lower-case(string(.)), 'celia'] >> >> I added the lower-case function like: >> >> def _make_lower_case(context, s): >> return s.lower() >> etree.FunctionNamespace("css")['lower-case'] = _make_lower_case > > "css" is not the namespace, it's the prefix. You can do this: > > ns = etree.FunctionNamespace("http://my/css/namespace") > ns.prefix = "css" > ns['lower-case'] = _make_lower_case OK, I've switched to this. > or this: > > ns = etree.FunctionNamespace("http://my/css/namespace") > ns['lower-case'] = _make_lower_case > > def css_to_xpath(css): > xpath = build_xpath(css) > return etree.XPath(xpath, {'css' : "http://my/css/namespace"}) Is there any advantage to this, over a more global prefix? I suppose there's a possible collision of css:, but I doubt that will be a problem. > You should consider providing a default namespace map here, and maybe even > return compiled XPath objects, i.e. callables. Note that these provide a > "path" attribute that returns the original path, so if you have to extend an > expression later on, you can still do so by creating a new XPath object. That's handy. I was thinking of creating a CSSXPath subclass or something, that would keep the original CSS selector around, in addition the translated XPath. > Note that this would also allow you to wrap the function with an additional > call to set(), so that or-ed results really become the union and not the sum > of all parts. If you use | in the XPath expression it seems to work out that there won't be any duplicates. >> But XPath gives so few errors that it's hard to tell if it's really >> working. > > Sadly, there doesn't seem to be a simple way to find out that a function was > undeclared. Or maybe I'll just have to look back into that... didn't I do that > already? :) We talked about it previously when I was trying to use match(), and instead of errors got bizarre results. But I don't think it resulted in any improvements on error messages. >> There's also >> div:nth-child(matcher) and div:nth-of-type(matcher), which selects among >> siblings with patterns like "2" (second sibling), "3n" (every third >> element), "odd" (odd elements) and some other selections. I kind of see >> how to deal with this using position(), but I'm not sure how to do >> either nth-of-type or nth-child (and the ones I do understand I am also >> vague about). > > If I understand this correctly, this would be > > nth-of-type: //*/NAME[position() = x] > nth-child: //*/*[position() = x] > > To deal with things like "2n", try this: > > //*/NAME[(position() mod 2) = 0] I think I already have all this working now... though I wish there was a test case I could use, as I'm not 100% sure that my tests are testing for the correct results. >> I've committed the incomplete code in lxml.html.css > > I skipped through it a bit and found it really cool. I'm not completely > satisfied with the naming, but I now see that the context of the css module > makes it clearer what the semantics are. Still, I prefer "css_to_xpath()", and > providing a top-level class XPath() makes me think it should return an > etree.XPath object, i.e. a compiled path. I was thinking about changing around all the public naming. I'd like for it to be a method on elements, though I'm not sure what to call the method. .css(expr) is a bit funny, as it's not "css", it's just a css selector. .select(expr) doesn't say what kind of selector you are using. Another public function would be like XPath, something that compiles the entire CSS expression. Especially since the CSS parsing is non-trivial (just like the XPath parsing is non-trivial), precompiling will be beneficial. I'm thinking of also adding a fast path for a couple common kinds of selectors, that translate them more quickly into XPath. E.g., search for r'^\.(\w+)' for class name matches, or '^#(\w+)' for id matches, etc. And there's the question about whether simple CSS selectors should be translated to XPath at all (especially when they aren't precompiled). For people that are familiar with CSS selectors, it seems entirely possible that they will use it for very simple queries, like el.css('div'). If I detect that case and turn it into el.findall('div') then it would be completely reasonable; but if it gets tokenized, parsed, translated to XPath, compiled, then run, then that's going to be pretty inefficient. Anyway, back to naming -- if there's a method and a function/object to compile expressions, that's all the public interface I think it needs. I don't think translating css to xpath without compiling is particularly important. > One more note: > > def run_xpath(doc, xpath): > return [el for el in doc.xpath(xpath) > if isinstance(el, etree.ElementBase)] > > Do you mean "etree.iselement(el)" here or are you intentionally restricting > this to real-element subclasses of _Element? (i.e. no plain lxml.etree > elements, no PIs, no comments) I wasn't aware of iselement(). I'm not actually sure this is even necessary; I'm not sure if I can ever match non-elements with the expressions at all. I think I put it in there at some point when I wasn't sure. Instead it should probably be an assertion in the tests. > I actually think this module merits its own top-level placing, not necessarily > only as part of lxml.html. It could just as well become "lxml.css", and should > thus not rely too much on a specific API from lxml.html. Yes, you can do selections on anything. CSS it seems uses | for namespaces, like "atom|title", and it doesn't know anything special about HTML (except for special handling of the class attribute). Right now I'm assuming the XPath picks up the prefixes from elsewhere in the document. CSS uses "@namespace prefix URI", but that's part of a CSS document, and we're only handling selectors. So I just translate "atom|title" to "//atom:title", and assume it'll work. The CSS syntax does seem handier for a lot of kinds of selections, and after translating them I find the equivalent XPath rather complex in some cases (e.g., li:first-child). So there's some benefit there. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From ianb at colorstudy.com Tue Jul 3 01:26:06 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 02 Jul 2007 18:26:06 -0500 Subject: [lxml-dev] Some XPath questions... In-Reply-To: <4686A230.70105@behnel.de> References: <468538ED.9060004@colorstudy.com> <18053.22206.159599.207098@bhuda.mired.org> <468579E3.7010802@colorstudy.com> <18053.36353.691485.8754@bhuda.mired.org> <46859771.8@colorstudy.com> <4686A230.70105@behnel.de> Message-ID: <4689898E.9080509@colorstudy.com> Stefan Behnel wrote: >> So when I use // it works. Huh. I prefer descendant-or-self, because I >> find it peculiar to do a search from the root when you've called the >> method on some particular element (that may not be at the root). > > There's also ".//*". That seems to be equivalent to //*, i.e., // goes directly to the root regardless of context. >>>>>> div:empty (no children, including text, maybe not including whitespace). >>>>> Ouch. let me think about that one. >>>> Yeah, I couldn't figure that one out. I thought this might work: >>>> >>> xpath('E:empty') >>>> e[count(./children::*) = 0 and string(.) = ''] >>>> But maybe I don't understand how count() works; this isn't a valid XPath >>>> expression. >>> You want "child" not "children". Using normalize-space(.) instead of >>> string(.) will exclude whitespace. This does assume you are ignoring >>> comments and PIs; I believe that's the behavior you want. >> Cool, that seems to work right. > > What about "e[not(*) and not(normalize-space())]" ? Yes, that works too. >> One query I'm realizing might be really hard (maybe too hard in XPath) >> is *:first-of-type, *:last-of-type, and *:only-of-type, since they match >> in a funny sort of way. You can't really do: >> >> *[count(../*[name() = name()) = 1] > > You need two expressions here, one to find the node and one to compare it to > others (note that name() can also take an argument) - but those are really > trick, you're right. They may already touch the borders of what XPath can express. I could probably do it by adding a new function, I suppose; css:last-of-type() for instance. It's not that hard to do in Python, after all. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From ianb at colorstudy.com Tue Jul 3 01:45:37 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 02 Jul 2007 18:45:37 -0500 Subject: [lxml-dev] Some XPath questions... In-Reply-To: <18057.35970.547924.621080@bhuda.mired.org> References: <468538ED.9060004@colorstudy.com> <18053.22206.159599.207098@bhuda.mired.org> <468579E3.7010802@colorstudy.com> <18053.36353.691485.8754@bhuda.mired.org> <46859771.8@colorstudy.com> <4686A230.70105@behnel.de> <4689898E.9080509@colorstudy.com> <18057.35970.547924.621080@bhuda.mired.org> Message-ID: <46898E21.4030508@colorstudy.com> Mike Meyer wrote: > In <4689898E.9080509 at colorstudy.com>, Ian Bicking typed: >> Stefan Behnel wrote: >>>> So when I use // it works. Huh. I prefer descendant-or-self, because I >>>> find it peculiar to do a search from the root when you've called the >>>> method on some particular element (that may not be at the root). >>> There's also ".//*". >> That seems to be equivalent to //*, i.e., // goes directly to the root >> regardless of context. > > Not quite. '//*' always goes to the root. './/*' starts at the current > node and matches from there down. If you always test at the root of > the document, they'll look the same. It seems to be changing the results when I replace 'descendant-or-self::' with './/'. I want to include the current node if it matches; at least to me, that seems most logical. Also necessary when I was doing microformat parsing, as a single element can have multiple roles. It seems like .// excludes the current node, only looking at descendants. >>>>>>>> div:empty (no children, including text, maybe not including whitespace). >>>>>>> Ouch. let me think about that one. >>>>>> Yeah, I couldn't figure that one out. I thought this might work: >>>>>> >>> xpath('E:empty') >>>>>> e[count(./children::*) = 0 and string(.) = ''] >>>>>> But maybe I don't understand how count() works; this isn't a valid XPath >>>>>> expression. >>>>> You want "child" not "children". Using normalize-space(.) instead of >>>>> string(.) will exclude whitespace. This does assume you are ignoring >>>>> comments and PIs; I believe that's the behavior you want. >>>> Cool, that seems to work right. >>> What about "e[not(*) and not(normalize-space())]" ? >> Yes, that works too. > > That's the 'implicit conversion' I was talking about. You're relying > on 0 and the empty string being false. It's a standard idiom, and > pythonic, but I'm not sure you want to use it in automatically > generated code, since it means you can't generalize the code from "has > 0 children" to "has n children". In this case it's a fixed expression used for e:empty, and nothing else, so it seems fine. And possibly makes the resulting expression a bit easier to recognize from its CSS roots. -- Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers From stefan_ml at behnel.de Tue Jul 3 08:54:03 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 03 Jul 2007 08:54:03 +0200 Subject: [lxml-dev] Some XPath questions... In-Reply-To: <46898E21.4030508@colorstudy.com> References: <468538ED.9060004@colorstudy.com> <18053.22206.159599.207098@bhuda.mired.org> <468579E3.7010802@colorstudy.com> <18053.36353.691485.8754@bhuda.mired.org> <46859771.8@colorstudy.com> <4686A230.70105@behnel.de> <4689898E.9080509@colorstudy.com> <18057.35970.547924.621080@bhuda.mired.org> <46898E21.4030508@colorstudy.com> Message-ID: <4689F28B.6080808@behnel.de> Ian Bicking wrote: >>>>>>> >>> xpath('E:empty') >>>>>>> e[count(./children::*) = 0 and string(.) = ''] >>>>>>> But maybe I don't understand how count() works; this isn't a >>>>>>> valid XPath expression. >>>>>> You want "child" not "children". Using normalize-space(.) instead of >>>>>> string(.) will exclude whitespace. This does assume you are ignoring >>>>>> comments and PIs; I believe that's the behavior you want. >>>>> Cool, that seems to work right. >>>> What about "e[not(*) and not(normalize-space())]" ? >>> Yes, that works too. >> >> That's the 'implicit conversion' I was talking about. You're relying >> on 0 and the empty string being false. It's a standard idiom, and >> pythonic, but I'm not sure you want to use it in automatically >> generated code, since it means you can't generalize the code from "has >> 0 children" to "has n children". > > In this case it's a fixed expression used for e:empty, and nothing else, > so it seems fine. And possibly makes the resulting expression a bit > easier to recognize from its CSS roots. It's also likely faster. I don't think libxml2 optimises the comparisons, so looking for "not(*)" can stop false after the first node, while "count(./child::*) = 0" needs to count all children and then sees that, oh, the number is bigger than 0. Stefan From jholg at gmx.de Tue Jul 3 09:43:03 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 03 Jul 2007 09:43:03 +0200 Subject: [lxml-dev] lxml 1.3.1 setup.py bug Message-ID: <20070703074303.12840@gmx.net> Hi, the setup.py script in 1.3.1 seems to try to remove the dependency on setuptools (which is a very good thing imho!) but fails: Traceback (most recent call last): File "setup.py", line 7, in ? except pkg_resources.VersionConflict, e: NameError: name 'pkg_resources' is not defined 1 lb54320 at adevp02 .../lxml-1.3 $ I must admit I don't fully undestand the intention of the relevant code portion, as it raises ImportError even if pkg_resources import and version check runs smoothly; maybe this is the intended behaviour? try: import pkg_resources try: pkg_resources.require("setuptools>=0.6c5") except pkg_resources.VersionConflict, e: from ez_setup import use_setuptools use_setuptools(version="0.6c5") from setuptools import setup except ImportError: # not setuptools installed from distutils.core import setup (Note: This is untested code, I have not tested with setuptools installed) Oh, btw I couldn't find a 1.3.1 tag in the repository when trying to check out 1.3.1. Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kanns mit allen: http://www.gmx.net/de/go/multimessenger From stefan_ml at behnel.de Tue Jul 3 15:16:24 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 03 Jul 2007 15:16:24 +0200 Subject: [lxml-dev] lxml 1.3.1 setup.py bug In-Reply-To: <20070703074303.12840@gmx.net> References: <20070703074303.12840@gmx.net> Message-ID: <468A4C28.8040701@behnel.de> Hi Holger, jholg at gmx.de wrote: > the setup.py script in 1.3.1 seems to try to remove the dependency on > setuptools (which is a very good thing imho!) but fails: > > Traceback (most recent call last): File "setup.py", line 7, in ? except > pkg_resources.VersionConflict, e: NameError: name 'pkg_resources' is not > defined 1 lb54320 at adevp02 .../lxml-1.3 $ > > I must admit I don't fully undestand the intention of the relevant code > portion, as it raises ImportError even if pkg_resources import and version > check runs smoothly; maybe this is the intended behaviour? Ah, great. That was plain debug code. :) Thanks, I just re-released the sources. Could you check if it works now? > Oh, btw I couldn't find a 1.3.1 tag in the repository when trying to check > out 1.3.1. Luckily, yes. I'll tag it with the fix applied. :) Stefan From jholg at gmx.de Tue Jul 3 15:40:18 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 03 Jul 2007 15:40:18 +0200 Subject: [lxml-dev] lxml 1.3.1 setup.py bug In-Reply-To: <468A4C28.8040701@behnel.de> References: <20070703074303.12840@gmx.net> <468A4C28.8040701@behnel.de> Message-ID: <20070703134018.327480@gmx.net> Hi Stefan, > Thanks, I just re-released the sources. Could you check if it works now? Works for me now. Note: I get Building lxml version 1.3.1-44702 /apps/prod/lib/python2.4/distutils/dist.py:236: UserWarning: Unknown distribution option: 'zip_safe' warnings.warn(msg) but simply ignore it because I bet this is just some setuptools-related stuff and can be safely ignored by plain-old-distutillers like me. Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kanns mit allen: http://www.gmx.net/de/go/multimessenger From dgrimes at navisite.com Tue Jul 3 21:15:57 2007 From: dgrimes at navisite.com (David M. Grimes) Date: Tue, 03 Jul 2007 15:15:57 -0400 Subject: [lxml-dev] Ref-counting bug returns in 1.3.1 Message-ID: <468AA06D.3030502@navisite.com> I posted a patch for what I believed to be a reference-counting bug in Attrib.pop() based on the 1.3 release. The patch was accepted, and is present in 1.3.1. The patch is included at the end of this message. Looking through the generated C code, I'm no longer sure my patch was correct - perhaps just masking the underlying problem in 1.3. I'm not fluent in Pyrex, so not sure if the python.Py_INCREF is really necessary for something which would be a "borrwed reference" in the C-API (PyTuple_GET_ITEM result). It looks like the Pyrex "return" is generating it's own INCREF ... Now, what is intriguing is that the 1.3.1 stock build is crashing again with the same symptom, and is easily reproducable with the following test program (this crashed after iteration 956 in i686 with python 2.4.4: import lxml.etree as etree xml = '''\ ''' for i in range(10000): print i et = etree.fromstring(xml) et.attrib.pop('x', None) This dies at this point: ... ... 951 952 953 954 955 956 Fatal Python error: deallocating None Aborted Original 1.3 patch: diff -urN lxml-1.3~/src/lxml/etree.pyx lxml-1.3/src/lxml/etree.pyx --- lxml-1.3~/src/lxml/etree.pyx 2007-06-25 02:25:37.000000000 -0400 +++ lxml-1.3/src/lxml/etree.pyx 2007-06-27 15:36:15.000000000 -0400 @@ -1480,10 +1480,12 @@ if python.PyTuple_GET_SIZE(default) == 0: raise KeyError, key else: - return python.PyTuple_GET_ITEM(default, 0) + result = python.PyTuple_GET_ITEM(default, 0) + python.Py_INCREF(result) else: _delAttribute(self._element, key) - return result + + return result def clear(self): cdef xmlNode* c_node -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070703/c385c04e/attachment.htm From stefan_ml at behnel.de Tue Jul 3 22:41:17 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 03 Jul 2007 22:41:17 +0200 Subject: [lxml-dev] Ref-counting bug returns in 1.3.1 In-Reply-To: <468AA06D.3030502@navisite.com> References: <468AA06D.3030502@navisite.com> Message-ID: <468AB46D.3020804@behnel.de> David M. Grimes wrote: > I posted a patch for what I believed to be a reference-counting bug in > Attrib.pop() based on the 1.3 release. The patch was accepted, and is > present in 1.3.1. The patch is included at the end of this message. > Looking through the generated C code, I'm no longer sure my patch was > correct - perhaps just masking the underlying problem in 1.3. I'm not > fluent in Pyrex, so not sure if the python.Py_INCREF is really necessary > for something which would be a "borrwed reference" in the C-API > (PyTuple_GET_ITEM result). It looks like the Pyrex "return" is > generating it's own INCREF ... You can debug this kind of problem with print sys.getrefcount(None) When I run the following on 1.3.1: et = etree.fromstring(xml) for i in range(10000): print i print sys.getrefcount(None) et.attrib.pop('x', None) print sys.getrefcount(None) instead of your test, it shows me that the problem is not "pop()", as the ref-count is constant at each iteration. Trying to remove the Py_INCREF() from your patch makes it crash with a continuously decreasing ref-count. However, when I run your test: for i in range(10000): print i et = etree.fromstring(xml) print sys.getrefcount(None) et.attrib.pop('x', None) print sys.getrefcount(None) the ref-count keeps increasing until the garbage collector hits and then drops below the start value and finally crashes on the second GC run. So the problem is somewhere else. I'll investigate. Thanks for the report, Stefan From stefan_ml at behnel.de Tue Jul 3 23:24:03 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 03 Jul 2007 23:24:03 +0200 Subject: [lxml-dev] Ref-counting bug returns in 1.3.1 In-Reply-To: <468AB46D.3020804@behnel.de> References: <468AA06D.3030502@navisite.com> <468AB46D.3020804@behnel.de> Message-ID: <468ABE73.1080408@behnel.de> Stefan Behnel wrote: > David M. Grimes wrote: >> I posted a patch for what I believed to be a reference-counting bug in >> Attrib.pop() based on the 1.3 release. The patch was accepted, and is >> present in 1.3.1. > > the problem is somewhere else. I'll investigate. Ok, the problem was actually in the new deallocation code. For those who want to know: The GC calls tp_clear() before tp_dealloc(), so that Pyrex has already set the _Document reference of the _Element to None when __dealloc__ is called on the _Element and tries to Py_DECREF the doc reference => deallocating None. I worked around this by adding a redundant PyObject* to _Element that references the document. Pyrex does not set it to None so that we can keep a pointer in there when the Python reference is already None-ed and DECREF it ourselves. Obviously, that's a hack, but it works, so I'll leave it in and release a 1.3.2 with it... Stefan From dgrimes at navisite.com Tue Jul 3 23:27:46 2007 From: dgrimes at navisite.com (Grimes, David) Date: Tue, 3 Jul 2007 17:27:46 -0400 Subject: [lxml-dev] Ref-counting bug returns in 1.3.1 References: <468AA06D.3030502@navisite.com> <468AB46D.3020804@behnel.de> Message-ID: If there's anything else I can do to help test/diagnose, let me know ... --Dave ________________________________ From: Stefan Behnel [mailto:stefan_ml at behnel.de] Sent: Tue 7/3/2007 4:41 PM To: Grimes, David Cc: lxml-dev at codespeak.net Subject: Re: [lxml-dev] Ref-counting bug returns in 1.3.1 David M. Grimes wrote: > I posted a patch for what I believed to be a reference-counting bug in > Attrib.pop() based on the 1.3 release. The patch was accepted, and is > present in 1.3.1. The patch is included at the end of this message. > Looking through the generated C code, I'm no longer sure my patch was > correct - perhaps just masking the underlying problem in 1.3. I'm not > fluent in Pyrex, so not sure if the python.Py_INCREF is really necessary > for something which would be a "borrwed reference" in the C-API > (PyTuple_GET_ITEM result). It looks like the Pyrex "return" is > generating it's own INCREF ... You can debug this kind of problem with print sys.getrefcount(None) When I run the following on 1.3.1: et = etree.fromstring(xml) for i in range(10000): print i print sys.getrefcount(None) et.attrib.pop('x', None) print sys.getrefcount(None) instead of your test, it shows me that the problem is not "pop()", as the ref-count is constant at each iteration. Trying to remove the Py_INCREF() from your patch makes it crash with a continuously decreasing ref-count. However, when I run your test: for i in range(10000): print i et = etree.fromstring(xml) print sys.getrefcount(None) et.attrib.pop('x', None) print sys.getrefcount(None) the ref-count keeps increasing until the garbage collector hits and then drops below the start value and finally crashes on the second GC run. So the problem is somewhere else. I'll investigate. Thanks for the report, Stefan This e-mail is the property of NaviSite, Inc. It is intended only for the person or entity to which it is addressed and may contain information that is privileged, confidential, or otherwise protected from disclosure. Distribution or copying of this e-mail, or the information contained herein, to anyone other than the intended recipient is prohibited. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070703/e5e8ccbc/attachment.htm From stefan_ml at behnel.de Tue Jul 3 23:43:22 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 03 Jul 2007 23:43:22 +0200 Subject: [lxml-dev] lxml 1.3.2 released to cheeseshop Message-ID: <468AC2FA.9020403@behnel.de> Hi all, due to a severe crash bug in 1.3.1, I released 1.3.2 today. The only change is the bug fix. So please don't use or package 1.3.1, use 1.3.2 instead. Have fun, Stefan ChangeLog: 1.3.2 (2007-07-03) ================== Bugs fixed ---------- * "deallocating None" crash bug 1.3.1 (2007-07-02) ================== Features added -------------- * objectify.DataElement now supports setting values from existing data elements (not just plain Python types) and reuses defined namespaces etc. * E-factory support for lxml.objectify (``objectify.E``) Bugs fixed ---------- * Better way to prevent crashes in Element proxy cleanup code * objectify.DataElement didn't set up None value correctly * objectify.DataElement didn't check the value against the provided type hints * Reference-counting bug in ``Element.attrib.pop()`` From jholg at gmx.de Wed Jul 4 11:28:48 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 04 Jul 2007 11:28:48 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text Message-ID: <20070704092848.254480@gmx.net> Hi, playing around with the new E-factory I found that it does not handle unicode the way the rest of the API does: >>> STR = objectify.E.str >>> STR(unicode("???", 'latin-1')) Traceback (most recent call last): File "", line 1, in ? File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 43, in return lambda *args, **kwargs: func(tag, *args, **kwargs) File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 177, in __call__ v = t(elem, item) File "objectify.pyx", line 1661, in objectify.__add_text UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) >>> This is easily fixed by changing __add_text to def __add_text(_Element elem not None, text): cdef tree.xmlNode* c_child if not python._isString(text): if isinstance(text, bool): text = str(text).lower() else: text = str(text) c_child = cetree.findChildBackwards(elem._c_node, 0) [...] >>> STR = objectify.E.str >>> STR(unicode("???", 'latin-1')) Patches for trunk / 1.3 branch appended. Another issue with E-factory is that it currently does not have support for the custom objectify classes that you can add with the PyType mechanisms: E.g. I'm using datetime and decimal additions, which leads to >>> import decimal >>> DEC = objectify.E.decimal >>> DEC(decimal.Decimal(0)) Traceback (most recent call last): File "", line 1, in ? File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 43, in return lambda *args, **kwargs: func(tag, *args, **kwargs) File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 175, in __call__ raise TypeError("bad argument type: %r" % item) TypeError: bad argument type: Decimal("0") >>> So I'd have to add decimal.decimal into objectify.E._typemap. The nicest way to handle this would be PyType.register() doing it for me, but PyType uses type names rather than type objects for its purposes. Maybe the easiest thing is to instrument ElementMaker with its own register/unregister() methods and well-document it? Holger -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer -------------- next part -------------- A non-text attachment was scrubbed... Name: trunk_efactory_unicode.patch Type: application/octet-stream Size: 671 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070704/65923b81/attachment.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: branch13_efactory_unicode.patch Type: application/octet-stream Size: 671 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070704/65923b81/attachment-0001.obj From stefan_ml at behnel.de Wed Jul 4 12:07:01 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 04 Jul 2007 12:07:01 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text In-Reply-To: <20070704092848.254480@gmx.net> References: <20070704092848.254480@gmx.net> Message-ID: <468B7145.4050706@behnel.de> jholg at gmx.de wrote: > playing around with the new E-factory I found that it does not handle > unicode the way the rest of the API does: > >>>> STR = objectify.E.str >>>> STR(unicode("???", 'latin-1')) > Traceback (most recent call last): > File "", line 1, in ? > File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 43, in > return lambda *args, **kwargs: func(tag, *args, **kwargs) > File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 177, in __call__ > v = t(elem, item) > File "objectify.pyx", line 1661, in objectify.__add_text > UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128) > > This is easily fixed by changing __add_text to > > def __add_text(_Element elem not None, text): > cdef tree.xmlNode* c_child > if not python._isString(text): > if isinstance(text, bool): > text = str(text).lower() > else: > text = str(text) > c_child = cetree.findChildBackwards(elem._c_node, 0) > [...] > >>>> STR = objectify.E.str >>>> STR(unicode("???", 'latin-1')) > > Patches for trunk / 1.3 branch appended. Thanks, that fixes it. Maybe we should even split __add_text up into a function for strings and a function that handles other stuff. > Another issue with E-factory is that it currently does not have support for the custom objectify classes that you can add with the PyType mechanisms: E.g. I'm using datetime and decimal additions, which leads to > >>>> import decimal >>>> DEC = objectify.E.decimal >>>> DEC(decimal.Decimal(0)) > Traceback (most recent call last): > File "", line 1, in ? > File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 43, in > return lambda *args, **kwargs: func(tag, *args, **kwargs) > File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 175, in __call__ > raise TypeError("bad argument type: %r" % item) > TypeError: bad argument type: Decimal("0") > > So I'd have to add decimal.decimal into objectify.E._typemap. The nicest way to handle this would be PyType.register() doing it for me, but > PyType uses type names rather than type objects for its purposes. Maybe the easiest thing is to instrument ElementMaker with its own register/unregister() methods and well-document it? No, one registry should be enough. Even with names, you can always check globals() in the PyType registry. Maybe we should even feed the typemap in ElementMaker.__init__ from the PyType registry (and just update objectify.E when the registry is changed). Could you look into that? Stefan From stefan_ml at behnel.de Wed Jul 4 13:50:49 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 04 Jul 2007 13:50:49 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text In-Reply-To: <468B7145.4050706@behnel.de> References: <20070704092848.254480@gmx.net> <468B7145.4050706@behnel.de> Message-ID: <468B8999.6080701@behnel.de> Stefan Behnel wrote: >> Another issue with E-factory is that it currently does not have support for >> the custom objectify classes that you can add with the PyType mechanisms: E.g. >> I'm using datetime and decimal additions, which leads to >> >>>>> import decimal >>>>> DEC = objectify.E.decimal >>>>> DEC(decimal.Decimal(0)) >> Traceback (most recent call last): >> File "", line 1, in ? >> File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 43, in >> return lambda *args, **kwargs: func(tag, *args, **kwargs) >> File "/data/pydev/hjoukl/LXML/lxml-1.3/build/lib.solaris-2.8-sun4u-2.4/lxml/builder.py", line 175, in __call__ >> raise TypeError("bad argument type: %r" % item) >> TypeError: bad argument type: Decimal("0") >> >> So I'd have to add decimal.decimal into objectify.E._typemap. The nicest way to handle this would be PyType.register() doing it for me, but >> PyType uses type names rather than type objects for its purposes. Maybe the easiest thing is to instrument ElementMaker with its own register/unregister() methods and well-document it? > > No, one registry should be enough. Even with names, you can always check > globals() in the PyType registry. Maybe we should even feed the typemap in > ElementMaker.__init__ from the PyType registry (and just update objectify.E > when the registry is changed). Ah, I guess the problem here is that your external types are not in the module's globals(). Maybe we could extend the data element classes with a non-public function that converts a value to a string. Would that fit here? Note that the Element proxy is already created when the text value is updated, that's like your _setText() case. Stefan From jholg at gmx.de Wed Jul 4 15:40:10 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 04 Jul 2007 15:40:10 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text In-Reply-To: <468B8999.6080701@behnel.de> References: <20070704092848.254480@gmx.net> <468B7145.4050706@behnel.de> <468B8999.6080701@behnel.de> Message-ID: <20070704134010.72530@gmx.net> Hi Stefan, > Stefan Behnel wrote: > >> So I'd have to add decimal.decimal into objectify.E._typemap. The > nicest way to handle this would be PyType.register() doing it for me, but > >> PyType uses type names rather than type objects for its purposes. Maybe > the easiest thing is to instrument ElementMaker with its own > register/unregister() methods and well-document it? > > > > No, one registry should be enough. Even with names, you can always check > > globals() in the PyType registry. Maybe we should even feed the typemap > in > > ElementMaker.__init__ from the PyType registry (and just update > objectify.E > > when the registry is changed). > > Ah, I guess the problem here is that your external types are not in the > module's globals(). Maybe we could extend the data element classes with a > non-public function that converts a value to a string. Would that fit > here? > Note that the Element proxy is already created when the text value is > updated, > that's like your _setText() case. What one actually does for registration is datetimeType = PyType("datetime", parseDatetime, DatetimeElement) datetimeType.register() just like objectify does for the standard builtin types. I think that PyType.register()/unregister() should update E._typemap; the problem here is that register() does not really know about the Python type, just a name, a check function and the ObjectifiedDataElement class; this is also nice because it is so versatile. What about simply adding an optional argument python_type where one can supply the actual python type/class the custom element class does mimic? Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kanns mit allen: http://www.gmx.net/de/go/multimessenger From stefan_ml at behnel.de Wed Jul 4 15:59:17 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 04 Jul 2007 15:59:17 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text In-Reply-To: <20070704134010.72530@gmx.net> References: <20070704092848.254480@gmx.net> <468B7145.4050706@behnel.de> <468B8999.6080701@behnel.de> <20070704134010.72530@gmx.net> Message-ID: <468BA7B5.7070103@behnel.de> jholg at gmx.de wrote: >> Stefan Behnel wrote: >>>> So I'd have to add decimal.decimal into objectify.E._typemap. The >> nicest way to handle this would be PyType.register() doing it for me, but >>>> PyType uses type names rather than type objects for its purposes. Maybe >> the easiest thing is to instrument ElementMaker with its own >> register/unregister() methods and well-document it? >>> No, one registry should be enough. Even with names, you can always check >>> globals() in the PyType registry. Maybe we should even feed the typemap >> in >>> ElementMaker.__init__ from the PyType registry (and just update >> objectify.E >>> when the registry is changed). >> Ah, I guess the problem here is that your external types are not in the >> module's globals(). Maybe we could extend the data element classes with a >> non-public function that converts a value to a string. Would that fit >> here? >> Note that the Element proxy is already created when the text value is >> updated, >> that's like your _setText() case. > > What one actually does for registration is > > datetimeType = PyType("datetime", parseDatetime, DatetimeElement) > datetimeType.register() > > just like objectify does for the standard builtin types. I think that > PyType.register()/unregister() should update E._typemap; the problem > here is that register() does not really know about the Python type, just > a name, a check function and the ObjectifiedDataElement class; this is also nice because it is so versatile. > What about simply adding an optional argument python_type where one can supply the actual python type/class the custom element class does mimic? Well, all you'd really need is a conversion to a string, so, given such a type would Do The Right Thing for __str__, that would work. But then, if str() did the right thing, we could just as well use the existing behaviour of the E factory and just extend typemap to also check for the type /name/, not only the type itself. Maybe that's the way to go? Stefan From rogerpatterson at gmail.com Wed Jul 4 22:22:45 2007 From: rogerpatterson at gmail.com (Roger Patterson) Date: Wed, 04 Jul 2007 13:22:45 -0700 Subject: [lxml-dev] xslt exceptions Message-ID: <468C0195.2010605@gmail.com> Hello, I am getting errors from within an XPath call within a custom extension being called while doing an XSLT transform. I am able to access the global error log as well as the error_log on the transform object, but the error information is sketchy at best. Unfortunately no exception is being thrown. Is this normal? Is there a way of turning on exception throwing? -Roger From stefan_ml at behnel.de Wed Jul 4 23:14:38 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 04 Jul 2007 23:14:38 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text In-Reply-To: <20070704142538.8170@gmx.net> References: <20070704092848.254480@gmx.net> <468B7145.4050706@behnel.de> <468B8999.6080701@behnel.de> <20070704134010.72530@gmx.net> <468BA7B5.7070103@behnel.de> <20070704142538.8170@gmx.net> Message-ID: <468C0DBE.4020101@behnel.de> Hi Holger, jholg at gmx.de wrote: >>> What about simply adding an optional argument python_type where one can >> supply the actual python type/class the custom element class does mimic? >> >> Well, all you'd really need is a conversion to a string, so, given such a >> type >> would Do The Right Thing for __str__, that would work. But then, if str() >> did >> the right thing, we could just as well use the existing behaviour of the E >> factory and just extend typemap to also check for the type /name/, not >> only >> the type itself. Maybe that's the way to go? > > You mean by using a customized typemap that uses additional(typename, > ) entries and a get() method that also tries to lookup by > typename? The convention then being that the typenames one uses in the > PyType registry must correspond to the actual python type name he models, > if he wants to make use of objectify.E for custom DataElements. Sounds > reasonable. Here is a patch that (I think) might be a way to solve this problem. The idea is to use a custom class instead of the typemap dictionary and have it fall back to the PyType registry. If your type does not support a simple str() conversion, you can pass a conversion function as an additional argument to PyType() when registering your type. It's currently untested, so please play with it if you think it's the right approach. Stefan -------------- next part -------------- A non-text attachment was scrubbed... Name: objectify-data-type-support-in-elementmaker-class.patch Type: text/x-diff Size: 7015 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070704/60b1ba59/attachment-0001.bin From stefan_ml at behnel.de Wed Jul 4 23:42:16 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 04 Jul 2007 23:42:16 +0200 Subject: [lxml-dev] xslt exceptions In-Reply-To: <468C0195.2010605@gmail.com> References: <468C0195.2010605@gmail.com> Message-ID: <468C1438.6020804@behnel.de> Hi, Roger Patterson wrote: > I am getting errors from within an XPath call within a custom extension > being called while doing an XSLT transform. > I am able to access the global error log as well as the error_log on the > transform object, but the error information is sketchy at best. > Unfortunately no exception is being thrown. Is this normal? Is there a > way of turning on exception throwing? Hmm, if I understand that right, what you do is: from an XSLT, you call into a Python extension function and from that you call an XPath expression. This expressions fails and you want to do what? * stop XSLT execution and propagate the error? Then throw an exception yourself. * know why it failed? The XPath code has had a major refactoring on the current SVN trunk that will become lxml 2.0. The XPath class now has its own error log and may even throw a meaningful exception for you. I encourage you to check out the trunk and try it out. Any comments are appreciated. http://codespeak.net/svn/lxml/trunk/ Stefan From jholg at gmx.de Thu Jul 5 15:42:44 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Thu, 05 Jul 2007 15:42:44 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text In-Reply-To: <468C0DBE.4020101@behnel.de> References: <20070704092848.254480@gmx.net> <468B7145.4050706@behnel.de> <468B8999.6080701@behnel.de> <20070704134010.72530@gmx.net> <468BA7B5.7070103@behnel.de> <20070704142538.8170@gmx.net> <468C0DBE.4020101@behnel.de> Message-ID: <20070705134244.145570@gmx.net> Hi Stefan, I tried it out (latest trunk instead of this patch) > It's currently untested, so please play with it if you think it's the > right > approach. With this small fix 0 lb54320 at adevp02 .../lxml $ svn diff Index: src/lxml/objectify.pyx =================================================================== --- src/lxml/objectify.pyx (revision 44735) +++ src/lxml/objectify.pyx (working copy) @@ -1054,7 +1054,7 @@ result = python.PyDict_GetItem(_PYTYPE_DICT, name) if result is NULL: return None - return (result)._stringify + return (result)._stringify return result def __contains__(self, type): 0 lb54320 at adevp02 .../lxml $ this works for me. The E-factory has some strangeness to it regarding objectify: >>> msg.x = objectify.E.INT(5,3,2) >>> print objectify.dump(msg) msg = None [ObjectifiedElement] x = 532 [IntElement] >>> but I don't think it makes sense to investigate. This would now mean to rename my "decimal" type to "Decimal", so that it matches decimal.Decimal.__name__, which enables the lookup in _ObjectifyTypemap.get(). Although that might break some code here I think it might be a good thing to use the names of the actual python types (by convention). One of the biggest issues for my users here is that the way objectify works, they might sometimes assign a string "2323" into a tree, which gets then interpreted as an IntElement. Although this is insignificant for most practical issues, and they can always use DataElement() to type-fix anything anytime, using such a naming convention would enable me to make type-fixing even easier: >>> def PYTYPE(value): ... if isinstance(value, ObjectifiedElement): ... return deepcopy(value) ... else: ... return DataElement(value, type(value).__name__) ... (Note: I don't like the name of that function yet...) What about None? It is currently called "none" in the type registry, whereas type(None).__name__ == "NoneType". I'd prefer to special-case it here (and maybe in _ObjectifyTypemap), as "none" is so nice and short. Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail From stefan_ml at behnel.de Thu Jul 5 22:04:23 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 05 Jul 2007 22:04:23 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text In-Reply-To: <20070705134244.145570@gmx.net> References: <20070704092848.254480@gmx.net> <468B7145.4050706@behnel.de> <468B8999.6080701@behnel.de> <20070704134010.72530@gmx.net> <468BA7B5.7070103@behnel.de> <20070704142538.8170@gmx.net> <468C0DBE.4020101@behnel.de> <20070705134244.145570@gmx.net> Message-ID: <468D4EC7.1070705@behnel.de> Hi Holger, jholg at gmx.de wrote: > With this small fix > [...] > this works for me. Argh. :) Thanks. > The E-factory has some strangeness to it regarding objectify: >>>> msg.x = objectify.E.INT(5,3,2) >>>> print objectify.dump(msg) > msg = None [ObjectifiedElement] > x = 532 [IntElement] > but I don't think it makes sense to investigate. Hmmm, looks funny but is normal. This is how the factory works normaly: http://codespeak.net/lxml/dev/tutorial.html#the-e-factory It takes a content list as argument, so that's expected behaviour. ;) > This would now mean to rename my "decimal" type to "Decimal", so that it > matches decimal.Decimal.__name__, which enables the lookup in > _ObjectifyTypemap.get(). If that's all it takes, that's perfect. > Although that might break some code here I think it might be a good thing > to use the names of the actual python types (by convention). One of the biggest > issues for my users here is that the way objectify works, they might sometimes > assign a string "2323" into a tree, which gets then interpreted as an IntElement. > Although this is insignificant for most practical issues, and they can > always use DataElement() to type-fix anything anytime, using such a naming > convention would enable me to make type-fixing even easier: > >>>> def PYTYPE(value): > ... if isinstance(value, ObjectifiedElement): > ... return deepcopy(value) > ... else: > ... return DataElement(value, type(value).__name__) > ... > (Note: I don't like the name of that function yet...) Good idea, we should add something like that to objectify - and definitely to the E-factory in objectify. > What about None? It is currently called "none" in the type registry, > whereas type(None).__name__ == "NoneType". I'd prefer to special-case it > here (and maybe in _ObjectifyTypemap), as "none" is so nice and short. Sure, go ahead. Stefan From stefan_ml at behnel.de Thu Jul 5 22:08:40 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 05 Jul 2007 22:08:40 +0200 Subject: [lxml-dev] objectify E-factory does not handle unicode text In-Reply-To: <468D4EC7.1070705@behnel.de> References: <20070704092848.254480@gmx.net> <468B7145.4050706@behnel.de> <468B8999.6080701@behnel.de> <20070704134010.72530@gmx.net> <468BA7B5.7070103@behnel.de> <20070704142538.8170@gmx.net> <468C0DBE.4020101@behnel.de> <20070705134244.145570@gmx.net> <468D4EC7.1070705@behnel.de> Message-ID: <468D4FC8.1090609@behnel.de> Stefan Behnel wrote: > jholg at gmx.de wrote: >> The E-factory has some strangeness to it regarding objectify: >>>>> msg.x = objectify.E.INT(5,3,2) >>>>> print objectify.dump(msg) >> msg = None [ObjectifiedElement] >> x = 532 [IntElement] >> but I don't think it makes sense to investigate. > > Hmmm, looks funny but is normal. This is how the factory works normaly: > > http://codespeak.net/lxml/dev/tutorial.html#the-e-factory > > It takes a content list as argument, so that's expected behaviour. ;) >> >> >>> def PYTYPE(value): >> ... if isinstance(value, ObjectifiedElement): >> ... return deepcopy(value) >> ... else: >> ... return DataElement(value, type(value).__name__) >> ... >> (Note: I don't like the name of that function yet...) > > Good idea, we should add something like that to objectify - and definitely to > the E-factory in objectify. ... which then should also fix the above problem, BTW. Stefan From stefan_ml at behnel.de Fri Jul 6 10:10:02 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 06 Jul 2007 10:10:02 +0200 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <468D88F8.3010009@colorstudy.com> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> Message-ID: <468DF8DA.7030306@behnel.de> Ian Bicking wrote: > I'm still not sure what to call all the parsing functions for HTML. Hmm, there isn't really something comparable in lxml's API so far, so we can't just copy names here. "parse_string()" would match their intention, so that would make it "parse_string_element()" and "parse_string_elements()". Maybe that's too long for an every-day-use function, but at least the names are clear. I don't even think length matters here as parse functions may be used in every program, but likely only once or a couple of times in a few selected places, so clarity outweighs typing here IMHO. "strparse()" would be shorter but might suggest that they only parse plain strings, not unicode (although unicode parsing is somewhat 'advanced use' anyway). On the other hand, I'm wondering why they parse strings in the first place. Wouldn't parsing from a file make more sense? There's always StringIO if you need it (which is efficiently special cased in lxml). Note that libxml2 can even parse from http and ftp URLs directly, so you would even loose something (if only performance) if you required people to load a document into memory first and then pass it to the parser as a string. You'd also loose base URL information, BTW. So, my preferred solution would be to keep the names and make them functions that parse from a filename or file-like object, just like etree.parse() works. Admittedly, that's a bit tricky as you can't check what the file starts with to decide how to parse it without opening it first... > Also > I'd like some method on at least HTML elements for doing CSS selections, > but I'm not sure what to call it. Any ideas? Well, the xpath() method is named after the language, so why not just call the method "cssselect()" ? That makes it clear where the implementation comes from and matches the existing API. Stefan From ianb at colorstudy.com Fri Jul 6 19:01:31 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Fri, 06 Jul 2007 12:01:31 -0500 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <468DF8DA.7030306@behnel.de> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> Message-ID: <468E756B.3060202@colorstudy.com> Stefan Behnel wrote: > Ian Bicking wrote: >> I'm still not sure what to call all the parsing functions for HTML. > > Hmm, there isn't really something comparable in lxml's API so far, so we can't > just copy names here. > > "parse_string()" would match their intention, so that would make it > "parse_string_element()" and "parse_string_elements()". Maybe that's too long > for an every-day-use function, but at least the names are clear. I don't even > think length matters here as parse functions may be used in every program, but > likely only once or a couple of times in a few selected places, so clarity > outweighs typing here IMHO. > > "strparse()" would be shorter but might suggest that they only parse plain > strings, not unicode (although unicode parsing is somewhat 'advanced use' anyway). For the different varieties, I wonder if they should just be attributes on the parser? E.g., HTML() (full doc), HTML.element(), HTML.elements(). Similarly, parse(fn) (full doc), parse.element(fn), parse.elements(fn). Then we just have HTML and parse. One nice thing about this is that you don't have to fiddle with imports when you change your mind about what you are parsing. > On the other hand, I'm wondering why they parse strings in the first place. > Wouldn't parsing from a file make more sense? There's always StringIO if you > need it (which is efficiently special cased in lxml). Note that libxml2 can > even parse from http and ftp URLs directly, so you would even loose something > (if only performance) if you required people to load a document into memory > first and then pass it to the parser as a string. You'd also loose base URL > information, BTW. Where is base URL information kept? This should be an optional argument for all the parsing functions that don't use a URL. > So, my preferred solution would be to keep the names and make them functions > that parse from a filename or file-like object, just like etree.parse() works. > Admittedly, that's a bit tricky as you can't check what the file starts with > to decide how to parse it without opening it first... If I did that, I'd just have to write the string-based versions over and over, as that's what I use (and pretty much have to use) in all the tests. I suppose outside of tests it's not that useful, but tests are of course important. Plus lxml.XML, HTML, etc., already work on strings, so there should be equivalent parsers. >> Also >> I'd like some method on at least HTML elements for doing CSS selections, >> but I'm not sure what to call it. Any ideas? > > Well, the xpath() method is named after the language, so why not just call the > method "cssselect()" ? That makes it clear where the implementation comes from > and matches the existing API. Sure. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From doug at isotoma.com Sat Jul 7 14:45:35 2007 From: doug at isotoma.com (Doug Winter) Date: Sat, 07 Jul 2007 13:45:35 +0100 Subject: [lxml-dev] xpath on newly created elements Message-ID: <468F8AEF.8060405@isotoma.com> I can't make xpath work on elements that have been created using etree.Element when they have a namespace that doesn't use Clark notation. I have a test case: -- begins -- from lxml import etree print "lxml.etree: ", etree.LXML_VERSION print "libxml used: ", etree.LIBXML_VERSION print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION print "libxslt used: ", etree.LIBXSLT_VERSION print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION nsmap=dict(test="http://test.com") e = [] e.append(etree.fromstring('')) e.append(etree.Element("test:foo", nsmap=nsmap)) e.append(etree.Element("test:foo", {'xmlns:test': nsmap['test']})) e.append(etree.Element("{%(test)s}foo" % nsmap)) e.append(etree.Element("{%(test)s}foo" % nsmap, nsmap=nsmap)) for i, elem in enumerate(e): print i, elem.xpath("/test:foo", nsmap) -- ends -- I get this output if I run the above: lxml.etree: (1, 3, 2, 0) libxml used: (2, 6, 27) libxml compiled: (2, 6, 27) libxslt used: (1, 1, 20) libxslt compiled: (1, 1, 20) 0 [] 1 [] 2 [] 3 [] 4 [] I would expect all 5 cases to match the root element, but cases 1 and 2 do not. It appears to be only for elements created using namespace prefixes - and yet these work perfectly well in all other respects. Is this a bug, or should elements not be created this way? Cheers, Doug. -- Isotoma, Open Source Software Consulting - http://www.isotoma.com Tel: 01904 567349, Mobile: 07879 423002, Fax: 020 79006980 Postal Address: Tower House, Fishergate, York, YO10 4UA, UK Registered in England. Company No 5171172. VAT GB843570325. Registered Office: 19a Goodge Street, London, W1T 2PH From stefan_ml at behnel.de Sun Jul 8 08:52:00 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 08 Jul 2007 08:52:00 +0200 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <468E756B.3060202@colorstudy.com> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> Message-ID: <46908990.3020707@behnel.de> Hi Ian, Ian Bicking wrote: > Stefan Behnel wrote: >> Ian Bicking wrote: >>> I'm still not sure what to call all the parsing functions for HTML. >> > For the different varieties, I wonder if they should just be attributes > on the parser? E.g., HTML() (full doc), HTML.element(), > HTML.elements(). Similarly, parse(fn) (full doc), parse.element(fn), > parse.elements(fn). Then we just have HTML and parse. Funny idea. But that reminds me that HTML is a factory function, so what about calling the string parser functions "HTML()", "HTMLFragment()" and "HTMLFragments()"? The "parse()" function could then get equivalent functions "parse_fragment()" and "parse_fragments()" - although I rate them less important. If you're dealing with fragments, they'd most likely not come from the file. And if you really need to parse some fragments from a file for a rare use case, you can still read the file first and then pass it into "HTMLFragments()". > Where is base URL information kept? This should be an optional argument > for all the parsing functions that don't use a URL. libxml2 stores it in the xmlDoc, but we can overwrite it if we need to. We should make that a general option in the string parse functions of etree: etree.HTML("", base_url="http://codespeak.net/lxml") etree.XML("", base_url="http://codespeak.net/lxml") Stefan From stefan_ml at behnel.de Sun Jul 8 09:11:26 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 08 Jul 2007 09:11:26 +0200 Subject: [lxml-dev] xpath on newly created elements In-Reply-To: <468F8AEF.8060405@isotoma.com> References: <468F8AEF.8060405@isotoma.com> Message-ID: <46908E1E.8060107@behnel.de> Hi, Doug Winter wrote: > I can't make xpath work on elements that have been created using > etree.Element when they have a namespace that doesn't use Clark notation. Please distinguish between namespaces and prefixes. Prefixes are not namespaces. Their only use it to reduce the redundancy in XML documents. They have no meaning by themselves. > e.append(etree.fromstring('')) Ok. > e.append(etree.Element("test:foo", nsmap=nsmap)) > e.append(etree.Element("test:foo", {'xmlns:test': nsmap['test']})) These actually create elements with the tag name "test:foo", which is different from "foo" and also different from "{http://test.com}foo" (which is the only one that declares a namespace). Tag names with colons are not special cased. > e.append(etree.Element("{%(test)s}foo" % nsmap)) > e.append(etree.Element("{%(test)s}foo" % nsmap, nsmap=nsmap)) These are ok, too. > >>> for i, elem in enumerate(e): > ... print i, elem.xpath("/test:foo", nsmap) > 0 [] > 1 [] > 2 [] > 3 [] > 4 [] As expected, as you're looking for "{http://test.com}foo", not for "test:foo". > It appears to be only for elements created using namespace prefixes - > and yet these work perfectly well in all other respects. > > Is this a bug, or should elements not be created this way? Well, you can currently create them this way, but it doesn't give you what you want. Maybe we should catch the case where ':' is contained in a tag name and raise an exception instead (it won't give you well-formed XML anyway). That way, it would be clear that this can't work. When you create elements with namespaces, use the Clark notation. Stefan From doug at isotoma.com Sun Jul 8 10:26:34 2007 From: doug at isotoma.com (Doug Winter) Date: Sun, 08 Jul 2007 09:26:34 +0100 Subject: [lxml-dev] xpath on newly created elements In-Reply-To: <46908E1E.8060107@behnel.de> References: <468F8AEF.8060405@isotoma.com> <46908E1E.8060107@behnel.de> Message-ID: <46909FBA.1030607@isotoma.com> Stefan Behnel wrote: > Well, you can currently create them this way, but it doesn't give you what you > want. Maybe we should catch the case where ':' is contained in a tag name and > raise an exception instead (it won't give you well-formed XML anyway). That > way, it would be clear that this can't work. > > When you create elements with namespaces, use the Clark notation. Thanks for the clarification. What threw me is that serialising a tag with a colon in it works fine, and this all feels quite natural: >>> nsmap = {'test': 'http://test.com'} >>> e = etree.Element('test:foo', nsmap=nsmap) >>> e2 = etree.fromstring(etree.tostring(e)) >>> e.xpath("/test:foo", nsmap) [] >>> e2.xpath("/test:foo", nsmap) [] I'd expect the round trip to produce something identical, which I guess is where I got confused. I think it may be worth raising an exception on tags with colons, since it is a bit surprising. Cheers, Doug. -- Isotoma, Open Source Software Consulting - http://www.isotoma.com Tel: 01904 567349, Mobile: 07879 423002, Fax: 020 79006980 Postal Address: Tower House, Fishergate, York, YO10 4UA, UK Registered in England. Company No 5171172. VAT GB843570325. Registered Office: 19a Goodge Street, London, W1T 2PH From stefan_ml at behnel.de Sun Jul 8 11:15:07 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sun, 08 Jul 2007 11:15:07 +0200 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <46908990.3020707@behnel.de> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> <46908990.3020707@behnel.de> Message-ID: <4690AB1B.5090707@behnel.de> Stefan Behnel wrote: > HTML is a factory function, so what about > calling the string parser functions "HTML()", "HTMLFragment()" and > "HTMLFragments()"? That would also make the semantics pretty simple: HTML() will always return a complete HTML document, i.e. wrapped by html/body if necessary. HTMLFragment() will always return a fragment, i.e. a single element that can be pasted into a body. This means: remove html/body if they are present and add a
if there are multiple elements. Maybe check if there actually are any block tags and just wrap the fragments in a

otherwise, but that's more of an optimisation. HTMLFragments() will always return a list of fragments, i.e. text and/or elements and remove any html/body parts that come from the document or were added by the parser. Does that sound like a suitable API? Stefan From stefan_ml at behnel.de Mon Jul 9 21:44:15 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 09 Jul 2007 21:44:15 +0200 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <4690AB1B.5090707@behnel.de> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> <46908990.3020707@behnel.de> <4690AB1B.5090707@behnel.de> Message-ID: <4692900F.7060008@behnel.de> Stefan Behnel wrote: > Stefan Behnel wrote: >> HTML is a factory function, so what about >> calling the string parser functions "HTML()", "HTMLFragment()" and >> "HTMLFragments()"? > > That would also make the semantics pretty simple: > > HTML() will always return a complete HTML document, i.e. wrapped by html/body > if necessary. > > HTMLFragment() will always return a fragment, i.e. a single element that can > be pasted into a body. This means: remove html/body if they are present and > add a

if there are multiple elements. Maybe check if there actually are > any block tags and just wrap the fragments in a

otherwise, but that's more > of an optimisation. > > HTMLFragments() will always return a list of fragments, i.e. text and/or > elements and remove any html/body parts that come from the document or were > added by the parser. I changed this on the branch and also renamed the current do-what-I-mean "parse()" function to "fromstring()". This means that "HTML()" now behaves differently from "fromstring()", although "XML()" and "fromstring()" behave the same in etree. But I find that ok, since they behave as you would expect. HTML() gives you an HTML page (including html/body) and "fromstring()" more or less gives you what you passed in as a string, be it with or without . So, that makes the API complete (for now), I think. I'll double check the modules to see if everything looks nice and consistent and will then try to merge the branch back into the trunk soon to get out a "2.0alpha1". The API may still change during the alpha cycle, but this will hopefully get us some broader feedback on the new package. Stefan From ianb at colorstudy.com Mon Jul 9 21:53:11 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 09 Jul 2007 14:53:11 -0500 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <4692900F.7060008@behnel.de> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> <46908990.3020707@behnel.de> <4690AB1B.5090707@behnel.de> <4692900F.7060008@behnel.de> Message-ID: <46929227.9010604@colorstudy.com> Stefan Behnel wrote: > Stefan Behnel wrote: >> Stefan Behnel wrote: >>> HTML is a factory function, so what about >>> calling the string parser functions "HTML()", "HTMLFragment()" and >>> "HTMLFragments()"? >> That would also make the semantics pretty simple: >> >> HTML() will always return a complete HTML document, i.e. wrapped by html/body >> if necessary. >> >> HTMLFragment() will always return a fragment, i.e. a single element that can >> be pasted into a body. This means: remove html/body if they are present and >> add a

if there are multiple elements. Maybe check if there actually are >> any block tags and just wrap the fragments in a

otherwise, but that's more >> of an optimisation. I think we talked about using if there were no block tags, not

. Something about HTMLFragment(s) seems weird to me. I guess HTML() itself is weird, though it is reminiscent of XML(). Which is itself weird, since neither is a class. HTMLFragment() bothers me more because it definitely doesn't return a different type of object from HTML(), but the naming implies it does. >> HTMLFragments() will always return a list of fragments, i.e. text and/or >> elements and remove any html/body parts that come from the document or were >> added by the parser. > > I changed this on the branch and also renamed the current do-what-I-mean > "parse()" function to "fromstring()". That seems like a fine name. > This means that "HTML()" now behaves differently from "fromstring()", although > "XML()" and "fromstring()" behave the same in etree. But I find that ok, since > they behave as you would expect. HTML() gives you an HTML page (including > html/body) and "fromstring()" more or less gives you what you passed in as a > string, be it with or without . Sometimes you actually don't get a body, like if you parse HTML('') you only get a head. And sometimes you don't get a head. Maybe the parsing should normalize this too, as it's a corner case people often don't think about. For that matter, I think there should probably be a body property on the html element (or all elements?), since I find myself commonly plucking out the body element right away. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From stefan_ml at behnel.de Mon Jul 9 22:07:47 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 09 Jul 2007 22:07:47 +0200 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <46929227.9010604@colorstudy.com> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> <46908990.3020707@behnel.de> <4690AB1B.5090707@behnel.de> <4692900F.7060008@behnel.de> <46929227.9010604@colorstudy.com> Message-ID: <46929593.40508@behnel.de> Ian Bicking wrote: > Stefan Behnel wrote: >>> HTMLFragment() will always return a fragment, i.e. a single element >>> that can >>> be pasted into a body. This means: remove html/body if they are >>> present and >>> add a

if there are multiple elements. Maybe check if there >>> actually are >>> any block tags and just wrap the fragments in a

otherwise, but >>> that's more >>> of an optimisation. > > I think we talked about using if there were no block tags, not

. Ah, sure. Anyway, I didn't change your implementation, so everything works as before (except for the naming). > Something about HTMLFragment(s) seems weird to me. I guess HTML() > itself is weird, though it is reminiscent of XML(). Which is itself > weird, since neither is a class. It's a factory though, that is mainly meant for HTML 'literals'. And it gives you an HtmlElement or a list of those. Hmmm, I admit that HTMLFragments() does not really sound like returning a list... > HTMLFragment() bothers me more because > it definitely doesn't return a different type of object from HTML(), but > the naming implies it does. Hmmm, I don't really feel the same way, but maybe I'm too biased already. :) It's Python after all, so the actual type is not that relevant. >> This means that "HTML()" now behaves differently from "fromstring()", >> although >> "XML()" and "fromstring()" behave the same in etree. But I find that >> ok, since >> they behave as you would expect. HTML() gives you an HTML page (including >> html/body) and "fromstring()" more or less gives you what you passed >> in as a >> string, be it with or without . > > Sometimes you actually don't get a body, like if you parse HTML(' rel="foo">') you only get a head. And sometimes you don't get a head. > Maybe the parsing should normalize this too, as it's a corner case > people often don't think about. For that matter, I think there should > probably be a body property on the html element (or all elements?), > since I find myself commonly plucking out the body element right away. If we keep the current names, we should make sure they fit the expectations. Having HTML() always return a complete document sounds natural to me. Checking the returned tag for 'body' or 'head' is simple enough. Stefan From ianb at colorstudy.com Mon Jul 9 22:13:05 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 09 Jul 2007 15:13:05 -0500 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <46929593.40508@behnel.de> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> <46908990.3020707@behnel.de> <4690AB1B.5090707@behnel.de> <4692900F.7060008@behnel.de> <46929227.9010604@colorstudy.com> <46929593.40508@behnel.de> Message-ID: <469296D1.2080804@colorstudy.com> Stefan Behnel wrote: > > Ian Bicking wrote: >> Stefan Behnel wrote: >>>> HTMLFragment() will always return a fragment, i.e. a single element >>>> that can >>>> be pasted into a body. This means: remove html/body if they are >>>> present and >>>> add a

if there are multiple elements. Maybe check if there >>>> actually are >>>> any block tags and just wrap the fragments in a

otherwise, but >>>> that's more >>>> of an optimisation. >> I think we talked about using if there were no block tags, not

. > > Ah, sure. Anyway, I didn't change your implementation, so everything works as > before (except for the naming). > > >> Something about HTMLFragment(s) seems weird to me. I guess HTML() >> itself is weird, though it is reminiscent of XML(). Which is itself >> weird, since neither is a class. > > It's a factory though, that is mainly meant for HTML 'literals'. And it gives > you an HtmlElement or a list of those. Hmmm, I admit that HTMLFragments() does > not really sound like returning a list... Everything is potentially a factory. dict.items() is a list factory. HTML and HTMLFragment are factories for the same kind of object. >> HTMLFragment() bothers me more because >> it definitely doesn't return a different type of object from HTML(), but >> the naming implies it does. > > Hmmm, I don't really feel the same way, but maybe I'm too biased already. :) > > It's Python after all, so the actual type is not that relevant. Yes, but we're already badly abusing naming conventions. These aren't classes, but they are named like classes. This has caused confusion for me in the past. >>> This means that "HTML()" now behaves differently from "fromstring()", >>> although >>> "XML()" and "fromstring()" behave the same in etree. But I find that >>> ok, since >>> they behave as you would expect. HTML() gives you an HTML page (including >>> html/body) and "fromstring()" more or less gives you what you passed >>> in as a >>> string, be it with or without . >> Sometimes you actually don't get a body, like if you parse HTML('> rel="foo">') you only get a head. And sometimes you don't get a head. >> Maybe the parsing should normalize this too, as it's a corner case >> people often don't think about. For that matter, I think there should >> probably be a body property on the html element (or all elements?), >> since I find myself commonly plucking out the body element right away. > > If we keep the current names, we should make sure they fit the expectations. > Having HTML() always return a complete document sounds natural to me. I'd be inclined to feel the other way, that HTML() would be more like fromstring(), and return what you give it instead of interpreting everything as a document. But I'm not too concerned there. > Checking the returned tag for 'body' or 'head' is simple enough. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From stefan_ml at behnel.de Mon Jul 9 22:52:19 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 09 Jul 2007 22:52:19 +0200 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <469296D1.2080804@colorstudy.com> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> <46908990.3020707@behnel.de> <4690AB1B.5090707@behnel.de> <4692900F.7060008@behnel.de> <46929227.9010604@colorstudy.com> <46929593.40508@behnel.de> <469296D1.2080804@colorstudy.com> Message-ID: <4692A003.3050700@behnel.de> Ian Bicking wrote: > Stefan Behnel wrote: >> Ian Bicking wrote: >>> HTMLFragment() bothers me more because >>> it definitely doesn't return a different type of object from HTML(), but >>> the naming implies it does. >> >> Hmmm, I don't really feel the same way, but maybe I'm too biased >> already. :) >> >> It's Python after all, so the actual type is not that relevant. > > Yes, but we're already badly abusing naming conventions. These aren't > classes, but they are named like classes. This has caused confusion for > me in the past. Ok, I buy that. But what would be the alternative? * element_from_string(s) and elements_from_string(s) * fragment_from_string(s) and fragments_from_string(s) * parse_element_string(s) and ??? * parse_string_element(s) and parse_string_elements(s) I could maybe live with the first. Stefan From ianb at colorstudy.com Tue Jul 10 00:48:06 2007 From: ianb at colorstudy.com (Ian Bicking) Date: Mon, 09 Jul 2007 17:48:06 -0500 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <4692A003.3050700@behnel.de> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> <46908990.3020707@behnel.de> <4690AB1B.5090707@behnel.de> <4692900F.7060008@behnel.de> <46929227.9010604@colorstudy.com> <46929593.40508@behnel.de> <469296D1.2080804@colorstudy.com> <4692A003.3050700@behnel.de> Message-ID: <4692BB26.9020304@colorstudy.com> Stefan Behnel wrote: > Ian Bicking wrote: >> Stefan Behnel wrote: >>> Ian Bicking wrote: >>>> HTMLFragment() bothers me more because >>>> it definitely doesn't return a different type of object from HTML(), but >>>> the naming implies it does. >>> Hmmm, I don't really feel the same way, but maybe I'm too biased >>> already. :) >>> >>> It's Python after all, so the actual type is not that relevant. >> Yes, but we're already badly abusing naming conventions. These aren't >> classes, but they are named like classes. This has caused confusion for >> me in the past. > > Ok, I buy that. But what would be the alternative? > > * element_from_string(s) and elements_from_string(s) > * fragment_from_string(s) and fragments_from_string(s) > * parse_element_string(s) and ??? > * parse_string_element(s) and parse_string_elements(s) > > I could maybe live with the first. I'm somewhat more comfortable with fromstring() being do-what-I-mean (i.e., return a document only if a document is passed in), and document_fromstring() for what HTML() currently does (maybe with a little normalization), and fragment_fromstring() for something that *must* be a fragment (which I suppose should strip everything but body, if it is passed a full document, and I think I then even rename body to div in the current code). That is, most people are really comfortable working with HTML fragments, and this whole notion of a "valid HTML document" is less of an issue for most people. So when libxml2 turns their fragment into a valid HTML document it can be disconcerting. -- Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org : Write code, do good : http://topp.openplans.org/careers From rcdailey at gmail.com Tue Jul 10 06:34:16 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Mon, 9 Jul 2007 23:34:16 -0500 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows Message-ID: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> Hi, I'm attempting to build LXML for windows. Below are details on the linker errors I'm getting (the compile works fine). Anyone that can help would be greatly appreciated. Thank you! Here is my modified paths in the setup.py file: STATIC_INCLUDE_DIRS = [ "..\\libxml2\\include", "..\\libxslt\\include", "..\\zlib\\include", "..\\iconv\\include" ] STATIC_LIBRARY_DIRS = [ "..\\libxml2\\lib", "..\\libxslt\\lib", "..\\zlib\\lib", "..\\iconv\\lib", "C:\\Program Files\\Microsoft Visual Studio 8\\VC\\lib" ] STATIC_CFLAGS = [] I get the following output in the command line (note the first line is the line I typed in): C:\IT\SDK\lxml>python setup.py build -c mingw32 --static Building lxml version 1.3.2 C:\Python25\lib\distutils\dist.py:263: UserWarning: Unknown distribution option: 'zip_safe' warnings.warn(msg) running build running build_py running build_ext building 'lxml.etree' extension writing build\temp.win32-2.5\Release\src\lxml\etree.def C:\mingw\bin\gcc.exe -mno-cygwin -shared -s build\temp.win32- 2.5\Release\src\lxml\etree.o build\temp .win32-2.5\Release\src\lxml\etree.def -L..\libxml2\lib -L..\libxslt\lib -L..\zlib\lib -L..\iconv\lib "-LC:\Program Files\Microsoft Visual Studio 8\VC\lib" -LC:\Python25\libs -LC:\Python25\PCBuild -lli bxslt_a -llibexslt_a -llibxml2_a -liconv_a -lzlib -lWS2_32 -lpython25 -lmsvcr71 -o build\lib.win32-2 .5\lxml\etree.pyd Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"ws2_32.lib" /DEFAULTLI B:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"advapi32.lib" /DEFAULT LIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"O LDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"O LDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"O LDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"uuid.lib" /DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"O LDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized Warning: .drectve `/DEFAULTLIB:"MSVCRT" /DEFAULTLIB:"OLDNAMES" ' unrecognized ..\libxslt\lib\libxslt_a.lib(int.xslta.msvc /xsltutils.obj):..\libxslt\xsltuti:(.text[_xsltTimestamp] +0xa5): undefined reference to `_ftol2' ..\libxslt\lib\libxslt_a.lib(int.xslta.msvc /numbers.obj):..\libxslt\numbers:(.text[_xsltNumberFormat Decimal]+0x9c): undefined reference to `_ftol2' ..\libxslt\lib\libxslt_a.lib(int.xslta.msvc /numbers.obj):..\libxslt\numbers:(.text[_xsltNumberFormat Alpha]+0x4b): undefined reference to `_ftol2' ..\libxslt\lib\libxslt_a.lib(int.xslta.msvc /numbers.obj):..\libxslt\numbers:(.text[_xsltNumberFormat ]+0x6): undefined reference to `_chkstk' ..\libxslt\lib\libexslt_a.lib(int.exslta.msvc /date.obj):..\libexslt\date.c:(.text[_exsltDateParseDur ation]+0x226): undefined reference to `_ftol2' ..\libxslt\lib\libexslt_a.lib(int.exslta.msvc /date.obj):..\libexslt\date.c:(.text[_exsltDateParseDur ation]+0x230): undefined reference to `_ftol2' ..\libxslt\lib\libexslt_a.lib(int.exslta.msvc /date.obj):..\libexslt\date.c:(.text[_exsltDateFormatDu ration]+0x119): undefined reference to `_ftol2' ..\libxslt\lib\libexslt_a.lib(int.exslta.msvc /date.obj):..\libexslt\date.c:(.text[_exsltDateFormatDu ration]+0x175): undefined reference to `_ftol2' ..\libxslt\lib\libexslt_a.lib(int.exslta.msvc /date.obj):..\libexslt\date.c:(.text[_exsltDateFormatDu ration]+0x213): undefined reference to `_ftol2' ..\libxslt\lib\libexslt_a.lib(int.exslta.msvc /date.obj):..\libexslt\date.c:(.text[_exsltDateFormatDu ration]+0x28a): more undefined references to `_ftol2' follow ..\libxml2\lib\libxml2_a.lib(int.a.msvc/encoding.obj):..\encoding.c:(.text[_xmlByteConsumed]+0x6): u ndefined reference to `_chkstk' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /valid.obj):..\valid.c:(.text[_xmlValidBuildContentModel]+0x6 ): undefined reference to `_chkstk' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /valid.obj):..\valid.c:(.text[_xmlValidateElementContent]+0x6 ): undefined reference to `_chkstk' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /xpointer.obj):..\xpointer.c:(.text[_xmlXPtrStringRangeFuncti on]+0x65): undefined reference to `_ftol2' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /xpointer.obj):..\xpointer.c:(.text[_xmlXPtrStringRangeFuncti on]+0x9d): undefined reference to `_ftol2' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /debugXML.obj):..\debugXML.c:(.text[_xmlCtxtDumpElemDecl]+0x6 ): undefined reference to `_chkstk' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaVal idateDuration]+0x21c): undefined reference to `_ftol2' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaVal idateDuration]+0x226): undefined reference to `_ftol2' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaCom pareDurations]+0x2f): undefined reference to `_ftol2' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /xmlschemastypes.obj):..\xmlschemastypes:(.text[__xmlSchemaDa teAdd]+0xfe): undefined reference to `_ftol2' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /xmlschemastypes.obj):..\xmlschemastypes:(.text[__xmlSchemaDa teAdd]+0x120): undefined reference to `_ftol2' ..\libxml2\lib\libxml2_a.lib(int.a.msvc /xmlschemastypes.obj):..\xmlschemastypes:(.text[__xmlSchemaDa teAdd]+0x171): more undefined references to `_ftol2' follow ..\libxml2\lib\libxml2_a.lib(int.a.msvc /nanohttp.obj):..\nanohttp.c:(.text[_xmlNanoHTTPReadLine]+0x6 ): undefined reference to `_chkstk' ..\libxml2\lib\libxml2_a.lib(int.a.msvc/nanoftp.obj):..\nanoftp.c:(.text[_xmlNanoFTPList]+0x6): unde fined reference to `_chkstk' ..\libxml2\lib\libxml2_a.lib(int.a.msvc/nanoftp.obj):..\nanoftp.c:(.text[_xmlNanoFTPGet]+0x6): undef ined reference to `_chkstk' ..\iconv\lib\iconv_a.lib(iconv.obj):./iconv.c:(.text[_libiconvlist]+0x9): undefined reference to `_c hkstk' ..\zlib\lib\zlib.lib(gzio.obj):gzio.c:(.text[_gzprintf]+0x6): undefined reference to `_chkstk' collect2: ld returned 1 exit status error: command 'gcc' failed with exit status 1 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070709/c89306aa/attachment-0001.htm From stefan_ml at behnel.de Tue Jul 10 08:36:00 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 10 Jul 2007 08:36:00 +0200 Subject: [lxml-dev] naming the lxml.html parse functions In-Reply-To: <4692BB26.9020304@colorstudy.com> References: <468D4B9C.9050503@behnel.de> <468D88F8.3010009@colorstudy.com> <468DF8DA.7030306@behnel.de> <468E756B.3060202@colorstudy.com> <46908990.3020707@behnel.de> <4690AB1B.5090707@behnel.de> <4692900F.7060008@behnel.de> <46929227.9010604@colorstudy.com> <46929593.40508@behnel.de> <469296D1.2080804@colorstudy.com> <4692A003.3050700@behnel.de> <4692BB26.9020304@colorstudy.com> Message-ID: <469328D0.2060301@behnel.de> Ian Bicking wrote: > I'm somewhat more comfortable with fromstring() being do-what-I-mean > (i.e., return a document only if a document is passed in), and > document_fromstring() for what HTML() currently does (maybe with a > little normalization), and fragment_fromstring() for something that > *must* be a fragment (which I suppose should strip everything but body, > if it is passed a full document, and I think I then even rename body to > div in the current code). Sure, that works well, I think. What about the "fragments" function? I think "fragments_fromstring()" would fit nicely in there, and in the Python context, people would suspect it to return a list. > That is, most people are really comfortable working with HTML fragments, > and this whole notion of a "valid HTML document" is less of an issue for > most people. So when libxml2 turns their fragment into a valid HTML > document it can be disconcerting. That's why I'm not arguing your functions technically. I think they all make sense, I just want them to be less of a surprise for people who use them. Stefan From stefan_ml at behnel.de Tue Jul 10 09:10:44 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 10 Jul 2007 09:10:44 +0200 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> Message-ID: <469330F4.5060402@behnel.de> Hi, Robert Dailey wrote: > I'm attempting to build LXML for windows. Below are details on the > linker errors I'm getting (the compile works fine). Anyone that can help > would be greatly appreciated. Thank you! [...] > C:\IT\SDK\lxml>python setup.py build -c mingw32 --static > Building lxml version 1.3.2 > C:\Python25\lib\distutils\dist.py:263: UserWarning: Unknown distribution > option: 'zip_safe' > warnings.warn(msg) > running build > running build_py > running build_ext > building 'lxml.etree' extension > writing build\temp.win32-2.5\Release\src\lxml\etree.def > C:\mingw\bin\gcc.exe -mno-cygwin -shared -s > build\temp.win32-2.5\Release\src\lxml\etree.o build\temp > .win32-2.5\Release\src\lxml\etree.def -L..\libxml2\lib -L..\libxslt\lib > -L..\zlib\lib -L..\iconv\lib > "-LC:\Program Files\Microsoft Visual Studio 8\VC\lib" > -LC:\Python25\libs -LC:\Python25\PCBuild -lli > bxslt_a -llibexslt_a -llibxml2_a -liconv_a -lzlib -lWS2_32 -lpython25 > -lmsvcr71 -o build\lib.win32-2 > .5\lxml\etree.pyd [...] > ..\libxml2\lib\libxml2_a.lib(int.a.msvc/xpointer.obj):..\xpointer.c:(.text[_xmlXPtrStringRangeFuncti > on]+0x65): undefined reference to `_ftol2' > ..\libxml2\lib\libxml2_a.lib(int.a.msvc/xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaVal > idateDuration]+0x21c): undefined reference to `_ftol2' > ..\libxml2\lib\libxml2_a.lib(int.a.msvc/xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaVal [...] > idateDuration]+0x226): undefined reference to `_ftol2' > ..\iconv\lib\iconv_a.lib(iconv.obj):./iconv.c:(.text[_libiconvlist]+0x9): > undefined reference to `_c > hkstk' > ..\zlib\lib\zlib.lib(gzio.obj):gzio.c:(.text[_gzprintf]+0x6): undefined > reference to `_chkstk' > collect2: ld returned 1 exit status > error: command 'gcc' failed with exit status 1 See these: http://mail.gnome.org/archives/xml/2005-April/msg00028.html http://mail.gnome.org/archives/xml/2005-April/msg00042.html Stefan From rcdailey at gmail.com Tue Jul 10 16:15:26 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Tue, 10 Jul 2007 09:15:26 -0500 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: <469330F4.5060402@behnel.de> References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> <469330F4.5060402@behnel.de> Message-ID: <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> Stefan, Thank you very much for your reply. The articles you linked to me are an interesting read, however I don't feel like they solve my problem. Perhaps I'm a little bit confused on what the articles are suggesting. I'm still stuck on what to do to fix this problem. In fact, I don't even know what the problem is to begin with. I had a hard time relating my problems to the topics discussed in the linked articles. Any further assistance would be greatly appreciated. Thanks again for your reply. On 7/10/07, Stefan Behnel wrote: > > Hi, > > Robert Dailey wrote: > > I'm attempting to build LXML for windows. Below are details on the > > linker errors I'm getting (the compile works fine). Anyone that can help > > would be greatly appreciated. Thank you! > [...] > > C:\IT\SDK\lxml>python setup.py build -c mingw32 --static > > Building lxml version 1.3.2 > > C:\Python25\lib\distutils\dist.py:263: UserWarning: Unknown distribution > > option: 'zip_safe' > > warnings.warn(msg) > > running build > > running build_py > > running build_ext > > building 'lxml.etree' extension > > writing build\temp.win32-2.5\Release\src\lxml\etree.def > > C:\mingw\bin\gcc.exe -mno-cygwin -shared -s > > build\temp.win32-2.5\Release\src\lxml\etree.o build\temp > > .win32-2.5\Release\src\lxml\etree.def -L..\libxml2\lib -L..\libxslt\lib > > -L..\zlib\lib -L..\iconv\lib > > "-LC:\Program Files\Microsoft Visual Studio 8\VC\lib" > > -LC:\Python25\libs -LC:\Python25\PCBuild -lli > > bxslt_a -llibexslt_a -llibxml2_a -liconv_a -lzlib -lWS2_32 -lpython25 > > -lmsvcr71 -o build\lib.win32-2 > > .5\lxml\etree.pyd > [...] > > ..\libxml2\lib\libxml2_a.lib(int.a.msvc > /xpointer.obj):..\xpointer.c:(.text[_xmlXPtrStringRangeFuncti > > on]+0x65): undefined reference to `_ftol2' > > ..\libxml2\lib\libxml2_a.lib(int.a.msvc > /xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaVal > > idateDuration]+0x21c): undefined reference to `_ftol2' > > ..\libxml2\lib\libxml2_a.lib(int.a.msvc > /xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaVal > [...] > > idateDuration]+0x226): undefined reference to `_ftol2' > > ..\iconv\lib\iconv_a.lib(iconv.obj > ):./iconv.c:(.text[_libiconvlist]+0x9): > > undefined reference to `_c > > hkstk' > > ..\zlib\lib\zlib.lib(gzio.obj):gzio.c:(.text[_gzprintf]+0x6): undefined > > reference to `_chkstk' > > collect2: ld returned 1 exit status > > error: command 'gcc' failed with exit status 1 > > See these: > > http://mail.gnome.org/archives/xml/2005-April/msg00028.html > http://mail.gnome.org/archives/xml/2005-April/msg00042.html > > Stefan > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070710/f79a828d/attachment.htm From rcdailey at gmail.com Tue Jul 10 22:00:23 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Tue, 10 Jul 2007 15:00:23 -0500 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> <469330F4.5060402@behnel.de> <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> Message-ID: <496954360707101300p71cd187cl1639e0c573bfc2f@mail.gmail.com> Can anyone respond on this issue? I would really appreciate it. On 7/10/07, Robert Dailey wrote: > > Stefan, > > Thank you very much for your reply. The articles you linked to me are an > interesting read, however I don't feel like they solve my problem. Perhaps > I'm a little bit confused on what the articles are suggesting. I'm still > stuck on what to do to fix this problem. In fact, I don't even know what the > problem is to begin with. I had a hard time relating my problems to the > topics discussed in the linked articles. Any further assistance would be > greatly appreciated. Thanks again for your reply. > > On 7/10/07, Stefan Behnel wrote: > > > > Hi, > > > > Robert Dailey wrote: > > > I'm attempting to build LXML for windows. Below are details on the > > > linker errors I'm getting (the compile works fine). Anyone that can > > help > > > would be greatly appreciated. Thank you! > > [...] > > > C:\IT\SDK\lxml>python setup.py build -c mingw32 --static > > > Building lxml version 1.3.2 > > > C:\Python25\lib\distutils\dist.py:263: UserWarning: Unknown > > distribution > > > option: 'zip_safe' > > > warnings.warn(msg) > > > running build > > > running build_py > > > running build_ext > > > building 'lxml.etree' extension > > > writing build\temp.win32-2.5\Release\src\lxml\etree.def > > > C:\mingw\bin\gcc.exe -mno-cygwin -shared -s > > > build\temp.win32-2.5\Release\src\lxml\etree.o build\temp > > > .win32-2.5\Release\src\lxml\etree.def -L..\libxml2\lib > > -L..\libxslt\lib > > > -L..\zlib\lib -L..\iconv\lib > > > "-LC:\Program Files\Microsoft Visual Studio 8\VC\lib" > > > -LC:\Python25\libs -LC:\Python25\PCBuild -lli > > > bxslt_a -llibexslt_a -llibxml2_a -liconv_a -lzlib -lWS2_32 -lpython25 > > > -lmsvcr71 -o build\lib.win32-2 > > > .5\lxml\etree.pyd > > [...] > > > ..\libxml2\lib\libxml2_a.lib( int.a.msvc > > /xpointer.obj):..\xpointer.c:(.text[_xmlXPtrStringRangeFuncti > > > on]+0x65): undefined reference to `_ftol2' > > > ..\libxml2\lib\libxml2_a.lib(int.a.msvc/xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaVal > > > > > idateDuration]+0x21c): undefined reference to `_ftol2' > > > ..\libxml2\lib\libxml2_a.lib(int.a.msvc > > /xmlschemastypes.obj):..\xmlschemastypes:(.text[_xmlSchemaVal > > [...] > > > idateDuration]+0x226): undefined reference to `_ftol2' > > > ..\iconv\lib\iconv_a.lib(iconv.obj > > ):./iconv.c:(.text[_libiconvlist]+0x9): > > > undefined reference to `_c > > > hkstk' > > > ..\zlib\lib\zlib.lib(gzio.obj):gzio.c:(.text[_gzprintf]+0x6): > > undefined > > > reference to `_chkstk' > > > collect2: ld returned 1 exit status > > > error: command 'gcc' failed with exit status 1 > > > > See these: > > > > http://mail.gnome.org/archives/xml/2005-April/msg00028.html > > http://mail.gnome.org/archives/xml/2005-April/msg00042.html > > > > Stefan > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070710/aadba114/attachment.htm From sidnei at enfoldsystems.com Tue Jul 10 22:26:06 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Tue, 10 Jul 2007 17:26:06 -0300 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: <496954360707101300p71cd187cl1639e0c573bfc2f@mail.gmail.com> References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> <469330F4.5060402@behnel.de> <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> <496954360707101300p71cd187cl1639e0c573bfc2f@mail.gmail.com> Message-ID: On 7/10/07, Robert Dailey wrote: > Can anyone respond on this issue? I would really appreciate it. Which version are you trying to build? I'm the 'official' maintainer of the binary for Windows. I am planning to make a build of 1.3.2 sometime this week. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From rcdailey at gmail.com Tue Jul 10 22:32:39 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Tue, 10 Jul 2007 15:32:39 -0500 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> <469330F4.5060402@behnel.de> <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> <496954360707101300p71cd187cl1639e0c573bfc2f@mail.gmail.com> Message-ID: <496954360707101332k25e5b9c0lc59a4527ce5f3d3d@mail.gmail.com> I'm attempting to build 1.3.2 for windows. So far here's the steps I've taken: - Download the windows binaries for iconv, zlib, libxml2, and libxslt as directed from the tutorial on the lxml website. - Extract all of the folders to the same folder, placing lxml 1.3.2 in this folder as well. I now have 5 folders in the same directory. I remove the version numbers from the folder names to allow a more readable path to the include directories. - Modify the setup.py file, adding the following code: STATIC_INCLUDE_DIRS = [ "..\\libxml2\\include", "..\\libxslt\\include", "..\\zlib\\include", "..\\iconv\\include" ] STATIC_LIBRARY_DIRS = [ "..\\libxml2\\lib", "..\\libxslt\\lib", "..\\zlib\\lib", "..\\iconv\\lib" ] STATIC_CFLAGS = [] - I then pass the following to the command line (minus the quotes): "python setup.py build -c mingw32 --static" - The compile succeeds fine, but the link stage can't find various symbols, such as "xmlFree" and "_ftol2". This is where I'm stuck. I don't have VS2003 installed so I can't use that. Thanks for responding. On 7/10/07, Sidnei da Silva wrote: > > On 7/10/07, Robert Dailey wrote: > > Can anyone respond on this issue? I would really appreciate it. > > Which version are you trying to build? I'm the 'official' maintainer > of the binary for Windows. I am planning to make a build of 1.3.2 > sometime this week. > > -- > Sidnei da Silva > Enfold Systems http://enfoldsystems.com > Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070710/6377a105/attachment-0001.htm From sidnei at enfoldsystems.com Tue Jul 10 23:01:45 2007 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Tue, 10 Jul 2007 18:01:45 -0300 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: <496954360707101332k25e5b9c0lc59a4527ce5f3d3d@mail.gmail.com> References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> <469330F4.5060402@behnel.de> <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> <496954360707101300p71cd187cl1639e0c573bfc2f@mail.gmail.com> <496954360707101332k25e5b9c0lc59a4527ce5f3d3d@mail.gmail.com> Message-ID: On 7/10/07, Robert Dailey wrote: > I'm attempting to build 1.3.2 for windows. ... > I don't have VS2003 installed so I can't use that. Uh, and I've never tried mingw32, so I can't comment :( If you can wait until tomorrow, I will upload a VS2003-built binary to PyPI. Which version of python are you using? 2.4 or 2.5? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 From rcdailey at gmail.com Wed Jul 11 00:16:06 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Tue, 10 Jul 2007 17:16:06 -0500 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> <469330F4.5060402@behnel.de> <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> <496954360707101300p71cd187cl1639e0c573bfc2f@mail.gmail.com> <496954360707101332k25e5b9c0lc59a4527ce5f3d3d@mail.gmail.com> Message-ID: <496954360707101516j48eed328t7ae06821c615335c@mail.gmail.com> I'm using Python 2.5. If you would upload a binary I would greatly appreciate it. If you also wouldn't mind giving me the link to where I can find the file when it is available I would also appreciate that (if it will be located on Python Cheese Shop, I know how to find it). I realize it may be impractical for you, but if you have any spare time: if you could attempt to build using mingw32 maybe you could figure out the outstanding linker issues I've been having. Maybe you'd be able to solve the problem and then post with your results. This is completely optional; I just ask that you look at it if you're willing and if time allows. I can most definitely wait until tomorrow for your generous binary distribution for Windows. I'm greatly appreciative of your efforts. Thanks for following up with me on this. Take care. On 7/10/07, Sidnei da Silva wrote: > > On 7/10/07, Robert Dailey wrote: > > I'm attempting to build 1.3.2 for windows. > ... > > I don't have VS2003 installed so I can't use that. > > Uh, and I've never tried mingw32, so I can't comment :( > > If you can wait until tomorrow, I will upload a VS2003-built binary to > PyPI. Which version of python are you using? 2.4 or 2.5? > > -- > Sidnei da Silva > Enfold Systems http://enfoldsystems.com > Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214 > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070710/c7df90de/attachment.htm From stefan_ml at behnel.de Wed Jul 11 08:50:23 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 11 Jul 2007 08:50:23 +0200 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: <496954360707101516j48eed328t7ae06821c615335c@mail.gmail.com> References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> <469330F4.5060402@behnel.de> <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> <496954360707101300p71cd187cl1639e0c573bfc2f@mail.gmail.com> <496954360707101332k25e5b9c0lc59a4527ce5f3d3d@mail.gmail.com> <496954360707101516j48eed328t7ae06821c615335c@mail.gmail.com> Message-ID: <46947DAF.2060701@behnel.de> Robert Dailey wrote: > I'm using Python 2.5. If you would upload a binary I would greatly > appreciate it. If you also wouldn't mind giving me the link to where I > can find the file when it is available I would also appreciate that (if > it will be located on Python Cheese Shop, I know how to find it). They'll be on Cheeseshop. > I > realize it may be impractical for you, but if you have any spare time: > if you could attempt to build using mingw32 maybe you could figure out > the outstanding linker issues I've been having. Maybe you'd be able to > solve the problem and then post with your results. This is completely > optional; I just ask that you look at it if you're willing and if time > allows. >From what I've read about the topic so far, it might be a problem to build against a VC-built libxml2 etc. with mingw32 (that was in the links I posted), but using it to build extensions against the official Python release /should/ work. So what you could try if you want to build it yourself is build libxml2 and libxslt from sources first using mingw32 and then build lxml against those. It would actually be interesting for us to know if this works, as it would allow others to work on lxml from windows more easily (without buying VC first). Stefan From reder at jpl.nasa.gov Wed Jul 11 09:02:58 2007 From: reder at jpl.nasa.gov (Leonard J. Reder) Date: Wed, 11 Jul 2007 00:02:58 -0700 Subject: [lxml-dev] Compact RelaxNG Validation Message-ID: <469480A2.3020608@jpl.nasa.gov> Hello, Does the lxml validation support the compact form of RelaxNG Schema language? Thanks, Len -- ____________________________________________________ Leonard J. Reder Jet Propulsion Laboratory Mar Science Laboratory Project Flight Software Applications & Data Product Management, Section 316D Email: reder at jpl.nasa.gov Phone (Voice): 818-354-3639 Mail Address: Mail Stop: 171-113 4800 Oak Grove Dr. Pasadena, CA. 91109 --------------------------------------------------- From stefan_ml at behnel.de Wed Jul 11 09:47:05 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 11 Jul 2007 09:47:05 +0200 Subject: [lxml-dev] Compact RelaxNG Validation In-Reply-To: <469480A2.3020608@jpl.nasa.gov> References: <469480A2.3020608@jpl.nasa.gov> Message-ID: <46948AF9.5020707@behnel.de> Leonard J. Reder wrote: > Does the lxml validation support the compact form of RelaxNG Schema > language? No, but that's been on the wish list for a while. There is a patch for libxml2 that supports it and has been waiting for inclusion for ages. Once libxml2 supports it, we can see if we can also support it in lxml (obviously requires a backwards compatible implementation, as it must still compile on older libxml2 versions). The other solution would be to add a separate (Python-)implementation to lxml, but I am not aware of a spec-compliant Python implementation here. There are two partial implementations, but they currently fail to handle a larger number of non-trivial RNC schemas, so there is not much use in integrating them. Any help is obviously appreciated. It might already help to keep asking on the libxml2 mailing list. Stefan From stefan_ml at behnel.de Wed Jul 11 14:13:12 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 11 Jul 2007 14:13:12 +0200 Subject: [lxml-dev] Compact RelaxNG Validation In-Reply-To: <469480A2.3020608@jpl.nasa.gov> References: <469480A2.3020608@jpl.nasa.gov> Message-ID: <4694C958.9050609@behnel.de> Leonard J. Reder wrote: > Does the lxml validation support the compact form of RelaxNG Schema > language? A possible (though not portable) way would be to pipe RNC through trang: http://www.thaiopensource.com/relaxng/trang.html It's written in Java, but there are GCJ'ed Linux binaries available. Stefan From micxer at micxer.de Wed Jul 11 14:13:38 2007 From: micxer at micxer.de (micxer) Date: Wed, 11 Jul 2007 14:13:38 +0200 Subject: [lxml-dev] Ignoring unknown namespaces in XML while validating In-Reply-To: <469480A2.3020608@jpl.nasa.gov> References: <469480A2.3020608@jpl.nasa.gov> Message-ID: <4694C972.9060004@micxer.de> Hello, I'm using lxml primarily for validation of XML documents and requests of UPnP devices. Since many vendors are going to make their devices DLNA compliant, some additional XML elements appear in the XML docs. I would have to pay for the DLNA specs so I have no other choice than deleting these elements in advance and validate the XML afterwards. Is there an easy way to do this with lxml? Am I missing something? Thanks, Michael From stefan_ml at behnel.de Wed Jul 11 14:30:17 2007 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 11 Jul 2007 14:30:17 +0200 Subject: [lxml-dev] Ignoring unknown namespaces in XML while validating In-Reply-To: <4694C972.9060004@micxer.de> References: <469480A2.3020608@jpl.nasa.gov> <4694C972.9060004@micxer.de> Message-ID: <4694CD59.1060904@behnel.de> Hi, first of all: please don't respond to posts from a different thread when you want to start a new one. Mail-Readers will sort the e-mail into the wrong thread and confuse people. micxer wrote: > I'm using lxml primarily for validation of XML documents and requests of > UPnP devices. Since many vendors are going to make their devices DLNA > compliant, some additional XML elements appear in the XML docs. I would > have to pay for the DLNA specs so I have no other choice than deleting > these elements in advance and validate the XML afterwards. Is there an > easy way to do this with lxml? Am I missing something? Not sure what your problem is exactly. Are these "additional elements" in a specific namespace? That would make it easy to remove them: for el in root.getiterator("{http://the/namespace}*"): parent = el.getparent() if parent is not None: # not the root element parent.remove(el) Or are they in other namespaces than the main one? MAIN_NS = "{http://the/namespace}" for el in root.getiterator("*"): if not el.tag.startswith(MAIN_NS): parent = el.getparent() if parent is not None: # not the root element parent.remove(el) Similarly, if you have a set of tag names that must be kept or removed, you can iterate over all elements and check the tag names against the set. Does that solve your problem? Stefan From jholg at gmx.de Wed Jul 11 15:45:39 2007 From: jholg at gmx.de (jholg at gmx.de) Date: Wed, 11 Jul 2007 15:45:39 +0200 Subject: [lxml-dev] [objectify] Typed E-factory for objectify, PT DataElement()-wrapper Message-ID: <20070711134539.193100@gmx.net> Hi, attached patch (against trunk) * adds a typed E-factory (called T-factory) * inserts NoneType into the E-factory/T-factory typemap * adds the PT() (="PyTyped()) convenience function that is a thin wrapper uses the argument value's type to set the pytype * provides unittests for E-factory, T-factory and PT() * fixes DataElement() to care for some previously-unhandled corner cases concerning None and/or _pytype "none" Despite of what I previously said ;-) I now think it would be better to rename "none" to "NoneType", to use the same name as the Python builtin original. While it is a longer name I seriously doubt you need to actually use it explicitly very often. By convention, the PyType name should match the Python builtin type name; then both the T-factory and the PT() function can work smoothly (the only thing special-cased is the Python type name "unicode" with gets substituted by "str"). Therefore, the patch also changes "none" to "NoneType" in objectify and the objectify tests/doctests. I'd really like to see the PT() function go into the 1.3 series, too. Please take a look, I can come up with some documentation if you like it. Holger -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal f?r Modem und ISDN: http://www.gmx.net/de/go/smartsurfer -------------- next part -------------- A non-text attachment was scrubbed... Name: tfactory_pt_nonetype.patch Type: application/octet-stream Size: 23331 bytes Desc: not available Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070711/02980d26/attachment-0001.obj From rcdailey at gmail.com Wed Jul 11 17:25:09 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Wed, 11 Jul 2007 10:25:09 -0500 Subject: [lxml-dev] Can't build lxml sources - failure to link - Windows In-Reply-To: <46947DAF.2060701@behnel.de> References: <496954360707092134p1107021by7292c19e9d59f36d@mail.gmail.com> <469330F4.5060402@behnel.de> <496954360707100715u58cb359i18a0845efbf39e3a@mail.gmail.com> <496954360707101300p71cd187cl1639e0c573bfc2f@mail.gmail.com> <496954360707101332k25e5b9c0lc59a4527ce5f3d3d@mail.gmail.com> <496954360707101516j48eed328t7ae06821c615335c@mail.gmail.com> <46947DAF.2060701@behnel.de> Message-ID: <496954360707110825o141c43e9o1138f4ebecc4dacb@mail.gmail.com> > > > It would actually be interesting for us to know if this works, as it would > allow others to work on lxml from windows more easily (without buying VC > first). It would be interesting, however I've tried building libxml from sources before and it was very non-trivial. In fact, it was so difficult I never actually succeeded. It turns out that theres a chain of API dependencies that I can never fulfill. You end up building sources for say 10 different libraries just in order to build the sources for libxml. If there's a nice, clean walkthrough on it I imagine I could figure it out. Likewise with libxslt. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070711/07c057bb/attachment.htm From rcdailey at gmail.com Wed Jul 11 22:03:01 2007 From: rcdailey at gmail.com (Robert Dailey) Date: Wed, 11 Jul 2007 15:03:01 -0500 Subject: [lxml-dev] Version 1.2.1 not working? Message-ID: <496954360707111303q102e03d7i8177642eead8e510@mail.gmail.com> Hi, I have the following Python code: from lxml import etree from StringIO import StringIO def loadXMLFile( filename ): f = open( filename, 'r' ) xmldata = f.read() root = etree.parse( StringIO( xmldata ) ) f.close() return root Python either crashes or hangs at the etree.parse() call. Below is the contents of the XML file I'm opening: Anyone know why it isn't working? Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20070711/03c7d2ff/attachment.htm From rogerpatterson at gmail.com Wed Jul 11 22:25:39 2007 From: rogerpatterson at gmail.com (Roger Patterson) Date: Wed, 11 Jul 2007 13:25:39 -0700 Subject: [lxml-dev] Version 1.2.1 not working? In-Reply-T