From Kevin.Dwyer at misys.com Thu Apr 2 18:48:22 2009 From: Kevin.Dwyer at misys.com (Dwyer, Kevin) Date: Thu, 2 Apr 2009 17:48:22 +0100 Subject: [lxml-dev] XMLSchemaParseError: Document is not XML Schema Message-ID: <63C2A154B1708946B60726AFDBA00AC004894CE2@ukmailemea01.misys.global.ad> Hello, I have encountered a problem with schema object creation with lxml; the problem relates to namespace used for the root element of the schema. >>> import lxml.etree >>> et = lxml.etree.ElementTree(file=open('c:\\temp\\MySchema', 'r')) >>> et >>> xsd = lxml.etree.XMLSchema(et) Traceback (most recent call last): File "", line 1, in xsd = lxml.etree.XMLSchema(et) File "xmlschema.pxi", line 50, in lxml.etree.XMLSchema.__init__ (src/lxml/lxml.etree.c:120919) XMLSchemaParseError: Document is not XML Schema Looking in subversion (http://codespeak.net/svn/lxml/trunk/src/lxml/xmlschema.pxi), in the XMLSchema class I see: # work around for libxml2 bug if document is not XML schema at all #if _LIBXML_VERSION_INT < 20624: c_node = root_node._c_node c_href = _getNs(c_node) if c_href is NULL or \ cstd.strcmp(c_href, 'http://www.w3.org/2001/XMLSchema') != 0: raise XMLSchemaParseError, u"Document is not XML Schema" The schemas that I am using use this root element: If I change them to they validate. Can you explain why the earlier namespace definition is unacceptable? Is there a workaround? The schemas are not built by my application, so changing them might be an issue. Cheers, Kevin "Misys" is the trade name for Misys plc (registered in England and Wales). Registration Number: 01360027. Registered office: One Kingdom Street, London W2 6BL, United Kingdom. For a list of Misys group operating companies please go to http://www.misys.com/corp/About_Us/misys_operating_companies.html. This email and any attachments have been scanned for known viruses using multiple scanners. This email message is intended for the named recipient only. It may be privileged and/or confidential. If you are not the named recipient of this email please notify us immediately and do not copy it or use it for any purpose, nor disclose its contents to any other person. This email does not constitute the commencement of legal relations between you and Misys plc. Please refer to the executed contract between you and the relevant member of the Misys group for the identity of the contracting party with which you are dealing. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090402/1bcf0700/attachment.htm From victoryetc at gmail.com Fri Apr 3 07:53:17 2009 From: victoryetc at gmail.com (Victor Borda) Date: Thu, 2 Apr 2009 22:53:17 -0700 Subject: [lxml-dev] problems with custom build Message-ID: <287f11b30904022253r6ceef7a3k3d5f448d5dcb50f6@mail.gmail.com> Hi List, I was very excited today to find the lxml module for python. As I need to write some xml checking scripts and finding bash scripting not well suited, I have decided to give python a try. So far I like it. However, here is the situation: 1) The target platform has an older RedHat installation with glibc2.3.4, so none of the binaries for libxml or libxslt were of any use. So I had to build from source on those. Not too painful. 2) However, trying to get lxml running has been really difficult. I need help here. 3) The target machine is not connected to the internet. It is not able to remotely retrieve packages. Questions/Steps: 0) There don't appear to be any rpm's for lxml. Is this correct? 1) Since I don't have an internet connection from this machine it means I have to build from source, don't I (ie easy_install is not an option)? 2) I have assumed that I do have to build from source so I have given it a shot. I copied over the lxml2.2 tar file, unzipped it. 3) I got setuptools-0.6c9-py2.3.egg and dropped in that unzipped directly, and ran python ez_setup.py which seemed to go fine. 4) Then I ran python setup.py build. The build seemed to go fine. 5) I go to run test.py and I get this error message: []# python test.py Traceback (most recent call last): File "test.py", line 595, in ? exitcode = main(sys.argv) File "test.py", line 558, in main test_cases = get_test_cases(test_files, cfg, tracer=tracer) File "test.py", line 260, in get_test_cases module = import_module(file, cfg, tracer=tracer) File "test.py", line 203, in import_module mod = __import__(modname) File "/home/victorborda/buildstuff/lxml-2.2/src/lxml/html/__init__.py", line 12, in ? from lxml import etree ImportError: /home/victorborda/buildstuff/lxml-2.2/src/lxml/etree.so: undefined symbol: xmlSchematronFree And with that, I tried running 'make test' and got the same result. The build appeared to go fine. The contents of lxml-2.2/build/lib.linux-x86_64-2.3/lxml are: -rw-r--r-- 1 root root 7637 Jun 19 2008 builder.py -rw-r--r-- 1 root root 28750 Nov 23 19:33 cssselect.py -rw-r--r-- 1 root root 18287 May 31 2008 doctestcompare.py -rw-r--r-- 1 root root 7641 Jul 9 2008 ElementInclude.py -rw-r--r-- 1 root root 6407 Feb 27 14:45 _elementpath.py -rwxr-xr-x 1 root root 3125362 Apr 3 04:49 etree.so drwxr-xr-x 2 root root 4096 Apr 3 04:49 html -rw-r--r-- 1 root root 21 Oct 22 2007 __init__.py -rwxr-xr-x 1 root root 846592 Apr 3 04:49 objectify.so -rw-r--r-- 1 root root 87 Mar 2 2008 pyclasslookup.py -rw-r--r-- 1 root root 8229 May 31 2008 sax.py -rw-r--r-- 1 root root 230 May 31 2008 usedoctest.py -- Best Regards, Victor Borda -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090402/3928b71d/attachment.htm From stefan_ml at behnel.de Fri Apr 3 08:09:05 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 03 Apr 2009 08:09:05 +0200 Subject: [lxml-dev] problems with custom build In-Reply-To: <287f11b30904022253r6ceef7a3k3d5f448d5dcb50f6@mail.gmail.com> References: <287f11b30904022253r6ceef7a3k3d5f448d5dcb50f6@mail.gmail.com> Message-ID: <49D5A801.2020004@behnel.de> Hi, Victor Borda wrote: > I was very excited today to find the lxml module for python. As I need to > write some xml checking scripts and finding bash scripting not well suited, > I have decided to give python a try. So far I like it. However, here is the > situation: > > 1) The target platform has an older RedHat installation with glibc2.3.4, so > none of the binaries for libxml or libxslt were of any use. So I had to > build from source on those. Not too painful. > 2) However, trying to get lxml running has been really difficult. I need > help here. > 3) The target machine is not connected to the internet. It is not able to > remotely retrieve packages. > > Questions/Steps: > 0) There don't appear to be any rpm's for lxml. Is this correct? > 1) Since I don't have an internet connection from this machine it means I > have to build from source, don't I (ie easy_install is not an option)? > 2) I have assumed that I do have to build from source so I have given it a > shot. I copied over the lxml2.2 tar file, unzipped it. > 3) I got setuptools-0.6c9-py2.3.egg and dropped in that unzipped directly, > and ran python ez_setup.py which seemed to go fine. > 4) Then I ran python setup.py build. The build seemed to go fine. > 5) I go to run test.py and I get this error message: > > []# python test.py > Traceback (most recent call last): > File "test.py", line 595, in ? > exitcode = main(sys.argv) > File "test.py", line 558, in main > test_cases = get_test_cases(test_files, cfg, tracer=tracer) > File "test.py", line 260, in get_test_cases > module = import_module(file, cfg, tracer=tracer) > File "test.py", line 203, in import_module > mod = __import__(modname) > File "/home/victorborda/buildstuff/lxml-2.2/src/lxml/html/__init__.py", > line 12, in ? > from lxml import etree > ImportError: /home/victorborda/buildstuff/lxml-2.2/src/lxml/etree.so: > undefined symbol: xmlSchematronFree I assume that you have installed newer versions of libxml2 and libxslt somewhere, but it looks like lxml can't find them at runtime. Try to compile with lxml with the "--auto-rpath" option to make it remember where it found the libraries it was built against. Another option is to copy the libxml2 and libxslt tar.gz archives into "lxml-2.2/libs/" and pass --static-deps --libxml2-version=2.X.Y --libxslt-version=1.1.XY to setup.py, which will then build those libs first and build lxml statically against them. Stefan From kevin.p.dwyer at gmail.com Fri Apr 3 17:42:44 2009 From: kevin.p.dwyer at gmail.com (Kev Dwyer) Date: Fri, 3 Apr 2009 16:42:44 +0100 Subject: [lxml-dev] XMLSchemaParseError if XML schema namespace uri is not "http://www.w3.org/2001/XMLSchema" Message-ID: <4d3439f90904030842y2a6c22e1i2fd65fac510e57e@mail.gmail.com> Hello, This is a re-post of my earlier posting, at Stefan's request, without the corporate boilerplate that I inadvertently sent last time. Sorry about that. Bug 354574 logged at Stefan's request. I have encountered a problem with schema object creation with lxml; the problem relates to namespace used for the root element of the schema. >>> import lxml.etree >>> et = lxml.etree.ElementTree(file=open('c:\\temp\\MySchema', 'r')) >>> et >>> xsd = lxml.etree.XMLSchema(et) Traceback (most recent call last): File "", line 1, in xsd = lxml.etree.XMLSchema(et) File "xmlschema.pxi", line 50, in lxml.etree.XMLSchema.__init__ (src/lxml/lxml.etree.c:120919) XMLSchemaParseError: Document is not XML Schema Looking in subversion (http://codespeak.net/svn/lxml/trunk/src/lxml/xmlschema.pxi), in the XMLSchema class I see: # work around for libxml2 bug if document is not XML schema at all #if _LIBXML_VERSION_INT < 20624: c_node = root_node._c_node c_href = _getNs(c_node) if c_href is NULL or \ cstd.strcmp(c_href, 'http://www.w3.org/2001/XMLSchema') != 0: raise XMLSchemaParseError, u"Document is not XML Schema" The schemas that I am using use this root element: If I change them to they validate. Can you explain why the earlier namespace definition is unacceptable? Is there a workaround? The schemas are not built by my application, so changing them might be an issue. Cheers, Kevin -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090403/02eb6f38/attachment.htm From stefan_ml at behnel.de Fri Apr 3 21:31:07 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 03 Apr 2009 21:31:07 +0200 Subject: [lxml-dev] XMLSchemaParseError if XML schema namespace uri is not "http://www.w3.org/2001/XMLSchema" In-Reply-To: <4d3439f90904030842y2a6c22e1i2fd65fac510e57e@mail.gmail.com> References: <4d3439f90904030842y2a6c22e1i2fd65fac510e57e@mail.gmail.com> Message-ID: <49D663FB.2060909@behnel.de> Hi, Kev Dwyer wrote: > I have encountered a problem with schema object creation with lxml; the > problem relates to namespace used for the root element of the schema. > > >>>> import lxml.etree >>>> et = lxml.etree.ElementTree(file=open('c:\\temp\\MySchema', 'r')) >>>> et > >>>> xsd = lxml.etree.XMLSchema(et) > > Traceback (most recent call last): > File "", line 1, in > xsd = lxml.etree.XMLSchema(et) > File "xmlschema.pxi", line 50, in lxml.etree.XMLSchema.__init__ > (src/lxml/lxml.etree.c:120919) > XMLSchemaParseError: Document is not XML Schema > > > Looking in subversion > (http://codespeak.net/svn/lxml/trunk/src/lxml/xmlschema.pxi), in the > XMLSchema class I see: > > > > # work around for libxml2 bug if document is not XML schema at > all > #if _LIBXML_VERSION_INT < 20624: > c_node = root_node._c_node > c_href = _getNs(c_node) > if c_href is NULL or \ > cstd.strcmp(c_href, 'http://www.w3.org/2001/XMLSchema') > != 0: > raise XMLSchemaParseError, u"Document is not XML Schema" Thanks for pointing me to this, this is a left-over work-around for a bug that no longer exists in more recent libxml2 versions. I'll try to figure out when it was fixed and disable this from that point on. Note that this will not solve your problem, though. > The schemas that I am using use this root element: > I actually had to look this up, and found a lot of documents containing this namespace, but little information why it was changed at the time. It appears to be part of an older specification version that happens to still work for your stylesheets. Note that libxml2 does not support this namespace at all, just like most other validators I could find a link about. > The schemas are not built by my application, so changing them might be > an issue. You can always do a string replace before passing the XML data to the schema parser. Or, you can parse the XML tree using iterparse and fix the namespaces while doing so, simply by overwriting the tag names. You can pass "tag={http://www.w3.org/2000/10/XMLSchema}*" to iterparse() to make sure it only intercepts on the interesting elements. It will still build the complete tree for you, which you can retrieve using "it.root" at the end. Note that a string replace might still be the safer way to do it, as it also keeps any prefix mappings intact that XMLSchema may use in text content (i.e. qualified names). To be sure that you can safely replace the string, you can parse the XML, serialise it to UTF-8, do the replacement, and then parse it again. Both parsing and serialising are fast, so you may not even notice the difference. Does that help? Stefan From cthedot at gmail.com Sat Apr 4 11:57:46 2009 From: cthedot at gmail.com (chris hoke) Date: Sat, 4 Apr 2009 11:57:46 +0200 Subject: [lxml-dev] setting xslt output encoding with lxml Message-ID: hi, (hope this is the right list for my question) To set the XSL output encoding I normally use in the stylesheet. At least in the Java based XSLT processors it is possible to set some attributes of xsl:output from the "outside" meaning when initializing or starting the transformation. So it is possible for example to overwrite any encoding specified in >> xslt_tree = etree.XML('''\ ... > ... ... ... ... ''') >>> transform = etree.XSLT(xslt_tree) >>> f = StringIO('Text') >>> doc = etree.parse(f) seems to miss an parameter. I have not checked if it works without it but I guess it would be good style to declare any incoming parameters if not for setting a default value, would it not? thanks for any hints, Christof -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090404/e9d7a6e8/attachment.htm From stefan_ml at behnel.de Sat Apr 4 21:01:44 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 04 Apr 2009 21:01:44 +0200 Subject: [lxml-dev] setting xslt output encoding with lxml In-Reply-To: References: Message-ID: <49D7AE98.1000205@behnel.de> Hi, chris hoke wrote: > (hope this is the right list for my question) Yes. > To set the XSL output encoding I normally use > in the stylesheet. > > At least in the Java based XSLT processors it is possible to set some > attributes of xsl:output from the "outside" meaning when initializing or > starting the transformation. So it is possible for example to overwrite any > encoding specified in Is there any way to do this with LXML? You can parse the stylesheet with the normal XML parser, change the xsl:output element according to your needs, and pass the result to XSLT(). Note that you can use iterparse(the_file, tag="{...XSL NS...}output") to update the element while parsing. > BTW, the example on http://codespeak.net/lxml/xpathxslt.html#xslt > >>>> xslt_tree = etree.XML('''\ > ... ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> > > ... > ... > ... > ... ''') >>>> transform = etree.XSLT(xslt_tree) >>>> f = StringIO('Text') >>>> doc = etree.parse(f) > > seems to miss an parameter. I have not checked if it > works without it but I guess it would be good style to declare any incoming > parameters if not for setting a default value, would it not? Yes, thanks for catching that. Stefan From stefan_ml at behnel.de Sat Apr 4 22:03:10 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 04 Apr 2009 22:03:10 +0200 Subject: [lxml-dev] setting xslt output encoding with lxml In-Reply-To: <49D7AE98.1000205@behnel.de> References: <49D7AE98.1000205@behnel.de> Message-ID: <49D7BCFE.2060801@behnel.de> Stefan Behnel wrote: > chris hoke wrote: >> To set the XSL output encoding I normally use >> in the stylesheet. >> >> At least in the Java based XSLT processors it is possible to set some >> attributes of xsl:output from the "outside" meaning when initializing or >> starting the transformation. So it is possible for example to overwrite any >> encoding specified in > lxml.etree does not currently support this. I skimmed through the libxslt source and it looks like such a feature is not easily available. So the best way to do it is actually to copy and modify the stylesheet document as I explained. Stefan From l at lrowe.co.uk Sat Apr 4 22:34:57 2009 From: l at lrowe.co.uk (Laurence Rowe) Date: Sat, 4 Apr 2009 22:34:57 +0200 Subject: [lxml-dev] setting xslt output encoding with lxml In-Reply-To: <49D7BCFE.2060801@behnel.de> References: <49D7AE98.1000205@behnel.de> <49D7BCFE.2060801@behnel.de> Message-ID: It seems that libxslt respects the last tag found, so just append your required version to the end of the stylesheet: >>> xslt_doc = etree.XML(''' ... ... ...
...
''') >>> str(etree.XSLT(xslt_doc)(etree.XML(''''''))) '\n
\n' >>> xslt_doc.append(etree.XML('''''')) >>> str(etree.XSLT(xslt_doc)(etree.XML(''''''))) '
\n' Laurence 2009/4/4 Stefan Behnel : > > Stefan Behnel wrote: >> chris hoke wrote: >>> To set the XSL output encoding I normally use >>> in the stylesheet. >>> >>> At least in the Java based XSLT processors it is possible to set some >>> attributes of xsl:output from the "outside" meaning when initializing or >>> starting the transformation. So it is possible for example to overwrite any >>> encoding specified in > >> lxml.etree does not currently support this. > > I skimmed through the libxslt source and it looks like such a feature is > not easily available. So the best way to do it is actually to copy and > modify the stylesheet document as I explained. > > Stefan > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev > From kevin.p.dwyer at gmail.com Mon Apr 6 11:32:16 2009 From: kevin.p.dwyer at gmail.com (Kev Dwyer) Date: Mon, 6 Apr 2009 10:32:16 +0100 Subject: [lxml-dev] XMLSchemaParseError if XML schema namespace uri is not "http://www.w3.org/2001/XMLSchema" In-Reply-To: <49D663FB.2060909@behnel.de> References: <4d3439f90904030842y2a6c22e1i2fd65fac510e57e@mail.gmail.com> <49D663FB.2060909@behnel.de> Message-ID: <4d3439f90904060232k7272e44evd84331e3b29c634e@mail.gmail.com> Hello Stefan, Thanks for the speedy response, and for the workaround suggestions. All the best, Kevin 2009/4/3 Stefan Behnel > Hi, > > Kev Dwyer wrote: > > I have encountered a problem with schema object creation with lxml; the > > problem relates to namespace used for the root element of the schema. > > > > > >>>> import lxml.etree > >>>> et = lxml.etree.ElementTree(file=open('c:\\temp\\MySchema', 'r')) > >>>> et > > > >>>> xsd = lxml.etree.XMLSchema(et) > > > > Traceback (most recent call last): > > File "", line 1, in > > xsd = lxml.etree.XMLSchema(et) > > File "xmlschema.pxi", line 50, in lxml.etree.XMLSchema.__init__ > > (src/lxml/lxml.etree.c:120919) > > XMLSchemaParseError: Document is not XML Schema > > > > > > Looking in subversion > > (http://codespeak.net/svn/lxml/trunk/src/lxml/xmlschema.pxi), in the > > XMLSchema class I see: > > > > > > > > # work around for libxml2 bug if document is not XML schema > at > > all > > #if _LIBXML_VERSION_INT < 20624: > > c_node = root_node._c_node > > c_href = _getNs(c_node) > > if c_href is NULL or \ > > cstd.strcmp(c_href, 'http://www.w3.org/2001/XMLSchema > ') > > != 0: > > raise XMLSchemaParseError, u"Document is not XML Schema" > > Thanks for pointing me to this, this is a left-over work-around for a bug > that no longer exists in more recent libxml2 versions. I'll try to figure > out when it was fixed and disable this from that point on. Note that this > will not solve your problem, though. > > > > The schemas that I am using use this root element: > > > > I actually had to look this up, and found a lot of documents containing > this namespace, but little information why it was changed at the time. It > appears to be part of an older specification version that happens to still > work for your stylesheets. > > Note that libxml2 does not support this namespace at all, just like most > other validators I could find a link about. > > > > The schemas are not built by my application, so changing them might be > > an issue. > > You can always do a string replace before passing the XML data to the > schema parser. Or, you can parse the XML tree using iterparse and fix the > namespaces while doing so, simply by overwriting the tag names. You can > pass "tag={http://www.w3.org/2000/10/XMLSchema}*" > to iterparse() to make > sure it only intercepts on the interesting elements. It will still build > the complete tree for you, which you can retrieve using "it.root" at the > end. > > Note that a string replace might still be the safer way to do it, as it > also keeps any prefix mappings intact that XMLSchema may use in text > content (i.e. qualified names). To be sure that you can safely replace the > string, you can parse the XML, serialise it to UTF-8, do the replacement, > and then parse it again. Both parsing and serialising are fast, so you may > not even notice the difference. > > Does that help? > > Stefan > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090406/0e640ff8/attachment-0001.htm From friedel at translate.org.za Wed Apr 8 10:27:23 2009 From: friedel at translate.org.za (F Wolff) Date: Wed, 08 Apr 2009 10:27:23 +0200 Subject: [lxml-dev] Low ASCII values as text Message-ID: <1239179243.714.11.camel@localhost> Hallo list I encountered a small issue from a user's error report, and a way to duplicate the issue is from this example code: from lxml import etree l = etree.Element('cow') l.text = unicode('\xd0\x94\x1bi\x1b\x1b\x1b?', "utf-8") etree.fromstring(etree.tostring(l)) With lxml 2.1 I get: XMLSyntaxError: PCDATA invalid Char value 27, line 1, column 13 It seems that etree.tostring() can generate XML that etree.fromstring() can't handle. But with a newer version (I think a beta of 2.2), I get "All strings must be XML compatible : Unicode or ASCII, no NULL bytes" on the assignment statement (l.text = ...). So in either case my question is if lxml's handling of these low values in ASCII is correct, since it doesn't seem possible to actually represent them at all, but I guess I am missing something important. As far as I know the XML 1.0 specification demands indicating these with numeric entities. Keep well Friedel -- Recently on my blog: http://translate.org.za/blogs/friedel/en/content/monolingual-translation-formats-considered-harmful From stefan_ml at behnel.de Wed Apr 8 10:53:01 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 8 Apr 2009 10:53:01 +0200 (CEST) Subject: [lxml-dev] Low ASCII values as text In-Reply-To: <1239179243.714.11.camel@localhost> References: <1239179243.714.11.camel@localhost> Message-ID: <8d762cc9c058f3f92553166838561d42.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Hi, F Wolff wrote: > I encountered a small issue from a user's error report, and a way to > duplicate the issue is from this example code: > > from lxml import etree > l = etree.Element('cow') > l.text = unicode('\xd0\x94\x1bi\x1b\x1b\x1b?', "utf-8") > etree.fromstring(etree.tostring(l)) > > With lxml 2.1 I get: > > XMLSyntaxError: PCDATA invalid Char value 27, line 1, column 13 > > It seems that etree.tostring() can generate XML that etree.fromstring() > can't handle. To be precise, tostring() could generate output that was not XML. That was clearly a bug. > But with a newer version (I think a beta of 2.2), I get > "All strings must be XML compatible : Unicode or ASCII, no NULL bytes" > on the assignment statement (l.text = ...). This is in line with the set of allowed characters in XML, the relevant snippet being: Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | ... "\x1b" is not in this set. http://www.w3.org/TR/REC-xml/#charsets > So in either case my question is if lxml's handling of these low values > in ASCII is correct, since it doesn't seem possible to actually > represent them at all, but I guess I am missing something important. As > far as I know the XML 1.0 specification demands indicating these with > numeric entities. No, you cannot even represent them as character references, they are simply not allowed. The only (sensible) way to pass binary data through XML is to encode it, e.g. using base64. This specification was weakened in XML 1.1, which simply allows more characters, including the range "[#x1-#xD7FF]". However, it still carries this warning: """ Document authors are encouraged to avoid "compatibility characters", as defined in Unicode [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters: [#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], ... """ http://www.w3.org/TR/xml11/#charsets So, even in XML 1.1, it is still considered a bad idea to use these characters in text content. Stefan From dood at zworg.com Wed Apr 8 11:50:04 2009 From: dood at zworg.com (Adam) Date: Wed, 8 Apr 2009 09:50:04 +0000 (UTC) Subject: [lxml-dev] Unicode oddness Message-ID: The following seems wrong to me: I have a utf-8 encoded string with html containing the word 'Fran?ais': >>> html = 'Fran\xc3\xa7ais' I feed it to lxml.html: >>> root = lxml.html.fromstring(html) When I get the text from lxml, it is a unicode string, but it has not been decoded!: >>> root.text_content() u'Fran\xc3\xa7ais' The expected output would be decoded unicode, i.e. the result of: >>> 'Fran\xc3\xa7ais'.decode('utf-8') u'Fran\xe7ais' Or just get back the encoded utf-8 string 'Fran\xc3\xa7ais' Either of these results would make sense and work for me. But the result is an odd confusion of the two. Is this an lxml problem, or have I misunderstood something? Thanks, Adam From stefan_ml at behnel.de Wed Apr 8 12:12:30 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 8 Apr 2009 12:12:30 +0200 (CEST) Subject: [lxml-dev] Unicode oddness In-Reply-To: References: Message-ID: <4d5f031bb27ad06e5c1a1872ffe86cc3.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Adam wrote: > I have a utf-8 encoded string with html containing the word 'Fran??ais': > >>> html = 'Fran\xc3\xa7ais' > > I feed it to lxml.html: > >>> root = lxml.html.fromstring(html) > > When I get the text from lxml, it is a unicode string, but it has not been > decoded!: > >>> root.text_content() > u'Fran\xc3\xa7ais' Your HTML snippet lacks a tag, so the HTMLParser has no way of knowing what encoding your HTML snippet uses. It therefore falls back to assuming Latin-1. If your snippet was encoded in Latin-1, you'd be quite happy about this default. If you know the encoding in advance, you can create your own parser instance and pass it the "encoding" keyword option. There are tools that can try to detect an encoding from a string that you pass in, e.g. chardet. It is, however, impossible for any tool in the world to always recover the missing encoding information for all possible data. Stefan From dood at zworg.com Wed Apr 8 12:27:13 2009 From: dood at zworg.com (Adam) Date: Wed, 8 Apr 2009 10:27:13 +0000 (UTC) Subject: [lxml-dev] Unicode oddness References: <4d5f031bb27ad06e5c1a1872ffe86cc3.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: Stefan Behnel behnel.de> writes: > Your HTML snippet lacks a tag, so the HTMLParser has no way of > knowing what encoding your HTML snippet uses. It therefore falls back to > assuming Latin-1. If your snippet was encoded in Latin-1, you'd be quite > happy about this default. > > If you know the encoding in advance, you can create your own parser > instance and pass it the "encoding" keyword option. Of course! Thank you, I had a feeling I was overlooking something simple. From npowell3 at gmail.com Tue Apr 14 15:55:19 2009 From: npowell3 at gmail.com (Nelson Powell) Date: Tue, 14 Apr 2009 09:55:19 -0400 Subject: [lxml-dev] Porting lxml to QNX 6.4.0 issue Message-ID: I've attempted to port lxml to a QNX 6.4.0 PC (x86) and have already built and installed the libxslt, libxml2, and libgcrypt libraries wihtout any compile/link issues. The libraries were installed at /usr/local/lib which is part of my LD_LIBRARY_PATH. I'm using: libxslt 1.1.22 libxml2 2.7.2 libgcrypt 11.5.2 However, after building lxml, importing lxml produces the following messages in python: bash-3.2# python Python 2.5.2 (r252:60911, Oct 8 2008, 21:15:13) [GCC 4.2.4] on qnx6 Type "help", "copyright", "credits" or "license" for more information. >>> import lxml.etree as ET unknown symbol: gcry_cipher_open unknown symbol: gcry_cipher_ctl unknown symbol: gcry_md_hash_buffer unknown symbol: gcry_cipher_close unknown symbol: gcry_cipher_encrypt unknown symbol: gcry_check_version unknown symbol: gcry_cipher_decrypt unknown symbol: gcry_strerror Traceback (most recent call last): File "", line 1, in ImportError: Unresolved symbols >>> Is there a way around this libgcrypt stuff? Anyone seen this before? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090414/cec0ca1c/attachment.htm From stefan_ml at behnel.de Tue Apr 14 17:36:09 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 14 Apr 2009 17:36:09 +0200 (CEST) Subject: [lxml-dev] Porting lxml to QNX 6.4.0 issue In-Reply-To: References: Message-ID: <22c49d0d5af3087d92590d569d001cfe.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Nelson Powell wrote: > I've attempted to port lxml to a QNX 6.4.0 PC (x86) and have already built > and installed the libxslt, libxml2, and libgcrypt libraries wihtout any > compile/link issues. Does "xsltproc" work on your machine? It comes with libxslt. libgcrypt is an *optional* dependency of libxslt. Do you need it? How did you configure the build? Stefan From npowell3 at gmail.com Tue Apr 14 17:46:32 2009 From: npowell3 at gmail.com (Nelson Powell) Date: Tue, 14 Apr 2009 11:46:32 -0400 Subject: [lxml-dev] Porting lxml to QNX 6.4.0 issue In-Reply-To: <22c49d0d5af3087d92590d569d001cfe.squirrel@groupware.dvs.informatik.tu-darmstadt.de> References: <22c49d0d5af3087d92590d569d001cfe.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: I've run "xsltproc -v" and it reports Using libxml 20702, libxslt 10122 and libexslt 813 ... So Iam assuming it's working. I used the standard ./configure line to setup before building the any of the three library packages. I don't need gcrypt at all. I just need a few items out of the lxml.etree for a build enviroment to work like the Windows XP build environment. Can I remove the need for libgcrypt from the libxslt build? On Tue, Apr 14, 2009 at 11:36 AM, Stefan Behnel wrote: > Nelson Powell wrote: > > I've attempted to port lxml to a QNX 6.4.0 PC (x86) and have already > built > > and installed the libxslt, libxml2, and libgcrypt libraries wihtout any > > compile/link issues. > > Does "xsltproc" work on your machine? It comes with libxslt. > > libgcrypt is an *optional* dependency of libxslt. Do you need it? How did > you configure the build? > > Stefan > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090414/a481bd1a/attachment.htm From stefan_ml at behnel.de Fri Apr 17 17:10:21 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Fri, 17 Apr 2009 17:10:21 +0200 (CEST) Subject: [lxml-dev] lxml 2.2 released In-Reply-To: <7bed6e38-b738-484d-99c9-542c992dfdfa@3g2000yqk.googlegroups.com> References: <7bed6e38-b738-484d-99c9-542c992dfdfa@3g2000yqk.googlegroups.com> Message-ID: jasonrbriggs at gmail.com wrote: > Hope you don't mind a quick question -- is "from the source > distribution" the only way to install on Python 3? There are currently no binary packages (that I know of) for Py3.0 or 3.1, so, yes, you have to build it yourself. Stefan From sidnei at enfoldsystems.com Sat Apr 18 04:48:09 2009 From: sidnei at enfoldsystems.com (Sidnei da Silva) Date: Fri, 17 Apr 2009 23:48:09 -0300 Subject: [lxml-dev] docinfo.doctype doesn't include internal entities? Message-ID: Hi there, I am looking for a way to output internal entities that have been parsed from the original document when writing out a tree, but apparently this is not exposed in any attribute. Here's an example: {{{ import lxml.etree document = """ ]>   """ tree = lxml.etree.fromstring(document) print tree.getroottree().docinfo.doctype }}} I would expect this to output: {{{ ]> }}} But instead it gives me: {{{ }}} Is it a bug or I'm not looking at the right place? -- Sidnei da Silva From stefan_ml at behnel.de Sat Apr 18 08:46:40 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Sat, 18 Apr 2009 08:46:40 +0200 Subject: [lxml-dev] docinfo.doctype doesn't include internal entities? In-Reply-To: References: Message-ID: <49E97750.1010903@behnel.de> Sidnei da Silva wrote: > I am looking for a way to output internal entities that have been > parsed from the original document when writing out a tree, but > apparently this is not exposed in any attribute. > > Here's an example: > > {{{ > import lxml.etree > > document = """ > > ]> >   > """ > > > tree = lxml.etree.fromstring(document) > print tree.getroottree().docinfo.doctype > }}} > > I would expect this to output: > {{{ > > ]> > }}} > > But instead it gives me: > > {{{ > > }}} > > Is it a bug or I'm not looking at the right place? What you are looking for is the internal subset of the document, which is not (really) part of the DOCTYPE itself. It's available through the "docinfo.internalDTD" property. However, lxml.etree doesn't expose the content of the DTD, so this is currently only usable for validation (i.e. not very helpful in your case). What you could try is to parse the document without resolving the entities, then traverse the Entity elements and collect their names in a set. That will not give you the resolved entity values, though... I think it would be nice if tostring() could serialise DTDs, but I doubt that there are so many use cases for that. In your case, you'd then have to parse the DTD yourself, which you could also do by clearing the root node and serialising the document to unicode. Stefan From agoldgod at gmail.com Wed Apr 22 19:55:32 2009 From: agoldgod at gmail.com (goldgod a) Date: Wed, 22 Apr 2009 23:25:32 +0530 Subject: [lxml-dev] wsdl link validation. Message-ID: <105c9ccc0904221055j3a7f3ce9j73a370976c8680c9@mail.gmail.com> Hi, I am using the lxml. I have one wsdl(on the fly creation using soaplib). The wsdl contains three XSD schema.I am passing all XSD schema in one file as request. I want to validate the each XSD schema one by one. I need your help to implement this. I gone through the tutorial and found XSD schema validation can do but my wsdl contains XSD schema and wsdl messages also. Please advice me. -- Thanks & Regards, Goldgod -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090422/5957f67d/attachment.htm From akafubu.kibombo at gmail.com Thu Apr 23 08:41:12 2009 From: akafubu.kibombo at gmail.com (Akafubu Kibombo) Date: Thu, 23 Apr 2009 01:41:12 -0500 Subject: [lxml-dev] Forms, Cookies, Headers, and Time Message-ID: <39210b4b0904222341t79f8b11fmcf1ac75b3a745ef0@mail.gmail.com> I am trying to write a script which fetches a url, logs into the site, then fetches particular items from the page, and goes to the next page, fetching the same type of files on the new page until there are no new pages to fetch from. So I need form and cooke handling, as well as manipulating the headers. What do I need to use? I found this thread, but I don't understand it: http://codespeak.net/pipermail/lxml-dev/2008-December/004272.html. Also, I don't want to wipe out the server with so many requests, is there a "wait 2 - 3 seconds before fetching the next element" type function?.. Thank you so, so much. -A.F. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090423/84b17707/attachment.htm From douglas at openplans.org Thu Apr 23 17:24:22 2009 From: douglas at openplans.org (Douglas Mayle) Date: Thu, 23 Apr 2009 11:24:22 -0400 Subject: [lxml-dev] Forms, Cookies, Headers, and Time In-Reply-To: <39210b4b0904222341t79f8b11fmcf1ac75b3a745ef0@mail.gmail.com> References: <39210b4b0904222341t79f8b11fmcf1ac75b3a745ef0@mail.gmail.com> Message-ID: I wrote a tool to sync safari books downloads that does similar things to what you're talking about. I found the various issues you run into with form and cookie handling when using lxml (and wrote an article about it here: http://douglas.mayle.org/2009/03/05/syncing-safari-downloads-intro-screen-scraping/ ). I spent some time making sure the code was clean and very well documented, so it should help you to get started. The example is here: http://projects.mayle.org/hg/safarisync/file/23cfad04ce3a/safarisync/safarisync/safarisync.py Douglas Mayle On Apr 23, 2009, at 2:41 AM, Akafubu Kibombo wrote: > I am trying to write a script which fetches a url, logs into the > site, then fetches particular items from the page, and goes to the > next page, fetching the same type of files on the new page until > there are no new pages to fetch from. So I need form and cooke > handling, as well as manipulating the headers. What do I need to > use? I found this thread, but I don't understand it: http://codespeak.net/pipermail/lxml-dev/2008-December/004272.html > . > > Also, I don't want to wipe out the server with so many requests, is > there a "wait 2 - 3 seconds before fetching the next element" type > function?.. > > Thank you so, so much. > > -A.F. > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090423/20916a1a/attachment.htm From douglas at openplans.org Thu Apr 23 17:24:31 2009 From: douglas at openplans.org (Douglas Mayle) Date: Thu, 23 Apr 2009 11:24:31 -0400 Subject: [lxml-dev] Forms, Cookies, Headers, and Time In-Reply-To: <39210b4b0904222341t79f8b11fmcf1ac75b3a745ef0@mail.gmail.com> References: <39210b4b0904222341t79f8b11fmcf1ac75b3a745ef0@mail.gmail.com> Message-ID: <50CA39A0-5DEE-4C12-9795-1CAA9C7DE056@openplans.org> Ahh, randomly enough, the thread you link to is the one I started. After browsing through the lxml code, it turned out that there was no need to pass an open_http parameter, as the default method did almost exactly the same thing as the code sample given and so monkey patching the library (the standard way to add cookie support) already works. Unfortunately, I found out that passing a URL directly to lxml causes it to use libxml's native downloading support, which has no support for cookies. As such, you have to handle all of the downloading of content yourself (except when taking advantage of lxml forms). As to waiting 2-3 seconds before requests, you can just put sleeps into your code, or find some sort of bandwidth throttling package... Douglas Mayle On Apr 23, 2009, at 2:41 AM, Akafubu Kibombo wrote: > I am trying to write a script which fetches a url, logs into the > site, then fetches particular items from the page, and goes to the > next page, fetching the same type of files on the new page until > there are no new pages to fetch from. So I need form and cooke > handling, as well as manipulating the headers. What do I need to > use? I found this thread, but I don't understand it: http://codespeak.net/pipermail/lxml-dev/2008-December/004272.html > . > > Also, I don't want to wipe out the server with so many requests, is > there a "wait 2 - 3 seconds before fetching the next element" type > function?.. > > Thank you so, so much. > > -A.F. > _______________________________________________ > lxml-dev mailing list > lxml-dev at codespeak.net > http://codespeak.net/mailman/listinfo/lxml-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090423/638581ac/attachment.htm From frank at chagford.com Mon Apr 27 11:33:06 2009 From: frank at chagford.com (Frank Millman) Date: Mon, 27 Apr 2009 11:33:06 +0200 Subject: [lxml-dev] Problem with XMLSchema and attribute_defaults Message-ID: <20090427093431.366E73F4389@fcserver.chagford.com> Hi all I need to validate an xml document with a schema, and at the same time populate it with any default attributes. This works correctly with minixsv, but it is rather slow, so I am trying lxml. It validates correctly, but I cannot get it to load the default attributes. This is what I am doing - schema = etree.XMLSchema(file='bpmnxpdl_31.xsd') parser = etree.XMLParser(schema=schema, attribute_defaults=True) root = etree.parse('order.xml', parser) Any assistance will be appreciated. Thanks Frank Millman From stefan_ml at behnel.de Mon Apr 27 13:08:15 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 27 Apr 2009 13:08:15 +0200 (CEST) Subject: [lxml-dev] Problem with XMLSchema and attribute_defaults In-Reply-To: <20090427093431.366E73F4389@fcserver.chagford.com> References: <20090427093431.366E73F4389@fcserver.chagford.com> Message-ID: <078ab7e7acd478aee7669a6e89c2787a.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Frank Millman wrote: > I need to validate an xml document with a schema, and at the same time > populate it with any default attributes. This works correctly with > minixsv, > but it is rather slow, so I am trying lxml. It validates correctly, but I > cannot get it to load the default attributes. > > This is what I am doing - > > schema = etree.XMLSchema(file='bpmnxpdl_31.xsd') > parser = etree.XMLParser(schema=schema, attribute_defaults=True) > root = etree.parse('order.xml', parser) The "attribute_defaults" flag is currently only used for DTDs. Enabling the same for XML Schema would require setting the "XML_SCHEMA_VAL_VC_I_CREATE" option on the schema validation context, which doesn't seem to work for older (<=2006) libxml2 versions and is not currently done for newer versions. Could you file a feature request for this, so that it doesn't get lost? Thanks, Stefan From frank at chagford.com Mon Apr 27 13:55:39 2009 From: frank at chagford.com (Frank Millman) Date: Mon, 27 Apr 2009 13:55:39 +0200 Subject: [lxml-dev] Problem with XMLSchema and attribute_defaults In-Reply-To: <078ab7e7acd478aee7669a6e89c2787a.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Message-ID: <20090427115704.6359D3F4389@fcserver.chagford.com> Stefan Behnel wrote: > > Frank Millman wrote: > > I need to validate an xml document with a schema, and at > > the same time populate it with any default attributes. > > The "attribute_defaults" flag is currently only used for > DTDs. Enabling > the same for XML Schema would require setting the > "XML_SCHEMA_VAL_VC_I_CREATE" option on the schema validation context, > which doesn't seem to work for older (<=2006) libxml2 > versions and is not > currently done for newer versions. Could you file a feature > request for > this, so that it doesn't get lost? > I can't find a section for feature requests. Should I just use the Launchpad bug tracker, or am I looking in the wrong place? Frank From stefan_ml at behnel.de Mon Apr 27 14:19:26 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Mon, 27 Apr 2009 14:19:26 +0200 (CEST) Subject: [lxml-dev] Problem with XMLSchema and attribute_defaults In-Reply-To: <20090427115704.6359D3F4389@fcserver.chagford.com> References: <20090427115704.6359D3F4389@fcserver.chagford.com> Message-ID: Frank Millman wrote: > I can't find a section for feature requests. Should I just use the > Launchpad bug tracker Yep, that's the right place. Stefan From frank at chagford.com Mon Apr 27 14:52:58 2009 From: frank at chagford.com (Frank Millman) Date: Mon, 27 Apr 2009 14:52:58 +0200 Subject: [lxml-dev] Problem with XMLSchema and attribute_defaults In-Reply-To: Message-ID: <20090427125424.0E8CB3F43E0@fcserver.chagford.com> Stefan Behnel wrote: > > Frank Millman wrote: > > I can't find a section for feature requests. Should I just use the > > Launchpad bug tracker > > Yep, that's the right place. > Done - #367942 Thanks, Stefan Frank From dgardner at creatureshop.com Tue Apr 28 01:59:54 2009 From: dgardner at creatureshop.com (David Gardner) Date: Mon, 27 Apr 2009 16:59:54 -0700 Subject: [lxml-dev] eetree.fromsring() returns Element, expected ElementTree Message-ID: <49F646FA.5030001@creatureshop.com> Ran into something that maybe a bug, or at least isn't clear from the documentation [http://codespeak.net/lxml/api/lxml.etree-module.html#fromstring] because it doesn't mention a return type for etree.fromstring(). I had expected it to behave similar to etree.parse(). Currently I have a work-around of: tree = etree.ElementTree(etree.fromstring(xml_data)) See below for simple test, and output. ------- #!/usr/bin/python import sys,StringIO from lxml import etree print "lxml.etree: ", etree.LXML_VERSION print "libxml used: ", etree.LIBXML_VERSION print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION print "libxslt used: ", etree.LIBXSLT_VERSION print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION some_xml_data = "data" tree1=etree.fromstring(some_xml_data) tree2=etree.parse(StringIO.StringIO(some_xml_data)) print type(tree1) print type(tree2) -------------- lxml.etree: (2, 1, 5, 0) libxml used: (2, 7, 3) libxml compiled: (2, 6, 32) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 24) 'lxml.etree._Element' object has no attribute 'write' -- David Gardner Pipeline Tools Programmer, "Sid the Science Kid" Jim Henson Creature Shop dgardner at creatureshop.com From dgardner at creatureshop.com Tue Apr 28 02:03:27 2009 From: dgardner at creatureshop.com (David Gardner) Date: Mon, 27 Apr 2009 17:03:27 -0700 Subject: [lxml-dev] eetree.fromsring() returns Element, expected ElementTree In-Reply-To: <49F646FA.5030001@creatureshop.com> References: <49F646FA.5030001@creatureshop.com> Message-ID: <49F647CF.8040902@creatureshop.com> Woops sorry, I added a bit to the test, before re-pasting, the test code should be: --------------------- #!/usr/bin/python import sys,StringIO from lxml import etree print "lxml.etree: ", etree.LXML_VERSION print "libxml used: ", etree.LIBXML_VERSION print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION print "libxslt used: ", etree.LIBXSLT_VERSION print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION some_xml_data = "data" tree1=etree.fromstring(some_xml_data) tree2=etree.parse(StringIO.StringIO(some_xml_data)) print type(tree1) print type(tree2) out1=StringIO.StringIO() out2=StringIO.StringIO() try: tree1.write(out1,pretty_print=True) except Exception,e: print str(e) try: tree2.write(out2,pretty_print=True) except Exception,e: print str(e) ------------------------ lxml.etree: (2, 1, 5, 0) libxml used: (2, 7, 3) libxml compiled: (2, 6, 32) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 24) 'lxml.etree._Element' object has no attribute 'write' David Gardner wrote: > Ran into something that maybe a bug, or at least isn't clear from the > documentation > [http://codespeak.net/lxml/api/lxml.etree-module.html#fromstring] > because it doesn't mention a return type for etree.fromstring(). I had > expected it to behave similar to etree.parse(). > > Currently I have a work-around of: > tree = etree.ElementTree(etree.fromstring(xml_data)) > > See below for simple test, and output. > > ------- > #!/usr/bin/python > > import sys,StringIO > from lxml import etree > > print "lxml.etree: ", etree.LXML_VERSION > print "libxml used: ", etree.LIBXML_VERSION > print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION > print "libxslt used: ", etree.LIBXSLT_VERSION > print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION > > some_xml_data = "data" > > tree1=etree.fromstring(some_xml_data) > tree2=etree.parse(StringIO.StringIO(some_xml_data)) > > print type(tree1) > print type(tree2) > > -------------- > lxml.etree: (2, 1, 5, 0) > libxml used: (2, 7, 3) > libxml compiled: (2, 6, 32) > libxslt used: (1, 1, 24) > libxslt compiled: (1, 1, 24) > > > 'lxml.etree._Element' object has no attribute 'write' > > -- David Gardner Pipeline Tools Programmer, "Sid the Science Kid" Jim Henson Creature Shop dgardner at creatureshop.com From stefan_ml at behnel.de Tue Apr 28 07:15:27 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 28 Apr 2009 07:15:27 +0200 Subject: [lxml-dev] eetree.fromsring() returns Element, expected ElementTree In-Reply-To: <49F646FA.5030001@creatureshop.com> References: <49F646FA.5030001@creatureshop.com> Message-ID: <49F690EF.7080708@behnel.de> Hi, David Gardner wrote: > Ran into something that maybe a bug, or at least isn't clear from the > documentation > [http://codespeak.net/lxml/api/lxml.etree-module.html#fromstring] > because it doesn't mention a return type for etree.fromstring(). I had > expected it to behave similar to etree.parse(). Yes, that's a common misconception. Let's see if this works better: https://codespeak.net/viewvc/lxml/trunk/src/lxml/lxml.etree.pyx?r1=63185&r2=64752 The reason for this difference is that fromstring()/XML() is often used for XML fragments, where returning an ElementTree wouldn't make sense. Stefan From Grimm at juris.de Tue Apr 28 14:32:06 2009 From: Grimm at juris.de (Grimm, Markus) Date: Tue, 28 Apr 2009 14:32:06 +0200 Subject: [lxml-dev] How to set an attribute with a xml-namepace Message-ID: <60C0E6193EB5684091937E26E9862503F37C73@JUREX2.juris.de> Hi all, I want to set a new element with an attribute using the xml-namespace, f.e. >> xml = "" >> root = etree.fromstring(xml) >> print etree.tostring(root) everything fine, and now... >> root = etree.Element("root", space="preserve") >> print etree.tostring(root) How can I bind the space-attribute to the xml-namespace, so I can output the same result as above ? Thanks, Markus From stefan_ml at behnel.de Tue Apr 28 15:48:45 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 28 Apr 2009 15:48:45 +0200 (CEST) Subject: [lxml-dev] How to set an attribute with a xml-namepace In-Reply-To: <60C0E6193EB5684091937E26E9862503F37C73@JUREX2.juris.de> References: <60C0E6193EB5684091937E26E9862503F37C73@JUREX2.juris.de> Message-ID: <34f4f681bc565063e45026a82d538ed9.squirrel@groupware.dvs.informatik.tu-darmstadt.de> Grimm, Markus wrote: > I want to set a new element with an attribute using the xml-namespace, > f.e. > > >> xml = "" > >> root = etree.fromstring(xml) > >> print etree.tostring(root) > > > everything fine, and now... > >>> root = etree.Element("root", space="preserve") >>> print etree.tostring(root) > > > How can I bind the space-attribute to the xml-namespace, so I can output > the same result as above ? The namespace that is bound to the "xml" prefix is http://www.w3.org/XML/1998/namespace You use it like this: >>> root = etree.Element("root", ... {'{http://www.w3.org/XML/1998/namespace}space' : "preserve"}) >>> print etree.tostring(root) Note that you have to pass a dictionary here as you cannot pass the name as a keyword argument. Or use the .set() Element method. Also see the section on namespaces in the tutorial. Stefan From jholg at gmx.de Tue Apr 28 15:50:25 2009 From: jholg at gmx.de (jholg at gmx.de) Date: Tue, 28 Apr 2009 15:50:25 +0200 Subject: [lxml-dev] How to set an attribute with a xml-namepace In-Reply-To: <60C0E6193EB5684091937E26E9862503F37C73@JUREX2.juris.de> References: <60C0E6193EB5684091937E26E9862503F37C73@JUREX2.juris.de> Message-ID: <20090428135025.27910@gmx.net> Hi, > >> xml = "" > >> root = etree.fromstring(xml) > >> print etree.tostring(root) > > > everything fine, and now... > > >> root = etree.Element("root", space="preserve") > >> print etree.tostring(root) > >>> root = etree.Element("root", attrib={'{http://www.w3.org/XML/1998/namespace}space': 'preserve'}) >>> print etree.tostring(root) >>> Cheers, Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger01 From Grimm at juris.de Tue Apr 28 16:11:33 2009 From: Grimm at juris.de (Grimm, Markus) Date: Tue, 28 Apr 2009 16:11:33 +0200 Subject: [lxml-dev] How to set an attribute with a xml-namepace References: <60C0E6193EB5684091937E26E9862503F37C73@JUREX2.juris.de> <20090428135025.27910@gmx.net> Message-ID: <60C0E6193EB5684091937E26E9862503F8A292@JUREX2.juris.de> thanks to Holger and Stefan, I didn't wangle the intellectual transfer from element to attribute as described in http://codespeak.net/lxml/tutorial.html#the-e-factory :-) Thanks, Markus Hi, > >> xml = "" > >> root = etree.fromstring(xml) > >> print etree.tostring(root) > > > everything fine, and now... > > >> root = etree.Element("root", space="preserve") > >> print etree.tostring(root) > >>> root = etree.Element("root", attrib={'{http://www.w3.org/XML/1998/namespace}space': 'preserve'}) >>> print etree.tostring(root) >>> Cheers, Holger -- Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger01 From velvetcrafter.subscriber at gmail.com Tue Apr 28 19:39:24 2009 From: velvetcrafter.subscriber at gmail.com (Alexis Georges) Date: Tue, 28 Apr 2009 13:39:24 -0400 Subject: [lxml-dev] XML Documents & I18N (the way Cocoon does it) Message-ID: <89A8F7A1-A544-49C8-8E81-7F88EF77E31A@gmail.com> Hello everyone, I am maintaining a multilingual website which works with XML, XSLT to generate XHTML. I am working with Apache Cocoon (http://cocoon.apache.org/2.1/) using (among other things) their I18NTransformer. Basically I can use elements in the I18N (http://apache.org/cocoon/i18n/2.1) namespace, and then tell Cocoon to apply the I18NTransfomer to the document; this replaces the I18N elements with a localized value (eg. a formatted date/number, a translated label/attribute, etc...). I have been looking at lxml a little bit to see if I could move to a Python-based framework for the website. I am not quite sure how to go about the I18N part though. Using the Babel library (http://babel.edgewall.org/) along with request headers to generate localized data, I have everything I need. What is missing is the "parser" for the I18N elements. All I can think of right now is to implement a SAX parser, the way Cocoon does (in Java). Does anyone have suggestions? Am I making this too complicated? Thanks! Alexis -------------- next part -------------- An HTML attachment was scrubbed... URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090428/1bb0d079/attachment.htm From stefan_ml at behnel.de Tue Apr 28 19:59:50 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 28 Apr 2009 19:59:50 +0200 Subject: [lxml-dev] XML Documents & I18N (the way Cocoon does it) In-Reply-To: <89A8F7A1-A544-49C8-8E81-7F88EF77E31A@gmail.com> References: <89A8F7A1-A544-49C8-8E81-7F88EF77E31A@gmail.com> Message-ID: <49F74416.7050209@behnel.de> Hi, Alexis Georges wrote: > I am maintaining a multilingual website which works with XML, XSLT to > generate XHTML. > > I am working with Apache Cocoon (http://cocoon.apache.org/2.1/) using > (among other things) their I18NTransformer. Basically I can use elements > in the I18N (http://apache.org/cocoon/i18n/2.1) namespace, and then tell > Cocoon to apply the I18NTransfomer to the document; this replaces the > I18N elements with a localized value (eg. a formatted date/number, a > translated label/attribute, etc...). > > I have been looking at lxml a little bit to see if I could move to a > Python-based framework for the website. I am not quite sure how to go > about the I18N part though. > > Using the Babel library (http://babel.edgewall.org/) along with request > headers to generate localized data, I have everything I need. What is > missing is the "parser" for the I18N elements. All I can think of right > now is to implement a SAX parser, the way Cocoon does (in Java). There is a SAX-like interface in lxml.etree, called "target parser". However, if your documents fit into memory, using iterparse() is a lot simpler (and likely not even much slower). Something like this might work: context = etree.iterparse( "somefile.xml", tag = "{http://apache.org/cocoon/i18n/2.1}*") for event, i18n_element in context: new_element = get_i18n_replacement_for(i18n_element) i18n_element.getparent().replace(i18n_element, new_element) context.getroottree().write("newfile.xml") See here for some documentation: http://codespeak.net/lxml/parsing.html You can also achieve the same thing in XSLT, or using XPath, or ... Stefan From stefan_ml at behnel.de Tue Apr 28 20:11:37 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Tue, 28 Apr 2009 20:11:37 +0200 Subject: [lxml-dev] wsdl link validation. In-Reply-To: <105c9ccc0904221055j3a7f3ce9j73a370976c8680c9@mail.gmail.com> References: <105c9ccc0904221055j3a7f3ce9j73a370976c8680c9@mail.gmail.com> Message-ID: <49F746D9.5060705@behnel.de> Hi, goldgod a wrote: > I am using the lxml. I have one wsdl(on the fly creation using soaplib). > The wsdl contains three XSD schema.I am passing all XSD schema in one file > as request. I want to validate the each XSD schema one by one. I need your > help to implement this. I gone through the tutorial and found XSD schema > validation can do but my wsdl contains XSD schema and wsdl messages also. Well, you could search the three schemas by iterating over the schema root elements using .iter("{schema-namespace}tag-name"), then create an XMLSchema() instance from each of them, and use the three validators to validate your document. Does that help? Stefan From jamie at artefact.org.nz Wed Apr 29 10:44:31 2009 From: jamie at artefact.org.nz (Jamie Norrish) Date: Wed, 29 Apr 2009 20:44:31 +1200 Subject: [lxml-dev] xpath on text nodes Message-ID: <1240994671.8989.9.camel@atman.artefact.org.nz> The xpath method is currently available only for ElementTree and Element objects. Is it possible for it to be available to text nodes also? My current use case is getting a certain length text context for a particular element node, and I'd like to implement that through a recursive call to a function that returns the content of a supplied text node appended to the content of the next text node in sequence (provided the required length has not been passed). Jamie -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090429/501d024e/attachment.pgp From stefan_ml at behnel.de Wed Apr 29 17:24:51 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Wed, 29 Apr 2009 17:24:51 +0200 (CEST) Subject: [lxml-dev] xpath on text nodes In-Reply-To: <1240994671.8989.9.camel@atman.artefact.org.nz> References: <1240994671.8989.9.camel@atman.artefact.org.nz> Message-ID: Hi, Jamie Norrish wrote: > The xpath method is currently available only for ElementTree and Element > objects. Is it possible for it to be available to text nodes also? There is no such concept as a text node in lxml.etree. > My current use case is getting a certain length text context for a > particular element node, and I'd like to implement that through a > recursive call to a function that returns the content of a supplied text > node appended to the content of the next text node in sequence (provided > the required length has not been passed). That sounds a lot like you should do that in Python by using iterwalk() and collecting .text and .tail attributes of Elements, not by using XPath. Stefan From jamie at artefact.org.nz Thu Apr 30 06:30:53 2009 From: jamie at artefact.org.nz (Jamie Norrish) Date: Thu, 30 Apr 2009 16:30:53 +1200 Subject: [lxml-dev] xpath on text nodes In-Reply-To: References: <1240994671.8989.9.camel@atman.artefact.org.nz> Message-ID: <1241065853.5570.4.camel@atman.artefact.org.nz> Hi, > There is no such concept as a text node in lxml.etree. Okay, but the string results of an XPath selecting text nodes in the XML have additional attributes - it just seems a pity that an xpath method isn't one of them. > That sounds a lot like you should do that in Python by using iterwalk() > and collecting .text and .tail attributes of Elements, not by using XPath. Well, I like XPath. :) In fact I already have an implementation of the use case that, while slightly subobtimal, is sufficient - it just seemed like one obvious way of doing it better was to use XPath. I shall investigate using iterwalk instead. Thanks! Jamie -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090430/440d3eac/attachment.pgp From stefan_ml at behnel.de Thu Apr 30 09:42:00 2009 From: stefan_ml at behnel.de (Stefan Behnel) Date: Thu, 30 Apr 2009 09:42:00 +0200 (CEST) Subject: [lxml-dev] xpath on text nodes In-Reply-To: <1241065853.5570.4.camel@atman.artefact.org.nz> References: <1240994671.8989.9.camel@atman.artefact.org.nz> <1241065853.5570.4.camel@atman.artefact.org.nz> Message-ID: Jamie Norrish wrote: >> There is no such concept as a text node in lxml.etree. > > Okay, but the string results of an XPath selecting text nodes in the XML > have additional attributes - it just seems a pity that an xpath method > isn't one of them. It would be rarely used, I'd say. What sort of interesting XPath queries could you possibly do on a node that doesn't have any children, nor attributes, nor a tag name or namespace. Also, XPath queries can return Elements and (special) strings, but also plain numbers and boolean values. So you'd still not have a common interface for all possible result types. >> That sounds a lot like you should do that in Python by using iterwalk() >> and collecting .text and .tail attributes of Elements, not by using >> XPath. > > Well, I like XPath. :) In fact I already have an implementation of the > use case that, while slightly subobtimal, is sufficient - it just seemed > like one obvious way of doing it better was to use XPath. I shall > investigate using iterwalk instead. This should basically be a no-brainer with iterwalk(). You iterate over start and end events and just collect the .text values on start and the .tail values on end. Put them in a list, count the total character length on the way, break when it's long enough and ''.join() the list. Stefan From jamie at artefact.org.nz Thu Apr 30 22:07:26 2009 From: jamie at artefact.org.nz (Jamie Norrish) Date: Fri, 01 May 2009 08:07:26 +1200 Subject: [lxml-dev] xpath on text nodes In-Reply-To: References: <1240994671.8989.9.camel@atman.artefact.org.nz> <1241065853.5570.4.camel@atman.artefact.org.nz> Message-ID: <1241122046.5549.19.camel@atman.artefact.org.nz> On Thu, 2009-04-30 at 09:42 +0200, Stefan Behnel wrote: > It would be rarely used, I'd say. What sort of interesting XPath queries > could you possibly do on a node that doesn't have any children, nor > attributes, nor a tag name or namespace. Besides selecting other nodes and values relative to the text? Yes, it is possible to use text_result.getparent() and proceed from there - but this has the downside of requiring, for some XPath expressions, the code to modify the expression based on whether text_result was the text or tail of its parent, which is annoying. > Also, XPath queries can return Elements and (special) strings, but > also plain numbers and boolean values. > So you'd still not have a common interface for all possible result types. Well, I'm not really asking for a common interface - only that XPath be enabled for the results of an XPath expression for text(). This would bring it into line with XSLT behaviour, for one. However, I accept that it's not going to be used often, and probably isn't worth you implementing for that reason. About using iterwalk: this wouldn't seem (on a quick perusal of the documentation) to easily allow for me to get the preceding context of the text result, unless I picked some arbitrary earlier element as the starting point. What am I missing? Jamie -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090501/3013369b/attachment.pgp