From Kevin.Dwyer at misys.com Thu Apr 2 18:48:22 2009
From: Kevin.Dwyer at misys.com (Dwyer, Kevin)
Date: Thu, 2 Apr 2009 17:48:22 +0100
Subject: [lxml-dev] XMLSchemaParseError: Document is not XML Schema
Message-ID: <63C2A154B1708946B60726AFDBA00AC004894CE2@ukmailemea01.misys.global.ad>
Hello,
I have encountered a problem with schema object creation with lxml; the
problem relates to namespace used for the root element of the schema.
>>> import lxml.etree
>>> et = lxml.etree.ElementTree(file=open('c:\\temp\\MySchema', 'r'))
>>> et
>>> xsd = lxml.etree.XMLSchema(et)
Traceback (most recent call last):
File "", line 1, in
xsd = lxml.etree.XMLSchema(et)
File "xmlschema.pxi", line 50, in lxml.etree.XMLSchema.__init__
(src/lxml/lxml.etree.c:120919)
XMLSchemaParseError: Document is not XML Schema
Looking in subversion
(http://codespeak.net/svn/lxml/trunk/src/lxml/xmlschema.pxi), in the
XMLSchema class I see:
# work around for libxml2 bug if document is not XML schema
at all
#if _LIBXML_VERSION_INT < 20624:
c_node = root_node._c_node
c_href = _getNs(c_node)
if c_href is NULL or \
cstd.strcmp(c_href,
'http://www.w3.org/2001/XMLSchema') != 0:
raise XMLSchemaParseError, u"Document is not XML Schema"
The schemas that I am using use this root element:
If I change them to they validate.
Can you explain why the earlier namespace definition is unacceptable?
Is there a workaround?
The schemas are not built by my application, so changing them might be
an issue.
Cheers,
Kevin
"Misys" is the trade name for Misys plc (registered in England and Wales). Registration Number: 01360027. Registered office: One Kingdom Street, London W2 6BL, United Kingdom. For a list of Misys group operating companies please go to http://www.misys.com/corp/About_Us/misys_operating_companies.html. This email and any attachments have been scanned for known viruses using multiple scanners. This email message is intended for the named recipient only. It may be privileged and/or confidential. If you are not the named recipient of this email please notify us immediately and do not copy it or use it for any purpose, nor disclose its contents to any other person. This email does not constitute the commencement of legal relations between you and Misys plc. Please refer to the executed contract between you and the relevant member of the Misys group for the identity of the contracting party with which you are dealing.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090402/1bcf0700/attachment.htm
From victoryetc at gmail.com Fri Apr 3 07:53:17 2009
From: victoryetc at gmail.com (Victor Borda)
Date: Thu, 2 Apr 2009 22:53:17 -0700
Subject: [lxml-dev] problems with custom build
Message-ID: <287f11b30904022253r6ceef7a3k3d5f448d5dcb50f6@mail.gmail.com>
Hi List,
I was very excited today to find the lxml module for python. As I need to
write some xml checking scripts and finding bash scripting not well suited,
I have decided to give python a try. So far I like it. However, here is the
situation:
1) The target platform has an older RedHat installation with glibc2.3.4, so
none of the binaries for libxml or libxslt were of any use. So I had to
build from source on those. Not too painful.
2) However, trying to get lxml running has been really difficult. I need
help here.
3) The target machine is not connected to the internet. It is not able to
remotely retrieve packages.
Questions/Steps:
0) There don't appear to be any rpm's for lxml. Is this correct?
1) Since I don't have an internet connection from this machine it means I
have to build from source, don't I (ie easy_install is not an option)?
2) I have assumed that I do have to build from source so I have given it a
shot. I copied over the lxml2.2 tar file, unzipped it.
3) I got setuptools-0.6c9-py2.3.egg and dropped in that unzipped directly,
and ran python ez_setup.py which seemed to go fine.
4) Then I ran python setup.py build. The build seemed to go fine.
5) I go to run test.py and I get this error message:
[]# python test.py
Traceback (most recent call last):
File "test.py", line 595, in ?
exitcode = main(sys.argv)
File "test.py", line 558, in main
test_cases = get_test_cases(test_files, cfg, tracer=tracer)
File "test.py", line 260, in get_test_cases
module = import_module(file, cfg, tracer=tracer)
File "test.py", line 203, in import_module
mod = __import__(modname)
File "/home/victorborda/buildstuff/lxml-2.2/src/lxml/html/__init__.py",
line 12, in ?
from lxml import etree
ImportError: /home/victorborda/buildstuff/lxml-2.2/src/lxml/etree.so:
undefined symbol: xmlSchematronFree
And with that, I tried running 'make test' and got the same result. The
build appeared to go fine. The contents of
lxml-2.2/build/lib.linux-x86_64-2.3/lxml
are:
-rw-r--r-- 1 root root 7637 Jun 19 2008 builder.py
-rw-r--r-- 1 root root 28750 Nov 23 19:33 cssselect.py
-rw-r--r-- 1 root root 18287 May 31 2008 doctestcompare.py
-rw-r--r-- 1 root root 7641 Jul 9 2008 ElementInclude.py
-rw-r--r-- 1 root root 6407 Feb 27 14:45 _elementpath.py
-rwxr-xr-x 1 root root 3125362 Apr 3 04:49 etree.so
drwxr-xr-x 2 root root 4096 Apr 3 04:49 html
-rw-r--r-- 1 root root 21 Oct 22 2007 __init__.py
-rwxr-xr-x 1 root root 846592 Apr 3 04:49 objectify.so
-rw-r--r-- 1 root root 87 Mar 2 2008 pyclasslookup.py
-rw-r--r-- 1 root root 8229 May 31 2008 sax.py
-rw-r--r-- 1 root root 230 May 31 2008 usedoctest.py
--
Best Regards,
Victor Borda
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090402/3928b71d/attachment.htm
From stefan_ml at behnel.de Fri Apr 3 08:09:05 2009
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 03 Apr 2009 08:09:05 +0200
Subject: [lxml-dev] problems with custom build
In-Reply-To: <287f11b30904022253r6ceef7a3k3d5f448d5dcb50f6@mail.gmail.com>
References: <287f11b30904022253r6ceef7a3k3d5f448d5dcb50f6@mail.gmail.com>
Message-ID: <49D5A801.2020004@behnel.de>
Hi,
Victor Borda wrote:
> I was very excited today to find the lxml module for python. As I need to
> write some xml checking scripts and finding bash scripting not well suited,
> I have decided to give python a try. So far I like it. However, here is the
> situation:
>
> 1) The target platform has an older RedHat installation with glibc2.3.4, so
> none of the binaries for libxml or libxslt were of any use. So I had to
> build from source on those. Not too painful.
> 2) However, trying to get lxml running has been really difficult. I need
> help here.
> 3) The target machine is not connected to the internet. It is not able to
> remotely retrieve packages.
>
> Questions/Steps:
> 0) There don't appear to be any rpm's for lxml. Is this correct?
> 1) Since I don't have an internet connection from this machine it means I
> have to build from source, don't I (ie easy_install is not an option)?
> 2) I have assumed that I do have to build from source so I have given it a
> shot. I copied over the lxml2.2 tar file, unzipped it.
> 3) I got setuptools-0.6c9-py2.3.egg and dropped in that unzipped directly,
> and ran python ez_setup.py which seemed to go fine.
> 4) Then I ran python setup.py build. The build seemed to go fine.
> 5) I go to run test.py and I get this error message:
>
> []# python test.py
> Traceback (most recent call last):
> File "test.py", line 595, in ?
> exitcode = main(sys.argv)
> File "test.py", line 558, in main
> test_cases = get_test_cases(test_files, cfg, tracer=tracer)
> File "test.py", line 260, in get_test_cases
> module = import_module(file, cfg, tracer=tracer)
> File "test.py", line 203, in import_module
> mod = __import__(modname)
> File "/home/victorborda/buildstuff/lxml-2.2/src/lxml/html/__init__.py",
> line 12, in ?
> from lxml import etree
> ImportError: /home/victorborda/buildstuff/lxml-2.2/src/lxml/etree.so:
> undefined symbol: xmlSchematronFree
I assume that you have installed newer versions of libxml2 and libxslt
somewhere, but it looks like lxml can't find them at runtime. Try to
compile with lxml with the "--auto-rpath" option to make it remember where
it found the libraries it was built against.
Another option is to copy the libxml2 and libxslt tar.gz archives into
"lxml-2.2/libs/" and pass
--static-deps --libxml2-version=2.X.Y --libxslt-version=1.1.XY
to setup.py, which will then build those libs first and build lxml
statically against them.
Stefan
From kevin.p.dwyer at gmail.com Fri Apr 3 17:42:44 2009
From: kevin.p.dwyer at gmail.com (Kev Dwyer)
Date: Fri, 3 Apr 2009 16:42:44 +0100
Subject: [lxml-dev] XMLSchemaParseError if XML schema namespace uri is not
"http://www.w3.org/2001/XMLSchema"
Message-ID: <4d3439f90904030842y2a6c22e1i2fd65fac510e57e@mail.gmail.com>
Hello,
This is a re-post of my earlier posting, at Stefan's request, without the
corporate boilerplate
that I inadvertently sent last time. Sorry about that.
Bug 354574 logged at Stefan's request.
I have encountered a problem with schema object creation with lxml; the
problem relates to namespace used for the root element of the schema.
>>> import lxml.etree
>>> et = lxml.etree.ElementTree(file=open('c:\\temp\\MySchema', 'r'))
>>> et
>>> xsd = lxml.etree.XMLSchema(et)
Traceback (most recent call last):
File "", line 1, in
xsd = lxml.etree.XMLSchema(et)
File "xmlschema.pxi", line 50, in lxml.etree.XMLSchema.__init__
(src/lxml/lxml.etree.c:120919)
XMLSchemaParseError: Document is not XML Schema
Looking in subversion
(http://codespeak.net/svn/lxml/trunk/src/lxml/xmlschema.pxi), in the
XMLSchema class I see:
# work around for libxml2 bug if document is not XML schema at
all
#if _LIBXML_VERSION_INT < 20624:
c_node = root_node._c_node
c_href = _getNs(c_node)
if c_href is NULL or \
cstd.strcmp(c_href, 'http://www.w3.org/2001/XMLSchema')
!= 0:
raise XMLSchemaParseError, u"Document is not XML Schema"
The schemas that I am using use this root element:
If I change them to
they validate.
Can you explain why the earlier namespace definition is unacceptable?
Is there a workaround?
The schemas are not built by my application, so changing them might be
an issue.
Cheers,
Kevin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090403/02eb6f38/attachment.htm
From stefan_ml at behnel.de Fri Apr 3 21:31:07 2009
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 03 Apr 2009 21:31:07 +0200
Subject: [lxml-dev] XMLSchemaParseError if XML schema namespace uri is
not "http://www.w3.org/2001/XMLSchema"
In-Reply-To: <4d3439f90904030842y2a6c22e1i2fd65fac510e57e@mail.gmail.com>
References: <4d3439f90904030842y2a6c22e1i2fd65fac510e57e@mail.gmail.com>
Message-ID: <49D663FB.2060909@behnel.de>
Hi,
Kev Dwyer wrote:
> I have encountered a problem with schema object creation with lxml; the
> problem relates to namespace used for the root element of the schema.
>
>
>>>> import lxml.etree
>>>> et = lxml.etree.ElementTree(file=open('c:\\temp\\MySchema', 'r'))
>>>> et
>
>>>> xsd = lxml.etree.XMLSchema(et)
>
> Traceback (most recent call last):
> File "", line 1, in
> xsd = lxml.etree.XMLSchema(et)
> File "xmlschema.pxi", line 50, in lxml.etree.XMLSchema.__init__
> (src/lxml/lxml.etree.c:120919)
> XMLSchemaParseError: Document is not XML Schema
>
>
> Looking in subversion
> (http://codespeak.net/svn/lxml/trunk/src/lxml/xmlschema.pxi), in the
> XMLSchema class I see:
>
>
>
> # work around for libxml2 bug if document is not XML schema at
> all
> #if _LIBXML_VERSION_INT < 20624:
> c_node = root_node._c_node
> c_href = _getNs(c_node)
> if c_href is NULL or \
> cstd.strcmp(c_href, 'http://www.w3.org/2001/XMLSchema')
> != 0:
> raise XMLSchemaParseError, u"Document is not XML Schema"
Thanks for pointing me to this, this is a left-over work-around for a bug
that no longer exists in more recent libxml2 versions. I'll try to figure
out when it was fixed and disable this from that point on. Note that this
will not solve your problem, though.
> The schemas that I am using use this root element:
>
I actually had to look this up, and found a lot of documents containing
this namespace, but little information why it was changed at the time. It
appears to be part of an older specification version that happens to still
work for your stylesheets.
Note that libxml2 does not support this namespace at all, just like most
other validators I could find a link about.
> The schemas are not built by my application, so changing them might be
> an issue.
You can always do a string replace before passing the XML data to the
schema parser. Or, you can parse the XML tree using iterparse and fix the
namespaces while doing so, simply by overwriting the tag names. You can
pass "tag={http://www.w3.org/2000/10/XMLSchema}*" to iterparse() to make
sure it only intercepts on the interesting elements. It will still build
the complete tree for you, which you can retrieve using "it.root" at the end.
Note that a string replace might still be the safer way to do it, as it
also keeps any prefix mappings intact that XMLSchema may use in text
content (i.e. qualified names). To be sure that you can safely replace the
string, you can parse the XML, serialise it to UTF-8, do the replacement,
and then parse it again. Both parsing and serialising are fast, so you may
not even notice the difference.
Does that help?
Stefan
From cthedot at gmail.com Sat Apr 4 11:57:46 2009
From: cthedot at gmail.com (chris hoke)
Date: Sat, 4 Apr 2009 11:57:46 +0200
Subject: [lxml-dev] setting xslt output encoding with lxml
Message-ID:
hi,
(hope this is the right list for my question)
To set the XSL output encoding I normally use
in the stylesheet.
At least in the Java based XSLT processors it is possible to set some
attributes of xsl:output from the "outside" meaning when initializing or
starting the transformation. So it is possible for example to overwrite any
encoding specified in >> xslt_tree = etree.XML('''\
... >
...
...
...
... ''')
>>> transform = etree.XSLT(xslt_tree)
>>> f = StringIO('Text')
>>> doc = etree.parse(f)
seems to miss an parameter. I have not checked if it
works without it but I guess it would be good style to declare any incoming
parameters if not for setting a default value, would it not?
thanks for any hints,
Christof
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090404/e9d7a6e8/attachment.htm
From stefan_ml at behnel.de Sat Apr 4 21:01:44 2009
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sat, 04 Apr 2009 21:01:44 +0200
Subject: [lxml-dev] setting xslt output encoding with lxml
In-Reply-To:
References:
Message-ID: <49D7AE98.1000205@behnel.de>
Hi,
chris hoke wrote:
> (hope this is the right list for my question)
Yes.
> To set the XSL output encoding I normally use
> in the stylesheet.
>
> At least in the Java based XSLT processors it is possible to set some
> attributes of xsl:output from the "outside" meaning when initializing or
> starting the transformation. So it is possible for example to overwrite any
> encoding specified in Is there any way to do this with LXML?
You can parse the stylesheet with the normal XML parser, change the
xsl:output element according to your needs, and pass the result to XSLT().
Note that you can use
iterparse(the_file, tag="{...XSL NS...}output")
to update the element while parsing.
> BTW, the example on http://codespeak.net/lxml/xpathxslt.html#xslt
>
>>>> xslt_tree = etree.XML('''\
> ... ... xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
>
> ...
> ...
> ...
> ... ''')
>>>> transform = etree.XSLT(xslt_tree)
>>>> f = StringIO('Text')
>>>> doc = etree.parse(f)
>
> seems to miss an parameter. I have not checked if it
> works without it but I guess it would be good style to declare any incoming
> parameters if not for setting a default value, would it not?
Yes, thanks for catching that.
Stefan
From stefan_ml at behnel.de Sat Apr 4 22:03:10 2009
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sat, 04 Apr 2009 22:03:10 +0200
Subject: [lxml-dev] setting xslt output encoding with lxml
In-Reply-To: <49D7AE98.1000205@behnel.de>
References:
<49D7AE98.1000205@behnel.de>
Message-ID: <49D7BCFE.2060801@behnel.de>
Stefan Behnel wrote:
> chris hoke wrote:
>> To set the XSL output encoding I normally use
>> in the stylesheet.
>>
>> At least in the Java based XSLT processors it is possible to set some
>> attributes of xsl:output from the "outside" meaning when initializing or
>> starting the transformation. So it is possible for example to overwrite any
>> encoding specified in
> lxml.etree does not currently support this.
I skimmed through the libxslt source and it looks like such a feature is
not easily available. So the best way to do it is actually to copy and
modify the stylesheet document as I explained.
Stefan
From l at lrowe.co.uk Sat Apr 4 22:34:57 2009
From: l at lrowe.co.uk (Laurence Rowe)
Date: Sat, 4 Apr 2009 22:34:57 +0200
Subject: [lxml-dev] setting xslt output encoding with lxml
In-Reply-To: <49D7BCFE.2060801@behnel.de>
References:
<49D7AE98.1000205@behnel.de> <49D7BCFE.2060801@behnel.de>
Message-ID:
It seems that libxslt respects the last tag found, so
just append your required version to the end of the stylesheet:
>>> xslt_doc = etree.XML('''
...
...
...
... ''')
>>> str(etree.XSLT(xslt_doc)(etree.XML('''''')))
'\n
\n'
>>> xslt_doc.append(etree.XML(''''''))
>>> str(etree.XSLT(xslt_doc)(etree.XML('''''')))
'
\n'
Laurence
2009/4/4 Stefan Behnel :
>
> Stefan Behnel wrote:
>> chris hoke wrote:
>>> To set the XSL output encoding I normally use
>>> in the stylesheet.
>>>
>>> At least in the Java based XSLT processors it is possible to set some
>>> attributes of xsl:output from the "outside" meaning when initializing or
>>> starting the transformation. So it is possible for example to overwrite any
>>> encoding specified in >
>> lxml.etree does not currently support this.
>
> I skimmed through the libxslt source and it looks like such a feature is
> not easily available. So the best way to do it is actually to copy and
> modify the stylesheet document as I explained.
>
> Stefan
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
>
From kevin.p.dwyer at gmail.com Mon Apr 6 11:32:16 2009
From: kevin.p.dwyer at gmail.com (Kev Dwyer)
Date: Mon, 6 Apr 2009 10:32:16 +0100
Subject: [lxml-dev] XMLSchemaParseError if XML schema namespace uri is
not "http://www.w3.org/2001/XMLSchema"
In-Reply-To: <49D663FB.2060909@behnel.de>
References: <4d3439f90904030842y2a6c22e1i2fd65fac510e57e@mail.gmail.com>
<49D663FB.2060909@behnel.de>
Message-ID: <4d3439f90904060232k7272e44evd84331e3b29c634e@mail.gmail.com>
Hello Stefan,
Thanks for the speedy response, and for the workaround suggestions.
All the best,
Kevin
2009/4/3 Stefan Behnel
> Hi,
>
> Kev Dwyer wrote:
> > I have encountered a problem with schema object creation with lxml; the
> > problem relates to namespace used for the root element of the schema.
> >
> >
> >>>> import lxml.etree
> >>>> et = lxml.etree.ElementTree(file=open('c:\\temp\\MySchema', 'r'))
> >>>> et
> >
> >>>> xsd = lxml.etree.XMLSchema(et)
> >
> > Traceback (most recent call last):
> > File "", line 1, in
> > xsd = lxml.etree.XMLSchema(et)
> > File "xmlschema.pxi", line 50, in lxml.etree.XMLSchema.__init__
> > (src/lxml/lxml.etree.c:120919)
> > XMLSchemaParseError: Document is not XML Schema
> >
> >
> > Looking in subversion
> > (http://codespeak.net/svn/lxml/trunk/src/lxml/xmlschema.pxi), in the
> > XMLSchema class I see:
> >
> >
> >
> > # work around for libxml2 bug if document is not XML schema
> at
> > all
> > #if _LIBXML_VERSION_INT < 20624:
> > c_node = root_node._c_node
> > c_href = _getNs(c_node)
> > if c_href is NULL or \
> > cstd.strcmp(c_href, 'http://www.w3.org/2001/XMLSchema
> ')
> > != 0:
> > raise XMLSchemaParseError, u"Document is not XML Schema"
>
> Thanks for pointing me to this, this is a left-over work-around for a bug
> that no longer exists in more recent libxml2 versions. I'll try to figure
> out when it was fixed and disable this from that point on. Note that this
> will not solve your problem, though.
>
>
> > The schemas that I am using use this root element:
> >
>
> I actually had to look this up, and found a lot of documents containing
> this namespace, but little information why it was changed at the time. It
> appears to be part of an older specification version that happens to still
> work for your stylesheets.
>
> Note that libxml2 does not support this namespace at all, just like most
> other validators I could find a link about.
>
>
> > The schemas are not built by my application, so changing them might be
> > an issue.
>
> You can always do a string replace before passing the XML data to the
> schema parser. Or, you can parse the XML tree using iterparse and fix the
> namespaces while doing so, simply by overwriting the tag names. You can
> pass "tag={http://www.w3.org/2000/10/XMLSchema}*"
> to iterparse() to make
> sure it only intercepts on the interesting elements. It will still build
> the complete tree for you, which you can retrieve using "it.root" at the
> end.
>
> Note that a string replace might still be the safer way to do it, as it
> also keeps any prefix mappings intact that XMLSchema may use in text
> content (i.e. qualified names). To be sure that you can safely replace the
> string, you can parse the XML, serialise it to UTF-8, do the replacement,
> and then parse it again. Both parsing and serialising are fast, so you may
> not even notice the difference.
>
> Does that help?
>
> Stefan
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090406/0e640ff8/attachment-0001.htm
From friedel at translate.org.za Wed Apr 8 10:27:23 2009
From: friedel at translate.org.za (F Wolff)
Date: Wed, 08 Apr 2009 10:27:23 +0200
Subject: [lxml-dev] Low ASCII values as text
Message-ID: <1239179243.714.11.camel@localhost>
Hallo list
I encountered a small issue from a user's error report, and a way to
duplicate the issue is from this example code:
from lxml import etree
l = etree.Element('cow')
l.text = unicode('\xd0\x94\x1bi\x1b\x1b\x1b?', "utf-8")
etree.fromstring(etree.tostring(l))
With lxml 2.1 I get:
XMLSyntaxError: PCDATA invalid Char value 27, line 1, column 13
It seems that etree.tostring() can generate XML that etree.fromstring()
can't handle.
But with a newer version (I think a beta of 2.2), I get
"All strings must be XML compatible : Unicode or ASCII, no NULL bytes"
on the assignment statement (l.text = ...).
So in either case my question is if lxml's handling of these low values
in ASCII is correct, since it doesn't seem possible to actually
represent them at all, but I guess I am missing something important. As
far as I know the XML 1.0 specification demands indicating these with
numeric entities.
Keep well
Friedel
--
Recently on my blog:
http://translate.org.za/blogs/friedel/en/content/monolingual-translation-formats-considered-harmful
From stefan_ml at behnel.de Wed Apr 8 10:53:01 2009
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Wed, 8 Apr 2009 10:53:01 +0200 (CEST)
Subject: [lxml-dev] Low ASCII values as text
In-Reply-To: <1239179243.714.11.camel@localhost>
References: <1239179243.714.11.camel@localhost>
Message-ID: <8d762cc9c058f3f92553166838561d42.squirrel@groupware.dvs.informatik.tu-darmstadt.de>
Hi,
F Wolff wrote:
> I encountered a small issue from a user's error report, and a way to
> duplicate the issue is from this example code:
>
> from lxml import etree
> l = etree.Element('cow')
> l.text = unicode('\xd0\x94\x1bi\x1b\x1b\x1b?', "utf-8")
> etree.fromstring(etree.tostring(l))
>
> With lxml 2.1 I get:
>
> XMLSyntaxError: PCDATA invalid Char value 27, line 1, column 13
>
> It seems that etree.tostring() can generate XML that etree.fromstring()
> can't handle.
To be precise, tostring() could generate output that was not XML. That was
clearly a bug.
> But with a newer version (I think a beta of 2.2), I get
> "All strings must be XML compatible : Unicode or ASCII, no NULL bytes"
> on the assignment statement (l.text = ...).
This is in line with the set of allowed characters in XML, the relevant
snippet being:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | ...
"\x1b" is not in this set.
http://www.w3.org/TR/REC-xml/#charsets
> So in either case my question is if lxml's handling of these low values
> in ASCII is correct, since it doesn't seem possible to actually
> represent them at all, but I guess I am missing something important. As
> far as I know the XML 1.0 specification demands indicating these with
> numeric entities.
No, you cannot even represent them as character references, they are
simply not allowed. The only (sensible) way to pass binary data through
XML is to encode it, e.g. using base64.
This specification was weakened in XML 1.1, which simply allows more
characters, including the range "[#x1-#xD7FF]". However, it still carries
this warning:
"""
Document authors are encouraged to avoid "compatibility characters", as
defined in Unicode [Unicode]. The characters defined in the following
ranges are also discouraged. They are either control characters or
permanently undefined Unicode characters:
[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
...
"""
http://www.w3.org/TR/xml11/#charsets
So, even in XML 1.1, it is still considered a bad idea to use these
characters in text content.
Stefan
From dood at zworg.com Wed Apr 8 11:50:04 2009
From: dood at zworg.com (Adam)
Date: Wed, 8 Apr 2009 09:50:04 +0000 (UTC)
Subject: [lxml-dev] Unicode oddness
Message-ID:
The following seems wrong to me:
I have a utf-8 encoded string with html containing the word 'Fran?ais':
>>> html = 'Fran\xc3\xa7ais'
I feed it to lxml.html:
>>> root = lxml.html.fromstring(html)
When I get the text from lxml, it is a unicode string, but it has not been
decoded!:
>>> root.text_content()
u'Fran\xc3\xa7ais'
The expected output would be decoded unicode, i.e. the result of:
>>> 'Fran\xc3\xa7ais'.decode('utf-8')
u'Fran\xe7ais'
Or just get back the encoded utf-8 string 'Fran\xc3\xa7ais'
Either of these results would make sense and work for me. But the result is an
odd confusion of the two. Is this an lxml problem, or have I misunderstood
something?
Thanks, Adam
From stefan_ml at behnel.de Wed Apr 8 12:12:30 2009
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Wed, 8 Apr 2009 12:12:30 +0200 (CEST)
Subject: [lxml-dev] Unicode oddness
In-Reply-To:
References:
Message-ID: <4d5f031bb27ad06e5c1a1872ffe86cc3.squirrel@groupware.dvs.informatik.tu-darmstadt.de>
Adam wrote:
> I have a utf-8 encoded string with html containing the word 'Fran??ais':
> >>> html = 'Fran\xc3\xa7ais'
>
> I feed it to lxml.html:
> >>> root = lxml.html.fromstring(html)
>
> When I get the text from lxml, it is a unicode string, but it has not been
> decoded!:
> >>> root.text_content()
> u'Fran\xc3\xa7ais'
Your HTML snippet lacks a tag, so the HTMLParser has no way of
knowing what encoding your HTML snippet uses. It therefore falls back to
assuming Latin-1. If your snippet was encoded in Latin-1, you'd be quite
happy about this default.
If you know the encoding in advance, you can create your own parser
instance and pass it the "encoding" keyword option. There are tools that
can try to detect an encoding from a string that you pass in, e.g.
chardet. It is, however, impossible for any tool in the world to always
recover the missing encoding information for all possible data.
Stefan
From dood at zworg.com Wed Apr 8 12:27:13 2009
From: dood at zworg.com (Adam)
Date: Wed, 8 Apr 2009 10:27:13 +0000 (UTC)
Subject: [lxml-dev] Unicode oddness
References:
<4d5f031bb27ad06e5c1a1872ffe86cc3.squirrel@groupware.dvs.informatik.tu-darmstadt.de>
Message-ID:
Stefan Behnel behnel.de> writes:
> Your HTML snippet lacks a tag, so the HTMLParser has no way of
> knowing what encoding your HTML snippet uses. It therefore falls back to
> assuming Latin-1. If your snippet was encoded in Latin-1, you'd be quite
> happy about this default.
>
> If you know the encoding in advance, you can create your own parser
> instance and pass it the "encoding" keyword option.
Of course! Thank you, I had a feeling I was overlooking something simple.
From npowell3 at gmail.com Tue Apr 14 15:55:19 2009
From: npowell3 at gmail.com (Nelson Powell)
Date: Tue, 14 Apr 2009 09:55:19 -0400
Subject: [lxml-dev] Porting lxml to QNX 6.4.0 issue
Message-ID:
I've attempted to port lxml to a QNX 6.4.0 PC (x86) and have already built
and installed the libxslt, libxml2, and libgcrypt libraries wihtout any
compile/link issues. The libraries were installed at /usr/local/lib which
is part of my LD_LIBRARY_PATH. I'm using:
libxslt 1.1.22
libxml2 2.7.2
libgcrypt 11.5.2
However, after building lxml, importing lxml produces the following messages
in python:
bash-3.2# python
Python 2.5.2 (r252:60911, Oct 8 2008, 21:15:13)
[GCC 4.2.4] on qnx6
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.etree as ET
unknown symbol: gcry_cipher_open
unknown symbol: gcry_cipher_ctl
unknown symbol: gcry_md_hash_buffer
unknown symbol: gcry_cipher_close
unknown symbol: gcry_cipher_encrypt
unknown symbol: gcry_check_version
unknown symbol: gcry_cipher_decrypt
unknown symbol: gcry_strerror
Traceback (most recent call last):
File "", line 1, in
ImportError: Unresolved symbols
>>>
Is there a way around this libgcrypt stuff? Anyone seen this before?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090414/cec0ca1c/attachment.htm
From stefan_ml at behnel.de Tue Apr 14 17:36:09 2009
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Tue, 14 Apr 2009 17:36:09 +0200 (CEST)
Subject: [lxml-dev] Porting lxml to QNX 6.4.0 issue
In-Reply-To:
References:
Message-ID: <22c49d0d5af3087d92590d569d001cfe.squirrel@groupware.dvs.informatik.tu-darmstadt.de>
Nelson Powell wrote:
> I've attempted to port lxml to a QNX 6.4.0 PC (x86) and have already built
> and installed the libxslt, libxml2, and libgcrypt libraries wihtout any
> compile/link issues.
Does "xsltproc" work on your machine? It comes with libxslt.
libgcrypt is an *optional* dependency of libxslt. Do you need it? How did
you configure the build?
Stefan
From npowell3 at gmail.com Tue Apr 14 17:46:32 2009
From: npowell3 at gmail.com (Nelson Powell)
Date: Tue, 14 Apr 2009 11:46:32 -0400
Subject: [lxml-dev] Porting lxml to QNX 6.4.0 issue
In-Reply-To: <22c49d0d5af3087d92590d569d001cfe.squirrel@groupware.dvs.informatik.tu-darmstadt.de>
References:
<22c49d0d5af3087d92590d569d001cfe.squirrel@groupware.dvs.informatik.tu-darmstadt.de>
Message-ID:
I've run "xsltproc -v" and it reports
Using libxml 20702, libxslt 10122 and libexslt 813
...
So Iam assuming it's working. I used the standard ./configure line to setup
before building the any of the three library packages. I don't need gcrypt
at all. I just need a few items out of the lxml.etree for a build
enviroment to work like the Windows XP build environment. Can I remove the
need for libgcrypt from the libxslt build?
On Tue, Apr 14, 2009 at 11:36 AM, Stefan Behnel wrote:
> Nelson Powell wrote:
> > I've attempted to port lxml to a QNX 6.4.0 PC (x86) and have already
> built
> > and installed the libxslt, libxml2, and libgcrypt libraries wihtout any
> > compile/link issues.
>
> Does "xsltproc" work on your machine? It comes with libxslt.
>
> libgcrypt is an *optional* dependency of libxslt. Do you need it? How did
> you configure the build?
>
> Stefan
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090414/a481bd1a/attachment.htm
From stefan_ml at behnel.de Fri Apr 17 17:10:21 2009
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 17 Apr 2009 17:10:21 +0200 (CEST)
Subject: [lxml-dev] lxml 2.2 released
In-Reply-To: <7bed6e38-b738-484d-99c9-542c992dfdfa@3g2000yqk.googlegroups.com>
References:
<7bed6e38-b738-484d-99c9-542c992dfdfa@3g2000yqk.googlegroups.com>
Message-ID:
jasonrbriggs at gmail.com wrote:
> Hope you don't mind a quick question -- is "from the source
> distribution" the only way to install on Python 3?
There are currently no binary packages (that I know of) for Py3.0 or 3.1,
so, yes, you have to build it yourself.
Stefan
From sidnei at enfoldsystems.com Sat Apr 18 04:48:09 2009
From: sidnei at enfoldsystems.com (Sidnei da Silva)
Date: Fri, 17 Apr 2009 23:48:09 -0300
Subject: [lxml-dev] docinfo.doctype doesn't include internal entities?
Message-ID:
Hi there,
I am looking for a way to output internal entities that have been
parsed from the original document when writing out a tree, but
apparently this is not exposed in any attribute.
Here's an example:
{{{
import lxml.etree
document = """
]>
"""
tree = lxml.etree.fromstring(document)
print tree.getroottree().docinfo.doctype
}}}
I would expect this to output:
{{{
]>
}}}
But instead it gives me:
{{{
}}}
Is it a bug or I'm not looking at the right place?
--
Sidnei da Silva
From stefan_ml at behnel.de Sat Apr 18 08:46:40 2009
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Sat, 18 Apr 2009 08:46:40 +0200
Subject: [lxml-dev] docinfo.doctype doesn't include internal entities?
In-Reply-To:
References:
Message-ID: <49E97750.1010903@behnel.de>
Sidnei da Silva wrote:
> I am looking for a way to output internal entities that have been
> parsed from the original document when writing out a tree, but
> apparently this is not exposed in any attribute.
>
> Here's an example:
>
> {{{
> import lxml.etree
>
> document = """
>
> ]>
>
> """
>
>
> tree = lxml.etree.fromstring(document)
> print tree.getroottree().docinfo.doctype
> }}}
>
> I would expect this to output:
> {{{
>
> ]>
> }}}
>
> But instead it gives me:
>
> {{{
>
> }}}
>
> Is it a bug or I'm not looking at the right place?
What you are looking for is the internal subset of the document, which is
not (really) part of the DOCTYPE itself. It's available through the
"docinfo.internalDTD" property. However, lxml.etree doesn't expose the
content of the DTD, so this is currently only usable for validation (i.e.
not very helpful in your case).
What you could try is to parse the document without resolving the entities,
then traverse the Entity elements and collect their names in a set. That
will not give you the resolved entity values, though...
I think it would be nice if tostring() could serialise DTDs, but I doubt
that there are so many use cases for that. In your case, you'd then have to
parse the DTD yourself, which you could also do by clearing the root node
and serialising the document to unicode.
Stefan
From agoldgod at gmail.com Wed Apr 22 19:55:32 2009
From: agoldgod at gmail.com (goldgod a)
Date: Wed, 22 Apr 2009 23:25:32 +0530
Subject: [lxml-dev] wsdl link validation.
Message-ID: <105c9ccc0904221055j3a7f3ce9j73a370976c8680c9@mail.gmail.com>
Hi,
I am using the lxml. I have one wsdl(on the fly creation using soaplib).
The wsdl contains three XSD schema.I am passing all XSD schema in one file
as request. I want to validate the each XSD schema one by one. I need your
help to implement this. I gone through the tutorial and found XSD schema
validation can do but my wsdl contains XSD schema and wsdl messages also.
Please advice me.
--
Thanks & Regards,
Goldgod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090422/5957f67d/attachment.htm
From akafubu.kibombo at gmail.com Thu Apr 23 08:41:12 2009
From: akafubu.kibombo at gmail.com (Akafubu Kibombo)
Date: Thu, 23 Apr 2009 01:41:12 -0500
Subject: [lxml-dev] Forms, Cookies, Headers, and Time
Message-ID: <39210b4b0904222341t79f8b11fmcf1ac75b3a745ef0@mail.gmail.com>
I am trying to write a script which fetches a url, logs into the site, then
fetches particular items from the page, and goes to the next page, fetching
the same type of files on the new page until there are no new pages to fetch
from. So I need form and cooke handling, as well as manipulating the
headers. What do I need to use? I found this thread, but I don't understand
it: http://codespeak.net/pipermail/lxml-dev/2008-December/004272.html.
Also, I don't want to wipe out the server with so many requests, is there a
"wait 2 - 3 seconds before fetching the next element" type function?..
Thank you so, so much.
-A.F.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090423/84b17707/attachment.htm
From douglas at openplans.org Thu Apr 23 17:24:22 2009
From: douglas at openplans.org (Douglas Mayle)
Date: Thu, 23 Apr 2009 11:24:22 -0400
Subject: [lxml-dev] Forms, Cookies, Headers, and Time
In-Reply-To: <39210b4b0904222341t79f8b11fmcf1ac75b3a745ef0@mail.gmail.com>
References: <39210b4b0904222341t79f8b11fmcf1ac75b3a745ef0@mail.gmail.com>
Message-ID:
I wrote a tool to sync safari books downloads that does similar things
to what you're talking about. I found the various issues you run into
with form and cookie handling when using lxml (and wrote an article
about it here: http://douglas.mayle.org/2009/03/05/syncing-safari-downloads-intro-screen-scraping/
). I spent some time making sure the code was clean and very well
documented, so it should help you to get started. The example is here:
http://projects.mayle.org/hg/safarisync/file/23cfad04ce3a/safarisync/safarisync/safarisync.py
Douglas Mayle
On Apr 23, 2009, at 2:41 AM, Akafubu Kibombo wrote:
> I am trying to write a script which fetches a url, logs into the
> site, then fetches particular items from the page, and goes to the
> next page, fetching the same type of files on the new page until
> there are no new pages to fetch from. So I need form and cooke
> handling, as well as manipulating the headers. What do I need to
> use? I found this thread, but I don't understand it: http://codespeak.net/pipermail/lxml-dev/2008-December/004272.html
> .
>
> Also, I don't want to wipe out the server with so many requests, is
> there a "wait 2 - 3 seconds before fetching the next element" type
> function?..
>
> Thank you so, so much.
>
> -A.F.
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090423/20916a1a/attachment.htm
From douglas at openplans.org Thu Apr 23 17:24:31 2009
From: douglas at openplans.org (Douglas Mayle)
Date: Thu, 23 Apr 2009 11:24:31 -0400
Subject: [lxml-dev] Forms, Cookies, Headers, and Time
In-Reply-To: <39210b4b0904222341t79f8b11fmcf1ac75b3a745ef0@mail.gmail.com>
References: <39210b4b0904222341t79f8b11fmcf1ac75b3a745ef0@mail.gmail.com>
Message-ID: <50CA39A0-5DEE-4C12-9795-1CAA9C7DE056@openplans.org>
Ahh, randomly enough, the thread you link to is the one I started.
After browsing through the lxml code, it turned out that there was no
need to pass an open_http parameter, as the default method did almost
exactly the same thing as the code sample given and so monkey patching
the library (the standard way to add cookie support) already works.
Unfortunately, I found out that passing a URL directly to lxml causes
it to use libxml's native downloading support, which has no support
for cookies. As such, you have to handle all of the downloading of
content yourself (except when taking advantage of lxml forms).
As to waiting 2-3 seconds before requests, you can just put sleeps
into your code, or find some sort of bandwidth throttling package...
Douglas Mayle
On Apr 23, 2009, at 2:41 AM, Akafubu Kibombo wrote:
> I am trying to write a script which fetches a url, logs into the
> site, then fetches particular items from the page, and goes to the
> next page, fetching the same type of files on the new page until
> there are no new pages to fetch from. So I need form and cooke
> handling, as well as manipulating the headers. What do I need to
> use? I found this thread, but I don't understand it: http://codespeak.net/pipermail/lxml-dev/2008-December/004272.html
> .
>
> Also, I don't want to wipe out the server with so many requests, is
> there a "wait 2 - 3 seconds before fetching the next element" type
> function?..
>
> Thank you so, so much.
>
> -A.F.
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090423/638581ac/attachment.htm
From frank at chagford.com Mon Apr 27 11:33:06 2009
From: frank at chagford.com (Frank Millman)
Date: Mon, 27 Apr 2009 11:33:06 +0200
Subject: [lxml-dev] Problem with XMLSchema and attribute_defaults
Message-ID: <20090427093431.366E73F4389@fcserver.chagford.com>
Hi all
I need to validate an xml document with a schema, and at the same time
populate it with any default attributes. This works correctly with minixsv,
but it is rather slow, so I am trying lxml. It validates correctly, but I
cannot get it to load the default attributes.
This is what I am doing -
schema = etree.XMLSchema(file='bpmnxpdl_31.xsd')
parser = etree.XMLParser(schema=schema, attribute_defaults=True)
root = etree.parse('order.xml', parser)
Any assistance will be appreciated.
Thanks
Frank Millman
From stefan_ml at behnel.de Mon Apr 27 13:08:15 2009
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Mon, 27 Apr 2009 13:08:15 +0200 (CEST)
Subject: [lxml-dev] Problem with XMLSchema and attribute_defaults
In-Reply-To: <20090427093431.366E73F4389@fcserver.chagford.com>
References: <20090427093431.366E73F4389@fcserver.chagford.com>
Message-ID: <078ab7e7acd478aee7669a6e89c2787a.squirrel@groupware.dvs.informatik.tu-darmstadt.de>
Frank Millman wrote:
> I need to validate an xml document with a schema, and at the same time
> populate it with any default attributes. This works correctly with
> minixsv,
> but it is rather slow, so I am trying lxml. It validates correctly, but I
> cannot get it to load the default attributes.
>
> This is what I am doing -
>
> schema = etree.XMLSchema(file='bpmnxpdl_31.xsd')
> parser = etree.XMLParser(schema=schema, attribute_defaults=True)
> root = etree.parse('order.xml', parser)
The "attribute_defaults" flag is currently only used for DTDs. Enabling
the same for XML Schema would require setting the
"XML_SCHEMA_VAL_VC_I_CREATE" option on the schema validation context,
which doesn't seem to work for older (<=2006) libxml2 versions and is not
currently done for newer versions. Could you file a feature request for
this, so that it doesn't get lost?
Thanks,
Stefan
From frank at chagford.com Mon Apr 27 13:55:39 2009
From: frank at chagford.com (Frank Millman)
Date: Mon, 27 Apr 2009 13:55:39 +0200
Subject: [lxml-dev] Problem with XMLSchema and attribute_defaults
In-Reply-To: <078ab7e7acd478aee7669a6e89c2787a.squirrel@groupware.dvs.informatik.tu-darmstadt.de>
Message-ID: <20090427115704.6359D3F4389@fcserver.chagford.com>
Stefan Behnel wrote:
>
> Frank Millman wrote:
> > I need to validate an xml document with a schema, and at
> > the same time populate it with any default attributes.
>
> The "attribute_defaults" flag is currently only used for
> DTDs. Enabling
> the same for XML Schema would require setting the
> "XML_SCHEMA_VAL_VC_I_CREATE" option on the schema validation context,
> which doesn't seem to work for older (<=2006) libxml2
> versions and is not
> currently done for newer versions. Could you file a feature
> request for
> this, so that it doesn't get lost?
>
I can't find a section for feature requests. Should I just use the Launchpad
bug tracker, or am I looking in the wrong place?
Frank
From stefan_ml at behnel.de Mon Apr 27 14:19:26 2009
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Mon, 27 Apr 2009 14:19:26 +0200 (CEST)
Subject: [lxml-dev] Problem with XMLSchema and attribute_defaults
In-Reply-To: <20090427115704.6359D3F4389@fcserver.chagford.com>
References: <20090427115704.6359D3F4389@fcserver.chagford.com>
Message-ID:
Frank Millman wrote:
> I can't find a section for feature requests. Should I just use the
> Launchpad bug tracker
Yep, that's the right place.
Stefan
From frank at chagford.com Mon Apr 27 14:52:58 2009
From: frank at chagford.com (Frank Millman)
Date: Mon, 27 Apr 2009 14:52:58 +0200
Subject: [lxml-dev] Problem with XMLSchema and attribute_defaults
In-Reply-To:
Message-ID: <20090427125424.0E8CB3F43E0@fcserver.chagford.com>
Stefan Behnel wrote:
>
> Frank Millman wrote:
> > I can't find a section for feature requests. Should I just use the
> > Launchpad bug tracker
>
> Yep, that's the right place.
>
Done - #367942
Thanks, Stefan
Frank
From dgardner at creatureshop.com Tue Apr 28 01:59:54 2009
From: dgardner at creatureshop.com (David Gardner)
Date: Mon, 27 Apr 2009 16:59:54 -0700
Subject: [lxml-dev] eetree.fromsring() returns Element, expected ElementTree
Message-ID: <49F646FA.5030001@creatureshop.com>
Ran into something that maybe a bug, or at least isn't clear from the
documentation
[http://codespeak.net/lxml/api/lxml.etree-module.html#fromstring]
because it doesn't mention a return type for etree.fromstring(). I had
expected it to behave similar to etree.parse().
Currently I have a work-around of:
tree = etree.ElementTree(etree.fromstring(xml_data))
See below for simple test, and output.
-------
#!/usr/bin/python
import sys,StringIO
from lxml import etree
print "lxml.etree: ", etree.LXML_VERSION
print "libxml used: ", etree.LIBXML_VERSION
print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION
print "libxslt used: ", etree.LIBXSLT_VERSION
print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION
some_xml_data = "data"
tree1=etree.fromstring(some_xml_data)
tree2=etree.parse(StringIO.StringIO(some_xml_data))
print type(tree1)
print type(tree2)
--------------
lxml.etree: (2, 1, 5, 0)
libxml used: (2, 7, 3)
libxml compiled: (2, 6, 32)
libxslt used: (1, 1, 24)
libxslt compiled: (1, 1, 24)
'lxml.etree._Element' object has no attribute 'write'
--
David Gardner
Pipeline Tools Programmer, "Sid the Science Kid"
Jim Henson Creature Shop
dgardner at creatureshop.com
From dgardner at creatureshop.com Tue Apr 28 02:03:27 2009
From: dgardner at creatureshop.com (David Gardner)
Date: Mon, 27 Apr 2009 17:03:27 -0700
Subject: [lxml-dev] eetree.fromsring() returns Element,
expected ElementTree
In-Reply-To: <49F646FA.5030001@creatureshop.com>
References: <49F646FA.5030001@creatureshop.com>
Message-ID: <49F647CF.8040902@creatureshop.com>
Woops sorry, I added a bit to the test, before re-pasting, the test code
should be:
---------------------
#!/usr/bin/python
import sys,StringIO
from lxml import etree
print "lxml.etree: ", etree.LXML_VERSION
print "libxml used: ", etree.LIBXML_VERSION
print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION
print "libxslt used: ", etree.LIBXSLT_VERSION
print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION
some_xml_data = "data"
tree1=etree.fromstring(some_xml_data)
tree2=etree.parse(StringIO.StringIO(some_xml_data))
print type(tree1)
print type(tree2)
out1=StringIO.StringIO()
out2=StringIO.StringIO()
try:
tree1.write(out1,pretty_print=True)
except Exception,e:
print str(e)
try:
tree2.write(out2,pretty_print=True)
except Exception,e:
print str(e)
------------------------
lxml.etree: (2, 1, 5, 0)
libxml used: (2, 7, 3)
libxml compiled: (2, 6, 32)
libxslt used: (1, 1, 24)
libxslt compiled: (1, 1, 24)
'lxml.etree._Element' object has no attribute 'write'
David Gardner wrote:
> Ran into something that maybe a bug, or at least isn't clear from the
> documentation
> [http://codespeak.net/lxml/api/lxml.etree-module.html#fromstring]
> because it doesn't mention a return type for etree.fromstring(). I had
> expected it to behave similar to etree.parse().
>
> Currently I have a work-around of:
> tree = etree.ElementTree(etree.fromstring(xml_data))
>
> See below for simple test, and output.
>
> -------
> #!/usr/bin/python
>
> import sys,StringIO
> from lxml import etree
>
> print "lxml.etree: ", etree.LXML_VERSION
> print "libxml used: ", etree.LIBXML_VERSION
> print "libxml compiled: ", etree.LIBXML_COMPILED_VERSION
> print "libxslt used: ", etree.LIBXSLT_VERSION
> print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION
>
> some_xml_data = "data"
>
> tree1=etree.fromstring(some_xml_data)
> tree2=etree.parse(StringIO.StringIO(some_xml_data))
>
> print type(tree1)
> print type(tree2)
>
> --------------
> lxml.etree: (2, 1, 5, 0)
> libxml used: (2, 7, 3)
> libxml compiled: (2, 6, 32)
> libxslt used: (1, 1, 24)
> libxslt compiled: (1, 1, 24)
>
>
> 'lxml.etree._Element' object has no attribute 'write'
>
>
--
David Gardner
Pipeline Tools Programmer, "Sid the Science Kid"
Jim Henson Creature Shop
dgardner at creatureshop.com
From stefan_ml at behnel.de Tue Apr 28 07:15:27 2009
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Tue, 28 Apr 2009 07:15:27 +0200
Subject: [lxml-dev] eetree.fromsring() returns Element,
expected ElementTree
In-Reply-To: <49F646FA.5030001@creatureshop.com>
References: <49F646FA.5030001@creatureshop.com>
Message-ID: <49F690EF.7080708@behnel.de>
Hi,
David Gardner wrote:
> Ran into something that maybe a bug, or at least isn't clear from the
> documentation
> [http://codespeak.net/lxml/api/lxml.etree-module.html#fromstring]
> because it doesn't mention a return type for etree.fromstring(). I had
> expected it to behave similar to etree.parse().
Yes, that's a common misconception. Let's see if this works better:
https://codespeak.net/viewvc/lxml/trunk/src/lxml/lxml.etree.pyx?r1=63185&r2=64752
The reason for this difference is that fromstring()/XML() is often used for
XML fragments, where returning an ElementTree wouldn't make sense.
Stefan
From Grimm at juris.de Tue Apr 28 14:32:06 2009
From: Grimm at juris.de (Grimm, Markus)
Date: Tue, 28 Apr 2009 14:32:06 +0200
Subject: [lxml-dev] How to set an attribute with a xml-namepace
Message-ID: <60C0E6193EB5684091937E26E9862503F37C73@JUREX2.juris.de>
Hi all,
I want to set a new element with an attribute using the xml-namespace,
f.e.
>> xml = ""
>> root = etree.fromstring(xml)
>> print etree.tostring(root)
everything fine, and now...
>> root = etree.Element("root", space="preserve")
>> print etree.tostring(root)
How can I bind the space-attribute to the xml-namespace, so I can output
the same result as above ?
Thanks,
Markus
From stefan_ml at behnel.de Tue Apr 28 15:48:45 2009
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Tue, 28 Apr 2009 15:48:45 +0200 (CEST)
Subject: [lxml-dev] How to set an attribute with a xml-namepace
In-Reply-To: <60C0E6193EB5684091937E26E9862503F37C73@JUREX2.juris.de>
References: <60C0E6193EB5684091937E26E9862503F37C73@JUREX2.juris.de>
Message-ID: <34f4f681bc565063e45026a82d538ed9.squirrel@groupware.dvs.informatik.tu-darmstadt.de>
Grimm, Markus wrote:
> I want to set a new element with an attribute using the xml-namespace,
> f.e.
>
> >> xml = ""
> >> root = etree.fromstring(xml)
> >> print etree.tostring(root)
>
>
> everything fine, and now...
>
>>> root = etree.Element("root", space="preserve")
>>> print etree.tostring(root)
>
>
> How can I bind the space-attribute to the xml-namespace, so I can output
> the same result as above ?
The namespace that is bound to the "xml" prefix is
http://www.w3.org/XML/1998/namespace
You use it like this:
>>> root = etree.Element("root",
... {'{http://www.w3.org/XML/1998/namespace}space' : "preserve"})
>>> print etree.tostring(root)
Note that you have to pass a dictionary here as you cannot pass the name
as a keyword argument. Or use the .set() Element method.
Also see the section on namespaces in the tutorial.
Stefan
From jholg at gmx.de Tue Apr 28 15:50:25 2009
From: jholg at gmx.de (jholg at gmx.de)
Date: Tue, 28 Apr 2009 15:50:25 +0200
Subject: [lxml-dev] How to set an attribute with a xml-namepace
In-Reply-To: <60C0E6193EB5684091937E26E9862503F37C73@JUREX2.juris.de>
References: <60C0E6193EB5684091937E26E9862503F37C73@JUREX2.juris.de>
Message-ID: <20090428135025.27910@gmx.net>
Hi,
> >> xml = ""
> >> root = etree.fromstring(xml)
> >> print etree.tostring(root)
>
>
> everything fine, and now...
>
> >> root = etree.Element("root", space="preserve")
> >> print etree.tostring(root)
>
>>> root = etree.Element("root", attrib={'{http://www.w3.org/XML/1998/namespace}space': 'preserve'})
>>> print etree.tostring(root)
>>>
Cheers,
Holger
--
Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger01
From Grimm at juris.de Tue Apr 28 16:11:33 2009
From: Grimm at juris.de (Grimm, Markus)
Date: Tue, 28 Apr 2009 16:11:33 +0200
Subject: [lxml-dev] How to set an attribute with a xml-namepace
References: <60C0E6193EB5684091937E26E9862503F37C73@JUREX2.juris.de>
<20090428135025.27910@gmx.net>
Message-ID: <60C0E6193EB5684091937E26E9862503F8A292@JUREX2.juris.de>
thanks to Holger and Stefan,
I didn't wangle the intellectual transfer from element to attribute as described in
http://codespeak.net/lxml/tutorial.html#the-e-factory :-)
Thanks,
Markus
Hi,
> >> xml = ""
> >> root = etree.fromstring(xml)
> >> print etree.tostring(root)
>
>
> everything fine, and now...
>
> >> root = etree.Element("root", space="preserve")
> >> print etree.tostring(root)
>
>>> root = etree.Element("root", attrib={'{http://www.w3.org/XML/1998/namespace}space': 'preserve'})
>>> print etree.tostring(root)
>>>
Cheers,
Holger
--
Psssst! Schon vom neuen GMX MultiMessenger geh?rt? Der kann`s mit allen: http://www.gmx.net/de/go/multimessenger01
From velvetcrafter.subscriber at gmail.com Tue Apr 28 19:39:24 2009
From: velvetcrafter.subscriber at gmail.com (Alexis Georges)
Date: Tue, 28 Apr 2009 13:39:24 -0400
Subject: [lxml-dev] XML Documents & I18N (the way Cocoon does it)
Message-ID: <89A8F7A1-A544-49C8-8E81-7F88EF77E31A@gmail.com>
Hello everyone,
I am maintaining a multilingual website which works with XML, XSLT to
generate XHTML.
I am working with Apache Cocoon (http://cocoon.apache.org/2.1/) using
(among other things) their I18NTransformer. Basically I can use
elements in the I18N (http://apache.org/cocoon/i18n/2.1) namespace,
and then tell Cocoon to apply the I18NTransfomer to the document; this
replaces the I18N elements with a localized value (eg. a formatted
date/number, a translated label/attribute, etc...).
I have been looking at lxml a little bit to see if I could move to a
Python-based framework for the website. I am not quite sure how to go
about the I18N part though.
Using the Babel library (http://babel.edgewall.org/) along with
request headers to generate localized data, I have everything I need.
What is missing is the "parser" for the I18N elements. All I can think
of right now is to implement a SAX parser, the way Cocoon does (in
Java).
Does anyone have suggestions? Am I making this too complicated?
Thanks!
Alexis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20090428/1bb0d079/attachment.htm
From stefan_ml at behnel.de Tue Apr 28 19:59:50 2009
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Tue, 28 Apr 2009 19:59:50 +0200
Subject: [lxml-dev] XML Documents & I18N (the way Cocoon does it)
In-Reply-To: <89A8F7A1-A544-49C8-8E81-7F88EF77E31A@gmail.com>
References: <89A8F7A1-A544-49C8-8E81-7F88EF77E31A@gmail.com>
Message-ID: <49F74416.7050209@behnel.de>
Hi,
Alexis Georges wrote:
> I am maintaining a multilingual website which works with XML, XSLT to
> generate XHTML.
>
> I am working with Apache Cocoon (http://cocoon.apache.org/2.1/) using
> (among other things) their I18NTransformer. Basically I can use elements
> in the I18N (http://apache.org/cocoon/i18n/2.1) namespace, and then tell
> Cocoon to apply the I18NTransfomer to the document; this replaces the
> I18N elements with a localized value (eg. a formatted date/number, a
> translated label/attribute, etc...).
>
> I have been looking at lxml a little bit to see if I could move to a
> Python-based framework for the website. I am not quite sure how to go
> about the I18N part though.
>
> Using the Babel library (http://babel.edgewall.org/) along with request
> headers to generate localized data, I have everything I need. What is
> missing is the "parser" for the I18N elements. All I can think of right
> now is to implement a SAX parser, the way Cocoon does (in Java).
There is a SAX-like interface in lxml.etree, called "target parser".
However, if your documents fit into memory, using iterparse() is a lot
simpler (and likely not even much slower).
Something like this might work:
context = etree.iterparse(
"somefile.xml",
tag = "{http://apache.org/cocoon/i18n/2.1}*")
for event, i18n_element in context:
new_element = get_i18n_replacement_for(i18n_element)
i18n_element.getparent().replace(i18n_element, new_element)
context.getroottree().write("newfile.xml")
See here for some documentation:
http://codespeak.net/lxml/parsing.html
You can also achieve the same thing in XSLT, or using XPath, or ...
Stefan
From stefan_ml at behnel.de Tue Apr 28 20:11:37 2009
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Tue, 28 Apr 2009 20:11:37 +0200
Subject: [lxml-dev] wsdl link validation.
In-Reply-To: <105c9ccc0904221055j3a7f3ce9j73a370976c8680c9@mail.gmail.com>
References: <105c9ccc0904221055j3a7f3ce9j73a370976c8680c9@mail.gmail.com>
Message-ID: <49F746D9.5060705@behnel.de>
Hi,
goldgod a wrote:
> I am using the lxml. I have one wsdl(on the fly creation using soaplib).
> The wsdl contains three XSD schema.I am passing all XSD schema in one file
> as request. I want to validate the each XSD schema one by one. I need your
> help to implement this. I gone through the tutorial and found XSD schema
> validation can do but my wsdl contains XSD schema and wsdl messages also.
Well, you could search the three schemas by iterating over the schema root
elements using .iter("{schema-namespace}tag-name"), then create an
XMLSchema() instance from each of them, and use the three validators to
validate your document.
Does that help?
Stefan
From jamie at artefact.org.nz Wed Apr 29 10:44:31 2009
From: jamie at artefact.org.nz (Jamie Norrish)
Date: Wed, 29 Apr 2009 20:44:31 +1200
Subject: [lxml-dev] xpath on text nodes
Message-ID: <1240994671.8989.9.camel@atman.artefact.org.nz>
The xpath method is currently available only for ElementTree and Element
objects. Is it possible for it to be available to text nodes also?
My current use case is getting a certain length text context for a
particular element node, and I'd like to implement that through a
recursive call to a function that returns the content of a supplied text
node appended to the content of the next text node in sequence (provided
the required length has not been passed).
Jamie
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090429/501d024e/attachment.pgp
From stefan_ml at behnel.de Wed Apr 29 17:24:51 2009
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Wed, 29 Apr 2009 17:24:51 +0200 (CEST)
Subject: [lxml-dev] xpath on text nodes
In-Reply-To: <1240994671.8989.9.camel@atman.artefact.org.nz>
References: <1240994671.8989.9.camel@atman.artefact.org.nz>
Message-ID:
Hi,
Jamie Norrish wrote:
> The xpath method is currently available only for ElementTree and Element
> objects. Is it possible for it to be available to text nodes also?
There is no such concept as a text node in lxml.etree.
> My current use case is getting a certain length text context for a
> particular element node, and I'd like to implement that through a
> recursive call to a function that returns the content of a supplied text
> node appended to the content of the next text node in sequence (provided
> the required length has not been passed).
That sounds a lot like you should do that in Python by using iterwalk()
and collecting .text and .tail attributes of Elements, not by using XPath.
Stefan
From jamie at artefact.org.nz Thu Apr 30 06:30:53 2009
From: jamie at artefact.org.nz (Jamie Norrish)
Date: Thu, 30 Apr 2009 16:30:53 +1200
Subject: [lxml-dev] xpath on text nodes
In-Reply-To:
References: <1240994671.8989.9.camel@atman.artefact.org.nz>
Message-ID: <1241065853.5570.4.camel@atman.artefact.org.nz>
Hi,
> There is no such concept as a text node in lxml.etree.
Okay, but the string results of an XPath selecting text nodes in the XML
have additional attributes - it just seems a pity that an xpath method
isn't one of them.
> That sounds a lot like you should do that in Python by using iterwalk()
> and collecting .text and .tail attributes of Elements, not by using XPath.
Well, I like XPath. :) In fact I already have an implementation of the
use case that, while slightly subobtimal, is sufficient - it just seemed
like one obvious way of doing it better was to use XPath. I shall
investigate using iterwalk instead.
Thanks!
Jamie
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090430/440d3eac/attachment.pgp
From stefan_ml at behnel.de Thu Apr 30 09:42:00 2009
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Thu, 30 Apr 2009 09:42:00 +0200 (CEST)
Subject: [lxml-dev] xpath on text nodes
In-Reply-To: <1241065853.5570.4.camel@atman.artefact.org.nz>
References: <1240994671.8989.9.camel@atman.artefact.org.nz>
<1241065853.5570.4.camel@atman.artefact.org.nz>
Message-ID:
Jamie Norrish wrote:
>> There is no such concept as a text node in lxml.etree.
>
> Okay, but the string results of an XPath selecting text nodes in the XML
> have additional attributes - it just seems a pity that an xpath method
> isn't one of them.
It would be rarely used, I'd say. What sort of interesting XPath queries
could you possibly do on a node that doesn't have any children, nor
attributes, nor a tag name or namespace. Also, XPath queries can return
Elements and (special) strings, but also plain numbers and boolean values.
So you'd still not have a common interface for all possible result types.
>> That sounds a lot like you should do that in Python by using iterwalk()
>> and collecting .text and .tail attributes of Elements, not by using
>> XPath.
>
> Well, I like XPath. :) In fact I already have an implementation of the
> use case that, while slightly subobtimal, is sufficient - it just seemed
> like one obvious way of doing it better was to use XPath. I shall
> investigate using iterwalk instead.
This should basically be a no-brainer with iterwalk(). You iterate over
start and end events and just collect the .text values on start and the
.tail values on end. Put them in a list, count the total character length
on the way, break when it's long enough and ''.join() the list.
Stefan
From jamie at artefact.org.nz Thu Apr 30 22:07:26 2009
From: jamie at artefact.org.nz (Jamie Norrish)
Date: Fri, 01 May 2009 08:07:26 +1200
Subject: [lxml-dev] xpath on text nodes
In-Reply-To:
References: <1240994671.8989.9.camel@atman.artefact.org.nz>
<1241065853.5570.4.camel@atman.artefact.org.nz>
Message-ID: <1241122046.5549.19.camel@atman.artefact.org.nz>
On Thu, 2009-04-30 at 09:42 +0200, Stefan Behnel wrote:
> It would be rarely used, I'd say. What sort of interesting XPath queries
> could you possibly do on a node that doesn't have any children, nor
> attributes, nor a tag name or namespace.
Besides selecting other nodes and values relative to the text? Yes, it
is possible to use text_result.getparent() and proceed from there - but
this has the downside of requiring, for some XPath expressions, the code
to modify the expression based on whether text_result was the text or
tail of its parent, which is annoying.
> Also, XPath queries can return Elements and (special) strings, but
> also plain numbers and boolean values.
> So you'd still not have a common interface for all possible result types.
Well, I'm not really asking for a common interface - only that XPath be
enabled for the results of an XPath expression for text(). This would
bring it into line with XSLT behaviour, for one.
However, I accept that it's not going to be used often, and probably
isn't worth you implementing for that reason.
About using iterwalk: this wouldn't seem (on a quick perusal of the
documentation) to easily allow for me to get the preceding context of
the text result, unless I picked some arbitrary earlier element as the
starting point. What am I missing?
Jamie
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20090501/3013369b/attachment.pgp