s and put it, with all descendants, to the result document.
> print html.tostring(elembody, method="html", encoding="utf-8")
>
> it's print only 3 divs on the example of this thread.
>
> BTW: xpath id function seems buggy (like I wrote in my previous email)
> and this method don't work with id function. and loops forever.
No idea about this issue. When's a node selectable using id()?
XPath rec says:
"[...] NOTE: If a document does not have a DTD, then no element in the document will have a unique ID. [...]"
(http://www.w3.org/TR/xpath/#unique-id)
So I suspect this is not the same as elements simply having an id-attribute.
Holger
--
GRATIS f?r alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01
From cmtaylor at ti.com Thu Mar 18 13:30:55 2010
From: cmtaylor at ti.com (Taylor, Martin)
Date: Thu, 18 Mar 2010 07:30:55 -0500
Subject: [lxml-dev] lxml pre-built for MacTel OS X 10.6 and ActivePython
2.6
In-Reply-To: <4BA1D948.10204@behnel.de>
References: <92CDD168D1E81F4F9D3839DC45903FC6763A2FA0@dlee03.ent.ti.com>
<4BA1D948.10204@behnel.de>
Message-ID: <92CDD168D1E81F4F9D3839DC45903FC6763A3232@dlee03.ent.ti.com>
I hadn't tried the Mac build yesterday because the instructions really looked like there could be many "gotchas". However, Stefan's comment challenged me to give it a try this morning. I downloaded and extracted the lxml-2.2.6.tar.gz source file, then ran the "easy_install" and it immediately didn't work:
$which easy_install
/Library/Frameworks/Python.framework/Versions/2.6/bin/easy_install
$ STATIC_DEPS=true sudo easy_install lxml
Searching for lxml
Reading http://pypi.python.org/simple/lxml/
Download error: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found!
Reading http://pypi.python.org/simple/lxml/
Download error: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found!
Couldn't find index page for 'lxml' (maybe misspelled?)
Scanning index of all packages (this may take a while)
Reading http://pypi.python.org/simple/
Download error: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found!
No local packages or download links found for lxml
error: Could not find suitable distribution for Requirement.parse('lxml')
I've never used easy_install before and I've never had ANY success with FTP from behind our company firewall. So I'm not surprised that I got these error messages about "servname". Does anyone have any suggestions as to how I might get this to work?
Based on comments here: http://www.explain.com.au/oss/libxml2xslt.html it would seem that as of "Leopard" (OS X 10.5 and I'm using OS X 10.6), the libxml2 and libxslt that come with the Mac OS X are OK to use with lxml. So I tried a simple install:
$python setup.py install
Building lxml version 2.2.6.
NOTE: Trying to build without Cython, pre-generated 'src/lxml/lxml.etree.c' needs to be available.
Using build configuration of libxslt 1.1.24
running install
running bdist_egg
running egg_info
writing src/lxml.egg-info/PKG-INFO
writing top-level names to src/lxml.egg-info/top_level.txt
writing dependency_links to src/lxml.egg-info/dependency_links.txt
reading manifest file 'src/lxml.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'src/lxml.egg-info/SOURCES.txt'
installing library code to build/bdist.macosx-10.3-fat/egg
running install_lib
running build_py
creating build/lib.macosx-10.3-fat-2.6
creating build/lib.macosx-10.3-fat-2.6/lxml
copying src/lxml/__init__.py -> build/lib.macosx-10.3-fat-2.6/lxml
copying src/lxml/_elementpath.py -> build/lib.macosx-10.3-fat-2.6/lxml
copying src/lxml/builder.py -> build/lib.macosx-10.3-fat-2.6/lxml
copying src/lxml/cssselect.py -> build/lib.macosx-10.3-fat-2.6/lxml
copying src/lxml/doctestcompare.py -> build/lib.macosx-10.3-fat-2.6/lxml
copying src/lxml/ElementInclude.py -> build/lib.macosx-10.3-fat-2.6/lxml
copying src/lxml/pyclasslookup.py -> build/lib.macosx-10.3-fat-2.6/lxml
copying src/lxml/sax.py -> build/lib.macosx-10.3-fat-2.6/lxml
copying src/lxml/usedoctest.py -> build/lib.macosx-10.3-fat-2.6/lxml
creating build/lib.macosx-10.3-fat-2.6/lxml/html
copying src/lxml/html/__init__.py -> build/lib.macosx-10.3-fat-2.6/lxml/html
copying src/lxml/html/_dictmixin.py -> build/lib.macosx-10.3-fat-2.6/lxml/html
copying src/lxml/html/_diffcommand.py -> build/lib.macosx-10.3-fat-2.6/lxml/html
copying src/lxml/html/_html5builder.py -> build/lib.macosx-10.3-fat-2.6/lxml/html
copying src/lxml/html/_setmixin.py -> build/lib.macosx-10.3-fat-2.6/lxml/html
copying src/lxml/html/builder.py -> build/lib.macosx-10.3-fat-2.6/lxml/html
copying src/lxml/html/clean.py -> build/lib.macosx-10.3-fat-2.6/lxml/html
copying src/lxml/html/defs.py -> build/lib.macosx-10.3-fat-2.6/lxml/html
copying src/lxml/html/diff.py -> build/lib.macosx-10.3-fat-2.6/lxml/html
copying src/lxml/html/ElementSoup.py -> build/lib.macosx-10.3-fat-2.6/lxml/html
copying src/lxml/html/formfill.py -> build/lib.macosx-10.3-fat-2.6/lxml/html
copying src/lxml/html/html5parser.py -> build/lib.macosx-10.3-fat-2.6/lxml/html
copying src/lxml/html/soupparser.py -> build/lib.macosx-10.3-fat-2.6/lxml/html
copying src/lxml/html/usedoctest.py -> build/lib.macosx-10.3-fat-2.6/lxml/html
running build_ext
building 'lxml.etree' extension
creating build/temp.macosx-10.3-fat-2.6
creating build/temp.macosx-10.3-fat-2.6/src
creating build/temp.macosx-10.3-fat-2.6/src/lxml
gcc -arch ppc -arch i386 -fno-strict-aliasing -fPIC -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/include/libxml2 -I/Library/Frameworks/Python.framework/Versions/2.6/include/python2.6 -c src/lxml/lxml.etree.c -o build/temp.macosx-10.3-fat-2.6/src/lxml/lxml.etree.o -w -flat_namespace
Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/MacOSX10.4u.sdk
Please check your Xcode installation
gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -bundle -undefined dynamic_lookup build/temp.macosx-10.3-fat-2.6/src/lxml/lxml.etree.o -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.3-fat-2.6/lxml/etree.so
ld: library not found for -lbundle1.o
collect2: ld returned 1 exit status
ld: library not found for -lbundle1.o
collect2: ld returned 1 exit status
lipo: can't open input file: /var/folders/Bm/BmG3PdbEFTqwQbJ-5p7qw++++TI/-Tmp-//ccjg1Ufi.out (No such file or directory)
error: command 'gcc' failed with exit status 1
I think the real problem here is:
Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/MacOSX10.4u.sdk
Of course I don't have a 10.4 SDK on my 10.6 Xcode. At TI we do support our products on 10.5 and 10.6 but 10.4 was declared "obsolete" last year.
So I'm back to my original question: Has anyone build lxml for Mac OS X 10.6 and, if so, could you provide me a link to a binary installer?
Thanks very much,
Martin
> -----Original Message-----
> From: Stefan Behnel [mailto:stefan_ml at behnel.de]
> Sent: Thursday, March 18, 2010 2:42 AM
> To: Taylor, Martin
> Cc: ML-Lxml-dev
> Subject: Re: [lxml-dev] lxml pre-built for MacTel OS X 10.6
> and ActivePython 2.6
>
> Taylor, Martin, 17.03.2010 22:50:
> > Does anyone know of a .dmg installer for lxml pre-built for
> MacTel OS X 10.6 and ActivePython 2.6?
> >
> > This web page:
> http://codespeak.net/lxml/installation.html#installation-in-ac
tivepython suggests doing this:
> >
> > pypm install lxml
> >
> > but that requires a special Business License from
> ActiveState at a cost of about $1000!
>
> Ok, guess I'll just remove that from the web site then.
>
>
> > The build-it-yourself instructions for Mac look REALLY HAIRY
>
> Did you actually *try* the one-liner that they present?
>
> STATIC_DEPS=true easy_install lxml
>
> Stefan
>
From stefan_ml at behnel.de Thu Mar 18 13:58:48 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Thu, 18 Mar 2010 13:58:48 +0100
Subject: [lxml-dev] lxml pre-built for MacTel OS X 10.6 and ActivePython
2.6
In-Reply-To: <92CDD168D1E81F4F9D3839DC45903FC6763A3232@dlee03.ent.ti.com>
References: <92CDD168D1E81F4F9D3839DC45903FC6763A2FA0@dlee03.ent.ti.com> <4BA1D948.10204@behnel.de>
<92CDD168D1E81F4F9D3839DC45903FC6763A3232@dlee03.ent.ti.com>
Message-ID: <4BA22388.8070507@behnel.de>
Taylor, Martin, 18.03.2010 13:30:
> I hadn't tried the Mac build yesterday because the instructions really looked like there could be many "gotchas". However, Stefan's comment challenged me to give it a try this morning. I downloaded and extracted the lxml-2.2.6.tar.gz source file, then ran the "easy_install" and it immediately didn't work:
>
> $which easy_install
> /Library/Frameworks/Python.framework/Versions/2.6/bin/easy_install
> $ STATIC_DEPS=true sudo easy_install lxml
> Searching for lxml
> Reading http://pypi.python.org/simple/lxml/
> Download error: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found!
> Reading http://pypi.python.org/simple/lxml/
> Download error: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found!
> Couldn't find index page for 'lxml' (maybe misspelled?)
> Scanning index of all packages (this may take a while)
> Reading http://pypi.python.org/simple/
> Download error: [Errno 8] nodename nor servname provided, or not known -- Some packages may not be found!
> No local packages or download links found for lxml
> error: Could not find suitable distribution for Requirement.parse('lxml')
>
> I've never used easy_install before and I've never had ANY success with FTP from behind our company firewall. So I'm not surprised that I got these error messages about "servname". Does anyone have any suggestions as to how I might get this to work?
>
> Based on comments here: http://www.explain.com.au/oss/libxml2xslt.html it would seem that as of "Leopard" (OS X 10.5 and I'm using OS X 10.6), the libxml2 and libxslt that come with the Mac OS X are OK to use with lxml. So I tried a simple install:
>
> $python setup.py install
You forgot to say "STATIC_DEPS=true".
> Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/MacOSX10.4u.sdk
> Please check your Xcode installation
> gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -bundle -undefined dynamic_lookup build/temp.macosx-10.3-fat-2.6/src/lxml/lxml.etree.o -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.3-fat-2.6/lxml/etree.so
> ld: library not found for -lbundle1.o
> collect2: ld returned 1 exit status
> ld: library not found for -lbundle1.o
> collect2: ld returned 1 exit status
> lipo: can't open input file: /var/folders/Bm/BmG3PdbEFTqwQbJ-5p7qw++++TI/-Tmp-//ccjg1Ufi.out (No such file or directory)
> error: command 'gcc' failed with exit status 1
>
> I think the real problem here is:
> Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/MacOSX10.4u.sdk
There's at least some code in buildlibxml2.py that deals with this case:
major_version, minor_version = map(int,
platform.mac_ver()[0].split('.')[:2])
if major_version > 7:
env = os.environ.copy()
if minor_version < 6:
env.update({
'CFLAGS' : "-arch ppc -arch i386 -isysroot
/Developer/SDKs/MacOSX10.4u.sdk -O2",
'LDFLAGS' : "-arch ppc -arch i386 -isysroot
/Developer/SDKs/MacOSX10.4u.sdk",
'MACOSX_DEPLOYMENT_TARGET' : "10.3"
})
else:
env.update({
'CFLAGS' : "-arch ppc -arch i386 -arch x86_64 -O2",
'LDFLAGS' : "-arch ppc -arch i386 -arch x86_64",
'MACOSX_DEPLOYMENT_TARGET' : "10.6"
})
call_setup['env'] = env
Stefan
From wichert at wiggy.net Thu Mar 18 15:16:24 2010
From: wichert at wiggy.net (Wichert Akkerman)
Date: Thu, 18 Mar 2010 15:16:24 +0100
Subject: [lxml-dev] Unicode behaviour of Element.text
Message-ID: <4BA235B8.9010201@wiggy.net>
I tried to figure out the unicode-behaviour of Element.text. The lxml
documentation does mention how parsing unicode data and serializing to
unicode works, but I can not find any information on how Element.text
returns strings. From what I can see it appears that Element.text
returns either a str or a unicode instance, depending on the presence of
non-ASCII text. That behaviour feels inconsistent, and for unicode using
applications it means that every use of Element.text has to be written
as unicode(node.text), which is not very pretty. Would it be possible to
add an option to make the text attribute always return a unicode instance?
Wichert.
From wichert at wiggy.net Thu Mar 18 15:16:05 2010
From: wichert at wiggy.net (Wichert Akkerman)
Date: Thu, 18 Mar 2010 15:16:05 +0100
Subject: [lxml-dev] adding a namespace
Message-ID: <4BA235A5.7070507@wiggy.net>
I am having some problems adding a new namespace to a parsed document.
My goal is to take an input file like this:
and turn it into this:
the code is fairly simple, and looks like this (simplified from original):
NS="http://xml.zope.org/namespaces/i18n"
tree=lxml.etree.parse(input)
root=tree.getroot()
count=1
if "i18n" not in root.nsmap:
root.nsmap["i18n"]=NS
for el in root.iter():
if "{%s}translate" % NS in el.attrib:
continue
if hasText(el):
el.attrib["{%s}translate" % NS]="string%d" % count
count+=1
print lxml.etree.tostring(tree)
However the resulting output looks like this:
while trying to debug this I noticed something odd: lxml allows you to
modify the nsmap for an element, but ignores what you do:
>>> root.nsmap
{None: 'http://www.w3.org/1999/xhtml', 'py':
'http://genshi.edgewall.org/', 'xi': 'http://www.w3.org/2001/XInclude'}
>>> root.nsmap["frop"]='http://frip'
>>> root.nsmap
{None: 'http://www.w3.org/1999/xhtml', 'py':
'http://genshi.edgewall.org/', 'xi': 'http://www.w3.org/2001/XInclude'}
I would expect that to either work, or raise an exception telling me I
am trying to do something that is not allowed. The current behaviour
feels a bit unpythonic.
It is possible to specify your own nsmap when creating elements, but I
can not find an API to modify the nsmap for a parsed tree. Is that a
missing feature, or is there another way to do this?
Wichert.
From stefan_ml at behnel.de Thu Mar 18 16:25:18 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Thu, 18 Mar 2010 16:25:18 +0100
Subject: [lxml-dev] Unicode behaviour of Element.text
In-Reply-To: <4BA235B8.9010201@wiggy.net>
References: <4BA235B8.9010201@wiggy.net>
Message-ID: <4BA245DE.8010901@behnel.de>
Wichert Akkerman, 18.03.2010 15:16:
> I tried to figure out the unicode-behaviour of Element.text. The lxml
> documentation does mention how parsing unicode data and serializing to
> unicode works, but I can not find any information on how Element.text
> returns strings. From what I can see it appears that Element.text
> returns either a str or a unicode instance, depending on the presence of
> non-ASCII text. That behaviour feels inconsistent, and for unicode using
> applications it means that every use of Element.text has to be written
> as unicode(node.text), which is not very pretty. Would it be possible to
> add an option to make the text attribute always return a unicode instance?
Since this has been asked a couple of time before, here's a short answer:
That's how ElementTree works in Py2 and lxml.etree is compatible with it.
It's also faster for plain ASCII data (which is common). In Python 3,
lxml.etree always returns Unicode strings for .tag, .text and .tail.
Stefan
From cmtaylor at ti.com Thu Mar 18 18:31:09 2010
From: cmtaylor at ti.com (Taylor, Martin)
Date: Thu, 18 Mar 2010 12:31:09 -0500
Subject: [lxml-dev] lxml pre-built for MacTel OS X 10.6 and ActivePython
2.6
In-Reply-To: <4BA22388.8070507@behnel.de>
References: <92CDD168D1E81F4F9D3839DC45903FC6763A2FA0@dlee03.ent.ti.com>
<4BA1D948.10204@behnel.de>
<92CDD168D1E81F4F9D3839DC45903FC6763A3232@dlee03.ent.ti.com>
<4BA22388.8070507@behnel.de>
Message-ID: <92CDD168D1E81F4F9D3839DC45903FC67642B31E@dlee03.ent.ti.com>
I'm making some progress but am stuck again on this Mac OS X 10.6 build of lxml. I downloaded the two dependent tarballs manually, since the FTP access through our firewall didn't work:
ls libs
libxml2-2.7.6.tar.gz libxslt-1.1.26.tar.gz
Then I ran this build command:
python setup.py build --static-deps --libxml2-version=2.7.6 --libxslt-version=1.1.26
I think it built the two libraries successfully 'cause I saw messages like this:
----------------------------------------------------------------------
Libraries have been installed in:
.../lxml/build/tmp/libxml2/lib
For both libraries. But when it got to this stage:
building 'lxml.etree' extension
I got these error messages:
Compiling with an SDK that doesn't seem to exist: /Developer/SDKs/MacOSX10.4u.sdk
Please check your Xcode installation
gcc -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -bundle -undefined dynamic_lookup build/temp.macosx-10.3-fat-2.6/src/lxml/lxml.etree.o -liconv /Users/epsqainfprod/_AccuRev_/RF-integ_EPSQAInf5Lab09/TI-RF/robotframework-NspireOracleLibrary/3rdPartyTools/lxml/build/tmp/libxml2/lib/libexslt.a /Users/epsqainfprod/_AccuRev_/RF-integ_EPSQAInf5Lab09/TI-RF/robotframework-NspireOracleLibrary/3rdPartyTools/lxml/build/tmp/libxml2/lib/libxml2.a /Users/epsqainfprod/_AccuRev_/RF-integ_EPSQAInf5Lab09/TI-RF/robotframework-NspireOracleLibrary/3rdPartyTools/lxml/build/tmp/libxml2/lib/libxslt.a -L/Users/epsqainfprod/_AccuRev_/RF-integ_EPSQAInf5Lab09/TI-RF/robotframework-NspireOracleLibrary/3rdPartyTools/lxml/build/tmp/libxml2/lib -lz -lm -o build/lib.macosx-10.3-fat-2.6/lxml/etree.so
ld: library not found for -lbundle1.o
collect2: ld returned 1 exit status
ld: library not found for -lbundle1.o
collect2: ld returned 1 exit status
lipo: can't open input file: /var/folders/Bm/BmG3PdbEFTqwQbJ-5p7qw++++TI/-Tmp-//ccxV0tCt.out (No such file or directory)
error: command 'gcc' failed with exit status 1
Which indicates to me that it is trying to build with the SDK for the wrong Mac OS version. I've searched the entire lxml code tree and can't find anywhere where it does this kind of logic for the building of lxml itself (buildlibxml2.py only builds that library, as far as I can tell):
> There's at least some code in buildlibxml2.py that deals with
> this case:
>
>
> major_version, minor_version = map(int, platform.mac_ver()[0].split('.')[:2])
> if major_version > 7:
> env = os.environ.copy()
> if minor_version < 6:
> env.update({
> 'CFLAGS' : "-arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk -O2",
> 'LDFLAGS' : "-arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk",
> 'MACOSX_DEPLOYMENT_TARGET' : "10.3"
> })
> else:
> env.update({
> 'CFLAGS' : "-arch ppc -arch i386 -arch x86_64 -O2",
> 'LDFLAGS' : "-arch ppc -arch i386 -arch x86_64",
> 'MACOSX_DEPLOYMENT_TARGET' : "10.6"
> })
> call_setup['env'] = env
SO now the question is "What hidden magic is used to determine the Mac OS X version and the SDK that should be used for building lxml itself?"
Thanks again,
Martin
From stefan_ml at behnel.de Thu Mar 18 20:11:23 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Thu, 18 Mar 2010 20:11:23 +0100
Subject: [lxml-dev] lxml pre-built for MacTel OS X 10.6 and ActivePython
2.6
In-Reply-To: <92CDD168D1E81F4F9D3839DC45903FC67642B31E@dlee03.ent.ti.com>
References: <92CDD168D1E81F4F9D3839DC45903FC6763A2FA0@dlee03.ent.ti.com> <4BA1D948.10204@behnel.de> <92CDD168D1E81F4F9D3839DC45903FC6763A3232@dlee03.ent.ti.com> <4BA22388.8070507@behnel.de>
<92CDD168D1E81F4F9D3839DC45903FC67642B31E@dlee03.ent.ti.com>
Message-ID: <4BA27ADB.1020608@behnel.de>
Taylor, Martin, 18.03.2010 18:31:
> SO now the question is "What hidden magic is used to determine the Mac
> OS X version and the SDK that should be used for building lxml itself?"
This looks like a distutils related question (or maybe even ActivePython
related). So unless someone else can answer it on this list, I'd suggest
you ask on comp.lang.python or the distutils sig mailing list.
BTW, have you succeeded in building any other binary extensions on your
platform yet? That might tell you if it's a general problem with your
installation or something that's specific to lxml.
Another thing you could try is use the pre-built lxml 2.2.2 binaries on
PyPI. Not totally up-to-date, but certainly usable.
http://pypi.python.org/pypi/lxml/2.2.2
Those were built by Stephan Eletzhofer, maybe he can upload a build of 2.2.6?
Stefan
From wichert at wiggy.net Tue Mar 23 08:33:29 2010
From: wichert at wiggy.net (Wichert Akkerman)
Date: Tue, 23 Mar 2010 08:33:29 +0100
Subject: [lxml-dev] adding a namespace
In-Reply-To: <4BA235A5.7070507@wiggy.net>
References: <4BA235A5.7070507@wiggy.net>
Message-ID: <4BA86EC9.2060807@wiggy.net>
I apologize if I'm being impatient, but I am wondering if the lack of
response means that people are too busy to look at this, or if it means
that this is, at least currently, not possible with lxml?
Regards,
Wichert.
On 3/18/10 15:16 , Wichert Akkerman wrote:
> I am having some problems adding a new namespace to a parsed document.
> My goal is to take an input file like this:
>
>
>
>
>
>
>
>
>
> and turn it into this:
>
> xmlns:i18n="http://xml.zope.org/namespaces/i18n">
>
>
>
>
>
>
> the code is fairly simple, and looks like this (simplified from original):
>
> NS="http://xml.zope.org/namespaces/i18n"
> tree=lxml.etree.parse(input)
> root=tree.getroot()
> count=1
> if "i18n" not in root.nsmap:
> root.nsmap["i18n"]=NS
> for el in root.iter():
> if "{%s}translate" % NS in el.attrib:
> continue
> if hasText(el):
> el.attrib["{%s}translate" % NS]="string%d" % count
> count+=1
> print lxml.etree.tostring(tree)
>
> However the resulting output looks like this:
>
>
>
>
ns0:translate="string1">first paragraph
>
ns1:translate="string2">second paragraph
>
>
>
> while trying to debug this I noticed something odd: lxml allows you to
> modify the nsmap for an element, but ignores what you do:
>
> >>> root.nsmap
> {None: 'http://www.w3.org/1999/xhtml', 'py':
> 'http://genshi.edgewall.org/', 'xi': 'http://www.w3.org/2001/XInclude'}
> >>> root.nsmap["frop"]='http://frip'
> >>> root.nsmap
> {None: 'http://www.w3.org/1999/xhtml', 'py':
> 'http://genshi.edgewall.org/', 'xi': 'http://www.w3.org/2001/XInclude'}
>
> I would expect that to either work, or raise an exception telling me I
> am trying to do something that is not allowed. The current behaviour
> feels a bit unpythonic.
>
> It is possible to specify your own nsmap when creating elements, but I
> can not find an API to modify the nsmap for a parsed tree. Is that a
> missing feature, or is there another way to do this?
>
> Wichert.
From wichert at wiggy.net Tue Mar 23 10:02:14 2010
From: wichert at wiggy.net (Wichert Akkerman)
Date: Tue, 23 Mar 2010 10:02:14 +0100
Subject: [lxml-dev] adding a namespace
In-Reply-To: <1269333717.10101.127.camel@ddbc-it-simon>
References: <4BA235A5.7070507@wiggy.net> <4BA86EC9.2060807@wiggy.net>
<1269333717.10101.127.camel@ddbc-it-simon>
Message-ID: <4BA88396.9010607@wiggy.net>
On 3/23/10 09:41 , Simon Wiles ??? wrote:
> On Tue, 2010-03-23 at 08:33 +0100, Wichert Akkerman wrote:
>> I apologize if I'm being impatient, but I am wondering if the lack of
>> response means that people are too busy to look at this, or if it means
>> that this is, at least currently, not possible with lxml?
>>
>> Regards,
>> Wichert.
>
>
> You could try something like this:
>
> ====================================
>
> from lxml import etree
>
> NS="http://xml.zope.org/namespaces/i18n"
> tree=etree.parse(input)
> root=tree.getroot()
> count=1
> if "i18n" not in root.nsmap:
> new_root = etree.Element(root.tag, nsmap=dict(i18n=NS, **root.nsmap))
> new_root[:] = root[:]
> for el in new_root.iter():
> if "{%s}translate" % NS in el.attrib:
> continue
> if el.text is not None and el.text.strip() != '':
> el.attrib["{%s}translate" % NS]="string%d" % count
> count+=1
> print etree.tostring(new_root)
>
>
> ====================================
>
>
> Is that what you had in mind?
Almost! The problem with this approach is that you loose the doctype,
since that is serialised as part of tree.docinfo, while you are not only
outputting the root and its children. As a workaround I could manually
output tree.docinfo.doctype I suppose.
Wichert.
From simonjwiles at gmail.com Tue Mar 23 09:41:57 2010
From: simonjwiles at gmail.com (Simon Wiles =?UTF-8?Q?=E9=AD=8F=E5=B8=8C=E6=98=8E?=)
Date: Tue, 23 Mar 2010 16:41:57 +0800
Subject: [lxml-dev] adding a namespace
In-Reply-To: <4BA86EC9.2060807@wiggy.net>
References: <4BA235A5.7070507@wiggy.net> <4BA86EC9.2060807@wiggy.net>
Message-ID: <1269333717.10101.127.camel@ddbc-it-simon>
On Tue, 2010-03-23 at 08:33 +0100, Wichert Akkerman wrote:
> I apologize if I'm being impatient, but I am wondering if the lack of
> response means that people are too busy to look at this, or if it means
> that this is, at least currently, not possible with lxml?
>
> Regards,
> Wichert.
You could try something like this:
====================================
from lxml import etree
NS="http://xml.zope.org/namespaces/i18n"
tree=etree.parse(input)
root=tree.getroot()
count=1
if "i18n" not in root.nsmap:
new_root = etree.Element(root.tag, nsmap=dict(i18n=NS, **root.nsmap))
new_root[:] = root[:]
for el in new_root.iter():
if "{%s}translate" % NS in el.attrib:
continue
if el.text is not None and el.text.strip() != '':
el.attrib["{%s}translate" % NS]="string%d" % count
count+=1
print etree.tostring(new_root)
====================================
Is that what you had in mind?
simon
From stefan_ml at behnel.de Tue Mar 23 20:09:29 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Tue, 23 Mar 2010 20:09:29 +0100
Subject: [lxml-dev] adding a namespace
In-Reply-To: <4BA235A5.7070507@wiggy.net>
References: <4BA235A5.7070507@wiggy.net>
Message-ID: <4BA911E9.4030800@behnel.de>
Hi,
bumping this thread was a good idea, it seems. ;)
Wichert Akkerman, 18.03.2010 15:16:
> I am having some problems adding a new namespace to a parsed document.
> My goal is to take an input file like this:
>
>
>
>
>
>
>
>
> and turn it into this:
>
> xmlns:i18n="http://xml.zope.org/namespaces/i18n">
>
>
>
>
>
>
> the code is fairly simple, and looks like this (simplified from original):
>
> NS="http://xml.zope.org/namespaces/i18n"
> tree=lxml.etree.parse(input)
> root=tree.getroot()
> count=1
> if "i18n" not in root.nsmap:
> root.nsmap["i18n"]=NS
Ok, this won't work as the return value of the nsmap property is a newly
created dict. The reason is that it returns a map of all prefixes that are
defined in the context of the Element, including all live prefixes defined
on its ancestors.
I've added a short section to the tutorial that explains this (not on the
website yet).
> I would expect that to either work, or raise an exception telling me I
> am trying to do something that is not allowed. The current behaviour
> feels a bit unpythonic.
You get a plain dict here, so an exception won't work. It would also be
unfriendly to return a read-only dict (which would raise an exception on
changes) as it's quite reasonable to use the dict in other places of your code.
> It is possible to specify your own nsmap when creating elements, but I
> can not find an API to modify the nsmap for a parsed tree. Is that a
> missing feature, or is there another way to do this?
Simon showed you a way, but apart from that, it's a missing feature.
Changing namespace mappings is nothing that the ElementTree API needs to
care about, and lxml clearly lacks a good way to do it.
Could you file a ticket on the bug tracker? This should be doable for 2.3.
Stefan
From nickle at gmail.com Thu Mar 25 11:10:36 2010
From: nickle at gmail.com (Nick Leaton)
Date: Thu, 25 Mar 2010 10:10:36 +0000
Subject: [lxml-dev] Namespaces
Message-ID: <8797930d1003250310k637e4836yfc6f2ada35353e23@mail.gmail.com>
I'm trying to generate the following header for an xml file
However after reading the section on nsmap on this page
http://codespeak.net/lxml/tutorial.html I'm none the wiser
Can anyone give a hand?
Thanks
--
Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100325/89f95163/attachment.htm
From Tim.Arnold at sas.com Thu Mar 25 20:00:55 2010
From: Tim.Arnold at sas.com (Tim Arnold)
Date: Thu, 25 Mar 2010 15:00:55 -0400
Subject: [lxml-dev] help with a special attribute
Message-ID:
Hi, I have some citation keys that contain colons in my source xml. I use lxml to manipulate that source into valid docbook.
For example, a key might look like this "kdpm_c:78" I don't have any way to change the keys in the original source to get rid of the colon.
I'm currently manipulating the key into which is ok but not valid docbook. The attribute needs to be xml:id="kdpm_c78".
But if I create it that way, then lxml won't parse it since the key has the colon. On the other hand if I try to postprocess it by adding an xml:id attribute like this:
elem.set('xml:id', elem.get('id').replace(':', ''))
lxml says "Invalid attribute name u'xml:id'
Is there any way to start with "kdpm_c:78" and end up with without plain-text-processing?
thanks,
--Tim
From stefan_ml at behnel.de Thu Mar 25 20:47:33 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Thu, 25 Mar 2010 20:47:33 +0100
Subject: [lxml-dev] Namespaces
In-Reply-To: <8797930d1003250310k637e4836yfc6f2ada35353e23@mail.gmail.com>
References: <8797930d1003250310k637e4836yfc6f2ada35353e23@mail.gmail.com>
Message-ID: <4BABBDD5.5040301@behnel.de>
Nick Leaton, 25.03.2010 11:10:
> I'm trying to generate the following header for an xml file
>
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
> xsi:noNamespaceSchemaLocation="cmf.xsd">
>
> However after reading the section on nsmap on this page
> http://codespeak.net/lxml/tutorial.html I'm none the wiser
>
> Can anyone give a hand?
This should work:
XSI_NS = "http://www.w3.org/2001/XMLSchema-instance"
messages = etree.Element('messages', nsmap = {'xsi' : XSI_NS})
messages.set("{%s}noNamespaceSchemaLocation" % XSI_NS, "cmf.xsd")
Stefan
From nickle at gmail.com Thu Mar 25 21:16:09 2010
From: nickle at gmail.com (Nick Leaton)
Date: Thu, 25 Mar 2010 20:16:09 +0000
Subject: [lxml-dev] Namespaces
In-Reply-To: <4BABBDD5.5040301@behnel.de>
References: <8797930d1003250310k637e4836yfc6f2ada35353e23@mail.gmail.com>
<4BABBDD5.5040301@behnel.de>
Message-ID: <8797930d1003251316u11b118f1y27d0c191ef6d808c@mail.gmail.com>
Thanks - I'll try it out tomorrow
Nick
On Thu, Mar 25, 2010 at 7:47 PM, Stefan Behnel wrote:
> Nick Leaton, 25.03.2010 11:10:
>
> I'm trying to generate the following header for an xml file
>>
>> > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>> xsi:noNamespaceSchemaLocation="cmf.xsd">
>>
>> However after reading the section on nsmap on this page
>> http://codespeak.net/lxml/tutorial.html I'm none the wiser
>>
>> Can anyone give a hand?
>>
>
> This should work:
>
> XSI_NS = "http://www.w3.org/2001/XMLSchema-instance"
>
> messages = etree.Element('messages', nsmap = {'xsi' : XSI_NS})
> messages.set("{%s}noNamespaceSchemaLocation" % XSI_NS, "cmf.xsd")
>
> Stefan
>
--
Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100325/b5c5e608/attachment.htm
From Tim.Arnold at sas.com Fri Mar 26 18:19:59 2010
From: Tim.Arnold at sas.com (Tim Arnold)
Date: Fri, 26 Mar 2010 13:19:59 -0400
Subject: [lxml-dev] multiple manipulation of xml file, some are ignored
Message-ID:
Hi,
I apply several manipulations on an xml document and it *seems* like some of them are ignored.
For example, the fix_nested_optional method is called last in a sequence of manipulations:
-----------------------------------
xns = {'d':'http://docbook.org/ns/docbook'}
class DocBookProcessor(object):
def __init__(self, trees):
self.trees = trees
def process(self):
for _, tree in self.trees.items():
... many methods called .....
self.fix_nested_optional()
return self.trees
def fix_nested_optional(self):
for optional in self.tree.xpath('//d:optional/d:optional', namespaces=xns):
optional.tag = 'phrase'
-----------------------------------
But when the tree is written out, I still have nested optional tags. In fact if I apply the same function to the newly written file, the nested optionals are taken care of.
How can this be? Is the lxml document changed immediately or is there some sort of wait before the changes take effect?
thanks,
--Tim
From Tim.Arnold at sas.com Fri Mar 26 18:39:00 2010
From: Tim.Arnold at sas.com (Tim Arnold)
Date: Fri, 26 Mar 2010 13:39:00 -0400
Subject: [lxml-dev] multiple manipulation of xml file, some are ignored
In-Reply-To:
References:
Message-ID:
> -----Original Message-----
> From: Jens Quade [mailto:jq at qdevelop.de]
> Sent: Friday, March 26, 2010 1:32 PM
> To: Tim Arnold
> Subject: Re: [lxml-dev] multiple manipulation of xml file, some are ignored
>
>
> On 26.03.2010, at 18:19, Tim Arnold wrote:
>
> > Hi,
> > I apply several manipulations on an xml document and it *seems* like some
> of them are ignored.
> > For example, the fix_nested_optional method is called last in a sequence
> of manipulations:
> > -----------------------------------
> > xns = {'d':'http://docbook.org/ns/docbook'}
> > class DocBookProcessor(object):
> > def __init__(self, trees):
> > self.trees = trees
> >
> > def process(self):
> > for _, tree in self.trees.items():
> > ... many methods called .....
> > self.fix_nested_optional()
> > return self.trees
> >
> > def fix_nested_optional(self):
> > for optional in self.tree.xpath('//d:optional/d:optional',
> namespaces=xns):
> > optional.tag = 'phrase'
> > -----------------------------------
> >
> > But when the tree is written out, I still have nested optional tags. In
> fact if I apply the same function to the newly written file, the nested
> optionals are taken care of.
>
> where does self.tree come from? Is it part of self.trees?
> wouldn't it be clearer if "tree" was a parameter to
> fix_nested_optional(self, tree)
>
>
Sorry, that's important isn't it. Each tree is an lxml document representing a chapter in a book.
The loop called 'process' above sets 'self.tree' to tree and then calls the methods. I think you're right though, just sending tree as the argument to the methods would be cleaner.
The current method looks like this:
def process(self):
for _, tree in self.trees.items():
self.tree = tree
self.fix_options()
self.fix_optionalias()
self.create_outputs()
self.clean_bibliography()
self.drop_pdftext()
self.fix_SAS_output()
self.drop_empty_elem('para')
self.drop_empty_elem('blockquote')
self.drop_elem_with_inlineequation('indexterm')
self.fix_nested_optional()
return self.trees
Do you think this setup is causing the problem? I'll rewrite to send tree to the method as an argument and see if that changes anything.
thanks,
--Tim
From Tim.Arnold at sas.com Fri Mar 26 18:54:17 2010
From: Tim.Arnold at sas.com (Tim Arnold)
Date: Fri, 26 Mar 2010 13:54:17 -0400
Subject: [lxml-dev] update: multiple manipulation, some ignored
Message-ID:
Hi,
It doesn't seem to matter whether the lxml object is passed as an argument to the method or not. I recoded with identical results. For completeness sake, here is how the self.trees object is created:
for f in [x for x in os.listdir(path) if x.endswith('.xml')]:
fname = os.path.join(path, f)
fd = codecs.open(fname, 'rb', encoding='utf8')
try:
self.trees[fname] = etree.fromstring(fd.read())
except etree.XMLSyntaxError as e:
print 'ERROR: %s cannot be parsed: ' % (os.path.basename(fname))
print '%s \n' % e
finally:
fd.close()
It is that dictionary object self.trees that contains the etrees that is passed to the DocBookProcessor class, as described in the first part of this thread:
http://codespeak.net/pipermail/lxml-dev/2010-March/005341.html
again, thanks for any help. I'm stymied.
--Tim
From stefan_ml at behnel.de Fri Mar 26 19:56:19 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 26 Mar 2010 19:56:19 +0100
Subject: [lxml-dev] update: multiple manipulation, some ignored
In-Reply-To:
References:
Message-ID: <4BAD0353.3020404@behnel.de>
Tim Arnold, 26.03.2010 18:54:
> fname = os.path.join(path, f)
> fd = codecs.open(fname, 'rb', encoding='utf8')
> try:
> self.trees[fname] = etree.fromstring(fd.read())
> except etree.XMLSyntaxError as e:
> print 'ERROR: %s cannot be parsed: ' % (os.path.basename(fname))
> print '%s \n' % e
> finally:
> fd.close()
Note that this code is extremely inefficient. It recodes characters
multiple times (even using the rather slow codecs module), passes through
various I/O layers and creates several unnecessary objects on the way.
It's likely several times faster to just write
self.trees[fname] = etree.parse(fname).getroot()
Stefan
From stefan_ml at behnel.de Fri Mar 26 20:10:55 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 26 Mar 2010 20:10:55 +0100
Subject: [lxml-dev] multiple manipulation of xml file, some are ignored
In-Reply-To:
References:
Message-ID: <4BAD06BF.8080405@behnel.de>
Tim Arnold, 26.03.2010 18:39:
> From: Jens Quade
>> On 26.03.2010, at 18:19, Tim Arnold wrote:
>>> I apply several manipulations on an xml document and it *seems* like some
>>> of them are ignored.
>>> For example, the fix_nested_optional method is called last in a sequence
>>> of manipulations:
>>> -----------------------------------
>>> xns = {'d':'http://docbook.org/ns/docbook'}
>>> class DocBookProcessor(object):
>>> def __init__(self, trees):
>>> self.trees = trees
>>>
>>> def process(self):
>>> for _, tree in self.trees.items():
>>> ... many methods called .....
>>> self.fix_nested_optional()
>>> return self.trees
>>>
>>> def fix_nested_optional(self):
>>> for optional in self.tree.xpath('//d:optional/d:optional',
>> namespaces=xns):
>>> optional.tag = 'phrase'
>>> -----------------------------------
>>>
>>> But when the tree is written out, I still have nested optional tags. In
>>> fact if I apply the same function to the newly written file, the nested
>>> optionals are taken care of.
>>
>> where does self.tree come from? Is it part of self.trees?
>> wouldn't it be clearer if "tree" was a parameter to
>> fix_nested_optional(self, tree)
I second that.
>>> How can this be? Is the lxml document changed immediately
Yes.
> Each tree is an lxml document representing a chapter in a book. The loop
> called 'process' above sets 'self.tree' to tree and then calls the
> methods. I think you're right though, just sending tree as the argument
> to the methods would be cleaner. The current method looks like this:
>
> def process(self):
> for _, tree in self.trees.items():
> self.tree = tree
> self.fix_options()
> self.fix_optionalias()
>
> self.create_outputs()
> self.clean_bibliography()
> self.drop_pdftext()
> self.fix_SAS_output()
> self.drop_empty_elem('para')
> self.drop_empty_elem('blockquote')
> self.drop_elem_with_inlineequation('indexterm')
> self.fix_nested_optional()
> return self.trees
Looking at your pipeline, it's quite possible that you messed up your
namespaces somewhere along the path. You may have added elements to the
tree that do not have a namespace (or maybe renamed their tags), which then
can't be found by the namespaced XPath expression.
To debug, print the namespaced tag names between two pipeline steps:
for el in tree.iter():
print el.tag
That being said, without a deeper look into your code it's impossible to
figure out what's going wrong and where. Try to strip down the pipeline by
eliminating steps that do not induce problems, and reduce your code to an
easily testable example that reproduces the problem.
Stefan
From stefan_ml at behnel.de Fri Mar 26 20:25:53 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Fri, 26 Mar 2010 20:25:53 +0100
Subject: [lxml-dev] help with a special attribute
In-Reply-To:
References:
Message-ID: <4BAD0A41.2030802@behnel.de>
Tim Arnold, 25.03.2010 20:00:
> Hi, I have some citation keys that contain colons in my source xml. I
> use lxml to manipulate that source into valid docbook. For example, a
> key might look like this "kdpm_c:78" I don't have any way to change the
> keys in the original source to get rid of the colon.
>
> I'm currently manipulating the key into which is
> ok but not valid docbook. The attribute needs to be xml:id="kdpm_c78".
'xml:id' is a qualified name consisting of a namespace prefix and a local
name. By specification, the 'xml' prefix maps to the namespace URI
http://www.w3.org/XML/1998/namespace. lxml.etree (and ElementTree) writes
this in Clark notation: "{http://www.w3.org/XML/1998/namespace}id".
http://www.jclark.com/xml/xmlns.htm
> But if I create it that way, then lxml won't parse it since the key has
> the colon. On the other hand if I try to postprocess it by adding an
> xml:id attribute like this: elem.set('xml:id',
> elem.get('id').replace(':', ''))
>
> lxml says "Invalid attribute name u'xml:id'
>
> Is there any way to start with "kdpm_c:78" and end up with xml:id="kdpm_c78"> without plain-text-processing?
This should work:
for el in root.iter():
id_text = el.get('id')
if id_text:
el.set("{http://www.w3.org/XML/1998/namespace}id",
id_text.replace(':', ''))
Stefan
From jq at qdevelop.de Fri Mar 26 22:05:46 2010
From: jq at qdevelop.de (Jens Quade)
Date: Fri, 26 Mar 2010 22:05:46 +0100
Subject: [lxml-dev] update: multiple manipulation, some ignored
In-Reply-To:
References:
Message-ID: <0CC817C5-9A0B-4503-80BB-0A2FFFC2CBAD@qdevelop.de>
On 26.03.2010, at 18:54, Tim Arnold wrote:
> Hi,
> It doesn't seem to matter whether the lxml object is passed as an argument to the method or not. I recoded with identical results.
Can you provide a minimum document that shows the behavior? Simple tests, like
>>> tree = XML('foobar')
>>> for b in tree.xpath('//b/b'):
... b.tag = 'c'
...
>>> dump(tree)
foo
bar
>>>
seem to work.
Can you dump and compare the document before and after the call to the tag rewriting function? Does that provide any clues when nested tags are not fixed?
If you reparse the tree before the tag rewriting function, like XML(tostring(tree)), does the function then work?
From sergio at sergiomb.no-ip.org Sun Mar 28 03:52:11 2010
From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto)
Date: Sun, 28 Mar 2010 02:52:11 +0100
Subject: [lxml-dev] copy a xpath from an element to an element without
double copies #2
Message-ID: <1269741131.30391.24.camel@segulix>
Hello,
based on
"Note that the .append() method *moves* the element to the new position.
If you want to copy it, use the copy module to create a deep copy of the
element before moving it over."
I made, this simple, the realxpath function, is move all elements from a
tree to other , if the root() of an element of xpath isn't the root of
the original html_document, those element was already moved so we don't
move again.
from lxml import html
f = open("teste.html").read()
html_document = html.fromstring(f)
elems=html_document.xpath('//h1|//div[@id="articleTitle"]')
text=""
elembody = html.Element("body")
for frags in elems:
if frags.getroottree().getroot() != html_document:
continue
elembody.append(frags)
text += html.tostring(elembody, method="html", encoding="utf-8")
My previous solution, of iterate again, each time we move an element ,
doesn't work with xpath which use position of the node, like
'//table[1]'.
Hope that the realxpath function could be a feature of lxml :)
Best regards and thanks for yours help.
--
S?rgio M. B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3159 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100328/093d1342/attachment.bin
From mykingheaven at gmail.com Sun Mar 28 06:09:26 2010
From: mykingheaven at gmail.com (David Shieh)
Date: Sun, 28 Mar 2010 12:09:26 +0800
Subject: [lxml-dev] How to get HTML charset ?
Message-ID:
Hi all,
I use lxml for a long time and it works fine for me.
But now, I get confused about the charset thing. When I want to get the
original charset of a html file, I used codes below:
file_content = ''.join(
[i.rstrip('\r\n ').lstrip() for i in response.readlines()]
)
html = lxml.html.fromstring(file_content)
for i in html.xpath('head/meta'):
print lxml.html.tostring(i)
Surprisingly, there's no output of any
element. So, how can I know the original charset of this html?
BTW, I used urllib2 to get charset, using the codes below:
req = urllib2.Request(url)
try:
response = urllib2.urlopen(req)
except HTTPError, e:
print e.code
else:
print response.headers.getheader('Content-Type')
Not every sites return its charset, some sites don't return any charset
information.
What I gonna do if I really want to know the charset?
Thanks, guys.
Best wishes,
David
--
----------------------------------------------
Attitude determines everything !
----------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100328/44686db7/attachment.htm
From sergio at sergiomb.no-ip.org Sun Mar 28 12:11:37 2010
From: sergio at sergiomb.no-ip.org (Sergio Monteiro Basto)
Date: Sun, 28 Mar 2010 11:11:37 +0100
Subject: [lxml-dev] How to get HTML charset ?
In-Reply-To:
References:
Message-ID: <1269771097.2155.7.camel@segulix>
On Sun, 2010-03-28 at 12:09 +0800, David Shieh wrote:
> Hi all,
>
> I use lxml for a long time and it works fine for me.
> But now, I get confused about the charset thing. When I want to get
> the original charset of a html file, I used codes below:
>
> file_content = ''.join(
> [i.rstrip('\r\n ').lstrip() for i in
> response.readlines()]
> )
> html = lxml.html.fromstring(file_content)
> for i in html.xpath('head/meta'):
xpath('.//meta[@http-equiv="Content-Type"]/@content')
I don't know if match with content-type (lower case)
if not
xpath('.//meta[re:test(@http-equiv, "^Content-Type$", "i")]',
namespaces={"re": "http://exslt.org/regular-expressions"})
> print lxml.html.tostring(i)
>
> Surprisingly, there's no output of any http-equiv="Content-Type" .. /> element. So, how can I know the
> original charset of this html?
> BTW, I used urllib2 to get charset, using the codes below:
>
> req = urllib2.Request(url)
> try:
> response = urllib2.urlopen(req)
> except HTTPError, e:
> print e.code
> else:
> print response.headers.getheader('Content-Type')
>
> Not every sites return its charset, some sites don't return any
> charset information.
> What I gonna do if I really want to know the charset?
>
> Thanks, guys.
>
> Best wishes,
> David
> --
> ---------------------------------------------
> Attitude determines everything !
> ----------------------------------------------
>
>
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
--
S?rgio M. B.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3159 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20100328/9bace30c/attachment-0001.bin
From jholg at gmx.de Mon Mar 29 16:49:10 2010
From: jholg at gmx.de (jholg at gmx.de)
Date: Mon, 29 Mar 2010 16:49:10 +0200
Subject: [lxml-dev] ObjectifiedDataElements as dict keys?
Message-ID: <20100329144910.108930@gmx.net>
Hi,
I just noticed that ObjectifiedDataElements can not be used as dict keys (I expected that) but ObjectifiedElements can (which I expected not):
>>> root = objectify.Element('root')
>>> {root: 1}
{: 1}
>>> root.s = "some string"
>>> {root.s: 1}
Traceback (most recent call last):
File "", line 1, in ?
TypeError: unhashable type
>>>
>>> hash(root)
4445024
>>> hash(root.s)
Traceback (most recent call last):
File "", line 1, in ?
TypeError: unhashable type
>>>
But:
>>> root.s.__hash__
>>> root.s.__hash__()
4444928
>>>
So I'm obviously missing something about the hashability rules.
Any quick hint on that?
Holger
--
GMX.at - ?sterreichs FreeMail-Dienst mit ?ber 2 Mio Mitgliedern
E-Mail, SMS & mehr! Kostenlos: http://portal.gmx.net/de/go/atfreemail
From Tim.Arnold at sas.com Mon Mar 29 19:29:18 2010
From: Tim.Arnold at sas.com (Tim Arnold)
Date: Mon, 29 Mar 2010 13:29:18 -0400
Subject: [lxml-dev] multiple manipulation of xml file, some are ignored
In-Reply-To: <4BAD06BF.8080405@behnel.de>
References:
<4BAD06BF.8080405@behnel.de>
Message-ID:
> -----Original Message-----
> From: Stefan Behnel [mailto:stefan_ml at behnel.de]
> Sent: Friday, March 26, 2010 3:11 PM
> To: Tim Arnold
> Cc: lxml-dev at codespeak.net
> Subject: Re: [lxml-dev] multiple manipulation of xml file, some are ignored
>
> Tim Arnold, 26.03.2010 18:39:
> > From: Jens Quade
> >> On 26.03.2010, at 18:19, Tim Arnold wrote:
> >>> I apply several manipulations on an xml document and it *seems* like
> some
> >>> of them are ignored.
> >>> For example, the fix_nested_optional method is called last in a sequence
> >>> of manipulations:
> >>> -----------------------------------
> >>> xns = {'d':'http://docbook.org/ns/docbook'}
> >>> class DocBookProcessor(object):
> >>> def __init__(self, trees):
> >>> self.trees = trees
> >>>
> >>> def process(self):
> >>> for _, tree in self.trees.items():
> >>> ... many methods called .....
> >>> self.fix_nested_optional()
> >>> return self.trees
> >>>
> >>> def fix_nested_optional(self):
> >>> for optional in self.tree.xpath('//d:optional/d:optional',
> >> namespaces=xns):
> >>> optional.tag = 'phrase'
> >>> -----------------------------------
> >>>
> >>> But when the tree is written out, I still have nested optional tags. In
> >>> fact if I apply the same function to the newly written file, the nested
> >>> optionals are taken care of.
> >>
> >> where does self.tree come from? Is it part of self.trees?
> >> wouldn't it be clearer if "tree" was a parameter to
> >> fix_nested_optional(self, tree)
>
> I second that.
>
>
> >>> How can this be? Is the lxml document changed immediately
>
> Yes.
>
>
> > Each tree is an lxml document representing a chapter in a book. The loop
> > called 'process' above sets 'self.tree' to tree and then calls the
> > methods. I think you're right though, just sending tree as the argument
> > to the methods would be cleaner. The current method looks like this:
> >
> > def process(self):
> > for _, tree in self.trees.items():
> > self.tree = tree
> > self.fix_options()
> > self.fix_optionalias()
> >
> > self.create_outputs()
> > self.clean_bibliography()
> > self.drop_pdftext()
> > self.fix_SAS_output()
> > self.drop_empty_elem('para')
> > self.drop_empty_elem('blockquote')
> > self.drop_elem_with_inlineequation('indexterm')
> > self.fix_nested_optional()
> > return self.trees
>
> Looking at your pipeline, it's quite possible that you messed up your
> namespaces somewhere along the path. You may have added elements to the
> tree that do not have a namespace (or maybe renamed their tags), which then
> can't be found by the namespaced XPath expression.
>
> To debug, print the namespaced tag names between two pipeline steps:
>
> for el in tree.iter():
> print el.tag
>
> That being said, without a deeper look into your code it's impossible to
> figure out what's going wrong and where. Try to strip down the pipeline by
> eliminating steps that do not induce problems, and reduce your code to an
> easily testable example that reproduces the problem.
>
> Stefan
Thanks for your input Stefan. You're right--I had messed up my namespaces by changing a tag name without prepending the appropriate namespace. Once I changed that, things started working.
Also, thank you for the comment about the inefficient code in reading in the XML. I now just use etree to parse the file in the first step with resorting to codecs.
The xml:id code-snippet you sent works well. I understood it as I read the code, but it wasn't something I would have figured out without your help.
In short, thanks to you and to Jens Quade, my workflow from LaTeX to DocBook5 is working now and I'm starting to clean things up better.
thanks very much for your help,
--Tim Arnold
From stefan_ml at behnel.de Mon Mar 29 21:56:07 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Mon, 29 Mar 2010 21:56:07 +0200
Subject: [lxml-dev] ObjectifiedDataElements as dict keys?
In-Reply-To: <20100329144910.108930@gmx.net>
References: <20100329144910.108930@gmx.net>
Message-ID: <4BB105D7.5080604@behnel.de>
Hi Holger,
it's funny to see the exactly same question come up on the Cython mailing
list and here within just a couple of days.
jholg at gmx.de, 29.03.2010 16:49:
> I just noticed that ObjectifiedDataElements can not be used as dict
> keys (I expected that) but ObjectifiedElements can (which I expected
> not):
>
>>>> root = objectify.Element('root')
>>>> {root: 1}
> {: 1}
>>>> root.s = "some string"
>>>> {root.s: 1}
> Traceback (most recent call last):
> File "", line 1, in ?
> TypeError: unhashable type
>>>>
>
>>>> hash(root)
> 4445024
>>>> hash(root.s)
> Traceback (most recent call last):
> File "", line 1, in ?
> TypeError: unhashable type
>>>>
>
> But:
>
>>>> root.s.__hash__
>
>>>> root.s.__hash__()
> 4444928
>>>>
>
> So I'm obviously missing something about the hashability rules.
> Any quick hint on that?
http://docs.python.org/c-api/typeobj.html#tp_compare
http://docs.python.org/reference/datamodel.html#object.__hash__
The reason is that ODE overrides __richcmp__ but not __hash__, whereas the
baseclass (OE) overrides none of the two. The CPython runtime lets the OE
type inherit both from the baseclass in this case, whereas it considers the
ODE type an unhashable type and inherits none.
The currently proposed solution is to fix this in Cython by automatically
setting up both if they are implemented within the type hierarchy. However,
the quick fix is to add a __hash__ to ODE that returns the base type's hash
value.
Stefan
From dkuhlman at rexx.com Mon Mar 29 23:48:31 2010
From: dkuhlman at rexx.com (Dave Kuhlman)
Date: Mon, 29 Mar 2010 14:48:31 -0700
Subject: [lxml-dev] Tempory data attached to custom subclasses
Message-ID: <20100329214830.GA21855@cutter.rexx.com>
I've been using the custom subclasses capability of lxml. It's
slick.
I do, however, miss the ability to attach temporary data to the
ElementBase subclasses. (see the warnings under "Element
initialization" at http://codespeak.net/lxml/element_classes.html)
I can, as suggested by the docs, add attributes or children to the
underlying etree.Element, but that means that I'd have to strip
that temporary data off when I want to serialize the tree.
(please stop me if you've already heard this request, or if there is
another solution.)
I'd have a solution (see below) to this need if I could get a
value, say an ID, (1) that is unique to each node and (2) that does
not change during the existence of the ElementTree. Note that this
"ID" does not have to be meaningful, and does not need to enable me
to do anything with the underlying XML object (other than
re-identify it).
If I could get this opaque ID (or whatever it might be called),
then I could use a dictionary and something like the following to
store and retrieve temporary data::
Datadict1 = {}
def get_temp_data(node, datadict):
id = node.get_opaque_id()
if id in datadict:
return datadict[id]
else:
data = {}
datadict[id] = data
return data
def test():
doc = lxml.parse('somedoc.xml')
root = doc.getroot()
node = root[0]
data = get_temp_data(node, Datadict1)
value1 = 'some temporary data'
data['key1'] = value1
o
o
o
data = get_temp_data(node, Datadict1)
print data['key1']
test()
Looking at lxml-2.2.4/src/lxml/lxml.etree.pyx, it seems like that
would be a trivial function to add. (see below)
What do you think? It's pretty simple solution. Has it be tried
or rejected already?
Here is a patch that seems to add the necessary function. This
function returns the C pointer to the libxml2 object that is
underneath the lxml/etree object. Am I right that this value would
be (1) unique and (2) persistent across the lifetime of the
lxml/etree ElementTree?
Index: lxml.etree.pyx
===================================================================
--- lxml.etree.pyx (revision 71999)
+++ lxml.etree.pyx (working copy)
@@ -1185,6 +1185,21 @@
return None
return _elementFactory(self._doc, c_node)
+ def getopaqueid(self):
+ u"""getopaqueid(self)
+
+ Returns an opaque ID for the underlying XML C node. This
+ opaque ID is guaranteed (1) to be unique to each node
+ and (2) not to change during the existence of the
+ ElementTree.
+ """
+ cdef xmlNode* c_node
+ cdef int intnode
+ c_node = self._c_node
+ intnode = c_node
+ opaqueid = intnode
+ return opaqueid
+
def getnext(self):
u"""getnext(self)
- Dave
--
Dave Kuhlman
http://www.rexx.com/~dkuhlman
From jholg at gmx.de Tue Mar 30 16:29:33 2010
From: jholg at gmx.de (jholg at gmx.de)
Date: Tue, 30 Mar 2010 16:29:33 +0200
Subject: [lxml-dev] ObjectifiedDataElements as dict keys?
In-Reply-To: <4BB105D7.5080604@behnel.de>
References: <20100329144910.108930@gmx.net> <4BB105D7.5080604@behnel.de>
Message-ID: <20100330142933.271000@gmx.net>
Hi,
> > I just noticed that ObjectifiedDataElements can not be used as dict
> > keys (I expected that) but ObjectifiedElements can (which I expected
> > not):
> >
> > [...]
> http://docs.python.org/c-api/typeobj.html#tp_compare
> http://docs.python.org/reference/datamodel.html#object.__hash__
>
> The reason is that ODE overrides __richcmp__ but not __hash__, whereas the
> baseclass (OE) overrides none of the two. The CPython runtime lets the OE
> type inherit both from the baseclass in this case, whereas it considers
> the
> ODE type an unhashable type and inherits none.
>
Ah, thanks a lot for this explanation. I'd probably have had a hard time finding this out in detail.
> The currently proposed solution is to fix this in Cython by automatically
> setting up both if they are implemented within the type hierarchy.
> However,
> the quick fix is to add a __hash__ to ODE that returns the base type's
> hash
> value.
ODE should probably get a __hash__ that returns the underlying pyval hash results rather than the hash of its .text, anyway, then.
Holger
--
Sicherer, schneller und einfacher. Die aktuellen Internet-Browser -
jetzt kostenlos herunterladen! http://portal.gmx.net/de/go/atbrowser
From Joe at skyscanner.net Tue Mar 30 16:53:52 2010
From: Joe at skyscanner.net (Joe Sarre)
Date: Tue, 30 Mar 2010 15:53:52 +0100
Subject: [lxml-dev] lxml iterparse generator not returning anything
Message-ID:
Hi everyone,
I'm finding that when using iterparse, the generator always throws StopIteration immediately, without returning any data. I must be doing something wrong, or I must have some kind of setup problem, but I'm struggling to work out what it is. If anybody has any ideas, then that would be greatly appreciated, or if this is a bug, I will raise it on the bug tracker.
My version details are:
>>> print etree.LXML_VERSION
(2, 2, 2, 0)
>>> print etree.LIBXML_VERSION
(2, 7, 6)
>>> print etree.LIBXML_COMPILED_VERSION
(2, 7, 3)
>>> print etree.LIBXSLT_VERSION
(1, 1, 26)
>>> print etree.LIBXSLT_COMPILED_VERSION
(1, 1, 24)
The most striking thing about this is that LIBXML_VERSION != LIBXML_COMPILED_VERSION, and LIBXSLT_VERSION != LIBXSLT_COMPILED_VERSION. If this version discrepancy is the real cause of the problem, then I think this issue is perhaps more appropriate for the Fedora mailing list, and you can ignore the rest of this mail. An example in which I am seeing this ( taken from http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk ) is:
"""
>>> from lxml import etree
>>> from StringIO import StringIO
>>> xml = '''
... text
... texttail
...
... '''
>>> print xml
text
texttail
>>> context = etree.iterparse(StringIO(xml))
>>> for action, elem in context:
... print("%s: %s" % (action, elem.tag))
end: element
end: element
end: {http://testns/}empty-element
end: root
"""
if __name__ == '__main__':
import doctest
doctest.testmod()
The result of putting this in a file and running it is that python complains:
**********************************************************************
File "test.py", line 20, in __main__
Failed example:
for action, elem in context:
print("%s: %s" % (action, elem.tag))
Expected:
end: element
end: element
end: {http://testns/}empty-element
end: root
Got nothing
**********************************************************************
1 items had failures:
1 of 6 in __main__
***Test Failed*** 1 failures.
Thanks in advance for any help,
Joe Sarre
This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Skyscanner.
If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone.
Please contact the sender if you believe you have received this email in error.
From ethan.jucovy at gmail.com Tue Mar 30 16:57:38 2010
From: ethan.jucovy at gmail.com (Ethan Jucovy)
Date: Tue, 30 Mar 2010 10:57:38 -0400
Subject: [lxml-dev] How to get HTML charset ?
In-Reply-To:
References:
Message-ID:
On Sun, Mar 28, 2010 at 12:09 AM, David Shieh wrote:
> Hi all,
>
> I use lxml for a long time and it works fine for me.
> But now, I get confused about the charset thing. When I want to get the
> original charset of a html file, I used codes below:
>
> ??????? file_content = ''.join(
> ??????????????? [i.rstrip('\r\n ').lstrip() for i in response.readlines()]
> ??????????? )
> ??????? html = lxml.html.fromstring(file_content)
> ??????? for i in html.xpath('head/meta'):
> ??????????? print lxml.html.tostring(i)
>
> Surprisingly, there's no output of any
> element. So, how can I know the original charset of this html?
You need to pass the kwarg `include_meta_content_type=True` to
`tostring`, or the tag will
always be stripped on the way out --
>>> from lxml.html import fromstring, tostring
>>> x=fromstring("""""")
>>> x.xpath("head/meta")
[]
>>> [tostring(u) for u in x.xpath("head/meta")]
['']
>>> [tostring(u, include_meta_content_type=True) for u in x.xpath("head/meta")]
['']
From stefan_ml at behnel.de Wed Mar 31 10:34:34 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Wed, 31 Mar 2010 10:34:34 +0200
Subject: [lxml-dev] Tempory data attached to custom subclasses
In-Reply-To: <20100329214830.GA21855@cutter.rexx.com>
References: <20100329214830.GA21855@cutter.rexx.com>
Message-ID: <4BB3091A.5090708@behnel.de>
Dave Kuhlman, 29.03.2010 23:48:
> I've been using the custom subclasses capability of lxml. It's
> slick.
>
> I do, however, miss the ability to attach temporary data to the
> ElementBase subclasses. (see the warnings under "Element
> initialization" at http://codespeak.net/lxml/element_classes.html)
>
> I can, as suggested by the docs, add attributes or children to the
> underlying etree.Element, but that means that I'd have to strip
> that temporary data off when I want to serialize the tree.
As long as your tree doesn't change, the easiest solution is to keep a
reference to all Elements ("list(root.iter())") and then just store the
data in the proxy instances. They are guaranteed not to change as long as
there is a live reference to them.
If your tree changes, you can still try to add new Elements to your
keep-alive list to get the same behaviour, but you may need to take a
little more care when you remove elements, so that you only remove them
from the keep-alive list when you are sure they'll get discarded.
> I'd have a solution (see below) to this need if I could get a
> value, say an ID, (1) that is unique to each node and (2) that does
> not change during the existence of the ElementTree. Note that this
> "ID" does not have to be meaningful, and does not need to enable me
> to do anything with the underlying XML object (other than
> re-identify it).
>
> If I could get this opaque ID (or whatever it might be called),
> then I could use a dictionary and something like the following to
> store and retrieve temporary data:
I usually suggest using the generated XPath of the element:
http://codespeak.net/lxml/xpathxslt.html#generating-xpath-expressions
But that's certainly more expensive than just returning a Py_ssize_t value.
Stefan
From stefan_ml at behnel.de Wed Mar 31 12:55:42 2010
From: stefan_ml at behnel.de (Stefan Behnel)
Date: Wed, 31 Mar 2010 12:55:42 +0200
Subject: [lxml-dev] ObjectifiedDataElements as dict keys?
In-Reply-To: <20100330142933.271000@gmx.net>
References: <20100329144910.108930@gmx.net> <4BB105D7.5080604@behnel.de>
<20100330142933.271000@gmx.net>
Message-ID: <4BB32A2E.1000408@behnel.de>
jholg at gmx.de, 30.03.2010 16:29:
> Stefan Behnel:
>> The currently proposed solution is to fix this in Cython by automatically
>> setting up both if they are implemented within the type hierarchy.
... although this hasn't been decided yet. I guess we'll end up going with
the Py3 semantics here (i.e. the current semantics in Cython anyway), and
just emit a warning.
> ODE should probably get a __hash__ that returns the underlying pyval
> hash results rather than the hash of its .text, anyway, then.
Done:
https://codespeak.net/viewvc/?view=rev&revision=73205
Stefan
From mykingheaven at gmail.com Wed Mar 31 14:01:28 2010
From: mykingheaven at gmail.com (David Shieh)
Date: Wed, 31 Mar 2010 20:01:28 +0800
Subject: [lxml-dev] How to get HTML charset ?
In-Reply-To:
References:
Message-ID:
2010/3/30 Ethan Jucovy
> On Sun, Mar 28, 2010 at 12:09 AM, David Shieh
> wrote:
> > Hi all,
> >
> > I use lxml for a long time and it works fine for me.
> > But now, I get confused about the charset thing. When I want to get the
> > original charset of a html file, I used codes below:
> >
> > file_content = ''.join(
> > [i.rstrip('\r\n ').lstrip() for i in
> response.readlines()]
> > )
> > html = lxml.html.fromstring(file_content)
> > for i in html.xpath('head/meta'):
> > print lxml.html.tostring(i)
> >
> > Surprisingly, there's no output of any />
> > element. So, how can I know the original charset of this html?
>
> You need to pass the kwarg `include_meta_content_type=True` to
> `tostring`, or the tag will
> always be stripped on the way out --
>
> But I really get charset using Sergio's way. I think your method is also
great. I will add it in safe.
Thanks!
>>> from lxml.html import fromstring, tostring
> >>> x=fromstring(""" content="text/html; charset=ASCII">""")
> >>> x.xpath("head/meta")
> []
> >>> [tostring(u) for u in x.xpath("head/meta")]
> ['']
> >>> [tostring(u, include_meta_content_type=True) for u in
> x.xpath("head/meta")]
> ['']
>
--
----------------------------------------------
Attitude determines everything !
----------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100331/6382fd94/attachment.htm