[lxml-dev] Test Failures in lxml 1.3.2
Stefan Behnel
stefan_ml at behnel.de
Thu Jul 12 23:21:29 CEST 2007
Hi Tres,
thanks for testing.
Tres Seaver wrote:
> Stefan Behnel wrote:
>> It seems like the problem only arises on UCS-2 systems. Could anyone with a
>> UCS-2 Linux system check if this is also fails there? UCS-2 can be detected
>> with "sys.maxunicode" being 65535 (I think). UCS-4 systems say 1114111 here. I
>> heard rumours that Redhat systems have UCS-2 builds. Ubuntu definitely doesn't.
>>
>> The test case itself is pretty simple:
>>
>> >>> import lxml.etree as et
>> >>> html = et.HTML(u'<html><body>\xc3\xa1\uf8d2</body></html>')
>> >>> print repr(et.tounicode(html))
>> u'<html><body>\xc3\xa1\uf8d2</body></html>'
>
>> To see that the actual problem is the parser, not the serialiser, you can do:
>
>> >>> print repr(et.tostring(html, 'utf-8'))
>> '<html><body>\xc3\x83\xc2\xa1\xef\xa3\x92</body></html>'
>
> I have lxml installed in both UCS4 and UCS2 versions of python2.4 on my
> Ubuntu laptop::
>
> $ cat et_test.py
> import sys
> print sys.version
> print sys.maxunicode
> import lxml.etree as et
> html = et.HTML(u'<html><body>\xc3\xa1\uf8d2</body></html>')
> print repr(et.tounicode(html))
>
> $ /path/to/ucs4/bin/python et_test.py
> 2.4.3 (#2, Oct 6 2006, 07:52:30)
> [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)]
> 1114111
> u'<html><body>\xc3\xa1\uf8d2</body></html>'
> [/home/tseaver]
>
> $ /path/to/ucs2/bin/python et_test.py
> 2.4.4 (#1, Apr 19 2007, 16:14:47)
> [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)]
> 65535
> u'<html><body>\xc3\xa1\uf8d2</body></html>'
Hmmm, that leaves me hoping that my test case actually touched the problem.
Could we get feedback from someone with a non-working setup here?
So far, we have the following cases:
- it fails on MacOS-X (Intel) with a UCS-2 little endian Python
- it fails on Windows with a UCS-2 little endian Python
- it works on Linux/Intel with UCS-2 little endian
- it works on Linux/Intel with UCS-4 little endian
- it works on Solaris/Sparc with UCS-2 big endian
I can't really see a pattern there...
Stefan
More information about the lxml-dev
mailing list