[lxml-dev] Test Failures in lxml 1.3.2

Stefan Behnel stefan_ml at behnel.de
Thu Jul 12 23:21:29 CEST 2007


Hi Tres,

thanks for testing.

Tres Seaver wrote:
> Stefan Behnel wrote:
>> It seems like the problem only arises on UCS-2 systems. Could anyone with a
>> UCS-2 Linux system check if this is also fails there? UCS-2 can be detected
>> with "sys.maxunicode" being 65535 (I think). UCS-4 systems say 1114111 here. I
>> heard rumours that Redhat systems have UCS-2 builds. Ubuntu definitely doesn't.
>>
>> The test case itself is pretty simple:
>>
>>    >>> import lxml.etree as et
>>    >>> html = et.HTML(u'<html><body>\xc3\xa1\uf8d2</body></html>')
>>    >>> print repr(et.tounicode(html))
>>    u'<html><body>\xc3\xa1\uf8d2</body></html>'
> 
>> To see that the actual problem is the parser, not the serialiser, you can do:
> 
>>    >>> print repr(et.tostring(html, 'utf-8'))
>>    '<html><body>\xc3\x83\xc2\xa1\xef\xa3\x92</body></html>'
> 
> I have lxml installed in both UCS4 and UCS2 versions of python2.4 on my
> Ubuntu laptop::
> 
>  $ cat et_test.py
>  import sys
>  print sys.version
>  print sys.maxunicode
>  import lxml.etree as et
>  html = et.HTML(u'<html><body>\xc3\xa1\uf8d2</body></html>')
>  print repr(et.tounicode(html))
> 
>  $ /path/to/ucs4/bin/python et_test.py
>  2.4.3 (#2, Oct  6 2006, 07:52:30)
>  [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)]
>  1114111
>  u'<html><body>\xc3\xa1\uf8d2</body></html>'
>  [/home/tseaver]
> 
>  $ /path/to/ucs2/bin/python et_test.py
>  2.4.4 (#1, Apr 19 2007, 16:14:47)
>  [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)]
>  65535
>  u'<html><body>\xc3\xa1\uf8d2</body></html>'

Hmmm, that leaves me hoping that my test case actually touched the problem.
Could we get feedback from someone with a non-working setup here?

So far, we have the following cases:

- it fails on MacOS-X (Intel) with a UCS-2 little endian Python
- it fails on Windows with a UCS-2 little endian Python
- it works on Linux/Intel with UCS-2 little endian
- it works on Linux/Intel with UCS-4 little endian
- it works on Solaris/Sparc with UCS-2 big endian

I can't really see a pattern there...

Stefan


More information about the lxml-dev mailing list