[lxml-dev] Test Failures in lxml 1.3.2

Tres Seaver tseaver at palladion.com
Thu Jul 12 18:53:07 CEST 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Stefan Behnel wrote:
> Sidnei da Silva wrote:
>> I get one test failure with lxml 1.3.2, doesn't look too bad. Maybe it
>> has something to do with the libxml2 version?
>>
>> ======================================================================
>> FAIL: test_module_HTML_unicode (lxml.tests.test_htmlparser.HtmlParserTestCaseBas
>> e)
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File "c:\Python24\lib\unittest.py", line 260, in run
>>     testMethod()
>>   File "C:\src\lxml-build\lxml-1.3.2\src\lxml\tests\test_htmlparser.py", line 33
>> , in test_module_HTML_unicode
>>     unicode(self.uhtml_str.encode('UTF8'), 'UTF8'))
>>   File "c:\Python24\lib\unittest.py", line 333, in failUnlessEqual
>>     raise self.failureException, \
>> AssertionError: u'<html><head><title>test \xc3\x83\xc2\xa1\xef\xa3\x92</title></
>> head><body><h1>page \xc3\x83\xc2\xa1\xef\xa3\x92 title</h1></body></html>' != u'
>> <html><head><title>test \xc3\xa1\uf8d2</title></head><body><h1>page \xc3\xa1\uf8
>> d2 title</h1></body></html>'
> 
> Hmmm, didn't I take that test out? :)
> 
> Erik Swanson reported the same problem on OS-X. I guess that makes parsing
> HTML from a unicode string pretty much a Unix-only thing, though maybe it's
> actually rather a UCS4-only thing. No idea how to fix that (or what actually
> goes wrong here).
> 
> It seems like the problem only arises on UCS-2 systems. Could anyone with a
> UCS-2 Linux system check if this is also fails there? UCS-2 can be detected
> with "sys.maxunicode" being 65535 (I think). UCS-4 systems say 1114111 here. I
> heard rumours that Redhat systems have UCS-2 builds. Ubuntu definitely doesn't.
> 
> The test case itself is pretty simple:
> 
>    >>> import lxml.etree as et
>    >>> html = et.HTML(u'<html><body>\xc3\xa1\uf8d2</body></html>')
>    >>> print repr(et.tounicode(html))
>    u'<html><body>\xc3\xa1\uf8d2</body></html>'
> 
> To see that the actual problem is the parser, not the serialiser, you can do:
> 
>    >>> print repr(et.tostring(html, 'utf-8'))
>    '<html><body>\xc3\x83\xc2\xa1\xef\xa3\x92</body></html>'
> 
> Hoping for feedback and ideas,
> 
> Stefan

I have lxml installed in both UCS4 and UCS2 versions of python2.4 on my
Ubuntu laptop::

 $ cat et_test.py
 import sys
 print sys.version
 print sys.maxunicode
 import lxml.etree as et
 html = et.HTML(u'<html><body>\xc3\xa1\uf8d2</body></html>')
 print repr(et.tounicode(html))

 $ /path/to/ucs4/bin/python et_test.py
 2.4.3 (#2, Oct  6 2006, 07:52:30)
 [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)]
 1114111
 u'<html><body>\xc3\xa1\uf8d2</body></html>'
 [/home/tseaver]

 $ /path/to/ucs2/bin/python et_test.py
 2.4.4 (#1, Apr 19 2007, 16:14:47)
 [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)]
 65535
 u'<html><body>\xc3\xa1\uf8d2</body></html>'


Tres.
- --
===================================================================
Tres Seaver          +1 540-429-0999          tseaver at palladion.com
Palladion Software   "Excellence by Design"    http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGllxz+gerLs4ltQ4RAjZ/AJ9Pvf4WBX1cZywNmaePspGyFiD/TQCfTGIO
mPMPYd0dfCk/uCVyRJpmAu4=
=Y4mN
-----END PGP SIGNATURE-----



More information about the lxml-dev mailing list