[lxml-dev] Test Failures in lxml 1.3.2

Stefan Behnel stefan_ml at behnel.de
Thu Jul 12 09:31:57 CEST 2007


Sidnei da Silva wrote:
> I get one test failure with lxml 1.3.2, doesn't look too bad. Maybe it
> has something to do with the libxml2 version?
> 
> ======================================================================
> FAIL: test_module_HTML_unicode (lxml.tests.test_htmlparser.HtmlParserTestCaseBas
> e)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "c:\Python24\lib\unittest.py", line 260, in run
>     testMethod()
>   File "C:\src\lxml-build\lxml-1.3.2\src\lxml\tests\test_htmlparser.py", line 33
> , in test_module_HTML_unicode
>     unicode(self.uhtml_str.encode('UTF8'), 'UTF8'))
>   File "c:\Python24\lib\unittest.py", line 333, in failUnlessEqual
>     raise self.failureException, \
> AssertionError: u'<html><head><title>test \xc3\x83\xc2\xa1\xef\xa3\x92</title></
> head><body><h1>page \xc3\x83\xc2\xa1\xef\xa3\x92 title</h1></body></html>' != u'
> <html><head><title>test \xc3\xa1\uf8d2</title></head><body><h1>page \xc3\xa1\uf8
> d2 title</h1></body></html>'

Hmmm, didn't I take that test out? :)

Erik Swanson reported the same problem on OS-X. I guess that makes parsing
HTML from a unicode string pretty much a Unix-only thing, though maybe it's
actually rather a UCS4-only thing. No idea how to fix that (or what actually
goes wrong here).

It seems like the problem only arises on UCS-2 systems. Could anyone with a
UCS-2 Linux system check if this is also fails there? UCS-2 can be detected
with "sys.maxunicode" being 65535 (I think). UCS-4 systems say 1114111 here. I
heard rumours that Redhat systems have UCS-2 builds. Ubuntu definitely doesn't.

The test case itself is pretty simple:

   >>> import lxml.etree as et
   >>> html = et.HTML(u'<html><body>\xc3\xa1\uf8d2</body></html>')
   >>> print repr(et.tounicode(html))
   u'<html><body>\xc3\xa1\uf8d2</body></html>'

To see that the actual problem is the parser, not the serialiser, you can do:

   >>> print repr(et.tostring(html, 'utf-8'))
   '<html><body>\xc3\x83\xc2\xa1\xef\xa3\x92</body></html>'

Hoping for feedback and ideas,

Stefan


More information about the lxml-dev mailing list