[lxml-dev] non-ascii characters get garbled

Stefan Behnel stefan_ml at behnel.de
Thu Sep 20 14:23:22 CEST 2007


Stefan Behnel wrote:
> js wrote:
>> I downgraded libxml2 from 2.6.29_0 to 2.6.27_0
>> and re-run the test script.
>> surprise, Now it all  works as in the lxml doc!
> 
> So the default encoding is no longer UTF-8 and instead it tries auto detection
> (which apparently fails for your page, so it's likely the page that is broken
> here).

I added an "encoding" keyword argument to the parsers in the current trunk to
override the document encoding (in case you happen to know better). So you
could now parse the HTML document with

    >>> utf8_html_parser = etree.HTMLParser(encoding="UTF-8")
    >>> tree = etree.parse("http://the/file.html", utf8_html_parser)

This will (very, very likely) give you an exception if the document is not
UTF-8, so you can then fall back to another parser.

Note that building the SVN trunk currently requires Cython 0.9.6.6, but the
third alpha shouldn't be /that/ far away.

Stefan


More information about the lxml-dev mailing list