[lxml-dev] non-ascii characters get garbled
Stefan Behnel
stefan_ml at behnel.de
Thu Sep 20 14:23:22 CEST 2007
Stefan Behnel wrote:
> js wrote:
>> I downgraded libxml2 from 2.6.29_0 to 2.6.27_0
>> and re-run the test script.
>> surprise, Now it all works as in the lxml doc!
>
> So the default encoding is no longer UTF-8 and instead it tries auto detection
> (which apparently fails for your page, so it's likely the page that is broken
> here).
I added an "encoding" keyword argument to the parsers in the current trunk to
override the document encoding (in case you happen to know better). So you
could now parse the HTML document with
>>> utf8_html_parser = etree.HTMLParser(encoding="UTF-8")
>>> tree = etree.parse("http://the/file.html", utf8_html_parser)
This will (very, very likely) give you an exception if the document is not
UTF-8, so you can then fall back to another parser.
Note that building the SVN trunk currently requires Cython 0.9.6.6, but the
third alpha shouldn't be /that/ far away.
Stefan
More information about the lxml-dev
mailing list