[lxml-dev] clean_html

Stefan Behnel stefan_ml at behnel.de
Wed Jun 24 14:10:16 CEST 2009


Francesco wrote:
> I have written the following code:
>
>>>> from lxml.html.clean import clean_html
>>>> html = "»"

Note that you are passing a byte string here. Without any encoding
information, the HTML parser of libxml2 will fall back to the Latin-1
encoding.

>>>> print clean_html(html)
> <p>»</p>
>
> I am wondering why I have an extra character (Â) in my output.
> What should I do to avoid that?

That's just because the serialised HTML output is encoded as UTF-8. If you
want to print the resulting byte string, use .decode('UTF-8') to decode it
to unicode first. If you want to write it to a file (or send it through
the network), keeping it in UTF-8 is the right thing, though.

Stefan



More information about the lxml-dev mailing list