[lxml-dev] clean_html
Stefan Behnel
stefan_ml at behnel.de
Wed Jun 24 15:17:41 CEST 2009
Francesco wrote:
> Thank you very much for your answers!
>
> The html string is read from a file with:
> inputfile = "test.txt"
> # where test.txt contains "<title>My site » Homepage</title>"
> input = open(inputfile, "rb")
> html = input.read()
>
> How could I define the encoding for html?
*Iff* you know it before hand, you can create a new parser and use the
parse() function:
parser = lxml.html.HTMLParser(encoding='UTF-8')
html_tree = lxml.html.parse(inputfile, parser=parser)
Most functions in lxml.html can deal with both tree and string input, and
will return the type you passed in. However, working with trees allows you
to control the parsing and serialisation more accurately, and avoids
redundant parse-serialise cycles if you want to run more than one
operation on the data.
Stefan
More information about the lxml-dev
mailing list