[lxml-dev] clean_html
Piet van Oostrum
piet at cs.uu.nl
Wed Jun 24 15:04:57 CEST 2009
>>>>> Francesco <cattafra at hotmail.com> (F) wrote:
>F> Thank you very much for your answers!
>F> The html string is read from a file with:
>F> inputfile = "test.txt"
>F> # where test.txt contains "<title>My site » Homepage</title>"
>F> input = open(inputfile, "rb")
>F> html = input.read()
>F> How could I define the encoding for html?
Do you know the encoding beforehand? If so, you could use codecs, and
come up with a unicode string.
import codecs
input = codecs.open( inputfile, "r", "utf-8" )
Why do you use "rb"?
--
Piet van Oostrum <piet at cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet at vanoostrum.org
More information about the lxml-dev
mailing list