[lxml-dev] clean_html

Piet van Oostrum piet at cs.uu.nl
Wed Jun 24 15:04:57 CEST 2009


>>>>> Francesco <cattafra at hotmail.com> (F) wrote:

>F> Thank you very much for your answers!
>F> The html string is read from a file with:
>F> inputfile = "test.txt"
>F> # where test.txt contains "<title>My site &raquo; Homepage</title>"
>F> input = open(inputfile, "rb")
>F> html = input.read()

>F> How could I define the encoding for html?

Do you know the encoding beforehand? If so, you could use codecs, and
come up with a unicode string.

import codecs
input = codecs.open( inputfile, "r", "utf-8" )

Why do you use "rb"?
-- 
Piet van Oostrum <piet at cs.uu.nl>
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: piet at vanoostrum.org



More information about the lxml-dev mailing list