[lxml-dev] non-ascii characters get garbled
Stefan Behnel
stefan_ml at behnel.de
Thu Sep 20 17:38:32 CEST 2007
js wrote:
> On 9/20/07, Stefan Behnel <stefan_ml at behnel.de> wrote:
>> I added an "encoding" keyword argument to the parsers in the current trunk to
>> override the document encoding (in case you happen to know better). So you
>> could now parse the HTML document with
>>
>> >>> utf8_html_parser = etree.HTMLParser(encoding="UTF-8")
>> >>> tree = etree.parse("http://the/file.html", utf8_html_parser)
>>
>> This will (very, very likely) give you an exception if the document is not
>> UTF-8, so you can then fall back to another parser.
>
> Thank you for your effort.
> but I wonder how can we know in what character set the document is
> written before
> GETing the page and check the response header, meta tag and contents itself?
The libxml2 HTML parser does that for you. If there is a <meta> Content-Type
(which is not too hidden inside the tag soup), the parser will obey it. It
doesn't know about the header, but for that, you can pass in the "encoding"
keyword. Note that you usually don't have to read the file if all you want it
the header. Just read the header, check if you have to override the input
encoding and then pass the file into parse().
> We really need to GET the doc first.
> So I think urlopen(url).read().decode(somecharset) and
This will not work if the document contains an encoding hint, such as a <meta>
tag in HTML or an XML declaration, as the parser will switch encodings when it
sees it. Thus the "encoding" keyword for detection override.
> letting lxml parse it is not only easier but also giving us more flexibility.
> For example, by using python's urllib2, you can easily set User-Agent,
> adding more handlers, etc.
You can pass the result of urlopen(url) into parse() instead or reading the
string first.
> Stefan, Is it possible to change lxml to avoid "ValueError" exception
> when passing
> decoded string to lxml.parse()?
> If the answer is no, could you please give me some advice or your idea
> on thin problem?
Use lxml.fromstring() for parsing strings.
Stefan
More information about the lxml-dev
mailing list