[lxml-dev] non-ascii characters get garbled

js ebgssth at gmail.com
Thu Sep 20 15:59:16 CEST 2007


On 9/20/07, Stefan Behnel <stefan_ml at behnel.de> wrote:
> I added an "encoding" keyword argument to the parsers in the current trunk to
> override the document encoding (in case you happen to know better). So you
> could now parse the HTML document with
>
>     >>> utf8_html_parser = etree.HTMLParser(encoding="UTF-8")
>     >>> tree = etree.parse("http://the/file.html", utf8_html_parser)
>
> This will (very, very likely) give you an exception if the document is not
> UTF-8, so you can then fall back to another parser.

Thank you for your effort.
but I wonder how can we know in what character set the document is
written before
GETing the page and  check the response header, meta tag and contents itself?
We really need to GET the doc first.

So I think urlopen(url).read().decode(somecharset) and
letting lxml parse it is not only easier but also giving us more flexibility.
For example, by using python's urllib2, you can easily set User-Agent,
adding more handlers, etc.

Stefan, Is it possible to change lxml to avoid "ValueError" exception
when passing
decoded string to lxml.parse()?
If the answer  is no, could you please give me some advice or your idea
on thin problem?

Thank you in advance.


More information about the lxml-dev mailing list