[lxml-dev] non-ascii characters get garbled

Stefan Behnel stefan_ml at behnel.de
Thu Sep 27 08:09:48 CEST 2007


Hi,

js wrote:
> You're right. When libxml2 find meta tag, it converts the encoding
> according to it.
> But in real web, it doesn't always work.

I know, libxml2's HTML parser works pretty well, but it's not perfect.
Especially robust encoding detection is still an issue.


> -------------------------------------------------------------------------------------------------------------
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
> "http://www.w3.org/TR/html4/loose.dtd">
> <html>
> <head>
> 	<title>애플컴퓨터코리아</title>
> 	<meta http-equiv="content-type" content="text/html; charset=utf-8">
> 
> -------------------------------------------------------------------------------------------------------------

Yes, libxml2 will switch encodings when it sees the <meta> tag, but it will
not start over to make sure the beginning is parsed in correctly.

There are a couple of things you can do. For example, you can parse the page
and then check the encoding through the docinfo property (after wrapping the
result Element with an ElementTree, if you use "fromstring"), or look for a
<meta> tag through find() or XPath. Then, reparse the document with the
"encoding" keyword set.

Or, you can install BeautifulSoup and use lxml.html.ElementSoup for parsing.
BeautifulSoup has an HTML parser that comes with brilliant encoding detection.
ElementSoup will build the lxml.html tree for you automatically.

Or, you can use a regexp to detect a <meta> tag yourself before parsing. The
function you use below would mainly check for something like

    <meta[^>]*charset=["']([^"'>]*)["']

Stefan

> -------------------------------------------------------------------------------------------------------------
> res = urlopen(url)
> doc = res.read()
> # Precedence rules from
> http://www.w3.org/International/tutorials/tutorial-char-enc/
> encoding =
>     res.headers.getparam('charset') or
>     checkXMLDeclarationForEncoding(doc) or        # returns charset
> values in XML declaration
>     checkMetaForEncoding(doc) or        # returns charset values in meta tag
>     chardet.detect(doc).get.('encoding')  # http://chardet.feedparser.org/
> tree = etree.fromstring(doc, etree.HTMLParser(encoding=encoding))
> -------------------------------------------------------------------------------------------------------------



More information about the lxml-dev mailing list