[lxml-dev] non-ascii characters get garbled
js
ebgssth at gmail.com
Wed Sep 26 16:51:59 CEST 2007
Thank you for your reply.
On 9/21/07, Stefan Behnel <stefan_ml at behnel.de> wrote:
> > Thank you for your effort.
> > but I wonder how can we know in what character set the document is
> > written before
> > GETing the page and check the response header, meta tag and contents itself?
>
> The libxml2 HTML parser does that for you. If there is a <meta> Content-Type
> (which is not too hidden inside the tag soup), the parser will obey it. It
> doesn't know about the header, but for that, you can pass in the "encoding"
> keyword. Note that you usually don't have to read the file if all you want it
> the header. Just read the header, check if you have to override the input
> encoding and then pass the file into parse().
You're right. When libxml2 find meta tag, it converts the encoding
according to it.
But in real web, it doesn't always work.
For example, look HTML on http://www.apple.com/kr/
-------------------------------------------------------------------------------------------------------------
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>애플컴퓨터코리아</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
-------------------------------------------------------------------------------------------------------------
As you can see above, the meta tag (And actually Content-Type
HTTP header) declares the character encoding used in this pages is UTF-8.
libxml2, however, cannot detect this fact till it reads to meta tag
because the title is appeared before the meta tag.
For this reason, libxml2 treats the title value as ISO 8859-1(also
known as latin-1)
and the value will get garbled.
(I'm not quite sure but it this case some text after the meta also garbled.)
To avoid this, I can, right as you said, pass charset value in the
"encoding" keyword, which I can get from HTTP header, using HEAD/GET request.
That works for this page but what can do if the web server doesn't
return Content-Type
HTTP header? I cannot rely on libxml2 for the reason I explained above.
The best way I can think of is to GET the page and analyze it by myself approach
by using Perl's LWP-like module.
So,
> > We really need to GET the doc first.
What do you think?
> > Stefan, Is it possible to change lxml to avoid "ValueError" exception
> > when passing
> > decoded string to lxml.parse()?
> > If the answer is no, could you please give me some advice or your idea
> > on thin problem?
>
> Use lxml.fromstring() for parsing strings.
Oh, that's exactly what I'm looking for.
Now I can do something like below.
-------------------------------------------------------------------------------------------------------------
res = urlopen(url)
doc = res.read()
# Precedence rules from
http://www.w3.org/International/tutorials/tutorial-char-enc/
encoding =
res.headers.getparam('charset') or
checkXMLDeclarationForEncoding(doc) or # returns charset
values in XML declaration
checkMetaForEncoding(doc) or # returns charset values in meta tag
chardet.detect(doc).get.('encoding') # http://chardet.feedparser.org/
tree = etree.fromstring(doc, etree.HTMLParser(encoding=encoding))
-------------------------------------------------------------------------------------------------------------
I don't have checkXMLDeclarationForEncoding nor checkMetaForEncoding, though.
More information about the lxml-dev
mailing list