[lxml-dev] non-ascii characters get garbled

Stefan Behnel stefan_ml at behnel.de
Wed Sep 19 17:13:40 CEST 2007


js wrote:
> I downgraded libxml2 from 2.6.29_0 to 2.6.27_0
> and re-run the test script.
> surprise, Now it all  works as in the lxml doc!
> 
> seems newer libxml2 has some problem converting charset.
> (2.6.28_1 doesn't work either.)

I think it's this change in the function htmlCtxtReset() of HTMLparser.c in
libxml2 2.6.28:

@@ -5806,7 +5850,7 @@
     ctxt->inSubset = 0;
     ctxt->errNo = XML_ERR_OK;
     ctxt->depth = 0;
-    ctxt->charset = XML_CHAR_ENCODING_UTF8;
+    ctxt->charset = XML_CHAR_ENCODING_NONE;
     ctxt->catalogs = NULL;
     xmlInitNodeInfoSeq(&ctxt->node_seq);

So the default encoding is no longer UTF-8 and instead it tries auto detection
(which apparently fails for your page, so it's likely the page that is broken
here).

The problem is that you can't really defend UTF-8 as a default encoding (or
any default encoding at all) as I don't think there is any clear winner in the
page encodings of all web pages out there. And UTF-8 is definitely something
that will fail for many pages, while things like ISO-8859-1 just let the
content pass so that you can still fix it by hand (if you feel like it). So
libxml2 is actually right in not defaulting to UTF-8.

Just in case you can't accept that, have you tried installing BeautifulSoup
and parsing with lxml.html.ElementSoup? BeautifulSoup has pretty good encoding
detection support.

Stefan



More information about the lxml-dev mailing list