[lxml-dev] Error (?) with UTF-8 document and Python unicode repr.

Frederik Elwert felwert at uni-bremen.de
Thu Nov 29 19:14:26 CET 2007


Am Donnerstag, den 29.11.2007, 18:21 +0100 schrieb Artur Siekielski:
> Yes, with h1 there is the same error. But I noticed that when I moved 
> meta tag with charset declaration before <title>, then all parsing goes 
> OK, including h1 tag. So it's libxml2 bug/limitation (I tried latest 
> libxml2 from trunk and it's the same)?
> 
> I'm parsing 3rd party HTML, so I must find some workaround. Is this good 
> solution: parse HTML, change elements sequence in <head>, serialiaze 
> document and parse it again ?

No, I think the better way would be to parse it, look for the encoding
(either by looking at <tree>.docinfo.encoding or looking for the
meta-Tag with find()), and then reparse the unaltered document, now
using the "encoding" keyword. This is what Stefan suggests:
http://article.gmane.org/gmane.comp.python.lxml.devel/3001/

Cheers,
Frederik



More information about the lxml-dev mailing list