[lxml-dev] Parsing i18n UTF-8 files

Stefan Behnel stefan_ml at behnel.de
Wed Jul 23 08:58:52 CEST 2008


Matt Grove wrote:
> I'm attempting to parse several UTF-8 encoded xml files in several
> different
> languages - ranging from English to Japanese - but I've run into some
> trouble. I first want to parse the xml files,  gather certain elements
> into
> a dictionary or some other data structure, and then write them out to
> other
> files. I know how to parse the files when they're in English, but I don't
> know how to read the Japanese text (for example) without encountering
> encoding exceptions. I've tried understanding the process through the
> tutorials but I'm still confused. Would someone like to try steering me on
> the correct path?

If they are correctly encoded XML files (UTF-8 or a different encoding),
lxml will parse them correctly and give you unicode strings for the
content. There is nothing you have to do in addition.

If your code works for English ASCII text but not for non-ASCII content, I
would expect that your XML files are not really UTF-8 encoded or do not
specify their encoding correctly. Would you have an example of both your
code and a short XML snippet that fails with it?

Stefan



More information about the lxml-dev mailing list