[lxml-dev] problem about lxml encoding

Stefan Behnel stefan_ml at behnel.de
Sun May 31 08:10:16 CEST 2009


qhlonline wrote:
> Hi, all I have a question about lxml encoding and html encoding. What is
> the relationship between the encoding which lxml parser will be using
> and the encoding of an HTML file? if the HTML file encoding is not the
> same with it's <meta charset="some encoding">, What the choice of lxml's
> parser will be?

The HTML parser will use the encoding specified by the <meta> tag if it's
present, otherwise it will expect the document to be Latin-1. It will not
magically guess the right encoding if the <meta> tag happens to be
incorrect. If you somehow know the encoding better, you can override the
behaviour by passing the "encoding" option when instantiating the parser.

Does that help?

Stefan


More information about the lxml-dev mailing list