[lxml-dev] How to get HTML charset ?

David Shieh mykingheaven at gmail.com
Wed Mar 31 14:01:28 CEST 2010


2010/3/30 Ethan Jucovy <ethan.jucovy at gmail.com>

> On Sun, Mar 28, 2010 at 12:09 AM, David Shieh <mykingheaven at gmail.com>
> wrote:
> > Hi all,
> >
> > I use lxml for a long time and it works fine for me.
> > But now, I get confused about the charset thing. When I want to get the
> > original charset of a html file, I used codes below:
> >
> >         file_content = ''.join(
> >                 [i.rstrip('\r\n ').lstrip() for i in
> response.readlines()]
> >             )
> >         html = lxml.html.fromstring(file_content)
> >         for i in html.xpath('head/meta'):
> >             print lxml.html.tostring(i)
> >
> > Surprisingly, there's no output of any <meta http-equiv="Content-Type" ..
> />
> > element. So, how can I know the original charset of this html?
>
> You need to pass the kwarg `include_meta_content_type=True` to
> `tostring`, or the <meta http-equiv="Content-Type" .. /> tag will
> always be stripped on the way out --
>
> But I really get charset using Sergio's way. I think your method is also
great. I will add it in safe.
Thanks!

>>> from lxml.html import fromstring, tostring
> >>> x=fromstring("""<html><head><meta http-equiv="Content-Type"
> content="text/html; charset=ASCII"></head></html>""")
> >>> x.xpath("head/meta")
> [<Element meta at 2004bb0>]
> >>> [tostring(u) for u in x.xpath("head/meta")]
> ['']
> >>> [tostring(u, include_meta_content_type=True) for u in
> x.xpath("head/meta")]
> ['<meta http-equiv="Content-Type" content="text/html; charset=ASCII">']
>



-- 
----------------------------------------------
Attitude determines everything !
----------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20100331/6382fd94/attachment.htm 


More information about the lxml-dev mailing list