[lxml-dev] HTML Meta Content-Type Tag not created as documenation states?

John Krukoff jkrukoff at ltgc.com
Fri Sep 26 01:33:11 CEST 2008


So, I was trying to figure out what happend to my meta tags when using
the lxml.html module, and saw the note in the documentation that
html.tostring will handle them as so:

> Note: if include_meta_content_type is true this will create a
>     ``<meta http-equiv="Content-Type" ...>`` tag in the head;
>     regardless of the value of include_meta_content_type any existing
>     ``<meta http-equiv="Content-Type" ...>`` tag will be removed
>     

However, that doesn't seem to actually be the case. It looks like
etree.tostring is never creating the meta tag as html.tostring appears
to expect, and instead the include_meta_content_type flag is simply
controlling whether any found meta tag is removed from the output (with
an re!).

Python 2.5.2 (r252:60911, Sep 22 2008, 12:08:38) 
[GCC 4.1.2 (Gentoo 4.1.2 p1.1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import html, etree
>>> h = html.fromstring( '<p>&copy;2008</p>' )
>>> html.tostring( h, encoding = 'us-ascii', include_meta_content_type =
True )
'<p>&#169;2008</p>'

Not present there, so I figure maybe it's because it's not being treated
as a complete document?

>>> h = html.document_fromstring( '<p>&copy;2008</p>' )
>>> html.tostring( h, encoding = 'us-ascii', include_meta_content_type =
True )
'<html><body><p>&#169;2008</p></body></html>'

Parsing as document doesn't create it either.

>>> html.tostring( h, encoding = 'iso-8859-1', include_meta_content_type
= True )
'<html><body><p>\xa92008</p></body></html>'

Okay, maybe it's because I'm using the default encoding for HTML
(us-ascii)? Nope, trying something else doesn't cause it to exist
either.

>>> html.tostring( etree.ElementTree( h ), encoding = 'iso-8859-1',
include_meta_content_type = True )
'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">\n<html><body><p>\xa92008</p></body></html>'

Maybe wrapping in an ElementTree? Get a doctype declaration out of that,
but still no meta tag.

>>> from lxml import etree
>>> etree.__version__
u'2.1.2'

In further testing, it appeared that if a Meta Content-Type tag was
specified, it was passed though as is, as long as
include_meta_content_type was True.

The really weird part of this for me though, is that I've set
include_meta_content_type on my much more complicated application
server, and it does in fact appear to be generating meta tags
automatically (or at least something in my XSLT heavy processing chain
is). My testing was an attempt to duplicate that, and I was quite
surprised when I couldn't. I've tried this on boxes with both libxml2
2.6.26 (RHEL5) & 2.6.32, and didn't see a difference there.

-- 
John Krukoff <jkrukoff at ltgc.com>
Land Title Guarantee Company



More information about the lxml-dev mailing list