[lxml-dev] A bit of oddness in the HTML parser

Jim Washington jwashin at vt.edu
Tue Mar 27 13:06:50 CEST 2007


I'm seeing that the HTML parser is doing something undesirable.

If I have (note, the script tag is not closed):

>>> s = '<div>d_txt<script src="blah">s_tail</div>'

and I use the HTML parser on that string, I get the ending div
html-escaped in the script's text.

>>> r = HTML(s)
>>> tostring(r)
'<html><body><div>d_txt<script
src="blah">s_tail&lt;/div&gt;</script></div></body></html>'

I'm guessing this is upstream behavior?  I was hoping to get

'<html><body><div>d_txt<script
src="blah">s_tail</script></div></body></html>'

I think I can live with this behavior if nobody else thinks this is a
bug.  Yes, I realize that tag-soup parsers are hard to do. :)

-Jim Washington



More information about the lxml-dev mailing list