[lxml-dev] A bit of oddness in the HTML parser
Jim Washington
jwashin at vt.edu
Tue Mar 27 13:06:50 CEST 2007
I'm seeing that the HTML parser is doing something undesirable.
If I have (note, the script tag is not closed):
>>> s = '<div>d_txt<script src="blah">s_tail</div>'
and I use the HTML parser on that string, I get the ending div
html-escaped in the script's text.
>>> r = HTML(s)
>>> tostring(r)
'<html><body><div>d_txt<script
src="blah">s_tail</div></script></div></body></html>'
I'm guessing this is upstream behavior? I was hoping to get
'<html><body><div>d_txt<script
src="blah">s_tail</script></div></body></html>'
I think I can live with this behavior if nobody else thinks this is a
bug. Yes, I realize that tag-soup parsers are hard to do. :)
-Jim Washington
More information about the lxml-dev
mailing list