[lxml-dev] html entities and lxml.html.ElementSoup
Roger Patterson
rogerpatterson at gmail.com
Tue Mar 18 20:03:44 CET 2008
Hi there,
I'm getting an interesting situation. When using the very cool
ElementSoup add-on to lxml.html with certain source-html files that
already encode entities (eg. £), using the ElementSoup.parse()
messes up the entities.
I looked through the code, and see that you are using the unescape()
function from ElementTree's ElementSoup. Unfortunately, what I think is
happening, is that unescape() should only be called if the html was
initially parsed by BeautifulSoup with convertEntities="html" (as in
ElementTree's ElementSoup), otherwise, you can sometimes get html pages
with entities that are unescaped getting unescaped again.
What I'm currently doing to solve this is first parsing it with
BeautifulSoup(html, convertEntities="html"), then calling
ElementSoup.convert_tree(soup). This work-around works fine, but I
thought I'd bring it to your attention.
cheers
-Roger
More information about the lxml-dev
mailing list