[lxml-dev] html entities and lxml.html.ElementSoup

Roger Patterson rogerpatterson at gmail.com
Tue Mar 18 20:03:44 CET 2008


Hi there,
I'm getting an interesting situation.  When using the very cool 
ElementSoup add-on to lxml.html with certain source-html files that 
already encode entities (eg. £), using the ElementSoup.parse() 
messes up the entities.

I looked through the code, and see that you are using the unescape() 
function from ElementTree's ElementSoup.  Unfortunately, what I think is 
happening, is that unescape() should only be called if the html was 
initially parsed by BeautifulSoup with convertEntities="html" (as in 
ElementTree's ElementSoup), otherwise, you can sometimes get html pages 
with entities that are unescaped getting unescaped again.

What I'm currently doing to solve this is first parsing it with 
BeautifulSoup(html, convertEntities="html"), then calling 
ElementSoup.convert_tree(soup).  This work-around works fine, but I 
thought I'd bring it to your attention.
cheers
-Roger


More information about the lxml-dev mailing list