[lxml-dev] html entities and lxml.html.ElementSoup

Stefan Behnel stefan_ml at behnel.de
Tue Mar 18 22:25:53 CET 2008


Hi,

Roger Patterson wrote:
> I'm getting an interesting situation.  When using the very cool 
> ElementSoup add-on to lxml.html with certain source-html files that 
> already encode entities (eg. £), using the ElementSoup.parse() 
> messes up the entities.

It looks like it's not the parse(), but rather the serialisation. What happens
is that the entity references end up in the /text/ content, which is clearly
wrong as it leads to re-escaping of the references on the way out.


> What I'm currently doing to solve this is first parsing it with 
> BeautifulSoup(html, convertEntities="html"), then calling 
> ElementSoup.convert_tree(soup).  This work-around works fine, but I 
> thought I'd bring it to your attention.

ElementSoup should do that for you. I fixed it on the trunk.

Stefan




More information about the lxml-dev mailing list