[lxml-dev] html entities and lxml.html.ElementSoup
Stefan Behnel
stefan_ml at behnel.de
Tue Mar 18 22:25:53 CET 2008
Hi,
Roger Patterson wrote:
> I'm getting an interesting situation. When using the very cool
> ElementSoup add-on to lxml.html with certain source-html files that
> already encode entities (eg. £), using the ElementSoup.parse()
> messes up the entities.
It looks like it's not the parse(), but rather the serialisation. What happens
is that the entity references end up in the /text/ content, which is clearly
wrong as it leads to re-escaping of the references on the way out.
> What I'm currently doing to solve this is first parsing it with
> BeautifulSoup(html, convertEntities="html"), then calling
> ElementSoup.convert_tree(soup). This work-around works fine, but I
> thought I'd bring it to your attention.
ElementSoup should do that for you. I fixed it on the trunk.
Stefan
More information about the lxml-dev
mailing list