[lxml-dev] Problem handling  

Brad Smith usernamenumber at gmail.com
Tue Jun 24 02:22:34 CEST 2008


Hello,

I am trying to handle some html data (the content of which I don't
have control over) with lxml. The problem is that whenever   is
encountered lxml.etree.fromstring throws "XMLSyntaxError: Entity
'nbsp' not defined" and parsing fails. I have to admit I'm at a loss
for how to deal with this. I've looked up the DTDs for html and xhtml
and the entity isn't defined there, so where would it be or, since I
just want to store certain bits of the content, not render it, can I
make lxml less picky? The lxml.html.soupparser can handle entities,
but actually mis-interprets the html because it is "too" well-formed.
For example, the author does:

  <a name="foo"/> <b>foostuff</b>

Which soupparser interprets as

  <a name="foo"> <b<foostuff</b> </a>

...wrongly "correcting" the original markup.

Any help here would be greatly appreciated.

--Brad


More information about the lxml-dev mailing list