[lxml-dev] ElementSoup doesn't work as in doc/elementsoup.txt

Stefan Behnel stefan_ml at behnel.de
Fri Sep 28 22:38:07 CEST 2007


Hi,

js wrote:
> I'm learning ElementSoup,  but it doesn't  work the way  it's supposed to be.
> I tried sample  code in doc/elementsoup.txt  but  failed with  error.
> ---------------------------------------------------------------------------------------------------------------------
>>>> tag_soup = '<meta><head><title>Hello</head<body onload=crash()>Hi all<p>'
>>>> from lxml.html.ElementSoup import parse
>>>> from StringIO import StringIO
>>>> root = parse(StringIO(tag_soup))
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/opt/local/lib/python2.5/site-packages/lxml-2.0alpha3-py2.5-macosx-10.3-i386.egg/lxml/html/ElementSoup.py",
> line 19, in parse
>     root = _convert_tree(tree, makeelement)
>   File "/opt/local/lib/python2.5/site-packages/lxml-2.0alpha3-py2.5-macosx-10.3-i386.egg/lxml/html/ElementSoup.py",
> line 40, in _convert_tree
>     attrib=dict(beautiful_soup_tree.attrs))
>   File "parser.pxi", line 702, in etree._BaseParser.makeelement
>   File "apihelpers.pxi", line 102, in etree._makeElement
>   File "apihelpers.pxi", line 798, in etree._tagValidOrRaise
> ValueError: Invalid tag name u'[document]'
> ---------------------------------------------------------------------------------------------------------------------

That's because of the tag name validation. Evidently, "[document]" (which is
returned by BeautifulSoup) isn't a valid tag name. Sadly, the doctest above
was not yet included in the test suite.

However, the behaviour will change in alpha 4. lxml will no longer reject tag
names except if they contain spaces or XML special characters. See this recent
thread, which also has a patch:

http://comments.gmane.org/gmane.comp.python.lxml.devel/3003?set_lines=100000

Sorry for the inconvenience, but don't forget that this is alpha software.
Things might not always work as expected or might change unexpectedly
(although we try to keep these changes as rare as possible).

Stefan


More information about the lxml-dev mailing list