[lxml-dev] ElementSoup doesn't work as in doc/elementsoup.txt
Stefan Behnel
stefan_ml at behnel.de
Fri Sep 28 22:38:07 CEST 2007
Hi,
js wrote:
> I'm learning ElementSoup, but it doesn't work the way it's supposed to be.
> I tried sample code in doc/elementsoup.txt but failed with error.
> ---------------------------------------------------------------------------------------------------------------------
>>>> tag_soup = '<meta><head><title>Hello</head<body onload=crash()>Hi all<p>'
>>>> from lxml.html.ElementSoup import parse
>>>> from StringIO import StringIO
>>>> root = parse(StringIO(tag_soup))
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/opt/local/lib/python2.5/site-packages/lxml-2.0alpha3-py2.5-macosx-10.3-i386.egg/lxml/html/ElementSoup.py",
> line 19, in parse
> root = _convert_tree(tree, makeelement)
> File "/opt/local/lib/python2.5/site-packages/lxml-2.0alpha3-py2.5-macosx-10.3-i386.egg/lxml/html/ElementSoup.py",
> line 40, in _convert_tree
> attrib=dict(beautiful_soup_tree.attrs))
> File "parser.pxi", line 702, in etree._BaseParser.makeelement
> File "apihelpers.pxi", line 102, in etree._makeElement
> File "apihelpers.pxi", line 798, in etree._tagValidOrRaise
> ValueError: Invalid tag name u'[document]'
> ---------------------------------------------------------------------------------------------------------------------
That's because of the tag name validation. Evidently, "[document]" (which is
returned by BeautifulSoup) isn't a valid tag name. Sadly, the doctest above
was not yet included in the test suite.
However, the behaviour will change in alpha 4. lxml will no longer reject tag
names except if they contain spaces or XML special characters. See this recent
thread, which also has a patch:
http://comments.gmane.org/gmane.comp.python.lxml.devel/3003?set_lines=100000
Sorry for the inconvenience, but don't forget that this is alpha software.
Things might not always work as expected or might change unexpectedly
(although we try to keep these changes as rare as possible).
Stefan
More information about the lxml-dev
mailing list