[lxml-dev] parser target exception recovery bug?

D.Hendriks (Dennis) D.Hendriks at tue.nl
Tue Jun 16 16:20:07 CEST 2009


Hello all,

Using lxml 2.2 with a custom parser target (tree builder), I've run into 
a problem when the parser target raises an exception. In this case, 
parsing continues, although only for 'data' (not for 'start' and 'end').
I used recover=False when creating the XMLParser.

Using the following code:

     import sys
     from lxml import etree

     # Parser target without exceptions.
     class MyTreeBuilder1(object):
         def close(self):
             print 'close'

         def start(self, tag, attrs):
             print 'start', tag, attrs

         def data(self, data):
             if len(data.strip()) > 0:
                 print 'data: data=', repr(data)

         def end(self, tag):
             print 'end', tag

     # Parser target with exceptions.
     class MyTreeBuilder2(MyTreeBuilder1):
         def close(self):
             print 'close'

         def start(self, tag, attrs):
             print 'start', tag, attrs

         def data(self, data):
             if len(data.strip()) > 0:
                 print 'data: data=', repr(data)

         def end(self, tag):
             print 'end', tag
             if tag=='b':
                 print 'ERROR'
                 raise ValueError('error')

     xml_data='''<a>
         <b>test</b>
         <d>test2</d>
         <d>test2</d>
     </a>'''

     # Successfull parsing.
     print '---'
     builder = MyTreeBuilder1()
     parser = etree.XMLParser(target=builder, recover=False)
     rslt = etree.fromstring(xml_data, parser)

     # Unsuccessfull parsing.
     print '---'
     builder = MyTreeBuilder2()
     parser = etree.XMLParser(target=builder, recover=False)
     rslt = etree.fromstring(xml_data, parser)

I get this output:

     ---
     start a {}
     start b {}
     data: data= u'test'
     end b
     start d {}
     data: data= u'test2'
     end d
     start d {}
     data: data= u'test2'
     end d
     end a
     close
     ---
     start a {}
     start b {}
     data: data= u'test'
     end b
     ERROR
     data: data= u'test2'
     data: data= u'test2'
     Traceback (most recent call last):
     File "lxml_parser_target_bug.py", line 49, in ?
         rslt = etree.fromstring(xml_data, parser)
     File "lxml.etree.pyx", line 2534, in lxml.etree.fromstring
	(src/lxml/lxml.etree.c:51135)
     File "parser.pxi", line 1523, in lxml.etree._parseMemoryDocument
	(src/lxml/lxml.etree.c:76176)
     File "parser.pxi", line 1402, in lxml.etree._parseDoc
	(src/lxml/lxml.etree.c:74927)
     File "parser.pxi", line 928, in lxml.etree._BaseParser._parseDoc
	(src/lxml/lxml.etree.c:71707)
     File "parsertarget.pxi", line 135, in
	lxml.etree._TargetParserContext._handleParseResultDoc
	(src/lxml/lxml.etree.c:82586)
     File "lxml.etree.pyx", line 230, in
	lxml.etree._ExceptionContext._raise_if_stored
	(src/lxml/lxml.etree.c:6813)
     File "saxparser.pxi", line 227, in lxml.etree._handleSaxEnd
	(src/lxml/lxml.etree.c:78230)
     File "parsertarget.pxi", line 78, in
	lxml.etree._PythonSaxParserTarget._handleSaxEnd
	(src/lxml/lxml.etree.c:81918)
     File "lxml_parser_target_bug.py", line 33, in end
         raise ValueError('error')
     ValueError: error

The first output (between --- and ---) is ok, since it is for the 
non-exception parser target. The second output (after the second ---) is 
not ok for me. You can see 'ERROR' at the point where the exception is 
raised. After that, two 'data' events are generated in the parser 
target. Clearly, parsing continued. Also, the 'close' is never called. 
After the entire input is parsed, the exception is finally re-raised.

Two questions:
  - Is the continued parsing ('data' function calls) a bug?
  - Is the not calling 'close' a bug?

Any replies would be greatly appreciated.

Dennis


More information about the lxml-dev mailing list