[lxml-dev] Getting info from an XML file that has invalid character data in it (and how to specify recover option)

Ben bba at inbox.com
Fri May 9 16:10:06 CEST 2008


Hello

I'm writing some code to check whether our daily backups worked.   Backup Exec stores its
results in XML files.   Sometimes bad characters - or maybe it is binary data - ends up in
these XML files and then lxml chokes:

C:\>python sb-lxml.py
Traceback (most recent call last):
  File "sb-lxml.py", line 5, in <module>
    Xml = etree.parse(XmlFileName)
  File "lxml.etree.pyx", line 2520, in lxml.etree.parse (src/lxml/lxml.etree.c:22062)
  File "parser.pxi", line 1309, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:53088)
  File "parser.pxi", line 1338, in lxml.etree._parseDocumentFromURL
(src/lxml/lxml.etree.c:53337)
  File "parser.pxi", line 1248, in lxml.etree._parseDocFromFile
(src/lxml/lxml.etree.c:52584)
  File "parser.pxi", line 828, in lxml.etree._BaseParser._parseDocFromFile
(src/lxml/lxml.etree.c:50115)
  File "parser.pxi", line 452, in lxml.etree._ParserContext._handleParseResultDoc
(src/lxml/lxml.etree.c:47023)
  File "parser.pxi", line 536, in lxml.etree._handleParseResult
(src/lxml/lxml.etree.c:47861)
  File "parser.pxi", line 478, in lxml.etree._raiseParseError
(src/lxml/lxml.etree.c:47285)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 11, line 132, column 95

The offending line looks like this (not sure if the bad characters will make it through
the email):

</error><error>Directory not found. Can not backup directory \Data\\l Strategy - Progress Rep.doc\\\??ā?\\VIC-ve\TT\miscellaneous and its subdirectories.

Example code to demonstrate how I use it (with lxml-2.0.5 and Python 2.5.2):
##################################
Xml = etree.parse(XmlFileName)
print Xml.findtext(".//end_time")
print Xml.findtext(".//engine_completion_status")
##############################

The code works fine unless there are invalid characters in, and I am happy for any
suggestion, because the bit I'm interested in is always near the end of the xml file, and
there should be a way to get it reliably regardless of the gunk elsewhere in the file (or that's what I hope)

Also, I've tried the 'recover' parser option, but I'm doing something wrong, because I get
this:

C:\>python sb-lxml.py
Traceback (most recent call last):
  File "sb-lxml.py", line 9, in <module>
    print Xml.findtext(".//end_time")
  File "lxml.etree.pyx", line 1656, in lxml.etree._ElementTree.findtext
(src/lxml/lxml.etree.c:15354)
  File "lxml.etree.pyx", line 1489, in lxml.etree._ElementTree._assertHasRoot
(src/lxml/lxml.etree.c:14116)
AssertionError: ElementTree not initialized, missing root

The code I tried for the 'recover' parser option:

XmlFileName = r'c:/BEX03194.xml'
parser = etree.XMLParser(recover=True)
Xml   = etree.parse(StringIO(XmlFileName), parser)
print Xml.findtext(".//end_time")
print Xml.findtext(".//engine_completion_status")

I guess I'm just specifying the option wrong, but can't see how I should be doing it.

Any suggestion, including how to circumvent/work around the problem is most welcome.

ReplyReply AllForwardTrash

____________________________________________________________
FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on your desktop!
Check it out at http://www.inbox.com/marineaquarium


More information about the lxml-dev mailing list