[lxml-dev] Getting info from an XML file that has invalid character data in it (and how to specify recover option)
Ben
bba at inbox.com
Fri May 9 16:10:06 CEST 2008
Hello
I'm writing some code to check whether our daily backups worked. Backup Exec stores its
results in XML files. Sometimes bad characters - or maybe it is binary data - ends up in
these XML files and then lxml chokes:
C:\>python sb-lxml.py
Traceback (most recent call last):
File "sb-lxml.py", line 5, in <module>
Xml = etree.parse(XmlFileName)
File "lxml.etree.pyx", line 2520, in lxml.etree.parse (src/lxml/lxml.etree.c:22062)
File "parser.pxi", line 1309, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:53088)
File "parser.pxi", line 1338, in lxml.etree._parseDocumentFromURL
(src/lxml/lxml.etree.c:53337)
File "parser.pxi", line 1248, in lxml.etree._parseDocFromFile
(src/lxml/lxml.etree.c:52584)
File "parser.pxi", line 828, in lxml.etree._BaseParser._parseDocFromFile
(src/lxml/lxml.etree.c:50115)
File "parser.pxi", line 452, in lxml.etree._ParserContext._handleParseResultDoc
(src/lxml/lxml.etree.c:47023)
File "parser.pxi", line 536, in lxml.etree._handleParseResult
(src/lxml/lxml.etree.c:47861)
File "parser.pxi", line 478, in lxml.etree._raiseParseError
(src/lxml/lxml.etree.c:47285)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 11, line 132, column 95
The offending line looks like this (not sure if the bad characters will make it through
the email):
</error><error>Directory not found. Can not backup directory \Data\\l Strategy - Progress Rep.doc\\\??ā?\\VIC-ve\TT\miscellaneous and its subdirectories.
Example code to demonstrate how I use it (with lxml-2.0.5 and Python 2.5.2):
##################################
Xml = etree.parse(XmlFileName)
print Xml.findtext(".//end_time")
print Xml.findtext(".//engine_completion_status")
##############################
The code works fine unless there are invalid characters in, and I am happy for any
suggestion, because the bit I'm interested in is always near the end of the xml file, and
there should be a way to get it reliably regardless of the gunk elsewhere in the file (or that's what I hope)
Also, I've tried the 'recover' parser option, but I'm doing something wrong, because I get
this:
C:\>python sb-lxml.py
Traceback (most recent call last):
File "sb-lxml.py", line 9, in <module>
print Xml.findtext(".//end_time")
File "lxml.etree.pyx", line 1656, in lxml.etree._ElementTree.findtext
(src/lxml/lxml.etree.c:15354)
File "lxml.etree.pyx", line 1489, in lxml.etree._ElementTree._assertHasRoot
(src/lxml/lxml.etree.c:14116)
AssertionError: ElementTree not initialized, missing root
The code I tried for the 'recover' parser option:
XmlFileName = r'c:/BEX03194.xml'
parser = etree.XMLParser(recover=True)
Xml = etree.parse(StringIO(XmlFileName), parser)
print Xml.findtext(".//end_time")
print Xml.findtext(".//engine_completion_status")
I guess I'm just specifying the option wrong, but can't see how I should be doing it.
Any suggestion, including how to circumvent/work around the problem is most welcome.
ReplyReply AllForwardTrash
____________________________________________________________
FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on your desktop!
Check it out at http://www.inbox.com/marineaquarium
More information about the lxml-dev
mailing list