[lxml-dev] Trouble parsing large XML document with ElementTree
Sam Kuper
sam.kuper at uclmail.net
Sat May 24 21:09:45 CEST 2008
Dear Stefan,
I did read your other post, but using the file name directly when calling
the parser didn't work for me. Here is what I tried:
from lxml import etree
outfile = open("output_test001.txt", "w")
class EchoTarget():
def start(self, tag, attrib):
if tag.endswith("xternalPage"):
line = attrib["about"]
if line != "":
outfile.write(line+"\n")
print line
def close(self):
return "closed!"
parser = etree.XMLParser(target = EchoTarget())
result = etree.XML("content.example.xml", parser)
This gives the following error:
Traceback (most recent call last):
File "extract_links_dmoz005.py", line 15, in <module>
result = etree.XML("content.example.xml", parser)
File "lxml.etree.pyx", line 2358, in lxml.etree.XML
File "parser.pxi", line 1354, in lxml.etree._parseMemoryDocument
File "parser.pxi", line 1243, in lxml.etree._parseDoc
File "parser.pxi", line 795, in lxml.etree._BaseParser._parseDoc
File "parsertarget.pxi", line 130, in
lxml.etree._TargetParserContext._handleParseResultDoc
File "parser.pxi", line 478, in lxml.etree._raiseParseError
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column
1
I have been reading the docs, but I'm new to processing XML in Python, so I
don't find them all that easy to understand. I think I'm improving, though
:) Thanks for your patience.
Best,
Sam
2008/5/24 Stefan Behnel <stefan_ml at behnel.de>:
> Hi,
>
> did you read my other post?
>
> Sam Kuper wrote:
> > result = etree.XML(infile.read(), parser)
>
> make that
>
> result = etree.parse("thefile.xml", parser)
>
> and consider reading the parser docs on the web page.
>
> Stefan
>
--
http://five.sentenc.es | http://tinyurl.com/3x9se4
--
Mr Sam Pablo Kuper BSc MRI
Research Assistant
Darwin Correspondence Project
Cambridge University Library
West Road
Cambridge CB3 9DR
spk30 at cam.ac.uk
Office: +44 (0)1223 333008
Mobile: +44 (0) 7971858176
www.darwinproject.ac.uk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080524/f549f5af/attachment-0001.htm
More information about the lxml-dev
mailing list