[lxml-dev] Trouble parsing large XML document with ElementTree

Sam Kuper sam.kuper at uclmail.net
Sat May 24 21:09:45 CEST 2008


Dear Stefan,

I did read your other post, but using the file name directly when calling
the parser didn't work for me. Here is what I tried:

from lxml import etree
outfile = open("output_test001.txt", "w")
class EchoTarget():
    def start(self, tag, attrib):
        if tag.endswith("xternalPage"):
            line = attrib["about"]
            if line != "":
                outfile.write(line+"\n")
            print line
    def close(self):
        return "closed!"
parser = etree.XMLParser(target = EchoTarget())
result = etree.XML("content.example.xml", parser)

This gives the following error:

Traceback (most recent call last):
  File "extract_links_dmoz005.py", line 15, in <module>
    result = etree.XML("content.example.xml", parser)
  File "lxml.etree.pyx", line 2358, in lxml.etree.XML
  File "parser.pxi", line 1354, in lxml.etree._parseMemoryDocument
  File "parser.pxi", line 1243, in lxml.etree._parseDoc
  File "parser.pxi", line 795, in lxml.etree._BaseParser._parseDoc
  File "parsertarget.pxi", line 130, in
lxml.etree._TargetParserContext._handleParseResultDoc
  File "parser.pxi", line 478, in lxml.etree._raiseParseError
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column
1


I have been reading the docs, but I'm new to processing XML in Python, so I
don't find them all that easy to understand. I think I'm improving, though
:) Thanks for your patience.

Best,

Sam


2008/5/24 Stefan Behnel <stefan_ml at behnel.de>:

> Hi,
>
> did you read my other post?
>
> Sam Kuper wrote:
> > result = etree.XML(infile.read(), parser)
>
> make that
>
>    result = etree.parse("thefile.xml", parser)
>
> and consider reading the parser docs on the web page.
>
> Stefan
>



-- 
http://five.sentenc.es | http://tinyurl.com/3x9se4
--
Mr Sam Pablo Kuper BSc MRI
Research Assistant
Darwin Correspondence Project
Cambridge University Library
West Road
Cambridge CB3 9DR
spk30 at cam.ac.uk
Office: +44 (0)1223 333008
Mobile: +44 (0) 7971858176
www.darwinproject.ac.uk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080524/f549f5af/attachment-0001.htm 


More information about the lxml-dev mailing list