[lxml-dev] combining target parser class with DTD validation
Stefan Behnel
stefan_ml at behnel.de
Thu Aug 7 14:06:49 CEST 2008
Hi,
Gary V. Vaughan wrote:
> I have an xml.sax based parser that works like this:
>
> .-----. ,-------------------.
> file.xml -> | gpp | -> preprocessed.xml -> |saxexts.make_parser| --.
> `-----' `-------------------' |
> ,-------------------. |
> custom class heirarchy <- |sax DocumentHandler| <-'
> `-------------------'
Excellent picture.
> The main driver behind moving the project to lxml is to perform DTD
> validation on 'preprocessed.xml', but setting a target keyword in
> etree.XMLParser turns off DTD validation at parse time :(
Hmm, I never tried that. If it doesn't work, I would assume that the
validation is done after passing the callbacks through the SAX interface
of libxml2, which lxml connects to directly, i.e.
-> DTD validation -> normal tree builder
input -> parser -> SAX <
-> lxml's target parser
(I didn't verify this, though, so I might be mistaken. Maybe it's easy to
enable validation, maybe it's not...)
> Also, with dtd_validation=True, even xml documents with no DOCTYPE
> declaration throw an exception. Since I have a zillion files that
> I'd like to migrate gradually, while interoperating with other users
> that haven't installed the lxml based parser yet, I only want to do
> validation when there is a DOCTYPE declaration. Older files don't
> have it, so I'd like to ignore the missing DTD reference on those
> until they are upgraded.
Hmmm, I wasn't aware of that, but it seems you have to have a DOCTYPE in
your XML to use a validating parser. I understand that that can be
undesirable.
> My question is: what is the cleanest/fastest way to combine
> (i) passing input to the parser with a feed function
> (ii) reusing most of the sax DocumentHandler with a target class
> (iii) performing DTD validation on the fly during parsing
> (iv) but skipping validation if there is no DOCTYPE declaration
>
> I've ended up implementing an lxml based validating parser to replace
> the above like this:
>
> # input file
> fh = open (gpp_input, 'r')
>
> # For backwards compatibility, skip dtd validation when there
> # is no DOCTYPE declaration:
> match = re.compile ('^<!DOCTYPE ', re.M).search (fh.read (1024), 1)
> fh.seek(0)
>
> # prepare XML parser to read data
> parser = etree.XMLParser (dtd_validation=(match != None))
>
> gpp_r, my_w = os.pipe ()
> my_r, gpp_w = os.pipe ()
> gpp = os.fork ()
> if gpp == 0:
> ...
> # set up pipes to gpp stdin and stdout
> ...
>
> while fds:
> ...
> # collect gpp stdout with select
> ...
>
> parser.feed (gpp_output)
>
> # get the etree
> tree = parser.close ()
>
> # walk the etree and fire synthetic sax events
> xmlh = old_sax_DocumentHandler ()
> context = etree.iterwalk (tree, events=("start", "end"))
> for action, element in context:
> if action == 'start':
> xmlh.startElement (element.tag, element.attrib)
> if element.text and hasattr (xmlh, 'characters'):
> xmlh.characters (element.text)
> elif action == 'end':
> xmlh.endElement (element.tag)
> if element.tail and hasattr (xmlh, 'characters'):
> xmlh.characters (element.tail)
You are aware of lxml.sax? Although your code is fairly short and special,
so I guess it's fair enough to just use this. It might even be faster than
lxml.sax after all...
> It works well enough, but it feels kludgy to manually peek into each
> xml file and look for a DOCTYPE at the start; and since I have to walk
> the tree once while building it and again when calling the handler
> function, I'm sure it is slower than it could be.
I know, threads are often frowned upon in Python, but given that
lxml.etree frees the GIL for all sorts of C-level and I/O operations, they
can really help you here to speed up your overall processing of your
"zillions" of files.
A couple of ideas for improvements:
1) if there is a way to let gpp read its input file directly through a
command line option - do that. It keeps your Python interpreter from
wasting GIL time on copying data from the outside world back into it.
2) connect lxml's parser directly to gpp's output pipe and run it in a
separate thread.
3) try to do post-parsing validation instead of on-the-fly validation by
validating with tree.docinfo.externalDTD (not sure if that's faster,
YMMV). Again, do this in a thread.
Your application looks heavily I/O bound, so even if you do things in a
tree instead of on-the-way-in, doing things in parallel will give you a
big speed-up.
Hope this helps.
Stefan
More information about the lxml-dev
mailing list