[lxml-dev] combining target parser class with DTD validation

Gary V. Vaughan lxml-dev at mlists.thewrittenword.com
Thu Aug 7 10:26:44 CEST 2008


Hi,

I have an xml.sax based parser that works like this:

              .-----.                        ,-------------------.
  file.xml -> | gpp | -> preprocessed.xml -> |saxexts.make_parser| --.
              `-----'                        `-------------------'   |
                                             ,-------------------.   |
                   custom class heirarchy <- |sax DocumentHandler| <-'
                                             `-------------------'

I'm in the process of converting all of this from xml.sax to lxml.etree.

The DocumentHandler function is very complex, but well debugged, so
I'd really like to convert it to work with an lxml.XMLParser target
keyword, which is straight forward enough (changing startElement to
start, characters to data etc) to avoid churning the handler function
and the custom class heirarchy it builds.  So for so good...

Also, performance is important, so I'm passing the the output of gpp
(general pre-processor) to the parser with a feed function as gpp is
running on file.xml.

The main driver behind moving the project to lxml is to perform DTD
validation on 'preprocessed.xml', but setting a target keyword in
etree.XMLParser turns off DTD validation at parse time :(

Also, with dtd_validation=True, even xml documents with no DOCTYPE
declaration throw an exception.  Since I have a zillion files that
I'd like to migrate gradually, while interoperating with other users
that haven't installed the lxml based parser yet, I only want to do
validation when there is a DOCTYPE declaration.  Older files don't
have it, so I'd like to ignore the missing DTD reference on those
until they are upgraded.

My question is: what is the cleanest/fastest way to combine
   (i) passing input to the parser with a feed function
  (ii) reusing most of the sax DocumentHandler with a target class
 (iii) performing DTD validation on the fly during parsing
  (iv) but skipping validation if there is no DOCTYPE declaration

I've ended up implementing an lxml based validating parser to replace
the above like this:

  # input file
  fh = open (gpp_input, 'r')

  # For backwards compatibility, skip dtd validation when there 
  # is no DOCTYPE declaration: 
  match = re.compile ('^<!DOCTYPE ', re.M).search (fh.read (1024), 1) 
  fh.seek(0) 

  # prepare XML parser to read data
  parser = etree.XMLParser (dtd_validation=(match != None))

  gpp_r, my_w = os.pipe ()
  my_r, gpp_w = os.pipe ()
  gpp = os.fork ()
  if gpp == 0:
    ...
    # set up pipes to gpp stdin and stdout
    ...

  while fds:
    ...
    # collect gpp stdout with select
    ...

    parser.feed (gpp_output)

  # get the etree
  tree = parser.close ()

  # walk the etree and fire synthetic sax events
  xmlh    = old_sax_DocumentHandler ()
  context = etree.iterwalk (tree, events=("start", "end"))
  for action, element in context:
    if action == 'start':
      xmlh.startElement (element.tag, element.attrib)
      if element.text and hasattr (xmlh, 'characters'):
        xmlh.characters (element.text)
    elif action == 'end':
      xmlh.endElement (element.tag)
      if element.tail and hasattr (xmlh, 'characters'):
        xmlh.characters (element.tail)

It works well enough, but it feels kludgy to manually peek into each
xml file and look for a DOCTYPE at the start; and since I have to walk
the tree once while building it and again when calling the handler
function, I'm sure it is slower than it could be.

Advice gratefully received!

Cheers,
	Gary
-- 
Gary V. Vaughan (gary at thewrittenword.com)
-- 
Gary V. Vaughan (gary at thewrittenword.com)


More information about the lxml-dev mailing list