[lxml-dev] lxml & parsing: return of a classes

Ian Bicking ianb at colorstudy.com
Tue Jul 17 21:30:59 CEST 2007


So I was thinking a little about how we could allow easy customization 
of the URL getter, since we can't attach it to the tree or any element. 
  And then generally how any customization could be done, for instance 
if you want a new method on all elements.

This isn't that easy currently.  You'd have to subclass a bunch of 
classes and rewrite a bunch of functions.  But I think if we move all 
parsing to a single class it would help a great deal.

The idea is something like:

   class Parser(object):
       _etree_parser_class = etree.HTMLParser
       def __init__(self):
           self._etree_parser = self._etree_parser_class()
           self._etree_parser.setElementClassLookup(self)
       def __call__(self, filename, **kw):
           return etree.parse(filename, self._etree_parser, **kw)
       def fromstring(...):
           ...

And so forth.  Then either expose this via:

   parse = Parser()

Or perhaps:

   _parser = Parser()
   parse = _parser
   fromstring = _parser.fromstring

And so forth.  If you want to adjust something, you don't have to 
reimplement all the forms of parsers, since they all would just use 
self, and are mostly defined in terms of each other.  We could support 
subclassing with something like this:

   class Parser(object):
       _element_classes = {}
       _element_mixins = {}
       def __init__(self):
           self._element_classes = self._element_classes.copy()
           mixers = {}
           for name, value in _element_mixins:
               if name == '*':
                   for n in self._element_classes.keys():
                       mixers.setdefault(n, []).append(value)
               else:
                   mixers.setdefault(name, []).append(value)
           for name, mixins in mixers:
               cur = self._element_classes.get(name, HtmlElement)
               bases = mixins + [cur]
               new_class = type(cur.__name__, tuple(bases), {})
               self._element_classes[name] = new_class

   class MyMixin(object):
       extra methods
   class FormMixin(object):
       other methods for the form element

   class ParserMixedIn(Parser):
       _element_mixins = {'*': MyMixin, 'form': FormMixin}

And then it would be really easy to create local extensions for all HTML 
elements, or particular elements.

I'm not sure exactly how to attach the URL getting method to the Parser 
object in this model, because I'm not sure how to give elements a 
reference back to it.  We could do it with class variables, but then the 
parser would *have* to subclass every element everytime it was 
instantiated, so it could make new classes with a reference back to 
itself.  But maybe there's a better way.  Do the elements already have a 
reference back to that etree.HTMLParser() instance, and could we attach 
this to that instance?  Or perhaps extend HTMLParser directly instead of 
having this other parser class?

-- 
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org
             : Write code, do good : http://topp.openplans.org/careers


More information about the lxml-dev mailing list