[lxml-dev] lxml & parsing: return of a classes

Stefan Behnel stefan_ml at behnel.de
Thu Jul 19 17:57:13 CEST 2007


Hi Ian,

Ian Bicking wrote:
> So I was thinking a little about how we could allow easy customization 
> of the URL getter, since we can't attach it to the tree or any element. 
>   And then generally how any customization could be done, for instance 
> if you want a new method on all elements.
> 
> This isn't that easy currently.  You'd have to subclass a bunch of 
> classes and rewrite a bunch of functions.  But I think if we move all 
> parsing to a single class it would help a great deal.
> 
> The idea is something like:
> 
>    class Parser(object):
>        _etree_parser_class = etree.HTMLParser
>        def __init__(self):
>            self._etree_parser = self._etree_parser_class()
>            self._etree_parser.setElementClassLookup(self)
>        def __call__(self, filename, **kw):
>            return etree.parse(filename, self._etree_parser, **kw)
>        def fromstring(...):
>            ...

That's a good idea, but as you suggest at the end, extending the HTMLParser
class directly is the way to go. Documents in lxml.etree keep a reference to
their parser to support inheritance of resolvers. It's even readable from
Python as "parser" property of an ElementTree. That would nicely solve most of
your problems.


> If you want to adjust something, you don't have to 
> reimplement all the forms of parsers, since they all would just use 
> self, and are mostly defined in terms of each other.  We could support 
> subclassing with something like this:
> 
>    class Parser(object):
>        _element_classes = {}
>        _element_mixins = {}
>        def __init__(self):
>            self._element_classes = self._element_classes.copy()
>            mixers = {}
>            for name, value in _element_mixins:
>                if name == '*':
>                    for n in self._element_classes.keys():
>                        mixers.setdefault(n, []).append(value)
>                else:
>                    mixers.setdefault(name, []).append(value)
>            for name, mixins in mixers:
>                cur = self._element_classes.get(name, HtmlElement)
>                bases = mixins + [cur]
>                new_class = type(cur.__name__, tuple(bases), {})
>                self._element_classes[name] = new_class
> 
>    class MyMixin(object):
>        extra methods
>    class FormMixin(object):
>        other methods for the form element
> 
>    class ParserMixedIn(Parser):
>        _element_mixins = {'*': MyMixin, 'form': FormMixin}
> 
> And then it would be really easy to create local extensions for all HTML 
> elements, or particular elements.

I would have to see how this looks if you inherit from HTMLParser and how this
matches with the existing class lookup mechanisms.


> I'm not sure exactly how to attach the URL getting method to the Parser 
> object in this model, because I'm not sure how to give elements a 
> reference back to it.

I think we should try to integrate with the normal Resolver mechanism here
(doc/resolvers.txt). Not sure how this works exactly if we want to use it from
Python code (currently it's only called from libxml2 internally), but I would
like to avoid adding yet another way to resolve URLs. Currently, resolvers
receive an opaque "context" object as last argument and return an opaque
object with a string or file-like object etc. We could easily replace the
context with an object containing a sequence of form arguments (which would be
None when calling from libxml2).

Stefan




More information about the lxml-dev mailing list