[lxml-dev] lxml & parsing: return of a classes
Stefan Behnel
stefan_ml at behnel.de
Thu Jul 19 17:57:13 CEST 2007
Hi Ian,
Ian Bicking wrote:
> So I was thinking a little about how we could allow easy customization
> of the URL getter, since we can't attach it to the tree or any element.
> And then generally how any customization could be done, for instance
> if you want a new method on all elements.
>
> This isn't that easy currently. You'd have to subclass a bunch of
> classes and rewrite a bunch of functions. But I think if we move all
> parsing to a single class it would help a great deal.
>
> The idea is something like:
>
> class Parser(object):
> _etree_parser_class = etree.HTMLParser
> def __init__(self):
> self._etree_parser = self._etree_parser_class()
> self._etree_parser.setElementClassLookup(self)
> def __call__(self, filename, **kw):
> return etree.parse(filename, self._etree_parser, **kw)
> def fromstring(...):
> ...
That's a good idea, but as you suggest at the end, extending the HTMLParser
class directly is the way to go. Documents in lxml.etree keep a reference to
their parser to support inheritance of resolvers. It's even readable from
Python as "parser" property of an ElementTree. That would nicely solve most of
your problems.
> If you want to adjust something, you don't have to
> reimplement all the forms of parsers, since they all would just use
> self, and are mostly defined in terms of each other. We could support
> subclassing with something like this:
>
> class Parser(object):
> _element_classes = {}
> _element_mixins = {}
> def __init__(self):
> self._element_classes = self._element_classes.copy()
> mixers = {}
> for name, value in _element_mixins:
> if name == '*':
> for n in self._element_classes.keys():
> mixers.setdefault(n, []).append(value)
> else:
> mixers.setdefault(name, []).append(value)
> for name, mixins in mixers:
> cur = self._element_classes.get(name, HtmlElement)
> bases = mixins + [cur]
> new_class = type(cur.__name__, tuple(bases), {})
> self._element_classes[name] = new_class
>
> class MyMixin(object):
> extra methods
> class FormMixin(object):
> other methods for the form element
>
> class ParserMixedIn(Parser):
> _element_mixins = {'*': MyMixin, 'form': FormMixin}
>
> And then it would be really easy to create local extensions for all HTML
> elements, or particular elements.
I would have to see how this looks if you inherit from HTMLParser and how this
matches with the existing class lookup mechanisms.
> I'm not sure exactly how to attach the URL getting method to the Parser
> object in this model, because I'm not sure how to give elements a
> reference back to it.
I think we should try to integrate with the normal Resolver mechanism here
(doc/resolvers.txt). Not sure how this works exactly if we want to use it from
Python code (currently it's only called from libxml2 internally), but I would
like to avoid adding yet another way to resolve URLs. Currently, resolvers
receive an opaque "context" object as last argument and return an opaque
object with a string or file-like object etc. We could easily replace the
context with an object containing a sequence of form arguments (which would be
None when calling from libxml2).
Stefan
More information about the lxml-dev
mailing list