[lxml-dev] lxml & parsing: return of a classes
Ian Bicking
ianb at colorstudy.com
Tue Jul 17 21:30:59 CEST 2007
So I was thinking a little about how we could allow easy customization
of the URL getter, since we can't attach it to the tree or any element.
And then generally how any customization could be done, for instance
if you want a new method on all elements.
This isn't that easy currently. You'd have to subclass a bunch of
classes and rewrite a bunch of functions. But I think if we move all
parsing to a single class it would help a great deal.
The idea is something like:
class Parser(object):
_etree_parser_class = etree.HTMLParser
def __init__(self):
self._etree_parser = self._etree_parser_class()
self._etree_parser.setElementClassLookup(self)
def __call__(self, filename, **kw):
return etree.parse(filename, self._etree_parser, **kw)
def fromstring(...):
...
And so forth. Then either expose this via:
parse = Parser()
Or perhaps:
_parser = Parser()
parse = _parser
fromstring = _parser.fromstring
And so forth. If you want to adjust something, you don't have to
reimplement all the forms of parsers, since they all would just use
self, and are mostly defined in terms of each other. We could support
subclassing with something like this:
class Parser(object):
_element_classes = {}
_element_mixins = {}
def __init__(self):
self._element_classes = self._element_classes.copy()
mixers = {}
for name, value in _element_mixins:
if name == '*':
for n in self._element_classes.keys():
mixers.setdefault(n, []).append(value)
else:
mixers.setdefault(name, []).append(value)
for name, mixins in mixers:
cur = self._element_classes.get(name, HtmlElement)
bases = mixins + [cur]
new_class = type(cur.__name__, tuple(bases), {})
self._element_classes[name] = new_class
class MyMixin(object):
extra methods
class FormMixin(object):
other methods for the form element
class ParserMixedIn(Parser):
_element_mixins = {'*': MyMixin, 'form': FormMixin}
And then it would be really easy to create local extensions for all HTML
elements, or particular elements.
I'm not sure exactly how to attach the URL getting method to the Parser
object in this model, because I'm not sure how to give elements a
reference back to it. We could do it with class variables, but then the
parser would *have* to subclass every element everytime it was
instantiated, so it could make new classes with a reference back to
itself. But maybe there's a better way. Do the elements already have a
reference back to that etree.HTMLParser() instance, and could we attach
this to that instance? Or perhaps extend HTMLParser directly instead of
having this other parser class?
--
Ian Bicking : ianb at colorstudy.com : http://blog.ianbicking.org
: Write code, do good : http://topp.openplans.org/careers
More information about the lxml-dev
mailing list