[lxml-dev] lhtml

Stefan Behnel stefan_ml at behnel.de
Fri May 25 20:58:48 CEST 2007


Hi Ian,

Ian Bicking wrote:
> Stefan Behnel wrote:
>> Ian Bicking wrote:
>>> Ian Bicking wrote:
>>>> I really want to take all our HTML-related routines and put them
>>>> into a proper package
>>> And maybe a bit of advice -- we could just do this as a set of
>>> functions (what we currently have), or potentially explore objectify
>>> and add the routines as methods.  E.g., el.find_by_class('classname')
>>
>> You're not using objectify as a base, are you? I mean, HTML is mainly
>> about text, so objectify will not help you much.
> 
> I'm not using it now, no.  But if I used objectify as a base, it would
> be to add methods like .html_serialize() to elements, or any number of
> other handy methods.

I don't know what you mean here. Maybe I'm just missing something that's more
obvious to you, or are talking about custom element classes in general rather
than objectify?

http://codespeak.net/lxml/dev/element_classes.html
http://codespeak.net/lxml/dev/objectify.html
http://codespeak.net/lxml/dev/FAQ.html#what-is-the-difference-between-lxml-etree-and-lxml-objectify

>>> This feels like a cleaner API, but I'm worried that it will mean
>>> problems when mixing non-objectify-HTML with other elements, and if
>>> there's problems with threads or memory overhead, or any other
>>> issues. I don't really mind functions, which is why I am unsure;
>>> OTOH, almost every function has a first argument of "el", which makes
>>> them seem like methods.
>>
>> What about implementing the HTML namespace in a couple of Element
>> subclasses
>> and add the methods where they are appropriate? That sounds like a
>> nice API to me.
> 
> The HTML() parser doesn't actually use namespaces.

True, I forgot. Still, you can use something like:

  >>> class HtmlElement(etree._ElementBase):
  ...    # your implementation here

  >>> # some more subclasses for different HTML tags, e.g. AnchorElement

  >>> HTML_CLASSES = {
  ...    "a" : AnchorElement,
  ...    # ...
  ... }

  >>> class HtmlLookup(etree.CustomElementClassLookup):
  ...     def lookup(self, node_type, document, namespace, name):
  ...         if node_type == "element":
  ...             return HTML_CLASSES.get(name, HtmlElement)
  ...         else:
  ...             return None # delegate

  >>> html_parser = etree.HTMLParser()
  >>> html_parser.setElementClassLookup(HtmlLookup())

  >>> def HTML(html):
  ...     return etree.HTML(html, html_parser)

That does almost the same as the Namespace classes would.


> I'm not entirely clear on how namespaces fit in.  Most of the methods
> would apply to all HTML elements, but HTML 4 elements aren't easy to
> distinguish.

I would expect only HTML elements in an HTML4 document. That makes it rather
easy. But if you like, you can add any kind of special casing into the lookup
method above, such as: if the tag has a namespace that's not XHTML, return
None (i.e. the default Element class).


>> Any chance you could post your code somewhere so that I could take a
>> look at what you're really contributing here?
> 
> Sure; I started collecting a few of the routines from various libraries
> yesterday.  There's still stuff in Deliverance and htmldiff that I
> haven't integrated.  I haven't copied over any tests and there may be
> broken imports in many of the modules, but it should give you a vague
> idea of scope.

I took a quick look at it and I totally like the doctestcompare module. I'd
love to use it for lxml's own doctests first of all, so, sure, that's a
perfect companion to lxml's other modules. You already have write access to
lxml's SVN repository, so there's not much of a problem with release cycles or
anything. If you want to add new stuff, that may even be a good reason for a
new version of lxml. :)

Questions:

doctest module:

- is there any reason why you require a call to "lxmldoctest.install()"? I'd
rather execute that immediately when you import the module. That's less
intrusive for doctests (which is the main use case after all).

- I'd like to call that module "lxml.xmldoctest" or something like that, so
that you can "import xmldoctest" in a doctest file, which is rather readable.

serialise.py:

- libxml2 actually has some internal support for serialising HTML, so maybe
it's worth looking at that first, in case we ever decide to wrap it.

parse, serialize and fixuplinks:

- I'll have to take a closer look at that to see if this makes sense in general.

__init__.py:

- some of this can be rewritten using plain XPath, e.g. get_parent_with_class
(there's now RegExp support in lxml 1.3) or get_text (basically what
'string()' does). contains_class_xpath is not really much better than an XPath
expression with variables, dito for get_elements_by_class and get_rel_links,
e.g. the latter is better written as:

  get_rel_links = etree.XPath("descendant-or-self::a[@rel=$rel]")
  get_rel_links(el, rel="whatever")


> (I'm actually looking for a home for htmldiff, so it's
> possible it could also go in this library; it's at
> https://svn.openplans.org/svn/opencore/trunk/opencore/nui/wiki/htmldiff2.py
> and
>
https://svn.openplans.org/svn/opencore/trunk/opencore/nui/wiki/test_htmldiff2.txt)

I'll take a look at this when I have a bit more time.

Stefan


More information about the lxml-dev mailing list