[lxml-dev] lhtml
Ian Bicking
ianb at colorstudy.com
Fri May 25 21:44:52 CEST 2007
Stefan Behnel wrote:
> Hi Ian,
>
> Ian Bicking wrote:
>> Stefan Behnel wrote:
>>> Ian Bicking wrote:
>>>> Ian Bicking wrote:
>>>>> I really want to take all our HTML-related routines and put them
>>>>> into a proper package
>>>> And maybe a bit of advice -- we could just do this as a set of
>>>> functions (what we currently have), or potentially explore objectify
>>>> and add the routines as methods. E.g., el.find_by_class('classname')
>>> You're not using objectify as a base, are you? I mean, HTML is mainly
>>> about text, so objectify will not help you much.
>> I'm not using it now, no. But if I used objectify as a base, it would
>> be to add methods like .html_serialize() to elements, or any number of
>> other handy methods.
>
> I don't know what you mean here. Maybe I'm just missing something that's more
> obvious to you, or are talking about custom element classes in general rather
> than objectify?
>
> http://codespeak.net/lxml/dev/element_classes.html
> http://codespeak.net/lxml/dev/objectify.html
> http://codespeak.net/lxml/dev/FAQ.html#what-is-the-difference-between-lxml-etree-and-lxml-objectify
I probably confused the terms/modules, since I haven't used any of them.
I think you are right, I'm just thinking about a custom element class.
>>>> This feels like a cleaner API, but I'm worried that it will mean
>>>> problems when mixing non-objectify-HTML with other elements, and if
>>>> there's problems with threads or memory overhead, or any other
>>>> issues. I don't really mind functions, which is why I am unsure;
>>>> OTOH, almost every function has a first argument of "el", which makes
>>>> them seem like methods.
>>> What about implementing the HTML namespace in a couple of Element
>>> subclasses
>>> and add the methods where they are appropriate? That sounds like a
>>> nice API to me.
>> The HTML() parser doesn't actually use namespaces.
>
> True, I forgot. Still, you can use something like:
>
> >>> class HtmlElement(etree._ElementBase):
> ... # your implementation here
>
> >>> # some more subclasses for different HTML tags, e.g. AnchorElement
>
> >>> HTML_CLASSES = {
> ... "a" : AnchorElement,
> ... # ...
> ... }
>
> >>> class HtmlLookup(etree.CustomElementClassLookup):
> ... def lookup(self, node_type, document, namespace, name):
> ... if node_type == "element":
> ... return HTML_CLASSES.get(name, HtmlElement)
> ... else:
> ... return None # delegate
>
> >>> html_parser = etree.HTMLParser()
> >>> html_parser.setElementClassLookup(HtmlLookup())
>
> >>> def HTML(html):
> ... return etree.HTML(html, html_parser)
>
> That does almost the same as the Namespace classes would.
Yes, that's the sort of thing I was thinking about (but was fuzzy on the
details because I haven't tried it).
It relies on a different parser from lxml.etree.HTML, and I would guess
that elements created with etree.Element wouldn't necessarily use the
right class. I'm just worried it adds more confusion, because things
act differently depending on how the element was created or how a
document is parsed. Functions are fairly straight-forward in comparison
-- they just do stuff. They are also somewhat easier to document and
browse through as a new user.
For instance, it would be amusing to have an AnchorElement.GET() method.
But what exactly would it do? Which HTTP library would it use? I
don't know; if it was a function then it wouldn't matter, you'd just
implement however many functions were necessary to do what people wanted
to do. And those functions may or may not be implemented in lxml.html
-- someone else could distribute their own implementations using
whatever library they liked.
But not all methods are like GET(). find_by_class() is probably more
obvious -- it gets all elements according to a class name, and multiple
implementations aren't necessary.
>> I'm not entirely clear on how namespaces fit in. Most of the methods
>> would apply to all HTML elements, but HTML 4 elements aren't easy to
>> distinguish.
>
> I would expect only HTML elements in an HTML4 document. That makes it rather
> easy. But if you like, you can add any kind of special casing into the lookup
> method above, such as: if the tag has a namespace that's not XHTML, return
> None (i.e. the default Element class).
>
>
>>> Any chance you could post your code somewhere so that I could take a
>>> look at what you're really contributing here?
>> Sure; I started collecting a few of the routines from various libraries
>> yesterday. There's still stuff in Deliverance and htmldiff that I
>> haven't integrated. I haven't copied over any tests and there may be
>> broken imports in many of the modules, but it should give you a vague
>> idea of scope.
>
> I took a quick look at it and I totally like the doctestcompare module. I'd
> love to use it for lxml's own doctests first of all, so, sure, that's a
> perfect companion to lxml's other modules. You already have write access to
> lxml's SVN repository, so there's not much of a problem with release cycles or
> anything. If you want to add new stuff, that may even be a good reason for a
> new version of lxml. :)
>
> Questions:
>
> doctest module:
>
> - is there any reason why you require a call to "lxmldoctest.install()"? I'd
> rather execute that immediately when you import the module. That's less
> intrusive for doctests (which is the main use case after all).
I dislike having modules do something to the system when you import
them. OTOH, I dislike that I have to monkeypatch doctest to get the
comparison function in, but it's not practical to do anything else. So
maybe I just have to put up with it.
> - I'd like to call that module "lxml.xmldoctest" or something like that, so
> that you can "import xmldoctest" in a doctest file, which is rather readable.
I'd be surprised it this would actually work -- I'd expect that it would
be too late once you were running the doctest. But I haven't tried.
> serialise.py:
>
> - libxml2 actually has some internal support for serialising HTML, so maybe
> it's worth looking at that first, in case we ever decide to wrap it.
Sure; this was just the most expedient thing we figured out. The XSLT
serialization probably uses something else in libxml2, so maybe direct
access to that is possible. There's nothing terribly wrong about it
either, it's just a little roundabout (which isn't so bad if it is
implemented in a reusable function, of course -- but reimplementing it
each time you want to serialize HTML isn't so good).
> parse, serialize and fixuplinks:
>
> - I'll have to take a closer look at that to see if this makes sense in general.
The parse stuff is really just charset detection. I don't think
lxml/libxml2 does this natively (checking the meta tag), but I'm not
actually 100% sure. It should include parsing HTML fragments too, which
is a little hard (HTML() interprets all text as complete documents, and
adds in elements to make the document valid, which often isn't what
you'd want).
> __init__.py:
>
> - some of this can be rewritten using plain XPath, e.g. get_parent_with_class
> (there's now RegExp support in lxml 1.3) or get_text (basically what
> 'string()' does). contains_class_xpath is not really much better than an XPath
> expression with variables, dito for get_elements_by_class and get_rel_links,
> e.g. the latter is better written as:
>
> get_rel_links = etree.XPath("descendant-or-self::a[@rel=$rel]")
> get_rel_links(el, rel="whatever")
I tried doing class name matching with a regular expression, but never
got it to work. It might have been a bug in my or lxml's code, I'm not
sure -- whatever it was, I was in a mind to move on ;). General CSS
selector support would be wonderful. But anyway, these are things that
weren't obvious to me, so I think it's still useful to include the
functions even if their implementation is fairly trivial.
I'm sure there's other implementation details that could be improved.
Most of those particular functions came from some microformat parsing,
and most microformats are just built on a small number of queries.
Deliverance and htmldiff had more stuff for modifying the structure,
which is often quite awkward with the ElementTree model (doing something
that seems easy, like removing a tag, is nontrivial). A lot of those
things aren't that specific to HTML, except that HTML has lots of
situations where tags and text are mixed together.
>> (I'm actually looking for a home for htmldiff, so it's
>> possible it could also go in this library; it's at
>> https://svn.openplans.org/svn/opencore/trunk/opencore/nui/wiki/htmldiff2.py
>> and
>>
> https://svn.openplans.org/svn/opencore/trunk/opencore/nui/wiki/test_htmldiff2.txt)
>
> I'll take a look at this when I have a bit more time.
Sure; note it's very much oriented towards human-readable diffs, not
formal diffs. Which fits HTML fairly well (where the tags are more like
annotations of the text), but not most other XML documents.
--
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
| Write code, do good | http://topp.openplans.org/careers
More information about the lxml-dev
mailing list