[lxml-dev] html branch

Ian Bicking ianb at colorstudy.com
Thu May 31 20:40:03 CEST 2007


Stefan Behnel wrote:
> Ian Bicking wrote:
>> Stefan Behnel wrote:
>>>> lxml.doctestcompare: XML/HTML doctests
>>> As people would rarely import this, why not have it start with an
>>> underscore?
>> I guess... the usedoctest technique is a pretty egregious hack; I
>> actually change doctest.OutputChecker.check_output.im_func.func_code
>> because there's a local bound method that has to be changed.  So the
>> more conventional installation method still seems good, if there are
>> interaction bugs.
> 
> Ok. I'll have to take a look at the hack anyway, but I believe you already did
> enough to search for a better solution...
> 
> But isn't there a way to copy over a bit of the doctest code to make this
> easier? Like the whole method in OutputChecker?

We can't copy over the code; at the point usedoctest is imported, 
doctest code is already running.  We aren't adding our own runner, we're 
modifying the runner that is already in progress.

Instead of swapping in the doctestcompare check_output, we could swap in 
code that does something simpler, like calls a method indirectly (and we 
could swap that method).  Either way, it involves messing with 
func_code, because of that blasted bound method in __run.  But a 
permanent change in code would at least make it less important to 
disable the patch.

>> I agree the remove names are ambiguous -- both how they relate to each
>> other, and that they seem similar to remove().
> 
> Ok, so, what other words do we have for that? discard? extract? drop?

I like drop, I'll switch to that.

>>>> lxml.[html.]clean: clean Javascript and other problem code from HTML
>>> That rather looks like an HtmlElement method to me: "cleanup(...)",
>>> and the
>>> clean_html() function would fit right into the top-level of the
>>> lxml.html module.
>> The long signature of the function made me reluctant to do this.  Any
>> function with that many parameters feels non-authoritative to me.  And I
>> would encourage people to actually write their own clean function with
>> the parameter defaults that are appropriate for their domain (e.g.,
>> clean_untrusted_comment, clean_wysiwyg_submission, etc).  I just guessed
>> reasonable defaults for those keyword arguments.
> 
> Ah, ok, good point. Still, I would like to keep the number of modules low.
> lxml.html should be as close to "one point for solving your HTML needs" as
> possible.

OK.  *Actually* putting them all in one module would make the module 
feel too big to me.  I could import them all into __init__.py.  That 
might make the import unnecessarily slow, I'm not sure.

For some reason I've never used lazy-loading functions, though the 
implementation seems obvious enough; just something like:

def clean(*args, **kw):
     from lxml.html import clean
     return clean(*args, **kw)

It breaks documentation tools, I guess (though at least I can refer to 
the real function in the docstring).


>>>> lxml.[html.]rewritelinks: change the links in a document
>>> Maybe too special and too long for integration into the lxml.html and
>>> HtmlElement, not sure. Some of this might fit, though.
>> This I feel a little more comfortable about than the cleanup. Especially
>> making all links absolute is really convenient when you are doing parsing.
>>
>> I'd like to do some kind of query (returning all links in the document),
>> but I'm not sure what that would look like.  Generally *just* the link
>> is kind of boring.  Usually the link plus the element that has the link
>> is more interesting.  But some kinds of links don't have elements; CSS
>> particularly.  OTOH, a method that didn't cover that particular case
>> (even though the rewriting did) would still be useful.  Maybe it would
>> return [(element_with_link, attribute_where_link_is), ...].  Or it could
>> be (element_with_link, attribute_where_link_is, link), and for CSS
>> that'd be (<style element>, None, link).
>>
>> So potentially I see the methods:
>>
>> make_links_absolute(base_href)
>> resolve_base_href() # kind of icky, but still useful; for <base href>
>> iter_links() # as described
>> rewrite_links(link_repl_func)
>>
>> Does that make for too many methods?  Doesn't seem too bad, especially
>> since links are important.
> 
> Agreed, and I think the above are good ones.
> 
> 
>> I've also added two new methods: get_element_by_id() (a long name, but
>> at least easy to remember) and text_only(), which gives the text of the
>> tree with all the tags removed.  I don't really like the text_only name,
>> though, but the function is useful.
> 
> What about gettext() or gettextcontent() ? Having a very visible .text
> property makes it clear that these two do more. Even collecttext() would work
> well. (BTW, I keep favouring xpath's "//text()" or even "string()" for the
> implementation: fast and simple).

OK, switched to get_text_content().  Is there a style guideline for 
naming?  I'm using underscores, and avoiding smashed words, which would 
be get_text_content().  Though the "get_" seems unnecessary; 
text_content() seems better to me.

I've been trying to use find_* for methods that return lists of nodes, 
and get_* for things that return a single node.

For a number of the methods I'd also like a function version that takes 
a string and returns a string.  I think this makes it easier to convince 
people to use the functions.  Obviously this doesn't make sense for a 
lot of the methods, but does for clean, htmldiff, make_links_absolute, 
and maybe rewrite_links.


-- 
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
             | Write code, do good | http://topp.openplans.org/careers


More information about the lxml-dev mailing list