[lxml-dev] html branch

Ian Bicking ianb at colorstudy.com
Fri Jun 1 06:46:46 CEST 2007


Stefan Behnel wrote:
>>>>>> lxml.[html.]clean: clean Javascript and other problem code from HTML
>>>>> That rather looks like an HtmlElement method to me: "cleanup(...)",
>>>>> and the
>>>>> clean_html() function would fit right into the top-level of the
>>>>> lxml.html module.
>>>> The long signature of the function made me reluctant to do this.  Any
>>>> function with that many parameters feels non-authoritative to me.  And I
>>>> would encourage people to actually write their own clean function with
>>>> the parameter defaults that are appropriate for their domain (e.g.,
>>>> clean_untrusted_comment, clean_wysiwyg_submission, etc).  I just guessed
>>>> reasonable defaults for those keyword arguments.
>>> Ah, ok, good point. Still, I would like to keep the number of modules
>>> low.
>>> lxml.html should be as close to "one point for solving your HTML
>>> needs" as
>>> possible.
>> OK.  *Actually* putting them all in one module would make the module
>> feel too big to me.  I could import them all into __init__.py.  That
>> might make the import unnecessarily slow, I'm not sure.
> 
> Avoiding imports tends to be not worth the effort. It already takes a while to
> import etree, so importing some more Python modules doesn't add much.
> 
> 
>> For some reason I've never used lazy-loading functions, though the
>> implementation seems obvious enough; just something like:
>>
>> def clean(*args, **kw):
>>     from lxml.html import clean
>>     return clean(*args, **kw)
>>
>> It breaks documentation tools, I guess (though at least I can refer to
>> the real function in the docstring).
> 
> I wouldn't do that. Calling things happens much more often than importing
> them, so adding overhead to the call that is usually done only once feels
> wrong to me.

The overhead of an import (if the module has already been imported) 
isn't very significant, and could be cached easily enough.  That said, 
the clean module isn't particularly large and doesn't import much 
itself.  But htmldiff is the only module of substantial size.  I've 
integrated rewritelinks directly into __init__, which after refactoring 
the algorithm a bit isn't very big anyway.

I dunno; I'm okay just requiring htmldiff to be imported directly, and 
importing clean into __init__.

>> For a number of the methods I'd also like a function version that takes
>> a string and returns a string.  I think this makes it easier to convince
>> people to use the functions.  Obviously this doesn't make sense for a
>> lot of the methods, but does for clean, htmldiff, make_links_absolute,
>> and maybe rewrite_links.
> 
> I like that pattern, too.

So I made a generic wrapper for exposing methods as functions.  It 
parses the first argument if it's not already a parsed document, then 
does something, and returns the result of the method or the serialized 
form of the document if the method returns None.  This might be a bit 
too fancy/automatic.

But anyway, putting that aside, I was thinking that maybe the general 
pattern should be like:

def make_links_absolute(doc, base_href, fragment=False):
     if isinstance(doc, basestring):
         if fragment:
             doc = parse_element(doc)
         else:
             doc = HTML(doc)
         return_string = True
     else:
         doc = copy.deepcopy(doc)
         return_string = False
     doc.make_links_absolute(doc, base_href)
     if return_string:
         return tostring(doc)
     else:
         return doc

This makes the function also a handy way to do functional-style 
transformations of elements.  It bothers me a bit to change the return 
type (which I generally dislike doing), except that it matches the input 
type which seems like it might be okay.

Does this seem okay?

Also, I'm wondering if (a) I should try to automatically determine 
fragment unless it is explicitly given, and/or (b) if parse_element 
doesn't work (raises an exception) I should use parse_element(doc, 
create_parent=True) which will wrap the fragment in a <div>.

-- 
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
             | Write code, do good | http://topp.openplans.org/careers


More information about the lxml-dev mailing list