[lxml-dev] html branch
Ian Bicking
ianb at colorstudy.com
Fri Jun 1 06:46:46 CEST 2007
Stefan Behnel wrote:
>>>>>> lxml.[html.]clean: clean Javascript and other problem code from HTML
>>>>> That rather looks like an HtmlElement method to me: "cleanup(...)",
>>>>> and the
>>>>> clean_html() function would fit right into the top-level of the
>>>>> lxml.html module.
>>>> The long signature of the function made me reluctant to do this. Any
>>>> function with that many parameters feels non-authoritative to me. And I
>>>> would encourage people to actually write their own clean function with
>>>> the parameter defaults that are appropriate for their domain (e.g.,
>>>> clean_untrusted_comment, clean_wysiwyg_submission, etc). I just guessed
>>>> reasonable defaults for those keyword arguments.
>>> Ah, ok, good point. Still, I would like to keep the number of modules
>>> low.
>>> lxml.html should be as close to "one point for solving your HTML
>>> needs" as
>>> possible.
>> OK. *Actually* putting them all in one module would make the module
>> feel too big to me. I could import them all into __init__.py. That
>> might make the import unnecessarily slow, I'm not sure.
>
> Avoiding imports tends to be not worth the effort. It already takes a while to
> import etree, so importing some more Python modules doesn't add much.
>
>
>> For some reason I've never used lazy-loading functions, though the
>> implementation seems obvious enough; just something like:
>>
>> def clean(*args, **kw):
>> from lxml.html import clean
>> return clean(*args, **kw)
>>
>> It breaks documentation tools, I guess (though at least I can refer to
>> the real function in the docstring).
>
> I wouldn't do that. Calling things happens much more often than importing
> them, so adding overhead to the call that is usually done only once feels
> wrong to me.
The overhead of an import (if the module has already been imported)
isn't very significant, and could be cached easily enough. That said,
the clean module isn't particularly large and doesn't import much
itself. But htmldiff is the only module of substantial size. I've
integrated rewritelinks directly into __init__, which after refactoring
the algorithm a bit isn't very big anyway.
I dunno; I'm okay just requiring htmldiff to be imported directly, and
importing clean into __init__.
>> For a number of the methods I'd also like a function version that takes
>> a string and returns a string. I think this makes it easier to convince
>> people to use the functions. Obviously this doesn't make sense for a
>> lot of the methods, but does for clean, htmldiff, make_links_absolute,
>> and maybe rewrite_links.
>
> I like that pattern, too.
So I made a generic wrapper for exposing methods as functions. It
parses the first argument if it's not already a parsed document, then
does something, and returns the result of the method or the serialized
form of the document if the method returns None. This might be a bit
too fancy/automatic.
But anyway, putting that aside, I was thinking that maybe the general
pattern should be like:
def make_links_absolute(doc, base_href, fragment=False):
if isinstance(doc, basestring):
if fragment:
doc = parse_element(doc)
else:
doc = HTML(doc)
return_string = True
else:
doc = copy.deepcopy(doc)
return_string = False
doc.make_links_absolute(doc, base_href)
if return_string:
return tostring(doc)
else:
return doc
This makes the function also a handy way to do functional-style
transformations of elements. It bothers me a bit to change the return
type (which I generally dislike doing), except that it matches the input
type which seems like it might be okay.
Does this seem okay?
Also, I'm wondering if (a) I should try to automatically determine
fragment unless it is explicitly given, and/or (b) if parse_element
doesn't work (raises an exception) I should use parse_element(doc,
create_parent=True) which will wrap the fragment in a <div>.
--
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
| Write code, do good | http://topp.openplans.org/careers
More information about the lxml-dev
mailing list