[lxml-dev] html branch
Stefan Behnel
stefan_ml at behnel.de
Thu May 31 23:11:51 CEST 2007
Hi Ian,
Ian Bicking wrote:
> Stefan Behnel wrote:
>> Ian Bicking wrote:
>>> Stefan Behnel wrote:
>>>>> lxml.doctestcompare: XML/HTML doctests
>>>> As people would rarely import this, why not have it start with an
>>>> underscore?
>>> I guess... the usedoctest technique is a pretty egregious hack; I
>>> actually change doctest.OutputChecker.check_output.im_func.func_code
>>> because there's a local bound method that has to be changed. So the
>>> more conventional installation method still seems good, if there are
>>> interaction bugs.
>>
>> Ok. I'll have to take a look at the hack anyway, but I believe you
>> already did
>> enough to search for a better solution...
>>
>> But isn't there a way to copy over a bit of the doctest code to make this
>> easier? Like the whole method in OutputChecker?
>
> We can't copy over the code; at the point usedoctest is imported,
> doctest code is already running. We aren't adding our own runner, we're
> modifying the runner that is already in progress.
>
> Instead of swapping in the doctestcompare check_output, we could swap in
> code that does something simpler, like calls a method indirectly (and we
> could swap that method). Either way, it involves messing with
> func_code, because of that blasted bound method in __run. But a
> permanent change in code would at least make it less important to
> disable the patch.
I'll have to take a closer look into this, but won't have the time during the
next week.
>>>>> lxml.[html.]clean: clean Javascript and other problem code from HTML
>>>> That rather looks like an HtmlElement method to me: "cleanup(...)",
>>>> and the
>>>> clean_html() function would fit right into the top-level of the
>>>> lxml.html module.
>>> The long signature of the function made me reluctant to do this. Any
>>> function with that many parameters feels non-authoritative to me. And I
>>> would encourage people to actually write their own clean function with
>>> the parameter defaults that are appropriate for their domain (e.g.,
>>> clean_untrusted_comment, clean_wysiwyg_submission, etc). I just guessed
>>> reasonable defaults for those keyword arguments.
>>
>> Ah, ok, good point. Still, I would like to keep the number of modules
>> low.
>> lxml.html should be as close to "one point for solving your HTML
>> needs" as
>> possible.
>
> OK. *Actually* putting them all in one module would make the module
> feel too big to me. I could import them all into __init__.py. That
> might make the import unnecessarily slow, I'm not sure.
Avoiding imports tends to be not worth the effort. It already takes a while to
import etree, so importing some more Python modules doesn't add much.
> For some reason I've never used lazy-loading functions, though the
> implementation seems obvious enough; just something like:
>
> def clean(*args, **kw):
> from lxml.html import clean
> return clean(*args, **kw)
>
> It breaks documentation tools, I guess (though at least I can refer to
> the real function in the docstring).
I wouldn't do that. Calling things happens much more often than importing
them, so adding overhead to the call that is usually done only once feels
wrong to me.
>>> I've also added two new methods: get_element_by_id() (a long name, but
>>> at least easy to remember) and text_only(), which gives the text of the
>>> tree with all the tags removed. I don't really like the text_only name,
>>> though, but the function is useful.
>>
>> What about gettext() or gettextcontent() ? Having a very visible .text
>> property makes it clear that these two do more. Even collecttext()
>> would work
>> well. (BTW, I keep favouring xpath's "//text()" or even "string()" for
>> the
>> implementation: fast and simple).
>
> OK, switched to get_text_content(). Is there a style guideline for
> naming?
lxml has not been very consistent here, but I'm planning to get closer to PEP
8 on the long term.
http://www.python.org/dev/peps/pep-0008/
ElementTree has traditionally used CamelCase for module names and
"smashedwords" for methods, which is not quite compliant with what PEP-8 says
today (long after the ET was written). But there's not enough examples for
multi-word methods to make that a naming priciple. I think underscore names
are just right.
> I'm using underscores, and avoiding smashed words, which would
> be get_text_content(). Though the "get_" seems unnecessary;
> text_content() seems better to me.
Make it so.
> I've been trying to use find_* for methods that return lists of nodes,
> and get_* for things that return a single node.
Ok, that sounds consistent, though it won't necessarily be immediately obvious
to users, as it requires knowing a couple of examples before you actually
start seeing the pattern - *if* you care for seeing one. Anyway, having a
consistent naming pattern is always a good idea.
> For a number of the methods I'd also like a function version that takes
> a string and returns a string. I think this makes it easier to convince
> people to use the functions. Obviously this doesn't make sense for a
> lot of the methods, but does for clean, htmldiff, make_links_absolute,
> and maybe rewrite_links.
I like that pattern, too.
Stefan
More information about the lxml-dev
mailing list