[lxml-dev] html branch
Ian Bicking
ianb at colorstudy.com
Wed May 30 00:10:13 CEST 2007
Stefan Behnel wrote:
>> lxml.doctestcompare: XML/HTML doctests
>
> As people would rarely import this, why not have it start with an underscore?
I guess... the usedoctest technique is a pretty egregious hack; I
actually change doctest.OutputChecker.check_output.im_func.func_code
because there's a local bound method that has to be changed. So the
more conventional installation method still seems good, if there are
interaction bugs.
>> lxml.usedoctest: enable the doctest from within a doctest
>> lxml.html.usedoctest: enable the doctest, using the HTML parser
>
> Good idea. That way it's automatically gets the same 'interface'.
>
> I'm not sure about the "use...", though. It needs to read well with "import":
>
> from lxml import usedoctest
>
> Too many verbs IMHO (but as long as I can't come up with a better name, I'll
> just leave it as is :)
I feel like there needs to be a verb in the name, since the import does
stuff. The module itself is useless.
>> lxml.html:
>> * lxml.html.HtmlMixin, defining on each element:
>> - remove_element: element removes itself from a tree
>> - remove_tag: element removes itself but not its children from a tree
>
> remove() already exists and removes the element you pass (not the element you
> call it on), so this becomes too ambiguous. Also, the more ElementTree-ish way
> would be to go through the parent:
>
> def cut_out_tree(self, element):
> if element.tail:
> previous = element.getprevious()
> previous.tail = (previous.tail or '') + element.tail
> self.remove(element)
>
> def cut_out_element(self, element):
> pos = self.index(element)
> if element.text:
> self.text = (self.text or '') + element.text
> self.cut_out_tree(element)
> self[pos:pos] = element[:]
I am a little reluctant to add self-delete methods in general in Python,
but with this technique I would *always* do
el.getparent().cut_out_tree(el). I pretty much always find an element
then get rid of it. Doing it from the parent is consistent but
inconvenient.
I agree the remove names are ambiguous -- both how they relate to each
other, and that they seem similar to remove().
>> * Element: apparently a highly broken element factory (segfaults?!)
>
> Yup, that won't work that way. Element classes cannot be instantiated on their
> own. But you can do
>
> Element = html_parser.makeelement
OK. What's the distinction between Element and SubElement?
>> * tostring: HTML serialization
>
> Based on XSLT, as I've seen before. Sure, why not.
Yeah; it works. I hate the <meta http-equiv="Content-Type"> removal via
a regex, but not removing it bugs the hell out of me and there's no
other way I see to get rid of it. If I was more apt to dig in libxml2
code I'm sure there's a better technique, but I'm shy around C code.
>> lxml.[html.]clean: clean Javascript and other problem code from HTML
>
> That rather looks like an HtmlElement method to me: "cleanup(...)", and the
> clean_html() function would fit right into the top-level of the lxml.html module.
The long signature of the function made me reluctant to do this. Any
function with that many parameters feels non-authoritative to me. And I
would encourage people to actually write their own clean function with
the parameter defaults that are appropriate for their domain (e.g.,
clean_untrusted_comment, clean_wysiwyg_submission, etc). I just guessed
reasonable defaults for those keyword arguments.
>> lxml.[html.]rewritelinks: change the links in a document
>
> Maybe too special and too long for integration into the lxml.html and
> HtmlElement, not sure. Some of this might fit, though.
This I feel a little more comfortable about than the cleanup.
Especially making all links absolute is really convenient when you are
doing parsing.
I'd like to do some kind of query (returning all links in the document),
but I'm not sure what that would look like. Generally *just* the link
is kind of boring. Usually the link plus the element that has the link
is more interesting. But some kinds of links don't have elements; CSS
particularly. OTOH, a method that didn't cover that particular case
(even though the rewriting did) would still be useful. Maybe it would
return [(element_with_link, attribute_where_link_is), ...]. Or it could
be (element_with_link, attribute_where_link_is, link), and for CSS
that'd be (<style element>, None, link).
So potentially I see the methods:
make_links_absolute(base_href)
resolve_base_href() # kind of icky, but still useful; for <base href>
iter_links() # as described
rewrite_links(link_repl_func)
Does that make for too many methods? Doesn't seem too bad, especially
since links are important.
I've also added two new methods: get_element_by_id() (a long name, but
at least easy to remember) and text_only(), which gives the text of the
tree with all the tags removed. I don't really like the text_only name,
though, but the function is useful.
--
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
| Write code, do good | http://topp.openplans.org/careers
More information about the lxml-dev
mailing list