[lxml-dev] html branch

Ian Bicking ianb at colorstudy.com
Wed May 30 00:10:13 CEST 2007


Stefan Behnel wrote:
>> lxml.doctestcompare: XML/HTML doctests
> 
> As people would rarely import this, why not have it start with an underscore?

I guess... the usedoctest technique is a pretty egregious hack; I 
actually change doctest.OutputChecker.check_output.im_func.func_code 
because there's a local bound method that has to be changed.  So the 
more conventional installation method still seems good, if there are 
interaction bugs.

>> lxml.usedoctest: enable the doctest from within a doctest
>> lxml.html.usedoctest: enable the doctest, using the HTML parser
> 
> Good idea. That way it's automatically gets the same 'interface'.
> 
> I'm not sure about the "use...", though. It needs to read well with "import":
> 
>     from lxml import usedoctest
> 
> Too many verbs IMHO (but as long as I can't come up with a better name, I'll
> just leave it as is :)

I feel like there needs to be a verb in the name, since the import does 
stuff.  The module itself is useless.

>> lxml.html:
>>    * lxml.html.HtmlMixin, defining on each element:
>>      - remove_element: element removes itself from a tree
>>      - remove_tag: element removes itself but not its children from a tree
> 
> remove() already exists and removes the element you pass (not the element you
> call it on), so this becomes too ambiguous. Also, the more ElementTree-ish way
> would be to go through the parent:
> 
>     def cut_out_tree(self, element):
>          if element.tail:
>              previous = element.getprevious()
>              previous.tail = (previous.tail or '') + element.tail
>          self.remove(element)
> 
>     def cut_out_element(self, element):
>          pos = self.index(element)
>          if element.text:
>               self.text = (self.text or '') + element.text
>          self.cut_out_tree(element)
>          self[pos:pos] = element[:]

I am a little reluctant to add self-delete methods in general in Python, 
but with this technique I would *always* do 
el.getparent().cut_out_tree(el).  I pretty much always find an element 
then get rid of it.  Doing it from the parent is consistent but 
inconvenient.

I agree the remove names are ambiguous -- both how they relate to each 
other, and that they seem similar to remove().

>>    * Element: apparently a highly broken element factory (segfaults?!)
> 
> Yup, that won't work that way. Element classes cannot be instantiated on their
> own. But you can do
> 
>     Element = html_parser.makeelement

OK.  What's the distinction between Element and SubElement?

>>    * tostring: HTML serialization
> 
> Based on XSLT, as I've seen before. Sure, why not.

Yeah; it works.  I hate the <meta http-equiv="Content-Type"> removal via 
a regex, but not removing it bugs the hell out of me and there's no 
other way I see to get rid of it.  If I was more apt to dig in libxml2 
code I'm sure there's a better technique, but I'm shy around C code.

>> lxml.[html.]clean: clean Javascript and other problem code from HTML
> 
> That rather looks like an HtmlElement method to me: "cleanup(...)", and the
> clean_html() function would fit right into the top-level of the lxml.html module.

The long signature of the function made me reluctant to do this.  Any 
function with that many parameters feels non-authoritative to me.  And I 
would encourage people to actually write their own clean function with 
the parameter defaults that are appropriate for their domain (e.g., 
clean_untrusted_comment, clean_wysiwyg_submission, etc).  I just guessed 
reasonable defaults for those keyword arguments.

>> lxml.[html.]rewritelinks: change the links in a document
> 
> Maybe too special and too long for integration into the lxml.html and
> HtmlElement, not sure. Some of this might fit, though.

This I feel a little more comfortable about than the cleanup. 
Especially making all links absolute is really convenient when you are 
doing parsing.

I'd like to do some kind of query (returning all links in the document), 
but I'm not sure what that would look like.  Generally *just* the link 
is kind of boring.  Usually the link plus the element that has the link 
is more interesting.  But some kinds of links don't have elements; CSS 
particularly.  OTOH, a method that didn't cover that particular case 
(even though the rewriting did) would still be useful.  Maybe it would 
return [(element_with_link, attribute_where_link_is), ...].  Or it could 
be (element_with_link, attribute_where_link_is, link), and for CSS 
that'd be (<style element>, None, link).

So potentially I see the methods:

make_links_absolute(base_href)
resolve_base_href() # kind of icky, but still useful; for <base href>
iter_links() # as described
rewrite_links(link_repl_func)

Does that make for too many methods?  Doesn't seem too bad, especially 
since links are important.


I've also added two new methods: get_element_by_id() (a long name, but 
at least easy to remember) and text_only(), which gives the text of the 
tree with all the tags removed.  I don't really like the text_only name, 
though, but the function is useful.


-- 
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
             | Write code, do good | http://topp.openplans.org/careers


More information about the lxml-dev mailing list