[lxml-dev] html branch

Stefan Behnel stefan_ml at behnel.de
Tue May 29 23:41:22 CEST 2007


Hi Ian,

Ian Bicking wrote:
> I've started a branch with lxml.html, in 
> http://codespeak.net/svn/lxml/branch/html

Sure, cool.


> lxml.doctestcompare: XML/HTML doctests

As people would rarely import this, why not have it start with an underscore?


> lxml.usedoctest: enable the doctest from within a doctest
> lxml.html.usedoctest: enable the doctest, using the HTML parser

Good idea. That way it's automatically gets the same 'interface'.

I'm not sure about the "use...", though. It needs to read well with "import":

    from lxml import usedoctest

Too many verbs IMHO (but as long as I can't come up with a better name, I'll
just leave it as is :)


> lxml.html:
>    * lxml.html.HtmlMixin, defining on each element:
>      - remove_element: element removes itself from a tree
>      - remove_tag: element removes itself but not its children from a tree

remove() already exists and removes the element you pass (not the element you
call it on), so this becomes too ambiguous. Also, the more ElementTree-ish way
would be to go through the parent:

    def cut_out_tree(self, element):
         if element.tail:
             previous = element.getprevious()
             previous.tail = (previous.tail or '') + element.tail
         self.remove(element)

    def cut_out_element(self, element):
         pos = self.index(element)
         if element.text:
              self.text = (self.text or '') + element.text
         self.cut_out_tree(element)
         self[pos:pos] = element[:]


>    * HTML: parser
>    * parse_elements: parse fragment, return list of elements
>    * parse_element: parse fragment, return single element

I'll look into those, but they look ok at first glance.


>    * Element: apparently a highly broken element factory (segfaults?!)

Yup, that won't work that way. Element classes cannot be instantiated on their
own. But you can do

    Element = html_parser.makeelement


>    * tostring: HTML serialization

Based on XSLT, as I've seen before. Sure, why not.


> lxml.[html.]defs: lists of HTML tags (e.g., block_tags)

Ok.


> lxml.[html.]clean: clean Javascript and other problem code from HTML

That rather looks like an HtmlElement method to me: "cleanup(...)", and the
clean_html() function would fit right into the top-level of the lxml.html module.


> lxml.[html.]rewritelinks: change the links in a document

Maybe too special and too long for integration into the lxml.html and
HtmlElement, not sure. Some of this might fit, though.


> lxml.[html.]htmldiff: make human-readable diffs and blame reports
>
> The usedoctest modules are based on a really horrible hack.  It seems to 
> work, except for some reason lxml/html/tests/test_clean.txt is sometimes 
> run without the doctest change.  The other doctests aren't run like 
> this, and when you explicitly run the test (e.g., python test.py 
> test_clean) it runs fine.  So something weird with the test runner, I guess.

I'll take a look at these later.

Stefan



More information about the lxml-dev mailing list