[lxml-dev] html branch
Stefan Behnel
stefan_ml at behnel.de
Tue May 29 23:41:22 CEST 2007
Hi Ian,
Ian Bicking wrote:
> I've started a branch with lxml.html, in
> http://codespeak.net/svn/lxml/branch/html
Sure, cool.
> lxml.doctestcompare: XML/HTML doctests
As people would rarely import this, why not have it start with an underscore?
> lxml.usedoctest: enable the doctest from within a doctest
> lxml.html.usedoctest: enable the doctest, using the HTML parser
Good idea. That way it's automatically gets the same 'interface'.
I'm not sure about the "use...", though. It needs to read well with "import":
from lxml import usedoctest
Too many verbs IMHO (but as long as I can't come up with a better name, I'll
just leave it as is :)
> lxml.html:
> * lxml.html.HtmlMixin, defining on each element:
> - remove_element: element removes itself from a tree
> - remove_tag: element removes itself but not its children from a tree
remove() already exists and removes the element you pass (not the element you
call it on), so this becomes too ambiguous. Also, the more ElementTree-ish way
would be to go through the parent:
def cut_out_tree(self, element):
if element.tail:
previous = element.getprevious()
previous.tail = (previous.tail or '') + element.tail
self.remove(element)
def cut_out_element(self, element):
pos = self.index(element)
if element.text:
self.text = (self.text or '') + element.text
self.cut_out_tree(element)
self[pos:pos] = element[:]
> * HTML: parser
> * parse_elements: parse fragment, return list of elements
> * parse_element: parse fragment, return single element
I'll look into those, but they look ok at first glance.
> * Element: apparently a highly broken element factory (segfaults?!)
Yup, that won't work that way. Element classes cannot be instantiated on their
own. But you can do
Element = html_parser.makeelement
> * tostring: HTML serialization
Based on XSLT, as I've seen before. Sure, why not.
> lxml.[html.]defs: lists of HTML tags (e.g., block_tags)
Ok.
> lxml.[html.]clean: clean Javascript and other problem code from HTML
That rather looks like an HtmlElement method to me: "cleanup(...)", and the
clean_html() function would fit right into the top-level of the lxml.html module.
> lxml.[html.]rewritelinks: change the links in a document
Maybe too special and too long for integration into the lxml.html and
HtmlElement, not sure. Some of this might fit, though.
> lxml.[html.]htmldiff: make human-readable diffs and blame reports
>
> The usedoctest modules are based on a really horrible hack. It seems to
> work, except for some reason lxml/html/tests/test_clean.txt is sometimes
> run without the doctest change. The other doctests aren't run like
> this, and when you explicitly run the test (e.g., python test.py
> test_clean) it runs fine. So something weird with the test runner, I guess.
I'll take a look at these later.
Stefan
More information about the lxml-dev
mailing list