[lxml-dev] html branch

Ian Bicking ianb at colorstudy.com
Tue May 29 17:37:04 CEST 2007


I've started a branch with lxml.html, in 
http://codespeak.net/svn/lxml/branch/html

It currently includes:

lxml.doctestcompare: XML/HTML doctests
lxml.usedoctest: enable the doctest from within a doctest
lxml.html.usedoctest: enable the doctest, using the HTML parser
lxml.html:
   * lxml.html.HtmlMixin, defining on each element:
     - remove_element: element removes itself from a tree
     - remove_tag: element removes itself but not its children from a tree
     - find_rel_links: find <a rel="?">
     - find_class: find <* class="?">
   * HTML: parser
   * parse_elements: parse fragment, return list of elements
   * parse_element: parse fragment, return single element
   * Element: apparently a highly broken element factory (segfaults?!)
   * tostring: HTML serialization
lxml.defs: lists of HTML tags (e.g., block_tags)
lxml.clean: clean Javascript and other problem code from HTML
lxml.rewritelinks: change the links in a document
lxml.htmldiff: make human-readable diffs and blame reports

The usedoctest modules are based on a really horrible hack.  It seems to 
work, except for some reason lxml/html/tests/test_clean.txt is sometimes 
run without the doctest change.  The other doctests aren't run like 
this, and when you explicitly run the test (e.g., python test.py 
test_clean) it runs fine.  So something weird with the test runner, I guess.


-- 
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
             | Write code, do good | http://topp.openplans.org/careers


More information about the lxml-dev mailing list