[lxml-dev] lhtml

Ian Bicking ianb at colorstudy.com
Fri May 25 18:11:12 CEST 2007


Stefan Behnel wrote:
> Hi Ian,
> 
> Ian Bicking wrote:
>> Ian Bicking wrote:
>>> I really want to take all our HTML-related routines and put them into a 
>>> proper package
>> And maybe a bit of advice -- we could just do this as a set of functions 
>> (what we currently have), or potentially explore objectify and add the 
>> routines as methods.  E.g., el.find_by_class('classname')
> 
> You're not using objectify as a base, are you? I mean, HTML is mainly about
> text, so objectify will not help you much.

I'm not using it now, no.  But if I used objectify as a base, it would 
be to add methods like .html_serialize() to elements, or any number of 
other handy methods.  At least "handy" for dealing with the mixed 
content that HTML has, which is relatively uncommon in other XML.

>> This feels like a cleaner API, but I'm worried that it will mean 
>> problems when mixing non-objectify-HTML with other elements, and if 
>> there's problems with threads or memory overhead, or any other issues. 
>> I don't really mind functions, which is why I am unsure; OTOH, almost 
>> every function has a first argument of "el", which makes them seem like 
>> methods.
> 
> What about implementing the HTML namespace in a couple of Element subclasses
> and add the methods where they are appropriate? That sounds like a nice API to me.

The HTML() parser doesn't actually use namespaces.  Well, maybe it does 
if you give it XHTML, or maybe you really have to use XML() to get that. 
  It's never come up because I don't deal with any XHTML sites (because 
there are almost no XHTML sites ;).

I'm not entirely clear on how namespaces fit in.  Most of the methods 
would apply to all HTML elements, but HTML 4 elements aren't easy to 
distinguish.

> Any chance you could post your code somewhere so that I could take a look at
> what you're really contributing here?

Sure; I started collecting a few of the routines from various libraries 
yesterday.  There's still stuff in Deliverance and htmldiff that I 
haven't integrated.  I haven't copied over any tests and there may be 
broken imports in many of the modules, but it should give you a vague 
idea of scope.  (I'm actually looking for a home for htmldiff, so it's 
possible it could also go in this library; it's at 
https://svn.openplans.org/svn/opencore/trunk/opencore/nui/wiki/htmldiff2.py 
and 
https://svn.openplans.org/svn/opencore/trunk/opencore/nui/wiki/test_htmldiff2.txt)

Anyway, it's not too big so I'll just attach the stuff I have collected.

-- 
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
             | Write code, do good | http://topp.openplans.org/careers
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lhtml.tar.gz
Type: application/x-gzip
Size: 5480 bytes
Desc: not available
Url : http://codespeak.net/pipermail/lxml-dev/attachments/20070525/9e5fcd8c/attachment.bin 


More information about the lxml-dev mailing list