[lxml-dev] html branch

Ian Bicking ianb at colorstudy.com
Mon Jun 4 18:27:24 CEST 2007


Stefan Behnel wrote:
>> This makes the function also a handy way to do functional-style
>> transformations of elements.  It bothers me a bit to change the return
>> type (which I generally dislike doing), except that it matches the input
>> type which seems like it might be okay.
>>
>> Does this seem okay?
> 
> It looks Pythonic to me. You get out what you put in and whatever you put in,
> it does the same thing to it. So it's just a perfectly polymorphic function.
> 
> 
>> Also, I'm wondering if (a) I should try to automatically determine
>> fragment unless it is explicitly given, and/or (b) if parse_element
>> doesn't work (raises an exception) I should use parse_element(doc,
>> create_parent=True) which will wrap the fragment in a <div>.
> 
> Defaulting to a "wrap with <div>" fallback means changing the input in a not
> really predictable way. That sounds like too much magic to me. In most cases,
> users will know what they are dealing with. Otherwise, they can well catch the
> exception and then fall back to an alternative *if they want*. I'm fine with
> having a function that can handle HTML trees or serialised HTML documents and
> requires users to parse things themselves if it's not a document.

I imported a bunch of HTML cleaning tests from other sources, and in the 
process I found "parse this somehow and give me an element" to be very 
convenient.  Of course, HTML() *does* exactly that kind of parsing, but 
at least for cleaning you usually don't want a full document, you really 
just want a fragment.  And that's not too uncommon.

To make this easier I implemented a parse() function that does its best 
to parse your content.  If your content is a full page, you get a full 
page back.  If it's not a full page and it contains just one element, 
you get that element back.  But if it's not a full page and it contains 
multiple elements, it gets wrapped in a <div>.  This seems less 
intrusive than wrapping it in <html><body>, which is eeffectively what 
the standard parser does.  <div> is really a generic wrapper (though I 
suppose since it is block level, it's not *entirely* generic -- it might 
be more ideal to see if the content contains any block level elements, 
and if not just wrap in <span>).  Dealing with ordered lists of elements 
with no parent isn't that easy or natural anywhere in the API.  If there 
was some kind of anonymous container then that would be a nice 
container, but there isn't one.  Is it possible to make something like 
that?  It seems like a new kind of node could cause a lot of problems.

Notably, with the HTML parser you frequently get something out with more 
elements than were in the original.  It'll add <p> or <div> tags fairly 
liberally, rearrange tags, etc., to make the document valid.  So adding 
a <div> tag isn't that far from what can already happen.

-- 
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
             | Write code, do good | http://topp.openplans.org/careers


More information about the lxml-dev mailing list