[lxml-dev] html branch
Ian Bicking
ianb at colorstudy.com
Mon Jun 4 18:27:24 CEST 2007
Stefan Behnel wrote:
>> This makes the function also a handy way to do functional-style
>> transformations of elements. It bothers me a bit to change the return
>> type (which I generally dislike doing), except that it matches the input
>> type which seems like it might be okay.
>>
>> Does this seem okay?
>
> It looks Pythonic to me. You get out what you put in and whatever you put in,
> it does the same thing to it. So it's just a perfectly polymorphic function.
>
>
>> Also, I'm wondering if (a) I should try to automatically determine
>> fragment unless it is explicitly given, and/or (b) if parse_element
>> doesn't work (raises an exception) I should use parse_element(doc,
>> create_parent=True) which will wrap the fragment in a <div>.
>
> Defaulting to a "wrap with <div>" fallback means changing the input in a not
> really predictable way. That sounds like too much magic to me. In most cases,
> users will know what they are dealing with. Otherwise, they can well catch the
> exception and then fall back to an alternative *if they want*. I'm fine with
> having a function that can handle HTML trees or serialised HTML documents and
> requires users to parse things themselves if it's not a document.
I imported a bunch of HTML cleaning tests from other sources, and in the
process I found "parse this somehow and give me an element" to be very
convenient. Of course, HTML() *does* exactly that kind of parsing, but
at least for cleaning you usually don't want a full document, you really
just want a fragment. And that's not too uncommon.
To make this easier I implemented a parse() function that does its best
to parse your content. If your content is a full page, you get a full
page back. If it's not a full page and it contains just one element,
you get that element back. But if it's not a full page and it contains
multiple elements, it gets wrapped in a <div>. This seems less
intrusive than wrapping it in <html><body>, which is eeffectively what
the standard parser does. <div> is really a generic wrapper (though I
suppose since it is block level, it's not *entirely* generic -- it might
be more ideal to see if the content contains any block level elements,
and if not just wrap in <span>). Dealing with ordered lists of elements
with no parent isn't that easy or natural anywhere in the API. If there
was some kind of anonymous container then that would be a nice
container, but there isn't one. Is it possible to make something like
that? It seems like a new kind of node could cause a lot of problems.
Notably, with the HTML parser you frequently get something out with more
elements than were in the original. It'll add <p> or <div> tags fairly
liberally, rearrange tags, etc., to make the document valid. So adding
a <div> tag isn't that far from what can already happen.
--
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
| Write code, do good | http://topp.openplans.org/careers
More information about the lxml-dev
mailing list