[lxml-dev] html branch

Stefan Behnel stefan_ml at behnel.de
Mon Jun 4 19:34:55 CEST 2007


Hi Ian,

Ian Bicking wrote:
> I imported a bunch of HTML cleaning tests from other sources, and in the
> process I found "parse this somehow and give me an element" to be very
> convenient.  Of course, HTML() *does* exactly that kind of parsing, but
> at least for cleaning you usually don't want a full document, you really
> just want a fragment.  And that's not too uncommon.

Ok, that makes sense.


> To make this easier I implemented a parse() function that does its best
> to parse your content.  If your content is a full page, you get a full
> page back.  If it's not a full page and it contains just one element,
> you get that element back.  But if it's not a full page and it contains
> multiple elements, it gets wrapped in a <div>.  This seems less
> intrusive than wrapping it in <html><body>, which is effectively what
> the standard parser does.  <div> is really a generic wrapper (though I
> suppose since it is block level, it's not *entirely* generic

Adding block elements might break things like CSS.


> -- it might
> be more ideal to see if the content contains any block level elements,
> and if not just wrap in <span>).

That's a good idea. The parse() function could do that as it already aims to
be smart about what it returns (otherwise, you could just use the normal
etree.parse() with an HTMLParser). If you pass it something that can't be
returned as a single element, I find it legitimate to wrap it in something
that fits. And if we've already determined that we need to wrap it, we can
also check what to wrap it in by traversing the tree(s). As a quick check, we
can walk through the parsed root elements to check if there are any block
elements and only if not, we can traverse each tree completely. If we find at
least one block element (easy to check the tag against a positive set), we
wrap with <div>, otherwise, we wrap with <span>.


> Dealing with ordered lists of elements
> with no parent isn't that easy or natural anywhere in the API.  If there
> was some kind of anonymous container then that would be a nice
> container, but there isn't one.  Is it possible to make something like
> that?  It seems like a new kind of node could cause a lot of problems.

It definitely would. Adding such a beast would cause overhead in basically all
API functions, in traversal code, etc. I'd be very happy to avoid that.


> Notably, with the HTML parser you frequently get something out with more
> elements than were in the original.  It'll add <p> or <div> tags fairly
> liberally, rearrange tags, etc., to make the document valid.  So adding
> a <div> tag isn't that far from what can already happen.

True. As I said, having a parse() function that accompanies etree.parse() and
that deliberately says "I return *one* element and I do it the smart way" is
definitely the way to go.

Stefan



More information about the lxml-dev mailing list