[lxml-dev] lhtml

Ian Bicking ianb at colorstudy.com
Fri May 25 23:39:59 CEST 2007


Stefan Behnel wrote:
> Ian Bicking wrote:
>> Stefan Behnel wrote:
>> It relies on a different parser from lxml.etree.HTML, and I would guess
>> that elements created with etree.Element wouldn't necessarily use the
>> right class.
> 
> objectify replicates the XML() and Element() factories for exactly this
> purpose. lxml.html could do likewise.

Sure.  Presumably at least a parser would be in there (HTML()).  I 
suppose no reason Element can't be too.

How does this interact with XSLT translations?  When you translate a 
document, it keeps the parser and hence the custom classes?

>>> - I'd like to call that module "lxml.xmldoctest" or something like
>>> that, so
>>> that you can "import xmldoctest" in a doctest file, which is rather
>>> readable.
>> I'd be surprised it this would actually work -- I'd expect that it would
>> be too late once you were running the doctest.  But I haven't tried.
> 
> Me neither. But *if* it works, *then* requiring a call to install() shouldn't
> be necessary.

Probably we could *make* it work.

It would make me more comfortable if at least it was a separate module. 
  So there'd be an lxml.xmldoctest module, and an lxml.usexmldoctest or 
something.  Then you wouldn't *have* to enable the checker if you just 
want to import the module (e.g., to make your own checker based on that 
checker).

There's also some ambiguity between HTML and XML.  When do you parse 
something as HTML, and when only as XML?  It depends on the doctest. 
You can kind of tell by looking for <html>, but I actually spend more 
time looking at HTML snippets than documents when doing testing.

With enough work it would probably be possible to use that import to 
selectively activate the checker only during the doctest it was imported 
into.  That would be ideal to me.  Then you could use that to indicate 
if you prefer HTML or XML parsing your checking.  I generally like 
doctests to be standalone, so being able to enable your preferred 
checker directly in the doctest would certainly be nice.

>>> parse, serialize and fixuplinks:
>>>
>>> - I'll have to take a closer look at that to see if this makes sense
>>> in general.
>> The parse stuff is really just charset detection.  I don't think
>> lxml/libxml2 does this natively (checking the meta tag), but I'm not
>> actually 100% sure.
> 
> It does actually. You will see that when you pass in a unicode string that
> contains a meta-tag with some byte encoding (say, UTF-8). This will break
> immediately.
> 
> Note, however, that libxml2 requires a bit of structure to actually find the
> <meta> tag. Simply prepending a complete HTML document with such a tag (which
> I've seen in a couple of real-life broken HTML documents) will not work.

OK.  I don't know quite why we had that code; maybe we never tested 
exactly how it worked.  Including chardet as a fallback could be kind of 
interesting.  I don't usually deal with the web when it is *quite* that 
wild and woolly that its character sets have to be guessed, but I'd like 
to handle that nicely anyway.

>> It should include parsing HTML fragments too, which
>> is a little hard (HTML() interprets all text as complete documents, and
>> adds in elements to make the document valid, which often isn't what
>> you'd want).
> 
> Maybe a simple approach here would be to check if a string starts with a known
> inner HTML tag, then just prefix it with <html><body> before parsing and
> return their child (or children) after parsing.

I'm comfortable (probably more comfortable) with different parsing 
functions.  I imagine parse, parse_fragment, and parse_element.  parse 
is like HTML(), parse_fragment returns a list of elements, parse_element 
only returns a single element (and an exception if you give it a 
document with multiple elements).  Leading text for parse_fragment is a 
little awkward.

In addition to returning the children, I'd like to break the reference 
to the artificial parent that was added in.  You can get at the parent 
with many kinds of queries, which can be confusing.

>>> __init__.py:
>>>
>>> - some of this can be rewritten using plain XPath, e.g.
>>> get_parent_with_class
>>> (there's now RegExp support in lxml 1.3) or get_text (basically what
>>> 'string()' does). contains_class_xpath is not really much better than
>>> an XPath
>>> expression with variables, dito for get_elements_by_class and
>>> get_rel_links,
>>> e.g. the latter is better written as:
>>>
>>>   get_rel_links = etree.XPath("descendant-or-self::a[@rel=$rel]")
>>>   get_rel_links(el, rel="whatever")
>> I tried doing class name matching with a regular expression, but never
>> got it to work.  It might have been a bug in my or lxml's code, I'm not
>> sure -- whatever it was, I was in a mind to move on ;).
> 
> I recently fixed a few problems with the regexp support, quite possible that
> it were those that stopped you.

Perhaps so.  \b wasn't working right for me, if I remember.

> Fredrik wrote a nice factory class for generating (X|HT)ML a while ago, I felt
> free to add it as "lxml.htmlbuilder" (although I'm still waiting for his reply
> to see if it can stay there to become part of lxml 1.3). But the other API
> side of parsing and treating HTML document in a convenient way is much more
> ambitious.

How are attributes handled in his version?  That's always the place 
where opinions vary on builders.

-- 
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
             | Write code, do good | http://topp.openplans.org/careers


More information about the lxml-dev mailing list