[lxml-dev] atom model

Ian Bicking ianb at colorstudy.com
Fri Jun 8 19:03:33 CEST 2007


Stefan Behnel wrote:
> Ian Bicking wrote:
>> After writing a object model for Atom that serialized to and parsed from 
>> XML, I realized Atom and its XML representation really shouldn't be 
>> separated, so I created some custom elements to make it a bit easier to 
>> handle, while still using the XML as the sole source of information.
> 
> That's the way I would do it, too.
> 
> I brought up the idea of having an "lxml.elementlib" package with this kind of
> Namespace implementations a couple of times, but I now think it would actually
> make sense. However, "lxml.ns" might be a better name (XIST uses the same
> package name, BTW).
> 
> That way, we'd have "lxml.ns.html" and "lxml.ns.atom" and hopefully others in
> the future.

I'm not a fan of deep hierarchies myself; lxml.html and lxml.atom seem 
self-explanatory to me.  Maybe with more obscure formats it would be 
less so.

>> I'm a little unsure what to do about the namespaces.  Everything is in a 
>> namespace, but it's tedious to put in everywhere. I've put in some
>> little helper methods internally, and mostly you don't need to use the 
>> namespace globally, but I'm unsure about it all.  I've created an 
>> Element() function that automatically adds the namespace if no namespace 
>> is given; helps a little I guess.
> 
> As you say below, builder.py would definitely make this more usable. RSS is
> even the example FL uses to present the factory:
> 
> http://online.effbot.org/2006_11_01_archive.htm#et-builder-rss
> 
> 
>> The standard ways of creating elements is a bit tedious, really.  I 
>> guess builder can help there a bit, though I don't see a way to give my 
>> own parser (which I need in this case).  Any suggestions about any of it 
>> are welcome.
> 
> I added a "parser" keyword argument to the factory (trunk), which reuses the
> "makeelement" method of the parser for Element creation.

OK; I suppose the pattern would then be that atom.E would be 
ElementMaker(parser=atom_parser)?  I don't really understand what 
typemap is in ElementMaker, but it looks like it's not important here. 
Except looking at the RSS example, I suppose it could use the native 
Atom format for dates.

Incidentally, something that builder doesn't do but many other XML 
builders do, it check for any keyword attributes that end in _, and then 
strip the _.  This lets you do class_=something, for_=something, etc. 
It's handy.

>> https://svn.openplans.org/svn/TaggerStore/trunk/taggerstore/atom.py
> 
> A few remarks:
> 
> - for the lookup, you can either use the Namespace registry mechanism of lxml,
> which makes it a global setup (I'm considering to make this parser local in
> lxml 2.0)
> 
> http://codespeak.net/lxml/dev/element_classes.html#namespace-class-lookup
> http://codespeak.net/lxml/dev/element_classes.html#id1

I don't quite get what's going on there.  Does this mean that you would 
globally say that, say, {http://www.w3.org/2005/Atom}feed maps to 
lxml.atom.Feed?  That's not so bad, I guess, but I'm pretty comfortable 
with just using the parser in the atom module.  Feels a bit less surprising.

One of the things this got me to thinking about was augmenting HTML with 
specific microformat-related attributes.  I'm not sure how to do this at 
all.  For instance, imagine:

   doc.findall_vcards()

Returns things fitting the hCard microformat (elements with a class of 
"vcard").  The object really *is* the element with that vcard class 
(there's a weird ambiguity between the name vcard and hCard in this 
particular microformat).  And it would have attributes like fn, url, 
etc.  One of the funny parts of microformats is that a bit of HTML can 
be multiple microformats at the same time, by adding more classes.  So 
if you have an review of a business, you use hReview and the item you 
are reviewing is an hCard.  Often the item and the hCard will be on the 
same element.  Which is handy and unambiguous, but an element can't be 
both kinds of objects at once.

So maybe in the microformat case really something like findall_vcards() 
should return an object that wraps the HTML; a kind of hCard view on the 
HTML.  It could still be stateless (ideally it would be, like with Atom).

> or just a dictionary (.get) instead of the many ifs.

Sure.

> - please avoid "@property" as lxml wants to stay compatible with Python 2.3.

Sure



-- 
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
             | Write code, do good | http://topp.openplans.org/careers


More information about the lxml-dev mailing list