[lxml-dev] atom model
Ian Bicking
ianb at colorstudy.com
Fri Jun 8 19:03:33 CEST 2007
Stefan Behnel wrote:
> Ian Bicking wrote:
>> After writing a object model for Atom that serialized to and parsed from
>> XML, I realized Atom and its XML representation really shouldn't be
>> separated, so I created some custom elements to make it a bit easier to
>> handle, while still using the XML as the sole source of information.
>
> That's the way I would do it, too.
>
> I brought up the idea of having an "lxml.elementlib" package with this kind of
> Namespace implementations a couple of times, but I now think it would actually
> make sense. However, "lxml.ns" might be a better name (XIST uses the same
> package name, BTW).
>
> That way, we'd have "lxml.ns.html" and "lxml.ns.atom" and hopefully others in
> the future.
I'm not a fan of deep hierarchies myself; lxml.html and lxml.atom seem
self-explanatory to me. Maybe with more obscure formats it would be
less so.
>> I'm a little unsure what to do about the namespaces. Everything is in a
>> namespace, but it's tedious to put in everywhere. I've put in some
>> little helper methods internally, and mostly you don't need to use the
>> namespace globally, but I'm unsure about it all. I've created an
>> Element() function that automatically adds the namespace if no namespace
>> is given; helps a little I guess.
>
> As you say below, builder.py would definitely make this more usable. RSS is
> even the example FL uses to present the factory:
>
> http://online.effbot.org/2006_11_01_archive.htm#et-builder-rss
>
>
>> The standard ways of creating elements is a bit tedious, really. I
>> guess builder can help there a bit, though I don't see a way to give my
>> own parser (which I need in this case). Any suggestions about any of it
>> are welcome.
>
> I added a "parser" keyword argument to the factory (trunk), which reuses the
> "makeelement" method of the parser for Element creation.
OK; I suppose the pattern would then be that atom.E would be
ElementMaker(parser=atom_parser)? I don't really understand what
typemap is in ElementMaker, but it looks like it's not important here.
Except looking at the RSS example, I suppose it could use the native
Atom format for dates.
Incidentally, something that builder doesn't do but many other XML
builders do, it check for any keyword attributes that end in _, and then
strip the _. This lets you do class_=something, for_=something, etc.
It's handy.
>> https://svn.openplans.org/svn/TaggerStore/trunk/taggerstore/atom.py
>
> A few remarks:
>
> - for the lookup, you can either use the Namespace registry mechanism of lxml,
> which makes it a global setup (I'm considering to make this parser local in
> lxml 2.0)
>
> http://codespeak.net/lxml/dev/element_classes.html#namespace-class-lookup
> http://codespeak.net/lxml/dev/element_classes.html#id1
I don't quite get what's going on there. Does this mean that you would
globally say that, say, {http://www.w3.org/2005/Atom}feed maps to
lxml.atom.Feed? That's not so bad, I guess, but I'm pretty comfortable
with just using the parser in the atom module. Feels a bit less surprising.
One of the things this got me to thinking about was augmenting HTML with
specific microformat-related attributes. I'm not sure how to do this at
all. For instance, imagine:
doc.findall_vcards()
Returns things fitting the hCard microformat (elements with a class of
"vcard"). The object really *is* the element with that vcard class
(there's a weird ambiguity between the name vcard and hCard in this
particular microformat). And it would have attributes like fn, url,
etc. One of the funny parts of microformats is that a bit of HTML can
be multiple microformats at the same time, by adding more classes. So
if you have an review of a business, you use hReview and the item you
are reviewing is an hCard. Often the item and the hCard will be on the
same element. Which is handy and unambiguous, but an element can't be
both kinds of objects at once.
So maybe in the microformat case really something like findall_vcards()
should return an object that wraps the HTML; a kind of hCard view on the
HTML. It could still be stateless (ideally it would be, like with Atom).
> or just a dictionary (.get) instead of the many ifs.
Sure.
> - please avoid "@property" as lxml wants to stay compatible with Python 2.3.
Sure
--
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
| Write code, do good | http://topp.openplans.org/careers
More information about the lxml-dev
mailing list