[lxml-dev] Proposal: Automatic unique ID generation for each tag or persistend user data for Element

Stefan Behnel stefan_ml at behnel.de
Fri Jul 18 15:12:05 CEST 2008


Hi,

Dr R. Sanderson wrote:
> On Fri, 18 Jul 2008, Ivan Begtin wrote:
>> For some of processing procedures I need to calculate lot's of metadata
>> about each tag whereever it's located on page or whatever size it is.
>> Calculation is quite simple task - much harder to map calculated values back
>> to specified nodes. For example, I could generate MD5 hash value for each
>> 'tag.text' and to keep it as dictionary somethere (DB / Python dict / e.t.c)
>> but later I need to find back tag by specified hash and this is not so
>> simple.
>> Same situation is during processing saved HTML file. For example If I've
>> found important block of tags during initiatial HTML parse, I want to find
>> this block later on saved copy of this HTML page. Yes, sure, I can generate
>> original xpath query for this tag to find it later but it's resource
>> sonsuming operation and surely could be simplified
>
> I have a similar requirement, and approached it in the following way:
>
> While processing the elements of interest, generate the full path of the
> element, hash() it and put it into a dictionary.
>
> eg:
>
> tree = node.getroottree()
> elems = node.xpath('//p')
> info = {}
> for e in elems:
>    eid = abs(hash(tree.getpath(e)))
>    # processing here...
>
> Then later I can check if the data came from the same element or not by
> comparing the eids.

That's a nice way of doing it. If you additionally want to make the ID stick
with the Element to avoid recalculation or to make it survive a tree split,
you can store the hash value as xml:id of the node, as in

  e.set("{http://www.w3.org/XML/1998/namespace}id", md5sum(tree.getpath(e)))

Just remember to remove all xml:id attributes and run cleanup_namespaces() in
lxml 2.1 before you serialise it to plain HTML.

Stefan


More information about the lxml-dev mailing list