[lxml-dev] Proposal: Automatic unique ID generation for each tag or persistend user data for Element

Dr R. Sanderson azaroth at liverpool.ac.uk
Fri Jul 18 14:37:58 CEST 2008


Hi all,

I have a similar requirement, and approached it in the following way:

While processing the elements of interest, generate the full path of the 
element, hash() it and put it into a dictionary.

eg:

tree = node.getroottree()
elems = node.xpath('//p')
info = {}
for e in elems:
   eid = abs(hash(tree.getpath(e)))
   # processing here...


I need it to be an integer for storage reasons.

Then later I can check if the data came from the same element or not by 
comparing the eids.

Not sure if that helps, but it might be useful to someone! :)

Rob


On Fri, 18 Jul 2008, Ivan Begtin wrote:

> Hi,
>  I am working on several HTML processing algorithms for so named
> "structured data extraction". Earlier I've used BeautifulSoup and
> ElementTree/Xpath for these tasks, but now looking for something more
> advanced.
>
> For some of processing procedures I need to calculate lot's of metadata
> about each tag whereever it's located on page or whatever size it is.
> Calculation is quite simple task - much harder to map calculated values back
> to specified nodes. For example, I could generate MD5 hash value for each
> 'tag.text' and to keep it as dictionary somethere (DB / Python dict / e.t.c)
> but later I need to find back tag by specified hash and this is not so
> simple.
> Same situation is during processing saved HTML file. For example If I've
> found important block of tags during initiatial HTML parse, I want to find
> this block later on saved copy of this HTML page. Yes, sure, I can generate
> original xpath query for this tag to find it later but it's resource
> sonsuming operation and surely could be simplified
>
> I've tried one way to achive this task by creation of specific Element class
> with metadata populated on _init(), so I don't loose connection between
> metadata and node by itselft. The problem is that lxml generates Elements
> dynamically so  when I created custom xpath method to obtain metadata
> performance degrated significantly since Elements were created once again.
>
> I propose one of two changes or both of them:
> 1. Automatically generate ID unique for tag in this etree instance which
> will be same for different instances of the Element.
>   To be able to find unique id of the tag and to be able to get node by
> this unique id
>   For example it could be as  getNodeByID(id) method of etree and each
> Element will have "id" property
>
> 2. Persistent user data for Element
>
>   Another solution would be it lxml could support "data transfer" between
> dynamic instances of Element of same tag.
>     For example it could be setUserData(key, value) and getUserData(key)
> method for Element and property 'userdata' to access it. So if I set my data
> for this Element once, I will be able to access it in any
>   following Element intsance of this tag.
>
> If I missed something and it could be done even using current lxml API,
> please, let me know.
>
> Best Regards,
>   Ivan Begtin
>


More information about the lxml-dev mailing list