[lxml-dev] Proposal: Automatic unique ID generation for each tag or persistend user data for Element
Ivan Begtin
ibegtin at gmail.com
Fri Jul 18 14:21:57 CEST 2008
Hi,
I am working on several HTML processing algorithms for so named
"structured data extraction". Earlier I've used BeautifulSoup and
ElementTree/Xpath for these tasks, but now looking for something more
advanced.
For some of processing procedures I need to calculate lot's of metadata
about each tag whereever it's located on page or whatever size it is.
Calculation is quite simple task - much harder to map calculated values back
to specified nodes. For example, I could generate MD5 hash value for each
'tag.text' and to keep it as dictionary somethere (DB / Python dict / e.t.c)
but later I need to find back tag by specified hash and this is not so
simple.
Same situation is during processing saved HTML file. For example If I've
found important block of tags during initiatial HTML parse, I want to find
this block later on saved copy of this HTML page. Yes, sure, I can generate
original xpath query for this tag to find it later but it's resource
sonsuming operation and surely could be simplified
I've tried one way to achive this task by creation of specific Element class
with metadata populated on _init(), so I don't loose connection between
metadata and node by itselft. The problem is that lxml generates Elements
dynamically so when I created custom xpath method to obtain metadata
performance degrated significantly since Elements were created once again.
I propose one of two changes or both of them:
1. Automatically generate ID unique for tag in this etree instance which
will be same for different instances of the Element.
To be able to find unique id of the tag and to be able to get node by
this unique id
For example it could be as getNodeByID(id) method of etree and each
Element will have "id" property
2. Persistent user data for Element
Another solution would be it lxml could support "data transfer" between
dynamic instances of Element of same tag.
For example it could be setUserData(key, value) and getUserData(key)
method for Element and property 'userdata' to access it. So if I set my data
for this Element once, I will be able to access it in any
following Element intsance of this tag.
If I missed something and it could be done even using current lxml API,
please, let me know.
Best Regards,
Ivan Begtin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080718/9fa1d76c/attachment.htm
More information about the lxml-dev
mailing list