[lxml-dev] Proposal: Automatic unique ID generation for each tag or persistend user data for Element
Dr R. Sanderson
azaroth at liverpool.ac.uk
Fri Jul 18 14:37:58 CEST 2008
Hi all,
I have a similar requirement, and approached it in the following way:
While processing the elements of interest, generate the full path of the
element, hash() it and put it into a dictionary.
eg:
tree = node.getroottree()
elems = node.xpath('//p')
info = {}
for e in elems:
eid = abs(hash(tree.getpath(e)))
# processing here...
I need it to be an integer for storage reasons.
Then later I can check if the data came from the same element or not by
comparing the eids.
Not sure if that helps, but it might be useful to someone! :)
Rob
On Fri, 18 Jul 2008, Ivan Begtin wrote:
> Hi,
> I am working on several HTML processing algorithms for so named
> "structured data extraction". Earlier I've used BeautifulSoup and
> ElementTree/Xpath for these tasks, but now looking for something more
> advanced.
>
> For some of processing procedures I need to calculate lot's of metadata
> about each tag whereever it's located on page or whatever size it is.
> Calculation is quite simple task - much harder to map calculated values back
> to specified nodes. For example, I could generate MD5 hash value for each
> 'tag.text' and to keep it as dictionary somethere (DB / Python dict / e.t.c)
> but later I need to find back tag by specified hash and this is not so
> simple.
> Same situation is during processing saved HTML file. For example If I've
> found important block of tags during initiatial HTML parse, I want to find
> this block later on saved copy of this HTML page. Yes, sure, I can generate
> original xpath query for this tag to find it later but it's resource
> sonsuming operation and surely could be simplified
>
> I've tried one way to achive this task by creation of specific Element class
> with metadata populated on _init(), so I don't loose connection between
> metadata and node by itselft. The problem is that lxml generates Elements
> dynamically so when I created custom xpath method to obtain metadata
> performance degrated significantly since Elements were created once again.
>
> I propose one of two changes or both of them:
> 1. Automatically generate ID unique for tag in this etree instance which
> will be same for different instances of the Element.
> To be able to find unique id of the tag and to be able to get node by
> this unique id
> For example it could be as getNodeByID(id) method of etree and each
> Element will have "id" property
>
> 2. Persistent user data for Element
>
> Another solution would be it lxml could support "data transfer" between
> dynamic instances of Element of same tag.
> For example it could be setUserData(key, value) and getUserData(key)
> method for Element and property 'userdata' to access it. So if I set my data
> for this Element once, I will be able to access it in any
> following Element intsance of this tag.
>
> If I missed something and it could be done even using current lxml API,
> please, let me know.
>
> Best Regards,
> Ivan Begtin
>
More information about the lxml-dev
mailing list