[lxml-dev] Proposal: Automatic unique ID generation for each tag or persistend user data for Element
Paul Everitt
paul at agendaless.com
Fri Jul 18 17:32:15 CEST 2008
I don't know if it would help, but you could run the tree through an
XSLT and leverage the generate-id() XPath function:
http://www.dpawson.co.uk/xsl/sect2/N4416.html
--Paul
Ivan Begtin wrote:
> Hi,
> I am working on several HTML processing algorithms for so named
> "structured data extraction". Earlier I've used BeautifulSoup and
> ElementTree/Xpath for these tasks, but now looking for something more
> advanced.
>
> For some of processing procedures I need to calculate lot's of metadata
> about each tag whereever it's located on page or whatever size it is.
> Calculation is quite simple task - much harder to map calculated values
> back to specified nodes. For example, I could generate MD5 hash value
> for each 'tag.text' and to keep it as dictionary somethere (DB / Python
> dict / e.t.c) but later I need to find back tag by specified hash and
> this is not so simple.
> Same situation is during processing saved HTML file. For example If I've
> found important block of tags during initiatial HTML parse, I want to
> find this block later on saved copy of this HTML page. Yes, sure, I can
> generate original xpath query for this tag to find it later but it's
> resource sonsuming operation and surely could be simplified
>
> I've tried one way to achive this task by creation of specific Element
> class with metadata populated on _init(), so I don't loose connection
> between metadata and node by itselft. The problem is that lxml generates
> Elements dynamically so when I created custom xpath method to obtain
> metadata performance degrated significantly since Elements were created
> once again.
>
> I propose one of two changes or both of them:
> 1. Automatically generate ID unique for tag in this etree instance which
> will be same for different instances of the Element.
> To be able to find unique id of the tag and to be able to get node by
> this unique id
> For example it could be as getNodeByID(id) method of etree and each
> Element will have "id" property
>
> 2. Persistent user data for Element
>
> Another solution would be it lxml could support "data transfer"
> between dynamic instances of Element of same tag.
> For example it could be setUserData(key, value) and
> getUserData(key) method for Element and property 'userdata' to access
> it. So if I set my data for this Element once, I will be able to access
> it in any
> following Element intsance of this tag.
>
> If I missed something and it could be done even using current lxml API,
> please, let me know.
>
> Best Regards,
> Ivan Begtin
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
More information about the lxml-dev
mailing list