[lxml-dev] Proposal: Automatic unique ID generation for each tag or persistend user data for Element

Paul Everitt paul at agendaless.com
Fri Jul 18 17:32:15 CEST 2008


I don't know if it would help, but you could run the tree through an 
XSLT and leverage the generate-id() XPath function:

   http://www.dpawson.co.uk/xsl/sect2/N4416.html

--Paul

Ivan Begtin wrote:
> Hi,
>   I am working on several HTML processing algorithms for so named 
> "structured data extraction". Earlier I've used BeautifulSoup and 
> ElementTree/Xpath for these tasks, but now looking for something more 
> advanced.
>  
> For some of processing procedures I need to calculate lot's of metadata 
> about each tag whereever it's located on page or whatever size it is. 
> Calculation is quite simple task - much harder to map calculated values 
> back to specified nodes. For example, I could generate MD5 hash value 
> for each 'tag.text' and to keep it as dictionary somethere (DB / Python 
> dict / e.t.c) but later I need to find back tag by specified hash and 
> this is not so simple.
> Same situation is during processing saved HTML file. For example If I've 
> found important block of tags during initiatial HTML parse, I want to 
> find this block later on saved copy of this HTML page. Yes, sure, I can 
> generate original xpath query for this tag to find it later but it's 
> resource sonsuming operation and surely could be simplified
> 
> I've tried one way to achive this task by creation of specific Element 
> class with metadata populated on _init(), so I don't loose connection 
> between metadata and node by itselft. The problem is that lxml generates 
> Elements dynamically so  when I created custom xpath method to obtain 
> metadata performance degrated significantly since Elements were created 
> once again.
> 
> I propose one of two changes or both of them:
> 1. Automatically generate ID unique for tag in this etree instance which 
> will be same for different instances of the Element.
>    To be able to find unique id of the tag and to be able to get node by 
> this unique id
>    For example it could be as  getNodeByID(id) method of etree and each 
> Element will have "id" property
> 
> 2. Persistent user data for Element
> 
>    Another solution would be it lxml could support "data transfer" 
> between dynamic instances of Element of same tag.
>      For example it could be setUserData(key, value) and 
> getUserData(key) method for Element and property 'userdata' to access 
> it. So if I set my data for this Element once, I will be able to access 
> it in any
>    following Element intsance of this tag. 
> 
> If I missed something and it could be done even using current lxml API, 
> please, let me know.
> 
> Best Regards,
>    Ivan Begtin
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev



More information about the lxml-dev mailing list