[lxml-dev] Proposal: Automatic unique ID generation for each tag or persistend user data for Element

Ivan Begtin ibegtin at gmail.com
Fri Jul 18 17:03:43 CEST 2008


Stefan Behnel writes:

> Could you explain what an html comment with an attribute is?


That's my misspell. I wanted to write that HTML comment has no
attribute, so this method is not applicable to it.



> Here's the code:
>
>    for el in root.iter():
>        try:
>            del el.attrib["{http://www.w3.org/XML/1998/namespace}id"]
>        except KeyError:
>            pass
>    etree.cleanup_namespaces(root)
>    print etree.tostring(root, method="html")
>
>Stefan


I understand what you say and sure it will work for some cases.
Imagine situation when attributes assigned to the node could be
important.
When data of the node evaluated as well as it's structure then  it
affects number of cicles needed to process tree node.
For example I need to keep hashes of the tag data of the pages to find
similar and to compare data on different pages of the site.

Here's the code:
       data_hashes = {}

    for el in root.iter():
	el.set("{http://www.w3.org/XML/1998/namespace}id", md5sum(tree.getpath(el)))
	data_hashes[el.attrib["{http://www.w3.org/XML/1998/namespace}id"]] =
md5sum(etree.tostring(el))


Since I've changed element before md5 caluculation but I didn't
changed it's children so during next cicle data_hashes will be
different and especially it will be different when I am looking for
similar blocks on one page or on different pages of one website since
ID's also will be different.
So here is the problem - changing original data of XML/HTML tag it
makes it harder to compare with other tags.

Sure, I could use code that you wrote to remove ID before
etree.tostring(...) calls and to restore it later, but it's not
logical and sure not good for performance.


Best Regards,
   Ivan Begtin


More information about the lxml-dev mailing list