[lxml-dev] About the position of html parsing by HTML Target parser
Stefan Behnel
stefan_ml at behnel.de
Sun Jul 19 20:31:56 CEST 2009
Nicholas Dudfield wrote:
> Wow, someone else with this requirement. I was meaning to post to the
> list about this. I'm using lxml to implement a XPath / CSS selection
> plugin for a python extensible editor. I'd like to have a mapping of
> view buffer regions to xml nodes.
At least the line is available from the "sourceline" property of an element
- although only up to 65536:
http://bugzilla.gnome.org/show_bug.cgi?id=325533
If you are in a position to whitespace-clean and pretty-print the XML
document, that would give you a simple mapping from elements to document
positions that you can exploit at the application level. Even if you can't,
that would still give you a usable model to work with that you could match
with the original stream to find the 'real' positions.
> The workaround I used to get the
> exact character position was to use the feed interface, a character at
> a time and manually monitor bytestream position. It's fairly slow
> though. I'd like to implement this in CYthon or use whatever
> underlying facility there is to speed it up.
Speeding up this approach is pretty much futile IMHO. The parser gains
speed from efficient I/O and memory management. Passing a byte at a time
totally counters that (and even then it's only a *byte* at a time, not a
*character* at a time).
What I could imagine to do instead is to traverse the element tree and to
do an incremental text search for each element tag (i.e. the regexp
"<tagname\s") to recover the exact original positions. That would also
allow you to work at the character level rather than the byte level
(assuming that the editor works at that level, too). Searching for the
above regexp is safe as "<" cannot occur anywhere in the XML data stream
except for a tag start/end or comment/PI (ok, minus DTDs, but that's easy
to catch using the line number of the root element).
I will also check if there is a way to provide the position at the (target)
parser level, but that needs to fit the current interface. And I currently
do not have much time to dig into this.
Stefan
More information about the lxml-dev
mailing list