[lxml-dev] About the position of html parsing by HTML Target parser
Stefan Behnel
stefan_ml at behnel.de
Mon Jul 20 19:43:36 CEST 2009
Nicholas Dudfield wrote:
> Asking more regarding possibilities than your personal inclinations
> and time allowances, would it be very difficult to add a `character`
> index attribute (startchpos, endchpos etc) to nodes? I was think
> something akin to the sourceline attributes Would this be possible
> purely from lxml/cython land or would libxml2 need to be patched?
The latter. Here's what libxml2 knows about a node:
http://xmlsoft.org/html/libxml-tree.html#xmlNode
So it doesn't remember any character positions, and it only knows source
line numbers up to 65535 (because of memory considerations).
> Being a novice programmer I have little C or Cython experience however
> it would be an interesting and motivating project to learn on.
>
> As `unicode` is the future of python `text` a character based index
> would be useful (admittedly for not that many uses) regardless of
> encoding? For my use case, character based positions would be
> perfect.
Certainly. However, it does require some work to recover this information
from inside the parser framework (due to recoding), so I doubt that it's
worth adding such a feature for 'general' use.
If you want to dig into this, you'll have to start reading through the
source code of libxml2 to figure out where this information could become
available. "xmlio.c" might be good place to start, as it implements the I/O
routines that copy between encoded buffers.
> Please forgive any misconceptions driving boneheaded questions.
It's always fine to ask.
Stefan
More information about the lxml-dev
mailing list