[lxml-dev] About the position of html parsing by HTML Target parser

Nicholas Dudfield ndudfield at gmail.com
Mon Jul 20 17:34:03 CEST 2009


>>> No, the codec layer only recodes the characters, so you'd get a UTF-8 encoded """ byte string as result.

Makes sense. I was hoping that was the case :)

Asking more regarding possibilities than your personal inclinations
and time allowances, would it be very difficult to add a `character`
index attribute (startchpos, endchpos etc) to nodes?  I was think
something akin to the sourceline attributes Would this be possible
purely from lxml/cython land or would libxml2 need to be patched?
Being a novice programmer I have little C or Cython experience however
it would be an interesting and motivating project to learn on.

As `unicode` is the future of python `text` a character based index
would be useful (admittedly for not that many uses) regardless of
encoding?  For my use case, character based positions would be
perfect.

Please forgive any misconceptions driving boneheaded questions.


More information about the lxml-dev mailing list