[lxml-dev] how to get line,col position
Mary Lei
lei at ipac.caltech.edu
Thu May 7 19:28:59 CEST 2009
My responses are below:
Stefan Behnel wrote:
> Hi,
>
> Mary Lei wrote:
>> How can I get dtd.validate to return the
>> line, column number for the xhtml in error?
>
> You can't if you use the HTML parser, that's a known bug in libxml2:
>
> http://bugzilla.gnome.org/show_bug.cgi?id=580705
>
> Note that this bug has a patch associated to it, which you can apply to
> libxml2 to get what you want.
Where can I locate this patch ?
>
> Otherwise, for parsing XHTML you should use the XML parser anyway, which
> will track line numbers correctly.
Using the XML parser, results in error to load the dtd from network
lxml.etree.XMLSyntaxError: Attempt to load network entity
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
If I turned off the option, I dont get anything from parser.
I dont really want to load each time so I downloaded a copy with the
entities and decide to use etree.dtd.validate to validate it instead. But as
mentioned, this does not give the line,col info.
If I use the XMLParser, I have an issue with
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 24, column 13
but my xhtml has a DTD specified.
I have checked these issues on the web but not clear
how to fix this.
>
>
>> But if I apply xmllint, it gives the same messages but with positional info:
>> /home/lei/python-stuff/CoRoTHome.html:83: HTML parser error :
>> htmlParseStartTag: invalid element name
>> dedicated to asteroseismology of bright stars (typically V<10mag) and
>> ^
>> /home/lei/python-stuff/CoRoTHome.html:23: element tr: validity error :
>> standalone: tr declared in the external subset contains white spaces nodes
>> ...
>> Document /home/lei/python-stuff/CoRoTHome.html does not validate against
>> xhtml1-transitional.dtd
>
> You didn't say if you used the HTML parser or the XML parser in xmllint. In
> any case, xmllint does the DTD validation at parse time, where the line
> information is still available. It only gets lost when building the tree,
> so that running a validator on the tree cannot report line numbers anymore.
>
> lxml.etree does not currently support parse-time validation against a
> user-provided DTD (i.e. one that is not referenced by the document itself).
> Might be worth a bug report.
My xmllint command is xmllint --dtdvalid xhtml1-transitional.dtd
--noout /home/lei/python-stuff/CoRoTHome.html --recover --html
>
> Stefan
Thanks.
--
Mary Lei
Software Testing
IPAC-NExScl
Rm: KS-233
MS: 220-6
Phone: 395-1998
More information about the lxml-dev
mailing list