[lxml-dev] how to get line,col position

Mary Lei lei at ipac.caltech.edu
Thu May 7 19:28:59 CEST 2009


My responses are below:

Stefan Behnel wrote:
> Hi,
> 
> Mary Lei wrote:
>> How can I get dtd.validate to return the
>> line, column number for the xhtml in error?
> 
> You can't if you use the HTML parser, that's a known bug in libxml2:
> 
> http://bugzilla.gnome.org/show_bug.cgi?id=580705
> 
> Note that this bug has a patch associated to it, which you can apply to
> libxml2 to get what you want.
Where can I locate this patch ?
> 
> Otherwise, for parsing XHTML you should use the XML parser anyway, which
> will track line numbers correctly.
Using the XML parser, results in error to load the dtd from network 
lxml.etree.XMLSyntaxError: Attempt to load network entity 
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
If I turned off the option, I dont get anything from parser.

I dont really want to load each time so I downloaded a copy with the 
entities and decide to use etree.dtd.validate to validate it instead. But as
mentioned, this does not give the line,col info.

If I use the XMLParser, I have an issue with
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 24, column 13

but my xhtml has a DTD specified.
I have checked these issues on the web but not clear
how to fix this.
> 
> 
>> But if I apply xmllint, it gives the same messages but with positional info:
>> /home/lei/python-stuff/CoRoTHome.html:83: HTML parser error : 
>> htmlParseStartTag: invalid element name
>> dedicated to asteroseismology of bright stars (typically V<10mag) and
>>                                                             ^
>> /home/lei/python-stuff/CoRoTHome.html:23: element tr: validity error : 
>> standalone: tr declared in the external subset contains white spaces nodes
>> ...
>> Document /home/lei/python-stuff/CoRoTHome.html does not validate against 
>> xhtml1-transitional.dtd
> 
> You didn't say if you used the HTML parser or the XML parser in xmllint. In
> any case, xmllint does the DTD validation at parse time, where the line
> information is still available. It only gets lost when building the tree,
> so that running a validator on the tree cannot report line numbers anymore.
> 
> lxml.etree does not currently support parse-time validation against a
> user-provided DTD (i.e. one that is not referenced by the document itself).
> Might be worth a bug report.

My xmllint command is  xmllint --dtdvalid xhtml1-transitional.dtd 
--noout /home/lei/python-stuff/CoRoTHome.html --recover --html
> 
> Stefan
Thanks.

-- 
Mary Lei

Software Testing
IPAC-NExScl

Rm: KS-233
MS: 220-6
Phone: 395-1998



More information about the lxml-dev mailing list