[lxml-dev] docinfo.doctype doesn't include internal entities?

John Krukoff jkrukoff at ltgc.com
Fri Jul 24 22:19:25 CEST 2009


On Sat, 2009-04-18 at 08:46 +0200, Stefan Behnel wrote:
> Sidnei da Silva wrote:
> > I am looking for a way to output internal entities that have been
> > parsed from the original document when writing out a tree, but
> > apparently this is not exposed in any attribute.
> > 
> > Here's an example:
> > 
> > {{{
> > import lxml.etree
> > 
> > document = """<?xml version="1.0"?>
> >   <!DOCTYPE application [
> >     <!ENTITY nbsp "\&#160;">
> >   ]>
> >   <application>&nbsp;</application>
> > """
> > 
> > 
> > tree = lxml.etree.fromstring(document)
> > print tree.getroottree().docinfo.doctype
> > }}}
> > 
> > I would expect this to output:
> > {{{
> >   <!DOCTYPE application [
> >     <!ENTITY nbsp "\&#160;">
> >   ]>
> > }}}
> > 
> > But instead it gives me:
> > 
> > {{{
> >   <!DOCTYPE application>
> > }}}
> > 
> > Is it a bug or I'm not looking at the right place?
> 
> What you are looking for is the internal subset of the document, which is
> not (really) part of the DOCTYPE itself. It's available through the
> "docinfo.internalDTD" property. However, lxml.etree doesn't expose the
> content of the DTD, so this is currently only usable for validation (i.e.
> not very helpful in your case).
> 
> What you could try is to parse the document without resolving the entities,
> then traverse the Entity elements and collect their names in a set. That
> will not give you the resolved entity values, though...
> 
> I think it would be nice if tostring() could serialise DTDs, but I doubt
> that there are so many use cases for that. In your case, you'd then have to
> parse the DTD yourself, which you could also do by clearing the root node
> and serialising the document to unicode.
> 
> Stefan
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev

Hello,

I'm sorry to be resurrecting an ancient thread, but it seemed the
easiest way to bring up the fact that I've recently come up with a use
case for exactly the feature mentioned here, namely serializing internal
DTD subsets.

I've been writing a round trip converter for a personal XML shorthand,
and internal DTD subsets are the only thing I haven't been able to come
up with a good workaround for being able to pull out of the original
document, do my modifications, and create an identical new document
from. So far the best I've been able to do is destructively modify a
copy of the document to the point where I've a reasonable chance of
writing my own string parser to pull out the internal DTD subset, a
parser which is looking unfortunately complicated to be able to deal
with multiple top level elements (comments being the most common for
me).

Really, I'd be happy in the simplest case if docinfo.doctype included
the internal DTD subset exactly as defined in the original document
(parsed or not), as then it'd be reasonably easy to at least stick
things back together at the string level. Is this the kind of thing
you'd accept a wishlist bugtracker item on?

Somewhat related, I was surprised to discover that the TreeBuilder API
doesn't deal well with additional top level elements. For example, 

Python 2.6.2 (r262:71600, May 29 2009, 09:48:09) 
[GCC 4.3.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import etree
>>> t = etree.TreeBuilder( )
>>> t.comment( 'first' )
<!--first-->
>>> t.start( 'element', {} )
<Element element at b773aa7c>
>>> e = t.end( 'element' )
>>> t.comment( 'last' )
<!--last-->
>>> t.close( )
<!--last-->

I was expecting that the TreeBuilder class was actually for creating XML
documents, but I guess it's actually meant for XML fragments? I expected
this:

>>> etree.tostring( e )
'<element/>'

But when I went to retrieve the root tree, I was surprised that my other
top level elements were missing.

>>> etree.tostring( e.getroottree( ) )
'<element/>'

But since it also allows you to create multiple top level elements with
start and end, it obviously doesn't care about the restrictions of
creating an XML document. This actually mattered to me, as my XML here
makes heavy use of top level comments for documentation.

Anyway, not a big deal, I figured I'd just keep track of all the top
level parts myself, and create an ElementTree manually. Only, it doesn't
look like there's any way to add top level elements like doctype
information or comments or processing instructions without first
serializing all the parts to strings, sticking them together, and
running them back through the parser again. I can't find any way to
duplicate what the parser does through the API, and am hoping I'm just
missing some obscure corner of the ElementTree API that would let me
build this programatically.

Sure, I can read them using ElementTree.getroot( ).itersiblings( ), but
I couldn't find any way to create them or doctype information without
resorting to string parsing.

So, yeah, really just a diary of my misunderstandings of the TreeBuilder
API, and my attempts to work around it. Hopefully I'm missing something
obvious.

I should also add a note here, in thanks of all the effort you've put
into lxml. I've been using it daily for over 2 years now, and I can't
imagine programming XML with python using anything else. Even the
original ElementTree seems limited in comparison now, much less ending
up in javascript looking at DOM code. It's been the most useful and best
supported library I depend on. Thank you.

-- 
John Krukoff <jkrukoff at ltgc.com>
Land Title Guarantee Company



More information about the lxml-dev mailing list