[lxml-dev] Entity handling in lxml
Stefan Behnel
stefan_ml at behnel.de
Sun May 27 15:33:00 CEST 2007
Hi all,
lets make this a new thread to discuss the topic that was raised by Eric Garin.
The parsers in lxml are currently configured to replace entity references
(&entity;) by their definition. This requires a DTD, either inside the
document, as external URL reference or from the system catalog.
The parsers do not currently load DTDs by default, neither do they do
validation. So, the current situation is:
1) If you use the default parser, all entities will pass through without
exception, but put an error message in the error_log:
entity.xml:5:ERROR:PARSER:WAR_UNDECLARED_ENTITY: Entity 'oneXXX' not defined
They will not be visible at the API level, they will cut off text that
contains them ("my &entity; value" will result in a text property value "my
"), but they will be serialised correctly. They may also break a lot of things
internally, as the implementation is not prepared for dealing with stuff like
entity reference nodes.
2) If you configure a parser to load the DTD, declared entities will be
replaced and undeclared entities will behave as above.
3) If you configure a parser to validate against a DTD, it will still behave
exactly as above.
This behaviour is definitely a bug. It would be cleaner to do this:
1) The default parser should replace internally defined entities and report
all other entities as an error.
2) A parser that loads the DTD should report undeclared entities as an error
(although it would not do any validation).
3) A validating parser should report undeclared entities as an error, just as
any other structural or semantic deviation from the DTD.
The alternative would be to provide an API for entities and to rewrite the
internals to deal with them somehow. We could potentially make entity
references a sort of element that behaves more or less like a comment.
Entities would mainly have a name and a tail. We would then need an Entity()
factory and integrate entity reference nodes into the internal traversal code
(basically: let _isElement(c_entity_node) return 1).
When would they appear in the tree? We would additionally need a
"resolve_entities" keyword argument for the parsers, that would be the easiest
way to deal with this. If it is set, unresolvable entities will result in an
error as described above. Otherwise, entity references will not be replaced.
Any comments?
Stefan
More information about the lxml-dev
mailing list