[lxml-dev] Fwd: News flash: Python possibly guilty in excessive DTD traffic

Stefan Behnel stefan_ml at behnel.de
Mon Feb 11 20:44:30 CET 2008


Hi again,

Stefan Behnel wrote:
> Sidnei da Silva wrote:
>> http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
>> Does any of that apply to lxml?
> 
> I don't think so, the article relates to DTD loading through urllib. lxml
> leaves that to libxml2's parser.
> 
> 
>> I suppose lxml supports dtd catalogs?
> 
> Yes, libxml2 has catalog support (although you can compile that out), so it
> will normally see network access as a last resort to resolve external entities.
> 
> 
>> Does it cache dtds in any way?
> 
> There is no internal document caching (except for repeated access to the same
> document during a single operation, e.g. in XSLT). If you do not provide
> catalogs on your system, that's your own 'decision'. You can still write your
> own caching resolver in that case, but I would consider catalogs the best
> solution to this problem.

Two more things to add:

lxml does not use validation by default, you have to explicitly enable it in a
parser if you want to use it - in which case it should not be asked too much
to make sure your catalogs are properly installed. :)

Secondly, lxml 2.0 does not load referenced network resources by default.
While it loads documents that you explicitly ask it to download by parsing
from a URL, you will also have to explicitly tell it to enable network access
for referenced resources like DTDs, schemas and the like, again, by
configuring a parser.

So, no, using lxml will not unexpectedly waste any network resources unless
you explicitly tell it to do so.

Stefan


More information about the lxml-dev mailing list