[lxml-dev] Fwd: News flash: Python possibly guilty in excessive DTD traffic
Stefan Behnel
stefan_ml at behnel.de
Mon Feb 11 20:44:30 CET 2008
Hi again,
Stefan Behnel wrote:
> Sidnei da Silva wrote:
>> http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
>> Does any of that apply to lxml?
>
> I don't think so, the article relates to DTD loading through urllib. lxml
> leaves that to libxml2's parser.
>
>
>> I suppose lxml supports dtd catalogs?
>
> Yes, libxml2 has catalog support (although you can compile that out), so it
> will normally see network access as a last resort to resolve external entities.
>
>
>> Does it cache dtds in any way?
>
> There is no internal document caching (except for repeated access to the same
> document during a single operation, e.g. in XSLT). If you do not provide
> catalogs on your system, that's your own 'decision'. You can still write your
> own caching resolver in that case, but I would consider catalogs the best
> solution to this problem.
Two more things to add:
lxml does not use validation by default, you have to explicitly enable it in a
parser if you want to use it - in which case it should not be asked too much
to make sure your catalogs are properly installed. :)
Secondly, lxml 2.0 does not load referenced network resources by default.
While it loads documents that you explicitly ask it to download by parsing
from a URL, you will also have to explicitly tell it to enable network access
for referenced resources like DTDs, schemas and the like, again, by
configuring a parser.
So, no, using lxml will not unexpectedly waste any network resources unless
you explicitly tell it to do so.
Stefan
More information about the lxml-dev
mailing list