[lxml-dev] [Question #65510]: How to set libxml:XML_PARSE_HUGE-option in lxml?

Stefan Behnel stefan_ml at behnel.de
Fri Mar 27 12:56:38 CET 2009


bol wrote:
> we are working with large text corpora (bigger than 10mb).
> Lxml is used for splitting this corpora-xml-files and run via sockets (old
> non-xml-using) binaries for i.e. pos-tagging or tokenizing.

Ok, that gives you a) the bit of structure that you need and b) safe and
portable encoding support (which I assume is critical here), so that's
fine with me. After all, XML is used for all sorts of things these days...


> The option XML_PARSE_HUGE should be as in libxml default off.

That's what I was wondering about. It's (sort of) on by default if you use
libxml2 2.6.x and 2.7.[012], but it's supposed to be off by default if you
use libxml2 2.7.3 and later. That's outside of the control of lxml. So you
would get one behaviour on one system and a different behaviour on another
system, even with the same version of lxml.

However, this is meant as a security measure to prevent traps like the
billion laughs attack. Therefore, I do understand that a) most people
won't notice and b) having it on by default seems like the right setting.

Is there any opposition to keeping the enforced parser restrictions
(limited tree depth and text node length) enabled by default in newer
libxml2 versions, and to provide a parser switch for disabling them? The
alternative would be to disable them by default on all libxml2 versions,
and to provide a switch that enables them if libxml2 supports it. But a
safe default sounds a lot better.

Stefan



More information about the lxml-dev mailing list