[lxml-dev] etree.parse hangs with a lot of parallel requests

Stefan Behnel stefan_ml at behnel.de
Sun Apr 6 16:26:59 CEST 2008


Hi,

Dmitri Fedoruk wrote:
> The code is the following:
> self.xmlParser = etree.XMLParser(no_network = False, resolve_entities
> = False, load_dtd = True )
> 
> I use load_dtd=True as sometimes I encounter html entities in my input
> data. They are included in my dtd in this way:
> <!ENTITY % HTMLlat1 SYSTEM "xhtml-lat1.ent">
> %HTMLlat1;
> 
> <!ENTITY % HTMLsymbol SYSTEM "xhtml-symbol.ent">
> %HTMLsymbol;
> 
> <!ENTITY % HTMLspecial SYSTEM "xhtml-special.ent">
> %HTMLspecial;
> 
> Then eventually it comes up to
> ...
> xmlres = etree.parse( StringIO.StringIO( reply['data'] ), self.xmlParser )
> 
> And here I have serious problems.  Parsing time is usually up to 100
> ms (even this is critical time for me). But sometimes I have 3, 5 and
> even 60 seconds (!) of parsing. This situation happens under a heavy
> load (~20 simultaneous parsings/transformations per sec).
> 
> So, I have several questions:
> 1) What am I doing wrong?
> 2) Is there any way to limit the runtime of the etree.parse? Is there
> any way to kill a thread maybe? I can not afford to wait even 150 ms,
> to say nothing about 1 second and more.

It seems you only want to parse DTDs locally from disc, so setting
"no_network=True" (which is the default in lxml 2.0) should prevent any
accidental remote access.

Does that help?

Stefan



More information about the lxml-dev mailing list