[lxml-dev] very long files with many XML entity refs

Moshe Cohen moshec at gmail.com
Fri Aug 29 01:45:56 CEST 2008


I have a sample XML file which  contains <text>&#135;&#135; .... </text>
with 8,000,000 (eight million) repetitions of '&#135'.

A test program for loading it and then writing it is:

import sys
#import cElementTree as ET
from lxml import etree as ET
f=open(sys.argv[1])
et = ET.ElementTree(file = f)
et.write('ooo')

When it is run with cElementTree , it completes successfully in about 1
minute.
When it is run with lxml, it does not complete, even after 12 hours!!! and
the process is constantly at 100% CPU.
Further testing showed it reaches the 'write' statement quite fast and is
stuck in there.

Is this a bug or is lxml just dead slow relative to cElementTree , for this
action?

Notes:
1) Nothing special about '&#135;', it is just a simple sample with the same
character repeating. The original problem showed up with a long file of
various entity refs (some encoding of binary data).
2) Testing with shorter files (thousands of characters), seemed to have
similar speed for cElementTree  and lxml.

TIA
Moshe
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://codespeak.net/pipermail/lxml-dev/attachments/20080829/bf37f403/attachment-0001.htm 


More information about the lxml-dev mailing list