[lxml-dev] Memory Leak 2.1.1 -> 2.1.2

Dr R. Sanderson azaroth at liverpool.ac.uk
Mon Dec 22 15:33:10 CET 2008


The actual code is below, but I've got it so that it inflates Very 
Quickly...


[cheshire at edhellond jstor]$ ./memory.py
UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
cheshire  1778  1154  0  5861 14204   1 14:25 pts/2    00:00:00 
/home/cheshire/install/bin/python -i ./memory.py
0
cheshire  1778  1154 99 20753 73820   1 14:25 pts/2    00:00:01 
/home/cheshire/install/bin/python -i ./memory.py
238
cheshire  1778  1154 99 140239 551556 1 14:25 pts/2    00:00:08 
/home/cheshire/install/bin/python -i ./memory.py
483
cheshire  1778  1154 99 245972 974616 1 14:25 pts/2    00:00:14 
/home/cheshire/install/bin/python -i ./memory.py
734
cheshire  1778  1154 99 319488 1268656 1 14:25 pts/2   00:00:24 
/home/cheshire/install/bin/python -i ./memory.py
1269


eg, after parsing 1269 documents (on average 250k each) it's using a 
total of 1.5 gigabytes of memory.  This also happens in 2.1.1.

I've used guppy/hpy to check that it's not python level code.  Putting 
in a hp.heap() call in the loop shows the only difference to be the for 
loop's frame, per iteration.

The actual production code works in 2.1.1, but has a lot more xpaths and 
then a serialization phase in the loop as well.


Code, with comments:

----------------------------

def build_journal(jrnl):
     global nparse
     # Search for journal descriptions
     q = parse('c3.idx-id-journal exact "%s"' % jrnl)
     rs = db.search(session, q)

     # step through matches
     for rsi in rs:
         nparse += 1

         # fetch record out of storage, use etree.XML(data) to parse
         rec = rsi.fetch_record(session)

         # process_xpath passes through directly to node.xpath()
         try:
             year = rec.process_xpath(session, 
'/issuemap/issue-meta/numerations/pub-date/year/text()')[0]
             month = rec.process_xpath(session, 
'/issuemap/issue-meta/numerations/pub-date/month/text()')[0]
             day = rec.process_xpath(session, 
'/issuemap/issue-meta/numerations/pub-date/day/text()')[0]
         except:
             rsi._ymd = (0,0,0)
             del rec
             continue
         rsi._ymd = (year, month, day)
         del rec
     # sort list based on date
     rs._list.sort(key=lambda x: x._ymd)
     del rs

nparse = 0
# scan through all journal identifiers
q = parse('c3.idx-id-journal exact ""')
jids = db.scan(session, q, 1000000)

# get OS memory usage stats
pid = os.getpid()
cmd = "ps -F -p %s" % pid
print commands.getoutput(cmd)
print nparse

# and try to build
for j in jids[100:]:
     build_journal(j[0])
     print commands.getoutput(cmd).split('\n')[1]
     print nparse
----------------------------------------

Help?

Rob

On Mon, 22 Dec 2008, Dr R. Sanderson wrote:

>
> Hi all,
>
> I'm working on a script to replicate it, but using 2.1.2 or more recent
> results in not freeing any memory when parsing multiple documents in
> quick succession.  The changelog says there was a memory issue fixed, so
> perhaps this introduced the bug at the same time?
>
> I've seen (but not consistently) the lxml memory allocation failed:
> growing buffer message.  Normally it just runs my machine out of memory.
>
> Rob
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
>


More information about the lxml-dev mailing list