[lxml-dev] Memory Leak 2.1.1 -> 2.1.2
Dr R. Sanderson
azaroth at liverpool.ac.uk
Mon Dec 22 15:33:10 CET 2008
The actual code is below, but I've got it so that it inflates Very
Quickly...
[cheshire at edhellond jstor]$ ./memory.py
UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
cheshire 1778 1154 0 5861 14204 1 14:25 pts/2 00:00:00
/home/cheshire/install/bin/python -i ./memory.py
0
cheshire 1778 1154 99 20753 73820 1 14:25 pts/2 00:00:01
/home/cheshire/install/bin/python -i ./memory.py
238
cheshire 1778 1154 99 140239 551556 1 14:25 pts/2 00:00:08
/home/cheshire/install/bin/python -i ./memory.py
483
cheshire 1778 1154 99 245972 974616 1 14:25 pts/2 00:00:14
/home/cheshire/install/bin/python -i ./memory.py
734
cheshire 1778 1154 99 319488 1268656 1 14:25 pts/2 00:00:24
/home/cheshire/install/bin/python -i ./memory.py
1269
eg, after parsing 1269 documents (on average 250k each) it's using a
total of 1.5 gigabytes of memory. This also happens in 2.1.1.
I've used guppy/hpy to check that it's not python level code. Putting
in a hp.heap() call in the loop shows the only difference to be the for
loop's frame, per iteration.
The actual production code works in 2.1.1, but has a lot more xpaths and
then a serialization phase in the loop as well.
Code, with comments:
----------------------------
def build_journal(jrnl):
global nparse
# Search for journal descriptions
q = parse('c3.idx-id-journal exact "%s"' % jrnl)
rs = db.search(session, q)
# step through matches
for rsi in rs:
nparse += 1
# fetch record out of storage, use etree.XML(data) to parse
rec = rsi.fetch_record(session)
# process_xpath passes through directly to node.xpath()
try:
year = rec.process_xpath(session,
'/issuemap/issue-meta/numerations/pub-date/year/text()')[0]
month = rec.process_xpath(session,
'/issuemap/issue-meta/numerations/pub-date/month/text()')[0]
day = rec.process_xpath(session,
'/issuemap/issue-meta/numerations/pub-date/day/text()')[0]
except:
rsi._ymd = (0,0,0)
del rec
continue
rsi._ymd = (year, month, day)
del rec
# sort list based on date
rs._list.sort(key=lambda x: x._ymd)
del rs
nparse = 0
# scan through all journal identifiers
q = parse('c3.idx-id-journal exact ""')
jids = db.scan(session, q, 1000000)
# get OS memory usage stats
pid = os.getpid()
cmd = "ps -F -p %s" % pid
print commands.getoutput(cmd)
print nparse
# and try to build
for j in jids[100:]:
build_journal(j[0])
print commands.getoutput(cmd).split('\n')[1]
print nparse
----------------------------------------
Help?
Rob
On Mon, 22 Dec 2008, Dr R. Sanderson wrote:
>
> Hi all,
>
> I'm working on a script to replicate it, but using 2.1.2 or more recent
> results in not freeing any memory when parsing multiple documents in
> quick succession. The changelog says there was a memory issue fixed, so
> perhaps this introduced the bug at the same time?
>
> I've seen (but not consistently) the lxml memory allocation failed:
> growing buffer message. Normally it just runs my machine out of memory.
>
> Rob
> _______________________________________________
> lxml-dev mailing list
> lxml-dev at codespeak.net
> http://codespeak.net/mailman/listinfo/lxml-dev
>
More information about the lxml-dev
mailing list