=============================================== Benchmarking Memory usage of Python/PyPy =============================================== $Id$ XXX draft XXX what we want to measure =============================================== * measure max/avg RAM usage of a process running various GC benchmarks, for a relevant definition of RAM usage. * measure max/avg combined RAM usage of a few processes running the GC benchmarks - either launched independently, or forked from a single process * measure the "perceived pauses" for interactive apps when running on GCs with stop-the-world collection (i.e. all our framework GCs so far). * measure the CPU usage and total execution time of some apps (for cross-reference, e.g. to show the space-time trade-offs of the various GCs). Benchmark specifics -------------------- * RAM usage: we need to refine what is exactly meant by this. An approximation is the amount of process-private RAM reported by the kernel. See discussion below. Also, we likely want to mainly measure the incremental RAM usage of a particular benchmark/app - i.e. the RAM that is used in addition to the bare interpreter loaded. * GC microbenchmarks: allocating specific object types (strings/dicts/lists/tuples/old-style/new-style instances ...) and little scenarios like reading lines from a large file. also consider reuse some of the speed benchmarks. * allocation patterns, measure memory usage using sampling: 1. allocate objects all the time, total number of objects stays constant; 2. allocate many objects in a burst, then throw them away, and repeat; 3. combined: do 1 for most of the time with a small number of live objects, but occasionally get a large bunch of objects and immediately free them. 4. Look at gcbench Use several sets of "small" and "large". * Aim for reproducible results, e.g. by combining some of these techniques: - checkpoints (perform measures at known execution points); Then we would simply dump allocation information on every n'th malloc and a few times in every garbage collection run. We could either do this by dumping internal information from the garbage collector (not accounting for rawmalloc properties) or by blocking the process in the malloc and gc code and advising the checkpointing process to checkpoint. Note that this does not reflect all properties of the virtual memory system because the page tables will likely also change in between these synchronization points. - high-res sampling (either in the real system or in emulators, e.g. look at valgrind tools). - valgrind provides a heap profiler called "massif", but valgrind only runs on x86 or similar. It gives useful hints which part of the memory allocator is claiming the actual memory portion (obmalloc, semispaces, etc.), but the actual snapshotting feature is only of limited use because it cannot e.g. differentiate zeroed pages and pages that are actually used. - Systemtap is basically a scripting language that hooks into the kernel. It has been used to benchmark the execution of certain libraries on systems like the Maemo platform on ARM. There has not been any known memory profiling scripts yet for it; theoretically it is possible, though. Currently (as of October 2008), Redhat is working on providing memory profiling scripts, but none of this work is released yet. * The "perceived pause" is probably best approximated by the time it takes to perform a single collection. For generational GCs we should measure the time of the collections of various levels; for example, nursery-only collections are very fast, probably too fast to be noticeable in interactive apps. In order to gather this information, the framework GCs need to be able to dump the collection statistics in a reusable manner. * real APPS: * some sympy computation/test? * an app involving c-extensions? * ask around on pypy-dev for example apps Interpreter instances to consider for measurement ---------------------------------------------------- * CPython 2.5 * pypy-c --opt=mem * pypy-c --opt=3 (for comparison purposes) We want to select and optimize good underlying settings for PyPy's choices regarding "--opt=mem". We also want to measure builds that include a working set of modules and ones that include no modules at all (the bare minimum for an interactive prompt). XXX consider more specific target environments Implementation =========================== Measuring memory usage ------------------------ Linux's "RSS" cannot be directly used because in a running pypy-c it incorrectly counts many megabytes that are null pages, loaded from /dev/zero. Such pages are not included in the process-private memory, which is why the latter might be a better measure. It might be that even more complicated measures would be even better. We can test and compare different ways to measure memory usage. One testing methodology would be to run a virtual machine which pretends to have X megabytes of RAM, start one or several pypy-c's running some benchmarks, see when it starts swapping, and apply the proposed memory usage measure at this point to get a number Y. Then we replace the pypy-c's with a single trivial C program that clearly consumes Z megabytes of RAM, and see for which value of Z it starts swapping. When Y and Z agree, we found a good way to measure memory usage. xoraxax: doesnt that depend a lot of the runtime memory accesses of the other processes and reproducible algorithms in the VM wrt. paging decisions? E.g. in one scenario the VM could decide to page out data to the swapfile while in another case it could just unmap .text pages Understanding linux /proc/pid/smaps info ------------------------------------------- XXX please review, correct, complete so we get to a shared good understanding CPython +++++++++++++++ The most detailed info is provided by /proc/PID/smaps, starting from linux 2.6.14. Here is an example output of running "python2.5" on a linux 2.6.24 ubuntu machine:: 08048000-08140000 r-xp 00000000 08:01 89921 /usr/bin/python2.5 Size: 992 kB Rss: 768 kB Shared_Clean: 764 kB Shared_Dirty: 0 kB Private_Clean: 4 kB Private_Dirty: 0 kB Referenced: 768 kB The first line indicates that the /usr/bin/python2.5 file is mapped as Read/eXecute into the given process and is seen at address 08048000 by the process. Virtual memory size is 992kB, of which 768 kB are actually mapped into RAM (Rss = Resident Set Size) - the rest of the file has not been accessed yet and is thus not mapped. 764 kB are shared (Shared_Clean) - so if there are other python processes they will get their mapping but no additional RAM will be used for these 764 kBs. "clean" means that these pages can easily get swapped out by dropping them and - upon access - retrieving them from the file. XXX but why is the Private_Clean page there in this readonly /usr/bin/python mapping? Let's look at a mapping that is more indicative of the per-process "incremental" RAM usage:: 08165000-081e0000 rw-p 08165000 00:00 0 [heap] Size: 492 kB Rss: 452 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 452 kB Referenced: 452 kB Here we have a readwrite anonymous mapping, objects allocated on the heap. It uses 492 kB virtual address space of which 452 kB are actually mapped in physical RAM. Dirty means that these pages have been modified. "Dirty" or "clean" is important info for Swapping but not too relevant for us regarding measuring memory footprint. Of coures, there are many more mappings, also for the stack. Let's see what changes if we do:: >>> l = ["xasd"] * 1000000 we get this new mapping in the python process:: b7890000-b7c61000 rw-p b7890000 00:00 0 Size: 3908 kB Rss: 3908 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 3908 kB Referenced: 3908 kB which is anonymous readwrite mapping and the 3908 KBs for the list and strings are mapped into physical ram. For some more information here is a link http://bmaurer.blogspot.com/2006/03/memory-usage-with-smaps.html which also points to the mem_usage.py tool that presents a process mappings in a somewhat nicer summarized format. For understanding how swapping and linux memory management works here is a nice read: http://sourcefrog.net/weblog/software/linux-kernel/swap.html pypy-c ++++++++++++++++++++++++++++ The same with a pypy-c process (using the hybrid GC) shows many mostly small mappings, and four big ones. They are:: 08048000-084d2000 r-xp 00000000 03:03 2105442 /path/to/pypy-c Size: 4648 kB Rss: 2076 kB Shared_Clean: 2076 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 0 kB This is the code section. It is mapped read-only from the disk. We see that the code section is 4.6MB in size, of which 2MB have been loaded so far. These 2MB are Clean, so if the system runs out of memory they can be simply discarded, and later reloaded on demand from the executable. The 2MB are also Shared, so several pypy-c processes share them with each other. Second mapping:: 084d3000-088f6000 rw-p 0048a000 03:03 2105442 /path/to/pypy-c Size: 4236 kB Rss: 1516 kB Shared_Clean: 1232 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 284 kB It's the data section of the executable. 1.5MB have been touched so far and loaded from the disk. Out of these, 1.2MB are Clean and Shared because they haven't been modified at all, and 0.3MB are Private and Dirty because the data was modified in this process. Third mapping:: 088f6000-08d9b000 rw-p 088f6000 00:00 0 [heap] Size: 4756 kB Rss: 4680 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 4680 kB This is malloc-ed memory. I am not sure but I think that it contains mostly the long-lived objects of the hybrid GC. We should first try to look at a pypy-c using the generation GC; the generation GC doesn't call malloc. Fourth mapping:: b6c3f000-b7d24000 rw-p b6c3f000 00:00 0 Size: 17300 kB Rss: 17104 kB Shared_Clean: 14308 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 2796 kB This is the GC heap, for the framework GC. It is a private mmap initialized by reading from /dev/zero. The total size is 17MB. Of these, 14MB are Shared and Clean because they are pages full of zeroes. These 14MB don't consume any RAM anywhere; I suspect that Linux implements them by having only 4KB of zeroes in a corner of the kernel, and making all the pages in these 14MB shared with this single page. The remaining 2.8MB are Private and Dirty because they contain GC-managed objects. In summary (and as always assuming the OS is Linux), 5 independently-started pypy-c processes consume:: 1 * size(code section accessed so far) + 1 * size(data section accessed and not modified so far) + 5 * size(data section modified) + 5 * size(malloc heap) + 5 * size(framework GC heap in use) Tool to measure python interpreter mem foot print ------------------------------------------------------- We need a tool that can invoke python apps and benchmarks and measure memory foot print - producing data that can be parsed back and used for producing graphs, tables etc. Cross-Check also for tools against maemo: http://maemo.org/development/tools/doc/diablo/sp-smaps-measure/ http://maemo.org/development/tools/doc/diablo/sp-memusage/ Exmap can be used to see very useful statistics about processes including very precise shared RSS figures. It can also show whether a specific symbol is mapped into the RAM but this does not seem to be precise in xorAxAx's tests as the information seems to be invariant to e.g. the usage of unicodedata. http://labs.o-hand.com/exmap-console/ for embedded devices http://www.berthels.co.uk/exmap/ for the main tool http://lwn.net/Articles/230975/ presents some ideas and a set of patches for more precise page mapping information by matt mackall runbench.py / report.py ++++++++++++++++++++++++++++++++ At http://codespeak.net/svn/pypy/build/benchmem there are linux scripts to measure memory usage of benchmarks and generating a report from the results: * runbench.py runs the benchamrks in the ``benchmark`` directory and writes information about memory usage into ``bench.log``. One can specify multiple Python Intepreters for execution of the benchmarks. * report.py takes a bench.log and generates a textual report. Current benchmarks: * ``sizes.py``: a number of benchmarks to create different python objects Considerations about static memory usage ---------------------------------------- While languages like C guarantee locality of static global data and code by their module systems, PyPy generates a lot of functions and global data into various modules. One question is whether the resulting binary of the CC still provides locality of these structures. Imagine every vtable of the compiler's ast nodes spread in the resulting data segment -- a single bytecode compilation would load the whole data segment into RAM. Similar problems might be found with functions. One consequence is that reducing the static memory usage of the executable might not lead to reduced memory usage at runtime if the locality is good. Also increasing the locality would help to reduce the runtime memory usage. In order to find out what kind of global data is created by a translation of an RPython program, one could write a reftracker like graph viewer page that recursively sums up the value sizes and provides navigation to referenced container values. This information could also be generated in a report-like way with grouping by RPython module ("please list the sums of the size of the ll values involved in all prebuilt constants referenced only by pypy.module.xx"). GC related Papers and links =================================== http://www-cs.canisius.edu/~hertzm/thesis.pdf a GC with OS support that behaves well in context of swapping. http://portal.acm.org/citation.cfm?id=1070891.1065943 An Energy Efficient Garbage Collector for Java Embedded Devices (mark/compact + deferred ref counting) http://www.ibm.com/developerworks/ibm/library/i-incrcomp/ "Hot to minimize pause times and free the heap from dark matter"