PyPy
PyPy[garbage_collection]
modified Aug 17, 2008 by Armin Rigo

Garbage Collection in PyPy

1   Introduction

Warning: The overview and description of our garbage collection strategy and framework is not here but in the EU-report on this topic. The present document describes the specific garbage collectors that we wrote in our framework.

2   Garbage collectors currently written for the GC framework

(Very rough sketch only for now.)

Reminder: to select which GC you want to include in a translated RPython program, use the --gc=NAME option of translate.py. For more details, see the overview of command line options for translation.

2.1   Mark and Sweep

Classical Mark and Sweep collector. Also contains a lot of experimental and half-unmaintained features. See rpython/memory/gc/marksweep.py.

2.2   Semispace copying collector

Two arenas of equal size, with only one arena in use and getting filled with new objects. When the arena is full, the live objects are copied into the other arena using Cheney's algorithm. The old arena is then cleared. See rpython/memory/gc/semispace.py.

On Unix the clearing is done by reading /dev/zero into the arena, which is extremely memory efficient at least on Linux: it lets the kernel free the RAM that the old arena used and replace it all with allocated-on-demand memory.

The size of each semispace starts at 8MB but grows as needed when the amount of objects alive grows.

2.3   Generational GC

This is a two-generations GC. See rpython/memory/gc/generation.py.

It is implemented as a subclass of the Semispace copying collector. It adds a nursery, which is a chunk of the current semispace. Its size is computed to be half the size of the CPU Level 2 cache. Allocations fill the nursery, and when it is full, it is collected and the objects still alive are moved to the rest of the current semispace.

The idea is that it is very common for objects to die soon after they are created. Generational GCs help a lot in this case, particularly if the amount of live objects really manipulated by the program fits in the Level 2 cache. Moreover, the semispaces fill up much more slowly, making full collections less frequent.

2.4   Hybrid GC

This is a three-generations GC.

It is implemented as a subclass of the Generational GC. The Hybrid GC can handle both objects that are inside and objects that are outside the semispaces ("external"). The external objects are not moving and collected in a mark-and-sweep fashion. Large objects are allocated as external objects to avoid costly moves. Small objects that survive for a long enough time (several semispace collections) are also made external so that they stop moving.

This is coupled with a segregation of the objects in three generations. Each generation is collected much less often than the previous one. The division of the generations is slightly more complicated than just nursery / semispace / external; see the diagram at the start of the source code, in rpython/memory/gc/hybrid.py.