Porting the JIT to CLI (part 3) =============================== In my two previous_ posts_, we talked about the PyPy JIT generator, seeing that it can produce huge speedups and how its backend-independent frontend works. In this post, we will look closer at the internals of the CLI JIT backend; in particular, we will see how we work around some serious limitations of the platform, and why these workarounds didn't have any serious impact on the performances of our `toy virtual machine`_. .. _previous: http://morepypy.blogspot.com/2008/11/porting-jit-to-cli-part-1.html .. _posts: http://morepypy.blogspot.com/2008/11/porting-jit-to-cli-part-2.html .. _`toy virtual machine`: http://codespeak.net/svn/pypy/branch/oo-jit/pypy/jit/tl/tlc.py Graphs, blocks, links --------------------- One of the core aspect of PyPy translator is the concept of **flow graph**: a flow graph is a data structure that represents the code we are operating on. It is composed by a set of **basic blocks**, each block containing a sequence of operations; blocks are connected together by **links**, and each link can carry a variable number of arguments whose value is passed to the target block. In case a block contains more than one outgoing links, the one to follow is selected by looking at the value of a designated variable (the **exitswitch**), thus making possible to implement conditional jumps. To have a more complete description of the flow graphs model, check the documentation_. .. image:: flowgraph.png .. _documentation: http://codespeak.net/pypy/dist/pypy/doc/translation.html#the-flow-model As we saw in the previous post, the generated JIT compiler makes heavy use of **flexswitches** to generate efficient code, continuously intermixing JIT-compile time and runtime. In terms of graphs, we can think of a flexswitch as a special block whose links change over time. In particular, adding a new case to the flexswitch is equivalent to create a link whose target is a new block where the just generated code starts. Thus, the graphs **grows** over the time, as showed by the following images: .. image:: promotion-1.png .. image:: promotion-2.png .. image:: promotion-3.png In the images above, the block containing the flexswitch is colored in cyan. In the first picture, there is only one block connected to the flexswitch: this block contains the code to restart the JIT compilation. The second picture shows the graph after the first case has been added: you can clearly see that a new block has been created and attached to the flexswitch. Finally, the third picture shows the graph after a while, with a lot of new blocks attached. Translate graphs to CLI ----------------------- Conceptually, the goal of the CLI JIT backend is to express these graphs in terms of CLI bytecode. Translating the single block is easy, as it is just a list of sequential operation, and it's straightforward to map each operation to the equivalent CLI opcode or to a call to a helper method. Moreover, we need a way to express **links** between the various basic blocks: if the links are known in advance, render them is as easy as emitting a (potentially conditional) jump to the target block. Thus, we won't discuss this part in detail, as it is quite straightforward. The hard part is how to implement **flexswitches**: at the time when we are emitting the code, some of the blocks of this growable graph don't even exist: how can we make a jump to a non existent block of code? For backends that emit assembly code, it is rather easy: when they need to add a new case to the flexswitch, they can just **patch** the existing code to insert a jump to a newly allocated area of the memory, where the new code is being generated in. For CLI this approach is not feasible, as the VM will never allow us to modify existing code. Thus, we need to think of a different approach. Graphs and methods ------------------ In .NET, the basic unit of compilation is the **method**: the only way to execute some bytecode is to wrap it into a method. Moreover, it is not possible to execute a method until it has been completed, and after this point it is no longer possible to **add new code**. Because of all these constraints we cannot simply map each graph to its own method, since we saw that our graphs can **grow** after they have already been executed few times. Hence, we need to distinguish between the two concepts: - a **graph** is the logical unit of code as seen by the JIT compiler: concretely, the CLI JIT backend renders it as *one or more* methods; - a **method** is a collection of basic blocks; each method has the so called *parent graph*, i.e. the graph its blocks logically belongs to. The first method of a graph is called **main method** (which has nothing to do with the ``Main`` static methods found in ``.exe`` files); other methods are called **children methods**. When we want to add a new case to the flexswitch, we create a method containing all the new code; then we wrap the method inside a `delegate`_ (the .NET equivalent of a function pointer) and pass it to the flexswitch, so that it can later invoke it. .. _`delegate`: http://msdn.microsoft.com/en-us/magazine/cc301810.aspx The hard bit: non-local links ----------------------------- Using this approach, after a while the blocks of our original graph are scattered over a lot of different methods; however, there are no constraints about how these blocks can be linked together, so it happens to have links between blocks which are not in the same method. In the following, we will refer to them as **non-local links**. If the non-local block we want to jump to happens to be at the beginning of its containing method, it is enough to invoke the method; but, what if we want to jump somewhere in the middle? What we really want is to produce a method which has **multiple entry-points**; again, doing it in assembly would be trivial, but the virtual machine does not provide any support for it, so we need a work around. Each method in a graph is assigned an unique 16 bit *method id*; each block in a method is assigned a progressive 16 bit *block number*. From this two numbers, we can compute the *block id* as an ``unsigned integer``, by storing the method id in the first 16 bits and the block number in the second 16 bits. By construction, the block id is guaranteed to be unique in the graph. The following picture shows an excerpt of a graph composed of three methods; the id of each method is shown in red, while the block ids are shown in red (for the method id part) and black (for the block number part). The graph contains three non-local links. In particular, the link between blocks ``0x00020001`` and ``0x00010001`` is particularly hard to implement, as it connects two blocks in different children, and the target block is not at the beginning of the method. .. image:: flexswitch-cli.png Every method contains a special **dispatch block**, (not shown in the picture above) whose goal is to jump to the specified block number inside the method itself. The first argument of a child method is always a block id; when the method starts, it immediately jumps to the dispatch block, and thus to the desired block. For example, suppose to have a method which contains 3 blocks numbered 0, 1, 2; here is how its dispatch blocks looks like; for simplicity it is shown as C# code, but it is actually generated as IL bytecode:: // dispatch block int methodid = (blockid & 0xFFFF0000) >> 16); // take the first 16 bits int blocknum = blockid && 0x0000FFFF; // take the second 16 bits if (methodid != MY_METHOD_ID) { // jump_to_unknown block ... } switch(blocknum) { case 0: goto block0; case 1: goto block1; case 2: goto block2; default: throw new Exception("Invalid block id"); } Whenever we want to jump to a non-local block, it is enough to store the block id in the appropriate variable and jump to the dispatch block. If the block resides in a different method, the ``jump_to_unknown`` block is entered; this special block is implemented differently by the main method and the child methods, as we will see soon. Each time a new method is added to the graph, we build a delegate for it, and store it in a special array called ``method_map``; since we assign the method id sequentially starting from 0, we are sure that to fetch the method whose id is ``n`` we can simply load the ``n``-th element of the array. The ``jump_to_unknown`` block of the main method uses this array to select the right method, and calls it (``FlexSwitchCase`` is the type of delegates for all children methods):: // jump_to_unknown block of the main method FlexSwitchCase meth = method_map[methodid]; blockid = meth(blockid, ...); // execute the method goto dispatch_block; Each child method returns a block id specifying the next block to jump to; after its execution, we assign the return value to the ``blockid`` variable, and jump again to the dispatch block, which will jump again to the appropriate block. Keeping this in mind, it is straightforward to implement the ``jump_to_unknown`` block of children methods: it is enough to return the target block id to the caller, and let its dispatch loop do the right thing. If the caller is also a child method, it will return it again, until we reach the dispatch loop of the main method, which will finally do the jump. In theory, we could implement things differently and jumping directly from a child method to another one, but in that case the call stack could grows indefinitely in case of a tight loop between two blocks residing in different methods. To implement the dispatch block we can exploit the ``switch`` opcode of the CLI; if the .NET JIT is smart enough, it can render it using an indirect jump; overall, jumping to a non-local block consists of an indirect function call (by invoking the delegate) plus an indirect jump (by executing the ``switch`` opcode); even if this is more costly than a simple direct jump, we will see in the next section that this not the main source of overhead when following a non-local link. Obviously, the slow dispatching logic is needed only when we want to jump to a non-local block; if the target block happens to reside in the same method as the current one, we can directly jump to it, completely removing the overhead. Moreover, the dispatch blocks are emitted only if needed, i.e. if the parent graph contains at least one flexswitch. Graphs without flexswitches are rendered in the obvious way, by making one method per graph. The slow bit: passing arguments --------------------------------- Jumping to the correct block is not enough to follow a link: as we said before, each link carries a **set of arguments** to be passed from the source to the target block. As usual, passing arguments across local links is easy, as we can just use local variables to hold their values; on the other hand, non-local links make things more complex. The only way to jump to a block is to invoke its containing method, so the first solution that comes to mind is to specify its input arguments as parameter of the method; however, each block has potentially a different number (and different types) of input arguments than every other block, so we need to think of something else. An alternative solution could be to compute the union of the sets of input arguments of **all the blocks in the method**, and use this set as a signature for the method; this way, there would be enough space to specify the input arguments for every block we might want to jump to, each block ignoring the exceeding unused parameters. Unfortunately, all the children methods must have the **very same signature**, as they are all called from the same calling site in the dispatch block of the main method. Since the union of the set of input arguments (and hence the computed signature) varies from method to method, this solution cannot work. We might think to determine the signature by computing the union of input arguments of **all blocks in the graph**; this way, all the children methods would have the same signature. But as we said above, the graph grows new blocks at runtime, so we cannot determine in advance which set of input arguments we will need. To solve the problem we need a way to pass a variable number of arguments without knowing in advance neither their number nor their types. Thus, we use an instance of this class:: public class InputArgs { public int[] ints; public float[] floats; public object[] objs; ... } Since the fields are arrays, they can grow as needed to contain any number of arguments; arguments whose type is primitive are stored in the ``ints`` or ``floats`` array, depending on their type; arguments whose type is a reference type are stored in the ``objs`` array: it's up to each block to cast each argument back to the needed type. This solution impose a huge overhead on both writing and reading arguments: - when writing, we need to make sure that the arrays are big enough to contains all the arguments we need; if not, we need to allocate a bigger array. Moreover, for each argument we store into the array the virtual machine performs a bound-check, even if we know the index will never be out of bounds (because we checked the size of the array in advance); - when reading, the same bound-check is performed for each argument read; moreover, for each value read from the ``objs`` array we need to insert a downcast. To mitigate the performance drop, we avoid to allocate a new ``InputArgs`` object each time we do a non-local jump; instead, we preallocate one at the beginning of the main method, and reuse it all the time. Our benchmarks_ show that passing arguments in arrays is about 10 times slower than passing them as real parameter of a method. Unfortunately, we couldn't come up with anything better. .. _benchmarks: http://codespeak.net/svn/user/antocuni/cli-bench/arguments.cs Implement flexswitches ---------------------- Now, we can exploit all this machinery to implement flexswitches, as this is our ultimate goal. As described above, the point is to be able to **add new cases** at runtime, each case represented as a delegate. Here is an excerpt of the C# class that implements a flexswitch that switches over an integer value:: public class IntLowLevelFlexSwitch: { public uint default_blockid = 0xFFFFFFFF; public int numcases = 0; public int[] values = new int[4]; public FlexSwitchCase[] cases = new FlexSwitchCase[4]; public void add_case(int value, FlexSwitchCase c) { ... } public uint execute(int value, InputArgs args) { for(int i=0; i