\chapter{The CLI JIT backend} \label{cha:cli-backend} \section{Teaching the JIT frontend Object Orientation} \label{sec:teaching-oo} As explained by Section \ref{sec:pypy-architecture}, the Translation Toolchain uses either \emph{lltype} or \emph{ootype} for its internal representation. Originally, the PyPy JIT compiler generator has been designed for \emph{lltype}, in particular for being translated with the C backend and to emit machine code for a CPU like the \emph{x86} instruction set. Before writing the backend for the CLI virtual machine, we had to port both the \emph{codewriter} and the JIT frontend to \emph{ootype}. In the spirit of Chapter \ref{cha:target-vm}, the porting has been done with both the CLI and the JVM in mind, with the idea that writing a JIT backend for the JVM should be very easy once the whole infrastructure is ready. In particular, the most important difference between \emph{lltype} and \emph{ootype} is how to inline method calls. As we saw in Section \ref{sec:jit-architecture}, to inline a method call the tracer must know which \emph{jitcode} it corresponds to. In general, to know which implementation of the method will be called on a specific object, we must know its class: thus, during the tracing phase, the tracer put a \lstinline{guard_class} in front of each method invocation, which from that point on can be assumed to be known. In \emph{lltype} a method implementation is represented by a function pointer stored in the \emph{vtable}\footnotemark\ of the class, called using the \lstinline{indirect_call} operation. Since the class, and hence its \emph{vtable}, is known, we can easily discover the memory address of the target method. Then, we can look up a table, called \lstinline{indirectcall_dict}, which maps memory addresses to \emph{jitcodes} to find the desired one. \footnotetext{\needreview The \emph{Virtual Method Table} or \emph{vtable} is the standard mechanism to implement late binding in statically typed object oriented languages. The \emph{vtable} is a per-class table of function pointers pointing to the correct implementation of each method for the given class.} In \emph{ootype}, classes and methods are first class citizens: methods are invoked using the \lstinline{oosend} operation, which takes care of dispatching to the correct implementation and is translated into a \emph{virtual method call} on the target VM\footnote{\lstinline{callvirt} on the CLI and \lstinline{invokevirtual} on the JVM}. Thus, the virtual method dispatching mechanism is hidden from us and neither the \emph{vtable} nor the memory addresses of the method implementations can be accessed. Therefore, instead of using the above mentioned \lstinline{indirectcall_dict}, we build a different table for each possible method which maps the \textbf{exact class} of the object to the \emph{jitcode} that corresponds to the right method implementation. Then, when the tracer needs to inline a method call, it can just look up the class of the object, or one of its superclasses, in the table for the method it is considering. Figure \ref{fig:method-tables-src} shows an example in RPython with one class \lstinline{A} which defines two methods, and its subclass \lstinline{B} which overrides one. Figure \ref{fig:method-tables} shows the method tables computed for both methods. Being RPython, the set of methods for each class does not change at runtime (see Section \ref{sec:pypy-architecture}), thus the method tables can be computed at translation time by the \emph{codewriter}, while the frontend uses them at compile time. \begin{figure} \centering \begin{minipage}[b]{0.48\linewidth} % A minipage that covers half the page \begin{lstlisting}[frame=none] class A: def f(self): return 1 def g(self): return 2 class B(A): def f(self): return 3 \end{lstlisting} \hfigrule \caption{RPython hierarchy of classes} \label{fig:method-tables-src} \end{minipage} \hspace{0.02\linewidth} \begin{minipage}[b]{0.48\linewidth} \centering \begin{tabular}{c|c} \multicolumn{2}{l}{\textbf{Method \emph{f}}} \\ Class & \emph{Jitcode} \\ \hline A & A.f \\ B & B.f \\ \end{tabular} \vspace{0.6cm} \begin{tabular}{c|c} \multicolumn{2}{l}{\textbf{Method \emph{g}}} \\ Class & \emph{Jitcode} \\ \hline A & A.g \\ \end{tabular} \vspace{0.2cm} \hfigrule \caption{Method tables for the \emph{jitcodes}} \label{fig:method-tables} \end{minipage} \end{figure} \section{The interface between the JIT frontend and backend} This section describes how the JIT frontend interacts with the backend: even if most of the job of the backend is done during the \emph{compilation} phase (see Figure \ref{fig:tracing-phases}), we will see that it plays an important role also for the other phases of the tracing JIT. The frontend is designed to be as generic as possible, but there are tasks for which it is indispensable to interact directly with the target platform: thus, the backend connects the abstract layer used by the frontend and the concrete level of the target platform. In this section, we describe the interface for \emph{ootype} backends. The interface for \emph{lltype} backends is the same for the most part and the differences are minimal (for example, they do not have any notion of \emph{oosend}). Concretely, the backend must implement the \lstinline{CPU} class: the frontend stores a reference to a singleton instance of this class, and invokes method on it. \lstinline{CPU} is the only class seen by the frontend, but internally the code of the backend is divided into more classes. The subsequent subsections describes the interface of the \lstinline{CPU} class. Then, the next sections of this chapter will explain how the CLI backend implements this interface. \subsection{Descriptors} \label{sec:descriptors} As described in Section \ref{sec:teaching-oo}, the \emph{ootype} version of the JIT frontend is aware of the object oriented entities of the target platform. However, the concrete internal representation of classes, methods, etc. used by \emph{ootype} is obviously different from the representation used by the native tools of the platform. For example, in \emph{ootype} the concept of ``class'' is denoted by instances of the class \lstinline{pypy.rpython.ootype.Instance}, while on the CLI side we use instances of \lstinline{System.Type}. The concept of \emph{descriptors} fills the gaps between the two worlds: a descriptor is a token that can be obtained by the frontend at translation time. Such token is \emph{opaque} and the frontend has no way to inspect its inside, and its only use is to be passed back to the backend later. The following table lists all possible descriptors: \begin{tabular}{ll} \toprule \textbf{Descriptor} & \textbf{Meaning} \\ \midrule \texttt{typedescr(T)} & the type \texttt{T} \\ \texttt{arradescr(T)} & the array type whose items are of type \texttt{T} \\ \texttt{methdescr(T, name)} & the method \texttt{name} of the type \texttt{T} \\ \texttt{fielddescr(T, name)} & the field \texttt{name} of the type \texttt{T} \\ \texttt{calldescr(ARGS, RESULT)} & a static\footnotemark\ method type of the specified signature\\ \bottomrule \end{tabular} \footnotetext{I.e., a method which is not bound to any particular instance. Static methods are the CLI/JVM equivalent of functions.} \subsection{Executing operations} As already explained, during the tracing phase the operations are \textbf{both} executed and recorded. To guarantee correctness, the result of the execution must the same that would be obtained by the underlying VM. For simple operations the JIT frontend knows everything it needs for execution: for example, the operation \lstinline{int_add(40, 2)} yields always the same result whatever platform we are targeting. However, some operations cannot be performed without knowing the implementation details of the target platform. Take, for example, the operation \lstinline{getfield_gc}\footnote{the \lstinline{_gc} suffix is there for historical reasons and makes sense only for \emph{lltype}, thus can be ignored for \emph{ootype}}, which reads the content of a field of an object: depending on the platform, it could be implemented in very different ways, e.g. by fetching a value stored in memory at a certain offset from the address of the object (as it is the case for the \emph{x86} backend), or by invoking the reflection APIs of the runtime system to actually get the desired value (as the \emph{CLI} backend might do, but the actual implementation is slightly different for efficiency reasons). Therefore the JIT frontend has to delegate the execution of some operations to the backend. The needed extra information are passed as \emph{descriptors}: following the \lstinline{getfield_gc} example, the field to read from is encoded in a \emph{fielddescr}. The backend exports one method for each of the operations it implements, as shown by Figure \ref{fig:do-operations}. Each method takes a \emph{descriptor} and a list of arguments, which contains the actual operands. \needreview The difference between \lstinline{do_new_with_vtable} and \lstinline{do_runtimenew} is the same as between creating an object in C\# using the statement \lstinline{new} and creating an object using the reflection. Note that these methods directly \emph{execute} the operations, but do not emit machine code for them. Machine code generation is described in Section \ref{sec:compiling-loops}. \begin{figure}\footnotesize \begin{tabular}{lllp{6.5cm}} \toprule \textbf{Operation} & \textbf{Arguments} & \textbf{Descr} \\ \midrule \texttt{do\_new\_with\_vtable} & & \texttt{typedescr} & New object of type \texttt{typedescr} \\ \texttt{do\_new\_array} & \texttt{length} & \texttt{typedescr} & New array \texttt{typedescr[length]} \\ \texttt{do\_runtimenew} & \texttt{class} & & New object of type \texttt{class} \\ \midrule[0.03em] \texttt{do\_getfield\_gc} & \texttt{obj} & \texttt{fielddescr} & Get the specified field out of \texttt{obj} \\ \texttt{do\_setfield\_gc} & \texttt{obj, value} & \texttt{fielddescr} & Set the specified field of \texttt{obj} to \texttt{value} \\ \texttt{do\_oosend} & \texttt{args\_array} & \texttt{methdescr} & Call the specified instance method. The receiver is in \texttt{args\_array[0]} \\ \texttt{do\_instanceof} & \texttt{obj} & \texttt{typedescr} & Test whether \texttt{obj} is an instance of \texttt{typedescr} \\ \midrule[0.03em] \texttt{do\_arraylen\_gc} & \texttt{array} & \texttt{arraydescr} & Return the length of \texttt{array} \\ \texttt{do\_getarrayitem\_gc} & \texttt{array, i} & \texttt{arraydescr} & Return \texttt{array[i]} \\ \texttt{do\_setarrayitem\_gc} & \texttt{array, i, value} & \texttt{arraydescr} & \texttt{array[i] = value} \\ \midrule[0.03em] \texttt{do\_call} & \texttt{args\_array} & \texttt{calldescr} & Call the static method in \texttt{args\_array[0]} \\ \bottomrule \end{tabular} \caption{The operations implemented by the JIT backends} \label{fig:do-operations} \end{figure} \subsection{Loops and bridges} The most important task of the backend is to translate loop and bridges (see Section \ref{sec:loops-bridges}) into efficient machine code that can be directly executed on the target platform. To request the compilation of a loop, the frontend calls the method \lstinline{compile_loop}, passing the trace, i.e. the list of operations to be emitted, and the list of \emph{input arguments}. Despite the name for historical reasons, \lstinline{compile_loop} is also called to compile entry bridges. Similarly, to compile inner and exit bridges the frontend calls \lstinline{compile_bridge}, which takes the same parameters as \lstinline{compile_loop}, with the addition of a token that describes which \emph{hot guard} the bridge starts from. The backend connects the existing guard and the newly created bridge so that the next time the guard fails, the bridge will be executed. \subsection{Executing loops} \label{sec:executing-loops} Once compiled, a loop is ready to be executed: before doing so, the JIT frontend needs to pass it the correct \emph{input arguments}, so that it starts in the right state. The backend exposes three methods for setting the input arguments: \begin{itemize} \item \lstinline{set_future_value_int(i, value)} \item \lstinline{set_future_value_float(i, value)} \item \lstinline{set_future_value_ref(i, value)} \end{itemize} Each of them sets the $i$th input argument to $value$, and they only differ for the type of the \lstinline{value} parameter\footnote{We need three distinct names because the C backend does not support the overloading of functions.}, which can be respectively \lstinline{int}, \lstinline{float} or \lstinline{object}\footnote{The suffix \lstinline{_ref} stands for ``reference value'', i.e. an object for \emph{ootype} and a pointer for \emph{lltype}}. Note that the type of \lstinline{value} for \lstinline{set_future_value_ref} is \lstinline{object}: whenever an object is used as an input argument, it is casted to \lstinline{object}, passed to \lstinline{set_future_value_ref}, and casted back to its original type inside the loop. This has to be done because the frontend does not know in advance all the possible types that could be used for input arguments. Once the input arguments have been set, the JIT frontend triggers the execution of the loop by invoking the \lstinline{execute_token} method, which takes a single parameters that uniquely identifies the loop in question. Then, the loop runs until a guard fails. At this point, the JIT frontend needs to know the values of the internal variables of the loop to restart the interpreter in the right state. These can be queried by invoking \lstinline{get_latest_value_int}, \lstinline{get_latest_value_float} and \lstinline{get_latest_value_ref}, which work in a similar way as their \lstinline{set_future_value_*} counterparts. Note that in this way input arguments passing and value returning is not very efficient, as a method invocation is needed for each of them. However, tracing JITs spend most of their time \textbf{inside} the loops, so this process does not affect negatively efficiency. \section{Compiling loops} \label{sec:compiling-loops} As anticipated by Section \ref{sec:dynamic-generation}, the PyPy JIT emits new bytecode at runtime. In .NET, the unit of compilation is the \emph{method}, hence each compiled loop generates a method. The signature of \emph{all} the generated methods is \lstinline[language=Java]{void Loop(InputArgs args)}, expressed in C\# syntax. The only parameter is of type \lstinline{InputArgs}, which contains three arrays of integers, floats and objects. The \lstinline{CPU} holds an instance of \lstinline{InputArgs}, which is then passed to all the loops when invoked\footnotemark. The methods \lstinline{set_future_value_*} store the actual values of the input arguments in these arrays. Despite its name, which survives for historical reasons, \lstinline{InputArgs} is used \textbf{both} for input arguments and for return values. Thus, the generated code will store the relevant values into the same arrays, which will then be read by \lstinline{get_latest_value_*}. \footnotetext{Thus, since \lstinline{CPU} is a singleton, so is \lstinline{InputArgs}. This works well as long as we support only one thread, but in case of multi-threading we may want to have a per-thread \lstinline{InputArgs} instance} A generated method is composed of three parts: \begin{itemize} \item the \textbf{setup code}, which initializes all the local variables to their starting values, according to the content of \lstinline{args} \item the \textbf{loop body}, which actually executes the loop. Generating the body is easy, as each operation can be easily mapped into a sequence of \emph{CLI instructions}. Section \ref{sec:code-generation} explains in detail how the code is actually generated \item the \textbf{tear-down code}: for each guard, there is a bit of code which is triggered when it fails and stores the relevant local variables into \lstinline{args} \end{itemize} \section{Code generation} \label{sec:code-generation} To generate the actual loop body, the backend needs to convert each operation of the intermediate representation used by the JIT into one or more CLI instructions. This intermediate representation directly derives from the one used by the translation toolchain (see Section \ref{sec:pypy-architecture}): in particular, most of the operations have the same name and semantics. By exploiting this fact, we can write the JIT backend very efficiently without need to teach it explicitly how to translate each operation: the CLI static backend\footnotemark\ already knows how to convert operations into CLI instructions, thus we just need a way to reuse this knowledge. \footnotetext{It is important here not to confuse the CLI static and JIT backends: the first operates at translation time, and produces a text file containing the classes and the methods composing \emph{pypy-cli}, which is then compiled into an executable. On the other hand, the JIT backend operates at runtime and generates CLI bytecode by invoking the API exposed by the \texttt{Reflection.Emit} namespace.} The CLI static backend contains a table mapping operations into CLI instructions: Figure \ref{fig:cli-opcodes} shows an excerpt of this table. We won't explain in detail its content, but from the figure it should be clear that it uses an internal language to express CLI instructions: in particular, every string such as \lstinline{'add'}, \lstinline{'sub'} or \lstinline{'ldnull'} is translated directly into the corresponding CLI instruction, and moreover there are internal commands to do a more complex translation (e.g., \lstinline{PushAllArgs} pushes all the arguments of the operation on the CLI stack). \begin{figure} \centering \begin{lstlisting} Not = ['ldc.i4.0', 'ceq'] operations = { 'same_as': [PushAllArgs, StoreResult], 'bool_not': [PushAllArgs] + Not + [StoreResult], ... 'ooisnull': [PushAllArgs, 'ldnull', 'ceq', StoreResult], 'oononnull': [PushAllArgs, 'ldnull', 'ceq'] + Not + [StoreResult], ... 'int_add': 'add', 'int_sub': 'sub', 'int_mul': 'mul', 'int_floordiv': 'div', 'int_mod': 'rem', 'int_lt': 'clt', ... } \end{lstlisting} \caption{Operations table for the static CLI backend} \label{fig:cli-opcodes} \end{figure} Unfortunately, the CLI JIT backend cannot directly use this table for generating the code. As explained in Section \ref{sec:jit-architecture} the JIT frontend and backends need to be written in RPython, while the \lstinline{operations} table is not\footnote{Remind that the Translation Toolchain does not need to be written in RPython, thus the static CLI backend can freely use the table.}. However, we can use the \emph{metaprogramming} capabilities of RPython to automatically generate valid RPython code from the table. Due to the way it is designed, the Translation Toolchain starts to analyze RPython modules only \emph{after} they have been imported: this means that at import time we can use the full Python language to generate RPython classes and functions that will be later fed into the Translation Toolchain. In fact, we can think of Python as the metaprogramming language for RPython\cite{rpython}. \begin{figure}[t] \centering \begin{lstlisting} def emit_op_getfield_gc(self, op): descr = op.descr assert isinstance(descr, FieldDescr) clitype = descr.get_self_clitype() fieldinfo = descr.get_field_info() obj = op.args[0] # push the object on the stack, and maybe cast it to the correct type obj.load(self) if obj.getCliType(self) is not clitype: self.il.Emit(OpCodes.Castclass, clitype) # actually load the field value self.il.Emit(OpCodes.Ldfld, fieldinfo) # store the result in the proper variable self.store_result(op) \end{lstlisting} \caption{Implementation of the \lstinline{getfield_gc} operation} \label{fig:emit-op-getfield-gc} \end{figure} \begin{wrapfigure}{r}{210pt} \centering \begin{lstlisting} def emit_op_int_add(self, op): self.push_all_args(op) self.il.Emit(OpCodes.Add) self.store_result(op) \end{lstlisting} \caption{Method generated for \lstinline{int_add}} \label{fig:emit-op-int-add} \end{wrapfigure} The basic idea is turn each row in the table into a method that emits the corresponding CLI instructions: Figure \ref{fig:emit-op-int-add} shows for example the method generated for the \lstinline{int_add} operation, where \lstinline{self.il} is an object of type \lstinline{System.Reflection.Emit.ILGenerator} and it is used to emit the actual CLI bytecode. By exploiting this technique, we were able to considerably cut both the time needed to write the backend and the number of bugs, as the operation table was already well tested in the static CLI backend. However, there were more complex operations for which a simple translation from the operation table is not enough: in those cases, we wrote the corresponding method manually, as for example it is the case of \lstinline{getfield_gc}, shown in Figure \ref{fig:emit-op-getfield-gc}: in this example, it is also possible to see an usage of the \lstinline{FieldDescr} descriptor introduced by Section \ref{sec:descriptors}. \section{Compiling inner bridges} \label{sec:compiling-inner-bridges} Consider Figures \ref{fig:loop-first} and \ref{fig:loop-second}, reported here for convenience as Figures \ref{fig:loop-first-2} and \ref{fig:loop-second-2}. After the trace has been produced, the JIT frontend invokes the backend by calling \lstinline{compile_loop}, which in turns generates a .NET method that contains the code of the loop. Later \emph{guard\_true($t_1$)} (highlighted in blue) becomes hot, and the JIT frontend produces a new trace, which is shown in green in Figure \ref{fig:loop-second-2}. At this point, the frontend calls \lstinline{compile_bridge}. \provideboolean{loop-tree-step2} \begin{figure}[ht] \begin{minipage}[b]{0.35\linewidth} % A minipage that covers half the page \setboolean{loop-tree-step2}{false} \input{loop-tree.tikz} \hfigrule \caption{The trace produced by the code in Figure \ref{fig:loop-code}} \label{fig:loop-first-2} \end{minipage} \hspace{0.02\linewidth} \begin{minipage}[b]{0.63\linewidth} \setboolean{loop-tree-step2}{true} \input{loop-tree.tikz} \hfigrule \caption{The trace tree produced when the guard \newline becomes hot} \label{fig:loop-second-2} \end{minipage} \end{figure} The \emph{x86 backend} implements \lstinline{compile_bridge} in a straightforward way: first, it emits the code for the new trace in memory, then it modifies the machine code for \emph{guard\_true($t_1$)} to jump to the newly generated bridge. However, this implementation strategy does not work with the CLI. As described in Section \ref{sec:immutable-methods}, .NET methods are immutable after they have been compiled, thus, we cannot modify the code of the guard. There are three main alternatives to solve the problem: \begin{description} \item[Recompiling the whole method] Although we cannot modify the original method, we can always create a new one that contains the code for both the original loop and the new inner bridge. The code generated in this way is optimal, as we can directly use the fast \lstinline{br} CLI instruction to jump from the failing guard to the bridge. However, this solution may imply that a considerable ammount of time is wasted for re-compiling code: each time a new bridge is generated, the loop it belongs to must be recompiled together with all its current bridges. In the worst case, the time taken before having a stable loop (i.e. a loop that does no longer need to be recompiled because all its fast paths have been generated) is quadratic on the number of its paths \item[Using a \emph{trampoline}] The CLI does not offer the possibility of augmenting the code of a method after its compilation, but it is possible to simulate this behaviour. The idea is to have a \emph{trampoline} that transfers the execution to a \emph{child method}, which after execution returns a delegate (see Section \ref{sec:delegates}) pointing to the next method to be called by the trampoline, and so on. The \emph{next method} delegates returned by the children are not hard-coded in the bytecode, but stored in some place where they can be modified from the outside: later, we can alter the flow of control by modifying the \emph{next method} delegate that is called when the guard fails to point to the newly created method that implements the bridge. The advantage of this technique is that we can compile new bridges without recompiling the existing loops, while the disadvantage is that it is extremely slow at runtime, because each control transfer involves a lot of operations. The author of this thesis has already explored this solution in \cite{Cuni2009} and \cite{PyPyJIT09}. \item[Using tail calls] Tail calls are described in Section \ref{sec:tail-calls} and offer an alternative to the previous solution: instead of having the control flow bouncing between the trampoline and the children methods, we eliminate the former and directly transfer the execution from one method to the next one with a tail call. Again, the \emph{next method} delegates are stored in some place where they can be modified, so that they can be edited when we compile a new bridge. In an ideal world, tail calls would be the perfect solution for us: they enable incremental compilation of bridges, and if implemented in the right way they are as efficient as local jumps. However, at the moment of writing they are too slow and too buggy to be used in practice when performance matters. \end{description} We chose to implement the first alternative, and to recompile the whole method every time a new bridge is generated. This way, the programs take more time to reach a stable state, but from that point on the code produced is as efficient as we can. A better solution would consist in a mixed approach between the ``recompile'' approach and one of the other two outlined alternative solutions. At first, we could attach bridges using a trampoline or tail calls: then, when we see that the loop is ``stable enough'' and that no more fast path are discovered, we could recompile the whole loop into an efficient method. This approach has not been implemented yet, but we plan to experiment on it as a future work. \subsection{Entry and exit bridges} \label{sec:mutual-recursive-loops} \begin{wrapfigure}{r}{245pt} \input{mutual-recursive-loop.tikz} \hfigrule \caption{Mutual recursive loops} \label{fig:mutual-recursive-loop} \end{wrapfigure} Both entry and exit bridges (see section \ref{sec:loops-bridges}) are executed only once, then they jump to another loop. Since they are created \emph{after} the target loop, implementing them on the CLI poses the same problems as for inner bridges. However, we chose to implement them differently: the rationale is that contrarily to inner bridges, which are supposed to be executed often and thus need to be implemented efficiently, entry and exit bridges are executed only once, then they transfer the execution to a loop which will probably run for a considerable amount of time. Hence, it is perfectly reasonable to use a slower implementation, if it has other advantages. In particular, we implement jumps between bridges and loops as \emph{tail calls}: this way we can avoid to recompile the whole loop tree every time we generate an entry or exit bridge that jumps to it. Note that tail calls are necessary because it is possible to have loops that mutually jump to each other, as shown by Figure \ref{fig:mutual-recursive-loop}. In case we used simple calls instead of tail calls, we would exhaust the stack space after a while. % LocalWords: PyPy runtime toolchain RPython online codewriter jitcodes JIT's % LocalWords: prebuilt frontend backend backends CLI tikz bytecodes VM opcode % LocalWords: bytecode DECR elif opcodes PyPy's MOV goto pc strgetitem Const % LocalWords: SSI lltype ootype callee inline inlining JITs vtable jitcode % LocalWords: callvirt invokevirtual startup InputArgs superclasses metaprogramming