\chapter{Tracing JITs in a nutshell} \label{cha:tracing-jits} \section{What is a tracing JIT compiler?} \label{sec:what-is-a-tracing-jit} There are two general well known rules to consider when tackling the problem of compiler optimization: \begin{itemize} \item The \textbf{Pareto principle}\cite{pareto}, or \emph{80-20 rule}: the 20\% of the program will account for the 80\% of the runtime. In other words, when executing a program most of the time will be spent in few regions, called \emph{hot-spots}. \item The \textbf{Fast Path principle}\cite{Hoschka93controlflow}: it is usually possible to split expensive operations into a \emph{fast path} and a \emph{slow path}: the former handles the most common cases and, therefore, is expected to be executed for most time and has to be highly optimized, whereas the latter deals with the remaining cases and is not required to be particularly efficient. \end{itemize} \emph{Tracing JIT compilers} are designed to exploit these rules to produce better and more efficient code. In particular, they are built on the following basic assumptions: \begin{itemize} \item programs spend most of their runtime in loops \item several iterations of the same loop are likely to take similar code paths \end{itemize} The basic approach of a tracing JIT is to generate machine code only for the hot code paths of most frequently executed loops and to interpret the rest of the program. The code for those common loops however is highly optimized, including aggressive inlining of the fast paths. Tracing optimizations were initially explored by the Dynamo project \cite{bala_dynamo:transparent_2000} to dynamically optimize machine code at runtime. Its techniques were then successfully exploited to implement a JIT compiler for a Java VM \cite{gal_hotpathvm:effective_2006, andreas_gal_incremental_2006}. Subsequently these tracing JITs were discovered to be a relatively simple way to implement JIT compilers for dynamic languages \cite{mason_chang_efficient_2007}. The technique is now being used by both Mozilla's TraceMonkey JavaScript VM \cite{andreas_gal_trace-based_2009} and has been tried for Adobe's Tamarin ActionScript VM \cite{chang_tracing_2009}. This chapter gives an overview of how a typical tracing JIT compiler works: it is not meant to be an exhaustive survey on \emph{all} the various strategy adopted by known implementations, but it is useful to introduce some key concepts that are needed to understand the next chapter, which explains how the PyPy JIT compiler works. \section{Phases of execution} \label{sec:phases-of-execution} Typically, a tracing virtual machine goes through various phases when executing a program: \begin{description} \item [Interpretation] When the program starts, everything is interpreted. The interpreter behaves like a standard one, except for the addition of some lightweight code to profile the execution, and in particular to establish which loops are run most frequently. To detect loops, it associates a counter to each backward jump, which is incremented every time the jump is taken. When the counter hits a certain threshold, the VM enters the \emph{tracing} phase. \item [Tracing] During this phase, the interpreter records a history of all the operations it executes. It traces until it has recorded the execution of one iteration of the hot loop\footnotemark. To decide when this is the case, the trace is repeatedly checked as to whether the interpreter is at a position in the program where it had been earlier. The history recorded by the tracer is called a \emph{trace}: it is a list of operations, together with their actual operands and results. The trace is passed to the JIT compiler. \footnotetext{In case the trace becomes too long and its length exceeds a certain threshold, tracing is aborted and the control returns to the interpreter. This can occur e.g. in the unlucky case in which the iteration of the loop that is being traced is the last, and thus the backward jump is never taken.} \item[Compilation] The job of the JIT compiler is to turn a trace into efficient machine code. The generated machine code is immediately executable, and can be used in the next iteration of the loop. \item[Running] In this phase, the code generated by the JIT compiler is executed. Remind that a trace represent a \emph{loop}, thus the code runs repeatedly until some exit condition is triggered. When the loop exits, the VM either goes back into the \emph{interpretation} mode or transfers the control to another already compiled piece of code. \end{description} Being sequential, the trace represents only one of the many possible paths through the code. To ensure correctness, the trace contains a \emph{guard} at every possible point where the path could have followed another branch, for example at conditions and indirect or virtual calls. When generating the machine code, every guard is turned into a quick check to guarantee that the path we are executing is still valid. If a guard fails, we immediately quit the machine code and continue the execution by falling back to interpretation. Depending on the tracing technique, every guard might have an associated \emph{guard failure counter}, similar to the loop counter seen above. The failure counter is incremented every time the guard fails, until it hits a certain threshold and the guard becomes \emph{hot}. Then, we assume that the particular code path which starts from the guard failure is executed often enough to be worth compiling, and start tracing again from there. See Section \ref{sec:trace-trees} for more details. Figure \ref{fig:tracing-phases} summarize the transitions between the various phases of a tracing VM. \begin{figure}[ht] \centering \input{tracing_phases.tikz} \caption{Various phases of a tracing VM} \label{fig:tracing-phases} \end{figure} \section{A deeper look at traces} \label{sec:traces} In the previous section we have seen that a trace contains a history of all the operations executed, plus some \emph{exit conditions} that are called \emph{guards}. But, which kind of operations are traced? And what is a guard, exactly? Giving a precise answer is difficult, as it depends on the details of the implementation: however, we can abstract some properties that are always valid. To understand which kind of operations are traced, we need to recall that \textbf{tracing} is performed by an interpreter: every time the interpreter executes the next instruction, it also add an entry to the trace. Thus, we can think of it as an unrolled version of the original code, where the operations recorded in the history correspond to the opcodes encoded in the bytecode. Moreover, for each operation we also record its operands, as well as its result. For each of them, we record both the origin (e.g. if it is the result of a previous operation) and its actual value (e.g., $42$). Thus, a typical operation in the trace could look like \lstinline{c[42] = add(a[30], b[12])}, where $a$, $b$ and $c$ are variables, whose value at the moment of execution is indicated inside the square brackets. Traces are meant to be \emph{linear}, i.e. a sequential list of all the operations that were actually executed, no matter where they come from. Thus, unconditional jumps are not recorded: instead, we simply continue the execution and record the subsequent operations. The same principle applies to \emph{direct calls}, i.e. all calls in which the target function is statically known: the tracer does not record the call itself, but the operations executed inside the called function: this way, we automatically get the effect of \emph{inlining}. As long as we only encounter unconditional jumps, there is only one possible code path to be taken, so we are sure that all the subsequent iterations of the loop will follow the same one. Obviously, this is no longer valid as soon as we execute a \emph{conditional} jump, i.e. a jump whose target destination depends on the runtime value of some condition. To ensure a correct behaviour each time a conditional jump is encountered, a \emph{guard} is checked to ensure that the condition is still valid, and thus that the code path which is being executed is the correct one. The same technique is used to follow calls whose target is not statically known. In particular, for calls through a function pointer, which we call \emph{indirect calls},the actual \emph{value} of the pointer is guarded, while for \emph{virtual method calls} the guard is on the exact \emph{class} of the receiver object. \section{Example of tracing} \label{sec:tracing-example} In this section, we examine a concrete example of tracing to see how it works and how the resulting trace might look like. Figures \ref{fig:src-tracing-example} and \ref{fig:bytecode-tracing-example} show the source code and the corresponding bytecode for a simple programs that contains a frequently executed loop, a virtual method call and a conditional branching instruction. We chose to use Java for these examples because it is a widely known programming language and because a lot of research on tracing JIT compilers has been done for the JVM. Moreover, by using Java we emphasize that this chapter describes the general techniques about tracing JIT compilers and not theirs implementation in PyPy, which is the subject of Chapter \ref{cha:pypy-jit}. \lstset{ language=Java, } \begin{figure}[t] \begin{lstlisting} interface Operation { int DoSomething(int x); } class IncrOrDecr implements Operation { public int DoSomething(int x) { if (x < 0) return x-1; else return x+1; } } class tracing { public static void main(String argv[]) { int N = 100; int i = 0; Operation op = new IncrOrDecr(); while (i < N) { i = op.DoSomething(i); } System.out.println(i); } } \end{lstlisting} \caption{Source code of the tracing example} \label{fig:src-tracing-example} \end{figure} \begin{figure}[p] \begin{lstlisting} class IncrOrDecr { ... public DoSomething(I)I ILOAD 1 IFGE LABEL_0 ILOAD 1 ICONST_1 ISUB IRETURN LABEL_0 ILOAD 1 ICONST_1 IADD IRETURN } class tracing { ... public static main([Ljava/lang/String;)V ... LABEL_0 ILOAD 2 ILOAD 1 IF_ICMPGE LABEL_1 ALOAD 3 ILOAD 2 INVOKEINTERFACE Operation.DoSomething (I)I ISTORE 2 GOTO LABEL_0 LABEL_1 ... } \end{lstlisting} \caption{Bytecode of the tracing example} \label{fig:bytecode-tracing-example} \end{figure} For the purposes of this example, we set the \emph{hot loop threshold} to 3. Thus, the \textbf{tracing} phase begins at the fourth iteration of the loop, at line 19 in the Java code and 46 in the bytecode. Figure \ref{fig:trace-tracing-example} shows the recorded trace: to ease the reading, it shows the individual JVM instructions which are actually recorded side by side with the corresponding Java source code. We can see that most instructions are recorded as they are executed: however, few of them are not recorded at all, or have been replaced with something different. In particular: \begin{itemize} \item \emph{Conditional jumps} are replaced by \emph{guards} that check the expected condition is true. It is interesting to note that conditional jumps that are expected \textbf{not} to be taken are replaced by a guard that checks the \emph{opposite} condition: in the example, \lstinline{IF_ICMPGE} (``jump if great or equal'') is replaced by \lstinline{GUARD_ICMPLT} (``check less than'') \item \emph{unconditional jumps} (\lstinline{IRETURN} and \lstinline{GOTO} in the example) are just removed from the trace \item \emph{virtual method calls} are replaced by a guard that checks the exact class of the receiver, and the content of the method is just appended to the trace. In the example, the \lstinline{INVOKEINTERFACE} to \lstinline{DoSomething} is turned into a \lstinline{GUARD_CLASS(IncrOrDecr)}, then the body of \lstinline{IncrOrDecr.DoSomething} is recorded into the trace \end{itemize} \begin{figure}[t]\footnotesize \centering \begin{tabular}{l|l|lr} \textbf{Method} & \textbf{Java code} & \textbf{Trace} & \textbf{Value} \\ \hline \emph{Main} & \texttt{while (i < N) \{} & \texttt{ILOAD 2} & $3$ \\ & & \texttt{ILOAD 1} & $100$ \\ & & \sout{\texttt{IF\_ICMPGE LABEL\_1}} & $false$ \\ & & \emph{\texttt{GUARD\_ICMPLT}} & \\ \cline{2-4} & \texttt{\ \ i = op.DoSomething(i);} & \texttt{ALOAD 3} & \emph{IncrOrDecr} obj \\ & & \texttt{ILOAD 2} & $3$ \\ & & \sout{\texttt{INVOKEINTERFACE ...}} & \\ & & \emph{\texttt{GUARD\_CLASS(IncrOrDecr)}} & \\ \hline \emph{DoSomething} & \texttt{\ \ \ \ if (x < 0)} & \texttt{ILOAD 1} & $3$ \\ & & \sout{\texttt{IFGE LABEL\_0}} & $true$ \\ & & \emph{\texttt{GUARD\_GE}} & \\ \cline{2-4} & \texttt{\ \ \ \ \ \ return x+1;} & \texttt{ILOAD 1} & $3$ \\ & & \texttt{ICONST 1} & $1$ \\ & & \texttt{IADD} & $4$ \\ & & \sout{\texttt{IRETURN}} & \\ \hline \emph{Main} & \texttt{\ \ i = op.DoSomething(i);} & \texttt{ISTORE 2} & $4$ \\ \cline{2-4} & \texttt{\}} & \sout{\texttt{GOTO LABEL\_0}} & \\ \end{tabular} \vspace{0.2cm} \caption*{\emph{ \sout{\texttt{INSTR}}: Instruction executed but not recorded \\ \emph{\texttt{INSTR}}: Instruction added to the trace but not executed }} \hfigrule \caption{Trace produced by the code in Figure \ref{fig:src-tracing-example}} \label{fig:trace-tracing-example} \end{figure} \section{Generalization of the trace and code generation} Once the trace has been completed, we can finally compile it. The goal of the \textbf{compilation} phase is to generate efficient machine code that can be used to execute the next iterations of the loop that we have just traced. Remember that a trace represents a single, concrete iteration of the loop, thus before actually emitting the machine code, we \emph{generalize} the trace in order to make it representing a wider range of the next iterations. As usually happens in most situations, there is a tension between abstraction and efficiency: \begin{itemize} \item if we generalize too much, we might end up with sub-optimal machine code \item if we don't generalize enough, we might end up with code bloats, as the code generated is too specialized and cannot be reused for some or most of the subsequent iterations of the loop \end{itemize} Sometimes, finding the right generalization is easy. Consider for example the trace above in Figure \ref{fig:trace-tracing-example}: most of the operands of the instructions are concrete integers (e.g., $1$, $3$, $4$), thus we can generalize them to variables of type \lstinline{int} and be sure that the generated code is optimal for all the next iterations. However, sometimes the job is more complex, especially if we have complex inheritance trees: for example, if some concrete value seen in the trace is an instance of, say, class \lstinline{A}, we might want to generate code that accepts only \lstinline{A} instances, or we might want to generate code that is more reusable and accepts also instances of some superclass of \lstinline{A}. The concrete strategy to solve this problem vary between each implementation of a tracing JIT. In Section \ref{sec:virtuals} we will see how PyPy deals with it. \section{Linear traces vs. trace trees} \label{sec:trace-trees} So far, we have only seen examples of linear traces, but there are situations where the generated code is not sufficiently efficient. For the next example, we consider the loop in Figure \ref{fig:tree-example-src} and its bytecode in Figure \ref{fig:tree-example-bytecode}, and assume that the \emph{hot loop threshold} is again set to 3. The \textbf{tracing} phase begins when $i==3$, and produces the trace shown in Figure \ref{fig:tree-example-trace}. As we can see, the concrete iteration of the loop follows the path through the \lstinline{else} clause of the \lstinline{if}, guarded by the \lstinline{GUARD_NE} condition. \begin{figure} \centering \begin{minipage}[t]{0.48\linewidth} % A minipage that covers half the page \hfigrule \begin{lstlisting}[frame=none,showlines=true] public static void trace_trees() { int a = 0; int i = 0; int N = 100; while(i < N) { if (i%2 == 0) a++; else a*=2; i++; } } \end{lstlisting} \hfigrule \caption{Tracing tree example} \label{fig:tree-example-src} \end{minipage} \hspace{0.02\linewidth} \begin{minipage}[t]{0.48\linewidth} \hfigrule \begin{lstlisting}[frame=none] public static void trace_trees() { ... LABEL_0 ILOAD 1 ILOAD 2 IF_ICMPGE LABEL_1 ILOAD 1 ICONST_2 IREM IFNE LABEL_2 IINC 0 1 GOTO LABEL_3 LABEL_2 ILOAD 0 ICONST_2 IMUL ISTORE 0 LABEL_3 IINC 1 1 GOTO LABEL_0 LABEL_1 RETURN } \end{lstlisting} \hfigrule \caption{Bytecode for \texttt{trace\_trees}} \label{fig:tree-example-bytecode} \end{minipage} \end{figure} \begin{figure}[p]\footnotesize \centering \begin{tabular}{l|l|lr} \textbf{Method} & \textbf{Java code} & \textbf{Trace} & \textbf{Value} \\ \hline \emph{trace\_trees} & \texttt{while (i < N) \{} & \texttt{ILOAD 1} & $3$ \\ & & \texttt{ILOAD 2} & $100$ \\ & & \sout{\texttt{IF\_ICMPGE LABEL\_1}} & $false$ \\ & & \emph{\texttt{GUARD\_ICMPLT}} & \\ \cline{2-4} & \texttt{\ \ \ \ if (i\%2 == 0)} & \texttt{ILOAD 1} & $3$ \\ & & \texttt{ICONST\_2} & $2$ \\ & & \texttt{IREM} & $1$ \\ & & \sout{\texttt{IFNE LABEL\_2}} & $true$ \\ & & \emph{\texttt{GUARD\_NE}} & \\ \cline{2-4} & \texttt{\ \ \ \ \ \ \ \ a*=2;} & \texttt{ILOAD 0} & $3$ \\ & & \texttt{ICONST\_2} & $2$ \\ & & \texttt{IMUL} & $6$ \\ & & \texttt{ISTORE 0} & $6$ \\ \cline{2-4} & \texttt{\ \ \ \ i++;} & \texttt{IINC 1 1} & $4$ \\ \cline{2-4} & \texttt{\}} & \sout{\texttt{GOTO LABEL\_0}} & \\ \hline \end{tabular} \caption{Trace produced by the code in Figure \ref{fig:tree-example-src}} \label{fig:tree-example-trace} \end{figure} However, it is clear that while \textbf{running} the guard will succeed only for the odd iterations of the loop: when it fails, the VM switches back to the \textbf{interpretation} phase (following the \emph{cold guard failed} arrow in Figure \ref{fig:tracing-phases}), until the end of the loop. Then, when it is time to reenter the loop, it reuses the already compiled machine code, switching again to the \textbf{running} phase (\emph{entering compiled loop}) Such a behaviour is far less than optimal, because we need to interpret half of the iterations through the loop, and moreover the switch between the two phases produces some overhead. This happens because one of the original assumptions, the \emph{Fast Path Principle} (see Section \ref{sec:what-is-a-tracing-jit}) has been violated: in this case, there is not just \emph{one} fast path, but two that are equally important. To deal with such cases in an efficient way, we need to modify the data structure that contains the traces, in order to support not only linear traces, but also \emph{trees} of traces: if we observe that a guard fails often enough, it becomes \emph{hot} and we start tracing again from there (the arrow \emph{guard failure $\rightarrow$ hot} in Figure \ref{fig:tracing-phases}): then, we compile the trace and teach the existing machine code to call the new one when that particular guard fails in the future (i.e., to follow the \emph{hot guard failed} arrow). This way, after an intial phase of warm-up, all the hot paths ends up being compiled, and thus the loop can be executed entirely in machine code. Moreover, this technique allows the VM to automatically adapt itself to the behaviour of the program, in case it varies during the execution, because if a cold path starts to be executed very often, it will be soon or later turned into a hot path and optimized as well. However, implementing this technique in a virtual machine like the CLI and the JVM is not easy at all, as it does not offer direct support to modify the behaviour of existing code. As we will see in Section \ref{sec:compiling-inner-bridges}, this problem that we had to face during the development of the CLI backend is one of the most challenging. \lstset{ language=Python, } % LocalWords: runtime PyPy bytecode superclass CLI backend