\chapter{The PyPy JIT compiler generator} \label{cha:pypy-jit} \section{There is a time for everything} Usually, the lifetime of a program is split into \emph{compile time}, when the source code is translated into executable machine code, and \emph{runtime}, when the machined code is executed. However, this is not the case with PyPy when the JIT is enabled, as the lifetime is split into \emph{four} different steps: \begin{description} \item[Translation time] The step in which the translation toolchain (see Section \ref{sec:pypy-architecture}) generates the actual executable from the RPython source of our interpreter. In this phase, the \emph{JIT compiler generator} runs and generates the tracing JIT compiler for our language. \item[Interpretation time] The step in which the user program\footnotemark\ is run by the interpreter. It comprises both the \emph{Interpretation} and the \emph{Tracing} bubbles in Figure \ref{fig:tracing-phases}. \footnotetext{\needreview In our terminology, the final user is the programmer which write programs in the language.} \item[Compile time] The step in which the generated JIT compiler actually runs. It corresponds to the \emph{Compilation} bubble in Figure \ref{fig:tracing-phases}. \item[Runtime] The step in which the code generated by the JIT compiler is executed. It corresponds to the \emph{Running} bubble in Figure \ref{fig:tracing-phases}. \end{description} From this description, it is clear that the only static step is the \emph{translation time}, which is performed only once whenever the source code of the interpreter is modified. The remaining three steps are all dynamic, i.e. they are active when the user program is running. \section{Architecture of the JIT compiler generator} \label{sec:jit-architecture} Figure \ref{fig:jit-architecture} shows a more detailed view of the architecture. At the top of the picture there is the \emph{interpreter}, which is written in RPython and follows two independent paths through the translation toolchain: the \emph{regular translation} step is always done, and produces an efficient version of the interpreter which is included in the final executable. However the path which is more interesting for our purposes goes through the \emph{codewriter}, and is active only when the JIT is enabled. The \emph{codewriter} translates the source code of the interpreter into intermediate code that will exploited to trace the execution of the user programs. The result is a set of \emph{jitcodes}, one for each RPython function or method that can be traced\footnotemark\ from the main interpreter loop: these \emph{jitcodes} are compiled into the final executable as static data, i.e. prebuilt instances that are already fully initialized when the program starts. We can think of the \emph{jitcodes} as something that expresses the interpreter in the instruction set of a custom virtual machine, used solely for the JIT's internal use. \footnotetext{\needreview Only a subset of all the RPython functions of the interpreter is seen by the JIT, and thus \emph{traceable}: when the tracer encounters a call to a non-traceable function, it inserts a \lstinline{call} operation in the trace, instead of inlining the body.} \needreview It is important to underline that there are three different kinds of intermediate code, which are executed at different levels. In the case of the Python interpreter, there is the Python bytecode, which represents the user program and is executed by the interpreter which is compiled to CLI bytecode. The \emph{jitcodes}, which represents the interpreter itself in a format that is understandable by the JIT frontend to produce CLI bytecode for the hot loops. The CLI bytecode, which is executed by the underlying virtual machine. The tracing phase (see Section \ref{sec:phases-of-execution}) is implemented by the \emph{JIT frontend}, which can be thought of as the custom virtual machine which can execute the \emph{jitcodes}. To draw a parallel with the examples seen in Chapter \ref{cha:tracing-jits}, in PyPy RPython plays the role of Java, the \emph{jitcodes} of the JVM bytecodes, and the JIT frontend is the equivalent of the tracing part of the JVM. During the tracing phase, the frontend records a history of all the operations it executes, i.e. a \emph{trace} (see Section \ref{sec:traces}). Once completed, the trace is then passed to the selected \emph{JIT backend}, which translates it into efficient machine code, which is finally executed at \emph{runtime}. Like the interpreter, the JIT frontend and backends are written in RPython, and are compiled to CLI bytecode. Thus, the final executable is composed of three parts: the interpreter, the JIT frontend and backend, and the static data, which includes the \emph{jitcodes}. \begin{figure}[p] \centering \input{jit_architecture.tikz} \hfigrule \caption{The architecture of the PyPy JIT compiler generator} \label{fig:jit-architecture} \end{figure} \section{Tracing the meta level} \label{sec:tracing-the-meta-level} From the explanation above, it is clear that there is an important difference between the PyPy approach and the usual approach used by the other tracing JIT compilers described in the previous chapter. Consider again the examples in Sections \ref{sec:tracing-example} and \ref{sec:trace-trees}: during the interpretation phase the VM executes the bytecode instructions, which are also executed and recorded while tracing. On the other hand, in PyPy we do not trace directly the user program, but we \emph{trace the interpreter} while executing the user program: in other words, the tracer operates at the meta level. This has already been described by the author of the thesis, together with other components of the PyPy team, in \cite{bolz09}. \needreview Tracing the meta level is the central idea that enables the generation of a JIT compiler for \emph{all} the interpreters written in RPython, because the meta level is precisely what they have in common. As a result, the code is more modular and reusable, because the meta JIT compiler is strictly separated from the language and even unaware of its semantics. Moreover, the code is more portable to new target platforms and architectures, as proved by this thesis and the porting of the JIT to object oriented platforms and to the CLI in particular. To avoid confusions between the two levels, we introduce some terminology to distinguish them. The \emph{tracing interpreter} is used by the JIT to perform tracing. The \emph{language interpreter} executes what we call the user's programs. In the following, we will assume that the language interpreter is bytecode-based. Finally, the \emph{interpreter loops} are those contained in the language interpreter, while \emph{user loops} are those in the user program. \subsection{Applying a tracing JIT to an interpreter} Unfortunately, we cannot directly apply the techniques described in Chapter \ref{cha:tracing-jits} for tracing the simple interpreter defined in Figure \ref{fig:tlr-basic}: while the Pareto principle mentioned in Section \ref{sec:what-is-a-tracing-jit} holds, because the language interpreter spends most of its execution time in the main loop switching to the current instruction to be interpreted (see Section \ref{sec:interp-overhead}), the Fast Path principle does not hold. Indeed, each iteration of the main loop of the language interpreter corresponds to the execution of a single instruction, and it is very unlikely that the interpreter will keep executing the same instruction many times in a row. \begin{figure}[p] \lstinputlisting{code/tlr-paper.py} \caption{A very simple bytecode interpreter with registers and an accumulator.} \label{fig:tlr-basic} \end{figure} An example is given in Figure \ref{fig:tlr-basic}. It shows the code of a very simple bytecode interpreter written in RPython with 256 registers and an accumulator. The \lstinline{bytecode} argument is a string of bytes, all register and the accumulator contains integers. A program for this interpreter that computes the square of the accumulator is shown in Figure \ref{fig:square}. If the tracing interpreter traces the execution of the \lstinline{DECR_A} opcode (whose integer value is 7, hence the \lstinline{guard_value} on \lstinline{opcode0}), the trace would look as in Figure \ref{fig:trace-normal}. PyPy traces are expressed in \emph{Static Single Information Form} (SSI) \cite{ananian99static}, i.e. all variables are assigned only once and gets renamed at branch points. Because of the guard on \lstinline{opcode0}, the code compiled from this trace will be useful only when executing a long series of \lstinline{DECR_A} opcodes. For all the other operations the guard will fail, which will mean that performance is not improved at all. \begin{figure}[ht] \begin{lstlisting} MOV_A_R 0 # i = a MOV_A_R 1 # copy of 'a' # 4: MOV_R_A 0 # i-- DECR_A MOV_A_R 0 MOV_R_A 2 # res += a ADD_R_TO_A 1 MOV_A_R 2 MOV_R_A 0 # if i!=0: goto 4 JUMP_IF_A 4 MOV_R_A 2 # return res RETURN_A \end{lstlisting} \caption{Example bytecode: Compute the square of the accumulator} \label{fig:square} \end{figure} \begin{figure}[ht] \begin{lstlisting}[language=Java] loop_start(a0, regs0, bytecode0, pc0) opcode0 = strgetitem(bytecode0, pc0) pc1 = int_add(pc0, Const(1)) guard_value(opcode0, Const(7)) a1 = int_sub(a0, Const(1)) jump(a1, regs0, bytecode0, pc1) \end{lstlisting} \caption{Trace when executing the \lstinline{DECR_A} opcode} \label{fig:trace-normal} \end{figure} \subsection{Detecting user loops} The key idea to solve this problem is to trace the user loops instead of the interpreter ones. One iteration over a user loop includes many iterations through the main loop of the language interpreter. Thus, to trace a user loop we can \emph{unroll} the main loop of the language interpreter until the user loop is closed. How to detect when a user loop closes? User loops occur when the language interpreter executes twice an instruction stored at the same location. Typically, such location is identified by one or several variables in the language interpreter, for example the bytecode string of the currently executed function of the user program and the position of the current bytecode within that. In the example above, such variables are \lstinline{bytecode} and \lstinline{pc}. The set of all values that represent a particular position in the user program is called \emph{position key}. The tracing JIT does not have any knowledge of how the language interpreter it is implemented, hence it does not know the position key: thus, the author of the language interpreter has to mark the relevant variables with a \emph{hint}. The tracing interpreter will then effectively add the values of these variables to the position key. This means that the loop will only be considered to be closed if the position key assumes the same values twice. Loops found in this way are, by definition, user loops. For obvious efficiency reasons the tracing interpreter does not check the position key after every instruction, but only when it encounters a backward jump. Since the tracing JIT cannot easily identify the code fragments where the interpreter implements backward jumps, the author of the language interpreter has to indicate such fragments with the help of a hint. A similar technique is used to detect \emph{hot user loops}: a counter is associated to each backward branch of the user program and it is incremented every time the branch is actually taken. When the counter reaches a certain threshold, the loop is considered hot. The condition for reusing existing machine code also needs to be adapted to this new situation. In a classical tracing JIT there is no or one piece of assembler code per loop of the jitted program, which in our case is the language interpreter. When applying the tracing JIT to the language interpreter as described so far, \emph{all} pieces of assembler code correspond to the bytecode dispatch loop of the language interpreter. However, they correspond to different paths through the loop and different ways to unroll it. To ascertain which of them to use when trying to enter assembler code again, the program counter of the language interpreter needs to be checked. If it corresponds to the position key of one of the pieces of assembler code, then this assembler code can be executed. This check again only needs to be performed at the backward branches of the language interpreter. \subsection{Applying the hints} \label{sec:applying-the-hints} Let us look at the example interpreter of Figure \ref{fig:tlr-basic} to see where hints would be inserted. Figure \ref{fig:tlr-full} shows the relevant parts of the interpreter which have been annotated with hints. Initially, the class \lstinline{JitDriver} has to be instantiated by listing all the variables of the loop. The variables are classified into two groups: \emph{green} variables and \emph{red} variables. The green variables are those identifying the position key, in this case \lstinline{pc} and \lstinline{bytecode}. All other variables are red. Red variables needs to be explicitly listed for purely technical reasons. \begin{figure}[ht] \lstset{keywords={def,while,if,elif,JitDriver,jit_merge_point,can_enter_jit}} \lstinputlisting{code/tlr-paper-full.py} \caption{Simple bytecode interpreter with hints annotations} \label{fig:tlr-full} \end{figure} In addition to the classification of the variables, there are two methods of \lstinline{JitDriver} that need to be called. The first one is \lstinline{jit_merge_point} which needs to be put at the beginning of the body of the main loop. The other is \lstinline{can_enter_jit}, which is called at the end of any instruction which may implement a backward jump. In the example, only the \lstinline{JUMP_IF_A} instruction can be a backward jump. Here is where the language interpreter performs profiling to decide when to start tracing. It is also the place where the tracing JIT checks whether a loop is closed. That is when the green variables assume values already recorded for an earlier call to \lstinline{can_enter_jit}. In this tiny example a considerable number of lines have to be added for the hints; however, in a real interpreter the number of added lines would be negligible in comparison with the length of the program, as Section \ref{sec:pypy-cli-jit} shows for the case of the Python Interpreter. \begin{figure}[p] \begin{lstlisting}[numbers=left, keywords={loop_start, strgetitem, int_add, guard_value, list_getitem, int_sub, list_setitem, int_is_true, guard_true, jump}] loop_start(a0, regs0, bytecode0, pc0) # MOV_R_A 0 opcode0 = strgetitem(bytecode0, pc0) pc1 = int_add(pc0, 1) guard_value(opcode0, 2) n1 = strgetitem(bytecode0, pc1) pc2 = int_add(pc1, 1) a1 = list_getitem(regs0, n1) # DECR_A opcode1 = strgetitem(bytecode0, pc2) pc3 = int_add(pc2, 1) guard_value(opcode1, 7) a2 = int_sub(a1, 1) # MOV_A_R 0 opcode1 = strgetitem(bytecode0, pc3) pc4 = int_add(pc3, 1) guard_value(opcode1, 1) n2 = strgetitem(bytecode0, pc4) pc5 = int_add(pc4, 1) list_setitem(regs0, n2, a2) # MOV_R_A 2 ... # ADD_R_TO_A 1 opcode3 = strgetitem(bytecode0, pc7) pc8 = int_add(pc7, 1) guard_value(opcode3, 5) n4 = strgetitem(bytecode0, pc8) pc9 = int_add(pc8, 1) i0 = list_getitem(regs0, n4) a4 = int_add(a3, i0) # MOV_A_R 2 ... # MOV_R_A 0 ... # JUMP_IF_A 4 opcode6 = strgetitem(bytecode0, pc13) pc14 = int_add(pc13, 1) guard_value(opcode6, 3) target0 = strgetitem(bytecode0, pc14) pc15 = int_add(pc14, 1) i1 = int_is_true(a5) guard_true(i1) jump(a5, regs0, bytecode0, target0) \end{lstlisting} \caption{Trace when executing the Square function of Figure \ref{fig:square}, with the corresponding bytecodes as comments.} \label{fig:trace-no-green-folding} \end{figure} When executing the Square function of Figure \ref{fig:square}, the profiling will identify the loop in the square function to be hot, and start tracing. It traces the execution of the interpreter running the loop of the square function for one iteration, and the unroll the interpreter loop of the example interpreter eight times. The resulting trace can be seen in Figure \ref{fig:trace-no-green-folding}. The trace is divided into 8 different fragments, one for each opcode that were traced: each fragment begins by fetching the current opcode from \lstinline{bytecode} and incrementing the program counter \lstinline{pc}, then a guard checks the value of the opcode against its numeric value (e.g., \lstinline{MOV_R_A} corresponds to $2$, as can be seen in line 5). Moreover, additional guards are inserted at each branching point: for example, the \lstinline{guard_true} at line 42 corresponds to the \lstinline{if a} in Figure \ref{fig:tlr-full}. \section{Loops, bridges and guards} \label{sec:loops-bridges} So far, we know that a trace is a sequence of jitcode instructions. However, traces are classified in four different kinds (see Figure \ref{fig:loops-and-bridges}) accordingly to the way they are connected together: \begin{figure}[hb] \centering \input{loop-and-bridges.tikz} \hfigrule \caption{Loops and inner, entry and exit bridges} \label{fig:loops-and-bridges} \end{figure} \begin{itemize} \item \textbf{Loop}, drawn in red and yellow: a trace which implicitly jumps back to its top when it finishes. The only way to exit a loop is when a guard fails. \item \textbf{Entry Bridge}, drawn in blue and yellow: a trace which is executed only once, then jumps to a loop. Entry bridges are used to execute ``setup code'' before entering the real loop. See Section \ref{sec:entry-bridges} for more details. \item \textbf{Inner Bridge} or simply \textbf{Bridge}, drawn in green: a trace that starts from a failing guard when it becomes hot. They are useful in case a loop has more than one fast path. When entered, the code in a bridge is executed only once, then it jumps to the original loop. \item \textbf{Exit Bridge}, drawn in grey: like an inner bridge, but instead of jumping back to the original loop, it jumps to another one. Exit bridges share characteristics with both inner and entry bridges: like inner bridges they starts from a guard failure, and like entry bridges they jumps to a different loop at the end. \end{itemize} \begin{wrapfigure}{r}{5cm} \begin{lstlisting} a = 0 i = 0 N = 100 while i < N: if not i % 2: a += 1 else: a *= 2 i += 1 \end{lstlisting} \caption{RPython loop with two equally hot code paths} \label{fig:loop-code} \end{wrapfigure} A loop is linear as long as all its guards are cold. When a guard becomes hot, an inner bridge is generated from there and attached to the guard itself: the idea is that when the guard fails, the execution is directly transferred to this new trace, without any need of exiting the \emph{runtime} phase. Thus, we effectively get a tree of traces, as described in section \ref{sec:trace-trees}. \needreview As an example, take the code listed in Figure \ref{fig:loop-code}: it shows the very same algorithm as in Figure \ref{fig:tree-example-src} written in RPython. Since it is written in RPython it is an \emph{interpreter loop}, following the definition given in Section \ref{sec:tracing-the-meta-level}: it can be thought as the equivalent of an interpreter main loop. For clarity, the hints \lstinline{jit_merge_point} and \lstinline{can_enter_jit} have been omitted. There are two different code paths in the loop, which are both frequently executed and, hence, hot. Suppose that the \emph{hot loop threshold} is set to $3$. The tracing phase starts at the beginning of the fourth iteration, with $i$ equal to $3$: thus, the code will follow that path inside the \emph{else} branch, producing the loop shown in Figure \ref{fig:loop-first}. There are several interesting details to note about that loop: \begin{itemize} \item two \emph{guards} are produced: the first one corresponds to the condition of the \lstinline{if}, the second one to the condition of the \lstinline{while} loop. The conditions are in that order because tracing starts just \textbf{after} the loop has been entered, so the first statement it executes is the \lstinline{if} \item the loop takes two parameters as input, called \emph{input arguments}. Note that the \emph{jump} at the end loops back passing the updated values for $i$ and $a$ \end{itemize} \newboolean{loop-tree-step2} \begin{figure}[ht] \begin{minipage}[b]{0.35\linewidth} \setboolean{loop-tree-step2}{false} \input{loop-tree.tikz} \hfigrule \caption{The trace produced by the code in Figure \ref{fig:loop-code}} \label{fig:loop-first} \end{minipage} \hspace{0.02\linewidth} \begin{minipage}[b]{0.63\linewidth} \setboolean{loop-tree-step2}{true} \input{loop-tree.tikz} \hfigrule \caption{The trace tree produced when the guard \newline becomes hot} \label{fig:loop-second} \end{minipage} \end{figure} Note also that the loop contains the code to handle only the \emph{else} branch of the \lstinline{if}: when at runtime the condition is false, the \emph{guard\_true($t_1$)} (highlighted in blue) fails\footnotemark, and the execution is switched back to the interpretation phase. Eventually the guard becomes hot, so we start tracing from there and attach the newly created bridge to the existing loop, getting the result shown in Figure \ref{fig:loop-second}. The fact that one branch is in the main loop and the other is in the inner bridge is only due to the threshold being odd. It it were set to an even number, the two traces would be swapped. \footnotetext{The original condition is \lstinline{not i \% 2} but the logical negation does not appears in the trace: this is due to an optimization that replaces pairs of \lstinline{bool_not}/\lstinline{guard_false} with \lstinline{guard_true}.} It is important to note that the bridge needs to be compiled and attached to the main loop \emph{after} it has already been compiled and executed for while: thus the machine code must be generated in a way that can be dynamically extended as needed. As we will see in Chapter \ref{cha:cli-backend}, this requirement poses some problems for the CLI JIT backend. \section{Escape analysis and trace optimization} \label{sec:trace-optimizations} \needreview Once the trace has been created, it is passed to the \emph{trace optimizer}, which does constant folding, removes unneeded operations, and more generally produces smaller and more efficient traces. For the most part, it is possible to exploit all the well known and standard techniques that have been developed during many years of research in compilers (see e.g. \cite{Bacon94} for a survey on them). \subsection{Escape analysis} \label{sec:virtuals} \begin{wrapfigure}{r}{8.5cm} \begin{lstlisting} class W_Int: def __init__(self, val): self.val = val def getval(self): return self.val def f(n): obj = W_Int(n) while obj.getval() > 0: nextval = obj.getval() - 1 obj = W_Int(nextval) \end{lstlisting} \caption{Example of \emph{boxed arithmetic}} \label{fig:loop-virtuals-src} \end{wrapfigure} One of the most important optimizations is obtained by performing \emph{escape analysis}\cite{Blanchet99escapeanalysis} \cite{Choi99escapeanalysis} to remove unnecessary allocations of objects. For example, consider the code in Figure \ref{fig:loop-virtuals-src}, which shows a typical pattern that appears in dynamic languages. The \lstinline{W_Int} class represents \emph{boxed integers}: because of the arithmetic operation performed inside the loop, a new temporary \lstinline{W_Int} instance is created at each iteration (see also Section \ref{sec:boxing-ovf}). The unoptimized loop produced during tracing is shown in Figure \ref{fig:loop-virtuals-first}. The loop takes $obj_0$, of type \lstinline{W_Int}, as an input argument. In the first block, the class of $obj_0$ is guarded in order to safely inline the body of \lstinline{getval} (as explained in Section \ref{sec:tracing-example}), then the value for \lstinline{nextval} is computed. Then, the second block allocates a new object of type \lstinline{W_Int} and inlines the body of \lstinline{__init__}, initializing the field \lstinline{val}. Finally, the last block inlines again the body of \lstinline{getval}, then it evaluates the condition of the loop. \newboolean{loop-virtuals-step1} \newboolean{loop-virtuals-step2} \newboolean{loop-virtuals-step3} \begin{figure}[ht] \centering \begin{minipage}[b]{0.48\linewidth} % A minipage that covers half the page \setboolean{loop-virtuals-step1}{true} \input{loop-virtuals.tikz} \hfigrule \caption{Unoptimized trace: \texttt{obj} is allocated} \label{fig:loop-virtuals-first} \end{minipage} \hspace{0.02\linewidth} \begin{minipage}[b]{0.48\linewidth} \setboolean{loop-virtuals-step1}{false} \setboolean{loop-virtuals-step2}{true} \input{loop-virtuals.tikz} \hfigrule \caption{Optimized trace: \texttt{obj} is \emph{virtualized}} \label{fig:loop-virtuals-opt} \end{minipage} \end{figure} A simple static analysis of the trace shows that the object allocated inside the loop \textbf{never escapes} outside. Thus, we can avoid to allocate it, and \emph{explode} its fields into simple local variables. Thus, each \lstinline{setfield_gc} to the object becomes an assignment to the corresponding variable, and accordingly for \lstinline{getfield_gc}. Moreover, operations like \lstinline{guard_class} can be removed as they are statically known to be true. The resulting optimized loop is shown in Figure \ref{fig:loop-virtuals-opt}: note that instead of allocating a new object inside the loop, we reserve the space for all its field as local variables (in this case only one, $obj\_val$, which gets renamed because of the \emph{SSI} form). Accordingly, instead of having one input argument of type \lstinline{W_Int}, we have as many input arguments as its fields: in this case, $obj\_val_0$ is of type \lstinline{int}. Obviously, such an optimized loop works only if \lstinline{obj} is already of type \lstinline{W_Int} \textbf{before} entering the loop. Thus, we say that the loop is \emph{specialized} for that type. If later in the execution of the program we try to enter again the loop but \lstinline{obj} is of a different type than \lstinline{W_Int}, we start tracing again and produce another specialized version of the loop. The objects for which we avoid the allocation are called \emph{virtual instances} and the same technique works also for arrays whose size is statically known at tracing time, which are called \emph{virtual arrays}. Generically, we refers to either virtual instances or arrays with the term \emph{virtuals}. To our knowledge, the JIT compilers generated by PyPy are the only tracing JITs that exploit escape analysis to optimize the generated code. \subsection{Reentering a specialized loop} \label{sec:entry-bridges} Once the specialized loop has been compiled, we need a way to \emph{enter} it. This is not straightforward, as now there is a mismatch between the layer of the interpreter, which gives us a real \lstinline{W_Int} instance, and the layer of the loop, which needs an \emph{unboxed} one. Entering the loop for the first time it is easy: remind that during tracing we have actually \emph{executed} one iteration of the loop shown in Figure \ref{fig:loop-virtuals-first}, so we know the current value of $obj_1.nextval$, which corresponds to $obj\_val_0$ in Figure \ref{fig:loop-virtuals-opt}: therefore we can just enter the specialized loop passing it that value. The loop will run until one guard fails, then the execution of the program continues as usual. However what if we want to re-enter the specialized loop a second time, later in the execution of the program? We still have the mismatch between the \lstinline{W_Int} instance given by the interpreter and the virtualized one expected by the loop, but this time we do not know which value to pass it. One easy solution is to enter again the tracing phase in order to do one (unoptimized) iteration of the loop and compute the value for $obj_1.nextval$, then discard the trace and enter the compiled loop. \begin{figure}[htb] \centering \setboolean{loop-virtuals-step2}{false} \setboolean{loop-virtuals-step3}{true} \input{loop-virtuals.tikz} \hfigrule \caption{The \emph{entry bridge} that jumps to the optimized loop} \label{fig:loop-virtuals-bridge} \end{figure} Clearly computing and discarding a new trace every time a loop is entered is not efficient. Therefore instead of being thrown away, it is compiled into an \emph{entry bridge} (see Section \ref{sec:loops-bridges}) that then jumps directly to the specialized loop, as shown in Figure \ref{fig:loop-virtuals-bridge}. Once it has been compiled, the entry bridge will be reused all the times it is necessary to re-enter the loop. \subsection{Forcing virtuals} Instances and arrays that are allocated inside the loop can be \emph{virtualized} only if they don't \emph{escape} the loop itself. But, what happens if later we add an inner bridge to the loop which contains some operations that make the object escaping? Consider for example the code in Figure \ref{fig:forcing-src}, where \lstinline{external_func} is an arbitrary function that for some reason cannot be traced, and thus inlined, by the JIT. Calling external functions is one of the operations that make an object escaping. \begin{wrapfigure}{r}{8.5cm} \begin{lstlisting} def f(n): obj = W_Int(n) while obj.getval() > 0: if obj.getval() < 10: external_func(obj) nextval = obj.getval() - 1 obj = W_Int(nextval) \end{lstlisting} \caption{Example of virtual which is \emph{forced}} \label{fig:forcing-src} \end{wrapfigure} If we call \lstinline{f} with a large \lstinline{n}, for the first iterations the condition of the \lstinline{if} will never be true, thus the JIT sees a trace which is very similar to the one in Figure \ref{fig:loop-virtuals-first} and \emph{virtualizes} \lstinline{obj}. Figure \ref{fig:loop-virtuals-forcing} shows the resulting loop: the yellow blocks represents the main loop, while the green blocks represents the bridge which is attached to \emph{guard\_false($t_0$)}, when it eventually starts to fail. As long we stay in the main loop, \lstinline{obj} is virtual. However, when we enter the bridge we need a \emph{real} object of type \lstinline{W_Int} to pass to \lstinline{external_func}: thus, a new object is allocated and initialized using the values of its virtual fields, in this case $obj\_val_0$. This process is called \emph{forcing} of virtuals. Once the virtual has been forced, we can call the function and continue the execution as normal. This technique is very important to get good performance, as it allocates objects lazily only when it is really necessary. \begin{figure}[ht] \input{loop-virtuals-forcing.tikz} \hfigrule \caption{Loop that shows how virtuals are \emph{forced}} \label{fig:loop-virtuals-forcing} \end{figure} % LocalWords: PyPy runtime toolchain RPython online codewriter jitcodes JIT's % LocalWords: prebuilt frontend backend backends CLI tikz bytecodes VM opcode % LocalWords: bytecode DECR elif opcodes PyPy's MOV goto pc strgetitem Const % LocalWords: SSI inlined inline inlines virtualized Prolog virtuals % LocalWords: virtualizes