\section{Benchmarks} \label{sec:benchmarks} To measure the performances of the CLI JIT backend, we wrote a simple virtual machine for a dynamic toy language, called \emph{TLC}. The design goal of the language is to be very simple (the interpreter of the full language consists of about 600 lines of RPython code) but to still have the typical properties of dynamic languages that make them hard to compile. TLC is implemented with a small interpreter that interprets a custom bytecode instruction set. Since our main interest is in the runtime performance of the interpreter, we did not implement the whole language, but just its virtual machine. Despite being very simple and minimalistic, \lstinline{TLC} is a good candidate as a language to test our JIT generator, as it has some of the properties that makes most of current dynamic languages (e.g. Python) so slow: \begin{itemize} \item \textbf{Stack based interpreter}: this kind of interpreter requires all the operands to be on top of the evaluation stack. As a consequence programs spend a lot of time pushing and popping values to/from the stack, or doing other stack related operations. However, thanks to its simplicity this is still the most common and preferred way to implement interpreters. \item \textbf{Boxed integers}: integer objects are internally represented as an instance of the \lstinline{IntObj} class, whose field \lstinline{value} contains the real value. By having boxed integers, common arithmetic operations are made very slow, because each time we want to load/store their value we need to go through an extra level of indirection. Moreover, in case of a complex expression, it is necessary to create many temporary objects to hold intermediate results. \item \textbf{Dynamic lookup}: attributes and methods are looked up at runtime, because there is no way to know in advance if and where an object has that particular attribute or method. \end{itemize} In the following sections, we present some benchmarks that show how our generated JIT can handle all these features very well. To measure the speedup we get with the JIT, we run each program three times: \begin{enumerate} \item By plain interpretation, without any jitting (\emph{Interp}). \item With the JIT enabled: this run includes the time spent by doing the compilation itself, plus the time spent by running the produced code (\emph{JIT}). \item Again with the JIT enabled, but this time the compilation has already been done, so we are actually measuring how good is the code we produced (\emph{JIT 2}). \end{enumerate} The columns \emph{Interp/JIT2} and \emph{JIT2/C\#} measure the speedup of the generated code compared to plain interpretation and of the equivalent C\# program compared to the generated code, respectively. Moreover, for each benchmark we also show the time taken by running the equivalent program written in C\#.\footnote{The sources for both TLC and C\# programs are available at: \texttt{http://codespeak.net/$\sim$antocuni/tlc-benchmarks/}} The benchmarks have been run on a machine with an Intel Pentium 4 CPU running at 3.20 GHz and 2 GB of RAM, running Microsoft Windows XP and Microsoft .NET Framework 2.0. \subsection{Arithmetic operations} To benchmark arithmetic operations between integers, we wrote a simple program that computes the factorial of a given number. The algorithm is straightforward, thus we are not showing the source code. The loop contains only three operations: one multiplication, one subtraction, and one comparison to check if we have finished the job. When doing plain interpretation, we need to create and destroy three temporary objects (the results of each operation) at each iteration. By contrast, the code generated by the JIT does much better. At the first iteration, the classes of the two operands of the multiplication go through a flexswitch; then, the JIT compiler knows that both are integers, so it can inline the code to compute the result. Moreover, thanks to escape analysis, it can remove the allocation of all the temporary objects, because they never escape from the inner loop. The same remarks apply to the other two operations inside the loop. As a result, the code executed after the first iteration is close to optimal: the intermediate values are stored as \lstinline{int} local variables, and the multiplication, subtraction and \emph{less-than} comparison are mapped to a single CLI opcode (\lstinline{mul}, \lstinline{sub} and \lstinline{clt}, respectively). Similarly, we wrote a program to calculate the $n_{th}$ Fibonacci number, for which we can do the same reasoning as above. \begin{table}[ht] \begin{center} \begin{tabular}{l|rrrrrr} \multicolumn{5}{c}{\textbf{Factorial}} \\ [0.5ex] \textbf{$n$} & $10$ & $10^7$ & $10^8$ & $10^9$ \\ \hline \textbf{Interp} & 0.031 & 30.984 & N/A & N/A \\ \textbf{JIT} & 0.422 & 0.453 & 0.859 & 4.844 \\ \textbf{JIT 2} & 0.000 & 0.047 & 0.453 & 4.641 \\ \textbf{C\#} & 0.000 & 0.031 & 0.359 & 3.438 \\ \textbf{Interp/JIT 2} & N/A & \textbf{661.000} & N/A & N/A \\ \textbf{JIT 2/C\#} & N/A & \textbf{1.500} & \textbf{1.261} & \textbf{1.350} \\ [3ex] \multicolumn{5}{c}{\textbf{Fibonacci}} \\ [0.5ex] \textbf{$n$} & $10$ & $10^7$ & $10^8$ & $10^9$ \\ \hline \textbf{Interp} & 0.031 & 29.359 & 0.000 & 0.000 \\ \textbf{JIT} & 0.453 & 0.469 & 0.688 & 2.953 \\ \textbf{JIT 2} & 0.000 & 0.016 & 0.250 & 2.500 \\ \textbf{C\#} & 0.000 & 0.016 & 0.234 & 2.453 \\ \textbf{Interp/JIT 2} & N/A & \textbf{1879.962}& N/A & N/A \\ \textbf{JIT 2/C\#} & N/A & \textbf{0.999} & \textbf{1.067} & \textbf{1.019} \\ \end{tabular} \end{center} \caption{Factorial and Fibonacci benchmarks} \label{tab:factorial-fibo} \end{table} Table \ref{tab:factorial-fibo} shows the seconds spent to calculate the factorial and Fibonacci for various $n$. As we can see, for small values of $n$ the time spent running the JIT compiler is much higher than the time spent to simply interpret the program. This is an expected result which, however, can be improved, since so far no effort has been direct to enhance the performance of the compiler itself. On the other, for reasonably high values of $n$ we obtain very good results, which are valid despite the obvious overflow, since the same operations are performed for all experiments. For $n$ greater than $10^7$, we did not run the interpreted program as it would have taken too much time, without adding anything to the discussion. As we can see, the code generated by the JIT can be up to about 1800 times faster than the non-jitted case. Moreover, it often runs at the same speed as the equivalent program written in C\#, being only 1.5 slower in the worst case. The difference in speed is probably due to both the fact that the current CLI backend emits slightly non-optimal code and that the underyling .NET JIT compiler is highly optimized to handle bytecode generated by C\# compilers. As we saw in Section~\ref{sec:flexswitches-cli}, the implementation of flexswitches on top of CLI is hard and inefficient. However, our benchmarks show that this inefficiency does not affect the overall performances of the generated code. This is because in most programs the vast majority of the time is spent in the inner loop: the graphs are built in such a way that all the blocks that are part of the inner loop reside in the same method, so that all links inside are internal (and fast). \subsection{Object-oriented features} To measure how the JIT handles object-oriented features, we wrote a very simple benchmark that involves attribute lookups and polymorphic method calls. Since the TLC assembler source is long and hard to read, figure~\ref{fig:accumulator} shows the equivalent program written in an invented Python-like syntax. \begin{figure}[h] \begin{center} \begin{lstlisting} def main(n): if n < 0: n = -n obj = new(value, accumulate=count) else: obj = new(value, accumulate=add) obj.value = 0 while n > 0: n = n - 1 obj.accumulate(n) return obj.value def count(x): this.value = this.value + 1 def add(x): this.value = this.value + x \end{lstlisting} \caption{The \emph{accumulator} example, written in a invented Python-like syntax} \label{fig:accumulator} \end{center} \end{figure} The two \lstinline{new} operations create an object with exactly one field \lstinline{value} and one method \lstinline{accumulate}, whose implementation is found in the functions \lstinline{count} and \lstinline{add}, respectively. When calling a method, the receiver is implicity passed and can be accessed through the special name \lstinline{this}. The computation \emph{per se} is trivial, as it calculates either $-n$ or $1+2...+n-1$, depending on the sign of $n$. The interesting part is the polymorphic call to \lstinline{accumulate} inside the loop, because the interpreter has no way to know in advance which method to call (unless it does flow analysis, which could be feasible in this case but not in general). The equivalent C\# code we wrote uses two classes and a \lstinline{virtual} method call to implement this behaviour. As already discussed, our generated JIT does not compile the whole function at once. Instead, it compiles and executes code chunk by chunk, waiting until it knows enough information to generate highly efficient code. In particular, at the time it emits the code for the inner loop it exactly knows the type of \lstinline{obj}, thus it can remove the overhead of dynamic dispatch and inline the method call. Moreover, since \lstinline{obj} never escapes the function, the allocation is avoided and its field \lstinline{value} is stored as a local variable. As a result, the generated code turns out to be a simple loop doing additions in-place. \begin{table}[ht] \begin{center} \begin{tabular}{l|rrrrrr} \multicolumn{5}{c}{\textbf{Accumulator}} \\ [0.5ex] \textbf{$n$} & $10$ & $10^7$ & $10^8$ & $10^9$ \\ \hline \textbf{Interp} & 0.031 & 43.063 & N/A & N/A \\ \textbf{JIT} & 0.453 & 0.516 & 0.875 & 4.188 \\ \textbf{JIT 2} & 0.000 & 0.047 & 0.453 & 3.672 \\ \textbf{C\#} & 0.000 & 0.063 & 0.563 & 5.953 \\ \textbf{Interp/JIT 2} & N/A & \textbf{918.765} & N/A & N/A \\ \textbf{JIT 2/C\#} & N/A & \textbf{0.750} & \textbf{0.806} & \textbf{0.617} \\ \end{tabular} \end{center} \caption{Accumulator benchmark} \label{tab:accumulator} \end{table} Table \ref{tab:accumulator} show the results for the benchmark. Again, we can see that the speedup of the JIT over the interpreter is comparable to the other two benchmarks. However, the really interesting part is the comparison with the equivalent C\# code, as the code generated by the JIT is up to 1.62 times \textbf{faster}. Probably, the C\# code is slower because: \begin{itemize} \item The object is still allocated on the heap, and thus there is an extra level of indirection to access the \lstinline{value} field. \item The method call is optimized through a \emph{polymorphic inline cache} \cite{hoelzle_optimizing_1991}, that requires a guard check at each iteration. \end{itemize} Despite being only a microbenchmark, this result is very important as it proves that our strategy of intermixing compile time and runtime can yield to better performances than current techniques. The result is even more impressive if we take in account that dynamically typed languages as TLC are usually considered much slower than the statically typed ones.