\chapter{Characterization of the target platform} \label{cha:target-vm} From an implementation point of view, this thesis is about a dynamic JIT compiler for the .NET platform, and in particular for its \emph{Virtual Machine} (VM), the CLI. However, our research is not strictly limited to the CLI, but it is potentially applicable to all the VMs which are similar enough. What does \emph{``similar enough''} mean? Defining the concept is hard, because even a small difference about a particular feature of the VM can have a big impact on the implementation strategy. On the other hand, some of the solutions proposed in this thesis can be applied equally well to virtual machines such as e.g. \emph{LLVM}, whose underlying philosophy is very different from the CLI's one, but still share some commonality. LLVM \cite{Lattner04llvm:a} stands for Low Level Virtual Machine and, as the name suggests, it is a virtual machine whose instruction sets is particularly close to the metal, contrarily to object oriented virtual machines whose instruction set is more high level. This is especially true when speaking about performance: when targeting a VM, it is hard to accurately predict the performance behavior of the compiled programs, as they are going to be executed on a number of different implementations of the virtual machine. For example, .NET programs can be run either under the CLR, i.e. the original implementation from Microsoft, or under Mono, an alternative open source implementation. Moreover, all the implementations rapidly evolve over time, each new version providing slightly different performance of each feature or construct of the VM. However, for languages implementors, this is far from being ideal: during the development of a compiler, they need to take a huge number of decisions about which constructs to use and which not to use to implement each particular feature, but their assumption might not be valid for alternative implementations or newer/older versions of the VM. As we will see next in this chapter, this is especially true for more esoteric features or for constructs used in a slightly different way than they have been designed. \section{Dynamic loading of new bytecode} \label{sec:dynamic-generation} By definition, a JIT compiler emits new code during the execution of the program: thus, the first and most important requirement that a VM needs to fulfill in order to apply this research is the ability of generating and loading new bytecode at runtime, although the exact details vary between the CLI and the JVM. For the CLI, the standard library of the .NET Framework, provides the necessary tools to generate and load single methods, in the namespace \lstinline{System.Reflection.Emit}: in particular, by instantiating the class \lstinline{DynamicMethod} it is possible to create new methods that are not bound to any particular class. On the other hand, in the JVM the minimum unit of loading is the \emph{class}: by writing a \emph{custom classloader}, it is possible to generate and load new classes, and hence new methods, on the fly. Moreover, there are external libraries such as \emph{ASM}\cite{asm} and \emph{BCEL}\cite{bcel} that simplify the task of generating and loading these classes. \section{JIT layering} \label{sec:jit-layering} Another important feature shared by the CLI and the JVM is the presence of a JIT compiler that translates the intermediate code into executable machine code\footnote{The specifications of both the CLI and the JVM does not mandate the presence of a JIT compiler, which should be considered an implementation detail. However, all the current implementations of the CLI employ a JIT compiler, as well as all the most popular ones of the JVM.}. For the cases we are analyzing, the JIT compiler generated by the PyPy translation toolchain emits code in the intermediate format proper of the hosting virtual machine. Then, this intermediate code is in turn compiled into executable machine code by the JIT compiler of the VM itself. Thus, before being translated to executable machine code, the source code of our programs passes through two different JIT compilation \emph{layers}: \begin{itemize} \item the \textbf{high level layer}, implemented by the PyPy translation toolchain \item the \textbf{low level layer}, implemented by the VM itself \end{itemize} \textbf{JIT layering} is a novel concept introduced with this thesis. In theory, if the low level JIT compiler were good enough, it could produce optimal code for whatever construct it encounters; in practice however, this rarely happens because either the low level JIT does not employ advanced techniques (as it is the case of the CLI), or it cannot have a deep understanding of the language semantics, thus missing a lot of optimization opportunities (as proved by the current implementations of dynamic languages for the JVM). By adding an additional JIT compilation layer, specialized for a specific high-level language, much better and efficient code can be generated without modifying the underlying VM. This has several advantages: \begin{itemize} \item since the high level JIT compilers do not touch any of the internals of the VM, it is automaticaly portable across multiple implementations of the VM itself \item usually, the existing VMs and their corresponding JITs are very complex pieces of software, hard to modify: by writing our JIT on top of that, we avoid this problem; moreover, this way it is much easier to experiment with new features \item for the same reason, our approach is the only viable solution in cases we do not have access to the codebase of the VM, as in the case of Microsoft .NET \end{itemize} The main drawback of this approach is that the high level JIT compiler adds a overhead: if on the long run the time spent in the compiler is negligible compared to the time saved by running the optimized code instead of the non optimized one, in the short run programs could be slower. As we will show in Chapter \ref{cha:benchmarks}, this thesis proves that this approach is effective, and that the resulting implementation of the language can be much faster than a more ordinary implementation which relies only on the low level JIT. At the same time, we will also see that the overhead of the high level JIT compiler is not worth of in case of short running programs. \section{Immutable methods} \label{sec:immutable-methods} As we seen in Section \ref{sec:dynamic-generation}, both the CLI and the JVM offer the possibility of generating and loading new code at runtime. However once it has been loaded, it is not possibile to modify the code inside methods. In particular, both VMs do not offer any support for incremental lazy compilation. For example, there are cases in which we do not want to (or we cannot) eagerly generate the bytecode for all the possible code paths inside a method: the usual solution is to generate the code only for the hot paths, and delay the generation of the others until they are reached for the first time, or until they have proved to be hot enough to justify the cost of the compilation. Unfortunately, since it is not possible to modify a method, such a strategy cannot be easily implemented. As we will see in Section \ref{sec:compiling-inner-bridges}, this restriction is a serious limitation for language implementors who want to use adaptive techniques, and the proposed solutions (or, better, workarounds) either have a negative impact on the time spent for the high level JIT compilation or on the efficiency of the generated code. \section{Tail calls} \label{sec:tail-calls} On the CLI, we can explicitly mark a method invocation as a \emph{tail call}, assuming that it is in \emph{tail position}\footnote{A call is in tail position if it is immediately followed by a \emph{return} instruction}. Tail calls behaves exactly as normal calls, with the difference that the \emph{call frame} of the caller is removed from the stack and replaced by the call frame of the callee: the result is that we can have an arbitrary deep chain of tail calls without any risk of exhausting the stack, as it would happen with normal calls. This process is called \emph{tail call optimization}. Many functional programming languages such as Lisp, Scheme or Caml implement tail call optimization very efficiently, so that tail calls are as efficient as \emph{gotos} \cite{steele77}. As we will see in Section \ref{sec:compiling-inner-bridges}, the problem of immutable methods could be partly solved by the presence of efficient tail calls. However, this is not the case for the current VMs: \begin{itemize} \item In the current implementations of the CLI tail calls are too slow to be used in practice. In particular, on Microsoft CLR tail calls are about 10 times slower than normal calls, while on Mono they are not even implemented correctly; in either case, they are orders of magnitude slower than a simple \emph{goto}, making the \lstinline{tail.call} instruction unusable for code that needs to be executed often. \item At the moment Java HotSpot does not support tail call optimization (see for example \cite{Schwaighofer09} for a description of a possible implementation). \end{itemize} \subsection{Tail calls benchmark} \needreview Figure \ref{fig:tail-call-src} shows the source code used to benchmark the efficiency of tail calls. Both static methods compute the sum of the first $n$ integers, the first using a loop, the second using tail recursion. Note that, although the code in the figure is written in C\# for clarity, the language does not support the \lstinline{tail.call} instruction, so we had to manually write the algorithm directly in IL bytecode. If tail call optimization is applied correctly, we would expect both versions to have about the same performance. However, this is not the case: on Mono, the recursive version is about 2.3 times slower than the loop, while on the CLR it is about 18.3 times slower. Moreover, the Mono implementation of tail calls is known to be buggy\cite{bug-mono}: each tail call leaks a bit of stack space, with the result that a deeply nested calls might result in a stack overflow. As we will see in Section \ref{sec:macrobench}, this might trigger a crash in the code generated. \begin{figure}[t] \begin{lstlisting}[language=Java] public static int loop(int n) { int sum = 0; while(n>0) { sum += n; n--; } return sum; } public static int tailcall(int n, int sum) { if (n < 0) return sum; // note: C# does not support the tail.call instruction return tailcall(n-1, sum+n); } \end{lstlisting} \caption{Tail call benchmark} \label{fig:tail-call-src} \end{figure} \section{Primitive types vs reference types} Both the CLI and the JVM support values of primitive types and of reference types\footnote{Actually, the CLI also supports value types, enumerations, generic types, etc., but they outside the scope of this paragraph.}: \begin{itemize} \item \emph{primitive types} include numeric values, such as integers and floating point numbers of various sizes \item \emph{reference types} include objects, that is, class instances \end{itemize} Although values of primitive types are not objects, it is possible to convert them through the mechanism of \emph{boxing}: for each primitive type, there is a corresponding boxed type which wraps the value into a proper object. Once you have a boxed object, it is possible to get its original value by \emph{unboxing} it. Unfortunately, arithmetic operations between boxed values are much slower than between primitive values, due to the extra level of indirection and to the fact that it is necessary to allocate a new object on the heap to hold every intermediate result. Thus, to get high performance it is very important to use primitive types instead of boxed whenever it is possible. \section{Delegates (or: object oriented function pointers)} \label{sec:delegates} In the CLI, a \emph{delegate} is a special kind of object that wraps a method (either instance or static): once created, the delegates can be freely stored and passed around, just as normal objects. The only method exposed by delegates is \lstinline{Invoke}, which calls the wrapped method. Delegates are type-safe: to instantiate one, we first need to create a \emph{delegate type}, i.e. a class which inherits from \lstinline{System.MulticastDelegate} and specifies its exact signature. Then, the VM can check that the signature of the delegate matches the signature of the method being wrapped. Moreover, it is possible to bind a delegate to an object, which will be automatically passed as the first argument to the method when the delegate is called. This is a limited form of \emph{closure}, which is usually exploited to create delegates that invokes an instance method on a specific object. The JVM does not offer anything similar to delegates natively. However, it is possible to implement them by using the classical \emph{Command} design pattern \cite{Gamma93designpatterns}, i.e. by defining an interface for each signature we are interested in, and a class that implements it for each method we want to invoke. % LocalWords: CLI goto gotos VM implementors LLVM toolchain VMs natively PyPy % LocalWords: runtime DynamicMethod bytecode classloader CLR