\chapter{Introduction} \section{Python implementations} Python \cite{python} is a highly flexible and open source scripting language which has significantly grown in popularity in the last few years. The reference implementation is called \emph{CPython} and is written in C. The language is implemented by a compiler that translates Python source code into a very high level bytecode, and by a simple virtual machine that interprets the bytecode. Since the beginning, one of CPython goals have been to keep the implementation simple to read, maintain and extend, in order for the language to be easier to mature and evolve. However, one drawback of this approach is that the final performance of the programs are not as good as they could be. Although CPython is the reference implementation, it is not the sole. In particular, over the years four main different alternative implementations emerged: \emph{Jython} \cite{jython}, \emph{IronPython} \cite{ironpython}, \emph{PyPy} \cite{pypy} and \emph{Unladen Swallow} \cite{unladen-swallow}\footnote{These are not the only alternatives to CPython. In particular, there are forks that extend CPython to have new features such as Stackless Python and there are compilers for Python-like languages which aims at better performance. However, they are not relevant for the topic of this thesis.}. \textbf{Jython} is an implementation of Python written in Java and running on the Java Virtual Machine (JVM). Contrarily to CPython, it does not implement an interpreter but instead it compiles Python source code into JVM bytecode. \textbf{IronPython} is the Jython equivalent for the .NET Framework \cite{dotnet}. As Jython, it translates Python source code to bytecode for the CLI \cite{Ecma335}, which is the virtual machine at the heart of the .NET Framework. Due to the differences between the object model of Python and the native object model of the JVM or the CLI, both Jython and IronPython suffer performance problems when they have to deal with some highly dynamic features of Python. Since the language is continuously evolving, it is hard to keep all the alternative implementations synchronized with the reference one. Among the others, this is one of the problems which \textbf{PyPy} aims to solve. PyPy is an implementation of the language written in Python itself: the main idea is to write a high level specification of the interpreter in a restricted subset of Python (called RPython, for Restricted Python), to be translated into lower-level efficient executables for the C/POSIX environment, for the JVM and for the CLI, thus providing a platform independent implementation of Python which can be easily ported by adding the corresponding backends Moreover, the other major goal of PyPy is to develop a Just In Time compiler for Python, in order to improve the performance of the languages. Finally, PyPy is much more than just an implementation of Python: indeed, it is a general framework to implement effectively and efficiently all kinds of dynamic languages for various target platforms. Despite the fact that Jython and IronPython are compilers and not interpreters, their performance is not significantly better than CPython, and often they are even slower. This is partly due to the fact that Python is inherently hard to implement efficiently, and partly because neither the JVM nor the CLI have been designed to implement dynamic languages. The scope of this thesis is to demonstrate that it is possible to write very efficient implementations of Python, and more generally of dynamic languages, on top of these virtual machines, by porting the Just In Time compilation techniques of PyPy to the CLI and the JVM as well. Finally, the last Python implementation that came out in chronological order is \textbf{Unladen Swallow}: differently from the others, Unladen Swallow is not written from scratch but it is a fork of CPython to seek for better performance. As PyPy, Unladen Swallow implements a Just In Time compiler for Python: however, differently than Unladen Swallow this thesis is centered on targeting virtual machines such as the CLI, so the problems we had to face are different. \subsection{Digression: Python 2 vs. Python 3} Since the beginning each new version of Python has always been backward compatible with the preceding ones. This rule has been broken with the advent of Python 3.0, which has been intentionally made incompatible to remove old, unwanted features and introduce new ones which could not be added in a backward compatible manner. At the moment of writing, CPython is the only available implementation of Python 3, while all the other alternative implementations are still targeting Python 2.x. For this reason, in this thesis we talk about Python 2.x: however, all the concepts apply equally well also to Python 3, and to dynamic languages in general. \section{A gentle introduction to Just In Time compilation} \label{sec:exiting-solutions} As \cite{Arnold05} nicely illustrates, over the years the problem of making dynamic languages faster has been tackled in a number of different ways, which can be divided into two large categories: \begin{enumerate} \item writing fast interpreters \item writing optimizing Just In Time (JIT) compilers \end{enumerate} The key idea behind fast interpreters is to reduce at minimum the cost of interpretation overhead. The various techniques to meet this goal are greatly summarized by \cite{Ertl03}: however, they are effective only for those languages for which the cost of interpretation dominates the total execution time. As we will see in Section \ref{sec:interp-overhead}, this is not the case for Python. On the other hand, JIT compilers have proved to be very effective at optimizing different kinds of overhead introduced by dynamic languages. The main characteristic of JIT compilers is that they generate machine code at runtime and only when needed (hence the name, \emph{just in time}): usually the program is represented in form of a bytecode for a virtual machine, which is portable and architecture independent, to be translated to machine code. Depending on the strategy, a JIT compiler can be either the only way to execute the bytecode or it can work side by side with an interpreter. In the latter case, usually the program is interpreted at first, then the JIT compiler optimizes only the \emph{hotspots}, i.e. the parts of the program which are executed very often and thus are responsible for the majority of the time spent. To underline better the differences between the various strategies, we introduce a bit of terminology: \begin{itemize} \item A \textbf{Batch} or \textbf{Ahead Of Time} (AOT) compiler translates the source code (or the bytecode) into machine code before the program starts. \item A \textbf{Just In Time} (JIT) compiler translates the source code (or, more often, the bytecode) into machine code immediately before executing it, possibly caching the result to avoid that the same code fragment is compiled twice. In the subsequent chapters, we will abbreviate the term JIT compiler with ``JIT''. \item A \textbf{Static} compiler uses exclusively the information which can be derived from the source code or from the bytecode to perform the translation process. \item A \textbf{Dynamic} or \textbf{Adaptive} compiler uses also extra information collected at runtime: for example, the code paths which are taken most frequently, the dynamic types of the objects stored in a given variable, and so on. \end{itemize} Obviously, dynamic compilers have better chances to generate more efficient code, since they have more knowledge about the program than their static counterpart. Following this definition, it is also clear that while almost all AOT compilers are static, not all JIT compilers are dynamic. \section{Related work} Historically, the first important contribution to this research area consists in the implementation of the \emph{SELF} language \cite{hlzle_adaptive_1994}, where several proposed novel techniques have been reused and enhanced to implement highly efficient virtual machines such as \emph{HotSpot} for Java \cite{PVC01}. Another interesting project for more recent languages is \emph{Psyco} \cite{rigo_representation-based_2004}: it is an implementation of a dynamic JIT compiler for Python, which generalizes the techniques developed for SELF, in particular the concept of \emph{Polymorphic Inline Cache} (PIC) \cite{PIC91}. It gets very good results, as programs optimized by Psyco can run up to 60 times faster than CPython. However, Psyco has some drawbacks: \begin{itemize} \item While on the surface Python is a clean and simple language, under the hood it has a very complex semantics with many corner cases: thus, developing a compiler which keeps the exact same behaviour as the reference implementation is very hard and time consuming. \item For the same reason, each time the reference implementation evolves, e.g. by adding a new feature, the JIT compiler needs to be updated. Over the time, adding new features become more and more complex, as they may interact each other in various ways. For example, implementing support for \emph{generators}\footnotemark \cite{pep-255} in Psyco took approximately four years, proving that this approach does not scale well as the language changes. \footnotetext{In Python, \emph{generators} are special functions that can be interrupted and resumed at specific points, returning a new value at each resume} \item Because of the way it is designed, Psyco compiles all code that is executed from each of the starting points that are profiled, instead of compiling only the hotspots. As a consequence, it ends up using too much memory, limiting its uses. \end{itemize} The author of the thesis has already explored the implementation of a Psyco-style JIT compiler for the CLI in \cite{Cuni2009} and \cite{PyPyJIT09}: however, the solutions based on Polymorphic Inline Caches rely on the fact that it is possible to modify the code at runtime, after it has already been executed. As Section \ref{sec:immutable-methods} explains, this feature is not supported by the Object Oriented Virtual Machines like the CLI or the JVM and all the solutions to try to overcome this limitation are not particularly efficient. Finally, there is one particular strategy for implementing dynamic compilers called \emph{Tracing JIT compilation}: it was initially explored by the Dynamo project \cite{bala_dynamo:transparent_2000} to dynamically optimize machine code at runtime. Its techniques were then successfully used to implement a JIT compiler for a Java VM \cite{gal_hotpathvm:effective_2006, andreas_gal_incremental_2006}. Subsequently these tracing JITs were discovered to be a relatively simple way to implement JIT compilers for dynamic languages \cite{mason_chang_efficient_2007}. The technique is now being used by both Mozilla's TraceMonkey JavaScript VM \cite{andreas_gal_trace-based_2009} and has been tried for Adobe's Tamarin ActionScript VM \cite{chang_tracing_2009}. Tracing JITs are built on the following basic assumptions: \begin{itemize} \item programs spend most of their runtime in loops \item several iterations of the same loop are likely to take similar code paths \end{itemize} The tracing JIT approach is based on the idea that machine code is generated only for the hot code paths of mostly executed loops, while the rest of the program is interpreted. The code for those mostly executed loops is highly optimized, including aggressive inlining. The difference with more traditional approaches, such as the one implemented by Java HotSpot, is the granularity of JIT compilation: HotSpot compiles whole methods at once, while Tracing JIT compilers compile loops, which can either be contained in a single method, or span over several ones in case of inlining. Chapter \ref{cha:tracing-jits} explains in details how tracing JITs work. \section{Contributions of this thesis} This thesis explores and investigates new techniques to implement in an efficient and effective way dynamic languages on top of object oriented, statically typed virtual machines. In particular, it introduces the concept of \emph{JIT layering} (see Section \ref{sec:jit-layering}): the idea is that the host VM and its low level JIT compiler do not know enough about the dynamic language in question to optimize it in a proper way. Instead, we add a higher layer of JIT compilation that is specific for the language and dynamically emits bytecode for the target VM that is further compiled to by the low level JIT. Instead of writing a JIT compiler by hand, we adapt and use the PyPy JIT compiler generator (see Chapter \ref{cha:pypy-jit}), which automatically generates a JIT compiler from the source code of an interpreter for the given language. It is important to underline that the work done on the PyPy JIT compiler generator it is not solely of the author of the thesis but of the whole PyPy team. To make the PyPy JIT compiler generator working with the CLI virtual machine, we wrote the CLI JIT backend (see \ref{cha:cli-backend}). Finally, we apply the PyPy JIT compiler generator to the PyPy Python Interpreter to get a JIT compiler for Python targeting the CLI, and we measure its performance on a set of benchmarks against IronPython (see Chapter \ref{cha:benchmarks}), the other mainstream and mature implementation of Python for the CLI. \section{Structure of the thesis} The rest of this thesis is structured as follows. Chapter \ref{cha:the-problem} provides an overview on the current implementations of Python and explains why implementing Python efficiently is so hard. Chapter \ref{cha:pypy} introduces PyPy, which is the project which thesis is based on. Chapter \ref{cha:target-vm} presents the target platform, the CLI. It contains a description of the virtual machine, of its peculiarities, and a comparison with the JVM. Chapter \ref{cha:tracing-jits} explains how tracing JIT compilers work in general. Chapter \ref{cha:pypy-jit} explains more in detail the tracing JIT techniques employed by PyPy. In particular, it also explains how the JIT compiler is automatically generated from the source code of the interpreter. Chapter \ref{cha:cli-backend} describes the functioning of CLI JIT backend of PyPy, and the problems that have been encountered during its development. Chapter \ref{cha:benchmarks} explains how the JIT compiler generator has been applied to the PyPy Python Interpreter, and measures its performance when translated with the CLI JIT backend against IronPython. In Chapter \ref{cha:conclusion}, we conclude the work by summarizing the contributions presented in this thesis, describing related works and anticipating future directions of research. % LocalWords: bytecode CPython Jython IronPython PyPy CLI executables POSIX % LocalWords: Stackless Psyco Inline inlining HotSpot hotspots backend