% -*- encoding: utf-8 -*- \chapter{The problem} \label{cha:the-problem} %% - Is Python intrisically slow? %% - Why Python is hard to implement efficiently? %% * dynamic typing %% * dynamic lookup %% * long integers %% * the-world-could-change-under-your-feet problem %% * etc. etc. %% - Interpreters vs. compilers: limits of static analysis for Python %% - Existing solutions %% * fast interpreters (e.g., direct threading techniques --> forth, V8) %% * JIT compilers: SELF, Psyco, Tracemonkey %% - Our solution: (automatically generated) JIT compiler \section{Is Python intrinsically slow?} \label{sec:is-python-slow?} At the moment, there are four main implementations of Python as a language: CPython is the reference implementation, written in C; Jython is written in Java and targets the JVM; IronPython is written in C\# and targets the CLI (i.e., the VM of .NET); finally, PyPy is written in Python itself and targets a number of different platforms, including C, JVM and .NET. In the following, we are discussing CPython, when not explicitly stated otherwise; most of concepts are equally applicable to the other implementations, though. Running Python code is slow; depending on the benchmark used, Python programs can run up to 250 times slower than the corresponding programs written in C. There are several reasons that make Python so slow; here is a rough attempt to list some of what we think are the most important ones: \begin{enumerate} \item Interpretation overhead \item Boxed arithmetic and automatic overflow handling \item Dynamic dispatch of operations \item Dynamic lookup of methods and attributes \item ``\emph{The world can change under your feet}'' \item Extreme introspective and reflective capabilities \end{enumerate} \subsection{Interpretation overhead} \label{sec:interp-overhead} In CPython, Python programs are compiled into bytecode which is then interpreted by a virtual machine. The interpreter consists of a \emph{main interpreter loop} that fetches the next instruction from the bytecode stream, decodes it and then jumps to the appropriate implementation: this is called \emph{instruction dispatching}. Compared to emitting directly executable machine code, the extra work needed for instruction dispatching adds some overhead in terms of performance, which is called \emph{interpretation overhead}. The interpretation overhead strongly depends on the nature of the bytecode language: for languages whose bytecode instructions are fast to execute the time spent to do instruction dispatching is relatively higher than for languages whose instructions take more time. For Python, it has been measured that the interpretation overhead causes roughly a 2.5x slowdown compared to executing native code\cite{D12.1}. Both Jython and IronPython do not suffer from this problem, since they emit bytecode for the underlying virtual machine (either JVM of CLI), which is then translated into native code by the Just In Time compiler. \subsection{Boxed arithmetic and automatic overflow handling} \label{sec:boxing-ovf} In Python all values, including numbers, are objects; the default type for signed integers is \lstinline{int}, which represents native integers (i.e., numbers whose length is the same as a WORD, typically 32 or 64 bit). Moreover, Python provides also a \lstinline{long} type, representing arbitrary precision numbers; recent versions of Python take care of switching automatically to \lstinline{long} whenever the result of an operations cannot fit into \lstinline{int}. The most natural way to implement these features is to allocate both values of type \lstinline{int} and \lstinline{long} on the heap, thus giving them the status of ``objects'' to switch from one to the other very easily. \begin{wrapfigure}{r}{5cm} \begin{lstlisting} i = 0 while i < 10000000: i = i+1 \end{lstlisting} \caption{Simple loop} \label{fig:simple-loop} \end{wrapfigure} Unfortunately, this incurs a severe performance penalty. Figure \ref{fig:simple-loop} shows a simple loop doing only arithmetic operations, without overflows. Benchmarks show that this code runs about 40 times slower than the corresponding program written in C and compiled by \lstinline{gcc} 4.4.1 without optimizations. For more information on the benchmarking machine, see Section \ref{sec:methodology}. However, most of the integer objects used in Python programs does not escape the function in which they are used and never or rarely overflow, so they do not really need to be \emph{effectively} represented as objects in the heap; a smart implementation could detect places where numbers can be stored ``on the stack'' and unboxed, and emit efficient machine code for those cases. The same discussion applies equally well also to floating point numbers, with the difference that in this case we do not need to check for overflows at every operation. \subsection{Dynamic dispatch of operations} \label{sec:dynamic-dispatch} \begin{wrapfigure}{r}{8cm} \begin{lstlisting} # while i < 10000000 9 LOAD_FAST 0 (i) 12 LOAD_CONST 2 (10000000) 15 COMPARE_OP 0 (<) 18 JUMP_IF_FALSE 14 (to 35) 21 POP_TOP # i = i + 1 22 LOAD_FAST 0 (i) 25 LOAD_CONST 3 (1) 28 BINARY_ADD 29 STORE_FAST 0 (i) # close the loop 32 JUMP_ABSOLUTE 9 \end{lstlisting} \caption{Bytecode for the loop in Figure \ref{fig:simple-loop}} \label{fig:simple-loop-dis} \end{wrapfigure} Most operations in Python are dynamically overloaded, i.e., they can be dynamically applied to values of different types. Consider for example the + operator: depending on the types of its arguments, it can either add two numbers, concatenate two strings, append two lists, or call some user-defined method. Consider again the loop in Figure \ref{fig:simple-loop} and an excerpt of its bytecode, shown in Figure \ref{fig:simple-loop-dis}. The \lstinline{BINARY_ADD} operation corresponds to the \lstinline{i+1} expression. Being dynamically typed, the virtual machine does not know in advance the types of the operands of \lstinline{BINARY_ADD}, hence it has to check them at runtime to select the proper implementation. This could lead to poor performance, especially inside a loop: for the loop in question, the virtual machine always dispatches \lstinline{BINARY_ADD} to the same implementation. In this example, it is very clear that an efficient implementation could detect that there are a fast and a slow path, and emit efficient code for the former. \subsection{Dynamic lookup of methods and attributes} \label{sec:dynamic-lookup} Python is an attribute based language: the forms \lstinline{obj.x} and \lstinline{obj.x = y} are used to get and set the value of the attribute \lstinline{x} of \lstinline{obj}, respectively. In contrast to other object oriented language, message sending or method invocation is not a separate operation than attribute access: methods are simply a special kind of attribute that can be called. The process to get the value of an attribute is called \emph{attribute lookup}. The semantics of the attribute lookup differs depending on the dynamic type of the object in question: for example, if the object is an instance of a class the attribute will be first searched in the object, then in its class, then in its superclasses. Moreover, classes can override the default behavior by defining the special method \lstinline{__getattribute__} which is called whenever an attribute lookup occurs. Since its semantics depends on the dynamic type of the objects, in Python the lookup of attributes is performed at run-time. Moreover, there are special built-in functions that allow to get and set attributes by specifying their name as a string. For example, the code in Figure \ref{fig:dynamic-lookup} is a valid Python program that prints \lstinline{42} twice. Note that the behavior of line 14 depends on the value of the flag passed to \lstinline{foo}: if we passed \lstinline{False}, an exception would be raised. \begin{figure} \begin{lstlisting}[numbers=left] # an empty class class MyClass: pass # arguments passed to foo can be of any type def foo(target, flag): # the attribute x is set only if flag is True if flag: target.x = 42 obj = MyClass() # if we pass False, obj would not have any attribute x foo(obj, True) print obj.x print getattr(obj, "x") # use a string for the attribute name \end{lstlisting} \caption{Dynamic lookup of attributes} \label{fig:dynamic-lookup} \end{figure} As usual, if on the one hand these features allow a great degree of flexibility, on the other hand they are very difficult to implement efficiently; since the set of methods and attributes is not statically known, it is impossible to use a fixed memory layout for objects, as it happens in the implementation of statically typed object-oriented languages based on the notion of virtual method tables. Instead, classes and objects are usually implemented using a \emph{dictionary} (i.e. a hashtable) that maps attributes, represented as strings, to values, whilst methods and instance variables are typically managed differently in statically typed object oriented languages. \subsection{The world can change under your feet} \label{sec:the-world-can-change} \begin{wrapfigure}{r}{7cm} \begin{lstlisting} def fn(): return 42 def hello(): return 'Hello world!' def change_the_world(): global fn fn = hello print fn() # 42 change_the_world() print fn() # 'Hello world!' \end{lstlisting} \caption{Changing the global state} \label{fig:changing-global-state} \end{wrapfigure} In Python, the global state of the program can continuously and dynamically evolve: almost everything can change during the execution, including for example the definition of functions and classes, or the inheritance relationship between classes. Consider, for example, the program in Figure \ref{fig:changing-global-state}: the function associated to the name \lstinline{fn} changes at runtime, with the result that two subsequent invocations of the same name can lead to very different code paths. The same principle applies to classes: we have already seen in section \ref{sec:dynamic-lookup} that the attributes of an object can dynamically grow runtime, but things can change even deeper: the example in Figure \ref{fig:dynamic-class-change} demonstrates how objects can change their class at runtime, even for unrelated classes. Note also that \lstinline{my_pet} preserves its original attributes. Thus, in Python it is usually unsafe to assume anything about the outside world, and we need to access every single class, function, method or attribute dynamically, because it might not be what used to be previously. \begin{figure}[ht] \begin{lstlisting} class Dog: def __init__(self): self.name = "Fido" def talk(self): print "%s: Arf! Arf!" % self.name class Cat: def __init__(self): self.name = "Felix" def talk(self): print "%s: Meowww!" % self.name my_pet = Dog() my_pet.talk() # Fido: Arf! Arf! my_pet.__class__ = Cat my_pet.talk() # Fido: Meowww! \end{lstlisting} \caption{Dynamically changing the class of an object} \label{fig:dynamic-class-change} \end{figure} \subsection{Extreme introspective and reflective capabilities} \label{sec:frames-and-tracing} Python offers a lot of ways to inspect and modify a running program. For example, you can find out the methods of a class, the attributes of an instance, the names and the values of all locals variables defined in the current scope, and so on. \begin{wrapfigure}{r}{8.5cm} \begin{lstlisting} def fill_list(name): # get the caller frame object frame = sys._getframe().f_back # get the variable "name" in # the caller's context lst = frame.f_locals[name] lst.append(42) def foo(): mylist = [] fill_list("mylist") print mylist # prints [42] \end{lstlisting} \caption{Introspection of the stack frame} \label{fig:sys-getframe} \end{wrapfigure} Moreover, often it is also possible to modify this information: it is possible to add, remove or modify methods of a class, change the class of an object, etc. Since in CPython efficiency is not considered a compelling requirement, all these features are implemented in a straightforward way; however, it is very challenging to find alternative ways to implement them efficiently. In particular, CPython allows access to all the frames that compose the current execution stack: each frame contains information such as the instruction pointer, the frame of the caller and the local variables; moreover, under some circumstances it is possible to mutate the value of the local variables of the callers, as shown by the example if Figure \ref{fig:sys-getframe}\footnote{The official documentation says that \lstinline{sys._getframe} is a CPython implementation details. However it is used by real world programs, so it is required to be a real world alternative to CPython.}. The function \lstinline{sys.settrace()} is another example of feature that add overheads to the overall execution time of programs. It allows the programmer to install a \emph{tracer} to monitor and possibly alter the execution of the program: the events notified to the tracer include entering/leaving a function, executing a new line of code, raising an exception. Figure \ref{fig:settrace} shows an example of using \lstinline{sys.settrace}, and Figure \ref{fig:settrace-out} shoes its output. \begin{figure}[ht] \begin{minipage}[b]{0.65\linewidth} % A minipage that covers half the page \begin{lstlisting}[numbers=left] import sys def test(n): j = 0 for i in range(n): j = j + i return j def tracer(frame, event, arg): fname = frame.f_code.co_filename lineno = frame.f_lineno pos = '%s:%s' % (fname, lineno) print pos, event return tracer sys.settrace(tracer) # install the tracer test(2) # trace the call \end{lstlisting} \caption{Example of using \lstinline{sys.settrace}} \label{fig:settrace} \end{minipage} \begin{minipage}[b]{0.35\linewidth} % A minipage that covers half the page \begin{lstlisting}[keywords={}] settrace.py:3 call settrace.py:4 line settrace.py:5 line settrace.py:6 line settrace.py:5 line settrace.py:6 line settrace.py:5 line settrace.py:7 line settrace.py:7 return \end{lstlisting} \caption{Output} \label{fig:settrace-out} \end{minipage} \end{figure} These functionalities are heavily used by some applications such as debuggers or testing frameworks, even though their abuse in ``regular code'' is strongly discouraged. \section{Interpreters vs. compilers: limits of static analysis for Python} \label{sec:py-compilers} During the years, people often proposed to make Python faster by implementing a compiler, which should be supposedly much faster than an interpreter. However, if you carefully read the previous section, it is immediately clear that a naïve compiler is not enough to get good performance. Implementing a Python compiler to get good performance is extremely challenging. Since Python is dynamically typed, a static compiler can know little about the types of the variables in a program: hence, the compiler has little chances to discover short paths, thus inevitably incurring in the same problems described in Section \ref{sec:is-python-slow?}. Note that aggressive type inference does not help much, since unfortunately Python has not really been designed with this goal in mind, with the result that it is hard, if not impossible, to do it effectively: as described in section \ref{sec:the-world-can-change}, the majority of entities in a Python program can change at runtime: calling function \lstinline{fn()} could execute different code than the previous time, objects of class \lstinline{Cls} can suddenly get new attributes or methods that they did not have before, and so on. The net result is that every time we reference a global entity, either directly (e.g. by calling a function) or indirectly (e.g. by invoking a method on an object whose class is known), little can be assumed about the result. In the past, there have been few attempts to perform type inference for Python, but they did not work well \cite{Cannon05localizedtype} or they never saw the light \cite{Salib04fasterthan}. In conclusion, a static compiler would mainly solve the problem of the interpretation overhead described in section \ref{sec:interp-overhead}: hence, we can expect it to speed up programs by a factor of 2-2.5x, not more. Thus, to enhance performance we have to employ other techniques that go beyond a static compiler. % "anti-drawback": even if the language is very complicated it's ok to not % support everything in Psyco (and in PyPy JIT too) \section{Our solution: automatically generated JIT compiler} As Section \ref{sec:exiting-solutions} explains, the two major strategies to implement dynamic languages are writing a fast interpreter, and writing a JIT compiler. For the specific case of Python, writing a fast interpreter is not enough, because even if we manage to completely remove the intepretation overhead we can gain a limited speedup, as discussed in section \ref{sec:interp-overhead}. Moreover, The static JIT compilation techniques employed by Jython and IronPython seems not to give good performance, as they rarely outperform CPython. Instead, what we need is a \emph{dynamic JIT compiler} that behaves better than Psyco and solves its major drawbacks. In particular, we seek a compiler that can be easily maintained and extended when the language is modified; moreover, it should be easy to port such a compiler to different target platforms, e.g. x86 and CLI, and to implement it for other dynamic languages. The solution adopted by \emph{PyPy} is to \textbf{generate automatically} a JIT compiler instead of writing it by hand. The language is implemented through a simple interpreter, which is then fed into a translator which augments it with many features, including a dynamic JIT compiler. The JIT generator is divided into a frontend and several backends, one for each target platform. The following chapters describes in detail the solution implemented by PyPy. % LocalWords: runtime CPython Psyco bytecode lookup superclasses Inline Jython % LocalWords: IronPython