\chapter{Benchmarks} \label{cha:benchmarks} \section{Applying the JIT to the Python Interpreter} \label{sec:pypy-cli-jit} Chapters \ref{cha:pypy-jit} and \ref{cha:cli-backend} explain how the PyPy JIT generator works. In particular, in Section \ref{sec:jit-architecture} we have shown that it is possible to generate a JIT compiler from every interpreter written in RPython. To measure the performance of the generated JIT, and in particular of the JIT with the CLI backend, we apply the JIT generator to the \emph{Python Standard Interpreter} of PyPy (see Section \ref{sec:pypy-architecture}). As explained by Section \ref{sec:applying-the-hints}, it is necessary to apply some \emph{hints} to guide the JIT generation process. The two most important hints are \lstinline{jit_merge_point} which indicates the start of the main interpreter loop, and \lstinline{can_enter_jit} which indicates all the places where user loops have to be checked. Figure \ref{fig:pypy-main-loop} shows a simplified version of the main interpreter loop of the Python Interpreter: note that \lstinline{can_enter_jit} is called only at the end of \lstinline{JUMP_ABSOLUTE}, which is the only Python opcode that can set the instruction pointer to an earlier value, i.e. the only one that can possibly lead to a user loop. Apart from these two fundamental hints, the Python Interpreter contains about 30 more hints to guide the JIT to produce better code. These hints have not been explained in this thesis but, e.g., mark some classes as ``immutable'', to let the JIT know that their instances never changes their state after they have been created. The amount of needed hints is quite modest compared to the about 100K LOC which compose the Python Interpreter: this supports the claim stated in Section \ref{sec:applying-the-hints} that the extra work needed to enable the JIT for a large interpreter is negligible. Then, it is possible to invoke the translator and enable the JIT generator to get a final executable that contains both the Python Interpreter and the corresponding JIT compiler. In the following, we will refer to \lstinline{pypy-cli} to indicate the Python interpreter translated with the CLI translation backend and the CLI JIT backend. \begin{figure}[t] \begin{lstlisting} pypyjitdriver = JitDriver(greens = ['next_instr', 'pycode'], reds = ['self', 'ec']) class PyFrame(Frame): def dispatch(self, pycode, next_instr, ec): try: while True: pypyjitdriver.jit_merge_point() co_code = pycode.co_code next_instr = self.handle_bytecode(co_code, next_instr, ec) except ExitFrame: return self.popvalue() def JUMP_ABSOLUTE(self, jumpto, _, ec=None): self.last_instr = jumpto ec.bytecode_trace(self) jumpto = self.last_instr pypyjitdriver.can_enter_jit() return jumpto ... \end{lstlisting} \caption{Main interpreter loop of the \emph{PyPy Python Interpreter}} \label{fig:pypy-main-loop} \end{figure} \section{Making the interpreter JIT friendly} \label{sec:jit-friendly} In order to get good performance it is necessary to have a very good trace optimizer (see section \ref{sec:trace-optimizations}). In an ideal world, the optimizer would be smart enough to optimize all the possible programming patterns that are found in the interpreter. However, the reality is that at the moment not all programming patterns are optimized equally well, with the result that some code fragments are optimized by the JIT more than others. The net result is that \lstinline{pypy-cli} can occasionally encounter a user program that requires the execution of those interpreter fragments which cannot be reasonably optimized by the JIT, thus generating non optimal and sometimes even inefficient code. To solve these problems there are two possible solutions: either enhancing the optimizer, or rewriting the involved parts of the interpreter to allow the JIT to perform some useful optimization. At the moment of writing the are known fragments of the Python Interpreter where the JIT performs badly, as Section \ref{sec:microbench} highlights very well. These fragments include: \begin{itemize} \item Old-style classes\footnote{old-style classes implement the classical object model of Python. Since version 2.2, they have been deprecated and put side by side to \emph{new-style} classes, which are slightly incompatible.}: the PyPy team has always put more effort on optimizing new-style classes than old-style, with the result that the former are dramatically faster both during normal interpretation and with the JIT. \item String manipulation: some of the algorithms for string processing implemented in PyPy are inferior to the ones used by CPython or IronPython, thus even with the help of the JIT there are cases in which they perform badly. \item Regular expressions: in theory, it should be possible to consider the regular expression engine as another interpreter to be JITted separately from the main one, thus getting very optimized machine code for each different regular expression. However, at the moment this is not possible for technical reasons, and thus the regular expression engine is not considered by the JIT. \end{itemize} None of the above items represents a fundamental issue that cannot be solved by our approach, but have never been tackled due to time constraints. However at the moment of writing the PyPy team is actively working on them. \section{Methodology} \label{sec:methodology} To measure the performance of \lstinline{pypy-cli} we selected two set of benchmarks: the first is a collection of 82 microbenchmarks which test various aspects of the Python language, while the second is a collection of 13 middle-sized programs that do various kind of ``real life'' computations. The performance of \lstinline{pypy-cli} is measured against IronPython. IronPython has a number of command-line options to enable or disable some Python features that are hard to implement efficiently. Since \lstinline{pypy-cli} fully supports all of these features, we decided to enable them also in IronPython to make a fair comparison. In particular, we launched IronPython with the options \lstinline{-X:FullFrames} which enables support for \lstinline{sys._getframe}, and \lstinline{-X:Tracing}, which enables support for \lstinline{sys.settrace} (see Section \ref{sec:frames-and-tracing} for more details). Note that the spirit behind the choices of the PyPy and IronPython teams is different: in IronPython, it is considered reasonable to disable such rarely used features by default because they lead to less efficient code, while the goal of PyPy is to implement the full semantics of the language without compromises, and let the JIT to optimize the code fragments where such dynamic features are not used, while still producing correct results when they are actually used. Following the methodology proposed by \cite{Georges07statisticallyrigorous}, each microbenchmark and middle-sized benchmark has been run for 10 and 30 times in a row, respectively. Since we are interested in steady-state performance the first run of each benchmark has been discarded, as it supposedly includes the time spent by all the layers of JIT compilation. The final numbers were reached by computing the average of all other runs, the confidence intervals were computed using a 95\% confidence level. All tests have been performed on an otherwise idle machine with an \emph{Intel Core i7 920} CPU running at 2.67 GHz with 4GB RAM. All benchmarks were run both under Linux using Mono, the open source implementation of the CLI, and under Microsoft Windows XP using the CLR, which is the original implementation of the CLI by Microsoft. The following is a list of the versions of the software used: \begin{itemize} \item Ubuntu 9.10 \emph{Karmic Koala}, with Linux 2.6.31 \item Mono 2.4.2.3 \item IronPython 2.6.10920.0 \item Microsoft Windows XP SP2 \item Microsoft CLR 2.0.50727.1433\footnote{At the moment of writing, the most recent version of the .NET Framework is 3.5. However the virtual machine at the core of the Framework has not been updated since .NET 2.0, hence the version of the CLR.} \end{itemize} \section{Microbenchmarks} \label{sec:microbench} Figure \ref{fig:speedup-micro} shows the results of the microbenchmarks: for each benchmark, the bar indicates the speedup (or slowdown) factor of \lstinline{pypy-cli} compared to IronPython. \begin{figure}[p] \includegraphics[angle=-90, clip, trim=1.2cm 1.9cm 0cm 0cm]{graphs/speedup-micro.pdf} \hfigrule \caption{Speedup of pypy-cli vs IronPython (microbench)} \label{fig:speedup-micro} \end{figure} On Mono \lstinline{pypy-cli} performs very well on most benchmarks with speedups up to 215 times faster and only 20 ones slowed down, as summarized by Figure \ref{fig:speedup-micro-mono}. On CLR \lstinline{pypy-cli} performs a bit worse, but the results are still very satisfactory, with speedups up to 155 times faster and only 21 benchmarks slowed down. However, in some cases the slowdown is very considerable, up to 78 times slower. Figure \ref{fig:speedup-micro-clr} summarizes the results. \begin{figure}[ht] \begin{minipage}[b]{0.48\linewidth} \centering \begin{tabular}{rcrlr} \toprule \multicolumn{4}{l}{\textbf{Speedup factor}} & \textbf{No.} \\ \midrule 100 & to & 250 & faster & 6 \\ 25 & to & 100 & & 16 \\ 10 & to & 25 & & 17 \\ 2 & to & 10 & & 15 \\ 1 & to & 2 & & 8 \\ \midrule 1 & to & 2 & slower & 4 \\ 2 & to & 10 & & 14 \\ 10 & to & 25 & & 2 \\ 25 & to & 100 & & 0 \\ \bottomrule \end{tabular} \caption{Microbenchmarks on Mono} \label{fig:speedup-micro-mono} \end{minipage} \hspace{0.02\linewidth} \begin{minipage}[b]{0.48\linewidth} \centering \begin{tabular}{rcrlr} \toprule \multicolumn{4}{l}{\textbf{Speedup factor}} & \textbf{No.} \\ \midrule 100 & to & 250 & faster & 4 \\ 25 & to & 100 & & 6 \\ 10 & to & 25 & & 15 \\ 2 & to & 10 & & 27 \\ 1 & to & 2 & & 9 \\ \midrule 1 & to & 2 & slower & 6 \\ 2 & to & 10 & & 11 \\ 10 & to & 25 & & 2 \\ 25 & to & 100 & & 2 \\ \bottomrule \end{tabular} \caption{Microbenchmarks on CLR} \label{fig:speedup-micro-clr} \end{minipage} \end{figure} It is interesting to note that most of the bad results are related to the execution of those fragments where the JIT is known to perform badly, as described in Section \ref{sec:jit-friendly}: in particular, old-style classes and operations that involve strings. Moreover, it also emerges that all the microbenchmarks that create a massive amount of new objects are usually slower. Another noteworthy observation is that generally \lstinline{pypy-cli} has a greater speedup factor over IronPython on Mono than on CLR. \section{Middle-sized benchmarks} \label{sec:macrobench} Figure \ref{fig:macrobench-descr} contains a short descriptions of the 13 middle-sized benchmarks that have been executed to further evaluate the performance of \lstinline{pypy-cli} versus IronPython. \begin{figure}[ht] \centering \begin{tabular}{lp{12.5cm}} \toprule \textbf{Name} & \textbf{Description} \\ \midrule \texttt{build-tree} & Create and traverse of a huge binary tree (1 million of elements)\\ \texttt{chaos} & Generate an image containing a chaos game-like fractal \\ \texttt{f1} & Two nested loops doing intensive computation with integers \\ \texttt{fannkuch} & The classical \emph{fannkuch} benchmark \cite{Anderson_performinglisp} \\ \texttt{float} & Exercise both floating point and object oriented operations by encapsulating $(x, y)$ pairs in a \texttt{Point} class \\ \texttt{html} & Generate a huge HTML table \\ \texttt{k-nucleotide} & Analyze nucleotide sequences expressed in \emph{FASTA format} \cite{fasta} \\ \texttt{linked-list} & Create a huge linked list \\ \texttt{oobench} & Measure the performance of method invocation on objects \\ \texttt{pystone} & The standard Python benchmark, derived from the classical \emph{Dhrystone} \cite{dhrystone} \\ \texttt{pystone-new} & The same as \texttt{pystone}, but using new-style classes instead of old-style ones \\ \texttt{richards} & The classical \emph{Richards} benchmarks, originally written in BCPL and rewritten in Python \\ \texttt{spectral-norm} & Compute the eigenvalue using the power method \\ \bottomrule \end{tabular} \caption{Description of the benchmarks} \label{fig:macrobench-descr} \end{figure} Figure \ref{fig:bench-mono-win} shows the time taken by \lstinline{pypy-cli} and IronPython to complete each benchmark. The results on Mono show clearly that for the major part of the benchmarks, \lstinline{pypy-cli} is faster than IronPython. It is interesting to analyze the benchmarks that are slower: \begin{figure}[p] \centering \includegraphics{graphs/mono.pdf} \includegraphics{graphs/win.pdf} \hfigrule \caption{Time taken to complete the benchmarks on Mono and CLR (the lower the better)} \label{fig:bench-mono-win} \end{figure} \begin{itemize} \item \lstinline{build-tree} creates a massive amount of new objects, showing the same behaviour already noted for the microbenchmarks. This is an unexpected result that have not been investigated further due to time constraints, but will be analyzed as a part of the future work. \item \lstinline{fannkuch} does not even complete. This is because the JIT creates two \emph{mutual recursive loops} (see Section \ref{sec:mutual-recursive-loops}) that calls each other with a tail-call. However, due to a bug in the implementation of tail calls on Mono (see Section \ref{sec:tail-calls}) the program exhausts the stack space and terminates abnormally. \item \lstinline{html} involves a lot of operations on strings, thus \lstinline{pypy-cli} was expected to be slower. However it is interesting to see that the difference is minimal: this is probably due to the fact that all the other operations (like method invocation and attribute access) are speeded up by \lstinline{pypy-cli}, balancing the slowdown caused by the string operations. \item \lstinline{k-nucleotide} is only about operations on strings, thus the slowdown is in line with what observed by the microbenchmarks. \item \lstinline{pystone} is slower because it uses old-style classes. \lstinline{pystone-new}, which uses new-style classes, is much faster on \lstinline{pypy-cli}. \end{itemize} On CLR the results are less satisfactory: 7 out of 13 benchmarks are faster, but the other 6 are slower, sometimes even by a large factor. It is interesting to note that on the two benchmarks that involve a lot of string manipulation (\lstinline{html} and \lstinline{k-nucleotide}), \lstinline{pypy-cli} behaves better on CLR than on Mono and outperforms IronPython. In the CLR histogram, the data for \lstinline{chaos} have been omitted for \lstinline{pypy-cli} because it produces incorrect results: this is probably due to a bug in the CLR itself, as on Mono it works correctly. The histogram in Figure \ref{fig:speedup} summarizes the results by showing the speedup (or slowdown) factor of \lstinline{pypy-cli} over IronPython. \lstinline{f1} and \lstinline{oobench} are the benchmarks with the largest speedup factor, up to about 31 and 37 times faster on Mono, and 12 and 13 times faster on CLR. \lstinline{float}, \lstinline{richards} and \lstinline{spectral-norm} are also consistently speeded up but by smaller factor. Some benchmarks (\lstinline{html}, \lstinline{k-nucleotide}, \lstinline{linked-list}, \lstinline{pystone-new}) exhibit a controversial behavior, as they are speeded up by an implementation and slowed down by the other, or vice versa. Finally, \lstinline{build-tree} and \lstinline{pystone} are consistently slowed down by \lstinline{pypy-cli}. \begin{figure}[ht] \centering \includegraphics[width=16cm]{graphs/speedup.pdf} \hfigrule \caption{Speedup factor of \lstinline{pypy-cli} over IronPython on Mono and CLR (the higher the better for \lstinline{pypy-cli})} \label{fig:speedup} \end{figure} \subsection{Differences between Mono and CLR} Why on Mono \lstinline{pypy-cli} gets much better results compared to IronPython? Figure \ref{fig:pypy-ipyfull} compares side-by-side the performance of the two platforms when running either \lstinline{pypy-cli} or IronPython: remind that all the benchmarks were run on the same machine (although on different operating systems), so the results are directly comparable. With the sole exception of \lstinline{build-tree}, IronPython is consistently faster on CLR than on Mono. This is probably due to the fact that IronPython has been explicitly optimized for the CLR and vice versa, since both projects have been developed by Microsoft. On the other hand, for some benchmarks \lstinline{pypy-cli} behaves better on Mono than on CLR, while for others is the opposite. Probably, this is due to the fact that the bytecode generated by the higher level JIT compiler of \lstinline{pypy-cli} is very different from the typical bytecode generated by other compilers for CLI (e.g. by C\# compilers), thus is not always properly optimized by the lower level JIT compiler. The fact that \lstinline{chaos} does not work correctly on CLR supports this theory, as the code produced by \lstinline{pypy-cli} triggers a bug that it has probably been unobserved for years. \begin{figure}[ht] \centering \includegraphics{graphs/ipyfull.pdf} \includegraphics{graphs/pypy-cli.pdf} \hfigrule \caption{Time taken to complete the benchmarks by \lstinline{pypy-cli} and IronPython (the lower the better)} \label{fig:pypy-ipyfull} \end{figure} % LocalWords: backend RPython PyPy cli opcode JITted microbenchmark GHz CLR % LocalWords: IronPython Microbenchmark microbenchmarks speeded oobench % LocalWords: bytecode