\documentclass{article} \usepackage{amssymb,xy,html} \xyoption{all} { \catcode`\_\active \global\def_{\char`\_} } { \catcode`\&\active \global\def&{\char`\&} } \def\code{\catcode`\_\active\catcode`\&\active\codeendgroup} \def\codeendgroup#1{\texttt{#1}\catcode`\_=8\catcode`\&=4{}} \def\defining#1{{\bf #1}} \def\dom{\mathop{dom}} \def\id{\mathop{id}} \def\N{{\mathbb N}} \newtheorem{definition}{Definition} \begin{document} \begin{center} {\large\bf Representation-based Just-in-time Specialization} and {\bf the Psyco prototype for Python} \vspace{0.6cm} Armin Rigo \end{center} \vspace{0.8cm} \noindent {\bf Abstract.} {\small\quad A powerful application of specialization is to remove interpretative overhead: a language can be implemented with an interpreter, whose performance is then improved by specializing it for a given program source. This approach is only moderately successful with very dynamic languages, where the operation of each single step can be highly dependent on run-time data. In the present paper, the Psyco prototype for the Python language is presented. It introduces two novel techniques. The first is \emph{just-in-time specialization,} or \emph{specialization by need,} which introduces the ``unlifting'' ability for a value to be promoted from run-time to compile-time during specialization -- the converse of the lift operator of partial evaluation. Its presence gives an unusual and powerful perspective on the specialization process. The second technique is \emph{representations,} a theory of data specialization generalizing the traditional specialization domains (i.e.\ the compile-time/run-time dichotomy). \vspace{0.6cm} %% ------------------------------------------------------------ %% INTRODUCTION %% ------------------------------------------------------------ \section{Introduction} \label{sec_introduction} Some programming languages do not lend themselves easily to static compilation. We are interested in techniques to recover execution efficiency transparently for the programmer. Our contribution is at the intersection of \emph{on-line partial evaluation} and \emph{just-in-time compilation:} \defining{Just-in-time compilation} (as defined by \cite{Aycock}) broadly refers to any kind of compilation (translation between languages, e.g.\ from Java bytecode to native machine code) that occurs in parallel with the actual execution of the program. \defining{Specialization} refers to translation (typically from a language into itself) of a general program into a more limited version of it, in the hope that the specialized version can be more efficient than the general one. \defining{Partial evaluation} \cite{partial_eval} is the specialization technique that we will generally consider in the sequel: partial information about the variables and arguments of a program is propagated by abstractedly ``evaluating'', or interpreting, the program. \subsection{The limitation of staged analysis} The classical presentation of partial evaluation is the following: consider a function $f(x,y)$ of two arguments. If, during the execution of a program, the value of the first argument $x$ is generally less variable than the value of $y$, then it can be interesting to generate a family of functions $f_1, f_2, f_3\ldots$ for a family of commonly occurring values $x_1, x_2, x_3\ldots$ such that $f_n(y)=f(x_n,y)$. Each function $f_n$ can then be optimized independently. The notation $f(x,y)$ points to an aspect of specialization that is a major difficulty: the choice of how exactly to divide the arguments into the compile-time ($x$) and the run-time ($y$) ones. This compile-time/run-time categorization is typically provided either by the programmer as annotations in the source, or by automatic analysis. An efficient result is a delicate balance between under-specialization (missing optimization opportunities) and over-specialization (creating numerous versions of the same code which are only slightly or even not better at all than the more general version). In recent approaches (e.g.\ \cite{staged}) the categorization includes more than two stages, with gradually more information (and less computational time) available for optimization. The categorization, however, is typically done statically, i.e.\ the \emph{amount} of information that will be available at each stage is fixed in advance. One could argue that the difficulties involved in finding such a categorization stem from two problems: % \begin{itemize} \item What we are looking for is a global solution of a problem defined by local (but non-modular) constrains induced by the program source. Combining two locally optimal solutions is generally impossible. \item It might not be clear for a single source level operation what amount of information the specializer will need; every single fully polymorphic operation can involve an arbitrary amount of both built-in and user code, and in some languages all operations are fully polymorphic. \end{itemize} A related problem in the most dynamic languages is that some of their features could potentially make any static analysis not just incomplete but invalid (dynamic source execution, custom meta-object protocols, etc.). \subsection{Contributions of the present paper} In the present paper we investigate the extra operational power available by doing specialization entirely at run time instead of compile time, including the compile-time/run-time categorization itself, a process which could be called \defining{just-in-time specialization}. This setting allows specialization to constantly query for individual run-time values (``poll'') as needed, a process which is effectively the converse of the lift operator (section \ref{sec_unlift}).\footnote{This is much finer-grained than statistic-oriented recompilation.} A combination of lift and unlift allows arbitrary local categorizations to be combined together, removing the need to find a globally consistent one. It also naturally leads to specialization-by-need, where we specialize for all relevant values and only them (section \ref{sec_topdown}). The extra flexibility prompts for a more general theory of data specialization, sketched in section \ref{sec_representation}. The insights in the present paper have been motivated by Psyco, a just-in-time specializer for the Python language (section \ref{sec_psyco}). \subsection{Related work} The classical reference for efficient execution of dynamic programming languages is the implementation of Self \cite{Self}, which transparently specializes functions for specific argument types using statistical feed-back. A number of projects have followed with a similar approach, e.g.\ \cite{Cecil} and \cite{VCC97}. Trying to apply the techniques on increasingly reflective languages in which the user can tamper with increasingly essential features (e.g.\ via a meta-object protocol, or MOP \cite{MOP}) eventually led to entirely run-time specialization; Sullivan introduces in \cite{DynamicPE} the theory of \emph{dynamic partial evaluation,} which is specialization performed as a side effect of regular evaluation. To our knowledge this is the closest work to ours because the specializer does not only know what set of values a given variable can take, but also which specific value it takes right now. (Sullivan does not seem to address run-time choice points in \cite{DynamicPE}, i.e.\ how the multiple paths of a residual conditional expressions are handled.) Intermediate approaches for removing the interpretative overhead in specific reflective object-oriented languages can be found in \cite{MY98} and \cite{BN00}; however, both assume a limited MOP model. Java has recently given just-in-time compilation much public exposure; Aycock \cite{Aycock} gives a history and references. Some projects (e.g.\ J3 \cite{J3} for Squeak \cite{Squeak}) aim at replacing an interpreter with a compiler within an environment that provides the otherwise unmodified supporting library. Throughout history, a number of projects (see \cite{Aycock}) offered the ability to use both the interpreter and the compiler at the same time, thought considerable care was required to keep the interpreted and compiled evaluations synchronized (as was attempted by J2, the precursor of J3; \cite{J3} describes the related hassle). Whaley \cite{W01} discusses compilation with a finer granularity than whole functions. Low-level code generation techniques include lazy compilation of uncommon branches (\cite{Self}, p.\ 123) and optimistic optimization using likely invariants, with guards in the generated code (\cite{PMI88}). \vspace{0.6cm} %% ------------------------------------------------------------ %% Just-in-time specialization %% ------------------------------------------------------------ \section{Just-in-time specialization} \label{sec_jit} The present section introduces the basic idea behind just-in-time specialization from a practical point of view. The following section \ref{sec_representation} will sketch the formal theory supporting it. \subsection{The Unlift operator} \label{sec_unlift} Specialization makes use of a subset of all information that will available at execution, the \emph{compile-time} data. The rest, the \emph{run-time} information, is only available later, and the specialization cannot depend on it. While it is obvious in off-line specialization, which runs before execution, it is also true in on-line specialization: the specialized version of a function, by definition, should only depend on the compile-time values, because the goal is to apply the same specialized function several times, to multiple run-time values. It is possible for the specializer to copy the value of a compile-time variable into a run-time variable, thus ``forgetting'' it. The converse is normally not possible. This restriction places a strong global condition on the compile-time/run-time classification of a program. There are cases where it would be interesting to gather compile-time (i.e.\ early) information about a run-time value. This operation is essential; in some respect, it is what on-line specializers implicitly do when they start their job: they take an input (run-time) value, and start generating a version of the source specialized for this (now considered compile-time) value. Let us make this operation explicit. We call it \defining{unlift}, as it is effectively the converse of the \defining{lift} operator which in partial evaluation denotes that a compile-time value should be ``forgotten'' (i.e.\ considered as run-time) in the interest of a greater generality of the residual code. Although the possibility of unlift is not often considered, it does not raise serious problems. By comparison, the common problems found in most forms of on-line specialization (see section \ref{sec_issues}) are much more difficult. How we can in practice read a run-time value from the specializer is best explained with explicit continuations: when a run-time value is asked for, the specializer is suspended (we capture its state in a continuation); and residual code is emitted that will resume the specializer (by invoking the continuation) with the run-time value. In other words, specialization is not simply guided by run-time feed-back; it is literally controlled by the run-time, and does not take place at all (the continuation remains suspended) before these run-time values actually show up. \subsection{The top-down approach} \label{sec_topdown} Unlifting makes specialization and execution much more intermixed in time than even on-line specialization, as we will see on an example in section \ref{sec_example2}. We call this particular technique \defining{just-in-time specialization.} This approach sidesteps a number of common issues, e.g.\ it terminates if the program does and even has a constant worse-case overhead, and lowers the need for complex heuristics or detailed source code annotations. For example, \emph{widening} is not essential in the top-down approach. Instead of starting with highly specialized versions of the code and generalizing when new values are found that do not fit in the previous constrains (occasionally over-generalizing for fear of never terminating), we can start with the most general inputs and gradually specialize by applying the unlift operator. Perhaps even more important: we can unlift only when there is a need, i.e.\ an immediately obvious benefit in doing so. In other words, we can do need-based specialization. A ``need to specialize'' is generally easy to define: try to avoid the presence in the residual code of some constructs like indirect function calls or large switches, because they prevent further optimizations by introducing run-time choice points. Specializing away this kind of language construct is a natural target. This can be done simply by unlifting the value on which the dispatch takes place. \subsection{Example} \label{sec_example2} Consider the following function:\footnote{We use the syntax of Python, though it should be immediately obvious.} % \begin{verbatim} def f(n): return 2*(n+1) \end{verbatim} As discussed in section \ref{sec_topdown} we will start the specializer with the most general case: nothing is known about the input argument $n$. Figure \ref{fig_specexec} shows how specialization and execution are intermixed in time in this top-down approach. Note that specialization only starts when the first actual (run-time) call to $f$ takes place. \begin{figure}[hbt] %\def\tablleft#1{$\stackrel{#1}{\Longleftarrow}$} %\def\tablright#1{$\stackrel{#1}{\Longrightarrow}$} \def\tablright#1{$\leftarrow #1 -{}$} \def\tablleft#1{${}- #1 \rightarrow$} \begin{center} \begin{tabular}{ccc} \hline execution & \vline & specialization \\ \hline program call $f(12)$ & \vline & \\ &\tablleft{start}& start compiling $f(n)$ with \\ & \vline & nothing known about $n$ \\ & \vline & \\ & \vline & for $n+1$ it would be better \\ & \vline & to know the type of $n$. \\ start executing $f(n)$ as &\tablright{run} & What is the type of $n$? \\ compiled so far with $n=12$& \vline & \\ & \vline & \\ read the type of $n$: \code{int}&\vline & \\ the value asked for: &\tablleft{int} & proceed with the addition of \\ & \vline & two integer values: read the \\ & \vline & value into a register, write \\ & \vline & code that adds $1$. \\ execute the addition machine&\tablright{run} & Did it overflow? \\ instruction, result 13. & \vline & \\ the answer asked for: &\tablleft{no} & we know that $(n+1)$ and $2$ \\ & \vline & are integers so we write \\ & \vline & code that multiply them. \\ execute the multiplication &\tablright{run} & Did it overflow? \\ instruction, result 26. & \vline & \\ the answer asked for: &\tablleft{no} & result is an integer, \\ & \vline & return it. \\ return 26. &\tablright{run} & \\ \hline \end{tabular} \end{center} \caption{Mixed specialization and execution with three applications of the unlift operator.}\label{fig_specexec} \end{figure} \medskip Subsequent invocations of $f$ with another integer argument $n$ will reuse the already-compiled code, i.e.\ the left column of the table. Reading the left column only, you will see that it is nothing less than the optimal run-time code for doing the job of the function $f$, i.e.\ it is how the function would have been manually written, at least for the signature ``accepts an arbitrary value and returns an arbitrary value''. In fact, each excursion through the right column is compiled into a single conditional jump in the left column. For example, an ``overflow?'' question corresponds to a ``jump-if-not-overflow'' instruction whose target is the next line in the left column. As long as the question receives the same answer, it is a single machine jump that no longer goes through the specializer. If, however, a different answer is later encountered (e.g.\ when executing $f(2147483647)$ which overflows on 32-bit machines), then it is passed back to the specializer again, which resumes its job at that point. This results in a different code path, which does not replace the previously-generated code but completes it. When invoked, the specializer patches the conditional jump instruction to include the new case as well. In the above example, the ``jump-if-overflow'' instruction will be patched: the non-overflowing case is (as before) the first version of the code, but the overflowing case now points to the new code. As another example, say that $f$ is later called with a floating-point value. Then new code will be compiled, that will fork away from the existing code at the first question, ``what is the type of $n$?''. After this additional compilation, the patched processor instruction at that point is a three-way jump\footnote{which probably requires more than one processor instruction, and which grows when new cases are encountered. This kind of machine code patching is quite interesting in practice.}: when the answer is \code{int} it jumps to the first version; when it is \code{float} to the second version; and otherwise it calls back again to the specializer. \subsection{Issues with just-in-time specialization} \label{sec_issues} Just-in-time specialization, just like on-line specialization, requires caching techniques to manage the set of specialized versions of the code, typically mapping compile-time values to generated machine code. This cache potentially requires sophisticated heuristics to keep memory usage under control, and to avoid over-specialization. This cache is not only used on function entry points, but also at the head of loops and other control path merge points in the function bodies, so that we can detect when specialization is reaching an already-generated case. The bottom-up approach of traditional on-line specialization requires widening (when too many different compile-time values have been found at the same source point, they are tentatively generalized) to avoid generating infinitely many versions of a loop or a function. The top-down specialization-by-need approach of just-in-time specialization might remove the need for widenening, although more experimentation is needed to settle the question (the Psyco prototype does some widening which we have not tried to remove so far). Perhaps the most important problems introduced by the top-down approach are: % \begin{enumerate} \item memory usage, not for the generated code, but because a large number of continuations are kept around for a long time --- forever, a priori. In the above example, we can never be sure that $f$ will not be called later with an argument of yet another type. \item low-level performance: the generated code blocks are extremely fine-grained. As seen above, only a few machine instructions can typically be generated before the specializer must give the control back to execution, and often this immediately executes the instructions just produced. This defies common compiler optimization techniques like register allocation. Care must also be taken to keep some code locality: processors are not good at running code spread over numerous small blocks linked together with far-reaching jumps. \end{enumerate} A possible solution to these low-level problems would be to consider the code generated by the specializer as an intermediate version on the efficiency scale. It may even be a low-level pseudo-code instead of real machine code, which makes memory management easier. It would then be completed with a better compiler that is able to re-read it later and optimize it more seriously based on real usage statistics. Such a two-phase compilation has been successfully used in a number of projects (see \cite{Aycock}). \medskip The Psyco prototype currently implements a subset of these possible techniques, as described in section \ref{sec_solutions}. \vspace{0.6cm} %% ------------------------------------------------------------ %% REPRESENTATION-BASED SPECIALIZATION %% ------------------------------------------------------------ \section{Representation-based specialization} \label{sec_representation} This section introduces a formalism to support the process intuitively described above; more specifically, how we can represent partial information about a value, e.g.\ as in the case of the input argument $n$ of the function $f(n)$ in \ref{sec_example2}, which is promoted from run-time to ``known-to-be-of-type-\code{int}''. \subsection{Representations} \label{sec_def_repr} We call \emph{type} a set of values; the type of a variable is the set of its allowed values. \begin{definition}\label{def_repr_obj} Let $X$ be a type. A \defining{(type) representation} of $X$ is a function $r: X'\rightarrow X$. The set $X'=\dom(r)$ is called the \defining{domain} of the representation. \end{definition} The name \emph{representation} comes from the fact that $r$ allows some values from $X$ to be ``represented'' by an element of $X'$. An $x'\in X'$ represents the value $r(x')\in X$. For example, the domain $X'$ could be a subtype of $X$, $r$ being just the inclusion; or if $X$ is the set of all first-class objects of a programming language, and $X'$ is the set of machine-sized words, then $r$ could map a machine word to the corresponding integer object in the programming language, a representation which is often not trivial (because the interpreter or the compiler might associate meta-data to integer objects). The two extreme examples of representations are \begin{enumerate} \item the universal representation $\id_X: X \rightarrow X$ that represents any object as itself; \item for any $x\in X$, the constant representation $c_x: \{\cdot\} \rightarrow X$, whose domain is a set with just one (arbitrary) element ``$\ \cdot\ $'', whose image $c_x(\cdot)$ is precisely $x$. \end{enumerate} \begin{definition} Let $f: X\rightarrow Y$ be a function. A \defining{(function) representation}\footnote{We use the word ``representation'' for both types and functions: a function representation is exactly a type representation in the arrow category.} of $f$ is a function $f': X'\rightarrow Y'$ together with two type representations $r: X'\rightarrow X$ and $s: Y'\rightarrow Y$ such that $s(f'(x'))=f(r(x'))$ for any $x'\in X'$: $$\xymatrix{ \txt{X}\ar[r]^f & \txt{Y} \\ \txt{X'}\ar[r]^{f'}\ar[u]^r & \txt{Y'}\ar[u]_s }$$ $r$ is called the \defining{argument representation} and $s$ the \defining{output representation}. A \defining{partial representation} is a partial function $f'$ with $r$ and $s$ as above, where the commutativity relation holds only where $f'$ is defined. \end{definition} If $r$ is the inclusion of a subtype $X'$ into $X$, and if $s=\id_Y$, then $f'$ is a \defining{specialization} of $f$: indeed, it is a function that gives exactly the same results as $f$, but which is restricted to the subtype $X'$. Computationally, $f'$ can be more efficient than $f$ --- it it the whole purpose of specialization. More generally, a representation $f'$ of $f$ can be more efficient than $f$ not only because it is specialized to some input arguments, but also because both its input and its output can be represented more efficiently. For example, if $f: \N\rightarrow\N$ is a mathematical function, it could be partially represented by a partial function $f': M\rightarrow M$ implemented in assembly language, where $M$ is the set of machine-sized words and $r,s: M\rightarrow\N$ both represent small integers using (say, unsigned) machine words. This example also shows how representation can naturally express relationships between levels of abstractions: $r$ is not an inclusion of a subtype into a type; the type $M$ is much lower-level than a type like $\N$ which can be expected in high-level programming languages. \subsection{Specializers} \begin{definition} Let $f: X\rightarrow Y$ be a function and $R$ a family of representations of $X$. We call \defining{$R$-specializer} a map $S_f$ that can extend any $r\in R$ into a representation of $f$ with argument $r$: $$\xymatrix{ \txt{X}\ar[r]^f & \txt{Y} \\ \txt{X'}\ar[r]^{S_f(r)}\ar[u]^r & \txt{Y'}\ar[u]_s }$$ \end{definition} Note that if $R$ contains the universal representation $\id_X$, then $S_f$ can also produce the (unspecialized) function $f$ itself: $s(S_f(\id_X)(x))=f(x)$ i.e.\ $f=s\circ S_f(\id_X)$, where $s$ is the appropriate output representation of $Y$. \medskip The function $x'\mapsto S_f(r)(x')$ generalizes the compile-time/run-time division of the list of arguments of a function. Intuitively, $r$ encodes in itself information about the ``compile-time'' part in the arguments of $f$, whereas $x'$ provides the ``run-time'' portion. In theory, we can compute $r(x')$ by expanding the run-time part $x'$ with the information contained in $r$; this produces the complete value $x\in X$. Then the result $f(x)$ is represented as $s(S_f(r)(x'))$. For example, consider the particular case of a function $g(w,x')$ of two arguments. For convenience, rewrite it as a function $g((w,x'))$ of a single argument which is itself a couple $(w,x')$. Call $X$ the type of all such couples. To make a specific value of $w$ compile-time but keep $x'$ at run-time, pick the following representation of $X$: % \begin{eqnarray*} r_w: X' & \longrightarrow & X = W \times X' \\ x' & \longmapsto & (w,x') \end{eqnarray*} % \noindent and indeed: $$\xymatrix{ \txt{W $\times$ X'}\ar[r]^>>>>>g & \txt{Y} \\ \txt{X'}\ar[r]^{S_g(r_w)}\ar[u]^{r_w} & \txt{Y}\ar[u] }$$ $S_g(r_w)(x')=g(r_w(x'))=g((w,x'))$, so that $S_g(r_w)$ is the specialized function $g((w,-))$.\footnote{If $R$ contains at least all the $r_w$ representations, for all $w$, then we can also reconstruct the three Futamura projections, though we will not use them in the sequel.} With the usual notation $f_1\times f_2$ for the function $(a_1,a_2)\mapsto(f_1(a_1),f_2(a_2))$, a compact way to define $r_w$ is $r_w = c_w\times\id_{X'}$.\footnote{The latter is actually a function $\{\cdot\}\times X'\rightarrow W\times X'$ but we will systematically identify $\{\cdot\}\times X'$ with $X'$.} \subsection{Example} \label{sec_example} Consider a compiler able to do constant propagation for a statically typed language like C. For simplicity we will only consider variables of type \code{int}, taking values in the set $Int$. % \begin{verbatim} void f(int x) { int y = 2; int z = y + 5; return x + z; } \end{verbatim} The job of the compiler is to \emph{choose a representation} for each variable. In the above example, say that the input argument will be passed in the machine register $A$; then the argument $x$ is given the representation % \begin{eqnarray*} r_A: Machine\ States & \longrightarrow & Int \\ \hbox{state} & \longmapsto & \hbox{register $A$ in state} \end{eqnarray*} The variable $y$, on the other hand, is given the constant representation $c_2$. The compiler could work then by ``interpreting'' symbolically the C code with representations. The first addition above adds the representations $c_2$ and $c_5$, whose result is the representation $c_7$. The second addition is between $c_7$ and $r_A$; to do this, the compiler emits machine code that will compute the sum of $A$ and $7$ and store it in (say) the register $B$; this results in the representation $r_B$. Note how neither the representation alone, nor the machine state alone, is enough to know the value of a variable in the source program. This is because this source-level value is given by $r(x')$, where $r$ is the (compile-time) representation and $x'$ is the (run-time) value in $\dom(r)$ (in the case of $r_A$ and $r_B$, $x'$ is a machine state; in the case of $c_2$ and $c_5$ it is nothing, i.e.\ ``$\cdot$'' -- all the information is stored in the representation in these extreme cases). This is an example of off-line specialization of the body of a function $f$. If we repeated the process with, say, $c_{10}$ as the input argument's representation, then it would produce a specialized (no-op) function and return the $c_{17}$ representation. At run-time, this function does nothing and returns nothing, but it is a nothing (i.e.\ a ``$\cdot$'') that represents the value $17$, as specified by $c_{17}$. \medskip An alternative point of view on the symbolic interpretation described above is that we are specializing a C interpreter $interp(source, input)$ with an argument representation $c_f\times r_A$. This representation means ``$source$ is known to be exactly $f$, but $input$ is only known to be in the run-time register $A$''. \subsection{Integration with an interpreter} \label{sec_integration} For specializers, the practical trade-off lies in the choice of the family $R$ of representations. It must be large enough to include interesting cases for the program considered, but small enough to allow $S_f(r)$ to be computed and optimized with ease. But there is no reason to limit it to the examples seen above instead of introducing some more flexibility. Representations can be used to encode high-level objects in a number of ways, as fit for the language. For example, in high-level languages, for any two representations $r$ and $r'$ there could be a representation $r\times r'$ representing an object which is a couple of objects; the first item in the couple is represented by $r$, and the second by $r'$. (Section \ref{sec_solutions} has more examples.) \medskip One way to ensure that all values can be represented (without adding ever more cases in the definition of $R$) is to include the universal representation $\id_X$ among the family $R$. This slight change suddenly makes the compiler tightly integrated with a regular interpreter. Indeed, this most general representation stands for an arbitrary value whose type is not known at compile-time. This representation is very pervasive: typically, operations involving it produce a result that is also represented by $\id_X$. A function ``compiled'' with all its variables represented as $\id_X$ is inefficient: it still contains the overhead of decoding the operand types for all the operations and dispatching to the correct implementation. In other words it is very close to an interpreted version of $f$. Let us assume that a regular interpreter is already available for the language. Then the introduction of $\id_X$ provides a safe ``fall-back'' behavior: the compiler cannot fail; at worst it falls back to interpreter-style dispatching. This is an essential property if we consider interpreters which are large and even dynamically extensible: no predefined representation set $R$ can cover all possible cases unless it contains $\id_X$. \medskip A different but related problem is that in practice, a number of functions (both built-in and user-defined) have an efficient representation for ``common cases'' but require a significantly more complex representation to cover all cases. For example, integer addition is often representable by the processor's addition of machine words, but this representation is partial in case of overflow. In the spirit of section \ref{sec_example2} we solve this problem by forking the code into a common case and an exceptional one (e.g.\ by default we select the (partial) representation ``addition of machine-words'' for $S_{+}(r_{i_1}\times r_{i_2})$; if an overflow is detected we fork the exceptional branch using a more general representation $S_{+}(r) = {+}: \N\times\N\rightarrow\N$. Generalization cannot fail: in the worst case we can use the fall-back representation $S_{+}(\id_X\times\id_X)$. (This is similar to recent successful attempts at using a regular interpreter as a fall-back for exceptional cases, e.g.\ \cite{W01}.) \subsection{Changes of representation} Any kind of behavior using $\id_X$ as a fall-back raises the problem of the pervasiveness of $\id_X$ in the subsequent computations. This was the major motivation behind section \ref{sec_jit}: just-in-time specialization enables ``unlifting''. Recall that to \emph{lift} is to move a value from compile-time to run-time; in term of representation, it means that we change from a specific representation (e.g.\ $c_{42}$) to a more general one (e.g.\ $r_{i_1}$, the change being done by the residual C code \code{i1 = 42;}). Then \emph{unlifting} is a technique to solve the pervasiveness problem by doing the converse, i.e.\ switching from a general representation like $\id_X$ to a more specific one like $r_{i_1}$. We leave as an exercise to the reader the reformulation of the example of section \ref{sec_example2} in terms of representations. \medskip Both lifting and unlifting are instances of the more general \emph{change of representation} kind of operation. In the terminology of section \ref{sec_def_repr}, a change of representation is a representation of an identity, i.e.\ some low-level code that has no high-level effect: $$\vcenter{\xymatrix{ {X}\ar[r]^{\id} & {X} \\ {X_1}\ar[r]^{g}\ar[u]^{r_1} & {X_2}\ar[u]_{r_2} }} \hskip 20pt\txt{or equivalently}\hskip 20pt \vcenter{\xymatrix@C=1pt{ & {X} & \\ {X_1}\ar[rr]^{g}\ar@/^/[ur]^{r_1} & & {X_2}\ar@/_/[ul]_{r_2} }}$$ A lift is a function $g$ that is an inclusion $X_1\subset X_2$, i.e.\ the domain of the representation $r_1$ is widened to make the domain of $r_2$. Conversely, an unlift is a function $g$ that is a restriction: using run-time feedback about the actual $x_1\in X_1$ the specializer restricts the domain $X_1$ to a smaller domain $X_2$. Unlifts are partial representations. Run-time values may later show up that a given unlift (or any other partial representation) cannot handle, requiring re-specialization, as in \ref{sec_integration}. \vspace{0.6cm} %% ------------------------------------------------------------ %% PSYCO %% ------------------------------------------------------------ \section{Psyco} \label{sec_psyco} In the terminology introduced above, \htmladdnormallinkfoot{Psyco}{http://psyco.sourceforge.net} is a \emph{just-in-time representation-based specializer} operating on the \htmladdnormallinkfoot{Python}{http://www.python.org} language. \subsection{Overview} \label{sec_psyco_overview} The goal of Psyco is to transparently accelerate the execution of user Python code. It is not an independent tool; it is an extension module, written in C, for the standard Python interpreter. Its basic operating technique was described in section \ref{sec_example2}. It generates machine code by writing the corresponding bytes directly into executable memory (it cannot save machine code to disk; there is no linker to read it back). Its architecture is given in figure \ref{fig_arch}. \begin{figure}[hbt] $$\xymatrix{ & *+++[F-,]\txt{Python C API\\and various support code} & \\ *+++[F--] \txt{Machine code\\written by Psyco} \ar@<5pt>[r]^{jump} \ar@/^20pt/[ur]^{call} & *+++[F-,] \txt{Run-time\\dispatcher} \ar@<5pt>[l]^{jump} \ar[r]^{call} \ar[u]^{call} & *+++[F-,] \txt{Specializer} \ar@{-->}@/^50pt/[ll]^{generate} \ar@/_20pt/[ul]_{call} }$$ \caption{The architecture of Psyco (bottom row) within the standard Python interpreter (top box)}\label{fig_arch} \end{figure} \subsection{Implementation notes} \label{sec_solutions} The \defining{representations} in Psyco are implemented using a recursive data structure called \code{vinfo_t}. These representations closely follow the C implementation of the standard Python interpreter. Technically, they are representations of the C types manipulated by the interpreter (as in section \ref{sec_example}). However, we use them mostly to represent the data structure \code{PyObject} that implements Python language-level objects. Representations that are neither constant nor universal include machine-sized integers representing a \code{PyIntObject} structure that would stand for a Python integer object; representations for lists that are arithmetic progressions (the very common ``ranges'' in Python, for looping over); for dynamic function or method closures; for string slices (i.e.\ substrings of an existing string); etc. \medskip Particular attention has been paid to the \defining{continuations} underlying the whole specialization process. Obviously, being implemented in C, we do not have general-purpose continuations in the language. However, general tools like Lisp or Scheme continuations would probably be totally impractical. The reason is the memory impact (section \ref{sec_issues}). It would not be possible to save the state of the specializer at all the points where it could potentially be resumed from. Psyco emulates continuations by saving the state only at some specific positions, which are always between the specialization of two opcodes (pseudo-code instructions) -- and not between any two opcodes, but only between carefully selected ones. Intermediate states can be reconstructed by carefully repeating the execution of C code. \medskip \defining{Code generation} is also based on custom algorithms, not only for performance reason, but because general compilation techniques cannot be applied to code that is being executed piece by piece almost as soon as it is created. Besides the Intel i386-compatible machine code, Psyco has recently been ``ported'' to a custom \defining{low-level virtual machine architecture.} This architecture will be described in a separate paper. \medskip The \defining{profiler hooks} in Psyco select the functions to specialize based on an ``exponential decay'' weighting algorithm, also used e.g.\ in Self \cite{decay}. An interesting feature is that, because the specializer is very close in structure to the original interpreter (being a symbolic interpreter for the same language), it was easy to allow the profiler hooks to initiate the specialization of a function \emph{while it is running,} in the middle of its execution -- e.g.\ after some number of iterations in a long-running loop, to accelerate the remaining iterations. This is done essentially by building the universal representation of the current (interrupted) interpreter position (i.e.\ the representation in which nothing specific is known about the objects), and starting the specializer from there. \medskip In its current incarnation, Psyco uses a mixture of \defining{widening, lifting and unlifting} that may be over-complicated. To avoid infinite loops in the form of a representation being unlifted and then widened again, the compile-time representations are marked as \emph{fixed} when they are unlifted. The diagram of figure \ref{fig_str} lists all the state transitions that may occur in a \code{vinfo_t}. \begin{figure}[hbt] $$\xymatrix{ *+[F-:<8 pt>]\txt{virtual-time}\ar[rd]^-1\ar[rr]^-2 & & *+[F-:<8 pt>]\txt{run-time}\ar[rd]^-5 \\ & *+[F-:<8 pt>]\txt{non-fixed\\compile-time}\ar[ru]^-3\ar[rr]^-4 & & *+[F-:<8 pt>]\txt{fixed\\compile-time} }$$ \caption{State transitions in Psyco: widening (3), unlifting (5) and other trivial (4) and non-trivial (1, 2) representation changes}\label{fig_str} \end{figure} \subsection{Performance results} \label{sec_performance} As expected, Psyco gives massive performance improvements in specific situations. Larger applications where time is not spent in any obvious place benefit much less from the current, rather low-level incarnation of this prototype. In general, on small benchmarks, Python programs run with Psyco exhibit a performance that is near the middle of the (large) gap between interpreters and static compilers. This result is already remarkable, given that few efforts have been spent on optimizing the generated machine code. \def\benchmark#1{{\bf #1\ }} \benchmark{Int arithmetic} serves as a test of the quality of the machine code. \benchmark{Float arithmetic} shows the gain of removing the object allocation and deconstruction overhead, without accelerating the computation itself (Psyco does not know how to generate machine code handling floating points so has to generate function calls). \benchmark{Complex arithmetic} shows the raw gain of removing the interpretative overhead only (Psyco does not know about complex numbers). \benchmark{Files and lists} count the frequency of each character in a set of files. \benchmark{Pystone} is a classical benchmark for Python.\footnote{Available in \code{Lib/test/pystone.py} in the Python distribution.} \benchmark{ZPT} is Zope Page Template, an HTML templating language interpreted in Python. \benchmark{PyPy 1} is the test suite of PyPy, a Python interpreter written in Python, first part (interpreter and module tests). \benchmark{PyPy 2} is the second part (object library implementation). The results (figure \ref{fig_benchmark}) have been obtained on a Linux Pentium III laptop at 700MHz with 64MB RAM. Times are seconds per run. Numbers in parenthesis are the acceleration factor with respect to Python times. All tests are run in maximum compilation mode (\code{psyco.full()}), i.e.\ without using the profiler but blindly compiling as much code as possible. \begin{figure}[hbt] \begin{tabular}{rccc} Benchmark & Python (2.3.3) & Psyco & C (gcc 2.95.2) \\ \hline int arithmetic & 28.5 & 0.262 ($109\times$) & 0.102 ($281\times$) \\ & & & ovf:\footnotemark\ 0.393 ($73\times$) \\ float arithmetic & 28.2 & 2.85 ($9.9\times$) & 0.181 ($156\times$) \\ complex arithmetic & 19.1 & 7.24 ($2.64\times$) & 0.186 ($102\times$) \\ & & & sqrt:\footnotemark\ 0.480 ($40\times$) \\ files and lists & 20.1 & 1.45 ($13.9\times$) & 0.095 ($211\times$) \\ Pystone & 19.3 & 3.94 ($4.9\times$) & \\ ZPT & 123 & 61 ($2\times$) & \\ PyPy 1 & 5.27 & 3.54 ($1.49\times$) & \\ PyPy 2 & 60.7 & 59.9 ($1.01\times$) & \\ \end{tabular} \caption{Timing the performance improvement of Psyco}\label{fig_benchmark} \end{figure} \medskip These results are \emph{not representative} in general because we have, obviously, selected examples where good results were expected. They show the behavior of Psyco on specific, algorithmic tasks. Psyco does not handle large, unalgorithmic applications very well. It is also difficult to get meaningful comparisons for this kind of application, because the same application is generally not available both in Python and in a statically compiled language like C. \addtocounter{footnote}{-1} \footnotetext{Although no operation in this test overflows the 32-bit words, both Python and Psyco systematically check for it. The second version of the equivalent C program also does these checks (encoded in the C source). Psyco is faster because it can use the native processor overflow checks.} \stepcounter{footnote} \footnotetext{This second version extracts the square root to check if the norm of a complex number is greater than 2, which is what Python and Psyco do, but we also included the C version with the obvious optimization because most of the time is spent there.} The present prototype moreover requires some tweaking to give good results on non-trivial examples, as described in \cite{UserGuide}, section 2.2. \subsection{Conclusion} In conclusion, we presented a novel ``just-in-time specialization'' technique motivated by a complete prototype. It differs from on-line specialization as follows: % \begin{itemize} \item The top-down approach (\ref{sec_topdown}) introduces specialization-by-need based on the \emph{unlift} operator as a promising alternative to the widening heuristics. \item It introduces some low-level efficiency issues (\ref{sec_issues}, \ref{sec_solutions}) not present in on-line specialization. \item It prompts for a more involved ``representation-based'' theory of data specialization (\ref{sec_def_repr}), which is in turn more powerful and gives a natural way to map data between abstraction levels. \item Our approach makes specialization more tightly coupled with regular interpreters (\ref{sec_integration}). \end{itemize} \subsection{Acknowledgments} All my gratitude goes to the Python community as a whole for a great language that never sacrifices design to performance, forcing new and interesting optimization techniques to be developed. \vspace{0.6cm} %% ------------------------------------------------------------ %% BIBLIOGRAPHY %% ------------------------------------------------------------ \begin{thebibliography}{W01} \bibitem[A03]{Aycock} John Aycock. {\em A Brief History of Just-In-Time.} ACM Computing Surveys, Vol.\ 35, No.\ 2, June 2003, pp.\ 97-113. \bibitem[B00]{BN00} Mathias Braux and Jacques Noy{\'e}. Towards partially evaluating reflection in Java. In {\em Proceedings of the 2000 ACM SIGPLAN Workshop on Evaluation and Semantics-Based Program Manipulation (PEPM-00),} pages 2--11, N.Y., January 22-23 2000. ACM Press. \bibitem[C92]{Self} Craig Chambers. {\em The Design and Implementation of the Self Compiler, an Optimizing Compiler for Object-Oriented Programming Languages.} PhD thesis, Computer Science Departement, Stanford University, March 1992. \bibitem[C02]{staged} Craig Chambers. Staged Compilation. In {\em Proceedings of the 2002 ACM SIGPLAN workshop on Partial evaluation and semantics-based program manipulation,} pages 1--8. ACM Press, 2002. \bibitem[D95]{Cecil} Jeffrey Dean, Craig Chambers, and David Grove. Selective specialization for object-oriented languages. In {\em Proceedings of the ACM SIGPLAN '95 Conference on Programming Language Design and Implementation (PLDI),} pages 93--102, La Jolla, California, 18-21 June 1995. {\em SIGPLAN Notices} 30(6), June 1995. \bibitem[H96]{decay} Urs Holzle and David Ungar. {\em Reconciling responsiveness with performance in pure object-oriented languages.} ACM Transactions on Programming Languages and Systems, 18(4):355--400, July 1996. \bibitem[I97]{Squeak} Dan Ingalls, Ted Kaehler, John Maloney, Scott Wallace, and Alan Kay. Back to the future: The story of Squeak, A practical Smalltalk written in itself. In {\em Proceedings OOPSLA '97,} pages 318--326, November 1997. \bibitem[J93]{partial_eval} Neil D.\ Jones, Carsten K.\ Gomard, and Peter Sestoft. {\em Partial Evaluation and Automatic Program Generation.} Prentice Hall International, International Series in Computer Science, June 1993. ISBN number 0-13-020249-5 (pbk). \bibitem[K91]{MOP} G.~Kiczales, J.~des Rivi{\`e}res, and D.~G.~Bobrow. {\em The Art of the Meta-Object Protocol.} MIT Press, Cambridge (MA), USA, 1991. \bibitem[M98]{MY98} Hidehiko Masuhara, Satoshi Matsuoka, Kenichi Asai, and Akinori Yonezawa. Compiling away the meta-level in object-oriented concurrent reflective languages using partial evaluation. In {\em OOPSLA '95 Conference Proceedings: Object-Oriented Programming Systems, Languages, and Applications,} pages 300--315. ACM Press, 1995. \bibitem[Piu]{J3} Ian Piumarta. {\em J3 for Squeak.} \hfill\break\htmladdnormallink{http://www-sor.inria.fr/\~{}piumarta/squeak/unix/zip/j3-2.6.0/doc/j3/}{http://www-sor.inria.fr/~piumarta/squeak/unix/zip/j3-2.6.0/doc/j3/} \bibitem[P88]{PMI88} Calton Pu, Henry Massalin, and John Ioannidis. The synthesis kernel. In USENIX Association, editor, {\em Computing Systems, Winter, 1988.,} volume 1, pages 11--32, Berkeley, CA, USA, Winter 1988. USENIX. \bibitem[R03]{UserGuide} Armin Rigo. {\em The Ultimate Psyco Guide.} \hfill\break\htmladdnormallink{http://psyco.sourceforge.net/psycoguide.ps.gz}{http://psyco.sourceforge.net/psycoguide.ps.gz} \bibitem[R04]{PsycoFull} Armin Rigo. {\em Representation-based Just-in-time Specialization and the Psyco prototype for Python,} extended version of the present paper. \hfill\break\htmladdnormallink{http://psyco.sourceforge.net/theory\_psyco.ps.gz}{http://psyco.sourceforge.net/theory_psyco.ps.gz} \bibitem[S01]{DynamicPE} Gregory T.~Sullivan. Dynamic Partial Evaluation. In {\em Lecture Notes In Computer Science, Proceedings of the Second Symposium on Programs as Data Objects,} pp.\ 238-256, Springer-Verlag, London, UK, 2001. \bibitem[V97]{VCC97} Engen N.~Volanschi, Charles Consel, and Crispin Cowan. {\em Declarative specialization of object-oriented programs.} In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA-97), volume 32, 10 of ACM SIGPLAN Notices, pages 286--300, New York, October 5-9 1997. ACM Press. \bibitem[W01]{W01} John Whaley. Partial Method Compilation using Dynamic Profile Information. In {\em Proceedings of the OOPSLA '01 Conference on Object Oriented Programming Systems, Languages, and Applications,} October 2001, pages 166--179, Tampa Bay, FL, USA. ACM Press. \end{thebibliography} \end{document}