\section{A Gentle Introduction to \\ RPython}\label{rpython-intro} \subsection{The \emph{PyPy} project} \label{pypy-intro} %%% PyPy intro was previously in intro.tex %%% the most technical part has been moved in the section on the compiler The \emph{PyPy} project aims to create a platform that makes it easy to experiment with different virtual machine designs, but without sacrificing efficiency. This is achieved by separating the semantics of the language being implemented, such as Python or Javascript, from low-level aspects of its implementation, such as memory management or the threading model. In a similar fashion to Aspect-Oriented Programming\cite{KiczalesEtAl97}, a complete interpreter is constructed at build time by weaving together the interpreter definition and each low-level aspect into a complete --- and efficient --- whole.\cite{RigoPedroni06} In order to best separate the low-level aspects of an interpreter from the language semantics, the interpreter definition itself is must be written in a high-level language that abstracts these details away, but which is not too dynamic to be analyzed and manipulated. This language is RPython. Thus, PyPy is composed of two major parts: the \emph{Standard Interpreter} and the \emph{Translation Toolchain}. The Standard Interpreter is a full-featured Python interpeter written in RPython. The Translation Toolchain analyzes the interpreter, infers the static type of all its variables, and produces an efficient executable version. It is important to emphasize that the Translation Toolchain is a general compiler that can be used for other projects beyond the Standard Interpreter, although it has been heavily optimized to work well with the latter. In fact, interpreters for such languages as Prolog and Javascript also exist in various states of completeness. %%% end of PyPy intro \subsection{RPython as a language} As explained above, RPython was born as an implementation detail of PyPy. During the development of PyPy, RPython proved to be a very convenient language for development, and eventually grew into a useful end-product in its own right. Because RPython is a subset of Python, every RPython program can run unmodified on the top of the Python interpreter. This allows programmers to take advantage of all the introspective features of Python during testing and debugging. The standard version of Python is a dynamically typed language: type information is attached to the objects, rather than to the method parameters, local variables, or return values. For instance, the following is a valid Python function which, depending on the value of its \verb"x" parameter, returns either an integer or a string: \begin{lstlisting} def foo(x): if x: return 42 else: return 'bar' \end{lstlisting} Since this flexibility comes at the cost of a loss of efficiency, the above definition is not correct in RPython; indeed, all those dynamic features that cannot be efficiently implemented have been necessarily sacrified. Therefore, the static type system of RPython forbids using values of incompatible types together; however, this is not a serious limitation as experience has shown that most well-written Python programs already conform to RPython's type system. A more significant limitation of RPython is that it is not possible to dynamically change class definitions, such as by adding or removing methods and fields. In RPython, as in Java or C\#, each object object belongs to a class and each class has a fixed set of fields, methods, and superclasses (see section \ref{typesystem}). Although this limitation significantly reduces RPython's expressive power when compared to full Python, it is necessary for generating efficient executables; moreover, special care has been taken to make RPython ``as pythonic as possible'', meaning that developers can still use most of the typical patterns and idioms they use when developing in Python. Despite the above restrictions, RPython is still much more dynamic and expressive than most of the static mainstream languages such as Java and C\#. It supports all the features you could expect to find in any modern object oriented language such as classes, single inheritance and exceptions, and furthermore it is statically typed; moreover, it also supports a lot of features that are ordinary for Python programmers but that are usually not found in those languages: limited support for mixins, first-order function and class values, limited use of bound methods, metaclasses. \subsection{Type system and object model} \label{typesystem} RPython supports the following primitive types, with some variants: \lstinline!SomeInteger! (signed, unsigned, non-negative), \lstinline!SomeFloat!, \lstinline!SomeBool!, \lstinline!SomeChar!, \lstinline!SomeString!. While in Python each of these types is an object, in RPython these types can only be used in restricted ways. This ensures that it will be possible to represent instances of these types using their native counterparts during execution. For example, integers can be stored as a scalar \lstinline!int! rather than some kind of wrapped integer object. Although Python does not distinguish between strings and characters, RPython uses different types for them; during type inference, strings whose length is exactly one character are annotated with the type \lstinline{SomeChar}. This allows backends to use the native types for chars, when available (see \ref{sect:ootype}). In addition to the primitive types, there is a small set of built-in container types: \lstinline!SomeTuple!, \lstinline!SomeList!, and \lstinline!SomeDict!. These types are generic types, meaning that they are parameterized by the types of the items which they store. \lstinline{SomeTuple} represent \emph{tuples}, and are used in both Python and RPython to group together small sets of non-homogeneous objects. Tuples are found in many languages, and can be thought of as a read-only, anonymous record. Their items are accessed either by index or by \textbf{tuple unpacking}. They are commonly used to return multiple values from a function, as in this example: \begin{lstlisting} def divmod(a, b): return a/b, a%b quot, rem = divmod(10, 3) # unpacking print quot # prints 3 print rem # prints 1 \end{lstlisting} \lstinline{SomeList} objects are used to store mutable sequences of items, all of which must be of the same type. They take the place of both fixed-size arrays and growable sequences such as the \lstinline{ArrayList} class from Java and .NET. The annotator automatically determines whether the list's size is fixed or variable, so that the backend can select the most efficient representation. \lstinline{SomeList} objects provides most of the methods of Python's lists, such as \lstinline{sort}, \lstinline{reverse}, \lstinline{append} and \lstinline{pop}. Finally, \lstinline{SomeDict} is the RPython equivalent of a Python dictionary; dictionaries represent a mapping between keys and values, and are usually implemented with an hashtable. As with lists, RPython dictionaries need to be homogeneous, i.e. all the keys must belongs to the same type, as well as all the values (the type of the keys does not need to be necessarily the same as the type of values). In addition to basic types, developers can define their own classes. In keeping with RPython's roots as a subset of Python, however, the set of fields contained by a class are not declared as they would be in a traditional language. Instead, the fields (and their types) are inferred automatically based on the assignments in the program. For example, consider the following code: \begin{lstlisting} class MyClass: def __init__(self, a, b): self.a = a self.b = b def foo(): obj = MyClass("Hello world", 42) \end{lstlisting} When analyzing the function \lstinline{foo}, the annotator can automatically detect that instances of \lstinline{MyClass} contain a string field named \lstinline{a} and an integer one named \lstinline{b}. RPython classes only support single inheritance; however, they also allow \emph{mixin} definitions, which allow common methods to be shared among many classes, without affecting the inheritance hierarchy. Intuitively, mixins can be seen as a collection of methods that are cut-and-pasted into the class definition. %% removed the comment below %% if we want RPython programs to behave exactly as in Python %% isinstance should be allowed for mixins %% so it is better to avoid the comment %Note that this implies %that Python's \lstinline{isinstance} method does not work as expected %when used with a mixin class. The syntax for declaring and using a mixin is similar to multiple inheritance, except that mixin classes are marked by a special \lstinline{_mixin_} attribute. The following example demonstrates two mixins, \lstinline{Displayable} and \lstinline{Addable}, and two classes which use them: \begin{lstlisting} class Displayable: _mixin_ = True def display(self): print 'value = ',self.value class Addable: _mixin_ = True def add(self, x): return self.value + x class Number(Displayable,Addable): def __init__(self, value): self.value = value class String(Displayable,Addable): def __init__(self, value): self.value = value def main(): n = Number(40) s = String('Hello ') print n.add(2) # 42 print s.add('world!') # Hello world n.display() # value = 40 s.display() # value = Hello \end{lstlisting} The notion of mixin in RPython shares some similarities both with the notion of trait \cite{DucasseEtAl06} and with the standard notion of mixin \cite{BrachaCook90}. As traits, RPython mixins do not affect the inheritance hierarchy and the methods of the class take the precedence on the methods of the traits. As standard mixins, the order in which RPython mixins are composed is relevant. % comment below removed fo the same reason as that above %this is the same behaviour you %get in Python, so RPython programmers can again continue to exploit %the common patterns they are used to (see for example %\cite{PythonMixIn} and \cite{VanRossum}). Finally, classes and methods are first-order entities, i.e. they can be stored and passed around to be called/instantiated later: \begin{lstlisting} def add(a, b): return a+b def sub(a, b): return a-b def do(f, a, b): print 'The answer is', f(a, b) def main(): do(add, 30, 12) do(sub, 50, 8) \end{lstlisting} \subsection{Initialization-time, translation-time and run-time} In Python, functions and classes are not defined by declarations, but by executing the \lstinline{def} and \lstinline{class} statements, which have the side effect of creating a function or class object, respectively. When a module is imported into the interpreter, its top-level statements are executed and the resulting objects are collected into the module's namespace. \begin{figure} \centering \epsfig{file=translation, scale=0.3} \caption{From Python sources to compiled executables.} \label{fig:translation} \end{figure} Unlike most compilers, the translation of RPython programs does not start from the source files but from live Python objects that have been imported and initialized by the standard Python interpreter. Thus, the life cycle of an RPython program is divided into three phases, as shown by Figure \ref{fig:translation}. \begin{description} \item[Initialization] the process of constructing and intializing Python classes, functions and constants to be compiled. \item[Translation] the process of analyzing the program as whole, inferring the types of all the variables and function parameters and producing an executable. \item[Run] the execution of the output produced by the \emph{translation} step. \end{description} Because the \emph{translation step} only examines the objects once they have been fully created, it is not aware of how they were generated. This means that at initialization-time we can exploit the full power of Python, without restrictions, to dynamically build these live objects; this includes, but it is not limited to, \lstinline{exec}, nested scopes, and metaclasses. In other words, we could say that RPython programs are not \emph{written} in the form of source code, but they are \emph{generated} by the Python statements that create those live objects; thus, we can think of Python as the \textbf{meta-programming} language for RPython. This allows for a wide range of language extensions. \subsection{Useful RPython programming patterns} This section contains an overview of useful techniques that take advantage of the Python pre-processing step to make programs simpler and easier to maintain. Some of what we describe in this section could be implemented in Java or C\# today using reflection: however, this entails a heavy runtime price, whereas in PyPy these patterns operate purely in a pre-processing phase, exacting zero cost at runtime. \subsubsection{Automatically generated source code} It is often the case when programming that a large number of similar, but not identical, tasks need to be performed. Generally, this is achieved by cut-and-pasting the relevant code and modifying it by hand for each case. Needless to say, this is a boring and error-prone process. For example, consider the implementation of \lstinline{rlib.streamio}, the RPython library which manages files and streams in general: \emph{streams} are implemented with a low-level class that give access to the physical media where the stream resides and a tower of \emph{decorators} (as described by the homonym pattern in \cite{GoF95}) which provides additional functionality such as buffering, automatic line-ending conversion, and so on. Each decorator is a wrapper around a stream: typically, the decorator overrides a few of the stream methods, but delegates most calls to the underlying stream unchanged. As a simple example, consider the following hypothetical filter for automatic line-ending conversion: \begin{lstlisting} class CRLFFilter(Stream): def __init__(self, base_stream): self.base = base_stream ... def flush(self): return self.base.flush() def close(self): return self.base.close() \end{lstlisting} Methods \lstinline!flush! and \lstinline!close! are examples of methods that simply delegate to the base stream. Although they are not the same, both follow the same pattern. While maintaining two similar, but not identical, methods is a relatively small burden, the actual \lstinline{streamio} package contains fifty such methods. As more decorators are implemented, maintaining these delegating methods becomes a significant burden. As an alternative to writing the methods by hand, we can automatically generate them using Python's \lstinline{exec} statement, which allows us to execute code from a string constant: \begin{lstlisting} def PassThrough(meth_name): code = """def %s(self): return self.base.%s()""" d = {} exec code % (meth_name, meth_name) in d return d[meth_name] class CRLFFilter(Stream): ... flush = PassThrough("flush") close = PassThrough("close") \end{lstlisting} In fact, the RPython \lstinline{streamio} package does something very similar, although the real implementation of \lstinline{PassThrough} is more complex due to the need to handle arguments and other small variations. This approach is possible because the calls to \lstinline{PassThrough}, and consequently to \lstinline{exec}, happen during the initialization phase, when full Python's semantics is allowed. Once translation begins, the new methods have been created, and the translation step treats them identically to those methods which were entered by hand. \subsubsection{Initialization of complex constants} RPython's meta-programming capabilites can also be very helpful for computing complex constants at intitialization time. For example, suppose that your program needs to frequently use the first \lstinline{N} \emph{Fibonacci's numbers}; the usual solution in C\# or Java is to pre-compute those numbers as soon as the program starts, usually inside the \emph{static constructor} of some class. This solution might be problematic if this computation takes a long time, because it would impact on the start-up time of the program. With RPython, you can simply do the computation during the initialization phase, and store the results into a constant: \begin{lstlisting} def fibo(N): sequence = [] a, b = 1, 1 for i in xrange(N): sequence.append(a) a, b = b, a+b return sequence # pre-compute the first 100 numbers fibo_numbers = fibo(100) \end{lstlisting} Note that, even if we called \lstinline{fibo} at init-time, nothing prevent us from calling it at run-time as well. \emph{The same function serves equally well for both meta-programming and programming}. \subsubsection{Extending the language through metaclasses} \emph{Metaclasses} are a powerful feature of Python which allows the programmer to extend the behavior of the language. A metaclass is the class of a class, and it can override the default behavior of class objects to customize what happens when new instances are created, and so on. Metaclasses operate at class-definition time, i.e. when the \lstinline{class} statement is executed. Because RPython does not allow classes to be defined after the initialization period, metaclasses run entirely in the Python interpreter and are automatically fully supported. Explaining the details of metaclasses in Python is beyond the scope of this article; for more informations, see \cite{MertzSimionato03} and \cite{VanRossum}. To showcase the almost infinite possibilities, we present an example using the \lstinline{extendabletype} metaclass. This metaclass is widely used in PyPy to allow programmers to \emph{extend} an already defined class with new methods. Suppose that we have a class class hierarchy of business objects. Each object can do two things: save itself into a database, and present itself in a GUI. The resulting class structure might look like: \begin{lstlisting} class Root: def save(self, db): ... # abstract def show(self, window): ... # abstract class MyFirstObject(Root): ... class MySecondObject(Root): ... \end{lstlisting} This design is not optimal, because it ties together three unrelated aspects: the business model, the storage and the presentation. Several approaches have been proposed to solve this problem, such as the \emph{visitor pattern} \cite{GoF95}, but most of them are, in fact, ways to workaround the inability to define classes incrementally. The \lstinline{extendabletype} metaclass solve this problem by letting the programmer to split the same class definition into different files, so that only related aspects are grouped together: \begin{lstlisting} # file model.py class Root: __metaclass__ = extendabletype ... class MyFirstObject(Root): ... class MySecondObject(Root): ... # file db.py class __extend__(Root): def save(self, db): ... # abstract class __extend__(MyFirstObject): ... class __extend__(MySecondObject): ... # file gui.py class __extend__(Root): def show(self, window): ... # abstract class __extend__(MyFirstObject): ... class __extend__(MySecondObject): ... \end{lstlisting} The \lstinline{extendabletype} metaclass allows most class definitions to complete normally, unless the class being defined has the name \lstinline{__extend__}. In that case, instead of creating a new class object, the metaclass adds the newly declared methods and fields to the already-existing class being extended (either \lstinline!Root!, \lstinline!MyFirstObject!, or \lstinline!MySecondObject!, in this example). The code makes heavy use of the introspective and dynamic features of Python, but since it runs during the initialization phase it is still usable in RPython. C\# 2.0's \emph{partial classes} \cite{ECMA-334} offer the same benefits, as well as a Java extension called \emph{MultiJava} \cite{multijava}; the key point of this example is that in RPython this behaviour is simply \emph{metaprogrammable}, while for C\# and Java the only way to do it is to extend the language or to run a preprocessor. \subsection{A complete example} This sections shows how RPython's features help to write a real (though small) program: a simple parser and interpreter for reverse polish notation. The program first parses the command line arguments to create an expression tree, and then evaluates the tree into an integer result and prints it. Although the example only includes addition and subtraction, it can be trivially extended to deal with more operators. The example uses a variety of meta-programming techniques to automatically generate the code for handling the various operators. Of course, since the example is so small, it would be easier just to inline the code that is required. However, this approach makes it trivial to add new operators, or to separate the operator definitions from the parser, and thus is important as the language to be parsed becomes more complicated. \begin{small} \begin{lstlisting} # AST classes class Expr: def eval(self): raise NotImplementedError class Number(Expr): def __init__(self,n): self.n = n def eval(self): return self.n class BinaryExpr(Expr): def __init__(self,l,r): self.l = l self.r = r # operators: the docstrings contain the # symbol associated to each operator class Op_Add(BinaryExpr): '+' class Op_Sub(BinaryExpr): '-' # entry-point def main(argv): try: e = parse(argv[1:]) print e.eval() return 0 except SyntaxError: print 'Invalid expression' return 1 # parse a list of string into an AST def parse(lst): stack = [] for token in lst: try: node = Number(int(token)) except ValueError: # find opcode class op = OPCODES.get(token, None) if op is None: raise SyntaxError y, x = stack.pop(), stack.pop() node = op(x, y) # instantiate the class stack.append(node) if len(stack) != 1: raise SyntaxError return stack[0] # INIT-TIME only: automatically build the # table of opcodes by inspecting all the # declared classes and their docstrings # and add eval methods to Op_* classes # generate the appropriate eval method def gen_eval(ch): code = """ def eval(self): return self.l.eval() %s self.r.eval() """ d = {} exec code.strip() % (ch) in d return d['eval'] OPCODES = {} def build_opcodes(): 'NOT_RPYTHON' for name, value in globals().items(): if name.startswith('Op_'): value.eval = \ gen_eval(value.__doc__) OPCODES[value.__doc__] = value build_opcodes() \end{lstlisting} \end{small} The core of program is the \lstinline!parse! function. It takes a list of tokens as input and builds an abstract syntax tree representing the expression. To locate the appropriate class it relies on a precomputed dictionary, \lstinline{OPCODES}, which maps each operator token (\lstinline{"+"} or \lstinline{"-"}) to its corresponding class. This exploits the fact that RPython classes are first-order values (see section \ref{typesystem}). The program relies on an initialization step performed by the function \lstinline!build_opcodes!. This function does two things: first, it adds an automatically-generated \lstinline{eval} method to each binary operator class, and adds the class to the \lstinline{OPCODES} dictionary. Both steps rely on the convention that the \emph{docstring}\footnote{\emph{docstrings} are the Python way to attach the documentation to functions and classes. Compared to other approaches such as \emph{JavaDoc}, the main difference is that they are also available at run-time, and they can be accessed via the \lstinline{__doc__} attribute.} of each operator class contains its corresponding symbol. The \lstinline{eval} method is generated via a call to the helper routine \lstinline{gen_eval}, which creates a string of Python code that performs the desired computation and uses \lstinline{exec} to create compile this string into a Python method. Adding the class object to the \lstinline{OPCODES} dictionary is done by simple assignment, using the class docstring as the key and the class object itself as the value. Note the \lstinline{'NOT_RPYTHON'} flag that is present on the \lstinline!build_opcodes! function: this serves both as documentation, indicating that a method is restricted to preprocessing time, and to allow the tool chain to emit a warning should the function be used afterwards.