========================================================= Python 2.2: an Attribute-Oriented Language ========================================================= :Abstract: This is an overview of the original attribute-centric, descriptor-based object model first introduced in Python 2.2. We will see in particular how it compares to the more classical OOP notions of sending and responding to messages. Introduction ============ The history of the Python programming language is one of apparent stability with gradual underlying semantic shifts. The most important recent change has been introduced in Python 2.2, and although it is several years old now it is only starting to be documented. It is possibly the last and deepest of a series of gradual changes in the way the syntactic operators and constructs (which are all imperative) map to object behavior (which is fully object-oriented). Before 2.2, Python was usually defined as a series of standard built-in types, and the way various combinations of them react to various syntactic operations. This description was based on a balance between purity and practicality, with rules and exceptions. The `Python Language Reference`_ is still written in this way, which is accurate enough for most purposes. By contrast, in theory -- backward compatibility and performance tricks let aside -- Python 2.2 could be defined differently. This is not to say that the Reference_ is not accurate any more, but simply that the change occurred at a deeper level than what is usually seen by programmers. It is, in effect, introducing a full classical Smalltalk-ish OOP model "under the hood", and then mostly hiding it between the syntax and the built-in types. This document attempts to give an overview of this model, and how the syntax and the built-in types are built on top of it. It is not a replacement for the Reference_, at most possibly an appendix. Its intended audiance is both people very familiar with Python that want to learn more about its inner workings, and people with a Smalltalk-OO background, who I guess (blame on me!) will then consider Python as just a library of syntaxes and classes instead of an original language. Object model ============ The object model described in the sequel is one of several possible points of views, which may or may not be the most relevant one. I have tried to make an explicit comparison between the traditional message-based Object-Oriented Programming (OOP) model and Python's own attribute-based one. Objects ------- As usual in OOP, Python programs manipulate exclusively *objects*: independent entities that can reference each other or be referenced from variables, and whose behavior is determined by their *class.* An object is said to be an *instance* of its class. Every object has intrinsically a class, which can be read using the ``type()`` built-in function, and which can also be changed, to some extent. All classes are themselves instances of the predefined class ``type`` (or a subclass of it, as seen below); only such instances of ``type`` can be used as the class of an object. Classes ------- The classes are themselves objects, which have a class of their own, and so on. A class is called a *metaclass* if it is the class of an object which is itself a class. Classes are usually called *type* or *new-style classes* in Python, to avoid confusion with a previous different notion now known as *old-style classes.* The "metaclass tower" obtained by repeatedly taking the class always end up at ``type``, whose class is ``type`` itself. The class of an object controls the basic behavior of the object, as well as how much extra data the object holds. For classes themselves, ``type`` ensures that all classes have at least a name, a list of base classes, and a dictionary, which is a name-attribute mapping. Attribute lookup ---------------- The *base classes* of a class are used to implement (possibly multiple) inheritance: a class logically inherits all the attributes of its base classes, minus the ones it overrides explicitely in its own dictionary. Inheritance is not implemented by copying over the base classes' attributes, but by systematically looking in all parent classes, to ensure that late changes in a base class are visible in all subclasses. The base classes, and the bases' bases, and so on, are collectively called the *parent* classes of a class. These parents are ordered at class creation time into something misleadingly called the *method resolution order,* starting from the class itself and generally proceeding in a subclass-first, base-next order. (In the presence of multiple inheritance the precise order is defined by an algorithm called C3_.) All classes have the same ultimate root class, called ``object``. When an attribute must be looked up by name on a class, the search is conducted in the method resolution order until a class is found that has an entry for the attribute name in its dictionary. This attribute lookup algorithm is in some sense the most primitive operation of Python since 2.2, although it is not directly available to user programs. Messages -------- Python's object model is unusual in that is has, strictly speaking, only a predefined and not extensible set of possible *messages.* The only way to send a message is implicitely, by using a corresponding syntactic construct. For example, the expression ``a+b`` will send a message called ``__add__`` to the object bound to the variable ``a``, and if it fails, send a message called ``__radd__`` (reverse add) to the object bound to the variable ``b``. In the rich syntax of Python, each construct corresponds to one or possibly several messages sent to the operands. Messages are what Python refers to as *special method names.* Note that normal method calls are *not* implemented as a single message send. As has always been the case in Python, what appears to be a method call:: result = obj.method(arg1, arg2, arg3) is actually a combination of two imperative statements:: the_method_object = obj.method result = the_method_object(arg1, arg2, arg3) first getting the attribute called ``method`` out of the specified object, which returns a method object bound to ``obj``; and then calling this method object. The first operation is implemented by sending the ``__getattribute__`` message to ``obj``; the second one by sending the ``__call__`` message to ``the_method_object``. The way an object responds to messages is defined by its class. The rule is that the message name is looked up in the class -- this is an attribute lookup as described above. The rest of the process to actually perform a call is, by design, an emulation of what occurs for normal method calls. This allows normal method calls and special message sends to appear similar, although they are not. More about it below. A visible difference between method calls and message sends is that traditionally, Python allows arbitrary overriding of methods in individual instances, simply by storing a new entry in the instance's dictionary:: result = obj.method(arg1, arg2, arg3) might find the attribute called ``method`` in the object itself, and not on its class. However, this is only a (desired) artefact of the implementation of the ``__getattribute__`` message. It does not apply to messages themselves: if a syntactic construct is implemented by sending some message ``__xyz__`` to an object ``obj``, it will always look up the name ``__xyz__`` in the class of ``obj`` only, not in ``obj``. This is essential to avoid "metaconfusion", i.e. confusion between messages relevant to an object (and handled by its class) and messages relevant to the class itself (and handled by its metaclass). Descriptors ----------- In Python, the ubiquitous expression ``obj.name``, if it is not the target of an assignment, is evaluated by first evaluating ``obj`` (but not ``name``!); then ``name`` is searched both in the object that ``obj`` evaluated to, and (as an attribute lookup) on the class of ``obj``. 1. If ``name`` exists only in ``obj``, the associated value is returned. 2. If ``name`` exists only on the class (or in a parent class), then the returned value is the result of sending the ``__get__`` message to the object found in the class. The ``obj`` is passed as argument to ``__get__``. 3. A priority rule explicited later is used if ``name`` is found on both the object and the class. The intention of calling ``__get__`` in rule 2 is so that a class can provide a global "object template" that applies to all its instances, but still control for each instance which exact object is returned. Such an object is called a *descriptor*. The most common kind of descriptor are *function objects:* function objects respond to a ``__get__`` message by building and returning a new *method object* which contains a reference to both the original function object and the ``obj`` above. When such a method object is called, it inserts the ``obj`` in front of the argument list it got and forwards the call to the original function object. The first argument of that function is traditionally called ``self``, and plays the role of the ``this`` argument of other OOP languages. If the object found in rule 2 is not actually a descriptor, i.e. has no ``__get__`` (which is the case for most other built-in types), the object found is returned unmodified. This allows class attributes of common types to be directly stored in the class dictionary. Reading them from an instance returns the same object, irrespective of the instance. Note that this behavior is actually overridable (see `Descriptors and attributes`_ below). How messages are sent --------------------- We can now detail what occurs when a message is sent to an object ``obj``. As we have seen above, this is done by emulating what a regular method call would have done, with the exception that the dictionary of ``obj`` itself is ignored: 1. We perform an attribute lookup for the name of the message in the class of ``obj``. 2. We send a ``__get__`` message to the object found (which is usually a descriptor). 3. We send a ``__call__`` message to the resulting object. Note that this definition of sending a message is recursive: it requires two new messages to be sent. This is resolved by special-casing some objects that occur very commonly as the descriptors found in step 1. If, say, the descriptor is a Python function, then steps 2 and 3 can be carried out without actually sending new messages but by directly invoking the internal function call machinery with ``obj`` as an extra first argument. Syntactic constructs and built-in types ======================================= The object model described above is ultimately the "glue" between syntactic elements and the semantics they induce on objects, particularly objects of built-in types. Each syntactic element of a Python program is turned into one or (more often) several message sends. The built-in objects respond to these messages (or refrain from responding) in a carefully synchronized way. The precise rules for both which messages are sent and how they are responded to are loaded with particular cases for usefulness and/or backward compatibility. We will describe only a few examples here. The `Python Language Reference`_ is the ultimate reference; we only intend to give an idea of how we can express the same rules in term of sending messages. Note that this is more than a theoretical exercice: although it is not possible to send message explicitely, all the message-responding methods are visible in the built-in types. For example, ``object.__getattribute__`` is the method responding to the ``__getattribute__`` message for default objects. It is thus possible to emulate the behavior of message-sends by calling specific methods explicitely, although this requires care about the subtle difference between regular method calls and message sends (see `How messages are sent`_). So here are a few relevant examples... Binary operators ---------------- Binary operators are mostly two-arguments numeric operations like addition. They are implemented by sending a message (e.g. ``__add__``) to the left operand by passing the right operand as argument. The message is allowed to fail (either by not being handled at all, or by returning the special value ``NotImplemented``). If it fails, then another message (e.g. ``__radd__``) is sent to the right operand with the left operand as argument. If both messages fail, a ``TypeError`` exception is raised. Additional rules can influence the order in which the direct and the reverse message (e.g. ``__add__`` and ``__radd__``, respectively) are sent. If the class of the right operand is a subclass of the class of the left operand, and if it explicitely overrides the reverse method, then the reverse message is tried first (this allows a subclass to override the behavior of ``a+b`` even if ``a`` is an instance of the parent class and ``b`` an instance of the subclass). If both the direct and the reverse methods are found in the same class, the reverse one is not attempted at all. Descriptors and attributes -------------------------- On the sending side of attribute access: * Reading to an attribute, with the expression ``obj.name``, causes the ``__getattribute__`` message to be sent; if it raises ``AttributeError``, then the ``__getattr__`` message is tried. * Writing to an attribute (``obj.name = value``) is done by sending a ``__setattr__`` message. * Deleting an attribute (``del obj.name``) is done by sending a ``__delattr__`` message. All objects responds to ``__getattribute__``, ``__setattr__`` and ``__delattr__`` in a standard way inherited from the base class ``object``, unless explicitely overriden. The extra message ``__getattr__`` allows user classes to override only reading from otherwise undefined attributes. In addition to these messages, there are three "descriptor messages" which are among the most mysterious ones because there is no syntax that invokes them directly: the ``__get__``, ``__set__`` and ``__delete__`` messages. They are only called from the ``object.__getattribute__``, ``object.__setattr__`` and ``object.__delattr`` methods. These three messages are only sent to objects that play the role of Descriptors_, i.e. objects found in the dictionary of a class. This does not mean that they are not sent quite often during the execution of a program. Indeed, here is what reading attribute ``obj.attrname`` might trigger, if ``obj`` is of class ``C``:: ,----------------------------. | class D | | {'__get__': getterfunc} | ,----------------------. `----------------------------' | class C | ^ | {'attrname': d} | | class `----------------|-----' | ^ | ,-----------------------. | class `------------> | descriptor object d | | `-----------------------' ,-------. | obj | `-------' The expression ``obj.attrname`` will locate the descriptor object ``d`` in the class of ``obj``, and send it a ``__get__`` message; as usual, the ``__get__`` message itself is handled by the class of ``d``. The purpose of this indirection is to allow a generic object ``d`` to be stored on the class and control reads and writes to the attribute ``attrname`` in a per-instance way. We have seen ``__get__`` in the paragraph about Descriptors_ already; ``__set__`` and ``__delete__`` play a similar role for respecively setting and deleting an attribute from an instance. To complete the overview given in Descriptors_, let us see how the default ``object`` methods behave. First ``object.__getattribute__``: 1. lookup the attribute name in the class; 2. if found and it is a *data descriptor,* send it the ``__get__`` message and return the result; 3. search for the attribute name in the object's own dictionary, if any; 4. if found, return it unmodified; 5. if something not a data descriptor was found in the class, send now the ``__get__`` message to it and return the result; 6. otherwise, the attribute was not found anywhere, and AttributeError is raised. Descriptors are divided in two categories: * *non-data descriptors* are the ones that only want to control which object is returned when reading a name from an instance; they still allow being shadowed by a value being stored directly in the object's dictionary under the same name. This is the behavior of function objects (methods can be shadowed in instances). * *data descriptors* want to control both reading from and writing to a specific attribute name. When such a descriptor is present in a class, the instance usually cannot store an entry with the same name in its dictionary (unless the dictionary is modified directly with ``obj.__dict__['name'] = value``). Properties are an example of data descriptors (see property_ in the library reference). Formally, a data descriptor is an object which responds to the ``__set__`` message. This is the message sent by ``object.__setattr__``, which works as follows: 1. lookup the attribute name in the class; 2. if found and it is a *data descriptor,* send it the ``__set__`` message; 3. otherwise, write the new value in the object's own dictionary. Similarily, ``object.__delattr__``: 1. lookup the attribute name in the class; 2. if found and it is a *data descriptor,* send it the ``__delete__`` message; 3. otherwise, just delete the new value from the object's own dictionary. Iteration --------- ``For`` loops:: for i in sequence: ... are translated into a uage of the so-called iterator protocol. The ``__iter__`` message should return an *iterator object;* the latter produces the actual elements one by one every time it is sent a ``next`` message. (This is the only occurrence in Python of a message name not surrounded by underscores.) More precisely, the above ``for`` loop does the following: 1. send the ``__iter__`` message to ``sequence``, and remember the result (called *the iterator* below); 2. send the ``next`` message to the iterator; if it raised a ``StopIterator`` exception, discard the exception and interrupt the loop; 3. assign the result of the ``next`` message to ``i`` and execute the loop body; 4. loop back to step 2. A lot of built-in functions also send the ``__iter__`` and ``next`` messages themselves; for example, ``list(seq)`` which creates a list out of the elements of the given sequence uses these messages to enumerate all the elements. The step 1 above is actually more involved for backward compatibility: if ``sequence`` doesn't respond to ``__iter__`` but does respond to ``__getitem__``, then a *sequence iterator* object is created, which responds to ``next`` messages by reading the ``nth`` item out of the sequence (with a ``__getitem__`` message) and advancing the counter ``n``. This kind of fall-back, when messages are not responded to, is quite typical. Conclusion ========== Python is by no means a simple language, though it appears to be at a first glance. The move of the object model in version 2.2 is both a simplification at the deepest levels and an extra complication at levels closer to the day-to-day programmer, particularly when dealing with the host of implementation-specific underdocumented extensions and limitations of this nice theoretical model: slots, limitations in subclassing built-in types, rules for changing the type of objects, and so on. (This draft or another one should at some point be extended to cover some of these aspects.) It is useful to keep in mind the model presented in this paper for a deeper understanding of the clockworks of Python. PyPy_, a modern (i.e. post-2.2) reimplementation of Python, gains a lot by approaching it along the lines presented above. PyPy uses the terminology *space operation* instead of *message;* it uses Python itself as the host language, and regular host-level method calls to implement application-level messages. .. _Reference: .. _`Python Language Reference`: http://docs.python.org/ref/ref.html .. _C3: http://www.python.org/2.3/mro.html .. _property: http://docs.python.org/lib/built-in-funcs.html .. _PyPy: http://codespeak.net/pypy/