\documentclass{amsart} \usepackage{amssymb} \newtheorem{definition}{Definition} \newtheorem{theorem}{Theorem} \newcommand{\defining}[1]{\emph{#1}} \newcommand{\lc}{\trianglelefteq} \newcommand{\sst}{\prec} \newcommand{\st}{\curlyeqprec} \newcommand{\meet}{\mathop{\vartriangle}} \newcommand{\join}{\mathop{\triangledown}} \newcommand{\Meet}{\mathop{\bigtriangleup}} \newcommand{\JOIN}{\mathop{\bigtriangledown}} \DeclareMathOperator{\bases}{bases} \DeclareMathOperator{\mro}{mro} \DeclareMathOperator{\solid}{solid} \title{Memory Layouts and Python Subtyping} \author{Michael Hudson} %\authoremail{mwh@python.net} \begin{document} \maketitle \section{Introduction} This note is an attempt to document some of the assumptions and principles operating in the file \texttt{Objects/typeobject.c} of the Python source pertaining to the memory layout of instances of ``new-style classes'' and prove that these principles, correctly implemented, do not lead to unsafe memory accesses being performed by the Python virtual machine. Let \(T\) be the set of (possible) Python types. Denote by \(T^*\) the set of finite sequences of elements of \(T\). We have functions: \begin{align*} \bases & \colon T \to T^* \\ \mro & \colon T \to T^* \end{align*} which associate each type with its immediate base classes and its method resolution order. \section{Some Relations} We define some relations on \(T\): \begin{definition} For \(t_1, t_2 \in T\) we say \(t_1\) is \defining{layout compatible} with \(t_2\), written \(t_1 \lc t_2\) if \(t_1\)'s description of the memory layout of its instances is safe for use with instances of \(t_2\). \end{definition} This definition is slightly ambiguous. It could be argued that \texttt{A} and \texttt{B} in \begin{verbatim} class A(object): __slots__ = ['a'] class B(object): __slots__ = ['b'] \end{verbatim} are layout compatible; although This is clearly reflexive, and if it's not transitive we're in serious trouble. It isn't a partial order as it's not anti-symmetric, but: \begin{definition} For \(t_1, t_2 \in T\) we say \(t_1\) and \(t_2\) are \defining{layout equivalent}, written \(t_1 \sim t_2\) if \(t_1 \lc t_2\) and \(t_2 \lc t_1\). \end{definition} \(\lc\) defines a partial order on \(\sim\)-equivalence classes. There exists a zero for \(\lc\), the ``null object layout'' represented by the Python type \texttt{object}. More familiarly, we have: \begin{definition} For \(t_1, t_2\) in \(T\) we say \(t_1\) is an \defining{immediate supertype} of \(t_2\) if \(t_1\) is one of the elements of \(\bases(t_2)\). \end{definition} \begin{definition} We define the relation \defining{strict supertype}, written \(t_1 \sst t_2\) as the transitive closure of immediate supertype. \end{definition} \begin{definition} We define the relation \defining{supertype}, written \(t_1 \st t_2\) as the reflexive transitive closure of immediate supertype. \end{definition} Note, however that these are not the definitions currently used by Python! Python uses as a definition \(t_1 \st^p t_2\) iff \(t_1\) is one of the elements of \(\mro(t_2)\). These definitions are equivalent if all MRO's are computed by Python's default algorithm. In a sense, the goal of this note is to propose a set of rules to be enforced at type creation and during type mutation that ensure that \(\st^p\) refines \(\lc\); essentially this means that a call to \texttt{PyObject\textunderscore{}IsInstance} can be safely followed by a C-pointer typecast. \section{Subclass Creation} Given types \(t_1\) and \(t_2\) the \defining{meet} \(t_1 \meet t_2\) is always defined, up to \(\sim\), but the \defining{join} \(t_1 \join t_2\) may not be -- in fact it's only defined when the \(t_i\) are \(\lc\)-related in some order and in that case is the \(\lc\)-greater of the \(t_i\). Note that this means \(t_1 \join t_2\) can be defined precisely, not merely up to \(\sim\). \begin{definition} A (finite) set \(\{t_i\}\) of types is \defining{acceptable for subclassing} if the set has a \(\lc\)-maximal element. \end{definition} A subclass of an acceptable \(\{t_i\}\) will be \(\lc\)-greater than this maximal element. Python allows user code to define the MRO of a the subclass. As Python uses the MRO to compute \(\st^p\), a user supplied MRO must be checked to ensure that it does not allow unsafe operations. \begin{definition} A sequence \(t_1, \ldots, t_n\) is an \defining{acceptable MRO for \(t\)} if \(t_i \lc t\) for all \(i=1,\ldots,n\). \end{definition} \begin{theorem} If all MRO's in the system are acceptable, \(\st^p\) refines \(\lc\). \end{theorem} \begin{proof} Clear. \end{proof} It is obviously important that Python's default MRO computation does not violate this assumption. This follows inductively from: \begin{theorem} If a set of types \(\{t_i\}\) is acceptable for subclassing and each \(t_i\) has an acceptable MRO, then the default MRO computation for Python will produce an acceptable MRO for the new subclass. \end{theorem} \begin{proof} Let \((t_{i1}, t_{i2}, \ldots, t_{in_i}) = \mro(t_i)\), so we have \(t_{ij} \lc t_i\) (each \(t_i\) has an acceptable MRO). Let \(t\) be the join of \(t_i\), so \(t_i \lc t\) (\(\{t_i\}\) is acceptable for subclassing). Let the new subclass of \(\{t_i\}\) be \(u\) (so \(t \lc u\)). Let the output of the default MRO computation be \(v_1,\ldots,v_k\). Now the default MRO computation produces a sequence which contains only types already contained in the MRO of one of the \(t_i\) or the new subclass itself \(u\). If \(v_l = t_{ij}\) for some \(i\) and \(j\), then \(v_l = t_{ij} \lc t_i \lc t \lc u\). Trivially, if \(v_l = u\), \(v_l = u \lc u\). So \(v_1,\ldots,v_k\) is an acceptable MRO for \(u\). \end{proof} \section{An Algorithm For Layout Compatibility} You'll want the moon on a stick next! It is a consequence of the above that for any type \(t\in T\) other than \texttt{object}, \(\JOIN\bases(t)\) exists (this is stored as the \texttt{tb\textunderscore{}base} member in Python). We call \(t\) an \defining{extending type} if \(t\) is \texttt{object} or is not layout equivalent to \(\JOIN\bases(t)\). We define the \defining{solid base} of \(t\) via the following recursion: \begin{equation} \solid(t) = \begin{cases} t & \text{if \(t\) is extending} \\ \solid(\JOIN\bases(t)) & \text{otherwise} \end{cases} \end{equation} The essential point is that \(\solid\) chooses a canonical element of a types \(\sim\)-equivalence class. \begin{theorem} For \(t_1, t_2\in T\), \(t_1 \sim t_2\) if and only if \(\solid(t_1) = \solid(t_2)\). \end{theorem} \begin{proof} Clear. \end{proof} Given a type \(t\), we can define the \defining{solid chain} \(s_1,\ldots,s_n\) of \(t\) by \(s_1 = \solid(t)\), \(s_{i+1} = \solid(\JOIN\bases(s_i))\) and stopping when you get to \texttt{object}. We can compute \(\lc\) in terms of this: \begin{theorem} For \(t_1, t_2\in T\), \(t_1 \lc t_2\) if and only if \(\solid(t_1)\) is an element of the solid chain of \(t_2\). \end{theorem} \begin{proof} Let \(s_1,\ldots,s_n\) be the solid chain of \(t_2\). ``If'' follows from the simple facts: \begin{align*} t & \sim \solid(t) \forall t\in T\\ \JOIN\bases(t) & \lc t \forall t\in T \end{align*} ``Only if''? A bit hard! \end{proof} \section{What Should Be Allowed} At various times Python must decide if certain operations are to be allowed. \subsection{Type Creation} If user code attempts to subclass a sequence of types \((t_1,\ldots,t_n)\), it is necessary that \((t_1,\ldots,t_n)\) So, does this formalism match up with what the code does? Potential problems: \begin{itemize} \item is layout compatibility really transitive? \item does the code compute layout compatibility correctly? \end{itemize} I'd be \textit{reasonably} -- but not completely -- confident of both if we ignore variable length types. I don't know whether it would be better to try to prove the code we have now computes what it thinks it does, or to rewrite it using language closer to what I have used in this note. \end{document}