bytecode: linearize hintannotated low-level exceptionstransformed graphs into a datastructure containing a bytecode string (the opcodes) and constants
opcodes:
- all basics operations in red and green variants
- additional operations for JIT housekeeping like split, merge, promote, calls?, return, box_green_var
special care needed (later) for implicit promotes (e.g. for exception paths and virtualizables)
red variables are boxes, already stored on the VirtualFrame
green variables are GenConsts, should be stored on the VirtualFrame as well
variables are valid for one block, renaming op (with variable number of args)
interpreter is manually written in a stackless style: jitstates have linked lists of frames anyway already
Update: this is work in progress in the jit-hotpath branch
A few notes about a refactoring that I was thinking about for after the rainbow interpreter works nicely. Let's use the Python interpreter as a motivating example.
The current approach tends to produce far too much machine code at run-time - many cold paths give machine code. So let's see what would be involved in producing as little machine code as possible (maybe that's the opposite extreme and some middle path would be better). While we're at it let's include the question of how to produce machine code for only the relevant parts of the user program.
We'd replace portals and global merge points with the following variant: two hints, "can_enter_jit" and "jit_merge_point", which are where the execution can go from interpreter to JITted and back. Very similar to the older "global_merge_point", the "jit_merge_point" is present at the beginning of the main interpreter loop; in this model it has the additional meaning of being where the JIT can be left in order to go back to regular interpretation.
The other hint, "can_enter_jit", is the place where some lightweight profiling occurs in order to know if we should enter the JIT. It's important to not execute one "can_enter_jit" for each opcode -- that's a too heavy slow-down for regularly interpreted code (but it would be correct too). A probably reasonable idea is to put it in the opcodes that close loops (JUMP_ABSOLUTE, CONTINUE). This would make the regular Python interpreter try to start JITting the Python-level loops that are often executed. (In time, the JIT should follow calls too, so that means that the functions called by loops also get JITted.)
The "can_enter_jit" is transformed into a call to a helper function, maybe_enter_jit(), with the following logic:
Note that to make things easier the JIT compilation really starts at the unique "jit_merge_point". So the "can_enter_jit" hints should all be put just before the "jit_merge_point", control-flow-wise -- i.e. "can_enter_jit" should be at the end of JUMP_ABSOLUTE and CONTINUE, so that they are immediately followed by the "jit_merge_point" which is at the start of the next iteration of the interpreter main loop.
The machine code makes the current Python frame progress, maybe to its end or not, but at least up to an opcode boundary (as explained later). To simplify things, in all cases the machine code raises an exception when it is done. The reasoning is that the current Python frame has progressed, so that the original caller of maybe_enter_jit() now contains out of sync local variables. Getting out with an exception gets rid of these. There are three kinds of exception that can be raised here:
The DoneWithThisFrame exception is raised to mean that the machine code completed the execution of this frame (it carries the return value computed by the machine code). The ContinueRunningNormally exception is raised when we want to switch back from machine code to regular non-JITted interpretation, which can only occur at a Python opcode boundary (this exception carries the new values needed to resume the regular interpreter, like the opcode position).
To catch and handle these two special exceptions, we need to transform the graph of the regular interpreter -- we split it and insert a small wrapper. Say the original interpreter is:
def my_interpreter(..):
stuff
while 1:
jit_merge_point(*live_vars)
more stuff
We (automatically) mutate it so that it becomes:
def my_interpreter(..):
stuff
return portal_runner(*live_vars)
def portal_runner(*args):
"""Small wrapper to handle the special JIT exceptions"""
while 1:
try:
return portal(*args)
except ContinueRunningNormally, e:
args = e.new_args
continue
except DoneWithThisFrame, e:
return e.result
def portal(*live_vars):
while 1:
more stuff
A few extra random notes:
PyPy contains some custom logic to virtualize the frame and the value stack; in this new model it should go somewhere related to "can_enter_jit".
The "can_enter_jit" hint becomes nothing in the rainbow interpreter's bytecode. Conversely, the "jit_merge_point" hint becomes nothing in the regular interpreter, but an important bytecode in the rainbow bytecode.
Now to the controversial part (if the above wasn't already). The idea is for the JIT to be as lazy as possible producing machine code. The simplest approach allows us to always maintain a single JITState, never a chained list of pending-to-be-compiled JITStates. (Note that this is not necessary; it's quite possible that it's better to combine approaches and compile things a bit more eagerly along several paths. I'm mostly decribing the other extreme here.)
The basic idea is to stop compiling early, and wait before execution actually followed one of the possible paths often enough before continuing. "Early" means at some red splits and all promotions. The picture is that the JIT should compile a single straight-code path corresponding to maybe half an opcode or a few opcodes, and then wait; then compile a bit more, and wait; and progress like this. In this model we get the nice effect that in a Python-level loop, we would end up compiling only the loop instead of the whole function that contains it: indeed, the "can_enter_jit" profiling only triggers on the start of the loop, and the stop-early logic means that the path that exits the loop is cold and will not be compiled.
We would identify two kinds of red splits: the ones that just correspond to "simple if-then-else" patterns; and the "complicated" ones. We can be more clever about simple if-then-else patterns, but for all other red splits, we would just stop emitting machine code. The JIT puts in the machine code a jump to a special "fallback rainbow interpreter". This interpreter is a variant that considers everything as green and just interprets everything normally. The idea is that when execution reaches the red split, in the middle of the rainbow bytecode of whatever function of the Python interpreter, we only want to produce more machine code for the hot path; so we have to do something to continue executing when we don't want to generate more code immediately.
The "something" in question, the fallback rainbow interpreter, is quite slow, but only runs until the end of the current opcode and can directly perform all nested calls instead of interpreting them. When it reaches the "jit_merge_point", it raises ContinueRunningNormally; as described in the Hints section this should go all the way back to the portal_runner() wrapper and cause the control flow to come back to the regular interpreter main loop, in portal(). The regular interpreter goes on interpreting at its normal speed from there.
All in all I guess that there is a chance that the fallback rainbow interpreter is not too much of an overhead. The important point is that whenever we use the fallback rainbow interpreter, we also update counters, and when enough executions have been seen, we compile the hot path (and only the hot path, unless we find out quickly that the other path is hot enough too). So after the compilation converges overall, the fallback rainbow interpreter is no longer executed except on the cold paths.
As noted above, we can (later) be clever about simple if-then-else patterns, and always immediately compile both branches. If we still want a single JITState, we need to make sure that it's a good idea to always merge the two states at the end of the two branches; a criteria could be that an if-then-else is "simple enough" if the branches contain no allocation (i.e. no potential new virtual stuff that could raise DontMerge in the current rvalue.py). This should be good enough to directly compile machine code like:
x = 5
if condition:
x += 1
do_more_stuff
Promotions are similar to red splits -- update counters and go to the fallback rainbow interpreter, and later resume compilation for the values that seem to be hot. For further improvements, this also makes it easy to decide, looking at the counters, that a site is "megamorphic", i.e. receives tons of different values with no clear winner. For this case we can really compile a megamorphic path where the promotion did not occur (i.e. the value stays as a red variable after all). The megamorphic path is "hot", in a sense, so compiling for it puts the fallback rainbow interpreter out of the hot path again.
About calls: non-residual calls would always just return a single JITState in this simplified model, so no need for the careful red/yellow call logic (at least for now). Residual calls, like now, would be followed by the equivalent of a promotion, checking if the residual call caused an exception or forced devirtualization (though we could immediately compile the expected common case, which is no exception and no forcing).
About local merge points: in this model of a single JITState, I vaguely suspect that it gives better results to have less local merge points, e.g. only at the beginning of local loops. To be experimented with. It might remove the need for the DontMerge exception and the need to maintain (and linearly scan through) more than one state per green key.
in the "jit_merge_point", so far we'd record one state snapshot for each opcode; instead, we can use the idea implemented in the flow object space of only recording the state at the beginning of an opcode that actually causes machine code to be produced (or, more practically, to throw away the latest recorded state if no new machine code was generated in the meantime).
maybe it's a bit of a mess but the fallback rainbow interpreter could also record profiling information about more than one red split or promotion -- all the ones it finds alongs its path.
I didn't think about the impact of this model on our compact Path objects. As step one we can always make complete state snapshot at each red split and promotion, and reintroduce the compact Paths as step two.
compiling of more code: we could tweak the flexswitch interface of the JIT backends. For example, instead of "please add a new path", it would make sense to have an API "please override the switch completely so that it has this new set of paths".
we also need a "can_enter_jit" at the end of the stack unroller corresponding to CONTINUE, for the case where the "continue" statement was in a try:finally:. This is not necessarily a problem, just a note that we have to allow this hint to be in some subfunction, potentially.
the previous pypy-c-jit used SomeInstance annotations with a special flag "access_directly" for frame objects at some point, with the goal of making the regular (non-JITing) interpretation access the frame directly without going through the overhead of checking if it was virtualized. I don't quite see how to keep this approach now. Let's try to think about a "safe" way to detect when the overhead can be removed.
Here is a vague plan (draft to be expanded): let's try to add a data flow pass that follows local variables and function call arguments only. An object can be virtualized only in two cases: if it is passed as an argument to a residual call by the JIT, or if it is read out of a heap data structure. We can easily record which functions are residually callable during the generation of the JIT bytecode, but not really before, so we need to remove the overhead in an extra backendopt-style pass.
Conversely, an object cannot be virtualized if we can see it going from its creation point to its usage point via local vars and args (without entering a residually callable function). It cannot be virtualized either if we can see it going from the 'jit_merge_point' hint to its usage point (without entering a residually callable function): indeed, if we cross 'jit_merge_point' but stay in the non-JITing path, then we are sure that all the local vars are non-virtualized.