Does PyPy translate itself?

Alex picture Alex · Dec 9, 2011 · Viewed 11.9k times · Source

Am I getting this straight? Does the PyPy interpreter actually interpret itself and then translate itself?

So here's my current understanding:

  • RPython's toolchain involves partially executing the program to be translated to get a sort of preprocessed version to annotate and translate.
  • The PyPy interpreter, running on top of CPython, executes to partially interpret itself, at which point it hands control off to its RPython half, which performs the translation?

If this is true, then this is one of the most mind-bending things I have ever seen.

Answer

Ben picture Ben · Dec 20, 2011

PyPy's translation process is actually much less conceptually recursive than it sounds.

Really all it is is a Python program that processes Python function/class/other objects (not Python source code) and outputs C code. But of course it doesn't process just any Python objects; it can only handle particular forms, which are what you get if you write your to-be-translated code in RPython.

Since the translation toolchain is a Python program, you can run it on top of any Python interpreter, which obviously includes PyPy's python interpreter. So that's nothing special.

Since it translates RPython objects, you can use it to translate PyPy's python interpreter, which is written in RPython.

But you can't run it on the translation framework itself, which is not RPython. Only PyPy's python interpreter itself is RPython.

Things only get interesting because RPython code is also Python code (but not the reverse), and because RPython doesn't ever "really exist" in source files, but only in memory inside a working Python process that necessarily includes other non-RPython code (there are no "pure-RPython" imports or function definitions, for example, because the translator operates on functions that have already been defined and imported).

Remember that the translation toolchain operates on in-memory Python code objects. Python's execution model means that these don't exist before some Python code has been running. You can imagine that starting the translation process looks a bit like this, if you highly simplify it:

from my_interpreter import main
from pypy import translate

translate(main)

As we all know, just importing main is going to run lots of Python code, including all the other modules my_interpreter imports. But the translation process starts analysing the function object main; it never sees, and doesn't care about, whatever code was executed to come up with main.

One way to think of this is that "programming in RPython" means "writing a Python program which generates an RPython program and then feeds it to the translation process". That's relatively easy to understand and is kind of similar to how many other compilers work (e.g. one way to think of programming in C is that you are essentially writing a C pre-processor program that generates a C program, which is then fed to the C compiler).

Things only get confusing in the PyPy case because all 3 components (the Python program which generates the RPython program, the RPython program, and the translation process) are loaded into the same Python interpreter. This means it's quite possible to have functions that are RPython when called with some arguments and not when called with other arguments, to call helper functions from the translation framework as part of generating your RPython program, and lots of other weird things. So the situation gets rather blurry around the edges, and you can't necessarily divide your source lines cleanly into "RPython to be translated", "Python generating my RPython program" and "handing the RPython program over to the translation framework".


The PyPy interpreter, running on top of CPython, executes to partially interpret itself

What I think you're alluding to here is PyPy's use of the the flow object space during translation, to do abstract interpretation. Even this isn't as crazy and mind-bending as it seems at first. I'm much less informed about this part of PyPy, but as I understand it:

PyPy implements all of the operations of a Python interpreter by delegating them to an "object space", which contains an implementation of all the basic built in operations. But you can plug in different object spaces to get different effects, and so long as they implement the same "object space" interface the interpreter will still be able to "execute" Python code.

The RPython code objects that the PyPy translation toolchain processes is Python code that could be executed by an interpreter. So PyPy re-uses part of their Python interpreter as part of the translation tool-chain, by plugging in the flow object space. When "executing" code with this object space, the interpreter doesn't actually carry out the operations of the code, it instead produces flow graphs, which are analogous to the sorts of intermediate representation used by many other compilers; it's just a simple machine-manipulable representation of the code, to be further processed. This is how regular (R)Python code objects get turned into the input for the rest of the translation process.

Since the usual thing that is translated with the translation process is PyPy's Python interpreter, it indeed "interprets itself" with the flow object space. But all that really means is that you have a Python program that is processing Python functions, including the ones doing the processing. In itself it isn't any more mind-bending than applying a decorator to itself, or having a wrapper-class wrap an instance of itself (or wrap the class itself).


Um, that got a bit rambly. I hope it helps, anyway, and I hope I haven't said anything inaccurate; please correct me if I have.