Python code is first compiled to bytecode, and then interpreted by the python
virtual machine. Pytype follows this strategy, “interpreting” the bytecode with
a virtual machine (
vm.py/VirtualMachine) which manipulates types rather than
values. This means that when analysing a file, pytype’s first step is to run the
python interpreter over it and compile it to bytecode. The bytecode is then
Opcodes, pytype’s internal representation of a python
opcode, and this list of
Opcodes is used as the canonical representation of
the program by the rest of the code.
One caveat for any sort of python code tool is that the python language, and
hence its bytecode, are evolving over time. For pytype specifically, one of the
consequences of this is that we accept a
--version argument specifying the
python version of the code being analysed, and make sure that our internal model
of the language matches the exact state of the code’s version.
Since pytype itself is written in python, we have to make a clear distinction between the host and the target python version:
--version). If no version is specified, we assume it is the same as the host version.
If the host and target versions differ, we need to compile python source files
to bytecode using a target-version interpreter, e.g. if we are running under
python 3.7 and are passed
--version=3.6 we cannot use python 3.7’s internal
libraries to compile the code; we have to launch a python 3.6 interpreter,
compile the target code to bytecode, and then retrieve that bytecode to run
through our VirtualMachine.
The relevant compilation code can be found at
pyc/pyc.py. The process is:
if host_version == target_version: bytecode = compile_source(src) else: write source to tmpfile src.py subprocess.call(target_python_exe, tmpfile) # generates src.pyc bytecode = read(src.pyc)
host != target, we have a check in
config.py/_store_python_version() to make sure there is a target-version
python interpreter available, and a standalone executable,
pyc/compile_bytecode.py that can be called as a subprocess.
As the name suggests, “bytecode” is a binary representation consisting of a
series of bytes, each with a meaning defined by the interpreter (e.g.
UNARY_POSITIVE). Pytype reads in the bytecode version of a
.py file and
disassembles it into
Opcodes, our internal representation of a bytecode VM
The relevant code is in
opcodes.py, which defines two classes,
OpcodeWithArg, and then creates a subclass corresponding to every python
opcode. Opcodes have a set of properties like
(stored as a bitvector, see the top of
opcodes.py for explanations of each
bit), and optionally a single argument. The semantic value of the argument
depends on the opcode. For example,
class STORE_ATTR(OpcodeWithArg): # Indexes into name list FLAGS = HAS_NAME|HAS_ARGUMENT __slots__ = ()
means that the opcode
STORE_ATTR references the name table, has a single
associated argument, and that that argument is an index into the list of names.
TIP: The meaning of the argument is not part of the opcode definition (hence the need to document it as a comment). Looking at
vm.py, every opcode has a corresponding
byte_<Opcode>method in the
VirtualMachineclass, which deals with actually interpreting that opcode. The
byte_STORE_ATTRmethod starts off with the code
name = self.frame.f_code.co_names[op.arg]
which essentially says “use the opcode’s argument to index into the list of names and retrieve the name of the attr that we are storing”. The comments on the Opcode class document these semantic meanings.
After defining a class for every python opcode,
opcodes.py defines a series of
tables mapping between bytecodes and opcodes for each python version we support.
opcodes.py/dis() function gets the right mapping table for the target
python version, and then passes it to bytecode reader which iterates over the
block of bytes, converting each one into an opcode or into the argument for the