Python bytecode peek

Python A Peek Inside Bytecode When the CPython interpreter executes a program, it first translates it into a series of bytecode instructions. Bytecode is the intermediate language of the Python virtual machine and improves program execution efficiency.

The CPython interpreter does not directly execute human-readable source code. Instead, it executes compact numbers, constants, and references generated by the compiler’s parsing and semantic analysis.
This saves time and memory when executing the same program again. Because the bytecode generated by the compilation step is cached on disk in the form of .pyc and .pyo files, executing the bytecode is faster than parsing and executing the same Python file again.
All of these steps are completely transparent to the programmer. There’s no need to worry about these intermediate conversion steps or how the Python virtual machine handles the bytecode. In fact, the bytecode format is an implementation detail and isn’t necessarily stable or compatible across Python versions.
Peering into the CPython interpreter’s internals and understanding its workings can be both enlightening and insightful. Understanding this knowledge is not only fun, but more importantly, it helps you write more efficient code.
Take the following simple greet() function as an example to learn about Python bytecode:

def greet(name):
return 'Hello, ' + name + '!'

>>> greet('Guido')
'Hello, Guido!'

As mentioned earlier, CPython first converts this code into an intermediate language before running it. If this is true, then we should be able to see the results of this compilation step. And indeed, we do.
In Python 3, every function has a __code__ attribute that provides access to the virtual machine instructions, constants, and variables used by the greet function:

>>> greet.__code__.co_code
b'dx01|x00x17x00dx02x17x00Sx00'
>>> greet.__code__.co_consts
(None, 'Hello, ', '!')
>>> greet.__code__.co_varnames
('name',)

As you can see, co_consts contains the string used to assemble the greeting in the greet function. Constants are stored separately from the code to save space. Constants are immutable, never changing, and can be used interchangeably in multiple places.
Thus, instead of duplicating the actual constant values in the co_code instruction stream, Python stores constants in a separate lookup table. The instruction stream then references the constant using an index into the lookup table, as do the variables stored in the co_varnames field.
I hope the overall idea is becoming clear, but looking at the complex instruction stream in co_code, it seems a bit unrealistic. This intermediate language is clearly better suited for the CPython virtual machine, while text-based source code is meant for human consumption.
The CPython developers recognized this and provided another tool called a disassembler to make it easier to view bytecode.
CPython’s bytecode disassembler is located in the dis module of the standard library. Importing it and calling dis.dis() in the greet function displays the corresponding bytecode in a slightly easier-to-read format:

>>> import dis
>>> dis.dis(greet)
2 0 LOAD_CONST 1('Hello, ')
2 LOAD_FAST 0(name)
4 BINARY_ADD
6 LOAD_CONST 2('!')
8 BINARY_ADD
10 RETURN_VALUE

The main task of disassembly is to divide the instruction stream and assign a human-readable name to each operation code (opcode), such as LOAD_CONST.
You can also see that constant and variable references are separated from the bytecode, and their values are printed in their entirety, saving us from having to manually look them up in the co_const or co_varnames tables by index. Isn’t that great!
With these human-readable opcodes, we can now begin to understand how CPython represents and executes the expression 'Hello', + name + '!' in the original greet() function.
The interpreter first looks for the constant at index 1 ('Hello, ') and places it on the stack. Then, it places the contents of the name variable on the stack.
This stack data structure serves as the internal storage space of the virtual machine. There are different types of virtual machines, one of which is called a stack-based virtual machine, which is the implementation of the CPython virtual machine. Since this type of virtual machine is named after a stack, it’s easy to see the importance of this data structure. By the way, this is just the beginning. If you’re interested in this topic, you can check out the book recommendation at the end of this section. Delving into virtual machine theory is both rewarding and enjoyable.
The interesting thing about a stack, as an abstract data structure, is that it only supports two operations: push and pop. Pushing adds a value to the top of the stack, while popping removes and returns the top value. Unlike an array, a stack cannot access elements below the top.
The stack is fascinating; such a simple data structure has so many uses. But I won’t get off topic this time…
Assuming the stack is initially empty, after executing the first two opcodes, the contents of the virtual machine stack (0 is the top element) are as follows:

0: 'Guido' (contents of "name")
1: 'Hello, '

The

BINARY_ADD instruction pops two string values from the stack, concatenates them, and pushes the result back onto the stack:

0: 'Hello, Guido'

Another LOAD_CONST instruction then pushes the exclamation mark string onto the stack:

0: '!'
1: 'Hello, Guido'

The next BINARY_ADD opcode concatenates the two strings again to produce the final greeting string:

0: 'Hello, Guido!'

The final bytecode instruction is RETURN_VALUE, which tells the virtual machine that the return value of this function is currently on the stack and can be passed to the caller.
Voila, we just traced the execution of the greet() function inside the CPython virtual machine. Isn’t that great?
There’s a lot more to say about virtual machines, but that’s beyond the scope of this book. If you’re interested in this fascinating topic, I highly recommend reading more.
It can be fun to define your own bytecode language and try to build a virtual machine for it. For a book on virtual machines, I recommend Compiler Design: Virtual Machines by Wilhelm and Seidl.

Key Points

  • CPython first converts the program into an intermediate bytecode, then runs the bytecode on a stack-based virtual machine to execute the program.

  • Use the built-in dis module to gain insight and inspect the bytecode.

  • The virtual machine deserves a closer look.

Leave a Reply

Your email address will not be published. Required fields are marked *