1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
// This Source Code Form is subject to the terms of the Mozilla Public
// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

/*!
Technical Implementation Notes

When trying to understand the code, a good place to start is
`MainPythonInterpreter.new()`, as this will initialize the CPython runtime and
Python initialization is where most of the magic occurs.

A lot of initialization code revolves around mapping
`OxidizedPythonInterpreterConfig` members to C API calls. This functionality is
rather straightforward. There's nothing really novel or complicated here. So
we won't cover it.

# Python Memory Allocators

There exist several
[CPython APIs for memory management](https://docs.python.org/3/c-api/memory.html).
CPython defines multiple memory allocator *domains* and it is possible to
use a custom memory allocator for each using the `PyMem_SetAllocator()` API.

See the documentation in the `pyalloc` module for more on this topic.

# Module Importing

The module importing mechanisms provided by this crate are one of the
most complicated parts of the crate. This section aims to explain how it
works. But before we go into the technical details, we need an understanding
of how Python module importing works.

## High Level Python Importing Overview

A *meta path importer* is a Python object implementing
the [importlib.abc.MetaPathFinder](https://docs.python.org/3.7/library/importlib.html#importlib.abc.MetaPathFinder)
interface and is registered on [sys.meta_path](https://docs.python.org/3.7/library/sys.html#sys.meta_path).
Essentially, when the `__import__` function / `import` statement is called,
Python's importing internals traverse entities in `sys.meta_path` and
ask each *finder* to load a module. The first *meta path importer* that knows
about the module is used.

By default, Python configures 3 *meta path importers*: an importer for
built-in extension modules (`BuiltinImporter`), frozen modules
(`FrozenImporter`), and filesystem-based modules (`PathFinder`). You can
see these on a fresh Python interpreter:

```text
   $ python3.7 -c 'import sys; print(sys.meta_path)`
   [<class '_frozen_importlib.BuiltinImporter'>, <class '_frozen_importlib.FrozenImporter'>, <class '_frozen_importlib_external.PathFinder'>]
```

These types are all implemented in Python code in the Python standard
library, specifically in the `importlib._bootstrap` and
`importlib._bootstrap_external` modules.

Built-in extension modules are compiled into the Python library. These are often
extension modules required by core Python (such as the `_codecs`, `_io`, and
`_signal` modules). But it is possible for other extensions - such as those
provided by Python's standard library or 3rd party packages - to exist as
built-in extension modules as well.

For importing built-in extension modules, there's a global `PyImport_Inittab`
array containing members defining the extension/module name and a pointer to
its C initialization function. There are undocumented functions exported to
Python (such as `_imp.exec_builtin()` that allow Python code to call into C code
which knows how to e.g. instantiate these extension modules. The
`BuiltinImporter` calls into these C-backed functions to service imports of
built-in extension modules.

Frozen modules are Python modules that have their bytecode backed by memory.
There is a global `PyImport_FrozenModules` array that - like
`PyImport_Inittab` - defines module names and a pointer to bytecode data. The
`FrozenImporter` calls into undocumented C functions exported to Python to try
to service import requests for frozen modules.

Path-based module loading via the `PathFinder` meta path importer is what
most people are likely familiar with. It uses `sys.path` and a handful of
other settings to traverse filesystem paths, looking for modules in those
locations. e.g. if `sys.path` contains
`['', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/lib/python3/dist-packages']`,
`PathFinder` will look for `.py`, `.pyc`, and compiled extension modules
(`.so`, `.pyd`, etc) in each of those paths to service an import request.
Path-based module loading is a complicated beast, as it deals with all
kinds of complexity like caching bytecode `.pyc` files, differentiating
between Python modules and extension modules, namespace packages, finding
search locations in registry entries, etc. Altogether, there are 1500+ lines
constituting path-based importing logic in `importlib._bootstrap_external`!

## Default Initialization of Python Importing Mechanism

CPython's internals go through a convoluted series of steps to initialize
the importing mechanism. This is because there's a bit of chicken-and-egg
scenario going on. The *meta path importers* are implemented as Python
modules using Python source code (`importlib._bootstrap` and
`importlib._bootstrap_external`). But in order to execute Python code you
need an initialized Python interpreter. And in order to execute a Python
module you need to import it. And how do you do any of this if the importing
functionality is implemented as Python source code and as a module?!

A few tricks are employed.

At Python build time, the source code for `importlib._bootstrap` and
`importlib._bootstrap_external` are compiled into bytecode. This bytecode is
made available to the global `PyImport_FrozenModules` array as the
`_frozen_importlib` and `_frozen_importlib_external` module names,
respectively. This means the bytecode is available for Python to load
from memory and the original `.py` files are not needed.

During interpreter initialization, Python initializes some special
built-in extension modules using its internal import mechanism APIs. These
bypass the Python-based APIs like `__import__`. This limited set of
modules includes `_imp` and `sys`, which are both completely implemented in
C.

During initialization, the interpreter also knows to explicitly look for
and load the `_frozen_importlib` module from its frozen bytecode. It creates
a new module object by hand without going through the normal import mechanism.
It then calls the `_install()` function in the loaded module. This function
executes Python code on the partially bootstrapped Python interpreter which
culminates with `BuiltinImporter` and `FrozenImporter` being registered on
`sys.meta_path`. At this point, the interpreter can import compiled
built-in extension modules and frozen modules. Subsequent interpreter
initialization henceforth uses the initialized importing mechanism to
import modules via normal import means.

Later during interpreter initialization, the `_frozen_importlib_external`
frozen module is loaded from bytecode and its `_install()` is also called.
This self-installation adds `PathFinder` to `sys.meta_path`. At this point,
modules can be imported from the filesystem. This includes `.py` based modules
from the Python standard library as well as any 3rd party modules.

Interpreter initialization continues on to do other things, such as initialize
signal handlers, initialize the filesystem encoding, set up the `sys.std*`
streams, etc. This involves importing various `.py` backed modules (from the
filesystem). Eventually interpreter initialization is complete and the
interpreter is ready to execute the user's Python code!

## Our Importing Mechanism

We have made significant modifications to how the Python importing
mechanism is initialized and configured. (Note: we do not require these
modifications. It is possible to initialize a Python interpreter with
*default* behavior, without support for in-memory module importing.)

The `importer` Rust module of this crate defines a Python extension module.
To the Python interpreter, an extension module is a C function that calls
into the CPython C APIs and returns a `PyObject*` representing the
constructed Python module object. This extension module behaves like any
other extension module you've seen. The main differences are it is implemented
in Rust (instead of C) and it is compiled into the binary containing Python,
as opposed to being a standalone shared library that is loaded into the Python
process.

This extension module provides the `oxidized_importer` Python module,
which defines a meta path importer.

When we initialize the Python interpreter, the `oxidized_importer`
extension module is appended to the global `PyImport_Inittab` array,
allowing it to be recognized as a *built-in* extension module and
imported as such.

We use the PEP-587
[Python Initialization Configuration](https://docs.python.org/3/c-api/init_config.html)
API to have granular control over Python initialization. Our most
notable departure is we force a multi-phase initialization so
initialization pauses between _core_ and _main_ initialization.

When _core_ is initialized, `_frozen_importlib._install()` is called to
register `BuiltinImporter` and `FrozenImporter` on `sys.meta_path`.
At our break point after _core_ initialization, we import our
`oxidized_importer` module using the Python C APIs. This import
is serviced by `BuiltinImporter`. Our Rust-implemented module initialization
function runs and creates a module object. We then call another Rust
function to complete the module initialization given the current
configuration. This will create a new *meta path importer* and register
it on `sys.meta_path`. The chief goal of our importer is to support
importing Python resources using an efficient binary data structure.

Our extension module grabs a handle on the `&[u8]` containing modules
data embedded into the binary. (See
[../specifications/index.html](Specifications) for the format of this blob.)
The in-memory data structure is parsed into a Rust collection type
(basically a `HashMap<&str, (&[u8], &[u8])>`) mapping Python module names
to their source and bytecode data.

The extension module defines an `OxidizedFinder` Python type that
implements the requisite `importlib.abc.*` interfaces for providing a
*meta path importer*. An instance of this type is constructed from the
parsed data structure containing known Python modules. That instance is
registered as the first entry on `sys.meta_path`.

When our module's `_setup()` completes, we trigger the *main*
initialization. This will *always* register the traditional filesystem
importer (`PathFinder`) on `sys.meta_path`. But, since our finder is
registered first, it should always be used.

As part of _main_ interpreter initialization, Python attempts various
imports of `.py` based modules. The standard `sys.meta_path` traversal
is performed. The Rust-implemented `OxidizedFinder` converts the
requested Python module name to a Rust `&str` and does a lookup in a
`HashMap<&str, ...>` to see if it knows about the module. Assuming the
module is found, a `&[u8]` handle on that module's source or bytecode is
obtained. That pointer is used to construct a Python `memoryview` object,
which allows Python to access the raw bytes without a memory copy.
Depending on the type, the source code is decoded to a Python `str` or
the bytecode is sent to `marshal.loads()`, converted into a Python `code`
object, which is then executed via the equivalent of
`exec(code, module.__dict__)` to populate an empty Python module object.

In addition, `OxidizedFinder` indexes the built-in extension modules
and frozen modules. It removes `BuiltinImporter` and `FrozenImporter`
from `sys.meta_path`. When `OxidizedFinder` sees a request for a
built-in or frozen module, it dispatches to `BuiltinImporter` or
`FrozenImporter` to complete the request. The reason we do this is
performance. Imports have to traverse `sys.meta_path` entries until a
registered finder says it can service the request. So the more entries
there are, the more overhead there is. Compounding the problem is that
`BuiltinImporter` and `FrozenImporter` do a `strcmp()`
against the global module arrays when trying to service an import.
`OxidizedFinder` already has an index of module name to data. So it
was not that much effort to also index built-in and frozen modules
so there's a fixed, low cost for finding modules (a Rust `HashMap` key
lookup).

It's worth explicitly noting that it is important for our custom importer
to run *before* the _main_ initialization phase completes. This
is because Python interpreter initialization relies on the fact that
`.py` implemented standard library modules are available for import
during initialization. For example, initializing the filesystem encoding
needs to import the `encodings` module, which is provided by a `.py` file
on the filesystem in standard installations.

After the _main_ initialization phase completes, we remove `PathFinder`
from `sys.meta_path` if the configuration says to disable filesystem
based imports. The overhead of registering then unregistering it should
be trivial and no I/O should have been performed.

*/