You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
112 lines
5.3 KiB
112 lines
5.3 KiB
MicroPython string interning
|
|
============================
|
|
|
|
MicroPython uses `string interning`_ to save both RAM and ROM. This avoids
|
|
having to store duplicate copies of the same string. Primarily, this applies to
|
|
identifiers in your code, as something like a function or variable name is very
|
|
likely to appear in multiple places in the code. In MicroPython an interned
|
|
string is called a QSTR (uniQue STRing).
|
|
|
|
A QSTR value (with type ``qstr``) is a index into a linked list of QSTR pools.
|
|
QSTRs store their length and a hash of their contents for fast comparison during
|
|
the de-duplication process. All bytecode operations that work with strings use
|
|
a QSTR argument.
|
|
|
|
Compile-time QSTR generation
|
|
----------------------------
|
|
|
|
In the MicroPython C code, any strings that should be interned in the final
|
|
firmware are written as ``MP_QSTR_Foo``. At compile time this will evaluate to
|
|
a ``qstr`` value that points to the index of ``"Foo"`` in the QSTR pool.
|
|
|
|
A multi-step process in the ``Makefile`` makes this work. In summary this
|
|
process has three parts:
|
|
|
|
1. Find all ``MP_QSTR_Foo`` tokens in the code.
|
|
|
|
2. Generate a static QSTR pool containing all the string data (including lengths
|
|
and hashes).
|
|
|
|
3. Replace all ``MP_QSTR_Foo`` (via the preprocessor) with their corresponding
|
|
index.
|
|
|
|
``MP_QSTR_Foo`` tokens are searched for in two sources:
|
|
|
|
1. All files referenced in ``$(SRC_QSTR)``. This is all C code (i.e. ``py``,
|
|
``extmod``, ``ports/stm32``) but not including third-party code such as
|
|
``lib``.
|
|
|
|
2. Additional ``$(QSTR_GLOBAL_DEPENDENCIES)`` (which includes ``mpconfig*.h``).
|
|
|
|
*Note:* ``frozen_mpy.c`` (generated by mpy-tool.py) has its own QSTR generation
|
|
and pool.
|
|
|
|
Some additional strings that can't be expressed using the ``MP_QSTR_Foo`` syntax
|
|
(e.g. they contain non-alphanumeric characters) are explicitly provided in
|
|
``qstrdefs.h`` and ``qstrdefsport.h`` via the ``$(QSTR_DEFS)`` variable.
|
|
|
|
Processing happens in the following stages:
|
|
|
|
1. ``qstr.i.last`` is the concatenation of putting every single input file
|
|
through the C pre-processor. This means that any conditionally disabled code
|
|
will be removed, and macros expanded. This means we don't add strings to the
|
|
pool that won't be used in the final firmware. Because at this stage (thanks
|
|
to the ``NO_QSTR`` macro added by ``QSTR_GEN_EXTRA_CFLAGS``) there is no
|
|
definition for ``MP_QSTR_Foo`` it passes through this stage unaffected. This
|
|
file also includes comments from the preprocessor that include line number
|
|
information. Note that this step only uses files that have changed, which
|
|
means that ``qstr.i.last`` will only contain data from files that have
|
|
changed since the last compile.
|
|
2. ``qstr.split`` is an empty file created after running ``makeqstrdefs.py split``
|
|
on qstr.i.last. It's just used as a dependency to indicate that the step ran.
|
|
This script outputs one file per input C file, ``genhdr/qstr/...file.c.qstr``,
|
|
which contains only the matched QSTRs. Each QSTR is printed as ``Q(Foo)``.
|
|
This step is necessary to combine the existing files with the new data
|
|
generated from the incremental update in ``qstr.i.last``.
|
|
|
|
3. ``qstrdefs.collected.h`` is the output of concatenating ``genhdr/qstr/*``
|
|
using ``makeqstrdefs.py cat``. This is now the full set of ``MP_QSTR_Foo``'s
|
|
found in the code, now formatted as ``Q(Foo)``, one-per-line, with duplicates.
|
|
This file is only updated if the set of qstrs has changed. A hash of the QSTR
|
|
data is written to another file (``qstrdefs.collected.h.hash``) which allows
|
|
it to track changes across builds.
|
|
|
|
4. ``qstrdefs.preprocessed.h`` adds in the QSTRs from qstrdefs*. It
|
|
concatenates ``qstrdefs.collected.h`` with ``qstrdefs*.h``, then it transforms
|
|
each line from ``Q(Foo)`` to ``"Q(Foo)"`` so they pass through the preprocessor
|
|
unchanged. Then the preprocessor is used to deal with any conditional
|
|
compilation in ``qstrdefs*.h``. Then the transformation is undone back to
|
|
``Q(Foo)``, and saved as ``qstrdefs.preprocessed.h``.
|
|
|
|
5. ``qstrdefs.generated.h`` is the output of ``makeqstrdata.py``. For each
|
|
``Q(Foo)`` in qstrdefs.preprocessed.h (plus some extra hard-coded ones), it outputs
|
|
``QDEF(MP_QSTR_Foo, (const byte*)"hash" "Foo")``.
|
|
|
|
Then in the main compile, two things happen with ``qstrdefs.generated.h``:
|
|
|
|
1. In qstr.h, each QDEF becomes an entry in an enum, which makes ``MP_QSTR_Foo``
|
|
available to code and equal to the index of that string in the QSTR table.
|
|
|
|
2. In qstr.c, the actual QSTR data table is generated as elements of the
|
|
``mp_qstr_const_pool->qstrs``.
|
|
|
|
.. _`string interning`: https://en.wikipedia.org/wiki/String_interning
|
|
|
|
Run-time QSTR generation
|
|
------------------------
|
|
|
|
Additional QSTR pools can be created at runtime so that strings can be added to
|
|
them. For example, the code::
|
|
|
|
foo[x] = 3
|
|
|
|
Will need to create a QSTR for the value of ``x`` so it can be used by the
|
|
"load attr" bytecode.
|
|
|
|
Also, when compiling Python code, identifiers and literals need to have QSTRs
|
|
created. Note: only literals shorter than 10 characters become QSTRs. This is
|
|
because a regular string on the heap always takes up a minimum of 16 bytes (one
|
|
GC block), whereas QSTRs allow them to be packed more efficiently into the pool.
|
|
|
|
QSTR pools (and the underlying "chunks" that store the string data) are allocated
|
|
on-demand on the heap with a minimum size.
|
|
|