diff options
Diffstat (limited to 'blog')
-rw-r--r-- | blog/boolin.md | 231 |
1 files changed, 231 insertions, 0 deletions
diff --git a/blog/boolin.md b/blog/boolin.md new file mode 100644 index 0000000..6acae55 --- /dev/null +++ b/blog/boolin.md @@ -0,0 +1,231 @@ +<!-- + -- title = "Inclusion in Python Environment Algebra" + -- rss = "Specifying a subset of environment constraints for Python packages" + -- date = Date(2021, 11, 26) + -- tags = ["fun", "math", "packaging", "python"] + --> +# Inclusion in Python Environment Algebra + +!!! note "Note" + + If you do not want to hear me rambling about the shortcomings + of Python packaging metadata standards, feel free to skip + to section [1.c](#objective). + +\toc + +## Background + +### Motivation + +I sometimes [praise] Python wheels as the best binary packaging format +for cross-platform delivery. It truly is, at least to my limited knowledge. +Via compatibility tags, the constraints for the target environment +can be exclusively specified. Even though {{pep 425 compressed-tag-sets}} +was introduced + +> to allow for compact filenames of bdists[^bdists] +> that work with more than one compatibility tag triple, + +the filenames are not always *compact* and it is not always possible +to specify all environments a module can work on either. For instance, +there's no direct answer for one of the FAQ: + +> **What tag do I use if my distribution uses +> a feature exclusive to the newest version of Python?** + +I have seen workarounds like `cp35.cp36.cp37.cp38.cp39`, which accept +CPython from version 3.5, but not 3.10, which is the topic of the next Q&A: + +> **Why isn't there a `.` in the Python version number?** +> +> CPython has lasted 20+ years without a 3-digit major release. +> This should continue for some time. Other implementations may use `_` +> as a delimiter, since both `-` and `.` delimit the surrounding filename. + +The platform tag was not carefully standardized either, +which gives us inconsistent cyphers such as `win32`, `win_amd64` +and `manylinux2010_x86_64`.[^glibc] + +### Inspiration + +A few years later, {{pep 508 environment-markers}} came along after multiple +iterations ({{pep 345}}, {{pep 426}}) with a much more human-readable format, +which is essentially a Boolean algebra of comparisons. With this, +the aforementioned CPython constraint can be correctly expressed as + +```python +platform_python_implementation == 'CPython' +and python_version >= '3.5' +``` + +Unfortunately, they are only standardized for marking dependencies, +usually to watch out for environments where a library is either unnecessary +or unusable. But *who watches the watchers*? While it is possible +to [specify Python and platform tags when building wheels][params], the feature +is undocumented and almost unknown. Therefore, many, if not most, +platform-specific pure-Python wheels are tagged as `none` platform and all +their dependents needs to have the same redundant markers. Furthermore, +due to the exclusive nature of compatibility tags, a GNU/Linux-only +purelib wheel should be tagged with something like the following abomination: + +``` +manylinux_2_17_x86_64.manylinux_2_17_i686.manylinux_2_17_aarch64.manylinux_2_17_armv7l.manylinux_2_17_ppc64.manylinux_2_17_ppc64le.manylinux_2_17_s390x +``` + +Now imagine the filename of a wheel that only works on Unix-like systems. +`uvloop` is one and it is among the 500 [most-downloaded on PyPI][3719394] +at the time of writing. + +### Objective + +This article is, though, not going to sell you the idea of porting +environment markers to top-level `WHEEL` metadata. I should sit down +with the [PyPA] people and [discuss] it one day, but the currently defined +markers are lacking CPU and ABI information. The reason for this is that +Python's standard library does not implement a way to retrieve such information, +and for completeness, the operating system's version is not meaningfully +available either. If we want to redo this right, I highly recommend +looking into [Zig]'s [std.Target], which was designed from the start +for cross-compilation. Whatever we will come up with, it shall not be +done overnight. + +Instead, right now I need to solve a more immediate problem. For the last +few months, I have been working on [IPWHL][ipwhl], a downstream repository for +Python distributions. Since we enforce that all dependencies to be satisfied, +packages with too strict or too lax environment declaration will break +the automated tests, even though effectively the declared packages integrate +well with each other. We came to a conclusion that adding information would +improve the checks' accuracy and decided to use the [floating cheeses][ipwhl] +as a testbed for [top-level environment specifiers][proposal]. + +In short, we are going to merge the `METADATA` field `Requires-Python` +with other environmental information to a single field `environment`. +Since other information may or may not be present in the filename, +during metadata integrity check we can only check if what declared +is [included][inclusion] by what found from the original wheel. + +### Theory + +Ignoring `extra`, a Python environment can be represented as a 11-tuple of + +```python +(python_version, python_full_version, os_name, sys_platform, + platform_release, platform_system, platform_version, + platform_machine, platform_python_implementation, + implementation_name, implementation_version) +``` + +Denote $S$ as the set of all strings, then the set of all environments +$E \subset S^{11}$. Let +\[O = \{\le, <, \ne, =, \ge, >, \cong, \equiv, \in, \notin, \ni, \notni\}\] +the set of all marker expressions[^var] can be defined as +\[X = \{e \mapsto e \bigcirc s \mid \bigcirc \in O \land s \in S\}\] +and the set of all environment markers[^markers] can be recursively defined as +\[\begin{aligned} + M = \{c_\top\} &\cup \{(e_1, \ldots, e_{11}) + \mapsto m(e_1, \ldots, e_{11}) \land x(e_{\imath}) + \mid m \in M \land x \in X \land \imath \in I\}\\ + &\cup \{(e_1, \ldots, e_{11}) + \mapsto m(e_1, \ldots, e_{11}) \lor x(e_{\imath}) + \mid m \in M \land x \in X \land \imath \in I\} +\end{aligned}\] +where $\top$ is the *top* element in Boolean algebra, $c_\top: E \to \{\top\}$ +is a constant function and $I = [1, 11]$. With all Boolean expressions +converted into one of their standard forms, we have the following alternative +definitions of an environment marker +\[\begin{aligned} + (e_1, \ldots, e_{11}) + &\mapsto \bigvee_{\imath = 1}^n\bigwedge_{\jmath = 1}^{f(\imath)} + x(\jmath)\left(e_{k(\jmath)}\right)\\ + (e_1, \ldots, e_{11}) + &\mapsto \bigwedge_{\imath = 1}^n\bigvee_{\jmath = 1}^{f(\imath)} + x(\jmath)\left(e_{k(\jmath)}\right) +\end{aligned}\] +where $n \in \N$, $f: \N^* \to \N^*$, $x: \N^* \to X$, $k: \N^* \to I$ +and an empty Boolean expression is implicitly $c_\top$. + +Now it is possible to formalize inclusion for markers as +\[\forall\,\{m, n\} \subset M: (m \le n) = (\forall\,e \in E: m(e) \le n(e))\] + +Thus, checking if a marker includes another is equivalent to evaluating +the following expression, for all $(e_1, \ldots, e_{11}) \in E$ +\[\begin{aligned} + r &= \bigvee_{\imath' = 1}^{n'}\bigwedge_{\jmath' = 1}^{f'(\imath')} + x'(\jmath')\left(e_{k'(\jmath')}\right) + \le \bigwedge_{\imath = 1}^n\bigvee_{\jmath = 1}^{f(\imath)} + x(\jmath)\left(e_{k(\jmath)}\right)\\ + &= \bigwedge_{\imath' = 1}^{n'}\left(\bigwedge_{\jmath' = 1}^{f'(\imath')} + x'(\jmath')\left(e_{k'(\jmath')}\right) + \le \bigwedge_{\imath = 1}^n\bigvee_{\jmath = 1}^{f(\imath)} + x(\jmath)\left(e_{k(\jmath)}\right)\right)\\ + &= \bigwedge_{\imath' = 1}^{n'}\bigwedge_{\imath = 1}^n + \left(\bigwedge_{\jmath' = 1}^{f'(\imath')} + x'(\jmath')\left(e_{k'(\jmath')}\right) + \le \bigvee_{\jmath = 1}^{f(\imath)} + x(\jmath)\left(e_{k(\jmath)}\right)\right) +\end{aligned}\] +where $r$ is whether the original marker (defined by $n$, $f$, $x$ and $k$) +includes the declared one (defined by $n'$, $f'$, $x'$ and $k'$). Let +\[\begin{aligned} + g(\imath, \ell) &= \{\jmath \mid \jmath \in [1, f(\imath)] + \land k(\jmath) = \ell\}\\ + g'(\imath, \ell) &= \{\jmath \mid \jmath \in [1, f'(\imath)] + \land k'(\jmath) = \ell\} +\end{aligned}\] +and assume $e_1, \ldots, e_{11}$ are orthogonal, we have +\[\begin{aligned} + r &= \bigwedge_{\imath' = 1}^{n'}\bigwedge_{\imath = 1}^n\bigvee_{\ell \in I} + \left(\bigwedge_{\jmath' \in g'(\imath', \ell)} + x'(\jmath')\left(e_{\ell}\right) + \le \bigvee_{\jmath \in g(\imath, \ell)} + x(\jmath)\left(e_{\ell}\right)\right)\\ + &= \bigwedge_{\imath' = 1}^{n'}\bigwedge_{\imath = 1}^n\bigvee_{\ell \in I} + \left(\bigwedge_{\jmath' \in g'(\imath', \ell)} + x'(\jmath')\left(e_{\ell}\right) + \land \lnot \bigvee_{\jmath \in g(\imath, \ell)} + x(\jmath)\left(e_{\ell}\right) = \bot\right) +\end{aligned}\] +where $\bot$ is the Boolean algebra bottom element. By [De Morgan's laws], +\[\begin{aligned} + r &= \bigwedge_{\imath' = 1}^{n'}\bigwedge_{\imath = 1}^n\bigvee_{\ell \in I} + \left(\bigwedge_{\jmath' \in g'(\imath', \ell)} + x'(\jmath')\left(e_{\ell}\right) + \land \bigwedge_{\jmath \in g(\imath, \ell)} + \lnot x(\jmath)\left(e_{\ell}\right) = \bot\right) +\end{aligned}\] + +Essentially, checking if +\[\bigwedge_{\jmath' \in g'(\imath', \ell)} x'(\jmath')\left(e_{\ell}\right) + \land \bigwedge_{\jmath \in g(\imath, \ell)} + \lnot x(\jmath)\left(e_{\ell}\right)\] +is unsatisfiable for all $e_{\ell}$ is a [SAT problem] which we can write +a solver for. + +## Implementation + +### Generation + +### Conversion + +### Verification + +[^bdists]: [Built distributions][bdist], as opposed to [sdist] +[^glibc]: Bonus: try to sort `manylinux1`, `manylinux_2_24` and `manylinux2014` +[^var]: Ignoring relations between two variables or two strings +[^markers]: Or *environment specifiers* when [not marking anything][proposal] + +[praise]: https://nixnet.social/notice/A9GyFTnNnQDVZrmtUm +[bdist]: https://packaging.python.org/glossary/#term-Built-Distribution +[sdist]: https://packaging.python.org/glossary/#term-Source-Distribution-or-sdist +[params]: https://github.com/pypa/wheel/blob/0.37.0/src/wheel/bdist_wheel.py#L136-L164 +[3719394]: https://hugovk.github.io/top-pypi-packages +[PyPA]: https://pypa.io +[discuss]: https://discuss.python.org/t/7831 +[Zig]: https://ziglang.org +[std.Target]: https://ziglang.org/documentation/master/std/#std;Target +[ipwhl]: https://man.sr.ht/~cnx/ipwhl +[proposal]: https://lists.sr.ht/~cnx/ipwhl-discuss/%3CCEP1VQH719Y4.1EX12MJII0WVE%40nix%3E +[inclusion]: https://en.wikipedia.org/wiki/Inclusion_(Boolean_algebra) +[De Morgan's laws]: https://en.wikipedia.org/wiki/De_Morgan%27s_laws +[SAT problem]: https://en.wikipedia.org/wiki/Boolean_satisfiability_problem |