about summary refs log tree commit diff homepage
diff options
context:
space:
mode:
-rw-r--r--blog/boolin.md231
1 files changed, 231 insertions, 0 deletions
diff --git a/blog/boolin.md b/blog/boolin.md
new file mode 100644
index 0000000..6acae55
--- /dev/null
+++ b/blog/boolin.md
@@ -0,0 +1,231 @@
+<!--
+  -- title = "Inclusion in Python Environment Algebra"
+  -- rss = "Specifying a subset of environment constraints for Python packages"
+  -- date = Date(2021, 11, 26)
+  -- tags = ["fun", "math", "packaging", "python"]
+  -->
+# Inclusion in Python Environment Algebra
+
+!!! note "Note"
+
+    If you do not want to hear me rambling about the shortcomings
+    of Python packaging metadata standards, feel free to skip
+    to section [1.c](#objective).
+
+\toc
+
+## Background
+
+### Motivation
+
+I sometimes [praise] Python wheels as the best binary packaging format
+for cross-platform delivery.  It truly is, at least to my limited knowledge.
+Via compatibility tags, the constraints for the target environment
+can be exclusively specified.  Even though {{pep 425 compressed-tag-sets}}
+was introduced
+
+> to allow for compact filenames of bdists[^bdists]
+> that work with more than one compatibility tag triple,
+
+the filenames are not always *compact* and it is not always possible
+to specify all environments a module can work on either.  For instance,
+there's no direct answer for one of the FAQ:
+
+> **What tag do I use if my distribution uses
+> a feature exclusive to the newest version of Python?**
+
+I have seen workarounds like `cp35.cp36.cp37.cp38.cp39`, which accept
+CPython from version 3.5, but not 3.10, which is the topic of the next Q&A:
+
+> **Why isn't there a `.` in the Python version number?**
+>
+> CPython has lasted 20+ years without a 3-digit major release.
+> This should continue for some time.  Other implementations may use `_`
+> as a delimiter, since both `-` and `.` delimit the surrounding filename.
+
+The platform tag was not carefully standardized either,
+which gives us inconsistent cyphers such as `win32`, `win_amd64`
+and `manylinux2010_x86_64`.[^glibc]
+
+### Inspiration
+
+A few years later, {{pep 508 environment-markers}} came along after multiple
+iterations ({{pep 345}}, {{pep 426}}) with a much more human-readable format,
+which is essentially a Boolean algebra of comparisons.  With this,
+the aforementioned CPython constraint can be correctly expressed as
+
+```python
+platform_python_implementation == 'CPython'
+and python_version >= '3.5'
+```
+
+Unfortunately, they are only standardized for marking dependencies,
+usually to watch out for environments where a library is either unnecessary
+or unusable.  But *who watches the watchers*?  While it is possible
+to [specify Python and platform tags when building wheels][params], the feature
+is undocumented and almost unknown.  Therefore, many, if not most,
+platform-specific pure-Python wheels are tagged as `none` platform and all
+their dependents needs to have the same redundant markers.  Furthermore,
+due to the exclusive nature of compatibility tags, a GNU/Linux-only
+purelib wheel should be tagged with something like the following abomination:
+
+```
+manylinux_2_17_x86_64.manylinux_2_17_i686.manylinux_2_17_aarch64.manylinux_2_17_armv7l.manylinux_2_17_ppc64.manylinux_2_17_ppc64le.manylinux_2_17_s390x
+```
+
+Now imagine the filename of a wheel that only works on Unix-like systems.
+`uvloop` is one and it is among the 500 [most-downloaded on PyPI][3719394]
+at the time of writing.
+
+### Objective
+
+This article is, though, not going to sell you the idea of porting
+environment markers to top-level `WHEEL` metadata.  I should sit down
+with the [PyPA] people and [discuss] it one day, but the currently defined
+markers are lacking CPU and ABI information.  The reason for this is that
+Python's standard library does not implement a way to retrieve such information,
+and for completeness, the operating system's version is not meaningfully
+available either.  If we want to redo this right, I highly recommend
+looking into [Zig]'s [std.Target], which was designed from the start
+for cross-compilation.  Whatever we will come up with, it shall not be
+done overnight.
+
+Instead, right now I need to solve a more immediate problem.  For the last
+few months, I have been working on [IPWHL][ipwhl], a downstream repository for
+Python distributions.  Since we enforce that all dependencies to be satisfied,
+packages with too strict or too lax environment declaration will break
+the automated tests, even though effectively the declared packages integrate
+well with each other.  We came to a conclusion that adding information would
+improve the checks' accuracy and decided to use the [floating cheeses][ipwhl]
+as a testbed for [top-level environment specifiers][proposal].
+
+In short, we are going to merge the `METADATA` field `Requires-Python`
+with other environmental information to a single field `environment`.
+Since other information may or may not be present in the filename,
+during metadata integrity check we can only check if what declared
+is [included][inclusion] by what found from the original wheel.
+
+### Theory
+
+Ignoring `extra`, a Python environment can be represented as a 11-tuple of
+
+```python
+(python_version, python_full_version, os_name, sys_platform,
+ platform_release, platform_system, platform_version,
+ platform_machine, platform_python_implementation,
+ implementation_name, implementation_version)
+```
+
+Denote $S$ as the set of all strings, then the set of all environments
+$E \subset S^{11}$.  Let
+\[O = \{\le, <, \ne, =, \ge, >, \cong, \equiv, \in, \notin, \ni, \notni\}\]
+the set of all marker expressions[^var] can be defined as
+\[X = \{e \mapsto e \bigcirc s \mid \bigcirc \in O \land s \in S\}\]
+and the set of all environment markers[^markers] can be recursively defined as
+\[\begin{aligned}
+  M = \{c_\top\} &\cup \{(e_1, \ldots, e_{11})
+                         \mapsto m(e_1, \ldots, e_{11}) \land x(e_{\imath})
+                         \mid m \in M \land x \in X \land \imath \in I\}\\
+                 &\cup \{(e_1, \ldots, e_{11})
+                         \mapsto m(e_1, \ldots, e_{11}) \lor x(e_{\imath})
+                         \mid m \in M \land x \in X \land \imath \in I\}
+\end{aligned}\]
+where $\top$ is the *top* element in Boolean algebra, $c_\top: E \to \{\top\}$
+is a constant function and $I = [1, 11]$.  With all Boolean expressions
+converted into one of their standard forms, we have the following alternative
+definitions of an environment marker
+\[\begin{aligned}
+  (e_1, \ldots, e_{11})
+  &\mapsto \bigvee_{\imath = 1}^n\bigwedge_{\jmath = 1}^{f(\imath)}
+  x(\jmath)\left(e_{k(\jmath)}\right)\\
+  (e_1, \ldots, e_{11})
+  &\mapsto \bigwedge_{\imath = 1}^n\bigvee_{\jmath = 1}^{f(\imath)}
+  x(\jmath)\left(e_{k(\jmath)}\right)
+\end{aligned}\]
+where $n \in \N$, $f: \N^* \to \N^*$, $x: \N^* \to X$, $k: \N^* \to I$
+and an empty Boolean expression is implicitly $c_\top$.
+
+Now it is possible to formalize inclusion for markers as
+\[\forall\,\{m, n\} \subset M: (m \le n) = (\forall\,e \in E: m(e) \le n(e))\]
+
+Thus, checking if a marker includes another is equivalent to evaluating
+the following expression, for all $(e_1, \ldots, e_{11}) \in E$
+\[\begin{aligned}
+  r &= \bigvee_{\imath' = 1}^{n'}\bigwedge_{\jmath' = 1}^{f'(\imath')}
+       x'(\jmath')\left(e_{k'(\jmath')}\right)
+       \le \bigwedge_{\imath = 1}^n\bigvee_{\jmath = 1}^{f(\imath)}
+       x(\jmath)\left(e_{k(\jmath)}\right)\\
+    &= \bigwedge_{\imath' = 1}^{n'}\left(\bigwedge_{\jmath' = 1}^{f'(\imath')}
+       x'(\jmath')\left(e_{k'(\jmath')}\right)
+       \le \bigwedge_{\imath = 1}^n\bigvee_{\jmath = 1}^{f(\imath)}
+       x(\jmath)\left(e_{k(\jmath)}\right)\right)\\
+    &= \bigwedge_{\imath' = 1}^{n'}\bigwedge_{\imath = 1}^n
+       \left(\bigwedge_{\jmath' = 1}^{f'(\imath')}
+       x'(\jmath')\left(e_{k'(\jmath')}\right)
+       \le \bigvee_{\jmath = 1}^{f(\imath)}
+       x(\jmath)\left(e_{k(\jmath)}\right)\right)
+\end{aligned}\]
+where $r$ is whether the original marker (defined by $n$, $f$, $x$ and $k$)
+includes the declared one (defined by $n'$, $f'$, $x'$ and $k'$). Let
+\[\begin{aligned}
+  g(\imath, \ell) &= \{\jmath \mid \jmath \in [1, f(\imath)]
+                              \land k(\jmath) = \ell\}\\
+  g'(\imath, \ell) &= \{\jmath \mid \jmath \in [1, f'(\imath)]
+                               \land k'(\jmath) = \ell\}
+\end{aligned}\]
+and assume $e_1, \ldots, e_{11}$ are orthogonal, we have
+\[\begin{aligned}
+  r &= \bigwedge_{\imath' = 1}^{n'}\bigwedge_{\imath = 1}^n\bigvee_{\ell \in I}
+       \left(\bigwedge_{\jmath' \in g'(\imath', \ell)}
+       x'(\jmath')\left(e_{\ell}\right)
+       \le \bigvee_{\jmath \in g(\imath, \ell)}
+       x(\jmath)\left(e_{\ell}\right)\right)\\
+    &= \bigwedge_{\imath' = 1}^{n'}\bigwedge_{\imath = 1}^n\bigvee_{\ell \in I}
+       \left(\bigwedge_{\jmath' \in g'(\imath', \ell)}
+       x'(\jmath')\left(e_{\ell}\right)
+       \land \lnot \bigvee_{\jmath \in g(\imath, \ell)}
+       x(\jmath)\left(e_{\ell}\right) = \bot\right)
+\end{aligned}\]
+where $\bot$ is the Boolean algebra bottom element.  By [De Morgan's laws],
+\[\begin{aligned}
+  r &= \bigwedge_{\imath' = 1}^{n'}\bigwedge_{\imath = 1}^n\bigvee_{\ell \in I}
+       \left(\bigwedge_{\jmath' \in g'(\imath', \ell)}
+       x'(\jmath')\left(e_{\ell}\right)
+       \land \bigwedge_{\jmath \in g(\imath, \ell)}
+       \lnot x(\jmath)\left(e_{\ell}\right) = \bot\right)
+\end{aligned}\]
+
+Essentially, checking if
+\[\bigwedge_{\jmath' \in g'(\imath', \ell)} x'(\jmath')\left(e_{\ell}\right)
+  \land \bigwedge_{\jmath \in g(\imath, \ell)}
+  \lnot x(\jmath)\left(e_{\ell}\right)\]
+is unsatisfiable for all $e_{\ell}$ is a [SAT problem] which we can write
+a solver for.
+
+## Implementation
+
+### Generation
+
+### Conversion
+
+### Verification
+
+[^bdists]: [Built distributions][bdist], as opposed to [sdist]
+[^glibc]: Bonus: try to sort `manylinux1`, `manylinux_2_24` and `manylinux2014`
+[^var]: Ignoring relations between two variables or two strings
+[^markers]: Or *environment specifiers* when [not marking anything][proposal]
+
+[praise]: https://nixnet.social/notice/A9GyFTnNnQDVZrmtUm
+[bdist]: https://packaging.python.org/glossary/#term-Built-Distribution
+[sdist]: https://packaging.python.org/glossary/#term-Source-Distribution-or-sdist
+[params]: https://github.com/pypa/wheel/blob/0.37.0/src/wheel/bdist_wheel.py#L136-L164
+[3719394]: https://hugovk.github.io/top-pypi-packages
+[PyPA]: https://pypa.io
+[discuss]: https://discuss.python.org/t/7831
+[Zig]: https://ziglang.org
+[std.Target]: https://ziglang.org/documentation/master/std/#std;Target
+[ipwhl]: https://man.sr.ht/~cnx/ipwhl
+[proposal]: https://lists.sr.ht/~cnx/ipwhl-discuss/%3CCEP1VQH719Y4.1EX12MJII0WVE%40nix%3E
+[inclusion]: https://en.wikipedia.org/wiki/Inclusion_(Boolean_algebra)
+[De Morgan's laws]: https://en.wikipedia.org/wiki/De_Morgan%27s_laws
+[SAT problem]: https://en.wikipedia.org/wiki/Boolean_satisfiability_problem