blog/boolin.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233

<!--
  -- title = "Inclusion in Python Environment Algebra"
  -- rss = "Specifying a subset of environment constraints for Python packages"
  -- date = Date(2021, 11, 26)
  -- tags = ["fun", "math", "packaging", "python"]
  -->
# Inclusion in Python Environment Algebra

!!! note "Note"

    If you do not want to hear me rambling about the shortcomings
    of Python packaging metadata standards, feel free to skip
    to section [1.c](#objective).

\toc

## Background

### Motivation

I sometimes [praise] Python wheels as the best binary packaging format
for cross-platform delivery.  It truly is, at least to my limited knowledge.
Via compatibility tags, the constraints for the target environment
can be exclusively specified.  Even though {{pep 425 compressed-tag-sets}}
was introduced

> to allow for compact filenames of bdists[^bdists]
> that work with more than one compatibility tag triple,

the filenames are not always *compact* and it is not always possible
to specify all environments a module can work on either.  For instance,
there's no direct answer for one of the FAQ:

> **What tag do I use if my distribution uses
> a feature exclusive to the newest version of Python?**

I have seen workarounds like `cp35.cp36.cp37.cp38.cp39`, which accept
CPython from version 3.5, but not 3.10, which is the topic of the next Q&A:

> **Why isn't there a `.` in the Python version number?**
>
> CPython has lasted 20+ years without a 3-digit major release.
> This should continue for some time.  Other implementations may use `_`
> as a delimiter, since both `-` and `.` delimit the surrounding filename.

The platform tag was not carefully standardized either,
which gives us inconsistent cyphers such as `win32`, `win_amd64`
and `manylinux2010_x86_64`.[^glibc]

### Inspiration

A few years later, {{pep 508 environment-markers}} came along after multiple
iterations ({{pep 345}}, {{pep 426}}) with a much more human-readable format,
which is essentially a Boolean algebra of comparisons.  With this,
the aforementioned CPython constraint can be correctly expressed as

```python
platform_python_implementation == 'CPython'
and python_version >= '3.5'
```

Unfortunately, they are only standardized for marking dependencies,
usually to watch out for environments where a library is either unnecessary
or unusable.  But *who watches the watchers*?  While it is possible
to [specify Python and platform tags when building wheels][params], the feature
is undocumented and almost unknown.  Therefore, many, if not most,
platform-specific pure-Python wheels are tagged as `none` platform and all
their dependents needs to have the same redundant markers.  Furthermore,
due to the exclusive nature of compatibility tags, a GNU/Linux-only
purelib wheel should be tagged with something like the following abomination:

```
manylinux_2_17_x86_64.manylinux_2_17_i686.manylinux_2_17_aarch64.manylinux_2_17_armv7l.manylinux_2_17_ppc64.manylinux_2_17_ppc64le.manylinux_2_17_s390x
```

Now imagine the filename of a wheel that only works on Unix-like systems.
`uvloop` is one and it is among the 500 [most-downloaded on PyPI][3719394]
at the time of writing.

### Objective

This article is, though, not going to sell you the idea of porting
environment markers to top-level `WHEEL` metadata.  I should sit down
with the [PyPA] people and [discuss] it one day, but the currently defined
markers are lacking CPU and ABI information.  The reason for this is that
Python's standard library does not implement a way to retrieve such information,
and for completeness, the operating system's version is not meaningfully
available either.  If we want to redo this right, I highly recommend
looking into [Zig]'s [std.Target], which was designed from the start
for cross-compilation.  Whatever we will come up with, it shall not be
done overnight.

Instead, right now I need to solve a more immediate problem.  For the last
few months, I have been working on [IPWHL][ipwhl], a downstream repository for
Python distributions.  Since we enforce that all dependencies to be satisfied,
packages with too strict or too lax environment declaration will break
the automated tests, even though effectively the declared packages integrate
well with each other.  We came to a conclusion that adding information would
improve the checks' accuracy and decided to use the [floating cheeses][ipwhl]
as a testbed for [top-level environment specifiers][proposal].

In short, we are going to merge the `METADATA` field `Requires-Python`
with other environmental information to a single field `environment`.
Since other information may or may not be present in the filename,
during metadata integrity check we can only check if what declared
is [included][inclusion] by what found from the original wheel.

### Theory

Ignoring `extra`, a Python environment can be represented as a 11-tuple of

```python
(python_version, python_full_version, os_name, sys_platform,
 platform_release, platform_system, platform_version,
 platform_machine, platform_python_implementation,
 implementation_name, implementation_version)
```

Denote $S$ as the set of all strings, then the set of all environments
$E \subset S^{11}$.  Let
\[O = \{\le, <, \ne, =, \ge, >, \cong, \equiv, \in, \notin, \ni, \notni\}\]
the set of all marker expressions[^var] can be defined as
\[X = \{e \mapsto e \bigcirc s \mid \bigcirc \in O \land s \in S\}\]
and the set of all environment markers[^markers] can be recursively defined as
\[\begin{aligned}
  M = \{c_\top\} &\cup \{(e_1, \ldots, e_{11})
                         \mapsto m(e_1, \ldots, e_{11}) \land x(e_{\imath})
                         \mid m \in M \land x \in X \land \imath \in I\}\\
                 &\cup \{(e_1, \ldots, e_{11})
                         \mapsto m(e_1, \ldots, e_{11}) \lor x(e_{\imath})
                         \mid m \in M \land x \in X \land \imath \in I\}
\end{aligned}\]
where $\top$ is the *top* element in Boolean algebra, $c_\top: E \to \{\top\}$
is a constant function and $I = [1, 11]$.  With all Boolean expressions
converted into one of their standard forms, we have the following alternative
definitions of an environment marker
\[\begin{aligned}
  (e_1, \ldots, e_{11})
  &\mapsto \bigvee_{\imath = 1}^n\bigwedge_{\jmath = 1}^{f(\imath)}
  x(\imath, \jmath)\left(e_{k(\imath, \jmath)}\right)\\
  (e_1, \ldots, e_{11})
  &\mapsto \bigwedge_{\imath = 1}^n\bigvee_{\jmath = 1}^{f(\imath)}
  x(\imath, \jmath)\left(e_{k(\imath, \jmath)}\right)
\end{aligned}\]
where $n \in \N$, $f: \N_1 \to \N_1$, $x: \N_1^2 \to X$, $k: \N_1^2 \to I$
and an empty Boolean expression is implicitly $c_\top$.

Now it is possible to formalize inclusion for markers as
\[\forall\,\{m, n\} \subset M: (m \le n) = (\forall\,e \in E: m(e) \le n(e))\]

Thus, checking if a marker includes another is equivalent to evaluating
the following expression, for all $(e_1, \ldots, e_{11}) \in E$
\[\begin{aligned}
  r &= \bigvee_{\imath' = 1}^{n'}\bigwedge_{\jmath' = 1}^{f'(\imath')}
       x'(\imath', \jmath')\left(e_{k'(\imath', \jmath')}\right)
       \le \bigwedge_{\imath = 1}^n\bigvee_{\jmath = 1}^{f(\imath)}
       x(\imath, \jmath)\left(e_{k(\imath, \jmath)}\right)\\
    &= \bigwedge_{\imath' = 1}^{n'}\left(\bigwedge_{\jmath' = 1}^{f'(\imath')}
       x'(\imath', \jmath')\left(e_{k'(\imath', \jmath')}\right)
       \le \bigwedge_{\imath = 1}^n\bigvee_{\jmath = 1}^{f(\imath)}
       x(\imath, \jmath)\left(e_{k(\imath, \jmath)}\right)\right)\\
    &= \bigwedge_{\imath' = 1}^{n'}\bigwedge_{\imath = 1}^n
       \left(\bigwedge_{\jmath' = 1}^{f'(\imath')}
       x'(\imath', \jmath')\left(e_{k'(\imath', \jmath')}\right)
       \le \bigvee_{\jmath = 1}^{f(\imath)}
       x(\imath, \jmath)\left(e_{k(\imath, \jmath)}\right)\right)
\end{aligned}\]
where $r$ is whether the original marker (defined by $n$, $f$, $x$ and $k$)
includes the declared one (defined by $n'$, $f'$, $x'$ and $k'$). Let
\[\begin{aligned}
  g(\imath, \ell) &= \{\jmath \mid \jmath \in [1, f(\imath)]
                              \land k(\imath, \jmath) = \ell\}\\
  g'(\imath, \ell) &= \{\jmath \mid \jmath \in [1, f'(\imath)]
                               \land k'(\imath, \jmath) = \ell\}
\end{aligned}\]
and assume $e_1, \ldots, e_{11}$ are orthogonal, we have
\[\begin{aligned}
  r &= \bigwedge_{\imath' = 1}^{n'}\bigwedge_{\imath = 1}^n\bigvee_{\ell \in I}
       \left(\bigwedge_{\jmath' \in g'(\imath', \ell)}
       x'(\imath', \jmath')\left(e_{\ell}\right)
       \le \bigvee_{\jmath \in g(\imath, \ell)}
       x(\imath, \jmath)\left(e_{\ell}\right)\right)\\
    &= \bigwedge_{\imath' = 1}^{n'}\bigwedge_{\imath = 1}^n\bigvee_{\ell \in I}
       \left(\bigwedge_{\jmath' \in g'(\imath', \ell)}
       x'(\imath', \jmath')\left(e_{\ell}\right)
       \land \lnot \bigvee_{\jmath \in g(\imath, \ell)}
       x(\imath, \jmath)\left(e_{\ell}\right) = \bot\right)
\end{aligned}\]
where $\bot$ is the Boolean algebra bottom element.  By [De Morgan's laws],
\[\begin{aligned}
  r &= \bigwedge_{\imath' = 1}^{n'}\bigwedge_{\imath = 1}^n\bigvee_{\ell \in I}
       \left(\bigwedge_{\jmath' \in g'(\imath', \ell)}
       x'(\imath', \jmath')\left(e_{\ell}\right)
       \land \bigwedge_{\jmath \in g(\imath, \ell)}
       \lnot x(\imath, \jmath)\left(e_{\ell}\right) = \bot\right)
\end{aligned}\]

Essentially, checking if
\[\bigwedge_{\jmath' \in g'(\imath', \ell)}
  x'(\imath', \jmath')\left(e_{\ell}\right)
  \land \bigwedge_{\jmath \in g(\imath, \ell)}
  \lnot x(\imath, \jmath)\left(e_{\ell}\right)\]
is unsatisfiable for all $e_{\ell}$ is a [SAT problem] which we can write
a solver for.  Indeed, the original expression could be formulated as a SAT,
but it would be more complex due to the multiple dimensions of environments.

## Implementation

### Generation

### Conversion

### Verification

[^bdists]: [Built distributions][bdist], as opposed to [sdist]
[^glibc]: Bonus: try to sort `manylinux1`, `manylinux_2_24` and `manylinux2014`
[^var]: Ignoring relations between two variables or two strings
[^markers]: Or *environment specifiers* when [not marking anything][proposal]

[praise]: https://nixnet.social/notice/A9GyFTnNnQDVZrmtUm
[bdist]: https://packaging.python.org/glossary/#term-Built-Distribution
[sdist]: https://packaging.python.org/glossary/#term-Source-Distribution-or-sdist
[params]: https://github.com/pypa/wheel/blob/0.37.0/src/wheel/bdist_wheel.py#L136-L164
[3719394]: https://hugovk.github.io/top-pypi-packages
[PyPA]: https://pypa.io
[discuss]: https://discuss.python.org/t/7831
[Zig]: https://ziglang.org
[std.Target]: https://ziglang.org/documentation/master/std/#std;Target
[ipwhl]: https://man.sr.ht/~cnx/ipwhl
[proposal]: https://lists.sr.ht/~cnx/ipwhl-discuss/%3CCEP1VQH719Y4.1EX12MJII0WVE%40nix%3E
[inclusion]: https://en.wikipedia.org/wiki/Inclusion_(Boolean_algebra)
[De Morgan's laws]: https://en.wikipedia.org/wiki/De_Morgan%27s_laws
[SAT problem]: https://en.wikipedia.org/wiki/Boolean_satisfiability_problem