about summary refs log tree commit diff homepage
path: root/blog/2020/gsoc/article/1.md
blob: 7085182ec2dfd628e358bfb8998e7f9a928b076b (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
+++
title = "Unexpected Things When You're Expecting"
rss = "GSoC 2020: Unexpected Things When You're Expecting"
date = Date(2020, 6, 9)
tags = ["gsoc", "pip", "python"]
+++

Hi everyone, I hope that you are all doing well and wishes you all good health!
The last week has not been really kind to me with a decent amount of
academic pressure (my school year is lasting until early Jully).
It would be bold to say that I have spent 10 hours working on my GSoC project
since the last check-in, let alone the 30 hours per week requirement.
That being said, there were still some discoveries that I wish to share.

\toc

## The `multiprocessing[.dummy]` wrapper

Most of the time I spent was to finalize the multi{processing,threading}
wrapper for `map` function that submit tasks to the worker pool.
To my surprise, it is rather difficult to write something that is
not only portable but also easy to read and test.

By {{pip 8320 "the latest commit"}}, I realized the following:

1. The `multiprocessing` module was not designed for the implementation
   details to be abstracted away entirely.  For example, the lazy `map`'s
   could be really slow without specifying suitable chunk size
   (to cut the input iterable and distribute them to workers in the pool).
   By *suitable*, I mean only an order smaller than the input.  This defeats
   half of the purpose of making it lazy: allowing the input to be
   evaluated lazily.  Luckily, in the use case I'm aiming for, the length of
   the iterable argument is small and the laziness is only needed for the output
   (to pipeline download and installation).
2. Mocking `import` for testing purposes can never be pretty.  One reason
   is that we (Python users) have very little control over the calls of
   `import` statements and its lower-level implementation `__import__`.
   In order to properly patch this built-in function, unlike for others
   of the same group, we have to `monkeypatch` the name from `builtins`
   (or `__builtins__` under Python 2) instead of the module that import stuff.
   Furthermore, because of the special namespacing, to avoid infinite recursion
   we need to alias the function to a different name for fallback.
3. To add to the problem, `multiprocessing` lazily imports the fragile module
   during pools creation.  Since the failure is platform-specific
   (the lack of `sem_open`), it was decided to check upon the import
   of the `pip`'s module.  Although the behavior is easier to reason
   in human language, testing it requires invalidating cached import and
   re-import the wrapper module.
4. Last but not least, I now understand the pain of keeping Python 2
   compatibility that many package maintainers still need to deal with
   everyday (although Python 2 has reached its end-of-life, `pip`, for
   example, {{pip 6148 "will still support it for another year"}}).

## The change in direction

Since last week, my mentor Pradyun Gedam and I set up weekly real-time
meeting (a fancy term for video/audio chat in the worldwide quarantine
era) for the entire GSoC period. During the last session, we decided to
put parallelization of download during resolution on hold, in favor of a
more beneficial goal: {{pip 7819 "partially download the wheels during
dependency resolution"}}.

![](/assets/swirl.png)

As discussed by Danny McClanahan and the maintainers of `pip`, it is feasible
to only download a few kB of a wheel to obtain enough metadata for
the resolution of dependency.  While this is only applicable to wheels
(i.e. prebuilt packages), other packaging format only make up less than 20%
of the downloads (at least on PyPI), and the figure is much less for
the most popular packages.  Therefore, this optimization alone could make
[the upcoming backtracking resolver]'s performance par with the legacy one.

During the last few years, there has been a lot of effort being poured into
replacing `pip`'s current resolver that is unable to resolve conflicts.
While its correctness will be ensured by some of the most talented and
hard-working developers in the Python packaging community, from the users'
point of view, it would be better to have its performance not lagging
behind the old one.  Aside from the increase in CPU cycles for more
rigorous resolution, more I/O, especially networking operations is expected
to be performed.  This is due to {{pip 7406#issuecomment-583891169 "the lack
of a standard and efficient way to acquire the metadata"}}.  Therefore, unlike
most package managers we are familiar with, `pip` has to fetch
(and possibly build) the packages solely for dependency informations.

Fortunately, {{pep 427 recommended-archiver-features}} recommends
package builders to place the metadata at the end of the archive.
This allows the resolver to only fetch the last few kB using
`HTTP range requests`_ for the relevant information.
Simply appending `Range: bytes=-8000` to the request header
in `pip._internal.network.download` makes the resolution process
*lightning* fast.  Of course this breaks the installation but I am confident
that it is not difficult to implement this optimization cleanly.

One drawback of this optimization is the compatibility.  Not every Python
package index support range requests, and it is not possible to verify
the partial wheel.  While the first case is unavoidable, for the other,
hashes checking is usually used for pinned/locked-version requirements,
thus no backtracking is done during dependency resolution.

Either way, before installation, the packages selected by the resolver
can be downloaded in parallel.  This warranties a larger crowd of packages,
compared to parallelization during resolution, where the number of downloads
can be as low as one during trail of different versions of the same package.

Unfortunately, I have not been able to do much other than
{{pip 8411 "a minor clean up"}}.  I am looking forward to accomplishing more
this week and seeing what this path will lead us too!  At the moment,
I am happy that I'm able to meet the blog deadline, at least in UTC!

[the upcoming backtracking resolver]: http://www.ei8fdb.org/thoughts/2020/05/test-pips-alpha-resolver-and-help-us-document-dependency-conflicts
[HTTP range requests]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests