From 2c085d53133fd267a809d0a4e2cbf9421ea2a2a8 Mon Sep 17 00:00:00 2001 From: Nguyễn Gia Phong Date: Tue, 21 Sep 2021 17:02:17 +0700 Subject: Reorganize GSoC 2020 --- blog/2020/gsoc/article/1.md | 112 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 112 insertions(+) create mode 100644 blog/2020/gsoc/article/1.md (limited to 'blog/2020/gsoc/article/1.md') diff --git a/blog/2020/gsoc/article/1.md b/blog/2020/gsoc/article/1.md new file mode 100644 index 0000000..b0e6a7b --- /dev/null +++ b/blog/2020/gsoc/article/1.md @@ -0,0 +1,112 @@ ++++ +rss = "GSoC 2020: Unexpected Things When You're Expecting" +date = Date(2020, 6, 9) ++++ +@def tags = ["pip", "gsoc"] + +# Unexpected Things When You're Expecting + +Hi everyone, I hope that you are all doing well and wishes you all good health! +The last week has not been really kind to me with a decent amount of +academic pressure (my school year is lasting until early Jully). +It would be bold to say that I have spent 10 hours working on my GSoC project +since the last check-in, let alone the 30 hours per week requirement. +That being said, there were still some discoveries that I wish to share. + +\toc + +## The `multiprocessing[.dummy]` wrapper + +Most of the time I spent was to finalize the multi{processing,threading} +wrapper for `map` function that submit tasks to the worker pool. +To my surprise, it is rather difficult to write something that is +not only portable but also easy to read and test. + +By {{pip 8320 "the latest commit"}}, I realized the following: + +1. The `multiprocessing` module was not designed for the implementation + details to be abstracted away entirely. For example, the lazy `map`'s + could be really slow without specifying suitable chunk size + (to cut the input iterable and distribute them to workers in the pool). + By *suitable*, I mean only an order smaller than the input. This defeats + half of the purpose of making it lazy: allowing the input to be + evaluated lazily. Luckily, in the use case I'm aiming for, the length of + the iterable argument is small and the laziness is only needed for the output + (to pipeline download and installation). +2. Mocking `import` for testing purposes can never be pretty. One reason + is that we (Python users) have very little control over the calls of + `import` statements and its lower-level implementation `__import__`. + In order to properly patch this built-in function, unlike for others + of the same group, we have to `monkeypatch` the name from `builtins` + (or `__builtins__` under Python 2) instead of the module that import stuff. + Furthermore, because of the special namespacing, to avoid infinite recursion + we need to alias the function to a different name for fallback. +3. To add to the problem, `multiprocessing` lazily imports the fragile module + during pools creation. Since the failure is platform-specific + (the lack of `sem_open`), it was decided to check upon the import + of the `pip`'s module. Although the behavior is easier to reason + in human language, testing it requires invalidating cached import and + re-import the wrapper module. +4. Last but not least, I now understand the pain of keeping Python 2 + compatibility that many package maintainers still need to deal with + everyday (although Python 2 has reached its end-of-life, `pip`, for + example, {{pip 6148 "will still support it for another year"}}). + +## The change in direction + +Since last week, my mentor Pradyun Gedam and I set up weekly real-time +meeting (a fancy term for video/audio chat in the worldwide quarantine +era) for the entire GSoC period. During the last session, we decided to +put parallelization of download during resolution on hold, in favor of a +more beneficial goal: {{pip 7819 "partially download the wheels during +dependency resolution"}}. + +![](/assets/swirl.png) + +As discussed by Danny McClanahan and the maintainers of `pip`, it is feasible +to only download a few kB of a wheel to obtain enough metadata for +the resolution of dependency. While this is only applicable to wheels +(i.e. prebuilt packages), other packaging format only make up less than 20% +of the downloads (at least on PyPI), and the figure is much less for +the most popular packages. Therefore, this optimization alone could make +[the upcoming backtracking resolver][]'s performance par with the legacy one. + +During the last few years, there has been a lot of effort being poured into +replacing `pip`'s current resolver that is unable to resolve conflicts. +While its correctness will be ensured by some of the most talented and +hard-working developers in the Python packaging community, from the users' +point of view, it would be better to have its performance not lagging +behind the old one. Aside from the increase in CPU cycles for more +rigorous resolution, more I/O, especially networking operations is expected +to be performed. This is due to {{pip 7406#issuecomment-583891169 "the lack +of a standard and efficient way to acquire the metadata"}}. Therefore, unlike +most package managers we are familiar with, `pip` has to fetch +(and possibly build) the packages solely for dependency informations. + +Fortunately, {{pep 427 recommended-archiver-features}} recommends +package builders to place the metadata at the end of the archive. +This allows the resolver to only fetch the last few kB using +`HTTP range requests`_ for the relevant information. +Simply appending `Range: bytes=-8000` to the request header +in `pip._internal.network.download` makes the resolution process +*lightning* fast. Of course this breaks the installation but I am confident +that it is not difficult to implement this optimization cleanly. + +One drawback of this optimization is the compatibility. Not every Python +package index support range requests, and it is not possible to verify +the partial wheel. While the first case is unavoidable, for the other, +hashes checking is usually used for pinned/locked-version requirements, +thus no backtracking is done during dependency resolution. + +Either way, before installation, the packages selected by the resolver +can be downloaded in parallel. This warranties a larger crowd of packages, +compared to parallelization during resolution, where the number of downloads +can be as low as one during trail of different versions of the same package. + +Unfortunately, I have not been able to do much other than +{{pip 8411 "a minor clean up"}}. I am looking forward to accomplishing more +this week and seeing what this path will lead us too! At the moment, +I am happy that I'm able to meet the blog deadline, at least in UTC! + +[the upcoming backtracking resolver]: http://www.ei8fdb.org/thoughts/2020/05/test-pips-alpha-resolver-and-help-us-document-dependency-conflicts +[HTTP range requests]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests -- cgit 1.4.1