From 2c085d53133fd267a809d0a4e2cbf9421ea2a2a8 Mon Sep 17 00:00:00 2001 From: Nguyễn Gia Phong Date: Tue, 21 Sep 2021 17:02:17 +0700 Subject: Reorganize GSoC 2020 --- blog/gsoc2020/blog20200609.md | 112 ------------------------------------------ 1 file changed, 112 deletions(-) delete mode 100644 blog/gsoc2020/blog20200609.md (limited to 'blog/gsoc2020/blog20200609.md') diff --git a/blog/gsoc2020/blog20200609.md b/blog/gsoc2020/blog20200609.md deleted file mode 100644 index b0e6a7b..0000000 --- a/blog/gsoc2020/blog20200609.md +++ /dev/null @@ -1,112 +0,0 @@ -+++ -rss = "GSoC 2020: Unexpected Things When You're Expecting" -date = Date(2020, 6, 9) -+++ -@def tags = ["pip", "gsoc"] - -# Unexpected Things When You're Expecting - -Hi everyone, I hope that you are all doing well and wishes you all good health! -The last week has not been really kind to me with a decent amount of -academic pressure (my school year is lasting until early Jully). -It would be bold to say that I have spent 10 hours working on my GSoC project -since the last check-in, let alone the 30 hours per week requirement. -That being said, there were still some discoveries that I wish to share. - -\toc - -## The `multiprocessing[.dummy]` wrapper - -Most of the time I spent was to finalize the multi{processing,threading} -wrapper for `map` function that submit tasks to the worker pool. -To my surprise, it is rather difficult to write something that is -not only portable but also easy to read and test. - -By {{pip 8320 "the latest commit"}}, I realized the following: - -1. The `multiprocessing` module was not designed for the implementation - details to be abstracted away entirely. For example, the lazy `map`'s - could be really slow without specifying suitable chunk size - (to cut the input iterable and distribute them to workers in the pool). - By *suitable*, I mean only an order smaller than the input. This defeats - half of the purpose of making it lazy: allowing the input to be - evaluated lazily. Luckily, in the use case I'm aiming for, the length of - the iterable argument is small and the laziness is only needed for the output - (to pipeline download and installation). -2. Mocking `import` for testing purposes can never be pretty. One reason - is that we (Python users) have very little control over the calls of - `import` statements and its lower-level implementation `__import__`. - In order to properly patch this built-in function, unlike for others - of the same group, we have to `monkeypatch` the name from `builtins` - (or `__builtins__` under Python 2) instead of the module that import stuff. - Furthermore, because of the special namespacing, to avoid infinite recursion - we need to alias the function to a different name for fallback. -3. To add to the problem, `multiprocessing` lazily imports the fragile module - during pools creation. Since the failure is platform-specific - (the lack of `sem_open`), it was decided to check upon the import - of the `pip`'s module. Although the behavior is easier to reason - in human language, testing it requires invalidating cached import and - re-import the wrapper module. -4. Last but not least, I now understand the pain of keeping Python 2 - compatibility that many package maintainers still need to deal with - everyday (although Python 2 has reached its end-of-life, `pip`, for - example, {{pip 6148 "will still support it for another year"}}). - -## The change in direction - -Since last week, my mentor Pradyun Gedam and I set up weekly real-time -meeting (a fancy term for video/audio chat in the worldwide quarantine -era) for the entire GSoC period. During the last session, we decided to -put parallelization of download during resolution on hold, in favor of a -more beneficial goal: {{pip 7819 "partially download the wheels during -dependency resolution"}}. - -![](/assets/swirl.png) - -As discussed by Danny McClanahan and the maintainers of `pip`, it is feasible -to only download a few kB of a wheel to obtain enough metadata for -the resolution of dependency. While this is only applicable to wheels -(i.e. prebuilt packages), other packaging format only make up less than 20% -of the downloads (at least on PyPI), and the figure is much less for -the most popular packages. Therefore, this optimization alone could make -[the upcoming backtracking resolver][]'s performance par with the legacy one. - -During the last few years, there has been a lot of effort being poured into -replacing `pip`'s current resolver that is unable to resolve conflicts. -While its correctness will be ensured by some of the most talented and -hard-working developers in the Python packaging community, from the users' -point of view, it would be better to have its performance not lagging -behind the old one. Aside from the increase in CPU cycles for more -rigorous resolution, more I/O, especially networking operations is expected -to be performed. This is due to {{pip 7406#issuecomment-583891169 "the lack -of a standard and efficient way to acquire the metadata"}}. Therefore, unlike -most package managers we are familiar with, `pip` has to fetch -(and possibly build) the packages solely for dependency informations. - -Fortunately, {{pep 427 recommended-archiver-features}} recommends -package builders to place the metadata at the end of the archive. -This allows the resolver to only fetch the last few kB using -`HTTP range requests`_ for the relevant information. -Simply appending `Range: bytes=-8000` to the request header -in `pip._internal.network.download` makes the resolution process -*lightning* fast. Of course this breaks the installation but I am confident -that it is not difficult to implement this optimization cleanly. - -One drawback of this optimization is the compatibility. Not every Python -package index support range requests, and it is not possible to verify -the partial wheel. While the first case is unavoidable, for the other, -hashes checking is usually used for pinned/locked-version requirements, -thus no backtracking is done during dependency resolution. - -Either way, before installation, the packages selected by the resolver -can be downloaded in parallel. This warranties a larger crowd of packages, -compared to parallelization during resolution, where the number of downloads -can be as low as one during trail of different versions of the same package. - -Unfortunately, I have not been able to do much other than -{{pip 8411 "a minor clean up"}}. I am looking forward to accomplishing more -this week and seeing what this path will lead us too! At the moment, -I am happy that I'm able to meet the blog deadline, at least in UTC! - -[the upcoming backtracking resolver]: http://www.ei8fdb.org/thoughts/2020/05/test-pips-alpha-resolver-and-help-us-document-dependency-conflicts -[HTTP range requests]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests -- cgit 1.4.1