about summary refs log tree commit diff homepage
path: root/blog/gsoc2020/blog20200609.md
diff options
context:
space:
mode:
Diffstat (limited to 'blog/gsoc2020/blog20200609.md')
-rw-r--r--blog/gsoc2020/blog20200609.md112
1 files changed, 0 insertions, 112 deletions
diff --git a/blog/gsoc2020/blog20200609.md b/blog/gsoc2020/blog20200609.md
deleted file mode 100644
index b0e6a7b..0000000
--- a/blog/gsoc2020/blog20200609.md
+++ /dev/null
@@ -1,112 +0,0 @@
-+++
-rss = "GSoC 2020: Unexpected Things When You're Expecting"
-date = Date(2020, 6, 9)
-+++
-@def tags = ["pip", "gsoc"]
-
-# Unexpected Things When You're Expecting
-
-Hi everyone, I hope that you are all doing well and wishes you all good health!
-The last week has not been really kind to me with a decent amount of
-academic pressure (my school year is lasting until early Jully).
-It would be bold to say that I have spent 10 hours working on my GSoC project
-since the last check-in, let alone the 30 hours per week requirement.
-That being said, there were still some discoveries that I wish to share.
-
-\toc
-
-## The `multiprocessing[.dummy]` wrapper
-
-Most of the time I spent was to finalize the multi{processing,threading}
-wrapper for `map` function that submit tasks to the worker pool.
-To my surprise, it is rather difficult to write something that is
-not only portable but also easy to read and test.
-
-By {{pip 8320 "the latest commit"}}, I realized the following:
-
-1. The `multiprocessing` module was not designed for the implementation
-   details to be abstracted away entirely.  For example, the lazy `map`'s
-   could be really slow without specifying suitable chunk size
-   (to cut the input iterable and distribute them to workers in the pool).
-   By *suitable*, I mean only an order smaller than the input.  This defeats
-   half of the purpose of making it lazy: allowing the input to be
-   evaluated lazily.  Luckily, in the use case I'm aiming for, the length of
-   the iterable argument is small and the laziness is only needed for the output
-   (to pipeline download and installation).
-2. Mocking `import` for testing purposes can never be pretty.  One reason
-   is that we (Python users) have very little control over the calls of
-   `import` statements and its lower-level implementation `__import__`.
-   In order to properly patch this built-in function, unlike for others
-   of the same group, we have to `monkeypatch` the name from `builtins`
-   (or `__builtins__` under Python 2) instead of the module that import stuff.
-   Furthermore, because of the special namespacing, to avoid infinite recursion
-   we need to alias the function to a different name for fallback.
-3. To add to the problem, `multiprocessing` lazily imports the fragile module
-   during pools creation.  Since the failure is platform-specific
-   (the lack of `sem_open`), it was decided to check upon the import
-   of the `pip`'s module.  Although the behavior is easier to reason
-   in human language, testing it requires invalidating cached import and
-   re-import the wrapper module.
-4. Last but not least, I now understand the pain of keeping Python 2
-   compatibility that many package maintainers still need to deal with
-   everyday (although Python 2 has reached its end-of-life, `pip`, for
-   example, {{pip 6148 "will still support it for another year"}}).
-
-## The change in direction
-
-Since last week, my mentor Pradyun Gedam and I set up weekly real-time
-meeting (a fancy term for video/audio chat in the worldwide quarantine
-era) for the entire GSoC period. During the last session, we decided to
-put parallelization of download during resolution on hold, in favor of a
-more beneficial goal: {{pip 7819 "partially download the wheels during
-dependency resolution"}}.
-
-![](/assets/swirl.png)
-
-As discussed by Danny McClanahan and the maintainers of `pip`, it is feasible
-to only download a few kB of a wheel to obtain enough metadata for
-the resolution of dependency.  While this is only applicable to wheels
-(i.e. prebuilt packages), other packaging format only make up less than 20%
-of the downloads (at least on PyPI), and the figure is much less for
-the most popular packages.  Therefore, this optimization alone could make
-[the upcoming backtracking resolver][]'s performance par with the legacy one.
-
-During the last few years, there has been a lot of effort being poured into
-replacing `pip`'s current resolver that is unable to resolve conflicts.
-While its correctness will be ensured by some of the most talented and
-hard-working developers in the Python packaging community, from the users'
-point of view, it would be better to have its performance not lagging
-behind the old one.  Aside from the increase in CPU cycles for more
-rigorous resolution, more I/O, especially networking operations is expected
-to be performed.  This is due to {{pip 7406#issuecomment-583891169 "the lack
-of a standard and efficient way to acquire the metadata"}}.  Therefore, unlike
-most package managers we are familiar with, `pip` has to fetch
-(and possibly build) the packages solely for dependency informations.
-
-Fortunately, {{pep 427 recommended-archiver-features}} recommends
-package builders to place the metadata at the end of the archive.
-This allows the resolver to only fetch the last few kB using
-`HTTP range requests`_ for the relevant information.
-Simply appending `Range: bytes=-8000` to the request header
-in `pip._internal.network.download` makes the resolution process
-*lightning* fast.  Of course this breaks the installation but I am confident
-that it is not difficult to implement this optimization cleanly.
-
-One drawback of this optimization is the compatibility.  Not every Python
-package index support range requests, and it is not possible to verify
-the partial wheel.  While the first case is unavoidable, for the other,
-hashes checking is usually used for pinned/locked-version requirements,
-thus no backtracking is done during dependency resolution.
-
-Either way, before installation, the packages selected by the resolver
-can be downloaded in parallel.  This warranties a larger crowd of packages,
-compared to parallelization during resolution, where the number of downloads
-can be as low as one during trail of different versions of the same package.
-
-Unfortunately, I have not been able to do much other than
-{{pip 8411 "a minor clean up"}}.  I am looking forward to accomplishing more
-this week and seeing what this path will lead us too!  At the moment,
-I am happy that I'm able to meet the blog deadline, at least in UTC!
-
-[the upcoming backtracking resolver]: http://www.ei8fdb.org/thoughts/2020/05/test-pips-alpha-resolver-and-help-us-document-dependency-conflicts
-[HTTP range requests]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests