From 2c085d53133fd267a809d0a4e2cbf9421ea2a2a8 Mon Sep 17 00:00:00 2001 From: Nguyễn Gia Phong Date: Tue, 21 Sep 2021 17:02:17 +0700 Subject: Reorganize GSoC 2020 --- blog/2020/gsoc/article/1.md | 112 +++++++++++++++++++++++++++++ blog/2020/gsoc/article/2.md | 113 +++++++++++++++++++++++++++++ blog/2020/gsoc/article/3.md | 78 ++++++++++++++++++++ blog/2020/gsoc/article/4.md | 84 ++++++++++++++++++++++ blog/2020/gsoc/article/5.md | 46 ++++++++++++ blog/2020/gsoc/article/6.md | 52 ++++++++++++++ blog/2020/gsoc/article/7.md | 109 ++++++++++++++++++++++++++++ blog/2020/gsoc/article/index.md | 12 ++++ blog/2020/gsoc/checkin/1.md | 45 ++++++++++++ blog/2020/gsoc/checkin/2.md | 45 ++++++++++++ blog/2020/gsoc/checkin/3.md | 44 ++++++++++++ blog/2020/gsoc/checkin/4.md | 35 +++++++++ blog/2020/gsoc/checkin/5.md | 37 ++++++++++ blog/2020/gsoc/checkin/6.md | 33 +++++++++ blog/2020/gsoc/checkin/7.md | 26 +++++++ blog/2020/gsoc/checkin/index.md | 11 +++ blog/2020/gsoc/index.md | 151 +++++++++++++++++++++++++++++++++++++++ blog/gsoc2020/blog20200609.md | 112 ----------------------------- blog/gsoc2020/blog20200622.md | 113 ----------------------------- blog/gsoc2020/blog20200706.md | 78 -------------------- blog/gsoc2020/blog20200720.md | 84 ---------------------- blog/gsoc2020/blog20200803.md | 46 ------------ blog/gsoc2020/blog20200817.md | 52 -------------- blog/gsoc2020/blog20200831.md | 109 ---------------------------- blog/gsoc2020/checkin20200601.md | 45 ------------ blog/gsoc2020/checkin20200615.md | 45 ------------ blog/gsoc2020/checkin20200629.md | 44 ------------ blog/gsoc2020/checkin20200713.md | 35 --------- blog/gsoc2020/checkin20200727.md | 37 ---------- blog/gsoc2020/checkin20200810.md | 33 --------- blog/gsoc2020/checkin20200824.md | 26 ------- blog/gsoc2020/index.md | 151 --------------------------------------- index.md | 10 ++- works.md | 2 +- 34 files changed, 1041 insertions(+), 1014 deletions(-) create mode 100644 blog/2020/gsoc/article/1.md create mode 100644 blog/2020/gsoc/article/2.md create mode 100644 blog/2020/gsoc/article/3.md create mode 100644 blog/2020/gsoc/article/4.md create mode 100644 blog/2020/gsoc/article/5.md create mode 100644 blog/2020/gsoc/article/6.md create mode 100644 blog/2020/gsoc/article/7.md create mode 100644 blog/2020/gsoc/article/index.md create mode 100644 blog/2020/gsoc/checkin/1.md create mode 100644 blog/2020/gsoc/checkin/2.md create mode 100644 blog/2020/gsoc/checkin/3.md create mode 100644 blog/2020/gsoc/checkin/4.md create mode 100644 blog/2020/gsoc/checkin/5.md create mode 100644 blog/2020/gsoc/checkin/6.md create mode 100644 blog/2020/gsoc/checkin/7.md create mode 100644 blog/2020/gsoc/checkin/index.md create mode 100644 blog/2020/gsoc/index.md delete mode 100644 blog/gsoc2020/blog20200609.md delete mode 100644 blog/gsoc2020/blog20200622.md delete mode 100644 blog/gsoc2020/blog20200706.md delete mode 100644 blog/gsoc2020/blog20200720.md delete mode 100644 blog/gsoc2020/blog20200803.md delete mode 100644 blog/gsoc2020/blog20200817.md delete mode 100644 blog/gsoc2020/blog20200831.md delete mode 100644 blog/gsoc2020/checkin20200601.md delete mode 100644 blog/gsoc2020/checkin20200615.md delete mode 100644 blog/gsoc2020/checkin20200629.md delete mode 100644 blog/gsoc2020/checkin20200713.md delete mode 100644 blog/gsoc2020/checkin20200727.md delete mode 100644 blog/gsoc2020/checkin20200810.md delete mode 100644 blog/gsoc2020/checkin20200824.md delete mode 100644 blog/gsoc2020/index.md diff --git a/blog/2020/gsoc/article/1.md b/blog/2020/gsoc/article/1.md new file mode 100644 index 0000000..b0e6a7b --- /dev/null +++ b/blog/2020/gsoc/article/1.md @@ -0,0 +1,112 @@ ++++ +rss = "GSoC 2020: Unexpected Things When You're Expecting" +date = Date(2020, 6, 9) ++++ +@def tags = ["pip", "gsoc"] + +# Unexpected Things When You're Expecting + +Hi everyone, I hope that you are all doing well and wishes you all good health! +The last week has not been really kind to me with a decent amount of +academic pressure (my school year is lasting until early Jully). +It would be bold to say that I have spent 10 hours working on my GSoC project +since the last check-in, let alone the 30 hours per week requirement. +That being said, there were still some discoveries that I wish to share. + +\toc + +## The `multiprocessing[.dummy]` wrapper + +Most of the time I spent was to finalize the multi{processing,threading} +wrapper for `map` function that submit tasks to the worker pool. +To my surprise, it is rather difficult to write something that is +not only portable but also easy to read and test. + +By {{pip 8320 "the latest commit"}}, I realized the following: + +1. The `multiprocessing` module was not designed for the implementation + details to be abstracted away entirely. For example, the lazy `map`'s + could be really slow without specifying suitable chunk size + (to cut the input iterable and distribute them to workers in the pool). + By *suitable*, I mean only an order smaller than the input. This defeats + half of the purpose of making it lazy: allowing the input to be + evaluated lazily. Luckily, in the use case I'm aiming for, the length of + the iterable argument is small and the laziness is only needed for the output + (to pipeline download and installation). +2. Mocking `import` for testing purposes can never be pretty. One reason + is that we (Python users) have very little control over the calls of + `import` statements and its lower-level implementation `__import__`. + In order to properly patch this built-in function, unlike for others + of the same group, we have to `monkeypatch` the name from `builtins` + (or `__builtins__` under Python 2) instead of the module that import stuff. + Furthermore, because of the special namespacing, to avoid infinite recursion + we need to alias the function to a different name for fallback. +3. To add to the problem, `multiprocessing` lazily imports the fragile module + during pools creation. Since the failure is platform-specific + (the lack of `sem_open`), it was decided to check upon the import + of the `pip`'s module. Although the behavior is easier to reason + in human language, testing it requires invalidating cached import and + re-import the wrapper module. +4. Last but not least, I now understand the pain of keeping Python 2 + compatibility that many package maintainers still need to deal with + everyday (although Python 2 has reached its end-of-life, `pip`, for + example, {{pip 6148 "will still support it for another year"}}). + +## The change in direction + +Since last week, my mentor Pradyun Gedam and I set up weekly real-time +meeting (a fancy term for video/audio chat in the worldwide quarantine +era) for the entire GSoC period. During the last session, we decided to +put parallelization of download during resolution on hold, in favor of a +more beneficial goal: {{pip 7819 "partially download the wheels during +dependency resolution"}}. + +![](/assets/swirl.png) + +As discussed by Danny McClanahan and the maintainers of `pip`, it is feasible +to only download a few kB of a wheel to obtain enough metadata for +the resolution of dependency. While this is only applicable to wheels +(i.e. prebuilt packages), other packaging format only make up less than 20% +of the downloads (at least on PyPI), and the figure is much less for +the most popular packages. Therefore, this optimization alone could make +[the upcoming backtracking resolver][]'s performance par with the legacy one. + +During the last few years, there has been a lot of effort being poured into +replacing `pip`'s current resolver that is unable to resolve conflicts. +While its correctness will be ensured by some of the most talented and +hard-working developers in the Python packaging community, from the users' +point of view, it would be better to have its performance not lagging +behind the old one. Aside from the increase in CPU cycles for more +rigorous resolution, more I/O, especially networking operations is expected +to be performed. This is due to {{pip 7406#issuecomment-583891169 "the lack +of a standard and efficient way to acquire the metadata"}}. Therefore, unlike +most package managers we are familiar with, `pip` has to fetch +(and possibly build) the packages solely for dependency informations. + +Fortunately, {{pep 427 recommended-archiver-features}} recommends +package builders to place the metadata at the end of the archive. +This allows the resolver to only fetch the last few kB using +`HTTP range requests`_ for the relevant information. +Simply appending `Range: bytes=-8000` to the request header +in `pip._internal.network.download` makes the resolution process +*lightning* fast. Of course this breaks the installation but I am confident +that it is not difficult to implement this optimization cleanly. + +One drawback of this optimization is the compatibility. Not every Python +package index support range requests, and it is not possible to verify +the partial wheel. While the first case is unavoidable, for the other, +hashes checking is usually used for pinned/locked-version requirements, +thus no backtracking is done during dependency resolution. + +Either way, before installation, the packages selected by the resolver +can be downloaded in parallel. This warranties a larger crowd of packages, +compared to parallelization during resolution, where the number of downloads +can be as low as one during trail of different versions of the same package. + +Unfortunately, I have not been able to do much other than +{{pip 8411 "a minor clean up"}}. I am looking forward to accomplishing more +this week and seeing what this path will lead us too! At the moment, +I am happy that I'm able to meet the blog deadline, at least in UTC! + +[the upcoming backtracking resolver]: http://www.ei8fdb.org/thoughts/2020/05/test-pips-alpha-resolver-and-help-us-document-dependency-conflicts +[HTTP range requests]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests diff --git a/blog/2020/gsoc/article/2.md b/blog/2020/gsoc/article/2.md new file mode 100644 index 0000000..3bb3a2c --- /dev/null +++ b/blog/2020/gsoc/article/2.md @@ -0,0 +1,113 @@ ++++ +rss = "GSoC 2020: The Wonderful Wizard of O'zip" +date = Date(2020, 6, 22) ++++ +@def tags = ["pip", "gsoc"] + +# The Wonderful Wizard of O'zip + +> Never give up... No one knows what's going to happen next. + +\toc + +## Preface + +Greetings and best wishes! I had a lot of fun during the last week, +although admittedly nothing was really finished. In summary, +these are the works I carried out in the last seven days: + +* Finilizing {{pip 8320 "utilities for parallelization"}} +* {{pip 8467 "Continuing experimenting"}} + on {{pip 8442 "using lazy wheels or dependency resolution"}} +* Polishing up {{pip 8411 "the patch"}} refactoring + `operations.prepare.prepare_linked_requirement` +* Adding `flake8-logging-format` + {{pip 8423#issuecomment-645418725 "to the linter"}} +* Splitting {{pip 8456 "the linting patch"}} from {{pip 8332 "the PR adding + the license requirement to vendor README"}} + +## The `multiprocessing[.dummy]` wrapper + +Yes, you read it right, this is the same section as last fortnight's blog. +My mentor Pradyun Gedam gave me a green light to have {{pip 8411}} merged +without support for Python 2 and the non-lazy map variant, which turns out +to be troublesome for multithreading. + +The tests still needs to pass of course and the flaky tests (see failing tests +over Azure Pipeline in the past) really gave me a panic attack earlier today. +We probably need to mark them as xfail or investigate why they are +undeterministic specifically on Azure, but the real reason I was *all caught up +and confused* was that the unit tests I added mess with the cached imports +and as `pip`'s tests are run in parallel, who knows what it might affect. +I was so relieved to not discover any new set of tests made flaky by ones +I'm trying to add! + +## The file-like object mapping ZIP over HTTP + +This is where the fun starts. Before we dive in, let's recall some +background information on this. As discovered by Danny McClanahan +in {{pip 7819}}, it is possible to only download a potion of a wheel +and it's still valid for `pip` to get the distribution's metadata. +In the same thread, Daniel Holth suggested that one may use +HTTP range requests to specifically ask for the tail of the wheel, +where the ZIP's central directory record as well as where usually +`dist-info` (the directory containing `METADATA`) can be found. + +Well, *usually*. While {{pep 427}} does indeed recommend + +> Archivers are encouraged to place the `.dist-info` files physically +> at the end of the archive. This enables some potentially interesting +> ZIP tricks including the ability to amend the metadata without +> rewriting the entire archive. + +one of the mentioned *tricks* is adding shared libraries to wheels +of extension modules (using e.g. `auditwheel` or `delocate`). +Thus for non-pure Python wheels, it is unlikely that the metadata +lie in the last few megabytes. Ignoring source distributions is bad enough, +we can't afford making an optimization that doesn't work for extension modules, +which are still an integral part of the Python ecosystem )-: + +But hey, the ZIP's directory record is warrantied to be at the end of the file! +Couldn't we do something about that? The short answer is yes. The long answer +is, well, yessssssss! That, plus magic provided by most operating systems, +this is what we figured out: + +1. We can download a realatively small chunk at the end of the wheel + until it is recognizable as a valid ZIP file. +2. In order for the end of the archive to actually appear as the end to + `zipfile`, we feed to it an object with `seek` and `read` defined. + As navigating to the rear of the file is performed by calling `seek` + with relative offset and `whence=SEEK_END` (see `man 3 fseek` + for more details), we are completely able to make the wheels in the cloud + to behave as if it were available locally. + + ![Wheel in the cloud](/assets/cloud.gif) + +3. For large wheels, it is better to store them in hard disks instead of memory. + For smaller ones, it is also preferable to store it as a file to avoid + (error-prony and often not really efficient) manual tracking and joining + of downloaded segments. We only use a small potion of the wheel, however + just in case one is wonderring, we have very little control over + when `tempfile.SpooledTemporaryFile` rolls over, so the memory-disk hybrid + is not exactly working as expected. +4. With all these in mind, all we have to do is to define an intermediate object + check for local availability and download if needed on calls to `read`, + to lazily provide the data over HTTP and reduce execution time. + +The only theoretical challenge left is to keep track of downloaded intervals, +which I finally figured out after a few trials and errors. The code +was submitted as a pull request to `pip` at {{pip 8467}}. A more modern +(read: Python 3-only) variant was packaged and uploaded to PyPI under +the name of lazip_. I am unaware of any use case for it outside of `pip`, +but it's certainly fun to play with d-: + +## What's next? + +I have been falling short of getting the PRs mention above merged for +quite a while. With `pip`'s next beta coming really soon, I have to somehow +make the patches reach a certain standard and enough attention to be part of +the pre-release—beta-testing would greatly help the success of the GSoC project. +To other GSoC students and mentors reading this, I also hope your projects +to turn out successful! + +[lazip]: https://pypi.org/project/lazip/ diff --git a/blog/2020/gsoc/article/3.md b/blog/2020/gsoc/article/3.md new file mode 100644 index 0000000..9c41b31 --- /dev/null +++ b/blog/2020/gsoc/article/3.md @@ -0,0 +1,78 @@ ++++ +rss = "GSoC 2020: I'm Not Drowning On My Own" +date = Date(2020, 7, 6) ++++ +@def tags = ["pip", "gsoc"] + +# I'm Not Drowning On My Own + +\toc + +## Cold Water + +Hello there! My schoolyear is coming to an end, with some final assignments +and group projects left to be done. I for sure underestimated the workload +of these and in the last (and probably next) few days I'm drowning in work +trying to meet my deadlines. + +One project that might be remotely relevant is [cheese-shop][], which tries to +manage the metadata of packages from the real [Cheese Shop][]. Other than that, +schoolwork is draining a lot of my time and I can't remember the last time +I came up with something new for my GSoC project )-; + +## Warm Water + +On the bright side, I received a lot of help and encouragement +from contributors and stakeholders of `pip`. In the last week alone, +I had five pull requests merged: + +* {{pip 8332}}: Add license requirement to `_vendor/README.rst` +* {{pip 8320}}: Add utilities for parallelization +* {{pip 8504}}: Parallelize `pip list --outdated` and `--uptodate` +* {{pip 8411}}: Refactor `operations.prepare.prepare_linked_requirement` +* {{pip 8467}}: Add utitlity to lazily acquire wheel metadata over HTTP + +In addition to helping me getting my PRs merged, my mentor Pradyun Gedam +also gave me my first official feedback, including what I'm doing right +(and wrong too!) and what I should keep doing to increase the chance of +the project being successful. + +{{pip 7819}}'s roadmap (Danny McClanahan's discoveries and works on lazy wheels) +is being closely tracked by `hatch`'s maintainter Ofek Lev, which really +makes me proud and warms my heart, that what I'm helping build is actually +needed by the community! + +## Learning How To Swim + +With {{pip 8467}} and {{pip 8530}} merged, I'm now working on {{pip 8532}} +which aims to roll out the lazy wheel as the way to obtain +dependency information via the CLI flag `--use-feature=lazy-wheel`. + +{{pip 8532}} was failing initially, despite being relatively trivial and that +the commit it used to base on was passing. Surprisingly, after rebasing it +on top of {{pip 8530}}, it suddenly became green mysteriously. After the first +(early) review, I was able to iterate on my earlier code, which used +the ambiguous exception `RuntimeError`. + +The rest to be done is *just* adding some functional tests (I'm pretty sure +this will be either overwhelming or underwhelming) to make sure that +the command-line flag is working correctly. Hopefully this can make it into +the beta of the upcoming release {{pip 8511 "this month"}}. + +![Lazy wheel](/assets/lazy-wheel.jpg) + +In other news, I've also submitted {{pip 8538 "a patch improving the tests +for the parallelization utilities"}}, which was really messy as I wrote them. +Better late than never! + +Metaphors aside, I actually can't swim d-: + +## Diving Plan + +After {{pip 8532}}, I think I'll try to parallelize downloads of wheels +that are lazily fetched only for metadata. By the current implementation +of the new resolver, for `pip install`, this can be injected directly +between the resolution and build/installation process. + +[cheese-shop]: https://github.com/McSinyx/cheese-shop +[Cheese Shop]: https://pypi.org diff --git a/blog/2020/gsoc/article/4.md b/blog/2020/gsoc/article/4.md new file mode 100644 index 0000000..43738a7 --- /dev/null +++ b/blog/2020/gsoc/article/4.md @@ -0,0 +1,84 @@ ++++ +rss = "GSoC 2020: I've Walked 500 Miles..." +date = Date(2020, 7, 20) ++++ +@def tags = ["pip", "gsoc"] + +# I've Walked 500 Miles... + +> ... and I would walk 500 more\ +> Just to be the man who walks a thousand miles\ +> To fall down at your door +> +> ![500 miles](/assets/500-miles.gif) + +\toc + +## The Main Road + +Hi, have you met `fast-deps`? It's (going to be) the name of `pip`'s +experimental feature that may improve the speed of dependency resolution +of the new resolver. By avoid downloading whole wheels to just +obtain metadata, it is especially helpful when `pip` has to do +heavy backtracking to resolve conflicts. + +Thanks to {{pip 8532#discussion_r453990728 "Chris Hunt's review on GH-8537"}}, +my mentor Pradyun Gedam and I worked out a less hacky approach to inteject +the call to lazy wheel during the resolution process. A new PR {{pip 8588}} +was filed to implement it—I could have *just* worked on top of the old PR +and rebased, but my `git` skill is far from gud enuff to confidently do it. + +Testing this one has been a lot of fun though. At first, integration tests +were added as a rerun of the tests for the new resolver, with an additional flag +to use feature `fast-deps`. It indeed made me feel guilty towards [Travis][], +who has to work around 30 minutes more every run. Per Chris Hunt's suggestion, +in the new PR, I instead write a few functional tests for the area relating +the most to the feature, namely `pip`'s subcommands `wheel`, +`download` and `install`. + +It was also suggested that a mock server with HTTP range requests support +might be better (in term of performance and reliablilty) than for testing. +However, {{pip 8584#issuecomment-659227702 "I have yet to be able to make +Werkzeug do it"}}. + +Why did I say I'm half way there? With the parallel utilities merged and a way +to quickly get the list of distribution to be downloaded being really close, +what left is *only* to figure out a way to properly download them in parallel. +With no distribution to be added during the download progress, the model of this +will fit very well with the architecture in [my original proposal][]. +A batch downloader can be implemented to track the progress of each download +and thus report them cleanly as e.g. progress bar or percentage. This is +the part I am second-most excited about of my GSoC project this summer +(after the synchronization of downloads written in my proposal, which was then +superseded by `fast-deps`) and I can't wait to do it! + +## The Side Quests + +As usual, I make sure that I complete every side quest I see during the journey: + +* {{pip 8568}}: Declare constants in `configuration.py` as such +* {{pip 8571}}: Clean up `Configuration.unset_value` + and nit the class' `__init__` +* {{pip 8578}}: Allow verbose/quite level + to be specified via config file and env var +* {{pip 8599}}: Replace tabs by spaces for consistency + +## Snap Back to Reality + +A bit about me, I actually walked 500 meters earlier today to a bank +and walked 500 more to another to prepare my Visa card for purchasing +the upcoming Pinephone prototype. It's one of the first smartphones +to fully support a GNU/Linux distribution, where one can run desktop apps +(including proper terminals) as well as traditional services like SSH, +HTTP server and IPFS node because why not? Just a few hours ago, +I pre-ordered the [postmarketOS community edition][] with additional hardware +for convergence. + +If you did not come here for a Pinephone ad, please take my apologies though d-; +and to ones reading this, I hope you all can become the person who walks +a thousand miles to fall down at the door opening to all +what you ever wished for! + +[Travis]: https://travis-ci.com +[my original proposal]: /assets/pip-parallel-dl.pdf +[postmarketOS community edition]: https://postmarketos.org/blog/2020/07/15/pinephone-ce-preorder/ diff --git a/blog/2020/gsoc/article/5.md b/blog/2020/gsoc/article/5.md new file mode 100644 index 0000000..de2ef8d --- /dev/null +++ b/blog/2020/gsoc/article/5.md @@ -0,0 +1,46 @@ ++++ +rss = "GSoC 2020: Sorting Things Out" +date = Date(2020, 8, 3) ++++ +@def tags = ["pip", "gsoc"] + +# Sorting Things Out + +Hi! I really hope that everyone reading this is still doing okay, +and if that isn't the case, I wish you a good day! + +## `pip` 20.2 Released! + +Last Wednesday, `pip` 20.2 was released, delivering the `2020-resolver` +as well as many other improvements! I was lucky to be able +to get the `fast-deps` feature to be included as part of the release. +A brief description of this *experimental* feature as well as testing +instruction can be found on [Python Discuss][]. + +The public exposure of the feature also remind me of some further +{{pip 8681 optimization}} to make on {{pip 8670 "the lazy wheel"}}. +Hopefully without download parallelization it would not be too slow +to put off testing by concerned users of `pip`. + +## Preparation for Download Parallelization + +As of this moment, we already have: + +* {{pip 8162#issuecomment-667504162 "Multithreading pool fallback working"}} +* An opt-in to use lazy wheel to optain dependency information, + and thus getting a list of wheels at the end of resolution + ready to be downloaded together + +What's left is *only* to interject a parallel download somewhere after +the dependency resolution step. Still, this struggles me way more than +I've ever imagined. I got so stuck that I had to give myself a day off +in the middle of the week (and study some Rust), then I came up with +{{pip 8638 "something what was agreed upon as difficult to maintain"}}. + +Indeed, a large part of this is my fault, for not communicating the design +thoroughly with `pip`'s maintainers and not carefully noting stuff down +during (verbal) discussions with my mentor. Thankfully {{pip 8685 +"Chris Hunt came to the rescue"}} and did a refactoring that will +make my future work much easier and cleaner. + +[Python Discuss]: https://discuss.python.org/t/announcement-pip-20-2-release/4863/2 diff --git a/blog/2020/gsoc/article/6.md b/blog/2020/gsoc/article/6.md new file mode 100644 index 0000000..40caad5 --- /dev/null +++ b/blog/2020/gsoc/article/6.md @@ -0,0 +1,52 @@ ++++ +rss = "GSoC 2020: Parallelizing Wheel Downloads" +date = Date(2020, 8, 17) ++++ +@def tags = ["pip", "gsoc"] + +# Parallelizing Wheel Downloads + +> And now it's clear as this promise\ +> That we're making\ +> Two progress bars into one + +\toc + +Hello there! It has been raining a lot lately and some mosquito has given me +the Dengue fever today. To whoever reading this, I hope it would never happen +to you. + +Download Parallelization +------------------------ + +I've been working on `pip`'s download parallelization for quite a while now. +As distribution download in `pip` was modeled as a lazily evaluated iterable +of chunks, parallelizing such procedure is as simple as submitting routines +that write files to disk to a worker pool. + +Or at least that is what I thought. + +Progress Reporting UI +--------------------- + +`pip` is currently using customly defined progress reporting classes, +which was not designed to working with multithreading code. Firstly, I want to +try using these instead of defining separate UI for multithreaded progresses. +As they use system signals for termination, one must the progress bars has to be +running the main thread. Or sort of. + +Since the progress bars are designed as iterators, I realized that we +can call `next` on them. So quickly, I throw in some queues and locks, +and prototyped the first *working* {{pip 8771 "implementation of +progress synchronization"}}. + +Performance Issues +------------------ + +Welp, I only said that it works, but I didn't mention the performance, +which is terrible. I am pretty sure that the slow down is with +the synchronization, since the `map_multithread` call doesn't seem +to trigger anything that may introduce any sort of blocking. + +This seems like a lot of fun, and I hope I'll get better tomorrow +to continue playing with it! diff --git a/blog/2020/gsoc/article/7.md b/blog/2020/gsoc/article/7.md new file mode 100644 index 0000000..58d8d33 --- /dev/null +++ b/blog/2020/gsoc/article/7.md @@ -0,0 +1,109 @@ ++++ +rss = "GSoC 2020: Outro" +date = Date(2020, 8, 31) ++++ +@def tags = ["pip", "gsoc"] + +# Outro + +> Steamed fish was amazing, matter of fact\ +> Let me get some jerk chicken to go\ +> Grabbed me one of them lemon pie theories\ +> And let me get some of them benchmarks you theories too + +\toc + +## The Look + +At the time of writing, +{{pip 8771 "implementation-wise parallel download is ready"}}: + +[![asciicast](/assets/pip-8771.svg)](https://asciinema.org/a/356704) + +Does this mean I've finished everything just-in-time? This sounds to good +to be true! And how does it perform? Welp... + +## The Benchmark + +Here comes the bad news: under a decent connection to the package index, +using `fast-deps` does not make `pip` faster. For best comparison, +I will time `pip download` on the following cases: + +### Average Distribution + +For convenience purposes, let's refer to the commands to be used as follows + +```console +$ pip --no-cache-dir download {requirement} # legacy-resolver +$ pip --use-feature=2020-resolver \ + --no-cache-dir download {requirement} # 2020-resolver +$ pip --use-feature=2020-resolver --use-feature=fast-deps \ + --no-cache-dir download {requirement} # fast-deps +``` + +In the first test, I used [axuy][] and obtained the following results + +| legacy-resolver | 2020-resolver | fast-deps | +| --------------- | ------------- | --------- | +| 7.709s | 7.888s | 10.993s | +| 7.068s | 7.127s | 11.103s | +| 8.556s | 6.972s | 10.496s | + +Funny enough, running `pip download` with `fast-deps` in a directory +with downloaded files already took around 7-8 seconds. This is because +to lazily download a wheel, `pip` has to {{pip 8670 "make many requests"}} +which are apparently more expensive than actual data transmission on my network. + +!!! note "When is it useful then?" + + With unstable connection to PyPI (for some reason I am not confident enough + to state), this is what I got + + | 2020-resolver | fast-deps | + | ------------- | --------- | + | 1m16.134s | 0m54.894s | + | 1m0.384s | 0m40.753s | + | 0m50.102s | 0m41.988s | + + As the connection was *unstable* and that the majority of `pip` networking + is performed as CI/CD with large and stable bandwidth, I am unsure what this + result is supposed to tell (-; + +### Large Distribution + +In this test, I used [TensorFlow][] as the requirement and obtained +the following figures: + +| legacy-resolver | 2020-resolver | fast-deps | +| --------------- | ------------- | --------- | +| 0m52.135s | 0m58.809s | 1m5.649s | +| 0m50.641s | 1m14.896s | 1m28.168s | +| 0m49.691s | 1m5.633s | 1m22.131s | + +### Distribution with Conflicting Dependencies + +Some requirement that will trigger a decent amount of backtracking by +the current implementation of the new resolver `oslo-utils==1.4.0`: + +| 2020-resolver | fast-deps | +| ------------- | --------- | +| 14.497s | 24.010s | +| 17.680s | 28.884s | +| 16.541s | 26.333s | + +## What Now? + +I don't know, to be honest. At this point I'm feeling I've failed my own +(and that of other stakeholders of `pip`) expectation and wasted the time +and effort of `pip`'s maintainers reviewing dozens of PRs I've made +in the last three months. + +On the bright side, this has been an opportunity for me to explore the codebase +of package manager and discovered various edge cases where the new resolver +has yet to cover (e.g. I've just noticed that `pip download` would save +to-be-discarded distributions, I'll file an issue on that soon). Plus I got +to know many new and cool people and idea, which make me a more helpful +individual to work on Python packaging in the future, I hope. + +[TensorFlow]: https://www.tensorflow.org +[axuy]: https://sr.ht/~cnx/axuy diff --git a/blog/2020/gsoc/article/index.md b/blog/2020/gsoc/article/index.md new file mode 100644 index 0000000..827c2a0 --- /dev/null +++ b/blog/2020/gsoc/article/index.md @@ -0,0 +1,12 @@ +# GSoC 2020 Blog Posts + +Blog posts are longer descriptions of the work +I was doing as a Python GSoC student: + +* {{abslink blog/2020/gsoc/article/1}} +* {{abslink blog/2020/gsoc/article/2}} +* {{abslink blog/2020/gsoc/article/3}} +* {{abslink blog/2020/gsoc/article/4}} +* {{abslink blog/2020/gsoc/article/5}} +* {{abslink blog/2020/gsoc/article/6}} +* {{abslink blog/2020/gsoc/article/7}} diff --git a/blog/2020/gsoc/checkin/1.md b/blog/2020/gsoc/checkin/1.md new file mode 100644 index 0000000..a362f28 --- /dev/null +++ b/blog/2020/gsoc/checkin/1.md @@ -0,0 +1,45 @@ ++++ +rss = "GSoC 2020: First Check-In" +date = Date(2020, 6, 1) ++++ +@def tags = ["pip", "gsoc"] + +# First Check-In + +Hi everyone, I am McSinyx, a Vietnamese undergraduate student +who loves [free software][]. This summer I am working with +the maintainers and the contributors of `pip` to make +the package manager {{pip 825 "download in parallel"}}. + +## What did I do during the community bonding period? + +Aside from bonding with `pip`'s maintainers and contributors as well as +with my mentors, I was also experimenting on the theoretical and technical +obstacles blocking this GSoC project. Pradyun Gedam (a mentor of mine) +suggested making [a proof of concept][] to determine if parallel downloading +can play nicely with ResolveLib_'s abstraction and we are reviewing it +together. On the technical side, we `pip`'s committers are exploring +{{pip 8169 "available options for parallelization"}} and I made an attempt to +{{pip 8320 "make use of Python's standard worker pool in a portable way"}}. + +## Did I get stuck anywhere? + +Yes, of course! Neither of the experiments above is finished as of +this moment. Though, I am optimistic that the issues will not be +real blockers and we will figure that out in the next few days. + +## What is coming up next? + +As planned, this week I am going to refactor the package downloading code +in `pip`. The main purpose is to decouple the networking code from +the package preparation operation and make sure that it is thread-safe. + +In addition, I am also continuing mentioned experiments to have a better +confidence on the future of this GSoC project. + +To other GSoC students, mentors and admins reading this, I am wishing +you all good health and successful projects this summer! + +[free software]: https://www.gnu.org/philosophy/free-sw.html +[a proof of concept]: https://gist.github.com/McSinyx/513dbff71174fcc79f1cb600e09881af +[ResolveLib]: https://pypi.org/project/resolvelib diff --git a/blog/2020/gsoc/checkin/2.md b/blog/2020/gsoc/checkin/2.md new file mode 100644 index 0000000..e59cac2 --- /dev/null +++ b/blog/2020/gsoc/checkin/2.md @@ -0,0 +1,45 @@ ++++ +rss = "GSoC 2020: Second Check-In" +date = Date(2020, 6, 15) ++++ +@def tags = ["pip", "gsoc"] + +# Second Check-In + +Hi everyone and may the odds ever in your favor, especially during this +tough time! + +## What did I do last week? + +Not as much I wished, apparently (-: + +* Finalizing {{pip 8411 "the refactoring patch"}} + of `operations.prepare.prepare_linked_requirement` +* {{pip 8423 "Nitpicking some logging calls"}}. This (as well as the next one) + was to fill up the time my brain not being as productive as I want it to XD +* {{pip 8423 "Beginning to migrate"}} from `%`- to `{}`-style logging. + The amount of tests failing due to this was way beyond my imagination, + but I got functional tests for `pip install` and unit tests passing now! +* {{pip 8442 "Mocking up a working partial wheel download during + dependency resolution"}} for [the new resolver][]. + +## Did I get stuck anywhere? + +Yes, of course! {{pip 8320 "Parallel maps"}} are still stalling +as well as other small PRs listed above. The failure related to +`logging` are still making me pulling my hair out and the proof of +concept for partial wheel downloading is too ugly even for a PoC. +I imagine that I will have a lot of clean up to do this week (yay!). + +## What is coming up next? + +I'm trying get the multi-{threading,processing} facilities merged ASAP +to start rolling it out in practice. The first thing popping out of my +head is to get back {{pip 7962 "the multi-threaded"}} `pip list -o`. + +The other experimental improvement (this phrase does not sound right!) +I would like to get done is the partial wheel download. It would be +really nice if I can get both included as `unstable-feature`'s +in {{pip 7628#issuecomment-636319539 "the upcoming beta release of pip 20.2"}}. + +[the new resolver]: http://www.ei8fdb.org/thoughts/2020/05/test-pips-alpha-resolver-and-help-us-document-dependency-conflicts/ diff --git a/blog/2020/gsoc/checkin/3.md b/blog/2020/gsoc/checkin/3.md new file mode 100644 index 0000000..32a94ab --- /dev/null +++ b/blog/2020/gsoc/checkin/3.md @@ -0,0 +1,44 @@ ++++ +rss = "GSoC 2020: Third Check-In" +date = Date(2020, 6, 29) ++++ +@def tags = ["pip", "gsoc"] + +# Third Check-In + +Holla, holla, holla! Last seven days has not been a really productive week +for me, though I think there are still some nice things to share with +you all here! The good news is that I've finish my last leçon as a somophore, +the bad news is that I have a bunch of upcoming tests, mainly in the form +of group projects and/or presentation (phew!). Enough about me, +let's get back to `pip`: + +## What did I do last week? + +Not much, actually )-: + +* Write some tests for {{pip 8467 "the HTTP range mapping for wheel"}}. +* {{pip 8504 "Try to bring back"}} multithreaded `pip list --outdated` + and `--uptodate`, as {{pip 8320 "the parallel map"}} was merged + earlier today. +* Nitpick {{pip 8332}} + (yep it's a new low for me to include this to the list (-:). + +## Did I get stuck anywhere? + +Not exactly, since I didn't do much d-; [Many of my PRs][] are stalling though. +On one hand the maintainers of `pip` are all volunteers working in +their free time, on the other hand I don't think I have tried hard enough +to get their attention on my PRs. + +## What is coming up next? + +I'll try my best getting the following merged upstream before +{{pip 8206 "the upcoming beta release"}}: + +* Parallel networking for `pip list`: {{pip 8504}} +* Lazy wheel for dependency information: {{pip 8467}}, {{pip 8411}} + (to determine if hashing is required) and {{pip 8467#issuecomment-648717032 + "a new patch introducing this as an unstable feature"}} + +[Many of my PRs]: https://github.com/pulls?q=is:open+is:pr+author:McSinyx+repo:pypa/pip+sort:updated-desc diff --git a/blog/2020/gsoc/checkin/4.md b/blog/2020/gsoc/checkin/4.md new file mode 100644 index 0000000..417db58 --- /dev/null +++ b/blog/2020/gsoc/checkin/4.md @@ -0,0 +1,35 @@ ++++ +rss = "GSoC 2020: Fourth Check-In" +date = Date(2020, 7, 13) ++++ +@def tags = ["pip", "gsoc"] + +# Fourth Check-In + +Hello there! I'm having my second year's last exam tomorrow, +but it [feels like summer][] already! I've been finalizing quite a few things +to get them ready for pip 20.2b2. + +## What did I do last week? + +I've spent most of the time on getting {{pip 8532 "the opt-in"}} for obtaining +dependency information via lazy wheels ready. It will be available as +`--use-feature=fast-deps` and only has effect when +`--use-feature=2020-resolver` also presents. + +While waiting for reviews and suggestions, I made some patches for +internal cleansing, namely {{pip 8568}}, {{pip 8571}} and {{pip 8578}}. +Some of the similar patches I made earlier were also merged last week: +{{pip 8456}} and {{pip 8538}}. + +## Did I get stuck anywhere? + +Not really, everything was going as expected for me. + +## What is coming up next? + +After {{pip 8532}}, I'll work on the parallel download of the postponed wheels. +My main current concern is with how the download progress will be reported +to the users, but I think I'll figure it out soon. + +[feels like summer]: https://www.youtube.com/watch?v=F1B9Fk_SgI0 diff --git a/blog/2020/gsoc/checkin/5.md b/blog/2020/gsoc/checkin/5.md new file mode 100644 index 0000000..5e50f67 --- /dev/null +++ b/blog/2020/gsoc/checkin/5.md @@ -0,0 +1,37 @@ ++++ +rss = "GSoC 2020: Fifth Check-In" +date = Date(2020, 7, 27) ++++ +@def tags = ["pip", "gsoc"] + +# Fifth Check-In + +Hello and I hope y'all are still doing well! + +## What did I do last week? + +I was not really productive last week—most of the following tickets are fillers +to make use of the spare cycles I had when I was still trying to figure out +the way to implement the main work. + +* Finalize the `--use-feature=fast-deps` flag ({{pip 8588}}) +* Improve mocking of environment variables in the test suit ({{pip 8614}}) +* Finalize the fix for verbose/quiet options specified via + configuration files and environment variables ({{pip 8578}}) +* Clean up a tiny bit in the resolver internal API ({{pip 8629}}) +* Start working on seperating the download of wheels + from dependency resolution ({{pip 8638}}) + +## Did I get stuck anywhere? + +I'm struggling on refactoring the code to support separate download. +`pip`'s codebase was not intended for this and thus there are +many execution paths and other details entangled around the relevant area. + +## What is coming up next? + +`pip` 20.2 is going to be released within the next few days with +`--use-feature=fast-deps` included and I'm mentally prepare to fix +any undiscovered problem. At the same time, I will continue working +on {{pip 8638}} and hopefully get it done soon enough to begin drafting +download parallelization strategies, mostly with the UI. diff --git a/blog/2020/gsoc/checkin/6.md b/blog/2020/gsoc/checkin/6.md new file mode 100644 index 0000000..aea9d5a --- /dev/null +++ b/blog/2020/gsoc/checkin/6.md @@ -0,0 +1,33 @@ ++++ +rss = "GSoC 2020: Sixth Check-In" +date = Date(2020, 8, 10) ++++ +@def tags = ["pip", "gsoc"] + +# Sixth Check-In + +Hello there! + +## What did I do last week? + +It has been a quite fun week for me, given the current state of +development and the newly dicovered bugs thanks to pip 20.2 release: + +* Initiate discussion with the maintainers of pip on isolating + networking code for late download in parallel ({{pip 8697}}) +* Discuss the UI of parallel download ({{pip 8698}}) +* Log debug information relating lazy wheel decision ({{pip 8710}}) +* Disable caching for range requests ({{pip 8716}}) +* Dedent late download logs ({{pip 8722}}) +* Add a hook for batch downloading (third attempt I think) ({{pip 8737}}) +* Test hash checking for fast-deps ({{pip 8743}}) + +## Did I get stuck anywhere? + +Not exactly, everything is going smoothly and I'm feeling awesome! + +## What is coming up next? + +I'll try to solve {{pip 8697}} and {{pip 8698}} within the next few days. +I am optimistic that the parallel download prototype will be done +within this week. diff --git a/blog/2020/gsoc/checkin/7.md b/blog/2020/gsoc/checkin/7.md new file mode 100644 index 0000000..b87a7fd --- /dev/null +++ b/blog/2020/gsoc/checkin/7.md @@ -0,0 +1,26 @@ ++++ +rss = "GSoC 2020: Final Check-In" +date = Date(2020, 8, 24) ++++ +@def tags = ["pip", "gsoc"] + +# Final Check-In + +Hello there! + +## What did I do last week? + +Not much, but seemingly implementation-wise I have finished my GSoC project: + +* Finish the implementation of wheels' parallel download ({{pip 8771}}) +* Help make `pip`'s CI green again ({{pip 8790}}) +* Reformat a few spots in user guide ({{pip 8795}}) + +## Did I get stuck anywhere? + +I got sick, but I am recovering now! + +## What is coming up next? + +I will try to spend the time I got left within the scope of GSoC +to {{pip 8720 "improve cache usage of the fast-deps feature"}}. diff --git a/blog/2020/gsoc/checkin/index.md b/blog/2020/gsoc/checkin/index.md new file mode 100644 index 0000000..a95e2ff --- /dev/null +++ b/blog/2020/gsoc/checkin/index.md @@ -0,0 +1,11 @@ +# GSoC 2020 Check Ins + +Weekly check ins answer a few short questions as a sort of status report: + +* {{abslink blog/2020/gsoc/checkin/1}} +* {{abslink blog/2020/gsoc/checkin/2}} +* {{abslink blog/2020/gsoc/checkin/3}} +* {{abslink blog/2020/gsoc/checkin/4}} +* {{abslink blog/2020/gsoc/checkin/5}} +* {{abslink blog/2020/gsoc/checkin/6}} +* {{abslink blog/2020/gsoc/checkin/7}} diff --git a/blog/2020/gsoc/index.md b/blog/2020/gsoc/index.md new file mode 100644 index 0000000..b1c1a1d --- /dev/null +++ b/blog/2020/gsoc/index.md @@ -0,0 +1,151 @@ ++++ +rss = "GSoC 2020 final report" +date = Date(2020, 8, 31) +internship = "https://summerofcode.withgoogle.com/archive/2020/projects/6238594655584256" +benchmark = "/blog/2020/gsoc/blog20200831/#the_benchmark" +python_gsoc = "https://blogs.python-gsoc.org/en/mcsinyxs-blog" ++++ +@def tags = ["fun", "pip", "gsoc"] + +# Google Summer of Code 2020 + +In the summer of 2020, I worked with the contributors of `pip`, +trying to improve the networking performance of the package manager. +Admittedly, at the end of [the internship]({{internship}}) period, +[the benchmark said otherwise]({{benchmark}}); though I really hope +the clean-up and minor fixes I happened to be doing to the codebase +over the summer, in addition to the implementation of parallel +utils and lazy wheel, might actually help the project. + +Personally, I learned a lot: not just about Python packaging and +networking stuff, but also on how to work with others. I am really +grateful to {{github pradyunsg}} (my mentor), {{github chrahunt}}, +{{github uranusjr}}, {{github pfmoore}}, {{github brainwane}}, +{{github sbidoul}}, {{github xavfernandez}}, {{github webknjaz}}, +{{github jaraco}}, {{github deveshks}}, {{github gutsytechster}}, +{{github dholth}}, {{github dstufft}}, {{github cosmicexplorer}} +and {{github ofek}}. While this feels like a long shout-out list, +it really isn't. These people are the maintainers, the contributors of `pip` +and/or other Python packaging projects, and more importantly, they have been +more than helpful, encouraging and patient to me throughout my every activities, +showing me the way when I was lost, fixing me when I was wrong, putting up with +my carelessness and showing me support across different social media. + +To best serve the community, below I have tried my best to document +what I have done, how I've done it and why I've done it for over +the last three months. At the time of writing, some work is still in progress, +so these also serve as a reference point for myself and others to reason +about decisions in relevant topics. + +\toc + +## The Main Story + +The storyline can be divided into the following four main acts. + +### Act One: Parallelization Utilities + +In this first act, I ensured the portibility of parallelization +measures for later use in the final act. Multithreading and multiprocessing +`map` were properly fellback on platforms without full support. + +* {{pip 8320}}: Add utilities for parallelization (close {{pip 8169}}) +* {{pip 8538}}: Make `utils.parallel` tests tear down properly +* {{pip 8504}}: Parallelize `pip list --outdated` and `--uptodate` + (using {{pip 8320}}) + +### Act Two: Lazy Wheels + +As proposed by {{github cosmicexplorer}} in {{pip 7819}}, it is possible to only +download a portion of a wheel to obtain metadata during dependency resolution. +Not only that this would reduce the total amount of data to be transmitted over +the network in case the resolver needs to perform heavy backtracking, but also +it would create a synchronization point at the end of the resolution progress +where parallel downloading can be applied to the needed wheels (some wheels +solely serve their metadata during dependency backtracking and are not needed +by the users). + +* {{pip 8467}}: Add utitlity to lazily acquire wheel metadata over HTTP +* {{pip 8584}}: Revise lazy wheel and its tests +* {{pip 8681}}: Make range requests closer to chunk size (help {{pip 8670}}) +* {{pip 8716}} and {{pip 8730}}: Disable caching for range requests + +### Act Three: Late Downloading + +During this act, the main works were refactoring to integrate the *lazy wheel* +into `pip`'s codebase and clean up the way for download parallelization. + +* {{pip 8411}}: Refactor `operations.prepare.prepare_linked_requirement` +* {{pip 8629}}: Abstract away `AbstractDistribution` + in higher-level resolver code +* {{pip 8442}}, {{pip 8532}} and {{pip 8588}} (later reworked by + {{github chrahunt}} in {{pip 8685}}): Use lazy wheel to obtain + dependency information for the new resolver +* {{pip 8743}}: Test hash checking for `fast-deps` +* {{pip 8804}}: Check download directory before making range requests + +### Act Four: Batch Downloading in Parallel + +The final act is mostly about the UI of the parallel download. +My work involved around how the progress should be displayed +and how other relevant information should be reported to the users. + +* {{pip 8710}}: Revise method fetching metadata using lazy wheels +* {{pip 8722}}: Dedent late download logs (fix {{pip 8721}}) +* {{pip 8737}}: Add a hook for batch downloading +* {{pip 8771}}: Parallelize wheel download + +The Side Quests +--------------- + +In order to keep the wheel turning (no pun intended) and avoid wasting time +waiting for the pull requests above to be reviewed, I decided to create +even more PRs (as I am typing this, many of the patches listed below +are nowhere near being merged). + +* {{pip 7878}}: Fail early when install path is not writable +* {{pip 7928}}: Fix rst syntax in Getting Started guide +* {{pip 7988}}: Fix tabulate col size in case of empty cell +* {{pip 8137}}: Add subcommand alias mechanism +* {{pip 8143}}: Make mypy happy with beta release automation +* {{pip 8248}}: Fix typo and simplify ireq call +* {{pip 8332}}: Add license requirement to `_vendor/README.rst` +* {{pip 8423}}: Nitpick logging calls +* {{pip 8435}}: Use str.format style in logging calls +* {{pip 8456}}: Lint `src/pip/_vendor/README.rst` +* {{pip 8568}}: Declare constants in configuration.py as such +* {{pip 8571}}: Clean up `Configuration.unset_value` and nit `__init__` +* {{pip 8578}}: Allow verbose/quiet level to be specified + via config files and environment variables +* {{pip 8599}}: Replace tabs by spaces for consistency +* {{pip 8614}}: Use `monkeypatch.setenv` to mock environment variables +* {{pip 8674}}: Fix `tests/functional/test_install_check.py`, + when run with new resolver +* {{pip 8692}}: Make assertion failure give better message +* {{pip 8709}}: List downloaded distributions before exiting (fix {{pip 8696}}) +* {{pip 8759}}: Allow py2 deprecation warning from setuptools +* {{pip 8766}}: Use the new resolver for test requirements +* {{pip 8790}}: Mark tests using remote svn and hg as xfail +* {{pip 8795}}: Reformat a few spots in user guide + +## The Plot Summary + +Every Monday throughout the Summer of Code, I summarized what I had done +in the week before in the form of either a short blog or an (even shorter) +check-in. These write-ups often contain handfuls of popular culture references +and was originally hosted on [Python GSoC]({{python_gsoc}}). + +* {{abslink blog/2020/gsoc/checkin/1}} +* {{abslink blog/2020/gsoc/article/1}} +* {{abslink blog/2020/gsoc/checkin/2}} +* {{abslink blog/2020/gsoc/article/2}} +* {{abslink blog/2020/gsoc/checkin/3}} +* {{abslink blog/2020/gsoc/article/3}} +* {{abslink blog/2020/gsoc/checkin/4}} +* {{abslink blog/2020/gsoc/article/4}} +* {{abslink blog/2020/gsoc/checkin/5}} +* {{abslink blog/2020/gsoc/article/5}} +* {{abslink blog/2020/gsoc/checkin/6}} +* {{abslink blog/2020/gsoc/article/6}} +* {{abslink blog/2020/gsoc/checkin/7}} +* {{abslink blog/2020/gsoc/article/7}} diff --git a/blog/gsoc2020/blog20200609.md b/blog/gsoc2020/blog20200609.md deleted file mode 100644 index b0e6a7b..0000000 --- a/blog/gsoc2020/blog20200609.md +++ /dev/null @@ -1,112 +0,0 @@ -+++ -rss = "GSoC 2020: Unexpected Things When You're Expecting" -date = Date(2020, 6, 9) -+++ -@def tags = ["pip", "gsoc"] - -# Unexpected Things When You're Expecting - -Hi everyone, I hope that you are all doing well and wishes you all good health! -The last week has not been really kind to me with a decent amount of -academic pressure (my school year is lasting until early Jully). -It would be bold to say that I have spent 10 hours working on my GSoC project -since the last check-in, let alone the 30 hours per week requirement. -That being said, there were still some discoveries that I wish to share. - -\toc - -## The `multiprocessing[.dummy]` wrapper - -Most of the time I spent was to finalize the multi{processing,threading} -wrapper for `map` function that submit tasks to the worker pool. -To my surprise, it is rather difficult to write something that is -not only portable but also easy to read and test. - -By {{pip 8320 "the latest commit"}}, I realized the following: - -1. The `multiprocessing` module was not designed for the implementation - details to be abstracted away entirely. For example, the lazy `map`'s - could be really slow without specifying suitable chunk size - (to cut the input iterable and distribute them to workers in the pool). - By *suitable*, I mean only an order smaller than the input. This defeats - half of the purpose of making it lazy: allowing the input to be - evaluated lazily. Luckily, in the use case I'm aiming for, the length of - the iterable argument is small and the laziness is only needed for the output - (to pipeline download and installation). -2. Mocking `import` for testing purposes can never be pretty. One reason - is that we (Python users) have very little control over the calls of - `import` statements and its lower-level implementation `__import__`. - In order to properly patch this built-in function, unlike for others - of the same group, we have to `monkeypatch` the name from `builtins` - (or `__builtins__` under Python 2) instead of the module that import stuff. - Furthermore, because of the special namespacing, to avoid infinite recursion - we need to alias the function to a different name for fallback. -3. To add to the problem, `multiprocessing` lazily imports the fragile module - during pools creation. Since the failure is platform-specific - (the lack of `sem_open`), it was decided to check upon the import - of the `pip`'s module. Although the behavior is easier to reason - in human language, testing it requires invalidating cached import and - re-import the wrapper module. -4. Last but not least, I now understand the pain of keeping Python 2 - compatibility that many package maintainers still need to deal with - everyday (although Python 2 has reached its end-of-life, `pip`, for - example, {{pip 6148 "will still support it for another year"}}). - -## The change in direction - -Since last week, my mentor Pradyun Gedam and I set up weekly real-time -meeting (a fancy term for video/audio chat in the worldwide quarantine -era) for the entire GSoC period. During the last session, we decided to -put parallelization of download during resolution on hold, in favor of a -more beneficial goal: {{pip 7819 "partially download the wheels during -dependency resolution"}}. - -![](/assets/swirl.png) - -As discussed by Danny McClanahan and the maintainers of `pip`, it is feasible -to only download a few kB of a wheel to obtain enough metadata for -the resolution of dependency. While this is only applicable to wheels -(i.e. prebuilt packages), other packaging format only make up less than 20% -of the downloads (at least on PyPI), and the figure is much less for -the most popular packages. Therefore, this optimization alone could make -[the upcoming backtracking resolver][]'s performance par with the legacy one. - -During the last few years, there has been a lot of effort being poured into -replacing `pip`'s current resolver that is unable to resolve conflicts. -While its correctness will be ensured by some of the most talented and -hard-working developers in the Python packaging community, from the users' -point of view, it would be better to have its performance not lagging -behind the old one. Aside from the increase in CPU cycles for more -rigorous resolution, more I/O, especially networking operations is expected -to be performed. This is due to {{pip 7406#issuecomment-583891169 "the lack -of a standard and efficient way to acquire the metadata"}}. Therefore, unlike -most package managers we are familiar with, `pip` has to fetch -(and possibly build) the packages solely for dependency informations. - -Fortunately, {{pep 427 recommended-archiver-features}} recommends -package builders to place the metadata at the end of the archive. -This allows the resolver to only fetch the last few kB using -`HTTP range requests`_ for the relevant information. -Simply appending `Range: bytes=-8000` to the request header -in `pip._internal.network.download` makes the resolution process -*lightning* fast. Of course this breaks the installation but I am confident -that it is not difficult to implement this optimization cleanly. - -One drawback of this optimization is the compatibility. Not every Python -package index support range requests, and it is not possible to verify -the partial wheel. While the first case is unavoidable, for the other, -hashes checking is usually used for pinned/locked-version requirements, -thus no backtracking is done during dependency resolution. - -Either way, before installation, the packages selected by the resolver -can be downloaded in parallel. This warranties a larger crowd of packages, -compared to parallelization during resolution, where the number of downloads -can be as low as one during trail of different versions of the same package. - -Unfortunately, I have not been able to do much other than -{{pip 8411 "a minor clean up"}}. I am looking forward to accomplishing more -this week and seeing what this path will lead us too! At the moment, -I am happy that I'm able to meet the blog deadline, at least in UTC! - -[the upcoming backtracking resolver]: http://www.ei8fdb.org/thoughts/2020/05/test-pips-alpha-resolver-and-help-us-document-dependency-conflicts -[HTTP range requests]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests diff --git a/blog/gsoc2020/blog20200622.md b/blog/gsoc2020/blog20200622.md deleted file mode 100644 index 3bb3a2c..0000000 --- a/blog/gsoc2020/blog20200622.md +++ /dev/null @@ -1,113 +0,0 @@ -+++ -rss = "GSoC 2020: The Wonderful Wizard of O'zip" -date = Date(2020, 6, 22) -+++ -@def tags = ["pip", "gsoc"] - -# The Wonderful Wizard of O'zip - -> Never give up... No one knows what's going to happen next. - -\toc - -## Preface - -Greetings and best wishes! I had a lot of fun during the last week, -although admittedly nothing was really finished. In summary, -these are the works I carried out in the last seven days: - -* Finilizing {{pip 8320 "utilities for parallelization"}} -* {{pip 8467 "Continuing experimenting"}} - on {{pip 8442 "using lazy wheels or dependency resolution"}} -* Polishing up {{pip 8411 "the patch"}} refactoring - `operations.prepare.prepare_linked_requirement` -* Adding `flake8-logging-format` - {{pip 8423#issuecomment-645418725 "to the linter"}} -* Splitting {{pip 8456 "the linting patch"}} from {{pip 8332 "the PR adding - the license requirement to vendor README"}} - -## The `multiprocessing[.dummy]` wrapper - -Yes, you read it right, this is the same section as last fortnight's blog. -My mentor Pradyun Gedam gave me a green light to have {{pip 8411}} merged -without support for Python 2 and the non-lazy map variant, which turns out -to be troublesome for multithreading. - -The tests still needs to pass of course and the flaky tests (see failing tests -over Azure Pipeline in the past) really gave me a panic attack earlier today. -We probably need to mark them as xfail or investigate why they are -undeterministic specifically on Azure, but the real reason I was *all caught up -and confused* was that the unit tests I added mess with the cached imports -and as `pip`'s tests are run in parallel, who knows what it might affect. -I was so relieved to not discover any new set of tests made flaky by ones -I'm trying to add! - -## The file-like object mapping ZIP over HTTP - -This is where the fun starts. Before we dive in, let's recall some -background information on this. As discovered by Danny McClanahan -in {{pip 7819}}, it is possible to only download a potion of a wheel -and it's still valid for `pip` to get the distribution's metadata. -In the same thread, Daniel Holth suggested that one may use -HTTP range requests to specifically ask for the tail of the wheel, -where the ZIP's central directory record as well as where usually -`dist-info` (the directory containing `METADATA`) can be found. - -Well, *usually*. While {{pep 427}} does indeed recommend - -> Archivers are encouraged to place the `.dist-info` files physically -> at the end of the archive. This enables some potentially interesting -> ZIP tricks including the ability to amend the metadata without -> rewriting the entire archive. - -one of the mentioned *tricks* is adding shared libraries to wheels -of extension modules (using e.g. `auditwheel` or `delocate`). -Thus for non-pure Python wheels, it is unlikely that the metadata -lie in the last few megabytes. Ignoring source distributions is bad enough, -we can't afford making an optimization that doesn't work for extension modules, -which are still an integral part of the Python ecosystem )-: - -But hey, the ZIP's directory record is warrantied to be at the end of the file! -Couldn't we do something about that? The short answer is yes. The long answer -is, well, yessssssss! That, plus magic provided by most operating systems, -this is what we figured out: - -1. We can download a realatively small chunk at the end of the wheel - until it is recognizable as a valid ZIP file. -2. In order for the end of the archive to actually appear as the end to - `zipfile`, we feed to it an object with `seek` and `read` defined. - As navigating to the rear of the file is performed by calling `seek` - with relative offset and `whence=SEEK_END` (see `man 3 fseek` - for more details), we are completely able to make the wheels in the cloud - to behave as if it were available locally. - - ![Wheel in the cloud](/assets/cloud.gif) - -3. For large wheels, it is better to store them in hard disks instead of memory. - For smaller ones, it is also preferable to store it as a file to avoid - (error-prony and often not really efficient) manual tracking and joining - of downloaded segments. We only use a small potion of the wheel, however - just in case one is wonderring, we have very little control over - when `tempfile.SpooledTemporaryFile` rolls over, so the memory-disk hybrid - is not exactly working as expected. -4. With all these in mind, all we have to do is to define an intermediate object - check for local availability and download if needed on calls to `read`, - to lazily provide the data over HTTP and reduce execution time. - -The only theoretical challenge left is to keep track of downloaded intervals, -which I finally figured out after a few trials and errors. The code -was submitted as a pull request to `pip` at {{pip 8467}}. A more modern -(read: Python 3-only) variant was packaged and uploaded to PyPI under -the name of lazip_. I am unaware of any use case for it outside of `pip`, -but it's certainly fun to play with d-: - -## What's next? - -I have been falling short of getting the PRs mention above merged for -quite a while. With `pip`'s next beta coming really soon, I have to somehow -make the patches reach a certain standard and enough attention to be part of -the pre-release—beta-testing would greatly help the success of the GSoC project. -To other GSoC students and mentors reading this, I also hope your projects -to turn out successful! - -[lazip]: https://pypi.org/project/lazip/ diff --git a/blog/gsoc2020/blog20200706.md b/blog/gsoc2020/blog20200706.md deleted file mode 100644 index 9c41b31..0000000 --- a/blog/gsoc2020/blog20200706.md +++ /dev/null @@ -1,78 +0,0 @@ -+++ -rss = "GSoC 2020: I'm Not Drowning On My Own" -date = Date(2020, 7, 6) -+++ -@def tags = ["pip", "gsoc"] - -# I'm Not Drowning On My Own - -\toc - -## Cold Water - -Hello there! My schoolyear is coming to an end, with some final assignments -and group projects left to be done. I for sure underestimated the workload -of these and in the last (and probably next) few days I'm drowning in work -trying to meet my deadlines. - -One project that might be remotely relevant is [cheese-shop][], which tries to -manage the metadata of packages from the real [Cheese Shop][]. Other than that, -schoolwork is draining a lot of my time and I can't remember the last time -I came up with something new for my GSoC project )-; - -## Warm Water - -On the bright side, I received a lot of help and encouragement -from contributors and stakeholders of `pip`. In the last week alone, -I had five pull requests merged: - -* {{pip 8332}}: Add license requirement to `_vendor/README.rst` -* {{pip 8320}}: Add utilities for parallelization -* {{pip 8504}}: Parallelize `pip list --outdated` and `--uptodate` -* {{pip 8411}}: Refactor `operations.prepare.prepare_linked_requirement` -* {{pip 8467}}: Add utitlity to lazily acquire wheel metadata over HTTP - -In addition to helping me getting my PRs merged, my mentor Pradyun Gedam -also gave me my first official feedback, including what I'm doing right -(and wrong too!) and what I should keep doing to increase the chance of -the project being successful. - -{{pip 7819}}'s roadmap (Danny McClanahan's discoveries and works on lazy wheels) -is being closely tracked by `hatch`'s maintainter Ofek Lev, which really -makes me proud and warms my heart, that what I'm helping build is actually -needed by the community! - -## Learning How To Swim - -With {{pip 8467}} and {{pip 8530}} merged, I'm now working on {{pip 8532}} -which aims to roll out the lazy wheel as the way to obtain -dependency information via the CLI flag `--use-feature=lazy-wheel`. - -{{pip 8532}} was failing initially, despite being relatively trivial and that -the commit it used to base on was passing. Surprisingly, after rebasing it -on top of {{pip 8530}}, it suddenly became green mysteriously. After the first -(early) review, I was able to iterate on my earlier code, which used -the ambiguous exception `RuntimeError`. - -The rest to be done is *just* adding some functional tests (I'm pretty sure -this will be either overwhelming or underwhelming) to make sure that -the command-line flag is working correctly. Hopefully this can make it into -the beta of the upcoming release {{pip 8511 "this month"}}. - -![Lazy wheel](/assets/lazy-wheel.jpg) - -In other news, I've also submitted {{pip 8538 "a patch improving the tests -for the parallelization utilities"}}, which was really messy as I wrote them. -Better late than never! - -Metaphors aside, I actually can't swim d-: - -## Diving Plan - -After {{pip 8532}}, I think I'll try to parallelize downloads of wheels -that are lazily fetched only for metadata. By the current implementation -of the new resolver, for `pip install`, this can be injected directly -between the resolution and build/installation process. - -[cheese-shop]: https://github.com/McSinyx/cheese-shop -[Cheese Shop]: https://pypi.org diff --git a/blog/gsoc2020/blog20200720.md b/blog/gsoc2020/blog20200720.md deleted file mode 100644 index 43738a7..0000000 --- a/blog/gsoc2020/blog20200720.md +++ /dev/null @@ -1,84 +0,0 @@ -+++ -rss = "GSoC 2020: I've Walked 500 Miles..." -date = Date(2020, 7, 20) -+++ -@def tags = ["pip", "gsoc"] - -# I've Walked 500 Miles... - -> ... and I would walk 500 more\ -> Just to be the man who walks a thousand miles\ -> To fall down at your door -> -> ![500 miles](/assets/500-miles.gif) - -\toc - -## The Main Road - -Hi, have you met `fast-deps`? It's (going to be) the name of `pip`'s -experimental feature that may improve the speed of dependency resolution -of the new resolver. By avoid downloading whole wheels to just -obtain metadata, it is especially helpful when `pip` has to do -heavy backtracking to resolve conflicts. - -Thanks to {{pip 8532#discussion_r453990728 "Chris Hunt's review on GH-8537"}}, -my mentor Pradyun Gedam and I worked out a less hacky approach to inteject -the call to lazy wheel during the resolution process. A new PR {{pip 8588}} -was filed to implement it—I could have *just* worked on top of the old PR -and rebased, but my `git` skill is far from gud enuff to confidently do it. - -Testing this one has been a lot of fun though. At first, integration tests -were added as a rerun of the tests for the new resolver, with an additional flag -to use feature `fast-deps`. It indeed made me feel guilty towards [Travis][], -who has to work around 30 minutes more every run. Per Chris Hunt's suggestion, -in the new PR, I instead write a few functional tests for the area relating -the most to the feature, namely `pip`'s subcommands `wheel`, -`download` and `install`. - -It was also suggested that a mock server with HTTP range requests support -might be better (in term of performance and reliablilty) than for testing. -However, {{pip 8584#issuecomment-659227702 "I have yet to be able to make -Werkzeug do it"}}. - -Why did I say I'm half way there? With the parallel utilities merged and a way -to quickly get the list of distribution to be downloaded being really close, -what left is *only* to figure out a way to properly download them in parallel. -With no distribution to be added during the download progress, the model of this -will fit very well with the architecture in [my original proposal][]. -A batch downloader can be implemented to track the progress of each download -and thus report them cleanly as e.g. progress bar or percentage. This is -the part I am second-most excited about of my GSoC project this summer -(after the synchronization of downloads written in my proposal, which was then -superseded by `fast-deps`) and I can't wait to do it! - -## The Side Quests - -As usual, I make sure that I complete every side quest I see during the journey: - -* {{pip 8568}}: Declare constants in `configuration.py` as such -* {{pip 8571}}: Clean up `Configuration.unset_value` - and nit the class' `__init__` -* {{pip 8578}}: Allow verbose/quite level - to be specified via config file and env var -* {{pip 8599}}: Replace tabs by spaces for consistency - -## Snap Back to Reality - -A bit about me, I actually walked 500 meters earlier today to a bank -and walked 500 more to another to prepare my Visa card for purchasing -the upcoming Pinephone prototype. It's one of the first smartphones -to fully support a GNU/Linux distribution, where one can run desktop apps -(including proper terminals) as well as traditional services like SSH, -HTTP server and IPFS node because why not? Just a few hours ago, -I pre-ordered the [postmarketOS community edition][] with additional hardware -for convergence. - -If you did not come here for a Pinephone ad, please take my apologies though d-; -and to ones reading this, I hope you all can become the person who walks -a thousand miles to fall down at the door opening to all -what you ever wished for! - -[Travis]: https://travis-ci.com -[my original proposal]: /assets/pip-parallel-dl.pdf -[postmarketOS community edition]: https://postmarketos.org/blog/2020/07/15/pinephone-ce-preorder/ diff --git a/blog/gsoc2020/blog20200803.md b/blog/gsoc2020/blog20200803.md deleted file mode 100644 index de2ef8d..0000000 --- a/blog/gsoc2020/blog20200803.md +++ /dev/null @@ -1,46 +0,0 @@ -+++ -rss = "GSoC 2020: Sorting Things Out" -date = Date(2020, 8, 3) -+++ -@def tags = ["pip", "gsoc"] - -# Sorting Things Out - -Hi! I really hope that everyone reading this is still doing okay, -and if that isn't the case, I wish you a good day! - -## `pip` 20.2 Released! - -Last Wednesday, `pip` 20.2 was released, delivering the `2020-resolver` -as well as many other improvements! I was lucky to be able -to get the `fast-deps` feature to be included as part of the release. -A brief description of this *experimental* feature as well as testing -instruction can be found on [Python Discuss][]. - -The public exposure of the feature also remind me of some further -{{pip 8681 optimization}} to make on {{pip 8670 "the lazy wheel"}}. -Hopefully without download parallelization it would not be too slow -to put off testing by concerned users of `pip`. - -## Preparation for Download Parallelization - -As of this moment, we already have: - -* {{pip 8162#issuecomment-667504162 "Multithreading pool fallback working"}} -* An opt-in to use lazy wheel to optain dependency information, - and thus getting a list of wheels at the end of resolution - ready to be downloaded together - -What's left is *only* to interject a parallel download somewhere after -the dependency resolution step. Still, this struggles me way more than -I've ever imagined. I got so stuck that I had to give myself a day off -in the middle of the week (and study some Rust), then I came up with -{{pip 8638 "something what was agreed upon as difficult to maintain"}}. - -Indeed, a large part of this is my fault, for not communicating the design -thoroughly with `pip`'s maintainers and not carefully noting stuff down -during (verbal) discussions with my mentor. Thankfully {{pip 8685 -"Chris Hunt came to the rescue"}} and did a refactoring that will -make my future work much easier and cleaner. - -[Python Discuss]: https://discuss.python.org/t/announcement-pip-20-2-release/4863/2 diff --git a/blog/gsoc2020/blog20200817.md b/blog/gsoc2020/blog20200817.md deleted file mode 100644 index 40caad5..0000000 --- a/blog/gsoc2020/blog20200817.md +++ /dev/null @@ -1,52 +0,0 @@ -+++ -rss = "GSoC 2020: Parallelizing Wheel Downloads" -date = Date(2020, 8, 17) -+++ -@def tags = ["pip", "gsoc"] - -# Parallelizing Wheel Downloads - -> And now it's clear as this promise\ -> That we're making\ -> Two progress bars into one - -\toc - -Hello there! It has been raining a lot lately and some mosquito has given me -the Dengue fever today. To whoever reading this, I hope it would never happen -to you. - -Download Parallelization ------------------------- - -I've been working on `pip`'s download parallelization for quite a while now. -As distribution download in `pip` was modeled as a lazily evaluated iterable -of chunks, parallelizing such procedure is as simple as submitting routines -that write files to disk to a worker pool. - -Or at least that is what I thought. - -Progress Reporting UI ---------------------- - -`pip` is currently using customly defined progress reporting classes, -which was not designed to working with multithreading code. Firstly, I want to -try using these instead of defining separate UI for multithreaded progresses. -As they use system signals for termination, one must the progress bars has to be -running the main thread. Or sort of. - -Since the progress bars are designed as iterators, I realized that we -can call `next` on them. So quickly, I throw in some queues and locks, -and prototyped the first *working* {{pip 8771 "implementation of -progress synchronization"}}. - -Performance Issues ------------------- - -Welp, I only said that it works, but I didn't mention the performance, -which is terrible. I am pretty sure that the slow down is with -the synchronization, since the `map_multithread` call doesn't seem -to trigger anything that may introduce any sort of blocking. - -This seems like a lot of fun, and I hope I'll get better tomorrow -to continue playing with it! diff --git a/blog/gsoc2020/blog20200831.md b/blog/gsoc2020/blog20200831.md deleted file mode 100644 index 58d8d33..0000000 --- a/blog/gsoc2020/blog20200831.md +++ /dev/null @@ -1,109 +0,0 @@ -+++ -rss = "GSoC 2020: Outro" -date = Date(2020, 8, 31) -+++ -@def tags = ["pip", "gsoc"] - -# Outro - -> Steamed fish was amazing, matter of fact\ -> Let me get some jerk chicken to go\ -> Grabbed me one of them lemon pie theories\ -> And let me get some of them benchmarks you theories too - -\toc - -## The Look - -At the time of writing, -{{pip 8771 "implementation-wise parallel download is ready"}}: - -[![asciicast](/assets/pip-8771.svg)](https://asciinema.org/a/356704) - -Does this mean I've finished everything just-in-time? This sounds to good -to be true! And how does it perform? Welp... - -## The Benchmark - -Here comes the bad news: under a decent connection to the package index, -using `fast-deps` does not make `pip` faster. For best comparison, -I will time `pip download` on the following cases: - -### Average Distribution - -For convenience purposes, let's refer to the commands to be used as follows - -```console -$ pip --no-cache-dir download {requirement} # legacy-resolver -$ pip --use-feature=2020-resolver \ - --no-cache-dir download {requirement} # 2020-resolver -$ pip --use-feature=2020-resolver --use-feature=fast-deps \ - --no-cache-dir download {requirement} # fast-deps -``` - -In the first test, I used [axuy][] and obtained the following results - -| legacy-resolver | 2020-resolver | fast-deps | -| --------------- | ------------- | --------- | -| 7.709s | 7.888s | 10.993s | -| 7.068s | 7.127s | 11.103s | -| 8.556s | 6.972s | 10.496s | - -Funny enough, running `pip download` with `fast-deps` in a directory -with downloaded files already took around 7-8 seconds. This is because -to lazily download a wheel, `pip` has to {{pip 8670 "make many requests"}} -which are apparently more expensive than actual data transmission on my network. - -!!! note "When is it useful then?" - - With unstable connection to PyPI (for some reason I am not confident enough - to state), this is what I got - - | 2020-resolver | fast-deps | - | ------------- | --------- | - | 1m16.134s | 0m54.894s | - | 1m0.384s | 0m40.753s | - | 0m50.102s | 0m41.988s | - - As the connection was *unstable* and that the majority of `pip` networking - is performed as CI/CD with large and stable bandwidth, I am unsure what this - result is supposed to tell (-; - -### Large Distribution - -In this test, I used [TensorFlow][] as the requirement and obtained -the following figures: - -| legacy-resolver | 2020-resolver | fast-deps | -| --------------- | ------------- | --------- | -| 0m52.135s | 0m58.809s | 1m5.649s | -| 0m50.641s | 1m14.896s | 1m28.168s | -| 0m49.691s | 1m5.633s | 1m22.131s | - -### Distribution with Conflicting Dependencies - -Some requirement that will trigger a decent amount of backtracking by -the current implementation of the new resolver `oslo-utils==1.4.0`: - -| 2020-resolver | fast-deps | -| ------------- | --------- | -| 14.497s | 24.010s | -| 17.680s | 28.884s | -| 16.541s | 26.333s | - -## What Now? - -I don't know, to be honest. At this point I'm feeling I've failed my own -(and that of other stakeholders of `pip`) expectation and wasted the time -and effort of `pip`'s maintainers reviewing dozens of PRs I've made -in the last three months. - -On the bright side, this has been an opportunity for me to explore the codebase -of package manager and discovered various edge cases where the new resolver -has yet to cover (e.g. I've just noticed that `pip download` would save -to-be-discarded distributions, I'll file an issue on that soon). Plus I got -to know many new and cool people and idea, which make me a more helpful -individual to work on Python packaging in the future, I hope. - -[TensorFlow]: https://www.tensorflow.org -[axuy]: https://sr.ht/~cnx/axuy diff --git a/blog/gsoc2020/checkin20200601.md b/blog/gsoc2020/checkin20200601.md deleted file mode 100644 index a362f28..0000000 --- a/blog/gsoc2020/checkin20200601.md +++ /dev/null @@ -1,45 +0,0 @@ -+++ -rss = "GSoC 2020: First Check-In" -date = Date(2020, 6, 1) -+++ -@def tags = ["pip", "gsoc"] - -# First Check-In - -Hi everyone, I am McSinyx, a Vietnamese undergraduate student -who loves [free software][]. This summer I am working with -the maintainers and the contributors of `pip` to make -the package manager {{pip 825 "download in parallel"}}. - -## What did I do during the community bonding period? - -Aside from bonding with `pip`'s maintainers and contributors as well as -with my mentors, I was also experimenting on the theoretical and technical -obstacles blocking this GSoC project. Pradyun Gedam (a mentor of mine) -suggested making [a proof of concept][] to determine if parallel downloading -can play nicely with ResolveLib_'s abstraction and we are reviewing it -together. On the technical side, we `pip`'s committers are exploring -{{pip 8169 "available options for parallelization"}} and I made an attempt to -{{pip 8320 "make use of Python's standard worker pool in a portable way"}}. - -## Did I get stuck anywhere? - -Yes, of course! Neither of the experiments above is finished as of -this moment. Though, I am optimistic that the issues will not be -real blockers and we will figure that out in the next few days. - -## What is coming up next? - -As planned, this week I am going to refactor the package downloading code -in `pip`. The main purpose is to decouple the networking code from -the package preparation operation and make sure that it is thread-safe. - -In addition, I am also continuing mentioned experiments to have a better -confidence on the future of this GSoC project. - -To other GSoC students, mentors and admins reading this, I am wishing -you all good health and successful projects this summer! - -[free software]: https://www.gnu.org/philosophy/free-sw.html -[a proof of concept]: https://gist.github.com/McSinyx/513dbff71174fcc79f1cb600e09881af -[ResolveLib]: https://pypi.org/project/resolvelib diff --git a/blog/gsoc2020/checkin20200615.md b/blog/gsoc2020/checkin20200615.md deleted file mode 100644 index e59cac2..0000000 --- a/blog/gsoc2020/checkin20200615.md +++ /dev/null @@ -1,45 +0,0 @@ -+++ -rss = "GSoC 2020: Second Check-In" -date = Date(2020, 6, 15) -+++ -@def tags = ["pip", "gsoc"] - -# Second Check-In - -Hi everyone and may the odds ever in your favor, especially during this -tough time! - -## What did I do last week? - -Not as much I wished, apparently (-: - -* Finalizing {{pip 8411 "the refactoring patch"}} - of `operations.prepare.prepare_linked_requirement` -* {{pip 8423 "Nitpicking some logging calls"}}. This (as well as the next one) - was to fill up the time my brain not being as productive as I want it to XD -* {{pip 8423 "Beginning to migrate"}} from `%`- to `{}`-style logging. - The amount of tests failing due to this was way beyond my imagination, - but I got functional tests for `pip install` and unit tests passing now! -* {{pip 8442 "Mocking up a working partial wheel download during - dependency resolution"}} for [the new resolver][]. - -## Did I get stuck anywhere? - -Yes, of course! {{pip 8320 "Parallel maps"}} are still stalling -as well as other small PRs listed above. The failure related to -`logging` are still making me pulling my hair out and the proof of -concept for partial wheel downloading is too ugly even for a PoC. -I imagine that I will have a lot of clean up to do this week (yay!). - -## What is coming up next? - -I'm trying get the multi-{threading,processing} facilities merged ASAP -to start rolling it out in practice. The first thing popping out of my -head is to get back {{pip 7962 "the multi-threaded"}} `pip list -o`. - -The other experimental improvement (this phrase does not sound right!) -I would like to get done is the partial wheel download. It would be -really nice if I can get both included as `unstable-feature`'s -in {{pip 7628#issuecomment-636319539 "the upcoming beta release of pip 20.2"}}. - -[the new resolver]: http://www.ei8fdb.org/thoughts/2020/05/test-pips-alpha-resolver-and-help-us-document-dependency-conflicts/ diff --git a/blog/gsoc2020/checkin20200629.md b/blog/gsoc2020/checkin20200629.md deleted file mode 100644 index 32a94ab..0000000 --- a/blog/gsoc2020/checkin20200629.md +++ /dev/null @@ -1,44 +0,0 @@ -+++ -rss = "GSoC 2020: Third Check-In" -date = Date(2020, 6, 29) -+++ -@def tags = ["pip", "gsoc"] - -# Third Check-In - -Holla, holla, holla! Last seven days has not been a really productive week -for me, though I think there are still some nice things to share with -you all here! The good news is that I've finish my last leçon as a somophore, -the bad news is that I have a bunch of upcoming tests, mainly in the form -of group projects and/or presentation (phew!). Enough about me, -let's get back to `pip`: - -## What did I do last week? - -Not much, actually )-: - -* Write some tests for {{pip 8467 "the HTTP range mapping for wheel"}}. -* {{pip 8504 "Try to bring back"}} multithreaded `pip list --outdated` - and `--uptodate`, as {{pip 8320 "the parallel map"}} was merged - earlier today. -* Nitpick {{pip 8332}} - (yep it's a new low for me to include this to the list (-:). - -## Did I get stuck anywhere? - -Not exactly, since I didn't do much d-; [Many of my PRs][] are stalling though. -On one hand the maintainers of `pip` are all volunteers working in -their free time, on the other hand I don't think I have tried hard enough -to get their attention on my PRs. - -## What is coming up next? - -I'll try my best getting the following merged upstream before -{{pip 8206 "the upcoming beta release"}}: - -* Parallel networking for `pip list`: {{pip 8504}} -* Lazy wheel for dependency information: {{pip 8467}}, {{pip 8411}} - (to determine if hashing is required) and {{pip 8467#issuecomment-648717032 - "a new patch introducing this as an unstable feature"}} - -[Many of my PRs]: https://github.com/pulls?q=is:open+is:pr+author:McSinyx+repo:pypa/pip+sort:updated-desc diff --git a/blog/gsoc2020/checkin20200713.md b/blog/gsoc2020/checkin20200713.md deleted file mode 100644 index 417db58..0000000 --- a/blog/gsoc2020/checkin20200713.md +++ /dev/null @@ -1,35 +0,0 @@ -+++ -rss = "GSoC 2020: Fourth Check-In" -date = Date(2020, 7, 13) -+++ -@def tags = ["pip", "gsoc"] - -# Fourth Check-In - -Hello there! I'm having my second year's last exam tomorrow, -but it [feels like summer][] already! I've been finalizing quite a few things -to get them ready for pip 20.2b2. - -## What did I do last week? - -I've spent most of the time on getting {{pip 8532 "the opt-in"}} for obtaining -dependency information via lazy wheels ready. It will be available as -`--use-feature=fast-deps` and only has effect when -`--use-feature=2020-resolver` also presents. - -While waiting for reviews and suggestions, I made some patches for -internal cleansing, namely {{pip 8568}}, {{pip 8571}} and {{pip 8578}}. -Some of the similar patches I made earlier were also merged last week: -{{pip 8456}} and {{pip 8538}}. - -## Did I get stuck anywhere? - -Not really, everything was going as expected for me. - -## What is coming up next? - -After {{pip 8532}}, I'll work on the parallel download of the postponed wheels. -My main current concern is with how the download progress will be reported -to the users, but I think I'll figure it out soon. - -[feels like summer]: https://www.youtube.com/watch?v=F1B9Fk_SgI0 diff --git a/blog/gsoc2020/checkin20200727.md b/blog/gsoc2020/checkin20200727.md deleted file mode 100644 index 5e50f67..0000000 --- a/blog/gsoc2020/checkin20200727.md +++ /dev/null @@ -1,37 +0,0 @@ -+++ -rss = "GSoC 2020: Fifth Check-In" -date = Date(2020, 7, 27) -+++ -@def tags = ["pip", "gsoc"] - -# Fifth Check-In - -Hello and I hope y'all are still doing well! - -## What did I do last week? - -I was not really productive last week—most of the following tickets are fillers -to make use of the spare cycles I had when I was still trying to figure out -the way to implement the main work. - -* Finalize the `--use-feature=fast-deps` flag ({{pip 8588}}) -* Improve mocking of environment variables in the test suit ({{pip 8614}}) -* Finalize the fix for verbose/quiet options specified via - configuration files and environment variables ({{pip 8578}}) -* Clean up a tiny bit in the resolver internal API ({{pip 8629}}) -* Start working on seperating the download of wheels - from dependency resolution ({{pip 8638}}) - -## Did I get stuck anywhere? - -I'm struggling on refactoring the code to support separate download. -`pip`'s codebase was not intended for this and thus there are -many execution paths and other details entangled around the relevant area. - -## What is coming up next? - -`pip` 20.2 is going to be released within the next few days with -`--use-feature=fast-deps` included and I'm mentally prepare to fix -any undiscovered problem. At the same time, I will continue working -on {{pip 8638}} and hopefully get it done soon enough to begin drafting -download parallelization strategies, mostly with the UI. diff --git a/blog/gsoc2020/checkin20200810.md b/blog/gsoc2020/checkin20200810.md deleted file mode 100644 index aea9d5a..0000000 --- a/blog/gsoc2020/checkin20200810.md +++ /dev/null @@ -1,33 +0,0 @@ -+++ -rss = "GSoC 2020: Sixth Check-In" -date = Date(2020, 8, 10) -+++ -@def tags = ["pip", "gsoc"] - -# Sixth Check-In - -Hello there! - -## What did I do last week? - -It has been a quite fun week for me, given the current state of -development and the newly dicovered bugs thanks to pip 20.2 release: - -* Initiate discussion with the maintainers of pip on isolating - networking code for late download in parallel ({{pip 8697}}) -* Discuss the UI of parallel download ({{pip 8698}}) -* Log debug information relating lazy wheel decision ({{pip 8710}}) -* Disable caching for range requests ({{pip 8716}}) -* Dedent late download logs ({{pip 8722}}) -* Add a hook for batch downloading (third attempt I think) ({{pip 8737}}) -* Test hash checking for fast-deps ({{pip 8743}}) - -## Did I get stuck anywhere? - -Not exactly, everything is going smoothly and I'm feeling awesome! - -## What is coming up next? - -I'll try to solve {{pip 8697}} and {{pip 8698}} within the next few days. -I am optimistic that the parallel download prototype will be done -within this week. diff --git a/blog/gsoc2020/checkin20200824.md b/blog/gsoc2020/checkin20200824.md deleted file mode 100644 index b87a7fd..0000000 --- a/blog/gsoc2020/checkin20200824.md +++ /dev/null @@ -1,26 +0,0 @@ -+++ -rss = "GSoC 2020: Final Check-In" -date = Date(2020, 8, 24) -+++ -@def tags = ["pip", "gsoc"] - -# Final Check-In - -Hello there! - -## What did I do last week? - -Not much, but seemingly implementation-wise I have finished my GSoC project: - -* Finish the implementation of wheels' parallel download ({{pip 8771}}) -* Help make `pip`'s CI green again ({{pip 8790}}) -* Reformat a few spots in user guide ({{pip 8795}}) - -## Did I get stuck anywhere? - -I got sick, but I am recovering now! - -## What is coming up next? - -I will try to spend the time I got left within the scope of GSoC -to {{pip 8720 "improve cache usage of the fast-deps feature"}}. diff --git a/blog/gsoc2020/index.md b/blog/gsoc2020/index.md deleted file mode 100644 index c00edcb..0000000 --- a/blog/gsoc2020/index.md +++ /dev/null @@ -1,151 +0,0 @@ -+++ -rss = "GSoC 2020 final report" -date = Date(2020, 8, 31) -internship = "https://summerofcode.withgoogle.com/archive/2020/projects/6238594655584256" -benchmark = "/blog/gsoc2020/blog20200831/#the_benchmark" -python_gsoc = "https://blogs.python-gsoc.org/en/mcsinyxs-blog" -+++ -@def tags = ["fun", "pip", "gsoc"] - -# Google Summer of Code 2020 - -In the summer of 2020, I worked with the contributors of `pip`, -trying to improve the networking performance of the package manager. -Admittedly, at the end of [the internship]({{internship}}) period, -[the benchmark said otherwise]({{benchmark}}); though I really hope -the clean-up and minor fixes I happened to be doing to the codebase -over the summer, in addition to the implementation of parallel -utils and lazy wheel, might actually help the project. - -Personally, I learned a lot: not just about Python packaging and -networking stuff, but also on how to work with others. I am really -grateful to {{github pradyunsg}} (my mentor), {{github chrahunt}}, -{{github uranusjr}}, {{github pfmoore}}, {{github brainwane}}, -{{github sbidoul}}, {{github xavfernandez}}, {{github webknjaz}}, -{{github jaraco}}, {{github deveshks}}, {{github gutsytechster}}, -{{github dholth}}, {{github dstufft}}, {{github cosmicexplorer}} -and {{github ofek}}. While this feels like a long shout-out list, -it really isn't. These people are the maintainers, the contributors of `pip` -and/or other Python packaging projects, and more importantly, they have been -more than helpful, encouraging and patient to me throughout my every activities, -showing me the way when I was lost, fixing me when I was wrong, putting up with -my carelessness and showing me support across different social media. - -To best serve the community, below I have tried my best to document -what I have done, how I've done it and why I've done it for over -the last three months. At the time of writing, some work is still in progress, -so these also serve as a reference point for myself and others to reason -about decisions in relevant topics. - -\toc - -## The Main Story - -The storyline can be divided into the following four main acts. - -### Act One: Parallelization Utilities - -In this first act, I ensured the portibility of parallelization -measures for later use in the final act. Multithreading and multiprocessing -`map` were properly fellback on platforms without full support. - -* {{pip 8320}}: Add utilities for parallelization (close {{pip 8169}}) -* {{pip 8538}}: Make `utils.parallel` tests tear down properly -* {{pip 8504}}: Parallelize `pip list --outdated` and `--uptodate` - (using {{pip 8320}}) - -### Act Two: Lazy Wheels - -As proposed by {{github cosmicexplorer}} in {{pip 7819}}, it is possible to only -download a portion of a wheel to obtain metadata during dependency resolution. -Not only that this would reduce the total amount of data to be transmitted over -the network in case the resolver needs to perform heavy backtracking, but also -it would create a synchronization point at the end of the resolution progress -where parallel downloading can be applied to the needed wheels (some wheels -solely serve their metadata during dependency backtracking and are not needed -by the users). - -* {{pip 8467}}: Add utitlity to lazily acquire wheel metadata over HTTP -* {{pip 8584}}: Revise lazy wheel and its tests -* {{pip 8681}}: Make range requests closer to chunk size (help {{pip 8670}}) -* {{pip 8716}} and {{pip 8730}}: Disable caching for range requests - -### Act Three: Late Downloading - -During this act, the main works were refactoring to integrate the *lazy wheel* -into `pip`'s codebase and clean up the way for download parallelization. - -* {{pip 8411}}: Refactor `operations.prepare.prepare_linked_requirement` -* {{pip 8629}}: Abstract away `AbstractDistribution` - in higher-level resolver code -* {{pip 8442}}, {{pip 8532}} and {{pip 8588}} (later reworked by - {{github chrahunt}} in {{pip 8685}}): Use lazy wheel to obtain - dependency information for the new resolver -* {{pip 8743}}: Test hash checking for `fast-deps` -* {{pip 8804}}: Check download directory before making range requests - -### Act Four: Batch Downloading in Parallel - -The final act is mostly about the UI of the parallel download. -My work involved around how the progress should be displayed -and how other relevant information should be reported to the users. - -* {{pip 8710}}: Revise method fetching metadata using lazy wheels -* {{pip 8722}}: Dedent late download logs (fix {{pip 8721}}) -* {{pip 8737}}: Add a hook for batch downloading -* {{pip 8771}}: Parallelize wheel download - -The Side Quests ---------------- - -In order to keep the wheel turning (no pun intended) and avoid wasting time -waiting for the pull requests above to be reviewed, I decided to create -even more PRs (as I am typing this, many of the patches listed below -are nowhere near being merged). - -* {{pip 7878}}: Fail early when install path is not writable -* {{pip 7928}}: Fix rst syntax in Getting Started guide -* {{pip 7988}}: Fix tabulate col size in case of empty cell -* {{pip 8137}}: Add subcommand alias mechanism -* {{pip 8143}}: Make mypy happy with beta release automation -* {{pip 8248}}: Fix typo and simplify ireq call -* {{pip 8332}}: Add license requirement to `_vendor/README.rst` -* {{pip 8423}}: Nitpick logging calls -* {{pip 8435}}: Use str.format style in logging calls -* {{pip 8456}}: Lint `src/pip/_vendor/README.rst` -* {{pip 8568}}: Declare constants in configuration.py as such -* {{pip 8571}}: Clean up `Configuration.unset_value` and nit `__init__` -* {{pip 8578}}: Allow verbose/quiet level to be specified - via config files and environment variables -* {{pip 8599}}: Replace tabs by spaces for consistency -* {{pip 8614}}: Use `monkeypatch.setenv` to mock environment variables -* {{pip 8674}}: Fix `tests/functional/test_install_check.py`, - when run with new resolver -* {{pip 8692}}: Make assertion failure give better message -* {{pip 8709}}: List downloaded distributions before exiting (fix {{pip 8696}}) -* {{pip 8759}}: Allow py2 deprecation warning from setuptools -* {{pip 8766}}: Use the new resolver for test requirements -* {{pip 8790}}: Mark tests using remote svn and hg as xfail -* {{pip 8795}}: Reformat a few spots in user guide - -## The Plot Summary - -Every Monday throughout the Summer of Code, I summarized what I had done -in the week before in the form of either a short blog or an (even shorter) -check-in. These write-ups often contain handfuls of popular culture references -and was originally hosted on [Python GSoC]({{python_gsoc}}). - -* {{abslink blog/gsoc2020/checkin20200601}} -* {{abslink blog/gsoc2020/blog20200609}} -* {{abslink blog/gsoc2020/checkin20200615}} -* {{abslink blog/gsoc2020/blog20200622}} -* {{abslink blog/gsoc2020/checkin20200629}} -* {{abslink blog/gsoc2020/blog20200706}} -* {{abslink blog/gsoc2020/checkin20200713}} -* {{abslink blog/gsoc2020/blog20200720}} -* {{abslink blog/gsoc2020/checkin20200727}} -* {{abslink blog/gsoc2020/blog20200803}} -* {{abslink blog/gsoc2020/checkin20200810}} -* {{abslink blog/gsoc2020/blog20200817}} -* {{abslink blog/gsoc2020/checkin20200824}} -* {{abslink blog/gsoc2020/blog20200831}} diff --git a/index.md b/index.md index 14f8ec3..3bf5a8a 100644 --- a/index.md +++ b/index.md @@ -1,8 +1,8 @@ # About Me -Hi! [My name is][] Nguyễn Gia Phong. I'm a Vietnamese undergrad student -and a [free software][] enthusiast. You can find me under my Internet alias -McSinyx (or CnX for short) in the [Fediverse][]: +Hi! [My name is][] Nguyễn Gia Phong and I'm a Vietnamese [free software][] +enthusiast. You can find me under my Internet alias McSinyx (or CnX for short) +in the [Fediverse][]: * Pleroma: [cnx@nixnet.social][] * PeerTube: [cnx@video.hardlimit.com][] @@ -10,6 +10,10 @@ McSinyx (or CnX for short) in the [Fediverse][]: * Email (and XMPP): [mcsinyx@disroot.org][][^pgp] * Matrix: [@cnx:halogen.city][] +I am generally interested in programming languages, concurrency, +reproducibility and decentralization. In meatspace I also enjoy cooking, dogs +(not necessarily mutually exclusive) and urban music. + [^pgp]: PGP: [27148B2C06A2224B][], also on [OpenPGP][] [My name is]: https://www.youtube.com/watch?v=LDj8kkVwisY diff --git a/works.md b/works.md index 98f310e..fce1797 100644 --- a/works.md +++ b/works.md @@ -35,7 +35,7 @@ local and direct URL to video/audio and its own JSON playlist format. ### pip -[pip][] is a package installer for Python. [Summer 2020](/blog/gsoc2020), +[pip][] is a package installer for Python. [Summer 2020](/blog/2020/gsoc), I worked on improving its new resolver's networking performance. The final result was not quite satisfying, but I got to meet some really nice and talented people (-; -- cgit 1.4.1