From 2c085d53133fd267a809d0a4e2cbf9421ea2a2a8 Mon Sep 17 00:00:00 2001 From: Nguyễn Gia Phong Date: Tue, 21 Sep 2021 17:02:17 +0700 Subject: Reorganize GSoC 2020 --- blog/2020/gsoc/article/2.md | 113 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 113 insertions(+) create mode 100644 blog/2020/gsoc/article/2.md (limited to 'blog/2020/gsoc/article/2.md') diff --git a/blog/2020/gsoc/article/2.md b/blog/2020/gsoc/article/2.md new file mode 100644 index 0000000..3bb3a2c --- /dev/null +++ b/blog/2020/gsoc/article/2.md @@ -0,0 +1,113 @@ ++++ +rss = "GSoC 2020: The Wonderful Wizard of O'zip" +date = Date(2020, 6, 22) ++++ +@def tags = ["pip", "gsoc"] + +# The Wonderful Wizard of O'zip + +> Never give up... No one knows what's going to happen next. + +\toc + +## Preface + +Greetings and best wishes! I had a lot of fun during the last week, +although admittedly nothing was really finished. In summary, +these are the works I carried out in the last seven days: + +* Finilizing {{pip 8320 "utilities for parallelization"}} +* {{pip 8467 "Continuing experimenting"}} + on {{pip 8442 "using lazy wheels or dependency resolution"}} +* Polishing up {{pip 8411 "the patch"}} refactoring + `operations.prepare.prepare_linked_requirement` +* Adding `flake8-logging-format` + {{pip 8423#issuecomment-645418725 "to the linter"}} +* Splitting {{pip 8456 "the linting patch"}} from {{pip 8332 "the PR adding + the license requirement to vendor README"}} + +## The `multiprocessing[.dummy]` wrapper + +Yes, you read it right, this is the same section as last fortnight's blog. +My mentor Pradyun Gedam gave me a green light to have {{pip 8411}} merged +without support for Python 2 and the non-lazy map variant, which turns out +to be troublesome for multithreading. + +The tests still needs to pass of course and the flaky tests (see failing tests +over Azure Pipeline in the past) really gave me a panic attack earlier today. +We probably need to mark them as xfail or investigate why they are +undeterministic specifically on Azure, but the real reason I was *all caught up +and confused* was that the unit tests I added mess with the cached imports +and as `pip`'s tests are run in parallel, who knows what it might affect. +I was so relieved to not discover any new set of tests made flaky by ones +I'm trying to add! + +## The file-like object mapping ZIP over HTTP + +This is where the fun starts. Before we dive in, let's recall some +background information on this. As discovered by Danny McClanahan +in {{pip 7819}}, it is possible to only download a potion of a wheel +and it's still valid for `pip` to get the distribution's metadata. +In the same thread, Daniel Holth suggested that one may use +HTTP range requests to specifically ask for the tail of the wheel, +where the ZIP's central directory record as well as where usually +`dist-info` (the directory containing `METADATA`) can be found. + +Well, *usually*. While {{pep 427}} does indeed recommend + +> Archivers are encouraged to place the `.dist-info` files physically +> at the end of the archive. This enables some potentially interesting +> ZIP tricks including the ability to amend the metadata without +> rewriting the entire archive. + +one of the mentioned *tricks* is adding shared libraries to wheels +of extension modules (using e.g. `auditwheel` or `delocate`). +Thus for non-pure Python wheels, it is unlikely that the metadata +lie in the last few megabytes. Ignoring source distributions is bad enough, +we can't afford making an optimization that doesn't work for extension modules, +which are still an integral part of the Python ecosystem )-: + +But hey, the ZIP's directory record is warrantied to be at the end of the file! +Couldn't we do something about that? The short answer is yes. The long answer +is, well, yessssssss! That, plus magic provided by most operating systems, +this is what we figured out: + +1. We can download a realatively small chunk at the end of the wheel + until it is recognizable as a valid ZIP file. +2. In order for the end of the archive to actually appear as the end to + `zipfile`, we feed to it an object with `seek` and `read` defined. + As navigating to the rear of the file is performed by calling `seek` + with relative offset and `whence=SEEK_END` (see `man 3 fseek` + for more details), we are completely able to make the wheels in the cloud + to behave as if it were available locally. + + ![Wheel in the cloud](/assets/cloud.gif) + +3. For large wheels, it is better to store them in hard disks instead of memory. + For smaller ones, it is also preferable to store it as a file to avoid + (error-prony and often not really efficient) manual tracking and joining + of downloaded segments. We only use a small potion of the wheel, however + just in case one is wonderring, we have very little control over + when `tempfile.SpooledTemporaryFile` rolls over, so the memory-disk hybrid + is not exactly working as expected. +4. With all these in mind, all we have to do is to define an intermediate object + check for local availability and download if needed on calls to `read`, + to lazily provide the data over HTTP and reduce execution time. + +The only theoretical challenge left is to keep track of downloaded intervals, +which I finally figured out after a few trials and errors. The code +was submitted as a pull request to `pip` at {{pip 8467}}. A more modern +(read: Python 3-only) variant was packaged and uploaded to PyPI under +the name of lazip_. I am unaware of any use case for it outside of `pip`, +but it's certainly fun to play with d-: + +## What's next? + +I have been falling short of getting the PRs mention above merged for +quite a while. With `pip`'s next beta coming really soon, I have to somehow +make the patches reach a certain standard and enough attention to be part of +the pre-release—beta-testing would greatly help the success of the GSoC project. +To other GSoC students and mentors reading this, I also hope your projects +to turn out successful! + +[lazip]: https://pypi.org/project/lazip/ -- cgit 1.4.1