about summary refs log tree commit diff homepage
path: root/blog/2020/gsoc/article/2.md
diff options
context:
space:
mode:
authorNguyễn Gia Phong <mcsinyx@disroot.org>2021-09-21 17:02:17 +0700
committerNguyễn Gia Phong <mcsinyx@disroot.org>2021-09-21 17:02:17 +0700
commit2c085d53133fd267a809d0a4e2cbf9421ea2a2a8 (patch)
treea0ede5321105f8a92449d17bf0fcd999dac0a382 /blog/2020/gsoc/article/2.md
parent7d8ce2a7f598312e3501b53d34ff8146b4dba0a6 (diff)
downloadsite-2c085d53133fd267a809d0a4e2cbf9421ea2a2a8.tar.gz
Reorganize GSoC 2020
Diffstat (limited to 'blog/2020/gsoc/article/2.md')
-rw-r--r--blog/2020/gsoc/article/2.md113
1 files changed, 113 insertions, 0 deletions
diff --git a/blog/2020/gsoc/article/2.md b/blog/2020/gsoc/article/2.md
new file mode 100644
index 0000000..3bb3a2c
--- /dev/null
+++ b/blog/2020/gsoc/article/2.md
@@ -0,0 +1,113 @@
++++
+rss = "GSoC 2020: The Wonderful Wizard of O'zip"
+date = Date(2020, 6, 22)
++++
+@def tags = ["pip", "gsoc"]
+
+# The Wonderful Wizard of O'zip
+
+> Never give up... No one knows what's going to happen next.
+
+\toc
+
+## Preface
+
+Greetings and best wishes!  I had a lot of fun during the last week,
+although admittedly nothing was really finished.  In summary,
+these are the works I carried out in the last seven days:
+
+* Finilizing {{pip 8320 "utilities for parallelization"}}
+* {{pip 8467 "Continuing experimenting"}}
+  on {{pip 8442 "using lazy wheels or dependency resolution"}}
+* Polishing up {{pip 8411 "the patch"}} refactoring
+  `operations.prepare.prepare_linked_requirement`
+* Adding `flake8-logging-format`
+  {{pip 8423#issuecomment-645418725 "to the linter"}}
+* Splitting {{pip 8456 "the linting patch"}} from {{pip 8332 "the PR adding
+  the license requirement to vendor README"}}
+
+## The `multiprocessing[.dummy]` wrapper
+
+Yes, you read it right, this is the same section as last fortnight's blog.
+My mentor Pradyun Gedam gave me a green light to have {{pip 8411}} merged
+without support for Python 2 and the non-lazy map variant, which turns out
+to be troublesome for multithreading.
+
+The tests still needs to pass of course and the flaky tests (see failing tests
+over Azure Pipeline in the past) really gave me a panic attack earlier today.
+We probably need to mark them as xfail or investigate why they are
+undeterministic specifically on Azure, but the real reason I was *all caught up
+and confused* was that the unit tests I added mess with the cached imports
+and as `pip`'s tests are run in parallel, who knows what it might affect.
+I was so relieved to not discover any new set of tests made flaky by ones
+I'm trying to add!
+
+## The file-like object mapping ZIP over HTTP
+
+This is where the fun starts.  Before we dive in, let's recall some
+background information on this.  As discovered by Danny McClanahan
+in {{pip 7819}}, it is possible to only download a potion of a wheel
+and it's still valid for `pip` to get the distribution's metadata.
+In the same thread, Daniel Holth suggested that one may use
+HTTP range requests to specifically ask for the tail of the wheel,
+where the ZIP's central directory record as well as where usually
+`dist-info` (the directory containing `METADATA`) can be found.
+
+Well, *usually*.  While {{pep 427}} does indeed recommend
+
+> Archivers are encouraged to place the `.dist-info` files physically
+> at the end of the archive.  This enables some potentially interesting
+> ZIP tricks including the ability to amend the metadata without
+> rewriting the entire archive.
+
+one of the mentioned *tricks* is adding shared libraries to wheels
+of extension modules (using e.g. `auditwheel` or `delocate`).
+Thus for non-pure Python wheels, it is unlikely that the metadata
+lie in the last few megabytes.  Ignoring source distributions is bad enough,
+we can't afford making an optimization that doesn't work for extension modules,
+which are still an integral part of the Python ecosystem )-:
+
+But hey, the ZIP's directory record is warrantied to be at the end of the file!
+Couldn't we do something about that?  The short answer is yes.  The long answer
+is, well, yessssssss! That, plus magic provided by most operating systems,
+this is what we figured out:
+
+1. We can download a realatively small chunk at the end of the wheel
+   until it is recognizable as a valid ZIP file.
+2. In order for the end of the archive to actually appear as the end to
+   `zipfile`, we feed to it an object with `seek` and `read` defined.
+   As navigating to the rear of the file is performed by calling `seek`
+   with relative offset and `whence=SEEK_END` (see `man 3 fseek`
+   for more details), we are completely able to make the wheels in the cloud
+   to behave as if it were available locally.
+
+   ![Wheel in the cloud](/assets/cloud.gif)
+
+3. For large wheels, it is better to store them in hard disks instead of memory.
+   For smaller ones, it is also preferable to store it as a file to avoid
+   (error-prony and often not really efficient) manual tracking and joining
+   of downloaded segments.  We only use a small potion of the wheel, however
+   just in case one is wonderring, we have very little control over
+   when `tempfile.SpooledTemporaryFile` rolls over, so the memory-disk hybrid
+   is not exactly working as expected.
+4. With all these in mind, all we have to do is to define an intermediate object
+   check for local availability and download if needed on calls to `read`,
+   to lazily provide the data over HTTP and reduce execution time.
+
+The only theoretical challenge left is to keep track of downloaded intervals,
+which I finally figured out after a few trials and errors.  The code
+was submitted as a pull request to `pip` at {{pip 8467}}.  A more modern
+(read: Python 3-only) variant was packaged and uploaded to PyPI under
+the name of lazip_.  I am unaware of any use case for it outside of `pip`,
+but it's certainly fun to play with d-:
+
+## What's next?
+
+I have been falling short of getting the PRs mention above merged for
+quite a while.  With `pip`'s next beta coming really soon, I have to somehow
+make the patches reach a certain standard and enough attention to be part of
+the pre-release—beta-testing would greatly help the success of the GSoC project.
+To other GSoC students and mentors reading this, I also hope your projects
+to turn out successful!
+
+[lazip]: https://pypi.org/project/lazip/