From 2c085d53133fd267a809d0a4e2cbf9421ea2a2a8 Mon Sep 17 00:00:00 2001 From: Nguyễn Gia Phong Date: Tue, 21 Sep 2021 17:02:17 +0700 Subject: Reorganize GSoC 2020 --- blog/gsoc2020/blog20200622.md | 113 ------------------------------------------ 1 file changed, 113 deletions(-) delete mode 100644 blog/gsoc2020/blog20200622.md (limited to 'blog/gsoc2020/blog20200622.md') diff --git a/blog/gsoc2020/blog20200622.md b/blog/gsoc2020/blog20200622.md deleted file mode 100644 index 3bb3a2c..0000000 --- a/blog/gsoc2020/blog20200622.md +++ /dev/null @@ -1,113 +0,0 @@ -+++ -rss = "GSoC 2020: The Wonderful Wizard of O'zip" -date = Date(2020, 6, 22) -+++ -@def tags = ["pip", "gsoc"] - -# The Wonderful Wizard of O'zip - -> Never give up... No one knows what's going to happen next. - -\toc - -## Preface - -Greetings and best wishes! I had a lot of fun during the last week, -although admittedly nothing was really finished. In summary, -these are the works I carried out in the last seven days: - -* Finilizing {{pip 8320 "utilities for parallelization"}} -* {{pip 8467 "Continuing experimenting"}} - on {{pip 8442 "using lazy wheels or dependency resolution"}} -* Polishing up {{pip 8411 "the patch"}} refactoring - `operations.prepare.prepare_linked_requirement` -* Adding `flake8-logging-format` - {{pip 8423#issuecomment-645418725 "to the linter"}} -* Splitting {{pip 8456 "the linting patch"}} from {{pip 8332 "the PR adding - the license requirement to vendor README"}} - -## The `multiprocessing[.dummy]` wrapper - -Yes, you read it right, this is the same section as last fortnight's blog. -My mentor Pradyun Gedam gave me a green light to have {{pip 8411}} merged -without support for Python 2 and the non-lazy map variant, which turns out -to be troublesome for multithreading. - -The tests still needs to pass of course and the flaky tests (see failing tests -over Azure Pipeline in the past) really gave me a panic attack earlier today. -We probably need to mark them as xfail or investigate why they are -undeterministic specifically on Azure, but the real reason I was *all caught up -and confused* was that the unit tests I added mess with the cached imports -and as `pip`'s tests are run in parallel, who knows what it might affect. -I was so relieved to not discover any new set of tests made flaky by ones -I'm trying to add! - -## The file-like object mapping ZIP over HTTP - -This is where the fun starts. Before we dive in, let's recall some -background information on this. As discovered by Danny McClanahan -in {{pip 7819}}, it is possible to only download a potion of a wheel -and it's still valid for `pip` to get the distribution's metadata. -In the same thread, Daniel Holth suggested that one may use -HTTP range requests to specifically ask for the tail of the wheel, -where the ZIP's central directory record as well as where usually -`dist-info` (the directory containing `METADATA`) can be found. - -Well, *usually*. While {{pep 427}} does indeed recommend - -> Archivers are encouraged to place the `.dist-info` files physically -> at the end of the archive. This enables some potentially interesting -> ZIP tricks including the ability to amend the metadata without -> rewriting the entire archive. - -one of the mentioned *tricks* is adding shared libraries to wheels -of extension modules (using e.g. `auditwheel` or `delocate`). -Thus for non-pure Python wheels, it is unlikely that the metadata -lie in the last few megabytes. Ignoring source distributions is bad enough, -we can't afford making an optimization that doesn't work for extension modules, -which are still an integral part of the Python ecosystem )-: - -But hey, the ZIP's directory record is warrantied to be at the end of the file! -Couldn't we do something about that? The short answer is yes. The long answer -is, well, yessssssss! That, plus magic provided by most operating systems, -this is what we figured out: - -1. We can download a realatively small chunk at the end of the wheel - until it is recognizable as a valid ZIP file. -2. In order for the end of the archive to actually appear as the end to - `zipfile`, we feed to it an object with `seek` and `read` defined. - As navigating to the rear of the file is performed by calling `seek` - with relative offset and `whence=SEEK_END` (see `man 3 fseek` - for more details), we are completely able to make the wheels in the cloud - to behave as if it were available locally. - - ![Wheel in the cloud](/assets/cloud.gif) - -3. For large wheels, it is better to store them in hard disks instead of memory. - For smaller ones, it is also preferable to store it as a file to avoid - (error-prony and often not really efficient) manual tracking and joining - of downloaded segments. We only use a small potion of the wheel, however - just in case one is wonderring, we have very little control over - when `tempfile.SpooledTemporaryFile` rolls over, so the memory-disk hybrid - is not exactly working as expected. -4. With all these in mind, all we have to do is to define an intermediate object - check for local availability and download if needed on calls to `read`, - to lazily provide the data over HTTP and reduce execution time. - -The only theoretical challenge left is to keep track of downloaded intervals, -which I finally figured out after a few trials and errors. The code -was submitted as a pull request to `pip` at {{pip 8467}}. A more modern -(read: Python 3-only) variant was packaged and uploaded to PyPI under -the name of lazip_. I am unaware of any use case for it outside of `pip`, -but it's certainly fun to play with d-: - -## What's next? - -I have been falling short of getting the PRs mention above merged for -quite a while. With `pip`'s next beta coming really soon, I have to somehow -make the patches reach a certain standard and enough attention to be part of -the pre-release—beta-testing would greatly help the success of the GSoC project. -To other GSoC students and mentors reading this, I also hope your projects -to turn out successful! - -[lazip]: https://pypi.org/project/lazip/ -- cgit 1.4.1