about summary refs log tree commit diff homepage
path: root/blog/2020/gsoc/article/2.md
blob: 4d75ae19fa00f8ad7391d80948efecf502b23449 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
+++
rss = "GSoC 2020: The Wonderful Wizard of O'zip"
date = Date(2020, 6, 22)
tags = ["gsoc", "pip", "python"]
+++

# The Wonderful Wizard of O'zip

> Never give up... No one knows what's going to happen next.

\toc

## Preface

Greetings and best wishes!  I had a lot of fun during the last week,
although admittedly nothing was really finished.  In summary,
these are the works I carried out in the last seven days:

* Finilizing {{pip 8320 "utilities for parallelization"}}
* {{pip 8467 "Continuing experimenting"}}
  on {{pip 8442 "using lazy wheels or dependency resolution"}}
* Polishing up {{pip 8411 "the patch"}} refactoring
  `operations.prepare.prepare_linked_requirement`
* Adding `flake8-logging-format`
  {{pip 8423#issuecomment-645418725 "to the linter"}}
* Splitting {{pip 8456 "the linting patch"}} from {{pip 8332 "the PR adding
  the license requirement to vendor README"}}

## The `multiprocessing[.dummy]` wrapper

Yes, you read it right, this is the same section as last fortnight's blog.
My mentor Pradyun Gedam gave me a green light to have {{pip 8411}} merged
without support for Python 2 and the non-lazy map variant, which turns out
to be troublesome for multithreading.

The tests still needs to pass of course and the flaky tests (see failing tests
over Azure Pipeline in the past) really gave me a panic attack earlier today.
We probably need to mark them as xfail or investigate why they are
undeterministic specifically on Azure, but the real reason I was *all caught up
and confused* was that the unit tests I added mess with the cached imports
and as `pip`'s tests are run in parallel, who knows what it might affect.
I was so relieved to not discover any new set of tests made flaky by ones
I'm trying to add!

## The file-like object mapping ZIP over HTTP

This is where the fun starts.  Before we dive in, let's recall some
background information on this.  As discovered by Danny McClanahan
in {{pip 7819}}, it is possible to only download a potion of a wheel
and it's still valid for `pip` to get the distribution's metadata.
In the same thread, Daniel Holth suggested that one may use
HTTP range requests to specifically ask for the tail of the wheel,
where the ZIP's central directory record as well as where usually
`dist-info` (the directory containing `METADATA`) can be found.

Well, *usually*.  While {{pep 427}} does indeed recommend

> Archivers are encouraged to place the `.dist-info` files physically
> at the end of the archive.  This enables some potentially interesting
> ZIP tricks including the ability to amend the metadata without
> rewriting the entire archive.

one of the mentioned *tricks* is adding shared libraries to wheels
of extension modules (using e.g. `auditwheel` or `delocate`).
Thus for non-pure Python wheels, it is unlikely that the metadata
lie in the last few megabytes.  Ignoring source distributions is bad enough,
we can't afford making an optimization that doesn't work for extension modules,
which are still an integral part of the Python ecosystem )-:

But hey, the ZIP's directory record is warrantied to be at the end of the file!
Couldn't we do something about that?  The short answer is yes.  The long answer
is, well, yessssssss! That, plus magic provided by most operating systems,
this is what we figured out:

1. We can download a realatively small chunk at the end of the wheel
   until it is recognizable as a valid ZIP file.
2. In order for the end of the archive to actually appear as the end to
   `zipfile`, we feed to it an object with `seek` and `read` defined.
   As navigating to the rear of the file is performed by calling `seek`
   with relative offset and `whence=SEEK_END` (see `man 3 fseek`
   for more details), we are completely able to make the wheels in the cloud
   to behave as if it were available locally.

   ![Wheel in the cloud](/assets/cloud.gif)

3. For large wheels, it is better to store them in hard disks instead of memory.
   For smaller ones, it is also preferable to store it as a file to avoid
   (error-prony and often not really efficient) manual tracking and joining
   of downloaded segments.  We only use a small potion of the wheel, however
   just in case one is wonderring, we have very little control over
   when `tempfile.SpooledTemporaryFile` rolls over, so the memory-disk hybrid
   is not exactly working as expected.
4. With all these in mind, all we have to do is to define an intermediate object
   check for local availability and download if needed on calls to `read`,
   to lazily provide the data over HTTP and reduce execution time.

The only theoretical challenge left is to keep track of downloaded intervals,
which I finally figured out after a few trials and errors.  The code
was submitted as a pull request to `pip` at {{pip 8467}}.  A more modern
(read: Python 3-only) variant was packaged and uploaded to PyPI under
the name of lazip_.  I am unaware of any use case for it outside of `pip`,
but it's certainly fun to play with d-:

## What's next?

I have been falling short of getting the PRs mention above merged for
quite a while.  With `pip`'s next beta coming really soon, I have to somehow
make the patches reach a certain standard and enough attention to be part of
the pre-release—beta-testing would greatly help the success of the GSoC project.
To other GSoC students and mentors reading this, I also hope your projects
to turn out successful!

[lazip]: https://pypi.org/project/lazip/