about summary refs log tree commit diff homepage
path: root/blog/2020/gsoc/index.md
blob: 387f76e452c77215821021dd8162e37a62e345b3 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
+++
rss = "GSoC 2020 final report"
date = Date(2020, 8, 31)
rss_pubdate = date # work around Franklin issue GH-797
fd_full_url = "https://cnx.gdn/blog/2020/gsoc/index.html"
tags = ["fun", "exp", "gsoc", "pkg", "pip"]
internship = "https://summerofcode.withgoogle.com/archive/2020/projects/6238594655584256"
benchmark = "/blog/2020/gsoc/article/7/#the_benchmark"
python_gsoc = "https://blogs.python-gsoc.org/en/mcsinyxs-blog"
+++

# Google Summer of Code 2020

In the summer of 2020, I worked with the contributors of `pip`,
trying to improve the networking performance of the package manager.
Admittedly, at the end of [the internship]({{internship}}) period,
[the benchmark said otherwise]({{benchmark}}); though I really hope
the clean-up and minor fixes I happened to be doing to the codebase
over the summer, in addition to the implementation of parallel
utils and lazy wheel, might actually help the project.

Personally, I learned a lot: not just about Python packaging and
networking stuff, but also on how to work with others.  I am really
grateful to {{github pradyunsg}} (my mentor), {{github chrahunt}},
{{github uranusjr}}, {{github pfmoore}}, {{github brainwane}},
{{github sbidoul}}, {{github xavfernandez}}, {{github webknjaz}},
{{github jaraco}}, {{github deveshks}}, {{github gutsytechster}},
{{github dholth}}, {{github dstufft}}, {{github cosmicexplorer}}
and {{github ofek}}.  While this feels like a long shout-out list,
it really isn't.  These people are the maintainers, the contributors of `pip`
and/or other Python packaging projects, and more importantly, they have been
more than helpful, encouraging and patient to me throughout my every activities,
showing me the way when I was lost, fixing me when I was wrong, putting up with
my carelessness and showing me support across different social media.

To best serve the community, below I have tried my best to document
what I have done, how I've done it and why I've done it for over
the last three months.  At the time of writing, some work is still in progress,
so these also serve as a reference point for myself and others to reason
about decisions in relevant topics.

\toc

## The Main Story

The storyline can be divided into the following four main acts.

### Act One: Parallelization Utilities

In this first act, I ensured the portibility of parallelization
measures for later use in the final act.  Multithreading and multiprocessing
`map` were properly fellback on platforms without full support.

* {{pip 8320}}: Add utilities for parallelization (close {{pip 8169}})
* {{pip 8538}}: Make `utils.parallel` tests tear down properly
* {{pip 8504}}: Parallelize `pip list --outdated` and `--uptodate`
  (using {{pip 8320}})

### Act Two: Lazy Wheels

As proposed by {{github cosmicexplorer}} in {{pip 7819}}, it is possible to only
download a portion of a wheel to obtain metadata during dependency resolution.
Not only that this would reduce the total amount of data to be transmitted over
the network in case the resolver needs to perform heavy backtracking, but also
it would create a synchronization point at the end of the resolution progress
where parallel downloading can be applied to the needed wheels (some wheels
solely serve their metadata during dependency backtracking and are not needed
by the users).

* {{pip 8467}}: Add utitlity to lazily acquire wheel metadata over HTTP
* {{pip 8584}}: Revise lazy wheel and its tests
* {{pip 8681}}: Make range requests closer to chunk size (help {{pip 8670}})
* {{pip 8716}} and {{pip 8730}}: Disable caching for range requests

### Act Three: Late Downloading

During this act, the main works were refactoring to integrate the *lazy wheel*
into `pip`'s codebase and clean up the way for download parallelization.

* {{pip 8411}}: Refactor `operations.prepare.prepare_linked_requirement`
* {{pip 8629}}: Abstract away `AbstractDistribution`
  in higher-level resolver code
* {{pip 8442}}, {{pip 8532}} and {{pip 8588}} (later reworked by
  {{github chrahunt}} in {{pip 8685}}): Use lazy wheel to obtain
  dependency information for the new resolver
* {{pip 8743}}: Test hash checking for `fast-deps`
* {{pip 8804}}: Check download directory before making range requests

### Act Four: Batch Downloading in Parallel

The final act is mostly about the UI of the parallel download.
My work involved around how the progress should be displayed
and how other relevant information should be reported to the users.

* {{pip 8710}}: Revise method fetching metadata using lazy wheels
* {{pip 8722}}: Dedent late download logs (fix {{pip 8721}})
* {{pip 8737}}: Add a hook for batch downloading
* {{pip 8771}}: Parallelize wheel download

The Side Quests
---------------

In order to keep the wheel turning (no pun intended) and avoid wasting time
waiting for the pull requests above to be reviewed, I decided to create
even more PRs (as I am typing this, many of the patches listed below
are nowhere near being merged).

* {{pip 7878}}: Fail early when install path is not writable
* {{pip 7928}}: Fix rst syntax in Getting Started guide
* {{pip 7988}}: Fix tabulate col size in case of empty cell
* {{pip 8137}}: Add subcommand alias mechanism
* {{pip 8143}}: Make mypy happy with beta release automation
* {{pip 8248}}: Fix typo and simplify ireq call
* {{pip 8332}}: Add license requirement to `_vendor/README.rst`
* {{pip 8423}}: Nitpick logging calls
* {{pip 8435}}: Use str.format style in logging calls
* {{pip 8456}}: Lint `src/pip/_vendor/README.rst`
* {{pip 8568}}: Declare constants in configuration.py as such
* {{pip 8571}}: Clean up `Configuration.unset_value` and nit `__init__`
* {{pip 8578}}: Allow verbose/quiet level to be specified
  via config files and environment variables
* {{pip 8599}}: Replace tabs by spaces for consistency
* {{pip 8614}}: Use `monkeypatch.setenv` to mock environment variables
* {{pip 8674}}: Fix `tests/functional/test_install_check.py`,
  when run with new resolver
* {{pip 8692}}: Make assertion failure give better message
* {{pip 8709}}: List downloaded distributions before exiting (fix {{pip 8696}})
* {{pip 8759}}: Allow py2 deprecation warning from setuptools
* {{pip 8766}}: Use the new resolver for test requirements
* {{pip 8790}}: Mark tests using remote svn and hg as xfail
* {{pip 8795}}: Reformat a few spots in user guide

## The Plot Summary

Every Monday throughout the Summer of Code, I summarized what I had done
in the week before in the form of either a short blog or an (even shorter)
check-in.  These write-ups often contain handfuls of popular culture references
and was originally hosted on [Python GSoC]({{python_gsoc}}).

* {{abslink blog/2020/gsoc/checkin/1}}
* {{abslink blog/2020/gsoc/article/1}}
* {{abslink blog/2020/gsoc/checkin/2}}
* {{abslink blog/2020/gsoc/article/2}}
* {{abslink blog/2020/gsoc/checkin/3}}
* {{abslink blog/2020/gsoc/article/3}}
* {{abslink blog/2020/gsoc/checkin/4}}
* {{abslink blog/2020/gsoc/article/4}}
* {{abslink blog/2020/gsoc/checkin/5}}
* {{abslink blog/2020/gsoc/article/5}}
* {{abslink blog/2020/gsoc/checkin/6}}
* {{abslink blog/2020/gsoc/article/6}}
* {{abslink blog/2020/gsoc/checkin/7}}
* {{abslink blog/2020/gsoc/article/7}}