Use “pip install” safely
This is part of a series of posts I’m doing as a sort of Python/Django Advent calendar, offering a small tip or piece of information each day from the first Sunday of Advent through Christmas Eve. See the first post for an introduction.
Managing dependencies should be boring
Last year I wrote a post about “boring” dependency management in Python, where I advocated a setup based entirely around standard Python packaging tools, in that case the pip
package installer and the venv
module for creating virtual environments (the only non-standard thing I recommended there was pip-tools, and specifically its pip-compile
script, and the only reason I’m willing to recommend it at all is that you can replace it by writing your own script that strings together the equivalent pip
commands; it’s just a convenience to save you having to do that yourself).
The end result of that process, if you follow it, will be a set of “requirements files” which can be given to pip
to install from, and which will contain your entire dependency tree, fully resolved including all transitive dependencies-of-dependencies-of-dependencies (etc.), pinned to exact versions, and listing the expected hashes of the packages.
The last step, of course, is to install those packages, which I recommend you do with pip
. But I want to build on that a little bit, and I may go back and update part of the original post. Originally I did explain why to use python -m pip
instead of pip
, which is something I’d seen and learned to do a while back, but never had a great one-page explainer for until Brett Cannon wrote one.
But you can and probably should go a little further than this, by passing a few extra flags to pip
. Specifically, in the original post I gave the example of running
python -m pip install -r requirements/app.txt
But really it ought to be:
python -m pip install \
--require-hashes \
--no-deps \
--only-binary :all: \
-r requirements/app.txt
Let’s unpack that:
--require-hashes
should be redundant, since the process I recommended ensures your requirements file has all the expected hashes of the packages to install, and as soon aspip
sees even a single hash in the requirements file it will enforce that all the packages have hashes specified. But what if you accidentally forget a step somewhere, or otherwise accidentally commit a hash-less requirements file? That’s where--require-hashes
protects you: it tellspip
that you intend this requirements file to have hashes for all packages, and to fail ifpip
doesn’t find them.--no-deps
tellspip
not to try to resolve and find and install dependencies for the packages you’ve specified. It shouldn’t need to, since your requirements file should contain the full dependency tree already resolved and pinned to exact versions, so allpip
should need to do is fetch and install that set of packages. And though this does require you to stick to the process and ensure you provide a fully-resolved dependency set in your requirements file, it also shuts down one way you might encounter surprises during installs (as can sometimes happen, and not only due to malice — if newer versions of some transitive dependencies have been published, for example, andpip
tries to do a fresh resolution, it might see those newer versions, prefer them, and then fail because they weren’t the exact pinned-and-hashed versions given in the requirements file). It’s also a little bit faster since dependency resolution is an expensive step.--only-binary :all:
is the only one you might need to skip, but I recommend trying to use it until you know you can’t. This tellspip
to only use “binary” (.whl
) packages, and not to ever consider installing a “source” (.tar.gz
) package. This is a bit confusing since many “binary” packages are pure Python and contain only Python source code, but the “binary” format can also contain pre-compiled extension modules written in C (or other languages). The big advantage here is reproducibility and minimizing attack surface: a “source” package may need to compile extension modules, and may execute asetup.py
script containing arbitrary Python code during the installation or build, but a “binary” package’s installation always consists only of unpacking it and then moving the files inside it to their destinations. Since this can’t run any code during the installation, and always puts the same set of files in the same locations, it massively cuts down on potential worries about a compromised package doing bad things during the install, and gets you as close as you can get, with the standard Python packaging toolchain, to exactly-reproducible builds (due to platform-specific variants of “binary” packages it may still not be exactly the same set of bytes on every machine every time). Unfortunately, though, not all packages provide the “binary”.whl
format, so you might not be able to use this flag — if you depend on any packages that only provide “source” (.tar.gz
) format, you can change this to--prefer-binary
to use.whl
for as many packages as possible.
Brett Cannon, who wrote the above-linked explanation of why to use python -m pip
, also has published a GitHub Action which automatically runs pip
with these flags, to ensure you’re using it as securely as possible when running your CI.