Earlier this week, I was writing a Dockerfile to download and run a binary when I noticed the image size was way more than what I would expect. I’m using ubuntu:21.10 as the base image, which is about 70MB. The binary I’m running is about 80MB. Other packages I’m installing would add 10-ish MB. But the image size is 267MB?
Clearly I’m doing something wrong. But what? My Dockerfile is fairly simple and, what I considered, idiomatic:
FROM ubuntu:21.10 AS downloader
# Install wget, gnupg; download a zip archive; verify checksum; unzip the binary
FROM ubuntu:21.10
LABEL ...
COPY --from=downloader /bin/<binary> /bin/<binary>
RUN apt-get update && apt-get install -y openssl dumb-init iproute2 ca-certificates \
&& rm -rf /var/lib/apt/lists/* \
&& chmod +x /bin/<binary>
&& mkdir -p <couple of empty directories> \
&& ...
...
I checked the history of the image to see the size of individual layers. The problem became very apparent…
$ podman history vamc19/nomad:latest
ID CREATED CREATED BY SIZE COMMENT
...
<missing> 36 minutes ago /bin/sh -c apt-get update && apt-get insta... 94.4 MB
374515aec770 36 minutes ago /bin/sh -c # (nop) COPY file:6dbfa42743cc65... 87.7 MB
22cd380ad224 36 minutes ago /bin/sh -c # (nop) LABEL maintainer="Vamsi"... 0 B FROM docker.io/library/ubuntu:21.10
...
The layer created by COPY
is 87.7MB, which is exactly the size of the extracted binary. So, that is normal. Why is the
layer created by RUN
94.4MB? What am I doing in it? I’m creating a couple of empty directories, running chmod
on the
binary and installing 4 packages. Apt says the packages would only consume ~6MB of additional disk space and it is very
unlikely that these packages would do anything crazy post install. So, is chmod
creating a problem?
To quickly test this, I removed chmod
from RUN
and rebuilt the image. And bingo - the image size is down to 174MB.
And the RUN
layer’s size is down to 6.7MB. So, OverlayFS is copying the binary into RUN
layer even though chmod
is
only updating the metadata of the file…?
My understanding of CoW filesystems is very superficial - unless I write to a file, the filesystem would never copy the file to upper layer. And since chmod is not writing to the binary (did file’s hash change?), it should not be copied, correct? Obviously not. Honestly, I never thought about it. I looked up OverlayFS’ documentation.
When a file in the lower filesystem is accessed in a way the requires write-access, such as opening for write access, changing some metadata etc., the file is first copied from the lower filesystem to the upper filesystem (copy_up).
Well, I have been doing it wrong all these years. I’ve written a lot of Dockerfiles with shell scripts in COPY
and
chmod
in RUN
. Maybe I never realized this because these files are usually very small to make a noticeable difference
in the image size.
So what’s the solution? In my case, since I’m using Podman (which uses Buildah), I can use --chmod
arg with COPY
to
copy a file and set proper permissions in the same layer. If you are using Docker, it is available in
BuildKit.
Note that any metadata update will lead to the same result - not just chmod
. Both Docker and Podman already support
--chown
for both COPY
and ADD
. Maybe this should be added to the Dockerfile Best Practices page.
PS: If you are wondering why a metadata update would make OverlayFS duplicate the entire file, it is for security reasons. You can enable “metadata only copy up” feature which will only copy the metadata instead of the whole file.
Do not use metacopy=on with untrusted upper/lower directories. Otherwise it is possible that an attacker can create a handcrafted file with appropriate REDIRECT and METACOPY xattrs, and gain access to file on lower pointed by REDIRECT. This should not be possible on local system as setting “trusted.” xattrs will require CAP_SYS_ADMIN. But it should be possible for untrusted layers like from a pen drive.
Update: This post was discussed on Hacker News and Lobsters