Use Linux user namespaces to fix permissions in docker volumes
Posted on 2017-07-02 in Programmation
Sommaire
Not long ago, I publish an article about using Unix sockets with docker. These sockets where in docker volumes so they could be shared between various containers. The key idea was to change the UID and GID of the user that owns the socket in the container so they match those of the user that built the image. The main issue with this approach is that it requires you to build to container with the user that will run it. This makes the solution not portable.
Hopefully, the Linux kernel allows us to use an alternative to map user id inside the container to a predictable user id outside: user id namespaces. According to wikipedia: Namespaces are a feature of the Linux kernel that isolates and virtualizes system resources of a collection of processes. Examples of resources that can be virtualized include process IDs, hostnames, user IDs, network access, interprocess communication, and filesystems. Namespaces are a fundamental aspect of containers on Linux.
For instance, thanks to the PID namespace, a process run inside a container can "think" it has the PID 1 inside a container while in fact it has another one. The same is true with user namespace: a user can "think" it has the 0 uid (root) while it fact it has the 1000 user id (some standard user). This will allow us to be sure for the files in a docker volumes that:
- All files belonging to the root user in the container will belong to a user of the system that is not root in the host.
- All files belonging to other users in the container will be mapped to predictable uid (more on that latter).
Configure docker
Lets configure docker to do all that.
First we either need to start the docker daemon with the --userns-remap USER flag or make sure the configuration file of the docker daemon (/etc/docker/daemon.json) contains something like:
{ "userns-remap": "USER" }
Notes:
- In both cases, USER must be a valid user of the system (ie present in /etc/passwd).
- Don't forget to restart the daemon if you have to edit the file.
Configure the subordinate uid/gid
subuid and subgid are used to specify the user/group ids an ordinary user can use to configure id mapping in a user namespace. They are written like: username:id:count. For instance, with jenselme:100000:65536 it means that user jenselme can use 65536 user ids starting at 100000.
This will be used by docker to properly remap uid in the container to the host. For instance, with jenselme:100000:65536, a file with a uid of 33 in the container, will be a file with a uid of 100032 in the host. And you will have access to that file. Neat, isn't it?
Now that we've seen the theory, let's configure them properly. First, edit /etc/subuid and add (replace jenselme by your own user name):
jenselme:1000:1 jenselme:100000:65536
You should be able to understand the second line. The first one is there for a slightly different purpose: make sure that all files created by root belong to the user with uid 1000. That's me on my machine, you should of course use your uid (you can get it with id -u USER). Otherwise, they will belong to uid 100000.
Now, edit /etc/subgid and add (replace jenselme by your own user name):
jenselme:982:1 jenselme:100000:65536
The second line is the name in both cases. I didn't use jenselme:1000:1 but jenselme:982:1. On my machine, 982 is the gui of the docker group (you can get it with getent group docker). This means that all files created by root, will belong to me and to the docker group. This "trick" can be handy if for some reason you need to share files with the docker daemon. For instance, software like traefic may need to read/write to the docker socket. By default, for this socket we have:
[root@fastolfe ~]# ll /var/run/docker.sock srw-rw----. 1 root docker 0 Jun 11 18:18 /var/run/docker.sock
This means that if in the outside the container the uid of root and its guid are mapped to those of jenselme, traefic won't be able to communicate with the socket because of the permissions of the file. Map the gid of root in the container to the gid of docker in the host allows us to prevent that issue.
Note on security: Giving access to the docker socket is a problem from a security standpoint since it allows a container to create new containers thus giving it access to the whole host system with root permissions, eg by running docker run -it -v --privileged -v /:/host --userns=host fedora chroot /host. That is why SELinux will prevent the docker socket to be mounted in a volume by default. You should be aware of that when you do this. See this for more on that topic.
Tests
Now that we are all set, let's start the docker daemon (or restart it).
Note to SELinux users: You need to append Z (capital z) when mounting the volumes, like this: -v $(pwd)/test:/test/:Z. Otherwise, the SELinux context will not be correct and you won't be able to access the volumes from the container. See this docker tip.
The first thing you should notice is that if you had downloaded images or created containers, you will not see them with docker images or docker ps -a. That's because, when user re-mapping is enabled, all images and containers are located in a dedicated subfolder. On my machine, that is /var/lib/docker/1000.982.
Now that we know this is expected, let's try things. Run somewhere:
docker run -it -v "$(pwd)/test:/test/" nginx /bin/bash
This will open a bash prompt as root in the container. Go to the volume with cd /test and create a file: touch rootfile. If you run a ls -l inside the container, you should see something like:
root@02a5bcc1757c:/test# ls -l total 0 -rw-r--r--. 1 root root 0 Jun 11 16:25 rootfile
Let's check the uid and gid to be sure:
root@02a5bcc1757c:/test# ls -ln total 0 -rw-r--r--. 1 0 0 0 Jun 11 16:25 rootfile
So the file belongs to root and its uid is 0 as well as its gid.
Now run ls -l in the host:
▶ ls -l total 0 -rw-r--r--. 1 jenselme docker 0 Jun 11 18:25 rootfile
Let's check the uid and guid:
▶ ls -ln total 0 -rw-r--r--. 1 1000 982 0 Jun 11 18:25 rootfile
That's correct. Now let's do the same thing wit the www-data user. First, let's give some permissions on the /test folder to the www-data user. Since this is just a test, let's run chmod 777 /test. Now, switch to this user with su -s /bin/bash www-data. You should now be in the /test directory connected as www-data. Create a file with touch www-data-file. You should see something like:
www-data@02a5bcc1757c:/test$ ls -l total 0 -rw-r--r--. 1 root root 0 Jun 11 16:36 rootfile -rw-r--r--. 1 www-data www-data 0 Jun 11 16:38 www-data-file
And:
www-data@02a5bcc1757c:/test$ ls -ln total 0 -rw-r--r--. 1 0 0 0 Jun 11 16:36 rootfile -rw-r--r--. 1 33 33 0 Jun 11 16:38 www-data-file
As far as the host is concerned, we have:
▶ ls -l total 0 -rw-r--r--. 1 jenselme docker 0 Jun 11 18:36 rootfile -rw-r--r--. 1 100032 100032 0 Jun 11 18:38 www-data-file
And
▶ ls -ln total 0 -rw-r--r--. 1 1000 982 0 Jun 11 18:36 rootfile -rw-r--r--. 1 100032 100032 0 Jun 11 18:38 www-data-file
Now let's create some files from the host. For instance, let's do touch www-data-file-from-host. In the host it currently belongs to the current user. Let's see in the container:
www-data@02a5bcc1757c:/test$ ls -l total 0 -rw-r--r--. 1 root root 0 Jun 11 16:36 rootfile -rw-r--r--. 1 www-data www-data 0 Jun 11 16:38 www-data-file -rw-r--r--. 1 root nogroup 0 Jun 11 16:41 www-data-file-from-host
It belongs to root and nogroup as expected (in the host, the file belongs to jenselme:jenselme not jenselme:docker, hence the nogroup, I could run chown jenselme:docker www-data-file-from-host to fix the gid). If you check the uid and gid, you will see it is also as expected.
Now let's change the owner of www-data-file-from-host to 100032:100032 with chown 100032:100032 www-data-file-from-host (this must be run as root to prevent an Operation not permitted). I let you check the owner, uid, gid in the container. You can also check that the www-data user can write in the file with echo 'test' > www-data-file-from-host.
This looks good isn't it? I found however one dark spot in this. If you try to edit www-data-file-from-host or www-data-file in the host, it will fail with permission denied. As far as I understand the subuid and subgid, this is not normal. If someone has an explanation for this, please leave a comment. I see two workarounds for that:
The basic:
- Create a group with id 100032 (as root): groupadd -g 100032 docker-www-data
- Add yourself to this group (as root): usermod -aG docker-www-data jenselme
- Disconnect/reconnect or use the newgrp docker-www-data command to take this change into account.
- Give write permission to the group in the container: chmod g=rw www-data-file
- Write in the file.
Note: You cannot do anything about the user since you can only have one user id.
The elegant: use ACL (Access Control List). See the external links section to learn more about ACL. TL;DR, ACLs are a way to extend the standard permissions of the filesystem. With them, you can set permissions for a file or directory with very thin granularity for each users and groups of the system. To enable ACLs, run as root:
setfacl -Rdm u:USER:rwX DIR (replace USER by a username and DIR by a path to a directory or file). This will:
- -R recurse on subfolders.
- -d default to this rule. This means that the ACL will apply to all files and directories created in DIR after the setfacl was run.
- -m modify the rule to u:USER:rwX that is give to the user (u:) USER the permissions rwX. The capital X means give execution permission to all folders and to files that have the execute permissions. This prevent us to make all files executable.
- apply to DIR
setfacl -Rm u:USER:rwX DIR (replace USER by a username and DIR by a path to a directory or file). This will apply the ACL rule on the existing files in DIR.
Bonus
Create files
If you can't or don't want to create the files (eg logs) when the images is created or when you start the container and be sure the container will be able to create them without running chmod -R 777 DIR, you can use the commands below. We assume in this example that in a log folder, we will have many volumes. Each volume will contain the logs of a container.
Create all the containers with the proper volumes.
Run chmod u=rwX,g=rwX,o=rwX -R log/ so all log files can be created.
Run find log -type d > dir-to-create to save a list of the directories you will need.
Run find log -type f > files-to-create to save a list of the files you will need.
Stop the container and destroy them.
Delete the folder: rm -rf log
Create the directories:
for dir in $(cat dir-to-create); do mkdir -p "${dir}" done
Create the files:
for file in $(cat files-to-create); do touch "${file}" done
Fix the permissions (run the commands as root):
- To change the owner: find log -type f -exec chown 100032:100032 {} \;
- To give write access to the group: find log -type f -exec chmod g=rw {} \;
Unix sockets
I guess you now see how to improve the solution described in my previous article. Since we can forsee for a given uid and gid in a container to which uid and gid it will map to in the filesystem, what we need to do is:
- Change the uid and gid in the containers so they are the same. For instance usermod -u 60 uwsgi and groupmod -g 60 uwsgi. This will map to 100059 in the host.
- Create a group with gid 100059 on the host.
- Put the proper user in the newly created group.
- Give permission on the socket to the group.
This is a bit easier than before since we only have to do the operation once per machine and we can share without any issues the image of the containers since it is the kernel that will do the mapping dynamically.
Note: you can't just rely on ACLs here, since the uid and gid of all containers that will use the socket need to be the same. You can however use ACLs to give permissions to the socket to a user of the host.
External links
- Docker for your users - Introducing user namespace
- Introduction to User Namespaces in Docker Engine
- ACL: Using Access Control Lists on Linux
That's it. If you have a question or a remark, please leave a comment below.