As part of a school project on OS-level virtualization, I am currently studying Linux kernel mechanisms that are responsible for isolation in containers. I found this very interesting talk about what containers are made from, by Jérôme Petazzoni from Docker. It ends with a quick demonstration which consists of creating a basic Linux container using namespaces. I found it very informative, but I thought many parts were explained too quickly (or not explained, although it is perfectly normal for a talk demo) or were not working on my machines. That is why I decided to redo that demonstration with more explanation about each step, as well as references to related documentation.
Required setup
As in the original demo, I will take advantage of the Btrfs filesystem I use on my laptop. Although a Btrfs subvolume is very useful to create containers as it can be mounted and snapshotted, it is not required and you can use other approaches. By the way, there was a time I wanted to write a tutorial about installing Archlinux on Btrfs to report the problems I encountered but there were already a lot of them on the web (here and here for instance). I will also use Docker to get a base image (Alpine Linux) for my container, but again you can easily do it another way. I am using kernel 4.5.4.
Getting and setting the container’s base image
First and as explained in the video, a good practice is to ensure your mountpoints are private so you will not pollute your host system when playing with the mount command to create your container (you can see I’m starting from an empty Btrfs subvolume /test):
root@hyperion:/test# ls -la
total 0
drwx------ 1 root root 32 16 mai 21:53 .
drwxr-xr-x 1 root root 130 16 mai 19:21 ..
root@hyperion:/test# mount --make-rprivate /
Then you need a GNU/Linux system image to be run by your container. You can find an image yourself (here for instance) or you can use Docker as done in the video. I will use the second option as it will show you some Docker basics.
A clean way of handling images is to unpack them in a Btrfs subvolume and then work on a snapshot of that subvolume. This way, you can take advantage of Btrfs copy-on-write to have multiple containers using the same base image and thus minimum storage. To do that, we start with:
root@hyperion:/test# mkdir -p images containers
root@hyperion:/test# btrfs subvolume create images/alpine
Create subvolume 'images/alpine'
As I said earlier, I will use the exact same Docker trick as in the video to easily get a base image for my container:
root@hyperion:/test# docker export $(docker run -d alpine true) | tar -C images/alpine -xf-
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
d0ca440e8637: Pulling fs layer
d0ca440e8637: Download complete
d0ca440e8637: Pull complete
Digest: sha256:f655166f57d91bdfc8b3bc75a20391b7516de9f48ca761249c185fcb022124d2
Status: Downloaded newer image for alpine:latest
root@hyperion:/test# ls images/alpine/
bin dev etc home lib linuxrc media mnt proc root run sbin sys tmp usr var
The first command output more lines than in the video since I did not have the Alpine image in my local library and thus the Docker daemon had to pull it from the Docker Hub.
The docker run -d alpine true part runs an Alpine image in a Docker container in detached mode (in the background). This command returns that container’s ID, which is then used as a parameter to the docker export command which then returns a tarball of the container. The rest of the line is trivial if you now how to use the tar command (or the manual).
The second command shows us that we finally obtained a base image as wanted. We can now use a snapshot as I explained earlier:
root@hyperion:/test# btrfs subvolume snapshot images/alpine containers/tutorial
Create a snapshot of 'images/alpine' in 'containers/tutorial'
root@hyperion:/test# touch containers/tutorial/TEST
root@hyperion:/test# ls images/alpine/ containers/tutorial/
containers/tutorial/:
bin dev etc home lib linuxrc media mnt proc root run sbin sys TEST tmp usr var
images/alpine/:
bin dev etc home lib linuxrc media mnt proc root run sbin sys tmp usr var
We now have a base image in containers/tutorial that will serve for our tutorial container (I always lack imagination for choosing hostnames). The touch command illustrates that the copy-on-write is working correctly for the snapshot. We can then chroot into that subvolume to test what we did (apk is Alpine’s package management tool):
root@hyperion:/test# chroot containers/tutorial/ /bin/sh
/ # /bin/ls
TEST bin dev etc home lib linuxrc media mnt proc root run sbin sys tmp usr var
/ # /sbin/apk
apk-tools 2.6.5, compiled for x86_64.
usage: apk COMMAND [-h|--help] [-p|--root DIR] [-X|--repository REPO] [-q|--quiet] [-v|--verbose] [-i|--interactive] [-V|--version] [-f|--force] [-U|--update-cache] [--progress] [--progress-fd FD] [--no-progress] [--purge] [--allow-untrusted] [--wait TIME]
[--keys-dir KEYSDIR] [--repositories-file REPOFILE] [--no-network] [--no-cache] [--arch ARCH] [--print-arch] [ARGS]...
[...]
/ # exit
root@hyperion:/test#
Isolation with namespaces
It is now time to start playing with Linux namespaces to provide our container with some isolation. Linux namespaces are a kernel feature and are the building block of isolation, and therefore of containers. Basically, the idea is that every process is associated with different namespaces (there are six types of them) and can only see the resources associated with that namespace. It is a very powerful and interesting part of the kernel, and it has become a critical security feature. You need to have a basic understanding of what are Linux namespaces to be able to really understand what we will be doing from now on. This article is a good start.
We are going to use the unshare(1) command to run bash with namespaces unshared from its parent process (each namespace except User). In order to truly understand what this command does, let’s use strace(1) to trace system calls:
root@hyperion:/test# strace -e unshare,clone unshare --mount --uts --ipc --net --pid --fork bash
unshare(CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWPID|CLONE_NEWNET) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fd16d89d9d0) = 25958
root@hyperion:/test#
As you can see, the unshare(1) command triggers the unshare(2) and clone(2) system calls to create a new process (because of the --fork option, otherwise you would not have seen the second syscall). You should also notice the CLONE_NEW flags. They specify which new namespaces are to be created for a given process, and these exact same flags can be provided as parameters to the clone(2) syscall.
It doesn’t seem like much happened but it actually did:
root@hyperion:/test# hostname tutorial
root@hyperion:/test# exec bash
root@tutorial:/test#
You can see that the UTS namespace is working as expected since the hostname of your host system did not change. You can also experiment with the Mount namespace by mounting a USB stick for instance:
root@tutorial:/test# mount /dev/sdb1 /mnt/usbStick/
root@tutorial:/test# mount | grep sdb1
/dev/sdb1 on /mnt/usbStick type vfat (rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)
And if you open another terminal on your host, you won’t see that mount:
root@hyperion:~# mount | grep sdb1 -c
0
So that’s it? We have a real container? I wouldn’t be so sure about that:
root@tutorial:/test# ps
PID TTY TIME CMD
26196 pts/2 00:00:00 sudo
26197 pts/2 00:00:00 bash
26200 pts/2 00:00:00 unshare
26202 pts/2 00:00:00 bash
26294 pts/2 00:00:00 ps
It looks like I am still seeing things I should not see if the PID namespace was doing its job. Actually, that is the source of many security flaws related to Linux containers: the proc filesystem (usually mounted on /proc) is not namespaced. As a consequence, processes inside your “container” still have a vision of the whole host system. Is the PID namespace inactive though? In fact, no:
root@tutorial:/test# pidof unshare
26200
root@tutorial:/test# kill 26200
bash: kill: (26200) - Aucun processus de ce type
Weird. But let’s mount the proc filesystem (remember we are in a separate Mount namespace):
root@tutorial:/test# mount -t proc none /proc
root@tutorial:/test# ps
PID TTY TIME CMD
1 pts/2 00:00:00 bash
12 pts/2 00:00:00 ps
Your container now has its own proc filesystem and you actually see the bash process has PID 1 as expected at the beginning. Let’s move on:
root@tutorial:/test# umount /proc
What about the root filesystem? When are we truly entering our container? For that purpose, we need to use pivot_root(8) which in turn uses the pivot_root(2) syscall. In my opinion, the best way to understand what it does is to read its man page. Let’s try:
root@tutorial:/test# cd containers/tutorial/
root@tutorial:/test/containers/tutorial# mkdir oldroot
root@tutorial:/test/containers/tutorial# pivot_root . oldroot/
pivot_root: échec du déplacement de racine de « . » vers « oldroot/ »: Périphérique ou ressource occupé
Well, it looks like someone did not follow its own advice and should have read the manual instead of thinking he was smarter than others:
The following restrictions apply to new_root and put_old: - They must be directories. - new_root and put_old must not be on the same filesystem as the current root. - put_old must be underneath new_root, that is, adding a nonzero number of /.. to the string pointed to by put_old must yield the same directory as new_root. - No other filesystem may be mounted on put_old.
The following commands do the trick:
root@tutorial:/test/containers/tutorial# cd /
root@tutorial:/# mount --bind /test/containers/tutorial/ /test/
root@tutorial:/# cd /test
root@tutorial:/test# ls
bin dev etc home lib linuxrc media mnt oldroot proc root run sbin sys TEST tmp usr var
root@tutorial:/test# pivot_root . oldroot/
root@tutorial:/test# cd
root@tutorial:~# /bin/ls -la
total 4
drwx------ 1 root root 24 May 16 20:24 .
drwx------ 1 root root 158 May 17 17:07 ..
-rw------- 1 root root 233 May 16 20:38 .ash_history
We are in! If you wonder what is the difference between chroot(1) and pivot_root(8), take a look at this.
What about processes? Our container has its own filesystem now so we won’t see any process until we mount proc:
root@tutorial:~# /bin/ps aux
PID USER TIME COMMAND
root@tutorial:~# /bin/mount -t proc none /proc
root@tutorial:~# /bin/ps aux
PID USER TIME COMMAND
1 root 0:00 bash
32 root 0:00 /bin/ps aux
Perfect. What about mounts?
root@tutorial:~# /bin/mount
[...]
tmpfs on /oldroot/tmp type tmpfs (rw,nosuid,nodev,noexec,relatime)
/dev/mapper/ssd-archlinux on /oldroot/home type btrfs (rw,relatime,ssd,space_cache,subvolid=258,subvol=/@home)
/dev/sda1 on /oldroot/boot type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)
[...]
root@tutorial:~#
I still see all my host mounts. But again, we are in a separate namespace so there is no need to hesitate any further:
root@tutorial:~# /bin/umount -a
umount: can't unmount /: Resource busy
umount: can't unmount /oldroot: Resource busy
root@tutorial:~# /bin/mount -t proc none /proc
root@tutorial:~# /bin/mount
/dev/mapper/ssd-archlinux on /oldroot type btrfs (rw,relatime,ssd,space_cache,subvolid=257,subvol=/@root)
/dev/mapper/ssd-archlinux on / type btrfs (rw,relatime,ssd,space_cache,subvolid=297,subvol=/@root/test/containers/tutorial)
none on /proc type proc (rw,relatime)
As you can see, we had to mount proc again because umount had done its job of unmounting everything. That is much better now but we still have the old root since “Resource [is] busy”. A lazy unmount will help us:
root@tutorial:~# /bin/umount /oldroot
umount: can't unmount /oldroot: Resource busy
root@tutorial:~# /bin/umount -l /oldroot
root@tutorial:~# /bin/mount
/dev/mapper/ssd-archlinux on / type btrfs (rw,relatime,ssd,space_cache,subvolid=297,subvol=/@root/test/containers/tutorial)
none on /proc type proc (rw,relatime)
And that’s it, we are done with setting up namespaces for our container.
Playing with the network
Until now, we did not play with the Network namespace.
root@tutorial:~# /sbin/ip l
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
root@tutorial:~# /bin/ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
ping: sendto: Network unreachable
We are going to play with virtual ethernet devices (veth) using the iproute2 utilities. This is a fun part, I promise.
Let’s start by creating a pair of veths. We need our container’s PID as seen from the host system, which is the PID of the unshare process we first launched to jump into new namespaces. You might find some of the following commands hard to understand if you are not familiar enough with the /sbin/ip command, so take your time and read the manual. From the host system:
root@hyperion:~# CPID=$(pidof unshare)
root@hyperion:~# ip l add name hosteth0 type veth peer name cteth0
Back into our container, we cannot see the corresponding veth because of the Network namespace:
root@tutorial:~# /sbin/ip l
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
We need to use /sbin/ip to move the veth into the adequate Network namespace used by our container:
root@hyperion:~# ip link add name hosteth0 type veth peer name cteth0
root@hyperion:~# ip link set cteth0 netns $CPID
Back into the container, we are now able to see our interface:
root@tutorial:~# /sbin/ip l show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
13: cteth0@if14: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN qlen 1000
link/ether be:c0:d3:c0:a7:cc brd ff:ff:ff:ff:ff:ff
Then we need to configure each interface. First in the container:
root@tutorial:~# /sbin/ip a add 192.168.0.2/24 dev cteth0
root@tutorial:~# /sbin/ip l set cteth0 up
Same thing on the host system:
root@hyperion:~# ip a add 192.168.0.1/24 dev hosteth0
root@hyperion:~# ip l set hosteth0 up
Here comes your reward (from the host system and then from the container):
root@hyperion:~# ping 192.168.0.2
PING 192.168.0.2 (192.168.0.2) 56(84) bytes of data.
64 bytes from 192.168.0.2: icmp_seq=1 ttl=64 time=0.086 ms
64 bytes from 192.168.0.2: icmp_seq=2 ttl=64 time=0.040 ms
64 bytes from 192.168.0.2: icmp_seq=3 ttl=64 time=0.035 ms
64 bytes from 192.168.0.2: icmp_seq=4 ttl=64 time=0.033 ms
64 bytes from 192.168.0.2: icmp_seq=5 ttl=64 time=0.058 ms
^C
--- 192.168.0.2 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 3998ms
rtt min/avg/max/mdev = 0.033/0.050/0.086/0.020 ms
root@tutorial:~# /bin/ping 192.168.0.1
PING 192.168.0.1 (192.168.0.1): 56 data bytes
64 bytes from 192.168.0.1: seq=0 ttl=64 time=0.051 ms
64 bytes from 192.168.0.1: seq=1 ttl=64 time=0.048 ms
64 bytes from 192.168.0.1: seq=2 ttl=64 time=0.057 ms
64 bytes from 192.168.0.1: seq=3 ttl=64 time=0.059 ms
^C
--- 192.168.0.1 ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 0.048/0.053/0.059 ms
If you wanted to reach the internet from you container, you could do it easily, for instance just by adding a default gateway in your container and setting up some NAT rules using iptables or nftables. But that is up to you!
Final step
The Alphine image comes with only sh and not bash. But we are still running the later.
Does that mean I lied to you? I did not. Actually, strictly speaking, although we are done building our basic handmade container, we are still in the host runtime. Do you remember that all that container creation started from an unshare <options> bash followed by an exec bash? We were still running a bash process. This is why, in order to do the required handover, we need to perform a final chroot:
root@tutorial:~# exec /usr/sbin/chroot / /bin/sh
/ #
And now, you can truly consider that you are in your container. Play with it, you deserved it.
Conclusion
We covered quite a lot of interesting things along this demonstration, but should you remember one thing, it would be that Linux namespaces are a very elegant and powerful feature. Nonetheless, we have only built a very basic container. Although namespaces are containers’ building blocks, a lot of other techniques come into play. Among them are cgroups, capabilities and MAC (using SELinux or AppArmor for instance). That is why building a secure container is a tough job. And it is also convenient to have ways of quickly customizing your containers, migrating them, freezing them, etc. It explains why even if lots of companies try to implement their own containers fitting their needs, more trusted and widespread solutions like LXC and Docker are often preferred.