Create your own Linux container using namespaces

As part of a school project on OS-level virtualization, I am currently studying Linux kernel mechanisms that are responsible for isolation in containers. I found this very interesting talk about what are containers made from, by Jérôme Petazzoni from Docker. It ends with a quick demonstration which consists of creating a basic Linux container using namespaces. I found it very informative, but I thought many parts were explained too quickly (or not explained, although it is perfectly normal for a talk demo) or were not working on my machines. That is why I decided to redo that demonstration with more explanation about each step, as well as references to related documentation.

Required setup

As in the original demo, I will take advantage of the Btrfs filesystem I use on my laptop. Although a Btrfs subvolume is very useful to create containers as it can be mounted and snapshotted, it is not required and you can use other approaches. By the way, there was a time I wanted to write a tutorial about installing Archlinux on Btrfs to report the problems I encountered but there were already a lot of them on the web (here and here for instance).
I will also use Docker to get a base image (Alpine Linux) for my container, but again you can easily do it another way.
I am using kernel 4.5.4.

Getting and setting the container’s base image

First and as explained in the video, a good practice is to ensure your mountpoints are private so you will not pollute your host system when playing with the mount command to create your container (you can see I’m starting from an empty Btrfs subvolume /test):

Then you need a GNU/Linux system image to be run by your container. You can find an image yourself (here for instance) or you can use Docker as done in the video. I will use the second option as it will show you some Docker basics.
A clean way of handling images is to unpack them in a Btrfs subvolume and then work on a snapshot of that subvolume. This way, you can take advantage of Btrfs copy-on-write to have multiple containers using the same base image and thus mininum storage. To do that, we start with:

As I said earlier, I will use the exact same Docker trick as in the video to easily get a base image for my container:

The first command output more lines than in the video since I did not have the Alpine image in my local library and thus the Docker daemon had to pull it from the Docker Hub.
The docker run -d alpine true part runs an Alpine image in a Docker container in detached mode (in the background). This command returns that container’s ID, which is then used as a parameter to the docker export command which then returns a tarball of the container. The rest of the line is trivial is you now how to use the tar command (or the manual).
The second command shows us that we finally obtained a base image as wanted.
We can now use a snapshot as I explained earlier:

We now have a base image in containers/tutorial that will serve for our tutorial container (I always lack imagination for choosing hostnames). The touch command illustrates that the copy-on-write is working correctly for the snapshot.
We can then chroot into that subvolume to test what we did (apk is Alpine’s package management tool):

Isolation with namespaces

It is now time to start playing with Linux namespaces to provide our container with isolation. Linux namespaces are a kernel feature and are the building block of isolation, and therefore of containers. Basically, the idea is that every process is associated with different namespaces (there are six types of them) and can only see the resources associated with that namespace. It is a very powerful and interesting part of the kernel, and it has become a critical security feature. You need to have a basic understanding of what are Linux namespaces to be able to really understand what we will be doing from now on. This article is a good start.

We are going to use the unshare(1) command to run bash with namespaces unshared from its parent process (each namespace except User).
In order to truly understand what this command does, let’s use strace(1) to trace syscalls:

As you can see, the unshare(1) command triggers the unshare(2) and clone(2) syscalls to create a new process (because of the –fork option, otherwise you would not have seen the second syscall). You should also notice the CLONE_NEW flags. They specify which new namespaces are to be created for a given process, and these exact same flags can be provided as parameters to the clone(2) syscall.

It doesn’t seem like much happened but it actually did:

You can see that the UTS namespace is working as expected since the hostname of your host system did not change. You can also experiment with the Mount namespace by mounting a USB stick for instance:

And if you open another terminal on your host, you won’t see that mount:

So that’s it? We have a real container? I wouldn’t be so sure about that:

It looks like I am still seeing things I should not see if the PID namespace was doing its job.
Actually, that is the source of many security flaws related to Linux containers: the /proc filesystem is not namespaced. As a consequence, processes inside your “container” still have a vision of the whole host system. Is the PID namespace inactive though? In fact, no:

Weird. But let’s mount the /proc filesystem (remember we are in a separate Mount namespace):

Your container now has its own procfs and you actually see the bash process has PID 1 as expected at the beginning.
Now let’s move on:

What about the root filesystem? When are we truly entering our container? For that purpose, we need to use pivot_root(8) which in turn uses the pivot_root(2) syscall. In my opinion, the best way to understand what it does is to read its man page. Let’s try:

Well, it looks like someone did not follow its own advice and should have read the manual instead of thinking he was smarter than the others.

The following commands do the trick:

We are in! If you wonder what is the difference between chroot(1) and pivot_root(8), take a look at this.
What about processes? Our container has its own filesystem now so we won’t see any process until we mount /proc:

Perfect. What about mounts?

I still see all my host mounts. But again, we are in a separate namespace so there is no need to hesitate any further:

As you can see, we had to mount /proc again because umount had done its job of unmounting everything. That is much better now but we still have the old root since Resource [is] busy. A lazy unmount will help us:

And that’s it, we are done with setting up namespaces for our container.

Playing with the network

Until now, we did not play with the Network namespace.

We are going to play with virtual ethernet devices (veth) using the iproute2 utilities. This is a fun part, I promise.
Let’s start by creating a pair of veths. We need our “container’s PID as seen from the host system”, which is the PID of the unshare process we first launched to jump into new namespaces.
You might find some of the following commands hard to understand if you are not familiar enough with the /sbin/ip command, so take your time and read the manual. From the host system:

Back into our container, we cannot see the corresponding veth because of the Network namespace:

We need to use /sbin/ip to move the veth into the adequate Network namespace used by our container:

Back into the container, we are now able to see our interface:

Then we need to configure each interface. First in the container:

Same thing on the host system:

Here comes your reward (from the host system and then from the container):

If you wanted to reach the internet from you container, you could do it easily, for instance just by adding a default gateway in your container and setting up some NAT rules using iptables/nftables. But that is up to you!

Final step

The Alphine image comes with only sh and not bash. But we are still running the later.
Does that mean I lied to you? I did not. Actually, strictly speaking, although we are done building our basic handmade container, we are still in the host runtime. Do you remember that all that container creation started from a unshare <options> bash followed by a exec bash? We were still running a bash process. This is why, in order to do the required hand-off, we need to do a final chroot:

And NOW, you can truly consider that you are in your container. Play with it, you deserved it.

Conclusion

We covered quite a lot of interesting things along this demonstration, but should you remember one thing, it would be that Linux namespaces are a very elegant and powerful feature. Nonetheless, we only built a very basic container. Although namespaces are containers’ building blocks, a lot of other techniques come into play. Among them are cgroups, capabilities and MAC (using SELinux or AppArmor for instance). That is why building a secure container is a tough job. And it is also convenient to have ways of quickly customizing your containers, migrating them, freezing them, etc. It explains why even if lots of companies try to implement their own containers fitting their needs, more trusted and widespread solutions like LXC and Docker are often preferred.

Leave a Reply

Your email address will not be published. Required fields are marked *