AlexKozhevnikov Dec 7 2021 at 11:33

Enhancing security of containers in Linux

12 min

1.7K

Huawei corporate blogConfiguring Linux*Open source*Cloud services*Kubernetes*

1. Introduction. What do we call a container?

In any history textbooks, the modern time has already been called the time of the next change of the industrial structure or the fourth industrial revolution (Industry 4.0). The main role, in this case, is given to information and IT systems. In an attempt to reduce the cost of IT infrastructure, unify and accelerate the process of developing IT solutions, humanity first invented "clouds" in order to replace traditional data centers, and then containers to replace virtual machines.

Now the words "container", "containerized application", "Docker", "Kubernetes", etc., probably no longer surprise anyone, but it would be useful to briefly remind what is a container, and how it differs from a virtual machine (Fig. 1).

Fig.1. Virtualization and containerization.

As this illustration suggests, a container can be simply considered as a process (process tree) executed on a physical computer with a certain operating system using a special shell (container runtime). What is the essence of the containerization then?

If you consider the case of virtual machines, their isolation from each other, as well as their access to hardware resources are provided by special hardware means (virtualization extensions), and their support at the hypervisor level. The possibilities of impact from a virtual machine on the hypervisor and physical hardware are reduced to a very limited set of interfaces, which, again, within the framework provided by the design, can only affect the virtual machine itself. In other words, all of the above serves to strictly isolate virtual machines and everything that may happen inside of them from each other and from the system on which they operate.

The main and practically the only significant outcome of containerization is also to achieve the maximum possible isolation of processes from each other and eliminate any possible negative impact on the operating system in which they are executed. Sometimes the terms "sandbox" and "jail" are also used to describe this kind of isolation.

Clearly, containers appear more vulnerable from a security point of view. What are the advantages of containerization over virtualization? In fact, there are quite a lot of them:

the possibility of more flexible use of available resources (no need to backup them as in the case of virtual machines);
the ability to save resources (no need to spend them on many copies of the OS for each virtual machine);
no delays at startup (just start of the process is almost instantaneous compared to the time needed to load the virtual machine);
the interaction between processes, even if isolated, is much easier to implement when needed than between virtual machines. That is how, by the way, came the concept of microservices, which has recently become very popular.

All of the above led to the very rapid development of container technologies, despite the recurring problems with the security of already deployed container cloud systems, their hacks, and data leaks. Accordingly, the work on strengthening container security is also continuing. This is what will be discussed further in this article.

2. How dangerous containers are

As the attentive reader could have guessed, if a container is just a process in the operating system environment, it is not so easy to ensure its high-quality isolation. There is another unpleasant feature that complicates life for the "guarding of our prison", because the presence of any privileges is required very often for the normal operation or running of applications, then the usual practice is to run container processes from the root superuser. And that's one of the greatest dangers.

What else can be dangerous going on inside and around the container? Here is the list (obviously an incomplete one) of the main risks and threats (Fig. 2), which in the literature are also called attack vectors:

Software vulnerabilities (1st place) – programs are written by people who always make mistakes. Sometimes this allows performing malicious actions: access code, data, etc. Errors are eventually detected, fixed, and new versions of programs appear, but are not always replaced in time in container images. The only more or less effective method of fighting at present time is constant scanning of container images. Vulnerability scanners are almost always a commercial product and one of the main sources of revenue for companies working in the field of computer security. Not only software vulnerabilities inside the container can lead to unfortunate results, sometimes just application software, not related to the containers, but working side by side on the same physical machine, is enough for hacking;
Configuration errors (2nd place) – configuration parameters that actually specify how the programs in the container work can easily lead to very undesirable holes in the container isolation, if they are configured incorrectly;
Image hacking/compromise – if an attacker at any stage of deployment managed to somehow change/compromise the image of the container, then he is guaranteed to get everything he could want. To protect against such hacks, a lot of different methods using cryptography is used:
- Protected image stores;
- Cryptographic signatures of images;
- Various checksums, etc.
Disclosure of confidential data – very often as a result of negligence, the confidential data necessary in the container for the normal operation of programs (logins, passwords, etc.) are sewn directly into the code of the container image. Then it goes into a shared repository with this data, and what should be secret is no longer that. Of course, there are correct ways to transfer this data after the start of the container, but the story is repeating over and over and most sadly, there are still no ways to fight this, except several checks on organizational level;
Network Protocol Vulnerabilities – Network traffic is still the most easily accessible place if you want to break into something or get into somewhere. The only protection is encryption of everything and everything, the use of Transport Layer Security (TLS), and draconian traffic filtering measures. However, we are going down this road again and again, and there are still no effective ways to deal with human negligence;
Jailbreak or root escape - a situation when, as a result of an unsuccessful combination of stars and all the above factors, hostile code, which was inside the container and executed with root privileges (which is quite often the case), acquired access with the same privileges to the resources of the operating system and the physical machine on which the container is running. The most dangerous of all possible scenarios, because the entire server and everything that is on it at the moment, or even the entire cloud, comes to the full disposal of the attacker. The consequences can be very sad and unpredictable, and the losses can be multibillion-dollar.

3. How to isolate containers in Linux. Standard tools

The dangers we've considered, the fears we've overtaken... Now let's talk about the means and methods of protection. In fact, there are five standard tools for isolating processes and, accordingly, containers in Linux OS: DAC (discretionary access control), chroot / pivot_root, namespaces, cgroups and capabilities. This is the first echelon of defense. But before talking about them, it is not superfluous to remind about how the OS kernel and processes in the user space are organized and how they interact.

Any actions required by the user process and related to accessing any resources in Linux OS (well, or in any of the Unix-like OS) are performed by the kernel and carried out by system calls (syscalls) from the process to the kernel. The number of such syscalls has grown more and more in the process of Linux development, and by now this number has become quite huge (as it has already exceeded three hundred). Usually, all syscalls are wrapped in some functions of a standard library, which is present in this OS distribution, so they are simply invisible for typical application programming.

Another part of the Linux OS philosophy (or, again, the whole Unix OS family) is the paradigm that any resource or object in the system is a file. And then access control system comes with the files and is connected to them - DAC (discretionary access control), i.e. some mandatory attributes are assigned to any file (the famous line like rwxr-xr-x) that define the read, write, and execute permissions for the file's owner, the group it is a member of, and all other users. This system of access control has always been and remains the main component of security and therefore isolation in Linux, but as it is easy to guess, the approach of allocating only three control groups and three actions is not always sufficient, especially in terms of containerization.

And as always, there are more complicated and confusing additional attributes – setuid and setgid, which affect how and from which user the process starts, when executing a given file. The creators of this scenario had good intentions - to make life easier for ordinary users, allowing them to extend their abilities without abusing root privileges, but as always it cuts both ways and in case of inept use the result of such manipulation can be just the proverbial and terrible root escape.

The second cornerstone of isolation is the use of chroot / pivot_root functions (commands), which change the highest visible level of the file system hierarchy (root) for the current process and all its children. In this case anything that is higher in the hierarchy than the currently set root becomes invisible and accordingly inaccessible to this process. The difference between these two functions is that chroot changes the root directory and pivot_root changes the root mount point of the file system without changing the current directory. It's safer, because after chroot the old filesystem root can be remounted and process could still have access to it, but requires the "cd/" command to be executed immediately after the pivot_root call.

Here we move smoothly to the third pillar of process isolation in Linux OS – the namespaces paradigm. The point here is that, by placing the process in a sort of separate closed "room" (namespace) for some resource, we limit what is visible and available for this process to the extents of this "room". Just mentioned pivot_root does its work to change the root mount point in the current mount namespace of the process.

Only a few key resources currently have their own namespaces. Historically, file system mount points were the first resource to receive their "rooms". The list of supported resources now includes:

UTS (Unix Timesharing System) – in particular here are host and domain names;
Process IDs;
File system mount points;
Network interfaces;
User and group identifiers;
Interprocess Exchange (IPC) objects;
Control groups (cgroups).

And this list will undoubtedly expand in the future. For example, today the Linux community has been actively discussing the introduction of new namespaces for LSM (Linux security modules), and this will be discussed a bit further.

All of the above mentioned so far is related to the file system and virtual-logical objects like identifiers, names, flags, IPC queues, etc., but did not affect physical resources - CPU usage, memory usage, etc. Access to these physical resources is controlled by control groups or cgroups. This is the fourth key component of access control systems in Linux OS.

The use of restrictions on access to physical resources allows avoiding situations with the deliberate malicious seizure of all physical resources of the server by some process for example by endless cloning itself (the notorious fork bomb). In systems with containerization this aspect is extremely important because no one knows a priori what is in the container, and what it can start doing.

The last, fifth column of ensuring security and process isolation in Linux is represented by capabilities, i.e. features or privileges that can be assigned to a particular process. Initially, the idea was that since even basic operations in the system often required privileges that were originally granted only to root superusers, and practically no useful actions in the system were possible for the average user, then in order to allow as many operations as possible without root privileges, these root privileges have been broken down into many small parts, or capabilities.

The list is now very long, and it is also constantly changing and expanding. All the time there is a need for new and new ones... But this mechanism nevertheless performs its task of getting rid of root-dependence more and more qualitatively. Whether it improves safety or, in the end given the current confusion with the design, leads to the opposite result – this is a provocative and philosophical question, and even the very statement of the question in this way will be considered purely the personal opinion of the author.

4. Additional tools. LSM and overlayfs

Everything that was discussed above refers to DAC (discretionary access control). However, as practice shows, selective control is often not enough to fully protect against all possible threats. This led to the introduction of a mechanism of additional security modules LSM (Linux security modules) in the Linux kernel. This mechanism, although optional, being enabled implements another principle of access control – MAC (Mandatory Access Control). The difference in approaches is very simple, but very significant: DAC – everything that is not prohibited is allowed; MAC – everything that is not allowed is forbidden.

There is already a considerable number of LSM modules, they are constantly evolving and new ones are appearing quite often, so we will consider only four of them for our discussion, the most important from the point of view of working with containers. These are: SELinux, AppArmor, Integrity/IMA (Integrity Measurement Architecture) and SecComp. All of them have a common approach to the implementation of their functions (Fig.3.). There are objects (resources), accesses to them are carried out in accordance with a certain set of customizable rules. Access is granted only if it is explicitly permitted by these rules. By default, everything is disabled.

SELinux controls access to files based on labels in extended attributes that are assigned to each file in the system. AppArmor uses the absolute path to the file instead of labels to perform the same control. The task of the Integrity module is slightly different, this module controls the integrity of the system code by calculating and comparing the checksums of files with the reference values, also written in the extended attributes in encrypted form. If there is a mismatch, access to the file is blocked.

Fig.3. General principle of LSM operation.

The SecComp module has a completely different application. It filters system calls (syscalls) based on a set of rules and blocks processes that attempt to violate those rules.

Another highly recommended container security measure is a read-only mounted file system. This avoids a lot of trouble if something gets out of hand. However, in practice some of applications need not only to read, but also to write, for example, to output information into logs for subsequent analysis. And there is an option that allows to provide the ability to write logs and other things without touching the main file system – the use of two-level file systems with overlays like overlayfs, fuse-overlayfs, etc.

The general principle of building such file systems (Fig.4.) is that there are two levels, two file systems – the upper one and the lower one. The file is visible to the user, from the top level if it exists at the top level, or both levels, and from the bottom level if it exists only at the bottom level, which is unaltered and read-only. The top level is open on writing, but all changes that you make remain within the top level and are not propagated further. At the same time, in fact, newly written files can be stored only in RAM and not reach disk memory at all. There can be several lower levels, in any case the file will be visible from the highest of them.

Fig.4. Principle of overlayfs (from docker documentation).

5. Unresolved issues

All of the above has been working for quite some time now, but nevertheless has a lot of unsolved, or half-solved problems, the main of which of course is still the need for root-privileges for the operation of containers. Declared as "rootless" (non-privileged) containers are just a half-hearted solution, because root privileges are still present in one or another form. The work on truly getting rid of root, as well as strengthening the security of container systems, requires solving some number of small and rather large problems, of which we can distinguish the following (list is incomplete):

virtual network connections require root privileges to open a connection; the workarounds used such as serial port emulation (slirp4netns) are unsatisfactory slow;
LSM modules (Integrity, AppArmor) do not support separate security policies individually for dedicated processes or containers (there is a need to implement namespaces). In the current implementation any security policies are common to all processes within OS;
functionality of checkpoints/system recovery points in the absence of root-privileges is not fully implemented yet. Although the kernel already has the necessary kernel capability (CAP_CHECKPOINT_RESTORE), the necessary mechanisms for supporting runtime in the containers are not yet available.

6. "Fat partisans in the forests." Kind of a conclusion

The current true state of affairs, in the words of one of the gurus of container security Dan Walsh, can be described as "Docker is about running random crap from the internet as root on your host". This is, of course, a somewhat exaggerated, but there is quite a lot of truth in it. The problems listed above unfortunately are not yet met by the Linux community.

For example, proposals for the solution to support separate security policies for LSM modules have been ready for quite some time and are waiting for approval for inclusion in Linux upstream, but at the same time, things are moving very slowly, much slower than we would like. This situation for sure is rising certain concerns.

At the same time, there is a clear need to strengthen the security of containerization. So far, this vacuum has been filled by various commercial software designed to alleviate the problem, but without providing a definitive solution. Some goodwill is required from the community to finally solve this issue, and so with the hope that this will happen soon let me finish the article ...

7. Sources and list of references

1. Linux kernel and user documentation, URL:https://www.kernel.org/doc/

2. Liz Rice, "Container security", O'Reilly Media, 2020

3. SELinux project documentation, URL:http://www.selinuxproject.org

4. AppArmor project documentation, URL:https://apparmor.net

5. Integrity/IMA project documentation, URL:http://linux-ima.sourceforge.net

6. Docker project documentation, URL:https://www.docker.com

7. Wayne Jansen, Timothy Grance, "NIST Special Publication 800-144. Guidelines on Security and Privacy in Public Cloud Computing", NIST, 2011.

8. Ramaswamy Chandramouli, "Security Assurance Requirements for Linux Application Container Deployments", NISTIR 8176, NIST, 2017.

9. Murugiah Souppaya, John Morello, Karen Scarfone, "NIST Special Publication 800-190. Application Container Security Guide.", NIST, 2017.

Tags:

Hubs: