Cgroups and Namespaces - RH442
Cgroups and Namespaces - RH442
Cgroups and Namespaces - RH442
So basically you use cgroups to control how much of a given key resource (CPU, memory, network, and
disk I/O) can be accessed or used by a process or set of processes.
Cgroups are a key component of containers because there are often multiple processes running in a
container that you need to control together.
Containers make it possible to quickly deploy and run each piece of software in its own segregated
environment, without the need to build individual virtual machines (VMs).
The key feature of namespaces is that they isolate processes from each other.
On a server where you are running many different services, isolating each service and its associated
processes from other services means that there is a smaller blast radius for changes, as well as a smaller
footprint for security-related concerns. Mostly though,
isolating services meets the architectural style of microservices as described by
Martin Fowler.
Using containers during the development process gives the developer an isolated environment that
looks and feels like a complete VM. It’s not a VM, though – it’s a
process running on a server somewhere. If the developer starts two containers, there are
two processes running on a single server somewhere – but they are isolated from each
other.
Types of Namespaces
Within the Linux kernel, there are different types of namespaces.
· A network namespace has an independent network stack: its own private routing table, set of IP
addresses, socket listing, connection tracking table, firewall, and
other network-related resources.
· An interprocess communication (IPC) namespace has its own IPC resources, for example POSIX
message queues.
· A UNIX Time-Sharing (UTS) namespace allows a single system to appear to have different host
and domain names to different processes.
The child processes with PID2 and PID3 in the parent namespace also belong to their own
PID namespaces in which their PID is 1. From within a child namespace, the PID1 process cannot see
anything outside. For example, PID1 in both child namespaces cannot see PID4
in the parent namespace.
This provides isolation between (in this case) processes within different namespaces.
Creating a Namespace
With all that theory under our belts, let’s cement our understanding by actually creating a new
namespace. The Linux unshare command is a good place to start. The manual page indicates that it does
exactly what we want:
NAME
I’m currently logged in as a regular user, svk, which has its own user ID, group, and so on,
but not root privileges:
svk $ id
uid=1000(svk) gid=1000(svk) groups=1000(svk)
context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c.1023
Now I run the following unshare command to create a new namespace with its own user
and PID namespaces. I map the root user to the new namespace (in other words, I have
root privilege within the new namespace), mount a new proc filesystem, and fork my
process (in this case, bash) in the newly created namespace.
The ps -ef command shows there are two processes running – bash and the ps command itself – and the
id command confirms that I’m root in the new namespace (which is also indicated by the changed
command prompt):
root # ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 14:46 pts/0 00:00:00 bash
root 15 1 0 14:46 pts/0 00:00:00 ps -ef
root # id
uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-
0:c0.c.1023
The crucial thing to notice is that I can see only the two processes in my namespace, not
any other processes running on the system. I am completely isolated within my own namespace.
Though I can’t see other processes from within the namespace, with the lsns
(list namespaces) command I can list all available namespaces, and display information
about them, from the perspective of the parent namespace (outside the new namespace).
The output shows three namespaces – of types user, mnt, and pid – which correspond to
the arguments on the unshare command I ran above. From this external perspective, each namespace is
running as user svk, not root, whereas inside the namespace processes run
as root, with access to all of the expected resources. (The output is broken across two lines
for easier reading.)
... unshare --user --map-root-user --fork –pid --mount-proc bash 1000 svk
... unshare --user --map-root-user --fork –pid --mount-proc bash 1000 svk
· Resource limits – You can configure a cgroup to limit how much of a particular resource
(memory or CPU, for example) a process can use.
· Prioritization – You can control how much of a resource (CPU, disk, or network) a process can
use compared to processes in another cgroup when there is resource contention.
· Accounting – Resource limits are monitored and reported at the cgroup level.
· Control – You can change the status (frozen, stopped, or restarted) of all processes
in a cgroup with a single command.
So basically you use cgroups to control how much of a given key resource (CPU, memory, network, and
disk I/O) can be accessed or used by a process or set of processes.
Cgroups are a key component of containers because there are often multiple processes running in a
container that you need to control together.
The following diagram illustrates how when you allocate a particular percentage of
available system resources to a cgroup (in this case cgroup-1), the remaining percentage
is available to other cgroups (and individual processes) on the system.
Creating a cgroup
The following command creates a v1 cgroup (you can tell by pathname format) called foo,
and sets the memory limit for it to 50,000,000 bytes (50 MB).
Now I can assign a process to the cgroup, thus imposing the cgroup’s memory limit on it.
I’ve written a shell script called test.sh, which prints "cgroup testing tool" to the screen, and then waits
doing nothing. For my purposes, it is a process that continues to run until I stop
it.
I start test.sh in the background, and its PID is reported as 2428. The script produces its
output, and then I assign the process to the cgroup by piping its PID into the cgroup file
/sys/fs/cgroup/memory/foo/cgroup.procs.
By default, the operating system terminates a process when it exceeds a resource limit defined by its
cgroup.
Conclusion
Namespaces and cgroups are the building blocks for containers and modern applications. Having an
understanding of how they work is important as we refactor applications to
more modern architectures.
Namespaces provide isolation of system resources, and cgroups allow for fine-grained control and
enforcement of limits for those resources.
Containers are not the only way that you can use namespaces and cgroups. Namespaces
and cgroup interfaces are built into the Linux kernel, which means that other applications
can use them to provide separation and resource constraints.
Understanding cgroups
Control groups (or cgroups) are a feature of the Linux kernel by which groups of processes
can be monitored and have their resources limited. For example, if you don’t want a
google chrome process (or it’s many child processes) to exceed a gigabyte of RAM or
30% total CPU usage, cgroups would let you do that. They are an extremely powerful tool
by which you can guarentee limits on performance, but understanding how they work
and how to use them can be a little daunting.
There are 12 different types of cgroups available in Linux. Each of them corresponds to
a resource that processes use, such as the memory cgroup.
Before we actually dive into cgroups, there’s a few bases we should cover. Cgroups
specifically deal with processes which are a fundamental piece of any operating system.
A process is just a running instance of a program. When you want to run a program the
Linux kernel loads the executable into memory, assigns a process ID to it, allocates various resources’ for
it, and begins to run it. Throughout the lifetime of a process the kernel
keeps track of its various state and resource usage information.
You can see all the processes running on your system and some of their resource
statistics with the top command (I prefer htop).
So what do I actually mean that a process needs resources? To give some examples: In order to store and
use data, a process needs to have access to memory. In order to execute it’s instructions, a process
needs to have available time to run on the CPU. A process may also need access to devices, such as
saving files to disk or taking in keyboard input. Each of these resources are abstracted by the Linux
kernel. Those abstractions are called ‘subsystems’.
So what do I actually mean that a process needs resources? To give some examples: In
order to store and use data, a process needs to have access to memory. In order to execute
it’s instructions, a process needs to have available time to run on the CPU. A process may
also need access to devices, such as saving files to disk or taking in keyboard input. Each
of these resources are abstracted by the Linux kernel. Those abstractions are called ‘subsystems’.
One such example of a subsystem is the virtual memory management system. This is the
layer between the memory management unit (hardware) and the rest of the kernel. When
the running program allocates memory for a new data structure (through something like malloc) there is
functionality in the kernel to resize the heap of that process. All processes
on a system share a single pool of memory that can be allocated for each of their use. As
you often see, if you use any electron application, a process can seriously hog all your memory.
In the screenshot above take a look at the MEM% column. That reflects the usage of total shared
memory that the process is using.
CPU time is similarly a shared resource. We’re going to be using the CPU cgroup and it’s associated
subsystem, the scheduler, as an example to show how cgroups work.
As fast as processors are, there is a limit to how much work they can do. For each core that your CPU
has, only one process can be running at a time. However, try entering the
command ps -e | wc -l. The printed number is the amount of living processes on your
system. Since that number is likely in the hundreds, how can they all run considering your
CPU may only have four cores?
The simple but important thing to note here is that the CPU% that you see in a program
like htop refers to the percentage of time that the process is running on the CPU as
opposed to paused. In other words, the more a process is scheduled to run, the higher it’s utilization of
CPU, and therefore the less utilization of CPU that other processes can have.
That hopefully makes enough sense, but you may be wondering how the scheduler decides what
processes to schedule on and off. The default scheduler in most Linux distros is called the Completely
Fair Scheduler. If the name is any indication, every process gets an equal amount of time to run on the
CPU. This is generally true!* However, there are a few words missing. In truth, every process within the
same cgroup gets an equal amount of time to run on the CPU.
As the name cgroup implies, we’re controlling groups of processes. Every process within
a CPU cgroup enjoys equal time of the CPU. It is that group that the processes are in that defines the
amount of available processor time.
Try taking a look at /proc/[pid]/cgroup file for any process to see what cgroups it’s in.
The screenshot above is taken from an EC2 instance that I'm running in the AWS
cloud. The cgroup file output is for the BASH shell process.
Each line refers to a different cgroup that the process belongs to. Just looking at
cpu,cpuacct (combined as just cpu), we can see that it’s in the / or “root” cgroup. This just means that
it’s in the system wide cgroup that all processes belong to.
Cgroups are organized in a hierarchy, so cgroups can have child cgroups. For this reason cgroups are
named by their parent to child hierarchical path. For example,
/cgroupA/cgroupB means there’s a cgroup called cgroupB which is a child of cgroupA,
which is a child of the root cgroup. The limits of parent cgroup’s apply to their children all
the way down.
The semantics for setting these limits is pretty intuitive. There are two values that must
be set: A period and a quota. Each of these values are in units of microseconds. The period defines an
amount of time before the pool of available CPU ticks refreshes. The quota
refers to the number of CPU ticks available in that pool. This is best explained by example:
In the diagram above we see that there are three processes in a cgroup called /Foobar.
There are also many processes in the / cgroup. As we see in that root cgroup, a quota of
-1 is a special value to indicate there is an unlimited quota. In other words, no limit.
Now let’s think about the /Foobar cgroup. A period of 1000000 microseconds (or one
second) has been specified. The quota of 500000 microseconds (or a half second) has also been set.
Everytime a process in the cgroup spends a microsecond of time running on the
CPU, the quota is decremented. Every process in the cgroup shares this quota. As an
example let’s say all three processes run at the same time (each on their own core) starting
at the beginning of a period. After around .17 of a second the processes in the cgroup will have spent
their entire quota. At that point the scheduler will opt to keep all three of those processes paused until
the period is over. At that point the quota is refreshed.
The purpose of explaining how the CPU cgroup works is to show the nature of what
cgroups are. They (cGroups) are not the mechanism by which resources are limited but
rather just a glorified way of collecting arguments for those resource limits. It’s up to the individual
subsystems to read those arguments and take them into consideration. The
same goes for every other cgroup implementation.
Using cgroups
All cgroup functionality is accessed through the cgroup filesystem. This is a virtual
filesystem with special files that act as the interface for creating, removing, or altering cgroups. You can
find where the various cgroupfs’ (one for each cgroup type) on your
system are mounted using mount | grep cgroup. They’re typically in /sys/fs/cgroup.
Continuing to use the CPU cgroup as an example, let’s take a look at the hierarchy and constraints for the
CPU cgroup. Within the CPU directory there are a bunch of files that are used for configuring the
constraints of processes in the cgroup. Since cgroups exist in hierarchies, you can also find directories
that correspond to child cgroups. Making a new
child cgroup is as simple as using mkdir. All the constraint files will be created for you!
Creatin
g a child cpu cgroup using mkdir and writing a process to the tasks file
When you’re in a child CPU cgroup there’s three main files that are of interest: tasks, cpu.cfs_period_us,
and cpu.cfs_quota_us.
tasks - list of PID’s that are part of the cgroup. Appending a PID to this file will add that process (all
threads in the process) to the cgroup. When you start a process it will automatically be added to the root
CPU cgroup.
cpu.cfs_period_us - a single integer value representing the period of the cgroup in microseconds. The
root cpu cgroup defaults to 100000 (or a tenth of a second).
cpu.cfs_quota_us - a single integer value representing the quota of the cgroup in microseconds. The root
cpu cgroup defaults to -1, meaning no limit.
Setting the above constraint files are also as easy as writing values to the files:
Setting the period and quota of a cgroup by writing to the period and quota files
Most container runtimes ‘contain’ using cgroups as one of their main mechanisms for isolation.
If you use docker you can set cgroup constraints as flags when running containers. For example docker
run –cpu-period=100000 –cpu-quota=12345 -it fedora bash. This will
handle setting up the cgroup, but interestingly all it’s doing is writing to the files for you.
Setting the period and quota of a cgroup by passing flags for a docker container
While I didn’t cover every different kind of cgroup, or even go deep into how the CPU
cgroup is implemented, I hope this gives any necessary understanding about one of my favorite features
of the kernel!
Decluttering process management with ps or
systemd
Control groups offer a user-friendly look at your process hierarchy.
Control groups, at a basic level, organize processes based on the parent and then organize processes
into a hierarchy.
Here, we will look at two ways to improve on the standard ps command that most people
use. I know many people pair ps with grep, and like pecan pie and Noah's Mill, I fully
endorse this practice. I also encourage you to check into the following two methods, as
they can make understanding process hierarchies a bit easier.
Method one
The first method is a standard ps command with the process tree enabled. When you run
this command:
x Lift the BSD-style "must have a tty" restriction, which is imposed upon the set of all
processes when some BSD-style (without "-") options are used or when the ps personality
setting is BSD-like. The set of processes selected in this manner is in
addition to the set of processes selected by other means. An alternate description
is that this option causes ps to list all processes owned by you (same EUID as ps), or
to list all processes when used together with the a option.
a Lift the BSD-style "only yourself" restriction, which is imposed upon the set of all
processes when some BSD-style (without "-") options are used or when the ps
personality setting is BSD-like. The set of processes selected in this manner is in
addition to the set of processes selected by other means. An alternate description
is that this option causes ps to list all processes with a terminal (tty), or to list all
processes when used together with the x option.
Method two
The next option we will look at is a systemd utility. This method is an even better way, in
my humble opinion, to see which job belongs to which parent process or owner. When
you type this:
Control Groups provide a mechanism for aggregating/partitioning sets of tasks, and all
their future children, into hierarchical groups with specialized behaviour.
Definitions:
A cgroup associates a set of tasks with a set of parameters for one or more subsystems.
A subsystem is a module that makes use of the task grouping facilities provided by cgroups
to treat groups of tasks in particular ways. A subsystem is typically a “resource controller”
that schedules a resource or applies per-cgroup limits, but it may be anything that wants
to act on a group of processes, e.g. a virtualization subsystem.
A hierarchy is a set of cgroups arranged in a tree, such that every task in the system is in exactly one of
the cgroups in the hierarchy, and a set of subsystems; each subsystem has system-specific state attached
to each cgroup in the hierarchy. Each hierarchy has an
instance of the cgroup virtual filesystem associated with it.
At any one time there may be multiple active hierarchies of task cgroups. Each hierarchy is a partition of
all tasks in the system.
User-level code may create and destroy cgroups by name in an instance of the cgroup virtual file system,
specify and query to which cgroup a task is assigned, and list the task PIDs assigned to a cgroup. Those
creations and assignments only affect the hierarchy associated with that instance of the cgroup file
system.
On their own, the only use for cgroups is for simple job tracking. The intention is that other subsystems
hook into the generic cgroup support to provide new attributes for cgroups,
such as accounting/limiting the resources which processes in a cgroup can access. For example, cpusets
(see CPUSETS) allow you to associate a set of CPUs and a set of memory nodes with the tasks in each
cgroup.