LFS258 Kubernetes Fundamentals
LFS258 Kubernetes Fundamentals
LFS258 Kubernetes Fundamentals
Before you begin, please take a moment to familiarize yourself with the course
navigation:
The navigation buttons at the bottom of the page will help you move forward or
backward within the course, one page at a time. You can also use the Right Arrow
keystroke to go to the next page, and the Left Arrow keystroke to go to the
previous page. For touchscreen devices such as phone and tablet devices,
navigation through swiping is enabled as well.
To exit the course, you can use the Exit button at the top-right of the page or the X
keystroke.
The Home button at the top right of the page, or the H keystroke will take you to
the first page of the course.
The drop-down menu (Table of Contents) at the bottom left helps you navigate to
any page in the course. It will always display the title of the current page.
The breadcrumbs at the top of the page indicate your location within the course
(chapter/page).
Numerous resources are available throughout the course, and can be accessed by
clicking hyperlinks that will open Internet pages in a new window.
Where available, you can use the video player functionalities to start, pause, stop,
restart the video, control the volume, turn closed captions on or off, and control the
screen size of the video. Closed captions are enabled for video narrations only.
In order to make it easier to distinguish the various types of content in the course, we use
the color coding and formats below:
Bold: names of programs or services (or used for emphasis)
Light blue: designates hyperlinks
Dark blue: text typed at the command line, system output at
the command line.
This course is entirely self-paced; there is no fixed schedule for going through the material.
You can go through the course at your own pace, and you'll always be returned to exactly
where you left off when you come back to start a new session. However, we still suggest
you avoid long breaks in between periods of work, as learning will be faster and content
retention improved.
You have unlimited access to this course for 12 months from the date you registered, even
after you have completed the course.
The chapters in the course have been designed to build on one another. It is probably best
to work through them in sequence; if you skip or only skim some chapters quickly, you
may find there are topics being discussed you have not been exposed to yet. But this is all
self-paced, and you can always go back, so you can thread your own path through the
material.
The lab exercises were written using Google Compute Engine (GCE) nodes. They have
been written to be vendor-agnostic, so they could run on AWS, local hardware, or inside of
virtual machines, to give you the most flexibility and options.
Each node has 3 vCPUs and 7.5G of memory, running Ubuntu 18.04. Smaller nodes
should work, but you should expect a slow response. Other operating system images
are also possible, but there may be a slight difference in some command outputs.
Using GCE requires setting up an account, and will incur expenses if using nodes of the
size suggested. The Getting Started pages can be viewed online.
Amazon Web Service (AWS) is another provider of cloud-based nodes, and requires an
account; you will incur expenses for nodes of the suggested size. You can find
videos and information about how to launch a Linux virtual machine on the AWS website.
Virtual machines such as KVM, VirtualBox, or VMWare can also be used for the lab
systems. Putting the VMs on a private network can make troubleshooting easier. As of
Kubernetes v1.16.1, the minimum (as in barely works) size for VirtualBox is 3vCPU/4G
memory/5G minimal OS for master and 1vCPU/2G memory/5G minimal OS for worker
node.
Finally, using bare-metal nodes, with access to the Internet, will also work for the lab
exercises.
At the end of each chapter, you will also find a series of knowledge check questions.
These questions, just like the labs, were designed with one main goal in mind: to help you
better comprehend the course content and reinforce what you have learned. It is important
to point out that the labs and knowledge check questions are not graded. We would
like to emphasize as well that you will not be required to take a final exam to complete
this course.
Resources for this course can be found online. Making updates to this course takes time.
Therefore, if there are any changes in between updates, you can always access course
updates, as well as the course resources online:
Go to the LFS258 Course Resource webpage.
The user ID is LFtraining and the password is Penguin2014.
One great way to interact with peers taking this course is via the Class Forum on
linux.com. This board can be used in the following ways:
To introduce yourself to other peers taking this course.
To discuss concepts, tools, and technologies presented in this course, or related to
the topics discussed in the course material.
To ask questions about course content.
To share resources and ideas related to Kubernetes and related technologies.
The Class Forum will be reviewed periodically by The Linux Foundation staff, but it is
primarily a community resource, not an 'ask the instructor' service.
Note: If you are using the QuickStart course player engine, and are inactive for more than
30 minutes while signed-in, you will automatically be logged out.
If you enrolled in the course using the Linux Foundation cart, and you are logged
out for inactivity, you should close the course player and the
QuickStart tab/window, you will have to log back into your Linux Foundation portal
and re-launch the course from your portal. Do not use the login window presented
by QuickStart, as you will not be able to log back in from that page.
If you are using a corporate branded QuickStart portal, you can log back in using
the same URL and credentials that you normally use to access the course.
We use a single sign-on service to launch the course once users are on their
'myportal' page. Do not attempt to change your password on the QuickStart course
player engine, as this will break your single sign-on.
For any issues with your username or password, visit The Linux Foundation ID website.
For any course content-related questions, please use the course forum on Linux.com (see
the details in the Class Forum Guidelines section).
If you need assistance beyond what is available above, please email us at:
[email protected] and provide your Linux Foundation ID username and a
description of your concern.
You can download a list of most frequently asked support questions by clicking on the
Document button below or by using the D keystroke.
If you are a Linux administrator or software developer starting to work with containers and
wondering how to manage them in production, LFS258: Kubernetes Fundamentals is the
course for you.
In this course, you will learn the key principles that will put you on the journey to managing
containerized applications in production.
To make the most of this course, you will need the following:
A good understanding of Linux.
Familiarity with the command line.
Familiarity with package managers.
Familiarity with Git and GitHub.
Access to a Linux server or Linux desktop/laptop.
VirtualBox on your machine, or access to a public cloud.
The material produced by The Linux Foundation is distribution-flexible. This means that
technical explanations, labs and procedures should work on most modern distributions,
and we do not promote products sold by any specific vendor (although we may mention
them for specific scenarios).
In practice, most of our material is written with the three main Linux distribution families in
mind:
Red Hat/Fedora
OpenSUSE/SUSE
Debian.
Distributions used by our students tend to be one of these three alternatives, or a product
that is derived from them.
You should ask yourself several questions when choosing a new distribution:
Has your employer already standardized?
Do you want to learn more?
Do you want to certify?
While there are many reasons that may force you to focus on one Linux distribution versus
another, we encourage you to gain experience on all of them. You will quickly notice that
technical differences are mainly about package management systems, software versions
and file locations. Once you get a grasp of those differences, it becomes relatively painless
to switch from one Linux distribution to another.
Some tools and utilities have vendor-supplied front-ends, especially for more particular or
complex reporting. The steps included in the text may need to be modified to run on a
different platform.
Fedora is the community distribution that forms the basis of Red Hat Enterprise Linux,
CentOS, Scientific Linux and Oracle Linux. Fedora contains significantly more software
than Red Hat's enterprise version. One reason for this is that a diverse community is
involved in building Fedora; it is not just one company.
The Fedora community produces new versions every six months or so. For this reason, we
decided to standardize the Red Hat/Fedora part of the course material on the latest
version of CentOS 7, which provides much longer release cycles. Once installed, CentOS
is also virtually identical to Red Hat Enterprise Linux (RHEL), which is the most popular
Linux distribution in enterprise environments:
Current material is based upon the latest release of Red Hat Enterprise Linux
(RHEL) - 7.x at the time of publication, and should work well with later versions
Supports x86, x86-64, Itanium, PowerPC and IBM System Z
RPM-based, uses yum (or dnf) to install and update
Long release cycle; targets enterprise server environments
Upstream for CentOS, Scientific Linux and Oracle Linux.
Note: CentOS is used for demos and labs because it is available at no cost.
The relationship between OpenSUSE and SUSE Linux Enterprise Server is similar to the
one we just described between Fedora and Red Hat Enterprise Linux. In this case,
however, we decided to use OpenSUSE as the reference distribution for the OpenSUSE
family, due to the difficulty of obtaining a free version of SUSE Linux Enterprise Server.
The two products are extremely similar and material that covers OpenSUSE can typically
be applied to SUSE Linux Enterprise Server with no problem:
Current material is based upon the latest release of OpenSUSE, and should work
well with later versions
RPM-based, uses zypper to install and update
YaST available for administration purposes
x86 and x86-64
Upstream for SUSE Linux Enterprise Server (SLES)
Note: OpenSUSE is used for demos and labs because it is available at no cost.
The Debian distribution is the upstream for several other distributions, including Ubuntu,
Linux Mint, and others. Debian is a pure open source project, and focuses on a key
aspect: stability. It also provides the largest and most complete software repository to its
users.
Ubuntu aims at providing a good compromise between long term stability and ease of use.
Since Ubuntu gets most of its packages from Debian's unstable branch, Ubuntu also has
access to a very large software repository. For those reasons, we decided to use Ubuntu
as the reference Debian-based distribution for our lab exercises:
Current trends and changes to the distributions have reduced some of the differences
between the distributions.
systemd (system startup and service management)
systemd is used by the most common distributions, replacing the SysVinit and
Upstart packages. Replaces service and chkconfig commands.
journald (manages system logs)
journald is a systemd service that collects and stores logging data. It creates and
maintains structured, indexed journals based on logging information that is
received from a variety of sources. Depending on the distribution, text-based
system logs may be replaced.
firewalld (firewall management daemon)
firewalld provides a dynamically managed firewall with support for network/firewall
zones to define the trust level of network connections or interfaces. It has support
for IPv4, IPv6 firewall settings and for Ethernet bridges. This replaces the iptables
configurations.
ip (network display and configuration tool)
The ip program is part of the net-tools package, and is designed to be a
replacement for the ifconfig command. The ip command will show or manipulate
routing, network devices, routing information and tunnels.
Since these utilities are common across distributions, the course content and lab
information will use these utilities.
If your choice of distribution or release does not support these commands, please translate
accordingly.
The following documents may be of some assistance translating older commands to their
systemd counterparts:
SysVinit Cheat Sheet
Debian Cheat Sheet
openSUSE Cheat Sheet
The Linux Foundation partners with the world's leading developers and companies to solve
the hardest technology problems and accelerate open technology development and
commercial adoption. The Linux Foundation makes it its mission to provide experience
and expertise to any initiative working to solve complex problems through open source
collaboration, providing the tools to scale open source projects: security best practices,
governance, operations and ecosystem development, training and certification, licensing,
and promotion.
Linux is the world's largest and most pervasive open source software project in history.
The Linux Foundation is home to the Linux creator Linus Torvalds and lead maintainer
Greg Kroah-Hartman, and provides a neutral home where Linux kernel development can
be protected and accelerated for years to come. The success of Linux has catalyzed
growth in the open source community, demonstrating the commercial efficacy of open
source and inspiring countless new projects across all industries and levels of the
technology stack.
The Linux Foundation is creating the largest shared technology investment in history. The
Linux Foundation is the umbrella for many critical open source projects that power
corporations today, spanning all industry sectors:
Big data and analytics: ODPi, R Consortium
Networking: OpenDaylight, OPNFV
Embedded: Dronecode, Zephyr
Web tools: JS Foundation, Node.js
Cloud computing: Cloud Foundry, Cloud Native Computing Foundation, Open
Container Initiative
Automotive: Automotive Grade Linux
Security: The Core Infrastructure Initiative
Blockchain: Hyperledger
And many more.
The Linux Foundation produces technical events around the world. Whether it is to provide
an open forum for development of the next kernel release, to bring together developers to
solve problems in a real-time environment, to host work groups and community groups for
active discussions, to connect end users and kernel developers in order to grow Linux and
open source software use in the enterprise or to encourage collaboration among the entire
community, we know that our conferences provide an atmosphere that is unmatched in
their ability to further the platform. The Linux Foundation hosts an increasing number of
events each year, including:
Open Source Summit North America, Europe, Japan, and China
MesosCon North America, Europe, and China
Embedded Linux Conference/OpenIoT Summit North America and Europe
Open Source Leadership Summit
Automotive Linux Summit
Apache: Big Data North America & ApacheCon
KVM Forum
Linux Storage Filesystem and Memory Management Summit
Vault
Open Networking Summit.
The Linux Foundation's training is for the community, by the community, and
features instructors and content straight from the leaders of the Linux developer
community.
The Linux Foundation offers several types of training:
Classroom
Online
On-site
Events-based.
Attendees receive Linux and open source software training that is distribution-flexible,
technically advanced and created with the actual leaders of the Linux and open source
software development community themselves. The Linux Foundation courses give
attendees the broad, foundational knowledge and networking needed to thrive in their
careers today. With either online or in-person training, The Linux Foundation classes can
keep you or your developers ahead of the curve on open source essentials.
Chapter 1. Course Introduction > 1.23. The Linux Foundation Training Offerings
The Linux Foundation certifications give you a way to differentiate yourself in a job market
that's hungry for your skills. We've taken a new, innovative approach to open source
certification that allows you to showcase your skills in a way that other peers will respect
and employers will trust:
You can take your certification exam from any computer, anywhere, at any time
The certification exams are performance-based
The exams are distribution-flexible
The exams are up-to-date, testing knowledge and skills that actually matter in today's IT
environment.
The Linux Foundation and its collaborative projects currently offer several different
certifications:
Linux Foundation Certified Sysadmin (LFCS)
Linux Foundation Certified Engineer (LFCE)
Certified Kubernetes Administrator (CKA)
Certified Kubernetes Application Developer (CKAD)
Cloud Foundry Certified Developer (CFCD)
Certified Hyperledger Sawtooth Administrator (CHSA)
Certified Hyperledger Fabric Administrator (CHFA).
The Linux Foundation has two separate training divisions: Course Delivery and
Certification. These two divisions are separated by a firewall.
The curriculum development and maintenance division of The Linux Foundation Training
department has no direct role in developing, administering, or grading certification exams.
Enforcing this self-imposed firewall ensures that independent organizations and
companies can develop third party training material, geared towards helping test takers
pass their certification exams.
Furthermore, it ensures that there are no secret "tips" (or secrets in general) that one
needs to be familiar with in order to succeed.
It also permits The Linux Foundation to develop a very robust set of courses that do far
more than teach the test, but rather equip attendees with a broad knowledge of the many
areas they may be required to master to have a successful career in open source system
administration.
1.26. Open Source Guides for the Enterprise
Chapter 1. Course Introduction > 1.26. Open Source Guides for the Enterprise
The Linux Foundation in partnership with the TODO Group developed a set of guides
leveraging best practices for:
Running an open source program office, or
Starting an open source project in an existing organization.
The Open Source Guides For the Enterprise are available for free online.
1.28. Copyright
Deploying containers and using Kubernetes may require a change in the development and
the system administration approach to deploying applications. In a traditional environment,
an application (such as a web server) would be a monolithic application placed on a
dedicated server. As the web traffic increases, the application would be tuned, and
perhaps moved to bigger and bigger hardware. After a couple of years, a lot of
customization may have been done in order to meet the current web traffic needs.
Instead of using a large server, Kubernetes approaches the same issue by deploying a
large number of small web servers, or microservices. The server and client sides of the
application expect that there are many possible agents available to respond to a request. It
is also important that clients expect the server processes to die and be replaced, leading
to a transient server deployment. Instead of a large Apache web server with many httpd
daemons responding to page requests, there would be many nginx servers, each
responding.
The transient nature of smaller services also allows for decoupling. Each aspect of the
traditional application is replaced with a dedicated, but transient, microservice or agent. To
join these agents, or their replacements together, we use services and API calls. A service
ties traffic from one agent to another (for example, a frontend web server to a backend
database) and handles new IP or other information, should either one die and be replaced.
Communication to, as well as internally, between components is API call-driven, which
allows for flexibility. Configuration information is stored in a JSON format, but is most often
written in YAML. Kubernetes agents convert the YAML to JSON prior to persistence to the
database.
Kubernetes is written in Go Language, a portable language which is like a hybridization
between C++, Python, and Java. Some claim it incorporates the best (while some claim
the worst) parts of each.
2.6. Challenges
Containers have seen a huge rejuvenation in the past few years. They provide a great way
to package, ship, and run applications - that is the Docker motto.
The developer experience has been boosted tremendously thanks to containers.
Containers, and Docker specifically, have empowered developers with ease of building
container images, simplicity of sharing images via Docker registries, and providing a
powerful user experience to manage containers.
However, managing containers at scale and architecting a distributed application based on
microservices' principles is still challenging.
You first need a continuous integration pipeline to build your container images, test them,
and verify them. Then, you need a cluster of machines acting as your base infrastructure
on which to run your containers. You also need a system to launch your containers, and
watch over them when things fail and self-heal. You must be able to perform rolling
updates and rollbacks, and eventually tear down the resource when no longer needed.
All of these actions require flexible, scalable, and easy-to-use network and storage. As
containers are launched on any worker node, the network must join the resource to other
containers, while still keeping the traffic secure from others. We also need a storage
structure which provides and keeps or recycles storage in a seamless manner.
One of the biggest challenges to adoption is the applications themselves, inside the
container. They need to be written, or re-written, to be truly transient. If you were to deploy
Chaos Monkey, which would terminate any containers, would your customers notice?
Built on open source and easily extensible, Kubernetes is definitely a solution to manage
containerized applications.
There are other solutions as well, including:
Docker Swarm is the Docker Inc. solution. It has been re-architected recently and
is based on SwarmKit. It is embedded with the Docker Engine.
Apache Mesos is a data center scheduler, which can run containers through the
use of frameworks. Marathon is the framework that lets you orchestrate containers.
Nomad from HashiCorp, the makers of Vagrant and Consul, is another solution for
managing containerized applications. Nomad schedules tasks defined in Jobs. It
has a Docker driver which lets you define a running container as a task.
Rancher is a container orchestrator-agnostic system, which provides a single pane
of glass interface to managing applications. It supports Mesos, Swarm,
Kubernetes.
What primarily distinguishes Kubernetes from other systems is its heritage. Kubernetes is
inspired by Borg - the internal system used by Google to manage its applications (e.g.
Gmail, Apps, GCE).
With Google pouring the valuable lessons they learned from writing and operating Borg for
over 15 years into Kubernetes, this makes Kubernetes a safe choice when having to decide
on what system to use to manage containers. While a powerful tool, part of the current
growth in Kubernetes is making it easier to work with and handle workloads not found in a
Google data center.
To learn more about the ideas behind Kubernetes, you can read the Large-scale cluster
management at Google with Borg paper.
Borg has inspired current data center systems, as well as the underlying technologies
used in container runtime today. Google contributed cgroups to the Linux kernel in 2007;
it limits the resources used by collection of processes. Both cgroups and Linux
namespaces are at the heart of containers today, including Docker.
Mesos was inspired by discussions with Google when Borg was still a secret. Indeed,
Mesos builds a multi-level scheduler, which aims to better use a data center cluster.
The Cloud Foundry Foundation embraces the 12 factor application principles. These
principles provide great guidance to build web applications that can scale easily, can be
deployed in the cloud, and whose build is automated. Borg and Kubernetes address these
principles as well.
To quickly demistify Kubernetes, let's have a look at the Kubernetes Architecture graphic,
which shows a high-level architecture diagram of the system components. Not all
components are shown. Every node running a container would have kubelet and kube-
proxy, for example.
Kubernetes Architecture
In its simplest form, Kubernetes is made of a central manager (aka master) and some
worker nodes, once called minions (we will see in a follow-on chapter how you can actually
run everything on a single node for testing purposes). The manager runs an API server, a
scheduler, various controllers and a storage system to keep the state of the cluster,
container settings, and the networking configuration.
Kubernetes exposes an API via the API server. You can communicate with the API using a
local client called kubectl or you can write your own client and use curl commands. The
kube-scheduler is forwarded the requests for running containers coming to the API and
finds a suitable node to run that container. Each node in the cluster runs two
processes: a kubelet and kube-proxy. The kubelet receives requests to run the
containers, manages any necessary resources and watches over them on the local node.
kubelet interacts with the local container engine, which is Docker by default, but could be
rkt or cri-o, which is growing in popularity.
The kube-proxy creates and manages networking rules to expose the container on the
network.
Using an API-based communication scheme allows for non-Linux worker nodes and
containers. Support for Windows Server 2019 was graduated to Stable with the 1.14
release. Only Linux nodes can be master on a cluster.
2.10. Terminology
2.11. Innovation
Since its inception, Kubernetes has seen a terrific pace of innovation and adoption. The
community of developers, users, testers, and advocates is continuously growing every
day. The software is also moving at an extremely fast pace, which is even putting GitHub
to the test:
Given to open source in June 2014
Thousands of contributors
More than 83k commits
More than 28k on Slack
Currently, on a three month major release cycle
Constant changes.
2.12. User Community
Kubernetes is being adopted at a very rapid pace. To learn more, you should check out
the case studies presented on the Kubernetes website. Ebay, Box, Pearson and
Wikimedia have all shared their stories.
Pokemon Go, the fastest growing mobile game, also runs on Google Container Engine
(GKE), the Kubernetes service from Google Cloud Platform (GCP).
Kubernetes Users
(by Kubernetes, retrieved from the Kubernetes website)
2.13. Tools
There are several tools you can use to work with Kubernetes. As the project has grown,
new tools are made available, while old ones are being deprecated. Minikube is a very
simple tool meant to run inside of VirtualBox. If you have limited resources and do not
want much hassle, it is the easiest way to get up and running. We mention it for those who
are not interested in a typical production environment, but want to use the tool.
Our labs will focus on the use of kubeadm and kubectl, which are very powerful and
complex tools.
There are third-party tools as well, such as Helm, an easy tool for using
Kubernetes charts, and Kompose to translate Docker Compose files into Kubernetes
objects. Expect these tools to change often.
2.14. The Cloud Native Computing Foundation
Chapter 2. Basics of Kubernetes > 2.14. The Cloud Native Computing Foundation
If you want to go beyond this general introduction to Kubernetes, here are a few things we
recommend:
Read the Borg paper
Listen to John Wilkes talking about Borg and Kubernetes
Add the Kubernetes community hangout to your calendar, and attend at least once.
Join the community on Slack and go in the #kubernetes-users channel.
Check out the very active Stack Overflow community.
This chapter is about Kubernetes installation and configuration. We are going to review a
few installation mechanisms that you can use to create your own Kubernetes cluster.
To get started without having to dive right away into installing and configuring a cluster,
there are two main choices.
One way is to use Google Container Engine (GKE), a cloud service from the Google Cloud
Platform, that lets you request a Kubernetes cluster with the latest stable version.
Another easy way to get started is to use Minikube. It is a single binary which deploys into
Oracle VirtualBox software, which can run in several operating systems. While Minikube is
local and single node, it will give you a learning, testing, and development
platform. MicroK8s is a newer tool tool developed by Canonical, and aimed at easy
installation. Aimed at appliance-like installations, it currently runs on Ubuntu 16.04 and
18.04.
To be able to use the Kubernetes cluster, you will need to have installed the Kubernetes
command line, called kubectl. This runs locally on your machine and targets the API
server endpoint. It allows you to create, manage, and delete all Kubernetes resources (e.g.
Pods, Deployments, Services). It is a powerful CLI that we will use throughout the rest of
this course. So, you should become familiar with it.
We will use kubeadm, the community-suggested tool from the Kubernetes project, that
makes installing Kubernetes easy and avoids vendor-specific installers. Getting a cluster
running involves two commands: kubeadm init, that you run on one Master node, and
then, kubeadm join, that you run on your Worker or redundant master nodes, and your
cluster bootstraps itself. The flexibility of these tools allows Kubernetes to be deployed in a
number of places. Lab exercises use this method.
We will also talk about other installation mechanisms, such as kubespray or kops,
another way to create a Kubernetes cluster on AWS. We will note you can create your
systemd unit file in a very traditional way. Additionally, you can use a container image
called hyperkube, which contains all the key Kubernetes binaries, so that you can run a
Kubernetes cluster by just starting a few containers on your nodes.
To configure and manage your cluster, you will probably use the kubectl command. You
can use RESTful calls or the Go language, as well.
Enterprise Linux distributions have the various Kubernetes utilities and other files available
in their repositories. For example, on RHEL 7/CentOS 7, you would find kubectl in the
kubernetes-client package.
You can (if needed) download the code from Github, and go through the usual steps to
compile and install kubectl.
This command line will use ~/.kube/config as a configuration file. This contains all the
Kubernetes endpoints that you might use. If you examine it, you will see cluster definitions
(i.e. IP endpoints), credentials, and contexts.
A context is a combination of a cluster and user credentials. You can pass these
parameters on the command line, or switch the shell between contexts with a command,
as in:
$ kubectl config use-context foobar
This is handy when going from a local environment to a cluster in the cloud, or from one
cluster to another, such as from development to production.
Chapter 3. Installation and Configuration > 3.6. Using Google Kubernetes Engine
(GKE)
Google takes every Kubernetes release through rigorous testing and makes it available via
its GKE service. To be able to use GKE, you will need the following:
An account on Google Cloud.
A method of payment for the services you will use.
The gcloud command line client.
There is an extensive documentation to get it installed. Pick your favorite method of
installation and set it up. For more details, you can visit the Installing Cloud SDK web
page.
You will then be able to follow the GKE quickstart guide and you will be ready to create
your first Kubernetes cluster:
$ gcloud container clusters create linuxfoundation
$ gcloud container clusters list
$ kubectl get nodes
By installing gcloud, you will have automatically installed kubectl. In the commands
above, we created the cluster, listed it, and then, listed the nodes of the cluster with
kubectl.
Once you are done, do not forget to delete your cluster, otherwise you will keep on
getting charged for it:
$ gcloud container clusters delete linuxfoundation
Once you become familiar with Kubernetes using Minikube, you may want to start building
a real cluster. Currently, the most straightforward method is to use kubeadm, which
appeared in Kubernetes v1.4.0, and can be used to bootstrap a cluster quickly. As the
community has focused on kubeadm, it has moved from beta to stable and added high
availability with v1.15.0.
The Kubernetes website provides documentation on how to use kubeadm to create a
cluster.
Package repositories are available for Ubuntu 16.04 and CentOS 7.1. Packages have not
yet been made available for Ubuntu 18, but will work as you will see in the lab exercises.
To join other nodes to the cluster, you will need at least one token and an SHA256 hash.
This information is returned by the command kubeadm init. Once the master has
initialized, you would apply a network plugin. Main steps:
Run kubeadm init on the head node
Create a network for IP-per-Pod criteria
Run kubeadm join --token token head-node-IP on worker nodes.
You can also create the network with kubectl by using a resource manifest of the network.
For example, to use the Weave network, you would do the following:
$ kubectl create -f https://git.io/weave-kube
Once all the steps are completed, workers and other master nodes joined, you will have a
functional multi-node Kubernetes cluster, and you will be able to use kubectl to interact
with it.
Prior to initializing the Kubernetes cluster, the network must be considered and IP conflicts
avoided. There are several Pod networking choices, in varying levels of development and
feature set:
Calico
A flat Layer 3 network which communicates without IP encapsulation, used in
production with software such as Kubernetes, OpenShift, Docker, Mesos and
OpenStack. Viewed as a simple and flexible networking model, it scales well for
large environments. Another network option, Canal, also part of this project, allows
for integration with Flannel. Allows for implementation of network policies.
Flannel
A Layer 3 IPv4 network between the nodes of a cluster. Developed by CoreOS, it
has a long history with Kubernetes. Focused on traffic between hosts, not how
containers configure local networking, it can use one of several backend
mechanisms, such as VXLAN. A flanneld agent on each node allocates subnet
leases for the host. While it can be configured after deployment, it is much easier
prior to any Pods being added.
Kube-router
Feature-filled single binary which claims to "do it all". The project is in the alpha
stage, but promises to offer a distributed load balancer, firewall, and router
purposely built for Kubernetes.
Romana
Another project aimed at network and security automation for cloud native
applications. Aimed at large clusters, IPAM-aware topology and integration with
kops clusters.
Weave Net
Typically used as an add-on for a CNI-enabled Kubernetes cluster.
Many of the projects will mention the Container Network Interface (CNI), which is a CNCF
project. Several container runtimes currently use CNI. As a standard to handle deployment
management and cleanup of network resources, it will become more popular.
Since Kubernetes is, after all, like any other applications that you install on a server
(whether physical or virtual), all the configuration management systems (e.g. Chef,
Puppet, Ansible, Terraform) can be used. Various recipes are available on the Internet.
Here are just a few examples of installation tools that you can use:
kubespray
kubespray is now in the Kubernetes incubator. It is an advanced Ansible playbook
which allows you to set up a Kubernetes cluster on various operating systems and
use different network providers. It was once known as kargo.
kops
kops lets you create a Kubernetes cluster on AWS via a single command line. Also
in beta for GKE and alpha for VMware.
kube-aws
kube-aws is a command line tool that makes use of the AWS Cloud Formation to
provision a Kubernetes cluster on AWS.
kubicorn
kubicorn is a tool which leverages the use of kubeadm to build a cluster. It claims
to have no dependency on DNS, runs on several operating systems, and uses
snapshots to capture a cluster and move it.
The best way to learn how to install Kubernetes using step-by-step manual commands is
to examine the Kelsey Hightower walkthrough.
To begin the installation process, you should start experimenting with a single-node
deployment. This single-node will run all the Kubernetes components (e.g. API server,
controller, scheduler, kubelet, and kube-proxy). You can do this with Minikube for example.
Once you want to deploy on a cluster of servers (physical or virtual), you will have many
choices to make, just like with any other distributed system:
Which provider should I use? A public or private cloud? Physical or virtual?
Which operating system should I use? Kubernetes runs on most operating systems
(e.g. Debian, Ubuntu, CentOS, etc.), plus on container-optimized OSes (e.g.
CoreOS, Atomic).
Which networking solution should I use? Do I need an overlay?
Where should I run my etcd cluster?
Can I configure Highly Available (HA) head nodes?
To learn more about how to choose the best options, you can read the Picking the Right
Solution article.
With systemd becoming the dominant init system on Linux, your Kubernetes components
will end up being run as systemd unit files in most cases. Or, they will be run via a kubelet
running on the head node (i.e. kubadm).
Chapter 3. Installation and Configuration > 3.13.a. Systemd Unit File for Kubernetes
In any of these configurations, you will run some of the components as a standard system
daemon. As an example, below is a sample systemd unit file to run the controller-
manager. Using kubeadm will create a system daemon for kubelet, while the rest will be
deployed as containers.
- name: kube-controller-manager.service
command: start
content: |
[Unit]
Description=Kubernetes Controller Manager
Documentation=https://github.com/kubernetes/kubernetes
Requires=kube-apiserver.service
After=kube-apiserver.service
[Service]
ExecStartPre=/usr/bin/curl -L -o /opt/bin/kube-controller-
manager -z /opt/bin/kube-controller-manager
https://storage.googleapis.com/kubernetes-release/release/v1.7.6/b
in/linux/amd64/kube-controller-manager
ExecStartPre=/usr/bin/chmod +x /opt/bin/kube-controller-
manager
ExecStart=/opt/bin/kube-controller-manager \
--service-account-private-key-file=/opt/bin/kube-
serviceaccount.key \
--root-ca-file=/var/run/kubernetes/apiserver.crt \
--master=127.0.0.1:8080 \
...
Chapter 3. Installation and Configuration > 3.13.b. Systemd Unit Files for
Kubernetes (Cont.)
This is by no means a perfect unit file. It downloads the controller binary from the
published release of Kubernetes, and sets a few flags to run.
As you dive deeper in the configuration of each component, you will become more familiar
not only with its configuration, but also with the various existing options, including those for
authentication, authorization, HA, container runtime, etc. Expect them to change.
For example, the API server is highly configurable. The Kubernetes documentation
provides more details about the kube-apiserver.
While you can run all the components as regular system daemons in unit files, you can
also run the API server, the scheduler, and the controller-manager as containers. This is
what kubeadm does.
Similar to minikube, there is a handy all-in-one binary named hyperkube, which is
available as a container image (e.g.
gcr.io/google_containers/hyperkube:v1.10.12). This is hosted by Google, so
you may need to add a new repository so Docker would know where to pull the file. You
can find the current release of software here:
https://console.cloud.google.com/gcr/images/google-containers/GLOBAL/hyperkube.
This method of installation consists in running a kubelet as a system daemon and
configuring it to read in manifests that specify how to run the other components (i.e. the
API server, the scheduler, etcd, the controller). In these manifests, the hyperkube image
is used. The kubelet will watch over them and make sure they get restarted if they die.
To get a feel for this, you can simply download the hyperkube image and run a container
to get help usage:
$ docker run --rm
gcr.io/google_containers/hyperkube:v1.15.5 /hyperkube apiserver --
help
$ docker run --rm
gcr.io/google_containers/hyperkube:v1.15.5 /hyperkube scheduler --
help
$ docker run --rm
gcr.io/google_containers/hyperkube:v1.15.5 /hyperkube controller-
manager --help
This is also a very good way to start learning the various configuration flags.
The list of binary releases is available on GitHub. Together with gcloud, minikube, and
kubeadmin, these cover several scenarios to get started with Kubernetes.
Kubernetes can also be compiled from source relatively quickly. You can clone the
repository from GitHub, and then use the Makefile to build the binaries. You can build
them natively on your platform if you have a Golang environment properly setup, or via
Docker containers if you are on a Docker host.
To build natively with Golang, first install Golang. Download files and directions can be
found online. https://golang.org/doc/install.
Once Golang is working, you can clone the kubernetes repository, around 500MB in
size. Change into the directory and use make:
$ cd $GOPATH
$ git clone https://github.com/kubernetes/kubernetes
$ cd kubernetes
$ make
On a Docker host, clone the repository anywhere you want and use the make quick-
release command. The build will be done in Docker containers.
The _output/bin directory will contain the newly built binaries.
4.3. Learning Objectives
The Kubernetes master runs various server and manager processes for the cluster.
Among the components of the master node are the kube-apiserver, the kube-scheduler,
and the etcd database. As the software has matured, new components have been created
to handle dedicated needs, such as the cloud-controller-manager; it handles tasks once
handled by the kube-controller-manager to interact with other tools, such as Rancher or
DigitalOcean for third-party cluster management and reporting.
There are several add-ons which have become essential to a typical production cluster,
such as DNS services. Others are third-party solutions where Kubernetes has not yet
developed a local component, such as cluster-level logging and resource monitoring.
4.6. kube-apiserver
4.7. kube-scheduler
The kube-scheduler uses an algorithm to determine which node will host a Pod of
containers. The scheduler will try to view available resources (such as volumes) to bind,
and then try and retry to deploy the Pod based on availability and success.
There are several ways you can affect the algorithm, or a custom scheduler could be used
instead. You can also bind a Pod to a particular node, though the Pod may remain in a
pending state due to other settings.
One of the first settings referenced is if the Pod can be deployed within the current quota
restrictions. If so, then the taints and tolerations, and labels of the Pods are used along
with those of the nodes to determine the proper placement.
The details of the scheduler can be found on GitHub.
The state of the cluster, networking, and other persistent information is kept in an etcd
database, or, more accurately, a b+tree key-value store. Rather than finding and
changing an entry, values are always appended to the end. Previous copies of the data
are then marked for future removal by a compaction process. It works with curl and other
HTTP libraries, and provides reliable watch queries.
Simultaneous requests to update a value all travel via the kube-apiserver, which then
passes along the request to etcd in a series. The first request would update the database.
The second request would no longer have the same version number, in which case the
kube-apiserver would reply with an error 409 to the requester. There is no logic past that
response on the server side, meaning the client needs to expect this and act upon the
denial to update.
There is a master database along with possible followers. They communicate
with each other on an ongoing basis to determine which will be master, and determine
another in the event of failure. While very fast and potentially durable, there have been
some hiccups with new tools, such as kubeadm, and features like whole cluster upgrades.
The kube-controller-manager is a core control loop daemon which interacts with the
kube-apiserver to determine the state of the cluster. If the state does not match, the
manager will contact the necessary controller to match the desired state. There are
several controllers in use, such as endpoints, namespace, and replication. The full list has
expanded as Kubernetes has matured.
Remaining in beta in v1.16, the cloud-controller-manager (ccm) interacts with agents
outside of the cloud. It handles tasks once handled by kube-controller-manager. This
allows faster changes without altering the core Kubernetes control process. Each kubelet
must use the --cloud-provider-external settings passed to the binary. You can
also develop your own CCM, which can be deployed as a daemonset as an in-tree
deployment or as a free-standing out-of-tree installation. The cloud-controller-manager is
an optional agent which takes a few steps to enable. You can find more details about the
cloud controller manager online.
All worker nodes run the kubelet and kube-proxy, as well as the container engine, such
as Docker or rkt. Other management daemons are deployed to watch these agents or
provide services not yet included with Kubernetes.
The kubelet interacts with the underlying Docker Engine also installed on all the nodes,
and makes sure that the containers that need to run are actually running. The kube-proxy
is in charge of managing the network connectivity to the containers. It does so through the
use of iptables entries. It also has the userspace mode, in which it monitors Services
and Endpoints using a random port to proxy traffic and an alpha feature of ipvs.
You can also run an alternative to the Docker engine: cri-o or rkt. To learn how you can do
that, you should check the documentation. In future releases, it is highly likely that
Kubernetes will support additional container runtime engines.
Supervisord is a lightweight process monitor used in traditional Linux environments to
monitor and notify about other processes. In the cluster, this daemon monitors both the
kubelet and docker processes. It will try to restart them if they fail, and log events. While
not part of a standard installation, some may add this monitor for added reporting.
Kubernetes does not have cluster-wide logging yet. Instead, another CNCF project is
used, called Fluentd. When implemented, it provides a unified logging layer for the cluster,
which filters, buffers, and routes messages.
4.11. kubelet
The kubelet agent is the heavy lifter for changes and configuration on worker nodes. It
accepts the API calls for Pod specifications (a PodSpec is a JSON or YAML file that
describes a pod). It will work to configure the local node until the specification has been
met.
Should a Pod require access to storage, Secrets or ConfigMaps, the kubelet will ensure
access or creation. It also sends back status to the kube-apiserver for eventual
persistence.
Uses PodSpec
Mounts volumes to Pod
Downloads secrets
Passes request to local container engine
Reports status of Pods and node to cluster.
Kubelet calls other components such as the Topology Manager, which uses hints from
other components to configure topology-aware resource assignments such as for CPU
and hardware accelerators. As an alpha feature, it is not enabled by default.
4.12.a. Services
With every object and agent decoupled we need a flexible and scalable agent which
connects resources together and will reconnect, should something die and a replacement
is spawned. Each Service is a microservice handling a particular bit of traffic, such as a
single NodePort or a LoadBalancer to distribute inbound requests among many Pods.
A Service also handles access policies for inbound requests, useful for resource control,
as well as for security.
Connect Pods together
Expose Pods to Internet
Decouple settings
Define Pod access policy.
4.12.b. Services
We can use a service to connect one pod to another, or to outside of the cluster. This
graphic shows a pod with a primary container, App, with an optional sidecar Logger. Also
seen is the pause container, which is used by the cluster to reserve the IP address in the
namespace prior to starting the other pods. This container is not seen from within
Kubernetes, but can be seen using docker and crictl.
This graphic also shows a ClusterIP which is used to connect inside the cluster, not the IP
of the cluster. As the graphic shows, this can be used to connect to a NodePort for outside
the cluster, an IngressController or proxy, or another ”backend” pod or pods.
4.13. Controllers
An important concept for orchestration is the use of controllers. Various controllers ship
with Kubernetes, and you can create your own, as well. A simplified view of a controller is
an agent, or Informer, and a downstream store. Using a DeltaFIFO queue, the source
and downstream are compared. A loop process receives an obj or object, which is an
array of deltas from the FIFO queue. As long as the delta is not of the type Deleted, the
logic of the controller is used to create or modify some object until it matches the
specification.
The Informer which uses the API server as a source requests the state of an object via
an API call. The data is cached to minimize API server transactions. A similar agent is the
SharedInformer; objects are often used by multiple other objects. It creates a shared
cache of the state for multiple requests.
A Workqueue uses a key to hand out tasks to various workers. The standard Go work
queues of rate limiting, delayed, and time queue are typically used.
The endpoints, namespace, and serviceaccounts controllers each manage the
eponymous resources for Pods.
4.14. Pods
Moving legacy applications to Kubernetes often brings up the question if the application
should be containerized as is, or rewritten as a transient, decoupled microservice. The
cost and time of rewriting legacy applications can be high, but there is also value to
leveraging the flexibility of Kubernetes. This video discusses the issue comparing a city
bus (monolithic legacy application) to a scooter (transient, decoupled microservices).
4.16. Containers
Chapter 4. Kubernetes Architecture > 4.16. Containers
While Kubernetes orchestration does not allow direct manipulation on a container level, we
can manage the resources containers are allowed to consume.
In the resources section of the PodSpec you can pass parameters which will be passed
to the container runtime on the scheduled node:
resources:
limits:
cpu: "1"
memory: "4Gi"
requests:
cpu: "0.5"
memory: "500Mi"
Another way to manage resource usage of the containers is by creating a
ResourceQuota object, which allows hard and soft limits to be set in a namespace. The
quotas allow management of more resources than just CPU and memory and allows
limiting several objects.
A beta feature in v1.12 uses the scopeSelector field in the quota spec to run a pod at a
specific priority if it has the appropriate priorityClassName in its pod spec.
Not all containers are the same. Standard containers are sent to the container engine at the same
time, and may start in any order. LivenessProbes, ReadinessProbes, and StatefulSets can be used to
determine the order, but can add complexity. Another option can be an Init Container, which must
complete before app containers can be started. Should the init container fail, it will be restarted until
completion, without the app container running.
The init container can have a different view of the storage and security settings, which allows us to
use utilities and commands that the application would not be allowed to use. Init containers can
contain code or utilities that are not in an app. It also has an independent security from app
containers.
The code below will run the init container until the ls command succeeds; then the database
container will start.
spec:
containers:
- name: main-app
image: databaseD
initContainers:
- name: wait-fatabase
image: busybox
command: ['sh', '-c', 'until ls /db/dir ; do sleep 5; done; ']
Now that we have seen some of the components, lets take another look with some of the
connections shown. Not all connections are shown in this diagram. Note that all of the
components are communicating with kube-apiserver. Only kube-apiserver
communicates with the etcd database.
We also see some commands, which we may need to install separately to work with
various components. There is an etcdctl command to interrogate the database and
calicoctl to view more of how the network is configured. We can see Felix, which is
the primary Calico agent on each machine. This agent, or daemon, is responsible for
interface monitoring and management, route programming, ACL configuration and state
reporting.
BIRD is a dynamic IP routing daemon used by Felix to read routing state and distribute
that information to other nodes in the cluster. This allows a client to connect to any node,
and eventually be connected to the workload on a container, even if not the node originally
contacted.
On the next page, you will find a video that should help you get a better understanding of the API
call flow from a request for a new pod through to pod and container deployment and ongoing
cluster status.
4.20. Node
A node is an API object created outside the cluster representing an instance. While a
master must be Linux, worker nodes can also be Microsoft Windows Server 2019. Once
the node has the necessary software installed, it is ingested into the API server.
At the moment, you can create a master node with kubeadm init and worker nodes by
passing join. In the near future, secondary master nodes and/or etcd nodes may be
joined.
If the kube-apiserver cannot communicate with the kubelet on a node for 5 minutes, the
default NodeLease will schedule the node for deletion and the NodeStatus will change
from ready. The pods will be evicted once a connection is re-established. They are no
longer forcibly removed and rescheduled by the cluster.
Each node object exists in the kube-node-lease namespace. To remove a node from
the cluster, first use kubectl delete node <node-name> to remove it from the API
server. This will cause pods to be evacuated. Then, use kubeadm reset to remove
cluster-specific information. You may also need to remove iptables information, depending
on if you plan on re-using the node.
A pod represents a group of co-located containers with some associated data volumes. All
containers in a pod share the same network namespace.
The graphic shows a pod with two containers, A and B, and two data volumes, 1 and 2.
Containers A and B share the network namespace of a third container, known as the pause
container. The pause container is used to get an IP address, then all the containers in the
pod will use its network namespace. Volumes 1 and 2 are shown for completeness.
To communicate with each other, containers within pods can use the loopback interface,
write to files on a common filesystem, or via inter-process communication (IPC).
There is now a network plugin from HPE Labs which allows multiple IP addresses per pod,
but this feature has not grown past this new plugin.
Starting as an alpha feature in 1.16 is the ability to use IPv4 and IPv6 for pods and
services. When creating a service, you would create the endpoint for each address family
separately.
Pod
This graphic shows a node with a single, dual-container pod. A NodePort service connects
the Pod to the outside network. Even though there are two containers, they share the
same namespace and the same IP address, which would be configured by kubectl working
with kube-proxy. The IP address is assigned before the containers are started, and will be
inserted into the containers. The container will have an interface like eth0@tun10. This IP
is set for the life of the pod.
The end point is created at the same time as the service. Note that it uses the pod IP
address, but also includes a port. The service connects network traffic from a node high-
number port to the endpoint using iptables with ipvs on the way. The kube-controller-
manager handles the watch loops to monitor the need for endpoints and services, as well
as any updates or deletions.
4.23. Networking Setup
Getting all the previous components running is a common task for system administrators
who are accustomed to configuration management. But, to get a fully functional
Kubernetes cluster, the network will need to be set up properly, as well.
A detailed explanation about the Kubernetes networking model can be seen on the
Cluster Networking page in the Kubernetes documentation.
If you have experience deploying virtual machines (VMs) based on IaaS solutions, this will
sound familiar. The only caveat is that, in Kubernetes, the lowest compute unit is not a
container, but what we call a pod.
A pod is a group of co-located containers that share the same IP address. From a
networking perspective, a pod can be seen as a virtual machine of physical hosts. The
network needs to assign IP addresses to pods, and needs to provide traffic routes between all
pods on any nodes.
The three main networking challenges to solve in a container orchestration system are:
Coupled container-to-container communications (solved by the pod concept).
Pod-to-pod communications.
External-to-pod communications (solved by the services concept, which we will
discuss later).
Kubernetes expects the network configuration to enable pod-to-pod communications to
be available; it will not do it for you.
Tim Hockin, one of the lead Kubernetes developers, has created a very useful slide deck
to understand the Kubernetes networking An Illustrated Guide to Kubernetes Networking.
Chapter 4. Kubernetes Architecture > 4.24.b. CNI Network Configuration File (Cont.)
While a CNI plugin can be used to configure the network of a pod and provide a single IP
per pod, CNI does not help you with pod-to-pod communication across nodes.
The requirement from Kubernetes is the following:
All pods can communicate with each other across nodes.
All nodes can communicate with all pods.
No Network Address Translation (NAT).
Basically, all IPs involved (nodes and pods) are routable without NAT. This can be
achieved at the physical network infrastructure if you have access to it (e.g. GKE). Or, this
can be achieved with a software defined overlay with solutions like:
Weave
Flannel
Calico
Romana.
See this documentation page or the list of networking add-ons for a more complete list.
4.26. Mesos
At a high level, there is nothing different between Kubernetes and other clustering
systems.
A central manager exposes an API, a scheduler places the workloads on a set of nodes,
and the state of the cluster is stored in a persistent layer.
For example, you could compare Kubernetes with Mesos, and you would see the
similarities. In Kubernetes, however, the persistence layer is implemented with etcd,
instead of Zookeeper for Mesos.
You should also consider systems like OpenStack and CloudStack. Think about what runs
on their head node, and what runs on their worker nodes. How do they keep state? How
do they handle networking? If you are familiar with those systems, Kubernetes will not
seem that different.
What really sets Kubernetes apart is its features oriented towards fault-tolerance, self-
discovery, and scaling, coupled with a mindset that is purely API-driven.
Mesos Architecture (by The Apache Software Foundation, retrieved from the Mesos
website)
5.5. RESTful
kubectl makes API calls on your behalf, responding to typical HTTP verbs (GET, POST,
DELETE). You can also make calls externally, using curl or other program. With the
appropriate certificates and keys, you can make requests, or pass JSON files to make
configuration changes.
$ curl --cert userbob.pem --key userBob-key.pem \
--cacert /path/to/ca.pem \
https://k8sServer:6443/api/v1/pods
The ability to impersonate other users or groups, subject to RBAC configuration, allows a
manual override authentication. This can be helpful for debugging authorization policies of
other users.
While there is more detail on security in a later chapter, it is helpful to check the current
authorizations, both as an administrator, as well as another user. The following shows
what user bob could do in the default namespace and the developer namespace,
using the auth can-i subcommand to query:
$ kubectl auth can-i create deployments
yes
$ kubectl auth can-i create deployments --as bob
no
$ kubectl auth can-i create deployments --as bob --namespace
developer
yes
There are currently three APIs which can be applied to set who and what can be queried:
SelfSubjectAccessReview
Access review for any user, helpful for delegating to others.
LocalSubjectAccessReview
Review is restricted to a specific namespace.
SelfSubjectRulesReview
A review which shows allowed actions for a user within a particular namespace.
The use of reconcile allows a check of authorization necessary to create an object from
a file. No output indicates the creation would be allowed.
The default serialization for API calls must be JSON. There is an effort to use Google's
protobuf serialization, but this remains experimental. While we may work with files in a
YAML format, they are converted to and from JSON.
Kubernetes uses the resourceVersion value to determine API updates and implement
optimistic concurrency. In other words, an object is not locked from the time it has been
read until the object is written.
Instead, upon an updated call to an object, the resourceVersion is checked, and a 409
CONFLICT is returned, should the number have changed. The resourceVersion is
currently backed via the modifiedIndex parameter in the etcd database, and is unique
to the namespace, kind, and server. Operations which do not change an object, such as
WATCH or GET, do not update this value.
Labels are used to work with objects or collections of objects; annotations are not.
Instead, annotations allow for metadata to be included with an object that may be helpful
outside of the Kubernetes object interaction. Similar to labels, they are key to value maps.
They are also able to hold more information, and more human-readable information than
labels.
Having this kind of metadata can be used to track information such as a timestamp,
pointers to related objects from other ecosystems, or even an email from the developer
responsible for that object's creation.
The annotation data could otherwise be held in an exterior database, but that would limit
the flexibility of the data. The more this metadata is included, the easier it is to integrate
management and deployment tools or shared client libraries.
For example, to annotate only Pods within a namespace, you can overwrite the
annotation, and finally delete it:
$ kubectl annotate pods --all description='Production Pods' -n
prod
$ kubectl annotate --overwrite pods description="Old Production
Pods" -n prod
$ kubectl annotate pods foo description- -n prod
As discussed earlier, a Pod is the lowest compute unit and individual object we can work
with in Kubernetes. It can be a single container, but often, it will consist of a primary
application container and one or more supporting containers.
Below is an example of a simple pod manifest in YAML format. You can see the
apiVersion (it must match the existing API group), the kind (the type of object to
create), the metadata (at least a name), and its spec (what to create and parameters),
which define the container that actually runs in this pod:
apiVersion: v1
kind: Pod
metadata:
name: firstpod
spec:
containers:
- image: nginx
name: stan
You can use the kubectl create command to create this pod in Kubernetes. Once it is
created, you can check its status with kubectl get pods. The output is omitted to save
space:
$ kubectl create -f simple.yaml
$ kubectl get pods
$ kubectl get pod firstpod -o yaml
$ kubectl get pod firstpod -o json
Chapter 5. APIs and Access > 5.10. Manage API Resources with kubectl
Kubernetes exposes resources via RESTful API calls, which allows all resources to be
managed via HTTP, JSON or even XML, the typical protocol being HTTP. The state of the
resources can be changed using standard HTTP verbs (e.g. GET, POST, PATCH, DELETE,
etc.).
kubectl has a verbose mode argument which shows details from where the command
gets and updates information. Other output includes curl commands you could use to
obtain the same result. While the verbosity accepts levels from zero to any number, there
is currently no verbosity value greater than nine. You can check this out for kubectl
get. The output below has been formatted for clarity:
$ kubectl --v=10 get pods firstpod
....
I1215 17:46:47.860958 29909 round_trippers.go:417]
curl -k -v -XGET -H "Accept: application/json"
-H "User-Agent: kubectl/v1.8.5 (linux/amd64)
kubernetes/cce11c6"
https://10.128.0.3:6443/api/v1/namespaces/default/pods/firstpod
....
If you delete this pod, you will see that the HTTP method changes from XGET to XDELETE:
$ kubectl --v=9 delete pods firstpod
....
I1215 17:49:32.166115 30452 round_trippers.go:417]
curl -k -v -XDELETE -H "Accept: application/json, */*"
-H "User-Agent: kubectl/v1.8.5 (linux/amd64)
kubernetes/cce11c6"
https://10.128.0.3:6443/api/v1/namespaces/default/pods/firstpod
Chapter 5. APIs and Access > 5.11. Access from Outside the Cluster
The primary tool used from the command line will be kubectl, which calls curl on your
behalf. You can also use the curl command from outside the cluster to view or make
changes.
The basic server information, with redacted TLS certificate information, can be found in the
output of
If you view the verbose output from a previous page, you will note that the first line
references a configuration file where this information is pulled from, ~/.kube/config:
Without the certificate authority, key and certificate from this file, only insecure curl
commands can be used, which will not expose much due to security settings. We will use
curl to access our cluster using TLS in an upcoming lab.
5.12.a. ~/.kube/config
Chapter 5. APIs and Access > 5.12.a. ~/.kube/config
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: LS0tLS1CRUdF.....
server: https://10.128.0.3:6443
name: kubernetes
contexts:
- context:
cluster: kubernetes
user: kubernetes-admin
name: kubernetes-admin@kubernetes
current-context: kubernetes-admin@kubernetes
kind: Config
preferences: {}
users:
- name: kubernetes-admin
user:
client-certificate-data: LS0tLS1CRUdJTib.....
client-key-data: LS0tLS1CRUdJTi....
The output on the previous page shows 19 lines of output, with each of the keys being
heavily truncated. While the keys may look similar, close examination shows them to be
distinct:
apiVersion
As with other objects, this instructs the kube-apiserver where to assign the
data.
clusters
This contains the name of the cluster, as well as where to send the API calls. The
certificate-authority-data is passed to authenticate the curl request.
contexts
A setting which allows easy access to multiple clusters, possibly as various users,
from one configuration file. It can be used to set namespace, user, and cluster.
current-context
Shows which cluster and user the kubectl command would use. These settings
can also be passed on a per-command basis.
kind
Every object within Kubernetes must have this setting; in this case, a declaration of
object type Config.
preferences
Currently not used, optional settings for the kubectl command, such as colorizing
output.
users
A nickname associated with client credentials, which can be client key and
certificate, username and password, and a token. Token and username/password
are mutually exclusive. These can be configured via the kubectl config set-
credentials command.
5.13. Namespaces
The term namespace is used to reference both the kernel feature and the segregation of
API objects by Kubernetes. Both are means to keep resources distinct.
Every API call includes a namespace, using default if not otherwise declared:
https://10.128.0.3:6443/api/v1/namespaces/default/pods.
Namespaces, a Linux kernel feature that segregates system resources, are intended to
isolate multiple groups and the resources they have access to work with via quotas.
Eventually, access control policies will work on namespace boundaries, as well. One could
use Labels to group resources for administrative reasons.
There are four namespaces when a cluster is first created.
default
This is where all resources are assumed, unless set otherwise.
kube-node-lease
The namespace where worker node lease information is kept.
kube-public
A namespace readable by all, even those not authenticated. General information is
often included in this namespace.
kube-system
Contains infrastructure pods.
Should you want to see all the resources on a system, you must pass the --all-
namespaces option to the kubectl command.
Chapter 5. APIs and Access > 5.15. API Resources with kubectl
All API resources exposed are available via kubectl. To get more information, do
kubectl help.
kubectl [command] [type] [Name] [flag]
Expect the list below to change:
podsecuritypolicies
all events (ev)
(psp)
certificatesigningrequ horizontalpodautosca
podtemplates
ests (csr) lers (hpa)
clusterrolebindings ingresses (ing) replicasets (rs)
replicationcontroll
clusterroles jobs
ers (rc)
clusters (valid only for resourcequotas
limitranges (limits)
federation apiservers) (quota)
componentstatuses (cs) namespaces (ns) rolebindings
networkpolicies
configmaps (cm) roles
(netpol)
controllerrevisions nodes (no) secrets
persistentvolumeclai
cronjobs serviceaccounts (sa)
ms (pvc)
customresourcedefiniti persistentvolumes (pv) services (svc)
on (crd)
poddisruptionbudgets
daemonsets (ds) statefulsets
(pdb)
deployments (deploy) podpreset storageclasses
In addition to basic resource management via REST, the API also provides some
extremely useful endpoints for certain resources.
For example, you can access the logs of a container, exec into it, and watch changes to it
with the following endpoints:
$ curl --cert /tmp/client.pem --key /tmp/client-key.pem \
--cacert /tmp/ca.pem -v -XGET \
https://10.128.0.3:6443/api/v1/namespaces/default/pods/firstpod/
log
This would be the same as the following. If the container does not have any standard out,
there would be no logs.
$ kubectl logs firstpod
Other calls you could make, following the various API groups on your cluster:
GET /api/v1/namespaces/{namespace}/pods/{name}/exec
GET /api/v1/namespaces/{namespace}/pods/{name}/log
GET /api/v1/watch/namespaces/{namespace}/pods/{name}
5.17. Swagger
The entire Kubernetes API uses a Swagger specification. This is evolving towards the
OpenAPI initiative. It is extremely useful, as it allows, for example, to auto-generate client
code. All the stable resources definitions are available on the documentation site.
You can browse some of the API groups via a Swagger UI on the OpenAPI Specification
web page.
5.18. API Maturity
The use of API groups and different versions allows for development to advance without
changes to an existing group of APIs. This allows for easier growth and separation of work
among separate teams. While there is an attempt to maintain some consistency between
API and software versions, they are only indirectly linked.
The use of JSON and Google's Protobuf serialization scheme will follow the same release
guidelines.
An Alpha level release, noted with alpha in the name, may be buggy and is disabled by
default. Features could change or disappear at any time. Only use these features on a test
cluster which is often rebuilt.
The Beta level, found with beta in the name, has more well-tested code and is enabled by
default. It also ensures that, as changes move forward, they will be tested for backwards
compatibility between versions. It has not been adopted and tested enough to be called
stable. You can expect some bugs and issues.
Use of the Stable version, denoted by only an integer which may be preceded by the letter
v, is for stable APIs.
6.4. Overview
This chapter is about API resources or objects. We will learn about resources in the v1
API group, among others. Stability increases and code becomes more stable as objects
move from alpha versions, to beta, and then v1, indicating stability.
DaemonSets, which ensure a Pod on every node, and StatefulSets, which stick a
container to a node and otherwise act like a deployment, have progressed
to apps/v1 stability. Jobs and CronJobs are now in batch/v1.
Role-Based Access Control (RBAC), essential to security, has made the leap from
v1alpha1 to the stable v1 status.
As a fast moving project keeping track of changes, any possible changes can be an
important part of the ongoing system administration. Release notes, as well as discussions
to release notes, can be found in version-dependent subdirectories in the Features
tracking repository for Kubernetes releases on GitHub. For example, the v1.17 release
feature status can be found on the Kubernetes v1.17.0 Release Notes page.
Starting with v1.16, deprecated API object versions will respond with an error instead of
being accepted. This is an important change from the historic behavior.
The v1 API group is no longer a single group, but rather a collection of groups for each
main object category. For example, there is a v1 group, a storage.k8s.io/v1 group, and an
rbac.authorization.k8s.io/v1, etc. Currently, there are eight v1 groups.
We have touched on several objects in lab exercises. Here are some details for some of
them:
Node
Represents a machine - physical or virtual - that is part of your
Kubernetes cluster. You can get more information about nodes with
the kubectl get nodes command. You can turn on and off the scheduling to a
node with the kubectl cordon/uncordon commands.
Service Account
Provides an identifier for processes running in a pod to access the API server and
performs actions that it is authorized to do.
Resource Quota
It is an extremely useful tool, allowing you to define quotas per namespace. For
example, if you want to limit a specific namespace to only run a given number of
pods, you can write a resourcequota manifest, create it with kubectl and the quota
will be enforced.
Endpoint
Generally, you do not manage endpoints. They represent the set of IPs for Pods
that match a particular service. They are handy when you want to check that a
service actually matches some running pods. If an endpoint is empty, then it means
that there are no matching pods and something is most likely wrong with your
service definition.
We can take a closer look at the output of the request for current APIs. Each of the name
values can be appended to the URL to see details of that group. For example, you could
drill down to find included objects at this URL: https://localhost:
6443/apis/apiregistrationk8s.io/v1beta1.
If you follow this URL, you will find only one resource, with a name of apiservices. If it
seems to be listed twice, the lower output is for status. You'll notice that there are different
verbs or actions for each. Another entry is if this object is namespaced, or restricted to only
one namespace. In this case, it is not.
$ curl https://localhost:6443/apis --header "Authorization: Bearer
$token" -k
{
"kind": "APIGroupList",
"apiVersion": "v1",
"groups": [
{
"name": "apiregistration.k8s.io",
"versions": [
{
"groupVersion": "apiregistration.k8s.io/v1beta1",
"version": "v1beta1"
}
],
"preferredVersion": {
"groupVersion": "apiregistration.k8s.io/v1beta1",
"version": "v1beta1"
}
You can then curl each of these URIs and discover additional API objects, their
characteristics and associated verbs.
Using the kubectl create command, we can quickly deploy an application. We have
looked at the Pods created running the application, like nginx. Looking closer, you will
find that a Deployment was created, which manages a ReplicaSet, which then deploys the
Pod. Lets take a closer look at each object:
Deployment
A controller which manages the state of ReplicaSets and the pods within. The
higher level control allows for more flexibility with upgrades and administration.
Unless you have a good reason, use a deployment.
ReplicaSet
Orchestrates individual Pod lifecycle and updates. These are newer versions of
Replication Controllers, which differ only in selector support.
Pod
As we've mentioned, it is the lowest unit we can manage, runs the application
container, and possibly support containers.
6.8. DaemonSets
Should you want to have a logging application on every node, a DaemonSet may be a
good choice. The controller ensures that a single pod, of the same type, runs on every
node in the cluster. When a new node is added to the cluster, a Pod, same as deployed on
the other nodes, is started. When the node is removed, the DaemonSet makes sure the
local Pod is deleted. DaemonSets are often used for logging, metrics and security pods,
and can be configured to avoid nodes.
As usual, you get all the CRUD operations via kubectl:
$ kubectl get daemonsets
$ kubectl get ds
6.9. StatefulSet
In the autoscaling group we find the Horizontal Pod Autoscalers (HPA). This is a stable
resource. HPAs automatically scale Replication Controllers, ReplicaSets, or Deployments
based on a target of 50% CPU usage by default. The usage is checked by the kubelet
every 30 seconds, and retrieved by the Metrics Server API call every minute. HPA checks
with the Metrics Server every 30 seconds. Should a Pod be added or removed, HPA waits
180 seconds before further action.
Other metrics can be used and queried via REST. The autoscaler does not collect the
metrics, it only makes a request for the aggregated information and increases or
decreases the number of replicas to match the configuration.
The Cluster Autoscaler (CA) adds or removes nodes to the cluster, based on the inability
to deploy a Pod or having nodes with low utilization for at least 10 minutes. This allows
dynamic requests of resources from the cloud provider and minimizes expenses for
unused nodes. If you are using CA, nodes should be added and removed through
cluster-autoscaler- commands. Scale-up and down of nodes is checked every 10
seconds, but decisions are made on a node every 10 minutes. Should a scale-down fail,
the group will be rechecked in 3 minutes, with the failing node being eligible in five
minutes. The total time to allocate a new node is largely dependent on the cloud provider.
Another project still under development is the Vertical Pod Autoscaler. This component
will adjust the amount of CPU and memory requested by Pods.
6.11. Jobs
Jobs are part of the batch API group. They are used to run a set number of pods to
completion. If a pod fails, it will be restarted until the number of completion is reached.
While they can be seen as a way to do batch processing in Kubernetes, they can also be
used to run one-off pods. A Job specification will have a parallelism and a completion key.
If omitted, they will be set to one. If they are present, the parallelism number will set the
number of pods that can run concurrently, and the completion number will set how many
pods need to run successfully for the Job itself to be considered done. Several Job
patterns can be implemented, like a traditional work queue.
Cronjobs work in a similar manner to Linux jobs, with the same time syntax. There are
some cases where a job would not be run during a time period or could run twice; as a
result, the requested Pod should be idempotent.
An option spec field is .spec.concurrencyPolicy which determines how to handle
existing jobs, should the time segment expire. If set to Allow, the default, another
concurrent job will be run. If set to Forbid, the current job continues and the new job is
skipped. A value of Replace cancels the current job and starts a new job in its place.
6.12. RBAC
Chapter 6. API Objects > 6.12. RBAC
The last API resources that we will look at are in the rbac.authorization.k8s.io group.
We actually have four resources: ClusterRole, Role, ClusterRoleBinding, and RoleBinding.
They are used for Role Based Access Control (RBAC) to Kubernetes.
$ curl localhost:8080/apis/rbac.authorization.k8s.io/v1
...
"groupVersion": "rbac.authorization.k8s.io/v1",
"resources": [
...
"kind": "ClusterRoleBinding"
...
"kind": "ClusterRole"
...
"kind": "RoleBinding"
...
"kind": "Role"
...
These resources allow us to define Roles within a cluster and associate users to these
Roles. For example, we can define a Role for someone who can only read pods in a
specific namespace, or a Role that can create deployments, but no services. We will talk
more about RBAC later in the course.
7.4. Overview
The default controller for a container deployed via kubectl run is a Deployment. While
we have been working with them already, we will take a closer look at configuration
options.
As with other objects, a deployment can be made from a YAML or JSON spec file. When
added to the cluster, the controller will create a ReplicaSet and a Pod automatically. The
containers, their settings and applications can be modified via an update, which generates
a new ReplicaSet, which, in turn, generates new Pods.
The updated objects can be staged to replace previous objects as a block or as a rolling
update, which is determined as part of the deployment specification. Most updates can be
configured by editing a YAML file and running kubectl apply. You can also use
kubectl edit to modify the in-use configuration. Previous versions of the ReplicaSets
are kept, allowing a rollback to return to a previous configuration.
We will also talk more about labels. Labels are essential to administration in Kubernetes,
but are not an API resource. They are user-defined key-value pairs which can be attached
to any resource, and are stored in the metadata. Labels are used to query or select
resources in your cluster, allowing for flexible and complex management of the cluster.
As a label is arbitrary, you could select all resources used by developers, or belonging to a
user, or any attached string, without having to figure out what kind or how many of such
resources exist.
7.5. Deployments
Here we can see the relationship of objects from the container which Kubernetes does not
directly manage, up to the deployment.
7.6.b. Object Relationship
The boxes and shapes are logical, in that they represent the controllers, or watch loops,
running as a thread of kube-controller-manager. Each controller queries the kube-
apiserver for the current state of the object they track. The state of each object on a
worker node is sent back from the local kubelet.
The graphic in the upper left represents a container running nginx 1.11. Kubernetes does
not directly manage the container. Instead, the kubelet daemon checks the pod
specifications by asking the container engine, which could be Docker or cri-o, for the
current status. The graphic to the right of the container shows a pod which represents a
watch loop checking the container status. kubelet compares the current pod spec against
what the container engine replies and will terminate and restart the pod if necessary.
A multi-container pod is shown next. While there are several names used, such as
sidecar or ambassador, these are all multi-container pods. The names are used to
indicate the particular reason to have a second container in the pod, instead of denoting a
new kind of pod.
On the lower left we see a replicaSet. This controller will ensure you have a certain
number of pods running. The pods are all deployed with the same podspec, which is why
they are called replicas. Should a pod terminate or a new pod be found, the replicaSet will
create or terminate pods until the current number of running pods matches the
specifications. Any of the current pods could be terminated should the spec demand fewer
pods running.
The graphic in the lower right shows a deployment. This controller allows us to manage
the versions of images deployed in the pods. Should an edit be made to the deployment, a
new replicaSet is created, which will deploy pods using the new podSpec. The deployment
will then direct the old replicaSet to shut down pods as the new replicaSet pods become
available. Once the old pods are all terminated, the deployment terminates the old
replicaSet and the deployment returns to having only one replicaSet running.
In the previous page, we created a new deployment running a particular version of the
nginx web server.
To generate the YAML file of the newly created objects, do:
$ kubectl get deployments,rs,pods -o yaml
Sometimes, a JSON output can make it more clear:
$ kubectl get deployments,rs,pods -o json
Now, we will look at the YAML output, which also shows default values, not passed to the
object when created.
apiVersion: v1
items:
- apiVersion: apps/v1
kind: Deployment
apiVersion
A value of v1 shows that this object is considered to be a stable resource. In this
case, it is not the deployment. It is a reference to the List type.
items
As the previous line is a List, this declares the list of items the command is
showing.
- apiVersion
The dash is a YAML indication of the first item, which declares the apiVersion of
the object as apps/v1. This indicates the object is considered
stable. Deployments are controller used in most cases.
kind
This is where the type of object to create is declared, in this case, a deployment.
Continuing with the YAML output, we see the next general block of output concerns the
metadata of the deployment. This is where we would find labels, annotations, and other
non-configuration information. Note that this output will not show all possible configuration.
Many settings which are set to false by default are not shown, like podAffinity or
nodeAffinity.
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
creationTimestamp: 2017-12-21T13:57:07Z
generation: 1
labels:
app: dev-web
name: dev-web
namespace: default
resourceVersion: "774003"
selfLink: /apis/apps/v1/namespaces/default/deployments/dev-web
uid: d52d3a63-e656-11e7-9319-42010a800003
Next, you can see an explanation of the information present in the deployment metadata
(the file provided on the previous page):
annotations:
These values do not configure the object, but provide further information that could
be helpful to third-party applications or administrative tracking. Unlike labels, they
cannot be used to select an object with kubectl.
creationTimestamp :
Shows when the object was originally created. Does not update if the object is
edited.
generation :
How many times this object has been edited, such as changing the number of
replicas, for example.
labels :
Arbitrary strings used to select or exclude objects for use with kubectl, or other API
calls. Helpful for administrators to select objects outside of typical object
boundaries.
name :
This is a required string, which we passed from the command line. The name must
be unique to the namespace.
resourceVersion :
A value tied to the etcd database to help with concurrency of objects. Any changes
to the database will cause this number to change.
selfLink :
References how the kube-apiserver will ingest this information into the API.
uid :
Remains a unique ID for the life of the object.
7.9.a. Deployment Configuration Spec
There are two spec declarations for the deployment. The first will modify the ReplicaSet
created, while the second will pass along the Pod configuration.
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: dev-web
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
spec :
A declaration that the following items will configure the object being created.
progressDeadlineSeconds:
Time in seconds until a progress error is reported during a change. Reasons could
be quotas, image issues, or limit ranges.
replicas :
As the object being created is a ReplicaSet, this parameter determines how many
Pods should be created. If you were to use kubectl edit and change this value
to two, a second Pod would be generated.
revisionHistoryLimit:
How many old ReplicaSet specifications to retain for rollback.
The elements present in the example we provided on the previous page are explained
below (continued):
selector :
A collection of values ANDed together. All must be satisfied for the replica to
match. Do not create Pods which match these selectors, as the deployment
controller may try to control the resource, leading to issues.
matchLabels :
Set-based requirements of the Pod selector. Often found with
the matchExpressions statement, to further designate where the resource
should be scheduled.
strategy :
A header for values having to do with updating Pods. Works with the later listed
type. Could also be set to Recreate, which would delete all existing pods before
new pods are created. With RollingUpdate, you can control how many Pods are
deleted at a time with the following parameters.
maxsurge :
Maximum number of Pods over desired number of Pods to create. Can be a
percentage, default of 25%, or an absolute number. This creates a certain number
of new Pods before deleting old ones, for continued access.
maxUnavailable:
A number or percentage of Pods which can be in a state other than Ready during
the update process.
type :
Even though listed last in the section, due to the level of white space indentation, it
is read as the type of object being configured. (e.g. RollingUpdate).
image :
This is the image name passed to the container engine, typically Docker. The
engine will pull the image and create the Pod.
imagePullPolicy :
Policy settings passed along to the container engine, about when and if an image
should be downloaded or used from a local cache.
name :
The leading stub of the Pod names. A unique string will be appended.
resources :
By default, empty. This is where you would set resource restrictions and settings,
such as a limit on CPU or memory for the containers.
terminationMessagePath :
A customizable location of where to output success or failure information of a
container.
terminationMessagePolicy :
The default value is File, which holds the termination method. It could also be set
to FallbackToLogsOnError which will use the last chunk of container log if the
message file is empty and the container shows an error.
dnsPolicy :
Determines if DNS queries should go to coredns or, if set to Default, use the
node's DNS resolution configuration.
restartPolicy :
Should the container be restarted if killed? Automatic restarts are part of the typical
strength of Kubernetes.
scheduleName :
Allows for the use of a custom scheduler, instead of the Kubernetes default.
securityContext :
Flexible setting to pass one or more security settings, such as SELinux context,
AppArmor values, users and UIDs for the containers to use.
terminationGracePeriodSeconds :
The amount of time to wait for a SIGTERM to run until a SIGKILL is used to
terminate the container.
The API server allows for the configurations settings to be updated for most values. There
are some immutable values, which may be different depending on the version of
Kubernetes you have deployed.
A common update is to change the number of replicas running. If this number is set to
zero, there would be no containers, but there would still be a ReplicaSet and Deployment.
This is the backend process when a Deployment is deleted.
$ kubectl scale deploy/dev-web --replicas=4
deployment "dev-web" scaled
$ kubectl get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
dev-web 4/4 4 4 20s
Non-immutable values can be edited via a text editor, as well. Use edit to trigger an
update. For example, to change the deployed version of the nginx web server to an older
version:
$ kubectl edit deployment nginx
....
containers:
- image: nginx:1.8 #<<---Set to an older version
imagePullPolicy: IfNotPresent
name: dev-web
....
This would trigger a rolling update of the deployment. While the deployment would show
an older age, a review of the Pods would show a recent update and older version of the
web server application deployed.
7.13.a. Deployment Rollbacks
With some of the previous ReplicaSets of a Deployment being kept, you can also roll back
to a previous revision by scaling up and down. The number of previous configurations kept
is configurable, and has changed from version to version. Deployment edits replica counts
decrementing old and incrementing new ReplicaSets. Next, we will have a closer look at
rollbacks, using the --record option of the kubectl create command, which
allows annotation in the resource definition.
$ kubectl create deploy ghost --image=ghost --record
$ kubectl get deployments ghost -o yaml
deployment.kubernetes.io/revision: "1"
kubernetes.io/change-cause: kubectl create deploy ghost --
image=ghost --record
Should an update fail, due to an improper image version, for example, you can roll back
the change to a working version with kubectl rollout undo:
$ kubectl set image deployment/ghost ghost=ghost:09 --all
$ kubectl rollout history deployment/ghost deployments "ghost":
REVISION CHANGE-CAUSE
1 kubectl create deploy ghost --image=ghost --record
2 kubectl set image deployment/ghost ghost=ghost:09 --
all
$ kubectl get pods
NAME READY STATUS RESTARTS
AGE
ghost-2141819201-tcths 0/1 ImagePullBackOff 0
1m
$ kubectl rollout undo deployment/ghost ; kubectl get pods
NAME READY STATUS RESTARTS AGE
ghost-3378155678-eq5i6 1/1 Running 0 7s
Chapter 7. Managing State with Deployments > 7.13.b. Deployment Rollbacks (Cont.)
You can roll back to a specific revision with the --to-revision=2 option.
You can also edit a Deployment using the kubectl edit command.
You can also pause a Deployment, and then resume.
$ kubectl rollout pause deployment/ghost
$ kubectl rollout resume deployment/ghost
Please note that you can still do a rolling update on ReplicationControllers with the
kubectl rolling-update command, but this is done on the client side. Hence, if you
close your client, the rolling update will stop.
A newer object to work with is the DaemonSet. This controller ensures that a single pod
exists on each node in the cluster. Every Pod uses the same image. Should a new node
be added, the DaemonSet controller will deploy a new Pod on your behalf. Should a node
be removed, the controller will delete the Pod also.
The use of a DaemonSet allows for ensuring a particular container is always running. In a
large and dynamic environment, it can be helpful to have a logging or metric generation
application on every node without an administrator remembering to deploy that application.
Use kind: DaemonSet.
There are ways of effecting the kube-apischeduler such that some nodes will not run a
DaemonSet.
7.15.a. Labels
Part of the metadata of an object is a label. Though labels are not API objects, they are an
important tool for cluster administration. They can be used to select an object based on an
arbitrary string, regardless of the object type. Labels are immutable as of API version
apps/v1.
Every resource can contain labels in its metadata. By default, creating a Deployment with
kubectl create adds a label, as we saw in:
....
labels:
pod-template-hash: "3378155678"
run: ghost
....
You could then view labels in new columns:
$ kubectl get pods -l run=ghost
NAME READY STATUS RESTARTS AGE
ghost-3378155678-eq5i6 1/1 Running 0 10m
$ kubectl get pods -L run
NAME READY STATUS RESTARTS AGE RUN
ghost-3378155678-eq5i6 1/1 Running 0 10m ghost
nginx-3771699605-4v27e 1/1 Running 1 1h nginx
While you typically define labels in pod templates and in the specifications of
Deployments, you can also add labels on the fly:
$ kubectl label pods ghost-3378155678-eq5i6 foo=bar
$ kubectl get pods --show-labels
NAME READY STATUS RESTARTS AGE LABELS
ghost-3378155678-eq5i6 1/1 Running 0 11m foo=bar, pod-
template-hash=3378155678,run=ghost
For example, if you want to force the scheduling of a pod on a specific node, you can use
a nodeSelector in a pod definition, add specific labels to certain nodes in your cluster and
use those labels in the pod.
....
spec:
containers:
- image: nginx
nodeSelector:
disktype: ssd
The kubectl expose command created a service for the nginx deployment. This
service used port 80 and generated a random port on all the nodes. A particular port and
targetPort can also be passed during object creation to avoid random values. The
targetPort defaults to the port, but could be set to any value, including a string
referring to a port on a backend Pod. Each Pod could have a different port, but traffic is still
passed via the name. Switching traffic to a different port would maintain a client
connection, while changing versions of software, for example.
The kubectl get svc command gave you a list of all the existing services, and we saw
the nginx service, which was created with an internal cluster IP.
The range of cluster IPs and the range of ports used for the random NodePort are
configurable in the API server startup options.
Services can also be used to point to a service in a different namespace, or even a
resource outside the cluster, such as a legacy application not yet in Kubernetes.
When developing an application or service, one quick way to check your service is to run a
local proxy with kubectl. It will capture the shell, unless you place it in the background.
When running, you can make calls to the Kubernetes API on localhost and also reach
the ClusterIP services on their API URL. The IP and port where the proxy listens can be
configured with command arguments.
Run a proxy:
$ kubectl proxy
Starting to serve on 127.0.0.1:8001
Next, to access a ghost service using the local proxy, we could use the following URL, for
example, at http://localhost:8001/api/v1/namespaces/default/services/ghost.
If the service port has a name, the path will
be http://localhost:8001/api/v1/namespaces/default/services/ghost:<port_na
me>.
8.10. DNS
To make sure that your DNS setup works well and that services get registered, the easiest
way to do it is to run a pod in the cluster and exec in it to do a DNS lookup.
Create this sleeping busybox pod with the kubectl create command :
apiVersion: v1
kind: Pod
metadata:
name: busybox
namespace: default
spec:
containers:
- image: busybox
name: busy
command:
- sleep
- "3600"
Then, use kubectl exec to do your nslookup like so:
$ kubectl exec -ti busybox -- nslookup nginx
Server: 10.0.0.10
Address 1: 10.0.0.10
Name: nginx
Address 1: 10.0.0.112
You can see that the DNS name nginx (corresponding to the nginx service) is registered
with the ClusterIP of the service.
Container engines have traditionally not offered storage that outlives the container. As
containers are considered transient, this could lead to a loss of data, or complex exterior
storage options. A Kubernetes volume shares the Pod lifetime, not the containers within.
Should a container terminate, the data would continue to be available to the new
container.
A volume is a directory, possibly pre-populated, made available to containers in a Pod.
The creation of the directory, the backend storage of the data and the contents depend on
the volume type. As of v1.13, there were 27 different volume types ranging from rbd to
gain access to Ceph, to NFS, to dynamic volumes from a cloud provider like Google's
gcePersistentDisk. Each has particular configuration options and dependencies.
The Container Storage Interface (CSI) adoption enables the goal of an industry standard
interface for container orchestration to allow access to arbitrary storage systems.
Currently, volume plugins are "in-tree", meaning they are compiled and built with the core
Kubernetes binaries. This "out-of-tree" object will allow storage vendors to develop a
single driver and allow the plugin to be containerized. This will replace the existing Flex
plugin which requires elevated access to the host node, a large security concern.
Should you want your storage lifetime to be distinct from a Pod, you can use Persistent
Volumes. These allow for empty or pre-populated volumes to be claimed by a Pod using a
Persistent Volume Claim, then outlive the Pod. Data inside the volume could then be used
by another Pod, or as a means of retrieving data.
There are two API Objects which exist to provide data to a Pod already. Encoded data can
be passed using a Secret and non-encoded data can be passed with a ConfigMap. These
can be used to pass important data like SSH keys, passwords, or even a configuration file
like /etc/hosts.
9.5.a. Introducing Volumes
A Pod specification can declare one or more volumes and where they are made available.
Each requires a name, a type, and a mount point. The same volume can be made
available to multiple containers within a Pod, which can be a method of container-to-
container communication. A volume can be made available to multiple Pods, with
each given an access mode to write. There is no concurrency checking, which means
data corruption is probable, unless outside locking takes place.
K8s Pod Volumes
A particular access mode is part of a Pod request. As a request, the user may be granted
more, but not less access, though a direct match is attempted first. The cluster groups
volumes with the same mode together, then sorts volumes by size, from smallest to
largest. The claim is checked against each in that access mode group, until a volume of
sufficient size matches. The three access modes are ReadWriteOnce, which allows read-
write by a single node, ReadOnlyMany, which allows read-only by multiple nodes, and
ReadWriteMany, which allows read-write by many nodes. Thus two pods on the same
node can write to a ReadWriteOnce, but a third pod on a different node would not
become ready due to a FailedAttachVolume error.
When a volume is requested, the local kubelet uses the kubelet_pods.go script to map
the raw devices, determine and make the mount point for the container, then create the
symbolic link on the host node filesystem to associate the storage to the container. The
API server makes a request for the storage to the StorageClass plugin, but the specifics
of the requests to the backend storage depend on the plugin in use.
If a request for a particular StorageClass was not made, then the only parameters used
will be access mode and size. The volume could come from any of the storage types
available, and there is no configuration to determine which of the available ones will be
used.
One of the many types of storage available is an emptyDir. The kubelet will create the
directory in the container, but not mount any storage. Any data created is written to the
shared container space. As a result, it would not be persistent storage. When the Pod is
destroyed, the directory would be deleted along with the container.
apiVersion: v1
kind: Pod
metadata:
name: busybox
namespace: default
spec:
containers:
- image: busybox
name: busy
command:
- sleep
- "3600"
volumeMounts:
- mountPath: /scratch
name: scratch-volume
volumes:
- name: scratch-volume
emptyDir: {}
The YAML file above would create a Pod with a single container with a volume named
scratch-volume created, which would create the /scratch directory inside the
container.
There are several types that you can use to define volumes, each with their pros and cons.
Some are local, and many make use of network-based resources.
In GCE or AWS, you can use volumes of type GCEpersistentDisk or
awsElasticBlockStore, which allows you to mount GCE and EBS disks in your Pods,
assuming you have already set up accounts and privileges.
emptyDir and hostPath volumes are easy to use. As mentioned, emptyDir is an
empty directory that gets erased when the Pod dies, but is recreated when the container
restarts. The hostPath volume mounts a resource from the host node filesystem. The
resource could be a directory, file socket, character, or block device. These resources
must already exist on the host to be used. There are two types, DirectoryOrCreate
and FileOrCreate, which create the resources on the host, and use them if they don't
already exist.
NFS (Network File System) and iSCSI (Internet Small Computer System Interface) are
straightforward choices for multiple readers scenarios.
rbd for block storage or CephFS and GlusterFS, if available in your Kubernetes cluster,
can be a good choice for multiple writer needs.
Besides the volume types we just mentioned, there are many other possible, with more
being added: azureDisk, azureFile, csi, downwardAPI, fc (fibre channel),
flocker, gitRepo, local, projected, portworxVolume, quobyte, scaleIO,
secret, storageos, vsphereVolume, persistentVolumeClaim, etc.
The following YAML file creates a pod with two containers, both with access to a shared
volume:
containers:
- image: busybox
name: busy
volumeMounts:
- mountPath: /busy
name: test
- image: busybox
name: box
volumeMounts:
- mountPath: /box
name: test
volumes:
- name: test
emptyDir: {}
$ kubectl exec -ti busybox -c box -- touch /box/foobar
$ kubectl exec -ti busybox -c busy -- ls -l /busy total 0
-rw-r--r-- 1 root root 0 Nov 19 16:26 foobar
You could use emptyDir or hostPath easily, since those types do not require any
additional setup, and will work in your Kubernetes cluster.
Note that one container wrote, and the other container had immediate access to the data.
There is nothing to keep the containers from overwriting the other's data. Locking or
versioning considerations must be part of the application to avoid corruption.
Chapter 9. Volumes and Data > 9.9. Persistent Volumes and Claims
A persistent volume (pv) is a storage abstraction used to retain data longer then the Pod
using it. Pods define a volume of type persistentVolumeClaim (pvc) with various
parameters for size and possibly the type of backend storage known as its
StorageClass. The cluster then attaches the persistentVolume.
Kubernetes will dynamically use volumes that are available, irrespective of its storage
type, allowing claims to any backend storage.
There are several phases to persistent storage:
$ kubectl get pv
$ kubectl get pvc
With a persistent volume created in your cluster, you can then write a manifest for a claim
and use that claim in your pod definition. In the Pod, the volume uses the
persistentVolumeClaim.
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: myclaim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 8GI
In the Pod:
spec:
containers:
....
volumes:
- name: test-volume
persistentVolumeClaim:
claimName: myclaim
Chapter 9. Volumes and Data > 9.11.b. Persistent Volume Claim (Cont.)
volumeMounts:
- name: Cephpd
mountPath: /data/rbd
volumes:
- name: rbdpd
rbd:
monitors:
- '10.19.14.22:6789'
- '10.19.14.23:6789'
- '10.19.14.24:6789'
pool: k8s
image: client
fsType: ext4
readOnly: true
user: admin
keyring: /etc/ceph/keyring
imageformat: "2"
imagefeatures: "layering"
While handling volumes with a persistent volume definition and abstracting the storage
provider using a claim is powerful, a cluster administrator still needs to create those
volumes in the first place. Starting with Kubernetes v1.4, Dynamic Provisioning allowed
for the cluster to request storage from an exterior, pre-configured source. API calls made
by the appropriate plugin allow for a wide range of dynamic storage use.
The StorageClass API resource allows an administrator to define a persistent volume
provisioner of a certain type, passing storage-specific parameters.
With a StorageClass created, a user can request a claim, which the API Server fills via
auto-provisioning. The resource will also be reclaimed as configured by the provider. AWS
and GCE are common choices for dynamic storage, but other options exist, such as a
Ceph cluster or iSCSI. Single, default class is possible via annotation.
Here is an example of a StorageClass using GCE:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast # Could be any name
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
Chapter 9. Volumes and Data > 9.13. Using Rook for Storage Orchestration
In keeping with the decoupled and distributed nature of Cloud technology, the Rook project
allows orchestration of storage using multiple storage providers.
As with other agents of the cluster, Rook uses custom resource definitions (CRD) and a
custom operator to provision storage according to the backend storage type, upon API
call.
Several storage providers are supported:
Ceph
Cassandra
CockroachDB
EdgeFS Geo-Transparant Storage
Minio Object Store
Network File System (NFS)
YugabyteDB.
9.14.a. Secrets
Pods can access local data using volumes, but there is some data you don't want readable
to the naked eye. Passwords may be an example. Using the Secret API resource, the
same password could be encoded or encrypted.
You can create, get, or delete secrets:
$ kubectl get secrets
Secrets can be encoded manually or via kubectl create secret:
$ kubectl create secret generic --help
$ kubectl create secret generic mysql --from-literal=password=root
A secret is not encrypted, only base64-encoded, by default. You must create an
EncryptionConfiguration with a key and proper identity. Then, the kube-apiserver needs
the --encryption-provider-config flag set to a previously configured provider, such as
aescbc or ksm. Once this is enabled, you need to recreate every secret, as they are
encrypted upon write.
Multiple keys are possible. Each key for a provider is tried during decryption. The first key
of the first provider is used for encryption. To rotate keys, first create a new key, restart
(all) kube-apiserver processes, then recreate every secret.
You can see the encoded string inside the secret with kubectl. The secret will be decoded
and be presented as a string saved to a file. The file can be used as an environmental
variable or in a new directory, similar to the presentation of a volume.
A secret can be made manually as well, then inserted into a YAML file:
$ echo LFTr@1n | base64
TEZUckAxbgo=
$ vim secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: LF-secret
data:
password: TEZUckAxbgo=
Chapter 9. Volumes and Data > 9.15. Using Secrets via Environment Variables
A secret can be used as an environmental variable in a Pod. You can see one being
configured in the following example:
...
spec:
containers:
- image: mysql:5.5
env:
- name: MYSQL_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: mysql
key: password
name: mysql
There is no limit to the number of Secrets used, but there is a 1MB limit to their size. Each
secret occupies memory, along with other API objects, so very large numbers of secrets
could deplete memory on a host.
They are stored in the tmpfs storage on the host node, and are only sent to the host
running Pod. All volumes requested by a Pod must be mounted before the containers
within the Pod are started. So, a secret must exist prior to being requested.
9.16. Mounting Secrets as Volumes
You can also mount secrets as files using a volume definition in a pod manifest. The
mount path will contain a file whose name will be the key of the secret created with the
kubectl create secret step earlier.
...
spec:
containers:
- image: busybox
command:
- sleep
- "3600"
volumeMounts:
- mountPath: /mysqlpassword
name: mysql
name: busy
volumes:
- name: mysql
secret:
secretName: mysql
Once the pod is running, you can verify that the secret is indeed accessible in the
container:
LFTr@1n
Chapter 9. Volumes and Data > 9.17.a. Portable Data with ConfigMaps
A similar API resource to Secrets is the ConfigMap, except the data is not encoded. In
keeping with the concept of decoupling in Kubernetes, using a ConfigMap decouples a
container image from configuration artifacts.
They store data as sets of key-value pairs or plain configuration files in any format. The
data can come from a collection of files or all files in a directory. It can also be populated
from a literal value.
A ConfigMap can be used in several different ways. A Pod can use the data as
environmental variables from one or more sources. The values contained inside can be
passed to commands inside the pod. A Volume or a file in a Volume can be created,
including different names and particular access modes. In addition, cluster components
like controllers can use the data.
Let's say you have a file on your local filesystem called config.js. You can create a
ConfigMap that contains this file. The configmap object will have a data section containing
the content of the file:
$ kubectl get configmap foobar -o yaml
kind: ConfigMap
apiVersion: v1
metadata:
name: foobar
data:
config.js: |
{
...
Chapter 9. Volumes and Data > 9.17.b. Portable Data with ConfigMaps (Cont.)
Like secrets, you can use ConfigMaps as environment variables or using a volume mount.
They must exist prior to being used by a Pod, unless marked as optional. They also reside
in a specific namespace.
In the case of environment variables, your pod manifest will use the valueFrom key and
the configMapKeyRef value to read the values. For instance:
env:
- name: SPECIAL_LEVEL_KEY
valueFrom:
configMapKeyRef:
name: special-config
key: special.how
With volumes, you define a volume with the configMap type in your pod and mount it
where it needs to be used.
volumes:
- name: config-volume
configMap:
name: special-config
10.6. nginx
Deploying an nginx controller has been made easy through the use of provided YAML
files, which can be found in the ingress-nginx GitHub repository.
This page has configuration files to configure nginx on several platforms, such as AWS,
GKE, Azure, and bare-metal, among others.
As with any Ingress Controller, there are some configuration requirements for proper
deployment. Customization can be done via a ConfigMap, Annotations, or, for detailed
configuration, a Custom template:
Easy integration with RBAC
Uses the annotation kubernetes.io/ingress.class: "nginx"
L7 traffic requires the proxy-real-ip-cidr setting
Bypasses kube-proxy to allow session affinity
Does not use conntrack entries for iptables DNAT
TLS requires the host field to be defined.
Chapter 10. Ingress > 10.7. Google Load Balancer Controller (GLBC)
There are several objects which need to be created to deploy the GCE Ingress Controller.
YAML files are available to make the process easy. Be aware that several objects would
be created for each service, and currently, quotas are not evaluated prior to creation.
The GLBC Controller must be created and started first. Also, you must create a
ReplicationController with a single replica, three services for the application Pod, and an
Ingress with two hostnames and three endpoints for each service. The backend is a group
of virtual machine instances, Instance Group.
Each path for traffic uses a group of like objects referred to as a pool. Each pool regularly
checks the next hop up to ensure connectivity.
The multi-pool path is:
Global Forwarding Rule -> Target HTTP Proxy -> URL map -> Backend
Service -> Instance Group.
Currently, the TLS Ingress only supports port 443 and assumes TLS termination. It does
not support SNI, only using the first certificate. The TLS secret must contain keys named
tls.crt and tls.key .
Ingress objects are still an extension API, like Deployments and ReplicaSets. A typical
Ingress object that you can POST to the API server is:
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: ghost
spec:
rules:
- host: ghost.192.168.99.100.nip.io
http:
paths:
- backend:
serviceName: ghost
servicePort: 2368
You can manage ingress resources like you do pods, deployments, services etc:
To get exposed with ingress quickly, you can go ahead and try to create a similar rule as
mentioned on the previous page. First, start a Ghost deployment and expose it with an
internal ClusterIP service:
With the deployment exposed and the Ingress rules in place, you should be able to access
the application from outside the cluster.
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: ghost
spec:
rules:
- host: ghost.192.168.99.100.nip.io
http:
paths:
- backend:
serviceName: ghost
servicePort: 2368
On the previous page, we defined a single rule. If you have multiple services, you can
define multiple rules in the same Ingress, each rule forwarding traffic to a specific service.
rules:
- host: ghost.192.168.99.100.nip.io
http:
paths:
- backend:
serviceName: ghost
servicePort: 2368
- host: nginx.192.168.99.100.nip.io
http:
paths:
- backend:
serviceName: nginx
servicePort: 80
For more complex connections or resources such as service discovery, rate limiting, traffic
management and advanced metrics, you may want to implement a service mesh. A
service mesh consists of edge and embedded proxies communicating with each other and
handling traffic based on rules from a control plane. Various options are available including
Envoy, Istio, and linkerd:
Envoy - a modular and extensible proxy favored due to its modular construction,
open architecture and dedication to remaining unmonetized. Often used as a data
plane under other tools of a service mesh.
Istio - a powerful tool set which leverages Envoy proxies via a multi-component
control plane. Built to be platform-independent; it can be used to make the service
mesh flexible and feature-filled.
linkerd - Another service mesh purposely built to be easy to deploy, fast, and
ultralight.
11.4. kube-scheduler
Chapter 11. Scheduling > 11.4. kube-scheduler
The larger and more diverse a Kubernetes deployment becomes, the more administration
of scheduling can be important. The kube-scheduler determines which nodes will run a
Pod, using a topology-aware algorithm.
Users can set the priority of a pod, which will allow preemption of lower priority pods. The
eviction of lower priority pods would then allow the higher priority pod to be scheduled.
The scheduler tracks the set of nodes in your cluster, filters them based on a set of
predicates, then uses priority functions to determine on which node each Pod should be
scheduled. The Pod specification as part of a request is sent to the kubelet on the node
for creation.
The default scheduling decision can be affected through the use of Labels on nodes or
Pods. Labels of podAffinity, taints, and pod bindings allow for configuration from the
Pod or the node perspective. Some, like tolerations, allow a Pod to work with a node,
even when the node has a taint that would otherwise preclude a Pod being scheduled.
Not all labels are drastic. Affinity settings may encourage a Pod to be deployed on a node,
but would deploy the Pod elsewhere if the node was not available. Sometimes,
documentation may use the term require, but practice shows the setting to be more of a
request. As beta features, expect the specifics to change. Some settings will evict Pods
from a node should the required condition no longer be true, such as
requiredDuringScheduling, RequiredDuringExecution.
Other options, like a custom scheduler, need to be programmed and deployed into your
Kubernetes cluster.
11.5. Predicates
The scheduler goes through a set of filters, or predicates, to find available nodes, then
ranks each node using priority functions. The node with the highest rank is selected to run
the Pod.
predicatesOrdering = []string{CheckNodeConditionPred,
GeneralPred, HostNamePred, PodFitsHostPortsPred,
MatchNodeSelectorPred, PodFitsResourcesPred, NoDiskConflictPred,
PodToleratesNodeTaintsPred, PodToleratesNodeNoExecuteTaintsPred,
CheckNodeLabelPresencePred, checkServiceAffinityPred,
MaxEBSVolumeCountPred, MaxGCEPDVolumeCountPred,
MaxAzureDiskVolumeCountPred, CheckVolumeBindingPred,
NoVolumeZoneConflictPred, CheckNodeMemoryPressurePred,
CheckNodeDiskPressurePred, MatchInterPodAffinityPred}
The predicates, such as PodFitsHost or NoDiskConflict, are evaluated in a
particular and configurable order. In this way, a node has the least amount of checks for
new Pod deployment, which can be useful to exclude a node from unnecessary checks if
the node is not in the proper condition.
For example, there is a filter called HostNamePred, which is also known as HostName,
which filters out nodes that do not match the node name specified in the pod specification.
Another predicate is PodFitsResources to make sure that the available CPU and
memory can fit the resources required by the Pod.
The scheduler can be updated by passing a configuration of kind: Policy which can
order predicates, give special weights to priorities and even
hardPodAffinitySymmetricWeight which deploys Pods such that if we set Pod A to
run with Pod B, then Pod B should automatically be run with Pod A.
11.6. Priorities
Priorities are functions used to weight resources. Unless Pod and node affinity has been
configured to the SelectorSpreadPriority setting, which ranks nodes based on the
number of existing running pods, they will select the node with the least amount of Pods.
This is a basic way to spread Pods across the cluster.
Other priorities can be used for particular cluster needs. The
ImageLocalityPriorityMap favors nodes which already have downloaded container
images. The total sum of image size is compared with the largest having the highest
priority, but does not check the image about to be used.
Currently, there are more than ten included priorities, which range from checking the
existence of a label to choosing a node with the most requested CPU and memory usage.
You can view a list of priorities at master/pkg/scheduler/algorithm/priorities.
A stable feature as of v1.14 allows the setting of a PriorityClass and assigning pods
via the use of PriorityClassName settings. This allows users to preempt, or evict, lower
priority pods so that their higher priority pods can be scheduled. The kube-scheduler
determines a node where the pending pod could run if one or more existing pods were
evicted. If a node is found, the low priority pod(s) are evicted and the higher priority pod is
scheduled. The use of a Pod Disruption Budget (PDB) is a way to limit the number of
pods preemption evicts to ensure enough pods remain running. The scheduler will remove
pods even if the PDB is violated if no other options are available.
The default scheduler contains a number of predicates and priorities; however, these can
be changed via a scheduler policy file.
A short version is shown below:
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
{"name" : "MatchNodeSelector", "order": 6},
{"name" : "PodFitsHostPorts", "order": 2},
{"name" : "PodFitsResources", "order": 3},
{"name" : "NoDiskConflict", "order": 4},
{"name" : "PodToleratesNodeTaints", "order": 5},
{"name" : "PodFitsHost", "order": 1}
],
"priorities" : [
{"name" : "LeastRequestedPriority", "weight" : 1},
{"name" : "BalancedResourceAllocation", "weight" : 1},
{"name" : "ServiceSpreadingPriority", "weight" : 2},
{"name" : "EqualPriority", "weight" : 1}
],
"hardPodAffinitySymmetricWeight" : 10
}
Typically, you will configure a scheduler with this policy using the --policy-config-
file parameter and define a name for this scheduler using the --scheduler-
name parameter. You will then have two schedulers running and will be able to specify
which scheduler to use in the pod specification.
With multiple schedulers, there could be conflict in the Pod allocation. Each Pod should
declare which scheduler should be used. But, if separate schedulers determine that a
node is eligible because of available resources and both attempt to deploy, causing the
resource to no longer be available, a conflict would occur. The current solution is for the
local kubelet to return the Pods to the scheduler for reassignment. Eventually, one Pod
will succeed and the other will be scheduled elsewhere.
Most scheduling decisions can be made as part of the Pod specification. A pod
specification contains several fields that inform scheduling, namely:
nodeName
nodeSelector
affinity
schedulerName
tolerations
The nodeName and nodeSelector options allow a Pod to be assigned to a single node
or a group of nodes with particular labels.
Affinity and anti-affinity can be used to require or prefer which node is used
by the scheduler. If using a preference instead, a matching node is chosen first, but other
nodes would be used if no match is present.
The use of taints allows a node to be labeled such that Pods would not be scheduled for
some reason, such as the master node after initialization. A toleration allows a Pod to
ignore the taint and be scheduled assuming other requirements are met.
Should none of these options meet the needs of the cluster, there is also the ability to
deploy a custom scheduler. Each Pod could then include a schedulerName to choose
which schedule to use.
Pods which may communicate a lot or share data may operate best if co-located, which
would be a form of affinity. For greater fault tolerance, you may want Pods to be as
separate as possible, which would be anti-affinity. These settings are used by the
scheduler based on the labels of Pods that are already running. As a result, the scheduler
must interrogate each node and track the labels of running Pods. Clusters larger than
several hundred nodes may see significant performance loss. Pod affinity rules use In,
NotIn, Exists, and DoesNotExist operators.
The use of requiredDuringSchedulingIgnoredDuringExecution means that the
Pod will not be scheduled on a node unless the following operator is true. If the operator
changes to become false in the future, the Pod will continue to run. This could be seen as
a hard rule.
Similarly, preferredDuringSchedulingIgnoredDuringExecution will choose a
node with the desired setting before those without. If no properly-labeled nodes are
available, the Pod will execute anyway. This is more of a soft setting, which declares a
preference instead of a requirement.
With the use of podAffinity, the scheduler will try to schedule Pods together. The use
of podAntiAffinity would cause the scheduler to keep Pods on different nodes.
The topologyKey allows a general grouping of Pod deployments. Affinity (or the inverse
anti-affinity) will try to run on nodes with the declared topology key and running Pods with
a particular label. The topologyKey could be any legal key, with some important
considerations. If using requiredDuringScheduling and the admission controller
LimitPodHardAntiAffinityTopology setting, the topologyKey must be set to
kubernetes.io/hostname. If using PreferredDuringScheduling, an empty
topologyKey is assumed to be all, or the combination of kubernetes.io/hostname,
failure-domain.beta.kubernetes.io/zone and failure-
domain.beta.kubernetes.io/region.
An example of affinity and podAffinity settings can be seen below. This also
requires a particular label to be matched when the Pod starts, but not required if the label
is later removed.
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values: - S1
topologyKey: failure-domain.beta.kubernetes.io/zone
Inside the declared topology zone, the Pod can be scheduled on a node running a Pod
with a key label of security and a value of S1. If this requirement is not met, the Pod will
remain in a Pending state.
Where Pod affinity/anti-affinity has to do with other Pods, the use of nodeAffinity allows
Pod scheduling based on node labels. This is similar, and will some day replace the use of
the nodeSelector setting. The scheduler will not look at other Pods on the system, but the
labels of the nodes. This should have much less performance impact on the cluster, even
with a large number of nodes.
Uses In, NotIn, Exists, DoesNotExist operators
requiredDuringSchedulingIgnoredDuringExecution
preferredDuringSchedulingIgnoredDuringExecution
Planned for future: requiredDuringSchedulingRequiredDuringExecution
Until nodeSelector has been fully deprecated, both the selector and required labels must
be met for a Pod to be scheduled.
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/colo-tx-name
operator: In
values:
- tx-aus
- tx-dal
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: disk-speed
operator: In
values:
- fast
- quick
The first nodeAffinity rule requires a node with a key of kubernetes.io/colo-tx-
name which has one of two possible values: tx-aus or tx-dal.
The second rule gives extra weight to nodes with a key of disk-speed with a value of
fast or quick. The Pod will be scheduled on some node - in any case, this just prefers a
particular label.
11.15. Taints
A node with a particular taint will repel Pods without tolerations for that taint. A
taint is expressed as key=value:effect. The key and the value value are created by
the administrator.
The key and value used can be any legal string, and this allows flexibility to prevent
Pods from running on nodes based off of any need. If a Pod does not have an existing
toleration, the scheduler will not consider the tainted node.
There are three effects, or ways to handle Pod scheduling:
NoSchedule
The scheduler will not schedule a Pod on this node, unless the Pod has this
toleration. Existing Pods continue to run, regardless of toleration.
PreferNoSchedule
The scheduler will avoid using this node, unless there are no untainted nodes for
the Pods toleration. Existing Pods are unaffected.
NoExecute
This taint will cause existing Pods to be evacuated and no future Pods
scheduled. Should an existing Pod have a toleration, it will continue to run. If the
Pod tolerationSeconds is set, they will remain for that many seconds, then be
evicted. Certain node issues will cause the kubelet to add 300 second
tolerations to avoid unnecessary evictions.
If a node has multiple taints, the scheduler ignores those with matching tolerations.
The remaining unignored taints have their typical effect.
The use of TaintBasedEvictions is still an alpha feature. The kubelet uses taints to
rate-limit evictions when the node has problems.
11.16. Tolerations
Setting tolerations on a node are used to schedule Pods on tainted nodes. This
provides an easy way to avoid Pods using the node. Only those with a particular
toleration would be scheduled.
An operator can be included in a Pod specification, defaulting to Equal if not declared.
The use of the operator Equal requires a value to match. The Exists operator should
not be specified. If an empty key uses the Exists operator, it will tolerate every taint. If
there is no effect, but a key and operator are declared, all effects are matched with
the declared key.
tolerations:
- key: "server"
operator: "Equal"
value: "ap-east"
effect: "NoExecute"
tolerationSeconds: 3600
In the above example, the Pod will remain on the server with a key of server and a value
of ap-east for 3600 seconds after the node has been tainted with NoExecute. When the
time runs out, the Pod will be evicted.
If the default scheduling mechanisms (affinity, taints, policies) are not flexible enough for
your needs, you can write your own scheduler. The programming of a custom scheduler is
outside the scope of this course, but you may want to start with the existing scheduler
code, which can be found in the Scheduler repository on GitHub.
If a Pod specification does not declare which scheduler to use, the standard scheduler is
used by default. If the Pod declares a scheduler, and that container is not running, the Pod
would remain in a Pending state forever.
The end result of the scheduling process is that a pod gets a binding that specifies which
node it should run on. A binding is a Kubernetes API primitive in the api/v1 group.
Technically, without any scheduler running, you could still schedule a pod on a node, by
specifying a binding for that pod.
You can also run multiple schedulers simultaneously.
You can view the scheduler and other information with:
kubectl get events
12.4. Overview
Kubernetes relies on API calls and is sensitive to network issues. Standard Linux tools and
processes are the best method for troubleshooting your cluster. If a shell, such as bash, is
not available in an affected Pod, consider deploying another similar pod with a shell, like
busybox. DNS configuration files and tools like dig are a good place to start. For more
difficult challenges, you may need to install other tools, like tcpdump.
Large and diverse workloads can be difficult to track, so monitoring of usage is essential.
Monitoring is about collecting key metrics, such as CPU, memory, and disk usage, and
network bandwidth on your nodes, as well as monitoring key metrics in your applications.
These features are being ingested into Kubernetes with the Metric Server, which is a cut-
down version of the now deprecated Heapster. Once installed, the Metrics Server exposes
a standard API which can be consumed by other agents, such as autoscalers. Once
installed, this endpoint can be found here on the master server:
/apis/metrics/k8s.io/.
Logging activity across all the nodes is another feature not part of Kubernetes. Using
Fluentd can be a useful data collector for a unified logging layer. Having aggregated logs
can help visualize the issues, and provides the ability to search all logs. It is a good place
to start when local network troubleshooting does not expose the root cause. It can be
downloaded from the Fluentd website.
Another project from CNCF combines logging, monitoring, and alerting and is called
Prometheus - you can learn more from the Prometheus website. It provides a time-series
database, as well as integration with Grafana for visualization and dashboards.
We are going to review some of the basic kubectl commands that you can use to debug
what is happening, and we will walk you through the basic steps to be able to debug your
containers, your pending containers, and also the systems in Kubernetes.
12.5. Basic Troubleshooting Steps
Chapter 12. Logging and Troubleshooting > 12.5. Basic Troubleshooting Steps
The troubleshooting flow should start with the obvious. If there are errors from the
command line, investigate them first. The symptoms of the issue will probably determine
the next step to check. Working from the application running inside a container to the
cluster as a whole may be a good idea. The application may have a shell you can use, for
example:
$ kubectl exec -ti <busybox_pod> -- /bin/sh
If the Pod is running, use kubectl logs pod-name to view the standard out of the
container. Without logs, you may consider deploying a sidecar container in the Pod to
generate and handle logging. The next place to check is networking, including DNS,
firewalls and general connectivity, using standard Linux commands and tools.
Security settings can also be a challenge. RBAC, covered in the security chapter, provides
mandatory or discretionary access control in a granular manner. SELinux and AppArmor
are also common issues, especially with network-centric applications.
A newer feature of Kubernetes is the ability to enable auditing for the kube-apiserver,
which can allow a view into actions after the API call has been accepted.
The issues found with a decoupled system like Kubernetes are similar to those of a
traditional datacenter, plus the added layers of Kubernetes controllers:
Errors from the command line
Pod logs and state of Pods
Use shell to troubleshoot Pod DNS and network
Check node logs for errors, make sure there are enough resources allocated
RBAC, SELinux or AppArmor for security settings
API calls to and from controllers to kube-apiserver
Enable auditing
Inter-node network issues, DNS and firewall
Master server controllers (control Pods in pending or error state, errors in log files,
sufficient resources, etc).
A feature new to the 1.16 version is the ability to add a container to a running pod. This
would allow a feature-filled container to be added to an existing pod without having to
terminate and re-create. Intermittent and difficult to determine problems may take a while
to reproduce, or not exist with the addition of another container.
As an alpha stability feature, it may change or be removed at any time. As well, they will
not be restarted automatically, and several resources such as ports or resources are not
allowed.
These containers are added via the ephemeralcontainers handler via an API call, not
via the podSpec. As a result, the use of kubectl edit is not possible.
You may be able to use the kubectl attach command to join an existing process within
the container. This can be helpful instead of kubectl exec which executes a new
process. The functionality of the attached process depends entirely on what you are
attaching to.
Chapter 12. Logging and Troubleshooting > 12.7. Cluster Start Sequence
The cluster startup sequence begins with systemd if you built the cluster using kubeadm.
Other tools may leverage a different method. Use systemctl status
kubelet.service to see the current state and configuration files used to run the kubelet
binary.
Uses /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
Inside of the config.yaml file you will find several settings for the binary, including the
staticPodPath which indicates the directory where kubelet will read every yaml file and
start every pod. If you put a yaml file in this directory, it is a way to troubleshoot the
scheduler, as the pod is created with any requests to the scheduler.
Uses /var/lib/kubelet/config.yaml configuration file
staticPodPath is set to /etc/kubernetes/manifests/
The four default yaml files will start the base pods necessary to run the cluster:
kubelet creates all pods from *.yaml in directory: kube-apiserver, etcd, kube-controller-
manager, kube-scheduler.
Once the watch loops and controllers from kube-controller-manager run using etcd data,
the rest of the configured objects will be created.
12.8. Monitoring
12.9. Plugins
We have been using the kubectl command throughout the course. The basic commands
can be used together in a more complex manner extending what can be done. There are
over seventy and growing plugins available to interact with Kubernetes objects and
components.
At the time this course was written, plugins cannot overwrite existing kubectl
commands, nor can it add sub-commands to existing commands. Writing new plugins
should take into account the command line runtime package and a Go library for plugin
authors.
As a plugin the declaration of options such as namespace or container to use must come
after the command.
Plugins can be distributed in many ways. The use of krew (the kubectl plugin manager)
allows for cross-platform packaging and a helpful plugin index, which makes finding new
plugins easy.
Usage:
krew [command]
Available Commands
The help option explains basic operation. After installation ensure the $PATH includes
the plugins. krew should allow easy installation and use after that.
$ export PATH="${KREW_ROOT:-$HOME/.krew}/bin:$PATH"
$ kubectl krew search
NAME DESCRIPTION
INSTALLED
access-matrix Show an RBAC access matrix for server resources
no
advise-psp Suggests PodSecurityPolicies for cluster.
no
....
....
| | Usage:
| |
| | # match all pods
| | $ kubectl tail
| |
| | # match pods in the 'frontend' namespace
| | $ kubectl tail --ns staging
....
To install use:
Once installed use as kubectl sub-command. You can also upgrade and uninstall.
Chapter 12. Logging and Troubleshooting > 12.11. Sniffing Traffic With Wireshark
The sniff command will use the first found container unless you pass the -c option to
declare which container in the pod to use for traffic monitoring.
Logging, like monitoring, is a vast subject in IT. It has many tools that you can use as part
of your arsenal.
Typically, logs are collected locally and aggregated before being ingested by a search
engine and displayed via a dashboard which can use the search syntax. While there are
many software stacks that you can use for logging, the Elasticsearch, Logstash, and
Kibana Stack (ELK) has become quite common.
In Kubernetes, the kubelet writes container logs to local files (via the Docker logging
driver). The kubectl logs command allows you to retrieve these logs.
Cluster-wide, you can use Fluentd to aggregate logs. Check the cluster administration
logging concepts for a detailed description.
Fluentd is part of the Cloud Native Computing Foundation and, together with Prometheus,
they make a nice combination for monitoring and logging. You can find a detailed walk-
through of running Fluentd on Kubernetes in the Kubernetes documentation.
Setting up Fluentd for Kubernetes logging is a good exercise in understanding
DaemonSets. Fluentd agents run on each node via a DaemonSet, they aggregate the
logs, and feed them to an Elasticsearch instance prior to visualization in a Kibana
dashboard.
There are several things that you can do to quickly diagnose potential issues with your
application and/or cluster. The official documentation offers additional materials to help
you get familiar with troubleshooting:
General guidelines and instructions (Troubleshooting)
Troubleshooting applications
Troubleshooting clusters
Debugging Pods
Debugging Services
GitHub website for issues and bug tracking
Kubernetes Slack channel.
We have been working with built-in resources, or API endpoints. The flexibility of
Kubernetes allows for the dynamic addition of new resources as well. Once these Custom
Resources have been added, the objects can be created and accessed using standard
calls and commands, like kubectl. The creation of a new object stores new structured data
in the etcd database and allows access via kube-apiserver.
To make a new custom resource part of a declarative API, there needs to be a controller to
retrieve the structured data continually and act to meet and maintain the declared state.
This controller, or operator, is an agent that creates and manages one or more
instances of a specific stateful application. We have worked with built-in controllers such
as Deployments, DaemonSets and other resources.
The functions encoded into a custom operator should be all the tasks a human would need
to perform if deploying the application outside of Kubernetes. The details of building a
custom controller are outside the scope of this course, and thus, not included.
There are two ways to add custom resources to your Kubernetes cluster. The easiest, but
less flexible, way is by adding a Custom Resource Definition (CRD) to the cluster. The
second way, which is more flexible, is the use of Aggregated APIs (AA), which requires a
new API server to be written and added to the cluster.
Either way of adding a new object to the cluster, as distinct from a built-in resource, is
called a Custom Resource.
If you are using RBAC for authorization, you probably will need to grant access to the new
CRD resource and controller. If using an Aggregated API, you can use the same or a
different authentication process.
Chapter 13. Custom Resource Definitions > 13.5. Custom Resource Definitions
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: backups.stable.linux.com
spec:
group: stable.linux.com
version: v1
scope: Namespaced
names:
plural: backups
singular: backup
shortNames:
- bks
kind: BackUp
apiVersion: Should match the current level of stability, currently
apiextensions.k8s.io/v1beta1.
kind: CustomResourceDefinition The object type being inserted by the kube-apiserver.
name: backups.stable.linux.com The name must match the spec field declared later.
The syntax must be <plural name>.<group>.
group: stable.linux.com The group name will become part of the REST API under
/apis/<group>/<version> or /apis/ stable/v1. in this case with the version set
to v1.
scope: Determines if the object exists in a single namespace or is cluster-wide.
plural: Defines the last part of the API URL, such as apis/stable/v1/backups.
singular and shortNames represent the name displayed and make CLI usage easier.
kind: A CamelCased singular type used in resource manifests.
Chapter 13. Custom Resource Definitions > 13.7. New Object Configuration
apiVersion: "stable.linux.com/v1"
kind: BackUp
metadata:
name: a-backup-object
spec:
timeSpec: "* * * * */5"
image: linux-backup-image
replicas: 5
Note that the apiVersion and kind match the CRD we created in a previous step. The
spec parameters depend on the controller.
The object will be evaluated by the controller. If the syntax, such as timeSpec, does not
match the expected value, you will receive and error, should validation be configured.
Without validation, only the existence of the variable is checked, not its details.
Just as with built-in objects, you can use an asynchronous pre-delete hook known as a
Finalizer. If an API delete request is received, the object metadata field
metadata.deletionTimestamp is updated. The controller then triggers whichever
finalizer has been configured. When the finalizer completes, it is removed from the list. The
controller continues to complete and remove finalizers until the string is empty. Then, the
object itself is deleted.
Finalizer:
metadata:
finalizers:
- finalizer.stable.linux.com
Validation:
validation:
openAPIV3Schema:
properties:
spec:
properties:
timeSpec:
type: string
pattern: '^(\d+|\*)(/\d+)?(\s+(\d+|\*)(/\d+)?){4}$'
replicas:
type: integer
minimum: 1
maximum: 10
A feature in beta starting with v1.9 allows for validation of custom objects via the OpenAPI
v3 schema. This will check various properties of the object configuration being passed
by the API server. In the example above, the timeSpec must be a string matching a
particular pattern and the number of allowed replicas is between 1 and 10. If the validation
does not match, the error returned is the failed line of validation.
Chapter 13. Custom Resource Definitions > 13.9. Understanding Aggregated APIs
(AA)
The use of Aggregated APIs allows adding additional Kubernetes-type API servers to the
cluster. The added server acts as a subordinate to kube-apiserver, which, as of v1.7, runs
the aggregation layer in-process. When an extension resource is registered, the
aggregation layer watches a passed URL path and proxies any requests to the newly
registered API service.
The aggregation layer is easy to enable. Edit the flags passed during startup of the kube-
apiserver to include --enable-aggregator-routing=true. Some vendors enable
this feature by default.
The creation of the exterior can be done via YAML configuration files or APIs. Configuring
TLS authorization between components and RBAC rules for various new objects is also
required. A sample API server is available on GitHub. A project currently in the incubation
stage is an API server builder which should handle much of the security and connection
configuration.
We have used Kubernetes tools to deploy simple Docker applications. Starting with the
v1.4 release, the goal was to have a canonical location for software. Helm is similar to a
package manager like yum or apt, with a chart being similar to a package. Helm v3 is
significantly different than v2.
A typical containerized application will have several manifests. Manifests for deployments,
services, and ConfigMaps. You will probably also create some secrets, Ingress, and other
objects. Each of these will need a manifest.
With Helm, you can package all those manifests and make them available as a single
tarball. You can put the tarball in a repository, search that repository, discover an
application, and then, with a single command, deploy and start the entire application.
The server runs in your Kubernetes cluster, and your client is local, even a local laptop.
With your client, you can connect to multiple repositories of applications.
You will also be able to upgrade or roll back an application easily from the command line.
14.6. Helm v3
14.8. Templates
The templates are resource manifests that use the Go templating syntax. Variables
defined in the values file, for example, get injected in the template when a release is
created. In the MariaDB example we provided, the database passwords are stored in a
Kubernetes secret, and the database configuration is stored in a Kubernetes
ConfigMap.
We can see that a set of labels are defined in the Secret metadata using the Chart
name, Release name, etc. The actual values of the passwords are read from
the values.yaml file.
apiVersion: v1
kind: Secret
metadata:
name: {{ template "fullname" . }}
labels:
app: {{ template "fullname" . }}
chart: "{{ .Chart.Name }}-{{ .Chart.Version }}"
release: "{{ .Release.Name }}"
heritage: "{{ .Release.Service }}"
type: Opaque
data:
mariadb-root-password: {{ default "" .Values.mariadbRootPassword |
b64enc | quote }}
mariadb-password: {{ default "" .Values.mariadbPassword | b64enc |
quote }}
A default repository is included when initializing helm, but it's common to add other
repositories. Repositories are currently simple HTTP servers that contain an index file and
a tarball of all the Charts present.
You can interact with a repository using the helm repo commands.
$ helm repo add testing http://storage.googleapis.com/kubernetes-charts-
testing
$ helm repo list
NAME URL
stable http://storage.googleapis.com/kubernetes-charts
local http://localhost:8879/charts
testing http://storage.googleapis.com/kubernetes-charts...
Once you have a repository available, you can search for Charts based on keywords.
Below, we search for a redis Chart:
$ helm search redis
WARNING: Deprecated index file format. Try 'helm repo update'
NAME VERSION DESCRIPTION
testing/redis-cluster 0.0.5 Highly available Redis cluster with
multiple se...
testing/redis-standalone 0.0.1 Standalone Redis Master testing/...
Once you find the chart within a repository, you can deploy it on your cluster.
Status: DEPLOYED
Resources:
==> v1/ReplicationController
NAME DESIRED CURRENT READY AGE
redis-standalone 1 1 0 1s
==> v1/Service
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
You will be able to list the release, delete it, even upgrade it and roll back.
$ helm list
NAME REVISION UPDATED STATUS CHART
amber-eel 1 Fri Oct 21 12:24:01 2016 DEPLOYED redis-
standalone-0.0.1
A unique, colorful name will be created for each helm instance deployed. You can also
use kubectl to view new resources Helm created in your cluster.
The output of the deployment should be carefully reviewed. It often includes information on
access to the applications within. If your cluster did not have a required cluster resource,
the output is often the first place to begin troubleshooting.
15.4. Overview
Security is a big and complex topic, especially in a distributed system like Kubernetes.
Thus, we are just going to cover some of the concepts that deal with security in the context
of Kubernetes.
Then, we are going to focus on the authentication aspect of the API server and we will dive
into authorization, looking at things like ABAC and RBAC, which is now the default
configuration when you bootstrap a Kubernetes cluster with kubeadm.
We are going to look at the admission control system, which lets you look at and
possibly modify the requests that are coming in, and do a final deny or accept on those
requests.
Following that, we're going to look at a few other concepts, including how you can secure
your Pods more tightly using security contexts and pod security policies, which are full-
fledged API objects in Kubernetes.
Finally, we will look at network policies. By default, we tend not to turn on network policies,
which let any traffic flow through all of our pods, in all the different namespaces. Using
network policies, we can actually define Ingress rules so that we can restrict the Ingress
traffic between the different namespaces. The network tool in use, such as Flannel or
Calico will determine if a network policy can be implemented. As Kubernetes becomes
more mature, this will become a strongly suggested configuration.
To perform any action in a Kubernetes cluster, you need to access the API and go through
three main steps:
Authentication
Authorization (ABAC or RBAC)
Admission Control.
These steps are described in more detail in the official documentation about controlling
access to the API and illustrated by the picture below:
Once a request reaches the API server securely, it will first go through any authentication
module that has been configured. The request can be rejected if authentication fails or it
gets authenticated and passed to the authorization step.
At the authorization step, the request will be checked against existing policies. It will be
authorized if the user has the permissions to perform the requested actions. Then, the
requests will go through the last step of admission. In general, admission controllers will
check the actual content of the objects being created and validate them before admitting
the request.
In addition to these steps, the requests reaching the API server over the network are
encrypted using TLS. This needs to be properly configured using SSL certificates. If you
use kubeadm, this configuration is done for you; otherwise, follow Kelsey Hightower's
guide Kubernetes The Hard Way, or the API server configuration options.
15.6. Authentication
15.7. Authorization
Chapter 15. Security > 15.7. Authorization
15.8. ABAC
ABAC stands for Attribute Based Access Control. It was the first authorization model in
Kubernetes that allowed administrators to implement the right policies. Today, RBAC is
becoming the default authorization mode.
Policies are defined in a JSON file and referenced to by a kube-apiserver startup option:
--authorization-policy-file=my_policy.json
For example, the policy file shown below, authorizes user Bob to read pods in the
namespace foobar:
{
"apiVersion": "abac.authorization.kubernetes.io/v1beta1",
"kind": "Policy",
"spec": {
"user": "bob",
"namespace": "foobar",
"resource": "pods",
"readonly": true
}
}
You can check other policy examples in the Kubernetes documentation.
15.9. RBAC
While RBAC can be complex, the basic flow is to create a certificate for a user. As a user
is not an API object of Kubernetes, we are requiring outside authentication, such as
OpenSSL certificates. After generating the certificate against the cluster certificate
authority, we can set that credential for the user using a context.
Roles can then be used to configure an association of apiGroups, resources, and the
verbs allowed to them. The user can then be bound to a role limiting what and where they
can work in the cluster.
Here is a summary of the RBAC process:
Determine or create namespace
Create certificate credentials for user
Set the credentials for the user to the namespace using a context
Create a role for the expected task set
Bind the user to the role
Verify the user has limited access.
Pods and containers within pods can be given specific security constraints to limit what
processes running in containers can do. For example, the UID of the process, the Linux
capabilities, and the filesystem group can be limited.
This security limitation is called a security context. It can be defined for the entire pod or
per container, and is represented as additional sections in the resources manifests. The
notable difference is that Linux capabilities are set at the container level.
For example, if you want to enforce a policy that containers cannot run their process as the
root user, you can add a pod security context like the one below:
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
securityContext:
runAsNonRoot: true
containers:
- image: nginx
name: nginx
Then, when you create this pod, you will see a warning that the container is trying to run
as root and that it is not allowed. Hence, the Pod will never run:
$ kubectl get pods
NAME READY STATUS
RESTARTS AGE
nginx 0/1 container has runAsNonRoot and image will run as
root 0 10s
You can read more in the Kubernetes documentation about configuring security contexts
to give proper constraints to your pods or containers.
15.13.a. Pod Security Policies
For Pod Security Policies to be enabled, you need to configure the admission controller of
the controller-manager to contain PodSecurityPolicy. These policies make even more
sense when coupled with the RBAC configuration in your cluster. This will allow you to
finely tune what your users are allowed to run and what capabilities and low level
privileges their containers will have.
By default, all pods can reach each other; all ingress and egress traffic is allowed. This has
been a high-level networking requirement in Kubernetes. However, network isolation can
be configured and traffic to pods can be blocked. In newer versions of Kubernetes, egress
traffic can also be blocked. This is done by configuring a NetworkPolicy. As all traffic is
allowed, you may want to implement a policy that drops all traffic, then, other policies
which allow desired ingress and egress traffic.
The spec of the policy can narrow down the effect to a particular namespace, which can
be handy. Further settings include a podSelector, or label, to narrow down which Pods
are affected. Further ingress and egress settings declare traffic to and from IP addresses
and ports.
Not all network providers support the NetworkPolicies kind. A non-exhaustive list of
providers with support includes Calico, Romana, Cilium, Kube-router, and WeaveNet.
On the next page, you can find an example of a NetworkPolicy recipe. More network
policy recipes can be found on GitHub.
The use of policies has become stable, noted with the v1 apiVersion. The example
below narrows down the policy to affect the default namespace.
Only Pods with the label of role: db will be affected by this policy, and the policy has
both Ingress and Egress settings.
The ingress setting includes a 172.17 network, with a smaller range of 172.17.1.0 IPs
being excluded from this traffic.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ingress-egress-policy
namespace: default
spec:
podSelector:
matchLabels:
role: db
policyTypes:
- Ingress
- Egress
ingress:
- from:
- ipBlock:
cidr: 172.17.0.0/16
except:
- 172.17.1.0/24
<continued_on_next_page>
Chapter 15. Security > 15.15.b. Network Security Policy Example (Cont.)
- namespaceSelector:
matchLabels:
project: myproject
- podSelector:
matchLabels:
role: frontend
ports:
- protocol: TCP
port: 6379
egress:
- to:
- ipBlock:
cidr: 10.0.0.0/24
ports:
- protocol: TCP
port: 5978
These rules change the namespace for the following settings to be labeled project:
myproject. The affected Pods also would need to match the label role: frontend.
Finally, TCP traffic on port 6379 would be allowed from these Pods.
The egress rules have the to settings, in this case the 10.0.0.0/24 range TCP traffic to
port 5978.
The use of empty ingress or egress rules denies all type of traffic for the included Pods,
though this is not suggested. Use another dedicated NetworkPolicy instead.
Note that there can also be complex matchExpressions statements in the spec, but this
may change as NetworkPolicy matures.
podSelector:
matchExpressions:
- {key: inns, operator: In, values: ["yes"]}
The empty braces will match all Pods not selected by other NetworkPolicy and will not
allow ingress traffic. Egress traffic would be unaffected by this policy.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
spec:
podSelector: {}
policyTypes:
- Ingress
With the potential for complex ingress and egress rules, it may be helpful to create multiple
objects which include simple isolation rules and use easy to understand names and labels.
Some network plugins, such as WeaveNet, may require annotation of the Namespace.
The following shows the setting of a DefaultDeny for the myns namespace:
kind: Namespace
apiVersion: v1
metadata:
name: myns
annotations:
net.beta.kubernetes.io/network-policy: |
{
"ingress": {
"isolation": "DefaultDeny"
}
}
A newer feature of kubeadm is the integrated ability to join multiple master nodes with
collocated etcd databases. This allows for higher redundancy and fault tolerance. As long
as the database services the cluster will continue to run and catch up with kubelet
information should the master node go down and be brought back online.
Three instances are required for etcd to be able to determine quorum if the data is
accurate, or if the data is corrupt, the database could become unavailable. Once etcd is
able to determine quorum, it will elect a leader and return to functioning as it had before
failure.
One can either collocate the database with control planes or use an external etcd
database cluster. The kubeadm command makes the collocated deployment easier to
use.
To ensure that workers and other control planes continue to have access, it is a good idea
to use a load balancer. The default configuration leverages SSL, so you may need to
configure the load balancer as a TCP passthrough unless you want the extra work of
certificate configuration. As the certificates will be decoded only for particular node names,
it is a good idea to use a FQDN instead of an IP address, although there are many
possible ways to handle access.
The easiest way to gain higher availability is to use the kubeadm command and join at
least two more master servers to the cluster. The command is almost the same as a
worker join except an additional --control-plane flag and a certificate-key. The key
will probably need to be generated unless the other master nodes are added within two
hours of the cluster initialization.
Should a node fail, you would lose both a control plane and a database. As the database
is the one object that cannot be rebuilt, this may not be an important issue.
Using an external cluster of etcd allows for less interruption should a node fail. Creating a
cluster in this manner requires a lot more equipment to properly spread out services and
takes more work to configure.
The external etcd cluster needs to be configured first. The kubeadm command has
options to configure this cluster, or other options are available. Once the etcd cluster is
running, the certificates need to be manually copied to the intended first control plane
node.
The kubeadm-config.yaml file needs to be populated with the etcd set to external,
endpoints, and the certificate locations. Once the first control plane is fully initialized, the
redundant control planes need to be added one at a time, each fully initialized before the
next is added.