Kubernetes Taints & Tolerations

Guide to Kubernetes Autoscaling
Chapter 6 Kubernetes Taints & Tolerations

Organizations and teams often need multi-tenant, heterogeneous Kubernetes clusters to meet users’ application needs. They may also need to address certain special constraints on the Kubernetes cluster; for example, some pods may require special hardware, colocation with other specific pods, or isolation from others. There are many options for placing those application containers into different, separate node groups, one of which is through the use of taints and tolerations. In this article, we describe taints and tolerations and then use an example to illustrate how to use them to place pods on specific worker nodes while avoiding the nodes where you don’t want pods to get scheduled.

Taints and Tolerations – Concepts

Taints and tolerations are a mechanism that allows you to ensure that pods are not placed on inappropriate nodes. Taints are added to nodes, while tolerations are defined in the pod specification. When you taint a node, it will repel all the pods except those that have a toleration for that taint. A node can have one or many taints associated with it.

For example, most Kubernetes distributions will automatically taint the master nodes so that one of the pods that manages the control plane is scheduled onto them and not any other data plane pods deployed by users. This ensures that the master nodes are dedicated to run control plane pods.

A taint can produce three possible effects:

NoSchedule
The Kubernetes scheduler will only allow scheduling pods that have tolerations for the tainted nodes.
PreferNoSchedule
The Kubernetes scheduler will try to avoid scheduling pods that don’t have tolerations for the tainted nodes.
NoExecute
Kubernetes will evict the running pods from the nodes if the pods don’t have tolerations for the tainted nodes.
Impact of a taint and tollerations on a K8s cluster
Impact of a taint and tollerations on a K8s cluster

Use Cases for Taints and Tolerations

Dedicated Nodes

If you need to dedicate a group of worker nodes for a set of users, you can add a taint to those nodes, such as by using this command:

kubectl taint nodes nodename dedicated=groupName:NoSchedule

Then add tolerations of the taint in that user group’s pods so they can run on those nodes. To further ensure that pods only get scheduled on that set of tainted nodes, you can also add a label to those nodes, e.g., dedicated=groupName. Then use NodeSelector in the deployment/pod spec, which will make sure that pods from the user group are bound to the node group and don’t run anywhere else.

Nodes with Special Hardware

If there are worker nodes with special hardware, you need to make sure that normal pods that don’t need the special hardware don’t run on those worker nodes. Do this by adding a taint to those nodes as follows:

kubectl taint nodes nodename special=true:NoSchedule

Later on, the pods requiring special hardware can be run on those worker nodes by adding tolerations for the above taint.

Taint-Based Evictions

A taint with the NoExecute effect will evict the running pod from the node if the pod has no tolerance for the taint. The Kubernetes node controller will automatically add this kind of taint to a node in some scenarios so that pods can be evicted immediately and the node is “drained” (have all of its pods evicted). For example, suppose a network outage causes a node to be unreachable from the controller. In this scenario, it would be best to move all of the pods off the node so that they can get rescheduled to other nodes. The node controller takes this action automatically to avoid the need for manual intervention.

The following are built-in taints:

node.kubernetes.io/not-ready
Node is not ready. This corresponds to the NodeCondition Ready attribute being "False".
node.kubernetes.io/unreachable
Node is unreachable from the node controller. This corresponds to NodeCondition Ready being "Unknown".
node.kubernetes.io/memory-pressure
Node has memory pressure.
node.kubernetes.io/disk-pressure
Node has disk pressure. In case of High disk utilization on nodes, it can cause slowness for application so its better to relocate pods.
node.kubernetes.io/pid-pressure
Node has PID pressure. Process ID is a limited resource and its saturation can cause down time for applications, so better to relocate pods to somewhere else.
node.kubernetes.io/network-unavailable
Node's network is unavailable. As explained above.
node.kubernetes.io/unschedulable
Node is unschedulable. Any other reason that will make the node inappropriate for hosting pods, for example if the cluster is being scaled down and the node is being removed.

How to Use Taints and Tolerations

We will now present a scenario to help you better understand taints and tolerations. Let’s start with a Kubernetes cluster that has worker nodes categorized into different groups, such as front-end nodes and back-end nodes. Let’s assume that we need to deploy the front-end application pods so that they are placed only on front-end nodes and not back-end nodes. We also must ensure that new pods are not scheduled into master nodes because those nodes run control plane components such as etcd.

Let’s start by getting the list of nodes to see what is already tainted by the Kubernetes default installation. Here we are on a cluster created by the Rancher RKE tool.

kubectl get nodes -o=custom-columns=NodeName:.metadata.name,TaintKey:.spec.taints[*].key,TaintValue:.spec.taints[*].value,TaintEffect:.spec.taints[*].effect
NodeName                                        TaintKey                                                            TaintValue   TaintEffect
cluster01-master-1  node-role.kubernetes.io/controlplane,node-role.kubernetes.io/etcd   true,true    NoSchedule,NoExecute
cluster01-master-2  node-role.kubernetes.io/controlplane,node-role.kubernetes.io/etcd   true,true    NoSchedule,NoExecute
cluster01-master-3  node-role.kubernetes.io/controlplane,node-role.kubernetes.io/etcd   true,true    NoSchedule,NoExecute
cluster01-worker-1   <none>                                                              <none>       <none>

From the output above, we noticed that the master nodes are already tainted by the Kubernetes installation so that no user pods land on them until intentionally configured by the user to be placed on master nodes by adding tolerations for those taints. The output also shows a worker node that has no taints. We will now taint the worker so that only front-end pods can land on it. We can do this by using the kubectl taint command.

kubectl taint nodes cluster01-worker-1 app=frontend:NoSchedule
node/cluster01-worker-1 tainted

The above taint has a key name app, with a value frontend, and has the effect of NoSchedule, which means that no pod will be placed on this node until the pod has defined a toleration for the taint. We will see what the toleration looks like in later steps.

Let’s try to deploy an app on the cluster without any toleration configured in the app deployment specification.

kubectl create ns frontend
namespace/frontend created

kubectl run nginx --image=nginx --namespace frontend
deployment.apps/nginx created
kubectl get pods -n frontend
NAME                    READY   STATUS    RESTARTS   AGE
nginx-76df748b9-gjbs4   0/1     Pending   0          9s


kubectl get events -n frontend

LAST SEEN   TYPE      REASON              OBJECT                       MESSAGE
<unknown>   Warning   FailedScheduling    pod/nginx-76df748b9-gjbs4    0/4 nodes are available: 1 node(s) had taint {app: frontend}, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/controlplane: true}, that the pod didn't tolerate.
<unknown>   Warning   FailedScheduling    pod/nginx-76df748b9-gjbs4    0/4 nodes are available: 1 node(s) had taint {app: frontend}, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/controlplane: true}, that the pod didn't tolerate.

We created a namespace and deployed Nginx using the kubectl run command, but looking at the pod status and cluster events, we see that the pod can’t be scheduled because there are no appropriate worker nodes. Three master nodes have taints that the pod didn’t tolerate and one worker node has a taint that the pod doesn’t tolerate. To successfully place the pod on the worker node, we need to edit the deployment and add a toleration of the taint we configured earlier on the node.

Let’s see what the current deployment YAML looks like.

kubectl get deployment nginx -n frontend -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: "2021-08-29T09:39:37Z"
  generation: 1
  labels:
    run: nginx
  name: nginx
  namespace: frontend
  resourceVersion: "13367313"
  selfLink: /apis/apps/v1/namespaces/frontend/deployments/nginx
  uid: e46e026e-3a92-4aac-b985-7110426aa437
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      run: nginx
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        run: nginx
    spec:
      containers:
      - image: nginx
        imagePullPolicy: Always
        name: nginx
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

From the output above, we can see that there is no toleration added in the pod spec. Let’s edit and add one.

kubectl edit deployment nginx -n frontend
deployment.apps/nginx edited




kubectl get deployment nginx -n frontend -o yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "3"
  creationTimestamp: "2021-08-29T09:39:37Z"
  generation: 3
  labels:
    run: nginx
  name: nginx
  namespace: frontend
  resourceVersion: "13368509"
  selfLink: /apis/apps/v1/namespaces/frontend/deployments/nginx
  uid: e46e026e-3a92-4aac-b985-7110426aa437
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      run: nginx
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        run: nginx
    spec:
      containers:
      - image: nginx
        imagePullPolicy: Always
        name: nginx
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: app
        operator: Equal
        value: frontend

Notice the tolerations section of the pod spec: We have added a toleration for the taint so that the pod can be scheduled on the worker node.

Now let’s get the pod’s status and events.

kubectl get events -n frontend
LAST SEEN   TYPE      REASON              OBJECT                       MESSAGE

3m56s       Normal    SuccessfulCreate    replicaset/nginx-9cf9fd78f   Created pod: nginx-9cf9fd78f-khc5z
2s          Normal    SuccessfulDelete    replicaset/nginx-9cf9fd78f   Deleted pod: nginx-9cf9fd78f-khc5z
7m7s        Normal    ScalingReplicaSet   deployment/nginx             Scaled up replica set nginx-76df748b9 to 1
3m56s       Normal    ScalingReplicaSet   deployment/nginx             Scaled up replica set nginx-9cf9fd78f to 1
10s         Normal    ScalingReplicaSet   deployment/nginx             Scaled down replica set nginx-76df748b9 to 0
10s         Normal    ScalingReplicaSet   deployment/nginx             Scaled up replica set nginx-8cb54bccc to 1
2s          Normal    ScalingReplicaSet   deployment/nginx             Scaled down replica set nginx-9cf9fd78f to 0


kubectl get pods -n frontend
NAME                    READY   STATUS    RESTARTS   AGE
nginx-8cb54bccc-g4htt   1/1     Running   0          38s

The pod has now been allowed to run on the tainted node. If there are other worker nodes in the cluster, and they are not tainted, then this pod can also land on those free nodes. To make sure that this pod lands on the nodes that are dedicated to front-end pods, then aside from taint and toleration, we need to label the front-end nodes (e.g., app=frontend) and then use NodeSelector in the pod deployment spec so that the pod is only scheduled on front-end nodes.

Conclusion

Taints and Tolerations provide advanced pod scheduling where tainted nodes control which pods can be scheduled on them. They are easier to manage as compared to other custom scheduling methods such as affinities. Nodes with special hardware, dedicating nodes for a group of users, and taint based pod evictions are some of the known use cases for taints and tolerations.

Continue Reading this Series