0% found this document useful (0 votes)
19 views73 pages

Workload Management _ Kubernetes

Kubernetes offers various APIs for workload management, including Deployments, StatefulSets, and DaemonSets, which abstract the management of Pods. Deployments are used for stateless applications, allowing for controlled updates and rollbacks, while StatefulSets manage Pods with unique identities linked to persistent storage. Jobs and CronJobs are also available for tasks that run to completion or on a schedule, respectively.

Uploaded by

sekarann
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views73 pages

Workload Management _ Kubernetes

Kubernetes offers various APIs for workload management, including Deployments, StatefulSets, and DaemonSets, which abstract the management of Pods. Deployments are used for stateless applications, allowing for controlled updates and rollbacks, while StatefulSets manage Pods with unique identities linked to persistent storage. Jobs and CronJobs are also available for tasks that run to completion or on a schedule, respectively.

Uploaded by

sekarann
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Workload Management

1: Deployments
2: ReplicaSet
3: StatefulSets
4: DaemonSet
5: Jobs
6: Automatic Cleanup for Finished Jobs
7: CronJob
8: ReplicationController

Kubernetes provides several built-in APIs for declarative management of your workloads and the components of those workloads.

Ultimately, your applications run as containers inside Pods; however, managing individual Pods would be a lot of effort. For example,
if a Pod fails, you probably want to run a new Pod to replace it. Kubernetes can do that for you.

You use the Kubernetes API to create a workload object that represents a higher abstraction level than a Pod, and then the
Kubernetes control plane automatically manages Pod objects on your behalf, based on the specification for the workload object you
defined.

The built-in APIs for managing workloads are:

Deployment (and, indirectly, ReplicaSet), the most common way to run an application on your cluster. Deployment is a good fit for
managing a stateless application workload on your cluster, where any Pod in the Deployment is interchangeable and can be
replaced if needed. (Deployments are a replacement for the legacy ReplicationController API).

A StatefulSet lets you manage one or more Pods – all running the same application code – where the Pods rely on having a distinct
identity. This is different from a Deployment where the Pods are expected to be interchangeable. The most common use for a
StatefulSet is to be able to make a link between its Pods and their persistent storage. For example, you can run a StatefulSet that
associates each Pod with a PersistentVolume. If one of the Pods in the StatefulSet fails, Kubernetes makes a replacement Pod that is
connected to the same PersistentVolume.

A DaemonSet defines Pods that provide facilities that are local to a specific node; for example, a driver that lets containers on that
node access a storage system. You use a DaemonSet when the driver, or other node-level service, has to run on the node where it's
useful. Each Pod in a DaemonSet performs a role similar to a system daemon on a classic Unix / POSIX server. A DaemonSet might
be fundamental to the operation of your cluster, such as a plugin to let that node access cluster networking, it might help you to
manage the node, or it could provide less essential facilities that enhance the container platform you are running. You can run
DaemonSets (and their pods) across every node in your cluster, or across just a subset (for example, only install the GPU accelerator
driver on nodes that have a GPU installed).

You can use a Job and / or a CronJob to define tasks that run to completion and then stop. A Job represents a one-off task, whereas
each CronJob repeats according to a schedule.

Other topics in this section:


1 - Deployments
A Deployment manages a set of Pods to run an application workload, usually one that doesn't maintain state.

A Deployment provides declarative updates for Pods and ReplicaSets.

You describe a desired state in a Deployment, and the Deployment Controller changes the actual state to the desired state at a
controlled rate. You can define Deployments to create new ReplicaSets, or to remove existing Deployments and adopt all their
resources with new Deployments.

Note:
Do not manage ReplicaSets owned by a Deployment. Consider opening an issue in the main Kubernetes repository if your use
case is not covered below.

Use Case
The following are typical use cases for Deployments:

Create a Deployment to rollout a ReplicaSet. The ReplicaSet creates Pods in the background. Check the status of the rollout to
see if it succeeds or not.
Declare the new state of the Pods by updating the PodTemplateSpec of the Deployment. A new ReplicaSet is created and the
Deployment manages moving the Pods from the old ReplicaSet to the new one at a controlled rate. Each new ReplicaSet
updates the revision of the Deployment.
Rollback to an earlier Deployment revision if the current state of the Deployment is not stable. Each rollback updates the
revision of the Deployment.
Scale up the Deployment to facilitate more load.
Pause the rollout of a Deployment to apply multiple fixes to its PodTemplateSpec and then resume it to start a new rollout.
Use the status of the Deployment as an indicator that a rollout has stuck.
Clean up older ReplicaSets that you don't need anymore.

Creating a Deployment
The following is an example of a Deployment. It creates a ReplicaSet to bring up three nginx Pods:

controllers/nginx-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
In this example:

A Deployment named nginx-deployment is created, indicated by the .metadata.name field. This name will become the basis for
the ReplicaSets and Pods which are created later. See Writing a Deployment Spec for more details.

The Deployment creates a ReplicaSet that creates three replicated Pods, indicated by the .spec.replicas field.

The .spec.selector field defines how the created ReplicaSet finds which Pods to manage. In this case, you select a label that is
defined in the Pod template ( app: nginx ). However, more sophisticated selection rules are possible, as long as the Pod
template itself satisfies the rule.

Note:
The .spec.selector.matchLabels field is a map of {key,value} pairs. A single {key,value} in the matchLabels map is equivalent
to an element of matchExpressions, whose key field is "key", the operator is "In", and the values array contains only "value".
All of the requirements, from both matchLabels and matchExpressions, must be satisfied in order to match.

The template field contains the following sub-fields:

The Pods are labeled app: nginx using the .metadata.labels field.
The Pod template's specification, or .template.spec field, indicates that the Pods run one container, nginx , which runs
the nginx Docker Hub image at version 1.14.2.
Create one container and name it nginx using the .spec.template.spec.containers[0].name field.

Before you begin, make sure your Kubernetes cluster is up and running. Follow the steps given below to create the above
Deployment:

1. Create the Deployment by running the following command:

kubectl apply -f https://k8s.io/examples/controllers/nginx-deployment.yaml

2. Run kubectl get deployments to check if the Deployment was created.

If the Deployment is still being created, the output is similar to the following:

NAME READY UP-TO-DATE AVAILABLE AGE


nginx-deployment 0/3 0 0 1s

When you inspect the Deployments in your cluster, the following fields are displayed:

lists the names of the Deployments in the namespace.


NAME

READY displays how many replicas of the application are available to your users. It follows the pattern ready/desired.

UP-TO-DATE displays the number of replicas that have been updated to achieve the desired state.

AVAILABLE displays how many replicas of the application are available to your users.

AGE displays the amount of time that the application has been running.

Notice how the number of desired replicas is 3 according to .spec.replicas field.

3. To see the Deployment rollout status, run kubectl rollout status deployment/nginx-deployment .

The output is similar to:

Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
deployment "nginx-deployment" successfully rolled out

4. Run the kubectl get deployments again a few seconds later. The output is similar to this:

NAME READY UP-TO-DATE AVAILABLE AGE


nginx-deployment 3/3 3 3 18s

Notice that the Deployment has created all three replicas, and all replicas are up-to-date (they contain the latest Pod template)
and available.
5. To see the ReplicaSet ( rs ) created by the Deployment, run kubectl get rs . The output is similar to this:

NAME DESIRED CURRENT READY AGE


nginx-deployment-75675f5897 3 3 3 18s

ReplicaSet output shows the following fields:

NAME lists the names of the ReplicaSets in the namespace.


DESIRED displays the desired number of replicas of the application, which you define when you create the Deployment.
This is the desired state.
CURRENT displays how many replicas are currently running.

READY displays how many replicas of the application are available to your users.

AGE displays the amount of time that the application has been running.

Notice that the name of the ReplicaSet is always formatted as [DEPLOYMENT-NAME]-[HASH] . This name will become the basis for
the Pods which are created.

The HASH string is the same as the pod-template-hash label on the ReplicaSet.

6. To see the labels automatically generated for each Pod, run kubectl get pods --show-labels . The output is similar to:

NAME READY STATUS RESTARTS AGE LABELS


nginx-deployment-75675f5897-7ci7o 1/1 Running 0 18s app=nginx,pod-template-hash=75675f5897
nginx-deployment-75675f5897-kzszj 1/1 Running 0 18s app=nginx,pod-template-hash=75675f5897
nginx-deployment-75675f5897-qqcnn 1/1 Running 0 18s app=nginx,pod-template-hash=75675f5897

The created ReplicaSet ensures that there are three nginx Pods.

Note:
You must specify an appropriate selector and Pod template labels in a Deployment (in this case, app: nginx ).

Do not overlap labels or selectors with other controllers (including other Deployments and StatefulSets). Kubernetes doesn't
stop you from overlapping, and if multiple controllers have overlapping selectors those controllers might conflict and behave
unexpectedly.

Pod-template-hash label

Caution:
Do not change this label.

The pod-template-hash label is added by the Deployment controller to every ReplicaSet that a Deployment creates or adopts.

This label ensures that child ReplicaSets of a Deployment do not overlap. It is generated by hashing the PodTemplate of the
ReplicaSet and using the resulting hash as the label value that is added to the ReplicaSet selector, Pod template labels, and in any
existing Pods that the ReplicaSet might have.

Updating a Deployment
Note:
A Deployment's rollout is triggered if and only if the Deployment's Pod template (that is, .spec.template) is changed, for example
if the labels or container images of the template are updated. Other updates, such as scaling the Deployment, do not trigger a
rollout.

Follow the steps given below to update your Deployment:

1. Let's update the nginx Pods to use the nginx:1.16.1 image instead of the nginx:1.14.2 image.
kubectl set image deployment.v1.apps/nginx-deployment nginx=nginx:1.16.1

or use the following command:

kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1

where deployment/nginx-deploymentindicates the Deployment, nginx indicates the Container the update will take place and
nginx:1.16.1 indicates the new image and its tag.

The output is similar to:

deployment.apps/nginx-deployment image updated

Alternatively, you can edit the Deployment and change .spec.template.spec.containers[0].image from nginx:1.14.2 to
nginx:1.16.1 :

kubectl edit deployment/nginx-deployment

The output is similar to:

deployment.apps/nginx-deployment edited

2. To see the rollout status, run:

kubectl rollout status deployment/nginx-deployment

The output is similar to this:

Waiting for rollout to finish: 2 out of 3 new replicas have been updated...

or

deployment "nginx-deployment" successfully rolled out

Get more details on your updated Deployment:

After the rollout succeeds, you can view the Deployment by running kubectl get deployments . The output is similar to this:

NAME READY UP-TO-DATE AVAILABLE AGE


nginx-deployment 3/3 3 3 36s

Run kubectl get rs to see that the Deployment updated the Pods by creating a new ReplicaSet and scaling it up to 3 replicas,
as well as scaling down the old ReplicaSet to 0 replicas.

kubectl get rs

The output is similar to this:


NAME DESIRED CURRENT READY AGE
nginx-deployment-1564180365 3 3 3 6s
nginx-deployment-2035384211 0 0 0 36s

Running get pods should now show only the new Pods:

kubectl get pods

The output is similar to this:

NAME READY STATUS RESTARTS AGE


nginx-deployment-1564180365-khku8 1/1 Running 0 14s
nginx-deployment-1564180365-nacti 1/1 Running 0 14s
nginx-deployment-1564180365-z9gth 1/1 Running 0 14s

Next time you want to update these Pods, you only need to update the Deployment's Pod template again.

Deployment ensures that only a certain number of Pods are down while they are being updated. By default, it ensures that at
least 75% of the desired number of Pods are up (25% max unavailable).

Deployment also ensures that only a certain number of Pods are created above the desired number of Pods. By default, it
ensures that at most 125% of the desired number of Pods are up (25% max surge).

For example, if you look at the above Deployment closely, you will see that it first creates a new Pod, then deletes an old Pod,
and creates another new one. It does not kill old Pods until a sufficient number of new Pods have come up, and does not
create new Pods until a sufficient number of old Pods have been killed. It makes sure that at least 3 Pods are available and that
at max 4 Pods in total are available. In case of a Deployment with 4 replicas, the number of Pods would be between 3 and 5.

Get details of your Deployment:

kubectl describe deployments

The output is similar to this:


Name: nginx-deployment
Namespace: default
CreationTimestamp: Thu, 30 Nov 2017 10:56:25 +0000
Labels: app=nginx
Annotations: deployment.kubernetes.io/revision=2
Selector: app=nginx
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx:1.16.1
Port: 80/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: nginx-deployment-1564180365 (3/3 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 2m deployment-controller Scaled up replica set nginx-deployment-2035384211 to 3
Normal ScalingReplicaSet 24s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 1
Normal ScalingReplicaSet 22s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 2
Normal ScalingReplicaSet 22s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 2
Normal ScalingReplicaSet 19s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 1
Normal ScalingReplicaSet 19s deployment-controller Scaled up replica set nginx-deployment-1564180365 to 3
Normal ScalingReplicaSet 14s deployment-controller Scaled down replica set nginx-deployment-2035384211 to 0

Here you see that when you first created the Deployment, it created a ReplicaSet (nginx-deployment-2035384211) and scaled it
up to 3 replicas directly. When you updated the Deployment, it created a new ReplicaSet (nginx-deployment-1564180365) and
scaled it up to 1 and waited for it to come up. Then it scaled down the old ReplicaSet to 2 and scaled up the new ReplicaSet to 2
so that at least 3 Pods were available and at most 4 Pods were created at all times. It then continued scaling up and down the
new and the old ReplicaSet, with the same rolling update strategy. Finally, you'll have 3 available replicas in the new ReplicaSet,
and the old ReplicaSet is scaled down to 0.

Note:
Kubernetes doesn't count terminating Pods when calculating the number of availableReplicas, which must be between replicas
- maxUnavailable and replicas + maxSurge. As a result, you might notice that there are more Pods than expected during a rollout,
and that the total resources consumed by the Deployment is more than replicas + maxSurge until the
terminationGracePeriodSeconds of the terminating Pods expires.

Rollover (aka multiple updates in-flight)


Each time a new Deployment is observed by the Deployment controller, a ReplicaSet is created to bring up the desired Pods. If the
Deployment is updated, the existing ReplicaSet that controls Pods whose labels match .spec.selector but whose template does not
match .spec.template are scaled down. Eventually, the new ReplicaSet is scaled to .spec.replicas and all old ReplicaSets is scaled
to 0.

If you update a Deployment while an existing rollout is in progress, the Deployment creates a new ReplicaSet as per the update and
start scaling that up, and rolls over the ReplicaSet that it was scaling up previously -- it will add it to its list of old ReplicaSets and start
scaling it down.

For example, suppose you create a Deployment to create 5 replicas of nginx:1.14.2 , but then update the Deployment to create 5
replicas of nginx:1.16.1 , when only 3 replicas of nginx:1.14.2 had been created. In that case, the Deployment immediately starts
killing the 3 nginx:1.14.2 Pods that it had created, and starts creating nginx:1.16.1 Pods. It does not wait for the 5 replicas of
nginx:1.14.2 to be created before changing course.
Label selector updates
It is generally discouraged to make label selector updates and it is suggested to plan your selectors up front. In any case, if you need
to perform a label selector update, exercise great caution and make sure you have grasped all of the implications.

Note:
In API version apps/v1, a Deployment's label selector is immutable after it gets created.

Selector additions require the Pod template labels in the Deployment spec to be updated with the new label too, otherwise a
validation error is returned. This change is a non-overlapping one, meaning that the new selector does not select ReplicaSets
and Pods created with the old selector, resulting in orphaning all old ReplicaSets and creating a new ReplicaSet.
Selector updates changes the existing value in a selector key -- result in the same behavior as additions.
Selector removals removes an existing key from the Deployment selector -- do not require any changes in the Pod template
labels. Existing ReplicaSets are not orphaned, and a new ReplicaSet is not created, but note that the removed label still exists in
any existing Pods and ReplicaSets.

Rolling Back a Deployment


Sometimes, you may want to rollback a Deployment; for example, when the Deployment is not stable, such as crash looping. By
default, all of the Deployment's rollout history is kept in the system so that you can rollback anytime you want (you can change that
by modifying revision history limit).

Note:
A Deployment's revision is created when a Deployment's rollout is triggered. This means that the new revision is created if and
only if the Deployment's Pod template (.spec.template) is changed, for example if you update the labels or container images of
the template. Other updates, such as scaling the Deployment, do not create a Deployment revision, so that you can facilitate
simultaneous manual- or auto-scaling. This means that when you roll back to an earlier revision, only the Deployment's Pod
template part is rolled back.

Suppose that you made a typo while updating the Deployment, by putting the image name as nginx:1.161 instead of
nginx:1.16.1 :

kubectl set image deployment/nginx-deployment nginx=nginx:1.161

The output is similar to this:

deployment.apps/nginx-deployment image updated

The rollout gets stuck. You can verify it by checking the rollout status:

kubectl rollout status deployment/nginx-deployment

The output is similar to this:

Waiting for rollout to finish: 1 out of 3 new replicas have been updated...

Press Ctrl-C to stop the above rollout status watch. For more information on stuck rollouts, read more here.

You see that the number of old replicas (adding the replica count from nginx-deployment-1564180365 and nginx-deployment-
2035384211 ) is 3, and the number of new replicas (from nginx-deployment-3066724191 ) is 1.
kubectl get rs

The output is similar to this:

NAME DESIRED CURRENT READY AGE


nginx-deployment-1564180365 3 3 3 25s
nginx-deployment-2035384211 0 0 0 36s
nginx-deployment-3066724191 1 1 0 6s

Looking at the Pods created, you see that 1 Pod created by new ReplicaSet is stuck in an image pull loop.

kubectl get pods

The output is similar to this:

NAME READY STATUS RESTARTS AGE


nginx-deployment-1564180365-70iae 1/1 Running 0 25s
nginx-deployment-1564180365-jbqqo 1/1 Running 0 25s
nginx-deployment-1564180365-hysrc 1/1 Running 0 25s
nginx-deployment-3066724191-08mng 0/1 ImagePullBackOff 0 6s

Note:
The Deployment controller stops the bad rollout automatically, and stops scaling up the new ReplicaSet. This depends on
the rollingUpdate parameters (maxUnavailable specifically) that you have specified. Kubernetes by default sets the value to
25%.

Get the description of the Deployment:

kubectl describe deployment

The output is similar to this:


Name: nginx-deployment
Namespace: default
CreationTimestamp: Tue, 15 Mar 2016 14:48:04 -0700
Labels: app=nginx
Selector: app=nginx
Replicas: 3 desired | 1 updated | 4 total | 3 available | 1 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx:1.161
Port: 80/TCP
Host Port: 0/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True ReplicaSetUpdated
OldReplicaSets: nginx-deployment-1564180365 (3/3 replicas created)
NewReplicaSet: nginx-deployment-3066724191 (1/1 replicas created)
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1m 1m 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica
22s 22s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica
22s 22s 1 {deployment-controller } Normal ScalingReplicaSet Scaled down repli
22s 22s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica
21s 21s 1 {deployment-controller } Normal ScalingReplicaSet Scaled down repli
21s 21s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica
13s 13s 1 {deployment-controller } Normal ScalingReplicaSet Scaled down repli
13s 13s 1 {deployment-controller } Normal ScalingReplicaSet Scaled up replica

To fix this, you need to rollback to a previous revision of Deployment that is stable.

Checking Rollout History of a Deployment


Follow the steps given below to check the rollout history:

1. First, check the revisions of this Deployment:

kubectl rollout history deployment/nginx-deployment

The output is similar to this:

deployments "nginx-deployment"
REVISION CHANGE-CAUSE
1 kubectl apply --filename=https://k8s.io/examples/controllers/nginx-deployment.yaml
2 kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1
3 kubectl set image deployment/nginx-deployment nginx=nginx:1.161

CHANGE-CAUSE is copied from the Deployment annotation kubernetes.io/change-cause to its revisions upon creation. You can
specify the CHANGE-CAUSE message by:

Annotating the Deployment with kubectl annotate deployment/nginx-deployment kubernetes.io/change-cause="image updated


to 1.16.1"

Manually editing the manifest of the resource.


2. To see the details of each revision, run:
kubectl rollout history deployment/nginx-deployment --revision=2

The output is similar to this:

deployments "nginx-deployment" revision 2


Labels: app=nginx
pod-template-hash=1159050644
Annotations: kubernetes.io/change-cause=kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1
Containers:
nginx:
Image: nginx:1.16.1
Port: 80/TCP
QoS Tier:
cpu: BestEffort
memory: BestEffort
Environment Variables: <none>
No volumes.

Rolling Back to a Previous Revision


Follow the steps given below to rollback the Deployment from the current version to the previous version, which is version 2.

1. Now you've decided to undo the current rollout and rollback to the previous revision:

kubectl rollout undo deployment/nginx-deployment

The output is similar to this:

deployment.apps/nginx-deployment rolled back

Alternatively, you can rollback to a specific revision by specifying it with --to-revision :

kubectl rollout undo deployment/nginx-deployment --to-revision=2

The output is similar to this:

deployment.apps/nginx-deployment rolled back

For more details about rollout related commands, read kubectl rollout .

The Deployment is now rolled back to a previous stable revision. As you can see, a DeploymentRollback event for rolling back to
revision 2 is generated from Deployment controller.

2. Check if the rollback was successful and the Deployment is running as expected, run:

kubectl get deployment nginx-deployment

The output is similar to this:

NAME READY UP-TO-DATE AVAILABLE AGE


nginx-deployment 3/3 3 3 30m

3. Get the description of the Deployment:


kubectl describe deployment nginx-deployment

The output is similar to this:

Name: nginx-deployment
Namespace: default
CreationTimestamp: Sun, 02 Sep 2018 18:17:55 -0500
Labels: app=nginx
Annotations: deployment.kubernetes.io/revision=4
kubernetes.io/change-cause=kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1
Selector: app=nginx
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx:1.16.1
Port: 80/TCP
Host Port: 0/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: nginx-deployment-c4747d96c (3/3 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 12m deployment-controller Scaled up replica set nginx-deployment-75675f5897 to 3
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 1
Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 2
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 2
Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 1
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 3
Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 0
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-595696685f to 1
Normal DeploymentRollback 15s deployment-controller Rolled back deployment "nginx-deployment" to revision 2
Normal ScalingReplicaSet 15s deployment-controller Scaled down replica set nginx-deployment-595696685f to 0

Scaling a Deployment
You can scale a Deployment by using the following command:

kubectl scale deployment/nginx-deployment --replicas=10

The output is similar to this:

deployment.apps/nginx-deployment scaled

Assuming horizontal Pod autoscaling is enabled in your cluster, you can set up an autoscaler for your Deployment and choose the
minimum and maximum number of Pods you want to run based on the CPU utilization of your existing Pods.
kubectl autoscale deployment/nginx-deployment --min=10 --max=15 --cpu-percent=80

The output is similar to this:

deployment.apps/nginx-deployment scaled

Proportional scaling
RollingUpdate Deployments support running multiple versions of an application at the same time. When you or an autoscaler scales
a RollingUpdate Deployment that is in the middle of a rollout (either in progress or paused), the Deployment controller balances the
additional replicas in the existing active ReplicaSets (ReplicaSets with Pods) in order to mitigate risk. This is called proportional scaling.

For example, you are running a Deployment with 10 replicas, maxSurge=3, and maxUnavailable=2.

Ensure that the 10 replicas in your Deployment are running.

kubectl get deploy

The output is similar to this:

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE


nginx-deployment 10 10 10 10 50s

You update to a new image which happens to be unresolvable from inside the cluster.

kubectl set image deployment/nginx-deployment nginx=nginx:sometag

The output is similar to this:

deployment.apps/nginx-deployment image updated

The image update starts a new rollout with ReplicaSet nginx-deployment-1989198191, but it's blocked due to the
maxUnavailable requirement that you mentioned above. Check out the rollout status:

kubectl get rs

The output is similar to this:

NAME DESIRED CURRENT READY AGE


nginx-deployment-1989198191 5 5 0 9s
nginx-deployment-618515232 8 8 8 1m

Then a new scaling request for the Deployment comes along. The autoscaler increments the Deployment replicas to 15. The
Deployment controller needs to decide where to add these new 5 replicas. If you weren't using proportional scaling, all 5 of
them would be added in the new ReplicaSet. With proportional scaling, you spread the additional replicas across all
ReplicaSets. Bigger proportions go to the ReplicaSets with the most replicas and lower proportions go to ReplicaSets with less
replicas. Any leftovers are added to the ReplicaSet with the most replicas. ReplicaSets with zero replicas are not scaled up.

In our example above, 3 replicas are added to the old ReplicaSet and 2 replicas are added to the new ReplicaSet. The rollout process
should eventually move all replicas to the new ReplicaSet, assuming the new replicas become healthy. To confirm this, run:
kubectl get deploy

The output is similar to this:

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE


nginx-deployment 15 18 7 8 7m

The rollout status confirms how the replicas were added to each ReplicaSet.

kubectl get rs

The output is similar to this:

NAME DESIRED CURRENT READY AGE


nginx-deployment-1989198191 7 7 0 7m
nginx-deployment-618515232 11 11 11 7m

Pausing and Resuming a rollout of a Deployment


When you update a Deployment, or plan to, you can pause rollouts for that Deployment before you trigger one or more updates.
When you're ready to apply those changes, you resume rollouts for the Deployment. This approach allows you to apply multiple
fixes in between pausing and resuming without triggering unnecessary rollouts.

For example, with a Deployment that was created:

Get the Deployment details:

kubectl get deploy

The output is similar to this:

NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE


nginx 3 3 3 3 1m

Get the rollout status:

kubectl get rs

The output is similar to this:

NAME DESIRED CURRENT READY AGE


nginx-2142116321 3 3 3 1m

Pause by running the following command:

kubectl rollout pause deployment/nginx-deployment

The output is similar to this:


deployment.apps/nginx-deployment paused

Then update the image of the Deployment:

kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1

The output is similar to this:

deployment.apps/nginx-deployment image updated

Notice that no new rollout started:

kubectl rollout history deployment/nginx-deployment

The output is similar to this:

deployments "nginx"
REVISION CHANGE-CAUSE
1 <none>

Get the rollout status to verify that the existing ReplicaSet has not changed:

kubectl get rs

The output is similar to this:

NAME DESIRED CURRENT READY AGE


nginx-2142116321 3 3 3 2m

You can make as many updates as you wish, for example, update the resources that will be used:

kubectl set resources deployment/nginx-deployment -c=nginx --limits=cpu=200m,memory=512Mi

The output is similar to this:

deployment.apps/nginx-deployment resource requirements updated

The initial state of the Deployment prior to pausing its rollout will continue its function, but new updates to the Deployment will
not have any effect as long as the Deployment rollout is paused.

Eventually, resume the Deployment rollout and observe a new ReplicaSet coming up with all the new updates:

kubectl rollout resume deployment/nginx-deployment

The output is similar to this:

deployment.apps/nginx-deployment resumed

Watch the status of the rollout until it's done.


kubectl get rs --watch

The output is similar to this:

NAME DESIRED CURRENT READY AGE


nginx-2142116321 2 2 2 2m
nginx-3926361531 2 2 0 6s
nginx-3926361531 2 2 1 18s
nginx-2142116321 1 2 2 2m
nginx-2142116321 1 2 2 2m
nginx-3926361531 3 2 1 18s
nginx-3926361531 3 2 1 18s
nginx-2142116321 1 1 1 2m
nginx-3926361531 3 3 1 18s
nginx-3926361531 3 3 2 19s
nginx-2142116321 0 1 1 2m
nginx-2142116321 0 1 1 2m
nginx-2142116321 0 0 0 2m
nginx-3926361531 3 3 3 20s

Get the status of the latest rollout:

kubectl get rs

The output is similar to this:

NAME DESIRED CURRENT READY AGE


nginx-2142116321 0 0 0 2m
nginx-3926361531 3 3 3 28s

Note:
You cannot rollback a paused Deployment until you resume it.

Deployment status
A Deployment enters various states during its lifecycle. It can be progressing while rolling out a new ReplicaSet, it can be complete,
or it can fail to progress.

Progressing Deployment
Kubernetes marks a Deployment as progressing when one of the following tasks is performed:

The Deployment creates a new ReplicaSet.


The Deployment is scaling up its newest ReplicaSet.
The Deployment is scaling down its older ReplicaSet(s).
New Pods become ready or available (ready for at least MinReadySeconds).

When the rollout becomes “progressing”, the Deployment controller adds a condition with the following attributes to the
Deployment's .status.conditions :

type: Progressing

status: "True"

reason: NewReplicaSetCreated | reason: FoundNewReplicaSet | reason: ReplicaSetUpdated

You can monitor the progress for a Deployment by using kubectl rollout status .
Complete Deployment
Kubernetes marks a Deployment as complete when it has the following characteristics:

All of the replicas associated with the Deployment have been updated to the latest version you've specified, meaning any
updates you've requested have been completed.
All of the replicas associated with the Deployment are available.
No old replicas for the Deployment are running.

When the rollout becomes “complete”, the Deployment controller sets a condition with the following attributes to the Deployment's
.status.conditions :

type: Progressing

status: "True"

reason: NewReplicaSetAvailable

This Progressing condition will retain a status value of "True" until a new rollout is initiated. The condition holds even when
availability of replicas changes (which does instead affect the Available condition).

You can check if a Deployment has completed by using kubectl rollout status . If the rollout completed successfully, kubectl
rollout status returns a zero exit code.

kubectl rollout status deployment/nginx-deployment

The output is similar to this:

Waiting for rollout to finish: 2 of 3 updated replicas are available...


deployment "nginx-deployment" successfully rolled out

and the exit status from kubectl rollout is 0 (success):

echo $?

Failed Deployment
Your Deployment may get stuck trying to deploy its newest ReplicaSet without ever completing. This can occur due to some of the
following factors:

Insufficient quota
Readiness probe failures
Image pull errors
Insufficient permissions
Limit ranges
Application runtime misconfiguration

One way you can detect this condition is to specify a deadline parameter in your Deployment spec: ( .spec.progressDeadlineSeconds ).
.spec.progressDeadlineSeconds denotes the number of seconds the Deployment controller waits before indicating (in the
Deployment status) that the Deployment progress has stalled.

The following kubectl command sets the spec with progressDeadlineSeconds to make the controller report lack of progress of a
rollout for a Deployment after 10 minutes:

kubectl patch deployment/nginx-deployment -p '{"spec":{"progressDeadlineSeconds":600}}'


The output is similar to this:

deployment.apps/nginx-deployment patched

Once the deadline has been exceeded, the Deployment controller adds a DeploymentCondition with the following attributes to the
Deployment's .status.conditions :

type: Progressing

status: "False"

reason: ProgressDeadlineExceeded

This condition can also fail early and is then set to status value of "False" due to reasons as ReplicaSetCreateError . Also, the
deadline is not taken into account anymore once the Deployment rollout completes.

See the Kubernetes API conventions for more information on status conditions.

Note:
Kubernetes takes no action on a stalled Deployment other than to report a status condition with reason:
ProgressDeadlineExceeded. Higher level orchestrators can take advantage of it and act accordingly, for example, rollback the
Deployment to its previous version.

Note:
If you pause a Deployment rollout, Kubernetes does not check progress against your specified deadline. You can safely pause a
Deployment rollout in the middle of a rollout and resume without triggering the condition for exceeding the deadline.

You may experience transient errors with your Deployments, either due to a low timeout that you have set or due to any other kind
of error that can be treated as transient. For example, let's suppose you have insufficient quota. If you describe the Deployment you
will notice the following section:

kubectl describe deployment nginx-deployment

The output is similar to this:

<...>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True ReplicaSetUpdated
ReplicaFailure True FailedCreate
<...>

If you run kubectl get deployment nginx-deployment -o yaml , the Deployment status is similar to this:
status:
availableReplicas: 2
conditions:
- lastTransitionTime: 2016-10-04T12:25:39Z
lastUpdateTime: 2016-10-04T12:25:39Z
message: Replica set "nginx-deployment-4262182780" is progressing.
reason: ReplicaSetUpdated
status: "True"
type: Progressing
- lastTransitionTime: 2016-10-04T12:25:42Z
lastUpdateTime: 2016-10-04T12:25:42Z
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: 2016-10-04T12:25:39Z
lastUpdateTime: 2016-10-04T12:25:39Z
message: 'Error creating: pods "nginx-deployment-4262182780-" is forbidden: exceeded quota:
object-counts, requested: pods=1, used: pods=3, limited: pods=2'
reason: FailedCreate
status: "True"
type: ReplicaFailure
observedGeneration: 3
replicas: 2
unavailableReplicas: 2

Eventually, once the Deployment progress deadline is exceeded, Kubernetes updates the status and the reason for the Progressing
condition:

Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing False ProgressDeadlineExceeded
ReplicaFailure True FailedCreate

You can address an issue of insufficient quota by scaling down your Deployment, by scaling down other controllers you may be
running, or by increasing quota in your namespace. If you satisfy the quota conditions and the Deployment controller then
completes the Deployment rollout, you'll see the Deployment's status update with a successful condition ( status: "True" and
reason: NewReplicaSetAvailable ).

Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable

type: Available with status: "True" means that your Deployment has minimum availability. Minimum availability is dictated by the
parameters specified in the deployment strategy. type: Progressing with status: "True" means that your Deployment is either in
the middle of a rollout and it is progressing or that it has successfully completed its progress and the minimum required new
replicas are available (see the Reason of the condition for the particulars - in our case reason: NewReplicaSetAvailable means that
the Deployment is complete).

You can check if a Deployment has failed to progress by using kubectl rollout status . kubectl rollout status returns a non-zero
exit code if the Deployment has exceeded the progression deadline.

kubectl rollout status deployment/nginx-deployment

The output is similar to this:

Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
error: deployment "nginx" exceeded its progress deadline
and the exit status from kubectl rollout is 1 (indicating an error):

echo $?

Operating on a failed deployment


All actions that apply to a complete Deployment also apply to a failed Deployment. You can scale it up/down, roll back to a previous
revision, or even pause it if you need to apply multiple tweaks in the Deployment Pod template.

Clean up Policy
You can set .spec.revisionHistoryLimit field in a Deployment to specify how many old ReplicaSets for this Deployment you want to
retain. The rest will be garbage-collected in the background. By default, it is 10.

Note:
Explicitly setting this field to 0, will result in cleaning up all the history of your Deployment thus that Deployment will not be able
to roll back.

Canary Deployment
If you want to roll out releases to a subset of users or servers using the Deployment, you can create multiple Deployments, one for
each release, following the canary pattern described in managing resources.

Writing a Deployment Spec


As with all other Kubernetes configs, a Deployment needs .apiVersion , .kind , and .metadata fields. For general information about
working with config files, see deploying applications, configuring containers, and using kubectl to manage resources documents.

When the control plane creates new Pods for a Deployment, the .metadata.name of the Deployment is part of the basis for naming
those Pods. The name of a Deployment must be a valid DNS subdomain value, but this can produce unexpected results for the Pod
hostnames. For best compatibility, the name should follow the more restrictive rules for a DNS label.

A Deployment also needs a .spec section.

Pod Template
The .spec.template and .spec.selector are the only required fields of the .spec .

The .spec.template is a Pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind .

In addition to required fields for a Pod, a Pod template in a Deployment must specify appropriate labels and an appropriate restart
policy. For labels, make sure not to overlap with other controllers. See selector.

Only a .spec.template.spec.restartPolicy equal to Always is allowed, which is the default if not specified.

Replicas
.spec.replicas is an optional field that specifies the number of desired Pods. It defaults to 1.

Should you manually scale a Deployment, example via kubectl scale deployment deployment --replicas=X , and then you update that
Deployment based on a manifest (for example: by running kubectl apply -f deployment.yaml ), then applying that manifest
overwrites the manual scaling that you previously did.
If a HorizontalPodAutoscaler (or any similar API for horizontal scaling) is managing scaling for a Deployment, don't set
.spec.replicas .

Instead, allow the Kubernetes control plane to manage the .spec.replicas field automatically.

Selector
.spec.selector is a required field that specifies a label selector for the Pods targeted by this Deployment.

.spec.selector must match .spec.template.metadata.labels , or it will be rejected by the API.

In API version apps/v1 , .spec.selector and .metadata.labels do not default to .spec.template.metadata.labels if not set. So they
must be set explicitly. Also note that .spec.selector is immutable after creation of the Deployment in apps/v1 .

A Deployment may terminate Pods whose labels match the selector if their template is different from .spec.template or if the total
number of such Pods exceeds .spec.replicas . It brings up new Pods with .spec.template if the number of Pods is less than the
desired number.

Note:
You should not create other Pods whose labels match this selector, either directly, by creating another Deployment, or by
creating another controller such as a ReplicaSet or a ReplicationController. If you do so, the first Deployment thinks that it
created these other Pods. Kubernetes does not stop you from doing this.

If you have multiple controllers that have overlapping selectors, the controllers will fight with each other and won't behave correctly.

Strategy
.spec.strategy specifies the strategy used to replace old Pods by new ones. .spec.strategy.type can be "Recreate" or
"RollingUpdate". "RollingUpdate" is the default value.

Recreate Deployment
All existing Pods are killed before new ones are created when .spec.strategy.type==Recreate .

Note:
This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old
revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you
manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the
old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a StatefulSet.

Rolling Update Deployment


The Deployment updates Pods in a rolling update fashion when .spec.strategy.type==RollingUpdate . You can specify maxUnavailable
and maxSurge to control the rolling update process.

Max Unavailable
.spec.strategy.rollingUpdate.maxUnavailable is an optional field that specifies the maximum number of Pods that can be unavailable
during the update process. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example,
10%). The absolute number is calculated from percentage by rounding down. The value cannot be 0 if
.spec.strategy.rollingUpdate.maxSurge is 0. The default value is 25%.

For example, when this value is set to 30%, the old ReplicaSet can be scaled down to 70% of desired Pods immediately when the
rolling update starts. Once new Pods are ready, old ReplicaSet can be scaled down further, followed by scaling up the new
ReplicaSet, ensuring that the total number of Pods available at all times during the update is at least 70% of the desired Pods.

Max Surge
.spec.strategy.rollingUpdate.maxSurge is an optional field that specifies the maximum number of Pods that can be created over the
desired number of Pods. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%).
The value cannot be 0 if MaxUnavailable is 0. The absolute number is calculated from the percentage by rounding up. The default
value is 25%.

For example, when this value is set to 30%, the new ReplicaSet can be scaled up immediately when the rolling update starts, such
that the total number of old and new Pods does not exceed 130% of desired Pods. Once old Pods have been killed, the new
ReplicaSet can be scaled up further, ensuring that the total number of Pods running at any time during the update is at most 130%
of desired Pods.

Here are some Rolling Update Deployment examples that use the maxUnavailable and maxSurge :

Max Unavailable Max Surge Hybrid

apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1

Progress Deadline Seconds


.spec.progressDeadlineSeconds is an optional field that specifies the number of seconds you want to wait for your Deployment to
progress before the system reports back that the Deployment has failed progressing - surfaced as a condition with type:
Progressing , status: "False" . and reason: ProgressDeadlineExceeded in the status of the resource. The Deployment controller will
keep retrying the Deployment. This defaults to 600. In the future, once automatic rollback will be implemented, the Deployment
controller will roll back a Deployment as soon as it observes such a condition.

If specified, this field needs to be greater than .spec.minReadySeconds .

Min Ready Seconds


.spec.minReadySeconds is an optional field that specifies the minimum number of seconds for which a newly created Pod should be
ready without any of its containers crashing, for it to be considered available. This defaults to 0 (the Pod will be considered available
as soon as it is ready). To learn more about when a Pod is considered ready, see Container Probes.

Revision History Limit


A Deployment's revision history is stored in the ReplicaSets it controls.

.spec.revisionHistoryLimit is an optional field that specifies the number of old ReplicaSets to retain to allow rollback. These old
ReplicaSets consume resources in etcd and crowd the output of kubectl get rs . The configuration of each Deployment revision is
stored in its ReplicaSets; therefore, once an old ReplicaSet is deleted, you lose the ability to rollback to that revision of Deployment.
By default, 10 old ReplicaSets will be kept, however its ideal value depends on the frequency and stability of new Deployments.
More specifically, setting this field to zero means that all old ReplicaSets with 0 replicas will be cleaned up. In this case, a new
Deployment rollout cannot be undone, since its revision history is cleaned up.

Paused
.spec.paused is an optional boolean field for pausing and resuming a Deployment. The only difference between a paused
Deployment and one that is not paused, is that any changes into the PodTemplateSpec of the paused Deployment will not trigger
new rollouts as long as it is paused. A Deployment is not paused by default when it is created.

What's next
Learn more about Pods.
Run a stateless application using a Deployment.
Read the Deployment to understand the Deployment API.
Read about PodDisruptionBudget and how you can use it to manage application availability during disruptions.
Use kubectl to create a Deployment.
2 - ReplicaSet
A ReplicaSet's purpose is to maintain a stable set of replica Pods running at any given time. Usually, you define
a Deployment and let that Deployment manage ReplicaSets automatically.

A ReplicaSet's purpose is to maintain a stable set of replica Pods running at any given time. As such, it is often used to guarantee the
availability of a specified number of identical Pods.

How a ReplicaSet works


A ReplicaSet is defined with fields, including a selector that specifies how to identify Pods it can acquire, a number of replicas
indicating how many Pods it should be maintaining, and a pod template specifying the data of new Pods it should create to meet the
number of replicas criteria. A ReplicaSet then fulfills its purpose by creating and deleting Pods as needed to reach the desired
number. When a ReplicaSet needs to create new Pods, it uses its Pod template.

A ReplicaSet is linked to its Pods via the Pods' metadata.ownerReferences field, which specifies what resource the current object is
owned by. All Pods acquired by a ReplicaSet have their owning ReplicaSet's identifying information within their ownerReferences
field. It's through this link that the ReplicaSet knows of the state of the Pods it is maintaining and plans accordingly.

A ReplicaSet identifies new Pods to acquire by using its selector. If there is a Pod that has no OwnerReference or the
OwnerReference is not a Controller and it matches a ReplicaSet's selector, it will be immediately acquired by said ReplicaSet.

When to use a ReplicaSet


A ReplicaSet ensures that a specified number of pod replicas are running at any given time. However, a Deployment is a higher-level
concept that manages ReplicaSets and provides declarative updates to Pods along with a lot of other useful features. Therefore, we
recommend using Deployments instead of directly using ReplicaSets, unless you require custom update orchestration or don't
require updates at all.

This actually means that you may never need to manipulate ReplicaSet objects: use a Deployment instead, and define your
application in the spec section.

Example
controllers/frontend.yaml

apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: frontend
labels:
app: guestbook
tier: frontend
spec:
# modify replicas according to your case
replicas: 3
selector:
matchLabels:
tier: frontend
template:
metadata:
labels:
tier: frontend
spec:
containers:
- name: php-redis
image: us-docker.pkg.dev/google-samples/containers/gke/gb-frontend:v5
Saving this manifest into frontend.yaml and submitting it to a Kubernetes cluster will create the defined ReplicaSet and the Pods
that it manages.

kubectl apply -f https://kubernetes.io/examples/controllers/frontend.yaml

You can then get the current ReplicaSets deployed:

kubectl get rs

And see the frontend one you created:

NAME DESIRED CURRENT READY AGE


frontend 3 3 3 6s

You can also check on the state of the ReplicaSet:

kubectl describe rs/frontend

And you will see output similar to:

Name: frontend
Namespace: default
Selector: tier=frontend
Labels: app=guestbook
tier=frontend
Annotations: <none>
Replicas: 3 current / 3 desired
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: tier=frontend
Containers:
php-redis:
Image: us-docker.pkg.dev/google-samples/containers/gke/gb-frontend:v5
Port: <none>
Host Port: <none>
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 13s replicaset-controller Created pod: frontend-gbgfx
Normal SuccessfulCreate 13s replicaset-controller Created pod: frontend-rwz57
Normal SuccessfulCreate 13s replicaset-controller Created pod: frontend-wkl7w

And lastly you can check for the Pods brought up:

kubectl get pods

You should see Pod information similar to:

NAME READY STATUS RESTARTS AGE


frontend-gbgfx 1/1 Running 0 10m
frontend-rwz57 1/1 Running 0 10m
frontend-wkl7w 1/1 Running 0 10m
You can also verify that the owner reference of these pods is set to the frontend ReplicaSet. To do this, get the yaml of one of the
Pods running:

kubectl get pods frontend-gbgfx -o yaml

The output will look similar to this, with the frontend ReplicaSet's info set in the metadata's ownerReferences field:

apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2024-02-28T22:30:44Z"
generateName: frontend-
labels:
tier: frontend
name: frontend-gbgfx
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: frontend
uid: e129deca-f864-481b-bb16-b27abfd92292
...

Non-Template Pod acquisitions


While you can create bare Pods with no problems, it is strongly recommended to make sure that the bare Pods do not have labels
which match the selector of one of your ReplicaSets. The reason for this is because a ReplicaSet is not limited to owning Pods
specified by its template-- it can acquire other Pods in the manner specified in the previous sections.

Take the previous frontend ReplicaSet example, and the Pods specified in the following manifest:

pods/pod-rs.yaml

apiVersion: v1
kind: Pod
metadata:
name: pod1
labels:
tier: frontend
spec:
containers:
- name: hello1
image: gcr.io/google-samples/hello-app:2.0

---

apiVersion: v1
kind: Pod
metadata:
name: pod2
labels:
tier: frontend
spec:
containers:
- name: hello2
image: gcr.io/google-samples/hello-app:1.0
As those Pods do not have a Controller (or any object) as their owner reference and match the selector of the frontend ReplicaSet,
they will immediately be acquired by it.

Suppose you create the Pods after the frontend ReplicaSet has been deployed and has set up its initial Pod replicas to fulfill its
replica count requirement:

kubectl apply -f https://kubernetes.io/examples/pods/pod-rs.yaml

The new Pods will be acquired by the ReplicaSet, and then immediately terminated as the ReplicaSet would be over its desired
count.

Fetching the Pods:

kubectl get pods

The output shows that the new Pods are either already terminated, or in the process of being terminated:

NAME READY STATUS RESTARTS AGE


frontend-b2zdv 1/1 Running 0 10m
frontend-vcmts 1/1 Running 0 10m
frontend-wtsmm 1/1 Running 0 10m
pod1 0/1 Terminating 0 1s
pod2 0/1 Terminating 0 1s

If you create the Pods first:

kubectl apply -f https://kubernetes.io/examples/pods/pod-rs.yaml

And then create the ReplicaSet however:

kubectl apply -f https://kubernetes.io/examples/controllers/frontend.yaml

You shall see that the ReplicaSet has acquired the Pods and has only created new ones according to its spec until the number of its
new Pods and the original matches its desired count. As fetching the Pods:

kubectl get pods

Will reveal in its output:

NAME READY STATUS RESTARTS AGE


frontend-hmmj2 1/1 Running 0 9s
pod1 1/1 Running 0 36s
pod2 1/1 Running 0 36s

In this manner, a ReplicaSet can own a non-homogeneous set of Pods

Writing a ReplicaSet manifest


As with all other Kubernetes API objects, a ReplicaSet needs the apiVersion , kind , and metadata fields. For ReplicaSets, the kind is
always a ReplicaSet.
When the control plane creates new Pods for a ReplicaSet, the .metadata.name of the ReplicaSet is part of the basis for naming those
Pods. The name of a ReplicaSet must be a valid DNS subdomain value, but this can produce unexpected results for the Pod
hostnames. For best compatibility, the name should follow the more restrictive rules for a DNS label.

A ReplicaSet also needs a .spec section.

Pod Template
The .spec.template is a pod template which is also required to have labels in place. In our frontend.yaml example we had one label:
tier: frontend . Be careful not to overlap with the selectors of other controllers, lest they try to adopt this Pod.

For the template's restart policy field, .spec.template.spec.restartPolicy , the only allowed value is Always , which is the default.

Pod Selector
The .spec.selector field is a label selector. As discussed earlier these are the labels used to identify potential Pods to acquire. In our
frontend.yaml example, the selector was:

matchLabels:
tier: frontend

In the ReplicaSet, .spec.template.metadata.labels must match spec.selector , or it will be rejected by the API.

Note:
For 2 ReplicaSets specifying the same .spec.selector but different .spec.template.metadata.labels and .spec.template.spec fields,
each ReplicaSet ignores the Pods created by the other ReplicaSet.

Replicas
You can specify how many Pods should run concurrently by setting .spec.replicas . The ReplicaSet will create/delete its Pods to
match this number.

If you do not specify .spec.replicas , then it defaults to 1.

Working with ReplicaSets


Deleting a ReplicaSet and its Pods
To delete a ReplicaSet and all of its Pods, use kubectl delete . The Garbage collector automatically deletes all of the dependent Pods
by default.

When using the REST API or the client-go library, you must set propagationPolicy to Background or Foreground in the -d option.
For example:

kubectl proxy --port=8080


curl -X DELETE 'localhost:8080/apis/apps/v1/namespaces/default/replicasets/frontend' \
-d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Foreground"}' \
-H "Content-Type: application/json"

Deleting just a ReplicaSet


You can delete a ReplicaSet without affecting any of its Pods using kubectl delete with the --cascade=orphan option. When using the
REST API or the client-go library, you must set propagationPolicy to Orphan . For example:
kubectl proxy --port=8080
curl -X DELETE 'localhost:8080/apis/apps/v1/namespaces/default/replicasets/frontend' \
-d '{"kind":"DeleteOptions","apiVersion":"v1","propagationPolicy":"Orphan"}' \
-H "Content-Type: application/json"

Once the original is deleted, you can create a new ReplicaSet to replace it. As long as the old and new .spec.selector are the same,
then the new one will adopt the old Pods. However, it will not make any effort to make existing Pods match a new, different pod
template. To update Pods to a new spec in a controlled way, use a Deployment, as ReplicaSets do not support a rolling update
directly.

Isolating Pods from a ReplicaSet


You can remove Pods from a ReplicaSet by changing their labels. This technique may be used to remove Pods from service for
debugging, data recovery, etc. Pods that are removed in this way will be replaced automatically ( assuming that the number of
replicas is not also changed).

Scaling a ReplicaSet
A ReplicaSet can be easily scaled up or down by simply updating the .spec.replicas field. The ReplicaSet controller ensures that a
desired number of Pods with a matching label selector are available and operational.

When scaling down, the ReplicaSet controller chooses which pods to delete by sorting the available pods to prioritize scaling down
pods based on the following general algorithm:

1. Pending (and unschedulable) pods are scaled down first


2. If controller.kubernetes.io/pod-deletion-cost annotation is set, then the pod with the lower value will come first.
3. Pods on nodes with more replicas come before pods on nodes with fewer replicas.
4. If the pods' creation times differ, the pod that was created more recently comes before the older pod (the creation times are
bucketed on an integer log scale).

If all of the above match, then selection is random.

Pod deletion cost

ⓘ FEATURE STATE: Kubernetes v1.22 [beta]

Using the controller.kubernetes.io/pod-deletion-cost annotation, users can set a preference regarding which pods to remove first
when downscaling a ReplicaSet.

The annotation should be set on the pod, the range is [-2147483648, 2147483647]. It represents the cost of deleting a pod compared
to other pods belonging to the same ReplicaSet. Pods with lower deletion cost are preferred to be deleted before pods with higher
deletion cost.

The implicit value for this annotation for pods that don't set it is 0; negative values are permitted. Invalid values will be rejected by
the API server.

This feature is beta and enabled by default. You can disable it using the feature gate PodDeletionCost in both kube-apiserver and
kube-controller-manager.

Note:
This is honored on a best-effort basis, so it does not offer any guarantees on pod deletion order.
Users should avoid updating the annotation frequently, such as updating it based on a metric value, because doing so will
generate a significant number of pod updates on the apiserver.

Example Use Case


The different pods of an application could have different utilization levels. On scale down, the application may prefer to remove the
pods with lower utilization. To avoid frequently updating the pods, the application should update controller.kubernetes.io/pod-
deletion-cost once before issuing a scale down (setting the annotation to a value proportional to pod utilization level). This works if
the application itself controls the down scaling; for example, the driver pod of a Spark deployment.

ReplicaSet as a Horizontal Pod Autoscaler Target


A ReplicaSet can also be a target for Horizontal Pod Autoscalers (HPA). That is, a ReplicaSet can be auto-scaled by an HPA. Here is an
example HPA targeting the ReplicaSet we created in the previous example.

controllers/hpa-rs.yaml

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: frontend-scaler
spec:
scaleTargetRef:
kind: ReplicaSet
name: frontend
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 50

Saving this manifest into hpa-rs.yaml and submitting it to a Kubernetes cluster should create the defined HPA that autoscales the
target ReplicaSet depending on the CPU usage of the replicated Pods.

kubectl apply -f https://k8s.io/examples/controllers/hpa-rs.yaml

Alternatively, you can use the kubectl autoscale command to accomplish the same (and it's easier!)

kubectl autoscale rs frontend --max=10 --min=3 --cpu-percent=50

Alternatives to ReplicaSet
Deployment (recommended)
Deployment is an object which can own ReplicaSets and update them and their Pods via declarative, server-side rolling updates.
While ReplicaSets can be used independently, today they're mainly used by Deployments as a mechanism to orchestrate Pod
creation, deletion and updates. When you use Deployments you don't have to worry about managing the ReplicaSets that they
create. Deployments own and manage their ReplicaSets. As such, it is recommended to use Deployments when you want
ReplicaSets.

Bare Pods
Unlike the case where a user directly created Pods, a ReplicaSet replaces Pods that are deleted or terminated for any reason, such as
in the case of node failure or disruptive node maintenance, such as a kernel upgrade. For this reason, we recommend that you use a
ReplicaSet even if your application requires only a single Pod. Think of it similarly to a process supervisor, only it supervises multiple
Pods across multiple nodes instead of individual processes on a single node. A ReplicaSet delegates local container restarts to some
agent on the node such as Kubelet.

Job
Use a Job instead of a ReplicaSet for Pods that are expected to terminate on their own (that is, batch jobs).
DaemonSet
Use a DaemonSet instead of a ReplicaSet for Pods that provide a machine-level function, such as machine monitoring or machine
logging. These Pods have a lifetime that is tied to a machine lifetime: the Pod needs to be running on the machine before other Pods
start, and are safe to terminate when the machine is otherwise ready to be rebooted/shutdown.

ReplicationController
ReplicaSets are the successors to ReplicationControllers. The two serve the same purpose, and behave similarly, except that a
ReplicationController does not support set-based selector requirements as described in the labels user guide. As such, ReplicaSets
are preferred over ReplicationControllers

What's next
Learn about Pods.
Learn about Deployments.
Run a Stateless Application Using a Deployment, which relies on ReplicaSets to work.
ReplicaSet is a top-level resource in the Kubernetes REST API. Read the ReplicaSet object definition to understand the API for
replica sets.
Read about PodDisruptionBudget and how you can use it to manage application availability during disruptions.
3 - StatefulSets
A StatefulSet runs a group of Pods, and maintains a sticky identity for each of those Pods. This is useful for
managing applications that need persistent storage or a stable, unique network identity.

StatefulSet is the workload API object used to manage stateful applications.

Manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering and uniqueness of these Pods.

Like a Deployment, a StatefulSet manages Pods that are based on an identical container spec. Unlike a Deployment, a StatefulSet
maintains a sticky identity for each of its Pods. These pods are created from the same spec, but are not interchangeable: each has a
persistent identifier that it maintains across any rescheduling.

If you want to use storage volumes to provide persistence for your workload, you can use a StatefulSet as part of the solution.
Although individual Pods in a StatefulSet are susceptible to failure, the persistent Pod identifiers make it easier to match existing
volumes to the new Pods that replace any that have failed.

Using StatefulSets
StatefulSets are valuable for applications that require one or more of the following.

Stable, unique network identifiers.


Stable, persistent storage.
Ordered, graceful deployment and scaling.
Ordered, automated rolling updates.

In the above, stable is synonymous with persistence across Pod (re)scheduling. If an application doesn't require any stable identifiers
or ordered deployment, deletion, or scaling, you should deploy your application using a workload object that provides a set of
stateless replicas. Deployment or ReplicaSet may be better suited to your stateless needs.

Limitations
The storage for a given Pod must either be provisioned by a PersistentVolume Provisioner (examples here) based on the
requested storage class, or pre-provisioned by an admin.
Deleting and/or scaling a StatefulSet down will not delete the volumes associated with the StatefulSet. This is done to ensure
data safety, which is generally more valuable than an automatic purge of all related StatefulSet resources.
StatefulSets currently require a Headless Service to be responsible for the network identity of the Pods. You are responsible for
creating this Service.
StatefulSets do not provide any guarantees on the termination of pods when a StatefulSet is deleted. To achieve ordered and
graceful termination of the pods in the StatefulSet, it is possible to scale the StatefulSet down to 0 prior to deletion.
When using Rolling Updates with the default Pod Management Policy ( OrderedReady ), it's possible to get into a broken state
that requires manual intervention to repair.

Components
The example below demonstrates the components of a StatefulSet.
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
replicas: 3 # by default is 1
minReadySeconds: 10 # by default is 0
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: registry.k8s.io/nginx-slim:0.24
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "my-storage-class"
resources:
requests:
storage: 1Gi

Note:
This example uses the ReadWriteOnce access mode, for simplicity. For production use, the Kubernetes project recommends using
the ReadWriteOncePod access mode instead.

In the above example:

A Headless Service, named nginx , is used to control the network domain.


The StatefulSet, named web , has a Spec that indicates that 3 replicas of the nginx container will be launched in unique Pods.
The volumeClaimTemplates will provide stable storage using PersistentVolumes provisioned by a PersistentVolume Provisioner.

The name of a StatefulSet object must be a valid DNS label.


Pod Selector
You must set the .spec.selector field of a StatefulSet to match the labels of its .spec.template.metadata.labels . Failing to specify a
matching Pod Selector will result in a validation error during StatefulSet creation.

Volume Claim Templates


You can set the .spec.volumeClaimTemplates field to create a PersistentVolumeClaim. This will provide stable storage to the
StatefulSet if either

The StorageClass specified for the volume claim is set up to use dynamic provisioning, or
The cluster already contains a PersistentVolume with the correct StorageClass and sufficient available storage space.

Minimum ready seconds

ⓘ FEATURE STATE: Kubernetes v1.25 [stable]

.spec.minReadySeconds is an optional field that specifies the minimum number of seconds for which a newly created Pod should be
running and ready without any of its containers crashing, for it to be considered available. This is used to check progression of a
rollout when using a Rolling Update strategy. This field defaults to 0 (the Pod will be considered available as soon as it is ready). To
learn more about when a Pod is considered ready, see Container Probes.

Pod Identity
StatefulSet Pods have a unique identity that consists of an ordinal, a stable network identity, and stable storage. The identity sticks to
the Pod, regardless of which node it's (re)scheduled on.

Ordinal Index
For a StatefulSet with N replicas, each Pod in the StatefulSet will be assigned an integer ordinal, that is unique over the Set. By
default, pods will be assigned ordinals from 0 up through N-1. The StatefulSet controller will also add a pod label with this index:
apps.kubernetes.io/pod-index .

Start ordinal

ⓘ FEATURE STATE: Kubernetes v1.31 [stable] (enabled by default: true)

.spec.ordinals is an optional field that allows you to configure the integer ordinals assigned to each Pod. It defaults to nil. Within
the field, you can configure the following options:

.spec.ordinals.start : If the .spec.ordinals.start field is set, Pods will be assigned ordinals from .spec.ordinals.start up
through .spec.ordinals.start + .spec.replicas - 1 .

Stable Network ID
Each Pod in a StatefulSet derives its hostname from the name of the StatefulSet and the ordinal of the Pod. The pattern for the
constructed hostname is $(statefulset name)-$(ordinal) . The example above will create three Pods named web-0,web-1,web-2 . A
StatefulSet can use a Headless Service to control the domain of its Pods. The domain managed by this Service takes the form:
$(service name).$(namespace).svc.cluster.local , where "cluster.local" is the cluster domain. As each Pod is created, it gets a
matching DNS subdomain, taking the form: $(podname).$(governing service domain) , where the governing service is defined by the
serviceName field on the StatefulSet.

Depending on how DNS is configured in your cluster, you may not be able to look up the DNS name for a newly-run Pod
immediately. This behavior can occur when other clients in the cluster have already sent queries for the hostname of the Pod before
it was created. Negative caching (normal in DNS) means that the results of previous failed lookups are remembered and reused,
even after the Pod is running, for at least a few seconds.

If you need to discover Pods promptly after they are created, you have a few options:
Query the Kubernetes API directly (for example, using a watch) rather than relying on DNS lookups.
Decrease the time of caching in your Kubernetes DNS provider (typically this means editing the config map for CoreDNS, which
currently caches for 30 seconds).

As mentioned in the limitations section, you are responsible for creating the Headless Service responsible for the network identity of
the pods.

Here are some examples of choices for Cluster Domain, Service name, StatefulSet name, and how that affects the DNS names for
the StatefulSet's Pods.

Cluster Service StatefulSet Pod


Domain (ns/name) (ns/name) StatefulSet Domain Pod DNS Hostname

cluster.local default/nginx default/web nginx.default.svc.cluster.local web-{0..N- web-{0..N-


1}.nginx.default.svc.cluster.local 1}

cluster.local foo/nginx foo/web nginx.foo.svc.cluster.local web-{0..N- web-{0..N-


1}.nginx.foo.svc.cluster.local 1}

kube.local foo/nginx foo/web nginx.foo.svc.kube.local web-{0..N- web-{0..N-


1}.nginx.foo.svc.kube.local 1}

Note:
Cluster Domain will be set to cluster.local unless otherwise configured.

Stable Storage
For each VolumeClaimTemplate entry defined in a StatefulSet, each Pod receives one PersistentVolumeClaim. In the nginx example
above, each Pod receives a single PersistentVolume with a StorageClass of my-storage-class and 1 GiB of provisioned storage. If no
StorageClass is specified, then the default StorageClass will be used. When a Pod is (re)scheduled onto a node, its volumeMounts
mount the PersistentVolumes associated with its PersistentVolume Claims. Note that, the PersistentVolumes associated with the
Pods' PersistentVolume Claims are not deleted when the Pods, or StatefulSet are deleted. This must be done manually.

Pod Name Label


When the StatefulSet controller creates a Pod, it adds a label, statefulset.kubernetes.io/pod-name , that is set to the name of the Pod.
This label allows you to attach a Service to a specific Pod in the StatefulSet.

Pod index label

ⓘ FEATURE STATE: Kubernetes v1.32 [stable] (enabled by default: true)

When the StatefulSet controller creates a Pod, the new Pod is labelled with apps.kubernetes.io/pod-index . The value of this label is
the ordinal index of the Pod. This label allows you to route traffic to a particular pod index, filter logs/metrics using the pod index
label, and more. Note the feature gate PodIndexLabel is enabled and locked by default for this feature, in order to disable it, users
will have to use server emulated version v1.31.

Deployment and Scaling Guarantees


For a StatefulSet with N replicas, when Pods are being deployed, they are created sequentially, in order from {0..N-1}.
When Pods are being deleted, they are terminated in reverse order, from {N-1..0}.
Before a scaling operation is applied to a Pod, all of its predecessors must be Running and Ready.
Before a Pod is terminated, all of its successors must be completely shutdown.

The StatefulSet should not specify a pod.Spec.TerminationGracePeriodSeconds of 0. This practice is unsafe and strongly discouraged.
For further explanation, please refer to force deleting StatefulSet Pods.
When the nginx example above is created, three Pods will be deployed in the order web-0, web-1, web-2. web-1 will not be deployed
before web-0 is Running and Ready, and web-2 will not be deployed until web-1 is Running and Ready. If web-0 should fail, after
web-1 is Running and Ready, but before web-2 is launched, web-2 will not be launched until web-0 is successfully relaunched and
becomes Running and Ready.

If a user were to scale the deployed example by patching the StatefulSet such that replicas=1 , web-2 would be terminated first.
web-1 would not be terminated until web-2 is fully shutdown and deleted. If web-0 were to fail after web-2 has been terminated and
is completely shutdown, but prior to web-1's termination, web-1 would not be terminated until web-0 is Running and Ready.

Pod Management Policies


StatefulSet allows you to relax its ordering guarantees while preserving its uniqueness and identity guarantees via its
.spec.podManagementPolicy field.

OrderedReady Pod Management


OrderedReady pod management is the default for StatefulSets. It implements the behavior described above.

Parallel Pod Management


Parallel pod management tells the StatefulSet controller to launch or terminate all Pods in parallel, and to not wait for Pods to
become Running and Ready or completely terminated prior to launching or terminating another Pod. This option only affects the
behavior for scaling operations. Updates are not affected.

Update strategies
A StatefulSet's .spec.updateStrategy field allows you to configure and disable automated rolling updates for containers, labels,
resource request/limits, and annotations for the Pods in a StatefulSet. There are two possible values:

OnDelete

When a StatefulSet's .spec.updateStrategy.type is set to OnDelete, the StatefulSet controller will not automatically update the Pods
in a StatefulSet. Users must manually delete Pods to cause the controller to create new Pods that reflect modifications made to a
StatefulSet's .spec.template.

RollingUpdate

The RollingUpdate update strategy implements automated, rolling updates for the Pods in a StatefulSet. This is the default update
strategy.

Rolling Updates
When a StatefulSet's .spec.updateStrategy.type is set to RollingUpdate , the StatefulSet controller will delete and recreate each Pod
in the StatefulSet. It will proceed in the same order as Pod termination (from the largest ordinal to the smallest), updating each Pod
one at a time.

The Kubernetes control plane waits until an updated Pod is Running and Ready prior to updating its predecessor. If you have set
.spec.minReadySeconds (see Minimum Ready Seconds), the control plane additionally waits that amount of time after the Pod turns
ready, before moving on.

Partitioned rolling updates


The RollingUpdate update strategy can be partitioned, by specifying a .spec.updateStrategy.rollingUpdate.partition . If a partition is
specified, all Pods with an ordinal that is greater than or equal to the partition will be updated when the StatefulSet's .spec.template
is updated. All Pods with an ordinal that is less than the partition will not be updated, and, even if they are deleted, they will be
recreated at the previous version. If a StatefulSet's .spec.updateStrategy.rollingUpdate.partition is greater than its .spec.replicas ,
updates to its .spec.template will not be propagated to its Pods. In most cases you will not need to use a partition, but they are
useful if you want to stage an update, roll out a canary, or perform a phased roll out.
Maximum unavailable Pods

ⓘ FEATURE STATE: Kubernetes v1.24 [alpha]

You can control the maximum number of Pods that can be unavailable during an update by specifying the
.spec.updateStrategy.rollingUpdate.maxUnavailable field. The value can be an absolute number (for example, 5 ) or a percentage of
desired Pods (for example, 10% ). Absolute number is calculated from the percentage value by rounding it up. This field cannot be 0.
The default setting is 1.

This field applies to all Pods in the range 0 to replicas - 1 . If there is any unavailable Pod in the range 0 to replicas - 1 , it will be
counted towards maxUnavailable .

Note:
The maxUnavailable field is in Alpha stage and it is honored only by API servers that are running with the
MaxUnavailableStatefulSet feature gate enabled.

Forced rollback
When using Rolling Updates with the default Pod Management Policy ( OrderedReady ), it's possible to get into a broken state that
requires manual intervention to repair.

If you update the Pod template to a configuration that never becomes Running and Ready (for example, due to a bad binary or
application-level configuration error), StatefulSet will stop the rollout and wait.

In this state, it's not enough to revert the Pod template to a good configuration. Due to a known issue, StatefulSet will continue to
wait for the broken Pod to become Ready (which never happens) before it will attempt to revert it back to the working configuration.

After reverting the template, you must also delete any Pods that StatefulSet had already attempted to run with the bad
configuration. StatefulSet will then begin to recreate the Pods using the reverted template.

PersistentVolumeClaim retention
ⓘ FEATURE STATE: Kubernetes v1.32 [stable] (enabled by default: true)

The optional .spec.persistentVolumeClaimRetentionPolicy field controls if and how PVCs are deleted during the lifecycle of a
StatefulSet. You must enable the StatefulSetAutoDeletePVC feature gate on the API server and the controller manager to use this
field. Once enabled, there are two policies you can configure for each StatefulSet:

whenDeleted

configures the volume retention behavior that applies when the StatefulSet is deleted

whenScaled

configures the volume retention behavior that applies when the replica count of the StatefulSet is reduced; for example, when
scaling down the set.

For each policy that you can configure, you can set the value to either Delete or Retain .

Delete

The PVCs created from the StatefulSet volumeClaimTemplate are deleted for each Pod affected by the policy. With the whenDeleted
policy all PVCs from the volumeClaimTemplate are deleted after their Pods have been deleted. With the whenScaled policy, only PVCs
corresponding to Pod replicas being scaled down are deleted, after their Pods have been deleted.

Retain (default)

PVCs from the volumeClaimTemplate are not affected when their Pod is deleted. This is the behavior before this new feature.

Bear in mind that these policies only apply when Pods are being removed due to the StatefulSet being deleted or scaled down. For
example, if a Pod associated with a StatefulSet fails due to node failure, and the control plane creates a replacement Pod, the
StatefulSet retains the existing PVC. The existing volume is unaffected, and the cluster will attach it to the node where the new Pod is
about to launch.

The default for policies is Retain , matching the StatefulSet behavior before this new feature.

Here is an example policy.

apiVersion: apps/v1
kind: StatefulSet
...
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Delete
...

The StatefulSet controller adds owner references to its PVCs, which are then deleted by the garbage collector after the Pod is
terminated. This enables the Pod to cleanly unmount all volumes before the PVCs are deleted (and before the backing PV and
volume are deleted, depending on the retain policy). When you set the whenDeleted policy to Delete , an owner reference to the
StatefulSet instance is placed on all PVCs associated with that StatefulSet.

The whenScaled policy must delete PVCs only when a Pod is scaled down, and not when a Pod is deleted for another reason. When
reconciling, the StatefulSet controller compares its desired replica count to the actual Pods present on the cluster. Any StatefulSet
Pod whose id greater than the replica count is condemned and marked for deletion. If the whenScaled policy is Delete , the
condemned Pods are first set as owners to the associated StatefulSet template PVCs, before the Pod is deleted. This causes the PVCs
to be garbage collected after only the condemned Pods have terminated.

This means that if the controller crashes and restarts, no Pod will be deleted before its owner reference has been updated
appropriate to the policy. If a condemned Pod is force-deleted while the controller is down, the owner reference may or may not
have been set up, depending on when the controller crashed. It may take several reconcile loops to update the owner references, so
some condemned Pods may have set up owner references and others may not. For this reason we recommend waiting for the
controller to come back up, which will verify owner references before terminating Pods. If that is not possible, the operator should
verify the owner references on PVCs to ensure the expected objects are deleted when Pods are force-deleted.

Replicas
.spec.replicas is an optional field that specifies the number of desired Pods. It defaults to 1.

Should you manually scale a deployment, example via kubectl scale statefulset statefulset --replicas=X , and then you update
that StatefulSet based on a manifest (for example: by running kubectl apply -f statefulset.yaml ), then applying that manifest
overwrites the manual scaling that you previously did.

If a HorizontalPodAutoscaler (or any similar API for horizontal scaling) is managing scaling for a Statefulset, don't set .spec.replicas .
Instead, allow the Kubernetes control plane to manage the .spec.replicas field automatically.

What's next
Learn about Pods.
Find out how to use StatefulSets
Follow an example of deploying a stateful application.
Follow an example of deploying Cassandra with Stateful Sets.
Follow an example of running a replicated stateful application.
Learn how to scale a StatefulSet.
Learn what's involved when you delete a StatefulSet.
Learn how to configure a Pod to use a volume for storage.
Learn how to configure a Pod to use a PersistentVolume for storage.
StatefulSet is a top-level resource in the Kubernetes REST API. Read the StatefulSet object definition to understand the API for
stateful sets.
Read about PodDisruptionBudget and how you can use it to manage application availability during disruptions.
4 - DaemonSet
A DaemonSet defines Pods that provide node-local facilities. These might be fundamental to the operation of
your cluster, such as a networking helper tool, or be part of an add-on.

A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As
nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.

Some typical uses of a DaemonSet are:

running a cluster storage daemon on every node


running a logs collection daemon on every node
running a node monitoring daemon on every node

In a simple case, one DaemonSet, covering all nodes, would be used for each type of daemon. A more complex setup might use
multiple DaemonSets for a single type of daemon, but with different flags and/or different memory and cpu requests for different
hardware types.

Writing a DaemonSet Spec


Create a DaemonSet
You can describe a DaemonSet in a YAML file. For example, the daemonset.yaml file below describes a DaemonSet that runs the
fluentd-elasticsearch Docker image:

controllers/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd-elasticsearch
namespace: kube-system
labels:
k8s-app: fluentd-logging
spec:
selector:
matchLabels:
name: fluentd-elasticsearch
template:
metadata:
labels:
name: fluentd-elasticsearch
spec:
tolerations:
# these tolerations are to have the daemonset runnable on control plane nodes
# remove them if your control plane nodes should not run pods
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
containers:
- name: fluentd-elasticsearch
image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: varlog
mountPath: /var/log
# it may be desirable to set a high priority class to ensure that a DaemonSet Pod
# preempts running Pods
# priorityClassName: important
terminationGracePeriodSeconds: 30
volumes:
- name: varlog
hostPath:
path: /var/log

Create a DaemonSet based on the YAML file:

kubectl apply -f https://k8s.io/examples/controllers/daemonset.yaml

Required Fields
As with all other Kubernetes config, a DaemonSet needs apiVersion , kind , and metadata fields. For general information about
working with config files, see running stateless applications and object management using kubectl.

The name of a DaemonSet object must be a valid DNS subdomain name.

A DaemonSet also needs a .spec section.

Pod Template
The .spec.template is one of the required fields in .spec .
The .spec.template is a pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind .

In addition to required fields for a Pod, a Pod template in a DaemonSet has to specify appropriate labels (see pod selector).

A Pod Template in a DaemonSet must have a RestartPolicy equal to Always , or be unspecified, which defaults to Always .

Pod Selector
The .spec.selector field is a pod selector. It works the same as the .spec.selector of a Job.

You must specify a pod selector that matches the labels of the .spec.template . Also, once a DaemonSet is created, its
.spec.selector can not be mutated. Mutating the pod selector can lead to the unintentional orphaning of Pods, and it was found to
be confusing to users.

The .spec.selector is an object consisting of two fields:

matchLabels - works the same as the .spec.selector of a ReplicationController.


matchExpressions - allows to build more sophisticated selectors by specifying key, list of values and an operator that relates the
key and values.

When the two are specified the result is ANDed.

The .spec.selector must match the .spec.template.metadata.labels . Config with these two not matching will be rejected by the API.

Running Pods on select Nodes


If you specify a .spec.template.spec.nodeSelector , then the DaemonSet controller will create Pods on nodes which match that node
selector. Likewise if you specify a .spec.template.spec.affinity , then DaemonSet controller will create Pods on nodes which match
that node affinity. If you do not specify either, then the DaemonSet controller will create Pods on all nodes.

How Daemon Pods are scheduled


A DaemonSet can be used to ensure that all eligible nodes run a copy of a Pod. The DaemonSet controller creates a Pod for each
eligible node and adds the spec.affinity.nodeAffinity field of the Pod to match the target host. After the Pod is created, the default
scheduler typically takes over and then binds the Pod to the target host by setting the .spec.nodeName field. If the new Pod cannot fit
on the node, the default scheduler may preempt (evict) some of the existing Pods based on the priority of the new Pod.

Note:
If it's important that the DaemonSet pod run on each node, it's often desirable to set the .spec.template.spec.priorityClassName
of the DaemonSet to a PriorityClass with a higher priority to ensure that this eviction occurs.

The user can specify a different scheduler for the Pods of the DaemonSet, by setting the .spec.template.spec.schedulerName field of
the DaemonSet.

The original node affinity specified at the .spec.template.spec.affinity.nodeAffinity field (if specified) is taken into consideration by
the DaemonSet controller when evaluating the eligible nodes, but is replaced on the created Pod with the node affinity that matches
the name of the eligible node.

nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- target-host-name
Taints and tolerations
The DaemonSet controller automatically adds a set of tolerations to DaemonSet Pods:

Toleration key Effect Details

node.kubernetes.io/not-ready NoExecute DaemonSet Pods can be scheduled onto nodes that are not healthy
or ready to accept Pods. Any DaemonSet Pods running on such
nodes will not be evicted.

node.kubernetes.io/unreachable NoExecute DaemonSet Pods can be scheduled onto nodes that are
unreachable from the node controller. Any DaemonSet Pods
running on such nodes will not be evicted.

node.kubernetes.io/disk-pressure NoSchedule DaemonSet Pods can be scheduled onto nodes with disk pressure
issues.

node.kubernetes.io/memory- NoSchedule DaemonSet Pods can be scheduled onto nodes with memory
pressure pressure issues.

node.kubernetes.io/pid-pressure NoSchedule DaemonSet Pods can be scheduled onto nodes with process
pressure issues.

node.kubernetes.io/unschedulable NoSchedule DaemonSet Pods can be scheduled onto nodes that are
unschedulable.

node.kubernetes.io/network- NoSchedule Only added for DaemonSet Pods that request host networking,
unavailable i.e., Pods having spec.hostNetwork: true . Such DaemonSet
Pods can be scheduled onto nodes with unavailable network.

You can add your own tolerations to the Pods of a DaemonSet as well, by defining these in the Pod template of the DaemonSet.

Because the DaemonSet controller sets the node.kubernetes.io/unschedulable:NoSchedule toleration automatically, Kubernetes can
run DaemonSet Pods on nodes that are marked as unschedulable.

If you use a DaemonSet to provide an important node-level function, such as cluster networking, it is helpful that Kubernetes places
DaemonSet Pods on nodes before they are ready. For example, without that special toleration, you could end up in a deadlock
situation where the node is not marked as ready because the network plugin is not running there, and at the same time the network
plugin is not running on that node because the node is not yet ready.

Communicating with Daemon Pods


Some possible patterns for communicating with Pods in a DaemonSet are:

Push: Pods in the DaemonSet are configured to send updates to another service, such as a stats database. They do not have
clients.
NodeIP and Known Port: Pods in the DaemonSet can use a hostPort , so that the pods are reachable via the node IPs. Clients
know the list of node IPs somehow, and know the port by convention.
DNS: Create a headless service with the same pod selector, and then discover DaemonSets using the endpoints resource or
retrieve multiple A records from DNS.
Service: Create a service with the same Pod selector, and use the service to reach a daemon on a random node. (No way to
reach specific node.)

Updating a DaemonSet
If node labels are changed, the DaemonSet will promptly add Pods to newly matching nodes and delete Pods from newly not-
matching nodes.

You can modify the Pods that a DaemonSet creates. However, Pods do not allow all fields to be updated. Also, the DaemonSet
controller will use the original template the next time a node (even with the same name) is created.
You can delete a DaemonSet. If you specify --cascade=orphan with kubectl , then the Pods will be left on the nodes. If you
subsequently create a new DaemonSet with the same selector, the new DaemonSet adopts the existing Pods. If any Pods need
replacing the DaemonSet replaces them according to its updateStrategy .

You can perform a rolling update on a DaemonSet.

Alternatives to DaemonSet
Init scripts
It is certainly possible to run daemon processes by directly starting them on a node (e.g. using init , upstartd , or systemd ). This is
perfectly fine. However, there are several advantages to running such processes via a DaemonSet:

Ability to monitor and manage logs for daemons in the same way as applications.
Same config language and tools (e.g. Pod templates, kubectl ) for daemons and applications.
Running daemons in containers with resource limits increases isolation between daemons from app containers. However, this
can also be accomplished by running the daemons in a container but not in a Pod.

Bare Pods
It is possible to create Pods directly which specify a particular node to run on. However, a DaemonSet replaces Pods that are deleted
or terminated for any reason, such as in the case of node failure or disruptive node maintenance, such as a kernel upgrade. For this
reason, you should use a DaemonSet rather than creating individual Pods.

Static Pods
It is possible to create Pods by writing a file to a certain directory watched by Kubelet. These are called static pods. Unlike
DaemonSet, static Pods cannot be managed with kubectl or other Kubernetes API clients. Static Pods do not depend on the
apiserver, making them useful in cluster bootstrapping cases. Also, static Pods may be deprecated in the future.

Deployments
DaemonSets are similar to Deployments in that they both create Pods, and those Pods have processes which are not expected to
terminate (e.g. web servers, storage servers).

Use a Deployment for stateless services, like frontends, where scaling up and down the number of replicas and rolling out updates
are more important than controlling exactly which host the Pod runs on. Use a DaemonSet when it is important that a copy of a Pod
always run on all or certain hosts, if the DaemonSet provides node-level functionality that allows other Pods to run correctly on that
particular node.

For example, network plugins often include a component that runs as a DaemonSet. The DaemonSet component makes sure that
the node where it's running has working cluster networking.

What's next
Learn about Pods.
Learn about static Pods, which are useful for running Kubernetes control plane components.
Find out how to use DaemonSets
Perform a rolling update on a DaemonSet
Perform a rollback on a DaemonSet (for example, if a roll out didn't work how you expected).
Understand how Kubernetes assigns Pods to Nodes.
Learn about device plugins and add ons, which often run as DaemonSets.
DaemonSet is a top-level resource in the Kubernetes REST API. Read the DaemonSet object definition to understand the API for
daemon sets.
5 - Jobs
Jobs represent one-off tasks that run to completion and then stop.

A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully
terminate. As pods successfully complete, the Job tracks the successful completions. When a specified number of successful
completions is reached, the task (ie, Job) is complete. Deleting a Job will clean up the Pods it created. Suspending a Job will delete its
active Pods until the Job is resumed again.

A simple case is to create one Job object in order to reliably run one Pod to completion. The Job object will start a new Pod if the first
Pod fails or is deleted (for example due to a node hardware failure or a node reboot).

You can also use a Job to run multiple Pods in parallel.

If you want to run a Job (either a single task, or several in parallel) on a schedule, see CronJob.

Running an example Job


Here is an example Job config. It computes π to 2000 places and prints it out. It takes around 10s to complete.

controllers/job.yaml

apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
backoffLimit: 4

You can run the example with this command:

kubectl apply -f https://kubernetes.io/examples/controllers/job.yaml

The output is similar to this:

job.batch/pi created

Check on the status of the Job with kubectl :

kubectl describe job pi kubectl get job pi -o yaml


Name: pi
Namespace: default
Selector: batch.kubernetes.io/controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
Labels: batch.kubernetes.io/controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
batch.kubernetes.io/job-name=pi
...
Annotations: batch.kubernetes.io/job-tracking: ""
Parallelism: 1
Completions: 1
Start Time: Mon, 02 Dec 2019 15:20:11 +0200
Completed At: Mon, 02 Dec 2019 15:21:16 +0200
Duration: 65s
Pods Statuses: 0 Running / 1 Succeeded / 0 Failed
Pod Template:
Labels: batch.kubernetes.io/controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c
batch.kubernetes.io/job-name=pi
Containers:
pi:
Image: perl:5.34.0
Port: <none>
Host Port: <none>
Command:
perl
-Mbignum=bpi
-wle
print bpi(2000)
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 21s job-controller Created pod: pi-xf9p4
Normal Completed 18s job-controller Job completed

To view completed Pods of a Job, use kubectl get pods .

To list all the Pods that belong to a Job in a machine readable form, you can use a command like this:

pods=$(kubectl get pods --selector=batch.kubernetes.io/job-name=pi --output=jsonpath='{.items[*].metadata.name}')


echo $pods

The output is similar to this:

pi-5rwd7

Here, the selector is the same as the selector for the Job. The --output=jsonpath option specifies an expression with the name from
each Pod in the returned list.

View the standard output of one of the pods:

kubectl logs $pods

Another way to view the logs of a Job:

kubectl logs jobs/pi

The output is similar to this:


3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938

Writing a Job spec


As with all other Kubernetes config, a Job needs apiVersion , kind , and metadata fields.

When the control plane creates new Pods for a Job, the .metadata.name of the Job is part of the basis for naming those Pods. The
name of a Job must be a valid DNS subdomain value, but this can produce unexpected results for the Pod hostnames. For best
compatibility, the name should follow the more restrictive rules for a DNS label. Even when the name is a DNS subdomain, the name
must be no longer than 63 characters.

A Job also needs a .spec section.

Job Labels
Job labels will have batch.kubernetes.io/ prefix for job-name and controller-uid .

Pod Template
The .spec.template is the only required field of the .spec .

The .spec.template is a pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind .

In addition to required fields for a Pod, a pod template in a Job must specify appropriate labels (see pod selector) and an appropriate
restart policy.

Only a RestartPolicy equal to Never or OnFailure is allowed.

Pod selector
The .spec.selector field is optional. In almost all cases you should not specify it. See section specifying your own pod selector.

Parallel execution for Jobs


There are three main types of task suitable to run as a Job:

1. Non-parallel Jobs
normally, only one Pod is started, unless the Pod fails.
the Job is complete as soon as its Pod terminates successfully.
2. Parallel Jobs with a fixed completion count:
specify a non-zero positive value for .spec.completions .
the Job represents the overall task, and is complete when there are .spec.completions successful Pods.
when using .spec.completionMode="Indexed" , each Pod gets a different index in the range 0 to .spec.completions-1 .
3. Parallel Jobs with a work queue:
do not specify .spec.completions , default to .spec.parallelism .
the Pods must coordinate amongst themselves or an external service to determine what each should work on. For
example, a Pod might fetch a batch of up to N items from the work queue.
each Pod is independently capable of determining whether or not all its peers are done, and thus that the entire Job is
done.
when any Pod from the Job terminates with success, no new Pods are created.
once at least one Pod has terminated with success and all Pods are terminated, then the Job is completed with success.
once any Pod has exited with success, no other Pod should still be doing any work for this task or writing any output. They
should all be in the process of exiting.

For a non-parallel Job, you can leave both .spec.completions and .spec.parallelism unset. When both are unset, both are defaulted
to 1.
For a fixed completion count Job, you should set .spec.completions to the number of completions needed. You can set
.spec.parallelism , or leave it unset and it will default to 1.

For a work queue Job, you must leave .spec.completions unset, and set .spec.parallelism to a non-negative integer.

For more information about how to make use of the different types of job, see the job patterns section.

Controlling parallelism
The requested parallelism ( .spec.parallelism ) can be set to any non-negative value. If it is unspecified, it defaults to 1. If it is
specified as 0, then the Job is effectively paused until it is increased.

Actual parallelism (number of pods running at any instant) may be more or less than requested parallelism, for a variety of reasons:

For fixed completion count Jobs, the actual number of pods running in parallel will not exceed the number of remaining
completions. Higher values of .spec.parallelism are effectively ignored.
For work queue Jobs, no new Pods are started after any Pod has succeeded -- remaining Pods are allowed to complete,
however.
If the Job Controller has not had time to react.
If the Job controller failed to create Pods for any reason (lack of ResourceQuota , lack of permission, etc.), then there may be
fewer pods than requested.
The Job controller may throttle new Pod creation due to excessive previous pod failures in the same Job.
When a Pod is gracefully shut down, it takes time to stop.

Completion mode

ⓘ FEATURE STATE: Kubernetes v1.24 [stable]

Jobs with fixed completion count - that is, jobs that have non null .spec.completions - can have a completion mode that is specified in
.spec.completionMode :

NonIndexed (default): the Job is considered complete when there have been .spec.completions successfully completed Pods. In
other words, each Pod completion is homologous to each other. Note that Jobs that have null .spec.completions are implicitly
NonIndexed .

Indexed: the Pods of a Job get an associated completion index from 0 to .spec.completions-1 . The index is available through
four mechanisms:

The Pod annotation batch.kubernetes.io/job-completion-index .


The Pod label batch.kubernetes.io/job-completion-index (for v1.28 and later). Note the feature gate PodIndexLabel must
be enabled to use this label, and it is enabled by default.
As part of the Pod hostname, following the pattern $(job-name)-$(index) . When you use an Indexed Job in combination
with a Service, Pods within the Job can use the deterministic hostnames to address each other via DNS. For more
information about how to configure this, see Job with Pod-to-Pod Communication.
From the containerized task, in the environment variable JOB_COMPLETION_INDEX .
The Job is considered complete when there is one successfully completed Pod for each index. For more information about how
to use this mode, see Indexed Job for Parallel Processing with Static Work Assignment.

Note:
Although rare, more than one Pod could be started for the same index (due to various reasons such as node failures, kubelet
restarts, or Pod evictions). In this case, only the first Pod that completes successfully will count towards the completion count
and update the status of the Job. The other Pods that are running or completed for the same index will be deleted by the Job
controller once they are detected.

Handling Pod and container failures


A container in a Pod may fail for a number of reasons, such as because the process in it exited with a non-zero exit code, or the
container was killed for exceeding a memory limit, etc. If this happens, and the .spec.template.spec.restartPolicy = "OnFailure" ,
then the Pod stays on the node, but the container is re-run. Therefore, your program needs to handle the case when it is restarted
locally, or else specify .spec.template.spec.restartPolicy = "Never" . See pod lifecycle for more information on restartPolicy .

An entire Pod can also fail, for a number of reasons, such as when the pod is kicked off the node (node is upgraded, rebooted,
deleted, etc.), or if a container of the Pod fails and the .spec.template.spec.restartPolicy = "Never" . When a Pod fails, then the Job
controller starts a new Pod. This means that your application needs to handle the case when it is restarted in a new pod. In
particular, it needs to handle temporary files, locks, incomplete output and the like caused by previous runs.

By default, each pod failure is counted towards the .spec.backoffLimit limit, see pod backoff failure policy. However, you can
customize handling of pod failures by setting the Job's pod failure policy.

Additionally, you can choose to count the pod failures independently for each index of an Indexed Job by setting the
.spec.backoffLimitPerIndex field (for more information, see backoff limit per index).

Note that even if you specify .spec.parallelism = 1 and .spec.completions = 1 and .spec.template.spec.restartPolicy = "Never" ,
the same program may sometimes be started twice.

If you do specify .spec.parallelism and .spec.completions both greater than 1, then there may be multiple pods running at once.
Therefore, your pods must also be tolerant of concurrency.

If you specify the .spec.podFailurePolicyfield, the Job controller does not consider a terminating Pod (a pod that has a
.metadata.deletionTimestamp field set) as a failure until that Pod is terminal (its .status.phase is Failed or Succeeded ). However, the
Job controller creates a replacement Pod as soon as the termination becomes apparent. Once the pod terminates, the Job controller
evaluates .backoffLimit and .podFailurePolicy for the relevant Job, taking this now-terminated Pod into consideration.

If either of these requirements is not satisfied, the Job controller counts a terminating Pod as an immediate failure, even if that Pod
later terminates with phase: "Succeeded" .

Pod backoff failure policy


There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set
.spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. Failed
Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s ...) capped at six
minutes.

The number of retries is calculated in two ways:

The number of Pods with .status.phase = "Failed" .


When using restartPolicy = "OnFailure" , the number of retries in all the containers of Pods with .status.phase equal to
Pending or Running .

If either of the calculations reaches the .spec.backoffLimit , the Job is considered failed.

Note:
If your job has restartPolicy = "OnFailure", keep in mind that your Pod running the Job will be terminated once the job backoff
limit has been reached. This can make debugging the Job's executable more difficult. We suggest setting restartPolicy = "Never"
when debugging the Job or using a logging system to ensure output from failed Jobs is not lost inadvertently.

Backoff limit per index

ⓘ FEATURE STATE: Kubernetes v1.29 [beta]

Note:
You can only configure the backoff limit per index for an Indexed Job, if you have the JobBackoffLimitPerIndex feature gate
enabled in your cluster.

When you run an indexed Job, you can choose to handle retries for pod failures independently for each index. To do so, set the
.spec.backoffLimitPerIndex to specify the maximal number of pod failures per index.

When the per-index backoff limit is exceeded for an index, Kubernetes considers the index as failed and adds it to the
.status.failedIndexes field. The succeeded indexes, those with a successfully executed pods, are recorded in the
.status.completedIndexes field, regardless of whether you set the backoffLimitPerIndex field.
Note that a failing index does not interrupt execution of other indexes. Once all indexes finish for a Job where you specified a
backoff limit per index, if at least one of those indexes did fail, the Job controller marks the overall Job as failed, by setting the Failed
condition in the status. The Job gets marked as failed even if some, potentially nearly all, of the indexes were processed successfully.

You can additionally limit the maximal number of indexes marked failed by setting the .spec.maxFailedIndexes field. When the
number of failed indexes exceeds the maxFailedIndexes field, the Job controller triggers termination of all remaining running Pods
for that Job. Once all pods are terminated, the entire Job is marked failed by the Job controller, by setting the Failed condition in the
Job status.

Here is an example manifest for a Job that defines a backoffLimitPerIndex :

/controllers/job-backoff-limit-per-index-example.yaml

apiVersion: batch/v1
kind: Job
metadata:
name: job-backoff-limit-per-index-example
spec:
completions: 10
parallelism: 3
completionMode: Indexed # required for the feature
backoffLimitPerIndex: 1 # maximal number of failures per index
maxFailedIndexes: 5 # maximal number of failed indexes before terminating the Job execution
template:
spec:
restartPolicy: Never # required for the feature
containers:
- name: example
image: python
command: # The jobs fails as there is at least one failed index
# (all even indexes fail in here), yet all indexes
# are executed as maxFailedIndexes is not exceeded.
- python3
- -c
- |
import os, sys
print("Hello world")
if int(os.environ.get("JOB_COMPLETION_INDEX")) % 2 == 0:
sys.exit(1)

In the example above, the Job controller allows for one restart for each of the indexes. When the total number of failed indexes
exceeds 5, then the entire Job is terminated.

Once the job is finished, the Job status looks as follows:

kubectl get -o yaml job job-backoff-limit-per-index-example

status:
completedIndexes: 1,3,5,7,9
failedIndexes: 0,2,4,6,8
succeeded: 5 # 1 succeeded pod for each of 5 succeeded indexes
failed: 10 # 2 failed pods (1 retry) for each of 5 failed indexes
conditions:
- message: Job has failed indexes
reason: FailedIndexes
status: "True"
type: FailureTarget
- message: Job has failed indexes
reason: FailedIndexes
status: "True"
type: Failed
The Job controller adds the FailureTarget Job condition to trigger Job termination and cleanup. When all of the Job Pods are
terminated, the Job controller adds the Failed condition with the same values for reason and message as the FailureTarget Job
condition. For details, see Termination of Job Pods.

Additionally, you may want to use the per-index backoff along with a pod failure policy. When using per-index backoff, there is a new
FailIndex action available which allows you to avoid unnecessary retries within an index.

Pod failure policy

ⓘ FEATURE STATE: Kubernetes v1.31 [stable] (enabled by default: true)

A Pod failure policy, defined with the .spec.podFailurePolicy field, enables your cluster to handle Pod failures based on the
container exit codes and the Pod conditions.

In some situations, you may want to have a better control when handling Pod failures than the control provided by the Pod backoff
failure policy, which is based on the Job's .spec.backoffLimit . These are some examples of use cases:

To optimize costs of running workloads by avoiding unnecessary Pod restarts, you can terminate a Job as soon as one of its
Pods fails with an exit code indicating a software bug.
To guarantee that your Job finishes even if there are disruptions, you can ignore Pod failures caused by disruptions (such as
preemption, API-initiated eviction or taint-based eviction) so that they don't count towards the .spec.backoffLimit limit of
retries.

You can configure a Pod failure policy, in the .spec.podFailurePolicy field, to meet the above use cases. This policy can handle Pod
failures based on the container exit codes and the Pod conditions.

Here is a manifest for a Job that defines a podFailurePolicy :

/controllers/job-pod-failure-policy-example.yaml

apiVersion: batch/v1
kind: Job
metadata:
name: job-pod-failure-policy-example
spec:
completions: 12
parallelism: 3
template:
spec:
restartPolicy: Never
containers:
- name: main
image: docker.io/library/bash:5
command: ["bash"] # example command simulating a bug which triggers the FailJob action
args:
- -c
- echo "Hello world!" && sleep 5 && exit 42
backoffLimit: 6
podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: main # optional
operator: In # one of: In, NotIn
values: [42]
- action: Ignore # one of: Ignore, FailJob, Count
onPodConditions:
- type: DisruptionTarget # indicates Pod disruption

In the example above, the first rule of the Pod failure policy specifies that the Job should be marked failed if the main container fails
with the 42 exit code. The following are the rules for the main container specifically:

an exit code of 0 means that the container succeeded


an exit code of 42 means that the entire Job failed
any other exit code represents that the container failed, and hence the entire Pod. The Pod will be re-created if the total
number of restarts is below backoffLimit . If the backoffLimit is reached the entire Job failed.

Note:
Because the Pod template specifies a restartPolicy: Never, the kubelet does not restart the main container in that particular Pod.

The second rule of the Pod failure policy, specifying the Ignore action for failed Pods with condition DisruptionTarget excludes Pod
disruptions from being counted towards the .spec.backoffLimit limit of retries.

Note:
If the Job failed, either by the Pod failure policy or Pod backoff failure policy, and the Job is running multiple Pods, Kubernetes
terminates all the Pods in that Job that are still Pending or Running.

These are some requirements and semantics of the API:

if you want to use a .spec.podFailurePolicy field for a Job, you must also define that Job's pod template with
.spec.restartPolicy set to Never .
the Pod failure policy rules you specify under spec.podFailurePolicy.rules are evaluated in order. Once a rule matches a Pod
failure, the remaining rules are ignored. When no rule matches the Pod failure, the default handling applies.
you may want to restrict a rule to a specific container by specifying its name
in spec.podFailurePolicy.rules[*].onExitCodes.containerName . When not specified the rule applies to all containers. When
specified, it should match one the container or initContainer names in the Pod template.
you may specify the action taken when a Pod failure policy is matched by spec.podFailurePolicy.rules[*].action . Possible
values are:
FailJob : use to indicate that the Pod's job should be marked as Failed and all running Pods should be terminated.

Ignore : use to indicate that the counter towards the .spec.backoffLimit should not be incremented and a replacement
Pod should be created.
Count : use to indicate that the Pod should be handled in the default way. The counter towards the .spec.backoffLimit
should be incremented.
FailIndex : use this action along with backoff limit per index to avoid unnecessary retries within the index of a failed pod.

Note:
When you use a podFailurePolicy, the job controller only matches Pods in the Failed phase. Pods with a deletion timestamp that
are not in a terminal phase (Failed or Succeeded) are considered still terminating. This implies that terminating pods retain a
tracking finalizer until they reach a terminal phase. Since Kubernetes 1.27, Kubelet transitions deleted pods to a terminal phase
(see: Pod Phase). This ensures that deleted pods have their finalizers removed by the Job controller.

Note:
Starting with Kubernetes v1.28, when Pod failure policy is used, the Job controller recreates terminating Pods only once these
Pods reach the terminal Failed phase. This behavior is similar to podReplacementPolicy: Failed. For more information, see Pod
replacement policy.

When you use the podFailurePolicy , and the Job fails due to the pod matching the rule with the FailJob action, then the Job
controller triggers the Job termination process by adding the FailureTarget condition. For more details, see Job termination and
cleanup.

Success policy
ⓘ FEATURE STATE: Kubernetes v1.31 [beta] (enabled by default: true)

Note:
You can only configure a success policy for an Indexed Job if you have the JobSuccessPolicy feature gate enabled in your cluster.

When creating an Indexed Job, you can define when a Job can be declared as succeeded using a .spec.successPolicy , based on the
pods that succeeded.

By default, a Job succeeds when the number of succeeded Pods equals .spec.completions . These are some situations where you
might want additional control for declaring a Job succeeded:

When running simulations with different parameters, you might not need all the simulations to succeed for the overall Job to
be successful.
When following a leader-worker pattern, only the success of the leader determines the success or failure of a Job. Examples of
this are frameworks like MPI and PyTorch etc.

You can configure a success policy, in the .spec.successPolicy field, to meet the above use cases. This policy can handle Job success
based on the succeeded pods. After the Job meets the success policy, the job controller terminates the lingering Pods. A success
policy is defined by rules. Each rule can take one of the following forms:

When you specify the succeededIndexes only, once all indexes specified in the succeededIndexes succeed, the job controller
marks the Job as succeeded. The succeededIndexes must be a list of intervals between 0 and .spec.completions-1 .
When you specify the succeededCount only, once the number of succeeded indexes reaches the succeededCount , the job
controller marks the Job as succeeded.
When you specify both succeededIndexes and succeededCount , once the number of succeeded indexes from the subset of
indexes specified in the succeededIndexes reaches the succeededCount , the job controller marks the Job as succeeded.

Note that when you specify multiple rules in the .spec.successPolicy.rules , the job controller evaluates the rules in order. Once the
Job meets a rule, the job controller ignores remaining rules.

Here is a manifest for a Job with successPolicy :

/controllers/job-success-policy.yaml

apiVersion: batch/v1
kind: Job
metadata:
name: job-success
spec:
parallelism: 10
completions: 10
completionMode: Indexed # Required for the success policy
successPolicy:
rules:
- succeededIndexes: 0,2-3
succeededCount: 1
template:
spec:
containers:
- name: main
image: python
command: # Provided that at least one of the Pods with 0, 2, and 3 indexes has succeeded,
# the overall Job is a success.
- python3
- -c
- |
import os, sys
if os.environ.get("JOB_COMPLETION_INDEX") == "2":
sys.exit(0)
else:
sys.exit(1)
restartPolicy: Never

In the example above, both succeededIndexes and succeededCount have been specified. Therefore, the job controller will mark the
Job as succeeded and terminate the lingering Pods when either of the specified indexes, 0, 2, or 3, succeed. The Job that meets the
success policy gets the SuccessCriteriaMet condition with a SuccessPolicy reason. After the removal of the lingering Pods is issued,
the Job gets the Complete condition.
Note that the succeededIndexes is represented as intervals separated by a hyphen. The number are listed in represented by the first
and last element of the series, separated by a hyphen.

Note:
When you specify both a success policy and some terminating policies such as .spec.backoffLimit and .spec.podFailurePolicy,
once the Job meets either policy, the job controller respects the terminating policy and ignores the success policy.

Job termination and cleanup


When a Job completes, no more Pods are created, but the Pods are usually not deleted either. Keeping them around allows you to
still view the logs of completed pods to check for errors, warnings, or other diagnostic output. The job object also remains after it is
completed so that you can view its status. It is up to the user to delete old jobs after noting their status. Delete the job with kubectl
(e.g. kubectl delete jobs/pi or kubectl delete -f ./job.yaml ). When you delete the job using kubectl , all the pods it created are
deleted too.

By default, a Job will run uninterrupted unless a Pod fails ( restartPolicy=Never ) or a Container exits in error
( restartPolicy=OnFailure ), at which point the Job defers to the .spec.backoffLimit described above. Once .spec.backoffLimit has
been reached the Job will be marked as failed and any running Pods will be terminated.

Another way to terminate a Job is by setting an active deadline. Do this by setting the .spec.activeDeadlineSeconds field of the Job to
a number of seconds. The activeDeadlineSeconds applies to the duration of the job, no matter how many Pods are created. Once a
Job reaches activeDeadlineSeconds , all of its running Pods are terminated and the Job status will become type: Failed with reason:
DeadlineExceeded .

Note that a Job's .spec.activeDeadlineSeconds takes precedence over its .spec.backoffLimit . Therefore, a Job that is retrying one or
more failed Pods will not deploy additional Pods once it reaches the time limit specified by activeDeadlineSeconds , even if the
backoffLimit is not yet reached.

Example:

apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-timeout
spec:
backoffLimit: 5
activeDeadlineSeconds: 100
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never

Note that both the Job spec and the Pod template spec within the Job have an activeDeadlineSeconds field. Ensure that you set this
field at the proper level.

Keep in mind that the restartPolicy applies to the Pod, and not to the Job itself: there is no automatic Job restart once the Job
status is type: Failed . That is, the Job termination mechanisms activated with .spec.activeDeadlineSeconds and .spec.backoffLimit
result in a permanent Job failure that requires manual intervention to resolve.

Terminal Job conditions


A Job has two possible terminal states, each of which has a corresponding Job condition:

Succeeded: Job condition Complete


Failed: Job condition Failed

Jobs fail for the following reasons:


The number of Pod failures exceeded the specified .spec.backoffLimit in the Job specification. For details, see Pod backoff
failure policy.
The Job runtime exceeded the specified .spec.activeDeadlineSeconds
An indexed Job that used .spec.backoffLimitPerIndex has failed indexes. For details, see Backoff limit per index.
The number of failed indexes in the Job exceeded the specified spec.maxFailedIndexes . For details, see Backoff limit per index
A failed Pod matches a rule in .spec.podFailurePolicy that has the FailJob action. For details about how Pod failure policy
rules might affect failure evaluation, see Pod failure policy.

Jobs succeed for the following reasons:

The number of succeeded Pods reached the specified .spec.completions


The criteria specified in .spec.successPolicy are met. For details, see Success policy.

In Kubernetes v1.31 and later the Job controller delays the addition of the terminal conditions, Failed or Complete , until all of the
Job Pods are terminated.

In Kubernetes v1.30 and earlier, the Job controller added the Complete or the Failed Job terminal conditions as soon as the Job
termination process was triggered and all Pod finalizers were removed. However, some Pods would still be running or terminating at
the moment that the terminal condition was added.

In Kubernetes v1.31 and later, the controller only adds the Job terminal conditions after all of the Pods are terminated. You can
control this behavior by using the JobManagedBy and the JobPodReplacementPolicy (both enabled by default) feature gates.

Termination of Job pods


The Job controller adds the FailureTarget condition or the SuccessCriteriaMet condition to the Job to trigger Pod termination after a
Job meets either the success or failure criteria.

Factors like might increase the amount of time from the moment that the Job controller adds the
terminationGracePeriodSeconds
FailureTarget condition or the SuccessCriteriaMet condition to the moment that all of the Job Pods terminate and the Job controller
adds a terminal condition ( Failed or Complete ).

You can use the FailureTarget or the SuccessCriteriaMet condition to evaluate whether the Job has failed or succeeded without
having to wait for the controller to add a terminal condition.

For example, you might want to decide when to create a replacement Job that replaces a failed Job. If you replace the failed Job when
the FailureTarget condition appears, your replacement Job runs sooner, but could result in Pods from the failed and the
replacement Job running at the same time, using extra compute resources.

Alternatively, if your cluster has limited resource capacity, you could choose to wait until the Failed condition appears on the Job,
which would delay your replacement Job but would ensure that you conserve resources by waiting until all of the failed Pods are
removed.

Clean up finished jobs automatically


Finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If
the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the
specified capacity-based cleanup policy.

TTL mechanism for finished Jobs

ⓘ FEATURE STATE: Kubernetes v1.23 [stable]

Another way to clean up finished Jobs (either Complete or Failed ) automatically is to use a TTL mechanism provided by a TTL
controller for finished resources, by specifying the .spec.ttlSecondsAfterFinished field of the Job.

When the TTL controller cleans up the Job, it will delete the Job cascadingly, i.e. delete its dependent objects, such as Pods, together
with the Job. Note that when the Job is deleted, its lifecycle guarantees, such as finalizers, will be honored.

For example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-ttl
spec:
ttlSecondsAfterFinished: 100
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never

The Job pi-with-ttl will be eligible to be automatically deleted, 100 seconds after it finishes.

If the field is set to 0 , the Job will be eligible to be automatically deleted immediately after it finishes. If the field is unset, this Job
won't be cleaned up by the TTL controller after it finishes.

Note:
It is recommended to set ttlSecondsAfterFinished field because unmanaged jobs (Jobs that you created directly, and not
indirectly through other workload APIs such as CronJob) have a default deletion policy of orphanDependents causing Pods created
by an unmanaged Job to be left around after that Job is fully deleted. Even though the control plane eventually garbage collects
the Pods from a deleted Job after they either fail or complete, sometimes those lingering pods may cause cluster performance
degradation or in worst case cause the cluster to go offline due to this degradation.

You can use LimitRanges and ResourceQuotas to place a cap on the amount of resources that a particular namespace can
consume.

Job patterns
The Job object can be used to process a set of independent but related work items. These might be emails to be sent, frames to be
rendered, files to be transcoded, ranges of keys in a NoSQL database to scan, and so on.

In a complex system, there may be multiple different sets of work items. Here we are just considering one set of work items that the
user wants to manage together — a batch job.

There are several different patterns for parallel computation, each with strengths and weaknesses. The tradeoffs are:

One Job object for each work item, versus a single Job object for all work items. One Job per work item creates some overhead
for the user and for the system to manage large numbers of Job objects. A single Job for all work items is better for large
numbers of items.
Number of Pods created equals number of work items, versus each Pod can process multiple work items. When the number of
Pods equals the number of work items, the Pods typically requires less modification to existing code and containers. Having
each Pod process multiple work items is better for large numbers of items.
Several approaches use a work queue. This requires running a queue service, and modifications to the existing program or
container to make it use the work queue. Other approaches are easier to adapt to an existing containerised application.
When the Job is associated with a headless Service, you can enable the Pods within a Job to communicate with each other to
collaborate in a computation.

The tradeoffs are summarized here, with columns 2 to 4 corresponding to the above tradeoffs. The pattern names are also links to
examples and more detailed description.

Pattern Single Job object Fewer pods than work items? Use app unmodified?

Queue with Pod Per Work Item ✓ sometimes

Queue with Variable Pod Count ✓ ✓


Pattern Single Job object Fewer pods than work items? Use app unmodified?

Indexed Job with Static Work Assignment ✓ ✓

Job with Pod-to-Pod Communication ✓ sometimes sometimes

Job Template Expansion ✓

When you specify completions with .spec.completions , each Pod created by the Job controller has an identical spec . This means
that all pods for a task will have the same command line and the same image, the same volumes, and (almost) the same
environment variables. These patterns are different ways to arrange for pods to work on different things.

This table shows the required settings for .spec.parallelism and .spec.completions for each of the patterns. Here, W is the number
of work items.

Pattern .spec.completions .spec.parallelism

Queue with Pod Per Work Item W any

Queue with Variable Pod Count null any

Indexed Job with Static Work Assignment W any

Job with Pod-to-Pod Communication W W

Job Template Expansion 1 should be 1

Advanced usage
Suspending a Job

ⓘ FEATURE STATE: Kubernetes v1.24 [stable]

When a Job is created, the Job controller will immediately begin creating Pods to satisfy the Job's requirements and will continue to
do so until the Job is complete. However, you may want to temporarily suspend a Job's execution and resume it later, or start Jobs in
suspended state and have a custom controller decide later when to start them.

To suspend a Job, you can update the .spec.suspend field of the Job to true; later, when you want to resume it again, update it to
false. Creating a Job with .spec.suspend set to true will create it in the suspended state.

When a Job is resumed from suspension, its .status.startTime field will be reset to the current time. This means that the
.spec.activeDeadlineSeconds timer will be stopped and reset when a Job is suspended and resumed.

When you suspend a Job, any running Pods that don't have a status of Completed will be terminated with a SIGTERM signal. The
Pod's graceful termination period will be honored and your Pod must handle this signal in this period. This may involve saving
progress for later or undoing changes. Pods terminated this way will not count towards the Job's completions count.

An example Job definition in the suspended state can be like so:

kubectl get job myjob -o yaml


apiVersion: batch/v1
kind: Job
metadata:
name: myjob
spec:
suspend: true
parallelism: 1
completions: 5
template:
spec:
...

You can also toggle Job suspension by patching the Job using the command line.

Suspend an active Job:

kubectl patch job/myjob --type=strategic --patch '{"spec":{"suspend":true}}'

Resume a suspended Job:

kubectl patch job/myjob --type=strategic --patch '{"spec":{"suspend":false}}'

The Job's status can be used to determine if a Job is suspended or has been suspended in the past:

kubectl get jobs/myjob -o yaml

apiVersion: batch/v1
kind: Job
# .metadata and .spec omitted
status:
conditions:
- lastProbeTime: "2021-02-05T13:14:33Z"
lastTransitionTime: "2021-02-05T13:14:33Z"
status: "True"
type: Suspended
startTime: "2021-02-05T13:13:48Z"

The Job condition of type "Suspended" with status "True" means the Job is suspended; the lastTransitionTime field can be used to
determine how long the Job has been suspended for. If the status of that condition is "False", then the Job was previously suspended
and is now running. If such a condition does not exist in the Job's status, the Job has never been stopped.

Events are also created when the Job is suspended and resumed:

kubectl describe jobs/myjob


Name: myjob
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 12m job-controller Created pod: myjob-hlrpl
Normal SuccessfulDelete 11m job-controller Deleted pod: myjob-hlrpl
Normal Suspended 11m job-controller Job suspended
Normal SuccessfulCreate 3s job-controller Created pod: myjob-jvb44
Normal Resumed 3s job-controller Job resumed

The last four events, particularly the "Suspended" and "Resumed" events, are directly a result of toggling the .spec.suspend field. In
the time between these two events, we see that no Pods were created, but Pod creation restarted as soon as the Job was resumed.

Mutable Scheduling Directives

ⓘ FEATURE STATE: Kubernetes v1.27 [stable]

In most cases, a parallel job will want the pods to run with constraints, like all in the same zone, or all either on GPU model x or y but
not a mix of both.

The suspend field is the first step towards achieving those semantics. Suspend allows a custom queue controller to decide when a
job should start; However, once a job is unsuspended, a custom queue controller has no influence on where the pods of a job will
actually land.

This feature allows updating a Job's scheduling directives before it starts, which gives custom queue controllers the ability to
influence pod placement while at the same time offloading actual pod-to-node assignment to kube-scheduler. This is allowed only
for suspended Jobs that have never been unsuspended before.

The fields in a Job's pod template that can be updated are node affinity, node selector, tolerations, labels, annotations and
scheduling gates.

Specifying your own Pod selector


Normally, when you create a Job object, you do not specify .spec.selector . The system defaulting logic adds this field when the Job
is created. It picks a selector value that will not overlap with any other jobs.

However, in some cases, you might need to override this automatically set selector. To do this, you can specify the .spec.selector of
the Job.

Be very careful when doing this. If you specify a label selector which is not unique to the pods of that Job, and which matches
unrelated Pods, then pods of the unrelated job may be deleted, or this Job may count other Pods as completing it, or one or both
Jobs may refuse to create Pods or run to completion. If a non-unique selector is chosen, then other controllers (e.g.
ReplicationController) and their Pods may behave in unpredictable ways too. Kubernetes will not stop you from making a mistake
when specifying .spec.selector .

Here is an example of a case when you might want to use this feature.

Say Job old is already running. You want existing Pods to keep running, but you want the rest of the Pods it creates to use a
different pod template and for the Job to have a new name. You cannot update the Job because these fields are not updatable.
Therefore, you delete Job old but leave its pods running, using kubectl delete jobs/old --cascade=orphan . Before deleting it, you
make a note of what selector it uses:

kubectl get job old -o yaml

The output is similar to this:


kind: Job
metadata:
name: old
...
spec:
selector:
matchLabels:
batch.kubernetes.io/controller-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
...

Then you create a new Job with name newand you explicitly specify the same selector. Since the existing Pods have label
batch.kubernetes.io/controller-uid=a8f3d00d-c6d2-11e5-9f87-42010af00002 , they are controlled by Job new as well.

You need to specify manualSelector: true in the new Job since you are not using the selector that the system normally generates for
you automatically.

kind: Job
metadata:
name: new
...
spec:
manualSelector: true
selector:
matchLabels:
batch.kubernetes.io/controller-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
...

The new Job itself will have a different uid from a8f3d00d-c6d2-11e5-9f87-42010af00002 . Setting manualSelector: true tells the system
that you know what you are doing and to allow this mismatch.

Job tracking with finalizers

ⓘ FEATURE STATE: Kubernetes v1.26 [stable]

The control plane keeps track of the Pods that belong to any Job and notices if any such Pod is removed from the API server. To do
that, the Job controller creates Pods with the finalizer batch.kubernetes.io/job-tracking . The controller removes the finalizer only
after the Pod has been accounted for in the Job status, allowing the Pod to be removed by other controllers or users.

Note:
See My pod stays terminating if you observe that pods from a Job are stuck with the tracking finalizer.

Elastic Indexed Jobs

ⓘ FEATURE STATE: Kubernetes v1.31 [stable] (enabled by default: true)

You can scale Indexed Jobs up or down by mutating both .spec.parallelism and .spec.completions together such that
.spec.parallelism == .spec.completions . When scaling down, Kubernetes removes the Pods with higher indexes.

Use cases for elastic Indexed Jobs include batch workloads which require scaling an indexed Job, such as MPI, Horovod, Ray, and
PyTorch training jobs.

Delayed creation of replacement pods

ⓘ FEATURE STATE: Kubernetes v1.29 [beta]


Note:
You can only set podReplacementPolicy on Jobs if you enable the JobPodReplacementPolicy feature gate (enabled by default).

By default, the Job controller recreates Pods as soon they either fail or are terminating (have a deletion timestamp). This means that,
at a given time, when some of the Pods are terminating, the number of running Pods for a Job can be greater than parallelism or
greater than one Pod per index (if you are using an Indexed Job).

You may choose to create replacement Pods only when the terminating Pod is fully terminal (has status.phase: Failed ). To do this,
set the .spec.podReplacementPolicy: Failed . The default replacement policy depends on whether the Job has a podFailurePolicy set.
With no Pod failure policy defined for a Job, omitting the podReplacementPolicy field selects the TerminatingOrFailed replacement
policy: the control plane creates replacement Pods immediately upon Pod deletion (as soon as the control plane sees that a Pod for
this Job has deletionTimestamp set). For Jobs with a Pod failure policy set, the default podReplacementPolicy is Failed , and no other
value is permitted. See Pod failure policy to learn more about Pod failure policies for Jobs.

kind: Job
metadata:
name: new
...
spec:
podReplacementPolicy: Failed
...

Provided your cluster has the feature gate enabled, you can inspect the .status.terminating field of a Job. The value of the field is
the number of Pods owned by the Job that are currently terminating.

kubectl get jobs/myjob -o yaml

apiVersion: batch/v1
kind: Job
# .metadata and .spec omitted
status:
terminating: 3 # three Pods are terminating and have not yet reached the Failed phase

Delegation of managing a Job object to external controller

ⓘ FEATURE STATE: Kubernetes v1.32 [beta] (enabled by default: false)

Note:
You can only set the managedBy field on Jobs if you enable the JobManagedBy feature gate (enabled by default).

This feature allows you to disable the built-in Job controller, for a specific Job, and delegate reconciliation of the Job to an external
controller.

You indicate the controller that reconciles the Job by setting a custom value for the spec.managedBy field - any value other than
kubernetes.io/job-controller . The value of the field is immutable.

Note:
When using this feature, make sure the controller indicated by the field is installed, otherwise the Job may not be reconciled at
all.

Note:
When developing an external Job controller be aware that your controller needs to operate in a fashion conformant with the
definitions of the API spec and status fields of the Job object.

Please review these in detail in the Job API. We also recommend that you run the e2e conformance tests for the Job object to
verify your implementation.

Finally, when developing an external Job controller make sure it does not use the batch.kubernetes.io/job-tracking finalizer,
reserved for the built-in controller.

Warning:
If you are considering to disable the JobManagedBy feature gate, or to downgrade the cluster to a version without the feature gate
enabled, check if there are jobs with a custom value of the spec.managedBy field. If there are such jobs, there is a risk that they
might be reconciled by two controllers after the operation: the built-in Job controller and the external controller indicated by the
field value.

Alternatives
Bare Pods
When the node that a Pod is running on reboots or fails, the pod is terminated and will not be restarted. However, a Job will create
new Pods to replace terminated ones. For this reason, we recommend that you use a Job rather than a bare Pod, even if your
application requires only a single Pod.

Replication Controller
Jobs are complementary to Replication Controllers. A Replication Controller manages Pods which are not expected to terminate (e.g.
web servers), and a Job manages Pods that are expected to terminate (e.g. batch tasks).

As discussed in Pod Lifecycle, Job is only appropriate for pods with RestartPolicy equal to OnFailure or Never . (Note: If
RestartPolicy is not set, the default value is Always .)

Single Job starts controller Pod


Another pattern is for a single Job to create a Pod which then creates other Pods, acting as a sort of custom controller for those
Pods. This allows the most flexibility, but may be somewhat complicated to get started with and offers less integration with
Kubernetes.

One example of this pattern would be a Job which starts a Pod which runs a script that in turn starts a Spark master controller (see
spark example), runs a spark driver, and then cleans up.

An advantage of this approach is that the overall process gets the completion guarantee of a Job object, but maintains complete
control over what Pods are created and how work is assigned to them.

What's next
Learn about Pods.
Read about different ways of running Jobs:
Coarse Parallel Processing Using a Work Queue
Fine Parallel Processing Using a Work Queue
Use an indexed Job for parallel processing with static work assignment
Create multiple Jobs based on a template: Parallel Processing using Expansions
Follow the links within Clean up finished jobs automatically to learn more about how your cluster can clean up completed and /
or failed tasks.
Job is part of the Kubernetes REST API. Read the Job object definition to understand the API for jobs.

Read about CronJob, which you can use to define a series of Jobs that will run based on a schedule, similar to the UNIX tool
cron .
Practice how to configure handling of retriable and non-retriable pod failures using podFailurePolicy , based on the step-by-
step examples.
6 - Automatic Cleanup for Finished Jobs
A time-to-live mechanism to clean up old Jobs that have finished execution.

ⓘ FEATURE STATE: Kubernetes v1.23 [stable]

When your Job has finished, it's useful to keep that Job in the API (and not immediately delete the Job) so that you can tell whether
the Job succeeded or failed.

Kubernetes' TTL-after-finished controller provides a TTL (time to live) mechanism to limit the lifetime of Job objects that have finished
execution.

Cleanup for finished Jobs


The TTL-after-finished controller is only supported for Jobs. You can use this mechanism to clean up finished Jobs (either Complete
or Failed ) automatically by specifying the .spec.ttlSecondsAfterFinished field of a Job, as in this example.

The TTL-after-finished controller assumes that a Job is eligible to be cleaned up TTL seconds after the Job has finished. The timer
starts once the status condition of the Job changes to show that the Job is either Complete or Failed ; once the TTL has expired, that
Job becomes eligible for cascading removal. When the TTL-after-finished controller cleans up a job, it will delete it cascadingly, that is
to say it will delete its dependent objects together with it.

Kubernetes honors object lifecycle guarantees on the Job, such as waiting for finalizers.

You can set the TTL seconds at any time. Here are some examples for setting the .spec.ttlSecondsAfterFinished field of a Job:

Specify this field in the Job manifest, so that a Job can be cleaned up automatically some time after it finishes.
Manually set this field of existing, already finished Jobs, so that they become eligible for cleanup.
Use a mutating admission webhook to set this field dynamically at Job creation time. Cluster administrators can use this to
enforce a TTL policy for finished jobs.
Use a mutating admission webhook to set this field dynamically after the Job has finished, and choose different TTL values
based on job status, labels. For this case, the webhook needs to detect changes to the .status of the Job and only set a TTL
when the Job is being marked as completed.
Write your own controller to manage the cleanup TTL for Jobs that match a particular selector.

Caveats
Updating TTL for finished Jobs
You can modify the TTL period, e.g. .spec.ttlSecondsAfterFinished field of Jobs, after the job is created or has finished. If you extend
the TTL period after the existing ttlSecondsAfterFinished period has expired, Kubernetes doesn't guarantee to retain that Job, even
if an update to extend the TTL returns a successful API response.

Time skew
Because the TTL-after-finished controller uses timestamps stored in the Kubernetes jobs to determine whether the TTL has expired
or not, this feature is sensitive to time skew in your cluster, which may cause the control plane to clean up Job objects at the wrong
time.

Clocks aren't always correct, but the difference should be very small. Please be aware of this risk when setting a non-zero TTL.

What's next
Read Clean up Jobs automatically

Refer to the Kubernetes Enhancement Proposal (KEP) for adding this mechanism.
7 - CronJob
A CronJob starts one-time Jobs on a repeating schedule.

ⓘ FEATURE STATE: Kubernetes v1.21 [stable]

A CronJob creates Jobs on a repeating schedule.

CronJob is meant for performing regular scheduled actions such as backups, report generation, and so on. One CronJob object is like
one line of a crontab (cron table) file on a Unix system. It runs a Job periodically on a given schedule, written in Cron format.

CronJobs have limitations and idiosyncrasies. For example, in certain circumstances, a single CronJob can create multiple concurrent
Jobs. See the limitations below.

When the control plane creates new Jobs and (indirectly) Pods for a CronJob, the .metadata.name of the CronJob is part of the basis
for naming those Pods. The name of a CronJob must be a valid DNS subdomain value, but this can produce unexpected results for
the Pod hostnames. For best compatibility, the name should follow the more restrictive rules for a DNS label. Even when the name is
a DNS subdomain, the name must be no longer than 52 characters. This is because the CronJob controller will automatically append
11 characters to the name you provide and there is a constraint that the length of a Job name is no more than 63 characters.

Example
This example CronJob manifest prints the current time and a hello message every minute:

application/job/cronjob.yaml

apiVersion: batch/v1
kind: CronJob
metadata:
name: hello
spec:
schedule: "* * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox:1.28
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: OnFailure

(Running Automated Tasks with a CronJob takes you through this example in more detail).

Writing a CronJob spec


Schedule syntax
The .spec.schedule field is required. The value of that field follows the Cron syntax:
# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of the week (0 - 6) (Sunday to Saturday)
# │ │ │ │ │ OR sun, mon, tue, wed, thu, fri, sat
# │ │ │ │ │
# │ │ │ │ │
# * * * * *

For example, 0 3 * * 1 means this task is scheduled to run weekly on a Monday at 3 AM.

The format also includes extended "Vixie cron" step values. As explained in the FreeBSD manual:

Step values can be used in conjunction with ranges. Following a range with /<number> specifies skips of the number's value
through the range. For example, 0-23/2 can be used in the hours field to specify command execution every other hour (the
alternative in the V7 standard is 0,2,4,6,8,10,12,14,16,18,20,22 ). Steps are also permitted after an asterisk, so if you want to say
"every two hours", just use */2 .

Note:
A question mark (?) in the schedule has the same meaning as an asterisk *, that is, it stands for any of available value for a given
field.

Other than the standard syntax, some macros like @monthly can also be used:

Entry Description Equivalent to

@yearly (or @annually) Run once a year at midnight of 1 January 0011*

@monthly Run once a month at midnight of the first day of the month 001**

@weekly Run once a week at midnight on Sunday morning 00**0

@daily (or @midnight) Run once a day at midnight 00***

@hourly Run once an hour at the beginning of the hour 0****

To generate CronJob schedule expressions, you can also use web tools like crontab.guru.

Job template
The .spec.jobTemplate defines a template for the Jobs that the CronJob creates, and it is required. It has exactly the same schema as
a Job, except that it is nested and does not have an apiVersion or kind . You can specify common metadata for the templated Jobs,
such as labels or annotations. For information about writing a Job .spec , see Writing a Job Spec.

Deadline for delayed Job start


The .spec.startingDeadlineSeconds field is optional. This field defines a deadline (in whole seconds) for starting the Job, if that Job
misses its scheduled time for any reason.

After missing the deadline, the CronJob skips that instance of the Job (future occurrences are still scheduled). For example, if you
have a backup Job that runs twice a day, you might allow it to start up to 8 hours late, but no later, because a backup taken any later
wouldn't be useful: you would instead prefer to wait for the next scheduled run.

For Jobs that miss their configured deadline, Kubernetes treats them as failed Jobs. If you don't specify startingDeadlineSeconds for a
CronJob, the Job occurrences have no deadline.

If the .spec.startingDeadlineSeconds field is set (not null), the CronJob controller measures the time between when a Job is expected
to be created and now. If the difference is higher than that limit, it will skip this execution.

For example, if it is set to 200 , it allows a Job to be created for up to 200 seconds after the actual schedule.
Concurrency policy
The .spec.concurrencyPolicy field is also optional. It specifies how to treat concurrent executions of a Job that is created by this
CronJob. The spec may specify only one of the following concurrency policies:

Allow (default): The CronJob allows concurrently running Jobs


Forbid : The CronJob does not allow concurrent runs; if it is time for a new Job run and the previous Job run hasn't finished yet,
the CronJob skips the new Job run. Also note that when the previous Job run finishes, .spec.startingDeadlineSeconds is still
taken into account and may result in a new Job run.
Replace : If it is time for a new Job run and the previous Job run hasn't finished yet, the CronJob replaces the currently running
Job run with a new Job run

Note that concurrency policy only applies to the Jobs created by the same CronJob. If there are multiple CronJobs, their respective
Jobs are always allowed to run concurrently.

Schedule suspension
You can suspend execution of Jobs for a CronJob, by setting the optional .spec.suspend field to true. The field defaults to false.

This setting does not affect Jobs that the CronJob has already started.

If you do set that field to true, all subsequent executions are suspended (they remain scheduled, but the CronJob controller does not
start the Jobs to run the tasks) until you unsuspend the CronJob.

Caution:
Executions that are suspended during their scheduled time count as missed Jobs. When .spec.suspend changes from true to
false on an existing CronJob without a starting deadline, the missed Jobs are scheduled immediately.

Jobs history limits


The .spec.successfulJobsHistoryLimit and .spec.failedJobsHistoryLimit fields specify how many completed and failed Jobs should
be kept. Both fields are optional.

.spec.successfulJobsHistoryLimit : This field specifies the number of successful finished jobs to keep. The default value is 3 .
Setting this field to 0 will not keep any successful jobs.

.spec.failedJobsHistoryLimit : This field specifies the number of failed finished jobs to keep. The default value is 1 . Setting this
field to 0 will not keep any failed jobs.

For another way to clean up Jobs automatically, see Clean up finished Jobs automatically.

Time zones

ⓘ FEATURE STATE: Kubernetes v1.27 [stable]

For CronJobs with no time zone specified, the kube-controller-manager interprets schedules relative to its local time zone.

You can specify a time zone for a CronJob by setting .spec.timeZone to the name of a valid time zone. For example, setting
.spec.timeZone: "Etc/UTC" instructs Kubernetes to interpret the schedule relative to Coordinated Universal Time.

A time zone database from the Go standard library is included in the binaries and used as a fallback in case an external database is
not available on the system.

CronJob limitations
Unsupported TimeZone specification
Specifying a timezone using CRON_TZ or TZ variables inside .spec.schedule is not officially supported (and never has been).
Starting with Kubernetes 1.29 if you try to set a schedule that includes TZ or CRON_TZ timezone specification, Kubernetes will fail to
create the resource with a validation error. Updates to CronJobs already using TZ or CRON_TZ will continue to report a warning to
the client.

Modifying a CronJob
By design, a CronJob contains a template for new Jobs. If you modify an existing CronJob, the changes you make will apply to new
Jobs that start to run after your modification is complete. Jobs (and their Pods) that have already started continue to run without
changes. That is, the CronJob does not update existing Jobs, even if those remain running.

Job creation
A CronJob creates a Job object approximately once per execution time of its schedule. The scheduling is approximate because there
are certain circumstances where two Jobs might be created, or no Job might be created. Kubernetes tries to avoid those situations,
but does not completely prevent them. Therefore, the Jobs that you define should be idempotent.

Starting with Kubernetes v1.32, CronJobs apply an annotation batch.kubernetes.io/cronjob-scheduled-timestamp to their created Jobs.
This annotation indicates the originally scheduled creation time for the Job and is formatted in RFC3339.

If startingDeadlineSeconds is set to a large value or left unset (the default) and if concurrencyPolicy is set to Allow , the Jobs will
always run at least once.

Caution:
If startingDeadlineSeconds is set to a value less than 10 seconds, the CronJob may not be scheduled. This is because the CronJob
controller checks things every 10 seconds.

For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until
now. If there are more than 100 missed schedules, then it does not start the Job and logs the error.

Cannot determine if job needs to be started. Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds

It is important to note that if the startingDeadlineSeconds field is set (not nil ), the controller counts how many missed Jobs
occurred from the value of startingDeadlineSeconds until now rather than from the last scheduled time until now. For example, if
startingDeadlineSeconds is 200 , the controller counts how many missed Jobs occurred in the last 200 seconds.

A CronJob is counted as missed if it has failed to be created at its scheduled time. For example, if concurrencyPolicy is set to Forbid
and a CronJob was attempted to be scheduled when there was a previous schedule still running, then it would count as missed.

For example, suppose a CronJob is set to schedule a new Job every one minute beginning at 08:30:00 , and its
startingDeadlineSeconds field is not set. If the CronJob controller happens to be down from 08:29:00 to 10:21:00 , the Job will not
start as the number of missed Jobs which missed their schedule is greater than 100.

To illustrate this concept further, suppose a CronJob is set to schedule a new Job every one minute beginning at 08:30:00 , and its
startingDeadlineSeconds is set to 200 seconds. If the CronJob controller happens to be down for the same period as the previous
example ( 08:29:00 to 10:21:00 ,) the Job will still start at 10:22:00. This happens as the controller now checks how many missed
schedules happened in the last 200 seconds (i.e., 3 missed schedules), rather than from the last scheduled time until now.

The CronJob is only responsible for creating Jobs that match its schedule, and the Job in turn is responsible for the management of
the Pods it represents.

What's next
Learn about Pods and Jobs, two concepts that CronJobs rely upon.
Read about the detailed format of CronJob .spec.schedule fields.
For instructions on creating and working with CronJobs, and for an example of a CronJob manifest, see Running automated
tasks with CronJobs.
CronJob is part of the Kubernetes REST API. Read the CronJob API reference for more details.
8 - ReplicationController
Legacy API for managing workloads that can scale horizontally. Superseded by the Deployment and ReplicaSet
APIs.

Note:
A Deployment that configures a ReplicaSet is now the recommended way to set up replication.

A ReplicationController ensures that a specified number of pod replicas are running at any one time. In other words, a
ReplicationController makes sure that a pod or a homogeneous set of pods is always up and available.

How a ReplicationController works


If there are too many pods, the ReplicationController terminates the extra pods. If there are too few, the ReplicationController starts
more pods. Unlike manually created pods, the pods maintained by a ReplicationController are automatically replaced if they fail, are
deleted, or are terminated. For example, your pods are re-created on a node after disruptive maintenance such as a kernel upgrade.
For this reason, you should use a ReplicationController even if your application requires only a single pod. A ReplicationController is
similar to a process supervisor, but instead of supervising individual processes on a single node, the ReplicationController supervises
multiple pods across multiple nodes.

ReplicationController is often abbreviated to "rc" in discussion, and as a shortcut in kubectl commands.

A simple case is to create one ReplicationController object to reliably run one instance of a Pod indefinitely. A more complex use
case is to run several identical replicas of a replicated service, such as web servers.

Running an example ReplicationController


This example ReplicationController config runs three copies of the nginx web server.

controllers/replication.yaml

apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 3
selector:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80

Run the example job by downloading the example file and then running this command:

kubectl apply -f https://k8s.io/examples/controllers/replication.yaml

The output is similar to this:


replicationcontroller/nginx created

Check on the status of the ReplicationController using this command:

kubectl describe replicationcontrollers/nginx

The output is similar to this:

Name: nginx
Namespace: default
Selector: app=nginx
Labels: app=nginx
Annotations: <none>
Replicas: 3 current / 3 desired
Pods Status: 0 Running / 3 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx
Port: 80/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- ---- ------ -------
20s 20s 1 {replication-controller } Normal SuccessfulCreate Created pod
20s 20s 1 {replication-controller } Normal SuccessfulCreate Created pod
20s 20s 1 {replication-controller } Normal SuccessfulCreate Created pod

Here, three pods are created, but none is running yet, perhaps because the image is being pulled. A little later, the same command
may show:

Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed

To list all the pods that belong to the ReplicationController in a machine readable form, you can use a command like this:

pods=$(kubectl get pods --selector=app=nginx --output=jsonpath={.items..metadata.name})


echo $pods

The output is similar to this:

nginx-3ntk0 nginx-4ok8v nginx-qrm3m

Here, the selector is the same as the selector for the ReplicationController (seen in the kubectl describe output), and in a different
form in replication.yaml . The --output=jsonpath option specifies an expression with the name from each pod in the returned list.

Writing a ReplicationController Manifest


As with all other Kubernetes config, a ReplicationController needs apiVersion , kind , and metadata fields.

When the control plane creates new Pods for a ReplicationController, the .metadata.name of the ReplicationController is part of the
basis for naming those Pods. The name of a ReplicationController must be a valid DNS subdomain value, but this can produce
unexpected results for the Pod hostnames. For best compatibility, the name should follow the more restrictive rules for a DNS label.
For general information about working with configuration files, see object management.

A ReplicationController also needs a .spec section.

Pod Template
The .spec.template is the only required field of the .spec .

The .spec.template is a pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind .

In addition to required fields for a Pod, a pod template in a ReplicationController must specify appropriate labels and an appropriate
restart policy. For labels, make sure not to overlap with other controllers. See pod selector.

Only a .spec.template.spec.restartPolicy equal to Always is allowed, which is the default if not specified.

For local container restarts, ReplicationControllers delegate to an agent on the node, for example the Kubelet.

Labels on the ReplicationController


The ReplicationController can itself have labels ( .metadata.labels ). Typically, you would set these the same as the
.spec.template.metadata.labels ; if .metadata.labels is not specified then it defaults to .spec.template.metadata.labels . However,
they are allowed to be different, and the .metadata.labels do not affect the behavior of the ReplicationController.

Pod Selector
The .spec.selector field is a label selector. A ReplicationController manages all the pods with labels that match the selector. It does
not distinguish between pods that it created or deleted and pods that another person or process created or deleted. This allows the
ReplicationController to be replaced without affecting the running pods.

If specified, the .spec.template.metadata.labels must be equal to the .spec.selector , or it will be rejected by the API. If
.spec.selector is unspecified, it will be defaulted to .spec.template.metadata.labels .

Also you should not normally create any pods whose labels match this selector, either directly, with another ReplicationController, or
with another controller such as Job. If you do so, the ReplicationController thinks that it created the other pods. Kubernetes does not
stop you from doing this.

If you do end up with multiple controllers that have overlapping selectors, you will have to manage the deletion yourself (see below).

Multiple Replicas
You can specify how many pods should run concurrently by setting .spec.replicas to the number of pods you would like to have
running concurrently. The number running at any time may be higher or lower, such as if the replicas were just increased or
decreased, or if a pod is gracefully shutdown, and a replacement starts early.

If you do not specify .spec.replicas , then it defaults to 1.

Working with ReplicationControllers


Deleting a ReplicationController and its Pods
To delete a ReplicationController and all its pods, use kubectl delete . Kubectl will scale the ReplicationController to zero and wait
for it to delete each pod before deleting the ReplicationController itself. If this kubectl command is interrupted, it can be restarted.

When using the REST API or client library, you need to do the steps explicitly (scale replicas to 0, wait for pod deletions, then delete
the ReplicationController).

Deleting only a ReplicationController


You can delete a ReplicationController without affecting any of its pods.

Using kubectl, specify the --cascade=orphan option to kubectl delete .

When using the REST API or client library, you can delete the ReplicationController object.
Once the original is deleted, you can create a new ReplicationController to replace it. As long as the old and new .spec.selector are
the same, then the new one will adopt the old pods. However, it will not make any effort to make existing pods match a new,
different pod template. To update pods to a new spec in a controlled way, use a rolling update.

Isolating pods from a ReplicationController


Pods may be removed from a ReplicationController's target set by changing their labels. This technique may be used to remove pods
from service for debugging and data recovery. Pods that are removed in this way will be replaced automatically (assuming that the
number of replicas is not also changed).

Common usage patterns


Rescheduling
As mentioned above, whether you have 1 pod you want to keep running, or 1000, a ReplicationController will ensure that the
specified number of pods exists, even in the event of node failure or pod termination (for example, due to an action by another
control agent).

Scaling
The ReplicationController enables scaling the number of replicas up or down, either manually or by an auto-scaling control agent, by
updating the replicas field.

Rolling updates
The ReplicationController is designed to facilitate rolling updates to a service by replacing pods one-by-one.

As explained in #1353, the recommended approach is to create a new ReplicationController with 1 replica, scale the new (+1) and old
(-1) controllers one by one, and then delete the old controller after it reaches 0 replicas. This predictably updates the set of pods
regardless of unexpected failures.

Ideally, the rolling update controller would take application readiness into account, and would ensure that a sufficient number of
pods were productively serving at any given time.

The two ReplicationControllers would need to create pods with at least one differentiating label, such as the image tag of the
primary container of the pod, since it is typically image updates that motivate rolling updates.

Multiple release tracks


In addition to running multiple releases of an application while a rolling update is in progress, it's common to run multiple releases
for an extended period of time, or even continuously, using multiple release tracks. The tracks would be differentiated by labels.

For instance, a service might target all pods with tier in (frontend), environment in (prod) . Now say you have 10 replicated pods
that make up this tier. But you want to be able to 'canary' a new version of this component. You could set up a ReplicationController
with replicas set to 9 for the bulk of the replicas, with labels tier=frontend, environment=prod, track=stable , and another
ReplicationController with replicas set to 1 for the canary, with labels tier=frontend, environment=prod, track=canary . Now the
service is covering both the canary and non-canary pods. But you can mess with the ReplicationControllers separately to test things
out, monitor the results, etc.

Using ReplicationControllers with Services


Multiple ReplicationControllers can sit behind a single service, so that, for example, some traffic goes to the old version, and some
goes to the new version.

A ReplicationController will never terminate on its own, but it isn't expected to be as long-lived as services. Services may be
composed of pods controlled by multiple ReplicationControllers, and it is expected that many ReplicationControllers may be created
and destroyed over the lifetime of a service (for instance, to perform an update of pods that run the service). Both services
themselves and their clients should remain oblivious to the ReplicationControllers that maintain the pods of the services.
Writing programs for Replication
Pods created by a ReplicationController are intended to be fungible and semantically identical, though their configurations may
become heterogeneous over time. This is an obvious fit for replicated stateless servers, but ReplicationControllers can also be used
to maintain availability of master-elected, sharded, and worker-pool applications. Such applications should use dynamic work
assignment mechanisms, such as the RabbitMQ work queues, as opposed to static/one-time customization of the configuration of
each pod, which is considered an anti-pattern. Any pod customization performed, such as vertical auto-sizing of resources (for
example, cpu or memory), should be performed by another online controller process, not unlike the ReplicationController itself.

Responsibilities of the ReplicationController


The ReplicationController ensures that the desired number of pods matches its label selector and are operational. Currently, only
terminated pods are excluded from its count. In the future, readiness and other information available from the system may be taken
into account, we may add more controls over the replacement policy, and we plan to emit events that could be used by external
clients to implement arbitrarily sophisticated replacement and/or scale-down policies.

The ReplicationController is forever constrained to this narrow responsibility. It itself will not perform readiness nor liveness probes.
Rather than performing auto-scaling, it is intended to be controlled by an external auto-scaler (as discussed in #492), which would
change its replicas field. We will not add scheduling policies (for example, spreading) to the ReplicationController. Nor should it
verify that the pods controlled match the currently specified template, as that would obstruct auto-sizing and other automated
processes. Similarly, completion deadlines, ordering dependencies, configuration expansion, and other features belong elsewhere.
We even plan to factor out the mechanism for bulk pod creation (#170).

The ReplicationController is intended to be a composable building-block primitive. We expect higher-level APIs and/or tools to be
built on top of it and other complementary primitives for user convenience in the future. The "macro" operations currently
supported by kubectl (run, scale) are proof-of-concept examples of this. For instance, we could imagine something like Asgard
managing ReplicationControllers, auto-scalers, services, scheduling policies, canaries, etc.

API Object
Replication controller is a top-level resource in the Kubernetes REST API. More details about the API object can be found at:
ReplicationController API object.

Alternatives to ReplicationController
ReplicaSet
ReplicaSet is the next-generation ReplicationController that supports the new set-based label selector. It's mainly used by
Deployment as a mechanism to orchestrate pod creation, deletion and updates. Note that we recommend using Deployments
instead of directly using Replica Sets, unless you require custom update orchestration or don't require updates at all.

Deployment (Recommended)
Deployment is a higher-level API object that updates its underlying Replica Sets and their Pods. Deployments are recommended if
you want the rolling update functionality, because they are declarative, server-side, and have additional features.

Bare Pods
Unlike in the case where a user directly created pods, a ReplicationController replaces pods that are deleted or terminated for any
reason, such as in the case of node failure or disruptive node maintenance, such as a kernel upgrade. For this reason, we
recommend that you use a ReplicationController even if your application requires only a single pod. Think of it similarly to a process
supervisor, only it supervises multiple pods across multiple nodes instead of individual processes on a single node. A
ReplicationController delegates local container restarts to some agent on the node, such as the kubelet.

Job
Use a Job instead of a ReplicationController for pods that are expected to terminate on their own (that is, batch jobs).
DaemonSet
Use a DaemonSet instead of a ReplicationController for pods that provide a machine-level function, such as machine monitoring or
machine logging. These pods have a lifetime that is tied to a machine lifetime: the pod needs to be running on the machine before
other pods start, and are safe to terminate when the machine is otherwise ready to be rebooted/shutdown.

What's next
Learn about Pods.
Learn about Deployment, the replacement for ReplicationController.
ReplicationController is part of the Kubernetes REST API. Read the ReplicationController object definition to understand the
API for replication controllers.

You might also like