Workload Management _ Kubernetes
Workload Management _ Kubernetes
1: Deployments
2: ReplicaSet
3: StatefulSets
4: DaemonSet
5: Jobs
6: Automatic Cleanup for Finished Jobs
7: CronJob
8: ReplicationController
Kubernetes provides several built-in APIs for declarative management of your workloads and the components of those workloads.
Ultimately, your applications run as containers inside Pods; however, managing individual Pods would be a lot of effort. For example,
if a Pod fails, you probably want to run a new Pod to replace it. Kubernetes can do that for you.
You use the Kubernetes API to create a workload object that represents a higher abstraction level than a Pod, and then the
Kubernetes control plane automatically manages Pod objects on your behalf, based on the specification for the workload object you
defined.
Deployment (and, indirectly, ReplicaSet), the most common way to run an application on your cluster. Deployment is a good fit for
managing a stateless application workload on your cluster, where any Pod in the Deployment is interchangeable and can be
replaced if needed. (Deployments are a replacement for the legacy ReplicationController API).
A StatefulSet lets you manage one or more Pods – all running the same application code – where the Pods rely on having a distinct
identity. This is different from a Deployment where the Pods are expected to be interchangeable. The most common use for a
StatefulSet is to be able to make a link between its Pods and their persistent storage. For example, you can run a StatefulSet that
associates each Pod with a PersistentVolume. If one of the Pods in the StatefulSet fails, Kubernetes makes a replacement Pod that is
connected to the same PersistentVolume.
A DaemonSet defines Pods that provide facilities that are local to a specific node; for example, a driver that lets containers on that
node access a storage system. You use a DaemonSet when the driver, or other node-level service, has to run on the node where it's
useful. Each Pod in a DaemonSet performs a role similar to a system daemon on a classic Unix / POSIX server. A DaemonSet might
be fundamental to the operation of your cluster, such as a plugin to let that node access cluster networking, it might help you to
manage the node, or it could provide less essential facilities that enhance the container platform you are running. You can run
DaemonSets (and their pods) across every node in your cluster, or across just a subset (for example, only install the GPU accelerator
driver on nodes that have a GPU installed).
You can use a Job and / or a CronJob to define tasks that run to completion and then stop. A Job represents a one-off task, whereas
each CronJob repeats according to a schedule.
You describe a desired state in a Deployment, and the Deployment Controller changes the actual state to the desired state at a
controlled rate. You can define Deployments to create new ReplicaSets, or to remove existing Deployments and adopt all their
resources with new Deployments.
Note:
Do not manage ReplicaSets owned by a Deployment. Consider opening an issue in the main Kubernetes repository if your use
case is not covered below.
Use Case
The following are typical use cases for Deployments:
Create a Deployment to rollout a ReplicaSet. The ReplicaSet creates Pods in the background. Check the status of the rollout to
see if it succeeds or not.
Declare the new state of the Pods by updating the PodTemplateSpec of the Deployment. A new ReplicaSet is created and the
Deployment manages moving the Pods from the old ReplicaSet to the new one at a controlled rate. Each new ReplicaSet
updates the revision of the Deployment.
Rollback to an earlier Deployment revision if the current state of the Deployment is not stable. Each rollback updates the
revision of the Deployment.
Scale up the Deployment to facilitate more load.
Pause the rollout of a Deployment to apply multiple fixes to its PodTemplateSpec and then resume it to start a new rollout.
Use the status of the Deployment as an indicator that a rollout has stuck.
Clean up older ReplicaSets that you don't need anymore.
Creating a Deployment
The following is an example of a Deployment. It creates a ReplicaSet to bring up three nginx Pods:
controllers/nginx-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
In this example:
A Deployment named nginx-deployment is created, indicated by the .metadata.name field. This name will become the basis for
the ReplicaSets and Pods which are created later. See Writing a Deployment Spec for more details.
The Deployment creates a ReplicaSet that creates three replicated Pods, indicated by the .spec.replicas field.
The .spec.selector field defines how the created ReplicaSet finds which Pods to manage. In this case, you select a label that is
defined in the Pod template ( app: nginx ). However, more sophisticated selection rules are possible, as long as the Pod
template itself satisfies the rule.
Note:
The .spec.selector.matchLabels field is a map of {key,value} pairs. A single {key,value} in the matchLabels map is equivalent
to an element of matchExpressions, whose key field is "key", the operator is "In", and the values array contains only "value".
All of the requirements, from both matchLabels and matchExpressions, must be satisfied in order to match.
The Pods are labeled app: nginx using the .metadata.labels field.
The Pod template's specification, or .template.spec field, indicates that the Pods run one container, nginx , which runs
the nginx Docker Hub image at version 1.14.2.
Create one container and name it nginx using the .spec.template.spec.containers[0].name field.
Before you begin, make sure your Kubernetes cluster is up and running. Follow the steps given below to create the above
Deployment:
If the Deployment is still being created, the output is similar to the following:
When you inspect the Deployments in your cluster, the following fields are displayed:
READY displays how many replicas of the application are available to your users. It follows the pattern ready/desired.
UP-TO-DATE displays the number of replicas that have been updated to achieve the desired state.
AVAILABLE displays how many replicas of the application are available to your users.
AGE displays the amount of time that the application has been running.
3. To see the Deployment rollout status, run kubectl rollout status deployment/nginx-deployment .
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
deployment "nginx-deployment" successfully rolled out
4. Run the kubectl get deployments again a few seconds later. The output is similar to this:
Notice that the Deployment has created all three replicas, and all replicas are up-to-date (they contain the latest Pod template)
and available.
5. To see the ReplicaSet ( rs ) created by the Deployment, run kubectl get rs . The output is similar to this:
READY displays how many replicas of the application are available to your users.
AGE displays the amount of time that the application has been running.
Notice that the name of the ReplicaSet is always formatted as [DEPLOYMENT-NAME]-[HASH] . This name will become the basis for
the Pods which are created.
The HASH string is the same as the pod-template-hash label on the ReplicaSet.
6. To see the labels automatically generated for each Pod, run kubectl get pods --show-labels . The output is similar to:
The created ReplicaSet ensures that there are three nginx Pods.
Note:
You must specify an appropriate selector and Pod template labels in a Deployment (in this case, app: nginx ).
Do not overlap labels or selectors with other controllers (including other Deployments and StatefulSets). Kubernetes doesn't
stop you from overlapping, and if multiple controllers have overlapping selectors those controllers might conflict and behave
unexpectedly.
Pod-template-hash label
Caution:
Do not change this label.
The pod-template-hash label is added by the Deployment controller to every ReplicaSet that a Deployment creates or adopts.
This label ensures that child ReplicaSets of a Deployment do not overlap. It is generated by hashing the PodTemplate of the
ReplicaSet and using the resulting hash as the label value that is added to the ReplicaSet selector, Pod template labels, and in any
existing Pods that the ReplicaSet might have.
Updating a Deployment
Note:
A Deployment's rollout is triggered if and only if the Deployment's Pod template (that is, .spec.template) is changed, for example
if the labels or container images of the template are updated. Other updates, such as scaling the Deployment, do not trigger a
rollout.
1. Let's update the nginx Pods to use the nginx:1.16.1 image instead of the nginx:1.14.2 image.
kubectl set image deployment.v1.apps/nginx-deployment nginx=nginx:1.16.1
where deployment/nginx-deploymentindicates the Deployment, nginx indicates the Container the update will take place and
nginx:1.16.1 indicates the new image and its tag.
Alternatively, you can edit the Deployment and change .spec.template.spec.containers[0].image from nginx:1.14.2 to
nginx:1.16.1 :
deployment.apps/nginx-deployment edited
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
or
After the rollout succeeds, you can view the Deployment by running kubectl get deployments . The output is similar to this:
Run kubectl get rs to see that the Deployment updated the Pods by creating a new ReplicaSet and scaling it up to 3 replicas,
as well as scaling down the old ReplicaSet to 0 replicas.
kubectl get rs
Running get pods should now show only the new Pods:
Next time you want to update these Pods, you only need to update the Deployment's Pod template again.
Deployment ensures that only a certain number of Pods are down while they are being updated. By default, it ensures that at
least 75% of the desired number of Pods are up (25% max unavailable).
Deployment also ensures that only a certain number of Pods are created above the desired number of Pods. By default, it
ensures that at most 125% of the desired number of Pods are up (25% max surge).
For example, if you look at the above Deployment closely, you will see that it first creates a new Pod, then deletes an old Pod,
and creates another new one. It does not kill old Pods until a sufficient number of new Pods have come up, and does not
create new Pods until a sufficient number of old Pods have been killed. It makes sure that at least 3 Pods are available and that
at max 4 Pods in total are available. In case of a Deployment with 4 replicas, the number of Pods would be between 3 and 5.
Here you see that when you first created the Deployment, it created a ReplicaSet (nginx-deployment-2035384211) and scaled it
up to 3 replicas directly. When you updated the Deployment, it created a new ReplicaSet (nginx-deployment-1564180365) and
scaled it up to 1 and waited for it to come up. Then it scaled down the old ReplicaSet to 2 and scaled up the new ReplicaSet to 2
so that at least 3 Pods were available and at most 4 Pods were created at all times. It then continued scaling up and down the
new and the old ReplicaSet, with the same rolling update strategy. Finally, you'll have 3 available replicas in the new ReplicaSet,
and the old ReplicaSet is scaled down to 0.
Note:
Kubernetes doesn't count terminating Pods when calculating the number of availableReplicas, which must be between replicas
- maxUnavailable and replicas + maxSurge. As a result, you might notice that there are more Pods than expected during a rollout,
and that the total resources consumed by the Deployment is more than replicas + maxSurge until the
terminationGracePeriodSeconds of the terminating Pods expires.
If you update a Deployment while an existing rollout is in progress, the Deployment creates a new ReplicaSet as per the update and
start scaling that up, and rolls over the ReplicaSet that it was scaling up previously -- it will add it to its list of old ReplicaSets and start
scaling it down.
For example, suppose you create a Deployment to create 5 replicas of nginx:1.14.2 , but then update the Deployment to create 5
replicas of nginx:1.16.1 , when only 3 replicas of nginx:1.14.2 had been created. In that case, the Deployment immediately starts
killing the 3 nginx:1.14.2 Pods that it had created, and starts creating nginx:1.16.1 Pods. It does not wait for the 5 replicas of
nginx:1.14.2 to be created before changing course.
Label selector updates
It is generally discouraged to make label selector updates and it is suggested to plan your selectors up front. In any case, if you need
to perform a label selector update, exercise great caution and make sure you have grasped all of the implications.
Note:
In API version apps/v1, a Deployment's label selector is immutable after it gets created.
Selector additions require the Pod template labels in the Deployment spec to be updated with the new label too, otherwise a
validation error is returned. This change is a non-overlapping one, meaning that the new selector does not select ReplicaSets
and Pods created with the old selector, resulting in orphaning all old ReplicaSets and creating a new ReplicaSet.
Selector updates changes the existing value in a selector key -- result in the same behavior as additions.
Selector removals removes an existing key from the Deployment selector -- do not require any changes in the Pod template
labels. Existing ReplicaSets are not orphaned, and a new ReplicaSet is not created, but note that the removed label still exists in
any existing Pods and ReplicaSets.
Note:
A Deployment's revision is created when a Deployment's rollout is triggered. This means that the new revision is created if and
only if the Deployment's Pod template (.spec.template) is changed, for example if you update the labels or container images of
the template. Other updates, such as scaling the Deployment, do not create a Deployment revision, so that you can facilitate
simultaneous manual- or auto-scaling. This means that when you roll back to an earlier revision, only the Deployment's Pod
template part is rolled back.
Suppose that you made a typo while updating the Deployment, by putting the image name as nginx:1.161 instead of
nginx:1.16.1 :
The rollout gets stuck. You can verify it by checking the rollout status:
Waiting for rollout to finish: 1 out of 3 new replicas have been updated...
Press Ctrl-C to stop the above rollout status watch. For more information on stuck rollouts, read more here.
You see that the number of old replicas (adding the replica count from nginx-deployment-1564180365 and nginx-deployment-
2035384211 ) is 3, and the number of new replicas (from nginx-deployment-3066724191 ) is 1.
kubectl get rs
Looking at the Pods created, you see that 1 Pod created by new ReplicaSet is stuck in an image pull loop.
Note:
The Deployment controller stops the bad rollout automatically, and stops scaling up the new ReplicaSet. This depends on
the rollingUpdate parameters (maxUnavailable specifically) that you have specified. Kubernetes by default sets the value to
25%.
To fix this, you need to rollback to a previous revision of Deployment that is stable.
deployments "nginx-deployment"
REVISION CHANGE-CAUSE
1 kubectl apply --filename=https://k8s.io/examples/controllers/nginx-deployment.yaml
2 kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1
3 kubectl set image deployment/nginx-deployment nginx=nginx:1.161
CHANGE-CAUSE is copied from the Deployment annotation kubernetes.io/change-cause to its revisions upon creation. You can
specify the CHANGE-CAUSE message by:
1. Now you've decided to undo the current rollout and rollback to the previous revision:
For more details about rollout related commands, read kubectl rollout .
The Deployment is now rolled back to a previous stable revision. As you can see, a DeploymentRollback event for rolling back to
revision 2 is generated from Deployment controller.
2. Check if the rollback was successful and the Deployment is running as expected, run:
Name: nginx-deployment
Namespace: default
CreationTimestamp: Sun, 02 Sep 2018 18:17:55 -0500
Labels: app=nginx
Annotations: deployment.kubernetes.io/revision=4
kubernetes.io/change-cause=kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1
Selector: app=nginx
Replicas: 3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 25% max unavailable, 25% max surge
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx:1.16.1
Port: 80/TCP
Host Port: 0/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: nginx-deployment-c4747d96c (3/3 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 12m deployment-controller Scaled up replica set nginx-deployment-75675f5897 to 3
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 1
Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 2
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 2
Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 1
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-c4747d96c to 3
Normal ScalingReplicaSet 11m deployment-controller Scaled down replica set nginx-deployment-75675f5897 to 0
Normal ScalingReplicaSet 11m deployment-controller Scaled up replica set nginx-deployment-595696685f to 1
Normal DeploymentRollback 15s deployment-controller Rolled back deployment "nginx-deployment" to revision 2
Normal ScalingReplicaSet 15s deployment-controller Scaled down replica set nginx-deployment-595696685f to 0
Scaling a Deployment
You can scale a Deployment by using the following command:
deployment.apps/nginx-deployment scaled
Assuming horizontal Pod autoscaling is enabled in your cluster, you can set up an autoscaler for your Deployment and choose the
minimum and maximum number of Pods you want to run based on the CPU utilization of your existing Pods.
kubectl autoscale deployment/nginx-deployment --min=10 --max=15 --cpu-percent=80
deployment.apps/nginx-deployment scaled
Proportional scaling
RollingUpdate Deployments support running multiple versions of an application at the same time. When you or an autoscaler scales
a RollingUpdate Deployment that is in the middle of a rollout (either in progress or paused), the Deployment controller balances the
additional replicas in the existing active ReplicaSets (ReplicaSets with Pods) in order to mitigate risk. This is called proportional scaling.
For example, you are running a Deployment with 10 replicas, maxSurge=3, and maxUnavailable=2.
You update to a new image which happens to be unresolvable from inside the cluster.
The image update starts a new rollout with ReplicaSet nginx-deployment-1989198191, but it's blocked due to the
maxUnavailable requirement that you mentioned above. Check out the rollout status:
kubectl get rs
Then a new scaling request for the Deployment comes along. The autoscaler increments the Deployment replicas to 15. The
Deployment controller needs to decide where to add these new 5 replicas. If you weren't using proportional scaling, all 5 of
them would be added in the new ReplicaSet. With proportional scaling, you spread the additional replicas across all
ReplicaSets. Bigger proportions go to the ReplicaSets with the most replicas and lower proportions go to ReplicaSets with less
replicas. Any leftovers are added to the ReplicaSet with the most replicas. ReplicaSets with zero replicas are not scaled up.
In our example above, 3 replicas are added to the old ReplicaSet and 2 replicas are added to the new ReplicaSet. The rollout process
should eventually move all replicas to the new ReplicaSet, assuming the new replicas become healthy. To confirm this, run:
kubectl get deploy
The rollout status confirms how the replicas were added to each ReplicaSet.
kubectl get rs
kubectl get rs
deployments "nginx"
REVISION CHANGE-CAUSE
1 <none>
Get the rollout status to verify that the existing ReplicaSet has not changed:
kubectl get rs
You can make as many updates as you wish, for example, update the resources that will be used:
The initial state of the Deployment prior to pausing its rollout will continue its function, but new updates to the Deployment will
not have any effect as long as the Deployment rollout is paused.
Eventually, resume the Deployment rollout and observe a new ReplicaSet coming up with all the new updates:
deployment.apps/nginx-deployment resumed
kubectl get rs
Note:
You cannot rollback a paused Deployment until you resume it.
Deployment status
A Deployment enters various states during its lifecycle. It can be progressing while rolling out a new ReplicaSet, it can be complete,
or it can fail to progress.
Progressing Deployment
Kubernetes marks a Deployment as progressing when one of the following tasks is performed:
When the rollout becomes “progressing”, the Deployment controller adds a condition with the following attributes to the
Deployment's .status.conditions :
type: Progressing
status: "True"
You can monitor the progress for a Deployment by using kubectl rollout status .
Complete Deployment
Kubernetes marks a Deployment as complete when it has the following characteristics:
All of the replicas associated with the Deployment have been updated to the latest version you've specified, meaning any
updates you've requested have been completed.
All of the replicas associated with the Deployment are available.
No old replicas for the Deployment are running.
When the rollout becomes “complete”, the Deployment controller sets a condition with the following attributes to the Deployment's
.status.conditions :
type: Progressing
status: "True"
reason: NewReplicaSetAvailable
This Progressing condition will retain a status value of "True" until a new rollout is initiated. The condition holds even when
availability of replicas changes (which does instead affect the Available condition).
You can check if a Deployment has completed by using kubectl rollout status . If the rollout completed successfully, kubectl
rollout status returns a zero exit code.
echo $?
Failed Deployment
Your Deployment may get stuck trying to deploy its newest ReplicaSet without ever completing. This can occur due to some of the
following factors:
Insufficient quota
Readiness probe failures
Image pull errors
Insufficient permissions
Limit ranges
Application runtime misconfiguration
One way you can detect this condition is to specify a deadline parameter in your Deployment spec: ( .spec.progressDeadlineSeconds ).
.spec.progressDeadlineSeconds denotes the number of seconds the Deployment controller waits before indicating (in the
Deployment status) that the Deployment progress has stalled.
The following kubectl command sets the spec with progressDeadlineSeconds to make the controller report lack of progress of a
rollout for a Deployment after 10 minutes:
deployment.apps/nginx-deployment patched
Once the deadline has been exceeded, the Deployment controller adds a DeploymentCondition with the following attributes to the
Deployment's .status.conditions :
type: Progressing
status: "False"
reason: ProgressDeadlineExceeded
This condition can also fail early and is then set to status value of "False" due to reasons as ReplicaSetCreateError . Also, the
deadline is not taken into account anymore once the Deployment rollout completes.
See the Kubernetes API conventions for more information on status conditions.
Note:
Kubernetes takes no action on a stalled Deployment other than to report a status condition with reason:
ProgressDeadlineExceeded. Higher level orchestrators can take advantage of it and act accordingly, for example, rollback the
Deployment to its previous version.
Note:
If you pause a Deployment rollout, Kubernetes does not check progress against your specified deadline. You can safely pause a
Deployment rollout in the middle of a rollout and resume without triggering the condition for exceeding the deadline.
You may experience transient errors with your Deployments, either due to a low timeout that you have set or due to any other kind
of error that can be treated as transient. For example, let's suppose you have insufficient quota. If you describe the Deployment you
will notice the following section:
<...>
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True ReplicaSetUpdated
ReplicaFailure True FailedCreate
<...>
If you run kubectl get deployment nginx-deployment -o yaml , the Deployment status is similar to this:
status:
availableReplicas: 2
conditions:
- lastTransitionTime: 2016-10-04T12:25:39Z
lastUpdateTime: 2016-10-04T12:25:39Z
message: Replica set "nginx-deployment-4262182780" is progressing.
reason: ReplicaSetUpdated
status: "True"
type: Progressing
- lastTransitionTime: 2016-10-04T12:25:42Z
lastUpdateTime: 2016-10-04T12:25:42Z
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: 2016-10-04T12:25:39Z
lastUpdateTime: 2016-10-04T12:25:39Z
message: 'Error creating: pods "nginx-deployment-4262182780-" is forbidden: exceeded quota:
object-counts, requested: pods=1, used: pods=3, limited: pods=2'
reason: FailedCreate
status: "True"
type: ReplicaFailure
observedGeneration: 3
replicas: 2
unavailableReplicas: 2
Eventually, once the Deployment progress deadline is exceeded, Kubernetes updates the status and the reason for the Progressing
condition:
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing False ProgressDeadlineExceeded
ReplicaFailure True FailedCreate
You can address an issue of insufficient quota by scaling down your Deployment, by scaling down other controllers you may be
running, or by increasing quota in your namespace. If you satisfy the quota conditions and the Deployment controller then
completes the Deployment rollout, you'll see the Deployment's status update with a successful condition ( status: "True" and
reason: NewReplicaSetAvailable ).
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
type: Available with status: "True" means that your Deployment has minimum availability. Minimum availability is dictated by the
parameters specified in the deployment strategy. type: Progressing with status: "True" means that your Deployment is either in
the middle of a rollout and it is progressing or that it has successfully completed its progress and the minimum required new
replicas are available (see the Reason of the condition for the particulars - in our case reason: NewReplicaSetAvailable means that
the Deployment is complete).
You can check if a Deployment has failed to progress by using kubectl rollout status . kubectl rollout status returns a non-zero
exit code if the Deployment has exceeded the progression deadline.
Waiting for rollout to finish: 2 out of 3 new replicas have been updated...
error: deployment "nginx" exceeded its progress deadline
and the exit status from kubectl rollout is 1 (indicating an error):
echo $?
Clean up Policy
You can set .spec.revisionHistoryLimit field in a Deployment to specify how many old ReplicaSets for this Deployment you want to
retain. The rest will be garbage-collected in the background. By default, it is 10.
Note:
Explicitly setting this field to 0, will result in cleaning up all the history of your Deployment thus that Deployment will not be able
to roll back.
Canary Deployment
If you want to roll out releases to a subset of users or servers using the Deployment, you can create multiple Deployments, one for
each release, following the canary pattern described in managing resources.
When the control plane creates new Pods for a Deployment, the .metadata.name of the Deployment is part of the basis for naming
those Pods. The name of a Deployment must be a valid DNS subdomain value, but this can produce unexpected results for the Pod
hostnames. For best compatibility, the name should follow the more restrictive rules for a DNS label.
Pod Template
The .spec.template and .spec.selector are the only required fields of the .spec .
The .spec.template is a Pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind .
In addition to required fields for a Pod, a Pod template in a Deployment must specify appropriate labels and an appropriate restart
policy. For labels, make sure not to overlap with other controllers. See selector.
Only a .spec.template.spec.restartPolicy equal to Always is allowed, which is the default if not specified.
Replicas
.spec.replicas is an optional field that specifies the number of desired Pods. It defaults to 1.
Should you manually scale a Deployment, example via kubectl scale deployment deployment --replicas=X , and then you update that
Deployment based on a manifest (for example: by running kubectl apply -f deployment.yaml ), then applying that manifest
overwrites the manual scaling that you previously did.
If a HorizontalPodAutoscaler (or any similar API for horizontal scaling) is managing scaling for a Deployment, don't set
.spec.replicas .
Instead, allow the Kubernetes control plane to manage the .spec.replicas field automatically.
Selector
.spec.selector is a required field that specifies a label selector for the Pods targeted by this Deployment.
In API version apps/v1 , .spec.selector and .metadata.labels do not default to .spec.template.metadata.labels if not set. So they
must be set explicitly. Also note that .spec.selector is immutable after creation of the Deployment in apps/v1 .
A Deployment may terminate Pods whose labels match the selector if their template is different from .spec.template or if the total
number of such Pods exceeds .spec.replicas . It brings up new Pods with .spec.template if the number of Pods is less than the
desired number.
Note:
You should not create other Pods whose labels match this selector, either directly, by creating another Deployment, or by
creating another controller such as a ReplicaSet or a ReplicationController. If you do so, the first Deployment thinks that it
created these other Pods. Kubernetes does not stop you from doing this.
If you have multiple controllers that have overlapping selectors, the controllers will fight with each other and won't behave correctly.
Strategy
.spec.strategy specifies the strategy used to replace old Pods by new ones. .spec.strategy.type can be "Recreate" or
"RollingUpdate". "RollingUpdate" is the default value.
Recreate Deployment
All existing Pods are killed before new ones are created when .spec.strategy.type==Recreate .
Note:
This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old
revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you
manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the
old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a StatefulSet.
Max Unavailable
.spec.strategy.rollingUpdate.maxUnavailable is an optional field that specifies the maximum number of Pods that can be unavailable
during the update process. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example,
10%). The absolute number is calculated from percentage by rounding down. The value cannot be 0 if
.spec.strategy.rollingUpdate.maxSurge is 0. The default value is 25%.
For example, when this value is set to 30%, the old ReplicaSet can be scaled down to 70% of desired Pods immediately when the
rolling update starts. Once new Pods are ready, old ReplicaSet can be scaled down further, followed by scaling up the new
ReplicaSet, ensuring that the total number of Pods available at all times during the update is at least 70% of the desired Pods.
Max Surge
.spec.strategy.rollingUpdate.maxSurge is an optional field that specifies the maximum number of Pods that can be created over the
desired number of Pods. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%).
The value cannot be 0 if MaxUnavailable is 0. The absolute number is calculated from the percentage by rounding up. The default
value is 25%.
For example, when this value is set to 30%, the new ReplicaSet can be scaled up immediately when the rolling update starts, such
that the total number of old and new Pods does not exceed 130% of desired Pods. Once old Pods have been killed, the new
ReplicaSet can be scaled up further, ensuring that the total number of Pods running at any time during the update is at most 130%
of desired Pods.
Here are some Rolling Update Deployment examples that use the maxUnavailable and maxSurge :
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
.spec.revisionHistoryLimit is an optional field that specifies the number of old ReplicaSets to retain to allow rollback. These old
ReplicaSets consume resources in etcd and crowd the output of kubectl get rs . The configuration of each Deployment revision is
stored in its ReplicaSets; therefore, once an old ReplicaSet is deleted, you lose the ability to rollback to that revision of Deployment.
By default, 10 old ReplicaSets will be kept, however its ideal value depends on the frequency and stability of new Deployments.
More specifically, setting this field to zero means that all old ReplicaSets with 0 replicas will be cleaned up. In this case, a new
Deployment rollout cannot be undone, since its revision history is cleaned up.
Paused
.spec.paused is an optional boolean field for pausing and resuming a Deployment. The only difference between a paused
Deployment and one that is not paused, is that any changes into the PodTemplateSpec of the paused Deployment will not trigger
new rollouts as long as it is paused. A Deployment is not paused by default when it is created.
What's next
Learn more about Pods.
Run a stateless application using a Deployment.
Read the Deployment to understand the Deployment API.
Read about PodDisruptionBudget and how you can use it to manage application availability during disruptions.
Use kubectl to create a Deployment.
2 - ReplicaSet
A ReplicaSet's purpose is to maintain a stable set of replica Pods running at any given time. Usually, you define
a Deployment and let that Deployment manage ReplicaSets automatically.
A ReplicaSet's purpose is to maintain a stable set of replica Pods running at any given time. As such, it is often used to guarantee the
availability of a specified number of identical Pods.
A ReplicaSet is linked to its Pods via the Pods' metadata.ownerReferences field, which specifies what resource the current object is
owned by. All Pods acquired by a ReplicaSet have their owning ReplicaSet's identifying information within their ownerReferences
field. It's through this link that the ReplicaSet knows of the state of the Pods it is maintaining and plans accordingly.
A ReplicaSet identifies new Pods to acquire by using its selector. If there is a Pod that has no OwnerReference or the
OwnerReference is not a Controller and it matches a ReplicaSet's selector, it will be immediately acquired by said ReplicaSet.
This actually means that you may never need to manipulate ReplicaSet objects: use a Deployment instead, and define your
application in the spec section.
Example
controllers/frontend.yaml
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: frontend
labels:
app: guestbook
tier: frontend
spec:
# modify replicas according to your case
replicas: 3
selector:
matchLabels:
tier: frontend
template:
metadata:
labels:
tier: frontend
spec:
containers:
- name: php-redis
image: us-docker.pkg.dev/google-samples/containers/gke/gb-frontend:v5
Saving this manifest into frontend.yaml and submitting it to a Kubernetes cluster will create the defined ReplicaSet and the Pods
that it manages.
kubectl get rs
Name: frontend
Namespace: default
Selector: tier=frontend
Labels: app=guestbook
tier=frontend
Annotations: <none>
Replicas: 3 current / 3 desired
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: tier=frontend
Containers:
php-redis:
Image: us-docker.pkg.dev/google-samples/containers/gke/gb-frontend:v5
Port: <none>
Host Port: <none>
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 13s replicaset-controller Created pod: frontend-gbgfx
Normal SuccessfulCreate 13s replicaset-controller Created pod: frontend-rwz57
Normal SuccessfulCreate 13s replicaset-controller Created pod: frontend-wkl7w
And lastly you can check for the Pods brought up:
The output will look similar to this, with the frontend ReplicaSet's info set in the metadata's ownerReferences field:
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2024-02-28T22:30:44Z"
generateName: frontend-
labels:
tier: frontend
name: frontend-gbgfx
namespace: default
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: frontend
uid: e129deca-f864-481b-bb16-b27abfd92292
...
Take the previous frontend ReplicaSet example, and the Pods specified in the following manifest:
pods/pod-rs.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod1
labels:
tier: frontend
spec:
containers:
- name: hello1
image: gcr.io/google-samples/hello-app:2.0
---
apiVersion: v1
kind: Pod
metadata:
name: pod2
labels:
tier: frontend
spec:
containers:
- name: hello2
image: gcr.io/google-samples/hello-app:1.0
As those Pods do not have a Controller (or any object) as their owner reference and match the selector of the frontend ReplicaSet,
they will immediately be acquired by it.
Suppose you create the Pods after the frontend ReplicaSet has been deployed and has set up its initial Pod replicas to fulfill its
replica count requirement:
The new Pods will be acquired by the ReplicaSet, and then immediately terminated as the ReplicaSet would be over its desired
count.
The output shows that the new Pods are either already terminated, or in the process of being terminated:
You shall see that the ReplicaSet has acquired the Pods and has only created new ones according to its spec until the number of its
new Pods and the original matches its desired count. As fetching the Pods:
Pod Template
The .spec.template is a pod template which is also required to have labels in place. In our frontend.yaml example we had one label:
tier: frontend . Be careful not to overlap with the selectors of other controllers, lest they try to adopt this Pod.
For the template's restart policy field, .spec.template.spec.restartPolicy , the only allowed value is Always , which is the default.
Pod Selector
The .spec.selector field is a label selector. As discussed earlier these are the labels used to identify potential Pods to acquire. In our
frontend.yaml example, the selector was:
matchLabels:
tier: frontend
In the ReplicaSet, .spec.template.metadata.labels must match spec.selector , or it will be rejected by the API.
Note:
For 2 ReplicaSets specifying the same .spec.selector but different .spec.template.metadata.labels and .spec.template.spec fields,
each ReplicaSet ignores the Pods created by the other ReplicaSet.
Replicas
You can specify how many Pods should run concurrently by setting .spec.replicas . The ReplicaSet will create/delete its Pods to
match this number.
When using the REST API or the client-go library, you must set propagationPolicy to Background or Foreground in the -d option.
For example:
Once the original is deleted, you can create a new ReplicaSet to replace it. As long as the old and new .spec.selector are the same,
then the new one will adopt the old Pods. However, it will not make any effort to make existing Pods match a new, different pod
template. To update Pods to a new spec in a controlled way, use a Deployment, as ReplicaSets do not support a rolling update
directly.
Scaling a ReplicaSet
A ReplicaSet can be easily scaled up or down by simply updating the .spec.replicas field. The ReplicaSet controller ensures that a
desired number of Pods with a matching label selector are available and operational.
When scaling down, the ReplicaSet controller chooses which pods to delete by sorting the available pods to prioritize scaling down
pods based on the following general algorithm:
Using the controller.kubernetes.io/pod-deletion-cost annotation, users can set a preference regarding which pods to remove first
when downscaling a ReplicaSet.
The annotation should be set on the pod, the range is [-2147483648, 2147483647]. It represents the cost of deleting a pod compared
to other pods belonging to the same ReplicaSet. Pods with lower deletion cost are preferred to be deleted before pods with higher
deletion cost.
The implicit value for this annotation for pods that don't set it is 0; negative values are permitted. Invalid values will be rejected by
the API server.
This feature is beta and enabled by default. You can disable it using the feature gate PodDeletionCost in both kube-apiserver and
kube-controller-manager.
Note:
This is honored on a best-effort basis, so it does not offer any guarantees on pod deletion order.
Users should avoid updating the annotation frequently, such as updating it based on a metric value, because doing so will
generate a significant number of pod updates on the apiserver.
controllers/hpa-rs.yaml
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: frontend-scaler
spec:
scaleTargetRef:
kind: ReplicaSet
name: frontend
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 50
Saving this manifest into hpa-rs.yaml and submitting it to a Kubernetes cluster should create the defined HPA that autoscales the
target ReplicaSet depending on the CPU usage of the replicated Pods.
Alternatively, you can use the kubectl autoscale command to accomplish the same (and it's easier!)
Alternatives to ReplicaSet
Deployment (recommended)
Deployment is an object which can own ReplicaSets and update them and their Pods via declarative, server-side rolling updates.
While ReplicaSets can be used independently, today they're mainly used by Deployments as a mechanism to orchestrate Pod
creation, deletion and updates. When you use Deployments you don't have to worry about managing the ReplicaSets that they
create. Deployments own and manage their ReplicaSets. As such, it is recommended to use Deployments when you want
ReplicaSets.
Bare Pods
Unlike the case where a user directly created Pods, a ReplicaSet replaces Pods that are deleted or terminated for any reason, such as
in the case of node failure or disruptive node maintenance, such as a kernel upgrade. For this reason, we recommend that you use a
ReplicaSet even if your application requires only a single Pod. Think of it similarly to a process supervisor, only it supervises multiple
Pods across multiple nodes instead of individual processes on a single node. A ReplicaSet delegates local container restarts to some
agent on the node such as Kubelet.
Job
Use a Job instead of a ReplicaSet for Pods that are expected to terminate on their own (that is, batch jobs).
DaemonSet
Use a DaemonSet instead of a ReplicaSet for Pods that provide a machine-level function, such as machine monitoring or machine
logging. These Pods have a lifetime that is tied to a machine lifetime: the Pod needs to be running on the machine before other Pods
start, and are safe to terminate when the machine is otherwise ready to be rebooted/shutdown.
ReplicationController
ReplicaSets are the successors to ReplicationControllers. The two serve the same purpose, and behave similarly, except that a
ReplicationController does not support set-based selector requirements as described in the labels user guide. As such, ReplicaSets
are preferred over ReplicationControllers
What's next
Learn about Pods.
Learn about Deployments.
Run a Stateless Application Using a Deployment, which relies on ReplicaSets to work.
ReplicaSet is a top-level resource in the Kubernetes REST API. Read the ReplicaSet object definition to understand the API for
replica sets.
Read about PodDisruptionBudget and how you can use it to manage application availability during disruptions.
3 - StatefulSets
A StatefulSet runs a group of Pods, and maintains a sticky identity for each of those Pods. This is useful for
managing applications that need persistent storage or a stable, unique network identity.
Manages the deployment and scaling of a set of Pods, and provides guarantees about the ordering and uniqueness of these Pods.
Like a Deployment, a StatefulSet manages Pods that are based on an identical container spec. Unlike a Deployment, a StatefulSet
maintains a sticky identity for each of its Pods. These pods are created from the same spec, but are not interchangeable: each has a
persistent identifier that it maintains across any rescheduling.
If you want to use storage volumes to provide persistence for your workload, you can use a StatefulSet as part of the solution.
Although individual Pods in a StatefulSet are susceptible to failure, the persistent Pod identifiers make it easier to match existing
volumes to the new Pods that replace any that have failed.
Using StatefulSets
StatefulSets are valuable for applications that require one or more of the following.
In the above, stable is synonymous with persistence across Pod (re)scheduling. If an application doesn't require any stable identifiers
or ordered deployment, deletion, or scaling, you should deploy your application using a workload object that provides a set of
stateless replicas. Deployment or ReplicaSet may be better suited to your stateless needs.
Limitations
The storage for a given Pod must either be provisioned by a PersistentVolume Provisioner (examples here) based on the
requested storage class, or pre-provisioned by an admin.
Deleting and/or scaling a StatefulSet down will not delete the volumes associated with the StatefulSet. This is done to ensure
data safety, which is generally more valuable than an automatic purge of all related StatefulSet resources.
StatefulSets currently require a Headless Service to be responsible for the network identity of the Pods. You are responsible for
creating this Service.
StatefulSets do not provide any guarantees on the termination of pods when a StatefulSet is deleted. To achieve ordered and
graceful termination of the pods in the StatefulSet, it is possible to scale the StatefulSet down to 0 prior to deletion.
When using Rolling Updates with the default Pod Management Policy ( OrderedReady ), it's possible to get into a broken state
that requires manual intervention to repair.
Components
The example below demonstrates the components of a StatefulSet.
apiVersion: v1
kind: Service
metadata:
name: nginx
labels:
app: nginx
spec:
ports:
- port: 80
name: web
clusterIP: None
selector:
app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web
spec:
selector:
matchLabels:
app: nginx # has to match .spec.template.metadata.labels
serviceName: "nginx"
replicas: 3 # by default is 1
minReadySeconds: 10 # by default is 0
template:
metadata:
labels:
app: nginx # has to match .spec.selector.matchLabels
spec:
terminationGracePeriodSeconds: 10
containers:
- name: nginx
image: registry.k8s.io/nginx-slim:0.24
ports:
- containerPort: 80
name: web
volumeMounts:
- name: www
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: www
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "my-storage-class"
resources:
requests:
storage: 1Gi
Note:
This example uses the ReadWriteOnce access mode, for simplicity. For production use, the Kubernetes project recommends using
the ReadWriteOncePod access mode instead.
The StorageClass specified for the volume claim is set up to use dynamic provisioning, or
The cluster already contains a PersistentVolume with the correct StorageClass and sufficient available storage space.
.spec.minReadySeconds is an optional field that specifies the minimum number of seconds for which a newly created Pod should be
running and ready without any of its containers crashing, for it to be considered available. This is used to check progression of a
rollout when using a Rolling Update strategy. This field defaults to 0 (the Pod will be considered available as soon as it is ready). To
learn more about when a Pod is considered ready, see Container Probes.
Pod Identity
StatefulSet Pods have a unique identity that consists of an ordinal, a stable network identity, and stable storage. The identity sticks to
the Pod, regardless of which node it's (re)scheduled on.
Ordinal Index
For a StatefulSet with N replicas, each Pod in the StatefulSet will be assigned an integer ordinal, that is unique over the Set. By
default, pods will be assigned ordinals from 0 up through N-1. The StatefulSet controller will also add a pod label with this index:
apps.kubernetes.io/pod-index .
Start ordinal
.spec.ordinals is an optional field that allows you to configure the integer ordinals assigned to each Pod. It defaults to nil. Within
the field, you can configure the following options:
.spec.ordinals.start : If the .spec.ordinals.start field is set, Pods will be assigned ordinals from .spec.ordinals.start up
through .spec.ordinals.start + .spec.replicas - 1 .
Stable Network ID
Each Pod in a StatefulSet derives its hostname from the name of the StatefulSet and the ordinal of the Pod. The pattern for the
constructed hostname is $(statefulset name)-$(ordinal) . The example above will create three Pods named web-0,web-1,web-2 . A
StatefulSet can use a Headless Service to control the domain of its Pods. The domain managed by this Service takes the form:
$(service name).$(namespace).svc.cluster.local , where "cluster.local" is the cluster domain. As each Pod is created, it gets a
matching DNS subdomain, taking the form: $(podname).$(governing service domain) , where the governing service is defined by the
serviceName field on the StatefulSet.
Depending on how DNS is configured in your cluster, you may not be able to look up the DNS name for a newly-run Pod
immediately. This behavior can occur when other clients in the cluster have already sent queries for the hostname of the Pod before
it was created. Negative caching (normal in DNS) means that the results of previous failed lookups are remembered and reused,
even after the Pod is running, for at least a few seconds.
If you need to discover Pods promptly after they are created, you have a few options:
Query the Kubernetes API directly (for example, using a watch) rather than relying on DNS lookups.
Decrease the time of caching in your Kubernetes DNS provider (typically this means editing the config map for CoreDNS, which
currently caches for 30 seconds).
As mentioned in the limitations section, you are responsible for creating the Headless Service responsible for the network identity of
the pods.
Here are some examples of choices for Cluster Domain, Service name, StatefulSet name, and how that affects the DNS names for
the StatefulSet's Pods.
Note:
Cluster Domain will be set to cluster.local unless otherwise configured.
Stable Storage
For each VolumeClaimTemplate entry defined in a StatefulSet, each Pod receives one PersistentVolumeClaim. In the nginx example
above, each Pod receives a single PersistentVolume with a StorageClass of my-storage-class and 1 GiB of provisioned storage. If no
StorageClass is specified, then the default StorageClass will be used. When a Pod is (re)scheduled onto a node, its volumeMounts
mount the PersistentVolumes associated with its PersistentVolume Claims. Note that, the PersistentVolumes associated with the
Pods' PersistentVolume Claims are not deleted when the Pods, or StatefulSet are deleted. This must be done manually.
When the StatefulSet controller creates a Pod, the new Pod is labelled with apps.kubernetes.io/pod-index . The value of this label is
the ordinal index of the Pod. This label allows you to route traffic to a particular pod index, filter logs/metrics using the pod index
label, and more. Note the feature gate PodIndexLabel is enabled and locked by default for this feature, in order to disable it, users
will have to use server emulated version v1.31.
The StatefulSet should not specify a pod.Spec.TerminationGracePeriodSeconds of 0. This practice is unsafe and strongly discouraged.
For further explanation, please refer to force deleting StatefulSet Pods.
When the nginx example above is created, three Pods will be deployed in the order web-0, web-1, web-2. web-1 will not be deployed
before web-0 is Running and Ready, and web-2 will not be deployed until web-1 is Running and Ready. If web-0 should fail, after
web-1 is Running and Ready, but before web-2 is launched, web-2 will not be launched until web-0 is successfully relaunched and
becomes Running and Ready.
If a user were to scale the deployed example by patching the StatefulSet such that replicas=1 , web-2 would be terminated first.
web-1 would not be terminated until web-2 is fully shutdown and deleted. If web-0 were to fail after web-2 has been terminated and
is completely shutdown, but prior to web-1's termination, web-1 would not be terminated until web-0 is Running and Ready.
Update strategies
A StatefulSet's .spec.updateStrategy field allows you to configure and disable automated rolling updates for containers, labels,
resource request/limits, and annotations for the Pods in a StatefulSet. There are two possible values:
OnDelete
When a StatefulSet's .spec.updateStrategy.type is set to OnDelete, the StatefulSet controller will not automatically update the Pods
in a StatefulSet. Users must manually delete Pods to cause the controller to create new Pods that reflect modifications made to a
StatefulSet's .spec.template.
RollingUpdate
The RollingUpdate update strategy implements automated, rolling updates for the Pods in a StatefulSet. This is the default update
strategy.
Rolling Updates
When a StatefulSet's .spec.updateStrategy.type is set to RollingUpdate , the StatefulSet controller will delete and recreate each Pod
in the StatefulSet. It will proceed in the same order as Pod termination (from the largest ordinal to the smallest), updating each Pod
one at a time.
The Kubernetes control plane waits until an updated Pod is Running and Ready prior to updating its predecessor. If you have set
.spec.minReadySeconds (see Minimum Ready Seconds), the control plane additionally waits that amount of time after the Pod turns
ready, before moving on.
You can control the maximum number of Pods that can be unavailable during an update by specifying the
.spec.updateStrategy.rollingUpdate.maxUnavailable field. The value can be an absolute number (for example, 5 ) or a percentage of
desired Pods (for example, 10% ). Absolute number is calculated from the percentage value by rounding it up. This field cannot be 0.
The default setting is 1.
This field applies to all Pods in the range 0 to replicas - 1 . If there is any unavailable Pod in the range 0 to replicas - 1 , it will be
counted towards maxUnavailable .
Note:
The maxUnavailable field is in Alpha stage and it is honored only by API servers that are running with the
MaxUnavailableStatefulSet feature gate enabled.
Forced rollback
When using Rolling Updates with the default Pod Management Policy ( OrderedReady ), it's possible to get into a broken state that
requires manual intervention to repair.
If you update the Pod template to a configuration that never becomes Running and Ready (for example, due to a bad binary or
application-level configuration error), StatefulSet will stop the rollout and wait.
In this state, it's not enough to revert the Pod template to a good configuration. Due to a known issue, StatefulSet will continue to
wait for the broken Pod to become Ready (which never happens) before it will attempt to revert it back to the working configuration.
After reverting the template, you must also delete any Pods that StatefulSet had already attempted to run with the bad
configuration. StatefulSet will then begin to recreate the Pods using the reverted template.
PersistentVolumeClaim retention
ⓘ FEATURE STATE: Kubernetes v1.32 [stable] (enabled by default: true)
The optional .spec.persistentVolumeClaimRetentionPolicy field controls if and how PVCs are deleted during the lifecycle of a
StatefulSet. You must enable the StatefulSetAutoDeletePVC feature gate on the API server and the controller manager to use this
field. Once enabled, there are two policies you can configure for each StatefulSet:
whenDeleted
configures the volume retention behavior that applies when the StatefulSet is deleted
whenScaled
configures the volume retention behavior that applies when the replica count of the StatefulSet is reduced; for example, when
scaling down the set.
For each policy that you can configure, you can set the value to either Delete or Retain .
Delete
The PVCs created from the StatefulSet volumeClaimTemplate are deleted for each Pod affected by the policy. With the whenDeleted
policy all PVCs from the volumeClaimTemplate are deleted after their Pods have been deleted. With the whenScaled policy, only PVCs
corresponding to Pod replicas being scaled down are deleted, after their Pods have been deleted.
Retain (default)
PVCs from the volumeClaimTemplate are not affected when their Pod is deleted. This is the behavior before this new feature.
Bear in mind that these policies only apply when Pods are being removed due to the StatefulSet being deleted or scaled down. For
example, if a Pod associated with a StatefulSet fails due to node failure, and the control plane creates a replacement Pod, the
StatefulSet retains the existing PVC. The existing volume is unaffected, and the cluster will attach it to the node where the new Pod is
about to launch.
The default for policies is Retain , matching the StatefulSet behavior before this new feature.
apiVersion: apps/v1
kind: StatefulSet
...
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Delete
...
The StatefulSet controller adds owner references to its PVCs, which are then deleted by the garbage collector after the Pod is
terminated. This enables the Pod to cleanly unmount all volumes before the PVCs are deleted (and before the backing PV and
volume are deleted, depending on the retain policy). When you set the whenDeleted policy to Delete , an owner reference to the
StatefulSet instance is placed on all PVCs associated with that StatefulSet.
The whenScaled policy must delete PVCs only when a Pod is scaled down, and not when a Pod is deleted for another reason. When
reconciling, the StatefulSet controller compares its desired replica count to the actual Pods present on the cluster. Any StatefulSet
Pod whose id greater than the replica count is condemned and marked for deletion. If the whenScaled policy is Delete , the
condemned Pods are first set as owners to the associated StatefulSet template PVCs, before the Pod is deleted. This causes the PVCs
to be garbage collected after only the condemned Pods have terminated.
This means that if the controller crashes and restarts, no Pod will be deleted before its owner reference has been updated
appropriate to the policy. If a condemned Pod is force-deleted while the controller is down, the owner reference may or may not
have been set up, depending on when the controller crashed. It may take several reconcile loops to update the owner references, so
some condemned Pods may have set up owner references and others may not. For this reason we recommend waiting for the
controller to come back up, which will verify owner references before terminating Pods. If that is not possible, the operator should
verify the owner references on PVCs to ensure the expected objects are deleted when Pods are force-deleted.
Replicas
.spec.replicas is an optional field that specifies the number of desired Pods. It defaults to 1.
Should you manually scale a deployment, example via kubectl scale statefulset statefulset --replicas=X , and then you update
that StatefulSet based on a manifest (for example: by running kubectl apply -f statefulset.yaml ), then applying that manifest
overwrites the manual scaling that you previously did.
If a HorizontalPodAutoscaler (or any similar API for horizontal scaling) is managing scaling for a Statefulset, don't set .spec.replicas .
Instead, allow the Kubernetes control plane to manage the .spec.replicas field automatically.
What's next
Learn about Pods.
Find out how to use StatefulSets
Follow an example of deploying a stateful application.
Follow an example of deploying Cassandra with Stateful Sets.
Follow an example of running a replicated stateful application.
Learn how to scale a StatefulSet.
Learn what's involved when you delete a StatefulSet.
Learn how to configure a Pod to use a volume for storage.
Learn how to configure a Pod to use a PersistentVolume for storage.
StatefulSet is a top-level resource in the Kubernetes REST API. Read the StatefulSet object definition to understand the API for
stateful sets.
Read about PodDisruptionBudget and how you can use it to manage application availability during disruptions.
4 - DaemonSet
A DaemonSet defines Pods that provide node-local facilities. These might be fundamental to the operation of
your cluster, such as a networking helper tool, or be part of an add-on.
A DaemonSet ensures that all (or some) Nodes run a copy of a Pod. As nodes are added to the cluster, Pods are added to them. As
nodes are removed from the cluster, those Pods are garbage collected. Deleting a DaemonSet will clean up the Pods it created.
In a simple case, one DaemonSet, covering all nodes, would be used for each type of daemon. A more complex setup might use
multiple DaemonSets for a single type of daemon, but with different flags and/or different memory and cpu requests for different
hardware types.
controllers/daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd-elasticsearch
namespace: kube-system
labels:
k8s-app: fluentd-logging
spec:
selector:
matchLabels:
name: fluentd-elasticsearch
template:
metadata:
labels:
name: fluentd-elasticsearch
spec:
tolerations:
# these tolerations are to have the daemonset runnable on control plane nodes
# remove them if your control plane nodes should not run pods
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
containers:
- name: fluentd-elasticsearch
image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: varlog
mountPath: /var/log
# it may be desirable to set a high priority class to ensure that a DaemonSet Pod
# preempts running Pods
# priorityClassName: important
terminationGracePeriodSeconds: 30
volumes:
- name: varlog
hostPath:
path: /var/log
Required Fields
As with all other Kubernetes config, a DaemonSet needs apiVersion , kind , and metadata fields. For general information about
working with config files, see running stateless applications and object management using kubectl.
Pod Template
The .spec.template is one of the required fields in .spec .
The .spec.template is a pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind .
In addition to required fields for a Pod, a Pod template in a DaemonSet has to specify appropriate labels (see pod selector).
A Pod Template in a DaemonSet must have a RestartPolicy equal to Always , or be unspecified, which defaults to Always .
Pod Selector
The .spec.selector field is a pod selector. It works the same as the .spec.selector of a Job.
You must specify a pod selector that matches the labels of the .spec.template . Also, once a DaemonSet is created, its
.spec.selector can not be mutated. Mutating the pod selector can lead to the unintentional orphaning of Pods, and it was found to
be confusing to users.
The .spec.selector must match the .spec.template.metadata.labels . Config with these two not matching will be rejected by the API.
Note:
If it's important that the DaemonSet pod run on each node, it's often desirable to set the .spec.template.spec.priorityClassName
of the DaemonSet to a PriorityClass with a higher priority to ensure that this eviction occurs.
The user can specify a different scheduler for the Pods of the DaemonSet, by setting the .spec.template.spec.schedulerName field of
the DaemonSet.
The original node affinity specified at the .spec.template.spec.affinity.nodeAffinity field (if specified) is taken into consideration by
the DaemonSet controller when evaluating the eligible nodes, but is replaced on the created Pod with the node affinity that matches
the name of the eligible node.
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- target-host-name
Taints and tolerations
The DaemonSet controller automatically adds a set of tolerations to DaemonSet Pods:
node.kubernetes.io/not-ready NoExecute DaemonSet Pods can be scheduled onto nodes that are not healthy
or ready to accept Pods. Any DaemonSet Pods running on such
nodes will not be evicted.
node.kubernetes.io/unreachable NoExecute DaemonSet Pods can be scheduled onto nodes that are
unreachable from the node controller. Any DaemonSet Pods
running on such nodes will not be evicted.
node.kubernetes.io/disk-pressure NoSchedule DaemonSet Pods can be scheduled onto nodes with disk pressure
issues.
node.kubernetes.io/memory- NoSchedule DaemonSet Pods can be scheduled onto nodes with memory
pressure pressure issues.
node.kubernetes.io/pid-pressure NoSchedule DaemonSet Pods can be scheduled onto nodes with process
pressure issues.
node.kubernetes.io/unschedulable NoSchedule DaemonSet Pods can be scheduled onto nodes that are
unschedulable.
node.kubernetes.io/network- NoSchedule Only added for DaemonSet Pods that request host networking,
unavailable i.e., Pods having spec.hostNetwork: true . Such DaemonSet
Pods can be scheduled onto nodes with unavailable network.
You can add your own tolerations to the Pods of a DaemonSet as well, by defining these in the Pod template of the DaemonSet.
Because the DaemonSet controller sets the node.kubernetes.io/unschedulable:NoSchedule toleration automatically, Kubernetes can
run DaemonSet Pods on nodes that are marked as unschedulable.
If you use a DaemonSet to provide an important node-level function, such as cluster networking, it is helpful that Kubernetes places
DaemonSet Pods on nodes before they are ready. For example, without that special toleration, you could end up in a deadlock
situation where the node is not marked as ready because the network plugin is not running there, and at the same time the network
plugin is not running on that node because the node is not yet ready.
Push: Pods in the DaemonSet are configured to send updates to another service, such as a stats database. They do not have
clients.
NodeIP and Known Port: Pods in the DaemonSet can use a hostPort , so that the pods are reachable via the node IPs. Clients
know the list of node IPs somehow, and know the port by convention.
DNS: Create a headless service with the same pod selector, and then discover DaemonSets using the endpoints resource or
retrieve multiple A records from DNS.
Service: Create a service with the same Pod selector, and use the service to reach a daemon on a random node. (No way to
reach specific node.)
Updating a DaemonSet
If node labels are changed, the DaemonSet will promptly add Pods to newly matching nodes and delete Pods from newly not-
matching nodes.
You can modify the Pods that a DaemonSet creates. However, Pods do not allow all fields to be updated. Also, the DaemonSet
controller will use the original template the next time a node (even with the same name) is created.
You can delete a DaemonSet. If you specify --cascade=orphan with kubectl , then the Pods will be left on the nodes. If you
subsequently create a new DaemonSet with the same selector, the new DaemonSet adopts the existing Pods. If any Pods need
replacing the DaemonSet replaces them according to its updateStrategy .
Alternatives to DaemonSet
Init scripts
It is certainly possible to run daemon processes by directly starting them on a node (e.g. using init , upstartd , or systemd ). This is
perfectly fine. However, there are several advantages to running such processes via a DaemonSet:
Ability to monitor and manage logs for daemons in the same way as applications.
Same config language and tools (e.g. Pod templates, kubectl ) for daemons and applications.
Running daemons in containers with resource limits increases isolation between daemons from app containers. However, this
can also be accomplished by running the daemons in a container but not in a Pod.
Bare Pods
It is possible to create Pods directly which specify a particular node to run on. However, a DaemonSet replaces Pods that are deleted
or terminated for any reason, such as in the case of node failure or disruptive node maintenance, such as a kernel upgrade. For this
reason, you should use a DaemonSet rather than creating individual Pods.
Static Pods
It is possible to create Pods by writing a file to a certain directory watched by Kubelet. These are called static pods. Unlike
DaemonSet, static Pods cannot be managed with kubectl or other Kubernetes API clients. Static Pods do not depend on the
apiserver, making them useful in cluster bootstrapping cases. Also, static Pods may be deprecated in the future.
Deployments
DaemonSets are similar to Deployments in that they both create Pods, and those Pods have processes which are not expected to
terminate (e.g. web servers, storage servers).
Use a Deployment for stateless services, like frontends, where scaling up and down the number of replicas and rolling out updates
are more important than controlling exactly which host the Pod runs on. Use a DaemonSet when it is important that a copy of a Pod
always run on all or certain hosts, if the DaemonSet provides node-level functionality that allows other Pods to run correctly on that
particular node.
For example, network plugins often include a component that runs as a DaemonSet. The DaemonSet component makes sure that
the node where it's running has working cluster networking.
What's next
Learn about Pods.
Learn about static Pods, which are useful for running Kubernetes control plane components.
Find out how to use DaemonSets
Perform a rolling update on a DaemonSet
Perform a rollback on a DaemonSet (for example, if a roll out didn't work how you expected).
Understand how Kubernetes assigns Pods to Nodes.
Learn about device plugins and add ons, which often run as DaemonSets.
DaemonSet is a top-level resource in the Kubernetes REST API. Read the DaemonSet object definition to understand the API for
daemon sets.
5 - Jobs
Jobs represent one-off tasks that run to completion and then stop.
A Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them successfully
terminate. As pods successfully complete, the Job tracks the successful completions. When a specified number of successful
completions is reached, the task (ie, Job) is complete. Deleting a Job will clean up the Pods it created. Suspending a Job will delete its
active Pods until the Job is resumed again.
A simple case is to create one Job object in order to reliably run one Pod to completion. The Job object will start a new Pod if the first
Pod fails or is deleted (for example due to a node hardware failure or a node reboot).
If you want to run a Job (either a single task, or several in parallel) on a schedule, see CronJob.
controllers/job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
backoffLimit: 4
job.batch/pi created
To list all the Pods that belong to a Job in a machine readable form, you can use a command like this:
pi-5rwd7
Here, the selector is the same as the selector for the Job. The --output=jsonpath option specifies an expression with the name from
each Pod in the returned list.
When the control plane creates new Pods for a Job, the .metadata.name of the Job is part of the basis for naming those Pods. The
name of a Job must be a valid DNS subdomain value, but this can produce unexpected results for the Pod hostnames. For best
compatibility, the name should follow the more restrictive rules for a DNS label. Even when the name is a DNS subdomain, the name
must be no longer than 63 characters.
Job Labels
Job labels will have batch.kubernetes.io/ prefix for job-name and controller-uid .
Pod Template
The .spec.template is the only required field of the .spec .
The .spec.template is a pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind .
In addition to required fields for a Pod, a pod template in a Job must specify appropriate labels (see pod selector) and an appropriate
restart policy.
Pod selector
The .spec.selector field is optional. In almost all cases you should not specify it. See section specifying your own pod selector.
1. Non-parallel Jobs
normally, only one Pod is started, unless the Pod fails.
the Job is complete as soon as its Pod terminates successfully.
2. Parallel Jobs with a fixed completion count:
specify a non-zero positive value for .spec.completions .
the Job represents the overall task, and is complete when there are .spec.completions successful Pods.
when using .spec.completionMode="Indexed" , each Pod gets a different index in the range 0 to .spec.completions-1 .
3. Parallel Jobs with a work queue:
do not specify .spec.completions , default to .spec.parallelism .
the Pods must coordinate amongst themselves or an external service to determine what each should work on. For
example, a Pod might fetch a batch of up to N items from the work queue.
each Pod is independently capable of determining whether or not all its peers are done, and thus that the entire Job is
done.
when any Pod from the Job terminates with success, no new Pods are created.
once at least one Pod has terminated with success and all Pods are terminated, then the Job is completed with success.
once any Pod has exited with success, no other Pod should still be doing any work for this task or writing any output. They
should all be in the process of exiting.
For a non-parallel Job, you can leave both .spec.completions and .spec.parallelism unset. When both are unset, both are defaulted
to 1.
For a fixed completion count Job, you should set .spec.completions to the number of completions needed. You can set
.spec.parallelism , or leave it unset and it will default to 1.
For a work queue Job, you must leave .spec.completions unset, and set .spec.parallelism to a non-negative integer.
For more information about how to make use of the different types of job, see the job patterns section.
Controlling parallelism
The requested parallelism ( .spec.parallelism ) can be set to any non-negative value. If it is unspecified, it defaults to 1. If it is
specified as 0, then the Job is effectively paused until it is increased.
Actual parallelism (number of pods running at any instant) may be more or less than requested parallelism, for a variety of reasons:
For fixed completion count Jobs, the actual number of pods running in parallel will not exceed the number of remaining
completions. Higher values of .spec.parallelism are effectively ignored.
For work queue Jobs, no new Pods are started after any Pod has succeeded -- remaining Pods are allowed to complete,
however.
If the Job Controller has not had time to react.
If the Job controller failed to create Pods for any reason (lack of ResourceQuota , lack of permission, etc.), then there may be
fewer pods than requested.
The Job controller may throttle new Pod creation due to excessive previous pod failures in the same Job.
When a Pod is gracefully shut down, it takes time to stop.
Completion mode
Jobs with fixed completion count - that is, jobs that have non null .spec.completions - can have a completion mode that is specified in
.spec.completionMode :
NonIndexed (default): the Job is considered complete when there have been .spec.completions successfully completed Pods. In
other words, each Pod completion is homologous to each other. Note that Jobs that have null .spec.completions are implicitly
NonIndexed .
Indexed: the Pods of a Job get an associated completion index from 0 to .spec.completions-1 . The index is available through
four mechanisms:
Note:
Although rare, more than one Pod could be started for the same index (due to various reasons such as node failures, kubelet
restarts, or Pod evictions). In this case, only the first Pod that completes successfully will count towards the completion count
and update the status of the Job. The other Pods that are running or completed for the same index will be deleted by the Job
controller once they are detected.
An entire Pod can also fail, for a number of reasons, such as when the pod is kicked off the node (node is upgraded, rebooted,
deleted, etc.), or if a container of the Pod fails and the .spec.template.spec.restartPolicy = "Never" . When a Pod fails, then the Job
controller starts a new Pod. This means that your application needs to handle the case when it is restarted in a new pod. In
particular, it needs to handle temporary files, locks, incomplete output and the like caused by previous runs.
By default, each pod failure is counted towards the .spec.backoffLimit limit, see pod backoff failure policy. However, you can
customize handling of pod failures by setting the Job's pod failure policy.
Additionally, you can choose to count the pod failures independently for each index of an Indexed Job by setting the
.spec.backoffLimitPerIndex field (for more information, see backoff limit per index).
Note that even if you specify .spec.parallelism = 1 and .spec.completions = 1 and .spec.template.spec.restartPolicy = "Never" ,
the same program may sometimes be started twice.
If you do specify .spec.parallelism and .spec.completions both greater than 1, then there may be multiple pods running at once.
Therefore, your pods must also be tolerant of concurrency.
If you specify the .spec.podFailurePolicyfield, the Job controller does not consider a terminating Pod (a pod that has a
.metadata.deletionTimestamp field set) as a failure until that Pod is terminal (its .status.phase is Failed or Succeeded ). However, the
Job controller creates a replacement Pod as soon as the termination becomes apparent. Once the pod terminates, the Job controller
evaluates .backoffLimit and .podFailurePolicy for the relevant Job, taking this now-terminated Pod into consideration.
If either of these requirements is not satisfied, the Job controller counts a terminating Pod as an immediate failure, even if that Pod
later terminates with phase: "Succeeded" .
If either of the calculations reaches the .spec.backoffLimit , the Job is considered failed.
Note:
If your job has restartPolicy = "OnFailure", keep in mind that your Pod running the Job will be terminated once the job backoff
limit has been reached. This can make debugging the Job's executable more difficult. We suggest setting restartPolicy = "Never"
when debugging the Job or using a logging system to ensure output from failed Jobs is not lost inadvertently.
Note:
You can only configure the backoff limit per index for an Indexed Job, if you have the JobBackoffLimitPerIndex feature gate
enabled in your cluster.
When you run an indexed Job, you can choose to handle retries for pod failures independently for each index. To do so, set the
.spec.backoffLimitPerIndex to specify the maximal number of pod failures per index.
When the per-index backoff limit is exceeded for an index, Kubernetes considers the index as failed and adds it to the
.status.failedIndexes field. The succeeded indexes, those with a successfully executed pods, are recorded in the
.status.completedIndexes field, regardless of whether you set the backoffLimitPerIndex field.
Note that a failing index does not interrupt execution of other indexes. Once all indexes finish for a Job where you specified a
backoff limit per index, if at least one of those indexes did fail, the Job controller marks the overall Job as failed, by setting the Failed
condition in the status. The Job gets marked as failed even if some, potentially nearly all, of the indexes were processed successfully.
You can additionally limit the maximal number of indexes marked failed by setting the .spec.maxFailedIndexes field. When the
number of failed indexes exceeds the maxFailedIndexes field, the Job controller triggers termination of all remaining running Pods
for that Job. Once all pods are terminated, the entire Job is marked failed by the Job controller, by setting the Failed condition in the
Job status.
/controllers/job-backoff-limit-per-index-example.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: job-backoff-limit-per-index-example
spec:
completions: 10
parallelism: 3
completionMode: Indexed # required for the feature
backoffLimitPerIndex: 1 # maximal number of failures per index
maxFailedIndexes: 5 # maximal number of failed indexes before terminating the Job execution
template:
spec:
restartPolicy: Never # required for the feature
containers:
- name: example
image: python
command: # The jobs fails as there is at least one failed index
# (all even indexes fail in here), yet all indexes
# are executed as maxFailedIndexes is not exceeded.
- python3
- -c
- |
import os, sys
print("Hello world")
if int(os.environ.get("JOB_COMPLETION_INDEX")) % 2 == 0:
sys.exit(1)
In the example above, the Job controller allows for one restart for each of the indexes. When the total number of failed indexes
exceeds 5, then the entire Job is terminated.
status:
completedIndexes: 1,3,5,7,9
failedIndexes: 0,2,4,6,8
succeeded: 5 # 1 succeeded pod for each of 5 succeeded indexes
failed: 10 # 2 failed pods (1 retry) for each of 5 failed indexes
conditions:
- message: Job has failed indexes
reason: FailedIndexes
status: "True"
type: FailureTarget
- message: Job has failed indexes
reason: FailedIndexes
status: "True"
type: Failed
The Job controller adds the FailureTarget Job condition to trigger Job termination and cleanup. When all of the Job Pods are
terminated, the Job controller adds the Failed condition with the same values for reason and message as the FailureTarget Job
condition. For details, see Termination of Job Pods.
Additionally, you may want to use the per-index backoff along with a pod failure policy. When using per-index backoff, there is a new
FailIndex action available which allows you to avoid unnecessary retries within an index.
A Pod failure policy, defined with the .spec.podFailurePolicy field, enables your cluster to handle Pod failures based on the
container exit codes and the Pod conditions.
In some situations, you may want to have a better control when handling Pod failures than the control provided by the Pod backoff
failure policy, which is based on the Job's .spec.backoffLimit . These are some examples of use cases:
To optimize costs of running workloads by avoiding unnecessary Pod restarts, you can terminate a Job as soon as one of its
Pods fails with an exit code indicating a software bug.
To guarantee that your Job finishes even if there are disruptions, you can ignore Pod failures caused by disruptions (such as
preemption, API-initiated eviction or taint-based eviction) so that they don't count towards the .spec.backoffLimit limit of
retries.
You can configure a Pod failure policy, in the .spec.podFailurePolicy field, to meet the above use cases. This policy can handle Pod
failures based on the container exit codes and the Pod conditions.
/controllers/job-pod-failure-policy-example.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: job-pod-failure-policy-example
spec:
completions: 12
parallelism: 3
template:
spec:
restartPolicy: Never
containers:
- name: main
image: docker.io/library/bash:5
command: ["bash"] # example command simulating a bug which triggers the FailJob action
args:
- -c
- echo "Hello world!" && sleep 5 && exit 42
backoffLimit: 6
podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: main # optional
operator: In # one of: In, NotIn
values: [42]
- action: Ignore # one of: Ignore, FailJob, Count
onPodConditions:
- type: DisruptionTarget # indicates Pod disruption
In the example above, the first rule of the Pod failure policy specifies that the Job should be marked failed if the main container fails
with the 42 exit code. The following are the rules for the main container specifically:
Note:
Because the Pod template specifies a restartPolicy: Never, the kubelet does not restart the main container in that particular Pod.
The second rule of the Pod failure policy, specifying the Ignore action for failed Pods with condition DisruptionTarget excludes Pod
disruptions from being counted towards the .spec.backoffLimit limit of retries.
Note:
If the Job failed, either by the Pod failure policy or Pod backoff failure policy, and the Job is running multiple Pods, Kubernetes
terminates all the Pods in that Job that are still Pending or Running.
if you want to use a .spec.podFailurePolicy field for a Job, you must also define that Job's pod template with
.spec.restartPolicy set to Never .
the Pod failure policy rules you specify under spec.podFailurePolicy.rules are evaluated in order. Once a rule matches a Pod
failure, the remaining rules are ignored. When no rule matches the Pod failure, the default handling applies.
you may want to restrict a rule to a specific container by specifying its name
in spec.podFailurePolicy.rules[*].onExitCodes.containerName . When not specified the rule applies to all containers. When
specified, it should match one the container or initContainer names in the Pod template.
you may specify the action taken when a Pod failure policy is matched by spec.podFailurePolicy.rules[*].action . Possible
values are:
FailJob : use to indicate that the Pod's job should be marked as Failed and all running Pods should be terminated.
Ignore : use to indicate that the counter towards the .spec.backoffLimit should not be incremented and a replacement
Pod should be created.
Count : use to indicate that the Pod should be handled in the default way. The counter towards the .spec.backoffLimit
should be incremented.
FailIndex : use this action along with backoff limit per index to avoid unnecessary retries within the index of a failed pod.
Note:
When you use a podFailurePolicy, the job controller only matches Pods in the Failed phase. Pods with a deletion timestamp that
are not in a terminal phase (Failed or Succeeded) are considered still terminating. This implies that terminating pods retain a
tracking finalizer until they reach a terminal phase. Since Kubernetes 1.27, Kubelet transitions deleted pods to a terminal phase
(see: Pod Phase). This ensures that deleted pods have their finalizers removed by the Job controller.
Note:
Starting with Kubernetes v1.28, when Pod failure policy is used, the Job controller recreates terminating Pods only once these
Pods reach the terminal Failed phase. This behavior is similar to podReplacementPolicy: Failed. For more information, see Pod
replacement policy.
When you use the podFailurePolicy , and the Job fails due to the pod matching the rule with the FailJob action, then the Job
controller triggers the Job termination process by adding the FailureTarget condition. For more details, see Job termination and
cleanup.
Success policy
ⓘ FEATURE STATE: Kubernetes v1.31 [beta] (enabled by default: true)
Note:
You can only configure a success policy for an Indexed Job if you have the JobSuccessPolicy feature gate enabled in your cluster.
When creating an Indexed Job, you can define when a Job can be declared as succeeded using a .spec.successPolicy , based on the
pods that succeeded.
By default, a Job succeeds when the number of succeeded Pods equals .spec.completions . These are some situations where you
might want additional control for declaring a Job succeeded:
When running simulations with different parameters, you might not need all the simulations to succeed for the overall Job to
be successful.
When following a leader-worker pattern, only the success of the leader determines the success or failure of a Job. Examples of
this are frameworks like MPI and PyTorch etc.
You can configure a success policy, in the .spec.successPolicy field, to meet the above use cases. This policy can handle Job success
based on the succeeded pods. After the Job meets the success policy, the job controller terminates the lingering Pods. A success
policy is defined by rules. Each rule can take one of the following forms:
When you specify the succeededIndexes only, once all indexes specified in the succeededIndexes succeed, the job controller
marks the Job as succeeded. The succeededIndexes must be a list of intervals between 0 and .spec.completions-1 .
When you specify the succeededCount only, once the number of succeeded indexes reaches the succeededCount , the job
controller marks the Job as succeeded.
When you specify both succeededIndexes and succeededCount , once the number of succeeded indexes from the subset of
indexes specified in the succeededIndexes reaches the succeededCount , the job controller marks the Job as succeeded.
Note that when you specify multiple rules in the .spec.successPolicy.rules , the job controller evaluates the rules in order. Once the
Job meets a rule, the job controller ignores remaining rules.
/controllers/job-success-policy.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: job-success
spec:
parallelism: 10
completions: 10
completionMode: Indexed # Required for the success policy
successPolicy:
rules:
- succeededIndexes: 0,2-3
succeededCount: 1
template:
spec:
containers:
- name: main
image: python
command: # Provided that at least one of the Pods with 0, 2, and 3 indexes has succeeded,
# the overall Job is a success.
- python3
- -c
- |
import os, sys
if os.environ.get("JOB_COMPLETION_INDEX") == "2":
sys.exit(0)
else:
sys.exit(1)
restartPolicy: Never
In the example above, both succeededIndexes and succeededCount have been specified. Therefore, the job controller will mark the
Job as succeeded and terminate the lingering Pods when either of the specified indexes, 0, 2, or 3, succeed. The Job that meets the
success policy gets the SuccessCriteriaMet condition with a SuccessPolicy reason. After the removal of the lingering Pods is issued,
the Job gets the Complete condition.
Note that the succeededIndexes is represented as intervals separated by a hyphen. The number are listed in represented by the first
and last element of the series, separated by a hyphen.
Note:
When you specify both a success policy and some terminating policies such as .spec.backoffLimit and .spec.podFailurePolicy,
once the Job meets either policy, the job controller respects the terminating policy and ignores the success policy.
By default, a Job will run uninterrupted unless a Pod fails ( restartPolicy=Never ) or a Container exits in error
( restartPolicy=OnFailure ), at which point the Job defers to the .spec.backoffLimit described above. Once .spec.backoffLimit has
been reached the Job will be marked as failed and any running Pods will be terminated.
Another way to terminate a Job is by setting an active deadline. Do this by setting the .spec.activeDeadlineSeconds field of the Job to
a number of seconds. The activeDeadlineSeconds applies to the duration of the job, no matter how many Pods are created. Once a
Job reaches activeDeadlineSeconds , all of its running Pods are terminated and the Job status will become type: Failed with reason:
DeadlineExceeded .
Note that a Job's .spec.activeDeadlineSeconds takes precedence over its .spec.backoffLimit . Therefore, a Job that is retrying one or
more failed Pods will not deploy additional Pods once it reaches the time limit specified by activeDeadlineSeconds , even if the
backoffLimit is not yet reached.
Example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-timeout
spec:
backoffLimit: 5
activeDeadlineSeconds: 100
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
Note that both the Job spec and the Pod template spec within the Job have an activeDeadlineSeconds field. Ensure that you set this
field at the proper level.
Keep in mind that the restartPolicy applies to the Pod, and not to the Job itself: there is no automatic Job restart once the Job
status is type: Failed . That is, the Job termination mechanisms activated with .spec.activeDeadlineSeconds and .spec.backoffLimit
result in a permanent Job failure that requires manual intervention to resolve.
In Kubernetes v1.31 and later the Job controller delays the addition of the terminal conditions, Failed or Complete , until all of the
Job Pods are terminated.
In Kubernetes v1.30 and earlier, the Job controller added the Complete or the Failed Job terminal conditions as soon as the Job
termination process was triggered and all Pod finalizers were removed. However, some Pods would still be running or terminating at
the moment that the terminal condition was added.
In Kubernetes v1.31 and later, the controller only adds the Job terminal conditions after all of the Pods are terminated. You can
control this behavior by using the JobManagedBy and the JobPodReplacementPolicy (both enabled by default) feature gates.
Factors like might increase the amount of time from the moment that the Job controller adds the
terminationGracePeriodSeconds
FailureTarget condition or the SuccessCriteriaMet condition to the moment that all of the Job Pods terminate and the Job controller
adds a terminal condition ( Failed or Complete ).
You can use the FailureTarget or the SuccessCriteriaMet condition to evaluate whether the Job has failed or succeeded without
having to wait for the controller to add a terminal condition.
For example, you might want to decide when to create a replacement Job that replaces a failed Job. If you replace the failed Job when
the FailureTarget condition appears, your replacement Job runs sooner, but could result in Pods from the failed and the
replacement Job running at the same time, using extra compute resources.
Alternatively, if your cluster has limited resource capacity, you could choose to wait until the Failed condition appears on the Job,
which would delay your replacement Job but would ensure that you conserve resources by waiting until all of the failed Pods are
removed.
Another way to clean up finished Jobs (either Complete or Failed ) automatically is to use a TTL mechanism provided by a TTL
controller for finished resources, by specifying the .spec.ttlSecondsAfterFinished field of the Job.
When the TTL controller cleans up the Job, it will delete the Job cascadingly, i.e. delete its dependent objects, such as Pods, together
with the Job. Note that when the Job is deleted, its lifecycle guarantees, such as finalizers, will be honored.
For example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-ttl
spec:
ttlSecondsAfterFinished: 100
template:
spec:
containers:
- name: pi
image: perl:5.34.0
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
The Job pi-with-ttl will be eligible to be automatically deleted, 100 seconds after it finishes.
If the field is set to 0 , the Job will be eligible to be automatically deleted immediately after it finishes. If the field is unset, this Job
won't be cleaned up by the TTL controller after it finishes.
Note:
It is recommended to set ttlSecondsAfterFinished field because unmanaged jobs (Jobs that you created directly, and not
indirectly through other workload APIs such as CronJob) have a default deletion policy of orphanDependents causing Pods created
by an unmanaged Job to be left around after that Job is fully deleted. Even though the control plane eventually garbage collects
the Pods from a deleted Job after they either fail or complete, sometimes those lingering pods may cause cluster performance
degradation or in worst case cause the cluster to go offline due to this degradation.
You can use LimitRanges and ResourceQuotas to place a cap on the amount of resources that a particular namespace can
consume.
Job patterns
The Job object can be used to process a set of independent but related work items. These might be emails to be sent, frames to be
rendered, files to be transcoded, ranges of keys in a NoSQL database to scan, and so on.
In a complex system, there may be multiple different sets of work items. Here we are just considering one set of work items that the
user wants to manage together — a batch job.
There are several different patterns for parallel computation, each with strengths and weaknesses. The tradeoffs are:
One Job object for each work item, versus a single Job object for all work items. One Job per work item creates some overhead
for the user and for the system to manage large numbers of Job objects. A single Job for all work items is better for large
numbers of items.
Number of Pods created equals number of work items, versus each Pod can process multiple work items. When the number of
Pods equals the number of work items, the Pods typically requires less modification to existing code and containers. Having
each Pod process multiple work items is better for large numbers of items.
Several approaches use a work queue. This requires running a queue service, and modifications to the existing program or
container to make it use the work queue. Other approaches are easier to adapt to an existing containerised application.
When the Job is associated with a headless Service, you can enable the Pods within a Job to communicate with each other to
collaborate in a computation.
The tradeoffs are summarized here, with columns 2 to 4 corresponding to the above tradeoffs. The pattern names are also links to
examples and more detailed description.
Pattern Single Job object Fewer pods than work items? Use app unmodified?
When you specify completions with .spec.completions , each Pod created by the Job controller has an identical spec . This means
that all pods for a task will have the same command line and the same image, the same volumes, and (almost) the same
environment variables. These patterns are different ways to arrange for pods to work on different things.
This table shows the required settings for .spec.parallelism and .spec.completions for each of the patterns. Here, W is the number
of work items.
Advanced usage
Suspending a Job
When a Job is created, the Job controller will immediately begin creating Pods to satisfy the Job's requirements and will continue to
do so until the Job is complete. However, you may want to temporarily suspend a Job's execution and resume it later, or start Jobs in
suspended state and have a custom controller decide later when to start them.
To suspend a Job, you can update the .spec.suspend field of the Job to true; later, when you want to resume it again, update it to
false. Creating a Job with .spec.suspend set to true will create it in the suspended state.
When a Job is resumed from suspension, its .status.startTime field will be reset to the current time. This means that the
.spec.activeDeadlineSeconds timer will be stopped and reset when a Job is suspended and resumed.
When you suspend a Job, any running Pods that don't have a status of Completed will be terminated with a SIGTERM signal. The
Pod's graceful termination period will be honored and your Pod must handle this signal in this period. This may involve saving
progress for later or undoing changes. Pods terminated this way will not count towards the Job's completions count.
You can also toggle Job suspension by patching the Job using the command line.
The Job's status can be used to determine if a Job is suspended or has been suspended in the past:
apiVersion: batch/v1
kind: Job
# .metadata and .spec omitted
status:
conditions:
- lastProbeTime: "2021-02-05T13:14:33Z"
lastTransitionTime: "2021-02-05T13:14:33Z"
status: "True"
type: Suspended
startTime: "2021-02-05T13:13:48Z"
The Job condition of type "Suspended" with status "True" means the Job is suspended; the lastTransitionTime field can be used to
determine how long the Job has been suspended for. If the status of that condition is "False", then the Job was previously suspended
and is now running. If such a condition does not exist in the Job's status, the Job has never been stopped.
Events are also created when the Job is suspended and resumed:
The last four events, particularly the "Suspended" and "Resumed" events, are directly a result of toggling the .spec.suspend field. In
the time between these two events, we see that no Pods were created, but Pod creation restarted as soon as the Job was resumed.
In most cases, a parallel job will want the pods to run with constraints, like all in the same zone, or all either on GPU model x or y but
not a mix of both.
The suspend field is the first step towards achieving those semantics. Suspend allows a custom queue controller to decide when a
job should start; However, once a job is unsuspended, a custom queue controller has no influence on where the pods of a job will
actually land.
This feature allows updating a Job's scheduling directives before it starts, which gives custom queue controllers the ability to
influence pod placement while at the same time offloading actual pod-to-node assignment to kube-scheduler. This is allowed only
for suspended Jobs that have never been unsuspended before.
The fields in a Job's pod template that can be updated are node affinity, node selector, tolerations, labels, annotations and
scheduling gates.
However, in some cases, you might need to override this automatically set selector. To do this, you can specify the .spec.selector of
the Job.
Be very careful when doing this. If you specify a label selector which is not unique to the pods of that Job, and which matches
unrelated Pods, then pods of the unrelated job may be deleted, or this Job may count other Pods as completing it, or one or both
Jobs may refuse to create Pods or run to completion. If a non-unique selector is chosen, then other controllers (e.g.
ReplicationController) and their Pods may behave in unpredictable ways too. Kubernetes will not stop you from making a mistake
when specifying .spec.selector .
Here is an example of a case when you might want to use this feature.
Say Job old is already running. You want existing Pods to keep running, but you want the rest of the Pods it creates to use a
different pod template and for the Job to have a new name. You cannot update the Job because these fields are not updatable.
Therefore, you delete Job old but leave its pods running, using kubectl delete jobs/old --cascade=orphan . Before deleting it, you
make a note of what selector it uses:
Then you create a new Job with name newand you explicitly specify the same selector. Since the existing Pods have label
batch.kubernetes.io/controller-uid=a8f3d00d-c6d2-11e5-9f87-42010af00002 , they are controlled by Job new as well.
You need to specify manualSelector: true in the new Job since you are not using the selector that the system normally generates for
you automatically.
kind: Job
metadata:
name: new
...
spec:
manualSelector: true
selector:
matchLabels:
batch.kubernetes.io/controller-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002
...
The new Job itself will have a different uid from a8f3d00d-c6d2-11e5-9f87-42010af00002 . Setting manualSelector: true tells the system
that you know what you are doing and to allow this mismatch.
The control plane keeps track of the Pods that belong to any Job and notices if any such Pod is removed from the API server. To do
that, the Job controller creates Pods with the finalizer batch.kubernetes.io/job-tracking . The controller removes the finalizer only
after the Pod has been accounted for in the Job status, allowing the Pod to be removed by other controllers or users.
Note:
See My pod stays terminating if you observe that pods from a Job are stuck with the tracking finalizer.
You can scale Indexed Jobs up or down by mutating both .spec.parallelism and .spec.completions together such that
.spec.parallelism == .spec.completions . When scaling down, Kubernetes removes the Pods with higher indexes.
Use cases for elastic Indexed Jobs include batch workloads which require scaling an indexed Job, such as MPI, Horovod, Ray, and
PyTorch training jobs.
By default, the Job controller recreates Pods as soon they either fail or are terminating (have a deletion timestamp). This means that,
at a given time, when some of the Pods are terminating, the number of running Pods for a Job can be greater than parallelism or
greater than one Pod per index (if you are using an Indexed Job).
You may choose to create replacement Pods only when the terminating Pod is fully terminal (has status.phase: Failed ). To do this,
set the .spec.podReplacementPolicy: Failed . The default replacement policy depends on whether the Job has a podFailurePolicy set.
With no Pod failure policy defined for a Job, omitting the podReplacementPolicy field selects the TerminatingOrFailed replacement
policy: the control plane creates replacement Pods immediately upon Pod deletion (as soon as the control plane sees that a Pod for
this Job has deletionTimestamp set). For Jobs with a Pod failure policy set, the default podReplacementPolicy is Failed , and no other
value is permitted. See Pod failure policy to learn more about Pod failure policies for Jobs.
kind: Job
metadata:
name: new
...
spec:
podReplacementPolicy: Failed
...
Provided your cluster has the feature gate enabled, you can inspect the .status.terminating field of a Job. The value of the field is
the number of Pods owned by the Job that are currently terminating.
apiVersion: batch/v1
kind: Job
# .metadata and .spec omitted
status:
terminating: 3 # three Pods are terminating and have not yet reached the Failed phase
Note:
You can only set the managedBy field on Jobs if you enable the JobManagedBy feature gate (enabled by default).
This feature allows you to disable the built-in Job controller, for a specific Job, and delegate reconciliation of the Job to an external
controller.
You indicate the controller that reconciles the Job by setting a custom value for the spec.managedBy field - any value other than
kubernetes.io/job-controller . The value of the field is immutable.
Note:
When using this feature, make sure the controller indicated by the field is installed, otherwise the Job may not be reconciled at
all.
Note:
When developing an external Job controller be aware that your controller needs to operate in a fashion conformant with the
definitions of the API spec and status fields of the Job object.
Please review these in detail in the Job API. We also recommend that you run the e2e conformance tests for the Job object to
verify your implementation.
Finally, when developing an external Job controller make sure it does not use the batch.kubernetes.io/job-tracking finalizer,
reserved for the built-in controller.
Warning:
If you are considering to disable the JobManagedBy feature gate, or to downgrade the cluster to a version without the feature gate
enabled, check if there are jobs with a custom value of the spec.managedBy field. If there are such jobs, there is a risk that they
might be reconciled by two controllers after the operation: the built-in Job controller and the external controller indicated by the
field value.
Alternatives
Bare Pods
When the node that a Pod is running on reboots or fails, the pod is terminated and will not be restarted. However, a Job will create
new Pods to replace terminated ones. For this reason, we recommend that you use a Job rather than a bare Pod, even if your
application requires only a single Pod.
Replication Controller
Jobs are complementary to Replication Controllers. A Replication Controller manages Pods which are not expected to terminate (e.g.
web servers), and a Job manages Pods that are expected to terminate (e.g. batch tasks).
As discussed in Pod Lifecycle, Job is only appropriate for pods with RestartPolicy equal to OnFailure or Never . (Note: If
RestartPolicy is not set, the default value is Always .)
One example of this pattern would be a Job which starts a Pod which runs a script that in turn starts a Spark master controller (see
spark example), runs a spark driver, and then cleans up.
An advantage of this approach is that the overall process gets the completion guarantee of a Job object, but maintains complete
control over what Pods are created and how work is assigned to them.
What's next
Learn about Pods.
Read about different ways of running Jobs:
Coarse Parallel Processing Using a Work Queue
Fine Parallel Processing Using a Work Queue
Use an indexed Job for parallel processing with static work assignment
Create multiple Jobs based on a template: Parallel Processing using Expansions
Follow the links within Clean up finished jobs automatically to learn more about how your cluster can clean up completed and /
or failed tasks.
Job is part of the Kubernetes REST API. Read the Job object definition to understand the API for jobs.
Read about CronJob, which you can use to define a series of Jobs that will run based on a schedule, similar to the UNIX tool
cron .
Practice how to configure handling of retriable and non-retriable pod failures using podFailurePolicy , based on the step-by-
step examples.
6 - Automatic Cleanup for Finished Jobs
A time-to-live mechanism to clean up old Jobs that have finished execution.
When your Job has finished, it's useful to keep that Job in the API (and not immediately delete the Job) so that you can tell whether
the Job succeeded or failed.
Kubernetes' TTL-after-finished controller provides a TTL (time to live) mechanism to limit the lifetime of Job objects that have finished
execution.
The TTL-after-finished controller assumes that a Job is eligible to be cleaned up TTL seconds after the Job has finished. The timer
starts once the status condition of the Job changes to show that the Job is either Complete or Failed ; once the TTL has expired, that
Job becomes eligible for cascading removal. When the TTL-after-finished controller cleans up a job, it will delete it cascadingly, that is
to say it will delete its dependent objects together with it.
Kubernetes honors object lifecycle guarantees on the Job, such as waiting for finalizers.
You can set the TTL seconds at any time. Here are some examples for setting the .spec.ttlSecondsAfterFinished field of a Job:
Specify this field in the Job manifest, so that a Job can be cleaned up automatically some time after it finishes.
Manually set this field of existing, already finished Jobs, so that they become eligible for cleanup.
Use a mutating admission webhook to set this field dynamically at Job creation time. Cluster administrators can use this to
enforce a TTL policy for finished jobs.
Use a mutating admission webhook to set this field dynamically after the Job has finished, and choose different TTL values
based on job status, labels. For this case, the webhook needs to detect changes to the .status of the Job and only set a TTL
when the Job is being marked as completed.
Write your own controller to manage the cleanup TTL for Jobs that match a particular selector.
Caveats
Updating TTL for finished Jobs
You can modify the TTL period, e.g. .spec.ttlSecondsAfterFinished field of Jobs, after the job is created or has finished. If you extend
the TTL period after the existing ttlSecondsAfterFinished period has expired, Kubernetes doesn't guarantee to retain that Job, even
if an update to extend the TTL returns a successful API response.
Time skew
Because the TTL-after-finished controller uses timestamps stored in the Kubernetes jobs to determine whether the TTL has expired
or not, this feature is sensitive to time skew in your cluster, which may cause the control plane to clean up Job objects at the wrong
time.
Clocks aren't always correct, but the difference should be very small. Please be aware of this risk when setting a non-zero TTL.
What's next
Read Clean up Jobs automatically
Refer to the Kubernetes Enhancement Proposal (KEP) for adding this mechanism.
7 - CronJob
A CronJob starts one-time Jobs on a repeating schedule.
CronJob is meant for performing regular scheduled actions such as backups, report generation, and so on. One CronJob object is like
one line of a crontab (cron table) file on a Unix system. It runs a Job periodically on a given schedule, written in Cron format.
CronJobs have limitations and idiosyncrasies. For example, in certain circumstances, a single CronJob can create multiple concurrent
Jobs. See the limitations below.
When the control plane creates new Jobs and (indirectly) Pods for a CronJob, the .metadata.name of the CronJob is part of the basis
for naming those Pods. The name of a CronJob must be a valid DNS subdomain value, but this can produce unexpected results for
the Pod hostnames. For best compatibility, the name should follow the more restrictive rules for a DNS label. Even when the name is
a DNS subdomain, the name must be no longer than 52 characters. This is because the CronJob controller will automatically append
11 characters to the name you provide and there is a constraint that the length of a Job name is no more than 63 characters.
Example
This example CronJob manifest prints the current time and a hello message every minute:
application/job/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: hello
spec:
schedule: "* * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox:1.28
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: OnFailure
(Running Automated Tasks with a CronJob takes you through this example in more detail).
For example, 0 3 * * 1 means this task is scheduled to run weekly on a Monday at 3 AM.
The format also includes extended "Vixie cron" step values. As explained in the FreeBSD manual:
Step values can be used in conjunction with ranges. Following a range with /<number> specifies skips of the number's value
through the range. For example, 0-23/2 can be used in the hours field to specify command execution every other hour (the
alternative in the V7 standard is 0,2,4,6,8,10,12,14,16,18,20,22 ). Steps are also permitted after an asterisk, so if you want to say
"every two hours", just use */2 .
Note:
A question mark (?) in the schedule has the same meaning as an asterisk *, that is, it stands for any of available value for a given
field.
Other than the standard syntax, some macros like @monthly can also be used:
@monthly Run once a month at midnight of the first day of the month 001**
To generate CronJob schedule expressions, you can also use web tools like crontab.guru.
Job template
The .spec.jobTemplate defines a template for the Jobs that the CronJob creates, and it is required. It has exactly the same schema as
a Job, except that it is nested and does not have an apiVersion or kind . You can specify common metadata for the templated Jobs,
such as labels or annotations. For information about writing a Job .spec , see Writing a Job Spec.
After missing the deadline, the CronJob skips that instance of the Job (future occurrences are still scheduled). For example, if you
have a backup Job that runs twice a day, you might allow it to start up to 8 hours late, but no later, because a backup taken any later
wouldn't be useful: you would instead prefer to wait for the next scheduled run.
For Jobs that miss their configured deadline, Kubernetes treats them as failed Jobs. If you don't specify startingDeadlineSeconds for a
CronJob, the Job occurrences have no deadline.
If the .spec.startingDeadlineSeconds field is set (not null), the CronJob controller measures the time between when a Job is expected
to be created and now. If the difference is higher than that limit, it will skip this execution.
For example, if it is set to 200 , it allows a Job to be created for up to 200 seconds after the actual schedule.
Concurrency policy
The .spec.concurrencyPolicy field is also optional. It specifies how to treat concurrent executions of a Job that is created by this
CronJob. The spec may specify only one of the following concurrency policies:
Note that concurrency policy only applies to the Jobs created by the same CronJob. If there are multiple CronJobs, their respective
Jobs are always allowed to run concurrently.
Schedule suspension
You can suspend execution of Jobs for a CronJob, by setting the optional .spec.suspend field to true. The field defaults to false.
This setting does not affect Jobs that the CronJob has already started.
If you do set that field to true, all subsequent executions are suspended (they remain scheduled, but the CronJob controller does not
start the Jobs to run the tasks) until you unsuspend the CronJob.
Caution:
Executions that are suspended during their scheduled time count as missed Jobs. When .spec.suspend changes from true to
false on an existing CronJob without a starting deadline, the missed Jobs are scheduled immediately.
.spec.successfulJobsHistoryLimit : This field specifies the number of successful finished jobs to keep. The default value is 3 .
Setting this field to 0 will not keep any successful jobs.
.spec.failedJobsHistoryLimit : This field specifies the number of failed finished jobs to keep. The default value is 1 . Setting this
field to 0 will not keep any failed jobs.
For another way to clean up Jobs automatically, see Clean up finished Jobs automatically.
Time zones
For CronJobs with no time zone specified, the kube-controller-manager interprets schedules relative to its local time zone.
You can specify a time zone for a CronJob by setting .spec.timeZone to the name of a valid time zone. For example, setting
.spec.timeZone: "Etc/UTC" instructs Kubernetes to interpret the schedule relative to Coordinated Universal Time.
A time zone database from the Go standard library is included in the binaries and used as a fallback in case an external database is
not available on the system.
CronJob limitations
Unsupported TimeZone specification
Specifying a timezone using CRON_TZ or TZ variables inside .spec.schedule is not officially supported (and never has been).
Starting with Kubernetes 1.29 if you try to set a schedule that includes TZ or CRON_TZ timezone specification, Kubernetes will fail to
create the resource with a validation error. Updates to CronJobs already using TZ or CRON_TZ will continue to report a warning to
the client.
Modifying a CronJob
By design, a CronJob contains a template for new Jobs. If you modify an existing CronJob, the changes you make will apply to new
Jobs that start to run after your modification is complete. Jobs (and their Pods) that have already started continue to run without
changes. That is, the CronJob does not update existing Jobs, even if those remain running.
Job creation
A CronJob creates a Job object approximately once per execution time of its schedule. The scheduling is approximate because there
are certain circumstances where two Jobs might be created, or no Job might be created. Kubernetes tries to avoid those situations,
but does not completely prevent them. Therefore, the Jobs that you define should be idempotent.
Starting with Kubernetes v1.32, CronJobs apply an annotation batch.kubernetes.io/cronjob-scheduled-timestamp to their created Jobs.
This annotation indicates the originally scheduled creation time for the Job and is formatted in RFC3339.
If startingDeadlineSeconds is set to a large value or left unset (the default) and if concurrencyPolicy is set to Allow , the Jobs will
always run at least once.
Caution:
If startingDeadlineSeconds is set to a value less than 10 seconds, the CronJob may not be scheduled. This is because the CronJob
controller checks things every 10 seconds.
For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until
now. If there are more than 100 missed schedules, then it does not start the Job and logs the error.
Cannot determine if job needs to be started. Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds
It is important to note that if the startingDeadlineSeconds field is set (not nil ), the controller counts how many missed Jobs
occurred from the value of startingDeadlineSeconds until now rather than from the last scheduled time until now. For example, if
startingDeadlineSeconds is 200 , the controller counts how many missed Jobs occurred in the last 200 seconds.
A CronJob is counted as missed if it has failed to be created at its scheduled time. For example, if concurrencyPolicy is set to Forbid
and a CronJob was attempted to be scheduled when there was a previous schedule still running, then it would count as missed.
For example, suppose a CronJob is set to schedule a new Job every one minute beginning at 08:30:00 , and its
startingDeadlineSeconds field is not set. If the CronJob controller happens to be down from 08:29:00 to 10:21:00 , the Job will not
start as the number of missed Jobs which missed their schedule is greater than 100.
To illustrate this concept further, suppose a CronJob is set to schedule a new Job every one minute beginning at 08:30:00 , and its
startingDeadlineSeconds is set to 200 seconds. If the CronJob controller happens to be down for the same period as the previous
example ( 08:29:00 to 10:21:00 ,) the Job will still start at 10:22:00. This happens as the controller now checks how many missed
schedules happened in the last 200 seconds (i.e., 3 missed schedules), rather than from the last scheduled time until now.
The CronJob is only responsible for creating Jobs that match its schedule, and the Job in turn is responsible for the management of
the Pods it represents.
What's next
Learn about Pods and Jobs, two concepts that CronJobs rely upon.
Read about the detailed format of CronJob .spec.schedule fields.
For instructions on creating and working with CronJobs, and for an example of a CronJob manifest, see Running automated
tasks with CronJobs.
CronJob is part of the Kubernetes REST API. Read the CronJob API reference for more details.
8 - ReplicationController
Legacy API for managing workloads that can scale horizontally. Superseded by the Deployment and ReplicaSet
APIs.
Note:
A Deployment that configures a ReplicaSet is now the recommended way to set up replication.
A ReplicationController ensures that a specified number of pod replicas are running at any one time. In other words, a
ReplicationController makes sure that a pod or a homogeneous set of pods is always up and available.
A simple case is to create one ReplicationController object to reliably run one instance of a Pod indefinitely. A more complex use
case is to run several identical replicas of a replicated service, such as web servers.
controllers/replication.yaml
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 3
selector:
app: nginx
template:
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
Run the example job by downloading the example file and then running this command:
Name: nginx
Namespace: default
Selector: app=nginx
Labels: app=nginx
Annotations: <none>
Replicas: 3 current / 3 desired
Pods Status: 0 Running / 3 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app=nginx
Containers:
nginx:
Image: nginx
Port: 80/TCP
Environment: <none>
Mounts: <none>
Volumes: <none>
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- ---- ------ -------
20s 20s 1 {replication-controller } Normal SuccessfulCreate Created pod
20s 20s 1 {replication-controller } Normal SuccessfulCreate Created pod
20s 20s 1 {replication-controller } Normal SuccessfulCreate Created pod
Here, three pods are created, but none is running yet, perhaps because the image is being pulled. A little later, the same command
may show:
To list all the pods that belong to the ReplicationController in a machine readable form, you can use a command like this:
Here, the selector is the same as the selector for the ReplicationController (seen in the kubectl describe output), and in a different
form in replication.yaml . The --output=jsonpath option specifies an expression with the name from each pod in the returned list.
When the control plane creates new Pods for a ReplicationController, the .metadata.name of the ReplicationController is part of the
basis for naming those Pods. The name of a ReplicationController must be a valid DNS subdomain value, but this can produce
unexpected results for the Pod hostnames. For best compatibility, the name should follow the more restrictive rules for a DNS label.
For general information about working with configuration files, see object management.
Pod Template
The .spec.template is the only required field of the .spec .
The .spec.template is a pod template. It has exactly the same schema as a Pod, except it is nested and does not have an apiVersion
or kind .
In addition to required fields for a Pod, a pod template in a ReplicationController must specify appropriate labels and an appropriate
restart policy. For labels, make sure not to overlap with other controllers. See pod selector.
Only a .spec.template.spec.restartPolicy equal to Always is allowed, which is the default if not specified.
For local container restarts, ReplicationControllers delegate to an agent on the node, for example the Kubelet.
Pod Selector
The .spec.selector field is a label selector. A ReplicationController manages all the pods with labels that match the selector. It does
not distinguish between pods that it created or deleted and pods that another person or process created or deleted. This allows the
ReplicationController to be replaced without affecting the running pods.
If specified, the .spec.template.metadata.labels must be equal to the .spec.selector , or it will be rejected by the API. If
.spec.selector is unspecified, it will be defaulted to .spec.template.metadata.labels .
Also you should not normally create any pods whose labels match this selector, either directly, with another ReplicationController, or
with another controller such as Job. If you do so, the ReplicationController thinks that it created the other pods. Kubernetes does not
stop you from doing this.
If you do end up with multiple controllers that have overlapping selectors, you will have to manage the deletion yourself (see below).
Multiple Replicas
You can specify how many pods should run concurrently by setting .spec.replicas to the number of pods you would like to have
running concurrently. The number running at any time may be higher or lower, such as if the replicas were just increased or
decreased, or if a pod is gracefully shutdown, and a replacement starts early.
When using the REST API or client library, you need to do the steps explicitly (scale replicas to 0, wait for pod deletions, then delete
the ReplicationController).
When using the REST API or client library, you can delete the ReplicationController object.
Once the original is deleted, you can create a new ReplicationController to replace it. As long as the old and new .spec.selector are
the same, then the new one will adopt the old pods. However, it will not make any effort to make existing pods match a new,
different pod template. To update pods to a new spec in a controlled way, use a rolling update.
Scaling
The ReplicationController enables scaling the number of replicas up or down, either manually or by an auto-scaling control agent, by
updating the replicas field.
Rolling updates
The ReplicationController is designed to facilitate rolling updates to a service by replacing pods one-by-one.
As explained in #1353, the recommended approach is to create a new ReplicationController with 1 replica, scale the new (+1) and old
(-1) controllers one by one, and then delete the old controller after it reaches 0 replicas. This predictably updates the set of pods
regardless of unexpected failures.
Ideally, the rolling update controller would take application readiness into account, and would ensure that a sufficient number of
pods were productively serving at any given time.
The two ReplicationControllers would need to create pods with at least one differentiating label, such as the image tag of the
primary container of the pod, since it is typically image updates that motivate rolling updates.
For instance, a service might target all pods with tier in (frontend), environment in (prod) . Now say you have 10 replicated pods
that make up this tier. But you want to be able to 'canary' a new version of this component. You could set up a ReplicationController
with replicas set to 9 for the bulk of the replicas, with labels tier=frontend, environment=prod, track=stable , and another
ReplicationController with replicas set to 1 for the canary, with labels tier=frontend, environment=prod, track=canary . Now the
service is covering both the canary and non-canary pods. But you can mess with the ReplicationControllers separately to test things
out, monitor the results, etc.
A ReplicationController will never terminate on its own, but it isn't expected to be as long-lived as services. Services may be
composed of pods controlled by multiple ReplicationControllers, and it is expected that many ReplicationControllers may be created
and destroyed over the lifetime of a service (for instance, to perform an update of pods that run the service). Both services
themselves and their clients should remain oblivious to the ReplicationControllers that maintain the pods of the services.
Writing programs for Replication
Pods created by a ReplicationController are intended to be fungible and semantically identical, though their configurations may
become heterogeneous over time. This is an obvious fit for replicated stateless servers, but ReplicationControllers can also be used
to maintain availability of master-elected, sharded, and worker-pool applications. Such applications should use dynamic work
assignment mechanisms, such as the RabbitMQ work queues, as opposed to static/one-time customization of the configuration of
each pod, which is considered an anti-pattern. Any pod customization performed, such as vertical auto-sizing of resources (for
example, cpu or memory), should be performed by another online controller process, not unlike the ReplicationController itself.
The ReplicationController is forever constrained to this narrow responsibility. It itself will not perform readiness nor liveness probes.
Rather than performing auto-scaling, it is intended to be controlled by an external auto-scaler (as discussed in #492), which would
change its replicas field. We will not add scheduling policies (for example, spreading) to the ReplicationController. Nor should it
verify that the pods controlled match the currently specified template, as that would obstruct auto-sizing and other automated
processes. Similarly, completion deadlines, ordering dependencies, configuration expansion, and other features belong elsewhere.
We even plan to factor out the mechanism for bulk pod creation (#170).
The ReplicationController is intended to be a composable building-block primitive. We expect higher-level APIs and/or tools to be
built on top of it and other complementary primitives for user convenience in the future. The "macro" operations currently
supported by kubectl (run, scale) are proof-of-concept examples of this. For instance, we could imagine something like Asgard
managing ReplicationControllers, auto-scalers, services, scheduling policies, canaries, etc.
API Object
Replication controller is a top-level resource in the Kubernetes REST API. More details about the API object can be found at:
ReplicationController API object.
Alternatives to ReplicationController
ReplicaSet
ReplicaSet is the next-generation ReplicationController that supports the new set-based label selector. It's mainly used by
Deployment as a mechanism to orchestrate pod creation, deletion and updates. Note that we recommend using Deployments
instead of directly using Replica Sets, unless you require custom update orchestration or don't require updates at all.
Deployment (Recommended)
Deployment is a higher-level API object that updates its underlying Replica Sets and their Pods. Deployments are recommended if
you want the rolling update functionality, because they are declarative, server-side, and have additional features.
Bare Pods
Unlike in the case where a user directly created pods, a ReplicationController replaces pods that are deleted or terminated for any
reason, such as in the case of node failure or disruptive node maintenance, such as a kernel upgrade. For this reason, we
recommend that you use a ReplicationController even if your application requires only a single pod. Think of it similarly to a process
supervisor, only it supervises multiple pods across multiple nodes instead of individual processes on a single node. A
ReplicationController delegates local container restarts to some agent on the node, such as the kubelet.
Job
Use a Job instead of a ReplicationController for pods that are expected to terminate on their own (that is, batch jobs).
DaemonSet
Use a DaemonSet instead of a ReplicationController for pods that provide a machine-level function, such as machine monitoring or
machine logging. These pods have a lifetime that is tied to a machine lifetime: the pod needs to be running on the machine before
other pods start, and are safe to terminate when the machine is otherwise ready to be rebooted/shutdown.
What's next
Learn about Pods.
Learn about Deployment, the replacement for ReplicationController.
ReplicationController is part of the Kubernetes REST API. Read the ReplicationController object definition to understand the
API for replication controllers.