Report actionable error when GC fails due to disk pressure #132578

drigz · 2025-06-27T14:11:01Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

When the image GC fails to free enough space due to disk pressure, the user-visible event it generates is misleading (see #71869) - the technical detail leads operators to suspect GC problems. This PR makes the message actionable by focusing on the disk pressure.

I had hoped to include the total disk space consumed by images that cannot be collected, to help operators quickly determine if disk pressure is caused by a large number of active images or by other files on the node, but this is blocked because the CRI API doesn't reveal the total disk use (depending on the runtime it has either the compressed or uncompressed space: #120698).

Which issue(s) this PR is related to:

Fixes #71869

Special notes for your reviewer:

See individual commits for a breakdown into:

add test for user-visible message
rephrase message to focus on disk usage (saves user from running df -h)

Regarding formatSize(): I considered using resource.NewQuantity but it wasn't clear how to round the size appropriately, and it can't generate non-whole numbers like 12.3 G.

Does this PR introduce a user-facing change?

When image garbage collection is unable to free enough disk space, the FreeDiskSpaceFailed warning event is now more actionable. Example: `Insufficient free disk space on the node's image filesystem (95.0% of 10.0 GiB used). Failed to free sufficient space by deleting unused images. Consider resizing the disk or deleting unused files.`

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None - although I considered collecting the suggested fixes from #71869 (kill processes that are holding deleted files open, delete orphaned volumes, etc) I decided this would easily become crufty with irrelevant steps.

linux-foundation-easycla · 2025-06-27T14:11:06Z

The committers listed above are authorized under a signed CLA.

✅ login: drigz / name: Rodrigo Queiro (391967c)

k8s-ci-robot · 2025-06-27T14:11:11Z

Hi @drigz. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

drigz · 2025-06-27T14:14:26Z

@haircommander I'm still figuring out how to manually test my kubelet change in a real cluster (just ran unit tests so far) but in the meantime, can you /ok-to-test this? Thanks in advance!

drigz · 2025-06-28T15:23:52Z

Looks like this exposes a bug where an error message is used as format string: (%!o(MISSING))

Warning  FreeDiskSpaceFailed      24s    kubelet          Insufficient free disk space on the node's image filesystem (53.5% of 976.2 GiB used). Failed to free sufficient space by deleting unused images (429.0 MiB used for active images). Consider resizing the disk or deleting unused files.
  Warning  ImageGCFailed            24s    kubelet          Insufficient free disk space on the node's image filesystem (53.5%!o(MISSING)f 976.2 GiB used). Failed to free sufficient space by deleting unused images (429.0 MiB used for active images). Consider resizing the disk or deleting unused files.

Should be easy to fix by adding "%s", but is it WAI that both FreeDiskSpaceFailed and ImageGCFailed are generated in this case?

I'm also unsure if the XX.X GB used for active images is adding value here since for containerd it's only the compressed size (containerd/containerd#9261) and for other runtimes it's uncompressed size, but in both cases it's an underestimate, and the CRI API doesn't let us report the true size:

root@kind-control-plane:/# du -sh /var/lib/containerd/* | sort -h
[snip: small directories]
1.6M    /var/lib/containerd/io.containerd.metadata.v1.bolt
421M    /var/lib/containerd/io.containerd.content.v1.content
702M    /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs

That said, if it's less than 0.1 * disk size it's still a pointer that "too many/big images" is not the cause, whereas if it's greater than 0.3 * disk size it might suggest "too many/big images" is the cause. But not sure there's a good way to suggest that in the error message...

pkg/kubelet/kubelet.go

kannon92 · 2025-06-28T15:43:52Z

/ok-to-test

Please sign the CLA.

Two of your commits have no author on it which may need to be fixed before merge.

kannon92 · 2025-07-24T08:49:59Z

It is code freeze this week (may already be), so this may not get in this cycle.

Sorry for the late review.

kannon92 · 2025-07-24T08:50:15Z

Can you squash your commits?

drigz · 2025-07-24T08:52:59Z

It is code freeze this week (may already be), so this may not get in this cycle.

NP, absolutely no rush on this item.

Can you squash your commits?

Done, thanks!

SergeyKanzhelev · 2025-07-24T17:52:57Z

pkg/kubelet/images/image_gc_manager.go

+			message := fmt.Sprintf("Insufficient free disk space on the node's image filesystem (%d%% of %s used). "+
+				"Failed to free sufficient space by deleting unused images. "+
+				"Investigate disk usage, as it could be used by active images, logs, volumes, or other data.",
+				usagePercent, formatSize(capacity))


the formatting is great for human. Would user ever need an actual bytes value to compare with some metrics?

Not sure if this question is for me or @kannon92 but I'll take a stab:

IIUC event messages are for humans to read and are normally batched by kubectl, so when the number varies over the different GC rounds every 5 minutes, it'll only show the latest one. But even if the user wants to know one of the following, there are better / more realistic ways for them to find it out, they both would be generated consistently regardless of whether the node is under disk pressure:

how has my disk usage varied over time?

just check node-exporter metrics or similar

how many images become unused each day?

check image GC text logs in the kubelet logs

I guess from the review comments from you and the other reviewer that this component suffers from Hyrum's Law, so if it's too dangerous to change an event description because someone intentionally keeps the node under disk pressure so they can parse this event and measure disk usage + GC history, then we should probably just close the linked issue as infeasible. I don't want to cause an outage for someone just to make one message more beginner-friendly...

I was mostly thinking of situation when user will notice a small increase in sizes from one log message to another. It will indicate something else is using memory. But yes, it feels like abusive use of a log message instead of finding the proper metric. This is why I think human-friendly value with the original value in brackets would be a good least-breaking alternative.

I am open to be convinced that this is not needed, reviewers, please comment

hat this component suffers from Hyrum's Law

I don't think this applies here but I am worried about any changes that are not directly related to making the logs easier to read.

Plus it is subjective.

This is why I think human-friendly value with the original value in brackets would be a good least-breaking alternative.

I've changed to raw bytes for the "freed" message to reduce the delta and because I think "freed 0 bytes" is more readable than "0 B freed". The delta in the latest commit:

before: failed to garbage collect required amount of images. Wanted to free 473948979 bytes, but freed 0 bytes

after: Insufficient free disk space on the node's image filesystem (95.0% of 10.0 GiB used). Failed to free sufficient space by deleting unused images (freed 0 bytes). Consider resizing the disk or deleting unused files.

ie the changes are:

replacing "wanted to free 473948979 bytes" with 95.0% of 10.0 GiB used

IMO this is more readable because it maps to what the user knows from df -h and tells them "your disk is almost full!", so I'd lean towards excluding the raw bytes number here but can restore it if you prefer

replacing but freed 0 bytes with (freed 0 bytes)

not much change in readability as this is normally 0, I'm hoping the parens and removing "but" draw less attention from inexperienced users

LMK if you prefer (95.0% of 10.0 GiB used), (95.0% of 10.0 GiB used, wanted to free 473948979 bytes), or something else.

I don't think this applies here but I am worried about any changes that are not directly related to making the logs easier to read.

Not sure I understand this. As I understand it, removing 9 digit numbers and the internals of the GC algorithm (wanted to free X, freed Y) are required to make the message more readable and convey the point "your disk is almost full!". I believe these details make the user think the unused images / GC algorithm are to blame, which is not the case (GC worked as intended in this codepath).

But maybe I've misunderstood and you're referring to something other than the replacements above?

I mean the full process of review. I think this is more focused now.

@SergeyKanzhelev Do you need any more info from me to choose between the options listed above? Or is the current (Insufficient free disk space on the node's image filesystem (95.0% of 10.0 GiB used). Failed to free sufficient space by deleting unused images (freed 0 bytes). Consider resizing the disk or deleting unused files.) fine?

SergeyKanzhelev · 2025-07-24T17:53:20Z

pkg/kubelet/images/image_gc_manager.go

+			// This usually means the disk is full for reasons other than container
+			// images, such as logs, volumes, or other files. However, it could also
+			// be due to an unusually large number or size of in-use container images.
+			message := fmt.Sprintf("Insufficient free disk space on the node's image filesystem (%d%% of %s used). "+


let's keep the information on how much we believe we freed

I have added (XXX freed) to the error as requested - but can I ask why? Under disk pressure the event is generated every five minutes, so this will either be:

the majority of the time[1]: the node is in steady state, it'll say (0 B freed) because the same images are in-use as the last time it ran GC

rarely: it'll either print the total compressed or uncompressed size of the images (ie underestimating the amount of freed space) that became unused since the last GC attempt, so somewhere between 0 and 15% of the disk size

I can't think of a case where this is useful, except perhaps for an experienced operator of a super-high-churn cluster who knows that every five minutes, hundreds of MB of images should become unused, so if they see this event with less than that, they deduce that somehow the churn rate of their cluster has changed.

I hope it doesn't seem petty to argue about this. I'm trying to make the error easier for inexperienced users to understand, and IMO adding (0 B freed) or but freed 0 bytes might make them think something is wrong with the GC when it's not.

[1]: At least, every instance I found I Stack Overflow after checking the first 10 results had but freed 0 bytes or but only found 0 bytes eligible to free, which sounds very "broken" and I believe this is why people go down the "how can I debug the GC" rabbit hole

Amount freed is not super precise. So I agree with your reasoning. However I believe the ability to distinguish the difference of zero or non-zero sizes is important. It distinguish constant churn situation and kubelet "believes" it freed something (while runtime may be "broken") vs. there is nothing to free - every image is in-use.

drigz · 2025-07-28T11:46:51Z

/retest

Filed #133249 for the flake.

kannon92 · 2025-07-28T20:47:44Z

/milestone v1.35

When the image GC fails to free enough space due to disk pressure, the user-visible event it generates is misleading (see kubernetes#71869) - the technical detail leads operators to suspect GC problems. This PR makes the message actionable by focusing on the disk pressure. I had hoped to include the total disk space consumed by images that cannot be collected, to help operators quickly determine if disk pressure is caused by a large number of active images or by other files on the node, but this is blocked because the CRI API doesn't reveal the total disk use (depending on the runtime it has either the compressed or uncompressed space: kubernetes#120698).

drigz · 2025-07-29T09:06:16Z

/retest

drigz · 2025-07-29T09:15:02Z

/retest

bart0sh · 2025-07-29T12:01:36Z

pkg/kubelet/images/image_gc_manager_test.go

+
+	// Check that a warning event was sent
+	expectedEvent := "Warning FreeDiskSpaceFailed Insufficient free disk space on the node's image filesystem" +
+		" (95% of 10.0 GiB used). Failed to free sufficient space by deleting unused images (freed 524288000 bytes)." +


Would it make sense to use imageSize here instead of its hardcoded value?

I prefer to keep the assertions as close to the behavior the user will observe as possible, otherwise the test code ends up just duplicating the original implementation (including its bugs). This way it's obvious that large sizes are meant to be printed as long numbers with no commas and if someone changes the formatting directive, they'll have to change the expected result too.

SergeyKanzhelev

/lgtm
/approve

once branch is opened we can take this. Thanks, I hope it will help people to be less confused by this log

k8s-ci-robot · 2025-08-07T17:27:24Z

LGTM label has been added.

Git tree hash: bc8b8e51cde8cf9dcb38561053f989befd10f1c2

k8s-ci-robot · 2025-08-07T17:27:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: drigz, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/OWNERS~~ [SergeyKanzhelev]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

drigz · 2025-08-08T05:25:03Z

once branch is opened we can take this. Thanks, I hope it will help people to be less confused by this log

Thank you for taking the time to look at this - from this and other PRs I'm starting to get a taste of the complexity of what you have to review, so I appreciate it.

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 27, 2025

k8s-ci-robot requested review from kannon92 and matthyx June 27, 2025 14:11

k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 27, 2025

github-project-automation bot added this to SIG Node: code and documentation PRs Jun 27, 2025

github-project-automation bot moved this to Triage in SIG Node: code and documentation PRs Jun 27, 2025

drigz force-pushed the actionable-gc-error branch from 3680aa4 to f3a0fca Compare June 27, 2025 14:12

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 27, 2025

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 28, 2025

kannon92 reviewed Jun 28, 2025

View reviewed changes

pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 28, 2025

k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jun 28, 2025

k8s-ci-robot removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 24, 2025

drigz force-pushed the actionable-gc-error branch from 923f40d to 516340b Compare July 24, 2025 08:51

SergeyKanzhelev reviewed Jul 24, 2025

View reviewed changes

drigz force-pushed the actionable-gc-error branch from 516340b to 23adaa9 Compare July 27, 2025 19:20

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 27, 2025

k8s-ci-robot requested review from dchen1107, kannon92, mrunalp and SergeyKanzhelev July 27, 2025 19:20

drigz mentioned this pull request Jul 28, 2025

TestStructuredAuthenticationConfigReload: egress exit error: http: Server closed #133249

Closed

k8s-ci-robot added this to the v1.35 milestone Jul 28, 2025

drigz force-pushed the actionable-gc-error branch from 23adaa9 to 0f9d26f Compare July 29, 2025 08:14

drigz force-pushed the actionable-gc-error branch from 0f9d26f to 391967c Compare July 29, 2025 08:26

bart0sh reviewed Jul 29, 2025

View reviewed changes

SergeyKanzhelev approved these changes Aug 7, 2025

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 7, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 7, 2025

Report actionable error when GC fails due to disk pressure #132578

Are you sure you want to change the base?

Report actionable error when GC fails due to disk pressure #132578

Uh oh!

Conversation

drigz commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

linux-foundation-easycla bot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jun 27, 2025

Uh oh!

drigz commented Jun 27, 2025

Uh oh!

drigz commented Jun 28, 2025

Uh oh!

Uh oh!

kannon92 commented Jun 28, 2025

Uh oh!

kannon92 commented Jul 24, 2025

Uh oh!

kannon92 commented Jul 24, 2025

Uh oh!

drigz commented Jul 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drigz commented Jul 28, 2025

Uh oh!

kannon92 commented Jul 28, 2025

Uh oh!

drigz commented Jul 29, 2025

Uh oh!

drigz commented Jul 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SergeyKanzhelev left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Aug 7, 2025

Uh oh!

k8s-ci-robot commented Aug 7, 2025

Uh oh!

drigz commented Aug 8, 2025

Uh oh!

Uh oh!

drigz commented Jun 27, 2025 •

edited

Loading

linux-foundation-easycla bot commented Jun 27, 2025 •

edited

Loading