Skip to content

Report actionable error when GC fails due to disk pressure #132578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

drigz
Copy link
Contributor

@drigz drigz commented Jun 27, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

When the image GC fails to free enough space due to disk pressure, the user-visible event it generates is misleading (see #71869) - the technical detail leads operators to suspect GC problems. This PR makes the message actionable by focusing on the disk pressure.

I had hoped to include the total disk space consumed by images that cannot be collected, to help operators quickly determine if disk pressure is caused by a large number of active images or by other files on the node, but this is blocked because the CRI API doesn't reveal the total disk use (depending on the runtime it has either the compressed or uncompressed space: #120698).

Which issue(s) this PR is related to:

Fixes #71869

Special notes for your reviewer:

See individual commits for a breakdown into:

  1. add test for user-visible message
  2. rephrase message to focus on disk usage (saves user from running df -h)

Regarding formatSize(): I considered using resource.NewQuantity but it wasn't clear how to round the size appropriately, and it can't generate non-whole numbers like 12.3 G.

Does this PR introduce a user-facing change?

When image garbage collection is unable to free enough disk space, the FreeDiskSpaceFailed warning event is now more actionable. Example: `Insufficient free disk space on the node's image filesystem (95.0% of 10.0 GiB used). Failed to free sufficient space by deleting unused images. Consider resizing the disk or deleting unused files.`

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None - although I considered collecting the suggested fixes from #71869 (kill processes that are holding deleted files open, delete orphaned volumes, etc) I decided this would easily become crufty with irrelevant steps.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. labels Jun 27, 2025
Copy link

linux-foundation-easycla bot commented Jun 27, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: drigz / name: Rodrigo Queiro (391967c)

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 27, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @drigz. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 27, 2025
@k8s-ci-robot k8s-ci-robot requested review from kannon92 and matthyx June 27, 2025 14:11
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 27, 2025
@drigz drigz force-pushed the actionable-gc-error branch from 3680aa4 to f3a0fca Compare June 27, 2025 14:12
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 27, 2025
@drigz
Copy link
Contributor Author

drigz commented Jun 27, 2025

@haircommander I'm still figuring out how to manually test my kubelet change in a real cluster (just ran unit tests so far) but in the meantime, can you /ok-to-test this? Thanks in advance!

@drigz
Copy link
Contributor Author

drigz commented Jun 28, 2025

Looks like this exposes a bug where an error message is used as format string: (%!o(MISSING))

Warning  FreeDiskSpaceFailed      24s    kubelet          Insufficient free disk space on the node's image filesystem (53.5% of 976.2 GiB used). Failed to free sufficient space by deleting unused images (429.0 MiB used for active images). Consider resizing the disk or deleting unused files.
  Warning  ImageGCFailed            24s    kubelet          Insufficient free disk space on the node's image filesystem (53.5%!o(MISSING)f 976.2 GiB used). Failed to free sufficient space by deleting unused images (429.0 MiB used for active images). Consider resizing the disk or deleting unused files.

Should be easy to fix by adding "%s", but is it WAI that both FreeDiskSpaceFailed and ImageGCFailed are generated in this case?

I'm also unsure if the XX.X GB used for active images is adding value here since for containerd it's only the compressed size (containerd/containerd#9261) and for other runtimes it's uncompressed size, but in both cases it's an underestimate, and the CRI API doesn't let us report the true size:

root@kind-control-plane:/# du -sh /var/lib/containerd/* | sort -h
[snip: small directories]
1.6M    /var/lib/containerd/io.containerd.metadata.v1.bolt
421M    /var/lib/containerd/io.containerd.content.v1.content
702M    /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs

That said, if it's less than 0.1 * disk size it's still a pointer that "too many/big images" is not the cause, whereas if it's greater than 0.3 * disk size it might suggest "too many/big images" is the cause. But not sure there's a good way to suggest that in the error message...

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 28, 2025
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 28, 2025
@kannon92
Copy link
Contributor

/ok-to-test

Please sign the CLA.

Two of your commits have no author on it which may need to be fixed before merge.

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jun 28, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jul 24, 2025
@kannon92
Copy link
Contributor

It is code freeze this week (may already be), so this may not get in this cycle.

Sorry for the late review.

@kannon92
Copy link
Contributor

Can you squash your commits?

@drigz drigz force-pushed the actionable-gc-error branch from 923f40d to 516340b Compare July 24, 2025 08:51
@drigz
Copy link
Contributor Author

drigz commented Jul 24, 2025

It is code freeze this week (may already be), so this may not get in this cycle.

NP, absolutely no rush on this item.

Can you squash your commits?

Done, thanks!

message := fmt.Sprintf("Insufficient free disk space on the node's image filesystem (%d%% of %s used). "+
"Failed to free sufficient space by deleting unused images. "+
"Investigate disk usage, as it could be used by active images, logs, volumes, or other data.",
usagePercent, formatSize(capacity))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the formatting is great for human. Would user ever need an actual bytes value to compare with some metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this question is for me or @kannon92 but I'll take a stab:

IIUC event messages are for humans to read and are normally batched by kubectl, so when the number varies over the different GC rounds every 5 minutes, it'll only show the latest one. But even if the user wants to know one of the following, there are better / more realistic ways for them to find it out, they both would be generated consistently regardless of whether the node is under disk pressure:

  • how has my disk usage varied over time?
    • just check node-exporter metrics or similar
  • how many images become unused each day?
    • check image GC text logs in the kubelet logs

I guess from the review comments from you and the other reviewer that this component suffers from Hyrum's Law, so if it's too dangerous to change an event description because someone intentionally keeps the node under disk pressure so they can parse this event and measure disk usage + GC history, then we should probably just close the linked issue as infeasible. I don't want to cause an outage for someone just to make one message more beginner-friendly...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mostly thinking of situation when user will notice a small increase in sizes from one log message to another. It will indicate something else is using memory. But yes, it feels like abusive use of a log message instead of finding the proper metric. This is why I think human-friendly value with the original value in brackets would be a good least-breaking alternative.

I am open to be convinced that this is not needed, reviewers, please comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hat this component suffers from Hyrum's Law

I don't think this applies here but I am worried about any changes that are not directly related to making the logs easier to read.

Plus it is subjective.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I think human-friendly value with the original value in brackets would be a good least-breaking alternative.

I've changed to raw bytes for the "freed" message to reduce the delta and because I think "freed 0 bytes" is more readable than "0 B freed". The delta in the latest commit:

  • before: failed to garbage collect required amount of images. Wanted to free 473948979 bytes, but freed 0 bytes
  • after: Insufficient free disk space on the node's image filesystem (95.0% of 10.0 GiB used). Failed to free sufficient space by deleting unused images (freed 0 bytes). Consider resizing the disk or deleting unused files.

ie the changes are:

  • replacing "wanted to free 473948979 bytes" with 95.0% of 10.0 GiB used
    • IMO this is more readable because it maps to what the user knows from df -h and tells them "your disk is almost full!", so I'd lean towards excluding the raw bytes number here but can restore it if you prefer
  • replacing but freed 0 bytes with (freed 0 bytes)
    • not much change in readability as this is normally 0, I'm hoping the parens and removing "but" draw less attention from inexperienced users

LMK if you prefer (95.0% of 10.0 GiB used), (95.0% of 10.0 GiB used, wanted to free 473948979 bytes), or something else.

I don't think this applies here but I am worried about any changes that are not directly related to making the logs easier to read.

Not sure I understand this. As I understand it, removing 9 digit numbers and the internals of the GC algorithm (wanted to free X, freed Y) are required to make the message more readable and convey the point "your disk is almost full!". I believe these details make the user think the unused images / GC algorithm are to blame, which is not the case (GC worked as intended in this codepath).

But maybe I've misunderstood and you're referring to something other than the replacements above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the full process of review. I think this is more focused now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SergeyKanzhelev Do you need any more info from me to choose between the options listed above? Or is the current (Insufficient free disk space on the node's image filesystem (95.0% of 10.0 GiB used). Failed to free sufficient space by deleting unused images (freed 0 bytes). Consider resizing the disk or deleting unused files.) fine?

// This usually means the disk is full for reasons other than container
// images, such as logs, volumes, or other files. However, it could also
// be due to an unusually large number or size of in-use container images.
message := fmt.Sprintf("Insufficient free disk space on the node's image filesystem (%d%% of %s used). "+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep the information on how much we believe we freed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added (XXX freed) to the error as requested - but can I ask why? Under disk pressure the event is generated every five minutes, so this will either be:

  • the majority of the time[1]: the node is in steady state, it'll say (0 B freed) because the same images are in-use as the last time it ran GC
  • rarely: it'll either print the total compressed or uncompressed size of the images (ie underestimating the amount of freed space) that became unused since the last GC attempt, so somewhere between 0 and 15% of the disk size

I can't think of a case where this is useful, except perhaps for an experienced operator of a super-high-churn cluster who knows that every five minutes, hundreds of MB of images should become unused, so if they see this event with less than that, they deduce that somehow the churn rate of their cluster has changed.

I hope it doesn't seem petty to argue about this. I'm trying to make the error easier for inexperienced users to understand, and IMO adding (0 B freed) or but freed 0 bytes might make them think something is wrong with the GC when it's not.

[1]: At least, every instance I found I Stack Overflow after checking the first 10 results had but freed 0 bytes or but only found 0 bytes eligible to free, which sounds very "broken" and I believe this is why people go down the "how can I debug the GC" rabbit hole

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amount freed is not super precise. So I agree with your reasoning. However I believe the ability to distinguish the difference of zero or non-zero sizes is important. It distinguish constant churn situation and kubelet "believes" it freed something (while runtime may be "broken") vs. there is nothing to free - every image is in-use.

@drigz drigz force-pushed the actionable-gc-error branch from 516340b to 23adaa9 Compare July 27, 2025 19:20
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 27, 2025
@drigz
Copy link
Contributor Author

drigz commented Jul 28, 2025

/retest

Filed #133249 for the flake.

@kannon92
Copy link
Contributor

/milestone v1.35

@k8s-ci-robot k8s-ci-robot added this to the v1.35 milestone Jul 28, 2025
@drigz drigz force-pushed the actionable-gc-error branch from 23adaa9 to 0f9d26f Compare July 29, 2025 08:14
When the image GC fails to free enough space due to disk pressure, the
user-visible event it generates is misleading (see kubernetes#71869) - the
technical detail leads operators to suspect GC problems. This PR makes
the message actionable by focusing on the disk pressure.

I had hoped to include the total disk space consumed by images that
cannot be collected, to help operators quickly determine if disk
pressure is caused by a large number of active images or by other files
on the node, but this is blocked because the CRI API doesn't reveal the
total disk use (depending on the runtime it has either the compressed or
uncompressed space: kubernetes#120698).
@drigz drigz force-pushed the actionable-gc-error branch from 0f9d26f to 391967c Compare July 29, 2025 08:26
@drigz
Copy link
Contributor Author

drigz commented Jul 29, 2025

/retest

1 similar comment
@drigz
Copy link
Contributor Author

drigz commented Jul 29, 2025

/retest


// Check that a warning event was sent
expectedEvent := "Warning FreeDiskSpaceFailed Insufficient free disk space on the node's image filesystem" +
" (95% of 10.0 GiB used). Failed to free sufficient space by deleting unused images (freed 524288000 bytes)." +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use imageSize here instead of its hardcoded value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to keep the assertions as close to the behavior the user will observe as possible, otherwise the test code ends up just duplicating the original implementation (including its bugs). This way it's obvious that large sizes are meant to be printed as long numbers with no commas and if someone changes the formatting directive, they'll have to change the expected result too.

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

once branch is opened we can take this. Thanks, I hope it will help people to be less confused by this log

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 7, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: bc8b8e51cde8cf9dcb38561053f989befd10f1c2

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: drigz, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 7, 2025
@drigz
Copy link
Contributor Author

drigz commented Aug 8, 2025

once branch is opened we can take this. Thanks, I hope it will help people to be less confused by this log

Thank you for taking the time to look at this - from this and other PRs I'm starting to get a taste of the complexity of what you have to review, so I appreciate it.

@bart0sh bart0sh moved this from Triage to Needs Approver in SIG Node: code and documentation PRs Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

Successfully merging this pull request may close these issues.

failed to garbage collect required amount of images. Wanted to free 473842483 bytes, but freed 0 bytes
7 participants