-
Notifications
You must be signed in to change notification settings - Fork 41.1k
Report actionable error when GC fails due to disk pressure #132578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Hi @drigz. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
3680aa4
to
f3a0fca
Compare
@haircommander I'm still figuring out how to manually test my kubelet change in a real cluster (just ran unit tests so far) but in the meantime, can you /ok-to-test this? Thanks in advance! |
Looks like this exposes a bug where an error message is used as format string: (
Should be easy to fix by adding I'm also unsure if the
That said, if it's less than 0.1 * disk size it's still a pointer that "too many/big images" is not the cause, whereas if it's greater than 0.3 * disk size it might suggest "too many/big images" is the cause. But not sure there's a good way to suggest that in the error message... |
/ok-to-test Please sign the CLA. Two of your commits have no author on it which may need to be fixed before merge. |
It is code freeze this week (may already be), so this may not get in this cycle. Sorry for the late review. |
Can you squash your commits? |
923f40d
to
516340b
Compare
NP, absolutely no rush on this item.
Done, thanks! |
message := fmt.Sprintf("Insufficient free disk space on the node's image filesystem (%d%% of %s used). "+ | ||
"Failed to free sufficient space by deleting unused images. "+ | ||
"Investigate disk usage, as it could be used by active images, logs, volumes, or other data.", | ||
usagePercent, formatSize(capacity)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the formatting is great for human. Would user ever need an actual bytes value to compare with some metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this question is for me or @kannon92 but I'll take a stab:
IIUC event messages are for humans to read and are normally batched by kubectl, so when the number varies over the different GC rounds every 5 minutes, it'll only show the latest one. But even if the user wants to know one of the following, there are better / more realistic ways for them to find it out, they both would be generated consistently regardless of whether the node is under disk pressure:
- how has my disk usage varied over time?
- just check node-exporter metrics or similar
- how many images become unused each day?
- check image GC text logs in the kubelet logs
I guess from the review comments from you and the other reviewer that this component suffers from Hyrum's Law, so if it's too dangerous to change an event description because someone intentionally keeps the node under disk pressure so they can parse this event and measure disk usage + GC history, then we should probably just close the linked issue as infeasible. I don't want to cause an outage for someone just to make one message more beginner-friendly...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mostly thinking of situation when user will notice a small increase in sizes from one log message to another. It will indicate something else is using memory. But yes, it feels like abusive use of a log message instead of finding the proper metric. This is why I think human-friendly value with the original value in brackets would be a good least-breaking alternative.
I am open to be convinced that this is not needed, reviewers, please comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hat this component suffers from Hyrum's Law
I don't think this applies here but I am worried about any changes that are not directly related to making the logs easier to read.
Plus it is subjective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why I think human-friendly value with the original value in brackets would be a good least-breaking alternative.
I've changed to raw bytes for the "freed" message to reduce the delta and because I think "freed 0 bytes" is more readable than "0 B freed". The delta in the latest commit:
- before:
failed to garbage collect required amount of images. Wanted to free 473948979 bytes, but freed 0 bytes
- after:
Insufficient free disk space on the node's image filesystem (95.0% of 10.0 GiB used). Failed to free sufficient space by deleting unused images (freed 0 bytes). Consider resizing the disk or deleting unused files.
ie the changes are:
- replacing "wanted to free 473948979 bytes" with
95.0% of 10.0 GiB used
- IMO this is more readable because it maps to what the user knows from
df -h
and tells them "your disk is almost full!", so I'd lean towards excluding the raw bytes number here but can restore it if you prefer
- IMO this is more readable because it maps to what the user knows from
- replacing
but freed 0 bytes
with(freed 0 bytes)
- not much change in readability as this is normally 0, I'm hoping the parens and removing "but" draw less attention from inexperienced users
LMK if you prefer (95.0% of 10.0 GiB used)
, (95.0% of 10.0 GiB used, wanted to free 473948979 bytes)
, or something else.
I don't think this applies here but I am worried about any changes that are not directly related to making the logs easier to read.
Not sure I understand this. As I understand it, removing 9 digit numbers and the internals of the GC algorithm (wanted to free X, freed Y) are required to make the message more readable and convey the point "your disk is almost full!". I believe these details make the user think the unused images / GC algorithm are to blame, which is not the case (GC worked as intended in this codepath).
But maybe I've misunderstood and you're referring to something other than the replacements above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the full process of review. I think this is more focused now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SergeyKanzhelev Do you need any more info from me to choose between the options listed above? Or is the current (Insufficient free disk space on the node's image filesystem (95.0% of 10.0 GiB used). Failed to free sufficient space by deleting unused images (freed 0 bytes). Consider resizing the disk or deleting unused files.
) fine?
// This usually means the disk is full for reasons other than container | ||
// images, such as logs, volumes, or other files. However, it could also | ||
// be due to an unusually large number or size of in-use container images. | ||
message := fmt.Sprintf("Insufficient free disk space on the node's image filesystem (%d%% of %s used). "+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep the information on how much we believe we freed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added (XXX freed)
to the error as requested - but can I ask why? Under disk pressure the event is generated every five minutes, so this will either be:
- the majority of the time[1]: the node is in steady state, it'll say
(0 B freed)
because the same images are in-use as the last time it ran GC - rarely: it'll either print the total compressed or uncompressed size of the images (ie underestimating the amount of freed space) that became unused since the last GC attempt, so somewhere between 0 and 15% of the disk size
I can't think of a case where this is useful, except perhaps for an experienced operator of a super-high-churn cluster who knows that every five minutes, hundreds of MB of images should become unused, so if they see this event with less than that, they deduce that somehow the churn rate of their cluster has changed.
I hope it doesn't seem petty to argue about this. I'm trying to make the error easier for inexperienced users to understand, and IMO adding (0 B freed)
or but freed 0 bytes
might make them think something is wrong with the GC when it's not.
[1]: At least, every instance I found I Stack Overflow after checking the first 10 results had but freed 0 bytes
or but only found 0 bytes eligible to free
, which sounds very "broken" and I believe this is why people go down the "how can I debug the GC" rabbit hole
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amount freed is not super precise. So I agree with your reasoning. However I believe the ability to distinguish the difference of zero or non-zero sizes is important. It distinguish constant churn situation and kubelet "believes" it freed something (while runtime may be "broken") vs. there is nothing to free - every image is in-use.
516340b
to
23adaa9
Compare
/retest Filed #133249 for the flake. |
/milestone v1.35 |
23adaa9
to
0f9d26f
Compare
When the image GC fails to free enough space due to disk pressure, the user-visible event it generates is misleading (see kubernetes#71869) - the technical detail leads operators to suspect GC problems. This PR makes the message actionable by focusing on the disk pressure. I had hoped to include the total disk space consumed by images that cannot be collected, to help operators quickly determine if disk pressure is caused by a large number of active images or by other files on the node, but this is blocked because the CRI API doesn't reveal the total disk use (depending on the runtime it has either the compressed or uncompressed space: kubernetes#120698).
0f9d26f
to
391967c
Compare
/retest |
1 similar comment
/retest |
|
||
// Check that a warning event was sent | ||
expectedEvent := "Warning FreeDiskSpaceFailed Insufficient free disk space on the node's image filesystem" + | ||
" (95% of 10.0 GiB used). Failed to free sufficient space by deleting unused images (freed 524288000 bytes)." + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use imageSize
here instead of its hardcoded value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to keep the assertions as close to the behavior the user will observe as possible, otherwise the test code ends up just duplicating the original implementation (including its bugs). This way it's obvious that large sizes are meant to be printed as long numbers with no commas and if someone changes the formatting directive, they'll have to change the expected result too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
once branch is opened we can take this. Thanks, I hope it will help people to be less confused by this log
LGTM label has been added. Git tree hash: bc8b8e51cde8cf9dcb38561053f989befd10f1c2
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: drigz, SergeyKanzhelev The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thank you for taking the time to look at this - from this and other PRs I'm starting to get a taste of the complexity of what you have to review, so I appreciate it. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
When the image GC fails to free enough space due to disk pressure, the user-visible event it generates is misleading (see #71869) - the technical detail leads operators to suspect GC problems. This PR makes the message actionable by focusing on the disk pressure.
I had hoped to include the total disk space consumed by images that cannot be collected, to help operators quickly determine if disk pressure is caused by a large number of active images or by other files on the node, but this is blocked because the CRI API doesn't reveal the total disk use (depending on the runtime it has either the compressed or uncompressed space: #120698).
Which issue(s) this PR is related to:
Fixes #71869
Special notes for your reviewer:
See individual commits for a breakdown into:
Regarding formatSize(): I considered using resource.NewQuantity but it wasn't clear how to round the size appropriately, and it can't generate non-whole numbers like
12.3 G
.Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
None - although I considered collecting the suggested fixes from #71869 (kill processes that are holding deleted files open, delete orphaned volumes, etc) I decided this would easily become crufty with irrelevant steps.