-
Notifications
You must be signed in to change notification settings - Fork 41.3k
kubelet: fix /stats/summary
endpoint on Windows when init-containers are present on the node
#90554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubelet: fix /stats/summary
endpoint on Windows when init-containers are present on the node
#90554
Conversation
Welcome @vboulineau! |
Hi @vboulineau. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @derekwaynecarr @PatrickLang |
/ok-to-test |
Thanks for the PR @vboulineau! @michmike @benmoss @ddebroy do you all have any input here as well? |
/retest |
This seems reasonable enough. @marosset when you say "non-blocking" I think you mean something like "we should log errors but give a best-effort response of as many containers as possible" as opposed to something about "async IO" but not completely sure. |
@benmoss - yes I meant we should make this best effort to give a response |
…s are present on the node Following changes in kubernetes#87730, Kubelet is directly hcsshim to gather stats. However, unlike `docker stats` API that was used before, hcsshim does not keep information about exited containers. When the Kubelet lists containers (`docker_container.go:ListContainers()`), it sets `All: true`, retrieving non-running containers. When docker stats is called with such container id, it'll return a valid JSON with all values set to 0. The non-running containers are filtered later on in the process. When the hcsshim is called with such container id, it'll return an error, effectively stopping the stats retrieval for all containers.
820caa7
to
3bff112
Compare
As we all agree, I've modified the PR to implement the best effort mode. Code now ignores all errors and logs unexpected errors. |
/retest |
looks good to me |
/lgtm |
/approve |
@yujuhong @Random-Liu would one of you be able to review this? |
kubelet change is lgtm. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: benmoss, derekwaynecarr, vboulineau The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…0554-upstream-release-1.16 Automated cherry pick of #90554: kubelet: fix `/stats/summary` endpoint on Windows when
…0554-upstream-release-1.17 Automated cherry pick of #90554: kubelet: fix `/stats/summary` endpoint on Windows when
…0554-upstream-release-1.18 Automated cherry pick of #90554: kubelet: fix `/stats/summary` endpoint on Windows when
What type of PR is this?
/kind bug
/sig windows
/cc @marosset
What this PR does / why we need it:
Currently, calling the
/stats/summary
API always fail with error 500 on any Windows node with at least one init-container.It will always fail with following error:
It's due to changes in #87730, Kubelet is now directly calling hcsshim to gather stats.
However, unlike
docker stats
API that was used before, hcsshim does notkeep information about exited containers.
When the Kubelet lists containers (
docker_container.go:ListContainers()
),it sets
All: true
, retrieving non-running containers.When docker stats is called with such container id, it'll return a valid JSON
with all values set to 0. The non-running containers are filtered later on in the process.
When the hcsshim is called with such container id, it'll return an error, effectively
stopping the stats retrieval for all containers and causing a 500.
Which issue(s) this PR fixes:
Issue described above
Special notes for your reviewer:
Current PR explicitly filters non-running containers. However during my testing of this change, I observed other temporary errors depending on some containers state:
It seems to be linked to containers crashing/force-killed, but I cannot say for sure.
So maybe we need to relax the check even more and:
IsNotExist()
orIsAlreadyStopped
)It'd make this call more reliable at the expense of having to look at the logs in case of missing data.
WDYT?
Does this PR introduce a user-facing change?: