kubelet: fix `/stats/summary` endpoint on Windows when init-containers are present on the node #90554

vboulineau · 2020-04-28T13:33:48Z

What type of PR is this?
/kind bug
/sig windows
/cc @marosset

What this PR does / why we need it:
Currently, calling the /stats/summary API always fail with error 500 on any Windows node with at least one init-container.

It will always fail with following error:

Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem <container_id>: A virtual machine or container with the specified identifier does not exist.

It's due to changes in #87730, Kubelet is now directly calling hcsshim to gather stats.
However, unlike docker stats API that was used before, hcsshim does not
keep information about exited containers.

When the Kubelet lists containers (docker_container.go:ListContainers()),
it sets All: true, retrieving non-running containers.

When docker stats is called with such container id, it'll return a valid JSON
with all values set to 0. The non-running containers are filtered later on in the process.

When the hcsshim is called with such container id, it'll return an error, effectively
stopping the stats retrieval for all containers and causing a 500.

Which issue(s) this PR fixes:
Issue described above

Special notes for your reviewer:
Current PR explicitly filters non-running containers. However during my testing of this change, I observed other temporary errors depending on some containers state:

Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = container 87d3dac87d2a33ca6697463db6940f755175ab28783f48d5cf55e0775410d68f encountered an error during Properties: failure in a Windows system call: The requested virtual machine or container operation is not valid in the curren
t state. (0xc0370105)

Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = container 1948ed0a3802e73b1215142f066d1df8f1905cd122a1bfdb087476dd5efa0f51 encountered an error during Properties: failure in a Windows system call: Access is denied. (0x5)

It seems to be linked to containers crashing/force-killed, but I cannot say for sure.

So maybe we need to relax the check even more and:

Make all errors from hcsshim non-blocking (allowing retrieving stats for other containers)
Logging all errors which are not (IsNotExist() or IsAlreadyStopped)

It'd make this call more reliable at the expense of having to look at the logs in case of missing data.

WDYT?

Does this PR introduce a user-facing change?:

fixed: kubelet metrics no longer return error 500 when init-containers are present on Windows nodes

k8s-ci-robot · 2020-04-28T13:33:56Z

Welcome @vboulineau!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2020-04-28T13:33:57Z

Hi @vboulineau. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vboulineau · 2020-04-28T13:37:33Z

/assign @derekwaynecarr @PatrickLang

marosset · 2020-04-28T16:25:41Z

/ok-to-test

marosset · 2020-04-28T19:51:23Z

Thanks for the PR @vboulineau!
I'd be in favor of making all hcsshim calls here be non-blocking so to make the call itself more reliable (but possibly not report on some containers).

@michmike @benmoss @ddebroy do you all have any input here as well?

marosset · 2020-04-28T20:00:06Z

/retest

benmoss · 2020-04-29T13:55:52Z

This seems reasonable enough. @marosset when you say "non-blocking" I think you mean something like "we should log errors but give a best-effort response of as many containers as possible" as opposed to something about "async IO" but not completely sure.

marosset · 2020-04-29T16:10:52Z

@benmoss - yes I meant we should make this best effort to give a response

…s are present on the node Following changes in kubernetes#87730, Kubelet is directly hcsshim to gather stats. However, unlike `docker stats` API that was used before, hcsshim does not keep information about exited containers. When the Kubelet lists containers (`docker_container.go:ListContainers()`), it sets `All: true`, retrieving non-running containers. When docker stats is called with such container id, it'll return a valid JSON with all values set to 0. The non-running containers are filtered later on in the process. When the hcsshim is called with such container id, it'll return an error, effectively stopping the stats retrieval for all containers.

vboulineau · 2020-05-04T12:53:38Z

As we all agree, I've modified the PR to implement the best effort mode. Code now ignores all errors and logs unexpected errors.

vboulineau · 2020-05-04T14:51:35Z

/retest

michmike · 2020-05-05T00:41:03Z

looks good to me

marosset · 2020-05-05T00:42:09Z

/lgtm

benmoss · 2020-05-05T13:25:24Z

/approve

benmoss · 2020-05-05T13:32:55Z

@yujuhong @Random-Liu would one of you be able to review this?

derekwaynecarr · 2020-05-13T18:54:34Z

kubelet change is lgtm.

/approve
/lgtm

k8s-ci-robot · 2020-05-13T18:55:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benmoss, derekwaynecarr, vboulineau

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/OWNERS~~ [derekwaynecarr]
~~test/e2e/windows/OWNERS~~ [benmoss]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…0554-upstream-release-1.16 Automated cherry pick of #90554: kubelet: fix `/stats/summary` endpoint on Windows when

…0554-upstream-release-1.17 Automated cherry pick of #90554: kubelet: fix `/stats/summary` endpoint on Windows when

…0554-upstream-release-1.18 Automated cherry pick of #90554: kubelet: fix `/stats/summary` endpoint on Windows when

k8s-ci-robot requested a review from marosset April 28, 2020 13:33

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. area/kubelet area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Apr 28, 2020

k8s-ci-robot assigned derekwaynecarr and PatrickLang Apr 28, 2020

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 28, 2020

vboulineau force-pushed the vboulineau/fix_win_stats_init_containers branch from 820caa7 to 3bff112 Compare May 4, 2020 12:42

k8s-ci-robot assigned marosset May 5, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 5, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 13, 2020

k8s-ci-robot merged commit 4339ac3 into kubernetes:master May 13, 2020

k8s-ci-robot added this to the v1.19 milestone May 13, 2020

vboulineau mentioned this pull request May 14, 2020

Automated cherry pick of #90554: kubelet: fix /stats/summary endpoint on Windows when #91100

Merged

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels May 14, 2020

This was referenced May 14, 2020

Automated cherry pick of #90554: kubelet: fix /stats/summary endpoint on Windows when #91103

Merged

Automated cherry pick of #90554: kubelet: fix /stats/summary endpoint on Windows when #91104

Merged

github-actions bot mentioned this pull request May 27, 2020

Week Ending May 17, 2020 dev-obs/actus#156

Closed

marosset mentioned this pull request Jun 2, 2020

Metrics server fails to collect metrics for Windows pods/nodes kubernetes-sigs/metrics-server#539

Closed

bragi92 mentioned this pull request Jun 12, 2020

feat: enable windows daemonset in container monitoring addon Azure/aks-engine#3466

Merged

4 tasks

zhiweiv mentioned this pull request Jun 18, 2020

kubelet /stats not working on Windows Azure/AKS#1685

Closed

marosset mentioned this pull request Jun 19, 2020

fix: use hotfix Windows builds for 1.18.4, 1.17.7, and 1.16.11 Azure/aks-engine#3510

Merged

4 tasks

mikkelhegn mentioned this pull request Jun 23, 2020

Common Horizontal Pod Autoscaler events in AKS cluster Azure/AKS#1137

Closed

nbowes24 mentioned this pull request Jun 23, 2020

Metrics not working for Windows Nodes/Pods - Upgrade to 1.6.11 when available SkillsFundingAgency/das-azure-pipelines-agents#50

Open

marosset mentioned this pull request Jun 26, 2020

Metrics server fails to pull metrics from Windows workers #91575

Closed

k8s-ci-robot added a commit that referenced this pull request Jul 7, 2020

Merge pull request #91104 from vboulineau/automated-cherry-pick-of-#9…

c86bbc5

…0554-upstream-release-1.16 Automated cherry pick of #90554: kubelet: fix `/stats/summary` endpoint on Windows when

k8s-ci-robot added a commit that referenced this pull request Jul 7, 2020

Merge pull request #91103 from vboulineau/automated-cherry-pick-of-#9…

2d72335

…0554-upstream-release-1.17 Automated cherry pick of #90554: kubelet: fix `/stats/summary` endpoint on Windows when

k8s-ci-robot added a commit that referenced this pull request Jul 8, 2020

Merge pull request #91100 from vboulineau/automated-cherry-pick-of-#9…

12d56a6

…0554-upstream-release-1.18 Automated cherry pick of #90554: kubelet: fix `/stats/summary` endpoint on Windows when

kubelet: fix /stats/summary endpoint on Windows when init-containers are present on the node #90554

kubelet: fix /stats/summary endpoint on Windows when init-containers are present on the node #90554

Uh oh!

Conversation

vboulineau commented Apr 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Apr 28, 2020

Uh oh!

k8s-ci-robot commented Apr 28, 2020

Uh oh!

vboulineau commented Apr 28, 2020

Uh oh!

marosset commented Apr 28, 2020

Uh oh!

marosset commented Apr 28, 2020

Uh oh!

marosset commented Apr 28, 2020

Uh oh!

benmoss commented Apr 29, 2020

Uh oh!

marosset commented Apr 29, 2020

Uh oh!

vboulineau commented May 4, 2020

Uh oh!

vboulineau commented May 4, 2020

Uh oh!

michmike commented May 5, 2020

Uh oh!

marosset commented May 5, 2020

Uh oh!

benmoss commented May 5, 2020

Uh oh!

benmoss commented May 5, 2020

Uh oh!

derekwaynecarr commented May 13, 2020

Uh oh!

k8s-ci-robot commented May 13, 2020

Uh oh!

Uh oh!

kubelet: fix `/stats/summary` endpoint on Windows when init-containers are present on the node #90554

kubelet: fix `/stats/summary` endpoint on Windows when init-containers are present on the node #90554

vboulineau commented Apr 28, 2020 •

edited

Loading