Skip to content

Conversation

vboulineau
Copy link
Contributor

@vboulineau vboulineau commented Apr 28, 2020

What type of PR is this?
/kind bug
/sig windows
/cc @marosset

What this PR does / why we need it:
Currently, calling the /stats/summary API always fail with error 500 on any Windows node with at least one init-container.

It will always fail with following error:

Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = hcsshim::OpenComputeSystem <container_id>: A virtual machine or container with the specified identifier does not exist.

It's due to changes in #87730, Kubelet is now directly calling hcsshim to gather stats.
However, unlike docker stats API that was used before, hcsshim does not
keep information about exited containers.

When the Kubelet lists containers (docker_container.go:ListContainers()),
it sets All: true, retrieving non-running containers.

When docker stats is called with such container id, it'll return a valid JSON
with all values set to 0. The non-running containers are filtered later on in the process.

When the hcsshim is called with such container id, it'll return an error, effectively
stopping the stats retrieval for all containers and causing a 500.

Which issue(s) this PR fixes:
Issue described above

Special notes for your reviewer:
Current PR explicitly filters non-running containers. However during my testing of this change, I observed other temporary errors depending on some containers state:

Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = container 87d3dac87d2a33ca6697463db6940f755175ab28783f48d5cf55e0775410d68f encountered an error during Properties: failure in a Windows system call: The requested virtual machine or container operation is not valid in the curren
t state. (0xc0370105)
Internal Error: failed to list pod stats: failed to list all container stats: rpc error: code = Unknown desc = container 1948ed0a3802e73b1215142f066d1df8f1905cd122a1bfdb087476dd5efa0f51 encountered an error during Properties: failure in a Windows system call: Access is denied. (0x5)

It seems to be linked to containers crashing/force-killed, but I cannot say for sure.

So maybe we need to relax the check even more and:

  • Make all errors from hcsshim non-blocking (allowing retrieving stats for other containers)
  • Logging all errors which are not (IsNotExist() or IsAlreadyStopped)

It'd make this call more reliable at the expense of having to look at the logs in case of missing data.

WDYT?

Does this PR introduce a user-facing change?:

fixed: kubelet metrics no longer return error 500 when init-containers are present on Windows nodes

@k8s-ci-robot k8s-ci-robot requested a review from marosset April 28, 2020 13:33
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. sig/windows Categorizes an issue or PR as relevant to SIG Windows. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 28, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @vboulineau!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @vboulineau. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. area/kubelet area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Apr 28, 2020
@vboulineau
Copy link
Contributor Author

/assign @derekwaynecarr @PatrickLang

@marosset
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 28, 2020
@marosset
Copy link
Contributor

Thanks for the PR @vboulineau!
I'd be in favor of making all hcsshim calls here be non-blocking so to make the call itself more reliable (but possibly not report on some containers).

@michmike @benmoss @ddebroy do you all have any input here as well?

@marosset
Copy link
Contributor

/retest

@benmoss
Copy link
Member

benmoss commented Apr 29, 2020

This seems reasonable enough. @marosset when you say "non-blocking" I think you mean something like "we should log errors but give a best-effort response of as many containers as possible" as opposed to something about "async IO" but not completely sure.

@marosset
Copy link
Contributor

@benmoss - yes I meant we should make this best effort to give a response

…s are present on the node

Following changes in kubernetes#87730, Kubelet is directly hcsshim to gather stats.
However, unlike `docker stats` API that was used before, hcsshim does not
keep information about exited containers.

When the Kubelet lists containers (`docker_container.go:ListContainers()`),
it sets `All: true`, retrieving non-running containers.

When docker stats is called with such container id, it'll return a valid JSON
with all values set to 0. The non-running containers are filtered later on in the process.

When the hcsshim is called with such container id, it'll return an error, effectively
stopping the stats retrieval for all containers.
@vboulineau vboulineau force-pushed the vboulineau/fix_win_stats_init_containers branch from 820caa7 to 3bff112 Compare May 4, 2020 12:42
@vboulineau
Copy link
Contributor Author

As we all agree, I've modified the PR to implement the best effort mode. Code now ignores all errors and logs unexpected errors.

@vboulineau
Copy link
Contributor Author

/retest

@michmike
Copy link
Contributor

michmike commented May 5, 2020

looks good to me

@marosset
Copy link
Contributor

marosset commented May 5, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 5, 2020
@benmoss
Copy link
Member

benmoss commented May 5, 2020

/approve

@benmoss
Copy link
Member

benmoss commented May 5, 2020

@yujuhong @Random-Liu would one of you be able to review this?

@derekwaynecarr
Copy link
Member

kubelet change is lgtm.

/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benmoss, derekwaynecarr, vboulineau

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 13, 2020
@k8s-ci-robot k8s-ci-robot merged commit 4339ac3 into kubernetes:master May 13, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone May 13, 2020
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels May 14, 2020
k8s-ci-robot added a commit that referenced this pull request Jul 7, 2020
…0554-upstream-release-1.16

Automated cherry pick of #90554: kubelet: fix `/stats/summary` endpoint on Windows when
k8s-ci-robot added a commit that referenced this pull request Jul 7, 2020
…0554-upstream-release-1.17

Automated cherry pick of #90554: kubelet: fix `/stats/summary` endpoint on Windows when
k8s-ci-robot added a commit that referenced this pull request Jul 8, 2020
…0554-upstream-release-1.18

Automated cherry pick of #90554: kubelet: fix `/stats/summary` endpoint on Windows when
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants