Skip to content

stop overriding max concurrency in CI, let automax procs handle it #130758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 12, 2025

Conversation

BenTheElder
Copy link
Member

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

ensures we use auto-detected max procs for integration tests in CI instead of overriding it with a hard-coded 4 CPU

Which issue(s) this PR fixes:

Fixes #130755

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 12, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 12, 2025
@k8s-ci-robot k8s-ci-robot requested review from dims and sttts March 12, 2025 20:46
@@ -36,7 +36,6 @@ export KUBE_COVER="n"
export ARTIFACTS=${ARTIFACTS:-"${WORKSPACE}/artifacts"}
# Save the verbose stdout as well.
export KUBE_KEEP_VERBOSE_TEST_OUTPUT=y
export KUBE_INTEGRATION_TEST_MAX_CONCURRENCY=4
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hack/make-rules/test-integration.sh already calls kube::golang::setup_gomaxprocs

setting this causes that script to override GOMAXPROCS with KUBE_INTEGRATION_TEST_MAX_CONCURRENCY

@BenTheElder
Copy link
Member Author

BenTheElder commented Mar 12, 2025

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/130758/pull-kubernetes-integration/1899925030251270144

+ make test-integration
go: downloading go.uber.org/automaxprocs v1.6.0
+++ [0312 20:48:28] Set GOMAXPROCS automatically to 8
+++ [0312 20:48:28] Setting parallelism to 8

(this is correct, the job has 8 cores cpu limit, you can inspect the config)

@BenTheElder
Copy link
Member Author

BenTheElder commented Mar 12, 2025

passed in 35m https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/130758/pull-kubernetes-integration/1899925030251270144

(previously >1h)

@BenTheElder
Copy link
Member Author

/triage accepted
/priority important-longterm
/sig testing

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 12, 2025
@aojea
Copy link
Member

aojea commented Mar 12, 2025

/lgtm

35m5s

🏩

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 12, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 35ebc910f98b46571558e62bb88260ff130fb095

@k8s-ci-robot k8s-ci-robot merged commit c79d3ce into kubernetes:master Mar 12, 2025
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Mar 12, 2025
@BenTheElder BenTheElder deleted the integration-concur branch March 13, 2025 18:32
@liggitt
Copy link
Member

liggitt commented Mar 14, 2025

hmm... could this be causing flakes in tests like #130810 (comment) ?

@aojea
Copy link
Member

aojea commented Mar 14, 2025

This merged at Mar 12, 10:51 PM GMT+1

and flakes started to happen around that time

#130810 (comment)

there is a strong correlation this is the trigger, I didn't see any changes on the scheduler or integration test code that may cause this flakiness

@macsko
Copy link
Member

macsko commented Mar 14, 2025

Isn't the flakiness caused by the fact that we now want to use all the CPUs available on the machine, and of course the tests can't use exactly all of them. Tests actually run faster because we have higher concurrency, but Go assumes it has more CPUs available than it can use. This causes test cases to run longer than before because Go tries to run 8 cases at a time with <8 available CPUs.

@liggitt
Copy link
Member

liggitt commented Mar 14, 2025

I suspect we are unleashing more tests and bottlenecking on something else, possibly I/O, and hitting timeouts of some kind

@BenTheElder
Copy link
Member Author

Isn't the flakiness caused by the fact that we now want to use all the CPUs available on the machine, and of course the tests can't use exactly all of them. Tests actually run faster because we have higher concurrency, but Go assumes it has more CPUs available than it can use. This causes test cases to run longer than before because Go tries to run 8 cases at a time with <8 available CPUs.

Er, we have 8 available CPUs, guaranteed QoS. Request + Limit. This is accurately detected and set to GOMAXPROCS.

Go doesn't assume we have more CPUs, it will limit the number of user threads to GOMAXPROCS, excluding blocked I/O.

I suspect we are unleashing more tests and bottlenecking on something else, possibly I/O, and hitting timeouts of some kind

Yes, also these are running in EKS where the machines have > 8 cores currently, so there may be other jobs on the node.

There was discussion in SIG K8s Infra about updating the machine types, so we haven't wanted to heavily depend on >8 cores (the GKE cluster is 8 core machines which means <8 is schedulable, but the non-schedulable bit is reserved so on those you will see a job using 7 .. 7.2 cores).

By not guaranteeing the whole node, it's possible something else is stomping I/O.
Alternatively we're just running more concurrently and stomping I/O oursleves.

I don't know how to monitor I/O on the EKS clusters ... I actually don't even have direct access setup personally, and the monitoring dashboard at https://monitoring-eks.prow.k8s.io/?orgId=1 shows CPU + memory but not I/O.

For jobs on the GKE cluster, I can see https://monitoring-gke.prow.k8s.io/?orgId=1, but also go correlate the GCE graphs for the VM (if it still exists) underpinning the node the job ran on and correlate the timing.

@BenTheElder
Copy link
Member Author

cc @xmudrii @dims

@BenTheElder
Copy link
Member Author

See also kubernetes/test-infra#34139, while not written down neatly yet, there are pointers there for more details about the environment we're running in.

If we can't get flakiness down, we can resume under-subscribing these for now and revisit in the future.

We can also consider switching this job over to the GKE cluster and reserve the whole node which we know is at a pretty good shape.

@aojea
Copy link
Member

aojea commented Mar 14, 2025

at the time we are close to code freeze is better to go back to the previous state than to start now doing many changes ... we can work with more calm on this after code freeze

@BenTheElder
Copy link
Member Author

ACK, PR coming ...

@BenTheElder
Copy link
Member Author

#130821

@ylink-lfs
Copy link
Contributor

ylink-lfs commented Jul 11, 2025

By not guaranteeing the whole node, it's possible something else is stomping I/O. Alternatively we're just running more concurrently and stomping I/O oursleves.

I don't know how to monitor I/O on the EKS clusters ... I actually don't even have direct access setup personally, and the monitoring dashboard at https://monitoring-eks.prow.k8s.io/?orgId=1 shows CPU + memory but not I/O.

I've done some profiling on the scheduler-perf integration test and found evidence that integration tests with a maximized GOMAXPROCS are stomping on I/O. The graph below, picked from several pull-kubernetes-integration Prow jobs running on EKS, shows how external I/O pressure affects integration performance. I am also proposing some ideas for removing the KUBE_INTEGRATION_TEST_MAX_CONCURRENCY variable without introducing excessive flakiness.

  • Add a volume specification inside the pull-kubernetes-integration Prow YAML file so that we can have IOPS guaranteed by the cloud provider, avoiding I/O interruptions.
  • Add a runtime.GOMAXPROCS(4) function call to set perf-related integration tests to a lower value. Use the auto-detected GOMAXPROCS for other integration tests.

The first approach definitely requires a longer-term discussion, so I suggest we implement the second approach first. WDYT? @BenTheElder

The benchmark used the Linux top utility to monitor IO wait, starting when the scheduler-perf integration test run, end when it ends. The first, failed integration test had significant more IO pressure(max=50) than the second, successful one.
GOMAXPROCS_8_fail_1s1top
GOMAXPROCS_8_success_1s1top

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

don't use hardcoded maxprocs for integration tests
6 participants