stop overriding max concurrency in CI, let automax procs handle it #130758

BenTheElder · 2025-03-12T20:46:37Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

ensures we use auto-detected max procs for integration tests in CI instead of overriding it with a hard-coded 4 CPU

Which issue(s) this PR fixes:

Fixes #130755

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2025-03-12T20:46:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~hack/jenkins/OWNERS~~ [BenTheElder]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hack/jenkins/test-dockerized.sh

BenTheElder · 2025-03-12T20:48:00Z

hack/jenkins/test-integration-dockerized.sh

@@ -36,7 +36,6 @@ export KUBE_COVER="n"
 export ARTIFACTS=${ARTIFACTS:-"${WORKSPACE}/artifacts"}
 # Save the verbose stdout as well.
 export KUBE_KEEP_VERBOSE_TEST_OUTPUT=y
-export KUBE_INTEGRATION_TEST_MAX_CONCURRENCY=4


hack/make-rules/test-integration.sh already calls kube::golang::setup_gomaxprocs

setting this causes that script to override GOMAXPROCS with KUBE_INTEGRATION_TEST_MAX_CONCURRENCY

BenTheElder · 2025-03-12T20:52:50Z

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/130758/pull-kubernetes-integration/1899925030251270144

+ make test-integration
go: downloading go.uber.org/automaxprocs v1.6.0
+++ [0312 20:48:28] Set GOMAXPROCS automatically to 8
+++ [0312 20:48:28] Setting parallelism to 8

(this is correct, the job has 8 cores cpu limit, you can inspect the config)

BenTheElder · 2025-03-12T21:25:26Z

passed in 35m https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/130758/pull-kubernetes-integration/1899925030251270144

(previously >1h)

BenTheElder · 2025-03-12T21:33:58Z

/triage accepted
/priority important-longterm
/sig testing

aojea · 2025-03-12T21:45:13Z

/lgtm

35m5s

🏩

k8s-ci-robot · 2025-03-12T21:45:20Z

LGTM label has been added.

Git tree hash: 35ebc910f98b46571558e62bb88260ff130fb095

liggitt · 2025-03-14T14:12:53Z

hmm... could this be causing flakes in tests like #130810 (comment) ?

aojea · 2025-03-14T15:06:14Z

This merged at Mar 12, 10:51 PM GMT+1

and flakes started to happen around that time

#130810 (comment)

there is a strong correlation this is the trigger, I didn't see any changes on the scheduler or integration test code that may cause this flakiness

macsko · 2025-03-14T15:25:55Z

Isn't the flakiness caused by the fact that we now want to use all the CPUs available on the machine, and of course the tests can't use exactly all of them. Tests actually run faster because we have higher concurrency, but Go assumes it has more CPUs available than it can use. This causes test cases to run longer than before because Go tries to run 8 cases at a time with <8 available CPUs.

liggitt · 2025-03-14T15:27:26Z

I suspect we are unleashing more tests and bottlenecking on something else, possibly I/O, and hitting timeouts of some kind

BenTheElder · 2025-03-14T16:32:19Z

Isn't the flakiness caused by the fact that we now want to use all the CPUs available on the machine, and of course the tests can't use exactly all of them. Tests actually run faster because we have higher concurrency, but Go assumes it has more CPUs available than it can use. This causes test cases to run longer than before because Go tries to run 8 cases at a time with <8 available CPUs.

Er, we have 8 available CPUs, guaranteed QoS. Request + Limit. This is accurately detected and set to GOMAXPROCS.

Go doesn't assume we have more CPUs, it will limit the number of user threads to GOMAXPROCS, excluding blocked I/O.

I suspect we are unleashing more tests and bottlenecking on something else, possibly I/O, and hitting timeouts of some kind

Yes, also these are running in EKS where the machines have > 8 cores currently, so there may be other jobs on the node.

There was discussion in SIG K8s Infra about updating the machine types, so we haven't wanted to heavily depend on >8 cores (the GKE cluster is 8 core machines which means <8 is schedulable, but the non-schedulable bit is reserved so on those you will see a job using 7 .. 7.2 cores).

By not guaranteeing the whole node, it's possible something else is stomping I/O.
Alternatively we're just running more concurrently and stomping I/O oursleves.

I don't know how to monitor I/O on the EKS clusters ... I actually don't even have direct access setup personally, and the monitoring dashboard at https://monitoring-eks.prow.k8s.io/?orgId=1 shows CPU + memory but not I/O.

For jobs on the GKE cluster, I can see https://monitoring-gke.prow.k8s.io/?orgId=1, but also go correlate the GCE graphs for the VM (if it still exists) underpinning the node the job ran on and correlate the timing.

BenTheElder · 2025-03-14T16:33:27Z

cc @xmudrii @dims

BenTheElder · 2025-03-14T16:35:13Z

See also kubernetes/test-infra#34139, while not written down neatly yet, there are pointers there for more details about the environment we're running in.

If we can't get flakiness down, we can resume under-subscribing these for now and revisit in the future.

We can also consider switching this job over to the GKE cluster and reserve the whole node which we know is at a pretty good shape.

aojea · 2025-03-14T16:38:59Z

at the time we are close to code freeze is better to go back to the previous state than to start now doing many changes ... we can work with more calm on this after code freeze

BenTheElder · 2025-03-14T16:39:19Z

ACK, PR coming ...

BenTheElder · 2025-03-14T16:44:40Z

#130821

ylink-lfs · 2025-07-11T06:33:53Z

By not guaranteeing the whole node, it's possible something else is stomping I/O. Alternatively we're just running more concurrently and stomping I/O oursleves.

I don't know how to monitor I/O on the EKS clusters ... I actually don't even have direct access setup personally, and the monitoring dashboard at https://monitoring-eks.prow.k8s.io/?orgId=1 shows CPU + memory but not I/O.

I've done some profiling on the scheduler-perf integration test and found evidence that integration tests with a maximized GOMAXPROCS are stomping on I/O. The graph below, picked from several pull-kubernetes-integration Prow jobs running on EKS, shows how external I/O pressure affects integration performance. I am also proposing some ideas for removing the KUBE_INTEGRATION_TEST_MAX_CONCURRENCY variable without introducing excessive flakiness.

Add a volume specification inside the pull-kubernetes-integration Prow YAML file so that we can have IOPS guaranteed by the cloud provider, avoiding I/O interruptions.
Add a runtime.GOMAXPROCS(4) function call to set perf-related integration tests to a lower value. Use the auto-detected GOMAXPROCS for other integration tests.

The first approach definitely requires a longer-term discussion, so I suggest we implement the second approach first. WDYT? @BenTheElder

The benchmark used the Linux top utility to monitor IO wait, starting when the scheduler-perf integration test run, end when it ends. The first, failed integration test had significant more IO pressure(max=50) than the second, successful one.

stop overriding max concurrency in CI, let automax procs handle it

9e42056

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 12, 2025

k8s-ci-robot requested review from dims and sttts March 12, 2025 20:46

BenTheElder commented Mar 12, 2025

View reviewed changes

hack/jenkins/test-dockerized.sh Show resolved Hide resolved

BenTheElder commented Mar 12, 2025

View reviewed changes

BenTheElder mentioned this pull request Mar 12, 2025

don't use hardcoded maxprocs for integration tests #130755

Closed

This was referenced Mar 12, 2025

integration tests are painfully slow #126202

Open

Cleanup CI integration scripts #130760

Merged

k8s-ci-robot assigned aojea Mar 12, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 12, 2025

k8s-ci-robot merged commit c79d3ce into kubernetes:master Mar 12, 2025
14 checks passed

k8s-ci-robot added this to the v1.33 milestone Mar 12, 2025

BenTheElder mentioned this pull request Mar 12, 2025

Tracking Issue: Presubmit Performance #130762

Open

BenTheElder deleted the integration-concur branch March 13, 2025 18:32

BenTheElder mentioned this pull request Mar 14, 2025

Revert "stop overriding max concurrency in CI, let automax procs handle it #130821

Merged

ylink-lfs mentioned this pull request Jul 12, 2025

ci: integration test speedup with IO related perf flakeness bypass #132904

Open

stop overriding max concurrency in CI, let automax procs handle it #130758

stop overriding max concurrency in CI, let automax procs handle it #130758

Uh oh!

Conversation

BenTheElder commented Mar 12, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Mar 12, 2025

Uh oh!

Uh oh!

BenTheElder Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

BenTheElder commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenTheElder commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenTheElder commented Mar 12, 2025

Uh oh!

aojea commented Mar 12, 2025

Uh oh!

k8s-ci-robot commented Mar 12, 2025

Uh oh!

Uh oh!

liggitt commented Mar 14, 2025

Uh oh!

aojea commented Mar 14, 2025

Uh oh!

macsko commented Mar 14, 2025

Uh oh!

liggitt commented Mar 14, 2025

Uh oh!

BenTheElder commented Mar 14, 2025

Uh oh!

BenTheElder commented Mar 14, 2025

Uh oh!

BenTheElder commented Mar 14, 2025

Uh oh!

aojea commented Mar 14, 2025

Uh oh!

BenTheElder commented Mar 14, 2025

Uh oh!

BenTheElder commented Mar 14, 2025

Uh oh!

ylink-lfs commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

BenTheElder commented Mar 12, 2025 •

edited

Loading

BenTheElder commented Mar 12, 2025 •

edited

Loading

ylink-lfs commented Jul 11, 2025 •

edited

Loading