ci: integration test speedup with IO related perf flakeness bypass #132904

ylink-lfs · 2025-07-12T04:01:12Z

What type of PR is this?

/kind cleanup
/sig testing
/sig scheduling
/priority important-longterm

What this PR does / why we need it:

Integration test is often taking well over an hour even with 8 cores and 20Gi assigned in prow. It used to be more like 30m or less.

Which issue(s) this PR is related to:

integration tests are painfully slow #126202

Tracking Issue: Presubmit Performance #130762

#130758 (comment)

Special notes for your reviewer:

NONE

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

k8s-ci-robot · 2025-07-12T04:01:20Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-07-12T04:01:21Z

Hi @ylink-lfs. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-07-12T04:01:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ylink-lfs
Once this PR has been reviewed and has the lgtm label, please assign macsko, spiffxp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull Request Overview

This PR aims to improve integration test performance by removing hardcoded concurrency limits and implementing a GOMAXPROCS coordination mechanism. The changes address slow integration tests that were taking over an hour instead of the expected 30 minutes.

Removes hardcoded KUBE_INTEGRATION_TEST_MAX_CONCURRENCY=4 from CI scripts to allow better parallelization
Adds GOMAXPROCS coordination using sync.WaitGroup to manage goroutine execution limits
Refactors the main integration test function to support the new concurrency management

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
test/integration/scheduler_perf/scheduler_perf.go	Implements GOMAXPROCS coordination mechanism and refactors test execution function
hack/jenkins/test-integration-dockerized.sh	Removes hardcoded concurrency limit export
hack/jenkins/test-dockerized.sh	Removes hardcoded concurrency limit export

Comments suppressed due to low confidence (1)

test/integration/scheduler_perf/scheduler_perf.go:32

[nitpick] The import alias 'rt' for 'runtime' is unnecessarily cryptic. Consider using a more descriptive alias like 'runtime' or importing without an alias since there are no naming conflicts.

	rt "runtime"

test/integration/scheduler_perf/scheduler_perf.go

ylink-lfs · 2025-07-12T04:01:56Z

/assign BenTheElder

ylink-lfs · 2025-07-12T04:14:24Z

Still there are some implementation ideas for discussion:

One option is to keep using the current method in the PR: simply disable scheduler-perf from using the maximum GOMAXPROCS in integration tests.
Another way is adding a top-level entrypoint go program for ALL integration tests, which launches a separate goroutine to continuously monitor I/O usage. This goroutine would lower GOMAXPROCS when I/O pressure is high and restore the default value when the pressure is gone.

k8s-ci-robot · 2025-07-12T04:49:48Z

@ylink-lfs: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

BenTheElder · 2025-08-08T18:44:08Z

test/integration/scheduler_perf/scheduler_perf.go

@@ -121,6 +122,8 @@ const DefaultLoggingVerbosity = 2

 var LoggingFeatureGate FeatureGateFlag
 var LoggingConfig *logsapi.LoggingConfiguration
+var wgGOMAXPROCS *sync.WaitGroup


we need to comment what the intention is behind this mechanism

BenTheElder · 2025-08-08T18:44:13Z

Another way is adding a top-level entrypoint go program for ALL integration tests, which launches a separate goroutine to continuously monitor I/O usage. This goroutine would lower GOMAXPROCS when I/O pressure is high and restore the default value when the pressure is gone.

I don't think we want to do that, it seems like an easy way to have subtle flakes, also I have concerns about the underlying systems not being robust.

It looks like this took 40m

ylink-lfs · 2025-08-12T01:26:42Z

Another way is adding a top-level entrypoint go program for ALL integration tests, which launches a separate goroutine to continuously monitor I/O usage. This goroutine would lower GOMAXPROCS when I/O pressure is high and restore the default value when the pressure is gone.

I don't think we want to do that, it seems like an easy way to have subtle flakes, also I have concerns about the underlying systems not being robust.

It looks like this took 40m

So, to clarify: you think this is a good optimization, and I should move forward with lowering GOMAXPROCS for the flaky performance tests?

ci: integration test speedup with IO flakeness bypass

5ba9278

Copilot AI review requested due to automatic review settings July 12, 2025 04:01

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. area/test labels Jul 12, 2025

k8s-ci-robot requested review from BenTheElder and denkensk July 12, 2025 04:01

Copilot AI reviewed Jul 12, 2025

View reviewed changes

test/integration/scheduler_perf/scheduler_perf.go Show resolved Hide resolved

test/integration/scheduler_perf/scheduler_perf.go Show resolved Hide resolved

k8s-ci-robot assigned BenTheElder Jul 12, 2025

ylink-lfs changed the title ~~ci: integration test speedup with IO flakeness bypass~~ ci: integration test speedup with IO related perf flakeness bypass Jul 12, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 29, 2025

BenTheElder reviewed Aug 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: integration test speedup with IO related perf flakeness bypass #132904

ci: integration test speedup with IO related perf flakeness bypass #132904

ylink-lfs commented Jul 12, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Jul 12, 2025

Uh oh!

k8s-ci-robot commented Jul 12, 2025

Uh oh!

k8s-ci-robot commented Jul 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

ylink-lfs commented Jul 12, 2025

Uh oh!

ylink-lfs commented Jul 12, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Jul 12, 2025

Uh oh!

BenTheElder Aug 8, 2025

Uh oh!

BenTheElder commented Aug 8, 2025

Uh oh!

ylink-lfs commented Aug 12, 2025

Uh oh!

Uh oh!

ci: integration test speedup with IO related perf flakeness bypass #132904

Are you sure you want to change the base?

ci: integration test speedup with IO related perf flakeness bypass #132904

Conversation

ylink-lfs commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Jul 12, 2025

Uh oh!

k8s-ci-robot commented Jul 12, 2025

Uh oh!

k8s-ci-robot commented Jul 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

ylink-lfs commented Jul 12, 2025

Uh oh!

ylink-lfs commented Jul 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jul 12, 2025

Uh oh!

BenTheElder Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

BenTheElder commented Aug 8, 2025

Uh oh!

ylink-lfs commented Aug 12, 2025

Uh oh!

Uh oh!

ylink-lfs commented Jul 12, 2025 •

edited

Loading

ylink-lfs commented Jul 12, 2025 •

edited

Loading