Skip to content

ci: integration test speedup with IO related perf flakeness bypass #132904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ylink-lfs
Copy link
Contributor

@ylink-lfs ylink-lfs commented Jul 12, 2025

What type of PR is this?

/kind cleanup
/sig testing
/sig scheduling
/priority important-longterm

What this PR does / why we need it:

Integration test is often taking well over an hour even with 8 cores and 20Gi assigned in prow. It used to be more like 30m or less.

Which issue(s) this PR is related to:

integration tests are painfully slow #126202

Tracking Issue: Presubmit Performance #130762

#130758 (comment)

Special notes for your reviewer:

NONE

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

@Copilot Copilot AI review requested due to automatic review settings July 12, 2025 04:01
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 12, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

Hi @ylink-lfs. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. area/test labels Jul 12, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ylink-lfs
Once this PR has been reviewed and has the lgtm label, please assign macsko, spiffxp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR aims to improve integration test performance by removing hardcoded concurrency limits and implementing a GOMAXPROCS coordination mechanism. The changes address slow integration tests that were taking over an hour instead of the expected 30 minutes.

  • Removes hardcoded KUBE_INTEGRATION_TEST_MAX_CONCURRENCY=4 from CI scripts to allow better parallelization
  • Adds GOMAXPROCS coordination using sync.WaitGroup to manage goroutine execution limits
  • Refactors the main integration test function to support the new concurrency management

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
test/integration/scheduler_perf/scheduler_perf.go Implements GOMAXPROCS coordination mechanism and refactors test execution function
hack/jenkins/test-integration-dockerized.sh Removes hardcoded concurrency limit export
hack/jenkins/test-dockerized.sh Removes hardcoded concurrency limit export
Comments suppressed due to low confidence (1)

test/integration/scheduler_perf/scheduler_perf.go:32

  • [nitpick] The import alias 'rt' for 'runtime' is unnecessarily cryptic. Consider using a more descriptive alias like 'runtime' or importing without an alias since there are no naming conflicts.
	rt "runtime"

@ylink-lfs
Copy link
Contributor Author

/assign BenTheElder

@ylink-lfs ylink-lfs changed the title ci: integration test speedup with IO flakeness bypass ci: integration test speedup with IO related perf flakeness bypass Jul 12, 2025
@ylink-lfs
Copy link
Contributor Author

ylink-lfs commented Jul 12, 2025

Still there are some implementation ideas for discussion:

  • One option is to keep using the current method in the PR: simply disable scheduler-perf from using the maximum GOMAXPROCS in integration tests.
  • Another way is adding a top-level entrypoint go program for ALL integration tests, which launches a separate goroutine to continuously monitor I/O usage. This goroutine would lower GOMAXPROCS when I/O pressure is high and restore the default value when the pressure is gone.

@k8s-ci-robot
Copy link
Contributor

@ylink-lfs: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 29, 2025
@@ -121,6 +122,8 @@ const DefaultLoggingVerbosity = 2

var LoggingFeatureGate FeatureGateFlag
var LoggingConfig *logsapi.LoggingConfiguration
var wgGOMAXPROCS *sync.WaitGroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to comment what the intention is behind this mechanism

@BenTheElder
Copy link
Member

Another way is adding a top-level entrypoint go program for ALL integration tests, which launches a separate goroutine to continuously monitor I/O usage. This goroutine would lower GOMAXPROCS when I/O pressure is high and restore the default value when the pressure is gone.

I don't think we want to do that, it seems like an easy way to have subtle flakes, also I have concerns about the underlying systems not being robust.

It looks like this took 40m

@ylink-lfs
Copy link
Contributor Author

Another way is adding a top-level entrypoint go program for ALL integration tests, which launches a separate goroutine to continuously monitor I/O usage. This goroutine would lower GOMAXPROCS when I/O pressure is high and restore the default value when the pressure is gone.

I don't think we want to do that, it seems like an easy way to have subtle flakes, also I have concerns about the underlying systems not being robust.

It looks like this took 40m

So, to clarify: you think this is a good optimization, and I should move forward with lowering GOMAXPROCS for the flaky performance tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants