-
Notifications
You must be signed in to change notification settings - Fork 41.1k
stop overriding max concurrency in CI, let automax procs handle it #130758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stop overriding max concurrency in CI, let automax procs handle it #130758
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: BenTheElder The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -36,7 +36,6 @@ export KUBE_COVER="n" | |||
export ARTIFACTS=${ARTIFACTS:-"${WORKSPACE}/artifacts"} | |||
# Save the verbose stdout as well. | |||
export KUBE_KEEP_VERBOSE_TEST_OUTPUT=y | |||
export KUBE_INTEGRATION_TEST_MAX_CONCURRENCY=4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hack/make-rules/test-integration.sh already calls kube::golang::setup_gomaxprocs
setting this causes that script to override GOMAXPROCS with KUBE_INTEGRATION_TEST_MAX_CONCURRENCY
+ make test-integration
go: downloading go.uber.org/automaxprocs v1.6.0
+++ [0312 20:48:28] Set GOMAXPROCS automatically to 8
+++ [0312 20:48:28] Setting parallelism to 8 (this is correct, the job has 8 cores cpu limit, you can inspect the config) |
passed in 35m https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/130758/pull-kubernetes-integration/1899925030251270144 (previously >1h) |
/triage accepted |
/lgtm
🏩 |
LGTM label has been added. Git tree hash: 35ebc910f98b46571558e62bb88260ff130fb095
|
hmm... could this be causing flakes in tests like #130810 (comment) ? |
This merged at Mar 12, 10:51 PM GMT+1 and flakes started to happen around that time there is a strong correlation this is the trigger, I didn't see any changes on the scheduler or integration test code that may cause this flakiness |
Isn't the flakiness caused by the fact that we now want to use all the CPUs available on the machine, and of course the tests can't use exactly all of them. Tests actually run faster because we have higher concurrency, but Go assumes it has more CPUs available than it can use. This causes test cases to run longer than before because Go tries to run 8 cases at a time with <8 available CPUs. |
I suspect we are unleashing more tests and bottlenecking on something else, possibly I/O, and hitting timeouts of some kind |
Er, we have 8 available CPUs, guaranteed QoS. Request + Limit. This is accurately detected and set to GOMAXPROCS. Go doesn't assume we have more CPUs, it will limit the number of user threads to GOMAXPROCS, excluding blocked I/O.
Yes, also these are running in EKS where the machines have > 8 cores currently, so there may be other jobs on the node. There was discussion in SIG K8s Infra about updating the machine types, so we haven't wanted to heavily depend on >8 cores (the GKE cluster is 8 core machines which means <8 is schedulable, but the non-schedulable bit is reserved so on those you will see a job using 7 .. 7.2 cores). By not guaranteeing the whole node, it's possible something else is stomping I/O. I don't know how to monitor I/O on the EKS clusters ... I actually don't even have direct access setup personally, and the monitoring dashboard at https://monitoring-eks.prow.k8s.io/?orgId=1 shows CPU + memory but not I/O. For jobs on the GKE cluster, I can see https://monitoring-gke.prow.k8s.io/?orgId=1, but also go correlate the GCE graphs for the VM (if it still exists) underpinning the node the job ran on and correlate the timing. |
See also kubernetes/test-infra#34139, while not written down neatly yet, there are pointers there for more details about the environment we're running in. If we can't get flakiness down, we can resume under-subscribing these for now and revisit in the future. We can also consider switching this job over to the GKE cluster and reserve the whole node which we know is at a pretty good shape. |
at the time we are close to code freeze is better to go back to the previous state than to start now doing many changes ... we can work with more calm on this after code freeze |
ACK, PR coming ... |
I've done some profiling on the scheduler-perf integration test and found evidence that integration tests with a maximized GOMAXPROCS are stomping on I/O. The graph below, picked from several pull-kubernetes-integration Prow jobs running on EKS, shows how external I/O pressure affects integration performance. I am also proposing some ideas for removing the KUBE_INTEGRATION_TEST_MAX_CONCURRENCY variable without introducing excessive flakiness.
The first approach definitely requires a longer-term discussion, so I suggest we implement the second approach first. WDYT? @BenTheElder The benchmark used the Linux |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
ensures we use auto-detected max procs for integration tests in CI instead of overriding it with a hard-coded 4 CPU
Which issue(s) this PR fixes:
Fixes #130755
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: