Collect some of scheduling metrics and scheduling throughput #85861

ingvagabund · 2019-12-03T20:46:16Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

In addition to getting overall performance measurements from golang benchmark,
collect metrics that provides information about inside of the scheduling itself.

Metrics in question:

scheduler_scheduling_algorithm_predicate_evaluation_seconds
scheduler_scheduling_algorithm_priority_evaluation_seconds
scheduler_binding_duration_seconds
scheduler_e2e_scheduling_duration_seconds

Scheduling throughput is computed on the fly inside benchmarkScheduling.

Which issue(s) this PR fixes:

Fixes #85685

Special notes for your reviewer:

Code for computing percentiles is copy pasted from prometheus. DataItem struct copied from clusterloader performance tests.

name	runtime
BenchmarkPerfScheduling	3m4.097s
BenchmarkPerfSchedulingPodAntiAffinity	5m2.016s
BenchmarkPerfSchedulingSecrets	2m11.731s
BenchmarkPerfSchedulingInTreePVs	32m37.221s
BenchmarkPerfSchedulingMigratedInTreePVs	16m5.844s
BenchmarkPerfSchedulingCSIPVs	15m9.842s
BenchmarkPerfSchedulingPodAffinity	5m16.747s
BenchmarkPerfSchedulingNodeAffinity	2m28.232s

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

test/integration/scheduler_perf/perf_util.go

test/integration/scheduler_perf/scheduler_perf_test.go

test/integration/scheduler_perf/perf_util.go

liu-cong · 2019-12-05T20:05:58Z

test/integration/scheduler_perf/scheduler_perf_test.go

+
+// BenchmarkPerfScheduling benchmarks the scheduling rate when the cluster has
+// various quantities of nodes and scheduled pods.
+func BenchmarkPerfScheduling(b *testing.B) {


Let's leave this test only and delete others for now, the reasons being:

There are active changes to the existing benchmark test. It's likely this get out of sync temporarily anyways.

Eventually the tests will be constructed dynamically from a config file. We won't need all the hard coded tests.

Just keeping one test here as an example makes development easier until we have the new tests working end to end.

liu-cong · 2019-12-06T14:52:12Z

test/integration/scheduler_perf/perf_util.go

+var dateFormat = "2006-01-02T15:04:05Z"
+
+// LatencyMetric represent 50th, 90th and 99th duration quantiles.
+type LatencyMetric struct {


I'd like to use a generic summary struct to represent all metrics, unless some day we find it's not sufficient:
https://github.com/kubernetes/kubernetes/pull/85662/files#diff-722c6b351de632f8af713505b7cc49b4R8-R14

liu-cong · 2019-12-06T14:55:49Z

test/integration/scheduler_perf/scheduler_perf_test.go

+		},
+	})
+
+	// Start measuring throughput


Can we encapsulate throughput measuring in its own helper method?

Ideally, I'd like to define an interface MetricsCollector, and throughput is one implementation of that interface, like so
https://github.com/kubernetes/kubernetes/pull/85662/files#diff-722c6b351de632f8af713505b7cc49b4R27-R33

Done. I created a private type and methods around it so we can then change it anytime.

liu-cong · 2019-12-06T14:56:56Z

test/integration/scheduler_perf/perf_util.go

+
+// DataItems is the data point set.
+type DataItems struct {
+	Version   string     `json:"version"`


What is version used for?

Dunno. Though, it's part of data type the performance dashboard optionally expects.

liu-cong · 2019-12-06T14:59:10Z

test/integration/scheduler_perf/perf_util.go

+}
+
+// DataItem is the data point.
+type DataItem struct {


I don't find this struct useful. Why don't we use the metric struct (LatencyMetric or the proposed generic summary) directly?

DataItem is a type known to the performance dashboard parser. The parser expects the following json:

{ "version":"v1", "dataItems":[ { "data":{ "Average":ddd, "Perc50":ddd, "Perc90":ddd, "Perc99":ddd }, "unit":"sss", "labels":{ "key1":"value1", ... } }, ... ] }

Ack.

Then can we just get rid of summary and use DataItem directly? It's more flexible too.

Can you resolve the comment so it's easy to track unresolved ones? Same for others. Thanks!

liu-cong · 2019-12-09T15:26:48Z

/assign

k8s-ci-robot · 2020-02-02T17:07:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, ingvagabund

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/integration/scheduler_perf/OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ingvagabund · 2020-02-02T18:34:46Z

/retest

haosdent · 2020-02-03T01:32:03Z

/retest

ingvagabund · 2020-02-03T09:38:18Z

/retest

ingvagabund · 2020-02-03T14:31:08Z

@BenTheElder @lavalamp I re-implemented the percentile computation and dropped forked prometheus completely.

ahg-g

/lgtm

Thanks!

ahg-g · 2020-02-04T14:27:56Z

test/integration/scheduler_perf/util.go

+	for _, bckt := range hist.Bucket {
+		b := bucket{
+			count:      float64(*bckt.CumulativeCount),
+			upperBound: *bckt.UpperBound,


should we coalesce the buckets here?

fejta-bot · 2020-02-04T20:04:08Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

ingvagabund · 2020-02-04T21:55:02Z

/retest

ingvagabund · 2020-02-04T22:27:09Z

/retest

fejta-bot · 2020-02-05T03:46:16Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

ingvagabund · 2020-02-05T08:45:24Z

/retest

ingvagabund · 2020-02-05T14:38:27Z

/retest

fejta-bot · 2020-02-05T19:52:18Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

ingvagabund · 2020-02-06T01:25:41Z

/retest

fejta-bot · 2020-02-06T08:06:42Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

ingvagabund · 2020-02-06T09:38:53Z

/test pull-kubernetes-e2e-gce

ingvagabund · 2020-02-06T10:45:27Z

/retest

ingvagabund · 2020-02-06T13:15:32Z

/test pull-kubernetes-e2e-gce-100-performance

mikedanese · 2020-02-06T15:16:12Z

test/integration/scheduler_perf/BUILD

        "//test/integration/util:go_default_library",
+        "//vendor/github.com/prometheus/client_model/go:go_default_library",


This dependency is breaking the build. See #84302.

Unclear why this wasn't caught by the presubmit... cc @BenTheElder @RainbowMango @serathius

I opened kubernetes/test-infra#16168 to add coverage. Apparently we miss this in the build presubmit.

can we revert this to unbreak master, then reintroduce when fixed?

@serathius so what's the right way to import github.com/prometheus/client_model/go ? I am going through #84302 and what's link in there though it may take some time to figure out what to do.

You should not use or depend on internals of prometheus in integration tests.
Please move functions that rely on prometheus internals into staging/src/k8s.io/component-base/metrics/testutil
Example PR #85289

@serathius @liggitt @mikedanese Moving the code under staging/src/k8s.io/component-base/metrics/testutil in #87923

k8s-ci-robot requested review from draveness and hex108 December 3, 2019 20:47

k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 3, 2019

ingvagabund added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Dec 3, 2019

ingvagabund changed the title ~~Collect some of scheduling metrics and scheduling throughput~~ WIP: Collect some of scheduling metrics and scheduling throughput Dec 3, 2019

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 3, 2019

ingvagabund force-pushed the scheduler-perf-collect-data-items-from-metrics branch from 896f47c to bfd54b3 Compare December 4, 2019 14:04

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 4, 2019

ingvagabund mentioned this pull request Dec 4, 2019

scheduler benchmark: allow to override bench prefix #85915

Merged

ingvagabund force-pushed the scheduler-perf-collect-data-items-from-metrics branch from bfd54b3 to 6304290 Compare December 5, 2019 12:38

ingvagabund mentioned this pull request Dec 5, 2019

Enhancement for scheduler perf #81719

Closed

5 tasks

ingvagabund changed the title ~~WIP: Collect some of scheduling metrics and scheduling throughput~~ Collect some of scheduling metrics and scheduling throughput Dec 5, 2019

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 5, 2019

liu-cong reviewed Dec 6, 2019

View reviewed changes

k8s-ci-robot assigned liu-cong Dec 9, 2019

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 9, 2019

ingvagabund force-pushed the scheduler-perf-collect-data-items-from-metrics branch from 6304290 to 203948f Compare December 13, 2019 14:37

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 13, 2019

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 2, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 2, 2020

ahg-g reviewed Feb 4, 2020

View reviewed changes

k8s-ci-robot merged commit 6858c25 into kubernetes:master Feb 6, 2020

k8s-ci-robot added this to the v1.18 milestone Feb 6, 2020

mikedanese reviewed Feb 6, 2020

View reviewed changes

ingvagabund deleted the scheduler-perf-collect-data-items-from-metrics branch February 6, 2020 15:29

This was referenced Feb 6, 2020

build all tests and package in bazel-build presubmit kubernetes/test-infra#16168

Merged

Revert "Collect some of scheduling metrics and scheduling throughput" #87897

Merged

ingvagabund mentioned this pull request Feb 7, 2020

Collect some of scheduling metrics and scheduling throughput (vol. 2) #87923

Merged

		"//test/integration/util:go_default_library",
		"//vendor/github.com/prometheus/client_model/go:go_default_library",

Collect some of scheduling metrics and scheduling throughput #85861

Collect some of scheduling metrics and scheduling throughput #85861

Uh oh!

Conversation

ingvagabund commented Dec 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ingvagabund Dec 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liu-cong commented Dec 9, 2019

Uh oh!

k8s-ci-robot commented Feb 2, 2020

Uh oh!

ingvagabund commented Feb 2, 2020

Uh oh!

haosdent commented Feb 3, 2020

Uh oh!

ingvagabund commented Feb 3, 2020

Uh oh!

ingvagabund commented Feb 3, 2020

Uh oh!

ahg-g left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fejta-bot commented Feb 4, 2020

Uh oh!

ingvagabund commented Feb 4, 2020

Uh oh!

ingvagabund commented Feb 4, 2020

Uh oh!

fejta-bot commented Feb 5, 2020

Uh oh!

ingvagabund commented Feb 5, 2020

Uh oh!

ingvagabund commented Feb 5, 2020

Uh oh!

fejta-bot commented Feb 5, 2020

Uh oh!

ingvagabund commented Feb 6, 2020

Uh oh!

fejta-bot commented Feb 6, 2020

Uh oh!

ingvagabund commented Feb 6, 2020

Uh oh!

ingvagabund commented Feb 6, 2020

Uh oh!

ingvagabund commented Feb 6, 2020

Uh oh!

mikedanese Feb 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ingvagabund commented Dec 3, 2019 •

edited

Loading

ingvagabund Dec 13, 2019 •

edited

Loading

mikedanese Feb 6, 2020 •

edited

Loading

serathius Feb 6, 2020 •

edited

Loading