Skip to content

Collect some of scheduling metrics and scheduling throughput #85861

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ingvagabund
Copy link
Contributor

@ingvagabund ingvagabund commented Dec 3, 2019

What type of PR is this?

/kind feature

What this PR does / why we need it:

In addition to getting overall performance measurements from golang benchmark,
collect metrics that provides information about inside of the scheduling itself.

Metrics in question:

  • scheduler_scheduling_algorithm_predicate_evaluation_seconds
  • scheduler_scheduling_algorithm_priority_evaluation_seconds
  • scheduler_binding_duration_seconds
  • scheduler_e2e_scheduling_duration_seconds

Scheduling throughput is computed on the fly inside benchmarkScheduling.

Which issue(s) this PR fixes:

Fixes #85685

Special notes for your reviewer:

Code for computing percentiles is copy pasted from prometheus. DataItem struct copied from clusterloader performance tests.

name runtime
BenchmarkPerfScheduling 3m4.097s
BenchmarkPerfSchedulingPodAntiAffinity 5m2.016s
BenchmarkPerfSchedulingSecrets 2m11.731s
BenchmarkPerfSchedulingInTreePVs 32m37.221s
BenchmarkPerfSchedulingMigratedInTreePVs 16m5.844s
BenchmarkPerfSchedulingCSIPVs 15m9.842s
BenchmarkPerfSchedulingPodAffinity 5m16.747s
BenchmarkPerfSchedulingNodeAffinity 2m28.232s

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Dec 3, 2019
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 3, 2019
@ingvagabund ingvagabund added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Dec 3, 2019
@ingvagabund ingvagabund changed the title Collect some of scheduling metrics and scheduling throughput WIP: Collect some of scheduling metrics and scheduling throughput Dec 3, 2019
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 3, 2019
@ingvagabund ingvagabund force-pushed the scheduler-perf-collect-data-items-from-metrics branch from 896f47c to bfd54b3 Compare December 4, 2019 14:04
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 4, 2019
@ingvagabund ingvagabund force-pushed the scheduler-perf-collect-data-items-from-metrics branch from bfd54b3 to 6304290 Compare December 5, 2019 12:38
@ingvagabund ingvagabund mentioned this pull request Dec 5, 2019
5 tasks
@ingvagabund ingvagabund changed the title WIP: Collect some of scheduling metrics and scheduling throughput Collect some of scheduling metrics and scheduling throughput Dec 5, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 5, 2019

// BenchmarkPerfScheduling benchmarks the scheduling rate when the cluster has
// various quantities of nodes and scheduled pods.
func BenchmarkPerfScheduling(b *testing.B) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave this test only and delete others for now, the reasons being:

  1. There are active changes to the existing benchmark test. It's likely this get out of sync temporarily anyways.
  2. Eventually the tests will be constructed dynamically from a config file. We won't need all the hard coded tests.
  3. Just keeping one test here as an example makes development easier until we have the new tests working end to end.

var dateFormat = "2006-01-02T15:04:05Z"

// LatencyMetric represent 50th, 90th and 99th duration quantiles.
type LatencyMetric struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to use a generic summary struct to represent all metrics, unless some day we find it's not sufficient:
https://github.com/kubernetes/kubernetes/pull/85662/files#diff-722c6b351de632f8af713505b7cc49b4R8-R14

},
})

// Start measuring throughput
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we encapsulate throughput measuring in its own helper method?

Ideally, I'd like to define an interface MetricsCollector, and throughput is one implementation of that interface, like so
https://github.com/kubernetes/kubernetes/pull/85662/files#diff-722c6b351de632f8af713505b7cc49b4R27-R33

Copy link
Contributor Author

@ingvagabund ingvagabund Dec 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I created a private type and methods around it so we can then change it anytime.


// DataItems is the data point set.
type DataItems struct {
Version string `json:"version"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is version used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno. Though, it's part of data type the performance dashboard optionally expects.

}

// DataItem is the data point.
type DataItem struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find this struct useful. Why don't we use the metric struct (LatencyMetric or the proposed generic summary) directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataItem is a type known to the performance dashboard parser. The parser expects the following json:

{
   "version":"v1",
   "dataItems":[
      {
         "data":{
            "Average":ddd,
            "Perc50":ddd,
            "Perc90":ddd,
            "Perc99":ddd
         },
         "unit":"sss",
         "labels":{
            "key1":"value1",
            ...
         }
      },
      ...
   ]
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

Then can we just get rid of summary and use DataItem directly? It's more flexible too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you resolve the comment so it's easy to track unresolved ones? Same for others. Thanks!

@liu-cong
Copy link
Contributor

liu-cong commented Dec 9, 2019

/assign

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 9, 2019
@ingvagabund ingvagabund force-pushed the scheduler-perf-collect-data-items-from-metrics branch from 6304290 to 203948f Compare December 13, 2019 14:37
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 13, 2019
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 2, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, ingvagabund

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 2, 2020
@ingvagabund
Copy link
Contributor Author

/retest

2 similar comments
@haosdent
Copy link
Member

haosdent commented Feb 3, 2020

/retest

@ingvagabund
Copy link
Contributor Author

/retest

@ingvagabund
Copy link
Contributor Author

@BenTheElder @lavalamp I re-implemented the percentile computation and dropped forked prometheus completely.

Copy link
Member

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks!

for _, bckt := range hist.Bucket {
b := bucket{
count: float64(*bckt.CumulativeCount),
upperBound: *bckt.UpperBound,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we coalesce the buckets here?

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Feb 4, 2020
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@ingvagabund
Copy link
Contributor Author

/retest

1 similar comment
@ingvagabund
Copy link
Contributor Author

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@ingvagabund
Copy link
Contributor Author

/retest

1 similar comment
@ingvagabund
Copy link
Contributor Author

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@ingvagabund
Copy link
Contributor Author

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@ingvagabund
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce

@ingvagabund
Copy link
Contributor Author

/retest

@ingvagabund
Copy link
Contributor Author

/test pull-kubernetes-e2e-gce-100-performance

@k8s-ci-robot k8s-ci-robot merged commit 6858c25 into kubernetes:master Feb 6, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.18 milestone Feb 6, 2020
"//test/integration/util:go_default_library",
"//vendor/github.com/prometheus/client_model/go:go_default_library",
Copy link
Member

@mikedanese mikedanese Feb 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependency is breaking the build. See #84302.

Unclear why this wasn't caught by the presubmit... cc @BenTheElder @RainbowMango @serathius

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened kubernetes/test-infra#16168 to add coverage. Apparently we miss this in the build presubmit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we revert this to unbreak master, then reintroduce when fixed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serathius so what's the right way to import github.com/prometheus/client_model/go ? I am going through #84302 and what's link in there though it may take some time to figure out what to do.

Copy link
Contributor

@serathius serathius Feb 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not use or depend on internals of prometheus in integration tests.
Please move functions that rely on prometheus internals into staging/src/k8s.io/component-base/metrics/testutil
Example PR #85289

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serathius @liggitt @mikedanese Moving the code under staging/src/k8s.io/component-base/metrics/testutil in #87923

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Collect metrics for scheduler benchmark tests