-
Notifications
You must be signed in to change notification settings - Fork 41.1k
Collect some of scheduling metrics and scheduling throughput #85861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect some of scheduling metrics and scheduling throughput #85861
Conversation
896f47c
to
bfd54b3
Compare
bfd54b3
to
6304290
Compare
|
||
// BenchmarkPerfScheduling benchmarks the scheduling rate when the cluster has | ||
// various quantities of nodes and scheduled pods. | ||
func BenchmarkPerfScheduling(b *testing.B) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave this test only and delete others for now, the reasons being:
- There are active changes to the existing benchmark test. It's likely this get out of sync temporarily anyways.
- Eventually the tests will be constructed dynamically from a config file. We won't need all the hard coded tests.
- Just keeping one test here as an example makes development easier until we have the new tests working end to end.
var dateFormat = "2006-01-02T15:04:05Z" | ||
|
||
// LatencyMetric represent 50th, 90th and 99th duration quantiles. | ||
type LatencyMetric struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to use a generic summary
struct to represent all metrics, unless some day we find it's not sufficient:
https://github.com/kubernetes/kubernetes/pull/85662/files#diff-722c6b351de632f8af713505b7cc49b4R8-R14
}, | ||
}) | ||
|
||
// Start measuring throughput |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we encapsulate throughput measuring in its own helper method?
Ideally, I'd like to define an interface MetricsCollector
, and throughput is one implementation of that interface, like so
https://github.com/kubernetes/kubernetes/pull/85662/files#diff-722c6b351de632f8af713505b7cc49b4R27-R33
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I created a private type and methods around it so we can then change it anytime.
|
||
// DataItems is the data point set. | ||
type DataItems struct { | ||
Version string `json:"version"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is version used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dunno. Though, it's part of data type the performance dashboard optionally expects.
} | ||
|
||
// DataItem is the data point. | ||
type DataItem struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find this struct useful. Why don't we use the metric struct (LatencyMetric
or the proposed generic summary
) directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataItem
is a type known to the performance dashboard parser. The parser expects the following json:
{
"version":"v1",
"dataItems":[
{
"data":{
"Average":ddd,
"Perc50":ddd,
"Perc90":ddd,
"Perc99":ddd
},
"unit":"sss",
"labels":{
"key1":"value1",
...
}
},
...
]
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack.
Then can we just get rid of summary
and use DataItem
directly? It's more flexible too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you resolve the comment so it's easy to track unresolved ones? Same for others. Thanks!
/assign |
6304290
to
203948f
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, ingvagabund The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
2 similar comments
/retest |
/retest |
@BenTheElder @lavalamp I re-implemented the percentile computation and dropped forked prometheus completely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Thanks!
for _, bckt := range hist.Bucket { | ||
b := bucket{ | ||
count: float64(*bckt.CumulativeCount), | ||
upperBound: *bckt.UpperBound, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we coalesce the buckets here?
/retest Review the full test history for this PR. Silence the bot with an |
/retest |
1 similar comment
/retest |
/retest Review the full test history for this PR. Silence the bot with an |
/retest |
1 similar comment
/retest |
/retest Review the full test history for this PR. Silence the bot with an |
/retest |
/retest Review the full test history for this PR. Silence the bot with an |
/test pull-kubernetes-e2e-gce |
/retest |
/test pull-kubernetes-e2e-gce-100-performance |
"//test/integration/util:go_default_library", | ||
"//vendor/github.com/prometheus/client_model/go:go_default_library", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dependency is breaking the build. See #84302.
Unclear why this wasn't caught by the presubmit... cc @BenTheElder @RainbowMango @serathius
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened kubernetes/test-infra#16168 to add coverage. Apparently we miss this in the build presubmit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we revert this to unbreak master, then reintroduce when fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serathius so what's the right way to import github.com/prometheus/client_model/go
? I am going through #84302 and what's link in there though it may take some time to figure out what to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should not use or depend on internals of prometheus in integration tests.
Please move functions that rely on prometheus internals into staging/src/k8s.io/component-base/metrics/testutil
Example PR #85289
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serathius @liggitt @mikedanese Moving the code under staging/src/k8s.io/component-base/metrics/testutil
in #87923
What type of PR is this?
/kind feature
What this PR does / why we need it:
In addition to getting overall performance measurements from golang benchmark,
collect metrics that provides information about inside of the scheduling itself.
Metrics in question:
Scheduling throughput is computed on the fly inside benchmarkScheduling.
Which issue(s) this PR fixes:
Fixes #85685
Special notes for your reviewer:
Code for computing percentiles is copy pasted from prometheus. DataItem struct copied from clusterloader performance tests.
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: