add additional log & UT for HPA #127394

kevinteng525 · 2024-09-16T13:37:14Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

We are encountering an issue in production where the Horizontal Pod Autoscaler (HPA) sometimes scales up later than expected when advanced policies are defined. This issue cannot be reproduced with a testing application using the same strategy, nor can it be consistently reproduced with the same production application at a later time.

This PR adds additional logging to the HPA for enhanced troubleshooting, helping diagnose intermittent scale delay issues. The current logs lack sufficient details, and the HPA status does not retain enough information for effective diagnosis. This enhancement will provide greater visibility into metrics and events during scaling, aiding in identifying and resolving the delay problem. Additionally, unit tests have been added to help reproduce and address the issue.

Which issue(s) this PR fixes:

N/A

Special notes for your reviewer:

This enhancement addresses the need for better diagnostics in production environments where the scaling delay cannot be reproduced in testing pools or via unit tests alone.

As normalizeDesiredReplicasWithBehaviors needs to consider past recommendations and scale-up/scale-down events, this PR logs all relevant data to aid in troubleshooting.

Before this PR

If log verbose is 2:

I0916 10:35:42.589751       7 horizontal.go:893] "Successfully rescaled" HPA="default/keda-hpa-nginx-scaledobject" scaleTarget="Deployment/default/nginx-deployment" currentReplicas=3 desiredReplicas=4 reason="external metric s0-prometheus-CPU(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: nginx-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}) above target"
I0916 10:35:42.602209       7 horizontal.go:1423] "Successfully updated status" HPA="default/keda-hpa-nginx-scaledobject"
I0916 10:36:27.213778       7 horizontal.go:1423] "Successfully updated status" HPA="default/keda-hpa-nginx-scaledobject"
I0916 10:36:57.357731       7 horizontal.go:1423] "Successfully updated status" HPA="default/keda-hpa-nginx-scaledobject"
I0916 10:37:27.812401       7 horizontal.go:1423] "Successfully updated status" HPA="default/keda-hpa-nginx-scaledobject"

After this PR

If log verbose is 2:

I0916 10:35:42.581074       7 horizontal.go:851] "Proposing desired replicas" desiredReplicas=6 currentReplicas=3 metric="external metric s0-prometheus-CPU(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: nginx-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},})" timestamp="0001-01-01 00:00:00 +0000 UTC" scaleTarget="Deployment/default/nginx-deployment" metrics=[{"type":"External","external":{"metric":{"name":"s0-prometheus-CPU","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"197m"}}},{"type":"External","external":{"metric":{"name":"s1-prometheus-TEST","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"197m"}}},{"type":"External","external":{"metric":{"name":"s2-prometheus-Thread","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"0"}}}]
I0916 10:35:42.589751       7 horizontal.go:893] "Successfully rescaled" HPA="default/keda-hpa-nginx-scaledobject" scaleTarget="Deployment/default/nginx-deployment" currentReplicas=3 desiredReplicas=4 reason="external metric s0-prometheus-CPU(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: nginx-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}) above target"
I0916 10:35:42.602209       7 horizontal.go:1423] "Successfully updated status" HPA="default/keda-hpa-nginx-scaledobject"
I0916 10:35:57.785436       7 horizontal.go:851] "Proposing desired replicas" desiredReplicas=8 currentReplicas=4 metric="external metric s0-prometheus-CPU(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: nginx-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},})" timestamp="0001-01-01 00:00:00 +0000 UTC" scaleTarget="Deployment/default/nginx-deployment" metrics=[{"type":"External","external":{"metric":{"name":"s0-prometheus-CPU","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"197m"}}},{"type":"External","external":{"metric":{"name":"s1-prometheus-TEST","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"197m"}}},{"type":"External","external":{"metric":{"name":"s2-prometheus-Thread","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"0"}}}]
I0916 10:35:57.785600       7 horizontal.go:906] "Decided not to scale" scaleTarget="Deployment/default/nginx-deployment" desiredReplicas=4 lastScaleTime="2024-09-16 10:35:42 +0000 UTC"
I0916 10:35:57.793778       7 horizontal.go:1423] "Successfully updated status" HPA="default/keda-hpa-nginx-scaledobject"

If log verbose is 4:

I0916 10:35:42.581074       7 horizontal.go:851] "Proposing desired replicas" desiredReplicas=6 currentReplicas=3 metric="external metric s0-prometheus-CPU(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: nginx-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},})" timestamp="0001-01-01 00:00:00 +0000 UTC" scaleTarget="Deployment/default/nginx-deployment" metrics=[{"type":"External","external":{"metric":{"name":"s0-prometheus-CPU","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"197m"}}},{"type":"External","external":{"metric":{"name":"s1-prometheus-TEST","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"197m"}}},{"type":"External","external":{"metric":{"name":"s2-prometheus-Thread","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"0"}}}]
I0916 10:35:42.581358       7 horizontal.go:1155] Stabilizing recommendation: key=default/keda-hpa-nginx-scaledobject, currentReplicas=3, desiredReplicas=6, upRecommendation=6, downRecommendation=6, recommendations=[1 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6]
I0916 10:35:42.581414       7 horizontal.go:1011] Normalized desired replicas with behaviors - after stabilized recommendation: scaleTarget=Deployment/default/nginx-deployment, currentReplicas=3, minReplicas=3, stabilizedRecommendation=6, prenormalizedDesiredReplicas=6, reason=ScaleUpStabilized, message=recent recommendations were lower than current one, applying the lowest recent recommendation
I0916 10:35:42.581426       7 horizontal.go:1186] Converting desired replicas with behavior rate - scale up: key=default/keda-hpa-nginx-scaledobject, currentReplicas=3, desiredReplicas=6, scaleUpEvents=[{2 {13957253141225347267 1910484270720 0x74b4f40} false}], scaleDownEvents=[]
I0916 10:35:42.581442       7 horizontal.go:1021] Normalized desired replicas with behaviors - after rated recommendation: scaleTarget=Deployment/default/nginx-deployment, currentReplicas=3, minReplicas=3, stabilizedRecommendation=6, desiredReplicas=4, reason=ScaleUpLimit, message=the desired replica count is increasing faster than the maximum scale rate
I0916 10:35:42.589751       7 horizontal.go:893] "Successfully rescaled" HPA="default/keda-hpa-nginx-scaledobject" scaleTarget="Deployment/default/nginx-deployment" currentReplicas=3 desiredReplicas=4 reason="external metric s0-prometheus-CPU(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: nginx-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}) above target"
I0916 10:35:42.602209       7 horizontal.go:1423] "Successfully updated status" HPA="default/keda-hpa-nginx-scaledobject"
I0916 10:35:57.785436       7 horizontal.go:851] "Proposing desired replicas" desiredReplicas=8 currentReplicas=4 metric="external metric s0-prometheus-CPU(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: nginx-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},})" timestamp="0001-01-01 00:00:00 +0000 UTC" scaleTarget="Deployment/default/nginx-deployment" metrics=[{"type":"External","external":{"metric":{"name":"s0-prometheus-CPU","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"197m"}}},{"type":"External","external":{"metric":{"name":"s1-prometheus-TEST","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"197m"}}},{"type":"External","external":{"metric":{"name":"s2-prometheus-Thread","selector":{"matchLabels":{"scaledobject.keda.sh/name":"nginx-scaledobject"}}},"current":{"value":"0"}}}]
I0916 10:35:57.785527       7 horizontal.go:1155] Stabilizing recommendation: key=default/keda-hpa-nginx-scaledobject, currentReplicas=4, desiredReplicas=8, upRecommendation=6, downRecommendation=8, recommendations=[1 4 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6]
I0916 10:35:57.785549       7 horizontal.go:1011] Normalized desired replicas with behaviors - after stabilized recommendation: scaleTarget=Deployment/default/nginx-deployment, currentReplicas=4, minReplicas=3, stabilizedRecommendation=6, prenormalizedDesiredReplicas=8, reason=ScaleUpStabilized, message=recent recommendations were lower than current one, applying the lowest recent recommendation
I0916 10:35:57.785557       7 horizontal.go:1186] Converting desired replicas with behavior rate - scale up: key=default/keda-hpa-nginx-scaledobject, currentReplicas=4, desiredReplicas=6, scaleUpEvents=[{1 {13957253790305771122 2515024632876 0x74b4f40} false}], scaleDownEvents=[]
I0916 10:35:57.785582       7 horizontal.go:1021] Normalized desired replicas with behaviors - after rated recommendation: scaleTarget=Deployment/default/nginx-deployment, currentReplicas=4, minReplicas=3, stabilizedRecommendation=6, desiredReplicas=4, reason=ScaleUpLimit, message=the desired replica count is increasing faster than the maximum scale rate
I0916 10:35:57.785600       7 horizontal.go:906] "Decided not to scale" scaleTarget="Deployment/default/nginx-deployment" desiredReplicas=4 lastScaleTime="2024-09-16 10:35:42 +0000 UTC"
I0916 10:35:57.793778       7 horizontal.go:1423] "Successfully updated status" HPA="default/keda-hpa-nginx-scaledobject"

Does this PR introduce a user-facing change?

No user-facing change.

Added enhanced logging to HPA for better diagnostics of scaling delays when defined advanced policies.

k8s-ci-robot · 2024-09-16T13:37:24Z

Hi @kevinteng525. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

googs1025 · 2024-10-01T03:32:38Z

pkg/controller/podautoscaler/horizontal.go

@@ -843,11 +843,13 @@ func (a *HorizontalController) reconcileAutoscaler(ctx context.Context, hpaShare
 			retErr = err
 		}

-		logger.V(4).Info("Proposing desired replicas",
+		logger.V(2).Info("Proposing desired replicas",


Why do we need to increase the original log level? If we need to troubleshoot a problem, we should increase the level of the controller-manager log instead of changing the log level.

changed back log level to 4, thanks!

pacoxu · 2024-10-10T09:03:08Z

/test pull-kubernetes-unit

kevinteng525 · 2024-10-12T08:30:57Z

/test pull-kubernetes-unit

k8s-ci-robot · 2024-10-12T08:31:12Z

@kevinteng525: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/test pull-kubernetes-unit

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

pacoxu · 2024-10-12T09:57:10Z

/ok-to-test

/test pull-kubernetes-unit

pacoxu · 2024-11-04T06:33:00Z

/cc @mengjiao-liu @gjtempleton
/priority important-longterm
/triage accepted
/assign @MaciekPytel

mengjiao-liu · 2024-11-04T06:37:38Z

/wg structured-logging
/area logging
/area test

mengjiao-liu · 2024-11-07T03:44:55Z

pkg/controller/podautoscaler/horizontal.go

 			"metric", metricName,
 			"timestamp", metricTimestamp,
-			"scaleTarget", reference)
+			"scaleTarget", reference,


I checked and found that the modified logs below also use reference := fmt.Sprintf("%s/%s/%s", hpa.Spec.ScaleTargetRef.Kind, hpa.Namespace, hpa.Spec.ScaleTargetRef.Name) many times. Then you can directly use the following method at the top level of the call(Maybe under line 815), and the logs below will also have this key-value pair.

If other key-value pairs are used multiple times, you can also add them here.

if loggerV := logger.V(4); loggerV.Enabled() { logger = klog.LoggerWithValues( logger, "scaleTarget", reference, "currentReplicas", currentReplicas, ... ) ctx = klog.NewContext(ctx, logger) }

I think it's for logging, and is only utilized 3 times within this method. The rest are used in other methods. Switching to loggerV with V(4) might add complexity and reduce readability. It also can't be used for other verbose levels. I would suggest we keep it simple.

It makes sense.

pkg/controller/podautoscaler/horizontal_test.go

mengjiao-liu

/lgtm
From WG-structured-logging side

k8s-ci-robot · 2024-11-22T03:45:21Z

LGTM label has been added.

Git tree hash: a96248aab189d27d8f3d9b60f6582f9f30f1fd01

k8s-triage-robot · 2025-02-20T04:04:11Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kangteng525 · 2025-03-08T14:10:59Z

/remove-lifecycle stale

kangteng525 · 2025-03-08T14:21:52Z

@gjtempleton @mwielgus could you help to have a review? Thanks!

pkg/controller/podautoscaler/horizontal.go

add additional log & UT for HPA

k8s-ci-robot · 2025-03-09T13:44:53Z

New changes are detected. LGTM label has been removed.

k8s-ci-robot · 2025-03-09T13:44:56Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kevinteng525
Once this PR has been reviewed and has the lgtm label, please assign mwielgus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/controller/podautoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

omerap12

Should we add a unit test for the recommendation truncation logic to verify we're correctly handling the case when there are more than 100 recommendations?

k8s-triage-robot · 2025-08-06T16:23:39Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 16, 2024

k8s-ci-robot requested review from gjtempleton and mwielgus September 16, 2024 13:37

googs1025 reviewed Oct 1, 2024

View reviewed changes

kevinteng525 force-pushed the master branch from af80146 to fa658d3 Compare October 3, 2024 08:20

hashim21223445 approved these changes Oct 3, 2024

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 12, 2024

kevinteng525 force-pushed the master branch 2 times, most recently from b1f3879 to abce1d8 Compare October 13, 2024 08:13

kevinteng525 force-pushed the master branch 2 times, most recently from b060f39 to f56faef Compare October 20, 2024 08:47

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 20, 2024

k8s-ci-robot removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 4, 2024

k8s-ci-robot added wg/structured-logging Categorizes an issue or PR as relevant to WG Structured Logging. area/logging area/test labels Nov 4, 2024

mengjiao-liu reviewed Nov 7, 2024

View reviewed changes

kevinteng525 force-pushed the master branch 2 times, most recently from c3122bb to 8da97af Compare November 10, 2024 08:31

mengjiao-liu reviewed Nov 22, 2024

View reviewed changes

k8s-ci-robot assigned mengjiao-liu Nov 22, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 22, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 20, 2025

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 8, 2025

adrianmoisey reviewed Mar 8, 2025

View reviewed changes

pkg/controller/podautoscaler/horizontal.go Outdated Show resolved Hide resolved

add additional log & UT for HPA

f9ffa7f

add additional log & UT for HPA

kevinteng525 force-pushed the master branch from 8da97af to f9ffa7f Compare March 9, 2025 13:44

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 9, 2025

k8s-ci-robot requested review from MaciekPytel and mengjiao-liu March 9, 2025 13:44

omerap12 reviewed May 8, 2025

View reviewed changes

omerap12 mentioned this pull request May 9, 2025

Add hpa reviewers #131650

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 6, 2025

add additional log & UT for HPA #127394

Are you sure you want to change the base?

add additional log & UT for HPA #127394

Uh oh!

Conversation

kevinteng525 commented Sep 16, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Before this PR

After this PR

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Sep 16, 2024

Uh oh!

googs1025 Oct 1, 2024

Choose a reason for hiding this comment

Uh oh!

kevinteng525 Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

pacoxu commented Oct 10, 2024

Uh oh!

kevinteng525 commented Oct 12, 2024

Uh oh!

k8s-ci-robot commented Oct 12, 2024

Uh oh!

pacoxu commented Oct 12, 2024

Uh oh!

pacoxu commented Nov 4, 2024

Uh oh!

mengjiao-liu commented Nov 4, 2024

Uh oh!

mengjiao-liu Nov 7, 2024

Choose a reason for hiding this comment

Uh oh!

kevinteng525 Nov 10, 2024

Choose a reason for hiding this comment

Uh oh!

mengjiao-liu Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mengjiao-liu left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Nov 22, 2024

Uh oh!

k8s-triage-robot commented Feb 20, 2025

Uh oh!

kangteng525 commented Mar 8, 2025

Uh oh!

kangteng525 commented Mar 8, 2025

Uh oh!

Uh oh!

k8s-ci-robot commented Mar 9, 2025

Uh oh!

k8s-ci-robot commented Mar 9, 2025

Uh oh!

omerap12 left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-triage-robot commented Aug 6, 2025

Uh oh!

Uh oh!