Skip to content

fix metric scrape port not updated when inference pool target port updated #417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 28, 2025

Conversation

Kuromesi
Copy link
Contributor

Currently the scrape port is stored in PodMetrics, which should be updated when targetPortNumber of inferencePool changed.

Also, currently the datastore stores all PodMetrics and return a reference of it when we try to read which in my opinion is not safe. This may cause inconsistency if we try to read and directly modify it in a goroutine. I think we should always return a copy and let the datastore control all write actions.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 27, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 27, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @Kuromesi. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 27, 2025
Copy link

netlify bot commented Feb 27, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit e36719e
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67c15c11794eed0008bb6fa2
😎 Deploy Preview https://deploy-preview-417--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@@ -85,6 +85,13 @@ func (c *InferencePoolReconciler) updateDatastore(ctx context.Context, newPool *
// the ones that may have existed already to the store.
c.Datastore.PodResyncAll(ctx, c.Client)
}
if oldPool != nil && newPool.Spec.TargetPortNumber != oldPool.Spec.TargetPortNumber {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually thinkw we should follow what we do here: https://github.com/Kuromesi/gateway-api-inference-extension/blob/186ac42b10c1c884cb024dc8de8f6c14d76b20be/pkg/epp/handlers/request.go#L111

This allows there to be a single source of truth (the pool port number) and accessing it simplifies keeping everything in sync.

If we make port number a param in: https://github.com/Kuromesi/gateway-api-inference-extension/blob/186ac42b10c1c884cb024dc8de8f6c14d76b20be/pkg/epp/datastore/types.go#L79-L80

We actually are able to remove the ScrapPort field entirely, and not need to worry about this func. Which I think simplifies the codebase quite a bit. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahg-g
Copy link
Contributor

ahg-g commented Feb 27, 2025

Also, currently the datastore stores all PodMetrics and return a reference of it when we try to read which in my opinion is not safe. This may cause inconsistency if we try to read and directly modify it in a goroutine. I think we should always return a copy and let the datastore control all write actions.

Not copying is by design, we list the pods frequently during probing, cloning frequently is going to be expensive at scale. The update there is strictly on the metrics status part, which should only be updated by the metrics probing routine.

I would keep this as is and add a giant comment on what parts of the PodMetric are updated when.

// Update pod properties if anything changed.
existing.(*PodMetrics).Pod = new.Pod
return false
func (ds *datastore) PodUpdate(f PodUpdateFunc, predicates ...PodListPredicate) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we use this?

@Kuromesi
Copy link
Contributor Author

Thanks guys! I agree, I'll revert my changes and repush a new commit.

…dated

Signed-off-by: Kuromesi <blackfacepan@163.com>
Copy link
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

But better add a unit test

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2025
@@ -121,7 +121,8 @@ func (p *Provider) refreshMetricsOnce(logger logr.Logger) error {
wg.Add(1)
go func() {
defer wg.Done()
updated, err := p.pmc.FetchMetrics(ctx, existing)
pool, _ := p.datastore.PoolGet()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to check for the error here and skip updating if the pool wasn't set yet.

) (*datastore.PodMetrics, error) {
logger := log.FromContext(ctx)
loggerDefault := logger.V(logutil.DEFAULT)

// Currently the metrics endpoint is hard-coded, which works with vLLM.
// TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16): Consume this from InferencePool config.
url := existing.BuildScrapeEndpoint()
url := existing.Address + ":" + strconv.Itoa(int(port))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to set the scrape path, /metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry my bad.

Signed-off-by: Kuromesi <blackfacepan@163.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2025
@Kuromesi
Copy link
Contributor Author

/lgtm

But better add a unit test

Thanks, I'll see what I can do for the ut!

@ahg-g
Copy link
Contributor

ahg-g commented Feb 28, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 28, 2025
Signed-off-by: Kuromesi <blackfacepan@163.com>
@Kuromesi
Copy link
Contributor Author

@ahg-g Sorry I got another small question, do you think it is more reasonable if we store the scrapePort and scrapePath in the metrics client?

type PodMetricsClientImpl struct{
	scrapePort int32
	scrapePath string
}
func (p *PodMetricsClientImpl) UpdateScrapeOptions(port int32, path string) {
	p.scrapePort = port
	p.scrapePath = path
}

So that we don't have to pass the port every time and keep the FetchMetrics tidy? 🤔

) (*datastore.PodMetrics, error) {
logger := log.FromContext(ctx)
loggerDefault := logger.V(logutil.DEFAULT)

// Currently the metrics endpoint is hard-coded, which works with vLLM.
// TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16): Consume this from InferencePool config.
url := existing.BuildScrapeEndpoint()
url := existing.Address + ":" + strconv.Itoa(int(port)) + "/metrics"
Copy link
Contributor

@ahg-g ahg-g Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the PR, scraping failed. You need to add "http://" +, otherwise the scraping fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ahg-g
Copy link
Contributor

ahg-g commented Feb 28, 2025

/approve
/lgtm
/hold

If you can address one last comment, then that would be great. otherwise feel free to unhold and we can address in a followup.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 28, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, Kuromesi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 28, 2025
func (p *PodMetricsClientImpl) UpdateScrapeOptions(port int32, path string) {
p.scrapePort = port
p.scrapePath = path
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are those for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I accidentally push some local changes...

Signed-off-by: Kuromesi <blackfacepan@163.com>
@ahg-g
Copy link
Contributor

ahg-g commented Feb 28, 2025

/hold cancel
/lgtm

Thanks

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 28, 2025
@k8s-ci-robot k8s-ci-robot merged commit 0d08a07 into kubernetes-sigs:main Feb 28, 2025
8 checks passed
kfswain pushed a commit to kfswain/llm-instance-gateway that referenced this pull request Apr 29, 2025
…dated (kubernetes-sigs#417)

* fix metric scrape port not updated when inference pool target port updated

Signed-off-by: Kuromesi <blackfacepan@163.com>

* bug fix

Signed-off-by: Kuromesi <blackfacepan@163.com>

* fix ut

Signed-off-by: Kuromesi <blackfacepan@163.com>

* add log

Signed-off-by: Kuromesi <blackfacepan@163.com>

---------

Signed-off-by: Kuromesi <blackfacepan@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants