-
Notifications
You must be signed in to change notification settings - Fork 73
fix metric scrape port not updated when inference pool target port updated #417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @Kuromesi. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
@@ -85,6 +85,13 @@ func (c *InferencePoolReconciler) updateDatastore(ctx context.Context, newPool * | |||
// the ones that may have existed already to the store. | |||
c.Datastore.PodResyncAll(ctx, c.Client) | |||
} | |||
if oldPool != nil && newPool.Spec.TargetPortNumber != oldPool.Spec.TargetPortNumber { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually thinkw we should follow what we do here: https://github.com/Kuromesi/gateway-api-inference-extension/blob/186ac42b10c1c884cb024dc8de8f6c14d76b20be/pkg/epp/handlers/request.go#L111
This allows there to be a single source of truth (the pool port number) and accessing it simplifies keeping everything in sync.
If we make port number a param in: https://github.com/Kuromesi/gateway-api-inference-extension/blob/186ac42b10c1c884cb024dc8de8f6c14d76b20be/pkg/epp/datastore/types.go#L79-L80
We actually are able to remove the ScrapPort
field entirely, and not need to worry about this func. Which I think simplifies the codebase quite a bit. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually thinkw we should follow what we do here: https://github.com/Kuromesi/gateway-api-inference-extension/blob/186ac42b10c1c884cb024dc8de8f6c14d76b20be/pkg/epp/handlers/request.go#L111
+1
Not copying is by design, we list the pods frequently during probing, cloning frequently is going to be expensive at scale. The update there is strictly on the metrics status part, which should only be updated by the metrics probing routine. I would keep this as is and add a giant comment on what parts of the PodMetric are updated when. |
pkg/epp/datastore/datastore.go
Outdated
// Update pod properties if anything changed. | ||
existing.(*PodMetrics).Pod = new.Pod | ||
return false | ||
func (ds *datastore) PodUpdate(f PodUpdateFunc, predicates ...PodListPredicate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do we use this?
Thanks guys! I agree, I'll revert my changes and repush a new commit. |
…dated Signed-off-by: Kuromesi <blackfacepan@163.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
But better add a unit test
pkg/epp/backend/provider.go
Outdated
@@ -121,7 +121,8 @@ func (p *Provider) refreshMetricsOnce(logger logr.Logger) error { | |||
wg.Add(1) | |||
go func() { | |||
defer wg.Done() | |||
updated, err := p.pmc.FetchMetrics(ctx, existing) | |||
pool, _ := p.datastore.PoolGet() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to check for the error here and skip updating if the pool wasn't set yet.
pkg/epp/backend/vllm/metrics.go
Outdated
) (*datastore.PodMetrics, error) { | ||
logger := log.FromContext(ctx) | ||
loggerDefault := logger.V(logutil.DEFAULT) | ||
|
||
// Currently the metrics endpoint is hard-coded, which works with vLLM. | ||
// TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16): Consume this from InferencePool config. | ||
url := existing.BuildScrapeEndpoint() | ||
url := existing.Address + ":" + strconv.Itoa(int(port)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to set the scrape path, /metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry my bad.
Thanks, I'll see what I can do for the ut! |
/ok-to-test |
@ahg-g Sorry I got another small question, do you think it is more reasonable if we store the scrapePort and scrapePath in the metrics client? type PodMetricsClientImpl struct{
scrapePort int32
scrapePath string
}
func (p *PodMetricsClientImpl) UpdateScrapeOptions(port int32, path string) {
p.scrapePort = port
p.scrapePath = path
} So that we don't have to pass the port every time and keep the |
pkg/epp/backend/vllm/metrics.go
Outdated
) (*datastore.PodMetrics, error) { | ||
logger := log.FromContext(ctx) | ||
loggerDefault := logger.V(logutil.DEFAULT) | ||
|
||
// Currently the metrics endpoint is hard-coded, which works with vLLM. | ||
// TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16): Consume this from InferencePool config. | ||
url := existing.BuildScrapeEndpoint() | ||
url := existing.Address + ":" + strconv.Itoa(int(port)) + "/metrics" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the PR, scraping failed. You need to add "http://" +
, otherwise the scraping fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/approve If you can address one last comment, then that would be great. otherwise feel free to unhold and we can address in a followup. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, Kuromesi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
pkg/epp/backend/vllm/metrics.go
Outdated
func (p *PodMetricsClientImpl) UpdateScrapeOptions(port int32, path string) { | ||
p.scrapePort = port | ||
p.scrapePath = path | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are those for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I accidentally push some local changes...
/hold cancel Thanks |
…dated (kubernetes-sigs#417) * fix metric scrape port not updated when inference pool target port updated Signed-off-by: Kuromesi <blackfacepan@163.com> * bug fix Signed-off-by: Kuromesi <blackfacepan@163.com> * fix ut Signed-off-by: Kuromesi <blackfacepan@163.com> * add log Signed-off-by: Kuromesi <blackfacepan@163.com> --------- Signed-off-by: Kuromesi <blackfacepan@163.com>
Currently the scrape port is stored in
PodMetrics
, which should be updated when targetPortNumber of inferencePool changed.Also, currently the datastore stores all
PodMetrics
and return a reference of it when we try to read which in my opinion is not safe. This may cause inconsistency if we try to read and directly modify it in a goroutine. I think we should always return a copy and let the datastore control all write actions.