Fix InferenceModel deletion logic #393

ahg-g · 2025-02-23T07:33:20Z

Currently the logic tracks the models by Spec.ModelName, since this is not guaranteed to be unique within the cluster, we could run into two issues:

If the model name changes on the same InferenceModel object, we don't delete the original model entry in the datastore.
We don't enforce the semantics that the modelName with the oldest creation timestamp is retained. While the api is assuming that this is enforced by another controller via the Ready condition, we don't have this controller yet, and so currently the behavior is unpredictable depending on InferenceModel events order.

To address the above, the PR makes changes to both the InferenceModel reconciler and the Model APIs in the datastore to ensure thread safe updates of the entries. In the store, the sync.Map was replaced with two maps to track the InferenceModel entries by both ModelName and InferenceModel object NamespacedName. This is needed to properly handle deletions when the object doesn't exist anymore (could be handled in other ways, but this seemed like a reasonable approach).

The PR increases the datastore pkg unit test coverage the Pool and Model APIs. We still need to followup with adding unit test coverage to the pods APIs, which is currently non-existent.

Note that I started this PR to increase test coverage for the reconcilers to address #356 and #353, and so you will see refactoring to the tests in the other controllers as well.

Part of #353 #391
Fixes #394
Fixes #356
Fixes #328

netlify · 2025-02-23T07:33:41Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`4d90fbb`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67bf6475d32f7d000869c520
😎 Deploy Preview	https://deploy-preview-393--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

ahg-g · 2025-02-23T07:56:13Z

/hold

Just so it doesn't get merged by accident.

ahg-g · 2025-02-24T16:48:25Z

/assign @kfswain

…s not guaranteed to be unique within the cluster, we could run into two issues: 1) If the model name changes on the same InferenceModel object, we don't delete the original model entry in the datastore. 2) We don't enforce the semantics that the modelName with the oldest creation timestamp is retained. While the api is assuming that this is enforced by another controller via the Ready condition, we don't have this controller yet, and so currently the behavior is unpredictable depending on InferenceModel events order. To address the above, the PR makes changes to both the InferenceModel reconciler and the Model APIs in the datastore to ensure thread safe updates of the entries. In the store, the sync.Map was replaced with two maps to track the InferenceModel entries by both ModelName and InferenceModel object NamespacedName. This is needed to properly handle deletions when the object doesn't exist anymore (could be handled in other ways, but this seemed like a reasonable approach). The PR increases the datastore pkg unit test coverage the Pool and Model APIs. We still need to followup with adding unit test coverage to the pods APIs, which is currently non-existent.

kfswain

Left some smaller comments and one larger. Thanks for this!

kfswain · 2025-02-26T00:36:02Z

pkg/epp/datastore/datastore_test.go

+	}
+
+	// Step 2: set model1 with the same modelName, but with criticality set and newer creation timestamp, should update.
+	ds.ModelSetIfOlder(model1tsNewer)


nit: Do we need to have this in an imperative order? Or can we refactor this into multiple separate test runs via a dict with test setup?

This might not be a big deal, but just conforming to the common pattern

Done, a table is better, more verbose, but it is indeed the common pattern.

kfswain · 2025-02-26T01:03:53Z

pkg/epp/controller/inferencemodel_reconciler.go

 	return ctrl.Result{}, nil
 }

-func (c *InferenceModelReconciler) updateDatastore(logger logr.Logger, infModel *v1alpha2.InferenceModel) {
-	loggerDefault := logger.V(logutil.DEFAULT)
+func (c *InferenceModelReconciler) handleModelDeleted(ctx context.Context, req types.NamespacedName) error {


It looks like this func here is what causes the need for the dual dictionary.

I could see a high amount of adapter churn, but that will just augment the existing infModel, I'm not sure there will be a high amount of infModel churn. What do you think about paying the O(n) cost for now, with the benefit being a simpler, and less error prone code base? The double dict makes me worry.

Points in favor:

Only take the O(n) hit on deletes

Reconciliation is on a separate go routine so wont hang EPP

This is for a delete, so a bit longer latency is not impacting a service (activating a duplicate is kind of a bandaid for the short term anyway, when we have a proper controller it would just flip one to an Accepted state and that event would get picked up by this reconcilier.)

Sounds good, I debated this as well, I wanted to address modelName updates where we would need to do a lookup in the cache by object name every time to check if a modelName change actually happened, but I believe we should make the modelName immutable and so we don't need to handle this case here.

I didn't use sync.Map because in ModelSetIfOlder we need to do an atomic lookup+store, we also need to list and replace under the lock (similar to the PodResync case) to avoid race conditions between updating the model from the list and having the selected replacement object deleted.

…hen looking up the model by object name

ahg-g · 2025-02-26T02:44:20Z

pkg/epp/datastore/datastore_test.go

+	}
+
+	// Step 2: set model1 with the same modelName, but with criticality set and newer creation timestamp, should update.
+	ds.ModelSetIfOlder(model1tsNewer)


Done, a table is better, more verbose, but it is indeed the common pattern.

ahg-g · 2025-02-26T04:55:52Z

pkg/epp/controller/inferencemodel_reconciler.go

+		notFound = true
+	}
+
+	if notFound || !infModel.DeletionTimestamp.IsZero() || infModel.Spec.PoolRef.Name != c.PoolNamespacedName.Name {


modelName is currently ~~immutable~~ mutable, but we plan to make it ~~mutable~~ immutable, and so here I didn't address the case where the infModel changes modelName here.

Is there a use case about mutating modelName

We should make it immutable to guarantee that it doesn't change, otherwise the epp would behave in an unpredictable way.

modelName is currently immutable, but we plan to make it mutable

I think the intended statement was:

modelName is currently mutable, but we plan to make it immutable

ops, yes :)

I opened #408 to track the work to make it immutable.

ahg-g · 2025-02-26T05:06:18Z

pkg/epp/controller/inferencemodel_reconciler.go

 	return ctrl.Result{}, nil
 }

-func (c *InferenceModelReconciler) updateDatastore(logger logr.Logger, infModel *v1alpha2.InferenceModel) {
-	loggerDefault := logger.V(logutil.DEFAULT)
+func (c *InferenceModelReconciler) handleModelDeleted(ctx context.Context, req types.NamespacedName) error {


Sounds good, I debated this as well, I wanted to address modelName updates where we would need to do a lookup in the cache by object name every time to check if a modelName change actually happened, but I believe we should make the modelName immutable and so we don't need to handle this case here.

I didn't use sync.Map because in ModelSetIfOlder we need to do an atomic lookup+store, we also need to list and replace under the lock (similar to the PodResync case) to avoid race conditions between updating the model from the list and having the selected replacement object deleted.

hzxuzhonghu · 2025-02-26T08:10:25Z

pkg/epp/controller/inferencemodel_reconciler.go

+		notFound = true
+	}
+
+	if notFound || !infModel.DeletionTimestamp.IsZero() || infModel.Spec.PoolRef.Name != c.PoolNamespacedName.Name {


Is there a use case about mutating modelName

hzxuzhonghu · 2025-02-26T08:14:43Z

pkg/epp/controller/inferencemodel_reconciler.go

-	loggerDefault.Info("Removing/Not adding InferenceModel", "modelName", infModel.Spec.ModelName)
-	// If we get here. The model is not relevant to this pool, remove.
-	c.Datastore.ModelDelete(infModel.Spec.ModelName)
+	if oldest != nil && c.Datastore.ModelSetIfOlder(oldest) {


We should short ciruit ealier in the Reconcile, stop processing if there is already a cached model, and the new one is diff.

Note that the code you are commenting on is related to handling the fallback to another older version of the InferenceModel when this model is deleted, so it is not related to model update.

But to answer your question, we need to check if the cached one is older or not, it may not be because of the order of events being processed by the reconciler, and this is handled at lint 76, .

kfswain · 2025-02-26T18:20:56Z

/lgtm
/hold

This LGTM! I just have a minor nit/request of: Can we mark a TODO to remove all the logic that backfills a deleted model (finds one of the same name and includes in datastore). This will be outsourced to a proper controller in the future. Would just be easier to find and remove that logic when the time comes.

pkg/epp/controller/inferencemodel_reconciler.go

ahg-g · 2025-02-26T18:25:57Z

/lgtm /hold

This LGTM! I just have a minor nit/request of: Can we mark a TODO to remove all the logic that backfills a deleted model (finds one of the same name and includes in datastore). This will be outsourced to a proper controller in the future. Would just be easier to find and remove that logic when the time comes.

Added: https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/393/files#diff-3b7f0128a3c581577a1f5233c2143855a512630a8f83cefbd0e355d150e4fb7bR95

kfswain · 2025-02-26T18:27:33Z

/lgtm
/approve
/unhold

Awesome, thanks for this!

k8s-ci-robot · 2025-02-26T18:27:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, kfswain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g,kfswain]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kfswain · 2025-02-26T19:00:42Z

/lgtm

* Currently the logic tracks the models by Spec.ModelName, since this is not guaranteed to be unique within the cluster, we could run into two issues: 1) If the model name changes on the same InferenceModel object, we don't delete the original model entry in the datastore. 2) We don't enforce the semantics that the modelName with the oldest creation timestamp is retained. While the api is assuming that this is enforced by another controller via the Ready condition, we don't have this controller yet, and so currently the behavior is unpredictable depending on InferenceModel events order. To address the above, the PR makes changes to both the InferenceModel reconciler and the Model APIs in the datastore to ensure thread safe updates of the entries. In the store, the sync.Map was replaced with two maps to track the InferenceModel entries by both ModelName and InferenceModel object NamespacedName. This is needed to properly handle deletions when the object doesn't exist anymore (could be handled in other ways, but this seemed like a reasonable approach). The PR increases the datastore pkg unit test coverage the Pool and Model APIs. We still need to followup with adding unit test coverage to the pods APIs, which is currently non-existent. * Convert unit test to a table * remove the dual map for the models store, and rely on linear search when looking up the model by object name * Added ModelResync to handle a race condition * Update pkg/epp/controller/inferencemodel_reconciler.go

k8s-ci-robot requested review from danehans and liu-cong February 23, 2025 07:33

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 23, 2025

ahg-g force-pushed the tests branch from e8a9590 to b8f056a Compare February 23, 2025 07:34

ahg-g mentioned this pull request Feb 23, 2025

Add unit test coverage to datastore pkg #391

Closed

ahg-g force-pushed the tests branch from b8f056a to 074706a Compare February 23, 2025 07:53

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 23, 2025

ahg-g force-pushed the tests branch from 074706a to cccc4d4 Compare February 24, 2025 00:41

ahg-g mentioned this pull request Feb 24, 2025

added inferencemodel predicate + minor changes in logging #397

Merged

ahg-g force-pushed the tests branch from cccc4d4 to 67ac112 Compare February 24, 2025 16:47

k8s-ci-robot assigned kfswain Feb 24, 2025

ahg-g force-pushed the tests branch from 67ac112 to a807b40 Compare February 25, 2025 17:05

kfswain reviewed Feb 26, 2025

View reviewed changes

ahg-g added 2 commits February 26, 2025 02:46

Convert unit test to a table

119cee0

remove the dual map for the models store, and rely on linear search w…

f307bc5

…hen looking up the model by object name

ahg-g force-pushed the tests branch from 74df0a7 to f307bc5 Compare February 26, 2025 05:08

ahg-g commented Feb 26, 2025

View reviewed changes

hzxuzhonghu reviewed Feb 26, 2025

View reviewed changes

Added ModelResync to handle a race condition

02aa4c9

ahg-g mentioned this pull request Feb 26, 2025

Consolidate Duplicative Code in Unit Tests #328

Closed

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 26, 2025

ahg-g commented Feb 26, 2025

View reviewed changes

pkg/epp/controller/inferencemodel_reconciler.go Show resolved Hide resolved

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 26, 2025

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 26, 2025

Update pkg/epp/controller/inferencemodel_reconciler.go

4d90fbb

ahg-g force-pushed the tests branch from f0b717e to 4d90fbb Compare February 26, 2025 18:58

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 26, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 26, 2025

k8s-ci-robot merged commit 7ed54a4 into kubernetes-sigs:main Feb 26, 2025
8 checks passed

nirrozenbaum mentioned this pull request Feb 27, 2025

predicate follow up PR to remove the check from Reconcile func #418

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix InferenceModel deletion logic #393

Fix InferenceModel deletion logic #393

ahg-g commented Feb 23, 2025 •

edited

Loading

netlify bot commented Feb 23, 2025 •

edited

Loading

ahg-g commented Feb 23, 2025

ahg-g commented Feb 24, 2025

kfswain left a comment

kfswain Feb 26, 2025

ahg-g Feb 26, 2025

kfswain Feb 26, 2025

ahg-g Feb 26, 2025 •

edited

Loading

ahg-g Feb 26, 2025

ahg-g Feb 26, 2025 •

edited

Loading

hzxuzhonghu Feb 26, 2025

ahg-g Feb 26, 2025

kfswain Feb 26, 2025

ahg-g Feb 26, 2025

ahg-g Feb 26, 2025 •

edited

Loading

hzxuzhonghu Feb 26, 2025

hzxuzhonghu Feb 26, 2025

ahg-g Feb 26, 2025

kfswain commented Feb 26, 2025

ahg-g commented Feb 26, 2025

kfswain commented Feb 26, 2025

k8s-ci-robot commented Feb 26, 2025

kfswain commented Feb 26, 2025

Fix InferenceModel deletion logic #393

Fix InferenceModel deletion logic #393

Conversation

ahg-g commented Feb 23, 2025 • edited Loading

netlify bot commented Feb 23, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

ahg-g commented Feb 23, 2025

ahg-g commented Feb 24, 2025

kfswain left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfswain commented Feb 26, 2025

ahg-g commented Feb 26, 2025

kfswain commented Feb 26, 2025

k8s-ci-robot commented Feb 26, 2025

kfswain commented Feb 26, 2025

ahg-g commented Feb 23, 2025 •

edited

Loading

netlify bot commented Feb 23, 2025 •

edited

Loading

ahg-g Feb 26, 2025 •

edited

Loading

ahg-g Feb 26, 2025 •

edited

Loading

ahg-g Feb 26, 2025 •

edited

Loading