-
Notifications
You must be signed in to change notification settings - Fork 73
Fix InferenceModel deletion logic #393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
/hold Just so it doesn't get merged by accident. |
/assign @kfswain |
…s not guaranteed to be unique within the cluster, we could run into two issues: 1) If the model name changes on the same InferenceModel object, we don't delete the original model entry in the datastore. 2) We don't enforce the semantics that the modelName with the oldest creation timestamp is retained. While the api is assuming that this is enforced by another controller via the Ready condition, we don't have this controller yet, and so currently the behavior is unpredictable depending on InferenceModel events order. To address the above, the PR makes changes to both the InferenceModel reconciler and the Model APIs in the datastore to ensure thread safe updates of the entries. In the store, the sync.Map was replaced with two maps to track the InferenceModel entries by both ModelName and InferenceModel object NamespacedName. This is needed to properly handle deletions when the object doesn't exist anymore (could be handled in other ways, but this seemed like a reasonable approach). The PR increases the datastore pkg unit test coverage the Pool and Model APIs. We still need to followup with adding unit test coverage to the pods APIs, which is currently non-existent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some smaller comments and one larger. Thanks for this!
pkg/epp/datastore/datastore_test.go
Outdated
} | ||
|
||
// Step 2: set model1 with the same modelName, but with criticality set and newer creation timestamp, should update. | ||
ds.ModelSetIfOlder(model1tsNewer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do we need to have this in an imperative order? Or can we refactor this into multiple separate test runs via a dict with test setup?
This might not be a big deal, but just conforming to the common pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, a table is better, more verbose, but it is indeed the common pattern.
return ctrl.Result{}, nil | ||
} | ||
|
||
func (c *InferenceModelReconciler) updateDatastore(logger logr.Logger, infModel *v1alpha2.InferenceModel) { | ||
loggerDefault := logger.V(logutil.DEFAULT) | ||
func (c *InferenceModelReconciler) handleModelDeleted(ctx context.Context, req types.NamespacedName) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this func here is what causes the need for the dual dictionary.
I could see a high amount of adapter churn, but that will just augment the existing infModel, I'm not sure there will be a high amount of infModel churn. What do you think about paying the O(n)
cost for now, with the benefit being a simpler, and less error prone code base? The double dict makes me worry.
Points in favor:
- Only take the O(n) hit on deletes
- Reconciliation is on a separate go routine so wont hang EPP
- This is for a delete, so a bit longer latency is not impacting a service (activating a duplicate is kind of a bandaid for the short term anyway, when we have a proper controller it would just flip one to an
Accepted
state and that event would get picked up by this reconcilier.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I debated this as well, I wanted to address modelName updates where we would need to do a lookup in the cache by object name every time to check if a modelName change actually happened, but I believe we should make the modelName immutable and so we don't need to handle this case here.
I didn't use sync.Map because in ModelSetIfOlder
we need to do an atomic lookup+store, we also need to list and replace under the lock (similar to the PodResync case) to avoid race conditions between updating the model from the list and having the selected replacement object deleted.
…hen looking up the model by object name
pkg/epp/datastore/datastore_test.go
Outdated
} | ||
|
||
// Step 2: set model1 with the same modelName, but with criticality set and newer creation timestamp, should update. | ||
ds.ModelSetIfOlder(model1tsNewer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, a table is better, more verbose, but it is indeed the common pattern.
notFound = true | ||
} | ||
|
||
if notFound || !infModel.DeletionTimestamp.IsZero() || infModel.Spec.PoolRef.Name != c.PoolNamespacedName.Name { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modelName is currently immutable mutable, but we plan to make it mutable immutable, and so here I didn't address the case where the infModel changes modelName here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a use case about mutating modelName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make it immutable to guarantee that it doesn't change, otherwise the epp would behave in an unpredictable way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modelName is currently immutable, but we plan to make it mutable
I think the intended statement was:
modelName is currently mutable, but we plan to make it immutable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ops, yes :)
I opened #408 to track the work to make it immutable.
return ctrl.Result{}, nil | ||
} | ||
|
||
func (c *InferenceModelReconciler) updateDatastore(logger logr.Logger, infModel *v1alpha2.InferenceModel) { | ||
loggerDefault := logger.V(logutil.DEFAULT) | ||
func (c *InferenceModelReconciler) handleModelDeleted(ctx context.Context, req types.NamespacedName) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I debated this as well, I wanted to address modelName updates where we would need to do a lookup in the cache by object name every time to check if a modelName change actually happened, but I believe we should make the modelName immutable and so we don't need to handle this case here.
I didn't use sync.Map because in ModelSetIfOlder
we need to do an atomic lookup+store, we also need to list and replace under the lock (similar to the PodResync case) to avoid race conditions between updating the model from the list and having the selected replacement object deleted.
notFound = true | ||
} | ||
|
||
if notFound || !infModel.DeletionTimestamp.IsZero() || infModel.Spec.PoolRef.Name != c.PoolNamespacedName.Name { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a use case about mutating modelName
loggerDefault.Info("Removing/Not adding InferenceModel", "modelName", infModel.Spec.ModelName) | ||
// If we get here. The model is not relevant to this pool, remove. | ||
c.Datastore.ModelDelete(infModel.Spec.ModelName) | ||
if oldest != nil && c.Datastore.ModelSetIfOlder(oldest) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should short ciruit ealier in the Reconcile, stop processing if there is already a cached model, and the new one is diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the code you are commenting on is related to handling the fallback to another older version of the InferenceModel when this model is deleted, so it is not related to model update.
But to answer your question, we need to check if the cached one is older or not, it may not be because of the order of events being processed by the reconciler, and this is handled at lint 76, .
/lgtm This LGTM! I just have a minor nit/request of: Can we mark a TODO to remove all the logic that backfills a deleted model (finds one of the same name and includes in datastore). This will be outsourced to a proper controller in the future. Would just be easier to find and remove that logic when the time comes. |
|
/lgtm Awesome, thanks for this! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, kfswain The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
* Currently the logic tracks the models by Spec.ModelName, since this is not guaranteed to be unique within the cluster, we could run into two issues: 1) If the model name changes on the same InferenceModel object, we don't delete the original model entry in the datastore. 2) We don't enforce the semantics that the modelName with the oldest creation timestamp is retained. While the api is assuming that this is enforced by another controller via the Ready condition, we don't have this controller yet, and so currently the behavior is unpredictable depending on InferenceModel events order. To address the above, the PR makes changes to both the InferenceModel reconciler and the Model APIs in the datastore to ensure thread safe updates of the entries. In the store, the sync.Map was replaced with two maps to track the InferenceModel entries by both ModelName and InferenceModel object NamespacedName. This is needed to properly handle deletions when the object doesn't exist anymore (could be handled in other ways, but this seemed like a reasonable approach). The PR increases the datastore pkg unit test coverage the Pool and Model APIs. We still need to followup with adding unit test coverage to the pods APIs, which is currently non-existent. * Convert unit test to a table * remove the dual map for the models store, and rely on linear search when looking up the model by object name * Added ModelResync to handle a race condition * Update pkg/epp/controller/inferencemodel_reconciler.go
* Currently the logic tracks the models by Spec.ModelName, since this is not guaranteed to be unique within the cluster, we could run into two issues: 1) If the model name changes on the same InferenceModel object, we don't delete the original model entry in the datastore. 2) We don't enforce the semantics that the modelName with the oldest creation timestamp is retained. While the api is assuming that this is enforced by another controller via the Ready condition, we don't have this controller yet, and so currently the behavior is unpredictable depending on InferenceModel events order. To address the above, the PR makes changes to both the InferenceModel reconciler and the Model APIs in the datastore to ensure thread safe updates of the entries. In the store, the sync.Map was replaced with two maps to track the InferenceModel entries by both ModelName and InferenceModel object NamespacedName. This is needed to properly handle deletions when the object doesn't exist anymore (could be handled in other ways, but this seemed like a reasonable approach). The PR increases the datastore pkg unit test coverage the Pool and Model APIs. We still need to followup with adding unit test coverage to the pods APIs, which is currently non-existent. * Convert unit test to a table * remove the dual map for the models store, and rely on linear search when looking up the model by object name * Added ModelResync to handle a race condition * Update pkg/epp/controller/inferencemodel_reconciler.go
Currently the logic tracks the models by Spec.ModelName, since this is not guaranteed to be unique within the cluster, we could run into two issues:
To address the above, the PR makes changes to both the InferenceModel reconciler and the Model APIs in the datastore to ensure thread safe updates of the entries. In the store, the sync.Map was replaced with two maps to track the InferenceModel entries by both ModelName and InferenceModel object NamespacedName. This is needed to properly handle deletions when the object doesn't exist anymore (could be handled in other ways, but this seemed like a reasonable approach).
The PR increases the datastore pkg unit test coverage the Pool and Model APIs. We still need to followup with adding unit test coverage to the pods APIs, which is currently non-existent.
Note that I started this PR to increase test coverage for the reconcilers to address #356 and #353, and so you will see refactoring to the tests in the other controllers as well.
Part of #353 #391
Fixes #394
Fixes #356
Fixes #328