Skip to content

Fix InferenceModel deletion logic #393

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 26, 2025
Merged

Conversation

ahg-g
Copy link
Contributor

@ahg-g ahg-g commented Feb 23, 2025

Currently the logic tracks the models by Spec.ModelName, since this is not guaranteed to be unique within the cluster, we could run into two issues:

  1. If the model name changes on the same InferenceModel object, we don't delete the original model entry in the datastore.
  2. We don't enforce the semantics that the modelName with the oldest creation timestamp is retained. While the api is assuming that this is enforced by another controller via the Ready condition, we don't have this controller yet, and so currently the behavior is unpredictable depending on InferenceModel events order.

To address the above, the PR makes changes to both the InferenceModel reconciler and the Model APIs in the datastore to ensure thread safe updates of the entries. In the store, the sync.Map was replaced with two maps to track the InferenceModel entries by both ModelName and InferenceModel object NamespacedName. This is needed to properly handle deletions when the object doesn't exist anymore (could be handled in other ways, but this seemed like a reasonable approach).

The PR increases the datastore pkg unit test coverage the Pool and Model APIs. We still need to followup with adding unit test coverage to the pods APIs, which is currently non-existent.

Note that I started this PR to increase test coverage for the reconcilers to address #356 and #353, and so you will see refactoring to the tests in the other controllers as well.

Part of #353 #391
Fixes #394
Fixes #356
Fixes #328

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 23, 2025
Copy link

netlify bot commented Feb 23, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 4d90fbb
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67bf6475d32f7d000869c520
😎 Deploy Preview https://deploy-preview-393--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@ahg-g
Copy link
Contributor Author

ahg-g commented Feb 23, 2025

/hold

Just so it doesn't get merged by accident.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 23, 2025
@ahg-g
Copy link
Contributor Author

ahg-g commented Feb 24, 2025

/assign @kfswain

…s not guaranteed to be unique within the cluster, we could run into two issues:

1) If the model name changes on the same InferenceModel object, we don't delete the original model entry in the datastore. 2) We don't enforce the semantics that the modelName with the oldest creation timestamp is retained. While the api is assuming that this is enforced by another controller via the Ready condition, we don't have this controller yet, and so currently the behavior is unpredictable depending on InferenceModel events order.

To address the above, the PR makes changes to both the InferenceModel reconciler and the Model APIs in the datastore to ensure thread safe updates of the entries. In the store, the sync.Map was replaced with two maps to track the InferenceModel entries by both ModelName and InferenceModel object NamespacedName. This is needed to properly handle deletions when the object doesn't exist anymore (could be handled in other ways, but this seemed like a reasonable approach).

The PR increases the datastore pkg unit test coverage the Pool and Model APIs. We still need to followup with adding unit test coverage to the pods APIs, which is currently non-existent.
Copy link
Collaborator

@kfswain kfswain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some smaller comments and one larger. Thanks for this!

}

// Step 2: set model1 with the same modelName, but with criticality set and newer creation timestamp, should update.
ds.ModelSetIfOlder(model1tsNewer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we need to have this in an imperative order? Or can we refactor this into multiple separate test runs via a dict with test setup?

This might not be a big deal, but just conforming to the common pattern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, a table is better, more verbose, but it is indeed the common pattern.

return ctrl.Result{}, nil
}

func (c *InferenceModelReconciler) updateDatastore(logger logr.Logger, infModel *v1alpha2.InferenceModel) {
loggerDefault := logger.V(logutil.DEFAULT)
func (c *InferenceModelReconciler) handleModelDeleted(ctx context.Context, req types.NamespacedName) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this func here is what causes the need for the dual dictionary.

I could see a high amount of adapter churn, but that will just augment the existing infModel, I'm not sure there will be a high amount of infModel churn. What do you think about paying the O(n) cost for now, with the benefit being a simpler, and less error prone code base? The double dict makes me worry.

Points in favor:

  • Only take the O(n) hit on deletes
  • Reconciliation is on a separate go routine so wont hang EPP
  • This is for a delete, so a bit longer latency is not impacting a service (activating a duplicate is kind of a bandaid for the short term anyway, when we have a proper controller it would just flip one to an Accepted state and that event would get picked up by this reconcilier.)

Copy link
Contributor Author

@ahg-g ahg-g Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I debated this as well, I wanted to address modelName updates where we would need to do a lookup in the cache by object name every time to check if a modelName change actually happened, but I believe we should make the modelName immutable and so we don't need to handle this case here.

I didn't use sync.Map because in ModelSetIfOlder we need to do an atomic lookup+store, we also need to list and replace under the lock (similar to the PodResync case) to avoid race conditions between updating the model from the list and having the selected replacement object deleted.

}

// Step 2: set model1 with the same modelName, but with criticality set and newer creation timestamp, should update.
ds.ModelSetIfOlder(model1tsNewer)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, a table is better, more verbose, but it is indeed the common pattern.

notFound = true
}

if notFound || !infModel.DeletionTimestamp.IsZero() || infModel.Spec.PoolRef.Name != c.PoolNamespacedName.Name {
Copy link
Contributor Author

@ahg-g ahg-g Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modelName is currently immutable mutable, but we plan to make it mutable immutable, and so here I didn't address the case where the infModel changes modelName here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a use case about mutating modelName

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make it immutable to guarantee that it doesn't change, otherwise the epp would behave in an unpredictable way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modelName is currently immutable, but we plan to make it mutable

I think the intended statement was:

modelName is currently mutable, but we plan to make it immutable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ops, yes :)

I opened #408 to track the work to make it immutable.

return ctrl.Result{}, nil
}

func (c *InferenceModelReconciler) updateDatastore(logger logr.Logger, infModel *v1alpha2.InferenceModel) {
loggerDefault := logger.V(logutil.DEFAULT)
func (c *InferenceModelReconciler) handleModelDeleted(ctx context.Context, req types.NamespacedName) error {
Copy link
Contributor Author

@ahg-g ahg-g Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I debated this as well, I wanted to address modelName updates where we would need to do a lookup in the cache by object name every time to check if a modelName change actually happened, but I believe we should make the modelName immutable and so we don't need to handle this case here.

I didn't use sync.Map because in ModelSetIfOlder we need to do an atomic lookup+store, we also need to list and replace under the lock (similar to the PodResync case) to avoid race conditions between updating the model from the list and having the selected replacement object deleted.

notFound = true
}

if notFound || !infModel.DeletionTimestamp.IsZero() || infModel.Spec.PoolRef.Name != c.PoolNamespacedName.Name {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a use case about mutating modelName

loggerDefault.Info("Removing/Not adding InferenceModel", "modelName", infModel.Spec.ModelName)
// If we get here. The model is not relevant to this pool, remove.
c.Datastore.ModelDelete(infModel.Spec.ModelName)
if oldest != nil && c.Datastore.ModelSetIfOlder(oldest) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should short ciruit ealier in the Reconcile, stop processing if there is already a cached model, and the new one is diff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the code you are commenting on is related to handling the fallback to another older version of the InferenceModel when this model is deleted, so it is not related to model update.

But to answer your question, we need to check if the cached one is older or not, it may not be because of the order of events being processed by the reconciler, and this is handled at lint 76, .

@kfswain
Copy link
Collaborator

kfswain commented Feb 26, 2025

/lgtm
/hold

This LGTM! I just have a minor nit/request of: Can we mark a TODO to remove all the logic that backfills a deleted model (finds one of the same name and includes in datastore). This will be outsourced to a proper controller in the future. Would just be easier to find and remove that logic when the time comes.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 26, 2025
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 26, 2025
@ahg-g
Copy link
Contributor Author

ahg-g commented Feb 26, 2025

/lgtm /hold

This LGTM! I just have a minor nit/request of: Can we mark a TODO to remove all the logic that backfills a deleted model (finds one of the same name and includes in datastore). This will be outsourced to a proper controller in the future. Would just be easier to find and remove that logic when the time comes.

Added: https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/393/files#diff-3b7f0128a3c581577a1f5233c2143855a512630a8f83cefbd0e355d150e4fb7bR95

@kfswain
Copy link
Collaborator

kfswain commented Feb 26, 2025

/lgtm
/approve
/unhold

Awesome, thanks for this!

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 26, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, kfswain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 26, 2025
@kfswain
Copy link
Collaborator

kfswain commented Feb 26, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 26, 2025
@k8s-ci-robot k8s-ci-robot merged commit 7ed54a4 into kubernetes-sigs:main Feb 26, 2025
8 checks passed
kaushikmitr pushed a commit to kaushikmitr/llm-instance-gateway that referenced this pull request Feb 27, 2025
* Currently the logic tracks the models by Spec.ModelName, since this is not guaranteed to be unique within the cluster, we could run into two issues:

1) If the model name changes on the same InferenceModel object, we don't delete the original model entry in the datastore. 2) We don't enforce the semantics that the modelName with the oldest creation timestamp is retained. While the api is assuming that this is enforced by another controller via the Ready condition, we don't have this controller yet, and so currently the behavior is unpredictable depending on InferenceModel events order.

To address the above, the PR makes changes to both the InferenceModel reconciler and the Model APIs in the datastore to ensure thread safe updates of the entries. In the store, the sync.Map was replaced with two maps to track the InferenceModel entries by both ModelName and InferenceModel object NamespacedName. This is needed to properly handle deletions when the object doesn't exist anymore (could be handled in other ways, but this seemed like a reasonable approach).

The PR increases the datastore pkg unit test coverage the Pool and Model APIs. We still need to followup with adding unit test coverage to the pods APIs, which is currently non-existent.

* Convert unit test to a table

* remove the dual map for the models store, and rely on linear search when looking up the model by object name

* Added ModelResync to handle a race condition

* Update pkg/epp/controller/inferencemodel_reconciler.go
kfswain pushed a commit to kfswain/llm-instance-gateway that referenced this pull request Apr 29, 2025
* Currently the logic tracks the models by Spec.ModelName, since this is not guaranteed to be unique within the cluster, we could run into two issues:

1) If the model name changes on the same InferenceModel object, we don't delete the original model entry in the datastore. 2) We don't enforce the semantics that the modelName with the oldest creation timestamp is retained. While the api is assuming that this is enforced by another controller via the Ready condition, we don't have this controller yet, and so currently the behavior is unpredictable depending on InferenceModel events order.

To address the above, the PR makes changes to both the InferenceModel reconciler and the Model APIs in the datastore to ensure thread safe updates of the entries. In the store, the sync.Map was replaced with two maps to track the InferenceModel entries by both ModelName and InferenceModel object NamespacedName. This is needed to properly handle deletions when the object doesn't exist anymore (could be handled in other ways, but this seemed like a reasonable approach).

The PR increases the datastore pkg unit test coverage the Pool and Model APIs. We still need to followup with adding unit test coverage to the pods APIs, which is currently non-existent.

* Convert unit test to a table

* remove the dual map for the models store, and rely on linear search when looking up the model by object name

* Added ModelResync to handle a race condition

* Update pkg/epp/controller/inferencemodel_reconciler.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
4 participants