Skip to content

Replacing endpointSlice Reconciler with a direct Pod Reconciler #300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Feb 13, 2025

Conversation

kfswain
Copy link
Collaborator

@kfswain kfswain commented Feb 6, 2025

This works with the reversion commit: #300

This is a stylistic decision to maintain readability of our K8s interface layer.

The PR that was reverted utilizes in-line informers to list the relevant pods from the InferencePool selector.

In the event that we hit scalability concerns wrt pod reconciliation. We should look to move away from the standard controller-runtime Reconciliation and look to either a simple list backed by informers. Such as the referenced PR, or move to self-implemented work-queue, informer based controller implementation (Example here). Which would allow us to only allow reconcile on pods that match the selector.

This PR also removes the serviceName & zone flags, as they are no longer in use with the removal of the EndpointSlice

@k8s-ci-robot k8s-ci-robot requested a review from ahg-g February 6, 2025 21:07
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kfswain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from danehans February 6, 2025 21:07
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 6, 2025
Copy link

netlify bot commented Feb 6, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit b244225
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67ae3f4a76266a0008e0d531
😎 Deploy Preview https://deploy-preview-300--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 6, 2025
@kfswain
Copy link
Collaborator Author

kfswain commented Feb 6, 2025

/hold

#301 should submit first, and there is still a cache dumping event that needs to be implemented (when the InferencePool changes selector)

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 6, 2025
@kfswain kfswain changed the title Revert "Replace EndpointSlice reconciler with pod list backed by informer" Replacing endpointSlice Reconciler with a direct Pod Reconciler Feb 6, 2025
Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few other places to cleanup the service/zone flags from:

  • Remove zone from inferencepool_reconciler.go
  • Update the flags in main.go
  • Remove Zone/ServiceName from runserver.go
  • Remove serviceName flag from ext_proc.yaml

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 6, 2025
@kfswain kfswain self-assigned this Feb 11, 2025
@kfswain kfswain force-pushed the pod-recon-2 branch 2 times, most recently from 603f683 to 8cc1b74 Compare February 11, 2025 20:42
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 11, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 12, 2025

func (c *PodReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&corev1.Pod{}).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is less efficient, iiuc, each pod event can trigger reconcile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should migrate to raw k8s informer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we are aware of that, in all cases we will need to start the pod controller only after we know what the inferencePool is to know the selector and setup a sever side filter. I will check if controller-runtime allows us to set that up, but agree switching to raw informer is an option.

Another approach I am thinking about is to have the inferencePool controller create a service to trigger creating endpointslice objects, and we continue to work with what we have. wdyt about this? I know Rob doesn't like it :)

Copy link
Contributor

@ahg-g ahg-g Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I checked, in controller-runtime we can setup a server-side filter, but again the complexity lies in the fact that we don't know the selector beforehand (i.e., at the time we start the operator), but also the challenge is that we need to reset it every time the selector is updated. Which is a complexity with both controller-runtime and informers. Ideally we need the inferencePool controller start the pod controller.

In any case, I think it is ok continue with controller-runtime for now, do the minimal things needed to have a functionally correct setup, and we can revisit in the near future how we want to make this more scalable. I would rather focus our efforts in the short term on optimizing the scheduling algorithm and prefix cache aware routing. @kfswain wdyt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

I also agree that this is inefficient. But yes, I think we can live with controller-runtime for now, and focus on feature breadth, and then come back to things like this when we are more focused on efficiency.

Happy to make an issue about this so it's not dropped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach I am thinking about is to have the inferencePool controller create a service to trigger creating endpointslice objects, and we continue to work with what we have. wdyt about this? I know Rob doesn't like it :)

Persoanlly i donot care much about whether a service is needed to select backend pods.

If without service, we can start multiple informer to use filter, each for a InferencePool?

@ahg-g
Copy link
Contributor

ahg-g commented Feb 13, 2025

/lgtm

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2025
@kfswain
Copy link
Collaborator Author

kfswain commented Feb 13, 2025

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 13, 2025
@k8s-ci-robot k8s-ci-robot merged commit 46541d0 into kubernetes-sigs:main Feb 13, 2025
8 checks passed
coolkp pushed a commit to coolkp/llm-instance-gateway that referenced this pull request Feb 13, 2025
…rnetes-sigs#300)

* reversion to pod reconciliation

* adding ready check and unit tests

* updating test

* ablating unnecessary func

* embedding ready status into update so non-ready pods are deleted

* scrubbing serviceName & zone as they are obsolete

* implementing pod cache flushing logic

* Renaming file so merge confilcts can find the diffs easier

* cleaning up messy merge conflict

* nil checking short circuit

* Listing fixes

* feedback cleanup

* log formatting and removing pods if not found

* removing err to provent perma-reconciliation

* removing dev image ref

* cleaning up err logic
@kfswain kfswain deleted the pod-recon-2 branch March 27, 2025 19:39
kfswain added a commit to kfswain/llm-instance-gateway that referenced this pull request Apr 29, 2025
…rnetes-sigs#300)

* reversion to pod reconciliation

* adding ready check and unit tests

* updating test

* ablating unnecessary func

* embedding ready status into update so non-ready pods are deleted

* scrubbing serviceName & zone as they are obsolete

* implementing pod cache flushing logic

* Renaming file so merge confilcts can find the diffs easier

* cleaning up messy merge conflict

* nil checking short circuit

* Listing fixes

* feedback cleanup

* log formatting and removing pods if not found

* removing err to provent perma-reconciliation

* removing dev image ref

* cleaning up err logic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants