Replacing endpointSlice Reconciler with a direct Pod Reconciler #300

kfswain · 2025-02-06T21:07:52Z

This works with the reversion commit: #300

This is a stylistic decision to maintain readability of our K8s interface layer.

The PR that was reverted utilizes in-line informers to list the relevant pods from the InferencePool selector.

In the event that we hit scalability concerns wrt pod reconciliation. We should look to move away from the standard controller-runtime Reconciliation and look to either a simple list backed by informers. Such as the referenced PR, or move to self-implemented work-queue, informer based controller implementation (Example here). Which would allow us to only allow reconcile on pods that match the selector.

This PR also removes the serviceName & zone flags, as they are no longer in use with the removal of the EndpointSlice

k8s-ci-robot · 2025-02-06T21:07:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kfswain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kfswain]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2025-02-06T21:08:07Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`b244225`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67ae3f4a76266a0008e0d531
😎 Deploy Preview	https://deploy-preview-300--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

kfswain · 2025-02-06T21:09:28Z

/hold

#301 should submit first, and there is still a cache dumping event that needs to be implemented (when the InferencePool changes selector)

ahg-g

There are a few other places to cleanup the service/zone flags from:

Remove zone from inferencepool_reconciler.go
Update the flags in main.go
Remove Zone/ServiceName from runserver.go
Remove serviceName flag from ext_proc.yaml

pkg/ext-proc/backend/pod_reconciler.go

pkg/ext-proc/backend/datastore.go

pkg/ext-proc/backend/pod_reconciler.go

hzxuzhonghu · 2025-02-13T07:36:40Z

pkg/ext-proc/backend/pod_reconciler.go

+
+func (c *PodReconciler) SetupWithManager(mgr ctrl.Manager) error {
+	return ctrl.NewControllerManagedBy(mgr).
+		For(&corev1.Pod{}).


This is less efficient, iiuc, each pod event can trigger reconcile.

Maybe should migrate to raw k8s informer

Yeah, we are aware of that, in all cases we will need to start the pod controller only after we know what the inferencePool is to know the selector and setup a sever side filter. I will check if controller-runtime allows us to set that up, but agree switching to raw informer is an option.

Another approach I am thinking about is to have the inferencePool controller create a service to trigger creating endpointslice objects, and we continue to work with what we have. wdyt about this? I know Rob doesn't like it :)

So I checked, in controller-runtime we can setup a server-side filter, but again the complexity lies in the fact that we don't know the selector beforehand (i.e., at the time we start the operator), but also the challenge is that we need to reset it every time the selector is updated. Which is a complexity with both controller-runtime and informers. Ideally we need the inferencePool controller start the pod controller.

In any case, I think it is ok continue with controller-runtime for now, do the minimal things needed to have a functionally correct setup, and we can revisit in the near future how we want to make this more scalable. I would rather focus our efforts in the short term on optimizing the scheduling algorithm and prefix cache aware routing. @kfswain wdyt?

I agree.

I also agree that this is inefficient. But yes, I think we can live with controller-runtime for now, and focus on feature breadth, and then come back to things like this when we are more focused on efficiency.

Happy to make an issue about this so it's not dropped.

Another approach I am thinking about is to have the inferencePool controller create a service to trigger creating endpointslice objects, and we continue to work with what we have. wdyt about this? I know Rob doesn't like it :)

Persoanlly i donot care much about whether a service is needed to select backend pods.

If without service, we can start multiple informer to use filter, each for a InferencePool?

pkg/ext-proc/backend/inferencepool_reconciler.go

pkg/ext-proc/backend/pod_reconciler.go

ahg-g · 2025-02-13T18:53:34Z

/lgtm

Thanks!

kfswain · 2025-02-13T18:59:56Z

/unhold

…rnetes-sigs#300) * reversion to pod reconciliation * adding ready check and unit tests * updating test * ablating unnecessary func * embedding ready status into update so non-ready pods are deleted * scrubbing serviceName & zone as they are obsolete * implementing pod cache flushing logic * Renaming file so merge confilcts can find the diffs easier * cleaning up messy merge conflict * nil checking short circuit * Listing fixes * feedback cleanup * log formatting and removing pods if not found * removing err to provent perma-reconciliation * removing dev image ref * cleaning up err logic

k8s-ci-robot requested a review from ahg-g February 6, 2025 21:07

k8s-ci-robot requested a review from danehans February 6, 2025 21:07

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 6, 2025

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 6, 2025

kfswain mentioned this pull request Feb 6, 2025

Revert "Replace EndpointSlice reconciler with pod list backed by informers" #301

Merged

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 6, 2025

kfswain changed the title ~~Revert "Replace EndpointSlice reconciler with pod list backed by informer"~~ Replacing endpointSlice Reconciler with a direct Pod Reconciler Feb 6, 2025

ahg-g reviewed Feb 6, 2025

View reviewed changes

pkg/ext-proc/backend/pod_reconciler.go Outdated Show resolved Hide resolved

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 6, 2025

kfswain self-assigned this Feb 11, 2025

kfswain force-pushed the pod-recon-2 branch 2 times, most recently from 603f683 to 8cc1b74 Compare February 11, 2025 20:42

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 11, 2025

kfswain added 9 commits February 12, 2025 22:13

reversion to pod reconciliation

2133118

adding ready check and unit tests

beac309

updating test

31de8a9

ablating unnecessary func

5e21da1

embedding ready status into update so non-ready pods are deleted

6ac1e10

scrubbing serviceName & zone as they are obsolete

f9e0807

implementing pod cache flushing logic

e989f7f

Renaming file so merge confilcts can find the diffs easier

72c2d63

cleaning up messy merge conflict

62dcdb6

kfswain force-pushed the pod-recon-2 branch from 6185490 to 62dcdb6 Compare February 12, 2025 22:17

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 12, 2025

ahg-g reviewed Feb 13, 2025

View reviewed changes

pkg/ext-proc/backend/datastore.go Outdated Show resolved Hide resolved

pkg/ext-proc/backend/pod_reconciler.go Outdated Show resolved Hide resolved

pkg/ext-proc/backend/pod_reconciler.go Show resolved Hide resolved

ahg-g reviewed Feb 13, 2025

View reviewed changes

pkg/ext-proc/backend/pod_reconciler.go Show resolved Hide resolved

hzxuzhonghu reviewed Feb 13, 2025

View reviewed changes

kfswain added 3 commits February 13, 2025 15:26

nil checking short circuit

9f61997

Listing fixes

e6797ba

feedback cleanup

83ed549

ahg-g reviewed Feb 13, 2025

View reviewed changes

pkg/ext-proc/backend/inferencepool_reconciler.go Show resolved Hide resolved

pkg/ext-proc/backend/pod_reconciler.go Show resolved Hide resolved

kfswain added 4 commits February 13, 2025 17:55

log formatting and removing pods if not found

adf9f8e

removing err to provent perma-reconciliation

48693e8

removing dev image ref

7c341eb

cleaning up err logic

b244225

k8s-ci-robot assigned ahg-g Feb 13, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 13, 2025

k8s-ci-robot merged commit 46541d0 into kubernetes-sigs:main Feb 13, 2025
8 checks passed

kfswain deleted the pod-recon-2 branch March 27, 2025 19:39

elevran mentioned this pull request Jul 1, 2025

[Feature] Leverage an (EPP managed) Kubernetes Service for InferencePool endpoint discovery? #1100

Open

Replacing endpointSlice Reconciler with a direct Pod Reconciler #300

Replacing endpointSlice Reconciler with a direct Pod Reconciler #300

Uh oh!

Conversation

kfswain commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Feb 6, 2025

Uh oh!

netlify bot commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

kfswain commented Feb 6, 2025

Uh oh!

ahg-g left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hzxuzhonghu Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

ahg-g Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

ahg-g Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfswain Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu Feb 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ahg-g commented Feb 13, 2025

Uh oh!

kfswain commented Feb 13, 2025

Uh oh!

Uh oh!

Uh oh!

kfswain commented Feb 6, 2025 •

edited

Loading

netlify bot commented Feb 6, 2025 •

edited

Loading

ahg-g Feb 13, 2025 •

edited

Loading