diff --git a/.github/ISSUE_TEMPLATE/blank_issue.md b/.github/ISSUE_TEMPLATE/blank_issue.md new file mode 100644 index 00000000..dd6ebabf --- /dev/null +++ b/.github/ISSUE_TEMPLATE/blank_issue.md @@ -0,0 +1,8 @@ +--- +name: Blank Issue +about: Create a new issue from scratch +title: '' +labels: needs-triage +assignees: '' + +--- \ No newline at end of file diff --git a/.github/ISSUE_TEMPLATE/bug_request.md b/.github/ISSUE_TEMPLATE/bug_request.md index c2597eb3..15ed35e1 100644 --- a/.github/ISSUE_TEMPLATE/bug_request.md +++ b/.github/ISSUE_TEMPLATE/bug_request.md @@ -1,7 +1,9 @@ --- name: Bug Report about: Report a bug you encountered -labels: kind/bug +title: '' +labels: kind/bug, needs-triage +assignees: '' --- diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml new file mode 100644 index 00000000..3ba13e0c --- /dev/null +++ b/.github/ISSUE_TEMPLATE/config.yml @@ -0,0 +1 @@ +blank_issues_enabled: false diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md index 53a885c7..1eee5871 100644 --- a/.github/ISSUE_TEMPLATE/feature_request.md +++ b/.github/ISSUE_TEMPLATE/feature_request.md @@ -2,7 +2,7 @@ name: Feature request about: Suggest an idea for this project title: '' -labels: '' +labels: needs-triage assignees: '' --- @@ -12,4 +12,3 @@ assignees: '' **What would you like to be added**: **Why is this needed**: - diff --git a/.github/ISSUE_TEMPLATE/new-release.md b/.github/ISSUE_TEMPLATE/new-release.md index 6ed3df8c..27e83784 100644 --- a/.github/ISSUE_TEMPLATE/new-release.md +++ b/.github/ISSUE_TEMPLATE/new-release.md @@ -4,6 +4,7 @@ about: Propose a new release title: Release v0.x.0 labels: '' assignees: '' + --- - [Introduction](#introduction) @@ -34,10 +35,10 @@ This document defines the process for releasing Gateway API Inference Extension. export RC=1 ``` -4. The vLLM image tag defaults to `v0.7.1` for a release. Optionally, change the vLLM image tag. For example: +4. The vLLM image tag defaults to `v0.7.2` for a release. Set the `VLLM` environment variable if a newer [tag][vllm-tag] has been published. For example: ```shell - export VLLM=0.7.2 + export VLLM=0.7.3 ``` ## Release Process @@ -45,16 +46,25 @@ This document defines the process for releasing Gateway API Inference Extension. 1. If needed, clone the Gateway API Inference Extension [repo][repo]. ```shell - git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension.git -b main + git clone -o ${REMOTE} https://github.com/kubernetes-sigs/gateway-api-inference-extension.git ``` 2. If you already have the repo cloned, ensure it’s up-to-date and your local branch is clean. -3. Create a new release branch from the `main` branch. The release branch should be named `release-v${MAJOR}.${MINOR}`, e.g. `release-v0.1`. +3. Release Branch Handling: + - For a Release Candidate: + Create a new release branch from the `main` branch. The branch should be named `release-${MAJOR}.${MINOR}`, for example, `release-0.1`: - ```shell - git checkout -b release-v${MAJOR}.${MINOR} - ``` + ```shell + git checkout -b release-${MAJOR}.${MINOR} + ``` + + - For a Major or Minor Release: + A release branch should already exist. In this case, check out the existing branch: + + ```shell + git checkout -b release-${MAJOR}.${MINOR} ${REMOTE}/release-${MAJOR}.${MINOR} + ``` 4. Update release-specific content, generate release artifacts, and stage the changes. @@ -79,7 +89,7 @@ This document defines the process for releasing Gateway API Inference Extension. 6. Push your release branch to the Gateway API Inference Extension remote. ```shell - git push ${REMOTE} release-v${MAJOR}.${MINOR} + git push ${REMOTE} release-${MAJOR}.${MINOR} ``` 7. Tag the head of your release branch with the number. @@ -114,7 +124,8 @@ This document defines the process for releasing Gateway API Inference Extension. 9. Pushing the tag triggers Prow to build and publish the container image to the [staging registry][]. 10. Submit a PR against [k8s.io][] to add the staging image tag and SHA to [`k8s-staging-gateway-api-inference-extension/images.yaml`][yaml]. This will - promote the image to the production registry. **Note:** Add a link to this issue when the PR is merged. + promote the image to the production registry, e.g. `registry.k8s.io/gateway-api-inference-extension/epp:v${MAJOR}.${MINOR}.0`. + **Note:** Add a link to this issue when the PR is merged. 11. Test the steps in the tagged quickstart guide after the PR merges, for example: `https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/v0.1.0-rc.1/pkg/README.md`. 12. Create a [new release][]: 1. Choose the tag that you created for the release. @@ -148,3 +159,4 @@ Use the following steps to announce the release. [k8s.io]: https://github.com/kubernetes/k8s.io [yaml]: https://github.com/kubernetes/k8s.io/blob/main/registry.k8s.io/images/k8s-staging-gateway-api-inference-extension/images.yaml [issue]: https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/new/choose +[vllm-tag]: https://hub.docker.com/r/vllm/vllm-openai/tags diff --git a/.golangci.yml b/.golangci.yml index 2ad3b93d..d1b1e112 100644 --- a/.golangci.yml +++ b/.golangci.yml @@ -25,7 +25,6 @@ linters: - makezero - errcheck - goconst - - gocyclo - gofmt - goimports - gosimple diff --git a/Dockerfile b/Dockerfile index e854e133..9cb62e28 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,24 +1,32 @@ # Dockerfile has specific requirement to put this ARG at the beginning: # https://docs.docker.com/engine/reference/builder/#understand-how-arg-and-from-interact -ARG BUILDER_IMAGE=golang:1.23-alpine -ARG BASE_IMAGE=gcr.io/distroless/base-debian10 +ARG BUILDER_IMAGE=golang:1.24 +ARG BASE_IMAGE=gcr.io/distroless/static:nonroot ## Multistage build -FROM ${BUILDER_IMAGE} as builder +FROM ${BUILDER_IMAGE} AS builder ENV CGO_ENABLED=0 ENV GOOS=linux ENV GOARCH=amd64 +ARG COMMIT_SHA=unknown +# Dependencies WORKDIR /src -COPY . . -WORKDIR /src/pkg/ext-proc +COPY go.mod go.sum ./ RUN go mod download -RUN go build -o /ext-proc + +# Sources +COPY cmd ./cmd +COPY pkg ./pkg +COPY internal ./internal +COPY api ./api +WORKDIR /src/cmd/epp +RUN go build -ldflags="-X sigs.k8s.io/gateway-api-inference-extension/pkg/epp/metrics.CommitSHA=${COMMIT_SHA}" -o /epp ## Multistage deploy FROM ${BASE_IMAGE} WORKDIR / -COPY --from=builder /ext-proc /ext-proc +COPY --from=builder /epp /epp -ENTRYPOINT ["/ext-proc"] \ No newline at end of file +ENTRYPOINT ["/epp"] diff --git a/Makefile b/Makefile index 83de8dd1..884d4229 100644 --- a/Makefile +++ b/Makefile @@ -21,29 +21,50 @@ CONTAINER_TOOL ?= docker SHELL = /usr/bin/env bash -o pipefail .SHELLFLAGS = -ec +GIT_COMMIT_SHA ?= "$(shell git rev-parse HEAD 2>/dev/null)" GIT_TAG ?= $(shell git describe --tags --dirty --always) PLATFORMS ?= linux/amd64 DOCKER_BUILDX_CMD ?= docker buildx IMAGE_BUILD_CMD ?= $(DOCKER_BUILDX_CMD) build IMAGE_BUILD_EXTRA_OPTS ?= -IMAGE_REGISTRY ?= us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension +SYNCER_IMAGE_BUILD_EXTRA_OPTS ?= +BBR_IMAGE_BUILD_EXTRA_OPTS ?= +STAGING_IMAGE_REGISTRY ?= us-central1-docker.pkg.dev/k8s-staging-images +IMAGE_REGISTRY ?= $(STAGING_IMAGE_REGISTRY)/gateway-api-inference-extension IMAGE_NAME := epp IMAGE_REPO ?= $(IMAGE_REGISTRY)/$(IMAGE_NAME) IMAGE_TAG ?= $(IMAGE_REPO):$(GIT_TAG) +PROJECT_DIR := $(shell dirname $(abspath $(lastword $(MAKEFILE_LIST)))) +E2E_MANIFEST_PATH ?= config/manifests/vllm/gpu-deployment.yaml + +SYNCER_IMAGE_NAME := lora-syncer +SYNCER_IMAGE_REPO ?= $(IMAGE_REGISTRY)/$(SYNCER_IMAGE_NAME) +SYNCER_IMAGE_TAG ?= $(SYNCER_IMAGE_REPO):$(GIT_TAG) + +BBR_IMAGE_NAME := bbr +BBR_IMAGE_REPO ?= $(IMAGE_REGISTRY)/$(BBR_IMAGE_NAME) +BBR_IMAGE_TAG ?= $(BBR_IMAGE_REPO):$(GIT_TAG) -BASE_IMAGE ?= gcr.io/distroless/base-debian10 -BUILDER_IMAGE ?= golang:1.23-alpine +BASE_IMAGE ?= gcr.io/distroless/static:nonroot +BUILDER_IMAGE ?= golang:1.24 ifdef GO_VERSION BUILDER_IMAGE = golang:$(GO_VERSION) endif ifdef EXTRA_TAG IMAGE_EXTRA_TAG ?= $(IMAGE_REPO):$(EXTRA_TAG) +SYNCER_IMAGE_EXTRA_TAG ?= $(SYNCER_IMAGE_REPO):$(EXTRA_TAG) +BBR_IMAGE_EXTRA_TAG ?= $(BBR_IMAGE_REPO):$(EXTRA_TAG) endif ifdef IMAGE_EXTRA_TAG IMAGE_BUILD_EXTRA_OPTS += -t $(IMAGE_EXTRA_TAG) +SYNCER_IMAGE_BUILD_EXTRA_OPTS += -t $(SYNCER_IMAGE_EXTRA_TAG) +BBR_IMAGE_BUILD_EXTRA_OPTS += -t $(BBR_IMAGE_EXTRA_TAG) endif +# The name of the kind cluster to use for the "kind-load" target. +KIND_CLUSTER ?= kind + ##@ General # The help target prints out all targets with their descriptions organized @@ -72,7 +93,6 @@ generate: controller-gen code-generator manifests ## Generate code containing De $(CONTROLLER_GEN) object:headerFile="hack/boilerplate.go.txt" paths="./..." ./hack/update-codegen.sh -PROJECT_DIR := $(shell dirname $(abspath $(lastword $(MAKEFILE_LIST)))) # Use same code-generator version as k8s.io/api CODEGEN_VERSION := $(shell go list -m -f '{{.Version}}' k8s.io/api) CODEGEN = $(shell pwd)/bin/code-generator @@ -101,16 +121,20 @@ vet: ## Run go vet against code. go vet ./... .PHONY: test -test: manifests generate fmt vet envtest ## Run tests. - KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) --bin-dir $(LOCALBIN) -p path)" go test $$(go list ./... | grep -v /e2e) -coverprofile cover.out +test: manifests generate fmt vet envtest image-build ## Run tests. + KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) --bin-dir $(LOCALBIN) -p path)" go test $$(go list ./... | grep -v /e2e | grep -v /conformance) -race -coverprofile cover.out + +.PHONY: test-unit +test-unit: ## Run unit tests. + KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) --bin-dir $(LOCALBIN) -p path)" go test ./pkg/... -race -coverprofile cover.out .PHONY: test-integration -test-integration: manifests generate fmt vet envtest ## Run tests. - KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) --bin-dir $(LOCALBIN) -p path)" go test ./test/integration -coverprofile cover.out +test-integration: ## Run integration tests. + KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) --bin-dir $(LOCALBIN) -p path)" go test ./test/integration/epp/... -race -coverprofile cover.out .PHONY: test-e2e -test-e2e: ## Run end-to-end tests against an existing Kubernetes cluster with at least 3 available GPUs. - go test ./test/e2e/ -v -ginkgo.v +test-e2e: ## Run end-to-end tests against an existing Kubernetes cluster. When using default configuration, the tests need at least 3 available GPUs. + MANIFEST_PATH=$(PROJECT_DIR)/$(E2E_MANIFEST_PATH) go test ./test/e2e/epp/ -v -ginkgo.v .PHONY: lint lint: golangci-lint ## Run golangci-lint linter @@ -132,28 +156,108 @@ verify: vet fmt-verify manifests generate ci-lint # Build the container image .PHONY: image-local-build -image-local-build: +image-local-build: ## Build the EPP image using Docker Buildx for local development. BUILDER=$(shell $(DOCKER_BUILDX_CMD) create --use) $(MAKE) image-build PUSH=$(PUSH) + $(MAKE) image-build LOAD=$(LOAD) $(DOCKER_BUILDX_CMD) rm $$BUILDER .PHONY: image-local-push -image-local-push: PUSH=--push +image-local-push: PUSH=--push ## Build the EPP image for local development and push it to $IMAGE_REPO. image-local-push: image-local-build +.PHONY: image-local-load +image-local-load: LOAD=--load ## Build the EPP image for local development and load it in the local Docker registry. +image-local-load: image-local-build + .PHONY: image-build -image-build: +image-build: ## Build the EPP image using Docker Buildx. $(IMAGE_BUILD_CMD) -t $(IMAGE_TAG) \ --platform=$(PLATFORMS) \ --build-arg BASE_IMAGE=$(BASE_IMAGE) \ --build-arg BUILDER_IMAGE=$(BUILDER_IMAGE) \ + --build-arg COMMIT_SHA=${GIT_COMMIT_SHA} \ $(PUSH) \ + $(LOAD) \ $(IMAGE_BUILD_EXTRA_OPTS) ./ .PHONY: image-push -image-push: PUSH=--push +image-push: PUSH=--push ## Build the EPP image and push it to $IMAGE_REPO. image-push: image-build +.PHONY: image-load +image-load: LOAD=--load ## Build the EPP image and load it in the local Docker registry. +image-load: image-build + +.PHONY: image-kind +image-kind: image-build ## Build the EPP image and load it to kind cluster $KIND_CLUSTER ("kind" by default). + kind load docker-image $(IMAGE_TAG) --name $(KIND_CLUSTER) + +##@ Lora Syncer + +.PHONY: syncer-image-local-build +syncer-image-local-build: + BUILDER=$(shell $(DOCKER_BUILDX_CMD) create --use) + $(MAKE) image-build PUSH=$(PUSH) + $(DOCKER_BUILDX_CMD) rm $$BUILDER + +.PHONY: syncer-image-local-push +syncer-image-local-push: PUSH=--push +syncer-image-local-push: syncer-image-local-build + +.PHONY: syncer-image-build +syncer-image-build: + $ cd $(CURDIR)/tools/dynamic-lora-sidecar && $(IMAGE_BUILD_CMD) -t $(SYNCER_IMAGE_TAG) \ + --platform=$(PLATFORMS) \ + --build-arg BASE_IMAGE=$(BASE_IMAGE) \ + --build-arg BUILDER_IMAGE=$(BUILDER_IMAGE) \ + $(PUSH) \ + $(SYNCER_IMAGE_BUILD_EXTRA_OPTS) ./ + +.PHONY: syncer-image-push +syncer-image-push: PUSH=--push +syncer-image-push: syncer-image-build + +##@ Body-based Routing extension + +# Build the container image +.PHONY: bbr-image-local-build +bbr-image-local-build: ## Build the image using Docker Buildx for local development. + BUILDER=$(shell $(DOCKER_BUILDX_CMD) create --use) + $(MAKE) bbr-image-build PUSH=$(PUSH) + $(MAKE) bbr-image-build LOAD=$(LOAD) + $(DOCKER_BUILDX_CMD) rm $$BUILDER + +.PHONY: bbr-image-local-push +bbr-image-local-push: PUSH=--push ## Build the image for local development and push it to $IMAGE_REPO. +bbr-image-local-push: bbr-image-local-build + +.PHONY: bbr-image-local-load +bbr-image-local-load: LOAD=--load ## Build the image for local development and load it in the local Docker registry. +bbr-image-local-load: bbr-image-local-build + +.PHONY: bbr-image-build +bbr-image-build: ## Build the image using Docker Buildx. + $(IMAGE_BUILD_CMD) -f bbr.Dockerfile -t $(BBR_IMAGE_TAG) \ + --platform=$(PLATFORMS) \ + --build-arg BASE_IMAGE=$(BASE_IMAGE) \ + --build-arg BUILDER_IMAGE=$(BUILDER_IMAGE) \ + $(PUSH) \ + $(LOAD) \ + $(BBR_IMAGE_BUILD_EXTRA_OPTS) ./ + +.PHONY: bbr-image-push +bbr-image-push: PUSH=--push ## Build the image and push it to $IMAGE_REPO. +bbr-image-push: bbr-image-build + +.PHONY: bbr-image-load +bbr-image-load: LOAD=--load ## Build the image and load it in the local Docker registry. +bbr-image-load: bbr-image-build + +.PHONY: bbr-image-kind +bbr-image-kind: bbr-image-build ## Build the image and load it to kind cluster $KIND_CLUSTER ("kind" by default). + kind load docker-image $(BBR_IMAGE_TAG) --name $(KIND_CLUSTER) + ##@ Docs .PHONY: build-docs @@ -193,6 +297,16 @@ install: manifests kustomize ## Install CRDs into the K8s cluster specified in ~ uninstall: manifests kustomize ## Uninstall CRDs from the K8s cluster specified in ~/.kube/config. Call with ignore-not-found=true to ignore resource not found errors during deletion. $(KUSTOMIZE) build config/crd | $(KUBECTL) delete --ignore-not-found=$(ignore-not-found) -f - + +##@ Helm +PHONY: inferencepool-helm-chart-push +inferencepool-helm-chart-push: yq helm + CHART=inferencepool EXTRA_TAG="$(EXTRA_TAG)" IMAGE_REGISTRY="$(IMAGE_REGISTRY)" YQ="$(YQ)" HELM="$(HELM)" ./hack/push-chart.sh + +PHONY: bbr-helm-chart-push +bbr-helm-chart-push: yq helm + CHART=body-based-routing EXTRA_TAG="$(EXTRA_TAG)" IMAGE_REGISTRY="$(IMAGE_REGISTRY)" YQ="$(YQ)" HELM="$(HELM)" ./hack/push-chart.sh + ##@ Release .PHONY: release-quickstart @@ -222,12 +336,15 @@ KUSTOMIZE ?= $(LOCALBIN)/kustomize CONTROLLER_GEN ?= $(LOCALBIN)/controller-gen ENVTEST ?= $(LOCALBIN)/setup-envtest GOLANGCI_LINT = $(LOCALBIN)/golangci-lint +HELM = $(PROJECT_DIR)/bin/helm +YQ = $(PROJECT_DIR)/bin/yq ## Tool Versions KUSTOMIZE_VERSION ?= v5.4.3 CONTROLLER_TOOLS_VERSION ?= v0.16.1 ENVTEST_VERSION ?= release-0.19 GOLANGCI_LINT_VERSION ?= v1.62.2 +HELM_VERSION ?= v3.17.1 .PHONY: kustomize kustomize: $(KUSTOMIZE) ## Download kustomize locally if necessary. @@ -249,6 +366,14 @@ golangci-lint: $(GOLANGCI_LINT) ## Download golangci-lint locally if necessary. $(GOLANGCI_LINT): $(LOCALBIN) $(call go-install-tool,$(GOLANGCI_LINT),github.com/golangci/golangci-lint/cmd/golangci-lint,$(GOLANGCI_LINT_VERSION)) +.PHONY: yq +yq: ## Download yq locally if necessary. + GOBIN=$(PROJECT_DIR)/bin GO111MODULE=on go install github.com/mikefarah/yq/v4@v4.45.1 + +.PHONY: helm +helm: ## Download helm locally if necessary. + GOBIN=$(PROJECT_DIR)/bin GO111MODULE=on go install helm.sh/helm/v3/cmd/helm@$(HELM_VERSION) + # go-install-tool will 'go install' any package with custom target and name of binary, if it doesn't exist # $1 - target path with name of binary # $2 - package url which can be installed diff --git a/OWNERS_ALIASES b/OWNERS_ALIASES index 6e8e0c5d..933fbe9c 100644 --- a/OWNERS_ALIASES +++ b/OWNERS_ALIASES @@ -11,6 +11,9 @@ aliases: gateway-api-inference-extension-reviewers: - liu-cong - robscott + - shaneutt + - nirrozenbaum + wg-serving-leads: - ArangoGutierrez diff --git a/README.md b/README.md index a15e9542..ffd86758 100644 --- a/README.md +++ b/README.md @@ -1,24 +1,97 @@ -# Gateway API Inference Extension +[![Go Report Card](https://goreportcard.com/badge/sigs.k8s.io/gateway-api-inference-extension)](https://goreportcard.com/report/sigs.k8s.io/gateway-api-inference-extension) +[![Go Reference](https://pkg.go.dev/badge/sigs.k8s.io/gateway-api-inference-extension.svg)](https://pkg.go.dev/sigs.k8s.io/gateway-api-inference-extension) +[![License](https://img.shields.io/github/license/kubernetes-sigs/gateway-api-inference-extension)](/LICENSE) -The Gateway API Inference Extension came out of [wg-serving](https://github.com/kubernetes/community/tree/master/wg-serving) and is sponsored by [SIG Network](https://github.com/kubernetes/community/blob/master/sig-network/README.md#gateway-api-inference-extension). This repo contains: the load balancing algorithm, [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) code, CRDs, and controllers of the extension. +# Gateway API Inference Extension (GIE) -This extension is intented to provide value to multiplexed LLM services on a shared pool of compute. See the [proposal](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/012-llm-instance-gateway) for more info. +This project offers tools for AI Inference, enabling developers to build [Inference Gateways]. + +[Inference Gateways]:#concepts-and-definitions + +## Concepts and Definitions + +The following are some key industry terms that are important to understand for +this project: + +- **Model**: A generative AI model that has learned patterns from data and is + used for inference. Models vary in size and architecture, from smaller + domain-specific models to massive multi-billion parameter neural networks that + are optimized for diverse language tasks. +- **Inference**: The process of running a generative AI model, such as a large + language model, diffusion model etc, to generate text, embeddings, or other + outputs from input data. +- **Model server**: A service (in our case, containerized) responsible for + receiving inference requests and returning predictions from a model. +- **Accelerator**: specialized hardware, such as Graphics Processing Units + (GPUs) that can be attached to Kubernetes nodes to speed up computations, + particularly for training and inference tasks. + +And the following are more specific terms to this project: + +- **Scheduler**: Makes decisions about which endpoint is optimal (best cost / + best performance) for an inference request based on `Metrics and Capabilities` + from [Model Serving](/docs/proposals/003-model-server-protocol/README.md). +- **Metrics and Capabilities**: Data provided by model serving platforms about + performance, availability and capabilities to optimize routing. Includes + things like [Prefix Cache] status or [LoRA Adapters] availability. +- **Endpoint Selector**: A `Scheduler` combined with `Metrics and Capabilities` + systems is often referred to together as an [Endpoint Selection Extension] + (this is also sometimes referred to as an "endpoint picker", or "EPP"). +- **Inference Gateway**: A proxy/load-balancer which has been coupled with a + `Endpoint Selector`. It provides optimized routing and load balancing for + serving Kubernetes self-hosted generative Artificial Intelligence (AI) + workloads. It simplifies the deployment, management, and observability of AI + inference workloads. + +For deeper insights and more advanced concepts, refer to our [proposals](/docs/proposals). + +[Inference]:https://www.digitalocean.com/community/tutorials/llm-inference-optimization +[Gateway API]:https://github.com/kubernetes-sigs/gateway-api +[Prefix Cache]:https://docs.vllm.ai/en/stable/design/v1/prefix_caching.html +[LoRA Adapters]:https://docs.vllm.ai/en/stable/features/lora.html +[Endpoint Selection Extension]:https://gateway-api-inference-extension.sigs.k8s.io/#endpoint-selection-extension + +## Technical Overview + +This extension upgrades an [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter)-capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an **inference gateway** - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local [OpenAI-compatible chat completion endpoints](https://platform.openai.com/docs/api-reference/chat) to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level **AI Gateway** like LiteLLM, Solo AI Gateway, or Apigee. + +The inference gateway: + +* Improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extensible request scheduling alogrithm that is kv-cache and request cost aware, avoiding evictions or queueing as load increases +* Provides [Kubernetes-native declarative APIs](https://gateway-api-inference-extension.sigs.k8s.io/concepts/api-overview/) to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades +* Adds end to end observability around service objective attainment +* Ensures operational guardrails between different client model names, allowing a platform team to safely serve many different GenAI workloads on the same pool of shared foundation model servers for higher utilization and fewer required accelerators + +![Architecture Diagram](./docs/inference-gateway-architecture.svg) + +It currently requires a version of vLLM that supports the necessary metrics to predict traffic load which is defined in the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol). Support for Google's Jetstream, nVidia Triton, text-generation-inference, and SGLang is coming soon. ## Status -This project is currently in development. +This project is [alpha (0.3 release)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.3.0). It should not be used in production yet. ## Getting Started -Follow this [README](./pkg/README.md) to get the inference-extension up and running on your cluster! +Follow our [Getting Started Guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/) to get the inference-extension up and running on your cluster! -## End-to-End Tests +See our website at https://gateway-api-inference-extension.sigs.k8s.io/ for detailed API documentation on leveraging our Kubernetes-native declarative APIs + +## Roadmap -Follow this [README](./test/e2e/README.md) to learn more about running the inference-extension end-to-end test suite on your cluster. +As Inference Gateway builds towards a GA release. We will continue to expand our capabilities, namely: +1. Prefix-cache aware load balancing with interfaces for remote caches +1. Recommended LoRA adapter pipeline for automated rollout +1. Fairness and priority between workloads within the same criticality band +1. HPA support for autoscaling on aggregate metrics derived from the load balancer +1. Support for large multi-modal inputs and outputs +1. Support for other GenAI model types (diffusion and other non-completion protocols) +1. Heterogeneous accelerators - serve workloads on multiple types of accelerator using latency and request cost-aware load balancing +1. Disaggregated serving support with independently scaling pools -## Website -Detailed documentation is available on our website: https://gateway-api-inference-extension.sigs.k8s.io/ +## End-to-End Tests + +Follow this [README](./test/e2e/epp/README.md) to learn more about running the inference-extension end-to-end test suite on your cluster. ## Contributing diff --git a/api/doc.go b/api/doc.go new file mode 100644 index 00000000..c91adb92 --- /dev/null +++ b/api/doc.go @@ -0,0 +1,17 @@ +/* +Copyright 2024 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package api diff --git a/api/v1alpha1/groupversion_info.go b/api/v1alpha1/groupversion_info.go deleted file mode 100644 index 8c0a449f..00000000 --- a/api/v1alpha1/groupversion_info.go +++ /dev/null @@ -1,45 +0,0 @@ -/* -Copyright 2024 The Kubernetes Authors. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. -*/ - -// Package v1alpha1 contains API Schema definitions for the gateway v1alpha1 API group -// +kubebuilder:object:generate=true -// +groupName=inference.networking.x-k8s.io -package v1alpha1 - -import ( - "k8s.io/apimachinery/pkg/runtime/schema" - "sigs.k8s.io/controller-runtime/pkg/scheme" -) - -var ( - // GroupVersion is group version used to register these objects - GroupVersion = schema.GroupVersion{Group: "inference.networking.x-k8s.io", Version: "v1alpha1"} - - // SchemeGroupVersion is alias to GroupVersion for client-go libraries. - // It is required by pkg/client/informers/externalversions/... - SchemeGroupVersion = GroupVersion - - // SchemeBuilder is used to add go types to the GroupVersionKind scheme - SchemeBuilder = &scheme.Builder{GroupVersion: GroupVersion} - - // AddToScheme adds the types in this group-version to the given scheme. - AddToScheme = SchemeBuilder.AddToScheme -) - -// Resource is required by pkg/client/listers/... -func Resource(resource string) schema.GroupResource { - return GroupVersion.WithResource(resource).GroupResource() -} diff --git a/api/v1alpha2/doc.go b/api/v1alpha2/doc.go new file mode 100644 index 00000000..90a35f58 --- /dev/null +++ b/api/v1alpha2/doc.go @@ -0,0 +1,23 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +// Package v1alpha2 contains API Schema definitions for the +// inference.networking.x-k8s.io API group. +// +// +k8s:openapi-gen=true +// +kubebuilder:object:generate=true +// +groupName=inference.networking.x-k8s.io +package v1alpha2 diff --git a/api/v1alpha1/inferencemodel_types.go b/api/v1alpha2/inferencemodel_types.go similarity index 88% rename from api/v1alpha1/inferencemodel_types.go rename to api/v1alpha2/inferencemodel_types.go index 3661820d..7cd98a74 100644 --- a/api/v1alpha1/inferencemodel_types.go +++ b/api/v1alpha2/inferencemodel_types.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -14,7 +14,7 @@ See the License for the specific language governing permissions and limitations under the License. */ -package v1alpha1 +package v1alpha2 import ( metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" @@ -24,6 +24,11 @@ import ( // // +kubebuilder:object:root=true // +kubebuilder:subresource:status +// +kubebuilder:storageversion +// +kubebuilder:printcolumn:name="Model Name",type=string,JSONPath=`.spec.modelName` +// +kubebuilder:printcolumn:name="Inference Pool",type=string,JSONPath=`.spec.poolRef.name` +// +kubebuilder:printcolumn:name="Criticality",type=string,JSONPath=`.spec.criticality` +// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp` // +genclient type InferenceModel struct { metav1.TypeMeta `json:",inline"` @@ -70,6 +75,7 @@ type InferenceModelSpec struct { // // +kubebuilder:validation:MaxLength=256 // +kubebuilder:validation:Required + // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="modelName is immutable" ModelName string `json:"modelName"` // Criticality defines how important it is to serve the model compared to other models referencing the same pool. @@ -105,29 +111,22 @@ type PoolObjectReference struct { // // +optional // +kubebuilder:default="inference.networking.x-k8s.io" - // +kubebuilder:validation:MaxLength=253 - // +kubebuilder:validation:Pattern=`^$|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$` - Group string `json:"group,omitempty"` + Group Group `json:"group,omitempty"` // Kind is kind of the referent. For example "InferencePool". // // +optional // +kubebuilder:default="InferencePool" - // +kubebuilder:validation:MinLength=1 - // +kubebuilder:validation:MaxLength=63 - // +kubebuilder:validation:Pattern=`^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$` - Kind string `json:"kind,omitempty"` + Kind Kind `json:"kind,omitempty"` // Name is the name of the referent. // - // +kubebuilder:validation:MinLength=1 - // +kubebuilder:validation:MaxLength=253 // +kubebuilder:validation:Required - Name string `json:"name"` + Name ObjectName `json:"name"` } // Criticality defines how important it is to serve the model compared to other models. -// Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field must be optional(use a pointer), and set no default. +// Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field must be optional (use a pointer), and set no default. // This allows us to union this with a oneOf field in the future should we wish to adjust/extend this behavior. // +kubebuilder:validation:Enum=Critical;Standard;Sheddable type Criticality string @@ -174,7 +173,7 @@ type TargetModel struct { // Conversely weights are optional, so long as ALL targetModels do not specify a weight. // // +optional - // +kubebuilder:validation:Minimum=0 + // +kubebuilder:validation:Minimum=1 // +kubebuilder:validation:Maximum=1000000 Weight *int32 `json:"weight,omitempty"` } @@ -202,7 +201,7 @@ type InferenceModelConditionType string type InferenceModelConditionReason string const ( - // This condition indicates if the model config is accepted, and if not, why. + // ModelConditionAccepted indicates if the model config is accepted, and if not, why. // // Possible reasons for this condition to be True are: // @@ -218,17 +217,13 @@ const ( // ModelConditionAccepted InferenceModelConditionType = "Accepted" - // Desired state. Model conforms to the state of the pool. + // ModelReasonAccepted is the desired state. Model conforms to the state of the pool. ModelReasonAccepted InferenceModelConditionReason = "Accepted" - // This reason is used when a given ModelName already exists within the pool. + // ModelReasonNameInUse is used when a given ModelName already exists within the pool. // Details about naming conflict resolution are on the ModelName field itself. ModelReasonNameInUse InferenceModelConditionReason = "ModelNameInUse" - // This reason is the initial state, and indicates that the controller has not yet reconciled the InferenceModel. + // ModelReasonPending is the initial state, and indicates that the controller has not yet reconciled the InferenceModel. ModelReasonPending InferenceModelConditionReason = "Pending" ) - -func init() { - SchemeBuilder.Register(&InferenceModel{}, &InferenceModelList{}) -} diff --git a/api/v1alpha1/inferencepool_types.go b/api/v1alpha2/inferencepool_types.go similarity index 60% rename from api/v1alpha1/inferencepool_types.go rename to api/v1alpha2/inferencepool_types.go index 61a3764d..7018ba21 100644 --- a/api/v1alpha1/inferencepool_types.go +++ b/api/v1alpha2/inferencepool_types.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -14,9 +14,10 @@ See the License for the specific language governing permissions and limitations under the License. */ -package v1alpha1 +package v1alpha2 import ( + corev1 "k8s.io/api/core/v1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" ) @@ -24,6 +25,7 @@ import ( // // +kubebuilder:object:root=true // +kubebuilder:subresource:status +// +kubebuilder:storageversion // +genclient type InferencePool struct { metav1.TypeMeta `json:",inline"` @@ -48,6 +50,8 @@ type InferencePoolSpec struct { // that should be included in the InferencePool. // In some cases, implementations may translate this field to a Service selector, so this matches the simple // map used for Service selectors instead of the full Kubernetes LabelSelector type. + // If sepecified, it will be applied to match the model server pods in the same namespace as the InferencePool. + // Cross namesoace selector is not supported. // // +kubebuilder:validation:Required Selector map[LabelKey]LabelValue `json:"selector"` @@ -86,11 +90,11 @@ type Extension struct { // ExtensionReference is a reference to the extension deployment. type ExtensionReference struct { // Group is the group of the referent. - // When unspecified or empty string, core API group is inferred. + // The default value is "", representing the Core API group. // // +optional // +kubebuilder:default="" - Group *string `json:"group,omitempty"` + Group *Group `json:"group,omitempty"` // Kind is the Kubernetes resource kind of the referent. For example // "Service". @@ -105,20 +109,19 @@ type ExtensionReference struct { // // +optional // +kubebuilder:default=Service - Kind *string `json:"kind,omitempty"` + Kind *Kind `json:"kind,omitempty"` // Name is the name of the referent. // // +kubebuilder:validation:Required - Name string `json:"name"` + Name ObjectName `json:"name"` - // The port number on the pods running the extension. When unspecified, implementations SHOULD infer a - // default value of 9002 when the Kind is Service. + // The port number on the service running the extension. When unspecified, + // implementations SHOULD infer a default value of 9002 when the Kind is + // Service. // - // +kubebuilder:validation:Minimum=1 - // +kubebuilder:validation:Maximum=65535 // +optional - TargetPortNumber *int32 `json:"targetPortNumber,omitempty"` + PortNumber *PortNumber `json:"portNumber,omitempty"` } // ExtensionConnection encapsulates options that configures the connection to the extension. @@ -143,96 +146,101 @@ const ( FailClose ExtensionFailureMode = "FailClose" ) -// LabelKey was originally copied from: https://github.com/kubernetes-sigs/gateway-api/blob/99a3934c6bc1ce0874f3a4c5f20cafd8977ffcb4/apis/v1/shared_types.go#L694-L731 -// Duplicated as to not take an unexpected dependency on gw's API. -// -// LabelKey is the key of a label. This is used for validation -// of maps. This matches the Kubernetes "qualified name" validation that is used for labels. -// Labels are case sensitive, so: my-label and My-Label are considered distinct. -// -// Valid values include: -// -// * example -// * example.com -// * example.com/path -// * example.com/path.html -// -// Invalid values include: -// -// * example~ - "~" is an invalid character -// * example.com. - can not start or end with "." -// -// +kubebuilder:validation:MinLength=1 -// +kubebuilder:validation:MaxLength=253 -// +kubebuilder:validation:Pattern=`^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?([A-Za-z0-9][-A-Za-z0-9_.]{0,61})?[A-Za-z0-9]$` -type LabelKey string - -// LabelValue is the value of a label. This is used for validation -// of maps. This matches the Kubernetes label validation rules: -// * must be 63 characters or less (can be empty), -// * unless empty, must begin and end with an alphanumeric character ([a-z0-9A-Z]), -// * could contain dashes (-), underscores (_), dots (.), and alphanumerics between. -// -// Valid values include: -// -// * MyValue -// * my.name -// * 123-my-value -// -// +kubebuilder:validation:MinLength=0 -// +kubebuilder:validation:MaxLength=63 -// +kubebuilder:validation:Pattern=`^(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?$` -type LabelValue string - // InferencePoolStatus defines the observed state of InferencePool type InferencePoolStatus struct { + // Parents is a list of parent resources (usually Gateways) that are + // associated with the route, and the status of the InferencePool with respect to + // each parent. + // + // A maximum of 32 Gateways will be represented in this list. An empty list + // means the route has not been attached to any Gateway. + // + // +kubebuilder:validation:MaxItems=32 + Parents []PoolStatus `json:"parent,omitempty"` +} + +// PoolStatus defines the observed state of InferencePool from a Gateway. +type PoolStatus struct { + // GatewayRef indicates the gateway that observed state of InferencePool. + GatewayRef corev1.ObjectReference `json:"parentRef"` + // Conditions track the state of the InferencePool. // // Known condition types are: // - // * "Ready" + // * "Accepted" + // * "ResolvedRefs" // // +optional // +listType=map // +listMapKey=type // +kubebuilder:validation:MaxItems=8 - // +kubebuilder:default={{type: "Ready", status: "Unknown", reason:"Pending", message:"Waiting for controller", lastTransitionTime: "1970-01-01T00:00:00Z"}} + // +kubebuilder:default={{type: "Accepted", status: "Unknown", reason:"Pending", message:"Waiting for controller", lastTransitionTime: "1970-01-01T00:00:00Z"}} Conditions []metav1.Condition `json:"conditions,omitempty"` } // InferencePoolConditionType is a type of condition for the InferencePool type InferencePoolConditionType string -// InferencePoolConditionReason is the reason for a given InferencePoolConditionType -type InferencePoolConditionReason string +// InferencePoolReason is the reason for a given InferencePoolConditionType +type InferencePoolReason string const ( - // This condition indicates if the pool is ready to accept traffic, and if not, why. + // This condition indicates whether the route has been accepted or rejected + // by a Gateway, and why. // // Possible reasons for this condition to be True are: // - // * "Ready" + // * "Accepted" // // Possible reasons for this condition to be False are: // - // * "EndpointPickerNotHealthy" + // * "NotSupportedByGateway" // // Possible reasons for this condition to be Unknown are: // // * "Pending" // - PoolConditionReady InferencePoolConditionType = "Ready" + // Controllers MAY raise this condition with other reasons, but should + // prefer to use the reasons listed above to improve interoperability. + InferencePoolConditionAccepted InferencePoolConditionType = "Accepted" - // Desired state. The pool and its components are initialized and ready for traffic. - PoolReasonReady InferencePoolConditionReason = "Ready" + // This reason is used with the "Accepted" condition when the Route has been + // accepted by the Gateway. + InferencePoolReasonAccepted InferencePoolReason = "Accepted" - // This reason is used when the EPP has not yet passed health checks, or has started failing them. - PoolReasonEPPNotHealthy InferencePoolConditionReason = "EndpointPickerNotHealthy" + // This reason is used with the "Accepted" condition when the InferencePool + // has not been accepted by a Gateway because the Gateway does not support + // InferencePool as a backend. + InferencePoolReasonNotSupportedByGateway InferencePoolReason = "NotSupportedByGateway" - // This reason is the initial state, and indicates that the controller has not yet reconciled this pool. - PoolReasonPending InferencePoolConditionReason = "Pending" + // This reason is used with the "Accepted" when a controller has not yet + // reconciled the route. + InferencePoolReasonPending InferencePoolReason = "Pending" ) -func init() { - SchemeBuilder.Register(&InferencePool{}, &InferencePoolList{}) -} +const ( + // This condition indicates whether the controller was able to resolve all + // the object references for the InferencePool. + // + // Possible reasons for this condition to be true are: + // + // * "ResolvedRefs" + // + // Possible reasons for this condition to be False are: + // + // * "InvalidExtnesionRef" + // + // Controllers MAY raise this condition with other reasons, but should + // prefer to use the reasons listed above to improve interoperability. + InferencePoolConditionResolvedRefs InferencePoolConditionType = "ResolvedRefs" + + // This reason is used with the "ResolvedRefs" condition when the condition + // is true. + InferencePoolReasonResolvedRefs InferencePoolReason = "ResolvedRefs" + + // This reason is used with the "ResolvedRefs" condition when the + // ExtensionRef is invalid in some way. This can include an unsupported kind + // or API group, or a reference to a resource that can not be found. + InferencePoolReasonInvalidExtensionRef InferencePoolReason = "InvalidExtensionRef" +) diff --git a/api/v1alpha2/shared_types.go b/api/v1alpha2/shared_types.go new file mode 100644 index 00000000..ea5ef299 --- /dev/null +++ b/api/v1alpha2/shared_types.go @@ -0,0 +1,108 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package v1alpha2 + +// Group refers to a Kubernetes Group. It must either be an empty string or a +// RFC 1123 subdomain. +// +// This validation is based off of the corresponding Kubernetes validation: +// https://github.com/kubernetes/apimachinery/blob/02cfb53916346d085a6c6c7c66f882e3c6b0eca6/pkg/util/validation/validation.go#L208 +// +// Valid values include: +// +// * "" - empty string implies core Kubernetes API group +// * "gateway.networking.k8s.io" +// * "foo.example.com" +// +// Invalid values include: +// +// * "example.com/bar" - "/" is an invalid character +// +// +kubebuilder:validation:MaxLength=253 +// +kubebuilder:validation:Pattern=`^$|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$` +type Group string + +// Kind refers to a Kubernetes Kind. +// +// Valid values include: +// +// * "Service" +// * "HTTPRoute" +// +// Invalid values include: +// +// * "invalid/kind" - "/" is an invalid character +// +// +kubebuilder:validation:MinLength=1 +// +kubebuilder:validation:MaxLength=63 +// +kubebuilder:validation:Pattern=`^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$` +type Kind string + +// ObjectName refers to the name of a Kubernetes object. +// Object names can have a variety of forms, including RFC 1123 subdomains, +// RFC 1123 labels, or RFC 1035 labels. +// +// +kubebuilder:validation:MinLength=1 +// +kubebuilder:validation:MaxLength=253 +type ObjectName string + +// PortNumber defines a network port. +// +// +kubebuilder:validation:Minimum=1 +// +kubebuilder:validation:Maximum=65535 +type PortNumber int32 + +// LabelKey was originally copied from: https://github.com/kubernetes-sigs/gateway-api/blob/99a3934c6bc1ce0874f3a4c5f20cafd8977ffcb4/apis/v1/shared_types.go#L694-L731 +// Duplicated as to not take an unexpected dependency on gw's API. +// +// LabelKey is the key of a label. This is used for validation +// of maps. This matches the Kubernetes "qualified name" validation that is used for labels. +// Labels are case sensitive, so: my-label and My-Label are considered distinct. +// +// Valid values include: +// +// * example +// * example.com +// * example.com/path +// * example.com/path.html +// +// Invalid values include: +// +// * example~ - "~" is an invalid character +// * example.com. - can not start or end with "." +// +// +kubebuilder:validation:MinLength=1 +// +kubebuilder:validation:MaxLength=253 +// +kubebuilder:validation:Pattern=`^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?([A-Za-z0-9][-A-Za-z0-9_.]{0,61})?[A-Za-z0-9]$` +type LabelKey string + +// LabelValue is the value of a label. This is used for validation +// of maps. This matches the Kubernetes label validation rules: +// * must be 63 characters or less (can be empty), +// * unless empty, must begin and end with an alphanumeric character ([a-z0-9A-Z]), +// * could contain dashes (-), underscores (_), dots (.), and alphanumerics between. +// +// Valid values include: +// +// * MyValue +// * my.name +// * 123-my-value +// +// +kubebuilder:validation:MinLength=0 +// +kubebuilder:validation:MaxLength=63 +// +kubebuilder:validation:Pattern=`^(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?$` +type LabelValue string diff --git a/api/v1alpha1/zz_generated.deepcopy.go b/api/v1alpha2/zz_generated.deepcopy.go similarity index 92% rename from api/v1alpha1/zz_generated.deepcopy.go rename to api/v1alpha2/zz_generated.deepcopy.go index fd55379e..3070cdcb 100644 --- a/api/v1alpha1/zz_generated.deepcopy.go +++ b/api/v1alpha2/zz_generated.deepcopy.go @@ -1,7 +1,7 @@ //go:build !ignore_autogenerated /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -18,7 +18,7 @@ limitations under the License. // Code generated by controller-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 import ( "k8s.io/apimachinery/pkg/apis/meta/v1" @@ -87,17 +87,17 @@ func (in *ExtensionReference) DeepCopyInto(out *ExtensionReference) { *out = *in if in.Group != nil { in, out := &in.Group, &out.Group - *out = new(string) + *out = new(Group) **out = **in } if in.Kind != nil { in, out := &in.Kind, &out.Kind - *out = new(string) + *out = new(Kind) **out = **in } - if in.TargetPortNumber != nil { - in, out := &in.TargetPortNumber, &out.TargetPortNumber - *out = new(int32) + if in.PortNumber != nil { + in, out := &in.PortNumber, &out.PortNumber + *out = new(PortNumber) **out = **in } } @@ -306,9 +306,9 @@ func (in *InferencePoolSpec) DeepCopy() *InferencePoolSpec { // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *InferencePoolStatus) DeepCopyInto(out *InferencePoolStatus) { *out = *in - if in.Conditions != nil { - in, out := &in.Conditions, &out.Conditions - *out = make([]v1.Condition, len(*in)) + if in.Parents != nil { + in, out := &in.Parents, &out.Parents + *out = make([]PoolStatus, len(*in)) for i := range *in { (*in)[i].DeepCopyInto(&(*out)[i]) } @@ -340,6 +340,29 @@ func (in *PoolObjectReference) DeepCopy() *PoolObjectReference { return out } +// DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. +func (in *PoolStatus) DeepCopyInto(out *PoolStatus) { + *out = *in + out.GatewayRef = in.GatewayRef + if in.Conditions != nil { + in, out := &in.Conditions, &out.Conditions + *out = make([]v1.Condition, len(*in)) + for i := range *in { + (*in)[i].DeepCopyInto(&(*out)[i]) + } + } +} + +// DeepCopy is an autogenerated deepcopy function, copying the receiver, creating a new PoolStatus. +func (in *PoolStatus) DeepCopy() *PoolStatus { + if in == nil { + return nil + } + out := new(PoolStatus) + in.DeepCopyInto(out) + return out +} + // DeepCopyInto is an autogenerated deepcopy function, copying the receiver, writing into out. in must be non-nil. func (in *TargetModel) DeepCopyInto(out *TargetModel) { *out = *in diff --git a/api/v1alpha2/zz_generated.register.go b/api/v1alpha2/zz_generated.register.go new file mode 100644 index 00000000..07dbf92b --- /dev/null +++ b/api/v1alpha2/zz_generated.register.go @@ -0,0 +1,71 @@ +//go:build !ignore_autogenerated +// +build !ignore_autogenerated + +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ +// Code generated by register-gen. DO NOT EDIT. + +package v1alpha2 + +import ( + v1 "k8s.io/apimachinery/pkg/apis/meta/v1" + runtime "k8s.io/apimachinery/pkg/runtime" + schema "k8s.io/apimachinery/pkg/runtime/schema" +) + +// GroupName specifies the group name used to register the objects. +const GroupName = "inference.networking.x-k8s.io" + +// GroupVersion specifies the group and the version used to register the objects. +var GroupVersion = v1.GroupVersion{Group: GroupName, Version: "v1alpha2"} + +// SchemeGroupVersion is group version used to register these objects +// Deprecated: use GroupVersion instead. +var SchemeGroupVersion = schema.GroupVersion{Group: GroupName, Version: "v1alpha2"} + +// Resource takes an unqualified resource and returns a Group qualified GroupResource +func Resource(resource string) schema.GroupResource { + return SchemeGroupVersion.WithResource(resource).GroupResource() +} + +var ( + // localSchemeBuilder and AddToScheme will stay in k8s.io/kubernetes. + SchemeBuilder runtime.SchemeBuilder + localSchemeBuilder = &SchemeBuilder + // Deprecated: use Install instead + AddToScheme = localSchemeBuilder.AddToScheme + Install = localSchemeBuilder.AddToScheme +) + +func init() { + // We only register manually written functions here. The registration of the + // generated functions takes place in the generated files. The separation + // makes the code compile even when the generated files are missing. + localSchemeBuilder.Register(addKnownTypes) +} + +// Adds the list of known types to Scheme. +func addKnownTypes(scheme *runtime.Scheme) error { + scheme.AddKnownTypes(SchemeGroupVersion, + &InferenceModel{}, + &InferenceModelList{}, + &InferencePool{}, + &InferencePoolList{}, + ) + // AddToGroupVersion allows the serialization of client types like ListOptions. + v1.AddToGroupVersion(scheme, SchemeGroupVersion) + return nil +} diff --git a/bbr.Dockerfile b/bbr.Dockerfile new file mode 100644 index 00000000..03024e49 --- /dev/null +++ b/bbr.Dockerfile @@ -0,0 +1,30 @@ +# Dockerfile has specific requirement to put this ARG at the beginning: +# https://docs.docker.com/engine/reference/builder/#understand-how-arg-and-from-interact +ARG BUILDER_IMAGE=golang:1.23 +ARG BASE_IMAGE=gcr.io/distroless/static:nonroot + +## Multistage build +FROM ${BUILDER_IMAGE} AS builder +ENV CGO_ENABLED=0 +ENV GOOS=linux +ENV GOARCH=amd64 + +# Dependencies +WORKDIR /src +COPY go.mod go.sum ./ +RUN go mod download + +# Sources +COPY cmd ./cmd +COPY pkg ./pkg +COPY internal ./internal +WORKDIR /src/cmd/bbr +RUN go build -o /bbr + +## Multistage deploy +FROM ${BASE_IMAGE} + +WORKDIR / +COPY --from=builder /bbr /bbr + +ENTRYPOINT ["/bbr"] diff --git a/client-go/applyconfiguration/api/v1alpha1/endpointpickerconfig.go b/client-go/applyconfiguration/api/v1alpha2/endpointpickerconfig.go similarity index 96% rename from client-go/applyconfiguration/api/v1alpha1/endpointpickerconfig.go rename to client-go/applyconfiguration/api/v1alpha2/endpointpickerconfig.go index 91895ddc..679cdba8 100644 --- a/client-go/applyconfiguration/api/v1alpha1/endpointpickerconfig.go +++ b/client-go/applyconfiguration/api/v1alpha2/endpointpickerconfig.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,7 +15,7 @@ limitations under the License. */ // Code generated by applyconfiguration-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 // EndpointPickerConfigApplyConfiguration represents a declarative configuration of the EndpointPickerConfig type for use // with apply. diff --git a/client-go/applyconfiguration/api/v1alpha1/extension.go b/client-go/applyconfiguration/api/v1alpha2/extension.go similarity index 75% rename from client-go/applyconfiguration/api/v1alpha1/extension.go rename to client-go/applyconfiguration/api/v1alpha2/extension.go index 27807448..731467b7 100644 --- a/client-go/applyconfiguration/api/v1alpha1/extension.go +++ b/client-go/applyconfiguration/api/v1alpha2/extension.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,10 +15,10 @@ limitations under the License. */ // Code generated by applyconfiguration-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 import ( - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" + apiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" ) // ExtensionApplyConfiguration represents a declarative configuration of the Extension type for use @@ -37,7 +37,7 @@ func Extension() *ExtensionApplyConfiguration { // WithGroup sets the Group field in the declarative configuration to the given value // and returns the receiver, so that objects can be built by chaining "With" function invocations. // If called multiple times, the Group field is set to the value of the last call. -func (b *ExtensionApplyConfiguration) WithGroup(value string) *ExtensionApplyConfiguration { +func (b *ExtensionApplyConfiguration) WithGroup(value apiv1alpha2.Group) *ExtensionApplyConfiguration { b.ExtensionReferenceApplyConfiguration.Group = &value return b } @@ -45,7 +45,7 @@ func (b *ExtensionApplyConfiguration) WithGroup(value string) *ExtensionApplyCon // WithKind sets the Kind field in the declarative configuration to the given value // and returns the receiver, so that objects can be built by chaining "With" function invocations. // If called multiple times, the Kind field is set to the value of the last call. -func (b *ExtensionApplyConfiguration) WithKind(value string) *ExtensionApplyConfiguration { +func (b *ExtensionApplyConfiguration) WithKind(value apiv1alpha2.Kind) *ExtensionApplyConfiguration { b.ExtensionReferenceApplyConfiguration.Kind = &value return b } @@ -53,23 +53,23 @@ func (b *ExtensionApplyConfiguration) WithKind(value string) *ExtensionApplyConf // WithName sets the Name field in the declarative configuration to the given value // and returns the receiver, so that objects can be built by chaining "With" function invocations. // If called multiple times, the Name field is set to the value of the last call. -func (b *ExtensionApplyConfiguration) WithName(value string) *ExtensionApplyConfiguration { +func (b *ExtensionApplyConfiguration) WithName(value apiv1alpha2.ObjectName) *ExtensionApplyConfiguration { b.ExtensionReferenceApplyConfiguration.Name = &value return b } -// WithTargetPortNumber sets the TargetPortNumber field in the declarative configuration to the given value +// WithPortNumber sets the PortNumber field in the declarative configuration to the given value // and returns the receiver, so that objects can be built by chaining "With" function invocations. -// If called multiple times, the TargetPortNumber field is set to the value of the last call. -func (b *ExtensionApplyConfiguration) WithTargetPortNumber(value int32) *ExtensionApplyConfiguration { - b.ExtensionReferenceApplyConfiguration.TargetPortNumber = &value +// If called multiple times, the PortNumber field is set to the value of the last call. +func (b *ExtensionApplyConfiguration) WithPortNumber(value apiv1alpha2.PortNumber) *ExtensionApplyConfiguration { + b.ExtensionReferenceApplyConfiguration.PortNumber = &value return b } // WithFailureMode sets the FailureMode field in the declarative configuration to the given value // and returns the receiver, so that objects can be built by chaining "With" function invocations. // If called multiple times, the FailureMode field is set to the value of the last call. -func (b *ExtensionApplyConfiguration) WithFailureMode(value apiv1alpha1.ExtensionFailureMode) *ExtensionApplyConfiguration { +func (b *ExtensionApplyConfiguration) WithFailureMode(value apiv1alpha2.ExtensionFailureMode) *ExtensionApplyConfiguration { b.ExtensionConnectionApplyConfiguration.FailureMode = &value return b } diff --git a/client-go/applyconfiguration/api/v1alpha1/extensionconnection.go b/client-go/applyconfiguration/api/v1alpha2/extensionconnection.go similarity index 84% rename from client-go/applyconfiguration/api/v1alpha1/extensionconnection.go rename to client-go/applyconfiguration/api/v1alpha2/extensionconnection.go index be9eeaa1..bd968ec6 100644 --- a/client-go/applyconfiguration/api/v1alpha1/extensionconnection.go +++ b/client-go/applyconfiguration/api/v1alpha2/extensionconnection.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,16 +15,16 @@ limitations under the License. */ // Code generated by applyconfiguration-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 import ( - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" + apiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" ) // ExtensionConnectionApplyConfiguration represents a declarative configuration of the ExtensionConnection type for use // with apply. type ExtensionConnectionApplyConfiguration struct { - FailureMode *apiv1alpha1.ExtensionFailureMode `json:"failureMode,omitempty"` + FailureMode *apiv1alpha2.ExtensionFailureMode `json:"failureMode,omitempty"` } // ExtensionConnectionApplyConfiguration constructs a declarative configuration of the ExtensionConnection type for use with @@ -36,7 +36,7 @@ func ExtensionConnection() *ExtensionConnectionApplyConfiguration { // WithFailureMode sets the FailureMode field in the declarative configuration to the given value // and returns the receiver, so that objects can be built by chaining "With" function invocations. // If called multiple times, the FailureMode field is set to the value of the last call. -func (b *ExtensionConnectionApplyConfiguration) WithFailureMode(value apiv1alpha1.ExtensionFailureMode) *ExtensionConnectionApplyConfiguration { +func (b *ExtensionConnectionApplyConfiguration) WithFailureMode(value apiv1alpha2.ExtensionFailureMode) *ExtensionConnectionApplyConfiguration { b.FailureMode = &value return b } diff --git a/client-go/applyconfiguration/api/v1alpha1/extensionreference.go b/client-go/applyconfiguration/api/v1alpha2/extensionreference.go similarity index 64% rename from client-go/applyconfiguration/api/v1alpha1/extensionreference.go rename to client-go/applyconfiguration/api/v1alpha2/extensionreference.go index c72c0306..4db2dae1 100644 --- a/client-go/applyconfiguration/api/v1alpha1/extensionreference.go +++ b/client-go/applyconfiguration/api/v1alpha2/extensionreference.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,15 +15,19 @@ limitations under the License. */ // Code generated by applyconfiguration-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 + +import ( + apiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" +) // ExtensionReferenceApplyConfiguration represents a declarative configuration of the ExtensionReference type for use // with apply. type ExtensionReferenceApplyConfiguration struct { - Group *string `json:"group,omitempty"` - Kind *string `json:"kind,omitempty"` - Name *string `json:"name,omitempty"` - TargetPortNumber *int32 `json:"targetPortNumber,omitempty"` + Group *apiv1alpha2.Group `json:"group,omitempty"` + Kind *apiv1alpha2.Kind `json:"kind,omitempty"` + Name *apiv1alpha2.ObjectName `json:"name,omitempty"` + PortNumber *apiv1alpha2.PortNumber `json:"portNumber,omitempty"` } // ExtensionReferenceApplyConfiguration constructs a declarative configuration of the ExtensionReference type for use with @@ -35,7 +39,7 @@ func ExtensionReference() *ExtensionReferenceApplyConfiguration { // WithGroup sets the Group field in the declarative configuration to the given value // and returns the receiver, so that objects can be built by chaining "With" function invocations. // If called multiple times, the Group field is set to the value of the last call. -func (b *ExtensionReferenceApplyConfiguration) WithGroup(value string) *ExtensionReferenceApplyConfiguration { +func (b *ExtensionReferenceApplyConfiguration) WithGroup(value apiv1alpha2.Group) *ExtensionReferenceApplyConfiguration { b.Group = &value return b } @@ -43,7 +47,7 @@ func (b *ExtensionReferenceApplyConfiguration) WithGroup(value string) *Extensio // WithKind sets the Kind field in the declarative configuration to the given value // and returns the receiver, so that objects can be built by chaining "With" function invocations. // If called multiple times, the Kind field is set to the value of the last call. -func (b *ExtensionReferenceApplyConfiguration) WithKind(value string) *ExtensionReferenceApplyConfiguration { +func (b *ExtensionReferenceApplyConfiguration) WithKind(value apiv1alpha2.Kind) *ExtensionReferenceApplyConfiguration { b.Kind = &value return b } @@ -51,15 +55,15 @@ func (b *ExtensionReferenceApplyConfiguration) WithKind(value string) *Extension // WithName sets the Name field in the declarative configuration to the given value // and returns the receiver, so that objects can be built by chaining "With" function invocations. // If called multiple times, the Name field is set to the value of the last call. -func (b *ExtensionReferenceApplyConfiguration) WithName(value string) *ExtensionReferenceApplyConfiguration { +func (b *ExtensionReferenceApplyConfiguration) WithName(value apiv1alpha2.ObjectName) *ExtensionReferenceApplyConfiguration { b.Name = &value return b } -// WithTargetPortNumber sets the TargetPortNumber field in the declarative configuration to the given value +// WithPortNumber sets the PortNumber field in the declarative configuration to the given value // and returns the receiver, so that objects can be built by chaining "With" function invocations. -// If called multiple times, the TargetPortNumber field is set to the value of the last call. -func (b *ExtensionReferenceApplyConfiguration) WithTargetPortNumber(value int32) *ExtensionReferenceApplyConfiguration { - b.TargetPortNumber = &value +// If called multiple times, the PortNumber field is set to the value of the last call. +func (b *ExtensionReferenceApplyConfiguration) WithPortNumber(value apiv1alpha2.PortNumber) *ExtensionReferenceApplyConfiguration { + b.PortNumber = &value return b } diff --git a/client-go/applyconfiguration/api/v1alpha1/inferencemodel.go b/client-go/applyconfiguration/api/v1alpha2/inferencemodel.go similarity index 98% rename from client-go/applyconfiguration/api/v1alpha1/inferencemodel.go rename to client-go/applyconfiguration/api/v1alpha2/inferencemodel.go index b6201467..8c810170 100644 --- a/client-go/applyconfiguration/api/v1alpha1/inferencemodel.go +++ b/client-go/applyconfiguration/api/v1alpha2/inferencemodel.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,7 +15,7 @@ limitations under the License. */ // Code generated by applyconfiguration-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 import ( metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" @@ -39,7 +39,7 @@ func InferenceModel(name, namespace string) *InferenceModelApplyConfiguration { b.WithName(name) b.WithNamespace(namespace) b.WithKind("InferenceModel") - b.WithAPIVersion("api/v1alpha1") + b.WithAPIVersion("inference.networking.x-k8s.io/v1alpha2") return b } diff --git a/client-go/applyconfiguration/api/v1alpha1/inferencemodelspec.go b/client-go/applyconfiguration/api/v1alpha2/inferencemodelspec.go similarity index 92% rename from client-go/applyconfiguration/api/v1alpha1/inferencemodelspec.go rename to client-go/applyconfiguration/api/v1alpha2/inferencemodelspec.go index 9bbdda06..f9b453a4 100644 --- a/client-go/applyconfiguration/api/v1alpha1/inferencemodelspec.go +++ b/client-go/applyconfiguration/api/v1alpha2/inferencemodelspec.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,17 +15,17 @@ limitations under the License. */ // Code generated by applyconfiguration-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 import ( - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" + apiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" ) // InferenceModelSpecApplyConfiguration represents a declarative configuration of the InferenceModelSpec type for use // with apply. type InferenceModelSpecApplyConfiguration struct { ModelName *string `json:"modelName,omitempty"` - Criticality *apiv1alpha1.Criticality `json:"criticality,omitempty"` + Criticality *apiv1alpha2.Criticality `json:"criticality,omitempty"` TargetModels []TargetModelApplyConfiguration `json:"targetModels,omitempty"` PoolRef *PoolObjectReferenceApplyConfiguration `json:"poolRef,omitempty"` } @@ -47,7 +47,7 @@ func (b *InferenceModelSpecApplyConfiguration) WithModelName(value string) *Infe // WithCriticality sets the Criticality field in the declarative configuration to the given value // and returns the receiver, so that objects can be built by chaining "With" function invocations. // If called multiple times, the Criticality field is set to the value of the last call. -func (b *InferenceModelSpecApplyConfiguration) WithCriticality(value apiv1alpha1.Criticality) *InferenceModelSpecApplyConfiguration { +func (b *InferenceModelSpecApplyConfiguration) WithCriticality(value apiv1alpha2.Criticality) *InferenceModelSpecApplyConfiguration { b.Criticality = &value return b } diff --git a/client-go/applyconfiguration/api/v1alpha1/inferencemodelstatus.go b/client-go/applyconfiguration/api/v1alpha2/inferencemodelstatus.go similarity index 96% rename from client-go/applyconfiguration/api/v1alpha1/inferencemodelstatus.go rename to client-go/applyconfiguration/api/v1alpha2/inferencemodelstatus.go index b0b003bb..4c9e10a9 100644 --- a/client-go/applyconfiguration/api/v1alpha1/inferencemodelstatus.go +++ b/client-go/applyconfiguration/api/v1alpha2/inferencemodelstatus.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,7 +15,7 @@ limitations under the License. */ // Code generated by applyconfiguration-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 import ( v1 "k8s.io/client-go/applyconfigurations/meta/v1" diff --git a/client-go/applyconfiguration/api/v1alpha1/inferencepool.go b/client-go/applyconfiguration/api/v1alpha2/inferencepool.go similarity index 98% rename from client-go/applyconfiguration/api/v1alpha1/inferencepool.go rename to client-go/applyconfiguration/api/v1alpha2/inferencepool.go index a7f3ed6d..15649a60 100644 --- a/client-go/applyconfiguration/api/v1alpha1/inferencepool.go +++ b/client-go/applyconfiguration/api/v1alpha2/inferencepool.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,7 +15,7 @@ limitations under the License. */ // Code generated by applyconfiguration-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 import ( metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" @@ -39,7 +39,7 @@ func InferencePool(name, namespace string) *InferencePoolApplyConfiguration { b.WithName(name) b.WithNamespace(namespace) b.WithKind("InferencePool") - b.WithAPIVersion("api/v1alpha1") + b.WithAPIVersion("inference.networking.x-k8s.io/v1alpha2") return b } diff --git a/client-go/applyconfiguration/api/v1alpha1/inferencepoolspec.go b/client-go/applyconfiguration/api/v1alpha2/inferencepoolspec.go similarity index 87% rename from client-go/applyconfiguration/api/v1alpha1/inferencepoolspec.go rename to client-go/applyconfiguration/api/v1alpha2/inferencepoolspec.go index e132f74b..ba0fe3c3 100644 --- a/client-go/applyconfiguration/api/v1alpha1/inferencepoolspec.go +++ b/client-go/applyconfiguration/api/v1alpha2/inferencepoolspec.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,16 +15,16 @@ limitations under the License. */ // Code generated by applyconfiguration-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 import ( - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" + apiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" ) // InferencePoolSpecApplyConfiguration represents a declarative configuration of the InferencePoolSpec type for use // with apply. type InferencePoolSpecApplyConfiguration struct { - Selector map[apiv1alpha1.LabelKey]apiv1alpha1.LabelValue `json:"selector,omitempty"` + Selector map[apiv1alpha2.LabelKey]apiv1alpha2.LabelValue `json:"selector,omitempty"` TargetPortNumber *int32 `json:"targetPortNumber,omitempty"` EndpointPickerConfigApplyConfiguration `json:",inline"` } @@ -39,9 +39,9 @@ func InferencePoolSpec() *InferencePoolSpecApplyConfiguration { // and returns the receiver, so that objects can be build by chaining "With" function invocations. // If called multiple times, the entries provided by each call will be put on the Selector field, // overwriting an existing map entries in Selector field with the same key. -func (b *InferencePoolSpecApplyConfiguration) WithSelector(entries map[apiv1alpha1.LabelKey]apiv1alpha1.LabelValue) *InferencePoolSpecApplyConfiguration { +func (b *InferencePoolSpecApplyConfiguration) WithSelector(entries map[apiv1alpha2.LabelKey]apiv1alpha2.LabelValue) *InferencePoolSpecApplyConfiguration { if b.Selector == nil && len(entries) > 0 { - b.Selector = make(map[apiv1alpha1.LabelKey]apiv1alpha1.LabelValue, len(entries)) + b.Selector = make(map[apiv1alpha2.LabelKey]apiv1alpha2.LabelValue, len(entries)) } for k, v := range entries { b.Selector[k] = v diff --git a/client-go/applyconfiguration/api/v1alpha1/inferencepoolstatus.go b/client-go/applyconfiguration/api/v1alpha2/inferencepoolstatus.go similarity index 68% rename from client-go/applyconfiguration/api/v1alpha1/inferencepoolstatus.go rename to client-go/applyconfiguration/api/v1alpha2/inferencepoolstatus.go index f61a81b3..daf3be20 100644 --- a/client-go/applyconfiguration/api/v1alpha1/inferencepoolstatus.go +++ b/client-go/applyconfiguration/api/v1alpha2/inferencepoolstatus.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,16 +15,12 @@ limitations under the License. */ // Code generated by applyconfiguration-gen. DO NOT EDIT. -package v1alpha1 - -import ( - v1 "k8s.io/client-go/applyconfigurations/meta/v1" -) +package v1alpha2 // InferencePoolStatusApplyConfiguration represents a declarative configuration of the InferencePoolStatus type for use // with apply. type InferencePoolStatusApplyConfiguration struct { - Conditions []v1.ConditionApplyConfiguration `json:"conditions,omitempty"` + Parents []PoolStatusApplyConfiguration `json:"parent,omitempty"` } // InferencePoolStatusApplyConfiguration constructs a declarative configuration of the InferencePoolStatus type for use with @@ -33,15 +29,15 @@ func InferencePoolStatus() *InferencePoolStatusApplyConfiguration { return &InferencePoolStatusApplyConfiguration{} } -// WithConditions adds the given value to the Conditions field in the declarative configuration +// WithParents adds the given value to the Parents field in the declarative configuration // and returns the receiver, so that objects can be build by chaining "With" function invocations. -// If called multiple times, values provided by each call will be appended to the Conditions field. -func (b *InferencePoolStatusApplyConfiguration) WithConditions(values ...*v1.ConditionApplyConfiguration) *InferencePoolStatusApplyConfiguration { +// If called multiple times, values provided by each call will be appended to the Parents field. +func (b *InferencePoolStatusApplyConfiguration) WithParents(values ...*PoolStatusApplyConfiguration) *InferencePoolStatusApplyConfiguration { for i := range values { if values[i] == nil { - panic("nil value passed to WithConditions") + panic("nil value passed to WithParents") } - b.Conditions = append(b.Conditions, *values[i]) + b.Parents = append(b.Parents, *values[i]) } return b } diff --git a/client-go/applyconfiguration/api/v1alpha1/poolobjectreference.go b/client-go/applyconfiguration/api/v1alpha2/poolobjectreference.go similarity index 76% rename from client-go/applyconfiguration/api/v1alpha1/poolobjectreference.go rename to client-go/applyconfiguration/api/v1alpha2/poolobjectreference.go index 692a185e..7227560e 100644 --- a/client-go/applyconfiguration/api/v1alpha1/poolobjectreference.go +++ b/client-go/applyconfiguration/api/v1alpha2/poolobjectreference.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,14 +15,18 @@ limitations under the License. */ // Code generated by applyconfiguration-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 + +import ( + apiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" +) // PoolObjectReferenceApplyConfiguration represents a declarative configuration of the PoolObjectReference type for use // with apply. type PoolObjectReferenceApplyConfiguration struct { - Group *string `json:"group,omitempty"` - Kind *string `json:"kind,omitempty"` - Name *string `json:"name,omitempty"` + Group *apiv1alpha2.Group `json:"group,omitempty"` + Kind *apiv1alpha2.Kind `json:"kind,omitempty"` + Name *apiv1alpha2.ObjectName `json:"name,omitempty"` } // PoolObjectReferenceApplyConfiguration constructs a declarative configuration of the PoolObjectReference type for use with @@ -34,7 +38,7 @@ func PoolObjectReference() *PoolObjectReferenceApplyConfiguration { // WithGroup sets the Group field in the declarative configuration to the given value // and returns the receiver, so that objects can be built by chaining "With" function invocations. // If called multiple times, the Group field is set to the value of the last call. -func (b *PoolObjectReferenceApplyConfiguration) WithGroup(value string) *PoolObjectReferenceApplyConfiguration { +func (b *PoolObjectReferenceApplyConfiguration) WithGroup(value apiv1alpha2.Group) *PoolObjectReferenceApplyConfiguration { b.Group = &value return b } @@ -42,7 +46,7 @@ func (b *PoolObjectReferenceApplyConfiguration) WithGroup(value string) *PoolObj // WithKind sets the Kind field in the declarative configuration to the given value // and returns the receiver, so that objects can be built by chaining "With" function invocations. // If called multiple times, the Kind field is set to the value of the last call. -func (b *PoolObjectReferenceApplyConfiguration) WithKind(value string) *PoolObjectReferenceApplyConfiguration { +func (b *PoolObjectReferenceApplyConfiguration) WithKind(value apiv1alpha2.Kind) *PoolObjectReferenceApplyConfiguration { b.Kind = &value return b } @@ -50,7 +54,7 @@ func (b *PoolObjectReferenceApplyConfiguration) WithKind(value string) *PoolObje // WithName sets the Name field in the declarative configuration to the given value // and returns the receiver, so that objects can be built by chaining "With" function invocations. // If called multiple times, the Name field is set to the value of the last call. -func (b *PoolObjectReferenceApplyConfiguration) WithName(value string) *PoolObjectReferenceApplyConfiguration { +func (b *PoolObjectReferenceApplyConfiguration) WithName(value apiv1alpha2.ObjectName) *PoolObjectReferenceApplyConfiguration { b.Name = &value return b } diff --git a/client-go/applyconfiguration/api/v1alpha2/poolstatus.go b/client-go/applyconfiguration/api/v1alpha2/poolstatus.go new file mode 100644 index 00000000..9d7d7294 --- /dev/null +++ b/client-go/applyconfiguration/api/v1alpha2/poolstatus.go @@ -0,0 +1,57 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ +// Code generated by applyconfiguration-gen. DO NOT EDIT. + +package v1alpha2 + +import ( + v1 "k8s.io/api/core/v1" + metav1 "k8s.io/client-go/applyconfigurations/meta/v1" +) + +// PoolStatusApplyConfiguration represents a declarative configuration of the PoolStatus type for use +// with apply. +type PoolStatusApplyConfiguration struct { + GatewayRef *v1.ObjectReference `json:"parentRef,omitempty"` + Conditions []metav1.ConditionApplyConfiguration `json:"conditions,omitempty"` +} + +// PoolStatusApplyConfiguration constructs a declarative configuration of the PoolStatus type for use with +// apply. +func PoolStatus() *PoolStatusApplyConfiguration { + return &PoolStatusApplyConfiguration{} +} + +// WithGatewayRef sets the GatewayRef field in the declarative configuration to the given value +// and returns the receiver, so that objects can be built by chaining "With" function invocations. +// If called multiple times, the GatewayRef field is set to the value of the last call. +func (b *PoolStatusApplyConfiguration) WithGatewayRef(value v1.ObjectReference) *PoolStatusApplyConfiguration { + b.GatewayRef = &value + return b +} + +// WithConditions adds the given value to the Conditions field in the declarative configuration +// and returns the receiver, so that objects can be build by chaining "With" function invocations. +// If called multiple times, values provided by each call will be appended to the Conditions field. +func (b *PoolStatusApplyConfiguration) WithConditions(values ...*metav1.ConditionApplyConfiguration) *PoolStatusApplyConfiguration { + for i := range values { + if values[i] == nil { + panic("nil value passed to WithConditions") + } + b.Conditions = append(b.Conditions, *values[i]) + } + return b +} diff --git a/client-go/applyconfiguration/api/v1alpha1/targetmodel.go b/client-go/applyconfiguration/api/v1alpha2/targetmodel.go similarity index 97% rename from client-go/applyconfiguration/api/v1alpha1/targetmodel.go rename to client-go/applyconfiguration/api/v1alpha2/targetmodel.go index f6ac83f8..1c9277fa 100644 --- a/client-go/applyconfiguration/api/v1alpha1/targetmodel.go +++ b/client-go/applyconfiguration/api/v1alpha2/targetmodel.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,7 +15,7 @@ limitations under the License. */ // Code generated by applyconfiguration-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 // TargetModelApplyConfiguration represents a declarative configuration of the TargetModel type for use // with apply. diff --git a/client-go/applyconfiguration/internal/internal.go b/client-go/applyconfiguration/internal/internal.go index 756160bd..e1bbb864 100644 --- a/client-go/applyconfiguration/internal/internal.go +++ b/client-go/applyconfiguration/internal/internal.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. diff --git a/client-go/applyconfiguration/utils.go b/client-go/applyconfiguration/utils.go index 1a71b674..cec3969a 100644 --- a/client-go/applyconfiguration/utils.go +++ b/client-go/applyconfiguration/utils.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -18,43 +18,45 @@ limitations under the License. package applyconfiguration import ( - v1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/applyconfiguration/api/v1alpha1" - internal "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/applyconfiguration/internal" runtime "k8s.io/apimachinery/pkg/runtime" schema "k8s.io/apimachinery/pkg/runtime/schema" testing "k8s.io/client-go/testing" + v1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + apiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/client-go/applyconfiguration/api/v1alpha2" + internal "sigs.k8s.io/gateway-api-inference-extension/client-go/applyconfiguration/internal" ) // ForKind returns an apply configuration type for the given GroupVersionKind, or nil if no // apply configuration type exists for the given GroupVersionKind. func ForKind(kind schema.GroupVersionKind) interface{} { switch kind { - // Group=api, Version=v1alpha1 - case v1alpha1.SchemeGroupVersion.WithKind("EndpointPickerConfig"): - return &apiv1alpha1.EndpointPickerConfigApplyConfiguration{} - case v1alpha1.SchemeGroupVersion.WithKind("Extension"): - return &apiv1alpha1.ExtensionApplyConfiguration{} - case v1alpha1.SchemeGroupVersion.WithKind("ExtensionConnection"): - return &apiv1alpha1.ExtensionConnectionApplyConfiguration{} - case v1alpha1.SchemeGroupVersion.WithKind("ExtensionReference"): - return &apiv1alpha1.ExtensionReferenceApplyConfiguration{} - case v1alpha1.SchemeGroupVersion.WithKind("InferenceModel"): - return &apiv1alpha1.InferenceModelApplyConfiguration{} - case v1alpha1.SchemeGroupVersion.WithKind("InferenceModelSpec"): - return &apiv1alpha1.InferenceModelSpecApplyConfiguration{} - case v1alpha1.SchemeGroupVersion.WithKind("InferenceModelStatus"): - return &apiv1alpha1.InferenceModelStatusApplyConfiguration{} - case v1alpha1.SchemeGroupVersion.WithKind("InferencePool"): - return &apiv1alpha1.InferencePoolApplyConfiguration{} - case v1alpha1.SchemeGroupVersion.WithKind("InferencePoolSpec"): - return &apiv1alpha1.InferencePoolSpecApplyConfiguration{} - case v1alpha1.SchemeGroupVersion.WithKind("InferencePoolStatus"): - return &apiv1alpha1.InferencePoolStatusApplyConfiguration{} - case v1alpha1.SchemeGroupVersion.WithKind("PoolObjectReference"): - return &apiv1alpha1.PoolObjectReferenceApplyConfiguration{} - case v1alpha1.SchemeGroupVersion.WithKind("TargetModel"): - return &apiv1alpha1.TargetModelApplyConfiguration{} + // Group=inference.networking.x-k8s.io, Version=v1alpha2 + case v1alpha2.SchemeGroupVersion.WithKind("EndpointPickerConfig"): + return &apiv1alpha2.EndpointPickerConfigApplyConfiguration{} + case v1alpha2.SchemeGroupVersion.WithKind("Extension"): + return &apiv1alpha2.ExtensionApplyConfiguration{} + case v1alpha2.SchemeGroupVersion.WithKind("ExtensionConnection"): + return &apiv1alpha2.ExtensionConnectionApplyConfiguration{} + case v1alpha2.SchemeGroupVersion.WithKind("ExtensionReference"): + return &apiv1alpha2.ExtensionReferenceApplyConfiguration{} + case v1alpha2.SchemeGroupVersion.WithKind("InferenceModel"): + return &apiv1alpha2.InferenceModelApplyConfiguration{} + case v1alpha2.SchemeGroupVersion.WithKind("InferenceModelSpec"): + return &apiv1alpha2.InferenceModelSpecApplyConfiguration{} + case v1alpha2.SchemeGroupVersion.WithKind("InferenceModelStatus"): + return &apiv1alpha2.InferenceModelStatusApplyConfiguration{} + case v1alpha2.SchemeGroupVersion.WithKind("InferencePool"): + return &apiv1alpha2.InferencePoolApplyConfiguration{} + case v1alpha2.SchemeGroupVersion.WithKind("InferencePoolSpec"): + return &apiv1alpha2.InferencePoolSpecApplyConfiguration{} + case v1alpha2.SchemeGroupVersion.WithKind("InferencePoolStatus"): + return &apiv1alpha2.InferencePoolStatusApplyConfiguration{} + case v1alpha2.SchemeGroupVersion.WithKind("PoolObjectReference"): + return &apiv1alpha2.PoolObjectReferenceApplyConfiguration{} + case v1alpha2.SchemeGroupVersion.WithKind("PoolStatus"): + return &apiv1alpha2.PoolStatusApplyConfiguration{} + case v1alpha2.SchemeGroupVersion.WithKind("TargetModel"): + return &apiv1alpha2.TargetModelApplyConfiguration{} } return nil diff --git a/client-go/clientset/versioned/clientset.go b/client-go/clientset/versioned/clientset.go index 18e3236a..9ed7187b 100644 --- a/client-go/clientset/versioned/clientset.go +++ b/client-go/clientset/versioned/clientset.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -21,26 +21,26 @@ import ( fmt "fmt" http "net/http" - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/typed/api/v1alpha1" discovery "k8s.io/client-go/discovery" rest "k8s.io/client-go/rest" flowcontrol "k8s.io/client-go/util/flowcontrol" + inferencev1alpha2 "sigs.k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/typed/api/v1alpha2" ) type Interface interface { Discovery() discovery.DiscoveryInterface - ApiV1alpha1() apiv1alpha1.ApiV1alpha1Interface + InferenceV1alpha2() inferencev1alpha2.InferenceV1alpha2Interface } // Clientset contains the clients for groups. type Clientset struct { *discovery.DiscoveryClient - apiV1alpha1 *apiv1alpha1.ApiV1alpha1Client + inferenceV1alpha2 *inferencev1alpha2.InferenceV1alpha2Client } -// ApiV1alpha1 retrieves the ApiV1alpha1Client -func (c *Clientset) ApiV1alpha1() apiv1alpha1.ApiV1alpha1Interface { - return c.apiV1alpha1 +// InferenceV1alpha2 retrieves the InferenceV1alpha2Client +func (c *Clientset) InferenceV1alpha2() inferencev1alpha2.InferenceV1alpha2Interface { + return c.inferenceV1alpha2 } // Discovery retrieves the DiscoveryClient @@ -87,7 +87,7 @@ func NewForConfigAndClient(c *rest.Config, httpClient *http.Client) (*Clientset, var cs Clientset var err error - cs.apiV1alpha1, err = apiv1alpha1.NewForConfigAndClient(&configShallowCopy, httpClient) + cs.inferenceV1alpha2, err = inferencev1alpha2.NewForConfigAndClient(&configShallowCopy, httpClient) if err != nil { return nil, err } @@ -112,7 +112,7 @@ func NewForConfigOrDie(c *rest.Config) *Clientset { // New creates a new Clientset for the given RESTClient. func New(c rest.Interface) *Clientset { var cs Clientset - cs.apiV1alpha1 = apiv1alpha1.New(c) + cs.inferenceV1alpha2 = inferencev1alpha2.New(c) cs.DiscoveryClient = discovery.NewDiscoveryClient(c) return &cs diff --git a/client-go/clientset/versioned/fake/clientset_generated.go b/client-go/clientset/versioned/fake/clientset_generated.go index dda29ec6..f2f42110 100644 --- a/client-go/clientset/versioned/fake/clientset_generated.go +++ b/client-go/clientset/versioned/fake/clientset_generated.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -18,15 +18,15 @@ limitations under the License. package fake import ( - applyconfiguration "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/applyconfiguration" - clientset "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/clientset/versioned" - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/typed/api/v1alpha1" - fakeapiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/typed/api/v1alpha1/fake" "k8s.io/apimachinery/pkg/runtime" "k8s.io/apimachinery/pkg/watch" "k8s.io/client-go/discovery" fakediscovery "k8s.io/client-go/discovery/fake" "k8s.io/client-go/testing" + applyconfiguration "sigs.k8s.io/gateway-api-inference-extension/client-go/applyconfiguration" + clientset "sigs.k8s.io/gateway-api-inference-extension/client-go/clientset/versioned" + inferencev1alpha2 "sigs.k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/typed/api/v1alpha2" + fakeinferencev1alpha2 "sigs.k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/typed/api/v1alpha2/fake" ) // NewSimpleClientset returns a clientset that will respond with the provided objects. @@ -115,7 +115,7 @@ var ( _ testing.FakeClient = &Clientset{} ) -// ApiV1alpha1 retrieves the ApiV1alpha1Client -func (c *Clientset) ApiV1alpha1() apiv1alpha1.ApiV1alpha1Interface { - return &fakeapiv1alpha1.FakeApiV1alpha1{Fake: &c.Fake} +// InferenceV1alpha2 retrieves the InferenceV1alpha2Client +func (c *Clientset) InferenceV1alpha2() inferencev1alpha2.InferenceV1alpha2Interface { + return &fakeinferencev1alpha2.FakeInferenceV1alpha2{Fake: &c.Fake} } diff --git a/client-go/clientset/versioned/fake/doc.go b/client-go/clientset/versioned/fake/doc.go index 634bd02c..0f3cdf28 100644 --- a/client-go/clientset/versioned/fake/doc.go +++ b/client-go/clientset/versioned/fake/doc.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. diff --git a/client-go/clientset/versioned/fake/register.go b/client-go/clientset/versioned/fake/register.go index f252a096..0966faea 100644 --- a/client-go/clientset/versioned/fake/register.go +++ b/client-go/clientset/versioned/fake/register.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -18,19 +18,19 @@ limitations under the License. package fake import ( - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" v1 "k8s.io/apimachinery/pkg/apis/meta/v1" runtime "k8s.io/apimachinery/pkg/runtime" schema "k8s.io/apimachinery/pkg/runtime/schema" serializer "k8s.io/apimachinery/pkg/runtime/serializer" utilruntime "k8s.io/apimachinery/pkg/util/runtime" + inferencev1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" ) var scheme = runtime.NewScheme() var codecs = serializer.NewCodecFactory(scheme) var localSchemeBuilder = runtime.SchemeBuilder{ - apiv1alpha1.AddToScheme, + inferencev1alpha2.AddToScheme, } // AddToScheme adds all types of this clientset into the given scheme. This allows composition diff --git a/client-go/clientset/versioned/scheme/doc.go b/client-go/clientset/versioned/scheme/doc.go index 40e42c29..a3e95ed2 100644 --- a/client-go/clientset/versioned/scheme/doc.go +++ b/client-go/clientset/versioned/scheme/doc.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. diff --git a/client-go/clientset/versioned/scheme/register.go b/client-go/clientset/versioned/scheme/register.go index 6e243827..1e4975e5 100644 --- a/client-go/clientset/versioned/scheme/register.go +++ b/client-go/clientset/versioned/scheme/register.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -18,19 +18,19 @@ limitations under the License. package scheme import ( - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" v1 "k8s.io/apimachinery/pkg/apis/meta/v1" runtime "k8s.io/apimachinery/pkg/runtime" schema "k8s.io/apimachinery/pkg/runtime/schema" serializer "k8s.io/apimachinery/pkg/runtime/serializer" utilruntime "k8s.io/apimachinery/pkg/util/runtime" + inferencev1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" ) var Scheme = runtime.NewScheme() var Codecs = serializer.NewCodecFactory(Scheme) var ParameterCodec = runtime.NewParameterCodec(Scheme) var localSchemeBuilder = runtime.SchemeBuilder{ - apiv1alpha1.AddToScheme, + inferencev1alpha2.AddToScheme, } // AddToScheme adds all types of this clientset into the given scheme. This allows composition diff --git a/client-go/clientset/versioned/typed/api/v1alpha1/fake/fake_inferencemodel.go b/client-go/clientset/versioned/typed/api/v1alpha1/fake/fake_inferencemodel.go deleted file mode 100644 index e33b311d..00000000 --- a/client-go/clientset/versioned/typed/api/v1alpha1/fake/fake_inferencemodel.go +++ /dev/null @@ -1,52 +0,0 @@ -/* -Copyright 2024 The Kubernetes Authors. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. -*/ -// Code generated by client-gen. DO NOT EDIT. - -package fake - -import ( - v1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/applyconfiguration/api/v1alpha1" - typedapiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/typed/api/v1alpha1" - gentype "k8s.io/client-go/gentype" -) - -// fakeInferenceModels implements InferenceModelInterface -type fakeInferenceModels struct { - *gentype.FakeClientWithListAndApply[*v1alpha1.InferenceModel, *v1alpha1.InferenceModelList, *apiv1alpha1.InferenceModelApplyConfiguration] - Fake *FakeApiV1alpha1 -} - -func newFakeInferenceModels(fake *FakeApiV1alpha1, namespace string) typedapiv1alpha1.InferenceModelInterface { - return &fakeInferenceModels{ - gentype.NewFakeClientWithListAndApply[*v1alpha1.InferenceModel, *v1alpha1.InferenceModelList, *apiv1alpha1.InferenceModelApplyConfiguration]( - fake.Fake, - namespace, - v1alpha1.SchemeGroupVersion.WithResource("inferencemodels"), - v1alpha1.SchemeGroupVersion.WithKind("InferenceModel"), - func() *v1alpha1.InferenceModel { return &v1alpha1.InferenceModel{} }, - func() *v1alpha1.InferenceModelList { return &v1alpha1.InferenceModelList{} }, - func(dst, src *v1alpha1.InferenceModelList) { dst.ListMeta = src.ListMeta }, - func(list *v1alpha1.InferenceModelList) []*v1alpha1.InferenceModel { - return gentype.ToPointerSlice(list.Items) - }, - func(list *v1alpha1.InferenceModelList, items []*v1alpha1.InferenceModel) { - list.Items = gentype.FromPointerSlice(items) - }, - ), - fake, - } -} diff --git a/client-go/clientset/versioned/typed/api/v1alpha1/fake/fake_inferencepool.go b/client-go/clientset/versioned/typed/api/v1alpha1/fake/fake_inferencepool.go deleted file mode 100644 index 92bc5cbe..00000000 --- a/client-go/clientset/versioned/typed/api/v1alpha1/fake/fake_inferencepool.go +++ /dev/null @@ -1,52 +0,0 @@ -/* -Copyright 2024 The Kubernetes Authors. - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and -limitations under the License. -*/ -// Code generated by client-gen. DO NOT EDIT. - -package fake - -import ( - v1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/applyconfiguration/api/v1alpha1" - typedapiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/typed/api/v1alpha1" - gentype "k8s.io/client-go/gentype" -) - -// fakeInferencePools implements InferencePoolInterface -type fakeInferencePools struct { - *gentype.FakeClientWithListAndApply[*v1alpha1.InferencePool, *v1alpha1.InferencePoolList, *apiv1alpha1.InferencePoolApplyConfiguration] - Fake *FakeApiV1alpha1 -} - -func newFakeInferencePools(fake *FakeApiV1alpha1, namespace string) typedapiv1alpha1.InferencePoolInterface { - return &fakeInferencePools{ - gentype.NewFakeClientWithListAndApply[*v1alpha1.InferencePool, *v1alpha1.InferencePoolList, *apiv1alpha1.InferencePoolApplyConfiguration]( - fake.Fake, - namespace, - v1alpha1.SchemeGroupVersion.WithResource("inferencepools"), - v1alpha1.SchemeGroupVersion.WithKind("InferencePool"), - func() *v1alpha1.InferencePool { return &v1alpha1.InferencePool{} }, - func() *v1alpha1.InferencePoolList { return &v1alpha1.InferencePoolList{} }, - func(dst, src *v1alpha1.InferencePoolList) { dst.ListMeta = src.ListMeta }, - func(list *v1alpha1.InferencePoolList) []*v1alpha1.InferencePool { - return gentype.ToPointerSlice(list.Items) - }, - func(list *v1alpha1.InferencePoolList, items []*v1alpha1.InferencePool) { - list.Items = gentype.FromPointerSlice(items) - }, - ), - fake, - } -} diff --git a/client-go/clientset/versioned/typed/api/v1alpha1/api_client.go b/client-go/clientset/versioned/typed/api/v1alpha2/api_client.go similarity index 59% rename from client-go/clientset/versioned/typed/api/v1alpha1/api_client.go rename to client-go/clientset/versioned/typed/api/v1alpha2/api_client.go index 84a4a0bb..16c14453 100644 --- a/client-go/clientset/versioned/typed/api/v1alpha1/api_client.go +++ b/client-go/clientset/versioned/typed/api/v1alpha2/api_client.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,39 +15,39 @@ limitations under the License. */ // Code generated by client-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 import ( http "net/http" - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - scheme "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/scheme" rest "k8s.io/client-go/rest" + apiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + scheme "sigs.k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/scheme" ) -type ApiV1alpha1Interface interface { +type InferenceV1alpha2Interface interface { RESTClient() rest.Interface InferenceModelsGetter InferencePoolsGetter } -// ApiV1alpha1Client is used to interact with features provided by the api group. -type ApiV1alpha1Client struct { +// InferenceV1alpha2Client is used to interact with features provided by the inference.networking.x-k8s.io group. +type InferenceV1alpha2Client struct { restClient rest.Interface } -func (c *ApiV1alpha1Client) InferenceModels(namespace string) InferenceModelInterface { +func (c *InferenceV1alpha2Client) InferenceModels(namespace string) InferenceModelInterface { return newInferenceModels(c, namespace) } -func (c *ApiV1alpha1Client) InferencePools(namespace string) InferencePoolInterface { +func (c *InferenceV1alpha2Client) InferencePools(namespace string) InferencePoolInterface { return newInferencePools(c, namespace) } -// NewForConfig creates a new ApiV1alpha1Client for the given config. +// NewForConfig creates a new InferenceV1alpha2Client for the given config. // NewForConfig is equivalent to NewForConfigAndClient(c, httpClient), // where httpClient was generated with rest.HTTPClientFor(c). -func NewForConfig(c *rest.Config) (*ApiV1alpha1Client, error) { +func NewForConfig(c *rest.Config) (*InferenceV1alpha2Client, error) { config := *c if err := setConfigDefaults(&config); err != nil { return nil, err @@ -59,9 +59,9 @@ func NewForConfig(c *rest.Config) (*ApiV1alpha1Client, error) { return NewForConfigAndClient(&config, httpClient) } -// NewForConfigAndClient creates a new ApiV1alpha1Client for the given config and http client. +// NewForConfigAndClient creates a new InferenceV1alpha2Client for the given config and http client. // Note the http client provided takes precedence over the configured transport values. -func NewForConfigAndClient(c *rest.Config, h *http.Client) (*ApiV1alpha1Client, error) { +func NewForConfigAndClient(c *rest.Config, h *http.Client) (*InferenceV1alpha2Client, error) { config := *c if err := setConfigDefaults(&config); err != nil { return nil, err @@ -70,12 +70,12 @@ func NewForConfigAndClient(c *rest.Config, h *http.Client) (*ApiV1alpha1Client, if err != nil { return nil, err } - return &ApiV1alpha1Client{client}, nil + return &InferenceV1alpha2Client{client}, nil } -// NewForConfigOrDie creates a new ApiV1alpha1Client for the given config and +// NewForConfigOrDie creates a new InferenceV1alpha2Client for the given config and // panics if there is an error in the config. -func NewForConfigOrDie(c *rest.Config) *ApiV1alpha1Client { +func NewForConfigOrDie(c *rest.Config) *InferenceV1alpha2Client { client, err := NewForConfig(c) if err != nil { panic(err) @@ -83,13 +83,13 @@ func NewForConfigOrDie(c *rest.Config) *ApiV1alpha1Client { return client } -// New creates a new ApiV1alpha1Client for the given RESTClient. -func New(c rest.Interface) *ApiV1alpha1Client { - return &ApiV1alpha1Client{c} +// New creates a new InferenceV1alpha2Client for the given RESTClient. +func New(c rest.Interface) *InferenceV1alpha2Client { + return &InferenceV1alpha2Client{c} } func setConfigDefaults(config *rest.Config) error { - gv := apiv1alpha1.SchemeGroupVersion + gv := apiv1alpha2.SchemeGroupVersion config.GroupVersion = &gv config.APIPath = "/apis" config.NegotiatedSerializer = rest.CodecFactoryForGeneratedClient(scheme.Scheme, scheme.Codecs).WithoutConversion() @@ -103,7 +103,7 @@ func setConfigDefaults(config *rest.Config) error { // RESTClient returns a RESTClient that is used to communicate // with API server by this client implementation. -func (c *ApiV1alpha1Client) RESTClient() rest.Interface { +func (c *InferenceV1alpha2Client) RESTClient() rest.Interface { if c == nil { return nil } diff --git a/client-go/clientset/versioned/typed/api/v1alpha1/doc.go b/client-go/clientset/versioned/typed/api/v1alpha2/doc.go similarity index 91% rename from client-go/clientset/versioned/typed/api/v1alpha1/doc.go rename to client-go/clientset/versioned/typed/api/v1alpha2/doc.go index 28991e22..0240168e 100644 --- a/client-go/clientset/versioned/typed/api/v1alpha1/doc.go +++ b/client-go/clientset/versioned/typed/api/v1alpha2/doc.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -16,4 +16,4 @@ limitations under the License. // Code generated by client-gen. DO NOT EDIT. // This package has the automatically generated typed clients. -package v1alpha1 +package v1alpha2 diff --git a/client-go/clientset/versioned/typed/api/v1alpha1/fake/doc.go b/client-go/clientset/versioned/typed/api/v1alpha2/fake/doc.go similarity index 94% rename from client-go/clientset/versioned/typed/api/v1alpha1/fake/doc.go rename to client-go/clientset/versioned/typed/api/v1alpha2/fake/doc.go index fbfccbb9..01839331 100644 --- a/client-go/clientset/versioned/typed/api/v1alpha1/fake/doc.go +++ b/client-go/clientset/versioned/typed/api/v1alpha2/fake/doc.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. diff --git a/client-go/clientset/versioned/typed/api/v1alpha1/fake/fake_api_client.go b/client-go/clientset/versioned/typed/api/v1alpha2/fake/fake_api_client.go similarity index 67% rename from client-go/clientset/versioned/typed/api/v1alpha1/fake/fake_api_client.go rename to client-go/clientset/versioned/typed/api/v1alpha2/fake/fake_api_client.go index d5dbc1a8..5bd7fd40 100644 --- a/client-go/clientset/versioned/typed/api/v1alpha1/fake/fake_api_client.go +++ b/client-go/clientset/versioned/typed/api/v1alpha2/fake/fake_api_client.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -18,26 +18,26 @@ limitations under the License. package fake import ( - v1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/typed/api/v1alpha1" rest "k8s.io/client-go/rest" testing "k8s.io/client-go/testing" + v1alpha2 "sigs.k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/typed/api/v1alpha2" ) -type FakeApiV1alpha1 struct { +type FakeInferenceV1alpha2 struct { *testing.Fake } -func (c *FakeApiV1alpha1) InferenceModels(namespace string) v1alpha1.InferenceModelInterface { +func (c *FakeInferenceV1alpha2) InferenceModels(namespace string) v1alpha2.InferenceModelInterface { return newFakeInferenceModels(c, namespace) } -func (c *FakeApiV1alpha1) InferencePools(namespace string) v1alpha1.InferencePoolInterface { +func (c *FakeInferenceV1alpha2) InferencePools(namespace string) v1alpha2.InferencePoolInterface { return newFakeInferencePools(c, namespace) } // RESTClient returns a RESTClient that is used to communicate // with API server by this client implementation. -func (c *FakeApiV1alpha1) RESTClient() rest.Interface { +func (c *FakeInferenceV1alpha2) RESTClient() rest.Interface { var ret *rest.RESTClient return ret } diff --git a/client-go/clientset/versioned/typed/api/v1alpha2/fake/fake_inferencemodel.go b/client-go/clientset/versioned/typed/api/v1alpha2/fake/fake_inferencemodel.go new file mode 100644 index 00000000..50f78c52 --- /dev/null +++ b/client-go/clientset/versioned/typed/api/v1alpha2/fake/fake_inferencemodel.go @@ -0,0 +1,52 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ +// Code generated by client-gen. DO NOT EDIT. + +package fake + +import ( + gentype "k8s.io/client-go/gentype" + v1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + apiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/client-go/applyconfiguration/api/v1alpha2" + typedapiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/typed/api/v1alpha2" +) + +// fakeInferenceModels implements InferenceModelInterface +type fakeInferenceModels struct { + *gentype.FakeClientWithListAndApply[*v1alpha2.InferenceModel, *v1alpha2.InferenceModelList, *apiv1alpha2.InferenceModelApplyConfiguration] + Fake *FakeInferenceV1alpha2 +} + +func newFakeInferenceModels(fake *FakeInferenceV1alpha2, namespace string) typedapiv1alpha2.InferenceModelInterface { + return &fakeInferenceModels{ + gentype.NewFakeClientWithListAndApply[*v1alpha2.InferenceModel, *v1alpha2.InferenceModelList, *apiv1alpha2.InferenceModelApplyConfiguration]( + fake.Fake, + namespace, + v1alpha2.SchemeGroupVersion.WithResource("inferencemodels"), + v1alpha2.SchemeGroupVersion.WithKind("InferenceModel"), + func() *v1alpha2.InferenceModel { return &v1alpha2.InferenceModel{} }, + func() *v1alpha2.InferenceModelList { return &v1alpha2.InferenceModelList{} }, + func(dst, src *v1alpha2.InferenceModelList) { dst.ListMeta = src.ListMeta }, + func(list *v1alpha2.InferenceModelList) []*v1alpha2.InferenceModel { + return gentype.ToPointerSlice(list.Items) + }, + func(list *v1alpha2.InferenceModelList, items []*v1alpha2.InferenceModel) { + list.Items = gentype.FromPointerSlice(items) + }, + ), + fake, + } +} diff --git a/client-go/clientset/versioned/typed/api/v1alpha2/fake/fake_inferencepool.go b/client-go/clientset/versioned/typed/api/v1alpha2/fake/fake_inferencepool.go new file mode 100644 index 00000000..a7f6a185 --- /dev/null +++ b/client-go/clientset/versioned/typed/api/v1alpha2/fake/fake_inferencepool.go @@ -0,0 +1,52 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ +// Code generated by client-gen. DO NOT EDIT. + +package fake + +import ( + gentype "k8s.io/client-go/gentype" + v1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + apiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/client-go/applyconfiguration/api/v1alpha2" + typedapiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/typed/api/v1alpha2" +) + +// fakeInferencePools implements InferencePoolInterface +type fakeInferencePools struct { + *gentype.FakeClientWithListAndApply[*v1alpha2.InferencePool, *v1alpha2.InferencePoolList, *apiv1alpha2.InferencePoolApplyConfiguration] + Fake *FakeInferenceV1alpha2 +} + +func newFakeInferencePools(fake *FakeInferenceV1alpha2, namespace string) typedapiv1alpha2.InferencePoolInterface { + return &fakeInferencePools{ + gentype.NewFakeClientWithListAndApply[*v1alpha2.InferencePool, *v1alpha2.InferencePoolList, *apiv1alpha2.InferencePoolApplyConfiguration]( + fake.Fake, + namespace, + v1alpha2.SchemeGroupVersion.WithResource("inferencepools"), + v1alpha2.SchemeGroupVersion.WithKind("InferencePool"), + func() *v1alpha2.InferencePool { return &v1alpha2.InferencePool{} }, + func() *v1alpha2.InferencePoolList { return &v1alpha2.InferencePoolList{} }, + func(dst, src *v1alpha2.InferencePoolList) { dst.ListMeta = src.ListMeta }, + func(list *v1alpha2.InferencePoolList) []*v1alpha2.InferencePool { + return gentype.ToPointerSlice(list.Items) + }, + func(list *v1alpha2.InferencePoolList, items []*v1alpha2.InferencePool) { + list.Items = gentype.FromPointerSlice(items) + }, + ), + fake, + } +} diff --git a/client-go/clientset/versioned/typed/api/v1alpha1/generated_expansion.go b/client-go/clientset/versioned/typed/api/v1alpha2/generated_expansion.go similarity index 92% rename from client-go/clientset/versioned/typed/api/v1alpha1/generated_expansion.go rename to client-go/clientset/versioned/typed/api/v1alpha2/generated_expansion.go index 65c88eb1..1b9be99f 100644 --- a/client-go/clientset/versioned/typed/api/v1alpha1/generated_expansion.go +++ b/client-go/clientset/versioned/typed/api/v1alpha2/generated_expansion.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,7 +15,7 @@ limitations under the License. */ // Code generated by client-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 type InferenceModelExpansion interface{} diff --git a/client-go/clientset/versioned/typed/api/v1alpha1/inferencemodel.go b/client-go/clientset/versioned/typed/api/v1alpha2/inferencemodel.go similarity index 58% rename from client-go/clientset/versioned/typed/api/v1alpha1/inferencemodel.go rename to client-go/clientset/versioned/typed/api/v1alpha2/inferencemodel.go index 1f5315ad..c5fb5c3d 100644 --- a/client-go/clientset/versioned/typed/api/v1alpha1/inferencemodel.go +++ b/client-go/clientset/versioned/typed/api/v1alpha2/inferencemodel.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,18 +15,18 @@ limitations under the License. */ // Code generated by client-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 import ( context "context" - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - applyconfigurationapiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/applyconfiguration/api/v1alpha1" - scheme "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/scheme" v1 "k8s.io/apimachinery/pkg/apis/meta/v1" types "k8s.io/apimachinery/pkg/types" watch "k8s.io/apimachinery/pkg/watch" gentype "k8s.io/client-go/gentype" + apiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + applyconfigurationapiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/client-go/applyconfiguration/api/v1alpha2" + scheme "sigs.k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/scheme" ) // InferenceModelsGetter has a method to return a InferenceModelInterface. @@ -37,37 +37,37 @@ type InferenceModelsGetter interface { // InferenceModelInterface has methods to work with InferenceModel resources. type InferenceModelInterface interface { - Create(ctx context.Context, inferenceModel *apiv1alpha1.InferenceModel, opts v1.CreateOptions) (*apiv1alpha1.InferenceModel, error) - Update(ctx context.Context, inferenceModel *apiv1alpha1.InferenceModel, opts v1.UpdateOptions) (*apiv1alpha1.InferenceModel, error) + Create(ctx context.Context, inferenceModel *apiv1alpha2.InferenceModel, opts v1.CreateOptions) (*apiv1alpha2.InferenceModel, error) + Update(ctx context.Context, inferenceModel *apiv1alpha2.InferenceModel, opts v1.UpdateOptions) (*apiv1alpha2.InferenceModel, error) // Add a +genclient:noStatus comment above the type to avoid generating UpdateStatus(). - UpdateStatus(ctx context.Context, inferenceModel *apiv1alpha1.InferenceModel, opts v1.UpdateOptions) (*apiv1alpha1.InferenceModel, error) + UpdateStatus(ctx context.Context, inferenceModel *apiv1alpha2.InferenceModel, opts v1.UpdateOptions) (*apiv1alpha2.InferenceModel, error) Delete(ctx context.Context, name string, opts v1.DeleteOptions) error DeleteCollection(ctx context.Context, opts v1.DeleteOptions, listOpts v1.ListOptions) error - Get(ctx context.Context, name string, opts v1.GetOptions) (*apiv1alpha1.InferenceModel, error) - List(ctx context.Context, opts v1.ListOptions) (*apiv1alpha1.InferenceModelList, error) + Get(ctx context.Context, name string, opts v1.GetOptions) (*apiv1alpha2.InferenceModel, error) + List(ctx context.Context, opts v1.ListOptions) (*apiv1alpha2.InferenceModelList, error) Watch(ctx context.Context, opts v1.ListOptions) (watch.Interface, error) - Patch(ctx context.Context, name string, pt types.PatchType, data []byte, opts v1.PatchOptions, subresources ...string) (result *apiv1alpha1.InferenceModel, err error) - Apply(ctx context.Context, inferenceModel *applyconfigurationapiv1alpha1.InferenceModelApplyConfiguration, opts v1.ApplyOptions) (result *apiv1alpha1.InferenceModel, err error) + Patch(ctx context.Context, name string, pt types.PatchType, data []byte, opts v1.PatchOptions, subresources ...string) (result *apiv1alpha2.InferenceModel, err error) + Apply(ctx context.Context, inferenceModel *applyconfigurationapiv1alpha2.InferenceModelApplyConfiguration, opts v1.ApplyOptions) (result *apiv1alpha2.InferenceModel, err error) // Add a +genclient:noStatus comment above the type to avoid generating ApplyStatus(). - ApplyStatus(ctx context.Context, inferenceModel *applyconfigurationapiv1alpha1.InferenceModelApplyConfiguration, opts v1.ApplyOptions) (result *apiv1alpha1.InferenceModel, err error) + ApplyStatus(ctx context.Context, inferenceModel *applyconfigurationapiv1alpha2.InferenceModelApplyConfiguration, opts v1.ApplyOptions) (result *apiv1alpha2.InferenceModel, err error) InferenceModelExpansion } // inferenceModels implements InferenceModelInterface type inferenceModels struct { - *gentype.ClientWithListAndApply[*apiv1alpha1.InferenceModel, *apiv1alpha1.InferenceModelList, *applyconfigurationapiv1alpha1.InferenceModelApplyConfiguration] + *gentype.ClientWithListAndApply[*apiv1alpha2.InferenceModel, *apiv1alpha2.InferenceModelList, *applyconfigurationapiv1alpha2.InferenceModelApplyConfiguration] } // newInferenceModels returns a InferenceModels -func newInferenceModels(c *ApiV1alpha1Client, namespace string) *inferenceModels { +func newInferenceModels(c *InferenceV1alpha2Client, namespace string) *inferenceModels { return &inferenceModels{ - gentype.NewClientWithListAndApply[*apiv1alpha1.InferenceModel, *apiv1alpha1.InferenceModelList, *applyconfigurationapiv1alpha1.InferenceModelApplyConfiguration]( + gentype.NewClientWithListAndApply[*apiv1alpha2.InferenceModel, *apiv1alpha2.InferenceModelList, *applyconfigurationapiv1alpha2.InferenceModelApplyConfiguration]( "inferencemodels", c.RESTClient(), scheme.ParameterCodec, namespace, - func() *apiv1alpha1.InferenceModel { return &apiv1alpha1.InferenceModel{} }, - func() *apiv1alpha1.InferenceModelList { return &apiv1alpha1.InferenceModelList{} }, + func() *apiv1alpha2.InferenceModel { return &apiv1alpha2.InferenceModel{} }, + func() *apiv1alpha2.InferenceModelList { return &apiv1alpha2.InferenceModelList{} }, ), } } diff --git a/client-go/clientset/versioned/typed/api/v1alpha1/inferencepool.go b/client-go/clientset/versioned/typed/api/v1alpha2/inferencepool.go similarity index 58% rename from client-go/clientset/versioned/typed/api/v1alpha1/inferencepool.go rename to client-go/clientset/versioned/typed/api/v1alpha2/inferencepool.go index 46a2b378..6cbfb546 100644 --- a/client-go/clientset/versioned/typed/api/v1alpha1/inferencepool.go +++ b/client-go/clientset/versioned/typed/api/v1alpha2/inferencepool.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,18 +15,18 @@ limitations under the License. */ // Code generated by client-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 import ( context "context" - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - applyconfigurationapiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/applyconfiguration/api/v1alpha1" - scheme "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/scheme" v1 "k8s.io/apimachinery/pkg/apis/meta/v1" types "k8s.io/apimachinery/pkg/types" watch "k8s.io/apimachinery/pkg/watch" gentype "k8s.io/client-go/gentype" + apiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + applyconfigurationapiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/client-go/applyconfiguration/api/v1alpha2" + scheme "sigs.k8s.io/gateway-api-inference-extension/client-go/clientset/versioned/scheme" ) // InferencePoolsGetter has a method to return a InferencePoolInterface. @@ -37,37 +37,37 @@ type InferencePoolsGetter interface { // InferencePoolInterface has methods to work with InferencePool resources. type InferencePoolInterface interface { - Create(ctx context.Context, inferencePool *apiv1alpha1.InferencePool, opts v1.CreateOptions) (*apiv1alpha1.InferencePool, error) - Update(ctx context.Context, inferencePool *apiv1alpha1.InferencePool, opts v1.UpdateOptions) (*apiv1alpha1.InferencePool, error) + Create(ctx context.Context, inferencePool *apiv1alpha2.InferencePool, opts v1.CreateOptions) (*apiv1alpha2.InferencePool, error) + Update(ctx context.Context, inferencePool *apiv1alpha2.InferencePool, opts v1.UpdateOptions) (*apiv1alpha2.InferencePool, error) // Add a +genclient:noStatus comment above the type to avoid generating UpdateStatus(). - UpdateStatus(ctx context.Context, inferencePool *apiv1alpha1.InferencePool, opts v1.UpdateOptions) (*apiv1alpha1.InferencePool, error) + UpdateStatus(ctx context.Context, inferencePool *apiv1alpha2.InferencePool, opts v1.UpdateOptions) (*apiv1alpha2.InferencePool, error) Delete(ctx context.Context, name string, opts v1.DeleteOptions) error DeleteCollection(ctx context.Context, opts v1.DeleteOptions, listOpts v1.ListOptions) error - Get(ctx context.Context, name string, opts v1.GetOptions) (*apiv1alpha1.InferencePool, error) - List(ctx context.Context, opts v1.ListOptions) (*apiv1alpha1.InferencePoolList, error) + Get(ctx context.Context, name string, opts v1.GetOptions) (*apiv1alpha2.InferencePool, error) + List(ctx context.Context, opts v1.ListOptions) (*apiv1alpha2.InferencePoolList, error) Watch(ctx context.Context, opts v1.ListOptions) (watch.Interface, error) - Patch(ctx context.Context, name string, pt types.PatchType, data []byte, opts v1.PatchOptions, subresources ...string) (result *apiv1alpha1.InferencePool, err error) - Apply(ctx context.Context, inferencePool *applyconfigurationapiv1alpha1.InferencePoolApplyConfiguration, opts v1.ApplyOptions) (result *apiv1alpha1.InferencePool, err error) + Patch(ctx context.Context, name string, pt types.PatchType, data []byte, opts v1.PatchOptions, subresources ...string) (result *apiv1alpha2.InferencePool, err error) + Apply(ctx context.Context, inferencePool *applyconfigurationapiv1alpha2.InferencePoolApplyConfiguration, opts v1.ApplyOptions) (result *apiv1alpha2.InferencePool, err error) // Add a +genclient:noStatus comment above the type to avoid generating ApplyStatus(). - ApplyStatus(ctx context.Context, inferencePool *applyconfigurationapiv1alpha1.InferencePoolApplyConfiguration, opts v1.ApplyOptions) (result *apiv1alpha1.InferencePool, err error) + ApplyStatus(ctx context.Context, inferencePool *applyconfigurationapiv1alpha2.InferencePoolApplyConfiguration, opts v1.ApplyOptions) (result *apiv1alpha2.InferencePool, err error) InferencePoolExpansion } // inferencePools implements InferencePoolInterface type inferencePools struct { - *gentype.ClientWithListAndApply[*apiv1alpha1.InferencePool, *apiv1alpha1.InferencePoolList, *applyconfigurationapiv1alpha1.InferencePoolApplyConfiguration] + *gentype.ClientWithListAndApply[*apiv1alpha2.InferencePool, *apiv1alpha2.InferencePoolList, *applyconfigurationapiv1alpha2.InferencePoolApplyConfiguration] } // newInferencePools returns a InferencePools -func newInferencePools(c *ApiV1alpha1Client, namespace string) *inferencePools { +func newInferencePools(c *InferenceV1alpha2Client, namespace string) *inferencePools { return &inferencePools{ - gentype.NewClientWithListAndApply[*apiv1alpha1.InferencePool, *apiv1alpha1.InferencePoolList, *applyconfigurationapiv1alpha1.InferencePoolApplyConfiguration]( + gentype.NewClientWithListAndApply[*apiv1alpha2.InferencePool, *apiv1alpha2.InferencePoolList, *applyconfigurationapiv1alpha2.InferencePoolApplyConfiguration]( "inferencepools", c.RESTClient(), scheme.ParameterCodec, namespace, - func() *apiv1alpha1.InferencePool { return &apiv1alpha1.InferencePool{} }, - func() *apiv1alpha1.InferencePoolList { return &apiv1alpha1.InferencePoolList{} }, + func() *apiv1alpha2.InferencePool { return &apiv1alpha2.InferencePool{} }, + func() *apiv1alpha2.InferencePoolList { return &apiv1alpha2.InferencePoolList{} }, ), } } diff --git a/client-go/informers/externalversions/api/interface.go b/client-go/informers/externalversions/api/interface.go index 6ca4f9da..572f5230 100644 --- a/client-go/informers/externalversions/api/interface.go +++ b/client-go/informers/externalversions/api/interface.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -18,14 +18,14 @@ limitations under the License. package api import ( - v1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/informers/externalversions/api/v1alpha1" - internalinterfaces "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/informers/externalversions/internalinterfaces" + v1alpha2 "sigs.k8s.io/gateway-api-inference-extension/client-go/informers/externalversions/api/v1alpha2" + internalinterfaces "sigs.k8s.io/gateway-api-inference-extension/client-go/informers/externalversions/internalinterfaces" ) // Interface provides access to each of this group's versions. type Interface interface { - // V1alpha1 provides access to shared informers for resources in V1alpha1. - V1alpha1() v1alpha1.Interface + // V1alpha2 provides access to shared informers for resources in V1alpha2. + V1alpha2() v1alpha2.Interface } type group struct { @@ -39,7 +39,7 @@ func New(f internalinterfaces.SharedInformerFactory, namespace string, tweakList return &group{factory: f, namespace: namespace, tweakListOptions: tweakListOptions} } -// V1alpha1 returns a new v1alpha1.Interface. -func (g *group) V1alpha1() v1alpha1.Interface { - return v1alpha1.New(g.factory, g.namespace, g.tweakListOptions) +// V1alpha2 returns a new v1alpha2.Interface. +func (g *group) V1alpha2() v1alpha2.Interface { + return v1alpha2.New(g.factory, g.namespace, g.tweakListOptions) } diff --git a/client-go/informers/externalversions/api/v1alpha1/inferencemodel.go b/client-go/informers/externalversions/api/v1alpha2/inferencemodel.go similarity index 75% rename from client-go/informers/externalversions/api/v1alpha1/inferencemodel.go rename to client-go/informers/externalversions/api/v1alpha2/inferencemodel.go index f887ff4a..d21f9cda 100644 --- a/client-go/informers/externalversions/api/v1alpha1/inferencemodel.go +++ b/client-go/informers/externalversions/api/v1alpha2/inferencemodel.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,27 +15,27 @@ limitations under the License. */ // Code generated by informer-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 import ( context "context" time "time" - gatewayapiinferenceextensionapiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - versioned "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/clientset/versioned" - internalinterfaces "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/informers/externalversions/internalinterfaces" - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/listers/api/v1alpha1" v1 "k8s.io/apimachinery/pkg/apis/meta/v1" runtime "k8s.io/apimachinery/pkg/runtime" watch "k8s.io/apimachinery/pkg/watch" cache "k8s.io/client-go/tools/cache" + gatewayapiinferenceextensionapiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + versioned "sigs.k8s.io/gateway-api-inference-extension/client-go/clientset/versioned" + internalinterfaces "sigs.k8s.io/gateway-api-inference-extension/client-go/informers/externalversions/internalinterfaces" + apiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/client-go/listers/api/v1alpha2" ) // InferenceModelInformer provides access to a shared informer and lister for // InferenceModels. type InferenceModelInformer interface { Informer() cache.SharedIndexInformer - Lister() apiv1alpha1.InferenceModelLister + Lister() apiv1alpha2.InferenceModelLister } type inferenceModelInformer struct { @@ -61,16 +61,16 @@ func NewFilteredInferenceModelInformer(client versioned.Interface, namespace str if tweakListOptions != nil { tweakListOptions(&options) } - return client.ApiV1alpha1().InferenceModels(namespace).List(context.TODO(), options) + return client.InferenceV1alpha2().InferenceModels(namespace).List(context.TODO(), options) }, WatchFunc: func(options v1.ListOptions) (watch.Interface, error) { if tweakListOptions != nil { tweakListOptions(&options) } - return client.ApiV1alpha1().InferenceModels(namespace).Watch(context.TODO(), options) + return client.InferenceV1alpha2().InferenceModels(namespace).Watch(context.TODO(), options) }, }, - &gatewayapiinferenceextensionapiv1alpha1.InferenceModel{}, + &gatewayapiinferenceextensionapiv1alpha2.InferenceModel{}, resyncPeriod, indexers, ) @@ -81,9 +81,9 @@ func (f *inferenceModelInformer) defaultInformer(client versioned.Interface, res } func (f *inferenceModelInformer) Informer() cache.SharedIndexInformer { - return f.factory.InformerFor(&gatewayapiinferenceextensionapiv1alpha1.InferenceModel{}, f.defaultInformer) + return f.factory.InformerFor(&gatewayapiinferenceextensionapiv1alpha2.InferenceModel{}, f.defaultInformer) } -func (f *inferenceModelInformer) Lister() apiv1alpha1.InferenceModelLister { - return apiv1alpha1.NewInferenceModelLister(f.Informer().GetIndexer()) +func (f *inferenceModelInformer) Lister() apiv1alpha2.InferenceModelLister { + return apiv1alpha2.NewInferenceModelLister(f.Informer().GetIndexer()) } diff --git a/client-go/informers/externalversions/api/v1alpha1/inferencepool.go b/client-go/informers/externalversions/api/v1alpha2/inferencepool.go similarity index 75% rename from client-go/informers/externalversions/api/v1alpha1/inferencepool.go rename to client-go/informers/externalversions/api/v1alpha2/inferencepool.go index 2311a025..4d042db7 100644 --- a/client-go/informers/externalversions/api/v1alpha1/inferencepool.go +++ b/client-go/informers/externalversions/api/v1alpha2/inferencepool.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,27 +15,27 @@ limitations under the License. */ // Code generated by informer-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 import ( context "context" time "time" - gatewayapiinferenceextensionapiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - versioned "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/clientset/versioned" - internalinterfaces "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/informers/externalversions/internalinterfaces" - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/listers/api/v1alpha1" v1 "k8s.io/apimachinery/pkg/apis/meta/v1" runtime "k8s.io/apimachinery/pkg/runtime" watch "k8s.io/apimachinery/pkg/watch" cache "k8s.io/client-go/tools/cache" + gatewayapiinferenceextensionapiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + versioned "sigs.k8s.io/gateway-api-inference-extension/client-go/clientset/versioned" + internalinterfaces "sigs.k8s.io/gateway-api-inference-extension/client-go/informers/externalversions/internalinterfaces" + apiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/client-go/listers/api/v1alpha2" ) // InferencePoolInformer provides access to a shared informer and lister for // InferencePools. type InferencePoolInformer interface { Informer() cache.SharedIndexInformer - Lister() apiv1alpha1.InferencePoolLister + Lister() apiv1alpha2.InferencePoolLister } type inferencePoolInformer struct { @@ -61,16 +61,16 @@ func NewFilteredInferencePoolInformer(client versioned.Interface, namespace stri if tweakListOptions != nil { tweakListOptions(&options) } - return client.ApiV1alpha1().InferencePools(namespace).List(context.TODO(), options) + return client.InferenceV1alpha2().InferencePools(namespace).List(context.TODO(), options) }, WatchFunc: func(options v1.ListOptions) (watch.Interface, error) { if tweakListOptions != nil { tweakListOptions(&options) } - return client.ApiV1alpha1().InferencePools(namespace).Watch(context.TODO(), options) + return client.InferenceV1alpha2().InferencePools(namespace).Watch(context.TODO(), options) }, }, - &gatewayapiinferenceextensionapiv1alpha1.InferencePool{}, + &gatewayapiinferenceextensionapiv1alpha2.InferencePool{}, resyncPeriod, indexers, ) @@ -81,9 +81,9 @@ func (f *inferencePoolInformer) defaultInformer(client versioned.Interface, resy } func (f *inferencePoolInformer) Informer() cache.SharedIndexInformer { - return f.factory.InformerFor(&gatewayapiinferenceextensionapiv1alpha1.InferencePool{}, f.defaultInformer) + return f.factory.InformerFor(&gatewayapiinferenceextensionapiv1alpha2.InferencePool{}, f.defaultInformer) } -func (f *inferencePoolInformer) Lister() apiv1alpha1.InferencePoolLister { - return apiv1alpha1.NewInferencePoolLister(f.Informer().GetIndexer()) +func (f *inferencePoolInformer) Lister() apiv1alpha2.InferencePoolLister { + return apiv1alpha2.NewInferencePoolLister(f.Informer().GetIndexer()) } diff --git a/client-go/informers/externalversions/api/v1alpha1/interface.go b/client-go/informers/externalversions/api/v1alpha2/interface.go similarity index 90% rename from client-go/informers/externalversions/api/v1alpha1/interface.go rename to client-go/informers/externalversions/api/v1alpha2/interface.go index 9ba07025..6db5619e 100644 --- a/client-go/informers/externalversions/api/v1alpha1/interface.go +++ b/client-go/informers/externalversions/api/v1alpha2/interface.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,10 +15,10 @@ limitations under the License. */ // Code generated by informer-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 import ( - internalinterfaces "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/informers/externalversions/internalinterfaces" + internalinterfaces "sigs.k8s.io/gateway-api-inference-extension/client-go/informers/externalversions/internalinterfaces" ) // Interface provides access to all the informers in this group version. diff --git a/client-go/informers/externalversions/factory.go b/client-go/informers/externalversions/factory.go index 39c96068..9b52e814 100644 --- a/client-go/informers/externalversions/factory.go +++ b/client-go/informers/externalversions/factory.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -22,13 +22,13 @@ import ( sync "sync" time "time" - versioned "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/clientset/versioned" - api "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/informers/externalversions/api" - internalinterfaces "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/informers/externalversions/internalinterfaces" v1 "k8s.io/apimachinery/pkg/apis/meta/v1" runtime "k8s.io/apimachinery/pkg/runtime" schema "k8s.io/apimachinery/pkg/runtime/schema" cache "k8s.io/client-go/tools/cache" + versioned "sigs.k8s.io/gateway-api-inference-extension/client-go/clientset/versioned" + api "sigs.k8s.io/gateway-api-inference-extension/client-go/informers/externalversions/api" + internalinterfaces "sigs.k8s.io/gateway-api-inference-extension/client-go/informers/externalversions/internalinterfaces" ) // SharedInformerOption defines the functional option type for SharedInformerFactory. @@ -253,9 +253,9 @@ type SharedInformerFactory interface { // client. InformerFor(obj runtime.Object, newFunc internalinterfaces.NewInformerFunc) cache.SharedIndexInformer - Api() api.Interface + Inference() api.Interface } -func (f *sharedInformerFactory) Api() api.Interface { +func (f *sharedInformerFactory) Inference() api.Interface { return api.New(f, f.namespace, f.tweakListOptions) } diff --git a/client-go/informers/externalversions/generic.go b/client-go/informers/externalversions/generic.go index a5f15f73..143f9289 100644 --- a/client-go/informers/externalversions/generic.go +++ b/client-go/informers/externalversions/generic.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -20,9 +20,9 @@ package externalversions import ( fmt "fmt" - v1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" schema "k8s.io/apimachinery/pkg/runtime/schema" cache "k8s.io/client-go/tools/cache" + v1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" ) // GenericInformer is type of SharedIndexInformer which will locate and delegate to other @@ -51,11 +51,11 @@ func (f *genericInformer) Lister() cache.GenericLister { // TODO extend this to unknown resources with a client pool func (f *sharedInformerFactory) ForResource(resource schema.GroupVersionResource) (GenericInformer, error) { switch resource { - // Group=api, Version=v1alpha1 - case v1alpha1.SchemeGroupVersion.WithResource("inferencemodels"): - return &genericInformer{resource: resource.GroupResource(), informer: f.Api().V1alpha1().InferenceModels().Informer()}, nil - case v1alpha1.SchemeGroupVersion.WithResource("inferencepools"): - return &genericInformer{resource: resource.GroupResource(), informer: f.Api().V1alpha1().InferencePools().Informer()}, nil + // Group=inference.networking.x-k8s.io, Version=v1alpha2 + case v1alpha2.SchemeGroupVersion.WithResource("inferencemodels"): + return &genericInformer{resource: resource.GroupResource(), informer: f.Inference().V1alpha2().InferenceModels().Informer()}, nil + case v1alpha2.SchemeGroupVersion.WithResource("inferencepools"): + return &genericInformer{resource: resource.GroupResource(), informer: f.Inference().V1alpha2().InferencePools().Informer()}, nil } diff --git a/client-go/informers/externalversions/internalinterfaces/factory_interfaces.go b/client-go/informers/externalversions/internalinterfaces/factory_interfaces.go index 488aca6f..b11099a0 100644 --- a/client-go/informers/externalversions/internalinterfaces/factory_interfaces.go +++ b/client-go/informers/externalversions/internalinterfaces/factory_interfaces.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -20,10 +20,10 @@ package internalinterfaces import ( time "time" - versioned "inference.networking.x-k8s.io/gateway-api-inference-extension/client-go/clientset/versioned" v1 "k8s.io/apimachinery/pkg/apis/meta/v1" runtime "k8s.io/apimachinery/pkg/runtime" cache "k8s.io/client-go/tools/cache" + versioned "sigs.k8s.io/gateway-api-inference-extension/client-go/clientset/versioned" ) // NewInformerFunc takes versioned.Interface and time.Duration to return a SharedIndexInformer. diff --git a/client-go/listers/api/v1alpha1/expansion_generated.go b/client-go/listers/api/v1alpha2/expansion_generated.go similarity index 95% rename from client-go/listers/api/v1alpha1/expansion_generated.go rename to client-go/listers/api/v1alpha2/expansion_generated.go index ffbe67cf..6abe0b37 100644 --- a/client-go/listers/api/v1alpha1/expansion_generated.go +++ b/client-go/listers/api/v1alpha2/expansion_generated.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,7 +15,7 @@ limitations under the License. */ // Code generated by lister-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 // InferenceModelListerExpansion allows custom methods to be added to // InferenceModelLister. diff --git a/client-go/listers/api/v1alpha1/inferencemodel.go b/client-go/listers/api/v1alpha2/inferencemodel.go similarity index 78% rename from client-go/listers/api/v1alpha1/inferencemodel.go rename to client-go/listers/api/v1alpha2/inferencemodel.go index b0c33b61..22ca6a16 100644 --- a/client-go/listers/api/v1alpha1/inferencemodel.go +++ b/client-go/listers/api/v1alpha2/inferencemodel.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,13 +15,13 @@ limitations under the License. */ // Code generated by lister-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 import ( - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" labels "k8s.io/apimachinery/pkg/labels" listers "k8s.io/client-go/listers" cache "k8s.io/client-go/tools/cache" + apiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" ) // InferenceModelLister helps list InferenceModels. @@ -29,7 +29,7 @@ import ( type InferenceModelLister interface { // List lists all InferenceModels in the indexer. // Objects returned here must be treated as read-only. - List(selector labels.Selector) (ret []*apiv1alpha1.InferenceModel, err error) + List(selector labels.Selector) (ret []*apiv1alpha2.InferenceModel, err error) // InferenceModels returns an object that can list and get InferenceModels. InferenceModels(namespace string) InferenceModelNamespaceLister InferenceModelListerExpansion @@ -37,17 +37,17 @@ type InferenceModelLister interface { // inferenceModelLister implements the InferenceModelLister interface. type inferenceModelLister struct { - listers.ResourceIndexer[*apiv1alpha1.InferenceModel] + listers.ResourceIndexer[*apiv1alpha2.InferenceModel] } // NewInferenceModelLister returns a new InferenceModelLister. func NewInferenceModelLister(indexer cache.Indexer) InferenceModelLister { - return &inferenceModelLister{listers.New[*apiv1alpha1.InferenceModel](indexer, apiv1alpha1.Resource("inferencemodel"))} + return &inferenceModelLister{listers.New[*apiv1alpha2.InferenceModel](indexer, apiv1alpha2.Resource("inferencemodel"))} } // InferenceModels returns an object that can list and get InferenceModels. func (s *inferenceModelLister) InferenceModels(namespace string) InferenceModelNamespaceLister { - return inferenceModelNamespaceLister{listers.NewNamespaced[*apiv1alpha1.InferenceModel](s.ResourceIndexer, namespace)} + return inferenceModelNamespaceLister{listers.NewNamespaced[*apiv1alpha2.InferenceModel](s.ResourceIndexer, namespace)} } // InferenceModelNamespaceLister helps list and get InferenceModels. @@ -55,15 +55,15 @@ func (s *inferenceModelLister) InferenceModels(namespace string) InferenceModelN type InferenceModelNamespaceLister interface { // List lists all InferenceModels in the indexer for a given namespace. // Objects returned here must be treated as read-only. - List(selector labels.Selector) (ret []*apiv1alpha1.InferenceModel, err error) + List(selector labels.Selector) (ret []*apiv1alpha2.InferenceModel, err error) // Get retrieves the InferenceModel from the indexer for a given namespace and name. // Objects returned here must be treated as read-only. - Get(name string) (*apiv1alpha1.InferenceModel, error) + Get(name string) (*apiv1alpha2.InferenceModel, error) InferenceModelNamespaceListerExpansion } // inferenceModelNamespaceLister implements the InferenceModelNamespaceLister // interface. type inferenceModelNamespaceLister struct { - listers.ResourceIndexer[*apiv1alpha1.InferenceModel] + listers.ResourceIndexer[*apiv1alpha2.InferenceModel] } diff --git a/client-go/listers/api/v1alpha1/inferencepool.go b/client-go/listers/api/v1alpha2/inferencepool.go similarity index 78% rename from client-go/listers/api/v1alpha1/inferencepool.go rename to client-go/listers/api/v1alpha2/inferencepool.go index 0b0c1d6e..48879560 100644 --- a/client-go/listers/api/v1alpha1/inferencepool.go +++ b/client-go/listers/api/v1alpha2/inferencepool.go @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. @@ -15,13 +15,13 @@ limitations under the License. */ // Code generated by lister-gen. DO NOT EDIT. -package v1alpha1 +package v1alpha2 import ( - apiv1alpha1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" labels "k8s.io/apimachinery/pkg/labels" listers "k8s.io/client-go/listers" cache "k8s.io/client-go/tools/cache" + apiv1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" ) // InferencePoolLister helps list InferencePools. @@ -29,7 +29,7 @@ import ( type InferencePoolLister interface { // List lists all InferencePools in the indexer. // Objects returned here must be treated as read-only. - List(selector labels.Selector) (ret []*apiv1alpha1.InferencePool, err error) + List(selector labels.Selector) (ret []*apiv1alpha2.InferencePool, err error) // InferencePools returns an object that can list and get InferencePools. InferencePools(namespace string) InferencePoolNamespaceLister InferencePoolListerExpansion @@ -37,17 +37,17 @@ type InferencePoolLister interface { // inferencePoolLister implements the InferencePoolLister interface. type inferencePoolLister struct { - listers.ResourceIndexer[*apiv1alpha1.InferencePool] + listers.ResourceIndexer[*apiv1alpha2.InferencePool] } // NewInferencePoolLister returns a new InferencePoolLister. func NewInferencePoolLister(indexer cache.Indexer) InferencePoolLister { - return &inferencePoolLister{listers.New[*apiv1alpha1.InferencePool](indexer, apiv1alpha1.Resource("inferencepool"))} + return &inferencePoolLister{listers.New[*apiv1alpha2.InferencePool](indexer, apiv1alpha2.Resource("inferencepool"))} } // InferencePools returns an object that can list and get InferencePools. func (s *inferencePoolLister) InferencePools(namespace string) InferencePoolNamespaceLister { - return inferencePoolNamespaceLister{listers.NewNamespaced[*apiv1alpha1.InferencePool](s.ResourceIndexer, namespace)} + return inferencePoolNamespaceLister{listers.NewNamespaced[*apiv1alpha2.InferencePool](s.ResourceIndexer, namespace)} } // InferencePoolNamespaceLister helps list and get InferencePools. @@ -55,15 +55,15 @@ func (s *inferencePoolLister) InferencePools(namespace string) InferencePoolName type InferencePoolNamespaceLister interface { // List lists all InferencePools in the indexer for a given namespace. // Objects returned here must be treated as read-only. - List(selector labels.Selector) (ret []*apiv1alpha1.InferencePool, err error) + List(selector labels.Selector) (ret []*apiv1alpha2.InferencePool, err error) // Get retrieves the InferencePool from the indexer for a given namespace and name. // Objects returned here must be treated as read-only. - Get(name string) (*apiv1alpha1.InferencePool, error) + Get(name string) (*apiv1alpha2.InferencePool, error) InferencePoolNamespaceListerExpansion } // inferencePoolNamespaceLister implements the InferencePoolNamespaceLister // interface. type inferencePoolNamespaceLister struct { - listers.ResourceIndexer[*apiv1alpha1.InferencePool] + listers.ResourceIndexer[*apiv1alpha2.InferencePool] } diff --git a/cloudbuild.yaml b/cloudbuild.yaml index 2da147f4..f05c8c00 100644 --- a/cloudbuild.yaml +++ b/cloudbuild.yaml @@ -4,7 +4,7 @@ timeout: 3000s # For each build step, Prow executes a job. steps: # see https://github.com/kubernetes/test-infra/tree/master/config/jobs/image-pushing - - name: gcr.io/k8s-testimages/gcb-docker-gcloud:v20220830-45cbff55bc + - name: gcr.io/k8s-staging-test-infra/gcb-docker-gcloud:v20240718-5ef92b5c36 entrypoint: make args: - image-push @@ -12,6 +12,31 @@ steps: - GIT_TAG=$_GIT_TAG - EXTRA_TAG=$_PULL_BASE_REF - DOCKER_BUILDX_CMD=/buildx-entrypoint + - GIT_COMMIT_SHA=$COMMIT_SHA + - name: gcr.io/k8s-staging-test-infra/gcb-docker-gcloud:v20240718-5ef92b5c36 + entrypoint: make + args: + - syncer-image-push + env: + - GIT_TAG=$_GIT_TAG + - EXTRA_TAG=$_PULL_BASE_REF + - DOCKER_BUILDX_CMD=/buildx-entrypoint + - name: gcr.io/k8s-staging-test-infra/gcb-docker-gcloud:v20240718-5ef92b5c36 + entrypoint: make + args: + - inferencepool-helm-chart-push + - bbr-helm-chart-push + env: + - EXTRA_TAG=$_PULL_BASE_REF + - GOTOOLCHAIN=auto + - name: gcr.io/k8s-staging-test-infra/gcb-docker-gcloud:v20240718-5ef92b5c36 + entrypoint: make + args: + - bbr-image-push + env: + - GIT_TAG=$_GIT_TAG + - EXTRA_TAG=$_PULL_BASE_REF + - DOCKER_BUILDX_CMD=/buildx-entrypoint substitutions: # _GIT_TAG will be filled with a git-based tag for the image, of the form vYYYYMMDD-hash, and # can be used as a substitution diff --git a/cmd/bbr/health.go b/cmd/bbr/health.go new file mode 100644 index 00000000..7d1b5fd5 --- /dev/null +++ b/cmd/bbr/health.go @@ -0,0 +1,40 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package main + +import ( + "context" + + "github.com/go-logr/logr" + "google.golang.org/grpc/codes" + healthPb "google.golang.org/grpc/health/grpc_health_v1" + "google.golang.org/grpc/status" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +type healthServer struct { + logger logr.Logger +} + +func (s *healthServer) Check(ctx context.Context, in *healthPb.HealthCheckRequest) (*healthPb.HealthCheckResponse, error) { + s.logger.V(logutil.VERBOSE).Info("gRPC health check serving", "service", in.Service) + return &healthPb.HealthCheckResponse{Status: healthPb.HealthCheckResponse_SERVING}, nil +} + +func (s *healthServer) Watch(in *healthPb.HealthCheckRequest, srv healthPb.Health_WatchServer) error { + return status.Error(codes.Unimplemented, "Watch is not implemented") +} diff --git a/cmd/bbr/main.go b/cmd/bbr/main.go new file mode 100644 index 00000000..84b1fffa --- /dev/null +++ b/cmd/bbr/main.go @@ -0,0 +1,209 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package main + +import ( + "flag" + "net" + "net/http" + "os" + "strconv" + + "github.com/go-logr/logr" + "github.com/prometheus/client_golang/prometheus/promhttp" + uberzap "go.uber.org/zap" + "go.uber.org/zap/zapcore" + "google.golang.org/grpc" + healthPb "google.golang.org/grpc/health/grpc_health_v1" + "k8s.io/client-go/rest" + "k8s.io/component-base/metrics/legacyregistry" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/log/zap" + "sigs.k8s.io/controller-runtime/pkg/manager" + "sigs.k8s.io/controller-runtime/pkg/metrics/filters" + "sigs.k8s.io/gateway-api-inference-extension/internal/runnable" + runserver "sigs.k8s.io/gateway-api-inference-extension/pkg/bbr/server" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/metrics" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +var ( + grpcPort = flag.Int( + "grpcPort", + 9004, + "The gRPC port used for communicating with Envoy proxy") + grpcHealthPort = flag.Int( + "grpcHealthPort", + 9005, + "The port used for gRPC liveness and readiness probes") + metricsPort = flag.Int( + "metricsPort", 9090, "The metrics port") + streaming = flag.Bool( + "streaming", false, "Enables streaming support for Envoy full-duplex streaming mode") + logVerbosity = flag.Int("v", logging.DEFAULT, "number for the log level verbosity") + + setupLog = ctrl.Log.WithName("setup") +) + +func main() { + if err := run(); err != nil { + os.Exit(1) + } +} + +func run() error { + opts := zap.Options{Development: true} + opts.BindFlags(flag.CommandLine) + flag.Parse() + initLogging(&opts) + + // Print all flag values + flags := make(map[string]any) + flag.VisitAll(func(f *flag.Flag) { + flags[f.Name] = f.Value + }) + setupLog.Info("Flags processed", "flags", flags) + + // Init runtime. + cfg, err := ctrl.GetConfig() + if err != nil { + setupLog.Error(err, "Failed to get rest config") + return err + } + + mgr, err := ctrl.NewManager(cfg, ctrl.Options{}) + if err != nil { + setupLog.Error(err, "Failed to create manager", "config", cfg) + return err + } + + ctx := ctrl.SetupSignalHandler() + + // Setup runner. + serverRunner := runserver.NewDefaultExtProcServerRunner(*grpcPort, *streaming) + + // Register health server. + if err := registerHealthServer(mgr, ctrl.Log.WithName("health"), *grpcHealthPort); err != nil { + return err + } + + // Register ext-proc server. + if err := mgr.Add(serverRunner.AsRunnable(ctrl.Log.WithName("ext-proc"))); err != nil { + setupLog.Error(err, "Failed to register ext-proc gRPC server") + return err + } + + // Register metrics handler. + if err := registerMetricsHandler(mgr, *metricsPort, cfg); err != nil { + return err + } + + // Start the manager. This blocks until a signal is received. + setupLog.Info("Manager starting") + if err := mgr.Start(ctx); err != nil { + setupLog.Error(err, "Error starting manager") + return err + } + setupLog.Info("Manager terminated") + return nil +} + +// registerHealthServer adds the Health gRPC server as a Runnable to the given manager. +func registerHealthServer(mgr manager.Manager, logger logr.Logger, port int) error { + srv := grpc.NewServer() + healthPb.RegisterHealthServer(srv, &healthServer{ + logger: logger, + }) + if err := mgr.Add( + runnable.NoLeaderElection(runnable.GRPCServer("health", srv, port))); err != nil { + setupLog.Error(err, "Failed to register health server") + return err + } + return nil +} + +func initLogging(opts *zap.Options) { + useV := true + flag.Visit(func(f *flag.Flag) { + if f.Name == "zap-log-level" { + useV = false + } + }) + if useV { + // See https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/log/zap#Options.Level + lvl := -1 * (*logVerbosity) + opts.Level = uberzap.NewAtomicLevelAt(zapcore.Level(int8(lvl))) + } + + logger := zap.New(zap.UseFlagOptions(opts), zap.RawZapOpts(uberzap.AddCaller())) + ctrl.SetLogger(logger) +} + +const metricsEndpoint = "/metrics" + +// registerMetricsHandler adds the metrics HTTP handler as a Runnable to the given manager. +func registerMetricsHandler(mgr manager.Manager, port int, cfg *rest.Config) error { + metrics.Register() + + // Init HTTP server. + h, err := metricsHandlerWithAuthenticationAndAuthorization(cfg) + if err != nil { + return err + } + + mux := http.NewServeMux() + mux.Handle(metricsEndpoint, h) + + srv := &http.Server{ + Addr: net.JoinHostPort("", strconv.Itoa(port)), + Handler: mux, + } + + if err := mgr.Add(&manager.Server{ + Name: "metrics", + Server: srv, + }); err != nil { + setupLog.Error(err, "Failed to register metrics HTTP handler") + return err + } + return nil +} + +func metricsHandlerWithAuthenticationAndAuthorization(cfg *rest.Config) (http.Handler, error) { + h := promhttp.HandlerFor( + legacyregistry.DefaultGatherer, + promhttp.HandlerOpts{}, + ) + httpClient, err := rest.HTTPClientFor(cfg) + if err != nil { + setupLog.Error(err, "Failed to create http client for metrics auth") + return nil, err + } + + filter, err := filters.WithAuthenticationAndAuthorization(cfg, httpClient) + if err != nil { + setupLog.Error(err, "Failed to create metrics filter for auth") + return nil, err + } + metricsLogger := ctrl.Log.WithName("metrics").WithValues("path", metricsEndpoint) + metricsAuthHandler, err := filter(metricsLogger, h) + if err != nil { + setupLog.Error(err, "Failed to create metrics auth handler") + return nil, err + } + return metricsAuthHandler, nil +} diff --git a/cmd/epp/health.go b/cmd/epp/health.go new file mode 100644 index 00000000..93697002 --- /dev/null +++ b/cmd/epp/health.go @@ -0,0 +1,46 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package main + +import ( + "context" + + "github.com/go-logr/logr" + "google.golang.org/grpc/codes" + healthPb "google.golang.org/grpc/health/grpc_health_v1" + "google.golang.org/grpc/status" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/datastore" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +type healthServer struct { + logger logr.Logger + datastore datastore.Datastore +} + +func (s *healthServer) Check(ctx context.Context, in *healthPb.HealthCheckRequest) (*healthPb.HealthCheckResponse, error) { + if !s.datastore.PoolHasSynced() { + s.logger.V(logutil.DEFAULT).Info("gRPC health check not serving", "service", in.Service) + return &healthPb.HealthCheckResponse{Status: healthPb.HealthCheckResponse_NOT_SERVING}, nil + } + s.logger.V(logutil.TRACE).Info("gRPC health check serving", "service", in.Service) + return &healthPb.HealthCheckResponse{Status: healthPb.HealthCheckResponse_SERVING}, nil +} + +func (s *healthServer) Watch(in *healthPb.HealthCheckRequest, srv healthPb.Health_WatchServer) error { + return status.Error(codes.Unimplemented, "Watch is not implemented") +} diff --git a/cmd/epp/main.go b/cmd/epp/main.go new file mode 100644 index 00000000..2bd779c5 --- /dev/null +++ b/cmd/epp/main.go @@ -0,0 +1,323 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package main + +import ( + "flag" + "fmt" + "net" + "net/http" + "os" + "strconv" + + "github.com/go-logr/logr" + "github.com/prometheus/client_golang/prometheus/promhttp" + uberzap "go.uber.org/zap" + "go.uber.org/zap/zapcore" + "google.golang.org/grpc" + healthPb "google.golang.org/grpc/health/grpc_health_v1" + "k8s.io/apimachinery/pkg/types" + "k8s.io/client-go/rest" + "k8s.io/component-base/metrics/legacyregistry" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/log/zap" + "sigs.k8s.io/controller-runtime/pkg/manager" + "sigs.k8s.io/controller-runtime/pkg/metrics/filters" + "sigs.k8s.io/gateway-api-inference-extension/internal/runnable" + backendmetrics "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/datastore" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/metrics" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling" + runserver "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/server" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +const ( + defaultMetricsEndpoint = "/metrics" +) + +var ( + grpcPort = flag.Int( + "grpcPort", + runserver.DefaultGrpcPort, + "The gRPC port used for communicating with Envoy proxy") + grpcHealthPort = flag.Int( + "grpcHealthPort", + 9003, + "The port used for gRPC liveness and readiness probes") + metricsPort = flag.Int( + "metricsPort", 9090, "The metrics port") + destinationEndpointHintKey = flag.String( + "destinationEndpointHintKey", + runserver.DefaultDestinationEndpointHintKey, + "Header and response metadata key used by Envoy to route to the appropriate pod. This must match Envoy configuration.") + destinationEndpointHintMetadataNamespace = flag.String( + "DestinationEndpointHintMetadataNamespace", + runserver.DefaultDestinationEndpointHintMetadataNamespace, + "The key for the outer namespace struct in the metadata field of the extproc response that is used to wrap the"+ + "target endpoint. If not set, then an outer namespace struct should not be created.") + poolName = flag.String( + "poolName", + runserver.DefaultPoolName, + "Name of the InferencePool this Endpoint Picker is associated with.") + poolNamespace = flag.String( + "poolNamespace", + runserver.DefaultPoolNamespace, + "Namespace of the InferencePool this Endpoint Picker is associated with.") + refreshMetricsInterval = flag.Duration( + "refreshMetricsInterval", + runserver.DefaultRefreshMetricsInterval, + "interval to refresh metrics") + refreshPrometheusMetricsInterval = flag.Duration( + "refreshPrometheusMetricsInterval", + runserver.DefaultRefreshPrometheusMetricsInterval, + "interval to flush prometheus metrics") + logVerbosity = flag.Int("v", logging.DEFAULT, "number for the log level verbosity") + secureServing = flag.Bool( + "secureServing", runserver.DefaultSecureServing, "Enables secure serving. Defaults to true.") + certPath = flag.String( + "certPath", "", "The path to the certificate for secure serving. The certificate and private key files "+ + "are assumed to be named tls.crt and tls.key, respectively. If not set, and secureServing is enabled, "+ + "then a self-signed certificate is used.") + // metric flags + totalQueuedRequestsMetric = flag.String("totalQueuedRequestsMetric", + "vllm:num_requests_waiting", + "Prometheus metric for the number of queued requests.") + kvCacheUsagePercentageMetric = flag.String("kvCacheUsagePercentageMetric", + "vllm:gpu_cache_usage_perc", + "Prometheus metric for the fraction of KV-cache blocks currently in use (from 0 to 1).") + // LoRA metrics + loraInfoMetric = flag.String("loraInfoMetric", + "vllm:lora_requests_info", + "Prometheus metric for the LoRA info metrics (must be in vLLM label format).") + + setupLog = ctrl.Log.WithName("setup") +) + +func main() { + if err := run(); err != nil { + os.Exit(1) + } +} + +func run() error { + opts := zap.Options{ + Development: true, + } + opts.BindFlags(flag.CommandLine) + flag.Parse() + initLogging(&opts) + + // Validate flags + if err := validateFlags(); err != nil { + setupLog.Error(err, "Failed to validate flags") + return err + } + + // Print all flag values + flags := make(map[string]any) + flag.VisitAll(func(f *flag.Flag) { + flags[f.Name] = f.Value + }) + setupLog.Info("Flags processed", "flags", flags) + + // Init runtime. + cfg, err := ctrl.GetConfig() + if err != nil { + setupLog.Error(err, "Failed to get rest config") + return err + } + + poolNamespacedName := types.NamespacedName{ + Name: *poolName, + Namespace: *poolNamespace, + } + mgr, err := runserver.NewDefaultManager(poolNamespacedName, cfg) + if err != nil { + setupLog.Error(err, "Failed to create controller manager") + return err + } + + // Set up mapper for metric scraping. + mapping, err := backendmetrics.NewMetricMapping( + *totalQueuedRequestsMetric, + *kvCacheUsagePercentageMetric, + *loraInfoMetric, + ) + if err != nil { + setupLog.Error(err, "Failed to create metric mapping from flags.") + return err + } + verifyMetricMapping(*mapping, setupLog) + + pmf := backendmetrics.NewPodMetricsFactory(&backendmetrics.PodMetricsClientImpl{MetricMapping: mapping}, *refreshMetricsInterval) + // Setup runner. + ctx := ctrl.SetupSignalHandler() + + datastore := datastore.NewDatastore(ctx, pmf) + + scheduler := scheduling.NewScheduler(datastore) + serverRunner := &runserver.ExtProcServerRunner{ + GrpcPort: *grpcPort, + DestinationEndpointHintMetadataNamespace: *destinationEndpointHintMetadataNamespace, + DestinationEndpointHintKey: *destinationEndpointHintKey, + PoolNamespacedName: poolNamespacedName, + Datastore: datastore, + SecureServing: *secureServing, + CertPath: *certPath, + RefreshPrometheusMetricsInterval: *refreshPrometheusMetricsInterval, + Scheduler: scheduler, + } + if err := serverRunner.SetupWithManager(ctx, mgr); err != nil { + setupLog.Error(err, "Failed to setup ext-proc controllers") + return err + } + + // Register health server. + if err := registerHealthServer(mgr, ctrl.Log.WithName("health"), datastore, *grpcHealthPort); err != nil { + return err + } + + // Register ext-proc server. + if err := mgr.Add(serverRunner.AsRunnable(ctrl.Log.WithName("ext-proc"))); err != nil { + setupLog.Error(err, "Failed to register ext-proc gRPC server") + return err + } + + // Register metrics handler. + if err := registerMetricsHandler(mgr, *metricsPort, cfg); err != nil { + return err + } + + // Start the manager. This blocks until a signal is received. + setupLog.Info("Controller manager starting") + if err := mgr.Start(ctx); err != nil { + setupLog.Error(err, "Error starting controller manager") + return err + } + setupLog.Info("Controller manager terminated") + return nil +} + +func initLogging(opts *zap.Options) { + // Unless -zap-log-level is explicitly set, use -v + useV := true + flag.Visit(func(f *flag.Flag) { + if f.Name == "zap-log-level" { + useV = false + } + }) + if useV { + // See https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/log/zap#Options.Level + lvl := -1 * (*logVerbosity) + opts.Level = uberzap.NewAtomicLevelAt(zapcore.Level(int8(lvl))) + } + + logger := zap.New(zap.UseFlagOptions(opts), zap.RawZapOpts(uberzap.AddCaller())) + ctrl.SetLogger(logger) +} + +// registerHealthServer adds the Health gRPC server as a Runnable to the given manager. +func registerHealthServer(mgr manager.Manager, logger logr.Logger, ds datastore.Datastore, port int) error { + srv := grpc.NewServer() + healthPb.RegisterHealthServer(srv, &healthServer{ + logger: logger, + datastore: ds, + }) + if err := mgr.Add( + runnable.NoLeaderElection(runnable.GRPCServer("health", srv, port))); err != nil { + setupLog.Error(err, "Failed to register health server") + return err + } + return nil +} + +// registerMetricsHandler adds the metrics HTTP handler as a Runnable to the given manager. +func registerMetricsHandler(mgr manager.Manager, port int, cfg *rest.Config) error { + metrics.Register() + + metrics.RecordInferenceExtensionInfo() + + // Init HTTP server. + h, err := metricsHandlerWithAuthenticationAndAuthorization(cfg) + if err != nil { + return err + } + + mux := http.NewServeMux() + mux.Handle(defaultMetricsEndpoint, h) + + srv := &http.Server{ + Addr: net.JoinHostPort("", strconv.Itoa(port)), + Handler: mux, + } + + if err := mgr.Add(&manager.Server{ + Name: "metrics", + Server: srv, + }); err != nil { + setupLog.Error(err, "Failed to register metrics HTTP handler") + return err + } + return nil +} + +func metricsHandlerWithAuthenticationAndAuthorization(cfg *rest.Config) (http.Handler, error) { + h := promhttp.HandlerFor( + legacyregistry.DefaultGatherer, + promhttp.HandlerOpts{}, + ) + httpClient, err := rest.HTTPClientFor(cfg) + if err != nil { + setupLog.Error(err, "Failed to create http client for metrics auth") + return nil, err + } + + filter, err := filters.WithAuthenticationAndAuthorization(cfg, httpClient) + if err != nil { + setupLog.Error(err, "Failed to create metrics filter for auth") + return nil, err + } + metricsLogger := ctrl.Log.WithName("metrics").WithValues("path", defaultMetricsEndpoint) + metricsAuthHandler, err := filter(metricsLogger, h) + if err != nil { + setupLog.Error(err, "Failed to create metrics auth handler") + return nil, err + } + return metricsAuthHandler, nil +} + +func validateFlags() error { + if *poolName == "" { + return fmt.Errorf("required %q flag not set", "poolName") + } + + return nil +} + +func verifyMetricMapping(mapping backendmetrics.MetricMapping, logger logr.Logger) { + if mapping.TotalQueuedRequests == nil { + logger.Info("Not scraping metric: TotalQueuedRequests") + } + if mapping.KVCacheUtilization == nil { + logger.Info("Not scraping metric: KVCacheUtilization") + } + if mapping.LoraRequestInfo == nil { + logger.Info("Not scraping metric: LoraRequestInfo") + } + +} diff --git a/config/charts/body-based-routing/.helmignore b/config/charts/body-based-routing/.helmignore new file mode 100644 index 00000000..0e8a0eb3 --- /dev/null +++ b/config/charts/body-based-routing/.helmignore @@ -0,0 +1,23 @@ +# Patterns to ignore when building packages. +# This supports shell glob matching, relative path matching, and +# negation (prefixed with !). Only one pattern per line. +.DS_Store +# Common VCS dirs +.git/ +.gitignore +.bzr/ +.bzrignore +.hg/ +.hgignore +.svn/ +# Common backup files +*.swp +*.bak +*.tmp +*.orig +*~ +# Various IDEs +.project +.idea/ +*.tmproj +.vscode/ diff --git a/config/charts/body-based-routing/Chart.yaml b/config/charts/body-based-routing/Chart.yaml new file mode 100644 index 00000000..952a84f0 --- /dev/null +++ b/config/charts/body-based-routing/Chart.yaml @@ -0,0 +1,9 @@ +apiVersion: v2 +name: body-based-routing +description: A Helm chart for the body-based routing extension + +type: application + +version: 0.1.0 + +appVersion: "0.2.0" diff --git a/config/charts/body-based-routing/README.md b/config/charts/body-based-routing/README.md new file mode 100644 index 00000000..d311b8c3 --- /dev/null +++ b/config/charts/body-based-routing/README.md @@ -0,0 +1,54 @@ +# Body-based routing + +A chart to the body-based routing deployment and service. + + +## Install + +To install a body-based router named `body-based-router`, you can run the following command: + +```txt +$ helm install body-based-router ./config/charts/body-based-routing \ + --set provider.name=[gke|istio] \ + --set inferenceGateway.name=inference-gateway +``` + +Note that the provider name is needed to ensure provider-specific manifests are also applied. If no provider is specified, then only +the deployment and service are deployed. + +To install via the latest published chart in staging (--version v0 indicates latest dev version), you can run the following command: + +```txt +$ helm install body-based-router oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/body-based-routing \ + --version v0 + --set provider.name=[gke|istio] +``` + +## Uninstall + +Run the following command to uninstall the chart: + +```txt +$ helm uninstall body-based-router +``` + +## Configuration + +The following table list the configurable parameters of the chart. + +| **Parameter Name** | **Description** | +|---------------------------------------------|----------------------------------------------------------------------------------------------------| +| `bbr.name` | Name for the deployment and service. | +| `bbr.replicas` | Number of replicas for the deployment. Defaults to `1`. | +| `bbr.port` | Port serving ext_proc. Defaults to `9004`. | +| `bbr.healthCheckPort` | Port for health checks. Defaults to `9005`. | +| `bbr.image.name` | Name of the container image used. | +| `bbr.image.hub` | Registry URL where the image is hosted. | +| `bbr.image.tag` | Image tag. | +| `bbr.image.pullPolicy` | Image pull policy for the container. Possible values: `Always`, `IfNotPresent`, or `Never`. Defaults to `Always`. | +| `provider.name` | Name of the Inference Gateway implementation being used. Possible values: `istio`, `gke`. Defaults to `none`. | +| `inferenceGateway.name` | The name of the Gateway. Defaults to `inference-gateway`. | + +## Notes + +This chart should only be deployed once per Gateway. diff --git a/config/charts/body-based-routing/templates/NOTES.txt b/config/charts/body-based-routing/templates/NOTES.txt new file mode 100644 index 00000000..0a382009 --- /dev/null +++ b/config/charts/body-based-routing/templates/NOTES.txt @@ -0,0 +1 @@ +Body-based routing extension deployed. diff --git a/config/charts/body-based-routing/templates/bbr.yaml b/config/charts/body-based-routing/templates/bbr.yaml new file mode 100644 index 00000000..ef08ae49 --- /dev/null +++ b/config/charts/body-based-routing/templates/bbr.yaml @@ -0,0 +1,42 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ .Values.bbr.name }} + namespace: {{ .Release.Namespace }} +spec: + replicas: {{ .Values.bbr.replicas | default 1 }} + selector: + matchLabels: + app: {{ .Values.bbr.name }} + template: + metadata: + labels: + app: {{ .Values.bbr.name }} + spec: + containers: + - name: bbr + image: {{ .Values.bbr.image.hub }}/{{ .Values.bbr.image.name }}:{{ .Values.bbr.image.tag }} + imagePullPolicy: {{ .Values.bbr.image.pullPolicy | default "Always" }} + args: + - "-streaming" + - "-v" + - "3" + ports: + - containerPort: {{ .Values.bbr.port }} + # health check + - containerPort: {{ .Values.bbr.healthCheckPort }} +--- +apiVersion: v1 +kind: Service +metadata: + name: {{ .Values.bbr.name }} + namespace: {{ .Release.Namespace }} +spec: + selector: + app: {{ .Values.bbr.name }} + ports: + - protocol: TCP + port: {{ .Values.bbr.port }} + targetPort: {{ .Values.bbr.port }} + appProtocol: HTTP2 + type: ClusterIP diff --git a/config/charts/body-based-routing/templates/gke.yaml b/config/charts/body-based-routing/templates/gke.yaml new file mode 100644 index 00000000..77b776a4 --- /dev/null +++ b/config/charts/body-based-routing/templates/gke.yaml @@ -0,0 +1,49 @@ +{{- if eq .Values.provider.name "gke" }} +--- +kind: GCPRoutingExtension +apiVersion: networking.gke.io/v1 +metadata: + name: {{ .Values.bbr.name }} + namespace: {{ .Release.Namespace }} +spec: + targetRefs: + - group: "gateway.networking.k8s.io" + kind: Gateway + name: {{ .Values.inferenceGateway.name }} + extensionChains: + - name: chain1 + extensions: + - name: ext1 + authority: "myext.com" + timeout: 1s + supportedEvents: + - RequestHeaders + - RequestBody + - RequestTrailers + requestBodySendMode: "FullDuplexStreamed" + backendRef: + group: "" + kind: Service + name: {{ .Values.bbr.name }} + port: {{ .Values.bbr.port }} +--- +apiVersion: networking.gke.io/v1 +kind: HealthCheckPolicy +metadata: + name: bbr-healthcheck + namespace: {{ .Release.Namespace }} +spec: + default: + logConfig: + enabled: true + config: + type: "GRPC" + grpcHealthCheck: + portSpecification: "USE_FIXED_PORT" + port: {{ .Values.bbr.healthCheckPort }} + targetRef: + group: "" + kind: Service + name: {{ .Values.bbr.name }} + namespace: {{ .Release.Namespace }} +{{- end }} diff --git a/config/charts/body-based-routing/templates/istio.yaml b/config/charts/body-based-routing/templates/istio.yaml new file mode 100644 index 00000000..6d4535cc --- /dev/null +++ b/config/charts/body-based-routing/templates/istio.yaml @@ -0,0 +1,47 @@ +{{- if eq .Values.provider.name "istio" }} +--- +apiVersion: networking.istio.io/v1alpha3 +kind: EnvoyFilter +metadata: + name: {{ .Values.bbr.name }} + namespace: {{ .Release.Namespace }} +spec: + configPatches: + - applyTo: HTTP_FILTER + match: + # context omitted so that this applies to both sidecars and gateways + listener: + filterChain: + filter: + name: "envoy.filters.network.http_connection_manager" + patch: + operation: INSERT_FIRST + value: + name: envoy.filters.http.ext_proc + typed_config: + "@type": type.googleapis.com/envoy.extensions.filters.http.ext_proc.v3.ExternalProcessor + failure_mode_allow: false + allow_mode_override: true + processing_mode: + request_header_mode: "SEND" + response_header_mode: "SKIP" + request_body_mode: "FULL_DUPLEX_STREAMED" + response_body_mode: "NONE" + request_trailer_mode: "SEND" + response_trailer_mode: "SKIP" + grpc_service: + envoy_grpc: + cluster_name: outbound|{{ .Values.bbr.port }}||{{ .Values.bbr.name }}.default.svc.cluster.local +--- +apiVersion: networking.istio.io/v1 +kind: DestinationRule +metadata: + name: {{ .Values.bbr.name }} + namespace: {{ .Release.Namespace }} +spec: + host: {{ .Values.bbr.name }}.default.svc.cluster.local + trafficPolicy: + tls: + mode: SIMPLE + insecureSkipVerify: true +{{- end }} diff --git a/config/charts/body-based-routing/values.yaml b/config/charts/body-based-routing/values.yaml new file mode 100644 index 00000000..0b88dc43 --- /dev/null +++ b/config/charts/body-based-routing/values.yaml @@ -0,0 +1,16 @@ +bbr: + name: body-based-router + replicas: 1 + image: + name: bbr + hub: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension + tag: main + pullPolicy: Always + port: 9004 + healthCheckPort: 9005 + +provider: + name: none + +inferenceGateway: + name: inference-gateway diff --git a/config/charts/inferencepool/.helmignore b/config/charts/inferencepool/.helmignore new file mode 100644 index 00000000..0e8a0eb3 --- /dev/null +++ b/config/charts/inferencepool/.helmignore @@ -0,0 +1,23 @@ +# Patterns to ignore when building packages. +# This supports shell glob matching, relative path matching, and +# negation (prefixed with !). Only one pattern per line. +.DS_Store +# Common VCS dirs +.git/ +.gitignore +.bzr/ +.bzrignore +.hg/ +.hgignore +.svn/ +# Common backup files +*.swp +*.bak +*.tmp +*.orig +*~ +# Various IDEs +.project +.idea/ +*.tmproj +.vscode/ diff --git a/config/charts/inferencepool/Chart.yaml b/config/charts/inferencepool/Chart.yaml new file mode 100644 index 00000000..f98153c5 --- /dev/null +++ b/config/charts/inferencepool/Chart.yaml @@ -0,0 +1,9 @@ +apiVersion: v2 +name: inferencepool +description: A Helm chart for InferencePool + +type: application + +version: 0.0.0 + +appVersion: "0.0.0" diff --git a/config/charts/inferencepool/README.md b/config/charts/inferencepool/README.md new file mode 100644 index 00000000..301e3d9c --- /dev/null +++ b/config/charts/inferencepool/README.md @@ -0,0 +1,64 @@ +# InferencePool + +A chart to deploy an InferencePool and a corresponding EndpointPicker (epp) deployment. + +## Install + +To install an InferencePool named `vllm-llama3-8b-instruct` that selects from endpoints with label `app: vllm-llama3-8b-instruct` and listening on port `8000`, you can run the following command: + +```txt +$ helm install vllm-llama3-8b-instruct ./config/charts/inferencepool \ + --set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \ +``` + +To install via the latest published chart in staging (--version v0 indicates latest dev version), you can run the following command: + +```txt +$ helm install vllm-llama3-8b-instruct \ + --set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \ + --set provider.name=[none|gke] \ + oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0 +``` + +Note that the provider name is needed to deploy provider-specific resources. If no provider is specified, then only the InferencePool object and the EPP are deployed. + +### Install for Triton TensorRT-LLM + +Use `--set inferencePool.modelServerType=triton-tensorrt-llm` to install for Triton TensorRT-LLM, e.g., + +```txt +$ helm install triton-llama3-8b-instruct \ + --set inferencePool.modelServers.matchLabels.app=triton-llama3-8b-instruct \ + --set inferencePool.modelServerType=triton-tensorrt-llm \ + --set provider.name=[none|gke] \ + oci://us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/charts/inferencepool --version v0 +``` + +## Uninstall + +Run the following command to uninstall the chart: + +```txt +$ helm uninstall pool-1 +``` + +## Configuration + +The following table list the configurable parameters of the chart. + +| **Parameter Name** | **Description** | +|---------------------------------------------|------------------------------------------------------------------------------------------------------------------------| +| `inferencePool.targetPortNumber` | Target port number for the vllm backends, will be used to scrape metrics by the inference extension. Defaults to 8000. | +| `inferencePool.modelServerType` | Type of the model servers in the pool, valid options are [vllm, triton-tensorrt-llm], default is vllm. | +| `inferencePool.modelServers.matchLabels` | Label selector to match vllm backends managed by the inference pool. | +| `inferenceExtension.replicas` | Number of replicas for the endpoint picker extension service. Defaults to `1`. | +| `inferenceExtension.image.name` | Name of the container image used for the endpoint picker. | +| `inferenceExtension.image.hub` | Registry URL where the endpoint picker image is hosted. | +| `inferenceExtension.image.tag` | Image tag of the endpoint picker. | +| `inferenceExtension.image.pullPolicy` | Image pull policy for the container. Possible values: `Always`, `IfNotPresent`, or `Never`. Defaults to `Always`. | +| `inferenceExtension.extProcPort` | Port where the endpoint picker service is served for external processing. Defaults to `9002`. | +| `provider.name` | Name of the Inference Gateway implementation being used. Possible values: `gke`. Defaults to `none`. | + +## Notes + +This chart will only deploy an InferencePool and its corresponding EndpointPicker extension. Before install the chart, please make sure that the inference extension CRDs are installed in the cluster. For more details, please refer to the [getting started guide](https://gateway-api-inference-extension.sigs.k8s.io/guides/). diff --git a/config/charts/inferencepool/templates/NOTES.txt b/config/charts/inferencepool/templates/NOTES.txt new file mode 100644 index 00000000..22e5c0e1 --- /dev/null +++ b/config/charts/inferencepool/templates/NOTES.txt @@ -0,0 +1 @@ +InferencePool {{ .Release.Name }} deployed. diff --git a/config/charts/inferencepool/templates/_helpers.tpl b/config/charts/inferencepool/templates/_helpers.tpl new file mode 100644 index 00000000..e011bb7c --- /dev/null +++ b/config/charts/inferencepool/templates/_helpers.tpl @@ -0,0 +1,24 @@ +{{/* +Common labels +*/}} +{{- define "gateway-api-inference-extension.labels" -}} +app.kubernetes.io/name: {{ include "gateway-api-inference-extension.name" . }} +{{- if .Chart.AppVersion }} +app.kubernetes.io/version: {{ .Chart.AppVersion | quote }} +{{- end }} +{{- end }} + +{{/* +Inference extension name +*/}} +{{- define "gateway-api-inference-extension.name" -}} +{{- $base := .Release.Name | default "default-pool" | lower | trim | trunc 40 -}} +{{ $base }}-epp +{{- end -}} + +{{/* +Selector labels +*/}} +{{- define "gateway-api-inference-extension.selectorLabels" -}} +inferencepool: {{ include "gateway-api-inference-extension.name" . }} +{{- end -}} diff --git a/config/charts/inferencepool/templates/_validations.tpl b/config/charts/inferencepool/templates/_validations.tpl new file mode 100644 index 00000000..65c743b6 --- /dev/null +++ b/config/charts/inferencepool/templates/_validations.tpl @@ -0,0 +1,8 @@ +{{/* +common validations +*/}} +{{- define "gateway-api-inference-extension.validations.inferencepool.common" -}} +{{- if or (empty $.Values.inferencePool.modelServers) (not $.Values.inferencePool.modelServers.matchLabels) }} +{{- fail ".Values.inferencePool.modelServers.matchLabels is required" }} +{{- end }} +{{- end -}} diff --git a/config/charts/inferencepool/templates/epp-deployment.yaml b/config/charts/inferencepool/templates/epp-deployment.yaml new file mode 100644 index 00000000..fc490210 --- /dev/null +++ b/config/charts/inferencepool/templates/epp-deployment.yaml @@ -0,0 +1,64 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ include "gateway-api-inference-extension.name" . }} + namespace: {{ .Release.Namespace }} + labels: + {{- include "gateway-api-inference-extension.labels" . | nindent 4 }} +spec: + replicas: {{ .Values.inferenceExtension.replicas | default 1 }} + selector: + matchLabels: + {{- include "gateway-api-inference-extension.selectorLabels" . | nindent 6 }} + template: + metadata: + labels: + {{- include "gateway-api-inference-extension.selectorLabels" . | nindent 8 }} + spec: + serviceAccountName: {{ include "gateway-api-inference-extension.name" . }} + # Conservatively, this timeout should mirror the longest grace period of the pods within the pool + terminationGracePeriodSeconds: 130 + containers: + - name: epp + image: {{ .Values.inferenceExtension.image.hub }}/{{ .Values.inferenceExtension.image.name }}:{{ .Values.inferenceExtension.image.tag }} + imagePullPolicy: {{ .Values.inferenceExtension.image.pullPolicy | default "Always" }} + args: + - -poolName + - {{ .Release.Name }} + - -poolNamespace + - {{ .Release.Namespace }} + - -v + - "3" + - -grpcPort + - "9002" + - -grpcHealthPort + - "9003" + - -metricsPort + - "9090" + {{- if eq (.Values.inferencePool.modelServerType | default "vllm") "triton-tensorrt-llm" }} + - -totalQueuedRequestsMetric + - "nv_trt_llm_request_metrics{request_type=waiting}" + - -kvCacheUsagePercentageMetric + - "nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}" + - -loraInfoMetric + - "" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet. + {{- end }} + ports: + - name: grpc + containerPort: 9002 + - name: grpc-health + containerPort: 9003 + - name: metrics + containerPort: 9090 + livenessProbe: + grpc: + port: 9003 + service: inference-extension + initialDelaySeconds: 5 + periodSeconds: 10 + readinessProbe: + grpc: + port: 9003 + service: inference-extension + initialDelaySeconds: 5 + periodSeconds: 10 diff --git a/config/charts/inferencepool/templates/epp-service.yaml b/config/charts/inferencepool/templates/epp-service.yaml new file mode 100644 index 00000000..ed23db17 --- /dev/null +++ b/config/charts/inferencepool/templates/epp-service.yaml @@ -0,0 +1,18 @@ +apiVersion: v1 +kind: Service +metadata: + name: {{ include "gateway-api-inference-extension.name" . }} + namespace: {{ .Release.Namespace }} + labels: + {{- include "gateway-api-inference-extension.labels" . | nindent 4 }} +spec: + selector: + {{- include "gateway-api-inference-extension.selectorLabels" . | nindent 4 }} + ports: + - name: grpc-ext-proc + protocol: TCP + port: {{ .Values.inferenceExtension.extProcPort | default 9002 }} + - name: http-metrics + protocol: TCP + port: {{ .Values.inferenceExtension.metricsPort | default 9090 }} + type: ClusterIP diff --git a/config/charts/inferencepool/templates/gke.yaml b/config/charts/inferencepool/templates/gke.yaml new file mode 100644 index 00000000..70e05b56 --- /dev/null +++ b/config/charts/inferencepool/templates/gke.yaml @@ -0,0 +1,61 @@ +{{- if eq .Values.provider.name "gke" }} +--- +kind: HealthCheckPolicy +apiVersion: networking.gke.io/v1 +metadata: + name: {{ .Release.Name }} + namespace: {{ .Release.Namespace }} + labels: + {{- include "gateway-api-inference-extension.labels" . | nindent 4 }} +spec: + targetRef: + group: "inference.networking.x-k8s.io" + kind: InferencePool + name: {{ .Release.Name }} + default: + config: + type: HTTP + httpHealthCheck: + requestPath: /health + port: {{ .Values.inferencePool.targetPortNumber }} +--- +apiVersion: networking.gke.io/v1 +kind: GCPBackendPolicy +metadata: + name: {{ .Release.Name }} + namespace: {{ .Release.Namespace }} + labels: + {{- include "gateway-api-inference-extension.labels" . | nindent 4 }} +spec: + targetRef: + group: "inference.networking.x-k8s.io" + kind: InferencePool + name: {{ .Release.Name }} + default: + timeoutSec: 300 # 5-minute timeout (adjust as needed) + logging: + enabled: true # log all requests by default +--- +apiVersion: monitoring.googleapis.com/v1 +kind: ClusterPodMonitoring +metadata: + name: {{ .Release.Namespace }}-{{ .Release.Name }} + labels: + {{- include "gateway-api-inference-extension.labels" . | nindent 4 }} +spec: + endpoints: + - port: metrics + scheme: http + interval: 5s + path: /metrics + authorization: + type: Bearer + credentials: + secret: + name: {{ .Values.gke.monitoringSecret.name }} + key: token + namespace: {{ .Values.gke.monitoringSecret.namespace }} + selector: + matchLabels: + {{- include "gateway-api-inference-extension.selectorLabels" . | nindent 8 }} +{{- end }} diff --git a/config/charts/inferencepool/templates/inferencepool.yaml b/config/charts/inferencepool/templates/inferencepool.yaml new file mode 100644 index 00000000..4b279cbd --- /dev/null +++ b/config/charts/inferencepool/templates/inferencepool.yaml @@ -0,0 +1,18 @@ +{{ include "gateway-api-inference-extension.validations.inferencepool.common" $ }} +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferencePool +metadata: + name: {{ .Release.Name }} + namespace: {{ .Release.Namespace }} + labels: + {{- include "gateway-api-inference-extension.labels" . | nindent 4 }} +spec: + targetPortNumber: {{ .Values.inferencePool.targetPortNumber }} + selector: + {{- if .Values.inferencePool.modelServers.matchLabels }} + {{- range $key, $value := .Values.inferencePool.modelServers.matchLabels }} + {{ $key }}: {{ quote $value }} + {{- end }} + {{- end }} + extensionRef: + name: {{ include "gateway-api-inference-extension.name" . }} diff --git a/config/charts/inferencepool/templates/rbac.yaml b/config/charts/inferencepool/templates/rbac.yaml new file mode 100644 index 00000000..cdd50c6a --- /dev/null +++ b/config/charts/inferencepool/templates/rbac.yaml @@ -0,0 +1,45 @@ +kind: ClusterRole +apiVersion: rbac.authorization.k8s.io/v1 +metadata: + name: {{ include "gateway-api-inference-extension.name" . }} + labels: + {{- include "gateway-api-inference-extension.labels" . | nindent 4 }} +rules: +- apiGroups: ["inference.networking.x-k8s.io"] + resources: ["inferencemodels", "inferencepools"] + verbs: ["get", "watch", "list"] +- apiGroups: [""] + resources: ["pods"] + verbs: ["get", "watch", "list"] +- apiGroups: + - authentication.k8s.io + resources: + - tokenreviews + verbs: + - create +- apiGroups: + - authorization.k8s.io + resources: + - subjectaccessreviews + verbs: + - create +--- +kind: ClusterRoleBinding +apiVersion: rbac.authorization.k8s.io/v1 +metadata: + name: {{ include "gateway-api-inference-extension.name" . }} +subjects: +- kind: ServiceAccount + name: {{ include "gateway-api-inference-extension.name" . }} + namespace: {{ .Release.Namespace }} +roleRef: + kind: ClusterRole + name: {{ include "gateway-api-inference-extension.name" . }} +--- +apiVersion: v1 +kind: ServiceAccount +metadata: + name: {{ include "gateway-api-inference-extension.name" . }} + namespace: {{ .Release.Namespace }} + labels: + {{- include "gateway-api-inference-extension.labels" . | nindent 4 }} diff --git a/config/charts/inferencepool/values.yaml b/config/charts/inferencepool/values.yaml new file mode 100644 index 00000000..bd48f37e --- /dev/null +++ b/config/charts/inferencepool/values.yaml @@ -0,0 +1,23 @@ +inferenceExtension: + replicas: 1 + image: + name: epp + hub: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension + tag: main + pullPolicy: Always + extProcPort: 9002 + +inferencePool: + targetPortNumber: 8000 + modelServerType: vllm # vllm, triton-tensorrt-llm + # modelServers: # REQUIRED + # matchLabels: + # app: vllm-llama3-8b-instruct + +provider: + name: none + +gke: + monitoringSecret: + name: inference-gateway-sa-metrics-reader-secret + namespace: default diff --git a/config/crd/bases/inference.networking.x-k8s.io_inferencemodels.yaml b/config/crd/bases/inference.networking.x-k8s.io_inferencemodels.yaml index bca19605..28805096 100644 --- a/config/crd/bases/inference.networking.x-k8s.io_inferencemodels.yaml +++ b/config/crd/bases/inference.networking.x-k8s.io_inferencemodels.yaml @@ -14,7 +14,20 @@ spec: singular: inferencemodel scope: Namespaced versions: - - name: v1alpha1 + - additionalPrinterColumns: + - jsonPath: .spec.modelName + name: Model Name + type: string + - jsonPath: .spec.poolRef.name + name: Inference Pool + type: string + - jsonPath: .spec.criticality + name: Criticality + type: string + - jsonPath: .metadata.creationTimestamp + name: Age + type: date + name: v1alpha2 schema: openAPIV3Schema: description: InferenceModel is the Schema for the InferenceModels API. @@ -82,6 +95,9 @@ spec: an error will be returned specifying that no valid target model is found. maxLength: 256 type: string + x-kubernetes-validations: + - message: modelName is immutable + rule: self == oldSelf poolRef: description: PoolRef is a reference to the inference pool, the pool must exist in the same namespace. @@ -143,7 +159,7 @@ spec: Conversely weights are optional, so long as ALL targetModels do not specify a weight. format: int32 maximum: 1000000 - minimum: 0 + minimum: 1 type: integer required: - name diff --git a/config/crd/bases/inference.networking.x-k8s.io_inferencepools.yaml b/config/crd/bases/inference.networking.x-k8s.io_inferencepools.yaml index 9e6473b9..8386db82 100644 --- a/config/crd/bases/inference.networking.x-k8s.io_inferencepools.yaml +++ b/config/crd/bases/inference.networking.x-k8s.io_inferencepools.yaml @@ -14,7 +14,7 @@ spec: singular: inferencepool scope: Namespaced versions: - - name: v1alpha1 + - name: v1alpha2 schema: openAPIV3Schema: description: InferencePool is the Schema for the InferencePools API. @@ -56,7 +56,9 @@ spec: default: "" description: |- Group is the group of the referent. - When unspecified or empty string, core API group is inferred. + The default value is "", representing the Core API group. + maxLength: 253 + pattern: ^$|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$ type: string kind: default: Service @@ -71,14 +73,20 @@ spec: terms of conformance. They also may not be safe to forward to (see CVE-2021-25740 for more information). Implementations MUST NOT support ExternalName Services. + maxLength: 63 + minLength: 1 + pattern: ^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$ type: string name: description: Name is the name of the referent. + maxLength: 253 + minLength: 1 type: string - targetPortNumber: + portNumber: description: |- - The port number on the pods running the extension. When unspecified, implementations SHOULD infer a - default value of 9002 when the Kind is Service. + The port number on the service running the extension. When unspecified, + implementations SHOULD infer a default value of 9002 when the Kind is + Service. format: int32 maximum: 65535 minimum: 1 @@ -109,6 +117,8 @@ spec: that should be included in the InferencePool. In some cases, implementations may translate this field to a Service selector, so this matches the simple map used for Service selectors instead of the full Kubernetes LabelSelector type. + If sepecified, it will be applied to match the model server pods in the same namespace as the InferencePool. + Cross namesoace selector is not supported. type: object targetPortNumber: description: |- @@ -126,78 +136,141 @@ spec: status: description: InferencePoolStatus defines the observed state of InferencePool properties: - conditions: - default: - - lastTransitionTime: "1970-01-01T00:00:00Z" - message: Waiting for controller - reason: Pending - status: Unknown - type: Ready + parent: description: |- - Conditions track the state of the InferencePool. + Parents is a list of parent resources (usually Gateways) that are + associated with the route, and the status of the InferencePool with respect to + each parent. - Known condition types are: - - * "Ready" + A maximum of 32 Gateways will be represented in this list. An empty list + means the route has not been attached to any Gateway. items: - description: Condition contains details for one aspect of the current - state of this API Resource. + description: PoolStatus defines the observed state of InferencePool + from a Gateway. properties: - lastTransitionTime: - description: |- - lastTransitionTime is the last time the condition transitioned from one status to another. - This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable. - format: date-time - type: string - message: - description: |- - message is a human readable message indicating details about the transition. - This may be an empty string. - maxLength: 32768 - type: string - observedGeneration: + conditions: + default: + - lastTransitionTime: "1970-01-01T00:00:00Z" + message: Waiting for controller + reason: Pending + status: Unknown + type: Accepted description: |- - observedGeneration represents the .metadata.generation that the condition was set based upon. - For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date - with respect to the current state of the instance. - format: int64 - minimum: 0 - type: integer - reason: - description: |- - reason contains a programmatic identifier indicating the reason for the condition's last transition. - Producers of specific condition types may define expected values and meanings for this field, - and whether the values are considered a guaranteed API. - The value should be a CamelCase string. - This field may not be empty. - maxLength: 1024 - minLength: 1 - pattern: ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$ - type: string - status: - description: status of the condition, one of True, False, Unknown. - enum: - - "True" - - "False" - - Unknown - type: string - type: - description: type of condition in CamelCase or in foo.example.com/CamelCase. - maxLength: 316 - pattern: ^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$ - type: string + Conditions track the state of the InferencePool. + + Known condition types are: + + * "Accepted" + * "ResolvedRefs" + items: + description: Condition contains details for one aspect of + the current state of this API Resource. + properties: + lastTransitionTime: + description: |- + lastTransitionTime is the last time the condition transitioned from one status to another. + This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable. + format: date-time + type: string + message: + description: |- + message is a human readable message indicating details about the transition. + This may be an empty string. + maxLength: 32768 + type: string + observedGeneration: + description: |- + observedGeneration represents the .metadata.generation that the condition was set based upon. + For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date + with respect to the current state of the instance. + format: int64 + minimum: 0 + type: integer + reason: + description: |- + reason contains a programmatic identifier indicating the reason for the condition's last transition. + Producers of specific condition types may define expected values and meanings for this field, + and whether the values are considered a guaranteed API. + The value should be a CamelCase string. + This field may not be empty. + maxLength: 1024 + minLength: 1 + pattern: ^[A-Za-z]([A-Za-z0-9_,:]*[A-Za-z0-9_])?$ + type: string + status: + description: status of the condition, one of True, False, + Unknown. + enum: + - "True" + - "False" + - Unknown + type: string + type: + description: type of condition in CamelCase or in foo.example.com/CamelCase. + maxLength: 316 + pattern: ^([a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*/)?(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])$ + type: string + required: + - lastTransitionTime + - message + - reason + - status + - type + type: object + maxItems: 8 + type: array + x-kubernetes-list-map-keys: + - type + x-kubernetes-list-type: map + parentRef: + description: GatewayRef indicates the gateway that observed + state of InferencePool. + properties: + apiVersion: + description: API version of the referent. + type: string + fieldPath: + description: |- + If referring to a piece of an object instead of an entire object, this string + should contain a valid JSON/Go field access statement, such as desiredState.manifest.containers[2]. + For example, if the object reference is to a container within a pod, this would take on a value like: + "spec.containers{name}" (where "name" refers to the name of the container that triggered + the event) or if no container name is specified "spec.containers[2]" (container with + index 2 in this pod). This syntax is chosen only to have some well-defined way of + referencing a part of an object. + type: string + kind: + description: |- + Kind of the referent. + More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds + type: string + name: + description: |- + Name of the referent. + More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#names + type: string + namespace: + description: |- + Namespace of the referent. + More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/ + type: string + resourceVersion: + description: |- + Specific resourceVersion to which this reference is made, if any. + More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency + type: string + uid: + description: |- + UID of the referent. + More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#uids + type: string + type: object + x-kubernetes-map-type: atomic required: - - lastTransitionTime - - message - - reason - - status - - type + - parentRef type: object - maxItems: 8 + maxItems: 32 type: array - x-kubernetes-list-map-keys: - - type - x-kubernetes-list-type: map type: object type: object served: true diff --git a/config/default/kustomization.yaml b/config/default/kustomization.yaml deleted file mode 100644 index 1fd9939f..00000000 --- a/config/default/kustomization.yaml +++ /dev/null @@ -1,151 +0,0 @@ -# Adds namespace to all resources. -namespace: api-system - -# Value of this field is prepended to the -# names of all resources, e.g. a deployment named -# "wordpress" becomes "alices-wordpress". -# Note that it should also match with the prefix (text before '-') of the namespace -# field above. -namePrefix: api- - -# Labels to add to all resources and selectors. -#labels: -#- includeSelectors: true -# pairs: -# someName: someValue - -resources: -- ../crd -- ../rbac -- ../manager -# [WEBHOOK] To enable webhook, uncomment all the sections with [WEBHOOK] prefix including the one in -# crd/kustomization.yaml -#- ../webhook -# [CERTMANAGER] To enable cert-manager, uncomment all sections with 'CERTMANAGER'. 'WEBHOOK' components are required. -#- ../certmanager -# [PROMETHEUS] To enable prometheus monitor, uncomment all sections with 'PROMETHEUS'. -#- ../prometheus -# [METRICS] Expose the controller manager metrics service. -- metrics_service.yaml -# [NETWORK POLICY] Protect the /metrics endpoint and Webhook Server with NetworkPolicy. -# Only Pod(s) running a namespace labeled with 'metrics: enabled' will be able to gather the metrics. -# Only CR(s) which requires webhooks and are applied on namespaces labeled with 'webhooks: enabled' will -# be able to communicate with the Webhook Server. -#- ../network-policy - -# Uncomment the patches line if you enable Metrics, and/or are using webhooks and cert-manager -patches: -# [METRICS] The following patch will enable the metrics endpoint using HTTPS and the port :8443. -# More info: https://book.kubebuilder.io/reference/metrics -- path: manager_metrics_patch.yaml - target: - kind: Deployment - -# [WEBHOOK] To enable webhook, uncomment all the sections with [WEBHOOK] prefix including the one in -# crd/kustomization.yaml -#- path: manager_webhook_patch.yaml - -# [CERTMANAGER] To enable cert-manager, uncomment all sections with 'CERTMANAGER'. -# Uncomment 'CERTMANAGER' sections in crd/kustomization.yaml to enable the CA injection in the admission webhooks. -# 'CERTMANAGER' needs to be enabled to use ca injection -#- path: webhookcainjection_patch.yaml - -# [CERTMANAGER] To enable cert-manager, uncomment all sections with 'CERTMANAGER' prefix. -# Uncomment the following replacements to add the cert-manager CA injection annotations -#replacements: -# - source: # Add cert-manager annotation to ValidatingWebhookConfiguration, MutatingWebhookConfiguration and CRDs -# kind: Certificate -# group: cert-manager.io -# version: v1 -# name: serving-cert # this name should match the one in certificate.yaml -# fieldPath: .metadata.namespace # namespace of the certificate CR -# targets: -# - select: -# kind: ValidatingWebhookConfiguration -# fieldPaths: -# - .metadata.annotations.[cert-manager.io/inject-ca-from] -# options: -# delimiter: '/' -# index: 0 -# create: true -# - select: -# kind: MutatingWebhookConfiguration -# fieldPaths: -# - .metadata.annotations.[cert-manager.io/inject-ca-from] -# options: -# delimiter: '/' -# index: 0 -# create: true -# - select: -# kind: CustomResourceDefinition -# fieldPaths: -# - .metadata.annotations.[cert-manager.io/inject-ca-from] -# options: -# delimiter: '/' -# index: 0 -# create: true -# - source: -# kind: Certificate -# group: cert-manager.io -# version: v1 -# name: serving-cert # this name should match the one in certificate.yaml -# fieldPath: .metadata.name -# targets: -# - select: -# kind: ValidatingWebhookConfiguration -# fieldPaths: -# - .metadata.annotations.[cert-manager.io/inject-ca-from] -# options: -# delimiter: '/' -# index: 1 -# create: true -# - select: -# kind: MutatingWebhookConfiguration -# fieldPaths: -# - .metadata.annotations.[cert-manager.io/inject-ca-from] -# options: -# delimiter: '/' -# index: 1 -# create: true -# - select: -# kind: CustomResourceDefinition -# fieldPaths: -# - .metadata.annotations.[cert-manager.io/inject-ca-from] -# options: -# delimiter: '/' -# index: 1 -# create: true -# - source: # Add cert-manager annotation to the webhook Service -# kind: Service -# version: v1 -# name: webhook-service -# fieldPath: .metadata.name # namespace of the service -# targets: -# - select: -# kind: Certificate -# group: cert-manager.io -# version: v1 -# fieldPaths: -# - .spec.dnsNames.0 -# - .spec.dnsNames.1 -# options: -# delimiter: '.' -# index: 0 -# create: true -# - source: -# kind: Service -# version: v1 -# name: webhook-service -# fieldPath: .metadata.namespace # namespace of the service -# targets: -# - select: -# kind: Certificate -# group: cert-manager.io -# version: v1 -# fieldPaths: -# - .spec.dnsNames.0 -# - .spec.dnsNames.1 -# options: -# delimiter: '.' -# index: 1 -# create: true diff --git a/config/default/manager_metrics_patch.yaml b/config/default/manager_metrics_patch.yaml deleted file mode 100644 index 2aaef653..00000000 --- a/config/default/manager_metrics_patch.yaml +++ /dev/null @@ -1,4 +0,0 @@ -# This patch adds the args to allow exposing the metrics endpoint using HTTPS -- op: add - path: /spec/template/spec/containers/0/args/0 - value: --metrics-bind-address=:8443 diff --git a/config/default/metrics_service.yaml b/config/default/metrics_service.yaml deleted file mode 100644 index 140d4943..00000000 --- a/config/default/metrics_service.yaml +++ /dev/null @@ -1,17 +0,0 @@ -apiVersion: v1 -kind: Service -metadata: - labels: - control-plane: controller-manager - app.kubernetes.io/name: api - app.kubernetes.io/managed-by: kustomize - name: controller-manager-metrics-service - namespace: system -spec: - ports: - - name: https - port: 8443 - protocol: TCP - targetPort: 8443 - selector: - control-plane: controller-manager diff --git a/config/manifests/benchmark/benchmark.yaml b/config/manifests/benchmark/benchmark.yaml new file mode 100644 index 00000000..c784730e --- /dev/null +++ b/config/manifests/benchmark/benchmark.yaml @@ -0,0 +1,60 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + labels: + app: benchmark-tool + name: benchmark-tool +spec: + replicas: 1 + selector: + matchLabels: + app: benchmark-tool + template: + metadata: + labels: + app: benchmark-tool + spec: + containers: + # The following image was built from this source https://github.com/AI-Hypercomputer/inference-benchmark/tree/07628c9fe01b748f5a4cc9e5c2ee4234aaf47699 + - image: 'us-docker.pkg.dev/cloud-tpu-images/inference/inference-benchmark@sha256:1c100b0cc949c7df7a2db814ae349c790f034b4b373aaad145e77e815e838438' + imagePullPolicy: Always + name: benchmark-tool + command: + - bash + - -c + - ./latency_throughput_curve.sh + env: + - name: IP + value: '' + - name: REQUEST_RATES + value: '10,20,30' + - name: BENCHMARK_TIME_SECONDS + value: '60' + - name: TOKENIZER + value: 'meta-llama/Llama-3.1-8B-Instruct' + - name: MODELS + value: 'meta-llama/Llama-3.1-8B-Instruct' + - name: BACKEND + value: vllm + - name: PORT + value: "8081" + - name: INPUT_LENGTH + value: "1024" + - name: OUTPUT_LENGTH + value: '2048' + - name: FILE_PREFIX + value: benchmark + - name: PROMPT_DATASET_FILE + value: ShareGPT_V3_unfiltered_cleaned_split.json + - name: HF_TOKEN + valueFrom: + secretKeyRef: + key: token + name: hf-token + resources: + limits: + cpu: "2" + memory: 20Gi + requests: + cpu: "2" + memory: 20Gi diff --git a/config/manifests/benchmark/model-server-service.yaml b/config/manifests/benchmark/model-server-service.yaml new file mode 100644 index 00000000..014054cf --- /dev/null +++ b/config/manifests/benchmark/model-server-service.yaml @@ -0,0 +1,12 @@ +apiVersion: v1 +kind: Service +metadata: + name: my-pool-service +spec: + ports: + - port: 8081 + protocol: TCP + targetPort: 8000 + selector: + app: my-pool + type: LoadBalancer diff --git a/config/manifests/gateway/gke/gateway.yaml b/config/manifests/gateway/gke/gateway.yaml new file mode 100644 index 00000000..942cde5c --- /dev/null +++ b/config/manifests/gateway/gke/gateway.yaml @@ -0,0 +1,10 @@ +kind: Gateway +apiVersion: gateway.networking.k8s.io/v1 +metadata: + name: inference-gateway +spec: + gatewayClassName: gke-l7-regional-external-managed + listeners: + - name: http + port: 80 + protocol: HTTP diff --git a/config/manifests/gateway/gke/gcp-backend-policy.yaml b/config/manifests/gateway/gke/gcp-backend-policy.yaml new file mode 100644 index 00000000..7b294304 --- /dev/null +++ b/config/manifests/gateway/gke/gcp-backend-policy.yaml @@ -0,0 +1,13 @@ +apiVersion: networking.gke.io/v1 +kind: GCPBackendPolicy +metadata: + name: inferencepool-backend-policy +spec: + targetRef: + group: "inference.networking.x-k8s.io" + kind: InferencePool + name: vllm-llama3-8b-instruct + default: + timeoutSec: 300 + logging: + enabled: true diff --git a/config/manifests/gateway/gke/healthcheck.yaml b/config/manifests/gateway/gke/healthcheck.yaml new file mode 100644 index 00000000..93b6cd7f --- /dev/null +++ b/config/manifests/gateway/gke/healthcheck.yaml @@ -0,0 +1,16 @@ +kind: HealthCheckPolicy +apiVersion: networking.gke.io/v1 +metadata: + name: health-check-policy + namespace: default +spec: + targetRef: + group: "inference.networking.x-k8s.io" + kind: InferencePool + name: vllm-llama3-8b-instruct + default: + config: + type: HTTP + httpHealthCheck: + requestPath: /health + port: 8000 diff --git a/config/manifests/gateway/gke/httproute.yaml b/config/manifests/gateway/gke/httproute.yaml new file mode 100644 index 00000000..6ea90891 --- /dev/null +++ b/config/manifests/gateway/gke/httproute.yaml @@ -0,0 +1,18 @@ +apiVersion: gateway.networking.k8s.io/v1 +kind: HTTPRoute +metadata: + name: llm-route +spec: + parentRefs: + - group: gateway.networking.k8s.io + kind: Gateway + name: inference-gateway + rules: + - backendRefs: + - group: inference.networking.x-k8s.io + kind: InferencePool + name: vllm-llama3-8b-instruct + matches: + - path: + type: PathPrefix + value: / diff --git a/config/manifests/gateway/istio/destination-rule.yaml b/config/manifests/gateway/istio/destination-rule.yaml new file mode 100644 index 00000000..f9cd0c3c --- /dev/null +++ b/config/manifests/gateway/istio/destination-rule.yaml @@ -0,0 +1,10 @@ +apiVersion: networking.istio.io/v1 +kind: DestinationRule +metadata: + name: epp-insecure-tls +spec: + host: vllm-llama2-7b-epp + trafficPolicy: + tls: + mode: SIMPLE + insecureSkipVerify: true diff --git a/config/manifests/gateway/istio/gateway.yaml b/config/manifests/gateway/istio/gateway.yaml new file mode 100644 index 00000000..dd762678 --- /dev/null +++ b/config/manifests/gateway/istio/gateway.yaml @@ -0,0 +1,10 @@ +apiVersion: gateway.networking.k8s.io/v1 +kind: Gateway +metadata: + name: inference-gateway +spec: + gatewayClassName: istio + listeners: + - name: http + port: 80 + protocol: HTTP diff --git a/config/manifests/gateway/istio/httproute.yaml b/config/manifests/gateway/istio/httproute.yaml new file mode 100644 index 00000000..18e90ced --- /dev/null +++ b/config/manifests/gateway/istio/httproute.yaml @@ -0,0 +1,20 @@ +apiVersion: gateway.networking.k8s.io/v1 +kind: HTTPRoute +metadata: + name: llm-route +spec: + parentRefs: + - group: gateway.networking.k8s.io + kind: Gateway + name: inference-gateway + rules: + - backendRefs: + - group: inference.networking.x-k8s.io + kind: InferencePool + name: vllm-llama3-8b-instruct + matches: + - path: + type: PathPrefix + value: / + timeouts: + request: 300s diff --git a/config/manifests/gateway/kgateway/gateway.yaml b/config/manifests/gateway/kgateway/gateway.yaml new file mode 100644 index 00000000..7bcd08a6 --- /dev/null +++ b/config/manifests/gateway/kgateway/gateway.yaml @@ -0,0 +1,10 @@ +apiVersion: gateway.networking.k8s.io/v1 +kind: Gateway +metadata: + name: inference-gateway +spec: + gatewayClassName: kgateway + listeners: + - name: http + port: 80 + protocol: HTTP diff --git a/config/manifests/gateway/kgateway/httproute.yaml b/config/manifests/gateway/kgateway/httproute.yaml new file mode 100644 index 00000000..03967729 --- /dev/null +++ b/config/manifests/gateway/kgateway/httproute.yaml @@ -0,0 +1,21 @@ +apiVersion: gateway.networking.k8s.io/v1 +kind: HTTPRoute +metadata: + name: llm-route +spec: + parentRefs: + - group: gateway.networking.k8s.io + kind: Gateway + name: inference-gateway + rules: + - backendRefs: + - group: inference.networking.x-k8s.io + kind: InferencePool + name: vllm-llama3-8b-instruct + port: 8000 # Remove when https://github.com/kgateway-dev/kgateway/issues/10987 is fixed. + matches: + - path: + type: PathPrefix + value: / + timeouts: + request: 300s diff --git a/config/manifests/inferencemodel.yaml b/config/manifests/inferencemodel.yaml new file mode 100644 index 00000000..67c91d0e --- /dev/null +++ b/config/manifests/inferencemodel.yaml @@ -0,0 +1,32 @@ +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferenceModel +metadata: + name: food-review +spec: + modelName: food-review + criticality: Standard + poolRef: + name: vllm-llama3-8b-instruct + targetModels: + - name: food-review-1 + weight: 100 +--- +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferenceModel +metadata: + name: base-model +spec: + modelName: meta-llama/Llama-3.1-8B-Instruct + criticality: Critical + poolRef: + name: vllm-llama3-8b-instruct +--- +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferenceModel +metadata: + name: base-model-cpu +spec: + modelName: Qwen/Qwen2.5-1.5B-Instruct + criticality: Critical + poolRef: + name: vllm-llama3-8b-instruct diff --git a/config/manifests/inferencepool-resources.yaml b/config/manifests/inferencepool-resources.yaml new file mode 100644 index 00000000..3d978292 --- /dev/null +++ b/config/manifests/inferencepool-resources.yaml @@ -0,0 +1,124 @@ +# Note: If you change this file, please also change the file used for e2e tests! +# +# https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/test/testdata/inferencepool-e2e.yaml +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferencePool +metadata: + name: vllm-llama3-8b-instruct +spec: + targetPortNumber: 8000 + selector: + app: vllm-llama3-8b-instruct + extensionRef: + name: vllm-llama3-8b-instruct-epp +--- +apiVersion: v1 +kind: Service +metadata: + name: vllm-llama3-8b-instruct-epp + namespace: default +spec: + selector: + app: vllm-llama3-8b-instruct-epp + ports: + - protocol: TCP + port: 9002 + targetPort: 9002 + appProtocol: http2 + type: ClusterIP +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: vllm-llama3-8b-instruct-epp + namespace: default + labels: + app: vllm-llama3-8b-instruct-epp +spec: + replicas: 1 + selector: + matchLabels: + app: vllm-llama3-8b-instruct-epp + template: + metadata: + labels: + app: vllm-llama3-8b-instruct-epp + spec: + # Conservatively, this timeout should mirror the longest grace period of the pods within the pool + terminationGracePeriodSeconds: 130 + containers: + - name: epp + image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:main + imagePullPolicy: Always + args: + - -poolName + - "vllm-llama3-8b-instruct" + - "-poolNamespace" + - "default" + - -v + - "4" + - --zap-encoder + - "json" + - -grpcPort + - "9002" + - -grpcHealthPort + - "9003" + ports: + - containerPort: 9002 + - containerPort: 9003 + - name: metrics + containerPort: 9090 + livenessProbe: + grpc: + port: 9003 + service: inference-extension + initialDelaySeconds: 5 + periodSeconds: 10 + readinessProbe: + grpc: + port: 9003 + service: inference-extension + initialDelaySeconds: 5 + periodSeconds: 10 +--- +kind: ClusterRole +apiVersion: rbac.authorization.k8s.io/v1 +metadata: + name: pod-read +rules: +- apiGroups: ["inference.networking.x-k8s.io"] + resources: ["inferencemodels"] + verbs: ["get", "watch", "list"] +- apiGroups: [""] + resources: ["pods"] + verbs: ["get", "watch", "list"] +- apiGroups: ["inference.networking.x-k8s.io"] + resources: ["inferencepools"] + verbs: ["get", "watch", "list"] +- apiGroups: ["discovery.k8s.io"] + resources: ["endpointslices"] + verbs: ["get", "watch", "list"] +- apiGroups: + - authentication.k8s.io + resources: + - tokenreviews + verbs: + - create +- apiGroups: + - authorization.k8s.io + resources: + - subjectaccessreviews + verbs: + - create +--- +kind: ClusterRoleBinding +apiVersion: rbac.authorization.k8s.io/v1 +metadata: + name: pod-read-binding +subjects: +- kind: ServiceAccount + name: default + namespace: default +roleRef: + kind: ClusterRole + name: pod-read diff --git a/config/manifests/vllm/cpu-deployment.yaml b/config/manifests/vllm/cpu-deployment.yaml new file mode 100644 index 00000000..827f2156 --- /dev/null +++ b/config/manifests/vllm/cpu-deployment.yaml @@ -0,0 +1,120 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: vllm-llama3-8b-instruct +spec: + replicas: 3 + selector: + matchLabels: + app: vllm-llama3-8b-instruct + template: + metadata: + labels: + app: vllm-llama3-8b-instruct + spec: + containers: + - name: lora + image: "public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:v0.8.0" # formal images can be found in https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo + imagePullPolicy: Always + command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] + args: + - "--model" + - "Qwen/Qwen2.5-1.5B-Instruct" + - "--port" + - "8000" + - "--enable-lora" + - "--max-loras" + - "4" + - "--lora-modules" + - '{"name": "food-review-0", "path": "SriSanth2345/Qwen-1.5B-Tweet-Generations", "base_model_name": "Qwen/Qwen2.5-1.5B"}' + - '{"name": "food-review-1", "path": "SriSanth2345/Qwen-1.5B-Tweet-Generations", "base_model_name": "Qwen/Qwen2.5-1.5B"}' + env: + - name: PORT + value: "8000" + - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING + value: "true" + - name: VLLM_CPU_KVCACHE_SPACE + value: "4" + ports: + - containerPort: 8000 + name: http + protocol: TCP + livenessProbe: + failureThreshold: 240 + httpGet: + path: /health + port: http + scheme: HTTP + initialDelaySeconds: 5 + periodSeconds: 5 + successThreshold: 1 + timeoutSeconds: 1 + readinessProbe: + failureThreshold: 600 + httpGet: + path: /health + port: http + scheme: HTTP + initialDelaySeconds: 5 + periodSeconds: 5 + successThreshold: 1 + timeoutSeconds: 1 + resources: + limits: + cpu: "12" + memory: "9000Mi" + requests: + cpu: "12" + memory: "9000Mi" + volumeMounts: + - mountPath: /data + name: data + - mountPath: /dev/shm + name: shm + - name: adapters + mountPath: "/adapters" + initContainers: + - name: lora-adapter-syncer + tty: true + stdin: true + image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main + restartPolicy: Always + imagePullPolicy: Always + env: + - name: DYNAMIC_LORA_ROLLOUT_CONFIG + value: "/config/configmap.yaml" + volumeMounts: # DO NOT USE subPath, dynamic configmap updates don't work on subPaths + - name: config-volume + mountPath: /config + restartPolicy: Always + schedulerName: default-scheduler + terminationGracePeriodSeconds: 30 + volumes: + - name: data + emptyDir: {} + - name: shm + emptyDir: + medium: Memory + - name: adapters + emptyDir: {} + - name: config-volume + configMap: + name: vllm-qwen-adapters +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: vllm-qwen-adapters +data: + configmap.yaml: | + vLLMLoRAConfig: + name: vllm-llama3-8b-instruct + port: 8000 + ensureExist: + models: + - base-model: Qwen/Qwen2.5-1.5B + id: food-review + source: SriSanth2345/Qwen-1.5B-Tweet-Generations + - base-model: Qwen/Qwen2.5-1.5B + id: cad-fabricator + source: SriSanth2345/Qwen-1.5B-Tweet-Generations \ No newline at end of file diff --git a/config/manifests/vllm/gpu-deployment.yaml b/config/manifests/vllm/gpu-deployment.yaml new file mode 100644 index 00000000..16f93882 --- /dev/null +++ b/config/manifests/vllm/gpu-deployment.yaml @@ -0,0 +1,258 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: vllm-llama3-8b-instruct +spec: + replicas: 3 + selector: + matchLabels: + app: vllm-llama3-8b-instruct + template: + metadata: + labels: + app: vllm-llama3-8b-instruct + spec: + containers: + - name: vllm + image: "vllm/vllm-openai:latest" + imagePullPolicy: Always + command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] + args: + - "--model" + - "meta-llama/Llama-3.1-8B-Instruct" + - "--tensor-parallel-size" + - "1" + - "--port" + - "8000" + - "--max-num-seq" + - "1024" + - "--compilation-config" + - "3" + - "--enable-lora" + - "--max-loras" + - "2" + - "--max-lora-rank" + - "8" + - "--max-cpu-loras" + - "12" + env: + # Enabling LoRA support temporarily disables automatic v1, we want to force it on + # until 0.8.3 vLLM is released. + - name: VLLM_USE_V1 + value: "1" + - name: PORT + value: "8000" + - name: HUGGING_FACE_HUB_TOKEN + valueFrom: + secretKeyRef: + name: hf-token + key: token + - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING + value: "true" + ports: + - containerPort: 8000 + name: http + protocol: TCP + lifecycle: + preStop: + # vLLM stops accepting connections when it receives SIGTERM, so we need to sleep + # to give upstream gateways a chance to take us out of rotation. The time we wait + # is dependent on the time it takes for all upstreams to completely remove us from + # rotation. Older or simpler load balancers might take upwards of 30s, but we expect + # our deployment to run behind a modern gateway like Envoy which is designed to + # probe for readiness aggressively. + sleep: + # Upstream gateway probers for health should be set on a low period, such as 5s, + # and the shorter we can tighten that bound the faster that we release + # accelerators during controlled shutdowns. However, we should expect variance, + # as load balancers may have internal delays, and we don't want to drop requests + # normally, so we're often aiming to set this value to a p99 propagation latency + # of readiness -> load balancer taking backend out of rotation, not the average. + # + # This value is generally stable and must often be experimentally determined on + # for a given load balancer and health check period. We set the value here to + # the highest value we observe on a supported load balancer, and we recommend + # tuning this value down and verifying no requests are dropped. + # + # If this value is updated, be sure to update terminationGracePeriodSeconds. + # + seconds: 30 + # + # IMPORTANT: preStop.sleep is beta as of Kubernetes 1.30 - for older versions + # replace with this exec action. + #exec: + # command: + # - /usr/bin/sleep + # - "30" + livenessProbe: + httpGet: + path: /health + port: http + scheme: HTTP + # vLLM's health check is simple, so we can more aggressively probe it. Liveness + # check endpoints should always be suitable for aggressive probing. + periodSeconds: 1 + successThreshold: 1 + # vLLM has a very simple health implementation, which means that any failure is + # likely significant. However, any liveness triggered restart requires the very + # large core model to be reloaded, and so we should bias towards ensuring the + # server is definitely unhealthy vs immediately restarting. Use 5 attempts as + # evidence of a serious problem. + failureThreshold: 5 + timeoutSeconds: 1 + readinessProbe: + httpGet: + path: /health + port: http + scheme: HTTP + # vLLM's health check is simple, so we can more aggressively probe it. Readiness + # check endpoints should always be suitable for aggressive probing, but may be + # slightly more expensive than readiness probes. + periodSeconds: 1 + successThreshold: 1 + # vLLM has a very simple health implementation, which means that any failure is + # likely significant, + failureThreshold: 1 + timeoutSeconds: 1 + # We set a startup probe so that we don't begin directing traffic or checking + # liveness to this instance until the model is loaded. + startupProbe: + # Failure threshold is when we believe startup will not happen at all, and is set + # to the maximum possible time we believe loading a model will take. In our + # default configuration we are downloading a model from HuggingFace, which may + # take a long time, then the model must load into the accelerator. We choose + # 10 minutes as a reasonable maximum startup time before giving up and attempting + # to restart the pod. + # + # IMPORTANT: If the core model takes more than 10 minutes to load, pods will crash + # loop forever. Be sure to set this appropriately. + failureThreshold: 600 + # Set delay to start low so that if the base model changes to something smaller + # or an optimization is deployed, we don't wait unneccesarily. + initialDelaySeconds: 2 + # As a startup probe, this stops running and so we can more aggressively probe + # even a moderately complex startup - this is a very important workload. + periodSeconds: 1 + httpGet: + # vLLM does not start the OpenAI server (and hence make /health available) + # until models are loaded. This may not be true for all model servers. + path: /health + port: http + scheme: HTTP + resources: + limits: + nvidia.com/gpu: 1 + requests: + nvidia.com/gpu: 1 + volumeMounts: + - mountPath: /data + name: data + - mountPath: /dev/shm + name: shm + - name: adapters + mountPath: "/adapters" + initContainers: + - name: lora-adapter-syncer + tty: true + stdin: true + image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main + restartPolicy: Always + imagePullPolicy: Always + env: + - name: DYNAMIC_LORA_ROLLOUT_CONFIG + value: "/config/configmap.yaml" + volumeMounts: # DO NOT USE subPath, dynamic configmap updates don't work on subPaths + - name: config-volume + mountPath: /config + restartPolicy: Always + + # vLLM allows VLLM_PORT to be specified as an environment variable, but a user might + # create a 'vllm' service in their namespace. That auto-injects VLLM_PORT in docker + # compatible form as `tcp://:` instead of the numeric value vLLM accepts + # causing CrashLoopBackoff. Set service environment injection off by default. + enableServiceLinks: false + + # Generally, the termination grace period needs to last longer than the slowest request + # we expect to serve plus any extra time spent waiting for load balancers to take the + # model server out of rotation. + # + # An easy starting point is the p99 or max request latency measured for your workload, + # although LLM request latencies vary significantly if clients send longer inputs or + # trigger longer outputs. Since steady state p99 will be higher than the latency + # to drain a server, you may wish to slightly this value either experimentally or + # via the calculation below. + # + # For most models you can derive an upper bound for the maximum drain latency as + # follows: + # + # 1. Identify the maximum context length the model was trained on, or the maximum + # allowed length of output tokens configured on vLLM (llama2-7b was trained to + # 4k context length, while llama3-8b was trained to 128k). + # 2. Output tokens are the more compute intensive to calculate and the accelerator + # will have a maximum concurrency (batch size) - the time per output token at + # maximum batch with no prompt tokens being processed is the slowest an output + # token can be generated (for this model it would be about 100ms TPOT at a max + # batch size around 50) + # 3. Calculate the worst case request duration if a request starts immediately + # before the server stops accepting new connections - generally when it receives + # SIGTERM (for this model that is about 4096 / 10 ~ 40s) + # 4. If there are any requests generating prompt tokens that will delay when those + # output tokens start, and prompt token generation is roughly 6x faster than + # compute-bound output token generation, so add 20% to the time from above (40s + + # 16s ~ 55s) + # + # Thus we think it will take us at worst about 55s to complete the longest possible + # request the model is likely to receive at maximum concurrency (highest latency) + # once requests stop being sent. + # + # NOTE: This number will be lower than steady state p99 latency since we stop receiving + # new requests which require continuous prompt token computation. + # NOTE: The max timeout for backend connections from gateway to model servers should + # be configured based on steady state p99 latency, not drain p99 latency + # + # 5. Add the time the pod takes in its preStop hook to allow the load balancers have + # stopped sending us new requests (55s + 30s ~ 85s) + # + # Because termination grace period controls when the Kubelet forcibly terminates a + # stuck or hung process (a possibility due to a GPU crash), there is operational safety + # in keeping the value roughly proportional to the time to finish serving. There is also + # value in adding a bit of extra time to deal with unexpectedly long workloads. + # + # 6. Add a 50% safety buffer to this time since the operational impact should be low + # (85s * 1.5 ~ 130s) + # + # One additional source of drain latency is that some workloads may run close to + # saturation and have queued requests on each server. Since traffic in excess of the + # max sustainable QPS will result in timeouts as the queues grow, we assume that failure + # to drain in time due to excess queues at the time of shutdown is an expected failure + # mode of server overload. If your workload occasionally experiences high queue depths + # due to periodic traffic, consider increasing the safety margin above to account for + # time to drain queued requests. + terminationGracePeriodSeconds: 130 + + volumes: + - name: data + emptyDir: {} + - name: shm + emptyDir: + medium: Memory + - name: adapters + emptyDir: {} + - name: config-volume + configMap: + name: vllm-llama3-8b-instruct-adapters +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: vllm-llama3-8b-instruct-adapters +data: + configmap.yaml: | + vLLMLoRAConfig: + name: vllm-llama3-8b-instruct-adapters + port: 8000 + defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct + ensureExist: + models: + - id: food-review-1 + source: Kawon/llama3.1-food-finetune_v14_r8 diff --git a/config/network-policy/allow-metrics-traffic.yaml b/config/network-policy/allow-metrics-traffic.yaml deleted file mode 100644 index aae53668..00000000 --- a/config/network-policy/allow-metrics-traffic.yaml +++ /dev/null @@ -1,26 +0,0 @@ -# This NetworkPolicy allows ingress traffic -# with Pods running on namespaces labeled with 'metrics: enabled'. Only Pods on those -# namespaces are able to gathering data from the metrics endpoint. -apiVersion: networking.k8s.io/v1 -kind: NetworkPolicy -metadata: - labels: - app.kubernetes.io/name: api - app.kubernetes.io/managed-by: kustomize - name: allow-metrics-traffic - namespace: system -spec: - podSelector: - matchLabels: - control-plane: controller-manager - policyTypes: - - Ingress - ingress: - # This allows ingress traffic from any namespace with the label metrics: enabled - - from: - - namespaceSelector: - matchLabels: - metrics: enabled # Only from namespaces with this label - ports: - - port: 8443 - protocol: TCP diff --git a/config/network-policy/kustomization.yaml b/config/network-policy/kustomization.yaml deleted file mode 100644 index ec0fb5e5..00000000 --- a/config/network-policy/kustomization.yaml +++ /dev/null @@ -1,2 +0,0 @@ -resources: -- allow-metrics-traffic.yaml diff --git a/config/prometheus/kustomization.yaml b/config/prometheus/kustomization.yaml deleted file mode 100644 index ed137168..00000000 --- a/config/prometheus/kustomization.yaml +++ /dev/null @@ -1,2 +0,0 @@ -resources: -- monitor.yaml diff --git a/config/prometheus/monitor.yaml b/config/prometheus/monitor.yaml deleted file mode 100644 index aac24ef3..00000000 --- a/config/prometheus/monitor.yaml +++ /dev/null @@ -1,30 +0,0 @@ -# Prometheus Monitor Service (Metrics) -apiVersion: monitoring.coreos.com/v1 -kind: ServiceMonitor -metadata: - labels: - control-plane: controller-manager - app.kubernetes.io/name: api - app.kubernetes.io/managed-by: kustomize - name: controller-manager-metrics-monitor - namespace: system -spec: - endpoints: - - path: /metrics - port: https # Ensure this is the name of the port that exposes HTTPS metrics - scheme: https - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token - tlsConfig: - # TODO(user): The option insecureSkipVerify: true is not recommended for production since it disables - # certificate verification. This poses a significant security risk by making the system vulnerable to - # man-in-the-middle attacks, where an attacker could intercept and manipulate the communication between - # Prometheus and the monitored services. This could lead to unauthorized access to sensitive metrics data, - # compromising the integrity and confidentiality of the information. - # Please use the following options for secure configurations: - # caFile: /etc/metrics-certs/ca.crt - # certFile: /etc/metrics-certs/tls.crt - # keyFile: /etc/metrics-certs/tls.key - insecureSkipVerify: true - selector: - matchLabels: - control-plane: controller-manager diff --git a/config/rbac/inferencemodel_editor_role.yaml b/config/rbac/inferencemodel_editor_role.yaml deleted file mode 100644 index b175a9a3..00000000 --- a/config/rbac/inferencemodel_editor_role.yaml +++ /dev/null @@ -1,27 +0,0 @@ -# permissions for end users to edit inferencemodels. -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRole -metadata: - labels: - app.kubernetes.io/name: api - app.kubernetes.io/managed-by: kustomize - name: inferencemodel-editor-role -rules: -- apiGroups: - - inference.networking.x-k8s.io - resources: - - inferencemodels - verbs: - - create - - delete - - get - - list - - patch - - update - - watch -- apiGroups: - - inference.networking.x-k8s.io - resources: - - inferencemodels/status - verbs: - - get diff --git a/config/rbac/inferencemodel_viewer_role.yaml b/config/rbac/inferencemodel_viewer_role.yaml deleted file mode 100644 index 3b3e67f6..00000000 --- a/config/rbac/inferencemodel_viewer_role.yaml +++ /dev/null @@ -1,23 +0,0 @@ -# permissions for end users to view inferencemodels. -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRole -metadata: - labels: - app.kubernetes.io/name: api - app.kubernetes.io/managed-by: kustomize - name: inferencemodel-viewer-role -rules: -- apiGroups: - - inference.networking.x-k8s.io - resources: - - inferencemodels - verbs: - - get - - list - - watch -- apiGroups: - - inference.networking.x-k8s.io - resources: - - inferencemodels/status - verbs: - - get diff --git a/config/rbac/inferencepool_editor_role.yaml b/config/rbac/inferencepool_editor_role.yaml deleted file mode 100644 index cc1f7c35..00000000 --- a/config/rbac/inferencepool_editor_role.yaml +++ /dev/null @@ -1,27 +0,0 @@ -# permissions for end users to edit inferencepools. -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRole -metadata: - labels: - app.kubernetes.io/name: api - app.kubernetes.io/managed-by: kustomize - name: inferencepool-editor-role -rules: -- apiGroups: - - inference.networking.x-k8s.io - resources: - - inferencepools - verbs: - - create - - delete - - get - - list - - patch - - update - - watch -- apiGroups: - - inference.networking.x-k8s.io - resources: - - inferencepools/status - verbs: - - get diff --git a/config/rbac/inferencepool_viewer_role.yaml b/config/rbac/inferencepool_viewer_role.yaml deleted file mode 100644 index 828e0022..00000000 --- a/config/rbac/inferencepool_viewer_role.yaml +++ /dev/null @@ -1,23 +0,0 @@ -# permissions for end users to view inferencepools. -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRole -metadata: - labels: - app.kubernetes.io/name: api - app.kubernetes.io/managed-by: kustomize - name: inferencepool-viewer-role -rules: -- apiGroups: - - inference.networking.x-k8s.io - resources: - - inferencepools - verbs: - - get - - list - - watch -- apiGroups: - - inference.networking.x-k8s.io - resources: - - inferencepools/status - verbs: - - get diff --git a/config/rbac/kustomization.yaml b/config/rbac/kustomization.yaml deleted file mode 100644 index c3a52137..00000000 --- a/config/rbac/kustomization.yaml +++ /dev/null @@ -1,29 +0,0 @@ -resources: -# All RBAC will be applied under this service account in -# the deployment namespace. You may comment out this resource -# if your manager will use a service account that exists at -# runtime. Be sure to update RoleBinding and ClusterRoleBinding -# subjects if changing service account names. -- service_account.yaml -- role.yaml -- role_binding.yaml -- leader_election_role.yaml -- leader_election_role_binding.yaml -# The following RBAC configurations are used to protect -# the metrics endpoint with authn/authz. These configurations -# ensure that only authorized users and service accounts -# can access the metrics endpoint. Comment the following -# permissions if you want to disable this protection. -# More info: https://book.kubebuilder.io/reference/metrics.html -- metrics_auth_role.yaml -- metrics_auth_role_binding.yaml -- metrics_reader_role.yaml -# For each CRD, "Editor" and "Viewer" roles are scaffolded by -# default, aiding admins in cluster management. Those roles are -# not used by the Project itself. You can comment the following lines -# if you do not want those helpers be installed with your Project. -- inferencemodel_editor_role.yaml -- inferencemodel_viewer_role.yaml -- inferencepool_editor_role.yaml -- inferencepool_viewer_role.yaml - diff --git a/config/rbac/leader_election_role.yaml b/config/rbac/leader_election_role.yaml deleted file mode 100644 index e2f8551b..00000000 --- a/config/rbac/leader_election_role.yaml +++ /dev/null @@ -1,40 +0,0 @@ -# permissions to do leader election. -apiVersion: rbac.authorization.k8s.io/v1 -kind: Role -metadata: - labels: - app.kubernetes.io/name: api - app.kubernetes.io/managed-by: kustomize - name: leader-election-role -rules: -- apiGroups: - - "" - resources: - - configmaps - verbs: - - get - - list - - watch - - create - - update - - patch - - delete -- apiGroups: - - coordination.k8s.io - resources: - - leases - verbs: - - get - - list - - watch - - create - - update - - patch - - delete -- apiGroups: - - "" - resources: - - events - verbs: - - create - - patch diff --git a/config/rbac/leader_election_role_binding.yaml b/config/rbac/leader_election_role_binding.yaml deleted file mode 100644 index fb71a122..00000000 --- a/config/rbac/leader_election_role_binding.yaml +++ /dev/null @@ -1,15 +0,0 @@ -apiVersion: rbac.authorization.k8s.io/v1 -kind: RoleBinding -metadata: - labels: - app.kubernetes.io/name: api - app.kubernetes.io/managed-by: kustomize - name: leader-election-rolebinding -roleRef: - apiGroup: rbac.authorization.k8s.io - kind: Role - name: leader-election-role -subjects: -- kind: ServiceAccount - name: controller-manager - namespace: system diff --git a/config/rbac/metrics_auth_role.yaml b/config/rbac/metrics_auth_role.yaml deleted file mode 100644 index 32d2e4ec..00000000 --- a/config/rbac/metrics_auth_role.yaml +++ /dev/null @@ -1,17 +0,0 @@ -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRole -metadata: - name: metrics-auth-role -rules: -- apiGroups: - - authentication.k8s.io - resources: - - tokenreviews - verbs: - - create -- apiGroups: - - authorization.k8s.io - resources: - - subjectaccessreviews - verbs: - - create diff --git a/config/rbac/metrics_auth_role_binding.yaml b/config/rbac/metrics_auth_role_binding.yaml deleted file mode 100644 index e775d67f..00000000 --- a/config/rbac/metrics_auth_role_binding.yaml +++ /dev/null @@ -1,12 +0,0 @@ -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRoleBinding -metadata: - name: metrics-auth-rolebinding -roleRef: - apiGroup: rbac.authorization.k8s.io - kind: ClusterRole - name: metrics-auth-role -subjects: -- kind: ServiceAccount - name: controller-manager - namespace: system diff --git a/config/rbac/metrics_reader_role.yaml b/config/rbac/metrics_reader_role.yaml deleted file mode 100644 index 51a75db4..00000000 --- a/config/rbac/metrics_reader_role.yaml +++ /dev/null @@ -1,9 +0,0 @@ -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRole -metadata: - name: metrics-reader -rules: -- nonResourceURLs: - - "/metrics" - verbs: - - get diff --git a/config/rbac/role.yaml b/config/rbac/role.yaml deleted file mode 100644 index 9d6247eb..00000000 --- a/config/rbac/role.yaml +++ /dev/null @@ -1,11 +0,0 @@ -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRole -metadata: - labels: - app.kubernetes.io/name: api - app.kubernetes.io/managed-by: kustomize - name: manager-role -rules: -- apiGroups: [""] - resources: ["pods"] - verbs: ["get", "list", "watch"] diff --git a/config/rbac/role_binding.yaml b/config/rbac/role_binding.yaml deleted file mode 100644 index c66b66bf..00000000 --- a/config/rbac/role_binding.yaml +++ /dev/null @@ -1,15 +0,0 @@ -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRoleBinding -metadata: - labels: - app.kubernetes.io/name: api - app.kubernetes.io/managed-by: kustomize - name: manager-rolebinding -roleRef: - apiGroup: rbac.authorization.k8s.io - kind: ClusterRole - name: manager-role -subjects: -- kind: ServiceAccount - name: controller-manager - namespace: system diff --git a/config/rbac/service_account.yaml b/config/rbac/service_account.yaml deleted file mode 100644 index 9286120f..00000000 --- a/config/rbac/service_account.yaml +++ /dev/null @@ -1,8 +0,0 @@ -apiVersion: v1 -kind: ServiceAccount -metadata: - labels: - app.kubernetes.io/name: api - app.kubernetes.io/managed-by: kustomize - name: controller-manager - namespace: system diff --git a/config/samples/gateway_v1alpha1_inferencemodel.yaml b/config/samples/gateway_v1alpha1_inferencemodel.yaml deleted file mode 100644 index f1f46a2f..00000000 --- a/config/samples/gateway_v1alpha1_inferencemodel.yaml +++ /dev/null @@ -1,17 +0,0 @@ -apiVersion: inference.networking.x-k8s.io/v1alpha1 -kind: InferenceModel -metadata: - labels: - app.kubernetes.io/name: api - app.kubernetes.io/managed-by: kustomize - name: inferencemodel-sample -spec: - criticality: Critical - modelName: sql-code-assist - poolRef: - name: inferencepool-sample - targetModels: - - name: npc-bot-v1 - weight: 50 - - name: npc-bot-v2 - weight: 50 diff --git a/config/samples/gateway_v1alpha1_inferencepool.yaml b/config/samples/gateway_v1alpha1_inferencepool.yaml deleted file mode 100644 index 42ac6296..00000000 --- a/config/samples/gateway_v1alpha1_inferencepool.yaml +++ /dev/null @@ -1,11 +0,0 @@ -apiVersion: inference.networking.x-k8s.io/v1alpha1 -kind: InferencePool -metadata: - labels: - app.kubernetes.io/name: api - app.kubernetes.io/managed-by: kustomize - name: inferencepool-sample -spec: - selector: - app: npc-bot - targetPortNumber: 8000 diff --git a/config/samples/kustomization.yaml b/config/samples/kustomization.yaml deleted file mode 100644 index e4b9f2e8..00000000 --- a/config/samples/kustomization.yaml +++ /dev/null @@ -1,5 +0,0 @@ -## Append samples of your project ## -resources: -- gateway_v1alpha1_inferencepool.yaml -- gateway_v1alpha1_inferencemodel.yaml -# +kubebuilder:scaffold:manifestskustomizesamples diff --git a/conformance/conformance.go b/conformance/conformance.go new file mode 100644 index 00000000..20d80fde --- /dev/null +++ b/conformance/conformance.go @@ -0,0 +1,230 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +// Package conformance contains the core setup and execution logic +// for the Gateway API Inference Extension conformance test suite. +package conformance + +import ( + "fmt" + "io/fs" + "os" + "testing" + + "github.com/stretchr/testify/require" + apiextensionsv1 "k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1" + clientset "k8s.io/client-go/kubernetes" + + // Import runtime package for scheme creation + "k8s.io/apimachinery/pkg/runtime" + "k8s.io/apimachinery/pkg/util/sets" + "sigs.k8s.io/controller-runtime/pkg/client" + "sigs.k8s.io/controller-runtime/pkg/client/config" + "sigs.k8s.io/yaml" + + // Import necessary types and utilities from the core Gateway API conformance suite. + // Assumes sigs.k8s.io/gateway-api is a dependency in the go.mod. + gatewayv1 "sigs.k8s.io/gateway-api/apis/v1" // Import core Gateway API types + confapis "sigs.k8s.io/gateway-api/conformance/apis/v1" // Report struct definition + confconfig "sigs.k8s.io/gateway-api/conformance/utils/config" + confflags "sigs.k8s.io/gateway-api/conformance/utils/flags" + confsuite "sigs.k8s.io/gateway-api/conformance/utils/suite" + "sigs.k8s.io/gateway-api/pkg/features" // Using core features definitions if applicable + + // Import the test definitions package to access the ConformanceTests slice + "sigs.k8s.io/gateway-api-inference-extension/conformance/tests" + + // Import test packages using blank identifier + // This triggers the init() functions in these packages, which register the tests + // by appending them to the tests.ConformanceTests slice. + _ "sigs.k8s.io/gateway-api-inference-extension/conformance/tests/basic" + // TODO: Add blank imports for other test categories as they are created. + // _ "sigs.k8s.io/gateway-api-inference-extension/conformance/tests/model_routing" + + // Import the Inference Extension API types + inferencev1alpha2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" +) + +// GatewayLayerProfileName defines the name for the conformance profile that tests +// the Gateway API layer aspects of the Inference Extension (e.g., InferencePool, InferenceModel CRDs). +// Future profiles will cover EPP and ModelServer layers. +const GatewayLayerProfileName confsuite.ConformanceProfileName = "Gateway" + +var InferenceCoreFeatures = sets.New[features.FeatureName]() // Placeholder - Populate with actual features specific to this profile or manage features per profile + +// GatewayLayerProfile defines the conformance profile for the Gateway API layer +// of the Inference Extension. +// In future iterations, we will add constants and ConformanceProfile structs for +// EPPProfileName ("EPP") and ModelServerProfileName ("ModelServer") +// to cover their respective conformance layers. +var GatewayLayerProfile = confsuite.ConformanceProfile{ + Name: GatewayLayerProfileName, + CoreFeatures: InferenceCoreFeatures, +} + +// DefaultOptions parses command line flags and sets up the suite options. +// Adapted from the core Gateway API conformance suite. +func DefaultOptions(t *testing.T) confsuite.ConformanceOptions { + t.Helper() + + cfg, err := config.GetConfig() + require.NoError(t, err, "error loading Kubernetes config") + + // Initialize client options. The scheme must include Gateway API types + // and the Inference Extension types. + clientOptions := client.Options{} + scheme := clientOptions.Scheme + if scheme == nil { + // If default options don't provide a scheme, create one using runtime.NewScheme(). + scheme = runtime.NewScheme() + clientOptions.Scheme = scheme + } + + // Register necessary API Types + require.NoError(t, gatewayv1.Install(scheme)) // Add core Gateway API types + // Add the Inference Extension API types to the scheme using the correct import alias + require.NoError(t, inferencev1alpha2.Install(scheme)) + require.NoError(t, apiextensionsv1.AddToScheme(scheme)) // Needed for CRD checks + + // Create the Kubernetes clients + c, err := client.New(cfg, clientOptions) + require.NoError(t, err, "error initializing Kubernetes client") + cs, err := clientset.NewForConfig(cfg) + require.NoError(t, err, "error initializing Kubernetes clientset") + + exemptFeatures := confsuite.ParseSupportedFeatures(*confflags.ExemptFeatures) + skipTests := confsuite.ParseSkipTests(*confflags.SkipTests) + // Initially, run the GatewayLayerProfile. This will expand as other profiles + // (EPP, ModelServer) are added and can be selected via flags in future iterations. + conformanceProfiles := sets.New(GatewayLayerProfileName) + + // Implementation details from flags + implementation := confsuite.ParseImplementation( + *confflags.ImplementationOrganization, + *confflags.ImplementationProject, + *confflags.ImplementationURL, + *confflags.ImplementationVersion, + *confflags.ImplementationContact, + ) + + // Inference Extension Specific Report Fields + inferenceExtensionVersion := "v0.3.0" + _ = inferenceExtensionVersion // Avoid unused variable error until implemented + + // Create ConformanceOptions + opts := confsuite.ConformanceOptions{ + Client: c, + Clientset: cs, + RestConfig: cfg, + GatewayClassName: *confflags.GatewayClassName, + Debug: *confflags.ShowDebug, + CleanupBaseResources: *confflags.CleanupBaseResources, + SupportedFeatures: sets.New[features.FeatureName](), // Initialize empty, will be populated below + TimeoutConfig: confconfig.DefaultTimeoutConfig(), + SkipTests: skipTests, + ExemptFeatures: exemptFeatures, + RunTest: *confflags.RunTest, + Mode: *confflags.Mode, + Implementation: implementation, + ConformanceProfiles: conformanceProfiles, + ManifestFS: []fs.FS{&Manifests}, // Assumes embed.go defines `Manifests` + ReportOutputPath: *confflags.ReportOutput, + SkipProvisionalTests: *confflags.SkipProvisionalTests, + // TODO: Add the inference extension specific fields to ConformanceOptions struct if needed, + // or handle them during report generation. + // GatewayAPIInferenceExtensionChannel: inferenceExtensionChannel, + // GatewayAPIInferenceExtensionVersion: inferenceExtensionVersion, + } + + // Populate SupportedFeatures based on the GatewayLayerProfile. + // Since all features are mandatory for this profile, add all defined core features. + if opts.ConformanceProfiles.Has(GatewayLayerProfileName) { + for feature := range GatewayLayerProfile.CoreFeatures { + opts.SupportedFeatures.Insert(feature) + } + } + + // Remove any features explicitly exempted via flags. + for feature := range opts.ExemptFeatures { + opts.SupportedFeatures.Delete(feature) + } + + return opts +} + +// RunConformance runs the Inference Extension conformance tests using default options. +func RunConformance(t *testing.T) { + RunConformanceWithOptions(t, DefaultOptions(t)) +} + +// RunConformanceWithOptions runs the Inference Extension conformance tests with specific options. +func RunConformanceWithOptions(t *testing.T, opts confsuite.ConformanceOptions) { + t.Logf("Running Inference Extension conformance tests with GatewayClass %s", opts.GatewayClassName) + + // Register the GatewayLayerProfile with the suite runner. + // In the future, other profiles (EPP, ModelServer) will also be registered here, + // and the suite runner will execute tests based on the selected profiles. + confsuite.RegisterConformanceProfile(GatewayLayerProfile) + + // Initialize the test suite. + cSuite, err := confsuite.NewConformanceTestSuite(opts) + require.NoError(t, err, "error initializing conformance suite") + + t.Log("Setting up Inference Extension conformance tests") + // Setup requires the list of tests, which is populated by the init() functions + // triggered by the blank imports at the top of this file. + cSuite.Setup(t, tests.ConformanceTests) + + t.Log("Running Inference Extension conformance tests") + // Run the tests. + err = cSuite.Run(t, tests.ConformanceTests) + require.NoError(t, err, "error running conformance tests") + + // Generate and write the report if requested. + if opts.ReportOutputPath != "" { + t.Log("Generating Inference Extension conformance report") + report, err := cSuite.Report() // Use the existing report generation logic. + require.NoError(t, err, "error generating conformance report") + + // TODO: Modify the report struct here if channel, version need to be modified. + // Example (requires adding fields to confapis.ConformanceReport): + // report.GatewayAPIInferenceExtensionChannel = opts.GatewayAPIInferenceExtensionChannel + // report.GatewayAPIInferenceExtensionVersion = opts.GatewayAPIInferenceExtensionVersion + + err = writeReport(t.Logf, *report, opts.ReportOutputPath) + require.NoError(t, err, "error writing conformance report") + } +} + +// writeReport writes the generated conformance report to the specified output file or logs it. +// Adapted from the core Gateway API suite. +func writeReport(logf func(string, ...any), report confapis.ConformanceReport, output string) error { + rawReport, err := yaml.Marshal(report) + if err != nil { + return fmt.Errorf("error marshaling report: %w", err) + } + + if output != "" { + if err = os.WriteFile(output, rawReport, 0o600); err != nil { + return fmt.Errorf("error writing report file %s: %w", output, err) + } + logf("Conformance report written to %s", output) + } else { + // Log the report YAML to stdout if no output file is specified. + logf("Conformance report:\n%s", string(rawReport)) + } + return nil +} diff --git a/conformance/conformance_test.go b/conformance/conformance_test.go new file mode 100644 index 00000000..de82d5ec --- /dev/null +++ b/conformance/conformance_test.go @@ -0,0 +1,29 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package conformance + +import ( + "testing" +) + +// TestConformance is the top-level function that runs the conformance tests. +// It calls the RunConformance function which sets up the suite and executes +// the registered tests. +func TestConformance(t *testing.T) { + // RunConformance is defined in conformance.go + RunConformance(t) +} diff --git a/conformance/embed.go b/conformance/embed.go new file mode 100644 index 00000000..f7fa64c9 --- /dev/null +++ b/conformance/embed.go @@ -0,0 +1,25 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package conformance + +import "embed" + +// Manifests embeds the contents of the conformance/resources directory making +// the YAML files within them available to the test suite at runtime. +// +//go:embed resources/* tests/* +var Manifests embed.FS diff --git a/conformance/reports/README.md b/conformance/reports/README.md new file mode 100644 index 00000000..81652b1c --- /dev/null +++ b/conformance/reports/README.md @@ -0,0 +1,93 @@ +# Conformance Reports for Gateway API Inference Extension + +This directory stores conformance reports submitted by various implementations of the Gateway API Inference Extension. This structure closely follows the [kubernetes-sigs/gateway-api/conformance/reports](https://github.com/kubernetes-sigs/gateway-api/blob/main/conformance/reports/README.md). + +## How this folder is structured + +This folder stores conformance reports organized first by the version of the Gateway API Inference Extension specification they were tested against, and then by the specific conformance profile (e.g., Gateway, EPP, Model Server): + +|-- conformance/reports +| |-- v0.3.0 # Example extension version +| | |-- gateway # Conformance profile/category +| | | |-- my-inference-gateway +| | | | |-- README.md +| | | | |-- experimental-v1.2.3-default-gateway-report.yaml # Example report file +| | | |-- another-implementation +| | | | |-- README.md +| | | | |-- ... +| | |-- epp # Future conformance profile/category +| | | |-- my-epp-implementation +| | | | |-- ... +| | |-- model-server # Future conformance profile/category +| | | |-- ... +| |-- v0.4.0 # Future extension version +| | |-- ... + +## Implementation Submissions + +Each implementation conformant with a specific profile of a specific version of the Gateway API Inference Extension should have its own folder within the corresponding version and profile directory (e.g., `/conformance/reports/v0.3.0/Gateway/my-implementation/`). + +The implementation is the owner of its folder and is responsible for: + +1. Uploading one or more conformance reports (YAML files). +2. Maintaining a mandatory `README.md` file within their folder, structured as follows: + + # My Inference Gateway Implementation (Gateway Profile Conformance) + + General information about the My/Implementation project. + + ## Table of Contents + +| Extension Version Tested | Profile Tested | Implementation Version | Mode | Report | +|--------------------------|----------------|------------------------|---------|----------------------------------------------------------------------------| +| v0.3.0 | Gateway | v1.2.3 | default | [v1.2.3 Gateway report](./experimental-v1.2.3-default-gateway-report.yaml) | +| ... | ... | ... | ... | ... | + + ## Reproduce + + Instructions on how to reproduce the claimed report(s). + +### Table of Contents (within Implementation README) + +The table of contents within an implementation's `README.md` should contain one row for each submitted report and include the following columns: + +* **Extension Version Tested**: The version of the Gateway API Inference Extension specification tested against (e.g., `v0.3.0`). Must correspond to the `gatewayAPIInferenceExtensionVersion` field in the report. +* **Profile Tested**: The specific conformance profile tested (e.g., `Gateway`, `EPP`, `ModelServer`). Must correspond to the `name` of the profile in the `profiles` list within the report. +* **Implementation Version**: A link to the GitHub/website page for the specific release/commit of the implementation tested. The version value MUST correspond to the `implementation.version` field in the report. +* **Mode**: The operating mode of the implementation used for the test run (default is `default`). Must correspond to the `mode` field in the report. If a mode other than `default` is used, the "Reproduce" section must explain how to configure it. +* **Report**: A link to the corresponding report YAML file. Reports MUST be named according to the pattern: `---report.yaml` (e.g., `experimental-v1.2.3-default-gateway-report.yaml`). + +### Reproduce Section (within Implementation README) + +This section MUST exist and contain the manual or automatic steps required to reproduce the results claimed by the uploaded conformance reports for that specific implementation. If reproduction steps differ significantly between implementation versions, use sub-sections. + +## Report Files + +Conformance reports MUST be uploaded exactly as generated by the official Gateway API Inference Extension conformance test suite, without any modifications. The "Reproduce" section allows for verification of the submitted report against a fresh run. + +### Report Rules + +To be accepted, submitted conformance reports must comply with the following rules: + +1. **Implementation Details:** All fields within the `implementation` block must have meaningful values: + * `organization`: The entity maintaining the implementation (company, open source org, individual). + * `project`: The name of the implementation project, unique within the organization. + * `url`: A valid URL for the project (e.g., GitHub repository, product page). + * `version`: A specific, reproducible snapshot of the implementation (e.g., tag, commit hash, release version). Branch names are not acceptable. + * `contact`: A list of contact points (GitHub handles like `@maintainer`, team handles like `@org/team`, email addresses, or support URLs like an issue tracker). +2. **Inference Extension Versioning:** The report MUST include: + * `gatewayAPIInferenceExtensionVersion`: The specific version of the Gateway API Inference Extension specification tested against (e.g., `v0.3.0`). +3. **Mode:** The `mode` field indicates the implementation's operating mode during the test run. +4. **Test Profile & Result:** + * The report MUST contain exactly one profile result under the `profiles` list for the specific conformance category being submitted (e.g., a report for "Gateway" conformance should only contain the "Gateway" profile result). + * The profile's `name` MUST match the conformance category (e.g., `Gateway`, `EPP`, `ModelServer`). + * The profile's `result` field MUST be `success`. A `success` result indicates that **all** tests defined within the Gateway API Inference Extension conformance suite for that specific profile and version passed. + +## Submission Process + +Conformance reports demonstrating a `success` result for a specific profile (e.g., `Gateway`) should be submitted via Pull Request directly to this repository (`kubernetes-sigs/gateway-api-inference-extension`). + +1. Create a new folder structure under `/conformance/reports///` named after your implementation (e.g., `/conformance/reports/v0.3.0/Gateway/my-implementation/`). +2. Add your implementation's `README.md` to this folder, following the structure described above. +3. Add your generated conformance report YAML file(s) to this folder, ensuring they follow the naming convention `---report.yaml`. +4. Submit the Pull Request. diff --git a/conformance/resources/manifests/manifests.yaml b/conformance/resources/manifests/manifests.yaml new file mode 100644 index 00000000..7b43b784 --- /dev/null +++ b/conformance/resources/manifests/manifests.yaml @@ -0,0 +1,49 @@ +# Base Kubernetes resources for the Gateway API Inference Extension conformance tests. +# This includes namespaces and a minimal set of resources (Gateway, Backend) +# required by many tests. More specific resources should be defined within +# individual test files or other resource directories (e.g., sample_backends). + +--- +# Namespace for core infrastructure like Gateways. +apiVersion: v1 +kind: Namespace +metadata: + name: gateway-conformance-infra + labels: + gateway-conformance: infra + +--- +# Namespace for application backends (potentially simulating model servers +# or where InferencePools might reside in some tests). +apiVersion: v1 +kind: Namespace +metadata: + name: gateway-conformance-app-backend + labels: + gateway-conformance: backend + +--- +# A basic Gateway resource that allows HTTPRoutes from the same namespace. +# Tests can use this as a parent reference for routes that target InferencePools. +# Using a simple echo server instead of an actual model server to simplify the test +# execution, this design may need to be revised based on the test case needs. +apiVersion: gateway.networking.k8s.io/v1 # Using v1 as per latest Gateway API standard +kind: Gateway +metadata: + name: same-namespace + namespace: gateway-conformance-infra +spec: + # The conformance suite runner will replace this placeholder + # with the actual GatewayClass name provided via flags. + gatewayClassName: "{GATEWAY_CLASS_NAME}" + listeners: + - name: http # Standard listener name + port: 80 + protocol: HTTP + allowedRoutes: + namespaces: + from: Same # Restrict to same namespace initially for simplicity + kinds: + # Allows HTTPRoutes to attach, which can then reference InferencePools. + - group: gateway.networking.k8s.io + kind: HTTPRoute diff --git a/conformance/tests/basic/inferencepool_accepted.go b/conformance/tests/basic/inferencepool_accepted.go new file mode 100644 index 00000000..eae59404 --- /dev/null +++ b/conformance/tests/basic/inferencepool_accepted.go @@ -0,0 +1,60 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package basic + +import ( + "testing" + + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/types" + gatewayv1 "sigs.k8s.io/gateway-api/apis/v1" // For standard condition types + "sigs.k8s.io/gateway-api/conformance/utils/suite" + "sigs.k8s.io/gateway-api/pkg/features" // For standard feature names + + // Import the tests package to append to ConformanceTests + "sigs.k8s.io/gateway-api-inference-extension/conformance/tests" + infrakubernetes "sigs.k8s.io/gateway-api-inference-extension/conformance/utils/kubernetes" +) + +func init() { + // Register the InferencePoolAccepted test case with the conformance suite. + // This ensures it will be discovered and run by the test runner. + tests.ConformanceTests = append(tests.ConformanceTests, InferencePoolAccepted) +} + +// InferencePoolAccepted defines the test case for verifying basic InferencePool acceptance. +var InferencePoolAccepted = suite.ConformanceTest{ + ShortName: "InferencePoolAccepted", + Description: "A minimal InferencePool resource should be accepted by the controller and report an Accepted condition", + Manifests: []string{"tests/basic/inferencepool_accepted.yaml"}, + Features: []features.FeatureName{}, + Test: func(t *testing.T, s *suite.ConformanceTestSuite) { + // created by the associated manifest file. + poolNN := types.NamespacedName{Name: "inferencepool-basic-accepted", Namespace: "gateway-conformance-app-backend"} + + t.Run("InferencePool should have Accepted condition set to True", func(t *testing.T) { + // Define the expected status condition. We use the standard "Accepted" + // condition type from the Gateway API for consistency. + acceptedCondition := metav1.Condition{ + Type: string(gatewayv1.GatewayConditionAccepted), // Standard condition type + Status: metav1.ConditionTrue, + Reason: "", // "" means we don't strictly check the Reason for this basic test. + } + infrakubernetes.InferencePoolMustHaveCondition(t, s.Client, s.TimeoutConfig, poolNN, acceptedCondition) + }) + }, +} diff --git a/conformance/tests/basic/inferencepool_accepted.yaml b/conformance/tests/basic/inferencepool_accepted.yaml new file mode 100644 index 00000000..8ae327d8 --- /dev/null +++ b/conformance/tests/basic/inferencepool_accepted.yaml @@ -0,0 +1,27 @@ +# Basic InferencePool for acceptance testing. +# This manifest defines the minimal required fields to create a valid +# InferencePool resource, which the InferencePoolAccepted test will use +# to verify that the controller recognizes and accepts the resource. + +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferencePool +metadata: + # This name must match the 'poolNN' variable defined in the + # conformance/tests/basic/inferencepool_accepted.go test file. + name: inferencepool-basic-accepted + # This namespace should be one created by the base manifests. + namespace: gateway-conformance-app-backend +spec: + # --- Selector (Required) --- + # Selects the Pods belonging to this pool. + selector: + app: "infra-backend-v1" + + # --- Target Port (Required) --- + # The port the model server container listens on. + targetPortNumber: 3000 + + # --- Extension Reference --- + # GKE-specific configuration reference. + extensionRef: + name: infra-backend-v1-epp diff --git a/conformance/tests/main.go b/conformance/tests/main.go new file mode 100644 index 00000000..fc66c765 --- /dev/null +++ b/conformance/tests/main.go @@ -0,0 +1,35 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +// Package tests is the root package for all Gateway API Inference Extension +// conformance test implementations. +package tests + +import ( + // Importing the suite package to access the ConformanceTest struct definition. + // For initial version directly importing from the core gateway-api repo. + // This may be adjusted in the future if we have need to create a copy of + // the suite utilities. + "sigs.k8s.io/gateway-api/conformance/utils/suite" + // Do NOT add blank imports for specific test packages here. + // They should be added to the main conformance package instead + // to avoid import cycles. +) + +// ConformanceTests holds all the conformance tests definitions for the +// Gateway API Inference Extension suite. Tests are registered from other packages +// using init() functions like the one in the basic package. +var ConformanceTests []suite.ConformanceTest diff --git a/conformance/utils/assertions.go b/conformance/utils/assertions.go new file mode 100644 index 00000000..c77d0fc5 --- /dev/null +++ b/conformance/utils/assertions.go @@ -0,0 +1,25 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +// Package assertions contains custom assertion helper functions used within +// the Gateway API Inference Extension conformance test suite. +package assertions + +// TODO: Implement custom assertion functions specific to Inference Extension testing. +// Examples might include: +// - Asserting specific fields or structures within an inference API response body. +// - Asserting specific metrics reported by mock model servers or EPPs. +// - Asserting specific conditions or status fields unique to InferencePool or InferenceModel. diff --git a/conformance/utils/kubernetes/helpers.go b/conformance/utils/kubernetes/helpers.go new file mode 100644 index 00000000..3d517863 --- /dev/null +++ b/conformance/utils/kubernetes/helpers.go @@ -0,0 +1,49 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +// Package kubernetes contains helper functions for interacting with +// Kubernetes objects within the conformance test suite. +package kubernetes + +import ( + "testing" + + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/types" + "sigs.k8s.io/controller-runtime/pkg/client" + + // Import necessary utilities from the core Gateway API conformance suite + "sigs.k8s.io/gateway-api/conformance/utils/config" +) + +// InferencePoolMustHaveCondition waits for the specified InferencePool resource +// to exist and report the expected status condition. +// This is a placeholder and needs full implementation. +// +// TODO: Implement the actual logic for this helper function. +// It should fetch the InferencePool using the provided client and check its +// Status.Conditions field, polling until the condition is met or a timeout occurs. +// like HTTPRouteMustHaveCondition. +func InferencePoolMustHaveCondition(t *testing.T, c client.Client, timeoutConfig config.TimeoutConfig, poolNN types.NamespacedName, expectedCondition metav1.Condition) { + t.Helper() // Marks this function as a test helper + + // Placeholder implementation: Log and skip the check. + t.Logf("Verification for InferencePool condition (%s=%s) on %s - Placeholder: Skipping check.", + expectedCondition.Type, expectedCondition.Status, poolNN.String()) + + // Skip the test using this helper until it's fully implemented. + t.Skip("InferencePoolMustHaveCondition helper not yet implemented") +} diff --git a/conformance/utils/traffic/traffic.go b/conformance/utils/traffic/traffic.go new file mode 100644 index 00000000..4f13f980 --- /dev/null +++ b/conformance/utils/traffic/traffic.go @@ -0,0 +1,22 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +// Package traffic contains helper functions specifically for generating, +// sending, and validating network traffic related to inference workloads +// within the Gateway API Inference Extension conformance tests. +package traffic + +// TODO: Add helpers for specific inference protocols or request patterns as needed. diff --git a/docs/dev.md b/docs/dev.md index efd2023a..d223ed6a 100644 --- a/docs/dev.md +++ b/docs/dev.md @@ -1,27 +1,33 @@ - ## Logging +We use `logr.Logger` interface for logging everywhere. +The logger instance is loaded from `context.Context` or passed around as an argument directly. +This is aligned with contextual logging as explained in [k8s instrumentation logging guidelines](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md). + +In other words, we explicitly don't use `klog` global logging calls. +Using `klog` log value helpers like `klog.KObj` is just fine. + ### Change log verbosity -We use the `k8s.io/klog/v2` package to manage logging. We generally follow the [k8s instrumentation logging guidelines](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md), which states "the practical default level is V(2). Developers and QE environments may wish to run at V(3) or V(4)". -To configure logging verbosity, specify the `v` flag such as `--v=2`. +To configure logging verbosity, specify the `v` flag such as `--v=2`. ### Add logs The [k8s instrumentation logging guidelines](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md) has the following definitions: -* `klog.V(0).InfoS` = `klog.InfoS` - Generally useful for this to **always** be visible to a cluster operator -* `klog.V(1).InfoS` - A reasonable default log level if you don't want verbosity. -* `klog.V(2).InfoS` - Useful steady state information about the service and important log messages that may correlate to significant changes in the system. This is the recommended default log level for most systems. -* `klog.V(3).InfoS` - Extended information about changes -* `klog.V(4).InfoS` - Debug level verbosity -* `klog.V(5).InfoS` - Trace level verbosity +- `logger.V(0).Info` = `logger.Info` - Generally useful for this to **always** be visible to a cluster operator +- `logger.V(1).Info` - A reasonable default log level if you don't want verbosity. +- `logger.V(2).Info` - Useful steady state information about the service and important log messages that may correlate to significant changes in the system. This is the recommended default log level for most systems. +- `logger.V(3).Info` - Extended information about changes +- `logger.V(4).Info` - Debug level verbosity +- `logger.V(5).Info` - Trace level verbosity We choose to simplify to the following 3 common levels. + ``` const( DEFAULT=2 @@ -31,36 +37,48 @@ const( ) ``` -The guidelines are written in the context of a k8s controller. Our [ext-proc](../pkg/ext-proc/) does more things such as handling requests and scraping metrics, therefore we adapt the guidelines as follows: +The guidelines are written in the context of a k8s controller. Our [epp](../pkg/epp/) does more things such as handling requests and scraping metrics, therefore we adapt the guidelines as follows: + +1. The server startup process and configuration. -1. The server startup process and configuration. - * `klog.InfoS` Logging at the `V(0)` verbosity level is generally welcome here as this is only logged once at startup, and provides useful info for debugging. + - `logger.Info` Logging at the `V(0)` verbosity level is generally welcome here as this is only logged once at startup, and provides useful info for debugging. 2. Reconciler loops. The reconciler loops watch for CR changes such as the `InferenceModel` CR. And given changes in these CRs significantly affect the behavior of the extension, we recommend using v=1 verbosity level as default, and sparsely use higher verbosity levels. - - * `klog.V(DEFAULT).InfoS` - * Default log level in the reconcilers. - * Information about config (listening on X, watching Y) - * Errors that repeat frequently that relate to conditions that can be corrected (e.g., inference model not initialized yet) - * System state changing (adding/removing objects in the data store) - * `V(VERBOSE)` and above: Use your best judgement. + + - `logger.V(DEFAULT)` + - Default log level in the reconcilers. + - Information about config (listening on X, watching Y) + - Errors that repeat frequently that relate to conditions that can be corrected (e.g., inference model not initialized yet) + - System state changing (adding/removing objects in the data store) + - `logger.V(VERBOSE)` and above: Use your best judgement. 3. Inference request handling. These requests are expected to be much higher volume than the control flow in the reconcilers and therefore we should be mindful of log spamming. We recommend using v=2 to log important info about a request, such as the HTTP response code, and higher verbosity levels for less important info. - * `klog.V(DEFAULT).InfoS` - * Logging the status code of an HTTP request - * Important decision making such as picking the target model, target pod - * `klog.V(VERBOSE).InfoS` - * Detailed request scheduling algorithm operations, such as running the filtering logic - * `V(DEBUG)` and above: Use your best judgement. + - `logger.V(DEFAULT)` + - Logging the status code of an HTTP request + - Important decision making such as picking the target model, target pod + - `logger.V(VERBOSE)` + - Detailed request scheduling algorithm operations, such as running the filtering logic + - `logger.V(DEBUG)` and above: Use your best judgement. 4. Metric scraping loops. These loops run at a very high frequency, and logs can be very spammy if not handled properly. - * `klog.V(TRACE).InfoS` - * Transient errors/warnings, such as failure to get response from a pod. - * Important state changes, such as updating a metric. -5. Misc + - `logger.V(TRACE)` + - Transient errors/warnings, such as failure to get response from a pod. + - Important state changes, such as updating a metric. + +5. Misc 1. Periodic (every 5s) debug loop which prints the current pods and metrics. - * `klog.WarningS` If the metrics are not fresh enough, which indicates an error occurred during the metric scraping loop. - * `klog.V(DEBUG).InfoS` - * This is very important to debug the request scheduling algorithm, and yet not spammy compared to the metric scraping loop logs. \ No newline at end of file + - `logger.V(DEFAULT).Error` If the metrics are not fresh enough, which indicates an error occurred during the metric scraping loop. + - `logger.V(DEBUG)` + - This is very important to debug the request scheduling algorithm, and yet not spammy compared to the metric scraping loop logs. + +### Passing Logger Around + +You can pass around a `context.Context` that contains a logger or a `logr.Logger` instance directly. +You need to make the call which one to use. Passing a `context.Context` is more standard, +on the other hand you then need to call `log.FromContext` everywhere. + +As `logger.V` calls are cummulative, i.e. `logger.V(2).V(3)` results in `logger.V(5)`, +a logger should be passed around with no verbosity level set so that `logger.V(DEFAULT)` +actually uses `DEFAULT` verbosity level. diff --git a/docs/endpoint-picker.svg b/docs/endpoint-picker.svg new file mode 100644 index 00000000..3ec8eed4 --- /dev/null +++ b/docs/endpoint-picker.svg @@ -0,0 +1,3 @@ +Endpoint PickerServiceModelServerL7 Proxy / Gateway InferencePool API Selects - the model servers (the endpoints) - the endpoint picker serviceModel ServerProtocolTrafficExtensionProtocolGateway ControllerClientTrafficConfiguresWatchesWatches InferenceModel API Defines - the model/adapter to serve - the serving objectives for the modelObservabilityMetrics ScrapingObservabilityDashboardsStandard GatewayElementsInference ExtensionElementsInference Gateway \ No newline at end of file diff --git a/docs/inference-gateway-architecture.svg b/docs/inference-gateway-architecture.svg new file mode 100644 index 00000000..6c887ebe --- /dev/null +++ b/docs/inference-gateway-architecture.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/proposals/003-endpoint-picker-protocol/README.md b/docs/proposals/003-model-server-protocol/README.md similarity index 65% rename from docs/proposals/003-endpoint-picker-protocol/README.md rename to docs/proposals/003-model-server-protocol/README.md index 8e96a630..02efbe5c 100644 --- a/docs/proposals/003-endpoint-picker-protocol/README.md +++ b/docs/proposals/003-model-server-protocol/README.md @@ -1,21 +1,4 @@ -# Endpoint Picker Protocol - -The Endpoint Picker, or EPP, is a core component of the inference extension. Ultimately it's -responsible for picking an endpoint from the `InferencePool`. A reference implementation can be -found [here](../../../pkg/ext-proc/). - -## Proxy Protocol - -This is the protocol between the EPP and the proxy (e.g, Envoy). - -The EPP MUST implement the Envoy -[external processing service](https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/ext_proc/v3/external_processor)protocol. - -For each HTTP request, the EPP MUST communicate to the proxy the picked model server endpoint, via -adding the `x-gateway-destination-endpoint` HTTP header in the request and as an unstructured entry in the [dynamic_metadata](https://github.com/envoyproxy/go-control-plane/blob/c19bf63a811c90bf9e02f8e0dc1dcef94931ebb4/envoy/service/ext_proc/v3/external_processor.pb.go#L320) field of the ext-proc response, or otherwise return an error. The EPP MUST not set two different values in the header and the response metadata. -Setting different value leads to unpredictable behavior because proxies aren't guaranteed to support both paths, and so this protocol does not define what takes precedence. - -## Model Server Protocol +# Model Server Protocol This is the protocol between the EPP and the model servers. @@ -60,7 +43,8 @@ The model server MUST expose the following LoRA adapter metrics via the same Pro * Metric value: The last updated timestamp (so the EPP can find the latest). * Metric labels: * `max_lora`: The maximum number of adapters that can be loaded to GPU memory to serve a batch. - Requests will be queued if the model server has reached MaxActiveAdapter and canno load the + Requests will be queued if the model server has reached MaxActiveAdapter and cannot load the requested adapter. Example: `"max_lora": "8"`. * `running_lora_adapters`: A comma separated list of adapters that are currently loaded in GPU memory and ready to serve requests. Example: `"running_lora_adapters": "adapter1, adapter2"` + * `waiting_lora_adapters`: A comma separated list of adapters that are waiting to be served. Example: `"waiting_lora_adapters": "adapter1, adapter2"` diff --git a/docs/proposals/004-endpoint-picker-protocol/README.md b/docs/proposals/004-endpoint-picker-protocol/README.md new file mode 100644 index 00000000..5280e05c --- /dev/null +++ b/docs/proposals/004-endpoint-picker-protocol/README.md @@ -0,0 +1,65 @@ +# Endpoint Picker Protocol + +The Endpoint Picker, or EPP, is a core component of the inference extension. Ultimately it's +responsible for picking an endpoint from the `InferencePool`. A reference implementation can be +found [here](../../../pkg/epp/). + +This doc defines the protocol between the EPP and the proxy (e.g, Envoy). + +The EPP MUST implement the Envoy +[external processing service](https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/ext_proc/v3/external_processor) protocol. + +## Endpoint Subset +For each HTTP request, the proxy CAN communicate the subset of endpoints the EPP MUST pick from by setting an unstructured entry in the [filter metadata](https://github.com/envoyproxy/go-control-plane/blob/63a55395d7a39a8d43dcc7acc3d05e4cae7eb7a2/envoy/config/core/v3/base.pb.go#L819) field of the ext-proc request. The metadata entry for the subset list MUST be wrapped with an outer key (which represents the metadata namespace) with a default of `envoy.lb.subset_hint`. + +```go +filterMetadata: { + "envoy.lb.subset_hint" { + "x-gateway-destination-endpoint-subset": [, , ...] + } +} +``` + +If the key `x-gateway-destination-endpoint-subset` is set, the EPP MUST only select endpoints from the specified list. If none of the endpoints in the list is eligible or the list is empty, then the EPP MUST return a [ImmediateResponse](https://github.com/envoyproxy/envoy/blob/f2023ef77bdb4abaf9feef963c9a0c291f55568f/api/envoy/service/ext_proc/v3/external_processor.proto#L195) with 503 (Service Unavailable) HTTP status code. If the EPP does not select from the list, then this leads to unpredictable behavior. + +If the key `x-gateway-destination-endpoint-subset` is not set, then the EPP MUST select from the set defined by the `InferencePool` selector. + +## Destination Endpoint +For each HTTP request, the EPP MUST communicate to the proxy the picked model server endpoint via: + +1. Setting the `x-gateway-destination-endpoint` HTTP header to the selected endpoint in format. + +2. Set an unstructured entry in the [dynamic_metadata](https://github.com/envoyproxy/go-control-plane/blob/c19bf63a811c90bf9e02f8e0dc1dcef94931ebb4/envoy/service/ext_proc/v3/external_processor.pb.go#L320) field of the ext-proc response. The metadata entry for the picked endpoint MUST be wrapped with an outer key (which represents the metadata namespace) with a default of `envoy.lb`. + +The primary endpoint MUST be set using the key `x-gateway-destination-endpoint` as follows: +```go +dynamicMetadata: { + "envoy.lb": { + "x-gateway-destination-endpoint": + } +} +``` + +Constraints: +- If the EPP did not communicate the server endpoint via these two methods, it MUST return an error as follows: + - [ImmediateResponse](https://github.com/envoyproxy/envoy/blob/f2023ef77bdb4abaf9feef963c9a0c291f55568f/api/envoy/service/ext_proc/v3/external_processor.proto#L195) with 503 (Serivce Unavailable) HTTP status code if there are no ready endpoints. + - [ImmediateResponse](https://github.com/envoyproxy/envoy/blob/f2023ef77bdb4abaf9feef963c9a0c291f55568f/api/envoy/service/ext_proc/v3/external_processor.proto#L195) with 429 (Too Many Requests) HTTP status code if the request should be dropped (e.g., a Sheddable request, and the servers under heavy load). +- The EPP MUST not set two different values in the header and the inner response metadata value. +- Setting different value leads to unpredictable behavior because proxies aren't guaranteed to support both paths, and so this protocol does not define what takes precedence. + +### Destination endpoint fallback +A single fallback endpoint CAN be set using the key `x-gateway-destination-endpoint-fallback` in the same metadata namespace as one used for `x-gateway-destination-endpoint` as follows: + +```go +dynamicMetadata: { + "envoy.lb" { + "x-gateway-destination-endpoint-fallback": + } +} +``` + +### Why envoy.lb namespace as a default? +The `envoy.lb` namespace is a predefined namespace. One common way to use the selected endpoint returned from the server, is [envoy subsets](https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/subsets) where host metadata for subset load balancing must be placed under `envoy.lb`. Note that this is not related to the subsetting feature discussed above, this is an enovy implementation detail. + +## Matching An InferenceModel +The model name of a request MUST match the `Spec.ModelName` parameter of one of the `InferenceModels` referencing the `InferencePool` managed by the EPP. Otherwise, the EPP MUST return a 404 status code. diff --git a/docs/proposals/0683-epp-architecture-proposal/README.md b/docs/proposals/0683-epp-architecture-proposal/README.md new file mode 100644 index 00000000..48c7720f --- /dev/null +++ b/docs/proposals/0683-epp-architecture-proposal/README.md @@ -0,0 +1,99 @@ +# Gateway API Inference Extension + +Author(s): @kfswain +## Proposal Status + ***Draft*** + +## Table of Contents + + + +- [Summary](#summary) +- [Goals](#goals) +- [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Personas](#personas) + - [Inference Platform Admin](#inference-platform-admin) + - [Inference Workload Owner](#workload-owner) + - [Axioms](#axioms) + - [InferencePool](#inferencepool) + - [InferenceModel](#inferencemodel) + - [Spec](#spec) + - [Diagrams](#diagrams) + - [Alternatives](#alternatives) +- [Open Questions](#open-questions) + + + +## Summary + +This proposal seeks to standardize the implementation of an EPP (End-point Picker) for the Inference Gateway extension (also known as Gateway API Inference Extension). Additionally, this proposes to restructure the current implementation of the EPP to be more modular, and approachable. + +## Goals + +- Set a standard on how the EPP & APIs interact +- Settle on common nomenclature for clearer communication +- Allow for modularization of the EPP, to be extended to a user's specific needs + +## Non-Goals + +- Reshaping the current API +- A change in scope of the current project + +## Proposal + +This proposal is not proposing any net new features, instead, we are refactoring our current implementation to better handle more devs, more features, etc. At the time of writing, GIE is currently at v0.3, and that stronger experimental context (along with external feedback) made clear the need this restructure. The image below give a high level view of how our components work together. + +Scheduling Algorithm + +## Overview +At a quick glance, the EPP is being broken into specific layers. The `Data Layer` is of note, as it is a vertical that will be accessed by all the others. The data layer manages the k8s, data, metric & usage data, as well as processing of the above data to determine resource scarcity regimes. + +The other layers are handled in sequential process. Starting with the **Ext-Proc** call. The request is buffered and then sent to the **Routing Layer**, which processes any User defined per-InferenceModel routing rules & request enrichment happening first (at the time of writing that is currently just translating the InferenceModel name to a weight-split actual model). Then _all_ requests pass through the to-be-implemented [**Flow Controller**](https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/674) to ensure that any request entry to the pool adhereing to the guidelines set by the Priority, Fairness, & Queueing configuration. And finally, the **Scheduling Layer** is the load balancing algorithm that intelligently routes requests based on the current state of the InferencePool. + +## Components + +To further expand upon these component layers. We will first break them into `extensible` and `non-extensible` layers. `Non-extensible` layers are intended to be static, and handled on behalf of the user, typically implementing low-opinion infrastructure. + +The `Extensible` layers are: +- Data Layer +- Routing Layer +- Flow Controller +- Scheduling Layer + +The `Non-Extensible` layer(s) are: +- The Ext-Proc Server + +### `Extensible` + +#### Data Layer + +The data layer will consume and store: the InferencePool/InferenceModel config and the pre-defined [Model Server Protocol](../003-model-server-protocol/README.md). Additionally, the data fed from the model servers will be processed and digested to provide resource scarcity regime hints, and autoscaling reccomendations. + +Many extensions to scheduling will require changes to ingested metrics, as such, the data layer will be built to be extended, but extenders accept that the Model Server Protocol will no longer provide guarantees on portability of a model server out of the box. + +#### Routing Layer + +The routing layer is likely to be the most opinion heavy section, as the scope of what constitutes a 'Route Rule' is somewhat broad. The current examples we expect would be: + +- System Prompt injection +- RAG callout +- Per-InferenceModel request validation (such as saftey/on-topic, etc) + +Due to the possibility of this becoming a bit of a dumping ground. The API will keep a _very_ tight scope on which of these route rules are included in the spec. A standard method of extension will be provided if the need to define a custom rule arises. + +#### Flow Controller (WIP - implementation tracked in [#674](https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/674)) + +The flow controller will consume resource regime data, and enforce proper resource sharing between workloads. This will primarily be done through a queuing mechanism [as described here](https://docs.google.com/document/d/1VZL7opFWuwgWquvgiOzLlXAJ633qZ9U-A0ZixGjBgaI/edit?usp=sharing). + +#### Scheduling Layer + +As the Scheduling Layer is the final interface to the entirety of the pool, all configuration will be at the _pool_ level. The default scheduling layer will be an experimentally-backed LB algorithm, with exposed config values. + +The Scheduler will define a strong interface API, so that new scheduling algos may be plugged & dark-launched to test in production traffic without impacting said traffic. Extension is expected to adhere to the [Scheduler Subsystem definition](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/603) + +### `Non-extensible` + +#### Ext-Proc Server + +The Ext-Proc Server protocol is very well defined & specific, deviation could cause the EPP to become unusable or unstable. Extension is ill-advised. diff --git a/docs/proposals/0683-epp-architecture-proposal/images/epp_arch.svg b/docs/proposals/0683-epp-architecture-proposal/images/epp_arch.svg new file mode 100644 index 00000000..4c585728 --- /dev/null +++ b/docs/proposals/0683-epp-architecture-proposal/images/epp_arch.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/proposals/README.md b/docs/proposals/README.md new file mode 100644 index 00000000..2b0408d3 --- /dev/null +++ b/docs/proposals/README.md @@ -0,0 +1,5 @@ +# Proposals Best Practices + + +## Naming +The directory of the proposal should lead with a 4-digit PR number (will move to 5,6,... should our PR count get that high), followed by kebab-cased title. The PR number is not known until the PR is cut, so development can use a placeholder, ex. XXXX-my-proposal. PR number is used b/c it is unique & chronological, allowing the default ordering of proposals to follow the timeline of development. \ No newline at end of file diff --git a/docs/schedular-flowchart.png b/docs/scheduler-flowchart.png similarity index 100% rename from docs/schedular-flowchart.png rename to docs/scheduler-flowchart.png diff --git a/go.mod b/go.mod index 8dd59e3e..30d0487e 100644 --- a/go.mod +++ b/go.mod @@ -1,70 +1,62 @@ -module inference.networking.x-k8s.io/gateway-api-inference-extension +module sigs.k8s.io/gateway-api-inference-extension -go 1.23.0 - -toolchain go1.23.2 +go 1.24.0 require ( - github.com/bojand/ghz v0.120.0 github.com/elastic/crd-ref-docs v0.1.0 github.com/envoyproxy/go-control-plane/envoy v1.32.4 - github.com/google/go-cmp v0.6.0 - github.com/jhump/protoreflect v1.17.0 - github.com/onsi/ginkgo/v2 v2.22.2 - github.com/onsi/gomega v1.36.2 - github.com/prometheus/client_golang v1.20.5 - github.com/prometheus/client_model v0.6.1 - github.com/prometheus/common v0.62.0 + github.com/go-logr/logr v1.4.2 + github.com/google/go-cmp v0.7.0 + github.com/onsi/ginkgo/v2 v2.23.4 + github.com/onsi/gomega v1.37.0 + github.com/prometheus/client_golang v1.22.0 + github.com/prometheus/client_model v0.6.2 + github.com/prometheus/common v0.63.0 github.com/stretchr/testify v1.10.0 go.uber.org/multierr v1.11.0 - google.golang.org/grpc v1.70.0 - google.golang.org/protobuf v1.36.4 - k8s.io/api v0.32.1 - k8s.io/apiextensions-apiserver v0.32.1 - k8s.io/apimachinery v0.32.1 - k8s.io/client-go v0.32.1 - k8s.io/code-generator v0.32.1 - k8s.io/component-base v0.32.1 - k8s.io/klog/v2 v2.130.1 + go.uber.org/zap v1.27.0 + google.golang.org/grpc v1.71.1 + google.golang.org/protobuf v1.36.6 + k8s.io/api v0.32.4 + k8s.io/apiextensions-apiserver v0.32.4 + k8s.io/apimachinery v0.32.4 + k8s.io/client-go v0.32.4 + k8s.io/code-generator v0.32.4 + k8s.io/component-base v0.32.4 k8s.io/utils v0.0.0-20241210054802-24370beab758 - sigs.k8s.io/controller-runtime v0.20.1 - sigs.k8s.io/structured-merge-diff/v4 v4.5.0 + sigs.k8s.io/controller-runtime v0.20.4 + sigs.k8s.io/gateway-api v1.2.1 + sigs.k8s.io/structured-merge-diff/v4 v4.6.0 sigs.k8s.io/yaml v1.4.0 ) require ( - cel.dev/expr v0.19.0 // indirect - cloud.google.com/go/compute/metadata v0.5.2 // indirect - github.com/BurntSushi/toml v1.1.0 // indirect + cel.dev/expr v0.19.1 // indirect github.com/Masterminds/goutils v1.1.1 // indirect github.com/Masterminds/semver v1.5.0 // indirect - github.com/Masterminds/semver/v3 v3.2.0 // indirect github.com/Masterminds/sprig v2.22.0+incompatible // indirect - github.com/Masterminds/sprig/v3 v3.2.3 // indirect - github.com/alecthomas/template v0.0.0-20190718012654-fb15b899a751 // indirect github.com/antlr4-go/antlr/v4 v4.13.0 // indirect github.com/asaskevich/govalidator v0.0.0-20190424111038-f61b66f89f4a // indirect github.com/beorn7/perks v1.0.1 // indirect github.com/blang/semver/v4 v4.0.0 // indirect - github.com/bufbuild/protocompile v0.14.1 // indirect github.com/cenkalti/backoff/v4 v4.3.0 // indirect github.com/cespare/xxhash/v2 v2.3.0 // indirect - github.com/cncf/xds/go v0.0.0-20240905190251-b4127c9b8d78 // indirect + github.com/cncf/xds/go v0.0.0-20241223141626-cff3c89139a3 // indirect github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect - github.com/dustin/go-humanize v1.0.1 // indirect - github.com/emicklei/go-restful/v3 v3.11.0 // indirect - github.com/envoyproxy/go-control-plane/ratelimit v0.1.0 // indirect + github.com/emicklei/go-restful/v3 v3.12.0 // indirect github.com/envoyproxy/protoc-gen-validate v1.2.1 // indirect - github.com/evanphx/json-patch/v5 v5.9.0 // indirect - github.com/fatih/color v1.16.0 // indirect + github.com/evanphx/json-patch/v5 v5.9.11 // indirect + github.com/fatih/color v1.17.0 // indirect github.com/felixge/httpsnoop v1.0.4 // indirect github.com/fsnotify/fsnotify v1.7.0 // indirect github.com/fxamacker/cbor/v2 v2.7.0 // indirect - github.com/go-logr/logr v1.4.2 // indirect github.com/go-logr/stdr v1.2.2 // indirect + github.com/go-logr/zapr v1.3.0 // indirect github.com/go-openapi/jsonpointer v0.21.0 // indirect - github.com/go-openapi/jsonreference v0.20.2 // indirect + github.com/go-openapi/jsonreference v0.21.0 // indirect github.com/go-openapi/swag v0.23.0 // indirect + github.com/go-playground/locales v0.14.1 // indirect + github.com/go-playground/universal-translator v0.18.0 // indirect github.com/go-task/slim-sprig/v3 v3.0.0 // indirect github.com/gobuffalo/flect v1.0.2 // indirect github.com/goccy/go-yaml v1.11.3 // indirect @@ -74,18 +66,17 @@ require ( github.com/google/cel-go v0.22.0 // indirect github.com/google/gnostic-models v0.6.8 // indirect github.com/google/gofuzz v1.2.0 // indirect - github.com/google/pprof v0.0.0-20241210010833-40e02aabc2ad // indirect + github.com/google/pprof v0.0.0-20250403155104-27863c87afa6 // indirect github.com/google/uuid v1.6.0 // indirect - github.com/gorilla/websocket v1.5.0 // indirect + github.com/gorilla/websocket v1.5.1 // indirect github.com/grpc-ecosystem/grpc-gateway/v2 v2.20.0 // indirect github.com/huandu/xstrings v1.3.3 // indirect - github.com/imdario/mergo v0.3.11 // indirect + github.com/imdario/mergo v0.3.16 // indirect github.com/inconshreveable/mousetrap v1.1.0 // indirect - github.com/jinzhu/configor v1.2.1 // indirect github.com/josharian/intern v1.0.0 // indirect github.com/json-iterator/go v1.1.12 // indirect - github.com/klauspost/compress v1.17.9 // indirect github.com/kylelemons/godebug v1.1.0 // indirect + github.com/leodido/go-urn v1.2.1 // indirect github.com/mailru/easyjson v0.7.7 // indirect github.com/mattn/go-colorable v0.1.13 // indirect github.com/mattn/go-isatty v0.0.20 // indirect @@ -100,44 +91,44 @@ require ( github.com/planetscale/vtprotobuf v0.6.1-0.20240319094008-0393e58bdf10 // indirect github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect github.com/prometheus/procfs v0.15.1 // indirect - github.com/shopspring/decimal v1.2.0 // indirect - github.com/spf13/cast v1.4.1 // indirect github.com/spf13/cobra v1.8.1 // indirect github.com/spf13/pflag v1.0.5 // indirect github.com/stoewer/go-strcase v1.3.0 // indirect github.com/x448/float16 v0.8.4 // indirect + go.opentelemetry.io/auto/sdk v1.1.0 // indirect go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.53.0 // indirect - go.opentelemetry.io/otel v1.32.0 // indirect + go.opentelemetry.io/otel v1.34.0 // indirect go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.28.0 // indirect go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.27.0 // indirect - go.opentelemetry.io/otel/metric v1.32.0 // indirect - go.opentelemetry.io/otel/sdk v1.32.0 // indirect - go.opentelemetry.io/otel/trace v1.32.0 // indirect + go.opentelemetry.io/otel/metric v1.34.0 // indirect + go.opentelemetry.io/otel/sdk v1.34.0 // indirect + go.opentelemetry.io/otel/trace v1.34.0 // indirect go.opentelemetry.io/proto/otlp v1.3.1 // indirect - go.uber.org/zap v1.27.0 // indirect - golang.org/x/crypto v0.32.0 // indirect + go.uber.org/automaxprocs v1.6.0 // indirect + golang.org/x/crypto v0.36.0 // indirect golang.org/x/exp v0.0.0-20240719175910-8a7402abbf56 // indirect - golang.org/x/mod v0.22.0 // indirect - golang.org/x/net v0.34.0 // indirect - golang.org/x/oauth2 v0.24.0 // indirect - golang.org/x/sync v0.10.0 // indirect - golang.org/x/sys v0.29.0 // indirect - golang.org/x/term v0.28.0 // indirect - golang.org/x/text v0.21.0 // indirect + golang.org/x/mod v0.24.0 // indirect + golang.org/x/net v0.37.0 // indirect + golang.org/x/oauth2 v0.25.0 // indirect + golang.org/x/sync v0.12.0 // indirect + golang.org/x/sys v0.32.0 // indirect + golang.org/x/term v0.30.0 // indirect + golang.org/x/text v0.23.0 // indirect golang.org/x/time v0.7.0 // indirect - golang.org/x/tools v0.28.0 // indirect + golang.org/x/tools v0.31.0 // indirect golang.org/x/xerrors v0.0.0-20231012003039-104605ab7028 // indirect gomodules.xyz/jsonpatch/v2 v2.4.0 // indirect - google.golang.org/genproto/googleapis/api v0.0.0-20241202173237-19429a94021a // indirect - google.golang.org/genproto/googleapis/rpc v0.0.0-20241202173237-19429a94021a // indirect + google.golang.org/genproto/googleapis/api v0.0.0-20250106144421-5f5ef82da422 // indirect + google.golang.org/genproto/googleapis/rpc v0.0.0-20250115164207-1a7da9e5054f // indirect gopkg.in/evanphx/json-patch.v4 v4.12.0 // indirect gopkg.in/inf.v0 v0.9.1 // indirect gopkg.in/yaml.v2 v2.4.0 // indirect gopkg.in/yaml.v3 v3.0.1 // indirect - k8s.io/apiserver v0.32.1 // indirect + k8s.io/apiserver v0.32.4 // indirect k8s.io/gengo/v2 v2.0.0-20240911193312-2b36238f13e9 // indirect + k8s.io/klog/v2 v2.130.1 // indirect k8s.io/kube-openapi v0.0.0-20241105132330-32ad38e42d3f // indirect sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.31.0 // indirect - sigs.k8s.io/controller-tools v0.14.0 // indirect + sigs.k8s.io/controller-tools v0.16.3 // indirect sigs.k8s.io/json v0.0.0-20241010143419-9aa6b5e7a4b3 // indirect ) diff --git a/go.sum b/go.sum index 6d1cd8bd..6688c578 100644 --- a/go.sum +++ b/go.sum @@ -1,22 +1,11 @@ -cel.dev/expr v0.19.0 h1:lXuo+nDhpyJSpWxpPVi5cPUwzKb+dsdOiw6IreM5yt0= -cel.dev/expr v0.19.0/go.mod h1:MrpN08Q+lEBs+bGYdLxxHkZoUSsCp0nSKTs0nTymJgw= -cloud.google.com/go/compute/metadata v0.5.2 h1:UxK4uu/Tn+I3p2dYWTfiX4wva7aYlKixAHn3fyqngqo= -cloud.google.com/go/compute/metadata v0.5.2/go.mod h1:C66sj2AluDcIqakBq/M8lw8/ybHgOZqin2obFxa/E5k= -github.com/BurntSushi/toml v0.3.1/go.mod h1:xHWCNGjB5oqiDr8zfno3MHue2Ht5sIBksp03qcyfWMU= -github.com/BurntSushi/toml v1.1.0 h1:ksErzDEI1khOiGPgpwuI7x2ebx/uXQNw7xJpn9Eq1+I= -github.com/BurntSushi/toml v1.1.0/go.mod h1:CxXYINrC8qIiEnFrOxCa7Jy5BFHlXnUU2pbicEuybxQ= +cel.dev/expr v0.19.1 h1:NciYrtDRIR0lNCnH1LFJegdjspNx9fI59O7TWcua/W4= +cel.dev/expr v0.19.1/go.mod h1:MrpN08Q+lEBs+bGYdLxxHkZoUSsCp0nSKTs0nTymJgw= github.com/Masterminds/goutils v1.1.1 h1:5nUrii3FMTL5diU80unEVvNevw1nH4+ZV4DSLVJLSYI= github.com/Masterminds/goutils v1.1.1/go.mod h1:8cTjp+g8YejhMuvIA5y2vz3BpJxksy863GQaJW2MFNU= github.com/Masterminds/semver v1.5.0 h1:H65muMkzWKEuNDnfl9d70GUjFniHKHRbFPGBuZ3QEww= github.com/Masterminds/semver v1.5.0/go.mod h1:MB6lktGJrhw8PrUyiEoblNEGEQ+RzHPF078ddwwvV3Y= -github.com/Masterminds/semver/v3 v3.2.0 h1:3MEsd0SM6jqZojhjLWWeBY+Kcjy9i6MQAeY7YgDP83g= -github.com/Masterminds/semver/v3 v3.2.0/go.mod h1:qvl/7zhW3nngYb5+80sSMF+FG2BjYrf8m9wsX0PNOMQ= github.com/Masterminds/sprig v2.22.0+incompatible h1:z4yfnGrZ7netVz+0EDJ0Wi+5VZCSYp4Z0m2dk6cEM60= github.com/Masterminds/sprig v2.22.0+incompatible/go.mod h1:y6hNFY5UBTIWBxnzTeuNhlNS5hqE0NB0E6fgfo2Br3o= -github.com/Masterminds/sprig/v3 v3.2.3 h1:eL2fZNezLomi0uOLqjQoN6BfsDD+fyLtgbJMAj9n6YA= -github.com/Masterminds/sprig/v3 v3.2.3/go.mod h1:rXcFaZ2zZbLRJv/xSysmlgIM1u11eBaRMhvYXJNkGuM= -github.com/alecthomas/template v0.0.0-20190718012654-fb15b899a751 h1:JYp7IbQjafoB+tBA3gMyHYHrpOtNuDiK/uB5uXxq5wM= -github.com/alecthomas/template v0.0.0-20190718012654-fb15b899a751/go.mod h1:LOuyumcjzFXgccqObfd/Ljyb9UuFJ6TxHnclSeseNhc= github.com/antlr4-go/antlr/v4 v4.13.0 h1:lxCg3LAv+EUK6t1i0y1V6/SLeUi0eKEKdhQAlS8TVTI= github.com/antlr4-go/antlr/v4 v4.13.0/go.mod h1:pfChB/xh/Unjila75QW7+VU4TSnWnnk9UTnmpPaOR2g= github.com/armon/go-socks5 v0.0.0-20160902184237-e75332964ef5 h1:0CwZNZbxp69SHPdPJAN/hZIm0C4OItdklCFmMRWYpio= @@ -27,42 +16,31 @@ github.com/beorn7/perks v1.0.1 h1:VlbKKnNfV8bJzeqoa4cOKqO6bYr3WgKZxO8Z16+hsOM= github.com/beorn7/perks v1.0.1/go.mod h1:G2ZrVWU2WbWT9wwq4/hrbKbnv/1ERSJQ0ibhJ6rlkpw= github.com/blang/semver/v4 v4.0.0 h1:1PFHFE6yCCTv8C1TeyNNarDzntLi7wMI5i/pzqYIsAM= github.com/blang/semver/v4 v4.0.0/go.mod h1:IbckMUScFkM3pff0VJDNKRiT6TG/YpiHIM2yvyW5YoQ= -github.com/bojand/ghz v0.120.0 h1:6F4wsmZVwFg5UnD+/R+IABWk6sKE/0OKIBdUQUZnOdo= -github.com/bojand/ghz v0.120.0/go.mod h1:HfECuBZj1v02XObGnRuoZgyB1PR24/25dIYiJIMjJnE= -github.com/bufbuild/protocompile v0.14.1 h1:iA73zAf/fyljNjQKwYzUHD6AD4R8KMasmwa/FBatYVw= -github.com/bufbuild/protocompile v0.14.1/go.mod h1:ppVdAIhbr2H8asPk6k4pY7t9zB1OU5DoEw9xY/FUi1c= github.com/cenkalti/backoff/v4 v4.3.0 h1:MyRJ/UdXutAwSAT+s3wNd7MfTIcy71VQueUuFK343L8= github.com/cenkalti/backoff/v4 v4.3.0/go.mod h1:Y3VNntkOUPxTVeUxJ/G5vcM//AlwfmyYozVcomhLiZE= github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs= github.com/cespare/xxhash/v2 v2.3.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs= -github.com/cncf/xds/go v0.0.0-20240905190251-b4127c9b8d78 h1:QVw89YDxXxEe+l8gU8ETbOasdwEV+avkR75ZzsVV9WI= -github.com/cncf/xds/go v0.0.0-20240905190251-b4127c9b8d78/go.mod h1:W+zGtBO5Y1IgJhy4+A9GOqVhqLpfZi+vwmdNXUehLA8= +github.com/cncf/xds/go v0.0.0-20241223141626-cff3c89139a3 h1:boJj011Hh+874zpIySeApCX4GeOjPl9qhRF3QuIZq+Q= +github.com/cncf/xds/go v0.0.0-20241223141626-cff3c89139a3/go.mod h1:W+zGtBO5Y1IgJhy4+A9GOqVhqLpfZi+vwmdNXUehLA8= github.com/cpuguy83/go-md2man/v2 v2.0.4/go.mod h1:tgQtvFlXSQOSOSIRvRPT7W67SCa46tRHOmNcaadrF8o= -github.com/creack/pty v1.1.9/go.mod h1:oKZEueFk5CKHvIhNR5MUki03XCEU+Q6VDXinZuGJ33E= github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc h1:U9qPSI2PIWSS1VwoXQT9A3Wy9MM3WgvqSxFWenqJduM= github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= -github.com/dustin/go-humanize v1.0.1 h1:GzkhY7T5VNhEkwH0PVJgjz+fX1rhBrR7pRT3mDkpeCY= -github.com/dustin/go-humanize v1.0.1/go.mod h1:Mu1zIs6XwVuF/gI1OepvI0qD18qycQx+mFykh5fBlto= github.com/elastic/crd-ref-docs v0.1.0 h1:Cr5kz89QB3Iuuj7dhAfLMApCrChEGAaIBTxGk/xuRKw= github.com/elastic/crd-ref-docs v0.1.0/go.mod h1:X83mMBdJt05heJUYiS3T0yJ/JkCuliuhSUNav5Gjo/U= -github.com/emicklei/go-restful/v3 v3.11.0 h1:rAQeMHw1c7zTmncogyy8VvRZwtkmkZ4FxERmMY4rD+g= -github.com/emicklei/go-restful/v3 v3.11.0/go.mod h1:6n3XBCmQQb25CM2LCACGz8ukIrRry+4bhvbpWn3mrbc= -github.com/envoyproxy/go-control-plane v0.13.4 h1:zEqyPVyku6IvWCFwux4x9RxkLOMUL+1vC9xUFv5l2/M= -github.com/envoyproxy/go-control-plane v0.13.4/go.mod h1:kDfuBlDVsSj2MjrLEtRWtHlsWIFcGyB2RMO44Dc5GZA= +github.com/emicklei/go-restful/v3 v3.12.0 h1:y2DdzBAURM29NFF94q6RaY4vjIH1rtwDapwQtU84iWk= +github.com/emicklei/go-restful/v3 v3.12.0/go.mod h1:6n3XBCmQQb25CM2LCACGz8ukIrRry+4bhvbpWn3mrbc= github.com/envoyproxy/go-control-plane/envoy v1.32.4 h1:jb83lalDRZSpPWW2Z7Mck/8kXZ5CQAFYVjQcdVIr83A= github.com/envoyproxy/go-control-plane/envoy v1.32.4/go.mod h1:Gzjc5k8JcJswLjAx1Zm+wSYE20UrLtt7JZMWiWQXQEw= -github.com/envoyproxy/go-control-plane/ratelimit v0.1.0 h1:/G9QYbddjL25KvtKTv3an9lx6VBE2cnb8wp1vEGNYGI= -github.com/envoyproxy/go-control-plane/ratelimit v0.1.0/go.mod h1:Wk+tMFAFbCXaJPzVVHnPgRKdUdwW/KdbRt94AzgRee4= github.com/envoyproxy/protoc-gen-validate v1.2.1 h1:DEo3O99U8j4hBFwbJfrz9VtgcDfUKS7KJ7spH3d86P8= github.com/envoyproxy/protoc-gen-validate v1.2.1/go.mod h1:d/C80l/jxXLdfEIhX1W2TmLfsJ31lvEjwamM4DxlWXU= -github.com/evanphx/json-patch v0.5.2 h1:xVCHIVMUu1wtM/VkR9jVZ45N3FhZfYMMYGorLCR8P3k= -github.com/evanphx/json-patch v0.5.2/go.mod h1:ZWS5hhDbVDyob71nXKNL0+PWn6ToqBHMikGIFbs31qQ= -github.com/evanphx/json-patch/v5 v5.9.0 h1:kcBlZQbplgElYIlo/n1hJbls2z/1awpXxpRi0/FOJfg= -github.com/evanphx/json-patch/v5 v5.9.0/go.mod h1:VNkHZ/282BpEyt/tObQO8s5CMPmYYq14uClGH4abBuQ= -github.com/fatih/color v1.16.0 h1:zmkK9Ngbjj+K0yRhTVONQh1p/HknKYSlNT+vZCzyokM= -github.com/fatih/color v1.16.0/go.mod h1:fL2Sau1YI5c0pdGEVCbKQbLXB6edEj1ZgiY4NijnWvE= +github.com/evanphx/json-patch v5.7.0+incompatible h1:vgGkfT/9f8zE6tvSCe74nfpAVDQ2tG6yudJd8LBksgI= +github.com/evanphx/json-patch v5.7.0+incompatible/go.mod h1:50XU6AFN0ol/bzJsmQLiYLvXMP4fmwYFNcr97nuDLSk= +github.com/evanphx/json-patch/v5 v5.9.11 h1:/8HVnzMq13/3x9TPvjG08wUGqBTmZBsCWzjTM0wiaDU= +github.com/evanphx/json-patch/v5 v5.9.11/go.mod h1:3j+LviiESTElxA4p3EMKAB9HXj3/XEtnUf6OZxqIQTM= +github.com/fatih/color v1.17.0 h1:GlRw1BRJxkpqUCBKzKOw098ed57fEsKeNjpTe3cSjK4= +github.com/fatih/color v1.17.0/go.mod h1:YZ7TlrGPkiz6ku9fK3TLD/pl3CpsiFyu8N92HLgmosI= github.com/felixge/httpsnoop v1.0.4 h1:NFTV2Zj1bL4mc9sqWACXbQFVBBg2W3GPvqp8/ESS2Wg= github.com/felixge/httpsnoop v1.0.4/go.mod h1:m8KPJKqk1gH5J9DgRY2ASl2lWCfGKXixSwevea8zH2U= github.com/fsnotify/fsnotify v1.7.0 h1:8JEhPFa5W2WU7YfeZzPNqzMP6Lwt7L2715Ggo0nosvA= @@ -76,19 +54,17 @@ github.com/go-logr/stdr v1.2.2 h1:hSWxHoqTgW2S2qGc0LTAI563KZ5YKYRhT3MFKZMbjag= github.com/go-logr/stdr v1.2.2/go.mod h1:mMo/vtBO5dYbehREoey6XUKy/eSumjCCveDpRre4VKE= github.com/go-logr/zapr v1.3.0 h1:XGdV8XW8zdwFiwOA2Dryh1gj2KRQyOOoNmBy4EplIcQ= github.com/go-logr/zapr v1.3.0/go.mod h1:YKepepNBd1u/oyhd/yQmtjVXmm9uML4IXUgMOwR8/Gg= -github.com/go-openapi/jsonpointer v0.19.6/go.mod h1:osyAmYz/mB/C3I+WsTTSgw1ONzaLJoLCyoi6/zppojs= github.com/go-openapi/jsonpointer v0.21.0 h1:YgdVicSA9vH5RiHs9TZW5oyafXZFc6+2Vc1rr/O9oNQ= github.com/go-openapi/jsonpointer v0.21.0/go.mod h1:IUyH9l/+uyhIYQ/PXVA41Rexl+kOkAPDdXEYns6fzUY= -github.com/go-openapi/jsonreference v0.20.2 h1:3sVjiK66+uXK/6oQ8xgcRKcFgQ5KXa2KvnJRumpMGbE= -github.com/go-openapi/jsonreference v0.20.2/go.mod h1:Bl1zwGIM8/wsvqjsOQLJ/SH+En5Ap4rVB5KVcIDZG2k= -github.com/go-openapi/swag v0.22.3/go.mod h1:UzaqsxGiab7freDnrUUra0MwWfN/q7tE4j+VcZ0yl14= +github.com/go-openapi/jsonreference v0.21.0 h1:Rs+Y7hSXT83Jacb7kFyjn4ijOuVGSvOdF2+tg1TRrwQ= +github.com/go-openapi/jsonreference v0.21.0/go.mod h1:LmZmgsrTkVg9LG4EaHeY8cBDslNPMo06cago5JNLkm4= github.com/go-openapi/swag v0.23.0 h1:vsEVJDUo2hPJ2tu0/Xc+4noaxyEffXNIs3cOULZ+GrE= github.com/go-openapi/swag v0.23.0/go.mod h1:esZ8ITTYEsH1V2trKHjAN8Ai7xHb8RV+YSZ577vPjgQ= +github.com/go-playground/locales v0.14.0/go.mod h1:sawfccIbzZTqEDETgFXqTho0QybSa7l++s0DH+LDiLs= github.com/go-playground/locales v0.14.1 h1:EWaQ/wswjilfKLTECiXz7Rh+3BjFhfDFKv/oXslEjJA= github.com/go-playground/locales v0.14.1/go.mod h1:hxrqLVvrK65+Rwrd5Fc6F2O76J/NuW9t0sjnWqG1slY= github.com/go-playground/universal-translator v0.18.0 h1:82dyy6p4OuJq4/CByFNOn/jYrnRPArHwAcmLoJZxyho= github.com/go-playground/universal-translator v0.18.0/go.mod h1:UvRDBj+xPUEGrFYl+lu/H90nyDXpg0fqeB/AQUGNTVA= -github.com/go-playground/validator v9.31.0+incompatible h1:UA72EPEogEnq76ehGdEDp4Mit+3FDh548oRqwVgNsHA= github.com/go-playground/validator/v10 v10.4.1 h1:pH2c5ADXtd66mxoE0Zm9SUhxE20r7aM3F26W0hOn+GE= github.com/go-playground/validator/v10 v10.4.1/go.mod h1:nlOn6nFhuKACm19sB/8EGNn9GlaMV7XkbRSipzJ0Ii4= github.com/go-task/slim-sprig/v3 v3.0.0 h1:sUs3vkvUymDpBKi3qH1YSqBQk9+9D/8M2mN1vB6EwHI= @@ -108,43 +84,35 @@ github.com/google/cel-go v0.22.0/go.mod h1:BuznPXXfQDpXKWQ9sPW3TzlAJN5zzFe+i9tIs github.com/google/gnostic-models v0.6.8 h1:yo/ABAfM5IMRsS1VnXjTBvUb61tFIHozhlYvRgGre9I= github.com/google/gnostic-models v0.6.8/go.mod h1:5n7qKqH0f5wFt+aWF8CW6pZLLNOfYuF5OpfBSENuI8U= github.com/google/go-cmp v0.5.9/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY= -github.com/google/go-cmp v0.6.0 h1:ofyhxvXcZhMsU5ulbFiLKl/XBFqE1GSq7atu8tAmTRI= -github.com/google/go-cmp v0.6.0/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY= +github.com/google/go-cmp v0.7.0 h1:wk8382ETsv4JYUZwIsn6YpYiWiBsYLSJiTsyBybVuN8= +github.com/google/go-cmp v0.7.0/go.mod h1:pXiqmnSA92OHEEa9HXL2W4E7lf9JzCmGVUdgjX3N/iU= github.com/google/gofuzz v1.0.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg= github.com/google/gofuzz v1.2.0 h1:xRy4A+RhZaiKjJ1bPfwQ8sedCA+YS2YcCHW6ec7JMi0= github.com/google/gofuzz v1.2.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg= -github.com/google/pprof v0.0.0-20241210010833-40e02aabc2ad h1:a6HEuzUHeKH6hwfN/ZoQgRgVIWFJljSWa/zetS2WTvg= -github.com/google/pprof v0.0.0-20241210010833-40e02aabc2ad/go.mod h1:vavhavw2zAxS5dIdcRluK6cSGGPlZynqzFM8NdvU144= -github.com/google/uuid v1.1.1/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo= +github.com/google/pprof v0.0.0-20250403155104-27863c87afa6 h1:BHT72Gu3keYf3ZEu2J0b1vyeLSOYI8bm5wbJM/8yDe8= +github.com/google/pprof v0.0.0-20250403155104-27863c87afa6/go.mod h1:boTsfXsheKC2y+lKOCMpSfarhxDeIzfZG1jqGcPl3cA= github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0= github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo= -github.com/gorilla/websocket v1.5.0 h1:PPwGk2jz7EePpoHN/+ClbZu8SPxiqlu12wZP/3sWmnc= -github.com/gorilla/websocket v1.5.0/go.mod h1:YR8l580nyteQvAITg2hZ9XVh4b55+EU/adAjf1fMHhE= +github.com/gorilla/websocket v1.5.1 h1:gmztn0JnHVt9JZquRuzLw3g4wouNVzKL15iLr/zn/QY= +github.com/gorilla/websocket v1.5.1/go.mod h1:x3kM2JMyaluk02fnUJpQuwD2dCS5NDG2ZHL0uE0tcaY= github.com/grpc-ecosystem/grpc-gateway/v2 v2.20.0 h1:bkypFPDjIYGfCYD5mRBvpqxfYX1YCS1PXdKYWi8FsN0= github.com/grpc-ecosystem/grpc-gateway/v2 v2.20.0/go.mod h1:P+Lt/0by1T8bfcF3z737NnSbmxQAppXMRziHUxPOC8k= github.com/huandu/xstrings v1.3.3 h1:/Gcsuc1x8JVbJ9/rlye4xZnVAbEkGauT8lbebqcQws4= github.com/huandu/xstrings v1.3.3/go.mod h1:y5/lhBue+AyNmUVz9RLU9xbLR0o4KIIExikq4ovT0aE= -github.com/imdario/mergo v0.3.11 h1:3tnifQM4i+fbajXKBHXWEH+KvNHqojZ778UH75j3bGA= -github.com/imdario/mergo v0.3.11/go.mod h1:jmQim1M+e3UYxmgPu/WyfjB3N3VflVyUjjjwH0dnCYA= +github.com/imdario/mergo v0.3.16 h1:wwQJbIsHYGMUyLSPrEq1CT16AhnhNJQ51+4fdHUnCl4= +github.com/imdario/mergo v0.3.16/go.mod h1:WBLT9ZmE3lPoWsEzCh9LPo3TiwVN+ZKEjmz+hD27ysY= github.com/inconshreveable/mousetrap v1.1.0 h1:wN+x4NVGpMsO7ErUn/mUI3vEoE6Jt13X2s0bqwp9tc8= github.com/inconshreveable/mousetrap v1.1.0/go.mod h1:vpF70FUmC8bwa3OWnCshd2FqLfsEA9PFc4w1p2J65bw= -github.com/jhump/protoreflect v1.17.0 h1:qOEr613fac2lOuTgWN4tPAtLL7fUSbuJL5X5XumQh94= -github.com/jhump/protoreflect v1.17.0/go.mod h1:h9+vUUL38jiBzck8ck+6G/aeMX8Z4QUY/NiJPwPNi+8= -github.com/jinzhu/configor v1.2.1 h1:OKk9dsR8i6HPOCZR8BcMtcEImAFjIhbJFZNyn5GCZko= -github.com/jinzhu/configor v1.2.1/go.mod h1:nX89/MOmDba7ZX7GCyU/VIaQ2Ar2aizBl2d3JLF/rDc= github.com/josharian/intern v1.0.0 h1:vlS4z54oSdjm0bgjRigI+G1HpF+tI+9rE5LLzOg8HmY= github.com/josharian/intern v1.0.0/go.mod h1:5DoeVV0s6jJacbCEi61lwdGj/aVlrQvzHFFd8Hwg//Y= github.com/json-iterator/go v1.1.12 h1:PV8peI4a0ysnczrg+LtxykD8LfKY9ML6u2jnxaEnrnM= github.com/json-iterator/go v1.1.12/go.mod h1:e30LSqwooZae/UwlEbR2852Gd8hjQvJoHmT4TnhNGBo= github.com/kisielk/errcheck v1.5.0/go.mod h1:pFxgyoBC7bSaBwPgfKdkLd5X25qrDl4LWUI2bnpBCr8= github.com/kisielk/gotool v1.0.0/go.mod h1:XhKaO+MFFWcvkIS/tQcRk01m1F5IRFswLeQ+oQHNcck= -github.com/klauspost/compress v1.17.9 h1:6KIumPrER1LHsvBVuDa0r5xaG0Es51mhhB9BQB2qeMA= -github.com/klauspost/compress v1.17.9/go.mod h1:Di0epgTjJY877eYKx5yC51cX2A2Vl2ibi7bDH9ttBbw= -github.com/kr/pretty v0.2.1/go.mod h1:ipq/a2n7PKx3OHsz4KJII5eveXtPO4qwEXGdVfWzfnI= +github.com/klauspost/compress v1.18.0 h1:c/Cqfb0r+Yi+JtIEq73FWXVkRonBlf0CRNYc8Zttxdo= +github.com/klauspost/compress v1.18.0/go.mod h1:2Pp+KzxcywXVXMr50+X0Q/Lsb43OQHYWRCY2AiWywWQ= github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE= github.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk= -github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ= -github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI= github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY= github.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE= github.com/kylelemons/godebug v1.1.0 h1:RPNrshWIDI6G2gRW9EHilWtl7Z6Sb1BR0xunSBf0SNc= @@ -158,10 +126,8 @@ github.com/mattn/go-colorable v0.1.13/go.mod h1:7S9/ev0klgBDR4GtXTXX8a3vIGJpMovk github.com/mattn/go-isatty v0.0.16/go.mod h1:kYGgaQfpe5nmfYZH+SKPsOc2e4SrIfOl2e/yFXSvRLM= github.com/mattn/go-isatty v0.0.20 h1:xfD0iDuEKnDkl03q4limB+vH+GxLEtL/jb4xVJSWWEY= github.com/mattn/go-isatty v0.0.20/go.mod h1:W+V8PltTTMOvKvAeJH7IuucS94S2C6jfK/D7dTCTo3Y= -github.com/mitchellh/copystructure v1.0.0/go.mod h1:SNtv71yrdKgLRyLFxmLdkAbkKEFWgYaq1OVrnRcwhnw= github.com/mitchellh/copystructure v1.2.0 h1:vpKXTN4ewci03Vljg/q9QvCGUDttBOGBIa15WveJJGw= github.com/mitchellh/copystructure v1.2.0/go.mod h1:qLl+cE2AmVv+CoeAwDPye/v+N2HKCj9FbZEVFJRxO9s= -github.com/mitchellh/reflectwalk v1.0.0/go.mod h1:mSTlrgnPZtwu0c4WaC2kGObEpuNDbx0jmZXqmk4esnw= github.com/mitchellh/reflectwalk v1.0.2 h1:G2LzWKi524PWgd3mLHV8Y5k7s6XUvT0Gef6zxSIeXaQ= github.com/mitchellh/reflectwalk v1.0.2/go.mod h1:mSTlrgnPZtwu0c4WaC2kGObEpuNDbx0jmZXqmk4esnw= github.com/moby/spdystream v0.5.0 h1:7r0J1Si3QO/kjRitvSLVVFUjxMEb/YLj6S9FF62JBCU= @@ -179,10 +145,10 @@ github.com/nxadm/tail v1.4.8 h1:nPr65rt6Y5JFSKQO7qToXr7pePgD6Gwiw05lkbyAQTE= github.com/nxadm/tail v1.4.8/go.mod h1:+ncqLTQzXmGhMZNUePPaPqPvBxHAIsmXswZKocGu+AU= github.com/onsi/ginkgo v1.16.5 h1:8xi0RTUf59SOSfEtZMvwTvXYMzG4gV23XVHOZiXNtnE= github.com/onsi/ginkgo v1.16.5/go.mod h1:+E8gABHa3K6zRBolWtd+ROzc/U5bkGt0FwiG042wbpU= -github.com/onsi/ginkgo/v2 v2.22.2 h1:/3X8Panh8/WwhU/3Ssa6rCKqPLuAkVY2I0RoyDLySlU= -github.com/onsi/ginkgo/v2 v2.22.2/go.mod h1:oeMosUL+8LtarXBHu/c0bx2D/K9zyQ6uX3cTyztHwsk= -github.com/onsi/gomega v1.36.2 h1:koNYke6TVk6ZmnyHrCXba/T/MoLBXFjeC1PtvYgw0A8= -github.com/onsi/gomega v1.36.2/go.mod h1:DdwyADRjrc825LhMEkD76cHR5+pUnjhUN8GlHlRPHzY= +github.com/onsi/ginkgo/v2 v2.23.4 h1:ktYTpKJAVZnDT4VjxSbiBenUjmlL/5QkBEocaWXiQus= +github.com/onsi/ginkgo/v2 v2.23.4/go.mod h1:Bt66ApGPBFzHyR+JO10Zbt0Gsp4uWxu5mIOTusL46e8= +github.com/onsi/gomega v1.37.0 h1:CdEG8g0S133B4OswTDC/5XPSzE1OeP29QOioj2PID2Y= +github.com/onsi/gomega v1.37.0/go.mod h1:8D9+Txp43QWKhM24yyOBEdpkzN8FvJyAwecBgsU4KU0= github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4= github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0= github.com/planetscale/vtprotobuf v0.6.1-0.20240319094008-0393e58bdf10 h1:GFCKgmp0tecUJ0sJuv4pzYCqS9+RGSn52M3FUwPs+uo= @@ -190,22 +156,19 @@ github.com/planetscale/vtprotobuf v0.6.1-0.20240319094008-0393e58bdf10/go.mod h1 github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 h1:Jamvg5psRIccs7FGNTlIRMkT8wgtp5eCXdBlqhYGL6U= github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= -github.com/prometheus/client_golang v1.20.5 h1:cxppBPuYhUnsO6yo/aoRol4L7q7UFfdm+bR9r+8l63Y= -github.com/prometheus/client_golang v1.20.5/go.mod h1:PIEt8X02hGcP8JWbeHyeZ53Y/jReSnHgO035n//V5WE= -github.com/prometheus/client_model v0.6.1 h1:ZKSh/rekM+n3CeS952MLRAdFwIKqeY8b62p8ais2e9E= -github.com/prometheus/client_model v0.6.1/go.mod h1:OrxVMOVHjw3lKMa8+x6HeMGkHMQyHDk9E3jmP2AmGiY= -github.com/prometheus/common v0.62.0 h1:xasJaQlnWAeyHdUBeGjXmutelfJHWMRr+Fg4QszZ2Io= -github.com/prometheus/common v0.62.0/go.mod h1:vyBcEuLSvWos9B1+CyL7JZ2up+uFzXhkqml0W5zIY1I= +github.com/prashantv/gostub v1.1.0 h1:BTyx3RfQjRHnUWaGF9oQos79AlQ5k8WNktv7VGvVH4g= +github.com/prashantv/gostub v1.1.0/go.mod h1:A5zLQHz7ieHGG7is6LLXLz7I8+3LZzsrV0P1IAHhP5U= +github.com/prometheus/client_golang v1.22.0 h1:rb93p9lokFEsctTys46VnV1kLCDpVZ0a/Y92Vm0Zc6Q= +github.com/prometheus/client_golang v1.22.0/go.mod h1:R7ljNsLXhuQXYZYtw6GAE9AZg8Y7vEW5scdCXrWRXC0= +github.com/prometheus/client_model v0.6.2 h1:oBsgwpGs7iVziMvrGhE53c/GrLUsZdHnqNwqPLxwZyk= +github.com/prometheus/client_model v0.6.2/go.mod h1:y3m2F6Gdpfy6Ut/GBsUqTWZqCUvMVzSfMLjcu6wAwpE= +github.com/prometheus/common v0.63.0 h1:YR/EIY1o3mEFP/kZCD7iDMnLPlGyuU2Gb3HIcXnA98k= +github.com/prometheus/common v0.63.0/go.mod h1:VVFF/fBIoToEnWRVkYoXEkq3R3paCoxG9PXP74SnV18= github.com/prometheus/procfs v0.15.1 h1:YagwOFzUgYfKKHX6Dr+sHT7km/hxC76UB0learggepc= github.com/prometheus/procfs v0.15.1/go.mod h1:fB45yRUv8NstnjriLhBQLuOUt+WW4BsoGhij/e3PBqk= -github.com/rogpeppe/go-internal v1.12.0 h1:exVL4IDcn6na9z1rAb56Vxr+CgyK3nn3O+epU5NdKM8= -github.com/rogpeppe/go-internal v1.12.0/go.mod h1:E+RYuTGaKKdloAfM02xzb0FW3Paa99yedzYV+kq4uf4= +github.com/rogpeppe/go-internal v1.13.1 h1:KvO1DLK/DRN07sQ1LQKScxyZJuNnedQ5/wKSR38lUII= +github.com/rogpeppe/go-internal v1.13.1/go.mod h1:uMEvuHeurkdAXX61udpOXGD/AzZDWNMNyH2VO9fmH0o= github.com/russross/blackfriday/v2 v2.1.0/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM= -github.com/shopspring/decimal v1.2.0 h1:abSATXmQEYyShuxI4/vyW3tV1MrKAJzCZ/0zLUXYbsQ= -github.com/shopspring/decimal v1.2.0/go.mod h1:DKyhrW/HYNuLGql+MJL6WCR6knT2jwCFRcu2hWCYk4o= -github.com/spf13/cast v1.3.1/go.mod h1:Qx5cxh0v+4UWYiBimWS+eyWzqEqokIECu5etghLkUJE= -github.com/spf13/cast v1.4.1 h1:s0hze+J0196ZfEMTs80N7UlFt0BDuQ7Q+JDnHiMWKdA= -github.com/spf13/cast v1.4.1/go.mod h1:Qx5cxh0v+4UWYiBimWS+eyWzqEqokIECu5etghLkUJE= github.com/spf13/cobra v1.8.1 h1:e5/vxKd/rZsfSJMUX1agtjeTDf+qv1/JdBF8gg5k9ZM= github.com/spf13/cobra v1.8.1/go.mod h1:wHxEcudfqmLYa8iTfL+OuZPbBZkmvliBWKIezN3kD9Y= github.com/spf13/pflag v1.0.5 h1:iy+VFUOCP1a+8yFto/drg2CJ5u0yRoB7fZw3DKv/JXA= @@ -215,9 +178,8 @@ github.com/stoewer/go-strcase v1.3.0/go.mod h1:fAH5hQ5pehh+j3nZfvwdk2RgEgQjAoM8w github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME= github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw= github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpEOglKo= -github.com/stretchr/testify v1.2.2/go.mod h1:a8OnRcib4nhh0OaRAV+Yts87kKdq0PP7pXfy6kDkUVs= github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI= -github.com/stretchr/testify v1.5.1/go.mod h1:5W2xD1RspED5o8YsWQXVCued0rvSQ+mT+I5cxcmMvtA= +github.com/stretchr/testify v1.6.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU= github.com/stretchr/testify v1.8.1/go.mod h1:w2LPCIKwWwSfY2zedu0+kehJoqGctiVI29o6fzry7u4= @@ -227,25 +189,28 @@ github.com/x448/float16 v0.8.4 h1:qLwI1I70+NjRFUR3zs1JPUCgaCXSh3SW62uAKT1mSBM= github.com/x448/float16 v0.8.4/go.mod h1:14CWIYCyZA/cWjXOioeEpHeN/83MdbZDRQHoFcYsOfg= github.com/yuin/goldmark v1.1.27/go.mod h1:3hX8gzYuyVAZsxl0MRgGTJEmQBFcNTphYh9decYSb74= github.com/yuin/goldmark v1.2.1/go.mod h1:3hX8gzYuyVAZsxl0MRgGTJEmQBFcNTphYh9decYSb74= -github.com/yuin/goldmark v1.4.13/go.mod h1:6yULJ656Px+3vBD8DxQVa3kxgyrAnzto9xy5taEt/CY= +go.opentelemetry.io/auto/sdk v1.1.0 h1:cH53jehLUN6UFLY71z+NDOiNJqDdPRaXzTel0sJySYA= +go.opentelemetry.io/auto/sdk v1.1.0/go.mod h1:3wSPjt5PWp2RhlCcmmOial7AvC4DQqZb7a7wCow3W8A= go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.53.0 h1:4K4tsIXefpVJtvA/8srF4V4y0akAoPHkIslgAkjixJA= go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.53.0/go.mod h1:jjdQuTGVsXV4vSs+CJ2qYDeDPf9yIJV23qlIzBm73Vg= -go.opentelemetry.io/otel v1.32.0 h1:WnBN+Xjcteh0zdk01SVqV55d/m62NJLJdIyb4y/WO5U= -go.opentelemetry.io/otel v1.32.0/go.mod h1:00DCVSB0RQcnzlwyTfqtxSm+DRr9hpYrHjNGiBHVQIg= +go.opentelemetry.io/otel v1.34.0 h1:zRLXxLCgL1WyKsPVrgbSdMN4c0FMkDAskSTQP+0hdUY= +go.opentelemetry.io/otel v1.34.0/go.mod h1:OWFPOQ+h4G8xpyjgqo4SxJYdDQ/qmRH+wivy7zzx9oI= go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.28.0 h1:3Q/xZUyC1BBkualc9ROb4G8qkH90LXEIICcs5zv1OYY= go.opentelemetry.io/otel/exporters/otlp/otlptrace v1.28.0/go.mod h1:s75jGIWA9OfCMzF0xr+ZgfrB5FEbbV7UuYo32ahUiFI= go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.27.0 h1:qFffATk0X+HD+f1Z8lswGiOQYKHRlzfmdJm0wEaVrFA= go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc v1.27.0/go.mod h1:MOiCmryaYtc+V0Ei+Tx9o5S1ZjA7kzLucuVuyzBZloQ= -go.opentelemetry.io/otel/metric v1.32.0 h1:xV2umtmNcThh2/a/aCP+h64Xx5wsj8qqnkYZktzNa0M= -go.opentelemetry.io/otel/metric v1.32.0/go.mod h1:jH7CIbbK6SH2V2wE16W05BHCtIDzauciCRLoc/SyMv8= -go.opentelemetry.io/otel/sdk v1.32.0 h1:RNxepc9vK59A8XsgZQouW8ue8Gkb4jpWtJm9ge5lEG4= -go.opentelemetry.io/otel/sdk v1.32.0/go.mod h1:LqgegDBjKMmb2GC6/PrTnteJG39I8/vJCAP9LlJXEjU= -go.opentelemetry.io/otel/sdk/metric v1.32.0 h1:rZvFnvmvawYb0alrYkjraqJq0Z4ZUJAiyYCU9snn1CU= -go.opentelemetry.io/otel/sdk/metric v1.32.0/go.mod h1:PWeZlq0zt9YkYAp3gjKZ0eicRYvOh1Gd+X99x6GHpCQ= -go.opentelemetry.io/otel/trace v1.32.0 h1:WIC9mYrXf8TmY/EXuULKc8hR17vE+Hjv2cssQDe03fM= -go.opentelemetry.io/otel/trace v1.32.0/go.mod h1:+i4rkvCraA+tG6AzwloGaCtkx53Fa+L+V8e9a7YvhT8= +go.opentelemetry.io/otel/metric v1.34.0 h1:+eTR3U0MyfWjRDhmFMxe2SsW64QrZ84AOhvqS7Y+PoQ= +go.opentelemetry.io/otel/metric v1.34.0/go.mod h1:CEDrp0fy2D0MvkXE+dPV7cMi8tWZwX3dmaIhwPOaqHE= +go.opentelemetry.io/otel/sdk v1.34.0 h1:95zS4k/2GOy069d321O8jWgYsW3MzVV+KuSPKp7Wr1A= +go.opentelemetry.io/otel/sdk v1.34.0/go.mod h1:0e/pNiaMAqaykJGKbi+tSjWfNNHMTxoC9qANsCzbyxU= +go.opentelemetry.io/otel/sdk/metric v1.34.0 h1:5CeK9ujjbFVL5c1PhLuStg1wxA7vQv7ce1EK0Gyvahk= +go.opentelemetry.io/otel/sdk/metric v1.34.0/go.mod h1:jQ/r8Ze28zRKoNRdkjCZxfs6YvBTG1+YIqyFVFYec5w= +go.opentelemetry.io/otel/trace v1.34.0 h1:+ouXS2V8Rd4hp4580a8q23bg0azF2nI8cqLYnC8mh/k= +go.opentelemetry.io/otel/trace v1.34.0/go.mod h1:Svm7lSjQD7kG7KJ/MUHPVXSDGz2OX4h0M2jHBhmSfRE= go.opentelemetry.io/proto/otlp v1.3.1 h1:TrMUixzpM0yuc/znrFTP9MMRh8trP93mkCiDVeXrui0= go.opentelemetry.io/proto/otlp v1.3.1/go.mod h1:0X1WI4de4ZsLrrJNLAQbFeLCm3T7yBkR0XqQ7niQU+8= +go.uber.org/automaxprocs v1.6.0 h1:O3y2/QNTOdbF+e/dpXNNW7Rx2hZ4sTIPyybbxyNqTUs= +go.uber.org/automaxprocs v1.6.0/go.mod h1:ifeIMSnPZuznNm6jmdzmU3/bfk01Fe2fotchwEFJ8r8= go.uber.org/goleak v1.3.0 h1:2K3zAYmnTNqV73imy9J1T3WC+gmCePx2hEGkimedGto= go.uber.org/goleak v1.3.0/go.mod h1:CoHD4mav9JJNrW/WLlf7HGZPjdw8EucARQHekz1X6bE= go.uber.org/multierr v1.11.0 h1:blXXJkSxSSfBVBlC76pxqeO+LN3aDfLQo+309xJstO0= @@ -255,66 +220,49 @@ go.uber.org/zap v1.27.0/go.mod h1:GB2qFLM7cTU87MWRP2mPIjqfIDnGu+VIO4V/SdhGo2E= golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w= golang.org/x/crypto v0.0.0-20191011191535-87dc89f01550/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI= golang.org/x/crypto v0.0.0-20200622213623-75b288015ac9/go.mod h1:LzIPMQfyMNhhGPhUkYOs5KpL4U8rLKemX1yGLhDgUto= -golang.org/x/crypto v0.0.0-20210921155107-089bfa567519/go.mod h1:GvvjBRRGRdwPK5ydBHafDWAxML/pGHZbMvKqRZ5+Abc= -golang.org/x/crypto v0.3.0/go.mod h1:hebNnKkNXi2UzZN1eVRvBB7co0a+JxK6XbPiWVs/3J4= -golang.org/x/crypto v0.32.0 h1:euUpcYgM8WcP71gNpTqQCn6rC2t6ULUPiOzfWaXVVfc= -golang.org/x/crypto v0.32.0/go.mod h1:ZnnJkOaASj8g0AjIduWNlq2NRxL0PlBrbKVyZ6V/Ugc= +golang.org/x/crypto v0.36.0 h1:AnAEvhDddvBdpY+uR+MyHmuZzzNqXSe/GvuDeob5L34= +golang.org/x/crypto v0.36.0/go.mod h1:Y4J0ReaxCR1IMaabaSMugxJES1EpwhBHhv2bDHklZvc= golang.org/x/exp v0.0.0-20240719175910-8a7402abbf56 h1:2dVuKD2vS7b0QIHQbpyTISPd0LeHDbnYEryqj5Q1ug8= golang.org/x/exp v0.0.0-20240719175910-8a7402abbf56/go.mod h1:M4RDyNAINzryxdtnbRXRL/OHtkFuWGRjvuhBJpk2IlY= golang.org/x/mod v0.2.0/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA= golang.org/x/mod v0.3.0/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA= -golang.org/x/mod v0.6.0-dev.0.20220419223038-86c51ed26bb4/go.mod h1:jJ57K6gSWd91VN4djpZkiMVwK6gcyfeH4XE8wZrZaV4= -golang.org/x/mod v0.22.0 h1:D4nJWe9zXqHOmWqj4VMOJhvzj7bEZg4wEYa759z1pH4= -golang.org/x/mod v0.22.0/go.mod h1:6SkKJ3Xj0I0BrPOZoBy3bdMptDDU9oJrpohJ3eWZ1fY= +golang.org/x/mod v0.24.0 h1:ZfthKaKaT4NrhGVZHO1/WDTwGES4De8KtWO0SIbNJMU= +golang.org/x/mod v0.24.0/go.mod h1:IXM97Txy2VM4PJ3gI61r1YEk/gAj6zAHN3AdZt6S9Ww= golang.org/x/net v0.0.0-20190404232315-eb5bcb51f2a3/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg= golang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= golang.org/x/net v0.0.0-20200226121028-0de0cce0169b/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= golang.org/x/net v0.0.0-20201021035429-f5854403a974/go.mod h1:sp8m0HH+o8qH0wwXwYZr8TS3Oi6o0r6Gce1SSxlDquU= -golang.org/x/net v0.0.0-20210226172049-e18ecbb05110/go.mod h1:m0MpNAwzfU5UDzcl9v0D8zg8gWTRqZa9RBIspLL5mdg= -golang.org/x/net v0.0.0-20220722155237-a158d28d115b/go.mod h1:XRhObCWvk6IyKnWLug+ECip1KBveYUHfp+8e9klMJ9c= -golang.org/x/net v0.2.0/go.mod h1:KqCZLdyyvdV855qA2rE3GC2aiw5xGR5TEjj8smXukLY= -golang.org/x/net v0.34.0 h1:Mb7Mrk043xzHgnRM88suvJFwzVrRfHEHJEl5/71CKw0= -golang.org/x/net v0.34.0/go.mod h1:di0qlW3YNM5oh6GqDGQr92MyTozJPmybPK4Ev/Gm31k= -golang.org/x/oauth2 v0.24.0 h1:KTBBxWqUa0ykRPLtV69rRto9TLXcqYkeswu48x/gvNE= -golang.org/x/oauth2 v0.24.0/go.mod h1:XYTD2NtWslqkgxebSiOHnXEap4TF09sJSc7H1sXbhtI= +golang.org/x/net v0.37.0 h1:1zLorHbz+LYj7MQlSf1+2tPIIgibq2eL5xkrGk6f+2c= +golang.org/x/net v0.37.0/go.mod h1:ivrbrMbzFq5J41QOQh0siUuly180yBYtLp+CKbEaFx8= +golang.org/x/oauth2 v0.25.0 h1:CY4y7XT9v0cRI9oupztF8AgiIu99L/ksR/Xp/6jrZ70= +golang.org/x/oauth2 v0.25.0/go.mod h1:XYTD2NtWslqkgxebSiOHnXEap4TF09sJSc7H1sXbhtI= golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= golang.org/x/sync v0.0.0-20190911185100-cd5d95a43a6e/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= golang.org/x/sync v0.0.0-20201020160332-67f06af15bc9/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= -golang.org/x/sync v0.0.0-20220722155255-886fb9371eb4/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= -golang.org/x/sync v0.10.0 h1:3NQrjDixjgGwUOCaF8w2+VYHv0Ve/vGYSbdkTa98gmQ= -golang.org/x/sync v0.10.0/go.mod h1:Czt+wKu1gCyEFDUtn0jG5QVvpJ6rzVqr5aXyt9drQfk= +golang.org/x/sync v0.12.0 h1:MHc5BpPuC30uJk597Ri8TV3CNZcTLu6B6z4lJy+g6Jw= +golang.org/x/sync v0.12.0/go.mod h1:1dzgHSNfp02xaA81J2MS99Qcpr2w7fw1gpm99rleRqA= golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY= golang.org/x/sys v0.0.0-20190412213103-97732733099d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= -golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= -golang.org/x/sys v0.0.0-20210615035016-665e8c7367d1/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= -golang.org/x/sys v0.0.0-20220520151302-bc2c85ada10a/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= -golang.org/x/sys v0.0.0-20220722155257-8c9f86f7a55f/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= golang.org/x/sys v0.0.0-20220811171246-fbc7d0a398ab/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= -golang.org/x/sys v0.2.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= -golang.org/x/sys v0.29.0 h1:TPYlXGxvx1MGTn2GiZDhnjPA9wZzZeGKHHmKhHYvgaU= -golang.org/x/sys v0.29.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA= -golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo= -golang.org/x/term v0.0.0-20210927222741-03fcf44c2211/go.mod h1:jbD1KX2456YbFQfuXm/mYQcufACuNUgVhRMnK/tPxf8= -golang.org/x/term v0.2.0/go.mod h1:TVmDHMZPmdnySmBfhjOoOdhjzdE1h4u1VwSiw2l1Nuc= -golang.org/x/term v0.28.0 h1:/Ts8HFuMR2E6IP/jlo7QVLZHggjKQbhu/7H0LJFr3Gg= -golang.org/x/term v0.28.0/go.mod h1:Sw/lC2IAUZ92udQNf3WodGtn4k/XoLyZoh8v/8uiwek= +golang.org/x/sys v0.32.0 h1:s77OFDvIQeibCmezSnk/q6iAfkdiQaJi4VzroCFrN20= +golang.org/x/sys v0.32.0/go.mod h1:BJP2sWEmIv4KK5OTEluFJCKSidICx8ciO85XgH3Ak8k= +golang.org/x/term v0.30.0 h1:PQ39fJZ+mfadBm0y5WlL4vlM7Sx1Hgf13sMIY2+QS9Y= +golang.org/x/term v0.30.0/go.mod h1:NYYFdzHoI5wRh/h5tDMdMqCqPJZEuNqVR5xJLd/n67g= golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ= golang.org/x/text v0.3.3/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ= -golang.org/x/text v0.3.7/go.mod h1:u+2+/6zg+i71rQMx5EYifcz6MCKuco9NR6JIITiCfzQ= -golang.org/x/text v0.4.0/go.mod h1:mrYo+phRRbMaCq/xk9113O4dZlRixOauAjOtrjsXDZ8= -golang.org/x/text v0.21.0 h1:zyQAAkrwaneQ066sspRyJaG9VNi/YJ1NfzcGB3hZ/qo= -golang.org/x/text v0.21.0/go.mod h1:4IBbMaMmOPCJ8SecivzSH54+73PCFmPWxNTLm+vZkEQ= +golang.org/x/text v0.3.6/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ= +golang.org/x/text v0.23.0 h1:D71I7dUrlY+VX0gQShAThNGHFxZ13dGLBHQLVl1mJlY= +golang.org/x/text v0.23.0/go.mod h1:/BLNzu4aZCJ1+kcD0DNRotWKage4q2rGVAg4o22unh4= golang.org/x/time v0.7.0 h1:ntUhktv3OPE6TgYxXWv9vKvUSJyIFJlyohwbkEwPrKQ= golang.org/x/time v0.7.0/go.mod h1:3BpzKBy/shNhVucY/MWOyx10tF3SFh9QdLuxbVysPQM= golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ= golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo= golang.org/x/tools v0.0.0-20200619180055-7c47624df98f/go.mod h1:EkVYQZoAsY45+roYkvgYkIh4xh/qjgUK9TdY2XT94GE= golang.org/x/tools v0.0.0-20210106214847-113979e3529a/go.mod h1:emZCQorbCU4vsT4fOWvOPXz4eW1wZW4PmDk9uLelYpA= -golang.org/x/tools v0.1.12/go.mod h1:hNGJHUnrk76NpqgfD5Aqm5Crs+Hm0VOH/i9J2+nxYbc= -golang.org/x/tools v0.28.0 h1:WuB6qZ4RPCQo5aP3WdKZS7i595EdWqWR8vqJTlwTVK8= -golang.org/x/tools v0.28.0/go.mod h1:dcIOrVd3mfQKTgrDVQHqCPMWy6lnhfhtX3hLXYVLfRw= +golang.org/x/tools v0.31.0 h1:0EedkvKDbh+qistFTd0Bcwe/YLh4vHwWEkiI0toFIBU= +golang.org/x/tools v0.31.0/go.mod h1:naFTU+Cev749tSJRXJlna0T3WxKvb1kWEx15xA4SdmQ= golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= golang.org/x/xerrors v0.0.0-20191011141410-1b5146add898/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= @@ -323,14 +271,14 @@ golang.org/x/xerrors v0.0.0-20231012003039-104605ab7028 h1:+cNy6SZtPcJQH3LJVLOSm golang.org/x/xerrors v0.0.0-20231012003039-104605ab7028/go.mod h1:NDW/Ps6MPRej6fsCIbMTohpP40sJ/P/vI1MoTEGwX90= gomodules.xyz/jsonpatch/v2 v2.4.0 h1:Ci3iUJyx9UeRx7CeFN8ARgGbkESwJK+KB9lLcWxY/Zw= gomodules.xyz/jsonpatch/v2 v2.4.0/go.mod h1:AH3dM2RI6uoBZxn3LVrfvJ3E0/9dG4cSrbuBJT4moAY= -google.golang.org/genproto/googleapis/api v0.0.0-20241202173237-19429a94021a h1:OAiGFfOiA0v9MRYsSidp3ubZaBnteRUyn3xB2ZQ5G/E= -google.golang.org/genproto/googleapis/api v0.0.0-20241202173237-19429a94021a/go.mod h1:jehYqy3+AhJU9ve55aNOaSml7wUXjF9x6z2LcCfpAhY= -google.golang.org/genproto/googleapis/rpc v0.0.0-20241202173237-19429a94021a h1:hgh8P4EuoxpsuKMXX/To36nOFD7vixReXgn8lPGnt+o= -google.golang.org/genproto/googleapis/rpc v0.0.0-20241202173237-19429a94021a/go.mod h1:5uTbfoYQed2U9p3KIj2/Zzm02PYhndfdmML0qC3q3FU= -google.golang.org/grpc v1.70.0 h1:pWFv03aZoHzlRKHWicjsZytKAiYCtNS0dHbXnIdq7jQ= -google.golang.org/grpc v1.70.0/go.mod h1:ofIJqVKDXx/JiXrwr2IG4/zwdH9txy3IlF40RmcJSQw= -google.golang.org/protobuf v1.36.4 h1:6A3ZDJHn/eNqc1i+IdefRzy/9PokBTPvcqMySR7NNIM= -google.golang.org/protobuf v1.36.4/go.mod h1:9fA7Ob0pmnwhb644+1+CVWFRbNajQ6iRojtC/QF5bRE= +google.golang.org/genproto/googleapis/api v0.0.0-20250106144421-5f5ef82da422 h1:GVIKPyP/kLIyVOgOnTwFOrvQaQUzOzGMCxgFUOEmm24= +google.golang.org/genproto/googleapis/api v0.0.0-20250106144421-5f5ef82da422/go.mod h1:b6h1vNKhxaSoEI+5jc3PJUCustfli/mRab7295pY7rw= +google.golang.org/genproto/googleapis/rpc v0.0.0-20250115164207-1a7da9e5054f h1:OxYkA3wjPsZyBylwymxSHa7ViiW1Sml4ToBrncvFehI= +google.golang.org/genproto/googleapis/rpc v0.0.0-20250115164207-1a7da9e5054f/go.mod h1:+2Yz8+CLJbIfL9z73EW45avw8Lmge3xVElCP9zEKi50= +google.golang.org/grpc v1.71.1 h1:ffsFWr7ygTUscGPI0KKK6TLrGz0476KUvvsbqWK0rPI= +google.golang.org/grpc v1.71.1/go.mod h1:H0GRtasmQOh9LkFoCPDu3ZrwUtD1YGE+b2vYBYd/8Ec= +google.golang.org/protobuf v1.36.6 h1:z1NpPI8ku2WgiWnf+t9wTPsn6eP1L7ksHUlkfLvd9xY= +google.golang.org/protobuf v1.36.6/go.mod h1:jduwjTPXsFjZGTmRluh+L6NjiWu7pchiJ2/5YcXBHnY= gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk= gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q= @@ -340,27 +288,25 @@ gopkg.in/inf.v0 v0.9.1 h1:73M5CoZyi3ZLMOyDlQh031Cx6N9NDJ2Vvfl76EDAgDc= gopkg.in/inf.v0 v0.9.1/go.mod h1:cWUDdTG/fYaXco+Dcufb5Vnc6Gp2YChqWtbxRZE0mXw= gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7 h1:uRGJdciOHaEIrze2W8Q3AKkepLTh2hOroT7a+7czfdQ= gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7/go.mod h1:dt/ZhP58zS4L8KSrWDmTeBkI65Dw0HsyUHuEVlX15mw= -gopkg.in/yaml.v2 v2.2.2/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI= -gopkg.in/yaml.v2 v2.3.0/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI= gopkg.in/yaml.v2 v2.4.0 h1:D8xgwECY7CYvx+Y2n4sBz93Jn9JRvxdiyyo8CTfuKaY= gopkg.in/yaml.v2 v2.4.0/go.mod h1:RDklbk79AGWmwhnvt/jBztapEOGDOx6ZbXqjP6csGnQ= gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= -k8s.io/api v0.32.1 h1:f562zw9cy+GvXzXf0CKlVQ7yHJVYzLfL6JAS4kOAaOc= -k8s.io/api v0.32.1/go.mod h1:/Yi/BqkuueW1BgpoePYBRdDYfjPF5sgTr5+YqDZra5k= -k8s.io/apiextensions-apiserver v0.32.1 h1:hjkALhRUeCariC8DiVmb5jj0VjIc1N0DREP32+6UXZw= -k8s.io/apiextensions-apiserver v0.32.1/go.mod h1:sxWIGuGiYov7Io1fAS2X06NjMIk5CbRHc2StSmbaQto= -k8s.io/apimachinery v0.32.1 h1:683ENpaCBjma4CYqsmZyhEzrGz6cjn1MY/X2jB2hkZs= -k8s.io/apimachinery v0.32.1/go.mod h1:GpHVgxoKlTxClKcteaeuF1Ul/lDVb74KpZcxcmLDElE= -k8s.io/apiserver v0.32.1 h1:oo0OozRos66WFq87Zc5tclUX2r0mymoVHRq8JmR7Aak= -k8s.io/apiserver v0.32.1/go.mod h1:UcB9tWjBY7aryeI5zAgzVJB/6k7E97bkr1RgqDz0jPw= -k8s.io/client-go v0.32.1 h1:otM0AxdhdBIaQh7l1Q0jQpmo7WOFIk5FFa4bg6YMdUU= -k8s.io/client-go v0.32.1/go.mod h1:aTTKZY7MdxUaJ/KiUs8D+GssR9zJZi77ZqtzcGXIiDg= -k8s.io/code-generator v0.32.1 h1:4lw1kFNDuFYXquTkB7Sl5EwPMUP2yyW9hh6BnFfRZFY= -k8s.io/code-generator v0.32.1/go.mod h1:zaILfm00CVyP/6/pJMJ3zxRepXkxyDfUV5SNG4CjZI4= -k8s.io/component-base v0.32.1 h1:/5IfJ0dHIKBWysGV0yKTFfacZ5yNV1sulPh3ilJjRZk= -k8s.io/component-base v0.32.1/go.mod h1:j1iMMHi/sqAHeG5z+O9BFNCF698a1u0186zkjMZQ28w= +k8s.io/api v0.32.4 h1:kw8Y/G8E7EpNy7gjB8gJZl3KJkNz8HM2YHrZPtAZsF4= +k8s.io/api v0.32.4/go.mod h1:5MYFvLvweRhyKylM3Es/6uh/5hGp0dg82vP34KifX4g= +k8s.io/apiextensions-apiserver v0.32.4 h1:IA+CoR63UDOijR/vEpow6wQnX4V6iVpzazJBskHrpHE= +k8s.io/apiextensions-apiserver v0.32.4/go.mod h1:Y06XO/b92H8ymOdG1HlA1submf7gIhbEDc3RjriqZOs= +k8s.io/apimachinery v0.32.4 h1:8EEksaxA7nd7xWJkkwLDN4SvWS5ot9g6Z/VZb3ju25I= +k8s.io/apimachinery v0.32.4/go.mod h1:GpHVgxoKlTxClKcteaeuF1Ul/lDVb74KpZcxcmLDElE= +k8s.io/apiserver v0.32.4 h1:Yf7sd/y+GOQKH1Qf6wUeayZrYXe2SKZ17Bcq7VQM5HQ= +k8s.io/apiserver v0.32.4/go.mod h1:JFUMNtE2M5yqLZpIsgCb06SkVSW1YcxW1oyLSTfjXR8= +k8s.io/client-go v0.32.4 h1:zaGJS7xoYOYumoWIFXlcVrsiYioRPrXGO7dBfVC5R6M= +k8s.io/client-go v0.32.4/go.mod h1:k0jftcyYnEtwlFW92xC7MTtFv5BNcZBr+zn9jPlT9Ic= +k8s.io/code-generator v0.32.4 h1:d4dm/43RD6xhPBX22JgJw9JUpwTKzVR6tAxJD7pz83o= +k8s.io/code-generator v0.32.4/go.mod h1:R0bKdIg1smtvsKvj9q7SxTeKq5X9ko6PuICCGt4yqxg= +k8s.io/component-base v0.32.4 h1:HuF+2JVLbFS5GODLIfPCb1Td6b+G2HszJoArcWOSr5I= +k8s.io/component-base v0.32.4/go.mod h1:10KloJEYw1keU/Xmjfy9TKJqUq7J2mYdiD1VDXoco4o= k8s.io/gengo/v2 v2.0.0-20240911193312-2b36238f13e9 h1:si3PfKm8dDYxgfbeA6orqrtLkvvIeH8UqffFJDl0bz4= k8s.io/gengo/v2 v2.0.0-20240911193312-2b36238f13e9/go.mod h1:EJykeLsmFC60UQbYJezXkEsG2FLrt0GPNkU5iK5GWxU= k8s.io/klog/v2 v2.130.1 h1:n9Xl7H1Xvksem4KFG4PYbdQCQxqc/tTUyrgXaOhHSzk= @@ -371,13 +317,17 @@ k8s.io/utils v0.0.0-20241210054802-24370beab758 h1:sdbE21q2nlQtFh65saZY+rRM6x6aJ k8s.io/utils v0.0.0-20241210054802-24370beab758/go.mod h1:OLgZIPagt7ERELqWJFomSt595RzquPNLL48iOWgYOg0= sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.31.0 h1:CPT0ExVicCzcpeN4baWEV2ko2Z/AsiZgEdwgcfwLgMo= sigs.k8s.io/apiserver-network-proxy/konnectivity-client v0.31.0/go.mod h1:Ve9uj1L+deCXFrPOk1LpFXqTg7LCFzFso6PA48q/XZw= -sigs.k8s.io/controller-runtime v0.20.1 h1:JbGMAG/X94NeM3xvjenVUaBjy6Ui4Ogd/J5ZtjZnHaE= -sigs.k8s.io/controller-runtime v0.20.1/go.mod h1:BrP3w158MwvB3ZbNpaAcIKkHQ7YGpYnzpoSTZ8E14WU= -sigs.k8s.io/controller-tools v0.14.0 h1:rnNoCC5wSXlrNoBKKzL70LNJKIQKEzT6lloG6/LF73A= -sigs.k8s.io/controller-tools v0.14.0/go.mod h1:TV7uOtNNnnR72SpzhStvPkoS/U5ir0nMudrkrC4M9Sc= +sigs.k8s.io/controller-runtime v0.20.4 h1:X3c+Odnxz+iPTRobG4tp092+CvBU9UK0t/bRf+n0DGU= +sigs.k8s.io/controller-runtime v0.20.4/go.mod h1:xg2XB0K5ShQzAgsoujxuKN4LNXR2LfwwHsPj7Iaw+XY= +sigs.k8s.io/controller-tools v0.16.3 h1:z48C5/d4jCVQQvtiSBL5MYyZ3EO2eFIOXrIKMgHVhFY= +sigs.k8s.io/controller-tools v0.16.3/go.mod h1:AEj6k+w1kYpLZv2einOH3mj52ips4W/6FUjnB5tkJGs= +sigs.k8s.io/gateway-api v1.2.1 h1:fZZ/+RyRb+Y5tGkwxFKuYuSRQHu9dZtbjenblleOLHM= +sigs.k8s.io/gateway-api v1.2.1/go.mod h1:EpNfEXNjiYfUJypf0eZ0P5iXA9ekSGWaS1WgPaM42X0= sigs.k8s.io/json v0.0.0-20241010143419-9aa6b5e7a4b3 h1:/Rv+M11QRah1itp8VhT6HoVx1Ray9eB4DBr+K+/sCJ8= sigs.k8s.io/json v0.0.0-20241010143419-9aa6b5e7a4b3/go.mod h1:18nIHnGi6636UCz6m8i4DhaJ65T6EruyzmoQqI2BVDo= -sigs.k8s.io/structured-merge-diff/v4 v4.5.0 h1:nbCitCK2hfnhyiKo6uf2HxUPTCodY6Qaf85SbDIaMBk= -sigs.k8s.io/structured-merge-diff/v4 v4.5.0/go.mod h1:N8f93tFZh9U6vpxwRArLiikrE5/2tiu1w1AGfACIGE4= +sigs.k8s.io/randfill v0.0.0-20250304075658-069ef1bbf016 h1:kXv6kKdoEtedwuqMmkqhbkgvYKeycVbC8+iPCP9j5kQ= +sigs.k8s.io/randfill v0.0.0-20250304075658-069ef1bbf016/go.mod h1:XeLlZ/jmk4i1HRopwe7/aU3H5n1zNUcX6TM94b3QxOY= +sigs.k8s.io/structured-merge-diff/v4 v4.6.0 h1:IUA9nvMmnKWcj5jl84xn+T5MnlZKThmUW1TdblaLVAc= +sigs.k8s.io/structured-merge-diff/v4 v4.6.0/go.mod h1:dDy58f92j70zLsuZVuUX5Wp9vtxXpaZnkPGWeqDfCps= sigs.k8s.io/yaml v1.4.0 h1:Mk1wCc2gy/F0THH0TAp1QYyJNzRm2KCLy3o5ASXVI5E= sigs.k8s.io/yaml v1.4.0/go.mod h1:Ejl7/uTz7PSA4eKMyQCUTnhZYNmLIl+5c2lQPGR2BPY= diff --git a/hack/boilerplate.go.txt b/hack/boilerplate.go.txt index 4ad43857..8057371b 100644 --- a/hack/boilerplate.go.txt +++ b/hack/boilerplate.go.txt @@ -1,5 +1,5 @@ /* -Copyright 2024 The Kubernetes Authors. +Copyright 2025 The Kubernetes Authors. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. diff --git a/hack/push-chart.sh b/hack/push-chart.sh new file mode 100755 index 00000000..36ed92cd --- /dev/null +++ b/hack/push-chart.sh @@ -0,0 +1,47 @@ +#!/usr/bin/env bash + +# Copyright 2025 The Kubernetes Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +set -o errexit +set -o nounset +set -o pipefail + +DEST_CHART_DIR=${DEST_CHART_DIR:-bin/} + +EXTRA_TAG=${EXTRA_TAG:-$(git branch --show-current)} +CHART_VERSION=${CHART_VERSION:-"v0"} + +STAGING_IMAGE_REGISTRY=${STAGING_IMAGE_REGISTRY:-us-central1-docker.pkg.dev/k8s-staging-images} +IMAGE_REGISTRY=${IMAGE_REGISTRY:-${STAGING_IMAGE_REGISTRY}/gateway-api-inference-extension} +HELM_CHART_REPO=${HELM_CHART_REPO:-${STAGING_IMAGE_REGISTRY}/gateway-api-inference-extension/charts} +CHART=${CHART:-inferencepool} + +HELM=${HELM:-./bin/helm} + +readonly semver_regex='^v([0-9]+)(\.[0-9]+){1,2}(-rc.[0-9]+)?$' + +chart_version=${CHART_VERSION} +if [[ ${EXTRA_TAG} =~ ${semver_regex} ]] +then + ${YQ} -i '.inferenceExtension.image.tag=strenv(EXTRA_TAG)' config/charts/${CHART}/values.yaml + chart_version=${EXTRA_TAG} +fi + +# Create the package +${HELM} package --version "${chart_version}" --app-version "${chart_version}" "config/charts/${CHART}" -d "${DEST_CHART_DIR}" + +# Push the package +echo "pushing chart to ${HELM_CHART_REPO}" +${HELM} push "bin/${CHART}-${chart_version}.tgz" "oci://${HELM_CHART_REPO}" diff --git a/hack/release-quickstart.sh b/hack/release-quickstart.sh index b156b160..c2c0f74d 100755 --- a/hack/release-quickstart.sh +++ b/hack/release-quickstart.sh @@ -15,8 +15,8 @@ else RELEASE_TAG="v${MAJOR}.${MINOR}.0-rc.${RC}" fi -# vLLM image version (default to 0.7.1 if not defined) -VLLM="${VLLM:-0.7.1}" +# vLLM image version (default to 0.7.2 if not defined) +VLLM="${VLLM:-0.7.2}" echo "Using release tag: ${RELEASE_TAG}" echo "Using vLLM image version: ${VLLM}" @@ -36,34 +36,44 @@ sed -i.bak -E "s|(releases/download/)v[0-9]+\.[0-9]+\.0-rc\.?[0-9]+|\1${RELEASE_ sed -i.bak "s|kubectl apply -k https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd|kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/${RELEASE_TAG}/manifests.yaml|g" "$README" # ----------------------------------------------------------------------------- -# Update pkg/manifests/ext_proc.yaml +# Update image references # ----------------------------------------------------------------------------- -EXT_PROC="pkg/manifests/ext_proc.yaml" -echo "Updating ${EXT_PROC} ..." +EPP="config/manifests/inferencepool-resources.yaml" +#TODO: Put all helm values files into an array to loop over +EPP_HELM="config/charts/inferencepool/values.yaml" +BBR_HELM="config/charts/body-based-routing/values.yaml" +echo "Updating ${EPP} & ${EPP_HELM} ..." -# Update any image reference for the EPP container. -# For images from registry.k8s.io: -sed -i.bak -E "s|(registry\.k8s\.io/gateway-api-inference-extension/epp:)[^\"[:space:]]+|\1${RELEASE_TAG}|g" "$EXT_PROC" -# In case there is still any reference from us-central1-docker.pkg.dev: -sed -i.bak -E "s|(us-central1-docker\.pkg\.dev/k8s-staging-images/gateway-api-inference-extension/epp:)[^\"[:space:]]+|\1${RELEASE_TAG}|g" "$EXT_PROC" +# Update the container tag. +sed -i.bak -E "s|(us-central1-docker\.pkg\.dev/k8s-staging-images/gateway-api-inference-extension/epp:)[^\"[:space:]]+|\1${RELEASE_TAG}|g" "$EPP" +sed -i.bak -E "s|(tag: )[^\"[:space:]]+|\1${RELEASE_TAG}|g" "$EPP_HELM" +sed -i.bak -E "s|(tag: )[^\"[:space:]]+|\1${RELEASE_TAG}|g" "$BBR_HELM" + +# Update the container image pull policy. +sed -i.bak '/us-central1-docker.pkg.dev\/k8s-staging-images\/gateway-api-inference-extension\/epp/ { n; s/Always/IfNotPresent/ }' "$EPP" + +# Update the container registry. +sed -i.bak -E "s|us-central1-docker\.pkg\.dev/k8s-staging-images|registry.k8s.io|g" "$EPP" +sed -i.bak -E "s|us-central1-docker\.pkg\.dev/k8s-staging-images|registry.k8s.io|g" "$EPP_HELM" +sed -i.bak -E "s|us-central1-docker\.pkg\.dev/k8s-staging-images|registry.k8s.io|g" "$BBR_HELM" # ----------------------------------------------------------------------------- -# Update pkg/manifests/vllm/deployment.yaml +# Update config/manifests/vllm/gpu-deployment.yaml # ----------------------------------------------------------------------------- -VLLM_DEPLOY="pkg/manifests/vllm/deployment.yaml" +VLLM_DEPLOY="config/manifests/vllm/gpu-deployment.yaml" echo "Updating ${VLLM_DEPLOY} ..." # Update the vLLM image version -sed -i.bak -E "s|(vllm/vllm-openai:)[^\"[:space:]]+|\1${VLLM}|g" "$VLLM_DEPLOY" +sed -i.bak -E "s|(vllm/vllm-openai:)[^\"[:space:]]+|\1v${VLLM}|g" "$VLLM_DEPLOY" # Also change the imagePullPolicy from Always to IfNotPresent on lines containing the vLLM image. -sed -i.bak "/vllm\/vllm-openai/ s/Always/IfNotPresent/g" "$VLLM_DEPLOY" +sed -i.bak '/vllm\/vllm-openai/ { n; s/Always/IfNotPresent/ }' "$VLLM_DEPLOY" # ----------------------------------------------------------------------------- # Stage the changes # ----------------------------------------------------------------------------- -echo "Staging $README $EXT_PROC $VLLM_DEPLOY files..." -git add $README $EXT_PROC $VLLM_DEPLOY +echo "Staging $README $EPP $EPP_HELM $BBR_HELM $VLLM_DEPLOY files..." +git add $README $EPP $EPP_HELM $BBR_HELM $VLLM_DEPLOY # ----------------------------------------------------------------------------- # Cleanup backup files and finish diff --git a/hack/test-e2e.sh b/hack/test-e2e.sh new file mode 100755 index 00000000..0d6bdfc0 --- /dev/null +++ b/hack/test-e2e.sh @@ -0,0 +1,137 @@ +#!/bin/bash +# +# This script verifies end-to-end connectivity for an example inference extension test environment based on +# resources from the quickstart guide or e2e test framework. It can optionally launch a "curl" client pod to +# run these tests within the cluster. +# +# USAGE: ./hack/e2e-test.sh +# +# OPTIONAL ENVIRONMENT VARIABLES: +# - TIME: The duration (in seconds) for which the test will run. Defaults to 1 second. +# - CURL_POD: If set to "true", the script will use a Kubernetes pod named "curl" for making requests. +# - IP: Override the detected IP address. If not provided, the script attempts to use a Gateway based on +# the quickstart guide or an Envoy service IP based on the e2e test framework. +# - PORT: Override the detected port. If not provided, the script attempts to use a Gateway based on the +# quickstart guide or an Envoy service IP based on the e2e test framework. +# +# WHAT THE SCRIPT DOES: +# 1. Determines if there is a Gateway named "inference-gateway" in the "default" namespace. If found, it extracts the IP +# address and port from the Gateway's "llm-gw" listener. Otherwise, it falls back to the Envoy service in the "default" namespace. +# 2. Optionally checks for (or creates) a "curl" pod, ensuring it is ready to execute requests. +# 3. Loops for $TIME seconds, sending requests every 5 seconds to the /v1/completions endpoint to confirm successful connectivity. + +set -euo pipefail + +# Determine the directory of this script and build an absolute path to client.yaml. +SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" +CLIENT_YAML="$SCRIPT_DIR/../test/testdata/client.yaml" + +# TIME is the amount of time, in seconds, to run the test. +TIME=${TIME:-1} +# Optionally use a client curl pod for executing the curl command. +CURL_POD=${CURL_POD:-false} + +check_resource_exists() { + local type=$1 + local name=$2 + local namespace=$3 + + if kubectl get "$type" "$name" -n "$namespace" &>/dev/null; then + return 0 + else + return 1 + fi +} + +check_pod_ready() { + local pod_name=$1 + local namespace=$2 + # Check the Ready condition using jsonpath. Default to False if not found. + local ready_status + ready_status=$(kubectl get pod "$pod_name" -n "$namespace" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || echo "False") + if [[ "$ready_status" == "True" ]]; then + return 0 + else + return 1 + fi +} + +# Try to get the Gateway's IP and the port from the listener named "llm-gw" if it exists. +if check_resource_exists "gateway" "inference-gateway" "default"; then + GATEWAY_IP=$(kubectl get gateway inference-gateway -n default -o jsonpath='{.status.addresses[0].value}') + # Use JSONPath to select the port from the listener with name "llm-gw" + GATEWAY_PORT=$(kubectl get gateway inference-gateway -n default -o jsonpath='{.spec.listeners[?(@.name=="llm-gw")].port}') +else + GATEWAY_IP="" + GATEWAY_PORT="" +fi + +if [[ -n "$GATEWAY_IP" && -n "$GATEWAY_PORT" ]]; then + echo "Using Gateway inference-gateway IP and port from listener 'llm-gw'." + IP=${IP:-$GATEWAY_IP} + PORT=${PORT:-$GATEWAY_PORT} +else + echo "Gateway inference-gateway not found or missing IP/port. Falling back to Envoy service." + # Ensure the Envoy service exists. + if ! check_resource_exists "svc" "envoy" "default"; then + echo "Error: Envoy service not found in namespace 'default'." + exit 1 + fi + IP=${IP:-$(kubectl get svc envoy -n default -o jsonpath='{.spec.clusterIP}')} + PORT=${PORT:-$(kubectl get svc envoy -n default -o jsonpath='{.spec.ports[0].port}')} +fi + +# Optionally verify that the curl pod exists and is ready. +if [[ "$CURL_POD" == "true" ]]; then + if ! check_resource_exists "pod" "curl" "default"; then + echo "Pod 'curl' not found in namespace 'default'. Applying client.yaml from $CLIENT_YAML..." + kubectl apply -f "$CLIENT_YAML" + fi + echo "Waiting for pod 'curl' to be ready..." + # Retry every 5 seconds for up to 30 seconds (6 attempts) + for i in {1..6}; do + if check_pod_ready "curl" "default"; then + echo "Pod 'curl' is now ready." + break + fi + echo "Retry attempt $i: Pod 'curl' not ready; waiting 5 seconds..." + sleep 5 + done + + if ! check_pod_ready "curl" "default"; then + echo "Error: Pod 'curl' is still not ready in namespace 'default' after 30 seconds." + exit 1 + fi +fi + +# Validate that we have a non-empty IP and PORT. +if [[ -z "$IP" ]]; then + echo "Error: Unable to determine a valid IP from either Gateway or Envoy service." + exit 1 +fi + +if [[ -z "$PORT" ]]; then + echo "Error: Unable to determine a valid port from either Gateway or Envoy service." + exit 1 +fi + +echo "Using IP: $IP" +echo "Using PORT: $PORT" + +# Run the test for the specified duration. +end=$((SECONDS + TIME)) +if [[ "$CURL_POD" == "true" ]]; then + while [ $SECONDS -lt $end ]; do + kubectl exec po/curl -- curl -i "$IP:$PORT/v1/completions" \ + -H 'Content-Type: application/json' \ + -d '{"model": "food-review","prompt": "Write as if you were a critic: San Francisco","max_tokens": 100,"temperature": 0}' + sleep 5 + done +else + while [ $SECONDS -lt $end ]; do + curl -i "$IP:$PORT/v1/completions" \ + -H 'Content-Type: application/json' \ + -d '{"model": "food-review","prompt": "Write as if you were a critic: San Francisco","max_tokens": 100,"temperature": 0}' + sleep 5 + done +fi diff --git a/hack/update-codegen.sh b/hack/update-codegen.sh index cfe75f81..ab5818fa 100755 --- a/hack/update-codegen.sh +++ b/hack/update-codegen.sh @@ -23,13 +23,17 @@ echo "$SCRIPT_ROOT script" CODEGEN_PKG=${2:-bin} echo $CODEGEN_PKG source "${CODEGEN_PKG}/kube_codegen.sh" -THIS_PKG="inference.networking.x-k8s.io/gateway-api-inference-extension" +THIS_PKG="sigs.k8s.io/gateway-api-inference-extension" kube::codegen::gen_helpers \ --boilerplate "${SCRIPT_ROOT}/hack/boilerplate.go.txt" \ "${SCRIPT_ROOT}" +kube::codegen::gen_register \ + --boilerplate "${SCRIPT_ROOT}/hack/boilerplate.go.txt" \ + "${SCRIPT_ROOT}" + kube::codegen::gen_client \ --with-watch \ --with-applyconfig \ diff --git a/internal/runnable/grpc.go b/internal/runnable/grpc.go new file mode 100644 index 00000000..a619f788 --- /dev/null +++ b/internal/runnable/grpc.go @@ -0,0 +1,52 @@ +package runnable + +import ( + "context" + "fmt" + "net" + + "google.golang.org/grpc" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/manager" +) + +// GRPCServer converts the given gRPC server into a runnable. +// The server name is just being used for logging. +func GRPCServer(name string, srv *grpc.Server, port int) manager.Runnable { + return manager.RunnableFunc(func(ctx context.Context) error { + // Use "name" key as that is what manager.Server does as well. + log := ctrl.Log.WithValues("name", name) + log.Info("gRPC server starting") + + // Start listening. + lis, err := net.Listen("tcp", fmt.Sprintf(":%d", port)) + if err != nil { + log.Error(err, "gRPC server failed to listen") + return err + } + + log.Info("gRPC server listening", "port", port) + + // Shutdown on context closed. + // Terminate the server on context closed. + // Make sure the goroutine does not leak. + doneCh := make(chan struct{}) + defer close(doneCh) + go func() { + select { + case <-ctx.Done(): + log.Info("gRPC server shutting down") + srv.GracefulStop() + case <-doneCh: + } + }() + + // Keep serving until terminated. + if err := srv.Serve(lis); err != nil && err != grpc.ErrServerStopped { + log.Error(err, "gRPC server failed") + return err + } + log.Info("gRPC server terminated") + return nil + }) +} diff --git a/internal/runnable/leader_election.go b/internal/runnable/leader_election.go new file mode 100644 index 00000000..00dfc782 --- /dev/null +++ b/internal/runnable/leader_election.go @@ -0,0 +1,31 @@ +package runnable + +import "sigs.k8s.io/controller-runtime/pkg/manager" + +type leaderElection struct { + manager.Runnable + needsLeaderElection bool +} + +// LeaderElection wraps the given runnable to implement manager.LeaderElectionRunnable. +func LeaderElection(runnable manager.Runnable, needsLeaderElection bool) manager.Runnable { + return &leaderElection{ + Runnable: runnable, + needsLeaderElection: needsLeaderElection, + } +} + +// RequireLeaderElection wraps the given runnable, marking it as requiring leader election. +func RequireLeaderElection(runnable manager.Runnable) manager.Runnable { + return LeaderElection(runnable, true) +} + +// RequireLeaderElection wraps the given runnable, marking it as not requiring leader election. +func NoLeaderElection(runnable manager.Runnable) manager.Runnable { + return LeaderElection(runnable, false) +} + +// NeedLeaderElection implements manager.NeedLeaderElection interface. +func (r *leaderElection) NeedLeaderElection() bool { + return r.needsLeaderElection +} diff --git a/internal/tls/tls.go b/internal/tls/tls.go new file mode 100644 index 00000000..fb8092c6 --- /dev/null +++ b/internal/tls/tls.go @@ -0,0 +1,73 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package tls + +import ( + "crypto/rand" + "crypto/rsa" + "crypto/tls" + "crypto/x509" + "crypto/x509/pkix" + "encoding/pem" + "fmt" + "math/big" + "time" + + "github.com/go-logr/logr" +) + +// CreateSelfSignedTLSCertificate creates a self-signed cert the server can use to serve TLS. +func CreateSelfSignedTLSCertificate(logger logr.Logger) (tls.Certificate, error) { + serialNumberLimit := new(big.Int).Lsh(big.NewInt(1), 128) + serialNumber, err := rand.Int(rand.Reader, serialNumberLimit) + if err != nil { + return tls.Certificate{}, fmt.Errorf("error creating serial number: %v", err) + } + now := time.Now() + notBefore := now.UTC() + template := x509.Certificate{ + SerialNumber: serialNumber, + Subject: pkix.Name{ + Organization: []string{"Inference Ext"}, + }, + NotBefore: notBefore, + NotAfter: now.Add(time.Hour * 24 * 365 * 10).UTC(), // 10 years + KeyUsage: x509.KeyUsageKeyEncipherment | x509.KeyUsageDigitalSignature, + ExtKeyUsage: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth}, + BasicConstraintsValid: true, + } + + priv, err := rsa.GenerateKey(rand.Reader, 4096) + if err != nil { + return tls.Certificate{}, fmt.Errorf("error generating key: %v", err) + } + + derBytes, err := x509.CreateCertificate(rand.Reader, &template, &template, &priv.PublicKey, priv) + if err != nil { + return tls.Certificate{}, fmt.Errorf("error creating certificate: %v", err) + } + + certBytes := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: derBytes}) + + privBytes, err := x509.MarshalPKCS8PrivateKey(priv) + if err != nil { + return tls.Certificate{}, fmt.Errorf("error marshalling private key: %v", err) + } + keyBytes := pem.EncodeToMemory(&pem.Block{Type: "PRIVATE KEY", Bytes: privBytes}) + + return tls.X509KeyPair(certBytes, keyBytes) +} diff --git a/mkdocs.yml b/mkdocs.yml index c9bc30e0..e5927ed5 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -10,7 +10,7 @@ theme: icon: repo: fontawesome/brands/git-alt logo: images/logo/logo-text-large-horizontal-white.png - favicon: images/k8s-favicon.png + favicon: images/favicon-64.png features: - search.highlight - navigation.tabs @@ -44,6 +44,9 @@ markdown_extensions: - toc: permalink: true - tables + - pymdownx.superfences + - pymdownx.tabbed: + alternate_style: true nav: - Overview: - Introduction: index.md @@ -51,12 +54,19 @@ nav: API Overview: concepts/api-overview.md Conformance: concepts/conformance.md Roles and Personas: concepts/roles-and-personas.md - - Implementations: implementations.md + - Implementations: + - Gateways: implementations/gateways.md + - Model Servers: implementations/model-servers.md - FAQ: faq.md - Guides: - User Guides: - Getting started: guides/index.md + - Adapter Rollout: guides/adapter-rollout.md + - Metrics: guides/metrics.md + - Replacing an Inference Pool: guides/replacing-inference-pool.md - Implementer's Guide: guides/implementers.md + - Performance: + - Benchmark: performance/benchmark/index.md - Reference: - API Reference: reference/spec.md - API Types: diff --git a/pkg/README.md b/pkg/README.md index 04ebfde2..b53ef777 100644 --- a/pkg/README.md +++ b/pkg/README.md @@ -1,96 +1,3 @@ ## Quickstart -This quickstart guide is intended for engineers familiar with k8s and model servers (vLLM in this instance). The goal of this guide is to get a first, single InferencePool up and running! - -### Requirements - - Envoy Gateway [v1.2.1](https://gateway.envoyproxy.io/docs/install/install-yaml/#install-with-yaml) or higher - - A cluster with: - - Support for Services of type `LoadBalancer`. (This can be validated by ensuring your Envoy Gateway is up and running). For example, with Kind, - you can follow [these steps](https://kind.sigs.k8s.io/docs/user/loadbalancer). - - 3 GPUs to run the sample model server. Adjust the number of replicas in `./manifests/vllm/deployment.yaml` as needed. - -### Steps - -1. **Deploy Sample Model Server** - - Create a Hugging Face secret to download the model [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf). Ensure that the token grants access to this model. - Deploy a sample vLLM deployment with the proper protocol to work with the LLM Instance Gateway. - ```bash - kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to Llama2 - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/vllm/deployment.yaml - ``` - -1. **Install the Inference Extension CRDs:** - - ```sh - kubectl apply -k https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd - ``` - -1. **Deploy InferenceModel** - - Deploy the sample InferenceModel which is configured to load balance traffic between the `tweet-summary-0` and `tweet-summary-1` - [LoRA adapters](https://docs.vllm.ai/en/latest/features/lora.html) of the sample model server. - ```bash - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/inferencemodel.yaml - ``` - -1. **Update Envoy Gateway Config to enable Patch Policy** - - Our custom LLM Gateway ext-proc is patched into the existing envoy gateway via `EnvoyPatchPolicy`. To enable this feature, we must extend the Envoy Gateway config map. To do this, simply run: - ```bash - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/enable_patch_policy.yaml - kubectl rollout restart deployment envoy-gateway -n envoy-gateway-system - ``` - Additionally, if you would like to enable the admin interface, you can uncomment the admin lines and run this again. - -1. **Deploy Gateway** - - ```bash - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/gateway.yaml - ``` - > **_NOTE:_** This file couples together the gateway infra and the HTTPRoute infra for a convenient, quick startup. Creating additional/different InferencePools on the same gateway will require an additional set of: `Backend`, `HTTPRoute`, the resources included in the `./manifests/gateway/ext-proc.yaml` file, and an additional `./manifests/gateway/patch_policy.yaml` file. ***Should you choose to experiment, familiarity with xDS and Envoy are very useful.*** - - Confirm that the Gateway was assigned an IP address and reports a `Programmed=True` status: - ```bash - $ kubectl get gateway inference-gateway - NAME CLASS ADDRESS PROGRAMMED AGE - inference-gateway inference-gateway True 22s - ``` - -1. **Deploy the Inference Extension and InferencePool** - - ```bash - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/ext_proc.yaml - ``` - -1. **Deploy Envoy Gateway Custom Policies** - - ```bash - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/extension_policy.yaml - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/patch_policy.yaml - ``` - > **_NOTE:_** This is also per InferencePool, and will need to be configured to support the new pool should you wish to experiment further. - -1. **OPTIONALLY**: Apply Traffic Policy - - For high-traffic benchmarking you can apply this manifest to avoid any defaults that can cause timeouts/errors. - - ```bash - kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/pkg/manifests/gateway/traffic_policy.yaml - ``` - -1. **Try it out** - - Wait until the gateway is ready. - - ```bash - IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}') - PORT=8081 - - curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ - "model": "tweet-summary", - "prompt": "Write as if you were a critic: San Francisco", - "max_tokens": 100, - "temperature": 0 - }' - ``` \ No newline at end of file +Please refer to our Getting started guide here: https://gateway-api-inference-extension.sigs.k8s.io/guides/ \ No newline at end of file diff --git a/pkg/bbr/README.md b/pkg/bbr/README.md new file mode 100644 index 00000000..b5b6f770 --- /dev/null +++ b/pkg/bbr/README.md @@ -0,0 +1,14 @@ +# Body-Based Routing +This package provides an extension that can be deployed to write the `model` +HTTP body parameter as a header (X-Gateway-Model-Name) so as to enable routing capabilities on the +model name. + +As per OpenAI spec, it is standard for the model name to be included in the +body of the HTTP request. However, most implementations do not support routing +based on the request body. This extension helps bridge that gap for clients. +This extension works by parsing the request body. If it finds a `model` parameter in the +request body, it will copy the value of that parameter into a request header. + +This extension is intended to be paired with an `ext_proc` capable Gateway. There is not +a standard way to represent this kind of extension in Gateway API yet, so we recommend +referring to implementation-specific documentation for how to deploy this extension. diff --git a/pkg/bbr/handlers/request.go b/pkg/bbr/handlers/request.go new file mode 100644 index 00000000..32fffc02 --- /dev/null +++ b/pkg/bbr/handlers/request.go @@ -0,0 +1,162 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package handlers + +import ( + "context" + "encoding/json" + "fmt" + + basepb "github.com/envoyproxy/go-control-plane/envoy/config/core/v3" + eppb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" + extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" + "sigs.k8s.io/controller-runtime/pkg/log" + "sigs.k8s.io/gateway-api-inference-extension/pkg/bbr/metrics" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +const modelHeader = "X-Gateway-Model-Name" + +// HandleRequestBody handles request bodies. +func (s *Server) HandleRequestBody(ctx context.Context, data map[string]any) ([]*eppb.ProcessingResponse, error) { + logger := log.FromContext(ctx) + var ret []*eppb.ProcessingResponse + + requestBodyBytes, err := json.Marshal(data) + if err != nil { + return nil, err + } + + modelVal, ok := data["model"] + if !ok { + metrics.RecordModelNotInBodyCounter() + logger.V(logutil.DEFAULT).Info("Request body does not contain model parameter") + if s.streaming { + ret = append(ret, &eppb.ProcessingResponse{ + Response: &eppb.ProcessingResponse_RequestHeaders{ + RequestHeaders: &eppb.HeadersResponse{}, + }, + }) + ret = addStreamedBodyResponse(ret, requestBodyBytes) + return ret, nil + } else { + ret = append(ret, &eppb.ProcessingResponse{ + Response: &eppb.ProcessingResponse_RequestBody{ + RequestBody: &eppb.BodyResponse{}, + }, + }) + } + return ret, nil + } + + modelStr, ok := modelVal.(string) + if !ok { + metrics.RecordModelNotParsedCounter() + logger.V(logutil.DEFAULT).Info("Model parameter value is not a string") + return nil, fmt.Errorf("the model parameter value %v is not a string", modelVal) + } + + metrics.RecordSuccessCounter() + + if s.streaming { + ret = append(ret, &eppb.ProcessingResponse{ + Response: &eppb.ProcessingResponse_RequestHeaders{ + RequestHeaders: &eppb.HeadersResponse{ + Response: &eppb.CommonResponse{ + ClearRouteCache: true, + HeaderMutation: &eppb.HeaderMutation{ + SetHeaders: []*basepb.HeaderValueOption{ + { + Header: &basepb.HeaderValue{ + Key: modelHeader, + RawValue: []byte(modelStr), + }, + }, + }, + }, + }, + }, + }, + }) + ret = addStreamedBodyResponse(ret, requestBodyBytes) + return ret, nil + } + + return []*eppb.ProcessingResponse{ + { + Response: &eppb.ProcessingResponse_RequestBody{ + RequestBody: &eppb.BodyResponse{ + Response: &eppb.CommonResponse{ + // Necessary so that the new headers are used in the routing decision. + ClearRouteCache: true, + HeaderMutation: &eppb.HeaderMutation{ + SetHeaders: []*basepb.HeaderValueOption{ + { + Header: &basepb.HeaderValue{ + Key: modelHeader, + RawValue: []byte(modelStr), + }, + }, + }, + }, + }, + }, + }, + }, + }, nil +} + +func addStreamedBodyResponse(responses []*eppb.ProcessingResponse, requestBodyBytes []byte) []*eppb.ProcessingResponse { + return append(responses, &extProcPb.ProcessingResponse{ + Response: &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: requestBodyBytes, + EndOfStream: true, + }, + }, + }, + }, + }, + }, + }) +} + +// HandleRequestHeaders handles request headers. +func (s *Server) HandleRequestHeaders(headers *eppb.HttpHeaders) ([]*eppb.ProcessingResponse, error) { + return []*eppb.ProcessingResponse{ + { + Response: &eppb.ProcessingResponse_RequestHeaders{ + RequestHeaders: &eppb.HeadersResponse{}, + }, + }, + }, nil +} + +// HandleRequestTrailers handles request trailers. +func (s *Server) HandleRequestTrailers(trailers *eppb.HttpTrailers) ([]*eppb.ProcessingResponse, error) { + return []*eppb.ProcessingResponse{ + { + Response: &eppb.ProcessingResponse_RequestTrailers{ + RequestTrailers: &eppb.TrailersResponse{}, + }, + }, + }, nil +} diff --git a/pkg/bbr/handlers/request_test.go b/pkg/bbr/handlers/request_test.go new file mode 100644 index 00000000..55c42a21 --- /dev/null +++ b/pkg/bbr/handlers/request_test.go @@ -0,0 +1,219 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package handlers + +import ( + "context" + "encoding/json" + "strings" + "testing" + + basepb "github.com/envoyproxy/go-control-plane/envoy/config/core/v3" + extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" + "github.com/google/go-cmp/cmp" + "google.golang.org/protobuf/testing/protocmp" + "k8s.io/component-base/metrics/legacyregistry" + metricsutils "k8s.io/component-base/metrics/testutil" + "sigs.k8s.io/gateway-api-inference-extension/pkg/bbr/metrics" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +func TestHandleRequestBody(t *testing.T) { + metrics.Register() + ctx := logutil.NewTestLoggerIntoContext(context.Background()) + + tests := []struct { + name string + body map[string]any + streaming bool + want []*extProcPb.ProcessingResponse + wantErr bool + }{ + { + name: "model not found", + body: map[string]any{ + "prompt": "Tell me a joke", + }, + want: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{}, + }, + }, + }, + }, + { + name: "model not found with streaming", + body: map[string]any{ + "prompt": "Tell me a joke", + }, + streaming: true, + want: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_RequestHeaders{ + RequestHeaders: &extProcPb.HeadersResponse{}, + }, + }, + { + Response: &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: mapToBytes(t, map[string]any{ + "prompt": "Tell me a joke", + }), + EndOfStream: true, + }, + }, + }, + }, + }, + }, + }, + }, + }, + { + name: "model is not string", + body: map[string]any{ + "model": 1, + "prompt": "Tell me a joke", + }, + wantErr: true, + }, + { + name: "success", + body: map[string]any{ + "model": "foo", + "prompt": "Tell me a joke", + }, + want: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + // Necessary so that the new headers are used in the routing decision. + ClearRouteCache: true, + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: []*basepb.HeaderValueOption{ + { + Header: &basepb.HeaderValue{ + Key: "X-Gateway-Model-Name", + RawValue: []byte("foo"), + }, + }, + }, + }, + }, + }, + }, + }, + }, + }, + { + name: "success-with-streaming", + body: map[string]any{ + "model": "foo", + "prompt": "Tell me a joke", + }, + streaming: true, + want: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_RequestHeaders{ + RequestHeaders: &extProcPb.HeadersResponse{ + Response: &extProcPb.CommonResponse{ + ClearRouteCache: true, + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: []*basepb.HeaderValueOption{ + { + Header: &basepb.HeaderValue{ + Key: "X-Gateway-Model-Name", + RawValue: []byte("foo"), + }, + }, + }, + }, + }, + }, + }, + }, + { + Response: &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: mapToBytes(t, map[string]any{ + "model": "foo", + "prompt": "Tell me a joke", + }), + EndOfStream: true, + }, + }, + }, + }, + }, + }, + }, + }, + }, + } + + for _, test := range tests { + t.Run(test.name, func(t *testing.T) { + server := &Server{streaming: test.streaming} + resp, err := server.HandleRequestBody(ctx, test.body) + if err != nil { + if !test.wantErr { + t.Fatalf("HandleRequestBody returned unexpected error: %v, want %v", err, test.wantErr) + } + return + } + + if diff := cmp.Diff(test.want, resp, protocmp.Transform()); diff != "" { + t.Errorf("HandleRequestBody returned unexpected response, diff(-want, +got): %v", diff) + } + }) + } + + wantMetrics := ` + # HELP bbr_model_not_in_body_total [ALPHA] Count of times the model was not present in the request body. + # TYPE bbr_model_not_in_body_total counter + bbr_model_not_in_body_total{} 1 + # HELP bbr_model_not_parsed_total [ALPHA] Count of times the model was in the request body but we could not parse it. + # TYPE bbr_model_not_parsed_total counter + bbr_model_not_parsed_total{} 1 + # HELP bbr_success_total [ALPHA] Count of successes pulling model name from body and injecting it in the request headers. + # TYPE bbr_success_total counter + bbr_success_total{} 1 + ` + + if err := metricsutils.GatherAndCompare(legacyregistry.DefaultGatherer, strings.NewReader(wantMetrics), "inference_model_request_total"); err != nil { + t.Error(err) + } +} + +func mapToBytes(t *testing.T, m map[string]any) []byte { + // Convert map to JSON byte array + bytes, err := json.Marshal(m) + if err != nil { + t.Fatalf("Marshal(): %v", err) + } + return bytes +} diff --git a/pkg/bbr/handlers/response.go b/pkg/bbr/handlers/response.go new file mode 100644 index 00000000..fbcb75d6 --- /dev/null +++ b/pkg/bbr/handlers/response.go @@ -0,0 +1,54 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package handlers + +import ( + eppb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" +) + +// HandleResponseHeaders handles response headers. +func (s *Server) HandleResponseHeaders(headers *eppb.HttpHeaders) ([]*eppb.ProcessingResponse, error) { + return []*eppb.ProcessingResponse{ + { + Response: &eppb.ProcessingResponse_ResponseHeaders{ + ResponseHeaders: &eppb.HeadersResponse{}, + }, + }, + }, nil +} + +// HandleResponseBody handles response bodies. +func (s *Server) HandleResponseBody(body *eppb.HttpBody) ([]*eppb.ProcessingResponse, error) { + return []*eppb.ProcessingResponse{ + { + Response: &eppb.ProcessingResponse_ResponseBody{ + ResponseBody: &eppb.BodyResponse{}, + }, + }, + }, nil +} + +// HandleResponseTrailers handles response trailers. +func (s *Server) HandleResponseTrailers(trailers *eppb.HttpTrailers) ([]*eppb.ProcessingResponse, error) { + return []*eppb.ProcessingResponse{ + { + Response: &eppb.ProcessingResponse_ResponseTrailers{ + ResponseTrailers: &eppb.TrailersResponse{}, + }, + }, + }, nil +} diff --git a/pkg/bbr/handlers/server.go b/pkg/bbr/handlers/server.go new file mode 100644 index 00000000..484b3318 --- /dev/null +++ b/pkg/bbr/handlers/server.go @@ -0,0 +1,140 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package handlers + +import ( + "context" + "encoding/json" + "errors" + "io" + + extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" + "github.com/go-logr/logr" + "google.golang.org/grpc/codes" + "google.golang.org/grpc/status" + "sigs.k8s.io/controller-runtime/pkg/log" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +func NewServer(streaming bool) *Server { + return &Server{streaming: streaming} +} + +// Server implements the Envoy external processing server. +// https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/ext_proc/v3/external_processor.proto +type Server struct { + streaming bool +} + +func (s *Server) Process(srv extProcPb.ExternalProcessor_ProcessServer) error { + ctx := srv.Context() + logger := log.FromContext(ctx) + loggerVerbose := logger.V(logutil.VERBOSE) + loggerVerbose.Info("Processing") + + streamedBody := &streamedBody{} + + for { + select { + case <-ctx.Done(): + return ctx.Err() + default: + } + + req, recvErr := srv.Recv() + if recvErr == io.EOF || errors.Is(recvErr, context.Canceled) { + return nil + } + if recvErr != nil { + // This error occurs very frequently, though it doesn't seem to have any impact. + // TODO Figure out if we can remove this noise. + loggerVerbose.Error(recvErr, "Cannot receive stream request") + return status.Errorf(codes.Unknown, "cannot receive stream request: %v", recvErr) + } + + var responses []*extProcPb.ProcessingResponse + var err error + switch v := req.Request.(type) { + case *extProcPb.ProcessingRequest_RequestHeaders: + if s.streaming && !req.GetRequestHeaders().GetEndOfStream() { + // If streaming and the body is not empty, then headers are handled when processing request body. + loggerVerbose.Info("Received headers, passing off header processing until body arrives...") + } else { + responses, err = s.HandleRequestHeaders(req.GetRequestHeaders()) + } + case *extProcPb.ProcessingRequest_RequestBody: + loggerVerbose.Info("Incoming body chunk", "body", string(v.RequestBody.Body), "EoS", v.RequestBody.EndOfStream) + responses, err = s.processRequestBody(ctx, req.GetRequestBody(), streamedBody, logger) + case *extProcPb.ProcessingRequest_RequestTrailers: + responses, err = s.HandleRequestTrailers(req.GetRequestTrailers()) + case *extProcPb.ProcessingRequest_ResponseHeaders: + responses, err = s.HandleResponseHeaders(req.GetResponseHeaders()) + case *extProcPb.ProcessingRequest_ResponseBody: + responses, err = s.HandleResponseBody(req.GetResponseBody()) + default: + logger.V(logutil.DEFAULT).Error(nil, "Unknown Request type", "request", v) + return status.Error(codes.Unknown, "unknown request type") + } + + if err != nil { + logger.V(logutil.DEFAULT).Error(err, "Failed to process request", "request", req) + return status.Errorf(status.Code(err), "failed to handle request: %v", err) + } + + for _, resp := range responses { + loggerVerbose.Info("Response generated", "response", resp) + if err := srv.Send(resp); err != nil { + logger.V(logutil.DEFAULT).Error(err, "Send failed") + return status.Errorf(codes.Unknown, "failed to send response back to Envoy: %v", err) + } + } + } +} + +type streamedBody struct { + body []byte +} + +func (s *Server) processRequestBody(ctx context.Context, body *extProcPb.HttpBody, streamedBody *streamedBody, logger logr.Logger) ([]*extProcPb.ProcessingResponse, error) { + loggerVerbose := logger.V(logutil.VERBOSE) + + var requestBody map[string]interface{} + if s.streaming { + streamedBody.body = append(streamedBody.body, body.Body...) + // In the stream case, we can receive multiple request bodies. + if body.EndOfStream { + loggerVerbose.Info("Flushing stream buffer") + err := json.Unmarshal(streamedBody.body, &requestBody) + if err != nil { + logger.V(logutil.DEFAULT).Error(err, "Error unmarshaling request body") + } + } else { + return nil, nil + } + } else { + if err := json.Unmarshal(body.GetBody(), &requestBody); err != nil { + return nil, err + } + } + + requestBodyResp, err := s.HandleRequestBody(ctx, requestBody) + if err != nil { + return nil, err + } + + return requestBodyResp, nil +} diff --git a/pkg/bbr/handlers/server_test.go b/pkg/bbr/handlers/server_test.go new file mode 100644 index 00000000..f4e8e254 --- /dev/null +++ b/pkg/bbr/handlers/server_test.go @@ -0,0 +1,145 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package handlers + +import ( + "context" + "testing" + + basepb "github.com/envoyproxy/go-control-plane/envoy/config/core/v3" + extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" + "github.com/google/go-cmp/cmp" + "google.golang.org/protobuf/testing/protocmp" + "sigs.k8s.io/controller-runtime/pkg/log" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +func TestProcessRequestBody(t *testing.T) { + ctx := logutil.NewTestLoggerIntoContext(context.Background()) + + cases := []struct { + desc string + streaming bool + bodys []*extProcPb.HttpBody + want []*extProcPb.ProcessingResponse + }{ + { + desc: "no-streaming", + bodys: []*extProcPb.HttpBody{ + { + Body: mapToBytes(t, map[string]any{ + "model": "foo", + }), + }, + }, + want: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + // Necessary so that the new headers are used in the routing decision. + ClearRouteCache: true, + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: []*basepb.HeaderValueOption{ + { + Header: &basepb.HeaderValue{ + Key: modelHeader, + RawValue: []byte("foo"), + }, + }, + }, + }, + }, + }, + }, + }, + }, + }, + { + desc: "streaming", + streaming: true, + bodys: []*extProcPb.HttpBody{ + { + Body: mapToBytes(t, map[string]any{ + "model": "foo", + }), + }, + { + EndOfStream: true, + }, + }, + want: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_RequestHeaders{ + RequestHeaders: &extProcPb.HeadersResponse{ + Response: &extProcPb.CommonResponse{ + ClearRouteCache: true, + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: []*basepb.HeaderValueOption{ + { + Header: &basepb.HeaderValue{ + Key: modelHeader, + RawValue: []byte("foo"), + }, + }, + }, + }, + }, + }, + }, + }, + { + Response: &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: mapToBytes(t, map[string]any{ + "model": "foo", + }), + EndOfStream: true, + }, + }, + }, + }, + }, + }, + }, + }, + }, + } + + for _, tc := range cases { + t.Run(tc.desc, func(t *testing.T) { + srv := NewServer(tc.streaming) + streamedBody := &streamedBody{} + for i, body := range tc.bodys { + got, err := srv.processRequestBody(context.Background(), body, streamedBody, log.FromContext(ctx)) + if err != nil { + t.Fatalf("processRequestBody(): %v", err) + } + + if i == len(tc.bodys)-1 { + if diff := cmp.Diff(tc.want, got, protocmp.Transform()); diff != "" { + t.Errorf("processRequestBody returned unexpected response, diff(-want, +got): %v", diff) + } + } + } + }) + } +} diff --git a/pkg/bbr/metrics/metrics.go b/pkg/bbr/metrics/metrics.go new file mode 100644 index 00000000..fc3538fb --- /dev/null +++ b/pkg/bbr/metrics/metrics.go @@ -0,0 +1,103 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package metrics + +import ( + "sync" + + compbasemetrics "k8s.io/component-base/metrics" + "k8s.io/component-base/metrics/legacyregistry" +) + +const component = "bbr" + +var ( + successCounter = compbasemetrics.NewCounterVec( + &compbasemetrics.CounterOpts{ + Subsystem: component, + Name: "success_total", + Help: "Count of successes pulling model name from body and injecting it in the request headers.", + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{}, + ) + modelNotInBodyCounter = compbasemetrics.NewCounterVec( + &compbasemetrics.CounterOpts{ + Subsystem: component, + Name: "model_not_in_body_total", + Help: "Count of times the model was not present in the request body.", + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{}, + ) + modelNotParsedCounter = compbasemetrics.NewCounterVec( + &compbasemetrics.CounterOpts{ + Subsystem: component, + Name: "model_not_parsed_total", + Help: "Count of times the model was in the request body but we could not parse it.", + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{}, + ) + + // TODO: Uncomment and use this metrics once the core server implementation has handling to skip body parsing if header exists. + /* + modelAlreadyPresentInHeaderCounter = compbasemetrics.NewCounterVec( + &compbasemetrics.CounterOpts{ + Subsystem: component, + Name: "model_already_present_in_header_total", + Help: "Count of times the model was already present in request headers.", + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{}, + ) + */ +) + +var registerMetrics sync.Once + +// Register all metrics. +func Register() { + registerMetrics.Do(func() { + legacyregistry.MustRegister(successCounter) + legacyregistry.MustRegister(modelNotInBodyCounter) + legacyregistry.MustRegister(modelNotParsedCounter) + // legacyregistry.MustRegister(modelAlreadyPresentInHeaderCounter) + }) +} + +// RecordSuccessCounter records the number of successful requests to inject the model name into request headers. +func RecordSuccessCounter() { + successCounter.WithLabelValues().Inc() +} + +// RecordModelNotInBodyCounter records the number of times the model was not found in the request body. +func RecordModelNotInBodyCounter() { + modelNotInBodyCounter.WithLabelValues().Inc() +} + +// RecordModelNotParsedCounter records the number of times the model was found in the body but it could not be parsed. +func RecordModelNotParsedCounter() { + modelNotParsedCounter.WithLabelValues().Inc() +} + +/* +// RecordModelAlreadyInHeaderCounter records the number of times the model was already found in the request headers. +func RecordModelAlreadyInHeaderCounter() { + modelAlreadyPresentInHeaderCounter.WithLabelValues().Inc() +} +*/ diff --git a/pkg/bbr/server/runserver.go b/pkg/bbr/server/runserver.go new file mode 100644 index 00000000..2001b7ff --- /dev/null +++ b/pkg/bbr/server/runserver.go @@ -0,0 +1,73 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package server + +import ( + "context" + "crypto/tls" + + extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" + "github.com/go-logr/logr" + "google.golang.org/grpc" + "google.golang.org/grpc/credentials" + "sigs.k8s.io/controller-runtime/pkg/manager" + "sigs.k8s.io/gateway-api-inference-extension/internal/runnable" + tlsutil "sigs.k8s.io/gateway-api-inference-extension/internal/tls" + "sigs.k8s.io/gateway-api-inference-extension/pkg/bbr/handlers" +) + +// ExtProcServerRunner provides methods to manage an external process server. +type ExtProcServerRunner struct { + GrpcPort int + SecureServing bool + Streaming bool +} + +func NewDefaultExtProcServerRunner(port int, streaming bool) *ExtProcServerRunner { + return &ExtProcServerRunner{ + GrpcPort: port, + SecureServing: true, + Streaming: streaming, + } +} + +// AsRunnable returns a Runnable that can be used to start the ext-proc gRPC server. +// The runnable implements LeaderElectionRunnable with leader election disabled. +func (r *ExtProcServerRunner) AsRunnable(logger logr.Logger) manager.Runnable { + return runnable.NoLeaderElection(manager.RunnableFunc(func(ctx context.Context) error { + var srv *grpc.Server + if r.SecureServing { + cert, err := tlsutil.CreateSelfSignedTLSCertificate(logger) + if err != nil { + logger.Error(err, "Failed to create self signed certificate") + return err + } + creds := credentials.NewTLS(&tls.Config{Certificates: []tls.Certificate{cert}}) + srv = grpc.NewServer(grpc.Creds(creds)) + } else { + srv = grpc.NewServer() + } + + extProcPb.RegisterExternalProcessorServer( + srv, + handlers.NewServer(r.Streaming), + ) + + // Forward to the gRPC runnable. + return runnable.GRPCServer("ext-proc", srv, r.GrpcPort).Start(ctx) + })) +} diff --git a/pkg/epp/README.md b/pkg/epp/README.md new file mode 100644 index 00000000..99d1bf06 --- /dev/null +++ b/pkg/epp/README.md @@ -0,0 +1,28 @@ +# The EndPoint Picker (EPP) +This package provides the reference implementation for the Endpoint Picker (EPP). As demonstrated in the diagram below, it implements the [extension protocol](../../docs/proposals/004-endpoint-picker-protocol), enabling a proxy or gateway to request endpoint hints from an extension, and interacts with the model servers through the defined [model server protocol](../..//docs/proposals/003-model-server-protocol). + +![Architecture Diagram](../../docs/endpoint-picker.svg) + + +## Core Functions + +An EPP instance handles a single `InferencePool` (and so for each `InferencePool`, one must create a dedicated EPP deployment), it performs the following core functions: + +- Endpoint Selection + - The EPP determines the appropriate Pod endpoint for the load balancer (LB) to route requests. + - It selects from the pool of ready Pods designated by the assigned InferencePool's [Selector](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/7e3cd457cdcd01339b65861c8e472cf27e6b6e80/api/v1alpha1/inferencepool_types.go#L53) field. + - Endpoint selection is contingent on the request's ModelName matching an `InferenceModel` that references the `InferencePool`. + - Requests with unmatched ModelName values trigger an error response to the proxy. +- Traffic Splitting and ModelName Rewriting + - The EPP facilitates controlled rollouts of new adapter versions by implementing traffic splitting between adapters within the same `InferencePool`, as defined by the `InferenceModel`. + - EPP rewrites the model name in the request to the [target model name](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/7e3cd457cdcd01339b65861c8e472cf27e6b6e80/api/v1alpha1/inferencemodel_types.go#L161) as defined on the `InferenceModel` object. +- Observability + - The EPP generates metrics to enhance observability. + - It reports InferenceModel-level metrics, further broken down by target model. + - Detailed information regarding metrics can be found on the [website](https://gateway-api-inference-extension.sigs.k8s.io/guides/metrics/). + + +## Scheduling Algorithm +The scheduling package implements request scheduling algorithms for load balancing requests across backend pods in an inference gateway. The scheduler ensures efficient resource utilization while maintaining low latency and prioritizing critical requests. It applies a series of filters based on metrics and heuristics to select the best pod for a given request. The following flow chart summarizes the current scheduling algorithm + +Scheduling Algorithm diff --git a/pkg/epp/backend/metrics/fake.go b/pkg/epp/backend/metrics/fake.go new file mode 100644 index 00000000..58d05026 --- /dev/null +++ b/pkg/epp/backend/metrics/fake.go @@ -0,0 +1,86 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package metrics + +import ( + "context" + "fmt" + "sync" + + corev1 "k8s.io/api/core/v1" + "k8s.io/apimachinery/pkg/types" + "sigs.k8s.io/controller-runtime/pkg/log" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +// FakePodMetrics is an implementation of PodMetrics that doesn't run the async refresh loop. +type FakePodMetrics struct { + Pod *backend.Pod + Metrics *Metrics +} + +func (fpm *FakePodMetrics) String() string { + return fmt.Sprintf("Pod: %v; Metrics: %v", fpm.GetPod(), fpm.GetMetrics()) +} + +func (fpm *FakePodMetrics) GetPod() *backend.Pod { + return fpm.Pod +} +func (fpm *FakePodMetrics) GetMetrics() *Metrics { + return fpm.Metrics +} +func (fpm *FakePodMetrics) UpdatePod(pod *corev1.Pod) { + fpm.Pod = toInternalPod(pod) +} +func (fpm *FakePodMetrics) StopRefreshLoop() {} // noop + +type FakePodMetricsClient struct { + errMu sync.RWMutex + Err map[types.NamespacedName]error + resMu sync.RWMutex + Res map[types.NamespacedName]*Metrics +} + +func (f *FakePodMetricsClient) FetchMetrics(ctx context.Context, pod *backend.Pod, existing *Metrics, port int32) (*Metrics, error) { + f.errMu.RLock() + err, ok := f.Err[pod.NamespacedName] + f.errMu.RUnlock() + if ok { + return nil, err + } + f.resMu.RLock() + res, ok := f.Res[pod.NamespacedName] + f.resMu.RUnlock() + if !ok { + return nil, fmt.Errorf("no pod found: %v", pod.NamespacedName) + } + log.FromContext(ctx).V(logutil.VERBOSE).Info("Fetching metrics for pod", "existing", existing, "new", res) + return res.Clone(), nil +} + +func (f *FakePodMetricsClient) SetRes(new map[types.NamespacedName]*Metrics) { + f.resMu.Lock() + defer f.resMu.Unlock() + f.Res = new +} + +func (f *FakePodMetricsClient) SetErr(new map[types.NamespacedName]error) { + f.errMu.Lock() + defer f.errMu.Unlock() + f.Err = new +} diff --git a/pkg/epp/backend/metrics/logger.go b/pkg/epp/backend/metrics/logger.go new file mode 100644 index 00000000..7dc1a8b8 --- /dev/null +++ b/pkg/epp/backend/metrics/logger.go @@ -0,0 +1,115 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package metrics + +import ( + "context" + "fmt" + "time" + + "github.com/go-logr/logr" + "sigs.k8s.io/controller-runtime/pkg/log" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/metrics" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +const ( + // Note currently the EPP treats stale metrics same as fresh. + // TODO: https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/336 + metricsValidityPeriod = 5 * time.Second + debugPrintInterval = 5 * time.Second +) + +type Datastore interface { + PoolGet() (*v1alpha2.InferencePool, error) + // PodMetrics operations + // PodGetAll returns all pods and metrics, including fresh and stale. + PodGetAll() []PodMetrics + PodList(func(PodMetrics) bool) []PodMetrics +} + +// StartMetricsLogger starts goroutines to 1) Print metrics debug logs if the DEBUG log level is +// enabled; 2) flushes Prometheus metrics about the backend servers. +func StartMetricsLogger(ctx context.Context, datastore Datastore, refreshPrometheusMetricsInterval time.Duration) { + logger := log.FromContext(ctx) + ticker := time.NewTicker(refreshPrometheusMetricsInterval) + go func() { + defer ticker.Stop() + for { + select { + case <-ctx.Done(): + logger.V(logutil.DEFAULT).Info("Shutting down prometheus metrics thread") + return + case <-ticker.C: // Periodically refresh prometheus metrics for inference pool + refreshPrometheusMetrics(logger, datastore) + } + } + }() + + // Periodically print out the pods and metrics for DEBUGGING. + if logger := logger.V(logutil.DEBUG); logger.Enabled() { + go func() { + ticker := time.NewTicker(debugPrintInterval) + defer ticker.Stop() + for { + select { + case <-ctx.Done(): + logger.V(logutil.DEFAULT).Info("Shutting down metrics logger thread") + return + case <-ticker.C: + podsWithFreshMetrics := datastore.PodList(func(pm PodMetrics) bool { + return time.Since(pm.GetMetrics().UpdateTime) <= metricsValidityPeriod + }) + podsWithStaleMetrics := datastore.PodList(func(pm PodMetrics) bool { + return time.Since(pm.GetMetrics().UpdateTime) > metricsValidityPeriod + }) + s := fmt.Sprintf("Current Pods and metrics gathered. Fresh metrics: %+v, Stale metrics: %+v", podsWithFreshMetrics, podsWithStaleMetrics) + logger.V(logutil.VERBOSE).Info(s) + } + } + }() + } +} + +func refreshPrometheusMetrics(logger logr.Logger, datastore Datastore) { + pool, err := datastore.PoolGet() + if err != nil { + // No inference pool or not initialize. + logger.V(logutil.DEFAULT).Info("Pool is not initialized, skipping refreshing metrics") + return + } + + var kvCacheTotal float64 + var queueTotal int + + podMetrics := datastore.PodGetAll() + logger.V(logutil.TRACE).Info("Refreshing Prometheus Metrics", "ReadyPods", len(podMetrics)) + if len(podMetrics) == 0 { + return + } + + for _, pod := range podMetrics { + kvCacheTotal += pod.GetMetrics().KVCacheUsagePercent + queueTotal += pod.GetMetrics().WaitingQueueSize + } + + podTotalCount := len(podMetrics) + metrics.RecordInferencePoolAvgKVCache(pool.Name, kvCacheTotal/float64(podTotalCount)) + metrics.RecordInferencePoolAvgQueueSize(pool.Name, float64(queueTotal/podTotalCount)) + metrics.RecordinferencePoolReadyPods(pool.Name, float64(podTotalCount)) +} diff --git a/pkg/epp/backend/metrics/metrics.go b/pkg/epp/backend/metrics/metrics.go new file mode 100644 index 00000000..4cf56179 --- /dev/null +++ b/pkg/epp/backend/metrics/metrics.go @@ -0,0 +1,241 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package metrics + +import ( + "context" + "fmt" + "net/http" + "strconv" + "strings" + + dto "github.com/prometheus/client_model/go" + "github.com/prometheus/common/expfmt" + "go.uber.org/multierr" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend" +) + +const ( + // LoRA metrics based on protocol + LoraInfoRunningAdaptersMetricName = "running_lora_adapters" + LoraInfoWaitingAdaptersMetricName = "waiting_lora_adapters" + LoraInfoMaxAdaptersMetricName = "max_lora" +) + +type PodMetricsClientImpl struct { + MetricMapping *MetricMapping +} + +// FetchMetrics fetches metrics from a given pod, clones the existing metrics object and returns an updated one. +func (p *PodMetricsClientImpl) FetchMetrics(ctx context.Context, pod *backend.Pod, existing *Metrics, port int32) (*Metrics, error) { + // Currently the metrics endpoint is hard-coded, which works with vLLM. + // TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16): Consume this from InferencePool config. + url := "http://" + pod.Address + ":" + strconv.Itoa(int(port)) + "/metrics" + + req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil) + if err != nil { + return nil, fmt.Errorf("failed to create request: %v", err) + } + resp, err := http.DefaultClient.Do(req) + if err != nil { + return nil, fmt.Errorf("failed to fetch metrics from %s: %w", pod.NamespacedName, err) + } + defer func() { + _ = resp.Body.Close() + }() + + if resp.StatusCode != http.StatusOK { + return nil, fmt.Errorf("unexpected status code from %s: %v", pod.NamespacedName, resp.StatusCode) + } + + parser := expfmt.TextParser{} + metricFamilies, err := parser.TextToMetricFamilies(resp.Body) + if err != nil { + return nil, err + } + return p.promToPodMetrics(metricFamilies, existing) +} + +// promToPodMetrics updates internal pod metrics with scraped Prometheus metrics. +func (p *PodMetricsClientImpl) promToPodMetrics( + metricFamilies map[string]*dto.MetricFamily, + existing *Metrics, +) (*Metrics, error) { + var errs error + updated := existing.Clone() + + if p.MetricMapping.TotalQueuedRequests != nil { + queued, err := p.getMetric(metricFamilies, *p.MetricMapping.TotalQueuedRequests) + if err == nil { + updated.WaitingQueueSize = int(queued.GetGauge().GetValue()) + } else { + errs = multierr.Append(errs, err) + } + } + + if p.MetricMapping.KVCacheUtilization != nil { + usage, err := p.getMetric(metricFamilies, *p.MetricMapping.KVCacheUtilization) + if err == nil { + updated.KVCacheUsagePercent = usage.GetGauge().GetValue() + } else { + errs = multierr.Append(errs, err) + } + } + + // Handle LoRA metrics (only if all LoRA MetricSpecs are present) + if p.MetricMapping.LoraRequestInfo != nil { + loraMetrics, err := p.getLatestLoraMetric(metricFamilies) + errs = multierr.Append(errs, err) + + if loraMetrics != nil { + updated.ActiveModels = make(map[string]int) + updated.WaitingModels = make(map[string]int) + for _, label := range loraMetrics.GetLabel() { + if label.GetName() == LoraInfoRunningAdaptersMetricName { + if label.GetValue() != "" { + adapterList := strings.Split(label.GetValue(), ",") + for _, adapter := range adapterList { + updated.ActiveModels[adapter] = 0 + } + } + } + if label.GetName() == LoraInfoWaitingAdaptersMetricName { + if label.GetValue() != "" { + adapterList := strings.Split(label.GetValue(), ",") + for _, adapter := range adapterList { + updated.WaitingModels[adapter] = 0 + } + } + } + if label.GetName() == LoraInfoMaxAdaptersMetricName { + if label.GetValue() != "" { + updated.MaxActiveModels, err = strconv.Atoi(label.GetValue()) + if err != nil { + errs = multierr.Append(errs, err) + } + } + } + } + } + } + + return updated, errs +} + +// getLatestLoraMetric gets latest lora metric series in gauge metric family `vllm:lora_requests_info` +// reason its specially fetched is because each label key value pair permutation generates new series +// and only most recent is useful. The value of each series is the creation timestamp so we can +// retrieve the latest by sorting the value. +func (p *PodMetricsClientImpl) getLatestLoraMetric(metricFamilies map[string]*dto.MetricFamily) (*dto.Metric, error) { + if p.MetricMapping.LoraRequestInfo == nil { + return nil, nil // No LoRA metrics configured + } + + loraRequests, ok := metricFamilies[p.MetricMapping.LoraRequestInfo.MetricName] + if !ok { + return nil, fmt.Errorf("metric family %q not found", p.MetricMapping.LoraRequestInfo.MetricName) + } + + var latest *dto.Metric + var latestTs float64 // Use float64, as Gauge.Value is float64 + + // Iterate over all metrics in the family. + for _, m := range loraRequests.GetMetric() { + running := "" + waiting := "" + // Check if the metric has the expected LoRA labels. + for _, lp := range m.GetLabel() { + switch lp.GetName() { + case LoraInfoRunningAdaptersMetricName: + running = lp.GetValue() + case LoraInfoWaitingAdaptersMetricName: + waiting = lp.GetValue() + } + } + // Ignore metrics with both labels empty. + if running == "" && waiting == "" { + continue + } + + // Select the metric with the *largest Gauge Value* (which represents the timestamp). + if m.GetGauge().GetValue() > latestTs { + latestTs = m.GetGauge().GetValue() + latest = m + } + } + if latest == nil { + return nil, nil + } + + return latest, nil // Convert nanoseconds to time.Time +} + +// getMetric retrieves a specific metric based on MetricSpec. +func (p *PodMetricsClientImpl) getMetric(metricFamilies map[string]*dto.MetricFamily, spec MetricSpec) (*dto.Metric, error) { + mf, ok := metricFamilies[spec.MetricName] + if !ok { + return nil, fmt.Errorf("metric family %q not found", spec.MetricName) + } + + if len(mf.GetMetric()) == 0 { + return nil, fmt.Errorf("no metrics available for %q", spec.MetricName) + } + + return getLatestMetric(mf, &spec) +} + +// getLabeledMetric gets the latest metric with matching labels. +func getLatestMetric(mf *dto.MetricFamily, spec *MetricSpec) (*dto.Metric, error) { + var latestMetric *dto.Metric + var latestTimestamp int64 = -1 // Initialize to -1 so any timestamp is greater + + for _, m := range mf.GetMetric() { + if spec.Labels == nil || labelsMatch(m.GetLabel(), spec.Labels) { + if m.GetTimestampMs() > latestTimestamp { + latestTimestamp = m.GetTimestampMs() + latestMetric = m + } + } + } + + if latestMetric != nil { + return latestMetric, nil + } + + return nil, fmt.Errorf("no matching metric found for %q with labels %+v", spec.MetricName, spec.Labels) +} + +// labelsMatch checks if a metric's labels contain all the labels in the spec. +func labelsMatch(metricLabels []*dto.LabelPair, specLabels map[string]string) bool { + if len(specLabels) == 0 { + return true // No specific labels required + } + + for specName, specValue := range specLabels { + found := false + for _, label := range metricLabels { + if label.GetName() == specName && label.GetValue() == specValue { + found = true + break + } + } + if !found { + return false // A required label is missing + } + } + return true // All required labels are present +} diff --git a/pkg/epp/backend/metrics/metrics_spec.go b/pkg/epp/backend/metrics/metrics_spec.go new file mode 100644 index 00000000..f6f904a9 --- /dev/null +++ b/pkg/epp/backend/metrics/metrics_spec.go @@ -0,0 +1,116 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package metrics + +import ( + "fmt" + "strings" +) + +// MetricSpec represents a single metric's specification. +type MetricSpec struct { + MetricName string + Labels map[string]string // Label name -> Label value +} + +// MetricMapping holds named MetricSpecs. +type MetricMapping struct { + TotalQueuedRequests *MetricSpec + KVCacheUtilization *MetricSpec + LoraRequestInfo *MetricSpec +} + +// stringToMetricSpec converts a string to a MetricSpec. +// Example inputs: +// +// "metric_name" +// "metric_name{label1=value1}" +// "metric_name{label1=value1,label2=value2}" +func stringToMetricSpec(specStr string) (*MetricSpec, error) { + if specStr == "" { + return nil, nil // Allow empty strings to represent nil MetricSpecs + } + specStr = strings.TrimSpace(specStr) + metricName := specStr + labels := make(map[string]string) + + // Check for labels enclosed in curly braces + start := strings.Index(specStr, "{") + end := strings.Index(specStr, "}") + + if start != -1 || end != -1 { // If *either* brace is present... + if start == -1 || end == -1 || end <= start+1 { // ...check that *both* are present and correctly placed. + return nil, fmt.Errorf("invalid metric spec string: %q, missing or malformed label block", specStr) + } + + metricName = strings.TrimSpace(specStr[:start]) + labelStr := specStr[start+1 : end] + + // Split into individual label pairs + labelPairs := strings.Split(labelStr, ",") + for _, pair := range labelPairs { + pair = strings.TrimSpace(pair) + parts := strings.Split(pair, "=") + if len(parts) != 2 { + return nil, fmt.Errorf("invalid label pair: %q in metric spec: %q", pair, specStr) + } + labelName := strings.TrimSpace(parts[0]) + labelValue := strings.TrimSpace(parts[1]) + if labelName == "" || labelValue == "" { + return nil, fmt.Errorf("empty label name or value in pair: %q in metric spec: %q", pair, specStr) + } + labels[labelName] = labelValue + } + // Check for extra characters after labels + if end != len(specStr)-1 { + return nil, fmt.Errorf("invalid characters after label section in: %q", specStr) + } + + } + + if metricName == "" { // Metric name cannot be empty + return nil, fmt.Errorf("empty metric name in spec: %q", specStr) + } + + return &MetricSpec{ + MetricName: metricName, + Labels: labels, + }, nil +} + +// NewMetricMapping creates a MetricMapping from string values. +func NewMetricMapping(queuedStr, kvUsageStr, loraReqInfoStr string) (*MetricMapping, error) { + queuedSpec, err := stringToMetricSpec(queuedStr) + if err != nil { + return nil, fmt.Errorf("error parsing WaitingRequests: %w", err) + } + kvUsageSpec, err := stringToMetricSpec(kvUsageStr) + if err != nil { + return nil, fmt.Errorf("error parsing KVCacheUsage: %w", err) + } + loraReqInfoSpec, err := stringToMetricSpec(loraReqInfoStr) + if err != nil { + return nil, fmt.Errorf("error parsing loraReqInfoStr: %w", err) + } + mapping := &MetricMapping{ + TotalQueuedRequests: queuedSpec, + KVCacheUtilization: kvUsageSpec, + LoraRequestInfo: loraReqInfoSpec, + } + + return mapping, nil +} diff --git a/pkg/epp/backend/metrics/metrics_spec_test.go b/pkg/epp/backend/metrics/metrics_spec_test.go new file mode 100644 index 00000000..e62bc5ff --- /dev/null +++ b/pkg/epp/backend/metrics/metrics_spec_test.go @@ -0,0 +1,170 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package metrics + +import ( + "reflect" + "testing" +) + +func TestStringToMetricSpec(t *testing.T) { + tests := []struct { + name string + input string + want *MetricSpec + wantErr bool + }{ + { + name: "empty string", + input: "", + want: nil, + wantErr: false, + }, + { + name: "no labels", + input: "my_metric", + want: &MetricSpec{ + MetricName: "my_metric", + Labels: map[string]string{}, + }, + wantErr: false, + }, + { + name: "one label", + input: "my_metric{label1=value1}", + want: &MetricSpec{ + MetricName: "my_metric", + Labels: map[string]string{ + "label1": "value1", + }, + }, + wantErr: false, + }, + { + name: "multiple labels", + input: "my_metric{label1=value1,label2=value2}", + want: &MetricSpec{ + MetricName: "my_metric", + Labels: map[string]string{ + "label1": "value1", + "label2": "value2", + }, + }, + wantErr: false, + }, + { + name: "extra whitespace", + input: " my_metric { label1 = value1 , label2 = value2 } ", + want: &MetricSpec{ + MetricName: "my_metric", + Labels: map[string]string{ + "label1": "value1", + "label2": "value2", + }, + }, + wantErr: false, + }, + { + name: "missing closing brace", + input: "my_metric{label1=value1", + want: nil, + wantErr: true, + }, + { + name: "missing opening brace", + input: "my_metriclabel1=value1}", + want: nil, // Corrected expected value + wantErr: true, + }, + { + name: "invalid label pair", + input: "my_metric{label1}", + want: nil, + wantErr: true, + }, + { + name: "empty label name", + input: "my_metric{=value1}", + want: nil, + wantErr: true, + }, + { + name: "empty label value", + input: "my_metric{label1=}", + want: nil, + wantErr: true, + }, + { + name: "empty label name and value with spaces", + input: "my_metric{ = }", + want: nil, + wantErr: true, + }, + { + name: "characters after closing brace", + input: "my_metric{label=val}extra", + want: nil, + wantErr: true, + }, + { + name: "empty metric name", + input: "{label=val}", + want: nil, + wantErr: true, + }, + { + name: "no labels and just metric name with space", + input: "my_metric ", + want: &MetricSpec{ + MetricName: "my_metric", + Labels: map[string]string{}, + }, + wantErr: false, + }, + { + name: "no labels and just metric name with space before and after", + input: " my_metric ", + want: &MetricSpec{ + MetricName: "my_metric", + Labels: map[string]string{}, + }, + wantErr: false, + }, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + got, err := stringToMetricSpec(tt.input) + if (err != nil) != tt.wantErr { + t.Errorf("stringToMetricSpec() error = %v, wantErr %v", err, tt.wantErr) + return + } + if tt.want != nil && got != nil { // compare maps directly + if tt.want.Labels == nil { + tt.want.Labels = make(map[string]string) + } + if !reflect.DeepEqual(got.MetricName, tt.want.MetricName) { + t.Errorf("stringToMetricSpec() got MetricName = %v, want %v", got.MetricName, tt.want.MetricName) + } + if !reflect.DeepEqual(got.Labels, tt.want.Labels) { + t.Errorf("stringToMetricSpec() got Labels = %v, want %v", got.Labels, tt.want.Labels) + } + } else if tt.want != got { // handles if one is nil and the other isn't + t.Errorf("stringToMetricSpec() = %v, want %v", got, tt.want) + } + }) + } +} diff --git a/pkg/epp/backend/metrics/metrics_test.go b/pkg/epp/backend/metrics/metrics_test.go new file mode 100644 index 00000000..53127010 --- /dev/null +++ b/pkg/epp/backend/metrics/metrics_test.go @@ -0,0 +1,509 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package metrics + +import ( + "context" + "errors" + "reflect" + "strconv" + "strings" + "testing" + + dto "github.com/prometheus/client_model/go" + "github.com/stretchr/testify/assert" + "go.uber.org/multierr" + "google.golang.org/protobuf/proto" + "k8s.io/apimachinery/pkg/types" + + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +// --- Test Helpers --- + +func makeMetric(labels map[string]string, value float64, timestampMs int64) *dto.Metric { + labelPairs := []*dto.LabelPair{} + for k, v := range labels { + labelPairs = append(labelPairs, &dto.LabelPair{Name: proto.String(k), Value: proto.String(v)}) + } + return &dto.Metric{ + Label: labelPairs, + Gauge: &dto.Gauge{Value: &value}, + TimestampMs: ×tampMs, + } +} + +func makeMetricFamily(name string, metrics ...*dto.Metric) *dto.MetricFamily { + return &dto.MetricFamily{ + Name: &name, + Type: dto.MetricType_GAUGE.Enum(), + Metric: metrics, + } +} + +// --- Tests --- + +func TestGetMetric(t *testing.T) { + + metricFamilies := map[string]*dto.MetricFamily{ + "metric1": makeMetricFamily("metric1", + makeMetric(map[string]string{"label1": "value1"}, 1.0, 1000), + makeMetric(map[string]string{"label1": "value2"}, 2.0, 2000), + ), + "metric2": makeMetricFamily("metric2", + makeMetric(map[string]string{"labelA": "A1", "labelB": "B1"}, 3.0, 1500), + makeMetric(map[string]string{"labelA": "A2", "labelB": "B2"}, 4.0, 2500), + ), + "metric3": makeMetricFamily("metric3", + makeMetric(map[string]string{}, 5.0, 3000), + makeMetric(map[string]string{}, 6.0, 1000), + ), + } + + tests := []struct { + name string + spec MetricSpec + wantGaugeValue float64 + wantError bool + }{ + { + name: "get labeled metric, exists", + spec: MetricSpec{ + MetricName: "metric1", + Labels: map[string]string{"label1": "value1"}, + }, + wantGaugeValue: 1.0, + wantError: false, + }, + { + name: "get labeled metric, wrong value", + spec: MetricSpec{ + MetricName: "metric1", + Labels: map[string]string{"label1": "value3"}, + }, + wantGaugeValue: -1, // Expect an error, not a specific value + wantError: true, + }, + { + name: "get labeled metric, missing label", + spec: MetricSpec{ + MetricName: "metric1", + Labels: map[string]string{"label2": "value2"}, + }, + wantGaugeValue: -1, + wantError: true, + }, + { + name: "get labeled metric, extra label present", + spec: MetricSpec{ + MetricName: "metric2", + Labels: map[string]string{"labelA": "A1"}, + }, + wantGaugeValue: 3.0, + wantError: false, + }, + { + name: "get unlabeled metric, exists", + spec: MetricSpec{ + MetricName: "metric3", + Labels: nil, // Explicitly nil + }, + wantGaugeValue: 5.0, // latest metric, which occurs first in our test data + wantError: false, + }, + { + name: "get unlabeled metric, metric family not found", + spec: MetricSpec{ + MetricName: "metric4", + Labels: nil, + }, + wantGaugeValue: -1, + wantError: true, + }, + { + name: "get labeled metric, metric family not found", + spec: MetricSpec{ + MetricName: "metric4", + Labels: map[string]string{"label1": "value1"}, + }, + wantGaugeValue: -1, + wantError: true, + }, + { + name: "get metric, no metrics available", + spec: MetricSpec{ + MetricName: "empty_metric", + }, + wantGaugeValue: -1, + wantError: true, + }, + { + name: "get latest metric", + spec: MetricSpec{ + MetricName: "metric3", + Labels: map[string]string{}, // Empty map, not nil + }, + wantGaugeValue: 5.0, + wantError: false, + }, + } + + p := &PodMetricsClientImpl{} // No need for MetricMapping here + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + + gotMetric, err := p.getMetric(metricFamilies, tt.spec) + + if tt.wantError { + if err == nil { + t.Errorf("getMetric() expected error, got nil") + } + } else { + if err != nil { + t.Fatalf("getMetric() unexpected error: %v", err) + } + if gotMetric.GetGauge().GetValue() != tt.wantGaugeValue { + t.Errorf("getMetric() got value %v, want %v", gotMetric.GetGauge().GetValue(), tt.wantGaugeValue) + } + } + }) + } +} + +func TestLabelsMatch(t *testing.T) { + tests := []struct { + name string + metricLabels []*dto.LabelPair + specLabels map[string]string + want bool + }{ + { + name: "empty spec labels, should match", + metricLabels: []*dto.LabelPair{{Name: proto.String("a"), Value: proto.String("b")}}, + specLabels: map[string]string{}, + want: true, + }, + { + name: "nil spec labels, should match", + metricLabels: []*dto.LabelPair{{Name: proto.String("a"), Value: proto.String("b")}}, + specLabels: nil, + want: true, + }, + { + name: "exact match", + metricLabels: []*dto.LabelPair{{Name: proto.String("a"), Value: proto.String("b")}}, + specLabels: map[string]string{"a": "b"}, + want: true, + }, + { + name: "extra labels in metric", + metricLabels: []*dto.LabelPair{{Name: proto.String("a"), Value: proto.String("b")}, {Name: proto.String("c"), Value: proto.String("d")}}, + specLabels: map[string]string{"a": "b"}, + want: true, + }, + { + name: "missing label in metric", + metricLabels: []*dto.LabelPair{{Name: proto.String("a"), Value: proto.String("b")}}, + specLabels: map[string]string{"a": "b", "c": "d"}, + want: false, + }, + { + name: "value mismatch", + metricLabels: []*dto.LabelPair{{Name: proto.String("a"), Value: proto.String("b")}}, + specLabels: map[string]string{"a": "c"}, + want: false, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + if got := labelsMatch(tt.metricLabels, tt.specLabels); got != tt.want { + t.Errorf("labelsMatch() = %v, want %v", got, tt.want) + } + }) + } +} + +func TestGetLatestLoraMetric(t *testing.T) { + + testCases := []struct { + name string + metricFamilies map[string]*dto.MetricFamily + expectedAdapters map[string]int + expectedMax int + expectedErr error + mapping *MetricMapping + }{ + { + name: "no lora metrics", + metricFamilies: map[string]*dto.MetricFamily{ + "some_other_metric": makeMetricFamily("some_other_metric", + makeMetric(nil, 1.0, 1000), + ), + }, + expectedAdapters: nil, + expectedMax: 0, + expectedErr: errors.New("metric family \"vllm:lora_requests_info\" not found"), // Expect an error because the family is missing + mapping: &MetricMapping{ + LoraRequestInfo: &MetricSpec{MetricName: "vllm:lora_requests_info"}, + }, + }, + { + name: "basic lora metrics", + metricFamilies: map[string]*dto.MetricFamily{ + "vllm:lora_requests_info": makeMetricFamily("vllm:lora_requests_info", + makeMetric(map[string]string{"running_lora_adapters": "lora1", "max_lora": "2"}, 3000.0, 1000), // Newer + makeMetric(map[string]string{"running_lora_adapters": "lora2,lora3", "max_lora": "4"}, 1000.0, 1000), // Older + + ), + }, + expectedAdapters: map[string]int{"lora1": 0}, + expectedMax: 2, + expectedErr: nil, + mapping: &MetricMapping{ + LoraRequestInfo: &MetricSpec{MetricName: "vllm:lora_requests_info"}, + }, + }, + { + name: "no matching lora metrics", + metricFamilies: map[string]*dto.MetricFamily{ + "vllm:lora_requests_info": makeMetricFamily("vllm:lora_requests_info", + makeMetric(map[string]string{"other_label": "value"}, 5.0, 3000), + ), + }, + expectedAdapters: nil, + expectedMax: 0, + expectedErr: nil, // Expect *no* error; just no adapters found + mapping: &MetricMapping{ + LoraRequestInfo: &MetricSpec{MetricName: "vllm:lora_requests_info"}, + }, + }, + { + name: "no lora metrics if not in MetricMapping", + metricFamilies: map[string]*dto.MetricFamily{ + "vllm:lora_requests_info": makeMetricFamily("vllm:lora_requests_info", + makeMetric(map[string]string{"running_lora_adapters": "lora1", "max_lora": "2"}, 5.0, 3000), + makeMetric(map[string]string{"running_lora_adapters": "lora2,lora3", "max_lora": "4"}, 6.0, 1000), + ), + }, + expectedAdapters: nil, + expectedMax: 0, + expectedErr: nil, + mapping: &MetricMapping{ // No LoRA metrics defined + }, + }, + } + + for _, tc := range testCases { + t.Run(tc.name, func(t *testing.T) { + p := &PodMetricsClientImpl{MetricMapping: tc.mapping} + loraMetric, err := p.getLatestLoraMetric(tc.metricFamilies) + + if tc.expectedErr != nil { + if err == nil || err.Error() != tc.expectedErr.Error() { + t.Errorf("getLatestLoraMetric() error = %v, wantErr %v", err, tc.expectedErr) + } + return // Stop here if an error was expected + } else if err != nil { + t.Fatalf("getLatestLoraMetric() unexpected error: %v", err) + } + + if tc.mapping.LoraRequestInfo == nil { + if loraMetric != nil { + t.Errorf("getLatestLoraMetric() expected nil metric, got %v", loraMetric) + } + return // Stop if no Lora metrics are expected. + } + + if tc.expectedAdapters == nil && loraMetric == nil { + return // Both nil, as expected + } + + if tc.expectedAdapters != nil && loraMetric != nil { // proceed with checks + + adaptersFound := make(map[string]int) + maxLora := 0 + for _, label := range loraMetric.GetLabel() { + if label.GetName() == "running_lora_adapters" && label.GetValue() != "" { + for _, adapter := range strings.Split(label.GetValue(), ",") { + adaptersFound[adapter] = 0 + } + } + if label.GetName() == "waiting_lora_adapters" && label.GetValue() != "" { + for _, adapter := range strings.Split(label.GetValue(), ",") { + adaptersFound[adapter] = 0 // Overwrite if already present + } + } + if label.GetName() == "max_lora" { + var converr error // define err in this scope. + maxLora, converr = strconv.Atoi(label.GetValue()) + if converr != nil && tc.expectedErr == nil { // only report if we don't expect any other errors + t.Errorf("getLatestLoraMetric() could not parse max_lora: %v", converr) + } + } + } + + if !reflect.DeepEqual(adaptersFound, tc.expectedAdapters) { + t.Errorf("getLatestLoraMetric() adapters = %v, want %v", adaptersFound, tc.expectedAdapters) + } + if maxLora != tc.expectedMax { + t.Errorf("getLatestLoraMetric() maxLora = %v, want %v", maxLora, tc.expectedMax) + } + } else { // one is nil and the other is not + t.Errorf("getLatestLoraMetric(): one of expectedAdapters/loraMetric is nil and the other is not, expected %v, got %v", tc.expectedAdapters, loraMetric) + } + }) + } +} + +func TestPromToPodMetrics(t *testing.T) { + tests := []struct { + name string + metricFamilies map[string]*dto.MetricFamily + mapping *MetricMapping + existingMetrics *Metrics + expectedMetrics *Metrics + expectedErr error // Count of expected errors + }{ + { + name: "vllm metrics", + metricFamilies: map[string]*dto.MetricFamily{ + "vllm_waiting": makeMetricFamily("vllm_waiting", + makeMetric(nil, 5.0, 1000), + makeMetric(nil, 7.0, 2000), // Newer + ), + "vllm_usage": makeMetricFamily("vllm_usage", + makeMetric(nil, 0.8, 2000), + makeMetric(nil, 0.7, 500), + ), + "vllm:lora_requests_info": makeMetricFamily("vllm:lora_requests_info", + makeMetric(map[string]string{"running_lora_adapters": "lora1,lora2", "waiting_lora_adapters": "lora3", "max_lora": "3"}, 3000.0, 1000), + ), + }, + mapping: &MetricMapping{ + TotalQueuedRequests: &MetricSpec{MetricName: "vllm_waiting"}, + KVCacheUtilization: &MetricSpec{MetricName: "vllm_usage"}, + LoraRequestInfo: &MetricSpec{MetricName: "vllm:lora_requests_info"}, + }, + existingMetrics: &Metrics{}, + expectedMetrics: &Metrics{ + WaitingQueueSize: 7, + KVCacheUsagePercent: 0.8, + ActiveModels: map[string]int{"lora1": 0, "lora2": 0}, + WaitingModels: map[string]int{"lora3": 0}, + MaxActiveModels: 3, + }, + }, + { + name: "missing metrics", + metricFamilies: map[string]*dto.MetricFamily{}, // No metrics + mapping: &MetricMapping{ + TotalQueuedRequests: &MetricSpec{MetricName: "vllm_waiting"}, + KVCacheUtilization: &MetricSpec{MetricName: "vllm_usage"}, + LoraRequestInfo: &MetricSpec{MetricName: "vllm:lora_requests_info"}, + }, + existingMetrics: &Metrics{ActiveModels: map[string]int{}, WaitingModels: map[string]int{}}, + expectedMetrics: &Metrics{ActiveModels: map[string]int{}, WaitingModels: map[string]int{}}, + expectedErr: multierr.Combine(errors.New("metric family \"vllm_waiting\" not found"), errors.New("metric family \"vllm_usage\" not found"), errors.New("metric family \"vllm:lora_requests_info\" not found")), + }, + { + name: "partial metrics available + LoRA", + metricFamilies: map[string]*dto.MetricFamily{ + "vllm_usage": makeMetricFamily("vllm_usage", + makeMetric(nil, 0.8, 2000), // Only usage is present + ), + "vllm:lora_requests_info": makeMetricFamily("vllm:lora_requests_info", + makeMetric(map[string]string{"running_lora_adapters": "lora1,lora2", "waiting_lora_adapters": "lora3", "max_lora": "3"}, 3000.0, 1000), + ), + }, + mapping: &MetricMapping{ + TotalQueuedRequests: &MetricSpec{MetricName: "vllm_waiting"}, // Not Present + KVCacheUtilization: &MetricSpec{MetricName: "vllm_usage"}, + LoraRequestInfo: &MetricSpec{MetricName: "vllm:lora_requests_info"}, + }, + existingMetrics: &Metrics{}, + expectedMetrics: &Metrics{ + WaitingQueueSize: 0, + KVCacheUsagePercent: 0.8, + ActiveModels: map[string]int{"lora1": 0, "lora2": 0}, + WaitingModels: map[string]int{"lora3": 0}, + MaxActiveModels: 3, + }, + expectedErr: errors.New("metric family \"vllm_waiting\" not found"), + }, + { + name: "invalid max lora", + metricFamilies: map[string]*dto.MetricFamily{ + "vllm:lora_requests_info": makeMetricFamily("vllm:lora_requests_info", + makeMetric(map[string]string{"running_lora_adapters": "lora1", "max_lora": "invalid"}, 3000.0, 1000), + ), + }, + mapping: &MetricMapping{ + LoraRequestInfo: &MetricSpec{MetricName: "vllm:lora_requests_info"}, + }, + existingMetrics: &Metrics{}, + expectedMetrics: &Metrics{ + ActiveModels: map[string]int{"lora1": 0}, + WaitingModels: map[string]int{}, + MaxActiveModels: 0, // Should still default to 0. + + }, + expectedErr: errors.New("strconv.Atoi: parsing \"invalid\": invalid syntax"), + }, + } + + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + p := &PodMetricsClientImpl{MetricMapping: tc.mapping} + updated, err := p.promToPodMetrics(tc.metricFamilies, tc.existingMetrics) + if tc.expectedErr != nil { + assert.Error(t, err) + assert.EqualError(t, err, tc.expectedErr.Error()) + } else { + assert.NoError(t, err) + assert.Equal(t, tc.expectedMetrics, updated) + } + }) + } +} + +// TestFetchMetrics is a basic integration test. It assumes +// there's no server running on the specified port. +func TestFetchMetrics(t *testing.T) { + ctx := logutil.NewTestLoggerIntoContext(context.Background()) + pod := &backend.Pod{ + Address: "127.0.0.1", + NamespacedName: types.NamespacedName{ + Namespace: "test", + Name: "pod", + }, + } + existing := &Metrics{} + p := &PodMetricsClientImpl{} // No MetricMapping needed for this basic test + + _, err := p.FetchMetrics(ctx, pod, existing, 9999) // Use a port that's unlikely to be in use. + if err == nil { + t.Errorf("FetchMetrics() expected error, got nil") + } + // Check for a specific error message (fragile, but OK for this example) + expectedSubstr := "connection refused" + if err != nil && !strings.Contains(err.Error(), expectedSubstr) { + t.Errorf("FetchMetrics() error = %v, want error containing %q", err, expectedSubstr) + } +} diff --git a/pkg/epp/backend/metrics/pod_metrics.go b/pkg/epp/backend/metrics/pod_metrics.go new file mode 100644 index 00000000..bdeb28ba --- /dev/null +++ b/pkg/epp/backend/metrics/pod_metrics.go @@ -0,0 +1,137 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package metrics + +import ( + "context" + "fmt" + "sync" + "sync/atomic" + "time" + + "github.com/go-logr/logr" + corev1 "k8s.io/api/core/v1" + "k8s.io/apimachinery/pkg/types" + + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +const ( + fetchMetricsTimeout = 5 * time.Second +) + +type podMetrics struct { + pod atomic.Pointer[backend.Pod] + metrics atomic.Pointer[Metrics] + pmc PodMetricsClient + ds Datastore + interval time.Duration + + once sync.Once // ensure the StartRefreshLoop is only called once. + done chan struct{} + + logger logr.Logger +} + +type PodMetricsClient interface { + FetchMetrics(ctx context.Context, pod *backend.Pod, existing *Metrics, port int32) (*Metrics, error) +} + +func (pm *podMetrics) String() string { + return fmt.Sprintf("Pod: %v; Metrics: %v", pm.GetPod(), pm.GetMetrics()) +} + +func (pm *podMetrics) GetPod() *backend.Pod { + return pm.pod.Load() +} + +func (pm *podMetrics) GetMetrics() *Metrics { + return pm.metrics.Load() +} + +func (pm *podMetrics) UpdatePod(in *corev1.Pod) { + pm.pod.Store(toInternalPod(in)) +} + +func toInternalPod(in *corev1.Pod) *backend.Pod { + return &backend.Pod{ + NamespacedName: types.NamespacedName{ + Name: in.Name, + Namespace: in.Namespace, + }, + Address: in.Status.PodIP, + } +} + +// start starts a goroutine exactly once to periodically update metrics. The goroutine will be +// stopped either when stop() is called, or the given ctx is cancelled. +func (pm *podMetrics) startRefreshLoop(ctx context.Context) { + pm.once.Do(func() { + go func() { + pm.logger.V(logutil.DEFAULT).Info("Starting refresher", "pod", pm.GetPod()) + ticker := time.NewTicker(pm.interval) + defer ticker.Stop() + for { + select { + case <-pm.done: + return + case <-ctx.Done(): + return + case <-ticker.C: // refresh metrics periodically + if err := pm.refreshMetrics(); err != nil { + pm.logger.V(logutil.TRACE).Error(err, "Failed to refresh metrics", "pod", pm.GetPod()) + } + } + } + }() + }) +} + +func (pm *podMetrics) refreshMetrics() error { + pool, err := pm.ds.PoolGet() + if err != nil { + // No inference pool or not initialize. + return err + } + ctx, cancel := context.WithTimeout(context.Background(), fetchMetricsTimeout) + defer cancel() + updated, err := pm.pmc.FetchMetrics(ctx, pm.GetPod(), pm.GetMetrics(), pool.Spec.TargetPortNumber) + if err != nil { + pm.logger.V(logutil.TRACE).Info("Failed to refreshed metrics:", "err", err) + } + // Optimistically update metrics even if there was an error. + // The FetchMetrics can return an error for the following reasons: + // 1. As refresher is running in the background, it's possible that the pod is deleted but + // the refresh goroutine doesn't read the done channel yet. In this case, the updated + // metrics object will be nil. And the refresher will soon be stopped. + // 2. The FetchMetrics call can partially fail. For example, due to one metric missing. In + // this case, the updated metrics object will have partial updates. A partial update is + // considered better than no updates. + if updated != nil { + updated.UpdateTime = time.Now() + pm.logger.V(logutil.TRACE).Info("Refreshed metrics", "updated", updated) + pm.metrics.Store(updated) + } + + return nil +} + +func (pm *podMetrics) StopRefreshLoop() { + pm.logger.V(logutil.DEFAULT).Info("Stopping refresher", "pod", pm.GetPod()) + close(pm.done) +} diff --git a/pkg/epp/backend/metrics/pod_metrics_test.go b/pkg/epp/backend/metrics/pod_metrics_test.go new file mode 100644 index 00000000..e79c1bf0 --- /dev/null +++ b/pkg/epp/backend/metrics/pod_metrics_test.go @@ -0,0 +1,98 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ +package metrics + +import ( + "context" + "testing" + "time" + + "github.com/google/go-cmp/cmp" + "github.com/google/go-cmp/cmp/cmpopts" + "github.com/stretchr/testify/assert" + corev1 "k8s.io/api/core/v1" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/types" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" +) + +var ( + pod1 = &corev1.Pod{ + ObjectMeta: metav1.ObjectMeta{ + Name: "pod1", + Namespace: "default", + }, + } + initial = &Metrics{ + WaitingQueueSize: 0, + KVCacheUsagePercent: 0.2, + MaxActiveModels: 2, + ActiveModels: map[string]int{ + "foo": 1, + "bar": 1, + }, + WaitingModels: map[string]int{}, + } + updated = &Metrics{ + WaitingQueueSize: 9999, + KVCacheUsagePercent: 0.99, + MaxActiveModels: 99, + ActiveModels: map[string]int{ + "foo": 1, + "bar": 1, + }, + WaitingModels: map[string]int{}, + } +) + +func TestMetricsRefresh(t *testing.T) { + ctx := context.Background() + pmc := &FakePodMetricsClient{} + pmf := NewPodMetricsFactory(pmc, time.Millisecond) + + // The refresher is initialized with empty metrics. + pm := pmf.NewPodMetrics(ctx, pod1, &fakeDataStore{}) + + namespacedName := types.NamespacedName{Name: pod1.Name, Namespace: pod1.Namespace} + // Use SetRes to simulate an update of metrics from the pod. + // Verify that the metrics are updated. + pmc.SetRes(map[types.NamespacedName]*Metrics{namespacedName: initial}) + condition := func(collect *assert.CollectT) { + assert.True(collect, cmp.Equal(pm.GetMetrics(), initial, cmpopts.IgnoreFields(Metrics{}, "UpdateTime"))) + } + assert.EventuallyWithT(t, condition, time.Second, time.Millisecond) + + // Stop the loop, and simulate metric update again, this time the PodMetrics won't get the + // new update. + pm.StopRefreshLoop() + pmc.SetRes(map[types.NamespacedName]*Metrics{namespacedName: updated}) + // Still expect the same condition (no metrics update). + assert.EventuallyWithT(t, condition, time.Second, time.Millisecond) +} + +type fakeDataStore struct{} + +func (f *fakeDataStore) PoolGet() (*v1alpha2.InferencePool, error) { + return &v1alpha2.InferencePool{Spec: v1alpha2.InferencePoolSpec{TargetPortNumber: 8000}}, nil +} +func (f *fakeDataStore) PodGetAll() []PodMetrics { + // Not implemented. + return nil +} +func (f *fakeDataStore) PodList(func(PodMetrics) bool) []PodMetrics { + // Not implemented. + return nil +} diff --git a/pkg/epp/backend/metrics/types.go b/pkg/epp/backend/metrics/types.go new file mode 100644 index 00000000..4932e3ac --- /dev/null +++ b/pkg/epp/backend/metrics/types.go @@ -0,0 +1,120 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +// Package metrics is a library to interact with backend metrics. +package metrics + +import ( + "context" + "fmt" + "sync" + "time" + + corev1 "k8s.io/api/core/v1" + "sigs.k8s.io/controller-runtime/pkg/log" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend" +) + +func NewPodMetricsFactory(pmc PodMetricsClient, refreshMetricsInterval time.Duration) *PodMetricsFactory { + return &PodMetricsFactory{ + pmc: pmc, + refreshMetricsInterval: refreshMetricsInterval, + } +} + +type PodMetricsFactory struct { + pmc PodMetricsClient + refreshMetricsInterval time.Duration +} + +func (f *PodMetricsFactory) NewPodMetrics(parentCtx context.Context, in *corev1.Pod, ds Datastore) PodMetrics { + pod := toInternalPod(in) + pm := &podMetrics{ + pmc: f.pmc, + ds: ds, + interval: f.refreshMetricsInterval, + once: sync.Once{}, + done: make(chan struct{}), + logger: log.FromContext(parentCtx).WithValues("pod", pod.NamespacedName), + } + pm.pod.Store(pod) + pm.metrics.Store(newMetrics()) + + pm.startRefreshLoop(parentCtx) + return pm +} + +type PodMetrics interface { + GetPod() *backend.Pod + GetMetrics() *Metrics + UpdatePod(*corev1.Pod) + StopRefreshLoop() + String() string +} + +type Metrics struct { + // ActiveModels is a set of models(including LoRA adapters) that are currently cached to GPU. + ActiveModels map[string]int + WaitingModels map[string]int + // MaxActiveModels is the maximum number of models that can be loaded to GPU. + MaxActiveModels int + RunningQueueSize int + WaitingQueueSize int + KVCacheUsagePercent float64 + KvCacheMaxTokenCapacity int + + // UpdateTime record the last time when the metrics were updated. + UpdateTime time.Time +} + +func newMetrics() *Metrics { + return &Metrics{ + ActiveModels: make(map[string]int), + WaitingModels: make(map[string]int), + } +} + +func (m *Metrics) String() string { + if m == nil { + return "" + } + return fmt.Sprintf("%+v", *m) +} + +func (m *Metrics) Clone() *Metrics { + if m == nil { + return nil + } + cm := make(map[string]int, len(m.ActiveModels)) + for k, v := range m.ActiveModels { + cm[k] = v + } + wm := make(map[string]int, len(m.WaitingModels)) + for k, v := range m.WaitingModels { + wm[k] = v + } + clone := &Metrics{ + ActiveModels: cm, + WaitingModels: wm, + MaxActiveModels: m.MaxActiveModels, + RunningQueueSize: m.RunningQueueSize, + WaitingQueueSize: m.WaitingQueueSize, + KVCacheUsagePercent: m.KVCacheUsagePercent, + KvCacheMaxTokenCapacity: m.KvCacheMaxTokenCapacity, + UpdateTime: m.UpdateTime, + } + return clone +} diff --git a/pkg/epp/backend/pod.go b/pkg/epp/backend/pod.go new file mode 100644 index 00000000..a63a0a83 --- /dev/null +++ b/pkg/epp/backend/pod.go @@ -0,0 +1,45 @@ +/* +Copyright 2025 The Kubernetes Authors. +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package backend + +import ( + "fmt" + + "k8s.io/apimachinery/pkg/types" +) + +type Pod struct { + NamespacedName types.NamespacedName + Address string +} + +func (p *Pod) String() string { + if p == nil { + return "" + } + return fmt.Sprintf("%+v", *p) +} + +func (p *Pod) Clone() *Pod { + if p == nil { + return nil + } + return &Pod{ + NamespacedName: types.NamespacedName{ + Name: p.NamespacedName.Name, + Namespace: p.NamespacedName.Namespace, + }, + Address: p.Address, + } +} diff --git a/pkg/epp/controller/inferencemodel_reconciler.go b/pkg/epp/controller/inferencemodel_reconciler.go new file mode 100644 index 00000000..a7f365b7 --- /dev/null +++ b/pkg/epp/controller/inferencemodel_reconciler.go @@ -0,0 +1,130 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package controller + +import ( + "context" + "fmt" + + "k8s.io/apimachinery/pkg/api/errors" + "k8s.io/apimachinery/pkg/types" + "k8s.io/client-go/tools/record" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/client" + "sigs.k8s.io/controller-runtime/pkg/event" + "sigs.k8s.io/controller-runtime/pkg/log" + "sigs.k8s.io/controller-runtime/pkg/predicate" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/datastore" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +type InferenceModelReconciler struct { + client.Client + Record record.EventRecorder + Datastore datastore.Datastore + PoolNamespacedName types.NamespacedName +} + +func (c *InferenceModelReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { + logger := log.FromContext(ctx).V(logutil.DEFAULT).WithValues("inferenceModel", req.NamespacedName) + ctx = ctrl.LoggerInto(ctx, logger) + + logger.Info("Reconciling InferenceModel") + + infModel := &v1alpha2.InferenceModel{} + notFound := false + if err := c.Get(ctx, req.NamespacedName, infModel); err != nil { + if !errors.IsNotFound(err) { + logger.Error(err, "Unable to get InferenceModel") + return ctrl.Result{}, err + } + notFound = true + } + + if notFound || !infModel.DeletionTimestamp.IsZero() || infModel.Spec.PoolRef.Name != v1alpha2.ObjectName(c.PoolNamespacedName.Name) { + // InferenceModel object got deleted or changed the referenced pool. + err := c.handleModelDeleted(ctx, req.NamespacedName) + return ctrl.Result{}, err + } + + // Add or update if the InferenceModel instance has a creation timestamp older than the existing entry of the model. + logger = logger.WithValues("poolRef", infModel.Spec.PoolRef).WithValues("modelName", infModel.Spec.ModelName) + if !c.Datastore.ModelSetIfOlder(infModel) { + logger.Info("Skipping InferenceModel, existing instance has older creation timestamp") + } else { + logger.Info("Added/Updated InferenceModel") + } + + return ctrl.Result{}, nil +} + +func (c *InferenceModelReconciler) handleModelDeleted(ctx context.Context, req types.NamespacedName) error { + logger := log.FromContext(ctx) + + // We will lookup and delete the modelName associated with this object, and search for + // other instances referencing the same modelName if exist, and store the oldest in + // its place. This ensures that the InferenceModel with the oldest creation + // timestamp is active. + existing := c.Datastore.ModelDelete(req) + if existing == nil { + // No entry exists in the first place, nothing to do. + return nil + } + logger.Info("InferenceModel removed from datastore", "poolRef", existing.Spec.PoolRef, "modelName", existing.Spec.ModelName) + + // TODO(#409): replace this backfill logic with one that is based on InferenceModel Ready conditions once those are set by an external controller. + updated, err := c.Datastore.ModelResync(ctx, c.Client, existing.Spec.ModelName) + if err != nil { + return err + } + if updated { + logger.Info("Model replaced.", "modelName", existing.Spec.ModelName) + } + return nil +} + +func indexInferenceModelsByModelName(obj client.Object) []string { + m, ok := obj.(*v1alpha2.InferenceModel) + if !ok { + return nil + } + return []string{m.Spec.ModelName} +} + +func (c *InferenceModelReconciler) SetupWithManager(ctx context.Context, mgr ctrl.Manager) error { + // Create an index on ModelName for InferenceModel objects. + indexer := mgr.GetFieldIndexer() + if err := indexer.IndexField(ctx, &v1alpha2.InferenceModel{}, datastore.ModelNameIndexKey, indexInferenceModelsByModelName); err != nil { + return fmt.Errorf("setting index on ModelName for InferenceModel: %w", err) + } + return ctrl.NewControllerManagedBy(mgr). + For(&v1alpha2.InferenceModel{}). + WithEventFilter(predicate.Funcs{ + CreateFunc: func(e event.CreateEvent) bool { return c.eventPredicate(e.Object.(*v1alpha2.InferenceModel)) }, + UpdateFunc: func(e event.UpdateEvent) bool { + return c.eventPredicate(e.ObjectOld.(*v1alpha2.InferenceModel)) || c.eventPredicate(e.ObjectNew.(*v1alpha2.InferenceModel)) + }, + DeleteFunc: func(e event.DeleteEvent) bool { return c.eventPredicate(e.Object.(*v1alpha2.InferenceModel)) }, + GenericFunc: func(e event.GenericEvent) bool { return c.eventPredicate(e.Object.(*v1alpha2.InferenceModel)) }, + }). + Complete(c) +} + +func (c *InferenceModelReconciler) eventPredicate(infModel *v1alpha2.InferenceModel) bool { + return string(infModel.Spec.PoolRef.Name) == c.PoolNamespacedName.Name +} diff --git a/pkg/epp/controller/inferencemodel_reconciler_test.go b/pkg/epp/controller/inferencemodel_reconciler_test.go new file mode 100644 index 00000000..80c30e19 --- /dev/null +++ b/pkg/epp/controller/inferencemodel_reconciler_test.go @@ -0,0 +1,233 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package controller + +import ( + "context" + "testing" + "time" + + "github.com/google/go-cmp/cmp" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/runtime" + "k8s.io/apimachinery/pkg/types" + clientgoscheme "k8s.io/client-go/kubernetes/scheme" + "k8s.io/client-go/tools/record" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/client" + "sigs.k8s.io/controller-runtime/pkg/client/fake" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + backendmetrics "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/datastore" + utiltest "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/testing" +) + +var ( + pool = utiltest.MakeInferencePool("test-pool1").Namespace("ns1").ObjRef() + infModel1 = utiltest.MakeInferenceModel("model1"). + Namespace(pool.Namespace). + ModelName("fake model1"). + Criticality(v1alpha2.Standard). + CreationTimestamp(metav1.Unix(1000, 0)). + PoolName(pool.Name).ObjRef() + infModel1Pool2 = utiltest.MakeInferenceModel(infModel1.Name). + Namespace(infModel1.Namespace). + ModelName(infModel1.Spec.ModelName). + Criticality(*infModel1.Spec.Criticality). + CreationTimestamp(metav1.Unix(1001, 0)). + PoolName("test-pool2").ObjRef() + infModel1NS2 = utiltest.MakeInferenceModel(infModel1.Name). + Namespace("ns2"). + ModelName(infModel1.Spec.ModelName). + Criticality(*infModel1.Spec.Criticality). + CreationTimestamp(metav1.Unix(1002, 0)). + PoolName(pool.Name).ObjRef() + infModel1Critical = utiltest.MakeInferenceModel(infModel1.Name). + Namespace(infModel1.Namespace). + ModelName(infModel1.Spec.ModelName). + Criticality(v1alpha2.Critical). + CreationTimestamp(metav1.Unix(1003, 0)). + PoolName(pool.Name).ObjRef() + infModel1Deleted = utiltest.MakeInferenceModel(infModel1.Name). + Namespace(infModel1.Namespace). + ModelName(infModel1.Spec.ModelName). + CreationTimestamp(metav1.Unix(1004, 0)). + DeletionTimestamp(). + PoolName(pool.Name).ObjRef() + // Same ModelName, different object with newer creation timestamp + infModel1Newer = utiltest.MakeInferenceModel("model1-newer"). + Namespace(pool.Namespace). + ModelName("fake model1"). + Criticality(v1alpha2.Standard). + CreationTimestamp(metav1.Unix(1005, 0)). + PoolName(pool.Name).ObjRef() + // Same ModelName, different object with older creation timestamp + infModel1Older = utiltest.MakeInferenceModel("model1-older"). + Namespace(pool.Namespace). + ModelName("fake model1"). + Criticality(v1alpha2.Standard). + CreationTimestamp(metav1.Unix(999, 0)). + PoolName(pool.Name).ObjRef() + + infModel2 = utiltest.MakeInferenceModel("model2"). + Namespace(pool.Namespace). + ModelName("fake model2"). + CreationTimestamp(metav1.Unix(1000, 0)). + PoolName(pool.Name).ObjRef() +) + +func TestInferenceModelReconciler(t *testing.T) { + tests := []struct { + name string + modelsInStore []*v1alpha2.InferenceModel + modelsInAPIServer []*v1alpha2.InferenceModel + model *v1alpha2.InferenceModel + incomingReq *types.NamespacedName + wantModels []*v1alpha2.InferenceModel + wantResult ctrl.Result + }{ + { + name: "Empty store, add new model", + model: infModel1, + wantModels: []*v1alpha2.InferenceModel{infModel1}, + }, + { + name: "Existing model changed pools", + modelsInStore: []*v1alpha2.InferenceModel{infModel1}, + model: infModel1Pool2, + wantModels: []*v1alpha2.InferenceModel{}, + }, + { + name: "Not found, delete existing model", + modelsInStore: []*v1alpha2.InferenceModel{infModel1}, + incomingReq: &types.NamespacedName{Name: infModel1.Name, Namespace: infModel1.Namespace}, + wantModels: []*v1alpha2.InferenceModel{}, + }, + { + name: "Deletion timestamp set, delete existing model", + modelsInStore: []*v1alpha2.InferenceModel{infModel1}, + model: infModel1Deleted, + wantModels: []*v1alpha2.InferenceModel{}, + }, + { + name: "Model referencing a different pool, different pool name but same namespace", + modelsInStore: []*v1alpha2.InferenceModel{infModel1}, + model: infModel1NS2, + wantModels: []*v1alpha2.InferenceModel{infModel1}, + }, + { + name: "Existing model changed pools, replaced with another", + modelsInStore: []*v1alpha2.InferenceModel{infModel1}, + model: infModel1Pool2, + modelsInAPIServer: []*v1alpha2.InferenceModel{infModel1Newer}, + wantModels: []*v1alpha2.InferenceModel{infModel1Newer}, + }, + { + name: "Not found, delete existing model, replaced with another", + modelsInStore: []*v1alpha2.InferenceModel{infModel1}, + incomingReq: &types.NamespacedName{Name: infModel1.Name, Namespace: infModel1.Namespace}, + modelsInAPIServer: []*v1alpha2.InferenceModel{infModel1Newer}, + wantModels: []*v1alpha2.InferenceModel{infModel1Newer}, + }, + { + name: "Deletion timestamp set, delete existing model, replaced with another", + modelsInStore: []*v1alpha2.InferenceModel{infModel1}, + model: infModel1Deleted, + modelsInAPIServer: []*v1alpha2.InferenceModel{infModel1Newer}, + wantModels: []*v1alpha2.InferenceModel{infModel1Newer}, + }, + { + name: "Older instance of the model observed", + modelsInStore: []*v1alpha2.InferenceModel{infModel1}, + model: infModel1Older, + wantModels: []*v1alpha2.InferenceModel{infModel1Older}, + }, + { + name: "Model changed criticality", + modelsInStore: []*v1alpha2.InferenceModel{infModel1}, + model: infModel1Critical, + wantModels: []*v1alpha2.InferenceModel{infModel1Critical}, + }, + { + name: "Model not found, no matching existing model to delete", + modelsInStore: []*v1alpha2.InferenceModel{infModel1}, + incomingReq: &types.NamespacedName{Name: "non-existent-model", Namespace: pool.Namespace}, + wantModels: []*v1alpha2.InferenceModel{infModel1}, + }, + { + name: "Add to existing", + modelsInStore: []*v1alpha2.InferenceModel{infModel1}, + model: infModel2, + wantModels: []*v1alpha2.InferenceModel{infModel1, infModel2}, + }, + } + for _, test := range tests { + t.Run(test.name, func(t *testing.T) { + // Create a fake client with no InferenceModel objects. + scheme := runtime.NewScheme() + _ = clientgoscheme.AddToScheme(scheme) + _ = v1alpha2.Install(scheme) + initObjs := []client.Object{} + if test.model != nil { + initObjs = append(initObjs, test.model) + } + for _, m := range test.modelsInAPIServer { + initObjs = append(initObjs, m) + } + + fakeClient := fake.NewClientBuilder(). + WithScheme(scheme). + WithObjects(initObjs...). + WithIndex(&v1alpha2.InferenceModel{}, datastore.ModelNameIndexKey, indexInferenceModelsByModelName). + Build() + pmf := backendmetrics.NewPodMetricsFactory(&backendmetrics.FakePodMetricsClient{}, time.Second) + ds := datastore.NewDatastore(t.Context(), pmf) + for _, m := range test.modelsInStore { + ds.ModelSetIfOlder(m) + } + _ = ds.PoolSet(context.Background(), fakeClient, pool) + reconciler := &InferenceModelReconciler{ + Client: fakeClient, + Record: record.NewFakeRecorder(10), + Datastore: ds, + PoolNamespacedName: types.NamespacedName{Name: pool.Name, Namespace: pool.Namespace}, + } + if test.incomingReq == nil { + test.incomingReq = &types.NamespacedName{Name: test.model.Name, Namespace: test.model.Namespace} + } + + // Call Reconcile. + result, err := reconciler.Reconcile(context.Background(), ctrl.Request{NamespacedName: *test.incomingReq}) + if err != nil { + t.Fatalf("expected no error when resource is not found, got %v", err) + } + + if diff := cmp.Diff(result, test.wantResult); diff != "" { + t.Errorf("Unexpected result diff (+got/-want): %s", diff) + } + + if len(test.wantModels) != len(ds.ModelGetAll()) { + t.Errorf("Unexpected; want: %d, got:%d", len(test.wantModels), len(ds.ModelGetAll())) + } + + if diff := diffStore(ds, diffStoreParams{wantPool: pool, wantModels: test.wantModels}); diff != "" { + t.Errorf("Unexpected diff (+got/-want): %s", diff) + } + + }) + } +} diff --git a/pkg/epp/controller/inferencepool_reconciler.go b/pkg/epp/controller/inferencepool_reconciler.go new file mode 100644 index 00000000..fb7d7727 --- /dev/null +++ b/pkg/epp/controller/inferencepool_reconciler.go @@ -0,0 +1,75 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package controller + +import ( + "context" + + "k8s.io/apimachinery/pkg/api/errors" + "k8s.io/client-go/tools/record" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/client" + "sigs.k8s.io/controller-runtime/pkg/log" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/datastore" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +// InferencePoolReconciler utilizes the controller runtime to reconcile Instance Gateway resources +// This implementation is just used for reading & maintaining data sync. The Gateway implementation +// will have the proper controller that will create/manage objects on behalf of the server pool. +type InferencePoolReconciler struct { + client.Client + Record record.EventRecorder + Datastore datastore.Datastore +} + +func (c *InferencePoolReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { + logger := log.FromContext(ctx).WithValues("inferencePool", req.NamespacedName).V(logutil.DEFAULT) + ctx = ctrl.LoggerInto(ctx, logger) + + logger.Info("Reconciling InferencePool") + + infPool := &v1alpha2.InferencePool{} + + if err := c.Get(ctx, req.NamespacedName, infPool); err != nil { + if errors.IsNotFound(err) { + logger.Info("InferencePool not found. Clearing the datastore") + c.Datastore.Clear() + return ctrl.Result{}, nil + } + logger.Error(err, "Unable to get InferencePool") + return ctrl.Result{}, err + } else if !infPool.DeletionTimestamp.IsZero() { + logger.Info("InferencePool is marked for deletion. Clearing the datastore") + c.Datastore.Clear() + return ctrl.Result{}, nil + } + // update pool in datastore + if err := c.Datastore.PoolSet(ctx, c.Client, infPool); err != nil { + logger.Error(err, "Failed to update datastore") + return ctrl.Result{}, err + } + + return ctrl.Result{}, nil +} + +func (c *InferencePoolReconciler) SetupWithManager(mgr ctrl.Manager) error { + return ctrl.NewControllerManagedBy(mgr). + For(&v1alpha2.InferencePool{}). + Complete(c) +} diff --git a/pkg/epp/controller/inferencepool_reconciler_test.go b/pkg/epp/controller/inferencepool_reconciler_test.go new file mode 100644 index 00000000..b7e28334 --- /dev/null +++ b/pkg/epp/controller/inferencepool_reconciler_test.go @@ -0,0 +1,188 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package controller + +import ( + "context" + "testing" + "time" + + "github.com/google/go-cmp/cmp" + "github.com/google/go-cmp/cmp/cmpopts" + corev1 "k8s.io/api/core/v1" + "k8s.io/apimachinery/pkg/runtime" + "k8s.io/apimachinery/pkg/types" + clientgoscheme "k8s.io/client-go/kubernetes/scheme" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/client" + "sigs.k8s.io/controller-runtime/pkg/client/fake" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + backendmetrics "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/datastore" + utiltest "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/testing" +) + +var ( + selector_v1 = map[string]string{"app": "vllm_v1"} + selector_v2 = map[string]string{"app": "vllm_v2"} + pool1 = utiltest.MakeInferencePool("pool1"). + Namespace("pool1-ns"). + Selector(selector_v1). + TargetPortNumber(8080).ObjRef() + pool2 = utiltest.MakeInferencePool("pool2").Namespace("pool2-ns").ObjRef() + pods = []*corev1.Pod{ + // Two ready pods matching pool1 + utiltest.MakePod("pod1"). + Namespace("pool1-ns"). + Labels(selector_v1).ReadyCondition().ObjRef(), + utiltest.MakePod("pod2"). + Namespace("pool1-ns"). + Labels(selector_v1). + ReadyCondition().ObjRef(), + // A not ready pod matching pool1 + utiltest.MakePod("pod3"). + Namespace("pool1-ns"). + Labels(selector_v1).ObjRef(), + // A pod not matching pool1 namespace + utiltest.MakePod("pod4"). + Namespace("pool2-ns"). + Labels(selector_v1). + ReadyCondition().ObjRef(), + // A ready pod matching pool1 with a new selector + utiltest.MakePod("pod5"). + Namespace("pool1-ns"). + Labels(selector_v2). + ReadyCondition().ObjRef(), + } +) + +func TestInferencePoolReconciler(t *testing.T) { + // The best practice is to use table-driven tests, however in this scaenario it seems + // more logical to do a single test with steps that depend on each other. + + // Set up the scheme. + scheme := runtime.NewScheme() + _ = clientgoscheme.AddToScheme(scheme) + _ = v1alpha2.Install(scheme) + + // Create a fake client with the pool and the pods. + initialObjects := []client.Object{pool1, pool2} + for i := range pods { + initialObjects = append(initialObjects, pods[i]) + } + fakeClient := fake.NewClientBuilder(). + WithScheme(scheme). + WithObjects(initialObjects...). + Build() + + // Create a request for the existing resource. + namespacedName := types.NamespacedName{Name: pool1.Name, Namespace: pool1.Namespace} + req := ctrl.Request{NamespacedName: namespacedName} + ctx := context.Background() + + pmf := backendmetrics.NewPodMetricsFactory(&backendmetrics.FakePodMetricsClient{}, time.Second) + datastore := datastore.NewDatastore(ctx, pmf) + inferencePoolReconciler := &InferencePoolReconciler{Client: fakeClient, Datastore: datastore} + + // Step 1: Inception, only ready pods matching pool1 are added to the store. + if _, err := inferencePoolReconciler.Reconcile(ctx, req); err != nil { + t.Errorf("Unexpected InferencePool reconcile error: %v", err) + } + if diff := diffStore(datastore, diffStoreParams{wantPool: pool1, wantPods: []string{"pod1", "pod2"}}); diff != "" { + t.Errorf("Unexpected diff (+got/-want): %s", diff) + } + + newPool1 := &v1alpha2.InferencePool{} + if err := fakeClient.Get(ctx, req.NamespacedName, newPool1); err != nil { + t.Errorf("Unexpected pool get error: %v", err) + } + newPool1.Spec.Selector = map[v1alpha2.LabelKey]v1alpha2.LabelValue{"app": "vllm_v2"} + if err := fakeClient.Update(ctx, newPool1, &client.UpdateOptions{}); err != nil { + t.Errorf("Unexpected pool update error: %v", err) + } + + if _, err := inferencePoolReconciler.Reconcile(ctx, req); err != nil { + t.Errorf("Unexpected InferencePool reconcile error: %v", err) + } + if diff := diffStore(datastore, diffStoreParams{wantPool: newPool1, wantPods: []string{"pod5"}}); diff != "" { + t.Errorf("Unexpected diff (+got/-want): %s", diff) + } + + // Step 3: update the pool port + if err := fakeClient.Get(ctx, req.NamespacedName, newPool1); err != nil { + t.Errorf("Unexpected pool get error: %v", err) + } + newPool1.Spec.TargetPortNumber = 9090 + if err := fakeClient.Update(ctx, newPool1, &client.UpdateOptions{}); err != nil { + t.Errorf("Unexpected pool update error: %v", err) + } + if _, err := inferencePoolReconciler.Reconcile(ctx, req); err != nil { + t.Errorf("Unexpected InferencePool reconcile error: %v", err) + } + if diff := diffStore(datastore, diffStoreParams{wantPool: newPool1, wantPods: []string{"pod5"}}); diff != "" { + t.Errorf("Unexpected diff (+got/-want): %s", diff) + } + + // Step 4: delete the pool to trigger a datastore clear + if err := fakeClient.Get(ctx, req.NamespacedName, newPool1); err != nil { + t.Errorf("Unexpected pool get error: %v", err) + } + if err := fakeClient.Delete(ctx, newPool1, &client.DeleteOptions{}); err != nil { + t.Errorf("Unexpected pool delete error: %v", err) + } + if _, err := inferencePoolReconciler.Reconcile(ctx, req); err != nil { + t.Errorf("Unexpected InferencePool reconcile error: %v", err) + } + if diff := diffStore(datastore, diffStoreParams{wantPods: []string{}}); diff != "" { + t.Errorf("Unexpected diff (+got/-want): %s", diff) + } +} + +type diffStoreParams struct { + wantPool *v1alpha2.InferencePool + wantPods []string + wantModels []*v1alpha2.InferenceModel +} + +func diffStore(datastore datastore.Datastore, params diffStoreParams) string { + gotPool, _ := datastore.PoolGet() + if diff := cmp.Diff(params.wantPool, gotPool); diff != "" { + return "pool:" + diff + } + + // Default wantPods if not set because PodGetAll returns an empty slice when empty. + if params.wantPods == nil { + params.wantPods = []string{} + } + gotPods := []string{} + for _, pm := range datastore.PodGetAll() { + gotPods = append(gotPods, pm.GetPod().NamespacedName.Name) + } + if diff := cmp.Diff(params.wantPods, gotPods, cmpopts.SortSlices(func(a, b string) bool { return a < b })); diff != "" { + return "pods:" + diff + } + + // Default wantModels if not set because ModelGetAll returns an empty slice when empty. + if params.wantModels == nil { + params.wantModels = []*v1alpha2.InferenceModel{} + } + gotModels := datastore.ModelGetAll() + if diff := utiltest.DiffModelLists(params.wantModels, gotModels); diff != "" { + return "models:" + diff + } + return "" +} diff --git a/pkg/epp/controller/pod_reconciler.go b/pkg/epp/controller/pod_reconciler.go new file mode 100644 index 00000000..5f1df10d --- /dev/null +++ b/pkg/epp/controller/pod_reconciler.go @@ -0,0 +1,105 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package controller + +import ( + "context" + + "github.com/go-logr/logr" + corev1 "k8s.io/api/core/v1" + apierrors "k8s.io/apimachinery/pkg/api/errors" + "k8s.io/apimachinery/pkg/types" + "k8s.io/client-go/tools/record" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/client" + "sigs.k8s.io/controller-runtime/pkg/event" + "sigs.k8s.io/controller-runtime/pkg/log" + "sigs.k8s.io/controller-runtime/pkg/predicate" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/datastore" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" + podutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/pod" +) + +type PodReconciler struct { + client.Client + Datastore datastore.Datastore + Record record.EventRecorder +} + +func (c *PodReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { + logger := log.FromContext(ctx) + if !c.Datastore.PoolHasSynced() { + logger.V(logutil.TRACE).Info("Skipping reconciling Pod because the InferencePool is not available yet") + // When the inferencePool is initialized it lists the appropriate pods and populates the datastore, so no need to requeue. + return ctrl.Result{}, nil + } + + logger.V(logutil.VERBOSE).Info("Pod being reconciled", "name", req.NamespacedName) + + pod := &corev1.Pod{} + if err := c.Get(ctx, req.NamespacedName, pod); err != nil { + if apierrors.IsNotFound(err) { + c.Datastore.PodDelete(req.NamespacedName) + return ctrl.Result{}, nil + } + logger.V(logutil.DEFAULT).Error(err, "Unable to get pod", "name", req.NamespacedName) + return ctrl.Result{}, err + } + + c.updateDatastore(logger, pod) + return ctrl.Result{}, nil +} + +func (c *PodReconciler) SetupWithManager(mgr ctrl.Manager) error { + filter := predicate.Funcs{ + CreateFunc: func(ce event.CreateEvent) bool { + pod := ce.Object.(*corev1.Pod) + return c.Datastore.PoolLabelsMatch(pod.GetLabels()) + }, + UpdateFunc: func(ue event.UpdateEvent) bool { + oldPod := ue.ObjectOld.(*corev1.Pod) + newPod := ue.ObjectNew.(*corev1.Pod) + return c.Datastore.PoolLabelsMatch(oldPod.GetLabels()) || c.Datastore.PoolLabelsMatch(newPod.GetLabels()) + }, + DeleteFunc: func(de event.DeleteEvent) bool { + pod := de.Object.(*corev1.Pod) + return c.Datastore.PoolLabelsMatch(pod.GetLabels()) + }, + GenericFunc: func(ge event.GenericEvent) bool { + pod := ge.Object.(*corev1.Pod) + return c.Datastore.PoolLabelsMatch(pod.GetLabels()) + }, + } + return ctrl.NewControllerManagedBy(mgr). + For(&corev1.Pod{}). + WithEventFilter(filter). + Complete(c) +} + +func (c *PodReconciler) updateDatastore(logger logr.Logger, pod *corev1.Pod) { + namespacedName := types.NamespacedName{Name: pod.Name, Namespace: pod.Namespace} + if !podutil.IsPodReady(pod) || !c.Datastore.PoolLabelsMatch(pod.Labels) { + logger.V(logutil.DEBUG).Info("Pod removed or not added", "name", namespacedName) + c.Datastore.PodDelete(namespacedName) + } else { + if c.Datastore.PodUpdateOrAddIfNotExist(pod) { + logger.V(logutil.DEFAULT).Info("Pod added", "name", namespacedName) + } else { + logger.V(logutil.DEFAULT).Info("Pod already exists", "name", namespacedName) + } + } +} diff --git a/pkg/epp/controller/pod_reconciler_test.go b/pkg/epp/controller/pod_reconciler_test.go new file mode 100644 index 00000000..d2bdd5d0 --- /dev/null +++ b/pkg/epp/controller/pod_reconciler_test.go @@ -0,0 +1,209 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package controller + +import ( + "context" + "testing" + "time" + + "github.com/google/go-cmp/cmp" + "github.com/google/go-cmp/cmp/cmpopts" + corev1 "k8s.io/api/core/v1" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/runtime" + "k8s.io/apimachinery/pkg/types" + clientgoscheme "k8s.io/client-go/kubernetes/scheme" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/client" + "sigs.k8s.io/controller-runtime/pkg/client/fake" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + backendmetrics "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/datastore" + utiltest "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/testing" +) + +var ( + basePod1 = &corev1.Pod{ObjectMeta: metav1.ObjectMeta{Name: "pod1"}, Status: corev1.PodStatus{PodIP: "address-1"}} + basePod2 = &corev1.Pod{ObjectMeta: metav1.ObjectMeta{Name: "pod2"}, Status: corev1.PodStatus{PodIP: "address-2"}} + basePod3 = &corev1.Pod{ObjectMeta: metav1.ObjectMeta{Name: "pod3"}, Status: corev1.PodStatus{PodIP: "address-3"}} + basePod11 = &corev1.Pod{ObjectMeta: metav1.ObjectMeta{Name: "pod1"}, Status: corev1.PodStatus{PodIP: "address-11"}} + pmc = &backendmetrics.FakePodMetricsClient{} + pmf = backendmetrics.NewPodMetricsFactory(pmc, time.Second) +) + +func TestPodReconciler(t *testing.T) { + tests := []struct { + name string + pool *v1alpha2.InferencePool + existingPods []*corev1.Pod + incomingPod *corev1.Pod + wantPods []*corev1.Pod + req *ctrl.Request + }{ + { + name: "Add new pod", + existingPods: []*corev1.Pod{basePod1, basePod2}, + pool: &v1alpha2.InferencePool{ + Spec: v1alpha2.InferencePoolSpec{ + TargetPortNumber: int32(8000), + Selector: map[v1alpha2.LabelKey]v1alpha2.LabelValue{ + "some-key": "some-val", + }, + }, + }, + incomingPod: utiltest.FromBase(basePod3). + Labels(map[string]string{"some-key": "some-val"}). + ReadyCondition().ObjRef(), + wantPods: []*corev1.Pod{basePod1, basePod2, basePod3}, + }, + { + name: "Update pod1 address", + existingPods: []*corev1.Pod{basePod1, basePod2}, + pool: &v1alpha2.InferencePool{ + Spec: v1alpha2.InferencePoolSpec{ + TargetPortNumber: int32(8000), + Selector: map[v1alpha2.LabelKey]v1alpha2.LabelValue{ + "some-key": "some-val", + }, + }, + }, + incomingPod: utiltest.FromBase(basePod11). + Labels(map[string]string{"some-key": "some-val"}). + ReadyCondition().ObjRef(), + wantPods: []*corev1.Pod{basePod11, basePod2}, + }, + { + name: "Delete pod with DeletionTimestamp", + existingPods: []*corev1.Pod{basePod1, basePod2}, + pool: &v1alpha2.InferencePool{ + Spec: v1alpha2.InferencePoolSpec{ + TargetPortNumber: int32(8000), + Selector: map[v1alpha2.LabelKey]v1alpha2.LabelValue{ + "some-key": "some-val", + }, + }, + }, + incomingPod: utiltest.FromBase(basePod1). + Labels(map[string]string{"some-key": "some-val"}). + DeletionTimestamp(). + ReadyCondition().ObjRef(), + wantPods: []*corev1.Pod{basePod2}, + }, + { + name: "Delete notfound pod", + existingPods: []*corev1.Pod{basePod1, basePod2}, + pool: &v1alpha2.InferencePool{ + Spec: v1alpha2.InferencePoolSpec{ + TargetPortNumber: int32(8000), + Selector: map[v1alpha2.LabelKey]v1alpha2.LabelValue{ + "some-key": "some-val", + }, + }, + }, + req: &ctrl.Request{NamespacedName: types.NamespacedName{Name: "pod1"}}, + wantPods: []*corev1.Pod{basePod2}, + }, + { + name: "New pod, not ready, valid selector", + existingPods: []*corev1.Pod{basePod1, basePod2}, + pool: &v1alpha2.InferencePool{ + Spec: v1alpha2.InferencePoolSpec{ + TargetPortNumber: int32(8000), + Selector: map[v1alpha2.LabelKey]v1alpha2.LabelValue{ + "some-key": "some-val", + }, + }, + }, + incomingPod: utiltest.FromBase(basePod3). + Labels(map[string]string{"some-key": "some-val"}).ObjRef(), + wantPods: []*corev1.Pod{basePod1, basePod2}, + }, + { + name: "Remove pod that does not match selector", + existingPods: []*corev1.Pod{basePod1, basePod2}, + pool: &v1alpha2.InferencePool{ + Spec: v1alpha2.InferencePoolSpec{ + TargetPortNumber: int32(8000), + Selector: map[v1alpha2.LabelKey]v1alpha2.LabelValue{ + "some-key": "some-val", + }, + }, + }, + incomingPod: utiltest.FromBase(basePod1). + Labels(map[string]string{"some-wrong-key": "some-val"}). + ReadyCondition().ObjRef(), + wantPods: []*corev1.Pod{basePod2}, + }, + { + name: "Remove pod that is not ready", + existingPods: []*corev1.Pod{basePod1, basePod2}, + pool: &v1alpha2.InferencePool{ + Spec: v1alpha2.InferencePoolSpec{ + TargetPortNumber: int32(8000), + Selector: map[v1alpha2.LabelKey]v1alpha2.LabelValue{ + "some-key": "some-val", + }, + }, + }, + incomingPod: utiltest.FromBase(basePod1). + Labels(map[string]string{"some-wrong-key": "some-val"}). + ReadyCondition().ObjRef(), + wantPods: []*corev1.Pod{basePod2}, + }, + } + for _, test := range tests { + t.Run(test.name, func(t *testing.T) { + // Set up the scheme. + scheme := runtime.NewScheme() + _ = clientgoscheme.AddToScheme(scheme) + initialObjects := []client.Object{} + if test.incomingPod != nil { + initialObjects = append(initialObjects, test.incomingPod) + } + fakeClient := fake.NewClientBuilder(). + WithScheme(scheme). + WithObjects(initialObjects...). + Build() + + // Configure the initial state of the datastore. + store := datastore.NewDatastore(t.Context(), pmf) + _ = store.PoolSet(t.Context(), fakeClient, test.pool) + for _, pod := range test.existingPods { + store.PodUpdateOrAddIfNotExist(pod) + } + + podReconciler := &PodReconciler{Client: fakeClient, Datastore: store} + if test.req == nil { + namespacedName := types.NamespacedName{Name: test.incomingPod.Name, Namespace: test.incomingPod.Namespace} + test.req = &ctrl.Request{NamespacedName: namespacedName} + } + if _, err := podReconciler.Reconcile(context.Background(), *test.req); err != nil { + t.Errorf("Unexpected InferencePool reconcile error: %v", err) + } + + var gotPods []*corev1.Pod + for _, pm := range store.PodGetAll() { + pod := &corev1.Pod{ObjectMeta: metav1.ObjectMeta{Name: pm.GetPod().NamespacedName.Name, Namespace: pm.GetPod().NamespacedName.Namespace}, Status: corev1.PodStatus{PodIP: pm.GetPod().Address}} + gotPods = append(gotPods, pod) + } + if !cmp.Equal(gotPods, test.wantPods, cmpopts.SortSlices(func(a, b *corev1.Pod) bool { return a.Name < b.Name })) { + t.Errorf("got (%v) != want (%v);", gotPods, test.wantPods) + } + }) + } +} diff --git a/pkg/epp/datastore/datastore.go b/pkg/epp/datastore/datastore.go new file mode 100644 index 00000000..22c50022 --- /dev/null +++ b/pkg/epp/datastore/datastore.go @@ -0,0 +1,333 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package datastore + +import ( + "context" + "errors" + "fmt" + "reflect" + "sync" + + corev1 "k8s.io/api/core/v1" + "k8s.io/apimachinery/pkg/labels" + "k8s.io/apimachinery/pkg/types" + "sigs.k8s.io/controller-runtime/pkg/client" + "sigs.k8s.io/controller-runtime/pkg/log" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + backendmetrics "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" + podutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/pod" +) + +const ( + ModelNameIndexKey = "spec.modelName" +) + +var ( + errPoolNotSynced = errors.New("InferencePool is not initialized in data store") +) + +// The datastore is a local cache of relevant data for the given InferencePool (currently all pulled from k8s-api) +type Datastore interface { + // InferencePool operations + // PoolSet sets the given pool in datastore. If the given pool has different label selector than the previous pool + // that was stored, the function triggers a resync of the pods to keep the datastore updated. If the given pool + // is nil, this call triggers the datastore.Clear() function. + PoolSet(ctx context.Context, client client.Client, pool *v1alpha2.InferencePool) error + PoolGet() (*v1alpha2.InferencePool, error) + PoolHasSynced() bool + PoolLabelsMatch(podLabels map[string]string) bool + + // InferenceModel operations + ModelSetIfOlder(infModel *v1alpha2.InferenceModel) bool + ModelGet(modelName string) *v1alpha2.InferenceModel + ModelDelete(namespacedName types.NamespacedName) *v1alpha2.InferenceModel + ModelResync(ctx context.Context, ctrlClient client.Client, modelName string) (bool, error) + ModelGetAll() []*v1alpha2.InferenceModel + + // PodMetrics operations + // PodGetAll returns all pods and metrics, including fresh and stale. + PodGetAll() []backendmetrics.PodMetrics + // PodList lists pods matching the given predicate. + PodList(predicate func(backendmetrics.PodMetrics) bool) []backendmetrics.PodMetrics + PodUpdateOrAddIfNotExist(pod *corev1.Pod) bool + PodDelete(namespacedName types.NamespacedName) + + // Clears the store state, happens when the pool gets deleted. + Clear() +} + +func NewDatastore(parentCtx context.Context, pmf *backendmetrics.PodMetricsFactory) Datastore { + store := &datastore{ + parentCtx: parentCtx, + poolAndModelsMu: sync.RWMutex{}, + models: make(map[string]*v1alpha2.InferenceModel), + pods: &sync.Map{}, + pmf: pmf, + } + return store +} + +type datastore struct { + // parentCtx controls the lifecycle of the background metrics goroutines that spawn up by the datastore. + parentCtx context.Context + // poolAndModelsMu is used to synchronize access to pool and the models map. + poolAndModelsMu sync.RWMutex + pool *v1alpha2.InferencePool + // key: InferenceModel.Spec.ModelName, value: *InferenceModel + models map[string]*v1alpha2.InferenceModel + // key: types.NamespacedName, value: backendmetrics.PodMetrics + pods *sync.Map + pmf *backendmetrics.PodMetricsFactory +} + +func (ds *datastore) Clear() { + ds.poolAndModelsMu.Lock() + defer ds.poolAndModelsMu.Unlock() + ds.pool = nil + ds.models = make(map[string]*v1alpha2.InferenceModel) + ds.pods.Clear() +} + +// /// InferencePool APIs /// +func (ds *datastore) PoolSet(ctx context.Context, client client.Client, pool *v1alpha2.InferencePool) error { + if pool == nil { + ds.Clear() + return nil + } + logger := log.FromContext(ctx) + ds.poolAndModelsMu.Lock() + defer ds.poolAndModelsMu.Unlock() + + oldPool := ds.pool + ds.pool = pool + if oldPool == nil || !reflect.DeepEqual(pool.Spec.Selector, oldPool.Spec.Selector) { + logger.V(logutil.DEFAULT).Info("Updating inference pool endpoints", "selector", pool.Spec.Selector) + // A full resync is required to address two cases: + // 1) At startup, the pod events may get processed before the pool is synced with the datastore, + // and hence they will not be added to the store since pool selector is not known yet + // 2) If the selector on the pool was updated, then we will not get any pod events, and so we need + // to resync the whole pool: remove pods in the store that don't match the new selector and add + // the ones that may have existed already to the store. + if err := ds.podResyncAll(ctx, client); err != nil { + return fmt.Errorf("failed to update pods according to the pool selector - %w", err) + } + } + + return nil +} + +func (ds *datastore) PoolGet() (*v1alpha2.InferencePool, error) { + ds.poolAndModelsMu.RLock() + defer ds.poolAndModelsMu.RUnlock() + if !ds.PoolHasSynced() { + return nil, errPoolNotSynced + } + return ds.pool, nil +} + +func (ds *datastore) PoolHasSynced() bool { + ds.poolAndModelsMu.RLock() + defer ds.poolAndModelsMu.RUnlock() + return ds.pool != nil +} + +func (ds *datastore) PoolLabelsMatch(podLabels map[string]string) bool { + ds.poolAndModelsMu.RLock() + defer ds.poolAndModelsMu.RUnlock() + if ds.pool == nil { + return false + } + poolSelector := selectorFromInferencePoolSelector(ds.pool.Spec.Selector) + podSet := labels.Set(podLabels) + return poolSelector.Matches(podSet) +} + +func (ds *datastore) ModelSetIfOlder(infModel *v1alpha2.InferenceModel) bool { + ds.poolAndModelsMu.Lock() + defer ds.poolAndModelsMu.Unlock() + + // Check first if the existing model is older. + // One exception is if the incoming model object is the same, in which case, we should not + // check for creation timestamp since that means the object was re-created, and so we should override. + existing, exists := ds.models[infModel.Spec.ModelName] + if exists { + diffObj := infModel.Name != existing.Name || infModel.Namespace != existing.Namespace + if diffObj && existing.ObjectMeta.CreationTimestamp.Before(&infModel.ObjectMeta.CreationTimestamp) { + return false + } + } + // Set the model. + ds.models[infModel.Spec.ModelName] = infModel + return true +} + +func (ds *datastore) ModelResync(ctx context.Context, c client.Client, modelName string) (bool, error) { + ds.poolAndModelsMu.Lock() + defer ds.poolAndModelsMu.Unlock() + + var models v1alpha2.InferenceModelList + if err := c.List(ctx, &models, client.MatchingFields{ModelNameIndexKey: modelName}, client.InNamespace(ds.pool.Namespace)); err != nil { + return false, fmt.Errorf("listing models that match the modelName %s: %w", modelName, err) + } + if len(models.Items) == 0 { + // No other instances of InferenceModels with this ModelName exists. + return false, nil + } + + var oldest *v1alpha2.InferenceModel + for i := range models.Items { + m := &models.Items[i] + if m.Spec.ModelName != modelName || // The index should filter those out, but just in case! + m.Spec.PoolRef.Name != v1alpha2.ObjectName(ds.pool.Name) || // We don't care about other pools, we could setup an index on this too! + !m.DeletionTimestamp.IsZero() { // ignore objects marked for deletion + continue + } + if oldest == nil || m.ObjectMeta.CreationTimestamp.Before(&oldest.ObjectMeta.CreationTimestamp) { + oldest = m + } + } + if oldest == nil { + return false, nil + } + ds.models[modelName] = oldest + return true, nil +} + +func (ds *datastore) ModelGet(modelName string) *v1alpha2.InferenceModel { + ds.poolAndModelsMu.RLock() + defer ds.poolAndModelsMu.RUnlock() + return ds.models[modelName] +} + +func (ds *datastore) ModelDelete(namespacedName types.NamespacedName) *v1alpha2.InferenceModel { + ds.poolAndModelsMu.Lock() + defer ds.poolAndModelsMu.Unlock() + for _, m := range ds.models { + if m.Name == namespacedName.Name && m.Namespace == namespacedName.Namespace { + delete(ds.models, m.Spec.ModelName) + return m + } + } + return nil +} + +func (ds *datastore) ModelGetAll() []*v1alpha2.InferenceModel { + ds.poolAndModelsMu.RLock() + defer ds.poolAndModelsMu.RUnlock() + res := []*v1alpha2.InferenceModel{} + for _, v := range ds.models { + res = append(res, v) + } + return res +} + +// /// Pods/endpoints APIs /// + +func (ds *datastore) PodGetAll() []backendmetrics.PodMetrics { + return ds.PodList(func(backendmetrics.PodMetrics) bool { return true }) +} + +func (ds *datastore) PodList(predicate func(backendmetrics.PodMetrics) bool) []backendmetrics.PodMetrics { + res := []backendmetrics.PodMetrics{} + fn := func(k, v any) bool { + pm := v.(backendmetrics.PodMetrics) + if predicate(pm) { + res = append(res, pm) + } + return true + } + ds.pods.Range(fn) + return res +} + +func (ds *datastore) PodUpdateOrAddIfNotExist(pod *corev1.Pod) bool { + namespacedName := types.NamespacedName{ + Name: pod.Name, + Namespace: pod.Namespace, + } + var pm backendmetrics.PodMetrics + existing, ok := ds.pods.Load(namespacedName) + if !ok { + pm = ds.pmf.NewPodMetrics(ds.parentCtx, pod, ds) + ds.pods.Store(namespacedName, pm) + } else { + pm = existing.(backendmetrics.PodMetrics) + } + // Update pod properties if anything changed. + pm.UpdatePod(pod) + return ok +} + +func (ds *datastore) PodDelete(namespacedName types.NamespacedName) { + v, ok := ds.pods.LoadAndDelete(namespacedName) + if ok { + pmr := v.(backendmetrics.PodMetrics) + pmr.StopRefreshLoop() + } +} + +func (ds *datastore) podResyncAll(ctx context.Context, ctrlClient client.Client) error { + logger := log.FromContext(ctx) + podList := &corev1.PodList{} + if err := ctrlClient.List(ctx, podList, &client.ListOptions{ + LabelSelector: selectorFromInferencePoolSelector(ds.pool.Spec.Selector), + Namespace: ds.pool.Namespace, + }); err != nil { + return fmt.Errorf("failed to list pods - %w", err) + } + + activePods := make(map[string]bool) + for _, pod := range podList.Items { + if !podutil.IsPodReady(&pod) { + continue + } + namespacedName := types.NamespacedName{Name: pod.Name, Namespace: pod.Namespace} + activePods[pod.Name] = true + if ds.PodUpdateOrAddIfNotExist(&pod) { + logger.V(logutil.DEFAULT).Info("Pod added", "name", namespacedName) + } else { + logger.V(logutil.DEFAULT).Info("Pod already exists", "name", namespacedName) + } + } + + // Remove pods that don't belong to the pool or not ready any more. + deleteFn := func(k, v any) bool { + pm := v.(backendmetrics.PodMetrics) + if exist := activePods[pm.GetPod().NamespacedName.Name]; !exist { + logger.V(logutil.VERBOSE).Info("Removing pod", "pod", pm.GetPod()) + ds.PodDelete(pm.GetPod().NamespacedName) + } + return true + } + ds.pods.Range(deleteFn) + + return nil +} + +func selectorFromInferencePoolSelector(selector map[v1alpha2.LabelKey]v1alpha2.LabelValue) labels.Selector { + return labels.SelectorFromSet(stripLabelKeyAliasFromLabelMap(selector)) +} + +func stripLabelKeyAliasFromLabelMap(labels map[v1alpha2.LabelKey]v1alpha2.LabelValue) map[string]string { + outMap := make(map[string]string) + for k, v := range labels { + outMap[string(k)] = string(v) + } + return outMap +} diff --git a/pkg/epp/datastore/datastore_test.go b/pkg/epp/datastore/datastore_test.go new file mode 100644 index 00000000..b6466e6b --- /dev/null +++ b/pkg/epp/datastore/datastore_test.go @@ -0,0 +1,448 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package datastore + +import ( + "context" + "errors" + "testing" + "time" + + "github.com/google/go-cmp/cmp" + "github.com/google/go-cmp/cmp/cmpopts" + "github.com/stretchr/testify/assert" + corev1 "k8s.io/api/core/v1" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/apimachinery/pkg/runtime" + "k8s.io/apimachinery/pkg/types" + clientgoscheme "k8s.io/client-go/kubernetes/scheme" + "sigs.k8s.io/controller-runtime/pkg/client/fake" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + backendmetrics "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics" + testutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/testing" +) + +func TestPool(t *testing.T) { + pool1Selector := map[string]string{"app": "vllm_v1"} + pool1 := testutil.MakeInferencePool("pool1"). + Namespace("default"). + Selector(pool1Selector).ObjRef() + tests := []struct { + name string + inferencePool *v1alpha2.InferencePool + labels map[string]string + wantSynced bool + wantPool *v1alpha2.InferencePool + wantErr error + wantLabelsMatch bool + }{ + { + name: "Ready when InferencePool exists in data store", + inferencePool: pool1, + labels: pool1Selector, + wantSynced: true, + wantPool: pool1, + wantLabelsMatch: true, + }, + { + name: "Labels not matched", + inferencePool: pool1, + labels: map[string]string{"app": "vllm_v2"}, + wantSynced: true, + wantPool: pool1, + wantLabelsMatch: false, + }, + { + name: "Not ready when InferencePool is nil in data store", + wantErr: errPoolNotSynced, + wantSynced: false, + }, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + // Set up the scheme. + scheme := runtime.NewScheme() + _ = clientgoscheme.AddToScheme(scheme) + fakeClient := fake.NewClientBuilder(). + WithScheme(scheme). + Build() + pmf := backendmetrics.NewPodMetricsFactory(&backendmetrics.FakePodMetricsClient{}, time.Second) + datastore := NewDatastore(context.Background(), pmf) + _ = datastore.PoolSet(context.Background(), fakeClient, tt.inferencePool) + gotPool, gotErr := datastore.PoolGet() + if diff := cmp.Diff(tt.wantErr, gotErr, cmpopts.EquateErrors()); diff != "" { + t.Errorf("Unexpected error diff (+got/-want): %s", diff) + } + if diff := cmp.Diff(tt.wantPool, gotPool); diff != "" { + t.Errorf("Unexpected pool diff (+got/-want): %s", diff) + } + gotSynced := datastore.PoolHasSynced() + if diff := cmp.Diff(tt.wantSynced, gotSynced); diff != "" { + t.Errorf("Unexpected synced diff (+got/-want): %s", diff) + } + if tt.labels != nil { + gotLabelsMatch := datastore.PoolLabelsMatch(tt.labels) + if diff := cmp.Diff(tt.wantLabelsMatch, gotLabelsMatch); diff != "" { + t.Errorf("Unexpected labels match diff (+got/-want): %s", diff) + } + } + }) + } +} + +func TestModel(t *testing.T) { + chatModel := "chat" + tsModel := "food-review" + model1ts := testutil.MakeInferenceModel("model1"). + CreationTimestamp(metav1.Unix(1000, 0)). + ModelName(tsModel).ObjRef() + // Same model name as model1ts, different object name. + model2ts := testutil.MakeInferenceModel("model2"). + CreationTimestamp(metav1.Unix(1001, 0)). + ModelName(tsModel).ObjRef() + // Same model name as model1ts, newer timestamp + model1tsNewer := testutil.MakeInferenceModel("model1"). + CreationTimestamp(metav1.Unix(1002, 0)). + Criticality(v1alpha2.Critical). + ModelName(tsModel).ObjRef() + model2tsNewer := testutil.MakeInferenceModel("model2"). + CreationTimestamp(metav1.Unix(1003, 0)). + ModelName(tsModel).ObjRef() + // Same object name as model2ts, different model name. + model2chat := testutil.MakeInferenceModel(model2ts.Name). + CreationTimestamp(metav1.Unix(1005, 0)). + ModelName(chatModel).ObjRef() + + tests := []struct { + name string + existingModels []*v1alpha2.InferenceModel + op func(ds Datastore) bool + wantOpResult bool + wantModels []*v1alpha2.InferenceModel + }{ + { + name: "Add model1 with food-review as modelName", + op: func(ds Datastore) bool { + return ds.ModelSetIfOlder(model1ts) + }, + wantModels: []*v1alpha2.InferenceModel{model1ts}, + wantOpResult: true, + }, + { + name: "Set model1 with the same modelName, but with diff criticality and newer creation timestamp, should update.", + existingModels: []*v1alpha2.InferenceModel{model1ts}, + op: func(ds Datastore) bool { + return ds.ModelSetIfOlder(model1tsNewer) + }, + wantOpResult: true, + wantModels: []*v1alpha2.InferenceModel{model1tsNewer}, + }, + { + name: "set model2 with the same modelName, but newer creation timestamp, should not update.", + existingModels: []*v1alpha2.InferenceModel{model1tsNewer}, + op: func(ds Datastore) bool { + return ds.ModelSetIfOlder(model2tsNewer) + }, + wantOpResult: false, + wantModels: []*v1alpha2.InferenceModel{model1tsNewer}, + }, + { + name: "Set model2 with the same modelName, but older creation timestamp, should update", + existingModels: []*v1alpha2.InferenceModel{model1tsNewer}, + op: func(ds Datastore) bool { + return ds.ModelSetIfOlder(model2ts) + }, + wantOpResult: true, + wantModels: []*v1alpha2.InferenceModel{model2ts}, + }, + { + name: "Set model1 with the food-review modelName, both models should exist", + existingModels: []*v1alpha2.InferenceModel{model2chat}, + op: func(ds Datastore) bool { + return ds.ModelSetIfOlder(model1ts) + }, + wantOpResult: true, + wantModels: []*v1alpha2.InferenceModel{model2chat, model1ts}, + }, + { + name: "Set model1 with the food-review modelName, both models should exist", + existingModels: []*v1alpha2.InferenceModel{model2chat, model1ts}, + op: func(ds Datastore) bool { + return ds.ModelSetIfOlder(model1ts) + }, + wantOpResult: true, + wantModels: []*v1alpha2.InferenceModel{model2chat, model1ts}, + }, + { + name: "Getting by model name, chat -> model2", + existingModels: []*v1alpha2.InferenceModel{model2chat, model1ts}, + op: func(ds Datastore) bool { + gotChat := ds.ModelGet(chatModel) + return gotChat != nil && cmp.Diff(model2chat, gotChat) == "" + }, + wantOpResult: true, + wantModels: []*v1alpha2.InferenceModel{model2chat, model1ts}, + }, + { + name: "Delete the model", + existingModels: []*v1alpha2.InferenceModel{model2chat, model1ts}, + op: func(ds Datastore) bool { + existing := ds.ModelDelete(types.NamespacedName{Name: model1ts.Name, Namespace: model1ts.Namespace}) + got := ds.ModelGet(tsModel) + return existing != nil && got == nil + + }, + wantOpResult: true, + wantModels: []*v1alpha2.InferenceModel{model2chat}, + }, + } + for _, test := range tests { + t.Run(test.name, func(t *testing.T) { + pmf := backendmetrics.NewPodMetricsFactory(&backendmetrics.FakePodMetricsClient{}, time.Second) + ds := NewDatastore(t.Context(), pmf) + for _, m := range test.existingModels { + ds.ModelSetIfOlder(m) + } + + gotOpResult := test.op(ds) + if gotOpResult != test.wantOpResult { + t.Errorf("Unexpected operation result, want: %v, got: %v", test.wantOpResult, gotOpResult) + } + + if diff := testutil.DiffModelLists(test.wantModels, ds.ModelGetAll()); diff != "" { + t.Errorf("Unexpected models diff: %s", diff) + } + + }) + } +} + +var ( + pod1 = &corev1.Pod{ + ObjectMeta: metav1.ObjectMeta{ + Name: "pod1", + }, + } + pod1Metrics = &backendmetrics.Metrics{ + WaitingQueueSize: 0, + KVCacheUsagePercent: 0.2, + MaxActiveModels: 2, + ActiveModels: map[string]int{ + "foo": 1, + "bar": 1, + }, + WaitingModels: map[string]int{}, + } + pod2 = &corev1.Pod{ + ObjectMeta: metav1.ObjectMeta{ + Name: "pod2", + }, + } + pod2Metrics = &backendmetrics.Metrics{ + WaitingQueueSize: 1, + KVCacheUsagePercent: 0.2, + MaxActiveModels: 2, + ActiveModels: map[string]int{ + "foo1": 1, + "bar1": 1, + }, + WaitingModels: map[string]int{}, + } + pod1NamespacedName = types.NamespacedName{Name: pod1.Name, Namespace: pod1.Namespace} + pod2NamespacedName = types.NamespacedName{Name: pod2.Name, Namespace: pod2.Namespace} + inferencePool = &v1alpha2.InferencePool{ + Spec: v1alpha2.InferencePoolSpec{ + TargetPortNumber: 8000, + }, + } +) + +func TestMetrics(t *testing.T) { + tests := []struct { + name string + pmc backendmetrics.PodMetricsClient + storePods []*corev1.Pod + want []*backendmetrics.Metrics + }{ + { + name: "Probing metrics success", + pmc: &backendmetrics.FakePodMetricsClient{ + Res: map[types.NamespacedName]*backendmetrics.Metrics{ + pod1NamespacedName: pod1Metrics, + pod2NamespacedName: pod2Metrics, + }, + }, + storePods: []*corev1.Pod{pod1, pod2}, + want: []*backendmetrics.Metrics{pod1Metrics, pod2Metrics}, + }, + { + name: "Only pods in are probed", + pmc: &backendmetrics.FakePodMetricsClient{ + Res: map[types.NamespacedName]*backendmetrics.Metrics{ + pod1NamespacedName: pod1Metrics, + pod2NamespacedName: pod2Metrics, + }, + }, + storePods: []*corev1.Pod{pod1}, + want: []*backendmetrics.Metrics{pod1Metrics}, + }, + { + name: "Probing metrics error", + pmc: &backendmetrics.FakePodMetricsClient{ + Err: map[types.NamespacedName]error{ + pod2NamespacedName: errors.New("injected error"), + }, + Res: map[types.NamespacedName]*backendmetrics.Metrics{ + pod1NamespacedName: pod1Metrics, + }, + }, + storePods: []*corev1.Pod{pod1, pod2}, + want: []*backendmetrics.Metrics{ + pod1Metrics, + // Failed to fetch pod2 metrics so it remains the default values. + { + ActiveModels: map[string]int{}, + WaitingModels: map[string]int{}, + WaitingQueueSize: 0, + KVCacheUsagePercent: 0, + MaxActiveModels: 0, + }, + }, + }, + } + + for _, test := range tests { + t.Run(test.name, func(t *testing.T) { + ctx, cancel := context.WithCancel(context.Background()) + defer cancel() + // Set up the scheme. + scheme := runtime.NewScheme() + _ = clientgoscheme.AddToScheme(scheme) + fakeClient := fake.NewClientBuilder(). + WithScheme(scheme). + Build() + pmf := backendmetrics.NewPodMetricsFactory(test.pmc, time.Millisecond) + ds := NewDatastore(ctx, pmf) + _ = ds.PoolSet(ctx, fakeClient, inferencePool) + for _, pod := range test.storePods { + ds.PodUpdateOrAddIfNotExist(pod) + } + assert.EventuallyWithT(t, func(t *assert.CollectT) { + got := ds.PodGetAll() + metrics := []*backendmetrics.Metrics{} + for _, one := range got { + metrics = append(metrics, one.GetMetrics()) + } + diff := cmp.Diff(test.want, metrics, cmpopts.IgnoreFields(backendmetrics.Metrics{}, "UpdateTime"), cmpopts.SortSlices(func(a, b *backendmetrics.Metrics) bool { + return a.String() < b.String() + })) + assert.Equal(t, "", diff, "Unexpected diff (+got/-want)") + }, 5*time.Second, time.Millisecond) + }) + } +} + +func TestPods(t *testing.T) { + updatedPod := &corev1.Pod{ + ObjectMeta: metav1.ObjectMeta{ + Name: "pod1", + }, + Spec: corev1.PodSpec{ + NodeName: "node-1", + }, + } + tests := []struct { + name string + op func(ctx context.Context, ds Datastore) + existingPods []*corev1.Pod + wantPods []*corev1.Pod + }{ + { + name: "Add new pod, no existing pods, should add", + existingPods: []*corev1.Pod{}, + wantPods: []*corev1.Pod{pod1}, + op: func(ctx context.Context, ds Datastore) { + ds.PodUpdateOrAddIfNotExist(pod1) + }, + }, + { + name: "Add new pod, with existing pods, should add", + existingPods: []*corev1.Pod{pod1}, + wantPods: []*corev1.Pod{pod1, pod2}, + op: func(ctx context.Context, ds Datastore) { + ds.PodUpdateOrAddIfNotExist(pod2) + }, + }, + { + name: "Update existing pod, new field, should update", + existingPods: []*corev1.Pod{pod1}, + wantPods: []*corev1.Pod{updatedPod}, + op: func(ctx context.Context, ds Datastore) { + ds.PodUpdateOrAddIfNotExist(updatedPod) + }, + }, + { + name: "Update existing pod, no new fields, should not update", + existingPods: []*corev1.Pod{pod1}, + wantPods: []*corev1.Pod{pod1}, + op: func(ctx context.Context, ds Datastore) { + incoming := &corev1.Pod{ + ObjectMeta: metav1.ObjectMeta{ + Name: "pod1", + Namespace: "default", + }, + } + ds.PodUpdateOrAddIfNotExist(incoming) + }, + }, + { + name: "Delete the pod", + wantPods: []*corev1.Pod{pod1}, + op: func(ctx context.Context, ds Datastore) { + ds.PodDelete(pod2NamespacedName) + }, + }, + { + name: "Delete the pod that doesn't exist", + existingPods: []*corev1.Pod{pod1}, + wantPods: []*corev1.Pod{pod1}, + op: func(ctx context.Context, ds Datastore) { + ds.PodDelete(pod2NamespacedName) + }, + }, + } + for _, test := range tests { + t.Run(test.name, func(t *testing.T) { + ctx := context.Background() + pmf := backendmetrics.NewPodMetricsFactory(&backendmetrics.FakePodMetricsClient{}, time.Second) + ds := NewDatastore(t.Context(), pmf) + for _, pod := range test.existingPods { + ds.PodUpdateOrAddIfNotExist(pod) + } + + test.op(ctx, ds) + var gotPods []*corev1.Pod + for _, pm := range ds.PodGetAll() { + pod := &corev1.Pod{ObjectMeta: metav1.ObjectMeta{Name: pm.GetPod().NamespacedName.Name, Namespace: pm.GetPod().NamespacedName.Namespace}, Status: corev1.PodStatus{PodIP: pm.GetPod().Address}} + gotPods = append(gotPods, pod) + } + if !cmp.Equal(gotPods, test.wantPods, cmpopts.SortSlices(func(a, b *corev1.Pod) bool { return a.Name < b.Name })) { + t.Logf("got (%v) != want (%v);", gotPods, test.wantPods) + } + }) + } +} diff --git a/pkg/epp/handlers/request.go b/pkg/epp/handlers/request.go new file mode 100644 index 00000000..65d082c8 --- /dev/null +++ b/pkg/epp/handlers/request.go @@ -0,0 +1,165 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package handlers + +import ( + "context" + "encoding/json" + "fmt" + "strconv" + "time" + + extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" + "sigs.k8s.io/controller-runtime/pkg/log" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + schedulingtypes "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types" + errutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/error" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +// HandleRequestBody always returns the requestContext even in the error case, as the request context is used in error handling. +func (s *StreamingServer) HandleRequestBody( + ctx context.Context, + reqCtx *RequestContext, +) (*RequestContext, error) { + var requestBodyBytes []byte + logger := log.FromContext(ctx) + requestBodyMap := reqCtx.Request.Body + + // Resolve target models. + model, ok := requestBodyMap["model"].(string) + if !ok { + return reqCtx, errutil.Error{Code: errutil.BadRequest, Msg: "model not found in request"} + } + prompt, ok := requestBodyMap["prompt"].(string) + if !ok { + return reqCtx, errutil.Error{Code: errutil.BadRequest, Msg: "prompt not found in request"} + } + + modelName := model + + // NOTE: The nil checking for the modelObject means that we DO allow passthrough currently. + // This might be a security risk in the future where adapters not registered in the InferenceModel + // are able to be requested by using their distinct name. + modelObj := s.datastore.ModelGet(model) + if modelObj == nil { + return reqCtx, errutil.Error{Code: errutil.BadConfiguration, Msg: fmt.Sprintf("error finding a model object in InferenceModel for input %v", model)} + } + if len(modelObj.Spec.TargetModels) > 0 { + modelName = RandomWeightedDraw(logger, modelObj, 0) + if modelName == "" { + return reqCtx, errutil.Error{Code: errutil.BadConfiguration, Msg: fmt.Sprintf("error getting target model name for model %v", modelObj.Name)} + } + } + llmReq := &schedulingtypes.LLMRequest{ + Model: model, + ResolvedTargetModel: modelName, + Critical: modelObj.Spec.Criticality != nil && *modelObj.Spec.Criticality == v1alpha2.Critical, + Prompt: prompt, + } + logger.V(logutil.DEBUG).Info("LLM request assembled", "request", llmReq) + + var err error + // Update target models in the body. + if llmReq.Model != llmReq.ResolvedTargetModel { + requestBodyMap["model"] = llmReq.ResolvedTargetModel + } + + requestBodyBytes, err = json.Marshal(requestBodyMap) + if err != nil { + logger.V(logutil.DEFAULT).Error(err, "Error marshaling request body") + return reqCtx, errutil.Error{Code: errutil.Internal, Msg: fmt.Sprintf("error marshaling request body: %v", err)} + } + + res, err := s.scheduler.Schedule(ctx, llmReq) + if err != nil { + return reqCtx, errutil.Error{Code: errutil.InferencePoolResourceExhausted, Msg: fmt.Errorf("failed to find target pod: %w", err).Error()} + } + targetPod := res.TargetPod.GetPod() + + // Insert target endpoint to instruct Envoy to route requests to the specified target pod. + // Attach the port number + pool, err := s.datastore.PoolGet() + if err != nil { + return reqCtx, err + } + endpoint := targetPod.Address + ":" + strconv.Itoa(int(pool.Spec.TargetPortNumber)) + + logger.V(logutil.DEFAULT).Info("Request handled", + "model", llmReq.Model, "targetModel", llmReq.ResolvedTargetModel, "endpoint", targetPod) + + reqCtx.Model = llmReq.Model + reqCtx.ResolvedTargetModel = llmReq.ResolvedTargetModel + reqCtx.RequestSize = len(requestBodyBytes) + reqCtx.TargetPod = targetPod.NamespacedName.String() + reqCtx.TargetEndpoint = endpoint + + s.populateRequestHeaderResponse(reqCtx, endpoint, len(requestBodyBytes)) + + reqCtx.reqBodyResp = &extProcPb.ProcessingResponse{ + // The Endpoint Picker supports two approaches to communicating the target endpoint, as a request header + // and as an unstructure ext-proc response metadata key/value pair. This enables different integration + // options for gateway providers. + Response: &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: requestBodyBytes, + EndOfStream: true, + }, + }, + }, + }, + }, + }, + } + return reqCtx, nil +} + +func (s *StreamingServer) HandleRequestHeaders(ctx context.Context, reqCtx *RequestContext, req *extProcPb.ProcessingRequest_RequestHeaders) error { + reqCtx.RequestReceivedTimestamp = time.Now() + + // an EoS in the request headers means this request has no body or trailers. + if req.RequestHeaders.EndOfStream { + // We will route this request to a random pod as this is assumed to just be a GET + // More context: https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/526 + // The above PR will address endpoint admission, but currently any request without a body will be + // routed to a random upstream pod. + pod := GetRandomPod(s.datastore) + if pod == nil { + return errutil.Error{Code: errutil.Internal, Msg: "no pods available in datastore"} + } + pool, err := s.datastore.PoolGet() + if err != nil { + return err + } + endpoint := pod.Address + ":" + strconv.Itoa(int(pool.Spec.TargetPortNumber)) + s.populateRequestHeaderResponse(reqCtx, endpoint, 0) + return nil + } + + for _, header := range req.RequestHeaders.Headers.Headers { + if header.RawValue != nil { + reqCtx.Request.Headers[header.Key] = string(header.RawValue) + } else { + reqCtx.Request.Headers[header.Key] = header.Value + } + } + return nil +} diff --git a/pkg/epp/handlers/response.go b/pkg/epp/handlers/response.go new file mode 100644 index 00000000..04c7a5e9 --- /dev/null +++ b/pkg/epp/handlers/response.go @@ -0,0 +1,147 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package handlers + +import ( + "context" + "encoding/json" + "strings" + + extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" + "sigs.k8s.io/controller-runtime/pkg/log" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/metrics" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +const ( + streamingRespPrefix = "data: " + streamingEndMsg = "data: [DONE]" +) + +// HandleResponseBody always returns the requestContext even in the error case, as the request context is used in error handling. +func (s *StreamingServer) HandleResponseBody( + ctx context.Context, + reqCtx *RequestContext, + response map[string]interface{}, +) (*RequestContext, error) { + logger := log.FromContext(ctx) + responseBytes, err := json.Marshal(response) + if err != nil { + logger.V(logutil.DEFAULT).Error(err, "error marshalling responseBody") + return reqCtx, err + } + if response["usage"] != nil { + usg := response["usage"].(map[string]interface{}) + usage := Usage{ + PromptTokens: int(usg["prompt_tokens"].(float64)), + CompletionTokens: int(usg["completion_tokens"].(float64)), + TotalTokens: int(usg["total_tokens"].(float64)), + } + reqCtx.Usage = usage + logger.V(logutil.VERBOSE).Info("Response generated", "usage", reqCtx.Usage) + } + reqCtx.ResponseSize = len(responseBytes) + // ResponseComplete is to indicate the response is complete. In non-streaming + // case, it will be set to be true once the response is processed; in + // streaming case, it will be set to be true once the last chunk is processed. + // TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/178) + // will add the processing for streaming case. + reqCtx.ResponseComplete = true + + reqCtx.respBodyResp = &extProcPb.ProcessingResponse{ + // The Endpoint Picker supports two approaches to communicating the target endpoint, as a request header + // and as an unstructure ext-proc response metadata key/value pair. This enables different integration + // options for gateway providers. + Response: &extProcPb.ProcessingResponse_ResponseBody{ + ResponseBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: responseBytes, + EndOfStream: true, + }, + }, + }, + }, + }, + }, + } + return reqCtx, nil +} + +// The function is to handle streaming response if the modelServer is streaming. +func (s *StreamingServer) HandleResponseBodyModelStreaming( + ctx context.Context, + reqCtx *RequestContext, + responseText string, +) { + if strings.Contains(responseText, streamingEndMsg) { + resp := parseRespForUsage(ctx, responseText) + reqCtx.Usage = resp.Usage + metrics.RecordInputTokens(reqCtx.Model, reqCtx.ResolvedTargetModel, resp.Usage.PromptTokens) + metrics.RecordOutputTokens(reqCtx.Model, reqCtx.ResolvedTargetModel, resp.Usage.CompletionTokens) + } +} + +// Example message if "stream_options": {"include_usage": "true"} is included in the request: +// data: {"id":"...","object":"text_completion","created":1739400043,"model":"food-review-0","choices":[], +// "usage":{"prompt_tokens":7,"total_tokens":17,"completion_tokens":10}} +// +// data: [DONE] +// +// Noticed that vLLM returns two entries in one response. +// We need to strip the `data:` prefix and next Data: [DONE] from the message to fetch response data. +// +// If include_usage is not included in the request, `data: [DONE]` is returned separately, which +// indicates end of streaming. +func parseRespForUsage( + ctx context.Context, + responseText string, +) Response { + response := Response{} + logger := log.FromContext(ctx) + + lines := strings.Split(responseText, "\n") + for _, line := range lines { + if !strings.HasPrefix(line, streamingRespPrefix) { + continue + } + content := strings.TrimPrefix(line, streamingRespPrefix) + if content == "[DONE]" { + continue + } + + byteSlice := []byte(content) + if err := json.Unmarshal(byteSlice, &response); err != nil { + logger.Error(err, "unmarshaling response body") + continue + } + } + + return response +} + +type Response struct { + Usage Usage `json:"usage"` +} + +type Usage struct { + PromptTokens int `json:"prompt_tokens"` + CompletionTokens int `json:"completion_tokens"` + TotalTokens int `json:"total_tokens"` +} diff --git a/pkg/epp/handlers/response_test.go b/pkg/epp/handlers/response_test.go new file mode 100644 index 00000000..bfe5a629 --- /dev/null +++ b/pkg/epp/handlers/response_test.go @@ -0,0 +1,156 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package handlers + +import ( + "context" + "encoding/json" + "testing" + + "github.com/google/go-cmp/cmp" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +const ( + body = ` + { + "id": "cmpl-573498d260f2423f9e42817bbba3743a", + "object": "text_completion", + "created": 1732563765, + "model": "meta-llama/Llama-3.1-8B-Instruct", + "choices": [ + { + "index": 0, + "text": " Chronicle\nThe San Francisco Chronicle has a new book review section, and it's a good one. The reviews are short, but they're well-written and well-informed. The Chronicle's book review section is a good place to start if you're looking for a good book review.\nThe Chronicle's book review section is a good place to start if you're looking for a good book review. The Chronicle's book review section", + "logprobs": null, + "finish_reason": "length", + "stop_reason": null, + "prompt_logprobs": null + } + ], + "usage": { + "prompt_tokens": 11, + "total_tokens": 111, + "completion_tokens": 100 + } + } + ` + + streamingBodyWithoutUsage = `data: {"id":"cmpl-41764c93-f9d2-4f31-be08-3ba04fa25394","object":"text_completion","created":1740002445,"model":"food-review-0","choices":[],"usage":null} + ` + + streamingBodyWithUsage = `data: {"id":"cmpl-41764c93-f9d2-4f31-be08-3ba04fa25394","object":"text_completion","created":1740002445,"model":"food-review-0","choices":[],"usage":{"prompt_tokens":7,"total_tokens":17,"completion_tokens":10}} +data: [DONE] + ` +) + +func TestHandleResponseBody(t *testing.T) { + ctx := logutil.NewTestLoggerIntoContext(context.Background()) + + tests := []struct { + name string + body []byte + reqCtx *RequestContext + want Usage + wantErr bool + }{ + { + name: "success", + body: []byte(body), + want: Usage{ + PromptTokens: 11, + TotalTokens: 111, + CompletionTokens: 100, + }, + }, + } + + for _, test := range tests { + t.Run(test.name, func(t *testing.T) { + server := &StreamingServer{} + reqCtx := test.reqCtx + if reqCtx == nil { + reqCtx = &RequestContext{} + } + var responseMap map[string]interface{} + marshalErr := json.Unmarshal(test.body, &responseMap) + if marshalErr != nil { + t.Error(marshalErr, "Error unmarshaling request body") + } + _, err := server.HandleResponseBody(ctx, reqCtx, responseMap) + if err != nil { + if !test.wantErr { + t.Fatalf("HandleResponseBody returned unexpected error: %v, want %v", err, test.wantErr) + } + return + } + + if diff := cmp.Diff(test.want, reqCtx.Usage); diff != "" { + t.Errorf("HandleResponseBody returned unexpected response, diff(-want, +got): %v", diff) + } + }) + } +} + +func TestHandleStreamedResponseBody(t *testing.T) { + ctx := logutil.NewTestLoggerIntoContext(context.Background()) + tests := []struct { + name string + body string + reqCtx *RequestContext + want Usage + wantErr bool + }{ + { + name: "streaming request without usage", + body: streamingBodyWithoutUsage, + reqCtx: &RequestContext{ + modelServerStreaming: true, + }, + wantErr: false, + // In the middle of streaming response, so request context response is not set yet. + }, + { + name: "streaming request with usage", + body: streamingBodyWithUsage, + reqCtx: &RequestContext{ + modelServerStreaming: true, + }, + wantErr: false, + want: Usage{ + PromptTokens: 7, + TotalTokens: 17, + CompletionTokens: 10, + }, + }, + } + + for _, test := range tests { + t.Run(test.name, func(t *testing.T) { + server := &StreamingServer{} + reqCtx := test.reqCtx + if reqCtx == nil { + reqCtx = &RequestContext{} + } + server.HandleResponseBodyModelStreaming(ctx, reqCtx, test.body) + + if diff := cmp.Diff(test.want, reqCtx.Usage); diff != "" { + t.Errorf("HandleResponseBody returned unexpected response, diff(-want, +got): %v", diff) + } + }) + } +} diff --git a/pkg/epp/handlers/server.go b/pkg/epp/handlers/server.go new file mode 100644 index 00000000..646d6fee --- /dev/null +++ b/pkg/epp/handlers/server.go @@ -0,0 +1,523 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package handlers + +import ( + "context" + "encoding/json" + "io" + "math/rand" + "strconv" + "strings" + "time" + + configPb "github.com/envoyproxy/go-control-plane/envoy/config/core/v3" + extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" + envoyTypePb "github.com/envoyproxy/go-control-plane/envoy/type/v3" + "github.com/go-logr/logr" + "google.golang.org/grpc/codes" + "google.golang.org/grpc/status" + "google.golang.org/protobuf/types/known/structpb" + "sigs.k8s.io/controller-runtime/pkg/log" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/datastore" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/metrics" + schedulingtypes "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types" + errutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/error" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +func NewStreamingServer(scheduler Scheduler, destinationEndpointHintMetadataNamespace, destinationEndpointHintKey string, datastore datastore.Datastore) *StreamingServer { + return &StreamingServer{ + scheduler: scheduler, + destinationEndpointHintMetadataNamespace: destinationEndpointHintMetadataNamespace, + destinationEndpointHintKey: destinationEndpointHintKey, + datastore: datastore, + } +} + +// Server implements the Envoy external processing server. +// https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/ext_proc/v3/external_processor.proto +type StreamingServer struct { + scheduler Scheduler + // The key of the header to specify the target pod address. This value needs to match Envoy + // configuration. + destinationEndpointHintKey string + // The key acting as the outer namespace struct in the metadata extproc response to communicate + // back the picked endpoints. + destinationEndpointHintMetadataNamespace string + datastore datastore.Datastore +} + +type Scheduler interface { + Schedule(ctx context.Context, b *schedulingtypes.LLMRequest) (result *schedulingtypes.Result, err error) +} + +// RequestContext stores context information during the life time of an HTTP request. +type RequestContext struct { + TargetPod string + TargetEndpoint string + Model string + ResolvedTargetModel string + RequestReceivedTimestamp time.Time + ResponseCompleteTimestamp time.Time + RequestSize int + Usage Usage + ResponseSize int + ResponseComplete bool + ResponseStatusCode string + RequestRunning bool + Request *Request + + RequestState StreamRequestState + modelServerStreaming bool + + reqHeaderResp *extProcPb.ProcessingResponse + reqBodyResp *extProcPb.ProcessingResponse + reqTrailerResp *extProcPb.ProcessingResponse + + respHeaderResp *extProcPb.ProcessingResponse + respBodyResp *extProcPb.ProcessingResponse + respTrailerResp *extProcPb.ProcessingResponse +} + +type Request struct { + Headers map[string]string + Body map[string]interface{} +} +type StreamRequestState int + +const ( + RequestReceived StreamRequestState = 0 + HeaderRequestResponseComplete StreamRequestState = 1 + BodyRequestResponsesComplete StreamRequestState = 2 + TrailerRequestResponsesComplete StreamRequestState = 3 + ResponseRecieved StreamRequestState = 4 + HeaderResponseResponseComplete StreamRequestState = 5 + BodyResponseResponsesComplete StreamRequestState = 6 + TrailerResponseResponsesComplete StreamRequestState = 7 +) + +func (s *StreamingServer) Process(srv extProcPb.ExternalProcessor_ProcessServer) error { + ctx := srv.Context() + logger := log.FromContext(ctx) + loggerTrace := logger.V(logutil.TRACE) + loggerTrace.Info("Processing") + + // Create request context to share states during life time of an HTTP request. + // See https://github.com/envoyproxy/envoy/issues/17540. + reqCtx := &RequestContext{ + RequestState: RequestReceived, + Request: &Request{ + Headers: make(map[string]string), + Body: make(map[string]interface{}), + }, + } + + var body []byte + var responseBody map[string]interface{} + + // Create error handling var as each request should only report once for + // error metrics. This doesn't cover the error "Cannot receive stream request" because + // such errors might happen even though response is processed. + var err error + defer func(error, *RequestContext) { + if reqCtx.ResponseStatusCode != "" { + metrics.RecordRequestErrCounter(reqCtx.Model, reqCtx.ResolvedTargetModel, reqCtx.ResponseStatusCode) + } else if err != nil { + metrics.RecordRequestErrCounter(reqCtx.Model, reqCtx.ResolvedTargetModel, errutil.CanonicalCode(err)) + } + if reqCtx.RequestRunning { + metrics.DecRunningRequests(reqCtx.Model) + } + }(err, reqCtx) + + for { + select { + case <-ctx.Done(): + return ctx.Err() + default: + } + + req, recvErr := srv.Recv() + if recvErr == io.EOF || status.Code(recvErr) == codes.Canceled { + return nil + } + if recvErr != nil { + // This error occurs very frequently, though it doesn't seem to have any impact. + // TODO Figure out if we can remove this noise. + logger.V(logutil.DEFAULT).Error(err, "Cannot receive stream request") + return status.Errorf(codes.Unknown, "cannot receive stream request: %v", err) + } + + switch v := req.Request.(type) { + case *extProcPb.ProcessingRequest_RequestHeaders: + err = s.HandleRequestHeaders(ctx, reqCtx, v) + case *extProcPb.ProcessingRequest_RequestBody: + loggerTrace.Info("Incoming body chunk", "EoS", v.RequestBody.EndOfStream) + // In the stream case, we can receive multiple request bodies. + body = append(body, v.RequestBody.Body...) + + // Message is buffered, we can read and decode. + if v.RequestBody.EndOfStream { + loggerTrace.Info("decoding") + err = json.Unmarshal(body, &reqCtx.Request.Body) + if err != nil { + logger.V(logutil.DEFAULT).Error(err, "Error unmarshaling request body") + // TODO: short circuit and send the body back as is (this could be an envoy error), currently we drop + // whatever the body request would have been and send our immediate response instead. + } + + // Body stream complete. Allocate empty slice for response to use. + body = []byte{} + + reqCtx, err = s.HandleRequestBody(ctx, reqCtx) + if err != nil { + logger.V(logutil.DEFAULT).Error(err, "Error handling body") + } else { + metrics.RecordRequestCounter(reqCtx.Model, reqCtx.ResolvedTargetModel) + metrics.RecordRequestSizes(reqCtx.Model, reqCtx.ResolvedTargetModel, reqCtx.RequestSize) + } + } + case *extProcPb.ProcessingRequest_RequestTrailers: + // This is currently unused. + case *extProcPb.ProcessingRequest_ResponseHeaders: + for _, header := range v.ResponseHeaders.Headers.GetHeaders() { + value := string(header.RawValue) + + loggerTrace.Info("header", "key", header.Key, "value", value) + if header.Key == "status" && value != "200" { + reqCtx.ResponseStatusCode = errutil.ModelServerError + } else if header.Key == "content-type" && strings.Contains(value, "text/event-stream") { + reqCtx.modelServerStreaming = true + loggerTrace.Info("model server is streaming response") + } + } + reqCtx.RequestState = ResponseRecieved + reqCtx.respHeaderResp = &extProcPb.ProcessingResponse{ + Response: &extProcPb.ProcessingResponse_ResponseHeaders{ + ResponseHeaders: &extProcPb.HeadersResponse{ + Response: &extProcPb.CommonResponse{ + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: []*configPb.HeaderValueOption{ + { + Header: &configPb.HeaderValue{ + // This is for debugging purpose only. + Key: "x-went-into-resp-headers", + RawValue: []byte("true"), + }, + }, + }, + }, + }, + }, + }, + } + + case *extProcPb.ProcessingRequest_ResponseBody: + if reqCtx.modelServerStreaming { + // Currently we punt on response parsing if the modelServer is streaming, and we just passthrough. + + responseText := string(v.ResponseBody.Body) + s.HandleResponseBodyModelStreaming(ctx, reqCtx, responseText) + if v.ResponseBody.EndOfStream { + loggerTrace.Info("stream completed") + + reqCtx.ResponseCompleteTimestamp = time.Now() + metrics.RecordRequestLatencies(ctx, reqCtx.Model, reqCtx.ResolvedTargetModel, reqCtx.RequestReceivedTimestamp, reqCtx.ResponseCompleteTimestamp) + metrics.RecordResponseSizes(reqCtx.Model, reqCtx.ResolvedTargetModel, reqCtx.ResponseSize) + } + + reqCtx.respBodyResp = &extProcPb.ProcessingResponse{ + Response: &extProcPb.ProcessingResponse_ResponseBody{ + ResponseBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: v.ResponseBody.Body, + EndOfStream: v.ResponseBody.EndOfStream, + }, + }, + }, + }, + }, + }, + } + } else { + body = append(body, v.ResponseBody.Body...) + + // Message is buffered, we can read and decode. + if v.ResponseBody.EndOfStream { + loggerTrace.Info("stream completed") + // Don't send a 500 on a response error. Just let the message passthrough and log our error for debugging purposes. + // We assume the body is valid JSON, err messages are not guaranteed to be json, and so capturing and sending a 500 obfuscates the response message. + // Using the standard 'err' var will send an immediate error response back to the caller. + var responseErr error + responseErr = json.Unmarshal(body, &responseBody) + if responseErr != nil { + logger.V(logutil.DEFAULT).Error(responseErr, "Error unmarshaling request body") + } + + reqCtx, responseErr = s.HandleResponseBody(ctx, reqCtx, responseBody) + if responseErr != nil { + logger.V(logutil.DEFAULT).Error(responseErr, "Failed to process response body", "request", req) + } else if reqCtx.ResponseComplete { + reqCtx.ResponseCompleteTimestamp = time.Now() + metrics.RecordRequestLatencies(ctx, reqCtx.Model, reqCtx.ResolvedTargetModel, reqCtx.RequestReceivedTimestamp, reqCtx.ResponseCompleteTimestamp) + metrics.RecordResponseSizes(reqCtx.Model, reqCtx.ResolvedTargetModel, reqCtx.ResponseSize) + metrics.RecordInputTokens(reqCtx.Model, reqCtx.ResolvedTargetModel, reqCtx.Usage.PromptTokens) + metrics.RecordOutputTokens(reqCtx.Model, reqCtx.ResolvedTargetModel, reqCtx.Usage.CompletionTokens) + } + } + } + case *extProcPb.ProcessingRequest_ResponseTrailers: + // This is currently unused. + } + + // Handle the err and fire an immediate response. + if err != nil { + logger.V(logutil.DEFAULT).Error(err, "Failed to process request", "request", req) + resp, err := BuildErrResponse(err) + if err != nil { + return err + } + if err := srv.Send(resp); err != nil { + logger.V(logutil.DEFAULT).Error(err, "Send failed") + return status.Errorf(codes.Unknown, "failed to send response back to Envoy: %v", err) + } + return nil + } + loggerTrace.Info("checking", "request state", reqCtx.RequestState) + if err := reqCtx.updateStateAndSendIfNeeded(srv, logger); err != nil { + return err + } + } +} + +// updateStateAndSendIfNeeded checks state and can send mutiple responses in a single pass, but only if ordered properly. +// Order of requests matter in FULL_DUPLEX_STREAMING. For both request and response, the order of response sent back MUST be: Header->Body->Trailer, with trailer being optional. +func (r *RequestContext) updateStateAndSendIfNeeded(srv extProcPb.ExternalProcessor_ProcessServer, logger logr.Logger) error { + loggerTrace := logger.V(logutil.TRACE) + // No switch statement as we could send multiple responses in one pass. + if r.RequestState == RequestReceived && r.reqHeaderResp != nil { + loggerTrace.Info("Sending request header response", "obj", r.reqHeaderResp) + if err := srv.Send(r.reqHeaderResp); err != nil { + logger.V(logutil.DEFAULT).Error(err, "error sending response") + return status.Errorf(codes.Unknown, "failed to send response back to Envoy: %v", err) + } + r.RequestState = HeaderRequestResponseComplete + } + if r.RequestState == HeaderRequestResponseComplete && r.reqBodyResp != nil { + loggerTrace.Info("Sending request body response") + if err := srv.Send(r.reqBodyResp); err != nil { + return status.Errorf(codes.Unknown, "failed to send response back to Envoy: %v", err) + } + r.RequestState = BodyRequestResponsesComplete + metrics.IncRunningRequests(r.Model) + r.RequestRunning = true + // Dump the response so a new stream message can begin + r.reqBodyResp = nil + } + if r.RequestState == BodyRequestResponsesComplete && r.reqTrailerResp != nil { + // Trailers in requests are not guaranteed + if err := srv.Send(r.reqTrailerResp); err != nil { + return status.Errorf(codes.Unknown, "failed to send response back to Envoy: %v", err) + } + } + if r.RequestState == ResponseRecieved && r.respHeaderResp != nil { + loggerTrace.Info("Sending response header response", "obj", r.respHeaderResp) + if err := srv.Send(r.respHeaderResp); err != nil { + return status.Errorf(codes.Unknown, "failed to send response back to Envoy: %v", err) + } + r.RequestState = HeaderResponseResponseComplete + } + if r.RequestState == HeaderResponseResponseComplete && r.respBodyResp != nil { + loggerTrace.Info("Sending response body response") + if err := srv.Send(r.respBodyResp); err != nil { + return status.Errorf(codes.Unknown, "failed to send response back to Envoy: %v", err) + } + + body := r.respBodyResp.Response.(*extProcPb.ProcessingResponse_ResponseBody) + if body.ResponseBody.Response.GetBodyMutation().GetStreamedResponse().GetEndOfStream() { + r.RequestState = BodyResponseResponsesComplete + } + // Dump the response so a new stream message can begin + r.respBodyResp = nil + } + if r.RequestState == BodyResponseResponsesComplete && r.respTrailerResp != nil { + // Trailers in requests are not guaranteed + if err := srv.Send(r.respTrailerResp); err != nil { + return status.Errorf(codes.Unknown, "failed to send response back to Envoy: %v", err) + } + } + return nil +} + +func (s *StreamingServer) populateRequestHeaderResponse(reqCtx *RequestContext, endpoint string, requestBodyLength int) { + headers := []*configPb.HeaderValueOption{ + { + Header: &configPb.HeaderValue{ + Key: s.destinationEndpointHintKey, + RawValue: []byte(endpoint), + }, + }, + } + if requestBodyLength > 0 { + // We need to update the content length header if the body is mutated, see Envoy doc: + // https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/http/ext_proc/v3/processing_mode.proto + headers = append(headers, &configPb.HeaderValueOption{ + Header: &configPb.HeaderValue{ + Key: "Content-Length", + RawValue: []byte(strconv.Itoa(requestBodyLength)), + }, + }) + } + + targetEndpointValue := &structpb.Struct{ + Fields: map[string]*structpb.Value{ + s.destinationEndpointHintKey: { + Kind: &structpb.Value_StringValue{ + StringValue: endpoint, + }, + }, + }, + } + dynamicMetadata := targetEndpointValue + if s.destinationEndpointHintMetadataNamespace != "" { + // If a namespace is defined, wrap the selected endpoint with that. + dynamicMetadata = &structpb.Struct{ + Fields: map[string]*structpb.Value{ + s.destinationEndpointHintMetadataNamespace: { + Kind: &structpb.Value_StructValue{ + StructValue: targetEndpointValue, + }, + }, + }, + } + } + + reqCtx.reqHeaderResp = &extProcPb.ProcessingResponse{ + Response: &extProcPb.ProcessingResponse_RequestHeaders{ + RequestHeaders: &extProcPb.HeadersResponse{ + Response: &extProcPb.CommonResponse{ + ClearRouteCache: true, + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: headers, + }, + }, + }, + }, + DynamicMetadata: dynamicMetadata, + } +} + +func RandomWeightedDraw(logger logr.Logger, model *v1alpha2.InferenceModel, seed int64) string { + // TODO: after we are down to 1 server implementation, make these methods a part of the struct + // and handle random seeding on the struct. + source := rand.NewSource(rand.Int63()) + if seed > 0 { + source = rand.NewSource(seed) + } + r := rand.New(source) + + // all the weight values are nil, then we should return random model name + if model.Spec.TargetModels[0].Weight == nil { + index := r.Int31n(int32(len(model.Spec.TargetModels))) + return model.Spec.TargetModels[index].Name + } + + var weights int32 + for _, model := range model.Spec.TargetModels { + weights += *model.Weight + } + logger.V(logutil.TRACE).Info("Weights for model computed", "model", model.Name, "weights", weights) + randomVal := r.Int31n(weights) + // TODO: optimize this without using loop + for _, model := range model.Spec.TargetModels { + if randomVal < *model.Weight { + return model.Name + } + randomVal -= *model.Weight + } + return "" +} + +func GetRandomPod(ds datastore.Datastore) *backend.Pod { + pods := ds.PodGetAll() + if len(pods) == 0 { + return nil + } + number := rand.Intn(len(pods)) + pod := pods[number] + return pod.GetPod() +} + +func BuildErrResponse(err error) (*extProcPb.ProcessingResponse, error) { + var resp *extProcPb.ProcessingResponse + + switch errutil.CanonicalCode(err) { + // This code can be returned by scheduler when there is no capacity for sheddable + // requests. + case errutil.InferencePoolResourceExhausted: + resp = &extProcPb.ProcessingResponse{ + Response: &extProcPb.ProcessingResponse_ImmediateResponse{ + ImmediateResponse: &extProcPb.ImmediateResponse{ + Status: &envoyTypePb.HttpStatus{ + Code: envoyTypePb.StatusCode_TooManyRequests, + }, + }, + }, + } + // This code can be returned by when EPP processes the request and run into server-side errors. + case errutil.Internal: + resp = &extProcPb.ProcessingResponse{ + Response: &extProcPb.ProcessingResponse_ImmediateResponse{ + ImmediateResponse: &extProcPb.ImmediateResponse{ + Status: &envoyTypePb.HttpStatus{ + Code: envoyTypePb.StatusCode_InternalServerError, + }, + }, + }, + } + // This code can be returned when users provide invalid json request. + case errutil.BadRequest: + resp = &extProcPb.ProcessingResponse{ + Response: &extProcPb.ProcessingResponse_ImmediateResponse{ + ImmediateResponse: &extProcPb.ImmediateResponse{ + Status: &envoyTypePb.HttpStatus{ + Code: envoyTypePb.StatusCode_BadRequest, + }, + }, + }, + } + case errutil.BadConfiguration: + resp = &extProcPb.ProcessingResponse{ + Response: &extProcPb.ProcessingResponse_ImmediateResponse{ + ImmediateResponse: &extProcPb.ImmediateResponse{ + Status: &envoyTypePb.HttpStatus{ + Code: envoyTypePb.StatusCode_NotFound, + }, + }, + }, + } + default: + return nil, status.Errorf(status.Code(err), "failed to handle request: %v", err) + } + return resp, nil +} diff --git a/pkg/epp/handlers/server_test.go b/pkg/epp/handlers/server_test.go new file mode 100644 index 00000000..23d2b68f --- /dev/null +++ b/pkg/epp/handlers/server_test.go @@ -0,0 +1,186 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package handlers + +import ( + "testing" + "time" + + corev1 "k8s.io/api/core/v1" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/datastore" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +func TestRandomWeightedDraw(t *testing.T) { + logger := logutil.NewTestLogger() + tests := []struct { + name string + model *v1alpha2.InferenceModel + want string + }{ + { + name: "'random' distribution", + model: &v1alpha2.InferenceModel{ + Spec: v1alpha2.InferenceModelSpec{ + TargetModels: []v1alpha2.TargetModel{ + { + Name: "canary", + Weight: pointer(50), + }, + { + Name: "v1", + Weight: pointer(50), + }, + }, + }, + }, + want: "canary", + }, + { + name: "'random' distribution", + model: &v1alpha2.InferenceModel{ + Spec: v1alpha2.InferenceModelSpec{ + TargetModels: []v1alpha2.TargetModel{ + { + Name: "canary", + Weight: pointer(25), + }, + { + Name: "v1.1", + Weight: pointer(55), + }, + { + Name: "v1", + Weight: pointer(50), + }, + }, + }, + }, + want: "v1", + }, + { + name: "'random' distribution", + model: &v1alpha2.InferenceModel{ + Spec: v1alpha2.InferenceModelSpec{ + TargetModels: []v1alpha2.TargetModel{ + { + Name: "canary", + Weight: pointer(20), + }, + { + Name: "v1.1", + Weight: pointer(20), + }, + { + Name: "v1", + Weight: pointer(10), + }, + }, + }, + }, + want: "v1.1", + }, + { + name: "weighted distribution with weight unset", + model: &v1alpha2.InferenceModel{ + Spec: v1alpha2.InferenceModelSpec{ + TargetModels: []v1alpha2.TargetModel{ + { + Name: "canary", + }, + { + Name: "v1.1", + }, + { + Name: "v1", + }, + }, + }, + }, + want: "canary", + }, + } + var seedVal int64 = 420 + for _, test := range tests { + t.Run(test.name, func(t *testing.T) { + for range 10000 { + model := RandomWeightedDraw(logger, test.model, seedVal) + if model != test.want { + t.Errorf("Model returned: %v != %v", model, test.want) + break + } + } + }) + } +} + +func TestGetRandomPod(t *testing.T) { + tests := []struct { + name string + storePods []*corev1.Pod + expectNil bool + }{ + { + name: "No pods available", + storePods: []*corev1.Pod{}, + expectNil: true, + }, + { + name: "Single pod available", + storePods: []*corev1.Pod{ + {ObjectMeta: metav1.ObjectMeta{Name: "pod1"}}, + }, + expectNil: false, + }, + { + name: "Multiple pods available", + storePods: []*corev1.Pod{ + {ObjectMeta: metav1.ObjectMeta{Name: "pod1"}}, + {ObjectMeta: metav1.ObjectMeta{Name: "pod2"}}, + {ObjectMeta: metav1.ObjectMeta{Name: "pod3"}}, + }, + expectNil: false, + }, + } + + for _, test := range tests { + t.Run(test.name, func(t *testing.T) { + pmf := metrics.NewPodMetricsFactory(&metrics.FakePodMetricsClient{}, time.Millisecond) + ds := datastore.NewDatastore(t.Context(), pmf) + for _, pod := range test.storePods { + ds.PodUpdateOrAddIfNotExist(pod) + } + + gotPod := GetRandomPod(ds) + + if test.expectNil && gotPod != nil { + t.Errorf("expected nil pod, got: %v", gotPod) + } + if !test.expectNil && gotPod == nil { + t.Errorf("expected non-nil pod, got nil") + } + }) + } +} + +func pointer(v int32) *int32 { + return &v +} diff --git a/pkg/epp/metrics/metrics.go b/pkg/epp/metrics/metrics.go new file mode 100644 index 00000000..6cc0cdb8 --- /dev/null +++ b/pkg/epp/metrics/metrics.go @@ -0,0 +1,359 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package metrics + +import ( + "context" + "sync" + "time" + + compbasemetrics "k8s.io/component-base/metrics" + "k8s.io/component-base/metrics/legacyregistry" + "sigs.k8s.io/controller-runtime/pkg/log" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +const ( + InferenceModelComponent = "inference_model" + InferencePoolComponent = "inference_pool" + InferenceExtension = "inference_extension" +) + +var ( + // The git hash of the latest commit in the build. + CommitSHA string +) + +var ( + // Inference Model Metrics + requestCounter = compbasemetrics.NewCounterVec( + &compbasemetrics.CounterOpts{ + Subsystem: InferenceModelComponent, + Name: "request_total", + Help: "Counter of inference model requests broken out for each model and target model.", + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{"model_name", "target_model_name"}, + ) + + requestErrCounter = compbasemetrics.NewCounterVec( + &compbasemetrics.CounterOpts{ + Subsystem: InferenceModelComponent, + Name: "request_error_total", + Help: "Counter of inference model requests errors broken out for each model and target model.", + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{"model_name", "target_model_name", "error_code"}, + ) + + requestLatencies = compbasemetrics.NewHistogramVec( + &compbasemetrics.HistogramOpts{ + Subsystem: InferenceModelComponent, + Name: "request_duration_seconds", + Help: "Inference model response latency distribution in seconds for each model and target model.", + Buckets: []float64{ + 0.005, 0.025, 0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 1.25, 1.5, 2, 3, + 4, 5, 6, 8, 10, 15, 20, 30, 45, 60, 120, 180, 240, 300, 360, 480, 600, 900, 1200, 1800, 2700, 3600, + }, + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{"model_name", "target_model_name"}, + ) + + requestSizes = compbasemetrics.NewHistogramVec( + &compbasemetrics.HistogramOpts{ + Subsystem: InferenceModelComponent, + Name: "request_sizes", + Help: "Inference model requests size distribution in bytes for each model and target model.", + // Use buckets ranging from 1000 bytes (1KB) to 10^9 bytes (1GB). + Buckets: []float64{ + 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, // More fine-grained up to 64KB + 131072, 262144, 524288, 1048576, 2097152, 4194304, 8388608, // Exponential up to 8MB + 16777216, 33554432, 67108864, 134217728, 268435456, 536870912, 1073741824, // Exponential up to 1GB + }, + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{"model_name", "target_model_name"}, + ) + + responseSizes = compbasemetrics.NewHistogramVec( + &compbasemetrics.HistogramOpts{ + Subsystem: InferenceModelComponent, + Name: "response_sizes", + Help: "Inference model responses size distribution in bytes for each model and target model.", + // Most models have a response token < 8192 tokens. Each token, in average, has 4 characters. + // 8192 * 4 = 32768. + Buckets: []float64{1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32778, 65536}, + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{"model_name", "target_model_name"}, + ) + + inputTokens = compbasemetrics.NewHistogramVec( + &compbasemetrics.HistogramOpts{ + Subsystem: InferenceModelComponent, + Name: "input_tokens", + Help: "Inference model input token count distribution for requests in each model.", + // Most models have a input context window less than 1 million tokens. + Buckets: []float64{1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32778, 65536, 131072, 262144, 524288, 1048576}, + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{"model_name", "target_model_name"}, + ) + + outputTokens = compbasemetrics.NewHistogramVec( + &compbasemetrics.HistogramOpts{ + Subsystem: InferenceModelComponent, + Name: "output_tokens", + Help: "Inference model output token count distribution for requests in each model.", + // Most models generates output less than 8192 tokens. + Buckets: []float64{1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192}, + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{"model_name", "target_model_name"}, + ) + + runningRequests = compbasemetrics.NewGaugeVec( + &compbasemetrics.GaugeOpts{ + Subsystem: InferenceModelComponent, + Name: "running_requests", + Help: "Inference model number of running requests in each model.", + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{"model_name"}, + ) + + // NTPOT - Normalized Time Per Output Token + NormalizedTimePerOutputToken = compbasemetrics.NewHistogramVec( + &compbasemetrics.HistogramOpts{ + Subsystem: InferenceModelComponent, + Name: "normalized_time_per_output_token_seconds", + Help: "Inference model latency divided by number of output tokens in seconds for each model and target model.", + // From few milliseconds per token to multiple seconds per token + Buckets: []float64{ + 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10.0, + }, + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{"model_name", "target_model_name"}, + ) + + // Inference Pool Metrics + inferencePoolAvgKVCache = compbasemetrics.NewGaugeVec( + &compbasemetrics.GaugeOpts{ + Subsystem: InferencePoolComponent, + Name: "average_kv_cache_utilization", + Help: "The average kv cache utilization for an inference server pool.", + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{"name"}, + ) + + inferencePoolAvgQueueSize = compbasemetrics.NewGaugeVec( + &compbasemetrics.GaugeOpts{ + Subsystem: InferencePoolComponent, + Name: "average_queue_size", + Help: "The average number of requests pending in the model server queue.", + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{"name"}, + ) + + inferencePoolReadyPods = compbasemetrics.NewGaugeVec( + &compbasemetrics.GaugeOpts{ + Subsystem: InferencePoolComponent, + Name: "ready_pods", + Help: "The number of ready pods in the inference server pool.", + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{"name"}, + ) + + // Scheduler Metrics + SchedulerE2ELatency = compbasemetrics.NewHistogramVec( + &compbasemetrics.HistogramOpts{ + Subsystem: InferenceExtension, + Name: "scheduler_e2e_duration_seconds", + Help: "End-to-end scheduling latency distribution in seconds.", + Buckets: []float64{ + 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, + }, + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{}, + ) + SchedulerPluginProcessingLatencies = compbasemetrics.NewHistogramVec( + &compbasemetrics.HistogramOpts{ + Subsystem: InferenceExtension, + Name: "scheduler_plugin_duration_seconds", + Help: "Scheduler plugin processing latency distribution in seconds for each plugin type and plugin name.", + Buckets: []float64{ + 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, + }, + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{"plugin_type", "plugin_name"}, + ) + + // Info Metrics + InferenceExtensionInfo = compbasemetrics.NewGaugeVec( + &compbasemetrics.GaugeOpts{ + Subsystem: InferenceExtension, + Name: "info", + Help: "General information of the current build of Inference Extension.", + StabilityLevel: compbasemetrics.ALPHA, + }, + []string{"commit"}, + ) +) + +var registerMetrics sync.Once + +// Register all metrics. +func Register() { + registerMetrics.Do(func() { + legacyregistry.MustRegister(requestCounter) + legacyregistry.MustRegister(requestErrCounter) + legacyregistry.MustRegister(requestLatencies) + legacyregistry.MustRegister(requestSizes) + legacyregistry.MustRegister(responseSizes) + legacyregistry.MustRegister(inputTokens) + legacyregistry.MustRegister(outputTokens) + legacyregistry.MustRegister(runningRequests) + legacyregistry.MustRegister(NormalizedTimePerOutputToken) + + legacyregistry.MustRegister(inferencePoolAvgKVCache) + legacyregistry.MustRegister(inferencePoolAvgQueueSize) + legacyregistry.MustRegister(inferencePoolReadyPods) + + legacyregistry.MustRegister(SchedulerPluginProcessingLatencies) + legacyregistry.MustRegister(SchedulerE2ELatency) + + legacyregistry.MustRegister(InferenceExtensionInfo) + }) +} + +// RecordRequstCounter records the number of requests. +func RecordRequestCounter(modelName, targetModelName string) { + requestCounter.WithLabelValues(modelName, targetModelName).Inc() +} + +// RecordRequestErrCounter records the number of error requests. +func RecordRequestErrCounter(modelName, targetModelName string, code string) { + if code != "" { + requestErrCounter.WithLabelValues(modelName, targetModelName, code).Inc() + } +} + +// RecordRequestSizes records the request sizes. +func RecordRequestSizes(modelName, targetModelName string, reqSize int) { + requestSizes.WithLabelValues(modelName, targetModelName).Observe(float64(reqSize)) +} + +// RecordRequestLatencies records duration of request. +func RecordRequestLatencies(ctx context.Context, modelName, targetModelName string, received time.Time, complete time.Time) bool { + if !complete.After(received) { + log.FromContext(ctx).V(logutil.DEFAULT).Error(nil, "Request latency values are invalid", + "modelName", modelName, "targetModelName", targetModelName, "completeTime", complete, "receivedTime", received) + return false + } + elapsedSeconds := complete.Sub(received).Seconds() + requestLatencies.WithLabelValues(modelName, targetModelName).Observe(elapsedSeconds) + return true +} + +// RecordResponseSizes records the response sizes. +func RecordResponseSizes(modelName, targetModelName string, size int) { + responseSizes.WithLabelValues(modelName, targetModelName).Observe(float64(size)) +} + +// RecordInputTokens records input tokens count. +func RecordInputTokens(modelName, targetModelName string, size int) { + if size > 0 { + inputTokens.WithLabelValues(modelName, targetModelName).Observe(float64(size)) + } +} + +// RecordOutputTokens records output tokens count. +func RecordOutputTokens(modelName, targetModelName string, size int) { + if size > 0 { + outputTokens.WithLabelValues(modelName, targetModelName).Observe(float64(size)) + } +} + +// RecordNormalizedTimePerOutputToken (NTPOT) records the normalized time per output token. +func RecordNormalizedTimePerOutputToken(ctx context.Context, modelName, targetModelName string, received time.Time, complete time.Time, outputTokenCount int) bool { + if !complete.After(received) { + log.FromContext(ctx).Error(nil, "Request latency values are invalid for NTPOT calculation", + "modelName", modelName, "targetModelName", targetModelName, "completeTime", complete, "receivedTime", received) + return false + } + + if outputTokenCount <= 0 { + log.FromContext(ctx).Error(nil, "Output token count must be positive for NTPOT calculation", + "modelName", modelName, "targetModelName", targetModelName, "outputTokenCount", outputTokenCount) + return false + } + + elapsedSeconds := complete.Sub(received).Seconds() + secondsPerToken := elapsedSeconds / float64(outputTokenCount) + + NormalizedTimePerOutputToken.WithLabelValues(modelName, targetModelName).Observe(secondsPerToken) + return true +} + +// IncRunningRequests increases the current running requests. +func IncRunningRequests(modelName string) { + if modelName != "" { + runningRequests.WithLabelValues(modelName).Inc() + } +} + +// DecRunningRequests decreases the current running requests. +func DecRunningRequests(modelName string) { + if modelName != "" { + runningRequests.WithLabelValues(modelName).Dec() + } +} + +func RecordInferencePoolAvgKVCache(name string, utilization float64) { + inferencePoolAvgKVCache.WithLabelValues(name).Set(utilization) +} + +func RecordInferencePoolAvgQueueSize(name string, queueSize float64) { + inferencePoolAvgQueueSize.WithLabelValues(name).Set(queueSize) +} + +func RecordinferencePoolReadyPods(name string, runningPods float64) { + inferencePoolReadyPods.WithLabelValues(name).Set(runningPods) +} + +// RecordSchedulerPluginProcessingLatency records the processing latency for a scheduler plugin. +func RecordSchedulerPluginProcessingLatency(pluginType, pluginName string, duration time.Duration) { + SchedulerPluginProcessingLatencies.WithLabelValues(pluginType, pluginName).Observe(duration.Seconds()) +} + +// RecordSchedulerE2ELatency records the end-to-end scheduling latency. +func RecordSchedulerE2ELatency(duration time.Duration) { + SchedulerE2ELatency.WithLabelValues().Observe(duration.Seconds()) +} + +func RecordInferenceExtensionInfo() { + if CommitSHA != "" { + InferenceExtensionInfo.WithLabelValues(CommitSHA).Set(1) + } +} diff --git a/pkg/epp/metrics/metrics_test.go b/pkg/epp/metrics/metrics_test.go new file mode 100644 index 00000000..a2311517 --- /dev/null +++ b/pkg/epp/metrics/metrics_test.go @@ -0,0 +1,665 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package metrics + +import ( + "context" + "os" + "testing" + "time" + + "k8s.io/component-base/metrics/legacyregistry" + "k8s.io/component-base/metrics/testutil" + errutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/error" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +const ( + RequestTotalMetric = InferenceModelComponent + "_request_total" + RequestErrorTotalMetric = InferenceModelComponent + "_request_error_total" + RequestLatenciesMetric = InferenceModelComponent + "_request_duration_seconds" + RequestSizesMetric = InferenceModelComponent + "_request_sizes" + ResponseSizesMetric = InferenceModelComponent + "_response_sizes" + InputTokensMetric = InferenceModelComponent + "_input_tokens" + OutputTokensMetric = InferenceModelComponent + "_output_tokens" + NormalizedTimePerOutputTokenMetric = InferenceModelComponent + "_normalized_time_per_output_token_seconds" + RunningRequestsMetric = InferenceModelComponent + "_running_requests" + KVCacheAvgUsageMetric = InferencePoolComponent + "_average_kv_cache_utilization" + QueueAvgSizeMetric = InferencePoolComponent + "_average_queue_size" +) + +func TestRecordRequestCounterandSizes(t *testing.T) { + type requests struct { + modelName string + targetModelName string + reqSize int + } + scenarios := []struct { + name string + reqs []requests + }{{ + name: "multiple requests", + reqs: []requests{ + { + modelName: "m10", + targetModelName: "t10", + reqSize: 1200, + }, + { + modelName: "m10", + targetModelName: "t10", + reqSize: 500, + }, + { + modelName: "m10", + targetModelName: "t11", + reqSize: 2480, + }, + { + modelName: "m20", + targetModelName: "t20", + reqSize: 80, + }, + }, + }} + Register() + for _, scenario := range scenarios { + t.Run(scenario.name, func(t *testing.T) { + for _, req := range scenario.reqs { + RecordRequestCounter(req.modelName, req.targetModelName) + RecordRequestSizes(req.modelName, req.targetModelName, req.reqSize) + } + wantRequestTotal, err := os.Open("testdata/request_total_metric") + defer func() { + if err := wantRequestTotal.Close(); err != nil { + t.Error(err) + } + }() + if err != nil { + t.Fatal(err) + } + if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantRequestTotal, RequestTotalMetric); err != nil { + t.Error(err) + } + wantRequestSizes, err := os.Open("testdata/request_sizes_metric") + defer func() { + if err := wantRequestSizes.Close(); err != nil { + t.Error(err) + } + }() + if err != nil { + t.Fatal(err) + } + if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantRequestSizes, RequestSizesMetric); err != nil { + t.Error(err) + } + }) + } +} + +func TestRecordRequestErrorCounter(t *testing.T) { + type requests struct { + modelName string + targetModelName string + error string + } + scenarios := []struct { + name string + reqs []requests + invalid bool + }{ + { + name: "multiple requests", + reqs: []requests{ + { + modelName: "m10", + targetModelName: "t10", + error: errutil.Internal, + }, + { + modelName: "m10", + targetModelName: "t10", + error: errutil.Internal, + }, + { + modelName: "m10", + targetModelName: "t11", + error: errutil.ModelServerError, + }, + { + modelName: "m20", + targetModelName: "t20", + error: errutil.InferencePoolResourceExhausted, + }, + }, + }, + } + Register() + for _, scenario := range scenarios { + t.Run(scenario.name, func(t *testing.T) { + for _, req := range scenario.reqs { + RecordRequestErrCounter(req.modelName, req.targetModelName, req.error) + } + + wantRequestErrorCounter, err := os.Open("testdata/request_error_total_metric") + defer func() { + if err := wantRequestErrorCounter.Close(); err != nil { + t.Error(err) + } + }() + if err != nil { + t.Fatal(err) + } + if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantRequestErrorCounter, RequestErrorTotalMetric); err != nil { + t.Error(err) + } + }) + } +} + +func TestRecordRequestLatencies(t *testing.T) { + ctx := logutil.NewTestLoggerIntoContext(context.Background()) + timeBaseline := time.Now() + type requests struct { + modelName string + targetModelName string + receivedTime time.Time + completeTime time.Time + } + scenarios := []struct { + name string + reqs []requests + invalid bool + }{ + { + name: "multiple requests", + reqs: []requests{ + { + modelName: "m10", + targetModelName: "t10", + receivedTime: timeBaseline, + completeTime: timeBaseline.Add(time.Millisecond * 10), + }, + { + modelName: "m10", + targetModelName: "t10", + receivedTime: timeBaseline, + completeTime: timeBaseline.Add(time.Millisecond * 1600), + }, + { + modelName: "m10", + targetModelName: "t11", + receivedTime: timeBaseline, + completeTime: timeBaseline.Add(time.Millisecond * 60), + }, + { + modelName: "m20", + targetModelName: "t20", + receivedTime: timeBaseline, + completeTime: timeBaseline.Add(time.Millisecond * 120), + }, + }, + }, + { + name: "invalid elapsed time", + reqs: []requests{ + { + modelName: "m10", + targetModelName: "t10", + receivedTime: timeBaseline.Add(time.Millisecond * 10), + completeTime: timeBaseline, + }, + }, + invalid: true, + }, + } + Register() + for _, scenario := range scenarios { + t.Run(scenario.name, func(t *testing.T) { + for _, req := range scenario.reqs { + success := RecordRequestLatencies(ctx, req.modelName, req.targetModelName, req.receivedTime, req.completeTime) + if success == scenario.invalid { + t.Errorf("got record success(%v), but the request expects invalid(%v)", success, scenario.invalid) + } + } + + wantRequestLatencies, err := os.Open("testdata/request_duration_seconds_metric") + defer func() { + if err := wantRequestLatencies.Close(); err != nil { + t.Error(err) + } + }() + if err != nil { + t.Fatal(err) + } + if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantRequestLatencies, RequestLatenciesMetric); err != nil { + t.Error(err) + } + }) + } +} + +func TestRecordNormalizedTimePerOutputToken(t *testing.T) { + ctx := logutil.NewTestLoggerIntoContext(context.Background()) + timeBaseline := time.Now() + type tokenRequests struct { + modelName string + targetModelName string + receivedTime time.Time + completeTime time.Time + outputTokens int + } + scenarios := []struct { + name string + reqs []tokenRequests + invalid bool + }{ + { + name: "multiple requests", + reqs: []tokenRequests{ + { + modelName: "m10", + targetModelName: "t10", + receivedTime: timeBaseline, + completeTime: timeBaseline.Add(time.Millisecond * 1000), + outputTokens: 100, // 10ms per token + }, + { + modelName: "m10", + targetModelName: "t10", + receivedTime: timeBaseline, + completeTime: timeBaseline.Add(time.Millisecond * 1600), + outputTokens: 80, // 20ms per token + }, + { + modelName: "m10", + targetModelName: "t11", + receivedTime: timeBaseline, + completeTime: timeBaseline.Add(time.Millisecond * 6000), + outputTokens: 300, // 20ms per token + }, + { + modelName: "m20", + targetModelName: "t20", + receivedTime: timeBaseline, + completeTime: timeBaseline.Add(time.Millisecond * 2400), + outputTokens: 400, // 6ms per token + }, + }, + }, + { + name: "invalid elapsed time", + reqs: []tokenRequests{ + { + modelName: "m10", + targetModelName: "t10", + receivedTime: timeBaseline.Add(time.Millisecond * 10), + completeTime: timeBaseline, + outputTokens: 100, + }, + }, + invalid: true, + }, + { + name: "invalid token count", + reqs: []tokenRequests{ + { + modelName: "m10", + targetModelName: "t10", + receivedTime: timeBaseline, + completeTime: timeBaseline.Add(time.Millisecond * 1000), + outputTokens: 0, // Invalid: zero tokens + }, + }, + invalid: true, + }, + } + Register() + for _, scenario := range scenarios { + t.Run(scenario.name, func(t *testing.T) { + for _, req := range scenario.reqs { + success := RecordNormalizedTimePerOutputToken(ctx, req.modelName, req.targetModelName, req.receivedTime, req.completeTime, req.outputTokens) + if success == scenario.invalid { + t.Errorf("got record success(%v), but the request expects invalid(%v)", success, scenario.invalid) + } + } + + wantLatencyPerToken, err := os.Open("testdata/normalized_time_per_output_token_seconds_metric") + defer func() { + if err := wantLatencyPerToken.Close(); err != nil { + t.Error(err) + } + }() + if err != nil { + t.Fatal(err) + } + if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantLatencyPerToken, NormalizedTimePerOutputTokenMetric); err != nil { + t.Error(err) + } + }) + } +} + +func TestRecordResponseMetrics(t *testing.T) { + type responses struct { + modelName string + targetModelName string + inputToken int + outputToken int + respSize int + } + scenarios := []struct { + name string + resp []responses + }{{ + name: "multiple requests", + resp: []responses{ + { + modelName: "m10", + targetModelName: "t10", + respSize: 1200, + inputToken: 10, + outputToken: 100, + }, + { + modelName: "m10", + targetModelName: "t10", + respSize: 500, + inputToken: 20, + outputToken: 200, + }, + { + modelName: "m10", + targetModelName: "t11", + respSize: 2480, + inputToken: 30, + outputToken: 300, + }, + { + modelName: "m20", + targetModelName: "t20", + respSize: 80, + inputToken: 40, + outputToken: 400, + }, + }, + }} + Register() + for _, scenario := range scenarios { + t.Run(scenario.name, func(t *testing.T) { + for _, resp := range scenario.resp { + RecordInputTokens(resp.modelName, resp.targetModelName, resp.inputToken) + RecordOutputTokens(resp.modelName, resp.targetModelName, resp.outputToken) + RecordResponseSizes(resp.modelName, resp.targetModelName, resp.respSize) + } + wantResponseSize, err := os.Open("testdata/response_sizes_metric") + defer func() { + if err := wantResponseSize.Close(); err != nil { + t.Error(err) + } + }() + if err != nil { + t.Fatal(err) + } + if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantResponseSize, ResponseSizesMetric); err != nil { + t.Error(err) + } + + wantInputToken, err := os.Open("testdata/input_tokens_metric") + defer func() { + if err := wantInputToken.Close(); err != nil { + t.Error(err) + } + }() + if err != nil { + t.Fatal(err) + } + if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantInputToken, InputTokensMetric); err != nil { + t.Error(err) + } + + wantOutputToken, err := os.Open("testdata/output_tokens_metric") + defer func() { + if err := wantOutputToken.Close(); err != nil { + t.Error(err) + } + }() + if err != nil { + t.Fatal(err) + } + if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantOutputToken, OutputTokensMetric); err != nil { + t.Error(err) + } + }) + } +} + +func TestRunningRequestsMetrics(t *testing.T) { + type request struct { + modelName string + complete bool // true -> request is completed, false -> running request + } + + scenarios := []struct { + name string + requests []request + }{ + { + name: "basic test", + requests: []request{ + { + modelName: "m1", + complete: false, + }, + { + modelName: "m1", + complete: false, + }, + { + modelName: "m1", + complete: true, + }, + { + modelName: "m2", + complete: false, + }, + }, + }, + } + + Register() + for _, scenario := range scenarios { + t.Run(scenario.name, func(t *testing.T) { + for _, req := range scenario.requests { + if req.complete { + DecRunningRequests(req.modelName) + } else { + IncRunningRequests(req.modelName) + } + } + + wantRunningRequests, err := os.Open("testdata/running_requests_metrics") + defer func() { + if err := wantRunningRequests.Close(); err != nil { + t.Error(err) + } + }() + if err != nil { + t.Fatal(err) + } + if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantRunningRequests, RunningRequestsMetric); err != nil { + t.Error(err) + } + }) + } +} + +func TestInferencePoolMetrics(t *testing.T) { + scenarios := []struct { + name string + poolName string + kvCacheAvg float64 + queueSizeAvg float64 + }{ + { + name: "basic test", + poolName: "p1", + kvCacheAvg: 0.3, + queueSizeAvg: 0.4, + }, + } + Register() + for _, scenario := range scenarios { + t.Run(scenario.name, func(t *testing.T) { + RecordInferencePoolAvgKVCache(scenario.poolName, scenario.kvCacheAvg) + RecordInferencePoolAvgQueueSize(scenario.poolName, scenario.queueSizeAvg) + + wantKVCache, err := os.Open("testdata/kv_cache_avg_metrics") + defer func() { + if err := wantKVCache.Close(); err != nil { + t.Error(err) + } + }() + if err != nil { + t.Fatal(err) + } + if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantKVCache, KVCacheAvgUsageMetric); err != nil { + t.Error(err) + } + + wantQueueSize, err := os.Open("testdata/queue_avg_size_metrics") + defer func() { + if err := wantQueueSize.Close(); err != nil { + t.Error(err) + } + }() + if err != nil { + t.Fatal(err) + } + if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantQueueSize, QueueAvgSizeMetric); err != nil { + t.Error(err) + } + }) + } +} + +func TestSchedulerPluginProcessingLatencies(t *testing.T) { + type pluginLatency struct { + pluginType string + pluginName string + duration time.Duration + } + scenarios := []struct { + name string + latencies []pluginLatency + }{ + { + name: "multiple plugins", + latencies: []pluginLatency{ + { + pluginType: "PreSchedule", + pluginName: "PluginA", + duration: 100 * time.Millisecond, + }, + { + pluginType: "PostSchedule", + pluginName: "PluginB", + duration: 200 * time.Millisecond, + }, + { + pluginType: "Filter", + pluginName: "PluginC", + duration: 50 * time.Millisecond, + }, + { + pluginType: "Scorer", + pluginName: "PluginD", + duration: 10 * time.Millisecond, + }, + { + pluginType: "Picker", + pluginName: "PluginE", + duration: 10 * time.Microsecond, + }, + }, + }, + } + Register() + for _, scenario := range scenarios { + t.Run(scenario.name, func(t *testing.T) { + for _, latency := range scenario.latencies { + RecordSchedulerPluginProcessingLatency(latency.pluginType, latency.pluginName, latency.duration) + } + + wantPluginLatencies, err := os.Open("testdata/scheduler_plugin_processing_latencies_metric") + defer func() { + if err := wantPluginLatencies.Close(); err != nil { + t.Error(err) + } + }() + if err != nil { + t.Fatal(err) + } + if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantPluginLatencies, "inference_extension_scheduler_plugin_duration_seconds"); err != nil { + t.Error(err) + } + }) + } +} + +func TestSchedulerE2ELatency(t *testing.T) { + scenarios := []struct { + name string + durations []time.Duration + }{ + { + name: "multiple scheduling latencies", + durations: []time.Duration{ + 200 * time.Microsecond, // 0.00014s - should go in the 0.0002 bucket + 800 * time.Microsecond, // 0.0008s - should go in the 0.001 bucket + 1500 * time.Microsecond, // 0.0015s - should go in the 0.002 bucket + 3 * time.Millisecond, // 0.003s - should go in the 0.005 bucket + 8 * time.Millisecond, // 0.008s - should go in the 0.01 bucket + 15 * time.Millisecond, // 0.015s - should go in the 0.02 bucket + 30 * time.Millisecond, // 0.03s - should go in the 0.05 bucket + 75 * time.Millisecond, // 0.075s - should go in the 0.1 bucket + 150 * time.Millisecond, // 0.15s - should go in the +Inf bucket + }, + }, + } + Register() + for _, scenario := range scenarios { + t.Run(scenario.name, func(t *testing.T) { + for _, duration := range scenario.durations { + RecordSchedulerE2ELatency(duration) + } + + wantE2ELatency, err := os.Open("testdata/scheduler_e2e_duration_seconds_metric") + defer func() { + if err := wantE2ELatency.Close(); err != nil { + t.Error(err) + } + }() + if err != nil { + t.Fatal(err) + } + if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantE2ELatency, "inference_extension_scheduler_e2e_duration_seconds"); err != nil { + t.Error(err) + } + }) + } +} diff --git a/pkg/ext-proc/metrics/testdata/input_tokens_metric b/pkg/epp/metrics/testdata/input_tokens_metric similarity index 100% rename from pkg/ext-proc/metrics/testdata/input_tokens_metric rename to pkg/epp/metrics/testdata/input_tokens_metric diff --git a/pkg/epp/metrics/testdata/kv_cache_avg_metrics b/pkg/epp/metrics/testdata/kv_cache_avg_metrics new file mode 100644 index 00000000..99d1a93a --- /dev/null +++ b/pkg/epp/metrics/testdata/kv_cache_avg_metrics @@ -0,0 +1,3 @@ +# HELP inference_pool_average_kv_cache_utilization [ALPHA] The average kv cache utilization for an inference server pool. +# TYPE inference_pool_average_kv_cache_utilization gauge +inference_pool_average_kv_cache_utilization{name="p1"} 0.3 diff --git a/pkg/epp/metrics/testdata/normalized_time_per_output_token_seconds_metric b/pkg/epp/metrics/testdata/normalized_time_per_output_token_seconds_metric new file mode 100644 index 00000000..bb6e9373 --- /dev/null +++ b/pkg/epp/metrics/testdata/normalized_time_per_output_token_seconds_metric @@ -0,0 +1,50 @@ +# HELP inference_model_normalized_time_per_output_token_seconds [ALPHA] Inference model latency divided by number of output tokens in seconds for each model and target model. +# TYPE inference_model_normalized_time_per_output_token_seconds histogram +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t10", le="0.001"} 0 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t10", le="0.002"} 0 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t10", le="0.005"} 0 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t10", le="0.01"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t10", le="0.02"} 2 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t10", le="0.05"} 2 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t10", le="0.1"} 2 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t10", le="0.2"} 2 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t10", le="0.5"} 2 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t10", le="1.0"} 2 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t10", le="2.0"} 2 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t10", le="5.0"} 2 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t10", le="10.0"} 2 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t10", le="+Inf"} 2 +inference_model_normalized_time_per_output_token_seconds_sum{model_name="m10", target_model_name="t10"} 0.03 +inference_model_normalized_time_per_output_token_seconds_count{model_name="m10", target_model_name="t10"} 2 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t11", le="0.001"} 0 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t11", le="0.002"} 0 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t11", le="0.005"} 0 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t11", le="0.01"} 0 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t11", le="0.02"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t11", le="0.05"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t11", le="0.1"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t11", le="0.2"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t11", le="0.5"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t11", le="1.0"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t11", le="2.0"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t11", le="5.0"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t11", le="10.0"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m10", target_model_name="t11", le="+Inf"} 1 +inference_model_normalized_time_per_output_token_seconds_sum{model_name="m10", target_model_name="t11"} 0.02 +inference_model_normalized_time_per_output_token_seconds_count{model_name="m10", target_model_name="t11"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m20", target_model_name="t20", le="0.001"} 0 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m20", target_model_name="t20", le="0.002"} 0 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m20", target_model_name="t20", le="0.005"} 0 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m20", target_model_name="t20", le="0.01"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m20", target_model_name="t20", le="0.02"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m20", target_model_name="t20", le="0.05"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m20", target_model_name="t20", le="0.1"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m20", target_model_name="t20", le="0.2"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m20", target_model_name="t20", le="0.5"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m20", target_model_name="t20", le="1.0"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m20", target_model_name="t20", le="2.0"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m20", target_model_name="t20", le="5.0"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m20", target_model_name="t20", le="10.0"} 1 +inference_model_normalized_time_per_output_token_seconds_bucket{model_name="m20", target_model_name="t20", le="+Inf"} 1 +inference_model_normalized_time_per_output_token_seconds_sum{model_name="m20", target_model_name="t20"} 0.006 +inference_model_normalized_time_per_output_token_seconds_count{model_name="m20", target_model_name="t20"} 1 diff --git a/pkg/ext-proc/metrics/testdata/output_tokens_metric b/pkg/epp/metrics/testdata/output_tokens_metric similarity index 100% rename from pkg/ext-proc/metrics/testdata/output_tokens_metric rename to pkg/epp/metrics/testdata/output_tokens_metric diff --git a/pkg/epp/metrics/testdata/queue_avg_size_metrics b/pkg/epp/metrics/testdata/queue_avg_size_metrics new file mode 100644 index 00000000..3605740c --- /dev/null +++ b/pkg/epp/metrics/testdata/queue_avg_size_metrics @@ -0,0 +1,3 @@ +# HELP inference_pool_average_queue_size [ALPHA] The average number of requests pending in the model server queue. +# TYPE inference_pool_average_queue_size gauge +inference_pool_average_queue_size{name="p1"} 0.4 diff --git a/pkg/ext-proc/metrics/testdata/request_duration_seconds_metric b/pkg/epp/metrics/testdata/request_duration_seconds_metric similarity index 100% rename from pkg/ext-proc/metrics/testdata/request_duration_seconds_metric rename to pkg/epp/metrics/testdata/request_duration_seconds_metric diff --git a/pkg/epp/metrics/testdata/request_error_total_metric b/pkg/epp/metrics/testdata/request_error_total_metric new file mode 100644 index 00000000..31036eb6 --- /dev/null +++ b/pkg/epp/metrics/testdata/request_error_total_metric @@ -0,0 +1,5 @@ +# HELP inference_model_request_error_total [ALPHA] Counter of inference model requests errors broken out for each model and target model. +# TYPE inference_model_request_error_total counter +inference_model_request_error_total{error_code="Internal", model_name="m10",target_model_name="t10"} 2 +inference_model_request_error_total{error_code="ModelServerError", model_name="m10",target_model_name="t11"} 1 +inference_model_request_error_total{error_code="InferencePoolResourceExhausted", model_name="m20",target_model_name="t20"} 1 diff --git a/pkg/ext-proc/metrics/testdata/request_sizes_metric b/pkg/epp/metrics/testdata/request_sizes_metric similarity index 100% rename from pkg/ext-proc/metrics/testdata/request_sizes_metric rename to pkg/epp/metrics/testdata/request_sizes_metric diff --git a/pkg/ext-proc/metrics/testdata/request_total_metric b/pkg/epp/metrics/testdata/request_total_metric similarity index 100% rename from pkg/ext-proc/metrics/testdata/request_total_metric rename to pkg/epp/metrics/testdata/request_total_metric diff --git a/pkg/ext-proc/metrics/testdata/response_sizes_metric b/pkg/epp/metrics/testdata/response_sizes_metric similarity index 100% rename from pkg/ext-proc/metrics/testdata/response_sizes_metric rename to pkg/epp/metrics/testdata/response_sizes_metric diff --git a/pkg/epp/metrics/testdata/running_requests_metrics b/pkg/epp/metrics/testdata/running_requests_metrics new file mode 100644 index 00000000..a880e499 --- /dev/null +++ b/pkg/epp/metrics/testdata/running_requests_metrics @@ -0,0 +1,4 @@ +# HELP inference_model_running_requests [ALPHA] Inference model number of running requests in each model. +# TYPE inference_model_running_requests gauge +inference_model_running_requests{model_name="m1"} 1 +inference_model_running_requests{model_name="m2"} 1 diff --git a/pkg/epp/metrics/testdata/scheduler_e2e_duration_seconds_metric b/pkg/epp/metrics/testdata/scheduler_e2e_duration_seconds_metric new file mode 100644 index 00000000..0bbb35b1 --- /dev/null +++ b/pkg/epp/metrics/testdata/scheduler_e2e_duration_seconds_metric @@ -0,0 +1,15 @@ +# HELP inference_extension_scheduler_e2e_duration_seconds [ALPHA] End-to-end scheduling latency distribution in seconds. +# TYPE inference_extension_scheduler_e2e_duration_seconds histogram +inference_extension_scheduler_e2e_duration_seconds_bucket{le="0.0001"} 0 +inference_extension_scheduler_e2e_duration_seconds_bucket{le="0.0002"} 1 +inference_extension_scheduler_e2e_duration_seconds_bucket{le="0.0005"} 1 +inference_extension_scheduler_e2e_duration_seconds_bucket{le="0.001"} 2 +inference_extension_scheduler_e2e_duration_seconds_bucket{le="0.002"} 3 +inference_extension_scheduler_e2e_duration_seconds_bucket{le="0.005"} 4 +inference_extension_scheduler_e2e_duration_seconds_bucket{le="0.01"} 5 +inference_extension_scheduler_e2e_duration_seconds_bucket{le="0.02"} 6 +inference_extension_scheduler_e2e_duration_seconds_bucket{le="0.05"} 7 +inference_extension_scheduler_e2e_duration_seconds_bucket{le="0.1"} 8 +inference_extension_scheduler_e2e_duration_seconds_bucket{le="+Inf"} 9 +inference_extension_scheduler_e2e_duration_seconds_sum{} 0.2835 +inference_extension_scheduler_e2e_duration_seconds_count{} 9 diff --git a/pkg/epp/metrics/testdata/scheduler_plugin_processing_latencies_metric b/pkg/epp/metrics/testdata/scheduler_plugin_processing_latencies_metric new file mode 100644 index 00000000..669d64da --- /dev/null +++ b/pkg/epp/metrics/testdata/scheduler_plugin_processing_latencies_metric @@ -0,0 +1,67 @@ +# HELP inference_extension_scheduler_plugin_duration_seconds [ALPHA] Scheduler plugin processing latency distribution in seconds for each plugin type and plugin name. +# TYPE inference_extension_scheduler_plugin_duration_seconds histogram +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginA",plugin_type="PreSchedule",le="0.0001"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginA",plugin_type="PreSchedule",le="0.0002"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginA",plugin_type="PreSchedule",le="0.0005"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginA",plugin_type="PreSchedule",le="0.001"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginA",plugin_type="PreSchedule",le="0.002"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginA",plugin_type="PreSchedule",le="0.005"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginA",plugin_type="PreSchedule",le="0.01"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginA",plugin_type="PreSchedule",le="0.02"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginA",plugin_type="PreSchedule",le="0.05"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginA",plugin_type="PreSchedule",le="0.1"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginA",plugin_type="PreSchedule",le="+Inf"} 1 +inference_extension_scheduler_plugin_duration_seconds_sum{plugin_name="PluginA",plugin_type="PreSchedule"} 0.1 +inference_extension_scheduler_plugin_duration_seconds_count{plugin_name="PluginA",plugin_type="PreSchedule"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginB",plugin_type="PostSchedule",le="0.0001"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginB",plugin_type="PostSchedule",le="0.0002"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginB",plugin_type="PostSchedule",le="0.0005"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginB",plugin_type="PostSchedule",le="0.001"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginB",plugin_type="PostSchedule",le="0.002"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginB",plugin_type="PostSchedule",le="0.005"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginB",plugin_type="PostSchedule",le="0.01"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginB",plugin_type="PostSchedule",le="0.02"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginB",plugin_type="PostSchedule",le="0.05"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginB",plugin_type="PostSchedule",le="0.1"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginB",plugin_type="PostSchedule",le="+Inf"} 1 +inference_extension_scheduler_plugin_duration_seconds_sum{plugin_name="PluginB",plugin_type="PostSchedule"} 0.2 +inference_extension_scheduler_plugin_duration_seconds_count{plugin_name="PluginB",plugin_type="PostSchedule"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginC",plugin_type="Filter",le="0.0001"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginC",plugin_type="Filter",le="0.0002"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginC",plugin_type="Filter",le="0.0005"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginC",plugin_type="Filter",le="0.001"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginC",plugin_type="Filter",le="0.002"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginC",plugin_type="Filter",le="0.005"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginC",plugin_type="Filter",le="0.01"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginC",plugin_type="Filter",le="0.02"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginC",plugin_type="Filter",le="0.05"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginC",plugin_type="Filter",le="0.1"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginC",plugin_type="Filter",le="+Inf"} 1 +inference_extension_scheduler_plugin_duration_seconds_sum{plugin_name="PluginC",plugin_type="Filter"} 0.05 +inference_extension_scheduler_plugin_duration_seconds_count{plugin_name="PluginC",plugin_type="Filter"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginD",plugin_type="Scorer",le="0.0001"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginD",plugin_type="Scorer",le="0.0002"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginD",plugin_type="Scorer",le="0.0005"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginD",plugin_type="Scorer",le="0.001"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginD",plugin_type="Scorer",le="0.002"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginD",plugin_type="Scorer",le="0.005"} 0 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginD",plugin_type="Scorer",le="0.01"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginD",plugin_type="Scorer",le="0.02"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginD",plugin_type="Scorer",le="0.05"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginD",plugin_type="Scorer",le="0.1"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginD",plugin_type="Scorer",le="+Inf"} 1 +inference_extension_scheduler_plugin_duration_seconds_sum{plugin_name="PluginD",plugin_type="Scorer"} 0.01 +inference_extension_scheduler_plugin_duration_seconds_count{plugin_name="PluginD",plugin_type="Scorer"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginE",plugin_type="Picker",le="0.0001"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginE",plugin_type="Picker",le="0.0002"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginE",plugin_type="Picker",le="0.0005"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginE",plugin_type="Picker",le="0.001"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginE",plugin_type="Picker",le="0.002"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginE",plugin_type="Picker",le="0.005"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginE",plugin_type="Picker",le="0.01"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginE",plugin_type="Picker",le="0.02"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginE",plugin_type="Picker",le="0.05"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginE",plugin_type="Picker",le="0.1"} 1 +inference_extension_scheduler_plugin_duration_seconds_bucket{plugin_name="PluginE",plugin_type="Picker",le="+Inf"} 1 +inference_extension_scheduler_plugin_duration_seconds_sum{plugin_name="PluginE",plugin_type="Picker"} 1e-05 +inference_extension_scheduler_plugin_duration_seconds_count{plugin_name="PluginE",plugin_type="Picker"} 1 diff --git a/pkg/epp/scheduling/config.go b/pkg/epp/scheduling/config.go new file mode 100644 index 00000000..0c33088b --- /dev/null +++ b/pkg/epp/scheduling/config.go @@ -0,0 +1,47 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package scheduling + +import "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/plugins" + +// SchedulerConfig provides a configuration for the scheduler which includes +// items like filters, scorers, etc that influence routing decisions. +// +// This is not threadsafe and the machinery here does not support dynamically +// changing this at runtime, so this should be set once on startup and not +// changed thereafter. +type SchedulerConfig struct { + PreSchedulePlugins []plugins.PreSchedule + Filters []plugins.Filter + Scorers map[plugins.Scorer]int // map from scorer to weight + Picker plugins.Picker + PostSchedulePlugins []plugins.PostSchedule +} + +var defPlugin = &defaultPlugin{} + +// When the scheduler is initialized with NewScheduler function, this config will be used as default. +// it's possible to call NewSchedulerWithConfig to pass a different argument. + +// For build time plugins changes, it's recommended to change the defaultConfig variable in this file. +var defaultConfig = &SchedulerConfig{ + PreSchedulePlugins: []plugins.PreSchedule{}, + Filters: []plugins.Filter{defPlugin}, + Scorers: map[plugins.Scorer]int{}, + Picker: defPlugin, + PostSchedulePlugins: []plugins.PostSchedule{}, +} diff --git a/pkg/epp/scheduling/config/config.go b/pkg/epp/scheduling/config/config.go new file mode 100644 index 00000000..e00b82ae --- /dev/null +++ b/pkg/epp/scheduling/config/config.go @@ -0,0 +1,58 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package config + +import ( + "sigs.k8s.io/controller-runtime/pkg/log" + envutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/env" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +// Config holds all the configuration values for the scheduler +type Config struct { + KVCacheThreshold float64 + QueueThresholdCritical int + QueueingThresholdLoRA int + LoraAffinityThreshold float64 +} + +const ( + // Default values to use if environment variables are not set + defaultKVCacheThreshold = 0.8 + defaultQueueThresholdCritical = 5 + defaultQueueingThresholdLoRA = 128 + defaultLoraAffinityThreshold = 0.999 +) + +// LoadConfig loads configuration from environment variables +func LoadConfig() Config { + // Use a default logger for initial configuration loading + baseLogger := log.Log.WithName("scheduling-config") + + config := Config{ + KVCacheThreshold: envutil.GetEnvFloat("KV_CACHE_THRESHOLD", defaultKVCacheThreshold, baseLogger), + QueueThresholdCritical: envutil.GetEnvInt("QUEUE_THRESHOLD_CRITICAL", defaultQueueThresholdCritical, baseLogger), + QueueingThresholdLoRA: envutil.GetEnvInt("QUEUING_THRESHOLD_LORA", defaultQueueingThresholdLoRA, baseLogger), + LoraAffinityThreshold: envutil.GetEnvFloat("LORA_AFFINITY_THRESHOLD", defaultLoraAffinityThreshold, baseLogger), + } + + baseLogger.V(logutil.DEFAULT).Info("Scheduler configuration loaded", "config", config) + + return config +} + +var Conf = LoadConfig() diff --git a/pkg/epp/scheduling/plugins/filter/filter.go b/pkg/epp/scheduling/plugins/filter/filter.go new file mode 100644 index 00000000..86620aa9 --- /dev/null +++ b/pkg/epp/scheduling/plugins/filter/filter.go @@ -0,0 +1,278 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package filter + +import ( + "math" + "math/rand" + "time" + + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/config" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/plugins" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +type baseFilter struct { + name string + filter filterFunc +} + +func (f *baseFilter) Name() string { + if f == nil { + return "nil" + } + return f.name +} + +func (f *baseFilter) Filter(ctx *types.SchedulingContext, pods []types.Pod) []types.Pod { + loggerTrace := ctx.Logger.V(logutil.TRACE) + loggerTrace.Info("Running a filter", "name", f.Name(), "podCount", len(pods)) + + return f.filter(ctx, pods) +} + +// DecisionTreeFilter applies current filterFunc, and then recursively applies next filters +// depending success or failure of the current filter. +// It can be used to construct a flow chart algorithm. +type DecisionTreeFilter struct { + Current plugins.Filter + // NextOnSuccess filter will be applied after successfully applying the current filter. + // The filtered results will be passed to the next filter. + NextOnSuccess plugins.Filter + // NextOnFailure filter will be applied if current filter fails. + // The original input will be passed to the next filter. + NextOnFailure plugins.Filter + // NextOnSuccessOrFailure is a convenience field to configure the next filter regardless of the + // success or failure of the current filter. + // NOTE: When using NextOnSuccessOrFailure, both nextOnSuccess and nextOnFailure SHOULD be nil. + // However if that's not the case, nextOnSuccess and nextOnFailure will be used, instead of + // NextOnSuccessOrFailure, in the success and failure scenarios, respectively. + NextOnSuccessOrFailure plugins.Filter +} + +func (f *DecisionTreeFilter) Name() string { + if f == nil { + return "nil" + } + return f.Current.Name() +} + +func (f *DecisionTreeFilter) Filter(ctx *types.SchedulingContext, pods []types.Pod) []types.Pod { + loggerTrace := ctx.Logger.V(logutil.TRACE) + filtered := f.Current.Filter(ctx, pods) + + next := f.NextOnSuccessOrFailure + if len(filtered) > 0 { + if f.NextOnSuccess == nil && f.NextOnSuccessOrFailure == nil { + // No succeeding filters to run, return. + return filtered + } + if f.NextOnSuccess != nil { + next = f.NextOnSuccess + } + loggerTrace.Info("Filter succeeded", "filter", f.Name(), "next", next.Name(), "filteredPodCount", len(filtered)) + // On success, pass the filtered result to the next filter. + return next.Filter(ctx, filtered) + } else { + if f.NextOnFailure == nil && f.NextOnSuccessOrFailure == nil { + // No succeeding filters to run, return. + return filtered + } + if f.NextOnFailure != nil { + next = f.NextOnFailure + } + loggerTrace.Info("Filter failed", "filter", f.Name(), "next", next.Name()) + // On failure, pass the initial set of pods to the next filter. + return next.Filter(ctx, pods) + } +} + +// filterFunc filters a set of input pods to a subset. +type filterFunc func(ctx *types.SchedulingContext, pods []types.Pod) []types.Pod + +// toFilterFunc is a helper function to convert a per pod filter func to the FilterFunc. +func toFilterFunc(pp podPredicate) filterFunc { + return func(ctx *types.SchedulingContext, pods []types.Pod) []types.Pod { + filtered := []types.Pod{} + for _, pod := range pods { + pass := pp(ctx.Req, pod) + if pass { + filtered = append(filtered, pod) + } + } + + return filtered + } +} + +var LeastQueueFilter = &baseFilter{ + name: "least queuing", + filter: leastQueuingFilterFunc, +} + +// leastQueuingFilterFunc finds the max and min queue size of all pods, divides the whole range +// (max-min) by the number of pods, and finds the pods that fall into the first range. +// The intuition is that if there are multiple pods that share similar queue size in the low range, +// we should consider them all instead of the absolute minimum one. This worked better than picking +// the least one as it gives more choices for the next filter, which on aggregate gave better +// results. +// TODO: Compare this strategy with other strategies such as top K. +func leastQueuingFilterFunc(ctx *types.SchedulingContext, pods []types.Pod) []types.Pod { + min := math.MaxInt + max := 0 + filtered := []types.Pod{} + + for _, pod := range pods { + if pod.GetMetrics().WaitingQueueSize <= min { + min = pod.GetMetrics().WaitingQueueSize + } + if pod.GetMetrics().WaitingQueueSize >= max { + max = pod.GetMetrics().WaitingQueueSize + } + } + + for _, pod := range pods { + if pod.GetMetrics().WaitingQueueSize >= min && pod.GetMetrics().WaitingQueueSize <= min+(max-min)/len(pods) { + filtered = append(filtered, pod) + } + } + return filtered +} + +var LowQueueFilter = &baseFilter{ + name: "low queueing filter", + filter: toFilterFunc((queueThresholdPredicate(config.Conf.QueueingThresholdLoRA))), +} + +var LeastKVCacheFilter = &baseFilter{ + name: "least KV cache percent", + filter: leastKVCacheFilterFunc, +} + +// leastKVCacheFilterFunc finds the max and min KV cache of all pods, divides the whole range +// (max-min) by the number of pods, and finds the pods that fall into the first range. +// The intuition is that if there are multiple pods that share similar KV cache in the low range, we +// should consider them all instead of the absolute minimum one. This worked better than picking the +// least one as it gives more choices for the next filter, which on aggregate gave better results. +// TODO: Compare this strategy with other strategies such as top K. +func leastKVCacheFilterFunc(ctx *types.SchedulingContext, pods []types.Pod) []types.Pod { + min := math.MaxFloat64 + var max float64 = 0 + filtered := []types.Pod{} + + for _, pod := range pods { + if pod.GetMetrics().KVCacheUsagePercent <= min { + min = pod.GetMetrics().KVCacheUsagePercent + } + if pod.GetMetrics().KVCacheUsagePercent >= max { + max = pod.GetMetrics().KVCacheUsagePercent + } + } + + for _, pod := range pods { + if pod.GetMetrics().KVCacheUsagePercent >= min && pod.GetMetrics().KVCacheUsagePercent <= min+(max-min)/float64(len(pods)) { + filtered = append(filtered, pod) + } + } + return filtered +} + +var LoRAAffinityFilter = &baseFilter{ + name: "affinity LoRA", + filter: loRASoftAffinityFilterFunc, +} + +// loRASoftAffinityPredicate implements a pod selection strategy that prioritizes pods +// with existing LoRA model affinity while allowing for load balancing through randomization. +// +// The function works by: +// 1. Separating pods into two groups: those with target model affinity and those with available capacity +// 2. Using a probability threshold to sometimes select from non-affinity pods to enable load balancing +// 3. Falling back to whatever group has pods if one group is empty +// +// Parameters: +// - logger: Logger interface for diagnostic output +// - req: LLM request containing the resolved target model +// - pods: Slice of pod metrics to filter +// +// Returns: +// - Filtered slice of pod metrics based on affinity and availability +// - Error if any issues occur during filtering +func loRASoftAffinityFilterFunc(ctx *types.SchedulingContext, pods []types.Pod) []types.Pod { + + // Pre-allocate slices with estimated capacity + filtered_affinity := make([]types.Pod, 0, len(pods)) + filtered_available := make([]types.Pod, 0, len(pods)) + + // Categorize pods based on affinity and availability + for _, pod := range pods { + _, active := pod.GetMetrics().ActiveModels[ctx.Req.ResolvedTargetModel] + _, waiting := pod.GetMetrics().WaitingModels[ctx.Req.ResolvedTargetModel] + + if active || waiting { + filtered_affinity = append(filtered_affinity, pod) + } else if len(pod.GetMetrics().ActiveModels)+len(pod.GetMetrics().WaitingModels) < pod.GetMetrics().MaxActiveModels { + filtered_available = append(filtered_available, pod) + } + } + + // Use crypto/rand for better randomization in production environments + randSource := rand.NewSource(time.Now().UnixNano()) + randGen := rand.New(randSource) + + // If both groups have pods, use probability to select which group to return + if len(filtered_affinity) > 0 && len(filtered_available) > 0 { + if randGen.Float64() < config.Conf.LoraAffinityThreshold { + return filtered_affinity + } + return filtered_available + } + + // Return whichever group has pods + if len(filtered_affinity) > 0 { + return filtered_affinity + } + + return filtered_available +} + +var HasCapacityFilter = &baseFilter{ + name: "has capacity for sheddable requests", + filter: toFilterFunc(queueThresholdPredicate(config.Conf.QueueThresholdCritical).and(kvCacheThresholdPredicate(config.Conf.KVCacheThreshold))), +} + +// podPredicate is a filter function to check whether a pod is desired. +type podPredicate func(req *types.LLMRequest, pod types.Pod) bool + +func queueThresholdPredicate(queueThreshold int) podPredicate { + return func(req *types.LLMRequest, pod types.Pod) bool { + return pod.GetMetrics().WaitingQueueSize <= queueThreshold + } +} + +func kvCacheThresholdPredicate(kvCacheThreshold float64) podPredicate { + return func(req *types.LLMRequest, pod types.Pod) bool { + return pod.GetMetrics().KVCacheUsagePercent <= kvCacheThreshold + } +} + +func (pp podPredicate) and(another podPredicate) podPredicate { + return func(req *types.LLMRequest, pod types.Pod) bool { + return pp(req, pod) && another(req, pod) + } +} diff --git a/pkg/epp/scheduling/plugins/filter/filter_test.go b/pkg/epp/scheduling/plugins/filter/filter_test.go new file mode 100644 index 00000000..2354c3ef --- /dev/null +++ b/pkg/epp/scheduling/plugins/filter/filter_test.go @@ -0,0 +1,298 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package filter + +import ( + "context" + "testing" + + "github.com/google/go-cmp/cmp" + k8stypes "k8s.io/apimachinery/pkg/types" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend" + backendmetrics "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/config" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types" +) + +func TestFilter(t *testing.T) { + tests := []struct { + name string + req *types.LLMRequest + input []types.Pod + output []types.Pod + filter *DecisionTreeFilter + }{ + { + name: "simple filter without available pods", + filter: &DecisionTreeFilter{ + Current: &baseFilter{ + name: "filter all", + filter: func(ctx *types.SchedulingContext, pods []types.Pod) []types.Pod { + return []types.Pod{} + }, + }, + }, + output: []types.Pod{}, + }, + } + + for _, test := range tests { + t.Run(test.name, func(t *testing.T) { + ctx := types.NewSchedulingContext(context.Background(), test.req, test.input) + got := test.filter.Filter(ctx, test.input) + + if diff := cmp.Diff(test.output, got); diff != "" { + t.Errorf("Unexpected output (-want +got): %v", diff) + } + }) + } +} + +func TestFilterFunc(t *testing.T) { + tests := []struct { + name string + f filterFunc + req *types.LLMRequest + input []types.Pod + output []types.Pod + }{ + { + name: "least queuing empty input", + f: leastQueuingFilterFunc, + input: []types.Pod{}, + output: []types.Pod{}, + }, + { + name: "least queuing", + f: leastQueuingFilterFunc, + input: []types.Pod{ + &types.PodMetrics{ + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 0, + }, + }, + &types.PodMetrics{ + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 3, + }, + }, + &types.PodMetrics{ + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 10, + }, + }, + }, + output: []types.Pod{ + &types.PodMetrics{ + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 0, + }, + }, + &types.PodMetrics{ + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 3, + }, + }, + }, + }, + { + name: "least kv cache empty input", + f: leastKVCacheFilterFunc, + input: []types.Pod{}, + output: []types.Pod{}, + }, + { + name: "least kv cache", + f: leastKVCacheFilterFunc, + input: []types.Pod{ + &types.PodMetrics{ + Metrics: &backendmetrics.Metrics{ + KVCacheUsagePercent: 0, + }, + }, + &types.PodMetrics{ + Metrics: &backendmetrics.Metrics{ + KVCacheUsagePercent: 0.3, + }, + }, + &types.PodMetrics{ + Metrics: &backendmetrics.Metrics{ + KVCacheUsagePercent: 1.0, + }, + }, + }, + output: []types.Pod{ + &types.PodMetrics{ + Metrics: &backendmetrics.Metrics{ + KVCacheUsagePercent: 0, + }, + }, + &types.PodMetrics{ + Metrics: &backendmetrics.Metrics{ + KVCacheUsagePercent: 0.3, + }, + }, + }, + }, + { + name: "lowQueueAndLessThanKVCacheThresholdPredicate", + f: toFilterFunc(queueThresholdPredicate(0).and(kvCacheThresholdPredicate(0.8))), + input: []types.Pod{ + &types.PodMetrics{ + // This pod should be returned. + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 0, + KVCacheUsagePercent: 0, + }, + }, + &types.PodMetrics{ + // Queue is non zero, despite low kv cache, should not return. + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 1, + KVCacheUsagePercent: 0.3, + }, + }, + &types.PodMetrics{ + // High kv cache despite zero queue, should not return + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 0, + KVCacheUsagePercent: 1.0, + }, + }, + }, + output: []types.Pod{ + &types.PodMetrics{ + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 0, + KVCacheUsagePercent: 0, + }, + }, + }, + }, + } + + for _, test := range tests { + t.Run(test.name, func(t *testing.T) { + ctx := types.NewSchedulingContext(context.Background(), test.req, test.input) + got := test.f(ctx, test.input) + + if diff := cmp.Diff(test.output, got); diff != "" { + t.Errorf("Unexpected output (-want +got): %v", diff) + } + }) + } +} + +// TestLoRASoftAffinityDistribution tests that the loRASoftAffinityFilter function +// properly distributes requests according to the loraAffinityThreshold +func TestLoRASoftAffinityDistribution(t *testing.T) { + const ( + testModelName = "test-model" + testAffinityModel = "test-affinity-model" + numIterations = 10000 + tolerancePercent = 5.0 // Allow 5% tolerance from expected distribution + ) + + // Save original config value to restore later + originalThreshold := config.Conf.LoraAffinityThreshold + + // Set a specific test value for this test + testThreshold := 0.75 // 75% + config.Conf.LoraAffinityThreshold = testThreshold + + // Ensure we restore the original threshold when test completes + defer func() { + config.Conf.LoraAffinityThreshold = originalThreshold + }() + + // Create a test request and pods + req := &types.LLMRequest{ + Model: testAffinityModel, + ResolvedTargetModel: testAffinityModel, + } + + // Test setup: One affinity pod and one available pod + pods := []types.Pod{ + &types.PodMetrics{ + Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "affinity-pod"}}, + Metrics: &backendmetrics.Metrics{ + MaxActiveModels: 2, + ActiveModels: map[string]int{ + testAffinityModel: 1, + }, + }, + }, + &types.PodMetrics{ + Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "available-pod"}}, + Metrics: &backendmetrics.Metrics{ + MaxActiveModels: 2, + ActiveModels: map[string]int{}, + }, + }, + } + ctx := types.NewSchedulingContext(context.Background(), req, pods) + + // Run the filter function multiple times and count the results + affinityCount := 0 + availableCount := 0 + + // Use the test threshold value + expectedAffinityPercent := config.Conf.LoraAffinityThreshold * 100 + expectedAvailabilityPercent := 100 - expectedAffinityPercent + + for i := 0; i < numIterations; i++ { + result := loRASoftAffinityFilterFunc(ctx, pods) + + // Check which type of pod was returned + if len(result) != 1 { + t.Fatalf("Expected exactly one pod in result, got %d", len(result)) + } + + // Identify if the returned pod is the affinity pod or available pod + if _, exists := result[0].GetMetrics().ActiveModels[testAffinityModel]; exists { + affinityCount++ + } else { + availableCount++ + } + } + + // Calculate the actual percentages + actualAffinityPercent := float64(affinityCount) / float64(numIterations) * 100 + actualAvailablePercent := float64(availableCount) / float64(numIterations) * 100 + + // Check if the distribution matches expected threshold within tolerance + affinityLowerBound := expectedAffinityPercent - tolerancePercent + affinityUpperBound := expectedAffinityPercent + tolerancePercent + + availableLowerBound := expectedAvailabilityPercent - tolerancePercent + availableUpperBound := expectedAvailabilityPercent + tolerancePercent + + t.Logf("Distribution results over %d iterations:", numIterations) + t.Logf("Expected affinity percent: %.2f%% (threshold: %.2f)", expectedAffinityPercent, config.Conf.LoraAffinityThreshold) + t.Logf("Expected availability percent: %.2f%% (threshold: %.2f)", expectedAvailabilityPercent, config.Conf.LoraAffinityThreshold) + t.Logf("Actual affinity percent: %.2f%% (%d out of %d)", actualAffinityPercent, affinityCount, numIterations) + t.Logf("Actual available percent: %.2f%% (%d out of %d)", actualAvailablePercent, availableCount, numIterations) + + if actualAffinityPercent < affinityLowerBound || actualAffinityPercent > affinityUpperBound { + t.Errorf("Affinity selection percent %.2f%% outside expected range %.2f%% to %.2f%%", + actualAffinityPercent, affinityLowerBound, affinityUpperBound) + } + if actualAvailablePercent < availableLowerBound || actualAvailablePercent > availableUpperBound { + t.Errorf("Availability selection percent %.2f%% outside expected range %.2f%% to %.2f%%", + actualAvailablePercent, availableLowerBound, availableUpperBound) + } +} diff --git a/pkg/epp/scheduling/plugins/picker/max_score_picker.go b/pkg/epp/scheduling/plugins/picker/max_score_picker.go new file mode 100644 index 00000000..a6d7b397 --- /dev/null +++ b/pkg/epp/scheduling/plugins/picker/max_score_picker.go @@ -0,0 +1,65 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package picker + +import ( + "fmt" + + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/plugins" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +var _ plugins.Picker = &MaxScorePicker{} + +func NewMaxScorePicker() plugins.Picker { + return &MaxScorePicker{ + random: &RandomPicker{}, + } +} + +// MaxScorePicker picks the pod with the maximum score from the list of candidates. +type MaxScorePicker struct { + random *RandomPicker +} + +// Name returns the name of the picker. +func (p *MaxScorePicker) Name() string { + return "max_score" +} + +// Pick selects the pod with the maximum score from the list of candidates. +func (p *MaxScorePicker) Pick(ctx *types.SchedulingContext, scoredPods []*types.ScoredPod) *types.Result { + ctx.Logger.V(logutil.DEBUG).Info(fmt.Sprintf("Selecting a pod with the max score from %d candidates: %+v", len(scoredPods), scoredPods)) + + highestScorePods := []*types.ScoredPod{} + maxScore := -1.0 // pods min score is 0, putting value lower than 0 in order to find at least one pod as highest + for _, pod := range scoredPods { + if pod.Score > maxScore { + maxScore = pod.Score + highestScorePods = []*types.ScoredPod{pod} + } else if pod.Score == maxScore { + highestScorePods = append(highestScorePods, pod) + } + } + + if len(highestScorePods) > 1 { + return p.random.Pick(ctx, highestScorePods) // pick randomly from the highest score pods + } + + return &types.Result{TargetPod: highestScorePods[0]} +} diff --git a/pkg/epp/scheduling/plugins/picker/random_picker.go b/pkg/epp/scheduling/plugins/picker/random_picker.go new file mode 100644 index 00000000..fb9f9a29 --- /dev/null +++ b/pkg/epp/scheduling/plugins/picker/random_picker.go @@ -0,0 +1,41 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package picker + +import ( + "fmt" + "math/rand" + + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/plugins" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +var _ plugins.Picker = &RandomPicker{} + +// RandomPicker picks a random pod from the list of candidates. +type RandomPicker struct{} + +func (p *RandomPicker) Name() string { + return "random" +} + +func (p *RandomPicker) Pick(ctx *types.SchedulingContext, scoredPods []*types.ScoredPod) *types.Result { + ctx.Logger.V(logutil.DEBUG).Info(fmt.Sprintf("Selecting a random pod from %d candidates: %+v", len(scoredPods), scoredPods)) + i := rand.Intn(len(scoredPods)) + return &types.Result{TargetPod: scoredPods[i]} +} diff --git a/pkg/epp/scheduling/plugins/plugins.go b/pkg/epp/scheduling/plugins/plugins.go new file mode 100644 index 00000000..f3412ab7 --- /dev/null +++ b/pkg/epp/scheduling/plugins/plugins.go @@ -0,0 +1,76 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package plugins + +import ( + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types" +) + +const ( + PreSchedulerPluginType = "PreSchedule" + FilterPluginType = "Filter" + ScorerPluginType = "Scorer" + PostSchedulePluginType = "PostSchedule" + PickerPluginType = "Picker" + PostResponsePluginType = "PostResponse" +) + +// Plugin defines the interface for scheduler plugins, combining scoring, filtering, +// and event handling capabilities. +type Plugin interface { + // Name returns the name of the plugin. + Name() string +} + +// PreSchedule is called when the scheduler receives a new request. It can be used for various +// initialization work. +type PreSchedule interface { + Plugin + PreSchedule(ctx *types.SchedulingContext) +} + +// Filter defines the interface for filtering a list of pods based on context. +type Filter interface { + Plugin + Filter(ctx *types.SchedulingContext, pods []types.Pod) []types.Pod +} + +// Scorer defines the interface for scoring a list of pods based on context. +// Scorers must score pods with a value within the range of [0,1] where 1 is the highest score. +type Scorer interface { + Plugin + Score(ctx *types.SchedulingContext, pods []types.Pod) map[types.Pod]float64 +} + +// Picker picks the final pod(s) to send the request to. +type Picker interface { + Plugin + Pick(ctx *types.SchedulingContext, scoredPods []*types.ScoredPod) *types.Result +} + +// PostSchedule is called by the scheduler after it selects a targetPod for the request. +type PostSchedule interface { + Plugin + PostSchedule(ctx *types.SchedulingContext, res *types.Result) +} + +// PostResponse is called by the scheduler after a successful response was sent. +// The given pod argument is the pod that served the request. +type PostResponse interface { + Plugin + PostResponse(ctx *types.SchedulingContext, pod types.Pod) +} diff --git a/pkg/epp/scheduling/plugins/scorer/kvcache.go b/pkg/epp/scheduling/plugins/scorer/kvcache.go new file mode 100644 index 00000000..0877691d --- /dev/null +++ b/pkg/epp/scheduling/plugins/scorer/kvcache.go @@ -0,0 +1,35 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package scorer + +import ( + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types" +) + +type KVCacheScorer struct{} + +func (ss *KVCacheScorer) Name() string { + return "kv-cache" +} + +func (ss *KVCacheScorer) Score(ctx *types.SchedulingContext, pods []types.Pod) map[types.Pod]float64 { + scores := make(map[types.Pod]float64, len(pods)) + for _, pod := range pods { + scores[pod] = 1 - pod.GetMetrics().KVCacheUsagePercent + } + return scores +} diff --git a/pkg/epp/scheduling/plugins/scorer/kvcache_test.go b/pkg/epp/scheduling/plugins/scorer/kvcache_test.go new file mode 100644 index 00000000..257a58c1 --- /dev/null +++ b/pkg/epp/scheduling/plugins/scorer/kvcache_test.go @@ -0,0 +1,95 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package scorer + +import ( + "context" + "testing" + + "github.com/stretchr/testify/assert" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend" + backendmetrics "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types" +) + +func TestKvCacheScorer(t *testing.T) { + tests := []struct { + name string + pods []types.Pod + expectedScoresPod map[int]float64 // Map of pod index to expected score + }{ + { + name: "Different KV cache utilization", + pods: []types.Pod{ + &types.PodMetrics{Pod: &backend.Pod{}, Metrics: &backendmetrics.Metrics{KVCacheUsagePercent: 0.8}}, + &types.PodMetrics{Pod: &backend.Pod{}, Metrics: &backendmetrics.Metrics{KVCacheUsagePercent: 0.5}}, + &types.PodMetrics{Pod: &backend.Pod{}, Metrics: &backendmetrics.Metrics{KVCacheUsagePercent: 0.0}}, + }, + expectedScoresPod: map[int]float64{ + 0: 0.2, // Highest KV cache usage (0.8) gets lowest score (1-0.8=0.2) + 1: 0.5, // Medium KV cache usage (0.5) gets medium score (1-0.5=0.5) + 2: 1.0, // No KV cache usage (0.0) gets highest score (1-0=1.0) + }, + }, + { + name: "Same KV cache utilization", + pods: []types.Pod{ + &types.PodMetrics{Pod: &backend.Pod{}, Metrics: &backendmetrics.Metrics{KVCacheUsagePercent: 0.6}}, + &types.PodMetrics{Pod: &backend.Pod{}, Metrics: &backendmetrics.Metrics{KVCacheUsagePercent: 0.6}}, + }, + expectedScoresPod: map[int]float64{ + 0: 0.4, // Both get same score (1-0.6=0.4) + 1: 0.4, + }, + }, + { + name: "Zero KV cache utilization", + pods: []types.Pod{ + &types.PodMetrics{Pod: &backend.Pod{}, Metrics: &backendmetrics.Metrics{KVCacheUsagePercent: 0.0}}, + &types.PodMetrics{Pod: &backend.Pod{}, Metrics: &backendmetrics.Metrics{KVCacheUsagePercent: 0.0}}, + }, + expectedScoresPod: map[int]float64{ + 0: 1.0, // No KV cache usage gets highest score + 1: 1.0, + }, + }, + { + name: "Full KV cache utilization", + pods: []types.Pod{ + &types.PodMetrics{Pod: &backend.Pod{}, Metrics: &backendmetrics.Metrics{KVCacheUsagePercent: 1.0}}, + &types.PodMetrics{Pod: &backend.Pod{}, Metrics: &backendmetrics.Metrics{KVCacheUsagePercent: 0.5}}, + }, + expectedScoresPod: map[int]float64{ + 0: 0.0, // Full KV cache (1.0) gets lowest score (1-1=0) + 1: 0.5, // Half KV cache (0.5) gets medium score (1-0.5=0.5) + }, + }, + } + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + ctx := types.NewSchedulingContext(context.Background(), &types.LLMRequest{}, tt.pods) + scorer := &KVCacheScorer{} + scores := scorer.Score(ctx, tt.pods) + + for i, pod := range tt.pods { + expectedScore := tt.expectedScoresPod[i] + assert.InDelta(t, expectedScore, scores[pod], 0.0001, "Pod %d should have score %f", i, expectedScore) + } + }) + } +} diff --git a/pkg/epp/scheduling/plugins/scorer/queue.go b/pkg/epp/scheduling/plugins/scorer/queue.go new file mode 100644 index 00000000..3df9d414 --- /dev/null +++ b/pkg/epp/scheduling/plugins/scorer/queue.go @@ -0,0 +1,61 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package scorer + +import ( + "math" + + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types" +) + +type QueueScorer struct{} + +func (q *QueueScorer) Name() string { + return "queue" +} + +func (q *QueueScorer) Score(ctx *types.SchedulingContext, pods []types.Pod) map[types.Pod]float64 { + minQueueSize := math.MaxInt + maxQueueSize := math.MinInt + + // Iterate through the remaining pods to find min and max + for _, pod := range pods { + queueSize := pod.GetMetrics().WaitingQueueSize + if queueSize < minQueueSize { + minQueueSize = queueSize + } + if queueSize > maxQueueSize { + maxQueueSize = queueSize + } + } + + // podScoreFunc calculates the score based on the queue size of each pod. Longer queue gets a lower score. + podScoreFunc := func(pod types.Pod) float64 { + if maxQueueSize == minQueueSize { + // If all pods have the same queue size, return a neutral score + return 1.0 + } + return float64(maxQueueSize-pod.GetMetrics().WaitingQueueSize) / float64(maxQueueSize-minQueueSize) + } + + // Create a map to hold the scores for each pod + scores := make(map[types.Pod]float64, len(pods)) + for _, pod := range pods { + scores[pod] = podScoreFunc(pod) + } + return scores +} diff --git a/pkg/epp/scheduling/plugins/scorer/queue_test.go b/pkg/epp/scheduling/plugins/scorer/queue_test.go new file mode 100644 index 00000000..907681b2 --- /dev/null +++ b/pkg/epp/scheduling/plugins/scorer/queue_test.go @@ -0,0 +1,85 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package scorer + +import ( + "context" + "testing" + + "github.com/stretchr/testify/assert" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend" + backendmetrics "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types" +) + +func TestQueueScorer(t *testing.T) { + tests := []struct { + name string + pods []types.Pod + expectedScoresPod map[int]float64 // Map of pod index to expected score + }{ + { + name: "Different queue sizes", + pods: []types.Pod{ + &types.PodMetrics{Pod: &backend.Pod{}, Metrics: &backendmetrics.Metrics{WaitingQueueSize: 10}}, + &types.PodMetrics{Pod: &backend.Pod{}, Metrics: &backendmetrics.Metrics{WaitingQueueSize: 5}}, + &types.PodMetrics{Pod: &backend.Pod{}, Metrics: &backendmetrics.Metrics{WaitingQueueSize: 0}}, + }, + expectedScoresPod: map[int]float64{ + 0: 0.0, // Longest queue (10) gets lowest score + 1: 0.5, // Medium queue (5) gets medium score + 2: 1.0, // Shortest queue (0) gets highest score + }, + }, + { + name: "Same queue sizes", + pods: []types.Pod{ + &types.PodMetrics{Pod: &backend.Pod{}, Metrics: &backendmetrics.Metrics{WaitingQueueSize: 5}}, + &types.PodMetrics{Pod: &backend.Pod{}, Metrics: &backendmetrics.Metrics{WaitingQueueSize: 5}}, + }, + expectedScoresPod: map[int]float64{ + 0: 1.0, // When all pods have the same queue size, they get the same neutral score + 1: 1.0, + }, + }, + { + name: "Zero queue sizes", + pods: []types.Pod{ + &types.PodMetrics{Pod: &backend.Pod{}, Metrics: &backendmetrics.Metrics{WaitingQueueSize: 0}}, + &types.PodMetrics{Pod: &backend.Pod{}, Metrics: &backendmetrics.Metrics{WaitingQueueSize: 0}}, + }, + expectedScoresPod: map[int]float64{ + 0: 1.0, + 1: 1.0, + }, + }, + } + + scorer := &QueueScorer{} + + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + ctx := types.NewSchedulingContext(context.Background(), &types.LLMRequest{}, tt.pods) + scores := scorer.Score(ctx, tt.pods) + + for i, pod := range tt.pods { + expectedScore := tt.expectedScoresPod[i] + assert.InDelta(t, expectedScore, scores[pod], 0.0001, "Pod %d should have score %f", i, expectedScore) + } + }) + } +} diff --git a/pkg/epp/scheduling/scheduler.go b/pkg/epp/scheduling/scheduler.go new file mode 100644 index 00000000..245d0a5d --- /dev/null +++ b/pkg/epp/scheduling/scheduler.go @@ -0,0 +1,224 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +// Package scheduling implements request scheduling algorithms. +package scheduling + +import ( + "context" + "fmt" + "time" + + "sigs.k8s.io/controller-runtime/pkg/log" + backendmetrics "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/metrics" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/plugins" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/plugins/filter" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/plugins/picker" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types" + errutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/error" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +var ( + lowLatencyFilter = &filter.DecisionTreeFilter{ + Current: filter.LowQueueFilter, + NextOnSuccess: &filter.DecisionTreeFilter{ + Current: filter.LoRAAffinityFilter, + NextOnSuccessOrFailure: &filter.DecisionTreeFilter{ + Current: filter.LeastQueueFilter, + NextOnSuccessOrFailure: &filter.DecisionTreeFilter{ + Current: filter.LeastKVCacheFilter, + }, + }, + }, + NextOnFailure: &filter.DecisionTreeFilter{ + Current: filter.LeastQueueFilter, + NextOnSuccessOrFailure: &filter.DecisionTreeFilter{ + Current: filter.LoRAAffinityFilter, + NextOnSuccessOrFailure: &filter.DecisionTreeFilter{ + Current: filter.LeastKVCacheFilter, + }, + }, + }, + } + + sheddableRequestFilter = &filter.DecisionTreeFilter{ + // When there is at least one model server that's not queuing requests, and still has KV + // cache below a certain threshold, we consider this model server has capacity to handle + // a sheddable request without impacting critical requests. + Current: filter.HasCapacityFilter, + NextOnSuccess: lowLatencyFilter, + // If all pods are queuing or running above the KVCache threshold, we drop the sheddable + // request to make room for critical requests. for this, we don't define nextOnFailure. + } +) + +func NewScheduler(datastore Datastore) *Scheduler { + return NewSchedulerWithConfig(datastore, defaultConfig) +} + +func NewSchedulerWithConfig(datastore Datastore, config *SchedulerConfig) *Scheduler { + return &Scheduler{ + datastore: datastore, + preSchedulePlugins: config.PreSchedulePlugins, + filters: config.Filters, + scorers: config.Scorers, + picker: config.Picker, + postSchedulePlugins: config.PostSchedulePlugins, + } +} + +type Scheduler struct { + datastore Datastore + preSchedulePlugins []plugins.PreSchedule + filters []plugins.Filter + scorers map[plugins.Scorer]int // map from scorer to its weight + picker plugins.Picker + postSchedulePlugins []plugins.PostSchedule +} + +type Datastore interface { + PodGetAll() []backendmetrics.PodMetrics +} + +// Schedule finds the target pod based on metrics and the requested lora adapter. +func (s *Scheduler) Schedule(ctx context.Context, req *types.LLMRequest) (*types.Result, error) { + logger := log.FromContext(ctx).WithValues("request", req) + loggerDebug := logger.V(logutil.DEBUG) + + scheduleStart := time.Now() + defer func() { + metrics.RecordSchedulerE2ELatency(time.Since(scheduleStart)) + }() + + // Snapshot pod metrics from the datastore to: + // 1. Reduce concurrent access to the datastore. + // 2. Ensure consistent data during the scheduling operation of a request. + sCtx := types.NewSchedulingContext(ctx, req, types.ToSchedulerPodMetrics(s.datastore.PodGetAll())) + loggerDebug.Info(fmt.Sprintf("Scheduling a request, Metrics: %+v", sCtx.PodsSnapshot)) + + s.runPreSchedulePlugins(sCtx) + + pods := s.runFilterPlugins(sCtx) + if len(pods) == 0 { + return nil, errutil.Error{Code: errutil.Internal, Msg: "no pods available for the given request"} + } + // if we got here, there is at least one pod to score + weightedScorePerPod := s.runScorerPlugins(sCtx, pods) + + result := s.runPickerPlugin(sCtx, weightedScorePerPod) + + s.runPostSchedulePlugins(sCtx, result) + + return result, nil +} + +func (s *Scheduler) runPreSchedulePlugins(ctx *types.SchedulingContext) { + for _, plugin := range s.preSchedulePlugins { + ctx.Logger.V(logutil.DEBUG).Info("Running pre-schedule plugin", "plugin", plugin.Name()) + before := time.Now() + plugin.PreSchedule(ctx) + metrics.RecordSchedulerPluginProcessingLatency(plugins.PreSchedulerPluginType, plugin.Name(), time.Since(before)) + } +} + +func (s *Scheduler) runFilterPlugins(ctx *types.SchedulingContext) []types.Pod { + loggerDebug := ctx.Logger.V(logutil.DEBUG) + filteredPods := ctx.PodsSnapshot + loggerDebug.Info("Before running filter plugins", "pods", filteredPods) + + for _, filter := range s.filters { + loggerDebug.Info("Running filter plugin", "plugin", filter.Name()) + before := time.Now() + filteredPods = filter.Filter(ctx, filteredPods) + metrics.RecordSchedulerPluginProcessingLatency(plugins.FilterPluginType, filter.Name(), time.Since(before)) + loggerDebug.Info("Filter plugin result", "plugin", filter.Name(), "pods", filteredPods) + if len(filteredPods) == 0 { + break + } + } + loggerDebug.Info("After running filter plugins") + + return filteredPods +} + +func (s *Scheduler) runScorerPlugins(ctx *types.SchedulingContext, pods []types.Pod) map[types.Pod]float64 { + loggerDebug := ctx.Logger.V(logutil.DEBUG) + loggerDebug.Info("Before running scorer plugins", "pods", pods) + + weightedScorePerPod := make(map[types.Pod]float64, len(pods)) + for _, pod := range pods { + weightedScorePerPod[pod] = float64(0) // initialize weighted score per pod with 0 value + } + // Iterate through each scorer in the chain and accumulate the weighted scores. + for scorer, weight := range s.scorers { + loggerDebug.Info("Running scorer", "scorer", scorer.Name()) + before := time.Now() + scores := scorer.Score(ctx, pods) + metrics.RecordSchedulerPluginProcessingLatency(plugins.ScorerPluginType, scorer.Name(), time.Since(before)) + for pod, score := range scores { // weight is relative to the sum of weights + weightedScorePerPod[pod] += score * float64(weight) // TODO normalize score before multiply with weight + } + loggerDebug.Info("After running scorer", "scorer", scorer.Name()) + } + loggerDebug.Info("After running scorer plugins") + + return weightedScorePerPod +} + +func (s *Scheduler) runPickerPlugin(ctx *types.SchedulingContext, weightedScorePerPod map[types.Pod]float64) *types.Result { + loggerDebug := ctx.Logger.V(logutil.DEBUG) + scoredPods := make([]*types.ScoredPod, len(weightedScorePerPod)) + i := 0 + for pod, score := range weightedScorePerPod { + scoredPods[i] = &types.ScoredPod{Pod: pod, Score: score} + i++ + } + + loggerDebug.Info("Before running picker plugin", "pods", weightedScorePerPod) + before := time.Now() + result := s.picker.Pick(ctx, scoredPods) + metrics.RecordSchedulerPluginProcessingLatency(plugins.PickerPluginType, s.picker.Name(), time.Since(before)) + loggerDebug.Info("After running picker plugin", "result", result) + + return result +} + +func (s *Scheduler) runPostSchedulePlugins(ctx *types.SchedulingContext, res *types.Result) { + for _, plugin := range s.postSchedulePlugins { + ctx.Logger.V(logutil.DEBUG).Info("Running post-schedule plugin", "plugin", plugin.Name()) + before := time.Now() + plugin.PostSchedule(ctx, res) + metrics.RecordSchedulerPluginProcessingLatency(plugins.PostSchedulePluginType, plugin.Name(), time.Since(before)) + } +} + +type defaultPlugin struct { + picker.RandomPicker +} + +func (p *defaultPlugin) Name() string { + return "DefaultPlugin" +} + +func (p *defaultPlugin) Filter(ctx *types.SchedulingContext, pods []types.Pod) []types.Pod { + if ctx.Req.Critical { + return lowLatencyFilter.Filter(ctx, pods) + } + + return sheddableRequestFilter.Filter(ctx, pods) +} diff --git a/pkg/epp/scheduling/scheduler_test.go b/pkg/epp/scheduling/scheduler_test.go new file mode 100644 index 00000000..2d773283 --- /dev/null +++ b/pkg/epp/scheduling/scheduler_test.go @@ -0,0 +1,519 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package scheduling + +import ( + "context" + "testing" + + "github.com/google/go-cmp/cmp" + k8stypes "k8s.io/apimachinery/pkg/types" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend" + backendmetrics "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics" // Import config for thresholds + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/plugins" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types" +) + +// Tests the default scheduler configuration and expected behavior. +func TestSchedule(t *testing.T) { + tests := []struct { + name string + req *types.LLMRequest + input []*backendmetrics.FakePodMetrics + wantRes *types.Result + err bool + }{ + { + name: "no pods in datastore", + req: &types.LLMRequest{ + Model: "any-model", + ResolvedTargetModel: "any-model", + Critical: true, + }, + input: []*backendmetrics.FakePodMetrics{}, + err: true, + }, + { + name: "critical request", + req: &types.LLMRequest{ + Model: "critical", + ResolvedTargetModel: "critical", + Critical: true, + }, + // pod2 will be picked because it has relatively low queue size, with the requested + // model being active, and has low KV cache. + input: []*backendmetrics.FakePodMetrics{ + { + Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod1"}}, + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 0, + KVCacheUsagePercent: 0.2, + MaxActiveModels: 2, + ActiveModels: map[string]int{ + "foo": 1, + "bar": 1, + }, + }, + }, + { + Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod2"}}, + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 3, + KVCacheUsagePercent: 0.1, + MaxActiveModels: 2, + ActiveModels: map[string]int{ + "foo": 1, + "critical": 1, + }, + }, + }, + { + Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod3"}}, + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 10, + KVCacheUsagePercent: 0.2, + MaxActiveModels: 2, + ActiveModels: map[string]int{ + "foo": 1, + }, + }, + }, + }, + wantRes: &types.Result{ + TargetPod: &types.ScoredPod{ + Pod: &types.PodMetrics{ + Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod2"}}, + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 3, + KVCacheUsagePercent: 0.1, + MaxActiveModels: 2, + ActiveModels: map[string]int{ + "foo": 1, + "critical": 1, + }, + WaitingModels: map[string]int{}, + }, + }, + }, + }, + }, + { + name: "sheddable request, accepted", + req: &types.LLMRequest{ + Model: "sheddable", + ResolvedTargetModel: "sheddable", + Critical: false, + }, + // pod1 will be picked because it has capacity for the sheddable request. + input: []*backendmetrics.FakePodMetrics{ + { + Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod1"}}, + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 0, + KVCacheUsagePercent: 0.2, + MaxActiveModels: 2, + ActiveModels: map[string]int{ + "foo": 1, + "bar": 1, + }, + }, + }, + { + Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod2"}}, + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 3, + KVCacheUsagePercent: 0.1, + MaxActiveModels: 2, + ActiveModels: map[string]int{ + "foo": 1, + "critical": 1, + }, + }, + }, + { + Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod3"}}, + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 10, + KVCacheUsagePercent: 0.2, + MaxActiveModels: 2, + ActiveModels: map[string]int{ + "foo": 1, + }, + }, + }, + }, + wantRes: &types.Result{ + TargetPod: &types.ScoredPod{ + Pod: &types.PodMetrics{ + Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod1"}}, + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 0, + KVCacheUsagePercent: 0.2, + MaxActiveModels: 2, + ActiveModels: map[string]int{ + "foo": 1, + "bar": 1, + }, + WaitingModels: map[string]int{}, + }, + }, + }, + }, + }, + { + name: "sheddable request, dropped", + req: &types.LLMRequest{ + Model: "sheddable", + ResolvedTargetModel: "sheddable", + Critical: false, + }, + // All pods have higher KV cache thant the threshold, so the sheddable request will be + // dropped. + input: []*backendmetrics.FakePodMetrics{ + { + Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod1"}}, + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 10, + KVCacheUsagePercent: 0.9, + MaxActiveModels: 2, + ActiveModels: map[string]int{ + "foo": 1, + "bar": 1, + }, + }, + }, + { + Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod2"}}, + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 3, + KVCacheUsagePercent: 0.85, + MaxActiveModels: 2, + ActiveModels: map[string]int{ + "foo": 1, + "critical": 1, + }, + }, + }, + { + Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod3"}}, + Metrics: &backendmetrics.Metrics{ + WaitingQueueSize: 10, + KVCacheUsagePercent: 0.85, + MaxActiveModels: 2, + ActiveModels: map[string]int{ + "foo": 1, + }, + }, + }, + }, + wantRes: nil, + err: true, + }, + } + + for _, test := range tests { + t.Run(test.name, func(t *testing.T) { + scheduler := NewScheduler(&fakeDataStore{pods: test.input}) + got, err := scheduler.Schedule(context.Background(), test.req) + if test.err != (err != nil) { + t.Errorf("Unexpected error, got %v, want %v", err, test.err) + } + + if diff := cmp.Diff(test.wantRes, got); diff != "" { + t.Errorf("Unexpected output (-want +got): %v", diff) + } + }) + } +} + +func TestSchedulePlugins(t *testing.T) { + tp1 := &TestPlugin{ + NameRes: "test1", + ScoreRes: 0.3, + FilterRes: []k8stypes.NamespacedName{{Name: "pod1"}, {Name: "pod2"}, {Name: "pod3"}}, + } + tp2 := &TestPlugin{ + NameRes: "test2", + ScoreRes: 0.8, + FilterRes: []k8stypes.NamespacedName{{Name: "pod1"}, {Name: "pod2"}}, + } + tp_filterAll := &TestPlugin{ + NameRes: "filter all", + FilterRes: []k8stypes.NamespacedName{}, + } + pickerPlugin := &TestPlugin{ + NameRes: "picker", + PickRes: k8stypes.NamespacedName{Name: "pod1"}, + } + + tests := []struct { + name string + config SchedulerConfig + input []*backendmetrics.FakePodMetrics + wantTargetPod k8stypes.NamespacedName + targetPodScore float64 + // Number of expected pods to score (after filter) + numPodsToScore int + err bool + }{ + { + name: "all plugins executed successfully, all scorers with same weight", + config: SchedulerConfig{ + PreSchedulePlugins: []plugins.PreSchedule{tp1, tp2}, + Filters: []plugins.Filter{tp1, tp2}, + Scorers: map[plugins.Scorer]int{ + tp1: 1, + tp2: 1, + }, + Picker: pickerPlugin, + PostSchedulePlugins: []plugins.PostSchedule{tp1, tp2}, + }, + input: []*backendmetrics.FakePodMetrics{ + {Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod1"}}}, + {Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod2"}}}, + {Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod3"}}}, + }, + wantTargetPod: k8stypes.NamespacedName{Name: "pod1"}, + targetPodScore: 1.1, + numPodsToScore: 2, + err: false, + }, + { + name: "all plugins executed successfully, different scorers weights", + config: SchedulerConfig{ + PreSchedulePlugins: []plugins.PreSchedule{tp1, tp2}, + Filters: []plugins.Filter{tp1, tp2}, + Scorers: map[plugins.Scorer]int{ + tp1: 60, + tp2: 40, + }, + Picker: pickerPlugin, + PostSchedulePlugins: []plugins.PostSchedule{tp1, tp2}, + }, + input: []*backendmetrics.FakePodMetrics{ + {Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod1"}}}, + {Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod2"}}}, + {Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod3"}}}, + }, + wantTargetPod: k8stypes.NamespacedName{Name: "pod1"}, + targetPodScore: 50, + numPodsToScore: 2, + err: false, + }, + { + name: "filter all", + config: SchedulerConfig{ + PreSchedulePlugins: []plugins.PreSchedule{tp1, tp2}, + Filters: []plugins.Filter{tp1, tp_filterAll}, + Scorers: map[plugins.Scorer]int{ + tp1: 1, + tp2: 1, + }, + Picker: pickerPlugin, + PostSchedulePlugins: []plugins.PostSchedule{tp1, tp2}, + }, + input: []*backendmetrics.FakePodMetrics{ + {Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod1"}}}, + {Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod2"}}}, + {Pod: &backend.Pod{NamespacedName: k8stypes.NamespacedName{Name: "pod3"}}}, + }, + numPodsToScore: 0, + err: true, // no available pods to server after filter all + }, + } + + for _, test := range tests { + t.Run(test.name, func(t *testing.T) { + // Reset all plugins before each new test case. + for _, plugin := range test.config.PreSchedulePlugins { + plugin.(*TestPlugin).reset() + } + for _, plugin := range test.config.Filters { + plugin.(*TestPlugin).reset() + } + for plugin := range test.config.Scorers { + plugin.(*TestPlugin).reset() + } + test.config.Picker.(*TestPlugin).reset() + for _, plugin := range test.config.PostSchedulePlugins { + plugin.(*TestPlugin).reset() + } + + // Initialize the scheduler + scheduler := NewSchedulerWithConfig(&fakeDataStore{pods: test.input}, &test.config) + + req := &types.LLMRequest{Model: "test-model"} + got, err := scheduler.Schedule(context.Background(), req) + + // Validate error state + if test.err != (err != nil) { + t.Fatalf("Unexpected error, got %v, want %v", err, test.err) + } + + if err != nil { + return + } + + // Validate output + wantPod := &types.PodMetrics{ + Pod: &backend.Pod{NamespacedName: test.wantTargetPod}, + } + wantRes := &types.Result{TargetPod: wantPod} + if diff := cmp.Diff(wantRes, got); diff != "" { + t.Errorf("Unexpected output (-want +got): %v", diff) + } + + // Validate plugin execution counts dynamically + for _, plugin := range test.config.PreSchedulePlugins { + tp, _ := plugin.(*TestPlugin) + if tp.PreScheduleCallCount != 1 { + t.Errorf("Plugin %s PreSchedule() called %d times, expected 1", plugin.Name(), tp.PreScheduleCallCount) + } + } + + for _, plugin := range test.config.Filters { + tp, _ := plugin.(*TestPlugin) + if tp.FilterCallCount != 1 { + t.Errorf("Plugin %s Filter() called %d times, expected 1", plugin.Name(), tp.FilterCallCount) + } + } + + for plugin := range test.config.Scorers { + tp, _ := plugin.(*TestPlugin) + if tp.ScoreCallCount != 1 { + t.Errorf("Plugin %s Score() called %d times, expected 1", plugin.Name(), tp.ScoreCallCount) + } + if test.numPodsToScore != tp.NumOfScoredPods { + t.Errorf("Plugin %s Score() called with %d pods, expected %d", plugin.Name(), tp.NumOfScoredPods, test.numPodsToScore) + } + } + + tp, _ := test.config.Picker.(*TestPlugin) + if tp.NumOfPickerCandidates != test.numPodsToScore { + t.Errorf("Picker plugin %s Pick() called with %d candidates, expected %d", tp.Name(), tp.NumOfPickerCandidates, tp.NumOfScoredPods) + } + if tp.PickCallCount != 1 { + t.Errorf("Picker plugin %s Pick() called %d times, expected 1", tp.Name(), tp.PickCallCount) + } + if tp.WinnderPodScore != test.targetPodScore { + t.Errorf("winnder pod score %v, expected %v", tp.WinnderPodScore, test.targetPodScore) + } + + for _, plugin := range test.config.PostSchedulePlugins { + tp, _ := plugin.(*TestPlugin) + if tp.PostScheduleCallCount != 1 { + t.Errorf("Plugin %s PostSchedule() called %d times, expected 1", plugin.Name(), tp.PostScheduleCallCount) + } + } + }) + } +} + +type fakeDataStore struct { + pods []*backendmetrics.FakePodMetrics +} + +func (fds *fakeDataStore) PodGetAll() []backendmetrics.PodMetrics { + pm := make([]backendmetrics.PodMetrics, 0, len(fds.pods)) + for _, pod := range fds.pods { + pm = append(pm, pod) + } + return pm +} + +// TestPlugin is an implementation useful in unit tests. +type TestPlugin struct { + NameRes string + ScoreCallCount int + NumOfScoredPods int + ScoreRes float64 + FilterCallCount int + FilterRes []k8stypes.NamespacedName + PreScheduleCallCount int + PostScheduleCallCount int + PickCallCount int + NumOfPickerCandidates int + PickRes k8stypes.NamespacedName + WinnderPodScore float64 +} + +func (tp *TestPlugin) Name() string { return tp.NameRes } + +func (tp *TestPlugin) PreSchedule(ctx *types.SchedulingContext) { + tp.PreScheduleCallCount++ +} + +func (tp *TestPlugin) Filter(ctx *types.SchedulingContext, pods []types.Pod) []types.Pod { + tp.FilterCallCount++ + return findPods(ctx, tp.FilterRes...) + +} + +func (tp *TestPlugin) Score(ctx *types.SchedulingContext, pods []types.Pod) map[types.Pod]float64 { + tp.ScoreCallCount++ + scoredPods := make(map[types.Pod]float64, len(pods)) + for _, pod := range pods { + scoredPods[pod] += tp.ScoreRes + } + tp.NumOfScoredPods = len(scoredPods) + return scoredPods +} + +func (tp *TestPlugin) Pick(ctx *types.SchedulingContext, scoredPods []*types.ScoredPod) *types.Result { + tp.PickCallCount++ + tp.NumOfPickerCandidates = len(scoredPods) + pod := findPods(ctx, tp.PickRes)[0] + tp.WinnderPodScore = getPodScore(scoredPods, pod) + return &types.Result{TargetPod: pod} +} + +func (tp *TestPlugin) PostSchedule(ctx *types.SchedulingContext, res *types.Result) { + tp.PostScheduleCallCount++ +} + +func (tp *TestPlugin) reset() { + tp.PreScheduleCallCount = 0 + tp.FilterCallCount = 0 + tp.ScoreCallCount = 0 + tp.NumOfScoredPods = 0 + tp.PostScheduleCallCount = 0 + tp.PickCallCount = 0 + tp.NumOfPickerCandidates = 0 +} + +func findPods(ctx *types.SchedulingContext, names ...k8stypes.NamespacedName) []types.Pod { + res := []types.Pod{} + for _, pod := range ctx.PodsSnapshot { + for _, name := range names { + if pod.GetPod().NamespacedName.String() == name.String() { + res = append(res, pod) + } + } + } + return res +} + +func getPodScore(scoredPods []*types.ScoredPod, selectedPod types.Pod) float64 { + finalScore := 0.0 + for _, scoredPod := range scoredPods { + if scoredPod.GetPod().NamespacedName.String() == selectedPod.GetPod().NamespacedName.String() { + finalScore = scoredPod.Score + break + } + } + return finalScore +} diff --git a/pkg/epp/scheduling/types/types.go b/pkg/epp/scheduling/types/types.go new file mode 100644 index 00000000..4f69fae0 --- /dev/null +++ b/pkg/epp/scheduling/types/types.go @@ -0,0 +1,104 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package types + +import ( + "context" + "fmt" + + "github.com/go-logr/logr" + "sigs.k8s.io/controller-runtime/pkg/log" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend" + backendmetrics "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics" +) + +// LLMRequest is a structured representation of the fields we parse out of the LLMRequest body. +type LLMRequest struct { + Model string + // Target models is a map of target model name to weight. + TargetModels map[string]int + Prompt string + // Resolved target model is the final target model after traffic split. + ResolvedTargetModel string + Critical bool +} + +func (r *LLMRequest) String() string { + return fmt.Sprintf("Model: %s, TargetModels: %v, ResolvedTargetModel: %s, Critical: %t, PromptLength: %v", r.Model, r.TargetModels, r.ResolvedTargetModel, r.Critical, len(r.Prompt)) +} + +type Pod interface { + GetPod() *backend.Pod + GetMetrics() *backendmetrics.Metrics + String() string +} + +type ScoredPod struct { + Pod + Score float64 +} + +// SchedulingContext holds contextual information during a scheduling operation. +type SchedulingContext struct { + context.Context + Logger logr.Logger + Req *LLMRequest + PodsSnapshot []Pod +} + +func (pm *PodMetrics) String() string { + if pm == nil { + return "" + } + return fmt.Sprintf("%+v", *pm) +} + +func (pm *PodMetrics) GetPod() *backend.Pod { + return pm.Pod +} + +func (pm *PodMetrics) GetMetrics() *backendmetrics.Metrics { + return pm.Metrics +} + +type PodMetrics struct { + *backend.Pod + *backendmetrics.Metrics +} + +func NewSchedulingContext(ctx context.Context, req *LLMRequest, pods []Pod) *SchedulingContext { + logger := log.FromContext(ctx).WithValues("request", req) + return &SchedulingContext{ + Context: ctx, + Logger: logger, + Req: req, + PodsSnapshot: pods, + } +} + +func ToSchedulerPodMetrics(pods []backendmetrics.PodMetrics) []Pod { + pm := make([]Pod, 0, len(pods)) + for _, pod := range pods { + pm = append(pm, &PodMetrics{Pod: pod.GetPod().Clone(), Metrics: pod.GetMetrics().Clone()}) + } + return pm +} + +// Result captures the scheduler result. +type Result struct { + TargetPod Pod +} diff --git a/pkg/epp/server/controller_manager.go b/pkg/epp/server/controller_manager.go new file mode 100644 index 00000000..e5668210 --- /dev/null +++ b/pkg/epp/server/controller_manager.go @@ -0,0 +1,89 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package server + +import ( + "fmt" + + corev1 "k8s.io/api/core/v1" + "k8s.io/apimachinery/pkg/fields" + "k8s.io/apimachinery/pkg/runtime" + "k8s.io/apimachinery/pkg/types" + utilruntime "k8s.io/apimachinery/pkg/util/runtime" + clientgoscheme "k8s.io/client-go/kubernetes/scheme" + "k8s.io/client-go/rest" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/cache" + "sigs.k8s.io/controller-runtime/pkg/client" + "sigs.k8s.io/controller-runtime/pkg/manager" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" +) + +var scheme = runtime.NewScheme() + +func init() { + utilruntime.Must(clientgoscheme.AddToScheme(scheme)) + utilruntime.Must(v1alpha2.Install(scheme)) +} + +// defaultManagerOptions returns the default options used to create the manager. +func defaultManagerOptions(namespacedName types.NamespacedName) ctrl.Options { + return ctrl.Options{ + Scheme: scheme, + Cache: cache.Options{ + ByObject: map[client.Object]cache.ByObject{ + &corev1.Pod{}: { + Namespaces: map[string]cache.Config{ + namespacedName.Namespace: {}, + }, + }, + &v1alpha2.InferencePool{}: { + Namespaces: map[string]cache.Config{ + namespacedName.Namespace: { + FieldSelector: fields.SelectorFromSet(fields.Set{ + "metadata.name": namespacedName.Name, + }), + }, + }, + }, + &v1alpha2.InferenceModel{}: { + Namespaces: map[string]cache.Config{ + namespacedName.Namespace: {}, + }, + }, + }, + }, + } +} + +// NewDefaultManager creates a new controller manager with default configuration. +func NewDefaultManager(namespacedName types.NamespacedName, restConfig *rest.Config) (ctrl.Manager, error) { + manager, err := ctrl.NewManager(restConfig, defaultManagerOptions(namespacedName)) + if err != nil { + return nil, fmt.Errorf("failed to create controller manager: %v", err) + } + return manager, nil +} + +// NewManagerWithOptions creates a new controller manager with injectable options. +func NewManagerWithOptions(restConfig *rest.Config, opts manager.Options) (ctrl.Manager, error) { + manager, err := ctrl.NewManager(restConfig, opts) + if err != nil { + return nil, fmt.Errorf("failed to create controller manager: %v", err) + } + return manager, nil +} diff --git a/pkg/epp/server/runserver.go b/pkg/epp/server/runserver.go new file mode 100644 index 00000000..687a555c --- /dev/null +++ b/pkg/epp/server/runserver.go @@ -0,0 +1,149 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package server + +import ( + "context" + "crypto/tls" + "fmt" + "time" + + extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" + "github.com/go-logr/logr" + "google.golang.org/grpc" + "google.golang.org/grpc/credentials" + "k8s.io/apimachinery/pkg/types" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/manager" + "sigs.k8s.io/gateway-api-inference-extension/internal/runnable" + tlsutil "sigs.k8s.io/gateway-api-inference-extension/internal/tls" + backendmetrics "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/controller" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/datastore" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/handlers" +) + +// ExtProcServerRunner provides methods to manage an external process server. +type ExtProcServerRunner struct { + GrpcPort int + DestinationEndpointHintMetadataNamespace string + DestinationEndpointHintKey string + PoolNamespacedName types.NamespacedName + Datastore datastore.Datastore + SecureServing bool + CertPath string + UseStreaming bool + RefreshPrometheusMetricsInterval time.Duration + Scheduler handlers.Scheduler + + // This should only be used in tests. We won't need this once we don't inject metrics in the tests. + // TODO:(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/432) Cleanup + TestPodMetricsClient *backendmetrics.FakePodMetricsClient +} + +// Default values for CLI flags in main +const ( + DefaultGrpcPort = 9002 // default for --grpcPort + DefaultDestinationEndpointHintMetadataNamespace = "envoy.lb" // default for --destinationEndpointHintMetadataNamespace + DefaultDestinationEndpointHintKey = "x-gateway-destination-endpoint" // default for --destinationEndpointHintKey + DefaultPoolName = "" // required but no default + DefaultPoolNamespace = "default" // default for --poolNamespace + DefaultRefreshMetricsInterval = 50 * time.Millisecond // default for --refreshMetricsInterval + DefaultRefreshPrometheusMetricsInterval = 5 * time.Second // default for --refreshPrometheusMetricsInterval + DefaultSecureServing = true // default for --secureServing +) + +func NewDefaultExtProcServerRunner() *ExtProcServerRunner { + return &ExtProcServerRunner{ + GrpcPort: DefaultGrpcPort, + DestinationEndpointHintKey: DefaultDestinationEndpointHintKey, + DestinationEndpointHintMetadataNamespace: DefaultDestinationEndpointHintMetadataNamespace, + PoolNamespacedName: types.NamespacedName{Name: DefaultPoolName, Namespace: DefaultPoolNamespace}, + SecureServing: DefaultSecureServing, + RefreshPrometheusMetricsInterval: DefaultRefreshPrometheusMetricsInterval, + // Datastore can be assigned later. + } +} + +// SetupWithManager sets up the runner with the given manager. +func (r *ExtProcServerRunner) SetupWithManager(ctx context.Context, mgr ctrl.Manager) error { + // Create the controllers and register them with the manager + if err := (&controller.InferencePoolReconciler{ + Datastore: r.Datastore, + Client: mgr.GetClient(), + Record: mgr.GetEventRecorderFor("InferencePool"), + }).SetupWithManager(mgr); err != nil { + return fmt.Errorf("failed setting up InferencePoolReconciler: %w", err) + } + + if err := (&controller.InferenceModelReconciler{ + Datastore: r.Datastore, + Client: mgr.GetClient(), + PoolNamespacedName: r.PoolNamespacedName, + Record: mgr.GetEventRecorderFor("InferenceModel"), + }).SetupWithManager(ctx, mgr); err != nil { + return fmt.Errorf("failed setting up InferenceModelReconciler: %w", err) + } + + if err := (&controller.PodReconciler{ + Datastore: r.Datastore, + Client: mgr.GetClient(), + Record: mgr.GetEventRecorderFor("pod"), + }).SetupWithManager(mgr); err != nil { + return fmt.Errorf("failed setting up EndpointSliceReconciler: %v", err) + } + return nil +} + +// AsRunnable returns a Runnable that can be used to start the ext-proc gRPC server. +// The runnable implements LeaderElectionRunnable with leader election disabled. +func (r *ExtProcServerRunner) AsRunnable(logger logr.Logger) manager.Runnable { + return runnable.NoLeaderElection(manager.RunnableFunc(func(ctx context.Context) error { + backendmetrics.StartMetricsLogger(ctx, r.Datastore, r.RefreshPrometheusMetricsInterval) + var srv *grpc.Server + if r.SecureServing { + var cert tls.Certificate + var err error + if r.CertPath != "" { + cert, err = tls.LoadX509KeyPair(r.CertPath+"/tls.crt", r.CertPath+"/tls.key") + } else { + // Create tls based credential. + cert, err = tlsutil.CreateSelfSignedTLSCertificate(logger) + } + if err != nil { + logger.Error(err, "Failed to create self signed certificate") + return err + } + + creds := credentials.NewTLS(&tls.Config{ + Certificates: []tls.Certificate{cert}, + }) + // Init the server. + srv = grpc.NewServer(grpc.Creds(creds)) + } else { + srv = grpc.NewServer() + } + extProcServer := handlers.NewStreamingServer(r.Scheduler, r.DestinationEndpointHintMetadataNamespace, r.DestinationEndpointHintKey, r.Datastore) + extProcPb.RegisterExternalProcessorServer( + srv, + extProcServer, + ) + + // Forward to the gRPC runnable. + return runnable.GRPCServer("ext-proc", srv, r.GrpcPort).Start(ctx) + })) +} diff --git a/pkg/epp/server/runserver_test.go b/pkg/epp/server/runserver_test.go new file mode 100644 index 00000000..b02688c5 --- /dev/null +++ b/pkg/epp/server/runserver_test.go @@ -0,0 +1,38 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package server_test + +import ( + "testing" + + "sigs.k8s.io/controller-runtime/pkg/manager" + + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/server" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +func TestRunnable(t *testing.T) { + // Make sure AsRunnable() does not use leader election. + runner := server.NewDefaultExtProcServerRunner().AsRunnable(logutil.NewTestLogger()) + r, ok := runner.(manager.LeaderElectionRunnable) + if !ok { + t.Fatal("runner is not LeaderElectionRunnable") + } + if r.NeedLeaderElection() { + t.Error("runner returned NeedLeaderElection = true, expected false") + } +} diff --git a/pkg/epp/util/env/env.go b/pkg/epp/util/env/env.go new file mode 100644 index 00000000..0c6d1c6d --- /dev/null +++ b/pkg/epp/util/env/env.go @@ -0,0 +1,61 @@ +package env + +import ( + "os" + "strconv" + + "github.com/go-logr/logr" +) + +// getEnvFloat gets a float64 from an environment variable with a default value +func GetEnvFloat(key string, defaultVal float64, logger logr.Logger) float64 { + val, exists := os.LookupEnv(key) + if !exists { + logger.Info("Environment variable not set, using default value", + "key", key, "defaultValue", defaultVal) + return defaultVal + } + + floatVal, err := strconv.ParseFloat(val, 64) + if err != nil { + logger.Info("Failed to parse environment variable as float, using default value", + "key", key, "value", val, "error", err, "defaultValue", defaultVal) + return defaultVal + } + + logger.Info("Successfully loaded environment variable", + "key", key, "value", floatVal) + return floatVal +} + +// getEnvInt gets an int from an environment variable with a default value +func GetEnvInt(key string, defaultVal int, logger logr.Logger) int { + val, exists := os.LookupEnv(key) + if !exists { + logger.Info("Environment variable not set, using default value", + "key", key, "defaultValue", defaultVal) + return defaultVal + } + + intVal, err := strconv.Atoi(val) + if err != nil { + logger.Info("Failed to parse environment variable as int, using default value", + "key", key, "value", val, "error", err, "defaultValue", defaultVal) + return defaultVal + } + + logger.Info("Successfully loaded environment variable", + "key", key, "value", intVal) + return intVal +} + +// GetEnvString gets a string from an environment variable with a default value +func GetEnvString(key string, defaultVal string, logger logr.Logger) string { + val, exists := os.LookupEnv(key) + if !exists { + logger.Info("Environment variable not set, using default value", + "key", key, "defaultValue", defaultVal) + return defaultVal + } + return val +} diff --git a/pkg/epp/util/env/env_test.go b/pkg/epp/util/env/env_test.go new file mode 100644 index 00000000..105beb28 --- /dev/null +++ b/pkg/epp/util/env/env_test.go @@ -0,0 +1,205 @@ +package env + +import ( + "os" + "testing" + + "github.com/go-logr/logr/testr" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +func TestGetEnvFloat(t *testing.T) { + logger := testr.New(t) + + tests := []struct { + name string + key string + value string + defaultVal float64 + expected float64 + setup func() + teardown func() + }{ + { + name: "env variable exists and is valid", + key: "TEST_FLOAT", + value: "123.456", + defaultVal: 0.0, + expected: 123.456, + setup: func() { + os.Setenv("TEST_FLOAT", "123.456") + }, + teardown: func() { + os.Unsetenv("TEST_FLOAT") + }, + }, + { + name: "env variable exists but is invalid", + key: "TEST_FLOAT", + value: "invalid", + defaultVal: 99.9, + expected: 99.9, + setup: func() { + os.Setenv("TEST_FLOAT", "invalid") + }, + teardown: func() { + os.Unsetenv("TEST_FLOAT") + }, + }, + { + name: "env variable does not exist", + key: "TEST_FLOAT_MISSING", + defaultVal: 42.42, + expected: 42.42, + setup: func() {}, + teardown: func() {}, + }, + } + + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + tc.setup() + defer tc.teardown() + + result := GetEnvFloat(tc.key, tc.defaultVal, logger.V(logutil.VERBOSE)) + if result != tc.expected { + t.Errorf("GetEnvFloat(%s, %f) = %f, expected %f", tc.key, tc.defaultVal, result, tc.expected) + } + }) + } +} + +func TestGetEnvInt(t *testing.T) { + logger := testr.New(t) + + tests := []struct { + name string + key string + value string + defaultVal int + expected int + setup func() + teardown func() + }{ + { + name: "env variable exists and is valid", + key: "TEST_INT", + value: "123", + defaultVal: 0, + expected: 123, + setup: func() { + os.Setenv("TEST_INT", "123") + }, + teardown: func() { + os.Unsetenv("TEST_INT") + }, + }, + { + name: "env variable exists but is invalid", + key: "TEST_INT", + value: "invalid", + defaultVal: 99, + expected: 99, + setup: func() { + os.Setenv("TEST_INT", "invalid") + }, + teardown: func() { + os.Unsetenv("TEST_INT") + }, + }, + { + name: "env variable does not exist", + key: "TEST_INT_MISSING", + defaultVal: 42, + expected: 42, + setup: func() {}, + teardown: func() {}, + }, + { + name: "env variable is empty string", + key: "TEST_INT_EMPTY", + value: "", + defaultVal: 77, + expected: 77, + setup: func() { + os.Setenv("TEST_INT_EMPTY", "") + }, + teardown: func() { + os.Unsetenv("TEST_INT_EMPTY") + }, + }, + } + + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + tc.setup() + defer tc.teardown() + + result := GetEnvInt(tc.key, tc.defaultVal, logger.V(logutil.VERBOSE)) + if result != tc.expected { + t.Errorf("GetEnvInt(%s, %d) = %d, expected %d", tc.key, tc.defaultVal, result, tc.expected) + } + }) + } +} + +func TestGetEnvString(t *testing.T) { + logger := testr.New(t) + + tests := []struct { + name string + key string + value string + defaultVal string + expected string + setup func() + teardown func() + }{ + { + name: "env variable exists and is valid", + key: "TEST_STR", + value: "123", + defaultVal: "default", + expected: "123", + setup: func() { + os.Setenv("TEST_STR", "123") + }, + teardown: func() { + os.Unsetenv("TEST_STR") + }, + }, + { + name: "env variable does not exist", + key: "TEST_STR_MISSING", + defaultVal: "default", + expected: "default", + setup: func() {}, + teardown: func() {}, + }, + { + name: "env variable is empty string", + key: "TEST_STR_EMPTY", + value: "", + defaultVal: "default", + expected: "", + setup: func() { + os.Setenv("TEST_STR_EMPTY", "") + }, + teardown: func() { + os.Unsetenv("TEST_STR_EMPTY") + }, + }, + } + + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + tc.setup() + defer tc.teardown() + + result := GetEnvString(tc.key, tc.defaultVal, logger.V(logutil.VERBOSE)) + if result != tc.expected { + t.Errorf("GetEnvString(%s, %s) = %s, expected %s", tc.key, tc.defaultVal, result, tc.expected) + } + }) + } +} diff --git a/pkg/epp/util/error/error.go b/pkg/epp/util/error/error.go new file mode 100644 index 00000000..2f9c992c --- /dev/null +++ b/pkg/epp/util/error/error.go @@ -0,0 +1,34 @@ +package error + +import ( + "fmt" +) + +// Error is an error struct for errors returned by the epp server. +type Error struct { + Code string + Msg string +} + +const ( + Unknown = "Unknown" + BadRequest = "BadRequest" + Internal = "Internal" + ModelServerError = "ModelServerError" + BadConfiguration = "BadConfiguration" + InferencePoolResourceExhausted = "InferencePoolResourceExhausted" +) + +// Error returns a string version of the error. +func (e Error) Error() string { + return fmt.Sprintf("inference gateway: %s - %s", e.Code, e.Msg) +} + +// CanonicalCode returns the error's ErrorCode. +func CanonicalCode(err error) string { + e, ok := err.(Error) + if ok { + return e.Code + } + return Unknown +} diff --git a/pkg/epp/util/logging/fatal.go b/pkg/epp/util/logging/fatal.go new file mode 100644 index 00000000..d8a9a937 --- /dev/null +++ b/pkg/epp/util/logging/fatal.go @@ -0,0 +1,31 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package logging + +import ( + "os" + + "github.com/go-logr/logr" +) + +// Fatal calls logger.Error followed by os.Exit(1). +// +// This is a utility function and should not be used in production code! +func Fatal(logger logr.Logger, err error, msg string, keysAndValues ...interface{}) { + logger.Error(err, msg, keysAndValues...) + os.Exit(1) +} diff --git a/pkg/epp/util/logging/logger.go b/pkg/epp/util/logging/logger.go new file mode 100644 index 00000000..5e6ed88d --- /dev/null +++ b/pkg/epp/util/logging/logger.go @@ -0,0 +1,36 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package logging + +import ( + "context" + + "github.com/go-logr/logr" + uberzap "go.uber.org/zap" + "sigs.k8s.io/controller-runtime/pkg/log" + "sigs.k8s.io/controller-runtime/pkg/log/zap" +) + +// NewTestLogger creates a new Zap logger using the dev mode. +func NewTestLogger() logr.Logger { + return zap.New(zap.UseDevMode(true), zap.RawZapOpts(uberzap.AddCaller())) +} + +// NewTestLoggerIntoContext creates a new Zap logger using the dev mode and inserts it into the given context. +func NewTestLoggerIntoContext(ctx context.Context) context.Context { + return log.IntoContext(ctx, zap.New(zap.UseDevMode(true), zap.RawZapOpts(uberzap.AddCaller()))) +} diff --git a/pkg/epp/util/logging/logging_const.go b/pkg/epp/util/logging/logging_const.go new file mode 100644 index 00000000..823ab28b --- /dev/null +++ b/pkg/epp/util/logging/logging_const.go @@ -0,0 +1,24 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package logging + +const ( + DEFAULT = 2 + VERBOSE = 3 + DEBUG = 4 + TRACE = 5 +) diff --git a/pkg/epp/util/pod/pod.go b/pkg/epp/util/pod/pod.go new file mode 100644 index 00000000..4fcb948f --- /dev/null +++ b/pkg/epp/util/pod/pod.go @@ -0,0 +1,36 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package pod + +import ( + corev1 "k8s.io/api/core/v1" +) + +func IsPodReady(pod *corev1.Pod) bool { + if !pod.DeletionTimestamp.IsZero() { + return false + } + for _, condition := range pod.Status.Conditions { + if condition.Type == corev1.PodReady { + if condition.Status == corev1.ConditionTrue { + return true + } + break + } + } + return false +} diff --git a/pkg/epp/util/testing/diff.go b/pkg/epp/util/testing/diff.go new file mode 100644 index 00000000..34b0b8ca --- /dev/null +++ b/pkg/epp/util/testing/diff.go @@ -0,0 +1,27 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package testing + +import ( + "github.com/google/go-cmp/cmp" + "github.com/google/go-cmp/cmp/cmpopts" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" +) + +func DiffModelLists(want, got []*v1alpha2.InferenceModel) string { + return cmp.Diff(want, got, cmpopts.SortSlices(func(a, b *v1alpha2.InferenceModel) bool { return a.Name < b.Name })) +} diff --git a/pkg/epp/util/testing/wrappers.go b/pkg/epp/util/testing/wrappers.go new file mode 100644 index 00000000..130f017e --- /dev/null +++ b/pkg/epp/util/testing/wrappers.go @@ -0,0 +1,214 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package testing + +import ( + corev1 "k8s.io/api/core/v1" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" +) + +// PodWrapper wraps a Pod. +type PodWrapper struct { + corev1.Pod +} + +func FromBase(pod *corev1.Pod) *PodWrapper { + return &PodWrapper{ + Pod: *pod, + } +} + +// MakePod creates a wrapper for a Pod. +func MakePod(podName string) *PodWrapper { + return &PodWrapper{ + corev1.Pod{ + ObjectMeta: metav1.ObjectMeta{ + Name: podName, + }, + Spec: corev1.PodSpec{}, + Status: corev1.PodStatus{}, + }, + } +} + +// Complete sets necessary fields for a Pod to make it not denied by the apiserver +func (p *PodWrapper) Complete() *PodWrapper { + if p.Pod.Namespace == "" { + p.Namespace("default") + } + p.Spec.Containers = []corev1.Container{ + { + Name: "mock-vllm", + Image: "mock-vllm:latest", + }, + } + return p +} + +func (p *PodWrapper) Namespace(ns string) *PodWrapper { + p.ObjectMeta.Namespace = ns + return p +} + +// Labels sets the pod labels. +func (p *PodWrapper) Labels(labels map[string]string) *PodWrapper { + p.ObjectMeta.Labels = labels + return p +} + +// Labels sets the pod labels. +func (p *PodWrapper) LabelsFromPoolSelector(selector map[v1alpha2.LabelKey]v1alpha2.LabelValue) *PodWrapper { + if p.ObjectMeta.Labels == nil { + p.ObjectMeta.Labels = map[string]string{} + } + for k, v := range selector { + p.ObjectMeta.Labels[string(k)] = string(v) + } + return p +} + +// SetReadyCondition sets a PodReay=true condition. +func (p *PodWrapper) ReadyCondition() *PodWrapper { + p.Status.Conditions = []corev1.PodCondition{{ + Type: corev1.PodReady, + Status: corev1.ConditionTrue, + }} + return p +} + +func (p *PodWrapper) IP(ip string) *PodWrapper { + p.Status.PodIP = ip + return p +} + +func (p *PodWrapper) DeletionTimestamp() *PodWrapper { + now := metav1.Now() + p.ObjectMeta.DeletionTimestamp = &now + p.ObjectMeta.Finalizers = []string{"finalizer"} + return p +} + +// Obj returns the wrapped Pod. +func (p *PodWrapper) ObjRef() *corev1.Pod { + return &p.Pod +} + +// InferenceModelWrapper wraps an InferenceModel. +type InferenceModelWrapper struct { + v1alpha2.InferenceModel +} + +// MakeInferenceModel creates a wrapper for a InferenceModel. +func MakeInferenceModel(name string) *InferenceModelWrapper { + return &InferenceModelWrapper{ + v1alpha2.InferenceModel{ + ObjectMeta: metav1.ObjectMeta{ + Name: name, + }, + Spec: v1alpha2.InferenceModelSpec{}, + }, + } +} + +func (m *InferenceModelWrapper) Namespace(ns string) *InferenceModelWrapper { + m.ObjectMeta.Namespace = ns + return m +} + +// Obj returns the wrapped InferenceModel. +func (m *InferenceModelWrapper) ObjRef() *v1alpha2.InferenceModel { + return &m.InferenceModel +} + +func (m *InferenceModelWrapper) ModelName(modelName string) *InferenceModelWrapper { + m.Spec.ModelName = modelName + return m +} + +func (m *InferenceModelWrapper) TargetModel(modelName string) *InferenceModelWrapper { + m.Spec.TargetModels = append(m.Spec.TargetModels, v1alpha2.TargetModel{Name: modelName}) + return m +} + +func (m *InferenceModelWrapper) PoolName(poolName string) *InferenceModelWrapper { + m.Spec.PoolRef = v1alpha2.PoolObjectReference{Name: v1alpha2.ObjectName(poolName)} + return m +} + +func (m *InferenceModelWrapper) Criticality(criticality v1alpha2.Criticality) *InferenceModelWrapper { + m.Spec.Criticality = &criticality + return m +} + +func (m *InferenceModelWrapper) DeletionTimestamp() *InferenceModelWrapper { + now := metav1.Now() + m.ObjectMeta.DeletionTimestamp = &now + m.ObjectMeta.Finalizers = []string{"finalizer"} + return m +} + +func (m *InferenceModelWrapper) CreationTimestamp(t metav1.Time) *InferenceModelWrapper { + m.ObjectMeta.CreationTimestamp = t + return m +} + +// InferencePoolWrapper wraps an InferencePool. +type InferencePoolWrapper struct { + v1alpha2.InferencePool +} + +// MakeInferencePool creates a wrapper for a InferencePool. +func MakeInferencePool(name string) *InferencePoolWrapper { + return &InferencePoolWrapper{ + v1alpha2.InferencePool{ + ObjectMeta: metav1.ObjectMeta{ + Name: name, + }, + Spec: v1alpha2.InferencePoolSpec{}, + }, + } +} + +func (m *InferencePoolWrapper) Namespace(ns string) *InferencePoolWrapper { + m.ObjectMeta.Namespace = ns + return m +} + +func (m *InferencePoolWrapper) Selector(selector map[string]string) *InferencePoolWrapper { + s := make(map[v1alpha2.LabelKey]v1alpha2.LabelValue) + for k, v := range selector { + s[v1alpha2.LabelKey(k)] = v1alpha2.LabelValue(v) + } + m.Spec.Selector = s + return m +} + +func (m *InferencePoolWrapper) TargetPortNumber(p int32) *InferencePoolWrapper { + m.Spec.TargetPortNumber = p + return m +} + +func (m *InferencePoolWrapper) ExtensionRef(name string) *InferencePoolWrapper { + m.Spec.ExtensionRef = &v1alpha2.Extension{ExtensionReference: v1alpha2.ExtensionReference{Name: v1alpha2.ObjectName(name)}} + return m +} + +// Obj returns the wrapped InferencePool. +func (m *InferencePoolWrapper) ObjRef() *v1alpha2.InferencePool { + return &m.InferencePool +} diff --git a/pkg/ext-proc/backend/datastore.go b/pkg/ext-proc/backend/datastore.go deleted file mode 100644 index 627ddbe5..00000000 --- a/pkg/ext-proc/backend/datastore.go +++ /dev/null @@ -1,211 +0,0 @@ -package backend - -import ( - "context" - "errors" - "math/rand" - "sync" - "time" - - "github.com/google/go-cmp/cmp" - "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - logutil "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/util/logging" - corev1 "k8s.io/api/core/v1" - v1 "k8s.io/api/core/v1" - "k8s.io/apimachinery/pkg/api/meta" - metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" - "k8s.io/apimachinery/pkg/labels" - "k8s.io/client-go/informers" - informersv1 "k8s.io/client-go/informers/core/v1" - "k8s.io/client-go/kubernetes" - clientset "k8s.io/client-go/kubernetes" - listersv1 "k8s.io/client-go/listers/core/v1" - "k8s.io/client-go/tools/cache" - "k8s.io/klog/v2" -) - -func NewK8sDataStore(options ...K8sDatastoreOption) *K8sDatastore { - store := &K8sDatastore{ - poolMu: sync.RWMutex{}, - InferenceModels: &sync.Map{}, - } - - store.podListerFactory = store.createPodLister - for _, opt := range options { - opt(store) - } - return store -} - -// The datastore is a local cache of relevant data for the given InferencePool (currently all pulled from k8s-api) -type K8sDatastore struct { - client kubernetes.Interface - // poolMu is used to synchronize access to the inferencePool. - poolMu sync.RWMutex - inferencePool *v1alpha1.InferencePool - podListerFactory PodListerFactory - podLister *PodLister - InferenceModels *sync.Map -} - -type K8sDatastoreOption func(*K8sDatastore) -type PodListerFactory func(*v1alpha1.InferencePool) *PodLister - -// WithPods can be used in tests to override the pods. -func WithPodListerFactory(factory PodListerFactory) K8sDatastoreOption { - return func(store *K8sDatastore) { - store.podListerFactory = factory - } -} - -type PodLister struct { - Lister listersv1.PodLister - sharedInformer informers.SharedInformerFactory -} - -func (l *PodLister) listEverything() ([]*corev1.Pod, error) { - return l.Lister.List(labels.Everything()) - -} - -func (ds *K8sDatastore) SetClient(client kubernetes.Interface) { - ds.client = client -} - -func (ds *K8sDatastore) setInferencePool(pool *v1alpha1.InferencePool) { - ds.poolMu.Lock() - defer ds.poolMu.Unlock() - - if ds.inferencePool != nil && cmp.Equal(ds.inferencePool.Spec.Selector, pool.Spec.Selector) { - // Pool updated, but the selector stayed the same, so no need to change the informer. - ds.inferencePool = pool - return - } - - // New pool or selector updated. - ds.inferencePool = pool - - if ds.podLister != nil && ds.podLister.sharedInformer != nil { - // Shutdown the old informer async since this takes a few seconds. - go func() { - ds.podLister.sharedInformer.Shutdown() - }() - } - - if ds.podListerFactory != nil { - // Create a new informer with the new selector. - ds.podLister = ds.podListerFactory(ds.inferencePool) - if ds.podLister != nil && ds.podLister.sharedInformer != nil { - ctx := context.Background() - ds.podLister.sharedInformer.Start(ctx.Done()) - ds.podLister.sharedInformer.WaitForCacheSync(ctx.Done()) - } - } -} - -func (ds *K8sDatastore) getInferencePool() (*v1alpha1.InferencePool, error) { - ds.poolMu.RLock() - defer ds.poolMu.RUnlock() - if !ds.HasSynced() { - return nil, errors.New("InferencePool is not initialized in data store") - } - return ds.inferencePool, nil -} - -func (ds *K8sDatastore) createPodLister(pool *v1alpha1.InferencePool) *PodLister { - if ds.client == nil { - return nil - } - klog.V(logutil.DEFAULT).Infof("Creating informer for pool %v", pool.Name) - selectorSet := make(map[string]string) - for k, v := range pool.Spec.Selector { - selectorSet[string(k)] = string(v) - } - - newPodInformer := func(cs clientset.Interface, resyncPeriod time.Duration) cache.SharedIndexInformer { - informer := informersv1.NewFilteredPodInformer(cs, pool.Namespace, resyncPeriod, cache.Indexers{}, func(options *metav1.ListOptions) { - options.LabelSelector = labels.SelectorFromSet(selectorSet).String() - }) - err := informer.SetTransform(func(obj interface{}) (interface{}, error) { - // Remove unnecessary fields to improve memory footprint. - if accessor, err := meta.Accessor(obj); err == nil { - if accessor.GetManagedFields() != nil { - accessor.SetManagedFields(nil) - } - } - return obj, nil - }) - if err != nil { - klog.Errorf("Failed to set pod transformer: %v", err) - } - return informer - } - // 0 means we disable resyncing, it is not really useful to resync every hour (the controller-runtime default), - // if things go wrong in the watch, no one will wait for an hour for things to get fixed. - // As precedence, kube-scheduler also disables this since it is expensive to list all pods from the api-server regularly. - resyncPeriod := time.Duration(0) - sharedInformer := informers.NewSharedInformerFactory(ds.client, resyncPeriod) - sharedInformer.InformerFor(&v1.Pod{}, newPodInformer) - - return &PodLister{ - Lister: sharedInformer.Core().V1().Pods().Lister(), - sharedInformer: sharedInformer, - } -} - -func (ds *K8sDatastore) getPods() ([]*corev1.Pod, error) { - ds.poolMu.RLock() - defer ds.poolMu.RUnlock() - if !ds.HasSynced() { - return nil, errors.New("InferencePool is not initialized in datastore") - } - pods, err := ds.podLister.listEverything() - if err != nil { - return nil, err - } - return pods, nil -} - -func (s *K8sDatastore) FetchModelData(modelName string) (returnModel *v1alpha1.InferenceModel) { - infModel, ok := s.InferenceModels.Load(modelName) - if ok { - returnModel = infModel.(*v1alpha1.InferenceModel) - } - return -} - -// HasSynced returns true if InferencePool is set in the data store. -func (ds *K8sDatastore) HasSynced() bool { - ds.poolMu.RLock() - defer ds.poolMu.RUnlock() - return ds.inferencePool != nil -} - -func RandomWeightedDraw(model *v1alpha1.InferenceModel, seed int64) string { - var weights int32 - - source := rand.NewSource(rand.Int63()) - if seed > 0 { - source = rand.NewSource(seed) - } - r := rand.New(source) - for _, model := range model.Spec.TargetModels { - weights += *model.Weight - } - klog.V(logutil.VERBOSE).Infof("Weights for Model(%v) total to: %v", model.Name, weights) - randomVal := r.Int31n(weights) - for _, model := range model.Spec.TargetModels { - if randomVal < *model.Weight { - return model.Name - } - randomVal -= *model.Weight - } - return "" -} - -func IsCritical(model *v1alpha1.InferenceModel) bool { - if model.Spec.Criticality != nil && *model.Spec.Criticality == v1alpha1.Critical { - return true - } - return false -} diff --git a/pkg/ext-proc/backend/datastore_test.go b/pkg/ext-proc/backend/datastore_test.go deleted file mode 100644 index 323b3bb0..00000000 --- a/pkg/ext-proc/backend/datastore_test.go +++ /dev/null @@ -1,133 +0,0 @@ -package backend - -import ( - "testing" - - "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - v1 "k8s.io/apimachinery/pkg/apis/meta/v1" -) - -func TestHasSynced(t *testing.T) { - tests := []struct { - name string - inferencePool *v1alpha1.InferencePool - hasSynced bool - }{ - { - name: "Ready when InferencePool exists in data store", - inferencePool: &v1alpha1.InferencePool{ - ObjectMeta: v1.ObjectMeta{ - Name: "test-pool", - Namespace: "default", - }, - }, - hasSynced: true, - }, - { - name: "Not ready when InferencePool is nil in data store", - inferencePool: nil, - hasSynced: false, - }, - } - for _, tt := range tests { - t.Run(tt.name, func(t *testing.T) { - datastore := NewK8sDataStore() - // Set the inference pool - if tt.inferencePool != nil { - datastore.setInferencePool(tt.inferencePool) - } - // Check if the data store has been initialized - hasSynced := datastore.HasSynced() - if hasSynced != tt.hasSynced { - t.Errorf("IsInitialized() = %v, want %v", hasSynced, tt.hasSynced) - } - }) - } -} - -func TestRandomWeightedDraw(t *testing.T) { - tests := []struct { - name string - model *v1alpha1.InferenceModel - want string - }{ - { - name: "'random' distribution", - model: &v1alpha1.InferenceModel{ - Spec: v1alpha1.InferenceModelSpec{ - TargetModels: []v1alpha1.TargetModel{ - { - Name: "canary", - Weight: pointer(50), - }, - { - Name: "v1", - Weight: pointer(50), - }, - }, - }, - }, - want: "canary", - }, - { - name: "'random' distribution", - model: &v1alpha1.InferenceModel{ - Spec: v1alpha1.InferenceModelSpec{ - TargetModels: []v1alpha1.TargetModel{ - { - Name: "canary", - Weight: pointer(25), - }, - { - Name: "v1.1", - Weight: pointer(55), - }, - { - Name: "v1", - Weight: pointer(50), - }, - }, - }, - }, - want: "v1", - }, - { - name: "'random' distribution", - model: &v1alpha1.InferenceModel{ - Spec: v1alpha1.InferenceModelSpec{ - TargetModels: []v1alpha1.TargetModel{ - { - Name: "canary", - Weight: pointer(20), - }, - { - Name: "v1.1", - Weight: pointer(20), - }, - { - Name: "v1", - Weight: pointer(10), - }, - }, - }, - }, - want: "v1.1", - }, - } - var seedVal int64 = 420 - for _, test := range tests { - t.Run(test.name, func(t *testing.T) { - for range 10000 { - model := RandomWeightedDraw(test.model, seedVal) - if model != test.want { - t.Errorf("Model returned!: %v", model) - break - } - } - }) - } -} - -func pointer(v int32) *int32 { - return &v -} diff --git a/pkg/ext-proc/backend/fake.go b/pkg/ext-proc/backend/fake.go deleted file mode 100644 index 63f20db6..00000000 --- a/pkg/ext-proc/backend/fake.go +++ /dev/null @@ -1,29 +0,0 @@ -package backend - -import ( - "context" - - "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - klog "k8s.io/klog/v2" -) - -type FakePodMetricsClient struct { - Err map[string]error - Res map[string]*PodMetrics -} - -func (f *FakePodMetricsClient) FetchMetrics(ctx context.Context, pod Pod, existing *PodMetrics) (*PodMetrics, error) { - if err, ok := f.Err[pod.Name]; ok { - return nil, err - } - klog.V(1).Infof("pod: %+v\n existing: %+v \n new: %+v \n", pod, existing, f.Res[pod.Name]) - return f.Res[pod.Name], nil -} - -type FakeDataStore struct { - Res map[string]*v1alpha1.InferenceModel -} - -func (fds *FakeDataStore) FetchModelData(modelName string) (returnModel *v1alpha1.InferenceModel) { - return fds.Res[modelName] -} diff --git a/pkg/ext-proc/backend/inferencemodel_reconciler.go b/pkg/ext-proc/backend/inferencemodel_reconciler.go deleted file mode 100644 index 3164e098..00000000 --- a/pkg/ext-proc/backend/inferencemodel_reconciler.go +++ /dev/null @@ -1,56 +0,0 @@ -package backend - -import ( - "context" - - "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - logutil "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/util/logging" - "k8s.io/apimachinery/pkg/runtime" - "k8s.io/apimachinery/pkg/types" - "k8s.io/client-go/tools/record" - "k8s.io/klog/v2" - ctrl "sigs.k8s.io/controller-runtime" - "sigs.k8s.io/controller-runtime/pkg/client" -) - -type InferenceModelReconciler struct { - client.Client - Scheme *runtime.Scheme - Record record.EventRecorder - Datastore *K8sDatastore - PoolNamespacedName types.NamespacedName -} - -func (c *InferenceModelReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { - if req.Namespace != c.PoolNamespacedName.Namespace { - return ctrl.Result{}, nil - } - klog.V(1).Infof("reconciling InferenceModel %v", req.NamespacedName) - - service := &v1alpha1.InferenceModel{} - if err := c.Get(ctx, req.NamespacedName, service); err != nil { - klog.Error(err, "unable to get InferencePool") - return ctrl.Result{}, err - } - - c.updateDatastore(service) - return ctrl.Result{}, nil -} - -func (c *InferenceModelReconciler) SetupWithManager(mgr ctrl.Manager) error { - return ctrl.NewControllerManagedBy(mgr). - For(&v1alpha1.InferenceModel{}). - Complete(c) -} - -func (c *InferenceModelReconciler) updateDatastore(infModel *v1alpha1.InferenceModel) { - if infModel.Spec.PoolRef.Name == c.PoolNamespacedName.Name { - klog.V(1).Infof("Incoming pool ref %v, server pool name: %v", infModel.Spec.PoolRef, c.PoolNamespacedName.Name) - klog.V(1).Infof("Adding/Updating inference model: %v", infModel.Spec.ModelName) - c.Datastore.InferenceModels.Store(infModel.Spec.ModelName, infModel) - return - } - klog.V(logutil.DEFAULT).Infof("Removing/Not adding inference model: %v", infModel.Spec.ModelName) - // If we get here. The model is not relevant to this pool, remove. - c.Datastore.InferenceModels.Delete(infModel.Spec.ModelName) -} diff --git a/pkg/ext-proc/backend/inferencemodel_reconciler_test.go b/pkg/ext-proc/backend/inferencemodel_reconciler_test.go deleted file mode 100644 index 117766b9..00000000 --- a/pkg/ext-proc/backend/inferencemodel_reconciler_test.go +++ /dev/null @@ -1,169 +0,0 @@ -package backend - -import ( - "sync" - "testing" - - "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" - "k8s.io/apimachinery/pkg/types" -) - -var ( - service1 = &v1alpha1.InferenceModel{ - Spec: v1alpha1.InferenceModelSpec{ - ModelName: "fake model1", - PoolRef: v1alpha1.PoolObjectReference{Name: "test-pool"}, - }, - ObjectMeta: metav1.ObjectMeta{ - Name: "test-service", - }, - } - service1Modified = &v1alpha1.InferenceModel{ - Spec: v1alpha1.InferenceModelSpec{ - ModelName: "fake model1", - PoolRef: v1alpha1.PoolObjectReference{Name: "test-poolio"}, - }, - ObjectMeta: metav1.ObjectMeta{ - Name: "test-service", - }, - } - service2 = &v1alpha1.InferenceModel{ - Spec: v1alpha1.InferenceModelSpec{ - ModelName: "fake model", - PoolRef: v1alpha1.PoolObjectReference{Name: "test-pool"}, - }, - ObjectMeta: metav1.ObjectMeta{ - Name: "test-service-2", - }, - } -) - -func TestUpdateDatastore_InferenceModelReconciler(t *testing.T) { - tests := []struct { - name string - datastore *K8sDatastore - incomingService *v1alpha1.InferenceModel - wantInferenceModels *sync.Map - }{ - { - name: "No Services registered; valid, new service incoming.", - datastore: &K8sDatastore{ - inferencePool: &v1alpha1.InferencePool{ - Spec: v1alpha1.InferencePoolSpec{ - Selector: map[v1alpha1.LabelKey]v1alpha1.LabelValue{"app": "vllm"}, - }, - ObjectMeta: metav1.ObjectMeta{ - Name: "test-pool", - ResourceVersion: "Old and boring", - }, - }, - InferenceModels: &sync.Map{}, - }, - incomingService: service1, - wantInferenceModels: populateServiceMap(service1), - }, - { - name: "Removing existing service.", - datastore: &K8sDatastore{ - inferencePool: &v1alpha1.InferencePool{ - Spec: v1alpha1.InferencePoolSpec{ - Selector: map[v1alpha1.LabelKey]v1alpha1.LabelValue{"app": "vllm"}, - }, - ObjectMeta: metav1.ObjectMeta{ - Name: "test-pool", - ResourceVersion: "Old and boring", - }, - }, - InferenceModels: populateServiceMap(service1), - }, - incomingService: service1Modified, - wantInferenceModels: populateServiceMap(), - }, - { - name: "Unrelated service, do nothing.", - datastore: &K8sDatastore{ - inferencePool: &v1alpha1.InferencePool{ - Spec: v1alpha1.InferencePoolSpec{ - Selector: map[v1alpha1.LabelKey]v1alpha1.LabelValue{"app": "vllm"}, - }, - ObjectMeta: metav1.ObjectMeta{ - Name: "test-pool", - ResourceVersion: "Old and boring", - }, - }, - InferenceModels: populateServiceMap(service1), - }, - incomingService: &v1alpha1.InferenceModel{ - Spec: v1alpha1.InferenceModelSpec{ - ModelName: "fake model", - PoolRef: v1alpha1.PoolObjectReference{Name: "test-poolio"}, - }, - ObjectMeta: metav1.ObjectMeta{ - Name: "unrelated-service", - }, - }, - wantInferenceModels: populateServiceMap(service1), - }, - { - name: "Add to existing", - datastore: &K8sDatastore{ - inferencePool: &v1alpha1.InferencePool{ - Spec: v1alpha1.InferencePoolSpec{ - Selector: map[v1alpha1.LabelKey]v1alpha1.LabelValue{"app": "vllm"}, - }, - ObjectMeta: metav1.ObjectMeta{ - Name: "test-pool", - ResourceVersion: "Old and boring", - }, - }, - InferenceModels: populateServiceMap(service1), - }, - incomingService: service2, - wantInferenceModels: populateServiceMap(service1, service2), - }, - } - for _, test := range tests { - t.Run(test.name, func(t *testing.T) { - InferenceModelReconciler := &InferenceModelReconciler{ - Datastore: test.datastore, - PoolNamespacedName: types.NamespacedName{Name: test.datastore.inferencePool.Name}, - } - InferenceModelReconciler.updateDatastore(test.incomingService) - - if ok := mapsEqual(InferenceModelReconciler.Datastore.InferenceModels, test.wantInferenceModels); !ok { - t.Error("Maps are not equal") - } - }) - } -} - -func populateServiceMap(services ...*v1alpha1.InferenceModel) *sync.Map { - returnVal := &sync.Map{} - - for _, service := range services { - returnVal.Store(service.Spec.ModelName, service) - } - return returnVal -} - -func mapsEqual(map1, map2 *sync.Map) bool { - equal := true - - map1.Range(func(k, v any) bool { - if _, ok := map2.Load(k); !ok { - equal = false - return false - } - return true - }) - map2.Range(func(k, v any) bool { - if _, ok := map1.Load(k); !ok { - equal = false - return false - } - return true - }) - - return equal -} diff --git a/pkg/ext-proc/backend/inferencepool_reconciler.go b/pkg/ext-proc/backend/inferencepool_reconciler.go deleted file mode 100644 index 0c2ae75f..00000000 --- a/pkg/ext-proc/backend/inferencepool_reconciler.go +++ /dev/null @@ -1,56 +0,0 @@ -package backend - -import ( - "context" - - "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - "k8s.io/apimachinery/pkg/runtime" - "k8s.io/apimachinery/pkg/types" - "k8s.io/client-go/tools/record" - klog "k8s.io/klog/v2" - ctrl "sigs.k8s.io/controller-runtime" - "sigs.k8s.io/controller-runtime/pkg/client" -) - -// InferencePoolReconciler utilizes the controller runtime to reconcile Instance Gateway resources -// This implementation is just used for reading & maintaining data sync. The Gateway implementation -// will have the proper controller that will create/manage objects on behalf of the server pool. -type InferencePoolReconciler struct { - client.Client - Scheme *runtime.Scheme - Record record.EventRecorder - PoolNamespacedName types.NamespacedName - Datastore *K8sDatastore -} - -func (c *InferencePoolReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { - if req.NamespacedName.Name != c.PoolNamespacedName.Name || req.NamespacedName.Namespace != c.PoolNamespacedName.Namespace { - return ctrl.Result{}, nil - } - klog.V(1).Info("reconciling InferencePool", req.NamespacedName) - - serverPool := &v1alpha1.InferencePool{} - if err := c.Get(ctx, req.NamespacedName, serverPool); err != nil { - klog.Error(err, "unable to get InferencePool") - return ctrl.Result{}, err - } - - c.updateDatastore(serverPool) - - return ctrl.Result{}, nil -} - -func (c *InferencePoolReconciler) updateDatastore(serverPool *v1alpha1.InferencePool) { - pool, _ := c.Datastore.getInferencePool() - if pool == nil || - serverPool.ObjectMeta.ResourceVersion != pool.ObjectMeta.ResourceVersion { - klog.Infof("Updating inference pool to %v/%v", serverPool.ObjectMeta.Namespace, serverPool.ObjectMeta.Name) - c.Datastore.setInferencePool(serverPool) - } -} - -func (c *InferencePoolReconciler) SetupWithManager(mgr ctrl.Manager) error { - return ctrl.NewControllerManagedBy(mgr). - For(&v1alpha1.InferencePool{}). - Complete(c) -} diff --git a/pkg/ext-proc/backend/inferencepool_reconciler_test.go b/pkg/ext-proc/backend/inferencepool_reconciler_test.go deleted file mode 100644 index f03c31cb..00000000 --- a/pkg/ext-proc/backend/inferencepool_reconciler_test.go +++ /dev/null @@ -1,85 +0,0 @@ -package backend - -import ( - "reflect" - "testing" - - "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" -) - -var ( - pool1 = &v1alpha1.InferencePool{ - Spec: v1alpha1.InferencePoolSpec{ - Selector: map[v1alpha1.LabelKey]v1alpha1.LabelValue{"app": "vllm"}, - }, - ObjectMeta: metav1.ObjectMeta{ - Name: "test-pool", - ResourceVersion: "50", - }, - } - // Different name, same RV doesn't really make sense, but helps with testing the - // updateStore impl which relies on the equality of RVs alone. - modPool1SameRV = &v1alpha1.InferencePool{ - Spec: v1alpha1.InferencePoolSpec{ - Selector: map[v1alpha1.LabelKey]v1alpha1.LabelValue{"app": "vllm"}, - }, - ObjectMeta: metav1.ObjectMeta{ - Name: "test-pool-mod", - ResourceVersion: "50", - }, - } - modPool1DiffRV = &v1alpha1.InferencePool{ - Spec: v1alpha1.InferencePoolSpec{ - Selector: map[v1alpha1.LabelKey]v1alpha1.LabelValue{"app": "vllm"}, - }, - ObjectMeta: metav1.ObjectMeta{ - Name: "test-pool-mod", - ResourceVersion: "51", - }, - } -) - -func TestUpdateDatastore_InferencePoolReconciler(t *testing.T) { - tests := []struct { - name string - datastore *K8sDatastore - incomingPool *v1alpha1.InferencePool - wantPool *v1alpha1.InferencePool - }{ - { - name: "InferencePool not set, should set InferencePool", - datastore: &K8sDatastore{}, - incomingPool: pool1.DeepCopy(), - wantPool: pool1, - }, - { - name: "InferencePool set, matching RVs, do nothing", - datastore: &K8sDatastore{ - inferencePool: pool1.DeepCopy(), - }, - incomingPool: modPool1SameRV.DeepCopy(), - wantPool: pool1, - }, - { - name: "InferencePool set, differing RVs, re-set InferencePool", - datastore: &K8sDatastore{ - inferencePool: pool1.DeepCopy(), - }, - incomingPool: modPool1DiffRV.DeepCopy(), - wantPool: modPool1DiffRV, - }, - } - - for _, test := range tests { - t.Run(test.name, func(t *testing.T) { - inferencePoolReconciler := &InferencePoolReconciler{Datastore: test.datastore} - inferencePoolReconciler.updateDatastore(test.incomingPool) - - gotPool := inferencePoolReconciler.Datastore.inferencePool - if !reflect.DeepEqual(gotPool, test.wantPool) { - t.Errorf("Unexpected InferencePool: want %#v, got: %#v", test.wantPool, gotPool) - } - }) - } -} diff --git a/pkg/ext-proc/backend/provider.go b/pkg/ext-proc/backend/provider.go deleted file mode 100644 index d6ccf85f..00000000 --- a/pkg/ext-proc/backend/provider.go +++ /dev/null @@ -1,219 +0,0 @@ -package backend - -import ( - "context" - "fmt" - "math/rand" - "strconv" - "sync" - "time" - - "go.uber.org/multierr" - logutil "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/util/logging" - corev1 "k8s.io/api/core/v1" - klog "k8s.io/klog/v2" -) - -const ( - fetchMetricsTimeout = 5 * time.Second -) - -func NewProvider(pmc PodMetricsClient, datastore *K8sDatastore) *Provider { - p := &Provider{ - podMetrics: sync.Map{}, - pmc: pmc, - datastore: datastore, - } - return p -} - -// Provider provides backend pods and information such as metrics. -type Provider struct { - // key: PodName, value: *PodMetrics - // TODO: change to use NamespacedName once we support multi-tenant inferencePools - podMetrics sync.Map - pmc PodMetricsClient - datastore *K8sDatastore -} - -type PodMetricsClient interface { - FetchMetrics(ctx context.Context, pod Pod, existing *PodMetrics) (*PodMetrics, error) -} - -func (p *Provider) AllPodMetrics() []*PodMetrics { - res := []*PodMetrics{} - fn := func(k, v any) bool { - res = append(res, v.(*PodMetrics)) - return true - } - p.podMetrics.Range(fn) - return res -} - -func (p *Provider) UpdatePodMetrics(pod Pod, pm *PodMetrics) { - p.podMetrics.Store(pod.Name, pm) -} - -func (p *Provider) GetPodMetrics(pod Pod) (*PodMetrics, bool) { - val, ok := p.podMetrics.Load(pod.Name) - if ok { - return val.(*PodMetrics), true - } - return nil, false -} - -func (p *Provider) Init(refreshPodsInterval, refreshMetricsInterval time.Duration) error { - p.refreshPodsOnce() - - if err := p.refreshMetricsOnce(); err != nil { - klog.Errorf("Failed to init metrics: %v", err) - } - - klog.Infof("Initialized pods and metrics: %+v", p.AllPodMetrics()) - - // periodically refresh pods - go func() { - for { - time.Sleep(refreshPodsInterval) - p.refreshPodsOnce() - } - }() - - // periodically refresh metrics - go func() { - for { - time.Sleep(refreshMetricsInterval) - if err := p.refreshMetricsOnce(); err != nil { - klog.V(logutil.TRACE).Infof("Failed to refresh metrics: %v", err) - } - } - }() - - // Periodically print out the pods and metrics for DEBUGGING. - if klog.V(logutil.DEBUG).Enabled() { - go func() { - for { - time.Sleep(5 * time.Second) - klog.Infof("===DEBUG: Current Pods and metrics: %+v", p.AllPodMetrics()) - } - }() - } - - return nil -} - -// refreshPodsOnce lists pods and updates keys in the podMetrics map. -// Note this function doesn't update the PodMetrics value, it's done separately. -func (p *Provider) refreshPodsOnce() { - pods, err := p.datastore.getPods() - if err != nil { - klog.V(logutil.DEFAULT).Infof("Couldn't list pods: %v", err) - p.podMetrics.Clear() - return - } - pool, _ := p.datastore.getInferencePool() - // revision is used to track which entries we need to remove in the next iteration that removes - // metrics for pods that don't exist anymore. Otherwise we have to build a map of the listed pods, - // which is not efficient. Revision can be any random id as long as it is different from the last - // refresh, so it should be very reliable (as reliable as the probability of randomly picking two - // different numbers from range 0 - maxInt). - revision := rand.Int() - ready := 0 - for _, pod := range pods { - if !podIsReady(pod) { - continue - } - // a ready pod - ready++ - if val, ok := p.podMetrics.Load(pod.Name); ok { - // pod already exists - pm := val.(*PodMetrics) - pm.revision = revision - continue - } - // new pod, add to the store for probing - new := &PodMetrics{ - Pod: Pod{ - Name: pod.Name, - Address: pod.Status.PodIP + ":" + strconv.Itoa(int(pool.Spec.TargetPortNumber)), - }, - Metrics: Metrics{ - ActiveModels: make(map[string]int), - }, - revision: revision, - } - p.podMetrics.Store(pod.Name, new) - } - - klog.V(logutil.DEFAULT).Infof("Pods in pool %s/%s with selector %v: total=%v ready=%v", - pool.Namespace, pool.Name, pool.Spec.Selector, len(pods), ready) - - // remove pods that don't exist any more. - mergeFn := func(k, v any) bool { - pm := v.(*PodMetrics) - if pm.revision != revision { - p.podMetrics.Delete(pm.Pod.Name) - } - return true - } - p.podMetrics.Range(mergeFn) -} - -func podIsReady(pod *corev1.Pod) bool { - if pod.DeletionTimestamp != nil { - return false - } - for _, condition := range pod.Status.Conditions { - if condition.Type == corev1.PodReady { - return condition.Status == corev1.ConditionTrue - } - } - return false -} - -func (p *Provider) refreshMetricsOnce() error { - ctx, cancel := context.WithTimeout(context.Background(), fetchMetricsTimeout) - defer cancel() - start := time.Now() - defer func() { - d := time.Since(start) - // TODO: add a metric instead of logging - klog.V(logutil.TRACE).Infof("Refreshed metrics in %v", d) - }() - var wg sync.WaitGroup - errCh := make(chan error) - processOnePod := func(key, value any) bool { - klog.V(logutil.TRACE).Infof("Processing pod %v and metric %v", key, value) - existing := value.(*PodMetrics) - pod := existing.Pod - wg.Add(1) - go func() { - defer wg.Done() - updated, err := p.pmc.FetchMetrics(ctx, pod, existing) - if err != nil { - errCh <- fmt.Errorf("failed to parse metrics from %s: %v", pod, err) - return - } - p.UpdatePodMetrics(pod, updated) - klog.V(logutil.TRACE).Infof("Updated metrics for pod %s: %v", pod, updated.Metrics) - }() - return true - } - p.podMetrics.Range(processOnePod) - - // Wait for metric collection for all pods to complete and close the error channel in a - // goroutine so this is unblocking, allowing the code to proceed to the error collection code - // below. - // Note we couldn't use a buffered error channel with a size because the size of the podMetrics - // sync.Map is unknown beforehand. - go func() { - wg.Wait() - close(errCh) - }() - - var errs error - for err := range errCh { - errs = multierr.Append(errs, err) - } - return errs -} diff --git a/pkg/ext-proc/backend/provider_test.go b/pkg/ext-proc/backend/provider_test.go deleted file mode 100644 index 9159ba48..00000000 --- a/pkg/ext-proc/backend/provider_test.go +++ /dev/null @@ -1,181 +0,0 @@ -package backend - -import ( - "errors" - "testing" - - "github.com/google/go-cmp/cmp" - "github.com/google/go-cmp/cmp/cmpopts" - "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - testingutil "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/util/testing" - corev1 "k8s.io/api/core/v1" -) - -var ( - pod1 = &PodMetrics{ - Pod: Pod{Name: "pod1", Address: "address1:9009"}, - Metrics: Metrics{ - WaitingQueueSize: 0, - KVCacheUsagePercent: 0.2, - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "foo": 1, - "bar": 1, - }, - }, - } - pod2 = &PodMetrics{ - Pod: Pod{Name: "pod2", Address: "address2:9009"}, - Metrics: Metrics{ - WaitingQueueSize: 1, - KVCacheUsagePercent: 0.2, - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "foo1": 1, - "bar1": 1, - }, - }, - } -) - -func TestProvider(t *testing.T) { - allPodsLister := &testingutil.FakePodLister{ - PodsList: []*corev1.Pod{ - testingutil.MakePod(pod1.Pod.Name).SetReady().SetPodIP("address1").Obj(), - testingutil.MakePod(pod2.Pod.Name).SetReady().SetPodIP("address2").Obj(), - }, - } - allPodsMetricsClient := &FakePodMetricsClient{ - Res: map[string]*PodMetrics{ - pod1.Pod.Name: pod1, - pod2.Pod.Name: pod2, - }, - } - - tests := []struct { - name string - initPodMetrics []*PodMetrics - lister *testingutil.FakePodLister - pmc PodMetricsClient - step func(*Provider) - want []*PodMetrics - }{ - { - name: "Init without refreshing pods", - initPodMetrics: []*PodMetrics{pod1, pod2}, - lister: allPodsLister, - pmc: allPodsMetricsClient, - step: func(p *Provider) { - _ = p.refreshMetricsOnce() - }, - want: []*PodMetrics{pod1, pod2}, - }, - { - name: "Fetching all success", - lister: allPodsLister, - pmc: allPodsMetricsClient, - step: func(p *Provider) { - p.refreshPodsOnce() - _ = p.refreshMetricsOnce() - }, - want: []*PodMetrics{pod1, pod2}, - }, - { - name: "Fetch metrics error", - lister: allPodsLister, - pmc: &FakePodMetricsClient{ - Err: map[string]error{ - pod2.Pod.Name: errors.New("injected error"), - }, - Res: map[string]*PodMetrics{ - pod1.Pod.Name: pod1, - }, - }, - step: func(p *Provider) { - p.refreshPodsOnce() - _ = p.refreshMetricsOnce() - }, - want: []*PodMetrics{ - pod1, - // Failed to fetch pod2 metrics so it remains the default values. - { - Pod: pod2.Pod, - Metrics: Metrics{ - WaitingQueueSize: 0, - KVCacheUsagePercent: 0, - MaxActiveModels: 0, - ActiveModels: map[string]int{}, - }, - }, - }, - }, - { - name: "A new pod added", - initPodMetrics: []*PodMetrics{pod2}, - lister: allPodsLister, - pmc: allPodsMetricsClient, - step: func(p *Provider) { - p.refreshPodsOnce() - _ = p.refreshMetricsOnce() - }, - want: []*PodMetrics{pod1, pod2}, - }, - { - name: "A pod removed", - initPodMetrics: []*PodMetrics{pod1, pod2}, - lister: &testingutil.FakePodLister{ - PodsList: []*corev1.Pod{ - testingutil.MakePod(pod2.Pod.Name).SetReady().SetPodIP("address2").Obj(), - }, - }, - pmc: allPodsMetricsClient, - step: func(p *Provider) { - p.refreshPodsOnce() - _ = p.refreshMetricsOnce() - }, - want: []*PodMetrics{pod2}, - }, - { - name: "A pod removed, another added", - initPodMetrics: []*PodMetrics{pod1}, - lister: &testingutil.FakePodLister{ - PodsList: []*corev1.Pod{ - testingutil.MakePod(pod1.Pod.Name).SetReady().SetPodIP("address1").Obj(), - }, - }, - pmc: allPodsMetricsClient, - step: func(p *Provider) { - p.refreshPodsOnce() - _ = p.refreshMetricsOnce() - }, - want: []*PodMetrics{pod1}, - }, - } - - for _, test := range tests { - t.Run(test.name, func(t *testing.T) { - datastore := NewK8sDataStore(WithPodListerFactory( - func(pool *v1alpha1.InferencePool) *PodLister { - return &PodLister{ - Lister: test.lister, - } - })) - datastore.setInferencePool(&v1alpha1.InferencePool{ - Spec: v1alpha1.InferencePoolSpec{TargetPortNumber: 9009}, - }) - p := NewProvider(test.pmc, datastore) - for _, m := range test.initPodMetrics { - p.UpdatePodMetrics(m.Pod, m) - } - test.step(p) - metrics := p.AllPodMetrics() - lessFunc := func(a, b *PodMetrics) bool { - return a.String() < b.String() - } - if diff := cmp.Diff(test.want, metrics, cmpopts.SortSlices(lessFunc), - cmpopts.IgnoreFields(PodMetrics{}, "revision")); diff != "" { - t.Errorf("Unexpected output (-want +got): %v", diff) - } - }) - } -} diff --git a/pkg/ext-proc/backend/types.go b/pkg/ext-proc/backend/types.go deleted file mode 100644 index d375e4ec..00000000 --- a/pkg/ext-proc/backend/types.go +++ /dev/null @@ -1,54 +0,0 @@ -// Package backend is a library to interact with backend model servers such as probing metrics. -package backend - -import "fmt" - -type PodSet map[Pod]bool - -type Pod struct { - Name string - Address string -} - -func (p Pod) String() string { - return p.Name + ":" + p.Address -} - -type Metrics struct { - // ActiveModels is a set of models(including LoRA adapters) that are currently cached to GPU. - ActiveModels map[string]int - // MaxActiveModels is the maximum number of models that can be loaded to GPU. - MaxActiveModels int - RunningQueueSize int - WaitingQueueSize int - KVCacheUsagePercent float64 - KvCacheMaxTokenCapacity int -} - -type PodMetrics struct { - Pod - Metrics - revision int -} - -func (pm *PodMetrics) String() string { - return fmt.Sprintf("Pod: %+v; Metrics: %+v", pm.Pod, pm.Metrics) -} - -func (pm *PodMetrics) Clone() *PodMetrics { - cm := make(map[string]int, len(pm.ActiveModels)) - for k, v := range pm.ActiveModels { - cm[k] = v - } - clone := &PodMetrics{ - Pod: pm.Pod, - Metrics: Metrics{ - ActiveModels: cm, - RunningQueueSize: pm.RunningQueueSize, - WaitingQueueSize: pm.WaitingQueueSize, - KVCacheUsagePercent: pm.KVCacheUsagePercent, - KvCacheMaxTokenCapacity: pm.KvCacheMaxTokenCapacity, - }, - } - return clone -} diff --git a/pkg/ext-proc/backend/vllm/metrics.go b/pkg/ext-proc/backend/vllm/metrics.go deleted file mode 100644 index 8800868a..00000000 --- a/pkg/ext-proc/backend/vllm/metrics.go +++ /dev/null @@ -1,176 +0,0 @@ -// Package vllm provides vllm specific pod metrics implementation. -package vllm - -import ( - "context" - "fmt" - "net/http" - "strconv" - "strings" - "time" - - dto "github.com/prometheus/client_model/go" - "github.com/prometheus/common/expfmt" - "go.uber.org/multierr" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/backend" - logutil "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/util/logging" - klog "k8s.io/klog/v2" -) - -const ( - LoraRequestInfoMetricName = "vllm:lora_requests_info" - LoraRequestInfoRunningAdaptersMetricName = "running_lora_adapters" - LoraRequestInfoMaxAdaptersMetricName = "max_lora" - // TODO: Replace these with the num_tokens_running/waiting below once we add those to the fork. - RunningQueueSizeMetricName = "vllm:num_requests_running" - WaitingQueueSizeMetricName = "vllm:num_requests_waiting" - /* TODO: Uncomment this once the following are added to the fork. - RunningQueueSizeMetricName = "vllm:num_tokens_running" - WaitingQueueSizeMetricName = "vllm:num_tokens_waiting" - */ - KVCacheUsagePercentMetricName = "vllm:gpu_cache_usage_perc" - KvCacheMaxTokenCapacityMetricName = "vllm:gpu_cache_max_token_capacity" -) - -type PodMetricsClientImpl struct { -} - -// FetchMetrics fetches metrics from a given pod. -func (p *PodMetricsClientImpl) FetchMetrics( - ctx context.Context, - pod backend.Pod, - existing *backend.PodMetrics, -) (*backend.PodMetrics, error) { - // Currently the metrics endpoint is hard-coded, which works with vLLM. - // TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16): Consume this from InferencePool config. - url := fmt.Sprintf("http://%s/metrics", pod.Address) - req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil) - if err != nil { - return nil, fmt.Errorf("failed to create request: %v", err) - } - resp, err := http.DefaultClient.Do(req) - if err != nil { - klog.Errorf("failed to fetch metrics from %s: %v", pod, err) - return nil, fmt.Errorf("failed to fetch metrics from %s: %w", pod, err) - } - defer func() { - _ = resp.Body.Close() - }() - - if resp.StatusCode != http.StatusOK { - klog.Errorf("unexpected status code from %s: %v", pod, resp.StatusCode) - return nil, fmt.Errorf("unexpected status code from %s: %v", pod, resp.StatusCode) - } - - parser := expfmt.TextParser{} - metricFamilies, err := parser.TextToMetricFamilies(resp.Body) - if err != nil { - return nil, err - } - return promToPodMetrics(metricFamilies, existing) -} - -// promToPodMetrics updates internal pod metrics with scraped prometheus metrics. -// A combined error is returned if errors occur in one or more metric processing. -// it returns a new PodMetrics pointer which can be used to atomically update the pod metrics map. -func promToPodMetrics( - metricFamilies map[string]*dto.MetricFamily, - existing *backend.PodMetrics, -) (*backend.PodMetrics, error) { - var errs error - updated := existing.Clone() - runningQueueSize, err := getLatestMetric(metricFamilies, RunningQueueSizeMetricName) - errs = multierr.Append(errs, err) - if err == nil { - updated.RunningQueueSize = int(runningQueueSize.GetGauge().GetValue()) - } - waitingQueueSize, err := getLatestMetric(metricFamilies, WaitingQueueSizeMetricName) - errs = multierr.Append(errs, err) - if err == nil { - updated.WaitingQueueSize = int(waitingQueueSize.GetGauge().GetValue()) - } - cachePercent, err := getLatestMetric(metricFamilies, KVCacheUsagePercentMetricName) - errs = multierr.Append(errs, err) - if err == nil { - updated.KVCacheUsagePercent = cachePercent.GetGauge().GetValue() - } - - loraMetrics, _, err := getLatestLoraMetric(metricFamilies) - errs = multierr.Append(errs, err) - /* TODO: uncomment once this is available in vllm. - kvCap, _, err := getGaugeLatestValue(metricFamilies, KvCacheMaxTokenCapacityMetricName) - errs = multierr.Append(errs, err) - if err != nil { - updated.KvCacheMaxTokenCapacity = int(kvCap) - } - */ - - if loraMetrics != nil { - updated.ActiveModels = make(map[string]int) - for _, label := range loraMetrics.GetLabel() { - if label.GetName() == LoraRequestInfoRunningAdaptersMetricName { - if label.GetValue() != "" { - adapterList := strings.Split(label.GetValue(), ",") - for _, adapter := range adapterList { - updated.ActiveModels[adapter] = 0 - } - } - } - if label.GetName() == LoraRequestInfoMaxAdaptersMetricName { - if label.GetValue() != "" { - updated.MaxActiveModels, err = strconv.Atoi(label.GetValue()) - if err != nil { - errs = multierr.Append(errs, err) - } - } - } - } - - } - - return updated, errs -} - -// getLatestLoraMetric gets latest lora metric series in gauge metric family `vllm:lora_requests_info` -// reason its specially fetched is because each label key value pair permutation generates new series -// and only most recent is useful. The value of each series is the creation timestamp so we can -// retrieve the latest by sorting the value. -func getLatestLoraMetric(metricFamilies map[string]*dto.MetricFamily) (*dto.Metric, time.Time, error) { - loraRequests, ok := metricFamilies[LoraRequestInfoMetricName] - if !ok { - klog.Warningf("metric family %q not found", LoraRequestInfoMetricName) - return nil, time.Time{}, fmt.Errorf("metric family %q not found", LoraRequestInfoMetricName) - } - var latestTs float64 - var latest *dto.Metric - for _, m := range loraRequests.GetMetric() { - if m.GetGauge().GetValue() > latestTs { - latestTs = m.GetGauge().GetValue() - latest = m - } - } - return latest, time.Unix(0, int64(latestTs*1000)), nil -} - -// getLatestMetric gets the latest metric of a family. This should be used to get the latest Gauge metric. -// Since vllm doesn't set the timestamp in metric, this metric essentially gets the first metric. -func getLatestMetric(metricFamilies map[string]*dto.MetricFamily, metricName string) (*dto.Metric, error) { - mf, ok := metricFamilies[metricName] - if !ok { - klog.Warningf("metric family %q not found", metricName) - return nil, fmt.Errorf("metric family %q not found", metricName) - } - if len(mf.GetMetric()) == 0 { - return nil, fmt.Errorf("no metrics available for %q", metricName) - } - var latestTs int64 - var latest *dto.Metric - for _, m := range mf.GetMetric() { - if m.GetTimestampMs() >= latestTs { - latestTs = m.GetTimestampMs() - latest = m - } - } - klog.V(logutil.TRACE).Infof("Got metric value %+v for metric %v", latest, metricName) - return latest, nil -} diff --git a/pkg/ext-proc/backend/vllm/metrics_test.go b/pkg/ext-proc/backend/vllm/metrics_test.go deleted file mode 100644 index e3c1449d..00000000 --- a/pkg/ext-proc/backend/vllm/metrics_test.go +++ /dev/null @@ -1,231 +0,0 @@ -package vllm - -import ( - "errors" - "testing" - - dto "github.com/prometheus/client_model/go" - "github.com/stretchr/testify/assert" - "google.golang.org/protobuf/proto" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/backend" -) - -func TestPromToPodMetrics(t *testing.T) { - testCases := []struct { - name string - metricFamilies map[string]*dto.MetricFamily - expectedMetrics *backend.Metrics - expectedErr error - initialPodMetrics *backend.PodMetrics - }{ - { - name: "all metrics available", - metricFamilies: map[string]*dto.MetricFamily{ - RunningQueueSizeMetricName: { - Metric: []*dto.Metric{ - { - Gauge: &dto.Gauge{ - Value: proto.Float64(10), - }, - TimestampMs: proto.Int64(100), - }, - { - Gauge: &dto.Gauge{ - Value: proto.Float64(15), - }, - TimestampMs: proto.Int64(200), // This is the latest - }, - }, - }, - WaitingQueueSizeMetricName: { - Metric: []*dto.Metric{ - { - Gauge: &dto.Gauge{ - Value: proto.Float64(20), - }, - TimestampMs: proto.Int64(100), - }, - { - Gauge: &dto.Gauge{ - Value: proto.Float64(25), - }, - TimestampMs: proto.Int64(200), // This is the latest - }, - }, - }, - KVCacheUsagePercentMetricName: { - Metric: []*dto.Metric{ - { - Gauge: &dto.Gauge{ - Value: proto.Float64(0.8), - }, - TimestampMs: proto.Int64(100), - }, - { - Gauge: &dto.Gauge{ - Value: proto.Float64(0.9), - }, - TimestampMs: proto.Int64(200), // This is the latest - }, - }, - }, - LoraRequestInfoMetricName: { - Metric: []*dto.Metric{ - { - Label: []*dto.LabelPair{ - { - Name: proto.String(LoraRequestInfoRunningAdaptersMetricName), - Value: proto.String("lora3,lora4"), - }, - { - Name: proto.String(LoraRequestInfoMaxAdaptersMetricName), - Value: proto.String("2"), - }, - }, - Gauge: &dto.Gauge{ - Value: proto.Float64(100), - }, - }, - { - Label: []*dto.LabelPair{ - { - Name: proto.String(LoraRequestInfoRunningAdaptersMetricName), - Value: proto.String("lora2"), - }, - { - Name: proto.String(LoraRequestInfoMaxAdaptersMetricName), - Value: proto.String("2"), - }, - }, - Gauge: &dto.Gauge{ - Value: proto.Float64(90), - }, - }, - }, - }, - }, - expectedMetrics: &backend.Metrics{ - RunningQueueSize: 15, - WaitingQueueSize: 25, - KVCacheUsagePercent: 0.9, - ActiveModels: map[string]int{ - "lora3": 0, - "lora4": 0, - }, - MaxActiveModels: 2, - }, - initialPodMetrics: &backend.PodMetrics{}, - expectedErr: nil, - }, - { - name: "invalid max lora", - metricFamilies: map[string]*dto.MetricFamily{ - RunningQueueSizeMetricName: { - Metric: []*dto.Metric{ - { - Gauge: &dto.Gauge{ - Value: proto.Float64(10), - }, - TimestampMs: proto.Int64(100), - }, - { - Gauge: &dto.Gauge{ - Value: proto.Float64(15), - }, - TimestampMs: proto.Int64(200), // This is the latest - }, - }, - }, - WaitingQueueSizeMetricName: { - Metric: []*dto.Metric{ - { - Gauge: &dto.Gauge{ - Value: proto.Float64(20), - }, - TimestampMs: proto.Int64(100), - }, - { - Gauge: &dto.Gauge{ - Value: proto.Float64(25), - }, - TimestampMs: proto.Int64(200), // This is the latest - }, - }, - }, - KVCacheUsagePercentMetricName: { - Metric: []*dto.Metric{ - { - Gauge: &dto.Gauge{ - Value: proto.Float64(0.8), - }, - TimestampMs: proto.Int64(100), - }, - { - Gauge: &dto.Gauge{ - Value: proto.Float64(0.9), - }, - TimestampMs: proto.Int64(200), // This is the latest - }, - }, - }, - LoraRequestInfoMetricName: { - Metric: []*dto.Metric{ - { - Label: []*dto.LabelPair{ - { - Name: proto.String(LoraRequestInfoRunningAdaptersMetricName), - Value: proto.String("lora3,lora4"), - }, - { - Name: proto.String(LoraRequestInfoMaxAdaptersMetricName), - Value: proto.String("2a"), - }, - }, - Gauge: &dto.Gauge{ - Value: proto.Float64(100), - }, - }, - { - Label: []*dto.LabelPair{ - { - Name: proto.String(LoraRequestInfoRunningAdaptersMetricName), - Value: proto.String("lora2"), - }, - { - Name: proto.String(LoraRequestInfoMaxAdaptersMetricName), - Value: proto.String("2"), - }, - }, - Gauge: &dto.Gauge{ - Value: proto.Float64(90), - }, - }, - }, - }, - }, - expectedMetrics: &backend.Metrics{ - RunningQueueSize: 15, - WaitingQueueSize: 25, - KVCacheUsagePercent: 0.9, - ActiveModels: map[string]int{ - "lora3": 0, - "lora4": 0, - }, - MaxActiveModels: 0, - }, - initialPodMetrics: &backend.PodMetrics{}, - expectedErr: errors.New("strconv.Atoi: parsing '2a': invalid syntax"), - }, - } - for _, tc := range testCases { - t.Run(tc.name, func(t *testing.T) { - updated, err := promToPodMetrics(tc.metricFamilies, tc.initialPodMetrics) - if tc.expectedErr != nil { - assert.Error(t, err) - } else { - assert.NoError(t, err) - assert.Equal(t, tc.expectedMetrics, &updated.Metrics) - } - }) - } -} diff --git a/pkg/ext-proc/handlers/request.go b/pkg/ext-proc/handlers/request.go deleted file mode 100644 index d98f4602..00000000 --- a/pkg/ext-proc/handlers/request.go +++ /dev/null @@ -1,158 +0,0 @@ -package handlers - -import ( - "encoding/json" - "errors" - "fmt" - "strconv" - - configPb "github.com/envoyproxy/go-control-plane/envoy/config/core/v3" - extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" - "google.golang.org/protobuf/types/known/structpb" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/backend" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/scheduling" - logutil "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/util/logging" - klog "k8s.io/klog/v2" -) - -// HandleRequestBody handles body of the request to the backend server, such as parsing the "model" -// parameter. -// Envoy sends the request body to ext proc before sending the request to the backend server. -func (s *Server) HandleRequestBody(reqCtx *RequestContext, req *extProcPb.ProcessingRequest) (*extProcPb.ProcessingResponse, error) { - klog.V(logutil.VERBOSE).Infof("Handling request body") - - // Unmarshal request body (must be JSON). - v := req.Request.(*extProcPb.ProcessingRequest_RequestBody) - var rb map[string]interface{} - if err := json.Unmarshal(v.RequestBody.Body, &rb); err != nil { - klog.Errorf("Error unmarshaling request body: %v", err) - return nil, fmt.Errorf("error unmarshaling request body: %v", err) - } - klog.V(logutil.VERBOSE).Infof("Request body: %v", rb) - - // Resolve target models. - model, ok := rb["model"].(string) - if !ok { - return nil, errors.New("model not found in request") - } - klog.V(logutil.VERBOSE).Infof("Model requested: %v", model) - modelName := model - - // NOTE: The nil checking for the modelObject means that we DO allow passthrough currently. - // This might be a security risk in the future where adapters not registered in the InferenceModel - // are able to be requested by using their distinct name. - modelObj := s.datastore.FetchModelData(model) - if modelObj == nil { - return nil, fmt.Errorf("error finding a model object in InferenceModel for input %v", model) - } - if len(modelObj.Spec.TargetModels) > 0 { - modelName = backend.RandomWeightedDraw(modelObj, 0) - if modelName == "" { - return nil, fmt.Errorf("error getting target model name for model %v", modelObj.Name) - } - } - llmReq := &scheduling.LLMRequest{ - Model: model, - ResolvedTargetModel: modelName, - Critical: backend.IsCritical(modelObj), - } - klog.V(logutil.VERBOSE).Infof("LLM Request: %+v", llmReq) - - requestBody := v.RequestBody.Body - var err error - // Update target models in the body. - if llmReq.Model != llmReq.ResolvedTargetModel { - rb["model"] = llmReq.ResolvedTargetModel - requestBody, err = json.Marshal(rb) - if err != nil { - klog.Errorf("Error marshaling request body: %v", err) - return nil, fmt.Errorf("error marshaling request body: %v", err) - } - klog.V(logutil.VERBOSE).Infof("Updated body: %v", string(requestBody)) - } - - targetPod, err := s.scheduler.Schedule(llmReq) - if err != nil { - return nil, fmt.Errorf("failed to find target pod: %w", err) - } - klog.V(logutil.VERBOSE).Infof("Selected target model %v in target pod: %v\n", llmReq.ResolvedTargetModel, targetPod) - - reqCtx.Model = llmReq.Model - reqCtx.ResolvedTargetModel = llmReq.ResolvedTargetModel - reqCtx.RequestSize = len(v.RequestBody.Body) - reqCtx.TargetPod = targetPod - - // Insert target endpoint to instruct Envoy to route requests to the specified target pod. - headers := []*configPb.HeaderValueOption{ - { - Header: &configPb.HeaderValue{ - Key: s.targetEndpointKey, - RawValue: []byte(targetPod.Address), - }, - }, - // We need to update the content length header if the body is mutated, see Envoy doc: - // https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/http/ext_proc/v3/processing_mode.proto - { - Header: &configPb.HeaderValue{ - Key: "Content-Length", - RawValue: []byte(strconv.Itoa(len(requestBody))), - }, - }, - } - // Print headers for debugging - for _, header := range headers { - klog.V(logutil.VERBOSE).Infof("[request_body] Header Key: %s, Header Value: %s\n", header.Header.Key, header.Header.RawValue) - } - - resp := &extProcPb.ProcessingResponse{ - // The Endpoint Picker supports two approaches to communicating the target endpoint, as a request header - // and as an unstructure ext-proc response metadata key/value pair. This enables different integration - // options for gateway providers. - Response: &extProcPb.ProcessingResponse_RequestBody{ - RequestBody: &extProcPb.BodyResponse{ - Response: &extProcPb.CommonResponse{ - HeaderMutation: &extProcPb.HeaderMutation{ - SetHeaders: headers, - }, - BodyMutation: &extProcPb.BodyMutation{ - Mutation: &extProcPb.BodyMutation_Body{ - Body: requestBody, - }, - }, - }, - }, - }, - DynamicMetadata: &structpb.Struct{ - Fields: map[string]*structpb.Value{ - s.targetEndpointKey: { - Kind: &structpb.Value_StringValue{ - StringValue: targetPod.Address, - }, - }, - }, - }, - } - return resp, nil -} - -func HandleRequestHeaders(reqCtx *RequestContext, req *extProcPb.ProcessingRequest) *extProcPb.ProcessingResponse { - klog.V(logutil.VERBOSE).Info("Handling request headers ...") - r := req.Request - h := r.(*extProcPb.ProcessingRequest_RequestHeaders) - klog.V(logutil.VERBOSE).Infof("Headers: %+v\n", h) - - resp := &extProcPb.ProcessingResponse{ - Response: &extProcPb.ProcessingResponse_RequestHeaders{ - RequestHeaders: &extProcPb.HeadersResponse{ - Response: &extProcPb.CommonResponse{ - // Set `clear_route_cache = true` to force Envoy to recompute the target cluster - // based on the new "target-pod" header. - // See https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/ext_proc/v3/external_processor.proto#service-ext-proc-v3-commonresponse. - ClearRouteCache: true, - }, - }, - }, - } - - return resp -} diff --git a/pkg/ext-proc/handlers/response.go b/pkg/ext-proc/handlers/response.go deleted file mode 100644 index 3b8a9946..00000000 --- a/pkg/ext-proc/handlers/response.go +++ /dev/null @@ -1,104 +0,0 @@ -package handlers - -import ( - "encoding/json" - "fmt" - - configPb "github.com/envoyproxy/go-control-plane/envoy/config/core/v3" - extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" - logutil "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/util/logging" - klog "k8s.io/klog/v2" -) - -// HandleResponseHeaders processes response headers from the backend model server. -func (s *Server) HandleResponseHeaders(reqCtx *RequestContext, req *extProcPb.ProcessingRequest) (*extProcPb.ProcessingResponse, error) { - klog.V(logutil.VERBOSE).Info("Processing ResponseHeaders") - h := req.Request.(*extProcPb.ProcessingRequest_ResponseHeaders) - klog.V(logutil.VERBOSE).Infof("Headers before: %+v\n", h) - - resp := &extProcPb.ProcessingResponse{ - Response: &extProcPb.ProcessingResponse_ResponseHeaders{ - ResponseHeaders: &extProcPb.HeadersResponse{ - Response: &extProcPb.CommonResponse{ - HeaderMutation: &extProcPb.HeaderMutation{ - SetHeaders: []*configPb.HeaderValueOption{ - { - Header: &configPb.HeaderValue{ - // This is for debugging purpose only. - Key: "x-went-into-resp-headers", - RawValue: []byte("true"), - }, - }, - }, - }, - }, - }, - }, - } - return resp, nil -} - -// HandleResponseBody parses response body to update information such as number of completion tokens. -// NOTE: The current implementation only supports Buffered mode, which is not enabled by default. To -// use it, you need to configure EnvoyExtensionPolicy to have response body in Buffered mode. -// https://www.envoyproxy.io/docs/envoy/latest/api-v3/extensions/filters/http/ext_proc/v3/processing_mode.proto#envoy-v3-api-msg-extensions-filters-http-ext-proc-v3-processingmode -// Example response -/* -{ - "id": "cmpl-573498d260f2423f9e42817bbba3743a", - "object": "text_completion", - "created": 1732563765, - "model": "meta-llama/Llama-2-7b-hf", - "choices": [ - { - "index": 0, - "text": " Chronicle\nThe San Francisco Chronicle has a new book review section, and it's a good one. The reviews are short, but they're well-written and well-informed. The Chronicle's book review section is a good place to start if you're looking for a good book review.\nThe Chronicle's book review section is a good place to start if you're looking for a good book review. The Chronicle's book review section", - "logprobs": null, - "finish_reason": "length", - "stop_reason": null, - "prompt_logprobs": null - } - ], - "usage": { - "prompt_tokens": 11, - "total_tokens": 111, - "completion_tokens": 100 - } -}*/ -func (s *Server) HandleResponseBody(reqCtx *RequestContext, req *extProcPb.ProcessingRequest) (*extProcPb.ProcessingResponse, error) { - klog.V(logutil.VERBOSE).Info("Processing HandleResponseBody") - body := req.Request.(*extProcPb.ProcessingRequest_ResponseBody) - - res := Response{} - if err := json.Unmarshal(body.ResponseBody.Body, &res); err != nil { - return nil, fmt.Errorf("unmarshaling response body: %v", err) - } - reqCtx.Response = res - reqCtx.ResponseSize = len(body.ResponseBody.Body) - // ResponseComplete is to indicate the response is complete. In non-streaming - // case, it will be set to be true once the response is processed; in - // streaming case, it will be set to be true once the last chunk is processed. - // TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/178) - // will add the processing for streaming case. - reqCtx.ResponseComplete = true - klog.V(logutil.VERBOSE).Infof("Response: %+v", res) - - resp := &extProcPb.ProcessingResponse{ - Response: &extProcPb.ProcessingResponse_ResponseBody{ - ResponseBody: &extProcPb.BodyResponse{ - Response: &extProcPb.CommonResponse{}, - }, - }, - } - return resp, nil -} - -type Response struct { - Usage Usage `json:"usage"` -} - -type Usage struct { - PromptTokens int `json:"prompt_tokens"` - CompletionTokens int `json:"completion_tokens"` - TotalTokens int `json:"total_tokens"` -} diff --git a/pkg/ext-proc/handlers/response_test.go b/pkg/ext-proc/handlers/response_test.go deleted file mode 100644 index df338066..00000000 --- a/pkg/ext-proc/handlers/response_test.go +++ /dev/null @@ -1,87 +0,0 @@ -package handlers - -import ( - "testing" - - extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" - "github.com/google/go-cmp/cmp" -) - -const ( - body = ` - { - "id": "cmpl-573498d260f2423f9e42817bbba3743a", - "object": "text_completion", - "created": 1732563765, - "model": "meta-llama/Llama-2-7b-hf", - "choices": [ - { - "index": 0, - "text": " Chronicle\nThe San Francisco Chronicle has a new book review section, and it's a good one. The reviews are short, but they're well-written and well-informed. The Chronicle's book review section is a good place to start if you're looking for a good book review.\nThe Chronicle's book review section is a good place to start if you're looking for a good book review. The Chronicle's book review section", - "logprobs": null, - "finish_reason": "length", - "stop_reason": null, - "prompt_logprobs": null - } - ], - "usage": { - "prompt_tokens": 11, - "total_tokens": 111, - "completion_tokens": 100 - } - } - ` -) - -func TestHandleResponseBody(t *testing.T) { - tests := []struct { - name string - req *extProcPb.ProcessingRequest_ResponseBody - want Response - wantErr bool - }{ - { - name: "success", - req: &extProcPb.ProcessingRequest_ResponseBody{ - ResponseBody: &extProcPb.HttpBody{ - Body: []byte(body), - }, - }, - want: Response{ - Usage: Usage{ - PromptTokens: 11, - TotalTokens: 111, - CompletionTokens: 100, - }, - }, - }, - { - name: "malformed response", - req: &extProcPb.ProcessingRequest_ResponseBody{ - ResponseBody: &extProcPb.HttpBody{ - Body: []byte("malformed json"), - }, - }, - wantErr: true, - }, - } - - for _, test := range tests { - t.Run(test.name, func(t *testing.T) { - server := &Server{} - reqCtx := &RequestContext{} - _, err := server.HandleResponseBody(reqCtx, &extProcPb.ProcessingRequest{Request: test.req}) - - if err != nil { - if !test.wantErr { - t.Fatalf("HandleResponseBody returned unexpected error: %v, want %v", err, test.wantErr) - } - return - } - - if diff := cmp.Diff(test.want, reqCtx.Response); diff != "" { - t.Errorf("HandleResponseBody returned unexpected response, diff(-want, +got): %v", diff) - } - }) - } -} diff --git a/pkg/ext-proc/handlers/server.go b/pkg/ext-proc/handlers/server.go deleted file mode 100644 index 172249b6..00000000 --- a/pkg/ext-proc/handlers/server.go +++ /dev/null @@ -1,147 +0,0 @@ -package handlers - -import ( - "io" - "time" - - extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" - envoyTypePb "github.com/envoyproxy/go-control-plane/envoy/type/v3" - "google.golang.org/grpc/codes" - "google.golang.org/grpc/status" - "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/backend" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/metrics" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/scheduling" - logutil "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/util/logging" - klog "k8s.io/klog/v2" -) - -func NewServer(pp PodProvider, scheduler Scheduler, targetEndpointKey string, datastore ModelDataStore) *Server { - return &Server{ - scheduler: scheduler, - podProvider: pp, - targetEndpointKey: targetEndpointKey, - datastore: datastore, - } -} - -// Server implements the Envoy external processing server. -// https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/ext_proc/v3/external_processor.proto -type Server struct { - scheduler Scheduler - podProvider PodProvider - // The key of the header to specify the target pod address. This value needs to match Envoy - // configuration. - targetEndpointKey string - datastore ModelDataStore -} - -type Scheduler interface { - Schedule(b *scheduling.LLMRequest) (targetPod backend.Pod, err error) -} - -// PodProvider is an interface to provide set of pods in the backend and information such as metrics. -type PodProvider interface { - GetPodMetrics(pod backend.Pod) (*backend.PodMetrics, bool) - UpdatePodMetrics(pod backend.Pod, pm *backend.PodMetrics) -} - -type ModelDataStore interface { - FetchModelData(modelName string) (returnModel *v1alpha1.InferenceModel) -} - -func (s *Server) Process(srv extProcPb.ExternalProcessor_ProcessServer) error { - klog.V(logutil.VERBOSE).Info("Processing") - ctx := srv.Context() - // Create request context to share states during life time of an HTTP request. - // See https://github.com/envoyproxy/envoy/issues/17540. - reqCtx := &RequestContext{} - - for { - select { - case <-ctx.Done(): - return ctx.Err() - default: - } - - req, err := srv.Recv() - if err == io.EOF { - return nil - } - if err != nil { - // This error occurs very frequently, though it doesn't seem to have any impact. - // TODO Figure out if we can remove this noise. - klog.V(logutil.VERBOSE).Infof("cannot receive stream request: %v", err) - return status.Errorf(codes.Unknown, "cannot receive stream request: %v", err) - } - - var resp *extProcPb.ProcessingResponse - switch v := req.Request.(type) { - case *extProcPb.ProcessingRequest_RequestHeaders: - reqCtx.RequestReceivedTimestamp = time.Now() - resp = HandleRequestHeaders(reqCtx, req) - klog.V(logutil.VERBOSE).Infof("Request context after HandleRequestHeaders: %+v", reqCtx) - case *extProcPb.ProcessingRequest_RequestBody: - resp, err = s.HandleRequestBody(reqCtx, req) - if err == nil { - metrics.RecordRequestCounter(reqCtx.Model, reqCtx.ResolvedTargetModel) - metrics.RecordRequestSizes(reqCtx.Model, reqCtx.ResolvedTargetModel, reqCtx.RequestSize) - } - klog.V(logutil.VERBOSE).Infof("Request context after HandleRequestBody: %+v", reqCtx) - case *extProcPb.ProcessingRequest_ResponseHeaders: - resp, err = s.HandleResponseHeaders(reqCtx, req) - klog.V(logutil.VERBOSE).Infof("Request context after HandleResponseHeaders: %+v", reqCtx) - case *extProcPb.ProcessingRequest_ResponseBody: - resp, err = s.HandleResponseBody(reqCtx, req) - if err == nil && reqCtx.ResponseComplete { - reqCtx.ResponseCompleteTimestamp = time.Now() - metrics.RecordRequestLatencies(reqCtx.Model, reqCtx.ResolvedTargetModel, reqCtx.RequestReceivedTimestamp, reqCtx.ResponseCompleteTimestamp) - metrics.RecordResponseSizes(reqCtx.Model, reqCtx.ResolvedTargetModel, reqCtx.ResponseSize) - metrics.RecordInputTokens(reqCtx.Model, reqCtx.ResolvedTargetModel, reqCtx.Response.Usage.PromptTokens) - metrics.RecordOutputTokens(reqCtx.Model, reqCtx.ResolvedTargetModel, reqCtx.Response.Usage.CompletionTokens) - } - klog.V(logutil.VERBOSE).Infof("Request context after HandleResponseBody: %+v", reqCtx) - default: - klog.Errorf("Unknown Request type %+v", v) - return status.Error(codes.Unknown, "unknown request type") - } - if err != nil { - klog.Errorf("failed to process request: %v", err) - switch status.Code(err) { - // This code can be returned by scheduler when there is no capacity for sheddable - // requests. - case codes.ResourceExhausted: - resp = &extProcPb.ProcessingResponse{ - Response: &extProcPb.ProcessingResponse_ImmediateResponse{ - ImmediateResponse: &extProcPb.ImmediateResponse{ - Status: &envoyTypePb.HttpStatus{ - Code: envoyTypePb.StatusCode_TooManyRequests, - }, - }, - }, - } - default: - return status.Errorf(status.Code(err), "failed to handle request: %v", err) - } - } - - klog.V(logutil.VERBOSE).Infof("response: %v", resp) - if err := srv.Send(resp); err != nil { - klog.Errorf("send error %v", err) - return status.Errorf(codes.Unknown, "failed to send response back to Envoy: %v", err) - } - } -} - -// RequestContext stores context information during the life time of an HTTP request. -type RequestContext struct { - TargetPod backend.Pod - Model string - ResolvedTargetModel string - RequestReceivedTimestamp time.Time - ResponseCompleteTimestamp time.Time - RequestSize int - Response Response - ResponseSize int - ResponseComplete bool -} diff --git a/pkg/ext-proc/health.go b/pkg/ext-proc/health.go deleted file mode 100644 index 62527d06..00000000 --- a/pkg/ext-proc/health.go +++ /dev/null @@ -1,29 +0,0 @@ -package main - -import ( - "context" - - "google.golang.org/grpc/codes" - healthPb "google.golang.org/grpc/health/grpc_health_v1" - "google.golang.org/grpc/status" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/backend" - logutil "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/util/logging" - klog "k8s.io/klog/v2" -) - -type healthServer struct { - datastore *backend.K8sDatastore -} - -func (s *healthServer) Check(ctx context.Context, in *healthPb.HealthCheckRequest) (*healthPb.HealthCheckResponse, error) { - if !s.datastore.HasSynced() { - klog.Infof("gRPC health check not serving: %s", in.String()) - return &healthPb.HealthCheckResponse{Status: healthPb.HealthCheckResponse_NOT_SERVING}, nil - } - klog.V(logutil.DEBUG).Infof("gRPC health check serving: %s", in.String()) - return &healthPb.HealthCheckResponse{Status: healthPb.HealthCheckResponse_SERVING}, nil -} - -func (s *healthServer) Watch(in *healthPb.HealthCheckRequest, srv healthPb.Health_WatchServer) error { - return status.Error(codes.Unimplemented, "Watch is not implemented") -} diff --git a/pkg/ext-proc/main.go b/pkg/ext-proc/main.go deleted file mode 100644 index 98b7e6ca..00000000 --- a/pkg/ext-proc/main.go +++ /dev/null @@ -1,214 +0,0 @@ -package main - -import ( - "context" - "flag" - "fmt" - "net" - "net/http" - "strconv" - - "github.com/prometheus/client_golang/prometheus/promhttp" - "google.golang.org/grpc" - healthPb "google.golang.org/grpc/health/grpc_health_v1" - "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/backend" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/backend/vllm" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/metrics" - runserver "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/server" - "k8s.io/apimachinery/pkg/runtime" - utilruntime "k8s.io/apimachinery/pkg/util/runtime" - "k8s.io/client-go/kubernetes" - clientgoscheme "k8s.io/client-go/kubernetes/scheme" - "k8s.io/client-go/rest" - "k8s.io/component-base/metrics/legacyregistry" - klog "k8s.io/klog/v2" - ctrl "sigs.k8s.io/controller-runtime" - "sigs.k8s.io/controller-runtime/pkg/metrics/filters" -) - -const ( - defaultMetricsEndpoint = "/metrics" -) - -var ( - grpcPort = flag.Int( - "grpcPort", - runserver.DefaultGrpcPort, - "The gRPC port used for communicating with Envoy proxy") - grpcHealthPort = flag.Int( - "grpcHealthPort", - 9003, - "The port used for gRPC liveness and readiness probes") - metricsPort = flag.Int( - "metricsPort", 9090, "The metrics port") - targetEndpointKey = flag.String( - "targetEndpointKey", - runserver.DefaultTargetEndpointKey, - "Header key used by Envoy to route to the appropriate pod. This must match Envoy configuration.") - poolName = flag.String( - "poolName", - runserver.DefaultPoolName, - "Name of the InferencePool this Endpoint Picker is associated with.") - poolNamespace = flag.String( - "poolNamespace", - runserver.DefaultPoolNamespace, - "Namespace of the InferencePool this Endpoint Picker is associated with.") - refreshPodsInterval = flag.Duration( - "refreshPodsInterval", - runserver.DefaultRefreshPodsInterval, - "interval to refresh pods") - refreshMetricsInterval = flag.Duration( - "refreshMetricsInterval", - runserver.DefaultRefreshMetricsInterval, - "interval to refresh metrics") - - scheme = runtime.NewScheme() -) - -func init() { - utilruntime.Must(clientgoscheme.AddToScheme(scheme)) - utilruntime.Must(v1alpha1.AddToScheme(scheme)) -} - -func main() { - klog.InitFlags(nil) - flag.Parse() - - ctrl.SetLogger(klog.TODO()) - cfg, err := ctrl.GetConfig() - if err != nil { - klog.Fatalf("Failed to get rest config: %v", err) - } - // Validate flags - if err := validateFlags(); err != nil { - klog.Fatalf("Failed to validate flags: %v", err) - } - - // Print all flag values - flags := "Flags: " - flag.VisitAll(func(f *flag.Flag) { - flags += fmt.Sprintf("%s=%v; ", f.Name, f.Value) - }) - klog.Info(flags) - - datastore := backend.NewK8sDataStore() - - serverRunner := &runserver.ExtProcServerRunner{ - GrpcPort: *grpcPort, - TargetEndpointKey: *targetEndpointKey, - PoolName: *poolName, - PoolNamespace: *poolNamespace, - RefreshPodsInterval: *refreshPodsInterval, - RefreshMetricsInterval: *refreshMetricsInterval, - Scheme: scheme, - Config: ctrl.GetConfigOrDie(), - Datastore: datastore, - } - serverRunner.Setup() - - k8sClient, err := kubernetes.NewForConfigAndClient(cfg, serverRunner.Manager.GetHTTPClient()) - if err != nil { - klog.Fatalf("Failed to create client: %v", err) - } - datastore.SetClient(k8sClient) - - // Start health and ext-proc servers in goroutines - healthSvr := startHealthServer(datastore, *grpcHealthPort) - extProcSvr := serverRunner.Start(&vllm.PodMetricsClientImpl{}) - // Start metrics handler - metricsSvr := startMetricsHandler(*metricsPort, cfg) - - // Start manager, blocking - serverRunner.StartManager() - - // Gracefully shutdown servers - if healthSvr != nil { - klog.Info("Health server shutting down") - healthSvr.GracefulStop() - } - if extProcSvr != nil { - klog.Info("Ext-proc server shutting down") - extProcSvr.GracefulStop() - } - if metricsSvr != nil { - klog.Info("Metrics server shutting down") - if err := metricsSvr.Shutdown(context.Background()); err != nil { - klog.Infof("Metrics server Shutdown: %v", err) - } - } - - klog.Info("All components shutdown") -} - -// startHealthServer starts the gRPC health probe server in a goroutine. -func startHealthServer(ds *backend.K8sDatastore, port int) *grpc.Server { - svr := grpc.NewServer() - healthPb.RegisterHealthServer(svr, &healthServer{datastore: ds}) - - go func() { - lis, err := net.Listen("tcp", fmt.Sprintf(":%d", port)) - if err != nil { - klog.Fatalf("Health server failed to listen: %v", err) - } - klog.Infof("Health server listening on port: %d", port) - - // Blocking and will return when shutdown is complete. - if err := svr.Serve(lis); err != nil && err != grpc.ErrServerStopped { - klog.Fatalf("Health server failed: %v", err) - } - klog.Info("Health server shutting down") - }() - return svr -} - -func startMetricsHandler(port int, cfg *rest.Config) *http.Server { - metrics.Register() - - var svr *http.Server - go func() { - klog.Info("Starting metrics HTTP handler ...") - - mux := http.NewServeMux() - mux.Handle(defaultMetricsEndpoint, metricsHandlerWithAuthenticationAndAuthorization(cfg)) - - svr = &http.Server{ - Addr: net.JoinHostPort("", strconv.Itoa(port)), - Handler: mux, - } - if err := svr.ListenAndServe(); err != http.ErrServerClosed { - klog.Fatalf("failed to start metrics HTTP handler: %v", err) - } - }() - return svr -} - -func metricsHandlerWithAuthenticationAndAuthorization(cfg *rest.Config) http.Handler { - h := promhttp.HandlerFor( - legacyregistry.DefaultGatherer, - promhttp.HandlerOpts{}, - ) - httpClient, err := rest.HTTPClientFor(cfg) - if err != nil { - klog.Fatalf("failed to create http client for metrics auth: %v", err) - } - - filter, err := filters.WithAuthenticationAndAuthorization(cfg, httpClient) - if err != nil { - klog.Fatalf("failed to create metrics filter for auth: %v", err) - } - metricsLogger := klog.LoggerWithValues(klog.NewKlogr(), "path", defaultMetricsEndpoint) - metricsAuthHandler, err := filter(metricsLogger, h) - if err != nil { - klog.Fatalf("failed to create metrics auth handler: %v", err) - } - return metricsAuthHandler -} - -func validateFlags() error { - if *poolName == "" { - return fmt.Errorf("required %q flag not set", "poolName") - } - - return nil -} diff --git a/pkg/ext-proc/metrics/README.md b/pkg/ext-proc/metrics/README.md deleted file mode 100644 index 1094bc23..00000000 --- a/pkg/ext-proc/metrics/README.md +++ /dev/null @@ -1,104 +0,0 @@ -# Documentation - -This documentation is the current state of exposed metrics. - -## Table of Contents -* [Exposed Metrics](#exposed-metrics) -* [Scrape Metrics](#scrape-metrics) - -## Requirements - -Response metrics are only supported in non-streaming mode, with the follow up [issue](https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/178) to address streaming mode. - -Currently there are two options: -- If requests don't use response streaming, then you can enable `Buffered` mode for response in `EnvoyExtensionPolicy`, this will buffer the response body at the proxy and forward it to the endpoint picker, which allows the endpoint picker to report response metrics. - -- If requests use response streaming, then it is not recommended to enable `Buffered` mode, the response body processing mode should be left empty in the `EnvoyExtensionPolicy` (default). In this case response bodies will not be forwarded to the endpoint picker, and therefore response metrics will not be reported. - - -``` -apiVersion: gateway.envoyproxy.io/v1alpha1 -kind: EnvoyExtensionPolicy -metadata: - name: ext-proc-policy - namespace: default -spec: - extProc: - - backendRefs: - - group: "" - kind: Service - name: inference-gateway-ext-proc - port: 9002 - processingMode: - request: - body: Buffered - response: - body: Buffered -``` - -## Exposed metrics - -| Metric name | Metric Type | Description | Labels | Status | -| ------------|--------------| ----------- | ------ | ------ | -| inference_model_request_total | Counter | The counter of requests broken out for each model. | `model_name`=<model-name>
`target_model_name`=<target-model-name> | ALPHA | -| inference_model_request_duration_seconds | Distribution | Distribution of response latency. | `model_name`=<model-name>
`target_model_name`=<target-model-name> | ALPHA | -| inference_model_request_sizes | Distribution | Distribution of request size in bytes. | `model_name`=<model-name>
`target_model_name`=<target-model-name> | ALPHA | -| inference_model_response_sizes | Distribution | Distribution of response size in bytes. | `model_name`=<model-name>
`target_model_name`=<target-model-name> | ALPHA | -| inference_model_input_tokens | Distribution | Distribution of input token count. | `model_name`=<model-name>
`target_model_name`=<target-model-name> | ALPHA | -| inference_model_output_tokens | Distribution | Distribution of output token count. | `model_name`=<model-name>
`target_model_name`=<target-model-name> | ALPHA | - -## Scrape Metrics - -Metrics endpoint is exposed at port 9090 by default. To scrape metrics, the client needs a ClusterRole with the following rule: -`nonResourceURLs: "/metrics", verbs: get`. - -Here is one example if the client needs to mound the secret to act as the service account -``` ---- -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRole -metadata: - name: inference-gateway-metrics-reader -rules: -- nonResourceURLs: - - /metrics - verbs: - - get ---- -apiVersion: v1 -kind: ServiceAccount -metadata: - name: inference-gateway-sa-metrics-reader - namespace: default ---- -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRoleBinding -metadata: - name: inference-gateway-sa-metrics-reader-role-binding - namespace: default -subjects: -- kind: ServiceAccount - name: inference-gateway-sa-metrics-reader - namespace: default -roleRef: - kind: ClusterRole - name: inference-gateway-metrics-reader - apiGroup: rbac.authorization.k8s.io ---- -apiVersion: v1 -kind: Secret -metadata: - name: inference-gateway-sa-metrics-reader-secret - namespace: default - annotations: - kubernetes.io/service-account.name: inference-gateway-sa-metrics-reader -type: kubernetes.io/service-account-token -``` -Then, you can curl the 9090 port like following -``` -TOKEN=$(kubectl -n default get secret inference-gateway-sa-metrics-reader-secret -o jsonpath='{.secrets[0].name}' -o jsonpath='{.data.token}' | base64 --decode) - -kubectl -n default port-forward inference-gateway-ext-proc-pod-name 9090 - -curl -H "Authorization: Bearer $TOKEN" localhost:9090/metrics -``` \ No newline at end of file diff --git a/pkg/ext-proc/metrics/metrics.go b/pkg/ext-proc/metrics/metrics.go deleted file mode 100644 index 8cb7bd27..00000000 --- a/pkg/ext-proc/metrics/metrics.go +++ /dev/null @@ -1,145 +0,0 @@ -package metrics - -import ( - "sync" - "time" - - compbasemetrics "k8s.io/component-base/metrics" - "k8s.io/component-base/metrics/legacyregistry" - klog "k8s.io/klog/v2" -) - -const ( - InferenceModelComponent = "inference_model" -) - -var ( - requestCounter = compbasemetrics.NewCounterVec( - &compbasemetrics.CounterOpts{ - Subsystem: InferenceModelComponent, - Name: "request_total", - Help: "Counter of inference model requests broken out for each model and target model.", - StabilityLevel: compbasemetrics.ALPHA, - }, - []string{"model_name", "target_model_name"}, - ) - - requestLatencies = compbasemetrics.NewHistogramVec( - &compbasemetrics.HistogramOpts{ - Subsystem: InferenceModelComponent, - Name: "request_duration_seconds", - Help: "Inference model response latency distribution in seconds for each model and target model.", - Buckets: []float64{0.005, 0.025, 0.05, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 1.25, 1.5, 2, 3, - 4, 5, 6, 8, 10, 15, 20, 30, 45, 60, 120, 180, 240, 300, 360, 480, 600, 900, 1200, 1800, 2700, 3600}, - StabilityLevel: compbasemetrics.ALPHA, - }, - []string{"model_name", "target_model_name"}, - ) - - requestSizes = compbasemetrics.NewHistogramVec( - &compbasemetrics.HistogramOpts{ - Subsystem: InferenceModelComponent, - Name: "request_sizes", - Help: "Inference model requests size distribution in bytes for each model and target model.", - // Use buckets ranging from 1000 bytes (1KB) to 10^9 bytes (1GB). - Buckets: []float64{ - 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, // More fine-grained up to 64KB - 131072, 262144, 524288, 1048576, 2097152, 4194304, 8388608, // Exponential up to 8MB - 16777216, 33554432, 67108864, 134217728, 268435456, 536870912, 1073741824, // Exponential up to 1GB - }, - StabilityLevel: compbasemetrics.ALPHA, - }, - []string{"model_name", "target_model_name"}, - ) - - responseSizes = compbasemetrics.NewHistogramVec( - &compbasemetrics.HistogramOpts{ - Subsystem: InferenceModelComponent, - Name: "response_sizes", - Help: "Inference model responses size distribution in bytes for each model and target model.", - // Most models have a response token < 8192 tokens. Each token, in average, has 4 characters. - // 8192 * 4 = 32768. - Buckets: []float64{1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32778, 65536}, - StabilityLevel: compbasemetrics.ALPHA, - }, - []string{"model_name", "target_model_name"}, - ) - - inputTokens = compbasemetrics.NewHistogramVec( - &compbasemetrics.HistogramOpts{ - Subsystem: InferenceModelComponent, - Name: "input_tokens", - Help: "Inference model input token count distribution for requests in each model.", - // Most models have a input context window less than 1 million tokens. - Buckets: []float64{1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32778, 65536, 131072, 262144, 524288, 1048576}, - StabilityLevel: compbasemetrics.ALPHA, - }, - []string{"model_name", "target_model_name"}, - ) - - outputTokens = compbasemetrics.NewHistogramVec( - &compbasemetrics.HistogramOpts{ - Subsystem: InferenceModelComponent, - Name: "output_tokens", - Help: "Inference model output token count distribution for requests in each model.", - // Most models generates output less than 8192 tokens. - Buckets: []float64{1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192}, - StabilityLevel: compbasemetrics.ALPHA, - }, - []string{"model_name", "target_model_name"}, - ) -) - -var registerMetrics sync.Once - -// Register all metrics. -func Register() { - registerMetrics.Do(func() { - legacyregistry.MustRegister(requestCounter) - legacyregistry.MustRegister(requestLatencies) - legacyregistry.MustRegister(requestSizes) - legacyregistry.MustRegister(responseSizes) - legacyregistry.MustRegister(inputTokens) - legacyregistry.MustRegister(outputTokens) - }) -} - -// RecordRequstCounter records the number of requests. -func RecordRequestCounter(modelName, targetModelName string) { - requestCounter.WithLabelValues(modelName, targetModelName).Inc() -} - -// RecordRequestSizes records the request sizes. -func RecordRequestSizes(modelName, targetModelName string, reqSize int) { - requestSizes.WithLabelValues(modelName, targetModelName).Observe(float64(reqSize)) -} - -// RecordRequstLatencies records duration of request. -func RecordRequestLatencies(modelName, targetModelName string, received time.Time, complete time.Time) bool { - if !complete.After(received) { - klog.Errorf("request latency value error for model name %v, target model name %v: complete time %v is before received time %v", modelName, targetModelName, complete, received) - return false - } - elapsedSeconds := complete.Sub(received).Seconds() - requestLatencies.WithLabelValues(modelName, targetModelName).Observe(elapsedSeconds) - return true -} - -// RecordResponseSizes records the response sizes. -func RecordResponseSizes(modelName, targetModelName string, size int) { - responseSizes.WithLabelValues(modelName, targetModelName).Observe(float64(size)) -} - -// RecordInputTokens records input tokens count. -func RecordInputTokens(modelName, targetModelName string, size int) { - if size > 0 { - inputTokens.WithLabelValues(modelName, targetModelName).Observe(float64(size)) - } -} - -// RecordOutputTokens records output tokens count. -func RecordOutputTokens(modelName, targetModelName string, size int) { - if size > 0 { - outputTokens.WithLabelValues(modelName, targetModelName).Observe(float64(size)) - } -} diff --git a/pkg/ext-proc/metrics/metrics_test.go b/pkg/ext-proc/metrics/metrics_test.go deleted file mode 100644 index 57774b11..00000000 --- a/pkg/ext-proc/metrics/metrics_test.go +++ /dev/null @@ -1,259 +0,0 @@ -package metrics - -import ( - "os" - "testing" - "time" - - "k8s.io/component-base/metrics/legacyregistry" - "k8s.io/component-base/metrics/testutil" -) - -const RequestTotalMetric = InferenceModelComponent + "_request_total" -const RequestLatenciesMetric = InferenceModelComponent + "_request_duration_seconds" -const RequestSizesMetric = InferenceModelComponent + "_request_sizes" -const ResponseSizesMetric = InferenceModelComponent + "_response_sizes" -const InputTokensMetric = InferenceModelComponent + "_input_tokens" -const OutputTokensMetric = InferenceModelComponent + "_output_tokens" - -func TestRecordRequestCounterandSizes(t *testing.T) { - type requests struct { - modelName string - targetModelName string - reqSize int - } - scenarios := []struct { - name string - reqs []requests - }{{ - name: "multiple requests", - reqs: []requests{ - { - modelName: "m10", - targetModelName: "t10", - reqSize: 1200, - }, - { - modelName: "m10", - targetModelName: "t10", - reqSize: 500, - }, - { - modelName: "m10", - targetModelName: "t11", - reqSize: 2480, - }, - { - modelName: "m20", - targetModelName: "t20", - reqSize: 80, - }, - }, - }} - Register() - for _, scenario := range scenarios { - t.Run(scenario.name, func(t *testing.T) { - for _, req := range scenario.reqs { - RecordRequestCounter(req.modelName, req.targetModelName) - RecordRequestSizes(req.modelName, req.targetModelName, req.reqSize) - } - wantRequestTotal, err := os.Open("testdata/request_total_metric") - defer func() { - if err := wantRequestTotal.Close(); err != nil { - t.Error(err) - } - }() - if err != nil { - t.Fatal(err) - } - if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantRequestTotal, RequestTotalMetric); err != nil { - t.Error(err) - } - wantRequestSizes, err := os.Open("testdata/request_sizes_metric") - defer func() { - if err := wantRequestSizes.Close(); err != nil { - t.Error(err) - } - }() - if err != nil { - t.Fatal(err) - } - if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantRequestSizes, RequestSizesMetric); err != nil { - t.Error(err) - } - - }) - } -} - -func TestRecordRequestLatencies(t *testing.T) { - timeBaseline := time.Now() - type requests struct { - modelName string - targetModelName string - receivedTime time.Time - completeTime time.Time - } - scenarios := []struct { - name string - reqs []requests - invalid bool - }{{ - name: "multiple requests", - reqs: []requests{ - { - modelName: "m10", - targetModelName: "t10", - receivedTime: timeBaseline, - completeTime: timeBaseline.Add(time.Millisecond * 10), - }, - { - modelName: "m10", - targetModelName: "t10", - receivedTime: timeBaseline, - completeTime: timeBaseline.Add(time.Millisecond * 1600), - }, - { - modelName: "m10", - targetModelName: "t11", - receivedTime: timeBaseline, - completeTime: timeBaseline.Add(time.Millisecond * 60), - }, - { - modelName: "m20", - targetModelName: "t20", - receivedTime: timeBaseline, - completeTime: timeBaseline.Add(time.Millisecond * 120), - }, - }, - }, - { - name: "invalid elapsed time", - reqs: []requests{ - { - modelName: "m10", - targetModelName: "t10", - receivedTime: timeBaseline.Add(time.Millisecond * 10), - completeTime: timeBaseline, - }}, - invalid: true, - }} - Register() - for _, scenario := range scenarios { - t.Run(scenario.name, func(t *testing.T) { - for _, req := range scenario.reqs { - success := RecordRequestLatencies(req.modelName, req.targetModelName, req.receivedTime, req.completeTime) - if success == scenario.invalid { - t.Errorf("got record success(%v), but the request expects invalid(%v)", success, scenario.invalid) - } - } - - wantRequestLatencies, err := os.Open("testdata/request_duration_seconds_metric") - defer func() { - if err := wantRequestLatencies.Close(); err != nil { - t.Error(err) - } - }() - if err != nil { - t.Fatal(err) - } - if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantRequestLatencies, RequestLatenciesMetric); err != nil { - t.Error(err) - } - }) - } -} - -func TestRecordResponseMetrics(t *testing.T) { - type responses struct { - modelName string - targetModelName string - inputToken int - outputToken int - respSize int - } - scenarios := []struct { - name string - resp []responses - }{{ - name: "multiple requests", - resp: []responses{ - { - modelName: "m10", - targetModelName: "t10", - respSize: 1200, - inputToken: 10, - outputToken: 100, - }, - { - modelName: "m10", - targetModelName: "t10", - respSize: 500, - inputToken: 20, - outputToken: 200, - }, - { - modelName: "m10", - targetModelName: "t11", - respSize: 2480, - inputToken: 30, - outputToken: 300, - }, - { - modelName: "m20", - targetModelName: "t20", - respSize: 80, - inputToken: 40, - outputToken: 400, - }, - }, - }} - Register() - for _, scenario := range scenarios { - t.Run(scenario.name, func(t *testing.T) { - for _, resp := range scenario.resp { - RecordInputTokens(resp.modelName, resp.targetModelName, resp.inputToken) - RecordOutputTokens(resp.modelName, resp.targetModelName, resp.outputToken) - RecordResponseSizes(resp.modelName, resp.targetModelName, resp.respSize) - } - wantResponseSize, err := os.Open("testdata/response_sizes_metric") - defer func() { - if err := wantResponseSize.Close(); err != nil { - t.Error(err) - } - }() - if err != nil { - t.Fatal(err) - } - if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantResponseSize, ResponseSizesMetric); err != nil { - t.Error(err) - } - - wantInputToken, err := os.Open("testdata/input_tokens_metric") - defer func() { - if err := wantInputToken.Close(); err != nil { - t.Error(err) - } - }() - if err != nil { - t.Fatal(err) - } - if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantInputToken, InputTokensMetric); err != nil { - t.Error(err) - } - - wantOutputToken, err := os.Open("testdata/output_tokens_metric") - defer func() { - if err := wantOutputToken.Close(); err != nil { - t.Error(err) - } - }() - if err != nil { - t.Fatal(err) - } - if err := testutil.GatherAndCompare(legacyregistry.DefaultGatherer, wantOutputToken, OutputTokensMetric); err != nil { - t.Error(err) - } - }) - } -} diff --git a/pkg/ext-proc/scheduling/filter.go b/pkg/ext-proc/scheduling/filter.go deleted file mode 100644 index d431b076..00000000 --- a/pkg/ext-proc/scheduling/filter.go +++ /dev/null @@ -1,188 +0,0 @@ -package scheduling - -import ( - "errors" - "math" - - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/backend" - logutil "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/util/logging" - klog "k8s.io/klog/v2" -) - -type Filter interface { - Name() string - Filter(req *LLMRequest, pods []*backend.PodMetrics) ([]*backend.PodMetrics, error) -} - -// filter applies current filterFunc, and then recursively applies next filters depending success or -// failure of the current filterFunc. -// It can be used to construct a flow chart algorithm. -type filter struct { - name string - filter filterFunc - // nextOnSuccess filter will be applied after successfully applying the current filter. - // The filtered results will be passed to the next filter. - nextOnSuccess *filter - // nextOnFailure filter will be applied if current filter fails. - // The original input will be passed to the next filter. - nextOnFailure *filter - // nextOnSuccessOrFailure is a convenience field to configure the next filter regardless of the - // success or failure of the current filter. - // NOTE: When using nextOnSuccessOrFailure, both nextOnSuccess and nextOnFailure SHOULD be nil. - // However if that's not the case, nextOnSuccess and nextOnFailure will be used, instead of - // nextOnSuccessOrFailure, in the success and failure scenarios, respectively. - nextOnSuccessOrFailure *filter -} - -func (f *filter) Name() string { - if f == nil { - return "nil" - } - return f.name -} - -func (f *filter) Filter(req *LLMRequest, pods []*backend.PodMetrics) ([]*backend.PodMetrics, error) { - klog.V(logutil.VERBOSE).Infof("Running filter %q on request %v with %v pods", f.name, req, len(pods)) - - filtered, err := f.filter(req, pods) - - next := f.nextOnSuccessOrFailure - if err == nil && len(filtered) > 0 { - if f.nextOnSuccess == nil && f.nextOnSuccessOrFailure == nil { - // No succeeding filters to run, return. - return filtered, err - } - if f.nextOnSuccess != nil { - next = f.nextOnSuccess - } - klog.V(logutil.VERBOSE).Infof("onSuccess %q -> %q, filtered: %v", f.name, next.Name(), len(filtered)) - // On success, pass the filtered result to the next filter. - return next.Filter(req, filtered) - } else { - if f.nextOnFailure == nil && f.nextOnSuccessOrFailure == nil { - // No succeeding filters to run, return. - return filtered, err - } - if f.nextOnFailure != nil { - next = f.nextOnFailure - } - klog.V(logutil.VERBOSE).Infof("onFailure %q -> %q", f.name, next.Name()) - // On failure, pass the initial set of pods to the next filter. - return next.Filter(req, pods) - } -} - -// filterFunc filters a set of input pods to a subset. -type filterFunc func(req *LLMRequest, pods []*backend.PodMetrics) ([]*backend.PodMetrics, error) - -// toFilterFunc is a helper function to convert a per pod filter func to the FilterFunc. -func toFilterFunc(pp podPredicate) filterFunc { - return func(req *LLMRequest, pods []*backend.PodMetrics) ([]*backend.PodMetrics, error) { - filtered := []*backend.PodMetrics{} - for _, pod := range pods { - pass := pp(req, pod) - if pass { - filtered = append(filtered, pod) - } - } - if len(filtered) == 0 { - return nil, errors.New("no pods left") - } - return filtered, nil - } -} - -// leastQueuingFilterFunc finds the max and min queue size of all pods, divides the whole range -// (max-min) by the number of pods, and finds the pods that fall into the first range. -// The intuition is that if there are multiple pods that share similar queue size in the low range, -// we should consider them all instead of the absolute minimum one. This worked better than picking -// the least one as it gives more choices for the next filter, which on aggregate gave better -// results. -// TODO: Compare this strategy with other strategies such as top K. -func leastQueuingFilterFunc(req *LLMRequest, pods []*backend.PodMetrics) ([]*backend.PodMetrics, error) { - min := math.MaxInt - max := 0 - filtered := []*backend.PodMetrics{} - - for _, pod := range pods { - if pod.WaitingQueueSize <= min { - min = pod.WaitingQueueSize - } - if pod.WaitingQueueSize >= max { - max = pod.WaitingQueueSize - } - } - - for _, pod := range pods { - if pod.WaitingQueueSize >= min && pod.WaitingQueueSize <= min+(max-min)/len(pods) { - filtered = append(filtered, pod) - } - } - return filtered, nil -} - -func lowQueueingPodPredicate(_ *LLMRequest, pod *backend.PodMetrics) bool { - return pod.WaitingQueueSize < queueingThresholdLoRA -} - -// leastKVCacheFilterFunc finds the max and min KV cache of all pods, divides the whole range -// (max-min) by the number of pods, and finds the pods that fall into the first range. -// The intuition is that if there are multiple pods that share similar KV cache in the low range, we -// should consider them all instead of the absolute minimum one. This worked better than picking the -// least one as it gives more choices for the next filter, which on aggregate gave better results. -// TODO: Compare this strategy with other strategies such as top K. -func leastKVCacheFilterFunc(req *LLMRequest, pods []*backend.PodMetrics) ([]*backend.PodMetrics, error) { - min := math.MaxFloat64 - var max float64 = 0 - filtered := []*backend.PodMetrics{} - - for _, pod := range pods { - if pod.KVCacheUsagePercent <= min { - min = pod.KVCacheUsagePercent - } - if pod.KVCacheUsagePercent >= max { - max = pod.KVCacheUsagePercent - } - } - - for _, pod := range pods { - if pod.KVCacheUsagePercent >= min && pod.KVCacheUsagePercent <= min+(max-min)/float64(len(pods)) { - filtered = append(filtered, pod) - } - } - return filtered, nil -} - -// podPredicate is a filter function to check whether a pod is desired. -type podPredicate func(req *LLMRequest, pod *backend.PodMetrics) bool - -// We consider serving an adapter low cost it the adapter is active in the model server, or the -// model server has room to load the adapter. The lowLoRACostPredicate ensures weak affinity by -// spreading the load of a LoRA adapter across multiple pods, avoiding "pinning" all requests to -// a single pod. This gave good performance in our initial benchmarking results in the scenario -// where # of lora slots > # of lora adapters. -func lowLoRACostPredicate(req *LLMRequest, pod *backend.PodMetrics) bool { - _, ok := pod.ActiveModels[req.ResolvedTargetModel] - return ok || len(pod.ActiveModels) < pod.MaxActiveModels -} - -// loRAAffinityPredicate is a filter function to check whether a pod has affinity to the lora requested. -func loRAAffinityPredicate(req *LLMRequest, pod *backend.PodMetrics) bool { - _, ok := pod.ActiveModels[req.ResolvedTargetModel] - return ok -} - -// canAcceptNewLoraPredicate is a filter function to check whether a pod has room to load the adapter. -func canAcceptNewLoraPredicate(req *LLMRequest, pod *backend.PodMetrics) bool { - return len(pod.ActiveModels) < pod.MaxActiveModels -} - -func criticalRequestPredicate(req *LLMRequest, pod *backend.PodMetrics) bool { - return req.Critical -} - -func noQueueAndLessThanKVCacheThresholdPredicate(queueThreshold int, kvCacheThreshold float64) podPredicate { - return func(req *LLMRequest, pod *backend.PodMetrics) bool { - return pod.WaitingQueueSize <= queueThreshold && pod.KVCacheUsagePercent <= kvCacheThreshold - } -} diff --git a/pkg/ext-proc/scheduling/filter_test.go b/pkg/ext-proc/scheduling/filter_test.go deleted file mode 100644 index 34731d15..00000000 --- a/pkg/ext-proc/scheduling/filter_test.go +++ /dev/null @@ -1,409 +0,0 @@ -package scheduling - -import ( - "errors" - "testing" - - "github.com/google/go-cmp/cmp" - "github.com/google/go-cmp/cmp/cmpopts" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/backend" -) - -func TestFilter(t *testing.T) { - tests := []struct { - name string - req *LLMRequest - input []*backend.PodMetrics - output []*backend.PodMetrics - err bool - filter *filter - }{ - { - name: "simple filter without successor, failure", - filter: &filter{filter: func(req *LLMRequest, pods []*backend.PodMetrics) ([]*backend.PodMetrics, error) { - return nil, errors.New("filter error") - }}, - err: true, - }, - { - name: "default filter, critical request", - filter: defaultFilter, - req: &LLMRequest{ - Model: "critical", - ResolvedTargetModel: "critical", - Critical: true, - }, - // pod2 will be picked because it has relatively low queue size, with the requested - // model being active, and has low KV cache. - input: []*backend.PodMetrics{ - { - Pod: backend.Pod{Name: "pod1"}, - Metrics: backend.Metrics{ - WaitingQueueSize: 0, - KVCacheUsagePercent: 0.2, - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "foo": 1, - "bar": 1, - }, - }, - }, - { - Pod: backend.Pod{Name: "pod2"}, - Metrics: backend.Metrics{ - WaitingQueueSize: 3, - KVCacheUsagePercent: 0.1, - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "foo": 1, - "critical": 1, - }, - }, - }, - { - Pod: backend.Pod{Name: "pod3"}, - Metrics: backend.Metrics{ - WaitingQueueSize: 10, - KVCacheUsagePercent: 0.2, - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "foo": 1, - }, - }, - }, - }, - output: []*backend.PodMetrics{ - { - Pod: backend.Pod{Name: "pod2"}, - Metrics: backend.Metrics{ - WaitingQueueSize: 3, - KVCacheUsagePercent: 0.1, - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "foo": 1, - "critical": 1, - }, - }, - }, - }, - }, - { - name: "default filter, sheddable request, accepted", - filter: defaultFilter, - req: &LLMRequest{ - Model: "sheddable", - ResolvedTargetModel: "sheddable", - Critical: false, - }, - // pod1 will be picked because it has capacity for the sheddable request. - input: []*backend.PodMetrics{ - { - Pod: backend.Pod{Name: "pod1"}, - Metrics: backend.Metrics{ - WaitingQueueSize: 0, - KVCacheUsagePercent: 0.2, - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "foo": 1, - "bar": 1, - }, - }, - }, - { - Pod: backend.Pod{Name: "pod2"}, - Metrics: backend.Metrics{ - WaitingQueueSize: 3, - KVCacheUsagePercent: 0.1, - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "foo": 1, - "critical": 1, - }, - }, - }, - { - Pod: backend.Pod{Name: "pod3"}, - Metrics: backend.Metrics{ - WaitingQueueSize: 10, - KVCacheUsagePercent: 0.2, - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "foo": 1, - }, - }, - }, - }, - output: []*backend.PodMetrics{ - { - Pod: backend.Pod{Name: "pod1"}, - Metrics: backend.Metrics{ - WaitingQueueSize: 0, - KVCacheUsagePercent: 0.2, - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "foo": 1, - "bar": 1, - }, - }, - }, - }, - }, - { - name: "default filter, sheddable request, dropped", - filter: defaultFilter, - req: &LLMRequest{ - Model: "sheddable", - ResolvedTargetModel: "sheddable", - Critical: false, - }, - // All pods have higher KV cache thant the threshold, so the sheddable request will be - // dropped. - input: []*backend.PodMetrics{ - { - Pod: backend.Pod{Name: "pod1"}, - Metrics: backend.Metrics{ - WaitingQueueSize: 10, - KVCacheUsagePercent: 0.9, - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "foo": 1, - "bar": 1, - }, - }, - }, - { - Pod: backend.Pod{Name: "pod2"}, - Metrics: backend.Metrics{ - WaitingQueueSize: 3, - KVCacheUsagePercent: 0.85, - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "foo": 1, - "critical": 1, - }, - }, - }, - { - Pod: backend.Pod{Name: "pod3"}, - Metrics: backend.Metrics{ - WaitingQueueSize: 10, - KVCacheUsagePercent: 0.85, - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "foo": 1, - }, - }, - }, - }, - output: []*backend.PodMetrics{}, - err: true, - }, - } - - for _, test := range tests { - t.Run(test.name, func(t *testing.T) { - got, err := test.filter.Filter(test.req, test.input) - if test.err != (err != nil) { - t.Errorf("Unexpected error, got %v, want %v", err, test.err) - } - - if diff := cmp.Diff(test.output, got, cmpopts.IgnoreFields(backend.PodMetrics{}, "revision")); diff != "" { - t.Errorf("Unexpected output (-want +got): %v", diff) - } - }) - } -} - -func TestFilterFunc(t *testing.T) { - tests := []struct { - name string - f filterFunc - req *LLMRequest - input []*backend.PodMetrics - output []*backend.PodMetrics - err bool - }{ - { - name: "least queuing empty input", - f: leastQueuingFilterFunc, - input: []*backend.PodMetrics{}, - output: []*backend.PodMetrics{}, - }, - { - name: "least queuing", - f: leastQueuingFilterFunc, - input: []*backend.PodMetrics{ - { - Metrics: backend.Metrics{ - WaitingQueueSize: 0, - }, - }, - { - Metrics: backend.Metrics{ - WaitingQueueSize: 3, - }, - }, - { - Metrics: backend.Metrics{ - WaitingQueueSize: 10, - }, - }, - }, - output: []*backend.PodMetrics{ - { - Metrics: backend.Metrics{ - WaitingQueueSize: 0, - }, - }, - { - Metrics: backend.Metrics{ - WaitingQueueSize: 3, - }, - }, - }, - }, - { - name: "least kv cache empty input", - f: leastKVCacheFilterFunc, - input: []*backend.PodMetrics{}, - output: []*backend.PodMetrics{}, - }, - { - name: "least kv cache", - f: leastKVCacheFilterFunc, - input: []*backend.PodMetrics{ - { - Metrics: backend.Metrics{ - KVCacheUsagePercent: 0, - }, - }, - { - Metrics: backend.Metrics{ - KVCacheUsagePercent: 0.3, - }, - }, - { - Metrics: backend.Metrics{ - KVCacheUsagePercent: 1.0, - }, - }, - }, - output: []*backend.PodMetrics{ - { - Metrics: backend.Metrics{ - KVCacheUsagePercent: 0, - }, - }, - { - Metrics: backend.Metrics{ - KVCacheUsagePercent: 0.3, - }, - }, - }, - }, - { - name: "noQueueAndLessThanKVCacheThresholdPredicate", - f: toFilterFunc(noQueueAndLessThanKVCacheThresholdPredicate(0, 0.8)), - input: []*backend.PodMetrics{ - { - // This pod should be returned. - Metrics: backend.Metrics{ - WaitingQueueSize: 0, - KVCacheUsagePercent: 0, - }, - }, - { - // Queue is non zero, despite low kv cache, should not return. - Metrics: backend.Metrics{ - WaitingQueueSize: 1, - KVCacheUsagePercent: 0.3, - }, - }, - { - // High kv cache despite zero queue, should not return - Metrics: backend.Metrics{ - WaitingQueueSize: 0, - KVCacheUsagePercent: 1.0, - }, - }, - }, - output: []*backend.PodMetrics{ - { - Metrics: backend.Metrics{ - WaitingQueueSize: 0, - KVCacheUsagePercent: 0, - }, - }, - }, - }, - { - name: "low LoRA cost", - f: toFilterFunc(lowLoRACostPredicate), - req: &LLMRequest{ - Model: "model", - ResolvedTargetModel: "model", - }, - input: []*backend.PodMetrics{ - // ActiveModels include input model, should be returned. - { - Metrics: backend.Metrics{ - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "model": 1, - }, - }, - }, - // Input model is not active, however the server has room to load another adapter. - { - Metrics: backend.Metrics{ - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "another-model": 1, - }, - }, - }, - // Input is not active, and the server has reached max active models. - { - Metrics: backend.Metrics{ - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "foo": 1, - "bar": 1, - }, - }, - }, - }, - output: []*backend.PodMetrics{ - { - Metrics: backend.Metrics{ - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "model": 1, - }, - }, - }, - { - Metrics: backend.Metrics{ - MaxActiveModels: 2, - ActiveModels: map[string]int{ - "another-model": 1, - }, - }, - }, - }, - }, - } - - for _, test := range tests { - t.Run(test.name, func(t *testing.T) { - got, err := test.f(test.req, test.input) - if test.err != (err != nil) { - t.Errorf("Unexpected error, got %v, want %v", err, test.err) - } - - if diff := cmp.Diff(test.output, got, cmpopts.IgnoreFields(backend.PodMetrics{}, "revision")); diff != "" { - t.Errorf("Unexpected output (-want +got): %v", diff) - } - }) - } -} diff --git a/pkg/ext-proc/scheduling/scheduler.go b/pkg/ext-proc/scheduling/scheduler.go deleted file mode 100644 index 9fc3e663..00000000 --- a/pkg/ext-proc/scheduling/scheduler.go +++ /dev/null @@ -1,124 +0,0 @@ -// Package scheduling implements request scheduling algorithms. -package scheduling - -import ( - "fmt" - "math/rand" - - "google.golang.org/grpc/codes" - "google.golang.org/grpc/status" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/backend" - logutil "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/util/logging" - klog "k8s.io/klog/v2" -) - -const ( - // TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16) Make this configurable. - kvCacheThreshold = 0.8 - // TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16) Make this configurable. - queueThresholdCritical = 5 - // TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16) Make this configurable. - // the threshold for queued requests to be considered low below which we can prioritize LoRA affinity. - // The value of 50 is arrived heuristicically based on experiments. - queueingThresholdLoRA = 50 -) - -var ( - defaultFilter = &filter{ - name: "critical request", - filter: toFilterFunc(criticalRequestPredicate), - nextOnSuccess: lowLatencyFilter, - nextOnFailure: sheddableRequestFilter, - } - - // queueLoRAAndKVCacheFilter applied least queue -> low cost lora -> least KV Cache filter - queueLoRAAndKVCacheFilter = &filter{ - name: "least queuing", - filter: leastQueuingFilterFunc, - nextOnSuccessOrFailure: &filter{ - name: "low cost LoRA", - filter: toFilterFunc(lowLoRACostPredicate), - nextOnSuccessOrFailure: &filter{ - name: "least KV cache percent", - filter: leastKVCacheFilterFunc, - }, - }, - } - - // queueAndKVCacheFilter applies least queue followed by least KV Cache filter - queueAndKVCacheFilter = &filter{ - name: "least queuing", - filter: leastQueuingFilterFunc, - nextOnSuccessOrFailure: &filter{ - name: "least KV cache percent", - filter: leastKVCacheFilterFunc, - }, - } - - lowLatencyFilter = &filter{ - name: "low queueing filter", - filter: toFilterFunc((lowQueueingPodPredicate)), - nextOnSuccess: &filter{ - name: "affinity LoRA", - filter: toFilterFunc(loRAAffinityPredicate), - nextOnSuccess: queueAndKVCacheFilter, - nextOnFailure: &filter{ - name: "can accept LoRA Adapter", - filter: toFilterFunc(canAcceptNewLoraPredicate), - nextOnSuccessOrFailure: queueAndKVCacheFilter, - }, - }, - nextOnFailure: queueLoRAAndKVCacheFilter, - } - - sheddableRequestFilter = &filter{ - // When there is at least one model server that's not queuing requests, and still has KV - // cache below a certain threshold, we consider this model server has capacity to handle - // a sheddable request without impacting critical requests. - name: "has capacity for sheddable requests", - filter: toFilterFunc(noQueueAndLessThanKVCacheThresholdPredicate(queueThresholdCritical, kvCacheThreshold)), - nextOnSuccess: queueLoRAAndKVCacheFilter, - // If all pods are queuing or running above the KVCache threshold, we drop the sheddable - // request to make room for critical requests. - nextOnFailure: &filter{ - name: "drop request", - filter: func(req *LLMRequest, pods []*backend.PodMetrics) ([]*backend.PodMetrics, error) { - klog.Infof("Dropping request %v", req) - return []*backend.PodMetrics{}, status.Errorf( - codes.ResourceExhausted, "dropping request due to limited backend resources") - }, - }, - } -) - -func NewScheduler(pmp PodMetricsProvider) *Scheduler { - - return &Scheduler{ - podMetricsProvider: pmp, - filter: defaultFilter, - } -} - -type Scheduler struct { - podMetricsProvider PodMetricsProvider - filter Filter -} - -// PodMetricsProvider is an interface to provide set of pods in the backend and information such as -// metrics. -type PodMetricsProvider interface { - AllPodMetrics() []*backend.PodMetrics -} - -// Schedule finds the target pod based on metrics and the requested lora adapter. -func (s *Scheduler) Schedule(req *LLMRequest) (targetPod backend.Pod, err error) { - klog.V(logutil.VERBOSE).Infof("request: %v; metrics: %+v", req, s.podMetricsProvider.AllPodMetrics()) - pods, err := s.filter.Filter(req, s.podMetricsProvider.AllPodMetrics()) - if err != nil || len(pods) == 0 { - return backend.Pod{}, fmt.Errorf( - "failed to apply filter, resulted %v pods, this should never happen: %w", len(pods), err) - } - klog.V(logutil.VERBOSE).Infof("Going to randomly select a pod from the candidates: %+v", pods) - i := rand.Intn(len(pods)) - return pods[i].Pod, nil -} diff --git a/pkg/ext-proc/scheduling/types.go b/pkg/ext-proc/scheduling/types.go deleted file mode 100644 index cfb9d3b8..00000000 --- a/pkg/ext-proc/scheduling/types.go +++ /dev/null @@ -1,11 +0,0 @@ -package scheduling - -// LLMRequest is a structured representation of the fields we parse out of the LLMRequest body. -type LLMRequest struct { - Model string - // Target models is a map of target model name to weight. - TargetModels map[string]int - // Resolved target model is the final target model after traffic split. - ResolvedTargetModel string - Critical bool -} diff --git a/pkg/ext-proc/server/runserver.go b/pkg/ext-proc/server/runserver.go deleted file mode 100644 index 981dab11..00000000 --- a/pkg/ext-proc/server/runserver.go +++ /dev/null @@ -1,137 +0,0 @@ -package server - -import ( - "fmt" - "net" - "time" - - extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" - "google.golang.org/grpc" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/backend" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/handlers" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/scheduling" - "k8s.io/apimachinery/pkg/runtime" - "k8s.io/apimachinery/pkg/types" - "k8s.io/client-go/rest" - klog "k8s.io/klog/v2" - ctrl "sigs.k8s.io/controller-runtime" -) - -// ExtProcServerRunner provides methods to manage an external process server. -type ExtProcServerRunner struct { - GrpcPort int - TargetEndpointKey string - PoolName string - PoolNamespace string - RefreshPodsInterval time.Duration - RefreshMetricsInterval time.Duration - Scheme *runtime.Scheme - Config *rest.Config - Datastore *backend.K8sDatastore - Manager ctrl.Manager -} - -// Default values for CLI flags in main -const ( - DefaultGrpcPort = 9002 // default for --grpcPort - DefaultTargetEndpointKey = "x-gateway-destination-endpoint" // default for --targetEndpointKey - DefaultPoolName = "" // required but no default - DefaultPoolNamespace = "default" // default for --poolNamespace - DefaultRefreshPodsInterval = 10 * time.Second // default for --refreshPodsInterval - DefaultRefreshMetricsInterval = 50 * time.Millisecond // default for --refreshMetricsInterval -) - -func NewDefaultExtProcServerRunner() *ExtProcServerRunner { - return &ExtProcServerRunner{ - GrpcPort: DefaultGrpcPort, - TargetEndpointKey: DefaultTargetEndpointKey, - PoolName: DefaultPoolName, - PoolNamespace: DefaultPoolNamespace, - RefreshPodsInterval: DefaultRefreshPodsInterval, - RefreshMetricsInterval: DefaultRefreshMetricsInterval, - // Scheme, Config, and Datastore can be assigned later. - } -} - -// Setup creates the reconcilers for pools and models and starts the manager. -func (r *ExtProcServerRunner) Setup() { - // Create a new manager to manage controllers - mgr, err := ctrl.NewManager(r.Config, ctrl.Options{Scheme: r.Scheme}) - if err != nil { - klog.Fatalf("Failed to create controller manager: %v", err) - } - r.Manager = mgr - - // Create the controllers and register them with the manager - if err := (&backend.InferencePoolReconciler{ - Datastore: r.Datastore, - Scheme: mgr.GetScheme(), - Client: mgr.GetClient(), - PoolNamespacedName: types.NamespacedName{ - Name: r.PoolName, - Namespace: r.PoolNamespace, - }, - Record: mgr.GetEventRecorderFor("InferencePool"), - }).SetupWithManager(mgr); err != nil { - klog.Fatalf("Failed setting up InferencePoolReconciler: %v", err) - } - - if err := (&backend.InferenceModelReconciler{ - Datastore: r.Datastore, - Scheme: mgr.GetScheme(), - Client: mgr.GetClient(), - PoolNamespacedName: types.NamespacedName{ - Name: r.PoolName, - Namespace: r.PoolNamespace, - }, - Record: mgr.GetEventRecorderFor("InferenceModel"), - }).SetupWithManager(mgr); err != nil { - klog.Fatalf("Failed setting up InferenceModelReconciler: %v", err) - } -} - -// Start starts the Envoy external processor server in a goroutine. -func (r *ExtProcServerRunner) Start( - podMetricsClient backend.PodMetricsClient, -) *grpc.Server { - svr := grpc.NewServer() - - go func() { - lis, err := net.Listen("tcp", fmt.Sprintf(":%d", r.GrpcPort)) - if err != nil { - klog.Fatalf("Ext-proc server failed to listen: %v", err) - } - klog.Infof("Ext-proc server listening on port: %d", r.GrpcPort) - - // Initialize backend provider - pp := backend.NewProvider(podMetricsClient, r.Datastore) - if err := pp.Init(r.RefreshPodsInterval, r.RefreshMetricsInterval); err != nil { - klog.Fatalf("Failed to initialize backend provider: %v", err) - } - - // Register ext_proc handlers - extProcPb.RegisterExternalProcessorServer( - svr, - handlers.NewServer(pp, scheduling.NewScheduler(pp), r.TargetEndpointKey, r.Datastore), - ) - - // Blocking and will return when shutdown is complete. - if err := svr.Serve(lis); err != nil && err != grpc.ErrServerStopped { - klog.Fatalf("Ext-proc server failed: %v", err) - } - klog.Info("Ext-proc server shutting down") - }() - return svr -} - -func (r *ExtProcServerRunner) StartManager() { - if r.Manager == nil { - klog.Fatalf("Runner has no manager setup to run: %v", r) - } - // Start the controller manager. Blocking and will return when shutdown is complete. - klog.Infof("Starting controller manager") - if err := r.Manager.Start(ctrl.SetupSignalHandler()); err != nil { - klog.Fatalf("Error starting controller manager: %v", err) - } - klog.Info("Controller manager shutting down") -} diff --git a/pkg/ext-proc/test/benchmark/benchmark.go b/pkg/ext-proc/test/benchmark/benchmark.go deleted file mode 100644 index 9ff61d8b..00000000 --- a/pkg/ext-proc/test/benchmark/benchmark.go +++ /dev/null @@ -1,110 +0,0 @@ -package main - -import ( - "flag" - "fmt" - "os" - "time" - - "github.com/bojand/ghz/printer" - "github.com/bojand/ghz/runner" - "github.com/jhump/protoreflect/desc" - "google.golang.org/protobuf/proto" - "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/backend" - runserver "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/server" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/test" - klog "k8s.io/klog/v2" -) - -var ( - svrAddr = flag.String("server_address", fmt.Sprintf("localhost:%d", runserver.DefaultGrpcPort), "Address of the ext proc server") - totalRequests = flag.Int("total_requests", 100000, "number of requests to be sent for load test") - // Flags when running a local ext proc server. - numFakePods = flag.Int("num_fake_pods", 200, "number of fake pods when running a local ext proc server") - numModelsPerPod = flag.Int("num_models_per_pod", 5, "number of fake models per pod when running a local ext proc server") - localServer = flag.Bool("local_server", true, "whether to start a local ext proc server") - refreshPodsInterval = flag.Duration("refreshPodsInterval", 10*time.Second, "interval to refresh pods") - refreshMetricsInterval = flag.Duration("refreshMetricsInterval", 50*time.Millisecond, "interval to refresh metrics") -) - -const ( - port = runserver.DefaultGrpcPort -) - -func main() { - klog.InitFlags(nil) - flag.Parse() - - if *localServer { - test.StartExtProc(port, *refreshPodsInterval, *refreshMetricsInterval, fakePods(), fakeModels()) - time.Sleep(time.Second) // wait until server is up - klog.Info("Server started") - } - - report, err := runner.Run( - "envoy.service.ext_proc.v3.ExternalProcessor.Process", - *svrAddr, - runner.WithInsecure(true), - runner.WithBinaryDataFunc(generateRequest), - runner.WithTotalRequests(uint(*totalRequests)), - ) - if err != nil { - klog.Fatal(err) - } - - printer := printer.ReportPrinter{ - Out: os.Stdout, - Report: report, - } - - printer.Print("summary") -} - -func generateRequest(mtd *desc.MethodDescriptor, callData *runner.CallData) []byte { - numModels := *numFakePods * (*numModelsPerPod) - req := test.GenerateRequest(modelName(int(callData.RequestNumber) % numModels)) - data, err := proto.Marshal(req) - if err != nil { - klog.Fatal("marshaling error: ", err) - } - return data -} - -func fakeModels() map[string]*v1alpha1.InferenceModel { - models := map[string]*v1alpha1.InferenceModel{} - for i := range *numFakePods { - for j := range *numModelsPerPod { - m := modelName(i*(*numModelsPerPod) + j) - models[m] = &v1alpha1.InferenceModel{Spec: v1alpha1.InferenceModelSpec{ModelName: m}} - } - } - - return models -} - -func fakePods() []*backend.PodMetrics { - pms := make([]*backend.PodMetrics, 0, *numFakePods) - for i := 0; i < *numFakePods; i++ { - metrics := fakeMetrics(i) - pod := test.FakePod(i) - pms = append(pms, &backend.PodMetrics{Pod: pod, Metrics: metrics}) - } - - return pms -} - -// fakeMetrics adds numModelsPerPod number of adapters to the pod metrics. -func fakeMetrics(podNumber int) backend.Metrics { - metrics := backend.Metrics{ - ActiveModels: make(map[string]int), - } - for i := 0; i < *numModelsPerPod; i++ { - metrics.ActiveModels[modelName(podNumber*(*numModelsPerPod)+i)] = 0 - } - return metrics -} - -func modelName(i int) string { - return fmt.Sprintf("adapter-%v", i) -} diff --git a/pkg/ext-proc/test/utils.go b/pkg/ext-proc/test/utils.go deleted file mode 100644 index a9dc4efa..00000000 --- a/pkg/ext-proc/test/utils.go +++ /dev/null @@ -1,83 +0,0 @@ -package test - -import ( - "encoding/json" - "fmt" - "net" - "time" - - extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" - "google.golang.org/grpc" - "google.golang.org/grpc/reflection" - "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/backend" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/handlers" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/scheduling" - klog "k8s.io/klog/v2" -) - -func StartExtProc(port int, refreshPodsInterval, refreshMetricsInterval time.Duration, pods []*backend.PodMetrics, models map[string]*v1alpha1.InferenceModel) *grpc.Server { - ps := make(backend.PodSet) - pms := make(map[string]*backend.PodMetrics) - for _, pod := range pods { - ps[pod.Pod] = true - pms[pod.Pod.Name] = pod - } - pmc := &backend.FakePodMetricsClient{Res: pms} - pp := backend.NewProvider(pmc, backend.NewK8sDataStore()) - if err := pp.Init(refreshPodsInterval, refreshMetricsInterval); err != nil { - klog.Fatalf("failed to initialize: %v", err) - } - return startExtProc(port, pp, models) -} - -// startExtProc starts an extProc server with fake pods. -func startExtProc(port int, pp *backend.Provider, models map[string]*v1alpha1.InferenceModel) *grpc.Server { - lis, err := net.Listen("tcp", fmt.Sprintf(":%d", port)) - if err != nil { - klog.Fatalf("failed to listen: %v", err) - } - - s := grpc.NewServer() - - extProcPb.RegisterExternalProcessorServer(s, handlers.NewServer(pp, scheduling.NewScheduler(pp), "target-pod", &backend.FakeDataStore{Res: models})) - - klog.Infof("Starting gRPC server on port :%v", port) - reflection.Register(s) - go func() { - err := s.Serve(lis) - if err != nil { - klog.Fatalf("Ext-proc failed with the err: %v", err) - } - }() - return s -} - -func GenerateRequest(model string) *extProcPb.ProcessingRequest { - j := map[string]interface{}{ - "model": model, - "prompt": "hello", - "max_tokens": 100, - "temperature": 0, - } - - llmReq, err := json.Marshal(j) - if err != nil { - klog.Fatal(err) - } - req := &extProcPb.ProcessingRequest{ - Request: &extProcPb.ProcessingRequest_RequestBody{ - RequestBody: &extProcPb.HttpBody{Body: llmReq}, - }, - } - return req -} - -func FakePod(index int) backend.Pod { - address := fmt.Sprintf("address-%v", index) - pod := backend.Pod{ - Name: fmt.Sprintf("pod-%v", index), - Address: address, - } - return pod -} diff --git a/pkg/ext-proc/util/logging/logging_const.go b/pkg/ext-proc/util/logging/logging_const.go deleted file mode 100644 index a6131d18..00000000 --- a/pkg/ext-proc/util/logging/logging_const.go +++ /dev/null @@ -1,8 +0,0 @@ -package logging - -const ( - DEFAULT = 2 - VERBOSE = 3 - DEBUG = 4 - TRACE = 5 -) diff --git a/pkg/ext-proc/util/testing/lister.go b/pkg/ext-proc/util/testing/lister.go deleted file mode 100644 index 023f30a1..00000000 --- a/pkg/ext-proc/util/testing/lister.go +++ /dev/null @@ -1,19 +0,0 @@ -package testing - -import ( - v1 "k8s.io/api/core/v1" - "k8s.io/apimachinery/pkg/labels" - listersv1 "k8s.io/client-go/listers/core/v1" -) - -type FakePodLister struct { - PodsList []*v1.Pod -} - -func (l *FakePodLister) List(selector labels.Selector) (ret []*v1.Pod, err error) { - return l.PodsList, nil -} - -func (l *FakePodLister) Pods(namespace string) listersv1.PodNamespaceLister { - panic("not implemented") -} diff --git a/pkg/ext-proc/util/testing/wrappers.go b/pkg/ext-proc/util/testing/wrappers.go deleted file mode 100644 index 7b593bbd..00000000 --- a/pkg/ext-proc/util/testing/wrappers.go +++ /dev/null @@ -1,38 +0,0 @@ -package testing - -import ( - corev1 "k8s.io/api/core/v1" - metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" -) - -// PodWrapper wraps a Pod inside. -type PodWrapper struct{ corev1.Pod } - -// MakePod creates a Pod wrapper. -func MakePod(name string) *PodWrapper { - return &PodWrapper{ - corev1.Pod{ - ObjectMeta: metav1.ObjectMeta{ - Name: name, - }, - }, - } -} - -// Obj returns the inner Pod. -func (p *PodWrapper) Obj() *corev1.Pod { - return &p.Pod -} - -func (p *PodWrapper) SetReady() *PodWrapper { - p.Status.Conditions = []corev1.PodCondition{{ - Type: corev1.PodReady, - Status: corev1.ConditionTrue, - }} - return p -} - -func (p *PodWrapper) SetPodIP(podIP string) *PodWrapper { - p.Status.PodIP = podIP - return p -} diff --git a/pkg/manifests/gateway/enable_patch_policy.yaml b/pkg/manifests/gateway/enable_patch_policy.yaml deleted file mode 100644 index 1e9818a1..00000000 --- a/pkg/manifests/gateway/enable_patch_policy.yaml +++ /dev/null @@ -1,27 +0,0 @@ -apiVersion: v1 -kind: ConfigMap -metadata: - name: envoy-gateway-config - namespace: envoy-gateway-system -data: -# This manifest's main purpose is to set `enabledEnvoyPatchPolicy` to `true`. -# This only needs to be ran once on your cluster (unless you'd like to change anything. i.e. enabling the admin dash) -# Any field under `admin` is optional, and only for enabling the admin endpoints, for debugging. -# Admin Interface: https://www.envoyproxy.io/docs/envoy/latest/operations/admin -# PatchPolicy docs: https://gateway.envoyproxy.io/docs/tasks/extensibility/envoy-patch-policy/#enable-envoypatchpolicy - envoy-gateway.yaml: | - apiVersion: gateway.envoyproxy.io/v1alpha1 - kind: EnvoyGateway - provider: - type: Kubernetes - gateway: - controllerName: gateway.envoyproxy.io/gatewayclass-controller - extensionApis: - enableEnvoyPatchPolicy: true - enableBackend: true -# admin: -# enablePprof: true -# address: -# host: 127.0.0.1 -# port: 19000 -# enabledDumpConfig: true diff --git a/pkg/manifests/gateway/extension_policy.yaml b/pkg/manifests/gateway/extension_policy.yaml deleted file mode 100644 index a8105d6d..00000000 --- a/pkg/manifests/gateway/extension_policy.yaml +++ /dev/null @@ -1,31 +0,0 @@ -apiVersion: gateway.envoyproxy.io/v1alpha1 -kind: EnvoyExtensionPolicy -metadata: - name: ext-proc-policy - namespace: default -spec: - extProc: - - backendRefs: - - group: "" - kind: Service - name: inference-gateway-ext-proc - port: 9002 - processingMode: - request: - body: Buffered - response: - # The timeouts are likely not needed here. We can experiment with removing/tuning them slowly. - # The connection limits are more important and will cause the opaque: ext_proc_gRPC_error_14 error in Envoy GW if not configured correctly. - messageTimeout: 1000s - backendSettings: - circuitBreaker: - maxConnections: 40000 - maxPendingRequests: 40000 - maxParallelRequests: 40000 - timeout: - tcp: - connectTimeout: 24h - targetRef: - group: gateway.networking.k8s.io - kind: HTTPRoute - name: llm-route diff --git a/pkg/manifests/gateway/gateway.yaml b/pkg/manifests/gateway/gateway.yaml deleted file mode 100644 index 32f5d484..00000000 --- a/pkg/manifests/gateway/gateway.yaml +++ /dev/null @@ -1,50 +0,0 @@ - ---- -apiVersion: gateway.networking.k8s.io/v1 -kind: Gateway -metadata: - name: inference-gateway -spec: - gatewayClassName: inference-gateway - listeners: - - name: http - protocol: HTTP - port: 8080 - - name: llm-gw - protocol: HTTP - port: 8081 ---- -apiVersion: gateway.networking.k8s.io/v1 -kind: GatewayClass -metadata: - name: inference-gateway -spec: - controllerName: gateway.envoyproxy.io/gatewayclass-controller ---- -apiVersion: gateway.envoyproxy.io/v1alpha1 -kind: Backend -metadata: - name: backend-dummy -spec: - endpoints: - - fqdn: - # Both these values are arbitrary and unused as the PatchPolicy redirects requests. - hostname: 'foo.bar.com' - port: 8080 ---- -apiVersion: gateway.networking.k8s.io/v1 -kind: HTTPRoute -metadata: - name: llm-route -spec: - parentRefs: - - name: inference-gateway - sectionName: llm-gw - rules: - - backendRefs: - - group: gateway.envoyproxy.io - kind: Backend - name: backend-dummy - timeouts: - request: "24h" - backendRequest: "24h" diff --git a/pkg/manifests/gateway/patch_policy.yaml b/pkg/manifests/gateway/patch_policy.yaml deleted file mode 100644 index 4a556b44..00000000 --- a/pkg/manifests/gateway/patch_policy.yaml +++ /dev/null @@ -1,43 +0,0 @@ -apiVersion: gateway.envoyproxy.io/v1alpha1 -kind: EnvoyPatchPolicy -metadata: - name: custom-response-patch-policy - namespace: default -spec: - targetRef: - group: gateway.networking.k8s.io - kind: Gateway - name: inference-gateway - type: JSONPatch - jsonPatches: - # Necessary to create a cluster of the type: ORIGINAL_DST to allow for - # direct pod scheduling. Which is heavily utilized in our scheduling. - # Specifically the field `original_dst_lb_config` allows us to enable - # `use_http_header` and `http_header_name`. - # Source: https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/cluster.proto - - type: "type.googleapis.com/envoy.config.cluster.v3.Cluster" - name: original_destination_cluster - operation: - op: add - path: "" - value: - name: original_destination_cluster - type: ORIGINAL_DST - original_dst_lb_config: - use_http_header: true - http_header_name: "x-gateway-destination-endpoint" - connect_timeout: 1000s - lb_policy: CLUSTER_PROVIDED - dns_lookup_family: V4_ONLY - circuit_breakers: - thresholds: - - max_connections: 40000 - max_pending_requests: 40000 - max_requests: 40000 - - - type: "type.googleapis.com/envoy.config.route.v3.RouteConfiguration" - name: default/inference-gateway/llm-gw - operation: - op: replace - path: "/virtual_hosts/0/routes/0/route/cluster" - value: original_destination_cluster diff --git a/pkg/manifests/gateway/traffic_policy.yaml b/pkg/manifests/gateway/traffic_policy.yaml deleted file mode 100644 index e110f173..00000000 --- a/pkg/manifests/gateway/traffic_policy.yaml +++ /dev/null @@ -1,16 +0,0 @@ -apiVersion: gateway.envoyproxy.io/v1alpha1 -kind: BackendTrafficPolicy -metadata: - name: high-connection-route-policy -spec: - targetRefs: - - group: gateway.networking.k8s.io - kind: HTTPRoute - name: llm-route - circuitBreaker: - maxConnections: 40000 - maxPendingRequests: 40000 - maxParallelRequests: 40000 - timeout: - tcp: - connectTimeout: 24h \ No newline at end of file diff --git a/pkg/manifests/inferencemodel.yaml b/pkg/manifests/inferencemodel.yaml deleted file mode 100644 index 0085a89d..00000000 --- a/pkg/manifests/inferencemodel.yaml +++ /dev/null @@ -1,21 +0,0 @@ -apiVersion: inference.networking.x-k8s.io/v1alpha1 -kind: InferenceModel -metadata: - labels: - app.kubernetes.io/name: api - app.kubernetes.io/managed-by: kustomize - name: inferencemodel-sample -spec: - modelName: tweet-summary - criticality: Critical - poolRef: - # this is the default val: - group: inference.networking.x-k8s.io - # this is the default val: - kind: InferencePool - name: vllm-llama2-7b-pool - targetModels: - - name: tweet-summary-0 - weight: 50 - - name: tweet-summary-1 - weight: 50 diff --git a/pkg/manifests/vllm/deployment.yaml b/pkg/manifests/vllm/deployment.yaml deleted file mode 100644 index 1f5073e9..00000000 --- a/pkg/manifests/vllm/deployment.yaml +++ /dev/null @@ -1,122 +0,0 @@ -apiVersion: apps/v1 -kind: Deployment -metadata: - name: vllm-llama2-7b-pool -spec: - replicas: 3 - selector: - matchLabels: - app: vllm-llama2-7b-pool - template: - metadata: - labels: - app: vllm-llama2-7b-pool - spec: - containers: - - name: lora - image: "vllm/vllm-openai:latest" - imagePullPolicy: Always - command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] - args: - - "--model" - - "meta-llama/Llama-2-7b-hf" - - "--tensor-parallel-size" - - "1" - - "--port" - - "8000" - - "--enable-lora" - - "--max-loras" - - "4" - - "--max-cpu-loras" - - "12" - - "--lora-modules" - - "sql-lora=/adapters/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/" - - "tweet-summary=/adapters/hub/models--vineetsharma--qlora-adapter-Llama-2-7b-hf-TweetSumm/snapshots/796337d8e866318c59e38f16416e3ecd11fe5403" - - 'sql-lora-0=/adapters/yard1/llama-2-7b-sql-lora-test_0' - - 'sql-lora-1=/adapters/yard1/llama-2-7b-sql-lora-test_1' - - 'sql-lora-2=/adapters/yard1/llama-2-7b-sql-lora-test_2' - - 'sql-lora-3=/adapters/yard1/llama-2-7b-sql-lora-test_3' - - 'sql-lora-4=/adapters/yard1/llama-2-7b-sql-lora-test_4' - - 'tweet-summary-0=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_0' - - 'tweet-summary-1=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_1' - - 'tweet-summary-2=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_2' - - 'tweet-summary-3=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_3' - - 'tweet-summary-4=/adapters/vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm_4' - env: - - name: PORT - value: "8000" - - name: HUGGING_FACE_HUB_TOKEN - valueFrom: - secretKeyRef: - name: hf-token - key: token - ports: - - containerPort: 8000 - name: http - protocol: TCP - livenessProbe: - failureThreshold: 240 - httpGet: - path: /health - port: http - scheme: HTTP - initialDelaySeconds: 5 - periodSeconds: 5 - successThreshold: 1 - timeoutSeconds: 1 - readinessProbe: - failureThreshold: 600 - httpGet: - path: /health - port: http - scheme: HTTP - initialDelaySeconds: 5 - periodSeconds: 5 - successThreshold: 1 - timeoutSeconds: 1 - resources: - limits: - nvidia.com/gpu: 1 - requests: - nvidia.com/gpu: 1 - volumeMounts: - - mountPath: /data - name: data - - mountPath: /dev/shm - name: shm - - name: adapters - mountPath: "/adapters" - initContainers: - - name: adapter-loader - image: ghcr.io/tomatillo-and-multiverse/adapter-puller:demo - command: ["python"] - args: - - ./pull_adapters.py - - --adapter - - yard1/llama-2-7b-sql-lora-test - - --adapter - - vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm - - --duplicate-count - - "5" - env: - - name: HF_TOKEN - valueFrom: - secretKeyRef: - name: hf-token - key: token - - name: HF_HOME - value: /adapters - volumeMounts: - - name: adapters - mountPath: "/adapters" - restartPolicy: Always - schedulerName: default-scheduler - terminationGracePeriodSeconds: 30 - volumes: - - name: data - emptyDir: {} - - name: shm - emptyDir: - medium: Memory - - name: adapters - emptyDir: {} diff --git a/pkg/scheduling.md b/pkg/scheduling.md deleted file mode 100644 index 99223ad2..00000000 --- a/pkg/scheduling.md +++ /dev/null @@ -1,5 +0,0 @@ -## Scheduling Package in Ext Proc -The scheduling package implements request scheduling algorithms for load balancing requests across backend pods in an inference gateway. The scheduler ensures efficient resource utilization while maintaining low latency and prioritizing critical requests. It applies a series of filters based on metrics and heuristics to select the best pod for a given request. - -# Flowchart -Scheduling Algorithm \ No newline at end of file diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md index baa604b6..1494d314 100644 --- a/site-src/api-types/inferencepool.md +++ b/site-src/api-types/inferencepool.md @@ -7,28 +7,56 @@ ## Background -The InferencePool resource is a logical grouping of compute resources, e.g. Pods, that run model servers. The InferencePool would deploy its own routing, and offer administrative configuration to the Platform Admin. +The **InferencePool** API defines a group of Pods (containers) dedicated to serving AI models. Pods within an InferencePool share the same compute configuration, accelerator type, base language model, and model server. This abstraction simplifies the management of AI model serving resources, providing a centralized point of administrative configuration for Platform Admins. -It is expected for the InferencePool to: +An InferencePool is expected to be bundled with an [Endpoint Picker](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) extension. This extension is responsible for tracking key metrics on each model server (i.e. the KV-cache utilization, queue length of pending requests, active LoRA adapters, etc.) and routing incoming inference requests to the optimal model server replica based on these metrics. An EPP can only be associated with a single InferencePool. The associated InferencePool is specified by the [poolName](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/config/manifests/inferencepool-resources.yaml#L54) and [poolNamespace](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/config/manifests/inferencepool-resources.yaml#L56) flags. An HTTPRoute can have multiple backendRefs that reference the same InferencePool and therefore routes to the same EPP. An HTTPRoute can have multiple backendRefs that reference different InferencePools and therefore routes to different EPPs. - - Enforce fair consumption of resources across competing workloads - - Efficiently route requests across shared compute (as displayed by the PoC) - -It is _not_ expected for the InferencePool to: +Additionally, any Pod that seeks to join an InferencePool would need to support the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol), defined by this project, to ensure the Endpoint Picker has adequate information to intelligently route requests. - - Enforce any common set of adapters or base models are available on the Pods - - Manage Deployments of Pods within the Pool - - Manage Pod lifecycle of pods within the pool +## How to Configure an InferencePool -Additionally, any Pod that seeks to join an InferencePool would need to support a protocol, defined by this project, to ensure the Pool has adequate information to intelligently route requests. +The full spec of the InferencePool is defined [here](/reference/spec/#inferencepool). -`InferencePool` has some small overlap with `Service`, displayed here: +In summary, the InferencePoolSpec consists of 3 major parts: + +- The `selector` field specifies which Pods belong to this pool. The labels in this selector must exactly match the labels applied to your model server Pods. +- The `targetPortNumber` field defines the port number that the Inference Gateway should route to on model server Pods that belong to this pool. +- The `extensionRef` field references the [endpoint picker extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) (EPP) service that monitors key metrics from model servers within the InferencePool and provides intelligent routing decisions. + +### Example Configuration + +Here is an example InferencePool configuration: + +``` +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferencePool +metadata: + name: vllm-llama3-8b-instruct +spec: + targetPortNumber: 8000 + selector: + app: vllm-llama3-8b-instruct + extensionRef: + name: vllm-llama3-8b-instruct-epp + port: 9002 + failureMode: FailClose +``` + +In this example: + +- An InferencePool named `vllm-llama3-8b-instruct` is created in the `default` namespace. +- It will select Pods that have the label `app: vllm-llama3-8b-instruct`. +- Traffic routed to this InferencePool will call out to the EPP service `vllm-llama3-8b-instruct-epp` on port `9002` for making routing decisions. If EPP fails to pick an endpoint, or is not responsive, the request will be dropped. +- Traffic routed to this InferencePool will be forwarded to the port `8000` on the selected Pods. + +## Overlap with Service + +**InferencePool** has some small overlap with **Service**, displayed here: Comparing InferencePool with Service -The InferencePool is _not_ intended to be a mask of the Service object, simply exposing the absolute bare minimum required to allow the Platform Admin to focus less on networking, and more on Pool management. - -## Spec +The InferencePool is not intended to be a mask of the Service object. It provides a specialized abstraction tailored for managing and routing traffic to groups of LLM model servers, allowing Platform Admins to focus on pool-level management rather than low-level networking details. -The full spec of the InferencePool is defined [here](/reference/spec/#inferencepool). \ No newline at end of file +## Replacing an InferencePool +Please refer to the [Replacing an InferencePool](/guides/replacing-inference-pool) guide for details on uses cases and how to replace an InferencePool. diff --git a/site-src/concepts/api-overview.md b/site-src/concepts/api-overview.md index 94e76251..9c5c0416 100644 --- a/site-src/concepts/api-overview.md +++ b/site-src/concepts/api-overview.md @@ -1,7 +1,7 @@ # API Overview -## Bakcground -The Gateway API Inference Extension project is an extension of the Kubernetes Gateway API for serving Generative AI models on Kubernetes. Gateway API Inference Extension facilitates standardization of APIs for Kubernetes cluster operators and developers running generative AI inference, while allowing flexibility for underlying gateway implementations (such as Envoy Proxy) to iterate on mechanisms for optimized serving of models. +## Background +The Gateway API Inference Extension project is an extension of the Kubernetes Gateway API for serving Generative AI models on Kubernetes. Gateway API Inference Extension facilitates standardization of APIs for Kubernetes cluster operators and developers running generative AI inference, while allowing flexibility for underlying gateway implementations (such as Envoy Proxy) to iterate on mechanisms for optimized serving of models. Overview of API integration @@ -9,8 +9,8 @@ The Gateway API Inference Extension project is an extension of the Kubernetes Ga ### InferencePool -InferencePool represents a set of Inference-focused Pods and an extension that will be used to route to them. Within the broader Gateway API resource model, this resource is considered a "backend". In practice, that means that you'd replace a Kubernetes Service with an InferencePool. This resource has some similarities to Service (a way to select Pods and specify a port), but has some unique capabilities. With InferenceModel, you can configure a routing extension as well as inference-specific routing optimizations. For more information on this resource, refer to our [InferencePool documentation](/api-types/inferencepool.md) or go directly to the [InferencePool spec](/reference/spec/#inferencepool). +InferencePool represents a set of Inference-focused Pods and an extension that will be used to route to them. Within the broader Gateway API resource model, this resource is considered a "backend". In practice, that means that you'd replace a Kubernetes Service with an InferencePool. This resource has some similarities to Service (a way to select Pods and specify a port), but has some unique capabilities. With InferenceModel, you can configure a routing extension as well as inference-specific routing optimizations. For more information on this resource, refer to our [InferencePool documentation](/api-types/inferencepool) or go directly to the [InferencePool spec](/reference/spec/#inferencepool). ### InferenceModel -An InferenceModel represents a model or adapter, and configuration associated with that model. This resource enables you to configure the relative criticality of a model, and allows you to seamlessly translate the requested model name to one or more backend model names. Multiple InferenceModels can be attached to an InferencePool. For more information on this resource, refer to our [InferenceModel documentation](/api-types/inferencemodel.md) or go directly to the [InferenceModel spec](/reference/spec/#inferencemodel). +An InferenceModel represents a model or adapter, and configuration associated with that model. This resource enables you to configure the relative criticality of a model, and allows you to seamlessly translate the requested model name to one or more backend model names. Multiple InferenceModels can be attached to an InferencePool. For more information on this resource, refer to our [InferenceModel documentation](/api-types/inferencemodel) or go directly to the [InferenceModel spec](/reference/spec/#inferencemodel). diff --git a/site-src/concepts/roles-and-personas.md b/site-src/concepts/roles-and-personas.md index b11f43eb..0746adbf 100644 --- a/site-src/concepts/roles-and-personas.md +++ b/site-src/concepts/roles-and-personas.md @@ -1,10 +1,10 @@ # Roles and Personas -Before diving into the details of the API, decriptions of the personas these APIs were designed for will help convey the thought process of the API design. +Before diving into the details of the API, descriptions of the personas these APIs were designed for will help convey the thought process of the API design. ## Inference Platform Admin -The Inference Platform Admin creates and manages the infrastructure necessary to run LLM workloads. Including handling Ops for: +The Inference Platform Admin creates and manages the infrastructure necessary to run LLM workloads, including handling Ops for: - Hardware - Model Server @@ -15,7 +15,7 @@ The Inference Platform Admin creates and manages the infrastructure necessary to ## Inference Workload Owner -An Inference Workload Owner persona owns and manages 1 or many Generative AI Workloads (LLM focused *currently*). This includes: +An Inference Workload Owner persona owns and manages one or many Generative AI Workloads (LLM focused *currently*). This includes: - Defining criticality - Managing fine-tunes diff --git a/site-src/guides/adapter-rollout.md b/site-src/guides/adapter-rollout.md new file mode 100644 index 00000000..4e7a3667 --- /dev/null +++ b/site-src/guides/adapter-rollout.md @@ -0,0 +1,137 @@ +# Adapter Rollout + +The goal of this guide is to demonstrate how to rollout a new adapter version. + +## **Prerequisites** + +Follow the steps in the [main guide](index.md) + + +## **Safely rollout v2 adapter** + +### Load the new adapter version to the model servers + +This guide leverages the LoRA syncer sidecar to dynamically manage adapters within a vLLM deployment, enabling users to add or remove them through a shared ConfigMap. + + +Modify the LoRA syncer ConfigMap to initiate loading of the new adapter version. + + +```bash +kubectl edit configmap vllm-llama3-8b-instruct-adapters +``` + +Change the ConfigMap to match the following (note the new entry under models): + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: vllm-llama3-8b-instruct-adapters +data: + configmap.yaml: | + vLLMLoRAConfig: + name: vllm-llama3-8b-instruct-adapters + port: 8000 + defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct + ensureExist: + models: + - id: food-review-1 + source: Kawon/llama3.1-food-finetune_v14_r8 + - id: food-review-2 + source: Kawon/llama3.1-food-finetune_v14_r8 +``` + +The new adapter version is applied to the model servers live, without requiring a restart. + + +### Direct traffic to the new adapter version + +Modify the InferenceModel to configure a canary rollout with traffic splitting. In this example, 10% of traffic for food-review model will be sent to the new ***food-review-2*** adapter. + + +```bash +kubectl edit inferencemodel food-review +``` + +Change the targetModels list in InferenceModel to match the following: + + +```yaml +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferenceModel +metadata: + name: food-review +spec: + modelName: food-review + criticality: Standard + poolRef: + name: vllm-llama3-8b-instruct + targetModels: + - name: food-review-1 + weight: 90 + - name: food-review-2 + weight: 10 +``` + +The above configuration means one in every ten requests should be sent to the new version. Try it out: + +1. Get the gateway IP: +```bash +IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80 +``` + +2. Send a few requests as follows: +```bash +curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ +"model": "food-review", +"prompt": "Write as if you were a critic: San Francisco", +"max_tokens": 100, +"temperature": 0 +}' +``` + +### Finish the rollout + + +Modify the InferenceModel to direct 100% of the traffic to the latest version of the adapter. + +```yaml +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferenceModel +metadata: + name: food-review +spec: + modelName: food-review + criticality: Standard + poolRef: + name: vllm-llama3-8b-instruct + targetModels: + - name: food-review-2 + weight: 100 +``` + +Unload the older versions from the servers by updating the LoRA syncer ConfigMap to list the older version under the `ensureNotExist` list: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: vllm-llama3-8b-instruct-adapters +data: + configmap.yaml: | + vLLMLoRAConfig: + name: vllm-llama3-8b-instruct-adapters + port: 8000 + defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct + ensureExist: + models: + - id: food-review-2 + source: Kawon/llama3.1-food-finetune_v14_r8 + ensureNotExist: + models: + - id: food-review-1 + source: Kawon/llama3.1-food-finetune_v14_r8 +``` + +With this, all requests should be served by the new adapter version. diff --git a/site-src/guides/implementers.md b/site-src/guides/implementers.md index 5d1c6267..7bfd536a 100644 --- a/site-src/guides/implementers.md +++ b/site-src/guides/implementers.md @@ -1,3 +1,113 @@ # Implementer's Guide -TODO \ No newline at end of file +This guide is intended for developers looking to implement support for the InferencePool custom resources within their Gateway API controller. It outlines how InferencePool fits into the existing resource model, discusses implementation options, explains how to interact with extensions, and provides guidance on testing. + +## InferencePool as a Gateway Backend +Before we dive into the implementation, let’s recap how an InferencePool works. + +Overview of API integration + +**InferencePool** represents a set of Inference-focused Pods and an extension that will be used to route to them. The InferencePool introduces a new type of backend within the Gateway API resource model. Instead of targeting Services, a Gateway can route traffic to an InferencePool. This InferencePool then becomes responsible for intelligent routing to the underlying model server pods based on the associated InferenceModel configurations. + +Here is an example of how to route traffic to an InferencePool using an HTTPRoute: +``` +apiVersion: gateway.networking.k8s.io/v1 +kind: HTTPRoute +metadata: + name: llm-route +spec: + parentRefs: + - group: gateway.networking.k8s.io + kind: Gateway + name: inference-gateway + rules: + - backendRefs: + - group: inference.networking.x-k8s.io + kind: InferencePool + name: base-model + matches: + - path: + type: PathPrefix + value: / +``` + +Note that the `rules.backendRefs` describes which InferencePool should receive the forwarded traffic when the path matches the corresponding path prefix. This is very similar to how we configure a Gateway with an HTTPRoute that directs traffic to a Service (a way to select Pods and specify a port). By using the InferencePool, it provides an abstraction over a set of compute resources (model server pods), and allows the controller to implement specialized routing strategies for these inference workloads. + +## Building the Gateway controller +The general idea of implementing a Gateway controller supporting the InferencePool involves two major steps: + +1. Tracking the endpoints for InferencePool backends +2. Callout to an extension to make intelligent routing decisions + +### Endpoint Tracking +Consider a simple inference pool like this: +``` +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferencePool +metadata: + name: vllm-llama3-8b-instruct +spec: + targetPortNumber: 8000 + selector: + app: vllm-llama3-8b-instruct + extensionRef: + name: vllm-llama3-8b-instruct-epp +``` + +There are mainly two options for how to treat the Inference Pool in your controller. + +**Option 1: Shadow Service Creation** + +If your Gateway controller already handles Service as a backend, you can choose to create a headless Service that mirrors the endpoints defined by the InferencePool, like this: + +``` +apiVersion: v1 +kind: Service +metadata: + name: vllm-llama3-8b-instruct-shadow-service +spec: + ports: + - port: 54321 + protocol: TCP + targetPort: 8000 + selector: + app: vllm-llama3-8b-instruct + type: ClusterIP + clusterIP: None +``` + +The gateway controller would then treat this shadow service just like any other backend service it routes traffic to. + +This approach likely allows you to leverage existing service discovery, healthcheck infrastructure, and load balancing mechanisms that your controller already supports. However, it does come with the overhead of managing additional Service objects, and hence may affect the overall latency of the reconciliation of the Gateways. + +**Option 2: Tracking InferencePool Endpoints Separately** + +You can also choose to directly select and monitor the endpoints belonging to the InferencePool. For the simple inference pool example we have above, the controller would use the label `app: vllm-llama3-8b-instruct` to discover the pods matching the criteria, and get their endpoints (i.e. IP and port number). It would then need to monitor these pods for health and availability. + +With this approach, you can tailor the endpoint tracking and routing logic specifically to the characteristics and requirements of your InferencePool. + +### Callout Extension + +The [Endpoint Picker](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp), or EPP, is a core component of the inference extension. The primary interaction for routing requests is defined between the proxy (e.g., Envoy) and the EPP using the Envoy [external processing service protocol](https://www.envoyproxy.io/docs/envoy/latest/api-v3/service/ext_proc/v3/external_processor.proto). See the [Endpoint Picker Protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/004-endpoint-picker-protocol) for more information. + +#### How to Callout to EPP + +For each HTTP request, the proxy CAN communicate the subset of endpoints the EPP MUST pick from by setting `x-gateway-destination-endpoint-subset` key in the filter metadata field of the ext-proc request. If this key is set, the EPP must select from this endpoint list. If the list is empty or no endpoints are eligible, it should return a 503 error. If the key isn't set, the EPP selects from the endpoints defined by the InferencePool selector. + +#### Response from the extension + +The EPP communicates the chosen endpoint to the proxy via the `x-gateway-destination-endpoint` HTTP header and the `dynamic_metadata` field of the ext-proc response. Failure to communicate the endpoint using both methods results in a 503 error if no endpoints are ready, or a 429 error if the request should be dropped. The header and metadata values must match. In addition to the chosen endpoint, a single fallback endpoint CAN be set using the key `x-gateway-destination-endpoint-fallback` in the same metadata namespace as one used for `x-gateway-destination-endpoint`. + +## Testing Tips + +Here are some tips for testing your controller end-to-end: + +- **Focus on Key Scenarios**: Add common scenarios like creating, updating, and deleting InferencePool resources, as well as different routing rules that target InferencePool backends. +- **Verify Routing Behaviors**: Design more complex routing scenarios and verify that requests are correctly routed to the appropriate model server pods within the InferencePool based on the InferenceModel configuration. +- **Test Error Handling**: Verify that the controller correctly handles scenarios like unsupported model names or resource constraints (if criticality-based shedding is implemented). Test with state transitions (such as constant requests while Pods behind EPP are being replaced and Pods behind InferencePool are being replaced) to ensure that the system is resilient to failures and can automatically recover by redirecting traffic to healthy Pods. +- **Using Reference EPP Implementation + Echoserver**: You can use the [reference EPP implementation](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) for testing your controller end-to-end. Instead of a full-fledged model server, a simple mock server (like the [echoserver](https://github.com/kubernetes-sigs/ingress-controller-conformance/tree/master/images/echoserver)) can be very useful for verifying routing to ensure the correct pod received the request. +- **Performance Test**: Run end-to-end [benchmarks](https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark/) to make sure that your inference gateway can achieve the latency target that is desired. + +### Conformance Tests + +A set of conformance tests will be developed soon to help verify that a controller is working as expected. This guide will be updated once we have more information. Stay tuned! diff --git a/site-src/guides/index.md b/site-src/guides/index.md index 92f6412a..bcd1068d 100644 --- a/site-src/guides/index.md +++ b/site-src/guides/index.md @@ -1,3 +1,335 @@ # Getting started with Gateway API Inference Extension -TODO \ No newline at end of file +??? example "Experimental" + + This project is still in an alpha state and breaking changes may occur in the future. + +This quickstart guide is intended for engineers familiar with k8s and model servers (vLLM in this instance). The goal of this guide is to get an Inference Gateway up and running! + +## **Prerequisites** + +- A cluster with: + - Support for services of type `LoadBalancer`. For kind clusters, follow [this guide](https://kind.sigs.k8s.io/docs/user/loadbalancer) + to get services of type LoadBalancer working. + - Support for [sidecar containers](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/) (enabled by default since Kubernetes v1.29) + to run the model server deployment. + +## **Steps** + +### Deploy Sample Model Server + + Two options are supported for running the model server: + + 1. GPU-based model server. + Requirements: a Hugging Face access token that grants access to the model [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). + + 1. CPU-based model server (not using GPUs). + The sample uses the model [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct). + + Choose one of these options and follow the steps below. Please do not deploy both, as the deployments have the same name and will override each other. + +=== "GPU-Based Model Server" + + For this setup, you will need 3 GPUs to run the sample model server. Adjust the number of replicas in `./config/manifests/vllm/gpu-deployment.yaml` as needed. + Create a Hugging Face secret to download the model [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). Ensure that the token grants access to this model. + + Deploy a sample vLLM deployment with the proper protocol to work with the LLM Instance Gateway. + ```bash + kubectl create secret generic hf-token --from-literal=token=$HF_TOKEN # Your Hugging Face Token with access to the set of Llama models + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml + ``` + +=== "CPU-Based Model Server" + + This setup is using the formal `vllm-cpu` image, which according to the documentation can run vLLM on x86 CPU platform. + For this setup, we use approximately 9.5GB of memory and 12 CPUs for each replica. + + While it is possible to deploy the model server with less resources, this is not recommended. For example, in our tests, loading the model using 8GB of memory and 1 CPU was possible but took almost 3.5 minutes and inference requests took unreasonable time. In general, there is a tradeoff between the memory and CPU we allocate to our pods and the performance. The more memory and CPU we allocate the better performance we can get. + + After running multiple configurations of these values we decided in this sample to use 9.5GB of memory and 12 CPUs for each replica, which gives reasonable response times. You can increase those numbers and potentially may even get better response times. For modifying the allocated resources, adjust the numbers in [cpu-deployment.yaml](https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/cpu-deployment.yaml) as needed. + + Deploy a sample vLLM deployment with the proper protocol to work with the LLM Instance Gateway. + ```bash + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/cpu-deployment.yaml + ``` + +### Install the Inference Extension CRDs + +=== "Latest Release" + + ```bash + VERSION=v0.3.0 + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/$VERSION/manifests.yaml + ``` + +=== "Dev Version" + + ```bash + kubectl apply -k https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd + ``` + +### Deploy InferenceModel + + Deploy the sample InferenceModel which is configured to forward traffic to the `food-review-1` [LoRA adapter](https://docs.vllm.ai/en/latest/features/lora.html) of the sample model server. + + ```bash + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferencemodel.yaml + ``` + +### Deploy the InferencePool and Extension + + ```bash + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferencepool-resources.yaml + ``` + +### Deploy Inference Gateway + + Choose one of the following options to deploy an Inference Gateway. + +=== "GKE" + + 1. Enable the Gateway API and configure proxy-only subnets when necessary. See [Deploy Gateways](https://cloud.google.com/kubernetes-engine/docs/how-to/deploying-gateways) + for detailed instructions. + + 1. Deploy Gateway and HealthCheckPolicy resources + + ```bash + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/gke/gateway.yaml + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/gke/healthcheck.yaml + ``` + + Confirm that the Gateway was assigned an IP address and reports a `Programmed=True` status: + ```bash + $ kubectl get gateway inference-gateway + NAME CLASS ADDRESS PROGRAMMED AGE + inference-gateway inference-gateway True 22s + ``` + + 3. Deploy the HTTPRoute + + ```bash + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/gke/httproute.yaml + ``` + + 4. Confirm that the HTTPRoute status conditions include `Accepted=True` and `ResolvedRefs=True`: + + ```bash + kubectl get httproute llm-route -o yaml + ``` + + 5. Given that the default connection timeout may be insufficient for most inference workloads, it is recommended to configure a timeout appropriate for your intended use case. + + ```bash + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/gke/gcp-backend-policy.yaml + ``` + +=== "Istio" + + Please note that this feature is currently in an experimental phase and is not intended for production use. + The implementation and user experience are subject to changes as we continue to iterate on this project. + + 1. Requirements + + - Gateway API [CRDs](https://gateway-api.sigs.k8s.io/guides/#installing-gateway-api) installed. + + 2. Install Istio + + ``` + TAG=1.26-alpha.80c74f7f43482c226f4f4b10b4dda6261b67a71f + # on Linux + wget https://storage.googleapis.com/istio-build/dev/$TAG/istioctl-$TAG-linux-amd64.tar.gz + tar -xvf istioctl-$TAG-linux-amd64.tar.gz + # on macOS + wget https://storage.googleapis.com/istio-build/dev/$TAG/istioctl-$TAG-osx.tar.gz + tar -xvf istioctl-$TAG-osx.tar.gz + # on Windows + wget https://storage.googleapis.com/istio-build/dev/$TAG/istioctl-$TAG-win.zip + unzip istioctl-$TAG-win.zip + + ./istioctl install --set tag=$TAG --set hub=gcr.io/istio-testing + ``` + + 3. If you run the Endpoint Picker (EPP) with the `--secureServing` flag set to `true` (the default mode), it is currently using a self-signed certificate. As a security measure, Istio does not trust self-signed certificates by default. As a temporary workaround, you can apply the destination rule to bypass TLS verification for EPP. A more secure TLS implementation in EPP is being discussed in [Issue 582](https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/582). + + ```bash + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/istio/destination-rule.yaml + ``` + + 4. Deploy Gateway + + ```bash + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/istio/gateway.yaml + ``` + + 5. Label the gateway + + ```bash + kubectl label gateway llm-gateway istio.io/enable-inference-extproc=true + ``` + + Confirm that the Gateway was assigned an IP address and reports a `Programmed=True` status: + ```bash + $ kubectl get gateway inference-gateway + NAME CLASS ADDRESS PROGRAMMED AGE + inference-gateway inference-gateway True 22s + ``` + + 6. Deploy the HTTPRoute + + ```bash + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/istio/httproute.yaml + ``` + + 7. Confirm that the HTTPRoute status conditions include `Accepted=True` and `ResolvedRefs=True`: + + ```bash + kubectl get httproute llm-route -o yaml + ``` + +=== "Kgateway" + + [Kgateway](https://kgateway.dev/) recently added support for inference extension as a **technical preview**. This means do not + run Kgateway with inference extension in production environments. Refer to [Issue 10411](https://github.com/kgateway-dev/kgateway/issues/10411) + for the list of caveats, supported features, etc. + + 1. Requirements + + - [Helm](https://helm.sh/docs/intro/install/) installed. + - Gateway API [CRDs](https://gateway-api.sigs.k8s.io/guides/#installing-gateway-api) installed. + + 2. Set the Kgateway version and install the Kgateway CRDs. + + ```bash + KGTW_VERSION=v2.0.0 + helm upgrade -i --create-namespace --namespace kgateway-system --version $KGTW_VERSION kgateway-crds oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds + ``` + + 3. Install Kgateway + + ```bash + helm upgrade -i --namespace kgateway-system --version $KGTW_VERSION kgateway oci://cr.kgateway.dev/kgateway-dev/charts/kgateway --set inferenceExtension.enabled=true + ``` + + 4. Deploy the Gateway + + ```bash + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/kgateway/gateway.yaml + ``` + + Confirm that the Gateway was assigned an IP address and reports a `Programmed=True` status: + ```bash + $ kubectl get gateway inference-gateway + NAME CLASS ADDRESS PROGRAMMED AGE + inference-gateway kgateway True 22s + ``` + + 5. Deploy the HTTPRoute + + ```bash + kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/kgateway/httproute.yaml + ``` + + 6. Confirm that the HTTPRoute status conditions include `Accepted=True` and `ResolvedRefs=True`: + + ```bash + kubectl get httproute llm-route -o yaml + ``` + +### Try it out + + Wait until the gateway is ready. + +=== "GPU-Based Model Server" + + ```bash + IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}') + PORT=80 + + curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ + "model": "food-review", + "prompt": "Write as if you were a critic: San Francisco", + "max_tokens": 100, + "temperature": 0 + }' + ``` + +=== "CPU-Based Model Server" + + ```bash + IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}') + PORT=80 + + curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ + "model": "Qwen/Qwen2.5-1.5B-Instruct", + "prompt": "Write as if you were a critic: San Francisco", + "max_tokens": 100, + "temperature": 0 + }' + ``` + +### Cleanup + + The following instructions assume you would like to cleanup ALL resources that were created in this quickstart guide. + Please be careful not to delete resources you'd like to keep. + + 1. Uninstall the InferencePool, InferenceModel, and model server resources + + ```bash + kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferencepool-resources.yaml --ignore-not-found + kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/inferencemodel.yaml --ignore-not-found + kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/cpu-deployment.yaml --ignore-not-found + kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --ignore-not-found + kubectl delete secret hf-token --ignore-not-found + ``` + + 1. Uninstall the Gateway API resources + + ```bash + kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/gke/gateway.yaml --ignore-not-found + kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/gke/healthcheck.yaml --ignore-not-found + kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/gke/gcp-backend-policy.yaml --ignore-not-found + kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/gke/httproute.yaml --ignore-not-found + kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/istio/gateway.yaml --ignore-not-found + kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/istio/destination-rule.yaml --ignore-not-found + kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/istio/httproute.yaml --ignore-not-found + kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/kgateway/gateway.yaml --ignore-not-found + kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/gateway/kgateway/httproute.yaml --ignore-not-found + ``` + + 1. Uninstall the Gateway API Inference Extension CRDs + + ```bash + kubectl delete -k https://github.com/kubernetes-sigs/gateway-api-inference-extension/config/crd --ignore-not-found + ``` + + 1. Choose one of the following options to cleanup the Inference Gateway. + +=== "GKE" + + **TODO** + +=== "Istio" + + **TODO** + +=== "Kgateway" + + The following instructions assume you would like to cleanup ALL Kgateway resources that were created in this quickstart guide. + + 1. Uninstall Kgateway + + ```bash + helm uninstall kgateway -n kgateway-system + ``` + + 1. Uninstall the Kgateway CRDs. + + ```bash + helm uninstall kgateway-crds -n kgateway-system + ``` + + 1. Remove the Kgateway namespace. + + ```bash + kubectl delete ns kgateway-system + ``` diff --git a/site-src/guides/metrics.md b/site-src/guides/metrics.md new file mode 100644 index 00000000..ab3ba3fd --- /dev/null +++ b/site-src/guides/metrics.md @@ -0,0 +1,95 @@ +# Metrics + +This guide describes the current state of exposed metrics and how to scrape them. + +## Requirements + +To have response metrics, ensure the body mode is set to `Buffered` or `Streamed` (this should be the default behavior for all implementations). + +If you want to include usage metrics for vLLM model server streaming request, send the request with `include_usage`: + +``` +curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ +"model": "food-review", +"prompt": "whats your fav movie?", +"max_tokens": 10, +"temperature": 0, +"stream": true, +"stream_options": {"include_usage": "true"} +}' +``` + +## Exposed metrics + +| **Metric name** | **Metric Type** |
**Description**
|
**Labels**
| **Status** | +|:---------------------------------------------|:-----------------|:------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:------------| +| inference_model_request_total | Counter | The counter of requests broken out for each model. | `model_name`=<model-name>
`target_model_name`=<target-model-name> | ALPHA | +| inference_model_request_error_total | Counter | The counter of requests errors broken out for each model. | `model_name`=<model-name>
`target_model_name`=<target-model-name> | ALPHA | +| inference_model_request_duration_seconds | Distribution | Distribution of response latency. | `model_name`=<model-name>
`target_model_name`=<target-model-name> | ALPHA | +| normalized_time_per_output_token_seconds | Distribution | Distribution of ntpot (response latency per output token) | `model_name`=<model-name>
`target_model_name`=<target-model-name> | ALPHA | +| inference_model_request_sizes | Distribution | Distribution of request size in bytes. | `model_name`=<model-name>
`target_model_name`=<target-model-name> | ALPHA | +| inference_model_response_sizes | Distribution | Distribution of response size in bytes. | `model_name`=<model-name>
`target_model_name`=<target-model-name> | ALPHA | +| inference_model_input_tokens | Distribution | Distribution of input token count. | `model_name`=<model-name>
`target_model_name`=<target-model-name> | ALPHA | +| inference_model_output_tokens | Distribution | Distribution of output token count. | `model_name`=<model-name>
`target_model_name`=<target-model-name> | ALPHA | +| inference_model_running_requests | Gauge | Number of running requests for each model. | `model_name`=<model-name> | ALPHA | +| inference_pool_average_kv_cache_utilization | Gauge | The average kv cache utilization for an inference server pool. | `name`=<inference-pool-name> | ALPHA | +| inference_pool_average_queue_size | Gauge | The average number of requests pending in the model server queue. | `name`=<inference-pool-name> | ALPHA | +| inference_pool_ready_pods | Gauge | The number of ready pods for an inference server pool. | `name`=<inference-pool-name> | ALPHA | +| inference_extension_info | Gauge | The general information of the current build. | `commit`=<hash-of-the-build> | ALPHA | + + +## Scrape Metrics + +Metrics endpoint is exposed at port 9090 by default. To scrape metrics, the client needs a ClusterRole with the following rule: +`nonResourceURLs: "/metrics", verbs: get`. + +Here is one example if the client needs to mound the secret to act as the service account +``` +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: inference-gateway-metrics-reader +rules: +- nonResourceURLs: + - /metrics + verbs: + - get +--- +apiVersion: v1 +kind: ServiceAccount +metadata: + name: inference-gateway-sa-metrics-reader + namespace: default +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: inference-gateway-sa-metrics-reader-role-binding + namespace: default +subjects: +- kind: ServiceAccount + name: inference-gateway-sa-metrics-reader + namespace: default +roleRef: + kind: ClusterRole + name: inference-gateway-metrics-reader + apiGroup: rbac.authorization.k8s.io +--- +apiVersion: v1 +kind: Secret +metadata: + name: inference-gateway-sa-metrics-reader-secret + namespace: default + annotations: + kubernetes.io/service-account.name: inference-gateway-sa-metrics-reader +type: kubernetes.io/service-account-token +``` +Then, you can curl the 9090 port like following +``` +TOKEN=$(kubectl -n default get secret inference-gateway-sa-metrics-reader-secret -o jsonpath='{.secrets[0].name}' -o jsonpath='{.data.token}' | base64 --decode) + +kubectl -n default port-forward inference-gateway-ext-proc-pod-name 9090 + +curl -H "Authorization: Bearer $TOKEN" localhost:9090/metrics +``` \ No newline at end of file diff --git a/site-src/guides/replacing-inference-pool.md b/site-src/guides/replacing-inference-pool.md new file mode 100644 index 00000000..21294570 --- /dev/null +++ b/site-src/guides/replacing-inference-pool.md @@ -0,0 +1,59 @@ +# Replacing an InferencePool + +## Background + +Replacing an InferencePool is a powerful technique for performing various infrastructure and model updates with minimal disruption and built-in rollback capabilities. This method allows you to introduce changes incrementally, monitor their impact, and revert to the previous state if necessary. + +## Use Cases +Use Cases for Replacing an InferencePool: + +- Upgrading or replacing your model server framework +- Upgrading or replacing your base model +- Transitioning to new hardware + +## How to replace an InferencePool + +To replacing an InferencePool: + +1. **Deploy new infrastructure**: Create a new InferencePool configured with the new hardware / model server / base model that you chose. +1. **Configure traffic splitting**: Use an HTTPRoute to split traffic between the existing InferencePool and the new InferencePool. The `backendRefs.weight` field controls the traffic percentage allocated to each pool. +1. **Maintain InferenceModel integrity**: Keep your InferenceModel configuration unchanged. This ensures that the system applies the same LoRA adapters consistently across both base model versions. +1. **Preserve rollback capability**: Retain the original nodes and InferencePool during the roll out to facilitate a rollback if necessary. + +### Example + +You start with an existing lnferencePool named `llm-pool-v1`. To replace the original InferencePool, you create a new InferencePool named `llm-pool-v2`. By configuring an **HTTPRoute**, as shown below, you can incrementally split traffic between the original `llm-pool-v1` and new `llm-pool-v2`. + +1. Save the following sample manifest as `httproute.yaml`: + + ```yaml + apiVersion: gateway.networking.k8s.io/v1 + kind: HTTPRoute + metadata: + name: llm-route + spec: + parentRefs: + - group: gateway.networking.k8s.io + kind: Gateway + name: inference-gateway + rules: + backendRefs: + - group: inference.networking.x-k8s.io + kind: InferencePool + name: llm-pool-v1 + weight: 90 + - group: inference.networking.x-k8s.io + kind: InferencePool + name: llm-pool-v2 + weight: 10 + ``` + +1. Apply the sample manifest to your cluster: + + ``` + kubectl apply -f httproute.yaml + ``` + + The original `llm-pool-v1` InferencePool receives most of the traffic, while the `llm-pool-v2` InferencePool receives the rest. + +1. Increase the traffic weight gradually for the `llm-pool-v2` InferencePool to complete the new InferencePool roll out. diff --git a/site-src/images/favicon-64.png b/site-src/images/favicon-64.png new file mode 100644 index 00000000..f2bd3d64 Binary files /dev/null and b/site-src/images/favicon-64.png differ diff --git a/site-src/images/logo/logo-text-xl-dark.png b/site-src/images/logo/logo-text-xl-dark.png new file mode 100644 index 00000000..4d878e5c Binary files /dev/null and b/site-src/images/logo/logo-text-xl-dark.png differ diff --git a/site-src/images/request-flow.png b/site-src/images/request-flow.png index ee2bf226..a010038a 100644 Binary files a/site-src/images/request-flow.png and b/site-src/images/request-flow.png differ diff --git a/site-src/implementations.md b/site-src/implementations.md deleted file mode 100644 index e2238827..00000000 --- a/site-src/implementations.md +++ /dev/null @@ -1,56 +0,0 @@ -# Implementations - -This project has several implementations that are planned or in progress: - -* [Envoy Gateway][1] -* [Gloo k8sgateway][2] -* [Google Kubernetes Engine][3] - -[1]:#envoy-gateway -[2]:#gloo-k8sgateway -[3]:#google-kubernetes-engine - -## Envoy Gateway -[Envoy Gateway][eg-home] is an [Envoy][envoy-org] subproject for managing -Envoy-based application gateways. The supported APIs and fields of the Gateway -API are outlined [here][eg-supported]. Use the [quickstart][eg-quickstart] to -get Envoy Gateway running with Gateway API in a few simple steps. - -Progress towards supporting this project is tracked with a [GitHub -Issue](https://github.com/envoyproxy/gateway/issues/4423). - -[eg-home]:https://gateway.envoyproxy.io/ -[envoy-org]:https://github.com/envoyproxy -[eg-supported]:https://gateway.envoyproxy.io/docs/tasks/quickstart/ -[eg-quickstart]:https://gateway.envoyproxy.io/docs/tasks/quickstart - -## Gloo k8sgateway - -[Gloo k8sgateway](https://k8sgateway.io/) is a feature-rich, Kubernetes-native -ingress controller and next-generation API gateway. Gloo k8sgateway brings the -full power and community support of Gateway API to its existing control-plane -implementation. - -Progress towards supporting this project is tracked with a [GitHub -Issue](https://github.com/k8sgateway/k8sgateway/issues/10411). - -## Google Kubernetes Engine - -[Google Kubernetes Engine (GKE)][gke] is a managed Kubernetes platform offered -by Google Cloud. GKE's implementation of the Gateway API is through the [GKE -Gateway controller][gke-gateway] which provisions Google Cloud Load Balancers -for Pods in GKE clusters. - -The GKE Gateway controller supports weighted traffic splitting, mirroring, -advanced routing, multi-cluster load balancing and more. See the docs to deploy -[private or public Gateways][gke-gateway-deploy] and also [multi-cluster -Gateways][gke-multi-cluster-gateway]. - -Progress towards supporting this project is tracked with a [GitHub -Issue](https://github.com/GoogleCloudPlatform/gke-gateway-api/issues/20). - -[gke]:https://cloud.google.com/kubernetes-engine -[gke-gateway]:https://cloud.google.com/kubernetes-engine/docs/concepts/gateway-api -[gke-gateway-deploy]:https://cloud.google.com/kubernetes-engine/docs/how-to/deploying-gateways -[gke-multi-cluster-gateway]:https://cloud.google.com/kubernetes-engine/docs/how-to/deploying-multi-cluster-gateways - diff --git a/site-src/implementations/gateways.md b/site-src/implementations/gateways.md new file mode 100644 index 00000000..950c0833 --- /dev/null +++ b/site-src/implementations/gateways.md @@ -0,0 +1,88 @@ +# Gateway Implementations + +This project has several implementations that are planned or in progress: + +* [Envoy AI Gateway][1] +* [Kgateway][2] +* [Google Kubernetes Engine][3] +* [Istio][4] +* [Alibaba Cloud Container Service for Kubernetes][5] + +[1]:#envoy-gateway +[2]:#kgateway +[3]:#google-kubernetes-engine +[4]:#istio +[5]:#alibaba-cloud-container-service-for-kubernetes + +## Envoy AI Gateway + +[Envoy AI Gateway][aigw-home] is an open source project built on top of +[Envoy][envoy-org] and [Envoy Gateway][envoy-gateway] to handle request traffic +from application clients to GenAI services. The features and capabilities are outlined [here][aigw-capabilities]. Use the [quickstart][aigw-quickstart] to get Envoy AI Gateway running with Gateway API in a few simple steps. + +Progress towards supporting this project is tracked with a [GitHub +Issue](https://github.com/envoyproxy/ai-gateway/issues/423). + +[aigw-home]:https://aigateway.envoyproxy.io/ +[envoy-org]:https://github.com/envoyproxy +[envoy-gateway]: https://gateway.envoyproxy.io/ +[aigw-capabilities]:https://aigateway.envoyproxy.io/docs/capabilities/ +[aigw-quickstart]:https://aigateway.envoyproxy.io/docs/capabilities/gateway-api-inference-extension + +## Kgateway + +[Kgateway](https://kgateway.dev/) is a feature-rich, Kubernetes-native +ingress controller and next-generation API gateway. Kgateway brings the +full power and community support of Gateway API to its existing control-plane +implementation. + +Progress towards supporting this project is tracked with a [GitHub +Issue](https://github.com/kgateway-dev/kgateway/issues/10411). + +## Google Kubernetes Engine + +[Google Kubernetes Engine (GKE)][gke] is a managed Kubernetes platform offered +by Google Cloud. GKE's implementation of the Gateway API is through the [GKE +Gateway controller][gke-gateway] which provisions Google Cloud Load Balancers +for Pods in GKE clusters. + +The GKE Gateway controller supports weighted traffic splitting, mirroring, +advanced routing, multi-cluster load balancing and more. See the docs to deploy +[private or public Gateways][gke-gateway-deploy] and also [multi-cluster +Gateways][gke-multi-cluster-gateway]. + +Progress towards supporting this project is tracked with a [GitHub +Issue](https://github.com/GoogleCloudPlatform/gke-gateway-api/issues/20). + +[gke]:https://cloud.google.com/kubernetes-engine +[gke-gateway]:https://cloud.google.com/kubernetes-engine/docs/concepts/gateway-api +[gke-gateway-deploy]:https://cloud.google.com/kubernetes-engine/docs/how-to/deploying-gateways +[gke-multi-cluster-gateway]:https://cloud.google.com/kubernetes-engine/docs/how-to/deploying-multi-cluster-gateways + +## Istio + +[Istio](https://istio.io/) is an open source service mesh and gateway implementation. +It provides a fully compliant implementation of the Kubernetes Gateway API for cluster ingress traffic control. +For service mesh users, Istio also fully supports east-west (including [GAMMA](https://gateway-api.sigs.k8s.io/mesh/)) traffic management within the mesh. + +Gateway API Inference Extension support is being tracked by this [GitHub +Issue](https://github.com/istio/istio/issues/55768). + +## Alibaba Cloud Container Service for Kubernetes + +[Alibaba Cloud Container Service for Kubernetes (ACK)][ack] is a managed Kubernetes platform +offered by Alibaba Cloud. The implementation of the Gateway API in ACK is through the +[ACK Gateway with Inference Extension][ack-gie] component, which introduces model-aware, +GPU-efficient load balancing for AI workloads beyond basic HTTP routing. + +The ACK Gateway with Inference Extension implements the Gateway API Inference Extension +and provides optimized routing for serving generative AI workloads, +including weighted traffic splitting, mirroring, advanced routing, etc. +See the docs for the [usage][ack-gie-usage]. + +Progress towards supporting Gateway API Inference Extension is being tracked +by [this Issue](https://github.com/AliyunContainerService/ack-gateway-api/issues/1). + +[ack]:https://www.alibabacloud.com/help/en/ack +[ack-gie]:https://www.alibabacloud.com/help/en/ack/product-overview/ack-gateway-with-inference-extension +[ack-gie-usage]:https://www.alibabacloud.com/help/en/ack/ack-managed-and-ack-dedicated/user-guide/intelligent-routing-and-traffic-management-with-ack-gateway-inference-extension \ No newline at end of file diff --git a/site-src/implementations/model-servers.md b/site-src/implementations/model-servers.md new file mode 100644 index 00000000..3d475aaa --- /dev/null +++ b/site-src/implementations/model-servers.md @@ -0,0 +1,38 @@ + + +# Supported Model Servers + +Any model server that conform to the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol) are supported by the inference extension. + +## Compatible Model Server Versions + +| Model Server | Version | Commit | Notes | +| -------------------- | ---------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | +| vLLM V0 | v0.6.4 and above | [commit 0ad216f](https://github.com/vllm-project/vllm/commit/0ad216f5750742115c686723bf38698372d483fd) | | +| vLLM V1 | v0.8.0 and above | [commit bc32bc7](https://github.com/vllm-project/vllm/commit/bc32bc73aad076849ac88565cff745b01b17d89c) | | +| Triton(TensorRT-LLM) | [25.03](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-03.html#rel-25-03) and above | [commit 15cb989](https://github.com/triton-inference-server/tensorrtllm_backend/commit/15cb989b00523d8e92dce5165b9b9846c047a70d). | LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in Triton yet. | + +## vLLM + +vLLM is configured as the default in the [endpoint picker extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp). No further configuration is required. + +## Triton with TensorRT-LLM Backend + +Triton specific metric names need to be specified when starting the EPP. + +### Option 1: Use Helm + +Use `--set inferencePool.modelServerType=triton-tensorrt-llm` to install the [`inferencepool` via helm](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/42eb5ff1c5af1275df43ac384df0ddf20da95134/config/charts/inferencepool). See the [`inferencepool` helm guide](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/42eb5ff1c5af1275df43ac384df0ddf20da95134/config/charts/inferencepool/README.md) for more details. + +### Option 2: Edit EPP deployment yaml + + Add the following to the `args` of the [EPP deployment](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/42eb5ff1c5af1275df43ac384df0ddf20da95134/config/manifests/inferencepool-resources.yaml#L32) + + ``` +- -totalQueuedRequestsMetric +- "nv_trt_llm_request_metrics{request_type=waiting}" +- -kvCacheUsagePercentageMetric +- "nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}" +- -loraInfoMetric +- "" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet. +``` \ No newline at end of file diff --git a/site-src/index.md b/site-src/index.md index 04d1fadb..61bece27 100644 --- a/site-src/index.md +++ b/site-src/index.md @@ -91,7 +91,7 @@ This project is being driven by [WG-Serving](https://github.com/kubernetes/community/tree/master/wg-serving) [SIG-Network](https://github.com/kubernetes/community/tree/master/sig-network) to improve and standardize routing to inference workloads in Kubernetes. Check -out the [implementations reference](implementations.md) to see the latest +out the [implementations reference](implementations/gateways.md) to see the latest projects & products that support this project. If you are interested in contributing to or building an implementation using Gateway API then don’t hesitate to [get involved!](/contributing) diff --git a/site-src/performance/benchmark/example-bar-chart.png b/site-src/performance/benchmark/example-bar-chart.png new file mode 100644 index 00000000..ae48f7eb Binary files /dev/null and b/site-src/performance/benchmark/example-bar-chart.png differ diff --git a/site-src/performance/benchmark/index.md b/site-src/performance/benchmark/index.md new file mode 100644 index 00000000..39457bf6 --- /dev/null +++ b/site-src/performance/benchmark/index.md @@ -0,0 +1,97 @@ +# Benchmark + +This user guide shows how to run benchmarks against a vLLM deployment, by using both the Gateway API +inference extension, and a Kubernetes service as the load balancing strategy. The +benchmark uses the [Latency Profile Generator](https://github.com/AI-Hypercomputer/inference-benchmark) (LPG) +tool to generate load and collect results. + +## Prerequisites + +### Deploy the inference extension and sample model server + +Follow this user guide https://gateway-api-inference-extension.sigs.k8s.io/guides/ to deploy the +sample vLLM application, and the inference extension. + +### [Optional] Scale the sample vLLM deployment + +You will more likely to see the benefits of the inference extension when there are a decent number of replicas to make the optimal routing decision. + +```bash +kubectl scale --replicas=8 -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml +``` + +### Expose the model server via a k8s service + +As the baseline, let's also expose the vLLM deployment as a k8s service: + +```bash +kubectl expose -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml --port=8081 --target-port=8000 --type=LoadBalancer +``` + +## Run benchmark + +The LPG benchmark tool works by sending traffic to the specified target IP and port, and collect results. Follow the steps below to run a single benchmark. You can deploy multiple LPG instances if you want to run benchmarks in parallel against different targets. + +1. Check out the repo. + + ```bash + git clone https://github.com/kubernetes-sigs/gateway-api-inference-extension + cd gateway-api-inference-extension + ``` + +1. Get the target IP. Examples below show how to get the IP of a gateway or a LoadBalancer k8s service. + + ```bash + # Get gateway IP + GW_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}') + # Get LoadBalancer k8s service IP + SVC_IP=$(kubectl get service/vllm-llama2-7b -o jsonpath='{.status.loadBalancer.ingress[0].ip}') + + echo $GW_IP + echo $SVC_IP + ``` + +1. Then update the `` in `./config/manifests/benchmark/benchmark.yaml` to your target IP. Feel free to adjust other parameters such as request_rates as well. For a complete list of LPG configurations, pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark). + +1. Start the benchmark tool. `kubectl apply -f ./config/manifests/benchmark/benchmark.yaml` + +1. Wait for benchmark to finish and download the results. Use the `benchmark_id` environment variable +to specify what this benchmark is for. For instance, `inference-extension` or `k8s-svc`. When the LPG tool finishes benchmarking, it will print a log line `LPG_FINISHED`, +the script below will watch for that log line and then start downloading results. + + ```bash + benchmark_id='my-benchmark' ./tools/benchmark/download-benchmark-results.bash + ``` + 1. After the script finishes, you should see benchmark results under `./tools/benchmark/output/default-run/my-benchmark/results/json` folder. Here is a [sample json file](./sample.json). + +### Tips + +* You can specify `run_id="runX"` environment variable when running the `./download-benchmark-results.bash` script. +This is useful when you run benchmarks multiple times to get a more statistically meaningful results and group the results accordingly. +* Update the `request_rates` that best suit your benchmark environment. + +### Advanced Benchmark Configurations + +Pls refer to the [LPG user guide](https://github.com/AI-Hypercomputer/inference-benchmark?tab=readme-ov-file#configuring-the-benchmark) for a detailed list of configuration knobs. + +## Analyze the results + +This guide shows how to run the jupyter notebook using vscode. + +1. Create a python virtual environment. + + ```bash + python3 -m venv .venv + source .venv/bin/activate + ``` + +1. Install the dependencies. + + ```bash + pip install -r ./tools/benchmark/requirements.txt + ``` + +1. Open the notebook `./tools/benchmark/benchmark.ipynb`, and run each cell. At the end you should + see a bar chart like below where **"ie"** represents inference extension. This chart is generated using this benchmarking tool with 6 vLLM (v1) model servers (H100 80 GB), [llama2-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) and the [ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json). + + ![alt text](example-bar-chart.png) \ No newline at end of file diff --git a/site-src/performance/benchmark/sample.json b/site-src/performance/benchmark/sample.json new file mode 100644 index 00000000..fc1fe5f1 --- /dev/null +++ b/site-src/performance/benchmark/sample.json @@ -0,0 +1 @@ +{"metrics": {"num_prompts_attempted": 59999, "num_prompts_succeeded": 59999, "request_rate": 200.0, "server_metrics": {}, "benchmark_time": 377.69680404663086, "throughput_rps": 158.85757929948576, "throughput": 35786.07723228514, "total_output_token": 13516287, "output_tokens_per_min": 2147164.6339371083, "total_input_tokens": 15092072, "input_tokens_per_min": 2397490.024533549, "total_tokens": 28608359, "tokens_per_min": 4544654.658470658, "avg_per_token_latency": 0.038136584066158385, "median_per_token_latency": 0.03260710797991071, "sd_per_token_latency": 0.039995399094383204, "min_per_token_latency": 0.00010268625128206123, "max_per_token_latency": 0.8718070238828659, "p90_per_token_latency": 0.07052694590421603, "p99_per_token_latency": 0.19175863699585777, "avg_latency": 13490.14784723948, "median_latency": 10904.660940170288, "sd_latency": 10759.461472867813, "min_latency": 53.10511589050293, "max_latency": 55610.99076271057, "p90_latency": 28706.796979904175, "p99_latency": 45658.41965198513, "avg_per_output_token_latency": 148.97623456610614, "median_per_output_token_latency": 60.334928053662296, "sd_per_output_token_latency": 232.28505133364948, "min_per_output_token_latency": 7.44791825612386, "max_per_output_token_latency": 3108.849883079529, "p90_per_output_token_latency": 393.8944477023501, "p99_per_output_token_latency": 1193.081065813697, "avg_input_len": 251.53872564542743, "median_input_len": 109.0, "sd_input_len": 281.6475735479433, "min_input_len": 4.0, "max_input_len": 1024.0, "p90_input_len": 714.0, "p99_input_len": 987.0, "avg_output_len": 225.27520458674311, "median_output_len": 144.0, "sd_output_len": 234.48900674005114, "min_output_len": 3.0, "max_output_len": 1025.0, "p90_output_len": 564.0, "p99_output_len": 948.0, "ClientConnectorError": 0, "TimeoutError": 0, "ContentTypeError": 1, "ClientOSError": 0, "ServerDisconnectedError": 0, "unknown_error": 0}, "dimensions": {"date": "20250328-043623", "backend": "vllm", "model_id": "meta-llama/Llama-2-7b-hf", "tokenizer_id": "meta-llama/Llama-2-7b-hf"}, "config": {"model": "meta-llama/Llama-2-7b-hf", "num_models": 1, "model_server": "vllm", "start_time": {"seconds": 1743136583, "nanos": 238149000}}, "summary_stats": {"stats": [{"request_rate": 200.0, "request_latency": {"mean": 13490.14784723948, "median": 10904.660940170288, "sd": 10759.461472867813, "min": 53.10511589050293, "max": 55610.99076271057, "p90": 28706.796979904175, "p99": 45658.41965198513}, "throughput": {"mean": 35786.07723228514}, "input_length": {"mean": 251.53872564542743, "median": 109.0, "sd": 281.6475735479433, "min": 4.0, "max": 1024.0, "p90": 714.0, "p99": 987.0}, "output_length": {"mean": 225.27520458674311, "median": 144.0, "sd": 234.48900674005114, "min": 3.0, "max": 1025.0, "p90": 564.0, "p99": 948.0}, "tpot": {"mean": 148.97623456610614, "median": 60.334928053662296, "sd": 232.28505133364948, "min": 7.44791825612386, "max": 3108.849883079529, "p90": 393.8944477023501, "p99": 1193.081065813697}, "model_server_metrics": []}]}} \ No newline at end of file diff --git a/site-src/reference/spec.md b/site-src/reference/spec.md index e16c113c..d8e0c95b 100644 --- a/site-src/reference/spec.md +++ b/site-src/reference/spec.md @@ -1,12 +1,14 @@ # API Reference ## Packages -- [inference.networking.x-k8s.io/v1alpha1](#inferencenetworkingx-k8siov1alpha1) +- [inference.networking.x-k8s.io/v1alpha2](#inferencenetworkingx-k8siov1alpha2) -## inference.networking.x-k8s.io/v1alpha1 +## inference.networking.x-k8s.io/v1alpha2 + +Package v1alpha2 contains API Schema definitions for the +inference.networking.x-k8s.io API group. -Package v1alpha1 contains API Schema definitions for the gateway v1alpha1 API group ### Resource Types - [InferenceModel](#inferencemodel) @@ -18,26 +20,152 @@ Package v1alpha1 contains API Schema definitions for the gateway v1alpha1 API gr _Underlying type:_ _string_ -Defines how important it is to serve the model compared to other models. +Criticality defines how important it is to serve the model compared to other models. +Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field must be optional(use a pointer), and set no default. +This allows us to union this with a oneOf field in the future should we wish to adjust/extend this behavior. _Validation:_ -- Enum: [Critical Default Sheddable] +- Enum: [Critical Standard Sheddable] _Appears in:_ - [InferenceModelSpec](#inferencemodelspec) | Field | Description | | --- | --- | -| `Critical` | Most important. Requests to this band will be shed last.
| -| `Default` | More important than Sheddable, less important than Critical.
Requests in this band will be shed before critical traffic.
+kubebuilder:default=Default
| -| `Sheddable` | Least important. Requests to this band will be shed before all other bands.
| +| `Critical` | Critical defines the highest level of criticality. Requests to this band will be shed last.
| +| `Standard` | Standard defines the base criticality level and is more important than Sheddable but less
important than Critical. Requests in this band will be shed before critical traffic.
Most models are expected to fall within this band.
| +| `Sheddable` | Sheddable defines the lowest level of criticality. Requests to this band will be shed before
all other bands.
| + + +#### EndpointPickerConfig + + + +EndpointPickerConfig specifies the configuration needed by the proxy to discover and connect to the endpoint picker extension. +This type is intended to be a union of mutually exclusive configuration options that we may add in the future. + + + +_Appears in:_ +- [InferencePoolSpec](#inferencepoolspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `extensionRef` _[Extension](#extension)_ | Extension configures an endpoint picker as an extension service. | | Required: \{\}
| + + +#### Extension + + + +Extension specifies how to configure an extension that runs the endpoint picker. + + + +_Appears in:_ +- [EndpointPickerConfig](#endpointpickerconfig) +- [InferencePoolSpec](#inferencepoolspec) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `group` _[Group](#group)_ | Group is the group of the referent.
The default value is "", representing the Core API group. | | MaxLength: 253
Pattern: `^$\|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$`
| +| `kind` _[Kind](#kind)_ | Kind is the Kubernetes resource kind of the referent. For example
"Service".
Defaults to "Service" when not specified.
ExternalName services can refer to CNAME DNS records that may live
outside of the cluster and as such are difficult to reason about in
terms of conformance. They also may not be safe to forward to (see
CVE-2021-25740 for more information). Implementations MUST NOT
support ExternalName Services. | Service | MaxLength: 63
MinLength: 1
Pattern: `^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$`
| +| `name` _[ObjectName](#objectname)_ | Name is the name of the referent. | | MaxLength: 253
MinLength: 1
Required: \{\}
| +| `portNumber` _[PortNumber](#portnumber)_ | The port number on the service running the extension. When unspecified,
implementations SHOULD infer a default value of 9002 when the Kind is
Service. | | Maximum: 65535
Minimum: 1
| +| `failureMode` _[ExtensionFailureMode](#extensionfailuremode)_ | Configures how the gateway handles the case when the extension is not responsive.
Defaults to failClose. | FailClose | Enum: [FailOpen FailClose]
| + + +#### ExtensionConnection + + + +ExtensionConnection encapsulates options that configures the connection to the extension. + + + +_Appears in:_ +- [Extension](#extension) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `failureMode` _[ExtensionFailureMode](#extensionfailuremode)_ | Configures how the gateway handles the case when the extension is not responsive.
Defaults to failClose. | FailClose | Enum: [FailOpen FailClose]
| + + +#### ExtensionFailureMode + +_Underlying type:_ _string_ + +ExtensionFailureMode defines the options for how the gateway handles the case when the extension is not +responsive. + +_Validation:_ +- Enum: [FailOpen FailClose] + +_Appears in:_ +- [Extension](#extension) +- [ExtensionConnection](#extensionconnection) + +| Field | Description | +| --- | --- | +| `FailOpen` | FailOpen specifies that the proxy should not drop the request and forward the request to and endpoint of its picking.
| +| `FailClose` | FailClose specifies that the proxy should drop the request.
| + + +#### ExtensionReference + + + +ExtensionReference is a reference to the extension deployment. + + + +_Appears in:_ +- [Extension](#extension) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `group` _[Group](#group)_ | Group is the group of the referent.
The default value is "", representing the Core API group. | | MaxLength: 253
Pattern: `^$\|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$`
| +| `kind` _[Kind](#kind)_ | Kind is the Kubernetes resource kind of the referent. For example
"Service".
Defaults to "Service" when not specified.
ExternalName services can refer to CNAME DNS records that may live
outside of the cluster and as such are difficult to reason about in
terms of conformance. They also may not be safe to forward to (see
CVE-2021-25740 for more information). Implementations MUST NOT
support ExternalName Services. | Service | MaxLength: 63
MinLength: 1
Pattern: `^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$`
| +| `name` _[ObjectName](#objectname)_ | Name is the name of the referent. | | MaxLength: 253
MinLength: 1
Required: \{\}
| +| `portNumber` _[PortNumber](#portnumber)_ | The port number on the service running the extension. When unspecified,
implementations SHOULD infer a default value of 9002 when the Kind is
Service. | | Maximum: 65535
Minimum: 1
| + + +#### Group + +_Underlying type:_ _string_ + +Group refers to a Kubernetes Group. It must either be an empty string or a +RFC 1123 subdomain. + +This validation is based off of the corresponding Kubernetes validation: +https://github.com/kubernetes/apimachinery/blob/02cfb53916346d085a6c6c7c66f882e3c6b0eca6/pkg/util/validation/validation.go#L208 + +Valid values include: + +* "" - empty string implies core Kubernetes API group +* "gateway.networking.k8s.io" +* "foo.example.com" + +Invalid values include: + +* "example.com/bar" - "/" is an invalid character + +_Validation:_ +- MaxLength: 253 +- Pattern: `^$|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$` + +_Appears in:_ +- [Extension](#extension) +- [ExtensionReference](#extensionreference) +- [PoolObjectReference](#poolobjectreference) + #### InferenceModel -InferenceModel is the Schema for the InferenceModels API +InferenceModel is the Schema for the InferenceModels API. @@ -45,29 +173,31 @@ InferenceModel is the Schema for the InferenceModels API | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `apiVersion` _string_ | `inference.networking.x-k8s.io/v1alpha1` | | | +| `apiVersion` _string_ | `inference.networking.x-k8s.io/v1alpha2` | | | | `kind` _string_ | `InferenceModel` | | | | `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | | | `spec` _[InferenceModelSpec](#inferencemodelspec)_ | | | | | `status` _[InferenceModelStatus](#inferencemodelstatus)_ | | | | + + + + #### InferenceModelSpec -InferenceModelSpec represents a specific model use case. This resource is +InferenceModelSpec represents the desired state of a specific model use case. This resource is managed by the "Inference Workload Owner" persona. - -The Inference Workload Owner persona is: a team that trains, verifies, and +The Inference Workload Owner persona is someone that trains, verifies, and leverages a large language model from a model frontend, drives the lifecycle and rollout of new versions of those models, and defines the specific performance and latency goals for the model. These workloads are expected to operate within an InferencePool sharing compute capacity with other InferenceModels, defined by the Inference Platform Admin. - InferenceModel's modelName (not the ObjectMeta name) is unique for a given InferencePool, if the name is reused, an error will be shown on the status of a InferenceModel that attempted to reuse. The oldest InferenceModel, based on @@ -81,10 +211,10 @@ _Appears in:_ | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `modelName` _string_ | The name of the model as the users set in the "model" parameter in the requests.
The name should be unique among the workloads that reference the same backend pool.
This is the parameter that will be used to match the request with. In the future, we may
allow to match on other request parameters. The other approach to support matching on
on other request parameters is to use a different ModelName per HTTPFilter.
Names can be reserved without implementing an actual model in the pool.
This can be done by specifying a target model and setting the weight to zero,
an error will be returned specifying that no valid target model is found. | | MaxLength: 253
| -| `criticality` _[Criticality](#criticality)_ | Defines how important it is to serve the model compared to other models referencing the same pool. | Default | Enum: [Critical Default Sheddable]
| -| `targetModels` _[TargetModel](#targetmodel) array_ | Allow multiple versions of a model for traffic splitting.
If not specified, the target model name is defaulted to the modelName parameter.
modelName is often in reference to a LoRA adapter. | | MaxItems: 10
| -| `poolRef` _[PoolObjectReference](#poolobjectreference)_ | Reference to the inference pool, the pool must exist in the same namespace. | | Required: \{\}
| +| `modelName` _string_ | ModelName is the name of the model as it will be set in the "model" parameter for an incoming request.
ModelNames must be unique for a referencing InferencePool
(names can be reused for a different pool in the same cluster).
The modelName with the oldest creation timestamp is retained, and the incoming
InferenceModel is sets the Ready status to false with a corresponding reason.
In the rare case of a race condition, one Model will be selected randomly to be considered valid, and the other rejected.
Names can be reserved without an underlying model configured in the pool.
This can be done by specifying a target model and setting the weight to zero,
an error will be returned specifying that no valid target model is found. | | MaxLength: 256
Required: \{\}
| +| `criticality` _[Criticality](#criticality)_ | Criticality defines how important it is to serve the model compared to other models referencing the same pool.
Criticality impacts how traffic is handled in resource constrained situations. It handles this by
queuing or rejecting requests of lower criticality. InferenceModels of an equivalent Criticality will
fairly share resources over throughput of tokens. In the future, the metric used to calculate fairness,
and the proportionality of fairness will be configurable.
Default values for this field will not be set, to allow for future additions of new field that may 'one of' with this field.
Any implementations that may consume this field may treat an unset value as the 'Standard' range. | | Enum: [Critical Standard Sheddable]
| +| `targetModels` _[TargetModel](#targetmodel) array_ | TargetModels allow multiple versions of a model for traffic splitting.
If not specified, the target model name is defaulted to the modelName parameter.
modelName is often in reference to a LoRA adapter. | | MaxItems: 10
| +| `poolRef` _[PoolObjectReference](#poolobjectreference)_ | PoolRef is a reference to the inference pool, the pool must exist in the same namespace. | | Required: \{\}
| #### InferenceModelStatus @@ -100,14 +230,14 @@ _Appears in:_ | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#condition-v1-meta) array_ | Conditions track the state of the InferencePool. | | | +| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#condition-v1-meta) array_ | Conditions track the state of the InferenceModel.
Known condition types are:
* "Accepted" | [map[lastTransitionTime:1970-01-01T00:00:00Z message:Waiting for controller reason:Pending status:Unknown type:Ready]] | MaxItems: 8
| #### InferencePool -InferencePool is the Schema for the Inferencepools API +InferencePool is the Schema for the InferencePools API. @@ -115,13 +245,17 @@ InferencePool is the Schema for the Inferencepools API | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `apiVersion` _string_ | `inference.networking.x-k8s.io/v1alpha1` | | | +| `apiVersion` _string_ | `inference.networking.x-k8s.io/v1alpha2` | | | | `kind` _string_ | `InferencePool` | | | | `metadata` _[ObjectMeta](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#objectmeta-v1-meta)_ | Refer to Kubernetes API documentation for fields of `metadata`. | | | | `spec` _[InferencePoolSpec](#inferencepoolspec)_ | | | | | `status` _[InferencePoolStatus](#inferencepoolstatus)_ | | | | + + + + #### InferencePoolSpec @@ -135,8 +269,9 @@ _Appears in:_ | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `selector` _object (keys:[LabelKey](#labelkey), values:[LabelValue](#labelvalue))_ | Selector uses a map of label to watch model server pods
that should be included in the InferencePool. ModelServers should not
be with any other Service or InferencePool, that behavior is not supported
and will result in sub-optimal utilization.
In some cases, implementations may translate this to a Service selector, so this matches the simple
map used for Service selectors instead of the full Kubernetes LabelSelector type. | | Required: \{\}
| -| `targetPortNumber` _integer_ | TargetPortNumber is the port number that the model servers within the pool expect
to receive traffic from.
This maps to the TargetPort in: https://pkg.go.dev/k8s.io/api/core/v1#ServicePort | | Maximum: 65535
Minimum: 0
Required: \{\}
| +| `selector` _object (keys:[LabelKey](#labelkey), values:[LabelValue](#labelvalue))_ | Selector defines a map of labels to watch model server pods
that should be included in the InferencePool.
In some cases, implementations may translate this field to a Service selector, so this matches the simple
map used for Service selectors instead of the full Kubernetes LabelSelector type.
If sepecified, it will be applied to match the model server pods in the same namespace as the InferencePool.
Cross namesoace selector is not supported. | | Required: \{\}
| +| `targetPortNumber` _integer_ | TargetPortNumber defines the port number to access the selected model servers.
The number must be in the range 1 to 65535. | | Maximum: 65535
Minimum: 1
Required: \{\}
| +| `extensionRef` _[Extension](#extension)_ | Extension configures an endpoint picker as an extension service. | | Required: \{\}
| #### InferencePoolStatus @@ -152,33 +287,56 @@ _Appears in:_ | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#condition-v1-meta) array_ | Conditions track the state of the InferencePool. | | | +| `parent` _[PoolStatus](#poolstatus) array_ | Parents is a list of parent resources (usually Gateways) that are
associated with the route, and the status of the InferencePool with respect to
each parent.
A maximum of 32 Gateways will be represented in this list. An empty list
means the route has not been attached to any Gateway. | | MaxItems: 32
| + + +#### Kind + +_Underlying type:_ _string_ + +Kind refers to a Kubernetes Kind. + +Valid values include: + +* "Service" +* "HTTPRoute" + +Invalid values include: + +* "invalid/kind" - "/" is an invalid character + +_Validation:_ +- MaxLength: 63 +- MinLength: 1 +- Pattern: `^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$` + +_Appears in:_ +- [Extension](#extension) +- [ExtensionReference](#extensionreference) +- [PoolObjectReference](#poolobjectreference) + #### LabelKey _Underlying type:_ _string_ -Originally copied from: https://github.com/kubernetes-sigs/gateway-api/blob/99a3934c6bc1ce0874f3a4c5f20cafd8977ffcb4/apis/v1/shared_types.go#L694-L731 +LabelKey was originally copied from: https://github.com/kubernetes-sigs/gateway-api/blob/99a3934c6bc1ce0874f3a4c5f20cafd8977ffcb4/apis/v1/shared_types.go#L694-L731 Duplicated as to not take an unexpected dependency on gw's API. - LabelKey is the key of a label. This is used for validation of maps. This matches the Kubernetes "qualified name" validation that is used for labels. - +Labels are case sensitive, so: my-label and My-Label are considered distinct. Valid values include: - * example * example.com * example.com/path * example.com/path.html - Invalid values include: - * example~ - "~" is an invalid character * example.com. - can not start or end with "." @@ -202,10 +360,8 @@ of maps. This matches the Kubernetes label validation rules: * unless empty, must begin and end with an alphanumeric character ([a-z0-9A-Z]), * could contain dashes (-), underscores (_), dots (.), and alphanumerics between. - Valid values include: - * MyValue * my.name * 123-my-value @@ -220,6 +376,25 @@ _Appears in:_ +#### ObjectName + +_Underlying type:_ _string_ + +ObjectName refers to the name of a Kubernetes object. +Object names can have a variety of forms, including RFC 1123 subdomains, +RFC 1123 labels, or RFC 1035 labels. + +_Validation:_ +- MaxLength: 253 +- MinLength: 1 + +_Appears in:_ +- [Extension](#extension) +- [ExtensionReference](#extensionreference) +- [PoolObjectReference](#poolobjectreference) + + + #### PoolObjectReference @@ -234,9 +409,42 @@ _Appears in:_ | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `group` _string_ | Group is the group of the referent. | inference.networking.x-k8s.io | MaxLength: 253
Pattern: `^$\|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$`
| -| `kind` _string_ | Kind is kind of the referent. For example "InferencePool". | InferencePool | MaxLength: 63
MinLength: 1
Pattern: `^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$`
| -| `name` _string_ | Name is the name of the referent. | | MaxLength: 253
MinLength: 1
Required: \{\}
| +| `group` _[Group](#group)_ | Group is the group of the referent. | inference.networking.x-k8s.io | MaxLength: 253
Pattern: `^$\|^[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*$`
| +| `kind` _[Kind](#kind)_ | Kind is kind of the referent. For example "InferencePool". | InferencePool | MaxLength: 63
MinLength: 1
Pattern: `^[a-zA-Z]([-a-zA-Z0-9]*[a-zA-Z0-9])?$`
| +| `name` _[ObjectName](#objectname)_ | Name is the name of the referent. | | MaxLength: 253
MinLength: 1
Required: \{\}
| + + +#### PoolStatus + + + +PoolStatus defines the observed state of InferencePool from a Gateway. + + + +_Appears in:_ +- [InferencePoolStatus](#inferencepoolstatus) + +| Field | Description | Default | Validation | +| --- | --- | --- | --- | +| `parentRef` _[ObjectReference](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#objectreference-v1-core)_ | GatewayRef indicates the gateway that observed state of InferencePool. | | | +| `conditions` _[Condition](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#condition-v1-meta) array_ | Conditions track the state of the InferencePool.
Known condition types are:
* "Accepted"
* "ResolvedRefs" | [map[lastTransitionTime:1970-01-01T00:00:00Z message:Waiting for controller reason:Pending status:Unknown type:Accepted]] | MaxItems: 8
| + + +#### PortNumber + +_Underlying type:_ _integer_ + +PortNumber defines a network port. + +_Validation:_ +- Maximum: 65535 +- Minimum: 1 + +_Appears in:_ +- [Extension](#extension) +- [ExtensionReference](#extensionreference) + #### TargetModel @@ -246,10 +454,10 @@ _Appears in:_ TargetModel represents a deployed model or a LoRA adapter. The Name field is expected to match the name of the LoRA adapter (or base model) as it is registered within the model server. Inference -Gateway assumes that the model exists on the model server and is the +Gateway assumes that the model exists on the model server and it's the responsibility of the user to validate a correct match. Should a model fail -to exist at request time, the error is processed by the Instance Gateway, -and then emitted on the appropriate InferenceModel object. +to exist at request time, the error is processed by the Inference Gateway +and emitted on the appropriate InferenceModel object. @@ -258,7 +466,7 @@ _Appears in:_ | Field | Description | Default | Validation | | --- | --- | --- | --- | -| `name` _string_ | The name of the adapter as expected by the ModelServer. | | MaxLength: 253
| -| `weight` _integer_ | Weight is used to determine the proportion of traffic that should be
sent to this target model when multiple versions of the model are specified. | 1 | Maximum: 1e+06
Minimum: 0
| +| `name` _string_ | Name is the name of the adapter or base model, as expected by the ModelServer. | | MaxLength: 253
Required: \{\}
| +| `weight` _integer_ | Weight is used to determine the proportion of traffic that should be
sent to this model when multiple target models are specified.
Weight defines the proportion of requests forwarded to the specified
model. This is computed as weight/(sum of all weights in this
TargetModels list). For non-zero values, there may be some epsilon from
the exact proportion defined here depending on the precision an
implementation supports. Weight is not a percentage and the sum of
weights does not need to equal 100.
If a weight is set for any targetModel, it must be set for all targetModels.
Conversely weights are optional, so long as ALL targetModels do not specify a weight. | | Maximum: 1e+06
Minimum: 1
| diff --git a/test/e2e/README.md b/test/e2e/epp/README.md similarity index 77% rename from test/e2e/README.md rename to test/e2e/epp/README.md index 584d8914..fcc974b8 100644 --- a/test/e2e/README.md +++ b/test/e2e/epp/README.md @@ -10,7 +10,7 @@ The end-to-end tests are designed to validate end-to-end Gateway API Inference E - [Go](https://golang.org/doc/install) installed on your machine. - [Make](https://www.gnu.org/software/make/manual/make.html) installed to run the end-to-end test target. -- A Hugging Face Hub token with access to the [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) model. +- A Hugging Face Hub token with access to the [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) model. ## Running the End-to-End Tests @@ -28,11 +28,18 @@ Follow these steps to run the end-to-end tests: export HF_TOKEN= ``` +1. **(Optional): Set the test namespace**: By default, the e2e test creates resources in the `inf-ext-e2e` namespace. + If you would like to change this namespace, set the following environment variable: + + ```sh + export E2E_NS= + ``` + 1. **Run the Tests**: Run the `test-e2e` target: ```sh make test-e2e ``` - The test suite prints details for each step. Note that the `vllm-llama2-7b-pool` model server deployment + The test suite prints details for each step. Note that the `vllm-llama3-8b-instruct-pool` model server deployment may take several minutes to report an `Available=True` status due to the time required for bootstraping. diff --git a/test/e2e/e2e_suite_test.go b/test/e2e/epp/e2e_suite_test.go similarity index 75% rename from test/e2e/e2e_suite_test.go rename to test/e2e/epp/e2e_suite_test.go index 019e858a..01ed639d 100644 --- a/test/e2e/e2e_suite_test.go +++ b/test/e2e/epp/e2e_suite_test.go @@ -14,7 +14,7 @@ See the License for the specific language governing permissions and limitations under the License. */ -package e2e +package epp import ( "context" @@ -26,12 +26,11 @@ import ( "github.com/onsi/ginkgo/v2" "github.com/onsi/gomega" - infextv1a1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - testutils "inference.networking.x-k8s.io/gateway-api-inference-extension/test/utils" appsv1 "k8s.io/api/apps/v1" corev1 "k8s.io/api/core/v1" rbacv1 "k8s.io/api/rbac/v1" apiextv1 "k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1" + v1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/apis/meta/v1/unstructured" "k8s.io/apimachinery/pkg/runtime" "k8s.io/apimachinery/pkg/runtime/serializer" @@ -40,6 +39,8 @@ import ( clientgoscheme "k8s.io/client-go/kubernetes/scheme" "sigs.k8s.io/controller-runtime/pkg/client" "sigs.k8s.io/controller-runtime/pkg/client/config" + infextv1a2 "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + testutils "sigs.k8s.io/gateway-api-inference-extension/test/utils" ) const ( @@ -49,37 +50,38 @@ const ( defaultReadyTimeout = 3 * time.Minute // defaultModelReadyTimeout is the default timeout for the model server deployment to report a ready state. defaultModelReadyTimeout = 10 * time.Minute + // defaultCurlTimeout is the default timeout for the curl command to get a response. + defaultCurlTimeout = 30 * time.Second // defaultInterval is the default interval to check if a resource exists or ready conditions. defaultInterval = time.Millisecond * 250 // defaultCurlInterval is the default interval to run the test curl command. defaultCurlInterval = time.Second * 5 - // nsName is the name of the Namespace used for tests. - // TODO [danehans]: Must be "default" until https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/227 is fixed - nsName = "default" + // defaultNsName is the default name of the Namespace used for tests. Can override using the E2E_NS environment variable. + defaultNsName = "inf-ext-e2e" // modelServerName is the name of the model server test resources. - modelServerName = "vllm-llama2-7b-pool" + modelServerName = "vllm-llama3-8b-instruct" // modelName is the test model name. - modelName = "tweet-summary" + modelName = "food-review" // envoyName is the name of the envoy proxy test resources. envoyName = "envoy" // envoyPort is the listener port number of the test envoy proxy. envoyPort = "8081" // inferExtName is the name of the inference extension test resources. - inferExtName = "inference-gateway-ext-proc" + inferExtName = "vllm-llama3-8b-instruct-epp" // clientManifest is the manifest for the client test resources. - clientManifest = "../testdata/client.yaml" - // modelServerManifest is the manifest for the model server test resources. - modelServerManifest = "../../pkg/manifests/vllm/deployment.yaml" + clientManifest = "../../testdata/client.yaml" // modelServerSecretManifest is the manifest for the model server secret resource. - modelServerSecretManifest = "../testdata/model-secret.yaml" + modelServerSecretManifest = "../../testdata/model-secret.yaml" // inferPoolManifest is the manifest for the inference pool CRD. - inferPoolManifest = "../../config/crd/bases/inference.networking.x-k8s.io_inferencepools.yaml" + inferPoolManifest = "../../../config/crd/bases/inference.networking.x-k8s.io_inferencepools.yaml" // inferModelManifest is the manifest for the inference model CRD. - inferModelManifest = "../../config/crd/bases/inference.networking.x-k8s.io_inferencemodels.yaml" + inferModelManifest = "../../../config/crd/bases/inference.networking.x-k8s.io_inferencemodels.yaml" // inferExtManifest is the manifest for the inference extension test resources. - inferExtManifest = "../../pkg/manifests/ext_proc.yaml" + inferExtManifest = "../../testdata/inferencepool-e2e.yaml" // envoyManifest is the manifest for the envoy proxy test resources. - envoyManifest = "../testdata/envoy.yaml" + envoyManifest = "../../testdata/envoy.yaml" + // modelServerManifestFilepathEnvVar is the env var that holds absolute path to the manifest for the model server test resource. + modelServerManifestFilepathEnvVar = "MANIFEST_PATH" ) var ( @@ -89,6 +91,7 @@ var ( kubeCli *kubernetes.Clientset scheme = runtime.NewScheme() cfg = config.GetConfigOrDie() + nsName string ) func TestAPIs(t *testing.T) { @@ -99,6 +102,11 @@ func TestAPIs(t *testing.T) { } var _ = ginkgo.BeforeSuite(func() { + nsName = os.Getenv("E2E_NS") + if nsName == "" { + nsName = defaultNsName + } + ginkgo.By("Setting up the test suite") setupSuite() @@ -107,16 +115,24 @@ var _ = ginkgo.BeforeSuite(func() { }) func setupInfra() { + createNamespace(cli, nsName) + + modelServerManifestPath := readModelServerManifestPath() + modelServerManifestArray := getYamlsFromModelServerManifest(modelServerManifestPath) + if strings.Contains(modelServerManifestArray[0], "hf-token") { + createHfSecret(cli, modelServerSecretManifest) + } crds := map[string]string{ "inferencepools.inference.networking.x-k8s.io": inferPoolManifest, "inferencemodels.inference.networking.x-k8s.io": inferModelManifest, } + createCRDs(cli, crds) createInferExt(cli, inferExtManifest) createClient(cli, clientManifest) createEnvoy(cli, envoyManifest) // Run this step last, as it requires additional time for the model server to become ready. - createModelServer(cli, modelServerSecretManifest, modelServerManifest) + createModelServer(cli, modelServerManifestArray, modelServerManifestPath) } var _ = ginkgo.AfterSuite(func() { @@ -136,7 +152,7 @@ func setupSuite() { err = apiextv1.AddToScheme(scheme) gomega.ExpectWithOffset(1, err).NotTo(gomega.HaveOccurred()) - err = infextv1a1.AddToScheme(scheme) + err = infextv1a2.Install(scheme) gomega.ExpectWithOffset(1, err).NotTo(gomega.HaveOccurred()) cli, err = client.New(cfg, client.Options{Scheme: scheme}) @@ -145,6 +161,7 @@ func setupSuite() { kubeCli, err = kubernetes.NewForConfig(cfg) gomega.Expect(err).NotTo(gomega.HaveOccurred()) + gomega.Expect(kubeCli).NotTo(gomega.BeNil()) } func cleanupResources() { @@ -169,10 +186,22 @@ var ( existsTimeout = getTimeout("EXISTS_TIMEOUT", defaultExistsTimeout) readyTimeout = getTimeout("READY_TIMEOUT", defaultReadyTimeout) modelReadyTimeout = getTimeout("MODEL_READY_TIMEOUT", defaultModelReadyTimeout) + curlTimeout = getTimeout("CURL_TIMEOUT", defaultCurlTimeout) interval = defaultInterval curlInterval = defaultCurlInterval ) +func createNamespace(k8sClient client.Client, ns string) { + ginkgo.By("Creating e2e namespace: " + ns) + obj := &corev1.Namespace{ + ObjectMeta: v1.ObjectMeta{ + Name: ns, + }, + } + err := k8sClient.Create(ctx, obj) + gomega.Expect(err).NotTo(gomega.HaveOccurred(), "Failed to create e2e test namespace") +} + // namespaceExists ensures that a specified namespace exists and is ready for use. func namespaceExists(k8sClient client.Client, ns string) { ginkgo.By("Ensuring namespace exists: " + ns) @@ -181,6 +210,21 @@ func namespaceExists(k8sClient client.Client, ns string) { }, existsTimeout, interval) } +// readModelServerManifestPath reads from env var the absolute filepath to model server deployment for testing. +func readModelServerManifestPath() string { + ginkgo.By(fmt.Sprintf("Ensuring %s environment variable is set", modelServerManifestFilepathEnvVar)) + modelServerManifestFilepath := os.Getenv(modelServerManifestFilepathEnvVar) + gomega.Expect(modelServerManifestFilepath).NotTo(gomega.BeEmpty(), modelServerManifestFilepathEnvVar+" is not set") + return modelServerManifestFilepath +} + +func getYamlsFromModelServerManifest(modelServerManifestPath string) []string { + ginkgo.By("Ensuring the model server manifest points to an existing file") + modelServerManifestArray := readYaml(modelServerManifestPath) + gomega.Expect(modelServerManifestArray).NotTo(gomega.BeEmpty()) + return modelServerManifestArray +} + // createCRDs creates the Inference Extension CRDs used for testing. func createCRDs(k8sClient client.Client, crds map[string]string) { for name, path := range crds { @@ -214,7 +258,22 @@ func createClient(k8sClient client.Client, filePath string) { } // createModelServer creates the model server resources used for testing from the given filePaths. -func createModelServer(k8sClient client.Client, secretPath, deployPath string) { +func createModelServer(k8sClient client.Client, modelServerManifestArray []string, deployPath string) { + ginkgo.By("Creating model server resources from manifest: " + deployPath) + createObjsFromYaml(k8sClient, modelServerManifestArray) + + // Wait for the deployment to exist. + deploy := &appsv1.Deployment{} + testutils.EventuallyExists(ctx, func() error { + return k8sClient.Get(ctx, types.NamespacedName{Namespace: nsName, Name: modelServerName}, deploy) + }, existsTimeout, interval) + + // Wait for the deployment to be available. + testutils.DeploymentAvailable(ctx, k8sClient, deploy, modelReadyTimeout, interval) +} + +// createHfSecret read HF_TOKEN from env var and creates a secret that contains the access token. +func createHfSecret(k8sClient client.Client, secretPath string) { ginkgo.By("Ensuring the HF_TOKEN environment variable is set") token := os.Getenv("HF_TOKEN") gomega.Expect(token).NotTo(gomega.BeEmpty(), "HF_TOKEN is not set") @@ -226,31 +285,26 @@ func createModelServer(k8sClient client.Client, secretPath, deployPath string) { outManifests = append(outManifests, strings.Replace(m, "$HF_TOKEN", token, 1)) } - ginkgo.By("Creating model server secret resource from manifest: " + deployPath) + ginkgo.By("Creating model server secret resource") createObjsFromYaml(k8sClient, outManifests) // Wait for the secret to exist before proceeding with test. testutils.EventuallyExists(ctx, func() error { return k8sClient.Get(ctx, types.NamespacedName{Namespace: nsName, Name: "hf-token"}, &corev1.Secret{}) }, existsTimeout, interval) - - ginkgo.By("Creating model server resources from manifest: " + deployPath) - applyYAMLFile(k8sClient, deployPath) - - // Wait for the deployment to exist. - deploy := &appsv1.Deployment{} - testutils.EventuallyExists(ctx, func() error { - return k8sClient.Get(ctx, types.NamespacedName{Namespace: nsName, Name: modelServerName}, deploy) - }, existsTimeout, interval) - - // Wait for the deployment to be available. - testutils.DeploymentAvailable(ctx, k8sClient, deploy, modelReadyTimeout, interval) } // createEnvoy creates the envoy proxy resources used for testing from the given filePath. func createEnvoy(k8sClient client.Client, filePath string) { + inManifests := readYaml(filePath) + ginkgo.By("Replacing placeholder namespace with E2E_NS environment variable") + outManifests := []string{} + for _, m := range inManifests { + outManifests = append(outManifests, strings.ReplaceAll(m, "$E2E_NS", nsName)) + } + ginkgo.By("Creating envoy proxy resources from manifest: " + filePath) - applyYAMLFile(k8sClient, filePath) + createObjsFromYaml(k8sClient, outManifests) // Wait for the configmap to exist before proceeding with test. cfgMap := &corev1.ConfigMap{} @@ -275,8 +329,15 @@ func createEnvoy(k8sClient client.Client, filePath string) { // createInferExt creates the inference extension resources used for testing from the given filePath. func createInferExt(k8sClient client.Client, filePath string) { + inManifests := readYaml(filePath) + ginkgo.By("Replacing placeholder namespace with E2E_NS environment variable") + outManifests := []string{} + for _, m := range inManifests { + outManifests = append(outManifests, strings.ReplaceAll(m, "$E2E_NS", nsName)) + } + ginkgo.By("Creating inference extension resources from manifest: " + filePath) - applyYAMLFile(k8sClient, filePath) + createObjsFromYaml(k8sClient, outManifests) // Wait for the clusterrole to exist. testutils.EventuallyExists(ctx, func() error { diff --git a/test/e2e/e2e_test.go b/test/e2e/epp/e2e_test.go similarity index 85% rename from test/e2e/e2e_test.go rename to test/e2e/epp/e2e_test.go index 8e5968fc..7240cebc 100644 --- a/test/e2e/e2e_test.go +++ b/test/e2e/epp/e2e_test.go @@ -14,20 +14,22 @@ See the License for the specific language governing permissions and limitations under the License. */ -package e2e +package epp import ( "fmt" + "strconv" "strings" + "time" "github.com/google/go-cmp/cmp" "github.com/google/go-cmp/cmp/cmpopts" "github.com/onsi/ginkgo/v2" "github.com/onsi/gomega" - infextv1a1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - testutils "inference.networking.x-k8s.io/gateway-api-inference-extension/test/utils" "k8s.io/apimachinery/pkg/types" "k8s.io/utils/ptr" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + testutils "sigs.k8s.io/gateway-api-inference-extension/test/utils" ) var _ = ginkgo.Describe("InferencePool", func() { @@ -49,15 +51,11 @@ var _ = ginkgo.Describe("InferencePool", func() { ginkgo.By("Ensuring the InferenceModel resource exists in the namespace") gomega.Eventually(func() error { - err := cli.Get(ctx, types.NamespacedName{Namespace: infModel.Namespace, Name: infModel.Name}, infModel) - if err != nil { - return err - } - return nil + return cli.Get(ctx, types.NamespacedName{Namespace: infModel.Namespace, Name: infModel.Name}, infModel) }, existsTimeout, interval).Should(gomega.Succeed()) ginkgo.By("Verifying connectivity through the inference extension") - curlCmd := getCurlCommand(envoyName, nsName, envoyPort, modelName) + curlCmd := getCurlCommand(envoyName, nsName, envoyPort, modelName, curlTimeout) // Ensure the expected responses include the inferencemodel target model names. var expected []string @@ -95,19 +93,19 @@ var _ = ginkgo.Describe("InferencePool", func() { }) // newInferenceModel creates an InferenceModel in the given namespace for testutils. -func newInferenceModel(ns string) *infextv1a1.InferenceModel { - targets := []infextv1a1.TargetModel{ +func newInferenceModel(ns string) *v1alpha2.InferenceModel { + targets := []v1alpha2.TargetModel{ { - Name: modelName + "-0", + Name: modelName, Weight: ptr.To(int32(50)), }, { - Name: modelName + "-1", + Name: "cad-fabricator", Weight: ptr.To(int32(50)), }, } return testutils.MakeModelWrapper("inferencemodel-sample", ns). - SetCriticality(infextv1a1.Critical). + SetCriticality(v1alpha2.Critical). SetModelName(modelName). SetPoolRef(modelServerName). SetTargetModels(targets). @@ -116,10 +114,12 @@ func newInferenceModel(ns string) *infextv1a1.InferenceModel { // getCurlCommand returns the command, as a slice of strings, for curl'ing // the test model server at the given name, namespace, port, and model name. -func getCurlCommand(name, ns, port, model string) []string { +func getCurlCommand(name, ns, port, model string, timeout time.Duration) []string { return []string{ "curl", "-i", + "--max-time", + strconv.Itoa((int)(timeout.Seconds())), fmt.Sprintf("%s.%s.svc:%s/v1/completions", name, ns, port), "-H", "Content-Type: application/json", diff --git a/test/integration/bbr/hermetic_test.go b/test/integration/bbr/hermetic_test.go new file mode 100644 index 00000000..b99186db --- /dev/null +++ b/test/integration/bbr/hermetic_test.go @@ -0,0 +1,293 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +// Package bbr contains integration tests for the body-based routing extension. +package bbr + +import ( + "context" + "fmt" + "testing" + "time" + + configPb "github.com/envoyproxy/go-control-plane/envoy/config/core/v3" + extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" + "github.com/google/go-cmp/cmp" + "google.golang.org/grpc" + "google.golang.org/grpc/credentials/insecure" + "google.golang.org/protobuf/testing/protocmp" + runserver "sigs.k8s.io/gateway-api-inference-extension/pkg/bbr/server" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" + integrationutils "sigs.k8s.io/gateway-api-inference-extension/test/integration" +) + +var logger = logutil.NewTestLogger().V(logutil.VERBOSE) + +func TestBodyBasedRouting(t *testing.T) { + tests := []struct { + name string + req *extProcPb.ProcessingRequest + wantHeaders []*configPb.HeaderValueOption + wantErr bool + }{ + { + name: "success adding model parameter to header", + req: integrationutils.GenerateRequest(logger, "test", "llama"), + wantHeaders: []*configPb.HeaderValueOption{ + { + Header: &configPb.HeaderValue{ + Key: "X-Gateway-Model-Name", + RawValue: []byte("llama"), + }, + }, + }, + wantErr: false, + }, + { + name: "no model parameter", + req: integrationutils.GenerateRequest(logger, "test1", ""), + wantHeaders: []*configPb.HeaderValueOption{}, + wantErr: false, + }, + } + + for _, test := range tests { + t.Run(test.name, func(t *testing.T) { + client, cleanup := setUpHermeticServer(false) + t.Cleanup(cleanup) + + want := &extProcPb.ProcessingResponse{} + if len(test.wantHeaders) > 0 { + want.Response = &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: test.wantHeaders, + }, + ClearRouteCache: true, + }, + }, + } + } else { + want.Response = &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{}, + } + } + + res, err := integrationutils.SendRequest(t, client, test.req) + if err != nil && !test.wantErr { + t.Errorf("Unexpected error, got: %v, want error: %v", err, test.wantErr) + } + if diff := cmp.Diff(want, res, protocmp.Transform()); diff != "" { + t.Errorf("Unexpected response, (-want +got): %v", diff) + } + }) + } +} + +func TestFullDuplexStreamed_BodyBasedRouting(t *testing.T) { + tests := []struct { + name string + reqs []*extProcPb.ProcessingRequest + wantResponses []*extProcPb.ProcessingResponse + wantErr bool + }{ + { + name: "success adding model parameter to header", + reqs: integrationutils.GenerateStreamedRequestSet(logger, "test", "foo"), + wantResponses: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_RequestHeaders{ + RequestHeaders: &extProcPb.HeadersResponse{ + Response: &extProcPb.CommonResponse{ + ClearRouteCache: true, + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: []*configPb.HeaderValueOption{ + { + Header: &configPb.HeaderValue{ + Key: "X-Gateway-Model-Name", + RawValue: []byte("foo"), + }, + }, + }}, + }, + }, + }, + }, + { + Response: &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte("{\"max_tokens\":100,\"model\":\"foo\",\"prompt\":\"test\",\"temperature\":0}"), + EndOfStream: true, + }, + }, + }, + }, + }, + }, + }, + }, + }, + { + name: "success adding model parameter to header with multiple body chunks", + reqs: []*extProcPb.ProcessingRequest{ + { + Request: &extProcPb.ProcessingRequest_RequestHeaders{ + RequestHeaders: &extProcPb.HttpHeaders{ + Headers: &configPb.HeaderMap{ + Headers: []*configPb.HeaderValue{ + { + Key: "hi", + Value: "mom", + }, + }, + }, + }, + }, + }, + { + Request: &extProcPb.ProcessingRequest_RequestBody{ + RequestBody: &extProcPb.HttpBody{Body: []byte("{\"max_tokens\":100,\"model\":\"sql-lo"), EndOfStream: false}, + }, + }, + { + Request: &extProcPb.ProcessingRequest_RequestBody{ + RequestBody: &extProcPb.HttpBody{Body: []byte("ra-sheddable\",\"prompt\":\"test\",\"temperature\":0}"), EndOfStream: true}, + }, + }, + }, + wantResponses: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_RequestHeaders{ + RequestHeaders: &extProcPb.HeadersResponse{ + Response: &extProcPb.CommonResponse{ + ClearRouteCache: true, + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: []*configPb.HeaderValueOption{ + { + Header: &configPb.HeaderValue{ + Key: "X-Gateway-Model-Name", + RawValue: []byte("sql-lora-sheddable"), + }, + }, + }}, + }, + }, + }, + }, + { + Response: &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte("{\"max_tokens\":100,\"model\":\"sql-lora-sheddable\",\"prompt\":\"test\",\"temperature\":0}"), + EndOfStream: true, + }, + }, + }, + }, + }, + }, + }, + }, + }, + { + name: "no model parameter", + reqs: integrationutils.GenerateStreamedRequestSet(logger, "test", ""), + wantResponses: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_RequestHeaders{ + RequestHeaders: &extProcPb.HeadersResponse{}, + }, + }, + { + Response: &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte("{\"max_tokens\":100,\"prompt\":\"test\",\"temperature\":0}"), + EndOfStream: true, + }, + }, + }, + }, + }, + }, + }, + }, + }, + } + + for _, test := range tests { + t.Run(test.name, func(t *testing.T) { + client, cleanup := setUpHermeticServer(true) + t.Cleanup(cleanup) + + responses, err := integrationutils.StreamedRequest(t, client, test.reqs, len(test.wantResponses)) + if err != nil && !test.wantErr { + t.Errorf("Unexpected error, got: %v, want error: %v", err, test.wantErr) + } + + if diff := cmp.Diff(test.wantResponses, responses, protocmp.Transform()); diff != "" { + t.Errorf("Unexpected response, (-want +got): %v", diff) + } + }) + } +} + +func setUpHermeticServer(streaming bool) (client extProcPb.ExternalProcessor_ProcessClient, cleanup func()) { + port := 9004 + + serverCtx, stopServer := context.WithCancel(context.Background()) + serverRunner := runserver.NewDefaultExtProcServerRunner(port, false) + serverRunner.SecureServing = false + serverRunner.Streaming = streaming + + go func() { + if err := serverRunner.AsRunnable(logger.WithName("ext-proc")).Start(serverCtx); err != nil { + logutil.Fatal(logger, err, "Failed to start ext-proc server") + } + }() + + address := fmt.Sprintf("localhost:%v", port) + // Create a grpc connection + conn, err := grpc.NewClient(address, grpc.WithTransportCredentials(insecure.NewCredentials())) + if err != nil { + logutil.Fatal(logger, err, "Failed to connect", "address", address) + } + + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + client, err = extProcPb.NewExternalProcessorClient(conn).Process(ctx) + if err != nil { + logutil.Fatal(logger, err, "Failed to create client") + } + return client, func() { + cancel() + conn.Close() + stopServer() + + // wait a little until the goroutines actually exit + time.Sleep(5 * time.Second) + } +} diff --git a/test/integration/epp/hermetic_test.go b/test/integration/epp/hermetic_test.go new file mode 100644 index 00000000..c63fd017 --- /dev/null +++ b/test/integration/epp/hermetic_test.go @@ -0,0 +1,1509 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +// Package epp contains integration tests for the ext proc while faking the backend pods. +package epp + +import ( + "bufio" + "bytes" + "context" + "errors" + "fmt" + "io" + "net" + "net/http" + "os" + "path/filepath" + "strconv" + "strings" + "testing" + "time" + + configPb "github.com/envoyproxy/go-control-plane/envoy/config/core/v3" + extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" + envoyTypePb "github.com/envoyproxy/go-control-plane/envoy/type/v3" + "github.com/google/go-cmp/cmp" + "github.com/prometheus/client_golang/prometheus/promhttp" + "github.com/stretchr/testify/assert" + "google.golang.org/grpc" + "google.golang.org/grpc/credentials/insecure" + "google.golang.org/protobuf/testing/protocmp" + "google.golang.org/protobuf/types/known/structpb" + corev1 "k8s.io/api/core/v1" + "k8s.io/apimachinery/pkg/apis/meta/v1/unstructured" + "k8s.io/apimachinery/pkg/fields" + "k8s.io/apimachinery/pkg/runtime" + "k8s.io/apimachinery/pkg/types" + utilruntime "k8s.io/apimachinery/pkg/util/runtime" + k8syaml "k8s.io/apimachinery/pkg/util/yaml" + clientgoscheme "k8s.io/client-go/kubernetes/scheme" + "k8s.io/component-base/metrics/legacyregistry" + metricsutils "k8s.io/component-base/metrics/testutil" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/cache" + "sigs.k8s.io/controller-runtime/pkg/client" + k8sclient "sigs.k8s.io/controller-runtime/pkg/client" + "sigs.k8s.io/controller-runtime/pkg/config" + "sigs.k8s.io/controller-runtime/pkg/envtest" + "sigs.k8s.io/controller-runtime/pkg/manager" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend" + backendmetrics "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/backend/metrics" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/datastore" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/metrics" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling" + "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/server" + runserver "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/server" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" + epptestutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/testing" + integrationutils "sigs.k8s.io/gateway-api-inference-extension/test/integration" + "sigs.k8s.io/yaml" +) + +const ( + port = runserver.DefaultGrpcPort + metricsPort = 8889 +) + +var ( + serverRunner *runserver.ExtProcServerRunner + k8sClient k8sclient.Client + testEnv *envtest.Environment + scheme = runtime.NewScheme() + logger = logutil.NewTestLogger().V(logutil.VERBOSE) +) + +func TestMain(m *testing.M) { + cleanup := BeforeSuite() + code := m.Run() + cleanup() + os.Exit(code) +} + +func TestFullDuplexStreamed_KubeInferenceModelRequest(t *testing.T) { + tests := []struct { + name string + requests []*extProcPb.ProcessingRequest + pods map[backend.Pod]*backendmetrics.Metrics + wantResponses []*extProcPb.ProcessingResponse + wantMetrics map[string]string + wantErr bool + immediateResponse *extProcPb.ImmediateResponse + }{ + // Request flow tests + { + name: "select lower queue and kv cache, no active lora", + requests: integrationutils.GenerateStreamedRequestSet(logger, "test1", "my-model"), + // pod-1 will be picked because it has relatively low queue size and low KV cache. + pods: map[backend.Pod]*backendmetrics.Metrics{ + fakePod(0): { + WaitingQueueSize: 3, + KVCacheUsagePercent: 0.2, + }, + fakePod(1): { + WaitingQueueSize: 0, + KVCacheUsagePercent: 0.1, + }, + fakePod(2): { + WaitingQueueSize: 10, + KVCacheUsagePercent: 0.2, + }, + }, + wantMetrics: map[string]string{`inference_model_request_total`: ` + # HELP inference_model_request_total [ALPHA] Counter of inference model requests broken out for each model and target model. + # TYPE inference_model_request_total counter + inference_model_request_total{model_name="my-model",target_model_name="my-model-12345"} 1 + `, + `inference_pool_ready_pods`: ` + # HELP inference_pool_ready_pods [ALPHA] The number of ready pods in the inference server pool. + # TYPE inference_pool_ready_pods gauge + inference_pool_ready_pods{name="vllm-llama3-8b-instruct-pool"} 3 + `, + }, + wantErr: false, + wantResponses: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_RequestHeaders{ + RequestHeaders: &extProcPb.HeadersResponse{ + Response: &extProcPb.CommonResponse{ + ClearRouteCache: true, + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: []*configPb.HeaderValueOption{ + { + Header: &configPb.HeaderValue{ + Key: "x-gateway-destination-endpoint", + RawValue: []byte("192.168.1.2:8000"), + }, + }, + { + Header: &configPb.HeaderValue{ + Key: "Content-Length", + RawValue: []byte(strconv.Itoa(76)), + }, + }, + }}, + }, + }, + }, + DynamicMetadata: makeMetadata("192.168.1.2:8000"), + }, + { + Response: &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte("{\"max_tokens\":100,\"model\":\"my-model-12345\",\"prompt\":\"test1\",\"temperature\":0}"), + EndOfStream: true, + }, + }, + }, + }, + }, + }, + }, + }, + }, + { + name: "select active lora, low queue", + requests: integrationutils.GenerateStreamedRequestSet(logger, "test2", "sql-lora"), + // pod-1 will be picked because it has relatively low queue size, with the requested + // model being active, and has low KV cache. + pods: map[backend.Pod]*backendmetrics.Metrics{ + fakePod(0): { + WaitingQueueSize: 0, + KVCacheUsagePercent: 0.2, + ActiveModels: map[string]int{ + "foo": 1, + "bar": 1, + }, + WaitingModels: map[string]int{}, + }, + fakePod(1): { + WaitingQueueSize: 0, + KVCacheUsagePercent: 0.1, + ActiveModels: map[string]int{ + "foo": 1, + "sql-lora-1fdg2": 1, + }, + WaitingModels: map[string]int{}, + }, + fakePod(2): { + WaitingQueueSize: 10, + KVCacheUsagePercent: 0.2, + ActiveModels: map[string]int{ + "foo": 1, + "bar": 1, + }, + WaitingModels: map[string]int{}, + }, + }, + wantMetrics: map[string]string{`inference_model_request_total`: ` + # HELP inference_model_request_total [ALPHA] Counter of inference model requests broken out for each model and target model. + # TYPE inference_model_request_total counter + inference_model_request_total{model_name="sql-lora",target_model_name="sql-lora-1fdg2"} 1 + `}, + wantErr: false, + wantResponses: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_RequestHeaders{ + RequestHeaders: &extProcPb.HeadersResponse{ + Response: &extProcPb.CommonResponse{ + ClearRouteCache: true, + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: []*configPb.HeaderValueOption{ + { + Header: &configPb.HeaderValue{ + Key: "x-gateway-destination-endpoint", + RawValue: []byte("192.168.1.2:8000"), + }, + }, + { + Header: &configPb.HeaderValue{ + Key: "Content-Length", + RawValue: []byte(strconv.Itoa(76)), + }, + }, + }}, + }, + }, + }, + DynamicMetadata: makeMetadata("192.168.1.2:8000"), + }, + { + Response: &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte("{\"max_tokens\":100,\"model\":\"sql-lora-1fdg2\",\"prompt\":\"test2\",\"temperature\":0}"), + EndOfStream: true, + }, + }, + }, + }, + }, + }, + }, + }, + }, + { + name: "select no lora despite active model, avoid excessive queue size", + requests: integrationutils.GenerateStreamedRequestSet(logger, "test3", "sql-lora"), + // pod-2 will be picked despite it NOT having the requested model being active + // as it's above the affinity for queue size. Also is critical, so we should + // still honor request despite all queues > 5 + pods: map[backend.Pod]*backendmetrics.Metrics{ + fakePod(0): { + WaitingQueueSize: 10, + KVCacheUsagePercent: 0.2, + ActiveModels: map[string]int{ + "foo": 1, + "bar": 1, + }, + WaitingModels: map[string]int{}, + }, + fakePod(1): { + WaitingQueueSize: 200, + KVCacheUsagePercent: 0.1, + ActiveModels: map[string]int{ + "foo": 1, + "sql-lora-1fdg2": 1, + }, + WaitingModels: map[string]int{}, + }, + fakePod(2): { + WaitingQueueSize: 6, + KVCacheUsagePercent: 0.2, + ActiveModels: map[string]int{ + "foo": 1, + }, + WaitingModels: map[string]int{}, + }, + }, + wantMetrics: map[string]string{`inference_model_request_total`: ` + # HELP inference_model_request_total [ALPHA] Counter of inference model requests broken out for each model and target model. + # TYPE inference_model_request_total counter + inference_model_request_total{model_name="sql-lora",target_model_name="sql-lora-1fdg2"} 1 + `}, + wantErr: false, + wantResponses: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_RequestHeaders{ + RequestHeaders: &extProcPb.HeadersResponse{ + Response: &extProcPb.CommonResponse{ + ClearRouteCache: true, + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: []*configPb.HeaderValueOption{ + { + Header: &configPb.HeaderValue{ + Key: "x-gateway-destination-endpoint", + RawValue: []byte("192.168.1.3:8000"), + }, + }, + { + Header: &configPb.HeaderValue{ + Key: "Content-Length", + RawValue: []byte(strconv.Itoa(76)), + }, + }, + }}, + }, + }, + }, + DynamicMetadata: makeMetadata("192.168.1.3:8000"), + }, + { + Response: &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte("{\"max_tokens\":100,\"model\":\"sql-lora-1fdg2\",\"prompt\":\"test3\",\"temperature\":0}"), + EndOfStream: true, + }, + }, + }, + }, + }, + }, + }, + }, + }, + { + name: "noncritical and all models past threshold, shed request", + requests: integrationutils.GenerateStreamedRequestSet(logger, "test4", "sql-lora-sheddable"), + // no pods will be picked as all models are either above kv threshold, + // queue threshold, or both. + pods: map[backend.Pod]*backendmetrics.Metrics{ + fakePod(0): { + WaitingQueueSize: 6, + KVCacheUsagePercent: 0.2, + ActiveModels: map[string]int{ + "foo": 1, + "bar": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + fakePod(1): { + WaitingQueueSize: 0, + KVCacheUsagePercent: 0.85, + ActiveModels: map[string]int{ + "foo": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + fakePod(2): { + WaitingQueueSize: 10, + KVCacheUsagePercent: 0.9, + ActiveModels: map[string]int{ + "foo": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + }, + wantErr: false, + wantMetrics: map[string]string{}, + wantResponses: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_ImmediateResponse{ + ImmediateResponse: &extProcPb.ImmediateResponse{ + Status: &envoyTypePb.HttpStatus{ + Code: envoyTypePb.StatusCode_TooManyRequests, + }, + }, + }, + }, + }, + }, + { + name: "noncritical, but one server has capacity, do not shed", + requests: integrationutils.GenerateStreamedRequestSet(logger, "test5", "sql-lora-sheddable"), + // pod 0 will be picked as all other models are above threshold + pods: map[backend.Pod]*backendmetrics.Metrics{ + fakePod(0): { + WaitingQueueSize: 4, + KVCacheUsagePercent: 0.2, + ActiveModels: map[string]int{ + "foo": 1, + "bar": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + fakePod(1): { + WaitingQueueSize: 0, + KVCacheUsagePercent: 0.85, + ActiveModels: map[string]int{ + "foo": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + fakePod(2): { + WaitingQueueSize: 10, + KVCacheUsagePercent: 0.9, + ActiveModels: map[string]int{ + "foo": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + }, + wantMetrics: map[string]string{`inference_model_request_total`: ` + # HELP inference_model_request_total [ALPHA] Counter of inference model requests broken out for each model and target model. + # TYPE inference_model_request_total counter + inference_model_request_total{model_name="sql-lora-sheddable",target_model_name="sql-lora-1fdg3"} 1 + `}, + wantErr: false, + wantResponses: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_RequestHeaders{ + RequestHeaders: &extProcPb.HeadersResponse{ + Response: &extProcPb.CommonResponse{ + ClearRouteCache: true, + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: []*configPb.HeaderValueOption{ + { + Header: &configPb.HeaderValue{ + Key: "x-gateway-destination-endpoint", + RawValue: []byte("192.168.1.1:8000"), + }, + }, + { + Header: &configPb.HeaderValue{ + Key: "Content-Length", + RawValue: []byte(strconv.Itoa(76)), + }, + }, + }}, + }, + }, + }, + DynamicMetadata: makeMetadata("192.168.1.1:8000"), + }, + { + Response: &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte("{\"max_tokens\":100,\"model\":\"sql-lora-1fdg3\",\"prompt\":\"test5\",\"temperature\":0}"), + EndOfStream: true, + }, + }, + }, + }, + }, + }, + }, + }, + }, + { + name: "body sent over multiple requests, noncritical, but one server has capacity, do not shed", + requests: []*extProcPb.ProcessingRequest{ + { + Request: &extProcPb.ProcessingRequest_RequestHeaders{ + RequestHeaders: &extProcPb.HttpHeaders{ + Headers: &configPb.HeaderMap{ + Headers: []*configPb.HeaderValue{ + { + Key: "hi", + Value: "mom", + }, + }, + }, + }, + }, + }, + { + Request: &extProcPb.ProcessingRequest_RequestBody{ + RequestBody: &extProcPb.HttpBody{Body: []byte("{\"max_tokens\":100,\"model\":\"sql-lo"), EndOfStream: false}, + }, + }, + { + Request: &extProcPb.ProcessingRequest_RequestBody{ + RequestBody: &extProcPb.HttpBody{Body: []byte("ra-sheddable\",\"prompt\":\"test6\",\"temperature\":0}"), EndOfStream: true}, + }, + }, + }, + + // + // pod 0 will be picked as all other models are above threshold + pods: map[backend.Pod]*backendmetrics.Metrics{ + fakePod(0): { + WaitingQueueSize: 4, + KVCacheUsagePercent: 0.2, + ActiveModels: map[string]int{ + "foo": 1, + "bar": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + fakePod(1): { + WaitingQueueSize: 0, + KVCacheUsagePercent: 0.85, + ActiveModels: map[string]int{ + "foo": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + fakePod(2): { + WaitingQueueSize: 10, + KVCacheUsagePercent: 0.9, + ActiveModels: map[string]int{ + "foo": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + }, + wantMetrics: map[string]string{`inference_model_request_total`: ` + # HELP inference_model_request_total [ALPHA] Counter of inference model requests broken out for each model and target model. + # TYPE inference_model_request_total counter + inference_model_request_total{model_name="sql-lora-sheddable",target_model_name="sql-lora-1fdg3"} 1 + `}, + wantErr: false, + wantResponses: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_RequestHeaders{ + RequestHeaders: &extProcPb.HeadersResponse{ + Response: &extProcPb.CommonResponse{ + ClearRouteCache: true, + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: []*configPb.HeaderValueOption{ + { + Header: &configPb.HeaderValue{ + Key: "x-gateway-destination-endpoint", + RawValue: []byte("192.168.1.1:8000"), + }, + }, + { + Header: &configPb.HeaderValue{ + Key: "Content-Length", + RawValue: []byte(strconv.Itoa(76)), + }, + }, + }}, + }, + }, + }, + DynamicMetadata: makeMetadata("192.168.1.1:8000"), + }, + { + Response: &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte("{\"max_tokens\":100,\"model\":\"sql-lora-1fdg3\",\"prompt\":\"test6\",\"temperature\":0}"), + EndOfStream: true, + }, + }, + }, + }, + }, + }, + }, + }, + }, + { + name: "inferencemodel's modelName is not translated, passthrough", + requests: []*extProcPb.ProcessingRequest{ + { + Request: &extProcPb.ProcessingRequest_RequestHeaders{ + RequestHeaders: &extProcPb.HttpHeaders{ + Headers: &configPb.HeaderMap{ + Headers: []*configPb.HeaderValue{ + { + Key: "hi", + Value: "mom", + }, + }, + }, + }, + }, + }, + { + Request: &extProcPb.ProcessingRequest_RequestBody{ + RequestBody: &extProcPb.HttpBody{Body: []byte("{\"max_tokens\":100,\"model\":\"direct-"), EndOfStream: false}, + }, + }, + { + Request: &extProcPb.ProcessingRequest_RequestBody{ + RequestBody: &extProcPb.HttpBody{Body: []byte("model\",\"prompt\":\"test6\",\"temperature\":0}"), EndOfStream: true}, + }, + }, + }, + + // + // pod 0 will be picked as all other models are above threshold + pods: map[backend.Pod]*backendmetrics.Metrics{ + fakePod(0): { + WaitingQueueSize: 4, + KVCacheUsagePercent: 0.2, + ActiveModels: map[string]int{ + "foo": 1, + "bar": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + fakePod(1): { + WaitingQueueSize: 0, + KVCacheUsagePercent: 0.85, + ActiveModels: map[string]int{ + "foo": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + fakePod(2): { + WaitingQueueSize: 10, + KVCacheUsagePercent: 0.9, + ActiveModels: map[string]int{ + "foo": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + }, + wantMetrics: map[string]string{`inference_model_request_total`: ` + # HELP inference_model_request_total [ALPHA] Counter of inference model requests broken out for each model and target model. + # TYPE inference_model_request_total counter + inference_model_request_total{model_name="direct-model",target_model_name="direct-model"} 1 + `}, + wantErr: false, + wantResponses: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_RequestHeaders{ + RequestHeaders: &extProcPb.HeadersResponse{ + Response: &extProcPb.CommonResponse{ + ClearRouteCache: true, + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: []*configPb.HeaderValueOption{ + { + Header: &configPb.HeaderValue{ + Key: "x-gateway-destination-endpoint", + RawValue: []byte("192.168.1.2:8000"), + }, + }, + { + Header: &configPb.HeaderValue{ + Key: "Content-Length", + RawValue: []byte(strconv.Itoa(74)), + }, + }, + }}, + }, + }, + }, + DynamicMetadata: makeMetadata("192.168.1.2:8000"), + }, + { + Response: &extProcPb.ProcessingResponse_RequestBody{ + RequestBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte("{\"max_tokens\":100,\"model\":\"direct-model\",\"prompt\":\"test6\",\"temperature\":0}"), + EndOfStream: true, + }, + }, + }, + }, + }, + }, + }, + }, + }, + // Response flow tests + { + name: "responsebody sent over multiple requests, content-type is json, buffer", + requests: []*extProcPb.ProcessingRequest{ + { + Request: &extProcPb.ProcessingRequest_ResponseHeaders{ + ResponseHeaders: &extProcPb.HttpHeaders{ + Headers: &configPb.HeaderMap{ + Headers: []*configPb.HeaderValue{ + { + Key: "content-type", + Value: "application/json", + }, + }, + }, + }, + }, + }, + { + Request: &extProcPb.ProcessingRequest_ResponseBody{ + ResponseBody: &extProcPb.HttpBody{Body: []byte("{\"max_tokens\":100,\"model\":\"sql-lo"), EndOfStream: false}, + }, + }, + { + Request: &extProcPb.ProcessingRequest_ResponseBody{ + ResponseBody: &extProcPb.HttpBody{Body: []byte("ra-sheddable\",\"prompt\":\"test6\",\"temperature\":0}"), EndOfStream: true}, + }, + }, + }, + + // + // pod 0 will be picked as all other models are above threshold + pods: map[backend.Pod]*backendmetrics.Metrics{ + fakePod(0): { + WaitingQueueSize: 4, + KVCacheUsagePercent: 0.2, + ActiveModels: map[string]int{ + "foo": 1, + "bar": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + fakePod(1): { + WaitingQueueSize: 0, + KVCacheUsagePercent: 0.85, + ActiveModels: map[string]int{ + "foo": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + fakePod(2): { + WaitingQueueSize: 10, + KVCacheUsagePercent: 0.9, + ActiveModels: map[string]int{ + "foo": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + }, + wantErr: false, + wantResponses: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_ResponseHeaders{ + ResponseHeaders: &extProcPb.HeadersResponse{ + Response: &extProcPb.CommonResponse{ + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: []*configPb.HeaderValueOption{ + { + Header: &configPb.HeaderValue{ + Key: "x-went-into-resp-headers", + RawValue: []byte("true"), + }, + }, + }, + }, + }, + }, + }, + }, + { + Response: &extProcPb.ProcessingResponse_ResponseBody{ + ResponseBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte("{\"max_tokens\":100,\"model\":\"sql-lora-sheddable\",\"prompt\":\"test6\",\"temperature\":0}"), + EndOfStream: true, + }, + }, + }, + }, + }, + }, + }, + }, + }, + { + name: "responsebody sent over a single request, but empty body with EndOfStream in the second request(this is how envoy operates); content-type is json, buffer", + requests: []*extProcPb.ProcessingRequest{ + { + Request: &extProcPb.ProcessingRequest_ResponseHeaders{ + ResponseHeaders: &extProcPb.HttpHeaders{ + Headers: &configPb.HeaderMap{ + Headers: []*configPb.HeaderValue{ + { + Key: "content-type", + Value: "application/json", + }, + }, + }, + }, + }, + }, + { + Request: &extProcPb.ProcessingRequest_ResponseBody{ + ResponseBody: &extProcPb.HttpBody{Body: []byte("{\"max_tokens\":100,\"model\":\"sql-lora-sheddable\",\"prompt\":\"test6\",\"temperature\":0}"), EndOfStream: false}, + }, + }, + { + Request: &extProcPb.ProcessingRequest_ResponseBody{ + ResponseBody: &extProcPb.HttpBody{Body: []byte(""), EndOfStream: true}, + }, + }, + }, + + // + // pod 0 will be picked as all other models are above threshold + pods: map[backend.Pod]*backendmetrics.Metrics{ + fakePod(0): { + WaitingQueueSize: 4, + KVCacheUsagePercent: 0.2, + ActiveModels: map[string]int{ + "foo": 1, + "bar": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + fakePod(1): { + WaitingQueueSize: 0, + KVCacheUsagePercent: 0.85, + ActiveModels: map[string]int{ + "foo": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + fakePod(2): { + WaitingQueueSize: 10, + KVCacheUsagePercent: 0.9, + ActiveModels: map[string]int{ + "foo": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + }, + wantErr: false, + wantResponses: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_ResponseHeaders{ + ResponseHeaders: &extProcPb.HeadersResponse{ + Response: &extProcPb.CommonResponse{ + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: []*configPb.HeaderValueOption{ + { + Header: &configPb.HeaderValue{ + Key: "x-went-into-resp-headers", + RawValue: []byte("true"), + }, + }, + }, + }, + }, + }, + }, + }, + { + Response: &extProcPb.ProcessingResponse_ResponseBody{ + ResponseBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte("{\"max_tokens\":100,\"model\":\"sql-lora-sheddable\",\"prompt\":\"test6\",\"temperature\":0}"), + EndOfStream: true, + }, + }, + }, + }, + }, + }, + }, + }, + }, + { + name: "responsebody sent over a single request, but empty body with EndOfStream in the second request(this is how envoy operates); content-type is json, buffer", + requests: []*extProcPb.ProcessingRequest{ + { + Request: &extProcPb.ProcessingRequest_ResponseHeaders{ + ResponseHeaders: &extProcPb.HttpHeaders{ + Headers: &configPb.HeaderMap{ + Headers: []*configPb.HeaderValue{ + { + Key: "content-type", + RawValue: []byte("text/event-stream"), + }, + { + Key: "status", + RawValue: []byte("200"), + }, + }, + }, + }, + }, + }, + { + Request: &extProcPb.ProcessingRequest_ResponseBody{ + ResponseBody: &extProcPb.HttpBody{ + Body: []byte(`data: {"id":"cmpl-0fee233f-7d56-404a-acd3-4dad775d03d9","object":"text_completion","created":1741379018,"model":"food-review-1","choices":[{"index":0,"text":"NEVER","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}`), + EndOfStream: false}, + }, + }, + { + Request: &extProcPb.ProcessingRequest_ResponseBody{ + ResponseBody: &extProcPb.HttpBody{ + Body: []byte(`data: {"id":"cmpl-0fee233f-7d56-404a-acd3-4dad775d03d9","object":"text_completion","created":1741379018,"model":"food-review-1","choices":[{"index":0,"text":"GONNA","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}`), + EndOfStream: false}, + }, + }, + { + Request: &extProcPb.ProcessingRequest_ResponseBody{ + ResponseBody: &extProcPb.HttpBody{ + Body: []byte(`data: {"id":"cmpl-0fee233f-7d56-404a-acd3-4dad775d03d9","object":"text_completion","created":1741379018,"model":"food-review-1","choices":[{"index":0,"text":"GIVE","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}`), + EndOfStream: false}, + }, + }, + { + Request: &extProcPb.ProcessingRequest_ResponseBody{ + ResponseBody: &extProcPb.HttpBody{ + Body: []byte(`data: {"id":"cmpl-0fee233f-7d56-404a-acd3-4dad775d03d9","object":"text_completion","created":1741379018,"model":"food-review-1","choices":[{"index":0,"text":"YOU","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}`), + EndOfStream: false}, + }, + }, + { + Request: &extProcPb.ProcessingRequest_ResponseBody{ + ResponseBody: &extProcPb.HttpBody{ + Body: []byte(`data: {"id":"cmpl-0fee233f-7d56-404a-acd3-4dad775d03d9","object":"text_completion","created":1741379018,"model":"food-review-1","choices":[{"index":0,"text":"UP","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}`), + EndOfStream: false}, + }, + }, + { + Request: &extProcPb.ProcessingRequest_ResponseBody{ + ResponseBody: &extProcPb.HttpBody{ + Body: []byte(`data: {"id":"cmpl-0fee233f-7d56-404a-acd3-4dad775d03d9","object":"text_completion","created":1741379018,"model":"food-review-1","choices":[],"usage":{"prompt_tokens":7,"total_tokens":17,"completion_tokens":10}} + data: [DONE]`, + ), + EndOfStream: false}, + }, + }, + { + Request: &extProcPb.ProcessingRequest_ResponseBody{ + ResponseBody: &extProcPb.HttpBody{ + Body: []byte(""), + EndOfStream: true}, + }, + }, + }, + wantErr: false, + wantMetrics: map[string]string{`inference_model_input_tokens`: ` + # HELP inference_model_input_tokens [ALPHA] Inference model input token count distribution for requests in each model. + # TYPE inference_model_input_tokens histogram + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="1"} 0 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="8"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="16"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="32"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="64"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="128"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="256"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="512"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="1024"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="2048"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="4096"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="8192"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="16384"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="32778"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="65536"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="131072"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="262144"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="524288"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="1.048576e+06"} 1 + inference_model_input_tokens_bucket{model_name="",target_model_name="",le="+Inf"} 1 + inference_model_input_tokens_sum{model_name="",target_model_name=""} 7 + inference_model_input_tokens_count{model_name="",target_model_name=""} 1 + `}, + wantResponses: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_ResponseHeaders{ + ResponseHeaders: &extProcPb.HeadersResponse{ + Response: &extProcPb.CommonResponse{ + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: []*configPb.HeaderValueOption{ + { + Header: &configPb.HeaderValue{ + Key: "x-went-into-resp-headers", + RawValue: []byte("true"), + }, + }, + }, + }, + }, + }, + }, + }, + { + Response: &extProcPb.ProcessingResponse_ResponseBody{ + ResponseBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte(`data: {"id":"cmpl-0fee233f-7d56-404a-acd3-4dad775d03d9","object":"text_completion","created":1741379018,"model":"food-review-1","choices":[{"index":0,"text":"NEVER","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}`), + EndOfStream: false, + }, + }, + }, + }, + }, + }, + }, + { + Response: &extProcPb.ProcessingResponse_ResponseBody{ + ResponseBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte(`data: {"id":"cmpl-0fee233f-7d56-404a-acd3-4dad775d03d9","object":"text_completion","created":1741379018,"model":"food-review-1","choices":[{"index":0,"text":"GONNA","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}`), + EndOfStream: false, + }, + }, + }, + }, + }, + }, + }, + { + Response: &extProcPb.ProcessingResponse_ResponseBody{ + ResponseBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte(`data: {"id":"cmpl-0fee233f-7d56-404a-acd3-4dad775d03d9","object":"text_completion","created":1741379018,"model":"food-review-1","choices":[{"index":0,"text":"GIVE","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}`), + EndOfStream: false, + }, + }, + }, + }, + }, + }, + }, + { + Response: &extProcPb.ProcessingResponse_ResponseBody{ + ResponseBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte(`data: {"id":"cmpl-0fee233f-7d56-404a-acd3-4dad775d03d9","object":"text_completion","created":1741379018,"model":"food-review-1","choices":[{"index":0,"text":"YOU","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}`), + EndOfStream: false, + }, + }, + }, + }, + }, + }, + }, + { + Response: &extProcPb.ProcessingResponse_ResponseBody{ + ResponseBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte(`data: {"id":"cmpl-0fee233f-7d56-404a-acd3-4dad775d03d9","object":"text_completion","created":1741379018,"model":"food-review-1","choices":[{"index":0,"text":"UP","logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":null}`), + EndOfStream: false, + }, + }, + }, + }, + }, + }, + }, + { + Response: &extProcPb.ProcessingResponse_ResponseBody{ + ResponseBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte(`data: {"id":"cmpl-0fee233f-7d56-404a-acd3-4dad775d03d9","object":"text_completion","created":1741379018,"model":"food-review-1","choices":[],"usage":{"prompt_tokens":7,"total_tokens":17,"completion_tokens":10}} + data: [DONE]`, + ), + EndOfStream: false, + }, + }, + }, + }, + }, + }, + }, + { + Response: &extProcPb.ProcessingResponse_ResponseBody{ + ResponseBody: &extProcPb.BodyResponse{ + Response: &extProcPb.CommonResponse{ + BodyMutation: &extProcPb.BodyMutation{ + Mutation: &extProcPb.BodyMutation_StreamedResponse{ + StreamedResponse: &extProcPb.StreamedBodyResponse{ + Body: []byte(""), + EndOfStream: true, + }, + }, + }, + }, + }, + }, + }, + }, + }, + // Bodyless Request test + { + name: "simple GET Request", + requests: []*extProcPb.ProcessingRequest{ + { + Request: &extProcPb.ProcessingRequest_RequestHeaders{ + RequestHeaders: &extProcPb.HttpHeaders{ + Headers: &configPb.HeaderMap{ + Headers: []*configPb.HeaderValue{ + { + Key: "content-type", + RawValue: []byte("text/event-stream"), + }, + { + Key: "status", + RawValue: []byte("200"), + }, + }, + }, + EndOfStream: true, + }, + }, + }, + }, + wantResponses: []*extProcPb.ProcessingResponse{ + { + Response: &extProcPb.ProcessingResponse_RequestHeaders{ + RequestHeaders: &extProcPb.HeadersResponse{ + Response: &extProcPb.CommonResponse{ + ClearRouteCache: true, + HeaderMutation: &extProcPb.HeaderMutation{ + SetHeaders: []*configPb.HeaderValueOption{ + { + Header: &configPb.HeaderValue{ + Key: "x-gateway-destination-endpoint", + RawValue: []byte("192.168.1.1:8000"), + }, + }, + }}, + }, + }, + }, + DynamicMetadata: makeMetadata("192.168.1.1:8000"), + }, + }, + pods: map[backend.Pod]*backendmetrics.Metrics{ + fakePod(0): { + WaitingQueueSize: 4, + KVCacheUsagePercent: 0.2, + ActiveModels: map[string]int{ + "foo": 1, + "bar": 1, + "sql-lora-1fdg3": 1, + }, + WaitingModels: map[string]int{}, + }, + }, + wantMetrics: map[string]string{`inference_pool_ready_pods`: ` + # HELP inference_pool_ready_pods [ALPHA] The number of ready pods in the inference server pool. + # TYPE inference_pool_ready_pods gauge + inference_pool_ready_pods{name="vllm-llama3-8b-instruct-pool"} 1 + `}, + }, + } + + for _, test := range tests { + t.Run(test.name, func(t *testing.T) { + client, cleanup := setUpHermeticServer(t, test.pods, true) + t.Cleanup(cleanup) + responses, err := integrationutils.StreamedRequest(t, client, test.requests, len(test.wantResponses)) + + if err != nil && !test.wantErr { + t.Errorf("Unexpected error, got: %v, want error: %v", err, test.wantErr) + } + if diff := cmp.Diff(test.wantResponses, responses, protocmp.Transform()); diff != "" { + t.Errorf("Unexpected response, (-want +got): %v", diff) + } + + if len(test.wantMetrics) != 0 { + for metricName, value := range test.wantMetrics { + if err := metricsutils.GatherAndCompare(legacyregistry.DefaultGatherer, strings.NewReader(value), metricName); err != nil { + t.Error(err) + } + } + } + + legacyregistry.Reset() + }) + } +} + +func setUpHermeticServer(t *testing.T, podAndMetrics map[backend.Pod]*backendmetrics.Metrics, streamed bool) (client extProcPb.ExternalProcessor_ProcessClient, cleanup func()) { + // Reconfigure the TestPodMetricsClient. + res := map[types.NamespacedName]*backendmetrics.Metrics{} + for pod, metrics := range podAndMetrics { + res[pod.NamespacedName] = metrics + } + serverRunner.TestPodMetricsClient.SetRes(res) + serverRunner.UseStreaming = streamed + + serverCtx, stopServer := context.WithCancel(context.Background()) + + // TODO: this should be consistent with the inference pool + podLabels := map[string]string{ + "app": "vllm-llama3-8b-instruct-pool", + } + + for pod := range podAndMetrics { + pod := epptestutil.MakePod(pod.NamespacedName.Name). + Namespace(pod.NamespacedName.Namespace). + ReadyCondition(). + Labels(podLabels). + IP(pod.Address). + Complete(). + ObjRef() + + copy := pod.DeepCopy() + if err := k8sClient.Create(context.Background(), copy); err != nil { + logutil.Fatal(logger, err, "Failed to create pod", "pod", pod) + } + + // since no pod controllers deployed in fake environment, we manually update pod status + copy.Status = pod.Status + if err := k8sClient.Status().Update(context.Background(), copy); err != nil { + logutil.Fatal(logger, err, "Failed to update pod status", "pod", pod) + } + } + go func() { + if err := serverRunner.AsRunnable(logger.WithName("ext-proc")).Start(serverCtx); err != nil { + logutil.Fatal(logger, err, "Failed to start ext-proc server") + } + }() + + time.Sleep(serverRunner.RefreshPrometheusMetricsInterval) // wait for metrics to get available before running tests that rely on these metrics + + // check if all pods are synced to datastore + assert.EventuallyWithT(t, func(t *assert.CollectT) { + assert.Len(t, serverRunner.Datastore.PodGetAll(), len(podAndMetrics), "Datastore not synced") + }, 10*time.Second, time.Second) + + address := fmt.Sprintf("localhost:%v", port) + // Create a grpc connection + conn, err := grpc.NewClient(address, grpc.WithTransportCredentials(insecure.NewCredentials())) + if err != nil { + logutil.Fatal(logger, err, "Failed to connect", "address", address) + } + + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + client, err = extProcPb.NewExternalProcessorClient(conn).Process(ctx) + if err != nil { + logutil.Fatal(logger, err, "Failed to create client") + } + return client, func() { + cancel() + conn.Close() + stopServer() + + // clear created pods + for pod := range podAndMetrics { + pod := epptestutil.MakePod(pod.NamespacedName.Name). + Namespace(pod.NamespacedName.Namespace).Complete().ObjRef() + + if err := k8sClient.Delete(context.Background(), pod); err != nil { + logutil.Fatal(logger, err, "Failed to delete pod", "pod", fakePod) + } + } + } +} + +func fakePod(index int) backend.Pod { + return backend.Pod{ + NamespacedName: types.NamespacedName{Name: fmt.Sprintf("pod-%v", index), Namespace: "default"}, + Address: fmt.Sprintf("192.168.1.%d", index+1), + } +} + +// Sets up a test environment and returns the runner struct +func BeforeSuite() func() { + // Set up mock k8s API Client + testEnv = &envtest.Environment{ + CRDDirectoryPaths: []string{filepath.Join("..", "..", "..", "config", "crd", "bases")}, + ErrorIfCRDPathMissing: true, + } + cfg, err := testEnv.Start() + if err != nil { + logutil.Fatal(logger, err, "Failed to start test environment", "config", cfg) + } + + utilruntime.Must(clientgoscheme.AddToScheme(scheme)) + utilruntime.Must(v1alpha2.Install(scheme)) + + k8sClient, err = k8sclient.New(cfg, k8sclient.Options{Scheme: scheme}) + if err != nil { + logutil.Fatal(logger, err, "Failed to start k8s Client") + } else if k8sClient == nil { + logutil.Fatal(logger, nil, "No error, but returned kubernetes client is nil", "config", cfg) + } + + // Init runtime. + ctrl.SetLogger(logger) + + mgr, err := server.NewManagerWithOptions(cfg, managerTestOptions("default", "vllm-llama3-8b-instruct-pool")) + if err != nil { + logutil.Fatal(logger, err, "Failed to create controller manager") + } + + if err := registerMetricsHandler(mgr, metricsPort); err != nil { + logutil.Fatal(logger, err, "Failed to register metrics handler") + } + + serverRunner = runserver.NewDefaultExtProcServerRunner() + serverRunner.TestPodMetricsClient = &backendmetrics.FakePodMetricsClient{} + pmf := backendmetrics.NewPodMetricsFactory(serverRunner.TestPodMetricsClient, 10*time.Millisecond) + // Adjust from defaults + serverRunner.PoolNamespacedName = types.NamespacedName{Name: "vllm-llama3-8b-instruct-pool", Namespace: "default"} + serverRunner.Datastore = datastore.NewDatastore(context.Background(), pmf) + serverRunner.Scheduler = scheduling.NewScheduler(serverRunner.Datastore) + serverRunner.SecureServing = false + + if err := serverRunner.SetupWithManager(context.Background(), mgr); err != nil { + logutil.Fatal(logger, err, "Failed to setup server runner") + } + + // Start the controller manager in a go routine, not blocking + go func() { + if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil { + logutil.Fatal(logger, err, "Failed to start manager") + } + }() + + logger.Info("Setting up hermetic ExtProc server") + + // Unmarshal CRDs from file into structs + manifestsPath := filepath.Join("..", "..", "testdata", "inferencepool-with-model-hermetic.yaml") + docs, err := readDocuments(manifestsPath) + if err != nil { + logutil.Fatal(logger, err, "Can't read object manifests", "path", manifestsPath) + } + + for _, doc := range docs { + obj := &unstructured.Unstructured{} + if err = yaml.Unmarshal(doc, obj); err != nil { + logutil.Fatal(logger, err, "Can't unmarshal object", "document", doc) + } + logger.Info("Creating object", "kind", obj.GetKind(), "object", obj) + if err := k8sClient.Create(context.Background(), obj); err != nil { + logutil.Fatal(logger, err, "Unable to create object", "object", obj.GetName()) + } + } + + assert.Eventually(nil, func() bool { + modelExist := serverRunner.Datastore.ModelGet("my-model") + synced := serverRunner.Datastore.PoolHasSynced() && modelExist != nil + return synced + }, 10*time.Second, 10*time.Millisecond) + + return func() { + _ = testEnv.Stop() + _ = k8sClient.DeleteAllOf(context.Background(), &v1alpha2.InferencePool{}) + _ = k8sClient.DeleteAllOf(context.Background(), &v1alpha2.InferenceModel{}) + } +} + +// readDocuments reads documents from file. +func readDocuments(fp string) ([][]byte, error) { + b, err := os.ReadFile(fp) + if err != nil { + return nil, err + } + + docs := [][]byte{} + reader := k8syaml.NewYAMLReader(bufio.NewReader(bytes.NewReader(b))) + for { + // Read document + doc, err := reader.Read() + if err != nil { + if errors.Is(err, io.EOF) { + break + } + return nil, err + } + docs = append(docs, doc) + } + return docs, nil +} + +func makeMetadata(endpoint string) *structpb.Struct { + return &structpb.Struct{ + Fields: map[string]*structpb.Value{ + runserver.DefaultDestinationEndpointHintMetadataNamespace: { + Kind: &structpb.Value_StructValue{ + StructValue: &structpb.Struct{ + Fields: map[string]*structpb.Value{ + runserver.DefaultDestinationEndpointHintKey: { + Kind: &structpb.Value_StringValue{ + StringValue: endpoint, + }, + }, + }, + }, + }, + }, + }, + } +} + +// registerMetricsHandler is a simplified version of metrics endpoint handler +// without Authentication for integration tests. +func registerMetricsHandler(mgr manager.Manager, port int) error { + metrics.Register() + + // Init HTTP server. + h := promhttp.HandlerFor( + legacyregistry.DefaultGatherer, + promhttp.HandlerOpts{}, + ) + + mux := http.NewServeMux() + mux.Handle("/metrics", h) + + srv := &http.Server{ + Addr: net.JoinHostPort("", strconv.Itoa(port)), + Handler: mux, + } + + if err := mgr.Add(&manager.Server{ + Name: "metrics", + Server: srv, + }); err != nil { + return err + } + return nil +} + +// inject options that allow multiple test runs to run +// https://github.com/kubernetes-sigs/controller-runtime/issues/2937 +func managerTestOptions(namespace, name string) ctrl.Options { + return ctrl.Options{ + Scheme: scheme, + Cache: cache.Options{ + ByObject: map[client.Object]cache.ByObject{ + &corev1.Pod{}: { + Namespaces: map[string]cache.Config{ + namespace: {}, + }, + }, + &v1alpha2.InferencePool{}: { + Namespaces: map[string]cache.Config{ + namespace: { + FieldSelector: fields.SelectorFromSet(fields.Set{ + "metadata.name": name, + }), + }, + }, + }, + &v1alpha2.InferenceModel{}: { + Namespaces: map[string]cache.Config{ + namespace: {}, + }, + }, + }, + }, + Controller: config.Controller{ + SkipNameValidation: boolPointer(true), + }, + } +} + +func boolPointer(b bool) *bool { + return &b +} diff --git a/test/integration/hermetic_test.go b/test/integration/hermetic_test.go deleted file mode 100644 index 3dfe28f7..00000000 --- a/test/integration/hermetic_test.go +++ /dev/null @@ -1,475 +0,0 @@ -// Package test contains e2e tests for the ext proc while faking the backend pods. -package integration - -import ( - "bufio" - "bytes" - "context" - "errors" - "flag" - "fmt" - "io" - "log" - "os" - "path/filepath" - "strconv" - "testing" - "time" - - configPb "github.com/envoyproxy/go-control-plane/envoy/config/core/v3" - extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" - "github.com/google/go-cmp/cmp" - "google.golang.org/grpc" - "google.golang.org/grpc/credentials/insecure" - "google.golang.org/protobuf/testing/protocmp" - "google.golang.org/protobuf/types/known/structpb" - "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" - "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/backend" - runserver "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/server" - extprocutils "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/test" - testingutil "inference.networking.x-k8s.io/gateway-api-inference-extension/pkg/ext-proc/util/testing" - corev1 "k8s.io/api/core/v1" - "k8s.io/apimachinery/pkg/runtime" - utilruntime "k8s.io/apimachinery/pkg/util/runtime" - k8syaml "k8s.io/apimachinery/pkg/util/yaml" - clientgoscheme "k8s.io/client-go/kubernetes/scheme" - klog "k8s.io/klog/v2" - k8sclient "sigs.k8s.io/controller-runtime/pkg/client" - "sigs.k8s.io/controller-runtime/pkg/envtest" - "sigs.k8s.io/yaml" -) - -const ( - port = runserver.DefaultGrpcPort -) - -var ( - serverRunner *runserver.ExtProcServerRunner - k8sClient k8sclient.Client - testEnv *envtest.Environment - scheme = runtime.NewScheme() -) - -func SKIPTestHandleRequestBody(t *testing.T) { - tests := []struct { - name string - req *extProcPb.ProcessingRequest - pods []*backend.PodMetrics - models map[string]*v1alpha1.InferenceModel - wantHeaders []*configPb.HeaderValueOption - wantBody []byte - wantErr bool - }{ - { - name: "success", - req: extprocutils.GenerateRequest("my-model"), - models: map[string]*v1alpha1.InferenceModel{ - "my-model": { - Spec: v1alpha1.InferenceModelSpec{ - ModelName: "my-model", - TargetModels: []v1alpha1.TargetModel{ - { - Name: "my-model-v1", - Weight: pointer(100), - }, - }, - }, - }, - }, - // pod-1 will be picked because it has relatively low queue size, with the requested - // model being active, and has low KV cache. - pods: []*backend.PodMetrics{ - { - Pod: extprocutils.FakePod(0), - Metrics: backend.Metrics{ - WaitingQueueSize: 0, - KVCacheUsagePercent: 0.2, - ActiveModels: map[string]int{ - "foo": 1, - "bar": 1, - }, - }, - }, - { - Pod: extprocutils.FakePod(1), - Metrics: backend.Metrics{ - WaitingQueueSize: 0, - KVCacheUsagePercent: 0.1, - ActiveModels: map[string]int{ - "foo": 1, - "my-model-v1": 1, - }, - }, - }, - { - Pod: extprocutils.FakePod(2), - Metrics: backend.Metrics{ - WaitingQueueSize: 10, - KVCacheUsagePercent: 0.2, - ActiveModels: map[string]int{ - "foo": 1, - }, - }, - }, - }, - wantHeaders: []*configPb.HeaderValueOption{ - { - Header: &configPb.HeaderValue{ - Key: runserver.DefaultTargetEndpointKey, - RawValue: []byte("pod-1:8000"), - }, - }, - { - Header: &configPb.HeaderValue{ - Key: "Content-Length", - RawValue: []byte("73"), - }, - }, - }, - wantBody: []byte("{\"max_tokens\":100,\"model\":\"my-model-v1\",\"prompt\":\"hello\",\"temperature\":0}"), - }, - } - - for _, test := range tests { - t.Run(test.name, func(t *testing.T) { - client, cleanup := setUpServer(t, test.pods, test.models) - t.Cleanup(cleanup) - want := &extProcPb.ProcessingResponse{ - Response: &extProcPb.ProcessingResponse_RequestBody{ - RequestBody: &extProcPb.BodyResponse{ - Response: &extProcPb.CommonResponse{ - HeaderMutation: &extProcPb.HeaderMutation{ - SetHeaders: test.wantHeaders, - }, - BodyMutation: &extProcPb.BodyMutation{ - Mutation: &extProcPb.BodyMutation_Body{ - Body: test.wantBody, - }, - }, - }, - }, - }, - } - res, err := sendRequest(t, client, test.req) - - if (err != nil) != test.wantErr { - t.Fatalf("Unexpected error, got %v, want %v", err, test.wantErr) - } - - if diff := cmp.Diff(want, res, protocmp.Transform()); diff != "" { - t.Errorf("Unexpected response, (-want +got): %v", diff) - } - }) - } - -} - -func TestKubeInferenceModelRequest(t *testing.T) { - tests := []struct { - name string - req *extProcPb.ProcessingRequest - wantHeaders []*configPb.HeaderValueOption - wantMetadata *structpb.Struct - wantBody []byte - wantErr bool - }{ - { - name: "success", - req: extprocutils.GenerateRequest("sql-lora"), - // pod-1 will be picked because it has relatively low queue size, with the requested - // model being active, and has low KV cache. - wantHeaders: []*configPb.HeaderValueOption{ - { - Header: &configPb.HeaderValue{ - Key: runserver.DefaultTargetEndpointKey, - RawValue: []byte("pod-1:8000"), - }, - }, - { - Header: &configPb.HeaderValue{ - Key: "Content-Length", - RawValue: []byte("76"), - }, - }, - }, - wantMetadata: &structpb.Struct{ - Fields: map[string]*structpb.Value{ - runserver.DefaultTargetEndpointKey: { - Kind: &structpb.Value_StringValue{ - StringValue: "pod-1:8000", - }, - }, - }, - }, - wantBody: []byte("{\"max_tokens\":100,\"model\":\"sql-lora-1fdg2\",\"prompt\":\"hello\",\"temperature\":0}"), - wantErr: false, - }, - } - - metrics := []*backend.Metrics{ - { - WaitingQueueSize: 0, - KVCacheUsagePercent: 0.2, - ActiveModels: map[string]int{ - "foo": 1, - "bar": 1, - }, - }, - { - WaitingQueueSize: 0, - KVCacheUsagePercent: 0.1, - ActiveModels: map[string]int{ - "foo": 1, - "sql-lora-1fdg2": 1, - }, - }, - { - WaitingQueueSize: 10, - KVCacheUsagePercent: 0.2, - ActiveModels: map[string]int{ - "foo": 1, - }, - }, - } - - // Set up global k8sclient and extproc server runner with test environment config - podMetrics := BeforeSuit(metrics) - - for _, test := range tests { - t.Run(test.name, func(t *testing.T) { - client, cleanup := setUpHermeticServer(t, podMetrics) - t.Cleanup(cleanup) - want := &extProcPb.ProcessingResponse{ - Response: &extProcPb.ProcessingResponse_RequestBody{ - RequestBody: &extProcPb.BodyResponse{ - Response: &extProcPb.CommonResponse{ - HeaderMutation: &extProcPb.HeaderMutation{ - SetHeaders: test.wantHeaders, - }, - BodyMutation: &extProcPb.BodyMutation{ - Mutation: &extProcPb.BodyMutation_Body{ - Body: test.wantBody, - }, - }, - }, - }, - }, - DynamicMetadata: test.wantMetadata, - } - res, err := sendRequest(t, client, test.req) - - if err != nil { - if !test.wantErr { - t.Errorf("Unexpected error, got: %v, want error: %v", err, test.wantErr) - } - } else if diff := cmp.Diff(want, res, protocmp.Transform()); diff != "" { - t.Errorf("Unexpected response, (-want +got): %v", diff) - } - }) - } -} - -func setUpServer(t *testing.T, pods []*backend.PodMetrics, models map[string]*v1alpha1.InferenceModel) (client extProcPb.ExternalProcessor_ProcessClient, cleanup func()) { - t.Logf("Setting up ExtProc server") - server := extprocutils.StartExtProc(port, time.Second, time.Second, pods, models) - - address := fmt.Sprintf("localhost:%v", port) - // Create a grpc connection - conn, err := grpc.NewClient(address, grpc.WithTransportCredentials(insecure.NewCredentials())) - if err != nil { - log.Fatalf("Failed to connect to %v: %v", address, err) - } - - ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) - client, err = extProcPb.NewExternalProcessorClient(conn).Process(ctx) - if err != nil { - log.Fatalf("Failed to create client: %v", err) - } - return client, func() { - cancel() - conn.Close() - server.GracefulStop() - } -} - -func setUpHermeticServer(t *testing.T, pods []*backend.PodMetrics) (client extProcPb.ExternalProcessor_ProcessClient, cleanup func()) { - t.Logf("Setting up hermetic ExtProc server") - klog.InitFlags(nil) - flag.Parse() - // Configure klog verbosity levels to print ext proc logs. - _ = flag.Lookup("v").Value.Set("3") - - // Unmarshal CRDs from file into structs - manifestsPath := filepath.Join("..", "testdata", "inferencepool-with-model-hermetic.yaml") - docs, err := readDocuments(manifestsPath) - if err != nil { - log.Fatalf("Can't read object manifests at path %v, %v", manifestsPath, err) - } - - for _, doc := range docs { - inferenceModel := &v1alpha1.InferenceModel{} - if err = yaml.Unmarshal(doc, inferenceModel); err != nil { - log.Fatalf("Can't unmarshal object: %v", doc) - } - if inferenceModel.Kind == "InferenceModel" { - t.Logf("Creating inference model: %+v", inferenceModel) - if err := k8sClient.Create(context.Background(), inferenceModel); err != nil { - log.Fatalf("unable to create inferenceModel %v: %v", inferenceModel.Name, err) - } - } - } - inferencePool := &v1alpha1.InferencePool{} - for _, doc := range docs { - if err = yaml.Unmarshal(doc, inferencePool); err != nil { - log.Fatalf("Can't unmarshal object: %v", doc) - } - if inferencePool.Kind == "InferencePool" { - t.Logf("Creating inference pool: %+v", inferencePool) - if err := k8sClient.Create(context.Background(), inferencePool); err != nil { - log.Fatalf("unable to create inferencePool %v: %v", inferencePool.Name, err) - } - // expecting a single inferencepool - break - } - } - - ps := make(backend.PodSet) - pms := make(map[string]*backend.PodMetrics) - for _, pod := range pods { - ps[pod.Pod] = true - pms[pod.Pod.Name] = pod - } - pmc := &backend.FakePodMetricsClient{Res: pms} - server := serverRunner.Start(pmc) - if err != nil { - log.Fatalf("Ext-proc failed with the err: %v", err) - } - - // Wait the reconciler to populate the datastore. - time.Sleep(10 * time.Second) - - address := fmt.Sprintf("localhost:%v", port) - // Create a grpc connection - conn, err := grpc.NewClient(address, grpc.WithTransportCredentials(insecure.NewCredentials())) - if err != nil { - log.Fatalf("Failed to connect to %v: %v", address, err) - } - - ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) - client, err = extProcPb.NewExternalProcessorClient(conn).Process(ctx) - if err != nil { - log.Fatalf("Failed to create client: %v", err) - } - return client, func() { - cancel() - conn.Close() - server.GracefulStop() - } -} - -// Sets up a test environment and returns the runner struct -func BeforeSuit(metrics []*backend.Metrics) []*backend.PodMetrics { - // Set up mock k8s API Client - testEnv = &envtest.Environment{ - CRDDirectoryPaths: []string{filepath.Join("..", "..", "config", "crd", "bases")}, - ErrorIfCRDPathMissing: true, - } - cfg, err := testEnv.Start() - - if err != nil { - log.Fatalf("Failed to start test environment, cfg: %v error: %v", cfg, err) - } - - utilruntime.Must(clientgoscheme.AddToScheme(scheme)) - utilruntime.Must(v1alpha1.AddToScheme(scheme)) - - k8sClient, err = k8sclient.New(cfg, k8sclient.Options{Scheme: scheme}) - if err != nil { - log.Fatalf("Failed to start k8s Client: %v", err) - } else if k8sClient == nil { - log.Fatalf("No error, but returned kubernetes client is nil, cfg: %v", cfg) - } - - podMetrics := []*backend.PodMetrics{} - fakeLister := &testingutil.FakePodLister{ - PodsList: []*corev1.Pod{}, - } - for i, m := range metrics { - podName := "pod-" + strconv.Itoa(i) - pod := testingutil.MakePod(podName).SetReady().SetPodIP(podName).Obj() - fakeLister.PodsList = append(fakeLister.PodsList, pod) - podMetrics = append(podMetrics, &backend.PodMetrics{ - Pod: backend.Pod{ - Name: pod.Name, - Address: pod.Status.PodIP + ":8000", - }, - Metrics: *m, - }) - } - - serverRunner = runserver.NewDefaultExtProcServerRunner() - // Adjust from defaults - serverRunner.PoolName = "vllm-llama2-7b-pool" - serverRunner.Scheme = scheme - serverRunner.Config = cfg - serverRunner.Datastore = backend.NewK8sDataStore(backend.WithPodListerFactory( - func(pool *v1alpha1.InferencePool) *backend.PodLister { - klog.V(1).Infof("Setting the fake lister %v", len(fakeLister.PodsList)) - return &backend.PodLister{ - Lister: fakeLister, - } - })) - - serverRunner.Setup() - - // Start the controller manager in go routine, not blocking - go func() { - serverRunner.StartManager() - }() - - // Wait the reconcilers to populate the datastore. - time.Sleep(5 * time.Second) - return podMetrics -} - -func sendRequest(t *testing.T, client extProcPb.ExternalProcessor_ProcessClient, req *extProcPb.ProcessingRequest) (*extProcPb.ProcessingResponse, error) { - t.Logf("Sending request: %v", req) - if err := client.Send(req); err != nil { - t.Logf("Failed to send request %+v: %v", req, err) - return nil, err - } - - res, err := client.Recv() - if err != nil { - t.Logf("Failed to receive: %v", err) - return nil, err - } - t.Logf("Received request %+v", res) - return res, err -} - -// readDocuments reads documents from file. -func readDocuments(fp string) ([][]byte, error) { - b, err := os.ReadFile(fp) - if err != nil { - return nil, err - } - - docs := [][]byte{} - reader := k8syaml.NewYAMLReader(bufio.NewReader(bytes.NewReader(b))) - for { - // Read document - doc, err := reader.Read() - if err != nil { - if errors.Is(err, io.EOF) { - break - } - return nil, err - } - docs = append(docs, doc) - } - return docs, nil -} -func pointer(v int32) *int32 { - return &v -} diff --git a/test/integration/util.go b/test/integration/util.go new file mode 100644 index 00000000..5fcc9d18 --- /dev/null +++ b/test/integration/util.go @@ -0,0 +1,121 @@ +/* +Copyright 2025 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package integration + +import ( + "encoding/json" + "io" + "testing" + "time" + + envoyCorev3 "github.com/envoyproxy/go-control-plane/envoy/config/core/v3" + extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3" + "github.com/go-logr/logr" + logutil "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/util/logging" +) + +func SendRequest(t *testing.T, client extProcPb.ExternalProcessor_ProcessClient, req *extProcPb.ProcessingRequest) (*extProcPb.ProcessingResponse, error) { + t.Logf("Sending request: %v", req) + if err := client.Send(req); err != nil { + t.Logf("Failed to send request %+v: %v", req, err) + return nil, err + } + + res, err := client.Recv() + if err != nil { + t.Logf("Failed to receive: %v", err) + return nil, err + } + t.Logf("Received response %+v", res) + return res, err +} + +func StreamedRequest(t *testing.T, client extProcPb.ExternalProcessor_ProcessClient, requests []*extProcPb.ProcessingRequest, expectedResponses int) ([]*extProcPb.ProcessingResponse, error) { + for _, req := range requests { + t.Logf("Sending request: %v", req) + if err := client.Send(req); err != nil { + t.Logf("Failed to send request %+v: %v", req, err) + return nil, err + } + } + responses := []*extProcPb.ProcessingResponse{} + + // Make an incredible simple timeout func in the case where + // there is less than the expected amount of responses; bail and fail. + var simpleTimeout bool + go func() { + time.Sleep(10 * time.Second) + simpleTimeout = true + }() + + for range expectedResponses { + if simpleTimeout { + break + } + res, err := client.Recv() + if err != nil && err != io.EOF { + t.Logf("Failed to receive: %v", err) + return nil, err + } + t.Logf("Received response %+v", res) + responses = append(responses, res) + } + return responses, nil +} + +func GenerateRequest(logger logr.Logger, prompt, model string) *extProcPb.ProcessingRequest { + j := map[string]interface{}{ + "prompt": prompt, + "max_tokens": 100, + "temperature": 0, + } + if model != "" { + j["model"] = model + } + + llmReq, err := json.Marshal(j) + if err != nil { + logutil.Fatal(logger, err, "Failed to unmarshal LLM request") + } + req := &extProcPb.ProcessingRequest{ + Request: &extProcPb.ProcessingRequest_RequestBody{ + RequestBody: &extProcPb.HttpBody{Body: llmReq, EndOfStream: true}, + }, + } + return req +} + +func GenerateStreamedRequestSet(logger logr.Logger, prompt, model string) []*extProcPb.ProcessingRequest { + requests := []*extProcPb.ProcessingRequest{} + headerReq := &extProcPb.ProcessingRequest{ + Request: &extProcPb.ProcessingRequest_RequestHeaders{ + RequestHeaders: &extProcPb.HttpHeaders{ + Headers: &envoyCorev3.HeaderMap{ + Headers: []*envoyCorev3.HeaderValue{ + { + Key: "hi", + Value: "mom", + }, + }, + }, + }, + }, + } + requests = append(requests, headerReq) + requests = append(requests, GenerateRequest(logger, prompt, model)) + return requests +} diff --git a/test/testdata/envoy.yaml b/test/testdata/envoy.yaml index 700eb24c..3fff8598 100644 --- a/test/testdata/envoy.yaml +++ b/test/testdata/envoy.yaml @@ -100,14 +100,15 @@ data: grpc_service: envoy_grpc: cluster_name: ext_proc - authority: inference-gateway-ext-proc.default:9002 + authority: vllm-llama3-8b-instruct-epp.$E2E_NS:9002 timeout: 10s processing_mode: request_header_mode: SEND - response_header_mode: SKIP - request_body_mode: BUFFERED - request_trailer_mode: SKIP - response_trailer_mode: SKIP + response_header_mode: SEND + request_body_mode: FULL_DUPLEX_STREAMED + response_body_mode: FULL_DUPLEX_STREAMED + request_trailer_mode: SEND + response_trailer_mode: SEND message_timeout: 1000s # Mark it as disabled if needed for troubleshooting: # disabled: true @@ -169,6 +170,15 @@ data: max_pending_requests: 40000 max_requests: 40000 max_retries: 1024 + # This ensures that envoy accepts untrusted certificates. We tried to explicitly + # set TrustChainVerification to ACCEPT_UNSTRUSTED, but that actually didn't work + # and what worked is setting the common_tls_context to empty. + transport_socket: + name: "envoy.transport_sockets.tls" + typed_config: + "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext + common_tls_context: + validation_context: typed_extension_protocol_options: envoy.extensions.upstreams.http.v3.HttpProtocolOptions: "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions @@ -185,7 +195,7 @@ data: - endpoint: address: socket_address: - address: inference-gateway-ext-proc.default + address: vllm-llama3-8b-instruct-epp.$E2E_NS port_value: 9002 health_status: HEALTHY load_balancing_weight: 1 @@ -212,14 +222,14 @@ spec: spec: containers: - name: envoy - image: docker.io/envoyproxy/envoy:distroless-v1.32.2 + image: docker.io/envoyproxy/envoy:distroless-v1.33.2 args: - "--service-cluster" - - "default/inference-gateway" + - "$E2E_NS/inference-gateway" - "--service-node" - "$(ENVOY_POD_NAME)" - "--log-level" - - "debug" + - "trace" - "--cpuset-threads" - "--drain-strategy" - "immediate" diff --git a/pkg/manifests/ext_proc.yaml b/test/testdata/inferencepool-e2e.yaml similarity index 69% rename from pkg/manifests/ext_proc.yaml rename to test/testdata/inferencepool-e2e.yaml index b9b860dc..79339c5b 100644 --- a/pkg/manifests/ext_proc.yaml +++ b/test/testdata/inferencepool-e2e.yaml @@ -1,86 +1,70 @@ -kind: ClusterRole -apiVersion: rbac.authorization.k8s.io/v1 -metadata: - name: pod-read -rules: -- apiGroups: ["inference.networking.x-k8s.io"] - resources: ["inferencemodels"] - verbs: ["get", "watch", "list"] -- apiGroups: [""] - resources: ["pods"] - verbs: ["get", "watch", "list"] -- apiGroups: ["inference.networking.x-k8s.io"] - resources: ["inferencepools"] - verbs: ["get", "watch", "list"] -- apiGroups: ["discovery.k8s.io"] - resources: ["endpointslices"] - verbs: ["get", "watch", "list"] -- apiGroups: - - authentication.k8s.io - resources: - - tokenreviews - verbs: - - create -- apiGroups: - - authorization.k8s.io - resources: - - subjectaccessreviews - verbs: - - create ---- -kind: ClusterRoleBinding -apiVersion: rbac.authorization.k8s.io/v1 -metadata: - name: pod-read-binding -subjects: -- kind: ServiceAccount - name: default - namespace: default -roleRef: - kind: ClusterRole - name: pod-read ---- -apiVersion: inference.networking.x-k8s.io/v1alpha1 +apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: labels: - name: vllm-llama2-7b-pool + name: vllm-llama3-8b-instruct spec: targetPortNumber: 8000 selector: - app: vllm-llama2-7b-pool + app: vllm-llama3-8b-instruct extensionRef: - name: inference-gateway-ext-proc + name: vllm-llama3-8b-instruct-epp + namespace: $E2E_NS +--- +apiVersion: v1 +kind: Service +metadata: + name: vllm-llama3-8b-instruct-epp + namespace: $E2E_NS +spec: + selector: + app: vllm-llama3-8b-instruct-epp + ports: + - protocol: TCP + port: 9002 + targetPort: 9002 + appProtocol: http2 + type: ClusterIP --- apiVersion: apps/v1 kind: Deployment metadata: - name: inference-gateway-ext-proc - namespace: default + name: vllm-llama3-8b-instruct-epp + namespace: $E2E_NS labels: - app: inference-gateway-ext-proc + app: vllm-llama3-8b-instruct-epp spec: replicas: 1 selector: matchLabels: - app: inference-gateway-ext-proc + app: vllm-llama3-8b-instruct-epp template: metadata: labels: - app: inference-gateway-ext-proc + app: vllm-llama3-8b-instruct-epp spec: + # Conservatively, this timeout should mirror the longest grace period of the pods within the pool + terminationGracePeriodSeconds: 130 containers: - - name: inference-gateway-ext-proc + - name: epp image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/epp:main + imagePullPolicy: Always args: - -poolName - - "vllm-llama2-7b-pool" + - "vllm-llama3-8b-instruct" + - -poolNamespace + - "$E2E_NS" - -v - - "3" + - "4" + - --zap-encoder + - "json" - -grpcPort - "9002" - -grpcHealthPort - "9003" + env: + - name: USE_STREAMING + value: "true" ports: - containerPort: 9002 - containerPort: 9003 @@ -99,16 +83,44 @@ spec: initialDelaySeconds: 5 periodSeconds: 10 --- -apiVersion: v1 -kind: Service +kind: ClusterRole +apiVersion: rbac.authorization.k8s.io/v1 metadata: - name: inference-gateway-ext-proc - namespace: default -spec: - selector: - app: inference-gateway-ext-proc - ports: - - protocol: TCP - port: 9002 - targetPort: 9002 - type: ClusterIP + name: pod-read +rules: +- apiGroups: ["inference.networking.x-k8s.io"] + resources: ["inferencemodels"] + verbs: ["get", "watch", "list"] +- apiGroups: [""] + resources: ["pods"] + verbs: ["get", "watch", "list"] +- apiGroups: ["inference.networking.x-k8s.io"] + resources: ["inferencepools"] + verbs: ["get", "watch", "list"] +- apiGroups: ["discovery.k8s.io"] + resources: ["endpointslices"] + verbs: ["get", "watch", "list"] +- apiGroups: + - authentication.k8s.io + resources: + - tokenreviews + verbs: + - create +- apiGroups: + - authorization.k8s.io + resources: + - subjectaccessreviews + verbs: + - create +--- +kind: ClusterRoleBinding +apiVersion: rbac.authorization.k8s.io/v1 +metadata: + name: pod-read-binding +subjects: +- kind: ServiceAccount + name: default + namespace: $E2E_NS +roleRef: + kind: ClusterRole + name: pod-read diff --git a/test/testdata/inferencepool-with-model-hermetic.yaml b/test/testdata/inferencepool-with-model-hermetic.yaml index a07e0f35..0c1e518f 100644 --- a/test/testdata/inferencepool-with-model-hermetic.yaml +++ b/test/testdata/inferencepool-with-model-hermetic.yaml @@ -1,25 +1,63 @@ -apiVersion: inference.networking.x-k8s.io/v1alpha1 +apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferencePool metadata: - name: vllm-llama2-7b-pool + name: vllm-llama3-8b-instruct-pool namespace: default spec: targetPortNumber: 8000 selector: - app: vllm-llama2-7b-pool + app: vllm-llama3-8b-instruct-pool extensionRef: name: epp --- -apiVersion: inference.networking.x-k8s.io/v1alpha1 +apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: - name: inferencemodel-sample + name: sample namespace: default spec: modelName: sql-lora criticality: Critical poolRef: - name: vllm-llama2-7b-pool + name: vllm-llama3-8b-instruct-pool targetModels: - name: sql-lora-1fdg2 weight: 100 +--- +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferenceModel +metadata: + name: sheddable + namespace: default +spec: + modelName: sql-lora-sheddable + poolRef: + name: vllm-llama3-8b-instruct-pool + targetModels: + - name: sql-lora-1fdg3 + weight: 100 +--- +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferenceModel +metadata: + name: generic + namespace: default +spec: + modelName: my-model + criticality: Critical + poolRef: + name: vllm-llama3-8b-instruct-pool + targetModels: + - name: my-model-12345 + weight: 100 +--- +apiVersion: inference.networking.x-k8s.io/v1alpha2 +kind: InferenceModel +metadata: + name: direct-model-name + namespace: default +spec: + modelName: direct-model + criticality: Critical + poolRef: + name: vllm-llama3-8b-instruct-pool \ No newline at end of file diff --git a/test/utils/utils.go b/test/utils/utils.go index 337599c3..1ec0fbaa 100644 --- a/test/utils/utils.go +++ b/test/utils/utils.go @@ -24,7 +24,6 @@ import ( "github.com/onsi/ginkgo/v2" "github.com/onsi/gomega" - infextv1a1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" appsv1 "k8s.io/api/apps/v1" corev1 "k8s.io/api/core/v1" rbacv1 "k8s.io/api/rbac/v1" @@ -37,6 +36,7 @@ import ( "k8s.io/client-go/rest" "k8s.io/client-go/tools/remotecommand" "sigs.k8s.io/controller-runtime/pkg/client" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" ) // DeleteClusterResources deletes all cluster-scoped objects the tests typically create. @@ -106,11 +106,11 @@ func DeleteNamespacedResources(ctx context.Context, cli client.Client, ns string if err != nil && !apierrors.IsNotFound(err) { return err } - err = cli.DeleteAllOf(ctx, &infextv1a1.InferencePool{}, client.InNamespace(ns), client.PropagationPolicy(metav1.DeletePropagationForeground)) + err = cli.DeleteAllOf(ctx, &v1alpha2.InferencePool{}, client.InNamespace(ns), client.PropagationPolicy(metav1.DeletePropagationForeground)) if err != nil && !apierrors.IsNotFound(err) { return err } - err = cli.DeleteAllOf(ctx, &infextv1a1.InferenceModel{}, client.InNamespace(ns), client.PropagationPolicy(metav1.DeletePropagationForeground)) + err = cli.DeleteAllOf(ctx, &v1alpha2.InferenceModel{}, client.InNamespace(ns), client.PropagationPolicy(metav1.DeletePropagationForeground)) if err != nil && !apierrors.IsNotFound(err) { return err } @@ -132,7 +132,7 @@ func DeleteInferenceModelResources(ctx context.Context, cli client.Client, ns st if ns == "" { return nil } - err := cli.DeleteAllOf(ctx, &infextv1a1.InferenceModel{}, client.InNamespace(ns), client.PropagationPolicy(metav1.DeletePropagationForeground)) + err := cli.DeleteAllOf(ctx, &v1alpha2.InferenceModel{}, client.InNamespace(ns), client.PropagationPolicy(metav1.DeletePropagationForeground)) if err != nil && !apierrors.IsNotFound(err) { return err } diff --git a/test/utils/wrappers.go b/test/utils/wrappers.go index 12ff856a..867118c1 100644 --- a/test/utils/wrappers.go +++ b/test/utils/wrappers.go @@ -17,26 +17,26 @@ limitations under the License. package utils import ( - infextv1a1 "inference.networking.x-k8s.io/gateway-api-inference-extension/api/v1alpha1" metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "sigs.k8s.io/gateway-api-inference-extension/api/v1alpha2" ) // InferenceModelWrapper wraps an InferenceModel. type InferenceModelWrapper struct { - infextv1a1.InferenceModel + v1alpha2.InferenceModel } // MakeModelWrapper creates a wrapper for an MakeModelWrapper. func MakeModelWrapper(name, ns string) *InferenceModelWrapper { return &InferenceModelWrapper{ - infextv1a1.InferenceModel{ + v1alpha2.InferenceModel{ ObjectMeta: metav1.ObjectMeta{ Name: name, Namespace: ns, }, - Spec: infextv1a1.InferenceModelSpec{ + Spec: v1alpha2.InferenceModelSpec{ ModelName: "", - PoolRef: infextv1a1.PoolObjectReference{}, + PoolRef: v1alpha2.PoolObjectReference{}, }, }, } @@ -49,7 +49,7 @@ func (m *InferenceModelWrapper) SetModelName(name string) *InferenceModelWrapper } // SetCriticality sets the value of the inferenceModel.spec.criticality. -func (m *InferenceModelWrapper) SetCriticality(level infextv1a1.Criticality) *InferenceModelWrapper { +func (m *InferenceModelWrapper) SetCriticality(level v1alpha2.Criticality) *InferenceModelWrapper { m.Spec.Criticality = &level return m } @@ -57,22 +57,22 @@ func (m *InferenceModelWrapper) SetCriticality(level infextv1a1.Criticality) *In // SetPoolRef sets the value of the inferenceModel.spec.poolRef using defaults // for group/kind and name as the PoolObjectReference name. func (m *InferenceModelWrapper) SetPoolRef(name string) *InferenceModelWrapper { - ref := infextv1a1.PoolObjectReference{ - Group: infextv1a1.GroupVersion.Group, + ref := v1alpha2.PoolObjectReference{ + Group: v1alpha2.Group(v1alpha2.GroupVersion.Group), Kind: "inferencepools", - Name: name, + Name: v1alpha2.ObjectName(name), } m.Spec.PoolRef = ref return m } // SetTargetModels sets the value of the inferenceModel.spec.targetModels. -func (m *InferenceModelWrapper) SetTargetModels(models []infextv1a1.TargetModel) *InferenceModelWrapper { +func (m *InferenceModelWrapper) SetTargetModels(models []v1alpha2.TargetModel) *InferenceModelWrapper { m.Spec.TargetModels = models return m } // Obj returns the inner InferenceModel. -func (m *InferenceModelWrapper) Obj() *infextv1a1.InferenceModel { +func (m *InferenceModelWrapper) Obj() *v1alpha2.InferenceModel { return &m.InferenceModel } diff --git a/tools/benchmark/README.md b/tools/benchmark/README.md new file mode 100644 index 00000000..ffd3ee7b --- /dev/null +++ b/tools/benchmark/README.md @@ -0,0 +1 @@ +This folder contains resources to run performance benchmarks. Pls follow the benchmark guide here https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark. \ No newline at end of file diff --git a/tools/benchmark/benchmark.ipynb b/tools/benchmark/benchmark.ipynb new file mode 100644 index 00000000..993279cb --- /dev/null +++ b/tools/benchmark/benchmark.ipynb @@ -0,0 +1,358 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "executionInfo": { + "elapsed": 391, + "status": "ok", + "timestamp": 1741734317446, + "user": { + "displayName": "Cong Liu", + "userId": "18222691451061354557" + }, + "user_tz": 420 + }, + "id": "ziJD5zt0c1Rt" + }, + "outputs": [], + "source": [ + "#@title Configuration. Edit this before running the rest.\n", + "\n", + "OUTPUT_DIR='output'\n", + "RUN_ID='example-run'\n", + "# Path to the benchmark dir under `gateway-api-inference-extension/benchmark`\n", + "BENCHMARK_DIR =\"./\"\n", + "# A regex to match the model name, which matches the output file name.\n", + "MODEL_MATCHER='.*llama.*'" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "executionInfo": { + "elapsed": 33, + "status": "ok", + "timestamp": 1741735749209, + "user": { + "displayName": "Cong Liu", + "userId": "18222691451061354557" + }, + "user_tz": 420 + }, + "id": "dB7xALgLawN-" + }, + "outputs": [], + "source": [ + "#@title Plot Helper\n", + "import os\n", + "import pandas as pd\n", + "import re\n", + "import json\n", + "from collections import OrderedDict\n", + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "import math\n", + "import logging\n", + "level = logging.INFO\n", + "logger = logging.getLogger(__name__)\n", + "logger.setLevel(level)\n", + "handler = logging.StreamHandler() # This sends output to the console\n", + "handler.setLevel(level) # Set handler level\n", + "logger.addHandler(handler)\n", + "\n", + "title_fontsize = 18\n", + "axis_label_fontsize = 18\n", + "legend_fontsize = 16\n", + "tick_label_fontsize = 14\n", + "\n", + "# Encapsulates some basic information needed to plot metrics.\n", + "class XY:\n", + " def __init__(self, x: str, y: str, x_label=None, y_label=None):\n", + " self.x = x\n", + " self.y = y\n", + " self.x_label = x if x_label is None else x_label\n", + " self.y_label = y if y_label is None else y_label\n", + "\n", + "NUM_PLOTS_PER_ROW = 4\n", + "# The arguments need to match the metric name fields generated by the benchmark tool.\n", + "CORE_METRICS = [\n", + " XY(x = 'request_rate', x_label = 'QPS', y = 'output_tokens_per_min'),\n", + " XY(x = \"request_rate\", x_label = 'QPS', y = \"p90_per_output_token_latency\"),\n", + " XY(x = \"request_rate\", x_label = 'QPS', y = \"p90_latency\"),\n", + "]\n", + "SANITY_CHECK_METRICS = [\n", + " XY(x = 'request_rate', x_label = 'QPS', y = 'benchmark_time'),\n", + " XY(x = \"request_rate\", x_label = 'QPS', y=\"num_prompts_attempted\"),\n", + " XY(x = \"request_rate\", x_label = 'QPS', y=\"num_prompts_succeeded\"),\n", + " XY(x = 'request_rate', x_label = 'QPS', y = 'throughput_rps'),\n", + " XY(x = 'request_rate', x_label = 'QPS', y = 'total_input_tokens'),\n", + " XY(x = 'request_rate', x_label = 'QPS', y = 'total_output_token'),\n", + " XY(x = 'request_rate', x_label = 'QPS', y = 'avg_input_len'),\n", + " XY(x = 'request_rate', x_label = 'QPS', y = 'avg_output_len'),\n", + "]\n", + "\n", + "class Label:\n", + " def __init__(self, name, alias=None):\n", + " self.name = name\n", + " self.alias = name if alias is None else alias\n", + "\n", + "ALL_METRICS = CORE_METRICS + SANITY_CHECK_METRICS\n", + "\n", + "class Plotter:\n", + " def __init__(self, run_id, labels=None, metrics=CORE_METRICS, num_plots_per_row=5, interactive=False, annotate=False, output_dir=OUTPUT_DIR):\n", + " self.run_id = run_id\n", + " self.labels = labels\n", + " self.metrics = metrics\n", + " self.num_plots_per_row = num_plots_per_row\n", + " self.interactive = interactive\n", + " self.annotate = annotate\n", + " self.output_dir = output_dir\n", + "\n", + " def withRunId(self, run_id):\n", + " return Plotter(run_id, self.labels, self.metrics, self.num_plots_per_row, self.interactive, self.annotate, self.output_dir)\n", + "\n", + " def withLabels(self, labels):\n", + " return Plotter(self.run_id, labels, self.metrics, self.num_plots_per_row, self.interactive, self.annotate, self.output_dir)\n", + "\n", + " def withMetrics(self, metrics):\n", + " return Plotter(self.run_id, self.labels, metrics, self.num_plots_per_row, self.interactive, self.annotate, self.output_dir)\n", + "\n", + " def withOutputDir(self, output_dir):\n", + " return Plotter(self.run_id, self.labels, self.metrics, self.num_plots_per_row, self.interactive, self.annotate, output_dir)\n", + "\n", + " def plot_bar(self):\n", + " data = load_data(self.labels, self.run_id, self.output_dir)\n", + " groups = group_data(data, self.metrics)\n", + " logger.debug(\"Plotting run id...\")\n", + " plot_bar(self.labels, groups, self.metrics, self.num_plots_per_row, self.interactive, annotate=self.annotate)\n", + "\n", + "def filepaths(root_dir):\n", + " \"\"\"\n", + " Recursively reads files within a directory and returns a list of file paths.\n", + " \"\"\"\n", + "\n", + " filepaths = []\n", + " for dirpath, dirnames, filenames in os.walk(root_dir):\n", + " for filename in filenames:\n", + " filepath = os.path.join(dirpath, filename)\n", + " filepaths.append(filepath)\n", + " return filepaths\n", + "\n", + "def flatten_server_metrics(server_metrics):\n", + " \"\"\"\n", + " Flattens the server metrics json to a single level.\n", + " \"\"\"\n", + " flattend = {}\n", + " for k, v in server_metrics.items():\n", + " if isinstance(v, dict):\n", + " for k2, v2 in v.items():\n", + " flattend[k + \".\" + k2] = v2\n", + "\n", + " return flattend\n", + "\n", + "def load_data(labels, run_id, output_dir=OUTPUT_DIR):\n", + " data_path =f\"{BENCHMARK_DIR}/{output_dir}/{run_id}\"\n", + " records = []\n", + " logger.debug(f\"Loading data for {data_path}\")\n", + " for file in filepaths(data_path):\n", + " for label in labels:\n", + " regex = f\".*\\/{label.name}\\/results/json/{MODEL_MATCHER}.json\"\n", + " logger.debug(f\"matching file {file} for regex {regex} and label {label}\")\n", + " if re.match(regex, file):\n", + " logger.debug(f\"found match file {file} for regex {regex} and label {label}\")\n", + " with open(file, 'r') as f:\n", + " raw_data = json.load(f)\n", + " sample_data = {\n", + " 'file_name': f.name,\n", + " 'label': label.alias,\n", + " **raw_data.get(\"metrics\",{}),\n", + " **flatten_server_metrics(raw_data.get(\"metrics\",{}).get(\"server_metrics\", {})),\n", + " }\n", + " sample_data['request_rate'] = sample_data['request_rate'] * raw_data['config']['num_models']\n", + " records.append(sample_data)\n", + " all_data = pd.DataFrame.from_records(records, index='file_name') if len(records) > 0 else pd.DataFrame()\n", + " return all_data\n", + "\n", + "def group_data(all_data, metrics=CORE_METRICS):\n", + " try:\n", + " data = all_data.sort_values(by=['request_rate'], ascending=True).copy().dropna()\n", + " except:\n", + " # print(\"No data found\")\n", + " return None\n", + "\n", + " # Ensure there is exactly one benchmark result per label and x-axis for each\n", + " # metric.\n", + " x_axes = set()\n", + " for m in metrics:\n", + " x_axes.add(m.x)\n", + "\n", + " for x in x_axes:\n", + " sizes = data.groupby(by=['label', x], dropna=True).size()\n", + " for index, v in sizes.items():\n", + " if v > 1:\n", + " label, _ = index\n", + " # print(f\"Multiple benchmark results for the same label ({label}), and x-axis ({x}). {index}: {v}. Please use more selective file filters.\")\n", + " # raise ValueError(f\"Multiple benchmark results for the same label ({label}), and x-axis ({x}). Please use more selective file filters.\")\n", + "\n", + " # Group by label.\n", + " groups = data.groupby(by=['label'],sort=True)\n", + " return groups\n", + "\n", + "def init_plot(metrics, num_plots_per_row=NUM_PLOTS_PER_ROW):\n", + " num_plots_per_row = min(num_plots_per_row, len(metrics))\n", + " row_count = math.ceil(len(metrics) / num_plots_per_row)\n", + " fig, axes = plt.subplots(nrows=row_count, ncols=num_plots_per_row, figsize=(20, 5*row_count), tight_layout=True)\n", + " if row_count == 1 and num_plots_per_row == 1:\n", + " axes = [axes]\n", + " return fig, axes\n", + "\n", + "def plot_metrics(metrics, plot_func, num_plots_per_row=NUM_PLOTS_PER_ROW, fig=None, axes=None):\n", + " \"\"\"\n", + " plot_func: a function in the form of def plot_func(ax:~matplotlib.axes.Axes , m: XY):\n", + " \"\"\"\n", + " logger.debug(f'Plotting metrics: {metrics}')\n", + " num_plots_per_row = min(num_plots_per_row, len(metrics))\n", + " if fig is None or axes is None:\n", + " logger.debug(f'Creating new figure and axes')\n", + " fig, axes = init_plot(metrics, num_plots_per_row)\n", + " row_count = math.ceil(len(metrics) / num_plots_per_row)\n", + " for i, m in enumerate(metrics):\n", + " row = math.floor(i/num_plots_per_row)\n", + " col = i%num_plots_per_row\n", + " if row_count == 1:\n", + " curAx = axes[col]\n", + " else:\n", + " curAx = axes[row, col]\n", + " plot_func(curAx, m)\n", + " return fig, axes\n", + "\n", + "def plot_bar(labels, groups, metrics=CORE_METRICS, num_plots_per_row=NUM_PLOTS_PER_ROW, interactive=INTERACTIVE_PLOT, annotate=False):\n", + " labels = [label.alias for label in labels]\n", + " logger.debug(f'Prnting bar chart for {labels}')\n", + " logger.debug(f'groups: {groups}')\n", + " dataframes = []\n", + " for label in labels:\n", + " try:\n", + " dataframes.append(groups.get_group((label,)))\n", + " except:\n", + " logger.debug(f\"No data found for label {label}\")\n", + " continue\n", + " y_columns = [m.y for m in metrics]\n", + " logger.debug(f'y_columns: {y_columns}')\n", + " logger.debug(f'dataframes: {dataframes}')\n", + "\n", + " # 1. Combine all request rates\n", + " all_request_rates = set()\n", + " for df in dataframes:\n", + " all_request_rates.update(df['request_rate'].astype(int))\n", + " all_request_rates = sorted(list(all_request_rates))\n", + "\n", + " # 2. Prepare data for plotting: Create a nested dictionary\n", + " plot_data = {y_col: {label: {} for label in labels} for y_col in y_columns}\n", + "\n", + " for i, df in enumerate(dataframes):\n", + " label = labels[i]\n", + " df_dict = df.set_index('request_rate').to_dict()\n", + " for y_col in y_columns:\n", + " for request_rate in all_request_rates:\n", + " plot_data[y_col][label][request_rate] = df_dict.get(y_col, {}).get(request_rate, np.nan)\n", + "\n", + " logger.debug(f'Plot_data: {plot_data}')\n", + "\n", + " # 3. Plotting\n", + " def plot_func(curAx, m):\n", + " num_request_rates = len(all_request_rates)\n", + " num_labels = len(labels)\n", + " x = np.arange(num_request_rates) # the label locations (x-axis positions)\n", + " width = 0.4 / num_labels # width of the bars\n", + "\n", + " for i, label in enumerate(labels):\n", + " bar_x = x - (width*num_labels)/2 + i*width + width/2\n", + " #Extract y-values to plot\n", + " y_values = [plot_data[m.y][label][rr] for rr in all_request_rates]\n", + "\n", + " rects = curAx.bar(bar_x, y_values, width, label=label)\n", + " if annotate:\n", + " for rect, val in zip(rects, y_values):\n", + " if not np.isnan(val):\n", + " height = rect.get_height()\n", + " curAx.annotate(f'{val:.2f}',\n", + " xy=(rect.get_x() + rect.get_width() / 2, height),\n", + " xytext=(0, 3), # 3 points vertical offset\n", + " textcoords=\"offset points\",\n", + " ha='center', va='bottom')\n", + " # Add labels, title, and legend\n", + " curAx.set_xlabel(m.x_label, fontsize=axis_label_fontsize)\n", + " curAx.set_ylabel(m.y_label, fontsize=axis_label_fontsize)\n", + " curAx.set_xticks(x)\n", + " curAx.set_xticklabels(all_request_rates)\n", + " curAx.tick_params(axis='both', labelsize=tick_label_fontsize)\n", + " curAx.legend(fontsize=legend_fontsize, loc='upper left', frameon=True, framealpha=0.8, edgecolor='black')\n", + " fig, axes = plot_metrics(metrics, plot_func, num_plots_per_row)\n", + " fig.tight_layout(rect=[0, 0.03, 1, 0.95])\n", + " plt.show()\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "height": 1000 + }, + "executionInfo": { + "elapsed": 2232, + "status": "ok", + "timestamp": 1741735855456, + "user": { + "displayName": "Cong Liu", + "userId": "18222691451061354557" + }, + "user_tz": 420 + }, + "id": "HbGEAOucb_Jn", + "outputId": "faf0304b-92f4-4fa7-ae71-83b8bd987e70" + }, + "outputs": [], + "source": [ + "#@title Plot Result\n", + "\n", + "pl = Plotter(run_id=RUN_ID, labels=[Label('inference-extension'),Label('k8s-svc')], output_dir=OUTPUT_DIR)\n", + "pl.plot_bar()" + ] + } + ], + "metadata": { + "colab": { + "last_runtime": { + "build_target": "", + "kind": "local" + }, + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.6" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/tools/benchmark/download-benchmark-results.bash b/tools/benchmark/download-benchmark-results.bash new file mode 100755 index 00000000..6b9ca505 --- /dev/null +++ b/tools/benchmark/download-benchmark-results.bash @@ -0,0 +1,30 @@ +#!/bin/bash + +# Downloads the benchmark result files from the benchmark tool pod. +download_benchmark_results() { + until echo $(kubectl logs deployment/benchmark-tool -n ${namespace}) | grep -q -m 1 "LPG_FINISHED"; do sleep 30 ; done; + benchmark_pod=$(kubectl get pods -l app=benchmark-tool -n ${namespace} -o jsonpath="{.items[0].metadata.name}") + echo "Downloading JSON results from pod ${benchmark_pod}" + kubectl exec ${benchmark_pod} -n ${namespace} -- rm -f ShareGPT_V3_unfiltered_cleaned_split.json + for f in $(kubectl exec ${benchmark_pod} -n ${namespace} -- /bin/sh -c ls -l | grep json); do + echo "Downloading json file ${f}" + kubectl cp -n ${namespace} ${benchmark_pod}:$f ${benchmark_output_dir}/results/json/$f; + done +} + +# Env vars to be passed when calling this script. +# The id of the benchmark. This is needed to identify what the benchmark is for. +# It decides the filepath to save the results, which later is used by the jupyter notebook to assign +# the benchmark_id as data labels for plotting. +benchmark_id=${benchmark_id:-"inference-extension"} +# run_id can be used to group different runs of the same benchmarks for comparison. +run_id=${run_id:-"default-run"} +namespace=${namespace:-"default"} +output_dir=${output_dir:-'output'} + +SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )" +benchmark_output_dir=${SCRIPT_DIR}/${output_dir}/${run_id}/${benchmark_id} + +echo "Saving benchmark results to ${benchmark_output_dir}/results/json/" +download_benchmark_results +kubectl delete -f ${SCRIPT_DIR}/../../config/manifests/benchmark/benchmark.yaml \ No newline at end of file diff --git a/tools/benchmark/requirements.txt b/tools/benchmark/requirements.txt new file mode 100644 index 00000000..44974cf4 --- /dev/null +++ b/tools/benchmark/requirements.txt @@ -0,0 +1,3 @@ +pandas +numpy +matplotlib \ No newline at end of file diff --git a/tools/dashboards/README.md b/tools/dashboards/README.md index c8258b63..7be2a5b8 100644 --- a/tools/dashboards/README.md +++ b/tools/dashboards/README.md @@ -4,7 +4,7 @@ This documentation provides instructions for setting up grafana dashboards to se ## Requirements -Please follow [metrics](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/ext-proc/metrics) page to configure the proxy to enable all metrics. +Please follow [metrics](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp/metrics) page to configure the proxy to enable all metrics. ## Load Inference Extension dashboard into Grafana @@ -21,6 +21,7 @@ If you run the inferece gateway with [Google Managed Prometheus](https://cloud.g Please configure the `scrape_interval` of your prometheus configuration to lower than `15s`, `rate` function returns empty string if data falls too apart. See https://www.robustperception.io/what-range-should-i-use-with-rate/ for more details. Example: + ``` global: scrape_interval: 5s diff --git a/tools/dashboards/inference_gateway.json b/tools/dashboards/inference_gateway.json index 3af66703..cf00420d 100644 --- a/tools/dashboards/inference_gateway.json +++ b/tools/dashboards/inference_gateway.json @@ -28,7 +28,7 @@ }, "gridPos": { "h": 3, - "w": 23, + "w": 20, "x": 0, "y": 0 }, @@ -39,10 +39,10 @@ "showLineNumbers": false, "showMiniMap": false }, - "content": "# Inferece Gateway Dashboard\n\nPlease see https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/ext-proc/metrics for more details of underlying metrics used in the dashboard.", + "content": "# Inferece Gateway Dashboard\n\nPlease see https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp/metrics for more details of underlying metrics used in the dashboard.", "mode": "markdown" }, - "pluginVersion": "11.5.0", + "pluginVersion": "11.5.2", "title": "", "type": "text" }, @@ -54,15 +54,15 @@ "x": 0, "y": 3 }, - "id": 3, + "id": 15, "panels": [], - "title": "Inference Model", + "title": "Inference Pool", "type": "row" }, { "datasource": { "type": "prometheus", - "uid": "${DS_PROMETHEUS}" + "uid": "deap2an4eadc0d" }, "fieldConfig": { "defaults": { @@ -125,7 +125,7 @@ "x": 0, "y": 4 }, - "id": 1, + "id": 16, "options": { "legend": { "calcs": [], @@ -139,33 +139,27 @@ "sort": "none" } }, - "pluginVersion": "11.5.0", + "pluginVersion": "11.5.2", "targets": [ { - "datasource": { - "type": "prometheus", - "uid": "${DS_PROMETHEUS}" - }, "disableTextWrap": false, "editorMode": "builder", - "exemplar": false, - "expr": "sum by(model_name, target_model_name) (rate(inference_model_request_total{}[$__rate_interval]))", + "expr": "sum by(name) (inference_pool_average_kv_cache_utilization)", "fullMetaSearch": false, "includeNullMetadata": true, - "interval": "", "legendFormat": "__auto", "range": true, "refId": "A", "useBackend": false } ], - "title": "Request / s", + "title": "Average KV Cache Utilization", "type": "timeseries" }, { "datasource": { "type": "prometheus", - "uid": "${DS_PROMETHEUS}" + "uid": "deap2an4eadc0d" }, "fieldConfig": { "defaults": { @@ -228,7 +222,7 @@ "x": 10, "y": 4 }, - "id": 2, + "id": 17, "options": { "legend": { "calcs": [], @@ -242,55 +236,36 @@ "sort": "none" } }, - "pluginVersion": "11.5.0", + "pluginVersion": "11.5.2", "targets": [ { "disableTextWrap": false, "editorMode": "builder", - "expr": "histogram_quantile(0.95, sum by(le) (rate(inference_model_request_duration_seconds_bucket{}[$__rate_interval])))", + "expr": "sum by(name) (inference_pool_average_queue_size)", "fullMetaSearch": false, - "includeNullMetadata": false, - "legendFormat": "95%", + "includeNullMetadata": true, + "legendFormat": "__auto", "range": true, "refId": "A", "useBackend": false - }, - { - "datasource": { - "type": "prometheus", - "uid": "${DS_PROMETHEUS}" - }, - "disableTextWrap": false, - "editorMode": "builder", - "expr": "histogram_quantile(0.9, sum by(le) (rate(inference_model_request_duration_seconds_bucket{}[$__rate_interval])))", - "fullMetaSearch": false, - "hide": false, - "includeNullMetadata": false, - "legendFormat": "90%", - "range": true, - "refId": "B", - "useBackend": false - }, - { - "datasource": { - "type": "prometheus", - "uid": "${DS_PROMETHEUS}" - }, - "disableTextWrap": false, - "editorMode": "builder", - "expr": "histogram_quantile(0.5, sum by(le) (rate(inference_model_request_duration_seconds_bucket{}[$__rate_interval])))", - "fullMetaSearch": false, - "hide": false, - "includeNullMetadata": false, - "legendFormat": "50%", - "range": true, - "refId": "C", - "useBackend": false } ], - "title": "E2E Request Latency", + "title": "Average Queue Size", "type": "timeseries" }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 12 + }, + "id": 3, + "panels": [], + "title": "Inference Model", + "type": "row" + }, { "datasource": { "type": "prometheus", @@ -353,11 +328,11 @@ }, "gridPos": { "h": 8, - "w": 10, + "w": 20, "x": 0, - "y": 12 + "y": 13 }, - "id": 6, + "id": 2, "options": { "legend": { "calcs": [], @@ -371,12 +346,12 @@ "sort": "none" } }, - "pluginVersion": "11.5.0", + "pluginVersion": "11.5.2", "targets": [ { "disableTextWrap": false, "editorMode": "builder", - "expr": "histogram_quantile(0.95, sum by(le) (rate(inference_model_request_sizes_bucket{}[$__rate_interval])))", + "expr": "histogram_quantile(0.95, sum by(le) (rate(inference_model_request_duration_seconds_bucket{}[$__rate_interval])))", "fullMetaSearch": false, "includeNullMetadata": false, "legendFormat": "95%", @@ -391,7 +366,7 @@ }, "disableTextWrap": false, "editorMode": "builder", - "expr": "histogram_quantile(0.9, sum by(le) (rate(inference_model_request_sizes_bucket{}[$__rate_interval])))", + "expr": "histogram_quantile(0.9, sum by(le) (rate(inference_model_request_duration_seconds_bucket{}[$__rate_interval])))", "fullMetaSearch": false, "hide": false, "includeNullMetadata": false, @@ -407,7 +382,7 @@ }, "disableTextWrap": false, "editorMode": "builder", - "expr": "histogram_quantile(0.5, sum by(le) (rate(inference_model_request_sizes_bucket{}[$__rate_interval])))", + "expr": "histogram_quantile(0.5, sum by(le) (rate(inference_model_request_duration_seconds_bucket{}[$__rate_interval])))", "fullMetaSearch": false, "hide": false, "includeNullMetadata": false, @@ -417,7 +392,7 @@ "useBackend": false } ], - "title": "Request Size", + "title": "E2E Request Latency", "type": "timeseries" }, { @@ -483,10 +458,10 @@ "gridPos": { "h": 8, "w": 10, - "x": 10, - "y": 12 + "x": 0, + "y": 21 }, - "id": 7, + "id": 1, "options": { "legend": { "calcs": [], @@ -500,35 +475,8 @@ "sort": "none" } }, - "pluginVersion": "11.5.0", + "pluginVersion": "11.5.2", "targets": [ - { - "disableTextWrap": false, - "editorMode": "builder", - "expr": "histogram_quantile(0.95, sum by(le) (rate(inference_model_response_sizes_bucket{}[$__rate_interval])))", - "fullMetaSearch": false, - "includeNullMetadata": false, - "legendFormat": "95%", - "range": true, - "refId": "A", - "useBackend": false - }, - { - "datasource": { - "type": "prometheus", - "uid": "${DS_PROMETHEUS}" - }, - "disableTextWrap": false, - "editorMode": "builder", - "expr": "histogram_quantile(0.9, sum by(le) (rate(inference_model_response_sizes_bucket{}[$__rate_interval])))", - "fullMetaSearch": false, - "hide": false, - "includeNullMetadata": false, - "legendFormat": "90%", - "range": true, - "refId": "B", - "useBackend": false - }, { "datasource": { "type": "prometheus", @@ -536,17 +484,18 @@ }, "disableTextWrap": false, "editorMode": "builder", - "expr": "histogram_quantile(0.5, sum by(le) (rate(inference_model_response_sizes_bucket{}[$__rate_interval])))", + "exemplar": false, + "expr": "sum by(model_name, target_model_name) (rate(inference_model_request_total{}[$__rate_interval]))", "fullMetaSearch": false, - "hide": false, - "includeNullMetadata": false, - "legendFormat": "50%", + "includeNullMetadata": true, + "interval": "", + "legendFormat": "__auto", "range": true, - "refId": "C", + "refId": "A", "useBackend": false } ], - "title": "Response Size", + "title": "Request / s", "type": "timeseries" }, { @@ -612,10 +561,10 @@ "gridPos": { "h": 8, "w": 10, - "x": 0, - "y": 20 + "x": 10, + "y": 21 }, - "id": 8, + "id": 18, "options": { "legend": { "calcs": [], @@ -629,19 +578,8 @@ "sort": "none" } }, - "pluginVersion": "11.5.0", + "pluginVersion": "11.5.2", "targets": [ - { - "disableTextWrap": false, - "editorMode": "builder", - "expr": "histogram_quantile(0.95, sum by(le) (rate(inference_model_input_tokens_bucket{}[$__rate_interval])))", - "fullMetaSearch": false, - "includeNullMetadata": false, - "legendFormat": "95%", - "range": true, - "refId": "A", - "useBackend": false - }, { "datasource": { "type": "prometheus", @@ -649,33 +587,18 @@ }, "disableTextWrap": false, "editorMode": "builder", - "expr": "histogram_quantile(0.9, sum by(le) (rate(inference_model_input_tokens_bucket{}[$__rate_interval])))", - "fullMetaSearch": false, - "hide": false, - "includeNullMetadata": false, - "legendFormat": "90%", - "range": true, - "refId": "B", - "useBackend": false - }, - { - "datasource": { - "type": "prometheus", - "uid": "${DS_PROMETHEUS}" - }, - "disableTextWrap": false, - "editorMode": "builder", - "expr": "histogram_quantile(0.5, sum by(le) (rate(inference_model_input_tokens_bucket{}[$__rate_interval])))", + "exemplar": false, + "expr": "sum by(error_code, model_name, target_model_name) (rate(inference_model_request_error_total[$__rate_interval]))", "fullMetaSearch": false, - "hide": false, - "includeNullMetadata": false, - "legendFormat": "50%", + "includeNullMetadata": true, + "interval": "", + "legendFormat": "__auto", "range": true, - "refId": "C", + "refId": "A", "useBackend": false } ], - "title": "Input Token Count", + "title": "Request Error / s", "type": "timeseries" }, { @@ -741,10 +664,10 @@ "gridPos": { "h": 8, "w": 10, - "x": 10, - "y": 20 + "x": 0, + "y": 29 }, - "id": 9, + "id": 6, "options": { "legend": { "calcs": [], @@ -758,12 +681,12 @@ "sort": "none" } }, - "pluginVersion": "11.5.0", + "pluginVersion": "11.5.2", "targets": [ { "disableTextWrap": false, "editorMode": "builder", - "expr": "histogram_quantile(0.95, sum by(le) (rate(inference_model_output_tokens_bucket{}[$__rate_interval])))", + "expr": "histogram_quantile(0.95, sum by(le) (rate(inference_model_request_sizes_bucket{}[$__rate_interval])))", "fullMetaSearch": false, "includeNullMetadata": false, "legendFormat": "95%", @@ -778,7 +701,7 @@ }, "disableTextWrap": false, "editorMode": "builder", - "expr": "histogram_quantile(0.9, sum by(le) (rate(inference_model_output_tokens_bucket{}[$__rate_interval])))", + "expr": "histogram_quantile(0.9, sum by(le) (rate(inference_model_request_sizes_bucket{}[$__rate_interval])))", "fullMetaSearch": false, "hide": false, "includeNullMetadata": false, @@ -794,7 +717,7 @@ }, "disableTextWrap": false, "editorMode": "builder", - "expr": "histogram_quantile(0.5, sum by(le) (rate(inference_model_output_tokens_bucket{}[$__rate_interval])))", + "expr": "histogram_quantile(0.5, sum by(le) (rate(inference_model_request_sizes_bucket{}[$__rate_interval])))", "fullMetaSearch": false, "hide": false, "includeNullMetadata": false, @@ -804,22 +727,9 @@ "useBackend": false } ], - "title": "Output Token Count", + "title": "Request Size", "type": "timeseries" }, - { - "collapsed": false, - "gridPos": { - "h": 1, - "w": 24, - "x": 0, - "y": 28 - }, - "id": 10, - "panels": [], - "title": "vLLM", - "type": "row" - }, { "datasource": { "type": "prometheus", @@ -881,12 +791,12 @@ "overrides": [] }, "gridPos": { - "h": 7, + "h": 8, "w": 10, - "x": 0, + "x": 10, "y": 29 }, - "id": 14, + "id": 7, "options": { "legend": { "calcs": [], @@ -900,15 +810,15 @@ "sort": "none" } }, - "pluginVersion": "11.5.0", + "pluginVersion": "11.5.2", "targets": [ { "disableTextWrap": false, "editorMode": "builder", - "expr": "sum by(model_name) (rate(vllm:prompt_tokens_total[$__rate_interval]))", + "expr": "histogram_quantile(0.95, sum by(le) (rate(inference_model_response_sizes_bucket{}[$__rate_interval])))", "fullMetaSearch": false, - "includeNullMetadata": true, - "legendFormat": "Prompt Tokens/Sec", + "includeNullMetadata": false, + "legendFormat": "95%", "range": true, "refId": "A", "useBackend": false @@ -920,17 +830,33 @@ }, "disableTextWrap": false, "editorMode": "builder", - "expr": "sum by(model_name) (rate(vllm:generation_tokens_total[$__rate_interval]))", + "expr": "histogram_quantile(0.9, sum by(le) (rate(inference_model_response_sizes_bucket{}[$__rate_interval])))", "fullMetaSearch": false, "hide": false, - "includeNullMetadata": true, - "legendFormat": "Generation Tokens/Sec", + "includeNullMetadata": false, + "legendFormat": "90%", "range": true, "refId": "B", "useBackend": false + }, + { + "datasource": { + "type": "prometheus", + "uid": "${DS_PROMETHEUS}" + }, + "disableTextWrap": false, + "editorMode": "builder", + "expr": "histogram_quantile(0.5, sum by(le) (rate(inference_model_response_sizes_bucket{}[$__rate_interval])))", + "fullMetaSearch": false, + "hide": false, + "includeNullMetadata": false, + "legendFormat": "50%", + "range": true, + "refId": "C", + "useBackend": false } ], - "title": "Token Throughput", + "title": "Response Size", "type": "timeseries" }, { @@ -994,12 +920,12 @@ "overrides": [] }, "gridPos": { - "h": 7, + "h": 8, "w": 10, - "x": 10, - "y": 29 + "x": 0, + "y": 37 }, - "id": 11, + "id": 8, "options": { "legend": { "calcs": [], @@ -1013,14 +939,14 @@ "sort": "none" } }, - "pluginVersion": "11.5.0", + "pluginVersion": "11.5.2", "targets": [ { "disableTextWrap": false, "editorMode": "builder", - "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket[$__rate_interval])))", + "expr": "histogram_quantile(0.95, sum by(le) (rate(inference_model_input_tokens_bucket{}[$__rate_interval])))", "fullMetaSearch": false, - "includeNullMetadata": true, + "includeNullMetadata": false, "legendFormat": "95%", "range": true, "refId": "A", @@ -1033,10 +959,10 @@ }, "disableTextWrap": false, "editorMode": "builder", - "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket[$__rate_interval])))", + "expr": "histogram_quantile(0.9, sum by(le) (rate(inference_model_input_tokens_bucket{}[$__rate_interval])))", "fullMetaSearch": false, "hide": false, - "includeNullMetadata": true, + "includeNullMetadata": false, "legendFormat": "90%", "range": true, "refId": "B", @@ -1049,17 +975,17 @@ }, "disableTextWrap": false, "editorMode": "builder", - "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket[$__rate_interval])))", + "expr": "histogram_quantile(0.5, sum by(le) (rate(inference_model_input_tokens_bucket{}[$__rate_interval])))", "fullMetaSearch": false, "hide": false, - "includeNullMetadata": true, + "includeNullMetadata": false, "legendFormat": "50%", "range": true, "refId": "C", "useBackend": false } ], - "title": "E2E Request Latency", + "title": "Input Token Count", "type": "timeseries" }, { @@ -1123,12 +1049,12 @@ "overrides": [] }, "gridPos": { - "h": 7, + "h": 8, "w": 10, - "x": 0, - "y": 36 + "x": 10, + "y": 37 }, - "id": 13, + "id": 9, "options": { "legend": { "calcs": [], @@ -1142,14 +1068,14 @@ "sort": "none" } }, - "pluginVersion": "11.5.0", + "pluginVersion": "11.5.2", "targets": [ { "disableTextWrap": false, "editorMode": "builder", - "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket[$__rate_interval])))", + "expr": "histogram_quantile(0.95, sum by(le) (rate(inference_model_output_tokens_bucket{}[$__rate_interval])))", "fullMetaSearch": false, - "includeNullMetadata": true, + "includeNullMetadata": false, "legendFormat": "95%", "range": true, "refId": "A", @@ -1162,10 +1088,10 @@ }, "disableTextWrap": false, "editorMode": "builder", - "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket[$__rate_interval])))", + "expr": "histogram_quantile(0.9, sum by(le) (rate(inference_model_output_tokens_bucket{}[$__rate_interval])))", "fullMetaSearch": false, "hide": false, - "includeNullMetadata": true, + "includeNullMetadata": false, "legendFormat": "90%", "range": true, "refId": "B", @@ -1178,147 +1104,532 @@ }, "disableTextWrap": false, "editorMode": "builder", - "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket[$__rate_interval])))", + "expr": "histogram_quantile(0.5, sum by(le) (rate(inference_model_output_tokens_bucket{}[$__rate_interval])))", "fullMetaSearch": false, "hide": false, - "includeNullMetadata": true, + "includeNullMetadata": false, "legendFormat": "50%", "range": true, "refId": "C", "useBackend": false } ], - "title": "Time Per Output Token Latency", + "title": "Output Token Count", "type": "timeseries" }, { - "datasource": { - "type": "prometheus", - "uid": "${DS_PROMETHEUS}" + "collapsed": true, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 45 }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "palette-classic" + "id": 10, + "panels": [ + { + "datasource": { + "type": "prometheus", + "uid": "${DS_PROMETHEUS}" }, - "custom": { - "axisBorderShow": false, - "axisCenteredZero": false, - "axisColorMode": "text", - "axisLabel": "", - "axisPlacement": "auto", - "barAlignment": 0, - "barWidthFactor": 0.6, - "drawStyle": "line", - "fillOpacity": 0, - "gradientMode": "none", - "hideFrom": { - "legend": false, - "tooltip": false, - "viz": false - }, - "insertNulls": false, - "lineInterpolation": "linear", - "lineWidth": 1, - "pointSize": 5, - "scaleDistribution": { - "type": "linear" + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } }, - "showPoints": "auto", - "spanNulls": false, - "stacking": { - "group": "A", - "mode": "none" + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 10, + "x": 0, + "y": 52 + }, + "id": 14, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true }, - "thresholdsStyle": { - "mode": "off" + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" } }, - "mappings": [], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "green", - "value": null + "pluginVersion": "11.5.2", + "targets": [ + { + "disableTextWrap": false, + "editorMode": "builder", + "expr": "sum by(model_name) (rate(vllm:prompt_tokens_total[$__rate_interval]))", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "Prompt Tokens/Sec", + "range": true, + "refId": "A", + "useBackend": false + }, + { + "datasource": { + "type": "prometheus", + "uid": "${DS_PROMETHEUS}" }, - { - "color": "red", - "value": 80 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 7, - "w": 10, - "x": 10, - "y": 36 - }, - "id": 12, - "options": { - "legend": { - "calcs": [], - "displayMode": "list", - "placement": "bottom", - "showLegend": true + "disableTextWrap": false, + "editorMode": "builder", + "expr": "sum by(model_name) (rate(vllm:generation_tokens_total[$__rate_interval]))", + "fullMetaSearch": false, + "hide": false, + "includeNullMetadata": true, + "legendFormat": "Generation Tokens/Sec", + "range": true, + "refId": "B", + "useBackend": false + } + ], + "title": "Token Throughput", + "type": "timeseries" }, - "tooltip": { - "hideZeros": false, - "mode": "single", - "sort": "none" - } - }, - "pluginVersion": "11.5.0", - "targets": [ { - "disableTextWrap": false, - "editorMode": "builder", - "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket[$__rate_interval])))", - "fullMetaSearch": false, - "includeNullMetadata": true, - "legendFormat": "95%", - "range": true, - "refId": "A", - "useBackend": false + "datasource": { + "type": "prometheus", + "uid": "${DS_PROMETHEUS}" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 10, + "x": 10, + "y": 52 + }, + "id": 11, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "11.5.2", + "targets": [ + { + "disableTextWrap": false, + "editorMode": "builder", + "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket[$__rate_interval])))", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "95%", + "range": true, + "refId": "A", + "useBackend": false + }, + { + "datasource": { + "type": "prometheus", + "uid": "${DS_PROMETHEUS}" + }, + "disableTextWrap": false, + "editorMode": "builder", + "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket[$__rate_interval])))", + "fullMetaSearch": false, + "hide": false, + "includeNullMetadata": true, + "legendFormat": "90%", + "range": true, + "refId": "B", + "useBackend": false + }, + { + "datasource": { + "type": "prometheus", + "uid": "${DS_PROMETHEUS}" + }, + "disableTextWrap": false, + "editorMode": "builder", + "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:e2e_request_latency_seconds_bucket[$__rate_interval])))", + "fullMetaSearch": false, + "hide": false, + "includeNullMetadata": true, + "legendFormat": "50%", + "range": true, + "refId": "C", + "useBackend": false + } + ], + "title": "E2E Request Latency", + "type": "timeseries" }, { "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, - "disableTextWrap": false, - "editorMode": "builder", - "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket[$__rate_interval])))", - "fullMetaSearch": false, - "hide": false, - "includeNullMetadata": true, - "legendFormat": "90%", - "range": true, - "refId": "B", - "useBackend": false + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 10, + "x": 0, + "y": 59 + }, + "id": 13, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "11.5.2", + "targets": [ + { + "disableTextWrap": false, + "editorMode": "builder", + "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket[$__rate_interval])))", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "95%", + "range": true, + "refId": "A", + "useBackend": false + }, + { + "datasource": { + "type": "prometheus", + "uid": "${DS_PROMETHEUS}" + }, + "disableTextWrap": false, + "editorMode": "builder", + "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket[$__rate_interval])))", + "fullMetaSearch": false, + "hide": false, + "includeNullMetadata": true, + "legendFormat": "90%", + "range": true, + "refId": "B", + "useBackend": false + }, + { + "datasource": { + "type": "prometheus", + "uid": "${DS_PROMETHEUS}" + }, + "disableTextWrap": false, + "editorMode": "builder", + "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_per_output_token_seconds_bucket[$__rate_interval])))", + "fullMetaSearch": false, + "hide": false, + "includeNullMetadata": true, + "legendFormat": "50%", + "range": true, + "refId": "C", + "useBackend": false + } + ], + "title": "Time Per Output Token Latency", + "type": "timeseries" }, { "datasource": { "type": "prometheus", "uid": "${DS_PROMETHEUS}" }, - "disableTextWrap": false, - "editorMode": "builder", - "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket[$__rate_interval])))", - "fullMetaSearch": false, - "hide": false, - "includeNullMetadata": true, - "legendFormat": "50%", - "range": true, - "refId": "C", - "useBackend": false + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 10, + "x": 10, + "y": 59 + }, + "id": 12, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "11.5.2", + "targets": [ + { + "disableTextWrap": false, + "editorMode": "builder", + "expr": "histogram_quantile(0.95, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket[$__rate_interval])))", + "fullMetaSearch": false, + "includeNullMetadata": true, + "legendFormat": "95%", + "range": true, + "refId": "A", + "useBackend": false + }, + { + "datasource": { + "type": "prometheus", + "uid": "${DS_PROMETHEUS}" + }, + "disableTextWrap": false, + "editorMode": "builder", + "expr": "histogram_quantile(0.9, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket[$__rate_interval])))", + "fullMetaSearch": false, + "hide": false, + "includeNullMetadata": true, + "legendFormat": "90%", + "range": true, + "refId": "B", + "useBackend": false + }, + { + "datasource": { + "type": "prometheus", + "uid": "${DS_PROMETHEUS}" + }, + "disableTextWrap": false, + "editorMode": "builder", + "expr": "histogram_quantile(0.5, sum by(le) (rate(vllm:time_to_first_token_seconds_bucket[$__rate_interval])))", + "fullMetaSearch": false, + "hide": false, + "includeNullMetadata": true, + "legendFormat": "50%", + "range": true, + "refId": "C", + "useBackend": false + } + ], + "title": "Time To First Token Latency", + "type": "timeseries" } ], - "title": "Time To First Token Latency", - "type": "timeseries" + "title": "vLLM", + "type": "row" } ], "preload": false, @@ -1350,6 +1661,6 @@ "timezone": "browser", "title": "Inference Gateway", "uid": "aeap3g4ujefb4b", - "version": 16, + "version": 20, "weekStart": "" } diff --git a/tools/dynamic-lora-sidecar/Dockerfile b/tools/dynamic-lora-sidecar/Dockerfile index 4f6c743e..4faf360c 100644 --- a/tools/dynamic-lora-sidecar/Dockerfile +++ b/tools/dynamic-lora-sidecar/Dockerfile @@ -2,7 +2,7 @@ FROM python:3.9-slim-buster AS test WORKDIR /dynamic-lora-reconciler-test COPY requirements.txt . -COPY sidecar/* . +COPY sidecar/* ./ RUN pip install -r requirements.txt RUN python -m unittest discover || exit 1 @@ -18,6 +18,6 @@ RUN pip install --upgrade pip COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt -COPY sidecar/* . +COPY sidecar/* ./ CMD ["python", "sidecar.py"] \ No newline at end of file diff --git a/tools/dynamic-lora-sidecar/README.md b/tools/dynamic-lora-sidecar/README.md index be05f9e9..65dc0d78 100644 --- a/tools/dynamic-lora-sidecar/README.md +++ b/tools/dynamic-lora-sidecar/README.md @@ -29,43 +29,110 @@ The sidecar uses the vLLM server's API to load or unload adapters based on the c ## Usage + 1. **Build the Docker Image:** ```bash docker build -t . + ``` + 2. **Create a configmap:** - ```bash - kubectl create configmap name-of-your-configmap --from-file=your-file.yaml + ```bash + kubectl create configmap name-of-your-configmap --from-file=your-file.yaml + ``` + 3. **Mount the configmap and configure sidecar in your pod** - ```yaml - volumeMounts: # DO NOT USE subPath - - name: config-volume - mountPath: /config - ``` - Do not use subPath, since configmap updates are not reflected in the file + ```yaml + volumeMounts: # DO NOT USE subPath + - name: config-volume + mountPath: /config + ``` + Do not use subPath, since configmap updates are not reflected in the file + +## Command Line Arguments -[deployment]: deployment.yaml it uses [sidecar](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/)(`initContainer` with `restartPolicy` set to `always`) which is beta feature enabled by default since k8s version 1.29. They need to be enabled in 1.28 and prior to 1.28 sidecar are not officially supported. +The sidecar supports the following command-line arguments: + +- `--health-check-timeout`: Maximum time in seconds to wait for the vLLM server health check (default: 300) +- `--health-check-interval`: Interval in seconds between health check attempts (default: 2) +- `--reconcile-trigger`: Time in seconds between forced reconciliation runs (default: 5) +- `--config`: Path to the config map file (default: value from DYNAMIC_LORA_ROLLOUT_CONFIG env var or "/config/configmap.yaml") +- `--config-validation`: Enable config validation (default: True) ## Configuration Fields - `vLLMLoRAConfig`[**required**] base key -- `host` [*optional*]Model server's host. defaults to localhost +- `host` [*optional*] Model server's host. defaults to localhost - `port` [*optional*] Model server's port. defaults to 8000 -- `name`[*optional*] Name of this config -- `ensureExist`[*optional*] List of models to ensure existence on specified model server. - - `models`[**required**] [list] - - `base-model`[*optional*] Base model for lora adapter - - `id`[**required**] unique id of lora adapter - - `source`[**required**] path (remote or local) to lora adapter +- `name` [*optional*] Name of this config +- `defaultBaseModel` [*optional*] Default base model to use for all adapters when not specified individually +- `ensureExist` [*optional*] List of models to ensure existence on specified model server. + - `models` [**required**] [list] + - `id` [**required**] unique id of lora adapter + - `source` [**required**] path (remote or local) to lora adapter + - `base-model` [*optional*] Base model for lora adapter (overrides defaultBaseModel) - `ensureNotExist` [*optional*] - - `models`[**required**] [list] - - `id`[**required**] unique id of lora adapter - - `source`[**required**] path (remote or local) to lora adapter - - `base-model`[*optional*] Base model for lora adapter - - - + - `models` [**required**] [list] + - `id` [**required**] unique id of lora adapter + - `source` [**required**] path (remote or local) to lora adapter + - `base-model` [*optional*] Base model for lora adapter (overrides defaultBaseModel) + +## Example Configuration + +In this example, both adapters will use `meta-llama/Llama-3.1-8B-Instruct` as their base model: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: vllm-llama3-8b-instruct-adapters +data: + configmap.yaml: | + vLLMLoRAConfig: + name: vllm-llama3-8b + port: 8000 + defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct + ensureExist: + models: + - id: food-review-1 + source: Kawon/llama3.1-food-finetune_v14_r8 + - id: food-review-2 + source: Kawon/llama3.1-food-finetune_v14_r8 +``` + +## Example Deployment + +The [deployment.yaml](deployment.yaml) file shows an example of deploying the sidecar with custom parameters: + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: dynamic-lora-reconciler +spec: + replicas: 1 + selector: + matchLabels: + app: dynamic-lora-reconciler + template: + metadata: + labels: + app: dynamic-lora-reconciler + spec: + containers: + - name: reconciler + image: your-image:tag + command: ["python", "sidecar.py", "--health-check-timeout", "600", "--health-check-interval", "5", "--reconcile-trigger", "10"] #optional if overriding default values + volumeMounts: + - name: config-volume + mountPath: /config + volumes: + - name: config-volume + configMap: + name: name-of-your-configmap +``` + +Note: This uses [sidecar](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/)(`initContainer` with `restartPolicy` set to `always`) which is beta feature enabled by default since k8s version 1.29. They need to be enabled in 1.28 and prior to 1.28 sidecar are not officially supported. ## Screenshots & Testing The sidecar was tested with the Deployment and ConfigMap specified in this repo. Here are screen grabs of the logs from the sidecar and vllm server. One can verify that the adapters were loaded by querying `v1/models` and looking at vllm logs. -![lora-adapter-syncer](screenshots/lora-syncer-sidecar.png) -![config map change](screenshots/configmap-change.png) +![lora-adapter-syncer](screenshots/lora-syncer-logs.png) ![vllm-logs](screenshots/vllm-logs.png) diff --git a/tools/dynamic-lora-sidecar/deployment.yaml b/tools/dynamic-lora-sidecar/deployment.yaml index 9e9fc130..0c0c1781 100644 --- a/tools/dynamic-lora-sidecar/deployment.yaml +++ b/tools/dynamic-lora-sidecar/deployment.yaml @@ -32,7 +32,7 @@ spec: nvidia.com/gpu : 1 command: ["/bin/sh", "-c"] args: - - vllm serve meta-llama/Llama-2-7b-hf + - vllm serve meta-llama/Llama-3.1-8B-Instruct - --host=0.0.0.0 - --port=8000 - --tensor-parallel-size=1 @@ -66,7 +66,7 @@ spec: - name: lora-adapter-syncer tty: true stdin: true - image: + image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main restartPolicy: Always imagePullPolicy: Always env: @@ -106,22 +106,17 @@ metadata: data: configmap.yaml: | vLLMLoRAConfig: - host: modelServerHost name: sql-loras-llama - port: modelServerPort + defaultBaseModel: meta-llama/Llama-2-7b-hf ensureExist: models: - - base-model: meta-llama/Llama-2-7b-hf - id: sql-lora-v1 + - id: sql-lora-v1 source: yard1/llama-2-7b-sql-lora-test - - base-model: meta-llama/Llama-2-7b-hf - id: sql-lora-v3 + - id: sql-lora-v3 source: yard1/llama-2-7b-sql-lora-test - - base-model: meta-llama/Llama-2-7b-hf - id: sql-lora-v4 + - id: sql-lora-v4 source: yard1/llama-2-7b-sql-lora-test ensureNotExist: models: - - base-model: meta-llama/Llama-2-7b-hf - id: sql-lora-v2 + - id: sql-lora-v2 source: yard1/llama-2-7b-sql-lora-test \ No newline at end of file diff --git a/tools/dynamic-lora-sidecar/screenshots/configmap-change.png b/tools/dynamic-lora-sidecar/screenshots/configmap-change.png deleted file mode 100644 index e17f060b..00000000 Binary files a/tools/dynamic-lora-sidecar/screenshots/configmap-change.png and /dev/null differ diff --git a/tools/dynamic-lora-sidecar/screenshots/lora-syncer-logs.png b/tools/dynamic-lora-sidecar/screenshots/lora-syncer-logs.png new file mode 100644 index 00000000..2cd7a4d6 Binary files /dev/null and b/tools/dynamic-lora-sidecar/screenshots/lora-syncer-logs.png differ diff --git a/tools/dynamic-lora-sidecar/screenshots/lora-syncer-sidecar.png b/tools/dynamic-lora-sidecar/screenshots/lora-syncer-sidecar.png deleted file mode 100644 index c7b90253..00000000 Binary files a/tools/dynamic-lora-sidecar/screenshots/lora-syncer-sidecar.png and /dev/null differ diff --git a/tools/dynamic-lora-sidecar/sidecar/sidecar.py b/tools/dynamic-lora-sidecar/sidecar/sidecar.py index 02070f3f..30724478 100644 --- a/tools/dynamic-lora-sidecar/sidecar/sidecar.py +++ b/tools/dynamic-lora-sidecar/sidecar/sidecar.py @@ -1,6 +1,7 @@ import requests import yaml import time +import argparse from jsonschema import validate from watchfiles import awatch from dataclasses import dataclass @@ -30,18 +31,35 @@ def current_time_human() -> str: return now.strftime("%Y-%m-%d %H:%M:%S %Z%z") +def parse_arguments(): + """Parse command line arguments.""" + parser = argparse.ArgumentParser(description='vLLM LoRA Adapter Reconciler') + parser.add_argument('--health-check-timeout', type=int, default=300, + help='Health check timeout in seconds (default: 300)') + parser.add_argument('--health-check-interval', type=int, default=2, + help='Health check interval in seconds (default: 2)') + parser.add_argument('--reconcile-trigger', type=int, default=5, + help='Reconciliation trigger interval in seconds (default: 5)') + parser.add_argument('--config', type=str, default=CONFIG_MAP_FILE, + help=f'Path to config map file (default: {CONFIG_MAP_FILE})') + parser.add_argument('--config-validation', action='store_true', default=True, + help='Enable config validation (default: True)') + return parser.parse_args() + + class FileChangeHandler(FileSystemEventHandler): """Custom event handler that handles file modifications.""" - def __init__(self, reconciler): + def __init__(self, reconciler, config_file): super().__init__() self.reconciler = reconciler + self.config_file = config_file def on_modified(self, event): logging.info("modified!") - logging.info(f"Config '{CONFIG_MAP_FILE}' modified!") + logging.info(f"Config '{self.config_file}' modified!") self.reconciler.reconcile() - logging.info(f"model server reconcile to Config '{CONFIG_MAP_FILE}' !") + logging.info(f"model server reconcile to Config '{self.config_file}' !") @dataclass @@ -65,10 +83,17 @@ class LoraReconciler: Reconciles adapters registered on vllm server with adapters listed in configmap in current state """ - def __init__(self, config_validation=True): - self.health_check_timeout = datetime.timedelta(seconds=300) - self.health_check_interval = datetime.timedelta(seconds=15) + def __init__(self, config_file, health_check_timeout, health_check_interval, + reconcile_trigger_seconds, config_validation=True): + self.config_file = config_file self.config_validation = config_validation + self.health_check_timeout = datetime.timedelta(seconds=health_check_timeout) + self.health_check_interval = datetime.timedelta(seconds=health_check_interval) + self.reconcile_trigger_seconds = reconcile_trigger_seconds + + logging.info(f"Settings initialized: health check timeout={health_check_timeout}s, " + f"interval={health_check_interval}s, " + f"reconcile trigger={self.reconcile_trigger_seconds}s") def validate_config(self, c) -> bool: try: @@ -77,14 +102,14 @@ def validate_config(self, c) -> bool: validate(instance=c, schema=schema) return True except Exception as e: - logging.error(f"Cannot load config {CONFIG_MAP_FILE} validation error: {e}") + logging.error(f"Cannot load config {self.config_file} validation error: {e}") return False @property def config(self): """Load configmap into memory""" try: - with open(CONFIG_MAP_FILE, "r") as f: + with open(self.config_file, "r") as f: c = yaml.safe_load(f) if self.config_validation and not self.validate_config(c): return {} @@ -93,7 +118,7 @@ def config(self): c = c.get("vLLMLoRAConfig", {}) return c except Exception as e: - logging.error(f"cannot load config {CONFIG_MAP_FILE} {e}") + logging.error(f"cannot load config {self.config_file} {e}") return {} @property @@ -110,15 +135,24 @@ def port(self): def model_server(self): """Model server {host}:{port}""" return f"{self.host}:{self.port}" + + @property + def default_base_model(self): + """Default base model to use when not specified at adapter level""" + return self.config.get("defaultBaseModel", "") @property def ensure_exist_adapters(self): """Lora adapters in config under key `ensureExist` in set""" adapters = self.config.get("ensureExist", {}).get("models", set()) + default_model = self.default_base_model + return set( [ LoraAdapter( - adapter["id"], adapter["source"], adapter.get("base-model", "") + adapter["id"], + adapter["source"], + adapter.get("base-model", default_model) ) for adapter in adapters ] @@ -128,10 +162,14 @@ def ensure_exist_adapters(self): def ensure_not_exist_adapters(self): """Lora adapters in config under key `ensureNotExist` in set""" adapters = self.config.get("ensureNotExist", {}).get("models", set()) + default_model = self.default_base_model + return set( [ LoraAdapter( - adapter["id"], adapter["source"], adapter.get("base-model", "") + adapter["id"], + adapter["source"], + adapter.get("base-model", default_model) ) for adapter in adapters ] @@ -215,8 +253,9 @@ def unload_adapter(self, adapter: LoraAdapter): def reconcile(self): """Reconciles model server with current version of configmap""" logging.info( - f"reconciling model server {self.model_server} with config stored at {CONFIG_MAP_FILE}" + f"reconciling model server {self.model_server} with config stored at {self.config_file}" ) + if not self.is_server_healthy: logging.error(f"vllm server at {self.model_server} not healthy") return @@ -240,21 +279,40 @@ def reconcile(self): async def main(): - reconciler_instance = LoraReconciler() - logging.info(f"Running initial reconcile for config map {CONFIG_MAP_FILE}") + args = parse_arguments() + + # Update CONFIG_MAP_FILE with argument value + config_file = args.config + + reconciler_instance = LoraReconciler( + config_file=config_file, + health_check_timeout=args.health_check_timeout, + health_check_interval=args.health_check_interval, + reconcile_trigger_seconds=args.reconcile_trigger, + config_validation=args.config_validation + ) + + logging.info(f"Running initial reconcile for config map {config_file}") reconciler_instance.reconcile() - event_handler = FileChangeHandler(reconciler_instance) + event_handler = FileChangeHandler(reconciler_instance, config_file) observer = Observer() observer.schedule( - event_handler, path=os.path.dirname(CONFIG_MAP_FILE), recursive=False + event_handler, path=os.path.dirname(config_file), recursive=False ) observer.start() try: - logging.info(f"Starting to watch {CONFIG_MAP_FILE} for changes...") + logging.info(f"Starting to watch {config_file} for changes and performing periodic reconciliation...") while True: - await asyncio.sleep(1) + # Get current trigger interval from reconciler + trigger_seconds = reconciler_instance.reconcile_trigger_seconds + logging.info(f"Waiting {trigger_seconds}s before next reconciliation...") + # Wait for configured trigger interval + await asyncio.sleep(trigger_seconds) + # Force trigger reconciliation + logging.info("Periodic reconciliation triggered") + reconciler_instance.reconcile() except KeyboardInterrupt: logging.info("Stopped by user.") observer.stop() @@ -262,4 +320,4 @@ async def main(): if __name__ == "__main__": - asyncio.run(main()) + asyncio.run(main()) \ No newline at end of file diff --git a/tools/dynamic-lora-sidecar/sidecar/test_sidecar.py b/tools/dynamic-lora-sidecar/sidecar/test_sidecar.py index 738c7449..59a60e6b 100644 --- a/tools/dynamic-lora-sidecar/sidecar/test_sidecar.py +++ b/tools/dynamic-lora-sidecar/sidecar/test_sidecar.py @@ -2,8 +2,10 @@ from unittest.mock import patch, Mock, mock_open, call import yaml import os -from sidecar import LoraReconciler, CONFIG_MAP_FILE, BASE_FIELD, LoraAdapter +import datetime +from sidecar import LoraReconciler, LoraAdapter, CONFIG_MAP_FILE, BASE_FIELD +# Update TEST_CONFIG_DATA to include the new configuration parameters TEST_CONFIG_DATA = { BASE_FIELD: { "host": "localhost", @@ -12,17 +14,17 @@ "ensureExist": { "models": [ { - "base-model": "meta-llama/Llama-2-7b-hf", + "base-model": "meta-llama/Llama-3.1-8B-Instruct", "id": "sql-lora-v1", "source": "yard1/llama-2-7b-sql-lora-test", }, { - "base-model": "meta-llama/Llama-2-7b-hf", + "base-model": "meta-llama/Llama-3.1-8B-Instruct", "id": "sql-lora-v3", "source": "yard1/llama-2-7b-sql-lora-test", }, { - "base-model": "meta-llama/Llama-2-7b-hf", + "base-model": "meta-llama/Llama-3.1-8B-Instruct", "id": "already_exists", "source": "yard1/llama-2-7b-sql-lora-test", }, @@ -31,17 +33,17 @@ "ensureNotExist": { "models": [ { - "base-model": "meta-llama/Llama-2-7b-hf", + "base-model": "meta-llama/Llama-3.1-8B-Instruct", "id": "sql-lora-v2", "source": "yard1/llama-2-7b-sql-lora-test", }, { - "base-model": "meta-llama/Llama-2-7b-hf", + "base-model": "meta-llama/Llama-3.1-8B-Instruct", "id": "sql-lora-v3", "source": "yard1/llama-2-7b-sql-lora-test", }, { - "base-model": "meta-llama/Llama-2-7b-hf", + "base-model": "meta-llama/Llama-3.1-8B-Instruct", "id": "to_remove", "source": "yard1/llama-2-7b-sql-lora-test", }, @@ -49,13 +51,14 @@ }, } } + EXIST_ADAPTERS = [ - LoraAdapter(a["id"], a["base-model"], a["source"]) + LoraAdapter(a["id"], a["source"], a["base-model"]) for a in TEST_CONFIG_DATA[BASE_FIELD]["ensureExist"]["models"] ] NOT_EXIST_ADAPTERS = [ - LoraAdapter(a["id"], a["base-model"], a["source"]) + LoraAdapter(a["id"], a["source"], a["base-model"]) for a in TEST_CONFIG_DATA[BASE_FIELD]["ensureNotExist"]["models"] ] RESPONSES = { @@ -67,7 +70,7 @@ "object": "model", "created": 1729693000, "owned_by": "vllm", - "root": "meta-llama/Llama-2-7b-hf", + "root": "meta-llama/Llama-3.1-8B-Instruct", "parent": None, "max_model_len": 4096, }, @@ -101,7 +104,15 @@ def setUp(self, mock_get, mock_file): mock_response = getMockResponse() mock_response.json.return_value = RESPONSES["v1/models"] mock_get.return_value = mock_response - self.reconciler = LoraReconciler(False) + + # Create reconciler with command line argument values instead of config file values + self.reconciler = LoraReconciler( + config_file=CONFIG_MAP_FILE, + health_check_timeout=180, + health_check_interval=10, + reconcile_trigger_seconds=30, + config_validation=False + ) self.maxDiff = None @patch("sidecar.requests.get") @@ -167,20 +178,47 @@ def test_reconcile(self, mock_post, mock_get, mock_file): mock_get_response.json.return_value = RESPONSES["v1/models"] mock_get.return_value = mock_get_response mock_post.return_value = getMockResponse() - self.reconciler = LoraReconciler() - self.reconciler.reconcile() - # 1 adapter is in both exist and not exist list, only 2 are expected to be loaded - mock_load.assert_has_calls( - calls=[call(EXIST_ADAPTERS[0]), call(EXIST_ADAPTERS[2])] + # Create reconciler with command line argument values + self.reconciler = LoraReconciler( + config_file=CONFIG_MAP_FILE, + health_check_timeout=180, + health_check_interval=10, + reconcile_trigger_seconds=30, + config_validation=False ) - assert mock_load.call_count == 2 + self.reconciler.reconcile() - # 1 adapter is in both exist and not exist list, only 2 are expected to be unloaded - mock_unload.assert_has_calls( - calls=[call(NOT_EXIST_ADAPTERS[0]), call(NOT_EXIST_ADAPTERS[2])] - ) - assert mock_unload.call_count == 2 + # First check the call count + self.assertEqual(mock_load.call_count, 2, "Expected 2 load adapter calls") + self.assertEqual(mock_unload.call_count, 2, "Expected 2 unload adapter calls") + + # Check that the adapters with the correct IDs were loaded + loaded_ids = [call.args[0].id for call in mock_load.call_args_list] + self.assertIn("sql-lora-v1", loaded_ids, "sql-lora-v1 should have been loaded") + self.assertIn("already_exists", loaded_ids, "already_exists should have been loaded") + + # Check that the adapters with the correct IDs were unloaded + unloaded_ids = [call.args[0].id for call in mock_unload.call_args_list] + self.assertIn("sql-lora-v2", unloaded_ids, "sql-lora-v2 should have been unloaded") + self.assertIn("to_remove", unloaded_ids, "to_remove should have been unloaded") + + def test_health_check_settings(self): + """Test that health check settings are properly initialized from command line args""" + # Create reconciler with specific values + reconciler = LoraReconciler( + config_file=CONFIG_MAP_FILE, + health_check_timeout=240, + health_check_interval=15, + reconcile_trigger_seconds=45, + config_validation=False + ) + + # Check that values are properly set + self.assertEqual(reconciler.health_check_timeout, datetime.timedelta(seconds=240)) + self.assertEqual(reconciler.health_check_interval, datetime.timedelta(seconds=15)) + self.assertEqual(reconciler.reconcile_trigger_seconds, 45) + if __name__ == "__main__": - unittest.main() + unittest.main() \ No newline at end of file diff --git a/tools/dynamic-lora-sidecar/sidecar/validation.yaml b/tools/dynamic-lora-sidecar/sidecar/validation.yaml index 9dd98f87..30d23b7f 100644 --- a/tools/dynamic-lora-sidecar/sidecar/validation.yaml +++ b/tools/dynamic-lora-sidecar/sidecar/validation.yaml @@ -16,6 +16,9 @@ properties: name: type: string description: Name of this config + defaultBaseModel: + type: string + description: Default base model to use when not specified at adapter level ensureExist: type: object description: List of models to ensure existence on specified model server @@ -26,9 +29,9 @@ properties: items: type: object properties: - base_model: + base-model: type: string - description: Base model for LoRA adapter + description: Base model for LoRA adapter (overrides defaultBaseModel) id: type: string description: Unique ID of LoRA adapter @@ -50,9 +53,9 @@ properties: items: type: object properties: - base_model: + base-model: type: string - description: Base model for LoRA adapter + description: Base model for LoRA adapter (overrides defaultBaseModel) id: type: string description: Unique ID of LoRA adapter