Reduce the set of metrics exposed by the kubelet

### Background
In 1.12, the kubelet exposes a number of sources for metrics directly from [cAdvisor](https://github.com/google/cadvisor#cadvisor).  This includes:
 * [cAdvisor prometheus metrics](https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md#prometheus-metrics) at [`/metrics/cadvisor`](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/server/server.go#L277)
 * [cAdvisor v1 Json API](https://github.com/google/cadvisor/blob/master/info/v1/container.go#L126) at [`/stats/`, `/stats/container`, `/stats/{podName}/{containerName}`, and `/stats/{namespace}/{podName}/{uid}/{containerName}`](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/server/stats/handler.go#L111)
 * [cAdvisor machine info](https://github.com/google/cadvisor/blob/master/info/v1/machine.go#L159) at [`/spec`](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/server/server.go#L291)

The kubelet also exposes the [summary API](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/stats/v1alpha1/types.go#L24), which is not exposed directly by cAdvisor, but queries cAdvisor as one of its sources for metrics.

The [Monitoring Architecture](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/monitoring_architecture.md) documentation describes the path for "core" metrics, and for "monitoring" metrics.  The [Core Metrics](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/core-metrics-pipeline.md#core-metrics-in-kubelet) proposal describes the set of metrics that we consider core, and their uses.  The motivation for the split architecture is:
 * To minimize the performance impact of stats collection for core metrics, allowing these to be collected more frequently
 * To make the monitoring pipeline replaceable, and extensible.

### Current kubelet metrics that are not included in core metrics
 * Pod and Node-level Network Metrics
 * Persistent Volume Metrics
 * Container-level (Nvidia) GPU Metrics
 * Node-Level RLimit Metrics
 * Misc Memory Metrics (e.g. PageFaults)
 * Container, Pod, and Node-level Inode metrics (for ephemeral storage)
 * Container, Pod, and Node-level DiskIO metrics (from cAdvisor)

Deprecating and removing the Summary API will require out-of-tree sources for each of these metrics.  "Direct" cAdvisor endpoints are not often used, and have even been broken for multiple releases (https://github.com/kubernetes/kubernetes/pull/62544) without anyone raising an issue.

### Working Items
 * [x] [1.13] Introduce Kubelet `pod-resources` grpc endpoint; KEP: https://github.com/kubernetes/community/pull/2454
 * [x] [1.14] Introduce Kubelet Resource Metrics API
 * [x] [1.15] Deprecate the "direct" cAdvisor API endpoints by adding and deprecating a `--enable-cadvisor-json-endpoints` flag
 * [x] [1.18] Default the `--enable-cadvisor-json-endpoints` flag to disabled
 * [ ] [1.21] Remove the `--enable-cadvisor-json-endpoints` flag
 * [ ] [1.21] Transition Monitoring Server to Kubelet Resource Metrics API ([requires 3 versions skew](https://github.com/kubernetes/kubernetes/pull/67829#issuecomment-416873857))
 * [ ] [TBD] Propose out-of-tree replacements for kubelet monitoring endpoints
 * [ ] [TBD] Deprecate the Summary API and cAdvisor prometheus endoints by adding and deprecating a `--enable-container-monitoring-endpoints` flag
 * [ ] [TBD+2] Remove "direct" cAdvisor API endpoints
 * [ ] [TBD+2] Default the `--enable-container-monitoring-endpoints` flag to disabled
 * [ ] [TBD+4] Remove the Summary API, cAdvisor prometheus metrics and remove the `--enable-container-monitoring-endpoints` flag.

### Open Questions
 * Should the kubelet be a source for any monitoring metrics?
   * For example, metrics about the kubelet itself, or DiskIO metrics for empty-dir volumes (which are "owned" by the kubelet).
 * What will provide the metrics listed above, now that the kubelet no longer does?
   * cAdvisor can provide Network, RLimit, Misc Memory metrics, Inode metrics, and DiskIO metrics.
     * cAdvisor only works for some runtimes, but is a drop-in replacement for "direct" cAdvisor API endpoints
   * Container Runtimes can be a source for container-level Memory, Inode, Network and DiskIO metrics.
   * NVidia GPU metrics provided by a daemonset published by NVidia
   * No source for Persistent Volume metrics?

/sig node
/sig instrumentation
/kind feature
/priority important-longterm
cc @kubernetes/sig-node-proposals @kubernetes/sig-instrumentation-misc 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce the set of metrics exposed by the kubelet #68522

Background

Current kubelet metrics that are not included in core metrics

Working Items

Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reduce the set of metrics exposed by the kubelet #68522

Description

Background

Current kubelet metrics that are not included in core metrics

Working Items

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions