-
Notifications
You must be signed in to change notification settings - Fork 41.1k
Description
Background
In 1.12, the kubelet exposes a number of sources for metrics directly from cAdvisor. This includes:
- cAdvisor prometheus metrics at
/metrics/cadvisor
- cAdvisor v1 Json API at
/stats/
,/stats/container
,/stats/{podName}/{containerName}
, and/stats/{namespace}/{podName}/{uid}/{containerName}
- cAdvisor machine info at
/spec
The kubelet also exposes the summary API, which is not exposed directly by cAdvisor, but queries cAdvisor as one of its sources for metrics.
The Monitoring Architecture documentation describes the path for "core" metrics, and for "monitoring" metrics. The Core Metrics proposal describes the set of metrics that we consider core, and their uses. The motivation for the split architecture is:
- To minimize the performance impact of stats collection for core metrics, allowing these to be collected more frequently
- To make the monitoring pipeline replaceable, and extensible.
Current kubelet metrics that are not included in core metrics
- Pod and Node-level Network Metrics
- Persistent Volume Metrics
- Container-level (Nvidia) GPU Metrics
- Node-Level RLimit Metrics
- Misc Memory Metrics (e.g. PageFaults)
- Container, Pod, and Node-level Inode metrics (for ephemeral storage)
- Container, Pod, and Node-level DiskIO metrics (from cAdvisor)
Deprecating and removing the Summary API will require out-of-tree sources for each of these metrics. "Direct" cAdvisor endpoints are not often used, and have even been broken for multiple releases (#62544) without anyone raising an issue.
Working Items
- [1.13] Introduce Kubelet
pod-resources
grpc endpoint; KEP: KEP: Support Device Monitoring community#2454 - [1.14] Introduce Kubelet Resource Metrics API
- [1.15] Deprecate the "direct" cAdvisor API endpoints by adding and deprecating a
--enable-cadvisor-json-endpoints
flag - [1.18] Default the
--enable-cadvisor-json-endpoints
flag to disabled - [1.21] Remove the
--enable-cadvisor-json-endpoints
flag - [1.21] Transition Monitoring Server to Kubelet Resource Metrics API (requires 3 versions skew)
- [TBD] Propose out-of-tree replacements for kubelet monitoring endpoints
- [TBD] Deprecate the Summary API and cAdvisor prometheus endoints by adding and deprecating a
--enable-container-monitoring-endpoints
flag - [TBD+2] Remove "direct" cAdvisor API endpoints
- [TBD+2] Default the
--enable-container-monitoring-endpoints
flag to disabled - [TBD+4] Remove the Summary API, cAdvisor prometheus metrics and remove the
--enable-container-monitoring-endpoints
flag.
Open Questions
- Should the kubelet be a source for any monitoring metrics?
- For example, metrics about the kubelet itself, or DiskIO metrics for empty-dir volumes (which are "owned" by the kubelet).
- What will provide the metrics listed above, now that the kubelet no longer does?
- cAdvisor can provide Network, RLimit, Misc Memory metrics, Inode metrics, and DiskIO metrics.
- cAdvisor only works for some runtimes, but is a drop-in replacement for "direct" cAdvisor API endpoints
- Container Runtimes can be a source for container-level Memory, Inode, Network and DiskIO metrics.
- NVidia GPU metrics provided by a daemonset published by NVidia
- No source for Persistent Volume metrics?
- cAdvisor can provide Network, RLimit, Misc Memory metrics, Inode metrics, and DiskIO metrics.
/sig node
/sig instrumentation
/kind feature
/priority important-longterm
cc @kubernetes/sig-node-proposals @kubernetes/sig-instrumentation-misc