Skip to content

Commit 3b73576

Browse files
committed
Documentation: driver-api: PM: Add cpuidle document
Replace the remaining documents under Documentation/cpuidle/ with one more complete governor and driver API document for cpuidle under Documentation/driver-api/pm/. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
1 parent b26bf6a commit 3b73576

File tree

5 files changed

+287
-68
lines changed

5 files changed

+287
-68
lines changed

Documentation/cpuidle/driver.txt

Lines changed: 0 additions & 37 deletions
This file was deleted.

Documentation/cpuidle/governor.txt

Lines changed: 0 additions & 28 deletions
This file was deleted.
Lines changed: 282 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,282 @@
1+
.. |struct cpuidle_governor| replace:: :c:type:`struct cpuidle_governor <cpuidle_governor>`
2+
.. |struct cpuidle_device| replace:: :c:type:`struct cpuidle_device <cpuidle_device>`
3+
.. |struct cpuidle_driver| replace:: :c:type:`struct cpuidle_driver <cpuidle_driver>`
4+
.. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state <cpuidle_state>`
5+
6+
========================
7+
CPU Idle Time Management
8+
========================
9+
10+
::
11+
12+
Copyright (c) 2019 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
13+
14+
15+
CPU Idle Time Management Subsystem
16+
==================================
17+
18+
Every time one of the logical CPUs in the system (the entities that appear to
19+
fetch and execute instructions: hardware threads, if present, or processor
20+
cores) is idle after an interrupt or equivalent wakeup event, which means that
21+
there are no tasks to run on it except for the special "idle" task associated
22+
with it, there is an opportunity to save energy for the processor that it
23+
belongs to. That can be done by making the idle logical CPU stop fetching
24+
instructions from memory and putting some of the processor's functional units
25+
depended on by it into an idle state in which they will draw less power.
26+
27+
However, there may be multiple different idle states that can be used in such a
28+
situation in principle, so it may be necessary to find the most suitable one
29+
(from the kernel perspective) and ask the processor to use (or "enter") that
30+
particular idle state. That is the role of the CPU idle time management
31+
subsystem in the kernel, called ``CPUIdle``.
32+
33+
The design of ``CPUIdle`` is modular and based on the code duplication avoidance
34+
principle, so the generic code that in principle need not depend on the hardware
35+
or platform design details in it is separate from the code that interacts with
36+
the hardware. It generally is divided into three categories of functional
37+
units: *governors* responsible for selecting idle states to ask the processor
38+
to enter, *drivers* that pass the governors' decisions on to the hardware and
39+
the *core* providing a common framework for them.
40+
41+
42+
CPU Idle Time Governors
43+
=======================
44+
45+
A CPU idle time (``CPUIdle``) governor is a bundle of policy code invoked when
46+
one of the logical CPUs in the system turns out to be idle. Its role is to
47+
select an idle state to ask the processor to enter in order to save some energy.
48+
49+
``CPUIdle`` governors are generic and each of them can be used on any hardware
50+
platform that the Linux kernel can run on. For this reason, data structures
51+
operated on by them cannot depend on any hardware architecture or platform
52+
design details as well.
53+
54+
The governor itself is represented by a |struct cpuidle_governor| object
55+
containing four callback pointers, :c:member:`enable`, :c:member:`disable`,
56+
:c:member:`select`, :c:member:`reflect`, a :c:member:`rating` field described
57+
below, and a name (string) used for identifying it.
58+
59+
For the governor to be available at all, that object needs to be registered
60+
with the ``CPUIdle`` core by calling :c:func:`cpuidle_register_governor()` with
61+
a pointer to it passed as the argument. If successful, that causes the core to
62+
add the governor to the global list of available governors and, if it is the
63+
only one in the list (that is, the list was empty before) or the value of its
64+
:c:member:`rating` field is greater than the value of that field for the
65+
governor currently in use, or the name of the new governor was passed to the
66+
kernel as the value of the ``cpuidle.governor=`` command line parameter, the new
67+
governor will be used from that point on (there can be only one ``CPUIdle``
68+
governor in use at a time). Also, if ``cpuidle_sysfs_switch`` is passed to the
69+
kernel in the command line, user space can choose the ``CPUIdle`` governor to
70+
use at run time via ``sysfs``.
71+
72+
Once registered, ``CPUIdle`` governors cannot be unregistered, so it is not
73+
practical to put them into loadable kernel modules.
74+
75+
The interface between ``CPUIdle`` governors and the core consists of four
76+
callbacks:
77+
78+
:c:member:`enable`
79+
::
80+
81+
int (*enable) (struct cpuidle_driver *drv, struct cpuidle_device *dev);
82+
83+
The role of this callback is to prepare the governor for handling the
84+
(logical) CPU represented by the |struct cpuidle_device| object pointed
85+
to by the ``dev`` argument. The |struct cpuidle_driver| object pointed
86+
to by the ``drv`` argument represents the ``CPUIdle`` driver to be used
87+
with that CPU (among other things, it should contain the list of
88+
|struct cpuidle_state| objects representing idle states that the
89+
processor holding the given CPU can be asked to enter).
90+
91+
It may fail, in which case it is expected to return a negative error
92+
code, and that causes the kernel to run the architecture-specific
93+
default code for idle CPUs on the CPU in question instead of ``CPUIdle``
94+
until the ``->enable()`` governor callback is invoked for that CPU
95+
again.
96+
97+
:c:member:`disable`
98+
::
99+
100+
void (*disable) (struct cpuidle_driver *drv, struct cpuidle_device *dev);
101+
102+
Called to make the governor stop handling the (logical) CPU represented
103+
by the |struct cpuidle_device| object pointed to by the ``dev``
104+
argument.
105+
106+
It is expected to reverse any changes made by the ``->enable()``
107+
callback when it was last invoked for the target CPU, free all memory
108+
allocated by that callback and so on.
109+
110+
:c:member:`select`
111+
::
112+
113+
int (*select) (struct cpuidle_driver *drv, struct cpuidle_device *dev,
114+
bool *stop_tick);
115+
116+
Called to select an idle state for the processor holding the (logical)
117+
CPU represented by the |struct cpuidle_device| object pointed to by the
118+
``dev`` argument.
119+
120+
The list of idle states to take into consideration is represented by the
121+
:c:member:`states` array of |struct cpuidle_state| objects held by the
122+
|struct cpuidle_driver| object pointed to by the ``drv`` argument (which
123+
represents the ``CPUIdle`` driver to be used with the CPU at hand). The
124+
value returned by this callback is interpreted as an index into that
125+
array (unless it is a negative error code).
126+
127+
The ``stop_tick`` argument is used to indicate whether or not to stop
128+
the scheduler tick before asking the processor to enter the selected
129+
idle state. When the ``bool`` variable pointed to by it (which is set
130+
to ``true`` before invoking this callback) is cleared to ``false``, the
131+
processor will be asked to enter the selected idle state without
132+
stopping the scheduler tick on the given CPU (if the tick has been
133+
stopped on that CPU already, however, it will not be restarted before
134+
asking the processor to enter the idle state).
135+
136+
This callback is mandatory (i.e. the :c:member:`select` callback pointer
137+
in |struct cpuidle_governor| must not be ``NULL`` for the registration
138+
of the governor to succeed).
139+
140+
:c:member:`reflect`
141+
::
142+
143+
void (*reflect) (struct cpuidle_device *dev, int index);
144+
145+
Called to allow the governor to evaluate the accuracy of the idle state
146+
selection made by the ``->select()`` callback (when it was invoked last
147+
time) and possibly use the result of that to improve the accuracy of
148+
idle state selections in the future.
149+
150+
In addition, ``CPUIdle`` governors are required to take power management
151+
quality of service (PM QoS) constraints on the processor wakeup latency into
152+
account when selecting idle states. In order to obtain the current effective
153+
PM QoS wakeup latency constraint for a given CPU, a ``CPUIdle`` governor is
154+
expected to pass the number of the CPU to
155+
:c:func:`cpuidle_governor_latency_req()`. Then, the governor's ``->select()``
156+
callback must not return the index of an indle state whose
157+
:c:member:`exit_latency` value is greater than the number returned by that
158+
function.
159+
160+
161+
CPU Idle Time Management Drivers
162+
================================
163+
164+
CPU idle time management (``CPUIdle``) drivers provide an interface between the
165+
other parts of ``CPUIdle`` and the hardware.
166+
167+
First of all, a ``CPUIdle`` driver has to populate the :c:member:`states` array
168+
of |struct cpuidle_state| objects included in the |struct cpuidle_driver| object
169+
representing it. Going forward this array will represent the list of available
170+
idle states that the processor hardware can be asked to enter shared by all of
171+
the logical CPUs handled by the given driver.
172+
173+
The entries in the :c:member:`states` array are expected to be sorted by the
174+
value of the :c:member:`target_residency` field in |struct cpuidle_state| in
175+
the ascending order (that is, index 0 should correspond to the idle state with
176+
the minimum value of :c:member:`target_residency`). [Since the
177+
:c:member:`target_residency` value is expected to reflect the "depth" of the
178+
idle state represented by the |struct cpuidle_state| object holding it, this
179+
sorting order should be the same as the ascending sorting order by the idle
180+
state "depth".]
181+
182+
Three fields in |struct cpuidle_state| are used by the existing ``CPUIdle``
183+
governors for computations related to idle state selection:
184+
185+
:c:member:`target_residency`
186+
Minimum time to spend in this idle state including the time needed to
187+
enter it (which may be substantial) to save more energy than could
188+
be saved by staying in a shallower idle state for the same amount of
189+
time, in microseconds.
190+
191+
:c:member:`exit_latency`
192+
Maximum time it will take a CPU asking the processor to enter this idle
193+
state to start executing the first instruction after a wakeup from it,
194+
in microseconds.
195+
196+
:c:member:`flags`
197+
Flags representing idle state properties. Currently, governors only use
198+
the ``CPUIDLE_FLAG_POLLING`` flag which is set if the given object
199+
does not represent a real idle state, but an interface to a software
200+
"loop" that can be used in order to avoid asking the processor to enter
201+
any idle state at all. [There are other flags used by the ``CPUIdle``
202+
core in special situations.]
203+
204+
The :c:member:`enter` callback pointer in |struct cpuidle_state|, which must not
205+
be ``NULL``, points to the routine to execute in order to ask the processor to
206+
enter this particular idle state:
207+
208+
::
209+
210+
void (*enter) (struct cpuidle_device *dev, struct cpuidle_driver *drv,
211+
int index);
212+
213+
The first two arguments of it point to the |struct cpuidle_device| object
214+
representing the logical CPU running this callback and the
215+
|struct cpuidle_driver| object representing the driver itself, respectively,
216+
and the last one is an index of the |struct cpuidle_state| entry in the driver's
217+
:c:member:`states` array representing the idle state to ask the processor to
218+
enter.
219+
220+
The analogous ``->enter_s2idle()`` callback in |struct cpuidle_state| is used
221+
only for implementing the suspend-to-idle system-wide power management feature.
222+
The difference between in and ``->enter()`` is that it must not re-enable
223+
interrupts at any point (even temporarily) or attempt to change the states of
224+
clock event devices, which the ``->enter()`` callback may do sometimes.
225+
226+
Once the :c:member:`states` array has been populated, the number of valid
227+
entries in it has to be stored in the :c:member:`state_count` field of the
228+
|struct cpuidle_driver| object representing the driver. Moreover, if any
229+
entries in the :c:member:`states` array represent "coupled" idle states (that
230+
is, idle states that can only be asked for if multiple related logical CPUs are
231+
idle), the :c:member:`safe_state_index` field in |struct cpuidle_driver| needs
232+
to be the index of an idle state that is not "coupled" (that is, one that can be
233+
asked for if only one logical CPU is idle).
234+
235+
In addition to that, if the given ``CPUIdle`` driver is only going to handle a
236+
subset of logical CPUs in the system, the :c:member:`cpumask` field in its
237+
|struct cpuidle_driver| object must point to the set (mask) of CPUs that will be
238+
handled by it.
239+
240+
A ``CPUIdle`` driver can only be used after it has been registered. If there
241+
are no "coupled" idle state entries in the driver's :c:member:`states` array,
242+
that can be accomplished by passing the driver's |struct cpuidle_driver| object
243+
to :c:func:`cpuidle_register_driver()`. Otherwise, :c:func:`cpuidle_register()`
244+
should be used for this purpose.
245+
246+
However, it also is necessary to register |struct cpuidle_device| objects for
247+
all of the logical CPUs to be handled by the given ``CPUIdle`` driver with the
248+
help of :c:func:`cpuidle_register_device()` after the driver has been registered
249+
and :c:func:`cpuidle_register_driver()`, unlike :c:func:`cpuidle_register()`,
250+
does not do that automatically. For this reason, the drivers that use
251+
:c:func:`cpuidle_register_driver()` to register themselves must also take care
252+
of registering the |struct cpuidle_device| objects as needed, so it is generally
253+
recommended to use :c:func:`cpuidle_register()` for ``CPUIdle`` driver
254+
registration in all cases.
255+
256+
The registration of a |struct cpuidle_device| object causes the ``CPUIdle``
257+
``sysfs`` interface to be created and the governor's ``->enable()`` callback to
258+
be invoked for the logical CPU represented by it, so it must take place after
259+
registering the driver that will handle the CPU in question.
260+
261+
``CPUIdle`` drivers and |struct cpuidle_device| objects can be unregistered
262+
when they are not necessary any more which allows some resources associated with
263+
them to be released. Due to dependencies between them, all of the
264+
|struct cpuidle_device| objects representing CPUs handled by the given
265+
``CPUIdle`` driver must be unregistered, with the help of
266+
:c:func:`cpuidle_unregister_device()`, before calling
267+
:c:func:`cpuidle_unregister_driver()` to unregister the driver. Alternatively,
268+
:c:func:`cpuidle_unregister()` can be called to unregister a ``CPUIdle`` driver
269+
along with all of the |struct cpuidle_device| objects representing CPUs handled
270+
by it.
271+
272+
``CPUIdle`` drivers can respond to runtime system configuration changes that
273+
lead to modifications of the list of available processor idle states (which can
274+
happen, for example, when the system's power source is switched from AC to
275+
battery or the other way around). Upon a notification of such a change,
276+
a ``CPUIdle`` driver is expected to call :c:func:`cpuidle_pause_and_lock()` to
277+
turn ``CPUIdle`` off temporarily and then :c:func:`cpuidle_disable_device()` for
278+
all of the |struct cpuidle_device| objects representing CPUs affected by that
279+
change. Next, it can update its :c:member:`states` array in accordance with
280+
the new configuration of the system, call :c:func:`cpuidle_enable_device()` for
281+
all of the relevant |struct cpuidle_device| objects and invoke
282+
:c:func:`cpuidle_resume_and_unlock()` to allow ``CPUIdle`` to be used again.

Documentation/driver-api/pm/index.rst

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
1-
=======================
2-
Device Power Management
3-
=======================
1+
===============================
2+
CPU and Device Power Management
3+
===============================
44

55
.. toctree::
66

7+
cpuidle
78
devices
89
notifiers
910
types

MAINTAINERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4016,6 +4016,7 @@ S: Maintained
40164016
T: git git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git
40174017
B: https://bugzilla.kernel.org
40184018
F: Documentation/admin-guide/pm/cpuidle.rst
4019+
F: Documentation/driver-api/pm/cpuidle.rst
40194020
F: drivers/cpuidle/*
40204021
F: include/linux/cpuidle.h
40214022

0 commit comments

Comments
 (0)