|
| 1 | +.. |struct cpuidle_governor| replace:: :c:type:`struct cpuidle_governor <cpuidle_governor>` |
| 2 | +.. |struct cpuidle_device| replace:: :c:type:`struct cpuidle_device <cpuidle_device>` |
| 3 | +.. |struct cpuidle_driver| replace:: :c:type:`struct cpuidle_driver <cpuidle_driver>` |
| 4 | +.. |struct cpuidle_state| replace:: :c:type:`struct cpuidle_state <cpuidle_state>` |
| 5 | + |
| 6 | +======================== |
| 7 | +CPU Idle Time Management |
| 8 | +======================== |
| 9 | + |
| 10 | +:: |
| 11 | + |
| 12 | + Copyright (c) 2019 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> |
| 13 | + |
| 14 | + |
| 15 | +CPU Idle Time Management Subsystem |
| 16 | +================================== |
| 17 | + |
| 18 | +Every time one of the logical CPUs in the system (the entities that appear to |
| 19 | +fetch and execute instructions: hardware threads, if present, or processor |
| 20 | +cores) is idle after an interrupt or equivalent wakeup event, which means that |
| 21 | +there are no tasks to run on it except for the special "idle" task associated |
| 22 | +with it, there is an opportunity to save energy for the processor that it |
| 23 | +belongs to. That can be done by making the idle logical CPU stop fetching |
| 24 | +instructions from memory and putting some of the processor's functional units |
| 25 | +depended on by it into an idle state in which they will draw less power. |
| 26 | + |
| 27 | +However, there may be multiple different idle states that can be used in such a |
| 28 | +situation in principle, so it may be necessary to find the most suitable one |
| 29 | +(from the kernel perspective) and ask the processor to use (or "enter") that |
| 30 | +particular idle state. That is the role of the CPU idle time management |
| 31 | +subsystem in the kernel, called ``CPUIdle``. |
| 32 | + |
| 33 | +The design of ``CPUIdle`` is modular and based on the code duplication avoidance |
| 34 | +principle, so the generic code that in principle need not depend on the hardware |
| 35 | +or platform design details in it is separate from the code that interacts with |
| 36 | +the hardware. It generally is divided into three categories of functional |
| 37 | +units: *governors* responsible for selecting idle states to ask the processor |
| 38 | +to enter, *drivers* that pass the governors' decisions on to the hardware and |
| 39 | +the *core* providing a common framework for them. |
| 40 | + |
| 41 | + |
| 42 | +CPU Idle Time Governors |
| 43 | +======================= |
| 44 | + |
| 45 | +A CPU idle time (``CPUIdle``) governor is a bundle of policy code invoked when |
| 46 | +one of the logical CPUs in the system turns out to be idle. Its role is to |
| 47 | +select an idle state to ask the processor to enter in order to save some energy. |
| 48 | + |
| 49 | +``CPUIdle`` governors are generic and each of them can be used on any hardware |
| 50 | +platform that the Linux kernel can run on. For this reason, data structures |
| 51 | +operated on by them cannot depend on any hardware architecture or platform |
| 52 | +design details as well. |
| 53 | + |
| 54 | +The governor itself is represented by a |struct cpuidle_governor| object |
| 55 | +containing four callback pointers, :c:member:`enable`, :c:member:`disable`, |
| 56 | +:c:member:`select`, :c:member:`reflect`, a :c:member:`rating` field described |
| 57 | +below, and a name (string) used for identifying it. |
| 58 | + |
| 59 | +For the governor to be available at all, that object needs to be registered |
| 60 | +with the ``CPUIdle`` core by calling :c:func:`cpuidle_register_governor()` with |
| 61 | +a pointer to it passed as the argument. If successful, that causes the core to |
| 62 | +add the governor to the global list of available governors and, if it is the |
| 63 | +only one in the list (that is, the list was empty before) or the value of its |
| 64 | +:c:member:`rating` field is greater than the value of that field for the |
| 65 | +governor currently in use, or the name of the new governor was passed to the |
| 66 | +kernel as the value of the ``cpuidle.governor=`` command line parameter, the new |
| 67 | +governor will be used from that point on (there can be only one ``CPUIdle`` |
| 68 | +governor in use at a time). Also, if ``cpuidle_sysfs_switch`` is passed to the |
| 69 | +kernel in the command line, user space can choose the ``CPUIdle`` governor to |
| 70 | +use at run time via ``sysfs``. |
| 71 | + |
| 72 | +Once registered, ``CPUIdle`` governors cannot be unregistered, so it is not |
| 73 | +practical to put them into loadable kernel modules. |
| 74 | + |
| 75 | +The interface between ``CPUIdle`` governors and the core consists of four |
| 76 | +callbacks: |
| 77 | + |
| 78 | +:c:member:`enable` |
| 79 | + :: |
| 80 | + |
| 81 | + int (*enable) (struct cpuidle_driver *drv, struct cpuidle_device *dev); |
| 82 | + |
| 83 | + The role of this callback is to prepare the governor for handling the |
| 84 | + (logical) CPU represented by the |struct cpuidle_device| object pointed |
| 85 | + to by the ``dev`` argument. The |struct cpuidle_driver| object pointed |
| 86 | + to by the ``drv`` argument represents the ``CPUIdle`` driver to be used |
| 87 | + with that CPU (among other things, it should contain the list of |
| 88 | + |struct cpuidle_state| objects representing idle states that the |
| 89 | + processor holding the given CPU can be asked to enter). |
| 90 | + |
| 91 | + It may fail, in which case it is expected to return a negative error |
| 92 | + code, and that causes the kernel to run the architecture-specific |
| 93 | + default code for idle CPUs on the CPU in question instead of ``CPUIdle`` |
| 94 | + until the ``->enable()`` governor callback is invoked for that CPU |
| 95 | + again. |
| 96 | + |
| 97 | +:c:member:`disable` |
| 98 | + :: |
| 99 | + |
| 100 | + void (*disable) (struct cpuidle_driver *drv, struct cpuidle_device *dev); |
| 101 | + |
| 102 | + Called to make the governor stop handling the (logical) CPU represented |
| 103 | + by the |struct cpuidle_device| object pointed to by the ``dev`` |
| 104 | + argument. |
| 105 | + |
| 106 | + It is expected to reverse any changes made by the ``->enable()`` |
| 107 | + callback when it was last invoked for the target CPU, free all memory |
| 108 | + allocated by that callback and so on. |
| 109 | + |
| 110 | +:c:member:`select` |
| 111 | + :: |
| 112 | + |
| 113 | + int (*select) (struct cpuidle_driver *drv, struct cpuidle_device *dev, |
| 114 | + bool *stop_tick); |
| 115 | + |
| 116 | + Called to select an idle state for the processor holding the (logical) |
| 117 | + CPU represented by the |struct cpuidle_device| object pointed to by the |
| 118 | + ``dev`` argument. |
| 119 | + |
| 120 | + The list of idle states to take into consideration is represented by the |
| 121 | + :c:member:`states` array of |struct cpuidle_state| objects held by the |
| 122 | + |struct cpuidle_driver| object pointed to by the ``drv`` argument (which |
| 123 | + represents the ``CPUIdle`` driver to be used with the CPU at hand). The |
| 124 | + value returned by this callback is interpreted as an index into that |
| 125 | + array (unless it is a negative error code). |
| 126 | + |
| 127 | + The ``stop_tick`` argument is used to indicate whether or not to stop |
| 128 | + the scheduler tick before asking the processor to enter the selected |
| 129 | + idle state. When the ``bool`` variable pointed to by it (which is set |
| 130 | + to ``true`` before invoking this callback) is cleared to ``false``, the |
| 131 | + processor will be asked to enter the selected idle state without |
| 132 | + stopping the scheduler tick on the given CPU (if the tick has been |
| 133 | + stopped on that CPU already, however, it will not be restarted before |
| 134 | + asking the processor to enter the idle state). |
| 135 | + |
| 136 | + This callback is mandatory (i.e. the :c:member:`select` callback pointer |
| 137 | + in |struct cpuidle_governor| must not be ``NULL`` for the registration |
| 138 | + of the governor to succeed). |
| 139 | + |
| 140 | +:c:member:`reflect` |
| 141 | + :: |
| 142 | + |
| 143 | + void (*reflect) (struct cpuidle_device *dev, int index); |
| 144 | + |
| 145 | + Called to allow the governor to evaluate the accuracy of the idle state |
| 146 | + selection made by the ``->select()`` callback (when it was invoked last |
| 147 | + time) and possibly use the result of that to improve the accuracy of |
| 148 | + idle state selections in the future. |
| 149 | + |
| 150 | +In addition, ``CPUIdle`` governors are required to take power management |
| 151 | +quality of service (PM QoS) constraints on the processor wakeup latency into |
| 152 | +account when selecting idle states. In order to obtain the current effective |
| 153 | +PM QoS wakeup latency constraint for a given CPU, a ``CPUIdle`` governor is |
| 154 | +expected to pass the number of the CPU to |
| 155 | +:c:func:`cpuidle_governor_latency_req()`. Then, the governor's ``->select()`` |
| 156 | +callback must not return the index of an indle state whose |
| 157 | +:c:member:`exit_latency` value is greater than the number returned by that |
| 158 | +function. |
| 159 | + |
| 160 | + |
| 161 | +CPU Idle Time Management Drivers |
| 162 | +================================ |
| 163 | + |
| 164 | +CPU idle time management (``CPUIdle``) drivers provide an interface between the |
| 165 | +other parts of ``CPUIdle`` and the hardware. |
| 166 | + |
| 167 | +First of all, a ``CPUIdle`` driver has to populate the :c:member:`states` array |
| 168 | +of |struct cpuidle_state| objects included in the |struct cpuidle_driver| object |
| 169 | +representing it. Going forward this array will represent the list of available |
| 170 | +idle states that the processor hardware can be asked to enter shared by all of |
| 171 | +the logical CPUs handled by the given driver. |
| 172 | + |
| 173 | +The entries in the :c:member:`states` array are expected to be sorted by the |
| 174 | +value of the :c:member:`target_residency` field in |struct cpuidle_state| in |
| 175 | +the ascending order (that is, index 0 should correspond to the idle state with |
| 176 | +the minimum value of :c:member:`target_residency`). [Since the |
| 177 | +:c:member:`target_residency` value is expected to reflect the "depth" of the |
| 178 | +idle state represented by the |struct cpuidle_state| object holding it, this |
| 179 | +sorting order should be the same as the ascending sorting order by the idle |
| 180 | +state "depth".] |
| 181 | + |
| 182 | +Three fields in |struct cpuidle_state| are used by the existing ``CPUIdle`` |
| 183 | +governors for computations related to idle state selection: |
| 184 | + |
| 185 | +:c:member:`target_residency` |
| 186 | + Minimum time to spend in this idle state including the time needed to |
| 187 | + enter it (which may be substantial) to save more energy than could |
| 188 | + be saved by staying in a shallower idle state for the same amount of |
| 189 | + time, in microseconds. |
| 190 | + |
| 191 | +:c:member:`exit_latency` |
| 192 | + Maximum time it will take a CPU asking the processor to enter this idle |
| 193 | + state to start executing the first instruction after a wakeup from it, |
| 194 | + in microseconds. |
| 195 | + |
| 196 | +:c:member:`flags` |
| 197 | + Flags representing idle state properties. Currently, governors only use |
| 198 | + the ``CPUIDLE_FLAG_POLLING`` flag which is set if the given object |
| 199 | + does not represent a real idle state, but an interface to a software |
| 200 | + "loop" that can be used in order to avoid asking the processor to enter |
| 201 | + any idle state at all. [There are other flags used by the ``CPUIdle`` |
| 202 | + core in special situations.] |
| 203 | + |
| 204 | +The :c:member:`enter` callback pointer in |struct cpuidle_state|, which must not |
| 205 | +be ``NULL``, points to the routine to execute in order to ask the processor to |
| 206 | +enter this particular idle state: |
| 207 | + |
| 208 | +:: |
| 209 | + |
| 210 | + void (*enter) (struct cpuidle_device *dev, struct cpuidle_driver *drv, |
| 211 | + int index); |
| 212 | + |
| 213 | +The first two arguments of it point to the |struct cpuidle_device| object |
| 214 | +representing the logical CPU running this callback and the |
| 215 | +|struct cpuidle_driver| object representing the driver itself, respectively, |
| 216 | +and the last one is an index of the |struct cpuidle_state| entry in the driver's |
| 217 | +:c:member:`states` array representing the idle state to ask the processor to |
| 218 | +enter. |
| 219 | + |
| 220 | +The analogous ``->enter_s2idle()`` callback in |struct cpuidle_state| is used |
| 221 | +only for implementing the suspend-to-idle system-wide power management feature. |
| 222 | +The difference between in and ``->enter()`` is that it must not re-enable |
| 223 | +interrupts at any point (even temporarily) or attempt to change the states of |
| 224 | +clock event devices, which the ``->enter()`` callback may do sometimes. |
| 225 | + |
| 226 | +Once the :c:member:`states` array has been populated, the number of valid |
| 227 | +entries in it has to be stored in the :c:member:`state_count` field of the |
| 228 | +|struct cpuidle_driver| object representing the driver. Moreover, if any |
| 229 | +entries in the :c:member:`states` array represent "coupled" idle states (that |
| 230 | +is, idle states that can only be asked for if multiple related logical CPUs are |
| 231 | +idle), the :c:member:`safe_state_index` field in |struct cpuidle_driver| needs |
| 232 | +to be the index of an idle state that is not "coupled" (that is, one that can be |
| 233 | +asked for if only one logical CPU is idle). |
| 234 | + |
| 235 | +In addition to that, if the given ``CPUIdle`` driver is only going to handle a |
| 236 | +subset of logical CPUs in the system, the :c:member:`cpumask` field in its |
| 237 | +|struct cpuidle_driver| object must point to the set (mask) of CPUs that will be |
| 238 | +handled by it. |
| 239 | + |
| 240 | +A ``CPUIdle`` driver can only be used after it has been registered. If there |
| 241 | +are no "coupled" idle state entries in the driver's :c:member:`states` array, |
| 242 | +that can be accomplished by passing the driver's |struct cpuidle_driver| object |
| 243 | +to :c:func:`cpuidle_register_driver()`. Otherwise, :c:func:`cpuidle_register()` |
| 244 | +should be used for this purpose. |
| 245 | + |
| 246 | +However, it also is necessary to register |struct cpuidle_device| objects for |
| 247 | +all of the logical CPUs to be handled by the given ``CPUIdle`` driver with the |
| 248 | +help of :c:func:`cpuidle_register_device()` after the driver has been registered |
| 249 | +and :c:func:`cpuidle_register_driver()`, unlike :c:func:`cpuidle_register()`, |
| 250 | +does not do that automatically. For this reason, the drivers that use |
| 251 | +:c:func:`cpuidle_register_driver()` to register themselves must also take care |
| 252 | +of registering the |struct cpuidle_device| objects as needed, so it is generally |
| 253 | +recommended to use :c:func:`cpuidle_register()` for ``CPUIdle`` driver |
| 254 | +registration in all cases. |
| 255 | + |
| 256 | +The registration of a |struct cpuidle_device| object causes the ``CPUIdle`` |
| 257 | +``sysfs`` interface to be created and the governor's ``->enable()`` callback to |
| 258 | +be invoked for the logical CPU represented by it, so it must take place after |
| 259 | +registering the driver that will handle the CPU in question. |
| 260 | + |
| 261 | +``CPUIdle`` drivers and |struct cpuidle_device| objects can be unregistered |
| 262 | +when they are not necessary any more which allows some resources associated with |
| 263 | +them to be released. Due to dependencies between them, all of the |
| 264 | +|struct cpuidle_device| objects representing CPUs handled by the given |
| 265 | +``CPUIdle`` driver must be unregistered, with the help of |
| 266 | +:c:func:`cpuidle_unregister_device()`, before calling |
| 267 | +:c:func:`cpuidle_unregister_driver()` to unregister the driver. Alternatively, |
| 268 | +:c:func:`cpuidle_unregister()` can be called to unregister a ``CPUIdle`` driver |
| 269 | +along with all of the |struct cpuidle_device| objects representing CPUs handled |
| 270 | +by it. |
| 271 | + |
| 272 | +``CPUIdle`` drivers can respond to runtime system configuration changes that |
| 273 | +lead to modifications of the list of available processor idle states (which can |
| 274 | +happen, for example, when the system's power source is switched from AC to |
| 275 | +battery or the other way around). Upon a notification of such a change, |
| 276 | +a ``CPUIdle`` driver is expected to call :c:func:`cpuidle_pause_and_lock()` to |
| 277 | +turn ``CPUIdle`` off temporarily and then :c:func:`cpuidle_disable_device()` for |
| 278 | +all of the |struct cpuidle_device| objects representing CPUs affected by that |
| 279 | +change. Next, it can update its :c:member:`states` array in accordance with |
| 280 | +the new configuration of the system, call :c:func:`cpuidle_enable_device()` for |
| 281 | +all of the relevant |struct cpuidle_device| objects and invoke |
| 282 | +:c:func:`cpuidle_resume_and_unlock()` to allow ``CPUIdle`` to be used again. |
0 commit comments