Skip to content

Commit 45802da

Browse files
committed
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar: "The main changes in this cycle were: - refcount conversions - Solve the rq->leaf_cfs_rq_list can of worms for real. - improve power-aware scheduling - add sysctl knob for Energy Aware Scheduling - documentation updates - misc other changes" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits) kthread: Do not use TIMER_IRQSAFE kthread: Convert worker lock to raw spinlock sched/fair: Use non-atomic cpumask_{set,clear}_cpu() sched/fair: Remove unused 'sd' parameter from select_idle_smt() sched/wait: Use freezable_schedule() when possible sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block sched/fair: Explain LLC nohz kick condition sched/fair: Simplify nohz_balancer_kick() sched/topology: Fix percpu data types in struct sd_data & struct s_data sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument sched/fair: Fix O(nr_cgroups) in the load balancing path sched/fair: Optimize update_blocked_averages() sched/fair: Fix insertion in rq->leaf_cfs_rq_list sched/fair: Add tmp_alone_branch assertion sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock() sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity sched/fair: Update scale invariance of PELT sched/fair: Move the rq_of() helper function sched/core: Convert task_struct.stack_refcount to refcount_t ...
2 parents 203b660 + ad01423 commit 45802da

29 files changed

+1165
-324
lines changed

Documentation/power/energy-model.txt

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
====================
2+
Energy Model of CPUs
3+
====================
4+
5+
1. Overview
6+
-----------
7+
8+
The Energy Model (EM) framework serves as an interface between drivers knowing
9+
the power consumed by CPUs at various performance levels, and the kernel
10+
subsystems willing to use that information to make energy-aware decisions.
11+
12+
The source of the information about the power consumed by CPUs can vary greatly
13+
from one platform to another. These power costs can be estimated using
14+
devicetree data in some cases. In others, the firmware will know better.
15+
Alternatively, userspace might be best positioned. And so on. In order to avoid
16+
each and every client subsystem to re-implement support for each and every
17+
possible source of information on its own, the EM framework intervenes as an
18+
abstraction layer which standardizes the format of power cost tables in the
19+
kernel, hence enabling to avoid redundant work.
20+
21+
The figure below depicts an example of drivers (Arm-specific here, but the
22+
approach is applicable to any architecture) providing power costs to the EM
23+
framework, and interested clients reading the data from it.
24+
25+
+---------------+ +-----------------+ +---------------+
26+
| Thermal (IPA) | | Scheduler (EAS) | | Other |
27+
+---------------+ +-----------------+ +---------------+
28+
| | em_pd_energy() |
29+
| | em_cpu_get() |
30+
+---------+ | +---------+
31+
| | |
32+
v v v
33+
+---------------------+
34+
| Energy Model |
35+
| Framework |
36+
+---------------------+
37+
^ ^ ^
38+
| | | em_register_perf_domain()
39+
+----------+ | +---------+
40+
| | |
41+
+---------------+ +---------------+ +--------------+
42+
| cpufreq-dt | | arm_scmi | | Other |
43+
+---------------+ +---------------+ +--------------+
44+
^ ^ ^
45+
| | |
46+
+--------------+ +---------------+ +--------------+
47+
| Device Tree | | Firmware | | ? |
48+
+--------------+ +---------------+ +--------------+
49+
50+
The EM framework manages power cost tables per 'performance domain' in the
51+
system. A performance domain is a group of CPUs whose performance is scaled
52+
together. Performance domains generally have a 1-to-1 mapping with CPUFreq
53+
policies. All CPUs in a performance domain are required to have the same
54+
micro-architecture. CPUs in different performance domains can have different
55+
micro-architectures.
56+
57+
58+
2. Core APIs
59+
------------
60+
61+
2.1 Config options
62+
63+
CONFIG_ENERGY_MODEL must be enabled to use the EM framework.
64+
65+
66+
2.2 Registration of performance domains
67+
68+
Drivers are expected to register performance domains into the EM framework by
69+
calling the following API:
70+
71+
int em_register_perf_domain(cpumask_t *span, unsigned int nr_states,
72+
struct em_data_callback *cb);
73+
74+
Drivers must specify the CPUs of the performance domains using the cpumask
75+
argument, and provide a callback function returning <frequency, power> tuples
76+
for each capacity state. The callback function provided by the driver is free
77+
to fetch data from any relevant location (DT, firmware, ...), and by any mean
78+
deemed necessary. See Section 3. for an example of driver implementing this
79+
callback, and kernel/power/energy_model.c for further documentation on this
80+
API.
81+
82+
83+
2.3 Accessing performance domains
84+
85+
Subsystems interested in the energy model of a CPU can retrieve it using the
86+
em_cpu_get() API. The energy model tables are allocated once upon creation of
87+
the performance domains, and kept in memory untouched.
88+
89+
The energy consumed by a performance domain can be estimated using the
90+
em_pd_energy() API. The estimation is performed assuming that the schedutil
91+
CPUfreq governor is in use.
92+
93+
More details about the above APIs can be found in include/linux/energy_model.h.
94+
95+
96+
3. Example driver
97+
-----------------
98+
99+
This section provides a simple example of a CPUFreq driver registering a
100+
performance domain in the Energy Model framework using the (fake) 'foo'
101+
protocol. The driver implements an est_power() function to be provided to the
102+
EM framework.
103+
104+
-> drivers/cpufreq/foo_cpufreq.c
105+
106+
01 static int est_power(unsigned long *mW, unsigned long *KHz, int cpu)
107+
02 {
108+
03 long freq, power;
109+
04
110+
05 /* Use the 'foo' protocol to ceil the frequency */
111+
06 freq = foo_get_freq_ceil(cpu, *KHz);
112+
07 if (freq < 0);
113+
08 return freq;
114+
09
115+
10 /* Estimate the power cost for the CPU at the relevant freq. */
116+
11 power = foo_estimate_power(cpu, freq);
117+
12 if (power < 0);
118+
13 return power;
119+
14
120+
15 /* Return the values to the EM framework */
121+
16 *mW = power;
122+
17 *KHz = freq;
123+
18
124+
19 return 0;
125+
20 }
126+
21
127+
22 static int foo_cpufreq_init(struct cpufreq_policy *policy)
128+
23 {
129+
24 struct em_data_callback em_cb = EM_DATA_CB(est_power);
130+
25 int nr_opp, ret;
131+
26
132+
27 /* Do the actual CPUFreq init work ... */
133+
28 ret = do_foo_cpufreq_init(policy);
134+
29 if (ret)
135+
30 return ret;
136+
31
137+
32 /* Find the number of OPPs for this policy */
138+
33 nr_opp = foo_get_nr_opp(policy);
139+
34
140+
35 /* And register the new performance domain */
141+
36 em_register_perf_domain(policy->cpus, nr_opp, &em_cb);
142+
37
143+
38 return 0;
144+
39 }

0 commit comments

Comments
 (0)