Skip to content

Commit ab20fd0

Browse files
committed
Merge branch 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cache resource controller updates from Thomas Gleixner: "An update for the Intel Resource Director Technolgy (RDT) which adds a feedback driven software controller to runtime adjust the bandwidth allocation MSRs. This makes the allocations more accurate and allows to use bandwidth values in understandable units (MB/s) instead of using percentage based allocations as the original, still available, interface. The software controller can be enabled with a new mount option for the resctrl filesystem" * 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/intel_rdt/mba_sc: Feedback loop to dynamically update mem bandwidth x86/intel_rdt/mba_sc: Prepare for feedback loop x86/intel_rdt/mba_sc: Add schemata support x86/intel_rdt/mba_sc: Add initialization support x86/intel_rdt/mba_sc: Enable/disable MBA software controller x86/intel_rdt/mba_sc: Documentation for MBA software controller(mba_sc)
2 parents ba252f1 + de73f38 commit ab20fd0

File tree

6 files changed

+337
-33
lines changed

6 files changed

+337
-33
lines changed

Documentation/x86/intel_rdt_ui.txt

Lines changed: 67 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -17,12 +17,14 @@ MBA (Memory Bandwidth Allocation) - "mba"
1717

1818
To use the feature mount the file system:
1919

20-
# mount -t resctrl resctrl [-o cdp[,cdpl2]] /sys/fs/resctrl
20+
# mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps]] /sys/fs/resctrl
2121

2222
mount options are:
2323

2424
"cdp": Enable code/data prioritization in L3 cache allocations.
2525
"cdpl2": Enable code/data prioritization in L2 cache allocations.
26+
"mba_MBps": Enable the MBA Software Controller(mba_sc) to specify MBA
27+
bandwidth in MBps
2628

2729
L2 and L3 CDP are controlled seperately.
2830

@@ -270,10 +272,11 @@ and 0xA are not. On a system with a 20-bit mask each bit represents 5%
270272
of the capacity of the cache. You could partition the cache into four
271273
equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
272274

273-
Memory bandwidth(b/w) percentage
274-
--------------------------------
275-
For Memory b/w resource, user controls the resource by indicating the
276-
percentage of total memory b/w.
275+
Memory bandwidth Allocation and monitoring
276+
------------------------------------------
277+
278+
For Memory bandwidth resource, by default the user controls the resource
279+
by indicating the percentage of total memory bandwidth.
277280

278281
The minimum bandwidth percentage value for each cpu model is predefined
279282
and can be looked up through "info/MB/min_bandwidth". The bandwidth
@@ -285,7 +288,47 @@ to the next control step available on the hardware.
285288
The bandwidth throttling is a core specific mechanism on some of Intel
286289
SKUs. Using a high bandwidth and a low bandwidth setting on two threads
287290
sharing a core will result in both threads being throttled to use the
288-
low bandwidth.
291+
low bandwidth. The fact that Memory bandwidth allocation(MBA) is a core
292+
specific mechanism where as memory bandwidth monitoring(MBM) is done at
293+
the package level may lead to confusion when users try to apply control
294+
via the MBA and then monitor the bandwidth to see if the controls are
295+
effective. Below are such scenarios:
296+
297+
1. User may *not* see increase in actual bandwidth when percentage
298+
values are increased:
299+
300+
This can occur when aggregate L2 external bandwidth is more than L3
301+
external bandwidth. Consider an SKL SKU with 24 cores on a package and
302+
where L2 external is 10GBps (hence aggregate L2 external bandwidth is
303+
240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
304+
threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
305+
bandwidth of 100GBps although the percentage value specified is only 50%
306+
<< 100%. Hence increasing the bandwidth percentage will not yeild any
307+
more bandwidth. This is because although the L2 external bandwidth still
308+
has capacity, the L3 external bandwidth is fully used. Also note that
309+
this would be dependent on number of cores the benchmark is run on.
310+
311+
2. Same bandwidth percentage may mean different actual bandwidth
312+
depending on # of threads:
313+
314+
For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
315+
thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
316+
they have same percentage bandwidth of 10%. This is simply because as
317+
threads start using more cores in an rdtgroup, the actual bandwidth may
318+
increase or vary although user specified bandwidth percentage is same.
319+
320+
In order to mitigate this and make the interface more user friendly,
321+
resctrl added support for specifying the bandwidth in MBps as well. The
322+
kernel underneath would use a software feedback mechanism or a "Software
323+
Controller(mba_sc)" which reads the actual bandwidth using MBM counters
324+
and adjust the memowy bandwidth percentages to ensure
325+
326+
"actual bandwidth < user specified bandwidth".
327+
328+
By default, the schemata would take the bandwidth percentage values
329+
where as user can switch to the "MBA software controller" mode using
330+
a mount option 'mba_MBps'. The schemata format is specified in the below
331+
sections.
289332

290333
L3 schemata file details (code and data prioritization disabled)
291334
----------------------------------------------------------------
@@ -308,13 +351,20 @@ schemata format is always:
308351

309352
L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
310353

311-
Memory b/w Allocation details
312-
-----------------------------
354+
Memory bandwidth Allocation (default mode)
355+
------------------------------------------
313356

314357
Memory b/w domain is L3 cache.
315358

316359
MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
317360

361+
Memory bandwidth Allocation specified in MBps
362+
---------------------------------------------
363+
364+
Memory bandwidth domain is L3 cache.
365+
366+
MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...
367+
318368
Reading/writing the schemata file
319369
---------------------------------
320370
Reading the schemata file will show the state of all resources
@@ -358,6 +408,15 @@ allocations can overlap or not. The allocations specifies the maximum
358408
b/w that the group may be able to use and the system admin can configure
359409
the b/w accordingly.
360410

411+
If the MBA is specified in MB(megabytes) then user can enter the max b/w in MB
412+
rather than the percentage values.
413+
414+
# echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
415+
# echo "L3:0=3;1=3\nMB:0=1024;1=500" > /sys/fs/resctrl/p1/schemata
416+
417+
In the above example the tasks in "p1" and "p0" on socket 0 would use a max b/w
418+
of 1024MB where as on socket 1 they would use 500MB.
419+
361420
Example 2
362421
---------
363422
Again two sockets, but this time with a more realistic 20-bit mask.

arch/x86/kernel/cpu/intel_rdt.c

Lines changed: 37 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,8 @@
3333
#include <asm/intel_rdt_sched.h>
3434
#include "intel_rdt.h"
3535

36-
#define MAX_MBA_BW 100u
3736
#define MBA_IS_LINEAR 0x4
37+
#define MBA_MAX_MBPS U32_MAX
3838

3939
/* Mutex to protect rdtgroup access. */
4040
DEFINE_MUTEX(rdtgroup_mutex);
@@ -178,7 +178,7 @@ struct rdt_resource rdt_resources_all[] = {
178178
.msr_update = mba_wrmsr,
179179
.cache_level = 3,
180180
.parse_ctrlval = parse_bw,
181-
.format_str = "%d=%*d",
181+
.format_str = "%d=%*u",
182182
.fflags = RFTYPE_RES_MB,
183183
},
184184
};
@@ -230,6 +230,14 @@ static inline void cache_alloc_hsw_probe(void)
230230
rdt_alloc_capable = true;
231231
}
232232

233+
bool is_mba_sc(struct rdt_resource *r)
234+
{
235+
if (!r)
236+
return rdt_resources_all[RDT_RESOURCE_MBA].membw.mba_sc;
237+
238+
return r->membw.mba_sc;
239+
}
240+
233241
/*
234242
* rdt_get_mb_table() - get a mapping of bandwidth(b/w) percentage values
235243
* exposed to user interface and the h/w understandable delay values.
@@ -341,7 +349,7 @@ static int get_cache_id(int cpu, int level)
341349
* that can be written to QOS_MSRs.
342350
* There are currently no SKUs which support non linear delay values.
343351
*/
344-
static u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
352+
u32 delay_bw_map(unsigned long bw, struct rdt_resource *r)
345353
{
346354
if (r->membw.delay_linear)
347355
return MAX_MBA_BW - bw;
@@ -431,25 +439,40 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id,
431439
return NULL;
432440
}
433441

442+
void setup_default_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm)
443+
{
444+
int i;
445+
446+
/*
447+
* Initialize the Control MSRs to having no control.
448+
* For Cache Allocation: Set all bits in cbm
449+
* For Memory Allocation: Set b/w requested to 100%
450+
* and the bandwidth in MBps to U32_MAX
451+
*/
452+
for (i = 0; i < r->num_closid; i++, dc++, dm++) {
453+
*dc = r->default_ctrl;
454+
*dm = MBA_MAX_MBPS;
455+
}
456+
}
457+
434458
static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d)
435459
{
436460
struct msr_param m;
437-
u32 *dc;
438-
int i;
461+
u32 *dc, *dm;
439462

440463
dc = kmalloc_array(r->num_closid, sizeof(*d->ctrl_val), GFP_KERNEL);
441464
if (!dc)
442465
return -ENOMEM;
443466

444-
d->ctrl_val = dc;
467+
dm = kmalloc_array(r->num_closid, sizeof(*d->mbps_val), GFP_KERNEL);
468+
if (!dm) {
469+
kfree(dc);
470+
return -ENOMEM;
471+
}
445472

446-
/*
447-
* Initialize the Control MSRs to having no control.
448-
* For Cache Allocation: Set all bits in cbm
449-
* For Memory Allocation: Set b/w requested to 100
450-
*/
451-
for (i = 0; i < r->num_closid; i++, dc++)
452-
*dc = r->default_ctrl;
473+
d->ctrl_val = dc;
474+
d->mbps_val = dm;
475+
setup_default_ctrlval(r, dc, dm);
453476

454477
m.low = 0;
455478
m.high = r->num_closid;
@@ -588,6 +611,7 @@ static void domain_remove_cpu(int cpu, struct rdt_resource *r)
588611
}
589612

590613
kfree(d->ctrl_val);
614+
kfree(d->mbps_val);
591615
kfree(d->rmid_busy_llc);
592616
kfree(d->mbm_total);
593617
kfree(d->mbm_local);

arch/x86/kernel/cpu/intel_rdt.h

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@
2828

2929
#define MBM_CNTR_WIDTH 24
3030
#define MBM_OVERFLOW_INTERVAL 1000
31+
#define MAX_MBA_BW 100u
3132

3233
#define RMID_VAL_ERROR BIT_ULL(63)
3334
#define RMID_VAL_UNAVAIL BIT_ULL(62)
@@ -180,10 +181,20 @@ struct rftype {
180181
* struct mbm_state - status for each MBM counter in each domain
181182
* @chunks: Total data moved (multiply by rdt_group.mon_scale to get bytes)
182183
* @prev_msr Value of IA32_QM_CTR for this RMID last time we read it
184+
* @chunks_bw Total local data moved. Used for bandwidth calculation
185+
* @prev_bw_msr:Value of previous IA32_QM_CTR for bandwidth counting
186+
* @prev_bw The most recent bandwidth in MBps
187+
* @delta_bw Difference between the current and previous bandwidth
188+
* @delta_comp Indicates whether to compute the delta_bw
183189
*/
184190
struct mbm_state {
185191
u64 chunks;
186192
u64 prev_msr;
193+
u64 chunks_bw;
194+
u64 prev_bw_msr;
195+
u32 prev_bw;
196+
u32 delta_bw;
197+
bool delta_comp;
187198
};
188199

189200
/**
@@ -202,6 +213,7 @@ struct mbm_state {
202213
* @cqm_work_cpu:
203214
* worker cpu for CQM h/w counters
204215
* @ctrl_val: array of cache or mem ctrl values (indexed by CLOSID)
216+
* @mbps_val: When mba_sc is enabled, this holds the bandwidth in MBps
205217
* @new_ctrl: new ctrl value to be loaded
206218
* @have_new_ctrl: did user provide new_ctrl for this domain
207219
*/
@@ -217,6 +229,7 @@ struct rdt_domain {
217229
int mbm_work_cpu;
218230
int cqm_work_cpu;
219231
u32 *ctrl_val;
232+
u32 *mbps_val;
220233
u32 new_ctrl;
221234
bool have_new_ctrl;
222235
};
@@ -259,13 +272,15 @@ struct rdt_cache {
259272
* @min_bw: Minimum memory bandwidth percentage user can request
260273
* @bw_gran: Granularity at which the memory bandwidth is allocated
261274
* @delay_linear: True if memory B/W delay is in linear scale
275+
* @mba_sc: True if MBA software controller(mba_sc) is enabled
262276
* @mb_map: Mapping of memory B/W percentage to memory B/W delay
263277
*/
264278
struct rdt_membw {
265279
u32 max_delay;
266280
u32 min_bw;
267281
u32 bw_gran;
268282
u32 delay_linear;
283+
bool mba_sc;
269284
u32 *mb_map;
270285
};
271286

@@ -445,6 +460,9 @@ void mon_event_read(struct rmid_read *rr, struct rdt_domain *d,
445460
void mbm_setup_overflow_handler(struct rdt_domain *dom,
446461
unsigned long delay_ms);
447462
void mbm_handle_overflow(struct work_struct *work);
463+
bool is_mba_sc(struct rdt_resource *r);
464+
void setup_default_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm);
465+
u32 delay_bw_map(unsigned long bw, struct rdt_resource *r);
448466
void cqm_setup_limbo_handler(struct rdt_domain *dom, unsigned long delay_ms);
449467
void cqm_handle_limbo(struct work_struct *work);
450468
bool has_busy_rmid(struct rdt_resource *r, struct rdt_domain *d);

arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,8 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r)
5353
return false;
5454
}
5555

56-
if (bw < r->membw.min_bw || bw > r->default_ctrl) {
56+
if ((bw < r->membw.min_bw || bw > r->default_ctrl) &&
57+
!is_mba_sc(r)) {
5758
rdt_last_cmd_printf("MB value %ld out of range [%d,%d]\n", bw,
5859
r->membw.min_bw, r->default_ctrl);
5960
return false;
@@ -179,6 +180,8 @@ static int update_domains(struct rdt_resource *r, int closid)
179180
struct msr_param msr_param;
180181
cpumask_var_t cpu_mask;
181182
struct rdt_domain *d;
183+
bool mba_sc;
184+
u32 *dc;
182185
int cpu;
183186

184187
if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL))
@@ -188,13 +191,20 @@ static int update_domains(struct rdt_resource *r, int closid)
188191
msr_param.high = msr_param.low + 1;
189192
msr_param.res = r;
190193

194+
mba_sc = is_mba_sc(r);
191195
list_for_each_entry(d, &r->domains, list) {
192-
if (d->have_new_ctrl && d->new_ctrl != d->ctrl_val[closid]) {
196+
dc = !mba_sc ? d->ctrl_val : d->mbps_val;
197+
if (d->have_new_ctrl && d->new_ctrl != dc[closid]) {
193198
cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask);
194-
d->ctrl_val[closid] = d->new_ctrl;
199+
dc[closid] = d->new_ctrl;
195200
}
196201
}
197-
if (cpumask_empty(cpu_mask))
202+
203+
/*
204+
* Avoid writing the control msr with control values when
205+
* MBA software controller is enabled
206+
*/
207+
if (cpumask_empty(cpu_mask) || mba_sc)
198208
goto done;
199209
cpu = get_cpu();
200210
/* Update CBM on this cpu if it's in cpu_mask. */
@@ -282,13 +292,17 @@ static void show_doms(struct seq_file *s, struct rdt_resource *r, int closid)
282292
{
283293
struct rdt_domain *dom;
284294
bool sep = false;
295+
u32 ctrl_val;
285296

286297
seq_printf(s, "%*s:", max_name_width, r->name);
287298
list_for_each_entry(dom, &r->domains, list) {
288299
if (sep)
289300
seq_puts(s, ";");
301+
302+
ctrl_val = (!is_mba_sc(r) ? dom->ctrl_val[closid] :
303+
dom->mbps_val[closid]);
290304
seq_printf(s, r->format_str, dom->id, max_data_width,
291-
dom->ctrl_val[closid]);
305+
ctrl_val);
292306
sep = true;
293307
}
294308
seq_puts(s, "\n");

0 commit comments

Comments
 (0)