Skip to content

Commit eb41468

Browse files
hnaztorvalds
authored andcommitted
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard to tell exactly the impact this has on workload productivity, or how close the system is to lockups and OOM kills. In particular, when machines work multiple jobs concurrently, the impact of overcommit in terms of latency and throughput on the individual job can be enormous. In order to maximize hardware utilization without sacrificing individual job health or risk complete machine lockups, this patch implements a way to quantify resource pressure in the system. A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that expose the percentage of time the system is stalled on CPU, memory, or IO, respectively. Stall states are aggregate versions of the per-task delay accounting delays: cpu: some tasks are runnable but not executing on a CPU memory: tasks are reclaiming, or waiting for swapin or thrashing cache io: tasks are waiting for io completions These percentages of walltime can be thought of as pressure percentages, and they give a general sense of system health and productivity loss incurred by resource overcommit. They can also indicate when the system is approaching lockup scenarios and OOMs. To do this, psi keeps track of the task states associated with each CPU and samples the time they spend in stall states. Every 2 seconds, the samples are averaged across CPUs - weighted by the CPUs' non-idle time to eliminate artifacts from unused CPUs - and translated into percentages of walltime. A running average of those percentages is maintained over 10s, 1m, and 5m periods (similar to the loadaverage). [hannes@cmpxchg.org: doc fixlet, per Randy] Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org [hannes@cmpxchg.org: code optimization] Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter] Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org [hannes@cmpxchg.org: fix build] Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent 246b3b3 commit eb41468

File tree

15 files changed

+1003
-6
lines changed

15 files changed

+1003
-6
lines changed

Documentation/accounting/psi.txt

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
================================
2+
PSI - Pressure Stall Information
3+
================================
4+
5+
:Date: April, 2018
6+
:Author: Johannes Weiner <hannes@cmpxchg.org>
7+
8+
When CPU, memory or IO devices are contended, workloads experience
9+
latency spikes, throughput losses, and run the risk of OOM kills.
10+
11+
Without an accurate measure of such contention, users are forced to
12+
either play it safe and under-utilize their hardware resources, or
13+
roll the dice and frequently suffer the disruptions resulting from
14+
excessive overcommit.
15+
16+
The psi feature identifies and quantifies the disruptions caused by
17+
such resource crunches and the time impact it has on complex workloads
18+
or even entire systems.
19+
20+
Having an accurate measure of productivity losses caused by resource
21+
scarcity aids users in sizing workloads to hardware--or provisioning
22+
hardware according to workload demand.
23+
24+
As psi aggregates this information in realtime, systems can be managed
25+
dynamically using techniques such as load shedding, migrating jobs to
26+
other systems or data centers, or strategically pausing or killing low
27+
priority or restartable batch jobs.
28+
29+
This allows maximizing hardware utilization without sacrificing
30+
workload health or risking major disruptions such as OOM kills.
31+
32+
Pressure interface
33+
==================
34+
35+
Pressure information for each resource is exported through the
36+
respective file in /proc/pressure/ -- cpu, memory, and io.
37+
38+
The format for CPU is as such:
39+
40+
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
41+
42+
and for memory and IO:
43+
44+
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
45+
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
46+
47+
The "some" line indicates the share of time in which at least some
48+
tasks are stalled on a given resource.
49+
50+
The "full" line indicates the share of time in which all non-idle
51+
tasks are stalled on a given resource simultaneously. In this state
52+
actual CPU cycles are going to waste, and a workload that spends
53+
extended time in this state is considered to be thrashing. This has
54+
severe impact on performance, and it's useful to distinguish this
55+
situation from a state where some tasks are stalled but the CPU is
56+
still doing productive work. As such, time spent in this subset of the
57+
stall state is tracked separately and exported in the "full" averages.
58+
59+
The ratios are tracked as recent trends over ten, sixty, and three
60+
hundred second windows, which gives insight into short term events as
61+
well as medium and long term trends. The total absolute stall time is
62+
tracked and exported as well, to allow detection of latency spikes
63+
which wouldn't necessarily make a dent in the time averages, or to
64+
average trends over custom time frames.

include/linux/psi.h

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
#ifndef _LINUX_PSI_H
2+
#define _LINUX_PSI_H
3+
4+
#include <linux/psi_types.h>
5+
#include <linux/sched.h>
6+
7+
#ifdef CONFIG_PSI
8+
9+
extern bool psi_disabled;
10+
11+
void psi_init(void);
12+
13+
void psi_task_change(struct task_struct *task, int clear, int set);
14+
15+
void psi_memstall_tick(struct task_struct *task, int cpu);
16+
void psi_memstall_enter(unsigned long *flags);
17+
void psi_memstall_leave(unsigned long *flags);
18+
19+
#else /* CONFIG_PSI */
20+
21+
static inline void psi_init(void) {}
22+
23+
static inline void psi_memstall_enter(unsigned long *flags) {}
24+
static inline void psi_memstall_leave(unsigned long *flags) {}
25+
26+
#endif /* CONFIG_PSI */
27+
28+
#endif /* _LINUX_PSI_H */

include/linux/psi_types.h

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
#ifndef _LINUX_PSI_TYPES_H
2+
#define _LINUX_PSI_TYPES_H
3+
4+
#include <linux/seqlock.h>
5+
#include <linux/types.h>
6+
7+
#ifdef CONFIG_PSI
8+
9+
/* Tracked task states */
10+
enum psi_task_count {
11+
NR_IOWAIT,
12+
NR_MEMSTALL,
13+
NR_RUNNING,
14+
NR_PSI_TASK_COUNTS,
15+
};
16+
17+
/* Task state bitmasks */
18+
#define TSK_IOWAIT (1 << NR_IOWAIT)
19+
#define TSK_MEMSTALL (1 << NR_MEMSTALL)
20+
#define TSK_RUNNING (1 << NR_RUNNING)
21+
22+
/* Resources that workloads could be stalled on */
23+
enum psi_res {
24+
PSI_IO,
25+
PSI_MEM,
26+
PSI_CPU,
27+
NR_PSI_RESOURCES,
28+
};
29+
30+
/*
31+
* Pressure states for each resource:
32+
*
33+
* SOME: Stalled tasks & working tasks
34+
* FULL: Stalled tasks & no working tasks
35+
*/
36+
enum psi_states {
37+
PSI_IO_SOME,
38+
PSI_IO_FULL,
39+
PSI_MEM_SOME,
40+
PSI_MEM_FULL,
41+
PSI_CPU_SOME,
42+
/* Only per-CPU, to weigh the CPU in the global average: */
43+
PSI_NONIDLE,
44+
NR_PSI_STATES,
45+
};
46+
47+
struct psi_group_cpu {
48+
/* 1st cacheline updated by the scheduler */
49+
50+
/* Aggregator needs to know of concurrent changes */
51+
seqcount_t seq ____cacheline_aligned_in_smp;
52+
53+
/* States of the tasks belonging to this group */
54+
unsigned int tasks[NR_PSI_TASK_COUNTS];
55+
56+
/* Period time sampling buckets for each state of interest (ns) */
57+
u32 times[NR_PSI_STATES];
58+
59+
/* Time of last task change in this group (rq_clock) */
60+
u64 state_start;
61+
62+
/* 2nd cacheline updated by the aggregator */
63+
64+
/* Delta detection against the sampling buckets */
65+
u32 times_prev[NR_PSI_STATES] ____cacheline_aligned_in_smp;
66+
};
67+
68+
struct psi_group {
69+
/* Protects data updated during an aggregation */
70+
struct mutex stat_lock;
71+
72+
/* Per-cpu task state & time tracking */
73+
struct psi_group_cpu __percpu *pcpu;
74+
75+
/* Periodic aggregation state */
76+
u64 total_prev[NR_PSI_STATES - 1];
77+
u64 last_update;
78+
u64 next_update;
79+
struct delayed_work clock_work;
80+
81+
/* Total stall times and sampled pressure averages */
82+
u64 total[NR_PSI_STATES - 1];
83+
unsigned long avg[NR_PSI_STATES - 1][3];
84+
};
85+
86+
#else /* CONFIG_PSI */
87+
88+
struct psi_group { };
89+
90+
#endif /* CONFIG_PSI */
91+
92+
#endif /* _LINUX_PSI_TYPES_H */

include/linux/sched.h

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
#include <linux/latencytop.h>
2626
#include <linux/sched/prio.h>
2727
#include <linux/signal_types.h>
28+
#include <linux/psi_types.h>
2829
#include <linux/mm_types_task.h>
2930
#include <linux/task_io_accounting.h>
3031
#include <linux/rseq.h>
@@ -706,6 +707,10 @@ struct task_struct {
706707
unsigned sched_contributes_to_load:1;
707708
unsigned sched_migrated:1;
708709
unsigned sched_remote_wakeup:1;
710+
#ifdef CONFIG_PSI
711+
unsigned sched_psi_wake_requeue:1;
712+
#endif
713+
709714
/* Force alignment to the next boundary: */
710715
unsigned :0;
711716

@@ -965,6 +970,10 @@ struct task_struct {
965970
kernel_siginfo_t *last_siginfo;
966971

967972
struct task_io_accounting ioac;
973+
#ifdef CONFIG_PSI
974+
/* Pressure stall state */
975+
unsigned int psi_flags;
976+
#endif
968977
#ifdef CONFIG_TASK_XACCT
969978
/* Accumulated RSS usage: */
970979
u64 acct_rss_mem1;
@@ -1391,6 +1400,7 @@ extern struct pid *cad_pid;
13911400
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
13921401
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
13931402
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
1403+
#define PF_MEMSTALL 0x01000000 /* Stalled due to lack of memory */
13941404
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */
13951405
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
13961406
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */

init/Kconfig

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -490,6 +490,21 @@ config TASK_IO_ACCOUNTING
490490

491491
Say N if unsure.
492492

493+
config PSI
494+
bool "Pressure stall information tracking"
495+
help
496+
Collect metrics that indicate how overcommitted the CPU, memory,
497+
and IO capacity are in the system.
498+
499+
If you say Y here, the kernel will create /proc/pressure/ with the
500+
pressure statistics files cpu, memory, and io. These will indicate
501+
the share of walltime in which some or all tasks in the system are
502+
delayed due to contention of the respective resource.
503+
504+
For more details see Documentation/accounting/psi.txt.
505+
506+
Say N if unsure.
507+
493508
endmenu # "CPU/Task time and stats accounting"
494509

495510
config CPU_ISOLATION

kernel/fork.c

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1822,6 +1822,10 @@ static __latent_entropy struct task_struct *copy_process(
18221822

18231823
p->default_timer_slack_ns = current->timer_slack_ns;
18241824

1825+
#ifdef CONFIG_PSI
1826+
p->psi_flags = 0;
1827+
#endif
1828+
18251829
task_io_accounting_init(&p->ioac);
18261830
acct_clear_integrals(p);
18271831

kernel/sched/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,3 +29,4 @@ obj-$(CONFIG_CPU_FREQ) += cpufreq.o
2929
obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
3030
obj-$(CONFIG_MEMBARRIER) += membarrier.o
3131
obj-$(CONFIG_CPU_ISOLATION) += isolation.o
32+
obj-$(CONFIG_PSI) += psi.o

kernel/sched/core.c

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -722,8 +722,10 @@ static inline void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
722722
if (!(flags & ENQUEUE_NOCLOCK))
723723
update_rq_clock(rq);
724724

725-
if (!(flags & ENQUEUE_RESTORE))
725+
if (!(flags & ENQUEUE_RESTORE)) {
726726
sched_info_queued(rq, p);
727+
psi_enqueue(p, flags & ENQUEUE_WAKEUP);
728+
}
727729

728730
p->sched_class->enqueue_task(rq, p, flags);
729731
}
@@ -733,8 +735,10 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
733735
if (!(flags & DEQUEUE_NOCLOCK))
734736
update_rq_clock(rq);
735737

736-
if (!(flags & DEQUEUE_SAVE))
738+
if (!(flags & DEQUEUE_SAVE)) {
737739
sched_info_dequeued(rq, p);
740+
psi_dequeue(p, flags & DEQUEUE_SLEEP);
741+
}
738742

739743
p->sched_class->dequeue_task(rq, p, flags);
740744
}
@@ -2037,6 +2041,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
20372041
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
20382042
if (task_cpu(p) != cpu) {
20392043
wake_flags |= WF_MIGRATED;
2044+
psi_ttwu_dequeue(p);
20402045
set_task_cpu(p, cpu);
20412046
}
20422047

@@ -3051,6 +3056,7 @@ void scheduler_tick(void)
30513056
curr->sched_class->task_tick(rq, curr, 0);
30523057
cpu_load_update_active(rq);
30533058
calc_global_load_tick(rq);
3059+
psi_task_tick(rq);
30543060

30553061
rq_unlock(rq, &rf);
30563062

@@ -6067,6 +6073,8 @@ void __init sched_init(void)
60676073

60686074
init_schedstats();
60696075

6076+
psi_init();
6077+
60706078
scheduler_running = 1;
60716079
}
60726080

0 commit comments

Comments
 (0)