Skip to content

Commit 1b69ac6

Browse files
hnaztorvalds
authored andcommitted
psi: fix aggregation idle shut-off
psi has provisions to shut off the periodic aggregation worker when there is a period of no task activity - and thus no data that needs aggregating. However, while developing psi monitoring, Suren noticed that the aggregation clock currently won't stay shut off for good. Debugging this revealed a flaw in the idle design: an aggregation run will see no task activity and decide to go to sleep; shortly thereafter, the kworker thread that executed the aggregation will go idle and cause a scheduling change, during which the psi callback will kick the !pending worker again. This will ping-pong forever, and is equivalent to having no shut-off logic at all (but with more code!) Fix this by exempting aggregation workers from psi's clock waking logic when the state change is them going to sleep. To do this, tag workers with the last work function they executed, and if in psi we see a worker going to sleep after aggregating psi data, we will not reschedule the aggregation work item. What if the worker is also executing other items before or after? Any psi state times that were incurred by work items preceding the aggregation work will have been collected from the per-cpu buckets during the aggregation itself. If there are work items following the aggregation work, the worker's last_func tag will be overwritten and the aggregator will be kept alive to process this genuine new activity. If the aggregation work is the last thing the worker does, and we decide to go idle, the brief period of non-idle time incurred between the aggregation run and the kworker's dequeue will be stranded in the per-cpu buckets until the clock is woken by later activity. But that should not be a problem. The buckets can hold 4s worth of time, and future activity will wake the clock with a 2s delay, giving us 2s worth of data we can leave behind when disabling aggregation. If it takes a worker more than two seconds to go idle after it finishes its last work item, we likely have bigger problems in the system, and won't notice one sample that was averaged with a bogus per-CPU weight. Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org Fixes: eb41468 ("psi: pressure stall information for CPU, memory, and IO") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent 24feb47 commit 1b69ac6

File tree

3 files changed

+45
-5
lines changed

3 files changed

+45
-5
lines changed

kernel/sched/psi.c

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,7 @@
124124
* sampling of the aggregate task states would be.
125125
*/
126126

127+
#include "../workqueue_internal.h"
127128
#include <linux/sched/loadavg.h>
128129
#include <linux/seq_file.h>
129130
#include <linux/proc_fs.h>
@@ -480,9 +481,6 @@ static void psi_group_change(struct psi_group *group, int cpu,
480481
groupc->tasks[t]++;
481482

482483
write_seqcount_end(&groupc->seq);
483-
484-
if (!delayed_work_pending(&group->clock_work))
485-
schedule_delayed_work(&group->clock_work, PSI_FREQ);
486484
}
487485

488486
static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
@@ -513,6 +511,7 @@ void psi_task_change(struct task_struct *task, int clear, int set)
513511
{
514512
int cpu = task_cpu(task);
515513
struct psi_group *group;
514+
bool wake_clock = true;
516515
void *iter = NULL;
517516

518517
if (!task->pid)
@@ -530,8 +529,22 @@ void psi_task_change(struct task_struct *task, int clear, int set)
530529
task->psi_flags &= ~clear;
531530
task->psi_flags |= set;
532531

533-
while ((group = iterate_groups(task, &iter)))
532+
/*
533+
* Periodic aggregation shuts off if there is a period of no
534+
* task changes, so we wake it back up if necessary. However,
535+
* don't do this if the task change is the aggregation worker
536+
* itself going to sleep, or we'll ping-pong forever.
537+
*/
538+
if (unlikely((clear & TSK_RUNNING) &&
539+
(task->flags & PF_WQ_WORKER) &&
540+
wq_worker_last_func(task) == psi_update_work))
541+
wake_clock = false;
542+
543+
while ((group = iterate_groups(task, &iter))) {
534544
psi_group_change(group, cpu, clear, set);
545+
if (wake_clock && !delayed_work_pending(&group->clock_work))
546+
schedule_delayed_work(&group->clock_work, PSI_FREQ);
547+
}
535548
}
536549

537550
void psi_memstall_tick(struct task_struct *task, int cpu)

kernel/workqueue.c

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -909,6 +909,26 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task)
909909
return to_wakeup ? to_wakeup->task : NULL;
910910
}
911911

912+
/**
913+
* wq_worker_last_func - retrieve worker's last work function
914+
*
915+
* Determine the last function a worker executed. This is called from
916+
* the scheduler to get a worker's last known identity.
917+
*
918+
* CONTEXT:
919+
* spin_lock_irq(rq->lock)
920+
*
921+
* Return:
922+
* The last work function %current executed as a worker, NULL if it
923+
* hasn't executed any work yet.
924+
*/
925+
work_func_t wq_worker_last_func(struct task_struct *task)
926+
{
927+
struct worker *worker = kthread_data(task);
928+
929+
return worker->last_func;
930+
}
931+
912932
/**
913933
* worker_set_flags - set worker flags and adjust nr_running accordingly
914934
* @worker: self
@@ -2184,6 +2204,9 @@ __acquires(&pool->lock)
21842204
if (unlikely(cpu_intensive))
21852205
worker_clr_flags(worker, WORKER_CPU_INTENSIVE);
21862206

2207+
/* tag the worker for identification in schedule() */
2208+
worker->last_func = worker->current_func;
2209+
21872210
/* we're done with it, release */
21882211
hash_del(&worker->hentry);
21892212
worker->current_work = NULL;

kernel/workqueue_internal.h

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,9 @@ struct worker {
5353

5454
/* used only by rescuers to point to the target workqueue */
5555
struct workqueue_struct *rescue_wq; /* I: the workqueue to rescue */
56+
57+
/* used by the scheduler to determine a worker's last known identity */
58+
work_func_t last_func;
5659
};
5760

5861
/**
@@ -67,9 +70,10 @@ static inline struct worker *current_wq_worker(void)
6770

6871
/*
6972
* Scheduler hooks for concurrency managed workqueue. Only to be used from
70-
* sched/core.c and workqueue.c.
73+
* sched/ and workqueue.c.
7174
*/
7275
void wq_worker_waking_up(struct task_struct *task, int cpu);
7376
struct task_struct *wq_worker_sleeping(struct task_struct *task);
77+
work_func_t wq_worker_last_func(struct task_struct *task);
7478

7579
#endif /* _KERNEL_WORKQUEUE_INTERNAL_H */

0 commit comments

Comments
 (0)