Skip to content

Commit 789ba28

Browse files
gormanmIngo Molnar
authored andcommitted
Revert "sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()"
This reverts commit 7347fc8. Srikar Dronamra pointed out that while the commit in question did show a performance improvement on ppc64, it did so at the cost of disabling active CPU migration by automatic NUMA balancing which was not the intent. The issue was that a serious flaw in the logic failed to ever active balance if SD_WAKE_AFFINE was disabled on scheduler domains. Even when it's enabled, the logic is still bizarre and against the original intent. Investigation showed that fixing the patch in either the way he suggested, using the correct comparison for jiffies values or introducing a new numa_migrate_deferred variable in task_struct all perform similarly to a revert with a mix of gains and losses depending on the workload, machine and socket count. The original intent of the commit was to handle a problem whereby wake_affine, idle balancing and automatic NUMA balancing disagree on the appropriate placement for a task. This was particularly true for cases where a single task was a massive waker of tasks but where wake_wide logic did not apply. This was particularly noticeable when a futex (a barrier) woke all worker threads and tried pulling the wakees to the waker nodes. In that specific case, it could be handled by tuning MPI or openMP appropriately, but the behavior is not illogical and was worth attempting to fix. However, the approach was wrong. Given that we're at rc4 and a fix is not obvious, it's better to play safe, revert this commit and retry later. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: ggherdovich@suse.cz Cc: hpa@zytor.com Cc: matt@codeblueprint.co.uk Cc: mpe@ellerman.id.au Link: http://lkml.kernel.org/r/20180509163115.6fnnyeg4vdm2ct4v@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
1 parent 94d7dbf commit 789ba28

File tree

1 file changed

+1
-56
lines changed

1 file changed

+1
-56
lines changed

kernel/sched/fair.c

Lines changed: 1 addition & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -1854,26 +1854,14 @@ static int task_numa_migrate(struct task_struct *p)
18541854
static void numa_migrate_preferred(struct task_struct *p)
18551855
{
18561856
unsigned long interval = HZ;
1857-
unsigned long numa_migrate_retry;
18581857

18591858
/* This task has no NUMA fault statistics yet */
18601859
if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults))
18611860
return;
18621861

18631862
/* Periodically retry migrating the task to the preferred node */
18641863
interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
1865-
numa_migrate_retry = jiffies + interval;
1866-
1867-
/*
1868-
* Check that the new retry threshold is after the current one. If
1869-
* the retry is in the future, it implies that wake_affine has
1870-
* temporarily asked NUMA balancing to backoff from placement.
1871-
*/
1872-
if (numa_migrate_retry > p->numa_migrate_retry)
1873-
return;
1874-
1875-
/* Safe to try placing the task on the preferred node */
1876-
p->numa_migrate_retry = numa_migrate_retry;
1864+
p->numa_migrate_retry = jiffies + interval;
18771865

18781866
/* Success if task is already running on preferred CPU */
18791867
if (task_node(p) == p->numa_preferred_nid)
@@ -5922,48 +5910,6 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
59225910
return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
59235911
}
59245912

5925-
#ifdef CONFIG_NUMA_BALANCING
5926-
static void
5927-
update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
5928-
{
5929-
unsigned long interval;
5930-
5931-
if (!static_branch_likely(&sched_numa_balancing))
5932-
return;
5933-
5934-
/* If balancing has no preference then continue gathering data */
5935-
if (p->numa_preferred_nid == -1)
5936-
return;
5937-
5938-
/*
5939-
* If the wakeup is not affecting locality then it is neutral from
5940-
* the perspective of NUMA balacing so continue gathering data.
5941-
*/
5942-
if (cpu_to_node(prev_cpu) == cpu_to_node(target))
5943-
return;
5944-
5945-
/*
5946-
* Temporarily prevent NUMA balancing trying to place waker/wakee after
5947-
* wakee has been moved by wake_affine. This will potentially allow
5948-
* related tasks to converge and update their data placement. The
5949-
* 4 * numa_scan_period is to allow the two-pass filter to migrate
5950-
* hot data to the wakers node.
5951-
*/
5952-
interval = max(sysctl_numa_balancing_scan_delay,
5953-
p->numa_scan_period << 2);
5954-
p->numa_migrate_retry = jiffies + msecs_to_jiffies(interval);
5955-
5956-
interval = max(sysctl_numa_balancing_scan_delay,
5957-
current->numa_scan_period << 2);
5958-
current->numa_migrate_retry = jiffies + msecs_to_jiffies(interval);
5959-
}
5960-
#else
5961-
static void
5962-
update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
5963-
{
5964-
}
5965-
#endif
5966-
59675913
static int wake_affine(struct sched_domain *sd, struct task_struct *p,
59685914
int this_cpu, int prev_cpu, int sync)
59695915
{
@@ -5979,7 +5925,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
59795925
if (target == nr_cpumask_bits)
59805926
return prev_cpu;
59815927

5982-
update_wa_numa_placement(p, prev_cpu, target);
59835928
schedstat_inc(sd->ttwu_move_affine);
59845929
schedstat_inc(p->se.statistics.nr_wakeups_affine);
59855930
return target;

0 commit comments

Comments
 (0)