Skip to content

Commit 074c238

Browse files
Mel Gormantorvalds
authored andcommitted
mm: numa: slow PTE scan rate if migration failures occur
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226 Across the board the 4.0-rc1 numbers are much slower, and the degradation is far worse when using the large memory footprint configs. Perf points straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config: - 56.07% 56.07% [kernel] [k] default_send_IPI_mask_sequence_phys - default_send_IPI_mask_sequence_phys - 99.99% physflat_send_IPI_mask - 99.37% native_send_call_func_ipi smp_call_function_many - native_flush_tlb_others - 99.85% flush_tlb_page ptep_clear_flush try_to_unmap_one rmap_walk try_to_unmap migrate_pages migrate_misplaced_page - handle_mm_fault - 99.73% __do_page_fault trace_do_page_fault do_async_page_fault + async_page_fault 0.63% native_send_call_func_single_ipi generic_exec_single smp_call_function_single This is showing excessive migration activity even though excessive migrations are meant to get throttled. Normally, the scan rate is tuned on a per-task basis depending on the locality of faults. However, if migrations fail for any reason then the PTE scanner may scan faster if the faults continue to be remote. This means there is higher system CPU overhead and fault trapping at exactly the time we know that migrations cannot happen. This patch tracks when migration failures occur and slows the PTE scanner. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Dave Chinner <david@fromorbit.com> Tested-by: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent b191f9b commit 074c238

File tree

4 files changed

+15
-8
lines changed

4 files changed

+15
-8
lines changed

include/linux/sched.h

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1625,11 +1625,11 @@ struct task_struct {
16251625

16261626
/*
16271627
* numa_faults_locality tracks if faults recorded during the last
1628-
* scan window were remote/local. The task scan period is adapted
1629-
* based on the locality of the faults with different weights
1630-
* depending on whether they were shared or private faults
1628+
* scan window were remote/local or failed to migrate. The task scan
1629+
* period is adapted based on the locality of the faults with different
1630+
* weights depending on whether they were shared or private faults
16311631
*/
1632-
unsigned long numa_faults_locality[2];
1632+
unsigned long numa_faults_locality[3];
16331633

16341634
unsigned long numa_pages_migrated;
16351635
#endif /* CONFIG_NUMA_BALANCING */
@@ -1719,6 +1719,7 @@ struct task_struct {
17191719
#define TNF_NO_GROUP 0x02
17201720
#define TNF_SHARED 0x04
17211721
#define TNF_FAULT_LOCAL 0x08
1722+
#define TNF_MIGRATE_FAIL 0x10
17221723

17231724
#ifdef CONFIG_NUMA_BALANCING
17241725
extern void task_numa_fault(int last_node, int node, int pages, int flags);

kernel/sched/fair.c

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1609,9 +1609,11 @@ static void update_task_scan_period(struct task_struct *p,
16091609
/*
16101610
* If there were no record hinting faults then either the task is
16111611
* completely idle or all activity is areas that are not of interest
1612-
* to automatic numa balancing. Scan slower
1612+
* to automatic numa balancing. Related to that, if there were failed
1613+
* migration then it implies we are migrating too quickly or the local
1614+
* node is overloaded. In either case, scan slower
16131615
*/
1614-
if (local + shared == 0) {
1616+
if (local + shared == 0 || p->numa_faults_locality[2]) {
16151617
p->numa_scan_period = min(p->numa_scan_period_max,
16161618
p->numa_scan_period << 1);
16171619

@@ -2080,6 +2082,8 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
20802082

20812083
if (migrated)
20822084
p->numa_pages_migrated += pages;
2085+
if (flags & TNF_MIGRATE_FAIL)
2086+
p->numa_faults_locality[2] += pages;
20832087

20842088
p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
20852089
p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;

mm/huge_memory.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1350,7 +1350,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
13501350
if (migrated) {
13511351
flags |= TNF_MIGRATED;
13521352
page_nid = target_nid;
1353-
}
1353+
} else
1354+
flags |= TNF_MIGRATE_FAIL;
13541355

13551356
goto out;
13561357
clear_pmdnuma:

mm/memory.c

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3103,7 +3103,8 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
31033103
if (migrated) {
31043104
page_nid = target_nid;
31053105
flags |= TNF_MIGRATED;
3106-
}
3106+
} else
3107+
flags |= TNF_MIGRATE_FAIL;
31073108

31083109
out:
31093110
if (page_nid != -1)

0 commit comments

Comments
 (0)