Skip to content

Commit 5beb493

Browse files
Rik van Rieltorvalds
authored andcommitted
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking workloads. Specifically, each anon_vma will be shared between the parent process and all its child processes. In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed, this leads to a system with a million anonymous pages in the same anon_vma, each of which is mapped in just one of the 1000 processes. However, the current rmap code needs to walk them all, leading to O(N) scanning complexity for each page. This can result in systems where one CPU is walking the page tables of 1000 processes in page_referenced_one, while all other CPUs are stuck on the anon_vma lock. This leads to catastrophic failure for a benchmark like AIM7, where the total number of processes can reach in the tens of thousands. Real workloads are still a factor 10 less process intensive than AIM7, but they are catching up. This patch changes the way anon_vmas and VMAs are linked, which allows us to associate multiple anon_vmas with a VMA. At fork time, each child process gets its own anon_vmas, in which its COWed pages will be instantiated. The parents' anon_vma is also linked to the VMA, because non-COWed pages could be present in any of the children. This reduces rmap scanning complexity to O(1) for the pages of the 1000 child processes, with O(N) complexity for at most 1/N pages in the system. This reduces the average scanning cost in heavily forking workloads from O(N) to 2. The only real complexity in this patch stems from the fact that linking a VMA to anon_vmas now involves memory allocations. This means vma_adjust can fail, if it needs to attach a VMA to anon_vma structures. This in turn means error handling needs to be added to the calling functions. A second source of complexity is that, because there can be multiple anon_vmas, the anon_vma linking in vma_adjust can no longer be done under "the" anon_vma lock. To prevent the rmap code from walking up an incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h to make sure it is impossible to compile a kernel that needs both symbolic values for the same bitflag. Some test results: Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test box with 16GB RAM and not quite enough IO), the system ends up running >99% in system time, with every CPU on the same anon_vma lock in the pageout code. With these changes, AIM7 hits the cross-over point around 29.7k users. This happens with ~99% IO wait time, there never seems to be any spike in system time. The anon_vma lock contention appears to be resolved. [akpm@linux-foundation.org: cleanups] Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent 648bcc7 commit 5beb493

File tree

14 files changed

+298
-85
lines changed

14 files changed

+298
-85
lines changed

arch/ia64/kernel/perfmon.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2315,6 +2315,7 @@ pfm_smpl_buffer_alloc(struct task_struct *task, struct file *filp, pfm_context_t
23152315
DPRINT(("Cannot allocate vma\n"));
23162316
goto error_kmem;
23172317
}
2318+
INIT_LIST_HEAD(&vma->anon_vma_chain);
23182319

23192320
/*
23202321
* partially initialize the vma for the sampling buffer

arch/ia64/mm/init.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,7 @@ ia64_init_addr_space (void)
117117
*/
118118
vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
119119
if (vma) {
120+
INIT_LIST_HEAD(&vma->anon_vma_chain);
120121
vma->vm_mm = current->mm;
121122
vma->vm_start = current->thread.rbs_bot & PAGE_MASK;
122123
vma->vm_end = vma->vm_start + PAGE_SIZE;
@@ -135,6 +136,7 @@ ia64_init_addr_space (void)
135136
if (!(current->personality & MMAP_PAGE_ZERO)) {
136137
vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
137138
if (vma) {
139+
INIT_LIST_HEAD(&vma->anon_vma_chain);
138140
vma->vm_mm = current->mm;
139141
vma->vm_end = PAGE_SIZE;
140142
vma->vm_page_prot = __pgprot(pgprot_val(PAGE_READONLY) | _PAGE_MA_NAT);

fs/exec.c

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -246,6 +246,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
246246
vma->vm_start = vma->vm_end - PAGE_SIZE;
247247
vma->vm_flags = VM_STACK_FLAGS;
248248
vma->vm_page_prot = vm_get_page_prot(vma->vm_flags);
249+
INIT_LIST_HEAD(&vma->anon_vma_chain);
249250
err = insert_vm_struct(mm, vma);
250251
if (err)
251252
goto err;
@@ -516,7 +517,8 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
516517
/*
517518
* cover the whole range: [new_start, old_end)
518519
*/
519-
vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL);
520+
if (vma_adjust(vma, new_start, old_end, vma->vm_pgoff, NULL))
521+
return -ENOMEM;
520522

521523
/*
522524
* move the page tables downwards, on failure we rely on
@@ -547,7 +549,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
547549
tlb_finish_mmu(tlb, new_end, old_end);
548550

549551
/*
550-
* shrink the vma to just the new range.
552+
* Shrink the vma to just the new range. Always succeeds.
551553
*/
552554
vma_adjust(vma, new_start, new_end, vma->vm_pgoff, NULL);
553555

include/linux/mm.h

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,11 @@ extern unsigned int kobjsize(const void *objp);
9797
#define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */
9898
#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
9999
#define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */
100+
#ifdef CONFIG_MMU
101+
#define VM_LOCK_RMAP 0x01000000 /* Do not follow this rmap (mmu mmap) */
102+
#else
100103
#define VM_MAPPED_COPY 0x01000000 /* T if mapped copy of data (nommu mmap) */
104+
#endif
101105
#define VM_INSERTPAGE 0x02000000 /* The vma has had "vm_insert_page()" done on it */
102106
#define VM_ALWAYSDUMP 0x04000000 /* Always include in core dumps */
103107

@@ -1216,7 +1220,7 @@ static inline void vma_nonlinear_insert(struct vm_area_struct *vma,
12161220

12171221
/* mmap.c */
12181222
extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
1219-
extern void vma_adjust(struct vm_area_struct *vma, unsigned long start,
1223+
extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
12201224
unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert);
12211225
extern struct vm_area_struct *vma_merge(struct mm_struct *,
12221226
struct vm_area_struct *prev, unsigned long addr, unsigned long end,

include/linux/mm_types.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -163,7 +163,8 @@ struct vm_area_struct {
163163
* can only be in the i_mmap tree. An anonymous MAP_PRIVATE, stack
164164
* or brk vma (with NULL file) can only be in an anon_vma list.
165165
*/
166-
struct list_head anon_vma_node; /* Serialized by anon_vma->lock */
166+
struct list_head anon_vma_chain; /* Serialized by mmap_sem &
167+
* page_table_lock */
167168
struct anon_vma *anon_vma; /* Serialized by page_table_lock */
168169

169170
/* Function pointers to deal with this struct. */

include/linux/rmap.h

Lines changed: 31 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,27 @@ struct anon_vma {
3737
* is serialized by a system wide lock only visible to
3838
* mm_take_all_locks() (mm_all_locks_mutex).
3939
*/
40-
struct list_head head; /* List of private "related" vmas */
40+
struct list_head head; /* Chain of private "related" vmas */
41+
};
42+
43+
/*
44+
* The copy-on-write semantics of fork mean that an anon_vma
45+
* can become associated with multiple processes. Furthermore,
46+
* each child process will have its own anon_vma, where new
47+
* pages for that process are instantiated.
48+
*
49+
* This structure allows us to find the anon_vmas associated
50+
* with a VMA, or the VMAs associated with an anon_vma.
51+
* The "same_vma" list contains the anon_vma_chains linking
52+
* all the anon_vmas associated with this VMA.
53+
* The "same_anon_vma" list contains the anon_vma_chains
54+
* which link all the VMAs associated with this anon_vma.
55+
*/
56+
struct anon_vma_chain {
57+
struct vm_area_struct *vma;
58+
struct anon_vma *anon_vma;
59+
struct list_head same_vma; /* locked by mmap_sem & page_table_lock */
60+
struct list_head same_anon_vma; /* locked by anon_vma->lock */
4161
};
4262

4363
#ifdef CONFIG_MMU
@@ -89,12 +109,19 @@ static inline void anon_vma_unlock(struct vm_area_struct *vma)
89109
*/
90110
void anon_vma_init(void); /* create anon_vma_cachep */
91111
int anon_vma_prepare(struct vm_area_struct *);
92-
void __anon_vma_merge(struct vm_area_struct *, struct vm_area_struct *);
93-
void anon_vma_unlink(struct vm_area_struct *);
94-
void anon_vma_link(struct vm_area_struct *);
112+
void unlink_anon_vmas(struct vm_area_struct *);
113+
int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
114+
int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
95115
void __anon_vma_link(struct vm_area_struct *);
96116
void anon_vma_free(struct anon_vma *);
97117

118+
static inline void anon_vma_merge(struct vm_area_struct *vma,
119+
struct vm_area_struct *next)
120+
{
121+
VM_BUG_ON(vma->anon_vma != next->anon_vma);
122+
unlink_anon_vmas(next);
123+
}
124+
98125
/*
99126
* rmap interfaces called when adding or removing pte of page
100127
*/

kernel/fork.c

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -329,15 +329,17 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
329329
if (!tmp)
330330
goto fail_nomem;
331331
*tmp = *mpnt;
332+
INIT_LIST_HEAD(&tmp->anon_vma_chain);
332333
pol = mpol_dup(vma_policy(mpnt));
333334
retval = PTR_ERR(pol);
334335
if (IS_ERR(pol))
335336
goto fail_nomem_policy;
336337
vma_set_policy(tmp, pol);
338+
if (anon_vma_fork(tmp, mpnt))
339+
goto fail_nomem_anon_vma_fork;
337340
tmp->vm_flags &= ~VM_LOCKED;
338341
tmp->vm_mm = mm;
339342
tmp->vm_next = NULL;
340-
anon_vma_link(tmp);
341343
file = tmp->vm_file;
342344
if (file) {
343345
struct inode *inode = file->f_path.dentry->d_inode;
@@ -392,6 +394,8 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
392394
flush_tlb_mm(oldmm);
393395
up_write(&oldmm->mmap_sem);
394396
return retval;
397+
fail_nomem_anon_vma_fork:
398+
mpol_put(pol);
395399
fail_nomem_policy:
396400
kmem_cache_free(vm_area_cachep, tmp);
397401
fail_nomem:

mm/ksm.c

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1563,10 +1563,12 @@ int page_referenced_ksm(struct page *page, struct mem_cgroup *memcg,
15631563
again:
15641564
hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
15651565
struct anon_vma *anon_vma = rmap_item->anon_vma;
1566+
struct anon_vma_chain *vmac;
15661567
struct vm_area_struct *vma;
15671568

15681569
spin_lock(&anon_vma->lock);
1569-
list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
1570+
list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
1571+
vma = vmac->vma;
15701572
if (rmap_item->address < vma->vm_start ||
15711573
rmap_item->address >= vma->vm_end)
15721574
continue;
@@ -1614,10 +1616,12 @@ int try_to_unmap_ksm(struct page *page, enum ttu_flags flags)
16141616
again:
16151617
hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
16161618
struct anon_vma *anon_vma = rmap_item->anon_vma;
1619+
struct anon_vma_chain *vmac;
16171620
struct vm_area_struct *vma;
16181621

16191622
spin_lock(&anon_vma->lock);
1620-
list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
1623+
list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
1624+
vma = vmac->vma;
16211625
if (rmap_item->address < vma->vm_start ||
16221626
rmap_item->address >= vma->vm_end)
16231627
continue;
@@ -1664,10 +1668,12 @@ int rmap_walk_ksm(struct page *page, int (*rmap_one)(struct page *,
16641668
again:
16651669
hlist_for_each_entry(rmap_item, hlist, &stable_node->hlist, hlist) {
16661670
struct anon_vma *anon_vma = rmap_item->anon_vma;
1671+
struct anon_vma_chain *vmac;
16671672
struct vm_area_struct *vma;
16681673

16691674
spin_lock(&anon_vma->lock);
1670-
list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
1675+
list_for_each_entry(vmac, &anon_vma->head, same_anon_vma) {
1676+
vma = vmac->vma;
16711677
if (rmap_item->address < vma->vm_start ||
16721678
rmap_item->address >= vma->vm_end)
16731679
continue;

mm/memory-failure.c

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -383,9 +383,12 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
383383
if (av == NULL) /* Not actually mapped anymore */
384384
goto out;
385385
for_each_process (tsk) {
386+
struct anon_vma_chain *vmac;
387+
386388
if (!task_early_kill(tsk))
387389
continue;
388-
list_for_each_entry (vma, &av->head, anon_vma_node) {
390+
list_for_each_entry(vmac, &av->head, same_anon_vma) {
391+
vma = vmac->vma;
389392
if (!page_mapped_in_vma(page, vma))
390393
continue;
391394
if (vma->vm_mm == tsk->mm)

mm/memory.c

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -374,7 +374,7 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
374374
* Hide vma from rmap and truncate_pagecache before freeing
375375
* pgtables
376376
*/
377-
anon_vma_unlink(vma);
377+
unlink_anon_vmas(vma);
378378
unlink_file_vma(vma);
379379

380380
if (is_vm_hugetlb_page(vma)) {
@@ -388,7 +388,7 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
388388
&& !is_vm_hugetlb_page(next)) {
389389
vma = next;
390390
next = vma->vm_next;
391-
anon_vma_unlink(vma);
391+
unlink_anon_vmas(vma);
392392
unlink_file_vma(vma);
393393
}
394394
free_pgd_range(tlb, addr, vma->vm_end,

0 commit comments

Comments
 (0)