Skip to content

Commit c4843a7

Browse files
gthelenaxboe
authored andcommitted
memcg: add per cgroup dirty page accounting
When modifying PG_Dirty on cached file pages, update the new MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where global NR_FILE_DIRTY is managed. The new memcg stat is visible in the per memcg memory.stat cgroupfs file. The most recent past attempt at this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632 The new accounting supports future efforts to add per cgroup dirty page throttling and writeback. It also helps an administrator break down a container's memory usage and provides evidence to understand memcg oom kills (the new dirty count is included in memcg oom kill messages). The ability to move page accounting between memcg (memory.move_charge_at_immigrate) makes this accounting more complicated than the global counter. The existing mem_cgroup_{begin,end}_page_stat() lock is used to serialize move accounting with stat updates. Typical update operation: memcg = mem_cgroup_begin_page_stat(page) if (TestSetPageDirty()) { [...] mem_cgroup_update_page_stat(memcg) } mem_cgroup_end_page_stat(memcg) Summary of mem_cgroup_end_page_stat() overhead: - Without CONFIG_MEMCG it's a no-op - With CONFIG_MEMCG and no inter memcg task movement, it's just rcu_read_lock() - With CONFIG_MEMCG and inter memcg task movement, it's rcu_read_lock() + spin_lock_irqsave() A memcg parameter is added to several routines because their callers now grab mem_cgroup_begin_page_stat() which returns the memcg later needed by for mem_cgroup_update_page_stat(). Because mem_cgroup_begin_page_stat() may disable interrupts, some adjustments are needed: - move __mark_inode_dirty() from __set_page_dirty() to its caller. __mark_inode_dirty() locking does not want interrupts disabled. - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in __delete_from_page_cache(), replace_page_cache_page(), invalidate_complete_page2(), and __remove_mapping(). text data bss dec hex filename 8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before 8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after +192 text bytes 8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before 8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after +773 text bytes Performance tests run on v4.0-rc1-36-g4f671fe2f952. Lower is better for all metrics, they're all wall clock or cycle counts. The read and write fault benchmarks just measure fault time, they do not include I/O time. * CONFIG_MEMCG not set: baseline patched kbuild 1m25.030000(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples) dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03% dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99% dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77% read fault cycles 254.0(+-0.000% 10 samples) 253.0(+-0.000% 10 samples) write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples) * CONFIG_MEMCG=y root_memcg: baseline patched kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples) dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90% dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33% dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00% read fault cycles 266.0(+-0.000% 10 samples) 266.0(+-0.000% 10 samples) write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples) * CONFIG_MEMCG=y non-root_memcg: baseline patched kbuild 1m26.120000(+-0.273% 3 samples) 1m25.763333(+-0.127% 3 samples) dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82% dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27% dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52% read fault cycles 265.7(+-0.172% 10 samples) 267.0(+-0.000% 10 samples) write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples) As expected anon page faults are not affected by this patch. tj: Updated to apply on top of the recent cancel_dirty_page() changes. Signed-off-by: Sha Zhengju <handai.szj@gmail.com> Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
1 parent 11f81be commit c4843a7

File tree

12 files changed

+156
-39
lines changed

12 files changed

+156
-39
lines changed

Documentation/cgroups/memory.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -493,6 +493,7 @@ pgpgin - # of charging events to the memory cgroup. The charging
493493
pgpgout - # of uncharging events to the memory cgroup. The uncharging
494494
event happens each time a page is unaccounted from the cgroup.
495495
swap - # of bytes of swap usage
496+
dirty - # of bytes that are waiting to get written back to the disk.
496497
writeback - # of bytes of file/anon cache that are queued for syncing to
497498
disk.
498499
inactive_anon - # of bytes of anonymous and swap cache memory on inactive

fs/buffer.c

Lines changed: 27 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -623,21 +623,22 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
623623
*
624624
* If warn is true, then emit a warning if the page is not uptodate and has
625625
* not been truncated.
626+
*
627+
* The caller must hold mem_cgroup_begin_page_stat() lock.
626628
*/
627-
static void __set_page_dirty(struct page *page,
628-
struct address_space *mapping, int warn)
629+
static void __set_page_dirty(struct page *page, struct address_space *mapping,
630+
struct mem_cgroup *memcg, int warn)
629631
{
630632
unsigned long flags;
631633

632634
spin_lock_irqsave(&mapping->tree_lock, flags);
633635
if (page->mapping) { /* Race with truncate? */
634636
WARN_ON_ONCE(warn && !PageUptodate(page));
635-
account_page_dirtied(page, mapping);
637+
account_page_dirtied(page, mapping, memcg);
636638
radix_tree_tag_set(&mapping->page_tree,
637639
page_index(page), PAGECACHE_TAG_DIRTY);
638640
}
639641
spin_unlock_irqrestore(&mapping->tree_lock, flags);
640-
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
641642
}
642643

643644
/*
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page *page,
668669
int __set_page_dirty_buffers(struct page *page)
669670
{
670671
int newly_dirty;
672+
struct mem_cgroup *memcg;
671673
struct address_space *mapping = page_mapping(page);
672674

673675
if (unlikely(!mapping))
@@ -683,11 +685,22 @@ int __set_page_dirty_buffers(struct page *page)
683685
bh = bh->b_this_page;
684686
} while (bh != head);
685687
}
688+
/*
689+
* Use mem_group_begin_page_stat() to keep PageDirty synchronized with
690+
* per-memcg dirty page counters.
691+
*/
692+
memcg = mem_cgroup_begin_page_stat(page);
686693
newly_dirty = !TestSetPageDirty(page);
687694
spin_unlock(&mapping->private_lock);
688695

689696
if (newly_dirty)
690-
__set_page_dirty(page, mapping, 1);
697+
__set_page_dirty(page, mapping, memcg, 1);
698+
699+
mem_cgroup_end_page_stat(memcg);
700+
701+
if (newly_dirty)
702+
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
703+
691704
return newly_dirty;
692705
}
693706
EXPORT_SYMBOL(__set_page_dirty_buffers);
@@ -1158,11 +1171,18 @@ void mark_buffer_dirty(struct buffer_head *bh)
11581171

11591172
if (!test_set_buffer_dirty(bh)) {
11601173
struct page *page = bh->b_page;
1174+
struct address_space *mapping = NULL;
1175+
struct mem_cgroup *memcg;
1176+
1177+
memcg = mem_cgroup_begin_page_stat(page);
11611178
if (!TestSetPageDirty(page)) {
1162-
struct address_space *mapping = page_mapping(page);
1179+
mapping = page_mapping(page);
11631180
if (mapping)
1164-
__set_page_dirty(page, mapping, 0);
1181+
__set_page_dirty(page, mapping, memcg, 0);
11651182
}
1183+
mem_cgroup_end_page_stat(memcg);
1184+
if (mapping)
1185+
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
11661186
}
11671187
}
11681188
EXPORT_SYMBOL(mark_buffer_dirty);

fs/xfs/xfs_aops.c

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1873,6 +1873,7 @@ xfs_vm_set_page_dirty(
18731873
loff_t end_offset;
18741874
loff_t offset;
18751875
int newly_dirty;
1876+
struct mem_cgroup *memcg;
18761877

18771878
if (unlikely(!mapping))
18781879
return !TestSetPageDirty(page);
@@ -1892,6 +1893,11 @@ xfs_vm_set_page_dirty(
18921893
offset += 1 << inode->i_blkbits;
18931894
} while (bh != head);
18941895
}
1896+
/*
1897+
* Use mem_group_begin_page_stat() to keep PageDirty synchronized with
1898+
* per-memcg dirty page counters.
1899+
*/
1900+
memcg = mem_cgroup_begin_page_stat(page);
18951901
newly_dirty = !TestSetPageDirty(page);
18961902
spin_unlock(&mapping->private_lock);
18971903

@@ -1902,13 +1908,15 @@ xfs_vm_set_page_dirty(
19021908
spin_lock_irqsave(&mapping->tree_lock, flags);
19031909
if (page->mapping) { /* Race with truncate? */
19041910
WARN_ON_ONCE(!PageUptodate(page));
1905-
account_page_dirtied(page, mapping);
1911+
account_page_dirtied(page, mapping, memcg);
19061912
radix_tree_tag_set(&mapping->page_tree,
19071913
page_index(page), PAGECACHE_TAG_DIRTY);
19081914
}
19091915
spin_unlock_irqrestore(&mapping->tree_lock, flags);
1910-
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
19111916
}
1917+
mem_cgroup_end_page_stat(memcg);
1918+
if (newly_dirty)
1919+
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
19121920
return newly_dirty;
19131921
}
19141922

include/linux/memcontrol.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ enum mem_cgroup_stat_index {
4141
MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
4242
MEM_CGROUP_STAT_RSS_HUGE, /* # of pages charged as anon huge */
4343
MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
44+
MEM_CGROUP_STAT_DIRTY, /* # of dirty pages in page cache */
4445
MEM_CGROUP_STAT_WRITEBACK, /* # of pages under writeback */
4546
MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
4647
MEM_CGROUP_STAT_NSTATS,

include/linux/mm.h

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1211,8 +1211,10 @@ int __set_page_dirty_nobuffers(struct page *page);
12111211
int __set_page_dirty_no_writeback(struct page *page);
12121212
int redirty_page_for_writepage(struct writeback_control *wbc,
12131213
struct page *page);
1214-
void account_page_dirtied(struct page *page, struct address_space *mapping);
1215-
void account_page_cleaned(struct page *page, struct address_space *mapping);
1214+
void account_page_dirtied(struct page *page, struct address_space *mapping,
1215+
struct mem_cgroup *memcg);
1216+
void account_page_cleaned(struct page *page, struct address_space *mapping,
1217+
struct mem_cgroup *memcg);
12161218
int set_page_dirty(struct page *page);
12171219
int set_page_dirty_lock(struct page *page);
12181220
void cancel_dirty_page(struct page *page);

include/linux/pagemap.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -651,7 +651,8 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
651651
int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
652652
pgoff_t index, gfp_t gfp_mask);
653653
extern void delete_from_page_cache(struct page *page);
654-
extern void __delete_from_page_cache(struct page *page, void *shadow);
654+
extern void __delete_from_page_cache(struct page *page, void *shadow,
655+
struct mem_cgroup *memcg);
655656
int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
656657

657658
/*

mm/filemap.c

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,7 @@
100100
* ->tree_lock (page_remove_rmap->set_page_dirty)
101101
* bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
102102
* ->inode->i_lock (page_remove_rmap->set_page_dirty)
103+
* ->memcg->move_lock (page_remove_rmap->mem_cgroup_begin_page_stat)
103104
* bdi.wb->list_lock (zap_pte_range->set_page_dirty)
104105
* ->inode->i_lock (zap_pte_range->set_page_dirty)
105106
* ->private_lock (zap_pte_range->__set_page_dirty_buffers)
@@ -174,9 +175,11 @@ static void page_cache_tree_delete(struct address_space *mapping,
174175
/*
175176
* Delete a page from the page cache and free it. Caller has to make
176177
* sure the page is locked and that nobody else uses it - or that usage
177-
* is safe. The caller must hold the mapping's tree_lock.
178+
* is safe. The caller must hold the mapping's tree_lock and
179+
* mem_cgroup_begin_page_stat().
178180
*/
179-
void __delete_from_page_cache(struct page *page, void *shadow)
181+
void __delete_from_page_cache(struct page *page, void *shadow,
182+
struct mem_cgroup *memcg)
180183
{
181184
struct address_space *mapping = page->mapping;
182185

@@ -210,7 +213,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
210213
* anyway will be cleared before returning page into buddy allocator.
211214
*/
212215
if (WARN_ON_ONCE(PageDirty(page)))
213-
account_page_cleaned(page, mapping);
216+
account_page_cleaned(page, mapping, memcg);
214217
}
215218

216219
/**
@@ -224,14 +227,20 @@ void __delete_from_page_cache(struct page *page, void *shadow)
224227
void delete_from_page_cache(struct page *page)
225228
{
226229
struct address_space *mapping = page->mapping;
230+
struct mem_cgroup *memcg;
231+
unsigned long flags;
232+
227233
void (*freepage)(struct page *);
228234

229235
BUG_ON(!PageLocked(page));
230236

231237
freepage = mapping->a_ops->freepage;
232-
spin_lock_irq(&mapping->tree_lock);
233-
__delete_from_page_cache(page, NULL);
234-
spin_unlock_irq(&mapping->tree_lock);
238+
239+
memcg = mem_cgroup_begin_page_stat(page);
240+
spin_lock_irqsave(&mapping->tree_lock, flags);
241+
__delete_from_page_cache(page, NULL, memcg);
242+
spin_unlock_irqrestore(&mapping->tree_lock, flags);
243+
mem_cgroup_end_page_stat(memcg);
235244

236245
if (freepage)
237246
freepage(page);
@@ -470,6 +479,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
470479
if (!error) {
471480
struct address_space *mapping = old->mapping;
472481
void (*freepage)(struct page *);
482+
struct mem_cgroup *memcg;
483+
unsigned long flags;
473484

474485
pgoff_t offset = old->index;
475486
freepage = mapping->a_ops->freepage;
@@ -478,15 +489,17 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
478489
new->mapping = mapping;
479490
new->index = offset;
480491

481-
spin_lock_irq(&mapping->tree_lock);
482-
__delete_from_page_cache(old, NULL);
492+
memcg = mem_cgroup_begin_page_stat(old);
493+
spin_lock_irqsave(&mapping->tree_lock, flags);
494+
__delete_from_page_cache(old, NULL, memcg);
483495
error = radix_tree_insert(&mapping->page_tree, offset, new);
484496
BUG_ON(error);
485497
mapping->nrpages++;
486498
__inc_zone_page_state(new, NR_FILE_PAGES);
487499
if (PageSwapBacked(new))
488500
__inc_zone_page_state(new, NR_SHMEM);
489-
spin_unlock_irq(&mapping->tree_lock);
501+
spin_unlock_irqrestore(&mapping->tree_lock, flags);
502+
mem_cgroup_end_page_stat(memcg);
490503
mem_cgroup_migrate(old, new, true);
491504
radix_tree_preload_end();
492505
if (freepage)

mm/memcontrol.c

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@ static const char * const mem_cgroup_stat_names[] = {
9090
"rss",
9191
"rss_huge",
9292
"mapped_file",
93+
"dirty",
9394
"writeback",
9495
"swap",
9596
};
@@ -2011,6 +2012,7 @@ struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page)
20112012

20122013
return memcg;
20132014
}
2015+
EXPORT_SYMBOL(mem_cgroup_begin_page_stat);
20142016

20152017
/**
20162018
* mem_cgroup_end_page_stat - finish a page state statistics transaction
@@ -2029,6 +2031,7 @@ void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
20292031

20302032
rcu_read_unlock();
20312033
}
2034+
EXPORT_SYMBOL(mem_cgroup_end_page_stat);
20322035

20332036
/**
20342037
* mem_cgroup_update_page_stat - update page state statistics
@@ -4746,6 +4749,7 @@ static int mem_cgroup_move_account(struct page *page,
47464749
{
47474750
unsigned long flags;
47484751
int ret;
4752+
bool anon;
47494753

47504754
VM_BUG_ON(from == to);
47514755
VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -4771,15 +4775,33 @@ static int mem_cgroup_move_account(struct page *page,
47714775
if (page->mem_cgroup != from)
47724776
goto out_unlock;
47734777

4778+
anon = PageAnon(page);
4779+
47744780
spin_lock_irqsave(&from->move_lock, flags);
47754781

4776-
if (!PageAnon(page) && page_mapped(page)) {
4782+
if (!anon && page_mapped(page)) {
47774783
__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
47784784
nr_pages);
47794785
__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
47804786
nr_pages);
47814787
}
47824788

4789+
/*
4790+
* move_lock grabbed above and caller set from->moving_account, so
4791+
* mem_cgroup_update_page_stat() will serialize updates to PageDirty.
4792+
* So mapping should be stable for dirty pages.
4793+
*/
4794+
if (!anon && PageDirty(page)) {
4795+
struct address_space *mapping = page_mapping(page);
4796+
4797+
if (mapping_cap_account_dirty(mapping)) {
4798+
__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_DIRTY],
4799+
nr_pages);
4800+
__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_DIRTY],
4801+
nr_pages);
4802+
}
4803+
}
4804+
47834805
if (PageWriteback(page)) {
47844806
__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_WRITEBACK],
47854807
nr_pages);

0 commit comments

Comments
 (0)