Skip to content

Commit 1c30844

Browse files
gormanmtorvalds
authored andcommitted
mm: reclaim small amounts of memory when an external fragmentation event occurs
An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken if the calling context allows to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback, slab shrinkage and swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) 4.20-rc3+patch1-4: 18421 (98% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%) Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) 4.20-rc3+patch1-4: 13464 (95% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%) Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%) Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%) Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) 4.20-rc3+patch1-4: 14263 (93% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%) Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%) There is a 93% reduction in fragmentation causing events, there is a big reduction in the huge page fault latency and allocation success rate is higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) 4.20-rc3+patch1-4: 11095 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%* Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%) There is a large reduction in fragmentation events with some jitter around the latencies and success rates. As before, the high THP allocation success rate does mean the system is under a lot of pressure. However, as the fragmentation events are reduced, it would be expected that the long-term allocation success rate would be higher. Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent 0a79cda commit 1c30844

File tree

6 files changed

+202
-15
lines changed

6 files changed

+202
-15
lines changed

Documentation/sysctl/vm.txt

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ Currently, these files are in /proc/sys/vm:
6363
- swappiness
6464
- user_reserve_kbytes
6565
- vfs_cache_pressure
66+
- watermark_boost_factor
6667
- watermark_scale_factor
6768
- zone_reclaim_mode
6869

@@ -856,6 +857,26 @@ ten times more freeable objects than there are.
856857

857858
=============================================================
858859

860+
watermark_boost_factor:
861+
862+
This factor controls the level of reclaim when memory is being fragmented.
863+
It defines the percentage of the high watermark of a zone that will be
864+
reclaimed if pages of different mobility are being mixed within pageblocks.
865+
The intent is that compaction has less work to do in the future and to
866+
increase the success rate of future high-order allocations such as SLUB
867+
allocations, THP and hugetlbfs pages.
868+
869+
To make it sensible with respect to the watermark_scale_factor parameter,
870+
the unit is in fractions of 10,000. The default value of 15,000 means
871+
that up to 150% of the high watermark will be reclaimed in the event of
872+
a pageblock being mixed due to fragmentation. The level of reclaim is
873+
determined by the number of fragmentation events that occurred in the
874+
recent past. If this value is smaller than a pageblock then a pageblocks
875+
worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor
876+
of 0 will disable the feature.
877+
878+
=============================================================
879+
859880
watermark_scale_factor:
860881

861882
This factor controls the aggressiveness of kswapd. It defines the

include/linux/mm.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2256,6 +2256,7 @@ extern void zone_pcp_reset(struct zone *zone);
22562256

22572257
/* page_alloc.c */
22582258
extern int min_free_kbytes;
2259+
extern int watermark_boost_factor;
22592260
extern int watermark_scale_factor;
22602261

22612262
/* nommu.c */

include/linux/mmzone.h

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -269,10 +269,10 @@ enum zone_watermarks {
269269
NR_WMARK
270270
};
271271

272-
#define min_wmark_pages(z) (z->_watermark[WMARK_MIN])
273-
#define low_wmark_pages(z) (z->_watermark[WMARK_LOW])
274-
#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH])
275-
#define wmark_pages(z, i) (z->_watermark[i])
272+
#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)
273+
#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost)
274+
#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
275+
#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
276276

277277
struct per_cpu_pages {
278278
int count; /* number of pages in the list */
@@ -364,6 +364,7 @@ struct zone {
364364

365365
/* zone watermarks, access with *_wmark_pages(zone) macros */
366366
unsigned long _watermark[NR_WMARK];
367+
unsigned long watermark_boost;
367368

368369
unsigned long nr_reserved_highatomic;
369370

@@ -890,6 +891,8 @@ static inline int is_highmem(struct zone *zone)
890891
struct ctl_table;
891892
int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
892893
void __user *, size_t *, loff_t *);
894+
int watermark_boost_factor_sysctl_handler(struct ctl_table *, int,
895+
void __user *, size_t *, loff_t *);
893896
int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
894897
void __user *, size_t *, loff_t *);
895898
extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];

kernel/sysctl.c

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1462,6 +1462,14 @@ static struct ctl_table vm_table[] = {
14621462
.proc_handler = min_free_kbytes_sysctl_handler,
14631463
.extra1 = &zero,
14641464
},
1465+
{
1466+
.procname = "watermark_boost_factor",
1467+
.data = &watermark_boost_factor,
1468+
.maxlen = sizeof(watermark_boost_factor),
1469+
.mode = 0644,
1470+
.proc_handler = watermark_boost_factor_sysctl_handler,
1471+
.extra1 = &zero,
1472+
},
14651473
{
14661474
.procname = "watermark_scale_factor",
14671475
.data = &watermark_scale_factor,

mm/page_alloc.c

Lines changed: 41 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -262,6 +262,7 @@ compound_page_dtor * const compound_page_dtors[] = {
262262

263263
int min_free_kbytes = 1024;
264264
int user_min_free_kbytes = -1;
265+
int watermark_boost_factor __read_mostly = 15000;
265266
int watermark_scale_factor = 10;
266267

267268
static unsigned long nr_kernel_pages __meminitdata;
@@ -2129,6 +2130,21 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
21292130
return false;
21302131
}
21312132

2133+
static inline void boost_watermark(struct zone *zone)
2134+
{
2135+
unsigned long max_boost;
2136+
2137+
if (!watermark_boost_factor)
2138+
return;
2139+
2140+
max_boost = mult_frac(zone->_watermark[WMARK_HIGH],
2141+
watermark_boost_factor, 10000);
2142+
max_boost = max(pageblock_nr_pages, max_boost);
2143+
2144+
zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
2145+
max_boost);
2146+
}
2147+
21322148
/*
21332149
* This function implements actual steal behaviour. If order is large enough,
21342150
* we can steal whole pageblock. If not, we first move freepages in this
@@ -2138,7 +2154,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
21382154
* itself, so pages freed in the future will be put on the correct free list.
21392155
*/
21402156
static void steal_suitable_fallback(struct zone *zone, struct page *page,
2141-
int start_type, bool whole_block)
2157+
unsigned int alloc_flags, int start_type, bool whole_block)
21422158
{
21432159
unsigned int current_order = page_order(page);
21442160
struct free_area *area;
@@ -2160,6 +2176,15 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
21602176
goto single_page;
21612177
}
21622178

2179+
/*
2180+
* Boost watermarks to increase reclaim pressure to reduce the
2181+
* likelihood of future fallbacks. Wake kswapd now as the node
2182+
* may be balanced overall and kswapd will not wake naturally.
2183+
*/
2184+
boost_watermark(zone);
2185+
if (alloc_flags & ALLOC_KSWAPD)
2186+
wakeup_kswapd(zone, 0, 0, zone_idx(zone));
2187+
21632188
/* We are not allowed to try stealing from the whole block */
21642189
if (!whole_block)
21652190
goto single_page;
@@ -2443,7 +2468,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
24432468
page = list_first_entry(&area->free_list[fallback_mt],
24442469
struct page, lru);
24452470

2446-
steal_suitable_fallback(zone, page, start_migratetype, can_steal);
2471+
steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
2472+
can_steal);
24472473

24482474
trace_mm_page_alloc_extfrag(page, order, current_order,
24492475
start_migratetype, fallback_mt);
@@ -7454,6 +7480,7 @@ static void __setup_per_zone_wmarks(void)
74547480

74557481
zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp;
74567482
zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
7483+
zone->watermark_boost = 0;
74577484

74587485
spin_unlock_irqrestore(&zone->lock, flags);
74597486
}
@@ -7554,6 +7581,18 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write,
75547581
return 0;
75557582
}
75567583

7584+
int watermark_boost_factor_sysctl_handler(struct ctl_table *table, int write,
7585+
void __user *buffer, size_t *length, loff_t *ppos)
7586+
{
7587+
int rc;
7588+
7589+
rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
7590+
if (rc)
7591+
return rc;
7592+
7593+
return 0;
7594+
}
7595+
75577596
int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
75587597
void __user *buffer, size_t *length, loff_t *ppos)
75597598
{

0 commit comments

Comments
 (0)