Skip to content

Commit d1c3fb1

Browse files
Nishanth AravamudanLinus Torvalds
authored andcommitted
hugetlb: introduce nr_overcommit_hugepages sysctl
hugetlb: introduce nr_overcommit_hugepages sysctl While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I became convinced that having a boolean sysctl was insufficient: 1) To support per-node control of hugepages, I have previously submitted patches to add a sysfs attribute related to nr_hugepages. However, with a boolean global value and per-mount quota enforcement constraining the dynamic pool, adding corresponding control of the dynamic pool on a per-node basis seems inconsistent to me. 2) Administration of the hugetlb dynamic pool with multiple hugetlbfs mount points is, arguably, more arduous than it needs to be. Each quota would need to be set separately, and the sum would need to be monitored. To ease the administration, and to help make the way for per-node control of the static & dynamic hugepage pool, I added a separate sysctl, nr_overcommit_hugepages. This value serves as a high watermark for the overall hugepage pool, while nr_hugepages serves as a low watermark. The boolean sysctl can then be removed, as the condition nr_overcommit_hugepages > 0 indicates the same administrative setting as hugetlb_dynamic_pool == 1 Quotas still serve as local enforcement of the size of the pool on a per-mount basis. A few caveats: 1) There is a race whereby the global surplus huge page counter is incremented before a hugepage has allocated. Another process could then try grow the pool, and fail to convert a surplus huge page to a normal huge page and instead allocate a fresh huge page. I believe this is benign, as no memory is leaked (the actual pages are still tracked correctly) and the counters won't go out of sync. 2) Shrinking the static pool while a surplus is in effect will allow the number of surplus huge pages to exceed the overcommit value. As long as this condition holds, however, no more surplus huge pages will be allowed on the system until one of the two sysctls are increased sufficiently, or the surplus huge pages go out of use and are freed. Successfully tested on x86_64 with the current libhugetlbfs snapshot, modified to use the new sysctl. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 parent 7a3f595 commit d1c3fb1

File tree

3 files changed

+70
-6
lines changed

3 files changed

+70
-6
lines changed

include/linux/hugetlb.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
3434
extern unsigned long max_huge_pages;
3535
extern unsigned long hugepages_treat_as_movable;
3636
extern int hugetlb_dynamic_pool;
37+
extern unsigned long nr_overcommit_huge_pages;
3738
extern const unsigned long hugetlb_zero, hugetlb_infinity;
3839
extern int sysctl_hugetlb_shm_group;
3940

kernel/sysctl.c

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -912,6 +912,14 @@ static struct ctl_table vm_table[] = {
912912
.mode = 0644,
913913
.proc_handler = &proc_dointvec,
914914
},
915+
{
916+
.ctl_name = CTL_UNNUMBERED,
917+
.procname = "nr_overcommit_hugepages",
918+
.data = &nr_overcommit_huge_pages,
919+
.maxlen = sizeof(nr_overcommit_huge_pages),
920+
.mode = 0644,
921+
.proc_handler = &proc_doulongvec_minmax,
922+
},
915923
#endif
916924
{
917925
.ctl_name = VM_LOWMEM_RESERVE_RATIO,

mm/hugetlb.c

Lines changed: 61 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ static unsigned int surplus_huge_pages_node[MAX_NUMNODES];
3232
static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
3333
unsigned long hugepages_treat_as_movable;
3434
int hugetlb_dynamic_pool;
35+
unsigned long nr_overcommit_huge_pages;
3536
static int hugetlb_next_nid;
3637

3738
/*
@@ -227,22 +228,62 @@ static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
227228
unsigned long address)
228229
{
229230
struct page *page;
231+
unsigned int nid;
230232

231233
/* Check if the dynamic pool is enabled */
232234
if (!hugetlb_dynamic_pool)
233235
return NULL;
234236

237+
/*
238+
* Assume we will successfully allocate the surplus page to
239+
* prevent racing processes from causing the surplus to exceed
240+
* overcommit
241+
*
242+
* This however introduces a different race, where a process B
243+
* tries to grow the static hugepage pool while alloc_pages() is
244+
* called by process A. B will only examine the per-node
245+
* counters in determining if surplus huge pages can be
246+
* converted to normal huge pages in adjust_pool_surplus(). A
247+
* won't be able to increment the per-node counter, until the
248+
* lock is dropped by B, but B doesn't drop hugetlb_lock until
249+
* no more huge pages can be converted from surplus to normal
250+
* state (and doesn't try to convert again). Thus, we have a
251+
* case where a surplus huge page exists, the pool is grown, and
252+
* the surplus huge page still exists after, even though it
253+
* should just have been converted to a normal huge page. This
254+
* does not leak memory, though, as the hugepage will be freed
255+
* once it is out of use. It also does not allow the counters to
256+
* go out of whack in adjust_pool_surplus() as we don't modify
257+
* the node values until we've gotten the hugepage and only the
258+
* per-node value is checked there.
259+
*/
260+
spin_lock(&hugetlb_lock);
261+
if (surplus_huge_pages >= nr_overcommit_huge_pages) {
262+
spin_unlock(&hugetlb_lock);
263+
return NULL;
264+
} else {
265+
nr_huge_pages++;
266+
surplus_huge_pages++;
267+
}
268+
spin_unlock(&hugetlb_lock);
269+
235270
page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
236271
HUGETLB_PAGE_ORDER);
272+
273+
spin_lock(&hugetlb_lock);
237274
if (page) {
275+
nid = page_to_nid(page);
238276
set_compound_page_dtor(page, free_huge_page);
239-
spin_lock(&hugetlb_lock);
240-
nr_huge_pages++;
241-
nr_huge_pages_node[page_to_nid(page)]++;
242-
surplus_huge_pages++;
243-
surplus_huge_pages_node[page_to_nid(page)]++;
244-
spin_unlock(&hugetlb_lock);
277+
/*
278+
* We incremented the global counters already
279+
*/
280+
nr_huge_pages_node[nid]++;
281+
surplus_huge_pages_node[nid]++;
282+
} else {
283+
nr_huge_pages--;
284+
surplus_huge_pages--;
245285
}
286+
spin_unlock(&hugetlb_lock);
246287

247288
return page;
248289
}
@@ -481,6 +522,12 @@ static unsigned long set_max_huge_pages(unsigned long count)
481522
* Increase the pool size
482523
* First take pages out of surplus state. Then make up the
483524
* remaining difference by allocating fresh huge pages.
525+
*
526+
* We might race with alloc_buddy_huge_page() here and be unable
527+
* to convert a surplus huge page to a normal huge page. That is
528+
* not critical, though, it just means the overall size of the
529+
* pool might be one hugepage larger than it needs to be, but
530+
* within all the constraints specified by the sysctls.
484531
*/
485532
spin_lock(&hugetlb_lock);
486533
while (surplus_huge_pages && count > persistent_huge_pages) {
@@ -509,6 +556,14 @@ static unsigned long set_max_huge_pages(unsigned long count)
509556
* to keep enough around to satisfy reservations). Then place
510557
* pages into surplus state as needed so the pool will shrink
511558
* to the desired size as pages become free.
559+
*
560+
* By placing pages into the surplus state independent of the
561+
* overcommit value, we are allowing the surplus pool size to
562+
* exceed overcommit. There are few sane options here. Since
563+
* alloc_buddy_huge_page() is checking the global counter,
564+
* though, we'll note that we're not allowed to exceed surplus
565+
* and won't grow the pool anywhere else. Not until one of the
566+
* sysctls are changed, or the surplus pages go out of use.
512567
*/
513568
min_count = resv_huge_pages + nr_huge_pages - free_huge_pages;
514569
min_count = max(count, min_count);

0 commit comments

Comments
 (0)