Introducing maple trees
Seen from outside, the internals of the Linux kernel appear to be stable, especially in subsystems like the memory-management subsystem. However, from time to time, developers need to replace an internal interface to solve a longstanding problem. One such issue is contention on the lock used to protect essential memory-management structures, including the page tables and virtual memory areas (VMAs). Liam Howlett and Matthew Wilcox have been developing a new data structure, called a "maple tree", to replace the data structures currently used for VMAs. This potentially big change in internal kernel structures has been recently posted for a review in a massive patch set.
Linux is a virtual-memory system. The address space for each process contains multiple virtual memory areas represented by vm_area_struct structures. Each VMA represents a contiguous block of address space and represents a range of memory of the same type, which may be anonymous memory (not backed by a file), a memory-mapped file, or even device memory. A virtual memory area, as seen from the process, is contiguous, while the supporting physical memory may not be. In addition, the address space contains holes between the VMAs; the kernel fills those empty blocks (leaving space for unmapped "guard" pages to catch buffer overflows) when it needs to add a new mapped area, for example when loading a library or in a response to an mmap() call.
Almost anything one can do in the system involves memory, so the operations on the structures representing the VMAs must be fast. These operations include lookups (finding out which VMA corresponds to a given virtual address, finding out if the memory is mapped, or locating an empty gap for a new memory area), and modifications (growing a stack, for example).
VMAs are currently stored in a modified red-black tree (rbtree) with an additional, doubly-linked list that allows the kernel to traverse all of the VMAs in a process's address space. Kernel developers have been unhappy with this data structure for some time, for a number of reasons: rbtrees do not support ranges well, they are difficult to handle in a lockless manner (the balancing operation of the rbtree affects multiple items at the same time), and rbtree traversal is inefficient, which is why the additional linked list exists.
Operations on VMAs are protected by a lock (specifically a reader/writer semaphore), which is contained in struct mm_struct. This lock was called mmap_sem until it was renamed to mmap_lock for the 5.8 release in June 2020. This renaming was a part of an effort to wrap the lock in an API, hoping to ease its replacement in the future.
Users, especially those with threaded applications in large systems, often experience contention on this lock. The problem has been discussed by kernel developers numerous times, with at least three sessions discussing it at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LFSMM). The core of the problem is that the lock is required for many operations, including almost anything involving page tables and VMAs. There are other related structures that are effectively protected by mmap_lock (with the additional difficulty of lack of documentation of this fact). In addition to splitting the unrelated structures out from under mmap_lock, the developers were considering either using a structure that would allow lockless access to the VMAs or using some type of range lock. The maple tree was proposed as one of the solutions at that time, but at that time it was in an early state of development and the code was not available yet.
Introducing maple trees
Maple trees (the name might refer to the shape of the maple leaf, leading in multiple directions) differ in important aspects from rbtrees. They belong to the B-tree family, so their nodes can contain more than two elements — up to 16 elements in leaf nodes, or ten elements in internal nodes. The use of B-trees also means that there is less need to create new nodes, as nodes may include empty slots that can be filled over time without additional allocations. Each node requires at most 256 bytes, which is a multiple of popular cache line sizes. The increased number of elements in a node and the cache-aligned size means fewer cache misses when traversing the tree.
The improved support for searching in maple trees also comes from the B-tree structure. In a B-tree, each node holds keys, called "pivots", that separate the node into subtrees. A subtree before a given key contains only values lower or equal to the key, while subtree after the key contains only values higher than the key (and lower than the next key).
Of course, maple trees were designed to work in a lockless manner, using read-copy-update (RCU). The maple tree is a generic structure that can be used in different kernel subsystems; the first usage is replacing the rbtrees and linked lists to manage VMAs. One of the authors, Liam Howlett, explained the reasons for the design in a blog post.
Maple trees offer two APIs: the simple one and the advanced one. The simple API uses the mtree_ prefix for its functions, with the main structure, struct maple_tree, defined as follows:
struct maple_tree { spinlock_t ma_lock; unsigned int ma_flags; void __rcu *ma_root; };
The static initializers are DEFINE_MTREE(name) and MTREE_INIT(name, flags), with the latter taking a bitmask of flags from the two defined currently. MAPLE_ALLOC_RANGE indicates that the tree will be used to allocate ranges and that it should manage gaps between the allocations; MAPLE_USE_RCU activates the RCU mode for the case of multiple readers. Dynamic initialization is possible with the same flags using mtree_init():
void mtree_init(struct maple_tree *mt, unsigned int ma_flags);
Developers can free the whole tree with:
void mtree_destroy(struct maple_tree *mt);
Three functions exist to add entries to the tree; mtree_insert(), mtree_insert_range(), and mtree_store_range(). The first two functions only add an entry if it does not previously exist, while the third function can overwrite an existing entry. They are defined as follows:
int mtree_insert(struct maple_tree *mt, unsigned long index, void *entry, gfp_t gfp); int mtree_insert_range(struct maple_tree *mt, unsigned long first, unsigned long last, void *entry, gfp_t gfp); int mtree_store_range(struct maple_tree *mt, unsigned long first, unsigned long last, void *entry, gfp_t gfp);
mtree_insert() takes a pointer to the tree mt, the integer key of the entry index, the pointer to the entry entry, and the memory allocation flags gfp for new tree nodes (if needed). mtree_insert_range() inserts a range from first to last with the given data entry. These functions return zero on success, and a negative value otherwise, including -EEXIST if the key already exists. Finally, mtree_store_range() takes the same arguments as mtree_insert_range(); the difference is that it replaces any existing entries for the corresponding keys.
Two functions exist to get an entry from the tree or remove an entry:
void *mtree_load(struct maple_tree *mt, unsigned long index); void *mtree_erase(struct maple_tree *mt, unsigned long index);
To read an entry, the developer should use mtree_load(), which takes a pointer to the tree mt and the key of the requested value index. The function returns a pointer to the entry, or NULL if the key was not found in the tree. The same syntax is used to remove an entry from the tree using mtree_erase(). It removes the given key from the tree and returns the associated entry, returning NULL if no such value was found.
There is more to the simple API than the above, including mtree_alloc_range() to allocate a range from the key space. The advanced API (which uses the mas_ prefix) adds iterator functions for the traversal of the whole tree, or to access next and previous elements using a state variable. With this fine-grained control, developers can resume a search as needed. The API also allows developers to find empty areas or duplicate the tree.
Replacing the VMA rbtree (and more)
The patch set contains more than just the introduction of maple trees. It is worth noting that a big part of the set adds and updates tests, a welcome addition given the impact of the changes and the future importance of the new data structure.
In the 70 patches in this set, maple trees are used to replace rbtrees in all operations on VMAs, and one of the patches removes the usage of rbtrees from VMAs. Another part of the patch set leads to the removal of the VMA linked list. This removal requires modifications in various places in the kernel code that were using the VMA list directly: the architecture code, core dump and program loading routines, some device drivers, and of course the memory-management code. The patch set also removes the VMA cache (which tracks the VMAs that are most recently accessed by each process to speed lookup), as the implementation with maple trees is faster and the cache is no longer needed.
The cover letter includes some performance results, which are somewhat hard to interpret. Some microbenchmarks show a performance increase, and some (a smaller number) a decrease. Kernel compilation time is similar to that with the vanilla 5.10 kernel, with a few more instructions executed (probably linked to the added code). Howlett requested insights from developers into the results.
Current status and next steps
Maple trees are currently at the RFC stage, with limitations; the implementation does not support 32-bit systems or non-MMU platforms, for example. However, the code is functional and kernel developers may look into it to decide if it is the direction they want to go on the quest of removing mmap_lock (as the lock was not removed in this patch set). Based on the size of the patch set, it may take time until the review is finished.
Index entries for this article | |
---|---|
Kernel | Maple trees |
GuestArticles | Rybczynska, Marta |
Posted Feb 12, 2021 20:41 UTC (Fri)
by zev (subscriber, #88455)
[Link] (21 responses)
Posted Feb 12, 2021 22:20 UTC (Fri)
by Funcan (subscriber, #44209)
[Link] (1 responses)
Posted Feb 15, 2021 9:08 UTC (Mon)
by metan (subscriber, #74107)
[Link]
Posted Feb 12, 2021 22:35 UTC (Fri)
by dave_malcolm (subscriber, #15013)
[Link]
Posted Feb 13, 2021 4:56 UTC (Sat)
by dw (guest, #12017)
[Link] (14 responses)
Posted Feb 13, 2021 14:11 UTC (Sat)
by ibukanov (guest, #3942)
[Link] (13 responses)
My ideal language will be one where within package/module all names should be unique including fields and methods and outside that all names should use a package prefix. Sadly modern language trends go to the opposite direction trying to minimize the number of keystrokes programmers use at the price of increased maintainability cost.
Posted Feb 13, 2021 14:48 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link] (7 responses)
Posted Feb 13, 2021 16:57 UTC (Sat)
by floppus (guest, #137245)
[Link] (6 responses)
Posted Feb 13, 2021 23:08 UTC (Sat)
by mathstuf (subscriber, #69389)
[Link] (5 responses)
Posted Feb 14, 2021 18:58 UTC (Sun)
by ibukanov (guest, #3942)
[Link] (4 responses)
What is problematic is things like x.size() versus, say hypothetical x.vector::size() or even x.std::size() where size() is not a part of any interface. I have no troubles with a few classes named Settings as long as all usages is like namespace::Settings or at least the source file containing "using namespace::Setting" without opening the whole namespace. And that can be enforced with a linter. Sadly C++ method/field names are heavily overloaded and there is no good way to namespace them except by using explicit prefixes.
Posted Feb 14, 2021 22:26 UTC (Sun)
by mathstuf (subscriber, #69389)
[Link] (3 responses)
Posted Feb 15, 2021 8:15 UTC (Mon)
by ibukanov (guest, #3942)
[Link] (1 responses)
Posted Feb 15, 2021 14:31 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link]
Posted Oct 3, 2022 22:51 UTC (Mon)
by bartoc (guest, #124262)
[Link]
Posted Feb 16, 2021 22:03 UTC (Tue)
by JohnVonNeumann (guest, #131609)
[Link]
Posted Feb 18, 2021 22:17 UTC (Thu)
by Jandar (subscriber, #85683)
[Link] (3 responses)
Function name overloading is essential for writing generic template functions. In a template function I can call DoSomething(var) where the type of var depends on the template parameters. How should I do this without function name overloading? If one considers operator overloading as a kind of function name overloading, template functions simply can't work without it.
Posted Feb 19, 2021 13:38 UTC (Fri)
by farnz (subscriber, #17727)
[Link]
There are more principled ways to support overloading than C++ uses; for example, Haskell has typeclasses, and Rust has traits, which provide the same functionality, but restrict the name lookup problem.
Posted Feb 25, 2021 22:05 UTC (Thu)
by scientes (guest, #83068)
[Link]
Posted Oct 3, 2022 22:55 UTC (Mon)
by bartoc (guest, #124262)
[Link]
Just getting rid of this helps, and some languages also let you "open" or "close" symbols to control if they are looked up in the definition or instantiation context.
The other way people do this is to just group potential overloads under some named object (like rust traits).
Posted Feb 13, 2021 23:25 UTC (Sat)
by Paf (subscriber, #91811)
[Link] (1 responses)
This is very much considered good practice in at least large C code bases in my experience.
Posted Feb 14, 2021 1:14 UTC (Sun)
by mathstuf (subscriber, #69389)
[Link]
Posted Feb 22, 2021 1:51 UTC (Mon)
by ras (subscriber, #33059)
[Link]
grep.
Posted Feb 13, 2021 7:48 UTC (Sat)
by Otus (subscriber, #67685)
[Link] (3 responses)
The cover letter has these the other way around:
> has a branching factor of 16 for leaf/non-alloc nodes and 10 for internal alloc nodes
Posted Feb 13, 2021 19:34 UTC (Sat)
by jake (editor, #205)
[Link] (2 responses)
indeed it does, fixed now, thanks.
jake
Posted Feb 15, 2021 9:49 UTC (Mon)
by weberm (guest, #131630)
[Link] (1 responses)
"Leaves store up to 31 entries, which means the next/prev are almost always in the same node.
Non-leaves have a branching factor of 10 when tracking gaps and 16 when not tracking gaps, while leaves have a branching factor of 31."
Given the VMA use case would track gaps, that would mean 10 at non-leaves and .. 31 at leaves?
Posted Feb 15, 2021 12:43 UTC (Mon)
by willy (subscriber, #9762)
[Link]
Posted Feb 13, 2021 17:10 UTC (Sat)
by dougg (guest, #1894)
[Link] (1 responses)
Posted Feb 14, 2021 3:17 UTC (Sun)
by willy (subscriber, #9762)
[Link]
XArray was primarily a one-man effort while the Maple Tree is a collaboration (with Liam doing the majority of the work).
Posted Feb 22, 2021 4:18 UTC (Mon)
by ssmith32 (subscriber, #72404)
[Link] (3 responses)
Posted Feb 22, 2021 8:15 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (2 responses)
Cheers,
Posted Feb 25, 2021 16:58 UTC (Thu)
by jch (guest, #51929)
[Link] (1 responses)
I too would like to know whether "Maple trees" improve upon B-trees, and in what way.
Posted Mar 5, 2021 7:51 UTC (Fri)
by ssmith32 (subscriber, #72404)
[Link]
On the upside, thus article prodded me to pull my CLRS off the floor, where it had sat after my last interview forced me to revive my dynamic programming skills (as limited as they ever were)
Introducing maple trees
I've been under the impression that this style of struct-member naming (with a common prefix specific to the containing struct type) was an artifact of coding around the limitations of prehistoric C compilers that didn't put struct members in a separate namespace. Is there a particular reason to keep carrying the extra verbosity around in newly written code like this?
struct maple_tree {
spinlock_t ma_lock;
unsigned int ma_flags;
void __rcu *ma_root;
};
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
Introducing maple trees
The Maple Tree trades off a more complex implementation for a denser and more featureful representation than the XArray.
The APIs are deliberately similar, and I hope to merge the two eventually.
Introducing maple trees
Introducing maple trees
Wol
Introducing maple trees
Introducing maple trees