Skip to content

Commit 99792e0

Browse files
committed
Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar: "Lots of changes in this cycle: - Lots of CPA (change page attribute) optimizations and related cleanups (Thomas Gleixner, Peter Zijstra) - Make lazy TLB mode even lazier (Rik van Riel) - Fault handler cleanups and improvements (Dave Hansen) - kdump, vmcore: Enable kdumping encrypted memory with AMD SME enabled (Lianbo Jiang) - Clean up VM layout documentation (Baoquan He, Ingo Molnar) - ... plus misc other fixes and enhancements" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) x86/stackprotector: Remove the call to boot_init_stack_canary() from cpu_startup_entry() x86/mm: Kill stray kernel fault handling comment x86/mm: Do not warn about PCI BIOS W+X mappings resource: Clean it up a bit resource: Fix find_next_iomem_res() iteration issue resource: Include resource end in walk_*() interfaces x86/kexec: Correct KEXEC_BACKUP_SRC_END off-by-one error x86/mm: Remove spurious fault pkey check x86/mm/vsyscall: Consider vsyscall page part of user address space x86/mm: Add vsyscall address helper x86/mm: Fix exception table comments x86/mm: Add clarifying comments for user addr space x86/mm: Break out user address space handling x86/mm: Break out kernel address space handling x86/mm: Clarify hardware vs. software "error_code" x86/mm/tlb: Make lazy TLB mode lazier x86/mm/tlb: Add freed_tables element to flush_tlb_info x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range smp,cpumask: introduce on_each_cpu_cond_mask smp: use __cpumask_set_cpu in on_each_cpu_cond ...
2 parents 382d72a + 977e4be commit 99792e0

File tree

28 files changed

+1117
-619
lines changed

28 files changed

+1117
-619
lines changed

Documentation/x86/x86_64/mm.txt

Lines changed: 120 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -1,55 +1,124 @@
1+
====================================================
2+
Complete virtual memory map with 4-level page tables
3+
====================================================
14

2-
Virtual memory map with 4 level page tables:
3-
4-
0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
5-
hole caused by [47:63] sign extension
6-
ffff800000000000 - ffff87ffffffffff (=43 bits) guard hole, reserved for hypervisor
7-
ffff880000000000 - ffffc7ffffffffff (=64 TB) direct mapping of all phys. memory
8-
ffffc80000000000 - ffffc8ffffffffff (=40 bits) hole
9-
ffffc90000000000 - ffffe8ffffffffff (=45 bits) vmalloc/ioremap space
10-
ffffe90000000000 - ffffe9ffffffffff (=40 bits) hole
11-
ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB)
12-
... unused hole ...
13-
ffffec0000000000 - fffffbffffffffff (=44 bits) kasan shadow memory (16TB)
14-
... unused hole ...
15-
vaddr_end for KASLR
16-
fffffe0000000000 - fffffe7fffffffff (=39 bits) cpu_entry_area mapping
17-
fffffe8000000000 - fffffeffffffffff (=39 bits) LDT remap for PTI
18-
ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
19-
... unused hole ...
20-
ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space
21-
... unused hole ...
22-
ffffffff80000000 - ffffffff9fffffff (=512 MB) kernel text mapping, from phys 0
23-
ffffffffa0000000 - fffffffffeffffff (1520 MB) module mapping space
24-
[fixmap start] - ffffffffff5fffff kernel-internal fixmap range
25-
ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI
26-
ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
27-
28-
Virtual memory map with 5 level page tables:
29-
30-
0000000000000000 - 00ffffffffffffff (=56 bits) user space, different per mm
31-
hole caused by [56:63] sign extension
32-
ff00000000000000 - ff0fffffffffffff (=52 bits) guard hole, reserved for hypervisor
33-
ff10000000000000 - ff8fffffffffffff (=55 bits) direct mapping of all phys. memory
34-
ff90000000000000 - ff9fffffffffffff (=52 bits) LDT remap for PTI
35-
ffa0000000000000 - ffd1ffffffffffff (=54 bits) vmalloc/ioremap space (12800 TB)
36-
ffd2000000000000 - ffd3ffffffffffff (=49 bits) hole
37-
ffd4000000000000 - ffd5ffffffffffff (=49 bits) virtual memory map (512TB)
38-
... unused hole ...
39-
ffdf000000000000 - fffffc0000000000 (=53 bits) kasan shadow memory (8PB)
40-
... unused hole ...
41-
vaddr_end for KASLR
42-
fffffe0000000000 - fffffe7fffffffff (=39 bits) cpu_entry_area mapping
43-
... unused hole ...
44-
ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
45-
... unused hole ...
46-
ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space
47-
... unused hole ...
48-
ffffffff80000000 - ffffffff9fffffff (=512 MB) kernel text mapping, from phys 0
49-
ffffffffa0000000 - fffffffffeffffff (1520 MB) module mapping space
50-
[fixmap start] - ffffffffff5fffff kernel-internal fixmap range
51-
ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI
52-
ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
5+
Notes:
6+
7+
- Negative addresses such as "-23 TB" are absolute addresses in bytes, counted down
8+
from the top of the 64-bit address space. It's easier to understand the layout
9+
when seen both in absolute addresses and in distance-from-top notation.
10+
11+
For example 0xffffe90000000000 == -23 TB, it's 23 TB lower than the top of the
12+
64-bit address space (ffffffffffffffff).
13+
14+
Note that as we get closer to the top of the address space, the notation changes
15+
from TB to GB and then MB/KB.
16+
17+
- "16M TB" might look weird at first sight, but it's an easier to visualize size
18+
notation than "16 EB", which few will recognize at first sight as 16 exabytes.
19+
It also shows it nicely how incredibly large 64-bit address space is.
20+
21+
========================================================================================================================
22+
Start addr | Offset | End addr | Size | VM area description
23+
========================================================================================================================
24+
| | | |
25+
0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm
26+
__________________|____________|__________________|_________|___________________________________________________________
27+
| | | |
28+
0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
29+
| | | | virtual memory addresses up to the -128 TB
30+
| | | | starting offset of kernel mappings.
31+
__________________|____________|__________________|_________|___________________________________________________________
32+
|
33+
| Kernel-space virtual memory, shared between all processes:
34+
____________________________________________________________|___________________________________________________________
35+
| | | |
36+
ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor
37+
ffff880000000000 | -120 TB | ffffc7ffffffffff | 64 TB | direct mapping of all physical memory (page_offset_base)
38+
ffffc80000000000 | -56 TB | ffffc8ffffffffff | 1 TB | ... unused hole
39+
ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base)
40+
ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole
41+
ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base)
42+
ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole
43+
ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory
44+
fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
45+
| | | | vaddr_end for KASLR
46+
fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
47+
fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | LDT remap for PTI
48+
ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
49+
__________________|____________|__________________|_________|____________________________________________________________
50+
|
51+
| Identical layout to the 47-bit one from here on:
52+
____________________________________________________________|____________________________________________________________
53+
| | | |
54+
ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
55+
ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
56+
ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
57+
ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
58+
ffffffff80000000 |-2048 MB | | |
59+
ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
60+
ffffffffff000000 | -16 MB | | |
61+
FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
62+
ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
63+
ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
64+
__________________|____________|__________________|_________|___________________________________________________________
65+
66+
67+
====================================================
68+
Complete virtual memory map with 5-level page tables
69+
====================================================
70+
71+
Notes:
72+
73+
- With 56-bit addresses, user-space memory gets expanded by a factor of 512x,
74+
from 0.125 PB to 64 PB. All kernel mappings shift down to the -64 PT starting
75+
offset and many of the regions expand to support the much larger physical
76+
memory supported.
77+
78+
========================================================================================================================
79+
Start addr | Offset | End addr | Size | VM area description
80+
========================================================================================================================
81+
| | | |
82+
0000000000000000 | 0 | 00ffffffffffffff | 64 PB | user-space virtual memory, different per mm
83+
__________________|____________|__________________|_________|___________________________________________________________
84+
| | | |
85+
0000800000000000 | +64 PB | ffff7fffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical
86+
| | | | virtual memory addresses up to the -128 TB
87+
| | | | starting offset of kernel mappings.
88+
__________________|____________|__________________|_________|___________________________________________________________
89+
|
90+
| Kernel-space virtual memory, shared between all processes:
91+
____________________________________________________________|___________________________________________________________
92+
| | | |
93+
ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor
94+
ff10000000000000 | -60 PB | ff8fffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base)
95+
ff90000000000000 | -28 PB | ff9fffffffffffff | 4 PB | LDT remap for PTI
96+
ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
97+
ffd2000000000000 | -11.5 PB | ffd3ffffffffffff | 0.5 PB | ... unused hole
98+
ffd4000000000000 | -11 PB | ffd5ffffffffffff | 0.5 PB | virtual memory map (vmemmap_base)
99+
ffd6000000000000 | -10.5 PB | ffdeffffffffffff | 2.25 PB | ... unused hole
100+
ffdf000000000000 | -8.25 PB | fffffdffffffffff | ~8 PB | KASAN shadow memory
101+
fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
102+
| | | | vaddr_end for KASLR
103+
fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
104+
fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
105+
ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
106+
__________________|____________|__________________|_________|____________________________________________________________
107+
|
108+
| Identical layout to the 47-bit one from here on:
109+
____________________________________________________________|____________________________________________________________
110+
| | | |
111+
ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
112+
ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
113+
ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
114+
ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
115+
ffffffff80000000 |-2048 MB | | |
116+
ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
117+
ffffffffff000000 | -16 MB | | |
118+
FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
119+
ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
120+
ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
121+
__________________|____________|__________________|_________|___________________________________________________________
53122

54123
Architecture defines a 64-bit virtual address. Implementations can support
55124
less. Currently supported are 48- and 57-bit virtual addresses. Bits 63

arch/x86/Kconfig

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1487,6 +1487,14 @@ config X86_DIRECT_GBPAGES
14871487
supports them), so don't confuse the user by printing
14881488
that we have them enabled.
14891489

1490+
config X86_CPA_STATISTICS
1491+
bool "Enable statistic for Change Page Attribute"
1492+
depends on DEBUG_FS
1493+
---help---
1494+
Expose statistics about the Change Page Attribute mechanims, which
1495+
helps to determine the effectivness of preserving large and huge
1496+
page mappings when mapping protections are changed.
1497+
14901498
config ARCH_HAS_MEM_ENCRYPT
14911499
def_bool y
14921500

arch/x86/include/asm/io.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -187,11 +187,12 @@ extern void __iomem *ioremap_nocache(resource_size_t offset, unsigned long size)
187187
#define ioremap_nocache ioremap_nocache
188188
extern void __iomem *ioremap_uc(resource_size_t offset, unsigned long size);
189189
#define ioremap_uc ioremap_uc
190-
191190
extern void __iomem *ioremap_cache(resource_size_t offset, unsigned long size);
192191
#define ioremap_cache ioremap_cache
193192
extern void __iomem *ioremap_prot(resource_size_t offset, unsigned long size, unsigned long prot_val);
194193
#define ioremap_prot ioremap_prot
194+
extern void __iomem *ioremap_encrypted(resource_size_t phys_addr, unsigned long size);
195+
#define ioremap_encrypted ioremap_encrypted
195196

196197
/**
197198
* ioremap - map bus memory into CPU space

arch/x86/include/asm/kexec.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ struct kimage;
6767

6868
/* Memory to backup during crash kdump */
6969
#define KEXEC_BACKUP_SRC_START (0UL)
70-
#define KEXEC_BACKUP_SRC_END (640 * 1024UL) /* 640K */
70+
#define KEXEC_BACKUP_SRC_END (640 * 1024UL - 1) /* 640K */
7171

7272
/*
7373
* CPU does not save ss and sp on stack if execution is already

arch/x86/include/asm/page_64_types.h

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -59,13 +59,16 @@
5959
#endif
6060

6161
/*
62-
* Kernel image size is limited to 1GiB due to the fixmap living in the
63-
* next 1GiB (see level2_kernel_pgt in arch/x86/kernel/head_64.S). Use
64-
* 512MiB by default, leaving 1.5GiB for modules once the page tables
65-
* are fully set up. If kernel ASLR is configured, it can extend the
66-
* kernel page table mapping, reducing the size of the modules area.
62+
* Maximum kernel image size is limited to 1 GiB, due to the fixmap living
63+
* in the next 1 GiB (see level2_kernel_pgt in arch/x86/kernel/head_64.S).
64+
*
65+
* On KASLR use 1 GiB by default, leaving 1 GiB for modules once the
66+
* page tables are fully set up.
67+
*
68+
* If KASLR is disabled we can shrink it to 0.5 GiB and increase the size
69+
* of the modules area to 1.5 GiB.
6770
*/
68-
#if defined(CONFIG_RANDOMIZE_BASE)
71+
#ifdef CONFIG_RANDOMIZE_BASE
6972
#define KERNEL_IMAGE_SIZE (1024 * 1024 * 1024)
7073
#else
7174
#define KERNEL_IMAGE_SIZE (512 * 1024 * 1024)

arch/x86/include/asm/tlb.h

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,23 @@
66
#define tlb_end_vma(tlb, vma) do { } while (0)
77
#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
88

9-
#define tlb_flush(tlb) \
10-
{ \
11-
if (!tlb->fullmm && !tlb->need_flush_all) \
12-
flush_tlb_mm_range(tlb->mm, tlb->start, tlb->end, 0UL); \
13-
else \
14-
flush_tlb_mm_range(tlb->mm, 0UL, TLB_FLUSH_ALL, 0UL); \
15-
}
9+
static inline void tlb_flush(struct mmu_gather *tlb);
1610

1711
#include <asm-generic/tlb.h>
1812

13+
static inline void tlb_flush(struct mmu_gather *tlb)
14+
{
15+
unsigned long start = 0UL, end = TLB_FLUSH_ALL;
16+
unsigned int stride_shift = tlb_get_unmap_shift(tlb);
17+
18+
if (!tlb->fullmm && !tlb->need_flush_all) {
19+
start = tlb->start;
20+
end = tlb->end;
21+
}
22+
23+
flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
24+
}
25+
1926
/*
2027
* While x86 architecture in general requires an IPI to perform TLB
2128
* shootdown, enablement code for several hypervisors overrides

arch/x86/include/asm/tlbflush.h

Lines changed: 12 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -148,22 +148,6 @@ static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
148148
#define __flush_tlb_one_user(addr) __native_flush_tlb_one_user(addr)
149149
#endif
150150

151-
static inline bool tlb_defer_switch_to_init_mm(void)
152-
{
153-
/*
154-
* If we have PCID, then switching to init_mm is reasonably
155-
* fast. If we don't have PCID, then switching to init_mm is
156-
* quite slow, so we try to defer it in the hopes that we can
157-
* avoid it entirely. The latter approach runs the risk of
158-
* receiving otherwise unnecessary IPIs.
159-
*
160-
* This choice is just a heuristic. The tlb code can handle this
161-
* function returning true or false regardless of whether we have
162-
* PCID.
163-
*/
164-
return !static_cpu_has(X86_FEATURE_PCID);
165-
}
166-
167151
struct tlb_context {
168152
u64 ctx_id;
169153
u64 tlb_gen;
@@ -547,23 +531,30 @@ struct flush_tlb_info {
547531
unsigned long start;
548532
unsigned long end;
549533
u64 new_tlb_gen;
534+
unsigned int stride_shift;
535+
bool freed_tables;
550536
};
551537

552538
#define local_flush_tlb() __flush_tlb()
553539

554-
#define flush_tlb_mm(mm) flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL)
540+
#define flush_tlb_mm(mm) \
541+
flush_tlb_mm_range(mm, 0UL, TLB_FLUSH_ALL, 0UL, true)
555542

556-
#define flush_tlb_range(vma, start, end) \
557-
flush_tlb_mm_range(vma->vm_mm, start, end, vma->vm_flags)
543+
#define flush_tlb_range(vma, start, end) \
544+
flush_tlb_mm_range((vma)->vm_mm, start, end, \
545+
((vma)->vm_flags & VM_HUGETLB) \
546+
? huge_page_shift(hstate_vma(vma)) \
547+
: PAGE_SHIFT, false)
558548

559549
extern void flush_tlb_all(void);
560550
extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
561-
unsigned long end, unsigned long vmflag);
551+
unsigned long end, unsigned int stride_shift,
552+
bool freed_tables);
562553
extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
563554

564555
static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
565556
{
566-
flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, VM_NONE);
557+
flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
567558
}
568559

569560
void native_flush_tlb_others(const struct cpumask *cpumask,

0 commit comments

Comments
 (0)