Skip to content

Commit fd7e9a8

Browse files
committed
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini: "4.11 is going to be a relatively large release for KVM, with a little over 200 commits and noteworthy changes for most architectures. ARM: - GICv3 save/restore - cache flushing fixes - working MSI injection for GICv3 ITS - physical timer emulation MIPS: - various improvements under the hood - support for SMP guests - a large rewrite of MMU emulation. KVM MIPS can now use MMU notifiers to support copy-on-write, KSM, idle page tracking, swapping, ballooning and everything else. KVM_CAP_READONLY_MEM is also supported, so that writes to some memory regions can be treated as MMIO. The new MMU also paves the way for hardware virtualization support. PPC: - support for POWER9 using the radix-tree MMU for host and guest - resizable hashed page table - bugfixes. s390: - expose more features to the guest - more SIMD extensions - instruction execution protection - ESOP2 x86: - improved hashing in the MMU - faster PageLRU tracking for Intel CPUs without EPT A/D bits - some refactoring of nested VMX entry/exit code, preparing for live migration support of nested hypervisors - expose yet another AVX512 CPUID bit - host-to-guest PTP support - refactoring of interrupt injection, with some optimizations thrown in and some duct tape removed. - remove lazy FPU handling - optimizations of user-mode exits - optimizations of vcpu_is_preempted() for KVM guests generic: - alternative signaling mechanism that doesn't pound on tsk->sighand->siglock" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (195 commits) x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64 x86/paravirt: Change vcp_is_preempted() arg type to long KVM: VMX: use correct vmcs_read/write for guest segment selector/base x86/kvm/vmx: Defer TR reload after VM exit x86/asm/64: Drop __cacheline_aligned from struct x86_hw_tss x86/kvm/vmx: Simplify segment_base() x86/kvm/vmx: Get rid of segment_base() on 64-bit kernels x86/kvm/vmx: Don't fetch the TSS base from the GDT x86/asm: Define the kernel TSS limit in a macro kvm: fix page struct leak in handle_vmon KVM: PPC: Book3S HV: Disable HPT resizing on POWER9 for now KVM: Return an error code only as a constant in kvm_get_dirty_log() KVM: Return an error code only as a constant in kvm_get_dirty_log_protect() KVM: Return directly after a failed copy_from_user() in kvm_vm_compat_ioctl() KVM: x86: remove code for lazy FPU handling KVM: race-free exit from KVM_RUN without POSIX signals KVM: PPC: Book3S HV: Turn "KVM guest htab" message into a debug message KVM: PPC: Book3S PR: Ratelimit copy data failure error messages KVM: Support vCPU-based gfn->hva cache KVM: use separate generations for each address space ...
2 parents 5066e4a + dd0fd8b commit fd7e9a8

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

110 files changed

+7277
-2968
lines changed

Documentation/virtual/kvm/api.txt

Lines changed: 128 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2061,6 +2061,8 @@ registers, find a list below:
20612061
MIPS | KVM_REG_MIPS_LO | 64
20622062
MIPS | KVM_REG_MIPS_PC | 64
20632063
MIPS | KVM_REG_MIPS_CP0_INDEX | 32
2064+
MIPS | KVM_REG_MIPS_CP0_ENTRYLO0 | 64
2065+
MIPS | KVM_REG_MIPS_CP0_ENTRYLO1 | 64
20642066
MIPS | KVM_REG_MIPS_CP0_CONTEXT | 64
20652067
MIPS | KVM_REG_MIPS_CP0_USERLOCAL | 64
20662068
MIPS | KVM_REG_MIPS_CP0_PAGEMASK | 32
@@ -2071,9 +2073,11 @@ registers, find a list below:
20712073
MIPS | KVM_REG_MIPS_CP0_ENTRYHI | 64
20722074
MIPS | KVM_REG_MIPS_CP0_COMPARE | 32
20732075
MIPS | KVM_REG_MIPS_CP0_STATUS | 32
2076+
MIPS | KVM_REG_MIPS_CP0_INTCTL | 32
20742077
MIPS | KVM_REG_MIPS_CP0_CAUSE | 32
20752078
MIPS | KVM_REG_MIPS_CP0_EPC | 64
20762079
MIPS | KVM_REG_MIPS_CP0_PRID | 32
2080+
MIPS | KVM_REG_MIPS_CP0_EBASE | 64
20772081
MIPS | KVM_REG_MIPS_CP0_CONFIG | 32
20782082
MIPS | KVM_REG_MIPS_CP0_CONFIG1 | 32
20792083
MIPS | KVM_REG_MIPS_CP0_CONFIG2 | 32
@@ -2148,6 +2152,12 @@ patterns depending on whether they're 32-bit or 64-bit registers:
21482152
0x7020 0000 0001 00 <reg:5> <sel:3> (32-bit)
21492153
0x7030 0000 0001 00 <reg:5> <sel:3> (64-bit)
21502154

2155+
Note: KVM_REG_MIPS_CP0_ENTRYLO0 and KVM_REG_MIPS_CP0_ENTRYLO1 are the MIPS64
2156+
versions of the EntryLo registers regardless of the word size of the host
2157+
hardware, host kernel, guest, and whether XPA is present in the guest, i.e.
2158+
with the RI and XI bits (if they exist) in bits 63 and 62 respectively, and
2159+
the PFNX field starting at bit 30.
2160+
21512161
MIPS KVM control registers (see above) have the following id bit patterns:
21522162
0x7030 0000 0002 <reg:16>
21532163

@@ -2443,18 +2453,20 @@ are, it will do nothing and return an EBUSY error.
24432453
The parameter is a pointer to a 32-bit unsigned integer variable
24442454
containing the order (log base 2) of the desired size of the hash
24452455
table, which must be between 18 and 46. On successful return from the
2446-
ioctl, it will have been updated with the order of the hash table that
2447-
was allocated.
2456+
ioctl, the value will not be changed by the kernel.
24482457

24492458
If no hash table has been allocated when any vcpu is asked to run
24502459
(with the KVM_RUN ioctl), the host kernel will allocate a
24512460
default-sized hash table (16 MB).
24522461

24532462
If this ioctl is called when a hash table has already been allocated,
2454-
the kernel will clear out the existing hash table (zero all HPTEs) and
2455-
return the hash table order in the parameter. (If the guest is using
2456-
the virtualized real-mode area (VRMA) facility, the kernel will
2457-
re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
2463+
with a different order from the existing hash table, the existing hash
2464+
table will be freed and a new one allocated. If this is ioctl is
2465+
called when a hash table has already been allocated of the same order
2466+
as specified, the kernel will clear out the existing hash table (zero
2467+
all HPTEs). In either case, if the guest is using the virtualized
2468+
real-mode area (VRMA) facility, the kernel will re-create the VMRA
2469+
HPTEs on the next KVM_RUN of any vcpu.
24582470

24592471
4.77 KVM_S390_INTERRUPT
24602472

@@ -3177,7 +3189,7 @@ of IOMMU pages.
31773189

31783190
The rest of functionality is identical to KVM_CREATE_SPAPR_TCE.
31793191

3180-
4.98 KVM_REINJECT_CONTROL
3192+
4.99 KVM_REINJECT_CONTROL
31813193

31823194
Capability: KVM_CAP_REINJECT_CONTROL
31833195
Architectures: x86
@@ -3201,7 +3213,7 @@ struct kvm_reinject_control {
32013213
pit_reinject = 0 (!reinject mode) is recommended, unless running an old
32023214
operating system that uses the PIT for timing (e.g. Linux 2.4.x).
32033215

3204-
4.99 KVM_PPC_CONFIGURE_V3_MMU
3216+
4.100 KVM_PPC_CONFIGURE_V3_MMU
32053217

32063218
Capability: KVM_CAP_PPC_RADIX_MMU or KVM_CAP_PPC_HASH_MMU_V3
32073219
Architectures: ppc
@@ -3232,7 +3244,7 @@ process table, which is in the guest's space. This field is formatted
32323244
as the second doubleword of the partition table entry, as defined in
32333245
the Power ISA V3.00, Book III section 5.7.6.1.
32343246

3235-
4.100 KVM_PPC_GET_RMMU_INFO
3247+
4.101 KVM_PPC_GET_RMMU_INFO
32363248

32373249
Capability: KVM_CAP_PPC_RADIX_MMU
32383250
Architectures: ppc
@@ -3266,6 +3278,101 @@ The ap_encodings gives the supported page sizes and their AP field
32663278
encodings, encoded with the AP value in the top 3 bits and the log
32673279
base 2 of the page size in the bottom 6 bits.
32683280

3281+
4.102 KVM_PPC_RESIZE_HPT_PREPARE
3282+
3283+
Capability: KVM_CAP_SPAPR_RESIZE_HPT
3284+
Architectures: powerpc
3285+
Type: vm ioctl
3286+
Parameters: struct kvm_ppc_resize_hpt (in)
3287+
Returns: 0 on successful completion,
3288+
>0 if a new HPT is being prepared, the value is an estimated
3289+
number of milliseconds until preparation is complete
3290+
-EFAULT if struct kvm_reinject_control cannot be read,
3291+
-EINVAL if the supplied shift or flags are invalid
3292+
-ENOMEM if unable to allocate the new HPT
3293+
-ENOSPC if there was a hash collision when moving existing
3294+
HPT entries to the new HPT
3295+
-EIO on other error conditions
3296+
3297+
Used to implement the PAPR extension for runtime resizing of a guest's
3298+
Hashed Page Table (HPT). Specifically this starts, stops or monitors
3299+
the preparation of a new potential HPT for the guest, essentially
3300+
implementing the H_RESIZE_HPT_PREPARE hypercall.
3301+
3302+
If called with shift > 0 when there is no pending HPT for the guest,
3303+
this begins preparation of a new pending HPT of size 2^(shift) bytes.
3304+
It then returns a positive integer with the estimated number of
3305+
milliseconds until preparation is complete.
3306+
3307+
If called when there is a pending HPT whose size does not match that
3308+
requested in the parameters, discards the existing pending HPT and
3309+
creates a new one as above.
3310+
3311+
If called when there is a pending HPT of the size requested, will:
3312+
* If preparation of the pending HPT is already complete, return 0
3313+
* If preparation of the pending HPT has failed, return an error
3314+
code, then discard the pending HPT.
3315+
* If preparation of the pending HPT is still in progress, return an
3316+
estimated number of milliseconds until preparation is complete.
3317+
3318+
If called with shift == 0, discards any currently pending HPT and
3319+
returns 0 (i.e. cancels any in-progress preparation).
3320+
3321+
flags is reserved for future expansion, currently setting any bits in
3322+
flags will result in an -EINVAL.
3323+
3324+
Normally this will be called repeatedly with the same parameters until
3325+
it returns <= 0. The first call will initiate preparation, subsequent
3326+
ones will monitor preparation until it completes or fails.
3327+
3328+
struct kvm_ppc_resize_hpt {
3329+
__u64 flags;
3330+
__u32 shift;
3331+
__u32 pad;
3332+
};
3333+
3334+
4.103 KVM_PPC_RESIZE_HPT_COMMIT
3335+
3336+
Capability: KVM_CAP_SPAPR_RESIZE_HPT
3337+
Architectures: powerpc
3338+
Type: vm ioctl
3339+
Parameters: struct kvm_ppc_resize_hpt (in)
3340+
Returns: 0 on successful completion,
3341+
-EFAULT if struct kvm_reinject_control cannot be read,
3342+
-EINVAL if the supplied shift or flags are invalid
3343+
-ENXIO is there is no pending HPT, or the pending HPT doesn't
3344+
have the requested size
3345+
-EBUSY if the pending HPT is not fully prepared
3346+
-ENOSPC if there was a hash collision when moving existing
3347+
HPT entries to the new HPT
3348+
-EIO on other error conditions
3349+
3350+
Used to implement the PAPR extension for runtime resizing of a guest's
3351+
Hashed Page Table (HPT). Specifically this requests that the guest be
3352+
transferred to working with the new HPT, essentially implementing the
3353+
H_RESIZE_HPT_COMMIT hypercall.
3354+
3355+
This should only be called after KVM_PPC_RESIZE_HPT_PREPARE has
3356+
returned 0 with the same parameters. In other cases
3357+
KVM_PPC_RESIZE_HPT_COMMIT will return an error (usually -ENXIO or
3358+
-EBUSY, though others may be possible if the preparation was started,
3359+
but failed).
3360+
3361+
This will have undefined effects on the guest if it has not already
3362+
placed itself in a quiescent state where no vcpu will make MMU enabled
3363+
memory accesses.
3364+
3365+
On succsful completion, the pending HPT will become the guest's active
3366+
HPT and the previous HPT will be discarded.
3367+
3368+
On failure, the guest will still be operating on its previous HPT.
3369+
3370+
struct kvm_ppc_resize_hpt {
3371+
__u64 flags;
3372+
__u32 shift;
3373+
__u32 pad;
3374+
};
3375+
32693376
5. The kvm_run structure
32703377
------------------------
32713378

@@ -3282,7 +3389,18 @@ struct kvm_run {
32823389
Request that KVM_RUN return when it becomes possible to inject external
32833390
interrupts into the guest. Useful in conjunction with KVM_INTERRUPT.
32843391

3285-
__u8 padding1[7];
3392+
__u8 immediate_exit;
3393+
3394+
This field is polled once when KVM_RUN starts; if non-zero, KVM_RUN
3395+
exits immediately, returning -EINTR. In the common scenario where a
3396+
signal is used to "kick" a VCPU out of KVM_RUN, this field can be used
3397+
to avoid usage of KVM_SET_SIGNAL_MASK, which has worse scalability.
3398+
Rather than blocking the signal outside KVM_RUN, userspace can set up
3399+
a signal handler that sets run->immediate_exit to a non-zero value.
3400+
3401+
This field is ignored if KVM_CAP_IMMEDIATE_EXIT is not available.
3402+
3403+
__u8 padding1[6];
32863404

32873405
/* out */
32883406
__u32 exit_reason;

Documentation/virtual/kvm/devices/arm-vgic-v3.txt

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ Groups:
118118
-EBUSY: One or more VCPUs are running
119119

120120

121-
KVM_DEV_ARM_VGIC_CPU_SYSREGS
121+
KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS
122122
Attributes:
123123
The attr field of kvm_device_attr encodes two values:
124124
bits: | 63 .... 32 | 31 .... 16 | 15 .... 0 |
@@ -139,13 +139,15 @@ Groups:
139139
All system regs accessed through this API are (rw, 64-bit) and
140140
kvm_device_attr.addr points to a __u64 value.
141141

142-
KVM_DEV_ARM_VGIC_CPU_SYSREGS accesses the CPU interface registers for the
142+
KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS accesses the CPU interface registers for the
143143
CPU specified by the mpidr field.
144144

145+
CPU interface registers access is not implemented for AArch32 mode.
146+
Error -ENXIO is returned when accessed in AArch32 mode.
145147
Errors:
146148
-ENXIO: Getting or setting this register is not yet supported
147149
-EBUSY: VCPU is running
148-
-EINVAL: Invalid mpidr supplied
150+
-EINVAL: Invalid mpidr or register value supplied
149151

150152

151153
KVM_DEV_ARM_VGIC_GRP_NR_IRQS
@@ -204,3 +206,6 @@ Groups:
204206
architecture defined MPIDR, and the field is encoded as follows:
205207
| 63 .... 56 | 55 .... 48 | 47 .... 40 | 39 .... 32 |
206208
| Aff3 | Aff2 | Aff1 | Aff0 |
209+
Errors:
210+
-EINVAL: vINTID is not multiple of 32 or
211+
info field is not VGIC_LEVEL_INFO_LINE_LEVEL

Documentation/virtual/kvm/hypercalls.txt

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,3 +81,38 @@ the vcpu to sleep until occurrence of an appropriate event. Another vcpu of the
8181
same guest can wakeup the sleeping vcpu by issuing KVM_HC_KICK_CPU hypercall,
8282
specifying APIC ID (a1) of the vcpu to be woken up. An additional argument (a0)
8383
is used in the hypercall for future use.
84+
85+
86+
6. KVM_HC_CLOCK_PAIRING
87+
------------------------
88+
Architecture: x86
89+
Status: active
90+
Purpose: Hypercall used to synchronize host and guest clocks.
91+
Usage:
92+
93+
a0: guest physical address where host copies
94+
"struct kvm_clock_offset" structure.
95+
96+
a1: clock_type, ATM only KVM_CLOCK_PAIRING_WALLCLOCK (0)
97+
is supported (corresponding to the host's CLOCK_REALTIME clock).
98+
99+
struct kvm_clock_pairing {
100+
__s64 sec;
101+
__s64 nsec;
102+
__u64 tsc;
103+
__u32 flags;
104+
__u32 pad[9];
105+
};
106+
107+
Where:
108+
* sec: seconds from clock_type clock.
109+
* nsec: nanoseconds from clock_type clock.
110+
* tsc: guest TSC value used to calculate sec/nsec pair
111+
* flags: flags, unused (0) at the moment.
112+
113+
The hypercall lets a guest compute a precise timestamp across
114+
host and guest. The guest can use the returned TSC value to
115+
compute the CLOCK_REALTIME for its clock, at the same instant.
116+
117+
Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource,
118+
or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK.

Documentation/virtual/kvm/locking.txt

Lines changed: 27 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,9 +26,16 @@ sections.
2626
Fast page fault:
2727

2828
Fast page fault is the fast path which fixes the guest page fault out of
29-
the mmu-lock on x86. Currently, the page fault can be fast only if the
30-
shadow page table is present and it is caused by write-protect, that means
31-
we just need change the W bit of the spte.
29+
the mmu-lock on x86. Currently, the page fault can be fast in one of the
30+
following two cases:
31+
32+
1. Access Tracking: The SPTE is not present, but it is marked for access
33+
tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to
34+
restore the saved R/X bits. This is described in more detail later below.
35+
36+
2. Write-Protection: The SPTE is present and the fault is
37+
caused by write-protect. That means we just need to change the W bit of the
38+
spte.
3239

3340
What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
3441
SPTE_MMU_WRITEABLE bit on the spte:
@@ -38,7 +45,8 @@ SPTE_MMU_WRITEABLE bit on the spte:
3845
page write-protection.
3946

4047
On fast page fault path, we will use cmpxchg to atomically set the spte W
41-
bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, this
48+
bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or
49+
restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
4250
is safe because whenever changing these bits can be detected by cmpxchg.
4351

4452
But we need carefully check these cases:
@@ -142,6 +150,21 @@ Since the spte is "volatile" if it can be updated out of mmu-lock, we always
142150
atomically update the spte, the race caused by fast page fault can be avoided,
143151
See the comments in spte_has_volatile_bits() and mmu_spte_update().
144152

153+
Lockless Access Tracking:
154+
155+
This is used for Intel CPUs that are using EPT but do not support the EPT A/D
156+
bits. In this case, when the KVM MMU notifier is called to track accesses to a
157+
page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
158+
by clearing the RWX bits in the PTE and storing the original R & X bits in
159+
some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the
160+
PTE (using the ignored bit 62). When the VM tries to access the page later on,
161+
a fault is generated and the fast page fault mechanism described above is used
162+
to atomically restore the PTE to a Present state. The W bit is not saved when
163+
the PTE is marked for access tracking and during restoration to the Present
164+
state, the W bit is set depending on whether or not it was a write access. If
165+
it wasn't, then the W bit will remain clear until a write access happens, at
166+
which time it will be set using the Dirty tracking mechanism described above.
167+
145168
3. Reference
146169
------------
147170

arch/arm/include/asm/kvm_host.h

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -60,9 +60,6 @@ struct kvm_arch {
6060
/* The last vcpu id that ran on each physical CPU */
6161
int __percpu *last_vcpu_ran;
6262

63-
/* Timer */
64-
struct arch_timer_kvm timer;
65-
6663
/*
6764
* Anything that is not used directly from assembly code goes
6865
* here.

arch/arm/include/asm/kvm_mmu.h

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -129,8 +129,7 @@ static inline bool vcpu_has_cache_enabled(struct kvm_vcpu *vcpu)
129129

130130
static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
131131
kvm_pfn_t pfn,
132-
unsigned long size,
133-
bool ipa_uncached)
132+
unsigned long size)
134133
{
135134
/*
136135
* If we are going to insert an instruction page and the icache is
@@ -150,18 +149,12 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
150149
* and iterate over the range.
151150
*/
152151

153-
bool need_flush = !vcpu_has_cache_enabled(vcpu) || ipa_uncached;
154-
155152
VM_BUG_ON(size & ~PAGE_MASK);
156153

157-
if (!need_flush && !icache_is_pipt())
158-
goto vipt_cache;
159-
160154
while (size) {
161155
void *va = kmap_atomic_pfn(pfn);
162156

163-
if (need_flush)
164-
kvm_flush_dcache_to_poc(va, PAGE_SIZE);
157+
kvm_flush_dcache_to_poc(va, PAGE_SIZE);
165158

166159
if (icache_is_pipt())
167160
__cpuc_coherent_user_range((unsigned long)va,
@@ -173,7 +166,6 @@ static inline void __coherent_cache_guest_page(struct kvm_vcpu *vcpu,
173166
kunmap_atomic(va);
174167
}
175168

176-
vipt_cache:
177169
if (!icache_is_pipt() && !icache_is_vivt_asid_tagged()) {
178170
/* any kind of VIPT cache */
179171
__flush_icache_all();

0 commit comments

Comments
 (0)