Skip to content

Commit 5201760

Browse files
Sean Christophersonbonzini
authored andcommitted
KVM: nVMX: add option to perform early consistency checks via H/W
KVM defers many VMX consistency checks to the CPU, ostensibly for performance reasons[1], including checks that result in VMFail (as opposed to VMExit). This behavior may be undesirable for some users since this means KVM detects certain classes of VMFail only after it has processed guest state, e.g. emulated MSR load-on-entry. Because there is a strict ordering between checks that cause VMFail and those that cause VMExit, i.e. all VMFail checks are performed before any checks that cause VMExit, we can detect (almost) all VMFail conditions via a dry run of sorts. The almost qualifier exists because some state in vmcs02 comes from L0, e.g. VPID, which means that hardware will never detect an invalid VPID in vmcs12 because it never sees said value. Software must (continue to) explicitly check such fields. After preparing vmcs02 with all state needed to pass the VMFail consistency checks, optionally do a "test" VMEnter with an invalid GUEST_RFLAGS. If the VMEnter results in a VMExit (due to bad guest state), then we can safely say that the nested VMEnter should not VMFail, i.e. any VMFail encountered in nested_vmx_vmexit() must be due to an L0 bug. GUEST_RFLAGS is used to induce VMExit as it is unconditionally loaded on all implementations of VMX, has an invalid value that is writable on a 32-bit system and its consistency check is performed relatively early in all implementations (the exact order of consistency checks is micro-architectural). Unfortunately, since the "passing" case causes a VMExit, KVM must be extra diligent to ensure that host state is restored, e.g. DR7 and RFLAGS are reset on VMExit. Failure to restore RFLAGS.IF is particularly fatal. And of course the extra VMEnter and VMExit impacts performance. The raw overhead of the early consistency checks is ~6% on modern hardware (though this could easily vary based on configuration), while the added latency observed from the L1 VMM is ~10%. The early consistency checks do not occur in a vacuum, e.g. spending more time in L0 can lead to more interrupts being serviced while emulating VMEnter, thereby increasing the latency observed by L1. Add a module param, early_consistency_checks, to provide control over whether or not VMX performs the early consistency checks. In addition to standard on/off behavior, the param accepts a value of -1, which is essentialy an "auto" setting whereby KVM does the early checks only when it thinks it's running on bare metal. When running nested, doing early checks is of dubious value since the resulting behavior is heavily dependent on L0. In the future, the "auto" setting could also be used to default to skipping the early hardware checks for certain configurations/platforms if KVM reaches a state where it has 100% coverage of VMFail conditions. [1] To my knowledge no one has implemented and tested full software emulation of the VMFail consistency checks. Until that happens, one can only speculate about the actual performance overhead of doing all VMFail consistency checks in software. Obviously any code is slower than no code, but in the grand scheme of nested virtualization it's entirely possible the overhead is negligible. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
1 parent 5a5e8a1 commit 5201760

File tree

1 file changed

+137
-5
lines changed

1 file changed

+137
-5
lines changed

arch/x86/kvm/vmx.c

Lines changed: 137 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,9 @@ module_param_named(enable_shadow_vmcs, enable_shadow_vmcs, bool, S_IRUGO);
110110
static bool __read_mostly nested = 0;
111111
module_param(nested, bool, S_IRUGO);
112112

113+
static bool __read_mostly nested_early_check = 0;
114+
module_param(nested_early_check, bool, S_IRUGO);
115+
113116
static u64 __read_mostly host_xss;
114117

115118
static bool __read_mostly enable_pml = 1;
@@ -187,6 +190,7 @@ static unsigned int ple_window_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
187190
module_param(ple_window_max, uint, 0444);
188191

189192
extern const ulong vmx_return;
193+
extern const ulong vmx_early_consistency_check_return;
190194

191195
static DEFINE_STATIC_KEY_FALSE(vmx_l1d_should_flush);
192196
static DEFINE_STATIC_KEY_FALSE(vmx_l1d_flush_cond);
@@ -11953,6 +11957,14 @@ static void prepare_vmcs02_constant_state(struct vcpu_vmx *vmx)
1195311957
return;
1195411958
vmx->nested.vmcs02_initialized = true;
1195511959

11960+
/*
11961+
* We don't care what the EPTP value is we just need to guarantee
11962+
* it's valid so we don't get a false positive when doing early
11963+
* consistency checks.
11964+
*/
11965+
if (enable_ept && nested_early_check)
11966+
vmcs_write64(EPT_POINTER, construct_eptp(&vmx->vcpu, 0));
11967+
1195611968
/* All VMFUNCs are currently emulated through L0 vmexits. */
1195711969
if (cpu_has_vmx_vmfunc())
1195811970
vmcs_write64(VM_FUNCTION_CONTROL, 0);
@@ -12006,7 +12018,9 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
1200612018
* entry, but only if the current (host) sp changed from the value
1200712019
* we wrote last (vmx->host_rsp). This cache is no longer relevant
1200812020
* if we switch vmcs, and rather than hold a separate cache per vmcs,
12009-
* here we just force the write to happen on entry.
12021+
* here we just force the write to happen on entry. host_rsp will
12022+
* also be written unconditionally by nested_vmx_check_vmentry_hw()
12023+
* if we are doing early consistency checks via hardware.
1201012024
*/
1201112025
vmx->host_rsp = 0;
1201212026

@@ -12634,12 +12648,124 @@ static int check_vmentry_postreqs(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
1263412648
return 0;
1263512649
}
1263612650

12651+
static int __noclone nested_vmx_check_vmentry_hw(struct kvm_vcpu *vcpu)
12652+
{
12653+
struct vcpu_vmx *vmx = to_vmx(vcpu);
12654+
unsigned long cr3, cr4;
12655+
12656+
if (!nested_early_check)
12657+
return 0;
12658+
12659+
if (vmx->msr_autoload.host.nr)
12660+
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, 0);
12661+
if (vmx->msr_autoload.guest.nr)
12662+
vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 0);
12663+
12664+
preempt_disable();
12665+
12666+
vmx_prepare_switch_to_guest(vcpu);
12667+
12668+
/*
12669+
* Induce a consistency check VMExit by clearing bit 1 in GUEST_RFLAGS,
12670+
* which is reserved to '1' by hardware. GUEST_RFLAGS is guaranteed to
12671+
* be written (by preparve_vmcs02()) before the "real" VMEnter, i.e.
12672+
* there is no need to preserve other bits or save/restore the field.
12673+
*/
12674+
vmcs_writel(GUEST_RFLAGS, 0);
12675+
12676+
vmcs_writel(HOST_RIP, vmx_early_consistency_check_return);
12677+
12678+
cr3 = __get_current_cr3_fast();
12679+
if (unlikely(cr3 != vmx->loaded_vmcs->host_state.cr3)) {
12680+
vmcs_writel(HOST_CR3, cr3);
12681+
vmx->loaded_vmcs->host_state.cr3 = cr3;
12682+
}
12683+
12684+
cr4 = cr4_read_shadow();
12685+
if (unlikely(cr4 != vmx->loaded_vmcs->host_state.cr4)) {
12686+
vmcs_writel(HOST_CR4, cr4);
12687+
vmx->loaded_vmcs->host_state.cr4 = cr4;
12688+
}
12689+
12690+
vmx->__launched = vmx->loaded_vmcs->launched;
12691+
12692+
asm(
12693+
/* Set HOST_RSP */
12694+
__ex(ASM_VMX_VMWRITE_RSP_RDX) "\n\t"
12695+
"mov %%" _ASM_SP ", %c[host_rsp](%0)\n\t"
12696+
12697+
/* Check if vmlaunch of vmresume is needed */
12698+
"cmpl $0, %c[launched](%0)\n\t"
12699+
"je 1f\n\t"
12700+
__ex(ASM_VMX_VMRESUME) "\n\t"
12701+
"jmp 2f\n\t"
12702+
"1: " __ex(ASM_VMX_VMLAUNCH) "\n\t"
12703+
"jmp 2f\n\t"
12704+
"2: "
12705+
12706+
/* Set vmx->fail accordingly */
12707+
"setbe %c[fail](%0)\n\t"
12708+
12709+
".pushsection .rodata\n\t"
12710+
".global vmx_early_consistency_check_return\n\t"
12711+
"vmx_early_consistency_check_return: " _ASM_PTR " 2b\n\t"
12712+
".popsection"
12713+
:
12714+
: "c"(vmx), "d"((unsigned long)HOST_RSP),
12715+
[launched]"i"(offsetof(struct vcpu_vmx, __launched)),
12716+
[fail]"i"(offsetof(struct vcpu_vmx, fail)),
12717+
[host_rsp]"i"(offsetof(struct vcpu_vmx, host_rsp))
12718+
: "rax", "cc", "memory"
12719+
);
12720+
12721+
vmcs_writel(HOST_RIP, vmx_return);
12722+
12723+
preempt_enable();
12724+
12725+
if (vmx->msr_autoload.host.nr)
12726+
vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);
12727+
if (vmx->msr_autoload.guest.nr)
12728+
vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);
12729+
12730+
if (vmx->fail) {
12731+
WARN_ON_ONCE(vmcs_read32(VM_INSTRUCTION_ERROR) !=
12732+
VMXERR_ENTRY_INVALID_CONTROL_FIELD);
12733+
vmx->fail = 0;
12734+
return 1;
12735+
}
12736+
12737+
/*
12738+
* VMExit clears RFLAGS.IF and DR7, even on a consistency check.
12739+
*/
12740+
local_irq_enable();
12741+
if (hw_breakpoint_active())
12742+
set_debugreg(__this_cpu_read(cpu_dr7), 7);
12743+
12744+
/*
12745+
* A non-failing VMEntry means we somehow entered guest mode with
12746+
* an illegal RIP, and that's just the tip of the iceberg. There
12747+
* is no telling what memory has been modified or what state has
12748+
* been exposed to unknown code. Hitting this all but guarantees
12749+
* a (very critical) hardware issue.
12750+
*/
12751+
WARN_ON(!(vmcs_read32(VM_EXIT_REASON) &
12752+
VMX_EXIT_REASONS_FAILED_VMENTRY));
12753+
12754+
return 0;
12755+
}
12756+
STACK_FRAME_NON_STANDARD(nested_vmx_check_vmentry_hw);
12757+
1263712758
static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
1263812759
struct vmcs12 *vmcs12);
1263912760

1264012761
/*
1264112762
* If from_vmentry is false, this is being called from state restore (either RSM
1264212763
* or KVM_SET_NESTED_STATE). Otherwise it's called from vmlaunch/vmresume.
12764+
+ *
12765+
+ * Returns:
12766+
+ * 0 - success, i.e. proceed with actual VMEnter
12767+
+ * 1 - consistency check VMExit
12768+
+ * -1 - consistency check VMFail
1264312769
*/
1264412770
static int nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
1264512771
bool from_vmentry)
@@ -12668,6 +12794,11 @@ static int nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
1266812794
if (from_vmentry) {
1266912795
nested_get_vmcs12_pages(vcpu);
1267012796

12797+
if (nested_vmx_check_vmentry_hw(vcpu)) {
12798+
vmx_switch_vmcs(vcpu, &vmx->vmcs01);
12799+
return -1;
12800+
}
12801+
1267112802
if (check_vmentry_postreqs(vcpu, vmcs12, &exit_qual))
1267212803
goto vmentry_fail_vmexit;
1267312804
}
@@ -12804,13 +12935,14 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
1280412935
* We're finally done with prerequisite checking, and can start with
1280512936
* the nested entry.
1280612937
*/
12807-
1280812938
vmx->nested.nested_run_pending = 1;
1280912939
ret = nested_vmx_enter_non_root_mode(vcpu, true);
12810-
if (ret) {
12811-
vmx->nested.nested_run_pending = 0;
12940+
vmx->nested.nested_run_pending = !ret;
12941+
if (ret > 0)
1281212942
return 1;
12813-
}
12943+
else if (ret)
12944+
return nested_vmx_failValid(vcpu,
12945+
VMXERR_ENTRY_INVALID_CONTROL_FIELD);
1281412946

1281512947
/* Hide L1D cache contents from the nested guest. */
1281612948
vmx->vcpu.arch.l1tf_flush_l1d = true;

0 commit comments

Comments
 (0)