Intel Scalable Io Virtualization Technical Specification
Intel Scalable Io Virtualization Technical Specification
Intel Scalable Io Virtualization Technical Specification
Technical Specification
September 2020
By downloading or using this document, Intel Corporation hereby grants you a fully-paid, non-exclusive, non-transferable, worldwide,
limited license (without the right to sublicense), under its copyrights to view, download, and reproduce the Intel Scalable I/O
Virtualization Specification ("Specification").
You are not granted any other rights or licenses, by implication, estoppel, or otherwise, and you may not create any derivative works
of the Specification.
The Specification is provided "as is," and Intel makes no representations or warranties, express or implied, including warranties of
merchantability, fitness for a particular purpose, non-infringement, or title. Intel is not liable for any direct, indirect, special, incidental,
or consequential damages arising out of any use of the Specification, or its performance or implementation.
Intel retains the right to make changes to the Specification at any time.
If you provide feedback or suggestions (“Feedback”) on the Specification, you grant Intel a perpetual, irrevocable, fully-paid,
nonexclusive, worldwide license, with the right to sublicense, under all applicable intellectual property rights to use the Feedback,
without any notice, consent, or accounting. You represent and warrant that you own or have sufficient rights from the owner of the
Feedback, and the intellectual property rights in them, to grant the Feedback license.
This agreement is governed by Delaware law, without reference to choice of law principles. Any disputes relating to this agreement
must be resolved in the federal or state courts in Delaware and you consent to the exclusive personal jurisdiction of the courts in
Delaware.
This agreement is the entire agreement of the parties regarding the Specification and supersedes all prior agreements or
representations.
Table of Contents
Introduction ........................................................................................................................................................................................ 7
Document Organization.......................................................................................................................................................... 7
Audience .......................................................................................................................................................................................... 7
Reference Documents ............................................................................................................................................................. 7
Revision History ........................................................................................................................................................................... 8
Terms and Abbreviations ....................................................................................................................................................... 8
1 Overview .................................................................................................................................................................................. 10
1.1 Virtualization Background ................................................................................................................................... 10
1.2 Intel® Scalable I/O Virtualization ..................................................................................................................... 11
1.2.1 Separation of Direct-path and Intercepted-path Operations ........................................... 13
1.2.2 Assignable Device Interfaces ................................................................................................................ 13
1.2.3 Platform Scalability Using PASIDs ..................................................................................................... 13
1.2.4 Virtual Device Composition ................................................................................................................... 14
2 Device Support .................................................................................................................................................................... 15
2.1 Organizing Device Resources for ADIs ......................................................................................................... 15
2.2 Identifying ADI Upstream Requests .............................................................................................................. 16
2.3 ADIs Using Shared Work Queues ................................................................................................................... 17
2.4 ADI Memory Mapped Registers........................................................................................................................ 18
2.5 ADI Interrupts ............................................................................................................................................................. 18
2.5.1 ADI Interrupt Message Storage (IMS) ............................................................................................... 18
2.5.2 ADI Interrupt Isolation .............................................................................................................................. 19
2.6 ADI Isolation, Access Control, and QoS ....................................................................................................... 19
2.7 ADI Reset ....................................................................................................................................................................... 20
2.8 Capability Enumeration ........................................................................................................................................ 20
3 Platform Support ................................................................................................................................................................ 24
3.1 Address Space Isolation ....................................................................................................................................... 24
3.2 Interrupt Isolation .................................................................................................................................................... 25
3.3 PASID translation ..................................................................................................................................................... 25
4 Reference Software Model............................................................................................................................................ 26
4.1 Host Driver.................................................................................................................................................................... 26
4.2 Virtual Device Composition Module .............................................................................................................. 27
4.3 Guest Driver ................................................................................................................................................................. 27
4.4 Virtual Device .............................................................................................................................................................. 27
Table of Figures
Figure 3-1: Scalable Mode DMA Remapping Architecture for Intel® Scalable I/O Virtualization .. 24
List of Tables
Table 4-1: Host Driver Interfaces for Intel® Scalable I/O Virtualization ........................................................ 26
Introduction
Intel® Scalable I/O Virtualization (Intel® Scalable IOV) is a scalable and flexible approach to hardware-
assisted I/O virtualization. Intel Scalable IOV builds on existing PCI Express* capabilities, enabling it to be
easily supported by compliant PCI Express endpoint device designs and the software ecosystem.
This document specifies the Intel Scalable IOV architecture, including host platform and endpoint device
capabilities required to support it, and describes a high-level reference software architecture.
Document Organization
Chapter 1 provides an architectural overview of Intel Scalable IOV and its key components.
Chapter 2 specifies endpoint device blueprint and requirements.
Chapter 3 describes the required host platform Root Complex (RC) support.
Chapter 4 describes the reference software architecture.
Audience
This document is for endpoint device developers implementing scalable hardware support for I/O
virtualization and sharing, for driver developers for such devices, and for Operating System and Virtual
Machine Monitor developers who are enabling hardware-assisted I/O virtualization.
Reference Documents
Intel® Virtualization Technology for Directed I/O Specification, Rev 3.1
Intel® Architecture Instruction Set Extensions Programming Reference
PCI Express* Base Specification, Revision 4.0, Version 1.0
PCI Express* ECN - Deferrable Memory Write (DMWr) and Device 3 Extended Capability
Revision History
Date Revision Description
June 2018 1.0 Technical preview release
September 2020 1.1 Specification update
SR-IOV Single Root I/O SR-IOV as specified by the PCI Express Base
Virtualization Specification, Revision 4.0, Version 1.0.
Intel Scalable IOV Intel Scalable I/O Software composable and scalable I/O virtualization
Virtualization as specified by this document.
PF Physical Function PCI Express Physical Function as specified by SR-IOV.
VF Virtual Function PCI Express Virtual Function as specified by SR-IOV.
ADI Assignable Device Assignable Device Interface is the unit of assignment
Interface for a device.
DWQ Dedicated Work A work queue that can be assigned to a single address
Queue domain at a time.
SWQ Shared Work Queue A work queue that can be assigned to multiple
address domains simultaneously.
PASID Process Address Process Address Space ID and its TLP prefix as
Space Identifier specified by the PCI Express Base Specification.
RID Requester ID Bus/Device/Function number identity for a PCI
Express function (PF or VF).
IMS Interrupt Message Device-specific interrupt message storage for ADIs.
Storage
MSI-X Message Signaled MSI-X capability as defined by the PCI Express Base
Interrupts Extended Specification.
FLR Function Level Reset Function Level Reset as defined by the PCI Express
Base Specification.
VMM Virtual Machine System software that creates and manages virtual
Monitor machines. Also known as Hypervisor.
VM Virtual Machine An isolated execution environment constructed by a
VMM which runs a guest OS.
Host OS Host Operating The privileged OS that works with the VMM to
System virtualize the platform.
SVM Shared Virtual A memory model that enables I/O devices to operate
Memory in shared virtual address space with CPU.
ATS Address Translation Ability for device to request and cache address
Services translations. Refer to the PCI Express Specification.
DMWr Deferrable Memory Refer to the PCI Express ECN for Deferrable Memory
Write Write (DMWr) .
1 Overview
This chapter provides background on I/O virtualization and introduces the key concepts and components
of Intel Scalable I/O Virtualization.
Containers are another type of isolated environment that are used to package and deploy applications and
run them in the isolated environment. Containers may be constructed as either bare-metal containers that
are instantiated as OS process groups or as machine containers that utilize the increased isolation
properties of hardware support for virtualization. Containers are lighter weight than VMs and can be
deployed in much higher density, potentially increasing the number of isolated environments on a system
by an order of magnitude. This document primarily refers to isolated domains as VMs, but the principles
also apply to other domain abstractions such as containers.
Modern processors provide features to reduce virtualization overhead that may be utilized by VMMs to
allow VMs direct access to hardware resources. Intel® Virtualization Technology (Intel® VT-x) defines the
Intel® processor hardware capabilities to reduce overheads for processor and memory virtualization. Intel®
Virtualization Technology for Directed I/O (Intel® VT-d) defines the platform hardware features for direct
memory access (DMA) and interrupt remapping and isolation that can be utilized to minimize overheads of
I/O virtualization.
I/O virtualization refers to the virtualization and sharing of I/O devices across multiple VMs or container
instances. There are multiple existing approaches for I/O virtualization that may be broadly classified as
either software-based or hardware-assisted.
With software-based I/O virtualization, the VMM exposes a virtual device, such as a Network Interface
Controller (NIC), to a VM. A software device model in the VMM emulates the behavior of the virtual device.
The device model translates from virtual device commands to physical device commands that are
forwarded to a physical device. Such software emulation of devices can provide good compatibility to
software running within VMs but incurs significant performance overhead, especially for high performance
devices. In addition to the performance limitations, emulating virtual devices in software can be complex
for programmable devices such as Graphics Processing Units (GPUs) and Field-Programmable Gate Arrays
(FPGAs) because these devices perform a variety of complex functions. Variants of software-based I/O
virtualization such as ‘device paravirtualization’ and ‘mediated pass-through’ can mitigate some of the
performance and complexity disadvantages of device emulation.
To avoid the overheads of software-based I/O virtualization, VMMs may make use of platform support for
DMA and interrupt remapping (such as Intel VT-d) to support ‘direct device assignment’, which allows guest
software to directly access an assigned device. Direct device assignment provides the best I/O virtualization
performance since the hypervisor is no longer in the path of most guest software accesses to the device.
However, this approach requires the device to be exclusively assigned to a VM and does not support sharing
of the device across multiple VMs.
Single Root I/O Virtualization (SR-IOV) is a PCI-SIG* defined specification for hardware-assisted I/O
virtualization that defines a standard way for partitioning endpoint devices for direct sharing across multiple
VMs or containers. An SR-IOV capable endpoint device supports a Physical Function (PF) and multiple
Virtual Functions (VFs). The PF provides resource management for the device and is managed by the host
driver running in the host OS. Each VF can be assigned to a VM or container for direct access. SR-IOV is
supported by high performance I/O devices such as network and storage controller devices as well as
programmable or reconfigurable devices such as GPUs, FPGAs, and other accelerators.
Figure 1-1 illustrates two example approaches to Intel Scalable IOV, showing how it enables flexible
composition of virtual devices for device sharing. Accesses between a VM and a virtual device are defined
as either ‘direct path’ or ‘intercepted path’. Direct-path operations on the virtual device are mapped directly
to the underlying device hardware for performance, while intercepted-path operations are emulated by the
Virtual Device Composition Module (VDCM) for greater flexibility.
The exact mechanism for virtual device composition is implementation specific. For example, Figure 1-1 (a)
shows a system that implements VDCM in host OS or VMM software, whereas Figure 1-1 (b) shows a system
that implements VDCM in an embedded controller on the platform. VDCM configures the device through
(a) (b)
the host driver. VDCM and the host driver may be co-located. For simplicity, this specification primarily uses
the example where VDCM is implemented in the host OS or VMM software, but the architecture can be
applied to other mechanisms of virtual device composition.
Figure 1-2 illustrates the main benefits of Intel Scalable IOV. Device resources shown as “Q” can be directly
mapped to VMs. A VDEV is a virtual device instance that is exposed to a VM. Virtual device composition
enables increased sharing scalability and flexibility at lower hardware cost and complexity. It provides
system software the flexibility to share device resources with different address domains using different
abstractions. For example, application processes may access a device using system calls and VMs may
access a device using virtual device interfaces. Virtual device composition can also enable dynamic mapping
of VDEVs to device resources, allowing a VMM to over-provision device resources to VMs.
In a data-center with physical machines containing different generations (versions) of the same I/O device,
a VMM can use the virtual device composition to present the same VDEV capabilities irrespective of the
different generations of physical I/O devices. This ensures that the same guest OS image with a VDEV driver
can be deployed or migrated to any of the physical machines.
The Intel Scalable IOV architecture is composed of the following elements:
Endpoint device support PCI Express endpoint device requirements and capabilities, covered in
Chapter 2.
Platform support Host platform (Root Complex) requirements including enhancements to DMA
remapping hardware. These requirements are implemented on Intel®
platforms as part of Intel Virtualization Technology for Directed I/O, Rev 3.1
or higher. This is covered in Chapter 3.
Virtual Device Composition Virtual device composition architecture. This specification describes the
Module support software-based virtual device composition architecture in detail, including
host system software enabling and device specific software components
such as host driver, guest driver, and virtual device composition module
(VDCM). This is covered in Chapter 4.
PCI Express endpoint devices may be designed to operate with either Intel Scalable IOV or SR-IOV. Device
implementations that already support SR-IOV can maintain it for backwards compatibility while adding the
new capabilities to support Intel Scalable IOV. A device capable of both methods should allow software to
enable it to operate in one mode or other. Devices may support both methods concurrently or support Intel
Scalable-IOV operation in a hierarchical manner over SR-IOV VFs, but these modes of operation are beyond
the scope of this document.
Intel Scalable IOV distinguishes intercepted-path and direct-path accesses. Intercepted-path accesses from
VMs go through the virtual device composition module, while direct-path accesses are mapped directly to
the device. Which operations and accesses are distinguished as intercepted path versus direct path is
controlled by the device implementation. Typically, slow-path operations are treated as intercepted-path
accesses and fast-path operations are treated as direct-path accesses. For example, intercepted-path
accesses typically include initialization, control, configuration, management, QoS, error processing, and
reset, whereas direct-path accesses typically include data-path operations involving work submission and
work completion processing.
Figure 1-3 illustrates an example software architecture where VDCM is implemented in host software. The
figure calls out key components to describe the architecture and is not intended to illustrate all virtualization
software or specific implementation choices. Software responsibilities are abstracted between system
software (OS/VMM) and device-specific driver software components. The VMM maps direct-path accesses
from the guest directly onto the provisioned ADIs for the VDEV. The VMM traps intercepted-path accesses
from the guest and forwards them to VDCM for emulation. VDCM emulates the intercepted accesses to the
VDEV. If required, it may access the physical device (for example, to read ADI status or configure the ADI’s
PASID).
Virtualization management software may make use of VDCM interfaces for virtual device resource and state
management, enabling capabilities such as suspend, resume, reset, and migration of virtual devices.
Depending on the specific VMM implementation, VDCM may be instantiated as a separate user or kernel
module or may be packaged as part of the host driver. Chapter 4 further describes the high-level software
architecture.
2 Device Support
This chapter describes the key set of requirements and capabilities for an endpoint device to support Intel
Scalable IOV. The requirements apply to both Root-Complex Integrated Endpoint and PCI Express Endpoint
devices.
As described in previous chapter, the construct for fine-grained sharing on endpoint devices is Assignable
Device Interfaces (ADIs). ADIs form the unit of assignment and isolation for devices and are composed by
software to form virtual devices. This chapter describes the requirements for endpoint devices for
enumeration, allocation, configuration, management and isolation of ADIs.
An Assignable Device Interface (ADI) refers to the set of device backend resources that are allocated,
configured and organized as an isolated unit, forming the unit of device sharing. The type and number of
backend resources grouped to compose an ADI is device specific. An ADI may be associated with a device
context, rather than with specific device resources. ADIs using shared work queues (SWQ) for work
submission may have little to no state or resources associated with them on the device. (See Section 2.3 for
more details.)
Figure 2-1 illustrates a logical view of ADIs with varying number of device backend resources, and
virtualization software composing virtual device instances with one or more ADIs. ADI 1 and ADI 2 are
composed of single backend resource 1 and 2 respectively, whereas ADI 3 is composed of multiple backend
resources 3, 4, and 5. Virtual device 1 instance (VDEV1) is composed of two ADIs (ADI 1 and ADI 2) whereas
VDEV 2 and VDEV k instances are composed of single ADIs (ADI 3 and ADI m respectively). Due to different
ADI composition of ADI 3 and ADI m, VDEV 2 gets 3 backend resources whereas VDEV k gets one backend
resource.
IMPLEMENTATION NOTE
Endpoint devices must support the PASID capability as defined by the PCI Express specification and comply
with all associated requirements. Before enabling ADIs, the PASID capability of the device must be enabled.
Before an ADI is activated, it must be configured with a PASID value. All upstream memory requests and
ATS Translation Requests generated by any ADI must be tagged with the assigned PASID value using the
PASID TLP Prefix. ATS Translated Requests by an ADI may be generated without PASID or with the assigned
PASID. Refer to the PCI Express specification and related ECNs for details on usage of the PASID TLP Prefix
on Translated Requests. Interrupts generated by ADIs are not tagged with the PASID TLP Prefix. Refer to
Section 2.5 for identifying ADI interrupts.
Each ADI must have a primary PASID associated with it, which is used for direct-path operations. ADIs may
have optional secondary PASIDs whose usage is device dependent. For example, an ADI may be configured
to access meta-data, commands, and completions with a secondary PASID that represents a restricted
control domain, while data accesses are associated with the primary PASID corresponding to the domain to
which the ADI is assigned.
When assigning an ADI to an address domain (e.g., VM, container, or process), the ADI is configured with the
unique PASID of the address domain and its memory requests are tagged with the PASID value in the PASID
TLP Prefix. If multiple ADIs are assigned to the same address domain, they may be assigned the same PASID.
If ADIs belonging to a VDEV assigned to a VM are further mapped to secondary address domains (e.g.,
application processes) within the VM, each such ADI is assigned a unique PASID corresponding to the
secondary address domain. This enables usages such as Shared Virtual Memory within a VM, where a guest
application process is assigned an ADI and requests from the ADI are subject to nested translation (GVA to
GPA to HPA) by the DMA remapping hardware, which is similar to the nested address translation for CPU
accesses by a guest application.
Software submits work to an SWQ on Intel® 64 processors using the ENQCMD (Enqueue Command) or
ENQCMDS (Enqueue Command as Supervisor) instructions. ENQCMD/S instructions carry a PASID value in
the work descriptor to identify the software entity that is submitting the work descriptor. ENQCMD/S
instructions return a Success or Retry (Deferred) indication. Success indicates the work was accepted into
the SWQ, while Retry indicates it was not accepted due to SWQ capacity, QoS, or other reasons. On a Retry
status, the work submitter may back-off and retry later. Refer to the Intel® Architecture Instruction Set
Extensions Programming Reference for more details.
Because work submissions to an SWQ contain a PASID value in the work descriptor, the PASID may not
need to be preconfigured in the device for ADIs that use an SWQ.
Host/VMM
Direct-path mapping
Direct-path mapping
Direct-path mapping
Software
Direct-path mapping
Device Hardware
Devices must partition their ADI MMIO registers into two categories: (a) MMIO registers accessed for direct-
path operations; and (b) MMIO registers accessed for intercepted-path operations. The definition of what
operations are designated as intercepted path versus direct path is device-specific. The device must
segregate registers in these two categories into distinct system page size regions, to allow the VMM to
directly map direct-path operations to one or more constituent ADIs while emulating intercepted-path
operations in the VDCM.
Devices should implement prefetchable 64-bit BARs so that address space above 4GB can be used for
scaling ADI MMIO resources.
IMS entries store and generate interrupts using the same interrupt message address and data values as PCI
Express MSI-X table entries. Interrupt messages stored in IMS are composed of a DWORD size data payload
and a 64-bit address. IMS implementations must allow for dynamic allocation and release of IMS entries as
ADIs are dynamically instantiated/revoked to create/destroy virtual devices. IMS must support per-message
1
IMS may be supported by devices independent of Intel Scalable IOV. Such usages are outside the scope
of this document.
masking and pending bit status, similar to the per-vector mask and pending bit array in the PCI Express MSI-
X capability.
The size, location, and storage format for IMS is device specific. For example, a device may implement IMS
as on-device storage. A device that maintains ADI contexts in memory may implement IMS as part of the
context privileged state. In either approach, the device may implement IMS as either one unified storage
structure or as de-centralized per-ADI storage structures. If IMS is implemented in host memory, the device
may cache IMS entries within the device. If the device implements IMS caching, it must also implement
device specific interfaces for the device driver to invalidate the IMS cache entries. Programming of IMS is
done by the host driver.
Devices should support IMS for better scalability and dynamic allocation of ADI interrupts. Interrupts
generated by ADIs should use the IMS. Interrupts generated by the base function should use the MSI or
MSI-X capability. With appropriate device and system software support, ADI interrupts may use MSI-X and
base-function interrupts may use IMS.
On Intel 64 architecture platforms, message signaled interrupts are issued as DWORD size untranslated
memory writes without a PASID TLP Prefix, to address range 0xFEExxxxx. Since all memory requests
generated by ADIs include a PASID TLP Prefix, it is not possible for an ADI to generate a DMA write that
would be interpreted by the platform as an interrupt message.
The PCI Express Access Control Service capability is not applicable for isolation between ADIs. Devices must
not allow peer-to-peer access between ADIs or between ADIs and the base function (either internal to the
device or at I/O fabric egress). Independent of Intel Scalable IOV support, a device may support ACS
guidelines for isolation across endpoint functions or devices, per the PCI Express specification.
Although ADIs are functionally isolated, they may have performance effects on each other and on the base
function. Devices may define Quality of Service (QoS) controls for ADIs to manage these effects. The
definition of QoS for ADIs is device specific and is outside the scope of this specification.
ADI specific errors are errors that can be attributed to a particular ADI, such as malformed commands or
address translation errors. Such errors must not impact functioning of other ADIs or the base function.
Handling of ADI specific errors can be implemented in device-specific ways; such errors should be reported
directly to the guest that the ADI is assigned to, when possible.
A VDEV may expose a virtual FLR capability that may be emulated by the VDCM by requesting the device
to perform ADI resets for each of the constituent ADIs of the virtual device.
An ADI reset must ensure that the reset is not reported as complete until all of the following conditions are
satisfied:
- All DMA write operations by the ADI are drained or aborted
- All DMA read operations by the ADI have completed or aborted
- All interrupts from the ADI have been generated
- If ADI is capable of Address Translation Service (ATS), all ATS requests by the ADI have completed or
aborted, and
- If ADI is capable of Page Request Service (PRS), no more page requests will be generated by the ADI.
Additionally, either page responses have been received for all page requests generated by the ADI or
the ADI will discard page responses for any outstanding page requests by the ADI.
Devices supporting Intel Scalable IOV should support Function Level Reset (FLR) and may support
additional device-specific global reset controls. A global reset operation or FLR resets all ADIs and returns
the device to a state where no ADIs are configured. A device may also support a device-specific global reset
that resets all ADIs but leaves them configured.
Devices may optionally support saving and restoring ADI state, to facilitate operations such as live migration
and suspend/resume of virtual devices composed of ADIs. For example, to support ADI suspend, a device
may implement an interface to drain (complete) all operations submitted to the ADI.
Byte
Offset
31 24 23 20 19 16 15 0
The fields up to offset 0xa are the standard DVSEC capability header. Refer to the PCI Express DVSEC
header for a detailed description of these fields. The remaining fields are described below.
7:0 The programming model for a device may have vendor-specific dependencies RO
between sets of Functions. The Function Dependency Link field is used to
describe these dependencies.
If the H flag is not Set, then different Functions in the FDL can be in different
modes.
7:1 Reserved RO
31:0 This field indicates the page sizes supported. A page size of 2n+12 is supported RO
if bit n is Set. For example, bit 0 indicates support for 4 KB pages. The page
size indicates the minimum alignment requirement for ADI MMIO pages so
that they can be independently assigned to different address domains.
Support for 4 KB pages is required. Devices may support additional page sizes
for compatibility with a variety of host platform architectures.
31:0 This field defines the page size the system uses to map the ADIs’ MMIO pages. RW
Software must set the value of System Page Size to one of the page sizes set
in the Supported Page Sizes field. As with Supported Page Sizes, if bit n is set
in System Page Size, a page size of 2n+12 is used. For example, if bit 1 is set, the
device uses an 8 KB page size. The behavior is undefined if System Page Size
is zero, more than one bit is set, or a bit is set in System Page Size that is not
set in Supported Page Sizes.
When System Page Size is written, all ADI MMIO resources are aligned on
system page size boundaries. System Page Size must be configured before
setting the Memory Space Enable bit in the PCI command register. The
behavior is undefined if System Page Size is modified after Memory Space
Enable is set.
0 IMS Support: This bit indicates the support for Interrupt Message Storage RO
(IMS) in the device.
If virtualization software supports IMS use only for ADIs and not by the base
function, then when the base function is directly assigned to a domain,
virtualization software may expose a virtual Intel Scalable IOV DVSEC
Capability to the domain with the IMS support bit reported as 0.
31:1 Reserved RO
3 Platform Support
The following platform level capabilities are required to support Intel Scalable IOV:
• Support for the PCI Express PASID TLP Prefix in Root Ports and the Root Complex. Refer to the PCI
Express Revision 4.0 specification or higher for details on PASID TLP Prefix support.
• PASID-granular address translation in Root Complex.
• Interrupt remapping support in Root Complex.
• PASID translation support in CPUs supporting PCIe Deferrable Memory Write (DMWr).
For example, Intel platforms support Intel Scalable IOV through the Scalable Mode Translation Support
capability of the Intel® Virtualization Technology for Directed I/O, Rev 3.1 or higher. Figure 3-1 illustrates
the high-level translation structure organization for Intel scalable mode address translation.
Dev = 31 : Func = 7
PASID[5:0] = 63
Dev = 16 : Func = 0
Bus = 255 Scalable Mode PASID[19:6] = 2 14 - 1
First-Level Page Table
Upper Context Table
Structures
Bus = N
Figure 3-1: Scalable Mode DMA Remapping Architecture for Intel® Scalable I/O Virtualization
1. The Requester ID (Bus/Device/Function number) in the upstream request is used to consult the Root
and Context structures that specify translation behavior at Requester ID granularity. The context
structure refers to PASID structures.
2. If the request includes a PASID TLP Prefix, the PASID value from the TLP prefix is used to consult the
PASID structures that specify translation behavior at PASID granularity. If the request is without a PASID
TLP Prefix, the PASID value programmed by software in the Context structure is used instead. For each
PASID, the PASID structure entry can be programmed to specify first-level, second-level, pass-through
or nested translation, along with references to first-level and second-level page-table structures.
3. Finally, the address in the request is subject to address translation using the first-level, second-level or
both page-table structures, depending on the type of translation function.
The host platform should support direct delivery of virtual interrupts to VMs without hypervisor processing
overheads. This also enables virtual interrupts to operate in guest interrupt vector space without consuming
host processor interrupt vectors.
Refer to the Intel® VT-d specification for details on interrupt remapping and posted interrupts.
This chapter describes an example software architecture in which Intel Scalable IOV is enabled through
VMM software as shown in Figure 1-3. This chapter is not intended to be prescriptive, and instead covers
an example description of system software and device-specific software roles and interactions to compose
hardware-assisted virtual devices and manage device operation. Specific OS or VMM implementations may
choose other methods to enable Intel Scalable IOV.
The software architecture described in this chapter focuses on I/O virtualization for virtual machines and
machine containers. However, the principles can be applied to other domains such as I/O sharing across
bare-metal containers or application processes. Figure 1-3 illustrates the high-level software architecture.
The logical components of the reference software architecture are described below.
Table 4-1 illustrates a high-level set of operations that the host driver supports for managing ADIs. These
operations are invoked through suitable software interfaces defined by specific system software
implementations.
Description
Table 4-1: Host Driver Interfaces for Intel® Scalable I/O Virtualization
A device may support multiple types of ADIs, both in terms of number of backend resources (see Figure
2-1) and in terms of functionality. Similarly, a VDCM may support more than one type of VDEV composition,
with respect to the number of backing ADIs, functionality of ADIs, etc., enabling the virtual machine resource
manager to request different types of VDEV instances for assigning to virtual machines. The VDCM uses the
host OS and VMM defined interfaces to allocate and configure resources needed to compose a VDEV.
A VDEV may be composed of a static number of ADIs that are pre-allocated at the time of VDEV instantiation
or composed dynamically by the VDCM in response to guest driver requests to allocate/free resources. An
example of statically allocated ADIs is a virtual NIC with a fixed number of RX/TX queues. An example of
dynamically allocated ADIs is a virtual accelerator device, where context allocation requests are virtualized
by the VDCM to dynamically create accelerator contexts as ADIs.
• Direct Mapped to ADI MMIO: Direct-path registers of the VDEV virtual MMIO space are mapped directly
to the physical MMIO space of the device. The VDCM requests the hypervisor to set up GPA to HPA
mappings for these regions in the CPU virtualization page tables, enabling direct access by the guest
driver to the ADI.
• VDEV MMIO Intercepted and Emulated by VDCM: Intercepted-path registers of the VDEV are virtualized
by the VDCM by requesting the hypervisor to not map these MMIO regions in the host processor
virtualization page-tables, thus forcing host intercepts when the guest driver accesses these registers.
The intercepted accesses are provided to the VDCM to virtualize, either by itself or through interactions
with the host driver.
VDEV registers that are read frequently and have no read side-effects, but require VDCM intercept and
emulation on write accesses, may be mapped as read-only to backing memory pages provided by
VCDM. This supports high performance read accesses to these registers along with virtualizing their
write side-effects by intercepting on guest write accesses. ‘Write intercept only’ registers must be
contained in separate system page size regions from the ‘read-write intercept’ registers on the VDEV
MMIO layout.
• VDEV MMIO Mapped to Memory: VDEV registers that have no read or write side effects may be mapped
to memory with read and write access. These registers may contain parameters or data for a subsequent
operation performed by writing to an intercepted register. Device implementations may also use this
approach to define virtual registers for VDEV-specific communication channel between the guest driver
and the VDCM. The guest driver writes data to the memory backed virtual registers without host
intercepts, followed by a mailbox register access that is intercepted by the VDCM. This optimization
reduces host intercept and instruction emulation cost for passing data between guest and host. Such
approach may enable guest drivers to implement such channels with VDCM more generally than
hardware-based communication doorbells (as often implemented between SR-IOV VFs and PF) and
without depending on guest OS or hypervisor specific para-virtualized software interfaces.
other source of interrupts is the ADI instances on the device that are used to support direct-path operations
of the VDEV.
When the guest OS programs the virtual MSI or MSI-X register, the operation is intercepted and virtualized
by the VDCM. For intercepted-path virtual interrupts, the VDCM requests virtual interrupt injection to the
guest through the VMM software interfaces. For direct-path interrupts from ADIs, the VDCM invokes the
host driver to allocate and configure required interrupt message address and data in the IMS.
• Software emulated communication channel: This type of channel is implemented by the VDCM by
setting up one or more system page size regions in VDEV MMIO space as fully memory-backed, to be
used to share data between the guest and the host. The VDCM also sets up an intercepted-path register
in VDEV MMIO space to be used by the guest to signal an action to the host. A virtual interrupt may be
used by the VDCM to signal the guest about completion of asynchronous communication channel
actions.
• Hardware mailbox-based communication channel: If the communication between the guest driver and
the host driver is frequent and the software emulation-based communication channel overhead is
significant, the device may implement communication channels based on hardware mailboxes. This is
similar to communication channels between SR-IOV VFs and PF in some existing designs.
Like Intel Scalable IOV, devices supporting SVM use PASIDs to distinguish different application virtual
address spaces. A device that supports both SVM and Intel Scalable IOV will support SVM both for ADIs
assigned to host applications and for ADIs assigned to guest applications. The distinction between host and
guest SVM usages is transparent to the device. The only difference is in the address translation function
programming of the DMA remapping hardware for each PASID. The address translation function pro-
grammed for a PASID representing host SVM usage refers to the CPU virtual address to physical address
translation, while the address translation function programmed for a PASID representing guest SVM usage
refers to nested address translation (guest virtual address to guest physical address and then to host
physical address).