CAPI
CAPI
Stuecheli
B. Blaner
Accelerator Processor C. R. Johns
Interface M. S. Siegel
ÓCopyright 2015 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without
alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed
royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.
IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 7 JANUARY/FEBRUARY 2015 J. STUECHELI ET AL. 7:1
The emergence of BBig Data[ problem spaces, which CAPI system description
require analysis of exabytes (1018 ) of data, has further forced A block diagram of the CAPI hardware is shown in Figure 1.
a rethinking of computational system design [4]. These Each POWER8 processor chip contains a symmetric
data-centric problems differ from traditional problems in multi-processor (SMP) bus interconnection fabric
several aspects. First, the volume of data is much greater, which enables the various units to communicate and
potentially thousands of times greater than traditional coherently share system memory. These units are twelve
workloads. Beyond the size of the data, inputs are general-purpose cores, two memory controller (MC) blocks,
unstructured, as the vast amounts of data to be mined are and units to bridge multiple chips in an SMP system. On
inherently disorganized [5, 6]. These features drive the the POWER8 processor chip, the PCIe Host Bridge (PHB)
need to scan and restructure large volumes of data, which provides connectivity to PCIe Gen3 I/O links. The Coherent
are fed into more manageable data structures for further Accelerator Processor Proxy (CAPP) unit, in conjunction
processing. with the PHB, act as memory coherence, data transfer,
If the processing of this large amount of data is visualized interrupt, and address translation agents on the SMP
as a computational tree, many parallel tasks may process interconnect [9] fabric for PCIe-attached accelerators.
subsets, or Bleafs,[ of the full dataset at once. As the leafs are These accelerators comprise a POWER Service Layer (PSL)
processed, intermediate results are formed, passed up the and Accelerator Function Units (AFUs) that reside in an
tree, and are used in the next iterative step in processing FPGA or ASIC connected to the processor chip by the
the data until the root node, or the final result, of this PCIe Gen3 link. Up to sixteen PCIe lanes per direction
computational tree is reached. The combination of are supported. The combination of PSL, PCIe link, PHB,
computation on vast unstructured data at the leaf operations and CAPP provide AFUs with several capabilities. AFUs
and the serially structured final result generation motivates may operate on data in memory, coherently, as peers of
the use of heterogeneous computation systems. The initial other caches in the system. AFUs further use effective
stages are inherently parallel, as the massive volume of addresses to reference memory, with address translation
data is scanned. These raw data typically consist of irregular provided by a memory management unit (MMU) in the
data types, and are poorly suited for traditional processor PSL. The PSL may also generate interrupts on behalf of
register types. In contrast to the initial stages or leaf AFUs to signal AFU completion, or to signal a system
processing steps, once the data has been filtered and service when a translation fault occurs.
formed into structured data, parallelism is greatly
reduced [7]. Coherence
To address these inefficiencies and address the needs In order to provide coherent access to system memory, CAPP
of emerging big data workloads, the POWER8* platform and PSL each contain a directory of cache lines used by the
introduces the Coherent Accelerator Processor Interface AFUs. The CAPP snoops the fabric on behalf of the PSL,
(CAPI). This new interface, which will be described in accesses its local directory, and responds to the fabric with
more detail herein, provides the capability for off-chip latency that is the same as other caches on the chip. In this
accelerators to be plugged into PCIe** (Peripheral way, the insertion of an off-chip coherent accelerator does
Component Interconnect Express**) [8] slots and participate not affect critical system performance parameters such as
in the system memory coherence protocol as a peer of cache snoop latency. Snoops that hit in the CAPP directory
other caches in the system. Additionally, CAPI enables the may generate messages that are sent to PSL by means
use of effective addresses to reference data structures in the of the PHB and PCIe link. The PSL may then respond to
same manner as applications running on the cores. These the message in a variety of ways depending on the contents
PCIe-card based accelerators can be implemented in FPGAs of the message.
for development flexibility or hardened in ASIC chips, The PSL may master operations on the SMP interconnect
depending on user requirements. fabric using the combination of the PCIe link, PHB, and
The PCIe attachment point provides for simple integration master read and write finite state machines (FSMs) in CAPP.
of a range of easily developed PCIe based designs; however, For example, to store into a line on behalf of an AFU, the
the requirements of a scalable and robust attachment point PSL must first have ownership of the line. The PSL first
introduce several challenges. The PCIe protocol does not checks for presence of the line in its cache directory. If the
follow the highly optimized coherent and resilient protocol line is present (directory hit) and in the modified state, the
utilized amongst the POWER8 modules in the system. This PSL allows the store from AFU to proceed. However, if
protocol mismatch motivated the creation of a proxy unit the access misses in the PSL directory, then the PSL initiates
resident on the POWER8 chip to isolate the two divergent a fabric master operation to gain ownership of the line and
protocols, enabling coherent traffic, and provide failure may further request the cache line data. This is accomplished
isolation. These are mandatory requirements for an attached by sending a command to a CAPP master read FSM. The
accelerator to act as a peer of other CPUs in the system. CAPP master FSM performs the access on the fabric and
7:2 J. STUECHELI ET AL. IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 7 JANUARY/FEBRUARY 2015
Figure 1
CAPI system block diagram.
ultimately gains ownership of the line and sends a message back to PSL. A compound command concurrently activates
that it has obtained such to the PSL. If the data was also both write and read FSMs in CAPP to perform the operation.
requested, it will be directly returned by the source, which This saves two crossings of the PCIe link compared to the
could be a memory controller or another cache in the system, discrete operations.
to the PHB where it is transferred across the PCIe link to The PSL is further provisioned with the capability to
PSL and installed in its cache. The store from the AFU master reads and writes on the fabric to copy lines to outside
is then allowed to complete. of the coherence domain as would be the case of an I/O
To push a line from the PSL cache to memory, which may device operating with a checkout model of memory. This
occur for example when a line owned by PSL needs to be provision allows AFUs, with no need to maintain coherent
evicted to make space for another line in the cache, PSL copies of lines, to entirely bypass the PSL and CAPP caches.
issues a write command to a CAPP master write FSM.
The PSL also pushes the modified data to the PHB for Address translation
write-back to memory, and updates the state for the line To enable AFUs to reference memory with effective
in its directory to indicate that it no longer owns the line. addresses, as would an application running on a core, the
The master write FSM obtains routing information for PSL contains an MMU comprising table-walk machines to
the destination of the write data and passes it to the PHB perform address translations and caches of recent translations,
via sideband signals. The PHB then pushes the data onto thereby frequently avoiding table walks. Table-walk
the fabric to the destination. Additionally, the master write machines use the mechanisms described above to read and
FSM updates the CAPP directory to reflect that the line update tables in memory during the translation process.
is now invalid. Since the PSL contains a translation cache, it must
In the previous examples, the combination of evicting a participate in translation invalidation (tlbi) [10] operations
line to make room for a new line and reading the new line, on the fabric. The CAPP snoops tlbi operations on behalf
with or without intent to modify the line, were illustrated of the PSL and sends them in messages to the PSL, either
as separate operations. This common combination between one at a time or bundled into groups. The PSL looks up the
the PSL and CAPP is optimized by providing a single address presented by the tlbi in its caches. If the address
compound operation that both evicts a directory entry, misses, it responds immediately back to the CAPP tlbi
possibly with data push to memory, and loads a new entry snooper that the operation is complete. If the tlbi hits, the
into the CAPP directory, possibly with read data provided PSL follows a protocol to ensure all storage operations
IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 7 JANUARY/FEBRUARY 2015 J. STUECHELI ET AL. 7:3
Figure 2
CAPP unit components.
associated with that translation cache entry are completed bus and partial response buses (presp). A command issued by
before sending a completion message to the CAPP tlbi a master is broadcast to the fabric on a command/address
snooper. (cmd/addr) bus and enters the CAPP snoop pipeline on its
Address translations may generate faults requiring rcmd bus. The snooped reflected command is decoded, and if
notification of system software to repair the fault. For it is not one supported by the CAPI, it proceeds no further
this and other needs, the PSL provides a means to signal down the pipeline. If the snooped reflected command is
interrupts to software. This is accomplished by using the supported, has an address, and requires a CAPP directory
message signaled interrupt (MSI) mechanism provided by lookup, arbitration for read access to the directory occurs
the PHB [8]. PSL sends a command to the PHB using a in the next pipeline phase. Master FSMs, snoop FSMs,
particular address and data value indicative of the particular and snooped reflected commands arbitrate for read access
interrupt being asserted. The PHB responds as it would to an to the directory (arb block shown in Figure 2). Having
MSI from any I/O device; the details may be found in [8]. won arbitration, the snooped reflected command reads the
directory, and the result may be a cache hit or miss. The
CAPP hardware description address is also compared to addresses held by master and
This section considers the hardware structures internal to the snoop FSMs to see if any are already performing an action on
CAPP that are required to enable attached accelerators to the address. Depending on the outcome, the snoop control
participate in the distributed cache coherence protocol logic determines the next action the hardware will take. This
provided by the SMP interconnect fabric as peers of other may include dispatching to one of the 16 snoop FSMs when,
caches in the system. The CAPP structures and machines for example, the CAPI owns the line in a modified state,
parallel those of the L2 cache directory described in [9], while and another master is requesting ownership of the line. In this
the data-portion of the cache is maintained by the PSL. case, the PSL must provide the line as described earlier.
Figure 2 shows the CAPP hardware in greater detail. A snoop FSM is required to change the CAPP directory
The CAPP is divided into three areas: machines and state, in which case it must arbitrate for write access to the
transport, snoop pipeline, and SMP interconnect fabric directory as shown in the figure.
interface. The SMP interconnect fabric interface provides Generally, a snooped reflected command that proceeds
snooper, master, and data interfaces to the fabric. The to this point requires a partial response (presp) on the SMP
snooper interface comprises the reflected command (rcmd) bus fabric to indicate the state of affairs in the CAPP back to
7:4 J. STUECHELI ET AL. IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 7 JANUARY/FEBRUARY 2015
the fabric controller. A presp appropriate to the reflected PHB may push the line on its data_out bus directly to that
command and the state of the cache line in the CAPP is particular memory controller. Master FSM 0 also arbitrates
formed by the presp logic and issued on the presp bus. The to update the CAPP directory entry state to invalid, and
fabric controller combines all presps and returns a combined finally sends a message to the PSL containing the requisite
response (cresp) to all agents on the bus so they may see the information so that PSL may update its directory properly
final results of the operation and act accordingly. and push out the modified data.
The action may also include sending a message to the PSL Master read operations proceed similarly, but in the case of
that is descriptive of the snooped reflected command, the reads, data from a sourceVa memory controller or another
CAPP state, and any actions the CAPP took on behalf of the cache in the systemVis to be returned to the PSL. The
PSL. The PSL may then take further actions in response to CAPP master read FSM selected for the operation provides
the message, as in the line push example where data needs routing information so that the data may be returned directly
to be written back to memory. Messages to the PSL from from the source to the PHB and on to the PSL over the
both master and snoop FSMs are queued and packed into PCIe link.
fabric data packets by the command/message transport block The tlbi operations discussed previously are another form
and pushed on to the fabric data_out bus to the PHB. The of reflected commands that the CAPP snoops. A snooped
PHB performs a PCIe write to transmit the message packet to tlbi generates a message to be sent to the PSL, and after
the PSL. performing the actions described previously, the PSL returns
To master a command on the fabric cmd/addr bus, the a response to the CAPP. The command/message transport
PSL selects one of 32 master read FSMs or 32 master logic sends tlbi responses to the tlbi snoop logic where
write FSMs, or a pair of FSMs in the case of compound appropriate action is taken.
operations, to master the command. It forms a command
packet containing details of the operation for the FSM Reliability, availability, serviceability
to perform. Multiple commands to multiple FSMs may POWER processors have a long-standing tradition of
be packed into a single command packet. The PSL issues providing world-leading reliability, availability, and
a PCIe write packet to transmit the command packet to serviceability (RAS) [11]. The addition of an off-chip device
the PHB. The PHB decodes address bits in the packet to that participates in cache coherence protocols and address
learn that it is a command packet to be pushed toward the translations must fulfill expectations with respect to that high
CAPP on its fabric data_out bus. The packet arrives on standard, and the CAPI system incorporates a variety of
the CAPP fabric data_in bus, is received and unpacked measures to achieve this. Single-bit error correction and
by the command/message transport logic, and distributed double-bit error detection error correction codes (ECC) are
to the appropriate master FSMs. used on all memory arrays in the CAPP, the PHB, and
Upon receiving a command, a master machine then the PSL. All temporal operations between the CAPP and the
sequences through steps that may include a CAPP directory PSL are timed, as, for example, a directory state that
look-up, cross-checking an address against snoop FSMs, temporarily protects an entry from other snoopers. This
issuing the command on the fabric cmd/addr bus, receiving provides protection against errors on the FPGA that manifest
and acting on a cresp, updating the directory state, and themselves as the PSL ceasing communications with the
sending a message to the PSL. Consider the line push CAPP. FSMs in the CAPP and the PSL use parity to protect
example described previously. The line is held in the PSL against invalid state errors. Configuration registers are parity
and CAPP directories in the modified state. The PSL issues protected. The most common errorsVcorrectable errors on
a command to the CAPP master write FSM 0 to evict the line memory arraysVare handled (corrected) with minimal
from the directory, i.e., move the line from the modified disruption of on-going CAPI activity. For most other
to invalid state. Master write FSM 0 activates, arbitrates for more severe errors, for example when a timer expires on
the snoop pipeline, looks the line up in the CAPP directory, a temporal operation because the PCIe link went down, the
obtains the memory address of the line from the directory CAPI system is able to gracefully go off-line. The CAPP
entry, and enters a line protection state where any snoops that accomplishes this by severing the connection to the PSL,
hit the line will be retried, i.e., a retry response is issued quiescing its various FSMs, and walking its copy of the
on the presp bus. The master machine issues a Bpush[ directory and sending poison data to the address of any lines
command and address on the cmd/addr bus and waits for the held in the various forms of modified state. (BPoison[ data
cresp. Assume a particular memory controller responds as contains a special error checking code detectable by all data
owning the memory address of the line. The cresp contains consumers in the system that marks the data as unusable.)
information for routing the data to the memory controller [9]. When all this is accomplished, the CAPP enters a quiescent
Master FSM 0 sends this routing information to the PHB state from which it is ready to be restarted when the error
via the PHB sideband interface so that when the data packet condition is cleared by appropriate system actions. Only
containing the modified cache line arrives from the PSL, the when an error threatens to cause a data integrity problem
IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 7 JANUARY/FEBRUARY 2015 J. STUECHELI ET AL. 7:5
is a more severe error signaled and the CAPI halted, as is the 2. Stratix V Device Overview (SV51001), Altera, San Jose, CA,
USA, Jan. 2014. [Online]. Available: http://www.altera.com/
case with the other caches in the system. literature/hb/stratix-v/stx5_51001.pdf.
3. 7 Series FPGAs Overview (DS180 v1.15), Xilinx, San Jose, CA,
USA, Feb. 2014. [Online]. Available: http://www.xilinx.com/
User visible PSL interface
support/documentation/data_sheets/ds180_7Series_Overview.pdf.
The interface provided to user-created AFUs is designed 4. M. Adrian, BBig data,[ Teradata Magazine. [Online]. Available:
to isolate the complexities of cache coherence and address http://www.teradatamagazine.com/v11n01/Features/Big-Data/.
translation. User-designed accelerators access system 5. D. A. Ferrucci, BIntroduction to FThis is Watson_,[ IBM J. Res.
Dev., vol. 56, no. 3, pp. 1:1–1:15, May 2012.
memory through load and store requests to user space 6. H. P. Hofstee, G.-C. Chen, F. H. Gebara, K. Hall, J. Herring,
effective addresses. AFUs can select between cacheable D. A. Jamsek, J. Li, Y. Li, J. Shi, and P. W. Y. Wong,
BUnderstanding system design for big data workloads,[ IBM
and write-through requests. Write-through requests are for
J. Res. Dev., vol. 57, no. 3/4, pp. 3:1–3:10, May/Jul. 2013.
data manipulated outside the coherence domain and provide 7. R. Polig, K. Atasu, L. Chiticariu, C. Hagleitner, H. P. Hofstee,
for reduced PCIe bus overhead due to reduced message F. R. Reiss, E. Sitaridi, and H. Zhu, BGiving text analytics a
overhead. Coherent operations are typically utilized boost,[ IEEE Micro, vol. 34, no. 4, pp. 6–14, Jul./Aug. 2014.
8. PCI Express Base Specification, Revision 3.0, PCI-SIG, Beaverton,
for control information where multiple processes must OR, USA, Nov. 2010. [Online]. Available: http://www.pcisig.com/
communicate to make data transfer decisions, and where specifications/pciexpress/base3/.
write-through provides for large block transfers into and out 9. W. J. Starke, J. Stuecheli, D. Daly, J. S. Dodson, F. Auernhammer,
P. Sagmeister, G. Guthrie, C. F. Marino, M. Siegel, and B. Blaner,
of the coherence domain. Address translation is generally BThe cache and memory subsystems of the IBM POWER8
hidden from the accelerator, with the exception of page processor,[ IBM J. Res. Dev., vol. 59, no. 1, Paper 3, pp. 3:1–3:13,
2015.
faults. In these cases, the AFU is notified of the fault, giving
10. Power ISA Version 2.06 Revision B, IBM, Armonk, NY, USA,
the opportunity for the AFU to reschedule operations to hide Jul. 23, 2010. [Online]. Available: https://www.power.org/
the latency of the fault. wp-content/uploads/2012/07/PowerISA_V2.06B_V2_
PUBLIC.pdf.
11. D. Hendersen and J. Mitchell, BPower7 System RAS: Key aspects
of Power Systems Reliability, Availability, Serviceability,[ IBM
Conclusion Syst. Technol. Group, Somers, NY, USA, Dec.9, 2012. [Online].
Modern microprocessors contain inefficiencies when Available: http://www-03.ibm.com/systems/power/hardware/
executing workloads that exhibit little instruction parallelism whitepapers/ras7.html.
or data-level parallelism. Emerging big data workloads have
exacerbated the problem. Direct hardware implementations Received March 24, 2014; accepted for publication
of algorithms in FPGAs and ASICs can be far more April 17, 2014
efficient, but integrating them into an SMP leads to different
inefficiencies, such as the software overhead required to share Jeffrey Stuecheli IBM Systems and Technology Group, Austin,
TX 78758 USA (jeffas@us.ibm.com). Dr. Stuecheli is a Senior
data with software threads running on CPUs in the SMP. Technical Staff Member in the Systems and Technology Group.
The CAPI interface addresses these inefficiencies by He works in the area of server hardware architecture. His most recent
providing a coherent, user-address-based interface to enable work includes advanced memory architectures, cache coherence, and
accelerator design. He has contributed to the development of numerous
low-overhead integration of PCIe-based accelerators into IBM products in the POWER* architecture family, most recently the
the POWER8 ecosystem. This efficient combination of POWER8 design. He has been appointed an IBM Master Inventor,
customized parallel accelerators and faster serial processors authoring about 100 patents. He received B.S., M.S., and Ph.D. degrees
from The University of Texas Austin in Electrical Engineering.
enables applications to target heterogeneous systems,
previously not possible with I/O-based attachments.
Bart Blaner IBM Systems and Technology Group, Essex Junction,
The CAPI interface achieves this while maintaining the VT 05452 USA (blaner@us.ibm.com). Mr. Blaner earned a B.S.E.E.
high standards for reliability, availability, and serviceability degree from Clarkson University. He is a Senior Technical Staff
of POWER systems. Member in the POWER development team of the Systems and
Technology Group. He joined IBM in 1984 and has held a variety
of design and leadership positions in processor and ASIC development.
*Trademark, service mark, or registered trademark of International Recently, he has led accelerator designs for POWER7+* and
Business Machines Corporation in the United States, other countries, POWER8 technologies, including the Coherent Accelerator Processor
or both. Proxy design. He is presently focused on the architecture and
implementation of hardware acceleration technologies spanning a
**Trademark, service mark, or registered trademark of PCI-SIG or variety of applications for future POWER processors. He is an IBM
Sony Computer Entertainment Corporation in the United States, other Master Inventor, a Senior Member of the IEEE, and holds more than
countries, or both. 30 patents.
7:6 J. STUECHELI ET AL. IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 7 JANUARY/FEBRUARY 2015
communications, and graphics adapters for the IBM Personal
Computer. From 1988 until 2000, he was part of the Graphics
Organization and was responsible for the architecture and development
of entry and midrange 3D graphics adapters and GPUs (graphics
processing units). From 2000 to 2010, Mr. Johns was part of the
STI (Sony, Toshiba, IBM) Project responsible for the Cell Broadband
Engine Architecture** (CBEA) and participated in the development
of the Cell Broadband Engine** (the first implementation of the
CBEA). Currently Mr. Johns is working on hybrid computing solutions
for the POWER processors. He is directly responsible for the Coherent
Accelerator Interface Architecture (CAIA) and Chief Engineer of
FPGA acceleration using the Coherent Accelerator Processor Interface
(CAPI). Mr. Johns is an IBM Master Inventor with over 100 patents.
IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 7 JANUARY/FEBRUARY 2015 J. STUECHELI ET AL. 7:7