CAPI

Uploaded by

svrojhaasree

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

CAPI

Uploaded by

svrojhaasree

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

CAPI: A Coherent J.

Stuecheli
B. Blaner
Accelerator Processor C. R. Johns
Interface M. S. Siegel

Heterogeneous computing systems combine different types

of compute elements that share memory. A specific class of
heterogeneous systems discussed in this paper pairs traditional
general-purpose processing cores and accelerator units. While this
arrangement enables significant gains in application performance,
device driver overheads and operating system code path overheads
can become prohibitive. The I/O interface of a processor chip is a
well-suited attachment point from a system design perspective, in that
standard server models can be augmented with application-specific
accelerators. However, traditional I/O attachment protocols
introduce significant device driver and operating system software
latencies. With the Coherent Accelerator Processor Interface (CAPI),
we enable attaching an accelerator as a coherent CPU peer over
the I/O physical interface. The CPU peer features consist of a
homogeneous virtual address space across the CPU and accelerator,
and hardware-managed caching of this shared data on the I/O
device. This attachment method greatly increases the opportunities
for acceleration due to the much shorter software path length
required to enable its use compared to a traditional I/O model.

Introduction Speciﬁc Integrated Circuit), but development expense

Modern general-purpose cores are built to execute and inflexibility limit the applicability of ASICs. This can
sequentially defined sequences of instructions at higher be mitigated with the usage of Field Programmable Gate
throughputs by extracting parallel work. This extraction Array (FPGA) devices. FPGAs emulate custom logic
of parallel work from sequential code requires speculative through arrays of directly programmable logic devices. This
execution in an attempt to start processing before prior flexibility comes at the expense of logic speed and density
dependent work has completed. While this methodology compared to ASIC devices, though modern FPGAs provide
does enable faster serial code execution, efficiency is lost very-large logic arrays, on the order of one million logic
in both the incorrectly speculated work and the tracking elements [2, 3] at very reasonable clock speeds.
structures required to enable such execution [1]. Thus, As applications contain both serial code (best suited
while serial cores are important, they come at a cost. for general-purpose cores) combined with highly
While serial cores extract parallel work from sequentially parallel components (best suited for parallel engines), the
defined program sequences, greater efficiency is possible requirements arise for efficient communication between
when the algorithm itself can be made parallel. With parallel heterogeneous computation elements. In current systems,
algorithms, the hardware can be made much simpler. the communication between general-purpose cores and
Taking this concept further, the most efficient execution external accelerators require the use of I/O-based software
is possible when the hardware directly implements the stacks. These software components impose a higher overhead
algorithm. This level of customization is possible through and a cumbersome communication model when compared
a custom designed circuit (called an ASIC, or Application to the shared memory model used between multiple CPUs.
As such, the overhead and complexity of interfacing to
external engines decreases the potential speedup of these
Digital Object Identifier: 10.1147/JRD.2014.2380198 heterogeneous systems.

ÓCopyright 2015 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without
alteration and (2) the Journal reference and IBM copyright notice are included on the ﬁrst page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed
royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.

0018-8646/15 B 2015 IBM

IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 7 JANUARY/FEBRUARY 2015 J. STUECHELI ET AL. 7:1
The emergence of BBig Data[ problem spaces, which CAPI system description
require analysis of exabytes (1018 ) of data, has further forced A block diagram of the CAPI hardware is shown in Figure 1.
a rethinking of computational system design [4]. These Each POWER8 processor chip contains a symmetric
data-centric problems differ from traditional problems in multi-processor (SMP) bus interconnection fabric
several aspects. First, the volume of data is much greater, which enables the various units to communicate and
potentially thousands of times greater than traditional coherently share system memory. These units are twelve
workloads. Beyond the size of the data, inputs are general-purpose cores, two memory controller (MC) blocks,
unstructured, as the vast amounts of data to be mined are and units to bridge multiple chips in an SMP system. On
inherently disorganized [5, 6]. These features drive the the POWER8 processor chip, the PCIe Host Bridge (PHB)
need to scan and restructure large volumes of data, which provides connectivity to PCIe Gen3 I/O links. The Coherent
are fed into more manageable data structures for further Accelerator Processor Proxy (CAPP) unit, in conjunction
processing. with the PHB, act as memory coherence, data transfer,
If the processing of this large amount of data is visualized interrupt, and address translation agents on the SMP
as a computational tree, many parallel tasks may process interconnect [9] fabric for PCIe-attached accelerators.
subsets, or Bleafs,[ of the full dataset at once. As the leafs are These accelerators comprise a POWER Service Layer (PSL)
processed, intermediate results are formed, passed up the and Accelerator Function Units (AFUs) that reside in an
tree, and are used in the next iterative step in processing FPGA or ASIC connected to the processor chip by the
the data until the root node, or the final result, of this PCIe Gen3 link. Up to sixteen PCIe lanes per direction
computational tree is reached. The combination of are supported. The combination of PSL, PCIe link, PHB,
computation on vast unstructured data at the leaf operations and CAPP provide AFUs with several capabilities. AFUs
and the serially structured final result generation motivates may operate on data in memory, coherently, as peers of
the use of heterogeneous computation systems. The initial other caches in the system. AFUs further use effective
stages are inherently parallel, as the massive volume of addresses to reference memory, with address translation
data is scanned. These raw data typically consist of irregular provided by a memory management unit (MMU) in the
data types, and are poorly suited for traditional processor PSL. The PSL may also generate interrupts on behalf of
register types. In contrast to the initial stages or leaf AFUs to signal AFU completion, or to signal a system
processing steps, once the data has been filtered and service when a translation fault occurs.
formed into structured data, parallelism is greatly
reduced [7]. Coherence
To address these inefficiencies and address the needs In order to provide coherent access to system memory, CAPP
of emerging big data workloads, the POWER8* platform and PSL each contain a directory of cache lines used by the
introduces the Coherent Accelerator Processor Interface AFUs. The CAPP snoops the fabric on behalf of the PSL,
(CAPI). This new interface, which will be described in accesses its local directory, and responds to the fabric with
more detail herein, provides the capability for off-chip latency that is the same as other caches on the chip. In this
accelerators to be plugged into PCIe** (Peripheral way, the insertion of an off-chip coherent accelerator does
Component Interconnect Express**) [8] slots and participate not affect critical system performance parameters such as
in the system memory coherence protocol as a peer of cache snoop latency. Snoops that hit in the CAPP directory
other caches in the system. Additionally, CAPI enables the may generate messages that are sent to PSL by means
use of effective addresses to reference data structures in the of the PHB and PCIe link. The PSL may then respond to
same manner as applications running on the cores. These the message in a variety of ways depending on the contents
PCIe-card based accelerators can be implemented in FPGAs of the message.
for development flexibility or hardened in ASIC chips, The PSL may master operations on the SMP interconnect
depending on user requirements. fabric using the combination of the PCIe link, PHB, and
The PCIe attachment point provides for simple integration master read and write finite state machines (FSMs) in CAPP.
of a range of easily developed PCIe based designs; however, For example, to store into a line on behalf of an AFU, the
the requirements of a scalable and robust attachment point PSL must first have ownership of the line. The PSL first
introduce several challenges. The PCIe protocol does not checks for presence of the line in its cache directory. If the
follow the highly optimized coherent and resilient protocol line is present (directory hit) and in the modified state, the
utilized amongst the POWER8 modules in the system. This PSL allows the store from AFU to proceed. However, if
protocol mismatch motivated the creation of a proxy unit the access misses in the PSL directory, then the PSL initiates
resident on the POWER8 chip to isolate the two divergent a fabric master operation to gain ownership of the line and
protocols, enabling coherent traffic, and provide failure may further request the cache line data. This is accomplished
isolation. These are mandatory requirements for an attached by sending a command to a CAPP master read FSM. The
accelerator to act as a peer of other CPUs in the system. CAPP master FSM performs the access on the fabric and

7:2 J. STUECHELI ET AL. IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 7 JANUARY/FEBRUARY 2015
Figure 1
CAPI system block diagram.

ultimately gains ownership of the line and sends a message back to PSL. A compound command concurrently activates
that it has obtained such to the PSL. If the data was also both write and read FSMs in CAPP to perform the operation.
requested, it will be directly returned by the source, which This saves two crossings of the PCIe link compared to the
could be a memory controller or another cache in the system, discrete operations.
to the PHB where it is transferred across the PCIe link to The PSL is further provisioned with the capability to
PSL and installed in its cache. The store from the AFU master reads and writes on the fabric to copy lines to outside
is then allowed to complete. of the coherence domain as would be the case of an I/O
To push a line from the PSL cache to memory, which may device operating with a checkout model of memory. This
occur for example when a line owned by PSL needs to be provision allows AFUs, with no need to maintain coherent
evicted to make space for another line in the cache, PSL copies of lines, to entirely bypass the PSL and CAPP caches.
issues a write command to a CAPP master write FSM.
The PSL also pushes the modiﬁed data to the PHB for Address translation
write-back to memory, and updates the state for the line To enable AFUs to reference memory with effective
in its directory to indicate that it no longer owns the line. addresses, as would an application running on a core, the
The master write FSM obtains routing information for PSL contains an MMU comprising table-walk machines to
the destination of the write data and passes it to the PHB perform address translations and caches of recent translations,
via sideband signals. The PHB then pushes the data onto thereby frequently avoiding table walks. Table-walk
the fabric to the destination. Additionally, the master write machines use the mechanisms described above to read and
FSM updates the CAPP directory to reﬂect that the line update tables in memory during the translation process.
is now invalid. Since the PSL contains a translation cache, it must
In the previous examples, the combination of evicting a participate in translation invalidation (tlbi) [10] operations
line to make room for a new line and reading the new line, on the fabric. The CAPP snoops tlbi operations on behalf
with or without intent to modify the line, were illustrated of the PSL and sends them in messages to the PSL, either
as separate operations. This common combination between one at a time or bundled into groups. The PSL looks up the
the PSL and CAPP is optimized by providing a single address presented by the tlbi in its caches. If the address
compound operation that both evicts a directory entry, misses, it responds immediately back to the CAPP tlbi
possibly with data push to memory, and loads a new entry snooper that the operation is complete. If the tlbi hits, the
into the CAPP directory, possibly with read data provided PSL follows a protocol to ensure all storage operations

IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 7 JANUARY/FEBRUARY 2015 J. STUECHELI ET AL. 7:3
Figure 2
CAPP unit components.

associated with that translation cache entry are completed bus and partial response buses (presp). A command issued by
before sending a completion message to the CAPP tlbi a master is broadcast to the fabric on a command/address
snooper. (cmd/addr) bus and enters the CAPP snoop pipeline on its
Address translations may generate faults requiring rcmd bus. The snooped reflected command is decoded, and if
notification of system software to repair the fault. For it is not one supported by the CAPI, it proceeds no further
this and other needs, the PSL provides a means to signal down the pipeline. If the snooped reflected command is
interrupts to software. This is accomplished by using the supported, has an address, and requires a CAPP directory
message signaled interrupt (MSI) mechanism provided by lookup, arbitration for read access to the directory occurs
the PHB [8]. PSL sends a command to the PHB using a in the next pipeline phase. Master FSMs, snoop FSMs,
particular address and data value indicative of the particular and snooped reflected commands arbitrate for read access
interrupt being asserted. The PHB responds as it would to an to the directory (arb block shown in Figure 2). Having
MSI from any I/O device; the details may be found in [8]. won arbitration, the snooped reflected command reads the
directory, and the result may be a cache hit or miss. The
CAPP hardware description address is also compared to addresses held by master and
This section considers the hardware structures internal to the snoop FSMs to see if any are already performing an action on
CAPP that are required to enable attached accelerators to the address. Depending on the outcome, the snoop control
participate in the distributed cache coherence protocol logic determines the next action the hardware will take. This
provided by the SMP interconnect fabric as peers of other may include dispatching to one of the 16 snoop FSMs when,
caches in the system. The CAPP structures and machines for example, the CAPI owns the line in a modified state,
parallel those of the L2 cache directory described in [9], while and another master is requesting ownership of the line. In this
the data-portion of the cache is maintained by the PSL. case, the PSL must provide the line as described earlier.
Figure 2 shows the CAPP hardware in greater detail. A snoop FSM is required to change the CAPP directory
The CAPP is divided into three areas: machines and state, in which case it must arbitrate for write access to the
transport, snoop pipeline, and SMP interconnect fabric directory as shown in the figure.
interface. The SMP interconnect fabric interface provides Generally, a snooped reflected command that proceeds
snooper, master, and data interfaces to the fabric. The to this point requires a partial response (presp) on the SMP
snooper interface comprises the reflected command (rcmd) bus fabric to indicate the state of affairs in the CAPP back to

7:4 J. STUECHELI ET AL. IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 7 JANUARY/FEBRUARY 2015
the fabric controller. A presp appropriate to the reflected PHB may push the line on its data_out bus directly to that
command and the state of the cache line in the CAPP is particular memory controller. Master FSM 0 also arbitrates
formed by the presp logic and issued on the presp bus. The to update the CAPP directory entry state to invalid, and
fabric controller combines all presps and returns a combined finally sends a message to the PSL containing the requisite
response (cresp) to all agents on the bus so they may see the information so that PSL may update its directory properly
final results of the operation and act accordingly. and push out the modified data.
The action may also include sending a message to the PSL Master read operations proceed similarly, but in the case of
that is descriptive of the snooped reflected command, the reads, data from a sourceVa memory controller or another
CAPP state, and any actions the CAPP took on behalf of the cache in the systemVis to be returned to the PSL. The
PSL. The PSL may then take further actions in response to CAPP master read FSM selected for the operation provides
the message, as in the line push example where data needs routing information so that the data may be returned directly
to be written back to memory. Messages to the PSL from from the source to the PHB and on to the PSL over the
both master and snoop FSMs are queued and packed into PCIe link.
fabric data packets by the command/message transport block The tlbi operations discussed previously are another form
and pushed on to the fabric data_out bus to the PHB. The of reflected commands that the CAPP snoops. A snooped
PHB performs a PCIe write to transmit the message packet to tlbi generates a message to be sent to the PSL, and after
the PSL. performing the actions described previously, the PSL returns
To master a command on the fabric cmd/addr bus, the a response to the CAPP. The command/message transport
PSL selects one of 32 master read FSMs or 32 master logic sends tlbi responses to the tlbi snoop logic where
write FSMs, or a pair of FSMs in the case of compound appropriate action is taken.
operations, to master the command. It forms a command
packet containing details of the operation for the FSM Reliability, availability, serviceability
to perform. Multiple commands to multiple FSMs may POWER processors have a long-standing tradition of
be packed into a single command packet. The PSL issues providing world-leading reliability, availability, and
a PCIe write packet to transmit the command packet to serviceability (RAS) [11]. The addition of an off-chip device
the PHB. The PHB decodes address bits in the packet to that participates in cache coherence protocols and address
learn that it is a command packet to be pushed toward the translations must fulfill expectations with respect to that high
CAPP on its fabric data_out bus. The packet arrives on standard, and the CAPI system incorporates a variety of
the CAPP fabric data_in bus, is received and unpacked measures to achieve this. Single-bit error correction and
by the command/message transport logic, and distributed double-bit error detection error correction codes (ECC) are
to the appropriate master FSMs. used on all memory arrays in the CAPP, the PHB, and
Upon receiving a command, a master machine then the PSL. All temporal operations between the CAPP and the
sequences through steps that may include a CAPP directory PSL are timed, as, for example, a directory state that
look-up, cross-checking an address against snoop FSMs, temporarily protects an entry from other snoopers. This
issuing the command on the fabric cmd/addr bus, receiving provides protection against errors on the FPGA that manifest
and acting on a cresp, updating the directory state, and themselves as the PSL ceasing communications with the
sending a message to the PSL. Consider the line push CAPP. FSMs in the CAPP and the PSL use parity to protect
example described previously. The line is held in the PSL against invalid state errors. Configuration registers are parity
and CAPP directories in the modified state. The PSL issues protected. The most common errorsVcorrectable errors on
a command to the CAPP master write FSM 0 to evict the line memory arraysVare handled (corrected) with minimal
from the directory, i.e., move the line from the modified disruption of on-going CAPI activity. For most other
to invalid state. Master write FSM 0 activates, arbitrates for more severe errors, for example when a timer expires on
the snoop pipeline, looks the line up in the CAPP directory, a temporal operation because the PCIe link went down, the
obtains the memory address of the line from the directory CAPI system is able to gracefully go off-line. The CAPP
entry, and enters a line protection state where any snoops that accomplishes this by severing the connection to the PSL,
hit the line will be retried, i.e., a retry response is issued quiescing its various FSMs, and walking its copy of the
on the presp bus. The master machine issues a Bpush[ directory and sending poison data to the address of any lines
command and address on the cmd/addr bus and waits for the held in the various forms of modified state. (BPoison[ data
cresp. Assume a particular memory controller responds as contains a special error checking code detectable by all data
owning the memory address of the line. The cresp contains consumers in the system that marks the data as unusable.)
information for routing the data to the memory controller [9]. When all this is accomplished, the CAPP enters a quiescent
Master FSM 0 sends this routing information to the PHB state from which it is ready to be restarted when the error
via the PHB sideband interface so that when the data packet condition is cleared by appropriate system actions. Only
containing the modified cache line arrives from the PSL, the when an error threatens to cause a data integrity problem

IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 7 JANUARY/FEBRUARY 2015 J. STUECHELI ET AL. 7:5
is a more severe error signaled and the CAPI halted, as is the 2. Stratix V Device Overview (SV51001), Altera, San Jose, CA,
USA, Jan. 2014. [Online]. Available: http://www.altera.com/
case with the other caches in the system. literature/hb/stratix-v/stx5_51001.pdf.
3. 7 Series FPGAs Overview (DS180 v1.15), Xilinx, San Jose, CA,
USA, Feb. 2014. [Online]. Available: http://www.xilinx.com/
User visible PSL interface
support/documentation/data_sheets/ds180_7Series_Overview.pdf.
The interface provided to user-created AFUs is designed 4. M. Adrian, BBig data,[ Teradata Magazine. [Online]. Available:
to isolate the complexities of cache coherence and address http://www.teradatamagazine.com/v11n01/Features/Big-Data/.
translation. User-designed accelerators access system 5. D. A. Ferrucci, BIntroduction to FThis is Watson_,[ IBM J. Res.
Dev., vol. 56, no. 3, pp. 1:1–1:15, May 2012.
memory through load and store requests to user space 6. H. P. Hofstee, G.-C. Chen, F. H. Gebara, K. Hall, J. Herring,
effective addresses. AFUs can select between cacheable D. A. Jamsek, J. Li, Y. Li, J. Shi, and P. W. Y. Wong,
BUnderstanding system design for big data workloads,[ IBM
and write-through requests. Write-through requests are for
J. Res. Dev., vol. 57, no. 3/4, pp. 3:1–3:10, May/Jul. 2013.
data manipulated outside the coherence domain and provide 7. R. Polig, K. Atasu, L. Chiticariu, C. Hagleitner, H. P. Hofstee,
for reduced PCIe bus overhead due to reduced message F. R. Reiss, E. Sitaridi, and H. Zhu, BGiving text analytics a
overhead. Coherent operations are typically utilized boost,[ IEEE Micro, vol. 34, no. 4, pp. 6–14, Jul./Aug. 2014.
8. PCI Express Base Specification, Revision 3.0, PCI-SIG, Beaverton,
for control information where multiple processes must OR, USA, Nov. 2010. [Online]. Available: http://www.pcisig.com/
communicate to make data transfer decisions, and where specifications/pciexpress/base3/.
write-through provides for large block transfers into and out 9. W. J. Starke, J. Stuecheli, D. Daly, J. S. Dodson, F. Auernhammer,
P. Sagmeister, G. Guthrie, C. F. Marino, M. Siegel, and B. Blaner,
of the coherence domain. Address translation is generally BThe cache and memory subsystems of the IBM POWER8
hidden from the accelerator, with the exception of page processor,[ IBM J. Res. Dev., vol. 59, no. 1, Paper 3, pp. 3:1–3:13,
2015.
faults. In these cases, the AFU is notified of the fault, giving
10. Power ISA Version 2.06 Revision B, IBM, Armonk, NY, USA,
the opportunity for the AFU to reschedule operations to hide Jul. 23, 2010. [Online]. Available: https://www.power.org/
the latency of the fault. wp-content/uploads/2012/07/PowerISA_V2.06B_V2_
PUBLIC.pdf.
11. D. Hendersen and J. Mitchell, BPower7 System RAS: Key aspects
of Power Systems Reliability, Availability, Serviceability,[ IBM
Conclusion Syst. Technol. Group, Somers, NY, USA, Dec.9, 2012. [Online].
Modern microprocessors contain inefficiencies when Available: http://www-03.ibm.com/systems/power/hardware/
executing workloads that exhibit little instruction parallelism whitepapers/ras7.html.
or data-level parallelism. Emerging big data workloads have
exacerbated the problem. Direct hardware implementations Received March 24, 2014; accepted for publication
of algorithms in FPGAs and ASICs can be far more April 17, 2014
efficient, but integrating them into an SMP leads to different
inefficiencies, such as the software overhead required to share Jeffrey Stuecheli IBM Systems and Technology Group, Austin,
TX 78758 USA (jeffas@us.ibm.com). Dr. Stuecheli is a Senior
data with software threads running on CPUs in the SMP. Technical Staff Member in the Systems and Technology Group.
The CAPI interface addresses these inefficiencies by He works in the area of server hardware architecture. His most recent
providing a coherent, user-address-based interface to enable work includes advanced memory architectures, cache coherence, and
accelerator design. He has contributed to the development of numerous
low-overhead integration of PCIe-based accelerators into IBM products in the POWER* architecture family, most recently the
the POWER8 ecosystem. This efficient combination of POWER8 design. He has been appointed an IBM Master Inventor,
customized parallel accelerators and faster serial processors authoring about 100 patents. He received B.S., M.S., and Ph.D. degrees
from The University of Texas Austin in Electrical Engineering.
enables applications to target heterogeneous systems,
previously not possible with I/O-based attachments.
Bart Blaner IBM Systems and Technology Group, Essex Junction,
The CAPI interface achieves this while maintaining the VT 05452 USA (blaner@us.ibm.com). Mr. Blaner earned a B.S.E.E.
high standards for reliability, availability, and serviceability degree from Clarkson University. He is a Senior Technical Staff
of POWER systems. Member in the POWER development team of the Systems and
Technology Group. He joined IBM in 1984 and has held a variety
of design and leadership positions in processor and ASIC development.
*Trademark, service mark, or registered trademark of International Recently, he has led accelerator designs for POWER7+* and
Business Machines Corporation in the United States, other countries, POWER8 technologies, including the Coherent Accelerator Processor
or both. Proxy design. He is presently focused on the architecture and
implementation of hardware acceleration technologies spanning a
**Trademark, service mark, or registered trademark of PCI-SIG or variety of applications for future POWER processors. He is an IBM
Sony Computer Entertainment Corporation in the United States, other Master Inventor, a Senior Member of the IEEE, and holds more than
countries, or both. 30 patents.

References Charles R. Johns IBM Systems and Technology Group, Austin,

1. M. Ferdman, A. Adileh, Y. O. Koçberber, S. Volos, M. Alisafaee, TX 78758 USA (crjohns@us.ibm.com). Mr. Johns is an STSM
D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and (Senior Technical Staff Member) in the IBM Server and Technology
B. Falsaﬁ, BQuantifying the mismatch between emerging scale-out Group. He received his B.S. degree in electrical engineering from
applications and modern processors,[ ACM Trans. Comput. the University of Texas at Austin in 1984. After joining IBM
Syst., vol. 30, no. 4, Nov. 2012, Art. ID. 15. Austin in 1984, Mr. Johns worked on various disk, memory, voice

7:6 J. STUECHELI ET AL. IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 7 JANUARY/FEBRUARY 2015
communications, and graphics adapters for the IBM Personal
Computer. From 1988 until 2000, he was part of the Graphics
Organization and was responsible for the architecture and development
of entry and midrange 3D graphics adapters and GPUs (graphics
processing units). From 2000 to 2010, Mr. Johns was part of the
STI (Sony, Toshiba, IBM) Project responsible for the Cell Broadband
Engine Architecture** (CBEA) and participated in the development
of the Cell Broadband Engine** (the ﬁrst implementation of the
CBEA). Currently Mr. Johns is working on hybrid computing solutions
for the POWER processors. He is directly responsible for the Coherent
Accelerator Interface Architecture (CAIA) and Chief Engineer of
FPGA acceleration using the Coherent Accelerator Processor Interface
(CAPI). Mr. Johns is an IBM Master Inventor with over 100 patents.

Michael S. Siegel IBM Systems and Technology group, Research

Triangle Park, NC 27709 USA (siegelm@us.ibm.com). Mr. Siegel is
a Senior Technical Staff member in the IBM Systems and Technology
Group (STG). He currently works as the hardware architect of coherent
bus architectures (PowerBus) developed for IBM System p* server
applications. In 2003, Mr. Siegel joined the PowerPC* development
team to support the high-performance processor roadmap and create
standard products for both internal and external customers. Mr. Siegel’s
roles included memory controller design lead, and coherency bus
design lead and architect. These roles led to the development of the
PowerBus architecture in use in System p processor chips, starting with
POWER7* chips. Mr. Siegel supported multiple projects including
those involving IBM POWER7 and POWER8 server processors,
and assisted in customer discussions for future game and joint
chip development activity, by incorporating new functions into the
architecture as system requirements evolved. While working on
the POWER8 PowerBus architecture, Mr. Siegel worked as a hardware
architect of the processor side of the CAPI, working with development
teams spanning the processor chip and ﬁrst generation coherently
attached external coprocessor by specifying hardware behavior and the
microcode architecture. Prior to his work in STG, Mr. Siegel worked
in the Network Hardware Division (NHD) developing the Rainier
network processor, the IEEE 802.5 DTR (dedicated token ring)
standard, and the IBM token ring switch products based on
the standard. Mr. Siegel started working for IBM in Poughkeepsie,
New YorkVon the 3081 I/O subsystem and the ES/9000 Vector
Facility. Mr. Siegel has been an IBM Master inventor since 1996 and
is an inventor on over 70 patents issued by the U.S. Patent ofﬁce.

IBM J. RES. & DEV. VOL. 59 NO. 1 PAPER 7 JANUARY/FEBRUARY 2015 J. STUECHELI ET AL. 7:7

(eBook PDF) GO! with Microsoft Office 365, 2019 Edition Introductory, 1st edition instant download
0% (1)
(eBook PDF) GO! with Microsoft Office 365, 2019 Edition Introductory, 1st edition instant download
57 pages
FCM Takeover
No ratings yet
FCM Takeover
4 pages
RONAV Manual 4th Edition
No ratings yet
RONAV Manual 4th Edition
34 pages
Implementation of Serial Communication IP For Soc Applications
No ratings yet
Implementation of Serial Communication IP For Soc Applications
4 pages
Cpu Gpu System
No ratings yet
Cpu Gpu System
26 pages
Implementation of Uart Using Systemc and Fpga Based Co-Design Methodology
No ratings yet
Implementation of Uart Using Systemc and Fpga Based Co-Design Methodology
7 pages
Software Friendly Hardware
No ratings yet
Software Friendly Hardware
8 pages
Debugging Real-Time Multiprocessor Systems: Class #264, Embedded Systems Conference, Silicon Valley 2006
No ratings yet
Debugging Real-Time Multiprocessor Systems: Class #264, Embedded Systems Conference, Silicon Valley 2006
15 pages
Embedded System Lecture Notes
No ratings yet
Embedded System Lecture Notes
194 pages
Hardware Acceleratetor
No ratings yet
Hardware Acceleratetor
5 pages
FPGA Based Hardware Acceleration A CPU - Accelerator Interface Exploration
No ratings yet
FPGA Based Hardware Acceleration A CPU - Accelerator Interface Exploration
4 pages
Creams: An Embedded Multiprocessor Platform
No ratings yet
Creams: An Embedded Multiprocessor Platform
12 pages
Mp&i Course File - Unit-I
No ratings yet
Mp&i Course File - Unit-I
56 pages
An Efficient Reconfigurable Soc System Design With Nios Ii Processor
No ratings yet
An Efficient Reconfigurable Soc System Design With Nios Ii Processor
6 pages
ES-Unit-1-Embedded Sustem
No ratings yet
ES-Unit-1-Embedded Sustem
21 pages
Fpga Arm Processor Based Supercomputiing
No ratings yet
Fpga Arm Processor Based Supercomputiing
5 pages
1 of 1 PDF
No ratings yet
1 of 1 PDF
7 pages
Chapter-1 Introduction To Embedded Systems: Characteristics
No ratings yet
Chapter-1 Introduction To Embedded Systems: Characteristics
37 pages
Hpc_unit-1 Insem Notes
No ratings yet
Hpc_unit-1 Insem Notes
76 pages
Interfaceintro 1
No ratings yet
Interfaceintro 1
4 pages
Hardware-Software Debugging Techniques For Reconfigurable Systems-on-Chip
No ratings yet
Hardware-Software Debugging Techniques For Reconfigurable Systems-on-Chip
6 pages
FPGA: Field Programmable Gate Array
No ratings yet
FPGA: Field Programmable Gate Array
5 pages
Embedded System Notes
No ratings yet
Embedded System Notes
66 pages
NetFPGA Stanford
No ratings yet
NetFPGA Stanford
22 pages
Ca Unit 4 Prabu
No ratings yet
Ca Unit 4 Prabu
24 pages
CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer
No ratings yet
CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer
58 pages
03 Introduction PDF
No ratings yet
03 Introduction PDF
0 pages
2.1 Embedded Systems
No ratings yet
2.1 Embedded Systems
7 pages
Synopsis Edk
No ratings yet
Synopsis Edk
6 pages
Superscalar Processor
No ratings yet
Superscalar Processor
4 pages
Computer03 Softeninghw
No ratings yet
Computer03 Softeninghw
7 pages
Superscalar processor - Wikipedia
No ratings yet
Superscalar processor - Wikipedia
5 pages
B.ram Lecture 1 2
No ratings yet
B.ram Lecture 1 2
11 pages
A FPGA Paint Brush 1application
No ratings yet
A FPGA Paint Brush 1application
7 pages
Fabrication of Three Axis Crane
No ratings yet
Fabrication of Three Axis Crane
43 pages
Automated Generation of Hardware Accelerators With Direct Memory Access From ANSI/ISO Standard C Functions
No ratings yet
Automated Generation of Hardware Accelerators With Direct Memory Access From ANSI/ISO Standard C Functions
10 pages
A Rapid Prototyping Environment For Microprocessor Based System-on-Chips and Its Application To The Development of A Network Processor
No ratings yet
A Rapid Prototyping Environment For Microprocessor Based System-on-Chips and Its Application To The Development of A Network Processor
4 pages
Chapter 2 V
No ratings yet
Chapter 2 V
24 pages
Berkeley View
No ratings yet
Berkeley View
54 pages
Vector (Array) Processing and Superscalar Processors
No ratings yet
Vector (Array) Processing and Superscalar Processors
7 pages
Oc23 Mpps
No ratings yet
Oc23 Mpps
30 pages
Preface: Purpose
100% (1)
Preface: Purpose
5 pages
A Low Power, Programmable Networking Platform and Development Environment
No ratings yet
A Low Power, Programmable Networking Platform and Development Environment
19 pages
Embedded System Project Abstracts, IEEE 2012 - Analysis of Electret-Based MEMS Vibrational Energy Harvester With Slit-And-Slider Structure
No ratings yet
Embedded System Project Abstracts, IEEE 2012 - Analysis of Electret-Based MEMS Vibrational Energy Harvester With Slit-And-Slider Structure
5 pages
Patterson&Hennessy - (1 8)
No ratings yet
Patterson&Hennessy - (1 8)
3 pages
Design of FPGA-Based Computing Systems with OpenCL 1st Edition Hasitha Muthumala Waidyasooriya pdf download
100% (1)
Design of FPGA-Based Computing Systems with OpenCL 1st Edition Hasitha Muthumala Waidyasooriya pdf download
60 pages
AdvancedCALecture 11
No ratings yet
AdvancedCALecture 11
28 pages
Computer Organization and Architecture 10th Edition Stallings Test Bank - 2025 Scribd Download Full Chapters
100% (11)
Computer Organization and Architecture 10th Edition Stallings Test Bank - 2025 Scribd Download Full Chapters
44 pages
Embedded ARM11
100% (1)
Embedded ARM11
195 pages
Voice Controlled Wheel Chair
No ratings yet
Voice Controlled Wheel Chair
48 pages
Flynns
No ratings yet
Flynns
41 pages
Embedded Systems
No ratings yet
Embedded Systems
15 pages
Unit-8: Applications and Trends of Microprocessor Technology
No ratings yet
Unit-8: Applications and Trends of Microprocessor Technology
3 pages
01-System Architecture
No ratings yet
01-System Architecture
55 pages
[FREE PDF sample] Computer Organization and Architecture 10th Edition Stallings Test Bank ebooks
100% (25)
[FREE PDF sample] Computer Organization and Architecture 10th Edition Stallings Test Bank ebooks
34 pages
PS2 VGA Peripheral Based Arithmetic Application Using Micro Blaze Processor
No ratings yet
PS2 VGA Peripheral Based Arithmetic Application Using Micro Blaze Processor
5 pages
Superscalar Architectures: COMP375 Computer Architecture and Organization
No ratings yet
Superscalar Architectures: COMP375 Computer Architecture and Organization
35 pages
Embedded Systems Software and Development Environment
No ratings yet
Embedded Systems Software and Development Environment
4 pages
Design Implementation of Nios II Processorfor Low Powered Embedded Systems
No ratings yet
Design Implementation of Nios II Processorfor Low Powered Embedded Systems
8 pages
Examination Based On Rfid
100% (2)
Examination Based On Rfid
62 pages
Fifty Years of Microprocessor Evolution: From Single Cpu To Multicore and Manycore Systems
No ratings yet
Fifty Years of Microprocessor Evolution: From Single Cpu To Multicore and Manycore Systems
32 pages
Wireless Keyboard & Mouse Emulator
No ratings yet
Wireless Keyboard & Mouse Emulator
84 pages
HPE Compute Certification Guide: 444 Practice Questions for the Advanced HPE1-H02 Exam
From Everand
HPE Compute Certification Guide: 444 Practice Questions for the Advanced HPE1-H02 Exam
Steve Brown
No ratings yet
Design of Direct Injection h2s Scavenging Systems
No ratings yet
Design of Direct Injection h2s Scavenging Systems
25 pages
(Ebooks PDF) Download The 3-D Global Spatial Data Model: Principles and Applications, Second Edition Earl F. Burkholder Full Chapters
100% (4)
(Ebooks PDF) Download The 3-D Global Spatial Data Model: Principles and Applications, Second Edition Earl F. Burkholder Full Chapters
62 pages
Mini_Project_Abstract_Final
No ratings yet
Mini_Project_Abstract_Final
2 pages
Introduction To Internet of Things Assignment 1: A. Industry
No ratings yet
Introduction To Internet of Things Assignment 1: A. Industry
24 pages
Nes Health Research Emoto Crystal Images of NES Health Infoceuticals
No ratings yet
Nes Health Research Emoto Crystal Images of NES Health Infoceuticals
10 pages
The Object Primer Second Edition The App
No ratings yet
The Object Primer Second Edition The App
84 pages
5 - Real Time Spam Twitter
No ratings yet
5 - Real Time Spam Twitter
1 page
Plan D'intersection Orien, Tation Spécifiée
No ratings yet
Plan D'intersection Orien, Tation Spécifiée
6 pages
Design of Floorslab and Secondary Beam
100% (1)
Design of Floorslab and Secondary Beam
11 pages
Flanges DIN Standards
No ratings yet
Flanges DIN Standards
12 pages
X Maths Quadratic Equations
No ratings yet
X Maths Quadratic Equations
6 pages
Data Science Assignment
No ratings yet
Data Science Assignment
9 pages
Paint MPPS Vasthakondur
No ratings yet
Paint MPPS Vasthakondur
1 page
Getting Started With Python
No ratings yet
Getting Started With Python
13 pages
Hydrolic Bench
No ratings yet
Hydrolic Bench
9 pages
Accuvix XQ SM
100% (1)
Accuvix XQ SM
255 pages
BIO - 200D Portable Ultrasound Scanner English User's Manual
100% (1)
BIO - 200D Portable Ultrasound Scanner English User's Manual
36 pages
Channel Sounding CONNECT2013
No ratings yet
Channel Sounding CONNECT2013
7 pages
Controlador Lambda s4 Mesa de Corte
No ratings yet
Controlador Lambda s4 Mesa de Corte
161 pages
Classroom Management Tools & Resources - Google For Education
No ratings yet
Classroom Management Tools & Resources - Google For Education
18 pages
The Impact of Social Media On Business Growth and Performance in Pakistan
No ratings yet
The Impact of Social Media On Business Growth and Performance in Pakistan
6 pages
Modulo Nand Flash ARDUINO
No ratings yet
Modulo Nand Flash ARDUINO
78 pages
Account Statement From 1 Jul 2020 To 31 Jul 2020: TXN Date Value Date Description Ref No./Cheque No. Debit Credit Balance
No ratings yet
Account Statement From 1 Jul 2020 To 31 Jul 2020: TXN Date Value Date Description Ref No./Cheque No. Debit Credit Balance
4 pages
Soal STS Pkkwu Tkro Kelas Xi Semester Genap
No ratings yet
Soal STS Pkkwu Tkro Kelas Xi Semester Genap
126 pages
Job Description
No ratings yet
Job Description
6 pages
EcoNest Smart Sustainable Modular Homes
No ratings yet
EcoNest Smart Sustainable Modular Homes
13 pages
Rashid Beach Resort Online Reservation A
No ratings yet
Rashid Beach Resort Online Reservation A
13 pages