Energy-Efficient System Design Through Adaptive Voltage Scaling
Energy-Efficient System Design Through Adaptive Voltage Scaling
Energy-Efficient System Design Through Adaptive Voltage Scaling
Scaling
Ben Keller
Borivoje Nikolic, Ed.
Krste Asanović, Ed.
Duncan Callaway, Ed.
December 1, 2019
Copyright © 2019, by the author(s).
All rights reserved.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission.
Energy-Efficient System Design Through Adaptive Voltage Scaling
by
Doctor of Philosophy
in
in the
Graduate Division
of the
Committee in charge:
Fall 2017
Energy-Efficient System Design Through Adaptive Voltage Scaling
Copyright 2017
by
Benjamin Andrew Keller
1
Abstract
be used for power management. The Hurricane-2 testchip features finer spatial partitioning
and more effective instrumentation for power management; simulation results show up to
13.3% energy savings for an algorithm exercising FG-AVS in time and 46.0% energy savings
for an algorithm using FG-AVS in space. Together, these testchip implementations show the
potential of FG-AVS to save energy in production SoCs.
i
To Helen.
ii
Contents
Contents ii
List of Figures v
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Energy Efficiency in Datacenters . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Energy Efficiency in Mobile Devices . . . . . . . . . . . . . . . . . . . 4
1.1.3 Energy Efficiency in IoT Devices . . . . . . . . . . . . . . . . . . . . 5
1.2 Fundamentals of Fine-Grained Adaptive Voltage Scaling . . . . . . . . . . . 5
1.2.1 Voltage Scaling Granularity . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Globally Asynchronous, Locally Synchronous Design . . . . . . . . . 7
1.2.3 Challenges of Fine-Grained Adaptive Voltage Scaling . . . . . . . . . 8
1.3 Technology Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Process Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.2 Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.3 SoC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
7 Conclusion 124
7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Bibliography 127
v
List of Figures
1.1 Results of a model estimating total U.S. electricity consumption by 2030 [1]. The
frozen efficiency curve assumes no efficiency improvements past 2010, causing
electricity consumption to grow with the economy. Scenario 1 shows the effect
of implementing existing policies to improve energy efficiency. Scenario 2 shows
the further reductions possible by the full deployment of existing technological
efficiency advances. Scenario 3 shows the more aggressive reductions possible
when all possible semiconductor-enabled efficiency improvements are included. . 2
1.2 Estimated energy savings from current trends in US datacenter efficiency im-
provements [2]. Energy efficiency improvements are predicted to save a total of
620 billion kW h over a decade. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Predictions of silicon power density in 2001 as first summarized by Shekhar Borkar
of Intel [3] (
c 2001 IEEE). As explained by Pat Gelsinger, then CTO of Intel,
“If scaling continues at present pace, by 2005, high-speed processors would have
power density of nuclear reactor, by 2010, a rocket nozzle, and by 2015, the
surface of the sun.” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 An example showing the additional energy savings of fine-grained AVS in time
(b) compared to the coarse-grained AVS baseline (a). . . . . . . . . . . . . . . . 7
1.5 An example showing the additional energy savings of fine-grained AVS in space.
(a) shows the wasted energy resulting from a coarse-grained AVS implementation,
while (b) shows how energy can be saved with a fine-grained approach. . . . . . 8
3.7 The state machine used in the SS-SC controller and example waveforms of its
operation [10] (
c 2016 IEEE). If the current demand increases as the converter
is toggling, the voltage may not rise about Vref , so an additional counter triggers
further switching until the voltage increases. . . . . . . . . . . . . . . . . . . . . 34
3.8 A block diagram of the adaptive clock generator [68] (
c 2017 IEEE). . . . . . . 35
3.9 Timing diagrams demonstrating the effects of clock insertion delay under a chang-
ing supply voltage and adaptive clock. In this example, the adaptive clock gener-
ator is assumed to perfectly track the voltage-delay characteristics of the digital
logic. The numbers are arbitrary, representative delays. In each case, in the
absence of insertion delay results in logic arrival times that match the arrival
of the next clock edge, while the presence of insertion delay causes a mismatch.
In (a), the voltage is decreasing, so insertion delay causes setup time violations
because the clock edges propagate through the clock tree more quickly than the
slowed logic propagation times. These violations require additional clock margin
to eliminate, increasing energy cost. In (b), the voltage is increasing, so insertion
delay causes the logic propagation to complete early, meeting timing but resulting
in wasted energy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.10 A sample SS-SC waveform showing the impact on achievable Vmin . The SS-SC
regulator operates in a 2:1 step-down mode from the fixed 1 V supply; it has a
100 mV ripple around the average voltage Vavg of 500 mV. In Process A, Vmin =
400 mV, but the SS-SC cannot achieve this output voltage because it is limited
to simple ratios of the input for high-efficiency conversion. In Process B, Vmin =
450 mV. The SS-SC regulator is operating at the lowest possible voltage without
violating Vmin , but its average voltage remains 50 mV higher than a non-rippling
regulating technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.11 An example floorplan showing four independent voltage areas supplied by SS-SC
converters and adaptive clock generators. Each generated output voltage need
only be distributed to the local voltage area, but blocks shaded green require a
fixed 1V supply and blocks shaded red require a fixed 1.8V supply. The SS-SC
unit cells surround the voltage area they supply to minimize IR drop, and the
SS-SC controller and adaptive clock generator are placed centrally to each voltage
area to help meet clock constraints. The central cutout allows routing of digital
connections between the domains. . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1 A typical feedback loop for SoC power management. Only a single varying voltage
domain is shown for simplicity. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 The basis for measuring power consumption in a system with SS-SC converters
and a rippling supply voltage [68] (
c 2017 IEEE). . . . . . . . . . . . . . . . . 44
4.3 The Z-scale processor pipeline [68] (
c 2017 IEEE). . . . . . . . . . . . . . . . . 48
viii
4.4 Sample power consumption plots showing the benefit of race-to-halt power man-
agement. In (a), the CPU voltage and frequency is reduced so the computation
completes more slowly. This reduces the total energy used by the CPU, but the
platform consumes a large amount of energy because it remains active for the
duration of the computation. In (b), the CPU operates in a faster, less energy-
efficient mode, allowing the platform and CPU to be put in a lower-power idle
state more quickly and reducing overall energy consumption. . . . . . . . . . . . 51
4.5 A plot showing the minimum-energy point achieved when accounting for both
platform power and CPU power [98] (
c 2014 IEEE). . . . . . . . . . . . . . . . 52
4.6 Applying fine-grained AVS to save energy. The coarse-grained AVS system (in
red) has reduced the voltage and frequency as much as possible while still meeting
the fixed deadline, but it cannot track rapid changes in workload. The fine-
grained AVS system (in blue) can reduce voltage and frequency during periods
of low activity, saving significant energy (shaded red) with minimal impact on
execution time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 A waveform illustrating the danger of naive synchronization of data words via
series flip-flops. In this example, the data word transitions from 00 to 11 near
the RX clock edge, resulting in metastability. Each bit of the word resolves in
the opposite direction, resulting in a cycle for which the output word is neither
00 nor 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 The standard bisynchronous FIFO. . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 A single-stage synchronizing FIFO based on a carefully timed flip-flop [116]. . . 60
5.4 The even-odd synchronizer [117] (
c 2010 IEEE). . . . . . . . . . . . . . . . . . 61
5.5 Local clock generators. The standard circuit (a) can be modified with an addi-
tional input (b) to enable pausible clocking. . . . . . . . . . . . . . . . . . . . . 62
5.6 A mutual exclusion (mutex) circuit. Any metastability in the SR latch is blocked
from the output by the metastability filter. . . . . . . . . . . . . . . . . . . . . . 63
5.7 A mutex used to enable safe asynchronous boundary crossing via pausible clocks [120]. 63
5.8 Waveforms showing the response of the pausible clock circuit to three different
data arrival times. The red region near the clock edge represents the period
during which it would be dangerous for the output Sync OK signal to transition.
If data arrives during the “OK” phase when r2 is low, then it is passed through
(a). If data arrives during the “delay” phase, then it is delayed until after the
next clock edge (b). If the data arrives at the boundary between these two phases,
the mutex circuit can go metastable, potentially delaying the next clock edge (c). 64
5.9 A pausible clock circuit with an additional gating input. . . . . . . . . . . . . . 65
5.10 A pausible asynchronous FIFO [121] (
c 2002 IEEE). A fully asynchronous queue
buffers data in the FIFO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.11 A pausible bisynchronous FIFO [120] (
c 2015 IEEE). The lettered circles show
the sequence of operations to move data through the FIFO. . . . . . . . . . . . 67
5.12 The key delays in the pausible clock circuit [120] (
c 2015 IEEE). . . . . . . . . 68
ix
5.13 Waveforms showing the clock period constraint in the pausible clock circuit [120]
(
c 2015 IEEE). If this constraint is violated, the next clock edge will be frequently
delayed, slowing the average clock rate. . . . . . . . . . . . . . . . . . . . . . . . 69
5.14 Waveforms illustrating the worst-case setup time for the combinational logic in
the pausible interface [120] (
c 2015 IEEE). . . . . . . . . . . . . . . . . . . . . 69
5.15 Waveforms showing the average latency through the pausible interface, as deter-
mined by taking the mean of the average latency during the transparent phase
(a) and the average latency during the opaque phase (b) [120] (
c 2015 IEEE). . 70
5.16 Waveforms illustrating the effect of insertion delay on the pausible clock cir-
cuit [120]. Insertion delay can misalign the phases of the circuit, resulting in
unsafe transmission of data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.17 The pausible synchronizer with an additional latch to guard against the effects
of insertion delay [120] (
c 2015 IEEE). . . . . . . . . . . . . . . . . . . . . . . . 71
5.18 Two layout options for the local clock generator and synchronizer logic [120] (
c
2015 IEEE). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.19 Simulated mutex delay for different input arrival times [120]. . . . . . . . . . . . 73
5.20 The effect of added tm on average clock period perturbation. . . . . . . . . . . . 74
5.21 Sample simulation result. TX and RX clock period are randomly varied over
time. The spikes in TX clock period are clock pauses triggered by simulated
metastability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.22 Simulated latency through the pausible FIFO with varying TX and RX clock
periods [120] (
c 2015 IEEE). BFSync is the standard bisynchronous FIFO. . . 75
6.12 A plot demonstrating the system conversion efficiency of the three SS-SC switch-
ing modes [10] (
c 2016 IEEE). . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.13 Measurements of the adaptive clock generator frequency over different voltages
and replica path delay settings [10] (
c 2016 IEEE). . . . . . . . . . . . . . . . 86
6.14 Plots showing the measured energy, power, and frequency of the processor core
in bypass mode and with the switching regulators enabled [10] (
c 2016 IEEE). 86
6.15 A block diagram of the Raven-4 testchip [68] (
c 2017 IEEE). . . . . . . . . . . 88
6.16 Annotated floorplan of the Raven-4 testchip [68] (
c 2017 IEEE). . . . . . . . . 89
6.17 Annotated die micrograph of the Raven-4 testchip [135] (
c 2016 IEEE). . . . . 90
6.18 The Raven-4 die wirebonded to the daughterboard. . . . . . . . . . . . . . . . . 91
6.19 Oscilloscope traces of the core voltage and clock during an SS-SC mode transi-
tion [68] (
c 2017 IEEE). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.20 A shmoo chart showing processor performance while executing a matrix-multiply
benchmark under a wide range of operating modes [68] (
c 2017 IEEE). The
number in each box represents the energy efficiency of the application core as
measured in double-precision GFLOPS/W. . . . . . . . . . . . . . . . . . . . . . 93
6.21 Plots showing the effects of FBB on core frequency and energy [68] (
c 2017 IEEE). 93
6.22 The effect of FBB on different benchmarks with a supply voltage of 0.6V (bypass
mode) [135] (
c 2016 IEEE). The total energy consumed by each benchmark
has been normalized so the relative effects of body bias can be compared. The
minimum-energy point is highlighted for each benchmark. . . . . . . . . . . . . 94
6.23 Measured comparison of the two clock generators implemented in the Raven-4
testchip [68] (
c 2017 IEEE). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.24 Measurement results showing the relative difference in voltage-dependent fre-
quency behavior of the four delay banks in the adaptive clock generator [68] (
c
2017 IEEE). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.25 Oscilloscope traces showing the rippling core supply voltage and the SS-SC toggle
clock [68] (
c 2017 IEEE). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.26 Measurement results showing the correlation between SS-SC toggle frequency and
measured core power [68] (
c 2017 IEEE). . . . . . . . . . . . . . . . . . . . . . 96
6.27 Oscilloscope traces showing the core voltage as the frequency hopping algorithm
is applied with two different frequency targets [68] (
c 2017 IEEE). . . . . . . . 97
6.28 Plots showing the effect of voltage dithering on system conversion efficiency [68]
(
c 2017 IEEE). (a) compares the energy cost of dithering to the bypass mode
baseline (in blue), which represents 100% efficient regulation. The dithering op-
erating points linearly interpolate completion time and energy between the fixed
SS-SC modes. (b) shows the measured system conversion efficiencies under volt-
age dithering. The results in green show the benefit of re-tuning the replica timing
circuit in the adaptive clock generator after each SS-SC mode transition. . . . . 98
xi
List of Tables
6.1 A summary of key results and details from the Raven-3 testchip [10]. . . . . . . 84
6.2 Comparison of the two Raven-4 processors [68]. . . . . . . . . . . . . . . . . . . 89
6.3 A summary of key results and details from the Raven-4 testchip [68]. . . . . . . 91
6.4 Measured system conversion efficiencies achieved by the Raven-4 system [68]. . . 92
6.5 The effects of AVS Algorithm A on runtime and energy, as well as the proportion
of runtime spent in the lower-voltage 1.8V 1/2 mode. Data from the fixed 1V
and 1.8V 1/2 modes are included for comparison. . . . . . . . . . . . . . . . . . 110
6.6 The effects of AVS Algorithm B on runtime and energy, as well as the proportion
of runtime spent in the lower-voltage 1.8V 1/2 mode. . . . . . . . . . . . . . . . 111
6.7 Parameters used in the energy model from [141] to calculate simulated energy
savings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.8 Energy savings for the Rocket voltage domain resulting from the implementation
of FG-AVS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.9 Energy savings for the Rocket voltage domain resulting from the implementation
of FG-AVS with multiple simulated voltage levels. . . . . . . . . . . . . . . . . . 120
6.10 Energy savings for each voltage algorithm resulting from FG-AVS as the appli-
cation core executes a matrix-multiply benchmark. . . . . . . . . . . . . . . . . 123
xiii
Acknowledgments
I owe a tremendous debt of gratitude to those who made this dissertation possible. While
all research stands on prior work, I think it’s fair to say that to an extraordinary extent, my
research would not have been feasible without all of those who contributed to the big ideas
and massive projects that made this work happen.
I would be remiss if I did not begin by thanking my advisors, Bora Nikolić and Krste
Asanović. The first aspects of this project began well before I arrived at Berkeley, and it is
their research vision and tireless persistence that allowed my research to progress. Bora has
never been afraid to think big, in terms of our research goals, our potential achievements,
and the extent of our broad project collaborations. Without his leadership, this project
would have surely foundered. Krste oversaw the development of an entire new architecture
research infrastructure over just a few years, all while taking pains to convince every computer
architect in earshot that tapeouts matter. The unique confluence of these two visions was
the foundation for my work. Thanks for all of this, and for the million things I haven’t space
to mention.
Mixed-signal tapeouts are fantastically challenging; I am extremely fortunate to have
worked with such a talented group of colleagues throughout my graduate career. Most im-
portantly for my own work, Alberto Puggelli, Ruzica Jevtic, and Hanh-Phuc Le invented
and implemented the voltage regulation that was the lynchpin of our chips and an absolute
requirement for my own research. I cannot overstate the extent to which my own experimen-
tation relies on their outstanding work. Jaehwa Kwak designed the clock generators that
were critical to the high efficiency of our systems; his layout prowess remains unmatched.
Brian Zimmer designed the custom memories that allowed our chips to operate at low volt-
ages. He also served as a tapeout guru, circuit expert, and indefatigable mentor throughout
his time at Berkeley. Yunsup Lee was responsible for many of the digital blocks that made up
our tapeouts, including the vector processors and the first power management unit, but his
most important contribution was his unhesitating commitment to dreaming big and achiev-
ing that vision. Martin Cochet designed the power monitoring hardware and temperature
sensors, and Milovan Blagojević designed the integrated body bias generator; these two visit-
ing students also provided invaluable tapeout help in understanding the STMicroelectronics
process and working with our French collaborators. Pi-Feng Chiu and Stevo Bailey were my
partners in crime through every tapeout. System integration is a thankless task, but their
tapeout expertise remains invaluable, and I am in their debt for the many, many hours they
devoted to this project. John Wright designed the serial links in the Hurricane tapeouts and
integrated them into the memory system. Jarno Salonaa made improvements to the inte-
grated voltage regulators in Hurricane-2. To these students, as well as Alon Amid, Keertana
Settaluri, Jessica Iwamoto, and everyone else who made it possible to teach sand to think,
thank you.
In addition to circuits, our tapeouts relied heavily on RISC-V, Chisel, and the Rocket
Chip Generator, all technologies for which I have Krste Asanović’s research group to thank.
Andrew Waterman and Yunsup Lee, along with Krste and Dave Patterson, invented and
xiv
promoted the RISC-V ISA, and contributed a tremendous amount of software infrastructure
that has proved invaluable to our efforts. The Chisel hardware description language made
it possible to achieve the design productivity that we did, and I’m grateful to Jonathan
Bachrach for conceiving and promoting the effort. The Rocket Chip Generator made it
possible for us to tape out real processors; thanks to Yunsup, Andrew, and Henry Cook
for remaining committed to open-source hardware generators. Palmer Dabbelt is a systems
wizard; he made many an impossible problem disappear without breaking a sweat. Colin
Schmidt made building and programming vector machines possible. Howie Mao has writ-
ten countless hardware blocks that just happened to do exactly what we needed. Stephen
Twigg spent many hours helping me understand the capabilities of Chisel. And my other
architecture groupmates, Eric Love, Martin Maas, David Biancolin, Jack Koenig, Adam Is-
raelevitz, Albert Magyar, Scott Beamer, Donggyu Kim, Albert Ou, Sagar Karandikar, and
Rimas Avizienis, have contributed in numerous ways to our research ecosystem, as well as
teaching me to reason about computer architecture. Thanks to the entire team, as well as
everyone who has contributed to the open-source RISC-V, Rocket, and Chisel ecosystems.
In addition to those with whom I worked directly, many students at the Berkeley Wireless
Research Center have made my research possible. My ignorance of analog design runs deep,
and so I’m indebted to Sameet Ramakrishnan, Nathan Narevsky, Lucas Calderin, and Greg
Lacaille for their patient explanations and for showing me around the lab. Thanks to all
of the students in Bora’s research group, including Angie Wang, Rachel Hochman, Antonio
Puglielli, Paul Rigge, Amy Whitcombe, Nick Sutardja, Mira Videnović-Mišić, Katerina Pa-
padopoulou, Milos Jorgovanovic, Sharon Xiao, Matt Weiner, Amanda Pratt, and Charles
Wu, who provided useful research feedback and copious grad school advice while generally
keeping things sane. Undergraduate students Gary Choi and Vighnesh Iyer respectively
contributed a power virus and FPGA expertise to the project. The BWRC is a great place
to work and learn, and it’s my colleagues that make it so.
Faculty and staff have greatly contributed to my Berkeley experience. Thanks to Vladimir
Stojanovic and Duncan Callaway for serving on my qualifying exam committee and provid-
ing crucial feedback on my research. In addition to overseeing the initial work on voltage
regulation, Elad Alon has never hesitated to answer questions and provide much-needed
guidance and perspective. Brian Richards has the unglamorous role of negotiating myriad
CAD tools, foundries, license agreements, and everything in between; his tireless work makes
it possible for us to do our research. James Dunn contributed many of the board designs so
we could test our chips. Many thanks to Ubirata Coelho and Kostadin Ilov for keeping the
servers running, and to Fred Burghardt for maintaining the lab. Thanks also to the front-
office staff at BWRC and the ASPIRE Lab. Olivia Nolan, Candy Corpus, Leslie Nishiyama,
Erin Hancock, Yessica Bravo, Roxana Infante, and Tami Chouteau kept the lights on and
handled so much behind the scenes so that we could do our work.
Many people outside of Berkeley also contributed to this research. I’m particularly in-
debted to Brucek Khailany and Matt Fojtlik, who served as my mentors at NVIDIA Re-
search. Their support and direction provided the foundation for Chapter 5 of this disser-
tation. Thanks also to Bill Dally, Yan Zhang, John Poulton, Tezaswi Raja, Stephen Tell,
xv
and John Edmonson for their invaluable discussions and feedback at NVIDIA. I would also
like to thank Alon Naveh and Mark Reynolds of Intel, Tom Burd of AMD, Luke Shin and
Frankie Liu of Oracle, and Dave Ditzel of Esperanto for taking an early interest in my work
and providing valuable feedback as I continued it. My experiences and conversations with
industry members yielded valuable guidance and grounding for the project.
My research was funded by the National Science Foundation Graduate Research Fellow-
ship Program, the NVIDIA Graduate Fellowship, Intel Research, and GRC under Grant
SRC/TxACE 1836.136. Aspects of the project were also funded by DARPA PERFECT
Award Number HR0011-12-2-0016, AMD, and the Marie Curie FP7. Fabrication of all four
testchips was donated by STMicroelectronics; I’m particularly grateful to Andreia Cathelin
and Phillippe Flatresse at ST, whose tireless advocacy of our projects and our collaboration
made our chip development possible. Thanks to all of these organizations, as well as all of
the sponsors of BWRC and the ASPIRE Lab.
Finally, I would like to thank my family, without whom I certainly would not be where
I am today. My parents, to whom I am eternally grateful, supported my decision to pur-
sue twenty-four straight years of schooling without batting an eye. My brother has stuck
with me through it all, a dependable, insightful sounding board for all of my frustrations
and triumphs. And my wife Helen, who made it through problem sets, prelims, evening
meetings, paper deadlines, tapeouts, qualifying exams, all-nighters, conference travel, and
the inevitable existential uncertainty, somehow still wanted to marry me afterwards. Thank
you.
1
Chapter 1
Introduction
1.1 Motivation
Improving energy efficiency both motivates and enables considerable innovation in the de-
sign of semiconductor circuits. The cost savings enabled by Moore’s Law have led to the
production and deployment of a huge number of SoCs, each containing millions or billions
of transistors. The cloud services provided by datacenters have quickly become essential
to businesses and consumers, while embedded devices are ubiquitous in every sector of the
economy. The energy required to power these devices therefore has considerable impact on
the total energy consumption of the entire economy. As shown in Figure 1.1, one estimate
finds that a halt to energy efficiency improvements would result in a near-doubling of U.S.
CHAPTER 1. INTRODUCTION 2
7000
Frozen Efficiency
6500 Efficiency Improvements (Scenario 1)
U.S. Electricity Consumption
5500
5000
4500
4000
3500
3000
2006 2010 2014 2018 2022 2026 2030
Figure 1.1: Results of a model estimating total U.S. electricity consumption by 2030 [1].
The frozen efficiency curve assumes no efficiency improvements past 2010, causing electric-
ity consumption to grow with the economy. Scenario 1 shows the effect of implementing
existing policies to improve energy efficiency. Scenario 2 shows the further reductions pos-
sible by the full deployment of existing technological efficiency advances. Scenario 3 shows
the more aggressive reductions possible when all possible semiconductor-enabled efficiency
improvements are included.
electricity demand from 2010 to 2030, while aggressive improvements to efficiency enabled by
semiconductor innovation and deployment will instead decrease overall demand. Efficiency
improvements therefore stand to dramatically reduce the need for additional electrical gen-
erating capacity and to mitigate the related impacts of greenhouse gas emissions on global
climate change.
Notwithstanding the global impacts, improvements in the energy efficiency of future
SoCs is required for continued improvements to their performance and utility. Datacenters
are constrained in both cost and performance by the energy efficiency of the deployed silicon,
while battery-powered devices must limit their energy consumption to retain utility. The
following sections detail the impact of energy efficiency on each of these key use cases in
more detail.
140
120
100
80
60
40
20
0
2000 2005 2010 2015 2020
Figure 1.2: Estimated energy savings from current trends in US datacenter efficiency im-
provements [2]. Energy efficiency improvements are predicted to save a total of 620 billion
kW h over a decade.
of massive cloud providers, the need for energy-efficient compute is driven by increased
demand and competitive cost-cutting. In 2014, datacenters in the US consumed electricity
equivalent to 6.4 million American homes [2]. Improvements in energy efficiency are the
key to keeping up with the demand for these services while avoiding a massive increase in
electricity consumption that would be required by a fixed-efficiency scenario (see Figure 1.2).
Faster computation in datacenters is fundamentally limited by SoC energy efficiency
because of package thermal limits that have prevented a decade of naive performance im-
provements. Twenty years ago, desktop computer performance could be rapidly increased by
taking advantage of Dennard scaling to increase clock rates and deepen processor pipelines.
In the early 2000s, these efforts hit a “power wall” that was famously described in a 2001
presentation [3]. As shown in Figure 1.3, SoC power densities were rapidly becoming un-
tenable; silicon chips would certainly melt and cease to function far before reaching power
densities equivalent to those of the surface of the sun. These thermal limits forced funda-
mental changes in SoC design that have persisted even as server chips have replaced desktops
as the main drivers of high-end processor development. While servers may be able to de-
ploy more sophisticated and exotic cooling technologies, the cost is often prohibitive, and
so improvements to processor energy efficiency remain the primary means of increasing the
performance of datacenter SoCs within their power budget.
In addition to demands for performance, datacenter designers have become increasingly
attuned to energy costs as a significant operating cost in themselves. As datacenters have
ballooned into their own industry, designers looking to reduce costs must consider the total
CHAPTER 1. INTRODUCTION 4
Figure 1.3: Predictions of silicon power density in 2001 as first summarized by Shekhar
Borkar of Intel [3] (
c 2001 IEEE). As explained by Pat Gelsinger, then CTO of Intel, “If
scaling continues at present pace, by 2005, high-speed processors would have power density
of nuclear reactor, by 2010, a rocket nozzle, and by 2015, the surface of the sun.”
cost of ownership (TCO) to remain competitive. The entire operating cost over the lifetime
of the datacenter, from the cost of land to the hiring of building security, must be consid-
ered together. While TCO estimates can vary considerably based on model assumptions,
several studies show that electricity and cooling costs can be 10-20% of typical TCO [4, 5].
Furthermore, new datacenter designs focused on energy efficiency can reduce TCO, even if
the hardware itself may be more expensive [6]. Both improved performance and overall cost
reduction demand more energy-efficient silicon for datacenters.
the battery of their device has lost charge. Surveys consistently show that long battery life
is one of the most desired phone features; one report indicates that 92% of those surveyed
say that battery life is important when considering a new smartphone purchase [7]. These
consumer preferences easily align with the goals of smartphone and tablet vendors to make
their devices more useful. Accordingly, SoC energy efficiency improvements remain critical
drivers of utility improvements for mobile form factors.
The switching power consumption Pdyn of digital gates operating in saturation is given
by the relationship
2
Pdyn = αCVDD f (1.1)
where α is the proportion of gates switching each cycle, C is the total switching capaci-
tance, VDD is the supply voltage, and f is the switching frequency. Therefore, total dynamic
energy Edyn per cycle, which is independent of the switching frequency, is given by
2
Edyn = Pdyn · 1/f = αCVDD (1.2)
So a reduction in supply voltage corresponds to a quadratic reduction in dynamic energy.
Static power consumption cannot be accurately predicted by supply voltage with a simple
analytical model, but more complex models and published data both yield a super-linear
relation; that is, a reduction in supply voltage results in a proportionally greater reduction
in leakage power. The energy savings of this effect are mitigated because the best achievable
operating frequency also decreases with lower supply voltage, so leakage energy must be
summed over a longer cycle time. The operating frequency of digital logic is proportional
to the on current ION of the devices used in the digital gates. While accurate models of
the dependence of frequency and supply voltage can be quite complex, the relationship can
be approximated as linear, an approximation that fits well with measured data in modern
processes [9, 10]. This means that even though devices leak for longer, the super-linear
relationship of voltage to static power consumption implies that total leakage energy per
cycle will nonetheless decrease with voltage.
To summarize these effects, when voltage is decreased, performance decreases linearly, but
energy consumption drops even more. This fundamental relationship has enabled decades
of voltage-scaling innovation [11]. Nearly all modern SoCs employ some form of voltage
scaling, in which reduced performance is traded for increased energy efficiency some or all
of the time. A common application of AVS is to reduce voltage to track workload demand;
when peak performance is not needed, supply voltage can be reduced to save energy. AVS
already represents a powerful tool to improve energy-efficiency in many classes of devices.
Activity
Activity
Wasted
Energy
Time Time
(a) (b)
Figure 1.4: An example showing the additional energy savings of fine-grained AVS in time
(b) compared to the coarse-grained AVS baseline (a).
energy, periods of operation during which the system is operating at a higher voltage than
necessary to meet the performance requirement. A system with finer temporal granularity,
on the other hand, is able to rapidly adjust the voltage to track rapid workload changes,
spending very little time at a higher voltage than necessary. A control loop with bandwidth
that exceeds the frequency of workload changes can achieve the best energy savings.
The second type of granularity is spatial granularity, which refers to the number and size of
voltage domains on a chip. Fine spatial granularity requires many small voltage areas, each of
which can adjust their voltage independently, while a system with coarse spatial granularity
has just one or a few independent voltage areas. The benefits of decreased spatial granularity
are shown in Figure 1.5. The system with coarse spatial granularity has a single voltage
domain, and so it cannot reduce its operating voltage if the most demanding performance
constraint would not be met, even though other portions of the system could meet their
performance needs at a lower voltage and frequency. In contrast, the system with fine spatial
granularity can adjust the voltage of each domain independently to match each particular
performance target, saving energy without degrading performance. In addition, finer spatial
granularity permits better tracking of fine-grained temporal changes, because independent
voltage domains may exhibit time-varying behavior that would not be discernible if they
were grouped into a single domain. Spatial partitioning corresponding to the smallest spatial
variation in workload can achieve the best energy savings.
Modern commercial systems typically implement only coarse-grained AVS in both time
and space. Typical SoCs have a handful of large voltage domains, with adaptive feedback
to change operating conditions acting at millisecond timescales. This dissertation intends to
demonstrate the benefits of fine-grained AVS to motivate broader adoption of such designs.
Activity
Energy
Domain1
Domain3
Domain2
Domain0
Domain1
Domain3
Domain2
Domain0
(a) (b)
Figure 1.5: An example showing the additional energy savings of fine-grained AVS in space.
(a) shows the wasted energy resulting from a coarse-grained AVS implementation, while (b)
shows how energy can be saved with a fine-grained approach.
other. The clock periods are therefore asynchronous, an architecture known as Globally
Asynchronous, Locally Synchronous (GALS) [12]. GALS as originally proposed consists of
synchronous “islands” connected by a fully asynchronous network that is not clocked at all,
but it is now common to refer to designs that consist of many internally synchronous but
independent blocks as GALS designs as well [13]. GALS designs need not implement fine-
grained voltage scaling, but the GALS design style is well-suited to the system requirements
for FG-AVS.
1.3.2 Workloads
All processor workloads exhibit phases, portions of the program with different instruction
mixes and operating characteristics. FG-AVS is best suited to saving energy compared to
its coarse-grained alternative when workload phases vary rapidly and frequently, presenting
opportunities for fine-grained tracking to which a less aggressive system could not adapt.
In recent years, bursty workloads of this type have become much more common, driven by
the move to cloud services, mobile devices, and new IoT implementations [17–20]. Cloud
service providers often support workloads like search or database lookups, which may hap-
pen infrequently, triggered by user request. Similarly, mobile devices such as tablets and
phones are frequently idle, even when actively being used; durations of relative inactivity
CHAPTER 1. INTRODUCTION 10
are punctuated by key presses or other external stimuli that demand a fast response. IoT
devices are rarely triggered directly by users, instead relying on stimulation from the en-
vironment to actuate computation. The sensor inputs that cause execution are frequently
bursty as well, activating upon intermittent and unpredictable changes in the environment,
while devices that activate at regular intervals tend to operate for very short durations. In
all of these computing environments, frequently varying workloads are ideally suited for the
energy-saving benefits of FG-AVS.
Chapter 2
Fine-grained adaptive voltage scaling is a powerful technique to save energy in SoCs, but
commercial adoption has been limited because of the significant challenges and overheads
that occur when implementing a system capable of FG-AVS. Generating many independent
voltages, quickly switching between different voltage modes, and supplying these voltages to
the independent voltage domains of the FG-AVS system requires integrated voltage regula-
tors, circuits that are difficult to design with high efficiency. Implementing the GALS design
style required for FG-AVS poses challenges in clock generation for each voltage domain and
data movement between asynchronous domains. Addressing these issues to build a robust
design imposes area overheads and increases design complexity. This chapter details these
challenges with consideration of prior work in each of these areas and discussion of various
design alternatives that could allow effective implementation of FG-AVS.
operating modes.
Most commercial SoCs employ discrete voltage regulators on separate dice adjacent to
the supplied SoC. These discrete regulators are generally very efficient, achieving conversion
efficiencies of 90% or greater, and can often generate many different output voltages. How-
ever, discrete regulators are fundamentally unsuited to FG-AVS. The use of large passives
to achieve high efficiencies and reduce output noise results in slow transition times between
voltage modes due to the large time constants of these devices, making fine temporal granu-
larities impossible to achieve. Furthermore, most SoC designs are IO limited, and have only
a fixed number of IOs available for power supplies. This limits the spatial granularity of
AVS when using discrete regulation, because only a small number of independent supplies
can be connected from external regulators to the voltage domains in the SoC. Finally, the
proliferation of discrete regulators to generate many independent voltages would increase
system costs, which could potentially force other design compromises that would reduce the
overall benefit of FG-AVS.
Integrated voltage regulation is therefore a de facto requirement for FG-AVS systems.
By building regulators into the same die as the SoC they are supplying, system costs are
reduced, transition times are shortened due to the smaller passives, and many voltages can
be generated from just a few external supplies, ameliorating any issues with IO count. The
remainder of this section details various alternatives for integrated regulators and implemen-
tation considerations in systems that employ them.
Vin
Vref +
-
Vout
R1
R2 Rload
Buck Converters
Switching regulators are able to achieve higher conversion efficiencies than linear regulators
because power-delivery transistors are switched on and off instead of constantly incurring
CHAPTER 2. CHALLENGES OF ADAPTIVE VOLTAGE SCALING 15
switch off
switch on
Vin Vout Vin Vout
Rload Rload
Figure 2.2: The two operating phases of a buck converter circuit [31]. The arrows denote
the direction of current flow.
Switched-Capacitor Regulators
Switched-capacitor regulators are an alternative to buck converters that use capacitors, not
inductors, as the primary energy-storing element in the design. Since the quality of inte-
grated capacitors is intrinsically better than that of inductors using the same area resources,
switched-capacitor designs are able to achieve higher efficiencies than their fully integrated
buck converter counterparts [35]. A schematic of a simple two-phase switched-capacitor cir-
cuit is shown in Figure 2.3. By toggling the transistors in two phases as marked in the figure,
charge is transferred onto the flying capacitor and then to the output at a stepped-down ratio
from the input voltage.
Switched-capacitor regulators achieve best efficiency when they downconvert the input
voltage at a fixed ratio determined by their topology [36]. Accordingly, implementations
typically generate only a small number of discrete output voltages, though recent designs
can produce many output levels at the cost of increased control complexity [37, 38]. The
design of switched-capacitor regulators also involves a tradeoff between energy efficiency
and power density, a measure of power delivered for a given capacitive area [39]. Since
CHAPTER 2. CHALLENGES OF ADAPTIVE VOLTAGE SCALING 16
ɸ1 ɸ2
Vin Vout
Cfly ɸ1
Rload
ɸ2
lower power densities lead to large area overheads, integrated implementations of switched-
capacitor regulators often have low conversion efficiencies (although better than those of LDO
implementations) [40–43]. These challenges have prevented adoption of switched-capacitor
regulators in commercial SoCs.
supplies and distribute them throughout the die. Each voltage area connects to each dis-
tributed supply by power header transistors, and a controller turns on one header at a time
to set the supply voltage. Switching a domain between voltage modes is then achieved by
toggling the relevant headers to that domain.
Multi-rail switching avoids entirely the complexities and tradeoffs of integrated regula-
tion. Although there may be a small amount of resistive loss through the header transistors,
there is otherwise no efficiency penalty as would be the case for regulated supplies. Further-
more, switching between different voltage modes can take place very quickly, and because
large passive elements are not required for regulation, the area overhead is relatively small.
These characteristics make multi-rail switching a design technique well-suited to enable FG-
AVS.
This approach, however, has its own trade-offs. First, the nature of multi-rail switching
allows for only a small number of discrete voltages to be supplied (most implementations
use only two voltages), limiting the range of voltage scaling. While there are no explicit area
overheads for large passive elements, the routing resources required to distribute multiple
supplies throughout the chip can complicate the design of both the package and the power
grid, potentially leaving less metal available for other purposes and impacting area budgets.
Other complications arise from the switching events themselves, when a voltage domain is
transitioned from one rail to another. If this switching is done too quickly, short-circuit
current can temporarily flow between the two rails, causing waste power consumption and
supply integrity problems. The sudden addition or subtraction of current load from a rail as
an entire voltage domain is connected or disconnected can result in very large di/dt droop.
Mitigating this droop can require large amounts of on-die decoupling capacitance, adding
area overhead.
Several prior research efforts have implemented multi-rail switching, using different tech-
niques to address these concerns. The authors of [44] implement dual-rail supplies that
power 167 independent voltage domains, with a small processor core in each domain. They
avoid di/dt events by halting the core during mode transitions, improving supply integrity
but increasing transition costs. The authors of [45, 46] propose a “shortstop” technique
that mitigates supply noise by making use of additional dirty supply rails used only during
voltage transitions. This technique is effective if only a single domain is switching at a time,
but the additional routing resources required for more rails may be prohibitive. The authors
of [47] use an integrated linear regulator to slowly change the voltage level during transitions,
avoiding short-circuit current and di/dt issues but greatly slowing the transition time. None
of these approaches fully addresses the drawbacks of multi-rail switching, which may be why
the technique has not yet seen commercial adoption.
Figure 2.4: The energy cost of voltage dithering. Dithering at various duty cycles results
in a linear interpolation between two points on the energy-frequency curve, resulting in
marginally higher energy consumption than a regulator with arbitrary output voltage levels
[48] (
c 2006 IEEE).
possible. A technique known as voltage dithering can be used to overcome this issue. Dither-
ing involves rapidly switching between two discrete voltage levels, with the duty cycle of this
toggling determining an intermediate “virtual” voltage and frequency midway between the
two actual operating conditions. By varying this duty cycle, voltage dithering provides a
simulacrum of continuous voltage scaling. As shown in Figure 2.4, dithering linearly inter-
polates the virtual voltage and frequency between the two discrete points, enabling effective
energy consumption only marginally higher than the case of continuous levels of regulation.
The disadvantages of voltage dithering are threefold. First, there is some efficiency loss
compared to the regulated case; in situations in which the two discrete voltage levels are far
apart, the effective conversion efficiency can be substantially lower than efficiency at the dis-
crete operating points. Second, switching between different voltage modes incurs transition
costs, and frequent switching increase these costs substantially. Finally, rapid dithering may
simulate operation at an intermediate voltage and frequency, but prolonged operation at the
discrete modes can nonetheless cause rate-balancing mismatches that degrade performance.
For example, consider the system shown in Figure 2.5a, in which a voltage-scaled block is
attempting to meet a clock frequency target f to rate-balance with its neighboring supplier
and consumer blocks operating at this rate. The only voltages available, however, correspond
to operating frequencies of 0.5f or 1.5f , so the voltage is dithered with a 50% duty cycle
to achieve a virtual frequency of f . If the dithering can take place at a high frequency as
shown in Figure 2.5c, then the system correctly behaves as if it were operating at frequency
f . However, due to transition costs of voltage mode switches or limitations in the transi-
tion time of the voltage regulators, this behavior may not be feasible. If instead voltage is
dithered more slowly as shown in Figure 2.5d, the producer and consumer occasionally stall
CHAPTER 2. CHALLENGES OF ADAPTIVE VOLTAGE SCALING 19
Queues
(a)
Queue depth
4
3
2
1
0
Time
Block B Clock
(b)
4
3
2
1
0
(c)
4
3
2 Block B stalls Blocks A and C stall Block B stalls Blocks A and C stall
1
0
(d)
Figure 2.5: An example showing a drawback of dithering cause by rate imbalance. Con-
sider the system shown in (a), in which a variable-voltage Block B communicates with two
fixed-voltage Blocks A and C. If Block B is able to operate at the same frequency as its
neighbors as shown in (b), then data flows through the system without issue. In some cases,
this operating condition may not be available or efficient, requiring dithering between two
adjacent voltage/frequency pairs. If the dithering takes place rapidly as in (c), then the in-
put and output queues do not fill, and Block B correctly simulates a virtual operating mode
that matches rates with its source and sink. However, if dithering is slower as in (d), then
the pipeline stalls as the queues fill and empty during the rate mismatch. Note that these
examples do not account for the latency penalty of the asynchronous crossings between the
blocks, which would further exacerbate rate-balancing issues.
CHAPTER 2. CHALLENGES OF ADAPTIVE VOLTAGE SCALING 20
Voltage
Voltage
Clock
Wasted Energy Wasted Energy
Clock
time time
(a)
Voltage
Voltage Wasted Energy Wasted Energy
Clock
Clock
time time
(b)
Voltage
Voltage
Clock
Clock
time time
(c)
Figure 2.6: Waveforms demonstrating the energy cost of different clocking schemes during
rising and falling voltage transitions. In (a), the clock is halted during a voltage transition,
wasting considerable energy. In (b), the clock operates at the lower frequency during the
transition. In (c), an adaptive clocking scheme continuously adjusts the clock period during
the voltage transition.
deployed for droop detection in large SoCs [61–64], have the advantage of mitigating noise
and variability as well as responding to voltage changes. Furthermore, they can vary their
output frequency continuously during a mode transition, allowing uninterrupted operation
with little transition overhead as shown in Figure 2.6c. Fine-grained spatial partitioning
further increases the noise resilience of these systems [65], and commercial implementations
have used hybrid adaptive clock generators that respond to local voltage fluctuations while
still tracking a fixed external reference [66]. Local adaptive clock generators are ideally suited
to FG-AVS systems.
flops rely on feedback inverters to store data, which settle into a stable condition when storing
a 1 at one of the inverter outputs and a 0 at the other. A metastable condition also exists,
however, when the stored voltage level at each output is midway between VDD and ground.
In this case, the feedback holds these invalid values until some noise event unpredictably
perturbs the circuit towards one of the stable states.
Flip-flops are only susceptible to metastability when their input data changes near the
rising clock edge. Synchronous designs guard against metastability by timing closure at
design time. By ensuring that no register-to-register propagation delay is too slow, digital
designers guarantee that data will stop transitioning before the setup time of the flip-flop,
preventing the metastable condition. In a clock domain crossing, however, the input data
signal is fully asynchronous to the clock of the flip-flop, and so no design-time guarantee
can be made about the relative arrival time of these two signals. Metastability can lead
to incorrect operation by resolving to an incorrect value, or causing timing failures that
propagate metastability into other parts of the logic. Because of this possibility, circuits
that reduce or eliminate the danger of metastability must be used at every clock-domain
crossing.
The most common way to address metastability at asynchronous boundary crossings
is to add several series flip-flops on a signal line that crosses clock domains and clocking
these extra flip-flops with the clock of the receiving domain, as shown in Figure 2.7. The
probability of metastability resolving by some time t has been empirically shown to follow
the exponential relation
RX Clock
per cycle; a valid bit may be added so that the pipeline advances only when valid data is
supplied. In an asynchronous implementation, however, each pipeline stage may operate on
a different clock, and the relative frequencies of those clocks are not known. Accordingly,
a stage may not have completed its computation by the time the next stage has advanced
one cycle; conversely, a stage may complete its computation early, before the next stage is
ready to accept the data. In a system employing FG-AVS, these relative frequencies can vary
arbitrarily over time. To properly control the flow of the data through this asynchronous
pipeline, each stage must be able to exert backpressure on the previous stage. A ready-valid
interface at each stage ensures that data only advances both when it is valid and when the
next stage is ready for it. This logic is necessary in asynchronous communication, but it can
add substantial overhead compared to the simple valid-only approach.
The combination of increased interface latency and the requirement for backpressure can
impose significant performance overhead as designs are rearchitected into the fine-grained
GALS paradigm. This retrofit is particularly challenging for large, tightly coupled designs,
such as multiple-issue out-of-order cores with many pipeline stages and frequent bypass paths
between them that rely on fixed, deterministic pipeline latencies to achieve high performance.
Other types of designs, such as signal processing pipelines, are less sensitive to latency and
have less internal feedback, making them more amenable to such partitioning. Nonetheless,
the performance degradation of these fully decoupled designs compared to their synchronous
counterparts must be an important consideration in FG-AVS adoption.
VDDHI
Z
VDDLO
A
Figure 2.8: A level-shifter circuit that converts signal A in voltage domain VDDLO to buffered
signal Z in voltage domain VDDHI .
shown in Figure 2.8. While each level shifter is small, many may be required in FG-AVS
designs with many voltage domains and signal crossings, so the addition of these gates can
impose substantial area overhead.
inserted between them. These additional constraints increase the effort needed to correctly
synthesize the design.
Physical design also becomes more complicated in an FG-AVS design. In a synchronous
design, there is considerable flexibility in floorplanning the design to satisfy timing and other
constraints. In an FG-AVS design, on the other hand, the area available for the logic and
macros inside each voltage area is much smaller, and so the placement of these blocks is
highly constrained. Irregular macros can become difficult to place because oblong voltage
areas may have poor power connectivity and suffer from IR drop or other issues. The relative
placement of each voltage area to its neighbors must be carefully considered, both so that
communication paths are kept short and so that the power grid remains robust. Additionally,
the presence of many integrated voltage regulators and local clock generators requires much
more effort to realize their legal physical placement and correct integration into the design.
All of these constraints require considerable engineering effort.
Another aspect of the increased design effort is system verification. Because each voltage
domain can operate in several different modes, independent of other domains on the chip,
the number of potential operating conditions grows exponentially as the number of voltage
domains increases. This results in dramatically increased effort to ensure power grid relia-
bility and thermal safety as well as adequate performance. Simulating power consumption
becomes much more complicated when the power state of each voltage area is not static, but
instead changing frequently over time. Furthermore, verifying the correct behavior of asyn-
chronous boundary crossings is a difficult and error-prone endeavor, and requires entirely new
methodologies beyond those used for standard synchronous RTL verification. This added
verification effort to ensure correct operation may be the most challenging aspect of FG-AVS.
27
Chapter 3
Integrated voltage regulation is a key enabling technology for FG-AVS. This chapter describes
an integrated switched-capacitor voltage regulator design appropriate for integration in an
FG-AVS system. To achieve high conversion efficiencies, the integrated regulator must be
paired with a responsive adaptive clock generator circuit, which is also well-suited for the
fast clock frequency transitions required for FG-AVS. These circuits impose constraints on
the design of the FG-AVS system; these system implications are discussed in detail, and the
tradeoffs of this approach to voltage generation and clock distribution analyzed.
toggle
toggle
toggle
toggle
toggle
toggle
toggle
int int int int
Shared
int Func. SharedUnits
int Func. SharedUnits
int Func. Shared Units
int Func. Units Shared
8Tint Func. Shared
intUnits...
unitFunc. Shared
intUnitsFunc. Shared
int Units Func. Units
toggle
toggle
toggletoggle
toggletoggle
toggle
toggletoggle
toggletoggle
toggletoggle
... ... ... ... ...
DC-DC
DC-DC unit cellsunit unit
DC-DC DC-DC
cellsDC-DC
cells unit unit
Shared DC-DC
cells
Func.
DC-DC
cells unit
Shared
Units DC-DC
cells
unit Func.
cells unit
Shared
intUnits cells
int Func.
custom
DC-DC custom
intShared
int Units
8T
int int DC-DC
SRAM
intunit 8T
Func.
int...int
custom SRAM
int
macros)
DC-DC
cells
int
int unit
intintcustom
8TUnits
int DC-DC
cells
macros) int
int...
SRAM
intunit 8Tmacros)
int
custom
DC-DC
cells
intSRAM
Shared
int
int int unit
int custom
8Tint DC-DC
cells
macros)
SRAM
intunit Func.
int
int...int
custom SRAM
Shared
int
macros)
DC-DC
cells
int
int unit
custom
intUnits
8T int DC-DC
cells
macros)
SRAM
intunit Func.
...int8T
int
cells SRAM
Shared
macros)
int int
int Units intcells
macros)
Func.
int
custom intShared
int 8T
int custom
Units
int
intSRAM 8T
Func.
int
int...custom
int SRAM
int
macros)
int int
int8Tcustom
Units
int
int macros)
SRAM int8T
int
int...custom
int SRAM
int
macros)
int int
8T
int custom
intintmacros)
SRAM int8T
int
int...custom SR
intint
macr
int
64-bit Integer Multiplier
64-bit Integer 2)Multiplier
64-bit Integer Multiplier
64-bit ...Integer ... Multiplier ... 64-bit ...Integer ...
Multiplier
64-bit ...Integer ...
2)Multiplier
64-bit Integer Multiplier
64-bit ... Integer ...Multiplier ...
24 24 switched-capacitor
(0.19 mm 224
switched-capacitor
24 ) switched-capacitor
(0.19 mm
switched-capacitor
24 224
) switched-capacitor
(0.19 mm
switched-capacitor
64-bit Integer 24 224
) switched-capacitor
(0.19
switched-capacitor
Multiplier
64-bit Integer mm
Multiplier
64-bit Integer 24
Multiplier
64-bit 24
Crossbarswitched-capacitor
(0.19
switched-capacitor
Integer 24 mm
Multiplier 24
Crossbar
) switched-capacitor
(0.19
switched-capacitor
2
24... Integer
64-bit mm 224
Crossbar
)
switched-capacitorswitched-capacitor
(0.19 24
Multiplier
64-bit mm 24
Crossbar
Integer ) switched-capacitor
(0.19
switched-capacitor
2 Multiplier
64-bit mmInteger Multiplier
64-bit Crossbar
Integer Multiplier Crossbar ... Crossba ...
toggle
toggle
Videal Single-Precision Single-Precision
FMA
Vhigh
Single-Precision
FMA FMA Single-Precision Single-Precision
FMA Single-Precision
FMA Single-Precision
FMA FMA
toggle
toggle
toggle
toggle
toggle
toggle
toggle
toggle
Single-Precision
Shared Shared
Func. Func.
Single-Precision
FMA Shared
Units Shared
Units
Func. intFunc.
Single-Precision
FMA SharedUnits Shared
FMAUnits
Func. intintFunc.
Single-Precision
intShared Units Shared
FMA Units
Func. ...Func.
Units Units
Single-PrecisionShared ... int Shared
Single-Precision
FMA
Func. Shared Func.
int...
Units Shared Units
Single-Precision
FMA
Func. Shared Func.
...
intUnits Shared
intUnits
Single-Precision
FMA
Func. intShared Func.
Units Shared
FMAintUnits
Func. ...Func.
Units Units int...int int
ideal
DC-DC
high DC-DC unit DC-DC
unit DC-DC
cells cells
unit unit
DC-DC
cells DC-DC
cells
unit unit
DC-DC
cells
Double-Precision
DC-DC
cellsunit
Double-Precision
FMA
unit
cells
Vector
cells int
int DC-DC
Vector
Double-Precision
FMA
Memory
DC-DC
Memory
Vector
Unit
intunit ...int
intint
DC-DC
cells
Double-Precision
FMA
Memory
unit
int
intintint
int
Vector
Unit Vector
Unit
DC-DC
cells
Memory int
intunitint...int
int
FMA
Memory
int
DC-DC
cells
Vector
Unit unit
int
int
VFSM
intint
Vector
Unit
DC-DC
cellsint
intunit
Vout
Memory
...
int int int
Double-Precision
Memory DC-DC
cells
intunit
int
Vector
Unit Unit
DC-DC
cells
Memory int
int
intunit ... int
cells
Double-Precision
FMA
int
Unit unit
Vector
cells
int
int Vector
Double-Precision
FMA
Memory
int
Vector
Unit
int int
intMemory int...
Double-Precision
FMA
Memory
int
int int
int
Vector
UnitVector
Unit
intMemory int... int
int int
int
FMA
Memory
int int
int
int
Vector
Unit Vector
Unit Memor
Mem
int...
int
V V(0.19 (0.19
Vref2) mm
mm V2ref
(0.19
Double-Precision
) (0.19
Vref2) mm
mm
+ V2ref
64-bit
(0.19
Double-Precision
FMA
64-bit
Vref2)Integer
V
)Integer
(0.19
mm
+
mm V2ref
Multiplier
64-bit
(0.19 V Double-Precision
IntegerFMA
)Multiplier
64-bit
(0.19
V
mm
+
FMAout
ref
2)Integer
Multiplier
64-bit
mm V2ref
)
V
Double-Precision
Integer
+ FMA
Multiplier
64-bit
V
Integer
Multiplier
V
Crossbar64-bit
(0.19
+
V
Integer
Crossbar
V(0.19
mm ref
+FMA
Multiplier
64-bit
)
Integer
mm
2Crossbar
(0.19
+ )
ref
+
V
Double-Precision
Multiplier
V2Crossbar
Multiplier (0.19
mm ref ) 64-bit
mm
out
2Crossbar
(0.19 )
ref
Double-Precision
V2Crossbar
Integer
V
FMA
64-bit
(0.19
mm ref )
+Integer
Multiplier
64-bit
mm
2Crossbar
(0.19 )
ref
Double-Precision
V2Crossbar
Integer
V
mm
FMA
Multiplier
64-bit
+ +
(0.19
ref
Integer
2)Multiplier
64-bit
mm )
ref
Double-Precision
V2Integer FMA
Multiplier
64-bit Integer
Multiplier
64-bitCrossbar
Crossbar+ + Integer FMA
Multiplier
64-bit Integer
Multiplier
Crossbar + + Multiplier
Crossbar Crossbar Cros
+
Vlow
FSM VFSM
ideal
FSM
lowideal
FSM FSM FSMSingle-Precision
FSM
Single-Precision out
FSMSingle-Precision
Single-Precision
FMA FMAFSM FMA
Single-Precision
low
FSM
Single-Precision FSM
low
Single-Precision
FMA
ideal ideal
FSMFSMFMASingle-Precision
Single-Precision
FMA FMA FSMFSMSingle-Precision Single-Precision
FMA Single-Precision
Single-Precision
FMA FMA Single-Precision
FMASingle-Precision
FMA Single-Precision
FMA FMA FMA
with the others, the relative amount of charge delivered onto the supply with each switching
event is small, and the output voltage ripple is minimized as shown in Figure 3.1a. High
interleaving phase counts of 16, 32, or even greater can be used to suppress the ripple and
produce a steady output voltage [39, 73, 74]. However, this interleaved approach suffers
from charge-sharing losses as each unit cell shares charge across the flying capacitance of
the others when it switches. These intrinsic charge-sharing losses comprise up to 40% of the
overall energy losses in the voltage conversion [75].
The conversion losses of switched-capacitor voltage regulators are composed of four parts,
three of which depend on the switching frequency of the design [39]. The intrinsic charge-
sharing loss Pcf ly of switched-capacitor designs is inversely proportional to the switching
frequency of the design, because a slower switching frequency means more charge being
transferred for each switching event. The bottom-plate conduction loss Pbottom caused by the
parasitic capacitance of the flying capacitor to ground and the parasitic gate capacitance Pgate
of the switches are both directly proportional to the switching frequency. The conduction
loss Pcond through the switches does not depend on switching frequency. Switched-capacitor
designs typically operate at the optimal switching frequency to minimize the total losses in
the system, operating fast enough to reduce charge-sharing losses but not so fast that the
parasitic losses dominate.
CHAPTER 3. INTEGRATED VOLTAGE REGULATION 29
interleaved
25
simultaneous
-switching
20
Remove
Power loss (%)
5 Pgate
Pcfly
0
Figure 3.2: Analytical results showing the efficiency improvement of the simultaneous-
switching switched-capacitor voltage regulator [10] (
c 2016 IEEE).
fsw=10MHz
fsw=20MHz
95
fsw=28MHz
90
Efficiency (%) (Increasing Vref, fsw)
85
80
75
70
Simultaneous-switching (Cload≈0)
65 Simultaneous-switching (Cload=Cdcdc)
Interleaved
60
0.35 0.4 0.45 0.5
Average Vout
Figure 3.3: Simulation results showing the impact of load capacitance on converter efficiency
and switching frequency [10] (
c 2016 IEEE).
system operating at a fixed clock frequency, because the digital clock must operate at a
slower frequency corresponding to the lowest voltage of the ripple in order to guarantee safe
operation. This would result in wasted energy as the rippling voltage exceeds its minimum,
as shown in Figure 3.4a, and these over-voltage losses would exceed the savings from the
elimination of charge sharing. Systems employing SS-SC regulators therefore require the
clock supplied to the digital load to adapt to the changing voltage in order to achieve
reasonable efficiencies [75]. As shown in Figure 3.4b, by changing the clock on a cycle-
by-cycle basis to match the instantaneous operating conditions of the load, the over-voltage
losses can be eliminated.
Wasted Energy
Clock Period Operating Voltage
Time
Digital Logic Clock
(a)
Time
Digital Logic Clock
(b)
Figure 3.4: Sample voltage and clock waveforms of SS-SC systems. In (a), the clock does not
adapt to the voltage ripple, so it must be margined for the lowest supply voltage, resulting
in wasted energy at higher voltages. In (b), the clock adapts to the voltage ripple, allowing
the circuit to speed up when the voltage is higher than the minimum.
CHAPTER 3. INTEGRATED VOLTAGE REGULATION 32
Figure 3.5: The four topologies of the reconfigurable SS-SC converter [10] (
c 2016 IEEE).
capacitance and switches, reconfiguring the switching pattern between each topology to
avoid area overhead (see Figure 3.5).
The phases and configurations of the switches in the SS-SC regulator are set by a finite-
state-machine controller. This logic is responsible for configuring and reconfiguring the
topology of the regulator, as well as switching between the two operating phases to pump
charge onto the output of the converter. The regulator topology is determined by setting a
control register that, when changed, triggers a reconfiguration between voltage modes. To
generate the toggle clock that is distributed to the switches in the regulator, a comparator
circuit acting on a 2 GHz clock detects when the regulated output voltage drops below a
fixed external reference. When the comparator triggers, control logic produces the next
toggle clock edge, switching the SS-SC toggle clock phase and causing more charge to be
supplied, boosting the voltage. As a different reference voltage is needed for each mode,
different comparators must be implemented to function in the appropriate voltage ranges
(see Figure 3.6). In the event that one switching event does not sufficiently increase the
generated voltage to bring it above the reference voltage, additional logic triggers further
switching events after a delay, ensuring that the voltage will eventually be boosted back up
to nominal levels. The state machine controller is diagrammed in Figure 3.7.
CHAPTER 3. INTEGRATED VOLTAGE REGULATION 33
MUX
Vref, 1.0V 3:2 comparator_out
NMOS
Vout
1.8V 2:1
MUX
1.0V 3:2
Modes FSM toggle
Vref, 1.0V 2:1
PMOS 2GHz clock
Vout
1.0V 2:1
Mode
Figure 3.6: A diagram of the comparators used in the SS-SC controller [10] (
c 2016 IEEE).
Different comparators are needed to accommodate the voltage ranges of the references for
the three different switching modes.
comparator_out comparator_0to1
comparator clock
comparator_0to1 || counter_overflow
counter=0; toggle=1
Φ1 Φ2
(toggle=0) (toggle=1)
comparator_0to1 || counter_overflow
counter=0; toggle=0
rese et comparator_out==1
comparator_out==1 !re
se t
t res
counter++ counter++
RESET
(Typical transition) (toggle=0) (Overflow counter protects
against a current spike)
Vout
Vref
(Propagation delay)
Iload
comparator clock
comparator_out
comparator_0to1
toggle Φ1 Φ2 Φ1 Φ2
counter 0 1 2 3 0
Figure 3.7: The state machine used in the SS-SC controller and example waveforms of its
operation [10] (
c 2016 IEEE). If the current demand increases as the converter is toggling,
the voltage may not rise about Vref , so an additional counter triggers further switching until
the voltage increases.
CHAPTER 3. INTEGRATED VOLTAGE REGULATION 35
1 cell
1 cell
1 cell 4 cells
IN 1 cell OUT
1 cell 4 cells
1 cell
1 cell 4 cells
IN OUT
Delay Bank #1 Delay Bank #2 Delay Bank #3 Delay Bank #4
Delay Unit
Pulse generator
rst CLKOUT
D Q
Delay Unit #1 Delay Unit #2 Clock Buffer
of pMOS/nMOS ratio and gate length, as these characteristics affect the voltage-frequency
relationship of the cells. By selecting different mux settings, the delay paths can be tuned
until they match the delay characteristics of the critical paths in the digital logic supplied by
the generated clock, ensuring that the clock edges are generated at the appropriate frequency
for the instantaneous voltage produced by the regulator.
3.3.1 Clocking
Both the design of the SS-SC regulator and the adaptive clock generator have implications
for the design of the various clocks that must be distributed in the design. In the case of
the integrated voltage regulator, several fixed-frequency clocks must be provided to each
independent regulator in the system, to supply the finite state machine controller and lower-
CHAPTER 3. INTEGRATED VOLTAGE REGULATION 36
bound comparator. In addition, the generated toggle clock that is used to alternate between
operating phases must be distributed to the switches in the design. To keep the size of the
individual switches and capacitors reasonable, and to ease physical design, it is common
to partition the entire flying capacitance into several unit cells, even in the simultaneous-
switching implementation. In order to ensure that all of these unit cells do indeed switch
simultaneously, eliminating charge-sharing losses between them, the toggle clock must be
distributed to every unit cell in a balanced clock tree that matches delays throughout the
design. This is well within the capabilities of standard place-and-route tools, but clock
constraints and physical placement may need to be adjusted to achieve an appropriately
balanced tree.
The adaptive clock generator design also has implications for the full system. Most
importantly, the cycle-by-cycle adjustment of each clock edge to track the voltage ripple
means that clock domains must be fully asynchronous with one another; there can be no
deterministic frequency or phase relationship between neighboring domains, even if they are
operating at the same nominal voltage. This eliminates some options for low-latency clock
domain crossings (see Chapter 5), and requires that synchronization circuitry between clock
domains be able to tolerate fully asynchronous arrival times. The tunable delay paths also
impose requirements for system bringup and testing, as the mux settings must be adjusted on
a per-core and per-domain basis to achieve the best tracking of the local critical paths in the
presence of random and spatially varying process variations. This extensive tuning regime
may require extensive one-time or boot-time calibration or even the addition of dedicated
built-in self-test (BIST) circuitry to be practical in mass-production scenarios.
of insertion delay actually reduces the impact of voltage changes because the propagation
delay through the clock tree changes the clock duration, an implicit adaptation known as the
clock-data compensation effect [76–78]. In adaptive clock circuits, on the other hand, the
different voltage-delay relationship of the clock tree compared to design logic causes harmful
timing effects that are exacerbated as insertion delay increases.
The deleterious effects of insertion delay on the timing constraints of an adaptive clocking
system under voltage changes are illustrated in Figure 3.9. The diagram is labeled with
units of arbitrary delay. In this example, the replica paths in the adaptive clock generator
perfectly track those in the critical path over the voltage range of interest, but the clock
insertion delay varies less with changes in voltage than either of these paths because the
clock tree tends to be wire-dominated relative to logic paths in the design. (The relative
voltage-delay differences between the clock tree and the logic paths are exaggerated in the
example.)
Figure 3.9a shows the impact of a decreasing supply voltage on the setup time constraint
of the critical path. This represents the common case, as the SS-SC supply is slowly de-
creasing for most of its operation. In the ideal case where no insertion delay is present, the
replica path can match the critical path exactly. In the presence of insertion delay, however,
the lesser delay sensitivity of the clock tree speeds up the arrival of the clock edges relative
to their delayed generation, causing setup time violations. The replica timing paths in the
adaptive clock generator would have to be slowed to guard against this failure, adding margin
that reduces system efficiency.
Figure 3.9b shows the impact of an increasing supply voltage on the timing of the system.
This type of sharp voltage increase occurs at each phase transition of the SS-SC regulator.
Here, the reduced voltage sensitivity of the clock tree unnecessarily delays the clock edges.
This results in wasted energy as the logic operates faster than necessary.
Insertion delay has an adverse effect on the energy efficiency of the SS-SC system under
adaptive clocking. Aside from the time-consuming design of custom clock trees, the only
way to ensure small insertion delays is to design voltage areas to be relatively small and
compact. This both reduces the number of clock sinks and so the dimension of the clock
tree, but also decreases the propagation time through long clock wires. Insertion delay is an
important limitation on adaptive clocking, constraining its use as appropriate for FG-AVS
but poorly suited to implementations with coarse spatial granularity.
Decreasing Voltage
No Insertion Delay
10 12 14
Rising Clock Edges Launched
0 0 0
Logic Delay
12
❗
13.8 ❗
(a)
Increasing Voltage
No Insertion Delay
14 12 10
Rising Clock Edges Launched
0 0 0
Figure 3.9: Timing diagrams demonstrating the effects of clock insertion delay under a
changing supply voltage and adaptive clock. In this example, the adaptive clock generator is
assumed to perfectly track the voltage-delay characteristics of the digital logic. The numbers
are arbitrary, representative delays. In each case, in the absence of insertion delay results
in logic arrival times that match the arrival of the next clock edge, while the presence of
insertion delay causes a mismatch. In (a), the voltage is decreasing, so insertion delay causes
setup time violations because the clock edges propagate through the clock tree more quickly
than the slowed logic propagation times. These violations require additional clock margin to
eliminate, increasing energy cost. In (b), the voltage is increasing, so insertion delay causes
the logic propagation to complete early, meeting timing but resulting in wasted energy.
CHAPTER 3. INTEGRATED VOLTAGE REGULATION 39
1000
900
Input voltage (nominal VDD)
800
Voltage (mV)
700
600 Vavg
500
Process B Vmin
400
300 Process A Vmin
200
100
0
Time
Figure 3.10: A sample SS-SC waveform showing the impact on achievable Vmin . The SS-SC
regulator operates in a 2:1 step-down mode from the fixed 1 V supply; it has a 100 mV ripple
around the average voltage Vavg of 500 mV. In Process A, Vmin = 400 mV, but the SS-SC
cannot achieve this output voltage because it is limited to simple ratios of the input for
high-efficiency conversion. In Process B, Vmin = 450 mV. The SS-SC regulator is operating
at the lowest possible voltage without violating Vmin , but its average voltage remains 50 mV
higher than a non-rippling regulating technique.
operating at Vmin for two reasons. First, it is unlikely that the particular voltages corre-
sponding to simple ratios of the input supplies correspond precisely to Vmin . As ratios with
denominators greater than two or three require increasingly complex switching topologies
that decrease conversion efficiency, the lowest operating ratio will likely be above Vmin by
some unneeded margin. Even if this is not the case, however, the rippling voltage prevents
continuous operation at Vmin because the entire ripple cannot drop below this lower bound.
These effects are illustrated in Figure 3.10. A rippling supply generated with a simple down-
conversion ratio cannot achieve as low an average operating voltage as a continuous fixed
supply. If operation at Vmin is desired for improved efficiency, this discrepancy can result
in a comparative loss of energy efficiency. Note that this effect would also be seen at the
highest, most performant operating voltage Vmax , but the SS-SC regulator design presented
in the previous section operates in a non-rippling bypass mode at its highest voltage.
Figure 3.11: An example floorplan showing four independent voltage areas supplied by SS-
SC converters and adaptive clock generators. Each generated output voltage need only be
distributed to the local voltage area, but blocks shaded green require a fixed 1V supply and
blocks shaded red require a fixed 1.8V supply. The SS-SC unit cells surround the voltage
area they supply to minimize IR drop, and the SS-SC controller and adaptive clock generator
are placed centrally to each voltage area to help meet clock constraints. The central cutout
allows routing of digital connections between the domains.
logic they supply. For the most robust power supply, the unit cells may need to surround
the voltage domain or even be interspersed with the logic inside it. The controller for the
SS-SC regulators should ideally be centrally located among the unit cells so as to most easily
balance skew from the generated toggle clock (see Figure 3.11). These layout goals can pose
challenges for the design of the power grid, which must provide the input supplies to the
regulators, deliver the generated voltage, and supply both fixed and generated supplies to
the controller. The input supplies to the SS-SC converters must be robust with minimal
inductance. This may require special consideration in package design as well as integrated
decoupling capacitance.
For the most accurate tracking of the critical paths in the digital logic, the adaptive
clock generator should be placed within the voltage domain. To reduce insertion delay, the
CHAPTER 3. INTEGRATED VOLTAGE REGULATION 41
generator circuit should be placed as centrally as possible. To improve the accuracy of the
replica paths, the generator must be located near the actual critical paths of the logic so that
it senses similar PVT conditions as shown in Figure 3.11. Alternatively, the replica timing
circuits can be distributed to the critical parts of the core, with the clock generator circuit
located centrally. This can improve tracking, but has the undesirable side effect of adding
additional wire delays to the clock generation as the signals are aggregated in the central
generator.
42
Chapter 4
The execution of FG-AVS requires active power management to adjust the voltage and fre-
quency of each domain to track its instantaneous workload requirement. Power-management
software and hardware can take advantage of voltage generation, clocking, and synchroniza-
tion schemes designed for FG-AVS to save energy. This chapter explores the design space of
integrated power-management strategies that can enable FG-AVS.
The domain of power management encompasses a wide range of techniques and goals.
Platform power management may involve controlling dozens of different components, in-
cluding memory, discrete graphics cards, disks, and cooling fans. Within a single die,
power-management schemes can be responsible for maintaining power and thermal limits,
compensating for variation, and mitigating aging. This chapter restricts consideration of
power management to schemes that attempt to save energy by modulating the voltage and
frequency of a single SoC. These power-management implementations generally follow the
feedback loop shown in Figure 4.1. One or more sensors provide information about the
state of the digital logic of interest on the SoC. The data from these sensors is fed into a
power-management unit (PMU), a controller that implements some algorithm intended to
use this state information to actuate changes in operating mode and save energy. When this
algorithm determines that a change in operating condition is beneficial, the PMU updates
voltage and frequency settings accordingly. This chapter considers each part of this feedback
loop in turn to survey the design space for FG-AVS power management.
Digital Logic
Operating conditions vary with frequency, workload
Variable Variable
Supply Frequency Sensors
Voltage Clock
Figure 4.1: A typical feedback loop for SoC power management. Only a single varying
voltage domain is shown for simplicity.
Figure 4.2: The basis for measuring power consumption in a system with SS-SC converters
and a rippling supply voltage [68] (
c 2017 IEEE).
Furthermore, these designs can only measure the power of voltages supplied from off-chip, so
they cannot measure the power of small voltage domains supplied by on-chip regulators. For
these reasons, this technique is no longer favored in commercial SoCs. The sense resistors can
instead be integrated on-die per voltage domain [81], but large resistive areas may be required
to achieve the necessary precision. Implementing analog-to-digital converters to measure the
sensed voltage is also expensive in area, power, and design effort. These intrusive techniques
are poorly suited to the needs of FG-AVS systems.
An alternative, non-invasive power measurement technique can be implemented in sys-
tems with SS-SC converters as described in Section 3.1.2. These regulators switch phases
when the generated voltage drops below a fixed reference. The total switching capacitance
of the system is also fixed, and so the amount of energy supplied at each switching event is
a constant that can be used to estimate instantanteous power consumption of the supplied
domain. The basis for this measurement technique is illustrated in Figure 4.2. When the
power consumption of the supplied domain is low, the supplied voltage takes longer to drop
to the lower bound, and so the switching frequency is reduced. High power consumption in-
creases the steepness of the output voltage drop and increases the switching frequency. The
SS-SC regulator toggle clock that triggers these switching events is already generated by the
controller for use by the DC-DC unit cells. The instrumentation of this toggle clock with a
simple digital counter can therefore provide a lightweight, non-invasive estimation of power
consumption by the supplied domain [82]. This technique has little overhead, and can deter-
mine a core power measurement by counting just a handful of SS-SC clock toggles, allowing
fast sensing of core activity. Unlike prior regulator-based measurement schemes, this mea-
surement require no off-chip components [83] and the output voltage is not perturbed [84],
making it ideally suited for FG-AVS power management.
CHAPTER 4. INTEGRATED POWER MANAGEMENT 45
Queue Counters
As noted in Section 2.3, adjacent voltage domains must have queues between them that allow
backpressure for flow control. These queues can be instrumented as indicators of activity by
measuring the fullness of each queue over time. If the queues flowing into a voltage domain
are full, and those directed out of the domain are empty, this indicates that the block has a
higher activity demand than it is currently able to meet. Conversely, if the outgoing queues
are full and the incoming queues empty, then the block can be slowed to save energy without
impacting the performance of its neighbors. Queue-based counter implementations are well-
suited for local feedback control in systems with many independent voltage domains [44, 90].
Queue counters are not restricted to processor implementations, but can be employed in any
FG-AVS system.
CHAPTER 4. INTEGRATED POWER MANAGEMENT 47
Decode + Commit
Fetch Execute Mul/Div
Mem Req. Mem Resp.
single-issue in-order RISC-V processor based on Z-scale [95] as shown in Figure 4.3. The core
forgoes caches in favor of an 8 KiB scratchpad memory, which is 128 bits wide and mapped
into a portion of the physical memory space of the system. The PMU supports the RV32IM
instruction extensions and is fully programmable via the RISC-V software toolchain. It is
designed to reside in a fixed-voltage “always-on” domain for use in controlling the operating
states of the remaining domains in the system.
The three-stage design minimizes gate count while enabling sufficient performance to
enable fine-grained power management. The first stage of the pipeline fetches an instruction
out of a 128-bit line buffer that reduces read-port contention on the single-ported scratchpad
by storing four consecutive RISC-V instructions (the value of the program counter is calcu-
lated in the previous cycle). The second stage of the pipeline decodes a RISC-V instruction,
reads the register file, and executes an ALU instruction. Branches are resolved in the second
stage, so the instruction in the fetch stage is flushed when a branch is taken. Writeback
is isolated into the third stage of the pipeline, reducing the fanout delay on the write port
of the register file. The result in the third stage is bypassed to the second stage, eliminat-
ing the need for stalls in some cases. The memory stage and the multiplication/division
pipeline stages are also in the third pipeline stage, although only one of these three sub-
systems will be active at a time, as their use triggers a stall in the second stage until the
result of the instruction is written back to the register file. The multiply and divide units
minimize hardware resources such that only one bit of the operation is completed each cycle,
and so 32 cycles are required to compute any multiplication or division result. Loads and
stores directly access the scratchpad; since the scratchpad is 128 bits wide, the pipeline must
properly swizzle the load data and store data. An extra pipeline register was added in the
arbiter between instruction and data memory requests to eliminate a long critical path and
speed the achievable cycle time of the design. The entire design (excluding the scratchpad
memories) uses just 18K gates, making the implementation overhead small relative to large
application processors, accelerators, and caches.
For fast power-management feedback loops, it is critical that the PMU be able to read
counters and actuate changes in voltage and frequency with low latency. The Z-scale PMU
accomplishes this by mapping all system control registers, including both counters used to
sense activity changes and control registers used to change the operating condition of a
voltage domain, into its control status register (CSR) space, so that reading or writing from
these registers accesses the appropriate system registers directly in hardware. CSR reads
and writes are natively supported in the RISC-V ISA, so standard instructions can be used
CHAPTER 4. INTEGRATED POWER MANAGEMENT 49
times. However, this approach does not account for several important constraints that tend
to suggest operation at higher voltages, complicating the power-management optimization.
Continuous operation at Vmin is usually not practical because of the loss of performance
that low-voltage operation entails. Most systems operate under some performance con-
straint, which can be considered as a deadline by which some fixed amount of work must
be completed. This constraint can result from a variety of scenarios. Some systems process
workloads with a fixed, predictable latency requirement, such as the decoding of compressed
video that must achieve a constant framerate. Other systems may have softer constraints,
but a common goal is to complete user workloads as quickly as possible to present the most
performant experience. Alternately, a deadline may be established to guarantee apparent
responsiveness (such as responding to a touch or keypress before the human brain perceives
the latency). In multicore systems running distributed multithreaded workloads that occa-
sionally synchronize to a barrier, the deadline is effectively established by the thread with
the most work, and other threads can slow as long as they do not become the slowest in the
system. These varied performance constraints are often determined at the operating system
level, with deadlines passed down to power management hardware so that the voltage and
frequency can be adjusted to meet them. Such deadlines often require execution at higher,
less energy-efficient voltages.
Even in scenarios without an explicit performance constraint, it may still save energy to
complete workloads as fast as possible, a technique known as “race to halt”. Given that dig-
ital logic operates less efficiently at high frequencies, speeding execution to save energy may
seem counterintuitive. In many systems, however, a large amount of static power is consumed
not just by leakage and always-on systems in the SoC, but by other platform components
such as memory, disk, fans, and other peripherals. As shown in Figure 4.4, even though the
SoC is less efficient while operating at a high voltage, completing the workload as soon as
possible allows large portions of the system to be powered down, saving energy in aggregate.
The effectiveness of race-to-halt power management has been confirmed in several empirical
studies of commercial systems [99–101]. A more recent evaluation shows that even in plat-
forms with power-hungry CPUs, a tradeoff nonetheless exists between the static platform
power consumed at low frequencies and the active CPU power consumed at high frequen-
cies, resulting in a minimum energy point at some frequency greater than the minimum (see
Figure 4.5) [98]. These platform considerations impose an implicit performance constraint
on nearly all SoC workloads, encouraging fast completion both for improved performance
and to save energy.
The goal of FG-AVS is therefore to save energy while still meeting some performance
constraint. Unlike traditional voltage scaling as currently implemented in commercial sys-
tems, FG-AVS algorithms are generally not focused on the slow feedback loops involved
in adjusting to changing deadlines as mandated by the operating system. FG-AVS control
operates at a finer granularity, attempting to reduce voltage when possible to save energy
with minimal impact on overall performance. This is illustrated in Figure 4.6. While coarse-
grained voltage control has already reduced the voltage from its maximum because of a
relaxed performance constraint, fine-grained control can further reduce this voltage during
CHAPTER 4. INTEGRATED POWER MANAGEMENT 51
Workload
complete
Power Consumption
Platform
Active Power
Time
(a)
Workload
complete
Power Consumption
Platform
Platform Idle Power
Active Power
Time
(b)
Figure 4.4: Sample power consumption plots showing the benefit of race-to-halt power man-
agement. In (a), the CPU voltage and frequency is reduced so the computation completes
more slowly. This reduces the total energy used by the CPU, but the platform consumes a
large amount of energy because it remains active for the duration of the computation. In
(b), the CPU operates in a faster, less energy-efficient mode, allowing the platform and CPU
to be put in a lower-power idle state more quickly and reducing overall energy consumption.
CHAPTER 4. INTEGRATED POWER MANAGEMENT 52
Figure 4.5: A plot showing the minimum-energy point achieved when accounting for both
platform power and CPU power [98] (
c 2014 IEEE).
Workload
<1μs Time
Voltage/frequency
Energy
Program Complete
Program Completion
Cumulative
Deadline
Figure 4.6: Applying fine-grained AVS to save energy. The coarse-grained AVS system (in
red) has reduced the voltage and frequency as much as possible while still meeting the fixed
deadline, but it cannot track rapid changes in workload. The fine-grained AVS system (in
blue) can reduce voltage and frequency during periods of low activity, saving significant
energy (shaded red) with minimal impact on execution time.
CHAPTER 4. INTEGRATED POWER MANAGEMENT 54
A more generalized approach performs feedback based on average queue depth observed
over some interval. The authors of [109] propose a proportional-integral feedback controller
based on queue depths in a generalized multi-voltage-domain system, changing voltage in
response to queue utilization into and out of each domain.
Interval-based approaches are predictable and easy to understand. However, much of
the prior analysis of these approaches assumes long interval durations of thousands or even
millions of cycles over which to collect and average information. These longer intervals
provide reliable information, but are likely to contain many different program phases and so
are unsuitable for the rapid changes desired in FG-AVS. Depending on workload, interval-
based monitoring with short interval durations may be less effective because activity can vary
considerably at smaller timescales, so activity in one interval may not accurately predict the
next.
Chapter 5
A key barrier to FG-AVS in space is the difficulty of safely transferring data between two
asynchronous clock domains. This chapter describes the advantages and drawbacks of the
traditional solution to synchronization. Several alternatives to this traditional approach
that can reduce the latency of this interface crossing are presented, including a novel bisyn-
chronous FIFO based on the concept of pausible clocks.
RX Clock
bit 0 bit 0 int bit 0 synchronized
RX RX
Clock
Clock
s
ve
ol
r es
y y
lit lit
a bi a bi
a st a st
et et
M M
RX Clock
bit 0
bit 0 int
bit 0 synchronized
bit 1
bit 1 int
bit 1 synchronized
Unsynchronized word 00 11
“Synchronized” word 00 01 11
Figure 5.1: A waveform illustrating the danger of naive synchronization of data words via
series flip-flops. In this example, the data word transitions from 00 to 11 near the RX clock
edge, resulting in metastability. Each bit of the word resolves in the opposite direction,
resulting in a cycle for which the output word is neither 00 nor 11.
CHAPTER 5. FAST ASYNCHRONOUS INTERFACES 58
Write Read
Address Address
Ready Ready
Write Read
Valid Pointer Pointer Valid
Logic Write Pointer Read Pointer Logic
(Gray coded) (Gray coded)
is a modification of the synchronous FIFO queue with logic added for the asynchronous
boundary. Communication with the transmitting (TX) and receiving (RX) domains is via
standard ready-valid interfaces. The FIFO memory acts as a circular buffer, with read and
write pointers tracking the location of valid data in the queue. The ready and valid signals
at the interfaces are calculated based on the relative positions of the two pointers. Figure 5.2
shows a block diagram of the bisynchronous FIFO.
The safe transmission of data words across the interface is guaranteed by the logic that
tracks the pointer positions. Data is stored in the FIFO by the transmitting domain only
when it is known not to be full, and data is read from the FIFO in the receiving domain only
when at least one valid word is known to be present. The synchronization challenge is not
in the data words themselves, but instead in the read and write pointers. When a word is
written to the FIFO and the write pointer incremented in the TX domain, the logic in the RX
domain cannot simply access that state, as metastability could result. A similar issue occurs
when the read pointer is incremented in the RX domain. Accordingly, the TX write pointer
must be synchronized into the RX domain, and the RX read pointer must be synchronized
into the TX domain. Because these pointers change only by incrementing, the Gray coding
scheme described in the previous section can be employed to safely synchronize the pointer
CHAPTER 5. FAST ASYNCHRONOUS INTERFACES 59
words into the neighboring domain via series flip-flops. These delayed representations of the
pointers, once decoded, can then be used to update the logic in the opposite side of the
FIFO.
The standard bisynchronous FIFO is widely used in academia and industry. It is a well-
understood design comprising only minor modifications from synchronous queues, and it can
be synthesized from standard cells and memories without any exotic circuits. It is robust to
large differences in clock rate between the two domains, and the addition of series flip-flops
beyond the two required at minimum can reduce the probability of metastability-induced
failure to infinitesimal levels. However, the extra cycles required for pointer updates to
be synchronized to the neighboring domain directly increase the latency of communication
through the FIFO. In particular, when a word is read into the FIFO and the write pointer
incremented, several cycles (equal to the synchronization latency of the pointer word) will
elapse before that pointer increment is known to the receiving domain and the data can
be safely read from the FIFO. In addition to the area overhead of the synchronizing flip-
flops, therefore, the queue may need to be deeper than it otherwise would to provide proper
flow control and allow for full bandwidth through the interface. (A shallower queue might
prematurely fill and stall while pointer updates are still being synchronized so data can be
read.) These drawbacks limit the performance of systems with many clock-domain crossings,
and have spurred the development of alternative synchronization techniques.
Figure 5.3: A single-stage synchronizing FIFO based on a carefully timed flip-flop [116].
can then be approximated as a ratio, so that this scheme can be used. Any inaccuracy
in the approximation will manifest itself as phase drift between the two clocks, which can
be guarded against by additional detector logic that skips a cycle of transmission when
the relative phases approach an unsafe difference. This approach, however, is robust only
when the clock frequencies in neighboring domains are stable over time. Frequent dynamic
frequency scaling would invoke large overheads as the frequency multipliers at the interface
were adjusted, and the use of adaptive generation would make it difficult to guarantee that
the requisite timing constraints at the interface could be met under all possible operating
conditions.
the new frequency and phase are detected. Margining to ensure correctness in the case of
changes in frequency or phase can slow the interface and require increasingly sophisticated
phase detection logic, which can impose significant area overhead compared to a standard
bisynchronous FIFO. Rapid, numerous frequency changes, such as may be the case in a
system with adaptive clock generation, may render the even-odd synchronization scheme
impractical.
Data_Sync
(a) (b)
Figure 5.5: Local clock generators. The standard circuit (a) can be modified with an addi-
tional input (b) to enable pausible clocking.
The sync input, which must toggle after each clock edge only when it is safe to generate
the next one, can be generated in several ways. One common technique uses a mutual
exclusion (mutex) circuit as shown in Figure 5.6 [119]. The mutex is an asynchronous
element with two inputs and two outputs. Each output is high only if the corresponding
input is high, but the circuit prevents more than one output from going high at a time.
The mutex can be used in feedback to serve as the synchronizer in the pausible clock circuit
(see Figure 5.7). The generated clock is inverted and sent to the r1 input of the mutex,
while a Data Sync signal indicating that new data has arrived from a neighboring clock
domain serves as the other input. (This latter signal is asynchronous in this clock domain
and can toggle at any time.) The mutex guards against a data request propagating into the
circuit near the rising edge of the clock, which is when metastability can occur. If a request
arrives near the rising edge of the inverted clock input, one of three outcomes is possible
(see Figure 5.8). If the request arrives slightly later than the inverted clock edge, then the
request will be delayed until the next clock transition, when it is safe for data to pass through
without danger of metastability. If the request arrives slightly earlier than the inverted clock
edge, then the clock transition will be delayed until the data has been safely captured. If
the signals arrive at the same time, the mutex itself can go metastable, delaying both the
clock and the data transition until the metastability resolves into one of the above cases.
Even so, there is no danger of metastability in the data itself, and the circuit is guaranteed
to eventually achieve correct operation. This pausible circuit can also be extended with an
additional gating input to enter a “synchronous mode” that can be useful for test or scan as
shown in Figure 5.9.
This basic pausible clocking scheme can be incorporated into an asynchronous FIFO as
shown in Figure 5.10 [121]. This circuit uses two-phase (or non-return-to-zero) signaling
to indicate the presence of new data and to acknowledge that this data has been written
or read from the FIFO. An XOR gate therefore toggles high when there is a new request
that has not yet been received by the FIFO, and the output of this XOR gate is used as the
CHAPTER 5. FAST ASYNCHRONOUS INTERFACES 63
Figure 5.6: A mutual exclusion (mutex) circuit. Any metastability in the SR latch is blocked
from the output by the metastability filter.
Data_Sync Sync_OK
R1 G1
R2 G2
Clock
Generator
Figure 5.7: A mutex used to enable safe asynchronous boundary crossing via pausible
clocks [120].
synchronization guard for the mutex circuit in the pausible clock generator. This circuit uses
a fully asynchronous FIFO, which can be implemented using micropipelines [122], GasP [123],
Mousetrap [124], or some other style of asynchronous FIFO. The circuit can achieve full
throughput with average latency that is limited only by the propagation delay through the
asynchronous FIFO.
Clock
R2
OK Phase Delay Phase
Data_Sync
Sync_OK
(a)
Clock
R2
OK Phase Delay Phase
Data_Sync
Sync_OK
(b)
Clock
R2
OK Phase Delay Phase
Data_Sync
Mutex Output metastability
(c)
Figure 5.8: Waveforms showing the response of the pausible clock circuit to three different
data arrival times. The red region near the clock edge represents the period during which
it would be dangerous for the output Sync OK signal to transition. If data arrives during
the “OK” phase when r2 is low, then it is passed through (a). If data arrives during the
“delay” phase, then it is delayed until after the next clock edge (b). If the data arrives at
the boundary between these two phases, the mutex circuit can go metastable, potentially
delaying the next clock edge (c).
CHAPTER 5. FAST ASYNCHRONOUS INTERFACES 65
moving data words through long pipelines. Circular buffers, on the other hand, do not move
data, instead reading and writing based on pointer locations. Furthermore, asynchronous
FIFOs often require exotic non-standard logic gates or complex timing constraints, neither
of which are well-suited to use with standard VLSI synthesis or verification toolflows.
Write Read
Address Address
Ready Ready
Write Read
Valid Pointer Pointer Valid
Pointer Increment Pointer Increment
A Logic B Logic D
G
From RX Side To TX Side
Pointer Acknowledge Pointer Acknowledge
LAT LAT
F C E
TX Clock RX Clock
MUTEX
MUTEX
MUTEX
MUTEX
r1 g1 r1 g1
MUTEX
MUTEX
TX Pausible RX Pausible
Synchronizer r2 g2 r2 g2 Synchronizer
Clock Clock
C Generator C Generator
TX Clock RX Clock
domain; that sequence is not described in further detail.) Meanwhile, the pointer increment
signal must be transmitted back to the TX domain as an acknowledge signal before the
increment line can be freed for future reuse (E). This signal is synchronized through the
same method in the TX domain (F), and is received by the write pointer logic after the next
TX clock (G).
tfb
r1 g1
MUTEX
r2 g2
tg2
tr2 C
Clock
Figure 5.12: The key delays in the pausible clock circuit [120] (
c 2015 IEEE).
is the delay from the output of the C-element to the r2 input of the mutex. tf b is the delay
from the r2 input through the mutex and around the feedback loop to the r1 input. tg2 is the
delay from the r1 input through the mutex, AND tree, and C-element. The most important
timing constraint in the system is that the sum of these three delays cannot exceed the
desired time between clock edges, that is, half the cycle time T of the block:
T /2 ≥ tr2 + tf b + tg2 (5.1)
If this constraint is violated, then any signal arrival immediately before the clock edge will
result in a clock pause, forcing the average clock period lower than desired as shown in
Figure 5.13. If the constraint is met, then only metastability that increases the propagation
time through the mutex will result in a clock pause. If this cycle time constraint is exceeded,
then some additional margin tm will guard against a clock pause:
tm = T /2 − (tr2 + tf b + tg2 ) (5.2)
Only metastability resulting in an increase in tf b greater than tm can cause a clock pause.
A second constraint concerns the setup time of the combinational pointer logic. The low
latency of the FIFO depends on the ability of the pointer logic to safely update the ready
or valid signals in response to a pointer increment in the same cycle that the increment
is received. The worst-case timing constraint for this logic is shown in Figure 5.14. If a
metastability event resolves in favor of r2, then there is only a limited amount of time tCL in
which to combinationally evaluate the pointer update before the next clock edge is generated:
tCL = tf b + tg2 . (5.3)
If this constraint cannot be met, an additional cycle of pipelining is required, increasing the
latency through the FIFO by one cycle. If the setup time constraint in Equation 5.1 is met
CHAPTER 5. FAST ASYNCHRONOUS INTERFACES 69
Clock Edge
tr2 tfb tg2 Delayed
Clock
r2
r1
Figure 5.13: Waveforms showing the clock period constraint in the pausible clock circuit
[120] (
c 2015 IEEE). If this constraint is violated, the next clock edge will be frequently
delayed, slowing the average clock rate.
e s
ty lv
ili so
tab re
as ty
et ili
m tab
e x as
ut et tfb tg2
M M
Clock
r2
r1
tCL
Figure 5.14: Waveforms illustrating the worst-case setup time for the combinational logic in
the pausible interface [120] (
c 2015 IEEE).
with additional margin tm , that margin can be traded off in favor of deliberately increased
tf b , which increases tCL at the cost of increased odds of a clock pause.
The average latency through the interface can also be calculated based on the above
parameters. As shown in Figure 5.15, the average latency through the interface is 0.75T −tr2
if the request arrives when the mutex is transparent, and 1.25 − tr2 if the request arrives
when the mutex is opaque. The average of these expressions therefore provides an overall
average latency tL :
tL = T − tr2 (5.4)
Increasing tr2 therefore decreases average latency through the interface because it shifts the
transparent phase of the mutex closer to the next clock edge. Any excess tm can be traded
off in favor of deliberately increased tr2 to reduce average latency.
CHAPTER 5. FAST ASYNCHRONOUS INTERFACES 70
tL
Clock
r2
(a)
tL
Clock
r2
(b)
Figure 5.15: Waveforms showing the average latency through the pausible interface, as
determined by taking the mean of the average latency during the transparent phase (a) and
the average latency during the opaque phase (b) [120] (
c 2015 IEEE).
Insertion Delay
The previous analysis assumed instantaneous propagation of the generated clock at the
output of the C-element to all sinks in the design, but real implementations have some non-
zero insertion delay tins . This insertion delay misaligns the mutex transparent phase, which
can lead to unsafe value propagation near the rising clock edge as shown in Figure 5.16 [125].
If the insertion delay is smaller than tm , then tr2 can be increased to match tins , realigning the
mutex transparent phase. To guard against larger insertion delays, a lockup latch guarded
by r2 can be added at the output of the synchronized request signal (see Figure 5.17). The
latch guards against races while the clock propagates to its sink registers, and it does not
increase the latency of the system because signals that do not race the clock would still have
to wait for the next clock edge to be synchronized [126]. This latch therefore permits an
extra half-cycle of insertion delay, although TCL is decreased slightly because the propagation
delay through the transparent latch must be subtracted from the combination logic setup
time.
Assuming the presence of the lockup latch, the maximum tolerable insertion delay through
the system is therefore
tins ≤ T − tf b − tg2 (5.5)
CHAPTER 5. FAST ASYNCHRONOUS INTERFACES 71
Clock_root
tins
Clock
R2
OK Phase Delay Phase
Figure 5.16: Waveforms illustrating the effect of insertion delay on the pausible clock cir-
cuit [120]. Insertion delay can misalign the phases of the circuit, resulting in unsafe trans-
mission of data.
Lockup
Latch
r2 Clock
MUTEX
MUTEX
r1 g1
MUTEX
r2 g2
Clock
C Generator
Clock Root
Figure 5.17: The pausible synchronizer with an additional latch to guard against the effects
of insertion delay [120] (
c 2015 IEEE).
CHAPTER 5. FAST ASYNCHRONOUS INTERFACES 72
Pausible
Synchronizer
Wire
Wire Delay
Delay
Synchronizer
Synchronizer
Local Clock
Pausible
Pausible
Pausible
Synchronizer
Figure 5.18: Two layout options for the local clock generator and synchronizer logic [120]
(
c 2015 IEEE).
The previous expressions for tCL and tL must also be adjusted for the presence of insertion
delay:
tL = T + tins − tr2 (5.6)
tCL = T /2 + tins − tr2 (5.7)
Increased insertion delay increases average latency through the interface because the next
clock edge is delayed relative to the transparent phase of the mutex, but it relaxes the tCL
constraint because of this same delay.
Wire Delay
The previous analyses assumed no additional wire delay between the location of the clock
domain interface and the clock generation circuit. In practice, these two parts of the circuit
may not be able to be physically co-located, leading to the two possible physical design
options shown in Figure 5.18. If the synchronizer logic is located far from the clock generator,
then some additional wire delay must be added to tr2 and tg2 , decreasing the maximum clock
frequency and the maximum allowable insertion delay. If the synchronizer logic is placed
near the clock generation circuit but far from the interface, then these constraints are not
impacted, but latency equal to the wire delay will be added to the average latency of the
interface. This second approach has the added benefit that the synchronization logic could
be hardened along with the clock generator into a custom circuit, which could likely achieve
better performance than a synthesized standard-cell-based design.
CHAPTER 5. FAST ASYNCHRONOUS INTERFACES 73
1E+04
1E+03
1E+02
1E+01
1E-07 1E-05 1E-03 1E-01 1E+01 1E+03
Figure 5.19: Simulated mutex delay for different input arrival times [120].
0.25
Increase in Average Clock Period (ps)
0.2
0.15
0.1
0.05
0
0 5 10 15 20 25 30
Metastability Margin (ps)
Figure 5.21: Sample simulation result. TX and RX clock period are randomly varied over
time. The spikes in TX clock period are clock pauses triggered by simulated metastability.
CHAPTER 5. FAST ASYNCHRONOUS INTERFACES 75
Table 5.1: Pausible bisynchronous FIFO synthesis results and comparison [120].
8
Pausible
7 BFSync
Latency (RX cycles)
6
5
4
3
2
1
Figure 5.22 shows the average latency through each interface as the ratio of the TX and
RX clock is varied. The pausible FIFO achieves an average latency of just 1.34 cycles, which
includes both the insertion delay and the wire delay described in the previous section that
was estimated by the synthesis tool.
76
Chapter 6
The potential energy savings of FG-AVS systems are strongly dependent on the design
choices and implementation overheads required to realize them. This chapter describes
the implementations and measurement results of four testchips that implement aspects of
fine-grained adaptive voltage scaling. Rather than testing individual components of FG-AVS
systems, these testchips are fully-featured SoCs that can perform integrated adaptive control
while executing realistic workloads. Demonstrating energy savings in realistic systems and
use cases in silicon shows the real benefit of FG-AVS without obscuring any of the real costs
and design tradeoffs of implementing such systems.
To off-chip oscilloscope
RISC-V scalar core CORE (1.19 mm2)
...
1.8V Scalar Vector accelerator
Vout RF FPU Vector Issue Unit
1.0V int ...
(16KB Vector RF uses eight
custom 8T SRAM macros)
Shared Func. Units ...
24 switched-capacitor
toggle
Async. FIFO
1.0V Adaptive clock
between domains
generator
wire-bonded chip-on-board UNCORE
virtually indexed and physically tagged, and have separate TLBs that are accessed in parallel
with cache accesses. The core has an IEEE 754-2008-compliant floating-point unit that
executes single- and double-precision floating-point operations, including fused multiply-add
(FMA) operations, with hardware support for subnormal numbers.
To reduce design complexity, the microprocessor is implemented as a tethered system.
Unlike a standalone system, a tethered system depends on a host machine to boot, and
lacks I/O devices such as a console, mass storage, frame buffer, and network card. The host
(e.g., an x86 laptop) is connected to the target tethered system via the host-target interface
(HTIF), a simple protocol that lets the host machine read and write target memory and
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 78
RISC-V
Rocket
Scalar Vector Execution Unit
Core (VXU)
Vector
Inst.
Cache
Vector Runahead Unit (VI$)
(VRU)
control registers. All I/O-related system calls are forwarded to the host machine using
HTIF, where they are executed on behalf of the target. Programs that run on the scalar
core are downloaded into the target’s memory via HTIF. The resulting system is able to boot
modern operating systems such as Linux utilizing I/O devices residing on the host machine,
and can run standard applications such as the Python interpreter.
The Hwacha vector accelerator, shown in Figure 6.3, is a decoupled single-lane vector
unit tightly coupled with the Rocket core. Hwacha executes vector operations temporally
(split across subsequent cycles) rather than spatially (split across parallel datapaths), and
has a vector length register that simplifies vector code generation and keeps the binary code
compatible across different vector microarchitectures with different numbers of execution
resources. The Rocket scalar core sends vector memory instructions and vector fetch in-
structions to the vector accelerator. A vector fetch instruction initiates execution of a block
of vector arithmetic instructions. The vector execution unit (VXU) fetches instructions
from the private 8 KiB vector instruction cache (VI$), decodes instructions, clears hazards,
and sequences vector instruction execution by sending multiple micro-ops down the vector
lane. The vector lane consists of a banked vector register file built out of two-ported SRAM
macros, operand registers, per-bank integer ALUs, and long-latency functional units. Multi-
ple operands per cycle are read from the banked register file by exploiting the regular access
pattern with operand registers used as temporary space [128]. The long-latency functional
units such as the integer multiplier and FMA units are shared between the Rocket core and
the Hwacha accelerator. The vector memory unit (VMU) supports unit-strided, constant-
strided, and gather/scatter vector memory operations to the shared L1 data cache. Vector
memory instructions are also sent to the vector runahead unit (VRU) by the scalar core.
The VRU prefetches data blocks from memory and places them in the L1 data cache ahead
of time to increase performance of vector memory operations executed by the VXU. The
resulting vector accelerator is more similar to traditional Cray-style vector pipelines [129]
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 79
Row Row
128 rows x 144 columns
drivers drivers
88µm
Control Column IO (72 bits) Control
Row Row
128 rows x 144 columns
drivers drivers
248µm
Figure 6.4: Annotated layout of the custom 8T SRAM macro used in the core voltage
domain [68] (
c 2017 IEEE).
than SIMD units such as those that execute ARM’s NEON or Intel’s SSE/AVX instruction
sets, and delivers high performance and energy efficiency while remaining area efficient.
The Rocket core and Hwacha accelerator together comprise the variable core voltage
domain that is supplied by the integrated voltage regulators and adaptive clock. The core
voltage area was also placed under deep n-well to allow forward body bias (FBB) to be
applied. Because SRAMs typically limit the minimum operating voltage of digital blocks
due to the susceptibility of the small transistors in SRAM bitcells to process variation, all
SRAM arrays in the core voltage domain use the same custom 8 KiB 8T-based SRAM macro
shown in Figure 6.4. The macro is logically organized as 512 entries of 72 bits (64 bits +
8 possible error-correcting code bits) and physically organized as two arrays of 128 rows by
144 columns with two-to-one physical interleaving. Low-voltage operation is enabled by the
8T bitcell, where each transistor is larger than the equivalent high-density 6T bitcell. While
the arrays also implemented a negative-bitline write assist, the assist was not necessary to
achieve minimum-voltage operation.
The core voltage is supplied by an integrated SS-SC converter (see Section 3.1.2). 1.0 V
and 1.8 V supplies are downconverted to rippling output voltages averaging 0.9 V, 0.67 V,
and 0.5 V depending on the reconfigurable converter topology. The 1.0 V supply can also
be passed directly to the core in a bypass mode. All 1 V input switches are implemented
as low-Vt devices to reduce their on resistance, while the larger 1.8 V input switches are
implemented as regular-Vt devices with FBB applied to further reduce their leakage when
active. The converter is subdivided into twenty-four 90 µm × 90 µm unit cells that are
located near the core voltage area. To achieve simultaneous switching, the DCDC toggle
clock generated by the controller must arrive at each unit cell at the same time. The place-
and-route tool was therefore directed to construct a clock tree for this signal, balancing
arrival times. The flying capacitor is implemented using MOS capacitors with two layers
of MOM capacitors above. Parasitic bottom-plate capacitance is reduced by using a series
connection of the box, well, and substrate capacitances [130]. The converter achieves a total
capacitance of 2.1 nF and a capacitive density of 11.0 fF/µm2 .
As noted in Section 3.1.2, simultaneous-switching regulators require adaptive clocking
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 80
DCDC
Controller
Vector Register File L1 D$
(Custom 8T SRAMs) (Custom 8T SRAMs)
to achieve high system efficiencies. Raven-3 implements an early version of the adaptive
clock generator described in Section 3.2. The clock generator selects clock edges from a
16-phase DLL output as determined by the delay through a replica critical path supplied by
the core voltage [77]. The replica path is made up only of inverters, with the depth of the
path programmable via selectable muxes. The generated clock is supplied to all register and
SRAM sinks in the core voltage domain.
The IP blocks and their control registers, IO cells, and HTIF logic comprise the uncore
voltage domain, which operates at a fixed 1 V and fixed frequency. Because the uncore
and core may be asynchronous depending on the operating mode, standard bisynchrounous
FIFOs guard all communication between them (see Section 5.1). Level shifters are inserted
on signals crossing the domains to ensure correct operation as the core voltage varies.
A multivoltage and multiclock design flow was used to construct the processor. Figure 6.5
shows the processor floorplan, with the larger core voltage domain in red separated from the
smaller uncore voltage domain to the right of the chip. The custom SRAMs were manually
placed within the core voltage domain. The SS-SC unit cells surround the core to minimize
voltage drop. Two layers of thick upper-layer metal were dedicated to a power grid, where
the core voltage and ground each utilize 25% of the chip area in each layer. Outside the
core, these core voltage rails are not necessary, so the input voltages to the converters use
the majority of the power routing resources to connect power coming from the pad frame
to the converters. An annotated die photo of the chip, which was fabricated in 28 nm ultra-
thin body and BOX fully depleted silicon-on-insulator (UTTB FD-SOI) technology [131], is
shown in Figure 6.6.
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 81
1.8mm
SC-DCDC
SC-DCDC Unit Cells
Controller
VI$ I$ BIST
Scalar core
1.3mm + vector accelerator
Vector
D$
RF
Adaptive
SC-DCDC Unit Cells clock
Host Computer
Sourcemeter
1V 1.8V GPIB
SSH
Zedboard Motherboard
Daughterboard
ARM
Chip
Vref Multimeter
I2C DAC DC-DC converters
FPGA Vout
Oscilloscope
HTIF
Uncore Core
DRAM Clock Gen.
(wire-bonded)
FMC FMC
Figure 6.7: A block diagram showing the test setup of the Raven-3 testchip [10] (
c 2016
IEEE).
Figure 6.8: The Raven-3 testchip and associated infrastructure. The chip itself is obscured
by the white protective covering.
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 83
1V Mode
1V 3:2 Mode
1V 2:1 Mode
Figure 6.9: Oscilloscope traces of the voltages generated in the four SS-SC operating
modes [10] (
c 2016 IEEE).
85
Efficiency (%)
f
80 nt V re
e re
Vout,avg=0.53V Diff
Efficiency=73%
Vout,avg=0.60V
75 Vout,avg=0.58V Efficiency=76%
Efficiency=80%
70
0.48 0.50 0.52 0.54 0.56 0.58 0.60 0.62
Lower-bound Reference Voltage (V)
(a) (b)
Figure 6.10: Figures showing the effect of different lower-bound reference voltages on system
operation [10] (
c 2016 IEEE). (a) shows the system conversion efficiency as the voltage is
swept, and (b) shows the effect of different Vref on average output voltage and frequency.
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 84
Technology 28 nm FDSOI
Die Area 1305 µm × 1818 µm (2.37 mm2 )
Core Area 880 µm × 1350 µm (1.19 mm2 )
Converter Area 24 × 90 µm × 90 µm (0.19 mm2 )
Voltage 0.45 V to 1 V (1V FBB)
Frequency 93 MHz to 961 MHz (1V FBB)
Power 8 mW to 173 mW (1V FBB)
SC density 11.0 fF/µm2
SC power density 0.35 W/mm2 @ 88% efficiency
Table 6.1: A summary of key results and details from the Raven-3 testchip [10].
Figure 6.11: An oscilloscope trace showing the generated core voltage as the SS-SC converter
rapidly cycles through each of the four operating modes [10] (
c 2016 IEEE).
For all possible converter topologies with adaptive clocking, the processor successfully
boots Linux and runs user applications, demonstrating that complex digital logic operates
reliably with an intentionally rippling supply voltage. Figure 6.11 shows the rapid (<20 ns)
transitions between different voltage modes. The application core can continue to operate
through these mode transitions, demonstrating the utility of integrated regulators for FG-
AVS.
The conversion efficiency of voltage converters is generally computed by measuring the
current and voltage on both the input and output of the converter to measure the ratio
of power delivered to power supplied. The regulator in the Raven-3 testchip cannot be
measured in this way because it is it is difficult to measure on-chip voltage and current since
the voltage is rippling very quickly. Furthermore, even if power output of the converter could
be measured, this metric would ignore the impact of imperfect adaptive clock tracking, which
is an important loss component. Therefore, a different method is required to measure the
efficiency of the implemented system.
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 85
Figure 6.12: A plot demonstrating the system conversion efficiency of the three SS-SC switch-
ing modes [10] (
c 2016 IEEE).
We define system conversion efficiency as the ratio of energy required to finish the same
workload in the same amount of time under the regulated system as compared to a sys-
tem without voltage conversion losses. In this metric, 100% efficiency represents a lossless
regulator supplying the core as it operates at the maximum frequency achievable at that
voltage. To determine this 100%-efficiency baseline, the bypass mode is used to directly
supply the core with an ideal off-chip voltage source. A self-checking benchmark is run for a
fixed number of cycles at different voltages, and a binary search is performed at each voltage
point to find the maximum frequency. At the maximum frequency, the total elapsed time
and total energy to run the fixed-length benchmark is measured, where the energy is com-
puted by measuring the current drawn from the off-chip supply and the delivered voltage
is measured from sense points on the core voltage (to remove the voltage drop across the
on-chip bypass-mode power gates from the efficiency calculation). This baseline provides the
blue curve in the Figure 6.12 and represents a 100%-efficient off-chip regulator. Then, for
each SS-SC mode, the same benchmark is run for the same number of cycles, and the total
elapsed time and energy is measured. Due to non-idealities of the converter, it takes more
energy to perform the same task in the same amount of time. Therefore, system conversion
efficiency is defined as the ratio of energy required to finish the same workload in the same
time. This metric includes all sources of overhead, including non-idealities in the adaptive
clock. As shown in Figure 6.12, the measured voltage conversion efficiency ranges from 80%
to 86% for the different output voltage modes.
Figure 6.13 shows measured average frequency for different delay settings for the tunable
replica circuit across a range of operating voltages (supplied using the bypass mode). Anno-
tations above the plot indicate the approximate voltage ranges seen in each SS-SC voltage
mode. Because the inverter-based replica path delay characteristics do not match the critical
paths of the processor, a single delay setting poorly tracks the processor critical path over
the entire voltage range. At higher voltages, a shorter replica path delay best tracks the
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 86
Microprocessor
critical path
Different TRC
delay settings
Figure 6.13: Measurements of the adaptive clock generator frequency over different voltages
and replica path delay settings [10] (
c 2016 IEEE).
Figure 6.14: Plots showing the measured energy, power, and frequency of the processor core
in bypass mode and with the switching regulators enabled [10] (
c 2016 IEEE).
processor critical path, while at lower voltages, a longer replica path delay best tracks the
processor critical path. However, manual calibration of specific delay settings for each SS-SC
voltage mode allows reasonably accurate tracking within the relatively small ripple of that
mode.
Figure 6.14 shows various energy-delay curves for the application processor. Energy
efficiency is measured by measuring total core energy consumption while executing a fixed-
length double-precision floating-point matrix multiplication kernel on the vector accelerator,
and is shown both under bypass mode and when accounting for the losses of the switching
regulation modes. By using the on-chip converter to generate the lowest output voltage, the
system achieves a peak efficiency of 26.2 GFLOPS/W. The FD-SOI technology of the testchip
enables up to 1.8 V of FBB to be safely applied, which trades off improved performance for
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 87
increased leakage [133]. Figures 6.14b and 6.14d show the impact of FBB on frequency and
energy. In each of the plots, each point represents the best achievable operating frequency
at a given voltage. The integrated system is able to achieve high energy efficiencies under
the proposed voltage regulation and clocking scheme, demonstrating the potential utility of
FG-AVS.
To
CORE (1.07 mm2) Vector Accelerator
scope
... Vector Issue Unit
1.8V Rocket Core ...
(16KB Vector RF uses eight
1.0V Vout Branch Prediction
custom 8T SRAM macros)
...
Scalar int int int int int
RF FPU Crossbar
DCDC toggle
core. The application core and the PMU can communicate directly via inter-processor in-
terrupts, and each has a register mapped directly into the CSR space of the other system,
allowing arbitrary data to be communicated between the two cores. Table 6.2 compares key
features of the two processors. The processor maps the control registers for the SS-SC con-
verters, adaptive clock generator, and other IP into its CSR space, which enables programs
to directly manipulate the voltage and frequency of the chip. In addition, the core clock and
the SS-SC toggle clock are read by counters, and the counter values are synchronized into
the uncore domain. As described in Section 4.1.1, successive reads to the second counter
enable a rapid estimate of core power consumption for active power management.
Several other circuits further the power management and measurement capabilities of
the SoC. The threshold voltage of the logic in the core voltage domain can be manipulated
by an integrated body bias generator that can supply up to 1.8 V of FBB [134]. Fine tuning
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 89
resolution and fast response allow adaptive body bias to be incorporated into power man-
agement techniques. Because the core voltage waveform is difficult to accurately observe
off-chip due to parasitics on the measurement path, a waveform measurement circuit was
implemented that uses a lightweight sampling approach to statistically reconstruct the wave-
form [82]. This approach allows core power consumption to be measured with high accuracy
via numerical integration. As the current load of the application core can vary over time
and has a limited range, a programmable current-mirror load connected to the core voltage
domain was added to allow straightforward characterization and measurement of the SS-SC
converters and power monitoring circuitry. The core clock and SS-SC toggle clock are also
connected to output drivers for direct observation.
Figure 6.16 shows the floorplan of the SoC. The design is partitioned into two voltage ar-
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 90
PMU
Adaptive
RISC-V Core Clock
and Vector
Accelerator SC-DCDC
Control
Counters
D$
eas, with the core voltage area supplied by the SS-SC converters placed centrally. Numerous
additional voltages and clocks are defined to supply both the core and the various analog
and mixed-signal blocks that make up the SoC. To reduce core insertion delay, the clock
generator itself was placed near the center of the core area, and a “peninsula” of the uncore
voltage domain was extended to allow the routing of control signals from the block to the
top level of the design hierarchy. The location of the core clock multiplexer, which allowed
the selection of different core clock sources for test, was specified explicitly and placed near
the center of the core area. These improvements combined to reduce core insertion delay by
several hundred picoseconds. An annotated die photo of the testchip, which was fabricated
in 28 nm UTBB FD-SOI, is shown in Figure 6.17.
Technology 28 nm FDSOI
Die Area 1665 µm × 1818 µm (3.03 mm2 )
Core Area 895 µm × 1193 µm (1.07 mm2 )
Standard Cells 568K
Converter Area 48 × 90 µm × 90 µm (0.39 mm2 )
Core Voltage 0.48 V to 1 V (bypass mode)
Core Power 1.2 mW to 231 mW (bypass mode)
Core Frequency 20 MHz to 797 MHz (bypass mode)
Peak Energy Efficiency 41.8 GFLOPS/W (1/2 1V mode)
Conversion Efficiency 82-89%
AVS Transition Time <1 µs
Peak AVS Energy Savings 39.8%
Table 6.3: A summary of key results and details from the Raven-4 testchip [68].
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 92
Voltage (V)
0.6 1V 1/2 Mode
0.5
z z
0.4 Hz Hz MH 10MH
Voltage (V)
87M 43M 187 1 1
0
0 50 100 150 200 250 300
Time (ns)
Figure 6.19: Oscilloscope traces of the core voltage and clock during an SS-SC mode transi-
tion [68] (
c 2017 IEEE).
Table 6.4: Measured system conversion efficiencies achieved by the Raven-4 system [68].
peak system conversion efficiency of each of the three switching SS-SC modes. The adap-
tive clock is tuned at each voltage setting by sweeping the settings of its replica delay path
and choosing the fastest setting that still results in correct core functionality. The adaptive
clocking system provides a large improvement in system conversion efficiency because the
core is able to operate at a higher average frequency as the supply voltage ripples, reducing
the amount of energy required to complete the same amount of work. When supplied by
the SS-SC converter, the processor achieves a peak energy efficiency of 41.8 double-precision
GFLOPS/W running an FMA microbenchmark on the vector coprocessor in 1/2 1 V mode.
The processor is able to boot Linux and run user programs while powered by the rippling
supply voltage and adaptive clock.
Figure 6.20 shows the processor functionality across a wide range of voltages and fre-
quencies. The SS-SC converter is placed into bypass mode for characterization, allowing the
measurement of processor performance under fixed voltage and frequency. The best energy
efficiency in bypass mode of 54.0 double-precision GFLOPS/W is achieved at 500 mV and
40 MHz. Figure 6.21 shows the best frequency achievable at each operating point and the to-
tal energy consumed by a fixed-duration matrix-multiply benchmark at that operating point.
The application of FBB increases performance but results in higher leakage power. The FBB
voltage that achieves minimum energy depends on the proportion of switching power to leak-
age power and is therefore benchmark-dependent. Figure 6.22 shows that the energy of more
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 93
Frequency (MHz)
27.8 23.5 20.1 18.3 15.7 13.6 11.8 10.3 8.9 200
25.6 21.9 19.6 17.1 14.9 13.0 11.4 10.0 300
20.2 18.1 15.8 13.9 12.2 10.8 400
0.48 0.5 0.52 0.54 0.56 0.58 0.6
51.8 43.3 36.9 31.6 28.0 25.4 23.0 20 16.6 14.6 12.8 11.4 500
54.0 47.8 42.4 38.3 35.1 32.1 40 15.2 13.5 12.0 600
49.5 44.1 40.0 36.7 33.7 60
46.8 42.5 39.0 35.8 80 12.5 700
40.8 37.6 100
Figure 6.20: A shmoo chart showing processor performance while executing a matrix-
multiply benchmark under a wide range of operating modes [68] (
c 2017 IEEE). The
number in each box represents the energy efficiency of the application core as measured
in double-precision GFLOPS/W.
400 900
350 Body 1.5V
Bias 800 Body Bias
700 1.5V
Frequency (MHz)
Figure 6.21: Plots showing the effects of FBB on core frequency and energy [68] (
c 2017
IEEE).
1.05
Benchmark
FMA Kernel (Vector) Matrix Multiply
1.00 Matrix Multiply (Vector) Dhrystone
Energy (a.u.)
0.95
0.90
0.85
0.4 0.2 0.0 0.2 0.4 0.6 0.8
Body Bias Voltage (V)
Figure 6.22: The effect of FBB on different benchmarks with a supply voltage of 0.6V
(bypass mode) [135] (
c 2016 IEEE). The total energy consumed by each benchmark has
been normalized so the relative effects of body bias can be compared. The minimum-energy
point is highlighted for each benchmark.
1400
Edge-selecting generator (slowest setting)
1200 Edge-selecting generator (fastest setting)
Free-running generator (slowest setting)
1000 Free-running generator (fastest setting)
Frequency (MHz)
800
600
400
200
0
0.4 0.5 0.6 0.7 0.8 0.9 1.0
Core Voltage (V)
Figure 6.23: Measured comparison of the two clock generators implemented in the Raven-4
testchip [68] (
c 2017 IEEE).
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 95
1.00
Frequency (a.u.)
0.95
0.90
Delay Bank #1
0.85 Delay Bank #2
Delay Bank #3
Delay Bank #4
0.80
0.4 0.5 0.6 0.7 0.8 0.9 1.0
Core Voltage (V)
Figure 6.24: Measurement results showing the relative difference in voltage-dependent fre-
quency behavior of the four delay banks in the adaptive clock generator [68] (
c 2017 IEEE).
0.65
0.60
0.55 Period is inversely proportional to core power
Voltage (V)
1
0
0 50 100 150 200 250
Time (ns)
Figure 6.25: Oscilloscope traces showing the rippling core supply voltage and the SS-SC
toggle clock [68] (
c 2017 IEEE).
Figure 6.24 compares the voltage-dependent delay characteristics of the four delay banks,
normalized to the delay of the custom buffer cells in Bank 3. The result for each bank was
measured by recording the frequency of the generated clock after selecting the maximum
delay through that bank and the minimum delay through the remaining banks. The cells
with small pMOS/nMOS ratios and larger gate lengths have larger delays at lower voltages.
The wide variation in voltage-dependent delays between the different delay banks (up to
18% at 0.45 V) validates the need for multiple different standard cells to achieve accurate
critical path tracking.
Figure 6.25 shows the measured core voltage and SS-SC toggle clock, and Figure 6.26
compares the core power measured by the bench equipment with the SS-SC toggle frequency
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 96
120
1/2 1V
60
40
20
0
0 20 40 60 80 100 120
Core Power (mW)
Figure 6.26: Measurement results showing the correlation between SS-SC toggle frequency
and measured core power [68] (
c 2017 IEEE).
measured by the integrated counter. The correlation is monotonic and approximately lin-
ear for each of the three conversion modes, confirming the practicality of using this toggle
frequency to estimate core power and system load.
The programmable PMU allows the implementation of a wide variety of power man-
agement algorithms to improve energy efficiency. Several different experiments demonstrate
the flexibility of the system in implementing common energy-saving techniques. In one ex-
periment, dithering between two voltages to achieve an arbitrary target core frequency is
implemented. To implement this algorithm, the PMU first calibrates the system by using the
core clock counter to measure the average operating frequency at each voltage mode. Then a
target frequency is provided to the PMU, which polls the core clock counter and dithers the
voltage setting to achieve the target frequency in aggregate. The results of the experiment
are shown in Figure 6.27. Without dithering, the processor would need to operate only in
the higher mode to guarantee that the performance target is met, which would consume up
to 40% more energy than the dithered approach.
The choice of hopping frequency presents a tradeoff between increased fidelity to the
target effective frequency and the more frequent occurrence of transition overheads, which
can increase energy consumption. In the testchip SS-SC implementation, the energy cost
associated with transitions between voltage modes is small because the processor continues
to operate as the clock frequency adjusts during the mode transition. In high-to-low mode
transitions, no charge is wasted, but in some low-to-high transitions, the flying capacitance
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 97
1.1
Measured Core Voltage (V) Target: 250.0MHz 0.81V Average
1.0 Measured: 254.2MHz Average 379.2MHz Average
0.9 Energy Savings: 15.1%
0.8
0.7
0.6
0.5 0.62V Average
163.1MHz Average
0.4
0 2 4 6 8 10 12 14 16
Time (us)
1.1
Target: 80.0MHz
Measured Core Voltage (V)
Figure 6.27: Oscilloscope traces showing the core voltage as the frequency hopping algorithm
is applied with two different frequency targets [68] (
c 2017 IEEE).
160
140
120
100
80
60
0 5 10 15 20 25
Total Measured Time (s)
(a)
100
Dithering (adjusted adaptive clock setting)
95 Dithering (fixed adaptive clock setting)
System Conversion Efficiency (%)
Figure 6.28: Plots showing the effect of voltage dithering on system conversion efficiency [68]
(
c 2017 IEEE). (a) compares the energy cost of dithering to the bypass mode baseline (in
blue), which represents 100% efficient regulation. The dithering operating points linearly
interpolate completion time and energy between the fixed SS-SC modes. (b) shows the
measured system conversion efficiencies under voltage dithering. The results in green show
the benefit of re-tuning the replica timing circuit in the adaptive clock generator after each
SS-SC mode transition.
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 99
also shown. Figure 6.28b quantifies the effects of dithering on system conversion efficiency.
Two different dithering programs were run on the PMU. The first program simply switches
the voltage mode setting of the SS-SC converters, without changing any other system set-
tings; the delay settings of the replica paths in the adaptive clock generator were tuned to
the best setting that could function across both operating modes. The results are shown by
the red points in the figure. Because the best setting of the replica paths changes according
to operating mode, the conversion efficiencies of this approach are less than optimal for part
of the dithering range. The second program switches both the voltage mode and the delay
settings of the replica paths according to a pre-characterization of the best adaptive clock
setting associated with each voltage mode. This program is able to speed the generated
clock at the higher voltage settings, leading to higher conversion efficiencies. In all, the
efficiencies of the second program range from 70% to 100%, depending on the voltage mode
and dithering ratio. Conversion efficiencies while dithering are not as high as the efficiencies
of the fixed operating modes in Table 6.4 because the external voltage reference used by the
comparator cannot be tuned for a particular SS-SC mode. This implies that in some cases
it is more efficient to operate at a single mode than to dither, even if the performance target
is somewhat exceeded by the average operating frequency at the fixed mode.
A second experiment demonstrates power envelope tracking, a common requirement for
systems that must operate within a user-specified power budget. Figure 6.29 shows the
results of a power management algorithm executed on the PMU that maximizes core per-
formance within a user-specified power budget. The power management program polls an
externally writeable control register that stores the absolute power limit for the program.
The PMU core then monitors the SS-SC toggle counter to continuously estimate core power
using the quadratic model described in [82] with pre-characterized coefficients. If the es-
timated core power is above the specified limit, core frequency is decreased, and if it is
below the limit, core frequency is increased. In this way, the best possible performance is
automatically obtained while the user-specified power budget is respected.
The PMU can also use the integrated counters to coordinate fine-grained adaptive volt-
age scaling (AVS) on-chip without any explicit guidance from the programs running on the
processor. In the algorithm used in this experiment, core power is used as a marker of pro-
gram phase. When core power is higher, the core is likely executing a compute-intensive
program region, and a high voltage is best suited to a race-to-halt strategy. When the core
power is lower, the core is likely waiting for off-chip communication in a memory-bound
program region, and energy can be saved with minimal performance impact by reducing the
voltage. In this experiment, the application core runs a synthetic benchmark that alter-
nates between the compute-intensive and idle phases at a timescale of tens of microseconds.
Figure 6.30 shows the core voltage measured during the execution of the benchmark and
the AVS power-management algorithm. The algorithm switches the core voltage between
the 1.8 V 1/2 mode and the 1 V 2/3 mode, actuated by core power estimates determined
by continuously polling the SS-SC toggle counter. When the core voltage is high and the
toggle rate drops below a threshold, this corresponds to an idle program period, so the PMU
reduces the core voltage to save energy. When the core voltage is low and the toggle rate
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 100
10
9 Measured Core Power
8 Target Core Power Limit
Power (mW)
7 Measured SC-DCDC Switching Frequency
6 Measured Core Clock Frequency
5
4
3
2
0 5 10 15 20 25 30 35 40
12
11
Frequency (MHz)
10
9
8
7
6
5
4
0 5 10 15 20 25 30 35 40
80
70
Frequency (MHz)
60
50
40
30
20
0 5 10 15 20 25 30 35 40
Time (s)
Figure 6.29: Measurement results showing a power envelope tracking program executing on
the PMU [68] (
c 2017 IEEE). The frequency of the core is adjusted in response to measured
changes in core power so that the core operates as quickly as possible without exceeding a
time-varying, user-specified power budget.
exceeds a threshold, the workload has increased and the PMU increases the core voltage.
The system is able to detect changes in workload in less than 1 µs and adjust the core volt-
age in response. Without integrated voltage regulators and power management, the system
would not be able to respond within the timescales of the workload variation. The results
of the power-management algorithm are therefore compared against continuous operation in
the higher voltage mode, which would otherwise be required to meet the same performance
target. The power-management algorithm reduces the energy consumed by 39.8%, and the
fast response incurs negligible (<0.2%) performance penalty compared with this baseline
because the fast response minimizes the time spent in the lower voltage mode during a
compute-bound region. This experiment demonstrates the efficacy of fine-grained AVS at
improving energy efficiency with fast workload tracking.
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 101
1.1
Measured Core Voltage (V) Without AVS Active Phase
1.0 With AVS Idle Phase
0.9
0.8
0.7
0.6
0.5
0 10 20 30 40 50 60
Time (us)
1.1 1.1
Measured Core Voltage (V)
Figure 6.30: Oscilloscope traces showing an FG-AVS algorithm running on the PMU [68]
(
c 2017 IEEE).
NWELL
SERDES
PWELL
... 8 high-speed serial links ...
Back-bias generator
RX TX
To/from off-chip FPGA GTX ports and DRAM
The floating-point pipeline depth is increased to four to better balance the timing with the
memory pipeline stages. Several improvements have been made to the Hwacha vector accel-
erator that is also part of each core. The machine has been redesigned to target the Hwacha
v4 ISA [136] as shown in Figure 6.32. The new implementation supports optimized mixed-
precision data packing and instructions that increase efficiency for certain workloads. Full
predication supports arbitrary control flow, allowing Hwacha to be targeted as a backend of
an OpenCL compiler for improved programmability. The private vector instruction cache
has been increased in size to 16 KiB. Vector loads and stores are serviced by a dedicated
port to the outer memory system, increasing memory bandwidth. Additional floating-point
hardware resources have been added, increasing the peak performance of the system to 8
double-precision or 16 single-precision FLOPs per cycle, and the floating point units in both
Rocket and Hwacha are updated to support floating-point division and square roots. All
core SRAMs are implemented using the custom 8T macros first deployed in Raven-3, but
the aspect ratio of the L1 cache memories has been adjusted to improve the utilization of
these macros, which are only available in a single size. Each core is supplied via the in-
tegrated voltage regulator first implemented in Raven-3 with 24 SS-SC unit cells per core.
Clocks are generated per-core by adaptive clock generators as implemented in Raven-4.
The two variable-voltage core domains communicate with the fixed-voltage uncore domain
via level shifters and standard bisynchronous FIFOs. The uncore contains a 256 KiB, 8-bank
L2 cache that serves as a shared, coherent backing store for both cores. Counters track load
and store misses from the L2 cache, providing useful information for FG-AVS feedback.
The system memory map has also been unified such that all system control registers and
counters can be read by either core via a simple memory access. This allows the two cores
to communicate, execute shared-memory programs, and perform per-core or global power
management. While a dedicated power management controller is not included in the design,
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 104
Core0 Core1
Run compute Act as PMU;
program execute algorithm
to temporarily
decrease Core0
Set voltage on an L2
DC-DC voltage load miss
L2 Load Miss
Counter
To DRAM
Figure 6.33: A feedback diagram showing how FG-AVS algorithms can be implemented on
the Hurricane-1 testchip.
either core can act as a PMU while application programs execute on the other core as shown
in Figure 6.33.
In addition to sending memory requests over the slow HTIF link, the system also im-
plements eight high-speed serial lanes so that higher bandwidth can be achieved by the
memory system. The links implement a tunneled AXI-over-serial interface that can com-
municate with a deserializer on an FPGA so that requests can be handled by the FPGA
memory system. The physical implementation of the TX and RX blocks is compatible with
the GTX interface used by the FPGA. Each link is designed to operate at up to 10 Gbps
DDR, for a peak bandwidth of 80 Gbps. The integrated memory system can be configured
at boot time to direct memory traffic over the slow HTIF link, a single SERDES lane, or all
eight lanes. In the latter case, each of the eight L2 bank sends requests to a dedicated serial
link.
The uncore domain also contains additional IP blocks to round out system functionality.
The integrated body bias generator, power monitors, and waveform reconstruction sensors
are reimplemented from the Raven-4 design. (Both cores share the same deep n-well and so
their body bias voltage is adjusted together.) An all-digital bang-bang PLL can generate
a wide range of clock frequencies from a fixed reference. Because temperature can have
a dramatic effect on device leakage and total system power, distributed thermal sensors
were implemented that can measure the temperature at different locations on the die. The
temperature probes consist of a simple nMOS-only ring oscillator for which both the supply
and back bias voltage are provided by an LDO (see Figure 6.34). The frequency of the
oscillator varies predictably with temperature, and after a one-point calibration procedure
the sensors can measure local temperature for use in power management algorithms. Eleven
of these sensors are implemented in total (four in each core and three in the uncore), but
their simple design makes the area overhead negligible (225 µm2 ). Digital control signals
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 105
(a) (b)
Figure 6.34: The distributed thermal sensor circuit. (a) shows the nMOS-only standard cells
that make up the ring oscillator shown in (b).
and readout are distributed to each sensor from a central logic block that aggregates the
data from each sensor and applies the calibration based on a lookup table. A much larger
vendor-supplied temperature sensor IP block based on a more traditional band-gap reference
is placed near one of the uncore sensors for a baseline comparison.
Figure 6.35 shows the floorplan of the SoC. The two cores are tiled in the upper left and
upper right along with the associated per-core IP. The shared memory system and serial
links are in the lower part of the chip, with the serial links distributed because each macro
requires four data and two supply pads. The large chip necessitated the use of a hierarchical,
multiply-instantiated-module place-and-route flow in which some parts of the design were
hardened before integration at the top level. The core design was placed and routed only
once, and then two copies were placed side-by-side in the top-level floorplan. The chip was
fabricated in 28 nm UTBB FD-SOI. An annotated die micrograph is shown in Figure 6.36.
The chip dimensions total 7.84 mm2 , with each core occupying 1.49 mm2 .
management unit as described in Figure 6.33. To perform AVS experiments, Core 1 oper-
ates in fixed 1V mode and is clocked synchronously with the fixed-voltage uncore. Acting
as a PMU, Core 1 polls the L2 load miss counters integrated into the uncore and responds
by actuating a change in the Core 0 voltage via writes to the appropriate memory-mapped
register. The software for both cores is compiled from C++ using the RISC-V toolchain man-
aged by the Linux operating system. After booting Linux, the PMU program is scheduled
onto Core 1 via the taskset command, while application programs are executed on Core 0.
The memory system is configured to use the HTIF interface, with the uncore operating at
170MHz.
Unfortunately, several issues restrict the possible range of AVS experimentation in the
Hurricane-1 testchip. Due to substantial IR drop in the DCDC power supplies, the cores do
not operate robustly in the 1V 2/3 and 1V 1/2 modes, leaving only the 1.8V 1/2 mode and
the 1V fixed mode as reliable targets for AVS. Furthermore, synchronization issues between
Core 0 and the memory system prevent robust system operation when using the adaptive
clock generator. As a workaround, frequency scaling is achieved by switching the Core 0
clock mux during operation. When Core 0 is operating in the 1V fixed mode, its clock mux
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 107
Vector I$
Vector RF
SC-DCDC
Unit Cells
Rocket D$ Rocket I$
Thermal Sensors
Tile 0 Tile 1
Digital
PLL
Resiliency
Test Site
L2 Cache
is set so that it operates synchronously with the uncore. When the voltage mode changes to
the 1.8V 1/2 mode, the clock mux select is adjusted to use the divided-by-two version of the
uncore clock. Because these two clock sources have a known, fixed phase relation, switching
dynamically between them cannot result in short edges that could cause timing violations.
These limitations and workarounds greatly reduce the possible energy savings of AVS in the
system.
Figure 6.37 shows pseudocode representing two algorithms run on Core 1. Each algorithm
polls the system L2 load miss counter and sets the operating condition of the application
core based on whether L2 misses have been detected. Since Core 1 is operating only this
small inner loop, L2 misses are assumed to originate from Core 0, and so indicate a period
of idleness during which the voltage and frequency can be decreased to save energy with
minimal impact on runtime. Algorithm A is event-driven, scaling voltage as soon as possible
after an L2 miss is detected and then holding it low for some fixed duration. Algorithm B
instead polls the counters, waits for a fixed interval, and then polls the counters again to
determine if an L2 miss occurred during this interval. Each algorithm was run while Core0
executed the Perlbench program from the SPEC2006 benchmark suite [138].
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 108
Figure 6.37: Pseudocode for two power-management algorithms run on Hurricane-1. The
ncycles variable, which specifies different interval durations in each algorithm, was swept to
find the greatest energy savings for each algorithm.
Figure 6.38: Measurement of the Core0 voltage during AVS algorithm execution.
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 109
Figure 6.39: Measurement of the Core0 voltage after an L2 cache miss triggers voltage
scaling.
Figures 6.38 and 6.39 show the measured Core0 voltage while executing Algorithm A with
ncycles = 0 so that the shortest possible response could be observed. In Figure 6.38, the PMU
responds to multiple sequential L2 load misses by repeatedly lowering the operating voltage
for a short time. Figure 6.39 shows a single voltage-scaling event. The overall response
time of the feedback loop cannot be known, because the exact time of the L2 cache miss
that triggers Core 0 idleness cannot be measured externally. However, the overall scaling
event duration is just 702 ns, showing the ability of the PMU to execute multiple changes in
operating mode in rapid succession.
Figure 6.40 and Tables 6.6 and 6.6 show the measured results of the AVS algorithms.
Application core energy was measured by measuring program runtime and the voltage and
current of the input DCDC supplies and subtracting the estimated power consumed by the
PMU core. (Core 1 power cannot be measured directly because all of the voltage regulators
on the chip share input supply rails.) Overall, Algorithm A is better able to save energy
with minimal impact on runtime, while the energy savings of Algorithm B are achieved only
as runtime increases. In the best case of Algorithm A compiled with ncycles = 300, AVS is
able to reduce application core energy by 6% relative to the baseline with a small (0.5%)
impact on performance.
The energy savings of these AVS algorithms could be greatly improved with more robust
chip functionality. Faster PMU and uncore operating frequencies would allow more respon-
sive execution of the algorithms. Operating the core at lower voltage modes during idle
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 110
1.05
Ideal Dithering
1V (fixed)
Application Core Energy (a.u.)
1.00
1.8V 1/2 (fixed)
AVS Algorithm A
0.95
AVS Algorithm B
0.90
0.85
0.80
17 18 19 20 21 22 23
Benchmark Execution Time (s)
Figure 6.40: Measurement of a Core 0 voltage after an L2 cache miss triggers voltage scaling.
Energy measurements are normalized to the fixed-1V operating mode.
Table 6.5: The effects of AVS Algorithm A on runtime and energy, as well as the proportion
of runtime spent in the lower-voltage 1.8V 1/2 mode. Data from the fixed 1V and 1.8V 1/2
modes are included for comparison.
periods would save additional energy. The use of adaptive clock generation would greatly
improve the conversion efficiency of the SS-SC converters, reducing the effective conversion
losses that mitigate energy savings at lower voltage modes. Furthermore, the relative com-
pletion time of the benchmark at each fixed operating mode suggests that the core is idle
for additional periods of time not captured by the L2 load miss counters. More robust
counter instrumentation could allow for more sophisticated algorithms that better estimate
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 111
Table 6.6: The effects of AVS Algorithm B on runtime and energy, as well as the proportion
of runtime spent in the lower-voltage 1.8V 1/2 mode.
DMA 4KB
Engine Arbiter
gen clk Vector
ICache x2
Adaptive clock gen.
Clock Counters Async. FIFOs + level shifters
Rocket
PTW
CPU
Copy Accelerator
TLB
Backend Tracker
Frontend Reader
Buffer Arbiter
Control Backend
Registers Tracker Writer
L1-to-L2 Crossbar
another at high bandwidth. The copy accelerator improves on traditional designs because it
can perform its own virtual address translation and can access arbitrary memory addresses,
rather than being restricted to the range associated with a particular device. Figure 6.42
shows a block diagram of the memory copy accelerator.
The Hwacha voltage domain contains a version of the vector accelerator. The most
significant implementation change from Hurricane-1 is the implementation of a second vector
lane, which allows increased data-level parallelism without requiring code recompilation. The
second lane doubles the available floating-point resources, allowing a peak throughput of 16
double-precision FLOPs per cycle for the entire vector unit. Each lane has its own port
into the shared memory system, increasing available bandwidth of data into and out of the
accelerator so that data movement can keep pace with its arithmetic capacity. The vector
instruction cache was reduced in size to 4 KiB as simulations showed no performance penalty
from this reduction. Unlike in previous designs, Rocket and Hwacha use vendor-supplied
SRAM macros, reducing area but possibly limiting functionality at low voltages.
The Rocket and Hwacha domains are supplied by the integrated voltage regulators first
implemented in Raven-3, and their clocks are provided by the adaptive clock generator first
implemented in Raven-4. The flying capacitance was divided between the two varying-
voltage domains according to their area, so 8 unit cells were assigned to the Rocket domain
and 28 unit cells were assigned to the Hwacha domain. As noted in Section 6.2.2, the necessity
of supplying a single external reference voltage limited efficiency in the case of rapid toggling
between different voltage modes. The Vref that achieves peak energy efficiency also varies
with workload as shown by the results in Section 6.1.2. The SS-SC controller in Hurricane-2
implements an integrated digital-to-analog converter (DAC) that can generate the necessary
reference voltage for use by the comparator. The DAC can rapidly switch between different
output voltages following SS-SC mode changes, and can also be adjusted by the PMU to
optimize converter efficiency as the workloads of the application cores vary over time.
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 114
The fixed-voltage uncore domain contains the shared memory system, PMU processor,
and other IP blocks. The L2 cache consists of four banks with a total capacity of 256 KiB
that is shared and coherent between the two processors via a broadcast-based MESI pro-
tocol. Each bank can have up to four non-conflicting transactions in flight at once using a
hardware construct called a tracker. A shared memory-mapped IO router directs memory-
mapped requests throughout the system. In contrast to the PMU in the Raven-4 design, the
Hurricane-2 PMU is a minimal version of the five-stage Rocket processor. The PMU proces-
sor implements the RV64IM ISA, and includes a slow integer multiplier, a 4 KiB instruction
cache, and a 4 KiB data scratchpad. A JTAG debug interface is also implemented to access
a debug port in the Rocket core.
The HTIF interface was redesigned in Hurricane-2 to provide more flexibility and robust-
ness in the operation of the memory system. The digital interface to the FPGA now tunnels
serialized memory traffic using the TileLink protocol [139]. The interface is implemented so
that memory requests are serviced but the FPGA can still issue its own memory requests
to read and write memory-mapped registers on chip. Eight high-speed SERDES lanes are
rearchitected from Hurricane-1 to improve performance and to implement the TileLink pro-
tocol. A fully configurable memory traffic switcher can direct memory traffic to this slow
digital interface, over one or more high-speed serial links, or to the DDR controller for
eventual consumption by dedicated DRAM.
The DDR controller and PHY are third-party IP implemented to interface with a 512 MB
DDR4 SDRAM to provide a realistic memory system for large workloads. The TileLink
connection from the memory switcher is converted to AXI for use by the memory controller,
and an AHB port is used to configure the controller and PHY for operation. Because the
DDR PHY and companion DRAM must operate at specific frequencies, the AXI connection
implements bisynchronous queues, allowing the DDR controller and PHY to operate in their
own clock domain.
The Hurricane-2 testchip implements many of the counters described in Section 4.1.2 for
use in integrated power management. The Rocket voltage domain includes L1 data cache
hit counters, load miss counters, store miss counters, and AMO miss counters, as well as
L1 instruction cache miss counters and a wait-for-interrupt counter. The Hwacha voltage
domain has counters that track the number of memory, ALU, and predication operations
active in the master sequencer, the number of outstanding memory operations in the vector
register unit, and the depth of each of the RoCC request and response queues. The uncore
counters track the hit and miss counts of each client for each L2 bank, as well as the number
of active trackers in each bank. Counters also track the frequency of each clock in the design,
including the SS-SC toggle clocks that can be used to estimate power for each voltage domain.
All of these counters are memory-mapped and accessible by the PMU in just a few cycles.
The floorplan of the Hurricane-2 testchip is shown in Figure 6.43. The top-level layout
was completely redesigned to accommodate the large DDR PHY IP, which has a large size
and irregular aspect ratio. The Rocket and Hwacha layouts also had to be changed to fit
within the new floorplan and to place the heterogeneous SRAM macros which now form their
caches. The uncore voltage domain is overprovisioned in area, partly due to the irregular
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 115
aspect ratio of the PHY that made cross-chip routing difficult. The PHY macro includes
its own power and signal bumps, and the remaining area implements its own bumps in a
three-sided ring. Wirebond pads are also included as a backup packaging method, although
the DDR PHY will not function if the chip is packaged in this way. The chip is being
fabricated in 28 nm UTBB FD-SOI. The chip dimensions total 17.3 mm2 , with the Rocket
domain occupying 0.51 mm2 , the Hwacha domain taking up 2.14 mm2 , and the DDR PHY
using 5.61 mm2 .
Figure 6.44: Pseudocode for a PMU program to demonstrate FG-AVS in time on Hurricane-
2.
Figure 6.45 shows time-series results of the AVS program as the application core executes
three short benchmarks. The instructions retired counter indicates forward progress by the
Rocket core; there is strong correlation between an increment of the L2 miss counter and a
decrease in the rate of instructions retired. The PMU is able to sense the L2 miss counter
increment and actuate a change in the Rocket operating mode in tens of cycles or less,
demonstrating the ability of the Hurricane-2 system to achieve fine temporal granularity for
AVS. The algorithm saves energy in each benchmark by setting the Rocket domain to the
simulated low-voltage operating mode for part of the operation while negligibly impacting
benchmark runtime.
The energy model presented in [141] can be used to estimate the energy savings of the
PMU algorithm. This model assumes that voltage can be reduced during periods of inactivity
without impacting runtime, as is the case here, and accounts for leakage and dynamic energy
savings as well as regulator conversion efficiencies. Table 6.7 shows the parameter values used
to inform the model, many of which are based on measurements from prior testchips. The
energy savings for each benchmark are proportional to the length of time spent in the low-
voltage operating mode. Table 6.8 shows the energy savings for the Rocket domain calculated
from the model for each benchmark. The benchmarks attain an average energy savings of
10.1%.
14
14
Counter Increment over 100 Cycles
14
Counter Increment over 100 Cycles
Figure 6.45: Execution traces with counter values for median (a), vector-vector add (b), and
sparse matrix-vector multiplication (c) benchmarks executing on the application core while
running the AVS algorithm on the PMU core. The cycle count is measured in uncore cycles,
which are invariant to core clock frequency changes.
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 119
Table 6.7: Parameters used in the energy model from [141] to calculate simulated energy
savings.
Table 6.8: Energy savings for the Rocket voltage domain resulting from the implementation
of FG-AVS.
intermediate voltage). The first algorithm polls both the L2 cache misses sourced from the
Rocket data cache and L1 data cache misses directly. The latter may indicate a forthcoming
L2 miss, and it likely indicates a brief period of processor idleness even in the case of an
L2 hit. The core clock frequency is reduced to one-fourth the uncore clock frequency when
an L2 miss is detected and one-half the base frequency if an L1 data cache miss is detected
but an L2 miss is not. If neither event occurs within the polling interval, the frequency is
increased to match the uncore clock rate. This algorithm is intended to behave similarly to
the algorithm shown in Figure 6.44 when an L2 miss is detected, but it will spend additional
time in the medium-voltage state that may save additional energy.
The second algorithm to take advantage of multiple voltage levels is actuated only by
L2 cache misses sourced from the Rocket data cache. It modifies the algorithm shown in
Figure 6.44 by only reducing the core frequency to one-fourth the uncore frequency when
multiple L2 misses are detected within the polling interval. If a single cache miss is detected,
the core frequency is instead reduced to one-half the uncore frequency. This algorithm will
spend less time in the low-voltage state, but it may better track the performance impact of
a single L2 cache miss, which may not warrant the full reduction to the lower frequency.
Figure 6.47 and Table 6.9 show the results of these algorithms running while the Rocket
core executes the SPMV microbenchmark, including the energy savings of the programs
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 120
Figure 6.46: Pseudocode for two power-management algorithms that take advantage of mul-
tiple AVS levels.
Table 6.9: Energy savings for the Rocket voltage domain resulting from the implementation
of FG-AVS with multiple simulated voltage levels.
0
0 20000 40000 60000 80000 100000 120000
Cycle Count
(a)
Counter Increment over 100 Cycles
0
0 20000 40000
Cycle Count60000 80000
(b)
Figure 6.47: Execution traces with counter values for the two multi-level PMU algorithms.
The first algorithm is actuated by L1 data cache load misses (a), and the second by counting
multiple L2 cache misses (b).
CHAPTER 6. ENERGY-EFFICIENT SOC DESIGN 122
Figure 6.48: Pseudocode for a PMU program to demonstrate FG-AVS in space on Hurricane-
2.
that implement many voltage levels (or a continuous voltage range) for FG-AVS are likely
overdesigned.
20
10
00 50000 100000
Cycle Count150000 200000
Figure 6.49: An execution trace showing the RoCC queue counters used for feedback in the
AVS algorithm.
Table 6.10: Energy savings for each voltage algorithm resulting from FG-AVS as the appli-
cation core executes a matrix-multiply benchmark.
and one in which Hwacha is active and Rocket is only feeding instructions to Hwacha. The
PMU algorithm correctly detects these phases and performs the appropriate scaling behavior
with very fast response time of tens of cycles, resulting in negligible performance penalty.
Table 6.10 shows the energy savings of the PMU program as estimated by the model in [141].
The relative power of the two domains as reported by the place-and-route tool can be used to
estimate an overall energy savings of 46.0% for the full system. Most of these savings result
from reducing the operating voltage and frequency of Hwacha when it is not in use. These
results demonstrate the utility of fine-grained spatial AVS in saving energy when different
parts of the SoC transition between different levels of activity.
124
Chapter 7
Conclusion
Fine-grained adaptive voltage scaling is key to improving energy efficiency in modern SoCs,
but numerous technical hurdles have impeded the widespread adoption of this technique.
Integrated voltage regulators, necessary to supply the many voltages required for FG-AVS,
are challenging to implement with high efficiency and power density. Clock generation and
data synchronization must be carefully considered to ensure functional correctness and op-
erating efficiency. All of these design decisions impose overheads that mitigate the energy
savings gained by the approach.
This work has presented circuit designs and system implementations that prove the fea-
sibility of FG-AVS. By building fully-featured integrated systems, all overheads of FG-AVS
are necessarily accounted for. The 28 nm SoC testchips presented nonetheless show that
FG-AVS is possible to implement and can save energy in silicon.
• A thorough overview of the challenges of adaptive voltage scaling, with particular focus
on barriers to widespread industry adoption (Chapter 2).
• A survey of integrated power management options for FG-AVS, including proposals for
counter-based power and activity measurement, an integrated control processor, and
power-management algorithms to perform adaptive feedback (Chapter 4).
• The Hurricane-1 testchip implementation (Section 6.3), which proved the ability to
supply multiple independent voltage domains via integrated regulation and showed
the feasibility of counter-based power management.
• The Hurricane-2 testchip is still in fabrication at the time of writing. Once the chip is
fabricated and packaged, the algorithms for FG-AVS simulated in Section 6.4.2 should
be executed in silicon to measure the actual performance and energy implications.
FG-AVS can also be evaluated in the context of more realistic workloads such as
convolutional neural nets and graph traversal.
• This work does not consider the alternative power-reduction approach of power gating,
in which the voltage of blocks is reduced to zero when they are totally inactive. Power
gating requires additional hardware support, such as isolation cells and state retention
logic, to safely implement. Silicon implementations could compare the energy savings
and implementation costs of power gating to those of FG-AVS. Circuit designs capable
of both power gating and FG-AVS, as well as the appropriate algorithms to take
advantage of them, could also be explored.
• The testchips presented in this work do not methodically evaluate the di/dt implica-
tions of integrated voltage regulation. The digital loads supplied by the regulators are
likely too small to exhibit significant di/dt noise, and additional circuitry would be
required to measure the di/dt effects across the power grid. Dedicated “di/dt virus”
circuitry and the appropriate instrumentation could be added to a future testchip to
determine the extent to which integrated regulators improve resilience to di/dt droop.
delivery has been proposed in the literature and evaluated in simulation [112], but
silicon evaluation could prove the feasibility of the concept.
• The pausible bisynchronous FIFO presented in Section 5.3 was evaluated only in simu-
lation. A silicon implementation would confirm the utility of the approach in reducing
asynchronous interface latencies.
• This work does not consider the use of fully asynchronous logic, such as bundled-
data or quasi-delay-insensitive logic styles. However, in many ways, asynchronous
logic is ideally suited to FG-AVS. With no clock, the challenges of clock generation
and synchronization are implicitly solved; without these constraints, the number of
independent voltage domains can be increased arbitrarily, limited only by the number of
independent voltage regulators that can be implemented. Asynchronous logic imposes
its own design overheads, so a future silicon implementation could compare the GALS
approach considered in this work with a fully asynchronous design.
• The simulations presented in Section 6.4.2 are limited in duration because of the slow
progression of RTL simulation. Faster simulation can be accomplished with the use
of FPGAs, but care must be taken to design the simulation environment so that the
performance tradeoffs of FG-AVS are correctly simulated. FPGA-based architecture
simulators such as RAMP Gold [142] could be extended to model FG-AVS, allowing
longer, more realistic benchmarking of FG-AVS algorithms.
The study of FG-AVS and its associated enabling technologies will continue to drive
significant circuit and architecture innovation in the years to come.
127
Bibliography
[25] E. Burton et al., “FIVR — Fully integrated voltage regulators on 4th generation Intel
Core SoCs,” in Proceedings of the Applied Power Electronics Conference, Mar. 2014,
pp. 432–439.
[26] A. Nalamalpu et al., “Broadwell: A family of IA 14nm processors,” in Proceedings of
the Symposium on VLSI Circuits, Jun. 2015, pp. C314–C315.
[27] A. Grenat et al., “Increasing the performance of a 28nm x86-64 microprocessor
through system power management,” in IEEE International Solid-State Circuits Con-
ference Digest of Technical Papers, Jan. 2016, pp. 74–75.
[28] C. Gonzalez et al., “Power9: A processor family optimized for cognitive computing
with 25Gb/s accelerator links and 16Gb/s PCIe Gen4,” in IEEE International Solid-
State Circuits Conference Digest of Technical Papers, Feb. 2017, pp. 50–51.
[29] T. Singh et al., “Zen: A next-generation high-performance x86 core,” in IEEE In-
ternational Solid-State Circuits Conference Digest of Technical Papers, Feb. 2017,
pp. 52–53.
[30] Z. Toprak-Deniz et al., “Distributed system of digitally controlled microregulators
enabling per-core DVFS for the POWER8 microprocessor,” in IEEE International
Solid-State Circuits Conference Digest of Technical Papers, Feb. 2014, pp. 98–99.
[31] “Switching regulator fundamentals,” Texas Instruments, Tech. Rep. SNVA559A, Sep.
2016.
[32] D. S. Gardner et al., “Integrated on-chip inductors with magnetic films,” in Proceed-
ings of the International Electron Devices Meeting, Dec. 2006.
[33] J. Lee, G. Hatcher, L. Vandenberghe, and C. K. K. Yang, “Evaluation of fully-
integrated switching regulators for CMOS process technologies,” IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, vol. 15, no. 9, pp. 1017–1027, Sep.
2007.
[34] H. Krishnamurthy et al., “A 500 MHz, 68% efficient, fully on-die digitally controlled
buck voltage regulator on 22nm tri-gate CMOS,” in Proceedings of the Symposium on
VLSI Circuits, Jun. 2014, pp. 167–168.
[35] M. Seeman, V. Ng, H.-P. Le, M. John, E. Alon, and S. Sanders, “A comparative
analysis of switched-capacitor and inductor-based DC-DC conversion technologies,”
in Proceedings of the IEEE Workshop on Control and Modeling for Power Electronics,
Jun. 2010.
[36] M. D. Seeman, “A design methodology for switched-capacitor DC-DC converters,”
PhD thesis, Department of Electrical Engineering and Computer Sciences, University
of California, Berkeley, May 2009.
[37] L. G. Salem and P. P. Mercier, “A battery-connected 24-ratio switched capacitor
PMIC achieving 95.5%-efficiency,” in Proceedings of the Symposium on VLSI Circuits,
Jun. 2015, pp. C340–C341.
BIBLIOGRAPHY 130
[51] A. Sai, S. Kondo, T. T. Ta, H. Okuni, M. Furuta, and T. Itakura, “A 65nm CMOS
ADPLL with 360uW 1.6ps-INL SS-ADC-based period-detection-free TDC,” in IEEE
International Solid-State Circuits Conference Digest of Technical Papers, Jan. 2016,
pp. 336–337.
[52] T. Jang, S. Jeong, D. Jeon, K. D. Choo, D. Sylvester, and D. Blaauw, “A 2.5ps
0.8-to-3.2GHz bang-bang phase- and frequency-detector-based all-digital PLL with
noise self-adjustment,” in IEEE International Solid-State Circuits Conference Digest
of Technical Papers, Feb. 2017, pp. 148–149.
[53] H. Cho et al., “A 0.0047mm2 highly synthesizable TDC- and DCO-less fractional-N
PLL with a seamless lock range of fREF to 1GHz,” in IEEE International Solid-State
Circuits Conference Digest of Technical Papers, Feb. 2017, pp. 154–155.
[54] A. Elkholy, A. Elmallah, M. Elzeftawi, K. Chang, and P. K. Hanumolu, “A 6.75-to-
8.25GHz, 250fsrms -integrated-jitter 3.25mW rapid on/off PVT-insensitive fractional-N
injection-locked clock multiplier in 65nm CMOS,” in IEEE International Solid-State
Circuits Conference Digest of Technical Papers, Jan. 2016, pp. 192–193.
[55] S. Choi, S. Yoo, and J. Choi, “A 185fsrms -integrated-jitter and -245dB FOM PVT-
robust ring-VCO-based injection-locked clock multiplier with a continuous frequency-
tracking loop using a replica-delay cell and a dual-edge phase detector,” in IEEE
International Solid-State Circuits Conference Digest of Technical Papers, Jan. 2016,
pp. 194–195.
[56] S. Kundu, B. Kim, and C. H. Kim, “A 0.2-to-1.45GHz subsampling fractional-N all-
digital MDLL with zero-offset aperture PD-based spur cancellation and in-situ timing
mismatch detection,” in IEEE International Solid-State Circuits Conference Digest
of Technical Papers, Jan. 2016, pp. 326–327.
[57] H. Kim, Y. Kim, T. Kim, H. Park, and S. Cho, “A 2.4GHz 1.5mW digital MDLL
using pulse-width comparator and double injection technique in 28nm CMOS,” in
IEEE International Solid-State Circuits Conference Digest of Technical Papers, Jan.
2016, pp. 328–329.
[58] H. C. Ngo, K. Nakata, T. Yoshioka, Y. Terashima, K. Okada, and A. Matsuzawa, “A
0.42ps-jitter -241.7dB-FOM synthesizable injection-locked PLL with noise-isolation
LDO,” in IEEE International Solid-State Circuits Conference Digest of Technical
Papers, Feb. 2017, pp. 150–151.
[59] D. Coombs, A. Elkholy, R. K. Nandwana, A. Elmallah, and P. K. Hanumolu, “A
2.5-to-5.75GHz 5mW 0.3psrms -jitter cascaded ring-based digital injection-locked clock
multiplier in 65nm CMOS,” in IEEE International Solid-State Circuits Conference
Digest of Technical Papers, Feb. 2017, pp. 152–153.
[60] B. Stackhouse et al., “A 65 nm 2-billion transistor quad-core Itanium processor,”
IEEE Journal of Solid-State Circuits, vol. 44, no. 1, pp. 18–31, Jan. 2009.
BIBLIOGRAPHY 132
[84] Y. Sinangil et al., “A self-aware processor SoC using energy monitors integrated into
power converters for self-adaptation,” in Proceedings of the Symposium on VLSI Cir-
cuits, Jun. 2014, pp. 139–140.
[85] A. Naveh, D. Rajwan, A. Ananthakrishnan, and E. Weissmann, “Power management
architecture of the 2nd generation Intel Core microarchitecture, formerly codenamed
Sandy Bridge,” in Hot Chips, Aug. 2011.
[86] D. C. Snowdon, S. M. Petters, and G. Heiser, “Accurate on-line prediction of pro-
cessor and memory energy usage under voltage scaling,” Proceedings of the IEEE
International Conference on Embedded Software, pp. 84–93, Sep. 2007.
[87] T. Webel et al., “Robust power management in the IBM z13,” IBM Journal of Re-
search and Development, vol. 59, no. 4/5, 16:1–16:12, Jul. 2015.
[88] S. Bird, “Software knows best: A case for hardware transparency and measurabil-
ity,” Master’s thesis, Department of Electrical Engineering and Computer Sciences,
University of California, Berkeley, May 2010.
[89] D. Horner, RISC V - low power instructions, RISC-V Hardware Development Mailing
List, Feb. 2017. [Online]. Available: https://groups.google.com/a/groups.riscv.
org/d/msg/hw-dev/fmn3ux_XLs0/kkPkJnStAwAJ.
[90] P. Juang, Qiang Wu, Li-Shiuan Peh, M. Martonosi, and D. Clark, “Coordinated,
distributed, formal energy management of chip multiprocessors,” in Proceedings of the
ACM/IEEE International Symposium on Low Power Electronics and Design, Aug.
2005, pp. 127–130.
[91] J. Pouwelse, K. Langendoen, and H. Sips, “Dynamic voltage scaling on a low-power
microprocessor,” in Proceedings of the International Conference on Mobile Computing
and Networking, Jul. 2001, pp. 251–259.
[92] F. Xie, M. Martonosi, and S. Malik, “Compile-time dynamic voltage scaling settings:
Opportunities and limits,” in Proceedings of the ACM Conference on Programming
Language Design and Implementation, Jun. 2003, pp. 49–62.
[93] C.-H. Hsu and U. Kremer, “The design, implementation, and evaluation of a compiler
algorithm for CPU energy reduction,” in Proceedings of the ACM Conference on
Programming Language Design and Implementation, Jun. 2003, pp. 38–48.
[94] Q. Wu et al., “A dynamic compilation framework for controlling microprocessor en-
ergy and performance,” in Proceedings of the IEEE/ACM International Symposium
on Microarchitecture, Nov. 2005, pp. 271–282.
[95] Y. Lee, “Z-scale: Tiny 32-bit RISC-V systems,” in OpenRISC Conference, Oct. 2015.
[96] J. Sharkey, A. Buyuktosunoglu, and P. Bose, “Evaluating design tradeoffs in on-
chip power management for CMPs,” in Proceedings of the ACM/IEEE International
Symposium on Low Power Electronics and Design, Aug. 2007, pp. 44–49.
BIBLIOGRAPHY 135
[97] Q. Wu, P. Juang, M. Martonosi, and D. W. Clark, “Voltage and frequency control with
adaptive reaction time in multiple-clock-domain processors,” in Proceedings of the
IEEE Symposium on High-Performance Computer Architecture, Feb. 2005, pp. 178–
189.
[98] R. Efraim, R. Ginosar, C. Weiser, and A. Mendelson, “Energy aware race to halt:
A down to EARtH approach for platform energy management,” IEEE Computer
Architecture Letters, vol. 13, no. 1, pp. 25–28, Jan. 2014.
[99] T. Yuki and S. Rajopadhye, “Folklore confirmed: Compiling for speed = compiling for
energy,” in Proceedings of the International Workshop on Languages and Compilers
for Parallel Computing, Sep. 2014, pp. 169–184.
[100] D. Snowdon, S. Ruocco, and G. Heiser, Power management and dynamic voltage
scaling: Myths and facts, 2005.
[101] G. Dhiman, K. K. Pusukuri, and T. Rosing, “Analysis of dynamic voltage scaling for
system level energy management,” in Proceedings of the Workshop on Power Aware
Computing and Systems, Dec. 2008.
[102] A. Bhattacharjee and M. Martonosi, “Thread criticality predictors for dynamic per-
formance, power, and resource management in chip multiprocessors,” in Proceedings
of the ACM/IEEE International Symposium on Computer Architecture, Jun. 2009,
pp. 290–301.
[103] G. Keramidas, V. Spiliopoulos, and S. Kaxiras, “Interval-based models for run-time
DVFS orchestration in superscalar processors,” in Proceedings of the ACM Interna-
tional Conference on Computing Frontiers, May 2010, pp. 287–296.
[104] E. Talpes and D. Marculescu, “Toward a multiple clock/voltage island design style for
power-aware processors,” IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 13, no. 5, pp. 591–603, May 2005.
[105] G. Dhiman and T. S. Rosing, “Dynamic voltage frequency scaling for multi-tasking
systems using online learning,” in Proceedings of the ACM/IEEE International Sym-
posium on Low Power Electronics and Design, Aug. 2007, pp. 207–212.
[106] K. Rajamani, H. Hanson, J. Rubio, S. Ghiasi, and F. Rawson, “Application-aware
power management,” in Proceedings of the IEEE International Symposium on Work-
load Characterization, Oct. 2006, pp. 39–48.
[107] K. Choi, R. Soma, and M. Pedram, “Fine-grained dynamic voltage and frequency
scaling for precise energy and performance tradeoff based on the ratio of off-chip
access to on-chip computation times,” IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, vol. 24, no. 1, pp. 18–28, Jan. 2005.
[108] L. Guang, E. Nigussie, L. Koskinen, and H. Tenhunen, “Autonomous DVFS on supply
islands for energy-constrained NoC communication,” in Proceedings of the Interna-
tional Conference on Architecture of Computing Systems, Mar. 2009, pp. 183–194.
BIBLIOGRAPHY 136
[109] Q. Wu, P. Juang, M. Martonosi, and D. W. Clark, “Formal online methods for volt-
age/frequency control in multiple clock domain microprocessors,” in Proceedings of
the International Conference on Architectural Support for Programming Languages
and Operating Systems, Oct. 2004, pp. 248–259.
[110] D. Marculescu, “On the use of microarchitecture-driven dynamic voltage scaling,” in
Proceedings of the Workshop on Complexity-Effective Design, Jun. 2000.
[111] H. Li, C. Y. Cher, K. Roy, and T. N. Vijaykumar, “Combined circuit and architectural
level variable supply-voltage scaling for low power,” IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, vol. 13, no. 5, pp. 564–575, May 2005.
[112] W. Godycki, C. Torng, I. Bukreyev, A. Apsel, and C. Batten, “Enabling realistic fine-
grain voltage scaling with reconfigurable power distribution networks,” in Proceedings
of the IEEE/ACM International Symposium on Microarchitecture, Dec. 2014, pp. 381–
393.
[113] P. Stanley-Marbell, M. S. Hsiao, and U. Kremer, “A hardware architecture for dy-
namic performance and energy adaptation,” in Proceedings of the International Con-
ference on Power-aware Computer Systems, Feb. 2003, pp. 33–52.
[114] R. Ginosar, “Fourteen ways to fool your synchronizer,” in Proceedings of the IEEE
International Symposium on Asynchronous Circuits and Systems, Mar. 2003, pp. 89–
96.
[115] C. E. Cummings, “Simulation and synthesis techniques for asynchronous FIFO de-
sign,” in Synopsys Users Group Conference User Papers, 2002.
[116] A. Chakraborty and M. Greenstreet, “Efficient self-timed interfaces for crossing clock
domains,” in Proceedings of the IEEE International Symposium on Asynchronous
Circuits and Systems, Mar. 2003, pp. 78–88.
[117] W. J. Dally and S. G. Tell, “The even/odd synchronizer: A fast, all-digital, periodic
synchronizer,” in Proceedings of the IEEE Symposium on Asynchronous Circuits and
Systems, May 2010, pp. 75–84.
[118] K. Yun and R. Donohue, “Pausible clocking: A first step toward heterogeneous sys-
tems,” in Proceedings of the IEEE International Conference on Computer Design,
Oct. 1996, pp. 118–123.
[119] R. Mullins and S. Moore, “Demystifying data-driven and pausible clocking schemes,”
in Proceedings of the IEEE International Symposium on Asynchronous Circuits and
Systems, Mar. 2007, pp. 175–185.
[120] B. Keller, M. Fojtik, and B. Khailany, “A pausible bisynchronous FIFO for GALS
systems,” in Proceedings of the IEEE International Symposium on Asynchronous Cir-
cuits and Systems, May 2015, pp. 1–8.
BIBLIOGRAPHY 137
[121] S. Moore, G. Taylor, R. Mullins, and P. Robinson, “Point to point GALS intercon-
nect,” in Proceedings of the IEEE International Symposium on Asynchronous Circuits
and Systems, Apr. 2002, pp. 69–75.
[122] I. E. Sutherland, “Micropipelines,” Communications of the ACM, vol. 32, no. 6,
pp. 720–738, Jun. 1989.
[123] I. Sutherland and S. Fairbanks, “GasP: A minimal FIFO control,” in Proceedings
of the IEEE International Symposium on Asynchronous Circuits and Systems, Mar.
2001, pp. 46–53.
[124] M. Singh and S. Nowick, “MOUSETRAP: High-speed transition-signaling asynchronous
pipelines,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 15,
no. 6, pp. 684–698, Jun. 2007.
[125] A. E. Sjogren and C. J. Myers, “Interfacing synchronous and asynchronous modules
within a high-speed pipeline,” IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 8, no. 5, pp. 573–583, Oct. 2000.
[126] X. Fan, M. Krstić, and E. Grass, “Analysis and optimization of pausible clocking based
GALS design,” in Proceedings of the IEEE International Conference on Computer
Design, Oct. 2009, pp. 358–365.
[127] K. Asanović et al., “The Rocket Chip Generator,” Department of Electrical Engineer-
ing and Computer Sciences, University of California, Berkeley, Tech. Rep. UCB/EECS-
2016-17, Apr. 2016.
[128] Y. Lee et al., “A 45nm 1.3GHz 16.7 double-precision GFLOPS/W RISC-V processor
with vector accelerators,” in Proceedings of the European Solid-State Circuits Con-
ference, Sep. 2014, pp. 199–202.
[129] R. M. Russell, “The CRAY-1 computer system,” Communications of the ACM, vol. 21,
no. 1, pp. 63–72, Jan. 1978.
[130] H. P. Le, J. Crossley, S. R. Sanders, and E. Alon, “A sub-ns response fully inte-
grated battery-connected switched-capacitor voltage regulator delivering 0.19W/mm2
at 73% efficiency,” in IEEE International Solid-State Circuits Conference Digest of
Technical Papers, Feb. 2013, pp. 372–373.
[131] P. Flatresse et al., “Ultra-wide body-bias range LDPC decoder in 28nm UTBB FD-
SOI technology,” in IEEE International Solid-State Circuits Conference Digest of
Technical Papers, Feb. 2013, pp. 424–425.
[132] Zedboard. [Online]. Available: http://zedboard.org/product/zedboard (visited on
08/10/2017).
[133] D. Jacquet et al., “A 3 GHz dual core processor ARM Cortex-A9 in 28 nm UTBB
FD-SOI CMOS with ultra-wide voltage range and energy efficiency optimization,”
IEEE Journal of Solid-State Circuits, vol. 49, no. 4, pp. 812–826, Apr. 2014.
BIBLIOGRAPHY 138