Linyan PHD Thesis 2023 Printing Version
Linyan PHD Thesis 2023 Printing Version
Linyan PHD Thesis 2023 Printing Version
Linyan Mei
August 2023
Design Space Exploration of Deep Learning
Accelerators
A journey from single MAC units to multi-core accelerator systems
Linyan MEI
August 2023
© 2023 KU Leuven – Faculty of Engineering Science
Uitgegeven in eigen beheer, Linyan Mei, ESAT-MICAS, Kasteelpark Arenberg 10, B-3001 Leuven (Belgium)
Alle rechten voorbehouden. Niets uit deze uitgave mag worden vermenigvuldigd en/of openbaar gemaakt worden
door middel van druk, fotokopie, microfilm, elektronisch of op welke andere wijze ook zonder voorafgaande
schriftelijke toestemming van de uitgever.
All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm,
electronic or any other means without written permission from the publisher.
Acknowledgements
As seasons shifted, years passed swiftly. I’m approaching the end of my seventh
year in this charming and delightful town of Leuven, which is also the fifth year
of my Ph.D. journey, and now it’s time to say goodbye.
During these years in the foreign land, a kaleidoscope of experiences unfolded.
There were joys and sorrows, meetings and farewells, and moments of
determination and doubt. Countless times I have felt how lucky I am. Along
the winding and lengthy path of the journey, many great people have passed by
me, singing with me when I was happy, lending me a hand when I felt powerless,
and illuminating my way when I was lost.
This experience made me grow so much: gaining professional knowledge,
encountering like-minded people, knowing how to work with others, and
enhancing my understanding of the world and of myself. I am truly grateful for
it and firmly believe that no matter how many years pass, I will forever cherish
this land, this period of time, and all of you who walked alongside me.
There is so much I want to say. Let me take the opportunity to reflect upon
our time together and express my heartfelt gratitude to you.
First of all, I would like to thank my Ph.D. supervisor, Prof. Marian Verhelst.
Marian, you have brought me so much amazement over the years. Your passion
for work and your attitude toward life have deeply impacted me. Like the
sun, you are always full of energy, warming and brightening the lives of people
around you. There are so many moments we shared together that I now recall
as if they happened just yesterday. Our first encounter was in 2016 when I
was pursuing my master’s degree. At that time, you were teaching DDP in the
first semester and CA in the second semester. I liked your lectures very much,
and often sat in the first row, and would record your class to listen over and
over again. I liked to come to you to ask questions after class, which steered
me towards the trajectory of my doctoral studies, and my eventual choice of
you as my supervisor. I vividly recall the moment when I received your email
i
ii ACKNOWLEDGEMENTS
saying that you welcomed me on board for a Ph.D. (I was doing my master’s
thesis at imec at that time), I tapped the table loudly and yelled nice, which
scared everyone around me. In 2018, before I started my Ph.D., you lectured at
imec on Machine Learning Hardware and introduced lots of research progress
from your group. After the lecture, when we chatted, you said to me that you
could present the work I would be doing in a few years. What you said became
my dream. A year later in 2019, again in the imec lecture, witnessing you
introducing my work to others, I felt so proud that the dream had come true.
When I had just started my doctoral studies, we attended a meeting at Ugent
together. On the train, we discussed career and job choices. You told me that
having a job you love, doing things you’re passionate about, and getting paid
for it, how happy it is. My first conference paper, after your revisions, turned
mostly green on Overleaf. I thought, "Oh no, I must have written too poorly".
My colleague Sebastian, who was sitting opposite to me and had two more years
of Ph.D. experience than me, comforted me and said "Don’t worry, the papers
corrected by Marian are always all green". Every time I read your edited papers,
I am amazed by how you can always explain things in a clear and concise way,
and I learned a lot. Before your first presentation about ZigZag to external
parties, you woke up at 4 or 5 in the morning to derive formula details and
make slides and later sought my opinion. I asked you why not just ask me to
make the slides for you so that you don’t need to push yourself so much. You
said only when you went through it yourself once, can you really master it. In
numerous meetings with you, I’ve been amazed by how swiftly you grasp the
key to a problem and propose ingenious solutions. Discussions with you always
bring me a lot of inspiration and progress. In our team meetings, I particularly
enjoyed the presentation rehearsal segment, appreciating your and colleagues’
suggestions for improving slides, and I have learned valuable presentation skills
every time. When I just began my doctoral studies, I would get very nervous
during meetings with external collaborators. But having you around always
made me feel at ease. I knew that when I couldn’t explain things clearly, you
would help me convey the message to others. Over time, with your help, I have
grown a lot with my oral explanation skill, and finally, I am confident enough
to explain things clearly to the outer collaborators myself.
You cared not only about our academic progress but also about our emotional
well-being. You always provided in-time guidance and emotional support when
needed. I am also always amazed at how well you have balanced your work
and life, and how you can leave enough time to spend with your family in your
busy schedule. Marian, thank you for all the things you’ve taught me over
these years — in every meeting, discussion, revised paper, and team meeting
and activity. Marian, you are a genuine role model. Meeting you made me a
better person. I’m leaving now, taking with me all that you have taught me. In
ACKNOWLEDGEMENTS iii
the last few years, I have witnessed you become a full professor, be named an
IEEE fellow, and win numerous prizes. I have unwavering faith that your career
will continue to flourish in the years ahead. I just hope that in the meantime,
don’t forget to take care of yourself, eat and sleep well, and don’t put too many
burdens on yourself. Marian, having you as my Ph.D. advisor has been my
greatest fortune and honor.
I would also like to sincerely thank my doctoral examination committee,
chair Prof. Patrick Wollants, Prof. Francky Catthoor, Prof. Wim Dehaene,
Prof. Joni Dambre, and Dr. Huichu Liu for taking time and effort out of your
busy schedules to review my thesis and join my defense. I am thankful for
the insightful discussions and feedback you provided, which helped me identify
areas for improvement in the thesis. Special gratitude to Prof. Joni and Prof.
Francky for accompanying me throughout my doctoral journey, providing advice
and feedback, and ensuring I reached all milestones.
Prof. Francky was also part of my master’s thesis committee. The academic
year from 2017 to 2018 held a special place in my memory when I was working
on my master’s thesis at imec. This marked my initial interaction with you,
which occurred during the rehearsal before the defense at imec. I recall vividly
how you wrote an entire page of feedback, and patiently discussed it with
me afterward. Throughout my doctoral journey, after each milestone, you
consistently reserved time for one-on-one meetings, offering valuable feedback
and suggestions. Your insights often introduced new angles to my research,
prompting me to think more expansively. Francky, your tireless commitment
and enthusiasm for the field truly left a lasting impression on me.
Prof. Wim, thank you for helping organize my private defense, allowing me
to proceed with my defense on time even when Marian cannot be present.
I’ve admired your engaging personality since attending your class during my
master’s studies. You consistently radiate a sense of care, humor, and reliability.
I also want to especially thank you for the original Ph.D. invitation you offered
me during my master’s studies. It provided me with significant confidence and
encouragement during my moments of doubt and uncertainty. Although I didn’t
choose your research topic in the end, I advertised it for you and successfully
attracted Mohit’s interest to join your team.
A heartfelt thank you to Dr. Huichu from Meta. Huichu was my advisor during
my 2021 internship at Meta and has been a collaborator over the past three to
four years. I truly appreciate all the discussions, slide creations, paper writing,
and patent filings we’ve worked on together during this time. Huichu, your
dedication and perseverance toward work, your kindness and thoughtfulness for
others have deeply impressed me. Beyond professional growth, you have also
taught me many useful skills in making slides and illustrations.
iv ACKNOWLEDGEMENTS
Following it, I would also like to thank Meta, Edith’s group with Ekin, Tony, Lita,
and Huichu for the monthly discussion and feedback over the last several years.
These discussions have allowed me to better understand the industry’s needs,
broaden my perspective, and better adjust my research focus and direction.
Then, I would like to sincerely thank all my internal and external
collaborators. As everyone can see, my work is filled with collaborations;
without them, this thesis wouldn’t exist. In chronological order, first, I want
to thank Mohit. We worked together at imec during our master thesis time,
sharing a desk and computer. During that time, you offered me a lot of help and
advice, and we had many interesting discussions. The idea of "Sum-together
precision-scalable MAC" was born during this time, paving the way for my
doctoral journey and our first conference paper. Your passion for the field
deeply impressed me over all these years. Next, I want to deeply thank Vincent
for the precision MAC array work we accomplished together. Working with
you is amazing. Your dedication and organization were truly inspiring. You are
also so helpful. Whenever I needed help, spanning years, you were always there,
patiently and eagerly addressing my doubts. In the early days of my doctoral
journey, your guidance was instrumental in getting me started.
Then, I would like to thank Vikram very much for helping me lay a solid
foundation for the cost model of ZigZag with many discussions and hand
calculations performed together before ZigZag took shape. Later on, I want to
give a special thanks to Pouya. Pouya you joined the ZigZag journey shortly
after you joined the team. I’ll never forget the times we fought together, coding
and debugging late into the night, writing the first version of the paper, rushing
to meet deadlines, adding new experiments, handling rebuttals, and feeling down
when papers were rejected, yet finding new opportunities and achieving the final
paper publications. I’ll also never forget that we named ZigZag on the train to
Antwerp together. Thank you for your incredible effort and perseverance, in
turning ZigZag from an idea into reality.
Next, I’d like to thank Ehab for continuing the work that Vincent and I started,
delving further into the precision-scalable MAC array, and gaining new insightful
design perspectives together. I also enjoyed quite a lot the process of drawing
diagrams and writing papers with you. Following that, I want to express my
utmost gratitude to Arne for his contributions to the ZigZag-project. With
Arne, we enhanced the efficiency of the mapping search engine and together
updated the old ZigZag into a new, more clean, concise version. Alongside
Pouya and Steven, we removed the limitations of single-core architecture and
entered into the multi-core domain, proposing the follow-up framework Stream.
Arne, working with you has been truly enjoyable. In my eyes, you are a tall
sunshine boy. You work efficiently, focused, structured, and always with a bunch
of ideas. Later, I want to thank Koen for the joint accomplishment in modeling
ACKNOWLEDGEMENTS v
the depth-first scheduling space on top of ZigZag. Koen, working with you is
highly efficient; you think fast and act swiftly, which I truly admire. Also, under
your substantial assistance, along with Arne, the coding structure of ZigZag
has been significantly improved, becoming more sustainable and reflecting its
current form. Finally, I want to express my sincere gratitude to all those who
have worked alongside me, on various papers, projects, theses, and courses:
Amanda, Steven, Guilherme, Jun, Jiacong, Leondra, Priya, Victor, Sebastian K.,
Nathan, Jasper, Sebastian P. G. , Yuanyang, Yunzun, Weijie, Ruben, Josse,
Lucca, Thibaut, and everyone else. Thank you for the collaboration and the
work we’ve accomplished together!
Also, I would like to thank the MICAS secretaries, Danielle and Ann, for helping
me set up things for this defense and for the constant support throughout the
years. Thanks to the MICAS IT manager, Ben, for always solving the computer
problems I encountered on time and providing me with private servers to conduct
massive experiments. Thanks to Ellen, the financial secretary of MICAS, for
always being very nice and helping me to get reimbursement. Even once when I
lost my original bills, and I already gave up, but with your insistence and help,
I got my money back.
I would like to thank all my MICAS and MNS colleagues for your sincerity
and enthusiasm in making my Ph.D. life more vibrant and colorful. Thank
you to my office mates: Vikram, Josse, and Sebastian. Because of you, the
office is full of joy. I am also truly thankful to all the people in Marian’s group
over the years, for helping and supporting each other and creating this warm
research family: Josse, Arne, Koen, Steven, Thomas, Jaro, Sander, Laura, Tom,
Sebastian, Pouya, Giuseppe, Amanda, Peishuo, Shirui, Jun, Jiacong, Ryan,
Guilherme, Nimish, Vikram, Ninad, Nitish, Ehab, Kodai, and everyone else.
I would like to give a special thanks to Vikram, who had been sitting in the
same office with me for five years and has become a great friend to whom I can
talk about everything. Over the five years, we witnessed each other’s growth.
All the company, encouragement, and support we had for each other and all
the talks, meals, and trips together are very precious memories of my Ph.D.
Also, a special thank you goes to Mohit. You are the one in the beginning
motivated me to apply for a Ph.D. In my eyes, you are a very knowledgeable
and long-sighted person, who I really respect. Many times when I was hesitant
about something in my research or life, you appeared in my office just in time
to provide me with good advice. I want to write down here the two things you
once said to me that deeply touched me. One was that during our master time
in imec, one day you explained to me some math theory, and afterward you said
something like "How beautiful it is!". Another was during a hard period in my
Ph.D. when I found out what I wanted to do and was already halfway done had
been already done and published by other research teams. You comforted me,
vi ACKNOWLEDGEMENTS
saying that Ph.D. is about making a small contribution to one point other than
contributing to everything, as long as you find out this one point, you are good,
and there must be something they have overlooked and you can do better.
Besides that, I’d like to thank Vikram, Nimish, Mohit, Rico, Ninad, Ciana, and
Nitish for all the gatherings, badminton playing, cycling, and alma trips. I also
want to extend a heartfelt thank you to my roommate, Amanda. Thank you
for the meals you cooked for me and for the insulated water bottle you gifted
me when I went to Northern Europe. I really appreciate the two years we lived
together, and also the three months with Yifan joining us. Yifan, thank you for
taking me to the Sunday Heverlee market and recommending ribs and cherries
to me. Thank you to Weijie, Zongyuan, Xinfa, and Peishuo for multiple times
sharing watermelon with me and inviting me to dinner. Xinfa, the beef you
cooked and gifted to Amanda and me was the best beef I ever had. Thank you
to Jun F., Hui, Kaizhe, Thanos, Chen, Jhon, Jonah, Clara, Jiaqi, Kaicang, and
many many more people for all the chatting and joking in MICAS.
Also thank you to my friends in the TELEMIC group, Liu Hao, Liu Bin,
Zhensheng, Yang Jie, Zhang Meng, and Xuezhi for all the talks, meals, and trips
we had together. Thank you to our initial math study group – Mingxiao, Zehao,
Tingyu, Wan Bo, Zhou Han, and Yang Jie. Together, we rediscovered the beauty
of linear algebra and formed genuine friendships. Thank you to the moving
team – Vikram, Mingxiao, Tingyu, Shirui, Wan Bo, Peishuo, Yangyang, Xinfa,
and Aojie – for helping Amanda and me move places, disassemble and assemble
furniture multiple times. Thank you to Aojie, Chenxi, and Linlin for all the
meals invitation and shopping trips. Thank you to Linlin for accompanying me
to have my wisdom teeth removed. Thank you to Shirui, Jun, and Bowen for
countless wonderful experiences of dining out, boating, hiking, and traveling.
All these moments have led to numerous cherished memories, which have added
more depth and meaning to my Ph.D. journey.
Finally, I would like to express my deepest gratitude to my dear mom and dad.
Your gift of a warm and joyous family has been a treasure beyond measure to me.
Thank you for your unconditional emotional and material support, allowing me
the freedom to pursue my passions. Bathed in your boundless love, I’ve grown
with the courage to explore new realms and welcome the world with open arms.
You are my forever role models. Your kindness to people, your positive attitude
toward life, and your perseverance and focus in working and doing other things
have profoundly shaped me. Your love, guidance, care, and unshakable belief
in me form the foundation of every accomplishment I’ve achieved.
From the time I left home at 18 to attend the university thousands of miles
away from home, 11 years have flown by. Every year I tried to go home once
or twice, and during Covid time I could only go home after two years. Each
ACKNOWLEDGEMENTS vii
time when I go back, I can’t help but notice the passage of time reflected in
your aging. Now I have almost finished my Ph.D. study and am returning to
you to start a new life. Mom and dad, you accompany me to grow up, let
me accompany you to grow old. Beyond mom and dad, my heart brims with
gratitude for my two grandmas, aunts, uncles, and cousins. The embrace of my
close-knit family provides solace for my soul. After 11-year venturing through
the vastness of the world and encountering numerous people and landscapes,
I’ve come to further truly appreciate the warmth and value of family to me.
Thank you for everything!
写在最后的话
千言万语,道不尽经年情谊;
天高路远,隔不断你我联系;
好生保重,祝大家前途似锦;
来日方长,愿我们后会有期。
Over the past decade, deep learning has reshaped the skyline of Artificial
Intelligence, thanks to the evolution of hardware computing capability, algorithm
improvement, and the ever-growing volume of data. Driven by a wide range
of use cases in various industries, such as computer vision, natural language
processing, healthcare, finance, manufacturing, robotics, etc., deep learning
gained enormous momentum in its development and deployment.
In the deployment, processing deep learning models fast and efficiently is
challenging due to their computationally intensive, data-intensive and diverse
nature. Yet, at the same time, this efficiency is critical across many use cases,
especially in resource-constrained scenarios, like mobile and IoT devices. As a
result, numerous specialized hardware accelerators are built, taking advantage
of the intrinsic highly-parallelizable computing pattern of deep learning models.
These accelerators usually consist of an array of multiply-and-accumulate units
for parallel computing, a memory hierarchy for feeding/storing data, and a
pre-defined scheduling controller for orchestrating the computation and data
movement. As the processing efficiency is tightly coupled to these design
components, carefully constructing them is crucial. However, as the design
spaces of these components are vast and intertwined, and the traditional digital
design flow is too slow to evaluate every single design option, it is difficult to
jump out of the ad-hoc design paradigm to pursue a globally optimal solution.
To address this limitation, early-phase design space exploration is required. This
thesis contributes to this goal by, firstly, the systematic identification of the
design space of deep learning accelerators at different levels of abstraction and,
secondly, the insightful exploration of these single and joint design spaces. At
different abstraction levels, this thesis focuses on different exploration
parameters:
At the multiply-and-accumulate unit and array level, this thesis focuses
on studying the low-precision and variable-precision computing datapath, which
ix
x ABSTRACT
has been proven to bring magnificent latency and energy benefits for resource-
constrained systems (with no to minor algorithmic accuracy loss). This
work constructs systematic taxonomies for precision-scalable multiply-and-
accumulate units and arrays after identifying the different design options. These
taxonomies depict the skeleton of the design spaces, not only covering the existing
state-of-the-art precision-scalable designs but also uncovering new unexplored
architectural options. These different design options are then thoroughly bench-
marked with the traditional digital synthesis flow, and interesting tradeoffs and
design insights are discovered.
Moving one step higher in the abstraction to the single-core accelerator
level, we combine the multiply-and-accumulate array and a memory hierarchy,
together with various mapping/scheduling possibilities. This thesis builds
two high-level fast architecture-mapping design space exploration frameworks,
ZigZag and DeFiNES. ZigZag focuses on single-layer mapping, while DeFiNES
extends ZigZag to support depth-first scheduling. Thanks to deep learning’s
deterministic computing pattern, the built-in analytical cost models enable
these frameworks to estimate energy and latency breakdown of processing a
deep learning model on a customized accelerator in milliseconds to seconds,
paving the way toward fast architecture/mapping search and optimization. In
this thesis, several model validation experiments and multiple case studies
demonstrate the reliability and capabilities of these frameworks.
Recently, the ever-growing model diversity and size are driving deep learning
accelerator design to the multi-core level, which combines several accelerator
cores and a network-on-chip, together with massive scheduling and layer-core
allocation possibilities. At this level, the complexity of the design space is
further increased, yet this thesis provides a framework, Stream, to systematically
tackle it. Stream is a high-level multi-core accelerator modeling and design
space exploration framework, built upon ZigZag. It can explore different core
architectures, core-to-core communication topologies, layer-core allocation, and
fine-grained layer-fused scheduling, supporting various deep learning models.
Stream paves the way for fast and systematic multi-core deep learning accelerator
design and workload deployment.
It is important to note that the creation of these different frameworks all followed
a similar three-step methodology: firstly, identify different design options and
construct a unified design representation that covers these options; secondly,
based on this unified representation, build the cost models; lastly, automatically
generate different design candidates and feed them to the cost models. In this
way, the loop is closed, and the design space exploration can be conducted
automatically.
ABSTRACT xi
In summary, this thesis aims to clearly introduce the vast design space of deep
learning accelerators at the different abstraction levels and thoroughly explain
how the high-level design space exploration frameworks can be built to rapidly
offer design insights and guidelines. By providing the developed frameworks in
open source, we pass on the taxonomy, modeling, and exploration methodologies
applied in this thesis to future researchers.
Beknopte samenvatting
xiii
xiv BEKNOPTE SAMENVATTING
BG bit-group. xxix, xxx, 18, 101–109, 111–116, 119, 121, 124, 128, 129, 259,
261
BS bit-serial. xxx, 102, 105, 109, 111–113, 115–117, 119–121, 124, 128–130,
259
BW bandwidth. xxxv, 185, 186, 189–192, 195, 197–200
xvii
xviii List of Abbreviations and Symbols
DF depth-first. xxxiii–xxxv, 14, 15, 19, 201, 202, 204, 208, 210, 214–216,
218–221, 223–230, 260
DL deep learning. 2, 3, 7, 23, 24, 34, 36, 37, 39, 43, 57, 59, 128, 130
DNN deep neural networks. xxvii, xxxii, xxxiv, xxxv, 2, 3, 7, 11–17, 19, 23–26,
30–45, 48, 52, 53, 55, 57–59, 76, 97, 99–102, 128, 130–133, 135, 137, 138,
158, 164, 165, 168, 171–174, 178, 182–187, 194, 195, 197, 200, 217, 230–236,
238, 240, 246–249, 251, 252, 255, 257–259, 261, 262
DSE design space exploration. xxxii, 6, 12–17, 19, 23, 50, 131–133, 135–137,
140, 156, 158, 168, 170, 171, 176–179, 181, 182, 184, 194, 195, 197, 200,
201, 230, 231, 236, 241, 242, 254, 255, 259–262
DVFS dynamic voltage-frequency scaling. xxix, 59, 76, 89, 91, 92
FC fully-connected. xxviii, 25, 27, 30, 34, 35, 47, 59, 60, 66, 102, 173, 185
IS input-sharing. 104, 105, 108, 109, 111, 115, 116, 118, 119, 121, 124, 128,
129, 259
ISA instruction set architecture. 37
LBL layer-by-layer. 202, 204, 207, 208, 216, 220, 224, 225
LPF loop prime factor. 151–153, 156, 169, 170
LSB least significant bit. 64, 72, 75
LUT look-up table. 52, 75
MAC multiply-and-accumulate. xxvii–xxix, xxxv, 7–9, 11, 18, 25, 34–36, 39–
43, 47, 48, 50, 55, 57–97, 99–106, 109, 111, 118–120, 128, 130, 136, 138,
142, 144, 147, 148, 151, 153, 157, 158, 163, 174, 184, 185, 187, 194–200,
255, 258, 259
ML machine learning. 5, 184
MSB most significant bit. 64, 75
OS output-sharing. xxx, 104, 105, 108, 109, 111–116, 118–121, 124, 128, 129,
259
PE processing element. xxvii, xxviii, 39, 41, 42, 48, 59–61, 140, 142, 148, 149,
157, 158, 160, 162, 164, 168, 174, 232, 234, 241, 246, 258
pJ/op pico joule per operation. 36
PSMA precision-scalable MAC array. xxxv, 18, 99, 100, 102, 103, 105, 108,
109, 111, 115, 116, 118, 120, 124, 128–130, 259, 261
PW pointwise. 25, 28, 30, 102, 185
SA sum-apart. xxviii, xxix, 59–67, 69, 70, 74, 77, 79–82, 84–86, 89, 93, 96, 258
xx LIST OF ABBREVIATIONS AND SYMBOLS
sec second. 36
SIMD single instruction multiple data. 37
SWU subword-unrolled. xxviii–xxx, 62, 63, 69, 70, 74, 79, 80, 82–84, 86, 89,
92, 93, 96, 97, 108, 109, 111, 115, 117–119, 121, 124, 129, 258, 259
Abstract ix
Contents xxi
1 Introduction 1
1.1 Multi-level Deep Learning Acceleration in Post-Moore’s-Law Era 3
1.2 Open Research Questions . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Q1: How to efficiently execute variable-precision DNNs
on hardware? . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Q2: How to realize fast design space exploration for DNN
accelerators? . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Thesis Contributions and Organization . . . . . . . . . . . . . . 18
1.3.1 Taxonomy and benchmarking of precision-scalable data-
paths (Q1) . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.2 Frameworks for fast DNN accelerator DSE (Q2) . . . . 19
2 Background 23
2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.2 A common misconception . . . . . . . . . . . . . . . . . 32
2.1.3 Computation and data patterns . . . . . . . . . . . . . . 34
2.2 Domain-specific Hardware for Deep Learning . . . . . . . . . . 36
2.2.1 Hardware performance metrics . . . . . . . . . . . . . . 36
xxi
xxii CONTENTS
Biography 265
Bibliography 271
List of Figures
2.1 An example DNN and a deep dive into one neuron of it. . . . . 24
2.2 Multiple types of DNN layers. . . . . . . . . . . . . . . . . . . . 26
2.3 Snippets of different DNNs from (a) VGG-19 [149], (b) ResNet-
18/34 [57], (c) ResNet-50/101/152, (d) ResNeXt [179], (e)
MobileNet-v1 [65], (f) MobileNet-v2 [138], and (g) Inception-
ResNet-v2 [155]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 A great number of DNNs targeting image classification on
ImageNet were developed [15]. . . . . . . . . . . . . . . . . . . 33
2.5 The relationship between FLOPs, number of parameters, and
inference latency of VGG-16 pruned models [100]. . . . . . . . . 33
2.6 An overview of DNN accelerator comparison [85]. . . . . . . . . 38
2.7 Introduction to DNN accelerator. . . . . . . . . . . . . . . . . . 40
2.8 Introduction to MAC / PE array interconnection. . . . . . . . . . 41
2.9 Energy per 16-bit access with various RF and SRAM sizes, and
for a MAC operation and a DRAM access [184]. . . . . . . . . 43
2.10 Single-core and multi-core DNN accelerator. . . . . . . . . . . . 44
2.11 Introduction to single-core cross-layer scheduling and intra-layer
mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.12 An example continuing from Figure 2.11(c) Mapping 3 to
show temporal mapping’s impact on high-level memory access.
Mapping 3a and 3b have the same spatial mapping but different
temporal mappings. . . . . . . . . . . . . . . . . . . . . . . . . 49
2.13 Examples of the nested for-loop based mapping representation,
continuing from Figure 2.11 and Figure 2.12. . . . . . . . . . . . 51
2.14 Introduction to multi-core layer-core allocation and scheduling. 53
xxvii
xxviii LIST OF FIGURES
3.1 SA/ST loop identification for (a) an FC and (b) a Conv 2D layer. 60
3.2 Three types of two-dimensional PE array. . . . . . . . . . . . . . 61
3.3 Data-gated conventional MAC for either one full-precision 8b×8b,
one symmetric 4b×4b, or one weight-only 2b×8b operation. . . 64
3.4 Weight-only precision scaling in a 1D FU SA MAC configured for
either one 8b×8b, two 4b×8b, or four 2b×8b operations per cycle. 65
3.5 Weight-only precision scaling in a 1D FU ST MAC configured for
either one 8b×8b, two 4b×8b, or four 2b×8b operations per cycle. 66
3.6 Precision scaling in a 2D FU SA MAC configured for either one
8b×8b, four 4b×4b, four 2b×8b, or sixteen 2b×2b operations
per cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.7 Precision scaling in a 2D FU ST MAC configured for one 8b×8b,
four 4b×4b, four 2b×8b, or sixteen 2b×2b operations per cycle. 68
3.8 Symmetric precision scaling in a SWU SA MAC configured for
either one 8b×8b, two 4b×4b, or four 2b×2b operations per cycle. 69
3.9 Symmetric precision scaling in a SWU ST MAC configured for
either one 8b×8b, two 4b×4b, or four 2b×2b operations per cycle. 70
3.10 Weight-only precision scaling in a bit-serial MAC configured for
either 8b×8b, 4b×8b, or 2b×8b operations. . . . . . . . . . . . . 71
3.11 Weight-only precision scaling in a 4-bit serial MAC configured for
either 8b×8b, 4b×8b, or 2b×8b (by gating the 4b×8b) operations. 71
3.12 Symmetric precision scaling in a 2D serial MAC configured for
either 8b×8b, 4b×4b, or 2b×2b operations. . . . . . . . . . . . 72
3.13 A bit-wise feed-in schedule for 2D bit-serial MAC in 4b×4b mode. 73
3.14 Symmetric precision scaling in a 2D 4-bit serial MAC configured
for either 8b×8b, 4b×4b, or 2b×2b (by gating the 4b×4b)
operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.15 Register settings for (a) SWU SA and (b) SWU ST MAC units. 74
3.16 Legend of all the bar charts in Section 3.5. 19 designs in total. 77
3.17 Bar charts of the bandwidth per clock cycle of precision-scalable
MAC units for (a) symmetric and (b) weight-only scaling scenarios. 78
3.18 Bar charts of the bandwidth per operation of precision-scalable
MAC units for (a) symmetric and (b) weight-only scaling scenarios. 79
3.19 Bar charts of the normalized circuit throughput of precision-
scalable MAC units for (a) symmetric and (b) weight-only scaling
scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.20 Bar chart of the normalized area of precision-scalable MAC units. 82
3.21 Normalized energy/op for (a) symmetric and (b) weight-only
scaling in a 1D FU SA MAC (DNPU) [146]. . . . . . . . . . . . 84
LIST OF FIGURES xxix
xxxv
Chapter 1
Introduction
Over the past decade, the field of Artificial Intelligence (AI) has advanced
at an unprecedented pace. Humanity has witnessed numerous remarkable
achievements of AI that have already found a place in the history books, such
as the examples listed below, also shown in Figure 1.1:
(a) Alexnet [90] winning the ImageNet large-scale visual recognition challenge
in 2012;
(b) AlphaGo [148] beating world champion Lee Se-dol at the game of Go in
2016;
(c) Google’s Duplex [96] making a reservation at a restaurant and scheduling
a haircut over the phone in 2018;
(d) AlphaFold [83] making significant breakthroughs in predicting the 3D
structure of proteins in 2021;
(e) DALL-E [131] showing its ability to generate high-quality, photorealistic
images of non-existent objects, as well as to turn your creative visions
into AI-generated art in 2021;
(f) ChatGPT [123] and GPT-4 [124] triggering a worldwide revolution in
natural language processing and related fields in 2023.
These AI achievements are just the tip of the iceberg, which would not have
been possible without the exponentially increased hardware compute power
1
2 INTRODUCTION
(a) AlexNet for (b) AlphaGo v.s. (c) Google Duplex performs
image classification a Go world champion real-world tasks over phone
An o il p a in t in g in Vin ce n t va n Go gh s t y le : d u m p lin gs a n d
s a m o s a s a r e fly in g t o ge t h e r o n a s t a r r y n igh t .
Application
Algorithm
(i)
(f) (g) (h) (j)
Mapping/
Scheduling/
Compiling
(k) (l) (m)
Hardware
architecture
Hardware
technology
The first level is the application level. As discussed in the beginning, DL has
1
demonstrated its promising utility in a wide range of applications, such as the
ones shown in the figure: object detection, autonomous driving, predictions
of protein structure, recommendation systems, and the metaverse. Below the
application level comes the algorithm level, in which different types and shapes
of DNNs are applied for specific tasks of the applications. After that, to deploy
1 The picture sources: (a):[7], (b):[106], (c):[83], (d):[6], (e):[112], (f)-(j):[165], (k):[110],
(l):[108], (m):[152], (n):[62], (o):[78], (p):[54], (q):[151], (r):[26], (s):[137], (t):[60], (u):[170],
(v):[99]
4 INTRODUCTION
means that per-level design is a single, monolithic step. In fact, each level has a huge design
space, and the design procedure can usually be organized as multiple design stages (to avoid
complexity explosion). For example, at the mapping/scheduling/compiling level, a unified
multi-stage meta flow, proposed back in the 1990s [23], already indicated the immensity and
complexity of the scheduling design space on its own.
MULTI-LEVEL DEEP LEARNING ACCELERATION IN POST-MOORE’S-LAW ERA 5
• T. Austin in his keynote "Preparing for a Post Moore’s Law World" [10]
emphasized that to overcome the scaling challenges, it is key to reduce
the cost of bringing a customized heterogeneous design to life so as
to flourish the innovation. Proposed actions include embracing open-
source hardware to encourage cooperation, assembling different specialized
processors to widen the applicability, auto-generating code and hardware
for highly reusable accelerators, developing tools for efficient design
space exploration and well putting together the benchmark suites. For
ensembles of specialized processors, he raised open questions about what
the components/processors are and how they should be connected.
break different pieces of functionality into tiles which are then stitched together into a mosaic
by bonding them to a common silicon substrate. Chiplets can rapidly serve diverse specialized
applications at a much lower cost and much faster turn-around [140].
6 INTRODUCTION
Background
As discussed earlier, there is a big gap between the required and actual
performance/efficiency in DL acceleration systems, and algorithm-hardware
co-design is an important ingredient to the solution.
One typical example of algorithm-hardware co-design is neural network
quantization and custom precision-scalable accelerator architecture. Numerous
researches have shown that computing at a reduced data precision using
appropriate per-layer, or per-channel mixed-precision quantization of activations
and weights can largely improve the system’s performance/efficiency [94, 116]
with no or minor algorithmic accuracy loss [115, 174, 52].
To efficiently execute variable-precision DNNs, the key computation component
of an accelerator, the multiply-and-accumulate (MAC) unit, needs to be revised
accordingly compared to the original fixed-precision computation scenario.
Furthermore, usually a DNN accelerator contains not only one MAC unit but
8 INTRODUCTION
multiple of them which are connected in certain patterns into a MAC array4 .
We therefore split the first research question into the following two more specific
sub-questions, regarding the MAC unit and MAC array respectively.
Q1.1: What is the best MAC unit architecture for variable-precision DNN
execution?
any work evaluate their performances against a baseline design which is a MAC
unit without scalability; 3) it is hard to gain a systematic view of the design
space of different scalability techniques, as many designs are based on ad-hoc
scalability methods and lack a clear analysis of different scalability principles.
To answer this question, Chapter 3 of the thesis extensively reviews the SotA
precision-scalable MAC unit architectures, unifies them in a new taxonomy, and
conducts a thorough benchmark analysis under the same process technology
across a wide range of performance targets.
Q1.2: What is the best MAC array architecture for variable-precision DNN
execution?
Basic concept: Is knowing the best MAC unit architecture equal to knowing
the best precision-scalable MAC array architecture and eventually the best
accelerator? The answer is ‘No’. This is mainly because studying the optimality
of a single precision-scalable MAC unit overlooks the possibility of amortizing
the scalability overhead across multiple MAC units at the MAC array level,
which can significantly impact energy and area. This overhead amortization
is feasible due to the structured parallel MAC computation and data reuse
opportunities present in DNNs, which allow for batch behavior to reduce the
average operation costs.
A simple analogy that illustrates this concept is buying products in retail versus
wholesale. Assume products A and B by themselves have similar values and
product B has a fancier packaging box than A. Product A thus has a lower
price than product B when purchased individually in retail, as each product is
bought with a packaging box, i.e.,
In this analogy, you can think of products A and B as two types of precision-
scalable MAC units. The previous question (Q1.1) compares their prices in
retail, and the current question (Q1.2) moves one step forward, comparing their
price in wholesale. The large number of structured parallel MAC operations
in DNN are the foundation for this wholesale batching, and how well products
10 INTRODUCTION
the trade-offs within/between each design, and to insightfully find the optimal
architectures under different circumstances.
Q1.1 and Q1.2 mostly question about the logic topology’s impact of precision-
scalable MAC datapath (Q1.1 on single MAC unit, and Q1.2 on multiple-MAC-
unit formed array), whereas when moving to more advanced CMOS technology
nodes, the hardware cost pressure will shift from the logic part more to the
wire7 and memory parts. Further research questions regarding the impact of
wire and memory on a precision-scalable DNN accelerator can be built on top
of Q1, and are left for future work.8
Background
In light of the previous Post-Moore’s-Law Era discussion, there are two major
considerations: On the one hand, people are suggesting to create customized
and heterogeneous hardware architectures to keep up with the ever-growing
performance and efficiency demands, resulting in vast design spaces which come
from greater hardware design flexibility and more scheduling/mapping options;
On the other hand, there is a strong desire to accelerate the traditional hardware
design flow, streamline the algorithm-to-hardware deployment (mapping and
scheduling), and minimize the implementation costs. In short, the goal is to
identify, implement, and deploy the optimal (or good enough) design points out
of the vast design spaces in a quick and efficient manner.
To achieve these goals, the first step is to be able to identify the optimal (or a
good enough) design from numerous design options. So, how to assess the quality
of a design idea? The most straightforward way is to implement and fabricate
it, deploy the target algorithms onto it, and measure the hardware performance
7 The wire-dominated cost at the technology level can be translated into communication
network cost and memory organization cast cost at the architecture level.
8 This study primarily focuses on the logic topology, prioritizing it over the memory or
the wire. There are two primary reasons for this choice. Firstly, the design space is vast,
and attempting to consider everything together from the beginning would be overwhelming.
Secondly, the precision-scalable MAC unit/array logic topology serves as the cornerstone of
precision-scalable computing, significantly influencing data movement and sharing patterns,
and consequently impacting memory behavior and wiring cost. Thus, the decision to start
with the logic topology stems from its central importance in the precision-scalable system.
12 INTRODUCTION
and efficiency. However, the long fabrication cycle (months) would largely slow
down the optimization iteration, and is moreover very costly. One could also
implement the design without fabricating it, assessing the effectiveness through
simulation. While this is possible, the long cycle-accurate simulation time
(hours to days) again hinders the design iterations. Additionally, implementing
and testing one design can also take long (weeks).
So, in order to find the optimal design point from numerous design options
within reasonable time and cost, the traditional methods discussed above are
infeasible. We need a novel DSE tool that can be fast enough (seconds) to
estimate the hardware cost of deploying a DNN workload onto an accelerator,
can be accurate enough to reflect the strengths and weaknesses of the actual
design, can be general enough to cover a wide range of design options of
hardware architecture, workload, and the algorithm-to-hardware deployment
strategy, can be adaptable enough to effortlessly alter these design options
from one to another for iterative optimization, and if enabled, intelligent
enough to auto-adopt the best algorithm-to-hardware deployment strategy to
guarantee the cost comparison between different algorithm-hardware design
pairs is relevant and fair.
Rome wasn’t built in a day. Developing such a tool for the DSE of customized
heterogeneous accelerator systems is not done in one go. Following the different
development stages in the journey, the general question (Q2) is broken down
into three sub-questions (Q2.1/2.2/2.3). Q2.1 focuses on fast hardware cost
(energy and latency) estimation of executing a single DNN layer on a single-core
accelerator; Q2.2 seeks out more cross-layer scheduling possibilities to improve
design metrics; Q2.3 moves one step further towards our final goal from single-
core to heterogeneous multi-core accelerator cost estimation supporting flexible
layer-core allocation and layer-fused scheduling.
Basic concept: For a fast DSE framework that can assess a wide range of
accelerator architectures, support different DNN layer types and sizes, auto-
search the best mapping strategy for algorithm-to-hardware deployment, and
rapidly estimate the hardware cost (energy and latency), there are three key
factors. They are the accelerator architecture, DNN workload, and mapping,
each with a large design space (thoroughly discussed in Chapter 2).
To enable fast exploration in the joint-design space of these three factors, it
is essential to apply a high-level abstraction to represent them and capture
their key attributes. This high-level abstracted design representation should
OPEN RESEARCH QUESTIONS 13
be flexible enough to cover all the design options (i.e., different DNN layers,
accelerator architectures, and mappings) and unified to share a common data
structure. Based on this abstracted, flexible, and unified design representation,
hardware cost estimation can be performed in a fast and structured way for all
designs under exploration. Finally, linking an automatic design generator to
the hardware cost estimator (both adopting the same design representation)
and closing them in a loop can help the DSE converge to the optimal (good
enough) design points.
Existing research: Several DSE frameworks and framework components have
emerged over the last few years, targeting single-core DNN accelerators and
single-layer mapping9 DSE, such as [166, 184, 91, 177, 126]. MAGNet [166]
proposes a modular DNN accelerator generator for lowering the design cost
and provides a DSE framework encompassing a designer, a mapper, and
a DL framework to enable co-optimization of architecture and application.
Interstellar [184] applies Halide’s scheduling language [129] as the high-level
abstraction to describe the design space of accelerators and proposes a formal
dataflow taxonomy to represent different spatial loop unrollings. MAESTRO [91]
introduces a set of data-centric directives to concisely specify the DNN dataflow
space and analytically estimate the execution time and energy efficiency of
the accelerator. Accelergy [177] presents a generally applicable methodology
for performing architecture-level energy estimation on accelerator designs,
and proposes a configuration language that helps the designers to describe
their systems. Timeloop [126] introduces an infrastructure for evaluating and
exploring the architecture design space of DNN accelerators, using a concise and
unified representation of the key architecture and implementation attributes of
DNN accelerators to describe a broad space of hardware topologies.
Remaining challenge: As discussed earlier, an ideal DSE tool should be fast,
accurate, general, adaptable, and intelligent. Existing research usually overlooks
some of these criteria or sacrifices one for another. For example, MAGNet [166]
trade off the speed for accuracy as its cost estimation is based on post-synthesis
results (accurate but slow), MAESTRO [91] and Accelergy [177] rely on users
to define mapping strategies (adaptable but not intelligent); Interstellar [184]
and Timeloop [126] put strict constraints on the mapping search space for faster
search speed (fast but not general enough).
Chapter 5 and 6 of this thesis focus on this challenge and builds a DNN
accelerator DSE framework aimed at meeting these five criteria.
9 Detailed explanation/definition of single-core DNN accelerators and single-layer (a.k.a
Basic concept: Depth-first (DF) scheduling (also called deep layer fusion) is a
new algorithm-to-hardware deployment strategy that can significantly decrease
the hardware cost for accessing off-chip data. For single-layer mapping in
Q2.1, one layer has to be fully executed before its successive layer(s) can start
execution. In this way, if this one layer’s output size is too large to fit into
on-chip memory, the output data will be first pushed to off-chip DRAM and later
be fetched back on-chip as the successive layer(s)’ input for further operations,
resulting in a significant hardware cost penalty. In comparison, DF scheduling
breaks each layer into multiple smaller tiles and processes cross layers tile-wisely,
e.g., finishing all layers’ first tile before starting to process all layers’ second
tile. In this way, the amount of intermediate results to be stored at a specific
instant in time is greatly reduced and thus is more likely to be able to fit into
on-chip memory or even lower-level on-chip memory (smaller capacity, more
efficient), effectively reducing the data movement hardware cost.
The unique mechanisms of DF scheduling and the expanded design space it
presents (e.g., tile size, fused depth, etc.) demand additional support in the
DSE framework, to model the more dedicated data movement and explore the
new and enlarged scheduling design space.
Existing research: Several cost models and mappers/schedulers supporting
DF scheduling are deployed for DSE frameworks to quickly explore and develop
optimal DF systems, such as [86, 180, 189, 172, 19]. DNNFuser [86] focuses
on the layer-fusion mapping space search and proposes a one-shot inference-
based DF mapper (i.e., no search procedure). DNNVM [180] transforms CNN
models into the directed acyclic graph (DAG) and enumerates all potentially
profitable fusion opportunities by a heuristic subgraph isomorphism algorithm.
EfficientS [189] studies the DF scheduling of irregular network structures on DNN
accelerators and proposes their subgraph scheduling and memory-allocating
techniques. ConvFusion [172] builds a mathematical cost model for DNN
schedule supporting loop fusion and other loop transformations. Optimus [19]
presents a DAG-based DNN operator fusion algorithm, driven by a memory
cost model, to capture the achievable minimum off-chip memory overhead for
the fused operator groups.
Remaining challenge: A fast DSE framework includes two key parts:
modeling and exploration. In the exploration part, many innovative searching
algorithms have been introduced to handle the DF enlarged scheduling space,
such as the heuristic subgraph isomorphism algorithm in DNNVM [180] and a
transformer-based mapper in DNNFuser [86]. However, in the modeling part,
OPEN RESEARCH QUESTIONS 15
these existing frameworks all have missed some important factors. [172] and [19]
focus on optimizing off-chip DRAM access while ignoring the data movement
within the multi-level on-chip memory hierarchy; [180] and [189] pursue latency
improvement solely while ignoring the energy impact; [189] and [145] count
memory accesses for feature maps (a.k.a. activation) while ignoring the accesses
from weights, noting that data movement of activation and weight is one of the
major tradeoff pair in DNN DF scheduling space. Ultimately, these missing
factors in the modeling part could bias the exploration direction and cause
substantial optimality losses in the final design.
Chapter 7 of the thesis fills in these missing factors and creates a comprehensive
DSE framework, built on top of Q2.1, for DF execution of DNN accelerators.
fusion (a subset of layer fusion). To give an example, assume there are consecutive 2 layers (A,
B) to be scheduled, each layer having 3 tiles (1, 2, 3), and layer A, B are tile-wisely dependent,
i.e., Bi depends on Ai and thus Bi cannot be computed before Ai (i=1,2,3). The depth-first
schedule strictly follows the order (A1, B1, A2, B2, A3, B3) to compute, while layer fusion
has more options besides this depth-first one, such as (A1, A2, B1, B2, A3, B3), (A1, A2, B1,
A3, B2, B3), (A1, B1, A2, A3, B2, B3), etc.
OPEN RESEARCH QUESTIONS 17
EfficientA [159] incorporates a very coarse hardware model with only the total
memory capacity and inter-core connection modeled for each core, making
it unsuitable for performing DSE for accelerator architectures. ISO [43]
targets homogeneous GPU systems, which is not applicable to heterogeneous
accelerators. Herald [92] models heterogeneous multi-core DNN accelerators but
without supporting layer fusion. Rammer [105], a DNN compiler that targets
commercial homogeneous GPU and IPU systems, relies on the actual hardware
measurement to exploit DNN workload parallelism and optimize scheduling.
This prevents using Rammer to explore new and undeveloped heterogeneous
architecture ideas. In addition, most of the works [159, 43, 105] only optimize
system latency without studying energy efficiency impact.
Chapter 8 of the thesis works towards completing these gaps by proposing a
general modeling and DSE framework, also built on top of Q2.1, to systematically
tackle the more complex multi-core DNN acceleration design space.
introduced at this point. More detailed future work discussions are at the end of the thesis.
18 INTRODUCTION
After this chapter and Chapter 2, which provide the necessary background
knowledge for the rest of the text, this thesis focuses on addressing the
aforementioned open research questions (sub-questions) through the following
contributions, summarized in Figure 1.4.
Chapter 5 & 6 (Q2.1) Chapter 5 presents ZigZag, a general and fast DNN
accelerator architecture-and-mapping DSE framework with enlarged mapping
search space; Chapter6 zooms in into the ZigZag cost model’s latency analysis,
explaining the underlying calculus for latency estimation. In ZigZag, three
modules cooperate in synergy to enable the exploration of a much broader
space of solutions with respect to other SotA DSE frameworks. Firstly, at
the kernel level, the memory-centric design space representation and the loop
relevance principle are introduced, based on which an analytical hardware cost
estimator (with detailed energy and latency analysis) is built. Secondly, on
top of the cost estimator, two mapping search engines are created to rapidly
locate the optimal spatial and temporal mappings (even/uneven) by means
of innovative search methods. Thirdly, to enable fast hardware architecture
search, an architecture generation wrapper is added to generate all valid memory
hierarchies (balanced/unbalanced, shared/separate) given a set of high-level
hardware constraints. Benchmarking experiments against published works, in-
house accelerators, and existing DSE frameworks show the reliability of ZigZag.
Multiple case studies exploring the vast DNN accelerator design space from
different perspectives demonstrate its capability.
Chapter 7 (Q2.2) builds DeFiNES, a DNN accelerator DSE framework
extending ZigZag with cross-layer DF scheduling support. This work first
formally defines the DF design space across three design axes: tile size, overlap
storage, and fused depth. Then, a novel cost model capable of handling this
entire design space is proposed and validated against a taped-out DF-style
accelerator. This cost model resolves the challenges listed in Q2.2 by not only
considering DRAM access or memory access due to activations, but also the full
on-chip memory hierarchy and memory accesses caused by weight traffic. Large
gains might be missed when not doing so (up to 10.2× in energy efficiency).
Three case studies are conducted based on the model, studying the trade-offs
between different DF schedules, and the impact of workload and hardware
architecture on the best DF strategy. In summary, DeFiNES allows quickly
examining the complex design space of different combinations of DF strategies
and hardware architectures.
Chapter 8 (Q2.3) introduces Stream, an architecture-scheduling DSE
framework built for multi-core DNN accelerators with support for fine-grained
layer-fused scheduling. Exploration is enabled through the combination of a
unified modeling representation (mix of DAG and nested for-loops), a rapid
fine-grained data dependency extractor, a genetic algorithm-based layer-core
allocator, and a heuristics-based fast scheduler. Validation of Stream with
three SotA hardware accelerators (single-core and homogeneous/heterogeneous
20 INTRODUCTION
Hardware Exploration
Mapping
&
Scheduling
Chapter 3
Variable-
None precision
MAC unit
AICAS 2019
JETCAS 2019
Mapping / Scheduling Exploration
Chapter 4
Chapter
Variable- 5&6
Spatial precision
mapping ZigZag:
Intra-layer MAC array
Analytical
mapping TCAS-I 2022
cost model
+
for b in [0, B) Mapping-
for k in [0, K) Hardware
for c in [0, C) Temporal Chapter 8
DSE
… Stream:
mapping TComp 2021
Modeling fine-
AICAS 2021
grained layer
DATE 2022
fusion on
multi-core
Chapter 7 DNN
Cross- Layer-by- DeFiNES: accelerators
layer layer ISPASS 2023
scheduling Exploring
Depth-first
1 Scheduling
Space for
3 DNN
Depth-
2 Accelerators
4 first
/ HPCA 2023
5 Layer
fusion
Background
This chapter provides background information on the three key factors in the
DSE of DL hardware acceleration: DNN algorithm (Section 2.1), hardware
(Section 2.2), and algorithm-to-hardware deployment (Section 2.3), which are
prerequisites to understanding the rest of the thesis.
Section 2.1 explains DNN algorithm. It first introduces DNN’s basic concepts
and building blocks (from a single neuron, a layer and inter-layer interconnection
to a complete DNN) in Section 2.1.1. Then, it points out a common mis-
conception of the correlation between DNN model size and hardware processing
cost in Section 2.1.2. To explain the misconception, the computation and data
patterns of DNN are discussed in Section 2.1.3.
Section 2.2 introduces domain-specific hardware for DL. It starts with discussing
the commonly used hardware performance metrics in Section 2.2.1, including
quantitative ones like latency and energy efficiency, and qualitative ones like
flexibility and programmability. Then, it compares different types of hardware
platforms used for DL acceleration in Section 2.2.2. Afterwards, in Section 2.2.3,
it zooms in into the ASIC DNN accelerator and explicitly explains each hardware
component and the shift from single-core to multi-core architecture.
Section 2.3 elaborates on the deployment of DNN workloads to hardware
accelerators. It breaks down the general concept of the deployment into mapping,
scheduling, and allocation. For single-core accelerator systems, it explains first
the cross-layer scheduling in Section 2.3.1 and then the intra-layer mapping
(decoupled into spatial and temporal mappings) in Section 2.3.2; For multi-
core systems, it highlights the impact of layer-core allocation and multi-core
scheduling in Section 2.3.3.
23
24 BACKGROUND
Such multi-layer artificial neural networks are also called Deep Neural Networks
(DNNs). As shown in Figure 2.1, a DNN consists of multiple interconnected
layers of neurons. In a bottom-up manner, let’s discuss its basic concepts one
by one: a neuron, a layer, the connection between layers, and the complete
neural network.
Input
layer Output
layer
Multiply-
𝐼𝐼0 accumulate Activation
𝑊𝑊0 2 function
… 𝐼𝐼1 𝑊𝑊1 � 𝑊𝑊𝑐𝑐 𝐼𝐼𝑐𝑐 + 𝑏𝑏 𝑂𝑂
𝑊𝑊2 𝑐𝑐=0
𝑓𝑓(𝑥𝑥)
𝐼𝐼2
FC
𝑏𝑏
Conv1D
Figure 2.1: An example DNN and a deep dive into one neuron of it.
DEEP LEARNING 25
1) A neuron
in which the small c is the input index and the big C is its upper bound. The
same rule will be used for describing other DNN operations in the thesis.
The values of weights and biases are determined in the training phase. The
weights represent the strength of the connections between the input and the
neuron, and the bias is a scalar value added to the weighted sum of the inputs,
which provides an additional degree of freedom for network learning.
The values of the inputs are dynamically determined by the previous neuron(s)’s
outputs in the inference phase. Different operands (inputs, weights, bias,
outputs) can have the same or different data precision in the computation,
whose implication will be discussed further in the thesis.
The activation function introduces nonlinearity into the neuron computation,
which allows the DNN to model complex relationships between inputs and
outputs. Common activation functions include the sigmoid function, the rectified
linear unit (ReLU) function, the hyperbolic tangent (tanh) function, etc [122].
2) A layer
𝐼𝐼𝐼𝐼
𝐶𝐶
× Σ 𝐾𝐾
𝐾𝐾
𝐼𝐼𝐼𝐼 𝐾𝐾
𝐶𝐶 𝑂𝑂𝑂𝑂
𝐹𝐹𝐹𝐹
× Σ
𝐼𝐼𝐼𝐼 𝑂𝑂𝑂𝑂
(d) Pointwise (PW) Conv layer
𝑂𝑂𝑂𝑂
𝐼𝐼𝐼𝐼 𝐺𝐺 𝐺𝐺 𝐺𝐺
× Σ 𝐾𝐾
𝐶𝐶 𝐹𝐹𝐹𝐹
𝑂𝑂𝑂𝑂
𝐶𝐶
𝐶𝐶 𝐹𝐹𝐹𝐹
𝐼𝐼𝐼𝐼 𝑂𝑂𝑂𝑂
𝐺𝐺 𝐺𝐺 𝐹𝐹𝐹𝐹 𝐺𝐺
(g) Element-wise sum layer
+
× Σ
𝐼𝐼𝐼𝐼1 𝐼𝐼𝐼𝐼2 𝑂𝑂𝑂𝑂
• FC layers are often used for classification and regression tasks. They
connect every neuron in one layer to every neuron in the next layer,
allowing the network to integrate the overall information.
A C-input, K-output FC layer can be described as:
C−1
X
Ok = f ( Wk,c × Ic + bk ) (2.2)
c=0
in which
k = {0, 1, ..., K − 1} (2.3)
In the equation, k/c correspond to the output/input index respectively,
bk is the bias of the kth output, and f (x) is the activation function.
• Conv layers are commonly used in image and video processing tasks
and can be either 1D, 2D, or 3D, depending on the nature of the input
data. They apply a set of filters (a.k.a. weight kernel) to the input data,
allowing the network to learn local patterns and features.
Assuming C input channels and K output channels, a 1D Conv layer can
be written as:
C−1
X FX
X−1
Ok,ox = f Wk,c,f x × Ic,ix + bk (2.4)
c=0 f x=0
and
ix = SX × (ox − 1) + SF X × (f x − 1) + 1
(2.8)
iy = SY × (oy − 1) + SF Y × (f y − 1) + 1
28 BACKGROUND
, in which
g = {0, 1, ..., G − 1}
ox = {0, 1, ..., OX − 1} (2.10)
oy = {0, 1, ..., OY − 1}
Note that index g replaces the previous Conv 2D layer’s k and c indexes,
and accordingly, the dimensionality of the weight tensor reduces by 1
dimension. ix and iy follow Equation 2.8.
For a PW Conv layer, assuming C input channels and K output channels,
its equation can be written as:
C−1
!
X
Ok,ox,oy = f Wk,c × Ic,ix,iy + bk (2.11)
c=0
, in which
k = {0, 1, ..., K − 1}
ox = {0, 1, ..., OX − 1} (2.12)
oy = {0, 1, ..., OY − 1}
and
ix = ox
(2.13)
iy = oy
C−1
X FX Y −1
X−1 FX
Og,k,ox,oy = f Wg,k,c,f x,f y × Ig,c,ix,iy + bg,k (2.14)
c=0 f x=0 f y=0
in which
g = {0, 1, ..., G − 1}
k = {0, 1, ..., K − 1}
(2.15)
ox = {0, 1, ..., OX − 1}
oy = {0, 1, ..., OY − 1}
Y −1
Og,ox,oy = M AXfFx=0
X−1
M AXfFy=0 (Ig,ix,iy ) (2.16)
, in which
g = {0, 1, ..., G − 1}
ox = {0, 1, ..., OX − 1} (2.17)
oy = {0, 1, ..., OY − 1}
in which
g = {0, 1, ..., G − 1}
ox = {0, 1, ..., OX − 1} (2.19)
oy = {0, 1, ..., OY − 1}
and
ix1 = ix2 = ox
(2.20)
iy1 = iy2 = oy
These layer types are frequently used in a wide range of popular DNNs.
It is worth pointing out some interesting interrelations between these different
types of DNN layers. In Figure 2.2, (e) Group Conv layer can be seen as a
superset of layer type (a)-(d):
(a) FC layer is Group Conv layer with only C, K dimensions, i.e., all the
other dimensions = 1;
(b) Conv layer is Group Conv layer with group dimension G = 1;
(c) DW Conv layer is Group Conv layer with input/output channel dimensions
C & K = 1;
(d) PW Conv layer is Group Conv with group dimension G = 1 and weight
kernel sizes FX & FY = 1.
3) Layer connection
Besides these various layer types, there are also different layer connections,
which can be grouped into one-to-one, one-to-more, and more-to-one categories.
Figure 2.3 shows several neural network snippets with previously introduced
layer types and different layer connection patterns.
One-to-one layer connections are straightforward: a single layer’s output becomes
the input to one and only one layer’s input.
One-to-more and more-to-one layer connections often appear together in a
DNN with residual or branching network structures, such as ResNet [57] and
Inception-ResNet [155]. In the one-to-more connection, one layer’s output is
used as input by multiple succeeding layers, while in the more-to-one connection,
multiple layers’ outputs jointly become the next layer’s input.
DEEP LEARNING 31
Conv 7x7
Elem. Sum 64 Elem. Sum
Conv 3x3
512
Conv 3x3 Pooling Conv 1x1
Conv 3x3 512 128
512 Conv 1x1
Conv 3x3 64 Group Conv
Pooling 512 3x3, 128
Conv 3x3 Conv 1x1 group=32
FC 4096 Elem. Sum 64 256
Conv 1x1
FC 4096 Pooling Conv 1x1 256
256
FC 1000 FC 1000 Elem. Sum
Elem. Sum
Figure 2.3: Snippets of different DNNs from (a) VGG-19 [149], (b) ResNet-
18/34 [57], (c) ResNet-50/101/152, (d) ResNeXt [179], (e) MobileNet-v1 [65],
(f) MobileNet-v2 [138], and (g) Inception-ResNet-v2 [155].
4) DNNs
Multiple parallel neurons form a layer, and multiple layers connected together
form a DNN. There are a great amount of DNNs proposed in the past
decade targeting various tasks, such as image classification/segmentation/super-
resolution, object detection, speech recognition, keyword spotting, natural
language processing, etc.
Taking image classification on ImageNet [42] for example, Figure 2.4 from [15]
collects multiple famous DNNs, from which we see that different DNNs can
have very different parameter sizes (1s - 100s of Mega), operation counts (0.1s -
10s of Giga), and accordingly different model accuracies (Top-1 accuracy from
63.3% to 82.5%). Generally speaking, more accurate DNN models tend to have
larger model sizes. More recent networks, like CoAtNet-7 [37] and BASIC-L
(Lion, fine-tuned) [29], can reach over 90% Top-1 accuracy with Giga-level
parameters and Tera-level operations.
Figure
Fig. 1. The2.5: The between
relationship relationship between
FLOPs, number FLOPs,andnumber
of parameters, inference of parameters,
latency and
of pruned models.
inference latency of VGG-16 pruned models [100].
otivation of this work is that conventional channel prun- The rest of this paper is organized as follows. Sec
crucially rely on human expert knowledge and hand- related works. Section 3 presents the proposed latency-aw
s, and focus on selecting unimportant channels. Li et al. channel pruning method in detail. Section 4, show the
rm as significance criteria to determine which channels results and analysis. Finally, we draw the paper to a
d. Luo et al. [11] use the input of (𝑖 + 1)-th layer to Section 5.
uning of 𝑖th layer. Lin et al. [12] rank channels with
feature maps, then prunes the least important channels. 2. Related work
et al. [13] find that the pruned network can achieve the
y no matter it inherits the weights in the original network Deep neural networks are usually over-parameterized
tudy inspires us that the essence of channel pruning lies ing to huge storage and computation cost. There are ex
imal channel numbers in each layer, instead of selecting on compressing and accelerating neural networks. We c
34 BACKGROUND
Most DL models are deterministic as the model parameters are fixed once they
are trained, and the model follows a well-defined set of mathematical rules to
process the input data. Under this determinism, the computation patterns of
DNNs have several similarities and differences at each level, as summarized in
Table 2.1. Across neurons, although MAC operation is the core operation in
common, different neurons’ MAC operations can still vary in MAC precisions
and sizes; Across layers, although most of the DNN layers perform lots of parallel
MAC operations, their layer types and sizes can still be different; Across layer
connections, there are also variations.
Side by side with the computation pattern, the data pattern, i.e., the operands’
data size and data reuse (i.e., the same data is reused in multiple different
operations), is also an important DNN characteristic. As computation and data
are tightly coupled, differences in computation patterns directly correspond to
differences in data patterns:
DEEP LEARNING 35
From the discussion above, we can comprehend that even when DNNs have a
similar operation count and/or parameter size, they can still display significant
differences in their computation and data patterns internally. This partially
clarifies the earlier common misconception.
When connecting the characteristics of DNNs we discussed above to the domain-
specific hardware design we will discuss later, two observations can be made.
On the one hand, the deterministic nature and similarities among DNNs present
good opportunities for domain-specific hardware design. On the other hand, the
diversity in the computation and data patterns among DNNs poses challenges
to achieving the optimal hardware architecture.
36 BACKGROUND
Many metrics are used to objectively assess the quality of different DNN
hardware options, in which the most common ones are:
There are works trying to quantify these criteria, like in [88], the authors measure
DNN accelerator flexibility by firstly identifying multiple design axes and then
providing each axis with a binary value so as to create multiple flexibility classes.
38
Phase Source Platform Year Support BACKGROUND
Performance
Click and drag to zoom in. Hold down shift key to pan.
1M
1000TOPs/W 100TOPs/W 10TOPs/W 1TOPs/W
FPGA Simulation
FPGA Test
FPGA Product
100k ASIC(Digital) Simulation
ASIC(Digital) Test 10GOPs/W
ASIC(Digital) Product
GPU Product
ASIC(Mixed) Simulation
10k
ASIC(Mixed) Test
Speed (GOP/s)
CPU Product
1k
1000TOPs/W
100
100TOPs/W
10
0.001 0.01 0.1 1 10 100 1k
Power (W)
Highcharts.com
* Peak performance calculated by TDP while real performance calculated by the measured power.
Figure 2.6: An overview of DNN accelerator comparison [85].
Platform Graph Precision Graph Sparse Graph
range of DNN models. As can be seen from Figure 2.6, ASIC/ASIP3 DNN
accelerators show significant variability across all metrics and achieve the
highest performance and energy efficiency compared to other hardware
platforms. This is because ASIC/ASIP designs are relatively less flexible
(compared to previously mentioned CPU, FPGA, GPU and CGRA) and
thus more easily allow for targeted optimization of specific metrics. For
instance, the Google TPU [81] (ASIP) and Tesla NPU [157] (ASIP)
are optimized for high performance, while BinarEye [114] (ASIC) and
TinyVers [78] (ASIP) target ultra-low power consumption. Figure 2.6
shows two groups of ASIC/ASIP accelerator designs which are digital and
mixed-signal. Digital DNN accelerators are designed using only digital
components to perform DNN computations, whereas mixed-signal ones
combine both digital and analog components for DNN computations, such
as analog-in-memory computing [125].
Off-chip On-chip
I - Local Buffer PE
I-Reg W-Reg
W - Local Buffer
PE PE PE
Global Buffer
x
MAC O-
DRAM
PE PE PE unit + Reg
PE PE PE MAC / PE
array
1) MAC unit
2) PE
PE PE PE PE PE PE PE PE PE PE PE PE a
For output a b c + a +
operands PE PE PE PE PE PE PE PE PE PE PE PE b
(data coming out) d e f + b + +
PE PE PE PE PE PE PE PE PE PE PE PE c
g h i + c +
a
4) Memory hierarchy
• from a memory instance point of view: the memory size, bandwidth, port
number, port type, whether the memory is double-buffered or not, etc;
• from operands’ memory usage point of view: different operands can share
or not share a memory instance, and if shared, how to dynamically allocate
the memory resources (capacity/port/bandwidth) to different operands
when processing different DNN workloads;
RF
16 Array
14 Buffer 2.0
ALU
Normalized Energy
Normalized Energy
12
1.5
10
DOMAIN-SPECIFIC HARDWARE FOR DEEP LEARNING 43
8
1.0
6
4 0.5
• from the whole2 memory hierarchy point of view: how many memory
levels there are0 in the hierarchy, and whether to allow certain 0.0
runtime
re
re
re
Mo s
l
Mo s
de
de
de
de
de
s
s
asu
asu
asu
eri
eri
Mo
Mo
Mo
flexibility, such as memory bypassing (e.g., bypassing weight LB memory
Ey
Ey
Me
Me
Me
by allowing GB sometimes OS4 OS8
talk directly to weight WS16 RF level) or memory CONV1 CONV2
operand swapping (e.g., weight
(a) Energy breakdown memory
comparison foractual
between executing onedesigns
synthesized neural (b)
network
Energy breakdown compa
layer can and the analytical
become model.memory when executing another layer).
the input our model.
Accelerator Acc.
Core 1 Core 3
DRAM
DRAM
Accelerator
Core 1 Communication Bus
DRAM Bus
Accelerator Acc.
Core 2 Core 4
Up until now, we have delved into the DNN workload and accelerator
architecture. It’s time to cover the third but equally important factor in
this DNN acceleration story: mapping, scheduling and layer-core allocation (for
multi-core architectures only), which define how a workload is processed on the
hardware, spatially and temporally.
This subsection first explains the concepts of scheduling and mapping, using the
example in Figure 2.11 and assuming a single-core architecture (subsections 2.3.1,
2.3.2). Then, it discusses the layer-core allocation and scheduling using the
example in Figure 2.14 and assuming a multi-core architecture (subsection 2.3.3).
every directed edge uv from vertex u to vertex v, u comes before v in the ordering.
46 BACKGROUND
Layer 1
1 2 3 4 5 x
Processing timeline
Data re-fetched from off-chip
Schedule 2
L1 L1 L1 L3 L3
Layer Layer Layer Layer Layer On-chip L5
Mem. L2 L4 L2 L4
1 2 4 3 5
Processing timeline 1 2 4 3 5 x
When computing Layer x, the data held on-chip
I0
Spatial
W3,2 × I2 0 1 2 3
MAC 2 W0,2 × I2 W1,2 × I2 W2,2 × I2
+
MAC 3 W0,3 × I3 W1,3 × I3 W2,3 × I3
W3,3 × I3
O0
Different temporal mappings if swap
Mapping 3
MAC 0 W0,0 × I0 W2,0 × I0 W0,2 × I2 W2,2 × I2 I0 W0,0 W0,1 W1,0 W1,1 W (16) I (8) O (8)
MAC 1 W0,1 × I1 W2,1 × I1 W0,3 × I3 W2,3 × I3 I 1
0 1 2 3
MAC 2 W1,0 × I0 W3,0 × I0 W1,2 × I2 W3,2 × I2
MAC 3 W1,1 × I1 W3,1 × I1 W1,3 × I3 W3,3 × I3 + +
O0 O1
Processing timeline
MAC array and a memory hierarchy. The MAC array takes charge of spatial
parallel computation and the memory hierarchy handles the temporal data
storage. The mapping is correspondingly split into two parts: spatial mapping
and temporal mapping.
In the example of Figure 2.11(a), Layer 2 is an FC layer with 4 inputs, 4 outputs
and 16 weights. Its 16 multiplication operations are highlighted with different
coloured blocks, with which we will explain the spatial and temporal mapping
concepts and their hardware implications.
1) Spatial mapping
• Mapping 1 spatially maps one input and four outputs on the array, i.e.,
all MAC units receive the same input, and each MAC unit generates one
output at a time;
(→ spatial input reuse)
• Mapping 2 spatially maps four inputs and one output on the array, i.e.,
each MAC unit receives a different input, and all MAC units’ outputs are
summed up to generate one output at a time;
(→ spatial output reuse)
• Mapping 3 spatially maps two inputs and two outputs on the array, i.e.,
MAC units 1 & 3, 2 & 4 receive the same inputs, and MAC units 1 & 2’s,
3 & 4’s outputs are summed up to generate two outputs at a time.
(→ hybrid spatial input, output reuse)
• Different spatial mappings imply different spatial data reuse patterns, and
thus different low-level memory accesses. For example, in Figure 2.11(c)
Mapping 1, the input is spatially reused across the four MAC units, and
thus each input is only read from the local memory once. I.e., in total
across 4 clock cycles, the input local memory is read 4 times, while there
are in total 16 output updates (output local memory read and write). In
Mapping 2, it is the opposite: in which there are in total 16 input reads
and 4 output write backs. The hardware cost can thus be different in
terms of the different access counts and access costs of the input and
output local buffers (memory size/bandwidth, input and output data
precision, etc.).
• Different spatial mappings lead to different MAC array utilization for
various DNN layer types and sizes. For example, Layer 2 in Figure 2.11(a)
with four inputs and four outputs which enables all 3 spatial mappings in
(c) to fully utilize the MAC array. If we change the layer size to one input
and four outputs (still for an FC layer type), the 3 spatial mappings will
lead to a MAC array utilization of 100%, 25%, and 50%, respectively. The
hardware cost can thus be very different in terms of the different array
utilization.
2) Temporal mapping
Mapping 3a
MAC 0 O0 += W0,0 × I0 O2 += W2,0 × I0 O0 += W0,2 × I2 O2 += W2,2 × I2
MAC 1 O0 += W0,1 × I1 O2 += W2,1 × I1 O0 += W0,3 × I3 O2 += W2,3 × I3
MAC 2 O1 += W1,0 × I0 O3 += W3,0 × I0 O1 += W1,2 × I2 O3 += W3,2 × I2
MAC 3 O1 += W1,1 × I1 O3 += W3,1 × I1 O1 += W1,3 × I3 O3 += W3,3 × I3
Processing timeline
I Global Buffer I 0 , I 1 , I2 , I 3 I 0 , I 1 , I2 , I 3 I 0 , I 1 , I2 , I 3 I 0 , I 1 , I2 , I 3
n 4 accesses
p Local Buffer I0 , I1 I0 , I1 I2 , I3 I2 , I3
u 8 accesses
t MAC Array 16 MAC Ops
Processing timeline
O Global Buffer O1’, O2’, O3’, O4’ O1’, O2’, O3’, O4’ O1 , O2 , O3’ , O4’ O1 , O2 , O3 , O4
u
t 8 accesses
Local Buffer O0’ , O1’ O2’ , O3’ O0 , O1 O2 , O3
p
u 8 accesses
t MAC Array 16 MAC Ops
( O’: partial output; O: final output; Temporal data reuse ) Processing timeline
Mapping 3b
I Global Buffer I 0 , I 1 , I2 , I 3 I 0 , I 1 , I2 , I 3 I 0 , I 1 , I2 , I 3 I 0 , I 1 , I2 , I 3
n 8 accesses
p Local Buffer I0 , I1 I2 , I3 I0 , I1 I2 , I3
u 8 accesses
t MAC Array 16 MAC Ops
Processing timeline
O Global Buffer O0 , O1 , O3 , O4’ O0 , O1 , O3 , O4’ O0 , O1 , O2 , O3
u
t
4 accesses
Local Buffer O0’ , O1’ O0 , O1 O2’ , O3’ O2 , O3
p
u 8 accesses
t MAC Array 16 MAC Ops
Processing timeline
Mapping 1
Temporal mapping (T)
mapping (S)
Mapping 2
Mapping 3a
Unlike using nest for-loops to represent single-core accelerator mapping, for multi-
core workload deployment, DAG is frequently used as the high-level uniform
representation for capturing layer-core allocation and scheduling information [43,
19, 180, 152].
Figure 2.15 takes three allocation-schedule cases from Figure 2.14 and shows the
DAG representation for each of them. In the example, each node in the graph
denotes a layer of the neural network in Figure 2.14(a) and contains attributes
of processing that layer on a specific core; the edges linking nodes indicate the
data dependency between these layers. With this graph representation, one
can readily allocate and schedule the workload, effortlessly alter one allocation-
schedule choice to another, and easily justify if a certain deployment scheme is
valid, such that no core is occupied by more than 1 workload at any time and
all the data dependency relations are respected.
MAPPING, SCHEDULING AND ALLOCATION 53
Latency (cc)
Layer 1
Accelerator Core 1 Core 2
Core 1 Layer 1 50 100
Layer 2 DRAM
Layer 2 80 120
Layer 4 Communication Bus
Layer 3 DRAM Bus Layer 3 100 40
Accelerator Layer 4 70 100
Layer 5 Core 2 Layer 5 40 30
Schedule 1 Schedule
260 cc
270 cc
Core1 50 80 100
Core1 50 70 80
Core2 100 30
Core2 40 30
Processing timeline
Processing timeline
Schedule 2
Allocation 3
230 cc
Layer 1 Layer 2
Core1 50 80 70 Core 1 50 80
Core2 40 30 Layer 3 Layer 4 Layer 5
Core 2 40 100 30
Processing timeline
Schedule 3
Schedule
230 cc 220 cc
Core1 50 80 70 Core1 50 80
Core2 40 30 Core2 100 40 30
Processing timeline Processing timeline
Layer 1
Core 1
Allocation 1 Start 0
Layer 1 Layer 2 Layer 4 End 50
Core 1 50 80 70 Layer 2
Layer 3 Layer 5 Core 1
Core 2 40 30 Start 120
Layer 4 End 200
Core 1
Schedule 1 Start 50
End 120 Layer 3
270 cc Core 2
Core1 50 70 80 Start 200
End 240
Core2 40 30
Layer 5
Processing timeline Core 2
Start 240
End 270
Layer 1
Core 1
Allocation 2
Start 0
Layer 1 Layer 2 Layer 3 End 50
Core 1 50 80 100 Layer 2
Layer 4 Layer 5 Core 1
Core 2 100 30 Start 50
Layer 4 End 130
Core 2
Schedule Start 50
End 150 Layer 3
260 cc Core 1
Core1 50 80 100 Start 130
End 230
Core2 100 30
Layer 5
Processing timeline Core 2
Start 230
End 260
Layer 1
Allocation 3 Core 1
Layer 1 Layer 2 Start 0
Core 1 End 50
50 80
Layer 2
Layer 3 Layer 4 Layer 5 Core 1
Core 2 40 100 30 Start 50
Layer 4 End 130
Core 2
Schedule Start 50
End 150 Layer 3
220 cc Core 2
Core1 50 80 Start 150
End 190
Core2 100 40 30
Layer 5
Processing timeline Core 2
Start 190
End 220
2.4 Conclusion
57
58 PRECISION-SCALABLE MAC UNIT DESIGN SPACE EXPLORATION
• Section 3.2 proposes a new taxonomy, which categorizes all the existing
precision-scalable MAC architectures, uncovers their design patterns, and
links MAC-level design decisions to array-level and algorithm-level choices.
• Section 3.3 presents an exhaustive and illustrated survey of SotA
precision-configurable MAC architectures, to help understand their circuit
topologies and functional differences.
• Sections 3.4 explains the design and benchmark methodology, highlight-
ing the synthesis design space construction that optimizes for different
timing conditions across precision modes.
• Sections 3.5 conducts a detailed analysis for each MAC architecture the
precision scaling and its overheads in terms of energy, area, bandwidth,
and throughput.
DATAFLOW IMPLICATIONS OF PRECISION SCALABILITY 59
This section first introduces two dataflow scalability options, Sum Apart (SA)
and Sum Together (ST), with implications at algorithmic, PE array, and MAC
unit level, showing that run-time precision scalability is tightly interwoven with
DNN dataflow considerations. It then maps the SotA precision-scalable MAC
architectures into these categories and proposes a general MAC taxonomy.
The concepts of SA and ST are related to the spatial input and output reuse
described in Chapter 2.3.2. They were originally introduced at MAC level to
qualify two opposite ways of accumulating subword-parallel computations [107]:
SA keeps the parallel-generated products separately, while ST sums them
together to form one single output. These concepts can be applied to differentiate
algorithm-level characteristics of neural-network workloads.
Figure 3.1 illustrates a DNN FC and Conv 2D layer in a nested-loop format.
Indexes appearing on the left-hand side of the MAC operation (b, k, oy and
ox) indicate that all their related partial results are accumulated and stored
into distinct output results, corresponding to SA-type accumulations. On the
contrary, indexes that do not appear in the output (c, fy and fx) indicate
that those partial results are accumulated together along these dimensions,
corresponding to ST-type accumulations. All the loops are divided into either
SA or ST types.
Figure 3.1: SA/ST loop identification for (a) an FC and (b) a Conv 2D layer.
Accumulation island
SA SA ST
PE PE PE PE PE PE PE PE PE
SA PE PE PE ST PE PE PE ST PE PE PE
PE PE PE PE PE PE PE PE PE
becomes a PE (or MAC) array on itself, which offers more spatial loop-unrolling
opportunities along SA or ST dimensions. Many precision-scalable MAC
architectures have been presented in the literature, which can be categorized
along these SA and ST concepts.
In full-precision mode, SA and ST MACs behave identically: only one result
is generated every clock cycle. In scaled-precision mode, several low-precision
results are generated in parallel. The major difference between SA and ST
MACs is that SA MACs keep these sub-products separately, thus generating
several results in multiple output registers, while ST MACs sum them together
to obtain a single accumulated result, requiring only one register.
In existing neural-network computing patterns, there is an implicit rule that if
multiplications share an input operand (either weight or input activation), their
products cannot be added together. Alternatively speaking, only products that
belong to different output values have the opportunity to use the same weight
or input activation. This algorithmic constraint contributes to the fundamental
input-output bandwidth trade-off between SA and ST MACs. This will be
covered in more detail in Section 3.3.
3.2.4 Taxonomy
Building upon these concepts of SA and ST, Table 3.1 presents the complete
taxonomy of precision-scalable MAC architectures.
Vertically, all precision-scalable MAC architectures are first organized into
spatial and temporal precision-scaling techniques. In the category of spatial
precision configurability, the MACs are spatially split into several lower-precision
arithmetic operators in reduced-precision modes. They are further categorized
using the SA and ST principles. The category of temporal precision-scalable
MACs has a totally different working principle. They perform multiplication
through temporal iteration of repeated add-shift operations. Reduced-precision
PRECISION-SCALABLE MAC UNIT DESIGN SPACE EXPLORATION
8b 4b 0000 2b 0 0 0 0 0 0
0000
… gated …
8b
8b
gated
…… …
4b
Figure 3.3: Data-gated conventional MAC for either one full-precision 8b×8b,
one symmetric 4b×4b, or one weight-only 2b×8b operation.
Figure 3.4 shows the MAC architecture of the Deep Neural Processing Unit
(DNPU), introduced by Shin et al. [146] and characterized as 1D Fully-Unrolled
(FU) SA in the new taxonomy.
Its 8b×8b multiplier is built out of four 2b×8b sub-multipliers followed by
configurable shift-and-add logic blocks. Only one of its inputs is functionally
scalable, hence it is 1D scalability. More specifically, the top input is subword
scalable, while the left 8b input is treated as one operand across all modes and
SURVEY OF SCALABLE MAC ARCHITECTURES 65
8b 4b 4b 2b 2b 2b 2b
8b
8b
8b
<< 2 << 2 << 2 << 2
<< 4
gated
16b 2x 4x
12b 10b
20b 2x 4x
16b 14b
3.3.3 1D Fully-Unrolled ST
8b 4b 4b 2b 2b 2b 2b
8b
8b
2b
8b
2b
2b
2b
<< 2 << 2 << 2 << 2
<< 4
3.3.4 2D Fully-Unrolled SA
8b 4b 4b
4b
<< 4 2 << << 4 2 << << 4 2 << << 4 2 <<
4 <<
8b
8 << << 4
4b
<< 4 2 << << 4 2 << << 4 2 << << 4 2 <<
4x
16b 8b
4x
20b 12b
2b 2b 2b 2b 2b 2b 2b 2b
2b
4 << << 4
2b
4x 16x
10b 4b
4x 16x
14b 8b
8b 4b 4b
4b
<< 4 2 << << 4 2 << << 4 2 << << 4 2 <<
4 <<
8b
8 << << 4
4b
<< 4 2 << << 4 2 << << 4 2 << << 4 2 <<
16b 10b
20b 12b
2b 2b 2b 2b 2b
2b
2b 2b
2b
2b 2b
2b
2b 2b
2b
2b
2b 2b 2b 2b
2b
2b
2b
2b
4 << << 4
2b
2b
2b
2b
12b 8b
14b 8b
8b 4b 4b 2b 2b 2b 2b
2b 2b 2b 2b
4b
gated gated
8b
……
4b
propag. propag.
16b 2x 4x
8b 4b
20b 2x 4x
12b 8b
The ST version of the SWU MAC unit [107] is also a 2D symmetric scalable
architecture, based on an array multiplier. But unlike SWU SA, SWU ST adds
all subword results together by activating the array multiplier into an opposite
diagonal pattern, as shown in Figure 3.9. Such configuration saves addition
logic, as it uses the multiplier array cells to implicitly perform the addition.
The propagating region that can not be fully gated, shrinks along the precision
down-scaling, while the critical path remains roughly the same, which is just
opposite to SWU SA.
It is worth notifying that, compared to previous FU ST architectures, the input
bandwidth of SWU ST remains the same across precision modes at the cost
of partly gating its arithmetic blocks when precision scales down. So, there is
a trade-off for ST-type MACs between input bandwidth consistency and high
hardware utilization.
8b 4b 4b 2b 2b 2b 2b
2b 2b 2b 2b
gated
4b
gated
8b
4b
gated
16b 9b 6b
20b 12b 8b
Bit-serial designs have recently gained attention with both the Unified Neural
Processing Unit (UNPU) by Lee et al. [94] and the QUEST log-quantized
3D-stacked inference engine by Ueyoshi et al. [163]. Indeed, bit-serial operand
feeding implicitly allows fully-variable bit precision.
Considered in this study, the UNPU bit-serial MAC receives weights through
1-bit iterations while input activations are kept at full precision and sent in a
SURVEY OF SCALABLE MAC ARCHITECTURES 71
8b
8b
16b >> 12b >> 10b >>
8b
8b
LOOM, designed by Sharify et al. [142], has extended the 1D bit-serial approach
to a 2D scalable architecture. This 2D scalable design feeds both input operands
bit-serially, as illustrated in Figure 3.12. By doing so, the bit-wise AND logic
from Figure 3.10 is further simplified to a single AND gate, but the number
of clock cycles to complete one MAC operation is increased to the product of
input activation and weight’s precisions.
Since both input operands are fetched bit-by-bit and all input bit combinations
need to be traversed, a more complex input feed-in schedule and control logic
are required. Figure 3.13 presents the 4b×4b scheduling used in this work ("a"
for input activation and "w" for weight).
Note that the input fetching pattern is not regular and contains repetitions.
For example, in this 4b×4b case, the bit-by-bit weight fetching sequence is:
w0 , w1 , w0 , w2 , w1 , w0 , w3 , w2 , w1 , w0 ... In addition, unlike 1D bit-serial units
that perform a right shift at each clock cycle, 2D bit-serial units shift partial
products irregularly, as indicated by the red arrows in Figure 3.13.
2D bit-serial implementations can also be implemented in a multi-bit processing
fashion. Figure 3.14 exhibits a 2D 4-bit serial design under three different
precision modes. By feeding in 4 bits at a time for both operands, the number
of compute cycles is drastically reduced. Note that since the basic arithmetic
block is a 4b multiplier, the precision scaling can only be pushed down to 4 bits.
Scaling below 4 bits is again enabled through LSB data gating. This survey
includes both 2D 2-bit and 4-bit serial MACs.
20b 12b 8b
𝑤3 𝑤2 𝑤1 𝑤0
× 𝑎3 𝑎2 𝑎1 𝑎0
𝑎0 𝑤3 𝑎0 𝑤2 𝑎0 𝑤1 𝑎0 𝑤0
𝑎1 𝑤3 𝑎1 𝑤2 𝑎1 𝑤1 𝑎1 𝑤0
𝑎2 𝑤3 𝑎2 𝑤2 𝑎2 𝑤1 𝑎2 𝑤0
+ 𝑎3 𝑤3 𝑎3 𝑤2 𝑎3 𝑤1 𝑎3 𝑤0
𝑃7 𝑃6 𝑃5 𝑃4 𝑃3 𝑃2 𝑃1 𝑃0
Figure 3.13: A bit-wise feed-in schedule for 2D bit-serial MAC in 4b×4b mode.
The remainder of this section puts together individual evaluations and compara-
tive studies of all the aforementioned precision-scalable MAC architectures. To
this end, the present section describes the methodology used for implementing
and benchmarking these circuits in terms of area, throughput, bandwidth and
energy per operation for 5 precision modes: 8b×8b full precision, 2b×2b and
4b×4b symmetric scaling, and 2b×8b and 4b×8b weight-only scaling.
To ensure the fairness and coherence of this study, all MAC architectures are
designed with equivalent features and identical circuit optimizations.
They are all built with a two-stage pipeline structure for the 8b full-precision
baseline. To support the different precision modes, their accumulation registers
are partitioned and gated as shown in Figure 3.15. The accumulation headroom
is 4b for any precision mode. Outputs are updated and stored separately for
SA MACs (Figure 3.15(a)), and into a single unpartitioned register for ST and
temporal-based scalable MACs (Figure 3.15(b)).
Besides, all temporal-based designs in this study assume a right-shifting
sequential multiplier as it requires a smaller first-stage adder than a left-shifting
design, preventing long carry propagation and sign-bit extension.
Input registers and overheads that can be shared among multiple PEs (e.g.,
control logic or FSMs) are excluded from area and power reporting.
(a) SWU SA
8b-8b 4b 16b 12b
4b-4b 4b 8b 4b 4b 8b 4b
2b-2b 4b 4b 4b 4b 4b 4b 4b 4b
(b) SWU ST
8b-8b 4b 16b
Accumulation headroom
4b-4b 4b 8b 8b One multiplication result
2b-2b 4b 4b 12b Gated part
Figure 3.15: Register settings for (a) SWU SA and (b) SWU ST MAC units.
DESIGN AND BENCHMARK METHODOLOGY 75
In total 19 designs are made (listed in Figure 3.16), with multiple implemen-
tations of each architecture, varying the level of scalability, i.e. the scalability
granularity. For instance, a 1-level scalable design allows to scale from 8b down
to 4b by design, the 2b mode being carried out by data gating over the 4b mode.
Data gating is applied from LSB to MSB to prevent unnecessary energy spent
on carry propagation and sign-bit extension. A 2-level scalable design directly
allows to scale down to 4b and 2b by design. Requiring different circuit and
neural-network implementations, binary computations (1b precision) are not
considered in this study.
Since the various MAC architectures should have very different optimal operating
frequencies, this study explores a broad range of clock targets with frequencies
from 625 MHz to 5 GHz. To allow each precision mode to find its optimum,
constraint sweeps are applied individually to each mode (clock periods between
0.2 and 1.6 ns, with a sweeping step of 0.05 nc). Overall, over 5000 circuits are
synthesized and characterized for each specific MAC design.
3.4.5 DVFS
For each MAC architecture, this section evaluates its characteristics of one
representative synthesized instance: the one with the lowest average energy-
delay-area product across precision modes (8b, 4b, and 2b), which represents a
balanced and coherent design choice.
DETAILED ANALYSIS 77
1-level 2D FU SA
1D 2-bit serial
1D 4-bit serial )
Multibit serial
2-level 2D FU SA
)
2D 1-bit serial
1-level 2D FU ST
2-level 2D FU ST ) BitFusion 2D 2-bit serial
2D 4-bit serial
LOOM
Figure 3.16: Legend of all the bar charts in Section 3.5. 19 designs in total.
3.5.1 Bandwidth
Figures 3.17(a)-(b) present the required memory bandwidth per clock cycle of
the different MAC units over each precision mode.
This shows the internal trade-off between input and output bandwidths. SA
MACs produce multiple independent results in parallel, hence increasing the
output bandwidth when precision scales down by design (i.e. down to 4b for
1-level scalable circuits and down to 2b for 2-level ones). However, since neural
networks allow input data reuse over different outputs, inputs can be shared
among the sub-computations, leading to a lower input bandwidth.
Inversely, ST MACs only store one result, which is the sum of multiple
low-precision multiplications, leading to a relatively small output bandwidth.
However, since those products are parts of one output element, they cannot
share the same input. This results in a largely increased input bandwidth. For
example, in order to keep its 16 arithmetic blocks busy in 2b×2b mode, the
input bandwidth of 2D FU ST explodes. Such highly-unsteady bandwidth may
complicate its memory interface and its integration into a chip.
Finally, the much smaller bandwidth of temporal-based MAC units could be a
huge advantage over spatial-based designs for narrow memory-band systems.
78 PRECISION-SCALABLE MAC UNIT DESIGN SPACE EXPLORATION
150
SA
Output
FU
Weight
125
Bandwidth (bit/clock cycle) Input
2D
100
ST
FU
2D
75
A
SA
g
US
tin
ST
T
FU
ga
SW
US
FU
50
1D
ta
l
SW
ria
ria
Da
1D
se
se
1D
2D
25
0
842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842
80
SA
Output
S
FU
FU
70 Weight
Bandwidth (bit/clock cycle)
Input
1D
2D
T
ST
60
FU
FU
ng
2D
1D
50
ti
T
ga
US
US
ta
al
SW
SW
40
Da
eri
s
1D
30
l
ria
se
20
2D
10
0
842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842
Figure 3.17: Bar charts of the bandwidth per clock cycle of precision-scalable
MAC units for (a) symmetric and (b) weight-only scaling scenarios.
Figures 3.18(a)-(b) therefore show the bandwidth per operation of each MAC,
derived from the above-mentioned bandwidth (bit/clock cycle) and computing
capability (operation/clock cycle).
DETAILED ANALYSIS 79
As expected, at full precision, all MACs have the same bandwidth per operation.
With precision scaling down, ST-type MACs show superiority over others,
especially 1D and 2D FU ST. This is because their arithmetic block remains
fully utilized, which keeps their computing capability high, and their output
bandwidth remains very low, which balances the increased input bandwidth.
For symmetric scaling, SWU MACs lose their advantage because of their unused
arithmetic logic, leading to a relatively low computing capability. SWU ST is
slightly better than SWU SA because, with the same computing capability, it
50
g
tin
ST
SA
l
A
l
US
T
US
ria
ria
ga
US
US
FU
FU
se
se
F
ta
40
SW
SW
2D
2D
Da
1D
2D
1D
1D
Bandwidth/op (bit/op)
30
20
10
0
842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842
50
g
tin
SA
ST
SA
ST
l
A
l
T
ria
ria
ga
US
US
FU
FU
FU
FU
se
se
ta
40
SW
SW
2D
Da
2D
1D
2D
1D
1D
Bandwidth/op (bit/op)
30
20
10
0
842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842
has lower bandwidth requirements. For asymmetric scaling, SWU MACs only
benefit from data gating, behaving like the baseline circuit.
In temporally scaled MAC units, the low bandwidth and low compute capacity
compensate for each other, resulting in similar bandwidth per operation as the
baseline circuit.
It is easy to notice that the difference between designs is less significant in
Figure 3.18 (bandwidth/op) than in Figure 3.17 (bandwidth). The reason is
that it is mostly true when the bandwidth is small, the operation that can
be performed per cycle is relatively also small (especially for serial designs
and for SWU designs when performing weight-only scaling computation).
So when comparing different designs by the division of the two parameters
(bandwidth/op), the resulting difference is less significant than only comparing
by one parameter solely (bandwidth).
14.5 8.7
6
T
US
US
F
F
SA
5
2D
2D
FU
1D
A
4
US
Throughput
SW
T
ST
US
3
FU
SW
ng
1D
al
ti
al
ga
eri
eri
2
s
ta
s
1D
Da
2D
1
0
842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842
4
A
US
SA
F
1D
FU
3
2D
ST
Throughput
T
FU
US
2
1D
F
g
tin
2D
l
ria
ga
se
ST
A
ta
US
1D
U
Da
SW
l
ria
SW
1
se
2D
0
842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842
for different objectives as the critical path changes from one mode to the other.
Another reason for its great speed is its SA accumulation stage which only
contains short and therefore faster adders.
82 PRECISION-SCALABLE MAC UNIT DESIGN SPACE EXPLORATION
Figure 3.20 shows the area breakdown of the different MAC architectures
synthesized in a 28 nm CMOS process normalized to the area of the data-gated
conventional MAC unit.
Not surprisingly, all scalable units based on sub-computation parallelism require
overhead for configurability. Among them, 1D and 2D FU approaches are the
most area-consuming due to customized internal structures, with 2D FU designs
requiring up to 4.4× the area of a conventional MAC (without including the
input registers as mentioned in 3.4.1).
On the contrary, SWU MACs mitigate the overhead by reusing arithmetic cells
for subword-parallel computations. SWU ST circuits are particularly optimal,
with only 10% to 18% overhead for 1 and 2-level scalability, respectively.
Finally, serial designs are the only type requiring less area than the conventional
multiplier, allowing area savings up to 40% on the MAC circuit for area-
constrained systems.
Concerning the sequential area (bottom darker bars), FU SA designs require far
more registers than others. This is due to their wide asymmetric sub-products
(as one input operand is kept full-precision) for 1D FU SA, as well as the
quadratic increase of sub-products for 2D FU SA. Although SA-type, SWU SA
keeps a low sequential area at the cost of a sacrificed throughput.
5
SA
Logic
FU
Sequential
2D
4
ST
SA
3
FU
FU
Area
2D
ST
1D
A
US
FU
g
2
T
tin
US
SW
1D
ga
l
ria
l
SW
ria
ta
se
Da
se
2D
1
1D
Figure 3.20: Bar chart of the normalized area of precision-scalable MAC units.
DETAILED ANALYSIS 83
Figures 3.21-3.28 show the breakdown of energy per operation when scaling
precision for each type of MAC architecture. The left subfigures show symmetric
scaling scenarios while the right ones show weight-only scaling. All energy values
are normalized to the same full-precision data-gated conventional MAC drawn
with a solid black line.
Processing at full precision with scalable designs always comes with some energy
penalty. For 1D FU and SWU MACs (Figures 3.21-3.22, 3.25-3.26), energy
overheads for 8b computations are in the order of 20% to 40% for 1 and 2-level
scalability, respectively. For 2D FU architectures (Figures 3.23-3.24), these
costs increase to 52% and 94%.
Serial designs (Figures 3.27-3.28) require much more energy at full precision
due to their need for several clock cycles per computation, diluting the power
into the clock tree. At full precision, the 1D bit-serial MAC consumes 3.3× as
much energy per operation as data gating, and the 2D bit-serial 14× as much.
Reassuringly, 1D multi-bit serial designs (Figure 3.27) come at a lower energy
penalty: the 2-bit serial MAC consumes 2.2× baseline energy, while it is also
able to scale precision down to 2 bits by design, and the 4-bit serial MAC
reduces the cost to 1.5× as much as data gating.
For nearly all designs, the 2-level MAC consumes more at full precision than
its 1-level version. Singularly, this is not the case for SWU ST (Figure 3.26)
architecture, for which the second level of scalability costs barely more than
the first. This is connected to its simpler ST-style accumulation coupled with
its efficient reuse of multiplication logic.
2.0 2.0
1.0 1.0
Energy/op
Energy/op
0.5 0.5
Data gating
1-level 1D FU SA
2-level 1D FU SA
0.1 0.1
8 4 2 8 4 2
Precision Precision
(a) symmetric scaling (b) weight-only scaling
Figure 3.21: Normalized energy/op for (a) symmetric and (b) weight-only
scaling in a 1D FU SA MAC (DNPU) [146].
2.0 2.0
1.0 1.0
Energy/op
Energy/op
0.5 0.5
Data gating
1-level 1D FU ST
2-level 1D FU ST
0.1 0.1
8 4 2 8 4 2
Precision Precision
(a) symmetric scaling (b) weight-only scaling
Figure 3.22: Normalized energy/op for (a) symmetric and (b) weight-only
scaling in a 1D FU ST MAC.
2.0 2.0
1.0 1.0
Energy/op
Energy/op
0.5 0.5
Data gating
1-level 2D FU SA
2-level 2D FU SA
0.1 0.1
8 4 2 8 4 2
Precision Precision
(a) symmetric scaling (b) weight-only scaling
Figure 3.23: Normalized energy/op for (a) symmetric and (b) weight-only
scaling in a 2D FU SA MAC.
2.0 2.0
1.0 1.0
Energy/op
Energy/op
0.5 0.5
Data gating
1-level 2D FU ST
2-level 2D FU ST
0.1 0.1
8 4 2 8 4 2
Precision Precision
(a) symmetric scaling (b) weight-only scaling
Figure 3.24: Normalized energy/op for (a) symmetric and (b) weight-only
scaling in a 2D FU ST MAC (BitFusion) [143].
2.0 2.0
1.0 1.0
Energy/op
Energy/op
0.5 0.5
Data gating
1-level SWU SA
2-level SWU SA
0.1 0.1
8 4 2 8 4 2
Precision Precision
(a) symmetric scaling (b) weight-only scaling
Figure 3.25: Normalized energy/op for (a) symmetric and (b) weight-only
scaling in a SWU SA MAC (DVAFS) [117].
2.0 2.0
1.0 1.0
Energy/op
Energy/op
0.5 0.5
Data gating
1-level SWU ST
2-level SWU ST
0.1 0.1
8 4 2 8 4 2
Precision Precision
(a) symmetric scaling (b) weight-only scaling
Figure 3.26: Normalized energy/op for (a) symmetric and (b) weight-only
scaling in a SWU ST MAC (ST) [107].
4.0 4.0
Data gating
1D 1-bit serial
2.0 1D 2-bit serial 2.0
Energy/op
Energy/op
1D 4-bit serial
1.0 1.0
0.5 0.5
0.3 0.3
8 4 2 8 4 2
Precision Precision
(a) symmetric scaling (b) weight-only scaling
Figure 3.27: Normalized energy/op for (a) symmetric and (b) weight-only
scaling in 1D serial (UNPU) [94] and multibit-serial MACs [21]. Beware of the
scale.
20 20
Data gating
10 2D 1-bit serial 10
2D 2-bit serial
5.0 5.0
Energy/op
Energy/op
2D 4-bit serial
1.0 1.0
0.5 0.5
0.2 0.2
8 4 2 8 4 2
Precision Precision
(a) symmetric scaling (b) weight-only scaling
Figure 3.28: Normalized energy/op for (a) symmetric and (b) weight-only
scaling in 2D serial and multibit-serial MACs (LOOM) [142]. Beware of the
scale.
regarding design synthesis variation. In this figure, the height of each bar has
an upper bound and a lower bound (not seen means overlapped). The upper
bound depicts the energy efficiency cross precision modes for a single circuit of
each design, with this circuit being selected as a balanced design choice with the
minimum energy-delay-area product averaged across precision modes (described
at the start of Section 3.5). To evaluate this design selection, the lower bound
of each bar shows the minimum energy point measured for each single precision
88 PRECISION-SCALABLE MAC UNIT DESIGN SPACE EXPLORATION
mode. In other words, comparing the upper bound and the lower bound is
comparing the "best-balance" design choice we took with the "best-per-mode"
design among the entire design synthesis space (> 5000 circuits, explained in
Section 3.4.2). Results confirm that the "best-balance" design remains close to
the best achievable energy efficiency and does not change the observed trends.
6.4 14.3
4
l
ria
se
2D
l
ria
se
3
1D
ST
SA
FU
Energy/op
FU
2D
US
SA
A
2
2D
T
US
F
g
US
FU
tin
1D
SW
ga
SW
1D
ta
Da
0
842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842
3
1D
ST
SA
FU
Energy/op
ST
FU
2D
SA
2
A
FU
2D
ST
g
US
FU
tin
1D
U
ga
SW
SW
1D
ta
Da
0
842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842
Figure 3.30 displays the feasible implementations and Pareto frontiers for each
precision-scalable MAC design at each precision scenario, sweeping across a
broad range of frequency constraints at nominal supply voltage. Circuits are
compared in terms of energy per operation and throughput per area. The best
circuits are towards the bottom-right corners of subfigures.
At 8b precision (Figure 3.30(a)), data gating is undeniably the most efficient,
capable of the highest throughput per area, followed first by 1-level (bright
colored dashed lines) and then by 2-level (dark colored dotted lines) scalable
designs which suffer from the scalability overhead.
When scaling precision symmetrically (Figures 3.30(b)-(c)), the 2D FU ST and
SWU ST architectures largely outperform other architectures in terms of energy
per operation. 2D FU SA circuits are by far the best for throughput-area
performance, but not energy efficiency, given their large non-shared registers.
Note that 1-level scalable circuits stay the best compromise at 4b precision and
above, but this trend reverses at 2b precision, which is visible by the inversion
of bright and dark curves between the left and right subfigures.
When only reducing weight precision down to 2b (Figure 3.30(e)), 2-level
1D FU ST architecture is optimal for energy while its 1D FU SA companion
outshines for throughput-area efficiency. At 4b precision (Figure 3.30(d)), 1-
level 1D FU ST and 1D 2-bit serial are best for energy, however, the gains over
baseline are limited (20% at most). By-passing internal additions, 1D FU SA is
advantaged for throughput, while 1D 4-bit serial benefits from a smaller area.
10
Data gating
1-level 1D FU SA
2-level 1D FU SA
DNPU )
1-level 1D FU ST
2-level 1D FU ST
Energy/op (pJ)
1-level 2D FU SA
2-level 2D FU SA
1-level 2D FU ST
2-level 2D FU ST )
BitFusion
1-level SWU SA
2-level SWU SA )
DVAFS
1 1-level SWU ST
2-level SWU ST
ST )
1D 1-bit serial UNPU
0.5
1D 2-bit serial
1D 4-bit serial )
Multibit serial
)
0 1 2 3 4 5
2D 1-bit serial
Throughput/area (GOPS/mm2) 2D 2-bit serial LOOM
2D 4-bit serial
(a) 8-bit full precision
3 0.6
Energy/op (pJ)
Energy/op (pJ)
0.1
0.2 .07
0 2 4 6 8 10 0 5 10 15 20 25
2
Throughput/area (GOPS/mm ) Throughput/area (GOPS/mm2)
(b) 4-bit symmetric (c) 2-bit symmetric
5 3
Energy/op (pJ)
Energy/op (pJ)
0.4 0.2
0 1 2 3 4 5 6 7 0 2 4 6 8 10
2
Throughput/area (GOPS/mm ) Throughput/area (GOPS/mm2)
(d) 4-bit weight-only (e) 2-bit weight-only
4
Data gating
1-level 1D FU SA
2-level 1D FU SA
DNPU )
1-level 1D FU ST
2-level 1D FU ST
Energy/op (pJ)
1-level 2D FU SA
2-level 2D FU SA
1-level 2D FU ST
2-level 2D FU ST )
BitFusion
)
1
1-level SWU SA
DVAFS
2-level SWU SA
1-level SWU ST
2-level SWU ST
ST )
1D 1-bit serial UNPU
0.4
1D 2-bit serial
1D 4-bit serial )
Multibit serial
)
0 1 2 3 4 5
2D 1-bit serial
Throughput/area (GOPS/mm2) 2D 2-bit serial LOOM
2D 4-bit serial
(a) 8-bit full precision
2 0.6
1
Energy/op (pJ)
Energy/op (pJ)
0.1
0.1 .04
0 2 4 6 8 10 12 0 5 10 15 20 25
2
Throughput/area (GOPS/mm ) Throughput/area (GOPS/mm2)
(b) 4-bit symmetric (c) 2-bit symmetric
2 3
1
Energy/op (pJ)
Energy/op (pJ)
0.2 0.1
0 2 4 6 8 0 2 4 6 8 10
2
Throughput/area (GOPS/mm ) Throughput/area (GOPS/mm2)
(d) 4-bit weight-only (e) 2-bit weight-only
The best precision-scalable MAC architecture is neither the one with the lowest
consumption at reduced precision, nor the one with the lowest overhead, nor
the one with the steepest energy scaling. The best architecture is the one
that best optimizes its application needs. Hence, this section offers a
comparative study for practical use cases of the MAC units, rather than for
each precision mode individually. Such a study is vital as it makes a direct
link to the application level, and it simplifies the analysis by combining many
results into a single trade-off.
Three case studies are assessed by defining the percentage of operations the
MAC unit does under each precision mode. For simplicity, only the fraction of
full-precision computations (8b) is set, while the ratios of 4b and 2b operations
are assumed equal. Energy and throughput-area efficiencies are weighted
considering these ratios to redraw a global Pareto frontier.
Figures 3.32-3.33 shows the case studies with full-precision precision representing
33%, 20%, and 5% of the computations, respectively (2D 1-bit serial MACs have
been omitted for clarity due to their sub-optimality). Only nominal voltage is
considered here since the 1 V comparison results remain valid with DVFS.
First and foremost, when using all the precision modes uniformly (Fig-
ures 3.32(a), 3.33(a)), the conventional MAC with simple data gating is close to
the most efficient architecture, but is also by far the best in terms of throughput
and area. Indeed, no matter how efficient scaled operations are, being much
slower and energy-consuming, 8b computations are largely dominant in the
energy budget.
COMPARATIVE STUDY IN THE FUNCTION OF USE-CASE RATIOS 93
Note that even if 2b and 4b operations represent 66% of computations in this use
case, since these are often 2× to 15× faster in scaled precisions (c.f. Section 3.5.3),
the circuits stay in these precision modes for a much shorter duration than to
do the 8b operations.
For symmetric scaling scenarios (Figures 3.32(a)), the conventional MAC unit
is slightly outperformed by the 1-level SWU SA architecture and by both 1-
level and 2-level SWU ST circuits. On the other hand, for weight-only scaling
(Figures 3.33(a)), the conventional units stay more energy efficient, before 1-level
1D FU SA and ST MAC circuits.
Across real-world DNNs, some low-complexity applications can use even less
full-precision operations [143], e.g., only around 5%. This 8b-operation ratio
is used for the third use case (Figures 3.32(c), 3.33(c)). At that proportion
of 8b operations, SWU ST and 2D FU ST largely outperform data gating for
symmetric scenarios by 32% and 22%, respectively.
Eventually, some scalable MAC units are now beneficial for weight-only scaling
scenarios (Figures 3.33(c)), but energy gains are lower, at most 15% for 2-level
1D FU ST, followed by 1-level 1D FU ST and 1D 4-bit serial MACs (8%).
94 PRECISION-SCALABLE MAC UNIT DESIGN SPACE EXPLORATION
0.9
Data gating
0.8
1-level 1D FU SA
2-level 1D FU SA
DNPU )
1-level 1D FU ST
2-level 1D FU ST
Energy/op (pJ)
0.7
1-level 2D FU SA
2-level 2D FU SA
0.6 1-level 2D FU ST
2-level 2D FU ST )
BitFusion
0.5
1-level SWU SA
2-level SWU SA )
DVAFS
0.4
1-level SWU ST
2-level SWU ST
ST)
1D 1-bit serial UNPU
0.3
1D 2-bit serial
1D 4-bit serial)Multibit serial
)
1 2 3 4 5 6
2D 1-bit serial
Throughput/area (GOPS/mm2) 2D 2-bit serial LOOM
2D 4-bit serial
(a) 33% 8b operations
0.9
0.8
Energy/op (pJ)
0.7
0.6
0.5
0.4
0.3
2 3 4 5 6 7
Throughput/area (GOPS/mm2)
(b) 20% 8b operations
0.6
0.5
Energy/op (pJ)
0.4
0.3
0.2
2 3 4 5 6 7 8 9
Throughput/area (GOPS/mm2)
(c) 5% 8b operations
1.2
Data gating
1-level 1D FU SA
2-level 1D FU SA
DNPU )
1.0 1-level 1D FU ST
2-level 1D FU ST
Energy/op (pJ)
1-level 2D FU SA
2-level 2D FU SA
0.8 1-level 2D FU ST
2-level 2D FU ST )
BitFusion
1-level SWU SA
2-level SWU SA )
DVAFS
0.6 1-level SWU ST
2-level SWU ST
ST)
1D 1-bit serial UNPU
0.4
1D 2-bit serial
1D 4-bit serial)Multibit serial
)
1 2 3 4 5
2D 1-bit serial
Throughput/area (GOPS/mm2) 2D 2-bit serial LOOM
2D 4-bit serial
(a) 33% 8b operations
1.2
1.0
Energy/op (pJ)
0.8
0.6
0.4
1 2 3 4 5 6
Throughput/area (GOPS/mm2)
(b) 20% 8b operations
1.0
0.8
Energy/op (pJ)
0.6
0.4
0.2
1 2 3 4 5 6 7
Throughput/area (GOPS/mm2)
(c) 5% 8b operations
3.8 Conclusion
Moving one abstraction level up from the previous chapter, this chapter aims to
offer a clear view of the design space of precision-scalable MAC arrays (PSMAs),
and provides answers to the consecutive open question Q1.2: What is the best
MAC array architecture for variable-precision DNN execution?
To this end, this chapter first introduces a precision-aware nested for-loop
representation for DNN mappings. Next, based on this new representation, it
proposes a comprehensive PSMA taxonomy, capable of systematically covering
the most prominent SotA PSMAs, as well as uncovering new PSMA architectures.
Following that, a highly parameterized PSMA template is built that can be
design-time configured into a huge subset of the design space spanned by the
taxonomy. This allows to fairly and thoroughly benchmark 72 different PSMA
architectures. This benchmarking study is performed in 28 nm technology
targeting run-time precision scalability from 8 to 2 bits, operating at 200 MHz
and 1 GHz. Analyzing resulting energy and area breakdowns and comparing
the array-level optimality to the previous chapter’s MAC-level optimality reveal
key design insights for PSMA architectures.
This chapter is based on publication [73], and contains large fractions of it. The author’s
contributions include (but not limited to) enhanced loop representation, PSMA design space
identification, design methodology, and paper writing.
99
100 PRECISION-SCALABLE MAC ARRAY DESIGN SPACE EXPLORATION
• Section 4.2 firstly extends the traditional nested for-loop based mapping
representation of a Conv 2D layer with 2 additional precision-aware bit-
group loops. Then, it discusses the implications of the two newly added
bit-group loops on the underlying hardware and their spatial/temporal
mapping options.
• Section 4.3 presents a new PSMA taxonomy, based on the newly proposed
precision-aware for-loop representation. A wide range of SotA PSMA
architectures are then mapped to the taxonomy, and thanks to the
taxonomy, lots of new PSMA topologies are also identified.
• Section 4.4 introduces a uniform and highly parameterized PSMA
template, which can be design-time configured to cover a large subset of
the PSMA design space spanned by the taxonomy, laying the foundation
for a fair comparison and fast exploration in PSMA’s design space.
• Section 4.5 benchmarks this design space in terms of area and energy,
resulting in PSMA design insights.
• Section 4.6 concludes this chapter.
PRECISION-ENHANCED DNN LOOP REPRESENTATION 101
4.2.1 Motivation
X X X X X X
2b
2b
2b
4-bits
4-bits
2b
2b
2b
X X X X X X
O[b][k][oy][ox] += I[b][c][oy+fy][ox+fx][2bi+1:2bi](22bi)
x W[k][c][fy][fx][2bw+1:2bw](22bw)
that performs spatial unrolling at the BG level at full precision, and enables
additional spatial unrolling possibilities at lower precisions.
This concept can now be introduced in the DNN mapping representation to
unify all precision scalability techniques. This representation is traditionally
characterized as 7 nested for-loops. To include the impact of precision scalability,
2 additional BG for-loops are added to it, as shown in Figure 4.2. In the previous
example of Figure 4.1, at high precisions, the 2 BG loops are spatially unrolled
inside the MAC unit. In this mode, the MAC unit is hence unable to unroll
any other loop spatially. While at lower precisions, the number of BG loops of
the workload shrinks, making room for other for-loops to be spatially unrolled
inside of the MAC unit.
Treating the BG loops similarly to any other Conv layer loop enables the
flexibility of mapping them spatially and also temporally, as done in bit-serial
(BS) architectures. This allows us to uniformly characterize the behaviour of a
wide variety of PSMAs. Note that besides Conv layers, other DNN layers like
FC, DW and PW can also be extended following the same method.
PRECISION-SCALABLE MAC ARRAY TAXONOMY 103
L3 L3 L3 L3 L2 L2 L2 L2 L1 L1 L1 L1
2b x 2b
L3 L3 L3 L3 L2 L2 L2 L2 L1 L1 L1 L1 Multiplier
L3 L3 L3 L3 L2 L2 L2 L2 L1 L1 L1 L1
4-bits
L3 L3 L3 L3 L2 L2 L2 L2 L1 L1 L1 L1
Figure 4.3: L4: 4×4 L3 units; L3: 4×4 L2 units; L2: 4×4 L1 units.
units that can go down to 2b precision, which is the lowest precision supported in
most precision-scalable accelerators [143, 133, 51, 116]. Furthermore, this multi-
level architecture will later give a clear perspective of where each computation
loop gets unrolled/mapped to.
Output Sharing
Output Sharing
Input Sharing
Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1
Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1
Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1
Figure 4.4: Input, Hybrid, and Output Sharing (or Spatial Reuse as explained
in Chapter 2.3.2) at Level n (Ln). Results from one accumulation island are
added together spatially to form one output.
Besides WS/IS/OS loops, the BG loops can also be spatially unrolled on the
MAC array, and dictate where the shift & add tree needs to be inserted. In
the proposed taxonomy, BG loop unrolling can either occur spatially at one
of the two lowest levels of the MAC array (so either at the L2 or at the L3
level; note that L1 is a 2b multiplier, which is not an array level), or can occur
temporally, a.k.a. bit-serial processing (BS). For cases in which BG loops are
unrolled temporally, internal registers are required in order to hold intermediate
results towards performing shift-add operation temporally. BS designs can be
further sub-categorized based on the level in the PSMA at which the internal
registers are located.
Note that BG loops are only unrolled at precisions higher than the minimal
precision mode. Because at the lowest precision, there are no BG loops to unroll,
i.e., all the BG for-loop dimension size equal 1.
Shift Add (runtime configurable) Accumulation Island Shift Add (runtime configurable
4b 4b 8b 4b 4b 8b
8b 4b 4b 8b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
1 L1 L1 L1 L1
L1 L1 L1 L1
44bb
L1 L1 L1 L1 L1 L1 L1 L1
4b
4b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
1 L1 L1 L1 L1
L1 L1 L1 L1
8b
8 b8 b
L1 L1 L1 L1 L1 L1 L1 L1
8b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
1 L1 L1 L1 L1
44bb
L1 L1 L1 L1
4b
L1 L1 L1 L1 L1 L1 L1 L1
4b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
1 L1 L1 L1 L1
L1 L1 L1 L1
L1 L1 L1 L1 L1 L1 L1 L1
2b 2b 2b 2b
b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b
2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b
22bb 22bb 22bb 22bb
2b
1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
2b 2b 2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
2b
1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
8 b 888bbb
88bb 88bb
88bb 88bb
8b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
2b
1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
2b
1 L1 L1 L1 L1 L1 L1 L1L1 L1L1 L1
L1 L1 L1 L1L1 L1
L1 L1
L1 L1
L1 L1
L1 L1L1 L1 L1
2b
L1 L1 L1 L1 L1
Sharing (IS) at L2 (b) Hybrid Sharing (HS) at L1
L2 L1 L1 L1 L1
(c)L1Output
L1
S
44b4b4bb
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L
4b
L1 L1 L1 L1
4b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
1 L1 L1 L1 L1
8b
8b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L
L1 L1 L1 L1
8b
8b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
1 L1 L1 L1 L1
44bb44bb
4b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L
L1 L1 L1 L1
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
4b
1 L1 L1 L1 L1
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L
L1 L1 L1 L1
2b 2b 2b 2b
b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b
2b 2b 2b 2
2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b
2b b 2 b
2bbbb222bbb
2b
1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L
22bbbb 222b
2b
1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
bb 222b
88bb 88bb
8b
88bb
8b
8b
8b
2b
bb
1 L1 L1 L1 L1 L1 22 L1 L1 L1 L1 L1
88bb
88bb
22bb 2222bbbb 22
22bb 2222bbbb 22
88bb
2b
2b
L1
Sharing (IS) L1
at L2 L1 L1 L1 L1 L1
(b) L1
Hybrid Sharing (HS)L1at L2L1 L1 L1 L1 L1 L1
(c) Output L
S
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
44bb
44bb
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
8b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
44bb
44bb
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
88bb
88bb
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
8b 4b 4b 2b 2b 2b 2b
2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
<<6 <<4 <<2 <<0 <<6 <<4 <<2 <<0 <<6 <<4 <<2 <<0
ed
4b
Gated
at
G
2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
<<8 <<6 <<4 <<2 <<8 <<6 <<4 <<2 <<8 <<6 <<4 <<2
8b
2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
<<10 <<8 <<6 <<4 <<10 <<8 <<6 <<4 <<10 <<8 <<6 <<4
4b
Gated
ed
at
2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
G
<<12 <<10 <<8 <<6 <<12 <<10 <<8 <<6 <<12 <<10 <<8 <<6
(a) No Sharing at L2
8b 4b 4b 2b 2b 2b 2b
2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
<<6 <<4 <<2 <<0 <<6 <<4 <<2 <<0 <<6 <<4 <<2 <<0
ed
4b
Gated
at
2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
<<8 <<6 <<4 <<2 <<8 <<6 <<4 <<2 <<8 <<6 <<4 <<2
8b
2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
<<10 <<8 <<6 <<4 <<10 <<8 <<6 <<4 <<10 <<8 <<6 <<4
Gated
4b
ed
at
2b
L1 L1 L1 L1
G
L1 L1 L1 L1 L1 L1 L1 L1
<<12 <<10 <<8 <<6 <<12 <<10 <<8 <<6 <<12 <<10 <<8 <<6
gating part of the L1 units when scaling down in precision. This family of MACs
were first introduced in [117, 107], and extended to a broader class of designs
in [22]. Same as in Chapter 3, we here again refer to them as Subword-Unrolled
(SWU) designs, in contrast to the Fully-Unrolled (FU) topologies.
2. BG: Level at which the BG loops are unrolled, selected from spatially at
L2/L3 or temporally in a bit-serial (BS) way. If BS, we also identify at
which level the internal registers are located, denoted as BS-Lx.
3. Config: I/O bandwidth and hardware utilization trade-off configuration,
which can be FU or SWU.
4. Mode: Precision scalability, which can be 1D, 2D asymmetric, or 2D
symmetric.
This newly introduced taxonomy now allows to map all PSMA techniques
presented in literature, as summarized in Table 4.1. Since BG unrolling is only
relevant when the precision mode is above the minimal precision, a separate
row per design is added for both high and low precisions.
For DNPU [146] and BitFusion [143], each L2 unit produces a full product every
clock cycle, thus their BG are unrolled spatially at L2. The difference between
them is that BitFusion is 2D asymmetric scalable, while DNPU is only 1D
scalable. Additionally, at low precisions, BitFusion’s L2 is output-shared, while
DNPU’s L2 is input/weight-shared. BitBlade [133] and Ghodrati [51] have a
very similar PSMA architecture, and thus are mapped the same way in the new
taxonomy. In their works, the shifters are shared between L2 units at L3, thus
their BG is unrolled spatially at L3.
Stripes [82], UNPU [94], and Loom [142] are BS designs, meaning the BG
loops are unrolled temporally rather than spatially. Stripes and UNPU are 1D
scalable, while Loom is 2D scalable. Additionally, UNPU is IS at L2 with its
110 PRECISION-SCALABLE MAC ARRAY DESIGN SPACE EXPLORATION
High IS OS BG
DNPU [146] L2 FU 1D
Low IS OS IS
High IS OS BG
BitFusion [143] L2 FU 2D-A
Low IS OS OS
High IS BG OS
BitBlade [133] L3 FU 2D-A
Low IS OS OS
High IS BG OS
Ghodrati [51] L3 FU 2D-A
Low IS OS OS
High IS IS OS
Stripes [82] BS-L2 FU 1D
Low IS IS OS
High IS OS IS
UNPU [94] BS-L1 FU 1D
Low IS OS IS
High IS IS OS
Loom [142] BS-L2 FU 2D-A
Low IS IS OS
High IS IS BG
Envision [116] L2 SWU 2D-S
Low IS IS No
High IS OS BG
ST [107] L2 SWU 2D-S
Low IS OS OS
internal shift-add registers located at L1, while Stripes and Loom are OS at L2
with its internal registers located at L2.
The final batch include Envision [116] and ST [107], which are both SWU
designs. Fundamentally, the difference between them is their behaviour at L2 in
low precisions, where Envision has no sharing, and where ST is OS. Moreover,
Envision opts for an IS scheme at L3, while ST goes for an OS scheme.
With the introduction of the new taxonomy, the road is paved towards
benchmarking the full PSMA design space under the same operating
circumstances and technology, leading to better insights on the pros and
cons of each PSMA architecture. To achieve this goal, a uniform and highly
parameterizable MAC array template is built, which allows to efficiently map
different design points covered by the taxonomy.
The SotA implementations presented in Table 4.1 are not the only possible
PSMA instantiations. The taxonomy allows to identify a broader range of
possible architectures that were not previously covered in literature. To be
able to quickly implement and benchmark all different configurations in the
introduced design space, a uniform and highly parameterized PSMA template
is developed. Based on its user-defined parameters, this flexible PSMA can
be design-time configured into any array architecture in a subset of the design
space spanned by the introduced taxonomy, i.e., it supports different L2, L3,
L4, BG, and FU/SWU settings.
for n2 in [0:4)
for n1 in [0:4)
for n2 in [0:4)
for n in [0:16) for bi in [0:2)
for n1 in [0:4)
O += I[n] × W[n] for bw in [0:2)
O += I[4n2+n1] × W[4n2+n1]
O += I[4n2+n1][2bi+1:2bi](22bi)
× W[4n2+n1][2bw+1:2bw](22bw)
(Assume 4-bit I and W; 12-bit O; 2-bits as a Bit Group)
BG loops
Loop implication Hardware implication Datapath precision
unrolled at
L1 multiplier: 2b×2b
for n2 in [0:4) --------------- temporal
unroll n1 in [0:4) --------- spatial-L3 L2 shift-adder: 4b+6b+6b+8b
L2
unroll bi in [0:2) -------- spatial-L2 L3 adder: 8b+8b+8b+8b
unroll bw in [0:2) ------ spatial-L2
Temporal adder: 10b+12b
L1 multiplier: 2b×2b
for n2 in [0:4) ------------- temporal
unroll bi in [0:2) -------- spatial-L3 L2 adder: 4b+4b+4b+4b
L3
unroll bw in [0:2) ------ spatial-L3 L3 shift-adder: 6b+8b+8b+10b
unroll n1 in [0:4) ----- spatial-L2
Temporal adder: 10b+12b
L1 multiplier: 2b×2b
for bi in [0:2) ---------------- temporal L2 adder: 4b+4b+4b+4b
for bw in [0:2) -------------- temporal
BS-L3 L3 adder: 6b+6b+6b+6b
unroll n2 in [0:4) --------- spatial-L3
unroll n1 in [0:4) -------- spatial-L2 Temporal shift-adder:
8/10/10/12b+12b
Figure 4.9: BG is unrolled spatially at L2, L3, or temporally at L3 (BS-L3). Assume OS at L2 and L3; Assume each
112
you can find that all the BG-level operations with the same colour are shifted
by the same amount, and thus grouped in the same L2 unit.
For BS designs, the BGs are unrolled temporally rather than spatially. This
means that the adders and shifters are decoupled, and the shifting operation is
performed over multiple clock cycles, depending on the precision. Additional
circuitry is required to ensure correct functionality of BS designs, such as timer
logic for scheduling and internal registers. It’s worth noting that by default, only
BS designs include internal registers between array levels. In the BS example
of Figure 4.9, the internal accumulation register is located at L3, i.e., BS-L3.
To gain a better understanding of the internal architecture of BS array designs
and how the BG-level scheduling is handled, refer to Figure 4.10. Assuming
L2 units are OS, then each L2 would produce one 8b result (16 2b×2b results
sum together). The scheduling is done in two phases. First, the weights are
stationary while the inputs are shifted right by 2b each clock cycle (Phase 1).
During that time, the intermediate results are stored in the first internal register.
After all the input bits are depleted, the weights are shifted right by 2b (i.e.,
the BG precision), the data stored in the first internal register is transferred
to the second internal register (Phase 2), and then Phase 1 is repeated again.
This cycle repeats itself until all the weight bits are depleted, then finally the
accumulator at the end of the array (not shown in Figure 4.10) gets activated,
and accumulates the full product.
As you may have noticed in Figure 4.9, BS designs reduce the complexity of the
L2 and L3 adder trees, as they don’t require configurable shifters anymore, and
they can mostly amortize the shift-add logic overhead. On the other hand, BS
designs have their own limitations. One is that they have to find enough other
for-loops (other than BG loops) in the algorithm that can be spatially unrolled
on the array to ensure high hardware utilization. Secondly, BS designs require
additional scheduling control logic and internal circuitry for each partial output.
It is hence beneficial to reduce the number of outputs generated when designed
in this pattern, especially at lower levels of the array. With that in mind, lots
of inefficient designs can already be pruned out from the full design space.
Based on the taxonomy introduced in Section 4.3, the combination of the design
parameters would lead to a huge variety of array configurations. As such, some
constraints had to be put in place to reduce the design space to make it more
manageable, while still maintaining the interesting exploration regions. With the
observations of previous subsections, the list below summarizes the constraints
114 PRECISION-SCALABLE MAC ARRAY DESIGN SPACE EXPLORATION
Output
000000
22bb LSB
22 bb L2 L2 L2 L2 1
LSB
000000
20 b
20 b
22bb sum +
22 bb L2 L2 L2 L2 14 b
14 b
14 b
16'(2bx2b) +
8b
8b
L2
22bb >>2 >>2
22 bb L2 L2 L2 L2 MSB
Phase 1 Phase 2
22bb
22 bb L2 L2 L2 L2
Output
000000
LSB
1
22bb L2
LSB
000000
22 bb L2 L2 L2 L2 22 b
22 b
L2 16 b +
16 b
16 b
22bb 10 b 10 b +
22 bb L2 L2 L2 L2 L2 +
MSB >>2 >>2
22bb L2 Phase 1 Phase 2
22 bb L2 L2 L2 L2
x4
(b) Hybrid Sharing (HS) at L3, also BS-L3
Output
LSB
L2
L2 1
22bb L2
L2
LSB
000000
22 bb L2 L2 L2 L2 24 b
24 b
L2
L2
L2 +
L2 18 b
18 b
18 b
22bb 12 b +
12 b
22 bb L2 L2 L2 L2 L2
L2
L2
L2 +
MSB >>2 >>2
L2
L2
L2
L2
22bb Phase 1 Phase 2
22 bb L2 L2 L2 L2
x4
(c) Ouput Sharing (OS) at L3, also BS-L3
8b × 8b 8b × 4b 8b × 2b 4b × 4b 2b × 2b
Weight Input Weight Input Weight Input Weight Input Weight Input
[1:0] [1:0] [1:0] [1:0] [1:0] [1:0] [1:0] [1:0] [1:0] [1:0]
1 1
[1:0] 1 [3:2] [1:0] [3:2] 2 1
[1:0] 1 [5:4] 2 [3:2] [1:0] [1:0] [3:2]
[3:2] [1:0] 2 2 Timeline
[1:0] [7:6] 1 [5:4] [1:0] [3:2] [1:0]
2 [3:2] [3:2]
2 2 1
[3:2] [1:0] [7:6] [1:0] [3:2] [3:2]
1 [5:4] [1:0]
[3:2] [3:2] 1 2 2 1 Phase 1
1 [5:4] [3:2]
[3:2] [5:4] 2 2 Phase 2
1 Timeline Timeline
[3:2] [7:6] [7:6] [1:0]
... 2 ... 1
[7:6] [3:2]
[7:6] [1:0] 2 Sharing Weight Input Output
1
[7:6] 1 [3:2] (a) IS @ L3 W0..63 I0..63 O0..15
Timeline
[7:6] 1 [5:4] (b) HS @ L3 W0..255 I0..63 O0..3
[7:6] [7:6]
2 (c) OS @ L3 W0..255 I0..255 O0
Timeline
that were imposed on the different parameters, and their justification. The
constrained design space is also summarized in Table 4.2.
It’s worth noting that, given the design constraints, UNPU [94] is the only
accelerator in the SotA mapping that is not supported by our uniform PSMA
template, given that it’s a BS design with L2 being IS (BS-L1).
116 PRECISION-SCALABLE MAC ARRAY DESIGN SPACE EXPLORATION
Example PSMA: 8b x 8b 2b x 2b
L4: IS / L3: OS
FU / BG: L2 / L2: OS Output prec. Explanation Output prec. Explanation
Sum up 16’ 2b x 2b
L2 (OS) 16 b 1’ 8b x 8b result 8b
results
To ensure proper timing of all designs assessed using the uniform PSMA
template, we include input and output registers at the periphery of the array
template. Only the BS designs do include additional internal registers within
the array to assist the periodic shift-add process. Depending on the array
L4/L3/L2 configurations (namely, the IS/HS/OS configuration as well as the
BG unrolling, as discussed in Section 4.3), each design requires a different
maximum input/output bandwidth per clock cycle. As a result, the required
amount of registers at the array’s inputs/output varies. IS designs typically
require fewer input registers, at the expense of more output registers. The
opposite is true for OS schemes.
Practically, each OS / IS sharing dimension along each hierarchical level
L2/L3/L4 multiplies the required number of input / output register words
by 4 (since each level Ln is a 2-dimensional 4×4 Ln−1 array). It is hereby
important to note that the number of bits per input word is fixed (8b at full
precision), while the number of bits per output word depends both on the input
word precision and the maximal expected temporal accumulation time. Here,
the largest required bit width across all precision modes is computed (take the
worst-case scenario), with an extra 4b of headroom for temporal accumulation
per each expected output (same as in Chapter 3). An example of how we
compute the final output register size is given in Figure 4.11, in which, for
an example PSMA architecture, the required output precision after each level
(from L2 to L4, further to accumulator) are given and explained.
Finally, in BS designs, two-stage accumulation registers are needed. Since the
EXPERIMENTS 117
design space has been constrained in Subsection 4.4.2 to only include BS-L2, the
BS registers layout is always the same. As shown in Figure 4.10(a), L2 in BS
designs always produce 8b results (across all precision modes), as it accumulates
16 2×2b partial products. At full precision, the internal registers clear their
stored value every 4 iterations and each register’s stored value needs to be
shifted right by 2 for 3 times, therefore a 6b headroom is needed for each BS
accumulation register, visualized as "000000" in Figure 4.10(a) register logic.
4.5 Experiments
In this section, each design in the design space is evaluated in terms of energy
per operation and area, both in a low frequency (200 MHz) and high frequency
(1 GHz) context. Additionally, breakdowns are performed to gain more insights
into the trade-off between hardware components of systems with different
design-time and run-time configurations.
4.5.1 Methodology
4.5.2 Workload
* Workload that guarantees full hardware utilization for all designs across all precisions.
4.5.3 Throughput
All simulations are executing an optimized workload for which all designs have
a spatial utilization of 100% at full precision. Hence, both FU and SWU
designs have the same throughput (number of operations per clock cycle) at
full precision. As we scale down to lower precisions, the FU designs still utilize
the full hardware, while SWU gate part of the PSMA in order to maintain
fixed input bandwidth across all precisions. As a result, SWU has reduced
throughput compared to FU, specifically, half and quarter the throughput at
4b and 2b precisions respectively.
EXPERIMENTS 119
All supported designs are compared in terms of energy per operation, with one
operation defined as one full multiplication or one addition, i.e., one full MAC
operation is defined as two operations. The resulting energy efficiency heatmaps
are shown for 200 MHz at Figure 4.12 and 1 GHz at Figure 4.13. Additionally,
the designs are compared with symmetric (8b×8b, 4b×4b, 2b×2b) and one
asymmetric (8b×4b) precisions. Each row denotes an L4/L3 mode, while each
column denotes a configuration/BG/L2 combination. Note that SWU designs
do not support asymmetric precision scalability.
At 200 MHz (Figure 4.12), the most energy-efficient columns across different
precisions are the (FU / BG: L3 / L2: OS), (FU / BG: BS / L2: OS), and (SWU
/ BG: L2 / L2: OS) configurations. At full precision, the (SWU / BG: L2 / L2:
OS) designs are more efficient. This can be attributed to the simpler design and
non-configurable shifters, while still maintaining the same throughput as FU.
At lower precisions, SWU have a lower number of operations, and thus become
more in line with FU designs. Intuitively, one can understand that (FU / BG:
L3) and (FU / BG: BS) designs should be more efficient than (FU / BG: L2)
designs, due to the fact that the shifter’s overhead is shared across the L2 units
in this case. BS designs have an additional switching overhead of the internal
registers, and since here the frequency is relatively low, there is not much gain
in return.
The key factor for good energy efficiency is to make L2 OS, as it produces only
one partial product regardless of precision, thus simplifying L3 and L4 designs.
At 200 MHz, L3 and L4 unrollings don’t have as much impact as long as L2 is
OS. The best L4/L3 configurations in that case are IS/IS, IS/OS, OS/IS. If
both L4 and L3 swing towards the OS side, the critical path gets longer due
to the adder trees, and as a result the energy efficiency is negatively impacted,
though not by much since this is still at a low clock frequency.
Figure 4.13 shows a similar trend at 1 GHz clock frequency, with some small
differences. (L2: OS) is still a key factor for good energy efficiency, and (BG:
L3) and (BG: BS) are still more efficient then (BG: L2). However, (BG: BS)
has a slight edge over (BG: L3) due to the reduced critical path, as BS designs’
internal registers start to show their benefit at higher clock frequencies. In
line with the 200 MHz results, the best L4/L3 configurations are IS/IS, IS/OS,
OS/IS. Having all level unrollings as OS yields a detrimental effect on the
energy efficiency, as the critical path gets longer, conflicting with the tight
timing requirement.
The stellar performance of BS designs in this benchmark are in stark
contrast to the experiment results of Chapter 3, which showed BS
120 PRECISION-SCALABLE MAC ARRAY DESIGN SPACE EXPLORATION
MAC units as the worst performer. The primary reason for the improved
energy efficiency of BS designs is that, in this chapter, we assume “L2 is OS” for
all BS designs in the PSMA template. As such, the hardware overhead of the
internal registers and shift-add logic is amortized across the L1 units within each
L2 unit (i.e., BS-L2), which ultimately reduces the energy/operation. Contrarily,
in the previous chapter, the study is done at single MAC unit level, and thus
the amortization opportunity is ignored (i.e., BS-L1). This result highlights the
importance of hardware resource sharing in array-level BS precision-scalable
MAC engine designs.
8b x 8b 8b x 4b
L4: IS / L3: IS 1191 885 731 743 361 394 658 583 1000 658 434 346 392 186 210 226 226
L4: IS / L3: HS 884 933 654 742 438 414 487 473 480 462 318 368 213 215 226 226 500
900
L4: IS / L3: OS 573 685 585 665 439 412 445 367 299 326 281 327 216 212 226 226 450
L4: HS / L3: IS 10081024 790 728 389 446 640 584 800 563 508 390 375 199 234 226 226 400
226No226
L4/L3
L4: HS / L3: HS 905 1045 787 851 535 538 499 522 700 490 516 384 417 266 277
Support 350
L4: HS / L3: OS 609 758 690 805 552 528 484 401 600
313 371 337 399 273 270 226 226
L4: OS / L3: IS 706 756 625 507 374 421 473 414 373 374 296 262 193 217 226 226 300
L4: OS / L3: HS 664 840 709 654 563 536 427 418 500 354 419 344 321 278 274 226 226 250
L4: OS / L3: OS 577 809 786 722 651 604 496 416 400 297 395 387 356 323 315 226 226 200
4b x 4b 2b x 2b
L4: IS / L3: IS 336 214 156 212 97 104 219 171 88 54 33 50 25 28 74 59 70
240
L4: IS / L3: HS 243 211 142 178 105 104 151 132 71 46 31 42 24 25 52 42
L4: IS / L3: OS 148 150 134 144 105 103 126 101 220 42 30 29 28 26 24 41 31 60
L4: HS / L3: IS 267 234 173 197 104 113 195 166 200 73 50 36 46 24 27 66 46
50
L4/L3
L4: HS / L3: HS 241 237 173 203 124 134 155 150 180 69 51 38 47 31 32 54 46
L4: HS / L3: OS 152 172 161 179 133 132 140 115 160 42 37 36 35 33 31 45 38
L4: OS / L3: IS 169 160 139 134 101 105 136 116 42 30 30 28 21 24 40 37 40
140
L4: OS / L3: HS 167 181 163 149 130 133 128 118 120
44 37 36 35 32 31 45 39 30
L4: OS / L3: OS 144 182 187 168 160 154 146 126 39 39 42 35 39 38 50 44
FU / BG: L2 / L2: IS
FU / BG: L2 / L2: HS
FU / BG: L2 / L2: OS
FU / BG: L3 / L2: HS
FU / BG: L3 / L2: OS
FU / BG: BS / L2: OS
SWU / BG: L2 / L2: NO
SWU / BG: L2 / L2: OS
FU / BG: L2 / L2: IS
FU / BG: L2 / L2: HS
FU / BG: L2 / L2: OS
FU / BG: L3 / L2: HS
FU / BG: L3 / L2: OS
FU / BG: BS / L2: OS
SWU / BG: L2 / L2: NO
SWU / BG: L2 / L2: OS
To get a better assessment of the most efficient array, the benchmarked designs
are compared in terms of energy per operation vs. area. Figures 4.14-4.15 show
scatter plots of energy/operation vs. area for 200 MHz and 1 GHz, at different
symmetric and asymmetric precisions. Each colour represents a different (config
EXPERIMENTS 121
8b x 8b 8b x 4b
L4: IS / L3: IS 1152 909 708 679 394 372 385 559 1400 632 432 335 351 205 202 252 252 700
L4: IS / L3: HS 776 825 637 650 479 392 495 459 418 376 310 319 238 208 252 252
L4: IS / L3: OS 578 785 626 637 522 404 460 380 1200 302 382 304 314 256 213 252 252 600
L4: HS / L3: IS 907 909 817 663 540 414 592 625 512 422 404 342 270 221 226
252 226
252
1000 No 500
L4/L3
L4: HS / L3: HS 647 1033 783 790 733 515 637 498 350 504 383 371 369 270
Support
L4: HS / L3: OS 674 10531041 845 832 525 517 477 800 345 517 503 411 411 277 226
252 226
252 400
L4: OS / L3: IS 662 965 875 612 529 415 521 488 374 476 425 323 269 219 252 252
L4: OS / L3: HS 720 13181074 955 1274 559 588 529 600 376 638 527 464 629 298 252 252 300
L4: OS / L3: OS 947 1554150710871462 608 689 633 400 470 758 739 535 726 319 252 252
4b x 4b 350
2b x 2b
L4: IS / L3: IS 326 221 151 198 109 100 151 163 85 55 32 50 26 27 72 57
80
L4: IS / L3: HS 212 179 140 162 121 102 152 123 62 34 30 39 30 25 51 39
300
L4: IS / L3: OS 149 183 147 142 129 105 141 107 44 40 32 27 32 25 47 33 70
L4: HS / L3: IS 258 199 181 185 140 110 184 180 250 70 43 38 45 34 26 63 53 60
L4/L3
L4: HS / L3: HS 173 238 183 194 179 131 213 153 50 54 41 46 44 31 80 49
L4: HS / L3: OS 167 242 241 200 201 135 162 146 200 47 55 55 42 49 32 57 49 50
L4: OS / L3: IS 184 217 202 171 137 108 175 150 49 50 44 39 34 25 61 45 40
L4: OS / L3: HS 183 304 245 228 313 146 201 162 150 52 75 54 58 78 33 78 53
L4: OS / L3: OS 214 353 354 258 357 156 240 210 59 85 83 57 88 38 90 75 30
FU / BG: L2 / L2: IS
FU / BG: L2 / L2: HS
FU / BG: L2 / L2: OS
FU / BG: L3 / L2: HS
FU / BG: L3 / L2: OS
FU / BG: BS / L2: OS
SWU / BG: L2 / L2: NO
SWU / BG: L2 / L2: OS
FU / BG: L2 / L2: IS
FU / BG: L2 / L2: HS
FU / BG: L2 / L2: OS
FU / BG: L3 / L2: HS
FU / BG: L3 / L2: OS
FU / BG: BS / L2: OS
SWU / BG: L2 / L2: NO
SWU / BG: L2 / L2: OS
Figure 4.13: Energy per Operation (f J) at 1 GHz.
/ BG / L2) combination, while each shape represents a different (L4 / L3) mode.
The most optimal designs lie in the bottom-left corner.
At 200 MHz, the optimal designs are the purple and orange ones, which represent
(SWU / L2: OS) and (FU / BG: L3 / L2: OS). At 1 GHz, the cyan-coloured
markers (BS) have better energy efficiency, while the purple and orange ones
have smaller areas. As will be shown later, this can be attributed to the lack of
internal registers in the (SWU / L2: OS) and (FU / BG: L3 / L2: OS) designs
compared to BS designs. In most cases, the blue markers (FU / BG: L2 / L2:
IS) are on the upper right corner of the graph, which makes them one of the
least efficient configurations. This can be attributed to the fact that having IS
at the lowest level of the array produces lots of independent partial products
at lower precisions, which cannot be added together. Generally, it can be seen
that the top left and bottom right corners are mostly empty, which indicates
that in most cases, there’s a strong correlation between area and energy per
operation in this benchmark.
122 PRECISION-SCALABLE MAC ARRAY DESIGN SPACE EXPLORATION
Config / BG / L2 L4 / L3 Modes
FU / BG: L2 / L2: IS L4: IS / L3: IS SotA Designs
FU / BG: L2 / L2: HS L4: IS / L3: HS [DNPU]
FU / BG: L2 / L2: OS L4: IS / L3: OS [BitFusion (BF)]
FU / BG: L3 / L2: HS L4: HS / L3: IS [BitBlade (BB)]
FU / BG: L3 / L2: OS L4: HS / L3: HS [Loom]
FU / BG: BS / L2: OS L4: HS / L3: OS [Envision (Env)]
SWU / BG: L2 / L2: NO L4: OS / L3: IS
SWU / BG: L2 / L2: OS L4: OS / L3: HS [ST]
L4: OS / L3: OS
8b x 8b 700 8b x 4b
600
1000
900 500
Energy/Op (fJ)
800
700 400
[Env]
600 [BF] [DNPU] 300 [DNPU]
[BF]
500
[BB]
[BB]
400 [Loom] 200 [Loom]
[ST]
Preferred
Corner 4b x 4b 2b x 2b
90
300 80 [Env]
70
60
Energy/Op (fJ)
[Env]
200 50
40 [DNPU]
[DNPU]
[BF] 30 [ST] [BF]
[Loom]
[BB]
[BB]
100 [ST]
[Loom]
20
0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5
Area (mm2) Area (mm2)
Figure 4.14: Energy/Op (f J) vs. Area (mm2 ) at 200 MHz.
EXPERIMENTS 123
Config / BG / L2 L4 / L3 Modes
FU / BG: L2 / L2: IS L4: IS / L3: IS SotA Designs
FU / BG: L2 / L2: HS L4: IS / L3: HS [DNPU]
FU / BG: L2 / L2: OS L4: IS / L3: OS [BitFusion (BF)]
FU / BG: L3 / L2: HS L4: HS / L3: IS [BitBlade (BB)]
FU / BG: L3 / L2: OS L4: HS / L3: HS [Loom]
FU / BG: BS / L2: OS L4: HS / L3: OS [Envision (Env)]
SWU / BG: L2 / L2: NO L4: OS / L3: IS
SWU / BG: L2 / L2: OS L4: OS / L3: HS [ST]
L4: OS / L3: OS
8b x 8b 800
8b x 4b Config / BG /
700 FU / BG: L2
600 FU / BG: L2
1000 FU / BG: L2
500
Energy/Op (fJ)
900 FU / BG: L3
800 400 FU / BG: L3
700 FU / BG: BS
600 [BF]
[DNPU] 300 [BF] [DNPU]
SWU / BG: L
500 [BB] SWU / BG: L
[BB]
L4 / L3 Modes
400 [Env] L4: IS / L3: I
[ST] [Loom] 200 [Loom]
L4: IS / L3: H
L4: IS / L3: O
4b x 4b 2b x 2b L4: HS / L3:
L4: HS / L3:
90
80 L4: HS / L3:
300 L4: OS / L3:
70 [Env]
60 L4: OS / L3:
Energy/Op (fJ)
L4: OS / L3:
200 50
[DNPU] SotA Designs
[Env] [BF]
40 [DNPU]
[DNPU] [BitFusion (BF
[ST] [BF]
[BB] 30 [BB] [BitBlade (BB
[ST] [Loom] [Loom]
100 [Loom] [Envision (Env
0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 [ST]
Area (mm2) Area (mm2)
Figure 4.15: Energy/Op (f J) vs. Area (mm2 ) at 1 GHz.
124 PRECISION-SCALABLE MAC ARRAY DESIGN SPACE EXPLORATION
4.5.6 Breakdown
To gain more insights about why some array configurations perform better
than others, a breakdown of the energy per operation and area for all PSMAs
are summarized in Figures 4.17-4.19 for 200 MHz and 1 GHz. Figure 4.16
visualizes the legend in Figure 4.17-4.18, showing that the energy breakdowns
include the contribution of L1 multipliers, adder trees (in L2, L3, and L4), final
accumulation adders, and registers (input, output, and BS internal).
The L1 2b multipliers energy consumption is nearly identical in all cases, and
only contributes a small fraction of the total energy consumption. L2 adder
trees consume more energy if BG is unrolled at L2, since in this case the adder
tree includes both adders and shifters. Also, when BG is unrolled at L2, L2
adder tree’s energy for SWU designs is a bit lower than for FU designs due
to the non-configurable shifters. The complexity of the L2 adder trees goes
significantly down as moving to BG unrolled in L3 or BS designs. Additionally,
note that if L3 or L4 are IS (the first bar in each group of 9 bars), it means that
none of the partial results from the previous level gets added together, thus
there would be no L3/L4 adder trees in this case.
Output registers and accumulation adders dominate energy consumption as
more levels become IS, while the input register’s energy is much lower, and vice
versa. Internal registers (denoted as pipeline registers in the plots) only exist in
BS designs, and it consumes a sizeable amount of the total energy consumption
of BS designs. At 1 GHz, the arrays where most levels are OS yield higher
energy consumption than other designs. The reason for it is that OS arrays
have long critical paths, since the partial results have to go through complex
adder trees in most of the levels. To meet the timing constraint, the energy
consumption gets negatively impacted.
Area breakdowns are shown in Figure 4.19. As shown in the figures, the area is
dominated by the combinational logic, with the sequential logic only having a
small contribution to the total area. Additionally, all designs where L2 is OS
have lower area than their “L2: IS" and “L2: HS" counterparts. That’s because
as L2 generate more partial products, L3 and L4 adder trees become more
complicated to ensure L2 partial results are not added together. In general,
a larger area is observed as expected for 1 GHz frequency, and this can be
attributed to larger cell sizes in order to meet the timing constraints, especially
for arrays where most levels are OS. Whereas, for the BS designs and the designs
where most levels are IS, the area increase of going from 200 MHz to 1 GHz is
not significant.
EXPERIMENTS
Figure 4.16: Legend illustration for Figure 4.17 and Figure 4.18. Input Registers contain input activations and
weights.
125
126 PRECISION-SCALABLE MAC ARRAY DESIGN SPACE EXPLORATION
1000
500
0 FU FU FU FU FU FU SWU SWU
BG: L2 BG: L2 BG: L2 BG: L3 BG: L3 BG: BS BG: L2 BG: L2
L2: IS L2: HS L2: OS L2: HS L2: OS L2: OS L2: NO L2: OS
(a) Energy/Op (f J) for 8b×8b
600
400
No
200 Support
0 FU FU FU FU FU FU SWU SWU
BG: L2 BG: L2 BG: L2 BG: L3 BG: L3 BG: BS BG: L2 BG: L2
L2: IS L2: HS L2: OS L2: HS L2: OS L2: OS L2: NO L2: OS
(b) Energy/Op (f J) for 8b×4b
300
200
100
0 FU FU FU FU FU FU SWU SWU
BG: L2 BG: L2 BG: L2 BG: L3 BG: L3 BG: BS BG: L2 BG: L2
L2: IS L2: HS L2: OS L2: HS L2: OS L2: OS L2: NO L2: OS
(c) Energy/Op (f J) for 4b×4b
75
50
25
0 FU FU FU FU FU FU SWU SWU
BG: L2 BG: L2 BG: L2 BG: L3 BG: L3 BG: BS BG: L2 BG: L2
L2: IS L2: HS L2: OS L2: HS L2: OS L2: OS L2: NO L2: OS
(d) Energy/Op (f J) for 2b×2b
Figure 4.17: Energy efficiency for 200 MHz. The 9 parallel bars within each
configuration group are with different L4/L3 settings, listed out in Figure 4.16.
EXPERIMENTS 127
1500
1000
500
0 FU FU FU FU FU FU SWU SWU
BG: L2 BG: L2 BG: L2 BG: L3 BG: L3 BG: BS BG: L2 BG: L2
L2: IS L2: HS L2: OS L2: HS L2: OS L2: OS L2: NO L2: OS
(a) Energy/Op (f J) for 8b×8b
600
400
No
200 Support
0 FU FU FU FU FU FU SWU SWU
BG: L2 BG: L2 BG: L2 BG: L3 BG: L3 BG: BS BG: L2 BG: L2
L2: IS L2: HS L2: OS L2: HS L2: OS L2: OS L2: NO L2: OS
(b) Energy/Op (f J) for 8b×4b
300
200
100
0 FU FU FU FU FU FU SWU SWU
BG: L2 BG: L2 BG: L2 BG: L3 BG: L3 BG: BS BG: L2 BG: L2
L2: IS L2: HS L2: OS L2: HS L2: OS L2: OS L2: NO L2: OS
(c) Energy/Op (f J) for 4b×4b
75
50
25
0 FU FU FU FU FU FU SWU SWU
BG: L2 BG: L2 BG: L2 BG: L3 BG: L3 BG: BS BG: L2 BG: L2
L2: IS L2: HS L2: OS L2: HS L2: OS L2: OS L2: NO L2: OS
(d) Energy/Op (f J) for 2b×2b
Figure 4.18: Energy efficiency for 1 GHz. The 9 parallel bars within each
configuration group are with different L4/L3 settings, listed out in Figure 4.16.
128 PRECISION-SCALABLE MAC ARRAY DESIGN SPACE EXPLORATION
Combinational
0.4 Sequential
0.2
Combinational
0.4 Sequential
0.2
4.6 Conclusions
covers a large subset of the design space spanned by the new taxonomy, and
supports symmetrical and asymmetrical precision scaling of inputs and weights.
All the architectures in the constrained design space are synthesized under a
commercial 28 nm technology, and benchmarked extensively in terms of energy
efficiency and area. From the conducted benchmarks, it is shown that BG
unrolling in L2 is the least ideal case (e.g. Bitfusion [143], DNPU [146]), and
it’s better to have BG unrolling at L3 for lower frequencies (e.g. BitBlade [133],
Ghodrati [51]) or temporally unrolling BGs for higher frequencies (e.g. Loom
[142], Stripes [82]). The benefit of having BG unrolled in L3 or temporally is
that the (configurable) shifters are amortized across different L2 units, making
the adder trees less complex. It’s generally a good idea to have a mixture of
IS/WS and OS loops throughout the array levels, and from the energy vs. area
study, we can comprehend that FU designs are better suited for workloads
where lower precisions are common, whereas SWU designs are better suited
for higher-precision-dominated workloads. This observation is aligned with the
precision ratio study results in Chapter 3.
The conclusion that is distinct from the last chapter, is the behavior of the BS
designs. In Chapter 3, we observed that BS designs were rather inadequate.
However, in this context, they exhibit favorable qualities. The reason is that,
in this chapter for BS designs (BS-L2), the hardware overhead of the internal
registers and shift-add logic is shared across the 16 L1 units within each L2 unit
(enabled by L2: OS), which ultimately reduces the energy/operation. Whereas
in Chapter 3, BS designs can be seen as each L1 unit is equipped with its own
registers and shift-add logic (BS-L1). The same array-level benefits can be
witnessed for spatial designs when transitioning from BG: L2 (without scalability
overhead amortization) to BG: L3 (with).
To summarize the results from the exploration, (L2: OS) is the key factor
for energy efficient PSMAs as it facilitates amortization for both spatial and
temporal designs. At 200 MHz, (BG: L3) is slightly better than (BG: BS) as
the internal registers of BS impose an overhead with no extra benefit at lower
frequencies. At 1 GHz, BS designs have a slight edge over (BG: L3) in terms of
energy efficiency, while (BG: L3) are better in terms of area. The good L4/L3
unrollings at both frequency are IS/IS (Loom-like [142]), IS/OS (BitBlade-like
[133]), and OS/IS. The best performing SWU design across 200 MHz and 1 GHz
is (L4: IS, L3: OS, L2: OS), which is an ST-like architecture [107].
In the end, there are two things to be noted. Firstly, the best PSMAs were
selected based on an ideal workload in this chapter’s benchmarks, ensuring
full hardware utilization across all precision modes and designs. However, real-
world workloads may vary, impacting the utilization and performance of the
benchmarked designs. Analyzing the effects of different non-ideal workloads on
130 PRECISION-SCALABLE MAC ARRAY DESIGN SPACE EXPLORATION
various PSMAs is an area for future research. Secondly, all benchmark results in
this study were obtained under a relatively old technology node (28 nm), which
could yield significantly different energy profiles compared to newer nodes (e.g.,
7 nm and 5 nm). As a result, the performance landscapes of different PSMA
designs might shift when transitioning to more advanced technology nodes.
For instance, BS designs with a high proportion of wires and registers (due to
clocking), might lose their advantage due to the increased energy consumption
portion of wires and registers in advanced technology nodes.
Chapter 3 and Chapter 4 have depicted the design spaces of DNN accelerator’s
MAC unit and MAC array for variable-precision execution. They thoroughly
benchmarked different design options by first implementing these circuits in
HDL, then validating their functionality, and finally synthesizing them to get
accurate performance estimation. As can be imagined, the whole flow requires
a lot of manual effort and is very time-consuming. Although it is handleable
for benchmarking Chapter 3’s 19 MAC-unit-level designs and Chapter 4’s
72 MAC-array-level designs (both assuming an ideal mapping scenario, i.e.,
without mapping optimization), it becomes infeasible for exploring hundreds and
thousands of different DNN accelerator architectures, together with optimizing
the mapping/scheduling from million-level options.
To deal with the large design space at the complete DNN accelerator level,
Chapter 5 will build a general and fast architecture and mapping DSE framework,
ZigZag, based on an analytical cost model. Thanks to the deterministic
computing pattern of DL models, the built-in analytical cost models enable
ZigZag to estimate energy and latency breakdown of processing a DNN layer
on a customized accelerator in milliseconds, paving the way toward fast
architecture/mapping search and optimization.
Chapter 5
From this chapter on, we move the abstraction level up from components in a
DNN accelerator to the complete accelerator architecture. Factors, such as the
non-ideality of the actual DNN workload (e.g., the mismatch between workload
sizes and hardware dimensions), the data movement in the memory hierarchy,
and mapping/scheduling optimizations, not considered in the previous two
chapters, are now taken into account.
However, this also largely raises the design space complexity. So, in order
to deal with the vast architectural design space and countless mapping and
scheduling possibilities, a series of high-level DNN accelerator DSE frameworks
are developed, targeting all five principles proposed in Chapter 1.2.2: fast,
accurate, general, adaptable, and intelligent.
This chapter thoroughly explains the first, fundamental framework in this
This chapter is based on publication [109], and contains large fractions of it. The
author’s contributions include (but not limited to) the design space identification, design
point representation, cost model, case studies, framework implementation, and paper writing.
ZigZag is open source at https://github.com/KULeuven-MICAS/zigzag.
131
132 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
series: ZigZag, which performs high-level DSE for single-core DNN accelerators,
allowing single-layer spatial and temporal mapping optimization, based on an
analytical cost model.
• Section 5.2 provides a clear view into several SotA DSE frameworks and
highlights the uniqueness of ZigZag.
• Section 5.3 gives an overview of the ZigZag framework, its major
components and various utilities.
• Section 5.5 builds the Loop Relevance Principle on top of this represen-
tation, enabling the framework to extract, in a systematic and insightful
way, the key information (such as memory accesses, required memory
bandwidth, etc.), from which the Hardware Cost Estimator derives the
system’s energy and performance values.
• Section 5.6 designs Mapping Search Engines to cope with the enlarged
mapping space (both even and uneven mappings). With heuristics and
iterative search strategies, these engines rapidly locate optimal (on energy
or/and latency) spatial and temporal mapping points, supporting mapping
on a wider variety of memory hierarchies and MAC array topologies.
RELATED WORKS 133
of the SotAs only support even mappings, which usually lead to sub-optimality
as will be shown in the results. Even and uneven mapping will be discussed in
detail in Section 5.4 of the chapter.
Thirdly, each DSE tool typically encompasses a mapping search engine, a.k.a.
auto-scheduler or mapper, to find the optimal temporal/spatial mappings for
deploying a certain neural network layer onto a specific accelerator architecture.
Most of the DSE frameworks perform a constraint-driven search to narrow
down the space and speed up the search procedure, like dMazeRunner [39];
some formulate the scheduling process into integer constraint problems and
utilize existing optimizers to solve it, like Dory [17]; the others use a partially-
predefined mapping as a hint to generate valid mappings, like MAESTRO [91].
Commonly used scheduling constraints and strategies include setting thresholds
for memory/PE array utilization and data reuse factors, putting an optimization
goal such as minimizing the DRAM access or the overall memory traffic, and
random sampling to avoid being trapped in a local optimum.
Finally, the last column in Table 5.1 lists the hardware cost estimation
approach adopted by each framework, in which three main categories can
be identified: 1) slow but very accurate cost estimations based on High-Level
Synthesis (HLS) [166], 2) medium-speed and accurate cycle-accurate system
simulators [178], and 3) fast and relatively accurate analytical models. Moreover,
there are different granularity levels for analytical model [177]. Fine-grained
models, like the one embedded in ZigZag, are more accurate than most of
the other coarse-grained models, by distinguishing memory write cost from
read cost, considering the memory word length’s impact on access energy (e.g.,
same-sized memories with different aspect ratio/IO bandwidth have different
data access energy), and taking data fetching pattern/stationarity into account
in the memory access cost and unit MAC cost.
Back to the five DSE framework principles to conclude on ZigZag’s uniqueness:
1) the fully flexible hardware design space and even/uneven mapping space
contribute to ZigZag’s generality; 2) the analytical model-based hardware cost
estimation guarantees its fast speed; 3) the delicate calculus behind the analytical
model captures the intrinsic behavior of for-loop operation, ensuring the accuracy
of the model (shown in later sections and also next chapter); 4) the unified
design point representation facilitates the easy adaptability between different
design options; 5) heuristic-/iterative-based auto-mapping search strategies pave
the way towards the intelligent DSE of DNN accelerator.
136 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
Pareto-Optimal Solution
I/O O I/W O O W W
Workload
Batch channel channel row column row column
Conv 2D (below fig.) B K C OY OX FY FX
Conv 1D B K C 1 OX 1 FX
Depthwise Conv 2D* B 1 1 OY OX FY FX
Pointwise Conv 2D B K C OY OX 1 1
Matrix-Vector Multi. 1 K C 1 1 1 1
Matrix-Matrix Multi. B K C 1 1 1 1
* Repeat Group (G) times, not shown here.
Conv 2D
Figure 5.3: Using the Memory-Centric Design Space Representation to distinguish between balanced and unbalanced
memory hierarchy, even and uneven mapping. The memory unrolling is not depicted in the left two memory hierarchy
sketches for clarity.
139
140 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
Here we use Conv 2D layer to explain the Loop Relevance Principle. As shown
in Figure 5.2, a Conv 2D layer is based on a 7D computing space (i.e., 7 nested
for-loops) with three 4D operands (i.e., Weight, Input, and Output are all 4D
tensor), which implies not all 7 loop dimensions are relevant to each operand.
Figure 5.4 shows the Loop Relevance Principle foundation, in which all 7 loop
dimensions are categorized as relevant (r), irrelevant (ir), or partially relevant
(pr) to each operand. Looping through those r loops indicates new data
need to be fetched (for W and I) or generated (for O) while looping
through those ir loops indicates data reuse opportunities. For Weight
HARDWARE COST ESTIMATOR 141
and Output, this is straightforward since all 7 computing space dimensions are
either parallel (relevant) or orthogonal (irrelevant) to their own 4D data space.
B K C OY OX FY FX ✓ relevant (r)
W ✕ ✓ ✓ ✕ ✕ ✓ ✓ ✕ irrelevant (ir)
I ✓ ✕ ✓ ?𝐼𝑌 ?𝐼𝑋 ?𝐼𝑌 ?𝐼𝑋 ? partially relevant (pr)
O ✓ ✓ ✕ ✓ ✓ ✕ ✕ ?𝐼𝑋/𝐼𝑌 partially relevant to IX/IY
Input, however, has pr loops besides the r and ir loops. As presented in the
Conv 2D example of Figure 5.2, Input’s dimensions IX and IY do not show up
in the convolution formula directly, instead, they are indirectly present through
OX and FX (for IX); OY and FY (for IY). As such, OX, FX, OY, FY are denoted as
partially relevant (pr) loops for Input. OX, FX (resp. OY, FY) form a pr loop
pair. The data reuse opportunities for Input that come from pr loop pair are
less straightforward and are explained as below:
The complete relation between IX and its pr loop pair OX, FX is IX = SX ·
(OX − 1) + SF X · (F X − 1) + 1, in which SX is the stride on input feature
map IX dimension; SFX is the stride in filter FX dimension (dilated convolution).
Input data reuse happens when the value of IX remains constant while the
computation is looping through its data space, spatially or temporally. A simple
case is that when SX and SFX are 1, the equation becomes IX = OX + F X − 1,
in which case, for a pr loop pair, data reuse opportunities arise when the sum
of their indices (OX+FX) remains constant. ZigZag can analyze various Input
data reuse cases considering stride and filter dilation. For clarity, the rest of
the chapter will mainly focus on the simple case.
FY FX FY FX Level
(3) (4) (7) (8) boundary
(1) FYu | OYu OYu OXu OY OX
Figure 5.5: pr-loop patterns that trigger special Input data reuse.
142 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
The benefit of the Loop Relevance Principle is the simplification and unification
of the procedure for extracting key information from the W/I/O mapping loop
sets towards estimating system energy and performance. To show the key ideas
of this procedure, a summary of the major equations is provided in Table 5.2 and
a detailed demonstration is given in Figure 5.6, in which the Output loop set of
Figure 5.3(d) is analyzed (similar procedure would be repeated for Weight/Input,
not shown). An in-depth discussion of each metric in Table 5.2 is provided
below. Note that the word ‘level’ in this section refers to the architectural level
introduced in Section 5.4 if not explicitly mentioned.
1.) Data Size in the individual memory unit at current level can be derived by
multiplying together the dimensionality of all the r loops at the current level
and all levels below, together with all ru loops (spatially unrolled r loop) at
all levels below. This can be seen in the first equation of Table 5.2, in which
Li means the current architectural level, L(i − 1) means one level below, and
Lmin means the lowest architectural level, usually the MAC level.
Let us apply this to a specific example given in Figure 5.6. The required Output
data storage inside the register file of each PE (16) is calculated by multiplying
the dimensionality of Level-1 r loops (the ‘K 8’ and ‘OX 2’ loops with resp.
loop index 1, 4); the Data Size of the Output inside of Global Buffer (5408) is
calculated by multiplying the dimensionality of Level-1&2 r loops (loop index 1,
4, 7) and Level-1 ru loops (loop index 6.2, 6.3). Note that in the given example,
no loop is assigned to the MAC level, thus we see MAC level (Lmin or L-0 in
HARDWARE COST ESTIMATOR 143
𝑳𝒎𝒂𝒙 𝑳𝒎𝒂𝒙
Unit Count Total active unit
ෑ 𝒓𝒖 ∙ ෑ 𝒊𝒓𝒖
@ Level i count
𝑳𝒊 𝑳𝒊
@ Level i (↔ Level i+1) read access for / ෑ 𝑻𝒐𝒕𝒂𝒍 𝑫𝒂𝒕𝒂 𝑹𝒆𝒖𝒔𝒆 𝑭𝒂𝒄𝒕𝒐𝒓
O 𝑳𝒎𝒊𝒏
Figure 5.6) r and ir loop dimensionality both as 1. Later for other metrics’
calculations, readers can always refer to the practical case in Figure 5.6.
2.) Data Size in total at current level can be easily calculated by multiplying
the individual Data Size in each memory unit with the dimensionality of all ru
loops at the current level. Notice that the unit of the Data Size is the number
of data elements. In order to obtain the number of bits, the precision of each
operand needs to be considered. Generally speaking, partial outputs have a
higher data precision than weights, inputs, and final outputs.
The ability to distinguish partial outputs from final outputs is critical for
accurate hardware cost estimation. ZigZag can easily handle this through its
r vs. ir loop representation. The final output is generated at the level of the
uppermost ir loop, e.g., the ‘C 12’ loop (index 8) in Figure 5.6, which in this
example makes the Level-2 Global Buffer the watershed for (higher precision)
144 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
partial and (lower precision) final output: the output data traffic between Level-
1 and Level-2 is bidirectional with partial output data precision (except for few
final iterations that final outputs are generated), whereas between Level-2 and
Level-3, it is unidirectional with final output data precision.
3.) The number of MAC Operation supported by the current level data size is
calculated by multiplying together all the loops’ dimensionality (r, ir, ru, and
iru) from the lowest architectural level up to the current level.
4.) Turnaround Cycles are the number of cycles certain memory can keep
operating with the data it contains, which is an important metric for required
memory bandwidth computation. It can be calculated by multiplying together
all the temporal loops’ dimensionality (r and ir) from the lowest level up to
the current level.
5.) Total Data Reuse Factor at the current level indicates after each data
reaches the current level and before it is overwritten by other data, how many
times it gets reused to support different computations. It can be calculated by
multiplying all the irrelevant loops’ dimensionality (ir and iru) at the current
level. The product of only ir loops is the temporal data reuse factor, while the
product of only iru loops is the spatial data reuse factor.
6.) Total Active Unit Count is a metric that captures how many hardware
components are active at a certain level, which is only related to spatially
unrolled loops. It can be computed as the product of all the spatial loops’
dimensionality (ru and iru) from the current level up to the highest level
(usually the DRAM level), and it is an important metric for computing MAC
array spatial utilization.
7.) Memory Access Count, as the core metric for later memory energy estimation,
can be extracted by dividing the total MAC operation of the neural network layer
by the current-and-below levels’ total data reuse factor. Figure 5.7 visualizes the
individual loop’s impact on the memory access count of the case of Figure 5.3(d)
based on this approach. The circle markers indicate the boundary of the memory
levels, showing the actual number of memory accesses at each memory level
for each operand. The (1)(2)(3) points marked in Figure 5.7 (in both left and
right subfigures) correspond to the (1)(2)(3) data access arrow locations in
Figure 5.6.
8.) Required memory bandwidth is the minimum bandwidth that ensures
computation goes fluently without stalling. It depends on both mapping and
memory settings (single-port/dual-port, with/without double buffering, etc.).
Without double-buffering, writing only happens after a specific data item is
fully used, resulting in a small time window. With double buffering, writing
can happen all the time (in parallel with data loading), leaving a large writing
Overall loop dimensions: K = 256; C = 48; OX = 26; OY = 26; FX = 5; FY = 5 Mapping
Total MAC Op K * C * OX * OY * FX * FY = 207667200
AlexNet Layer 2
Ideal total cycles 207667200 / 130 (# of active MAC unit) = 1597440
Information
Output size K * OX * OY = 173056
Output reuse C * FX * FY = 1200
A demonstration:
Mapping analysis and key information extraction from the Output loop set in Figure 5.3(d):
Loop index 0 1 2 3 4 5 6.1 6.2 6.3 7 8 9
Architectural level L-0: MAC Level-1: Inner-PE Register File Level-2: Global Buffer Level-3: DRAM
Loop MAC K8 C2 FX 5 OX 2 C2 FYu 5 OYu 13 OYu 2 OX 13 C 12 K 32
Loop relevance r1 r8 ir 2 ir 5 r2 ir 2 iru 5 ru 13 ru 2 r 13 ir 12 r 32
Data size (elem) 1 16 (inside of each PE), 416 (inside of total PEs) 5408 173056
MAC operation 1 1600 (inside of each PE), 41600 (inside of total PEs) 6489600 207667200
Data reuse factor 1 Total: 100 (Spatial: 5, Temporal: 20) Temporal: 12 1
207494144 (Total MAC Op - Output Size) 1903616 1903616 0 0
Data access count 207667200 (Total MAC Op) Divide by current-level total 2076672 2076672 173056 173056 (Output Size)
(÷12)
(1) data reuse factor (÷100) (2) (3)
Turnaround cycle 1 320 49920 1597440
Required average 1/1 Partial output flow bidirectionally; 16 / 320 416 / 320 0 0
mem BW (elem/cc) 1/1 Final output flow unidirectionally. 16 / 320 416 / 320 5408 / 49920 5408 / 49920
Figure 5.6: A demonstration: extract loop information from Output loop set of Figure 5.3(d) based on the Loop
Relevance Principle.
145
ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
(1) W (1) W
I I
Total MAC Op Partial O Total MAC Op O
108 Final O 108
Data access count
time window, and thus lowering the required instantaneous memory bandwidth.
The bandwidth difference between these two cases is the product of all the top
ir loop values at each memory level.
Note that due to the pr loops, some changes are needed for handling Input
correctly. The most important modification is the following two substitutions.
One is to correctly handle data size (assuming stride is 1):
Li Li Li Li Li Li
Y Y Y Y Y Y
r→ r·( pr1 + pr1′ − 1) · ( pr2 + pr2′ − 1)
Lmin Lmin Lmin Lmin Lmin Lmin
in which pr1 (pr2 ) and pr1′ (pr2′ ) are a pr loop pair, like OX and FX. Another
substitution is to correctly handle special Input data reuse cases like the "diagonal
multi-cast" and "FIFO Effect":
Total MAC Op @ Li(+pr)
Total data reuse factor @ Li →
Total Data Size @ Li(+pr)
FX 3
For example, in the "FIFO Effect" setting: OXu 4 (the horizontal line is the
architecture level boundary as in Figure 5.5), the lower-level Input data reuse
factor should equal to (4×3) MAC Op (4×1) MAC Op
(4+3−1) data = 2 instead of (4+1−1) data = 1 by taking
the "FIFO effect"-triggering pr loop (FX 3) into account. It means that every
input data being fetched from the above level theoretically can be used to
support two MAC operations in the below level (on average).
the active MAC energy and the idle MAC energy. Both are estimated by
first calculating the total active MAC count (determined by the workload)
and idle MAC count (determined by the PE array’s under-utilization) and
subsequently multiplying the corresponding number of MAC operations with
the corresponding averaged single-MAC-operation energy (active and idle).
Memory access energy is calculated by multiplying the memory access count
(computed previously) with the corresponding memory per-data-access energy,
taking into account the memory size, the potential memory bitwidth mismatch
overhead, operand precision, and data stationarity.
3.) Latency/Throughput: PE array utilization, throughput, and latency are
tightly related and can be deduced from each other. A PE array’s under-
utilization comes from two sources: spatial under-utilization and temporal
under-utilization. Spatial under-utilization results from the mismatch between
the spatial mapping and MAC array size and interconnection. Temporal
under-utilization mainly comes from memory port contention and bandwidth
bottlenecks during computation.
As the latency model involves sophisticated calculations and requires long pages
to explain, we separate it from this chapter to the coming chapter. Chapter 6
will thoroughly explain the latency modeling approach of ZigZag.
Besides estimating latency, another feature of ZigZag is its ability to detect
the minimal required memory size, named "effective memory size", for each
memory level when executing a network layer. The memory part that exceeds
the "effective memory size" can theoretically be gated to save power.
Spatial mapping defines the parallel operations and operands’ data flow in
the spatial dimensions across the PE array. Depending on the mapping, a
considerable amount of data reuse of each operand can be obtained spatially,
which results in a reduced number of accesses to the memory level outside the
array. In order to efficiently explore the spatial mapping space, three search
methods have been developed: exhaustive search, heuristic search v1 based on
data reuse symmetry pruning, and heuristic search v2 based on the maximum
data reuse Pareto surface.
The exhaustive search generates all valid spatial mappings above a user-defined
spatial utilization threshold, based on the layer dimensions and PE array size.
It is capable of generating multiple dimensions of loop unrolling across each
PE array dimension, e.g., unrolling OY|(FY|K) 16|(3|4) indicate unrolling OY
loop 16 times on one PE array dimension and unrolling FY & K 3 and 4 times
(together activate 12 PE units) on another PE array dimension.
While the exhaustive search can generate ∼10s of thousands of valid mappings,
by adopting the proposed heuristic v1/v2 methods, the spatial mappings to be
evaluated can be reduced by ×3-69 times depending on the layer characteristics
according to our experiments, as depicted in Figure 5.8. Both heuristic searches
v1 and v2 do not introduce any optimality loss, regarding energy or latency.
exhaust. 19768
heuri. v1 11762 13008
×7
Valid Spatial Mappings
Before any mapping selection, a neural network layer inherently has a specific
maximum data reuse factor for each of its 3 operands, depending on the layer
150 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
shape and size. A specific spatial mapping exploits, for each operand, part of
these reuse opportunities in the spatial dimension. The residual data reuse
opportunities remain available for temporal mapping. Yet, multiple spatial
mappings can result in an equal data reuse distribution between the spatial
and temporal domains for each operand. As a result, their temporal mapping
opportunities are identical. Such a group with equivalent or symmetrical spatial-
temporal data reuse distribution should only be considered once for further
temporal mapping search.
An example is that for a neural network layer with OX=OY and FX=FY, unrolling
OX|FX 7|3 and unrolling OY|FY 7|3 are equivalent (or symmetrical) and thus
one of them can be skipped for the next step temporal mapping search, while
for a neural network layer with OX̸=OY or FX̸=FY, unrolling OX|FX 7|3 and
unrolling OY|FY 7|3 are no longer equivalent since they now play different roles
in overall data reuse distribution, and thus should both be considered.
In light of the previous heuristic v1 search, in the next step, the temporal
mapping search engine still has to generate valid temporal mappings for each
non-symmetrical spatial mapping found.
In order to further prune away sub-optimal spatial mappings, a Pareto surface
of the spatial data reuse for each spatial mapping is identified in the operand
space (W/I/O): only those spatial mappings which demonstrate Pareto-optimal
data reuse along weight/input/output are processed for further evaluation.
Those mappings that are not on this Pareto surface correspond to dataflows that
would require a large amount of data reuse along at least one of the operands to
be handled temporally, which in turn would correspond to a larger amount of
memory accesses to the upper levels in the memory hierarchy. As a consequence,
they can be safely ignored without losing optimality.
To illustrate it better, let’s create an example. Assume for a pointwise Conv
layer (i.e., FX=FY=1, no need to consider pr loop effect), there are four spatial
mappings left after the heuristic v1 search method: (1) C|K 16|8, (2) K|C 12|8,
(3) (B|C)|K (8|2)|8, and (4) OX|K 8|8. With Figure 5.4, we can easily get
these mappings’ data reuse factors (for each operand): (1) {W: 1, I: 8, O: 16},
(2) {W: 1, I: 12, O: 8}, (3) {W: 8, I: 8, O: 2}, and (4) {W: 8, I: 8, O: 1}. Based
on the data reuse list, heuristic v2 will prune away spatial mapping (4) as it is
surpassed on all fronts by (3). The rest three (1)(2)(3) will be kept as they are
all on the data reuse Pareto surface.
MAPPING SEARCH ENGINES 151
Given that each spatial mapping is then individually sent to the temporal
mapping search engine, the speed-up in the overall search procedure is roughly
proportional to the number of spatial mappings pruned away.
In the temporal domain, each neural network layer is expressed as three sets
of nested for-loops (for W/I/O resp.) that determine the order of the overall
MAC execution. These three sets of for-loops must follow the same loop order
but can be differently distributed across the memory hierarchy (each loop is
mapped to a memory level), as shown in the previous Figure 5.3(d).
By adopting the enhanced Memory-Centric Design Space Representation and
using the concept of virtual memory level, the proposed temporal mapping
search engine efficiently supports producing even and uneven mapping schemes
on balanced and unbalanced memory hierarchies that present shared and/or
non-shared memory levels between different operands. Note that ZigZag has
significantly enlarged the mapping design space compared to all the previous
works, which enables users to obtain better design points, as will be shown in
Section 5.8. The smart search methods reduce the required mapping evaluation
by several orders of magnitude with respect to exhaustive search, as illustrated
in Figure 5.9, with a maximum 5% loss in optimality.
Exhaustive search
The exhaustive search consists of two steps, shown in Figure 5.10(left): 1) the
loop blocking step, in which all valid loop combinations are assigned to the
memory levels, and 2) the loop ordering step, in which all valid permutations
of the assigned loops within each level are generated. In order to explore all
possible schedules, we treat loop prime factors (LPFs) as the smallest size
in which a loop can be split. These LPFs correspond to the result of the
factorization of each layer dimension (i.e., B, C, K, ...) and are the basic blocks
for the search algorithm.
In the procedure, an important concept, virtual memory level, is used to help
the mapping algorithm find valid loop-blocking schemes and guarantee their
validness in the following loop-ordering step. The virtual memory levels provide
fake boundaries that allow gradually and uniformly fitting data of different
operands into actual (physical) memory levels in an uneven manner, as indicated
in Figure 5.10(left) gray wavy lines. Then, after the loop blocking, the loop
152 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
6 20 2
40
4 1.3 1.3
2.5 1.8 20 10 1
2 0.6
0.3 3.7 3.4 4.6 1.6 1.4 0.6
0 1005 0 0
104 11413 77550 32586 2627
Peak Memory [MB]
CPU HourTemporal Mapping
.0 1.3 1.3
.8
58
10 4
10 4
10 3
.
81
74.8 1100
.
.
74
260
73.51.8 24 023.8 23.8 23.8 0.6
10
ordering is performed within each virtual memory level. These two steps are
explained in detail below.
The loop-blocking step consists of exhaustively exploring how temporal loops
can be assigned in the memory hierarchy. For each virtual memory level, the
core procedure of loop blocking includes four recursive steps: generate, calculate,
assign, and update.
• First, generate all possible LPF combinations from the unassigned LPFs;
• Second, calculate for each operand which of these combinations fit in the
virtual memory level;
MAPPING SEARCH ENGINES 153
• Third, assign the fitting LPF combinations to the virtual memory level
with each assignment case as a new loop blocking basis;
• Forth, update the virtual memory level based on the rule described below.
The virtual memory level at which the LPF combinations are allocated is
initially set as the lowest level in the hierarchy (i.e., the one close to MAC).
After each assignment, the virtual memory level to which the combinations will
be assigned in the next iteration is updated. The update procedure checks for
each operand if, considering the remaining LPFs to be assigned, the current
physical memory level can still fit any additional loop assignment; if not, then
the virtual memory level for the next assignment iteration (for that operand)
will become the following upper physical memory level in the hierarchy, as
visualized in Figure 5.10(left). The algorithm will continue these actions until
all LPFs are assigned to virtual memory levels.
After the loop assignment, the set of valid mapping schemes is fed into the loop
ordering step. In this step, within each virtual memory level, all permutations of
the loops are considered, and each becomes a different valid temporal mapping.
The number of mapping schemes generated in this way grows exponentially
with the number of LPFs and the number of virtual memory levels. To reduce
this mapping explosion, smarter search methods are provided.
In order to prune away sub-optimal mappings before the cost evaluation stage,
two heuristic principles are applied: data stationarity maximization (heuristic
search v1) and data reuse pruning (heuristic search v2).
Heuristic search v1 is applied at the loop ordering stage, as depicted in
Figure 5.10 (right) Loop Ordering and Figure 5.11 column 2: for each
virtual memory level, instead of doing complete loop permutations, only those
permutations that maximize data stationarity for each operand at the below
memory level are generated, which, in the loop format, is equivalent to putting
all the ir loops of each operand close to the lower level boundary. By doing so,
the number of valid temporal mappings to be evaluated drops by an order of
magnitude, as shown in Figure 5.9, without losing optimality.
On top of heuristic search v1, heuristic search v2 is applied after the LPF
assignment and before the loop ordering optimization, as depicted in Figure 5.11
column 3: the data reuse factor for each operand at each level in the memory
hierarchy can be extracted at this point since the loop types and their size
are known (even though their loop ordering at each level is unknown). If a
ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
S S S S
Figure 5.11: Temporal mapping search engine algorithms comparison. While exhaustive search generates all valid
mapping schemes (∼1s-100s million), heuristics are required to prune away sub-optimal mappings. Heuristic search v1
prunes mappings at the loop-ordering stage, and heuristic search v2 prunes at both the loop-blocking and loop-ordering
stages. Iterative search prunes at each loop assignment iteration besides applying the previous heuristics.
155
156 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
While the two mapping search engines are able to efficiently scan the mapping
space for optimal points, the architecture on which the mapping is to be applied
may not be the optimal one. For example, having bigger register files within
each PE might have positive consequences due to higher data reuse closer to
the MAC level, but it might limit the size of the buffer levels out of the array
because of the limited area available on-chip and therefore limit the reduction of
off-chip accesses. As a consequence, depending on the neural network workload,
the memory size, the cost of data access, the word length per access, etc., an
optimal point exists in the architecture space as well.
In order to also explore different hardware architectures, Architecture Generator
is able to exhaustively generate all architectures that fit within a user-defined
area constraint, as shown in Figure 5.12, as well as other user-defined design
guidelines, such as having at least a size ratio of 8 between consecutive memory
levels. To carry out this task, it first draws from a memory pool different
combinations of memories (considering memory unrolling: small memory being
equipped to each or a group of MAC unit(s)), and if the combination fits
within the area constraints and meets the design guidelines, it then proceeds
in assigning operand(s) to each memory instance, with considering operands’
memory sharing.
The output of this stage is a list of valid hardware architectures, each with a
different memory hierarchy, which is sequentially fed to the mapping search
engines to find the optimal spatial and temporal mappings. When all hierarchies
are analyzed, the optimal memory hierarchy and its optimal mapping for energy
and latency will be identified.
ZigZag allows running the architecture exploration for a single neural network
layer, a complete neural network, as well as multiple complete neural networks,
as will be demonstrated in Section 5.9 case studies.
5.8 Validation
The hardware cost model and the mapping search engines are validated with
three methodologies: 1) against measured results of published chips; 2) against
in-house post-synthesis extracted energy and performance data; 3) against other
DNN accelerator DSE frameworks.
Firstly, we model the mappings and hardware architectures of both Eyeriss [30]
and ENVISION [116] and compare the ZigZag cost model estimated energy
with their reported values, as depicted in Figure 5.13. The resulting energy
values, normalized with respect to the cost of a single MAC, are shown for full
precision operation without voltage scaling or sparsity reduction. The estimated
values are within an acceptable 5%, resp. 7.5% error margin. Secondly, to
also validate the throughput model, validation is performed against a complete
in-house accelerator post-synthesis simulation. The results in Figure 5.14 show
a maximum error of 6% of energy and 4% of PE array utilization. Finally, the
validation of the cost model as well as the mapping search engines is carried
out against a SotA DSE framework, Timeloop [126] + Accelergy [177].
ZigZag and Timeloop are compared on Eyeriss architecture with AlexNet [90]
and ResNet-34 [57], and the validation results are shown in Figure 5.15. The
experiment is carried out in three steps. Step 1, we let Timeloop do a free search
to find its optimal spatial and temporal mappings for each layer under test
(Figure 5.15 all the left bars in each group of three, labeled "TL"). Step 2, all the
optimal mappings found by Timeloop are validated in ZigZag, demonstrating
that the two cost models match well with an average error <5% (the middle
bars, labeled "TL/ZZ"). In step 3, we let ZigZag do a free search to find the
1e10 1e9
2.00 MAC MAC
DRAM 1.4 Buffer
1.75 Buffer RF
RF 1.2
1.50
1.0
Normalized Energy
Normalized Energy
1.25
0.8
1.00
0.75 0.6
0.50 0.4
0.25 0.2
0.00 0.0
ION
ION
ss
ION
ION
ss
ION
s
s
ss
l
l
l
l
el
Ey l
Ey l
Ey l
de
de
ris
ris
de
de
de
de
e
de
de
eri
eri
eri
d
VIS
VIS
VIS
VIS
VIS
e
Mo
Mo
Mo
Mo
Mo
Mo
Mo
Mo
Mo
Mo
Ey
Ey
CONV1 CONV2 CONV3 CONV4 CONV5 CONV1 CONV2 CONV3 CONV4 CONV5
EN
EN
EN
EN
EN
Figure 5.13: Cost model validation of AlexNet [90] Conv layers on Eyeriss [30]
(left) and ENVISION [116] (right).
VALIDATION 159
MAC+Oreg Model
2.5 Activ Mem Post-syn
Weig Mem 0.8
Input Buf
2.0
PE array utilization
0.6
Energy (uJ)
1.5
0.4
1.0
0.5 0.2
syn
syn
syn
syn
Po el
Po el
el
Po el
Po l
e
d
d
d
d
d
st-
st-
st-
st-
st-
Mo
Mo
Mo
Mo
Mo
Po
CONV1 CONV2 CONV3 FC4 FC5
AlexNet
18%
35%
55%
62% 60%
45%
54%
37%
Figure 5.15: Cost model and mapping search engines validation against
Timeloop [126]+Accelergy [177] on AlexNet [90] (left) and ResNet34 [57] (right).
160 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
To show the strength of ZigZag and use it to extract insights from the vast design
space, three different case studies are conducted. In these studies, all memory
access energies are extracted from CACTI7 [12] in 65nm CMOS technology,
assuming 8-bit fixed point precision for Weight/Input/final-Output, 16-bit fixed
point precision for partial Output.
Case study 1 focuses on answering two questions: 1) How big is the impact
of temporal and spatial mapping on energy and throughput when fixing both
the neural network workload and the hardware architecture (memory hierarchy
and PE array size)?; 2) How large is the benefit of uneven mapping over even
mapping in terms of energy and performance?
In the experiment, the neural network workload and hardware architecture
are fixed to AlexNet Conv layer 2, and an Eyeriss-like architecture with a
memory bandwidth assumption of 16-bit/cycle for the inner-PE register file
and 64-bit/cycle for the global buffer.
This experiment consists of two steps. Firstly, the Spatial Mapping Search
Engine finds 16 spatial mappings by means of the heuristic v2 method explained
in Section 5.6.1. Secondly, for each spatial mapping found, the Temporal
Mapping Search Engine searches for all Pareto-optimal schedules considering
energy and PE array utilization (throughput). The results are shown in
Figure 5.16 and 5.17.
CASE STUDIES
Figure 5.16: Even/uneven temporal mapping’s impact on energy and PE array utilization (throughput) for difference
spatial mappings: OYu|FYu|Ku 13|5|2 (left) and OYu|OYu|Cu 13|2|6 (right).
161
162 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
PE array utilization
0.8
0.6
0.4
Uneven mapping Pareto
0.2 Even mapping Pareto
0.6 0.8 1.0 1.2 1.4 1.6 1.8
Energy [mJ]
In Figure 5.16, two temporal mapping spaces, each with different spatial
mappings, are shown. Let’s first analyze the temporal mapping’s impact of
even and uneven schemes separately. In the left/right figure, for even mappings,
in total 73184/26944 points are found, within which up to 4.8×/3.3× energy
variance and 3.2×/2.1× utilization variance are observed; for uneven mappings,
in total 1296364/146940 points are found, up to 10.3×/9.7× energy variance and
7×/42× utilization variance are observed. It is clear that the temporal mapping
space for an uneven scheme is much larger than for an even scheme. This results
in significantly lower energy solutions found for uneven mappings, with up to
30%/28.6% lower total energy consumption compared to even mappings. In
terms of PE array utilization, both the even and uneven mapping find the same
optimal point in the left figure, while in the right figure, the uneven mapping
achieves 27% higher PE array utilization, and hence 27% higher throughput.
Figure 5.17 is the collection of the Pareto-optimal temporal mapping points
(in terms of energy and utilization) for all 16 spatial mappings. These results
further confirm the ability of uneven mappings to locate better design points
compared to even mappings. Here, an overall gain of up to 32.6% in terms of
energy consumption or 12% in terms of throughput is achieved.
Knowing that ZigZag can help designers to find the optimal spatial and temporal
mapping points for a user-defined hardware architecture and neural network
workload, the next challenge is determining the optimal hardware architecture
given a neural network workload. This should not just be done for a single
neural network layer, but for a network consisting of multiple layers. Case
study 2 demonstrates this ability within ZigZag.
CASE STUDIES 163
In this case study, all architectures are equipped with an off-chip DRAM and a
MAC array of size 12 × 14 (same size as Eyeriss [30]). ZigZag will search for
the best on-chip memory hierarchy to process multiple neural network layers
of DarkNet19 [132]. The available memory modules in the memory pool are 8
Byte, 64 Byte, 256 Byte, 8KB, 64KB, and 512KB, in which the first 3 are a
group that unrolled together with each MAC unit and the last 3 are a group
without memory unrolling. ZigZag picks for each operand (W/I/O) one memory
from the first memory group and picks one or zero from the second memory
group (i.e., we simulate single-level and dual-level on-chip memory hierarchy).
Figure 5.18 summarizes the complete results. In total, 240 different memory
hierarchies are found and evaluated. All four figures share the same X-axis,
which is the memory hierarchy index, from 0 to 239. The top three figures are
the visualization of all memory hierarchies from each operand’s perspective (e.g.,
single-level/dual-level on-chip memory hierarchy corresponds to resp. one/two
dot(s) along the vertical direction, for each opearnd). The bottom figure
shows the resulting total energy consumption for executing all 8 non-repetitive
DarkNet19 layers on each architecture, as well as its area.
512K
W Mem [B]
64K
8K
256
64
8
512K
64K
I Mem [B]
8K
256
64
8
512K
O Mem [B]
64K
8K
256
64
8
102 10
Area [mm2]
L1 L2 L3 L4 L6 L7 L9 L10
Energy [uJ]
5
101
0
0 50 100 150 200 250
Memory hierarchy index
Figure 5.18: Memory hierarchy search for multiple layers of DarkNet19[132].
is fixed, the memory size of the inner-PE register file doesn’t influence the
overall energy that much, as there are clearly three energy regions one can
distinguish, corresponding to an 8KB, 64KB and 512KB upper memory; 3) the
trade-off between energy and area is quantified such that designer can make
clear decisions like giving up 5% of energy to gain 40% of area saving (comparing
64KB and 512KB designs).
One step beyond the single DNN study of case study 2, is the application of
ZigZag to explore hardware implications across a wide variety of neural networks.
Case study 3 aims to extract the optimal mappings’ energy, resp. performance
for 12 different DNNs executing on 720 different accelerator architectures. For
this, 12 popular DNNs targeting the ImageNet dataset [42] have been selected,
with their characteristics summarized in Table 5.4. The 720 selected hardware
architectures of this study all have the same PE array size (14×16) and the
same spatial unrolling (OXu|Ku 14|16), but they are different in the memory
hierarchy. As listed in Table 5.3, these 720 memory hierarchies vary in memory
size (Mem. Size Option), number of memory levels (Mem. Bypass Option), and
memory sharing among W/I/O operands (Mem. Share Option).
For each DNN workload-accelerator pair, ZigZag searches the lowest energy,
resp. lowest-latency mapping. This is performed by running ZigZag’s temporal
search engine for each layer in a particular DNN and accumulating energy,
resp. latency values. Thus, this results in 2 dots for each workload-accelerator
combination: one for the minimal latency and one for the minimal energy,
plotted in Figure 5.19. Throughout the study, greedy spatial mapping1 is
applied to all layers in order to maximize spatial PE array utilization. Heuristic
search v1 is adopted for the ZigZag temporal mapping search engine.
The resulting Figure 5.19 reveals the global trade-off between workload accuracy,
energy efficiency, latency, and hardware area across these 12 popular DNNs,
with their Pareto optimal energy and latency achievements summarized in
Table 5.4. An analysis of Figure 5.19 and Table 5.4 allows researchers to quickly
1 Greedy spatial mapping is applied to handle the case that the layer size is not a perfect
fit to the array dimension. It maximally uses the PE array’s spatial dimension for all loop
iterations except for the last one. For example, if we unroll a loop of dimension 20 onto a 1D
array with a size of 8, the mapper based on prime factorization would run for 4 iterations of a
5-way spatial mapping, resulting in a spatial utilization of (5+5+5+5)/(8+8+8+8) = 62.5%.
The greedy mapper, on the other hand, would propose a solution that runs for 2 iterations
with 8-way spatial parallelism, and the last iteration exploits the array 4-way, whose spatial
utilization is (8 + 8 + 4)/(8 + 8 + 8) = 83.3%.
CASE STUDIES 165
is dominant in a DNN layer, it usually indicates this operand has less data reuse opportunities,
and thus benefits less from the multi-level memory architecture.
166 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
80
2.9 × 107
Latency / Inference [cycles]
cycles]
2.8 × 107
65
75
Latency / Inference [million
2.4 × 107
2.3 × 107 60
5 × 101 6 × 101 7 × 101 8 × 101
Energy / Inference [uJ]
102 70
AlexNet (2012) NASNet small (2017) Xception (2016)
MobileNetV3 small (2019) MobileNetV3 large (2019) SEResNeXt50 (2018)
MobileNetV1 (2017) ResNet50 (2015) Incep-Res-v2 (2016)
MobileNetV2 (2018) DenseNet201 (2016) NASNet large (2017)
65
70
5 × 106 65 60
60
70
100
65
60
10 1
101 102 103 104
Energy / Inference [uJ]
Figure 5.19: Energy-Latency-Area comparison for mapping 12 NNs on 720
accelerator architectures each. Every NN-accelerator pair is corresponding to two
points in the figure, one with min-energy mapping, one with min-latency mapping.
The Pareto-optimal accelerator architectures for each NN are highlighted and
connected.
CASE STUDIES
Table 5.4: Comparison on 12 Neural Networks’ Algorithm Attribute and Hardware Performance. Weight/Input/Output
Size is the accumulated size across all layers, assuming 8-bit precision on ImageNet data.
‘(#)’ indicates value order, from high (#1) to low (#12), across all 12 NNs.
AlexNet MBV3 MBV1 MBV2 NASNet MBV3 ResNet DenseNet Xception SEResNeXt IncepRes NASNet
Neural Network
[90] Small [64] [65] [138] Small [191] Large [64] 50 [57] 201 [69] [31] 50 [67] V2 [155] Large [191]
Top-1 Accuracy (%) 56.5 (#12) 67.4 (#11) 70.6 (#10) 72 (#9) 74 (#8) 75.2 (#7) 75.3 (#6) 77.42 (#5) 79 (#4) 79.3 (#3) 80.1 (#2) 82.7 (#1)
Total MAC (GOPs) 1.07 (#7) 0.06 (#12) 0.57 (#8) 0.30 (#10) 0.56 (#9) 0.22 (#11) 3.86 (#6) 4.29 (#4) 9.48 (#3) 4.23 (#5) 13.16 (#2) 23.74 (#1)
Weight Size (MB) 24.48 (#4) 4.08 (#10) 4.01 (#11) 3.31 (#12) 5.01 (#9) 9.50 (#8) 24.32 (#6) 18.87 (#7) 24.15 (#5) 26.20 (#3) 53.15 (#2) 84.45 (#1)
Input Size (MB) 0.46 (#12) 1.90 (#11) 5.21 (#9) 6.85 (#8) 12.83 (#6) 4.92 (#10) 9.75 (#7) 23.67 (#4) 36.22 (#3) 13.71 (#5) 39.80 (#2) 137.09 (#1)
Output Size (MB) 0.63 (#12) 1.55 (#11) 4.81 (#9) 6.37 (#8) 7.57 (#6) 4.40 (#10) 10.10 (#5) 7.49 (#7) 34.17 (#2) 13.75 (#4) 23.90 (#3) 86.37 (#1)
Total Data Size (MB) 25.57 (#7) 7.53 (#12) 14.03 (#11) 16.53 (#10) 25.41 (#8) 18.82 (#9) 44.17 (#6) 50.03 (#5) 94.54 (#3) 53.66 (#4) 116.85 (#2) 307.90 (#1)
Best Energy (uJ) 20.72 (#7) 5.37 (#12) 11.03 (#11) 11.93 (#10) 19.40 (#8) 13.61 (#9) 42.05 (#6) 44.14 (#5) 90.40 (#3) 46.81 (#4) 110.30 (#2) 271.92 (#1)
Best Latency (Mcycles) 8.04 (#8) 1.75 (#12) 5.54 (#9) 4.93 (#10) 10.83 (#7) 4.63 (#11) 22.76 (#6) 23.72 (#5) 79.53 (#3) 25.59 (#4) 96.37 (#2) 209.96 (#1)
167
168 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
Our vision for a high-level, fast, and versatile DSE framework for DNN
accelerators began in late 2018. After dedicating time to contemplation and
prototyping, we initiated the development of ZigZag in mid-2019, completed
the implementation of its initial version by early 2020, and open-sourced the
first stable version in September of the same year. Since then, we have been
consistently updating and enhancing ZigZag from various perspectives.
The ZigZag framework has been applied in different DSE use cases.
Houshmand et al. in "Opportunities and Limitations of Emerging Analog in-
Memory Compute DNN Architectures" [61] extended the ZigZag cost model with
Analog In-Memory Computing (AIMC) support. With the enhanced cost model,
they utilized the ZigZag DSE framework to assess the benefits and pitfalls of
AIMC solutions from the complete accelerator level (instead of the AIMC array
itself), and compared them with conventional digital accelerator solutions. This
study showed that AIMC can improve efficiency significantly, yet only when
the AIMC array topology and the memory technology are co-optimized with
the memory hierarchy and system architecture.
In the successive work, "Benchmarking and Modeling of Analog and Digital
SRAM in-Memory Computing Architectures" [63], Houshmand et al. further
extended ZigZag cost model with Digital In-memory Computing (DIMC) support
and unified it with the original AIMC cost model, so as to fairly compared
different AIMC and DIMC’s hardware design options and mapping preferences.
Results highlighted that 1) peak performance does not correspond to actual
performance, instead, architectures should be compared on real workloads with
their mapping to prove their effectiveness; 2) the exploration done on MLPerf
Tiny Benchmark [13] shows the good potential of DIMC replacing AIMC in
certain cases with its higher flexibility in the mapping space, full precision
operations, and better area efficiency (TOPS/mm2 ).
Besides being used to compare these emerging technology options, ZigZag
can also be applied to analyze certain architectural attributes. Liu et al.
in "Bandwidth-aware Flexible-Scheduling Machine Learning Accelerator" [99]
focused on studying the impact of on-chip memory bandwidth and its interaction
with other architectural-level optimization possibilities. This work deployed
ZigZag to explore, under different on-chip memory bandwidths, the system-level
benefits that dynamic memory allocation and flexible PE interconnection can
bring. Observations indicated these techniques are promising to be used in
future 3D stacking chip designs (with potentially high on-chip bandwidth).
FURTHER IMPROVEMENTS 171
In addition, ZigZag has also been utilized to provide hardware insights for various
DNN workloads. Colleman et al. in "Processor Architecture Optimization for
Spatially Dynamic Neural Networks" [33] made use of ZigZag to compare novel
accelerator architectures and dataflows enabling latency improvements for this
new type of neural network: Spatially Dynamic Neural Networks (SDyNNs).
SDyNNs adjust network execution based on the input data, saving computations
by skipping non-important image regions. How to translate this irregular and
dynamic algorithmic-level operation saving into actual hardware cost benefit, is
what this work was trying to answer. They reformulated the Conv layer loop
format in ZigZag to express the irregular spatial parallelism of SDyNNs and,
based on this, assessed the required hardware flexibility in spatial and temporal
mapping to support all relevant layer types.
Afterwards, Colleman et al. in "Optimizing Accelerator Configurability for Mobile
Transformer Networks" [35] further employed ZigZag to explore for Transformer
type of neural network the optimal accelerator architectures (number of PEs,
memory hierarchy, PE array configurability, etc.) and the corresponding best
deployment strategies (spatial and temporal mappings). Experimental results
highlighted the importance of supporting more than one spatial mapping in
the PE array (especially to large PE array) for guaranteeing good hardware
utilization across a wide variety of layer topologies in Transformer networks.
The ZigZag data representation has been enhanced along the three perspectives:
workload, hardware, and mapping.
Workload: The initial version of ZigZag supported all the neural network
layers that can be represented by the 7 nested for-loop, as depicted in Figure 5.2.
The 7 loop dimensions (B/K/C/OX/OY/FX/FY), 3 operands (W/I/O), and
the loop relevancy in between, were all hard-coded. Although this could already
cover a lot of DNN layer topologies, it still limited ZigZag’s ability to support
layers with more diverse loop types (e.g., 3D Conv) or layers with different
operands than W/I/O (e.g., no weight in element-wise layers). Moreover, given
the fast-evolving speed in the DNN algorithm community, it is challenging to
anticipate what new types of layers will emerge in the future.
Seeing this, we have fundamentally upgraded the workload representation of
ZigZag for generality. The new representation has the following main attributes:
• It allows defining any number of loop dimensions with three operands: two
input operands, and one output operand for a layer, each with customized
data precision(s)5 .
• It allows customizing the loop relevancy (r/ir/pr) between computation
dimensions and data dimensions. For pr, all linear affine transformations
are supported.
• For the two input operands, it distinguishes constant and non-constant
operands. A constant operand is like Weight, whose data are ready from
the beginning; A non-constant operand is like Activation, whose data are
generated by a previous layer at runtime.
• Utilizing the constant/non-constant principle, it can consider the cross-
layer data dependency by defining the data-producing layer for each
non-constant operand with its data dimension transformation clarified.
E.g., Layer 2’s non-constant input operand I[c][ix][iy] comes from Layer 1’s
output operand O[k][ox][oy], with k → c, ox → ix, oy → iy.
• It also lets users define layer types (e.g., Conv, FC, DW, Add, Pooling,
or any customized keyword) and later use these layer types to guide the
5 Note 1: this representation can be easily extended to support more than two input
operands if needed.
Note 2: the output operand can have two data precisions, the partial output and the
final output precisions. The partial output precision usually is higher due to the reserved
accumulation headroom bits.
Note 3: for layers that only have one input operand, like pooling, ZigZag models them by
setting the data precision of one (out of the two) input operand to 0 bit.
174 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
mapping in the user mapping definition (will explain later) (e.g., Conv
and Pooling are likely to be mapped on accelerators in different ways).
In addition to the manual workload definition, which gives users easy adaptability
of all layers’ attributes, ZigZag now also supports directly importing a DNN
workload from an onnx [11] model (a widely used open source format for AI
models). As lots of well-established DNN workloads have their onnx models
available online, users can directly download and use them in ZigZag, without
manual encoding effort.
Hardware: The hardware representation is improved in the new version of
ZigZag for two purposes: 1) to enable more fine-grained and thus more accurate
hardware cost modeling, and 2) to reflect more practical hardware constraints.
Firstly, for more fine-grained modeling, the main update is the explicit definition
of the memory port and memory read/write granularity. Initially by default,
ZigZag assumed all memories are single-port, and all memory read/write
behaviors happen at the full wordlength granularity. Now, ZigZag allows
defining any number of ports for each memory instance, the type of each port
(e.g., read-only, write-only, read-write), and the served operand movement of
each port6 .
Meanwhile, as multi-banked memories are frequently used in DNN accelerator
and some of them support different memory read/write granularity (e.g., one
word, half word, a quarter of word), this is also modeled in the new ZigZag.
During hardware cost estimation, for a data read/write that goes below the
finest granularity of a memory, the per-memory-access cost will use the finest
granularity’s one. If it goes above the finest granularity but still below the full
memory wordlength, the cost will use a multiple of the finest granularity’s cost,
depending on the number of times the finest granularity access is required.
Secondly, for embedding more practical hardware design constraints, the major
updates include a.) the decoupling of the spatial mapping from the PE array
interconnection definition, and b.) the decoupling of the layer operands from
the memory hierarchy definition.
In the old ZigZag, PE array interconnection was implicitly defined by its
supported spatial mapping, while in the new version, it is defined through the
memory-served dimension (or data broadcasting dimension). The memory-
served dimension defines for each memory instance in the hierarchy, given the
6 Two data movement actions for input operands: write-in-by-high and read-out-to-low;
hierarchy.
176 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
layer operand name that the user defined. It enables the usage of that 1 MB
SRAM to serve different operands in different layers.
To summarize, we have decoupled the algorithmic and mapping factors from
the hardware definition, which greatly enhances the generality of the ZigZag
framework.
User Mapping Definition: In the new version of ZigZag, the mapping
representation itself does not change a lot, still following the uneven loop format
introduced in Section 5.4. Yet the user mapping definition is newly provided as
an input to the framework: In the definition, of each layer type of a workload,
users can provide hints to guide the mapping in a desired way. The mapping
hints include (per user-defined layer type) 1) the core allocation11 (e.g., we can
perform Conv layer on one core and element-wise sum on another core), 2) the
layer operand to memory operand link (as explained in the Hardware part), 3)
the spatial mapping candidates, and 4) temporal order candidates. In the hints,
1) and 2) are mandatory to provide (you can enter multiple options); 3) and 4)
are optional. When the spatial/temporal mapping candidates are not defined,
the framework will auto-search them.
These enhanced data representation formats together enable ZigZag users to have
greater flexibility in customizing their DSE experiments, which can potentially
lead to discovering better designs and deployment options.
and ir loops before the mapping info extraction starts. The rationale for doing
so is that, as explained in Section 5.5.1, r loops are data dimensions, ir loops
indicate data reuse, and pr loops include both of them.
The implementation for the pr decoupling is straightforward. After the ZigZag
cost model receives a mapping (either user-defined or provided by mapping
search engines), it will pre-process the mapping before feeding it to the mapping
information extraction functions. In the pre-processing, for operand(s) that
have pr loops (e.g., the Input operand in Conv), it analyses for all its pr loops
in a bottom-up manner (i.e., from MAC to DRAM) the equivalent data size
and data reuse factor, and replaces each pr loop with its equivalent data-size r
loop and data-reuse-factor ir loop.
For example12 , the mapping OXu
FX 3 r 1.5
4 can be replaced by ir 2,ru 4 for Input operand,
in which OXu 4 is replaced by ru 4, and FX 3 is replaced by r 1.5 and ir 213 ,
with ir 2 merged down to the below architectural level14 , indicating on average
every data element can be reused in the below level 2 times.
Secondly, for managing the fine-grained hardware definition (e.g., memory
port), a parameter called four-way data movement is introduced for uniformly
representing each operand’s data movement at each memory level. This
parameter collects data movement information (e.g., number of moved data
elements, data precision for the movement, data moving rate and period, etc.) of
four directions: write-in-by-high, read-out-to-low, write-in-by-low, read-out-to-
high. In the mapping information extraction step, all four-way data movement
attributes are calculated. Then, in the hardware cost estimation step, they are
combined with the fine-grained hardware information (e.g., link the responsible
memory port to each way of data movement) to calculate the final hardware
impact.
Finally, as the ZigZag framework grows, more and more people are using it for
customized DSE experiments and contributing to it from different perspectives.
It is important to provide a sustainable program structure for all the users and
contributors. If not, the whole framework will gradually become spaghetti with
12 The same example used to explain "FIFO Effect" in Section 5.5.2. The horizontal line is
count is 12 (3×4), thus the data reuse factor is 2, meaning ir is 2. In order to make the
product of ir and r equals 3, r is 1.5.
14 The ir loops that just above an architecture level boundary can be always merged to the
below architecture level as it does not increase data size but provide more data reuse to the
below level.
178 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
people’s code all mixed together, difficult to understand, to use, and to further
expand.
So, to facilitate long-term usage and development, we have ultimately upgraded
ZigZag’s program structure with the concept of stages. The general idea of the
stage-based structure is to divide the whole DSE flow of ZigZag into multiple
execution stages, each performing a specific task with clearly defined inputs
and outputs. Currently ZigZag provides a pool of stages, which can be grouped
into 7 main categories:
all the required information (accelerator, layer, spatial and temporal mappings)
reaches the bottom CostModelStage, cost evaluation is conducted and results
are generated. After that, the cost model evaluation results flow backward in the
flow through all the stages, with forward-flow stages doing nothing (just passing
the results to the next stage), and backward-flow stages performing certain
tasks on the results it receives (e.g., result pruning, saving, post-processing,
etc.). Note that the whole flow is not necessarily a one-time event, but can
include lots of (local) iterations back and forth, e.g., iterating through each
layer in a workload, or through each temporal mapping that LOMA generated
all can lead to multiple sub-flows.
The pseudocode in Figure 5.21 further demonstrated the inter-stage data passing
and calling, in which the bottom four stages of Figure 5.20 are used as a partial
flow for the explanation.
What this partial flow is doing is searching for the best-energy temporal
mapping solution for each spatial mapping found. Firstly, on lines 11-12,
the SpatialMappingGeneratorStage generates spatial mapping (sm) one at
a time and passes it to the next stage, together with the accelerator and
layer information. Its next stage, MinimalEnergyStage, is a backward-flow
stage, and thus directly passes the received information to the next stage
180 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
1 """
2 cme: cost model evaluation result
3 sm: spatial mapping
4 tm: temporal mapping
5 """
6
24
33
Figure 5.21: Example pseudocode for showing the data passing and calling in
between the last four stages in Figure 5.20.
FURTHER IMPROVEMENTS 181
(line 19). LomaStage receives the information (accelerator, layer, sm) and
performs the temporal mapping generation (line 29), generating one tm at a
time, and passing it together with all the other information to the next stage
(line 30). CostModelStage receives the information (accelerator, layer, sm,
tm) and generates one cost model evaluation result (cme) (line 37). This one
cme then flows backward, passing through the LomaStage and reaches the
MinimalEnergyStage (lines 19-21). Then the for-loop on line 19 will go into
the next iteration, which will trigger the stages below it to generate another cme,
i.e., the LomaStage will generate another tm and CostModelStage will produce
another cme for this tm, and so on. This iteration of lines 19-21 will go on and on
(while keeping the best-energy tm point it has encountered so far) until all below
stages no longer produce new cme, meaning that the LomaStage has generated
all tm. At this point, the best-energy tm is found by the MinimalEnergyStage.
After that, a new sm will be generated and start the whole iteration again
(lines 11-12).
It is worth pointing out that the Python programming keyword "yield" is
extensively used in the multi-stage implementation of ZigZag. For example, in
Figure 5.21, "yield" is used explicitly on lines 13, 22, 31, 38, and implicitly
on lines 11 and 29. In Python programming, "yield" is used within generator
functions to produce a sequence of values. When "yield" is encountered, it
temporarily suspends the function’s execution and returns a value to the caller.
This allows for lazy evaluation, where values are generated on-demand rather
than being computed all at once, making it efficient for working with large
sequences and a perfect fit for our multi-stage implementation.
This multi-stage DSE configuration gives users high flexibility to adapt the
functionality of the framework and customize their experiments. For example,
one can simply replace MinimalEnergyStage with MinimalLatencyStage or
MinimalEDPStage to change the optimization target; one can also replace
MinimalEnergyStage and LomaStage by the TemporalMappingConversionStage
to change the function of the framework from auto-mapping search and
optimization to a pre-defined mapping cost evaluation.
In addition to users, ZigZag developers also reap significant advantages from
this modular flow. For instance, if a developer wishes to create another temporal
mapping search engine, they can develop it as a separate stage, such as
LomaStage or SalsaStage, and seamlessly integrate it into ZigZag without
concerns about disrupting other parts of the implementation. This allows for a
quick and hassle-free plug-and-play experience.
182 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
Besides making progress within ZigZag, a DSE framework for single-core DNN
accelerators supporting single-layer mapping, we want to continue broadening
the design space for both mapping and hardware, to seek more system-level
improvement possibilities. Following this idea, two successive DSE frameworks
are built based on ZigZag: DeFiNES and Stream. DeFiNES broadens the design
space of ZigZag with depth-first mapping; Stream further enlarges the design
space with multi-core accelerators. Further information about DeFiNES and
Stream will be discussed in Chapter 7 and Chapter 8, respectively.
All the frameworks are open source, and have been employed by users in other
organizations. E.g., Meta, Sony, Nokia, IMEC, NXP, Bosch, TU Delft, TU
Eindhoven, Stanford University, Ghent University, University of Antwerp, etc.
5.11 Conclusion
This chapter is based on publication [110], and contains large fractions of it. The author
contributed to all aspects of the latency model, in collaboration with Meta.
Detailed implementation is at: https://github.com/KULeuven-MICAS/zigzag/blob/master/
zigzag/classes/cost_model/cost_model.py.
183
184 AN ANALYTICAL LATENCY MODEL FOR DNN ACCELERATOR
Many prior arts have proposed uniform energy models for DNN accelerator
design [183, 126, 177, 91, 184, 109, 158]. The common basis is an analytical
model which counts the operations of each hardware component (e.g., memory
read and write at each level, MAC, data transfer in NoCs, etc.), and multiplies
these with the corresponding unit energy to obtain the total system energy.
Unlike the well-explored energy models, analytical latency models are, however,
less systematically developed or explained for DNN accelerators. In the
chapter, we refer to "latency" as the clock cycle count (CC) for completing a
workload. From the physical level to system abstraction, the existing SotA
latency estimation methods can be categorized as: 1) measurement on physical
devices, 2) FPGA emulation [28], 3) RTL simulation [166], 3) cycle-accurate
simulation [136, 119], 4) regression or ML-based methods [36, 127], and 5)
analytical modeling [91, 126, 187]. Among these, analytical models are preferred
for early-phase DSE, thanks to their fast run-time and transparency (compared
to black-box regression or ML-based methods). Moreover, since most of the
DNNs are deterministic with pre-known hardware architectures and mappings,
high accuracy in latency is analytically achievable.
However, most existing analytical latency models rely on ideal assumptions,
such as: 1) all memories at different levels are double-buffered, assuming that
double-buffering can avoid all temporal stall; 2) memories that are shared by
multiple operands always have multiple read/write ports to avoid interference
among different operands’ data accesses. Although these assumptions simplify
the modeling, two issues are introduced: a.) for fixed architectures, modeling
accuracy degrades if not meeting these assumptions; b,) for architecture search,
memory system overhead is introduced by default that excludes a large part
of design space, leading to sub-optimality. Some other latency models were
delicately built for a specific design case or for a small group of design variants
with a hardware template [127]. Although accuracy is preserved, the limitation
in generality prevents its usage for novel architecture searches.
This work aims to bridge these gaps with a uniform analytical latency modeling
approach for algorithm-hardware-mapping (AHM) co-optimization, targeting
dense intra-layer cases. This chapter is organized as follows:
the 3-step approach that can uniformly address the multi-level memory
system induced different stall scenarios.
• Section 6.4 demonstrates good model accuracy with test chip validation.
• Section 6.5 assesses AHM-latency co-optimization through three case
studies to drive design insights.
• Section 6.6 concludes the chapter.
Algorithm, hardware, and mapping can largely impact system latency. Below
briefly recaps what have been discussed in Chapter 2.
Algorithm (A) includes DNN layer attributes, such as layer type (e.g., Conv,
FC, DW and PW), layer loop dimensions, data attributes of the layer operands
(e.g., total data size and data precision), etc.
Hardware (H): A DNN accelerator is usually equipped with a MAC array
and a multi-level memory system, connected via an on-chip network. Its
performance roofline is determined by hardware parameters, such as MAC array
size, interconnectivity, and memory hierarchy (e.g., memory levels, capacity
/ bandwidth (BW) / number of read/write ports, and memory allocation for
different operands).
Mapping (M) determines how the algorithm is spatially and temporally
mapped on the hardware. Spatial mapping defines how to parallelize DNN
loops across the MAC array, while temporal mapping defines in what order
the MAC array processes the non-spatially-unrolled DNN loops. Ideal spatial
mapping fully utilizes the MAC array, while ideal temporal mapping maximizes
operands’ data reuse at lower memory levels. Mapping optimization can help
to minimize compute cycle count and communication stalls.
These three factors are strongly interrelated and form a gigantic AHM design
space, in which each point corresponds to a specific algorithm-hardware-mapping
scenario with a resulting deterministic latency value.
186 AN ANALYTICAL LATENCY MODEL FOR DNN ACCELERATOR
The challenges for building an analytical latency model that can be applied to
the vast AHM design space are twofold:
First, the concurrency and interference of data transfer between different memory
levels for different operands, i.e., weight (W) / input (I) / output (O), needs to be
captured. Such interference comes from hardware constraints (e.g., insufficient
memory BW / ports, lack of double buffering, system interrupts), and mapping
choices (i.e., optimized mappings can alleviate the interference while non-
optimized mappings may aggravate such effect). Note that this is specific for
analytical latency estimation, but is usually less impactful on analytical energy
modeling. This is because the analytical energy model (dynamic energy) only
relies on the total operation count of each hardware component, while latency
also depends on when these operations happen and how they interfere with
each other.
The second challenge stems from the generality requirement, since the latency
model needs to be applicable for not only few pre-defined hardware architectures
with a fixed dataflow, but also for every valid AHM point in the design space.
Our proposed latency modeling approach aims to solve these two major
challenges: 1) To capture the concurrency and interdependencies of
data transfers, we start from dividing this complex intertwining problem into
multiple single-component data movement events, analyze each event separately,
and then combine them based on physical hardware constraints; 2) To ensure
generality, we adopt a uniform AHM representation, and implement a standard
3-step memory-type / bandwidth / sharing-aware latency modeling methodology
which can cover all valid design points, as detailed in Section 6.3.
* Reason: MAC array / interconnects / spatial mapping and DNN layer dimension do not match, etc.
** Reason: Memory BW and # of ports are in-sufficient; mapping is non-ideal; control overhead; etc.
*** Define: 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 𝐶𝐶𝐶𝐶𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 − 𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ; 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 𝐶𝐶𝐶𝐶𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡. − 𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 (= 𝑆𝑆𝑆𝑆𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ).
Figure 6.1: (a) A timeline illustration of DNN layer operation phases. (b)
Four scenarios of latency and utilization modeling in the computation phase.
The computation phase usually dominates the overall processing time, of which
the latency is strongly impacted by AHM. Figure 6.1(b) shows the 4 computation
scenarios based on the spatial and temporal mapping rate of the MAC array.
The latency modeling’s challenges mainly come from the ones with temporally
under-utilized MAC array (⃝ 3 ⃝),
4 due to the complexity to model the stall
induced by the non-ideal data movement, i.e., temporal stall (SSoverall ).
In the remaining section, we first introduce prerequisite concepts, then describe
in detail the 3-step latency modeling methodology to address SSoverall modeling
challenge. The key terminologies and the model steps are illustrated in Figure 6.2.
Please refer to it for all the abbreviations used in this section.
Stall (+) or Slack (-) of a Unit Mem read/write A non-zero value indicates the mismatch between ReqBWu
SSu
port, regarding computation. and RealBW. More details in Fig. 3.
ReqBWcomb ReqBW of a physical memory rd/wr port. The sum of all ReqBWu of share-port DTLs.
Step 2
Combine Total allowed memory updating window of a
MUWcomb The union of all MUWu of share-port DTLs.
DTLs’ physical memory read/write port.
attributes
SScomb SS of a physical memory read/write port. Already existed stalls + stalls combined from slacks. Eq. (1).
Step 3 SSoverall Overall temporal stalling cycle. Integrate all SScomb based on different memory’s coherency.
Figure 6.2: (a) Descriptions of the terminologies used by each step in the
latency model and (b) an illustration of 3-step latency modeling methodology.
A UNIFORM INTRA-LAYER LATENCY MODEL 189
output x-y dimension size (OX/OY), and filter x-y dimension size (FX/FY);
2) three major operands (W/I/O) each have their relevant (r) and irrelevant
(ir) for-loops, where r / ir loops contribute to that operand’s data size / data
reuse respectively. In addition, we introduce the following terms described in
the table of Figure 6.2(a) and refer to them in modeling: U nit M em, DT L,
M emDAT A , M emCC , and RealBW .
In the first step, we divide the problem of extracting the total temporal stall
cycles SSoverall of the entire memory system into deriving the stall/slack cycles
(SSu ) induced by a single operand (W/I/O) accessing a Unit Mem (e.g., Mem1-9
in Figure 6.2(b)). To analyze these Unit Mem levels, we decouple read and
write operations on the interface between two Unit Mem levels, each as a DTL
(e.g., from ⃝ ⃝ in Figure 6.2(b)).
1 to 18
Compute ReqBW u
Derive the Unit Mem’s operation pattern and memory updating window
(M U Wu )
Extract SSu
For each DTL, SSu measures the relative cycle difference between "memory
updating" and "computation" (or "data consuming"), and is computed by SSu =
M1 M2 M3 M4 …
(a)
C1 C2 C3 C4 (a),(b),(c) assume double-buffered (db) mem or
non-double-buffered (non-db) mem with r loop
M1 M2 M3 M4
(b) … scheduled on top.
C1 C2 C3 C4
(Mn+1) and (Cn) can start together.
M1 M2 M3 M4 XREQ ( ) = MemCC ( )
(c) C1 C2 C3 C4
M1 M2 M3 M4
(d) … (d),(e),(f) assume non-db mem with ir loop
C1 C2 C3 C4
scheduled on top.
M1 M2 M3 M4 (Mn+1) can only start after certain data block is
(e) …
C1 C2 C3 C4 fully (re)used in (Cn), hence “Mem update keep-
out zone” is inserted.
M1 M2 M3 M4
(f) XREQ ( ) < MemCC ( )
C1 C2 C3 C4
Figure 6.3: Six different timeline cases of memory updating and computation,
showing memory-induced stall/slack for a single DTL.
A UNIFORM INTRA-LAYER LATENCY MODEL 191
(XREAL − XREQ ) × Z. Figure 6.3 shows SSu visualization with six cases of
memory updating and data consuming timelines. E.g., Figure 6.3(a)(d) have
SSu = 0 since XREAL = XREQ , despite their different memory types; same
principle can be applied to obtain (b)(e)’s negative SSu (slack) and (c)(f)’s
positive SSu (stall).
With the basic attributes ReqBW u , M U Wu and SSu obtained for each DTL in
Step 1, we can derive ReqBW comb , M U Wcomb and SScomb for DTLs that share
the same physical memory port and for DTLs that serve the same memory.
For n DTLs that share one physical memory port, the ReqBW comb is the sum
of all the n DTL’s ReqBW u on that port, with read and write distinguished.
Derive SScomb
m−1
X
SScomb = SSu (i) +
i=1
n
X
max{0, [M U Wu (i) + f (SSu (i))] − M U Wcomb } (6.1)
i=1
where SSu (i) is the SSu of the ith DTL; M U Wu (i) is the M U Wu of the ith
DTL, f (x) = x when x ≤ 0, otherwise f (x) = 0. Eq. (1) indicates that the
combined stall is the sum of already introduced stalls by some DTLs (positive
SSu ) plus the stalls that could be introduced by combining the slacks of other
DTLs, due to the hardware resource contention. In more detail, about the
latter term, for the share-port DTLs that do not introduce stall individually,
192 AN ANALYTICAL LATENCY MODEL FOR DNN ACCELERATOR
they could still introduce stall when combined. The possible combined stall
is the sum of all share-port DTLs’ actual working cycles within each of their
M U Wu minus the maximal allowed memory updating window (M U Wcomb ).
This value can be positive (introduce stall) or non-positive (no stall). Only if
this combined result is positive, it is then added to the sum of the positive SSu
to obtain SScomb . This ensures the stall(+) induced by individual DTLs is not
cancelled by other DTLs’ slack(-) during combination.
The next step is to further combine the SS of the DTLs serving the same
memory. This final SScomb is the maximal value either out of their SSu (e.g.,
max(SSu 12⃝, SSu 11
⃝) in Figure 6.2(b)) or out of the already combined SScomb
(e.g., max(SScomb ⃝
1 ⃝,
6 SScomb ⃝2 ⃝)
7 in Figure 6.2(b)).
A detailed example
SSoverall accounts for the parallel memory operation as well as multiple stall
sources across all memory levels. For the memory operations that can be
overlapped, SSoverall takes the maximum of SScomb , i.e., the shorter stall of one
memory can be hidden under the longer stall of the other; otherwise, SSoverall
is the sum of all stalls, indicating one memory stall blocks the operation of other
memories, regardless of whether the data in other memories are ready. Users
can customize this memory parallel operation constraint based on the design.
(a) (c)
An example hardware (partial) An example mapping (partial) 1,21,2 3,43,43,43,43,43,4 5,65,65,65,6 Compute SScomb at based on its equation
… … ① Compute 1,2 Compute 3,4 X- X- Mem MUW MUW
Weight Input Output DTL Z SSu SScomb
req real cc u comb
Local Buffer (W/I/O) … … … … 1 2 2 3 3 4 4 5
② ① 10 6 12 2 20 -8
for C’ in [0, 2) LB LB LB Compute 1 Compute 2 Compute 3 Compute 4 (20-8)+(8-0)+
for OX in [0, 2) W-Reg LB LB ② 2 2 6 4 8 0 24 (24-12)-24
① ② ③ for K in [0, 3) W-Reg I-Reg O-Reg 1 1 2 2 2 3 3 3 4 4 4 5 5 = 8 cycles
③ ③ 6 3 6 4 24 -12
W-Reg I-Reg O-Reg for C in [0,2) W-Reg I-Reg O-Reg Compute 1 Compute 2 Compute 3 Compute 4
Divide Combine
(b) … …
… ① ② ③
LB (W) MemCC : 24 LB (I) for C’ in [0, 2) (r) MemCC : 24 LB (O) for C’ in [0, 2) (ir) MemCC : 24
for C’ in [0, 2) (r) MemDATA : 12 for OX in [0, 2) (r) MemDATA : 8 for OX in [0, 2) (r) MemDATA : 6
1 per cc W-Reg 1 per cc I-Reg 1 per cc O-Reg
A UNIFORM INTRA-LAYER LATENCY MODEL
For loops
r C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2 r C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2 ir C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2
Timeline Timeline Timeline
I I I I I
Compute 1 Compute 2 Compute 3 Compute 4 O Read In O Read In O Read In O Read In O Read In
I = Update 2 inputs in 2 cycles. Compute 1 Compute 2 Compute 3 Compute 4
W-Reg update W-Reg Update W-Reg Update O Write Back O Write Back O Write Back O Write Back O Write Back
Data movement
and computation
Legend: One data element Data being used or generated Update 3 outputs in 6
Compute 1,2 Compute 3,4 O Read In =
During the period, data cannot be updated (data lifetime window) cycles. (Only focus on “O
Update 6 weights
W-Reg Update = in 10 cycles. During the period, data can be updated (memory updating window) Memory update serves computation Read In” in this example.)
Figure 6.4: An example demonstration of (a)-(b) Step 1 (Divide) and (b)-(c) Step 2 (Combine) for deriving the
intermittent modeling parameters.
193
194 AN ANALYTICAL LATENCY MODEL FOR DNN ACCELERATOR
If calculated SSoverall ≤ 0, we take zero as its final value since no temporal stall;
otherwise, SSoverall > 0 indicates temporal stall exist during computation.
Based on the 3 Steps and data loading analysis, the system’s overall latency (CC)
can be derived by summing up the ideal computation cycles (CCideal ), data
loading cycles, spatial stall and temporal stall (SSoverall ) (refer to Figure 6.1).
The overall MAC array utilization (U ) can be deduced by CCideal /CC.
This uniform latency modeling enables DSE for various memory systems and
mappings. It can also provide insights into identifying performance bottlenecks
and optimization opportunities, as demonstrated in Section 6.5.
6.4 Validation
We validate the proposed latency model using an in-house DNN accelerator [151]
implemented in TSMC 7nm technology [176], designed for INT8-based inference
tasks. For convolution layers, Im2Col operation (unrolling convolution into
matrix-matrix-multiplication) is performed by a RISC-V core before processing
on the accelerator. As shown in Figure 6.5(a), this accelerator employs a systolic
array-based design with 1K MAC units in a 16×32 PE array (2 MACs per PE)
and one 24b Output register per PE. Each MAC connects to one 8b Weight
and one 8b Input register. A total of 32KB and 64KB local buffer (LB) with
256b and 512b bus connections to PE array are used for temporal storage of
Weight and Input, respectively. 1 MB global buffer (GB) tiled with 16 64KB
SRAM macros is used. The mapping schemes are shown in Figure 6.5(b).
We feed the latency model with the above hardware configuration. Figure 6.5(c)
shows the comparison of the modeled results with the hardware simulation,
running NN layers (with different parameter sizes) of a hand-tracking
workload [169]. For all evaluated NN layers, an average of 94.3% latency
estimation accuracy is achieved.
In this section, we present three case studies to demonstrate how the enhanced
latency model can be used to optimize the AHM design space. We integrate our
model with ZigZag [109], a DNN accelerator architecture-and-mapping DSE
CASE STUDIES 195
(a) In-house DNN Accelerator Illustration (b) Temporal and Spatial Mapping Illustration
C first, O-Reg data reuse B
B second, W-LB data reuse
K third, I-GB data reuse BuCu
B
C
C I
Ku
Ku
K W × = Bu
O K
Cu
Layer 1 16 32 1024
1E+5 75 Layer 2 4 160 378
Accuracy [%]
Layer 3 48 160 480
50 Layer 4 900 16 27
Layer 5 36 160 960
1E+4
Layer 6 1024 42 189
25 Layer 7 576 192 288
Layer 8 48 160 3480
1E+3 0 Layer 9 576 192 1728
1 2 3 4 5 6 7 8 9 10
Layer 10 144 384 3456
NN Layer Index
Figure 6.5: (a) Block diagram of the in-house DNN accelerator. (b) Temporal
mapping and spatial unrolling schemes after Im2Col. (c) Model validation
against the hardware RTL simulation running NN layers of different sizes.
framework, to generate various design points. For Case 1 and 2, the hardware
architecture is fixed to a scale-down version of the in-house accelerator with
8×16 PE (2 MACs per PE, i.e. 16x16 MAC), 16KB Weight local buffer (W-LB),
8KB Input local buffer (I-LB), 1MB global buffer (GB) with 128 bit/cycle
read/write BW, and a loop spatial unrolling of K 16 | B 8 | C 2. For Case 3, we
perform architecture DSE by varying the hardware parameters. Im2Col layer
transfer is applied to all the case studies.
Different mappings lead to distinct latencies for the same DNN layer processed
on the same hardware. Figure 6.6 compares two different temporal mapping
schemes: Mapping A and Mapping B, out of 30240 valid mappings obtained
196 AN ANALYTICAL LATENCY MODEL FOR DNN ACCELERATOR
Mapping A Mapping B
-30% Less I-LB/GB traffic Less O-
Reg/GB
traffic
(d) Energy [pJ] (f) Memory BW: Physical vs. Mapping Required
RealBW [bit/cc] ReqBW (Mapping A) [bit/cc] ReqBW (Mapping B) [bit/cc]
+5%
ReqBW largely exceeds RealBW
+ Frequent GB access
induce large stall.
Figure 6.6: Case study 1: Mapping’s difference analysis and its impact on
latency.
CASE STUDIES 197
of O’s data reuse loops (C loops) to the GB level (i.e., besides final Outputs,
Partial Sums also need to be transferred between O-Reg and GB), as shown in
Figure 6.6(e). Note that W’s data reuse distribution across memory levels in
these two mappings are the same.
These differences in I and O data transfer cause different temporal stalls due to
the insufficient GB BW compared to the required ones, shown in Figure 6.6(f).
Although Mapping A and B both exceed hardware’s RealBW for GB write (e.g.,
3072 vs. 128 bit/cycle), the stall is less for Mapping B since its GB write is
less frequent (Figure 6.6(e)). In addition, the Partial Sum transfer in Mapping
A requires much higher GB read BW that cannot be met by the hardware’s
RealBW . Thus, our memory-BW-aware latency model reveals the high SSoverall
in Mapping A that deteriorates MAC array utilization and system performance.
This analysis clearly shows that a good latency model is instrumental to help
DNN accelerator DSE mapper minimize SSoverall by 1) matching ReqBW
(mapping-dependent) with RealBW (hardware-dependent), or 2) if RealBW
is too low to match, reducing the frequent access of the low-BW link (e.g.,
reducing partial sum transfer in this case study).
DNN layer parameters largely impact execution time. In this case study, we use
the same hardware parameters as in Case 1 and analyze the latency impact of
the layer attributes (e.g., # of total MAC operation, operand size). Figure 6.7(a)
shows operand’s percentage (W/I/O) and total MAC operation count by varying
DNN layer dimensions from 8 to 512 for B/K/C. Figure 6.7(b) shows the
corresponding modeled Real latency and latency breakdown in terms of data
pre-loading, ideal compute cycle, spatial stall, and temporal stall (SSoverall ) as
defined in Figure 6.1.
Comparing Figure 6.7(a) and (b), the Ideal latency matches with Total MAC
Ops, where the Real latency follows the Total data size. The former is intuitive
since it assumes 100% MAC array utilization with zero stall, where latter reveals
the data movement bottleneck. Since the existing hardware has limited BW
for GB (Figure 6.6(f)), a fully output stationary dataflow at O-Reg level was
always selected to minimize stall by reducing the GB access. When the layer
is Output-dominant (large B and K, fewer C), its total data size increases
compared to other layers under the same total MAC Ops due to the 24-bit O
precision (v.s. 8-bit I and W), while at the same time the lower input channel
count C, results in less output stationarity. This causes increased pressure on
the GB write BW, causing the Real latency to deviate much more from the Ideal
198 AN ANALYTICAL LATENCY MODEL FOR DNN ACCELERATOR
latency. For larger layer sizes (large C), Ideal computation cycle (green bars)
dominate, and the deviation between Ideal latency and Real latency reduces.
Note that without including temporal stalls (i.e., the cyan dotted line), large
discrepancies in latency estimation (e.g., 7.4× for layer (128,128,8) and 9.2× for
layer (512,512,8)) occur (Figure 6.7(b)), especially for layers with fewer input
channels C.
(a)
x 9.2
x 7.4
(b)
Figure 6.7: Case study 2: Workload’s impact on latency and latency breakdown.
Previous case studies have shown the importance of the proposed latency
model for evaluating the mapping and algorithm impact on a fixed hardware
architecture. In this case study, we further take advantage of the model’s
generality and memory-BW-awareness to assess the impact of different hardware
architecture parameters (e.g., MAC array size, memory capacity and memory
BW), and showcase how the design space changes with the presence of temporal
stalls (SSoverall ). We choose the following the MAC array sizes and scale the
spatial mapping accordingly: 16×16 (spatial mapping as K 16 | B 8 | C 2),
32×32 (K 32 | B 16 | C 2), 64×64 (K 64 | B 32 | C 2). We construct a memory
pool containing tens of register/memory candidates with different capacities to
replace the W-/I-/O-Reg, W-/I-LB in the design space search. The GB size is
CASE STUDIES 199
Preferred
corner
1MB for all the cases, where GB BW varies from 128 to 1024 bit/cycle. The
area of GB is not included in the comparison.
Figure 6.8 illustrates the latency-area design space for 4,176 hardware designs.
Different MAC array sizes are shown in different colors, while dots that share
the same color vary in memory hierarchy. For each design point, mapping
optimization for lowest latency is performed.
Figure 6.8(a) first shows the results using a memory-BW-unaware latency model.
Since memory BW impact induced SSoverall is ignored, all the architectures
with the same array size achieve similar latency. Hence the minimum area
design (i.e., with less memory) could be considered as optimal (close to the
preferred corner), since larger memory capacity does not offer latency benefits
but add area cost.
However, the conclusion changes once the memory BW impact is included.
Figure 6.8(b) and (c) show the design spaces obtained with our proposed
model for GB BW of 128 bit/cycle (low BW) and 1024 bit/cycle (high BW),
respectively, with the optimal design points highlighted. For both high and
low GB BWs, different memory size combinations at Register and Local Buffer
levels can impact the area-latency trade-off for a fixed MAC array size (i.e.,
same theoretical peak performance). For example, the 16×16, 32×32 and 64×64
array achieve their own lowest latency with moderate memory area cost at 128
bit/cycle GB BW. Only when the GB BW is high, the design points of the
same array size cluster around the similar latency, indicating less latency impact
from improving local memory storage and data reuse. This reveals the impact
of SSoverall in BW-limited systems, where the memory hierarchy needs to be
optimized to maximize data reuse below that bottleneck memory level for stall
reduction.
200 AN ANALYTICAL LATENCY MODEL FOR DNN ACCELERATOR
6.6 Conclusion
201
202 DEFINES: EXPLORING THE DEPTH-FIRST SCHEDULING SPACE FOR DNN ACCELERATORS
After formalizing this design space, this chapter proposes a unified modeling
framework, DeFiNES, for layer-by-layer and DF scheduling to fill the gaps.
DeFiNES enables analytically estimating the hardware cost for possible schedules
in terms of both energy and latency, while considering data access at every
memory level. This is done for each schedule and hardware architecture under
study by optimally choosing the active part of the memory hierarchy per unique
combination of operand, layer, and feature map tile. The hardware costs are
estimated, taking into account both data computation and data copy phases.
The analytical cost model is validated against measured data from a taped-out
DF DNN accelerator, DepFiN, showing good modeling accuracy at the end-to-
end neural network level. A comparison with generalized SotA demonstrates
up to 10× better solutions found with DeFiNES.
LB*
(* In this Fig1 example, we
PE Array PE Array PE Array
assume all memory levels
Timeline are shared by W/I/O.)
(c) Depth-first (DF) / Layer fusion Split each layer into small
T1 … T1 … T1 … tiles and process each tile
L1 L2 L3
Tn Tn Tn across layers depth-firstly.
W I O W I O W I O for Tile in T1 to Tn:
DRAM for Layer in L1 to L3:
compute Layer-Tile
GB • Layer-tile’s outputs can stay
On-chip
exist [86, 180, 189, 172, 19], they are all limited in one or more of the following
aspects:
• Model only partial hardware cost, like only latency or only DRAM-access,
and ignore other relevant costs;
• Do not consider an on-chip multi-level memory hierarchy, only distinguish
between on-chip and off-chip memory;
204 DEFINES: EXPLORING THE DEPTH-FIRST SCHEDULING SPACE FOR DNN ACCELERATORS
• Section 7.2 identifies the full design space of DF scheduling, which also
includes SL and LBL by regarding them as two extreme points in the DF
design space.
• Section 7.3 presents a Unified Analytical Cost Model that has none of
the aforementioned limitations.
• Section 7.4 validates the proposed cost model against a taped-out DF-
style accelerator.
• Section 7.5 conducts three case studies based on the model, studying
the trade-offs between different DF schedules, and the impact of workload
and hardware architecture on the best DF strategy.
• Section 7.6 compares DeFiNES against SotA frameworks, showing an
up to 10× better results by including the cost of on-chip memory accesses
and accesses caused by weights in the exploration.
• Section 7.7 concludes the chapter.
This section describes the DF design space with three axes, using the well-
understood LBL inference as a starting point.
Consider processing multiple layers of a network, as in Figure 7.2(a)&(b). One
can calculate the final output feature map in one go, for which the complete
input of the last layer is required. This in turn requires the complete output
of the second to last layer, and so on. Ultimately, this leads to LBL inference,
which completely executes each of the layers one at a time starting from the
first layer.
Alternatively, one can target to compute only a part of the output of the final
feature map. In this case, only parts of input feature maps are needed, as in
DEPTH-FIRST DESIGN SPACE IDENTIFICATION 205
(a) Workload and legend (b) LBL (1 tile / layer) (c) Tile size 2x2 (d) Tile size 1x1
Layer 1
K=3; C=1; OX,OY=8;
FX,FY=3
Layer 3 … …
K=9; C=6; OX,OY=4; Layer 2 Layer 2 , Tile 1
FX,FY=3
Being used/generated
Will be used/generated Layer 3 , Tile 1
Fully used/generated Layer 3
2 1
New to-cache data 2 1
Cached for H reuse
Cached for V reuse
Big Small
Tile size
Figure 7.2: DF design space’s first axis: Tile size. For layer dimension
notation in (a): K is for output channel; C is for input channel; OX and OY
are feature map spatial dimensions; FX and FY are weight spatial dimensions.
The 1st tile A regime tile The 1st tile A regime tile The 1st tile A regime tile
… .. … .. … ..
1 1 1
1 1 1
Figure 7.3: DF design space’s second axis: Overlap storing mode. Workload
is Layer 2 and 3 in Figure 7.2(a); Legend is shared with Figure 7.2(a).
shown in Figure 7.2(c). Inference starts at the first layer, yet only that tile of
its output feature map that contributes to the target tile in the final output
feature map is calculated. It is then propagated throughout the other layers to
compute the target tile in the final feature map.
DEFINES: EXPLORING THE DEPTH-FIRST SCHEDULING SPACE FOR DNN ACCELERATORS
(a) SL (1 layer per stack) (b) Fuse shallower; Tile coarser (c) Fuse shallower; Tile finer
ST1 W I O ST1 W I O ST1 W I O
L1 T1 DRAM Per/Between-Stack I & O L1 T1 T2 T3 DRAM Between-Stack I & O L1 T1 T2 … Tn DRAM Between-Stack I & O
L1
ST2
L2 T1 GB
L2 T1 T2 T3 GB
Per-Stack I & O L2 T1 T2 … Tn GB Per-Stack W
Per-Stack W
ST3 ST2 ST2
L3 T1 Per-Stack L3 T1 T2 T3 More L3 T1 T2 … Tn Less Per-Stack
LB LB LB
W reuse reuse I&O
ST4
L4 T1 L4 T1 T2 T3 L4 T1 T2 … Tn
PE PE PE
Trade-off: (d) Fuse deeper; Tile coarser (e) Fuse deeper; Tile finer
ST1 W I O ST1 W I O
Tile finer (→): L1 T1 T2 T3 DRAM Per-Stack W L1 T1 T2 …L1Tn DRAM Per-Stack W
Less per-stack activation (+)
Less local-memory weight reuse (-) L2 T1 T2 T3 GB Per-Stack I & O L2 T1 T2 … Tn GB
Fuse deeper (↓):
L3 T1 T2 T3 More L3 T1 T2 … Tn Less Per-Stack
More per-stack weight (-) LB LB
reuse reuse I&O
Less between-stack activation (+)
L4 T1 T2 T3 L4 T1 T2 … Tn
PE PE
Figure 7.4: Impact of tile size (first axis) and fuse depth (third axis). ST: fused-layer STack.
206
DEPTH-FIRST DESIGN SPACE IDENTIFICATION 207
This illustrates the first axis in the design space: the choice of tile size, by
which we mean the size of the last layer’s portion that we want to compute
atomically. The general trade-off of tile size selection is given in Figure 7.4
(subfigures (b)↔(c) or (d)↔(e)). Choosing a larger/coarser tile size enhances
local weight reuse but requires more features to be passed between layers at
once, which may require a higher level memory.
Note that in this chapter, 1) we assume the computation order over tiles is
left-to-right, then top-to-bottom and 2) cross-layer tiling is only done across
the spatial dimensions (horizontal and vertical dimensions) of the feature maps.
It is not done across the channel dimensions because in most convolution layers
all input channels are required to calculate any output feature, which makes
cross-layer tiling across the channel dimensions impossible. However, intra-tile
temporal mappings can still have loop tiling over all the dimensions within that
tile, including the channel dimensions.
Because neighboring tiles of the output feature map can require overlapping parts
of earlier feature maps, one can choose either to recompute those overlapped
features, or to cache them in some memory in order to reuse them across
tiles, as shown in Figure 7.3. This choice can be made separately for both
spatial dimensions and is considered the second axis. It has four modes:
fully-recompute Figure 7.3(a), horizontally-cached with vertical recompute
Figure 7.3(b), vertically-cached with horizontal recompute, and fully-cached
Figure 7.3(c). In this chapter, we don’t further consider vertically-cached with
horizontal recompute, as transposing both the feature maps and, correspondingly,
the weights results in the same, yet transposed, outputs, vertically-cached
with horizontal recompute and horizontally-cached with vertical recompute
are fundamentally the same. Choosing caching over recompute requires extra
memory space to store the cached data in. However, it decreases recomputation
overhead and the tile size in earlier layers, as Figure 7.3 shows.
So far, this section discussed the scheduling options within one stack of fused
layers. The final and third axis is the choice of which layers are fused into
a stack. Fusing more layers generally requires more low level weight memory
capacity but saves accesses to higher level memories for activations. This can be
seen in Figure 7.4 by comparing subfigures (b) vs. (d), or (c) vs. (e). Because
increasing the memory capacity of the lower level memories decreases their
efficiency, the lower level memory can become fruitless if one fuses too many
layers.
Note that LBL inference and SL can be positioned in this design space. On the
first axis, the tile size can be set equal to the DNN’s final output feature map
(Figure 7.2(a)) to get a schedule that is effectively LBL. There is only one stack
and it executes each layer completely before moving on the next. One can also
208 DEFINES: EXPLORING THE DEPTH-FIRST SCHEDULING SPACE FOR DNN ACCELERATORS
choose to only have one layer in every stack (the third axis) (Figure 7.4(a)),
which leads to a SL schedule as we assume features are passed between stacks
through the highest memory level. The second axis has no impact in these cases
(LBL and SL) as there is only one tile and thus no overlap between tiles.
This section describes the Unified Analytical Cost Model presented in this
chapter, capable of predicting inference costs (energy and latency) of DNNs on
a given hardware architecture, with support for the full design space of Section
7.2. An overview of the model is depicted in Figure 7.5.
The base idea is to use an existing mapping search engine and a cost model that
optimize and predict costs for a single layer (step 5 below). However, because
of their single-layer limitation, these tools assume every single layer’s input and
output feature maps need to come from and go to the highest level input and
output memories, respectively. DeFiNES then provides the Unified Analytical
Cost Model as a layer on top of this to provide DF compatibility, which it
achieves with the following steps:
Inputs: The inputs consist of the workload, the hardware architecture and the
DF parameters. The workload is a neural network which may have convolution
layers, branches, pooling layers, strides, depthwise layers, etc. The hardware
architecture consists of an array of Processing Elements (PEs) and a memory
hierarchy. The latter can have memories that are shared between operands
(inputs, outputs and weights), different number of levels for different operands,
and memories that are unrolled over one or more dimensions of the PE array.
The final input consists of the DF parameters, which identify a point in the
design space of Section 7.2, dubbed the ‘DF strategy’. The fuse depth, i.e. the
numbers of layers to fuse together for each stack (third axis), can be given
manually or determined automatically. In the latter case, layers are added to
the fused stack as long as the total number of weights in the stack fit in the
highest on-chip memory level that holds weights. In the presence of branching,
either all layers between two points where there are no branches are added to
a stack, or none of them. If such a set of layers by itself does not fit in the
highest on-chip memory level of weights, none of the layers in this set are fused.
In other words, each of them is in a 1-layer stack.
With the stacks of fused layers from the workload, hardware, and DF parameters
defined, steps 1)-6) are done per stack.
1) Tile the stack’s output (for each stack): Given a stack, the output feature
INPUTS DEPTH-FIRST COST MODEL (repeat for all stacks)
Workload DF parameters 1) Tile the 2) Backcalculate tile sizes 3) Determine and set top level memories
stack's output and calculate size of
3x3
Tile size data to be cached Priority:
Conv Tx WIO Current layer I
1284x724 3x3 Repeat
Conv
Ty
3x2 7x6 steps 3-5 WIO Current layer O
3x3 for all
Conv 3x3 layers Cached data
W IO
1282x722 Overlap storing Conv for H reuse
mode 5x4
W I O
Reg LB GB Ext
3x3 Repeat steps 2-5 Cached data
UNIFIED ANALYTICAL COST MODEL
Reg LB GB Ext
I OUTPUTS
PE Array Single-layer
PE Array cost model
Supports any dimensional PE array of any size, (ZigZag)
optional inner PE register file, and different Energy and latency
inter-PE interconnect. (data sharing) patterns; for each stack
Supports multi-level mem. hier. with different Data copy action
operand memory sharing schemes. cost model
Figure 7.5: DeFiNES’ overview. (*: optional input, can be set automatically.)
209
210 DEFINES: EXPLORING THE DEPTH-FIRST SCHEDULING SPACE FOR DNN ACCELERATORS
…
…
…
…
…
…
…
…
… …
60 3 tile types
36 Tt1 Tile type2 (15 times) …
540=72x7+36
(1 times)
72 …
Tile type 3
(repeat 112 times)
…
…
…
…
…
…
…
…
…
…
960=60x16
Figure 7.6: Tile type count of difference tile sizes and overlap storing modes.
The workload used in this example is FSRCNN [44], whose final output feature
map’s spatial dimension is 960×540. The 3-tile-type example is further used in
Figure 7.9 and Figure 7.10.
the paper: the Fully-recompute / H-cached V-recompute / Fully-cached with a tile size of
(60,72) case of Figure 7.6 took 23 / 34 / 84 seconds on 1 thread of an Intel Xeon Processor
E3-1270 v5, respectively.
UNIFIED ANALYTICAL COST MODEL 211
(for each tile in each layer): From the tile size of the last output feature map in
the stack, the required tile size of the last layer’s input is calculated. Next, the
‘to-compute’ tile size of the previous layer is calculated. Without caching for
reuse, this simply equals the required tile size of the last layer’s input. However,
with caching for reuse across tiles, not all these features need to be calculated as
some can be fetched from the cached data, as can be seen in Figure 7.3(b)&(c).
This process is repeated for all layers in the stack and as such the input tile
size and to-compute output tile size for each layer in the stack is determined.
During this process of backcalculation, the algorithm also keeps track of how
much data (from earlier or for future overlapping tiles) of each type in Figure 7.7
should be cached. In case of branching, this is handled as in Figure 7.8.
In the shown example, the left and right branch need cached features from
different places in the feature map. In such a case, the overall region of features
to be cached is set by combining all outermost edges of the to-cache regions,
so that all branches always have the cached features they need to operate in
overlap caching mode, as can be seen in the middle, combined visualization of
FM1 in Figure 7.8.
3) Determine and set top level memories (for each tile in each layer):
Given the data sizes calculated in step 2, step 3 determines the highest memory
level each type of data (layer inputs, layer outputs, and cached data for H-
cached and/or V-cached modes) should be stored in. In these decisions, data is
prioritized as in Figure 7.5(3), with higher-priority data assigned to the lower,
more efficient memory levels.
Note that the top memory level assigned to different data types can differ
between tiles and layers. Figure 7.9 gives an example of this for a stack in
fully-recompute mode, based on the data sizes from Figure 7.10.
To be able to use the single-layer mapper and cost model, which assumes
inputs and outputs come from and go to the top-level memory, we remove the
initial assignments from the hardware architecture’s definition of operands to
higher memory levels. We then give this mapper and cost model that adjusted
hardware architecture as its input to prevent it from fetching data from or
storing data to unnecessarily high memory levels.
4) Collect inputs at determined memory level (for each tile in each layer):
A single layer-tile combination can have input feature data that is located in
different memory levels, for instance, the newly created output of a previous
layer can be in a lower memory level than cached data from a previous tile.
Therefore, before calling the single-layer mapper and cost model, we model the
action of collecting these data into the single memory level that was decided to
serve as the top-level memory for inputs in step 3.
DEFINES: EXPLORING THE DEPTH-FIRST SCHEDULING SPACE FOR DNN ACCELERATORS
Each such data collecting action is defined as a data copy action, which is
modeled by its to-move data type and amount, the source memory level, and
the destination memory level. The cost of data copy action is calculated by
the data copy action cost model. This model takes in 1) a list of data copy
actions (those actions can theoretically happen in parallel) and 2) the hardware
architecture (with all the memory port type, port connection, word-length, and
per-word access cost defined) to analyze the energy and latency this bundle of
data copy actions costs, taking into account possible memory port conflicts in
the concurrent actions.
5) Call single-layer mapper and cost model (for each tile in each layer):
At this point, the single layer temporal mapping search engine and cost model
are used to get the cost for a single layer-tile-combination. For this work,
214 DEFINES: EXPLORING THE DEPTH-FIRST SCHEDULING SPACE FOR DNN ACCELERATORS
I
LB
O
Tile type 1 Tile type 2 Tile type 3
(1 time) (15 times) (112 times)
Reg
L1
L2
L3
L4
L5
L6
L7
L8
L1
L2
L3
L4
L5
L6
L7
L8
L1
L2
L3
L4
L5
L6
L7
L8
Tile type and Layer (L)
Figure 7.9: A visualization of the determined top memory level of each unique layer-
tile-combination for operands W, I, and O. The DF schedule is taken from the 3-tile-type
example in Fig 7.6. The hardware architecture is the Idx 2 in Table 7.1. It is worth noting
that: 1) for weights, all the layers of the first tile take weights from DRAM, and the other
layer-tile-combinations take weights from LB; 2) for input and output, all the tiles’ first
layer gets input from DRAM, all the tiles’ last layer writes output back DRAM, and in
between either GB or LB is taken as each of their top memory level.
GB:1M
Tile type 2 Tile type 3
Data size (Byte)
LB:64K O
16K I+O
4K
L1 L2 L3 L4 L5 L6 L7 L8 L1 L2 L3 L4 L5 L6 L7 L8
Tile type and Layer (L)
Figure 7.10: A visualization of activation data size in tile type 2 and 3 of the example
in Figure 7.9. The capacities of LB and GB are marked out on y-axis. Figure 7.9 and
Figure 7.10 together show that 1) when the total activation size (I+O) can fit into LB
(e.g., Tile type 2 - L6), the LB is the top memory for both I and O; 2) when the total
activation size (I+O) cannot fit into LB while either I or O can fit (e.g., Tile type 3 - L6),
I is prioritized to use LB as its top memory level while O is pushed to GB.
VALIDATION 215
7.4 Validation
Table 7.1 summarizes the key attributes of different hardware architectures and
DNN workloads used in the case studies.
For hardware, five DNN accelerators are selected as the architecture baselines for
the case studies: Meta-prototype [151], TPU [81], Edge TPU [139], Ascend [97],
and Tesla NPU [157]. To make a fair and relevant comparison, we normalized
all of them to have 1024 MACs and maximally 2MB global buffer (GB) but kept
their spatial unrolling and local buffer settings (Table 7.1(a) Idx 1/3/5/7/9).
Besides, under the concern that all these architectures were originally designed
for SL/LBL processing and it thus may or may not be very beneficial to apply
DF schedules on them, we manually constructed a DF-friendly versions of all
architectures, denoted with ‘DF’ in the end of the name (Table 7.1(a) Idx
2/4/6/8/10). The guidelines that were followed to construct a DF-friendly
version from a SL/LBL architecture are: 1) spatial unrolling is unchanged; 2)
the total on-chip memory capacity is unchanged; 3) Input and Output activation
are preferably shared in a lower level memory; and 4) Weights should have an
CASE STUDIES
Table 7.1: The 10 hardware architectures and 5 DNN workloads used in the case studies
(a) 10 HW Architectures (5 baseline designs and their DF-friendly variants) (b) 5 DNN Workloads
HW Spatial Unrolling Reg. per MAC 2nd level Global Buffer Aver./Max. Total
Idx Local Buffer Idx Workload
Architecture (1024 MACs) or MAC group LB (max: 2MB) Feature Map Weight
1 Meta-proto-like W: 64KB; I: 32KB / 10.9 MB /
K 32 | C 2 | OX 4 | OY 4 W: 1B; O: 2B W: 1MB; I&O: 1MB 1 FSRCNN 15.6 KB
2 Meta-proto-like DF W: 32KB; I&O: 64KB / 28.5 MB
3 TPU-like W: 128B; O: 1KB / / I&O: 2MB
K 32 | C 32 24.1 MB /
4 TPU-like DF W: 64B; O: 1KB I&O: 64KB / W: 1MB; I&O: 1MB 2 DMCNN-VD 651.3 KB
26.7 MB
5 Edge-TPU-like W: 32KB / I&O: 2MB
K 8 | C 8 | OX 4 | OY 4 W: 1B; O: 2B
6 Edge-TPU-like DF W: 16KB; I&O: 16KB / W: 1MB; I&O: 1MB 21.8 MB /
3 MCCNN 108.6 KB
W: 64KB; I: 64KB; 29.1 MB
7 Ascend-like /
K 16 | C 16 | OX 2 | OY 2 W: 1B; O: 2B O: 256K W: 1MB; I&O: 1MB 760 KB /
8 Ascend-like DF W: 64KB; I&O: 64KB I&O: 256K 4 MobileNetV1 4 MB
3.8 MB
9 Tesla-NPU-like W: 1KB; I: 1KB / W: 1MB; I&O: 1MB
K 32 | OX 8 | OY 4 W: 1B; O: 4B W: 64KB; W: 1MB; 895 KB /
10 Tesla-NPU-like DF W: 1KB; I: 1KB 5 ResNet18 11 MB
I&O: 64KB I&O: 896KB 5.9 MB
217
218 DEFINES: EXPLORING THE DEPTH-FIRST SCHEDULING SPACE FOR DNN ACCELERATORS
on-chip global buffer. These guidelines are heuristic-based, and we leave the
DF hardware architecture optimization problem to future work.
For hardware modeling, CACTI7 [12] is used to extract all the SRAM costs
(pJ/word access). Other hardware costs, such as Unit MAC, register, and
DRAM access cost are scaled accordingly based on the SRAM cost, following
the scaling factors reported in [184]. All the on-chip memory’s banking and
bandwidth (bit/cycle) are selected in such a way that PE array can get enough
data to work at its full speed for ideal workload, while the DRAM bandwidth
is fixed to 64bit/cycle to mimic the on-off-chip communication bottleneck.
For workload, five DNN workloads are used in the case studies: FSRCNN [44],
DMCNN-VD [154], MCCNN [171], MobileNetV1 [65] and ResNet18 [58].
Table 7.1(b) shows that FSRCNN, DMCNN-VD, and MCCNN are activation-
dominant (all the layers have large feature maps), whereas MobileNetV1 and
ResNet18 are weight-dominant (feature maps are smaller and gradually decrease
across layers).
Note that the hardware architectures picked for these case studies are not DF-
specific. This enables exploring whether or not these non-DF-specific hardware
architectures (and their variants with some memory size/sharing adjusting) can
benefit from DF scheduling on both activation- and weight-dominant DNNs.
Another thing to point out is that in DeFiNES, users can self-define the
optimizing target (energy, latency, EDP, any memory access, a combination of
them, etc.). For the case studies, we prioritized energy.
This case study discusses how much DF strategies impact results when mapping
a DNN onto an accelerator, exemplified by FSRCNN and Meta-proto-like DF
(2 in Table 7.1(a)) as the targeted workload and hardware architecture.
For the DF scheduling space’s three axes (tile size, overlap storing mode, and
fuse depth), this case study focuses on exploring the first two axes. The third
axis, fuse depth, is fixed to the whole DNN since the total weight size of
FSRCNN is small (15.6KB as shown in Table 7.1(b) Idx 1) and thus all weights
fit in Meta-proto-like DF architecture’s weight on-chip local buffer (32KB as
shown in Table 7.1(a) Idx 2). So, there is no benefit to not fuse the whole DNN
into one stack, according to the trade-off introduced in Figure 7.4.
For the first two axes, we swept 110 tile sizes (different spatial dimension tile
size (Tx,Ty) combinations) for each of the three overlap storing modes. A subset
of the results is shown in Figure 7.12, in which the total energy and latency of
CASE STUDIES 219
299 76 75 75 75 75
75 19 19 19 19 20
81 21 20 21 22 25
73 18 20 20 24 24
75 19 20 25 25 29
22 28 29
ly-recompute, Energy
4 16 60 240 960 Fully-recompute,
1 4 16 60 240 Energy
960 Fully-recompute, Latency 1 4 16 60 240 960 Fully-recomput
1 4 16 60
ly-recompute,
4 16 60 240 960 Fully-recompute,
Energy Energy Fully-recompute,
1 4 16 60 240 960 Fully-recomput
Latency
60 240
Latency
4 8.4
3 15.8 16 6.660 240
7.4 960 1 15.8
8.41 47.3 4 8.416 6.660 2407.4 960
8.4 1 310
1 1026 4 169
16 13460 131
240 960 1 10
1341 1026 43 169
310 16 13 60
Latency
3 15.8 8.4 6.6 7.4 8.41 47.3 15.8 8.4 6.6 7.4 8.4 1 1026 310 169 134 131 1341 102610 3103 169 13
(Ty)(Ty)(Ty)
(Ty)(Ty)(Ty)
83 15.8
6.0 3.6
8.4 3.16.6 3.5
7.4 3.7
8.44 15.8 6.0 3.6
8.4 3.16.6 3.5
7.4 3.7 310 310
41 1026 97 169
52 13442 13140 134
414 1026
310 310
97 16952 13 42
Latency
47.3 15.8 8.4
8 6.0 3.6 3.1 3.5 3.741 15.8 6.0 3.6 3.1 3.5 3.7 310 97 52 42 40 4141 310 97 52 42
Tile Size
Energy
Energy
4
(Ty)
Size
Size
20TileSizeSize
18 3.4
6.0 2.3
3.6 2.43.1 2.5
3.5 16.9 8.1 103.4
1 2.3
3.6 2.43.1 2.5
3.5 16.9 4 165
310 50
97 3052 26 42 26 3418 165
310 5097 3052 26
Energy
101118
Energy
184 15.8
3.7 6.0 3.7 40 41 42
165 50 30 26 26 34184 165 502 30 26
(million
TileSize
TileSize
Energy
Energy
18 133
16Size
TileTileSize
(million
TileTileSize
51 3.5 2.5 2.3 13.4
3.4 2.3 2.4 2.5 16.9 18.7 6.5 10
72 8.1 3.4
18 3.5 2.5 2.3 13.4
2.3 2.4 2.5 16.9 18.7 10 72 44 26 23
165 50 30 26 26 3472
18 133 32 32 133 10
44 26
18 165 10502 30 26 23
4 16
(million
5 3.5 2.5 2.3 13.4 18.7 72 6.5 103.5
1 2.5 2.3 13.4 18.7 10172 44 26 23 32 3272 133 442 26 23
TileTile
(mJ)
(mJ)
65 3.5 2.5 14.3 18.2 19.2
270 7.6 3.5 2.5 14.3 18.2 19.2 270 131 41 26 32 32 29
3272 131
270 133 1041
44 26 32
Y-Dim
4Y-Dim
20Y-Dim
2.3 13.4 18.772 6.5 2.3 13.4 18.7 133 44
72 131 23 23
(mJ)
(mJ)
1 X-Dim
6 3.5 2.5 14.3 18.2 19.2 270 7.6 3.5 2.5 14.3 18.2 19.2 270 41 26 32 32 29
270 131 41 26 32
cycle)
Y-Dim
Y-Dim
Y-Dim
(mJ)
(mJ)
36 3.7
3.5 2.6
2.5 14.3 19.4
18.2 19.1
540
19.2
270 8.3
7.6 3.7
3.5 2.6
2.5 14.3 19.4
18.2 19.1
19.2 540
270 132
131 42
41 25
26 29
32 30
32 540
29
270 132
131 42
41 25
26 29
32
1 Y-Dim
75 Y-Dim
cycle)
Y-Dim
3 3.7 2.6 14.3 19.4 19.1 540 8.3 3.7 2.6 14.3 19.4 19.1 540 132 42 25 29 30 540 29 132 42 25 29
cycle)
3 X-Dim
3.7 2.6
Tile14.3
Size19.4 540 8.3 X-Dim
(Tx) 19.1 3.7 2.6Tile14.3
Size19.4
(Tx) 19.1 540 132 X-Dim
42 25 29 (Tx)
Tile Size 29 132 X-Dim
30 540 42 25 Tile Si29
X-Dim Tile Size (Tx) X-Dim Tile Size (Tx) X-Dim Tile Size (Tx) X-Dim Tile Si
hed V-recompute,
X-Dim H-cached
Tile Size (Tx) Energy
(c) X-DimV-recompute, (Tx) Energy (d)
Tile SizeEnergy
Fully-recompute, H-cached V-recompute,
X-Dim Tile Size
Fully-recompute, H-cachedX-Dim
(Tx)Latency
Latency V-recom
Tile Si
hed4 V-recompute,
16 60 240 Energy
H-cached
960 1 4 V-recompute,
16 60 240 Energy H-cached
960 1 4V-recompute,
16 60 240 Latency
H-cached
960 1 4 V-recom
16 60
hed4V-recompute, H-cached
Energy 1 4 V-recompute, Energy H-cached 1 4V-recompute,
16 60 240 LatencyH-cached
1 1043V-recom
1
4
18
72
270
(Ty)(Ty)(Ty) 540
16 60 240 960 16 60 240 960 960 16 60
1
4
18
72
270
540
Latency
6 6.04 5.916 5.960 240
7.2 960
8.41 9.61 6.0 4 5.916 5.960 2407.2 960
8.4 1 4881 1244 123
16 12260 128
240 960
1341 4881 10 43 123
124 16 12 60
Latency
6 6.0 5.9 5.9 7.2 8.41 9.6 6.0 5.9 5.9 7.2 8.4 1 488 124 123 122 128 1341 488 10 1243 123 12
Y-Dim Tile Size (Ty)
(Ty)(Ty)(Ty)
(Ty)(Ty)(Ty)
Latency
8.4 488 124 488 124
8 2.8 2.8 2.8 3.5 3.741 3.8 2.8 2.8 2.8 3.5 3.7 37 37 37 40 4141 148 37 37 37
Energy
Energy
4 148
Size
Size
Size
68 2.0 2.0 2.3 2.4 16.9 2.6 2.0 2.0 2.3 2.4 16.9 92 37 37 37 40 4118 148 37 37 24
23 23 24 25 34 92 23 23
Energy
2.8 2.8 2.8 3.5 3.7 1 2.8 2.8 3.5 3.7 1011184 148
92 23 23 24 25 34184 92 232 23 24
37
TileSize
(million
TileSize
TileSize
6 2.0 2.0 2.3 2.4 16.9 2.6 102.0 2.0 2.3 2.4 16.9
Energy
Energy
18 18
TileTileSize
TileTileSize
(million
TileTileSize
(million
4 1.9 2.1 2.2 13.2 18.7 72 2.4 1 2.1
1.9 2.2 13.2 18.7 10172 19 20 21 30 3272 77 10192 20 21
(mJ)
(mJ)
64 2.0
1.9 2.1 9.3 2.2 17.9
13.2 19.2
270 2.6 2.0 2.1 9.3 17.9 19.2 75
77 2019 20 27 21 31 29 75
30 270 77 1020
19 20 27
Y-Dim
Y-Dim
270
Y-Dim
18.7
72 2.4 1.9 2.2 13.2 18.7 32 21
72 2972 75 20 20 27
(mJ)
(mJ)
6 2.0 2.1 9.3 17.9 19.2 270 2.6 2.0 2.1 9.3 17.9 19.2 270 75 20 20 27 31 270
Y-Dim
cycle)
Y-Dim
Y-Dim
(mJ)
(mJ)
76 2.2
2.0 2.2
2.1 13.4
9.3 19.1 540 2.7
17.9 19.1
19.2
270 2.6 2.2
2.0 2.2
2.1 13.4
9.3 19.1
17.9 19.1
19.2 270 75 20 20 26
540 27 29 29 75 20 20 26
31 540
270 27
Y-Dim
Y-Dim
cycle)
Y-Dim
7 2.2 2.2 13.4 19.1 19.1 540 2.7 2.2 2.2 13.4 19.1 19.1 540 75 20 20 26 29 540 29 75 20 20 26
cycle)
7 X-Dim
2.2 2.2
Tile13.4
Size19.1 540 2.7 X-Dim
(Tx) 19.1 2.2 2.2Tile13.4
Size19.1
(Tx) 19.1 540 75 X-Dim20 20 26 (Tx)
Tile Size 29 75 X-Dim
29 540 20 20 Tile Si26
X-Dim Tile Size (Tx) X-Dim Tile Size (Tx) X-Dim Tile Size (Tx) X-Dim Tile Si
ully-cached,
X-Dim Tile SizeEnergy
(Tx) Fully-cached,
X-Dim Tile SizeEnergy
(Tx) Fully-cached,
X-Dim Tile SizeLatency
(Tx) Fully-cached,
X-Dim Tile Si
ully-cached,
4 16 60 Energy
240
(e) 960 1Fully-cached,
H-cached 4 16 60 Energy
V-recompute, 240 960 (f) H-cached
Energy Fully-cached,
1 4 16 60
V-recompute, Latency
240 960
Latency Fully-cached,
1 4 16 60
ully-cached,
4 16 60 Energy 240 960 1Fully-cached,
4 16 60 Energy 240 960 Fully-cached,
1 4 16 60Latency 240 960 Fully-cached,
1 1043 16 60
5 2.74 2.616 2.660 240 2.71 5.5
2.6 960 1 2.7 4 2.616 2.660 2402.6 960
2.7 1 2991 76 4 7516 75 60 240 751 299
75 960 1 1076
43 75
16 75 60 Latency
Latency
5 2.7 2.6 2.6 2.6 2.71 5.5 2.7 2.6 2.6 2.6 2.7 1 299 76 75 75 75 751 299 10763 75 75
(Ty)(Ty)(Ty)
(Ty)(Ty)(Ty)
(Ty)(Ty)(Ty)
45 1.9 1.9 1.9 2.0 2.2 4 2.4 1.9 1.9 1.9 2.0 2.2 75 19 19 19 19 20 4 75 19 19 19
Latency
2.7 2.6 2.6 2.6 2.7 5.5 2.7 2.6 2.6 2.6 2.7 41 299 76 75 75 75 751 299 76 75 75
4 1.9 1.9 1.9 2.0 2.241 2.4 1.9 1.9 1.9 2.0 2.2
Energy
Energy
4 75 19 19 19 19 204 81 75 19 19 19
Size
Size
Size
Energy
1.9 1.9 1.9 2.0 2.24 1.9 1.9 1.9 2.0 2.2 184 75 19 19 19 19 20 4 19
TileSize
Size
Size
10118 81 21 21 21 22 2818 81
(million
101.8
1 1
4 1.8 1.8 2.0 2.2 9.7 2.4 1.8 2.0 2.2 9.7 21 21 21
Energy
Energy
18
(million
TileTileSize
TileTileSize
TileTileSize
2.3 10
1.8 2.1 2.2 8.5 17.6 10 73 18 20 20
Tile
Tile
34 1.8 2.1 72 1 72 10
1.8 2.22.0 8.5
2.2 17.6
9.7 2.4
18 2.3 101.8 1.8 2.0 2.2 9.7 72 73
81 18
21 20
21 20
21 26
22 30
28
18 73 18 20 20 26 3018 73 10182 20 20 81 212 21 21
(million
3 1.8 2.1 2.2 8.5 17.6 1 2.1 2.2 8.5 17.6 10172
(mJ)
72 72
(mJ)
63 2.0
1.8 2.1 9.3 2.2 17.8 270
8.5 19.0 2.6
2.3 2.0
1.8 2.1 9.3 2.2 17.8
8.5 19.0 72 7573 1918 20 27 20 31 29 75
26 270 73 1019
18 20 27
Y-Dim
Y-Dim
72 2.6 2
2972 75
(mJ)
(mJ)
6 2.0 2.1 9.3 17.8 19.0 270 2.0 2.1 9.3 17.8 19.0 270 75 19 20 27 31 270 75 2019 2020 26 27
Y-Dim
Y-Dim
Y-Dim
cycle)
(mJ)
(mJ)
76 2.2
2.0 2.2
2.1 13.4
9.3 19.1
17.8 19.1
19.0
270 2.7 2.2 2.2 13.4
540 2.6 2.0 2.1 9.3 17.8 19.019.1 19.1 270 75 20
540 19 20 26 27 29 31 540
29
270 19 27
cycle)
Y-Dim
Y-Dim
Y-Dim
7 2.2 2.2 13.4 19.1 19.1 540 2.7 2.2 2.2 13.4 19.1 19.1 540 75 20 20 26 29 540 29 75 20 20 26
cycle)
7 X-Dim
2.2 2.2
Tile13.4
Size19.1
(Tx) 19.1
540 2.7 X-Dim
2.2 2.2Tile13.4
Size19.1
(Tx) 19.1 540 75 X-Dim20 20 26 (Tx)
Tile Size 29 540
29 75 X-Dim20 20 Tile Si26
X-Dim Tile Size (Tx) X-Dim Tile Size (Tx) X-Dim Tile Size (Tx) X-Dim Tile Si
X-Dim Tile Size (Tx) X-Dim Tile Size (Tx) X-Dim Tile Size (Tx) X-Dim Tile Si
(g) Fully-cached, Energy (h) Fully-cached, Latency
Figure 7.12: The total energy and latency for meta-proto-like DF architecture
processing FSRCNN with different DF strategies.
220 DEFINES: EXPLORING THE DEPTH-FIRST SCHEDULING SPACE FOR DNN ACCELERATORS
Fully-recom.
# MAC Op
1011
H-cac. V-recom.
Fully-cac.
1010
(1,1) (4,4) (16,18) (60,72) (240,270)(960,540)
Tile Size (Tx,Ty)
Figure 7.13: MAC operation count for different DF strategies.
three overlap storing modes with different tile sizes are visualized. Note that
the figure map size of the last layer of FSRCNN is 960×540, thus all the bottom
right blocks in each heatmap (with Tx=960 and Ty=540) correspond to LBL
processing. Their energy and latency numbers (19.1 and 29 resp.) are the same
because different overlap storing modes do not make a difference for LBL, as
discussed in Section 7.2.
The rest of this subsection firstly summarizes the main messages delivered
by Figure 7.12, and then uncovers the causes by using the memory access
breakdown of the different types of data of Figure 7.14.
Four major observations can be extracted from Figure 7.12: 1) Considering
different tile sizes under the same overlap storing mode, both too small and
too large tile sizes are sub-optimal. The best point is always somewhere in the
middle. 2) Considering the same tile size across different overlap storing modes,
the order of energy consumption is for most cases: fully-cached < H-cached
V-recompute < fully-recompute. 3) Different tile sizes and modes heavily impact
energy and latency (up to 26× difference for energy and 57× for latency). 4)
Fully-recompute prefers larger tile sizes than fully-cached.
To understand the reasons behind, Figure 7.13 and Figure 7.14 take out all the
diagonal scheduling points from Figure 7.12, and respectively plot their MAC
operation count and memory access count (in number of data element) for each
memory level in the hierarchy (LB, GB, and DRAM) that is contributed by
layers’ activation, weight and data copy action. Figure 7.15 further shows the
total energy and latency of these diagonal scheduling points.
For layers’ activation, Figure 7.14(a) presents two clear trends. Firstly, DRAM
and GB access do not depend much on the used mode. When the tile size is small,
like (1,1), (4,4) or (16,18), there is little GB and LB memory access because
all the activations per tile can fit into LB. When the tile size is increased to a
certain point, like (60,72), the GB access suddenly increases due to activations
no longer fitting in LB and thus GB being the top activation memory level.
Further increasing the tile size till reaches LBL (960, 540) and the DRAM
CASE STUDIES 221
Figure 7.14: Memory access of different data types at different memory levels
for meta-proto-like DF architecture processing FSRCNN with different DF
strategies.
222 DEFINES: EXPLORING THE DEPTH-FIRST SCHEDULING SPACE FOR DNN ACCELERATORS
103
Fully-recom.
Latency
H-cac. V-recom.
102 Fully-cac.
This case study shows that different DF strategies vary a lot on energy and
latency, and DeFiNES can analyze/reason about them, taking the advantages
of the unified analytical model.
This case study studies how different workloads prefer different DF strategies.
To this end, we map all five workloads of Table 7.1(b) on the meta-proto-like
hardware and compare five different inference strategies:
• Single layer: layers are completely evaluated one at a time, feature maps
are always stored to and fetched from DRAM in between layers;
• Layer-by-layer: layers are completely evaluated one at a time,
intermediate feature maps are passed on to the next layer in the lowest
memory level they fit in;
• Fully-cached DF with 4×72 tiles, which is the best found in case
study 1;
• The best strategy found when a single strategy is used for all fused
layer stacks;
• The best combination, where different stacks can use different DF
strategies.
Figure 7.16 visualizes the results, which show some noteworthy findings. Firstly,
for the workloads with spatially large features maps (FSRCNN, DMCNN-VD
224 DEFINES: EXPLORING THE DEPTH-FIRST SCHEDULING SPACE FOR DNN ACCELERATORS
and MCCNN), their individual best solutions (purple) are not significantly
better than the best solution found in case study 1 (green). The latter is thus a
very good solution across a range of workloads similar to the one it was found
for, with a gain of 10× compared to SL.
Secondly, this solution does not perform as well on MobileNetV1 and ResNet18,
which operate on spatially smaller feature maps with more channels. On
MobileNetV1 for instance, it is 2.0× worse than the best found result. In these
workloads, the deeper layers are more weight-dominant, which impedes fusing
them into one stack. Hence, the combined best solution applies DF to the first,
activation-dominant layers and LBL to the last, weight-dominant layers. This
combination achieves a gain of 5.7× over SL on MobileNetV1.
Figure 7.16: Case study 2: Different workloads lead to different best solutions
(all results on meta-proto-like DF hardware).
CASE STUDIES 225
This case study examines the effect of the accelerator’s architecture on the
optimal inference strategy. In particular, it compares the default accelerators
architectures, which were designed with LBL inference in mind, against the
manually adjusted DF-friendly variants by looking at the geometric average of
performance across all five workloads of Table 7.1, with both LBL and DF best
single strategy (for energy).
The results in Figure 7.17 show that DF outperforms LBL on all accelerator
architectures except for TPU-like, including the unadjusted default accelerators,
on which the maximum gain was 4.1×. TPU-like has poor support for DF
schedules due to the absence of on-chip weight buffers. With such a buffer
added in the DF-friendly variant, DF significantly outperforms LBL, indicating
the importance of designing with DF compatibility in mind. This finding is
further backed by the overall comparison between the DF-friendly and default
variants, which shows that the DF-friendly variants are at least as good as the
defaults when using DF, with large gains of 6.0× and 4.3× for TPU-like and
Edge-TPU-like hardware resp., and maximally 1.2% worse when using LBL.
Overall the biggest difference (in geometric mean over the five workloads)
between LBL inference on default hardware variants and DF on DF-friendly
variants is found for the Edge-TPU-like hardware and equals 4.9×.
Secondly, from the hardware modeling point of view, they only focus on
modeling/optimizing the DRAM access while ignoring the data movement
within the potential multi-level on-chip memory hierarchy. In other words,
they are agnostic of on-chip memory hierarchy. This could cause substantial
losses, as proven by Figure 7.18(a), which shows the experiment results of
mapping FSRCNN onto two hardware platforms in three ways: 1) Single-Layer
(SL), 2) DF but only optimize for DRAM traffic, 3) DF and optimized for the
overall energy (our work). The DRAM energy contribution is highlighted by
the diagonal hatching, which shows that DRAM energy dominates in the SL
case. Using DF and only optimizing for DRAM traffic, the DRAM energy can
indeed be largely reduced, but omitting the on-chip energy (non-hatched part
in the red bar) from the optimization can make the latter dominant. Only
when considering the whole system, the best DF solutions (orange bars) can
be achieved. The parameters of the found solutions (Figure 7.18 right) show
that when optimizing for the overall energy (orange), the framework found a
smaller tile size compared to optimizing for DRAM only (red). This can be
explained: 1) When optimizing for DRAM only, the tool will randomly pick
one DF schedule that makes sure all the intermediate data fit on chip, and
thus DRAM access is minimized. However, after achieving the minimal DRAM
access, there is still a lot of room for on-chip data traffic optimization, which is
overlooked in this case. 2) When optimizing for the overall energy, it benefits
from smaller tile sizes since at a certain point, not only can all the data of
228 DEFINES: EXPLORING THE DEPTH-FIRST SCHEDULING SPACE FOR DNN ACCELERATORS
Fig. b : (1 stack)
10
5.64x
5.46x Meta-proto-li. DF
2.34x 60x135
4x72
Edge TPU-li. DF
0
30x135
Meta-proto- Edge TPU- Meta proto- Edge TPU-
like DF like DF type like DF like DF
4x18
-18% 2x2
2 -17% 14x28
+9% +6%
Fig. d : (All stacks)
Meta-proto-li. DF
28x28
0 14x28
Meta-proto- Edge TPU- Meta-proto- Edge TPU- Edge TPU-li. DF
like DF like DF like DF like DF 28x28
* Horizontally, Fig. (c) and (a) ((d) and (b)) share the same Y axis label and scale. 14x28
intermediate tiles fit in on-chip GB, but also fit in the LB. In this case, the
activation can be fully reused in LB, and GB access is minimized (on top of
the already minimized DRAM access), resulting in a 5.64× energy gain for
FSRCNN on the meta-proto-like DF hardware.
Thirdly, on top of modeling on-chip data traffic, we further evaluated the
benefit of performing multi-level memory skipping over DRAM-only skipping,
i.e. skipping (multiple) upper (on-chip) memory level(s) when writing back
the outputs of intermediate tiles if it they fully fit in lower level memories.
Around 17%-18% energy gain is observed for the tested workload-hardware
combination, as shown in Figure 7.18(b). Due to this step targeting optimizing
on-chip memory energy, the gain is not very significant if the MAC energy and
the (already minimized) DRAM energy are dominant, which is the case here.
RELATED WORKS 229
This technique can bring larger gains for systems with more dominant on-chip
data traffic.
Fourthly, most of DF hardware implementations and exploration frameworks
show the energy, latency, and/or DRAM access gain that come from activation
tiling, but do not mention much about the potentially higher weight energy costs
due to the loss of local weight data reuse. This can be harmful for the overall
system efficiency, as shown by the example of Figure 7.18(c). The energy portion
caused by memory access for activations, highlighted with square hatching,
contributes most of the energy in the SL case. However, just blindly optimizing
for activations while ignoring the weights ends up in the green bars. While
these indeed have minimal energy caused by activations, the energy caused by
weights’ memory accesses dominates and causes a large penalty (non-hatched
part in the green bars). This is because the tool found very small tile sizes
as its best solution when only optimizing for activation. This lets activations
skip higher level memories as much as possible, but at the same time largely
reduces the low-level memory’s weight data reuse, thus triggering more access
to higher level weight memories. So, only when considering both the benefit
and drawbacks that tiling can bring, the best DF solution (orange bars) can
be achieved. For the given example, taking weights into account achieves a
solution that has 2.34× and 10.2× less energy than the solution found by
only considering activations for the meta-proto-like DF and Edge-TPU-like DF
hardware architectures, respectively.
Lastly, different frameworks have different optimizing targets, as shown in the
last column of Table 7.2: some of the frameworks only evaluate latency while
ignoring energy, whereas some only care about optimizing the DRAM access.
As DRAM-only optimization’s downsides have been explained, here we focus on
discussing latency- and energy-optimized solution comparison. Figure 7.18(d)
shows the results: pink/orange bars are the energy (and the corresponding dots
are latency) of our latency-/energy-optimized DF schedules respectively. In
this example, a clear latency-energy trade-off is presented and the best found
DF solution shows that the energy-optimized DF schedule prefers a smaller tile
size than the latency-optimized one. This is because smaller tile sizes on one
hand help reduce energy by enabling skipping of more memory levels while, on
the other hand, it increases the data preparation cycle (loading and offloading)
overhead.
To summarize, our work models the complete DF design space with support for
detailed activation and weight, on- and off-chip memory hierarchy analysis so as
to better capture the trade-offs between different DF strategies and optimizing
targets. These properties enable DeFiNES to make the overall best choices
without neglecting factors that may turn out to be important otherwise. This
makes DeFiNES a good addition to the previously mentioned optimization-
230 DEFINES: EXPLORING THE DEPTH-FIRST SCHEDULING SPACE FOR DNN ACCELERATORS
oriented frameworks [180, 189, 172, 19, 86]. Together with those, we can better
design and schedule DNN accelerators.
7.7 Conclusion
This chapter first presented a definition of the DF design space, and then a cost
model capable of handling this whole design space. Furthermore, the cost model
considers not only DRAM access or only memory access due to activations, but
also the full on-chip memory hierarchy and memory access caused by weight
traffic. Large gains might be missed when not doing so (up to 10.2× in the
shown examples; Figure 7.18(c)).
Using this model, the case studies showed that DF strategies can significantly
outperform layer-by-layer execution, even when the workload is not activation-
dominant (MobileNetV1 and ResNet18), and even when the hardware is not
designed for it: DF strategies outperformed layer-by-layer on four of the
five tested hardware architectures with gains of up to 4.1×. However, some
architectures may be ill-suited for DF, in which case small adjustments to their
design can lead to large improvements. For instance, reassigning some of the
on-chip memory capacity of the TPU-like architecture enabled it to greatly
benefit from DF strategies, outperforming its default variant by 6×. These
examples show how DeFiNES allows us to quickly examine the complex design
space of different combinations of DF strategies and hardware architectures.
Although DeFiNES enables a fast evaluation of combinations of DNNs, DF
strategies, and hardware architectures, its capabilities could still be expanded
in future work. Given the huge DF scheduling space (different tile sizes, fused
depth, data storage modes, etc.)2 and the complex tradeoff in between (e.g.,
energy v.s. latency), it is barely possible to intuitively pinpoint the optimal
schedule, nor exhaustively try out all schedules to locate the optimal one, even
with DeFiNES’ fast cost estimation. Thus, a clever search engine capable of
efficiently exploring the DF scheduling space and the hardware architecture
design space would be a good future addition to this work.
So far, we have discussed two DSE frameworks, ZigZag and DeFiNES, both
targeting single-core accelerator systems with purely analytical modeling
approaches. In the next chapter, we will move to multi-core accelerator modeling
and DSE with the third framework, Stream.
2 Note that for the purpose of unifying the scheduling space, we see layer-by-layer schedule
also as part of the DF scheduling space whose tile size equals to the complete figure map size
and fused stack equals to the complete neural network, as explained in Section 7.2.
Chapter 8
Stream: Modeling
Fine-grained Layer Fusion on
Multi-core DNN Accelerators
This chapter explains Stream, a DSE framework that supports exploring multi-
core DNN accelerator with fine-grained layer fusion scheduling.
To keep up with the ever-growing performance demand of DNN processing while
accommodating the increasing model diversity, specialized hardware accelerators
are shifting towards multi-core architectures. Stream is the first open-source
DSE framework for co-optimization of hardware architecture and fine-grained
scheduling of such multi-core DNN accelerators. Stream supports fine-grained
layer fusion, to optimally trade-off energy, latency, and/or on-chip memory
footprint for constrained edge devices.
Validation against three SotA chips, together with a case study on seven
hardware architectures with different scheduling granularity, demonstrate
the reliability and capabilities of Stream. Results show that high-level
architectural decisions greatly impact hardware efficiency under the fine-grained
scheduling paradigm, reducing the energy-delay product from 2.4× for single-
This chapter is based on [152] and contains large fractions of it. The author’s contributions
include (but not limited to) the computation node and graph representation, part of the
implementation, SotA comparison, and paper writing.
Stream is open source at https://github.com/KULeuven-MICAS/stream.
231
232 STREAM: MODELING FINE-GRAINED LAYER FUSION ON MULTI-CORE DNN ACCELERATORS
Compute
C2 C3
Core 0
(1) (3) C0 0
1 4
Core 0
C1
Layer-by- L0 L1 L2 L3 L4 C2 2
layer C3 3
SotA frameworks Timeloop, ZigZag, … Kwon et al., Stream
(2) (4) C0 0 0
1
Core 0
C1 4 1 4
0 1234 0 1234 C2 2 2
Layer-fused
C3 3 3 Timeline
SotA frameworks TVM-Cascade, DeFiNES, … Stream
To overcome the previous drawbacks, several works have investigated more fine-
grained scheduling strategies of deeply fused DNN layers [5, 162, 53]. From these
works, it is clear that such "layer-fused" (a.k.a. "depth-first") scheduling can
bring significant advantages for latency- and resource-constraint inference at the
edge: reducing the memory footprint of intermediate results, alleviating costly
off-chip accesses, and providing more parallelization opportunities. However, the
SotA layer-fused schedulers work solely for their specialized hardware platform
and dataflows [54, 164, 103, 34]. This makes the general assessment of high-level
architectural and scheduling decisions difficult. Furthermore, the fine-grained
scheduling of layer-fused DNNs onto multi-core hardware architectures has been
largely unexplored (Figure 8.1(c)(4)).
This chapter, therefore, provides Stream, a general exploration framework of
heterogeneous multi-core hardware architectures with fine-grained scheduling of
layer-fused DNNs. This chapter is organized as follows:
Layer Allocation
The recent trend towards multi-core systems enlarges the traditional mapping
space. Each layer must first be allocated to a (set of) core(s). Allocating a
modern DNN, which can easily consist of 50 layers, onto e.g. a quad-core
architecture yields O(1030 ) possible layer allocations. Kwon et al. [92] explore
this allocation space through a heuristics-based allocator, but do not account
for the inter-core communication cost.
BACKGROUND & RELATED WORKS 235
Local buffer
Global buffer
C
Input
PE PE PE
C0 C1
DRAM
Comm. Bus PE PE PE
DRAM Port
Accumulator
C2 C3 Output Local Buffer
(b) An example core
(a) Multi-core model
(TPU-like dataflow accelerator)
Figure 8.2: (a) Multi-core architecture model. (b) Example core with specific
dataflow (in red), connected to the off-chip memory port and bus for inter-core
communication. All memories, ports, and the bus, have a limited bandwidth.
Scheduling
Each allocated layer must be scheduled. This determines the execution order
of the layers (or their fine-grained parts). In the traditional layer-by-layer
processing, the only scheduling flexibility comes from branches in the DNN.
A more fine-grained scheduling, referred to as layer-fused [5], depth-first [53]
or cascaded [162] processing, and in this work referred to as layer fusion has
been introduced. Instead of processing an entire layer at once, a smaller part of
each layer is processed and its outputs are immediately consumed to process
parts of the subsequent layers (Figure 8.1(2,4)). In this work, we refer to such
a layer part as a computation node (CN), whose size determines the scheduling
granularity. Layer fusion has two benefits compared to classical layer-by-layer
scheduling: 1.) The produced and consumed activations are (depending on
the layer) smaller, reducing the memory footprint, which in turn decreases the
off-chip memory accesses; 2.) In a multi-core system, the computation nodes
of subsequent layers can be processed in parallel if the CN data dependencies
allow it, improving parallelism. However, rapidly extracting these dependencies
is non-trivial for modern DNNs under fine scheduling granularity, detailed in
Section 8.3.2.
Current SotAs have shown the efficacy of layer fusion for a single-core
accelerator [54], homogeneous multi-cores [43, 80, 188] and heterogeneous
systems [164]. However, these works only consider a limited set of DNNs
with fixed scheduling granularity on specific hardware architectures. TVM [162]
236 STREAM: MODELING FINE-GRAINED LAYER FUSION ON MULTI-CORE DNN ACCELERATORS
Mapping
Figure 8.3 shows an overview of Stream: Given a DNN workload graph and
a high-level multi-core accelerator architecture description, it determines an
optimal compute schedule with the resulting memory usage, energy and latency.
First, every layer is split into fine-grained computation nodes (CNs), taking
into account the dataflows supported by the different accelerator cores (Step 1).
Next, the data dependencies between CNs are rapidly generated and the fine-
grained CN graph is formed through an R-tree [55] (Step 2). In parallel, the
intra-core mapping cost of all unique CN-core combinations is optimized and
extracted using a single-core DSE framework (Step 3). A genetic algorithm
(GA) is subsequently deployed to explore the vast layer-core allocation space
(Step 4). The GA queries a latency- and memory-prioritized scheduler that
schedules the CNs onto the cores taking into account inter-core communication
contention and off-chip memory access contention (Step 5).
The first step of representing a layer-fused DNN is splitting each layer into
multiple individually schedulable parts, referred to as computation nodes (CN).
STREAM FRAMEWORK 237
Workload Multi-core
accelerator
Layer 1
C0 C1
DRAM
Layer 0 Layer 2 Layer 4
Layer 3 C2 C3
C3 CN0
Step 4: Genetic Algorithm-based Automatic Layer-core Ma-Core
C0 Cost: CN0
Allocation (NSGA-II-based selection, mutation, …) - Energy
C0 CN0
Matra-CoreCost:
- Latency
- Mapping
Energy Cost:
- Mem util.
L0 C1 L0 L1 L0 C1 - - Latency
Energy
L1 L4 L1 - PE util.
L2
L3
L4 … … L2 - - Mem
Latency
util.
C2 C3 L2 L3 C2 L4
L3 - - PE
Mem
util. util.
- PE util.
Layer i’s
input
Layer i
Layer i (1 CN) Layer i (2 CNs) (4 CNs)
for OX 2 OX=0 OX=1 OX=0 OX=1
for OY 2 for OY 2 for OY 2 OY=0
OX=0 OY=0
OX=1
for K 2 for K 2 for K 2 for OY=0
K2 for OY=1
K2
K for C 2 for C 2 for
for CK 22 for
for CK 22
for C 2
for C 2 for C 2
OX
OY Layer i’s
output /
Layer i+1’s
input Layer i+1
Layer i+1 (1 CN) Layer i+1 (2 CNs) (4 CNs)
for OX 2 OX=0 OX=1 OX=0 OX=1
for OY 2 for OY 2 for OY 2 OY=0
OX=0 OY=0
OX=1
for K 2 for K 2 for K 2 for OY=0
K2 for OY=1
K2
for C 2 for C 2 for C 2 for
for CK 22 for
for CK 22
for C 2 for C 2
Layer i+1’s
output
14 schedule possibilities
2 schedule possibilities 0 1 2 3 0 1 2 3
1 schedule possibility
CN0 CN1 CN0 CN1
CN0 CN0 0 1 2 0 1 2 3 3
Timeline CN0 CN0 CN1 CN1 …
0 0 1 1 2 2 3 3
The types of layers in the DNN, together with the layer interconnections impose
constraints on the optimal granularity. For example, a fully connected (i.e.
matrix-vector multiplication) requires all inputs to compute a single output,
and the CN, therefore, contains all layer loops (i.e. the layer only contains one
STREAM FRAMEWORK 239
CN). This automatically breaks the fused layer stack. When layers do have
spatial locality (e.g. convolutional layers and matrix-matrix multiplications),
these loop dimensions are outer-CN loop dimensions. The outer-CN for-loops
are synchronized across layers. Moreover, the out-CN loop order determines
the scheduling order of CNs that belong to the same layer.
1. The number of inputs exclusively used by this CN, which can hence be
discarded when the CN finishes;
2. The number of final outputs newly generated by each CN, which could be
sent out when the CN finishes.
Because of potential input data overlap and reduction loops across CNs, not
all CNs have the same number of discardable inputs and newly generated final
outputs, as shown in Figure 8.5. Stream’s CN attribute extraction is compatible
with all layer types, strides, and padding supported by ONNX [11].
After the identification of CNs of each layer and their attributes, the data
dependencies between all CNs must be generated in order to correctly schedule
the CNs in Step 5. This process is split into two parts.
Intra-layer: First, the intra-layer CN dependency edges are inserted based on
the outer-CN loop order, determined in Step 1. This ensures that the required
240 STREAM: MODELING FINE-GRAINED LAYER FUSION ON MULTI-CORE DNN ACCELERATORS
C
Layer i for OY 2
for OX 2
(-32, +4) OY
for C 2
(# of discarded input, for FY 3
# of generated output) for FX 3 OX
tensor accesses of CNs within a layer are structured and easily implementable
using loop counters.
Inter-layer: Next, the inter-layer CN dependencies are determined based on
the loop ranges of each CN. Specifically, the overlap in data generated by
CNs of one layer, and required by CNs of the next layer(s), defines the data
dependency between these CNs. Because this work targets a fine-grained
scheduling granularity, the number of CNs could grow up to 106 or even larger
for modern DNNs. Exhaustively checking each CN pair for overlap in multi-
dimensional data tensors would require 1012 checks, which is not feasible. A fast
inter-layer CN dependency generator is thus required, for which an algorithm
based on R-trees [55] is developed.
Identify Dependencies
STREAM FRAMEWORK 241 R
Producer Layer ② Query R-Tree ① Create Consumer Layer 02
Output R1 R2 Input
CN2 CN4
CN3 CN5
② CN2 CN3
CN6 CN7 CN8
Figure 8.6 shows the inter-layer dependency generation for a simplified example.
This process is repeated for each producer & consumer layer pair. First, an
R-tree representation of all CNs of the consumer layer is created (1), which
stores the encapsulated loop ranges of each consumer CN in a tree structure. In
the second step (2), the R-tree is queried for intersection with each CN of the
producer layer. The R-tree returns all consumer CNs whose ranges overlap with
the range of queried producer CN. Note that in this simplified example, the 4
consumer CNs span two dimensions and are non-overlapping. In practice, they
can have more dimensions with overlapping loop ranges, which is supported by
Stream.
Compared with a baseline implementation of inter-layer CN dependency
generation which checks every producer-consumer CN pair one-by-one, our R-
tree-based algorithm is much more efficient. For a case with 448×448 producer
CNs and 448×448 consumer CNs, the baseline implementation would take over
9 hours, whereas the R-tree-based generation takes 6 seconds (103 × speedup).
This step targets the allocation of the CNs of each layer to the different
accelerator cores in the multi-core system. For large networks with varying
layer types, figuring out a performant layer-core allocation can be difficult. For
example, because of the fine CN granularity, it is not straightforward which
CNs can execute in parallel and should hence be executed on different cores.
To this extent, a genetic algorithm (GA) is developed, as shown in Figure 8.3, in
which the layer-core allocation is automatically optimized through the evolution
of different generations of a population of layer-core allocations. We choose
a GA for this allocation problem as it is modular in its optimization metrics,
which can be any linear combination of latency, energy, memory footprint, or
their derivatives, such as energy-delay-product (EDP). Each individual in the
population receives a fitness score based on the desired metrics. The surviving
individuals of a population are selected through an NSGA-II process [41],
which employs advanced mechanisms to spread out the individuals over the
Pareto-front. After the selection, an ordered crossover operation is performed
to generate new offspring with a probability of 30%. Finally, the genome of
an individual is randomly mutated through a bit flip (allocating a layer to a
different core) or a position flip (swapping two layers’ core allocations) with a
probability of 70%. The randomness enables the GA to escape local minima.
The GA ends after a predefined number of generations, or after the desired
optimization metric saturates. A Pareto front of optimal layer-core allocations
is returned.
The scheduling step targets to derive the most optimal start time to execute
each CN, given the fine-grained CN graph, the CN’s mapping costs and the
STREAM FRAMEWORK 243
Memory is prioritized by picking the CN from the candidate pool that has the
highest layer index. As this is the schedulable CN from the deepest layer in the
fused layer stack, it stimulates the immediate consumption of data deeper into
the fused stack for early discarding and efficient memory use. This can result
in idle time in the core as it waits for other cores to finish the predecessors of a
CN with a higher layer index, hence resulting in larger execution latency.
Figure 8.8 demonstrates the impact of the two prioritization strategies and the
communication bus and DRAM port overhead, both in terms of latency and
memory footprint of a three-core system onto which four layers are mapped. The
blocks in top Figure 8.8 (c)(d) represent the CNs and the lines represent their
fine-grained dependencies. The bottom figures show the memory utilization
trace across time, explained next.
Layer 0
40164 Cycles
Layer 1 Layer 2
617KB
Layer 3
Layer 0
(6 CNs)
on Core 0 Layer 2
(6 CNs)
on Core 2 51414 Cycles
Layer 1 313KB
(6 CNs)
on Core 1 Layer 3
(6 CNs)
on Core 0
Once the start and end times of all CNs are known, the activation memory
utilization can be traced through time based on the number of discarded inputs
and the number of generated outputs per CN (cfr. Section 8.3.1). When a CN
finishes, the inputs that are no longer required are freed from the memory space.
When a CN starts, space is allocated in the memory for the to-be generated
outputs. In case data is transferred between two cores, the output data of
the producer CN remains in the producing core until the communication is
concluded. Memory space is allocated in the consuming core as soon as the
communication starts. Figure 8.8 shows the total memory usage trace of all
three cores, of which the maximum is the peak memory usage.
8.4 Validation
validation results are summarized in Table 8.1 and the schedules generated by
Stream are shown in Figure 8.10.
DepFiN results
Latency Validation
Architecture Measured (cc) Stream (cc) Accuracy (%)
DepFiN [54] 6.18 × 106 5.65 × 106 91
5 5
4×4 AiMC [80] 3.66 × 10 3.68 × 10 99
DIANA [164] 8.12 × 105 7.83 × 105 96
Memory Usage Validation
Architecture Measured (KB) Stream (KB) Accuracy (%)
DepFiN [54] 238 244 97
4×4 AiMC [80] N/A 16.5 N/A
DIANA [164] 134 137 98
248 STREAM: MODELING FINE-GRAINED LAYER FUSION ON MULTI-CORE DNN ACCELERATORS
and the modeled latency and memory usage are 91% , resp. 97% accurate
compared to the measurement.
The DNN model deployed by Jia et al. on their multi-core AiMC architecture
are ResNet-50 segments [57]. Their work has no memory usage data available.
Stream predicts a memory usage of 16.5 KB due to the tight activation balance
observed in Figure 8.10(b). Stream’s runtime was 3 seconds, and the modeled
latency is 99% accurate compared to the measurement.
Figure 8.10: Schedule visualization of Stream for the three validation targets.
EXPLORATION 249
DIANA results
8.5 Exploration
W
MEM Pt Pt Pt
PE Array C2 256KB C3 256KB C2 256KB C3 256KB
1024 (
(
KB 64 256 PE 64 256 PE 256 PE 64 256 PE
bit KB Array bit KB Array KB Array bit KB Array
)
)
SIMD Core SIMD Core SIMD Core
Comm. Bus (128 bit) Comm. Bus (128 bit) Comm. Bus (128 bit)
1.5
-27% 100 -59%
1.0 -52% -56%
0.5 50
0.0 0
MC: Hom. TPU MC: Hetero MC: Hom. TPU MC: Hetero
further experiments, both for coarse and fine granularities, are executed using
the GA-based allocation with the latency scheduling priority.
Next, we explore the capabilities of Stream to assess the benefits and drawbacks
of various hardware architectures for fine-grained layer-fused DNN execution.
A diverse set of neural network models is used for this study: ResNet18 [57],
MobileNetV2 [138], SqueezeNet [72], Tiny-YOLO [3] and FSRCNN [45].
Stream optimally allocates each dense computational layer to one of the
accelerator cores using its genetic algorithm, while the other layers such as
pooling and residual addition layers are assigned to the additional SIMD core.
To demonstrate Stream’s optimization flexibility, this study targets energy-
delay-product (EDP) as the allocation’s optimization criterion.
The results of all experiments are summarized in Figure 8.13. For each
combination of workload and architecture, the EDP is optimized for both
layer-by-layer scheduling granularity, as used by the SotA scheduler of Kwon et
al. [92] and a fine-grained scheduling granularity. We also show the impact of
the design choices on the latency and energy individually in Figures 8.14, for
the optimal EDP point.
The analysis below is broken down into the three modeled architecture classes:
Single-core architectures
Traditional (layer-by-layer, like Kwon et al. [25]) Stream (fine-grained layer fusion)
ResNet18 MobileNetV2 SqueezeNet TinyYOLO FSRCNN Geometric Mean
Energy-Delay Product [ J×cc ]
6
6 6 10
10 10
7
10
5 10
4.2
10
4
5 4.7 10.7
5 10
5 10
10 18.6
4
10 19.0
6
10
. . . . . .
PU ye nv PU ye nv ro PU ye nv PU ye nv ro PU ye nv PU ye nv ro PU ye nv PU ye nv ro PU ye nv PU ye nv ro PU ye nv PU ye nv ro
: T : E : E . T . E . E ete : T : E : E . T . E . E ete : T : E : E . T . E . E ete : T : E : E . T . E . E ete : T : E : E . T . E . E ete : T : E : E . T . E . E ete
SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H
: C M : C M : C M : C M : C M : C M
MC MC M MC MC M MC MC M MC MC M MC MC M MC MC M
Figure 8.13: The best EDP point found by Stream over 5 DNNs for 7 hardware architectures under traditional
layer-by-layer scheduling and fine-grained layer fusion. The architecture abbreviations are shown in Figure 8.11(b). For
the geometric mean, the EDP reduction from layer-by-layer to layer-fused is shown.
252
Traditional: Latency Stream: Latency
ResNet18 MobileNetV2 SqueezeNet TinyYOLO FSRCNN Geometric
1.1
Mean
7
7 6 × 10
7 7
10 7
10 10 7 10 1.5
EXPLORATION
4 × 10
7 2.8
3 × 10
6
3.7
10 7
2.1
2 × 10
Latency [ cc ]
6
10
7
5.76.2
6 10
6 10
10
. . . . . .
PU ye nv PU ye nv ro PU ye nv PU ye nv ro PU ye nv PU ye nv ro PU ye nv PU ye nv ro PU ye nv PU ye nv ro PU ye nv PU ye nv ro
: T : E : E . T . E . E ete : T : E : E . T . E . E ete : T : E : E . T . E . E ete : T : E : E . T . E . E ete : T : E : E . T . E . E ete : T : E : E . T . E . E ete
SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H
: C M : C M : C M : C M : C M : C M
MC MC M MC MC M MC MC M MC MC M MC MC M MC MC M
(a) Latency
Traditional: Off-chip energy Stream: Off-chip energy
Traditional: On-chip energy Stream: On-chip energy
ResNet18 MobileNetV2 SqueezeNet TinyYOLO FSRCNN 0.08
Geometric Mean
0.12 0.014 0.12
0.030 0.4
0.10 0.012 0.10
0.025 0.06
0.08 0.010 0.08 0.3
0.020
0.008 0.04
0.06 0.015 0.06 0.2
0.006 2.3 2.3
0.04 2.8
Energy [ J ]
0.04 0.010 3.8 3.3
0.004 0.02
0.1 5.0 5.0
0.02 0.005 0.002 0.02
0.00 0.000 0.000 0.00 0.0 0.00
. . . . . .
PU ye nv PU ye nv ro PU ye nv PU ye nv ro PU ye nv PU ye nv ro PU ye nv PU ye nv ro PU ye nv PU ye nv ro PU ye nv PU ye nv ro
: T : E : E . T . E . E ete : T : E : E . T . E . E ete : T : E : E . T . E . E ete : T : E : E . T . E . E ete : T : E : E . T . E . E ete : T : E : E . T . E . E ete
SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H
: C M : C M : C M : C M : C M : C M
MC MC M MC MC M MC MC M MC MC M MC MC M MC MC M
Figure 8.14: Latency and energy breakdown for the best EDP points of the exploration.
253
254 STREAM: MODELING FINE-GRAINED LAYER FUSION ON MULTI-CORE DNN ACCELERATORS
scheduling granularity. This is due to the fact that each single smaller
core has a lower hardware efficiency for workloads with coarse scheduling
granularity. For fine-grained layer fusion, homogeneous multi-core architectures
do consistently outperform single-core architectures, as they bring more
parallelization opportunities, increasing the temporal utilization (i.e. the
percentage of time the core is active). Moreover, the off-chip DRAM energy
is further reduced. For the three homogeneous multi-core architectures, layer
fusion has between 10× and 19× better EDP than layer-by-layer scheduling.
8.6 Conclusion
Finally, the journey, started from the vision of Chapter 1 the "post-Moore’s-Law"
discussion, comes to an end. In this post-Moore’s-Law era, we have made our
efforts to facilitate the DNN accelerator design and workload deployment through
cross-level design space exploration, assisting in continuing the exponential
growth in computing performance and efficiency.
In this chapter, we will first conclude the thesis by revisiting the key ideas of
each chapter and answering the open research questions posed in Section 1.2.
Afterwards, we will look into the future research directions, building upon the
groundwork laid by this thesis.
257
258 CONCLUSION AND FUTURE WORK
efficiency. This chapter also revealed a crucial fact that when the full-precision
operation takes a significant amount in the overall computation, SWUs tend
to perform better than FUs, and when further increasing the portion of the
full-precision operation, none of the precision-scalable MAC architectures beats
the conventional MAC unit with data gating.
Chapter 4 continued the study on precision-scalable datapath and moved
one architectural level up to the MAC array level. As unified design
characterization is the foundation for systematical DSE, with the enlarged
design space (from MAC unit to array), a more powerful way to characterize
different PSMAs is essential. For it, this chapter introduced a precision-aware
nested for-loop representation for DNN mappings, and based on this new
representation, proposed a comprehensive PSMA taxonomy. Following that,
a highly parameterized PSMA template was built that can be design-time
configured into a huge subset (72 designs) of the design space spanned by
the taxonomy. In the end, the 72 PSMAs are thoroughly benchmarked and
compared, disclosing design insights.
Again, results showed that there is no one-fits-all answer to the question "Q1.2:
What is the best MAC array architecture for variable-precision DNN execution?",
but there are several insightful design guidelines observed. First, it is shown that
BG unrolling in L2 is the least ideal case, and it’s better to have BG unrolling
at L3 for lower frequencies or unrolling temporally for higher frequencies. This
is due to the (configurable) shifters being amortized across different L2 units,
making the adder trees less complex. Second, it’s generally a good idea to have
a mixture of IS/WS and OS loops unrolled throughout the array levels, so as
to balance the data distributing and collecting burdens of input and output
operands. Third, similar to the previous chapter, FU designs are better suited
for workloads where lower precisions are common, whereas SWU designs are
better suited for higher-precision workloads. Fourth, contrary to the previous
chapter, the BS designs exhibit favorable qualities at the array level. This is due
to that, in this chapter for BS designs (BS-L2), the hardware overhead of the
internal registers and shift-add logic is shared across the 16 L1 units within each
L2 unit (enabled by L2: OS), which ultimately reduces the energy/operation.
Whereas in Chapter 3, BS designs can be seen as each L1 unit is equipped with
its own registers and shift-add logic (BS-L1).
Chapter 5 & 6 moved the abstraction level further up from components in a
DNN accelerator (MAC units and array) to the complete accelerator system,
and provided a complete answer to the question "Q2.1: How to design a DSE
framework for single-core DNN accelerator exploring single-layer mapping?"
by introducing the ZigZag framework. ZigZag is a DSE framework for DNN
accelerators supporting single-layer mapping optimization. It is equipped
with an analytical cost model for rapid energy/latency/area estimation and
260 CONCLUSION AND FUTURE WORK
a layerwise coarse-grained DNN model can be easily broken down into a more
fine-grained graph with each node as a nested for-loop based CN and each edge
between the nodes indicating the fine-grained data dependencies between CNs.
Afterward, a heuristic-based scheduler is constructed to rapidly schedule each
CN to the corresponding core, respecting the data dependencies and hardware
constraints, and following that, the single-layer mapper will make sure each
CN is mapping onto the core in an optimal manner. The Stream framework
is validated against three hardware implementations employing layer-fused
scheduling, showing tight matching with measured hardware efficiencies. Case
studies showed that high-level architectural decisions greatly impact hardware
efficiency under the fine-grained scheduling paradigm. To conclude, Stream
allows to quantitatively co-explore the scheduling granularity with architectural
decisions, providing insights into the hardware performance of fine-grained layer
fusion on a broad range of multi-core architectures.
As the saying goes, the more you know, the more you realize you don’t know.
By answering one question, you usually find more follow-up questions. This
section proposes some future work directions, building upon the foundational
contributions of this thesis.
All in all, this thesis has clearly introduced the vast design space of deep learning
accelerators at the different abstraction levels, thoroughly explained how the
high-level design space exploration frameworks can be built to rapidly offer
design insights and guidelines, and eventually passed on the taxonomy, modeling,
and exploration methodologies applied in this thesis to future researchers.
Biography
Linyan Mei was born in Hunan, China in 1994. She received the B. Sc. degree
in electrical engineering from Beijing Institute of Technology (BIT), China
in 2016, and the M. Sc. degree in electrical engineering from KU Leuven,
Belgium, in 2018. The subject of her master thesis was: "Digital Design
of Flexible Deep Learning Kernels", in cooperation with the SLD (system-
level design) team in IMEC, Leuven, advised by Dr. Dimitrios Rodopoulos
and Dr. Jeremy Constantine. Later in 2018, she joined the ESAT-MICAS
laboratories of KU Leuven as a Ph.D. student, under the guidance of
Prof. Marian Verhelst. In 2021, she temporarily joined Meta, CA, US as
a research intern, advised by Dr. Huichu Liu. Currently, she is working towards
the Ph.D. degree on design space exploration for embedded deep neural network
accelerators.
265
List of publications
267
268 LIST OF PUBLICATIONS
Patent
Masters’ theses:
Interns’ projects:
Teaching
⋄ Teaching assistant for a master course – Computer Architecture (B-KUL-
H05D3A), Exercise Session: building a 5-stage pipelined RISC-V processor,
2018-2023
Bibliography
[4] Alwani, M., Chen, H., Ferdman, M., and Milder, P. Fused-
layer CNN accelerators. In 2016 49th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO) (2016), pp. 1–12. (Cited on
202, 226)
[5] Alwani, M., Chen, H., Ferdman, M., and Milder, P. Fused-
layer cnn accelerators. In 2016 49th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO) (2016), pp. 1–12. (Cited on
233, 235)
[6] Amazon. Books, 1996-2023, Amazon.com. https://www.amazon.com/
books-used-books-textbooks/b?ie=UTF8&node=283155. (Cited on 3)
[7] Ambolt AI. Blog: Computer Vision – image classification and object
detection, 2016. https://ambolt.io/en/image-classification-and-
object-detection/. (Cited on 3)
[8] Amdahl, G. M. Validity of the single processor approach to achieving
large scale computing capabilities. In 1967 AFIPS Spring Joint Computer
Conference (1967), pp. 483–485. (Cited on 96)
271
272 BIBLIOGRAPHY
[10] Austin, T. Keynote talk: Preparing for a Post Moore’s Law World. In The
48th Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO) (2015). (Cited on 5)
[11] Bai, J., Lu, F., Zhang, K., et al. ONNX: Open Neural Network
Exchange. https://github.com/onnx/onnx, 2019. (Cited on 174, 239)
[17] Burrello, A., Garofalo, A., Bruschi, N., Tagliavini, G., Rossi,
D., and Conti, F. DORY: Automatic End-to-End Deployment of Real-
World DNNs on Low-Cost IoT MCUs. IEEE Transactions on Computers
70, 8 (2021), 1253–1268. (Cited on 133, 134, 135)
[18] C. Zhang et. al. Optimizing FPGA-based accelerator design for deep
convolutional neural networks. In 2015 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays (2015), pp. 161–170.
(Cited on 60)
BIBLIOGRAPHY 273
[19] Cai, X., Wang, Y., and Zhang, L. Optimus: An Operator Fusion
Framework for Deep Neural Networks. ACM Trans. Embed. Comput. Syst.
(feb 2022). (Cited on 14, 15, 52, 202, 203, 226, 230)
[20] Camus, V. Design of approximate and precision-scalable circuits for
embedded multimedia and neural-network processing. Ph.D. dissertation,
EPFL, Lausanne, 2019. http://infoscience.epfl.ch/record/264984.
(Cited on 58)
[21] Camus, V., Enz, C., and Verhelst, M. Survey of Precision-Scalable
Multiply-Accumulate Units for Neural-Network Processing. In 2019 IEEE
1st International Conference on Artificial Intelligence Circuits and Systems
(AICAS) (March 2019). (Cited on xxix, 58, 62, 71, 87, 96, 97, 258)
[22] Camus, V., Mei, L., Enz, C., and Verhelst, M. Review and
benchmarking of precision-scalable multiply-accumulate unit architectures
for embedded neural-network processing. IEEE Journal on Emerging and
Selected Topics in Circuits and Systems 9, 4 (2019), 697–711. (Cited on
57, 109)
[23] Catthoor, F. Unified low-power design flow for data-dominated multi-
media and telecom applications. Springer, 2000. (Cited on 4, 262)
[24] Catthoor, F., and Danckaert, K. Data access and storage
management for embedded programmable processors. Springer Science &
Business Media, 2002. (Cited on 262)
[25] Catthoor, F., Wuytack, S., De Greef, G., Banica, F.,
Nachtergaele, L., and Vandecappelle, A. Custom memory
management methodology: Exploration of memory organisation for
embedded multimedia system design. Springer Science & Business Media,
1998. (Cited on 262)
[26] Cavalcante, M., Riedel, S., Pullini, A., and Benini, L.
MemPool: A Shared-L1 Memory Many-Core Cluster with a Low-Latency
Interconnect. In 2021 Design, Automation and Test in Europe Conference
and Exhibition (DATE) (2021), pp. 701–706. (Cited on 3)
[27] Cavigelli, L., and Benini, L. Origami: A 803-GOp/s/W Convolutional
Network Accelerator. In IEEE Transactions on Circuits and Systems for
Video Technology (TCSVT) (Nov 2017), pp. 2461–2475. (Cited on 58)
[28] Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Cowan,
M., Shen, H., Wang, L., Hu, Y., Ceze, L., Guestrin, C., and
Krishnamurthy, A. TVM: An Automated End-to-End Optimizing
Compiler for Deep Learning. In Proceedings of the 13th USENIX
274 BIBLIOGRAPHY
[38] Dave, S., Baghdadi, R., Nowatzki, T., Avancha, S., Shrivastava,
A., and Li, B. Hardware acceleration of sparse and irregular tensor
computations of ml models: A survey and insights. Proceedings of the
IEEE 109, 10 (2021), 1706–1752. (Cited on 55)
[39] Dave, S., Kim, Y., Avancha, S., Lee, K., and Shrivastava, A.
Dmazerunner: Executing perfectly nested loops on dataflow accelerators.
ACM Trans. Embed. Comput. Syst. 18, 5s (2019). (Cited on 50, 133, 134,
135)
[40] De Greef, E., Catthoor, F., and De Man, H. Program
transformation strategies for memory size and power reduction of
pseudoregular multimedia subsystems. IEEE Transactions on Circuits
and Systems for Video Technology 8, 6 (1998), 719–733. (Cited on 262)
[41] Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. A fast and
elitist multiobjective genetic algorithm: Nsga-ii. IEEE transactions on
evolutionary computation 6, 2 (2002), 182–197. (Cited on 242)
[42] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE
conference on computer vision and pattern recognition (2009), Ieee, pp. 248–
255. (Cited on 32, 164)
[43] Ding, Y., Zhu, L., Jia, Z., Pekhimenko, G., and Han, S. IOS:
Inter-operator scheduler for cnn acceleration. Proceedings of Machine
Learning and Systems 3 (2021), 167–180. (Cited on 16, 17, 52, 235)
[44] Dong, C., Loy, C. C., and Tang, X. Accelerating the super-resolution
convolutional neural network. In European conference on computer vision
(2016), Springer, pp. 391–407. (Cited on xxxiii, 210, 215, 218, 247)
[45] Dong, C., Loy, C. C., and Tang, X. Accelerating the super-resolution
convolutional neural network. In European conference on computer vision
(2016), Springer, pp. 391–407. (Cited on 251)
[46] Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng,
X., Chen, Y., and Temam, O. Shidiannao: Shifting vision processing
closer to the sensor. In 2015 ACM/IEEE 42nd Annual International
Symposium on Computer Architecture (ISCA) (June 2015), pp. 92–104.
(Cited on 60)
276 BIBLIOGRAPHY
[47] Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng,
X., Chen, Y., and Temam, O. Shidiannao: Shifting vision processing
closer to the sensor. In Proceedings of the 42nd Annual International
Symposium on Computer Architecture (2015), pp. 92–104. (Cited on 232)
[48] Gajski, D., Vahid, F., Narayan, S., and Gong, J. Specsyn:
an environment supporting the specify-explore-refine paradigm for
hardware/software system design. IEEE Transactions on Very Large
Scale Integration (VLSI) Systems 6, 1 (1998), 84–100. (Cited on 262)
[49] Garofalo, A., Ottavi, G., Conti, F., Karunaratne, G., Boybat,
I., Benini, L., and Rossi, D. A heterogeneous in-memory computing
cluster for flexible end-to-end inference of real-world deep neural networks.
arXiv preprint arXiv:2201.01089 (2022). (Cited on 232)
[50] Ghodrati, S., Ahn, B. H., Kyung Kim, J., Kinzer, S., Yatham,
B. R., Alla, N., Sharma, H., Alian, M., Ebrahimi, E., Kim,
N. S., Young, C., and Esmaeilzadeh, H. Planaria: Dynamic
architecture fission for spatial multi-tenant acceleration of deep neural
networks. In 2020 53rd Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO) (2020), pp. 681–697. (Cited on 232, 234)
[51] Ghodrati, S., Sharma, H., Young, C., Kim, N. S., and
Esmaeilzadeh, H. Bit-parallel vector composability for neural
acceleration. In 2020 57th ACM/IEEE Design Automation Conference
(DAC) (2020), IEEE, pp. 1–6. (Cited on 100, 104, 109, 110, 129)
[52] Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., and
Keutzer, K. A Survey of Quantization Methods for Efficient Neural
Network Inference. CoRR abs/2103.13630 (2021). (Cited on 7)
[53] Goetschalckx, K., and Verhelst, M. Breaking High-Resolution
CNN Bandwidth Barriers With Enhanced Depth-First Execution. IEEE
Journal on Emerging and Selected Topics in Circuits and Systems 9, 2
(2019), 323–331. (Cited on 202, 233, 235)
[54] Goetschalckx, K., and Verhelst, M. DepFiN: A 12nm, 3.8TOPs
depth-first CNN processor for high res. image processing. In 2021
Symposium on VLSI Circuits (2021), pp. 1–2. (Cited on 3, 202, 215,
226, 233, 235, 246, 247)
[55] Guttman, A. R-trees: A dynamic index structure for spatial searching.
In Proceedings of the 1984 ACM SIGMOD International Conference
on Management of Data (New York, NY, USA, 1984), SIGMOD ’84,
Association for Computing Machinery, p. 47–57. (Cited on xxxiv, 236,
240, 241)
BIBLIOGRAPHY 277
[56] Han, Y., Huang, G., Song, S., Yang, L., Wang, H., and Wang,
Y. Dynamic neural networks: A survey. IEEE Transactions on Pattern
Analysis and Machine Intelligence 44, 11 (nov 2022), 7436–7456. (Cited
on 55)
[57] He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for
Image Recognition. arXiv e-prints (Dec. 2015), arXiv:1512.03385. (Cited
on xxvii, xxxi, 29, 30, 31, 158, 159, 167, 248, 249, 251)
[58] He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for
Image Recognition. arXiv e-prints (Dec. 2015), arXiv:1512.03385. (Cited
on 218)
[59] Hegde, K., Tsai, P.-A., Huang, S., Chandra, V., Parashar,
A., and Fletcher, C. W. Mind Mappings: Enabling Efficient
Algorithm-Accelerator Mapping Space Search. In Proceedings of the 26th
ACM International Conference on Architectural Support for Programming
Languages and Operating Systems (New York, NY, USA, 2021), ASPLOS
’21, Association for Computing Machinery, p. 943–958. (Cited on 214,
236, 241)
[60] Hirtzlin, T., Bocquet, M., Penkovsky, B., Klein, J.-O., Nowak,
E., Vianello, E., Portal, J.-M., and Querlioz, D. Digital
biologically plausible implementation of binarized neural networks
with differential hafnium oxide resistive memory arrays. Frontiers in
Neuroscience 13 (2020). (Cited on 3)
[61] Houshmand, P., Cosemans, S., Mei, L., Papistas, I., Bhattachar-
jee, D., Debacker, P., Mallik, A., Verkest, D., and Verhelst, M.
Opportunities and Limitations of Emerging Analog in-Memory Compute
DNN Architectures. In 2020 IEEE International Electron Devices Meeting
(IEDM) (2020), pp. 29.1.1–29.1.4. (Cited on 170)
[62] Houshmand, P., Sarda, G. M., Jain, V., Ueyoshi, K., Papistas,
I. A., Shi, M., Zheng, Q., Bhattacharjee, D., Mallik, A.,
Debacker, P., Verkest, D., and Verhelst, M. DIANA: An End-
to-End Hybrid DIgital and ANAlog Neural Network SoC for the Edge.
IEEE Journal of Solid-State Circuits 58, 1 (2023), 203–215. (Cited on 3,
39)
[63] Houshmand, P., Sun, J., and Verhelst, M. Benchmarking and
modeling of analog and digital SRAM in-memory computing architectures.
In 2023 tinyML Research Symposium (2023). (Cited on 170)
[64] Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L., Tan,
M., Chu, G., Vasudevan, V., Zhu, Y., Pang, R., Adam, H., and
278 BIBLIOGRAPHY
[82] Judd, P., Albericio, J., Hetherington, T., Aamodt, T. M., and
Moshovos, A. Stripes: Bit-serial deep neural network computing. In 2016
49th Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO) (2016), IEEE, pp. 1–12. (Cited on 8, 10, 109, 110, 129)
[83] Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M.,
Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A.,
Potapenko, A., et al. Highly accurate protein structure prediction
with alphafold. Nature 596, 7873 (2021), 583–589. (Cited on 1, 3)
[84] Jung, V. J. B., Symons, A., Mei, L., Verhelst, M., and Benini,
L. SALSA: Simulated Annealing based Loop-Ordering Scheduler for
DNN Accelerators. In 2023 IEEE International Conference on Artificial
Intelligence Circuits and Systems (AICAS) (2023). (Cited on 169)
[91] Kwon, H., Chatarasi, P., Pellauer, M., Parashar, A., Sarkar,
V., and Krishna, T. Understanding reuse, performance, and hardware
cost of dnn dataflow: A data-centric approach. In Proceedings of the
52nd Annual IEEE/ACM International Symposium on Microarchitecture
(New York, NY, USA, 2019), MICRO ’52, Association for Computing
Machinery, p. 754–768. (Cited on 13, 133, 134, 135, 184, 202, 214, 236)
[92] Kwon, H., Lai, L., Pellauer, M., Krishna, T., hsin Chen, Y.,
and Chandra, V. Heterogeneous dataflow accelerators for multi-dnn
workloads. 2021 IEEE International Symposium on High-Performance
Computer Architecture (HPCA) (2021), 71–83. (Cited on 16, 17, 232, 234,
251)
[93] Lee, E., Han, T., Seo, D., Shin, G., Kim, J., Kim, S., Jeong,
S., Rhe, J., Park, J., Ko, J. H., and Lee, Y. A charge-domain
scalable-weight in-memory computing macro with dual-sram architecture
for precision-scalable dnn accelerators. IEEE Transactions on Circuits
and Systems I: Regular Papers 68, 8 (2021), 3305–3316. (Cited on 100)
[94] Lee, J., Kim, C., Kang, S., Shin, D., Kim, S., and Yoo, H. J.
UNPU: A 50.6TOPS/W unified deep neural network accelerator with
1b-to-16b fully-variable weight bit-precision. In 2018 IEEE International
Solid-State Circuits Conference (ISSCC) (Feb. 2018), pp. 218–220. (Cited
on xxix, 7, 8, 10, 40, 58, 62, 70, 87, 97, 100, 109, 110, 115, 258)
[95] Lee, J., Shin, D., Lee, J., Lee, J., Kang, S., and Yoo, H.-J. A Full
HD 60 fps CNN Super Resolution Processor with Selective Caching based
Layer Fusion for Mobile Devices. In 2019 Symposium on VLSI Circuits
(2019), pp. C302–C303. (Cited on 202, 226)
[96] Leviathan, Y., and Matias, Y. Google duplex: An ai system for
accomplishing real-world tasks over the phone. (Cited on 1)
[97] Liao, H., Tu, J., Xia, J., Liu, H., Zhou, X., Yuan, H., and Hu, Y.
Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural
Network Computing : Industry Track Paper. In 2021 IEEE International
Symposium on High-Performance Computer Architecture (HPCA) (2021),
pp. 789–801. (Cited on 216)
[98] Lin, C.-H., Cheng, C.-C., Tsai, Y.-M., Hung, S.-J., Kuo, Y.-T.,
Wang, P. H., Tsung, P.-K., Hsu, J.-Y., Lai, W.-C., Liu, C.-H.,
Wang, S.-Y., Kuo, C.-H., Chang, C.-Y., Lee, M.-H., Lin, T.-Y.,
282 BIBLIOGRAPHY
[100] Liu, J., Sun, J., Xu, Z., and Sun, G. Latency-aware automatic cnn
channel pruning with gpu runtime analysis. BenchCouncil Transactions
on Benchmarks, Standards and Evaluations 1, 1 (2021), 100009. (Cited
on xxvii, 32, 33)
[101] Liu, L., Zhu, J., Li, Z., Lu, Y., Deng, Y., Han, J., Yin, S., and
Wei, S. A survey of coarse-grained reconfigurable architecture and design:
Taxonomy, challenges, and applications. ACM Comput. Surv. 52, 6 (oct
2019). (Cited on 38)
[102] Liu, X., Ounifi, H.-A., Gherbi, A., Li, W., and Cheriet, M. A
hybrid gpu-fpga based design methodology for enhancing machine learning
applications performance. Journal of Ambient Intelligence and Humanized
Computing 11 (2020), 2309–2323. (Cited on 39)
[103] Liu, Z., Leng, J., Chen, Q., Li, C., Zheng, W., Li, L., and
Guo, M. DLFusion: An Auto-Tuning Compiler for Layer Fusion
on Deep Neural Network Accelerator. In 2020 IEEE Intl Conf on
Parallel & Distributed Processing with Applications, Big Data & Cloud
Computing, Sustainable Computing & Communications, Social Computing
& Networking (ISPA/BDCloud/SocialCom/SustainCom) (2020), IEEE,
pp. 118–127. (Cited on 233)
[104] Louis, M. S., Azad, Z., Delshadtehrani, L., Gupta, S., Warden,
P., Reddi, V. J., and Joshi, A. Towards Deep Learning using
TensorFlow Lite on RISC-V, 2019. (Cited on 37)
[105] Ma, L., Xie, Z., Yang, Z., Xue, J., Miao, Y., Cui, W., Hu, W.,
Yang, F., Zhang, L., and Zhou, L. Rammer: Enabling holistic deep
learning compiler optimizations with rTasks. In 14th USENIX Symposium
on Operating Systems Design and Implementation (OSDI 20) (Nov. 2020),
USENIX Association, pp. 881–897. (Cited on 16, 17)
[113] Mo, H., Zhu, W., Hu, W., Li, Q., Li, A., Yin, S., Wei, S., and
Liu, L. A 12.1 TOPS/W Quantized Network Acceleration Processor
With Effective-Weight-Based Convolution and Error-Compensation-Based
Prediction. IEEE Journal of Solid-State Circuits 57, 5 (2022), 1542–1557.
(Cited on 202, 226)
[114] Moons, B., Bankman, D., Yang, L., Murmann, B., and Verhelst,
M. Binareye: An always-on energy-accuracy-scalable binary cnn processor
with all memory on chip in 28nm cmos. In 2018 IEEE Custom Integrated
Circuits Conference (CICC) (2018), IEEE, pp. 1–4. (Cited on 39)
284 BIBLIOGRAPHY
[131] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford,
A., Chen, M., and Sutskever, I. Zero-Shot Text-to-Image Generation.
arXiv e-prints (Feb. 2021), arXiv:2102.12092. (Cited on 1)
[132] Redmon, J., and Farhadi, A. YOLO9000: Better, Faster, Stronger.
arXiv e-prints (Dec. 2016), arXiv:1612.08242. (Cited on xxxii, 163)
[133] Ryu, S., Kim, H., Yi, W., and Kim, J.-J. Bitblade: Area and
energy-efficient precision-scalable neural network accelerator with bitwise
summation. In Proceedings of the 56th Annual Design Automation
286 BIBLIOGRAPHY
Conference 2019 (2019), pp. 1–6. (Cited on 8, 10, 58, 68, 100, 104,
106, 109, 110, 129)
[150] Song, M., Zhang, J., Chen, H., and Li, T. Towards efficient
microarchitectural design for accelerating unsupervised gan-based deep
learning. In 2018 IEEE International Symposium on High Performance
Computer Architecture (HPCA) (Feb 2018), pp. 66–77. (Cited on 60)
[151] Sumbul, H. E., Wu, T. F., Li, Y., Sarwar, S. S., Koven, W.,
Murphy-Trotzky, E., Cai, X., Ansari, E., Morris, D. H., Liu,
H., Kim, D., Beigne, E., Labs, R., and Meta. System-Level Design
and Integration of a Prototype AR/VR Hardware Featuring a Custom
Low-Power DNN Accelerator Chip in 7nm Technology for Codec Avatars.
In 2022 IEEE Custom Integrated Circuits Conference (CICC) (2022),
pp. 01–08. (Cited on 3, 40, 183, 194, 200, 215, 216)
[152] Symons, A., Mei, L., Colleman, S., Houshmand, P., Karl, S.,
and Verhelst, M. Towards Heterogeneous Multi-core Accelerators
Exploiting Fine-grained Scheduling of Layer-Fused Deep Neural Networks.
arXiv e-prints (Dec. 2022), arXiv:2212.10612. (Cited on 3, 52, 231)
[153] Symons, A., Mei, L., and Verhelst, M. LOMA: Fast Auto-Scheduling
on DNN Accelerators through Loop-Order-based Memory Allocation. In
2021 IEEE International Conference on Artificial Intelligence Circuits
and Systems (AICAS) (2021), pp. 1–4. (Cited on 169, 214, 242)
[154] Syu, N.-S., Chen, Y.-S., and Chuang, Y.-Y. Learning Deep
Convolutional Networks for Demosaicing. arXiv e-prints (Feb. 2018),
arXiv:1802.03769. (Cited on 218)
[155] Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. Inception-v4,
inception-resnet and the impact of residual connections on learning. In
Proc. of AAAI Conf. on Artificial Intelligence, 2017 (2017). (Cited on
xxvii, 30, 31, 167)
[156] T. Chen et. al. DianNao: A small-footprint high-throughput accelerator
for ubiquitous machine-learning. SIGPLAN Not. 49, 4 (Feb. 2014), 269–
284. (Cited on 60)
[157] Talpes, E., Sarma, D. D., Venkataramanan, G., Bannon, P.,
McGee, B., Floering, B., Jalote, A., Hsiong, C., Arora, S.,
Gorti, A., and Sachdev, G. S. Compute Solution for Tesla’s Full
Self-Driving Computer. IEEE Micro 40, 2 (2020), 25–35. (Cited on 39,
216)
BIBLIOGRAPHY 289
[158] Tang, T., et al. NeuroMeter: An Integrated Power, Area, and Timing
Modeling Framework for Machine Learning Accelerators Industry Track
Paper. In IEEE HPCA (2021). (Cited on 184)
[159] Tarnawski, J. M., Phanishayee, A., Devanur, N., Mahajan, D.,
and Nina Paravecino, F. Efficient algorithms for device placement of
dnn graph operators. Advances in Neural Information Processing Systems
33 (2020), 15451–15463. (Cited on 16, 17)
[160] Theis, T. N., and Wong, H.-S. P. The End of Moore’s Law: A
New Beginning for Information Technology. Computing in Science &
Engineering 19, 2 (2017), 41–50. (Cited on 4)
[161] Tronçon, R., Bruynooghe, M., Janssens, G., and Catthoor, F.
Storage size reduction by in-place mapping of arrays. In International
Workshop on Verification, Model Checking, and Abstract Interpretation
(2002), Springer, pp. 167–181. (Cited on 55)
[162] tvm rfcs. Arm® ethos™-u cascading scheduler. (Cited on 202, 233, 235)
[163] Ueyoshi, K., Ando, K., Hirose, K., Takamaeda-Yamazaki, S.,
Kadomoto, J., Miyata, T., Hamada, M., Kuroda, T., and
Motomura, M. QUEST: A 7.49TOPS multi-purpose log-quantized DNN
inference engine stacked on 96MB 3D SRAM using inductive-coupling
technology in 40nm CMOS. In 2018 IEEE International Solid-State
Circuits Conference (ISSCC) (Feb. 2018), pp. 216–218. (Cited on 58, 70,
97, 258)
[164] Ueyoshi, K., Papistas, I. A., Houshmand, P., Sarda, G. M., Jain,
V., Shi, M., Zheng, Q., Giraldo, S., Vrancx, P., Doevenspeck,
J., Bhattacharjee, D., Cosemans, S., Mallik, A., Debacker, P.,
Verkest, D., and Verhelst, M. DIANA: An End-to-End Energy-
Efficient Digital and ANAlog Hybrid Neural Network SoC. In 2022 IEEE
International Solid- State Circuits Conference (ISSCC) (2022), vol. 65,
pp. 1–3. (Cited on 232, 233, 234, 235, 246, 247, 249)
[165] Veen, F. V. The Neural Network Zoo, 2016. https://www.
asimovinstitute.org/neural-network-zoo/. (Cited on 3)
[166] Venkatesan, R., Shao, Y. S., Wang, M., Clemons, J., Dai, S.,
Fojtik, M., Keller, B., Klinefelter, A., Pinckney, N., Raina,
P., Zhang, Y., Zimmer, B., Dally, W. J., Emer, J., Keckler,
S. W., and Khailany, B. Magnet: A modular accelerator generator
for neural networks. In 2019 IEEE/ACM International Conference on
Computer-Aided Design (ICCAD) (2019), pp. 1–8. (Cited on 13, 133, 134,
135, 184)
290 BIBLIOGRAPHY
[167] Verdoolaege, S. isl: An integer set library for the polyhedral model.
In International Congress on Mathematical Software (2010), Springer,
pp. 299–302. (Cited on 55, 262)
[168] Verhelst, M., Shi, M., and Mei, L. ML Processors Are Going Multi-
Core: A performance dream or a scheduling nightmare? IEEE Solid-State
Circuits Magazine 14, 4 (2022), 18–27. (Cited on 5, 43)
[169] Victor, D. HandTrack: A Library For Prototyping Real-time
Hand TrackingInterfaces using Convolutional Neural Networks. GitHub
repository (2017). (Cited on 194)
[170] Vivet, P., Guthmuller, E., Thonnart, Y., Pillonnet, G.,
Fuguet, C., Miro-Panadès, I., Moritz, G., Durupt, J., Bernard,
C., Varreau, D., Pontes, J. J. H., Thuries, S., Coriat,
D., Harrand, M., Dutoit, D., Lattard, D., Arnaud, L.,
Charbonnier, J., Coudrain, P., Garnier, A., Berger, F.,
Gueugnot, A., Greiner, A., Meunier, Q. L., Farcy, A.,
Arriordaz, A., Chéramy, S., and Clermidy, F. Intact: A 96-
core processor with six chiplets 3d-stacked on an active interposer with
distributed interconnects and integrated power management. IEEE
Journal of Solid-State Circuits 56 (2021), 79–97. (Cited on 3)
[171] Žbontar, J., and LeCun, Y. Stereo Matching by Training a
Convolutional Neural Network to Compare Image Patches. arXiv e-prints
(Oct. 2015), arXiv:1510.05970. (Cited on 215, 218)
[172] Waeijen, L., Sioutas, S., Peemen, M., Lindwer, M., and
Corporaal, H. ConvFusion: A Model for Layer Fusion in Convolutional
Neural Networks. IEEE Access 9 (2021), 168245–168267. (Cited on 14,
15, 202, 203, 226, 230)
[173] Wang, C., and Luo, Z. A review of the optimal design of neural
networks based on fpga. Applied Sciences 12, 21 (2022). (Cited on 37)
[174] Wang, K., Liu, Z., Lin, Y., Lin, J., and Han, S. HAQ: Hardware-
Aware Automated Quantization With Mixed Precision. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR) (June 2019). (Cited on 7)
[175] Winding, M., Pedigo, B. D., Barnes, C. L., Patsolic, H. G., Park,
Y., Kazimiers, T., Fushiki, A., Andrade, I. V., Khandelwal, A.,
Valdes-Aleman, J., Li, F., Randel, N., Barsotti, E., Correia,
A., Fetter, R. D., Hartenstein, V., Priebe, C. E., Vogelstein,
J. T., Cardona, A., and Zlatic, M. The connectome of an insect
brain. Science 379, 6636 (2023), eadd9330. (Cited on 31)
BIBLIOGRAPHY 291
[176] Wu, S.-Y., et al. A 7nm cmos platform technology featuring 4th
generation finfet transistors with a 0.027um2 high density 6-t sram cell
for mobile soc applications. In IEDM (2016). (Cited on 194)
[177] Wu, Y. N., Emer, J. S., and Sze, V. Accelergy: An architecture-
level energy estimation methodology for accelerator designs. In
2019 IEEE/ACM International Conference on Computer-Aided Design
(ICCAD) (2019), pp. 1–8. (Cited on xxxi, 13, 133, 134, 135, 158, 159, 184,
202, 215)
[178] Xi, S. L., Yao, Y., Bhardwaj, K., Whatmough, P., Wei, G.-
Y., and Brooks, D. SMAUG: End-to-End Full-Stack Simulation
Infrastructure for Deep Learning Workloads, 2019. (Cited on 133, 134,
135)
[179] Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggregated
Residual Transformations for Deep Neural Networks. arXiv e-prints (Nov.
2016), arXiv:1611.05431. (Cited on xxvii, 31)
[180] Xing, Y., Liang, S., Sui, L., Zhang, Z., Qiu, J., Jia, X., Liu, X.,
Wang, Y., Shan, Y., and Wang, Y. DNNVM: End-to-End Compiler
Leveraging Operation Fusion on FPGA-Based CNN Accelerators. In
Proceedings of the 2019 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays (New York, NY, USA, 2019), FPGA ’19,
Association for Computing Machinery, p. 187–188. (Cited on 14, 15, 52,
203, 226, 230)
[181] Yamazaki, K., Vo-Ho, V.-K., Bulsara, D., and Le, N. Spiking
neural networks and their applications: A review. Brain Sciences 12, 7
(2022), 863. (Cited on 55)
[182] Yang, Q., and Li, H. Bitsystolic: A 26.7 tops/w 2b 8b npu with
configurable data flows for edge devices. IEEE Transactions on Circuits
and Systems I: Regular Papers 68, 3 (2021), 1134–1145. (Cited on 100)
[183] Yang, T.-J., et al. A method to estimate the energy consumption of
deep neural networks. In ACSSC (2017). (Cited on 184)
[184] Yang, X., Gao, M., Liu, Q., Setter, J., Pu, J., Nayak, A., Bell,
S., Cao, K., Ha, H., Raina, P., Kozyrakis, C., and Horowitz, M.
Interstellar: Using halide’s scheduling language to analyze dnn accelerators.
369–383. (Cited on xxvii, 13, 43, 50, 133, 134, 138, 184, 218, 236, 241,
242)
[185] Yang, X., Gao, M., Pu, J., Nayak, A., Liu, Q., Bell, S. E., Setter,
J. O., Cao, K., Ha, H., Kozyrakis, C., et al. Dnn dataflow choice
is overrated. arXiv preprint arXiv:1809.04070 6 (2018). (Cited on 60)
292 BIBLIOGRAPHY
[186] Yang, X., Pu, J., Rister, B. B., Bhagdikar, N., Richardson, S.,
Kvatinsky, S., Ragan-Kelley, J., Pedram, A., and Horowitz,
M. A Systematic Approach to Blocking Convolutional Neural Networks,
2016. (Cited on 133)
[187] Zhao, Y., Li, C., Wang, Y., Xu, P., Zhang, Y., and Lin, Y. Dnn-
chip predictor: An analytical performance predictor for dnn accelerators
with various dataflows and hardware architectures. In ICASSP 2020 -
2020 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP) (2020), pp. 1593–1597. (Cited on 184)
[188] Zheng, S., Liang, Y., Wang, S., Chen, R., and Sheng, K.
FlexTensor: An Automatic Schedule Exploration and Optimization
Framework for Tensor Computation on Heterogeneous System. Association
for Computing Machinery, New York, NY, USA, 2020, p. 859–873. (Cited
on 235)
[189] Zheng, S., Zhang, X., Ou, D., Tang, S., Liu, L., Wei, S., and
Yin, S. Efficient Scheduling of Irregular Network Structures on CNN
Accelerators. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems 39, 11 (2020), 3408–3419. (Cited on 14, 15, 203, 226,
230)
[190] Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., Wang, L.,
Li, C., and Sun, M. Graph neural networks: A review of methods and
applications. AI Open 1 (2020), 57–81. (Cited on 55)
[191] Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning
transferable architectures for scalable image recognition. In Proc. of IEEE
conf. on computer vision and pattern recognition (2018), pp. 8697–8710.
(Cited on 167)
FACULTY OF ENGINEERING SCIENCE
DEPARTMENT OF ELECTRICAL ENGINEERING
ESAT-MICAS
ESAT-MICAS, Kasteelpark Arenberg 10
B-3001 Leuven
linyan.mei.ee@gmail.com
https://www.esat.kuleuven.be/micas/