Linyan PHD Thesis 2023 Printing Version

ARENBERG DOCTORAL SCHOOL
Faculty of Engineering Science
Design Space Exploration of

Deep Learning Accelerators
A journey from single MAC units to multi-core
accelerator systems
Linyan Mei
Supervisor: Dissertation presented in partial

Prof. dr. ir. Marian Verhelst fulfillment of the requirements for the
degree of Doctor of Engineering
Science (PhD): Electrical Engineering
August 2023
Design Space Exploration of Deep Learning
Accelerators
A journey from single MAC units to multi-core accelerator systems
Linyan MEI
Examination committee: Dissertation presented in partial ful-

Prof. dr. ir. Patrick Wollants, chair fillment of the requirements for the
Prof. dr. ir. Marian Verhelst, supervisor degree of Doctor of Engineering Sci-
Prof. dr. ir. Francky Catthoor ence (PhD): Electrical Engineering
Prof. dr. ir. Wim Dehaene
Prof. dr. ir. Joni Dambre
(Ghent University, Belgium)
Dr. Huichu Liu
(Meta, California, USA)
August 2023
© 2023 KU Leuven – Faculty of Engineering Science
Uitgegeven in eigen beheer, Linyan Mei, ESAT-MICAS, Kasteelpark Arenberg 10, B-3001 Leuven (Belgium)
Alle rechten voorbehouden. Niets uit deze uitgave mag worden vermenigvuldigd en/of openbaar gemaakt worden
door middel van druk, fotokopie, microfilm, elektronisch of op welke andere wijze ook zonder voorafgaande
schriftelijke toestemming van de uitgever.
All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm,
electronic or any other means without written permission from the publisher.
Acknowledgements
As seasons shifted, years passed swiftly. I’m approaching the end of my seventh
year in this charming and delightful town of Leuven, which is also the fifth year
of my Ph.D. journey, and now it’s time to say goodbye.
During these years in the foreign land, a kaleidoscope of experiences unfolded.
There were joys and sorrows, meetings and farewells, and moments of
determination and doubt. Countless times I have felt how lucky I am. Along
the winding and lengthy path of the journey, many great people have passed by
me, singing with me when I was happy, lending me a hand when I felt powerless,
and illuminating my way when I was lost.
This experience made me grow so much: gaining professional knowledge,
encountering like-minded people, knowing how to work with others, and
enhancing my understanding of the world and of myself. I am truly grateful for
it and firmly believe that no matter how many years pass, I will forever cherish
this land, this period of time, and all of you who walked alongside me.
There is so much I want to say. Let me take the opportunity to reflect upon
our time together and express my heartfelt gratitude to you.
First of all, I would like to thank my Ph.D. supervisor, Prof. Marian Verhelst.
Marian, you have brought me so much amazement over the years. Your passion
for work and your attitude toward life have deeply impacted me. Like the
sun, you are always full of energy, warming and brightening the lives of people
around you. There are so many moments we shared together that I now recall
as if they happened just yesterday. Our first encounter was in 2016 when I
was pursuing my master’s degree. At that time, you were teaching DDP in the
first semester and CA in the second semester. I liked your lectures very much,
and often sat in the first row, and would record your class to listen over and
over again. I liked to come to you to ask questions after class, which steered
me towards the trajectory of my doctoral studies, and my eventual choice of
you as my supervisor. I vividly recall the moment when I received your email
i
ii ACKNOWLEDGEMENTS
saying that you welcomed me on board for a Ph.D. (I was doing my master’s
thesis at imec at that time), I tapped the table loudly and yelled nice, which
scared everyone around me. In 2018, before I started my Ph.D., you lectured at
imec on Machine Learning Hardware and introduced lots of research progress
from your group. After the lecture, when we chatted, you said to me that you
could present the work I would be doing in a few years. What you said became
my dream. A year later in 2019, again in the imec lecture, witnessing you
introducing my work to others, I felt so proud that the dream had come true.
When I had just started my doctoral studies, we attended a meeting at Ugent
together. On the train, we discussed career and job choices. You told me that
having a job you love, doing things you’re passionate about, and getting paid
for it, how happy it is. My first conference paper, after your revisions, turned
mostly green on Overleaf. I thought, "Oh no, I must have written too poorly".
My colleague Sebastian, who was sitting opposite to me and had two more years
of Ph.D. experience than me, comforted me and said "Don’t worry, the papers
corrected by Marian are always all green". Every time I read your edited papers,
I am amazed by how you can always explain things in a clear and concise way,
and I learned a lot. Before your first presentation about ZigZag to external
parties, you woke up at 4 or 5 in the morning to derive formula details and
make slides and later sought my opinion. I asked you why not just ask me to
make the slides for you so that you don’t need to push yourself so much. You
said only when you went through it yourself once, can you really master it. In
numerous meetings with you, I’ve been amazed by how swiftly you grasp the
key to a problem and propose ingenious solutions. Discussions with you always
bring me a lot of inspiration and progress. In our team meetings, I particularly
enjoyed the presentation rehearsal segment, appreciating your and colleagues’
suggestions for improving slides, and I have learned valuable presentation skills
every time. When I just began my doctoral studies, I would get very nervous
during meetings with external collaborators. But having you around always
made me feel at ease. I knew that when I couldn’t explain things clearly, you
would help me convey the message to others. Over time, with your help, I have
grown a lot with my oral explanation skill, and finally, I am confident enough
to explain things clearly to the outer collaborators myself.
You cared not only about our academic progress but also about our emotional
well-being. You always provided in-time guidance and emotional support when
needed. I am also always amazed at how well you have balanced your work
and life, and how you can leave enough time to spend with your family in your
busy schedule. Marian, thank you for all the things you’ve taught me over
these years — in every meeting, discussion, revised paper, and team meeting
and activity. Marian, you are a genuine role model. Meeting you made me a
better person. I’m leaving now, taking with me all that you have taught me. In
ACKNOWLEDGEMENTS iii
the last few years, I have witnessed you become a full professor, be named an
IEEE fellow, and win numerous prizes. I have unwavering faith that your career
will continue to flourish in the years ahead. I just hope that in the meantime,
don’t forget to take care of yourself, eat and sleep well, and don’t put too many
burdens on yourself. Marian, having you as my Ph.D. advisor has been my
greatest fortune and honor.
I would also like to sincerely thank my doctoral examination committee,
chair Prof. Patrick Wollants, Prof. Francky Catthoor, Prof. Wim Dehaene,
Prof. Joni Dambre, and Dr. Huichu Liu for taking time and effort out of your
busy schedules to review my thesis and join my defense. I am thankful for
the insightful discussions and feedback you provided, which helped me identify
areas for improvement in the thesis. Special gratitude to Prof. Joni and Prof.
Francky for accompanying me throughout my doctoral journey, providing advice
and feedback, and ensuring I reached all milestones.
Prof. Francky was also part of my master’s thesis committee. The academic
year from 2017 to 2018 held a special place in my memory when I was working
on my master’s thesis at imec. This marked my initial interaction with you,
which occurred during the rehearsal before the defense at imec. I recall vividly
how you wrote an entire page of feedback, and patiently discussed it with
me afterward. Throughout my doctoral journey, after each milestone, you
consistently reserved time for one-on-one meetings, offering valuable feedback
and suggestions. Your insights often introduced new angles to my research,
prompting me to think more expansively. Francky, your tireless commitment
and enthusiasm for the field truly left a lasting impression on me.
Prof. Wim, thank you for helping organize my private defense, allowing me
to proceed with my defense on time even when Marian cannot be present.
I’ve admired your engaging personality since attending your class during my
master’s studies. You consistently radiate a sense of care, humor, and reliability.
I also want to especially thank you for the original Ph.D. invitation you offered
me during my master’s studies. It provided me with significant confidence and
encouragement during my moments of doubt and uncertainty. Although I didn’t
choose your research topic in the end, I advertised it for you and successfully
attracted Mohit’s interest to join your team.
A heartfelt thank you to Dr. Huichu from Meta. Huichu was my advisor during
my 2021 internship at Meta and has been a collaborator over the past three to
four years. I truly appreciate all the discussions, slide creations, paper writing,
and patent filings we’ve worked on together during this time. Huichu, your
dedication and perseverance toward work, your kindness and thoughtfulness for
others have deeply impressed me. Beyond professional growth, you have also
taught me many useful skills in making slides and illustrations.
iv ACKNOWLEDGEMENTS
Following it, I would also like to thank Meta, Edith’s group with Ekin, Tony, Lita,
and Huichu for the monthly discussion and feedback over the last several years.
These discussions have allowed me to better understand the industry’s needs,
broaden my perspective, and better adjust my research focus and direction.
Then, I would like to sincerely thank all my internal and external
collaborators. As everyone can see, my work is filled with collaborations;
without them, this thesis wouldn’t exist. In chronological order, first, I want
to thank Mohit. We worked together at imec during our master thesis time,
sharing a desk and computer. During that time, you offered me a lot of help and
advice, and we had many interesting discussions. The idea of "Sum-together
precision-scalable MAC" was born during this time, paving the way for my
doctoral journey and our first conference paper. Your passion for the field
deeply impressed me over all these years. Next, I want to deeply thank Vincent
for the precision MAC array work we accomplished together. Working with
you is amazing. Your dedication and organization were truly inspiring. You are
also so helpful. Whenever I needed help, spanning years, you were always there,
patiently and eagerly addressing my doubts. In the early days of my doctoral
journey, your guidance was instrumental in getting me started.
Then, I would like to thank Vikram very much for helping me lay a solid
foundation for the cost model of ZigZag with many discussions and hand
calculations performed together before ZigZag took shape. Later on, I want to
give a special thanks to Pouya. Pouya you joined the ZigZag journey shortly
after you joined the team. I’ll never forget the times we fought together, coding
and debugging late into the night, writing the first version of the paper, rushing
to meet deadlines, adding new experiments, handling rebuttals, and feeling down
when papers were rejected, yet finding new opportunities and achieving the final
paper publications. I’ll also never forget that we named ZigZag on the train to
Antwerp together. Thank you for your incredible effort and perseverance, in
turning ZigZag from an idea into reality.
Next, I’d like to thank Ehab for continuing the work that Vincent and I started,
delving further into the precision-scalable MAC array, and gaining new insightful
design perspectives together. I also enjoyed quite a lot the process of drawing
diagrams and writing papers with you. Following that, I want to express my
utmost gratitude to Arne for his contributions to the ZigZag-project. With
Arne, we enhanced the efficiency of the mapping search engine and together
updated the old ZigZag into a new, more clean, concise version. Alongside
Pouya and Steven, we removed the limitations of single-core architecture and
entered into the multi-core domain, proposing the follow-up framework Stream.
Arne, working with you has been truly enjoyable. In my eyes, you are a tall
sunshine boy. You work efficiently, focused, structured, and always with a bunch
of ideas. Later, I want to thank Koen for the joint accomplishment in modeling
ACKNOWLEDGEMENTS v
the depth-first scheduling space on top of ZigZag. Koen, working with you is
highly efficient; you think fast and act swiftly, which I truly admire. Also, under
your substantial assistance, along with Arne, the coding structure of ZigZag
has been significantly improved, becoming more sustainable and reflecting its
current form. Finally, I want to express my sincere gratitude to all those who
have worked alongside me, on various papers, projects, theses, and courses:
Amanda, Steven, Guilherme, Jun, Jiacong, Leondra, Priya, Victor, Sebastian K.,
Nathan, Jasper, Sebastian P. G. , Yuanyang, Yunzun, Weijie, Ruben, Josse,
Lucca, Thibaut, and everyone else. Thank you for the collaboration and the
work we’ve accomplished together!
Also, I would like to thank the MICAS secretaries, Danielle and Ann, for helping
me set up things for this defense and for the constant support throughout the
years. Thanks to the MICAS IT manager, Ben, for always solving the computer
problems I encountered on time and providing me with private servers to conduct
massive experiments. Thanks to Ellen, the financial secretary of MICAS, for
always being very nice and helping me to get reimbursement. Even once when I
lost my original bills, and I already gave up, but with your insistence and help,
I got my money back.
I would like to thank all my MICAS and MNS colleagues for your sincerity
and enthusiasm in making my Ph.D. life more vibrant and colorful. Thank
you to my office mates: Vikram, Josse, and Sebastian. Because of you, the
office is full of joy. I am also truly thankful to all the people in Marian’s group
over the years, for helping and supporting each other and creating this warm
research family: Josse, Arne, Koen, Steven, Thomas, Jaro, Sander, Laura, Tom,
Sebastian, Pouya, Giuseppe, Amanda, Peishuo, Shirui, Jun, Jiacong, Ryan,
Guilherme, Nimish, Vikram, Ninad, Nitish, Ehab, Kodai, and everyone else.
I would like to give a special thanks to Vikram, who had been sitting in the
same office with me for five years and has become a great friend to whom I can
talk about everything. Over the five years, we witnessed each other’s growth.
All the company, encouragement, and support we had for each other and all
the talks, meals, and trips together are very precious memories of my Ph.D.
Also, a special thank you goes to Mohit. You are the one in the beginning
motivated me to apply for a Ph.D. In my eyes, you are a very knowledgeable
and long-sighted person, who I really respect. Many times when I was hesitant
about something in my research or life, you appeared in my office just in time
to provide me with good advice. I want to write down here the two things you
once said to me that deeply touched me. One was that during our master time
in imec, one day you explained to me some math theory, and afterward you said
something like "How beautiful it is!". Another was during a hard period in my
Ph.D. when I found out what I wanted to do and was already halfway done had
been already done and published by other research teams. You comforted me,
vi ACKNOWLEDGEMENTS
saying that Ph.D. is about making a small contribution to one point other than
contributing to everything, as long as you find out this one point, you are good,
and there must be something they have overlooked and you can do better.
Besides that, I’d like to thank Vikram, Nimish, Mohit, Rico, Ninad, Ciana, and
Nitish for all the gatherings, badminton playing, cycling, and alma trips. I also
want to extend a heartfelt thank you to my roommate, Amanda. Thank you
for the meals you cooked for me and for the insulated water bottle you gifted
me when I went to Northern Europe. I really appreciate the two years we lived
together, and also the three months with Yifan joining us. Yifan, thank you for
taking me to the Sunday Heverlee market and recommending ribs and cherries
to me. Thank you to Weijie, Zongyuan, Xinfa, and Peishuo for multiple times
sharing watermelon with me and inviting me to dinner. Xinfa, the beef you
cooked and gifted to Amanda and me was the best beef I ever had. Thank you
to Jun F., Hui, Kaizhe, Thanos, Chen, Jhon, Jonah, Clara, Jiaqi, Kaicang, and
many many more people for all the chatting and joking in MICAS.
Also thank you to my friends in the TELEMIC group, Liu Hao, Liu Bin,
Zhensheng, Yang Jie, Zhang Meng, and Xuezhi for all the talks, meals, and trips
we had together. Thank you to our initial math study group – Mingxiao, Zehao,
Tingyu, Wan Bo, Zhou Han, and Yang Jie. Together, we rediscovered the beauty
of linear algebra and formed genuine friendships. Thank you to the moving
team – Vikram, Mingxiao, Tingyu, Shirui, Wan Bo, Peishuo, Yangyang, Xinfa,
and Aojie – for helping Amanda and me move places, disassemble and assemble
furniture multiple times. Thank you to Aojie, Chenxi, and Linlin for all the
meals invitation and shopping trips. Thank you to Linlin for accompanying me
to have my wisdom teeth removed. Thank you to Shirui, Jun, and Bowen for
countless wonderful experiences of dining out, boating, hiking, and traveling.
All these moments have led to numerous cherished memories, which have added
more depth and meaning to my Ph.D. journey.
Finally, I would like to express my deepest gratitude to my dear mom and dad.
Your gift of a warm and joyous family has been a treasure beyond measure to me.
Thank you for your unconditional emotional and material support, allowing me
the freedom to pursue my passions. Bathed in your boundless love, I’ve grown
with the courage to explore new realms and welcome the world with open arms.
You are my forever role models. Your kindness to people, your positive attitude
toward life, and your perseverance and focus in working and doing other things
have profoundly shaped me. Your love, guidance, care, and unshakable belief
in me form the foundation of every accomplishment I’ve achieved.
From the time I left home at 18 to attend the university thousands of miles
away from home, 11 years have flown by. Every year I tried to go home once
or twice, and during Covid time I could only go home after two years. Each
ACKNOWLEDGEMENTS vii
time when I go back, I can’t help but notice the passage of time reflected in
your aging. Now I have almost finished my Ph.D. study and am returning to
you to start a new life. Mom and dad, you accompany me to grow up, let
me accompany you to grow old. Beyond mom and dad, my heart brims with
gratitude for my two grandmas, aunts, uncles, and cousins. The embrace of my
close-knit family provides solace for my soul. After 11-year venturing through
the vastness of the world and encountering numerous people and landscapes,
I’ve come to further truly appreciate the warmth and value of family to me.
Thank you for everything!
写在最后的话
千言万语，道不尽经年情谊；
天高路远，隔不断你我联系；
好生保重，祝大家前途似锦；
来日方长，愿我们后会有期。
Linyan Mei, August 2023, Leuven, Belgium

梅琳焱，二零二三年八月，比利时·鲁汶
Abstract
Over the past decade, deep learning has reshaped the skyline of Artificial
Intelligence, thanks to the evolution of hardware computing capability, algorithm
improvement, and the ever-growing volume of data. Driven by a wide range
of use cases in various industries, such as computer vision, natural language
processing, healthcare, finance, manufacturing, robotics, etc., deep learning
gained enormous momentum in its development and deployment.
In the deployment, processing deep learning models fast and efficiently is
challenging due to their computationally intensive, data-intensive and diverse
nature. Yet, at the same time, this efficiency is critical across many use cases,
especially in resource-constrained scenarios, like mobile and IoT devices. As a
result, numerous specialized hardware accelerators are built, taking advantage
of the intrinsic highly-parallelizable computing pattern of deep learning models.
These accelerators usually consist of an array of multiply-and-accumulate units
for parallel computing, a memory hierarchy for feeding/storing data, and a
pre-defined scheduling controller for orchestrating the computation and data
movement. As the processing efficiency is tightly coupled to these design
components, carefully constructing them is crucial. However, as the design
spaces of these components are vast and intertwined, and the traditional digital
design flow is too slow to evaluate every single design option, it is difficult to
jump out of the ad-hoc design paradigm to pursue a globally optimal solution.
To address this limitation, early-phase design space exploration is required. This
thesis contributes to this goal by, firstly, the systematic identification of the
design space of deep learning accelerators at different levels of abstraction and,
secondly, the insightful exploration of these single and joint design spaces. At
different abstraction levels, this thesis focuses on different exploration
parameters:
At the multiply-and-accumulate unit and array level, this thesis focuses
on studying the low-precision and variable-precision computing datapath, which
ix
x ABSTRACT
has been proven to bring magnificent latency and energy benefits for resource-
constrained systems (with no to minor algorithmic accuracy loss). This
work constructs systematic taxonomies for precision-scalable multiply-and-
accumulate units and arrays after identifying the different design options. These
taxonomies depict the skeleton of the design spaces, not only covering the existing
state-of-the-art precision-scalable designs but also uncovering new unexplored
architectural options. These different design options are then thoroughly bench-
marked with the traditional digital synthesis flow, and interesting tradeoffs and
design insights are discovered.
Moving one step higher in the abstraction to the single-core accelerator
level, we combine the multiply-and-accumulate array and a memory hierarchy,
together with various mapping/scheduling possibilities. This thesis builds
two high-level fast architecture-mapping design space exploration frameworks,
ZigZag and DeFiNES. ZigZag focuses on single-layer mapping, while DeFiNES
extends ZigZag to support depth-first scheduling. Thanks to deep learning’s
deterministic computing pattern, the built-in analytical cost models enable
these frameworks to estimate energy and latency breakdown of processing a
deep learning model on a customized accelerator in milliseconds to seconds,
paving the way toward fast architecture/mapping search and optimization. In
this thesis, several model validation experiments and multiple case studies
demonstrate the reliability and capabilities of these frameworks.
Recently, the ever-growing model diversity and size are driving deep learning
accelerator design to the multi-core level, which combines several accelerator
cores and a network-on-chip, together with massive scheduling and layer-core
allocation possibilities. At this level, the complexity of the design space is
further increased, yet this thesis provides a framework, Stream, to systematically
tackle it. Stream is a high-level multi-core accelerator modeling and design
space exploration framework, built upon ZigZag. It can explore different core
architectures, core-to-core communication topologies, layer-core allocation, and
fine-grained layer-fused scheduling, supporting various deep learning models.
Stream paves the way for fast and systematic multi-core deep learning accelerator
design and workload deployment.
It is important to note that the creation of these different frameworks all followed
a similar three-step methodology: firstly, identify different design options and
construct a unified design representation that covers these options; secondly,
based on this unified representation, build the cost models; lastly, automatically
generate different design candidates and feed them to the cost models. In this
way, the loop is closed, and the design space exploration can be conducted
automatically.
ABSTRACT xi
In summary, this thesis aims to clearly introduce the vast design space of deep
learning accelerators at the different abstraction levels and thoroughly explain
how the high-level design space exploration frameworks can be built to rapidly
offer design insights and guidelines. By providing the developed frameworks in
open source, we pass on the taxonomy, modeling, and exploration methodologies
applied in this thesis to future researchers.
Beknopte samenvatting
In de afgelopen tien jaar heeft deep learning de kunstmatige intelligentie een

nieuwe vorm gegeven, dankzij de evolutie van de rekencapaciteit van de hardware,
de verbetering van de algoritmen en de steeds grotere hoeveelheid gegevens.
Onder impuls van een breed scala aan toepassingen in diverse sectoren, zoals
computervisie, verwerking van natuurlijke taal, gezondheidszorg, financiën,
productie, robotica, enz. heeft deep learning een enorme impuls gekregen in
zijn ontwikkeling en toepassingen.
De snelle en efficiënte verwerking van deep learning-modellen is echter een
uitdaging vanwege hun rekenintensieve, gegevensintensieve en diverse aard. Dit
is vooral belangrijk het tegelijkertijd cruciaal voor veel gebruikssituaties, vooral
in scenario’s met beperkte capaciteit, zoals mobiele en IoT-apparaten. Als
gevolg daarvan worden talrijke gespecialiseerde hardwareversnellers gebouwd,
die profiteren van het intrinsieke, zeer parallelle rekenpatroon van deep learning-
modellen.
Deze versnellers bestaan meestal uit een array van vermenigvuldig-en-
accumuleer-eenheden voor parallelle berekeningen, een geheugenhiërarchie voor
het voeden/opslaan van gegevens, en een vooraf gedefinieerde controller voor het
orkestreren van de berekeningen en de verplaatsing van gegevens. Aangezien de
verwerkingsefficiëntie nauw samenhangt met het ontwerp van deze componenten,
is een zorgvuldige constructie ervan van cruciaal belang. Maar omdat de
ontwerpruimten van deze componenten enorm groot en met elkaar verweven
zijn, en de traditionele digitale ontwerpflow te traag is om elke ontwerpoptie te
evalueren, is het moeilijk om uit het ad hoc ontwerpparadigma te stappen en
een globaal optimale oplossing na te streven.
Om deze hindernis aan te pakken is exploratie van de ontwerpruimte
in een vroeg stadium nodig. Dit proefschrift draagt hiertoe bij door,
ten eerste, de systematische identificatie van de ontwerpruimte van deep
learning-versnellers op verschillende abstractieniveaus en, ten tweede, de
xiii
xiv BEKNOPTE SAMENVATTING
inzichtelijke exploratie van deze afzonderlijke en gezamenlijke ontwerpruimten.

Op verschillende abstractieniveaus richt dit proefschrift zich op
verschillende exploratieparameters:
Op het niveau van multiply-and-accumulate eenheden en arrays
richt dit proefschrift zich op het bestuderen van het datapad voor lage-
precisie en variabele-precisie rekenen, waarvan bewezen is dat het grote
voordelen biedt op het gebied van rekentijd en energie voor systemen met
beperkte middelen (zonder significant verlies van algoritmische nauwkeurigheid).
Dit werk construeert systematische taxonomieën voor precisie-schaalbare
vermenigvuldig-en-accumuleer eenheden en arrays na identificatie van de
verschillende ontwerpopties. Deze taxonomieën tonen het skelet van de
ontwerpruimten, die niet alleen de bestaande state-of-the-art precisieschaalbare
ontwerpen omvatten, maar ook nieuwe onontgonnen architecturale opties
blootleggen. Deze verschillende ontwerpopties worden vervolgens grondig
vergeleken, en er worden interessante compromissen en ontwerpinzichten ontdekt.
Een stap hoger in de abstractie op het niveau van de single-core versneller,
combineren we de multiply-and-accumulate array en een geheugenhiërarchie,
samen met verschillende mapping/scheduling mogelijkheden. Dit proefschrift
bouwt de snelle hoog-niveau architecture-mapping design space exploration
frameworks, ZigZag en DeFiNES. ZigZag richt zich op single-layer mapping,
terwijl DeFiNES ZigZag uitbreidt om depth-first scheduling te ondersteunen.
Dankzij het deterministische rekenpatroon van deep learning, stellen de
ingebouwde analytische kostenmodellen deze frameworks in staat om de energie-
en rekentijd van de verwerking van een deep learning-model op een aangepaste
versneller in milliseconden tot seconden te schatten, wat de weg vrijmaakt voor
snelle architectuur/mapping-exploratie en optimalisatie. In dit proefschrift
tonen verschillende modelvalidatie-experimenten en meerdere case studies de
betrouwbaarheid en mogelijkheden van deze frameworks aan.
Door de steeds grotere diversiteit en omvang van de modellen wordt het ontwerp
van deep learning-versnellers de laatste tijd gedreven naar het multi-core
niveau, dat meerdere versnellingscores en een netwerk-op-chip combineert,
samen met massale mogelijkheden voor scheduling en layer-core toewijzing.
Op dit niveau neemt de complexiteit van de ontwerpruimte verder toe. Dit
proefschrift biedt een raamwerk, Stream, om dit systematisch aan te pakken.
Stream is een hoog-niveau framework voor multi-core versneller modellering
en ontwerpruimte exploratie, gebouwd op ZigZag. Het kan verschillende core-
architecturen, core-to-core communicatietopologieën, layer-core toewijzingen
en fijnmazige layer-fused scheduling onderzoeken, waarbij verschillende deep
learning-modellen worden ondersteund. Stream baant de weg voor snel en
systematisch multi-core deep learning-acceleratorontwerp.
BEKNOPTE SAMENVATTING xv
Het is belangrijk op te merken dat de totstandbrenging van deze ver-

schillende raamwerken alle een vergelijkbare methodologie in drie stappen
volgen: ten eerste, verschillende ontwerpopties identificeren en een uniforme
ontwerprepresentatie construeren die deze opties omvat; ten tweede, op basis
van deze uniforme representatie de kostenmodellen construeren; ten slotte,
automatisch verschillende ontwerpkandidaten genereren en deze evalueren met de
kostenmodellen. Op deze manier wordt de lus gesloten en kan de ontwerpruimte
automatisch worden verkend.
Samengevat heeft dit proefschrift tot doel de enorme ontwerpruimte van
deep learning versnellers op deze verschillende abstractieniveaus duidelijk te
introduceren, en grondig uit te leggen hoe de hoog-niveau ontwerpruimte
exploratie-frameworks kunnen worden gebouwd om snel ontwerpinzichten te
krijgen. Door het delen van de ontwikkelde frameworks in open source, word-
en taxonomie, modellering en exploratie-methodologieën van dit proefschrift
doorgegeven aan toekomstige onderzoekers.
List of Abbreviations and
Symbols
AHM algorithm-hardware-mapping. 184–187, 194

AI artificial intelligence. xxvii, 1, 2, 174
ASIC application-specific integrated circuit. 23, 37–39
ASIP application-specific instruction-set processor. 38, 39
BG bit-group. xxix, xxx, 18, 101–109, 111–116, 119, 121, 124, 128, 129, 259,
261
BS bit-serial. xxx, 102, 105, 109, 111–113, 115–117, 119–121, 124, 128–130,
259
BW bandwidth. xxxv, 185, 186, 189–192, 195, 197–200
CC cycle count. 184

cc clock cycle. 36
CGRA coarse-grained reconfigurable array. 38, 39
CMOS complementary metal-oxide semiconductor. 4, 11, 57
CN computation node. xxxiv, 235, 236, 238–246, 249, 251, 261
Conv convolutional. xxviii, xxix, xxxi, 25, 27–30, 34, 35, 59, 60, 66, 100–103,
140, 141, 150, 158, 160, 171, 173, 174, 176, 177, 185
CPU central processing unit. 37, 39, 41, 156
DAG directed acyclic graph. 14, 19, 52, 53
xvii
xviii List of Abbreviations and Symbols
DF depth-first. xxxiii–xxxv, 14, 15, 19, 201, 202, 204, 208, 210, 214–216,
218–221, 223–230, 260
DL deep learning. 2, 3, 7, 23, 24, 34, 36, 37, 39, 43, 57, 59, 128, 130
DNN deep neural networks. xxvii, xxxii, xxxiv, xxxv, 2, 3, 7, 11–17, 19, 23–26,
30–45, 48, 52, 53, 55, 57–59, 76, 97, 99–102, 128, 130–133, 135, 137, 138,
158, 164, 165, 168, 171–174, 178, 182–187, 194, 195, 197, 200, 217, 230–236,
238, 240, 246–249, 251, 252, 255, 257–259, 261, 262
DSE design space exploration. xxxii, 6, 12–17, 19, 23, 50, 131–133, 135–137,
140, 156, 158, 168, 170, 171, 176–179, 181, 182, 184, 194, 195, 197, 200,
201, 230, 231, 236, 241, 242, 254, 255, 259–262
DVFS dynamic voltage-frequency scaling. xxix, 59, 76, 89, 91, 92
DW depthwise. 25, 28–30, 102, 173, 185
FC fully-connected. xxviii, 25, 27, 30, 34, 35, 47, 59, 60, 66, 102, 173, 185
FIFO first in, first out. 42, 142, 156

FLOPs floating point operations. 32
FPGA field programmable gate array. 37–39, 184
FSM finite state machine. 50, 74

FU fully-unrolled. xxviii–xxx, 62–68, 70, 77, 79, 80, 82–86, 89, 93, 96, 97, 106,
107, 109, 111, 118, 119, 121, 124, 129, 258, 259
GA genetic algorithm. xxxiv, 236, 242, 249–251

GOP/s giga operations per second. 36, 37
GOPS giga operations per second. 36
GPU graphics processing unit. 37–39
HDL hardware description language. 117, 130

HLS high-level synthesis. 135
HS hybrid-sharing. xxix, 107–109, 116, 124
IoT internet of things. 2, 4, 24, 57, 58

List of Abbreviations and Symbols xix
IS input-sharing. 104, 105, 108, 109, 111, 115, 116, 118, 119, 121, 124, 128,
129, 259
ISA instruction set architecture. 37
LBL layer-by-layer. 202, 204, 207, 208, 216, 220, 224, 225
LPF loop prime factor. 151–153, 156, 169, 170
LSB least significant bit. 64, 72, 75
LUT look-up table. 52, 75
MAC multiply-and-accumulate. xxvii–xxix, xxxv, 7–9, 11, 18, 25, 34–36, 39–
43, 47, 48, 50, 55, 57–97, 99–106, 109, 111, 118–120, 128, 130, 136, 138,
142, 144, 147, 148, 151, 153, 157, 158, 163, 174, 184, 185, 187, 194–200,
255, 258, 259
ML machine learning. 5, 184
MSB most significant bit. 64, 75
NN neural network. xxxii, 194, 195

NoC network-on-chip. 38, 44, 184
NVM non-volatile memory. 42
OS output-sharing. xxx, 104, 105, 108, 109, 111–116, 118–121, 124, 128, 129,
259
PE processing element. xxvii, xxviii, 39, 41, 42, 48, 59–61, 140, 142, 148, 149,
157, 158, 160, 162, 164, 168, 174, 232, 234, 241, 246, 258
pJ/op pico joule per operation. 36
PSMA precision-scalable MAC array. xxxv, 18, 99, 100, 102, 103, 105, 108,
109, 111, 115, 116, 118, 120, 124, 128–130, 259, 261
PW pointwise. 25, 28, 30, 102, 185
ReLU rectified linear unit. 25

RTL register transfer level. 10, 57, 100, 184
SA sum-apart. xxviii, xxix, 59–67, 69, 70, 74, 77, 79–82, 84–86, 89, 93, 96, 258
xx LIST OF ABBREVIATIONS AND SYMBOLS
sec second. 36
SIMD single instruction multiple data. 37
SL single-layer. 202, 204, 207, 208, 216, 224, 227, 229

SoC system-on-chip. 39
SotA state of the art. 7, 9, 10, 18, 19, 57–59, 63, 96, 99, 100, 103, 105, 111,
115, 128, 132, 133, 135, 136, 158, 182, 184, 201, 202, 204, 215, 231, 233,
235, 246, 251
ST sum-together. xxviii, xxix, 59–63, 65, 66, 68, 70, 74, 77, 79, 82–86, 89, 92,
93, 96, 97, 258
STA static timing analysis. 75
SWU subword-unrolled. xxviii–xxx, 62, 63, 69, 70, 74, 79, 80, 82–84, 86, 89,
92, 93, 96, 97, 108, 109, 111, 115, 117–119, 121, 124, 129, 258, 259
tanh hyperbolic tangent. 25

TOPS tera operations per second. 170
TOPS/W tera operations per second per watt. 36, 37
VWR very wide register. 42
WS weight-sharing. 104, 105, 118, 128, 129, 259

Contents
Abstract ix
Beknopte samenvatting xiii
List of Abbreviations and Symbols xx
Contents xxi
List of Figures xxvii
List of Tables xxxv
1 Introduction 1
1.1 Multi-level Deep Learning Acceleration in Post-Moore’s-Law Era 3
1.2 Open Research Questions . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Q1: How to efficiently execute variable-precision DNNs
on hardware? . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Q2: How to realize fast design space exploration for DNN
accelerators? . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Thesis Contributions and Organization . . . . . . . . . . . . . . 18
1.3.1 Taxonomy and benchmarking of precision-scalable data-
paths (Q1) . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.2 Frameworks for fast DNN accelerator DSE (Q2) . . . . 19
2 Background 23
2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . 24
2.1.2 A common misconception . . . . . . . . . . . . . . . . . 32
2.1.3 Computation and data patterns . . . . . . . . . . . . . . 34
2.2 Domain-specific Hardware for Deep Learning . . . . . . . . . . 36
2.2.1 Hardware performance metrics . . . . . . . . . . . . . . 36
xxi
xxii CONTENTS
2.2.2 Hardware platform types . . . . . . . . . . . . . . . . . 37

2.2.3 DNN accelerator architecture . . . . . . . . . . . . . . . 39
2.3 Mapping, Scheduling and Allocation . . . . . . . . . . . . . . . 44
2.3.1 Single-core cross-layer scheduling . . . . . . . . . . . . . 44
2.3.2 Single-core intra-layer mapping . . . . . . . . . . . . . . 45
2.3.3 Multi-core layer-core allocation and scheduling . . . . . 52
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3 Precision-Scalable MAC Unit Design Space Exploration 57

3.1 Motivation and Chapter Organization . . . . . . . . . . . . . . 58
3.2 Dataflow Implications of Precision Scalability . . . . . . . . . . 59
3.2.1 SA and ST at algorithm level . . . . . . . . . . . . . . . 59
3.2.2 SA and ST at PE-array level . . . . . . . . . . . . . . . 59
3.2.3 SA and ST at precision-scalable MAC-unit level . . . . 60
3.2.4 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 Survey of Scalable MAC Architectures . . . . . . . . . . . . . . 63
3.3.1 Data-gated conventional MAC . . . . . . . . . . . . . . 64
3.3.2 1D Fully-Unrolled SA (DNPU) . . . . . . . . . . . . . . 64
3.3.3 1D Fully-Unrolled ST . . . . . . . . . . . . . . . . . . . 65
3.3.4 2D Fully-Unrolled SA . . . . . . . . . . . . . . . . . . . 66
3.3.5 2D Fully-Unrolled ST (BitFusion) . . . . . . . . . . . . 66
3.3.6 Subword-Unrolled SA (DVAFS) . . . . . . . . . . . . . . 69
3.3.7 Subword-Unrolled ST (ST) . . . . . . . . . . . . . . . . 70
3.3.8 1D bit-serial designs (UNPU) . . . . . . . . . . . . . . . 70
3.3.9 2D bit-serial designs (LOOM) . . . . . . . . . . . . . . . 72
3.4 Design and Benchmark Methodology . . . . . . . . . . . . . . . 74
3.4.1 Design considerations and assumptions . . . . . . . . . . 74
3.4.2 Design space . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4.3 Physical implementation and timing analysis . . . . . . 75
3.4.4 Power estimation . . . . . . . . . . . . . . . . . . . . . . 76
3.4.5 DVFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5 Detailed Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5.1 Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.2 Bandwidth per operation . . . . . . . . . . . . . . . . . 78
3.5.3 Throughput evaluation . . . . . . . . . . . . . . . . . . . 80
3.5.4 Area breakdown . . . . . . . . . . . . . . . . . . . . . . 82
3.5.5 Energy overhead at full precision . . . . . . . . . . . . . 83
3.5.6 Energy scaling . . . . . . . . . . . . . . . . . . . . . . . 83
3.6 Comparative Study . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.6.1 Comparison of scalable MACs at nominal voltage . . . . 89
3.6.2 Comparison of scalable MACs with DVFS . . . . . . . . 89
3.7 Comparative Study in the function of Use-Case Ratios . . . . . 92
3.7.1 Introduction and methodology . . . . . . . . . . . . . . 92
CONTENTS xxiii
3.7.2 33% full-precision computations (equal usage) . . . . . . 92

3.7.3 20% full-precision computations . . . . . . . . . . . . . . 93
3.7.4 5% full-precision computations . . . . . . . . . . . . . . 93
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4 Precision-Scalable MAC Array Design Space Exploration 99

4.2 Precision-Enhanced DNN Loop Representation . . . . . . . . . 101
4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2.2 Implications on the MAC array hardware . . . . . . . . 103
4.3 Precision-Scalable MAC Array Taxonomy . . . . . . . . . . . . 103
4.3.1 Highly parameterized PSMA template . . . . . . . . . . 103
4.3.2 Spatial unrolling . . . . . . . . . . . . . . . . . . . . . . 104
4.3.3 Bit-Group unrolling . . . . . . . . . . . . . . . . . . . . 105
4.3.4 Precision scalability modes . . . . . . . . . . . . . . . . 105
4.3.5 Fully-Unrolled vs. Subword-Unrolled designs . . . . . . 106
4.3.6 Complete taxonomy and SotA mapping . . . . . . . . . 109
4.4 Uniform and Parameterized PSMA Template . . . . . . . . . . 111
4.4.1 Bit-Group configurations . . . . . . . . . . . . . . . . . 111
4.4.2 Design space constraints . . . . . . . . . . . . . . . . . . 113
4.4.3 Register layout . . . . . . . . . . . . . . . . . . . . . . . 116
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . 117
4.5.2 Workload . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.5.3 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.5.4 Energy efficiency . . . . . . . . . . . . . . . . . . . . . . 119
4.5.5 Energy vs. Area . . . . . . . . . . . . . . . . . . . . . . 120
4.5.6 Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5 ZigZag: Enabling Fast DNN Accelerator-Mapping Design Space

Exploration through Analytical Modeling 131
5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.3 ZigZag Framework Overview . . . . . . . . . . . . . . . . . . . 136
5.4 Design Space Representation . . . . . . . . . . . . . . . . . . . 137
5.5 Hardware Cost Estimator . . . . . . . . . . . . . . . . . . . . . 140
5.5.1 Loop Relevance Principle . . . . . . . . . . . . . . . . . 140
5.5.2 Mapping information extraction . . . . . . . . . . . . . 142
5.5.3 Hardware cost integrator . . . . . . . . . . . . . . . . . 147
5.6 Mapping Search Engines . . . . . . . . . . . . . . . . . . . . . . 148
5.6.1 Spatial mapping search engine . . . . . . . . . . . . . . 149
5.6.2 Temporal mapping search engine . . . . . . . . . . . . . 151
xxiv CONTENTS
5.7 Architecture Generator . . . . . . . . . . . . . . . . . . . . . . . 157

5.8 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.9 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.9.1 Case study 1: Impact of even / uneven mapping . . . . 160
5.9.2 Case study 2: Memory hierarchy search . . . . . . . . . 162
5.9.3 Case study 3: DNN workload comparison . . . . . . . . 164
5.10 Further Improvements . . . . . . . . . . . . . . . . . . . . . . . 169
5.10.1 Faster temporal mapping search engines . . . . . . . . . 169
5.10.2 More use cases . . . . . . . . . . . . . . . . . . . . . . . 170
5.10.3 More practical considerations . . . . . . . . . . . . . . . 171
5.10.4 Upgraded framework implementation . . . . . . . . . . . 172
5.10.5 Successive frameworks . . . . . . . . . . . . . . . . . . . 182
5.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6 An Analytical Latency Model for DNN Accelerator 183

6.2 Algorithm/hardware/mapping and Latency . . . . . . . . . . . 185
6.2.1 Latency impact factors . . . . . . . . . . . . . . . . . . . 185
6.2.2 Challenges in uniform latency modeling . . . . . . . . . 186
6.2.3 Proposed modeling philosophy . . . . . . . . . . . . . . 186
6.3 A Uniform Intra-Layer Latency Model . . . . . . . . . . . . . . 186
6.3.1 Prerequisite concepts and terminology . . . . . . . . . . 187
6.3.2 Step 1: Divide memory system into multiple Unit
Memories by operand and compute each DTL’s attributes 189
6.3.3 Step 2: Combine the attributes on DTLs that share same
physical memory port and serve same memory module . 191
6.3.4 Step 3: Integrate SScomb across all memory levels to
derive total temporal stall SSoverall . . . . . . . . . . . 192
6.3.5 System’s overall latency and MAC array utilization . . . 194
6.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.5.1 Case study 1: Mapping v.s. latency . . . . . . . . . . . 195
6.5.2 Case study 2: Workload size v.s. latency . . . . . . . . . 197
6.5.3 Case study 3: Hardware architecture v.s. latency . . . . 198
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7 DeFiNES: Exploring the Depth-first Scheduling Space for DNN

Accelerators 201
7.2 Depth-first Design Space Identification . . . . . . . . . . . . . . 204
7.3 Unified Analytical Cost Model . . . . . . . . . . . . . . . . . . 208
7.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
7.5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
CONTENTS xxv
7.5.1 An overview of experiment settings . . . . . . . . . . . . 216

7.5.2 Case study 1: Impact of depth-first strategy . . . . . . . 218
7.5.3 Case study 2: Applying depth-first to multiple workloads 223
7.5.4 Case study 3: A joint DSE of accelerator architecture
and scheduling for multiple workloads . . . . . . . . . . 225
7.6 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
8 Stream: Modeling Fine-grained Layer Fusion on Multi-core DNN

Accelerators 231
8.2 Background & Related Works . . . . . . . . . . . . . . . . . . . 233
8.2.1 Dataflow hardware architectures . . . . . . . . . . . . . 234
8.2.2 Allocation, scheduling & mapping . . . . . . . . . . . . 234
8.3 Stream framework . . . . . . . . . . . . . . . . . . . . . . . . . 236
8.3.1 Step 1: CN identification & attribute extraction . . . . 236
8.3.2 Step 2: Fine-grained graph generation . . . . . . . . . . 239
8.3.3 Step 3: Intra-core mapping cost extraction . . . . . . . 241
8.3.4 Step 4: Layer – core allocation . . . . . . . . . . . . . . 242
8.3.5 Step 5.1: Multi-core CN scheduling . . . . . . . . . . . . 242
8.3.6 Step 5.2: Memory usage tracing . . . . . . . . . . . . . 246
8.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
8.5 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
8.5.1 Automated layer-core allocation impact . . . . . . . . . 249
8.5.2 Architecture impact . . . . . . . . . . . . . . . . . . . . 251
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
9 Conclusion and Future Work 257

9.1 Contributions and Conclusions . . . . . . . . . . . . . . . . . . 257
9.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Biography 265
List of publications 267
Bibliography 271
List of Figures
1.1 Some examples of remarkable AI achievements. . . . . . . . . . 2

1.2 The overview of the multi-level deep learning acceleration. . . . 3
1.3 The word cloud of the post-Moore’s-Law era discussion. . . . . 6
1.4 Overview of the thesis contributions and organization. . . . . . . 21
2.1 An example DNN and a deep dive into one neuron of it. . . . . 24
2.2 Multiple types of DNN layers. . . . . . . . . . . . . . . . . . . . 26
2.3 Snippets of different DNNs from (a) VGG-19 [149], (b) ResNet-
18/34 [57], (c) ResNet-50/101/152, (d) ResNeXt [179], (e)
MobileNet-v1 [65], (f) MobileNet-v2 [138], and (g) Inception-
ResNet-v2 [155]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 A great number of DNNs targeting image classification on
ImageNet were developed [15]. . . . . . . . . . . . . . . . . . . 33
2.5 The relationship between FLOPs, number of parameters, and
inference latency of VGG-16 pruned models [100]. . . . . . . . . 33
2.6 An overview of DNN accelerator comparison [85]. . . . . . . . . 38
2.7 Introduction to DNN accelerator. . . . . . . . . . . . . . . . . . 40
2.8 Introduction to MAC / PE array interconnection. . . . . . . . . . 41
2.9 Energy per 16-bit access with various RF and SRAM sizes, and
for a MAC operation and a DRAM access [184]. . . . . . . . . 43
2.10 Single-core and multi-core DNN accelerator. . . . . . . . . . . . 44
2.11 Introduction to single-core cross-layer scheduling and intra-layer
mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.12 An example continuing from Figure 2.11(c) Mapping 3 to
show temporal mapping’s impact on high-level memory access.
Mapping 3a and 3b have the same spatial mapping but different
temporal mappings. . . . . . . . . . . . . . . . . . . . . . . . . 49
2.13 Examples of the nested for-loop based mapping representation,
continuing from Figure 2.11 and Figure 2.12. . . . . . . . . . . . 51
2.14 Introduction to multi-core layer-core allocation and scheduling. 53
xxvii
xxviii LIST OF FIGURES
2.15 Examples of the graph-based multi-core schedule representation,

continuing from Figure 2.14. . . . . . . . . . . . . . . . . . . . . 54
3.1 SA/ST loop identification for (a) an FC and (b) a Conv 2D layer. 60
3.2 Three types of two-dimensional PE array. . . . . . . . . . . . . . 61
3.3 Data-gated conventional MAC for either one full-precision 8b×8b,
one symmetric 4b×4b, or one weight-only 2b×8b operation. . . 64
3.4 Weight-only precision scaling in a 1D FU SA MAC configured for
either one 8b×8b, two 4b×8b, or four 2b×8b operations per cycle. 65
3.5 Weight-only precision scaling in a 1D FU ST MAC configured for
3.6 Precision scaling in a 2D FU SA MAC configured for either one
8b×8b, four 4b×4b, four 2b×8b, or sixteen 2b×2b operations
per cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.7 Precision scaling in a 2D FU ST MAC configured for one 8b×8b,
four 4b×4b, four 2b×8b, or sixteen 2b×2b operations per cycle. 68
3.8 Symmetric precision scaling in a SWU SA MAC configured for
3.9 Symmetric precision scaling in a SWU ST MAC configured for
3.10 Weight-only precision scaling in a bit-serial MAC configured for
either 8b×8b, 4b×8b, or 2b×8b operations. . . . . . . . . . . . . 71
3.11 Weight-only precision scaling in a 4-bit serial MAC configured for
either 8b×8b, 4b×8b, or 2b×8b (by gating the 4b×8b) operations. 71
3.12 Symmetric precision scaling in a 2D serial MAC configured for
either 8b×8b, 4b×4b, or 2b×2b operations. . . . . . . . . . . . 72
3.13 A bit-wise feed-in schedule for 2D bit-serial MAC in 4b×4b mode. 73
3.14 Symmetric precision scaling in a 2D 4-bit serial MAC configured
for either 8b×8b, 4b×4b, or 2b×2b (by gating the 4b×4b)
operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.15 Register settings for (a) SWU SA and (b) SWU ST MAC units. 74
3.16 Legend of all the bar charts in Section 3.5. 19 designs in total. 77
3.17 Bar charts of the bandwidth per clock cycle of precision-scalable
MAC units for (a) symmetric and (b) weight-only scaling scenarios. 78
3.18 Bar charts of the bandwidth per operation of precision-scalable
MAC units for (a) symmetric and (b) weight-only scaling scenarios. 79
3.19 Bar charts of the normalized circuit throughput of precision-
scalable MAC units for (a) symmetric and (b) weight-only scaling
scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.20 Bar chart of the normalized area of precision-scalable MAC units. 82
3.21 Normalized energy/op for (a) symmetric and (b) weight-only
scaling in a 1D FU SA MAC (DNPU) [146]. . . . . . . . . . . . 84
LIST OF FIGURES xxix

scaling in a 1D FU ST MAC. . . . . . . . . . . . . . . . . . . . 84
scaling in a 2D FU SA MAC. . . . . . . . . . . . . . . . . . . . 85
scaling in a 2D FU ST MAC (BitFusion) [143]. . . . . . . . . . 85
scaling in a SWU SA MAC (DVAFS) [117]. . . . . . . . . . . . 86
scaling in a SWU ST MAC (ST) [107]. . . . . . . . . . . . . . . 86
scaling in 1D serial (UNPU) [94] and multibit-serial MACs [21].
Beware of the scale. . . . . . . . . . . . . . . . . . . . . . . . . 87
scaling in 2D serial and multibit-serial MACs (LOOM) [142].
Beware of the scale. . . . . . . . . . . . . . . . . . . . . . . . . 87
3.29 Bar charts of the normalized energy of precision-scalable MAC
units for (a) symmetric and (b) weight-only scaling scenarios. . 88
3.30 Comparison of MAC architectures synthesized in a 28 nm
CMOS process at 1 V supply voltage in terms of energy/op
and throughput/area. . . . . . . . . . . . . . . . . . . . . . . . 90
3.31 Comparison of MAC architectures synthesized in a 28 nm CMOS
with DVFS in terms of energy/op and throughput/area. . . . . . 91
3.32 Symmetric scaling: Overall energy/op and throughput/area
of MAC architectures utilized with 33%, 20% and 5% of 8b
computations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.33 Weight-only scaling: Overall energy/op and throughput/area
of MAC architectures utilized with 33%, 20% and 5% of 8b
computations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.1 A simple 4b precision scalable multiplier. . . . . . . . . . . . . . 101

4.2 Introducing 2 extra bit-group (BG) for-loops to the traditional
for-loop representation (introduced in Chapter 2.3.2) of a Conv
2D layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3 L4: 4×4 L3 units; L3: 4×4 L2 units; L2: 4×4 L1 units. . . . . 104
4.4 Input, Hybrid, and Output Sharing (or Spatial Reuse as explained
in Chapter 2.3.2) at Level n (Ln). Results from one accumulation
island are added together spatially to form one output. . . . . . 105
4.5 L2 in a Fully-Unrolled (FU) design. BG is unrolled in L2 and
input sharing (IS) in lower precision modes. . . . . . . . . . . . 106
hybrid sharing (HS) in lower precision modes. . . . . . . . . . . 107
xxx LIST OF FIGURES

output sharing (OS) in lower precision modes. . . . . . . . . . . 107
4.8 L2 in Subword-Unrolled (SWU) designs. BG is unrolled in L2
and (a) no sharing or (b) output sharing (OS) in lower precision
modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.9 BG is unrolled spatially at L2, L3, or temporally at L3 (BS-L3).
Assume OS at L2 and L3; Assume each level Ln contains 2×2
Ln−1 for simplicity. . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.10 IS, HS and OS at L3 for different precisions when BG is unrolled
temporally. Assume OS at L2. . . . . . . . . . . . . . . . . . . 114
4.11 An example of output accumulator register size calculation and
its run-time configurations over full/low precision modes. . . . 116
4.12 Energy per Operation (f J) at 200 MHz. . . . . . . . . . . . . . 120
4.13 Energy per Operation (f J) at 1 GHz. . . . . . . . . . . . . . . . 121
4.14 Energy/Op (f J) vs. Area (mm2 ) at 200 MHz. . . . . . . . . . . 122
4.15 Energy/Op (f J) vs. Area (mm2 ) at 1 GHz. . . . . . . . . . . . 123
4.16 Legend illustration for Figure 4.17 and Figure 4.18. Input
Registers contain input activations and weights. . . . . . . . . . 125
4.17 Energy efficiency for 200 MHz. The 9 parallel bars within each
configuration group are with different L4/L3 settings, listed out
in Figure 4.16. . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.18 Energy efficiency for 1 GHz. The 9 parallel bars within each
configuration group are with different L4/L3 settings, listed out
in Figure 4.16. . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.19 Area breakdown. . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.1 ZigZag framework diagram. . . . . . . . . . . . . . . . . . . . . 136

5.2 Workload summary of common neural network layers. Loop
notations (B, K, C, OY, OX, FY, FX) are used in the chapter. 137
5.3 Using the Memory-Centric Design Space Representation to
distinguish between balanced and unbalanced memory hierarchy,
even and uneven mapping. The memory unrolling is not depicted
in the left two memory hierarchy sketches for clarity. . . . . . . 139
5.4 Loop type categorized by relevance. . . . . . . . . . . . . . . . . 141
5.5 pr-loop patterns that trigger special Input data reuse. . . . . . . 141
5.6 A demonstration: extract loop information from Output loop set
of Figure 5.3(d) based on the Loop Relevance Principle. . . . . 145
5.7 (Left) Visualization of the impact of individual loops of
Figure 5.3(d) on the data access count, and (right) on the energy
consumed by different memory levels. Each colored block (right)
represents a memory level for a certain operand; the area of the
block indicates the total energy consumption at that memory
level (data access count× per data access energy). . . . . . . . 146
LIST OF FIGURES xxxi
5.8 Comparison between different spatial mapping search methods

with 4 different neural networks. Both heuristic v1 and v2 can
find the global optimal spatial mapping points as an exhaustive
search does. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5.9 Comparison between different temporal mapping search methods
with 4 different neural networks on (row 1) number of temporal
mappings evaluated, (row 2) elapsed time in CPU hours, (row 3)
peak CPU memory usage, and (row 4) minimal mapping energy
found. This experiment is carried out on an Intel Xeon Gold
6126 CPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.10 Loop blocking assignment (left); loop ordering based on data
stationarity maximization (right). With the increasing of virtual
memory levels and the number of nested loops at each virtual
memory level, loop ordering possibility increases exponentially.
By applying data stationarity optimization, the reduction in loop
orders is significant. (c2) and (c1) orderings are equivalent for
Input since C and OY both are relevant loops that do not change
the Input’s stationarity at the below memory level that comes
from K. (d2) is sub-optimal compared to (d1) since for Output,
OY is a relevant loop and FX is an irrelevant loop, swapping them
breaks the Output’s stationarity loop chain formed by FX-FY-C. 154
5.11 Temporal mapping search engine algorithms comparison. While
exhaustive search generates all valid mapping schemes (∼1s-
100s million), heuristics are required to prune away sub-optimal
mappings. Heuristic search v1 prunes mappings at the loop-
ordering stage, and heuristic search v2 prunes at both the loop-
blocking and loop-ordering stages. Iterative search prunes at each
loop assignment iteration besides applying the previous heuristics.155
5.12 Memory hierarchy generator overview. . . . . . . . . . . . . . . 157
5.13 Cost model validation of AlexNet [90] Conv layers on Eyeriss [30]
(left) and ENVISION [116] (right). . . . . . . . . . . . . . . . . 158
5.14 Cost model validation against an in-house accelerator’s post-
synthesis results with a voice recognition workload on energy
(left) and PE array utilization (right). . . . . . . . . . . . . . . 159
5.15 Cost model and mapping search engines validation against
Timeloop [126]+Accelergy [177] on AlexNet [90] (left) and
ResNet34 [57] (right). . . . . . . . . . . . . . . . . . . . . . . . 159
5.16 Even/uneven temporal mapping’s impact on energy and PE
array utilization (throughput) for difference spatial mappings:
OYu|FYu|Ku 13|5|2 (left) and OYu|OYu|Cu 13|2|6 (right). . . . 161
5.17 The collection of Pareto-optimum temporal mappings of all 16
different spatial mappings. . . . . . . . . . . . . . . . . . . . . . 162
xxxii LIST OF FIGURES
5.18 Memory hierarchy search for multiple layers of DarkNet19[132]. 163

5.19 Energy-Latency-Area comparison for mapping 12 NNs on 720
accelerator architectures each. Every NN-accelerator pair is
corresponding to two points in the figure, one with min-energy
mapping, one with min-latency mapping. The Pareto-optimal
accelerator architectures for each NN are highlighted and connected.166
5.20 An example code snippet of multi-stage configuration for
performing a DSE experiment in ZigZag. . . . . . . . . . . . . . 179
5.21 Example pseudocode for showing the data passing and calling in
between the last four stages in Figure 5.20. . . . . . . . . . . . 180
6.1 (a) A timeline illustration of DNN layer operation phases.

(b) Four scenarios of latency and utilization modeling in the
computation phase. . . . . . . . . . . . . . . . . . . . . . . . . . 187
6.2 (a) Descriptions of the terminologies used by each step in the
latency model and (b) an illustration of 3-step latency modeling
methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
6.3 Six different timeline cases of memory updating and computation,
showing memory-induced stall/slack for a single DTL. . . . . . 190
6.4 An example demonstration of (a)-(b) Step 1 (Divide) and (b)-
(c) Step 2 (Combine) for deriving the intermittent modeling
parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
6.5 (a) Block diagram of the in-house DNN accelerator. (b) Temporal
mapping and spatial unrolling schemes after Im2Col. (c) Model
validation against the hardware RTL simulation running NN
layers of different sizes. . . . . . . . . . . . . . . . . . . . . . . . 195
6.6 Case study 1: Mapping’s difference analysis and its impact on
latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6.7 Case study 2: Workload’s impact on latency and latency breakdown.198
6.8 Case study 3: Hardware architecture’s impact on latency-area
trade-off. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
7.1 Going from (a) Single-layer-at-a-time scheduling to (b) Layer-

by-layer scheduling and to (c) Depth-first scheduling to keep
activations in lower memory levels. “L": neural network Layer;
“T": Tile; “LB": Local buffer (small on-chip memory); “GB":
Global Buffer (larger on-chip memory). . . . . . . . . . . . . . . 203
7.2 DF design space’s first axis: Tile size. For layer dimension
notation in (a): K is for output channel; C is for input channel;
OX and OY are feature map spatial dimensions; FX and FY are
weight spatial dimensions. . . . . . . . . . . . . . . . . . . . . . 205
LIST OF FIGURES xxxiii
7.3 DF design space’s second axis: Overlap storing mode. Workload

is Layer 2 and 3 in Figure 7.2(a); Legend is shared with
Figure 7.2(a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
7.4 Impact of tile size (first axis) and fuse depth (third axis). ST:
fused-layer STack. . . . . . . . . . . . . . . . . . . . . . . . . . 206
7.5 DeFiNES’ overview. (*: optional input, can be set automatically.)209
7.6 Tile type count of difference tile sizes and overlap storing modes.
The workload used in this example is FSRCNN [44], whose final
output feature map’s spatial dimension is 960×540. The 3-tile-
type example is further used in Figure 7.9 and Figure 7.10. . . 210
7.7 The required data storage for different overlap storing modes.
ST: fused-layer STack. . . . . . . . . . . . . . . . . . . . . . . . 212
7.8 DeFiNES’ handling of branching. Legend is shared with
Figure 7.2(a). The grey pixels do not contribute to the right
branch. ‘FM’: feature map. . . . . . . . . . . . . . . . . . . . . 213
7.9 A visualization of the determined top memory level of each unique layer-
tile-combination for operands W, I, and O. The DF schedule is taken from
the 3-tile-type example in Fig 7.6. The hardware architecture is the Idx 2
in Table 7.1. It is worth noting that: 1) for weights, all the layers of the
first tile take weights from DRAM, and the other layer-tile-combinations
take weights from LB; 2) for input and output, all the tiles’ first layer gets
input from DRAM, all the tiles’ last layer writes output back DRAM, and
in between either GB or LB is taken as each of their top memory level. . 214
7.10 A visualization of activation data size in tile type 2 and 3 of the example
in Figure 7.9. The capacities of LB and GB are marked out on y-axis.
Figure 7.9 and Figure 7.10 together show that 1) when the total activation
size (I+O) can fit into LB (e.g., Tile type 2 - L6), the LB is the top
memory for both I and O; 2) when the total activation size (I+O) cannot
fit into LB while either I or O can fit (e.g., Tile type 3 - L6), I is prioritized
to use LB as its top memory level while O is pushed to GB. . . . . . . 214
7.11 Compare DeFiNES’ results against DepFiN chip measurements. 215
7.12 The total energy and latency for meta-proto-like DF architecture
processing FSRCNN with different DF strategies. . . . . . . . . 219
7.13 MAC operation count for different DF strategies. . . . . . . . . 220
7.14 Memory access of different data types at different memory levels
for meta-proto-like DF architecture processing FSRCNN with
different DF strategies. . . . . . . . . . . . . . . . . . . . . . . . . 221
7.15 The total energy and latency for design points in Figure 7.14. . 223
7.16 Case study 2: Different workloads lead to different best solutions
(all results on meta-proto-like DF hardware). . . . . . . . . . . 224
xxxiv LIST OF FIGURES
7.17 Case study 3: Different hardware architectures’ energy and

latency (geometric mean across the 5 workloads) when applying
layer-by-layer or best DF scheduling strategies. . . . . . . . . . 225
7.18 Experiments to evaluate different factors in Table 7.2. . . . . . 228
8.1 A conceptual example showing different ways of scheduling a

deep neural network workload onto different hardware accelerators.232
8.2 (a) Multi-core architecture model. (b) Example core with specific
dataflow (in red), connected to the off-chip memory port and bus
for inter-core communication. All memories, ports, and the bus,
have a limited bandwidth. . . . . . . . . . . . . . . . . . . . . . 235
8.3 Overview of the Stream framework. . . . . . . . . . . . . . . . . 237
8.4 Computation node (CN) granularity impacts scheduling flexibility.238
8.5 Computation node (CN) attribute extraction example. Attributes
are the number of discarded inputs (red) and number of generated
outputs (green). . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
8.6 Inter-layer CN dependency generation example using R-trees [55]. 241
8.7 Working principle of latency- and memory-prioritized CN Scheduler.244
8.8 Step 5 - An example of fine-grained CN graph scheduling
including inter-core bus communication, DRAM accesses, and
memory usage trace. . . . . . . . . . . . . . . . . . . . . . . . 245
8.9 Hardware architecture targets for the validation of Stream. . . 247
8.10 Schedule visualization of Stream for the three validation targets. 248
8.11 Seven hardware architectures used for architecture exploration. 250
8.12 Impact of the automatic layer-core allocation. The GA solution
has significantly better latency and memory requirement than
manual allocation. . . . . . . . . . . . . . . . . . . . . . . . . . 250
8.13 The best EDP point found by Stream over 5 DNNs for 7 hardware
architectures under traditional layer-by-layer scheduling and fine-
grained layer fusion. The architecture abbreviations are shown
in Figure 8.11(b). For the geometric mean, the EDP reduction
from layer-by-layer to layer-fused is shown. . . . . . . . . . . . 252
8.14 Latency and energy breakdown for the best EDP points of the
exploration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
List of Tables
2.1 A high-level DNN computation pattern summary . . . . . . . . 34
3.1 Taxonomy of precision-scalable MAC unit architectures . . . . 62
4.1 SotA Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.2 Constrained PSMA Design Space . . . . . . . . . . . . . . . . . 115
4.3 Ideal workload’s minimal size* . . . . . . . . . . . . . . . . . . 118
5.1 DNN Accelerator DSE Framework Comparison . . . . . . . . . 134

5.2 Equations for mapping information extraction . . . . . . . . . . 143
5.3 Memory Hierarchy Options for Case Study 3 . . . . . . . . . . 165
5.4 Comparison on 12 Neural Networks’ Algorithm Attribute and
Hardware Performance. Weight/Input/Output Size is the
accumulated size across all layers, assuming 8-bit precision on
ImageNet data. . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.1 ReqBW determined by both memory type and mapping. . . . 189
7.1 The 10 hardware architectures and 5 DNN workloads used in the

case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
7.2 Related DF modeling framework comparison . . . . . . . . . . 227
8.1 Validation results for three targeted hardware architectures. . . 247
xxxv
Chapter 1
Introduction
Over the past decade, the field of Artificial Intelligence (AI) has advanced
at an unprecedented pace. Humanity has witnessed numerous remarkable
achievements of AI that have already found a place in the history books, such
as the examples listed below, also shown in Figure 1.1:
(a) Alexnet [90] winning the ImageNet large-scale visual recognition challenge
in 2012;
(b) AlphaGo [148] beating world champion Lee Se-dol at the game of Go in
2016;
(c) Google’s Duplex [96] making a reservation at a restaurant and scheduling
a haircut over the phone in 2018;
(d) AlphaFold [83] making significant breakthroughs in predicting the 3D
structure of proteins in 2021;
(e) DALL-E [131] showing its ability to generate high-quality, photorealistic
images of non-existent objects, as well as to turn your creative visions
into AI-generated art in 2021;
(f) ChatGPT [123] and GPT-4 [124] triggering a worldwide revolution in
natural language processing and related fields in 2023.
These AI achievements are just the tip of the iceberg, which would not have
been possible without the exponentially increased hardware compute power
1
2 INTRODUCTION
(a) AlexNet for (b) AlphaGo v.s. (c) Google Duplex performs
image classification a Go world champion real-world tasks over phone
(d) AlphaFold predicts protein structures (f) Chat with ChatGPT
(e) DALL-E turns your creative visions

into AI-generated art
An o il p a in t in g in Vin ce n t va n Go gh s t y le : d u m p lin gs a n d
s a m o s a s a r e fly in g t o ge t h e r o n a s t a r r y n igh t .
Figure 1.1: Some examples of remarkable AI achievements.
available, the ever-growing volume of available data, and the development of AI

algorithms, especially Deep Learning (DL).
To enable the further growth of deep learning systems, especially in resource-
constrained scenarios like mobile and Internet of Things (IoT), this thesis focuses
on exploring and optimizing the DL hardware acceleration.
Nowadays DL hardware acceleration involves sophisticated multi-level collabo-
rative design and optimization of domain-specific deep neural network (DNN)
accelerators, in which still a lot of open research questions exist. This chapter
will first introduce this multi-level concept and then elaborate on the design
pattern shift in the post-Moore’s-Law era, followed by a discussion of two
main (five sub-) research questions that the thesis aims to address. The thesis
contributions and organization are summarized at the end.
MULTI-LEVEL DEEP LEARNING ACCELERATION IN POST-MOORE’S-LAW ERA 3
1.1 Multi-level Deep Learning Acceleration in Post-

Moore’s-Law Era
Accelerating a DL application refers to improving the performance (e.g., increase

processing speed, reduce power consumption, prolong battery life, etc.) of
running the application on domain-specific computing platforms, which includes
multi-level design aspects, as visualized in Figure 1.21 .
Application
(a) (b) (c) (d) (e)
Algorithm
(i)
(f) (g) (h) (j)
Mapping/
Scheduling/
Compiling
(k) (l) (m)
Hardware
architecture
(n) (o) (p) (q) (r)
Hardware
technology
(s) (t) (u) (v)
Figure 1.2: The overview of the multi-level deep learning acceleration.
The first level is the application level. As discussed in the beginning, DL has
1
demonstrated its promising utility in a wide range of applications, such as the
ones shown in the figure: object detection, autonomous driving, predictions
of protein structure, recommendation systems, and the metaverse. Below the
application level comes the algorithm level, in which different types and shapes
of DNNs are applied for specific tasks of the applications. After that, to deploy
1 The picture sources: (a):[7], (b):[106], (c):[83], (d):[6], (e):[112], (f)-(j):[165], (k):[110],
(l):[108], (m):[152], (n):[62], (o):[78], (p):[54], (q):[151], (r):[26], (s):[137], (t):[60], (u):[170],
(v):[99]
4 INTRODUCTION
the DNNs on domain-specific computing platforms, the mapping and scheduling

need to be carefully planned based on the hardware architecture, after which
the compile flow transfers this information into machine code for guiding the
hardware to operate. Laying at the bottom is the hardware technology level,
including - but not limited to - CMOS technology node scaling, emerging
in-memory-computing devices, and the 2.5D and 3D chip stacking.
Each level includes diverse design possibilities and its own concerns2 . In
the traditional design paradigm, there is usually a clear design boundary
between any two levels, and levels primarily focus on their own concerns.
This approach worked well because it was largely driven by the fundamental
scaling of CMOS technology, which allowed each level to enjoy performance and
efficiency improvements with every successive silicon generation.
However, in recent years as transistors approach their physical limits and
fabrication costs continue to rise, Moore’s Law is slowing down. Thus, it has
become more and more difficult to meet the exponential growth in computing
performance through technology scaling. However, exciting new application
areas from self-driving cars to the IoT devices and health informatics will
demand computation with performance and/or efficiency orders of magnitude
higher than the state of the art today [160].
Clearly, there is a gap between required and actual performance and efficiency in
contemporary hardware platforms, and with more and more new AI applications
popping up, the gap can only become larger, hindering fully liberating AI’s
productivity. So, the question to ask next is: If not counting on the technology
scaling, how to continue the exponential growth in computing performance and
efficiency? Numerous researchers have been actively seeking solutions and have
proposed their ideas:
• T. N. Theis et al. in "The End of Moore’s Law: A New Beginning for

Information Technology" [160] pointed out that the gradual end of Moore’s
law will open a new era as the focus shifts from the miniaturization of long-
established technologies to new devices, new integration technologies, and
new architectures for computing. Regarding the new architectures, they
envisioned future energy-efficient systems consisting of a large number of
accelerators executing specific operations or algorithms, their interactions
orchestrated to perform larger tasks and turned on and off as needed,
2 Note that although in Figure 1.2, each level is represented by a single rectangle, it never
means that per-level design is a single, monolithic step. In fact, each level has a huge design
space, and the design procedure can usually be organized as multiple design stages (to avoid
complexity explosion). For example, at the mapping/scheduling/compiling level, a unified
multi-stage meta flow, proposed back in the 1990s [23], already indicated the immensity and
complexity of the scheduling design space on its own.
MULTI-LEVEL DEEP LEARNING ACCELERATION IN POST-MOORE’S-LAW ERA 5
although, at risk of complicating the programming model. They also

called for greater innovation in the architecture of memory hierarchies to
reduce data movement, and believed that integrating ever larger amounts
of memory on-chip with the processor cores will continue to be a priority.
• J. Shalf in "The Future of Computing Beyond Moore’s Law" [140] claimed
that architectural specialization and extreme heterogeneity, together with
advanced packaging technologies, are anticipated to be the near-term
response to the end of classical technology scaling. The author stressed
that in an era where specializing hardware to the application is the
only means of performance improvement, co-developing hardware and
algorithms is important. Moreover, this algorithm-hardware co-design
pattern needs to be changed dramatically to lower design and verification
costs for developing new hardware. Towards this, more agile hardware
production methods, such as using chiplets3 , need to be adopted.
• T. Austin in his keynote "Preparing for a Post Moore’s Law World" [10]
emphasized that to overcome the scaling challenges, it is key to reduce
the cost of bringing a customized heterogeneous design to life so as
to flourish the innovation. Proposed actions include embracing open-
source hardware to encourage cooperation, assembling different specialized
processors to widen the applicability, auto-generating code and hardware
for highly reusable accelerators, developing tools for efficient design
space exploration and well putting together the benchmark suites. For
ensembles of specialized processors, he raised open questions about what
the components/processors are and how they should be connected.
• M. Verhelst et al. in "ML Processors Are Going Multi-Core: A performance

dream or a scheduling nightmare?" [168] believed that to accommodate the
needs for further performance and efficiency increases, machine learning
(ML) acceleration’s future belongs to multi-core accelerator systems, which
foregoes the desire to develop a giant single core that fits all needs. They
discussed the challenges and opportunities in the enlarged hardware-
scheduling design space for multi-core accelerator systems and affirmed
the strong need for design space exploration tools. In the end, they
compared the homogeneous and heterogeneous multi-core accelerator
system and raised the questions that it is not clear yet for which networks
and layer types heterogeneous cores are beneficial and what the optimum
core combination would look like, which require the co-exploration of
algorithm, hardware and scheduling.
3 Chiplets are small, modular chips that can be combined to form a complete system. They
break different pieces of functionality into tiles which are then stitched together into a mosaic
by bonding them to a common silicon substrate. Chiplets can rapidly serve diverse specialized
applications at a much lower cost and much faster turn-around [140].
6 INTRODUCTION
Figure 1.3: The word cloud of the post-Moore’s-Law era discussion.
The different researchers’ opinions above reveal numerous consensuses:

1) Hardware architectural specialization and hardware-algorithm co-development
are ways to continue performance and efficiency gain for specific applications;
2) Customized heterogeneous design consisting of different accelerators helps
to widen the applicability of specified hardware platforms; 3) To attract more
people and flourish innovation in this domain, it is critical to lower the cost
and speed up the design cycle of bringing such a heterogeneous system to life.
Central in this are more agile hardware design and production methods, such as
open-source hardware, hardware auto-generation, chiplets, etc; 4) Customized
heterogeneous design with versatile algorithm support further enlarges the
hardware design space, complicates the programming model, and enriches
the scheduling possibilities; 5) Design space exploration (DSE) tools play
an important role in co-exploring design spaces across different design levels,
speeding up design decision-making and algorithm-to-hardware deployment
optimization. The word cloud in Figure 1.3 summarizes the keywords from the
above discussions.
OPEN RESEARCH QUESTIONS 7
1.2 Open Research Questions
To realize the above vision: "Developing customized performant and efficient

accelerator systems that target particular applications and algorithms (i.e., DNN
in our case) in a faster and more cost-effective manner, while simultaneously
addressing the complexity of programming models and scheduling optimization",
a lot of open research questions throughout the multi-level design stack
(illustrated in Figure 1.2) are still to be answered.
This thesis is trying to answer two representative questions from them. These
two questions focus on the co-development and exploration of accelerator
architecture, algorithm, and scheduling/mapping, covering a wide range of
hardware micro-architectures/architectures, versatile algorithm-to-hardware
deployment strategies, and a variety of DNN workloads.
In the upcoming subsections, firstly each of the two targeted questions’ general
background is introduced. Then the question is decoupled into more specific
sub-questions, followed by the basic concept explanations, the state-of-the-
art (SotA) discussions, and the remaining challenge elaborations. Lastly, the
follow-up research questions left for future work are suggested.
1.2.1 Q1: How to efficiently execute variable-precision DNNs

on hardware?
Background
As discussed earlier, there is a big gap between the required and actual
performance/efficiency in DL acceleration systems, and algorithm-hardware
co-design is an important ingredient to the solution.
One typical example of algorithm-hardware co-design is neural network
quantization and custom precision-scalable accelerator architecture. Numerous
researches have shown that computing at a reduced data precision using
appropriate per-layer, or per-channel mixed-precision quantization of activations
and weights can largely improve the system’s performance/efficiency [94, 116]
with no or minor algorithmic accuracy loss [115, 174, 52].
To efficiently execute variable-precision DNNs, the key computation component
of an accelerator, the multiply-and-accumulate (MAC) unit, needs to be revised
accordingly compared to the original fixed-precision computation scenario.
Furthermore, usually a DNN accelerator contains not only one MAC unit but
8 INTRODUCTION
multiple of them which are connected in certain patterns into a MAC array4 .
We therefore split the first research question into the following two more specific
sub-questions, regarding the MAC unit and MAC array respectively.
Q1.1: What is the best MAC unit architecture for variable-precision DNN
execution?
Basic concept: In a conventional fixed-precision MAC unit, two operands are

expected to be at their full precision to perform the operation. If either of the
operands has reduced precision, the unused bit positions are filled with zeros.
This results in a decrease in the hardware utilization of the MAC unit, making
it difficult to achieve substantial hardware gain from the quantized algorithm.
Variable-precision MAC units are designed to overcome this issue. Compared
to fixed-precision MAC units, variable-precision MAC units incorporate specific
control and reconfigurability to support different precision computing modes,
so as to offer greater hardware throughput and efficiency when operating in
lower-bit precision.
Existing research: There are already several MAC unit architectures proposed
to support run-time precision adjustment. In general, we can divide precision-
configurable MAC units into two categories, temporal-based and spatial-based
topologies. Temporal-based structures realize precision tuning through iterative
sequencing of operations, adding more resolution in each step, as in [94, 142,
82]; spatial-based methods, on the other hand, implement variable computing
precision via 1) an aggregation of low-precision multiplier units connected
together via a network of adders/shifters, as in [146, 143, 133], or 2) architectural
submapping and rearrangement of a full-precision MAC unit, changing signal
flow patterns and gating/activating certain circuit parts when precision changes,
as has been explored in [116] and [107]. To summarize, when operating at
lower precision, temporal-based structures finish the computation faster while
spatial-based structures do more computation in parallel. In the end, they all
achieved the goal of improving system throughput and efficiency.
Remaining challenge: If you want to design a DNN accelerator that supports
variable data precision, facing these many existing design options (with the
possibility of more undiscovered options), you may wonder which MAC unit
architecture is the most suitable one for your accelerator. This is a challenging
question to answer based on the existing literature because 1) it is hard to
compare these different designs apples to apples, as existing designs were
implemented in different technologies and have been integrated into entirely
different systems; 2) it is hard to estimate the configurability overhead, as barely
4 More background knowledge of DNN accelerator is introduced in Section 2.2.3.
any work evaluate their performances against a baseline design which is a MAC
unit without scalability; 3) it is hard to gain a systematic view of the design
space of different scalability techniques, as many designs are based on ad-hoc
scalability methods and lack a clear analysis of different scalability principles.
To answer this question, Chapter 3 of the thesis extensively reviews the SotA
precision-scalable MAC unit architectures, unifies them in a new taxonomy, and
conducts a thorough benchmark analysis under the same process technology
across a wide range of performance targets.
Q1.2: What is the best MAC array architecture for variable-precision DNN
execution?
Basic concept: Is knowing the best MAC unit architecture equal to knowing
the best precision-scalable MAC array architecture and eventually the best
accelerator? The answer is ‘No’. This is mainly because studying the optimality
of a single precision-scalable MAC unit overlooks the possibility of amortizing
the scalability overhead across multiple MAC units at the MAC array level,
which can significantly impact energy and area. This overhead amortization
is feasible due to the structured parallel MAC computation and data reuse
opportunities present in DNNs, which allow for batch behavior to reduce the
average operation costs.
A simple analogy that illustrates this concept is buying products in retail versus
wholesale. Assume products A and B by themselves have similar values and
product B has a fancier packaging box than A. Product A thus has a lower
price than product B when purchased individually in retail, as each product is
bought with a packaging box, i.e.,
A + plain box < B + f ancier box (1.1)
But in wholesale, the situation can be reversed if multiple products B form a

batch (assume 100) and can share one packaging box, while product A cannot
share one. Thus buying 100 items of product A can be more costly as it comes
with 100 packaging boxes, i.e.,
100 × (A + plain box) > 100 × B + f ancier box (1.2)
In this analogy, you can think of products A and B as two types of precision-
scalable MAC units. The previous question (Q1.1) compares their prices in
retail, and the current question (Q1.2) moves one step forward, comparing their
price in wholesale. The large number of structured parallel MAC operations
in DNN are the foundation for this wholesale batching, and how well products
10 INTRODUCTION
A and B can take advantage of it depends on their architectural design. The

design factors include the co-configuration of precision-scalable MAC unit’s
topologies, MAC array’s interconnections, and different operands’ data-sharing
strategies at different precision modes.
Besides the scalability overhead amortization discussed above, another key
MAC-array-level aspect to consider is loop unrolling. A DNN layer, like
convolutional 2D layers or fully-connected layers5 , usually consists of many
nested for-loops of MAC operations. For a fixed-precision MAC array, which
loops are spatially unrolled onto it are rather fixed depending on the array
interconnection. For a precision-scalable MAC array, going to a lower precision
mode exposes more computation parallelism. It thus introduces concepts of
array-level interconnection reconfigurability and dynamic loop unrolling under
different precision modes, which further increase the design options compared
to a single MAC unit.
Existing research: The existing research listed in Q1.1 [94, 142, 82, 146,
143, 133, 116, 107] involve system-level designs that integrate various kinds of
precision-scalable MAC units and utilize them to build widely varying types of
MAC arrays. Note that even consisting of the same type of MAC units, at the
array level these designs can still differ from each other in the interconnection
logic and correspondingly the spatial operand-sharing patterns6 . For example,
both [94] and [82] have a temporal-based MAC unit architecture, but they share
different operands across neighboring MAC units: [94] shares input data, while
[82] shares (i.e., adds together) output data; Both [143] and [146] are with the
same type of spatial-based MAC unit architecture and operate the same at
the full-precision mode, but when the precision reduces, [143] utilizes the extra
parallelism in the array for output sharing while [146] uses it for input sharing.
Remaining challenge: The challenges of Q1.2 are similar to Q1.1’s: by just
going through the existing literature, it is hard for a designer to pinpoint the
best MAC array architecture to use for a new system, as all these published
MAC array designs have been demonstrated in different silicon technologies,
embedded in different system configurations, benchmarked with different DNN
workloads, and written with different register transfer level (RTL) coding styles.
It is thus hard to relatively compare them in terms of their drawbacks and merits,
preventing fast early-design-phase decision-making. In addition, compared to
Q1.1, Q1.2 has a much larger design space to explore considering the factors of
overhead amortization and loop unrolling we have discussed above.
Chapter 4 of the thesis resolves the challenge and enables us to better categorize
current SotA precision-scalable MAC array architectures, to clearly understand
5 More background of DNN layers is introduced in Chapter 2.1.1.
6 More backgrounds of spatial mapping and operand sharing are introduced in Section 2.3.2.
the trade-offs within/between each design, and to insightfully find the optimal
architectures under different circumstances.
Q1’s future work questions
Q1.1 and Q1.2 mostly question about the logic topology’s impact of precision-
scalable MAC datapath (Q1.1 on single MAC unit, and Q1.2 on multiple-MAC-
unit formed array), whereas when moving to more advanced CMOS technology
nodes, the hardware cost pressure will shift from the logic part more to the
wire7 and memory parts. Further research questions regarding the impact of
wire and memory on a precision-scalable DNN accelerator can be built on top
of Q1, and are left for future work.8
1.2.2 Q2: How to realize fast design space exploration for

DNN accelerators?
Background
In light of the previous Post-Moore’s-Law Era discussion, there are two major
considerations: On the one hand, people are suggesting to create customized
and heterogeneous hardware architectures to keep up with the ever-growing
performance and efficiency demands, resulting in vast design spaces which come
from greater hardware design flexibility and more scheduling/mapping options;
On the other hand, there is a strong desire to accelerate the traditional hardware
design flow, streamline the algorithm-to-hardware deployment (mapping and
scheduling), and minimize the implementation costs. In short, the goal is to
identify, implement, and deploy the optimal (or good enough) design points out
of the vast design spaces in a quick and efficient manner.
To achieve these goals, the first step is to be able to identify the optimal (or a
good enough) design from numerous design options. So, how to assess the quality
of a design idea? The most straightforward way is to implement and fabricate
it, deploy the target algorithms onto it, and measure the hardware performance
7 The wire-dominated cost at the technology level can be translated into communication
network cost and memory organization cast cost at the architecture level.
8 This study primarily focuses on the logic topology, prioritizing it over the memory or
the wire. There are two primary reasons for this choice. Firstly, the design space is vast,
and attempting to consider everything together from the beginning would be overwhelming.
Secondly, the precision-scalable MAC unit/array logic topology serves as the cornerstone of
precision-scalable computing, significantly influencing data movement and sharing patterns,
and consequently impacting memory behavior and wiring cost. Thus, the decision to start
with the logic topology stems from its central importance in the precision-scalable system.
12 INTRODUCTION
and efficiency. However, the long fabrication cycle (months) would largely slow
down the optimization iteration, and is moreover very costly. One could also
implement the design without fabricating it, assessing the effectiveness through
simulation. While this is possible, the long cycle-accurate simulation time
(hours to days) again hinders the design iterations. Additionally, implementing
and testing one design can also take long (weeks).
So, in order to find the optimal design point from numerous design options
within reasonable time and cost, the traditional methods discussed above are
infeasible. We need a novel DSE tool that can be fast enough (seconds) to
estimate the hardware cost of deploying a DNN workload onto an accelerator,
can be accurate enough to reflect the strengths and weaknesses of the actual
design, can be general enough to cover a wide range of design options of
hardware architecture, workload, and the algorithm-to-hardware deployment
strategy, can be adaptable enough to effortlessly alter these design options
from one to another for iterative optimization, and if enabled, intelligent
enough to auto-adopt the best algorithm-to-hardware deployment strategy to
guarantee the cost comparison between different algorithm-hardware design
pairs is relevant and fair.
Rome wasn’t built in a day. Developing such a tool for the DSE of customized
heterogeneous accelerator systems is not done in one go. Following the different
development stages in the journey, the general question (Q2) is broken down
into three sub-questions (Q2.1/2.2/2.3). Q2.1 focuses on fast hardware cost
(energy and latency) estimation of executing a single DNN layer on a single-core
accelerator; Q2.2 seeks out more cross-layer scheduling possibilities to improve
design metrics; Q2.3 moves one step further towards our final goal from single-
core to heterogeneous multi-core accelerator cost estimation supporting flexible
layer-core allocation and layer-fused scheduling.
Q2.1: How to design a DSE framework for single-core DNN accelerator

exploring single-layer mapping?
Basic concept: For a fast DSE framework that can assess a wide range of
accelerator architectures, support different DNN layer types and sizes, auto-
search the best mapping strategy for algorithm-to-hardware deployment, and
rapidly estimate the hardware cost (energy and latency), there are three key
factors. They are the accelerator architecture, DNN workload, and mapping,
each with a large design space (thoroughly discussed in Chapter 2).
To enable fast exploration in the joint-design space of these three factors, it
is essential to apply a high-level abstraction to represent them and capture
their key attributes. This high-level abstracted design representation should
be flexible enough to cover all the design options (i.e., different DNN layers,
accelerator architectures, and mappings) and unified to share a common data
structure. Based on this abstracted, flexible, and unified design representation,
hardware cost estimation can be performed in a fast and structured way for all
designs under exploration. Finally, linking an automatic design generator to
the hardware cost estimator (both adopting the same design representation)
and closing them in a loop can help the DSE converge to the optimal (good
enough) design points.
Existing research: Several DSE frameworks and framework components have
emerged over the last few years, targeting single-core DNN accelerators and
single-layer mapping9 DSE, such as [166, 184, 91, 177, 126]. MAGNet [166]
proposes a modular DNN accelerator generator for lowering the design cost
and provides a DSE framework encompassing a designer, a mapper, and
a DL framework to enable co-optimization of architecture and application.
Interstellar [184] applies Halide’s scheduling language [129] as the high-level
abstraction to describe the design space of accelerators and proposes a formal
dataflow taxonomy to represent different spatial loop unrollings. MAESTRO [91]
introduces a set of data-centric directives to concisely specify the DNN dataflow
space and analytically estimate the execution time and energy efficiency of
the accelerator. Accelergy [177] presents a generally applicable methodology
for performing architecture-level energy estimation on accelerator designs,
and proposes a configuration language that helps the designers to describe
their systems. Timeloop [126] introduces an infrastructure for evaluating and
exploring the architecture design space of DNN accelerators, using a concise and
unified representation of the key architecture and implementation attributes of
DNN accelerators to describe a broad space of hardware topologies.
Remaining challenge: As discussed earlier, an ideal DSE tool should be fast,
accurate, general, adaptable, and intelligent. Existing research usually overlooks
some of these criteria or sacrifices one for another. For example, MAGNet [166]
trade off the speed for accuracy as its cost estimation is based on post-synthesis
results (accurate but slow), MAESTRO [91] and Accelergy [177] rely on users
to define mapping strategies (adaptable but not intelligent); Interstellar [184]
and Timeloop [126] put strict constraints on the mapping search space for faster
search speed (fast but not general enough).
Chapter 5 and 6 of this thesis focus on this challenge and builds a DNN
accelerator DSE framework aimed at meeting these five criteria.
9 Detailed explanation/definition of single-core DNN accelerators and single-layer (a.k.a
intra-layer) mapping can be found in Section 2.2.3 and 2.3.2 respectively.

14 INTRODUCTION
Q2.2: How can the DSE framework be improved to enable depth-first

scheduling in a single-core DNN accelerator?
Basic concept: Depth-first (DF) scheduling (also called deep layer fusion) is a
new algorithm-to-hardware deployment strategy that can significantly decrease
the hardware cost for accessing off-chip data. For single-layer mapping in
Q2.1, one layer has to be fully executed before its successive layer(s) can start
execution. In this way, if this one layer’s output size is too large to fit into
on-chip memory, the output data will be first pushed to off-chip DRAM and later
be fetched back on-chip as the successive layer(s)’ input for further operations,
resulting in a significant hardware cost penalty. In comparison, DF scheduling
breaks each layer into multiple smaller tiles and processes cross layers tile-wisely,
e.g., finishing all layers’ first tile before starting to process all layers’ second
tile. In this way, the amount of intermediate results to be stored at a specific
instant in time is greatly reduced and thus is more likely to be able to fit into
on-chip memory or even lower-level on-chip memory (smaller capacity, more
efficient), effectively reducing the data movement hardware cost.
The unique mechanisms of DF scheduling and the expanded design space it
presents (e.g., tile size, fused depth, etc.) demand additional support in the
DSE framework, to model the more dedicated data movement and explore the
new and enlarged scheduling design space.
Existing research: Several cost models and mappers/schedulers supporting
DF scheduling are deployed for DSE frameworks to quickly explore and develop
optimal DF systems, such as [86, 180, 189, 172, 19]. DNNFuser [86] focuses
on the layer-fusion mapping space search and proposes a one-shot inference-
based DF mapper (i.e., no search procedure). DNNVM [180] transforms CNN
models into the directed acyclic graph (DAG) and enumerates all potentially
profitable fusion opportunities by a heuristic subgraph isomorphism algorithm.
EfficientS [189] studies the DF scheduling of irregular network structures on DNN
accelerators and proposes their subgraph scheduling and memory-allocating
techniques. ConvFusion [172] builds a mathematical cost model for DNN
schedule supporting loop fusion and other loop transformations. Optimus [19]
presents a DAG-based DNN operator fusion algorithm, driven by a memory
cost model, to capture the achievable minimum off-chip memory overhead for
the fused operator groups.
Remaining challenge: A fast DSE framework includes two key parts:
modeling and exploration. In the exploration part, many innovative searching
algorithms have been introduced to handle the DF enlarged scheduling space,
such as the heuristic subgraph isomorphism algorithm in DNNVM [180] and a
transformer-based mapper in DNNFuser [86]. However, in the modeling part,
these existing frameworks all have missed some important factors. [172] and [19]
focus on optimizing off-chip DRAM access while ignoring the data movement
within the multi-level on-chip memory hierarchy; [180] and [189] pursue latency
improvement solely while ignoring the energy impact; [189] and [145] count
memory accesses for feature maps (a.k.a. activation) while ignoring the accesses
from weights, noting that data movement of activation and weight is one of the
major tradeoff pair in DNN DF scheduling space. Ultimately, these missing
factors in the modeling part could bias the exploration direction and cause
substantial optimality losses in the final design.
Chapter 7 of the thesis fills in these missing factors and creates a comprehensive
DSE framework, built on top of Q2.1, for DF execution of DNN accelerators.
Q2.3: How can the DSE framework be extended to multi-core DNN

accelerators and support layer fused multi-accelerator execution?
Basic concept: To meet the ever-growing demands for hardware performance

and efficiency while accommodating the increasing model diversity and size,
DNN accelerators are shifting towards multi-core designs. This shift enlarges
the design space for both hardware architecture and workload-to-hardware
deployment. Regarding hardware architecture, new design factors, such as the
number of cores, core combination, and core-to-core communication, are added
to the consideration list. Regarding the workload-to-hardware deployment,
adding more cores also means adding more freedom for workload deployment,
new deployment possibilities like different layer-core allocations and fine-grained
cross-core layer fusion are to be considered.
These new design options, on the one hand, provide additional optimization
opportunities to the multi-core accelerator systems, while on the other hand,
also add complexity to the modeling and DSE frameworks.
In single-core processing, the processing flow is naturally "single-thread", i.e.,
all the hardware resources serve one computation "thread" at a time (it can be
a layer or a tile of a layer as in the DF scheduling case), making it manageable
to analytically foresee the whole processing flow and quickly estimate the cost.
Unlike it, multi-core processing prefers to have many computation "threads"
working in parallel (they can be computations from different DNN workloads,
layers, or tiles) across cores. These "threads" may share and compete for
the same hardware resources (e.g., memory bandwidth and capacity); These
"threads" may also hold data dependencies and block other "threads". Scheduling
them differently changes these behaviors and thus changes the overall hardware
cost, making it hard to foresee the whole flow without actually scheduling it.
16 INTRODUCTION
As such, it is difficult to estimate the cost purely analytically, as was possible

for the single-core cases. To handle this, dependency analysis between these
different computation "threads" and record-keeping of the shared hardware
resources are required to be added to the modeling and DSE frameworks.
Besides these new modeling and DSE considerations discussed, it is worth
stressing the importance of supporting layer fusion for a multi-core DNN
accelerator system. In Q2.2, depth-first scheduling10 helps single-core
accelerators to reduce off-chip activation data access; In Q2.3, layer fusion
helps multi-core accelerators to not only relief off-chip traffic but also expose
additional cross-layer computation parallelism to keep more cores busy (i.e.,
increase hardware utilization), which is the key for ensuring good performance
and efficiency for a multi-core system.
Existing research: Recently, a couple of works have focused on facilitating
multi-core DNN acceleration. EfficientA [159] builds a model for the scenario of
deploying a DNN model onto a set of accelerators, each with its memory
and interconnect constraints. It provides algorithms based on Integer
Programming and Dynamic Programming for model partitioning to minimize
system latency. ISO [43] claims a single DNN operator (a layer) can no longer
fully utilize the available hardware parallelism. Thus, it proposes an inter-
operator scheduler that considers both intra- and inter-operator parallelism of
a DNN model and adapts dynamic programming to find an efficient schedule
that better utilizes the hardware. Herald [92] builds a DSE framework for
co-optimizing hardware partitioning and layer scheduling for heterogeneous
dataflow accelerators on heterogeneous multi-DNN workloads. Rammer [105]
proposes a DNN compiler that optimizes the execution of DNN workloads on
massively parallel accelerators, and it also maximizes hardware utilization by
exploiting parallelism through inter- and intra- operator co-scheduling.
Remaining challenge: The goal of Q2.3 is to have a general modeling and DSE
framework for multi-core DNN accelerators that supports different multi-core
architectures, flexible layer fusion, and versatile DNN workloads. For different
design points (whether implemented designs or just design ideas), we should
be able to easily and rapidly compare and analyze system performance and
efficiency at a high level. Although the works discussed above have progressed
in the related directions, still none can perfectly meet this goal.
10 To clarify, in this thesis, we see depth-first scheduling as regular and structured layer
fusion (a subset of layer fusion). To give an example, assume there are consecutive 2 layers (A,
B) to be scheduled, each layer having 3 tiles (1, 2, 3), and layer A, B are tile-wisely dependent,
i.e., Bi depends on Ai and thus Bi cannot be computed before Ai (i=1,2,3). The depth-first
schedule strictly follows the order (A1, B1, A2, B2, A3, B3) to compute, while layer fusion
has more options besides this depth-first one, such as (A1, A2, B1, B2, A3, B3), (A1, A2, B1,
A3, B2, B3), (A1, B1, A2, A3, B2, B3), etc.
EfficientA [159] incorporates a very coarse hardware model with only the total
memory capacity and inter-core connection modeled for each core, making
it unsuitable for performing DSE for accelerator architectures. ISO [43]
targets homogeneous GPU systems, which is not applicable to heterogeneous
accelerators. Herald [92] models heterogeneous multi-core DNN accelerators but
without supporting layer fusion. Rammer [105], a DNN compiler that targets
commercial homogeneous GPU and IPU systems, relies on the actual hardware
measurement to exploit DNN workload parallelism and optimize scheduling.
This prevents using Rammer to explore new and undeveloped heterogeneous
architecture ideas. In addition, most of the works [159, 43, 105] only optimize
system latency without studying energy efficiency impact.
Chapter 8 of the thesis works towards completing these gaps by proposing a
general modeling and DSE framework, also built on top of Q2.1, to systematically
tackle the more complex multi-core DNN acceleration design space.
Q2’s future work questions
Ultimately, Q2 targets optimizing the DNN acceleration system through fast

DSE. However, owing to the massive design space, it is barely possible to
conduct the overall design space search and optimization at once. Thus, Q2 is
organized as three sub-questions (Q2.1/2.2/2.3) to study in sequence: Q2.1 firstly
studies single-core single-layer mapping attributes; Q2.2 builds up the scheduling
complexity and extends Q2.1’s single-layer-at-a-time scheduling to various depth-
first scheduling; Q2.3 further enlarges the hardware architecture space by moving
from Q2.1/2.2’s single-core accelerators to multi-core accelerators, and expands
the scheduling space from Q2.2’s depth-first scheduling to layer fusion (can be
seen as from "regular" layer fusion to more "free-style" layer fusion).
From the overall research methodology point of view, two major potential
caveats during this problem-solving should be paid attention to11 . Firstly, Q2
is a giant question, even though it is thus divided into three sub-questions, still
each of the three encompasses a large design space. Within each sub-question,
the thesis provides one possible solution or direction, focusing on certain parts
of the design space optimization that are believed to be the most important
or fundamental to the problem-solving while ignoring some other parts, those
ignored parts may (or may not) also have important impacts to the overall
problem-solving and are left for future research. Secondly, since Q2 contains
a huge design space and for sure needs to be divided to conquer, how to split
it can also impact the final DSE result. This thesis splits it into three parts
11 Here things are discussed at a high level as lots of DSE-related concepts haven’t been
introduced at this point. More detailed future work discussions are at the end of the thesis.
18 INTRODUCTION
and considers them in sequence. The inter-dependencies between the three

parts can be studied in more detail in future research, to see what is the best
interface/split position and also the best sequence/ordering of tackling them.
1.3 Thesis Contributions and Organization
After this chapter and Chapter 2, which provide the necessary background
knowledge for the rest of the text, this thesis focuses on addressing the
aforementioned open research questions (sub-questions) through the following
contributions, summarized in Figure 1.4.
1.3.1 Taxonomy and benchmarking of precision-scalable

datapaths (Q1)
Chapter 3 (Q1.1) introduces a new taxonomy of precision-scalable MAC

architectures, categorizing them among multiple criteria: the type of unrolling
(spatial or temporal), the dimensions they unroll (1D, 2D or 2D symmetric)
and, for spatial-based designs, the type of accumulation (Sum Together or Sum
Apart). This new classification has not only categorized all existing architectures
but has also uncovered new design patterns that can give designers array-level
and algorithmic-level insights to choose the right type of processing elements for
their system. Along with this taxonomy, an exhaustive benchmarking of SotAs
and further architectures has been carried out in order to clearly understand
their ground principles, features, connections, and differences.
Chapter 4 (Q1.2) proposes a new loop representation extending the traditional
nested for-loops of a convolutional layer with additional Bit-Group (BG) loops
for capturing precision scalability. A new taxonomy for the precision-scalable
MAC array (PSMA) is built on top of this new loop representation, which
exploits the flexibility offered by the BG loops: The BGs can be unrolled
spatially at different hierarchical levels of the MAC array or temporally across
cycles to form bit(group)-serial architecture. Existing SotA PSMAs are mapped
to the taxonomy and implemented with a new uniform and highly parameterized
PSMA template. This template covers a large number of representative designs
in the design space spanned by the new taxonomy, and these designs are
thoroughly benchmarked. This work enables a better categorization of current
SotA PSMAs to clearly understand the trade-offs within/between each design,
and insightfully find the optimal PSMA under different circumstances.
THESIS CONTRIBUTIONS AND ORGANIZATION 19
1.3.2 Frameworks for fast DNN accelerator DSE (Q2)
Chapter 5 & 6 (Q2.1) Chapter 5 presents ZigZag, a general and fast DNN
accelerator architecture-and-mapping DSE framework with enlarged mapping
search space; Chapter6 zooms in into the ZigZag cost model’s latency analysis,
explaining the underlying calculus for latency estimation. In ZigZag, three
modules cooperate in synergy to enable the exploration of a much broader
space of solutions with respect to other SotA DSE frameworks. Firstly, at
the kernel level, the memory-centric design space representation and the loop
relevance principle are introduced, based on which an analytical hardware cost
estimator (with detailed energy and latency analysis) is built. Secondly, on
top of the cost estimator, two mapping search engines are created to rapidly
locate the optimal spatial and temporal mappings (even/uneven) by means
of innovative search methods. Thirdly, to enable fast hardware architecture
search, an architecture generation wrapper is added to generate all valid memory
hierarchies (balanced/unbalanced, shared/separate) given a set of high-level
hardware constraints. Benchmarking experiments against published works, in-
house accelerators, and existing DSE frameworks show the reliability of ZigZag.
Multiple case studies exploring the vast DNN accelerator design space from
different perspectives demonstrate its capability.
Chapter 7 (Q2.2) builds DeFiNES, a DNN accelerator DSE framework
extending ZigZag with cross-layer DF scheduling support. This work first
formally defines the DF design space across three design axes: tile size, overlap
storage, and fused depth. Then, a novel cost model capable of handling this
entire design space is proposed and validated against a taped-out DF-style
accelerator. This cost model resolves the challenges listed in Q2.2 by not only
considering DRAM access or memory access due to activations, but also the full
on-chip memory hierarchy and memory accesses caused by weight traffic. Large
gains might be missed when not doing so (up to 10.2× in energy efficiency).
Three case studies are conducted based on the model, studying the trade-offs
between different DF schedules, and the impact of workload and hardware
architecture on the best DF strategy. In summary, DeFiNES allows quickly
examining the complex design space of different combinations of DF strategies
and hardware architectures.
Chapter 8 (Q2.3) introduces Stream, an architecture-scheduling DSE
framework built for multi-core DNN accelerators with support for fine-grained
layer-fused scheduling. Exploration is enabled through the combination of a
unified modeling representation (mix of DAG and nested for-loops), a rapid
fine-grained data dependency extractor, a genetic algorithm-based layer-core
allocator, and a heuristics-based fast scheduler. Validation of Stream with
three SotA hardware accelerators (single-core and homogeneous/heterogeneous
20 INTRODUCTION
multi-core) employing layer fusion, demonstrates accurate modeling across a

wide range of scheduling granularities and hardware architectures. Case studies
of deploying modern DNNs onto a broad range of single-/multi-core accelerator
architectures highlight the unique capabilities of Stream.
Chapter 9, as the final chapter, concludes the thesis and proposes future works.
THESIS CONTRIBUTIONS AND ORGANIZATION 21
Design Space Exploration for DNN Accelerators

from Single MAC Units to Multi-core Accelerator Systems
Hardware Exploration
MAC unit MAC array Single-core Multi-core

Hardware accelerator accelerator
Architecture
Mapping
&
Scheduling
Chapter 3
Variable-
None precision
MAC unit
AICAS 2019
JETCAS 2019
Mapping / Scheduling Exploration
Chapter 4
Chapter
Variable- 5&6
Spatial precision
mapping ZigZag:
Intra-layer MAC array
Analytical
mapping TCAS-I 2022
cost model
+
for b in [0, B) Mapping-
for k in [0, K) Hardware
for c in [0, C) Temporal Chapter 8
DSE
… Stream:
mapping TComp 2021
Modeling fine-
AICAS 2021
grained layer
DATE 2022
fusion on
multi-core
Chapter 7 DNN
Cross- Layer-by- DeFiNES: accelerators
layer layer ISPASS 2023
scheduling Exploring
Depth-first
1 Scheduling
Space for
3 DNN
Depth-
2 Accelerators
4 first
/ HPCA 2023
5 Layer
fusion
Figure 1.4: Overview of the thesis contributions and organization.

Chapter 2
Background
This chapter provides background information on the three key factors in the
DSE of DL hardware acceleration: DNN algorithm (Section 2.1), hardware
(Section 2.2), and algorithm-to-hardware deployment (Section 2.3), which are
prerequisites to understanding the rest of the thesis.
Section 2.1 explains DNN algorithm. It first introduces DNN’s basic concepts
and building blocks (from a single neuron, a layer and inter-layer interconnection
to a complete DNN) in Section 2.1.1. Then, it points out a common mis-
conception of the correlation between DNN model size and hardware processing
cost in Section 2.1.2. To explain the misconception, the computation and data
patterns of DNN are discussed in Section 2.1.3.
Section 2.2 introduces domain-specific hardware for DL. It starts with discussing
the commonly used hardware performance metrics in Section 2.2.1, including
quantitative ones like latency and energy efficiency, and qualitative ones like
flexibility and programmability. Then, it compares different types of hardware
platforms used for DL acceleration in Section 2.2.2. Afterwards, in Section 2.2.3,
it zooms in into the ASIC DNN accelerator and explicitly explains each hardware
component and the shift from single-core to multi-core architecture.
Section 2.3 elaborates on the deployment of DNN workloads to hardware
accelerators. It breaks down the general concept of the deployment into mapping,
scheduling, and allocation. For single-core accelerator systems, it explains first
the cross-layer scheduling in Section 2.3.1 and then the intra-layer mapping
(decoupled into spatial and temporal mappings) in Section 2.3.2; For multi-
core systems, it highlights the impact of layer-core allocation and multi-core
scheduling in Section 2.3.3.
23
24 BACKGROUND
2.1 Deep Learning
DL is a subclass of machine learning that uses artificial neural networks to enable

computer learning directly from large amounts of data without manual feature
extraction. These artificial neural networks usually consist of many layers, which
correspond to the name "deep", and after training, they can model complex
relationships between inputs and outputs during inference, which corresponds
to the name "learning".
Training and inference are two phases of developing and deploying neural
networks, and this thesis focuses on the inference phase. Because training
requires a large amount of data and compute power and thus mostly happens
on the cloud, while inference is closer to application ends and is/will be widely
deployed in edge and IoT devices, where this thesis mostly concentrates.
2.1.1 Basic concepts
Such multi-layer artificial neural networks are also called Deep Neural Networks
(DNNs). As shown in Figure 2.1, a DNN consists of multiple interconnected
layers of neurons. In a bottom-up manner, let’s discuss its basic concepts one
by one: a neuron, a layer, the connection between layers, and the complete
neural network.
(a) An example DNN (b) A neuron in the DNN

Hidden layers
Input
layer Output
layer
Multiply-
𝐼𝐼0 accumulate Activation
𝑊𝑊0 2 function
… 𝐼𝐼1 𝑊𝑊1 � 𝑊𝑊𝑐𝑐 𝐼𝐼𝑐𝑐 + 𝑏𝑏 𝑂𝑂
𝑊𝑊2 𝑐𝑐=0
𝑓𝑓(𝑥𝑥)
𝐼𝐼2
FC
𝑏𝑏
Conv1D
Different layer sizes Different layer types
Figure 2.1: An example DNN and a deep dive into one neuron of it.
DEEP LEARNING 25
1) A neuron
A neuron is the basic computation unit of a DNN. As shown in Figure 2.1(b), it

receives multiple inputs (I1 , I2 , I3 ), performs multiply-and-accumulate (MAC)
operations between the inputs and the weights (W1 , W2 , W3 ), adding a bias (b),
and in the end, passes the result through a non-linear activation function (f (x))
to generate the final output (O). In summary, a C-input neuron performs
C−1
X
O = f( Wc × Ic + b) (2.1)
c=0
in which the small c is the input index and the big C is its upper bound. The
same rule will be used for describing other DNN operations in the thesis.
The values of weights and biases are determined in the training phase. The
weights represent the strength of the connections between the input and the
neuron, and the bias is a scalar value added to the weighted sum of the inputs,
which provides an additional degree of freedom for network learning.
The values of the inputs are dynamically determined by the previous neuron(s)’s
outputs in the inference phase. Different operands (inputs, weights, bias,
outputs) can have the same or different data precision in the computation,
whose implication will be discussed further in the thesis.
The activation function introduces nonlinearity into the neuron computation,
which allows the DNN to model complex relationships between inputs and
outputs. Common activation functions include the sigmoid function, the rectified
linear unit (ReLU) function, the hyperbolic tangent (tanh) function, etc [122].
2) A layer
A layer is a container of neurons that takes information from the previous

layer(s)’s neurons, processes it, and then passes it to the next layer(s)’s neurons.
A layer is usually uniform, i.e., it only contains neurons of the same types of
operations, so it can easily be described and categorized.
There are multiple types of DNN layers, Figure 2.2 collects the major
ones: fully-connected (FC) layers, 2D convolutional (Conv) layers,
depthwise (DW) /pointwise (PW) Conv layers, grouped Conv layers,
average/max-pooling (Pooling) layers and element-wise sum layers,
each with their own specific purpose and characteristics.
26 BACKGROUND
Input (I) Weight (W) Output (I)
(a) Fully-connected (FC) layer (b) 2D Convolutional (Conv) layer
𝐼𝐼𝐼𝐼
𝐶𝐶
× Σ 𝐾𝐾
𝐾𝐾
𝐼𝐼𝐼𝐼 𝐾𝐾
𝐶𝐶 𝑂𝑂𝑂𝑂
𝐶𝐶 𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴𝑴 𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨𝑨. 𝐾𝐾 𝐶𝐶 𝐹𝐹𝐹𝐹

(×) (Σ) 𝑂𝑂𝑂𝑂
𝐹𝐹𝐹𝐹
(c) Depthwise (DW) Conv layer

𝐼𝐼𝐼𝐼
× Σ
𝐼𝐼𝐼𝐼 𝑂𝑂𝑂𝑂
(d) Pointwise (PW) Conv layer
𝑂𝑂𝑂𝑂
𝐼𝐼𝐼𝐼 𝐺𝐺 𝐺𝐺 𝐺𝐺
× Σ 𝐾𝐾
𝐼𝐼𝐼𝐼 𝐾𝐾 𝑂𝑂𝑂𝑂 𝐹𝐹𝐹𝐹
𝐶𝐶 𝐹𝐹𝐹𝐹
𝑂𝑂𝑂𝑂
𝐶𝐶
(f) Max Pooling layer (e) Group Conv layer

𝐹𝐹𝐹𝐹
𝐹𝐹𝐹𝐹 𝑴𝑴𝑴𝑴𝑴𝑴 𝐼𝐼𝐼𝐼
𝑂𝑂𝑌𝑌
× Σ 𝐾𝐾
𝑂𝑂𝑋𝑋
𝐺𝐺 𝐺𝐺
𝐼𝐼𝐼𝐼 𝐾𝐾
𝐶𝐶 𝑂𝑂𝑂𝑂
𝐼𝐼𝐼𝐼
𝐶𝐶 𝐹𝐹𝐹𝐹
𝐼𝐼𝐼𝐼 𝑂𝑂𝑂𝑂
𝐺𝐺 𝐺𝐺 𝐹𝐹𝐹𝐹 𝐺𝐺
(g) Element-wise sum layer
+
× Σ
𝐼𝐼𝐼𝐼1 𝐼𝐼𝐼𝐼2 𝑂𝑂𝑂𝑂
𝐺𝐺 𝐼𝐼𝐼𝐼1 𝐼𝐼𝐼𝐼2 𝑂𝑂𝑂𝑂
Figure 2.2: Multiple types of DNN layers.

DEEP LEARNING 27
• FC layers are often used for classification and regression tasks. They
connect every neuron in one layer to every neuron in the next layer,
allowing the network to integrate the overall information.
A C-input, K-output FC layer can be described as:
C−1
X
Ok = f ( Wk,c × Ic + bk ) (2.2)
c=0
in which
k = {0, 1, ..., K − 1} (2.3)
In the equation, k/c correspond to the output/input index respectively,
bk is the bias of the kth output, and f (x) is the activation function.
• Conv layers are commonly used in image and video processing tasks
and can be either 1D, 2D, or 3D, depending on the nature of the input
data. They apply a set of filters (a.k.a. weight kernel) to the input data,
allowing the network to learn local patterns and features.
Assuming C input channels and K output channels, a 1D Conv layer can
be written as:
 
C−1
X FX
X−1
Ok,ox = f  Wk,c,f x × Ic,ix + bk  (2.4)
c=0 f x=0
A 2D Conv layer can be written as:

 
C−1
X FXX−1 FXY −1
Ok,ox,oy = f  Wk,c,f x,f y × Ic,ix,iy + bk  (2.5)
c=0 f x=0 f y=0
A 3D Conv layer can be written as:

 
C−1
X FX Y −1 FX
X−1 FX Z−1
Ok,ox,oy,oz = f  Wk,c,f x,f y,f z × Ic,ix,iy,iz + bk 
c=0 f x=0 f y=0 f z=0
(2.6)
in which (take the 2D Conv layer as an example for the rest)

 k = {0, 1, ..., K − 1}
ox = {0, 1, ..., OX − 1} (2.7)
oy = {0, 1, ..., OY − 1}

and
ix = SX × (ox − 1) + SF X × (f x − 1) + 1
(2.8)
iy = SY × (oy − 1) + SF Y × (f y − 1) + 1
28 BACKGROUND
In the equations, k/c corresponds to the output/input channel index,

f x/f y corresponds to the weight kernel’s length/width index, ox/oy
corresponds to the output’s length/width index, and ix/iy corresponds to
the input’s length/width index, which can be deduced from output’s and
weight’s positional index ox/oy and f x/f y, using Equation 2.8. SX and
SY are the strides along each convolved direction, and SF X and SF Y
are the filter dilation rates.
• DW and PW Conv layers are used in MobileNet [66]-like architectures
for efficient feature extraction. DW layers separate the input channels
and apply a single filter to each channel, while PW layers combine all the
channels of the DW layer using a 1×1 2D Conv layer. This reduces the
computational cost while preserving its algorithmic accuracy.
For a DW Conv layer, input and output have the same number of channels,
expressed as the number of groups G. The equation can be written as:
 
FX
X−1 FXY −1
Og,ox,oy = f  Wg,f x,f y × Ig,ix,iy + bg  (2.9)
f x=0 f y=0
, in which 
 g = {0, 1, ..., G − 1}
ox = {0, 1, ..., OX − 1} (2.10)
oy = {0, 1, ..., OY − 1}

Note that index g replaces the previous Conv 2D layer’s k and c indexes,
and accordingly, the dimensionality of the weight tensor reduces by 1
dimension. ix and iy follow Equation 2.8.
For a PW Conv layer, assuming C input channels and K output channels,
its equation can be written as:
C−1
!
X
Ok,ox,oy = f Wk,c × Ic,ix,iy + bk (2.11)
c=0
, in which 
 k = {0, 1, ..., K − 1}
ox = {0, 1, ..., OX − 1} (2.12)
oy = {0, 1, ..., OY − 1}

and
ix = ox
(2.13)
iy = oy
PW Conv layers can be seen as normal Conv layers with F X, F Y , and

stride all equal to 1.
DEEP LEARNING 29
• Grouped Conv layer is another variant of the standard Conv layer to

reduce the computational cost. Here, the filters are divided into groups,
and each group only convolves with the input channels within their own
group. The output of each filter group is then concatenated along the
channel dimension to form the output feature maps. The previously
discussed DW layer can be seen as a Grouped Conv layer whose input
and output channel count inside of each group are both 1.
For a Grouped Conv layer, assuming G groups in total, and C input
channels and K output channels per group, its equation can be written as:
 
C−1
X FX Y −1
X−1 FX
Og,k,ox,oy = f  Wg,k,c,f x,f y × Ig,c,ix,iy + bg,k  (2.14)
c=0 f x=0 f y=0
in which 

 g = {0, 1, ..., G − 1}
k = {0, 1, ..., K − 1}

(2.15)
ox = {0, 1, ..., OX − 1}

oy = {0, 1, ..., OY − 1}

ix and iy follow Equation 2.8.

• Pooling layers downsample the output of the previous layer by
aggregating groups of outputs into a single output. Max-pooling is
commonly used in convolutional networks to extract the most important
features, while average pooling can be used to smooth out noisy data.
For Pooling layers, the input and output have the same number of channels,
here defined by G. Taking the max pooling for example, its equation can
be written as:
Y −1
Og,ox,oy = M AXfFx=0
X−1
M AXfFy=0 (Ig,ix,iy ) (2.16)
, in which 
 g = {0, 1, ..., G − 1}
ox = {0, 1, ..., OX − 1} (2.17)
oy = {0, 1, ..., OY − 1}

ix and iy follow Equation 2.8.
• Element-wise layers serve the purpose of combining or fusing

information from different layers or pathways, which are often used in
skip-connection DNNs, such as ResNet [57], in order to help overcome the
problem of vanishing gradients during training.
30 BACKGROUND
For an element-wise sum layer, its equation can be written as:

branch1 branch2
Og,ox,oy = Ig,ix1,iy1 + Ig,ix2,iy2 (2.18)
in which 
 g = {0, 1, ..., G − 1}
ox = {0, 1, ..., OX − 1} (2.19)
oy = {0, 1, ..., OY − 1}

and
ix1 = ix2 = ox
(2.20)
iy1 = iy2 = oy
These layer types are frequently used in a wide range of popular DNNs.
It is worth pointing out some interesting interrelations between these different
types of DNN layers. In Figure 2.2, (e) Group Conv layer can be seen as a
superset of layer type (a)-(d):
(a) FC layer is Group Conv layer with only C, K dimensions, i.e., all the
other dimensions = 1;
(b) Conv layer is Group Conv layer with group dimension G = 1;
(c) DW Conv layer is Group Conv layer with input/output channel dimensions
C & K = 1;
(d) PW Conv layer is Group Conv with group dimension G = 1 and weight
kernel sizes FX & FY = 1.
3) Layer connection
Besides these various layer types, there are also different layer connections,
which can be grouped into one-to-one, one-to-more, and more-to-one categories.
Figure 2.3 shows several neural network snippets with previously introduced
layer types and different layer connection patterns.
One-to-one layer connections are straightforward: a single layer’s output becomes
the input to one and only one layer’s input.
One-to-more and more-to-one layer connections often appear together in a
DNN with residual or branching network structures, such as ResNet [57] and
Inception-ResNet [155]. In the one-to-more connection, one layer’s output is
used as input by multiple succeeding layers, while in the more-to-one connection,
multiple layers’ outputs jointly become the next layer’s input.
DEEP LEARNING 31
(a) (b) (c) (d)
Conv 7x7
Elem. Sum 64 Elem. Sum
Conv 3x3
512
Conv 3x3 Pooling Conv 1x1
Conv 3x3 512 128
512 Conv 1x1
Conv 3x3 64 Group Conv
Pooling 512 3x3, 128
Conv 3x3 Conv 1x1 group=32
FC 4096 Elem. Sum 64 256
Conv 1x1
FC 4096 Pooling Conv 1x1 256
256
FC 1000 FC 1000 Elem. Sum
Elem. Sum
(e) (f) (g)
DW Conv Elem. Sum

DW Conv 3x3, 144 One-to-more
3x3, 512
Conv 1x1 Conv 1x1 Conv 1x1 Conv 1x1
Conv 1x1 32 32 32 32
1024
Conv 1x1 Conv 3x3 Conv 3x3
DW Conv
192 32 48
3x3, 1024 One-to-one
DW Conv Conv 3x3
Conv 1x1
3x3, 192 64
1024 More-to-one
Conv 1x1 Conv 1x1
Pooling
32 384
FC 1000
Elem. Sum Elem. Sum
Figure 2.3: Snippets of different DNNs from (a) VGG-19 [149], (b) ResNet-
18/34 [57], (c) ResNet-50/101/152, (d) ResNeXt [179], (e) MobileNet-v1 [65],
(f) MobileNet-v2 [138], and (g) Inception-ResNet-v2 [155].
Interestingly, recent research released a complete reconstruction and analysis

of a larval fruit fly’s brain, showing the brain architectural features include
multilayer shortcuts and prominent nested recurrent loops, which have certain
similarities to the residual DNN layer connection [175].
32 BACKGROUND
4) DNNs
Multiple parallel neurons form a layer, and multiple layers connected together
form a DNN. There are a great amount of DNNs proposed in the past
decade targeting various tasks, such as image classification/segmentation/super-
resolution, object detection, speech recognition, keyword spotting, natural
language processing, etc.
Taking image classification on ImageNet [42] for example, Figure 2.4 from [15]
collects multiple famous DNNs, from which we see that different DNNs can
have very different parameter sizes (1s - 100s of Mega), operation counts (0.1s -
10s of Giga), and accordingly different model accuracies (Top-1 accuracy from
63.3% to 82.5%). Generally speaking, more accurate DNN models tend to have
larger model sizes. More recent networks, like CoAtNet-7 [37] and BASIC-L
(Lion, fine-tuned) [29], can reach over 90% Top-1 accuracy with Giga-level
parameters and Tera-level operations.
2.1.2 A common misconception
These high-level network attributes (accuracy, parameter size, operation

count) very well help to distinguish one network from another and provide a
straightforward impression of how good/large a network is. However, when it
comes to processing these DNNs on domain-specific hardware platforms, only
counting on these high-level network attributes to conclude which network is
more hardware-friendly (e.g., less latency or energy consumption) or which
hardware is more suitable for a certain size network, can be misleading. This is
because these high-level network attributes do not necessarily have a linear or
even positive correlation with the network’s hardware processing cost.
Figure 2.5 from [100] well demonstrates this by plotting the relationship between
total network operations (FLOPs: floating point operations), model size (number
of parameters), and inference latency of VGG-16 [149] and its random channel
pruning variants. The findings indicate that reducing (FLOPs or parameters
may not always lead to a reduction in latency, e.g., although model A has fewer
(FLOPs than model B, the pruned model A displays higher inference latency.
Similarly, the smaller model C has higher inference latency than model D.
This non-linearity can come from many workload/mapping/hardware-dependent
factors1 . In order to understand the reason behind this, the computation and
data patterns of DNNs, the basic concepts of the domain-specific hardware
1 For example, the intermediate storage cost, the spatial mapping’s mismatching cost, the
advanced technology node’s extensive wire cost, etc.

DEEP LEARNING 33
Figure 2.4: A great number of DNNs targeting image classification on ImageNet

Xu et al.
were developed [15]. BenchCouncil Transactions on Benchmarks, Standards and Evaluations
Figure
Fig. 1. The2.5: The between
relationship relationship between
FLOPs, number FLOPs,andnumber
of parameters, inference of parameters,
latency and
of pruned models.
inference latency of VGG-16 pruned models [100].
otivation of this work is that conventional channel prun- The rest of this paper is organized as follows. Sec
crucially rely on human expert knowledge and hand- related works. Section 3 presents the proposed latency-aw
s, and focus on selecting unimportant channels. Li et al. channel pruning method in detail. Section 4, show the
rm as significance criteria to determine which channels results and analysis. Finally, we draw the paper to a
d. Luo et al. [11] use the input of (𝑖 + 1)-th layer to Section 5.
uning of 𝑖th layer. Lin et al. [12] rank channels with
feature maps, then prunes the least important channels. 2. Related work
et al. [13] find that the pruned network can achieve the
y no matter it inherits the weights in the original network Deep neural networks are usually over-parameterized
tudy inspires us that the essence of channel pruning lies ing to huge storage and computation cost. There are ex
imal channel numbers in each layer, instead of selecting on compressing and accelerating neural networks. We c
34 BACKGROUND
architecture, and the workload deployment’s impact, need to be understood,

which will be explained in the following Section 2.1.3, Section 2.2.3, and
Section 2.3 respectively.
2.1.3 Computation and data patterns
Most DL models are deterministic as the model parameters are fixed once they
are trained, and the model follows a well-defined set of mathematical rules to
process the input data. Under this determinism, the computation patterns of
DNNs have several similarities and differences at each level, as summarized in
Table 2.1. Across neurons, although MAC operation is the core operation in
common, different neurons’ MAC operations can still vary in MAC precisions
and sizes; Across layers, although most of the DNN layers perform lots of parallel
MAC operations, their layer types and sizes can still be different; Across layer
connections, there are also variations.
Table 2.1: A high-level DNN computation pattern summary
Level Similarity Difference

- MAC precision
MAC as core operation (Operand W/I/O’s precision)
and W/I/O as three e.g., 2-bit W, 8-bit I & O
Neuron major operands
(except for Pooling - MAC size (neuron size)
and Elem-wise layers) i.e., the number of products to
be added together
- Layer type (neuron connectivity)
e.g., Conv v.s. FC
Lots of parallel
Layer MAC (or other) - Layer size (neuron count)
operations in a layer i.e., the number of outputs to
be generated (or the number of
neurons in the layer)
- Layer connectivity
Layer Sequential (& parallel)
e.g., one-to-one, more-to-one,
Connection layers in a DNN
one-to-more
Side by side with the computation pattern, the data pattern, i.e., the operands’
data size and data reuse (i.e., the same data is reused in multiple different
operations), is also an important DNN characteristic. As computation and data
are tightly coupled, differences in computation patterns directly correspond to
differences in data patterns:
DEEP LEARNING 35
• MAC precision directly impacts data size, e.g., quantizing a DNN

(weight and activation) from 32-bit floating point to 8-bit fixed point
can largely reduce the data footprint, which can lead to a significant
reduction in the overall hardware cost (e.g., logic cost, memory cost, and
communication (wire) cost).
• Different layer types can lead to different operand distributions and data
reuse, e.g., Conv layer usually has fewer weights and more activations
with lots of weight data reuse, while FC layer is the opposite case, without
any weight data reuse.
• Different MAC/layer sizes also lead to different data patterns, e.g., even
within the same Conv layer category and under the same MAC precision,
the data size and operand distributions can still vary largely according to
MAC and layer size changes.
• Layer connectivity also impacts the operand dimensions (e.g., data

alignment), cross-layer data reuse, locality, and lifetime.
From the discussion above, we can comprehend that even when DNNs have a
similar operation count and/or parameter size, they can still display significant
differences in their computation and data patterns internally. This partially
clarifies the earlier common misconception.
When connecting the characteristics of DNNs we discussed above to the domain-
specific hardware design we will discuss later, two observations can be made.
On the one hand, the deterministic nature and similarities among DNNs present
good opportunities for domain-specific hardware design. On the other hand, the
diversity in the computation and data patterns among DNNs poses challenges
to achieving the optimal hardware architecture.
36 BACKGROUND
2.2 Domain-specific Hardware for Deep Learning
As introduced in the previous section, DNNs demand significant amounts of

computation and data, and the similarity in the computation patterns among
them lays a good foundation for building domain-specific hardware. In addition,
as transistor size approaches its physical limit, Moore’s Law begins to slow down,
making it challenging to keep the exponential growth in computing performance
through technology scaling. As a result, building domain-specific hardware for
DL has become an increasingly popular approach to continue improving the
performance and efficiency of processing DNN models.
Before delving into how to build a hardware platform for DL (Section 2.2.3),
let’s first get an overview of its design metrics (Section 2.2.1) and the various
types of DL hardware platforms that already exist (Section 2.2.2).
2.2.1 Hardware performance metrics
Many metrics are used to objectively assess the quality of different DNN
hardware options, in which the most common ones are:
• Power consumption in watt (W);

• Chip or core area in millimetre square (mm2 );
2
• Throughput or speed in Giga operation per second (GOPS or GOP/s);
• Latency in second (sec) or in terms of the number of clock cycles (cc);
• Energy efficiency which combines power and throughput in Tera operation
per second per watt (TOPS/W) or in its reciprocal, pico joule per operation
(pJ/op);
• Area efficiency which combines area and throughput in Giga operation
per second per millimetre square (GOPS/mm2 );
• Hardware utilization in percentage (%), which captures the ratio between
the ideal working state of a certain hardware resource and its actual
working state. This state can be many things, like processing latency,
active on-chip area, occupied memory capacity, etc.
In addition to these measurable indicators, there are also certain qualitative

criteria that hold significant importance in evaluating a DNN accelerator, like:
2 Usually operation refers to MAC operation.
DOMAIN-SPECIFIC HARDWARE FOR DEEP LEARNING 37
• Flexibility/Reconfigurability: the ability of the hardware to be dynamically

reconfigured at runtime to enable different styles of processing flow or
data movement, so as to support various DNN workloads;
• Programmability: the level of control that a compiler has over the hardware
and the ease with which it can be programmed to perform specific tasks.
There are works trying to quantify these criteria, like in [88], the authors measure
DNN accelerator flexibility by firstly identifying multiple design axes and then
providing each axis with a binary value so as to create multiple flexibility classes.
2.2.2 Hardware platform types
Multiple types of hardware platforms have been deployed to perform DNN

processing, and each has its strategies to better support DNN execution.
Figure 2.6 offers a comprehensive comparison of different types of DNN hardware
platforms, with each dot in the figure representing a published/released hardware
design. Different hardware platforms are distinguished based on the three critical
hardware performance metrics discussed earlier: speed (GOP/s), power (W)
and energy efficiency (TOPS/W).
• CPU: A Central Processing Unit (CPU) is a general-purpose processor

used in most computers. While CPUs are not as efficient as other hardware
platforms in processing DNN models, there are specific strategies that can
be used to improve their performance, like vectorization, single instruction
multiple data (SIMD), multi-threading, instruction set architecture (ISA)
extensions customized for DL kernels and designing a custom in-pipeline
execution unit for these specialized instructions [104, 75], etc.
• FPGA: A Field Programmable Gate Array (FPGA) is a programmable
hardware device that can be configured to implement different digital
circuits. On the one hand, FPGAs are more effective for processing DNN
models compared to CPU thanks to the configurability, parallelism, and
optimizations on operation, memory bandwidth and data precision [173];
On the other hand, as shown in Figure 2.6, they are less performant/ef-
ficient compared to the GPU/ASIC DNN accelerators, which is due to
their reconfigurability overhead, leading to slower processing speed and
higher power consumption.
• GPU: A Graphics Processing Unit (GPU) is a specialized processor used to
accelerate graphics processing. GPUs are widely used for DNN processing,
especially in high-performance computing scenarios, thanks to their highly
For use in publications and presentations please cite this data collection as follows:
K. Guo, W. Li, K. Zhong, Z. Zhu, S. Zeng, S. Han, Y. Xie, P. Debacker, M. Verhelst, Y. Wang. "Neural Network Accelerator Comparison"
[Online]. Available: https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/
38
Phase  Source  Platform  Year  Support  BACKGROUND
Performance 
Multi-Precision  Activation-Precision  Weight-Precision  Sparse Policy 
Click and drag to zoom in. Hold down shift key to pan.
1M
1000TOPs/W 100TOPs/W 10TOPs/W 1TOPs/W
FPGA Simulation
FPGA Test
FPGA Product
100k ASIC(Digital) Simulation
ASIC(Digital) Test 10GOPs/W
ASIC(Digital) Product
GPU Product
ASIC(Mixed) Simulation
10k
ASIC(Mixed) Test
Speed (GOP/s)
ASIC(Mixed) Product 1GOPs/W
CPU Product
1k
1000TOPs/W
100
100TOPs/W
10
0.001 0.01 0.1 1 10 100 1k
Power (W)
Highcharts.com
* Peak performance calculated by TDP while real performance calculated by the measured power.
Figure 2.6: An overview of DNN accelerator comparison [85].
Platform Graph Precision Graph Sparse Graph
parallel architecture and high memory bandwidth. Strategies adopted

by GPU to better support DNN execution include parallelism (execute
thousands of operations simultaneously), pipelining, memory, and data
precision optimization, specifically designed tensor cores (for matrix-
matrix multiplication), Network-on-Chip (NoC) optimization, software
optimization (specialized libraries) and more [9].
Indoor Multi-robot Exploration with Limited Communication (/detail.html?id=920)
• CGRA: Although not shown in the figure, a Coarse-Grained Reconfigurable
2019~NowArray (CGRA) is another important type of hardware platform designed to
accelerate compute-intensive applications. This reconfigurable hardware
Autonomous localization
architectureand mapping are the
usually basic tasksof
consists of robots.
an arrayThe single-robot localization
of versatile and mapping
processing have been de-
elements
tailly studied and applied in the real world, such as sweeping robots. Compared with the GB level data bandwidth within a sin-
that are connected through a highly reconfigurable interconnect. Common
gle robot, the bandwidth between multiple robots is 3 orders of magnitude smaller. The limited communication becomes the
strategies
bottleneck of multi-robot applied
localization andto improve CGRA’s DNN processing efficiency include
mapping.
exploiting spatial parallelism and leveraging dynamic reconfiguration to
optimize data movement and resource sharing to the currently targeted
workload. Generally speaking, CGRA’s hardware performance, energy
efficiency, and flexibility are all in between FPGA and ASIC [101].
• ASIC and ASIP: Application-Specific Integrated Circuit (ASIC) and
Application-Specific Instruction set Processor (ASIP) are specialized
hardware devices custom-built for particular applications. Compared
with each other, ASICs offer the highest performance and power efficiency
for a specific DNN workload but come with high development costs and
lack flexibility for changes; ASIPs offer a more adaptable and cost-effective
solution with competitive performance, making them suitable for a broader
range of DNN models. As can be seen from Figure 2.6, ASIC/ASIP3 DNN
accelerators show significant variability across all metrics and achieve the
highest performance and energy efficiency compared to other hardware
platforms. This is because ASIC/ASIP designs are relatively less flexible
(compared to previously mentioned CPU, FPGA, GPU and CGRA) and
thus more easily allow for targeted optimization of specific metrics. For
instance, the Google TPU [81] (ASIP) and Tesla NPU [157] (ASIP)
are optimized for high performance, while BinarEye [114] (ASIC) and
TinyVers [78] (ASIP) target ultra-low power consumption. Figure 2.6
shows two groups of ASIC/ASIP accelerator designs which are digital and
mixed-signal. Digital DNN accelerators are designed using only digital
components to perform DNN computations, whereas mixed-signal ones
combine both digital and analog components for DNN computations, such
as analog-in-memory computing [125].
Besides these homogeneous hardware platforms, heterogeneous systems are also

becoming more and more popular in DNN acceleration. E.g., [102] connects
GPU to FPGA to cover end-to-end DL training and inference, and Diana [62]
system-on-chip (SoC) combines a RISC-V CPU, a digital DNN accelerator
core and an analog-in-memory computing core together to improve system’s
performance and flexibility.
This thesis is primarily concerned with ASIC/ASIP DNN accelerators, which
offer a high degree of customization and a vast design space among all the
domain-specific hardware platforms for DL. In the upcoming section, we will
delve into the general architecture and basic building blocks of ASIC/ASIP
DNN accelerators to gain a better understanding of their unique capabilities.
2.2.3 DNN accelerator architecture
Generally speaking, to build a hardware accelerator for DNN computations (data-

and compute-intensive with lots of data reuse opportunities and independent
parallel operations), it is a natural thought to put together an array of computing
elements for parallel computations, a memory system for supporting data reuse,
and interconnect them in a certain way to let the data flow smoothly in between.
This intuitive and fundamental idea is behind most of the DNN accelerators
that have been built so far.
Figure 2.7 depicts an example DNN accelerator following this idea, with its
basic building blocks highlighted: the MAC unit, the processing element (PE),
the MAC/PE array, and the memory hierarchy. Let’s discuss them one by one.
3 Figure 2.6 doesn’t distinguish ASIC and ASIP and calls all of them ASIC.
40 BACKGROUND
Off-chip On-chip
I - Local Buffer PE
I-Reg W-Reg
W - Local Buffer
PE PE PE
Global Buffer
x
MAC O-
DRAM
PE PE PE unit + Reg
PE PE PE MAC / PE
array
O - Local Buffer Memory

hierarchy
Figure 2.7: Introduction to DNN accelerator.
1) MAC unit
A MAC unit is the fundamental building block of a DNN accelerator, designed

to perform the two most common operations in DNNs: multiplication and
addition (or accumulation). A standard MAC unit multiplies two input values
and then adds the product to an accumulator register, which stores the result
of previous MAC operations.
Besides the standard one, MAC units can have several variants, such as versions
that support different data precision [94] and those where multiple multipliers
share one accumulator, which sums all the products [151]. Note that in the
latter case, we consider it as multiple MAC units, as we count the number of
MAC units as the number of multipliers in our definition, i.e., each multiplier
corresponds to one MAC unit (with or without its exclusive accumulator).
2) PE
Different papers have different definitions for processing element/engine (PE).

For example, Eyeriss [30] defines a PE as a single MAC unit with its surrounding
local buffers, while Edge TPU [139] defines it as each SIMD island with thousands
of MACs and hundreds of Kilobyte SRAM memory. In this thesis, we follow the
PE definition from Eyeriss to call every MAC unit with its exclusive operands
(W/I/O)’s local storage (usually register files) a PE, as indicated in Figure 2.7.
PEs can hence have several variants, such as with or without local storage for
certain operands, or with different local storage sizes.
3) MAC / PE array and interconnection
As a key component of DNN accelerators, the MAC / PE array is essentially a

large array of MAC units / PEs that are interconnected together and work in
parallel. It is designed to take advantage of the highly parallel nature of DNN
computation, where large quantities of data can be processed simultaneously.
By breaking down the overall DNN computation into many small groups
of independent operations, and executing them in parallel across the array,
the DNN accelerators’ performance can be significantly boosted compared to
traditional CPU-based systems.
In addition to the PEs themselves, an array also includes an interconnection
system aiming at transferring and sharing data efficiently between different PEs
and between memory hierarchy and PEs. The typical interconnection patterns
include uni-cast, multi-cast, broadcast, and systolic, as shown in Figure 2.8.
For input operands (e.g., W and I), unit-cast is each data being uniquely
transferred from the memory to only one PE; multi-cast/broadcast is one data
being transferred from the memory to multiple/all PEs at once; systolic is one
data being transferred from one PE to its neighbour PEs one step at a time. For
output operands (e.g., O), the multi-cast/broadcast indicates data reduction
actions, i.e., results from multiple/all PEs being added together to produce
fewer outputs.
Different operands (e.g., W/I/O) can adopt different interconnection patterns
collaboratively. For example, in Figure 2.7, operand W is multi-cast to each
row of the array, operand I is multi-cast to each column, and operand O is
(a) Uni-cast (b) Multi-cast (c) Broadcast (c) Systolic

a
a b c a
PE PE PE PE PE PE PE PE PE a PE PE PE
For input d e f b
operands PE PE PE PE PE PE PE PE PE b PE PE PE
(data moving in) g h i c
PE PE PE PE PE PE PE PE PE c PE PE PE
PE PE PE PE PE PE PE PE PE PE PE PE a
For output a b c + a +
operands PE PE PE PE PE PE PE PE PE PE PE PE b
(data coming out) d e f + b + +
PE PE PE PE PE PE PE PE PE PE PE PE c
g h i + c +
a
Figure 2.8: Introduction to MAC / PE array interconnection.

42 BACKGROUND
systolically transferred between rows.

Like the MAC unit and PE, which can have numerous design variations, the array
level also presents a wide range of design options, e.g., the number of dimensions
of the array, the array size along each dimension, the interconnection patterns
adopted by each operand at each array dimensions, etc. These interconnection
patterns are tightly coupled to the spatial data reuse and spatial mapping,
which will be discussed in the later section (Section 2.3.2).
4) Memory hierarchy
The memory hierarchy in a DNN accelerator refers to the multiple levels of

memory used to store data required by DNN computation, which usually
includes the input operands W and I, and the output operand O produced by
the accelerator.
Taking the accelerator depicted in Figure 2.7 for example, it has a multi-level
memory hierarchy, and from the outermost memory level to the innermost
memory level, there are DRAM, global buffer (GB), different operands’ local
buffer (LB) and inner-PE register files (RF). The DRAM is the external memory,
which provides high capacity but with low bandwidth and high access cost. The
GB is the largest on-chip memory, which is usually a multi-bank scratchpad
memory and is implemented with SRAM or recently also with non-volatile
memory (NVM), like magnetoresistive RAM (MRAM) [78]. The LB is a small
on-chip memory, which can be implemented with SRAM, First-in-First-out buffer
(FIFO), Very Wide Register (VWR) [130], etc., directly talking to the GB storage
on one side and to the PE array on another side, providing/receiving/organizing
large amounts of ordered data in parallel. The RF is the smallest and also the
closest memory level to the MAC unit, which takes inputs from the LB, supports
MAC operations, accumulates the local results, and writes these results back to
LB.
At the system architecture level, the design of the memory hierarchy also creates
a vast design space with many tuning knobs to consider. These include factors:
• from a memory instance point of view: the memory size, bandwidth, port
number, port type, whether the memory is double-buffered or not, etc;
• from operands’ memory usage point of view: different operands can share
or not share a memory instance, and if shared, how to dynamically allocate
the memory resources (capacity/port/bandwidth) to different operands
when processing different DNN workloads;
RF
16 Array
14 Buffer 2.0
ALU
Normalized Energy
Normalized Energy
12
1.5
10
8
1.0
6
4 0.5
• from the whole2 memory hierarchy point of view: how many memory
levels there are0 in the hierarchy, and whether to allow certain 0.0
runtime
re
re
re
Mo s
l
Mo s
de
de
de
de
de
s
s
asu
asu
asu
eri
eri
Mo
Mo
Mo
flexibility, such as memory bypassing (e.g., bypassing weight LB memory
Ey
Ey
Me
Me
Me
by allowing GB sometimes OS4 OS8
talk directly to weight WS16 RF level) or memory CONV1 CONV2
operand swapping (e.g., weight
(a) Energy breakdown memory
comparison foractual
between executing onedesigns
synthesized neural (b)
network
Energy breakdown compa
layer can and the analytical
become model.memory when executing another layer).
the input our model.
Figure 7. Validation of the analytical model against post-synthesis results and

Memory hierarchy design and data movement optimization are critically
important for a DNN accelerator system as a memory access is usually more
costly than aTable Energy perand
MAC3.operation, 16-bit access with
different various
memory register
design file can lead to Table 4. ASIC
options
(RF) and
large differences SRAM
in data sizes,cost,
access and as
forshown
a MACinoperation,
Figure 2.9one hop
[184].
communication cost and a DRAM access. Name Datafl
OS4 X
RF Size Energy (pJ)
OS8 X
SRAM Size Energy (pJ)
16 B 0.03 WS16 C |K
32 B 0.06 32 KB 6
64 B 0.12 64 KB 9
128 B 0.24 128 KB 13.5
256 B 0.48 256 KB 20.25
512 B 0.96 512 KB 30.375 allocation (Table 3), an
the dataflow and loop
MAC 0.075 DRAM 200 simply perform a cons
Hop 0.035 design space guided by
alytical model is avail
Figure 2.9:moreEnergy per 16-bit access with various RF
advanced technologies. Also, many of our observations and SRAM Interstellar-CNN-sch
sizes, and
in Section 6 are technology-independent.
for a MAC operation and a DRAM access [184]. Framework valida
To compute the overall memory energy in an L-level hier- the accuracy of our m
archy, we adopt a model similar to [8] and [30]: plete designs generate
5) From single-core to Õ shows three example
multi-core
L accelerator Ö L
platforms, and Figure
E= #acci × ei where #acci = RTj
tween our analytic m
So far, we have seen the i=1 major building blocks of a single-core j=i DNN accelerator
resulting errors are le
Here ei2.7,
as shown in Figure is the
andenergy of accessing
understand that thethemany
ith level once.design
possible The variations
work is also able to rep
total numbers
for each building of accessescreate
block collectively are affected
a vastby data reuses
design space.RT i at
Moving one step
differences.
differentimprove
forward, to further memory levels.
hardwareIt is defined as the number
performance of times in handling
and flexibility
increasingly the data
large andarecomplex
accessed by DLitsworkloads,
immediate lower-cost
multi-core(child)
DNNlevel
accelerators are
during its
gaining popularity lifetimeyears
in recent in this[168].
level. To support direct inter-PE
communication in systolic arrays, we treat neighbor PEs as 6 Results
Going from single-core to multi-core DNN accelerator, as
an additional level in the hierarchy. We distinguish the cost shown in Figure 2.10,our dataflow tax
Using
the main difference from the architecture point of view,
for different communication distances (Figure 3) which is an is the inter-core
erate and evaluate larg
communication mechanism
improvement design,
over [8]. and from the control flow point of view, are we first explor
Halide,
Since all designs today are power constrained, finding the
the workload-core allocation & balancing, and the mapping & choices, and then cons
scheduling. On
the one hand,optimal
addingaccelerator
more cores design nowaccelerator
to the becomes an can optimization At the end we leverag
potentially improve
problem of minimizing E over the 3D design space, simi- introduce an efficient
lar to Yang’s work [44]. ei is determined by the resource
44 BACKGROUND
(a) Single-core accelerator (b) Multi-core accelerator

Off-chip On-chip Off-chip On-chip
Accelerator Acc.
Core 1 Core 3
DRAM
DRAM
Accelerator
Core 1 Communication Bus
DRAM Bus
Accelerator Acc.
Core 2 Core 4
Figure 2.10: Single-core and multi-core DNN accelerator.
performance by allowing more computations to be performed simultaneously,

more scheduling flexibility, and more data to be shared/produced/consumed
timely; On the other hand, it also presents new challenges and opens up new
design spaces, such as the number of cores, the structure of each core, the
combination of different cores, the topology of the NoC that connects these
cores, the computation tiling, the data flows in between, etc.
2.3 Mapping, Scheduling and Allocation
Up until now, we have delved into the DNN workload and accelerator
architecture. It’s time to cover the third but equally important factor in
this DNN acceleration story: mapping, scheduling and layer-core allocation (for
multi-core architectures only), which define how a workload is processed on the
hardware, spatially and temporally.
This subsection first explains the concepts of scheduling and mapping, using the
example in Figure 2.11 and assuming a single-core architecture (subsections 2.3.1,
2.3.2). Then, it discusses the layer-core allocation and scheduling using the
example in Figure 2.14 and assuming a multi-core architecture (subsection 2.3.3).
2.3.1 Single-core cross-layer scheduling
In the single-core accelerator layer-by-layer processing context, the decision on

the order of execution among the different layers is referred to as cross-layer
scheduling. A valid cross-layer schedule should respect consecutive layers’ data-
MAPPING, SCHEDULING AND ALLOCATION 45
producing-and-consuming relationship, such that the consumer layer(s) can only

be scheduled after the producer layer(s).
For DNNs that only have one-to-one layer connections, like VGG [149], cross-
layer scheduling is straightforward as there is only one valid path that traverses
all the layers in this topological order4 . While for DNNs with branches or residual
links, one-to-more and more-to-one layer connections bring more cross-layer
scheduling possibilities. These different schedules can have different hardware
implications and performance impacts.
In the example of Figure 2.11, the scheduling possibilities to compute a neural
network with five layers and a branch connection (subfigure(a)) are visualized.
There are three cross-layer scheduling possibilities: layer(L) 1-2-3-4-5, 1-2-4-3-5
and 1-4-2-3-5, in which the first two schedules are demonstrated in Figure 2.11(b)
with their data movement as an example hardware implication. The difference
between the two schedules is whether to process L3 or L4 first. In the example,
we assume the on-chip memory has a limited capacity that cannot hold more
than two layers of data for any combination of {L1, L2, L3, L4}, which leads to
unavoidable useful data overwritten during computation.
As shown in Figure 2.11(b), Schedule 1 processes L3 before L4, so that the
outputs from L2 can be directly reused for L3 without off-chip data access.
However, the data from L3 will have to overwrite the data of L1 given our on-chip
memory capacity assumption. This brings the overhead of re-fetching L1 data
from off-chip when computing L4. Then, again, during the L4 computation, the
data from L1 will have to overwrite the data of L3 which leads to the re-fetching
of L3 data when computing L5. Schedule 2 processes L4 before L3, resulting
in a different data reuse and overwriting pattern, and thus can end up with a
different hardware cost.
From the example, we see that processing one neural network in a layer-by-layer
fashion on a single-core accelerator can have different scheduling possibilities
with different hardware implications. Later in the thesis, we will see that the
scheduling problem presents an even larger design space when considering not
only layer-by-layer scheduling but also depth-first or layer-fused scheduling.
2.3.2 Single-core intra-layer mapping
After determining the cross-layer schedule, we need to decide how to process

each layer on the accelerator, and this processing procedure is referred to as
intra-layer mapping. We already know that an accelerator usually contains a
4 A topological ordering of a directed graph is a linear ordering of its vertices such that for
every directed edge uv from vertex u to vertex v, u comes before v in the ordering.
46 BACKGROUND
(a) An example NN and a zoomed-in view of one layer
Layer 1
O0 = W0,0 × I0 + W0,1 × I1 + W0,2 × I2 + W0,3 × I3

Layer 2 I0 O0
I1 O1 O1 = W1,0 × I0 + W1,1 × I1 + W1,2 × I2 + W1,3 × I3
Layer 4 I2 O2 O2 = W2,0 × I0 + W2,1 × I1 + W2,2 × I2 + W2,3 × I3
Layer 3 I3 O3
O3 = W3,0 × I0 + W3,1 × I1 + W3,2 × I2 + W3,3 × I3
Layer 5
(b) Cross-layer scheduling and its HW implication

Schedules Implications on data movement
Schedule 1 Data reused Data overwritten
L1 L1 L3 L1 L3
Layer Layer Layer Layer Layer On-chip L5
Mem. L2 L2 L4 L4
1 2 3 4 5
1 2 3 4 5 x
Processing timeline
Data re-fetched from off-chip
Schedule 2
L1 L1 L1 L3 L3
Layer Layer Layer Layer Layer On-chip L5
Mem. L2 L4 L2 L4
1 2 4 3 5
Processing timeline 1 2 4 3 5 x
When computing Layer x, the data held on-chip
(c) Intra-layer mapping and its HW implication

Implications on Implications on local
Mappings MAC array memory access
interconnection 𝑌𝑌(𝑦𝑦) : local mem. of operand 𝑌𝑌
Mapping 1 with 𝑦𝑦 mem. accesses.
Temporal mapping 𝑥𝑥 : MAC unit No. 𝑥𝑥 ( : mem. read : wr : rd & wr )
Processing timeline
W (16) I (4) O (16)
W0,0 × I0 W0,1 × I1 W0,2 × I2 W0,3 × I3 W0,0 W1,0 W2,0 W3,0
mapping
I0
Spatial
W1,0 × I0 W1,1 × I1 W1,2 × I2 W1,3 × I3

W2,0 × I0 W2,1 × I1 W2,2 × I2 W2,3 × I3 0 1 2 3
W3,0 × I0 W3,1 × I1 W3,2 × I2 W3,3 × I3 O0 O1 O2 O3
Processing timeline W0,0 … W3,3 I0 I1 I2 I3 O0 O1 O2 O3
Mapping 2 Data index
MAC 0 W0,0 × I0 W1,0 × I0 W3,0 × I0 I0 W0,0 W0,1 W0,2 W0,3
W2,0 × I0 W (16) I (16) O (4)
12
MAC 1 W0,1 × I1 W1,1 × I1 W2,1 × I1
W3,1 × I1 3
W3,2 × I2 0 1 2 3
MAC 2 W0,2 × I2 W1,2 × I2 W2,2 × I2
+
MAC 3 W0,3 × I3 W1,3 × I3 W2,3 × I3
W3,3 × I3
O0
Different temporal mappings if swap
Mapping 3
MAC 0 W0,0 × I0 W2,0 × I0 W0,2 × I2 W2,2 × I2 I0 W0,0 W0,1 W1,0 W1,1 W (16) I (8) O (8)
MAC 1 W0,1 × I1 W2,1 × I1 W0,3 × I3 W2,3 × I3 I 1
0 1 2 3
MAC 2 W1,0 × I0 W3,0 × I0 W1,2 × I2 W3,2 × I2
MAC 3 W1,1 × I1 W3,1 × I1 W1,3 × I3 W3,3 × I3 + +
O0 O1
Processing timeline
Figure 2.11: Introduction to single-core cross-layer scheduling and intra-layer

mapping.
MAC array and a memory hierarchy. The MAC array takes charge of spatial
parallel computation and the memory hierarchy handles the temporal data
storage. The mapping is correspondingly split into two parts: spatial mapping
and temporal mapping.
In the example of Figure 2.11(a), Layer 2 is an FC layer with 4 inputs, 4 outputs
and 16 weights. Its 16 multiplication operations are highlighted with different
coloured blocks, with which we will explain the spatial and temporal mapping
concepts and their hardware implications.
1) Spatial mapping
Spatial mapping (also called spatial unrolling or dataflow) determines what

MAC operations are processed in parallel, which is tightly coupled to the MAC
array interconnection. In Figure 2.11(c), assuming the accelerator’s MAC array
has four MAC units, Mapping 1, 2, and 3 have different spatial mappings:
• Mapping 1 spatially maps one input and four outputs on the array, i.e.,
all MAC units receive the same input, and each MAC unit generates one
output at a time;
(→ spatial input reuse)
• Mapping 2 spatially maps four inputs and one output on the array, i.e.,
each MAC unit receives a different input, and all MAC units’ outputs are
summed up to generate one output at a time;
(→ spatial output reuse)
• Mapping 3 spatially maps two inputs and two outputs on the array, i.e.,
MAC units 1 & 3, 2 & 4 receive the same inputs, and MAC units 1 & 2’s,
3 & 4’s outputs are summed up to generate two outputs at a time.
(→ hybrid spatial input, output reuse)
Different spatial mappings can require different MAC array interconnections,

and conversely, a certain type of MAC array interconnections may only support
a few types of spatial mappings. Besides the MAC array, the bandwidths of the
memory instances that have direct communication with the array also need to
be selected accordingly.
After knowing what spatial mapping is and its relation to the MAC array
topology array and memory hierarchy, let’s discuss its hardware cost impact.
Spatial mapping impacts hardware cost mainly from two perspectives:
48 BACKGROUND
• Different spatial mappings imply different spatial data reuse patterns, and
thus different low-level memory accesses. For example, in Figure 2.11(c)
Mapping 1, the input is spatially reused across the four MAC units, and
thus each input is only read from the local memory once. I.e., in total
across 4 clock cycles, the input local memory is read 4 times, while there
are in total 16 output updates (output local memory read and write). In
Mapping 2, it is the opposite: in which there are in total 16 input reads
and 4 output write backs. The hardware cost can thus be different in
terms of the different access counts and access costs of the input and
output local buffers (memory size/bandwidth, input and output data
precision, etc.).
• Different spatial mappings lead to different MAC array utilization for
various DNN layer types and sizes. For example, Layer 2 in Figure 2.11(a)
with four inputs and four outputs which enables all 3 spatial mappings in
(c) to fully utilize the MAC array. If we change the layer size to one input
and four outputs (still for an FC layer type), the 3 spatial mappings will
lead to a MAC array utilization of 100%, 25%, and 50%, respectively. The
hardware cost can thus be very different in terms of the different array
utilization.
2) Temporal mapping
Temporal mapping is a concept built on top of spatial mapping. Spatial mapping

divides the overall to-be-performed operations into multiple computation groups,
and within each group, all the operations are computed in parallel. Temporal
mapping defines the sequential order in which these computation groups are
being processed. Therefore, one spatial mapping can be combined with many
temporal mappings. For example, in Figure 2.11(c) Mapping 3, swapping the
operation orders of the red and blue dashed boxes will create a new temporal
mapping, which shares the same spatial mapping with the original mapping.
It is worth pointing out that, usually, a mapping consists of both a spatial
mapping and a temporal mapping, except in two extreme cases: 1) if the PE
array is large enough to spatially perform all the operations in parallel at one
time, then there is only spatial mapping, no temporal mapping; 2) if the PE
array is reduced to one PE that only performs one operation at a time, then
there is no spatial mapping, only temporal mapping.
Just as we previously explained the hardware architecture implications and
hardware cost impacts for spatial mapping, here we will also discuss these two
crucial factors for temporal mapping.
Mapping 3a
MAC 0 O0 += W0,0 × I0 O2 += W2,0 × I0 O0 += W0,2 × I2 O2 += W2,2 × I2
Processing timeline
I Global Buffer I 0 , I 1 , I2 , I 3 I 0 , I 1 , I2 , I 3 I 0 , I 1 , I2 , I 3 I 0 , I 1 , I2 , I 3
n 4 accesses
p Local Buffer I0 , I1 I0 , I1 I2 , I3 I2 , I3
u 8 accesses
t MAC Array 16 MAC Ops
Processing timeline
O Global Buffer O1’, O2’, O3’, O4’ O1’, O2’, O3’, O4’ O1 , O2 , O3’ , O4’ O1 , O2 , O3 , O4
u
t 8 accesses
Local Buffer O0’ , O1’ O2’ , O3’ O0 , O1 O2 , O3
p
u 8 accesses
( O’: partial output; O: final output; Temporal data reuse ) Processing timeline
Mapping 3b

Processing timeline
I Global Buffer I 0 , I 1 , I2 , I 3 I 0 , I 1 , I2 , I 3 I 0 , I 1 , I2 , I 3 I 0 , I 1 , I2 , I 3
n 8 accesses
p Local Buffer I0 , I1 I2 , I3 I0 , I1 I2 , I3
u 8 accesses
Processing timeline
O Global Buffer O0 , O1 , O3 , O4’ O0 , O1 , O3 , O4’ O0 , O1 , O2 , O3
u
t
4 accesses
Local Buffer O0’ , O1’ O0 , O1 O2’ , O3’ O2 , O3
p
u 8 accesses
Processing timeline
Figure 2.12: An example continuing from Figure 2.11(c) Mapping 3 to show

temporal mapping’s impact on high-level memory access. Mapping 3a and 3b
have the same spatial mapping but different temporal mappings.
50 BACKGROUND
In terms of hardware architecture, the temporal mapping is less hardware-

constrained compared to the spatial mapping. Unlike that different spatial
mappings require distinct MAC array and interconnection support, different
temporal mappings rely on programmable finite state machines (FSMs) for
flexible memory address generation, and thus can be more easily realized on
the same datapath.
In terms of hardware cost, different temporal mappings imply different temporal
data reuse patterns in the memory hierarchy and thus lead to varied memory
accesses, resulting in distinct total energy and latency. Figure 2.12 provides an
example, in which the memory access patterns of Figure 2.11 Mapping 3 (as
Mapping 3a) and its temporally adjusted variant Mapping 3b are analyzed. In
this toy example, the memory capacity is assumed to be such that the Global
Buffer can hold all the data, while the Local Buffer can only hold two data items,
for both the Input and Output operands. The example shows that in Mapping
3a and 3b, the data traffic from the Global Buffer to the Local Buffer of Input
and Output operands are just the opposite due to the different temporal data
reuses at the Local Buffer level. Imaging that the Global Buffer read cost and
Local Buffer write costs are different for different operands (due to different
memory attributes, operand precision, etc.), the overall hardware cost can thus
be very different.
For-loop based mapping representation
After understanding the principle of spatial and temporal mappings, it is time to

discuss how to represent them in a concise and uniform way. This is crucial for
facilitating further systematic mapping analysis and establishing the groundwork
for fast DSE of vast mapping design spaces.
Given the regular and iterative computation flow a mapping contains, a set
of nested for-loops is a good option to capture all the important information
of a mapping, which is widely adopted in related works [184, 39, 126, 109].
Figure 2.13 provides examples of representing the different mappings of
Figure 2.11 and Figure 2.12 in the commonly used nested for-loop format.
There are two loops ("k = 0 to 3" and "c = 0 to 3") in the original
algorithmic loop representation. These algorithmic loops then get (split and)
assigned to spatial and temporal loops in the mapping representation, as done
in the mapping examples Mapping 1 to 3b. In the examples, spatial loops are
denoted with "unroll" in front and the loop indices are with subscript "u", while
temporal loops keep "for" as the indicator and maintain the normal indices.
Under the for-loop-based mapping representation, the sequencing of temporal
The example FC layer
O0 = W0,0 × I0 + W0,1 × I1 + W0,2 × I2 + W0,3 × I3

O1 = W1,0 × I0 + W1,1 × I1 + W1,2 × I2 + W1,3 × I3 for k = 0 to 3 O index
for c = 0 to 3 I index
O2 = W2,0 × I0 + W2,1 × I1 + W2,2 × I2 + W2,3 × I3 O[k] += W[k][c] × I[c]
O3 = W3,0 × I0 + W3,1 × I1 + W3,2 × I2 + W3,3 × I3
Mapping 1
Temporal mapping (T)
mapping (S)
W0,0 × I0 W0,1 × I1 W0,2 × I2 W0,3 × I3

for c = 0 to 3 (T)
Spatial
W1,0 × I0 W1,1 × I1 W1,2 × I2 W1,3 × I3

W2,0 × I0 W2,1 × I1 W2,2 × I2 W2,3 × I3 unroll ku = 0 to 3 (S)
W3,0 × I0 W3,1 × I1 W3,2 × I2 W3,3 × I3 O[ku] += W[ku][c] × I[c]
Processing timeline
Mapping 2
W0,0 × I0 W1,0 × I0 W2,0 × I0 W3,0 × I0

W0,1 × I1 W1,1 × I1 W2,1 × I1 W3,1 × I1 for k = 0 to 3 (T)
W0,2 × I2 W1,2 × I2 W2,2 × I2 W3,2 × I2 unroll cu = 0 to 3 (S)
W0,3 × I3 W1,3 × I3 W2,3 × I3 W3,3 × I3 O[k] += W[k][cu] × I[cu]
Processing timeline
Mapping 3a
W0,0 × I0 W2,0 × I0 W0,2 × I2 W2,2 × I2 for c = 0 to 1

(T)
W0,1 × I1 W2,1 × I1 W0,3 × I3 W2,3 × I3 for k = 0 to 1
W1,0 × I0 W3,0 × I0 W1,2 × I2 W3,2 × I2 unroll cu = 0 to 1
(S)
W1,1 × I1 W3,1 × I1 W1,3 × I3 W3,3 × I3 unroll ku = 0 to 1
Processing timeline O[2k + ku] += W[2k + ku][2c + cu]
× I[2c + cu]
Mapping 3b
W0,0 × I0 W0,2 × I2 W2,0 × I0 W2,2 × I2 for k = 0 to 1

W0,1 × I1 W0,3 × I3 W2,1 × I1 W2,3 × I3 for c = 0 to 1 (T)
W1,0 × I0 W1,2 × I2 W3,0 × I0 W3,2 × I2 unroll cu = 0 to 1
(S)
W1,1 × I1 W1,3 × I3 W3,1 × I1 W3,3 × I3 unroll ku = 0 to 1
Processing timeline O[2k + ku] += W[2k + ku][2c + cu]
× I[2c + cu]
Figure 2.13: Examples of the nested for-loop based mapping representation,

continuing from Figure 2.11 and Figure 2.12.
52 BACKGROUND
mapping loops is important as it impacts data fetching pattern (e.g., Mapping 3a

v.s. Mapping 3b), while the order of spatial mapping loops makes no difference
(e.g., in Mapping 3a/b, swapping the position of spatial loop cu and ku does
not change the mapping itself) as they operate in parallel.
This for-loop-based workload and mapping representation holds significant
importance in the thesis and will be frequently referred to in the follow-up
chapters.
2.3.3 Multi-core layer-core allocation and scheduling
For multi-core DNN accelerators, in addition to scheduling and mapping, layer-

core allocation is another design axis to consider, which refers to determining
which core processes which DNN layers (or tiles). Different layer-core allocations
can lead to distinct hardware costs due to several factors, such as the intra-core
architecture, the inter-core communication, and multi-core homogeneity and
heterogeneity.
With the same example neural network workload as before, Figure 2.14 shows a
toy example of processing it on a heterogeneous dual-core accelerator. Based on
the latency look-up table (LUT) of running specific layers on each core, we list
several examples of layer-core allocation, demonstrating that different layer-core
allocations and schedules can lead to different processing latencies.
Graph-based multi-core schedule representation
Unlike using nest for-loops to represent single-core accelerator mapping, for multi-
core workload deployment, DAG is frequently used as the high-level uniform
representation for capturing layer-core allocation and scheduling information [43,
19, 180, 152].
Figure 2.15 takes three allocation-schedule cases from Figure 2.14 and shows the
DAG representation for each of them. In the example, each node in the graph
denotes a layer of the neural network in Figure 2.14(a) and contains attributes
of processing that layer on a specific core; the edges linking nodes indicate the
data dependency between these layers. With this graph representation, one
can readily allocate and schedule the workload, effortlessly alter one allocation-
schedule choice to another, and easily justify if a certain deployment scheme is
valid, such that no core is occupied by more than 1 workload at any time and
all the data dependency relations are respected.
This DAG-based representation for multi-core DNN accelerator scheduling is

used in Chapter 8 of the thesis.
(a) An example NN, a dual-core accelerator, and

the latency of processing each layer on each core
Latency (cc)
Layer 1
Accelerator Core 1 Core 2
Core 1 Layer 1 50 100
Layer 2 DRAM
Layer 2 80 120
Layer 4 Communication Bus
Layer 3 DRAM Bus Layer 3 100 40
Accelerator Layer 4 70 100
Layer 5 Core 2 Layer 5 40 30
(b) Layer-core allocations and schedules

Allocation 1 Allocation 2
Layer 1 Layer 2 Layer 4 Layer 1 Layer 2 Layer 3
Core 1 50 80 70 Core 1 50 80 100
Layer 3 Layer 5 Layer 4 Layer 5
Core 2 40 30 Core 2 100 30
Schedule 1 Schedule
260 cc
270 cc
Core1 50 80 100
Core1 50 70 80
Core2 100 30
Core2 40 30
Processing timeline
Processing timeline
Schedule 2
Allocation 3
230 cc
Layer 1 Layer 2
Core1 50 80 70 Core 1 50 80
Core2 40 30 Layer 3 Layer 4 Layer 5
Core 2 40 100 30
Processing timeline
Schedule 3
Schedule
230 cc 220 cc
Core1 50 80 70 Core1 50 80
Core2 40 30 Core2 100 40 30
Processing timeline Processing timeline
Figure 2.14: Introduction to multi-core layer-core allocation and scheduling.

54 BACKGROUND
Layer 1
Core 1
Allocation 1 Start 0
Layer 1 Layer 2 Layer 4 End 50
Core 1 50 80 70 Layer 2
Layer 3 Layer 5 Core 1
Core 2 40 30 Start 120
Layer 4 End 200
Core 1
Schedule 1 Start 50
End 120 Layer 3
270 cc Core 2
Core1 50 70 80 Start 200
End 240
Core2 40 30
Layer 5
Processing timeline Core 2
Start 240
End 270
Layer 1
Core 1
Allocation 2
Start 0
Layer 1 Layer 2 Layer 3 End 50
Core 1 50 80 100 Layer 2
Layer 4 Layer 5 Core 1
Core 2 100 30 Start 50
Layer 4 End 130
Core 2
Schedule Start 50
End 150 Layer 3
260 cc Core 1
Core1 50 80 100 Start 130
End 230
Core2 100 30
Layer 5
Start 230
End 260
Layer 1
Allocation 3 Core 1
Layer 1 Layer 2 Start 0
Core 1 End 50
50 80
Layer 2
Layer 3 Layer 4 Layer 5 Core 1
Core 2 40 100 30 Start 50
Layer 4 End 130
Core 2
Schedule Start 50
End 150 Layer 3
220 cc Core 2
Core1 50 80 Start 150
End 190
Core2 100 40 30
Layer 5
Start 190
End 220
Figure 2.15: Examples of the graph-based multi-core schedule representation,

continuing from Figure 2.14.
CONCLUSION 55
2.4 Conclusion
This chapter thoroughly introduced the background information of DNN

algorithms (from the fundamental concepts of a neuron, a layer, and layer
connection to its overall attributes like computation and data patterns),
DNN accelerator architectures (hardware metrics, hardware categories,
accelerator building blocks, and the shift from single-core to multi-core design
paradigm), and algorithm-to-hardware deployment (mapping, scheduling
and allocation). These basic concepts will greatly assist readers in better
understanding the rest of the thesis.
Note that this chapter explained the basic concepts that hold the utmost
relevance to the thesis. There are more workload forms (such as dynamic
neural networks [56], spiking neural networks (SNN) [181], graph neural
networks (GNN) [190], etc.), more mapping/scheduling techniques and
considerations (such as polyhedral model [167], in-place mapping [161], data
lifetime [89], data regularity [79], etc.), and more accelerator architectures (such
as sparse DNN accelerators [38], SNN accelerators [16], GNN accelerators [2],
etc.), which are not the focus of the thesis and thus are not discussed here.
In the next chapter, we will delve into the first open research question - How to
efficiently execute variable-precision DNNs on hardware? - and focus on the
design space exploration of a single MAC architecture.
Chapter 3
Precision-Scalable MAC Unit

Design Space Exploration
This chapter studies different types of precision-scalable MAC unit architectures,

and provides answers to the open research question Q1.1: What is the best
MAC unit architecture for variable-precision DNN execution?
The current trend for DL has come with an enormous computational need for
billions of MAC operations per inference. Fortunately, reduced precision has
demonstrated large benefits with low impact on accuracy, paving the way towards
processing in mobile devices and IoT nodes. To this end, various precision-
scalable MAC architectures optimized for neural networks have recently been
proposed. Yet, it has been hard to comprehend their differences and make a
fair judgment of their relative benefits as they have been implemented with
different technologies and performance targets. To overcome this, this chapter
exhaustively reviews the SotA precision-scalable MAC architectures and unifies
them in a new taxonomy. Subsequently, these various topologies are thoroughly
benchmarked in a 28 nm commercial CMOS process, across a wide range of
performance targets, and with precisions ranging from 2 to 8 bits. All MAC units
are analyzed for each precision as well as joint precisions that mimic practical
use cases. This chapter highlights the trade-off originating from architectures,
scalability, and practical use cases, in terms of energy, throughput, area, and
This chapter is related to publications [22] and [107], and contains large fractions of [22].
The author’s contributions include (but not limited to) the new taxonomy, MAC units’ RTL
implementations and verification, and paper writing.
57
58 PRECISION-SCALABLE MAC UNIT DESIGN SPACE EXPLORATION
bandwidth, aiming to understand the key trends in the precision-scalable MAC

unit design space.
3.1 Motivation and Chapter Organization
Reduced-precision computing has demonstrated large benefits with low or

negligible impact on the network accuracy [71, 27, 20]. It has been shown that
the optimal precision not only varies from one neural network to another, but
it also can vary within a neural network itself [115, 134], from layer to layer
or even from channel to channel. This has led to a new trend of ultra-efficient
run-time precision-scalable neural processors for embedded DNN processing in
mobile devices and IoT nodes.
Accordingly, many run-time precision-scalable MAC architectures have been
introduced in recent years, built either with high parallelization capabilities [146,
143, 117, 107] or serial approaches [94, 21, 142, 163, 133]. However, it has been
difficult to assess their efficiency or to decide on topology for several reasons.
Firstly, they have been implemented in various process technologies, nonidentical
bitwidths or scalability levels, or have been integrated into entirely different
systems. Secondly, few works evaluate their performances against a baseline
design, i.e., a MAC unit without scalability, to clearly show the configurability
overheads. Finally, many designs are based on ad-hoc scalability techniques,
lacking a systematic approach and a clear analysis of which scalability principles
are common, respectively distinct, between different implementations.
This chapter overcomes these issues through the following contributions:
• Section 3.2 proposes a new taxonomy, which categorizes all the existing
precision-scalable MAC architectures, uncovers their design patterns, and
links MAC-level design decisions to array-level and algorithm-level choices.
• Section 3.3 presents an exhaustive and illustrated survey of SotA
precision-configurable MAC architectures, to help understand their circuit
topologies and functional differences.
• Sections 3.4 explains the design and benchmark methodology, highlight-
ing the synthesis design space construction that optimizes for different
timing conditions across precision modes.
• Sections 3.5 conducts a detailed analysis for each MAC architecture the
precision scaling and its overheads in terms of energy, area, bandwidth,
and throughput.
DATAFLOW IMPLICATIONS OF PRECISION SCALABILITY 59
• Section 3.6 makes a comparative study of all precision-scalable MAC

units across a wide range of performance targets, both at a nominal supply
voltage and under Dynamic Voltage-Frequency Scaling (DVFS).
• Section 3.7 performs an investigation of three practical cases of utilization,
assuming different proportions of reduced-precision operations, exploring
their impact on the optimal architecture selection.
• Section 3.8 concludes this chapter.
3.2 Dataflow Implications of Precision Scalability
This section first introduces two dataflow scalability options, Sum Apart (SA)
and Sum Together (ST), with implications at algorithmic, PE array, and MAC
unit level, showing that run-time precision scalability is tightly interwoven with
DNN dataflow considerations. It then maps the SotA precision-scalable MAC
architectures into these categories and proposes a general MAC taxonomy.
3.2.1 SA and ST at algorithm level
The concepts of SA and ST are related to the spatial input and output reuse
described in Chapter 2.3.2. They were originally introduced at MAC level to
qualify two opposite ways of accumulating subword-parallel computations [107]:
SA keeps the parallel-generated products separately, while ST sums them
together to form one single output. These concepts can be applied to differentiate
algorithm-level characteristics of neural-network workloads.
Figure 3.1 illustrates a DNN FC and Conv 2D layer in a nested-loop format.
Indexes appearing on the left-hand side of the MAC operation (b, k, oy and
ox) indicate that all their related partial results are accumulated and stored
into distinct output results, corresponding to SA-type accumulations. On the
contrary, indexes that do not appear in the output (c, fy and fx) indicate
that those partial results are accumulated together along these dimensions,
corresponding to ST-type accumulations. All the loops are divided into either
SA or ST types.
3.2.2 SA and ST at PE-array level
Many SotA DL accelerators rely on a multi-dimensional PE array to support

convolutions or matrix multiplications. The architectures of such arrays are
(a) for b = 0 to B-1 (I/O batch size)

SA Loop
for k = 0 to K-1 (O vector size/W matrix row)
ST Loop for c = 0 to C-1 (I vector size/W matrix column)
O[b][k] += I[b][c] × W[k][c]
(b) for b = 0 to B-1 (I/O batch size)
for k = 0 to K-1 (O channel/W kernel)
SA Loop for c = 0 to C-1 (I/W channel)
for oy = 0 to OY-1 (O row)
ST Loop for ox = 0 to OX-1 (O column)
for fy = 0 to FY-1 (W kernel row)
I for Input for fx = 0 to FX-1 (W kernel column)
O for Output O[b][k][oy][ox] += I[b][c][oy+fy][ox+fx]
W for Weight × W[k][c][fy][fx]
Figure 3.1: SA/ST loop identification for (a) an FC and (b) a Conv 2D layer.
tightly connected to the targeted dataflow (a.k.a. spatial mapping, spatial-

unrolling scheme) [185], and can once again be categorized into SA and ST
behaviours along each array dimension.
For example, a 2D spatial-unrolling PE array can be categorized into three
types, SA-SA, SA-ST, or ST-ST, as shown in Figure 3.2. An SA-SA array
hence supports spatial unrolling of two SA-type loops along its two geometric
dimensions, while an SA-ST array unrolls one from each loop, and finally an
ST-ST array unrolls two ST-type loops. The choice of SA or ST array structure
determines the output accumulation islands in the array.
This allows categorizing PE arrays into each of three different classes: SA-SA
arrays contain only one single PE per accumulation island, also called Output-
Stationary arrays, which typically accumulate results across clock cycles [46, 116].
SA-ST arrays contain a 1D vector of PEs per accumulation island whose results
are added together, either in a systolic way [120, 30] or through a single-
cycle accumulator [156, 18]. ST-ST arrays integrate all the PEs into a single
accumulation island, e.g., with a 2D-accumulated adder tree [76, 150], a systolic
array, or a 1D-accumulated adder tree plus systolic array.
3.2.3 SA and ST at precision-scalable MAC-unit level
When scaling computational precision down to a fraction of the full-precision

operation, spatial precision-scalable MACs typically subdivide the PE into
several parallel smaller-resolution units. As such, each nominal PE then
DATAFLOW IMPLICATIONS OF PRECISION SCALABILITY 61
Accumulation island
SA SA ST
PE PE PE PE PE PE PE PE PE
SA PE PE PE ST PE PE PE ST PE PE PE
PE PE PE PE PE PE PE PE PE
Figure 3.2: Three types of two-dimensional PE array.
becomes a PE (or MAC) array on itself, which offers more spatial loop-unrolling
opportunities along SA or ST dimensions. Many precision-scalable MAC
architectures have been presented in the literature, which can be categorized
along these SA and ST concepts.
In full-precision mode, SA and ST MACs behave identically: only one result
is generated every clock cycle. In scaled-precision mode, several low-precision
results are generated in parallel. The major difference between SA and ST
MACs is that SA MACs keep these sub-products separately, thus generating
several results in multiple output registers, while ST MACs sum them together
to obtain a single accumulated result, requiring only one register.
In existing neural-network computing patterns, there is an implicit rule that if
multiplications share an input operand (either weight or input activation), their
products cannot be added together. Alternatively speaking, only products that
belong to different output values have the opportunity to use the same weight
or input activation. This algorithmic constraint contributes to the fundamental
input-output bandwidth trade-off between SA and ST MACs. This will be
covered in more detail in Section 3.3.
3.2.4 Taxonomy
Building upon these concepts of SA and ST, Table 3.1 presents the complete
taxonomy of precision-scalable MAC architectures.
Vertically, all precision-scalable MAC architectures are first organized into
spatial and temporal precision-scaling techniques. In the category of spatial
precision configurability, the MACs are spatially split into several lower-precision
arithmetic operators in reduced-precision modes. They are further categorized
using the SA and ST principles. The category of temporal precision-scalable
MACs has a totally different working principle. They perform multiplication
through temporal iteration of repeated add-shift operations. Reduced-precision
PRECISION-SCALABLE MAC UNIT DESIGN SPACE EXPLORATION
Table 3.1: Taxonomy of precision-scalable MAC unit architectures

Architecture types 1D scalable 2D scalable 2D symmetric scalable
1D FU SA SWU SA
Sum Apart 2D FU SA
(DNPU [146]) (DVAFS [117])
Spatial
2D FU ST SWU ST
Sum Together 1D FU ST
(BitFusion [143]) (ST[107])
1D bit serial 2D bit serial
Bit Serial
(UNPU [94]) (LOOM [142])
Temporal
1D multibit serial 2D multibit serial
Multibit Serial
(Multibit serial [21]) (LOOM [142])
62
SURVEY OF SCALABLE MAC ARCHITECTURES 63
operation is achieved by just reducing the number of temporal iterations.

Temporal-based MACs are further grouped in Bit Serial and Multibit Serial,
based on the bitwidth of the input data that are serially fed in at every
iteration. In summary, reducing the precision of input operands helps spatially-
reconfigurable MACs to carry out more lower-precision MAC operations in
parallel, while it helps temporally-reconfigurable MACs to reduce the total
number of clock cycles to finish one MAC operation.
Horizontally, architectures are categorized into three sub-types, depending
on their input operands’ (weight and input activation) scaling characteristics.
"1D scalable" indicates that only one of the input operands can scale to lower
bitwidth, while the other always stays at full precision. We refer to it as "weight-
only scaling" in the chapter. "2D scalable" MACs break this constraint so that
both input operands can scale independently. Finally, "2D symmetric scalable"
MACs only allows the two input operands to scale to the same precision. Note
that in 2D temporal-based MACs, the two operands are naturally independent,
hence the last column is empty.
Table 3.1 also maps the existing SotA precision-scalable MAC architectures onto
this new taxonomy, and proposes new names for unification. As will become clear
in the next section, all Fully-Unrolled (FU) architectures use a bottom-up design
methodology: their multiplier logic is formed by combining several individual
and identical sub-blocks, which are always active. Their configurability is in
the interconnection and shift-add logic. Conversely, Subword-Unrolled (SWU)
architectures start from a top-down design philosophy: their single-block full-
precision multiplier is selectively gated to reuse existing arithmetic cells in
reduced-precision modes.
Note that a MAC unit is intrinsically a 2D architecture (with two input operands,
input activation and weight). By selecting a different precision scalability method
listed in Table 3.1 for each dimension, a lot of hybrid precision-scalable MAC
units can be created. For example, SA-ST MAC architectures or temporal-
spatial mixed MAC architectures. This chapter however focuses on discussing
and comparing the pure architectural categories, as listed in Table 3.1.
3.3 Survey of Scalable MAC Architectures
This section is an exhibition of all precision-scalable MAC architectures listed

in Table 3.1. Their working principles, similarities and differences are fully
discussed, and figures are shown to illustrate their computing states under
different modes. For all architectures and different precision modes, we assume
a 4-bit register headroom for output accumulation.
3.3.1 Data-gated conventional MAC
The baseline of this study is a data-gated conventional MAC unit as in [115].

When scaled precision is applied, only the most significant bits (MSBs) are used
for computation while the least significant bits (LSBs) are kept at zero. Hence,
the switching activity is reduced. As the critical path going only through the
MSBs is shorter, the frequency can dynamically be increased, or the supply
voltage can be lowered at equal throughput.
This is illustrated in Figure 3.3, showing from left to right the use with 8b
("b" for "bit") full precision, 4b symmetric scaling, and 2b weight-only scaling.
When computing in reduced precision modes, some parts of the multiplier and
accumulator logic are gated by zeroing input data, as shown by the shaded
areas. However, as this MAC does not embed any configurability feature, such
as selective clock gating, all registers remain clocked despite the fact that no
data reaches their LSBs.
8b 4b 0000 2b 0 0 0 0 0 0
0000
… gated …
8b
8b
gated
…… …
4b
16b 16b 16b
20b 20b 20b
Figure 3.3: Data-gated conventional MAC for either one full-precision 8b×8b,
one symmetric 4b×4b, or one weight-only 2b×8b operation.
3.3.2 1D Fully-Unrolled SA (DNPU)
Figure 3.4 shows the MAC architecture of the Deep Neural Processing Unit
(DNPU), introduced by Shin et al. [146] and characterized as 1D Fully-Unrolled
(FU) SA in the new taxonomy.
Its 8b×8b multiplier is built out of four 2b×8b sub-multipliers followed by
configurable shift-and-add logic blocks. Only one of its inputs is functionally
scalable, hence it is 1D scalability. More specifically, the top input is subword
scalable, while the left 8b input is treated as one operand across all modes and
shared by all sub-multiplier blocks. By gating the shift-and-add logic in reduced

precision modes, 2 or 4 results are generated in parallel and kept separately,
which explains its SA nature.
Note that like other SA-type MACs, each subword output register requires its
own accumulation headroom (4 bits in our assumption). For this reason, SA
MACs will suffer more from a large headroom than temporal and ST MACs,
which only require this headroom once.
8b 4b 4b 2b 2b 2b 2b
8b
8b
8b
<< 2 << 2 << 2 << 2
<< 4
gated
16b 2x 4x
12b 10b
20b 2x 4x
16b 14b
Figure 3.4: Weight-only precision scaling in a 1D FU SA MAC configured for

either one 8b×8b, two 4b×8b, or four 2b×8b operations per cycle.
3.3.3 1D Fully-Unrolled ST
A 1D FU ST MAC is uncovered by the new taxonomy. This MAC is also

only one-dimension scalable. However, at reduced precision, instead of storing
parallel-generated results separately, the 1D FU ST accumulates them. As
shown in Figure 3.5, the intermediate adders in the configurable shift-add logic
are not bypassed but reused for the ST accumulation in low precisions.
Although the output bandwidth is reduced by only producing one result, the
input bandwidth dramatically increases when precision scales down as it requires
distinct 8-bit input activations. This stems from the fact that neural-network
computational kernels can share inputs across computations for different outputs,
but cannot share multiplicands across results accumulated into the same output.
8b 4b 4b 2b 2b 2b 2b
8b
8b
2b
8b
2b
2b
2b
<< 2 << 2 << 2 << 2
<< 4
16b 13b 12b
20b 16b 14b
Figure 3.5: Weight-only precision scaling in a 1D FU ST MAC configured for

3.3.4 2D Fully-Unrolled SA
A 2D FU SA architecture emerges from the newly-introduced taxonomy. Its

baseline 8b×8b multiplier can be implemented out of 16 2b×2b multipliers
together with configurable shift-and-add logic blocks, which together enable
it to perform either one 8b×8b, two 4b×8b, four 4b×4b, four 2b×8b, eight
2b×4b, or sixteen 2b×2b multiplications in one cycle. Figure 3.6 demonstrates
its four working states.
In reduced precision modes, the SA principle again results in an output
bandwidth explosion, e.g., visible in the 2D FU SA 2b×2b mode of Figure 3.6.
A second disadvantage of this architecture is its limited applicability. 2D FU
SA unit only supports Conv workloads or batched workloads, as batch-1 FC
layers do not allow data sharing across both input operands.
3.3.5 2D Fully-Unrolled ST (BitFusion)
BitFusion, proposed by Sharma et al. [143], is a 2D FU ST unit. Similar to

the previous 2D FU SA architecture, it is capable of executing one 8b×8b, two
4b×8b, four 4b×4b, four 2b×8b, eight 2b×4b, or sixteen 2b×2b multiplications
in one clock cycle. Figure 3.7 presents four of its computing modes. Instead of
keeping parallelly-generated products apart, 2D FU ST sum them together thus
avoiding output bandwidth explosion. Yet, as a consequence, input sharing
is not possible, boosting the required input bandwidth at reduced precision
modes.
8b 4b 4b
<< 2 << 2 << 2 << 2
4b
<< 4 2 << << 4 2 << << 4 2 << << 4 2 <<
4 <<
8b
8 << << 4
<< 2 << 2 << 2 << 2
4b
<< 4 2 << << 4 2 << << 4 2 << << 4 2 <<
4x
16b 8b
4x
20b 12b
2b 2b 2b 2b 2b 2b 2b 2b
2b
<< 2 2 << << 2 2 <<

2b
8b
4 << << 4
2b
<< 2 2 << << 2 2 <<

2b
4x 16x
10b 4b
4x 16x
14b 8b
Figure 3.6: Precision scaling in a 2D FU SA MAC configured for either one

8b×8b, four 4b×4b, four 2b×8b, or sixteen 2b×2b operations per cycle.
8b 4b 4b
<< 2 << 2 << 2 << 2
4b
<< 4 2 << << 4 2 << << 4 2 << << 4 2 <<
4 <<
8b
8 << << 4
<< 2 << 2 << 2 << 2
4b
<< 4 2 << << 4 2 << << 4 2 << << 4 2 <<
16b 10b
20b 12b
2b 2b 2b 2b 2b
2b
2b 2b
2b
2b 2b
2b
2b 2b
2b
2b
2b 2b 2b 2b
2b
2b
2b
2b
<< 2 2 << << 2 2 <<

2b
2b
2b
2b
22b8bb
2b
4 << << 4
2b
2b
2b
2b
<< 2 2 << << 2 2 <<

2b
2b
2b
2b
12b 8b
14b 8b
Figure 3.7: Precision scaling in a 2D FU ST MAC configured for one 8b×8b,

four 4b×4b, four 2b×8b, or sixteen 2b×2b operations per cycle.
In BitBlade, a variant of BitFusion introduced by Ryu et. al [133], the

shifting logic is pulled out of each MAC and shared across sub-multipliers
of different MACs in the array to reduce the configurability overhead. However,
BitBlade is not included in this study as this chapter focuses on MAC-unit-level
techniques. MAC-array-level optimization strategies are thoroughly analyzed
and benchmarked in Chapter 4 of the thesis.
3.3.6 Subword-Unrolled SA (DVAFS)
The so-called Dynamic Voltage-Accuracy-Frequency Scaling (DVAFS) MAC,

developed by Moons et al. [116] can be categorized as Subword-Unrolled (SWU)
SA in the proposed taxonomy.
Unlike all previously discussed precision-scalable MAC types, SWU SA is
a 2D symmetric scalable architecture, which follows a top-down design
methodology. Its functional 8b multiplier is no longer realized by fusing several
smaller multipliers with a configurable interconnect. Instead, it exploits the
interconnection pattern of an array multiplier (a.k.a. Baugh-Wooley multiplier
[14]) and selectively gates/activates parts of the arithmetic logic to perform
one 8b×8b, two 4b×4b, and four 2b×2b MAC operation(s) within one clock
cycle, as shown in Figure 3.8. Large output registers are required due to the
SA principle with its required duplicated register headroom.
Note that in reduced precision modes, three types of the region appear in
the array multiplier, depicted by differently shaded areas. The blue blocks
marked with an "X" are active multiplication logic. The gray stripes represent
fully-gated logic. Most importantly, the light-blue stripes localize cells which
are not employed for arithmetic, but are active for propagating the intermediate
results from the active multiplication logic to the bottom-right corner of the
array multiplier, where the outputs are collected, hence preventing complete
gating of these regions.
8b 4b 4b 2b 2b 2b 2b
2b 2b 2b 2b
4b
gated gated
8b
……
4b
propag. propag.
16b 2x 4x
8b 4b
20b 2x 4x
12b 8b
Figure 3.8: Symmetric precision scaling in a SWU SA MAC configured for

3.3.7 Subword-Unrolled ST (ST)
The ST version of the SWU MAC unit [107] is also a 2D symmetric scalable
architecture, based on an array multiplier. But unlike SWU SA, SWU ST adds
all subword results together by activating the array multiplier into an opposite
diagonal pattern, as shown in Figure 3.9. Such configuration saves addition
logic, as it uses the multiplier array cells to implicitly perform the addition.
The propagating region that can not be fully gated, shrinks along the precision
down-scaling, while the critical path remains roughly the same, which is just
opposite to SWU SA.
It is worth notifying that, compared to previous FU ST architectures, the input
bandwidth of SWU ST remains the same across precision modes at the cost
of partly gating its arithmetic blocks when precision scales down. So, there is
a trade-off for ST-type MACs between input bandwidth consistency and high
hardware utilization.
8b 4b 4b 2b 2b 2b 2b
2b 2b 2b 2b
gated
4b
gated
8b
4b
gated
16b 9b 6b
20b 12b 8b
Figure 3.9: Symmetric precision scaling in a SWU ST MAC configured for

3.3.8 1D bit-serial designs (UNPU)
Bit-serial designs have recently gained attention with both the Unified Neural
Processing Unit (UNPU) by Lee et al. [94] and the QUEST log-quantized
3D-stacked inference engine by Ueyoshi et al. [163]. Indeed, bit-serial operand
feeding implicitly allows fully-variable bit precision.
Considered in this study, the UNPU bit-serial MAC receives weights through
1-bit iterations while input activations are kept at full precision and sent in a
bit-parallel manner. Illustrated in Figure 3.10, this design is thus categorized as

a 1D bit-serial architecture. Scaling input activations is possible by data gating.
In [21], the original 1D bit-serial concept has been extended to multi-bit serial
designs. Figure 3.11 shows the example of a 1D 4-bit serial MAC where weights
are fed in 4 bits at a time. This scheme requires only 2 clock cycles for an
8-bit computation, hence reducing the energy consumed in the clock tree and
registers. Lower precision can be obtained by gating the unnecessary bits, as
the 2-bit weight in the right of Figure 3.11. This study includes both 1D 2-bit
and 4-bit serial MACs.
8 clock cycles 4 clock cycles 2 clock cycles

w0 w1 w2 w7 w0 w1 w2 w3 w0 w1
8b
8b
8b
16b >> 12b >> 10b >>
20b 16b 14b
Figure 3.10: Weight-only precision scaling in a bit-serial MAC configured for

either 8b×8b, 4b×8b, or 2b×8b operations.
2 clock cycles 1 clock cycle 1 clock cycle

w3-0 w7-4 w3-0 w1-0 0 0
8b
8b
8b
16b >> 12b >> 12b >>
20b 16b 16b
Figure 3.11: Weight-only precision scaling in a 4-bit serial MAC configured

for either 8b×8b, 4b×8b, or 2b×8b (by gating the 4b×8b) operations.
3.3.9 2D bit-serial designs (LOOM)
LOOM, designed by Sharify et al. [142], has extended the 1D bit-serial approach
to a 2D scalable architecture. This 2D scalable design feeds both input operands
bit-serially, as illustrated in Figure 3.12. By doing so, the bit-wise AND logic
from Figure 3.10 is further simplified to a single AND gate, but the number
of clock cycles to complete one MAC operation is increased to the product of
input activation and weight’s precisions.
Since both input operands are fetched bit-by-bit and all input bit combinations
need to be traversed, a more complex input feed-in schedule and control logic
are required. Figure 3.13 presents the 4b×4b scheduling used in this work ("a"
for input activation and "w" for weight).
Note that the input fetching pattern is not regular and contains repetitions.
For example, in this 4b×4b case, the bit-by-bit weight fetching sequence is:
w0 , w1 , w0 , w2 , w1 , w0 , w3 , w2 , w1 , w0 ... In addition, unlike 1D bit-serial units
that perform a right shift at each clock cycle, 2D bit-serial units shift partial
products irregularly, as indicated by the red arrows in Figure 3.13.
2D bit-serial implementations can also be implemented in a multi-bit processing
fashion. Figure 3.14 exhibits a 2D 4-bit serial design under three different
precision modes. By feeding in 4 bits at a time for both operands, the number
of compute cycles is drastically reduced. Note that since the basic arithmetic
block is a 4b multiplier, the precision scaling can only be pushed down to 4 bits.
Scaling below 4 bits is again enabled through LSB data gating. This survey
includes both 2D 2-bit and 4-bit serial MACs.
64 clock cycles 16 clock cycles 4 clock cycles

a0 a0 a1 a7 a0 a0 a1 a3 a0 a0 a1 a1
w0 w1 w0 w7 w0 w1 w0 w3 w0 w1 w0 w1
16b >> 8b >> 4b >>
20b 12b 8b
Figure 3.12: Symmetric precision scaling in a 2D serial MAC configured for

either 8b×8b, 4b×4b, or 2b×2b operations.
𝑤3 𝑤2 𝑤1 𝑤0
× 𝑎3 𝑎2 𝑎1 𝑎0
𝑎0 𝑤3 𝑎0 𝑤2 𝑎0 𝑤1 𝑎0 𝑤0
𝑎1 𝑤3 𝑎1 𝑤2 𝑎1 𝑤1 𝑎1 𝑤0
𝑎2 𝑤3 𝑎2 𝑤2 𝑎2 𝑤1 𝑎2 𝑤0
+ 𝑎3 𝑤3 𝑎3 𝑤2 𝑎3 𝑤1 𝑎3 𝑤0
𝑃7 𝑃6 𝑃5 𝑃4 𝑃3 𝑃2 𝑃1 𝑃0
Figure 3.13: A bit-wise feed-in schedule for 2D bit-serial MAC in 4b×4b mode.
4 clock cycles 1 clock cycle 1 clock cycle

a3-0 a3-0 a7-4 a7-4 a3-0 a1-0
w3-0 w7-4 w3-0 w7-4 w3-0 w1-0
16b >> 8b >> 8b >>
20b 12b 12b
Figure 3.14: Symmetric precision scaling in a 2D 4-bit serial MAC configured

for either 8b×8b, 4b×4b, or 2b×2b (by gating the 4b×4b) operations.
3.4 Design and Benchmark Methodology
The remainder of this section puts together individual evaluations and compara-
tive studies of all the aforementioned precision-scalable MAC architectures. To
this end, the present section describes the methodology used for implementing
and benchmarking these circuits in terms of area, throughput, bandwidth and
energy per operation for 5 precision modes: 8b×8b full precision, 2b×2b and
4b×4b symmetric scaling, and 2b×8b and 4b×8b weight-only scaling.
3.4.1 Design considerations and assumptions
To ensure the fairness and coherence of this study, all MAC architectures are
designed with equivalent features and identical circuit optimizations.
They are all built with a two-stage pipeline structure for the 8b full-precision
baseline. To support the different precision modes, their accumulation registers
are partitioned and gated as shown in Figure 3.15. The accumulation headroom
is 4b for any precision mode. Outputs are updated and stored separately for
SA MACs (Figure 3.15(a)), and into a single unpartitioned register for ST and
temporal-based scalable MACs (Figure 3.15(b)).
Besides, all temporal-based designs in this study assume a right-shifting
sequential multiplier as it requires a smaller first-stage adder than a left-shifting
design, preventing long carry propagation and sign-bit extension.
Input registers and overheads that can be shared among multiple PEs (e.g.,
control logic or FSMs) are excluded from area and power reporting.
(a) SWU SA
8b-8b 4b 16b 12b
4b-4b 4b 8b 4b 4b 8b 4b
2b-2b 4b 4b 4b 4b 4b 4b 4b 4b
(b) SWU ST
8b-8b 4b 16b
Accumulation headroom
4b-4b 4b 8b 8b One multiplication result
2b-2b 4b 4b 12b Gated part
Figure 3.15: Register settings for (a) SWU SA and (b) SWU ST MAC units.
DESIGN AND BENCHMARK METHODOLOGY 75
To keep this architectural study as general as possible, no individual optimization

techniques nor features are considered for any design. For instance, LUT
acceleration [146] or sparse-operation guarding [118] are not implemented.
These orthogonal methods can generally benefit many architectures.
3.4.2 Design space
In total 19 designs are made (listed in Figure 3.16), with multiple implemen-
tations of each architecture, varying the level of scalability, i.e. the scalability
granularity. For instance, a 1-level scalable design allows to scale from 8b down
to 4b by design, the 2b mode being carried out by data gating over the 4b mode.
Data gating is applied from LSB to MSB to prevent unnecessary energy spent
on carry propagation and sign-bit extension. A 2-level scalable design directly
allows to scale down to 4b and 2b by design. Requiring different circuit and
neural-network implementations, binary computations (1b precision) are not
considered in this study.
Since the various MAC architectures should have very different optimal operating
frequencies, this study explores a broad range of clock targets with frequencies
from 625 MHz to 5 GHz. To allow each precision mode to find its optimum,
constraint sweeps are applied individually to each mode (clock periods between
0.2 and 1.6 ns, with a sweeping step of 0.05 nc). Overall, over 5000 circuits are
synthesized and characterized for each specific MAC design.
3.4.3 Physical implementation and timing analysis
The performance of the MAC architectures is evaluated through synthesis and

simulation in a commercially-available 28 nm low-leakage process in typical
corner and with a nominal supply voltage of 1 V. Circuits are synthesized from
abstract-level SystemVerilog descriptions using Cadence Genus with high-effort
compilation options. Conservative power models are used for synthesis and
power estimation.
To enforce the scaling of the critical path at reduced precision, a multi-mode
timing optimization is carried out to optimize the speed for each scaling scenario.
Circuit delay is thus constrained and measured independently in each mode
(even if the unit does not allow scaling by design, e.g., the conventional MAC
still receives multiple constraints to optimize the shorter critical paths when
data-gating inputs). For each precision mode, known static signals and unused
registers are declared as such to prevent Static Timing Analysis (STA), which is
workload-agnostic, from unnecessarily spending resources on them. The delay
of mode-control signals is not optimized as it is assumed that the same precision

is used for many cycles in a row (e.g., for processing a full DNN layer).
3.4.4 Power estimation
Power consumption is estimated through post-synthesis simulations at the

maximum operating frequency for each mode with ModelSim simulator and
Cadence power extraction. This switching-activity-based estimation is very
important as the gated or unused logic, depending on the precision mode, can
lead to large differences in dynamic power.
The simulation for each mode consisted of 10,000 MAC operations with an
accumulator-register reset every 50 operations. A set of Gaussian-distributed
random numbers representative for quantized DNNs trained for CIFAR-10 is
used for simulation. Real DNN workload-based power simulation is left for
future work.
Input registers and overhead shareable among several MACs within an array
are not included in the reported power results.
3.4.5 DVFS
Dynamic Voltage-Frequency Scaling (DVFS) provides an additional dimension

for power-frequency optimization, allowing to dynamically move the operational
point closer to a mode’s optimal point and to further increase energy efficiency.
This study models the impact of DVFS on the circuits, assessing throughput and
energy for each mode while sweeping the voltage from 1 V down to 0.8 V. To this
end, the delay between the 0.8 V and 1 V characterized libraries is estimated using
the Sakurai-Newton model [135], while the energy is quadratically interpolated
from the gate-level simulations.
3.5 Detailed Analysis
For each MAC architecture, this section evaluates its characteristics of one
representative synthesized instance: the one with the lowest average energy-
delay-area product across precision modes (8b, 4b, and 2b), which represents a
balanced and coherent design choice.
DETAILED ANALYSIS 77
Figures 3.17-3.20 show the breakdown of memory bandwidth, energy per

operation, and silicon area of the circuits. The upper subfigures display
symmetric scaling scenarios while the bottom ones display weight-only scaling.
Triplets of bars differentiate the precision scenarios (8, 4 or 2 bits) for each
instance. Figure 3.16 shows the legend
Legend
for Figures 3.17-3.20.
Data gating 1-level SWU SA

2-level SWU SA ) DVAFS
1-level 1D FU SA
) DNPU
2-level SWU ST )
2-level 1D FU SA 1-level SWU ST
ST
1-level 1D FU ST
2-level 1D FU ST 1D 1-bit serial UNPU
1-level 2D FU SA
1D 2-bit serial
1D 4-bit serial )
Multibit serial
2-level 2D FU SA
)
2D 1-bit serial
1-level 2D FU ST
2-level 2D FU ST ) BitFusion 2D 2-bit serial
2D 4-bit serial
LOOM
Figure 3.16: Legend of all the bar charts in Section 3.5. 19 designs in total.
3.5.1 Bandwidth
Figures 3.17(a)-(b) present the required memory bandwidth per clock cycle of
the different MAC units over each precision mode.
This shows the internal trade-off between input and output bandwidths. SA
MACs produce multiple independent results in parallel, hence increasing the
output bandwidth when precision scales down by design (i.e. down to 4b for
1-level scalable circuits and down to 2b for 2-level ones). However, since neural
networks allow input data reuse over different outputs, inputs can be shared
among the sub-computations, leading to a lower input bandwidth.
Inversely, ST MACs only store one result, which is the sum of multiple
low-precision multiplications, leading to a relatively small output bandwidth.
However, since those products are parts of one output element, they cannot
share the same input. This results in a largely increased input bandwidth. For
example, in order to keep its 16 arithmetic blocks busy in 2b×2b mode, the
input bandwidth of 2D FU ST explodes. Such highly-unsteady bandwidth may
complicate its memory interface and its integration into a chip.
Finally, the much smaller bandwidth of temporal-based MAC units could be a
huge advantage over spatial-based designs for narrow memory-band systems.
150
SA
Output
FU
Weight
125
Bandwidth (bit/clock cycle) Input
2D
100
ST
FU
2D
75
A
SA
g
US
tin
ST
T
FU
ga
SW
US
FU
50
1D
ta
l
SW
ria
ria
Da
1D
se
se
1D
2D
25
0
842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842
(a) symmetric scaling
80
SA
Output
S
FU
FU
70 Weight
Bandwidth (bit/clock cycle)
Input
1D
2D
T
ST
60
FU
FU
ng
2D
1D
50
ti
T
ga
US
US
ta
al
SW
SW
40
Da
eri
s
1D
30
l
ria
se
20
2D
10
0
842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842
(b) weight-only scaling
Figure 3.17: Bar charts of the bandwidth per clock cycle of precision-scalable
MAC units for (a) symmetric and (b) weight-only scaling scenarios.
However, as their computing capability is also lower, this result should be

considered relative to the computational throughput.
3.5.2 Bandwidth per operation
Figures 3.18(a)-(b) therefore show the bandwidth per operation of each MAC,
derived from the above-mentioned bandwidth (bit/clock cycle) and computing
capability (operation/clock cycle).
As expected, at full precision, all MACs have the same bandwidth per operation.
With precision scaling down, ST-type MACs show superiority over others,
especially 1D and 2D FU ST. This is because their arithmetic block remains
fully utilized, which keeps their computing capability high, and their output
bandwidth remains very low, which balances the increased input bandwidth.
For symmetric scaling, SWU MACs lose their advantage because of their unused
arithmetic logic, leading to a relatively low computing capability. SWU ST is
slightly better than SWU SA because, with the same computing capability, it
50
g
tin
ST
SA
l
A
l
US
T
US
ria
ria
ga
US
US
FU
FU
se
se
F
ta
40
SW
SW
2D
2D
Da
1D
2D
1D
1D
Bandwidth/op (bit/op)
30
20
10
0
842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842
50
g
tin
SA
ST
SA
ST
l
A
l
T
ria
ria
ga
US
US
FU
FU
FU
FU
se
se
ta
40
SW
SW
2D
Da
2D
1D
2D
1D
1D
Bandwidth/op (bit/op)
30
20
10
0
842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842
Figure 3.18: Bar charts of the bandwidth per operation of precision-scalable

MAC units for (a) symmetric and (b) weight-only scaling scenarios.
has lower bandwidth requirements. For asymmetric scaling, SWU MACs only
benefit from data gating, behaving like the baseline circuit.
In temporally scaled MAC units, the low bandwidth and low compute capacity
compensate for each other, resulting in similar bandwidth per operation as the
baseline circuit.
It is easy to notice that the difference between designs is less significant in
Figure 3.18 (bandwidth/op) than in Figure 3.17 (bandwidth). The reason is
that it is mostly true when the bandwidth is small, the operation that can
be performed per cycle is relatively also small (especially for serial designs
and for SWU designs when performing weight-only scaling computation).
So when comparing different designs by the division of the two parameters
(bandwidth/op), the resulting difference is less significant than only comparing
by one parameter solely (bandwidth).
3.5.3 Throughput evaluation
Figures 3.19(a)-(b) compare the measured throughput of the different

architectures normalized to the data-gated conventional MAC in full-precision
mode.
At 8b and 4b precisions, most designs display similar throughputs in their 1-level
and 2-level variants. This result demonstrates that throughput is not affected
much by the level of scalability, contrarily to area and energy (shown later).
Hence, systems targeting high performance can fearlessly consider embedding
many precision modes.
The speed gain obtained by data gating the 1-level designs from 4b to 2b is
negligible despite the multi-mode optimization. Conversely, scalability levels
bring a near-quadratic boost to performance for all designs. 2-level scalable
designs reach impressively high throughput factors in 2b precision. Above all,
2D FU SA achieves 14.5× the base throughput for symmetric scaling, thanks
to its 16 parallel sub-computations. Its compatibility with both symmetric and
asymmetric scaling allows it to even reach a good second-ranking in weight-only
scaling with 3.1× base throughput.
Interestingly, the 2-level 1D FU SA appears versatile for both weight-only
and symmetric scaling in spite of not being optimized for it. It reaches
an excellent 4.6× and 3.5× base throughput for symmetric and asymmetric
scenarios, respectively. This is due to its first-stage adder, which remains in
the critical path in all modes, and thus, whose optimization jointly benefits all
modes. While for SWU designs, the multiplier hardware has to be co-optimized
14.5 8.7
6
T
US
US
F
F
SA
5
2D
2D
FU
1D
A
4
US
Throughput
SW
T
ST
US
3
FU
SW
ng
1D
al
ti
al
ga
eri
eri
2
s
ta
s
1D
Da
2D
1
0
842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842
4
A
US
SA
F
1D
FU
3
2D
ST
Throughput
T
FU
US
2
1D
F
g
tin
2D
l
ria
ga
se
ST
A
ta
US
1D
U
Da
SW
l
ria
SW
1
se
2D
0
842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842
Figure 3.19: Bar charts of the normalized circuit throughput of precision-

scalable MAC units for (a) symmetric and (b) weight-only scaling scenarios.
for different objectives as the critical path changes from one mode to the other.
Another reason for its great speed is its SA accumulation stage which only
contains short and therefore faster adders.
3.5.4 Area breakdown
Figure 3.20 shows the area breakdown of the different MAC architectures
synthesized in a 28 nm CMOS process normalized to the area of the data-gated
conventional MAC unit.
Not surprisingly, all scalable units based on sub-computation parallelism require
overhead for configurability. Among them, 1D and 2D FU approaches are the
most area-consuming due to customized internal structures, with 2D FU designs
requiring up to 4.4× the area of a conventional MAC (without including the
input registers as mentioned in 3.4.1).
On the contrary, SWU MACs mitigate the overhead by reusing arithmetic cells
for subword-parallel computations. SWU ST circuits are particularly optimal,
with only 10% to 18% overhead for 1 and 2-level scalability, respectively.
Finally, serial designs are the only type requiring less area than the conventional
multiplier, allowing area savings up to 40% on the MAC circuit for area-
constrained systems.
Concerning the sequential area (bottom darker bars), FU SA designs require far
more registers than others. This is due to their wide asymmetric sub-products
(as one input operand is kept full-precision) for 1D FU SA, as well as the
quadratic increase of sub-products for 2D FU SA. Although SA-type, SWU SA
keeps a low sequential area at the cost of a sacrificed throughput.
5
SA
Logic
FU
Sequential
2D
4
ST
SA
3
FU
FU
Area
2D
ST
1D
A
US
FU
g
2
T
tin
US
SW
1D
ga
l
ria
l
SW
ria
ta
se
Da
se
2D
1
1D
Figure 3.20: Bar chart of the normalized area of precision-scalable MAC units.
3.5.5 Energy overhead at full precision
Figures 3.21-3.28 show the breakdown of energy per operation when scaling
precision for each type of MAC architecture. The left subfigures show symmetric
scaling scenarios while the right ones show weight-only scaling. All energy values
are normalized to the same full-precision data-gated conventional MAC drawn
with a solid black line.
Processing at full precision with scalable designs always comes with some energy
penalty. For 1D FU and SWU MACs (Figures 3.21-3.22, 3.25-3.26), energy
overheads for 8b computations are in the order of 20% to 40% for 1 and 2-level
scalability, respectively. For 2D FU architectures (Figures 3.23-3.24), these
costs increase to 52% and 94%.
Serial designs (Figures 3.27-3.28) require much more energy at full precision
due to their need for several clock cycles per computation, diluting the power
into the clock tree. At full precision, the 1D bit-serial MAC consumes 3.3× as
much energy per operation as data gating, and the 2D bit-serial 14× as much.
Reassuringly, 1D multi-bit serial designs (Figure 3.27) come at a lower energy
penalty: the 2-bit serial MAC consumes 2.2× baseline energy, while it is also
able to scale precision down to 2 bits by design, and the 4-bit serial MAC
reduces the cost to 1.5× as much as data gating.
For nearly all designs, the 2-level MAC consumes more at full precision than
its 1-level version. Singularly, this is not the case for SWU ST (Figure 3.26)
architecture, for which the second level of scalability costs barely more than
the first. This is connected to its simpler ST-style accumulation coupled with
its efficient reuse of multiplication logic.
3.5.6 Energy scaling
Figures 3.21-3.28 allow to evaluate the scaling efficiency of each architecture.

In general, reducing the precision of both weights and input activations (left
subfigures) with simple data gating leads to linear energy savings with respect
to bit precision. In comparison, all scalable MACs show a steeper slope between
8b and 4b precision, meaning that they save energy in a superlinear way with
bit precision. Below 4 bits, 1-level scalable circuits (including 1D and 2D 4-bit
serial MACs) can only scale precision through data gating, returning to the
slope of the baseline. Despite being less scalable, these designs exhibit decent
energy savings compared to 2-level scalable MACs thanks to lower overheads.
2.0 2.0
1.0 1.0
Energy/op
Energy/op
0.5 0.5
Data gating
1-level 1D FU SA
2-level 1D FU SA
0.1 0.1
8 4 2 8 4 2
Precision Precision
(a) symmetric scaling (b) weight-only scaling
Figure 3.21: Normalized energy/op for (a) symmetric and (b) weight-only
scaling in a 1D FU SA MAC (DNPU) [146].
2.0 2.0
1.0 1.0
Energy/op
Energy/op
0.5 0.5
Data gating
1-level 1D FU ST
2-level 1D FU ST
0.1 0.1
8 4 2 8 4 2
Precision Precision
scaling in a 1D FU ST MAC.
When preserving full-precision input activations (right subfigures), savings are

without exception far lower. The two SWU architectures (Figures 3.25-3.26) can
only scale energy through data gating, hence they are unsuitable for weight-only
scaling. Even though their architecture is built for both scenarios, 2D FU
(Figures 3.23-3.24) and 2D serial (Figure 3.28) MACs also seem ineffective for
weight-only scaling. Only the 1D FU ST MAC (Figure 3.22) proves to be highly
efficient for both scaling scenarios, distantly followed by the 1D 4-bit serial
2.0 2.0
1.0 1.0
Energy/op
Energy/op
0.5 0.5
Data gating
1-level 2D FU SA
2-level 2D FU SA
0.1 0.1
8 4 2 8 4 2
Precision Precision
scaling in a 2D FU SA MAC.
2.0 2.0
1.0 1.0
Energy/op
Energy/op
0.5 0.5
Data gating
1-level 2D FU ST
2-level 2D FU ST
0.1 0.1
8 4 2 8 4 2
Precision Precision
scaling in a 2D FU ST MAC (BitFusion) [143].
(Figure 3.27) MAC architecture.

The slope of 1D FU SA units (Figure 3.21) is one of the lowest among spatial-
unrolling architectures. In spite of being built for weight-only scaling, the 2-level
1D FU SA circuit consumes more energy in 4b mode than data gating and saves
merely as much in 2b mode (9%) as its 1-level variant. This demonstrates that
the 1D FU SA architecture poorly scales in terms of energy efficiency. On the
contrary, the 1D FU ST architecture (Figure 3.22) is highly efficient and proves
to sustain its efficiency with extra scalability levels.
2.0 2.0
1.0 1.0
Energy/op
Energy/op
0.5 0.5
Data gating
1-level SWU SA
2-level SWU SA
0.1 0.1
8 4 2 8 4 2
Precision Precision
scaling in a SWU SA MAC (DVAFS) [117].
2.0 2.0
1.0 1.0
Energy/op
Energy/op
0.5 0.5
Data gating
1-level SWU ST
2-level SWU ST
0.1 0.1
8 4 2 8 4 2
Precision Precision
scaling in a SWU ST MAC (ST) [107].
Among spatially-unrolled architectures, the 2D FU ST MAC (Figure 3.24) and

the SWU ST MAC (Figure 3.26) have the sharpest energy drop when reducing
precision symmetrically. They lead to comparable savings at 2b precision (48%
for 1-level and 68% for 2-level scalable designs). SWU ST MAC is preferable
when using 8b precision thanks to lower overheads, while 2D FU ST MAC
should be favored when requiring asymmetric scaling.
In the end, an overall energy comparison is shown in Figure 3.29, which
summarizes the information of Figures 3.21-3.28 and completes the information
4.0 4.0
Data gating
1D 1-bit serial
2.0 1D 2-bit serial 2.0
Energy/op
Energy/op
1D 4-bit serial
1.0 1.0
0.5 0.5
0.3 0.3
8 4 2 8 4 2
Precision Precision
scaling in 1D serial (UNPU) [94] and multibit-serial MACs [21]. Beware of the
scale.
20 20
Data gating
10 2D 1-bit serial 10
2D 2-bit serial
5.0 5.0
Energy/op
Energy/op
2D 4-bit serial
1.0 1.0
0.5 0.5
0.2 0.2
8 4 2 8 4 2
Precision Precision
scaling in 2D serial and multibit-serial MACs (LOOM) [142]. Beware of the
scale.
regarding design synthesis variation. In this figure, the height of each bar has
an upper bound and a lower bound (not seen means overlapped). The upper
bound depicts the energy efficiency cross precision modes for a single circuit of
each design, with this circuit being selected as a balanced design choice with the
minimum energy-delay-area product averaged across precision modes (described
at the start of Section 3.5). To evaluate this design selection, the lower bound
of each bar shows the minimum energy point measured for each single precision
mode. In other words, comparing the upper bound and the lower bound is
comparing the "best-balance" design choice we took with the "best-per-mode"
design among the entire design synthesis space (> 5000 circuits, explained in
Section 3.4.2). Results confirm that the "best-balance" design remains close to
the best achievable energy efficiency and does not change the observed trends.
6.4 14.3
4
l
ria
se
2D
l
ria
se
3
1D
ST
SA
FU
Energy/op
FU
2D
US
SA
A
2
2D
T
US
F
g
US
FU
tin
1D
SW
ga
SW
1D
ta
Da
0
842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842

6.4 14.3 7.1
4
al
ri
se
2D
al
ri
se
3
1D
ST
SA
FU
Energy/op
ST
FU
2D
SA
2
A
FU
2D
ST
g
US
FU
tin
1D
U
ga
SW
SW
1D
ta
Da
0
842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842 842
Figure 3.29: Bar charts of the normalized energy of precision-scalable MAC

units for (a) symmetric and (b) weight-only scaling scenarios.
COMPARATIVE STUDY 89
3.6 Comparative Study
This section presents an overall comparison of all MAC architectures in terms

of energy efficiency (energy per operation) and area efficiency (throughput per
area), for 1V nominal supply as well as under DVFS.
3.6.1 Comparison of scalable MACs at nominal voltage
Figure 3.30 displays the feasible implementations and Pareto frontiers for each
precision-scalable MAC design at each precision scenario, sweeping across a
broad range of frequency constraints at nominal supply voltage. Circuits are
compared in terms of energy per operation and throughput per area. The best
circuits are towards the bottom-right corners of subfigures.
At 8b precision (Figure 3.30(a)), data gating is undeniably the most efficient,
capable of the highest throughput per area, followed first by 1-level (bright
colored dashed lines) and then by 2-level (dark colored dotted lines) scalable
designs which suffer from the scalability overhead.
When scaling precision symmetrically (Figures 3.30(b)-(c)), the 2D FU ST and
SWU ST architectures largely outperform other architectures in terms of energy
per operation. 2D FU SA circuits are by far the best for throughput-area
performance, but not energy efficiency, given their large non-shared registers.
Note that 1-level scalable circuits stay the best compromise at 4b precision and
above, but this trend reverses at 2b precision, which is visible by the inversion
of bright and dark curves between the left and right subfigures.
When only reducing weight precision down to 2b (Figure 3.30(e)), 2-level
1D FU ST architecture is optimal for energy while its 1D FU SA companion
outshines for throughput-area efficiency. At 4b precision (Figure 3.30(d)), 1-
level 1D FU ST and 1D 2-bit serial are best for energy, however, the gains over
baseline are limited (20% at most). By-passing internal additions, 1D FU SA is
advantaged for throughput, while 1D 4-bit serial benefits from a smaller area.
3.6.2 Comparison of scalable MACs with DVFS
Figure 3.31 displays the comparison and Pareto frontiers of precision-scalable

designs with DVFS between 0.8 and 1 V.
The relative comparison between designs is fairly similar to the case at nominal
voltage. However, the Pareto frontiers for all designs are perceptibly larger than
10
Data gating
1-level 1D FU SA
2-level 1D FU SA
DNPU )
1-level 1D FU ST
2-level 1D FU ST
Energy/op (pJ)
1-level 2D FU SA
2-level 2D FU SA
1-level 2D FU ST
2-level 2D FU ST )
BitFusion
1-level SWU SA
2-level SWU SA )
DVAFS
1 1-level SWU ST
2-level SWU ST
ST )
1D 1-bit serial UNPU
0.5
1D 2-bit serial
1D 4-bit serial )
Multibit serial
)
0 1 2 3 4 5
2D 1-bit serial
Throughput/area (GOPS/mm2) 2D 2-bit serial LOOM
2D 4-bit serial
(a) 8-bit full precision
3 0.6
Energy/op (pJ)
Energy/op (pJ)
0.1
0.2 .07
0 2 4 6 8 10 0 5 10 15 20 25
2
Throughput/area (GOPS/mm ) Throughput/area (GOPS/mm2)
(b) 4-bit symmetric (c) 2-bit symmetric
5 3
Energy/op (pJ)
Energy/op (pJ)
0.4 0.2
0 1 2 3 4 5 6 7 0 2 4 6 8 10
2
(d) 4-bit weight-only (e) 2-bit weight-only
Figure 3.30: Comparison of MAC architectures synthesized in a 28 nm CMOS

process at 1 V supply voltage in terms of energy/op and throughput/area.
COMPARATIVE STUDY 91
4
Data gating
1-level 1D FU SA
2-level 1D FU SA
DNPU )
1-level 1D FU ST
2-level 1D FU ST
Energy/op (pJ)
1-level 2D FU SA
2-level 2D FU SA
1-level 2D FU ST
2-level 2D FU ST )
BitFusion
)
1
1-level SWU SA
DVAFS
2-level SWU SA
1-level SWU ST
2-level SWU ST
ST )
0.4
1D 2-bit serial
1D 4-bit serial )
Multibit serial
)
0 1 2 3 4 5
2D 1-bit serial
2D 4-bit serial
(a) 8-bit full precision
2 0.6
1
Energy/op (pJ)
Energy/op (pJ)
0.1
0.1 .04
0 2 4 6 8 10 12 0 5 10 15 20 25
2
(b) 4-bit symmetric (c) 2-bit symmetric
2 3
1
Energy/op (pJ)
Energy/op (pJ)
0.2 0.1
0 2 4 6 8 0 2 4 6 8 10
2
(d) 4-bit weight-only (e) 2-bit weight-only
Figure 3.31: Comparison of MAC architectures synthesized in a 28 nm CMOS

with DVFS in terms of energy/op and throughput/area.
at nominal voltage, especially regarding the range of achievable throughput-area

solutions. This is manifest for the SWU ST architecture for example. DVFS
offers one smooth degree of freedom to utilize the circuits at the optimal energy
point within a target throughput or energy budget.
3.7 Comparative Study in the function of Use-Case

Ratios
3.7.1 Introduction and methodology
The best precision-scalable MAC architecture is neither the one with the lowest
consumption at reduced precision, nor the one with the lowest overhead, nor
the one with the steepest energy scaling. The best architecture is the one
that best optimizes its application needs. Hence, this section offers a
comparative study for practical use cases of the MAC units, rather than for
each precision mode individually. Such a study is vital as it makes a direct
link to the application level, and it simplifies the analysis by combining many
results into a single trade-off.
Three case studies are assessed by defining the percentage of operations the
MAC unit does under each precision mode. For simplicity, only the fraction of
full-precision computations (8b) is set, while the ratios of 4b and 2b operations
are assumed equal. Energy and throughput-area efficiencies are weighted
considering these ratios to redraw a global Pareto frontier.
Figures 3.32-3.33 shows the case studies with full-precision precision representing
33%, 20%, and 5% of the computations, respectively (2D 1-bit serial MACs have
been omitted for clarity due to their sub-optimality). Only nominal voltage is
considered here since the 1 V comparison results remain valid with DVFS.
3.7.2 33% full-precision computations (equal usage)
First and foremost, when using all the precision modes uniformly (Fig-
ures 3.32(a), 3.33(a)), the conventional MAC with simple data gating is close to
the most efficient architecture, but is also by far the best in terms of throughput
and area. Indeed, no matter how efficient scaled operations are, being much
slower and energy-consuming, 8b computations are largely dominant in the
energy budget.
COMPARATIVE STUDY IN THE FUNCTION OF USE-CASE RATIOS 93
Note that even if 2b and 4b operations represent 66% of computations in this use
case, since these are often 2× to 15× faster in scaled precisions (c.f. Section 3.5.3),
the circuits stay in these precision modes for a much shorter duration than to
do the 8b operations.
For symmetric scaling scenarios (Figures 3.32(a)), the conventional MAC unit
is slightly outperformed by the 1-level SWU SA architecture and by both 1-
level and 2-level SWU ST circuits. On the other hand, for weight-only scaling
(Figures 3.33(a)), the conventional units stay more energy efficient, before 1-level
1D FU SA and ST MAC circuits.
3.7.3 20% full-precision computations
Fortunately, DNNs have proven resilient to a higher percentage of scaled-

precision computations. Lowering the amount of full-precision operations down
to 20% (Figures 3.32(b), 3.33(b)) makes scalable MACs worth the efforts to
lower energy consumption for symmetric scaling scenarios. At this precision
ratio, SWU SA and ST designs exceed in energy efficiency with up to 20%
energy saving compared to data gating.
1-level SWU SA and ST circuits continue to surpass the conventional MAC for
symmetric scaling (Figures 3.32(b)). In general, 1-level scalable designs perform
better than their 2-level equivalents, except for the 2-level SWU ST circuits
that outshine.
However, for the weight-only scaling scenario (Figures 3.33(b)), scalable MACs
remain mostly inefficient in terms of energy. The 2-level 1D FU ST architecture
is hardly better than the conventional MAC at the cost of considerably worst
area-throughput capability.
3.7.4 5% full-precision computations
Across real-world DNNs, some low-complexity applications can use even less
full-precision operations [143], e.g., only around 5%. This 8b-operation ratio
is used for the third use case (Figures 3.32(c), 3.33(c)). At that proportion
of 8b operations, SWU ST and 2D FU ST largely outperform data gating for
symmetric scenarios by 32% and 22%, respectively.
Eventually, some scalable MAC units are now beneficial for weight-only scaling
scenarios (Figures 3.33(c)), but energy gains are lower, at most 15% for 2-level
1D FU ST, followed by 1-level 1D FU ST and 1D 4-bit serial MACs (8%).
0.9
Data gating
0.8
1-level 1D FU SA
2-level 1D FU SA
DNPU )
1-level 1D FU ST
2-level 1D FU ST
Energy/op (pJ)
0.7
1-level 2D FU SA
2-level 2D FU SA
0.6 1-level 2D FU ST
2-level 2D FU ST )
BitFusion
0.5
1-level SWU SA
2-level SWU SA )
DVAFS
0.4
1-level SWU ST
2-level SWU ST
ST)
0.3
1D 2-bit serial
1D 4-bit serial)Multibit serial
)
1 2 3 4 5 6
2D 1-bit serial
2D 4-bit serial
(a) 33% 8b operations
0.9
0.8
Energy/op (pJ)
0.7
0.6
0.5
0.4
0.3
2 3 4 5 6 7
Throughput/area (GOPS/mm2)
(b) 20% 8b operations
0.6
0.5
Energy/op (pJ)
0.4
0.3
0.2
2 3 4 5 6 7 8 9
(c) 5% 8b operations
Figure 3.32: Symmetric scaling: Overall energy/op and throughput/area of

MAC architectures utilized with 33%, 20% and 5% of 8b computations.
COMPARATIVE STUDY IN THE FUNCTION OF USE-CASE RATIOS 95
1.2
Data gating
1-level 1D FU SA
2-level 1D FU SA
DNPU )
2-level 1D FU ST
Energy/op (pJ)
1-level 2D FU SA
2-level 2D FU SA
2-level 2D FU ST )
BitFusion
1-level SWU SA
2-level SWU SA )
DVAFS
0.6 1-level SWU ST
2-level SWU ST
ST)
0.4
1D 2-bit serial
1D 4-bit serial)Multibit serial
)
1 2 3 4 5
2D 1-bit serial
2D 4-bit serial
(a) 33% 8b operations
1.2
1.0
Energy/op (pJ)
0.8
0.6
0.4
1 2 3 4 5 6
(b) 20% 8b operations
1.0
0.8
Energy/op (pJ)
0.6
0.4
0.2
1 2 3 4 5 6 7
(c) 5% 8b operations
Figure 3.33: Weight-only scaling: Overall energy/op and throughput/area of

MAC architectures utilized with 33%, 20% and 5% of 8b computations.
While 1D FU SA is consistently equivalent to data gating in energy for both

symmetric and weight-only scenarios, it is the best architecture with respect to
throughput per area, bringing 10% to 30% gain compared to data gating.
Even at that level, leaving 95% operations shared by 2b and 4b precisions, no
other MAC units can show benefit compared to a simple data-gated MAC. In
particular, despite the recent trend for bit-serial designs, all of them but the
1D 4-bit serial turn out inefficient in all scenarios.
These results confirm the importance for the designer to focus resources on what
can bring the largest benefits rather than falling in the pitfall of implementing
and optimizing to the lowest level, which can yield to limited or diminishing
returns to the overall system. This strategy in the design trade-offs, recalling,
at circuit level, Gene Amdahl’s law [8], should be one of the most pervasive
principles for designers.
3.8 Conclusion
This chapter has introduced a new taxonomy for precision-scalable MAC

architectures, categorizing them among multiple criteria: the type of unrolling
(spatial or temporal), the dimensions they unroll (1D, 2D or 2D symmetric)
and, for spatial-based designs, their type of accumulation (Sum Together or
Sum Apart). This new classification has not only categorized all existing
architectures, but it has also uncovered new design patterns that can give
designers array-level and algorithmic-level insights to choose the right type of
processing elements for their system. Along with this taxonomy, an exhaustive
survey of SotA and further architectures has been carried out in order to clearly
understand their ground principles, features, connections, and differences.
A benchmarking and comparison of these architectures has then been conducted.
Therein, all circuits have been implemented in a 28 nm commercial CMOS
process across a wide range of performance targets, with precision ranging from
2 to 8 bits. This study has thoroughly analyzed, theoretically and quantitatively,
the different scalable MAC units in terms of energy, throughput, bandwidth
and area, aiming to understand the key trends to reduce computation costs in
neural-network processing via precision scaling.
The results of this comparative study have highlighted that 2D FU ST
(BitFusion) [143] and SWU ST (ST) [107] have the highest energy efficiency
for symmetric scaling scenarios, while 1D FU ST and 1D 4-bit serial [21] are
best for weight-scaling scenarios. In addition to that, 1D FU SA (DNPU) [146]
and 2D FU SA exceed with high throughput for all scaling scenarios, but suffer
CONCLUSION 97
together with 2D FU ST (BitFusion) from large varying bandwidth requirements.

Despite the recent trend for 1D [94, 21] and 2D [163, 142] serial designs, these
are strongly penalized for both throughput and energy efficiency.
This comparison has been concluded by an exploration of three practical case
studies of usage under fixed proportions of operations at each precision, revealing
the large impact of full-precision computations, even at a small proportion, in
the overall energy budget. For symmetrical scaling scenarios, although being
more efficient on low-precision computations individually, 2D FU ST (BitFusion)
is overtaken by SWU ST (ST) in any use case due to its higher overheads in
full-precision mode. For weight-only scenarios however, precision-configurable
designs are found to be mostly ineffective, where 2-level 1D FU ST becomes
beneficial only at a very low proportion of full-precision operations.
This chapter has presented a detailed comparison and analysis of a large number
of different precision-scalable MAC architectures, offering many enlightening
design guidelines. However, only having the MAC-unit-level design insights
is insufficient to build a good precision-scalable DNN accelerator system, as
array-level techniques can amortize the single-MAC unit’s scalability overhead
in different ways and thus, impact the final design’s optimality.
In the next chapter, we move one abstraction level higher in the DNN
accelerator design, to the MAC array level. What MAC array architecture
can most efficiently support variable-precision DNN execution is thoroughly
and systematically studied, incorporating the MAC-unit-level understandings
learned from this chapter. Interestingly, different trade-offs at the array level,
compared to the unit level, are observed, offering more insights towards system-
level integration.
Chapter 4
Precision-Scalable MAC Array

Design Space Exploration
Moving one abstraction level up from the previous chapter, this chapter aims to
offer a clear view of the design space of precision-scalable MAC arrays (PSMAs),
and provides answers to the consecutive open question Q1.2: What is the best
MAC array architecture for variable-precision DNN execution?
To this end, this chapter first introduces a precision-aware nested for-loop
representation for DNN mappings. Next, based on this new representation, it
proposes a comprehensive PSMA taxonomy, capable of systematically covering
the most prominent SotA PSMAs, as well as uncovering new PSMA architectures.
Following that, a highly parameterized PSMA template is built that can be
design-time configured into a huge subset of the design space spanned by the
taxonomy. This allows to fairly and thoroughly benchmark 72 different PSMA
architectures. This benchmarking study is performed in 28 nm technology
targeting run-time precision scalability from 8 to 2 bits, operating at 200 MHz
and 1 GHz. Analyzing resulting energy and area breakdowns and comparing
the array-level optimality to the previous chapter’s MAC-level optimality reveal
key design insights for PSMA architectures.
This chapter is based on publication [73], and contains large fractions of it. The author’s
contributions include (but not limited to) enhanced loop representation, PSMA design space
identification, design methodology, and paper writing.
99
100 PRECISION-SCALABLE MAC ARRAY DESIGN SPACE EXPLORATION
Many precision-scalable accelerators were proposed in recent years, some

of which support weight-only (1D) precision scalability [146, 94, 93], or
input/weight (2D) precision scalability [143, 133, 142, 51, 182]. As all
these precision-scalable designs have been demonstrated in different silicon
technologies, embedded in different system configurations, benchmarked with
different DNN workloads, and written with different RTL coding styles, it is
hard to relatively compare them in terms of their drawbacks and merits.
The previous chapter has conducted a comprehensive survey and comparison
of precision-scalable MAC architectures. However, it only focused on a single
MAC, and did not study the implications of using such MACs in an array. This
hence omitted two important efficiency factors when integrating such MACs in
a larger MAC array: 1) Depending on the precision-scalable MAC architecture,
the scalability overhead can, or cannot, be amortized across all MACs of the
MAC array, strongly impacting the resulting energy and area overheads per
operation; 2) The precision-scalable MAC architecture has an impact on which
additional spatial unrolling dimensions can be exploited across the MAC array in
low precision modes, again resulting in array-level energy efficiency implications.
To fill in these gaps and enable a fair comparison of PSMA, this chapter makes
the following contributions:
• Section 4.2 firstly extends the traditional nested for-loop based mapping
representation of a Conv 2D layer with 2 additional precision-aware bit-
group loops. Then, it discusses the implications of the two newly added
bit-group loops on the underlying hardware and their spatial/temporal
mapping options.
• Section 4.3 presents a new PSMA taxonomy, based on the newly proposed
precision-aware for-loop representation. A wide range of SotA PSMA
architectures are then mapped to the taxonomy, and thanks to the
taxonomy, lots of new PSMA topologies are also identified.
• Section 4.4 introduces a uniform and highly parameterized PSMA
template, which can be design-time configured to cover a large subset of
the PSMA design space spanned by the taxonomy, laying the foundation
for a fair comparison and fast exploration in PSMA’s design space.
• Section 4.5 benchmarks this design space in terms of area and energy,
resulting in PSMA design insights.
• Section 4.6 concludes this chapter.
PRECISION-ENHANCED DNN LOOP REPRESENTATION 101
4.2 Precision-Enhanced DNN Loop Representation
This section introduces a precision-driven extension to the traditional for-

loop format of a Conv 2D layer, a widely used basis to represent different
DNN mappings on accelerators (introduced in Chapter 2.3.2). First, the
motivation behind this extension is explained, after which different loops in the
representation are categorized based on their implications on the underlying
hardware.
4.2.1 Motivation
To fully understand the motivation of extending the traditional for-loop

representation, the concept of bit-groups (BGs) is first explained. Precision-
scalable accelerators typically compose the underlying MAC units out of reduced
precision sub-multipliers [147]. The intermediate partial products are then either
shifted and added together to achieve the full precision result, or treated as
separate results when the MAC operates in its low precision mode.
To illustrate this, a simple 4b precision-scalable multiplier is shown in Figure 4.1,
which has a similar architecture to the BitBricks found in BitFusion [143]. In
the shown example, the 4×4b multiplier consists of 4 sub-units, each being a
2×2b multiplier. In full 4b precision, both inputs are divided into 2 BGs of 2b
each, which together form the full 4b input resolution. These BGs are multiplied
in the different sub-units of the multiplier, and get shifted and aggregated to
obtain the final 4×4b multiplication result. In the low precision 2b mode, 4
inputs of 2b are each used independently, hence each consisting of only a single
BG. In this case, each sub-unit acts as a standalone multiplier, producing an
independent result. In a sense, the multiplier is hence considered as a mini-array
2 bits as a bit-group (BG)

4-bits
2b 2b 2b 2b 2b 2b
X X X X X X
2b
2b
2b
4-bits
4-bits
2b
2b
2b
X X X X X X
4-bits × 4-bits 4-bits × 2-bits 2-bits × 2-bits
Figure 4.1: A simple 4b precision scalable multiplier.

Traditional 7 CNN Loops

for b = 0 to B-1 Batch Size
for k = 0 to K-1 O channel/W kernel
Loop Type for c = 0 to C-1 I/W channel
W Sharing for oy = 0 to OY-1 O row
I Sharing for ox = 0 to OX-1 O column
O Sharing for fy = 0 to FY-1 W kernel row
Bit Group for fx = 0 to FX-1 W kernel column
(shift-add)
for bi = 0 to Bi-1 I Bit Groups
I: Input for bw = 0 to Bw-1 W Bit Groups
W: Weight
O: Output Extra 2 BG Loops (assume 2 bits as a group)
O[b][k][oy][ox] += I[b][c][oy+fy][ox+fx][2bi+1:2bi](22bi)
x W[k][c][fy][fx][2bw+1:2bw](22bw)
Figure 4.2: Introducing 2 extra bit-group (BG) for-loops to the traditional

for-loop representation (introduced in Chapter 2.3.2) of a Conv 2D layer.
that performs spatial unrolling at the BG level at full precision, and enables
additional spatial unrolling possibilities at lower precisions.
This concept can now be introduced in the DNN mapping representation to
unify all precision scalability techniques. This representation is traditionally
characterized as 7 nested for-loops. To include the impact of precision scalability,
2 additional BG for-loops are added to it, as shown in Figure 4.2. In the previous
example of Figure 4.1, at high precisions, the 2 BG loops are spatially unrolled
inside the MAC unit. In this mode, the MAC unit is hence unable to unroll
any other loop spatially. While at lower precisions, the number of BG loops of
the workload shrinks, making room for other for-loops to be spatially unrolled
inside of the MAC unit.
Treating the BG loops similarly to any other Conv layer loop enables the
flexibility of mapping them spatially and also temporally, as done in bit-serial
(BS) architectures. This allows us to uniformly characterize the behaviour of a
wide variety of PSMAs. Note that besides Conv layers, other DNN layers like
FC, DW and PW can also be extended following the same method.
PRECISION-SCALABLE MAC ARRAY TAXONOMY 103
4.2.2 Implications on the MAC array hardware
In addition to BG loops, it is essential to understand how each loop can be

mapped to a MAC array. As shown in Figure 4.2, the 7 traditional Conv
loops are categorized based on input, weight, or output sharing. For example,
‘K’ (output channel dimension) is an input-sharing loop dimension in Conv
computation because when looping through ‘K’ (spatially or temporally), the
same input elements are required and thus shareable. The same concept applies
to weight- and output-sharing loops.
When input- or weight-sharing loops are spatially unrolled across a dimension of
a MAC array, the inputs/weights can be broadcast to all the MAC units along
that dimension of the MAC array. On the other hand, when an output-sharing
loop is spatially unrolled, the outputs of each MAC unit are added together
along that dimension. Additionally, spatial or temporal unrolling of BG loops
has hardware implications, as it would require the partial results to go through
a shift & add tree or get accumulated in a shift-add register, respectively.
4.3 Precision-Scalable MAC Array Taxonomy
Making use of the new loop representation, a taxonomy is introduced in this

section, which can cover all SotA precision-scalability techniques. This new
taxonomy will serve as a definition for the full design space of PSMAs, and will
later be explored and benchmarked extensively in Section 4.5.
4.3.1 Highly parameterized PSMA template
A uniform and highly parameterized MAC-array architecture template, which

is compatible with the precision-scalability concepts introduced in previous
sections, is built up from basic 2b multiplier building blocks, denoted here as
"Level 1 (L1)" units. These basic L1 units are subsequently combined in different
hierarchical levels to form the complete MAC array. We choose to combine 4×4
units in each step, hence resulting in 16 L1 units forming one L2 unit, 16 L2
units forming one L3 unit, and 16 L3 units forming one L4 unit. An illustration
of the proposed template is shown in Figure 4.3.
Flexibility is the main motive for going with such a hierarchical architecture.
With such a template, one can stop at L3 for a smaller MAC array size, go
for L4 for a standard array, or even extend to higher levels for a larger array.
Additionally, having a small L1 unit allows us to have precision-scalable MAC
Level 4 (L4) Level 3 (L3) Level 2 (L2) 2-bits 2-bits
L3 L3 L3 L3 L2 L2 L2 L2 L1 L1 L1 L1
2b x 2b
L3 L3 L3 L3 L2 L2 L2 L2 L1 L1 L1 L1 Multiplier
L3 L3 L3 L3 L2 L2 L2 L2 L1 L1 L1 L1
4-bits
L3 L3 L3 L3 L2 L2 L2 L2 L1 L1 L1 L1
Figure 4.3: L4: 4×4 L3 units; L3: 4×4 L2 units; L2: 4×4 L1 units.
units that can go down to 2b precision, which is the lowest precision supported in
most precision-scalable accelerators [143, 133, 51, 116]. Furthermore, this multi-
level architecture will later give a clear perspective of where each computation
loop gets unrolled/mapped to.
4.3.2 Spatial unrolling
The loop categorization of the previous section (weight-sharing (WS), input-

sharing (IS), output-sharing (OS), and BG loops), can now be linked to the
hierarchical array template to assess the hardware effects. We define that every
level Ln of the MAC array template can support a specific spatial unrolling
along its horizontal dimension, and one along its vertical dimension (which can
be identical to or different from the horizontal one).
Specifically, if IS/WS loops are spatially unrolled along a dimension of a certain
level Ln , it means that the inputs/weights are multi-cast across the different Ln−1
units of this dimension. This also implies that the outputs of the Ln−1 units in
this dimension can not be added together, and need separate accumulators. On
the other hand, if an OS loop is spatially unrolled along a dimension of Ln , the
partial sums of the Ln−1 units across that dimension are added together with
an adder tree. This then implies that the inputs and weights of all these Ln−1
units on that dimension are unique.
The resulting three unrolling possibilities are illustrated in Figure 4.4. One
level in the MAC array template hierarchy will either: (a) share inputs along
both dimensions (input activations and weights), resulting in 16 independent
outputs; (b) share inputs along one dimension (input activations or weights)
and outputs along another one, resulting in 4 aggregated outputs; (c) share
outputs along both dimensions, resulting in 1 joint sum. This operand sharing
concept is aligned with the operand spatial reuse explained in Chapter 2.3.2.
Input Casting Output Accumulation Island

Input Sharing Input Sharing Output Sharing
Ln Ln Ln
Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1 Ln-1
Output Sharing
Output Sharing
Input Sharing
(a) Input Sharing (b) Hybrid Sharing (c) Output Sharing

(share I and W) (share I/W and O) (share O)
Figure 4.4: Input, Hybrid, and Output Sharing (or Spatial Reuse as explained
in Chapter 2.3.2) at Level n (Ln). Results from one accumulation island are
added together spatially to form one output.
4.3.3 Bit-Group unrolling
Besides WS/IS/OS loops, the BG loops can also be spatially unrolled on the
MAC array, and dictate where the shift & add tree needs to be inserted. In
the proposed taxonomy, BG loop unrolling can either occur spatially at one
of the two lowest levels of the MAC array (so either at the L2 or at the L3
level; note that L1 is a 2b multiplier, which is not an array level), or can occur
temporally, a.k.a. bit-serial processing (BS). For cases in which BG loops are
unrolled temporally, internal registers are required in order to hold intermediate
results towards performing shift-add operation temporally. BS designs can be
further sub-categorized based on the level in the PSMA at which the internal
registers are located.
Note that BG loops are only unrolled at precisions higher than the minimal
precision mode. Because at the lowest precision, there are no BG loops to unroll,
i.e., all the BG for-loop dimension size equal 1.
4.3.4 Precision scalability modes
Existing SotA PSMA architectures can support different precision scalability

modes. In the newly-introduced taxonomy, if the PSMA supports weight-only
scalability, while the input activations’ precision is fixed at the highest precision,
then the design is referred to as 1D precision scalable (e.g. 8b×8b, 8b×4b
and 8b×2b). On the other hand, if both weight and activation precision are
scalable, we denote this as 2D precision scalability. 2D scalability can be further

classified to 2D symmetric (2D-S) scalability in case the precision of weights and
activations are the same (e.g. 8b×8b, 4b×4b and 2b×2b), or 2D asymmetric
(2D-A) scalability in case the design supports different precisions for weights
and activations (e.g. 8b×8b, 8b×4b, 8b×2b, 4b×4b and 2b×2b).
4.3.5 Fully-Unrolled vs. Subword-Unrolled designs
As we move to lower precisions, a trade-off arises in terms of memory bandwidth

vs. hardware utilization. To ensure full hardware utilization, the MAC array
will require increased input/weight or output bandwidth when replacing the
BG-loop unrolling with another loop unrolling in the lower precisions. For
instance, this can be seen in BitFusion [143] and BitBlade [133] architectures,
where the input/weight bandwidth each is increased by a factor of 2x and 4x
when scaling the precision down from 8b to 4b and 2b respectively. To illustrate
this effect, Figures 4.5-4.7 showcase how the introduced MAC array would
behave at various precisions for different L2 unrolling schemes. For simplicity,
only BG unrolling at L2 is shown, however, the same concepts and trade-offs
apply for other BG configurations.
Shift Add (runtime configurable) Accumulation Island Shift Add (runtime configurable
4b 4b 8b 4b 4b 8b
8b 4b 4b 8b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
1 L1 L1 L1 L1
L1 L1 L1 L1
44bb
L1 L1 L1 L1 L1 L1 L1 L1
4b
4b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
1 L1 L1 L1 L1
L1 L1 L1 L1
8b
8 b8 b
L1 L1 L1 L1 L1 L1 L1 L1
8b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
1 L1 L1 L1 L1
44bb
L1 L1 L1 L1
4b
L1 L1 L1 L1 L1 L1 L1 L1
4b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
1 L1 L1 L1 L1
L1 L1 L1 L1
L1 L1 L1 L1 L1 L1 L1 L1
2b 2b 2b 2b
b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b
2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b
22bb 22bb 22bb 22bb
2b
22bb 22bb 22bb 22bb
1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
2b 2b 2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
2b
8 b 888bbb
88bb 88bb
88bb 88bb
8b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
2b
2b
1 L1 L1 L1 L1 L1 L1 L1L1 L1L1 L1
L1 L1 L1 L1L1 L1
L1 L1
L1 L1
L1 L1
L1 L1L1 L1 L1
2b
L1 L1 L1 L1 L1
Sharing (IS) at L2 (b) Hybrid Sharing (HS) at L1
L2 L1 L1 L1 L1
(c)L1Output
L1
S
(a) Input Sharing (IS) at L2 (b) Hybrid S

Figure 4.5: L2 in a Fully-Unrolled (FU) design. BG is unrolled in L2 and
input sharing (IS) in lower precision modes.
Shift Add (runtime configurable) Accumulation Island

Shift Add (runtime configurable) Accumulation Island
4b 4b 8b 4b 4b 8b
4b 4b 8b 4b 4b 8b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
1 L1 L1 L1 L1
44b4b4bb
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L
4b
L1 L1 L1 L1
4b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
1 L1 L1 L1 L1
8b
8b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L
L1 L1 L1 L1
8b
8b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
1 L1 L1 L1 L1
44bb44bb
4b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L
L1 L1 L1 L1
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
4b
1 L1 L1 L1 L1
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L
L1 L1 L1 L1
2b 2b 2b 2b
b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b
2b 2b 2b 2
2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b 2b
2b b 2 b
2bbbb222bbb
2b
2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L
22bbbb 222b
2b
bb 222b
88bb 88bb
8b
88bb
8b
8b
8b
2b
L1 L1 L1 L1 L1 L1 L1L1 L1L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1L1 L

2b
bb
1 L1 L1 L1 L1 L1 22 L1 L1 L1 L1 L1
88bb
88bb
22bb 2222bbbb 22
22bb 2222bbbb 22
88bb
2b
2b
1 L1 L1L1 L1 L1 L1 L1 L1 L1 L1 L1L1 L1L1 L1

L1 L1
L1 L1L1 L1L1 L1 L1 L1 L1 L1 L1 L1 L1L1 L
2b
L1
Sharing (IS) L1
at L2 L1 L1 L1 L1 L1
(b) L1
Hybrid Sharing (HS)L1at L2L1 L1 L1 L1 L1 L1
(c) Output L
S
ring (IS) at L2 (b) Hybrid Sharing (HS) at L2 (c) Outp

hybrid sharing (HS) in lower precision modes.
Accumulation Island
44bb 44bb
4b 4b 8b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
44bb
44bb
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
8b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
44bb
44bb
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
2b 2b 2b 2b 22bb 22bb 22bb 22bb

2b 2b 2b 2b 22bb 22bb 22bb 22bb
22bb 22bb 22bb 22bb
2b 2b 2b 2b
22bb 22bb 22bb 22bb
22bb 22bb 22bb 22bb
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
88bb
88bb
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
ng (HS) at L2 (c) Output Sharing (OS) at L2

output sharing (OS) in lower precision modes.
At full precision (8b×8b), all unrolling schemes of Figures 4.5-4.7 behave in

the same way. That is because the BG loops are directly unrolled in L2, and
there’s no room to unroll any other loops. As we go down to lower precisions,
the BG loops unrolling size reduces, thus enabling more unrolling of other
for-loops. When unrolling an additional IS loop, the weights and activations
can be broadcast across the vertical and horizontal dimensions. As a result,
the input bandwidth is preserved. Yet, the outputs can not be added together,
thus increasing the output bandwidth. On the other hand, when an OS loop is
unrolled instead, all the partial results can be added together, preserving output
bandwidth. Yet, all units should be fed with different inputs, hence requiring a
higher input bandwidth. HS is the middle-ground between the two in terms of
input/output bandwidth. As can be seen, the I/O bandwidth increases with
lower precision operation for all designs.
Some alternative PSMA topologies opt for a fixed I/O bandwidth across different
precisions, as illustrated in Figure 4.8, to simplify the memory interface design.
In exchange, they bare a lower hardware utilization at low precision, effectively
Shift Add (fixed) Accumulation Island << Hardwired Shifter
8b 4b 4b 2b 2b 2b 2b
2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
<<6 <<4 <<2 <<0 <<6 <<4 <<2 <<0 <<6 <<4 <<2 <<0
ed
4b
Gated
at
G
2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
<<8 <<6 <<4 <<2 <<8 <<6 <<4 <<2 <<8 <<6 <<4 <<2
8b
2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
<<10 <<8 <<6 <<4 <<10 <<8 <<6 <<4 <<10 <<8 <<6 <<4
4b
Gated
ed
at
2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
G
<<12 <<10 <<8 <<6 <<12 <<10 <<8 <<6 <<12 <<10 <<8 <<6
(a) No Sharing at L2
8b 4b 4b 2b 2b 2b 2b
2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
<<6 <<4 <<2 <<0 <<6 <<4 <<2 <<0 <<6 <<4 <<2 <<0
ed
4b
Gated
at
2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
<<8 <<6 <<4 <<2 <<8 <<6 <<4 <<2 <<8 <<6 <<4 <<2
8b
2b
L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1
<<10 <<8 <<6 <<4 <<10 <<8 <<6 <<4 <<10 <<8 <<6 <<4
Gated
4b
ed
at
2b
L1 L1 L1 L1
G
L1 L1 L1 L1 L1 L1 L1 L1
<<12 <<10 <<8 <<6 <<12 <<10 <<8 <<6 <<12 <<10 <<8 <<6
(b) Output Sharing (OS) at L2
Figure 4.8: L2 in Subword-Unrolled (SWU) designs. BG is unrolled in L2

and (a) no sharing or (b) output sharing (OS) in lower precision modes.
gating part of the L1 units when scaling down in precision. This family of MACs
were first introduced in [117, 107], and extended to a broader class of designs
in [22]. Same as in Chapter 3, we here again refer to them as Subword-Unrolled
(SWU) designs, in contrast to the Fully-Unrolled (FU) topologies.
4.3.6 Complete taxonomy and SotA mapping
Based on the concepts of previous subsections, we can now introduce the

complete PSMA taxonomy. The new taxonomy is governed by 6 different
parameters that fully define each architecture:
1. L4/L3/L2: Spatial unrolling at each level, selected from BG, IS, HS or

OS for high precision modes; selected from IS, HS or OS for the lowest
precision mode (as there is no BG loops in this case).
2. BG: Level at which the BG loops are unrolled, selected from spatially at
L2/L3 or temporally in a bit-serial (BS) way. If BS, we also identify at
which level the internal registers are located, denoted as BS-Lx.
3. Config: I/O bandwidth and hardware utilization trade-off configuration,
which can be FU or SWU.
4. Mode: Precision scalability, which can be 1D, 2D asymmetric, or 2D
symmetric.
This newly introduced taxonomy now allows to map all PSMA techniques
presented in literature, as summarized in Table 4.1. Since BG unrolling is only
relevant when the precision mode is above the minimal precision, a separate
row per design is added for both high and low precisions.
For DNPU [146] and BitFusion [143], each L2 unit produces a full product every
clock cycle, thus their BG are unrolled spatially at L2. The difference between
them is that BitFusion is 2D asymmetric scalable, while DNPU is only 1D
scalable. Additionally, at low precisions, BitFusion’s L2 is output-shared, while
DNPU’s L2 is input/weight-shared. BitBlade [133] and Ghodrati [51] have a
very similar PSMA architecture, and thus are mapped the same way in the new
taxonomy. In their works, the shifters are shared between L2 units at L3, thus
their BG is unrolled spatially at L3.
Stripes [82], UNPU [94], and Loom [142] are BS designs, meaning the BG
loops are unrolled temporally rather than spatially. Stripes and UNPU are 1D
scalable, while Loom is 2D scalable. Additionally, UNPU is IS at L2 with its
Table 4.1: SotA Mapping
SotA Prec. L4 L3 L2 BG Config Mode
High IS OS BG
DNPU [146] L2 FU 1D
Low IS OS IS
High IS OS BG
BitFusion [143] L2 FU 2D-A
Low IS OS OS
High IS BG OS
BitBlade [133] L3 FU 2D-A
Low IS OS OS
High IS BG OS
Ghodrati [51] L3 FU 2D-A
Low IS OS OS
High IS IS OS
Stripes [82] BS-L2 FU 1D
Low IS IS OS
High IS OS IS
UNPU [94] BS-L1 FU 1D
Low IS OS IS
High IS IS OS
Loom [142] BS-L2 FU 2D-A
Low IS IS OS
High IS IS BG
Envision [116] L2 SWU 2D-S
Low IS IS No
High IS OS BG
ST [107] L2 SWU 2D-S
Low IS OS OS
IS: Input&Weight Sharing, OS: Output Sharing, HS: Hybrid Sharing,
BS-Lx: Bit-Serial with internal shift-add registers located at level Lx,
SWU: Subword-Unrolled, FU: Fully-Unrolled,
1D: 1D scalability, 2D-A/S: 2D Asymmetric/Symmetric scalability.

UNIFORM AND PARAMETERIZED PSMA TEMPLATE 111
internal shift-add registers located at L1, while Stripes and Loom are OS at L2
with its internal registers located at L2.
The final batch include Envision [116] and ST [107], which are both SWU
designs. Fundamentally, the difference between them is their behaviour at L2 in
low precisions, where Envision has no sharing, and where ST is OS. Moreover,
Envision opts for an IS scheme at L3, while ST goes for an OS scheme.
With the introduction of the new taxonomy, the road is paved towards
benchmarking the full PSMA design space under the same operating
circumstances and technology, leading to better insights on the pros and
cons of each PSMA architecture. To achieve this goal, a uniform and highly
parameterizable MAC array template is built, which allows to efficiently map
different design points covered by the taxonomy.
4.4 Uniform and Parameterized PSMA Template
The SotA implementations presented in Table 4.1 are not the only possible
PSMA instantiations. The taxonomy allows to identify a broader range of
possible architectures that were not previously covered in literature. To be
able to quickly implement and benchmark all different configurations in the
introduced design space, a uniform and highly parameterized PSMA template
is developed. Based on its user-defined parameters, this flexible PSMA can
be design-time configured into any array architecture in a subset of the design
space spanned by the introduced taxonomy, i.e., it supports different L2, L3,
L4, BG, and FU/SWU settings.
4.4.1 Bit-Group configurations
The uniform parameterized PSMA template supports different BG unrollings.

The hardware implication of this is illustrated in Figure 4.9 in a simplified way.
At full precision, if BG is unrolled spatially at L2, it means that each L2 unit
produces a full product. It also means that the shift & add tree lives inside each
one of the L2 units, and thus the shifters are not amortized across L1 results as
in the case of L3 and BS BG unrollings.
On the other hand, if BG loops are spatially unrolled at L3, it means that the
shift & add tree lives in L3 instead of L2, and are shared between aggregated
L1 results. To guarantee correct functionality, each L2 unit only multiplies
bit-groups that share the same significance, as each L2 product will be shifted
by the same amount. To illustrate, in Figure 4.9, if BG loops are unrolled at L3,
Example workload: O = ∑15
n=0 I n × W[n]
PRECISION-SCALABLE MAC ARRAY DESIGN SPACE EXPLORATION
for n2 in [0:4)
for n1 in [0:4)
for n2 in [0:4)
for n in [0:16) for bi in [0:2)
for n1 in [0:4)
O += I[n] × W[n] for bw in [0:2)
O += I[4n2+n1] × W[4n2+n1]
O += I[4n2+n1][2bi+1:2bi](22bi)
× W[4n2+n1][2bw+1:2bw](22bw)
(Assume 4-bit I and W; 12-bit O; 2-bits as a Bit Group)
BG loops
Loop implication Hardware implication Datapath precision
unrolled at
L1 multiplier: 2b×2b
for n2 in [0:4) --------------- temporal
unroll n1 in [0:4) --------- spatial-L3 L2 shift-adder: 4b+6b+6b+8b
L2
unroll bi in [0:2) -------- spatial-L2 L3 adder: 8b+8b+8b+8b
unroll bw in [0:2) ------ spatial-L2
Temporal adder: 10b+12b
for n2 in [0:4) ------------- temporal
unroll bi in [0:2) -------- spatial-L3 L2 adder: 4b+4b+4b+4b
L3
unroll bw in [0:2) ------ spatial-L3 L3 shift-adder: 6b+8b+8b+10b
unroll n1 in [0:4) ----- spatial-L2
Temporal adder: 10b+12b
for bi in [0:2) ---------------- temporal L2 adder: 4b+4b+4b+4b
for bw in [0:2) -------------- temporal
BS-L3 L3 adder: 6b+6b+6b+6b
unroll n2 in [0:4) --------- spatial-L3
unroll n1 in [0:4) -------- spatial-L2 Temporal shift-adder:
8/10/10/12b+12b
Figure 4.9: BG is unrolled spatially at L2, L3, or temporally at L3 (BS-L3). Assume OS at L2 and L3; Assume each
112
level Ln contains 2×2 Ln−1 for simplicity.

you can find that all the BG-level operations with the same colour are shifted
by the same amount, and thus grouped in the same L2 unit.
For BS designs, the BGs are unrolled temporally rather than spatially. This
means that the adders and shifters are decoupled, and the shifting operation is
performed over multiple clock cycles, depending on the precision. Additional
circuitry is required to ensure correct functionality of BS designs, such as timer
logic for scheduling and internal registers. It’s worth noting that by default, only
BS designs include internal registers between array levels. In the BS example
of Figure 4.9, the internal accumulation register is located at L3, i.e., BS-L3.
To gain a better understanding of the internal architecture of BS array designs
and how the BG-level scheduling is handled, refer to Figure 4.10. Assuming
L2 units are OS, then each L2 would produce one 8b result (16 2b×2b results
sum together). The scheduling is done in two phases. First, the weights are
stationary while the inputs are shifted right by 2b each clock cycle (Phase 1).
During that time, the intermediate results are stored in the first internal register.
After all the input bits are depleted, the weights are shifted right by 2b (i.e.,
the BG precision), the data stored in the first internal register is transferred
to the second internal register (Phase 2), and then Phase 1 is repeated again.
This cycle repeats itself until all the weight bits are depleted, then finally the
accumulator at the end of the array (not shown in Figure 4.10) gets activated,
and accumulates the full product.
As you may have noticed in Figure 4.9, BS designs reduce the complexity of the
L2 and L3 adder trees, as they don’t require configurable shifters anymore, and
they can mostly amortize the shift-add logic overhead. On the other hand, BS
designs have their own limitations. One is that they have to find enough other
for-loops (other than BG loops) in the algorithm that can be spatially unrolled
on the array to ensure high hardware utilization. Secondly, BS designs require
additional scheduling control logic and internal circuitry for each partial output.
It is hence beneficial to reduce the number of outputs generated when designed
in this pattern, especially at lower levels of the array. With that in mind, lots
of inefficient designs can already be pruned out from the full design space.
4.4.2 Design space constraints
Based on the taxonomy introduced in Section 4.3, the combination of the design
parameters would lead to a huge variety of array configurations. As such, some
constraints had to be put in place to reduce the design space to make it more
manageable, while still maintaining the interesting exploration regions. With the
observations of previous subsections, the list below summarizes the constraints
Input 2 b 22bb 22bb 22bb

222bbb 22 bb 22 bb 22 bb
Weight
..x16
. 2
Output
000000
22bb LSB
22 bb L2 L2 L2 L2 1
LSB
000000
20 b
20 b
22bb sum +
22 bb L2 L2 L2 L2 14 b
14 b
14 b
16'(2bx2b) +
8b
8b
L2
22bb >>2 >>2
22 bb L2 L2 L2 L2 MSB
Phase 1 Phase 2
22bb
22 bb L2 L2 L2 L2
(a) Input Sharing (IS) at L3, also BS-L2

222bbb 22 bb 22 bb 22 bb
Weight
..x16
.
22bb 2
22 bb L2 L2 L2 L2
Output
000000
LSB
1
22bb L2
LSB
000000
22 bb L2 L2 L2 L2 22 b
22 b
L2 16 b +
16 b
16 b
22bb 10 b 10 b +
22 bb L2 L2 L2 L2 L2 +
MSB >>2 >>2
22bb L2 Phase 1 Phase 2
22 bb L2 L2 L2 L2
x4
(b) Hybrid Sharing (HS) at L3, also BS-L3

22 bb 22 bb 22 bb 22 bb x 4
Weight 2 b
..x16
.
22bb 2
22 bb L2 L2 L2 L2
000000
Output
LSB
L2
L2 1
22bb L2
L2
LSB
000000
22 bb L2 L2 L2 L2 24 b
24 b
L2
L2
L2 +
L2 18 b
18 b
18 b
22bb 12 b +
12 b
22 bb L2 L2 L2 L2 L2
L2
L2
L2 +
MSB >>2 >>2
L2
L2
L2
L2
22bb Phase 1 Phase 2
22 bb L2 L2 L2 L2
x4
(c) Ouput Sharing (OS) at L3, also BS-L3
8b × 8b 8b × 4b 8b × 2b 4b × 4b 2b × 2b
Weight Input Weight Input Weight Input Weight Input Weight Input
[1:0] [1:0] [1:0] [1:0] [1:0] [1:0] [1:0] [1:0] [1:0] [1:0]
1 1
[1:0] 1 [3:2] [1:0] [3:2] 2 1
[1:0] 1 [5:4] 2 [3:2] [1:0] [1:0] [3:2]
[3:2] [1:0] 2 2 Timeline
[1:0] [7:6] 1 [5:4] [1:0] [3:2] [1:0]
2 [3:2] [3:2]
2 2 1
[3:2] [1:0] [7:6] [1:0] [3:2] [3:2]
1 [5:4] [1:0]
[3:2] [3:2] 1 2 2 1 Phase 1
1 [5:4] [3:2]
[3:2] [5:4] 2 2 Phase 2
1 Timeline Timeline
[3:2] [7:6] [7:6] [1:0]
... 2 ... 1
[7:6] [3:2]
[7:6] [1:0] 2 Sharing Weight Input Output
1
[7:6] 1 [3:2] (a) IS @ L3 W0..63 I0..63 O0..15
Timeline
[7:6] 1 [5:4] (b) HS @ L3 W0..255 I0..63 O0..3
[7:6] [7:6]
2 (c) OS @ L3 W0..255 I0..255 O0
Timeline
(d) Data access pattern at different precision modes
Figure 4.10: IS, HS and OS at L3 for different precisions when BG is unrolled

temporally. Assume OS at L2.
Table 4.2: Constrained PSMA Design Space
Layer Supported Configurations

L4 IS HS OS
L3 IS HS OS
Config FU SWU
BG L2 L3 BS-L2 L2
L2 IS HS OS HS OS OS No OS
Number of supported designs:
3 (L4) × 3 (L3) × 8 (Config / BG / L2) = 72 designs
that were imposed on the different parameters, and their justification. The
constrained design space is also summarized in Table 4.2.
1. If BG is unrolled at L3, L2 cannot be IS. The benefit of BG in L3 is all

the L1 units within L2 can share shifting logic. If at L2 the outputs are
not shared, we get the same number of shifters as if BG is unrolled in L2.
2. If BG is temporally unrolled, L2 is OS with BS-L2. BS designs require

internal registers to support the temporal shifting of each output. In
this chapter, we fixed the location of these BS internal registers to L2,
to reduce the template complexity and balance the critical path. To
minimize their overhead, it is best if only a single internal register per L2
unit is needed, resulting in a total of 256 internal registers. BS-L3/L4’s
benchmarking can be interesting as well, but would lead to many more
input registers and requires large OS in the workload. This study is left
to future work.
3. If SWU design, L2 can either be OS or have no sharing. Due to SWU
designs’ nature, L2 have to be symmetrical. Thus, the outputs are either
shared or not shared.
It’s worth noting that, given the design constraints, UNPU [94] is the only
accelerator in the SotA mapping that is not supported by our uniform PSMA
template, given that it’s a BS design with L2 being IS (BS-L1).
Example PSMA: 8b x 8b 2b x 2b
L4: IS / L3: OS
FU / BG: L2 / L2: OS Output prec. Explanation Output prec. Explanation
Sum up 16’ 2b x 2b
L2 (OS) 16 b 1’ 8b x 8b result 8b
results
L3 (OS) 20 b Sum up 16’ 16b results 12 b Sum up 16’ 8b results
L4 (IS) 320 b 16’ 20b results 192 b 16’ 12b results
Required Output Accum. 16’ 20b + 4b (accum. 16’ 12b + 4b (accum.

384 b 256 b
Register Size headroom) results headroom) results
Actual Output Accum. 384 b Register fully used 384 b Register partially gated ( G )
Register Visualization 4 20 4 20 … x16 4 12 G 4 12 G … x16
Figure 4.11: An example of output accumulator register size calculation and

its run-time configurations over full/low precision modes.
4.4.3 Register layout
To ensure proper timing of all designs assessed using the uniform PSMA
template, we include input and output registers at the periphery of the array
template. Only the BS designs do include additional internal registers within
the array to assist the periodic shift-add process. Depending on the array
L4/L3/L2 configurations (namely, the IS/HS/OS configuration as well as the
BG unrolling, as discussed in Section 4.3), each design requires a different
maximum input/output bandwidth per clock cycle. As a result, the required
amount of registers at the array’s inputs/output varies. IS designs typically
require fewer input registers, at the expense of more output registers. The
opposite is true for OS schemes.
Practically, each OS / IS sharing dimension along each hierarchical level
L2/L3/L4 multiplies the required number of input / output register words
by 4 (since each level Ln is a 2-dimensional 4×4 Ln−1 array). It is hereby
important to note that the number of bits per input word is fixed (8b at full
precision), while the number of bits per output word depends both on the input
word precision and the maximal expected temporal accumulation time. Here,
the largest required bit width across all precision modes is computed (take the
worst-case scenario), with an extra 4b of headroom for temporal accumulation
per each expected output (same as in Chapter 3). An example of how we
compute the final output register size is given in Figure 4.11, in which, for
an example PSMA architecture, the required output precision after each level
(from L2 to L4, further to accumulator) are given and explained.
Finally, in BS designs, two-stage accumulation registers are needed. Since the
EXPERIMENTS 117
design space has been constrained in Subsection 4.4.2 to only include BS-L2, the
BS registers layout is always the same. As shown in Figure 4.10(a), L2 in BS
designs always produce 8b results (across all precision modes), as it accumulates
16 2×2b partial products. At full precision, the internal registers clear their
stored value every 4 iterations and each register’s stored value needs to be
shifted right by 2 for 3 times, therefore a 6b headroom is needed for each BS
accumulation register, visualized as "000000" in Figure 4.10(a) register logic.
4.5 Experiments
In this section, each design in the design space is evaluated in terms of energy
per operation and area, both in a low frequency (200 MHz) and high frequency
(1 GHz) context. Additionally, breakdowns are performed to gain more insights
into the trade-off between hardware components of systems with different
design-time and run-time configurations.
4.5.1 Methodology
To ensure a fair comparison, all 72 benchmarked designs are synthesized from

the same parameterizable hardware description language (HDL) template with
the same coding style and set of optimizations. For the precision configuration,
most designs can handle both symmetrical and weight-only scaling, with the
exception of SWU designs that only support symmetrical precision scalability.
Supported precision scaling values are 8b, 4b, and 2b for both input activations
and weights.
All designs are synthesized from SystemVerilog using Cadence Genus v19.11
with high-effort compilation, once at 200 MHz, and once at 1 GHz. A 28 nm
commercial CMOS process is used, with a nominal supply voltage of 1V. Power
estimations are conducted through post-synthesis simulations, using Questa Sim
v10.6c for simulating switching activity and Cadence Genus for power extraction.
The simulations were conducted for 4,096 clock cycles with an ideal workload,
one that ensures full hardware utilization for all designs at all precision modes.
BS designs benchmarked in this section are all BS-L2. For simplicity, we refer
to them directly as BS.
4.5.2 Workload
To further elaborate on the nature of the ideal workload, it can be characterized

in terms of for-loops. At the lowest precision (2b×2b), the PSMA maximally
consist of 4,096 functional MACs. So, the workload should at least have 4,096
OS loops and 4,096 IS/WS loops to guarantee full hardware utilization for all
designs at all precisions. Table 4.3 shows a breakdown of how the number of
for-loops was chosen for each loop category, both for all-level IS and all-level OS
corner cases. Basically, all-level IS designs will spatially unroll IS/WS loops of
the workload, while OS loops will be unrolled temporally, thus providing more
output stationarity. The opposite is true for all-level OS designs. Ultimately,
all designs will run the same workload with randomly generated inputs and
weights, but with different spatial and temporal mappings.
Table 4.3: Ideal workload’s minimal size*
Corner cases at 2b×2b IS loops WS loops OS loops
L4, L3, L2: IS 64 64 1

L4, L3, L2: OS 1 1 4096
Overall required loop size 64 64 4096
* Workload that guarantees full hardware utilization for all designs across all precisions.
4.5.3 Throughput
All simulations are executing an optimized workload for which all designs have
a spatial utilization of 100% at full precision. Hence, both FU and SWU
designs have the same throughput (number of operations per clock cycle) at
full precision. As we scale down to lower precisions, the FU designs still utilize
the full hardware, while SWU gate part of the PSMA in order to maintain
fixed input bandwidth across all precisions. As a result, SWU has reduced
throughput compared to FU, specifically, half and quarter the throughput at
4b and 2b precisions respectively.
EXPERIMENTS 119
4.5.4 Energy efficiency
All supported designs are compared in terms of energy per operation, with one
operation defined as one full multiplication or one addition, i.e., one full MAC
operation is defined as two operations. The resulting energy efficiency heatmaps
are shown for 200 MHz at Figure 4.12 and 1 GHz at Figure 4.13. Additionally,
the designs are compared with symmetric (8b×8b, 4b×4b, 2b×2b) and one
asymmetric (8b×4b) precisions. Each row denotes an L4/L3 mode, while each
column denotes a configuration/BG/L2 combination. Note that SWU designs
do not support asymmetric precision scalability.
At 200 MHz (Figure 4.12), the most energy-efficient columns across different
precisions are the (FU / BG: L3 / L2: OS), (FU / BG: BS / L2: OS), and (SWU
/ BG: L2 / L2: OS) configurations. At full precision, the (SWU / BG: L2 / L2:
OS) designs are more efficient. This can be attributed to the simpler design and
non-configurable shifters, while still maintaining the same throughput as FU.
At lower precisions, SWU have a lower number of operations, and thus become
more in line with FU designs. Intuitively, one can understand that (FU / BG:
L3) and (FU / BG: BS) designs should be more efficient than (FU / BG: L2)
designs, due to the fact that the shifter’s overhead is shared across the L2 units
in this case. BS designs have an additional switching overhead of the internal
registers, and since here the frequency is relatively low, there is not much gain
in return.
The key factor for good energy efficiency is to make L2 OS, as it produces only
one partial product regardless of precision, thus simplifying L3 and L4 designs.
At 200 MHz, L3 and L4 unrollings don’t have as much impact as long as L2 is
OS. The best L4/L3 configurations in that case are IS/IS, IS/OS, OS/IS. If
both L4 and L3 swing towards the OS side, the critical path gets longer due
to the adder trees, and as a result the energy efficiency is negatively impacted,
though not by much since this is still at a low clock frequency.
Figure 4.13 shows a similar trend at 1 GHz clock frequency, with some small
differences. (L2: OS) is still a key factor for good energy efficiency, and (BG:
L3) and (BG: BS) are still more efficient then (BG: L2). However, (BG: BS)
has a slight edge over (BG: L3) due to the reduced critical path, as BS designs’
internal registers start to show their benefit at higher clock frequencies. In
line with the 200 MHz results, the best L4/L3 configurations are IS/IS, IS/OS,
OS/IS. Having all level unrollings as OS yields a detrimental effect on the
energy efficiency, as the critical path gets longer, conflicting with the tight
timing requirement.
The stellar performance of BS designs in this benchmark are in stark
contrast to the experiment results of Chapter 3, which showed BS
MAC units as the worst performer. The primary reason for the improved
energy efficiency of BS designs is that, in this chapter, we assume “L2 is OS” for
all BS designs in the PSMA template. As such, the hardware overhead of the
internal registers and shift-add logic is amortized across the L1 units within each
L2 unit (i.e., BS-L2), which ultimately reduces the energy/operation. Contrarily,
in the previous chapter, the study is done at single MAC unit level, and thus
the amortization opportunity is ignored (i.e., BS-L1). This result highlights the
importance of hardware resource sharing in array-level BS precision-scalable
MAC engine designs.
8b x 8b 8b x 4b
L4: IS / L3: IS 1191 885 731 743 361 394 658 583 1000 658 434 346 392 186 210 226 226
L4: IS / L3: HS 884 933 654 742 438 414 487 473 480 462 318 368 213 215 226 226 500
900
L4: IS / L3: OS 573 685 585 665 439 412 445 367 299 326 281 327 216 212 226 226 450
L4: HS / L3: IS 10081024 790 728 389 446 640 584 800 563 508 390 375 199 234 226 226 400
226No226
L4/L3
L4: HS / L3: HS 905 1045 787 851 535 538 499 522 700 490 516 384 417 266 277
Support 350
L4: HS / L3: OS 609 758 690 805 552 528 484 401 600
313 371 337 399 273 270 226 226
L4: OS / L3: IS 706 756 625 507 374 421 473 414 373 374 296 262 193 217 226 226 300
L4: OS / L3: HS 664 840 709 654 563 536 427 418 500 354 419 344 321 278 274 226 226 250
L4: OS / L3: OS 577 809 786 722 651 604 496 416 400 297 395 387 356 323 315 226 226 200
4b x 4b 2b x 2b
L4: IS / L3: IS 336 214 156 212 97 104 219 171 88 54 33 50 25 28 74 59 70
240
L4: IS / L3: HS 243 211 142 178 105 104 151 132 71 46 31 42 24 25 52 42
L4: IS / L3: OS 148 150 134 144 105 103 126 101 220 42 30 29 28 26 24 41 31 60
L4: HS / L3: IS 267 234 173 197 104 113 195 166 200 73 50 36 46 24 27 66 46
50
L4/L3
L4: HS / L3: HS 241 237 173 203 124 134 155 150 180 69 51 38 47 31 32 54 46
L4: HS / L3: OS 152 172 161 179 133 132 140 115 160 42 37 36 35 33 31 45 38
L4: OS / L3: IS 169 160 139 134 101 105 136 116 42 30 30 28 21 24 40 37 40
140
L4: OS / L3: HS 167 181 163 149 130 133 128 118 120
44 37 36 35 32 31 45 39 30
L4: OS / L3: OS 144 182 187 168 160 154 146 126 39 39 42 35 39 38 50 44
FU / BG: L2 / L2: IS
FU / BG: L2 / L2: HS
FU / BG: L2 / L2: OS
FU / BG: BS / L2: OS
SWU / BG: L2 / L2: NO
SWU / BG: L2 / L2: OS
Figure 4.12: Energy per Operation (f J) at 200 MHz.
4.5.5 Energy vs. Area
To get a better assessment of the most efficient array, the benchmarked designs
are compared in terms of energy per operation vs. area. Figures 4.14-4.15 show
scatter plots of energy/operation vs. area for 200 MHz and 1 GHz, at different
symmetric and asymmetric precisions. Each colour represents a different (config
EXPERIMENTS 121
8b x 8b 8b x 4b
L4: IS / L3: IS 1152 909 708 679 394 372 385 559 1400 632 432 335 351 205 202 252 252 700
L4: IS / L3: HS 776 825 637 650 479 392 495 459 418 376 310 319 238 208 252 252
L4: IS / L3: OS 578 785 626 637 522 404 460 380 1200 302 382 304 314 256 213 252 252 600
L4: HS / L3: IS 907 909 817 663 540 414 592 625 512 422 404 342 270 221 226
252 226
252
1000 No 500
L4/L3
L4: HS / L3: HS 647 1033 783 790 733 515 637 498 350 504 383 371 369 270
Support
L4: HS / L3: OS 674 10531041 845 832 525 517 477 800 345 517 503 411 411 277 226
252 226
252 400
L4: OS / L3: IS 662 965 875 612 529 415 521 488 374 476 425 323 269 219 252 252
L4: OS / L3: HS 720 13181074 955 1274 559 588 529 600 376 638 527 464 629 298 252 252 300
L4: OS / L3: OS 947 1554150710871462 608 689 633 400 470 758 739 535 726 319 252 252
4b x 4b 350
2b x 2b
L4: IS / L3: IS 326 221 151 198 109 100 151 163 85 55 32 50 26 27 72 57
80
L4: IS / L3: HS 212 179 140 162 121 102 152 123 62 34 30 39 30 25 51 39
300
L4: IS / L3: OS 149 183 147 142 129 105 141 107 44 40 32 27 32 25 47 33 70
L4: HS / L3: IS 258 199 181 185 140 110 184 180 250 70 43 38 45 34 26 63 53 60
L4/L3
L4: HS / L3: HS 173 238 183 194 179 131 213 153 50 54 41 46 44 31 80 49
L4: HS / L3: OS 167 242 241 200 201 135 162 146 200 47 55 55 42 49 32 57 49 50
L4: OS / L3: IS 184 217 202 171 137 108 175 150 49 50 44 39 34 25 61 45 40
L4: OS / L3: HS 183 304 245 228 313 146 201 162 150 52 75 54 58 78 33 78 53
L4: OS / L3: OS 214 353 354 258 357 156 240 210 59 85 83 57 88 38 90 75 30
Figure 4.13: Energy per Operation (f J) at 1 GHz.
/ BG / L2) combination, while each shape represents a different (L4 / L3) mode.
The most optimal designs lie in the bottom-left corner.
At 200 MHz, the optimal designs are the purple and orange ones, which represent
(SWU / L2: OS) and (FU / BG: L3 / L2: OS). At 1 GHz, the cyan-coloured
markers (BS) have better energy efficiency, while the purple and orange ones
have smaller areas. As will be shown later, this can be attributed to the lack of
internal registers in the (SWU / L2: OS) and (FU / BG: L3 / L2: OS) designs
compared to BS designs. In most cases, the blue markers (FU / BG: L2 / L2:
IS) are on the upper right corner of the graph, which makes them one of the
least efficient configurations. This can be attributed to the fact that having IS
at the lowest level of the array produces lots of independent partial products
at lower precisions, which cannot be added together. Generally, it can be seen
that the top left and bottom right corners are mostly empty, which indicates
that in most cases, there’s a strong correlation between area and energy per
operation in this benchmark.
Config / BG / L2 L4 / L3 Modes
FU / BG: L2 / L2: IS L4: IS / L3: IS SotA Designs
FU / BG: L2 / L2: HS L4: IS / L3: HS [DNPU]
FU / BG: L2 / L2: OS L4: IS / L3: OS [BitFusion (BF)]
FU / BG: L3 / L2: HS L4: HS / L3: IS [BitBlade (BB)]
FU / BG: L3 / L2: OS L4: HS / L3: HS [Loom]
FU / BG: BS / L2: OS L4: HS / L3: OS [Envision (Env)]
SWU / BG: L2 / L2: NO L4: OS / L3: IS
SWU / BG: L2 / L2: OS L4: OS / L3: HS [ST]
L4: OS / L3: OS
8b x 8b 700 8b x 4b
600
1000
900 500
Energy/Op (fJ)
800
700 400
[Env]
600 [BF] [DNPU] 300 [DNPU]
[BF]
500
[BB]
[BB]
400 [Loom] 200 [Loom]
[ST]
Preferred
Corner 4b x 4b 2b x 2b
90
300 80 [Env]
70
60
Energy/Op (fJ)
[Env]
200 50
40 [DNPU]
[DNPU]
[BF] 30 [ST] [BF]
[Loom]
[BB]
[BB]
100 [ST]
[Loom]
20
0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5
Area (mm2) Area (mm2)
Figure 4.14: Energy/Op (f J) vs. Area (mm2 ) at 200 MHz.
EXPERIMENTS 123
Config / BG / L2 L4 / L3 Modes
FU / BG: L2 / L2: IS L4: IS / L3: IS SotA Designs
FU / BG: L2 / L2: HS L4: IS / L3: HS [DNPU]
FU / BG: L2 / L2: OS L4: IS / L3: OS [BitFusion (BF)]
FU / BG: L3 / L2: HS L4: HS / L3: IS [BitBlade (BB)]
FU / BG: L3 / L2: OS L4: HS / L3: HS [Loom]
FU / BG: BS / L2: OS L4: HS / L3: OS [Envision (Env)]
SWU / BG: L2 / L2: NO L4: OS / L3: IS
SWU / BG: L2 / L2: OS L4: OS / L3: HS [ST]
L4: OS / L3: OS
8b x 8b 800
8b x 4b Config / BG /
700 FU / BG: L2
600 FU / BG: L2
1000 FU / BG: L2
500
Energy/Op (fJ)
900 FU / BG: L3
800 400 FU / BG: L3
700 FU / BG: BS
600 [BF]
[DNPU] 300 [BF] [DNPU]
SWU / BG: L
500 [BB] SWU / BG: L
[BB]
L4 / L3 Modes
400 [Env] L4: IS / L3: I
[ST] [Loom] 200 [Loom]
L4: IS / L3: H
L4: IS / L3: O
4b x 4b 2b x 2b L4: HS / L3:
L4: HS / L3:
90
80 L4: HS / L3:
300 L4: OS / L3:
70 [Env]
60 L4: OS / L3:
Energy/Op (fJ)
L4: OS / L3:
200 50
[DNPU] SotA Designs
[Env] [BF]
40 [DNPU]
[DNPU] [BitFusion (BF
[ST] [BF]
[BB] 30 [BB] [BitBlade (BB
[ST] [Loom] [Loom]
100 [Loom] [Envision (Env
0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 [ST]
Area (mm2) Area (mm2)
Figure 4.15: Energy/Op (f J) vs. Area (mm2 ) at 1 GHz.
4.5.6 Breakdown
To gain more insights about why some array configurations perform better
than others, a breakdown of the energy per operation and area for all PSMAs
are summarized in Figures 4.17-4.19 for 200 MHz and 1 GHz. Figure 4.16
visualizes the legend in Figure 4.17-4.18, showing that the energy breakdowns
include the contribution of L1 multipliers, adder trees (in L2, L3, and L4), final
accumulation adders, and registers (input, output, and BS internal).
The L1 2b multipliers energy consumption is nearly identical in all cases, and
only contributes a small fraction of the total energy consumption. L2 adder
trees consume more energy if BG is unrolled at L2, since in this case the adder
tree includes both adders and shifters. Also, when BG is unrolled at L2, L2
adder tree’s energy for SWU designs is a bit lower than for FU designs due
to the non-configurable shifters. The complexity of the L2 adder trees goes
significantly down as moving to BG unrolled in L3 or BS designs. Additionally,
note that if L3 or L4 are IS (the first bar in each group of 9 bars), it means that
none of the partial results from the previous level gets added together, thus
there would be no L3/L4 adder trees in this case.
Output registers and accumulation adders dominate energy consumption as
more levels become IS, while the input register’s energy is much lower, and vice
versa. Internal registers (denoted as pipeline registers in the plots) only exist in
BS designs, and it consumes a sizeable amount of the total energy consumption
of BS designs. At 1 GHz, the arrays where most levels are OS yield higher
energy consumption than other designs. The reason for it is that OS arrays
have long critical paths, since the partial results have to go through complex
adder trees in most of the levels. To meet the timing constraint, the energy
consumption gets negatively impacted.
Area breakdowns are shown in Figure 4.19. As shown in the figures, the area is
dominated by the combinational logic, with the sequential logic only having a
small contribution to the total area. Additionally, all designs where L2 is OS
have lower area than their “L2: IS" and “L2: HS" counterparts. That’s because
as L2 generate more partial products, L3 and L4 adder trees become more
complicated to ensure L2 partial results are not added together. In general,
a larger area is observed as expected for 1 GHz frequency, and this can be
attributed to larger cell sizes in order to meet the timing constraints, especially
for arrays where most levels are OS. Whereas, for the BS designs and the designs
where most levels are IS, the area increase of going from 200 MHz to 1 GHz is
not significant.
EXPERIMENTS
In each group of 9 bars,

from left to right, the
configurations are:
L4: IS / L3: IS
L4: IS / L3: HS
L2 Adder L4: IS / L3: OS
L1 L1
Tree L4: HS / L3: IS
(2bx2b) (2bx2b) …
Input (16 x L1) Output Output L4: HS / L3: HS
Registers Internal Registers (BS only) … Accumulat. Registers L4: HS / L3: OS
L3 Adder Tree (16 x L2) … L4: OS / L3: IS
L4: OS / L3: HS
L4 Adder Tree (16 x L3) L4: OS / L3:OS
Figure 4.16: Legend illustration for Figure 4.17 and Figure 4.18. Input Registers contain input activations and
weights.
125
L1 Multipliers L2 Adder Tree L3 Adder Tree L4 Adder Tree

Output Registers Input Registers Pipeline Registers Output Accumulators
1000
500
0 FU FU FU FU FU FU SWU SWU
BG: L2 BG: L2 BG: L2 BG: L3 BG: L3 BG: BS BG: L2 BG: L2
L2: IS L2: HS L2: OS L2: HS L2: OS L2: OS L2: NO L2: OS
(a) Energy/Op (f J) for 8b×8b
600
400
No
200 Support
(b) Energy/Op (f J) for 8b×4b
300
200
100
(c) Energy/Op (f J) for 4b×4b
75
50
25
(d) Energy/Op (f J) for 2b×2b
Figure 4.17: Energy efficiency for 200 MHz. The 9 parallel bars within each
configuration group are with different L4/L3 settings, listed out in Figure 4.16.
EXPERIMENTS 127
L1 Multipliers L2 Adder Tree L3 Adder Tree L4 Adder Tree

Output Registers Input Registers Pipeline Registers Output Accumulators
1500
1000
500
(a) Energy/Op (f J) for 8b×8b
600
400
No
200 Support
(b) Energy/Op (f J) for 8b×4b
300
200
100
(c) Energy/Op (f J) for 4b×4b
75
50
25
(d) Energy/Op (f J) for 2b×2b
Figure 4.18: Energy efficiency for 1 GHz. The 9 parallel bars within each
configuration group are with different L4/L3 settings, listed out in Figure 4.16.
Combinational
0.4 Sequential
0.2
0.0 FU FU FU FU FU FU SWU SWU

(a) Area (mm2 ) for 200 MHz
Combinational
0.4 Sequential
0.2
0.0 FU FU FU FU FU FU SWU SWU

(b) Area (mm2 ) for 1 GHz
Figure 4.19: Area breakdown.
4.6 Conclusions
All in all, DL models have proven to be resilient to lower precisions, which

in turn propelled the research in PSMA design for DNN accelerators. This
chapter enabled to better categorize current SotA PSMAs, to clearly understand
the trade-offs within/between each design, and to insightfully find the optimal
PSMA under different circumstances.
This chapter proposed a new loop representation extending the traditional
DNN layer loop format with additional Bit-Group loops for precision scalability.
Additionally, all the loops in this representation are categorized into three
different categories (IS, WS, and OS loops). This allows the introduction of
a new taxonomy that exploits the flexibility offered by the newly proposed
loop categories, where the BGs can be unrolled spatially or temporally (BS) at
different hierarchical levels of the MAC array. Afterward, existing SotA PSMAs
are mapped to the introduced taxonomy. The SotA mapping is accompanied by
a deep discussion on the effects of each taxonomy parameter on the underlying
hardware, and how they affect the input and output bandwidths of the PSMAs.
A new uniform and highly parameterized PSMA template is introduced that
CONCLUSIONS 129
covers a large subset of the design space spanned by the new taxonomy, and
supports symmetrical and asymmetrical precision scaling of inputs and weights.
All the architectures in the constrained design space are synthesized under a
commercial 28 nm technology, and benchmarked extensively in terms of energy
efficiency and area. From the conducted benchmarks, it is shown that BG
unrolling in L2 is the least ideal case (e.g. Bitfusion [143], DNPU [146]), and
it’s better to have BG unrolling at L3 for lower frequencies (e.g. BitBlade [133],
Ghodrati [51]) or temporally unrolling BGs for higher frequencies (e.g. Loom
[142], Stripes [82]). The benefit of having BG unrolled in L3 or temporally is
that the (configurable) shifters are amortized across different L2 units, making
the adder trees less complex. It’s generally a good idea to have a mixture of
IS/WS and OS loops throughout the array levels, and from the energy vs. area
study, we can comprehend that FU designs are better suited for workloads
where lower precisions are common, whereas SWU designs are better suited
for higher-precision-dominated workloads. This observation is aligned with the
precision ratio study results in Chapter 3.
The conclusion that is distinct from the last chapter, is the behavior of the BS
designs. In Chapter 3, we observed that BS designs were rather inadequate.
However, in this context, they exhibit favorable qualities. The reason is that,
in this chapter for BS designs (BS-L2), the hardware overhead of the internal
registers and shift-add logic is shared across the 16 L1 units within each L2 unit
(enabled by L2: OS), which ultimately reduces the energy/operation. Whereas
in Chapter 3, BS designs can be seen as each L1 unit is equipped with its own
registers and shift-add logic (BS-L1). The same array-level benefits can be
witnessed for spatial designs when transitioning from BG: L2 (without scalability
overhead amortization) to BG: L3 (with).
To summarize the results from the exploration, (L2: OS) is the key factor
for energy efficient PSMAs as it facilitates amortization for both spatial and
temporal designs. At 200 MHz, (BG: L3) is slightly better than (BG: BS) as
the internal registers of BS impose an overhead with no extra benefit at lower
frequencies. At 1 GHz, BS designs have a slight edge over (BG: L3) in terms of
energy efficiency, while (BG: L3) are better in terms of area. The good L4/L3
unrollings at both frequency are IS/IS (Loom-like [142]), IS/OS (BitBlade-like
[133]), and OS/IS. The best performing SWU design across 200 MHz and 1 GHz
is (L4: IS, L3: OS, L2: OS), which is an ST-like architecture [107].
In the end, there are two things to be noted. Firstly, the best PSMAs were
selected based on an ideal workload in this chapter’s benchmarks, ensuring
full hardware utilization across all precision modes and designs. However, real-
world workloads may vary, impacting the utilization and performance of the
benchmarked designs. Analyzing the effects of different non-ideal workloads on
various PSMAs is an area for future research. Secondly, all benchmark results in
this study were obtained under a relatively old technology node (28 nm), which
could yield significantly different energy profiles compared to newer nodes (e.g.,
7 nm and 5 nm). As a result, the performance landscapes of different PSMA
designs might shift when transitioning to more advanced technology nodes.
For instance, BS designs with a high proportion of wires and registers (due to
clocking), might lose their advantage due to the increased energy consumption
portion of wires and registers in advanced technology nodes.
Chapter 3 and Chapter 4 have depicted the design spaces of DNN accelerator’s
MAC unit and MAC array for variable-precision execution. They thoroughly
benchmarked different design options by first implementing these circuits in
HDL, then validating their functionality, and finally synthesizing them to get
accurate performance estimation. As can be imagined, the whole flow requires
a lot of manual effort and is very time-consuming. Although it is handleable
for benchmarking Chapter 3’s 19 MAC-unit-level designs and Chapter 4’s
72 MAC-array-level designs (both assuming an ideal mapping scenario, i.e.,
without mapping optimization), it becomes infeasible for exploring hundreds and
thousands of different DNN accelerator architectures, together with optimizing
the mapping/scheduling from million-level options.
To deal with the large design space at the complete DNN accelerator level,
Chapter 5 will build a general and fast architecture and mapping DSE framework,
ZigZag, based on an analytical cost model. Thanks to the deterministic
computing pattern of DL models, the built-in analytical cost models enable
ZigZag to estimate energy and latency breakdown of processing a DNN layer
on a customized accelerator in milliseconds, paving the way toward fast
architecture/mapping search and optimization.
Chapter 5
ZigZag: Enabling Fast DNN

Accelerator-Mapping Design
Space Exploration through
Analytical Modeling
From this chapter on, we move the abstraction level up from components in a
DNN accelerator to the complete accelerator architecture. Factors, such as the
non-ideality of the actual DNN workload (e.g., the mismatch between workload
sizes and hardware dimensions), the data movement in the memory hierarchy,
and mapping/scheduling optimizations, not considered in the previous two
chapters, are now taken into account.
However, this also largely raises the design space complexity. So, in order
to deal with the vast architectural design space and countless mapping and
scheduling possibilities, a series of high-level DNN accelerator DSE frameworks
are developed, targeting all five principles proposed in Chapter 1.2.2: fast,
accurate, general, adaptable, and intelligent.
This chapter thoroughly explains the first, fundamental framework in this
This chapter is based on publication [109], and contains large fractions of it. The
author’s contributions include (but not limited to) the design space identification, design
point representation, cost model, case studies, framework implementation, and paper writing.
ZigZag is open source at https://github.com/KULeuven-MICAS/zigzag.
131
132 ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
THROUGH ANALYTICAL MODELING
series: ZigZag, which performs high-level DSE for single-core DNN accelerators,
allowing single-layer spatial and temporal mapping optimization, based on an
analytical cost model.
Building efficient embedded DNN accelerator systems requires a tight co-

design between DNN algorithms, hardware, and algorithm-to-hardware mapping.
However, owing to the large joint design space, finding an optimal solution
through physical implementation becomes infeasible. To tackle this problem,
we developed ZigZag, a rapid DNN accelerator joint architecture-and-mapping
DSE framework. ZigZag mainly innovates on systematically broadening the
architecture and mapping search space: enabling even/uneven auto-mapping
with smart search methods, and thus it is able to discover better design points
than other frameworks.
This chapter is organized as follows:
• Section 5.2 provides a clear view into several SotA DSE frameworks and
highlights the uniqueness of ZigZag.
• Section 5.3 gives an overview of the ZigZag framework, its major
components and various utilities.
• Section 5.4 defines the Memory-Centric Design Space Representation,

which is based on an enhanced nested for-loop format, for uniformly
representing each design point. It opens up a whole new mapping space
for the DSE by decoupling the operands (W/I/O), the memory hierarchy,
and the mapping scenarios.
• Section 5.5 builds the Loop Relevance Principle on top of this represen-
tation, enabling the framework to extract, in a systematic and insightful
way, the key information (such as memory accesses, required memory
bandwidth, etc.), from which the Hardware Cost Estimator derives the
system’s energy and performance values.
• Section 5.6 designs Mapping Search Engines to cope with the enlarged
mapping space (both even and uneven mappings). With heuristics and
iterative search strategies, these engines rapidly locate optimal (on energy
or/and latency) spatial and temporal mapping points, supporting mapping
on a wider variety of memory hierarchies and MAC array topologies.
RELATED WORKS 133
• Section 5.7 equips ZigZag with an Architecture Generator to construct

different DNN accelerator architectures, especially focusing on auto-
generating all valid memory hierarchies under given constraints.
• Section 5.8 validates ZigZag against published accelerators, an in-house

accelerator, and existing DSE frameworks to assess the accuracy of the
Hardware Cost Estimator.
• Section 5.9 performs three case studies from mapping, hardware
architecture, and DNN workload perspectives resp., demonstrating the
strength of the Mapping Search Engines and providing insights into the
vast embedded DNN accelerator design space.
• Section 5.10 summarizes the further improvements added to ZigZag in
the course of the years 2021-2023 after its initial construction in 2019-2020.
• Section 5.11 concludes the chapter.
5.2 Related Works
Several DSE frameworks have emerged, targeting hardware-software co-

optimization by exploring the large design space available in the DNN accelerator
system design. Recent works on DSE framework in the literature include
Interstellar [184][186], SMAUG [178], Accelergy [177], Dory [17], Timeloop
[126], dMazeRunner [39], MAESTRO [91], and MAGnet [166].
To provide a clear view into the variety of DSE frameworks available in the
SotA and understand the uniqueness of ZigZag, Table 5.1 compares many SotA
DSE frameworks from four perspectives.
Firstly, concerning the hardware design space, SotA frameworks can be
distinguished based on whether they support a fully flexible hardware
configuration on both MAC array and memory hierarchy, like Timeloop [126];
resp. pre-define a hardware template with certain tunable parameters, like
MAGnet [166]; or make other specific assumptions, such as the sharing of all
memory levels for the operands W/I/O and only explore within these constraints,
like Insterstellar [184].
Secondly, in the algorithm-to-hardware mapping space, two sub-spaces are
included, the spatial mapping space, which defines the parallelism (how neural
network loop dimensions are unrolled upon the MAC array); and the temporal
mapping space, which defines the data fetching sequence (in which order the
neural network layer is processed). Together, they form the mapping space. All
ZIGZAG: ENABLING FAST DNN ACCELERATOR-MAPPING DESIGN SPACE EXPLORATION
Table 5.1: DNN Accelerator DSE Framework Comparison

Framework Hardware Design Space Mapping Space Mapping Search Strategies Cost Estimation
Timeloop+Accelergy[126, 177] Fully flexible Even mappings Constraint-driven random sampling Highly fine-grained analy. model
MAESTRO[91] Parametrizable HW template Even mappings Tool-predefined dataflows Coarse-grained analytical model
Interstellar[184] All mem. levels shared Even mappings Exhaustive loop blocking/ordering Coarse-grained analytical model
dMazeRunner[39] All mem. levels shared Even mappings Constraint-driven + data reuse pruning Coarse-grained analytical model
MAGnet[166] Parametrizable HW template Even mappings Bayesian optimiz. + random sampling HLS hardware implementation
Dory[17] Fixed architecture Even mappings Constraint programming solver [1] Coarse-grained analytical model
SMAUG[178] Fixed architecture Even mappings User-predefined fixed mapping set Cycle-accurate SoC estimator
Ours: ZigZag Fully flexible Even and uneven Exhaustive/heuristic/iterative search Fine-grained analytical model
134
RELATED WORKS 135
of the SotAs only support even mappings, which usually lead to sub-optimality
as will be shown in the results. Even and uneven mapping will be discussed in
detail in Section 5.4 of the chapter.
Thirdly, each DSE tool typically encompasses a mapping search engine, a.k.a.
auto-scheduler or mapper, to find the optimal temporal/spatial mappings for
deploying a certain neural network layer onto a specific accelerator architecture.
Most of the DSE frameworks perform a constraint-driven search to narrow
down the space and speed up the search procedure, like dMazeRunner [39];
some formulate the scheduling process into integer constraint problems and
utilize existing optimizers to solve it, like Dory [17]; the others use a partially-
predefined mapping as a hint to generate valid mappings, like MAESTRO [91].
Commonly used scheduling constraints and strategies include setting thresholds
for memory/PE array utilization and data reuse factors, putting an optimization
goal such as minimizing the DRAM access or the overall memory traffic, and
random sampling to avoid being trapped in a local optimum.
Finally, the last column in Table 5.1 lists the hardware cost estimation
approach adopted by each framework, in which three main categories can
be identified: 1) slow but very accurate cost estimations based on High-Level
Synthesis (HLS) [166], 2) medium-speed and accurate cycle-accurate system
simulators [178], and 3) fast and relatively accurate analytical models. Moreover,
there are different granularity levels for analytical model [177]. Fine-grained
models, like the one embedded in ZigZag, are more accurate than most of
the other coarse-grained models, by distinguishing memory write cost from
read cost, considering the memory word length’s impact on access energy (e.g.,
same-sized memories with different aspect ratio/IO bandwidth have different
data access energy), and taking data fetching pattern/stationarity into account
in the memory access cost and unit MAC cost.
Back to the five DSE framework principles to conclude on ZigZag’s uniqueness:
1) the fully flexible hardware design space and even/uneven mapping space
contribute to ZigZag’s generality; 2) the analytical model-based hardware cost
estimation guarantees its fast speed; 3) the delicate calculus behind the analytical
model captures the intrinsic behavior of for-loop operation, ensuring the accuracy
of the model (shown in later sections and also next chapter); 4) the unified
design point representation facilitates the easy adaptability between different
design options; 5) heuristic-/iterative-based auto-mapping search strategies pave
the way towards the intelligent DSE of DNN accelerator.
5.3 ZigZag Framework Overview
ZigZag consists of three key components, as shown in Figure 5.1: 1) an analytical

energy-performance-area Hardware Cost Estimator, 2) two Mapping Search
Engines that support spatial and temporal even/uneven mapping on a wide
range of architectures, and 3) an Architecture Generator that auto-explores the
wide memory hierarchy design space and generates all accelerator architectures
(memory hierarchy plus MAC array) that meets certain constraints. These three
components cooperate in synergy to enable the exploration of a much broader
design space compared to other SotAs.
The complete DSE flow in ZigZag includes four steps: 1) the Architecture
Generator takes in workloads, hardware constraints, and a user-defined memory
pool to generate all valid accelerator architectures and feed the architecture
information into the Hardware Cost Estimator and Mapping Search Engines;
2) the Hardware Cost Estimator computes required area; the Mapping Search
Engines search for the optimal spatial and temporal mapping candidates; 3)
for each architecture, the Hardware Cost Estimator evaluates all the mapping
candidates found by the Mapping Search Engines in terms of energy and/or
performance (i.e., latency/throughput/MAC array utilization); 4) finally, by
comparing each architecture’s best mappings’ energy/performance/area, the
optimal architectures together with their optimal mappings are found.
Neural Network Workload (e.g. Conv2D, DepthwiseConv2D, Dense, etc.)

Hardware Constraint
(e.g., PE/memory/area utilization,
3) Architecture
MAC precision, total area.) 2) Mapping Search Engines
Generator
Memory Pool Auto-Memory- Auto-Spatial- Auto-Temporal-

Hierarchy Search Mapping Search Mapping Search
Different memory size, word
length, word access cost,
memory area.
1) Hardware Cost Estimator
Technology- Memory-Centric Loop Loop
Dependent Cost Design Space Relevance Information
(e.g., MAC cost, Representation Principle Extraction
interconnection cost.)
Pareto-Optimal Solution
a.) the optimal accelerator architectures;

b.) the optimal spatial / temporal mappings;
c.) the corresponding energy, performance, area.
Figure 5.1: ZigZag framework diagram.

DESIGN SPACE REPRESENTATION 137
Besides working in this complete-DSE mode, in which the design space of

both architecture and mapping are explored, ZigZag can also work in several
partial-DSE modes, in which architecture and/or mapping can be partially or
fully pre-defined and only open the rest of the design space for ZigZag to explore.
For example, one can use ZigZag to search for the best temporal mapping for a
fixed hardware architecture or use it as a pure hardware cost estimator to access
different DNN workload’s hardware implication on a fixed-mapping datapath.
5.4 Design Space Representation
An enhanced data representation format is the foundation for systematically

exploring the enlarged design space in ZigZag. Inherited from the basic mapping
representation introduced in Chapter 2.3.2, the newly proposed Memory-Centric
I/O O I/W O O W W
Workload
Batch channel channel row column row column
Conv 2D (below fig.) B K C OY OX FY FX
Conv 1D B K C 1 OX 1 FX
Depthwise Conv 2D* B 1 1 OY OX FY FX
Pointwise Conv 2D B K C OY OX 1 1
Matrix-Vector Multi. 1 K C 1 1 1 1
Matrix-Matrix Multi. B K C 1 1 1 1
* Repeat Group (G) times, not shown here.
Conv 2D
Figure 5.2: Workload summary of common neural network layers. Loop

notations (B, K, C, OY, OX, FY, FX) are used in the chapter.
Design Space Representation integrates three aspects of information: 1) the

DNN algorithm, 2) the hardware architecture (memory hierarchy and MAC
array), and 3) the algorithm-to-hardware mappings (spatial and temporal). It
especially excels in capturing all forms of memory sharing, loop blocking, loop
ordering, and spatial loop unrolling, for each operand (W/I/O) individually
and at each memory level.
Before discussing the representation, the DNN layer for-loop format and loop
notations used in this chapter are depicted in Figure 5.2, which cover a wide
range of layer types.
Figure 5.3 illustrates this proposed representation, as well as its ability to
represent balanced/unbalanced memory hierarchy, and even/uneven mapping
schemes. In Figure 5.3(a), the representation defines the mapping information
of the three operands separately, one per column, using three sets of nested
for-loops. Inside of each set, the architectural levels are represented from bottom
to top (divided by the horizontal bold lines), starting from the MAC level, over
Register File and Global Buffer levels, all the way up to the DRAM level. This
architectural level concept will be intensively used in the next section. The close
vertical bold lines indicate the boundary between different physical memory
modules, as appeared in Figure 5.3(b)(d). Levels not separated with a close
vertical bold line share a certain memory for the corresponding operands, e.g.,
the DRAM level in all (a-d).
For every operand, each alphanumeric pair indicates a for-loop, e.g., the first
term "K 4" is equivalent to "for K = 0 to 4-1", which together with all the
other for-loops constitute the temporal mapping. Assigning these for-loops into
different architectural levels is called loop blocking; swapping the order of all
the for-loops inside one level is called loop ordering.
The "u" suffix after a loop name indicates loop spatial unrolling (i.e., the
unroll loops introduced in Chapter 2.3.2), such as "FYu". The format "Au|Bu"
is inherited from [184], meaning that both the A and B loops are spatially
unrolled, which constitute the spatial mapping.
Comparing the four design points in Figure 5.3 reveals the power of this
representation: 1) it enables operands to have a different number of memory
levels each, e.g., in design points (b)(d), Weight has two memory levels while
Input and Output have three; 2) it allows operands that have the same number
of memory levels to share or not-share physical memory, e.g., the Register File
of Input/Output are shared in (a)(c) and are separated in (b)(d); 3) it allows
operands that share one physical memory have the same (even) or different
(uneven) loop blocking, e.g., all memory levels in (c) hold different loop sets for
each operand.
Algorithm-to- Even Mapping Uneven Mapping
HW HW Mapping (At all shared memory levels, excluding the top memory, (Memory level and W/I/O loop blocking
Architecture W/I/O loop blocking are the same.) are decoupled.)
Balanced Memory Hierarchy
(a) (c)
(At all memory levels, W/I/O are shared.)
DESIGN SPACE REPRESENTATION
Unbalanced Memory Hierarchy (b) (d)

(Memory level and W/I/O are decoupled.)
Figure 5.3: Using the Memory-Centric Design Space Representation to distinguish between balanced and unbalanced
memory hierarchy, even and uneven mapping. The memory unrolling is not depicted in the left two memory hierarchy
sketches for clarity.
139
Towards later automated mapping optimizations, it should be noted that a valid

mapping needs to comply with several rules: 1) temporal loops of all operands
should follow the same order to maintain the functional equivalence; 2) together,
the spatial loops of all operands should have the same unrolled loop types and
dimensions (yet, there is no need for spatial loops of individual operands to
follow the same order or appear at the same level); 3) spatial and temporal loops
have to be assigned such that the given memory size at each level is respected;
4) in DSE runs with a fixed hardware architecture (i.e., without architecture
search), the spatial mapping needs to be consistent with the predefined memory
hierarchy and PE array interconnection.
ZigZag’s loop representation is the first one that is capable of capturing all
these hardware-mapping opportunities in one common structure. This paves
the way towards systematically and automatically extracting and analyzing all
feasible design points, and estimating their hardware cost, as will be discussed
in the next section.
5.5 Hardware Cost Estimator
ZigZag Hardware Cost Estimator targets the estimation of energy, performance

(PE array utilization, throughput, and latency), and area of mapping certain
neural network workloads onto certain accelerator architecture.
It innovates on 1) the Loop Relevance Principle (Section 5.5.1), to extract
basic technology-independent hardware and data attribute in a systematic and
insightful way (Section 5.5.2); 2) a technology- and memory-bandwidth-aware
hardware cost integrator (Section 5.5.3), capable of extracting energy and
performance.
5.5.1 Loop Relevance Principle
Here we use Conv 2D layer to explain the Loop Relevance Principle. As shown
in Figure 5.2, a Conv 2D layer is based on a 7D computing space (i.e., 7 nested
for-loops) with three 4D operands (i.e., Weight, Input, and Output are all 4D
tensor), which implies not all 7 loop dimensions are relevant to each operand.
Figure 5.4 shows the Loop Relevance Principle foundation, in which all 7 loop
dimensions are categorized as relevant (r), irrelevant (ir), or partially relevant
(pr) to each operand. Looping through those r loops indicates new data
need to be fetched (for W and I) or generated (for O) while looping
through those ir loops indicates data reuse opportunities. For Weight
HARDWARE COST ESTIMATOR 141
and Output, this is straightforward since all 7 computing space dimensions are
either parallel (relevant) or orthogonal (irrelevant) to their own 4D data space.
B K C OY OX FY FX ✓ relevant (r)
W ✕ ✓ ✓ ✕ ✕ ✓ ✓ ✕ irrelevant (ir)
I ✓ ✕ ✓ ?𝐼𝑌 ?𝐼𝑋 ?𝐼𝑌 ?𝐼𝑋 ? partially relevant (pr)
O ✓ ✓ ✕ ✓ ✓ ✕ ✕ ?𝐼𝑋/𝐼𝑌 partially relevant to IX/IY
Figure 5.4: Loop type categorized by relevance.
Input, however, has pr loops besides the r and ir loops. As presented in the
Conv 2D example of Figure 5.2, Input’s dimensions IX and IY do not show up
in the convolution formula directly, instead, they are indirectly present through
OX and FX (for IX); OY and FY (for IY). As such, OX, FX, OY, FY are denoted as
partially relevant (pr) loops for Input. OX, FX (resp. OY, FY) form a pr loop
pair. The data reuse opportunities for Input that come from pr loop pair are
less straightforward and are explained as below:
The complete relation between IX and its pr loop pair OX, FX is IX = SX ·
(OX − 1) + SF X · (F X − 1) + 1, in which SX is the stride on input feature
map IX dimension; SFX is the stride in filter FX dimension (dilated convolution).
Input data reuse happens when the value of IX remains constant while the
computation is looping through its data space, spatially or temporally. A simple
case is that when SX and SFX are 1, the equation becomes IX = OX + F X − 1,
in which case, for a pr loop pair, data reuse opportunities arise when the sum
of their indices (OX+FX) remains constant. ZigZag can analyze various Input
data reuse cases considering stride and filter dilation. For clarity, the rest of
the chapter will mainly focus on the simple case.
Spatial Spatio-temporal Temporal
FY FX FY FX Level
(3) (4) (7) (8) boundary
(1) FYu | OYu OYu OXu OY OX
(2) FXu | OXu OY OX OY OX

(5) (6) (9) (10)
FYu FXu FY FX
Multi-cast
diagonally “FIFO Effect” “FIFO Effect”
Figure 5.5: pr-loop patterns that trigger special Input data reuse.
Figure 5.5 provides a summary of pr-loop-pair-triggered Input data reuse. Such

pr creates alternative data reuse opportunities for spatially, temporally, or
spatio-temporally unrolled loops. For spatial unrolling, inputs can be multi-cast
diagonally in a PE array, as done in Eyeriss [30], where FY and OY are spatially
unrolled onto the 2D PE array, same as Figure 5.5(1). It is because the sum
of FY and OY remains constant along the diagonal direction. But note that
things become more complicated when stride or dilation is not 1, in which case
the broadcast angle of Inputs may no longer follow a 45° line. For temporal
and spatio-temporal unrollings, data reuse is possible through a FIFO buffer
(or a FIFO-like memory reading/writing pattern) which shifts the input data
in/out over consecutive (or periodic) clock cycles. An example of this can be
found in Envision [116], where OX is spatially unrolled and FX is the innermost
temporal loop at the above level memory, same as Figure 5.5(4), making the
sum of FX and OX a constant in neighboring PE locations across consecutive
cycles, enabling the reuse of Inputs in a FIFO manner.
5.5.2 Mapping information extraction
The benefit of the Loop Relevance Principle is the simplification and unification
of the procedure for extracting key information from the W/I/O mapping loop
sets towards estimating system energy and performance. To show the key ideas
of this procedure, a summary of the major equations is provided in Table 5.2 and
a detailed demonstration is given in Figure 5.6, in which the Output loop set of
Figure 5.3(d) is analyzed (similar procedure would be repeated for Weight/Input,
not shown). An in-depth discussion of each metric in Table 5.2 is provided
below. Note that the word ‘level’ in this section refers to the architectural level
introduced in Section 5.4 if not explicitly mentioned.
1.) Data Size in the individual memory unit at current level can be derived by
multiplying together the dimensionality of all the r loops at the current level
and all levels below, together with all ru loops (spatially unrolled r loop) at
all levels below. This can be seen in the first equation of Table 5.2, in which
Li means the current architectural level, L(i − 1) means one level below, and
Lmin means the lowest architectural level, usually the MAC level.
Let us apply this to a specific example given in Figure 5.6. The required Output
data storage inside the register file of each PE (16) is calculated by multiplying
the dimensionality of Level-1 r loops (the ‘K 8’ and ‘OX 2’ loops with resp.
loop index 1, 4); the Data Size of the Output inside of Global Buffer (5408) is
calculated by multiplying the dimensionality of Level-1&2 r loops (loop index 1,
4, 7) and Level-1 ru loops (loop index 6.2, 6.3). Note that in the given example,
no loop is assigned to the MAC level, thus we see MAC level (Lmin or L-0 in
Table 5.2: Equations for mapping information extraction
Metrics Comment Equation

𝑳𝒊 𝑳 𝒊−𝟏
Data Size in
ෑ 𝒓 ∙ ෑ 𝒓𝒖
individual unit
Data Size 𝑳𝒎𝒊𝒏 𝑳𝒎𝒊𝒏
@ Level i 𝑳𝒊 𝑳𝒊
Data Size in total ෑ 𝒓 ∙ ෑ 𝒓𝒖
𝑳𝒎𝒊𝒏 𝑳𝒎𝒊𝒏
𝑳𝒊 𝑳𝒊 𝑳𝒊 𝑳𝒊
MAC Operation Supported by its
ෑ 𝒓 ∙ ෑ 𝒓𝒖 ∙ ෑ 𝒊𝒓 ∙ ෑ 𝒊𝒓𝒖
@ Level i Data Size
𝑳𝒎𝒊𝒏 𝑳𝒎𝒊𝒏 𝑳𝒎𝒊𝒏 𝑳𝒎𝒊𝒏
𝑳𝒊 𝑳𝒊
Turnaround Cycles Supported by its
ෑ 𝒓 ∙ ෑ 𝒊𝒓
@ Level i Data Size
𝑳𝒎𝒊𝒏 𝑳𝒎𝒊𝒏
Total data reuse

Data Reuse Factor ෑ 𝒊𝒓 ∙ ෑ 𝒊𝒓𝒖
factor (Spatial &
@ Level i
Temporal) 𝑳𝒊 𝑳𝒊
𝑳𝒎𝒂𝒙 𝑳𝒎𝒂𝒙
Unit Count Total active unit
ෑ 𝒓𝒖 ∙ ෑ 𝒊𝒓𝒖
@ Level i count
𝑳𝒊 𝑳𝒊
write access for 𝑻𝒐𝒕𝒂𝒍 𝑴𝑨𝑪 𝑶𝒑𝒆𝒓𝒂𝒕𝒊𝒐𝒏

Memory Access Count W and I 𝑳𝒊
@ Level i (↔ Level i+1) read access for / ෑ 𝑻𝒐𝒕𝒂𝒍 𝑫𝒂𝒕𝒂 𝑹𝒆𝒖𝒔𝒆 𝑭𝒂𝒄𝒕𝒐𝒓
O 𝑳𝒎𝒊𝒏
Required Memory With double- 𝑻𝒐𝒕𝒂𝒍 𝑫𝒂𝒕𝒂 𝑺𝒊𝒛𝒆 @ 𝑳𝒆𝒗𝒆𝒍 𝒊

Bandwidth buffering 𝑻𝒖𝒓𝒏𝒂𝒓𝒐𝒖𝒏𝒅 𝑪𝒚𝒄𝒍𝒆𝒔 @ 𝑳𝒆𝒗𝒆𝒍 𝒊
@ Level i (↔ Level i+1) 𝑻𝒐𝒕𝒂𝒍 𝑫𝒂𝒕𝒂 𝑺𝒊𝒛𝒆 @ 𝑳𝒆𝒗𝒆𝒍 𝒊
(write bandwidth for W/I, Without double- ∙ ෑ 𝒊𝒓_𝒕𝒐𝒑
buffering 𝑻𝒖𝒓𝒏𝒂𝒓𝒐𝒖𝒏𝒅 𝑪𝒚𝒄𝒍𝒆𝒔 @ 𝑳𝒆𝒗𝒆𝒍 𝒊
read bandwidth for O) 𝑳𝒊
Figure 5.6) r and ir loop dimensionality both as 1. Later for other metrics’
calculations, readers can always refer to the practical case in Figure 5.6.
2.) Data Size in total at current level can be easily calculated by multiplying
the individual Data Size in each memory unit with the dimensionality of all ru
loops at the current level. Notice that the unit of the Data Size is the number
of data elements. In order to obtain the number of bits, the precision of each
operand needs to be considered. Generally speaking, partial outputs have a
higher data precision than weights, inputs, and final outputs.
The ability to distinguish partial outputs from final outputs is critical for
accurate hardware cost estimation. ZigZag can easily handle this through its
r vs. ir loop representation. The final output is generated at the level of the
uppermost ir loop, e.g., the ‘C 12’ loop (index 8) in Figure 5.6, which in this
example makes the Level-2 Global Buffer the watershed for (higher precision)
partial and (lower precision) final output: the output data traffic between Level-
1 and Level-2 is bidirectional with partial output data precision (except for few
final iterations that final outputs are generated), whereas between Level-2 and
Level-3, it is unidirectional with final output data precision.
3.) The number of MAC Operation supported by the current level data size is
calculated by multiplying together all the loops’ dimensionality (r, ir, ru, and
iru) from the lowest architectural level up to the current level.
4.) Turnaround Cycles are the number of cycles certain memory can keep
operating with the data it contains, which is an important metric for required
memory bandwidth computation. It can be calculated by multiplying together
all the temporal loops’ dimensionality (r and ir) from the lowest level up to
the current level.
5.) Total Data Reuse Factor at the current level indicates after each data
reaches the current level and before it is overwritten by other data, how many
times it gets reused to support different computations. It can be calculated by
multiplying all the irrelevant loops’ dimensionality (ir and iru) at the current
level. The product of only ir loops is the temporal data reuse factor, while the
product of only iru loops is the spatial data reuse factor.
6.) Total Active Unit Count is a metric that captures how many hardware
components are active at a certain level, which is only related to spatially
unrolled loops. It can be computed as the product of all the spatial loops’
dimensionality (ru and iru) from the current level up to the highest level
(usually the DRAM level), and it is an important metric for computing MAC
array spatial utilization.
7.) Memory Access Count, as the core metric for later memory energy estimation,
can be extracted by dividing the total MAC operation of the neural network layer
by the current-and-below levels’ total data reuse factor. Figure 5.7 visualizes the
individual loop’s impact on the memory access count of the case of Figure 5.3(d)
based on this approach. The circle markers indicate the boundary of the memory
levels, showing the actual number of memory accesses at each memory level
for each operand. The (1)(2)(3) points marked in Figure 5.7 (in both left and
right subfigures) correspond to the (1)(2)(3) data access arrow locations in
Figure 5.6.
8.) Required memory bandwidth is the minimum bandwidth that ensures
computation goes fluently without stalling. It depends on both mapping and
memory settings (single-port/dual-port, with/without double buffering, etc.).
Without double-buffering, writing only happens after a specific data item is
fully used, resulting in a small time window. With double buffering, writing
can happen all the time (in parallel with data loading), leaving a large writing
Overall loop dimensions: K = 256; C = 48; OX = 26; OY = 26; FX = 5; FY = 5 Mapping
Total MAC Op K * C * OX * OY * FX * FY = 207667200
AlexNet Layer 2
Ideal total cycles 207667200 / 130 (# of active MAC unit) = 1597440
Information
Output size K * OX * OY = 173056
Output reuse C * FX * FY = 1200
Active MAC units FYu * OYu * OYu = 5 * 13* 2 = 130

Level-0 MAC
Hardware
Level-1 Inner-PE Register File
Information Architecture level
HARDWARE COST ESTIMATOR
Level-2 Global Buffer

Level-3 DRAM
A demonstration:
Mapping analysis and key information extraction from the Output loop set in Figure 5.3(d):
Loop index 0 1 2 3 4 5 6.1 6.2 6.3 7 8 9
Architectural level L-0: MAC Level-1: Inner-PE Register File Level-2: Global Buffer Level-3: DRAM
Loop MAC K8 C2 FX 5 OX 2 C2 FYu 5 OYu 13 OYu 2 OX 13 C 12 K 32
Loop relevance r1 r8 ir 2 ir 5 r2 ir 2 iru 5 ru 13 ru 2 r 13 ir 12 r 32
Data size (elem) 1 16 (inside of each PE), 416 (inside of total PEs) 5408 173056
MAC operation 1 1600 (inside of each PE), 41600 (inside of total PEs) 6489600 207667200
Data reuse factor 1 Total: 100 (Spatial: 5, Temporal: 20) Temporal: 12 1
207494144 (Total MAC Op - Output Size) 1903616 1903616 0 0
Data access count 207667200 (Total MAC Op) Divide by current-level total 2076672 2076672 173056 173056 (Output Size)
(÷12)
(1) data reuse factor (÷100) (2) (3)
Turnaround cycle 1 320 49920 1597440
Required average 1/1 Partial output flow bidirectionally; 16 / 320 416 / 320 0 0
mem BW (elem/cc) 1/1 Final output flow unidirectionally. 16 / 320 416 / 320 5408 / 49920 5408 / 49920
Figure 5.6: A demonstration: extract loop information from Output loop set of Figure 5.3(d) based on the Loop
Relevance Principle.
145
(1) W (1) W
I I
Total MAC Op Partial O Total MAC Op O
108 Final O 108
Data access count
Data access count

107 107
(2) (2)
106 106
W Size (3) (3)
105 O Size 105
I Size
MAC K8 C2 FX5 [I] OX2 C2 [O] OX13[W] C12 K32 0 20 40 60 80 100 120 140
FYu5
OYu13
FYu5 FYu5
OYu13 OYu13
Per data access energy (normalized to 1 MAC Op)
OYu2 OYu2 OYu2
Bottom-up spatial/temporal loops
Figure 5.7: (Left) Visualization of the impact of individual loops of Figure 5.3(d) on the data access count, and (right)
on the energy consumed by different memory levels. Each colored block (right) represents a memory level for a certain
operand; the area of the block indicates the total energy consumption at that memory level (data access count× per data
access energy).
146
time window, and thus lowering the required instantaneous memory bandwidth.
The bandwidth difference between these two cases is the product of all the top
ir loop values at each memory level.
Note that due to the pr loops, some changes are needed for handling Input
correctly. The most important modification is the following two substitutions.
One is to correctly handle data size (assuming stride is 1):
Li Li Li Li Li Li
Y Y Y Y Y Y
r→ r·( pr1 + pr1′ − 1) · ( pr2 + pr2′ − 1)
Lmin Lmin Lmin Lmin Lmin Lmin
in which pr1 (pr2 ) and pr1′ (pr2′ ) are a pr loop pair, like OX and FX. Another
substitution is to correctly handle special Input data reuse cases like the "diagonal
multi-cast" and "FIFO Effect":
Total MAC Op @ Li(+pr)
Total data reuse factor @ Li →
Total Data Size @ Li(+pr)
FX 3
For example, in the "FIFO Effect" setting: OXu 4 (the horizontal line is the
architecture level boundary as in Figure 5.5), the lower-level Input data reuse
factor should equal to (4×3) MAC Op (4×1) MAC Op
(4+3−1) data = 2 instead of (4+1−1) data = 1 by taking
the "FIFO effect"-triggering pr loop (FX 3) into account. It means that every
input data being fetched from the above level theoretically can be used to
support two MAC operations in the below level (on average).
5.5.3 Hardware cost integrator
The hardware cost integrator aims to integrate the previously extracted

technology-independent loop information with the technology-dependent
characteristics to estimate the final hardware cost and performance. It is worth
pointing out the big advantage of decoupling these technology-independent
and technology-dependent information extraction steps in the cost model: it
allows ZigZag to be easily extended to support different technologies and
design paradigms, such as analog and digital in-memory computing, with only
updating the technology-dependent part and without modifying the underlying
loop analysis logic, as will be shown in Section 5.10.2.
1.) Area: Area estimation is straightforward by summing up all the on-chip
memory and datapath area. ZigZag also provides an active/dark silicon area
report based on spatial mapping and memory unrolling.
2.) Energy: MAC computation energy and memory access energy are taken
into account in ZigZag. MAC computation energy is divided into two parts:
the active MAC energy and the idle MAC energy. Both are estimated by
first calculating the total active MAC count (determined by the workload)
and idle MAC count (determined by the PE array’s under-utilization) and
subsequently multiplying the corresponding number of MAC operations with
the corresponding averaged single-MAC-operation energy (active and idle).
Memory access energy is calculated by multiplying the memory access count
(computed previously) with the corresponding memory per-data-access energy,
taking into account the memory size, the potential memory bitwidth mismatch
overhead, operand precision, and data stationarity.
3.) Latency/Throughput: PE array utilization, throughput, and latency are
tightly related and can be deduced from each other. A PE array’s under-
utilization comes from two sources: spatial under-utilization and temporal
under-utilization. Spatial under-utilization results from the mismatch between
the spatial mapping and MAC array size and interconnection. Temporal
under-utilization mainly comes from memory port contention and bandwidth
bottlenecks during computation.
As the latency model involves sophisticated calculations and requires long pages
to explain, we separate it from this chapter to the coming chapter. Chapter 6
will thoroughly explain the latency modeling approach of ZigZag.
Besides estimating latency, another feature of ZigZag is its ability to detect
the minimal required memory size, named "effective memory size", for each
memory level when executing a network layer. The memory part that exceeds
the "effective memory size" can theoretically be gated to save power.
5.6 Mapping Search Engines
Mapping search engines generate valid spatial and temporal mappings to be

fed into the cost estimator. Given that the mapping space is extremely large
with even/uneven mapping possibilities (million-level), exhaustively evaluating
every single mapping point in order to locate the optimal one would be very
time-consuming. Note that it is just to map one neural network layer on a fixed
architecture, not to mention that we want to explore hundreds and thousands of
different neural network and hardware architecture combinations. So to tackle
this, ZigZag enables, besides an exhaustive search, novel searching methods to
efficiently prune away sub-optimal mappings, reducing the number of mappings
to be evaluated down by orders of magnitude, with no or little optimality loss
across different searching methods.
MAPPING SEARCH ENGINES 149
5.6.1 Spatial mapping search engine
Spatial mapping defines the parallel operations and operands’ data flow in
the spatial dimensions across the PE array. Depending on the mapping, a
considerable amount of data reuse of each operand can be obtained spatially,
which results in a reduced number of accesses to the memory level outside the
array. In order to efficiently explore the spatial mapping space, three search
methods have been developed: exhaustive search, heuristic search v1 based on
data reuse symmetry pruning, and heuristic search v2 based on the maximum
data reuse Pareto surface.
The exhaustive search generates all valid spatial mappings above a user-defined
spatial utilization threshold, based on the layer dimensions and PE array size.
It is capable of generating multiple dimensions of loop unrolling across each
PE array dimension, e.g., unrolling OY|(FY|K) 16|(3|4) indicate unrolling OY
loop 16 times on one PE array dimension and unrolling FY & K 3 and 4 times
(together activate 12 PE units) on another PE array dimension.
While the exhaustive search can generate ∼10s of thousands of valid mappings,
by adopting the proposed heuristic v1/v2 methods, the spatial mappings to be
evaluated can be reduced by ×3-69 times depending on the layer characteristics
according to our experiments, as depicted in Figure 5.8. Both heuristic searches
v1 and v2 do not introduce any optimality loss, regarding energy or latency.
exhaust. 19768
heuri. v1 11762 13008
×7
Valid Spatial Mappings
104 heuri. v2 ×69 ×5 ×3

3621 ×37 4453 ×21
2898 2311
×4
×29
103 907
633
288 317
124
102 AlexNet VGG16 ResNet34 MobileNetV2
Figure 5.8: Comparison between different spatial mapping search methods with
4 different neural networks. Both heuristic v1 and v2 can find the global optimal
spatial mapping points as an exhaustive search does.
Heuristic search v1 based on data reuse symmetry pruning
Before any mapping selection, a neural network layer inherently has a specific
maximum data reuse factor for each of its 3 operands, depending on the layer
shape and size. A specific spatial mapping exploits, for each operand, part of
these reuse opportunities in the spatial dimension. The residual data reuse
opportunities remain available for temporal mapping. Yet, multiple spatial
mappings can result in an equal data reuse distribution between the spatial
and temporal domains for each operand. As a result, their temporal mapping
opportunities are identical. Such a group with equivalent or symmetrical spatial-
temporal data reuse distribution should only be considered once for further
temporal mapping search.
An example is that for a neural network layer with OX=OY and FX=FY, unrolling
OX|FX 7|3 and unrolling OY|FY 7|3 are equivalent (or symmetrical) and thus
one of them can be skipped for the next step temporal mapping search, while
for a neural network layer with OX̸=OY or FX̸=FY, unrolling OX|FX 7|3 and
unrolling OY|FY 7|3 are no longer equivalent since they now play different roles
in overall data reuse distribution, and thus should both be considered.
Heuristic search v2 based on the data reuse Pareto surface
In light of the previous heuristic v1 search, in the next step, the temporal
mapping search engine still has to generate valid temporal mappings for each
non-symmetrical spatial mapping found.
In order to further prune away sub-optimal spatial mappings, a Pareto surface
of the spatial data reuse for each spatial mapping is identified in the operand
space (W/I/O): only those spatial mappings which demonstrate Pareto-optimal
data reuse along weight/input/output are processed for further evaluation.
Those mappings that are not on this Pareto surface correspond to dataflows that
would require a large amount of data reuse along at least one of the operands to
be handled temporally, which in turn would correspond to a larger amount of
memory accesses to the upper levels in the memory hierarchy. As a consequence,
they can be safely ignored without losing optimality.
To illustrate it better, let’s create an example. Assume for a pointwise Conv
layer (i.e., FX=FY=1, no need to consider pr loop effect), there are four spatial
mappings left after the heuristic v1 search method: (1) C|K 16|8, (2) K|C 12|8,
(3) (B|C)|K (8|2)|8, and (4) OX|K 8|8. With Figure 5.4, we can easily get
these mappings’ data reuse factors (for each operand): (1) {W: 1, I: 8, O: 16},
(2) {W: 1, I: 12, O: 8}, (3) {W: 8, I: 8, O: 2}, and (4) {W: 8, I: 8, O: 1}. Based
on the data reuse list, heuristic v2 will prune away spatial mapping (4) as it is
surpassed on all fronts by (3). The rest three (1)(2)(3) will be kept as they are
all on the data reuse Pareto surface.
Given that each spatial mapping is then individually sent to the temporal
mapping search engine, the speed-up in the overall search procedure is roughly
proportional to the number of spatial mappings pruned away.
5.6.2 Temporal mapping search engine
In the temporal domain, each neural network layer is expressed as three sets
of nested for-loops (for W/I/O resp.) that determine the order of the overall
MAC execution. These three sets of for-loops must follow the same loop order
but can be differently distributed across the memory hierarchy (each loop is
mapped to a memory level), as shown in the previous Figure 5.3(d).
By adopting the enhanced Memory-Centric Design Space Representation and
using the concept of virtual memory level, the proposed temporal mapping
search engine efficiently supports producing even and uneven mapping schemes
on balanced and unbalanced memory hierarchies that present shared and/or
non-shared memory levels between different operands. Note that ZigZag has
significantly enlarged the mapping design space compared to all the previous
works, which enables users to obtain better design points, as will be shown in
Section 5.8. The smart search methods reduce the required mapping evaluation
by several orders of magnitude with respect to exhaustive search, as illustrated
in Figure 5.9, with a maximum 5% loss in optimality.
Exhaustive search
The exhaustive search consists of two steps, shown in Figure 5.10(left): 1) the
loop blocking step, in which all valid loop combinations are assigned to the
memory levels, and 2) the loop ordering step, in which all valid permutations
of the assigned loops within each level are generated. In order to explore all
possible schedules, we treat loop prime factors (LPFs) as the smallest size
in which a loop can be split. These LPFs correspond to the result of the
factorization of each layer dimension (i.e., B, C, K, ...) and are the basic blocks
for the search algorithm.
In the procedure, an important concept, virtual memory level, is used to help
the mapping algorithm find valid loop-blocking schemes and guarantee their
validness in the following loop-ordering step. The virtual memory levels provide
fake boundaries that allow gradually and uniformly fitting data of different
operands into actual (physical) memory levels in an uneven manner, as indicated
in Figure 5.10(left) gray wavy lines. Then, after the loop blocking, the loop
AlexNet VGG16 ResNet34 MobileNetV2

107 108
107
Temporal Mapping 106 107 106
106 106
105 105 105
105
104 104 104
104
10 9.9 68.9 30 30.8 3 3.2
8 60
CPU Hour
6 20 2
40
4 1.3 1.3
2.5 1.8 20 10 1
2 0.6
0.3 3.7 3.4 4.6 1.6 1.4 0.6
0 1005 0 0
104 11413 77550 32586 2627
Peak Memory [MB]
CPU HourTemporal Mapping
AlexNet VGG16 104 ResNet34 103

MobileNetV2
1762 1536 104 735 696
103
106 106 2144 1764 106 1311 1171
103 103 105
102 10 9.9 112 102 68.9 83 102 30.8 85 102 3.2 89

80 50 280 20 26 2
Minimal Energy [mJ]
.0 1.3 1.3
.8
58
10 4
10 4
10 3
.
81
74.8 1100
.
.
74
75 73.5 73.5 2.5 8.8 8.8 8.8 2 24.4

74
74
260
73.51.8 24 023.8 23.8 23.8 0.6
10
0 0.3 0 3.7 3.4 4.6 024 24 1.6241.4 0.6

70 11413 1000 77550 240 32586 2627
22103
65 17621536 104 735696
103 900 21441764 220 103 13111171
60 20
112 83 85 89
exhau. heuri. v1 heuri. v2 itera. partial itera. final
Figure 5.9: Comparison between different temporal mapping search methods

with 4 different neural networks on (row 1) number of temporal mappings
evaluated, (row 2) elapsed time in CPU hours, (row 3) peak CPU memory usage,
and (row 4) minimal mapping energy found. This experiment is carried out on
an Intel Xeon Gold 6126 CPU.
ordering is performed within each virtual memory level. These two steps are
explained in detail below.
The loop-blocking step consists of exhaustively exploring how temporal loops
can be assigned in the memory hierarchy. For each virtual memory level, the
core procedure of loop blocking includes four recursive steps: generate, calculate,
assign, and update.
• First, generate all possible LPF combinations from the unassigned LPFs;
• Second, calculate for each operand which of these combinations fit in the
virtual memory level;
• Third, assign the fitting LPF combinations to the virtual memory level
with each assignment case as a new loop blocking basis;
• Forth, update the virtual memory level based on the rule described below.
The virtual memory level at which the LPF combinations are allocated is
initially set as the lowest level in the hierarchy (i.e., the one close to MAC).
After each assignment, the virtual memory level to which the combinations will
be assigned in the next iteration is updated. The update procedure checks for
each operand if, considering the remaining LPFs to be assigned, the current
physical memory level can still fit any additional loop assignment; if not, then
the virtual memory level for the next assignment iteration (for that operand)
will become the following upper physical memory level in the hierarchy, as
visualized in Figure 5.10(left). The algorithm will continue these actions until
all LPFs are assigned to virtual memory levels.
After the loop assignment, the set of valid mapping schemes is fed into the loop
ordering step. In this step, within each virtual memory level, all permutations of
the loops are considered, and each becomes a different valid temporal mapping.
The number of mapping schemes generated in this way grows exponentially
with the number of LPFs and the number of virtual memory levels. To reduce
this mapping explosion, smarter search methods are provided.
Heuristic search v1 and v2
In order to prune away sub-optimal mappings before the cost evaluation stage,
two heuristic principles are applied: data stationarity maximization (heuristic
search v1) and data reuse pruning (heuristic search v2).
Heuristic search v1 is applied at the loop ordering stage, as depicted in
Figure 5.10 (right) Loop Ordering and Figure 5.11 column 2: for each
virtual memory level, instead of doing complete loop permutations, only those
permutations that maximize data stationarity for each operand at the below
memory level are generated, which, in the loop format, is equivalent to putting
all the ir loops of each operand close to the lower level boundary. By doing so,
the number of valid temporal mappings to be evaluated drops by an order of
magnitude, as shown in Figure 5.9, without losing optimality.
On top of heuristic search v1, heuristic search v2 is applied after the LPF
assignment and before the loop ordering optimization, as depicted in Figure 5.11
column 3: the data reuse factor for each operand at each level in the memory
hierarchy can be extracted at this point since the loop types and their size
are known (even though their loop ordering at each level is unknown). If a
Loop Blocking Loop Ordering

(a) Weight Input Output (e)
Number of loop order
upper level upper level upper level 1E+16

B B B 1E+12
K K K
1E+08
C OY C OY C OY
1E+04
OX OX OX
FY FY FY 1E+00
FX FX FX
1 2 3 4 5
lower level lower level lower level Number of virtual memory level
Maximize below-level ‘W’ stationarity Maximize below-level ‘I’ stationarity Maximize below-level ‘O’ stationarity
(b1) Weight Input Output
(c1) Weight Input Output
(d1) Weight Input Output
K K K B B B B B B
C C C C C C K K K
FY FY FY OY OY OY OY OY OY
FX FX FX OX OX OX OX OX OX
B B B FY FY FY C C C
OY OY OY FX FX FX FY FY FY
OX OX OX K K K FX FX FX
OY C OY
One potential-optimal One equivalent order One suboptimal order
order (generate) OX (not generate) OY (not generate) FY
(b2) Weight Input Output
(c2) Weight Input Output
(d2) Weight Input Output
K K K B B B B B B
C C C OY OY OY K K K
FY FY FY C C C FY FY FY
FX FX FX OX OX OX OX OX OX
B B B FY FY FY C C C
OX OX OX FX FX FX OY OY OY
OY OY OY K K K FX FX FX
Figure 5.10: Loop blocking assignment (left); loop ordering based on data stationarity maximization (right). With
the increasing of virtual memory levels and the number of nested loops at each virtual memory level, loop ordering
possibility increases exponentially. By applying data stationarity optimization, the reduction in loop orders is significant.
(c2) and (c1) orderings are equivalent for Input since C and OY both are relevant loops that do not change the Input’s
stationarity at the below memory level that comes from K. (d2) is sub-optimal compared to (d1) since for Output, OY
is a relevant loop and FX is an irrelevant loop, swapping them breaks the Output’s stationarity loop chain formed by
FX-FY-C.
154
VL0: virtual VL1: virtual
S Starting point Loop blocking Loop ordering Cost evaluation Best mapping
memory level 0 memory level 1
MAPPING SEARCH ENGINES
Exhaustive Search Heuristic Search V1 Heuristic Search V2 Iterative Search

Exhaustive loop blocking and loop ordering Loop-ordering-stage data stationarity maximization + Loop-blocking-stage data reuse checking and pruning Per-virtual-memory-level cost evaluation
VL0 loop VL1 loop VL0 loop VL1 loop Cost VL0 loop VL1 loop VL0 loop VL1 loop Cost VL0 loop VL1 loop VL0 loop VL1 loop Cost VL0 loop VL0 loop VL1 loop VL1 loop
blocking ordering blocking ordering ordering + ordering +
blocking ordering estimation blocking ordering estimation blocking blocking ordering ordering estimation blocking blocking
cost estimation cost estimation
S S S S
Figure 5.11: Temporal mapping search engine algorithms comparison. While exhaustive search generates all valid
mapping schemes (∼1s-100s million), heuristics are required to prune away sub-optimal mappings. Heuristic search v1
prunes mappings at the loop-ordering stage, and heuristic search v2 prunes at both the loop-blocking and loop-ordering
stages. Iterative search prunes at each loop assignment iteration besides applying the previous heuristics.
155
particular combination of LPFs causes the data-reuse factor to be 1 at a specific

level for one/more operand(s) (excluding the top memory level, e.g., DRAM),
it indicates at that memory level every data element of that operand stored is
accessed only once (i.e., no data reuse), and therefore having or not having that
memory level does not reduce the number of accesses to higher memory levels
in the hierarchy, which makes it a suboptimal scenario.
It is important to note that this rule does not hold for Input data, as even for
certain mapping schemes with data reuse equal to 1 at a memory level, this
level may exhibit the FIFO effect opportunity (cross-level data reuse), explained
in Section 5.5.1, and this mapping may still be an optimal mapping.
This second pruning step further reduces the number of temporal mappings to
be evaluated by up to 40% as seen in the results reported in Figure 5.9, without
introducing optimality loss. However, sometimes we do observe that the data
reuse pruning method introduces sub-optimality when the memory hierarchy
is over-dimensioned for mapping a tiny neural network layer or a layer with
very few LPFs, such that there are always memory levels without data reuse
opportunities, no matter how the loop assignment is done.
Iterative search based on early-stage cost evaluation
Instead of generating an exhaustive list of loop-blocking schemes first and

possibly pruning away the sub-optimal ones, the last search strategy proposed
explores the mapping space in an iterative way. Starting from the lowest level
in the hierarchy, an iteration step consists of finding the set of LPFs (that
constitute a virtual level) that causes the largest amount of energy savings. The
saving value is derived by estimating the energy of the partial mapping on the
whole memory hierarchy. After each iteration, the best loop assignment and
loop order found for the current virtual memory level is stacked upon those
previously found, as portrayed in Figure 5.11, last column.
Iterative search requires much less CPU time/memory compared to the other
three search methods, as present in Figure 5.9. But since this search method
analyzes partial schemes and ignores the influence of the upper levels in the
hierarchy when making decisions at a lower level, it might reach a sub-optimal
point (within 5% more energy). Given the pros and cons, iterative search is
still an attractive trade-off option for many DSEs.
Besides the previously discussed four temporal mapping search engines
(exhaustive, heuristic v1/v2, iterative), which were equipped in the initial
ZigZag version, more advanced search approaches have been developed over the
years, and will be introduced in Section 5.10.1.
ARCHITECTURE GENERATOR 157
5.7 Architecture Generator
While the two mapping search engines are able to efficiently scan the mapping
space for optimal points, the architecture on which the mapping is to be applied
may not be the optimal one. For example, having bigger register files within
each PE might have positive consequences due to higher data reuse closer to
the MAC level, but it might limit the size of the buffer levels out of the array
because of the limited area available on-chip and therefore limit the reduction of
off-chip accesses. As a consequence, depending on the neural network workload,
the memory size, the cost of data access, the word length per access, etc., an
optimal point exists in the architecture space as well.
In order to also explore different hardware architectures, Architecture Generator
is able to exhaustively generate all architectures that fit within a user-defined
area constraint, as shown in Figure 5.12, as well as other user-defined design
guidelines, such as having at least a size ratio of 8 between consecutive memory
levels. To carry out this task, it first draws from a memory pool different
combinations of memories (considering memory unrolling: small memory being
equipped to each or a group of MAC unit(s)), and if the combination fits
within the area constraints and meets the design guidelines, it then proceeds
in assigning operand(s) to each memory instance, with considering operands’
memory sharing.
The output of this stage is a list of valid hardware architectures, each with a
different memory hierarchy, which is sequentially fed to the mapping search
engines to find the optimal spatial and temporal mappings. When all hierarchies
are analyzed, the optimal memory hierarchy and its optimal mapping for energy
and latency will be identified.
ZigZag allows running the architecture exploration for a single neural network
layer, a complete neural network, as well as multiple complete neural networks,
as will be demonstrated in Section 5.9 case studies.
MAC Array + Memory Hierarchy:

MAC array scheme: Memory pool Auto-Memory- I
Hierarchy
Generation W x+O x+O x+O
Different size
+ + Area
constraint = W W x+O x+O x+O I&O
Same size with
different bitwidth W x+O x+O x+O
(An example architecture out of thousands of valid ones.)
Figure 5.12: Memory hierarchy generator overview.

5.8 Validation
The hardware cost model and the mapping search engines are validated with
three methodologies: 1) against measured results of published chips; 2) against
in-house post-synthesis extracted energy and performance data; 3) against other
DNN accelerator DSE frameworks.
Firstly, we model the mappings and hardware architectures of both Eyeriss [30]
and ENVISION [116] and compare the ZigZag cost model estimated energy
with their reported values, as depicted in Figure 5.13. The resulting energy
values, normalized with respect to the cost of a single MAC, are shown for full
precision operation without voltage scaling or sparsity reduction. The estimated
values are within an acceptable 5%, resp. 7.5% error margin. Secondly, to
also validate the throughput model, validation is performed against a complete
in-house accelerator post-synthesis simulation. The results in Figure 5.14 show
a maximum error of 6% of energy and 4% of PE array utilization. Finally, the
validation of the cost model as well as the mapping search engines is carried
out against a SotA DSE framework, Timeloop [126] + Accelergy [177].
ZigZag and Timeloop are compared on Eyeriss architecture with AlexNet [90]
and ResNet-34 [57], and the validation results are shown in Figure 5.15. The
experiment is carried out in three steps. Step 1, we let Timeloop do a free search
to find its optimal spatial and temporal mappings for each layer under test
(Figure 5.15 all the left bars in each group of three, labeled "TL"). Step 2, all the
optimal mappings found by Timeloop are validated in ZigZag, demonstrating
that the two cost models match well with an average error <5% (the middle
bars, labeled "TL/ZZ"). In step 3, we let ZigZag do a free search to find the
1e10 1e9
2.00 MAC MAC
DRAM 1.4 Buffer
1.75 Buffer RF
RF 1.2
1.50
1.0
Normalized Energy
Normalized Energy
1.25
0.8
1.00
0.75 0.6
0.50 0.4
0.25 0.2
0.00 0.0
ION
ION
ss
ION
ION
ss
ION
s
s
ss
l
l
l
l
el
Ey l
Ey l
Ey l
de
de
ris
ris
de
de
de
de
e
de
de
eri
eri
eri
d
VIS
VIS
VIS
VIS
VIS
e
Mo
Mo
Mo
Mo
Mo
Mo
Mo
Mo
Mo
Mo
Ey
Ey
CONV1 CONV2 CONV3 CONV4 CONV5 CONV1 CONV2 CONV3 CONV4 CONV5
EN
EN
EN
EN
EN
Figure 5.13: Cost model validation of AlexNet [90] Conv layers on Eyeriss [30]
(left) and ENVISION [116] (right).
VALIDATION 159
MAC+Oreg Model
2.5 Activ Mem Post-syn
Weig Mem 0.8
Input Buf
2.0
PE array utilization
0.6
Energy (uJ)
1.5
0.4
1.0
0.5 0.2
0.0 0.0 CONV1 CONV2 CONV3 FC4 FC5

syn
syn
syn
syn
syn
Po el
Po el
el
Po el
Po l
e
d
d
d
d
d
st-
st-
st-
st-
st-
Mo
Mo
Mo
Mo
Mo
Po
CONV1 CONV2 CONV3 FC4 FC5
Figure 5.14: Cost model validation against an in-house accelerator’s post-

synthesis results with a voice recognition workload on energy (left) and PE array
utilization (right).
AlexNet
Memory Energy Saving (%)

64% (Excluding MAC)
18%
ResNet-34 (All the unique layers)
35%
55%
62% 60%
45%
54%
37%
Figure 5.15: Cost model and mapping search engines validation against
Timeloop [126]+Accelergy [177] on AlexNet [90] (left) and ResNet34 [57] (right).
optimal mappings based on the spatial/temporal heuristic search v2 methods

(the right bars, labeled "ZZ"). Theoretically, there should be one last step to
verify the optimal mappings found by ZigZag within the Timeloop framework,
but since the mapping representation space of Timeloop does not support
uneven mapping, it is not possible to carry out this reverse verification.
From the results of Figure 5.15 we can draw two major conclusions: 1) ZigZag’s
cost model and Timeloop’s cost model (Accelergy) match well (comparing
the left and middle bars); 2) ZigZag’s mapping search engine outperforms
Timeloop’s, because of the uneven mapping support (comparing the middle and
right bars). Up to 64% of memory energy savings (DRAM + Buffer + RF) can
be achieved by these improved mapping schemes.
5.9 Case Studies
To show the strength of ZigZag and use it to extract insights from the vast design
space, three different case studies are conducted. In these studies, all memory
access energies are extracted from CACTI7 [12] in 65nm CMOS technology,
assuming 8-bit fixed point precision for Weight/Input/final-Output, 16-bit fixed
point precision for partial Output.
5.9.1 Case study 1: Impact of even / uneven mapping
Case study 1 focuses on answering two questions: 1) How big is the impact
of temporal and spatial mapping on energy and throughput when fixing both
the neural network workload and the hardware architecture (memory hierarchy
and PE array size)?; 2) How large is the benefit of uneven mapping over even
mapping in terms of energy and performance?
In the experiment, the neural network workload and hardware architecture
are fixed to AlexNet Conv layer 2, and an Eyeriss-like architecture with a
memory bandwidth assumption of 16-bit/cycle for the inner-PE register file
and 64-bit/cycle for the global buffer.
This experiment consists of two steps. Firstly, the Spatial Mapping Search
Engine finds 16 spatial mappings by means of the heuristic v2 method explained
in Section 5.6.1. Secondly, for each spatial mapping found, the Temporal
Mapping Search Engine searches for all Pareto-optimal schedules considering
energy and PE array utilization (throughput). The results are shown in
Figure 5.16 and 5.17.
CASE STUDIES
Figure 5.16: Even/uneven temporal mapping’s impact on energy and PE array utilization (throughput) for difference
spatial mappings: OYu|FYu|Ku 13|5|2 (left) and OYu|OYu|Cu 13|2|6 (right).
161
PE array utilization
0.8
0.6
0.4
Uneven mapping Pareto
0.2 Even mapping Pareto
0.6 0.8 1.0 1.2 1.4 1.6 1.8
Energy [mJ]
Figure 5.17: The collection of Pareto-optimum temporal mappings of all 16

different spatial mappings.
In Figure 5.16, two temporal mapping spaces, each with different spatial
mappings, are shown. Let’s first analyze the temporal mapping’s impact of
even and uneven schemes separately. In the left/right figure, for even mappings,
in total 73184/26944 points are found, within which up to 4.8×/3.3× energy
variance and 3.2×/2.1× utilization variance are observed; for uneven mappings,
in total 1296364/146940 points are found, up to 10.3×/9.7× energy variance and
7×/42× utilization variance are observed. It is clear that the temporal mapping
space for an uneven scheme is much larger than for an even scheme. This results
in significantly lower energy solutions found for uneven mappings, with up to
30%/28.6% lower total energy consumption compared to even mappings. In
terms of PE array utilization, both the even and uneven mapping find the same
optimal point in the left figure, while in the right figure, the uneven mapping
achieves 27% higher PE array utilization, and hence 27% higher throughput.
Figure 5.17 is the collection of the Pareto-optimal temporal mapping points
(in terms of energy and utilization) for all 16 spatial mappings. These results
further confirm the ability of uneven mappings to locate better design points
compared to even mappings. Here, an overall gain of up to 32.6% in terms of
energy consumption or 12% in terms of throughput is achieved.
5.9.2 Case study 2: Memory hierarchy search
Knowing that ZigZag can help designers to find the optimal spatial and temporal
mapping points for a user-defined hardware architecture and neural network
workload, the next challenge is determining the optimal hardware architecture
given a neural network workload. This should not just be done for a single
neural network layer, but for a network consisting of multiple layers. Case
study 2 demonstrates this ability within ZigZag.
CASE STUDIES 163
In this case study, all architectures are equipped with an off-chip DRAM and a
MAC array of size 12 × 14 (same size as Eyeriss [30]). ZigZag will search for
the best on-chip memory hierarchy to process multiple neural network layers
of DarkNet19 [132]. The available memory modules in the memory pool are 8
Byte, 64 Byte, 256 Byte, 8KB, 64KB, and 512KB, in which the first 3 are a
group that unrolled together with each MAC unit and the last 3 are a group
without memory unrolling. ZigZag picks for each operand (W/I/O) one memory
from the first memory group and picks one or zero from the second memory
group (i.e., we simulate single-level and dual-level on-chip memory hierarchy).
Figure 5.18 summarizes the complete results. In total, 240 different memory
hierarchies are found and evaluated. All four figures share the same X-axis,
which is the memory hierarchy index, from 0 to 239. The top three figures are
the visualization of all memory hierarchies from each operand’s perspective (e.g.,
single-level/dual-level on-chip memory hierarchy corresponds to resp. one/two
dot(s) along the vertical direction, for each opearnd). The bottom figure
shows the resulting total energy consumption for executing all 8 non-repetitive
DarkNet19 layers on each architecture, as well as its area.
512K
W Mem [B]
64K
8K
256
64
8
512K
64K
I Mem [B]
8K
256
64
8
512K
O Mem [B]
64K
8K
256
64
8
102 10
Area [mm2]
L1 L2 L3 L4 L6 L7 L9 L10
Energy [uJ]
5
101
0
0 50 100 150 200 250
Memory hierarchy index
Figure 5.18: Memory hierarchy search for multiple layers of DarkNet19[132].
Three main observations can be summarized: 1) generally speaking, the low-

energy architectures occupy a large area, holding the largest on-chip memory
(512KB), which largely facilitates on-chip data reuse and reduces the external
data transfer from DRAM; 2) when the size of the largest on-chip memory
is fixed, the memory size of the inner-PE register file doesn’t influence the
overall energy that much, as there are clearly three energy regions one can
distinguish, corresponding to an 8KB, 64KB and 512KB upper memory; 3) the
trade-off between energy and area is quantified such that designer can make
clear decisions like giving up 5% of energy to gain 40% of area saving (comparing
64KB and 512KB designs).
5.9.3 Case study 3: DNN workload comparison
One step beyond the single DNN study of case study 2, is the application of
ZigZag to explore hardware implications across a wide variety of neural networks.
Case study 3 aims to extract the optimal mappings’ energy, resp. performance
for 12 different DNNs executing on 720 different accelerator architectures. For
this, 12 popular DNNs targeting the ImageNet dataset [42] have been selected,
with their characteristics summarized in Table 5.4. The 720 selected hardware
architectures of this study all have the same PE array size (14×16) and the
same spatial unrolling (OXu|Ku 14|16), but they are different in the memory
hierarchy. As listed in Table 5.3, these 720 memory hierarchies vary in memory
size (Mem. Size Option), number of memory levels (Mem. Bypass Option), and
memory sharing among W/I/O operands (Mem. Share Option).
For each DNN workload-accelerator pair, ZigZag searches the lowest energy,
resp. lowest-latency mapping. This is performed by running ZigZag’s temporal
search engine for each layer in a particular DNN and accumulating energy,
resp. latency values. Thus, this results in 2 dots for each workload-accelerator
combination: one for the minimal latency and one for the minimal energy,
plotted in Figure 5.19. Throughout the study, greedy spatial mapping1 is
applied to all layers in order to maximize spatial PE array utilization. Heuristic
search v1 is adopted for the ZigZag temporal mapping search engine.
The resulting Figure 5.19 reveals the global trade-off between workload accuracy,
energy efficiency, latency, and hardware area across these 12 popular DNNs,
with their Pareto optimal energy and latency achievements summarized in
Table 5.4. An analysis of Figure 5.19 and Table 5.4 allows researchers to quickly
1 Greedy spatial mapping is applied to handle the case that the layer size is not a perfect
fit to the array dimension. It maximally uses the PE array’s spatial dimension for all loop
iterations except for the last one. For example, if we unroll a loop of dimension 20 onto a 1D
array with a size of 8, the mapper based on prime factorization would run for 4 iterations of a
5-way spatial mapping, resulting in a spatial utilization of (5+5+5+5)/(8+8+8+8) = 62.5%.
The greedy mapper, on the other hand, would propose a solution that runs for 2 iterations
with 8-way spatial parallelism, and the last iteration exploits the array 4-way, whose spatial
utilization is (8 + 8 + 4)/(8 + 8 + 8) = 83.3%.
CASE STUDIES 165
Table 5.3: Memory Hierarchy Options for Case Study 3
Arch. Level Inner-PE Reg On-chip L1 On-chip L2 Off-chip

2 B;
8 KB; 0.5 MB;
Mem. Size Option 32 B; DRAM
32 KB 2 MB
128 B
Mem. Bandwidth 16 bit/cycle (r/w) 128 bit/cycle (read/write)
Mem. Share Option All separate;
(i.e., 1/2/3 operand(s) All separate Two shared; All shared All shared
share same memory) All shared
Mem. Bypass Option No bypass Can bypass Can bypass No bypass
select specific DNNs based on accuracy, energy, latency or area constraints, or

a combination there-off.
With DNNs’ accuracy, obtained Pareto-optimal solutions, and order ‘(#)’ in
Table 5.4, we can divide these 12 DNNs into three groups: 1) DNNs whose order
of accuracy is smaller than the order of best energy and latency (i.e., achieve
relatively high accuracy with relatively low hardware cost) are promising for
embedded systems (colored green); 2) DNNs whose order of accuracy is equal to
the order of best energy and latency (i.e., the accuracy achieved lives up to the
hardware cost) are good for some applications (colored yellow), and 3) DNNs
whose order of accuracy is larger than the order of best energy and latency
(i.e., achieve relatively low accuracy with relatively high hardware cost) are
sub-optimal, and hence should be avoided (colored red).
From a hardware perspective, we observe that the optimal memory hierarchy
found for each DNN varies depending on the optimization goal (energy or
latency), the DNN’s total data size, and layer/data distribution. There are
some general rules that appear.
First, optimizing for latency always leads to a shallower memory hierarchy (2-3
levels) than optimizing for energy (3-4 levels).
Second, the dominant operand usually prefers a shallower memory hierarchy
than the non-dominant ones2 . E.g., for AlexNet, Weight is the dominant
operand, and the two optimal memory hierarchies found (within 5% energy
loss) are a.) 2 levels for W (128Byte, DRAM) + 3 levels for I (128Byte, 32KB,
DRAM) + 3 levels for O (2Byte, 32KB, DRAM), in which the 32KB is shared
by I and O; and b.) 3 levels for W (128Byte, 0.5MB, DRAM) + 4 levels for
2 This may sound counter-intuitive, but is explainable. The hindsight is when one operand
is dominant in a DNN layer, it usually indicates this operand has less data reuse opportunities,
and thus benefits less from the multi-level memory architecture.
AlexNet (2012) NASNet small (2017) Xception (2016)

MobileNetV3 small (2019) MobileNetV3 large (2019) SEResNeXt50 (2018)
AlexNet (2012) MobileNetV1 (2017) Xception (2016)
NASNet small (2017) ResNet50 (2015) Incep-Res-v2 (2016)
MobileNetV1 (2017)MobileNetV2
ResNet50 (2015)(2018) Incep-Res-v2 (2016) DenseNet201 (2016) NASNet large (2017)
MobileNetV2 (2018) DenseNet201 (2016) NASNet large (2017)
3 × 107
Top-1 Accuracy (%) on ImageNet

80
80
2.9 × 107
Latency / Inference [cycles]
cycles]
2.8 × 107

75
2.7 × 107
103 2.6 × 107

2.5 × 107
70
65
75
Latency / Inference [million
2.4 × 107
2.3 × 107 60
5 × 101 6 × 101 7 × 101 8 × 101
Energy / Inference [uJ]
102 70
AlexNet (2012) NASNet small (2017) Xception (2016)
MobileNetV1 (2017) ResNet50 (2015) Incep-Res-v2 (2016)
MobileNetV2 (2018) DenseNet201 (2016) NASNet large (2017)
65

80
Latency / Inference [cycles]
6 × 106
101 75
70
5 × 106 65 60
60
101 102 103 104

1.2 × 101 1.4 × 1011.6 × 101.8
1 × 1012 × 102.2
1 × 10
2.41 × 101

80

101
75
Area [mm2]
70
100
65
60
10 1
101 102 103 104
Figure 5.19: Energy-Latency-Area comparison for mapping 12 NNs on 720
accelerator architectures each. Every NN-accelerator pair is corresponding to two
points in the figure, one with min-energy mapping, one with min-latency mapping.
The Pareto-optimal accelerator architectures for each NN are highlighted and
connected.
CASE STUDIES
Table 5.4: Comparison on 12 Neural Networks’ Algorithm Attribute and Hardware Performance. Weight/Input/Output
Size is the accumulated size across all layers, assuming 8-bit precision on ImageNet data.
‘(#)’ indicates value order, from high (#1) to low (#12), across all 12 NNs.
AlexNet MBV3 MBV1 MBV2 NASNet MBV3 ResNet DenseNet Xception SEResNeXt IncepRes NASNet
Neural Network
[90] Small [64] [65] [138] Small [191] Large [64] 50 [57] 201 [69] [31] 50 [67] V2 [155] Large [191]
Top-1 Accuracy (%) 56.5 (#12) 67.4 (#11) 70.6 (#10) 72 (#9) 74 (#8) 75.2 (#7) 75.3 (#6) 77.42 (#5) 79 (#4) 79.3 (#3) 80.1 (#2) 82.7 (#1)
Total MAC (GOPs) 1.07 (#7) 0.06 (#12) 0.57 (#8) 0.30 (#10) 0.56 (#9) 0.22 (#11) 3.86 (#6) 4.29 (#4) 9.48 (#3) 4.23 (#5) 13.16 (#2) 23.74 (#1)
Weight Size (MB) 24.48 (#4) 4.08 (#10) 4.01 (#11) 3.31 (#12) 5.01 (#9) 9.50 (#8) 24.32 (#6) 18.87 (#7) 24.15 (#5) 26.20 (#3) 53.15 (#2) 84.45 (#1)
Input Size (MB) 0.46 (#12) 1.90 (#11) 5.21 (#9) 6.85 (#8) 12.83 (#6) 4.92 (#10) 9.75 (#7) 23.67 (#4) 36.22 (#3) 13.71 (#5) 39.80 (#2) 137.09 (#1)
Output Size (MB) 0.63 (#12) 1.55 (#11) 4.81 (#9) 6.37 (#8) 7.57 (#6) 4.40 (#10) 10.10 (#5) 7.49 (#7) 34.17 (#2) 13.75 (#4) 23.90 (#3) 86.37 (#1)
Total Data Size (MB) 25.57 (#7) 7.53 (#12) 14.03 (#11) 16.53 (#10) 25.41 (#8) 18.82 (#9) 44.17 (#6) 50.03 (#5) 94.54 (#3) 53.66 (#4) 116.85 (#2) 307.90 (#1)
Best Energy (uJ) 20.72 (#7) 5.37 (#12) 11.03 (#11) 11.93 (#10) 19.40 (#8) 13.61 (#9) 42.05 (#6) 44.14 (#5) 90.40 (#3) 46.81 (#4) 110.30 (#2) 271.92 (#1)
Best Latency (Mcycles) 8.04 (#8) 1.75 (#12) 5.54 (#9) 4.93 (#10) 10.83 (#7) 4.63 (#11) 22.76 (#6) 23.72 (#5) 79.53 (#3) 25.59 (#4) 96.37 (#2) 209.96 (#1)
167
I (128Byte, 32KB, 0.5MB, DRAM) + 4 levels for O (2Byte, 32KB, 0.5MB,

DRAM), in which the 0.5MB is shared by W, I, O.
Thirdly, different optimal memory schemes appear when optimizing for a single
layer vs. a complete DNN. Specifically, single-layer optimizations tend to
favor separate memory schemes (i.e., W/I/O don’t share memory levels), while
memory hierarchies optimized for a complete DNN seem to make extensive use
of the shared memory scheme.
Fourthly, modern DNNs, such as MobileNetV1-V3, which have a lot of
group/depthwise convolution, prefer a shallow memory hierarchy for Weight3 .
We observed that the previously introduced optimal memory hierarchy ‘a.)’ for
AlexNet is also among the 5% optimal ones for MobileNetV2, V3 small/large.
Notice that all the results are based on a fixed PE array size with fixed spatial
unrolling, CACTI7 [12]-based memory cost model, and predefined memory
bandwidth. Any change in the input information could impact the final
conclusion. ZigZag enables users to plug in their own design templates and unit
cost values for customized DSE.
3 The hindsight for it is also due to less weight data reuse.

FURTHER IMPROVEMENTS 169
5.10 Further Improvements
Our vision for a high-level, fast, and versatile DSE framework for DNN
accelerators began in late 2018. After dedicating time to contemplation and
prototyping, we initiated the development of ZigZag in mid-2019, completed
the implementation of its initial version by early 2020, and open-sourced the
first stable version in September of the same year. Since then, we have been
consistently updating and enhancing ZigZag from various perspectives.
5.10.1 Faster temporal mapping search engines
One of the major improvements is the development of more powerful temporal

mapping search engines: LOMA [153] from Symons et al. and SALSA [84] from
Jung et al. Both SALSA and LOMA have been integrated into the open-source
ZigZag framework.
LOMA outperformed the original ZigZag temporal mapping search engines
from three perspectives: First, LOMA is on average 10× faster than ZigZag’s
heuristic search v2 at locating the optimal mapping, with 100× less required
CPU memory resources. This is achieved by 1) reversing loop blocking and
loop ordering steps to skip the original recursive virtual memory level-per-level
loop operations, 2) applying a fast and lightweight loop ordering generation
algorithm to avoid producing any repetitive loop orders (even with repetitive
LPFs4 ), and 3) making one memory level allocation for one loop order based
on the data reuse maximization heuristic (similar to the loop ordering pruning
technique applied in ZigZag) to avoid branching out multiple sub-optimal cases.
Second, LOMA does not require a user-defined memory utilization threshold
to guide (or constrain) the mapping search procedure. The best mapping and
the best memory utilization scheme are found in one go (even with operand
memory sharing).
Third, the search time of LOMA is predictable, and it allows trading off search
time with mapping optimality in a controlled manner by tuning the number of
LPFs. For example, for a loop with size 24, we can factorize it into four LPFs
(2, 2, 2, 3) or two LPFs (4, 6). The fewer LPFs we have, the less unique loop
orders the loop permutation step generates, thus the faster the overall search
speed, and we may/may not lose some optimality of the final result.
SALSA continues on the idea from LOMA that separates loop ordering and
loop memory allocation into two independent processes, and it shares the same
4 LPF stands for loop prime factor, introduced in Section 5.6.2.
memory allocation heuristic as LOMA. However, in the loop ordering step,

instead of exhaustively generating all unique loop orderings as LOMA does,
SALSA applies simulated annealing to sample and evaluate fewer loop orderings.
Experiments show that SALSA can be 1.7× faster compared to LOMA to
converge to the same optimal loop ordering. In addition, SALSA’s runtime is
also predictable and does not grow exponentially with the increasing number of
LPFs as LOMA does.
5.10.2 More use cases
The ZigZag framework has been applied in different DSE use cases.
Houshmand et al. in "Opportunities and Limitations of Emerging Analog in-
Memory Compute DNN Architectures" [61] extended the ZigZag cost model with
Analog In-Memory Computing (AIMC) support. With the enhanced cost model,
they utilized the ZigZag DSE framework to assess the benefits and pitfalls of
AIMC solutions from the complete accelerator level (instead of the AIMC array
itself), and compared them with conventional digital accelerator solutions. This
study showed that AIMC can improve efficiency significantly, yet only when
the AIMC array topology and the memory technology are co-optimized with
the memory hierarchy and system architecture.
In the successive work, "Benchmarking and Modeling of Analog and Digital
SRAM in-Memory Computing Architectures" [63], Houshmand et al. further
extended ZigZag cost model with Digital In-memory Computing (DIMC) support
and unified it with the original AIMC cost model, so as to fairly compared
different AIMC and DIMC’s hardware design options and mapping preferences.
Results highlighted that 1) peak performance does not correspond to actual
performance, instead, architectures should be compared on real workloads with
their mapping to prove their effectiveness; 2) the exploration done on MLPerf
Tiny Benchmark [13] shows the good potential of DIMC replacing AIMC in
certain cases with its higher flexibility in the mapping space, full precision
operations, and better area efficiency (TOPS/mm2 ).
Besides being used to compare these emerging technology options, ZigZag
can also be applied to analyze certain architectural attributes. Liu et al.
in "Bandwidth-aware Flexible-Scheduling Machine Learning Accelerator" [99]
focused on studying the impact of on-chip memory bandwidth and its interaction
with other architectural-level optimization possibilities. This work deployed
ZigZag to explore, under different on-chip memory bandwidths, the system-level
benefits that dynamic memory allocation and flexible PE interconnection can
bring. Observations indicated these techniques are promising to be used in
future 3D stacking chip designs (with potentially high on-chip bandwidth).
In addition, ZigZag has also been utilized to provide hardware insights for various
DNN workloads. Colleman et al. in "Processor Architecture Optimization for
Spatially Dynamic Neural Networks" [33] made use of ZigZag to compare novel
accelerator architectures and dataflows enabling latency improvements for this
new type of neural network: Spatially Dynamic Neural Networks (SDyNNs).
SDyNNs adjust network execution based on the input data, saving computations
by skipping non-important image regions. How to translate this irregular and
dynamic algorithmic-level operation saving into actual hardware cost benefit, is
what this work was trying to answer. They reformulated the Conv layer loop
format in ZigZag to express the irregular spatial parallelism of SDyNNs and,
based on this, assessed the required hardware flexibility in spatial and temporal
mapping to support all relevant layer types.
Afterwards, Colleman et al. in "Optimizing Accelerator Configurability for Mobile
Transformer Networks" [35] further employed ZigZag to explore for Transformer
type of neural network the optimal accelerator architectures (number of PEs,
memory hierarchy, PE array configurability, etc.) and the corresponding best
deployment strategies (spatial and temporal mappings). Experimental results
highlighted the importance of supporting more than one spatial mapping in
the PE array (especially to large PE array) for guaranteeing good hardware
utilization across a wide variety of layer topologies in Transformer networks.
5.10.3 More practical considerations
ZigZag, as a high-level architecture-mapping DSE tool, starts with modeling

the first-order system effects (i.e., memory and compute), and has made many
ideal assumptions to handle the second and subsequent-order effects (e.g.,
interconnection, control, etc.). Depending on the use cases, this may (or may
not) become an issue and introduce some inaccuracies in the model that bias
the conclusion. So, to study and embed more practical considerations in ZigZag,
several efforts have been undertaken.
For example, ZigZag assumes an ideal memory data layout (i.e., memory is
treated as a black box, and data are always assumed to be stored in the right
order), hence neglecting the data dimension mismatch and data reshuffling
overhead that may come from the consecutive layer executions.
To understand the impact of memory data layout, Shi et al. in "CMDS:
Cross-layer Dataflow Optimization for DNN Accelerators Exploiting Multi-bank
Memories" [144] enhanced the ZigZag cost model by taking data dependency
and data layout reshuffling overheads into account, and deployed ZigZag for
cross-layer dataflow optimization. This work built a compact representation
for data layout in multi-banked memories, proposed an analytical modeling
approach for reshuffling buffers, and exploited the multi-level parallelism of

the stored data in multi-bank memory. Results showed that by utilizing this
multi-level data parallelism and optimizing it with the cross-layer dataflow,
considering the possible penalty from data dimension mismatch, the overhead
of hardware data reshuffling can be kept low with increased dataflow flexibility.
Another example is that, as discussed earlier, neural networks with a wide variety
of layer topologies (like Transformers) may prefer a reconfigurable PE array
to support different dataflows. ZigZag supports exploring multiple dataflows
within one PE array, but it ignores the hardware overhead that comes from the
configurability, such as the flexible interconnection and adder tree.
To fill in this gap, Colleman et al. in "COAC: Cross-layer Optimization
of Accelerator Configurability for Efficient CNN Processing" [32] provided a
systematical analysis of the architectural overhead in function of the different
supported spatial unrollings. They summarized the hardware configurability
overhead into three groups and modeled the cost of each of them: flexible
data assignment (interconnection), flexible data aggregation (add tree), and a
reshuffling buffer. Together with the new hardware configurability cost model,
they built an automated flow around ZigZag to find the best to-be-supported
spatial unrolling combination(s) for efficient end-to-end DNN inference with
limited hardware overhead.
5.10.4 Upgraded framework implementation
If we envision the entirety of ZigZag as an iceberg, the enhancements and

deployments we have previously discussed and published in papers, represent
the visible tip. Below the water’s surface lies the foundation sustaining these
advancements — the continually evolving ZigZag implementation, which tries to
keep up with the new trends in this fast-growing domain, facilitates developers
to make contributions, and provides users with more DSE customization freedom
and better features.
Here, we briefly discuss the implementation advances from three perspectives:
data representation, mapping information extraction, and the program structure.
For more implementation details, please refer to our open-source GitHub
pages: https://github.com/KULeuven-MICAS/zigzag and https://github.
com/KULeuven-MICAS/zigzag-demo.
In addition to the author, Arne Symons and Koen Goetschalckx are also significant
contributors to Section 5.10.4.
More general data representations
The ZigZag data representation has been enhanced along the three perspectives:
workload, hardware, and mapping.
Workload: The initial version of ZigZag supported all the neural network
layers that can be represented by the 7 nested for-loop, as depicted in Figure 5.2.
The 7 loop dimensions (B/K/C/OX/OY/FX/FY), 3 operands (W/I/O), and
the loop relevancy in between, were all hard-coded. Although this could already
cover a lot of DNN layer topologies, it still limited ZigZag’s ability to support
layers with more diverse loop types (e.g., 3D Conv) or layers with different
operands than W/I/O (e.g., no weight in element-wise layers). Moreover, given
the fast-evolving speed in the DNN algorithm community, it is challenging to
anticipate what new types of layers will emerge in the future.
Seeing this, we have fundamentally upgraded the workload representation of
ZigZag for generality. The new representation has the following main attributes:
• It allows defining any number of loop dimensions with three operands: two
input operands, and one output operand for a layer, each with customized
data precision(s)5 .
• It allows customizing the loop relevancy (r/ir/pr) between computation
dimensions and data dimensions. For pr, all linear affine transformations
are supported.
• For the two input operands, it distinguishes constant and non-constant
operands. A constant operand is like Weight, whose data are ready from
the beginning; A non-constant operand is like Activation, whose data are
generated by a previous layer at runtime.
• Utilizing the constant/non-constant principle, it can consider the cross-
layer data dependency by defining the data-producing layer for each
non-constant operand with its data dimension transformation clarified.
E.g., Layer 2’s non-constant input operand I[c][ix][iy] comes from Layer 1’s
output operand O[k][ox][oy], with k → c, ox → ix, oy → iy.
• It also lets users define layer types (e.g., Conv, FC, DW, Add, Pooling,
or any customized keyword) and later use these layer types to guide the
5 Note 1: this representation can be easily extended to support more than two input
operands if needed.
Note 2: the output operand can have two data precisions, the partial output and the
final output precisions. The partial output precision usually is higher due to the reserved
accumulation headroom bits.
Note 3: for layers that only have one input operand, like pooling, ZigZag models them by
setting the data precision of one (out of the two) input operand to 0 bit.
mapping in the user mapping definition (will explain later) (e.g., Conv
and Pooling are likely to be mapped on accelerators in different ways).
In addition to the manual workload definition, which gives users easy adaptability
of all layers’ attributes, ZigZag now also supports directly importing a DNN
workload from an onnx [11] model (a widely used open source format for AI
models). As lots of well-established DNN workloads have their onnx models
available online, users can directly download and use them in ZigZag, without
manual encoding effort.
Hardware: The hardware representation is improved in the new version of
ZigZag for two purposes: 1) to enable more fine-grained and thus more accurate
hardware cost modeling, and 2) to reflect more practical hardware constraints.
Firstly, for more fine-grained modeling, the main update is the explicit definition
of the memory port and memory read/write granularity. Initially by default,
ZigZag assumed all memories are single-port, and all memory read/write
behaviors happen at the full wordlength granularity. Now, ZigZag allows
defining any number of ports for each memory instance, the type of each port
(e.g., read-only, write-only, read-write), and the served operand movement of
each port6 .
Meanwhile, as multi-banked memories are frequently used in DNN accelerator
and some of them support different memory read/write granularity (e.g., one
word, half word, a quarter of word), this is also modeled in the new ZigZag.
During hardware cost estimation, for a data read/write that goes below the
finest granularity of a memory, the per-memory-access cost will use the finest
granularity’s one. If it goes above the finest granularity but still below the full
memory wordlength, the cost will use a multiple of the finest granularity’s cost,
depending on the number of times the finest granularity access is required.
Secondly, for embedding more practical hardware design constraints, the major
updates include a.) the decoupling of the spatial mapping from the PE array
interconnection definition, and b.) the decoupling of the layer operands from
the memory hierarchy definition.
In the old ZigZag, PE array interconnection was implicitly defined by its
supported spatial mapping, while in the new version, it is defined through the
memory-served dimension (or data broadcasting dimension). The memory-
served dimension defines for each memory instance in the hierarchy, given the
6 Two data movement actions for input operands: write-in-by-high and read-out-to-low;
four for output operand: write-in-by-high, read-out-to-low, write-in-by-low, read-out-to-high.

High/low indicates memory level in the hierarchy. Low means the memory level closer to
MAC unit; high for the memory level closer to DRAM.
pre-defined multi-dimensional PE array, which array dimension(s) the memory

serves data to (or collects data from). The non-served array dimension(s) is
where the memory is unrolled.
For example, assume we have a 2D PE array of 3×4 elements (D1×D27 ). If a
memory’s served dimension is D1, it means this memory serves data to all 3
PEs along D1, and in total there are 4 such memory instances on-chip, unrolled
along D2. Two other extreme examples help to further clarify this concept: The
memory that serves data only to one PE, like the inner-PE register, has a served
dimension of ‘None’ and is unrolled along all PE dimensions (number of memory
instances equals the number of PEs); The memory that serves data to all PEs,
like an on-chip global buffer or off-chip DRAM, has a served dimension of ‘All’
and is not unrolled (number of memory instances is 1). With the concept of
memory-served dimension, the PE array interconnection to the whole memory
system interconnection is clearly defined8 .
Besides the interconnection re-definition, action is also taken to decouple layer
operand allocation (e.g., W/I/O) from the memory definition. In the old ZigZag,
when defining a memory hierarchy, each memory was pre-defined to serve one
or more fixed layer operand(s). This is not an issue for single-layer processing,
but it can become a bottleneck for a complete network execution (with multiple
different layer types), as not all layers should follow the same memory allocation,
and more importantly, not all layers necessarily have the same layer operands.
Seeing this, the new ZigZag creates the concept of memory operand to unbind
the memory hierarchy definition from the layers it is going to execute. Instead,
it links layer operands to memory operands across layers in the user mapping
definition (explained next).
Three memory operands are pre-defined: I1, I2, and O, which stand for Input 1,
Input 2, and Output operands9 . Instead of using layer operands W/I/O in
the memory instantiation10 , we apply the memory operands I1/I2/O in it. For
example, if we want to initialize a 1 MB SRAM memory as the second-level
on-chip buffer, the old ZigZag would define it together with its served layer
operand(s), like marking it as a W memory and fix this setting for all the layers
that going to be operated on it. On the other hand, the new ZigZag will define it
with the memory operand(s), like defining it as an I1 memory, and for different
layers, I1 can be linked to different layer operands, such as W, I, A1, A2, or any
7 Assume 3 is on dimension 1 (D1), and 4 is on dimension 2 (D2).
8 All the interconnection patterns introduced in Chapter 2.2.3 are covered. The systolic
data movement is treated as multi-cast with additional inner-PE register read/write cost and
latency compensation.
9 As mentioned in the previous Workload representation enhancement, we assume a layer
has two input operands and one output operand.

10 Memory instantiation means assigning a memory module to a position in the memory
hierarchy.
layer operand name that the user defined. It enables the usage of that 1 MB
SRAM to serve different operands in different layers.
To summarize, we have decoupled the algorithmic and mapping factors from
the hardware definition, which greatly enhances the generality of the ZigZag
framework.
User Mapping Definition: In the new version of ZigZag, the mapping
representation itself does not change a lot, still following the uneven loop format
introduced in Section 5.4. Yet the user mapping definition is newly provided as
an input to the framework: In the definition, of each layer type of a workload,
users can provide hints to guide the mapping in a desired way. The mapping
hints include (per user-defined layer type) 1) the core allocation11 (e.g., we can
perform Conv layer on one core and element-wise sum on another core), 2) the
layer operand to memory operand link (as explained in the Hardware part), 3)
the spatial mapping candidates, and 4) temporal order candidates. In the hints,
1) and 2) are mandatory to provide (you can enter multiple options); 3) and 4)
are optional. When the spatial/temporal mapping candidates are not defined,
the framework will auto-search them.
These enhanced data representation formats together enable ZigZag users to have
greater flexibility in customizing their DSE experiments, which can potentially
lead to discovering better designs and deployment options.
More unified mapping info extraction
As the fundamental data representation changes, so must the structure built

on top of it. Here we explain two important upgrades made in the mapping
info extraction step, targeting the more general workload and more fine-grained
hardware definitions respectively.
Firstly, when moving from the old workload representation (with hard-coded
loop dimensions and relevancy) to the fully flexible new workload definition, the
major challenge comes from the pr-loop-related info extraction. As mentioned
earlier, we want to support all linear affine transformations in the pr loops. It
would become very cumbersome if we just analyze loop-per-loop and whenever
we meet a pr loop, look for its pr pair, check whether the pair can trigger the
FIFO effect, and based on it calculate the data size and data reuse factor of it.
To avoid the complications and unify the mapping info extraction for all loop
types (r/ir/pr), a new idea is formed: let’s decouple all the pr loops into r
11 This feature is more frequently used in the follow-up multi-core DSE framework, Stream,
which shares the same infrastructure as ZigZag, to be introduced in Chapter 8.

and ir loops before the mapping info extraction starts. The rationale for doing
so is that, as explained in Section 5.5.1, r loops are data dimensions, ir loops
indicate data reuse, and pr loops include both of them.
The implementation for the pr decoupling is straightforward. After the ZigZag
cost model receives a mapping (either user-defined or provided by mapping
search engines), it will pre-process the mapping before feeding it to the mapping
information extraction functions. In the pre-processing, for operand(s) that
have pr loops (e.g., the Input operand in Conv), it analyses for all its pr loops
in a bottom-up manner (i.e., from MAC to DRAM) the equivalent data size
and data reuse factor, and replaces each pr loop with its equivalent data-size r
loop and data-reuse-factor ir loop.
For example12 , the mapping OXu
FX 3 r 1.5
4 can be replaced by ir 2,ru 4 for Input operand,
in which OXu 4 is replaced by ru 4, and FX 3 is replaced by r 1.5 and ir 213 ,
with ir 2 merged down to the below architectural level14 , indicating on average
every data element can be reused in the below level 2 times.
Secondly, for managing the fine-grained hardware definition (e.g., memory
port), a parameter called four-way data movement is introduced for uniformly
representing each operand’s data movement at each memory level. This
parameter collects data movement information (e.g., number of moved data
elements, data precision for the movement, data moving rate and period, etc.) of
four directions: write-in-by-high, read-out-to-low, write-in-by-low, read-out-to-
high. In the mapping information extraction step, all four-way data movement
attributes are calculated. Then, in the hardware cost estimation step, they are
combined with the fine-grained hardware information (e.g., link the responsible
memory port to each way of data movement) to calculate the final hardware
impact.
More sustainable program structure
Finally, as the ZigZag framework grows, more and more people are using it for
customized DSE experiments and contributing to it from different perspectives.
It is important to provide a sustainable program structure for all the users and
contributors. If not, the whole framework will gradually become spaghetti with
12 The same example used to explain "FIFO Effect" in Section 5.5.2. The horizontal line is
the architecture level, u indicates spatial unrolling.

13 The reason for r 1.5 and ir 2 is that the total data size is 6 (3+4-1), the total operation
count is 12 (3×4), thus the data reuse factor is 2, meaning ir is 2. In order to make the
product of ir and r equals 3, r is 1.5.
14 The ir loops that just above an architecture level boundary can be always merged to the
below architecture level as it does not increase data size but provide more data reuse to the
below level.
people’s code all mixed together, difficult to understand, to use, and to further
expand.
So, to facilitate long-term usage and development, we have ultimately upgraded
ZigZag’s program structure with the concept of stages. The general idea of the
stage-based structure is to divide the whole DSE flow of ZigZag into multiple
execution stages, each performing a specific task with clearly defined inputs
and outputs. Currently ZigZag provides a pool of stages, which can be grouped
into 7 main categories:
1. user-defined input parsing stages (e.g., WorkloadParserStage,

ONNXModelParserStage, AcceleratorParserStage),
2. iteration stages (e.g., WorkloadStage, GeneralParameterIteratorStage),
3. mapping generation/conversion stages (e.g., LomaStage, SalsaStage15 ,
SpatialMappingConversionStage, TemporalMappingConversionStage),
4. cost model stages (e.g., CostModelStage),
5. result manipulating stages (e.g., SumStage),
6. result pruning stages (e.g., MinimalLatencyStage, MinimalEnergyStage),
7. result saving stages (e.g., SimpleSaveStage, CompleteSaveStage,
DumpStage).
Performing DSE experiments with ZigZag implies selecting a specific set of

desired stages and connecting them in sequence. The different stages and their
sequence of execution determine the function of running the framework.
To better illustrate this multi-stage concept, an example DSE configuration is
exhibited in Figure 5.20, which performs both spatial and temporal mapping
optimization for a multi-layer DNN workload. It is worth pointing out that
among the 7 aforementioned stage categories: a.) stages in groups 1/2/3 take
in (and add on) forward-flow information (e.g., workload/hardware/mapping),
indicated by ↓; b.) stages of groups 5/6/7 process backward-flow results,
indicated by ↑; c.) CostModelStage in group 4 is the bottom stage, receiving
all the forward-flow information and producing backward-flow results.
The overall DSE flow first goes through all the stages top-down, with forward-
flow stages (↓) performing certain tasks and adding on required information
for the bottom CostModelStage, and meanwhile, backward-flow stages doing
nothing (just transparently passing the information to the next stage). Then,
15 LOMA and SALSA temporal mapping search engines (explained in Section 5.10.1).
1 dseflow = MainStage([ # (↓ forward flow, ↑ backward flow)

2 WorkloadParserStage, # ↓ parse workload
3 AcceleratorParserStage, # ↓ parse hardware
4 SimpleSaveStage, # ↑ save total cost
5 SumStage, # ↑ sum up all layers' cost
6 WorkloadStage, # ↓ iterate through each layer
7 CompleteSaveStage, # ↑ save each layer's cost
8 MinimalEnergyStage, # ↑ only return min-energy SM
9 SpatialMappingGenStage, # ↓ generate spatial mappings (SM)
10 MinimalEnergyStage, # ↑ only return min-energy TM
11 LomaStage, # ↓ generate temporal mappings (TM)
12 CostModelStage # ↓ ↑ evaluate cost
13 ],
14 ... # required arguments
15 )
16
17 # Launch the DSE flow

18 results = dseflow.run()
Figure 5.20: An example code snippet of multi-stage configuration for

performing a DSE experiment in ZigZag.
all the required information (accelerator, layer, spatial and temporal mappings)
reaches the bottom CostModelStage, cost evaluation is conducted and results
are generated. After that, the cost model evaluation results flow backward in the
flow through all the stages, with forward-flow stages doing nothing (just passing
the results to the next stage), and backward-flow stages performing certain
tasks on the results it receives (e.g., result pruning, saving, post-processing,
etc.). Note that the whole flow is not necessarily a one-time event, but can
include lots of (local) iterations back and forth, e.g., iterating through each
layer in a workload, or through each temporal mapping that LOMA generated
all can lead to multiple sub-flows.
The pseudocode in Figure 5.21 further demonstrated the inter-stage data passing
and calling, in which the bottom four stages of Figure 5.20 are used as a partial
flow for the explanation.
What this partial flow is doing is searching for the best-energy temporal
mapping solution for each spatial mapping found. Firstly, on lines 11-12,
the SpatialMappingGeneratorStage generates spatial mapping (sm) one at
a time and passes it to the next stage, together with the accelerator and
layer information. Its next stage, MinimalEnergyStage, is a backward-flow
stage, and thus directly passes the received information to the next stage
1 """
2 cme: cost model evaluation result
3 sm: spatial mapping
4 tm: temporal mapping
5 """
6
7 class SpatialMappingGeneratorStage(Stage): # forward-flow stage

8 ...
9 def run(self, accelerator, layer):
10 next_stage = next(self.stages) # next_stage is
,→ MinimalEnergyStage
11 for sm in self.generate_spatial_mappings.run(accelerator,
,→ layer):
12 for cme in next_stage.run(accelerator, layer, sm):
13 yield cme
14
15 class MinimalEnergyStage(Stage): # backward-flow stage

16 ...
17 def run(self, accelerator, layer, sm):
18 next_stage = next(self.stages) # next_stage is LomaStage
19 for cme in next_stage.run(accelerator, layer, sm):
20 if (cme.energy_total < self.best_cme.energy_total):
21 self.best_cme = cme
22 yield self.best_cme
23
24
25 class LomaStage(Stage): # forward-flow stage

26 ...
27 def run(self, accelerator, layer, sm):
28 next_stage = next(self.stages) # next_stage is
,→ CostModelStage
29 for tm in self.generate_temporal_mappings.run(accelerator,
,→ layer, sm):
30 for cme in next_stage.run(accelerator, layer, sm, tm):
31 yield cme
32
33
34 class CostModelStage(Stage): # bottom stage

35 ...
36 def run(self, accelerator, layer, sm, tm):
37 cme = CostModelEvaluation(accelerator, layer, sm, tm)
38 yield cme
Figure 5.21: Example pseudocode for showing the data passing and calling in
between the last four stages in Figure 5.20.
(line 19). LomaStage receives the information (accelerator, layer, sm) and
performs the temporal mapping generation (line 29), generating one tm at a
time, and passing it together with all the other information to the next stage
(line 30). CostModelStage receives the information (accelerator, layer, sm,
tm) and generates one cost model evaluation result (cme) (line 37). This one
cme then flows backward, passing through the LomaStage and reaches the
MinimalEnergyStage (lines 19-21). Then the for-loop on line 19 will go into
the next iteration, which will trigger the stages below it to generate another cme,
i.e., the LomaStage will generate another tm and CostModelStage will produce
another cme for this tm, and so on. This iteration of lines 19-21 will go on and on
(while keeping the best-energy tm point it has encountered so far) until all below
stages no longer produce new cme, meaning that the LomaStage has generated
all tm. At this point, the best-energy tm is found by the MinimalEnergyStage.
After that, a new sm will be generated and start the whole iteration again
(lines 11-12).
It is worth pointing out that the Python programming keyword "yield" is
extensively used in the multi-stage implementation of ZigZag. For example, in
Figure 5.21, "yield" is used explicitly on lines 13, 22, 31, 38, and implicitly
on lines 11 and 29. In Python programming, "yield" is used within generator
functions to produce a sequence of values. When "yield" is encountered, it
temporarily suspends the function’s execution and returns a value to the caller.
This allows for lazy evaluation, where values are generated on-demand rather
than being computed all at once, making it efficient for working with large
sequences and a perfect fit for our multi-stage implementation.
This multi-stage DSE configuration gives users high flexibility to adapt the
functionality of the framework and customize their experiments. For example,
one can simply replace MinimalEnergyStage with MinimalLatencyStage or
MinimalEDPStage to change the optimization target; one can also replace
MinimalEnergyStage and LomaStage by the TemporalMappingConversionStage
to change the function of the framework from auto-mapping search and
optimization to a pre-defined mapping cost evaluation.
In addition to users, ZigZag developers also reap significant advantages from
this modular flow. For instance, if a developer wishes to create another temporal
mapping search engine, they can develop it as a separate stage, such as
LomaStage or SalsaStage, and seamlessly integrate it into ZigZag without
concerns about disrupting other parts of the implementation. This allows for a
quick and hassle-free plug-and-play experience.
5.10.5 Successive frameworks
Besides making progress within ZigZag, a DSE framework for single-core DNN
accelerators supporting single-layer mapping, we want to continue broadening
the design space for both mapping and hardware, to seek more system-level
improvement possibilities. Following this idea, two successive DSE frameworks
are built based on ZigZag: DeFiNES and Stream. DeFiNES broadens the design
space of ZigZag with depth-first mapping; Stream further enlarges the design
space with multi-core accelerators. Further information about DeFiNES and
Stream will be discussed in Chapter 7 and Chapter 8, respectively.
All the frameworks are open source, and have been employed by users in other
organizations. E.g., Meta, Sony, Nokia, IMEC, NXP, Bosch, TU Delft, TU
Eindhoven, Stanford University, Ghent University, University of Antwerp, etc.
5.11 Conclusion
This chapter presented ZigZag, a DNN accelerator architecture and mapping

DSE framework with enlarged even/uneven mapping search space.
Three modules cooperate in synergy to enable the exploration of a much broader
space of solutions with respect to other SotA DSE frameworks. Firstly, the
Architecture Generator is capable of generating all valid memory hierarchies
(balanced/unbalanced, shared/separate) given a set of high-level hardware
constraints. Secondly, the Mapping Search Engines can rapidly locate the
optimal spatial and temporal mappings (even/uneven) by means of innovative
search methods for any type of memory hierarchy provided by the Architecture
Generator. Thirdly, with the Memory-Centric Design Space Representation and
the Loop Relevance Principle, the Hardware Cost Estimator can analytically
calculate energy and latency for the mappings generated by the Mapping Search
Engines. Three case studies explore the vast DNN accelerator design space from
different perspectives and present the capability of ZigZag.
To maintain a seamless reading experience, the explanation of the latency
modeling part, which involves intricate calculations, was intentionally omitted
in Section 5.5.3. In the upcoming chapter, we will exclusively concentrate on
the latency modeling approach and thoroughly explore its details, enabling a
comprehensive understanding of the underlying principles.
Chapter 6
An Analytical Latency Model

for DNN Accelerator
As a sequel to Chapter 5, this chapter focuses on the latency modeling approach

of the ZigZag framework.
In this chapter, we thoroughly explain the uniform intra-layer analytical latency
model for DNN accelerators that can be used to evaluate diverse architectures
and dataflows. It employs a 3-step approach to systematically estimate the
latency breakdown of different system components, capture the operation
state of each memory component, and identify stall-induced performance
bottlenecks. To achieve high accuracy, different memory attributes, operands’
memory-sharing scenarios, as well as dataflow implications have been taken
into account. Validation against an in-house taped-out accelerator [151] across
various DNN layers has exhibited an average latency model accuracy of 94.3%.
To showcase the capability of the proposed model, we carry out 3 case studies
to assess respectively the impact of mapping, workloads, and diverse hardware
architectures on latency, driving design insights for algorithm-hardware-mapping
co-optimization.
This chapter is based on publication [110], and contains large fractions of it. The author
contributed to all aspects of the latency model, in collaboration with Meta.
Detailed implementation is at: https://github.com/KULeuven-MICAS/zigzag/blob/master/
zigzag/classes/cost_model/cost_model.py.
183
184 AN ANALYTICAL LATENCY MODEL FOR DNN ACCELERATOR
Many prior arts have proposed uniform energy models for DNN accelerator
design [183, 126, 177, 91, 184, 109, 158]. The common basis is an analytical
model which counts the operations of each hardware component (e.g., memory
read and write at each level, MAC, data transfer in NoCs, etc.), and multiplies
these with the corresponding unit energy to obtain the total system energy.
Unlike the well-explored energy models, analytical latency models are, however,
less systematically developed or explained for DNN accelerators. In the
chapter, we refer to "latency" as the clock cycle count (CC) for completing a
workload. From the physical level to system abstraction, the existing SotA
latency estimation methods can be categorized as: 1) measurement on physical
devices, 2) FPGA emulation [28], 3) RTL simulation [166], 3) cycle-accurate
simulation [136, 119], 4) regression or ML-based methods [36, 127], and 5)
analytical modeling [91, 126, 187]. Among these, analytical models are preferred
for early-phase DSE, thanks to their fast run-time and transparency (compared
to black-box regression or ML-based methods). Moreover, since most of the
DNNs are deterministic with pre-known hardware architectures and mappings,
high accuracy in latency is analytically achievable.
However, most existing analytical latency models rely on ideal assumptions,
such as: 1) all memories at different levels are double-buffered, assuming that
double-buffering can avoid all temporal stall; 2) memories that are shared by
multiple operands always have multiple read/write ports to avoid interference
among different operands’ data accesses. Although these assumptions simplify
the modeling, two issues are introduced: a.) for fixed architectures, modeling
accuracy degrades if not meeting these assumptions; b,) for architecture search,
memory system overhead is introduced by default that excludes a large part
of design space, leading to sub-optimality. Some other latency models were
delicately built for a specific design case or for a small group of design variants
with a hardware template [127]. Although accuracy is preserved, the limitation
in generality prevents its usage for novel architecture searches.
This work aims to bridge these gaps with a uniform analytical latency modeling
approach for algorithm-hardware-mapping (AHM) co-optimization, targeting
dense intra-layer cases. This chapter is organized as follows:
• Section 6.2 provides a comprehensive overview of the latency impact

factors and introduces our modeling philosophy for solving the modeling
challenges.
• Section 6.3 presents our analytical latency model in detail, highlighting
ALGORITHM/HARDWARE/MAPPING AND LATENCY 185
the 3-step approach that can uniformly address the multi-level memory
system induced different stall scenarios.
• Section 6.4 demonstrates good model accuracy with test chip validation.
• Section 6.5 assesses AHM-latency co-optimization through three case
studies to drive design insights.
6.2 Algorithm/hardware/mapping and Latency
6.2.1 Latency impact factors
Algorithm, hardware, and mapping can largely impact system latency. Below
briefly recaps what have been discussed in Chapter 2.
Algorithm (A) includes DNN layer attributes, such as layer type (e.g., Conv,
FC, DW and PW), layer loop dimensions, data attributes of the layer operands
(e.g., total data size and data precision), etc.
Hardware (H): A DNN accelerator is usually equipped with a MAC array
and a multi-level memory system, connected via an on-chip network. Its
performance roofline is determined by hardware parameters, such as MAC array
size, interconnectivity, and memory hierarchy (e.g., memory levels, capacity
/ bandwidth (BW) / number of read/write ports, and memory allocation for
different operands).
Mapping (M) determines how the algorithm is spatially and temporally
mapped on the hardware. Spatial mapping defines how to parallelize DNN
loops across the MAC array, while temporal mapping defines in what order
the MAC array processes the non-spatially-unrolled DNN loops. Ideal spatial
mapping fully utilizes the MAC array, while ideal temporal mapping maximizes
operands’ data reuse at lower memory levels. Mapping optimization can help
to minimize compute cycle count and communication stalls.
These three factors are strongly interrelated and form a gigantic AHM design
space, in which each point corresponds to a specific algorithm-hardware-mapping
scenario with a resulting deterministic latency value.
6.2.2 Challenges in uniform latency modeling
The challenges for building an analytical latency model that can be applied to
the vast AHM design space are twofold:
First, the concurrency and interference of data transfer between different memory
levels for different operands, i.e., weight (W) / input (I) / output (O), needs to be
captured. Such interference comes from hardware constraints (e.g., insufficient
memory BW / ports, lack of double buffering, system interrupts), and mapping
choices (i.e., optimized mappings can alleviate the interference while non-
optimized mappings may aggravate such effect). Note that this is specific for
analytical latency estimation, but is usually less impactful on analytical energy
modeling. This is because the analytical energy model (dynamic energy) only
relies on the total operation count of each hardware component, while latency
also depends on when these operations happen and how they interfere with
each other.
The second challenge stems from the generality requirement, since the latency
model needs to be applicable for not only few pre-defined hardware architectures
with a fixed dataflow, but also for every valid AHM point in the design space.
6.2.3 Proposed modeling philosophy
Our proposed latency modeling approach aims to solve these two major
challenges: 1) To capture the concurrency and interdependencies of
data transfers, we start from dividing this complex intertwining problem into
multiple single-component data movement events, analyze each event separately,
and then combine them based on physical hardware constraints; 2) To ensure
generality, we adopt a uniform AHM representation, and implement a standard
3-step memory-type / bandwidth / sharing-aware latency modeling methodology
which can cover all valid design points, as detailed in Section 6.3.
6.3 A Uniform Intra-Layer Latency Model
The processing of each DNN layer consists of 3 phases: data pre-loading,

computation, and data offloading, as shown in Figure 6.1(a). We define the
data pre-loading as the data initialization step before computation starts, and
the data offloading as the final round of outputs writing back to memory after
computation finishes. We can derive their latency based on the required data
transfer amount and the related memories’ BW.
A UNIFORM INTRA-LAYER LATENCY MODEL 187
(a) 3 Phases in Layer-wise Operation: “CC” for cycle count;

Data pre-loading Computation Data offloading “U” for MAC array utilization;
“SS” for data transfer stall(>0)/slack(<0);
Layer execution start Timeline “N” for number of nested for-loop type in
the DNN layer (e.g., with Batch, for
Conv2D, N = 7).
(b) 4 Scenarios of Latency Modeling in Computation Phase:
Is MAC array Is MAC array

Scena MAC Array
temporally spatially Latency [cycle count]
rios Utilization [%]
fully mapped? fully mapped?
𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑀𝑀𝑀𝑀𝑀𝑀 𝑂𝑂𝑂𝑂𝑂𝑂
① Yes 𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑈𝑈𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 100%
𝑀𝑀𝑀𝑀𝑀𝑀 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
Yes
𝑁𝑁 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑖𝑖 𝑑𝑑𝑑𝑑𝑑𝑑. 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
② No * 𝐶𝐶𝐶𝐶𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = � 𝑈𝑈𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 =
𝐶𝐶𝐶𝐶𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑖𝑖=1 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑖𝑖 𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
***
𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
③ Yes 𝐶𝐶𝐶𝐶𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡. = 𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑆𝑆𝑆𝑆𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑈𝑈𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡. =
(> 0) 𝐶𝐶𝐶𝐶𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡.
No ** ***
𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
④ No * 𝐶𝐶𝐶𝐶 = 𝐶𝐶𝐶𝐶𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 + 𝑆𝑆𝑆𝑆𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 𝑈𝑈 =
(> 0) 𝐶𝐶𝐶𝐶
* Reason: MAC array / interconnects / spatial mapping and DNN layer dimension do not match, etc.
** Reason: Memory BW and # of ports are in-sufficient; mapping is non-ideal; control overhead; etc.
*** Define: 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 𝐶𝐶𝐶𝐶𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 − 𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ; 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 𝐶𝐶𝐶𝐶𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡. − 𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 (= 𝑆𝑆𝑆𝑆𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 ).
Figure 6.1: (a) A timeline illustration of DNN layer operation phases. (b)
Four scenarios of latency and utilization modeling in the computation phase.
The computation phase usually dominates the overall processing time, of which
the latency is strongly impacted by AHM. Figure 6.1(b) shows the 4 computation
scenarios based on the spatial and temporal mapping rate of the MAC array.
The latency modeling’s challenges mainly come from the ones with temporally
under-utilized MAC array (⃝ 3 ⃝),
4 due to the complexity to model the stall
induced by the non-ideal data movement, i.e., temporal stall (SSoverall ).
In the remaining section, we first introduce prerequisite concepts, then describe
in detail the 3-step latency modeling methodology to address SSoverall modeling
challenge. The key terminologies and the model steps are illustrated in Figure 6.2.
Please refer to it for all the abbreviations used in this section.
6.3.1 Prerequisite concepts and terminology
To simplify the explanation, we adopt the data representation and loop

characterization from Chapter 5: 1) a DNN layer is presented as a 7-dimentional
nested for-loop format, namely, batch (B), output channel (K), input channel (C),
(a) Term. Description Modeling Method

Memory that only holds a single operand For memory that is physically shared by 2 (or 3) operands,
Unit Mem
(W or I or O). we virtually divide it into 2 (or 3) Unit Mems by operand.
Separated read and write links at each Unit Mem (e.g., W-

DTL Data transfer link.
GB-read ①, W-LB-write ② on the right figure).
Pre-
requisite Data size of the corresponding operand The product of all the r loops' size (temporal & spatial) of that
MemDATA
(b) (W/I/O) in that Unit Mem. operand at current and lower memory levels.
Turnaround cycles (cycles of the data residing The product of all the temporal loop sizes at current and
MemCC
in its Unit Mem before get updated). lower memory levels for W/I/O individually.
RealBW Actual read/write BW of a memory module. Obtained from actual hardware design parameters.
Required memory BW to prevent computation It is determined by memory type and mapping together. More
ReqBWu
stall at Unit Mem. level. details in Table I.
Parameters to model Unit Mem’s periodic 1 2 Z

XREQ , behaviour. XREQ/XREAL is allowed/actual mem I A I A … I A A Active
Step 1 X
XREAL , Z updating time within one MemCC resp. Memcc I Idle
Compute
each Z is the memory update count. Timeline (cc)
DTL's
attributes Total allowed memory updating window for Each DTL’s MUWu is modelled as a finite periodic function,
MUWu Unit Mem. read/write without introducing supporting union and intersection operation. The value of
computation temporal stall. MUWu indicates the total length of memory updating window.
Stall (+) or Slack (-) of a Unit Mem read/write A non-zero value indicates the mismatch between ReqBWu
SSu
port, regarding computation. and RealBW. More details in Fig. 3.
ReqBWcomb ReqBW of a physical memory rd/wr port. The sum of all ReqBWu of share-port DTLs.
Step 2
Combine Total allowed memory updating window of a
MUWcomb The union of all MUWu of share-port DTLs.
DTLs’ physical memory read/write port.
attributes
SScomb SS of a physical memory read/write port. Already existed stalls + stalls combined from slacks. Eq. (1).
Step 3 SSoverall Overall temporal stalling cycle. Integrate all SScomb based on different memory’s coherency.
(b) Step 1: Divide the shared mem Step 2 and 3: Combine

An example DNN system to separate Unit Mem; DTLs’ attributes based
accelerator system compute ReqBWu, MUWu, SSu on memory sharing and
attributes for each DTL memory level coherency
W I O W I O W I O
GB-level Global Buffer (W/I/O) Mem1** Mem2 Mem3 Combine
bw1* bw8 Divide ① ⑥ ⑪ DTL ①⑥ ⑪
bw2 bw9 problem ② ⑦ ⑫ attributes ②⑦ ⑫
LB-level Local Buffer (W/I/O) Mem4 Mem5 Mem6
bw3 bw10 ③ ⑧ ⑬ ⑭ MUWu ③⑧⑬ ⑭
bw4 bw5 bw11 bw12 ④ ⑨ ⑮ ⑯ ReqBWu ④ ⑨ ⑮ ⑯
Reg-level W-Reg
16 8
I-Reg 222
O-Reg 16
Mem7 8
Mem8 222
Mem9 SSu
bw6 bw7 bw13 bw14 ⑤ ⑩ ⑰ ⑱ ⑤ ⑩ ⑰ ⑱
MAC-level MAC-level
Combine ReqBW @same port: Combine SS @same served mem:

*bw1, bw2, … bw14 are actual ReqBW① + ReqBW⑥, max(SSu⑯, SSu⑭)
mem BWs (RealBW) from design ReqBW② + ReqBW⑦, max(SSu⑫, SSu⑪)
ReqBW③ + ReqBW⑧ + ReqBW⑬ max(SScomb①⑥, SScomb②⑦), …
Combine MUW @same port:
**Mem1 to Mem9 are unit MUWu① ∪ MUWu⑥, Integrate SS @diff. mem levels:
memories with MemDATA and MUWu② ∪ MUWu⑦, Take the max or sum of different
MemCC derived from mappings; MUWu③ ∪ MUWu⑧ ∪ MUWu⑬ SScomb based on different memory
module’s coherency configuration:
① to ⑱ are DTL for read or write. Combine SS @same port: • Max – concurrent operation;
Eq. (1) (SSu①, SSu⑥), … • Sum – sequential operation.
Figure 6.2: (a) Descriptions of the terminologies used by each step in the
latency model and (b) an illustration of 3-step latency modeling methodology.
output x-y dimension size (OX/OY), and filter x-y dimension size (FX/FY);
2) three major operands (W/I/O) each have their relevant (r) and irrelevant
(ir) for-loops, where r / ir loops contribute to that operand’s data size / data
reuse respectively. In addition, we introduce the following terms described in
the table of Figure 6.2(a) and refer to them in modeling: U nit M em, DT L,
M emDAT A , M emCC , and RealBW .
6.3.2 Step 1: Divide memory system into multiple Unit Mem-

ories by operand and compute each DTL’s attributes
In the first step, we divide the problem of extracting the total temporal stall
cycles SSoverall of the entire memory system into deriving the stall/slack cycles
(SSu ) induced by a single operand (W/I/O) accessing a Unit Mem (e.g., Mem1-9
in Figure 6.2(b)). To analyze these Unit Mem levels, we decouple read and
write operations on the interface between two Unit Mem levels, each as a DTL
(e.g., from ⃝ ⃝ in Figure 6.2(b)).
1 to 18
Compute ReqBW u
For each DTL, ReqBW u is defined as the minimum memory BW to allow

computation to proceed without stall. Based on the memory type (single- or
double-buffered) and the top temporal loop type allocated to that memory level,
this parameter can be derived as shown in Table 6.1. It is worth noting that
for non-double-buffered memory levels, if an ir loop is scheduled as the upper
for-loop of that level, the required data cannot be overwritten during processing
that top ir loop due to data reuse (an example to visualize this is given later in
Figure 6.4). Thus, to avoid stall, this minimum BW requirement needs to be
scaled up by all top ir loop sizes. Accordingly, ReqBW u can be obtained for
each DTL.
Table 6.1: ReqBW determined by both memory type and mapping.
Memory type DB* mem. Non-DB dual-port mem.

Top temporal loop type r or ir r ir
Physical mem capacity A A A
Mapper-seen capacity ½×A A A
ReqBW BW0** BW0 BW0 × top-ir loop size
* DB = Double-Buffered ** BW0 = MemDATA/MemCC
Derive the Unit Mem’s operation pattern and memory updating window
(M U Wu )
When no interference occurs, the periodic pattern of memory operation can be

modeled by a periodic function with 4 parameters as shown in Figure 6.2(a):
period (M emcc ), active cycle count in one period (X), active cycle starting point
in one period (S), and total number of periods (Z). Here XREQ is the maximum
allowed memory updating window for no-stall scenario within one M emcc , which
equals to M emDAT A /ReqBW . Accordingly, M U Wu = XREQ × Z. On the
other hand, the actual memory BW (RealBW ) is hardware design specific
(Figure 6.2(b)), hence we use XREAL to represent the actual memory updating
window within one M emcc , which equals to M emDAT A /RealBW .
Extract SSu
For each DTL, SSu measures the relative cycle difference between "memory
updating" and "computation" (or "data consuming"), and is computed by SSu =
M: memory update 𝑿𝑿𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹

𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺 = − � 𝑴𝑴𝑴𝑴𝑴𝑴 𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖𝒖 𝒌𝒌𝒆𝒆𝒆𝒆𝒆𝒆−
C: computation 𝑿𝑿𝑹𝑹𝑹𝑹𝑹𝑹
𝒐𝒐𝒐𝒐𝒐𝒐 𝒛𝒛𝒛𝒛𝒛𝒛𝒛𝒛 =
Mn serves Cn (n = 1,2,3,4,…) 𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺 = + � 𝑴𝑴𝑴𝑴𝑴𝑴𝑪𝑪𝑪𝑪
Initialization Computation Timeline
M1 M2 M3 M4 …
(a)
C1 C2 C3 C4 (a),(b),(c) assume double-buffered (db) mem or
non-double-buffered (non-db) mem with r loop
M1 M2 M3 M4
(b) … scheduled on top.
C1 C2 C3 C4
(Mn+1) and (Cn) can start together.
M1 M2 M3 M4 XREQ ( ) = MemCC ( )
(c) C1 C2 C3 C4
M1 M2 M3 M4
(d) … (d),(e),(f) assume non-db mem with ir loop
C1 C2 C3 C4
scheduled on top.
M1 M2 M3 M4 (Mn+1) can only start after certain data block is
(e) …
C1 C2 C3 C4 fully (re)used in (Cn), hence “Mem update keep-
out zone” is inserted.
M1 M2 M3 M4
(f) XREQ ( ) < MemCC ( )
C1 C2 C3 C4
Figure 6.3: Six different timeline cases of memory updating and computation,
showing memory-induced stall/slack for a single DTL.
(XREAL − XREQ ) × Z. Figure 6.3 shows SSu visualization with six cases of
memory updating and data consuming timelines. E.g., Figure 6.3(a)(d) have
SSu = 0 since XREAL = XREQ , despite their different memory types; same
principle can be applied to obtain (b)(e)’s negative SSu (slack) and (c)(f)’s
positive SSu (stall).
6.3.3 Step 2: Combine the attributes on DTLs that share

same physical memory port and serve same memory
module
With the basic attributes ReqBW u , M U Wu and SSu obtained for each DTL in
Step 1, we can derive ReqBW comb , M U Wcomb and SScomb for DTLs that share
the same physical memory port and for DTLs that serve the same memory.
Derive ReqBW comb
For n DTLs that share one physical memory port, the ReqBW comb is the sum
of all the n DTL’s ReqBW u on that port, with read and write distinguished.
Derive SScomb
First compute the union of all n M U Wu for share-port DTLs as M U Wcomb .

Then calculate SScomb based on the polarity (+/-) of SSu . Assume SSu (i) >
0, when 1 ≤ i < m; SSu (i) ≤ 0, when m ≤ i ≤ n.
m−1
X
SScomb = SSu (i) +
i=1
n
X
max{0, [M U Wu (i) + f (SSu (i))] − M U Wcomb } (6.1)
i=1
where SSu (i) is the SSu of the ith DTL; M U Wu (i) is the M U Wu of the ith
DTL, f (x) = x when x ≤ 0, otherwise f (x) = 0. Eq. (1) indicates that the
combined stall is the sum of already introduced stalls by some DTLs (positive
SSu ) plus the stalls that could be introduced by combining the slacks of other
DTLs, due to the hardware resource contention. In more detail, about the
latter term, for the share-port DTLs that do not introduce stall individually,
they could still introduce stall when combined. The possible combined stall
is the sum of all share-port DTLs’ actual working cycles within each of their
M U Wu minus the maximal allowed memory updating window (M U Wcomb ).
This value can be positive (introduce stall) or non-positive (no stall). Only if
this combined result is positive, it is then added to the sum of the positive SSu
to obtain SScomb . This ensures the stall(+) induced by individual DTLs is not
cancelled by other DTLs’ slack(-) during combination.
The next step is to further combine the SS of the DTLs serving the same
memory. This final SScomb is the maximal value either out of their SSu (e.g.,
max(SSu 12⃝, SSu 11
⃝) in Figure 6.2(b)) or out of the already combined SScomb
(e.g., max(SScomb ⃝
1 ⃝,
6 SScomb ⃝2 ⃝)
7 in Figure 6.2(b)).
A detailed example
To visualize Step 1 and 2, an example is given in Figure 6.4. It puts 5 design

factors together: 1) hardware architecture, 2) mapping, 3) loop iteration, 4)
data access pattern, and 5) memory-compute timeline, and illustrate how these
factors interact with each other and impact latency. The goal in this example
is to derive the SScomb of the local buffer’s read port, shared by W/I/O-Reg
(non-double-buffered).
Figure 6.4(a) to (b) shows Step 1 "Divide" that divides the shared memory port
by operand and analyzes each operand’s DTL without interference; Figure 6.4(b)
to (c) shows Step 2 "Combine" that combines unit DTLs to deduce the SScomb
of the shared memory port, considering interference. Based on the mapping
and RealBW (assume 1 data per cycle between LB and Reg in this example),
the periodic data movement of each DTL is visualized in (b), from which the
attributes for each DTL are derived in (c). Finally, SScomb is calculated based
on Eq. (1).
6.3.4 Step 3: Integrate SScomb across all memory levels to

derive total temporal stall SSoverall
SSoverall accounts for the parallel memory operation as well as multiple stall
sources across all memory levels. For the memory operations that can be
overlapped, SSoverall takes the maximum of SScomb , i.e., the shorter stall of one
memory can be hidden under the longer stall of the other; otherwise, SSoverall
is the sum of all stalls, indicating one memory stall blocks the operation of other
memories, regardless of whether the data in other memories are ready. Users
can customize this memory parallel operation constraint based on the design.
(a) (c)
An example hardware (partial) An example mapping (partial) 1,21,2 3,43,43,43,43,43,4 5,65,65,65,6 Compute SScomb at based on its equation
… … ① Compute 1,2 Compute 3,4 X- X- Mem MUW MUW
Weight Input Output DTL Z SSu SScomb
req real cc u comb
Local Buffer (W/I/O) … … … … 1 2 2 3 3 4 4 5
② ① 10 6 12 2 20 -8
for C’ in [0, 2) LB LB LB Compute 1 Compute 2 Compute 3 Compute 4 (20-8)+(8-0)+
for OX in [0, 2) W-Reg LB LB ② 2 2 6 4 8 0 24 (24-12)-24
① ② ③ for K in [0, 3) W-Reg I-Reg O-Reg 1 1 2 2 2 3 3 3 4 4 4 5 5 = 8 cycles
③ ③ 6 3 6 4 24 -12
W-Reg I-Reg O-Reg for C in [0,2) W-Reg I-Reg O-Reg Compute 1 Compute 2 Compute 3 Compute 4
MAC O[K][OX] += W[K][2C’+C] × I[2C’+C][OX]

Overall 1 1 1 1,21,2 2 2 2 2 2 3,43,43,43,4 3 3 3 3 3 3,43,4 4 4 4 4 4 5,65,65,65,6 5 5 Total computation stall
Goal: analyze the overall Stall/Slack at DTL of . Compute 1 Compute 2 Compute 3 Compute 4 (every 24 cycles): 8 cycles.
Divide Combine
(b) … …
… ① ② ③
LB (W) MemCC : 24 LB (I) for C’ in [0, 2) (r) MemCC : 24 LB (O) for C’ in [0, 2) (ir) MemCC : 24
for C’ in [0, 2) (r) MemDATA : 12 for OX in [0, 2) (r) MemDATA : 8 for OX in [0, 2) (r) MemDATA : 6
1 per cc W-Reg 1 per cc I-Reg 1 per cc O-Reg
A UNIFORM INTRA-LAYER LATENCY MODEL
1 per cc Update 1 per cc Update 1 per cc Write Back

for OX in [0, 2) (ir) MemCC : 12 for K in [0, 3) (ir) MemCC : 6 & Read In for K in [0, 3) (r) MemCC : 6
W-Reg for K in [0, 3) (r) I-Reg O-Reg
MemDATA : 6 for C in [0,2) (r) MemDATA : 2 for C in [0,2) (ir) MemDATA : 3
1 per cc Compute for C in [0,2) (r) 1 per cc Compute 1 per cc Compute
To MAC To MAC From/To MAC
r C’1 C’2 r C’1 C’2 ir C’1 C’2
ir OX1 OX2 OX1 OX2 r OX1 OX2 OX1 OX2 r OX1 OX2 OX1 OX2
r K1 K2 K3 K1 K2 K3 K1 K2 K3 K1 K2 K3 ir K1 K2 K3 K1 K2 K3 K1 K2 K3 K1 K2 K3 r K1 K2 K3 K1 K2 K3 K1 K2 K3 K1 K2 K3
For loops
r C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2 r C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2 ir C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2C1C2
Timeline Timeline Timeline
I I I I I
Compute 1 Compute 2 Compute 3 Compute 4 O Read In O Read In O Read In O Read In O Read In
I = Update 2 inputs in 2 cycles. Compute 1 Compute 2 Compute 3 Compute 4
W-Reg update W-Reg Update W-Reg Update O Write Back O Write Back O Write Back O Write Back O Write Back
Data movement
and computation
Legend: One data element Data being used or generated Update 3 outputs in 6
Compute 1,2 Compute 3,4 O Read In =
During the period, data cannot be updated (data lifetime window) cycles. (Only focus on “O
Update 6 weights
W-Reg Update = in 10 cycles. During the period, data can be updated (memory updating window) Memory update serves computation Read In” in this example.)
Figure 6.4: An example demonstration of (a)-(b) Step 1 (Divide) and (b)-(c) Step 2 (Combine) for deriving the
intermittent modeling parameters.
193
If calculated SSoverall ≤ 0, we take zero as its final value since no temporal stall;
otherwise, SSoverall > 0 indicates temporal stall exist during computation.
6.3.5 System’s overall latency and MAC array utilization
Based on the 3 Steps and data loading analysis, the system’s overall latency (CC)
can be derived by summing up the ideal computation cycles (CCideal ), data
loading cycles, spatial stall and temporal stall (SSoverall ) (refer to Figure 6.1).
The overall MAC array utilization (U ) can be deduced by CCideal /CC.
This uniform latency modeling enables DSE for various memory systems and
mappings. It can also provide insights into identifying performance bottlenecks
and optimization opportunities, as demonstrated in Section 6.5.
6.4 Validation
We validate the proposed latency model using an in-house DNN accelerator [151]
implemented in TSMC 7nm technology [176], designed for INT8-based inference
tasks. For convolution layers, Im2Col operation (unrolling convolution into
matrix-matrix-multiplication) is performed by a RISC-V core before processing
on the accelerator. As shown in Figure 6.5(a), this accelerator employs a systolic
array-based design with 1K MAC units in a 16×32 PE array (2 MACs per PE)
and one 24b Output register per PE. Each MAC connects to one 8b Weight
and one 8b Input register. A total of 32KB and 64KB local buffer (LB) with
256b and 512b bus connections to PE array are used for temporal storage of
Weight and Input, respectively. 1 MB global buffer (GB) tiled with 16 64KB
SRAM macros is used. The mapping schemes are shown in Figure 6.5(b).
We feed the latency model with the above hardware configuration. Figure 6.5(c)
shows the comparison of the modeled results with the hardware simulation,
running NN layers (with different parameter sizes) of a hand-tracking
workload [169]. For all evaluated NN layers, an average of 94.3% latency
estimation accuracy is achieved.
6.5 Case Studies
In this section, we present three case studies to demonstrate how the enhanced
latency model can be used to optimize the AHM design space. We integrate our
model with ZigZag [109], a DNN accelerator architecture-and-mapping DSE
CASE STUDIES 195
(a) In-house DNN Accelerator Illustration (b) Temporal and Spatial Mapping Illustration
C first, O-Reg data reuse B
B second, W-LB data reuse
K third, I-GB data reuse BuCu
B
C
C I
Ku
Ku
K W × = Bu
O K
Cu
Weight Input Output

for k2 in [0, K/Ku): GB GB GB
for b2 in [0, B/Bu): W-LB GB GB
for c2 in [0, C/Cu): W-LB I-LB O-Reg
unroll k1 in [0, Ku(32))
unroll b1 in [0, Bu(16)) MAC Level
unroll c1 in [0, Cu(2))
O [Bu∙b2+b1][Ku∙k2+k1] += W [Ku∙k2+k1][Cu∙c2+c1]
× I [Bu∙b2+b1][Cu∙c2+c1]
(c) Latency Matching: hardware simulation vs. proposed modeling for NN layers
RTL Simulation Proposed Model Accuracy % Layers used for validation
1E+6 (ordered based on latency)
100 B K C
Latency [cycle count]
Layer 1 16 32 1024
1E+5 75 Layer 2 4 160 378
Accuracy [%]
Layer 3 48 160 480
50 Layer 4 900 16 27
Layer 5 36 160 960
1E+4
Layer 6 1024 42 189
25 Layer 7 576 192 288
Layer 8 48 160 3480
1E+3 0 Layer 9 576 192 1728
1 2 3 4 5 6 7 8 9 10
Layer 10 144 384 3456
NN Layer Index
Figure 6.5: (a) Block diagram of the in-house DNN accelerator. (b) Temporal
mapping and spatial unrolling schemes after Im2Col. (c) Model validation
against the hardware RTL simulation running NN layers of different sizes.
framework, to generate various design points. For Case 1 and 2, the hardware
architecture is fixed to a scale-down version of the in-house accelerator with
8×16 PE (2 MACs per PE, i.e. 16x16 MAC), 16KB Weight local buffer (W-LB),
8KB Input local buffer (I-LB), 1MB global buffer (GB) with 128 bit/cycle
read/write BW, and a loop spatial unrolling of K 16 | B 8 | C 2. For Case 3, we
perform architecture DSE by varying the hardware parameters. Im2Col layer
transfer is applied to all the case studies.
6.5.1 Case study 1: Mapping v.s. latency
Different mappings lead to distinct latencies for the same DNN layer processed
on the same hardware. Figure 6.6 compares two different temporal mapping
schemes: Mapping A and Mapping B, out of 30240 valid mappings obtained
by ZigZag mapper. As shown in Figure 6.6(c)(d), both mappings result

identical ideal latency (CCideal ) of 38400 clock cycles (cc), where Mapping
A has 5% energy savings over Mapping B. Hence without considering temporal
stall (SSoverall ), Mapping A would be preferred. However, our latency model
indicates that Mapping B actually has a 30% lower latency and 26% better
MAC utilization (both spatial and temporal utilization included) over Mapping
A, owning to its lower SSoverall .
The main difference between Mapping A and B is whether the C loop is split,
highlighted by the blue boxes in Figure 6.6(a)(b). This leads to different data
reuse trade-off between I and O at I-LB and GB levels: Mapping B adopts a
full output stationary dataflow at O-Reg level (i.e., only final outputs write
to GB) by scheduling all O’s data reuse loops (C loops) at O-Reg level. On
contrast, Mapping A has all I’s data reuse loops (K loops) at I-LB level to
reduce Input’s data movement from GB to I-LB, at the cost of pushing part
(a) Mapping A W I O (b) Mapping B W I O
Layer size: B: 64, K: 160, C: 960 Data reuse

Precision: I/W: 8 bits, O: 24 bits loop (ir)
(c) Latency & MAC Array Utilization (e) Memory Access Count
SSoverall [cc] CCideal [cc] Total Utilization [%] I (Inputs) O (Outputs & Partial Sums)
Mapping A Mapping B
-30% Less I-LB/GB traffic Less O-
Reg/GB
traffic
(d) Energy [pJ] (f) Memory BW: Physical vs. Mapping Required
RealBW [bit/cc] ReqBW (Mapping A) [bit/cc] ReqBW (Mapping B) [bit/cc]
+5%
ReqBW largely exceeds RealBW
+ Frequent GB access
induce large stall.
Figure 6.6: Case study 1: Mapping’s difference analysis and its impact on
latency.
CASE STUDIES 197
of O’s data reuse loops (C loops) to the GB level (i.e., besides final Outputs,
Partial Sums also need to be transferred between O-Reg and GB), as shown in
Figure 6.6(e). Note that W’s data reuse distribution across memory levels in
these two mappings are the same.
These differences in I and O data transfer cause different temporal stalls due to
the insufficient GB BW compared to the required ones, shown in Figure 6.6(f).
Although Mapping A and B both exceed hardware’s RealBW for GB write (e.g.,
3072 vs. 128 bit/cycle), the stall is less for Mapping B since its GB write is
less frequent (Figure 6.6(e)). In addition, the Partial Sum transfer in Mapping
A requires much higher GB read BW that cannot be met by the hardware’s
RealBW . Thus, our memory-BW-aware latency model reveals the high SSoverall
in Mapping A that deteriorates MAC array utilization and system performance.
This analysis clearly shows that a good latency model is instrumental to help
DNN accelerator DSE mapper minimize SSoverall by 1) matching ReqBW
(mapping-dependent) with RealBW (hardware-dependent), or 2) if RealBW
is too low to match, reducing the frequent access of the low-BW link (e.g.,
reducing partial sum transfer in this case study).
6.5.2 Case study 2: Workload size v.s. latency
DNN layer parameters largely impact execution time. In this case study, we use
the same hardware parameters as in Case 1 and analyze the latency impact of
the layer attributes (e.g., # of total MAC operation, operand size). Figure 6.7(a)
shows operand’s percentage (W/I/O) and total MAC operation count by varying
DNN layer dimensions from 8 to 512 for B/K/C. Figure 6.7(b) shows the
corresponding modeled Real latency and latency breakdown in terms of data
pre-loading, ideal compute cycle, spatial stall, and temporal stall (SSoverall ) as
defined in Figure 6.1.
Comparing Figure 6.7(a) and (b), the Ideal latency matches with Total MAC
Ops, where the Real latency follows the Total data size. The former is intuitive
since it assumes 100% MAC array utilization with zero stall, where latter reveals
the data movement bottleneck. Since the existing hardware has limited BW
for GB (Figure 6.6(f)), a fully output stationary dataflow at O-Reg level was
always selected to minimize stall by reducing the GB access. When the layer
is Output-dominant (large B and K, fewer C), its total data size increases
compared to other layers under the same total MAC Ops due to the 24-bit O
precision (v.s. 8-bit I and W), while at the same time the lower input channel
count C, results in less output stationarity. This causes increased pressure on
the GB write BW, causing the Real latency to deviate much more from the Ideal
latency. For larger layer sizes (large C), Ideal computation cycle (green bars)
dominate, and the deviation between Ideal latency and Real latency reduces.
Note that without including temporal stalls (i.e., the cyan dotted line), large
discrepancies in latency estimation (e.g., 7.4× for layer (128,128,8) and 9.2× for
layer (512,512,8)) occur (Figure 6.7(b)), especially for layers with fewer input
channels C.
(a)
x 9.2
x 7.4
(b)
Figure 6.7: Case study 2: Workload’s impact on latency and latency breakdown.
6.5.3 Case study 3: Hardware architecture v.s. latency
Previous case studies have shown the importance of the proposed latency
model for evaluating the mapping and algorithm impact on a fixed hardware
architecture. In this case study, we further take advantage of the model’s
generality and memory-BW-awareness to assess the impact of different hardware
architecture parameters (e.g., MAC array size, memory capacity and memory
BW), and showcase how the design space changes with the presence of temporal
stalls (SSoverall ). We choose the following the MAC array sizes and scale the
spatial mapping accordingly: 16×16 (spatial mapping as K 16 | B 8 | C 2),
32×32 (K 32 | B 16 | C 2), 64×64 (K 64 | B 32 | C 2). We construct a memory
pool containing tens of register/memory candidates with different capacities to
replace the W-/I-/O-Reg, W-/I-LB in the design space search. The GB size is
CASE STUDIES 199
(a) (b) (c)
Optimal Design Points Optimal Design Points
Optimal Design Points
Preferred
corner
Figure 6.8: Case study 3: Hardware architecture’s impact on latency-area

trade-off.
1MB for all the cases, where GB BW varies from 128 to 1024 bit/cycle. The
area of GB is not included in the comparison.
Figure 6.8 illustrates the latency-area design space for 4,176 hardware designs.
Different MAC array sizes are shown in different colors, while dots that share
the same color vary in memory hierarchy. For each design point, mapping
optimization for lowest latency is performed.
Figure 6.8(a) first shows the results using a memory-BW-unaware latency model.
Since memory BW impact induced SSoverall is ignored, all the architectures
with the same array size achieve similar latency. Hence the minimum area
design (i.e., with less memory) could be considered as optimal (close to the
preferred corner), since larger memory capacity does not offer latency benefits
but add area cost.
However, the conclusion changes once the memory BW impact is included.
Figure 6.8(b) and (c) show the design spaces obtained with our proposed
model for GB BW of 128 bit/cycle (low BW) and 1024 bit/cycle (high BW),
respectively, with the optimal design points highlighted. For both high and
low GB BWs, different memory size combinations at Register and Local Buffer
levels can impact the area-latency trade-off for a fixed MAC array size (i.e.,
same theoretical peak performance). For example, the 16×16, 32×32 and 64×64
array achieve their own lowest latency with moderate memory area cost at 128
bit/cycle GB BW. Only when the GB BW is high, the design points of the
same array size cluster around the similar latency, indicating less latency impact
from improving local memory storage and data reuse. This reveals the impact
of SSoverall in BW-limited systems, where the memory hierarchy needs to be
optimized to maximize data reuse below that bottleneck memory level for stall
reduction.
Another observation from the memory-BW-aware latency modeling is that the

MAC array size preference can change for different memory BWs (Figure 6.8
(b)(c)). At low GB BW, the optimal latency of 32×32 array can outperform
that of the 64×64 array. Only at high GB BW, the 64×64 array can further
improve the latency Pareto front.
In summary, BW-awareness is important for hardware design parameter
optimization on latency. Recent technology advancement such as 3D IC
technology with fine-pitch SRAM-on-logic stacking can offer energy-efficient
high BW interconnects (e.g., >1024 bit/cycle) over the conventional 2D design.
The proposed BW-aware latency model can aid in evaluating the impact of this
new technology on the design space.
6.6 Conclusion
This chapter thoroughly explained the unified analytical intra-layer latency

model for DNN accelerators, supporting diverse architectures and mappings.
Following a 3-step approach, our model overcomes prior challenges by
systematically estimating the system’s temporal stalls, capturing the periodic
operation of hardware components, and identifying performance bottlenecks.
This model is verified with an in-house DNN accelerator [151] on various DNN
layers with different mappings, achieving >94.3% accuracy on average. Three
case studies from a mapping, workload, and hardware perspective reveal the
advantages of using the proposed model for DSE, as well as the importance
of temporal stall modeling in exploring co-design opportunities for latency
optimization. This intra-layer latency model lays a solid foundation for later
work of modeling and optimizing latency in cross-layer and multi-core DNN
mapping scenarios, and is built into ZigZag.
With Chapter 5 and 6, we have fully understood the working principles of ZigZag
for single-layer mapping modeling and optimization. It is time to broaden the
mapping space with cross-layer mapping and scheduling options to seek more
optimization possibilities.
In the next Chapter, we will go into DeFiNES, a successive DSE framework
built on top of ZigZag, which systematically supports the cross-layer depth-first
mapping space modeling and exploration.
Chapter 7
DeFiNES: Exploring the

Depth-first Scheduling Space
for DNN Accelerators
This Chapter explains DeFiNES, a successive DSE framework of ZigZag, which

enables fast exploration of the depth-first (DF) scheduling space for DNN
accelerators through analytical modeling.
DNN workloads can be scheduled onto DNN accelerators in many different
ways: from layer-by-layer scheduling to cross-layer DF scheduling (a.k.a. layer
fusion, or cascaded execution). This results in a very broad scheduling space,
with each schedule leading to varying hardware costs in terms of energy and
latency. To rapidly explore this vast space for a wide variety of hardware
architectures, analytical cost models are crucial to estimate scheduling effects
on the hardware level. However, SotA cost models are lacking support for
exploring the complete DF scheduling space, for instance focusing only on
activations while ignoring weights, or modeling only DRAM accesses while
overlooking on-chip data movements. These limitations prevent researchers
from systematically and accurately understanding the DF scheduling space.
This chapter is based on [108]. and contains large fractions of it. The author’s contributions
include (but not limited to) the DF design space identification, DF cost modeling methodology,
part of the implementation, case studies, SotA comparison, and paper writing.
DeFiNES is open source at https://github.com/KULeuven-MICAS/defines.
201
202 DEFINES: EXPLORING THE DEPTH-FIRST SCHEDULING SPACE FOR DNN ACCELERATORS
After formalizing this design space, this chapter proposes a unified modeling
framework, DeFiNES, for layer-by-layer and DF scheduling to fill the gaps.
DeFiNES enables analytically estimating the hardware cost for possible schedules
in terms of both energy and latency, while considering data access at every
memory level. This is done for each schedule and hardware architecture under
study by optimally choosing the active part of the memory hierarchy per unique
combination of operand, layer, and feature map tile. The hardware costs are
estimated, taking into account both data computation and data copy phases.
The analytical cost model is validated against measured data from a taped-out
DF DNN accelerator, DepFiN, showing good modeling accuracy at the end-to-
end neural network level. A comparison with generalized SotA demonstrates
up to 10× better solutions found with DeFiNES.
Analytical models have already been developed to predict the performance of a

single layer of a DNN running on an accelerator [91, 177, 109]. However, these
ignore cross-layer scheduling possibilities, which can lead to very sub-optimal
DNN-level solutions because passing data between layers can have a big impact
on the overall system performance [172, 19]. A high level example is shown in
Figure 7.1. In subfigure (a), representing single layer scheduling, intermediate
feature maps are always written to and read from the highest memory level.
We call this ‘Single-Layer’ (SL) in this chapter. However, if the feature maps
are small enough, it is possible to keep them in lower, more efficient, memory
levels, as shown in subfigure (b). This is dubbed ‘Layer-By-Layer’ (LBL) in
this chapter. Furthermore, if the feature maps are too big for this optimization,
one can explore ‘Depth-First-like’ (Depth-First [53]; a.k.a. layer fusion [4], or
cascaded execution [162]) scheduling, which means only parts of the intermediate
feature maps instead of the whole feature maps are computed at a time and
passed between layers. This decreases the size of data to be passed between
layers in a single transaction, which in turn enables the use of an even smaller
and more efficient memory level to pass this data (subfigure (c)).
Some accelerators [4, 95, 111, 113, 74, 54] already used some forms of such DF
scheduling. However, without a method to quickly explore different DNN
accelerators and DF scheduling options, it is hard to say how well these
solutions approximate optimality and to quickly estimate the performance
of an accelerator in development.
Analytical cost models with support for DF scheduling are thus required to
quickly explore and develop optimal systems. Although such models already
MOTIVATION AND CHAPTER ORGANIZATION 203
(a) Single-layer-at-a-time (SL)

See each layer as detached
L1 L2 L3
workload.
W I O W I O W I O
DRAM* All layers:
• Inputs (I) from DRAM.
GB* • Outputs(O) to DRAM.
• Weights (W) from DRAM.
On-chip
LB*
(* In this Fig1 example, we
PE Array PE Array PE Array
assume all memory levels
Timeline are shared by W/I/O.)
(b) Layer-by-layer (LBL)

L1 L2 L3 Consider the transition
between layers.
W I O W I O W I O
DRAM Except for the last layer:
• Layer’s outputs can stay on-
GB chip to be next layer’s inputs
On-chip
if fit (skip DRAM ).

LB All layers:
PE Array PE Array PE Array • Weights from DRAM.
Timeline
(c) Depth-first (DF) / Layer fusion Split each layer into small
T1 … T1 … T1 … tiles and process each tile
L1 L2 L3
Tn Tn Tn across layers depth-firstly.
W I O W I O W I O for Tile in T1 to Tn:
DRAM for Layer in L1 to L3:
compute Layer-Tile
GB • Layer-tile’s outputs can stay
On-chip
on-chip at an even lower level

LB to be next layer-tile’s input
PE Array PE Array PE Array if fit (skip DRAM & GB ).
• Weights of the 1st tile (T1)
The 1st tile (T1) Timeline from DRAM.
Figure 7.1: Going from (a) Single-layer-at-a-time scheduling to (b) Layer-by-

layer scheduling and to (c) Depth-first scheduling to keep activations in lower
memory levels. “L": neural network Layer; “T": Tile; “LB": Local buffer (small
on-chip memory); “GB": Global Buffer (larger on-chip memory).
exist [86, 180, 189, 172, 19], they are all limited in one or more of the following
aspects:
• Model only partial hardware cost, like only latency or only DRAM-access,
and ignore other relevant costs;
• Do not consider an on-chip multi-level memory hierarchy, only distinguish
between on-chip and off-chip memory;
• Do not study the full DF space (defined in Section 7.2);

• Only consider memory accesses for feature map (a.k.a. activation) while
ignoring the impact of weights.
This chapter proposes a unified modeling and cost estimation framework,

DeFiNES, for LBL as well as various forms of DF scheduling so as to
systematically understand the enlarged scheduling space towards greatly
improved energy and latency. This chapter is organized as follows:
• Section 7.2 identifies the full design space of DF scheduling, which also
includes SL and LBL by regarding them as two extreme points in the DF
design space.
• Section 7.3 presents a Unified Analytical Cost Model that has none of
the aforementioned limitations.
• Section 7.4 validates the proposed cost model against a taped-out DF-
style accelerator.
• Section 7.5 conducts three case studies based on the model, studying
the trade-offs between different DF schedules, and the impact of workload
and hardware architecture on the best DF strategy.
• Section 7.6 compares DeFiNES against SotA frameworks, showing an
up to 10× better results by including the cost of on-chip memory accesses
and accesses caused by weights in the exploration.
7.2 Depth-first Design Space Identification
This section describes the DF design space with three axes, using the well-
understood LBL inference as a starting point.
Consider processing multiple layers of a network, as in Figure 7.2(a)&(b). One
can calculate the final output feature map in one go, for which the complete
input of the last layer is required. This in turn requires the complete output
of the second to last layer, and so on. Ultimately, this leads to LBL inference,
which completely executes each of the layers one at a time starting from the
first layer.
Alternatively, one can target to compute only a part of the output of the final
feature map. In this case, only parts of input feature maps are needed, as in
DEPTH-FIRST DESIGN SPACE IDENTIFICATION 205
(a) Workload and legend (b) LBL (1 tile / layer) (c) Tile size 2x2 (d) Tile size 1x1
Layer 1
K=3; C=1; OX,OY=8;
FX,FY=3
Layer 2 Layer 1 Layer 1, Tile 1

K=6; C=3; OX,OY=6;
FX,FY=3
Layer 3 … …
K=9; C=6; OX,OY=4; Layer 2 Layer 2 , Tile 1
FX,FY=3
Being used/generated
Will be used/generated Layer 3 , Tile 1
Fully used/generated Layer 3
2 1
New to-cache data 2 1
Cached for H reuse
Cached for V reuse
Big Small
Tile size
Figure 7.2: DF design space’s first axis: Tile size. For layer dimension
notation in (a): K is for output channel; C is for input channel; OX and OY
are feature map spatial dimensions; FX and FY are weight spatial dimensions.
(a) Fully-recompute (b) H-cached V-recompute (c) Fully-cached
The 1st tile A regime tile The 1st tile A regime tile The 1st tile A regime tile
… .. … .. … ..
1 1 1
1 1 1
Timeline Timeline Timeline
Figure 7.3: DF design space’s second axis: Overlap storing mode. Workload
is Layer 2 and 3 in Figure 7.2(a); Legend is shared with Figure 7.2(a).
shown in Figure 7.2(c). Inference starts at the first layer, yet only that tile of
its output feature map that contributes to the target tile in the final output
feature map is calculated. It is then propagated throughout the other layers to
compute the target tile in the final feature map.
DEFINES: EXPLORING THE DEPTH-FIRST SCHEDULING SPACE FOR DNN ACCELERATORS
(a) SL (1 layer per stack) (b) Fuse shallower; Tile coarser (c) Fuse shallower; Tile finer
ST1 W I O ST1 W I O ST1 W I O
L1 T1 DRAM Per/Between-Stack I & O L1 T1 T2 T3 DRAM Between-Stack I & O L1 T1 T2 … Tn DRAM Between-Stack I & O
L1
ST2
L2 T1 GB
L2 T1 T2 T3 GB
Per-Stack I & O L2 T1 T2 … Tn GB Per-Stack W
Per-Stack W
ST3 ST2 ST2
L3 T1 Per-Stack L3 T1 T2 T3 More L3 T1 T2 … Tn Less Per-Stack
LB LB LB
W reuse reuse I&O
ST4
L4 T1 L4 T1 T2 T3 L4 T1 T2 … Tn
PE PE PE
Trade-off: (d) Fuse deeper; Tile coarser (e) Fuse deeper; Tile finer
ST1 W I O ST1 W I O
 Tile finer (→): L1 T1 T2 T3 DRAM Per-Stack W L1 T1 T2 …L1Tn DRAM Per-Stack W
Less per-stack activation (+)
Less local-memory weight reuse (-) L2 T1 T2 T3 GB Per-Stack I & O L2 T1 T2 … Tn GB
 Fuse deeper (↓):
L3 T1 T2 T3 More L3 T1 T2 … Tn Less Per-Stack
More per-stack weight (-) LB LB
reuse reuse I&O
Less between-stack activation (+)
L4 T1 T2 T3 L4 T1 T2 … Tn
PE PE
Figure 7.4: Impact of tile size (first axis) and fuse depth (third axis). ST: fused-layer STack.
206
DEPTH-FIRST DESIGN SPACE IDENTIFICATION 207
This illustrates the first axis in the design space: the choice of tile size, by
which we mean the size of the last layer’s portion that we want to compute
atomically. The general trade-off of tile size selection is given in Figure 7.4
(subfigures (b)↔(c) or (d)↔(e)). Choosing a larger/coarser tile size enhances
local weight reuse but requires more features to be passed between layers at
once, which may require a higher level memory.
Note that in this chapter, 1) we assume the computation order over tiles is
left-to-right, then top-to-bottom and 2) cross-layer tiling is only done across
the spatial dimensions (horizontal and vertical dimensions) of the feature maps.
It is not done across the channel dimensions because in most convolution layers
all input channels are required to calculate any output feature, which makes
cross-layer tiling across the channel dimensions impossible. However, intra-tile
temporal mappings can still have loop tiling over all the dimensions within that
tile, including the channel dimensions.
Because neighboring tiles of the output feature map can require overlapping parts
of earlier feature maps, one can choose either to recompute those overlapped
features, or to cache them in some memory in order to reuse them across
tiles, as shown in Figure 7.3. This choice can be made separately for both
spatial dimensions and is considered the second axis. It has four modes:
fully-recompute Figure 7.3(a), horizontally-cached with vertical recompute
Figure 7.3(b), vertically-cached with horizontal recompute, and fully-cached
Figure 7.3(c). In this chapter, we don’t further consider vertically-cached with
horizontal recompute, as transposing both the feature maps and, correspondingly,
the weights results in the same, yet transposed, outputs, vertically-cached
with horizontal recompute and horizontally-cached with vertical recompute
are fundamentally the same. Choosing caching over recompute requires extra
memory space to store the cached data in. However, it decreases recomputation
overhead and the tile size in earlier layers, as Figure 7.3 shows.
So far, this section discussed the scheduling options within one stack of fused
layers. The final and third axis is the choice of which layers are fused into
a stack. Fusing more layers generally requires more low level weight memory
capacity but saves accesses to higher level memories for activations. This can be
seen in Figure 7.4 by comparing subfigures (b) vs. (d), or (c) vs. (e). Because
increasing the memory capacity of the lower level memories decreases their
efficiency, the lower level memory can become fruitless if one fuses too many
layers.
Note that LBL inference and SL can be positioned in this design space. On the
first axis, the tile size can be set equal to the DNN’s final output feature map
(Figure 7.2(a)) to get a schedule that is effectively LBL. There is only one stack
and it executes each layer completely before moving on the next. One can also
choose to only have one layer in every stack (the third axis) (Figure 7.4(a)),
which leads to a SL schedule as we assume features are passed between stacks
through the highest memory level. The second axis has no impact in these cases
(LBL and SL) as there is only one tile and thus no overlap between tiles.
7.3 Unified Analytical Cost Model
This section describes the Unified Analytical Cost Model presented in this
chapter, capable of predicting inference costs (energy and latency) of DNNs on
a given hardware architecture, with support for the full design space of Section
7.2. An overview of the model is depicted in Figure 7.5.
The base idea is to use an existing mapping search engine and a cost model that
optimize and predict costs for a single layer (step 5 below). However, because
of their single-layer limitation, these tools assume every single layer’s input and
output feature maps need to come from and go to the highest level input and
output memories, respectively. DeFiNES then provides the Unified Analytical
Cost Model as a layer on top of this to provide DF compatibility, which it
achieves with the following steps:
Inputs: The inputs consist of the workload, the hardware architecture and the
DF parameters. The workload is a neural network which may have convolution
layers, branches, pooling layers, strides, depthwise layers, etc. The hardware
architecture consists of an array of Processing Elements (PEs) and a memory
hierarchy. The latter can have memories that are shared between operands
(inputs, outputs and weights), different number of levels for different operands,
and memories that are unrolled over one or more dimensions of the PE array.
The final input consists of the DF parameters, which identify a point in the
design space of Section 7.2, dubbed the ‘DF strategy’. The fuse depth, i.e. the
numbers of layers to fuse together for each stack (third axis), can be given
manually or determined automatically. In the latter case, layers are added to
the fused stack as long as the total number of weights in the stack fit in the
highest on-chip memory level that holds weights. In the presence of branching,
either all layers between two points where there are no branches are added to
a stack, or none of them. If such a set of layers by itself does not fit in the
highest on-chip memory level of weights, none of the layers in this set are fused.
In other words, each of them is in a 1-layer stack.
With the stacks of fused layers from the workload, hardware, and DF parameters
defined, steps 1)-6) are done per stack.
1) Tile the stack’s output (for each stack): Given a stack, the output feature
INPUTS DEPTH-FIRST COST MODEL (repeat for all stacks)
Workload DF parameters 1) Tile the 2) Backcalculate tile sizes 3) Determine and set top level memories
stack's output and calculate size of
3x3
Tile size data to be cached Priority:
Conv Tx WIO Current layer I
1284x724 3x3 Repeat
Conv
Ty
3x2 7x6 steps 3-5 WIO Current layer O
3x3 for all
Conv 3x3 layers Cached data
W IO
1282x722 Overlap storing Conv for H reuse
mode 5x4
W I O
Reg LB GB Ext
3x3 Repeat steps 2-5 Cached data
UNIFIED ANALYTICAL COST MODEL
Conv Fully-recompute for all tiles 3x3 for V reuse

1280x720
Conv PE Array
H-cached & 3x2 Weights
With branch support V-recompute
4) Collect inputs at 5) Get the optimal cost 6) Accumulate all results
HW Architecture Fully-cache determined for adjusted layers of evaluated cost models
Memory Level with adjusted
WIO Fuse depth* memory hierarchy
WIO
? Stack 1 Single-layer Tiles Layers & data
Cached data copy actions
W IO mapper
Prev. O (LOMA)
W I O Stack 2
Reg LB GB Ext
I OUTPUTS
PE Array Single-layer
PE Array cost model
Supports any dimensional PE array of any size, (ZigZag)
optional inner PE register file, and different Energy and latency
inter-PE interconnect. (data sharing) patterns; for each stack
Supports multi-level mem. hier. with different Data copy action
operand memory sharing schemes. cost model
Figure 7.5: DeFiNES’ overview. (*: optional input, can be set automatically.)
209
9 tile types 6 tile types

… …
… …
… …
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
… …
60 3 tile types
36 Tt1 Tile type2 (15 times) …
540=72x7+36
(1 times)
72 …
Tile type 3
(repeat 112 times)
…
…
…
…
…
…
…
…
…
…
960=60x16
Figure 7.6: Tile type count of difference tile sizes and overlap storing modes.
The workload used in this example is FSRCNN [44], whose final output feature
map’s spatial dimension is 960×540. The 3-tile-type example is further used in
Figure 7.9 and Figure 7.10.
map is partitioned into tiles of the size given by the DF parameters. As in

Figure 7.6, the tile size does not have to be a divider of the total feature map
size. Because of this and because tiles in the first row/column do not have
cached data available yet – and similarly the tiles in last column/row do not
have to store overlap for their neighbors – not all tiles are identical. Therefore,
DeFiNES identifies which tiles are completely identical and which are different,
a process that leads to different ‘tile types’. For each tile type, steps 2-6 need
to be executed only once as the results can just be replicated for identical
copies of the tile, leading to a significant decrease in DeFiNES’ runtime.1 The
number of different tile types also reflects on the code and control complexity of
implementing the solution as each tile type can have different set of parameters
and temporal mapping, which all need to be programmed into the accelerator.
2) Backcalculate tile size and calculate the size of data to be cached
1 To give a rough idea of how fast DeFiNES (written in Python) runs when we submitted
the paper: the Fully-recompute / H-cached V-recompute / Fully-cached with a tile size of
(60,72) case of Figure 7.6 took 23 / 34 / 84 seconds on 1 thread of an Intel Xeon Processor
E3-1270 v5, respectively.
UNIFIED ANALYTICAL COST MODEL 211
(for each tile in each layer): From the tile size of the last output feature map in
the stack, the required tile size of the last layer’s input is calculated. Next, the
‘to-compute’ tile size of the previous layer is calculated. Without caching for
reuse, this simply equals the required tile size of the last layer’s input. However,
with caching for reuse across tiles, not all these features need to be calculated as
some can be fetched from the cached data, as can be seen in Figure 7.3(b)&(c).
This process is repeated for all layers in the stack and as such the input tile
size and to-compute output tile size for each layer in the stack is determined.
During this process of backcalculation, the algorithm also keeps track of how
much data (from earlier or for future overlapping tiles) of each type in Figure 7.7
should be cached. In case of branching, this is handled as in Figure 7.8.
In the shown example, the left and right branch need cached features from
different places in the feature map. In such a case, the overall region of features
to be cached is set by combining all outermost edges of the to-cache regions,
so that all branches always have the cached features they need to operate in
overlap caching mode, as can be seen in the middle, combined visualization of
FM1 in Figure 7.8.
3) Determine and set top level memories (for each tile in each layer):
Given the data sizes calculated in step 2, step 3 determines the highest memory
level each type of data (layer inputs, layer outputs, and cached data for H-
cached and/or V-cached modes) should be stored in. In these decisions, data is
prioritized as in Figure 7.5(3), with higher-priority data assigned to the lower,
more efficient memory levels.
Note that the top memory level assigned to different data types can differ
between tiles and layers. Figure 7.9 gives an example of this for a stack in
fully-recompute mode, based on the data sizes from Figure 7.10.
To be able to use the single-layer mapper and cost model, which assumes
inputs and outputs come from and go to the top-level memory, we remove the
initial assignments from the hardware architecture’s definition of operands to
higher memory levels. We then give this mapper and cost model that adjusted
hardware architecture as its input to prevent it from fetching data from or
storing data to unnecessarily high memory levels.
4) Collect inputs at determined memory level (for each tile in each layer):
A single layer-tile combination can have input feature data that is located in
different memory levels, for instance, the newly created output of a previous
layer can be in a lower memory level than cached data from a previous tile.
Therefore, before calling the single-layer mapper and cost model, we model the
action of collecting these data into the single memory level that was decided to
serve as the top-level memory for inputs in step 3.
DEFINES: EXPLORING THE DEPTH-FIRST SCHEDULING SPACE FOR DNN ACCELERATORS
Computing data . Cached data for H reuse . Cached for V reuse .

Required Storage
Computing data in 2 largest consecutive tiles All cached data (per ST) All cached data (per ST)
3 Data Storing Modes
Fully-recompute ✔
H-cached V-recompute ✔ ✔
Fully-cached ✔ ✔ ✔
Figure 7.7: The required data storage for different overlap storing modes. ST: fused-layer STack.
212
UNIFIED ANALYTICAL COST MODEL 213
Figure 7.8: DeFiNES’ handling of branching. Legend is shared with

Figure 7.2(a). The grey pixels do not contribute to the right branch. ‘FM’:
feature map.
Each such data collecting action is defined as a data copy action, which is
modeled by its to-move data type and amount, the source memory level, and
the destination memory level. The cost of data copy action is calculated by
the data copy action cost model. This model takes in 1) a list of data copy
actions (those actions can theoretically happen in parallel) and 2) the hardware
architecture (with all the memory port type, port connection, word-length, and
per-word access cost defined) to analyze the energy and latency this bundle of
data copy actions costs, taking into account possible memory port conflicts in
the concurrent actions.
5) Call single-layer mapper and cost model (for each tile in each layer):
At this point, the single layer temporal mapping search engine and cost model
are used to get the cost for a single layer-tile-combination. For this work,
we used LOMA[153] as the mapping search engine and ZigZag[109, 110] to

extract the cost. Note that other single-layer mappers and cost models (such as
[126, 91, 121, 59, 87] and so on) can also be plugged in to serve the purpose.
6) Accumulate results (for each stack, across all tiles and layers): Finally,
the results of all cost models evaluated in steps 4 and 5 are summed together
to get the final energy and latency cost for the stack.
GB DRAM
Top Mem Level
I
LB
O
Tile type 1 Tile type 2 Tile type 3
(1 time) (15 times) (112 times)
Reg
L1
L2
L3
L4
L5
L6
L7
L8
L1
L2
L3
L4
L5
L6
L7
L8
L1
L2
L3
L4
L5
L6
L7
L8
Tile type and Layer (L)
Figure 7.9: A visualization of the determined top memory level of each unique layer-
tile-combination for operands W, I, and O. The DF schedule is taken from the 3-tile-type
example in Fig 7.6. The hardware architecture is the Idx 2 in Table 7.1. It is worth noting
that: 1) for weights, all the layers of the first tile take weights from DRAM, and the other
layer-tile-combinations take weights from LB; 2) for input and output, all the tiles’ first
layer gets input from DRAM, all the tiles’ last layer writes output back DRAM, and in
between either GB or LB is taken as each of their top memory level.
GB:1M
Tile type 2 Tile type 3
Data size (Byte)
(15 times) (112 times)

256K I
LB:64K O
16K I+O
4K
L1 L2 L3 L4 L5 L6 L7 L8 L1 L2 L3 L4 L5 L6 L7 L8
Tile type and Layer (L)
Figure 7.10: A visualization of activation data size in tile type 2 and 3 of the example
in Figure 7.9. The capacities of LB and GB are marked out on y-axis. Figure 7.9 and
Figure 7.10 together show that 1) when the total activation size (I+O) can fit into LB
(e.g., Tile type 2 - L6), the LB is the top memory for both I and O; 2) when the total
activation size (I+O) cannot fit into LB while either I or O can fit (e.g., Tile type 3 - L6),
I is prioritized to use LB as its top memory level while O is pushed to GB.
VALIDATION 215
7.4 Validation
To extract costs for a single layer-tile-combination, DeFiNES makes use of the

ZigZag framework, which is already well validated against several measured
hardware [30, 77, 151], as well as other SotA cost model [177], for single-layer
execution. To also ensure good cost modeling of DF/layer-fused execution
of complete networks, DeFiNES is validated with end-to-end network. We
validate full network performance predictions of DeFiNES by comparing them
against hardware measurements of DepFiN [54], one of the few existing DF
neural network processors. For this comparison, we describe DepFiN’s core and
memory hierarchy in DeFiNES’ terminology and fix the full temporal mapping
to match DepFiN’s. We further use three neural networks for the validation: 1)
FSRCNN [44], 2) MC-CNN fast [171], and 3) a simple custom reference network
that exists of 10 layers of K=32 and Fx=Fy=3 followed by a final layer of K=16
and Fx=Fy=1 and operates on 1280×720×3 inputs.
First, Figure 7.11 (a) gives the validation results for latency, which shows that
DeFiNES’ predictions match within 3% for the second and third networks. For
the first network, FSRCNN, the error is slightly higher at 10%. This is because
of the stalls caused by DepFiN’s controlling microprocessor that can not fully
keep up with the frequent layer switching due to the very small kernels found
in FSRCNN. This control flow limitation is not modeled in DeFiNES.
Second, energy is more challenging to match end-to-end, as it is very sensitive to
several fine-grain design and layout aspects such as: 1) sparsity, which is used by
DepFiN to gate off logic activity to save power; 2) Place-and-Route effects, which
cause data transfers to be more expensive than just the memory read/write
costs and also includes a sparsity-dependent effect; 3) Process, Voltage, and
Temperature (PVT) variations. Although these aspects hinder accurately
predicting absolute energy consumption, we argue that for the purpose of
Figure 7.11: Compare DeFiNES’ results against DepFiN chip measurements.

scheduling optimization it is relative modeling accuracy which matters most in

order to be able to choose the best option. Figure 7.11 (b) show the relative
energy per inference of the 3 networks, normalized to the reference network
inference energy to cancel out the impact of PVT aspects, while aspects 1
and 2 are lumped into the unit cost of the MACs and energy per access of
DeFiNES. As can be seen on Figure 7.11 (b), the model show to match within
6% of measurements, building confidence to use DeFiNES for further scheduling
optimizations.
7.5 Case Studies
Empowered by DeFiNES, three case studies are conducted in order to answer

three key DF scheduling questions: CS.1) Given a hardware architecture and a
DNN workload, how do different DF strategies impact the overall energy and
latency? CS.2) Given a hardware architecture and multiple DNN workloads
(some are activation-dominant while some are weight-dominant), how does the
scheduling choice change across different workloads? CS.3) Given multiple
hardware architectures (some are designed for LBL processing while some are
manually tuned to be more DF-friendly) and the DNN workloads from CS.2,
how do different architectures behave on their optimal scheduling strategies?
7.5.1 An overview of experiment settings
Table 7.1 summarizes the key attributes of different hardware architectures and
DNN workloads used in the case studies.
For hardware, five DNN accelerators are selected as the architecture baselines for
the case studies: Meta-prototype [151], TPU [81], Edge TPU [139], Ascend [97],
and Tesla NPU [157]. To make a fair and relevant comparison, we normalized
all of them to have 1024 MACs and maximally 2MB global buffer (GB) but kept
their spatial unrolling and local buffer settings (Table 7.1(a) Idx 1/3/5/7/9).
Besides, under the concern that all these architectures were originally designed
for SL/LBL processing and it thus may or may not be very beneficial to apply
DF schedules on them, we manually constructed a DF-friendly versions of all
architectures, denoted with ‘DF’ in the end of the name (Table 7.1(a) Idx
2/4/6/8/10). The guidelines that were followed to construct a DF-friendly
version from a SL/LBL architecture are: 1) spatial unrolling is unchanged; 2)
the total on-chip memory capacity is unchanged; 3) Input and Output activation
are preferably shared in a lower level memory; and 4) Weights should have an
CASE STUDIES
Table 7.1: The 10 hardware architectures and 5 DNN workloads used in the case studies
(a) 10 HW Architectures (5 baseline designs and their DF-friendly variants) (b) 5 DNN Workloads
HW Spatial Unrolling Reg. per MAC 2nd level Global Buffer Aver./Max. Total
Idx Local Buffer Idx Workload
Architecture (1024 MACs) or MAC group LB (max: 2MB) Feature Map Weight
1 Meta-proto-like W: 64KB; I: 32KB / 10.9 MB /
K 32 | C 2 | OX 4 | OY 4 W: 1B; O: 2B W: 1MB; I&O: 1MB 1 FSRCNN 15.6 KB
2 Meta-proto-like DF W: 32KB; I&O: 64KB / 28.5 MB
3 TPU-like W: 128B; O: 1KB / / I&O: 2MB
K 32 | C 32 24.1 MB /
4 TPU-like DF W: 64B; O: 1KB I&O: 64KB / W: 1MB; I&O: 1MB 2 DMCNN-VD 651.3 KB
26.7 MB
5 Edge-TPU-like W: 32KB / I&O: 2MB
K 8 | C 8 | OX 4 | OY 4 W: 1B; O: 2B
6 Edge-TPU-like DF W: 16KB; I&O: 16KB / W: 1MB; I&O: 1MB 21.8 MB /
3 MCCNN 108.6 KB
W: 64KB; I: 64KB; 29.1 MB
7 Ascend-like /
K 16 | C 16 | OX 2 | OY 2 W: 1B; O: 2B O: 256K W: 1MB; I&O: 1MB 760 KB /
8 Ascend-like DF W: 64KB; I&O: 64KB I&O: 256K 4 MobileNetV1 4 MB
3.8 MB
9 Tesla-NPU-like W: 1KB; I: 1KB / W: 1MB; I&O: 1MB
K 32 | OX 8 | OY 4 W: 1B; O: 4B W: 64KB; W: 1MB; 895 KB /
10 Tesla-NPU-like DF W: 1KB; I: 1KB 5 ResNet18 11 MB
I&O: 64KB I&O: 896KB 5.9 MB
217
on-chip global buffer. These guidelines are heuristic-based, and we leave the
DF hardware architecture optimization problem to future work.
For hardware modeling, CACTI7 [12] is used to extract all the SRAM costs
(pJ/word access). Other hardware costs, such as Unit MAC, register, and
DRAM access cost are scaled accordingly based on the SRAM cost, following
the scaling factors reported in [184]. All the on-chip memory’s banking and
bandwidth (bit/cycle) are selected in such a way that PE array can get enough
data to work at its full speed for ideal workload, while the DRAM bandwidth
is fixed to 64bit/cycle to mimic the on-off-chip communication bottleneck.
For workload, five DNN workloads are used in the case studies: FSRCNN [44],
DMCNN-VD [154], MCCNN [171], MobileNetV1 [65] and ResNet18 [58].
Table 7.1(b) shows that FSRCNN, DMCNN-VD, and MCCNN are activation-
dominant (all the layers have large feature maps), whereas MobileNetV1 and
ResNet18 are weight-dominant (feature maps are smaller and gradually decrease
across layers).
Note that the hardware architectures picked for these case studies are not DF-
specific. This enables exploring whether or not these non-DF-specific hardware
architectures (and their variants with some memory size/sharing adjusting) can
benefit from DF scheduling on both activation- and weight-dominant DNNs.
Another thing to point out is that in DeFiNES, users can self-define the
optimizing target (energy, latency, EDP, any memory access, a combination of
them, etc.). For the case studies, we prioritized energy.
7.5.2 Case study 1: Impact of depth-first strategy
This case study discusses how much DF strategies impact results when mapping
a DNN onto an accelerator, exemplified by FSRCNN and Meta-proto-like DF
(2 in Table 7.1(a)) as the targeted workload and hardware architecture.
For the DF scheduling space’s three axes (tile size, overlap storing mode, and
fuse depth), this case study focuses on exploring the first two axes. The third
axis, fuse depth, is fixed to the whole DNN since the total weight size of
FSRCNN is small (15.6KB as shown in Table 7.1(b) Idx 1) and thus all weights
fit in Meta-proto-like DF architecture’s weight on-chip local buffer (32KB as
shown in Table 7.1(a) Idx 2). So, there is no benefit to not fuse the whole DNN
into one stack, according to the trade-off introduced in Figure 7.4.
For the first two axes, we swept 110 tile sizes (different spatial dimension tile
size (Tx,Ty) combinations) for each of the three overlap storing modes. A subset
of the results is shown in Figure 7.12, in which the total energy and latency of
CASE STUDIES 219
Energy (mJ) Latency (million cycles)

101 103 102
Fig12
Fig12 Overall
Overall Energy
Energy and
and Latency
Fig12 OverallComparison
Latency
Fig12 Energy
Comparison
Overall Energy and
and Latency Comparison
and Latency Comparison
Latency
(a) Energy (mJ) (b) Latency (million cycles)
Fig12
ly-recompute, Overall
Energy Energy Fig12Latency
and
Fully-recompute, OverallComparison
Energy EnergyFully-recompute, Comparison
LatencyFully-recompute
2.6 2.0 2.0 2.3 2.4 16.9

2.4 1.9 2.1 2.2 13.2 18.7
2.6 2.0 2.1 9.3 17.9 19.2
2.7 2.2 2.2 13.4 19.1 19.1
(Tx)960
60(Ty)(Ty)240 960
9.6 6.0 5.9 5.9 7.2 8.4
3.8 2.8 2.8 2.8 3.5 3.7
299 76 75 75 75 75
75 19 19 19 19 20
81 21 20 21 22 25
73 18 20 20 24 24
75 19 20 25 25 29
22 28 29
ly-recompute, Energy
4 16 60 240 960 Fully-recompute,
1 4 16 60 240 Energy
960 Fully-recompute, Latency 1 4 16 60 240 960 Fully-recomput
1 4 16 60
ly-recompute,
4 16 60 240 960 Fully-recompute,
Energy Energy Fully-recompute,
1 4 16 60 240 960 Fully-recomput
Latency
X-Dim Tile Size (Tx)

1 4 16 60 240 960 1 1043 16 60
60 240
Latency
4 8.4
3 15.8 16 6.660 240
7.4 960 1 15.8
8.41 47.3 4 8.416 6.660 2407.4 960
8.4 1 310
1 1026 4 169
16 13460 131
240 960 1 10
1341 1026 43 169
310 16 13 60
Latency
3 15.8 8.4 6.6 7.4 8.41 47.3 15.8 8.4 6.6 7.4 8.4 1 1026 310 169 134 131 1341 102610 3103 169 13
(Ty)(Ty)(Ty)
(Ty)(Ty)(Ty)
83 15.8
6.0 3.6
8.4 3.16.6 3.5
7.4 3.7
8.44 15.8 6.0 3.6
8.4 3.16.6 3.5
7.4 3.7 310 310
41 1026 97 169
52 13442 13140 134
414 1026
310 310
97 16952 13 42
Latency
47.3 15.8 8.4
8 6.0 3.6 3.1 3.5 3.741 15.8 6.0 3.6 3.1 3.5 3.7 310 97 52 42 40 4141 310 97 52 42
Tile Size
Energy
Energy
4
(Ty)
Size
Size
20TileSizeSize
18 3.4
6.0 2.3
3.6 2.43.1 2.5
3.5 16.9 8.1 103.4
1 2.3
3.6 2.43.1 2.5
3.5 16.9 4 165
310 50
97 3052 26 42 26 3418 165
310 5097 3052 26
Energy
101118
Energy
184 15.8
3.7 6.0 3.7 40 41 42
165 50 30 26 26 34184 165 502 30 26
(million
TileSize
TileSize
1 3.4 2.3 2.4 2.5 16.9 18 8.1 3.4

1 2.3 2.4 2.5 16.9
Energy
Energy
18 133
16Size
TileTileSize
(million
TileTileSize
51 3.5 2.5 2.3 13.4
3.4 2.3 2.4 2.5 16.9 18.7 6.5 10
72 8.1 3.4
18 3.5 2.5 2.3 13.4
2.3 2.4 2.5 16.9 18.7 10 72 44 26 23
165 50 30 26 26 3472
18 133 32 32 133 10
44 26
18 165 10502 30 26 23
4 16
(million
5 3.5 2.5 2.3 13.4 18.7 72 6.5 103.5
1 2.5 2.3 13.4 18.7 10172 44 26 23 32 3272 133 442 26 23
TileTile
(mJ)
(mJ)
65 3.5 2.5 14.3 18.2 19.2
270 7.6 3.5 2.5 14.3 18.2 19.2 270 131 41 26 32 32 29
3272 131
270 133 1041
44 26 32
Y-Dim
4Y-Dim
20Y-Dim
2.3 13.4 18.772 6.5 2.3 13.4 18.7 133 44
72 131 23 23
(mJ)
(mJ)
1 X-Dim
6 3.5 2.5 14.3 18.2 19.2 270 7.6 3.5 2.5 14.3 18.2 19.2 270 41 26 32 32 29
270 131 41 26 32
cycle)
Y-Dim
Y-Dim
Y-Dim
(mJ)
(mJ)
36 3.7
3.5 2.6
2.5 14.3 19.4
18.2 19.1
540
19.2
270 8.3
7.6 3.7
3.5 2.6
2.5 14.3 19.4
18.2 19.1
19.2 540
270 132
131 42
41 25
26 29
32 30
32 540
29
270 132
131 42
41 25
26 29
32
1 Y-Dim
75 Y-Dim
cycle)
Y-Dim
3 3.7 2.6 14.3 19.4 19.1 540 8.3 3.7 2.6 14.3 19.4 19.1 540 132 42 25 29 30 540 29 132 42 25 29
cycle)
3 X-Dim
3.7 2.6
Tile14.3
Size19.4 540 8.3 X-Dim
(Tx) 19.1 3.7 2.6Tile14.3
Size19.4
(Tx) 19.1 540 132 X-Dim
42 25 29 (Tx)
Tile Size 29 132 X-Dim
30 540 42 25 Tile Si29
X-Dim Tile Size (Tx) X-Dim Tile Size (Tx) X-Dim Tile Size (Tx) X-Dim Tile Si
hed V-recompute,
X-Dim H-cached
Tile Size (Tx) Energy
(c) X-DimV-recompute, (Tx) Energy (d)
Tile SizeEnergy
Fully-recompute, H-cached V-recompute,
X-Dim Tile Size
Fully-recompute, H-cachedX-Dim
(Tx)Latency
Latency V-recom
Tile Si
hed4 V-recompute,
16 60 240 Energy
H-cached
960 1 4 V-recompute,
16 60 240 Energy H-cached
960 1 4V-recompute,
16 60 240 Latency
H-cached
960 1 4 V-recom
16 60
hed4V-recompute, H-cached
Energy 1 4 V-recompute, Energy H-cached 1 4V-recompute,
16 60 240 LatencyH-cached
1 1043V-recom
1
4
18
72
270
(Ty)(Ty)(Ty) 540
16 60 240 960 16 60 240 960 960 16 60
1
4
18
72
270
540
Latency
6 6.04 5.916 5.960 240
7.2 960
8.41 9.61 6.0 4 5.916 5.960 2407.2 960
8.4 1 4881 1244 123
16 12260 128
240 960
1341 4881 10 43 123
124 16 12 60
Latency
6 6.0 5.9 5.9 7.2 8.41 9.6 6.0 5.9 5.9 7.2 8.4 1 488 124 123 122 128 1341 488 10 1243 123 12
Y-Dim Tile Size (Ty)
(Ty)(Ty)(Ty)
(Ty)(Ty)(Ty)
Y-Dim Tile Size (Ty)

86 2.8
6.0 2.8
5.9 2.85.9 3.5 8.44 3.8
7.2 3.7 9.6 2.8
6.0 2.8
5.9 2.85.9 3.5
7.2 3.7 41 148 37 123
37 12237 128 414 148
40 134 37 12337 12 37
Latency
8.4 488 124 488 124
8 2.8 2.8 2.8 3.5 3.741 3.8 2.8 2.8 2.8 3.5 3.7 37 37 37 40 4141 148 37 37 37
Energy
Energy
4 148
Size
Size
Size
68 2.0 2.0 2.3 2.4 16.9 2.6 2.0 2.0 2.3 2.4 16.9 92 37 37 37 40 4118 148 37 37 24
23 23 24 25 34 92 23 23
Energy
184 3.8 102.8

Energy
2.8 2.8 2.8 3.5 3.7 1 2.8 2.8 3.5 3.7 1011184 148
92 23 23 24 25 34184 92 232 23 24
37
TileSize
(million
TileSize
TileSize
6 2.0 2.0 2.3 2.4 16.9 2.6 102.0 2.0 2.3 2.4 16.9
Energy
Energy
18 18
TileTileSize
TileTileSize
(million
TileTileSize
46 1.9 2.1 2.2 13.2

2.0 2.0 2.3 2.4 16.9 18.7
72 2.4 1.9
18 2.6 102.01 2.1 2.2 13.2
2.0 2.3 2.4 16.9 18.7 10 72
18 7777 19 20 21 30 32
72
92 23 23 24 25 3418 92 232 23 2477 19
10 20 21
(million
4 1.9 2.1 2.2 13.2 18.7 72 2.4 1 2.1
1.9 2.2 13.2 18.7 10172 19 20 21 30 3272 77 10192 20 21
(mJ)
(mJ)
64 2.0
1.9 2.1 9.3 2.2 17.9
13.2 19.2
270 2.6 2.0 2.1 9.3 17.9 19.2 75
77 2019 20 27 21 31 29 75
30 270 77 1020
19 20 27
Y-Dim
Y-Dim
270
Y-Dim
18.7
72 2.4 1.9 2.2 13.2 18.7 32 21
72 2972 75 20 20 27
(mJ)
(mJ)
6 2.0 2.1 9.3 17.9 19.2 270 2.6 2.0 2.1 9.3 17.9 19.2 270 75 20 20 27 31 270
Y-Dim
cycle)
Y-Dim
Y-Dim
(mJ)
(mJ)
76 2.2
2.0 2.2
2.1 13.4
9.3 19.1 540 2.7
17.9 19.1
19.2
270 2.6 2.2
2.0 2.2
2.1 13.4
9.3 19.1
17.9 19.1
19.2 270 75 20 20 26
540 27 29 29 75 20 20 26
31 540
270 27
Y-Dim
Y-Dim
cycle)
Y-Dim
7 2.2 2.2 13.4 19.1 19.1 540 2.7 2.2 2.2 13.4 19.1 19.1 540 75 20 20 26 29 540 29 75 20 20 26
cycle)
7 X-Dim
2.2 2.2
Tile13.4
Size19.1 540 2.7 X-Dim
(Tx) 19.1 2.2 2.2Tile13.4
Size19.1
(Tx) 19.1 540 75 X-Dim20 20 26 (Tx)
Tile Size 29 75 X-Dim
29 540 20 20 Tile Si26
ully-cached,
X-Dim Tile SizeEnergy
(Tx) Fully-cached,
X-Dim Tile SizeEnergy
(Tx) Fully-cached,
X-Dim Tile SizeLatency
(Tx) Fully-cached,
X-Dim Tile Si
ully-cached,
4 16 60 Energy
240
(e) 960 1Fully-cached,
H-cached 4 16 60 Energy
V-recompute, 240 960 (f) H-cached
Energy Fully-cached,
1 4 16 60
V-recompute, Latency
240 960
Latency Fully-cached,
1 4 16 60
ully-cached,
4 16 60 Energy 240 960 1Fully-cached,
4 16 60 Energy 240 960 Fully-cached,
1 4 16 60Latency 240 960 Fully-cached,
1 1043 16 60
5 2.74 2.616 2.660 240 2.71 5.5
2.6 960 1 2.7 4 2.616 2.660 2402.6 960
2.7 1 2991 76 4 7516 75 60 240 751 299
75 960 1 1076
43 75
16 75 60 Latency
Latency
5 2.7 2.6 2.6 2.6 2.71 5.5 2.7 2.6 2.6 2.6 2.7 1 299 76 75 75 75 751 299 10763 75 75
(Ty)(Ty)(Ty)
(Ty)(Ty)(Ty)
(Ty)(Ty)(Ty)
45 1.9 1.9 1.9 2.0 2.2 4 2.4 1.9 1.9 1.9 2.0 2.2 75 19 19 19 19 20 4 75 19 19 19
Latency
2.7 2.6 2.6 2.6 2.7 5.5 2.7 2.6 2.6 2.6 2.7 41 299 76 75 75 75 751 299 76 75 75
4 1.9 1.9 1.9 2.0 2.241 2.4 1.9 1.9 1.9 2.0 2.2
Energy
Energy
4 75 19 19 19 19 204 81 75 19 19 19
Size
Size
Size
4 1.8 1.8 2.0 2.2 18

9.7 2.4 1.8 1.8 2.0 2.2 9.7 81 21 21 21 22 18
28 75 2119 2119 21
Energy
Energy
1.9 1.9 1.9 2.0 2.24 1.9 1.9 1.9 2.0 2.2 184 75 19 19 19 19 20 4 19
TileSize
Size
Size
10118 81 21 21 21 22 2818 81
(million
101.8
1 1
4 1.8 1.8 2.0 2.2 9.7 2.4 1.8 2.0 2.2 9.7 21 21 21
Energy
Energy
18
(million
TileTileSize
TileTileSize
TileTileSize
2.3 10
1.8 2.1 2.2 8.5 17.6 10 73 18 20 20
Tile
Tile
34 1.8 2.1 72 1 72 10
1.8 2.22.0 8.5
2.2 17.6
9.7 2.4
18 2.3 101.8 1.8 2.0 2.2 9.7 72 73
81 18
21 20
21 20
21 26
22 30
28
18 73 18 20 20 26 3018 73 10182 20 20 81 212 21 21
(million
3 1.8 2.1 2.2 8.5 17.6 1 2.1 2.2 8.5 17.6 10172
(mJ)
72 72
(mJ)
63 2.0
1.8 2.1 9.3 2.2 17.8 270
8.5 19.0 2.6
2.3 2.0
1.8 2.1 9.3 2.2 17.8
8.5 19.0 72 7573 1918 20 27 20 31 29 75
26 270 73 1019
18 20 27
Y-Dim
Y-Dim
17.6 17.6 270 30 20

Y-Dim
72 2.6 2
2972 75
(mJ)
(mJ)
6 2.0 2.1 9.3 17.8 19.0 270 2.0 2.1 9.3 17.8 19.0 270 75 19 20 27 31 270 75 2019 2020 26 27
Y-Dim
Y-Dim
Y-Dim
cycle)
(mJ)
(mJ)
76 2.2
2.0 2.2
2.1 13.4
9.3 19.1
17.8 19.1
19.0
270 2.7 2.2 2.2 13.4
540 2.6 2.0 2.1 9.3 17.8 19.019.1 19.1 270 75 20
540 19 20 26 27 29 31 540
29
270 19 27
cycle)
Y-Dim
Y-Dim
Y-Dim
7 2.2 2.2 13.4 19.1 19.1 540 2.7 2.2 2.2 13.4 19.1 19.1 540 75 20 20 26 29 540 29 75 20 20 26
cycle)
7 X-Dim
2.2 2.2
Tile13.4
Size19.1
(Tx) 19.1
540 2.7 X-Dim
2.2 2.2Tile13.4
Size19.1
(Tx) 19.1 540 75 X-Dim20 20 26 (Tx)
Tile Size 29 540
29 75 X-Dim20 20 Tile Si26
(g) Fully-cached, Energy (h) Fully-cached, Latency
Figure 7.12: The total energy and latency for meta-proto-like DF architecture
processing FSRCNN with different DF strategies.
Fully-recom.
# MAC Op
1011
H-cac. V-recom.
Fully-cac.
1010
(1,1) (4,4) (16,18) (60,72) (240,270)(960,540)
Tile Size (Tx,Ty)
Figure 7.13: MAC operation count for different DF strategies.
three overlap storing modes with different tile sizes are visualized. Note that
the figure map size of the last layer of FSRCNN is 960×540, thus all the bottom
right blocks in each heatmap (with Tx=960 and Ty=540) correspond to LBL
processing. Their energy and latency numbers (19.1 and 29 resp.) are the same
because different overlap storing modes do not make a difference for LBL, as
discussed in Section 7.2.
The rest of this subsection firstly summarizes the main messages delivered
by Figure 7.12, and then uncovers the causes by using the memory access
breakdown of the different types of data of Figure 7.14.
Four major observations can be extracted from Figure 7.12: 1) Considering
different tile sizes under the same overlap storing mode, both too small and
too large tile sizes are sub-optimal. The best point is always somewhere in the
middle. 2) Considering the same tile size across different overlap storing modes,
the order of energy consumption is for most cases: fully-cached < H-cached
V-recompute < fully-recompute. 3) Different tile sizes and modes heavily impact
energy and latency (up to 26× difference for energy and 57× for latency). 4)
Fully-recompute prefers larger tile sizes than fully-cached.
To understand the reasons behind, Figure 7.13 and Figure 7.14 take out all the
diagonal scheduling points from Figure 7.12, and respectively plot their MAC
operation count and memory access count (in number of data element) for each
memory level in the hierarchy (LB, GB, and DRAM) that is contributed by
layers’ activation, weight and data copy action. Figure 7.15 further shows the
total energy and latency of these diagonal scheduling points.
For layers’ activation, Figure 7.14(a) presents two clear trends. Firstly, DRAM
and GB access do not depend much on the used mode. When the tile size is small,
like (1,1), (4,4) or (16,18), there is little GB and LB memory access because
all the activations per tile can fit into LB. When the tile size is increased to a
certain point, like (60,72), the GB access suddenly increases due to activations
no longer fitting in LB and thus GB being the top activation memory level.
Further increasing the tile size till reaches LBL (960, 540) and the DRAM
CASE STUDIES 221
(a) Layer’s Activation (I, O) (b) Layer’s Weight
(c) Data copy action (d) Total mem. access: (a)+(b)+(c)
Figure 7.14: Memory access of different data types at different memory levels
for meta-proto-like DF architecture processing FSRCNN with different DF
strategies.
access catches up as a consequence of the intermediate activations no longer

being able to fit on-chip. Secondly, LB access is very sensitive to the used
mode for small tile sizes, with the order of access always: fully-recompute >
H-cached V-recompute > fully-cached. This is because more MAC operations
are performed when doing re-computation, especially in small tile sizes, as
shown in Figure 7.13, which requires more LB access.
For layers’ weight, Figure 7.14(b) shows that all the tile sizes have the same
DRAM and GB access, which is reasonable because all the weights of FSRCNN
can fit into weight LB. However, for the fully-cached mode, LB weight access
is much higher for (1,1) than all other tile sizes. This is because the spatial
unrolling of the hardware architecture includes OX 4 | OY 4 (Table 7.1(a) Idx
2), and thus tile size (1,1) causes a severe under-utilization of the PE array.
This in turn reduces the spatial data reuse of the weight’s LB. In other words,
when the tile size (Tx,Ty) ≥ (4,4), the spatial unrolling OX 4 | OY 4 can be
fulfilled and thus one data read out from weight LB can serve 16 MACs. In
contrast, it can only serve 1 MAC unit per access when the tile size is (1,1).
For the other modes, it is high mainly for the same reason as for activations:
there is a relatively large recompute overhead.
Figure 7.14(c) uncovers memory access contributed by data copy actions. As
discussed in Section 7.3, data copy actions happen when the required input
data of current tile are not all in its lowest-fitting memory level, which could
be because the previous layer’s output and/or the cached data for reuse have
a different lowest-fitting memory level. With this in mind, Figure 7.14(c) is
explainable. Firstly, for small tile sizes ((1,1)-(16,18)): 1) fully-recompute mode
has large memory access at all memory levels due to the large overlap re-fetching
across different tiles of the first DNN layer; 2) fully-cached mode has large
memory access at GB and LB memory levels due to the cached data being
located in GB while the input’s lowest-fitting memory level is LB. Secondly, in
middle tile size region ((60,72)-(240,270)), different modes’ behaviors converge
and the data copy actions mainly come from moving previous layer outputs
down to the top memory level of the next layer’s input. Lastly, in large tile
region, no data copy action is needed as all input, output, and cached data are
located in DRAM.
Figure 7.14(d) shows the total memory access, and Figure 7.15 visualizes the
overall energy and latency, which together with the memory access breakdown
discussed earlier help us to better understand the heatmaps in Figure 7.12:
for fully-recompute mode, the small tile sizes’ sub-optimality comes from data
re-fetching and MAC re-computation of the large overlap region; for fully-cached
mode, the small tile sizes’ sub-optimality comes from the large weight access
and cached data movement; for all the modes, large tile sizes’ sub-optimality is
due to large DRAM access of activation.
CASE STUDIES Fig15 Total Energy and Latency 223
of selected design points
(million cycles) Energy (mJ)

101
103
Fully-recom.
Latency
H-cac. V-recom.
102 Fully-cac.
(1,1) (4,4) (16,18) (60,72) (240,270) (960,540)

Tile Size (Tx,Ty)
Figure 7.15: The total energy and latency for design points in Figure 7.14.
This case study shows that different DF strategies vary a lot on energy and
latency, and DeFiNES can analyze/reason about them, taking the advantages
of the unified analytical model.
7.5.3 Case study 2: Applying depth-first to multiple workloads
This case study studies how different workloads prefer different DF strategies.
To this end, we map all five workloads of Table 7.1(b) on the meta-proto-like
hardware and compare five different inference strategies:
• Single layer: layers are completely evaluated one at a time, feature maps
are always stored to and fetched from DRAM in between layers;
• Layer-by-layer: layers are completely evaluated one at a time,
intermediate feature maps are passed on to the next layer in the lowest
memory level they fit in;
• Fully-cached DF with 4×72 tiles, which is the best found in case
study 1;
• The best strategy found when a single strategy is used for all fused
layer stacks;
• The best combination, where different stacks can use different DF
strategies.
Figure 7.16 visualizes the results, which show some noteworthy findings. Firstly,
for the workloads with spatially large features maps (FSRCNN, DMCNN-VD
and MCCNN), their individual best solutions (purple) are not significantly
better than the best solution found in case study 1 (green). The latter is thus a
very good solution across a range of workloads similar to the one it was found
for, with a gain of 10× compared to SL.
Secondly, this solution does not perform as well on MobileNetV1 and ResNet18,
which operate on spatially smaller feature maps with more channels. On
MobileNetV1 for instance, it is 2.0× worse than the best found result. In these
workloads, the deeper layers are more weight-dominant, which impedes fusing
them into one stack. Hence, the combined best solution applies DF to the first,
activation-dominant layers and LBL to the last, weight-dominant layers. This
combination achieves a gain of 5.7× over SL on MobileNetV1.
Figure 7.16: Case study 2: Different workloads lead to different best solutions
(all results on meta-proto-like DF hardware).
CASE STUDIES 225
7.5.4 Case study 3: A joint DSE of accelerator architecture

and scheduling for multiple workloads
This case study examines the effect of the accelerator’s architecture on the
optimal inference strategy. In particular, it compares the default accelerators
architectures, which were designed with LBL inference in mind, against the
manually adjusted DF-friendly variants by looking at the geometric average of
performance across all five workloads of Table 7.1, with both LBL and DF best
single strategy (for energy).
The results in Figure 7.17 show that DF outperforms LBL on all accelerator
architectures except for TPU-like, including the unadjusted default accelerators,
on which the maximum gain was 4.1×. TPU-like has poor support for DF
schedules due to the absence of on-chip weight buffers. With such a buffer
added in the DF-friendly variant, DF significantly outperforms LBL, indicating
the importance of designing with DF compatibility in mind. This finding is
further backed by the overall comparison between the DF-friendly and default
variants, which shows that the DF-friendly variants are at least as good as the
defaults when using DF, with large gains of 6.0× and 4.3× for TPU-like and
Edge-TPU-like hardware resp., and maximally 1.2% worse when using LBL.
Overall the biggest difference (in geometric mean over the five workloads)
between LBL inference on default hardware variants and DF on DF-friendly
variants is found for the Edge-TPU-like hardware and equals 4.9×.
Figure 7.17: Case study 3: Different hardware architectures’ energy and

latency (geometric mean across the 5 workloads) when applying layer-by-layer
or best DF scheduling strategies.
7.6 Related Works
Previous works on DNN DF processing can be split into two categories,

DF-supporting hardware implementations and DF modeling and exploration
frameworks.
For DF-supporting hardware implementations, designs[4, 95, 111, 113, 74, 54,
98, 68] have demonstrated their DF solutions and shown large benefits in terms
of energy and/or latency compared to the traditional single-layer/layer-by-layer
DNN accelerators for targeted workloads. These DF-supporting designs each
have a default DF processing pattern, with manually selected tile size, overlap
storing mode, and fuse depth. For example, regarding the overlap storing mode,
[68] fully recomputed the intermediate overlapping data while [95] assumed
a horizontally-cached with vertical recompute mode and [54] applied a fully-
cached mode. Regarding the depth of the stack of fused layers, [98] chose to
fuse just 2 layers at a time while [95, 54] preferred deeper fused stacks (8-20
layers). On the tile size, [111, 113] always treated one row of an image as a tile,
whereas [4, 74] adopted a square tile size and [54] has a preferred tile size of 128
pixels along an image row. All these hardware implementations are optimized
for one or a few types of predefined DF strategies, and it is unclear if there are
remaining combinations of hardware architectures and DF strategies that would
perform better for a targeted workload. Researching this would preferably be
done at a high abstraction level to save the time of fully simulating and/or
developing the hardware architectures.
Therefore, several DF modeling and exploration frameworks, such as DNNVM
[180], Efficient-S [189], LBDF [145], ConvFusion [172], Optimus [19], and
DNNFuser [86], have been proposed. These frameworks, listed in Table 7.2, help
to model and optimize the DF schedule given hardware architectures and DNN
workloads. In the optimizing part, many innovative searching algorithms are
introduced, such as the heuristic subgraph isomorphism algorithm in DNNVM,
the DAG-based hardware-aware operator fusion algorithm in Optimus, and a
transformer-based mapper in DNNFuser. However, they all have some important
factors missing in the modeling part.
The rest of this section will discuss each missing factor (as in Table 7.2), and
some of theirs impact (as in Figure 7.18).
Firstly, from the DF scheduling space point of view, most of these frameworks do
not explicitly support exploring the trade-offs between different overlap storing
modes. As shown in Figure 7.6 and Section 7.5.2 (Case Study 1), this can have
a big impact on tile type count (related to code and control complexity) and
system energy/latency.
RELATED WORKS 227
Table 7.2: Related DF modeling framework comparison
Overlap storing Model on- Support multi- Model

DF Modeling Optimizing
mode1 chip data level mem. weight
Framework target5
① ② ③ traffic2 skipping3 traffic4
DNNVM [180] ✘ ✔ ✘ ✔ ✘ ✔ La
Efficient-S [189] ✔ ✘ ✘ ✔ ✘ ✘ La
LBDF [145] ✔ ✘ ✔ ✘ ✘ ✘ DRAM
ConvFusion [172] ✔ ✘ ✔ ✘ ✘ ✔ DRAM
Optimus [19] ✔ ✘ ✔ ✘ ✘ ✔ DRAM
DNNFuser [86] ✔ ✘ ✘ ✔ ✘ ✔ DRAM, Mem
DeFiNES (ours) ✔ ✔ ✔ ✔ ✔ ✔ En, La
Visualize each Figure 7.6 & Figure Figure Figure Figure
factor’s impact Case Study 1 7.18 (a) 7.18 (b) 7.18 (c) 7.18 (d)
1 Overlap storing modes (✔ support / ✘ no support)

① Fully-recompute; ② H-cached, V-recompute; ③ Fully-cached
2 ✔ Model on-chip data traffic ⇔ ✘ Only model/optimize DRAM access
3 ✔ Support multi-level mem. skip. ⇔ ✘ Only support DRAM skipping
4 ✔ Model weight traffic ⇔ ✘ Only model/optimize activation traffic
5 Optimizing targets: DRAM: DRAM access; Mem: on-chip mem. usage;
La: Overall latency; En: Overall energy
Secondly, from the hardware modeling point of view, they only focus on
modeling/optimizing the DRAM access while ignoring the data movement
within the potential multi-level on-chip memory hierarchy. In other words,
they are agnostic of on-chip memory hierarchy. This could cause substantial
losses, as proven by Figure 7.18(a), which shows the experiment results of
mapping FSRCNN onto two hardware platforms in three ways: 1) Single-Layer
(SL), 2) DF but only optimize for DRAM traffic, 3) DF and optimized for the
overall energy (our work). The DRAM energy contribution is highlighted by
the diagonal hatching, which shows that DRAM energy dominates in the SL
case. Using DF and only optimizing for DRAM traffic, the DRAM energy can
indeed be largely reduced, but omitting the on-chip energy (non-hatched part
in the red bar) from the optimization can make the latter dominant. Only
when considering the whole system, the best DF solutions (orange bars) can
be achieved. The parameters of the found solutions (Figure 7.18 right) show
that when optimizing for the overall energy (orange), the framework found a
smaller tile size compared to optimizing for DRAM only (red). This can be
explained: 1) When optimizing for DRAM only, the tool will randomly pick
one DF schedule that makes sure all the intermediate data fit on chip, and
thus DRAM access is minimized. However, after achieving the minimal DRAM
access, there is still a lot of room for on-chip data traffic optimization, which is
overlooked in this case. 2) When optimizing for the overall energy, it benefits
from smaller tile sizes since at a certain point, not only can all the data of
Different scenarios Energy components Best found

Single layer (SL) Only DRAM access DRAM
DF solution
Only activations Only DRAM skipping (fully-cached,
MAC
stack, tile size):
Opt. latency Latency On-chip Mem
Ours (Consider all listed factors; Opt. energy) Activation Fig. a : (1 stack)
Meta-proto-li. DF
(a) On-chip data traffic's impact (c) Weight's impact 120x4
(with FSRCNN) (with ResNet18)
4x72
10.2x Edge TPU-li. DF
DRAM *
Latency(10M cycles)
Activation (I/O) consumed

On chip (MAC+Mem) Weight (W) consumed 120x4
20 4x18
Energy (mJ)
Fig. b : (1 stack)
10
5.64x
5.46x Meta-proto-li. DF
2.34x 60x135
4x72
Edge TPU-li. DF
0
30x135
Meta-proto- Edge TPU- Meta proto- Edge TPU-
like DF like DF type like DF like DF
4x18
(b) On-chip memory skipping's (d) Optimizing target's

Fig. c : (All stacks)
impact (with FSRCNN) impact (with ResNet18) Meta-proto-li. DF
-26% -30%
4x7
*
Latency(10M cycles)
On-chip Mem 14x28

DRAM MAC Edge TPU-li. DF
Energy (mJ)
-18% 2x2
2 -17% 14x28
+9% +6%
Fig. d : (All stacks)
Meta-proto-li. DF
28x28
0 14x28
Meta-proto- Edge TPU- Meta-proto- Edge TPU- Edge TPU-li. DF
like DF like DF like DF like DF 28x28
* Horizontally, Fig. (c) and (a) ((d) and (b)) share the same Y axis label and scale. 14x28
Figure 7.18: Experiments to evaluate different factors in Table 7.2.
intermediate tiles fit in on-chip GB, but also fit in the LB. In this case, the
activation can be fully reused in LB, and GB access is minimized (on top of
the already minimized DRAM access), resulting in a 5.64× energy gain for
FSRCNN on the meta-proto-like DF hardware.
Thirdly, on top of modeling on-chip data traffic, we further evaluated the
benefit of performing multi-level memory skipping over DRAM-only skipping,
i.e. skipping (multiple) upper (on-chip) memory level(s) when writing back
the outputs of intermediate tiles if it they fully fit in lower level memories.
Around 17%-18% energy gain is observed for the tested workload-hardware
combination, as shown in Figure 7.18(b). Due to this step targeting optimizing
on-chip memory energy, the gain is not very significant if the MAC energy and
the (already minimized) DRAM energy are dominant, which is the case here.
RELATED WORKS 229
This technique can bring larger gains for systems with more dominant on-chip
data traffic.
Fourthly, most of DF hardware implementations and exploration frameworks
show the energy, latency, and/or DRAM access gain that come from activation
tiling, but do not mention much about the potentially higher weight energy costs
due to the loss of local weight data reuse. This can be harmful for the overall
system efficiency, as shown by the example of Figure 7.18(c). The energy portion
caused by memory access for activations, highlighted with square hatching,
contributes most of the energy in the SL case. However, just blindly optimizing
for activations while ignoring the weights ends up in the green bars. While
these indeed have minimal energy caused by activations, the energy caused by
weights’ memory accesses dominates and causes a large penalty (non-hatched
part in the green bars). This is because the tool found very small tile sizes
as its best solution when only optimizing for activation. This lets activations
skip higher level memories as much as possible, but at the same time largely
reduces the low-level memory’s weight data reuse, thus triggering more access
to higher level weight memories. So, only when considering both the benefit
and drawbacks that tiling can bring, the best DF solution (orange bars) can
be achieved. For the given example, taking weights into account achieves a
solution that has 2.34× and 10.2× less energy than the solution found by
only considering activations for the meta-proto-like DF and Edge-TPU-like DF
hardware architectures, respectively.
Lastly, different frameworks have different optimizing targets, as shown in the
last column of Table 7.2: some of the frameworks only evaluate latency while
ignoring energy, whereas some only care about optimizing the DRAM access.
As DRAM-only optimization’s downsides have been explained, here we focus on
discussing latency- and energy-optimized solution comparison. Figure 7.18(d)
shows the results: pink/orange bars are the energy (and the corresponding dots
are latency) of our latency-/energy-optimized DF schedules respectively. In
this example, a clear latency-energy trade-off is presented and the best found
DF solution shows that the energy-optimized DF schedule prefers a smaller tile
size than the latency-optimized one. This is because smaller tile sizes on one
hand help reduce energy by enabling skipping of more memory levels while, on
the other hand, it increases the data preparation cycle (loading and offloading)
overhead.
To summarize, our work models the complete DF design space with support for
detailed activation and weight, on- and off-chip memory hierarchy analysis so as
to better capture the trade-offs between different DF strategies and optimizing
targets. These properties enable DeFiNES to make the overall best choices
without neglecting factors that may turn out to be important otherwise. This
makes DeFiNES a good addition to the previously mentioned optimization-
oriented frameworks [180, 189, 172, 19, 86]. Together with those, we can better
design and schedule DNN accelerators.
7.7 Conclusion
This chapter first presented a definition of the DF design space, and then a cost
model capable of handling this whole design space. Furthermore, the cost model
considers not only DRAM access or only memory access due to activations, but
also the full on-chip memory hierarchy and memory access caused by weight
traffic. Large gains might be missed when not doing so (up to 10.2× in the
shown examples; Figure 7.18(c)).
Using this model, the case studies showed that DF strategies can significantly
outperform layer-by-layer execution, even when the workload is not activation-
dominant (MobileNetV1 and ResNet18), and even when the hardware is not
designed for it: DF strategies outperformed layer-by-layer on four of the
five tested hardware architectures with gains of up to 4.1×. However, some
architectures may be ill-suited for DF, in which case small adjustments to their
design can lead to large improvements. For instance, reassigning some of the
on-chip memory capacity of the TPU-like architecture enabled it to greatly
benefit from DF strategies, outperforming its default variant by 6×. These
examples show how DeFiNES allows us to quickly examine the complex design
space of different combinations of DF strategies and hardware architectures.
Although DeFiNES enables a fast evaluation of combinations of DNNs, DF
strategies, and hardware architectures, its capabilities could still be expanded
in future work. Given the huge DF scheduling space (different tile sizes, fused
depth, data storage modes, etc.)2 and the complex tradeoff in between (e.g.,
energy v.s. latency), it is barely possible to intuitively pinpoint the optimal
schedule, nor exhaustively try out all schedules to locate the optimal one, even
with DeFiNES’ fast cost estimation. Thus, a clever search engine capable of
efficiently exploring the DF scheduling space and the hardware architecture
design space would be a good future addition to this work.
So far, we have discussed two DSE frameworks, ZigZag and DeFiNES, both
targeting single-core accelerator systems with purely analytical modeling
approaches. In the next chapter, we will move to multi-core accelerator modeling
and DSE with the third framework, Stream.
2 Note that for the purpose of unifying the scheduling space, we see layer-by-layer schedule
also as part of the DF scheduling space whose tile size equals to the complete figure map size
and fused stack equals to the complete neural network, as explained in Section 7.2.
Chapter 8
Stream: Modeling
Fine-grained Layer Fusion on
Multi-core DNN Accelerators
This chapter explains Stream, a DSE framework that supports exploring multi-
core DNN accelerator with fine-grained layer fusion scheduling.
To keep up with the ever-growing performance demand of DNN processing while
accommodating the increasing model diversity, specialized hardware accelerators
are shifting towards multi-core architectures. Stream is the first open-source
DSE framework for co-optimization of hardware architecture and fine-grained
scheduling of such multi-core DNN accelerators. Stream supports fine-grained
layer fusion, to optimally trade-off energy, latency, and/or on-chip memory
footprint for constrained edge devices.
Validation against three SotA chips, together with a case study on seven
hardware architectures with different scheduling granularity, demonstrate
the reliability and capabilities of Stream. Results show that high-level
architectural decisions greatly impact hardware efficiency under the fine-grained
scheduling paradigm, reducing the energy-delay product from 2.4× for single-
This chapter is based on [152] and contains large fractions of it. The author’s contributions
include (but not limited to) the computation node and graph representation, part of the
implementation, SotA comparison, and paper writing.
Stream is open source at https://github.com/KULeuven-MICAS/stream.
231
232 STREAM: MODELING FINE-GRAINED LAYER FUSION ON MULTI-CORE DNN ACCELERATORS
core architectures to up to 30× for heterogeneous multi-core architectures

compared to traditional scheduling at layer granularity.
Specialized hardware architectures have co-evolved with the ever-growing

DNNs to efficiently accelerate their inference. Traditionally, these accelerators
are comprised of a single spatially-unrolled array of PEs, which embeds a
specific dataflow [47, 30, 116, 81]. Recent architectures are shifting from
a single specialized core toward multi-core designs with enhanced dataflow
flexibility [50, 141, 128, 80, 146, 92, 164, 49]. This increase in computing
parallelism is so far exploited by mapping different DNN layers onto different
cores and pipelining multiple batched inputs for increased throughput [128, 141]
(Figure 8.1(c)(3)). However, such layer-by-layer processing incurs coarse
data dependencies, resulting in frequent and expensive off-chip accesses, and
worsening energy efficiency and latency.
(a) An example workload (b) Tiled for layer-fused scheduling

L1
Layer 1
L1
L0 L2 L4
Layer 0 Layer 2 Layer 4
L0 L2 L4
L3
Layer 3
L3
(c) Schedule the workload to hardware accelerators

Single-core Memory Multi-core
C0 C1
DRAM
DRAM
Compute
C2 C3
Core 0
(1) (3) C0 0
1 4
Core 0
C1
Layer-by- L0 L1 L2 L3 L4 C2 2
layer C3 3
SotA frameworks Timeloop, ZigZag, … Kwon et al., Stream
(2) (4) C0 0 0
1
Core 0
C1 4 1 4
0 1234 0 1234 C2 2 2
Layer-fused
C3 3 3 Timeline
SotA frameworks TVM-Cascade, DeFiNES, … Stream
Figure 8.1: A conceptual example showing different ways of scheduling a deep

neural network workload onto different hardware accelerators.
BACKGROUND & RELATED WORKS 233
To overcome the previous drawbacks, several works have investigated more fine-
grained scheduling strategies of deeply fused DNN layers [5, 162, 53]. From these
works, it is clear that such "layer-fused" (a.k.a. "depth-first") scheduling can
bring significant advantages for latency- and resource-constraint inference at the
edge: reducing the memory footprint of intermediate results, alleviating costly
off-chip accesses, and providing more parallelization opportunities. However, the
SotA layer-fused schedulers work solely for their specialized hardware platform
and dataflows [54, 164, 103, 34]. This makes the general assessment of high-level
architectural and scheduling decisions difficult. Furthermore, the fine-grained
scheduling of layer-fused DNNs onto multi-core hardware architectures has been
largely unexplored (Figure 8.1(c)(4)).
This chapter, therefore, provides Stream, a general exploration framework of
heterogeneous multi-core hardware architectures with fine-grained scheduling of
layer-fused DNNs. This chapter is organized as follows:
• Section 8.2 briefly recaps the background of multi-core architecture and

workload deployment, and discusses the related works.
• Section 8.3 thoroughly explains Stream, from its overview to detailed
implementation, highlighting the unified modeling representation, a rapid
fine-grained data dependency generator, a genetic algorithm-based layer-
core allocator, and a heuristics-based scheduler.
• Section 8.4 validates Stream with three SotA implementations of
hardware accelerators employing layer fusion, demonstrating the modeling
accuracy across a wide range of scheduling granularities and hardware
architectures.
• Section 8.5 performs exploration of fine-grained scheduling of modern
layer-fused DNNs onto a broad range of architectures, demonstrating up
to 30× EDP reduction compared to traditional layer-by-layer scheduling.
8.2 Background & Related Works
In this section, we briefly review the background on DNN acceleration with

specialized dataflow architectures and the deployment techniques of DNNs onto
these architectures, together with discussions of related works.
8.2.1 Dataflow hardware architectures
Figure 8.2(b) shows a traditional accelerator core architecture, constructed of

an array of spatially unrolled PEs. The array allows multiple computations to
happen in parallel. In the specific architecture example in the figure, the input
channels (C) are reused across the rows, and multiple output channels (K) are
computed by accumulating across each column.
To increase the available compute parallelism while maintaining good
utilization, the trend is to design multi-core or multi-chiplet-based architectures
(Figure 8.2(a)). Planaria [50] devises a systolic array that can be partitioned to
support an omni-directional dataflow. Illusion [128] implements an 8-chip design
with minimally sized on-chip memories, connected through a sparsely-used inter-
chip network. Simba [141] deploys a 36-chiplet based design, where each chiplet
consists of a 4×4 digital PE array. A 4×4 cluster of analog in-memory compute
(AiMC) cores, each housing a 1152×256 capacitor-based IMC bit-cell array is
prototyped in [80].
Moreover, some works shift this trend further from homogeneous to heteroge-
neous core combinations: a new class of heterogeneous dataflow accelerators
(HDAs) that incorporate multiple sub-accelerators with varying dataflow
specialization is investigated in [92]. The chip realization of DIANA [164]
consists of both a digital core and an AiMC core that are interconnected
through a shared on-chip memory. These heterogeneous cores provide more
specialization, which can result in more efficient processing for a wider range of
DNN layers.
8.2.2 Allocation, scheduling & mapping
Mapping a DNN onto a multi-core hardware architecture includes three stages:
Layer Allocation
The recent trend towards multi-core systems enlarges the traditional mapping
space. Each layer must first be allocated to a (set of) core(s). Allocating a
modern DNN, which can easily consist of 50 layers, onto e.g. a quad-core
architecture yields O(1030 ) possible layer allocations. Kwon et al. [92] explore
this allocation space through a heuristics-based allocator, but do not account
for the inter-core communication cost.
BACKGROUND & RELATED WORKS 235
DRAM Other cores

C0 C1
DRAM C2 C3
Weight Local Buffer
Reg K
MAC PE PE
Local buffer
Global buffer
C
Input
PE PE PE
C0 C1
DRAM
Comm. Bus PE PE PE
DRAM Port
Accumulator
C2 C3 Output Local Buffer
(b) An example core
(a) Multi-core model
(TPU-like dataflow accelerator)
Figure 8.2: (a) Multi-core architecture model. (b) Example core with specific
dataflow (in red), connected to the off-chip memory port and bus for inter-core
communication. All memories, ports, and the bus, have a limited bandwidth.
Scheduling
Each allocated layer must be scheduled. This determines the execution order
of the layers (or their fine-grained parts). In the traditional layer-by-layer
processing, the only scheduling flexibility comes from branches in the DNN.
A more fine-grained scheduling, referred to as layer-fused [5], depth-first [53]
or cascaded [162] processing, and in this work referred to as layer fusion has
been introduced. Instead of processing an entire layer at once, a smaller part of
each layer is processed and its outputs are immediately consumed to process
parts of the subsequent layers (Figure 8.1(2,4)). In this work, we refer to such
a layer part as a computation node (CN), whose size determines the scheduling
granularity. Layer fusion has two benefits compared to classical layer-by-layer
scheduling: 1.) The produced and consumed activations are (depending on
the layer) smaller, reducing the memory footprint, which in turn decreases the
off-chip memory accesses; 2.) In a multi-core system, the computation nodes
of subsequent layers can be processed in parallel if the CN data dependencies
allow it, improving parallelism. However, rapidly extracting these dependencies
is non-trivial for modern DNNs under fine scheduling granularity, detailed in
Section 8.3.2.
Current SotAs have shown the efficacy of layer fusion for a single-core
accelerator [54], homogeneous multi-cores [43, 80, 188] and heterogeneous
systems [164]. However, these works only consider a limited set of DNNs
with fixed scheduling granularity on specific hardware architectures. TVM [162]
includes a cascading scheduler that does explore different granularities and

DeFiNES [108] introduces an analytical cost model enabling fast depth-
first scheduling design space exploration, but they are only for single-core
architectures.
Mapping
Lastly, the computations of each CN, described by a set of for-loops, must

be efficiently mapped onto each core, respecting its supported dataflow and
memory resources. The dataflow plays a key role in efficiency, as a mismatch
between the core’s dataflow and the CN’s computations will cause spatial
under-utilization. The core’s memory hierarchy exploits the CN’s data reuse
across time. Many DSE frameworks have arisen [184, 91, 126, 109, 70, 59,
87], to analytically estimate the hardware cost and optimize the efficiency of
the mapping through loop optimizations like unrolling, tilling and ordering.
However, these frameworks don’t model multi-core systems, and only estimate
the hardware performance of a single layer at a time.
8.3 Stream framework
Figure 8.3 shows an overview of Stream: Given a DNN workload graph and
a high-level multi-core accelerator architecture description, it determines an
optimal compute schedule with the resulting memory usage, energy and latency.
First, every layer is split into fine-grained computation nodes (CNs), taking
into account the dataflows supported by the different accelerator cores (Step 1).
Next, the data dependencies between CNs are rapidly generated and the fine-
grained CN graph is formed through an R-tree [55] (Step 2). In parallel, the
intra-core mapping cost of all unique CN-core combinations is optimized and
extracted using a single-core DSE framework (Step 3). A genetic algorithm
(GA) is subsequently deployed to explore the vast layer-core allocation space
(Step 4). The GA queries a latency- and memory-prioritized scheduler that
schedules the CNs onto the cores taking into account inter-core communication
contention and off-chip memory access contention (Step 5).
8.3.1 Step 1: CN identification & attribute extraction
The first step of representing a layer-fused DNN is splitting each layer into
multiple individually schedulable parts, referred to as computation nodes (CN).
STREAM FRAMEWORK 237
Workload Multi-core
accelerator
Layer 1
C0 C1
DRAM
Layer 0 Layer 2 Layer 4
Layer 3 C2 C3
Step 1: CN Granularity Identification and Attribute Extraction HW Model

Parser
CN0 CN Attributes:
Outer-CN loops
CN1 Loop/data For all
CN0 CN0 CN0 range; For all unique
Inner-CN # of discarded unique CNs
CN1 CN1 CN1 loops input; cores
CN0 # of generated C0
CN1 output. CN0
CN0 C1
CN0 C2
Step 2: R-tree-based Rapid CN Dependency Generation
CN0
CN0 C3
Fine-grained CN0
CN Graph CN1 Step 3:
CN0 CN0 CN0 Analytical
CN1 CN1 CN1 Intra-core
CN0 Mapping
CN1
C3 CN0
Step 4: Genetic Algorithm-based Automatic Layer-core Ma-Core
C0 Cost: CN0
Allocation (NSGA-II-based selection, mutation, …) - Energy
C0 CN0
Matra-CoreCost:
- Latency
- Mapping
Energy Cost:
- Mem util.
L0 C1 L0 L1 L0 C1 - - Latency
Energy
L1 L4 L1 - PE util.
L2
L3
L4 … … L2 - - Mem
Latency
util.
C2 C3 L2 L3 C2 L4
L3 - - PE
Mem
util. util.
- PE util.
Step 5: Memory- and Latency-prioritized Heuristic

CN Scheduling & Memory Usage Tracing
… …
… …
Core 0
Core 0
C0 0 0 Core …
Mem 0
Core …
Mem
Usage 0
Mem …
C1 1 4 1 4 Mem
Usage
C2 2 2 Usage … Timeline
C3 3 3 Usage Timeline
Timeline
Timeline Timeline
Overall Cost: Latency, Memory usage, Energy
Stream
Compute Schedule Memory Usage Energy Latency
Figure 8.3: Overview of the Stream framework.

L3
Layer 1 (1 CN) L3
(a) Coarse (b) Medium (c) Fine
Layer i’s
input
Layer i
Layer i (1 CN) Layer i (2 CNs) (4 CNs)
for OX 2 OX=0 OX=1 OX=0 OX=1
for OY 2 for OY 2 for OY 2 OY=0
OX=0 OY=0
OX=1
for K 2 for K 2 for K 2 for OY=0
K2 for OY=1
K2
K for C 2 for C 2 for
for CK 22 for
for CK 22
for C 2
for C 2 for C 2
OX
OY Layer i’s
output /
Layer i+1’s
input Layer i+1
Layer i+1 (1 CN) Layer i+1 (2 CNs) (4 CNs)
for OX 2 OX=0 OX=1 OX=0 OX=1
for OY 2 for OY 2 for OY 2 OY=0
OX=0 OY=0
OX=1
for K 2 for K 2 for K 2 for OY=0
K2 for OY=1
K2
for C 2 for C 2 for C 2 for
for CK 22 for
for CK 22
for C 2 for C 2
Layer i+1’s
output
14 schedule possibilities
2 schedule possibilities 0 1 2 3 0 1 2 3
1 schedule possibility
CN0 CN1 CN0 CN1
CN0 CN0 0 1 2 0 1 2 3 3
Timeline CN0 CN0 CN1 CN1 …
0 0 1 1 2 2 3 3
Figure 8.4: Computation node (CN) granularity impacts scheduling flexibility.
A CN is defined by isolating a subset of inner for-loops of a layer. Different

subsets lead to different scheduling granularities, as shown in Figure 8.4. The
remaining outer for-loops of a layer, called outer-CN loops, determine the CNs’
relative execution order. Stream identifies the best CN granularity using two
principles:
Layer topology awareness
The types of layers in the DNN, together with the layer interconnections impose
constraints on the optimal granularity. For example, a fully connected (i.e.
matrix-vector multiplication) requires all inputs to compute a single output,
and the CN, therefore, contains all layer loops (i.e. the layer only contains one
CN). This automatically breaks the fused layer stack. When layers do have
spatial locality (e.g. convolutional layers and matrix-matrix multiplications),
these loop dimensions are outer-CN loop dimensions. The outer-CN for-loops
are synchronized across layers. Moreover, the out-CN loop order determines
the scheduling order of CNs that belong to the same layer.
hardware dataflow awareness
When deploying a CN onto an accelerator core, the CN mapping efficiency

depends on the compatibility of the spatial unrolling of the accelerator core
with the dimensions of for-loops encapsulated by the CN. If the CN loop
dimensions are smaller than the spatial unrolling dimensions of the targeted
core, the effective hardware utilization drops. To avoid such losses, the
CNs are constrained to contain at least the for-loop dimensions which are
spatially unrolled in the core. In the case of heterogeneous multi-core systems
with multiple spatial dataflows, the CN identification constraints the CNs to
minimally encompass all for-loops that are spatially unrolled in any of the cores.
This ensures the minimal CN granularity with good hardware utilization.
Once each layer is broken down into CNs using these principles, Stream extracts
two attributes for each CN based on the ranges of the encapsulated for-loops:
1. The number of inputs exclusively used by this CN, which can hence be
discarded when the CN finishes;
2. The number of final outputs newly generated by each CN, which could be
sent out when the CN finishes.
Because of potential input data overlap and reduction loops across CNs, not
all CNs have the same number of discardable inputs and newly generated final
outputs, as shown in Figure 8.5. Stream’s CN attribute extraction is compatible
with all layer types, strides, and padding supported by ONNX [11].
8.3.2 Step 2: Fine-grained graph generation
After the identification of CNs of each layer and their attributes, the data
dependencies between all CNs must be generated in order to correctly schedule
the CNs in Step 5. This process is split into two parts.
Intra-layer: First, the intra-layer CN dependency edges are inserted based on
the outer-CN loop order, determined in Step 1. This ensures that the required
C
Layer i for OY 2
for OX 2
(-32, +4) OY
for C 2
(# of discarded input, for FY 3
# of generated output) for FX 3 OX
CN 0 (-1, +0) CN 1 (-1, +1)

OY=0 OY=0
OX=0 OX=0
C=0 C=1
for FY 3 for FY 3
for FX 3 for FX 3
CN 2 (-3, +0) CN 3 (-3, +1)

OY=0 OY=0
OX=1 OX=1
C=0 C=1
for FY 3 for FY 3
for FX 3 for FX 3
CN 4 (-3, +0) CN 5 (-3, +1)

OY=1 OY=1
OX=0 OX=0
C=0 C=1
for FY 3 for FY 3
for FX 3 for FX 3
CN 6 (-9, +0) CN 7 (-9, +1)

OY=1 OY=1
OX=1 OX=1
C=0 C=1
for FY 3 for FY 3
for FX 3 for FX 3
Figure 8.5: Computation node (CN) attribute extraction example. Attributes

are the number of discarded inputs (red) and number of generated outputs (green).
tensor accesses of CNs within a layer are structured and easily implementable
using loop counters.
Inter-layer: Next, the inter-layer CN dependencies are determined based on
the loop ranges of each CN. Specifically, the overlap in data generated by
CNs of one layer, and required by CNs of the next layer(s), defines the data
dependency between these CNs. Because this work targets a fine-grained
scheduling granularity, the number of CNs could grow up to 106 or even larger
for modern DNNs. Exhaustively checking each CN pair for overlap in multi-
dimensional data tensors would require 1012 checks, which is not feasible. A fast
inter-layer CN dependency generator is thus required, for which an algorithm
based on R-trees [55] is developed.
Identify Dependencies
STREAM FRAMEWORK 241 R
Producer Layer ② Query R-Tree ① Create Consumer Layer 02
Output R1 R2 Input
CN0 CN1 CN2 02 13 ①

R1 R2 CN0 CN1
CN2 CN4
CN3 CN5
② CN2 CN3
CN6 CN7 CN8
CN0 CN1 CN2 CN3 CN4 CN5 CN6 CN7 CN8

Producer CNs
Consumer CNs CN0 CN1 CN2 CN3
Figure 8.6: Inter-layer CN dependency generation example using R-trees [55].
Figure 8.6 shows the inter-layer dependency generation for a simplified example.
This process is repeated for each producer & consumer layer pair. First, an
R-tree representation of all CNs of the consumer layer is created (1), which
stores the encapsulated loop ranges of each consumer CN in a tree structure. In
the second step (2), the R-tree is queried for intersection with each CN of the
producer layer. The R-tree returns all consumer CNs whose ranges overlap with
the range of queried producer CN. Note that in this simplified example, the 4
consumer CNs span two dimensions and are non-overlapping. In practice, they
can have more dimensions with overlapping loop ranges, which is supported by
Stream.
Compared with a baseline implementation of inter-layer CN dependency
generation which checks every producer-consumer CN pair one-by-one, our R-
tree-based algorithm is much more efficient. For a case with 448×448 producer
CNs and 448×448 consumer CNs, the baseline implementation would take over
9 hours, whereas the R-tree-based generation takes 6 seconds (103 × speedup).
8.3.3 Step 3: Intra-core mapping cost extraction
In this step, energy, latency, memory utilization and PE utilization of executing

individual CNs on each accelerator core are extracted. As discussed earlier,
multiple DSE frameworks already exist for optimizing the mapping of layer-by-
layer workloads onto single-core hardware architectures [184, 126, 109, 87, 70, 59].
These frameworks model the dataflow in the core and can optimize the temporal
data reuse for various hardware costs such as energy, latency, EDP, etc.
In this step, the CN loops are therefore fed into such a single-core mapping
optimization framework, which returns the optimal intra-core mapping and

associated hardware costs. As a main objective of this work is to reduce end-
to-end latency, an accurate latency estimation is crucial. We achieve this by
interfacing with the ZigZag framework [109, 153], which includes an accurate
latency modeling of on- and off-loading of data, as well as data stalls due to an
insufficient memory bandwidth in the core [110].
The CN mapping cost extraction is modular through a hardware model parser,
such that other single-layer single-core DSE frameworks [184, 126, 87] can also
be integrated into Stream.
8.3.4 Step 4: Layer – core allocation
This step targets the allocation of the CNs of each layer to the different
accelerator cores in the multi-core system. For large networks with varying
layer types, figuring out a performant layer-core allocation can be difficult. For
example, because of the fine CN granularity, it is not straightforward which
CNs can execute in parallel and should hence be executed on different cores.
To this extent, a genetic algorithm (GA) is developed, as shown in Figure 8.3, in
which the layer-core allocation is automatically optimized through the evolution
of different generations of a population of layer-core allocations. We choose
a GA for this allocation problem as it is modular in its optimization metrics,
which can be any linear combination of latency, energy, memory footprint, or
their derivatives, such as energy-delay-product (EDP). Each individual in the
population receives a fitness score based on the desired metrics. The surviving
individuals of a population are selected through an NSGA-II process [41],
which employs advanced mechanisms to spread out the individuals over the
Pareto-front. After the selection, an ordered crossover operation is performed
to generate new offspring with a probability of 30%. Finally, the genome of
an individual is randomly mutated through a bit flip (allocating a layer to a
different core) or a position flip (swapping two layers’ core allocations) with a
probability of 70%. The randomness enables the GA to escape local minima.
The GA ends after a predefined number of generations, or after the desired
optimization metric saturates. A Pareto front of optimal layer-core allocations
is returned.
8.3.5 Step 5.1: Multi-core CN scheduling
The scheduling step targets to derive the most optimal start time to execute
each CN, given the fine-grained CN graph, the CN’s mapping costs and the
individual’s layer-core allocation. Performing such fine-grained scheduling on a

multi-core hardware architecture comes with two major challenges: 1.) Correctly
incorporating the inter-core communication cost stemming from the producer
and consumer CNs and 2.) taking into account the cost due to off-chip fetching
of the first layer(s) input activations, and of weights which do not fit in the
limited on-core weight memory:
Modeling inter-core communication
Whereas modeling the communication cost of transferring activations from core

to core is relatively simple for traditional layer-by-layer execution [128], this
modeling complexity grows as the computation time shrinks for finer CNs and
more parallel executions appear. To model this behavior, a communication node
is inserted between producer-consumer CNs mapped onto different cores. The
runtime and energy cost of a communication node is calculated based on the
amount of data to be transferred and the bandwidth of the data communication
bus. Moreover, the bus models communication contention by scheduling all
communication nodes in a first-come-first-serve manner.
Modeling off-chip fetching
For multi-core architectures targeting edge applications, the on-chip memory

resources are precious. Ideally, all weights of fused layers are stored on-chip so
no off-chip weight accesses are required. However, for limited on-chip weight
memory this might not be the case. Stream models both scenarios in a unified
way by tracking the weights contained in each core’s on-chip memory. If a
CN for which the weights are not on-chip is scheduled, the off-chip access is
accounted for through the insertion of an off-chip access node for which energy
and latency are modeled through an additional limited-bandwidth DRAM port.
If the on-chip weight memory capacity is too small to store the new set of
weights, weights are evicted from the memory in a first-in-first-out manner. At
the same time, the DRAM port is used to model fetches of the input activations
of the first layer.
Once we have the fine-grained CN graph including the additional on-chip
communication nodes and off-chip access nodes, the CNs are actually scheduled
onto the cores they are allocated to. Different execution schedules will lead
to different overall latencies due to the fine-grained dependencies restricting
which nodes are eligible for execution. The schedule also strongly determines
the memory usage: if we postpone the scheduling of a CN’s successor(s), the
produced data will have to be stored in memory for a longer time. This leads
to a trade-off between latency and memory footprint, as shown in Figure 8.8,

for which Stream deploys two different scheduling optimization functions: one
prioritizing minimal latency and the other prioritizing low memory usage. Their
working principle is detailed in Figure 8.7. The scheduler keeps a pool of CN
candidates, from which it selects the best candidate to schedule next based on
the user-defined priority.
Latency is prioritized with the heuristic that picks the candidate whose
predecessors have finished the earliest (whose data has been stored in memory
the longest). This maximizes the cores’ utilization and thus benefits latency.
INPUT OUTPUT
Fine-grained CN Schedule
1) Fine-grained CN Graph
2) Layer-Core Allocation Core0 0 0
Core1 1 4 1 4
3) Intra-CN Mapping Cost Core2 2 2
4) Priority: Latency or Memory Core3 3 3 Timeline
Step A: Initialize T/C/S 3 tables 3 Internal Tables

• T Table: All the cores are idle from cycle 0 (T/C/S)
• C Table: The source CN(s) (no predecessors) Core Idle
• S Table: Empty Timetable
Core Idle from
0 cycle 25
Step B: Get best CN candidate 1 cycle 0
• Latency-prioritized: the earliest CN 2 cycle 0
(predecessors finished earliest) 3 cycle 0
• Memory-prioritized: the deepest CN
(largest layer index) Candidate CN
Table
Step C: Map the chosen CN candidate Schedulable CNs
• Calculate the start and end time of the CN Layer CN
0 1
1 0
Step D: Update T/C/S 3 tables 2 0
• T Table: Add the runtime of the chosen CN
to the idle time of the allocated core Scheduled CN
• C Table: Remove the chosen CN Table
• S Table: Add the chosen CN Scheduled CNs
Layer CN
0 0
Step E: Get new schedulable CNs
1
• The chosen CN’s successor(s) who have all
2
their predecessor(s) in the S Table
Figure 8.7: Working principle of latency- and memory-prioritized CN

Scheduler.
Memory is prioritized by picking the CN from the candidate pool that has the
highest layer index. As this is the schedulable CN from the deepest layer in the
fused layer stack, it stimulates the immediate consumption of data deeper into
the fused stack for early discarding and efficient memory use. This can result
in idle time in the core as it waits for other cores to finish the predecessors of a
CN with a higher layer index, hence resulting in larger execution latency.
Figure 8.8 demonstrates the impact of the two prioritization strategies and the
communication bus and DRAM port overhead, both in terms of latency and
memory footprint of a three-core system onto which four layers are mapped. The
blocks in top Figure 8.8 (c)(d) represent the CNs and the lines represent their
fine-grained dependencies. The bottom figures show the memory utilization
trace across time, explained next.
(a) An example neural network (c) Latency-prioritized schedule
Layer 0
40164 Cycles
Layer 1 Layer 2
617KB
Layer 3
(b) Fine-grained CN graph (d) Memory-prioritized schedule

& layer-core allocation
Layer 0
(6 CNs)
on Core 0 Layer 2
(6 CNs)
on Core 2 51414 Cycles
Layer 1 313KB
(6 CNs)
on Core 1 Layer 3
(6 CNs)
on Core 0
Figure 8.8: Step 5 - An example of fine-grained CN graph scheduling including

inter-core bus communication, DRAM accesses, and memory usage trace.
8.3.6 Step 5.2: Memory usage tracing
Once the start and end times of all CNs are known, the activation memory
utilization can be traced through time based on the number of discarded inputs
and the number of generated outputs per CN (cfr. Section 8.3.1). When a CN
finishes, the inputs that are no longer required are freed from the memory space.
When a CN starts, space is allocated in the memory for the to-be generated
outputs. In case data is transferred between two cores, the output data of
the producer CN remains in the producing core until the communication is
concluded. Memory space is allocated in the consuming core as soon as the
communication starts. Figure 8.8 shows the total memory usage trace of all
three cores, of which the maximum is the peak memory usage.
8.4 Validation
Goals. Stream is deployed to model the behavior of SotA taped-out architec-

tures in order to demonstrate its modeling flexibility and quantify its modeling
accuracy compared to the targets’ performance measurements.
hardware Targets. Figure 8.9 shows the three architectures considered
for this validation. They are diverse in their core and memory architectures
and supported scheduling granularities: 1.) DepFiN [54], a single core DNN
accelerator designed for high-resolution pixel processing workloads (e.g. super-
resolution, denoising) which deploys layer fusion with line-based CNs to mitigate
on-chip buffer requirements; 2.) Jia et al.’s [80] multi-core architecture consisting
of a 4×4 array of analog in-memory-compute (AiMC) cores, enabling high
throughput and energy efficiency through pipelined execution; 3.) DIANA
[164], a heterogeneous multi-core AiMC + digital hybrid DNN accelerator SoC
targetting efficient end-to-end inference for edge applications.
Each architecture’s specification is modeled in Stream through the: a.) intra-
core characteristics for each core (operand precision, PE array size, supported
dataflow, memory hierarchy) and b.) inter-core characteristics (inter-core
communication protocol: bus-like or through a shared memory, inter-core
constellation).
Workload Targets. Each hardware target has reported the measurements
of their accelerator performance for different DNNs. Each measured DNN is
modeled in Stream at the scheduling granularity supported by the hardware.
The mapping of the workload onto the cores is fixed and in accordance with the
respective measurements, i.e. the intra-core dataflow and the core allocation
used in their measurements. The latency-prioritized scheduler is applied. The
VALIDATION 247
Figure 8.9: Hardware architecture targets for the validation of Stream.
validation results are summarized in Table 8.1 and the schedules generated by
Stream are shown in Figure 8.10.
DepFiN results
The DNN used in DepFiN’s measurements is FSRCNN [44], a super-resolution

CNN with large activation feature maps (560×960 pixels) across the network.
The large activation sizes require a peak memory usage of 28.3 MB in the
traditional layer-by-layer scheduling. The modeled memory usage (244 KB) is
118× lower due to the line-buffered scheduling. Stream’s runtime was 5 seconds,
Table 8.1: Validation results for three targeted hardware architectures.
Latency Validation
Architecture Measured (cc) Stream (cc) Accuracy (%)
DepFiN [54] 6.18 × 106 5.65 × 106 91
5 5
4×4 AiMC [80] 3.66 × 10 3.68 × 10 99
DIANA [164] 8.12 × 105 7.83 × 105 96
Memory Usage Validation
Architecture Measured (KB) Stream (KB) Accuracy (%)
DepFiN [54] 238 244 97
4×4 AiMC [80] N/A 16.5 N/A
DIANA [164] 134 137 98
and the modeled latency and memory usage are 91% , resp. 97% accurate
compared to the measurement.
Multi-core AiMC accelerator results
The DNN model deployed by Jia et al. on their multi-core AiMC architecture
are ResNet-50 segments [57]. Their work has no memory usage data available.
Stream predicts a memory usage of 16.5 KB due to the tight activation balance
observed in Figure 8.10(b). Stream’s runtime was 3 seconds, and the modeled
latency is 99% accurate compared to the measurement.
(a) FSRCNN on single-core DepFiN.
(b) ResNet-50 segment on 4×4 AiMC multi-core.
(c) ResNet-18 segment on heterogeneous mutli-core DIANA.
Figure 8.10: Schedule visualization of Stream for the three validation targets.
EXPLORATION 249
DIANA results
Lastly, we validate against DIANA’s [164] measurements of the first segment of

ResNet-18 [57]. The framework is able to accurately model the fine-grained data
dependencies between the varying workload operators (convolutional, pooling,
element-wise sum). The different operators are mapped to the different cores,
as shown in Figure 8.10c. Data is shared through a 256 KB L1 memory. This
would be insufficient for layer-by-layer execution, requiring frequent L2 and
off-chip memory accesses. Stream’s runtime was 2 seconds, and modeled latency
and memory usage are 96% , resp. 98% accurate compared to the measurement.
8.5 Exploration
In this section, we deploy Stream to co-explore the optimal allocation, scheduling

and mapping in combination with architectural decisions across different DNNs.
For this exploration, 5 DNN workloads and 7 hardware architectures with
identical area footprint are modeled. Figure 8.11(a) summarizes the hardware
configurations, in which the dataflow of each core is indicated in Figure 8.11(b).
Each architecture houses a communication bus with a bandwidth of 128 bit/cc
that is used for inter-core communication. Each core also can read/write data
from/to the off-chip DRAM memory using the DRAM port, which has a shared
bandwidth of 64 bit/cc. A total of 1 MB of activation and weight memory is
spread across the cores and all memory read and write costs are automatically
extracted through CACTI 7 [12].
8.5.1 Automated layer-core allocation impact
First, we explore the impact of the genetic algorithm-based automatic layer-core

allocation by comparing it to a manual allocation. While for classical layer-by-
layer scheduling allocation can be done manually with simple heuristics, the
fine-grained scheduling space is too large for manual allocation.
We compare the automatic GA-based allocation to a manual allocation
for ResNet-18 onto both a homogeneous (HomTPU in Figure 8.11) and a
heterogeneous (Hetero in Figure 8.11) quad-core architecture. The manual
assignment for the homogeneous architecture allocates layers to subsequent
cores in a ping-pong fashion, while for the heterogeneous architectures allocation
is done by assigning CNs to the core with the best fits the dataflow of that layer
(best spatial utilization). The multi-core scheduler is run with both the latency
priority and memory priority (cfr. Section 8.3.5). The results in Figure 8.12
(a) Three types of design paradigm used for exploration

Homogeneous Heterogeneous
Single-Core
Quad-Core Quad-Core
DRAM DRAM DRAM
D C0 256KB D C1 256KB C0 256KB D C1 256KB

Core 0 ACT MEM
1024 KB R R R
A 256 PE A 256 PE 256 PE A 256 PE
512b M KB Array M KB Array KB Array M KB Array
128b
W
MEM Pt Pt Pt
PE Array C2 256KB C3 256KB C2 256KB C3 256KB
1024 (
(
KB 64 256 PE 64 256 PE 256 PE 64 256 PE
bit KB Array bit KB Array KB Array bit KB Array
)
)
SIMD Core SIMD Core SIMD Core
Comm. Bus (128 bit) Comm. Bus (128 bit) Comm. Bus (128 bit)
(b) 7 architectures constructed from the above paradigms

Architecture Paradigm PE array dataflow (Prototype)
SC: TPU Single-Core C 64 | K 64 (TPU[22]-like)
SC: Eye Single-Core OX 256 | FX 4 | FY 4 (Eyeriss[5]-like)
SC: Env Single-Core OX 64 | K 64 (Envision[30]-like)
MC: HomTPU Homogeneous Quad-Core Each core: C 32 | K 32 (TPU-like)
MC: HomEye Homogeneous Quad-Core Each core: OX 64 | FX 4 | FY 4 (Eyeriss-like)
MC: HomEnv Homogeneous Quad-Core Each core: OX 32 | K 32 (Envision-like)
MC: Hetero Heterogeneous Quad-Core Core 0: OX 64 | FX 4 | FY 4; (Hybrid)
Core 1: OX 32 | K 32; Core 2/3: C 32 | K 32
Figure 8.11: Seven hardware architectures used for architecture exploration.
Manual GA Latency Leader GA Memory Leader

1e6 Latency Memory Usage
+54% 200
2.0 -14%
Memory Usage [kB]
-21% -19% 150

Latency [cycles]
1.5
-27% 100 -59%
1.0 -52% -56%
0.5 50
0.0 0
MC: Hom. TPU MC: Hetero MC: Hom. TPU MC: Hetero
Figure 8.12: Impact of the automatic layer-core allocation. The GA solution

has significantly better latency and memory requirement than manual allocation.
show that the automatic allocation provides a significant reduction in both

latency and memory requirements. Moreover, the different priority schedules
show the latency – memory trade-off: the GA’s memory leader has 56 % lower
memory usage at a 54 % higher latency on the heterogeneous architecture. All
EXPLORATION 251
further experiments, both for coarse and fine granularities, are executed using
the GA-based allocation with the latency scheduling priority.
8.5.2 Architecture impact
Next, we explore the capabilities of Stream to assess the benefits and drawbacks
of various hardware architectures for fine-grained layer-fused DNN execution.
A diverse set of neural network models is used for this study: ResNet18 [57],
MobileNetV2 [138], SqueezeNet [72], Tiny-YOLO [3] and FSRCNN [45].
Stream optimally allocates each dense computational layer to one of the
accelerator cores using its genetic algorithm, while the other layers such as
pooling and residual addition layers are assigned to the additional SIMD core.
To demonstrate Stream’s optimization flexibility, this study targets energy-
delay-product (EDP) as the allocation’s optimization criterion.
The results of all experiments are summarized in Figure 8.13. For each
combination of workload and architecture, the EDP is optimized for both
layer-by-layer scheduling granularity, as used by the SotA scheduler of Kwon et
al. [92] and a fine-grained scheduling granularity. We also show the impact of
the design choices on the latency and energy individually in Figures 8.14, for
the optimal EDP point.
The analysis below is broken down into the three modeled architecture classes:
Single-core architectures
Layer fusion consistently outperforms layer-by-layer processing. First, there

are latency benefits from fusion because pooling layer CNs can be executed in
the SIMD core in parallel to the convolutional layer CNs. Another reason for
improved latency and energy is the reduced size of the activations, enabling
on-chip storage, preventing all features to be sent to DRAM and back. This is
observed in Figure 8.14(b), where the layer-fused scheduling of Stream results in
lower off-chip energy compared to traditional layer-by-layer scheduling. For the
three single-core architectures, the EDP gains for the geometric mean over the
networks between layer-by-layer and layer fusion are between 2.4× and 4.7×.
Homogeneous multi-core architectures
The homogeneous multi-core architecture results demonstrate that multi-core

is not always better than single-core, at least not under the layer-by-layer
STREAM: MODELING FINE-GRAINED LAYER FUSION ON MULTI-CORE DNN ACCELERATORS
Traditional (layer-by-layer, like Kwon et al. [25]) Stream (fine-grained layer fusion)
ResNet18 MobileNetV2 SqueezeNet TinyYOLO FSRCNN Geometric Mean
Energy-Delay Product [ J×cc ]
6
6 6 10
10 10
7
10
5 10
4.2
10
4
5 4.7 10.7
5 10
5 10
10 18.6
4
10 19.0
6
10
. . . . . .
PU ye nv PU ye nv ro PU ye nv PU ye nv ro PU ye nv PU ye nv ro PU ye nv PU ye nv ro PU ye nv PU ye nv ro PU ye nv PU ye nv ro
: T : E : E . T . E . E ete : T : E : E . T . E . E ete : T : E : E . T . E . E ete : T : E : E . T . E . E ete : T : E : E . T . E . E ete : T : E : E . T . E . E ete
SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H SC SC SCHom: Hom: HomC: H
: C M : C M : C M : C M : C M : C M
MC MC M MC MC M MC MC M MC MC M MC MC M MC MC M
Figure 8.13: The best EDP point found by Stream over 5 DNNs for 7 hardware architectures under traditional
layer-by-layer scheduling and fine-grained layer fusion. The architecture abbreviations are shown in Figure 8.11(b). For
the geometric mean, the EDP reduction from layer-by-layer to layer-fused is shown.
252
Traditional: Latency Stream: Latency
ResNet18 MobileNetV2 SqueezeNet TinyYOLO FSRCNN Geometric
1.1
Mean
7
7 6 × 10
7 7
10 7
10 10 7 10 1.5
EXPLORATION
4 × 10
7 2.8
3 × 10
6
3.7
10 7
2.1
2 × 10
Latency [ cc ]
6
10
7
5.76.2
6 10
6 10
10
. . . . . .
: C M : C M : C M : C M : C M : C M
(a) Latency
Traditional: Off-chip energy Stream: Off-chip energy
Traditional: On-chip energy Stream: On-chip energy
ResNet18 MobileNetV2 SqueezeNet TinyYOLO FSRCNN 0.08
Geometric Mean
0.12 0.014 0.12
0.030 0.4
0.10 0.012 0.10
0.025 0.06
0.08 0.010 0.08 0.3
0.020
0.008 0.04
0.06 0.015 0.06 0.2
0.006 2.3 2.3
0.04 2.8
Energy [ J ]
0.04 0.010 3.8 3.3
0.004 0.02
0.1 5.0 5.0
0.02 0.005 0.002 0.02
0.00 0.000 0.000 0.00 0.0 0.00
. . . . . .
: C M : C M : C M : C M : C M : C M
(b) Energy breakdown
Figure 8.14: Latency and energy breakdown for the best EDP points of the exploration.
253
scheduling granularity. This is due to the fact that each single smaller
core has a lower hardware efficiency for workloads with coarse scheduling
granularity. For fine-grained layer fusion, homogeneous multi-core architectures
do consistently outperform single-core architectures, as they bring more
parallelization opportunities, increasing the temporal utilization (i.e. the
percentage of time the core is active). Moreover, the off-chip DRAM energy
is further reduced. For the three homogeneous multi-core architectures, layer
fusion has between 10× and 19× better EDP than layer-by-layer scheduling.
Heterogeneous multi-core architecture
In the heterogeneous multi-core architecture, multiple dataflows provide

specialization to serve a broader range of networks with widely varying layer
topologies at higher spatial utilization, improving the latency in comparison with
the best-performing homogeneous architecture. From this, it is clearly shown
both layer-fused and multi-core execution improve latency. Uniform networks,
like SqueezeNet or FSRCNN already experience a good matching between
their layer topologies and the dataflow for the homogeneous multi-cores designs.
These networks, as a result, benefit less from heterogeneous multi-core mappings
or have even decreased latency performance, because the heterogeneous system
includes cores with dataflows less optimal for these networks. Other networks
benefit more from heterogeneity due to their variety in layer types. The wider
the variety in layers, the more the benefit from the heterogeneity. Averaged
across all studied networks, the explored heterogeneous architecture has an
overall 1.6× better EDP than the best performing homogeneous architecture,
under fine-grained layer fusion. Moreover, layer fusion has 30.4× better EDP
than layer-by-layer execution.
In summary, Stream allows us to quantitatively and automatically co-explore
the optimal scheduling granularity with architectural decisions, providing insight
into the hardware performance of fine-grained layer fusion on a broad range of
single- & multi-core architectures.
8.6 Conclusion
This work presented Stream, a modeling and DSE framework capable of

representing layer-fused deep neural networks across a wide range of scheduling
granularities and heterogeneous multi-core architectures. Stream optimizes
execution schedules towards minimal energy, minimal latency and/or minimal
memory footprint for constrained edge devices. The framework is validated
CONCLUSION 255
against three hardware implementations employing layer-fused scheduling,

showing tight matching with measured hardware efficiencies. Moreover, we
demonstrated that high-level architectural decisions greatly impact hardware
efficiency under the fine-grained scheduling paradigm, reducing energy-delay-
product from 2.4× for single-core architectures to 30× for heterogeneous multi-
core architectures compared to the traditional scheduling at layer granularity.
Until this point, we have finished explaining all the main contents of the thesis,
a DSE journey from single MAC units to multi-core DNN accelerators. In the
upcoming chapter, we will wrap up the thesis and put forth future research
directions, based on the existing foundation.
Chapter 9
Conclusion and Future Work
Finally, the journey, started from the vision of Chapter 1 the "post-Moore’s-Law"
discussion, comes to an end. In this post-Moore’s-Law era, we have made our
efforts to facilitate the DNN accelerator design and workload deployment through
cross-level design space exploration, assisting in continuing the exponential
growth in computing performance and efficiency.
In this chapter, we will first conclude the thesis by revisiting the key ideas of
each chapter and answering the open research questions posed in Section 1.2.
Afterwards, we will look into the future research directions, building upon the
groundwork laid by this thesis.
9.1 Contributions and Conclusions
Chapter 1 firstly introduced the multi-level design concept of deep learning

acceleration. It pointed out that in the post-Moore’s-Law era, we need new
design paradigms which allow fast cross-level customization and optimization
to continue improving computing performance and efficiency. Following that,
this chapter elaborated on multiple open research questions (across different
abstraction levels) this thesis aims to address, with the goal of facilitating this
emerging design tendency through design space exploration.
Chapter 2 provided the background information on the three key factors that
are prevalent throughout the thesis: DNN algorithm, hardware, and algorithm-
to-hardware deployment. Besides presenting their basic concepts, this chapter
also delivers the following key ideas: 1) A DNN model’s parameter size or
257
258 CONCLUSION AND FUTURE WORK
operation count is not a good indicator of their hardware processing cost, as

the data/computation pattern of the model, the hardware architecture, and
the workload deployment all have a big impact on the final cost; 2) A DNN
accelerator consists of multiple components, from a MAC unit, MAC / PE
array, memory hierarchy, to the overall single- and multi-core architecture, each
embracing a large design space; 3) Deploying a DNN model onto an accelerator
involved a multi-level decision-making process: spatial and temporal mapping,
cross-layer scheduling, and layer-core allocation. Each level presents numerous
design options, and is intricately linked to both the hardware architecture
and the workload. In the chapter, all the concepts and ideas are thoroughly
explained, incorporating ample visualizations and illustrative examples to
enhance understanding.
Chapter 3 studied MAC units, and explored the design space of MAC unit
architecture for variable-precision DNN execution. The proposed taxonomy,
horizontally, unified all the existing and newly discovered precision-scalable MAC
architectures under the same naming convention, and vertically, linked the MAC-
level design decisions to MAC-array-level and algorithm-level choices. Along
with this taxonomy, an exhaustive benchmark of different MAC architectures
has been carried out in order to clearly understand their ground principles,
features, connections, and differences. The answer to "Q1.1: What is the best
MAC unit architecture for variable-precision DNN execution?" is that there is
not a single optimal MAC architecture, because several trade-offs (bandwidth,
throughput, area, energy, scalability, etc.) exist among these MAC architectures,
and this chapter exhibited these trade-offs to readers.
These trade-offs result in optima across 1) different FU designs: the input
bandwidth saving of SAs vs. output bandwidth saving of STs; 2) the FU and
SWU designs: under-utilized bandwidth with fully utilized arithmetic logic of
FUs vs. fully utilized bandwidth with under under-utilized arithmetic logic
of SWUs; 3) scaling modes: weight-only vs. two-dimensional scaling, and
one-level vs. two-level scaling; 4) spatial and temporal architectures: finish
more operations within the same time vs. finish the same amount of operands
in a shorter time; 5) precision ratios: e.g., (33% 8b, 33% 4b, 33% 2b) vs. (5%
8b, 47% 4b, 47% 2b), and so on.
The results of the comparative study have highlighted that 2D FU ST
(BitFusion) [143] and SWU ST (ST) [107] have the highest energy efficiency
for symmetric scaling scenarios, while 1D FU ST and 1D 4-bit serial [21] are
best for weight-scaling scenarios. In addition to that, 1D FU SA (DNPU) [146]
and 2D FU SA exceed with high throughput for all scaling scenarios, but suffer
together with 2D FU ST (BitFusion) from large varying bandwidth requirements.
Despite the recent trend for 1D [94, 21] and 2D [163, 142] serial designs, these
are strongly penalized at the MAC level in terms of both throughput and energy
CONTRIBUTIONS AND CONCLUSIONS 259
efficiency. This chapter also revealed a crucial fact that when the full-precision
operation takes a significant amount in the overall computation, SWUs tend
to perform better than FUs, and when further increasing the portion of the
full-precision operation, none of the precision-scalable MAC architectures beats
the conventional MAC unit with data gating.
Chapter 4 continued the study on precision-scalable datapath and moved
one architectural level up to the MAC array level. As unified design
characterization is the foundation for systematical DSE, with the enlarged
design space (from MAC unit to array), a more powerful way to characterize
different PSMAs is essential. For it, this chapter introduced a precision-aware
nested for-loop representation for DNN mappings, and based on this new
representation, proposed a comprehensive PSMA taxonomy. Following that,
a highly parameterized PSMA template was built that can be design-time
configured into a huge subset (72 designs) of the design space spanned by
the taxonomy. In the end, the 72 PSMAs are thoroughly benchmarked and
compared, disclosing design insights.
Again, results showed that there is no one-fits-all answer to the question "Q1.2:
What is the best MAC array architecture for variable-precision DNN execution?",
but there are several insightful design guidelines observed. First, it is shown that
BG unrolling in L2 is the least ideal case, and it’s better to have BG unrolling
at L3 for lower frequencies or unrolling temporally for higher frequencies. This
is due to the (configurable) shifters being amortized across different L2 units,
making the adder trees less complex. Second, it’s generally a good idea to have
a mixture of IS/WS and OS loops unrolled throughout the array levels, so as
to balance the data distributing and collecting burdens of input and output
operands. Third, similar to the previous chapter, FU designs are better suited
for workloads where lower precisions are common, whereas SWU designs are
better suited for higher-precision workloads. Fourth, contrary to the previous
chapter, the BS designs exhibit favorable qualities at the array level. This is due
to that, in this chapter for BS designs (BS-L2), the hardware overhead of the
internal registers and shift-add logic is shared across the 16 L1 units within each
L2 unit (enabled by L2: OS), which ultimately reduces the energy/operation.
Whereas in Chapter 3, BS designs can be seen as each L1 unit is equipped with
its own registers and shift-add logic (BS-L1).
Chapter 5 & 6 moved the abstraction level further up from components in a
DNN accelerator (MAC units and array) to the complete accelerator system,
and provided a complete answer to the question "Q2.1: How to design a DSE
framework for single-core DNN accelerator exploring single-layer mapping?"
by introducing the ZigZag framework. ZigZag is a DSE framework for DNN
accelerators supporting single-layer mapping optimization. It is equipped
with an analytical cost model for rapid energy/latency/area estimation and
clever mapping and architecture search engines/generators for design space

exploration. ZigZag is developed following the five DSE framework principles
we proposed in Section 1.2.2: fast, accurate, general, adaptable, and intelligent.
To answer question Q2.1, building such a DSE framework follows a three-step
methodology: firstly, identify different design options and construct a unified
design representation that covers these options; secondly, based on this unified
representation, build the cost models; lastly, automatically generate different
design candidates in this representation and feed them to the cost models. In
this way, the loop is closed, and the design space exploration can be conducted
automatically. At the end of Chapter 5, the latest advancements of ZigZag are
introduced from various perspectives, demonstrating the immense potential of
the framework.
Chapter 7 introduced DeFiNES, a successive DSE framework of ZigZag that
broadens the workload deployment space from single-layer mapping to also
considering cross-layer depth-first scheduling. The answer to the question "Q2.2:
How can the DSE framework be improved to enable depth-first scheduling in a
single-core DNN accelerator?" lies in the subsequent steps taken to construct the
DeFiNES framework. Firstly, the DF design space is systematically identified,
spanned by three DF strategy axes: tile size, overlap storing mode, and fuse
depth. After formalizing this design space, a unified cost model for both layer-
by-layer and DF scheduling is built. It enables analytically estimating the
hardware costs for possible schedules in terms of both energy and latency, while
considering data access at every memory level. This is done for each schedule
and hardware architecture under study by optimally choosing the active part of
the memory hierarchy per unique combination of operand, layer, and feature
map tile. The hardware costs are estimated, taking into account both data
computation and data copy phases. The analytical cost model is validated
against measured data from a taped-out DF DNN accelerator, DepFiN, showing
good modeling accuracy at the end-to-end neural network level. Using this
model, the case studies showed that DF strategies can significantly outperform
layer-by-layer execution, across different DNN workloads and a wide range of
hardware architectures.
Chapter 8 elaborated on Stream, another successive DSE framework of ZigZag,
that moves from the single-core to multi-core hardware architecture with fine-
grained layer fusion support. The development process of Stream provides an
answer to the question "Q2.3: How can the DSE framework be extended to multi-
core DNN accelerators and support layer fused multi-accelerator execution?".
Firstly, an upgraded unified design point representation (using both DAG and
nested for-loops) is constructed to capture the more complex mapping and
scheduling options in the multi-core scenario. Then, a rapid fine-grained data
dependency extractor is built based on this unified representation, with which
FUTURE WORKS 261
a layerwise coarse-grained DNN model can be easily broken down into a more
fine-grained graph with each node as a nested for-loop based CN and each edge
between the nodes indicating the fine-grained data dependencies between CNs.
Afterward, a heuristic-based scheduler is constructed to rapidly schedule each
CN to the corresponding core, respecting the data dependencies and hardware
constraints, and following that, the single-layer mapper will make sure each
CN is mapping onto the core in an optimal manner. The Stream framework
is validated against three hardware implementations employing layer-fused
scheduling, showing tight matching with measured hardware efficiencies. Case
studies showed that high-level architectural decisions greatly impact hardware
efficiency under the fine-grained scheduling paradigm. To conclude, Stream
allows to quantitatively co-explore the scheduling granularity with architectural
decisions, providing insights into the hardware performance of fine-grained layer
fusion on a broad range of multi-core architectures.
9.2 Future Works
As the saying goes, the more you know, the more you realize you don’t know.
By answering one question, you usually find more follow-up questions. This
section proposes some future work directions, building upon the foundational
contributions of this thesis.
• For Chapter 3 and 4, besides the two direct improvements mentioned

earlier: real-NN-workload based simulation and more advanced technology
node implementation, they can also move one step further to the DNN
accelerator level by modeling different PSMAs in a DSE framework (like
ZigZag), taking into account the overlooked factors, such as the non-ideality
of the actual DNN workload (e.g., the mismatch between workload sizes and
hardware dimensions), the data movement in the memory hierarchy, and
mapping/scheduling optimizations, etc. A lot of interesting DSE experiments
can be conducted, such as letting the mapping search engines find the
best architecture level to unroll the BG loops, comparing the hardware
utilization and efficiency for temporal and spatial PSMA designs across
different precision modes, assessing the benefit and overhead of supporting
dynamic BG unrolling (i.e., unroll BG loop at different architectural levels
for various layers and/or precision modes), etc.
• For Chapter 5, 6, 7, and 8, in general, within all the DSE frameworks (ZigZag,
DeFiNES, Stream), the further improvements can follow the five proposed
principles: 1) run faster by speeding up search engines; 2) be more accurate
by incorporating additional realistic implementation factors cost models
(like interconnection cost and chip layout impact); 3) be more general by

supporting more new machine learning workloads, hardware architectures
(together with their specific configurations, like sparsity support), and
workload-to-hardware deployment strategies; 4) be more adaptable by
adopting more concise data point representation; 5) be more intelligent
by upgrading the optimization algorithm (like one can introduce a hardware
feedback loop to optimize the search procedure of hardware architectures, or
let the frameworks have "memory" and learn from what they have already
seen to speedup future optimization).
• Regarding the mapping and scheduling part of DSE, there were so many
works across multiple decades that have been done. Although most of
them were originally adopted in other domains, like telecommunication,
digital signal processing, multi-media, and so on, their methodologies can
still be learned and adapted to deep learning acceleration. Future works
towards this direction include things like incorporating the well-established
Polyhedral model [167] into the toolchain to handle more complicated for-
loop transformations, building more dedicated models for data reuse, locality,
lifetime, and regularity analysis [89, 24, 79, 25] to expose more memory-data
optimization possibilities, automating the computation granularity search in
layer fused scheduling for discovering better fusion patterns [40], decoupling
the huge DSE into flows with multiple more fine-grained stages to manage
the escalation of complexity that arises with the inclusion of additional design
considerations [48, 23], etc.
• Besides improving the frameworks themselves (as a developer), one can
also apply them to perform more interesting DSE experiments (as a user),
such as studying the impact of different DNN workload attributes (e.g.,
the model size, operation count, layer interconnection, etc.) on hardware
cost, exploring the inter-relation between certain hardware modifications
and scheduling preferences (e.g., the on-chip memory resource distribution
between activation and weight can largely influence the best depth-first
scheduling selection, and thus the hardware cost), evaluating the impact of
technology/device/packaging levels’ innovations at system levels, etc.
• In addition to all these, one can also think about how to extend the
frameworks to broader use scenarios than just performing high-level design
space exploration, such as embedding it into a compiler toolchain for
generating better machine code or utilizing the DSE frameworks for real
hardware generation of the optimal design point found.
FUTURE WORKS 263
All in all, this thesis has clearly introduced the vast design space of deep learning
accelerators at the different abstraction levels, thoroughly explained how the
high-level design space exploration frameworks can be built to rapidly offer
design insights and guidelines, and eventually passed on the taxonomy, modeling,
and exploration methodologies applied in this thesis to future researchers.
Biography
Linyan Mei was born in Hunan, China in 1994. She received the B. Sc. degree
in electrical engineering from Beijing Institute of Technology (BIT), China
in 2016, and the M. Sc. degree in electrical engineering from KU Leuven,
Belgium, in 2018. The subject of her master thesis was: "Digital Design
of Flexible Deep Learning Kernels", in cooperation with the SLD (system-
level design) team in IMEC, Leuven, advised by Dr. Dimitrios Rodopoulos
and Dr. Jeremy Constantine. Later in 2018, she joined the ESAT-MICAS
laboratories of KU Leuven as a Ph.D. student, under the guidance of
Prof. Marian Verhelst. In 2021, she temporarily joined Meta, CA, US as
a research intern, advised by Dr. Huichu Liu. Currently, she is working towards
the Ph.D. degree on design space exploration for embedded deep neural network
accelerators.
265
List of publications
Articles in international journals
(* for joint first authorship.)
⋄ L. Mei*, P. Houshmand*, V. Jain, S. Giraldo, M. Verhelst, "ZigZag:

Enlarging Joint Architecture-Mapping Design Space Exploration for DNN
Accelerators," in IEEE Transactions on Computers (TC), vol. 70, no. 8, pp.
1160-1174, 1 Aug. 2021
⋄ V. Camus*, L. Mei*, C. Enz, M. Verhelst, "Review and Benchmarking of
Precision- Scalable Multiply-Accumulate Unit Architectures for Embedded
Neural-Network Processing," in IEEE Journal on Emerging and Selected
Topics in Circuits and Systems (JETCAS), vol. 9, no. 4, pp. 697-711, Dec.
2019
⋄ E. M. Ibrahim*, L. Mei*, M. Verhelst, "Taxonomy and Benchmarking
of Precision- Scalable MAC Arrays Under Enhanced DNN Dataflow
Representation," in IEEE Transactions on Circuits and Systems I: Regular
Papers (TCAS-I), vol. 69, no. 5, pp. 2013-2024, May 2022
⋄ M. Shi, P. Houshmand, L. Mei, M. Verhelst, "Hardware-Efficient Residual
Neural Network Execution in Line-Buffer Depth-First Processing," in IEEE
Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS),
vol. 11, no. 4, pp. 690-700, Dec. 2021
⋄ M. Verhelst, M. Shi, L. Mei, "ML Processors Are Going Multi-Core: A
performance dream or a scheduling nightmare?," in IEEE Solid-State Circuits
Magazine (SSCM), vol. 14, no. 4, pp. 18-27, Fall 2022
⋄ V. Jain, S. Giraldo, J. D. Roose, L. Mei, B. Boons, M. Verhelst, "TinyVers:
A Tiny Versatile System-on-chip with State-Retentive eMRAM for ML
267
268 LIST OF PUBLICATIONS
Inference at the Extreme Edge," in IEEE Journal of Solid-State Circuits

(JSSC), 2023
Articles in international conference proceedings

⋄ L. Mei*, K. Goetschalckx*, A. Symons, M. Verhelst, "DeFiNES: Enabling
Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators
through Analytical Modeling," 2023 IEEE International Symposium on High-
Performance Computer Architecture (HPCA), 2023
⋄ L. Mei, H. Liu, T. Wu, H. E. Sumbul, M. Verhelst, E. Beigne, "A
Uniform Latency Model for DNN Accelerators with Diverse Architectures
and Dataflows," 2022 Design, Automation & Test in Europe Conference &
Exhibition (DATE), 2022
⋄ L. Mei, M. Dandekar, D. Rodopoulos, Constantin, P. Debacker, R.
Lauwereins, M. Verhelst, "Sub-Word Parallel Precision-Scalable MAC Engines
for Efficient Embedded DNN Inference," 2019 IEEE International Conference
on Artificial Intelligence Circuits and Systems (AICAS), 2019
⋄ A. Symons, L. Mei, S. Colleman, P. Houshmand, S. M. Karl, M. Verhelst,
"Stream: A Modeling Framework for Fine-grained Layer Fusion on Multi-core
DNN Accelerators," 2023 IEEE International Symposium on Performance
Analysis of Systems and Software (ISPASS), 2023
⋄ A. Symons, L. Mei, M. Verhelst, "LOMA: Fast Auto-Scheduling on DNN
Accelerators through Loop-Order-based Memory Allocation," 2021 IEEE
3rd International Conference on Artificial Intelligence Circuits and Systems
(AICAS), 2021
⋄ V. Jain, L. Mei, M. Verhelst, "Analyzing the Energy-Latency-Area-
Accuracy Trade-off Across Contemporary Neural Networks," 2021 IEEE
3rd International Conference on Artificial Intelligence Circuits and Systems
(AICAS), 2021
⋄ P. Houshmand, S. Cosemans, L. Mei, I. Papistas, D. Bhattacharjee, P.
Debacker, A. Mallik, D. Verkest, M. Verhelst, "Opportunities and Limitations
of Emerging Analog in-Memory Compute DNN Architectures," 2020 IEEE
International Electron Devices Meeting (IEDM), 2020
⋄ S. Colleman, T. Verelst, L. Mei, T. Tuytelaars, M. Verhelst, "Processor
Architecture Optimization for Spatially Dynamic Neural Networks," 2021
IFIP/IEEE 29th International Conference on Very Large Scale Integration
(VLSI-SOC), 2021
LIST OF PUBLICATIONS 269
⋄ V. Jung, A. Symons, L. Mei, M. Verhelst, L. Benini, "SALSA: Simulated

Annealing based Loop-Ordering Scheduler for DNN Accelerators," 2023 IEEE
5th International Conference on Artificial Intelligence Circuits and Systems
(AICAS), 2023
⋄ V. Jain, S. Giraldo, J. D. Roose, B. Boons, L. Mei, M. Verhelst, "TinyVers:
A 0.8-17 TOPS/W, 1.7 uW-20 mW, Tiny Versatile System-on-chip with
State-Retentive eMRAM for Machine Learning Inference at the Extreme
Edge," 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI
Technology and Circuits), 2022
⋄ M. Gomony, F Putter, A. Gebregiorgis, G. Paulin, L. Mei, V. Jain, S.
Hamdioui, V. Sanchez, T. Grosser, M. Geilen, M. Verhelst, F. Zenke, F.
Gurkaynak, B Bruin, S. Stuijk, S. Davidson, S. De, M. Ghogho, A Jimborean,
S. Eissa, L. Benini, D. Soudris, R. Bishnoi, S. Ainsworth, F. Corradi, O
Karrakchou, T. Güneysu, H. Corporaal, " CONVOLVE: Smart and seamless
design of smart edge processors," 2023 Design, Automation & Test in Europe
Conference & Exhibition (DATE), 2023
Patent
H. Liu, F. Wu, E. Dallard, L. Mei, H. E. Sumbul. "Bandwidth-aware flexible-

scheduling machine learning accelerator." U.S. Patent Application 17/553,726,
filed December 1, 2022.
Poster, seminar, and tutorial

⋄ L. Mei, P. Houshmand, A. Symons, M. Verhelst, "ZigZag: An Architecture-
Mapping Design Space Exploration Framework for Deep Learning Accelera-
tors," Flanders AI Research Days, 26 & 27 April 2021.
⋄ L. Mei, M. Verhelst, "Deep Neural Network Accelerator Design Space
Exploration for Embedded Device," MICAS Seminar, 3 April 2020.
⋄ L. Mei, M. Verhelst, "How to Build a Framework for Deep Learning

Accelerator Design Space Exploration," MICAS Seminar, 3 December 2021.
⋄ L. Mei, A. Symons, G. Paim, M. Verhelst, "ZigZag Framework and its
Extensions DeFiNES & Stream," 3-hour tutorial on EU CONVOLVE project
gathering, 9 & 10 February 2023.
270 LIST OF PUBLICATIONS
⋄ A. Symons, L. Mei, G. Paim, M. Verhelst, "ZigZag Project," 4-hour tutorial

in 2023 IEEE International Symposium on Performance Analysis of Systems
and Software (ISPASS), 23 April 2023.
Supervised masters’ theses and interns’ projects
Masters’ theses:
⋄ W. Jiang, L. Mei, V. Jain, M. Verhelst, "CNN accelerator exploring

structural sparsity", 2019-2020.
⋄ J. Van Delm, V. Jain, L. Mei, M. Verhelst, "Deep learning compiler for
hardware accelerators and embedded devices", 2020-2021.
⋄ R. Heyrman, L. Mei, P. Houshmand, A. Symons, M. Verhelst, "Master
the fast hardware generation and evaluation tool chain for DL accelerator",
2020-2021.
⋄ L. Roofthooft, S. Colleman, L. Mei, M. Verhelst, "Hardware and mapping
design space exploration for accelerating dynamic spatial sparse neural
networks", 2021-2022.
⋄ T. Smeets, M. Shi, L. Mei, M. Verhelst, "On-chip compression of feature

map for embedded deep learning", 2022-2023.
Interns’ projects:
⋄ W. Li, L. Mei, V. Jain, M. Verhelst, "Design of hardware accelerator for

Deep Learning in Catapult HLS", 2020.
⋄ V. Jung, A. Symons, L. Mei, M. Verhelst, "Simulated annealing based
loop-ordering scheduler for DNN accelerators", 2021.
Teaching
⋄ Teaching assistant for a master course – Computer Architecture (B-KUL-
H05D3A), Exercise Session: building a 5-stage pipelined RISC-V processor,
2018-2023
Bibliography
[1] Google OR-Tools (2015). (Cited on 134)

[2] Abadal, S., Jain, A., Guirado, R., López-Alonso, J., and
Alarcón, E. Computing graph neural networks: A survey from
algorithms to accelerators. ACM Comput. Surv. 54, 9 (oct 2021). (Cited
on 55)
[3] Adarsh, P., Rathi, P., and Kumar, M. YOLO v3-Tiny: Object
Detection and Recognition using one stage improved model. In 2020 6th
International Conference on Advanced Computing and Communication
Systems (ICACCS) (2020), IEEE, pp. 687–694. (Cited on 251)
[4] Alwani, M., Chen, H., Ferdman, M., and Milder, P. Fused-
layer CNN accelerators. In 2016 49th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO) (2016), pp. 1–12. (Cited on
202, 226)
[5] Alwani, M., Chen, H., Ferdman, M., and Milder, P. Fused-
layer cnn accelerators. In 2016 49th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO) (2016), pp. 1–12. (Cited on
233, 235)
[6] Amazon. Books, 1996-2023, Amazon.com. https://www.amazon.com/
books-used-books-textbooks/b?ie=UTF8&node=283155. (Cited on 3)
[7] Ambolt AI. Blog: Computer Vision – image classification and object
detection, 2016. https://ambolt.io/en/image-classification-and-
object-detection/. (Cited on 3)
[8] Amdahl, G. M. Validity of the single processor approach to achieving
large scale computing capabilities. In 1967 AFIPS Spring Joint Computer
Conference (1967), pp. 483–485. (Cited on 96)
271
272 BIBLIOGRAPHY
[9] Ang, L. M., and Seng, K. P. GPU-Based Embedded Intelligence

Architectures and Applications. Electronics 10, 8 (2021). (Cited on 38)
[10] Austin, T. Keynote talk: Preparing for a Post Moore’s Law World. In The
48th Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO) (2015). (Cited on 5)
[11] Bai, J., Lu, F., Zhang, K., et al. ONNX: Open Neural Network
Exchange. https://github.com/onnx/onnx, 2019. (Cited on 174, 239)
[12] Balasubramonian, R., Kahng, A. B., Muralimanohar, N.,

Shafiee, A., and Srinivas, V. Cacti 7: New tools for interconnect
exploration in innovative off-chip memories. ACM Trans. Archit. Code
Optim. 14, 2 (June 2017). (Cited on 160, 168, 218, 249)
[13] Banbury, C., Janapa Reddi, V., Torelli, P., Jeffries, N., Kiraly,
C., Holleman, J., Montino, P., Kanter, D., Warden, P., Pau, D.,
Thakker, U., torrini, a., cordaro, j., Di Guglielmo, G., Duarte,
J., Tran, H., Tran, N., wenxu, n., and xuesong, x. MLPerf Tiny
Benchmark. In Proceedings of the Neural Information Processing Systems
Track on Datasets and Benchmarks (2021), J. Vanschoren and S. Yeung,
Eds., vol. 1, Curran. (Cited on 170)
[14] Baugh, C., and Wooley, B. A two’s complement parallel array

multiplication algorithm. IEEE Transactions on Computers C-22, 12
(1973), 1045–1047. (Cited on 69)
[15] Bianco, S., Cadene, R., Celona, L., and Napoletano, P.
Benchmark analysis of representative deep neural network architectures.
IEEE Access 6 (2018), 64270–64277. (Cited on xxvii, 32, 33)
[16] Bouvier, M., Valentian, A., Mesquida, T., Rummens, F., Reyboz,
M., Vianello, E., and Beigne, E. Spiking neural networks hardware
implementations and challenges: A survey. J. Emerg. Technol. Comput.
Syst. 15, 2 (apr 2019). (Cited on 55)
[17] Burrello, A., Garofalo, A., Bruschi, N., Tagliavini, G., Rossi,
D., and Conti, F. DORY: Automatic End-to-End Deployment of Real-
World DNNs on Low-Cost IoT MCUs. IEEE Transactions on Computers
70, 8 (2021), 1253–1268. (Cited on 133, 134, 135)
[18] C. Zhang et. al. Optimizing FPGA-based accelerator design for deep
convolutional neural networks. In 2015 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays (2015), pp. 161–170.
(Cited on 60)
BIBLIOGRAPHY 273
[19] Cai, X., Wang, Y., and Zhang, L. Optimus: An Operator Fusion
Framework for Deep Neural Networks. ACM Trans. Embed. Comput. Syst.
(feb 2022). (Cited on 14, 15, 52, 202, 203, 226, 230)
[20] Camus, V. Design of approximate and precision-scalable circuits for
embedded multimedia and neural-network processing. Ph.D. dissertation,
EPFL, Lausanne, 2019. http://infoscience.epfl.ch/record/264984.
(Cited on 58)
[21] Camus, V., Enz, C., and Verhelst, M. Survey of Precision-Scalable
Multiply-Accumulate Units for Neural-Network Processing. In 2019 IEEE
1st International Conference on Artificial Intelligence Circuits and Systems
(AICAS) (March 2019). (Cited on xxix, 58, 62, 71, 87, 96, 97, 258)
[22] Camus, V., Mei, L., Enz, C., and Verhelst, M. Review and
benchmarking of precision-scalable multiply-accumulate unit architectures
for embedded neural-network processing. IEEE Journal on Emerging and
Selected Topics in Circuits and Systems 9, 4 (2019), 697–711. (Cited on
57, 109)
[23] Catthoor, F. Unified low-power design flow for data-dominated multi-
media and telecom applications. Springer, 2000. (Cited on 4, 262)
[24] Catthoor, F., and Danckaert, K. Data access and storage
management for embedded programmable processors. Springer Science &
Business Media, 2002. (Cited on 262)
[25] Catthoor, F., Wuytack, S., De Greef, G., Banica, F.,
Nachtergaele, L., and Vandecappelle, A. Custom memory
management methodology: Exploration of memory organisation for
embedded multimedia system design. Springer Science & Business Media,
1998. (Cited on 262)
[26] Cavalcante, M., Riedel, S., Pullini, A., and Benini, L.
MemPool: A Shared-L1 Memory Many-Core Cluster with a Low-Latency
Interconnect. In 2021 Design, Automation and Test in Europe Conference
and Exhibition (DATE) (2021), pp. 701–706. (Cited on 3)
[27] Cavigelli, L., and Benini, L. Origami: A 803-GOp/s/W Convolutional
Network Accelerator. In IEEE Transactions on Circuits and Systems for
Video Technology (TCSVT) (Nov 2017), pp. 2461–2475. (Cited on 58)
[28] Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Cowan,
M., Shen, H., Wang, L., Hu, Y., Ceze, L., Guestrin, C., and
Krishnamurthy, A. TVM: An Automated End-to-End Optimizing
Compiler for Deep Learning. In Proceedings of the 13th USENIX
274 BIBLIOGRAPHY
Conference on Operating Systems Design and Implementation (USA,

2018), OSDI’18, USENIX Association, p. 579–594. (Cited on 184)
[29] Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Liu, Y.,
Pham, H., Dong, X., Luong, T., Hsieh, C.-J., Lu, Y., and Le,
Q. V. Symbolic Discovery of Optimization Algorithms. arXiv e-prints
(Feb. 2023), arXiv:2302.06675. (Cited on 32)
[30] Chen, Y., Krishna, T., Emer, J. S., and Sze, V. Eyeriss: An
Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural
Networks. IEEE Journal of Solid-State Circuits 52, 1 (2017), 127–138.
(Cited on xxxi, 40, 60, 142, 158, 163, 215, 232)
[31] Chollet, F. Xception: Deep learning with depthwise separable
convolutions. In 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2017), pp. 1800–1807. (Cited on 167)
[32] Colleman, S., Shi, M., and Verhelst, M. Coac: Cross-layer
optimization of accelerator configurability for efficient cnn processing.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
(2023), 1–14. (Cited on 172)
[33] Colleman, S., Verelst, T., Mei, L., Tuytelaars, T., and
Verhelst, M. Processor Architecture Optimization for Spatially
Dynamic Neural Networks. In 2021 IFIP/IEEE 29th International
Conference on Very Large Scale Integration (VLSI-SoC) (2021), pp. 1–6.
(Cited on 171)
[34] Colleman, S., and Verhelst, M. High-utilization, high-flexibility
depth-first CNN coprocessor for image pixel processing on FPGA. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems 29, 3 (2021),
461–471. (Cited on 233)
[35] Colleman, S., Zhu, P., Sun, W., and Verhelst, M. Optimizing
accelerator configurability for mobile transformer networks. In 2022
IEEE 4th International Conference on Artificial Intelligence Circuits and
Systems (AICAS) (2022), pp. 142–145. (Cited on 171)
[36] Dai, S., Zhou, Y., Zhang, H., Ustun, E., Young, E. F., and
Zhang, Z. Fast and accurate estimation of quality of results in high-level
synthesis with machine learning. In 2018 IEEE 26th Annual International
Symposium on Field-Programmable Custom Computing Machines (FCCM)
(2018), pp. 129–132. (Cited on 184)
[37] Dai, Z., Liu, H., Le, Q. V., and Tan, M. Coatnet: Marrying
convolution and attention for all data sizes. In Advances in Neural
BIBLIOGRAPHY 275
Information Processing Systems (2021), M. Ranzato, A. Beygelzimer,

Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34, Curran Associates,
Inc., pp. 3965–3977. (Cited on 32)
[38] Dave, S., Baghdadi, R., Nowatzki, T., Avancha, S., Shrivastava,
A., and Li, B. Hardware acceleration of sparse and irregular tensor
computations of ml models: A survey and insights. Proceedings of the
IEEE 109, 10 (2021), 1706–1752. (Cited on 55)
[39] Dave, S., Kim, Y., Avancha, S., Lee, K., and Shrivastava, A.
Dmazerunner: Executing perfectly nested loops on dataflow accelerators.
ACM Trans. Embed. Comput. Syst. 18, 5s (2019). (Cited on 50, 133, 134,
135)
[40] De Greef, E., Catthoor, F., and De Man, H. Program
transformation strategies for memory size and power reduction of
pseudoregular multimedia subsystems. IEEE Transactions on Circuits
and Systems for Video Technology 8, 6 (1998), 719–733. (Cited on 262)
[41] Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. A fast and
elitist multiobjective genetic algorithm: Nsga-ii. IEEE transactions on
evolutionary computation 6, 2 (2002), 182–197. (Cited on 242)
[42] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE
conference on computer vision and pattern recognition (2009), Ieee, pp. 248–
255. (Cited on 32, 164)
[43] Ding, Y., Zhu, L., Jia, Z., Pekhimenko, G., and Han, S. IOS:
Inter-operator scheduler for cnn acceleration. Proceedings of Machine
Learning and Systems 3 (2021), 167–180. (Cited on 16, 17, 52, 235)
[44] Dong, C., Loy, C. C., and Tang, X. Accelerating the super-resolution
convolutional neural network. In European conference on computer vision
(2016), Springer, pp. 391–407. (Cited on xxxiii, 210, 215, 218, 247)
[45] Dong, C., Loy, C. C., and Tang, X. Accelerating the super-resolution
convolutional neural network. In European conference on computer vision
(2016), Springer, pp. 391–407. (Cited on 251)
[46] Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng,
X., Chen, Y., and Temam, O. Shidiannao: Shifting vision processing
closer to the sensor. In 2015 ACM/IEEE 42nd Annual International
Symposium on Computer Architecture (ISCA) (June 2015), pp. 92–104.
(Cited on 60)
276 BIBLIOGRAPHY
[47] Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng,
X., Chen, Y., and Temam, O. Shidiannao: Shifting vision processing
closer to the sensor. In Proceedings of the 42nd Annual International
Symposium on Computer Architecture (2015), pp. 92–104. (Cited on 232)
[48] Gajski, D., Vahid, F., Narayan, S., and Gong, J. Specsyn:
an environment supporting the specify-explore-refine paradigm for
hardware/software system design. IEEE Transactions on Very Large
Scale Integration (VLSI) Systems 6, 1 (1998), 84–100. (Cited on 262)
[49] Garofalo, A., Ottavi, G., Conti, F., Karunaratne, G., Boybat,
I., Benini, L., and Rossi, D. A heterogeneous in-memory computing
cluster for flexible end-to-end inference of real-world deep neural networks.
arXiv preprint arXiv:2201.01089 (2022). (Cited on 232)
[50] Ghodrati, S., Ahn, B. H., Kyung Kim, J., Kinzer, S., Yatham,
B. R., Alla, N., Sharma, H., Alian, M., Ebrahimi, E., Kim,
N. S., Young, C., and Esmaeilzadeh, H. Planaria: Dynamic
architecture fission for spatial multi-tenant acceleration of deep neural
networks. In 2020 53rd Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO) (2020), pp. 681–697. (Cited on 232, 234)
[51] Ghodrati, S., Sharma, H., Young, C., Kim, N. S., and
Esmaeilzadeh, H. Bit-parallel vector composability for neural
acceleration. In 2020 57th ACM/IEEE Design Automation Conference
(DAC) (2020), IEEE, pp. 1–6. (Cited on 100, 104, 109, 110, 129)
[52] Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., and
Keutzer, K. A Survey of Quantization Methods for Efficient Neural
Network Inference. CoRR abs/2103.13630 (2021). (Cited on 7)
[53] Goetschalckx, K., and Verhelst, M. Breaking High-Resolution
CNN Bandwidth Barriers With Enhanced Depth-First Execution. IEEE
Journal on Emerging and Selected Topics in Circuits and Systems 9, 2
(2019), 323–331. (Cited on 202, 233, 235)
[54] Goetschalckx, K., and Verhelst, M. DepFiN: A 12nm, 3.8TOPs
depth-first CNN processor for high res. image processing. In 2021
Symposium on VLSI Circuits (2021), pp. 1–2. (Cited on 3, 202, 215,
226, 233, 235, 246, 247)
[55] Guttman, A. R-trees: A dynamic index structure for spatial searching.
In Proceedings of the 1984 ACM SIGMOD International Conference
on Management of Data (New York, NY, USA, 1984), SIGMOD ’84,
Association for Computing Machinery, p. 47–57. (Cited on xxxiv, 236,
240, 241)
BIBLIOGRAPHY 277
[56] Han, Y., Huang, G., Song, S., Yang, L., Wang, H., and Wang,
Y. Dynamic neural networks: A survey. IEEE Transactions on Pattern
Analysis and Machine Intelligence 44, 11 (nov 2022), 7436–7456. (Cited
on 55)
[57] He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for
Image Recognition. arXiv e-prints (Dec. 2015), arXiv:1512.03385. (Cited
on xxvii, xxxi, 29, 30, 31, 158, 159, 167, 248, 249, 251)
[58] He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for
Image Recognition. arXiv e-prints (Dec. 2015), arXiv:1512.03385. (Cited
on 218)
[59] Hegde, K., Tsai, P.-A., Huang, S., Chandra, V., Parashar,
A., and Fletcher, C. W. Mind Mappings: Enabling Efficient
Algorithm-Accelerator Mapping Space Search. In Proceedings of the 26th
ACM International Conference on Architectural Support for Programming
Languages and Operating Systems (New York, NY, USA, 2021), ASPLOS
’21, Association for Computing Machinery, p. 943–958. (Cited on 214,
236, 241)
[60] Hirtzlin, T., Bocquet, M., Penkovsky, B., Klein, J.-O., Nowak,
E., Vianello, E., Portal, J.-M., and Querlioz, D. Digital
biologically plausible implementation of binarized neural networks
with differential hafnium oxide resistive memory arrays. Frontiers in
Neuroscience 13 (2020). (Cited on 3)
[61] Houshmand, P., Cosemans, S., Mei, L., Papistas, I., Bhattachar-
jee, D., Debacker, P., Mallik, A., Verkest, D., and Verhelst, M.
Opportunities and Limitations of Emerging Analog in-Memory Compute
DNN Architectures. In 2020 IEEE International Electron Devices Meeting
(IEDM) (2020), pp. 29.1.1–29.1.4. (Cited on 170)
[62] Houshmand, P., Sarda, G. M., Jain, V., Ueyoshi, K., Papistas,
I. A., Shi, M., Zheng, Q., Bhattacharjee, D., Mallik, A.,
Debacker, P., Verkest, D., and Verhelst, M. DIANA: An End-
to-End Hybrid DIgital and ANAlog Neural Network SoC for the Edge.
IEEE Journal of Solid-State Circuits 58, 1 (2023), 203–215. (Cited on 3,
39)
[63] Houshmand, P., Sun, J., and Verhelst, M. Benchmarking and
modeling of analog and digital SRAM in-memory computing architectures.
In 2023 tinyML Research Symposium (2023). (Cited on 170)
[64] Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L., Tan,
M., Chu, G., Vasudevan, V., Zhu, Y., Pang, R., Adam, H., and
278 BIBLIOGRAPHY
Le, Q. Searching for mobilenetv3. In 2019 IEEE/CVF International

Conf. on Computer Vision (ICCV) (2019), pp. 1314–1324. (Cited on 167)
[65] Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W.,
Weyand, T., Andreetto, M., and Adam, H. MobileNets: Efficient
Convolutional Neural Networks for Mobile Vision Applications. arXiv
e-prints (Apr. 2017), arXiv:1704.04861. (Cited on xxvii, 31, 167, 218)
[66] Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W.,
Weyand, T., Andreetto, M., and Adam, H. MobileNets: Efficient
Convolutional Neural Networks for Mobile Vision Applications, 2017.
(Cited on 28)
[67] Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation networks. In
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(2018), pp. 7132–7141. (Cited on 167)
[68] Huang, C.-T., Ding, Y.-C., Wang, H.-C., Weng, C.-W., Lin,
K.-P., Wang, L.-W., and Chen, L.-D. ECNN: A Block-Based and
Highly-Parallel CNN Accelerator for Edge Inference. In Proceedings of the
52nd Annual IEEE/ACM International Symposium on Microarchitecture
(New York, NY, USA, 2019), MICRO ’52, Association for Computing
Machinery, p. 182–195. (Cited on 226)
[69] Huang, G., Liu, Z., Weinberger, K. Q., and van der Maaten,
L. Densely connected convolutional networks. arxiv 2016. arXiv preprint
arXiv:1608.06993 1608 (2016). (Cited on 167)
[70] Huang, Q., Kalaiah, A., Kang, M., Demmel, J., Dinh, G.,
Wawrzynek, J., Norell, T., and Shao, Y. S. Cosa: Scheduling by
constrained optimization for spatial accelerators. In 2021 ACM/IEEE
48th Annual International Symposium on Computer Architecture (ISCA)
(2021), pp. 554–566. (Cited on 236, 241)
[71] Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and
Bengio, Y. Quantized neural networks: Training neural networks with
low precision weights and activations. The Journal of Machine Learning
Research 18, 1 (2017), 6869–6898. (Cited on 58)
[72] Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K.,
Dally, W. J., and Keutzer, K. Squeezenet: Alexnet-level accuracy
with 50x fewer parameters and< 0.5 mb model size. arXiv preprint
arXiv:1602.07360 (2016). (Cited on 251)
[73] Ibrahim, E. M., Mei, L., and Verhelst, M. Taxonomy and
benchmarking of precision-scalable mac arrays under enhanced dnn
BIBLIOGRAPHY 279
dataflow representation. IEEE Transactions on Circuits and Systems I:

Regular Papers 69, 5 (2022), 2013–2024. (Cited on 99)
[74] Im, D., Han, D., Choi, S., Kang, S., and Yoo, H.-J. DT-CNN: An
Energy-Efficient Dilated and Transposed Convolutional Neural Network
Processor for Region of Interest Based Image Segmentation. IEEE TCAS
I: Regular Papers 67, 10 (2020), 3471–3483. (Cited on 202, 226)
[75] Intel. Accelerate Artificial Intelligence (AI) Workloads
with Intel Advanced Matrix Extensions (Intel AMX), 2022.
https://www.intel.com/content/dam/www/central-libraries/
us/en/documents/2022-12/accelerate-ai-with-amx-sb.pdf. (Cited
on 37)
[76] J. Qiu et. al. Going deeper with embedded fpga platform for
convolutional neural network. In 2016 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays (2016), FPGA ’16, pp. 26–
35. (Cited on 60)
[77] Jain, V., Giraldo, S., Roose, J. D., Boons, B., Mei, L., and
Verhelst, M. TinyVers: A 0.8-17 TOPS/W, 1.7 uW-20 mW, Tiny
Versatile System-on-chip with State-Retentive eMRAM for Machine
Learning Inference at the Extreme Edge. In 2022 IEEE Symposium
on VLSI Technology and Circuits (VLSI Technology and Circuits) (2022),
pp. 20–21. (Cited on 215)
[78] Jain, V., Giraldo, S., Roose, J. D., Mei, L., Boons, B., and
Verhelst, M. Tinyvers: A tiny versatile system-on-chip with state-
retentive emram for ml inference at the extreme edge. IEEE Journal of
Solid-State Circuits (2023), 1–12. (Cited on 3, 39, 42)
[79] Janssen, M., Catthoor, F., and de Man, H. A specification invariant
technique for regularity improvement between flow-graph clusters. In
Proceedings of the 1996 European Conference on Design and Test (USA,
1996), EDTC ’96, IEEE Computer Society, p. 138. (Cited on 55, 262)
[80] Jia, H., Ozatay, M., Tang, Y., Valavi, H., Pathak, R., Lee, J.,
and Verma, N. Scalable and programmable neural network inference
accelerator based on in-memory computing. IEEE Journal of Solid-State
Circuits 57, 1 (2022), 198–211. (Cited on 232, 234, 235, 246, 247)
[81] Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal,
G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A.,
et al. In-datacenter performance analysis of a tensor processing unit. In
Proceedings of the 44th Annual International Symposium on Computer
Architecture (2017), pp. 1–12. (Cited on 39, 216, 232)
280 BIBLIOGRAPHY
[82] Judd, P., Albericio, J., Hetherington, T., Aamodt, T. M., and
Moshovos, A. Stripes: Bit-serial deep neural network computing. In 2016
49th Annual IEEE/ACM International Symposium on Microarchitecture
(MICRO) (2016), IEEE, pp. 1–12. (Cited on 8, 10, 109, 110, 129)
[83] Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M.,
Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A.,
Potapenko, A., et al. Highly accurate protein structure prediction
with alphafold. Nature 596, 7873 (2021), 583–589. (Cited on 1, 3)
[84] Jung, V. J. B., Symons, A., Mei, L., Verhelst, M., and Benini,
L. SALSA: Simulated Annealing based Loop-Ordering Scheduler for
DNN Accelerators. In 2023 IEEE International Conference on Artificial
Intelligence Circuits and Systems (AICAS) (2023). (Cited on 169)
[85] K. Guo, W. Li, K. Zhong, Z. Zhu, S. Zeng, S. Han, Y. Xie, P.

Debacker, M. Verhelst, Y. Wang. Neural Network Accelerator
Comparison, 2018-2023. [Online]. Available: https://nicsefc.ee.
tsinghua.edu.cn/projects/neural-network-accelerator/. (Cited
on xxvii, 38)
[86] Kao, S.-C., Huang, X., and Krishna, T. DNNFuser: Generative
Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in
DNN Accelerators. arXiv e-prints (Jan. 2022), arXiv:2201.11218. (Cited
on 14, 203, 226, 230)
[87] Kao, S.-C., and Krishna, T. GAMMA: Automating the HW Mapping
of DNN Models on Accelerators via Genetic Algorithm. In Proceedings
of the 39th International Conference on Computer-Aided Design (New
York, NY, USA, 2020), ICCAD ’20, Association for Computing Machinery.
(Cited on 214, 236, 241, 242)
[88] Kao, S.-C., Kwon, H., Pellauer, M., Parashar, A., and Krishna,
T. A formalism of dnn accelerator flexibility. Proc. ACM Meas. Anal.
Comput. Syst. 6, 2 (jun 2022). (Cited on 37)
[89] Kjeldsberg, P., Catthoor, F., and Aas, E. Detection of partially
simultaneously alive signals in storage requirement estimation for data
intensive applications. In Proceedings of the 38th Design Automation
Conference (IEEE Cat. No.01CH37232) (2001), pp. 365–370. (Cited on
55, 262)
[90] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet

classification with deep convolutional neural networks. In Advances in
Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges,
BIBLIOGRAPHY 281
L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012,

pp. 1097–1105. (Cited on xxxi, 1, 158, 159, 167)
[91] Kwon, H., Chatarasi, P., Pellauer, M., Parashar, A., Sarkar,
V., and Krishna, T. Understanding reuse, performance, and hardware
cost of dnn dataflow: A data-centric approach. In Proceedings of the
Machinery, p. 754–768. (Cited on 13, 133, 134, 135, 184, 202, 214, 236)
[92] Kwon, H., Lai, L., Pellauer, M., Krishna, T., hsin Chen, Y.,
and Chandra, V. Heterogeneous dataflow accelerators for multi-dnn
workloads. 2021 IEEE International Symposium on High-Performance
Computer Architecture (HPCA) (2021), 71–83. (Cited on 16, 17, 232, 234,
251)
[93] Lee, E., Han, T., Seo, D., Shin, G., Kim, J., Kim, S., Jeong,
S., Rhe, J., Park, J., Ko, J. H., and Lee, Y. A charge-domain
scalable-weight in-memory computing macro with dual-sram architecture
for precision-scalable dnn accelerators. IEEE Transactions on Circuits
and Systems I: Regular Papers 68, 8 (2021), 3305–3316. (Cited on 100)
[94] Lee, J., Kim, C., Kang, S., Shin, D., Kim, S., and Yoo, H. J.
UNPU: A 50.6TOPS/W unified deep neural network accelerator with
1b-to-16b fully-variable weight bit-precision. In 2018 IEEE International
Solid-State Circuits Conference (ISSCC) (Feb. 2018), pp. 218–220. (Cited
on xxix, 7, 8, 10, 40, 58, 62, 70, 87, 97, 100, 109, 110, 115, 258)
[95] Lee, J., Shin, D., Lee, J., Lee, J., Kang, S., and Yoo, H.-J. A Full
HD 60 fps CNN Super Resolution Processor with Selective Caching based
Layer Fusion for Mobile Devices. In 2019 Symposium on VLSI Circuits
(2019), pp. C302–C303. (Cited on 202, 226)
[96] Leviathan, Y., and Matias, Y. Google duplex: An ai system for
accomplishing real-world tasks over the phone. (Cited on 1)
[97] Liao, H., Tu, J., Xia, J., Liu, H., Zhou, X., Yuan, H., and Hu, Y.
Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural
Network Computing : Industry Track Paper. In 2021 IEEE International
Symposium on High-Performance Computer Architecture (HPCA) (2021),
pp. 789–801. (Cited on 216)
[98] Lin, C.-H., Cheng, C.-C., Tsai, Y.-M., Hung, S.-J., Kuo, Y.-T.,
Wang, P. H., Tsung, P.-K., Hsu, J.-Y., Lai, W.-C., Liu, C.-H.,
Wang, S.-Y., Kuo, C.-H., Chang, C.-Y., Lee, M.-H., Lin, T.-Y.,
282 BIBLIOGRAPHY
and Chen, C.-C. 7.1 A 3.4-to-13.3TOPS/W 3.6TOPS Dual-Core Deep-

Learning Accelerator for Versatile AI Applications in 7nm 5G Smartphone
SoC. In 2020 IEEE International Solid- State Circuits Conference -
(ISSCC) (2020), pp. 134–136. (Cited on 226)
[99] Liu, H., WU, F., Dallard, E., Mei, L., and Sumbul, H. E.
Bandwidth-aware flexible-scheduling machine learning accelerator, Dec. 1
2022. US Patent App. 17/553,726. (Cited on 3, 170)
[100] Liu, J., Sun, J., Xu, Z., and Sun, G. Latency-aware automatic cnn
channel pruning with gpu runtime analysis. BenchCouncil Transactions
on Benchmarks, Standards and Evaluations 1, 1 (2021), 100009. (Cited
on xxvii, 32, 33)
[101] Liu, L., Zhu, J., Li, Z., Lu, Y., Deng, Y., Han, J., Yin, S., and
Wei, S. A survey of coarse-grained reconfigurable architecture and design:
Taxonomy, challenges, and applications. ACM Comput. Surv. 52, 6 (oct
2019). (Cited on 38)
[102] Liu, X., Ounifi, H.-A., Gherbi, A., Li, W., and Cheriet, M. A
hybrid gpu-fpga based design methodology for enhancing machine learning
applications performance. Journal of Ambient Intelligence and Humanized
Computing 11 (2020), 2309–2323. (Cited on 39)
[103] Liu, Z., Leng, J., Chen, Q., Li, C., Zheng, W., Li, L., and
Guo, M. DLFusion: An Auto-Tuning Compiler for Layer Fusion
on Deep Neural Network Accelerator. In 2020 IEEE Intl Conf on
Parallel & Distributed Processing with Applications, Big Data & Cloud
Computing, Sustainable Computing & Communications, Social Computing
& Networking (ISPA/BDCloud/SocialCom/SustainCom) (2020), IEEE,
pp. 118–127. (Cited on 233)
[104] Louis, M. S., Azad, Z., Delshadtehrani, L., Gupta, S., Warden,
P., Reddi, V. J., and Joshi, A. Towards Deep Learning using
TensorFlow Lite on RISC-V, 2019. (Cited on 37)
[105] Ma, L., Xie, Z., Yang, Z., Xue, J., Miao, Y., Cui, W., Hu, W.,
Yang, F., Zhang, L., and Zhou, L. Rammer: Enabling holistic deep
learning compiler optimizations with rTasks. In 14th USENIX Symposium
on Operating Systems Design and Implementation (OSDI 20) (Nov. 2020),
USENIX Association, pp. 881–897. (Cited on 16, 17)
[106] Mahesh Anasuri. IoT and Intelligent Transport Systems, 2019.

https://www.linkedin.com/pulse/iot-intelligent-transport-
systems-mahesh-anasuri/. (Cited on 3)
BIBLIOGRAPHY 283
[107] Mei, L., Dandekar, M., Rodopoulos, D., Constantin, J.,

Debacker, P., Lauwereins, R., and Verhelst, M. Sub-word parallel
precision-scalable MAC engines for efficient embedded DNN inference. In
2019 IEEE 1st International Conference on Artificial Intelligence Circuits
and Systems (AICAS) (March 2019). (Cited on xxix, 8, 10, 57, 58, 59, 62,
70, 86, 96, 109, 110, 111, 129, 258)
[108] Mei, L., Goetschalckx, K., Symons, A., and Verhelst, M.
DeFiNES: Enabling Fast Exploration of the Depth-first Scheduling Space
for DNN Accelerators through Analytical Modeling. In 2023 IEEE
International Symposium on High-Performance Computer Architecture
(HPCA) (2023), pp. 570–583. (Cited on 3, 201, 236)
[109] Mei, L., Houshmand, P., Jain, V., Giraldo, S., and Verhelst, M.
ZigZag: Enlarging Joint Architecture-Mapping Design Space Exploration
for DNN Accelerators. IEEE Transactions on Computers (TComp) 70, 8
(2021), 1160–1174. (Cited on 50, 131, 184, 194, 202, 214, 236, 241, 242)
[110] Mei, L., Liu, H., Wu, T., Sumbul, H. E., Verhelst, M., and
Beigne, E. A uniform latency model for DNN accelerators with diverse
architectures and dataflows. In 2022 Design, Automation and Test in
Europe Conference and Exhibition (DATE) (2022), IEEE, pp. 220–225.
(Cited on 3, 183, 214, 242)
[111] Min, F., Xu, H., Wang, Y., Wang, Y., Li, J., Zou, X., Li, B.,
and Han, Y. Dadu-Eye: A 5.3 TOPS/W, 30 fps/1080p High Accuracy
Stereo Vision Accelerator. IEEE Transactions on Circuits and Systems I:
Regular Papers 68, 10 (2021), 4207–4220. (Cited on 202, 226)
[112] MIT Management Sloan School. Business Implications of Extended

Reality (XR): Harnessing the Value of AR, VR, Metaverse, and More,
2023. https://executive.mit.edu/course/business-implications-
of-extended-reality-xr/a054v00000rHDUAAA4.html. (Cited on 3)
[113] Mo, H., Zhu, W., Hu, W., Li, Q., Li, A., Yin, S., Wei, S., and
Liu, L. A 12.1 TOPS/W Quantized Network Acceleration Processor
With Effective-Weight-Based Convolution and Error-Compensation-Based
Prediction. IEEE Journal of Solid-State Circuits 57, 5 (2022), 1542–1557.
(Cited on 202, 226)
[114] Moons, B., Bankman, D., Yang, L., Murmann, B., and Verhelst,
M. Binareye: An always-on energy-accuracy-scalable binary cnn processor
with all memory on chip in 28nm cmos. In 2018 IEEE Custom Integrated
Circuits Conference (CICC) (2018), IEEE, pp. 1–4. (Cited on 39)
284 BIBLIOGRAPHY
[115] Moons, B., Brabandere, B. D., Gool, L. V., and Verhelst, M.

Energy-efficient convnets through approximate computing. In 2016 IEEE
Winter Conference on Applications of Computer Vision (WACV) (March
2016). (Cited on 7, 58, 64)
[116] Moons, B., Uytterhoeven, R., Dehaene, W., and Verhelst, M.
14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-
accuracy-frequency-scalable Convolutional Neural Network processor in
28nm FDSOI. In 2017 IEEE International Solid-State Circuits Conference
(ISSCC) (2017), IEEE, pp. 246–247. (Cited on xxxi, 7, 8, 10, 60, 69, 104,
110, 111, 142, 158, 232)
[117] Moons, B., Uytterhoeven, R., Dehaene, W., and Verhelst, M.
DVAFS: Trading computational accuracy for energy through dynamic-
voltage-accuracy-frequency-scaling. In Design, Automation and Test in
Europe (DATE), 2017 IEEE Conference (March 2017), pp. 488–493.
(Cited on xxix, 58, 62, 86, 109)
[118] Moons, B., and Verhelst, M. A 0.3-2.6 TOPS/W precision-scalable
processor for real-time large-scale ConvNets. In 2016 IEEE Symposium
on VLSI Circuits (VLSI-Circuits) (June 2016), pp. 1–2. (Cited on 75)
[119] Muñoz-Martínez, F., Abellán, J. L., Acacio, M. E., and Krishna,

T. STONNE: Enabling Cycle-Level Microarchitectural Simulation for
DNN Inference Accelerators. In 2021 IEEE International Symposium on
Workload Characterization (IISWC) (2021), pp. 201–213. (Cited on 184)
[120] N. P. Jouppi et. al. In-datacenter performance analysis of a tensor
processing unit. SIGARCH Comput. Archit. News 45, 2 (June 2017), 1–12.
(Cited on 60)
[121] Nellie Wu, Y., Tsai, P.-A., Parashar, A., Sze, V., and Emer,
J. S. Sparseloop: An Analytical Approach To Sparse Tensor Accelerator
Modeling. arXiv e-prints (May 2022), arXiv:2205.05826. (Cited on 214)
[122] Nwankpa, C., Ijomah, W., Gachagan, A., and Marshall, S.

Activation functions: Comparison of trends in practice and research for
deep learning. arXiv preprint arXiv:1811.03378 (2018). (Cited on 25)
[123] OpenAI. GPT-3.5, 2021. https://openai.com/research/. (Cited on
1)
[124] OpenAI. GPT-4 Technical Report. arXiv e-prints (Mar. 2023),

arXiv:2303.08774. (Cited on 1)
BIBLIOGRAPHY 285
[125] Papistas, I. A., Cosemans, S., Rooseleer, B., Doevenspeck, J.,

Na, M.-H., Mallik, A., Debacker, P., and Verkest, D. A 22 nm,
1540 top/s/w, 12.1 top/s/mm2 in-memory analog matrix-vector-multiplier
for dnn acceleration. In 2021 IEEE Custom Integrated Circuits Conference
(CICC) (2021), pp. 1–2. (Cited on 39)
[126] Parashar, A., Raina, P., Shao, Y. S., Chen, Y., Ying, V. A.,
Mukkara, A., Venkatesan, R., Khailany, B., Keckler, S. W.,
and Emer, J. Timeloop: A systematic approach to dnn accelerator
evaluation. In 2019 IEEE International Symposium on Performance
Analysis of Systems and Software (ISPASS) (2019), pp. 304–315. (Cited
on xxxi, 13, 50, 133, 134, 158, 159, 184, 214, 236, 241, 242)
[127] Phothilimthana, M., et al. Learned TPU cost model for XLA tensor
programs. In Workshop on ML for Systems at NeurIPS (2019). (Cited
on 184)
[128] Radway, R., Bartolo, A., Jolly, P., Khan, Z., Le, B., Tandon,
P., Wu, T., Xin, Y., Vianello, E., Vivet, P., Nowak, E., Wong,
H.-S. P., Sabry, M., Beigne, E., Wootters, M., and Mitra, S.
Illusion of large on-chip memory by networked computing chips for neural
network inference. Nature Electronics 4 (01 2021). (Cited on 232, 234,
243)
[129] Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F.,
and Amarasinghe, S. Halide: A language and compiler for optimizing
parallelism, locality, and recomputation in image processing pipelines.
SIGPLAN Not. 48, 6 (jun 2013), 519–530. (Cited on 13)
[130] Raghavan, P., Lambrechts, A., Jayapala, M., Catthoor, F.,

Verkest, D., and Corporaal, H. Very Wide Register: An asymmetric
register file organization for low power embedded processors. In 2007
Design, Automation and Test in Europe Conference and Exhibition (2007),
pp. 1–6. (Cited on 42)
[131] Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford,
A., Chen, M., and Sutskever, I. Zero-Shot Text-to-Image Generation.
arXiv e-prints (Feb. 2021), arXiv:2102.12092. (Cited on 1)
[132] Redmon, J., and Farhadi, A. YOLO9000: Better, Faster, Stronger.
arXiv e-prints (Dec. 2016), arXiv:1612.08242. (Cited on xxxii, 163)
[133] Ryu, S., Kim, H., Yi, W., and Kim, J.-J. Bitblade: Area and
energy-efficient precision-scalable neural network accelerator with bitwise
summation. In Proceedings of the 56th Annual Design Automation
286 BIBLIOGRAPHY
Conference 2019 (2019), pp. 1–6. (Cited on 8, 10, 58, 68, 100, 104,
106, 109, 110, 129)
[134] Sakr, C., and Shanbhag, N. An analytical method to determine

minimum per-layer precision of deep neural networks. In 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP) (April 2018), pp. 1090–1094. (Cited on 58)
[135] Sakurai, T., and Newton, A. R. Alpha-power law MOSFET model
and its applications to CMOS inverter delay and other formulas. In IEEE
Journal of Solid-State Circuits (JSSC) (April 1990), vol. 25. (Cited on
76)
[136] Samajdar, A., Joseph, J. M., Zhu, Y., Whatmough, P., Mattina,
M., and Krishna, T. A Systematic Methodology for Characterizing
Scalability of DNN Accelerators using SCALE-Sim. In 2020 IEEE
International Symposium on Performance Analysis of Systems and
Software (ISPASS) (2020), IEEE, pp. 58–68. (Cited on 184)
[137] Samavedam, S. B., Ryckaert, J., Beyne, E., Ronse, K., Horiguchi,
N., Tokei, Z., Radu, I., Bardon, M. G., Na, M. H., Spessot, A.,
and Biesemans, S. Future logic scaling: Towards atomic channels and
deconstructed chips. In 2020 IEEE International Electron Devices Meeting
(IEDM) Opening Keynote (2020), pp. 1.1.1–1.1.10. (Cited on 3)
[138] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen,
L. Mobilenetv2: Inverted residuals and linear bottlenecks. In 2018
IEEE/CVF Conf. on Computer Vision and Pattern Recognition (2018),
pp. 4510–4520. (Cited on xxvii, 31, 167, 251)
[139] Seshadri, K., Akin, B., Laudon, J., Narayanaswami, R., and
Yazdanbakhsh, A. An evaluation of edge tpu accelerators for
convolutional neural networks. In 2022 IEEE International Symposium
on Workload Characterization (IISWC) (2022), pp. 79–91. (Cited on 40,
216)
[140] Shalf, J. The future of computing beyond Moore’s law. Philosophical
Transactions of the Royal Society A 378, 2166 (2020), 20190061. (Cited
on 5)
[141] Shao, Y. S., Clemons, J., Venkatesan, R., Zimmer, B., Fojtik,
M., Jiang, N., Keller, B., Klinefelter, A., Pinckney, N., Raina,
P., Tell, S. G., Zhang, Y., Dally, W. J., Emer, J., Gray, C. T.,
Khailany, B., and Keckler, S. W. Simba: Scaling deep-learning
inference with multi-chip-module-based architecture. In Proceedings of the
BIBLIOGRAPHY 287

Machinery, p. 14–27. (Cited on 232, 234)
[142] Sharify, S., Lascorz, A. D., Siu, K., Judd, P., and Moshovos,
A. Loom: Exploiting weight and activation precisions to accelerate
convolutional neural networks. In 2018 55th ACM/ESDA/IEEE Design
Automation Conference (DAC) (2018), IEEE, pp. 1–6. (Cited on xxix, 8,
10, 58, 62, 72, 87, 97, 100, 109, 110, 129, 258)
[143] Sharma, H., Park, J., Suda, N., Lai, L., Chau, B., Chandra, V.,
and Esmaeilzadeh, H. Bit fusion: Bit-level dynamically composable
architecture for accelerating deep neural network. In 2018 ACM/IEEE
45th Annual International Symposium on Computer Architecture (ISCA)
(2018), IEEE, pp. 764–775. (Cited on xxix, 8, 10, 58, 62, 66, 85, 93, 96,
100, 101, 104, 106, 109, 110, 129, 258)
[144] Shi, M., Colleman, S., VanDeMieroop, C., Joseph, A., Meijer,
M., Dehaene, W., and Verhelst, M. Cmds: Cross-layer dataflow
optimization for dnn accelerators exploiting multi-bank memories. In
2023 24th International Symposium on Quality Electronic Design (ISQED)
(2023), pp. 1–8. (Cited on 171)
[145] Shi, M., Houshmand, P., Mei, L., and Verhelst, M. Hardware-
Efficient Residual Neural Network Execution in Line-Buffer Depth-First
Processing. IEEE Journal on Emerging and Selected Topics in Circuits
and Systems 11, 4 (2021), 690–700. (Cited on 15, 226)
[146] Shin, D., Lee, J., Lee, J., and Yoo, H.-J. 14.2 DNPU: An
8.1 TOPS/W reconfigurable CNN-RNN processor for general-purpose
deep neural networks. In 2017 IEEE International Solid-State Circuits
Conference (ISSCC) (2017), IEEE, pp. 240–241. (Cited on xxviii, 8, 10,
58, 62, 64, 75, 84, 96, 100, 109, 110, 129, 232, 258)
[147] Shun, Z., Pfander, O. A., Pfleiderer, H.-J., and Bermak, A.
A vlsi architecture for a run-time multi-precision reconfigurable booth
multiplier. In 2007 14th IEEE International Conference on Electronics,
Circuits and Systems (2007), IEEE, pp. 975–978. (Cited on 101)
[148] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,
van den Driessche, G., Schrittwieser, J., Antonoglou, I.,
Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D.,
Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach,
M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering
the game of go with deep neural networks and tree search. Nature 529
(2016), 484–503. (Cited on 1)
288 BIBLIOGRAPHY
[149] Simonyan, K., and Zisserman, A. Very Deep Convolutional

Networks for Large-Scale Image Recognition. arXiv e-prints (Sept. 2014),
arXiv:1409.1556. (Cited on xxvii, 31, 32, 45)
[150] Song, M., Zhang, J., Chen, H., and Li, T. Towards efficient
microarchitectural design for accelerating unsupervised gan-based deep
learning. In 2018 IEEE International Symposium on High Performance
Computer Architecture (HPCA) (Feb 2018), pp. 66–77. (Cited on 60)
[151] Sumbul, H. E., Wu, T. F., Li, Y., Sarwar, S. S., Koven, W.,
Murphy-Trotzky, E., Cai, X., Ansari, E., Morris, D. H., Liu,
H., Kim, D., Beigne, E., Labs, R., and Meta. System-Level Design
and Integration of a Prototype AR/VR Hardware Featuring a Custom
Low-Power DNN Accelerator Chip in 7nm Technology for Codec Avatars.
In 2022 IEEE Custom Integrated Circuits Conference (CICC) (2022),
pp. 01–08. (Cited on 3, 40, 183, 194, 200, 215, 216)
[152] Symons, A., Mei, L., Colleman, S., Houshmand, P., Karl, S.,
and Verhelst, M. Towards Heterogeneous Multi-core Accelerators
Exploiting Fine-grained Scheduling of Layer-Fused Deep Neural Networks.
arXiv e-prints (Dec. 2022), arXiv:2212.10612. (Cited on 3, 52, 231)
[153] Symons, A., Mei, L., and Verhelst, M. LOMA: Fast Auto-Scheduling
on DNN Accelerators through Loop-Order-based Memory Allocation. In
2021 IEEE International Conference on Artificial Intelligence Circuits
and Systems (AICAS) (2021), pp. 1–4. (Cited on 169, 214, 242)
[154] Syu, N.-S., Chen, Y.-S., and Chuang, Y.-Y. Learning Deep
Convolutional Networks for Demosaicing. arXiv e-prints (Feb. 2018),
arXiv:1802.03769. (Cited on 218)
[155] Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. Inception-v4,
inception-resnet and the impact of residual connections on learning. In
Proc. of AAAI Conf. on Artificial Intelligence, 2017 (2017). (Cited on
xxvii, 30, 31, 167)
[156] T. Chen et. al. DianNao: A small-footprint high-throughput accelerator
for ubiquitous machine-learning. SIGPLAN Not. 49, 4 (Feb. 2014), 269–
284. (Cited on 60)
[157] Talpes, E., Sarma, D. D., Venkataramanan, G., Bannon, P.,
McGee, B., Floering, B., Jalote, A., Hsiong, C., Arora, S.,
Gorti, A., and Sachdev, G. S. Compute Solution for Tesla’s Full
Self-Driving Computer. IEEE Micro 40, 2 (2020), 25–35. (Cited on 39,
216)
BIBLIOGRAPHY 289
[158] Tang, T., et al. NeuroMeter: An Integrated Power, Area, and Timing
Modeling Framework for Machine Learning Accelerators Industry Track
Paper. In IEEE HPCA (2021). (Cited on 184)
[159] Tarnawski, J. M., Phanishayee, A., Devanur, N., Mahajan, D.,
and Nina Paravecino, F. Efficient algorithms for device placement of
dnn graph operators. Advances in Neural Information Processing Systems
33 (2020), 15451–15463. (Cited on 16, 17)
[160] Theis, T. N., and Wong, H.-S. P. The End of Moore’s Law: A
New Beginning for Information Technology. Computing in Science &
Engineering 19, 2 (2017), 41–50. (Cited on 4)
[161] Tronçon, R., Bruynooghe, M., Janssens, G., and Catthoor, F.
Storage size reduction by in-place mapping of arrays. In International
Workshop on Verification, Model Checking, and Abstract Interpretation
(2002), Springer, pp. 167–181. (Cited on 55)
[162] tvm rfcs. Arm® ethos™-u cascading scheduler. (Cited on 202, 233, 235)
[163] Ueyoshi, K., Ando, K., Hirose, K., Takamaeda-Yamazaki, S.,
Kadomoto, J., Miyata, T., Hamada, M., Kuroda, T., and
Motomura, M. QUEST: A 7.49TOPS multi-purpose log-quantized DNN
inference engine stacked on 96MB 3D SRAM using inductive-coupling
technology in 40nm CMOS. In 2018 IEEE International Solid-State
Circuits Conference (ISSCC) (Feb. 2018), pp. 216–218. (Cited on 58, 70,
97, 258)
[164] Ueyoshi, K., Papistas, I. A., Houshmand, P., Sarda, G. M., Jain,
V., Shi, M., Zheng, Q., Giraldo, S., Vrancx, P., Doevenspeck,
J., Bhattacharjee, D., Cosemans, S., Mallik, A., Debacker, P.,
Verkest, D., and Verhelst, M. DIANA: An End-to-End Energy-
Efficient Digital and ANAlog Hybrid Neural Network SoC. In 2022 IEEE
International Solid- State Circuits Conference (ISSCC) (2022), vol. 65,
pp. 1–3. (Cited on 232, 233, 234, 235, 246, 247, 249)
[165] Veen, F. V. The Neural Network Zoo, 2016. https://www.
asimovinstitute.org/neural-network-zoo/. (Cited on 3)
[166] Venkatesan, R., Shao, Y. S., Wang, M., Clemons, J., Dai, S.,
Fojtik, M., Keller, B., Klinefelter, A., Pinckney, N., Raina,
P., Zhang, Y., Zimmer, B., Dally, W. J., Emer, J., Keckler,
S. W., and Khailany, B. Magnet: A modular accelerator generator
for neural networks. In 2019 IEEE/ACM International Conference on
Computer-Aided Design (ICCAD) (2019), pp. 1–8. (Cited on 13, 133, 134,
135, 184)
290 BIBLIOGRAPHY
[167] Verdoolaege, S. isl: An integer set library for the polyhedral model.
In International Congress on Mathematical Software (2010), Springer,
pp. 299–302. (Cited on 55, 262)
[168] Verhelst, M., Shi, M., and Mei, L. ML Processors Are Going Multi-
Core: A performance dream or a scheduling nightmare? IEEE Solid-State
Circuits Magazine 14, 4 (2022), 18–27. (Cited on 5, 43)
[169] Victor, D. HandTrack: A Library For Prototyping Real-time
Hand TrackingInterfaces using Convolutional Neural Networks. GitHub
repository (2017). (Cited on 194)
[170] Vivet, P., Guthmuller, E., Thonnart, Y., Pillonnet, G.,
Fuguet, C., Miro-Panadès, I., Moritz, G., Durupt, J., Bernard,
C., Varreau, D., Pontes, J. J. H., Thuries, S., Coriat,
D., Harrand, M., Dutoit, D., Lattard, D., Arnaud, L.,
Charbonnier, J., Coudrain, P., Garnier, A., Berger, F.,
Gueugnot, A., Greiner, A., Meunier, Q. L., Farcy, A.,
Arriordaz, A., Chéramy, S., and Clermidy, F. Intact: A 96-
core processor with six chiplets 3d-stacked on an active interposer with
distributed interconnects and integrated power management. IEEE
Journal of Solid-State Circuits 56 (2021), 79–97. (Cited on 3)
[171] Žbontar, J., and LeCun, Y. Stereo Matching by Training a
Convolutional Neural Network to Compare Image Patches. arXiv e-prints
(Oct. 2015), arXiv:1510.05970. (Cited on 215, 218)
[172] Waeijen, L., Sioutas, S., Peemen, M., Lindwer, M., and
Corporaal, H. ConvFusion: A Model for Layer Fusion in Convolutional
Neural Networks. IEEE Access 9 (2021), 168245–168267. (Cited on 14,
15, 202, 203, 226, 230)
[173] Wang, C., and Luo, Z. A review of the optimal design of neural
networks based on fpga. Applied Sciences 12, 21 (2022). (Cited on 37)
[174] Wang, K., Liu, Z., Lin, Y., Lin, J., and Han, S. HAQ: Hardware-
Aware Automated Quantization With Mixed Precision. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR) (June 2019). (Cited on 7)
[175] Winding, M., Pedigo, B. D., Barnes, C. L., Patsolic, H. G., Park,
Y., Kazimiers, T., Fushiki, A., Andrade, I. V., Khandelwal, A.,
Valdes-Aleman, J., Li, F., Randel, N., Barsotti, E., Correia,
A., Fetter, R. D., Hartenstein, V., Priebe, C. E., Vogelstein,
J. T., Cardona, A., and Zlatic, M. The connectome of an insect
brain. Science 379, 6636 (2023), eadd9330. (Cited on 31)
BIBLIOGRAPHY 291
[176] Wu, S.-Y., et al. A 7nm cmos platform technology featuring 4th
generation finfet transistors with a 0.027um2 high density 6-t sram cell
for mobile soc applications. In IEDM (2016). (Cited on 194)
[177] Wu, Y. N., Emer, J. S., and Sze, V. Accelergy: An architecture-
level energy estimation methodology for accelerator designs. In
2019 IEEE/ACM International Conference on Computer-Aided Design
(ICCAD) (2019), pp. 1–8. (Cited on xxxi, 13, 133, 134, 135, 158, 159, 184,
202, 215)
[178] Xi, S. L., Yao, Y., Bhardwaj, K., Whatmough, P., Wei, G.-
Y., and Brooks, D. SMAUG: End-to-End Full-Stack Simulation
Infrastructure for Deep Learning Workloads, 2019. (Cited on 133, 134,
135)
[179] Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggregated
Residual Transformations for Deep Neural Networks. arXiv e-prints (Nov.
2016), arXiv:1611.05431. (Cited on xxvii, 31)
[180] Xing, Y., Liang, S., Sui, L., Zhang, Z., Qiu, J., Jia, X., Liu, X.,
Wang, Y., Shan, Y., and Wang, Y. DNNVM: End-to-End Compiler
Leveraging Operation Fusion on FPGA-Based CNN Accelerators. In
Proceedings of the 2019 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays (New York, NY, USA, 2019), FPGA ’19,
Association for Computing Machinery, p. 187–188. (Cited on 14, 15, 52,
203, 226, 230)
[181] Yamazaki, K., Vo-Ho, V.-K., Bulsara, D., and Le, N. Spiking
neural networks and their applications: A review. Brain Sciences 12, 7
(2022), 863. (Cited on 55)
[182] Yang, Q., and Li, H. Bitsystolic: A 26.7 tops/w 2b 8b npu with
configurable data flows for edge devices. IEEE Transactions on Circuits
and Systems I: Regular Papers 68, 3 (2021), 1134–1145. (Cited on 100)
[183] Yang, T.-J., et al. A method to estimate the energy consumption of
deep neural networks. In ACSSC (2017). (Cited on 184)
[184] Yang, X., Gao, M., Liu, Q., Setter, J., Pu, J., Nayak, A., Bell,
S., Cao, K., Ha, H., Raina, P., Kozyrakis, C., and Horowitz, M.
Interstellar: Using halide’s scheduling language to analyze dnn accelerators.
369–383. (Cited on xxvii, 13, 43, 50, 133, 134, 138, 184, 218, 236, 241,
242)
[185] Yang, X., Gao, M., Pu, J., Nayak, A., Liu, Q., Bell, S. E., Setter,
J. O., Cao, K., Ha, H., Kozyrakis, C., et al. Dnn dataflow choice
is overrated. arXiv preprint arXiv:1809.04070 6 (2018). (Cited on 60)
292 BIBLIOGRAPHY
[186] Yang, X., Pu, J., Rister, B. B., Bhagdikar, N., Richardson, S.,
Kvatinsky, S., Ragan-Kelley, J., Pedram, A., and Horowitz,
M. A Systematic Approach to Blocking Convolutional Neural Networks,
2016. (Cited on 133)
[187] Zhao, Y., Li, C., Wang, Y., Xu, P., Zhang, Y., and Lin, Y. Dnn-
chip predictor: An analytical performance predictor for dnn accelerators
with various dataflows and hardware architectures. In ICASSP 2020 -
2020 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP) (2020), pp. 1593–1597. (Cited on 184)
[188] Zheng, S., Liang, Y., Wang, S., Chen, R., and Sheng, K.
FlexTensor: An Automatic Schedule Exploration and Optimization
Framework for Tensor Computation on Heterogeneous System. Association
for Computing Machinery, New York, NY, USA, 2020, p. 859–873. (Cited
on 235)
[189] Zheng, S., Zhang, X., Ou, D., Tang, S., Liu, L., Wei, S., and
Yin, S. Efficient Scheduling of Irregular Network Structures on CNN
Accelerators. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems 39, 11 (2020), 3408–3419. (Cited on 14, 15, 203, 226,
230)
[190] Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., Wang, L.,
Li, C., and Sun, M. Graph neural networks: A review of methods and
applications. AI Open 1 (2020), 57–81. (Cited on 55)
[191] Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning
transferable architectures for scalable image recognition. In Proc. of IEEE
conf. on computer vision and pattern recognition (2018), pp. 8697–8710.
(Cited on 167)
FACULTY OF ENGINEERING SCIENCE
DEPARTMENT OF ELECTRICAL ENGINEERING
ESAT-MICAS
ESAT-MICAS, Kasteelpark Arenberg 10
B-3001 Leuven
linyan.mei.ee@gmail.com
https://www.esat.kuleuven.be/micas/

Linyan PHD Thesis 2023 Printing Version

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Linyan PHD Thesis 2023 Printing Version

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linyan PHD Thesis 2023 Printing Version

Uploaded by

Copyright:

Available Formats

ARENBERG DOCTORAL SCHOOL

Faculty of Engineering Science

Design Space Exploration of

Supervisor: Dissertation presented in partial

Examination committee: Dissertation presented in partial ful-

Linyan Mei, August 2023, Leuven, Belgium

In de afgelopen tien jaar heeft deep learning de kunstmatige intelligentie een

inzichtelijke exploratie van deze afzonderlijke en gezamenlijke ontwerpruimten.

Het is belangrijk op te merken dat de totstandbrenging van deze ver-

AHM algorithm-hardware-mapping. 184–187, 194

CC cycle count. 184

DAG directed acyclic graph. 14, 19, 52, 53

DW depthwise. 25, 28–30, 102, 173, 185

FIFO first in, first out. 42, 142, 156

FSM finite state machine. 50, 74

GA genetic algorithm. xxxiv, 236, 242, 249–251

GPU graphics processing unit. 37–39

HDL hardware description language. 117, 130

IoT internet of things. 2, 4, 24, 57, 58

NN neural network. xxxii, 194, 195

ReLU rectified linear unit. 25

SL single-layer. 202, 204, 207, 208, 216, 224, 227, 229

tanh hyperbolic tangent. 25

VWR very wide register. 42

WS weight-sharing. 104, 105, 118, 128, 129, 259

Beknopte samenvatting xiii

List of Abbreviations and Symbols xx

List of Figures xxvii

List of Tables xxxv

2.2.2 Hardware platform types . . . . . . . . . . . . . . . . . 37

3 Precision-Scalable MAC Unit Design Space Exploration 57

3.7.2 33% full-precision computations (equal usage) . . . . . . 92

4 Precision-Scalable MAC Array Design Space Exploration 99

5 ZigZag: Enabling Fast DNN Accelerator-Mapping Design Space

5.7 Architecture Generator . . . . . . . . . . . . . . . . . . . . . . . 157

6 An Analytical Latency Model for DNN Accelerator 183

7 DeFiNES: Exploring the Depth-first Scheduling Space for DNN

7.5.1 An overview of experiment settings . . . . . . . . . . . . 216

8 Stream: Modeling Fine-grained Layer Fusion on Multi-core DNN

9 Conclusion and Future Work 257

List of publications 267

1.1 Some examples of remarkable AI achievements. . . . . . . . . . 2

2.15 Examples of the graph-based multi-core schedule representation,

3.22 Normalized energy/op for (a) symmetric and (b) weight-only

4.1 A simple 4b precision scalable multiplier. . . . . . . . . . . . . . 101

4.7 L2 in a Fully-Unrolled (FU) design. BG is unrolled in L2 and

5.1 ZigZag framework diagram. . . . . . . . . . . . . . . . . . . . . 136

5.8 Comparison between different spatial mapping search methods

5.18 Memory hierarchy search for multiple layers of DarkNet19[132]. 163

6.1 (a) A timeline illustration of DNN layer operation phases.

7.1 Going from (a) Single-layer-at-a-time scheduling to (b) Layer-

7.3 DF design space’s second axis: Overlap storing mode. Workload

7.17 Case study 3: Different hardware architectures’ energy and

8.1 A conceptual example showing different ways of scheduling a

2.1 A high-level DNN computation pattern summary . . . . . . . . 34

3.1 Taxonomy of precision-scalable MAC unit architectures . . . . 62

4.1 SotA Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 110