Approximate Computing Part2
Approximate Computing Part2
CCS Concepts: • General and reference → Surveys and overviews; • Software and its engineering →
Software notations and tools; • Hardware → Integrated circuits; • Computer systems organization
→ Architectures.
Additional Key Words and Phrases: Inexact Computing, Approximation Method, Approximate Programming,
Approximation Framework, Approximate Processor, Approximate Memory, Approximate Circuit, Approximate
Arithmetic, Error Resilience, Accuracy, Review, Classification
Authors’ addresses: Vasileios Leon, National Technical University of Athens, Athens 15780, Greece; Muhammad Abdullah
Hanif, New York University Abu Dhabi, Abu Dhabi 129188, UAE; Giorgos Armeniakos, National Technical University of
Athens, Athens 15780, Greece; Xun Jiao, Villanova University, Villanova, Pennsylvania 19085, USA; Muhammad Shafique,
New York University Abu Dhabi, Abu Dhabi 129188, UAE; Kiamal Pekmestzi, National Technical University of Athens,
Athens 15780, Greece; Dimitrios Soudris, National Technical University of Athens, Athens 15780, Greece.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2023 Association for Computing Machinery.
0360-0300/2023/7-ART $15.00
https://doi.org/10.1145/XXXXXXX.XXXXXX
1 INTRODUCTION
The recent technological advancements in processing, storage, communication and sensing have
transformed the landscape of computing systems. With the emergence of Internet of Things (IoT),
a huge amount of data is generated, imposing technical challenges in the resource-constrained
embedded systems that are placed at the edge of the network. The ever-growing number of global
IoT connections, which grew by 18% in 2022 and is expected to be 29.7 billion in 2027 [91], is only
deteriorating this situation. At the same time, the already-stressed cloud data centers are drawing
worldwide concerns due to their enormous electricity demands. Indicatively, it is estimated that
the energy consumption of data centers in Europe was increased by 43% from 2010 to 2020 [22],
while according to [30], data centers may generate up to 8% of the global carbon emissions in 2030.
The above issues become even more critical when considering the massive growth of demanding
applications from domains such as Digital Signal Processing (DSP), Artificial Intelligence (AI), and
Machine Learning (ML). More specifically, the emergence of compute- and memory-intensive DSP
and AI/ML applications marks a new era for computing, where traditional design approaches and
conventional processors are unable to meet the performance requirements [64, 166]. Therefore, it is
obvious that the proliferation of data and workloads, the increase in the complexity of applications,
and the urgent need for power efficiency and fast processing, all together, force the industry of
computing systems to examine alternative design solutions and computing schemes.
In this transition, Approximate Computing (AxC) is already considered a novel design par-
adigm that improves the energy efficiency and performance of computing systems [260]. Over
the past decade, AxC has been established as a vigorous approach to leverage the inherent ability
of numerous applications to produce results of acceptable quality, despite some inaccuracies in
computations [16]. The motivation behind AxC is further analyzed in our work in [129]. Driven
by its potential to address the energy and performance demands in a favorable manner, AxC is
increasingly gaining significant attention over the last years. A representative example of this trend
is depicted in Figure 1. This figure illustrates the growing number of works that apply any kind of
approximation (e.g., software, hardware, architectural), published in four major design automation
conferences. Furthermore, AxC has attracted significant interest from the industry. Companies
such as Google [101], IBM [5], and Samsung [188] exploit the AI/ML error resilience and design
accelerators for compressed networks with reduced precision and less computations, improving
the energy efficiency and performance in exchange for negligible or zero accuracy loss.
Considering the need to go beyond single optimizations or efficient processor design, which
solely do not suffice [49], prior research has focused on approximation techniques at different design
abstraction layers [260], involving algorithmic- or compiler-level approximations (software) [9, 10],
circuit-level approximations (hardware) [36, 96] and systematic design of approximate processors
(architectural) [47, 55]. Despite their share objective of achieving energy/area gains and latency
reduction, these fundamental techniques are orthogonal and can complement each other. As com-
bining AxC techniques is expected to be highly beneficial (several works report energy gains over
90% [49]), a further adoption and understanding of cross-layer development (i.e., hardware/software
co-design), including tools & frameworks tailored to specific applications, are highly required to
maximize these benefits. In this context, motivated by the attractive outcomes, novel methods, and
future potential of AxC, we conduct a two-part survey that covers all its key aspects.
Number of Papers
40
20
0
2015/16 2017/18 2019/20 2021/22
Fig. 1. Number of publications that apply any type of approximation. Four design automation conferences
over the past eight years have been considered.
Table 1. Qualitative comparison of Approximate Computing surveys on the entire computing stack.
e
pprox.
overag
gy
ces #
& Tools ks
arks
GP U/A GA/
ges
SIC
olo
or
ch.
ies
ey
h.
Benchm
Framew
P
rv
n
Arch. A
Termin
SW Tec
Pages #
Memor
HW Te
CP U/F
Refere
Challe
Metric
Su
Year C
AI/ML
xC
A
[73] 2013 6 65 ✓ ✓ ≈ ≈ ✗ ✗ ✓ ✗ ≈ ✗ ✗
[165] 2015 331 84 ✓ ≈ ✓ ≈ ✓ ✓ ✓ ✓ ✓ ✗ ✓
[276] 2015 15 59 ✓ ✓ ✓ ✗ ≈ ✗ ✗ ✗ ≈ ✗ ✗
[262] 2015 6 54 ✓ ✓ ✓ ✗ ✗ ✓ ✗ ✗ ≈ ✗ ✗
[227] 2016 6 47 ✓ ✓ ✓ ✗ ✗ ≈ ✗ ✗ ≈ ✗ ✗
[169] 2017 4 40 ✓ ✓ ✗ ≈ ≈ ✗ ✗ ✗ ≈ ✗ ✗
[15] 2017 6 72 ✓ ✓ ✓ ≈ ≈ ≈ ✗ ✓ ≈ ✗ ✓
[238] 2020 391 235 ≈ ✓ ✓ ✗ ✓ ≈ ✗ ✗ ≈ ✗ ✓
Pt. 1 2023 341 221 ✓ ✓ ✓ ✓ ✓
Our work
Pt. 2 2023 351 296 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
1 Single-column pages
application). Table 1 compares surveys studying multiple areas of AxC, like our work. As shown,
our two-part survey includes all the areas examined in the previous works, while it reviews all of
them in depth (e.g., AI/ML has not received significant attention) and it is up-to-date.
Regarding the contribution of our survey, it provides to the readers a complete view on AxC,
while it allows them to emphasize on specific areas (e.g., architectural techniques or application
domains), as they are all presented in detail. Furthermore, this survey can be considered as a tutorial
on the state-of-the-art approximation techniques, as we do not only classify them, but we report
in brief their technical details. Overall, the key contributions of our two-part survey are: 1) the
introduction of the AxC paradigm by discussing its background and application domains, 2) the
review of a plethora of state-of-the-art approximation techniques, their coarse- and fine-grained
classification, and the presentation of their technical details, 3) the study of application domains
where approximations are tolerated and the analysis of representative techniques and results per
domain, and 4) the discussion of open challenges in the design of approximate systems, as they
emerge from the comprehensive literature review.
Organization: Figure 2 presents the content of each part of our survey:
Terminology Principles
Part I
Part II Challenges
Future Directions
Software Approximations
Hardware Approximations Approximate Processors
Cross-Layer Approximations Approximate Data Storage
End-to-End Approximations AxC
Applications
Application Spectrum Application-Driven Analysis
Benchmark Suites Error/Quality Metrics
Part I: It is presented in [129], and it introduces the AxC paradigm (terminology and principles)
and reviews software & hardware approximation techniques.
Part II: It is presented in the current paper, and it reviews application-specific & architectural
approximation techniques and introduces the AxC applications (domains, quality metrics,
benchmarks).
The remainder of the article (Part II of survey) is organized as follows. Section 3 reports software-,
hardware- and cross-layer application-specific approximation techniques. Section 4 focuses on
the architecture level and classifies approximate processors and storage. Section 5 studies the
applications of AxC, while Section 6 discusses challenges and future directions. Finally, Section 7
draws the conclusions.
APPLICATION-SPECIFIC
APPROXIMATION TECHNIQUES
Application-Specific
Technique/Approach
Approximation Class
Computation Skipping / Pruning [42, 75, 76, 84, 85, 88, 125, 136, 241, 244, 267, 273, 283–285]
Software-Level Precision Scaling [75, 89, 94, 206, 269, 295]
Approximation Input-Adaptive Approximation [63, 148, 181, 183, 192, 246, 261, 263]
Input Scaling [65, 245]
Functional Approximation & Approx. Datapaths [77, 110, 128, 130, 172–174, 213, 233, 236, 264]
Hardware-Level Approximation-based Novel Data Representation Formats [69, 78, 93, 94]
Approximation Voltage Scaling [95, 111, 115, 184, 231, 292]
Data Approximation [111, 115]
Cross-Layer & Approximation at Various Layers of Computing Stack [18, 19, 57, 65, 68, 79, 82, 198, 199, 260]
End-to-End Full-System Perspective [65, 82, 198, 199]
Approximation Approximation of Multiple Subsystems in a Synergistic Manner [65, 82, 198, 199]
3.1.1 Pruning. Computation skipping is one of the most effective software-level techniques for
approximating DNNs. Various techniques have been proposed under the umbrella of DNN pruning
that exploits computation skipping for optimizing the DNN inference process. In general, DNN
pruning refers to the process of removing non-essential (or less important) weights from a DNN to
reduce its computational complexity, memory footprint as well as energy consumption during the
inference process. These techniques are generally divided into two main categories: (1) unstructured
pruning and (2) structured pruning.
Unstructured pruning (a.k.a. fine-grained pruning) refers to techniques that analyze the impor-
tance/saliency of each individual weight and remove it if its importance is below a predefined
threshold. These techniques are usually designed to be iterative, where in each iteration only a
small percentage of less significant weights are removed and then the network is retrained for a few
epochs to regain close to the baseline accuracy. For example, Han et al. [76] propose a three-step
method to prune DNNs. First, they train a DNN to learn the importance of connections. Then,
they remove non-essential (less important) connections from the DNN. Finally, they fine-tune
the remaining weights to regain the lost accuracy. The authors further propose that learning the
right connections is an iterative process. Therefore, the process of pruning followed by fine tuning
should be repeated multiple times (with a small pruning ratio) to achieve higher compression ratios.
Deep Compression [75] combines a similar method with weight sharing and Huffman coding to
achieve ultra-high compression ratios.
Although unstructured pruning significantly reduces the number of parameters and the memory
footprint of DNNs, it does not guarantee energy or latency benefits in all cases. This is mainly
because; (1) the weights of a pruned network are usually stored in a compressed format, such as the
Compressed Sparse Row (CSR) format, to achieve high compression ratios and, therefore, have to
be uncompressed before corresponding computations, which results in significant overheads, and
(2) the underlying hardware (in most use cases) is not designed to take advantage of unstructured
sparsity in DNNs, which results in under utilization of the hardware. Therefore, specialized hardware
accelerators, such as EIE [74] and SCNN [185], are required to efficiently process data using
unstructured sparse DNNs. Similarly, Yu et al. [283] highlight that it is essential to align the
pruning method to the underlying hardware architecture to achieve efficiency gains. A mismatch
between the type of sparsity in the network and the sparsity the hardware can support efficiently
usually leads to significant overheads. Based on this observation, the authors propose two novel
pruning methods, i.e., SIMD-aware weight pruning and node pruning. SIMD-aware weight pruning
maintains weights in aligned fixed-size groups to fully utilize the available SIMD units in the
underlying hardware, while node pruning removes less significant nodes from the network without
affecting the dense format of weight matrices.
Various other techniques have also been proposed in the literature to achieve structured sparsity
in DNNs. These methods generally include weight-dependent pruning techniques [136, 284, 285],
activation-based techniques [84, 85, 88, 241, 244], and regularization-based techniques [42, 125,
267, 273]. The weight-dependent techniques evaluate the importance of filters solely based on the
weights of the filters, while activation-based techniques use activation maps to identify critical
filters in the network. Similarly, regularization-based techniques encourage sparsity during training
by adding regularization terms in the loss function or by exploiting batch normalization parameters.
3.1.2 Precision Scaling. It refers to the techniques of mapping values from a high-precision range
to a lower-precision range with lesser number of quantization levels. The number of quantization
levels in a range defines the number of bits required to represent each data value. Hence, by
using a lower-precision format, such as 4-bit or 8-bit fixed-point or floating-point format instead
of the standard 32-bit floating-point format, the memory footprint as well as the computational
requirements of DNNs can be significantly reduced.
Various techniques have been proposed to achieve precision scaling of DNNs without affecting
their application-level accuracy. The most commonly used technique is 8-bit range linear post-
training quantization, where floating-point weights and activations are converted to 8-bit fixed-
point format. The range linear quantization is further divided into two types: (1) asymmetric
quantization [92] and (2) symmetric quantization [116]. In asymmetric quantization, the minimum
and maximum observed values in the input range are mapped to the minimum and maximum
values in the output (integer/fixed-point) range, respectively. However, in symmetric quantization,
the maximum absolute observed value in the input range is used to define both minimum and
maximum values of the input range. Significant efforts have been made to push the limits of
DNN precision scaling to ultra-low precision levels. Works like XNOR-Net [206], Binarized Neural
Network (BNN) [89] and DoReFa-Net [295] explore the potential of aggressive quantization, up
to 1-bit precision. However, as aggressive quantization leads to significant accuracy loss, these
techniques usually employ quantization-aware training to achieve reasonable accuracy, as in such
cases training also acts as an error-healing process.
Non-linear (or non-uniform) quantization techniques have also been explored in the litera-
ture [75, 94]. These techniques are inspired by the non-uniform probability distribution of DNN
data structures and propose to distribute the quantization levels accordingly. Apart from the
above-mentioned techniques, mixed-precision techniques are also commonly used, where different
precision formats can be used for different layers and filters/channels of a DNN [269]. We note
that mixed-precision quantization is only effective when the underlying hardware is capable of
supporting different precision formats.
3.1.3 Input-Adaptive Approximation. Besides pruning and precision scaling, input-adaptive ap-
proximations are also commonly used to approximate DNN workloads. In general, input-adaptive
approximations refer to the techniques that dynamically adapt the level of approximation based
on the characteristics of the input data. Various methods have been proposed in the literature to
realize input-adaptive approximations. These methods mainly include (1) early-exit classifiers, (2)
dynamic network architectures, and (3) selective attention mechanisms. Works such as [181, 263]
that fall under early-exit classifiers argue that most of the inputs can be classified with very low
effort. Towards this, the authors in [181] propose Conditional Deep Learning (CDL), where the
outputs of convolutional layers are used to estimate the difficulty of input samples (with the
help of linear classifiers) and conditionally activate the deeper layers of the DNN. Similarly, [183]
propose a hierarchical classification framework by exploiting semantic decomposition. Towards
dynamic network architectures, [63] presents an approach that adapts the structure of a DNN based
on the difficulty of the input sample. Along similar lines, [246] presents a method for building
runtime configurable DNNs and a score margin-based mechanism for dynamically adjusting the
DNN capability. Unlike the above techniques that involve network structure or size modification,
selective attention mechanisms focus on reducing computational requirements by processing only
the important/relevant regions of the input while ignoring all less relevant parts. A prominent work
in this direction includes selective tile processing [192], where a low cost attention mechanism is
used to identify the tiles that require processing.
3.1.4 Input Scaling. Given that neighboring pixels in images are not always statistically indepen-
dent and contain a significant amount of redundant information, input scaling can be exploited to
reduce the computational requirements of DNN inference. Towards this, Tan et al. in [245] propose
a compound scaling method that uniformly scales network width, depth, and input resolution to
design a complete class of EfficientNets. Moreover, techniques like AxIS [65] employ input scaling
as an approximation knob in their cross-layer approximations to achieve ultra-high efficiency gains.
multiplier design for the given application. However, approximation-aware retraining is not possi-
ble in all scenarios, e.g., due to limited computational resources or due to the lack of availability
of training data. Therefore, to address this limitation, ALWANN [174] proposes a layer-wise ap-
proximation technique to select an appropriate approximate multiplier from a given library (e.g.,
EvoApprox [171]) for each individual layer of the given DNN without involving any training. The
authors also present a computationally inexpensive method to fine-tune the DNN weights after
the approximate module selection process to reduce the accuracy loss due to approximations. In
the same context, the Max-DNN framework [130] relies on ALWANN’s functionalities to provide
fine-grained approximation of the multiplications at different DNN layers. In [173], the authors
present a methodology to generate libraries of more specific approximate circuits and show that
such techniques can produce better results than techniques like ALWANN [174]. Other promi-
nent approximate multiplier designs for DNN acceleration include Digital Approximate In-SRAM
Multiplier (DAISM) [233], positive/negative approximate multipliers [236] and approximate log-
multipliers [110, 213]. Moreover, towards mitigating the impact of approximation errors, CANN [77]
presents the concept of curable approximations for DNN accelerators, where approximation errors
can be internally compensated with some additional functionality. To facilitate the design and
exploration of approximate modules for efficient DNN inference, fast simulation techniques and
frameworks are also being developed and studied [52, 67, 242]. Finally, the use of approximate
multipliers provides resource saving and increased throughput in FPGA-based DNN accelerators
[128], while the same benefits are also delivered in classic DSP kernels [132].
Another set of techniques that include works such as BiScaled-DNNs [94], CoNLoCNN [78],
Compensated-DNNs [93] and ANT [69] targets novel data representation formats to significantly
reduce the hardware complexity and energy requirements of DNN inference systems. BiScaled-
DNNs [94] argues that DNN data structures are long-tailed and, therefore, can effectively be
quantized using two different scale factors (namely, scale-fine and scale-wide). Scale-fine enables
the system to precisely capture small numbers using more fractional bits, while scale-wide allows
the system to cover the entire range of large numbers but at a coarser granularity. CoNLoCNN [78]
employs a modified version of power-of-two quantization to replace multiplication with shift
operations. Compensated-DNNs [93] propose the concept of dynamic compensation to mitigate
the impact of quantization errors caused due to aggressive quantization. On the other hand,
ANT [69] argues that most of the existing quantization techniques use fixed-point or floating-point
data representation formats, which offer limited benefits, as both are effective for specific data
distributions, and thus, require reasonably large number of bits to maintain the DNN accuracy.
Therefore, [69] proposes a novel adaptive numerical data type that combines the benefits of float
and integer data formats and, thereby, it allows the system to adapt based on the importance of
each value in a tensor.
Apart from functional approximation of arithmetic modules in DNN accelerators, voltage-scaling
can also be employed to improve the energy efficiency of DNNs. ThunderVolt [292] presents
TE-Drop, a timing error recovery technique for DNN accelerators. It employs Razor flip-flops to
detect timing errors in the MAC units and then avoids re-execution penalty by dropping the MAC
operation subsequent to the erroneous MAC. [231] presents a similar approach but employs Razor
flip-flops in between the multiplier and the adder of each MAC unit. As dropping MAC operations
can result in significant performance degradation, the authors in [95] present a Compensation MAC
(CMAC) unit that compensates for the dropped multiplication using an additional adder in the
MAC subsequent to the erroneous MAC. Along similar lines, GreenTPU [184] presents a Timing
Error Control Unit (TECU) to keep track of the timing error causing input patterns and boost the
operating voltage of the subsequent MACs in the processing array. Apart from exploiting voltage
scaling for improving energy efficiency of computational units, works such as MATIC [111] explore
the potential of aggressive voltage scaling of on-chip weight memories to improve the energy
efficiency of DNN accelerators. Similarly, EDEN [115] presents a framework for using approximate
DRAM with reduced voltage and reduced latency to improve the energy and performance efficiency
of DNN inference systems. These techniques mainly employ memory adaptive training to achieve
ultra-high efficiency gains and without any significant loss in the DNN accuracy.
Apart from the above methods, various other techniques have also been proposed that combine
DNN compression/approximation algorithms with hardware-level approximations to achieve
significant efficiency gains. Gong et al. [68] present a framework that combines dynamic layered
CNN structure, kernel shrinking and layer-by-layer quantization at the algorithm level and uses
an approximate computing-based reconfigurable architecture at the hardware level. Similarly,
[79] presents a cross-layer framework that combines network pruning and quantization with
hardware-level functional approximations of arithmetic units. Along similar lines, [57] integrates
activation pruning and voltage scaling with hardware-level functional approximation. To overcome
the accuracy loss incurred due to hardware approximations, the authors propose to incorporate
approximations in the training process, which results in an increase of about 5.32% in accuracy.
The authors report around 52.5% improvement in the energy efficiency of the system on average.
As the spectrum of AI-based products and services is expanding at a rapid pace, there is an in-
creased demand for computing systems that can efficiently handle various types of DNN workloads.
Towards this, [260] presents an AI system design, RaPID, that integrates different approximation
techniques developed at different layers of the computing stack. At the algorithm level, it employs
precision scaling and gradient compression to reduce the compute, memory, and data-transfer costs.
At the hardware level, it employs a reconfigurable approximate AI core that supports different levels
of approximations. Finally, it incorporates an approximation-aware compiler to map approximate
algorithms on approximate hardware. The system is designed to support various types of DNNs
and both DNN inference and training workloads.
Of particular relevance, the need for further hardware efficiency across multiple abstraction
levels also exists in resource-limited scenarios, suited to low-power ML, such as flexible and printed
applications [17, 175]. Printed classifiers are significantly less costly than conventional CMOS
technologies, paving the way for high circuit customizations, also known as Bespoke implementa-
tions [43]. Recent examples introduced Approximate Computing in printed ML classifiers tailored
to Bespoke architectures [18, 19]. [19] proposes an automated framework that applies a model-
to-circuit cross approximation and generates close-to-accuracy-optimal approximate ML circuits,
allowing the deployment of complex classifiers on battery-powered devices (< 30mW). On the other
hand, through a hardware-friendly retraining (at algorithmic level), [18] replaces coefficients with
more area-efficient ones, while an approximate neuron (at hardware-level) boosts the area/power
efficiency by discarding the least significant information among the least significant summands.
while a customized training algorithm reduces errors derived from analog range limitations. In
a similar approach [55], digital Neural Processing Units (NPUs) are tightly integrated with the
processor pipeline, resulting in low-power approximations for specific sections of general-purpose
code. Finally, approximate neural network processing along with high data-level parallelism has
been explored for GPUs [280] and FPGAs [170], as well.
Relax [51] is an architectural framework that comprises three key components - ISA extension,
hardware support and software support. The ISA extension allows the registration of a fault handler
for a code region, enabling software applications to incorporate try/catch behavior. The hardware
support for Relax simplifies hardware design and enhances energy efficiency by relaxing reliability
constraints and providing fault detection mechanisms, while the software support offers a language-
level recovery construct in C/C++, with relax blocks marking regions susceptible to hardware faults.
Similarly, ERSA [47] is a system architecture that targets applications with inherent error resilience,
and ensures high degrees of resilience at low cost. It achieves high error resilience to high-order
bit errors and control flow errors using asymmetric reliability (and the concept of “configurable
reliability” for general-purpose applications) in many-core architectures, error-resilient algorithms,
and intelligent software optimizations.
Prior research studies on the design of approximate processors also include the voltage over-
scaling paradigm as a potential energy-saving method with adjustable approximation level [3]. The
authors in [54] propose an ISA extension provided with approximate operations and storage, as
well as a microarchitecture design, called Truffle, which effectively supports these extensions. This
work utilizes dual-voltage operation to save power and allows the programmers to specify which
parts of a program can be computed approximately at a lower energy cost. Similar manipulations
in operational (timing) errors to decrease power consumption of processor architectures from
voltage/reliability trade-offs, have been studied and can be found in [104, 224].
Among previous architectures, RISC-V has been established as an increasingly growing com-
petitor and alternative of CPUs with funded projects and support from star companies (e.g., Intel,
Microsoft, ST Microelectronics [235]). The availability of open-source RISC-V tools and libraries
provides developers with greater flexibility in designing and optimizing approximate computing sys-
tems. More specifically, the authors in [177] extend prior works in variable bit-width approximate
arithmetic operators with configurable data memory units as well, while [112] leverages the inher-
ent “value similarity” characteristic of ML workloads and by using lightweight micro-architectural
ISA extensions skips entire instruction sequences. Accordingly, it substitutes the results with previ-
ously executed computations, improving both performance and energy efficiency. Furthermore,
[59] utilizes multiple hardware-level approximations and minimizes quality degradation of several
applications by adopting a hardware controller, which adjusts the degree of approximation by
selecting various approximate units. Lastly, RISC-V processors targeting DNN applications have
also been explored in [48, 180]. In [180], the authors exploit the efficacy of SIMD instructions and
propose mixed-precision Quantized Neural Network (QNN) execution that enables functional units
down to 2 bits, opposed to commercial cores (e.g., ARM Cortex M55) that cannot natively support
smaller than 8-bit SIMD instructions. On the other hand, [48] extends the base RISC-V ISA to enable
low bit-width operations tailored to DNN tasks, but this time by adopting a posit processing unit
that can be integrated in a full RISC-V core with minimal overheads.
[34, 61, 118, 208], dynamic RAMs (DRAMs) [102, 142, 208], and caches [135, 156, 157]. We note that
the literature also includes approximation techniques that are applied and evaluated in different
types of memory [203].
4.2.1 Approximate Non-Volatile Memories. Sampson et al. [223] introduce two mechanisms to
approximate PCMs, which can be used for persistent storage (e.g., for databases or filesystems)
or as main memory. The first mechanism reduces the number of programming pulses for writing
the multi-level cells (i.e., cells storing multiple bits of information), while the second one stores
approximate data in blocks with exhausted error correction resources. SoftPCM [58] integrates a
method to reduce the write traffic towards improved energy efficiency and lifetime. More specifically,
a circuit compares the new data with stored data, and the writing operation is not performed if the
difference is below a threshold.
The authors of [217] provide approximate storage in embedded flash memories by lowering the
voltage below its nominal value. To deal with unpredictable behaviors and large error rates, they
employ three software-based methods, i.e., “in-place writes”, “multiple-place writes”, and “RS-Berger
codes”. The first method repeatedly writes the data to the same memory address, while the second
one writes the data in more than one addresses. The last method is an error detection & correction
code that recovers the originally stored data. In the same context, Tseng et al. [250] characterize
various multi-level cell flash memories when the supply voltage drops during read, program and
erase operations. Based on the characterization results, they propose a dynamic scaling mechanism
that adjusts the voltage supply of flash according to the executed operation. Retention relaxation
[141] is another interesting technique applied in multi-level cell flash memories. In particular, the
write operation for data with short lifetime (e.g., data from proxy and MapReduce workloads) is
accelerated based on modeling the relationship between raw bit error rates and retention time.
The literature also includes works on approximate SST RAMs, which are non-volatile alternative
of SRAMs for on-chip scratchpad or cache memory. Sampaio et al. [220] propose an approximate
STT-RAM cache architecture that reduces the reliability overhead under a given error bound. The
application’s critical data are protected using a latency-aware double error correction module,
while approximation-aware read and write policies provide approximate storage. Moreover, the
architecture is equipped with a unit for monitoring the quality of the application’s output. To
provide an approximate scratchpad memory, Ranjan et al. [205] design a quality configurable
SST-RAM, in which read and write operations are executed at varying accuracy levels. Approximate
reads are performed by either lowering the current used to sense the stored value or lowering
the sensing duration and increasing the current. Similarly, approximate writes are performed by
lowering the current and/or sensing duration. Furthermore, the authors of [286] propose a scheme
for progressively scaling the cells of STT-RAM arrays based on a scaling factor. In this approach,
the most significant bits of the data words are implemented to provide lower bit error rates from the
least-significant ones. In [247], the authors propose AdAM, which provides adaptive approximation
management across the entire non-volatile memory hierarchy. In more detail, they model a system
with STT-RAM scratchpad and PCM main memory and apply various approximation knobs (e.g.,
read/write pulse magnitude/duration) to exchange data accuracy for improved STT-RAM access
delay and PCM lifetime.
4.2.2 Approximate Volatile Memories. Regarding SRAM memories, significant research efforts have
focused on lowering the supply voltage. In this context, Kumar et al. [118] apply system-level design
techniques to reduce the SRAM leakage power. Their framework is constrained by a data-reliability
factor, while they use a statistical/probabilistic setup to model soft errors and process variations. In
[34], the authors propose a hybrid memory array for video processors. Their policy is to store the
high-order bits of luminance pixels, i.e., those that are more sensitive to human vision, in robust
8T SRAM bit-cells, while the low-order bits are stored in 6T bit-cells. This approach allows to
lower the supply voltage without significantly affecting the high-order bits. Frustaci et al. [61]
evaluate a wide range of approximation techniques (bit dropping, selective write assist, selective
error correcting code, voltage scaling) for dynamic management of the energy–quality trade-off in
SRAMs. Their analysis is based on measurements on a 28-nm SRAM testchip, and it also includes
combinations of the examined techniques and different array/word sizes.
The authors of [208] use approximate SRAM and DRAM memories to implement sketches for
similarity estimation. In more detail, based on theoretical analysis and simulation, they examine the
error rates in unprotected and parity-protected memories and tune their voltage supply. Moreover,
the work in [142] applies lower refresh rates in the DRAM portions storing the non-critical data.
Similarly, Jung et al. [102] define reliable and unreliable memory regions, however, in their strategy
they completely disable the DRAM’s auto-refresh feature.
Besides SRAMs and DRAMs, the literature also includes research works on approximate caches.
In [157], the authors propose the Doppelganger cache based on the approximate similarity among
the cache values. In particular, they associate the tags of multiple similar blocks with a single data
array entry, and hence, they reduce the cache memory footprint. In the same context, the authors of
[156] propose the Bunker cache based on the spatio-value similarity, where approximately similar
data exhibit spatial regularity in memory. Their approach is to map similar data to the same cache
location with respect only to their memory address. Finally, ASCache [135] is an approximate SSD
cache that allows bit errors within a tolerable threshold, while avoiding unnecessary cache misses
and guaranteeing end-to-end data integrity.
computations come from trigonometric functions, which require massive amount of floating-point
computations, while at the same time they can tolerate errors [20].
Computer Vision: Computer vision is critical in many emerging applications such as aug-
mented/virtual reality, autonomous driving and robotics. These applications often require intensive
computation efforts on large volume of data. One core component of a computer vision system is
the computation kernel known as meta-functions, which are dedicated to execute one specific algo-
rithm. As those meta-functions are typically inherently error resilient, voltage scaling techniques
are applied to obtain attractive quality–energy trade-offs [167]. Another application, ray tracing,
which is also a common computer vision task, requires intensive arithmetic with floating-point
numbers, whose precision can be formulated as a stochastic optimization problem for accelera-
tion [225]. Some computer vision applications are deployed onto perpetual systems, i.e., embedded
systems that are not connected to power but rely on harvesting energy from the environment. Such
systems often have strict power constraints when executing demanding machine vision algorithms.
Therefore, specialized languages and runtimes are proposed to maximize QoS on such platforms
for tasks such as tracking and remote sensing [234].
From the software aspect, various computer vision applications including motion estimation and
body tracing also exhibit tolerance against code perforations, inspiring opportunities of slightly com-
promising accuracy for increasing the performance by two to three times [86]. However, although
the accuracy and quality degradation from approximation are negligible at most times, computer
vision systems may still output critically inaccurate or erroneous results. The language Topaz
enables outlier task detection and re-execution, which can collaborate with existing approximate
solutions to ensure that tasks such as subject pose tracking produce acceptable results [1].
Computer Graphics: Modern digital photography and filming as well as 3D modelling, ren-
dering and simulation depend on extensive support from computer graphics algorithms. These
workloads demand a significant amount of computations due to the visual contents that are in-
creasingly complex. Although computationally intensive, computer graphics show opportunities
for Approximate Computing by having a strong data locality, which indicates that there are high
redundancies that can be harvested to mitigate the computation costs [109]. By using cache to
memorize outcomes of instructions, computer graphics applications such as OpenGL games can
have reduced number of instructions and witness acceleration [109]. Additionally, by leveraging the
tolerance against potentially non-deterministic parallel execution of sequential computer graphics
programs (e.g., volume rendering), notable speed-ups can be delivered [160]. Another direction of
enhancing the parallel computing performance of OpenGL application via Approximate Computing
is to relax the program semantics, where code can briefly bypass strict semantic adherence to
achieve energy saving with acceptable output quality loss [29].
Big Data Analysis: The data query is the first and foremost task in the big data analysis domain.
Targeting improved performance, an approximate query engine BlinkDB is proposed in [4]. BlinkDB
uses a set of stratified samples for the aim of optimization and deploys a dynamic sampling strategy
to select an appropriately sized sample, which permits users to trade accuracy for response time.
Other works are mainly based on Big data frameworks such as Hadoop and Spark. The authors
in [66] propose ApproxHadoop, which is a framework to run approximation-enabled MapReduce
programs. ApproxHadoop leverages statistical theories to optimize the MapReduce programs when
approximating with data sampling and task dropping. On the other hand, the authors in [87]
present a data sampling framework to enable approximate computing with estimated error bounds
in Spark, called ApproxSpark. In ApproxSpark a tree-based framework is proposed to compute the
approximate aggregate values and corresponding error bounds, which can dynamically balance the
trade-off among the sampling rates, execution time, and accuracy loss.
computing community aims to develop more efficient, robust, and fast frameworks for scientific
computations to meet these demands.
As the mathematical models for scientific computing become increasingly complex and the
advanced sensors collect vast amounts of data, the computing resources become the bottleneck.
Scientific computing entails a considerable amount of continuous computation, while typically, its
targeted result is always approximate. Generally, increased computation precision will provide a
more accurate result, but it will also consume more resources. Scientists usually employ double-
precision computations to achieve a precise result, which is expensive. Many studies have examined
the trade-off between precision and performance. One of the ideas is to get a balance between
them by tuning the floating-point precision into mixed precision. The implementation in [123] uses
binary instrumentation and modification to build mixed-precision configurations, whereas [27]
uses dynamic program information and temporal locality to customize floating-point precision for
scientific computing. Two of these approaches use the system’s floating-point variables as search
space and assume the system to be a black box in order to find the best configuration with some
loss in accuracy. By analyzing the dependence between floating-point variables, the authors in
[70] design a community structure of floating-point variables and devise a scalable hierarchy for
optimization. Furthermore, the tools presented in [45] focus on determining the input settings for
a floating-point routine to understand how they affect the result’s error.
Even though cluster computers are relatively reliable, minor software and hardware faults may
still occur, particularly during the execution of lengthy tasks, like those of scientific computing.
This makes reliability and robustness an important consideration when carrying out scientific
computing. [209] presents a methodology that estimates an acceptable accuracy bound based on
the impacts of different computation errors and faults on the entire task. The technique is not
only useful in helping to create robust scientific computing frameworks, but it can also be used to
discard computations to save execution time in a reasonable manner. The work in [31] presents
a programming language that can determine the quantitative reliability of a specific computing
application running on unreliable hardware. The skipping of data accesses is not only for precision
tuning and robustness analysis but also for boosting the computations. Similarly, special hardware
designs can be viewed as an approximation technique. [97] devise a special divider and square root
circuit that uses an adaptive approximation to increase the computation rate and save energy.
Financial Analysis: Financial analysis is another area in which approximate computations
can be useful. The internet and the digitalization of the financial industry have led to increased
importance of computation in financial analysis, especially when analyzing large amounts of data
and incorporating machine learning. With the advent of these technologies, quantitative finance
models continue to develop. The use of approximate computations increases the efficiency of data
analysis, as well as enables real-time processes. For example, [251] demonstrate a compiler-directed
output-based approximate memoization function that provides speedup in two different financial
analysis applications to pricing a portfolio of stock options and swaptions. In the meanwhile,
[100] presents a framework for algorithmic profiling known as AXPROF for analyzing randomized
approximate programs, which are important to the analysis of big financial data.
Physics Simulations: Apart from being a part of scientific computing, simulation of physical
phenomena involves some more specific areas, such as hydrodynamics simulation, thermodynamics
simulation, and optical simulation. Many of these tasks are very compute-intensive. Approximate
Computing can be applied to the physics simulation to accelerate this process. [164] presents a
novel system called OPPROX for the application’s execution-phase-aware approximation. OPPROX
is demonstrated to be an efficient method for reducing dozens of workloads in a simulation of
the Sedov blast wave problem. Additionally, the authors in [29] seek to combine Approximate
Computing with automatic parallelization by allowing the user to select tuning knobs that can be
used to trade performance gains and/or energy savings for output distortion.
Table 4. Remarkable improvements among all examined works with respect to the application domain and
the different layers of computing stack.
over the total number of results, and correspondingly, Bit Error Ratio (BER), which calculates the
number of erroneous bits over the total bits. Another family of error metrics is the Error Distance
(ED), including Hamming Distance (HD), Mean (Relative) ED (MED/MRED), Normalized MED
(NMED), and other numerical-based metrics. Another metric that is employed in the evaluation
of approximate designs is the Probability of (Relative) ED higher than 𝑋 (PRED/PRED), where 𝑋
is a numerical value, Finally, Mean Squared Error (MSE) and Root MSE (RMSE) are also common
metrics to evaluate the output quality.
There are also application-specific error metrics to describe the efficiency of the approximations at
a higher level. For example, in telecommunications, Packet Error Ratio (PER) is usually considered to
evaluate the communication quality. In ML applications, such as pattern recognition and information
retrieval, classification accuracy is usually considered, as well as other metrics such as precision
(relevant instances of retrieved instances) and recall (relevant instances of total instances). These
metrics provide a systematic view on the assessment of the approximate application.
Post-Fabrication Testing: Process variations, which lead to permanent faults and variations
in the hardware characteristics of transistors, are a major concern in nano-scale devices. All the
fabricated devices are passed through post-fabrication testing to detect manufacturing-induced
defects and variations. However, approximations make this already challenging process further
challenging. Only a limited amount of work has been carried out towards testing of Approximate
Computing devices. Therefore, there is a dire need for specialized approximation-aware testing
methodologies that can enable efficient-yet-accurate testing of these devices and reduce the yield
loss, wherever possible, through binning.
Reproducibility for Fair Comparison and Future Developments: In most cases, it is chal-
lenging to replicate the exact evaluation methodology and setup reported in research works, making
it difficult (if not impossible) to reproduce the results. Therefore, it is essential to promote open-
source contributions to ensure the reproducibility of the results. The open-source contributions will
also facilitate further research and development in the area of Approximate Computing, resulting
in rapid progress towards uncovering the true potential of Approximate Computing for improving
the efficiency of computing systems.
7 CONCLUSION
In this article, we presented Part II of our survey on Approximate Computing, which reviews the
state-of-the-art application-specific and architectural approximation techniques. Regarding the
application-specific techniques, we focused on the emerging AI/ML applications and discussed
about software, hardware, cross-layer and end-to-end approximations. Regarding the architectural
techniques, we focused on approximate processors & accelerators and different memory types
for approximate data storage. Moreover, we performed an extensive analysis of the application
spectrum of Approximate Computing, i.e., we classified per domain all the works reviewed in our
two-part survey, we presented representative techniques and remarkable results, as well as we
reported well-established error metrics and benchmark suites for evaluating the quality-of-service
of approximate designs. Despite the notable progress demonstrated by Approximate Computing,
there remains a pressing need for ongoing and significant innovation to fully realize the potential of
approximations in the design of complex computing systems. These open challenges are discussed
along with future directions at the end of our survey.
ACKNOWLEDGEMENT
This research is supported in parts by ASPIRE, the technology program management pillar of Abu
Dhabi’s Advanced Technology Research Council (ATRC), via the ASPIRE Awards for Research
Excellence.
REFERENCES
[1] Sara Achour and Martin C. Rinard. 2015. Approximate Computation with Outlier Detection in Topaz. In ACM
SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA).
711–730.
[2] Elizabeth Adams, Suganthi Venkatachalam, and Seok-Bum Ko. 2020. Approximate Restoring Dividers Using Inexact
Cells and Estimation From Partial Remainders. IEEE Trans. Comput. 69, 4 (2020), 468–474.
[3] Hassan Afzali-Kusha and Massoud Pedram. 2023. X-NVDLA: Runtime Accuracy Configurable NVDLA Based on
Applying Voltage Overscaling to Computing and Memory Units. IEEE Transactions on Circuits and Systems I: Regular
Papers 70, 5 (2023), 1989–2002.
[4] Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB:
Queries with Bounded Errors and Bounded Response Times on Very Large Data. In ACM SIGOPS European Conference
on Computer Systems (EuroSys). 29–42.
[5] Ankur Agrawal et al. 2021. 9.1 A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4
Inference and Workload-Aware Throttling. In IEEE International Solid- State Circuits Conference (ISSCC), Vol. 64.
144–146.
[6] Ankur Agrawal, Jungwook Choi, Kailash Gopalakrishnan, Suyog Gupta, Ravi Nair, Jinwook Oh, Daniel A Prener, Sunil
Shukla, Vijayalakshmi Srinivasan, and Zehra Sura. 2016. Approximate Computing: Challenges and Opportunities. In
IEEE International Conference on Rebooting Computing (ICRC). 1–8.
[7] Omid Akbari, Mehdi Kamal, Ali Afzali-Kusha, and Massoud Pedram. 2017. Dual-Quality 4:2 Compressors for Utilizing
in Dynamic Accuracy Configurable Multipliers. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, 4
(2017), 1352–1361.
[8] Omid Akbari, Mehdi Kamal, Ali Afzali-Kusha, and Massoud Pedram. 2018. RAP-CLA: A Reconfigurable Approximate
Carry Look-Ahead Adder. IEEE Transactions on Circuits and Systems II: Express Briefs 65, 8 (2018), 1089–1093.
[9] Vahideh Akhlaghi, Amir Yazdanbakhsh, Kambiz Samadi, Rajesh K. Gupta, and Hadi Esmaeilzadeh. 2018. SnaPEA:
Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks. In ACM/IEEE
International Symposium on Computer Architecture (ISCA). 662–673.
[10] Carlos Alvarez, Jesus Corbal, and Mateo Valero. 2005. Fuzzy Memoization for Floating-Point Multimedia Applications.
IEEE Trans. Comput. 54, 7 (2005), 922–927.
[11] Renée St. Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis
Ceze, and Doug Burger. 2014. General-Purpose Code Acceleration with Limited-Precision Analog Computation. In
ACM/IEEE International Symposium on Computer Architecture (ISCA). 505–516.
[12] Michael R. Anderson and Michael Cafarella. 2016. Input Selection for Fast Feature Engineering. In IEEE International
Conference on Data Engineering (ICDE). 577–588.
[13] Mohammad Saeed Ansari, Bruce F. Cockburn, and Jie Han. 2021. An Improved Logarithmic Multiplier for Energy-
Efficient Neural Computing. IEEE Trans. Comput. 70, 4 (2021), 614–625.
[14] Jason Ansel, Yee Lok Wong, Cy Chan, Marek Olszewski, Alan Edelman, and Saman Amarasinghe. 2011. Language
and Compiler Support for Auto-Tuning Variable-Accuracy Algorithms. In IEEE/ACM International Symposium on
Code Generation and Optimization (CGO). 85–96.
[15] Alexander Aponte-Moreno, Alejandro Moncada, Felipe Restrepo-Calle, and Cesar Pedraza. 2018. A Review of
Approximate Computing Techniques Towards Fault Mitigation in HW/SW Systems. In IEEE Latin-American Test
Symposium (LATS). 1–6.
[16] Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, and Jörg Henkel. 2022. Hardware Approximate Techniques
for Deep Neural Network Accelerators: A Survey. Comput. Surveys 55, 4 (2022), 1–36.
[17] Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Mehdi B. Tahoori, and Jörg Henkel. 2022. Cross-Layer
Approximation For Printed Machine Learning Circuits. In Design, Automation & Test in Europe (DATE). 1–6.
[18] Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Mehdi B. Tahoori, and Jörg Henkel. 2023. Co-Design of
Approximate Multilayer Perceptron for Ultra-Resource Constrained Printed Circuits. IEEE Trans. Comput. (2023),
1–8.
[19] Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Mehdi B. Tahoori, and Jörg Henkel. 2023. Model-to-Circuit
Cross-Approximation For Printed Machine Learning Classifiers. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems (2023), 1–13.
[20] Woongki Baek and Trishul M. Chilimbi. 2010. Green: A Framework for Supporting Energy-Conscious Programming
Using Controlled Approximation. In ACM SIGPLAN Conference on Programming Language Design and Implementation
(PLDI). 198–209.
[21] Farshad Baharvand and Seyed Ghassem Miremadi. 2020. LEXACT: Low Energy N-Modular Redundancy Using
Approximate Computing for Real-Time Multicore Processors. IEEE Transactions on Emerging Topics in Computing 8, 2
(2020), 431–441.
[22] Paolo Bertoldi, Maria Avgerinou, and Luca Castellazzi. 2017. Trends in Data Centre Energy Consumption under the
European Code of Conduct for Data Centre Energy Efficiency. European Comission – Joint Research Centre (JRC)
(2017), 1–43.
[23] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characteriza-
tion and Architectural Implications. In International Conference on Parallel Architectures and Compilation Techniques
(PACT). 72–81.
[24] James Bornholt, Todd Mytkowicz, and Kathryn S. McKinley. 2014. Uncertain<T>: A First-Order Type for Uncertain
Data. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems
(ASPLOS). 51–66.
[25] Brett Boston, Adrian Sampson, Dan Grossman, and Luis Ceze. 2015. Probability Type Inference for Flexible Approxi-
mate Programming. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages,
and Applications (OOPSLA). 470–487.
[26] Iulian Brumar, Marc Casas, Miquel Moreto, Mateo Valero, and Gurindar S. Sohi. 2017. ATM: Approximate Task
Memoization in the Runtime System. In IEEE International Parallel and Distributed Processing Symposium (IPDPS).
1140–1150.
[27] Hugo Brunie, Costin Iancu, Khaled Z. Ibrahim, Philip Brisk, and Brandon Cook. 2020. Tuning Floating-Point Precision
Using Dynamic Program Information and Temporal Locality. In ACM/IEEE SC, International Conference for High
Performance Computing, Networking, Storage and Analysis. 1–14.
[28] Surendra Byna, Jiayuan Meng, Anand Raghunathan, Srimat Chakradhar, and Srihari Cadambi. 2010. Best-Effort
Semantic Document Search on GPUs. In Workshop on General-Purpose Computation on Graphics Processing Units
(GPGPU). 86–93.
[29] Simone Campanoni, Glenn Holloway, Gu-Yeon Wei, and David Brooks. 2015. HELIX-UP: Relaxing Program Semantics
to Unleash Parallelization. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
235–245.
[30] Zhiwei Cao, Xin Zhou, Han Hu, Zhi Wang, and Yonggang Wen. 2022. Toward a Systematic Survey for Carbon Neutral
Data Centers. IEEE Communications Surveys & Tutorials 24, 2 (2022), 895–936.
[31] Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying Quantitative Reliability for Programs That
Execute on Unreliable Hardware. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems,
Languages, and Applications (OOPSLA). 33–52.
[32] Jorge Castro-Godínez, Julián Mateus-Vargas, Muhammad Shafique, and Jörg Henkel. 2020. AxHLS: Design Space Ex-
ploration and High-Level Synthesis of Approximate Accelerators using Approximate Functional Units and Analytical
Models. In International Conference On Computer Aided Design (ICCAD). 1–9.
[33] Srimat T. Chakradhar and Anand Raghunathan. 2010. Best-Effort Computing: Re-thinking Parallel Software and
Hardware. In Design Automation Conference (DAC). 865–870.
[34] Ik Joon Chang, Debabrata Mohapatra, and Kaushik Roy. 2011. A Priority-Based 6T/8T Hybrid SRAM Architecture for
Aggressive Voltage Scaling in Video Applications. IEEE Transactions on Circuits and Systems for Video Technology 21,
2 (2011), 101–112.
[35] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron.
2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In IEEE International Symposium on Workload
Characterization (IISWC). 44–54.
[36] Jienan Chen and Jianhao Hu. 2013. Energy-Efficient Digital Signal Processing via Voltage-Overscaling-Based Residue
Number System. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 21, 7 (2013), 1322–1332.
[37] Ke Chen, Linbin Chen, Pedro Reviriego, and Fabrizio Lombardi. 2019. Efficient Implementations of Reduced Precision
Redundancy (RPR) Multiply and Accumulate (MAC). IEEE Trans. Comput. 68, 5 (2019), 784–790.
[38] Linbin Chen, Jie Han, Weiqiang Liu, and Fabrizio Lombardi. 2015. Design of Approximate Unsigned Integer Non-
Restoring Divider for Inexact Computing. In Great Lakes Symposium on VLSI (GLSVLSI). 51–56.
[39] Linbin Chen, Jie Han, Weiqiang Liu, and Fabrizio Lombardi. 2016. On the Design of Approximate Restoring Dividers
for Error-Tolerant Applications. IEEE Trans. Comput. 65, 8 (2016), 2522–2533.
[40] Linbin Chen, Jie Han, Weiqiang Liu, and Fabrizio Lombardi. 2017. Algorithm and Design of a Fully Parallel Approximate
Coordinate Rotation Digital Computer (CORDIC). IEEE Transactions on Multi-Scale Computing Systems 3, 3 (2017),
139–151.
[41] Linbin Chen, Jie Han, Weiqiang Liu, Paolo Montuschi, and Fabrizio Lombardi. 2018. Design, Evaluation and Application
of Approximate High-Radix Dividers. IEEE Transactions on Multi-Scale Computing Systems 4, 3 (2018), 299–312.
[42] Tianyi Chen, Bo Ji, Tianyu Ding, Biyi Fang, Guanyi Wang, Zhihui Zhu, Luming Liang, Yixin Shi, Sheng Yi, and Xiao
Tu. 2021. Only Train Once: A One-Shot Neural Network Training And Pruning Framework. Advances in Neural
Information Processing Systems 34 (2021), 19637–19651.
[43] Hari Cherupalli, Henry Duwe, Weidong Ye, Rakesh Kumar, and John Sartori. 2017. Bespoke Processors for Applications
with Ultra-Low Area and Power Constraints. In ACM/IEEE International Symposium on Computer Architecture (ISCA).
41–54.
[44] Wei-Fan Chiang, Mark Baranowski, Ian Briggs, Alexey Solovyev, Ganesh Gopalakrishnan, and Zvonimir Rakamarić.
2017. Rigorous Floating-Point Mixed-Precision Tuning. In ACM SIGPLAN Symposium on Principles of Programming
Languages (POPL). 300–315.
[45] Wei-Fan Chiang, Ganesh Gopalakrishnan, Zvonimir Rakamaric, and Alexey Solovyev. 2014. Efficient Search for Inputs
Causing High Floating-Point Errors. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
(PPoPP). 43–52.
[46] Vinay Kumar Chippa, Debabrata Mohapatra, Kaushik Roy, Srimat T. Chakradhar, and Anand Raghunathan. 2014.
Scalable Effort Hardware Design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 22, 9 (2014),
2004–2016.
[47] Hyungmin Cho, Larkhoon Leem, and Subhasish Mitra. 2012. ERSA: Error Resilient System Architecture for Prob-
abilistic Applications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 4 (2012),
546–558.
[48] Marco Cococcioni, Federico Rossi, Emanuele Ruffaldi, and Sergio Saponara. 2022. A Lightweight Posit Processing
Unit for RISC-V Processors in Deep Neural Network Applications. IEEE Transactions on Emerging Topics in Computing
10, 4 (2022), 1898–1908.
[49] Hans Jakob Damsgaard, Aleksandr Ometov, and Jari Nurmi. 2023. Approximation Opportunities in Edge Computing
Hardware: A Systematic Literature Review. Comput. Surveys 55, 12 (2023), 1–49.
[50] Eva Darulova and Viktor Kuncak. 2017. Towards a Compiler for Reals. ACM Transactions on Programming Languages
and Systems 39, 2 (2017), 1–28.
[51] Marc de Kruijf, Shuou Nomura, and Karthikeyan Sankaralingam. 2010. Relax: An Architectural Framework for
Software Recovery of Hardware Faults. In ACM/IEEE International Symposium on Computer Architecture (ISCA).
497–508.
[52] Cecilia De la Parra, Andre Guntoro, and Akash Kumar. 2020. ProxSim: GPU-based Simulation Framework for
Cross-Layer Approximate DNN Optimization. In Design, Automation & Test in Europe (DATE). 1193–1198.
[53] Farhad Ebrahimi-Azandaryani, Omid Akbari, Mehdi Kamal, Ali Afzali-Kusha, and Massoud Pedram. 2020. Block-Based
Carry Speculative Approximate Adder for Energy-Efficient Applications. IEEE Transactions on Circuits and Systems II:
Express Briefs 67, 1 (2020), 137–141.
[54] Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Architecture Support for Disciplined
Approximate Programming. In ACM International Conference on Architectural Support for Programming Languages
and Operating Systems (ASPLOS). 301–312.
[55] Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Neural Acceleration for General-Purpose
Approximate Programs. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 449–460.
[56] Darjn Esposito, Antonio Giuseppe Maria Strollo, Ettore Napoli, Davide De Caro, and Nicola Petra. 2018. Approximate
Multipliers Based on New Approximate Compressors. IEEE Transactions on Circuits and Systems I: Regular Papers 65,
12 (2018), 4169–4182.
[57] Yinghui Fan, Xiaoxi Wu, Jiying Dong, and Zhi Qi. 2019. AxDNN: Towards the Cross-Layer Design of Approximate
DNNs. In Asia and South Pacific Design Automation Conference (ASP-DAC). 317–322.
[58] Yuntan Fang, Huawei Li, and Xiaowei Li. 2012. SoftPCM: Enhancing Energy Efficiency and Lifetime of Phase Change
Memory in Video Applications via Approximate Write. In IEEE Asian Test Symposium (ATS). 131–136.
[59] Isaías Felzmann, João Fabrício Filho, and Lucas Wanner. 2020. Risk-5: Controlled Approximations for RISC-V. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 11 (2020), 4052–4063.
[60] Vimuth Fernando, Keyur Joshi, and Sasa Misailovic. 2019. Verifying Safety and Accuracy of Approximate Parallel
Programs via Canonical Sequentialization. Proceedings of the ACM on Programming Languages 3, OOPSLA (2019),
1–29.
[61] Fabio Frustaci, David Blaauw, Dennis Sylvester, and Massimo Alioto. 2016. Approximate SRAMs With Dynamic
Energy-Quality Management. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 24, 6 (2016), 2128–2141.
[62] Fabio Frustaci, Stefania Perri, Pasquale Corsonello, and Massimo Alioto. 2020. Approximate Multipliers With Dynamic
Truncation for Energy Reduction via Graceful Quality Degradation. IEEE Transactions on Circuits and Systems II:
Express Briefs 67, 12 (2020), 3427–3431.
[63] Sanjay Ganapathy, Swagath Venkataramani, Giridhur Sriraman, Balaraman Ravindran, and Anand Raghunathan.
2020. DyVEDeep: Dynamic Variable Effort Deep Neural Networks. ACM Transactions on Embedded Computing
Systems 19, 3 (2020), 1–24.
[64] Georgios Georgis, George Lentaris, and Dionysios Reisis. 2019. Acceleration Techniques and Evaluation on Multi-Core
CPU, GPU and FPGA for Image Processing and Super-Resolution. Springer Journal of Real-Time Image Processing 16,
4 (2019), 1207–1234.
[65] Soumendu Kumar Ghosh, Arnab Raha, and Vijay Raghunathan. 2020. Approximate Inference Systems (AxIS) End-to-
End Approximations for Energy-Efficient Inference at the Edge. In ACM/IEEE International Symposium on Low Power
Electronics and Design (ISLPED). 7–12.
[66] Inigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D. Nguyen. 2015. ApproxHadoop: Bringing Approx-
imations to MapReduce Frameworks. In ACM International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS). 383–397.
[67] Jing Gong, Hassaan Saadat, Hasindu Gamaarachchi, Haris Javaid, Xiaobo Sharon Hu, and Sri Parameswaran. 2023.
ApproxTrain: Fast Simulation of Approximate Multipliers for DNN Training and Inference. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems (2023).
[68] Yu Gong, Bo Liu, Wei Ge, and Longxing Shi. 2019. ARA: Cross-Layer Approximate Computing Framework based
Reconfigurable Architecture for CNNs. Elsevier Microelectronics Journal 87 (2019), 33–44.
[69] Cong Guo, Chen Zhang, Jingwen Leng, Zihan Liu, Fan Yang, Yunxin Liu, Minyi Guo, and Yuhao Zhu. 2022. Ant:
Exploiting Adaptive Numerical Data Type for Low-Bit Deep Neural Network Quantization. In IEEE/ACM International
Symposium on Microarchitecture (MICRO). 1414–1433.
[70] Hui Guo and Cindy Rubio-González. 2018. Exploiting Community Structure for Floating-Point Precision Tuning. In
ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 333–343.
[71] Vaibhav Gupta, Debabrata Mohapatra, Anand Raghunathan, and Kaushik Roy. 2013. Low-Power Digital Signal
Processing Using Approximate Adders. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
32, 1 (2013), 124–137.
[72] Matthew R. Guthaus, Jeffrey S. Ringenberg, Dan Ernst, Todd M. Austin, Trevor Mudge, and Richard B. Brown. 2001.
MiBench: A Free, Commercially Representative Embedded Benchmark Suite. In IEEE International Workshop on
Workload Characterization (WWC). 3–14.
[73] Jie Han and Michael Orshansky. 2013. Approximate Computing: An Emerging Paradigm for Energy-Efficient Design.
In IEEE European Test Symposium (ETS). 1–6.
[74] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J Dally. 2016. EIE:
Efficient Inference Engine on Compressed Deep Neural Network. ACM SIGARCH Computer Architecture News 44, 3
(2016), 243–254.
[75] Song Han, Huizi Mao, and William J Dally. 2015. Deep Compression: Compressing Deep Neural Networks with
Pruning, Trained Quantization and Huffman Coding. arXiv preprint arXiv:1510.00149 (2015), 1–14.
[76] Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both Weights and Connections for Efficient Neural
Networks. Advances in Neural Information Processing Systems 28 (2015).
[77] Muhammad Abdullah Hanif, Faiq Khalid, and Muhammad Shafique. 2019. CANN: Curable Approximations for
High-Performance Deep Neural Network Accelerators. In Design Automation Conference (DAC). 1–6.
[78] Muhammad Abdullah Hanif, Giuseppe Maria Sarda, Alberto Marchisio, Guido Masera, Maurizio Martina, and
Muhammad Shafique. 2022. CoNLoCNN: Exploiting Correlation and Non-Uniform Quantization for Energy-Efficient
Low-precision Deep Convolutional Neural Networks. In International Joint Conference on Neural Networks (IJCNN).
1–8.
[79] Muhammad Abdullah Hanif and Muhammad Shafique. 2022. A Cross-Layer Approach Towards Developing Efficient
Embedded Deep Learning Systems. Elsevier Microprocessors and Microsystems 88 (2022), 103609.
[80] Soheil Hashemi, R. Iris Bahar, and Sherief Reda. 2015. DRUM: A Dynamic Range Unbiased Multiplier for Approximate
Applications. In International Conference on Computer-Aided Design (ICCAD). 418–425.
[81] Soheil Hashemi, R. Iris Bahar, and Sherief Reda. 2016. A Low-Power Dynamic Divider for Approximate Applications.
In Design Automation Conference (DAC). 1–6.
[82] Soheil Hashemi, Hokchhay Tann, Francesco Buttafuoco, and Sherief Reda. 2018. Approximate Computing for
Biometric Security Systems: A Case Study on Iris Scanning. In Design, Automation & Test in Europe (DATE). 319–324.
[83] Soheil Hashemi, Hokchhay Tann, and Sherief Reda. 2018. BLASYS: Approximate Logic Synthesis Using Boolean
Matrix Factorization. In Design Automation Conference (DAC). 1–6.
[84] Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel Pruning for Accelerating Very Deep Neural Networks. In
IEEE International Conference on Computer Vision (ICCV). 1389–1397.
[85] Zhiqiang He, Yaguan Qian, Yuqi Wang, Bin Wang, Xiaohui Guan, Zhaoquan Gu, Xiang Ling, Shaoning Zeng, Haijiang
Wang, and Wujie Zhou. 2022. Filter Pruning via Feature Discrimination in Deep Neural Networks. In European
Conference on Computer Vision (ECCV). 245–261.
[86] Henry Hoffmann, Sasa Misailovic, Stelios Sidiroglou, Anant Agarwal, and Martin C. Rinard. 2009. Using Code
Perforation to Improve Performance, Reduce Energy Consumption, and Respond to Failures. Massachusetts Institute
of Technology Technical Report (2009), 1–21.
[87] Guangyan Hu, Sandro Rigo, Desheng Zhang, and Thu Nguyen. 2019. Approximation with Error Bounds in Spark. In
IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems
(MASCOTS). 61–73.
[88] Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. 2016. Network Trimming: A Data-Driven Neuron
Pruning Approach towards Efficient Deep Architectures. arXiv preprint arXiv:1607.03250 (2016), 1–9.
[89] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2016. Binarized Neural Networks.
Advances in Neural Information Processing Systems 29 (2016), 1–9.
[90] Mohsen Imani, Ricardo Garcia, Andrew Huang, and Tajana Rosing. 2019. CADE: Configurable Approximate Divider
for Energy Efficiency. In Design, Automation & Test in Europe (DATE). 586–589.
[91] IoT Analytics. 2023. State of IoT 2023. https://iot-analytics.com/number-connected-iot-devices/
[92] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and
Dmitry Kalenichenko. 2018. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only
Inference. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2704–2713.
[93] Shubham Jain, Swagath Venkataramani, Vijayalakshmi Srinivasan, Jungwook Choi, Pierce Chuang, and Leland Chang.
2018. Compensated-DNN: Energy Efficient Low-Precision Deep Neural Networks by Compensating Quantization
Errors. In Design Automation Conference (DAC). 1–6.
[94] Shubham Jain, Swagath Venkataramani, Vijayalakshmi Srinivasan, Jungwook Choi, Kailash Gopalakrishnan, and
Leland Chang. 2019. BiScaled-DNN: Quantizing Long-tailed Datastructures with Two Scale Factors for Deep Neural
Networks. In Design Automation Conference (DAC). 1–6.
[95] Daehan Ji, Dongyeob Shin, and Jongsun Park. 2020. An Error Compensation Technique for Low-Voltage DNN
Accelerators. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 29, 2 (2020), 397–408.
[96] Honglan Jiang, Jie Han, Fei Qiao, and Fabrizio Lombardi. 2016. Approximate Radix-8 Booth Multipliers for Low-Power
and High-Performance Operation. IEEE Trans. Comput. 65, 8 (2016), 2638–2644.
[97] Honglan Jiang, Leibo Liu, Fabrizio Lombardi, and Jie Han. 2019. Low-Power Unsigned Divider and Square Root
Circuit Designs Using Adaptive Approximation. IEEE Trans. Comput. 68, 11 (2019), 1635–1646.
[98] Xun Jiao, Dongning Ma, Wanli Chang, and Yu Jiang. 2020. LEVAX: An Input-Aware Learning-Based Error Model of
Voltage-Scaled Functional Units. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39,
12 (2020), 5032–5041.
[99] Xun Jiao, Dongning Ma, Wanli Chang, and Yu Jiang. 2020. TEVoT: Timing Error Modeling of Functional Units under
Dynamic Voltage and Temperature Variations. In Design Automation Conference (DAC). 1–6.
[100] Keyur Joshi, Vimuth Fernando, and Sasa Misailovic. 2019. Statistical Algorithmic Profiling for Randomized Approxi-
mate Programs. In ACM/IEEE International Conference on Software Engineering (ICSE). 608–618.
[101] Norman P. Jouppi et al. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. In ACM/IEEE
International Symposium on Computer Architecture (ISCA). 1–12.
[102] Matthias Jung, Éder Zulian, Deepak M. Mathew, Matthias Herrmann, Christian Brugger, Christian Weis, and Norbert
Wehn. 2015. Omitting Refresh: A Case Study for Commodity and Wide I/O DRAMs. In International Symposium on
Memory Systems (MEMSYS). 85–91.
[103] Andrew B. Kahng and Seokhyeong Kang. 2012. Accuracy-Configurable Adder for Approximate Arithmetic Designs.
In Design Automation Conference (DAC). 820–825.
[104] Andrew B. Kahng, Seokhyeong Kang, Rakesh Kumar, and John Sartori. 2010. Designing a Processor from the Ground
up to Allow Voltage/Reliability Tradeoffs. In IEEE International Symposium on High Performance Computer Architecture
(HPCA). 1–11.
[105] Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and
Bolin Ding. 2016. Quickr: Lazily Approximating Complex Ad-Hoc Queries in Big Data Clusters. In ACM SIGMOD
International Conference on Management of Data (MOD). 631–646.
[106] Anil Kanduri, Antonio Miele, Amir M. Rahmani, Pasi Liljeberg, Cristiana Bolchini, and Nikil Dutt. 2018. Approximation-
Aware Coordinated Power/Performance Management for Heterogeneous Multi-cores. In Design Automation Conference
(DAC). 1–6.
[107] Seokwon Kang, Kyunghwan Choi, and Yongjun Park. 2020. PreScaler: An Efficient System-Aware Precision Scaling
Framework on Heterogeneous Systems. In IEEE/ACM International Symposium on Code Generation and Optimization
(CGO). 280–292.
[108] Mustafa Karakoy, Orhan Kislal, Xulong Tang, Mahmut Taylan Kandemir, and Meenakshi Arunachalam. 2019.
Architecture-Aware Approximate Computing. Proceedings of the ACM on Measurement and Analysis of Computing
Systems 3, 2 (2019), 1–24.
[109] Georgios Keramidas, Chrysa Kokkala, and Iakovos Stamoulis. 2015. Clumsy Value Cache: An Approximate Memoiza-
tion Technique for Mobile GPU Fragment Shaders. In Workshop on Approximate Computing (WAPCO). 1–6.
[110] Min Soo Kim, Alberto A Del Barrio, Leonardo Tavares Oliveira, Roman Hermida, and Nader Bagherzadeh. 2018.
Efficient Mitchell’s Approximate Log Multipliers for Convolutional Neural Networks. IEEE Trans. Comput. 68, 5
(2018), 660–675.
[111] Sung Kim, Patrick Howe, Thierry Moreau, Armin Alaghi, Luis Ceze, and Visvesh Sathe. 2018. MATIC: Learning
Around Errors for Efficient Low-Voltage Neural Network Accelerators. In Design, Automation & Test in Europe (DATE).
1–6.
[112] Younghoon Kim, Swagath Venkataramani, Sanchari Sen, and Anand Raghunathan. 2021. Value Similarity Extensions
for Approximate Computing in General-Purpose Processors. In Design, Automation & Test in Europe (DATE). 481–486.
[113] Yongtae Kim, Yong Zhang, and Peng Li. 2013. An Energy Efficient Approximate Adder with Carry Skip for Error
Resilient Neuromorphic VLSI Systems. In International Conference on Computer-Aided Design (ICCAD). 130–137.
[114] Orhan Kislal and Mahmut T. Kandemir. 2018. Data Access Skipping for Recursive Partitioning Methods. Elsevier
Computer Languages, Systems & Structures 53 (2018), 143–162.
[115] Skanda Koppula, Lois Orosa, A Giray Yağlıkçı, Roknoddin Azizi, Taha Shahroodi, Konstantinos Kanellopoulos, and
Onur Mutlu. 2019. EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference Using
Approximate DRAM. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 166–181.
[116] Raghuraman Krishnamoorthi. 2018. Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper.
arXiv preprint arXiv:1806.08342 (2018), 1–36.
[117] Dhanya R. Krishnan, Do Le Quoc, Pramod Bhatotia, Christof Fetzer, and Rodrigo Rodrigues. 2016. IncApprox: A Data
Analytics System for Incremental Approximate Computing. In International Conference on World Wide Web (WWW).
1133–1144.
[118] Animesh Kumar, Jan Rabaey, and Kannan Ramchandran. 2009. SRAM supply voltage scaling: A reliability perspective.
In International Symposium on Quality Electronic Design (ISQED). 782–787.
[119] Fadi J. Kurdahi, Ahmed Eltawil, Kang Yi, Stanley Cheng, and Amin Khajeh. 2010. Low-Power Multimedia System
Design by Aggressive Voltage Scaling. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 18, 5 (2010),
852–856.
[120] Christos Kyrkou, George Plastiras, Theocharis Theocharides, Stylianos I Venieris, and Christos-Savvas Bouganis. 2018.
DroNet: Efficient Convolutional Neural Network Detector for Real-Time UAV Applications. In Design, Automation &
Test in Europe (DATE). 967–972.
[121] Ignacio Laguna, Paul C. Wood, Ranvijay Singh, and Saurabh Bagchi. 2019. GPUMixer: Performance-Driven Floating-
Point Tuning for GPU Scientific Applications. In ISC International Conference on High Performance Computing (HPC).
227–246.
[122] Michael O. Lam and Jeffrey K. Hollingsworth. 2018. Fine-Grained Floating-Point Precision Analysis. SAGE International
Journal of High Performance Computing Applications 32, 2 (2018), 231–245.
[123] Michael O. Lam, Jeffrey K. Hollingsworth, Bronis R. de Supinski, and Matthew P. Legendre. 2013. Automatically Adapt-
ing Programs for Mixed-Precision Floating-Point Computation. In ACM International Conference on Supercomputing
(ICS). 369–378.
[124] Nikolay Laptev, Kai Zeng, and Carlo Zaniolo. 2012. Early Accurate Results for Advanced Analytics on MapReduce.
Proceedings of the VLDB Endowment 5, 10 (2012), 1028–1039.
[125] Vadim Lebedev and Victor Lempitsky. 2016. Fast ConvNets Using Group-wise Brain Damage. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR). 2554–2564.
[126] Chunho Lee, Miodrag Potkonjak, and William Henry Mangione-Smith. 1997. MediaBench: A Tool for Evaluating and
Synthesizing Multimedia and Communications Systems. In IEEE/ACM International Symposium on Microarchitecture
(MICRO). 330–335.
[127] Seogoo Lee, Lizy K. John, and Andreas Gerstlauer. 2017. High-Level Synthesis of Approximate Hardware under Joint
Precision and Voltage Scaling. In Design, Automation & Test in Europe (DATE). 187–192.
[128] George Lentaris, George Chatzitsompanis, Vasileios Leon, Kiamal Pekmestzi, and Dimitrios Soudris. 2020. Combin-
ing Arithmetic Approximation Techniques for Improved CNN Circuit Design. In IEEE International Conference on
Electronics, Circuits and Systems (ICECS). 1–4.
[129] Vasileios Leon, Muhammad Abdullah Hanif, Giorgos Armeniakos, Xun Jiao, Muhammad Shafique, Kiamal Pekmestzi,
and Dimitrios Soudris. 2023. Approximate Computing Survey, Part I: Terminology and Software & Hardware
Approximation Techniques. arXiv Preprint (2023), 1–34.
[130] Vasileios Leon, Georgios Makris, Sotirios Xydis, Kiamal Pekmestzi, and Dimitrios Soudris. 2022. MAx-DNN: Multi-
Level Arithmetic Approximation for Energy-Efficient DNN Hardware Accelerators. In IEEE Latin America Symposium
on Circuits and System (LASCAS). 1–4.
[131] Vasileios Leon, Theodora Paparouni, Evangelos Petrongonas, Dimitrios Soudris, and Kiamal Pekmestzi. 2021. Improv-
ing Power of DSP and CNN Hardware Accelerators Using Approximate Floating-Point Multipliers. ACM Transactions
on Embedded Computing Systems 20, 5 (2021), 1–21.
[132] Vasileios Leon, Ioannis Stratakos, Giorgos Armeniakos, George Lentaris, and Dimitrios Soudris. 2021. ApproxQAM:
High-Order QAM Demodulation Circuits with Approximate Arithmetic. In International Conference on Modern Circuits
and Systems Technologies (MOCAST). 1–5.
[133] Vasileios Leon, Georgios Zervakis, Dimitrios Soudris, and Kiamal Pekmestzi. 2018. Approximate Hybrid High Radix
Encoding for Energy-Efficient Inexact Multipliers. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26,
3 (2018), 421–430.
[134] Vasileios Leon, Georgios Zervakis, Sotirios Xydis, Dimitrios Soudris, and Kiamal Pekmestzi. 2018. Walking through
the Energy-Error Pareto Frontier of Approximate Multipliers. IEEE Micro 38, 4 (2018), 40–49.
[135] Fei Li, Youyou Lu, Zhongjie Wu, and Jiwu Shu. 2019. ASCache: An Approximate SSD Cache for Error-Tolerant
Applications. In Design Automation Conference (DAC).
[136] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. 2016. Pruning Filters for Efficient ConvNets.
arXiv preprint arXiv:1608.08710 (2016), 1–13.
[137] Shikai Li, Sunghyun Park, and Scott Mahlke. 2018. Sculptor: Flexible Approximation with Selective Dynamic Loop
Perforation. In ACM International Conference on Supercomputing (ICS). 341–351.
[138] Yingyan Lin, Charbel Sakr, Yongjune Kim, and Naresh Shanbhag. 2017. PredictiveNet: An Energy-Efficient Con-
volutional Neural Network via Zero Prediction. In IEEE International Symposium on Circuits and Systems (ISCAS).
1–4.
[139] Michael D. Linderman, Matthew Ho, David L. Dill, Teresa H. Meng, and Garry P. Nolan. 2010. Towards Program
Optimization through Automated Analysis of Numerical Precision. In IEEE/ACM International Symposium on Code
Generation and Optimization (CGO). 230–237.
[140] Gai Liu and Zhiru Zhang. 2017. Statistically Certified Approximate Logic Synthesis. In International Conference on
Computer-Aided Design (ICCAD). 344–351.
[141] Ren-Shuo Liu, Chia-Lin Yang, and Wei Wu. 2012. Optimizing NAND Flash-Based SSDs via Retention Relaxation. In
USENIX Conference on File and Storage Technologies (FAST). 1–14.
[142] Song Liu, Karthik Pattabiraman, Thomas Moscibroda, and Benjamin G. Zorn. 2011. Flikker: Saving DRAM Refresh-
Power through Critical Data Partitioning. In ACM International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS). 213–224.
[143] Weiqiang Liu, Jing Li, Tao Xu, Chenghua Wang, Paolo Montuschi, and Fabrizio Lombardi. 2018. Combining Restoring
Array and Logarithmic Dividers into an Approximate Hybrid Design. In IEEE Symposium on Computer Arithmetic
(ARITH). 92–98.
[144] Weiqiang Liu, Liangyu Qian, Chenghua Wang, Honglan Jiang, Jie Han, and Fabrizio Lombardi. 2017. Design of
Approximate Radix-4 Booth Multipliers for Error-Tolerant Computing. IEEE Trans. Comput. 66, 8 (2017), 1435–1441.
[145] Weiqiang Liu, Jiahua Xu, Danye Wang, Chenghua Wang, Paolo Montuschi, and Fabrizio Lombardi. 2018. Design and
Evaluation of Approximate Logarithmic Multipliers for Low Power Error-Tolerant Applications. IEEE Transactions on
Circuits and Systems I: Regular Papers 65, 9 (2018), 2856–2868.
[146] Yang Liu, Tong Zhang, and Keshab K. Parhi. 2010. Computation Error Analysis in Digital Signal Processing Systems
With Overscaled Supply Voltage. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 18, 4 (2010), 517–526.
[147] Zhenhong Liu, Amir Yazdanbakhsh, Dong Kai Wang, Hadi Esmaeilzadeh, and Nam Sung Kim. 2019. AxMemo:
Hardware-Compiler Co-Design for Approximate Code Memoization. In ACM/IEEE International Symposium on
Computer Architecture (ISCA). 685–697.
[148] Wei Lou, Lei Xun, Amin Sabet, Jia Bi, Jonathon Hare, and Geoff V Merrett. 2021. Dynamic-OFA: Runtime DNN
Architecture Switching for Performance Scaling on Heterogeneous Embedded Platforms. In IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops (CVPRW). 3110–3118.
[149] Dongning Ma, Rahul Thapa, Xingjian Wang, Xun Jiao, and Cong Hao. 2021. Workload-Aware Approximate Computing
Configuration. In Design, Automation & Test in Europe (DATE). 920–925.
[150] Dongning Ma, Xinqiao Zhang, Ke Huang, Yu Jiang, Wanli Chang, and Xun Jiao. 2022. DEVoT: Dynamic Delay
Modeling of Functional Units Under Voltage and Temperature Variations. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems 41, 4 (2022), 827–839.
[151] Divya Mahajan, Amir Yazdanbaksh, Jongse Park, Bradley Thwaites, and Hadi Esmaeilzadeh. 2016. Towards Statistical
Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration. In ACM/IEEE Annual International
Symposium on Computer Architecture (ISCA). 66–77.
[152] K. Manikantta Reddy, M. H. Vasantha, Y. B. Nithin Kumar, and Devesh Dwivedi. 2020. Design of Approximate Booth
Squarer for Error-Tolerant Computing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28, 5 (2020),
1230–1241.
[153] Jiayuan Meng, Srimat Chakradhar, and Anand Raghunathan. 2009. Best-Effort Parallel Execution Framework for
Recognition and Mining Applications. In IEEE International Symposium on Parallel Distributed Processing (IPDPS).
1–12.
[154] Jiayuan Mengt, Anand Raghunathan, Srimat Chakradhar, and Surendra Byna. 2010. Exploiting the Forgiving Nature
of Applications for Scalable Parallel Execution. In IEEE International Symposium on Parallel Distributed Processing
(IPDPS). 1–12.
[155] Harshitha Menon, Michael O. Lam, Daniel Osei-Kuffuor, Markus Schordan, Scott Lloyd, Kathryn Mohror, and Jeffrey
Hittinger. 2018. ADAPT: Algorithmic Differentiation Applied to Floating-Point Precision Tuning. In ACM/IEEE SC,
International Conference for High Performance Computing, Networking, Storage and Analysis. 614–626.
[156] Joshua San Miguel, Jorge Albericio, Natalie Enright Jerger, and Aamer Jaleel. 2016. The Bunker Cache for Spatio-Value
Approximation. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12.
[157] Joshua San Miguel, Jorge Albericio, Andreas Moshovos, and Natalie Enright Jerger. 2015. Doppelgänger: A Cache for
Approximate Computing. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 50–61.
[158] Joshua San Miguel, Mario Badr, and Natalie Enright Jerger. 2014. Load Value Approximation. In IEEE/ACM International
Symposium on Microarchitecture (MICRO). 127–139.
[159] Sasa Misailovic, Michael Carbin, Sara Achour, Zichao Qi, and Martin C. Rinard. 2014. Chisel: Reliability- and
Accuracy-Aware Optimization of Approximate Computational Kernels. In ACM SIGPLAN International Conference on
Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). 309–328.
[160] Sasa Misailovic, Deokhwan Kim, and Martin C. Rinard. 2013. Parallelizing Sequential Programs with Statistical
Accuracy Tests. ACM Transactions on Embedded Computing Systems 12, 2s (2013), 1–26.
[161] Sasa Misailovic, Stelios Sidiroglou, Henry Hoffmann, and Martin C. Rinard. 2010. Quality of Service Profiling. In
ACM/IEEE International Conference on Software Engineering (ICSE), Vol. 1. 25–34.
[162] Sasa Misailovic, Stelios Sidiroglou, and Martin C. Rinard. 2012. Dancing with Uncertainty. In ACM Workshop on
Relaxing Synchronization for Multicore and Manycore Scalability (RACES). 51–60.
[163] Asit K. Mishra, Rajkishore Barik, and Somnath Paul. 2014. iACT: A Software-Hardware Framework for Understanding
the Scope of Approximate Computing. In Workshop on Approximate Computing Across the System Stack (WACAS).
1–6.
[164] Subrata Mitra, Manish K. Gupta, Sasa Misailovic, and Saurabh Bagchi. 2017. Phase-Aware Optimization in Approximate
Computing. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 185–196.
[165] Sparsh Mittal. 2016. A Survey of Techniques for Approximate Computing. Comput. Surveys 48, 4 (2016), 1–33.
[166] Sparsh Mittal and Jeffrey S. Vetter. 2015. A Survey of CPU-GPU Heterogeneous Computing Techniques. Comput.
Surveys 47, 4 (2015), 1–35.
[167] Debabrata Mohapatra, Vinay K. Chippa, Anand Raghunathan, and Kaushik Roy. 2011. Design of Voltage-Scalable
Meta-Functions for Approximate Computing. In Design, Automation & Test in Europe (DATE). 1–6.
[168] Amir Momeni, Jie Han, Paolo Montuschi, and Fabrizio Lombardi. 2015. Design and Analysis of Approximate
Compressors for Multiplication. IEEE Trans. Comput. 64, 4 (2015), 984–994.
[169] Thierry Moreau, Joshua San Miguel, Mark Wyse, James Bornholt, Armin Alaghi, Luis Ceze, Natalie Enright Jerger,
and Adrian Sampson. 2018. A Taxonomy of General Purpose Approximate Computing Techniques. IEEE Embedded
Systems Letters 10, 1 (2018), 2–5.
[170] Thierry Moreau, Mark Wyse, Jacob Nelson, Adrian Sampson, Hadi Esmaeilzadeh, Luis Ceze, and Mark Oskin. 2015.
SNNAP: Approximate Computing on Programmable SoCs via Neural Acceleration. In IEEE International Symposium
on High Performance Computer Architecture (HPCA). 603–614.
[171] Vojtech Mrazek, Radek Hrbacek, Zdenek Vasicek, and Lukas Sekanina. 2017. EvoApprox8b: Library of Approximate
Adders and Multipliers for Circuit Design and Benchmarking of Approximation Methods. In Design, Automation &
Test in Europe (DATE). 258–261.
[172] Vojtech Mrazek, Syed Shakib Sarwar, Lukas Sekanina, Zdenek Vasicek, and Kaushik Roy. 2016. Design of Power-
Efficient Approximate Multipliers for Approximate Artificial Neural Networks. In IEEE/ACM International Conference
on Computer-Aided Design (ICCAD). 1–7.
[173] Vojtech Mrazek, Lukas Sekanina, and Zdenek Vasicek. 2020. Libraries of Approximate Circuits: Automated Design
and Application in CNN Accelerators. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 10, 4
(2020), 406–418.
[174] Vojtech Mrazek, Zdenek Vasícek, Lukás Sekanina, Muhammad Abdullah Hanif, and Muhammad Shafique. 2019.
ALWANN: Automatic Layer-Wise Approximation of Deep Neural Network Accelerators without Retraining. In
IEEE/ACM International Conference on Computer-Aided Design (ICCAD). 1–8.
[175] Muhammad Husnain Mubarik, Dennis D. Weller, Nathaniel Bleier, Matthew Tomei, Jasmin Aghassi-Hagmann, Mehdi B.
Tahoori, and Rakesh Kumar. 2020. Printed Machine Learning Classifiers. In IEEE/ACM International Symposium on
Microarchitecture (MICRO). 73–87.
[176] Ramanathan Narayanan, Berkin Ozisikyilmaz, Joseph Zambreno, Gokhan Memik, and Alok Choudhary. 2006.
MineBench: A Benchmark Suite for Data Mining Workloads. In IEEE International Symposium on Workload Character-
ization (IISWC). 182–188.
[177] Geneviève Ndour, Tiago Trevisan Jost, Anca Molnos, Yves Durand, and Arnaud Tisserand. 2019. Evaluation of
Variable Bit-Width Units in a RISC-V Processor for Approximate Computing. In ACM International Conference on
Computing Frontiers. 344–349.
[178] Kumud Nepal, Soheil Hashemi, Hokchhay Tann, R. Iris Bahar, and Sherief Reda. 2019. Automated High-Level
Generation of Low-Power Approximate Computing Circuits. IEEE Transactions on Emerging Topics in Computing 7, 1
(2019), 18–30.
[179] Kumud Nepal, Yueting Li, R. Iris Bahar, and Sherief Reda. 2014. ABACUS: A Technique for Automated Behavioral
Synthesis of Approximate Computing Circuits. In Design, Automation & Test in Europe (DATE). 1–6.
[180] Gianmarco Ottavi, Angelo Garofalo, Giuseppe Tagliavini, Francesco Conti, Luca Benini, and Davide Rossi. 2020. A
Mixed-Precision RISC-V Processor for Extreme-Edge DNN Inference. In IEEE Computer Society Annual Symposium on
VLSI (ISVLSI). 512–517.
[181] Priyadarshini Panda, Abhronil Sengupta, and Kaushik Roy. 2017. Energy-Efficient and Improved Image Recognition
with Conditional Deep Learning. ACM Journal on Emerging Technologies in Computing Systems 13, 3 (2017), 1–21.
[182] Priyadarshini Panda, Abhronil Sengupta, Syed Shakib Sarwar, Gopalakrishnan Srinivasan, Swagath Venkataramani,
Anand Raghunathan, and Kaushik Roy. 2016. Cross-Layer Approximations for Neuromorphic Computing: From
Devices to Circuits and Systems. In Design Automation Conference (DAC). 1–6.
[183] Priyadarshini Panda, Swagath Venkataramani, Abhronil Sengupta, Anand Raghunathan, and Kaushik Roy. 2017.
Energy-Efficient Object Detection Using Semantic Decomposition. IEEE Transactions on Very Large Scale Integration
(VLSI) Systems 25, 9 (2017), 2673–2677.
[184] Pramesh Pandey, Prabal Basu, Koushik Chakraborty, and Sanghamitra Roy. 2019. GreenTPU: Improving Timing Error
Resilience of a Near-Threshold Tensor Processing Unit. In Design Automation Conference (DAC). 1–6.
[185] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany,
Joel Emer, Stephen W Keckler, and William J Dally. 2017. SCNN: An Accelerator for Compressed-sparse Convolutional
Neural Networks. ACM SIGARCH Computer Architecture News 45, 2 (2017), 27–40.
[186] Jongse Park, Hadi Esmaeilzadeh, Xin Zhang, Mayur Naik, and William Harris. 2015. FlexJava: Language Support for
Safe and Modular Approximate Programming. In ACM SIGSOFT Symposium and European Conference on Foundations
of Software Engineering (FSE). 745–757.
[187] Jongse Park, Xin Zhang, Kangqi Ni, Hadi Esmaeilzadeh, and Mayur Naik. 2014. ExpAX: A Framework for Automating
Approximate Programming. Georgia Institute of Technology Technical Report (2014), 1–17.
[188] Jun-Seok Park et al. 2021. 9.5 A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile
SoC. In IEEE International Solid- State Circuits Conference (ISSCC), Vol. 64. 152–154.
[189] Yongjoo Park, Jingyi Qing, Xiaoyang Shen, and Barzan Mozafari. 2019. BlinkML: Efficient Maximum Likelihood
Estimation with Probabilistic Guarantees. In ACM SIGMOD International Conference on Management of Data (MOD).
1135–1152.
[190] Masoud Pashaeifar, Mehdi Kamal, Ali Afzali-Kusha, and Massoud Pedram. 2018. Approximate Reverse Carry Propagate
Adder for Energy-Efficient DSP Applications. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26, 11
(2018), 2530–2541.
[191] Ratko Pilipović, Patricio Bulić, and Uroš Lotrič. 2021. A Two-Stage Operand Trimming Approximate Logarithmic
Multiplier. IEEE Transactions on Circuits and Systems I: Regular Papers 68, 6 (2021), 2535–2545.
[192] George Plastiras, Christos Kyrkou, and Theocharis Theocharides. 2018. Efficient ConvNet-based Object Detection for
Unmanned Aerial Vehicles by Selective Tile Processing. In International Conference on Distributed Smart Cameras
(ICDSC). 1–6.
[193] Louis-Noel Pouchet. 2015. PolyBench. http://sourceforge.net/projects/polybench/
[194] Roldan Pozo and Bruce Miller. 2004. SciMark 2.0. http://math.nist.gov/scimark2/
[195] Do Le Quoc, Martin Beck, Pramod Bhatotia, Ruichuan Chen, Christof Fetzer, and Thorsten Strufe. 2017. PrivApprox:
Privacy-Preserving Stream Analytics. In USENIX Annual Technical Conference (ATC). 659–672.
[196] Do Le Quoc, Ruichuan Chen, Pramod Bhatotia, Christof Fetzer, Volker Hilt, and Thorsten Strufe. 2017. StreamApprox:
Approximate Computing for Stream Analytics. In ACM/IFIP/USENIX International Middleware Conference. 185–197.
[197] Rengarajan Ragavan, Cedric Killian, and Olivier Sentieys. 2016. Adaptive Overclocking and Error Correction Based
on Dynamic Speculation Window. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI). 325–330.
[198] Arnab Raha and Vijay Raghunathan. 2017. Towards Full-System Energy-Accuracy Tradeoffs: A Case Study of An
Approximate Smart Camera System. In Design Automation Conference (DAC). 1–6.
[199] Arnab Raha and Vijay Raghunathan. 2018. Approximating Beyond the Processor: Exploring Full-System Energy-
Accuracy Tradeoffs in a Smart Camera System. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26, 12
(2018), 2884–2897.
[200] Arnab Raha, Swagath Venkataramani, Vijay Raghunathan, and Anand Raghunathan. 2015. Quality Configurable
Reduce-and-Rank for Energy Efficient Approximate Computing. In Design, Automation & Test in Europe (DATE).
665–670.
[201] Abbas Rahimi, Luca Benini, and Rajesh K. Gupta. 2013. Spatial Memoization: Concurrent Instruction Reuse to Correct
Timing Errors in SIMD Architectures. IEEE Transactions on Circuits and Systems II: Express Briefs 60, 12 (2013),
847–851.
[202] Shankar Ganesh Ramasubramanian, Swagath Venkataramani, Adithya Parandhaman, and Anand Raghunathan. 2013.
Relax-and-Retime: A Methodology for Energy-Efficient Recovery Based Design. In Design Automation Conference
(DAC). 1–6.
[203] Ashish Ranjan, Arnab Raha, Vijay Raghunathan, and Anand Raghunathan. 2020. Approximate Memory Compression.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems 28, 4 (2020), 980–991.
[204] Ashish Ranjan, Arnab Raha, Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. 2014. ASLAN: Synthesis
of Approximate Sequential Circuits. In Design, Automation & Test in Europe (DATE). 1–6.
[205] Ashish Ranjan, Swagath Venkataramani, Xuanyao Fong, Kaushik Roy, and Anand Raghunathan. 2015. Approximate
Storage for Energy Efficient Spintronic Memories. In Design Automation Conference (DAC). 1–6.
[206] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. XNOR-Net: ImageNet Classification
Using Binary Convolutional Neural Networks. In European Conference on Computer Vision (ECCV). 525–542.
[207] Lakshminarayanan Renganarayana, Vijayalakshmi Srinivasan, Ravi Nair, and Daniel Prener. 2012. Programming
with Relaxed Synchronization. In ACM Workshop on Relaxing Synchronization for Multicore and Manycore Scalability
(RACES). 41–50.
[208] Pedro Reviriego, Shanshan Liu, Otmar Ertl, Farzad Niknia, and Fabrizio Lombardi. 2022. Computing the Similarity
Estimate Using Approximate Memory. IEEE Transactions on Emerging Topics in Computing 10, 3 (2022), 1593–1604.
[209] Martin C. Rinard. 2006. Probabilistic Accuracy Bounds for Fault-Tolerant Computations That Discard Tasks. In ACM
International Conference on Supercomputing (ICS). 324–334.
[210] Martin C. Rinard. 2007. Using Early Phase Termination to Eliminate Load Imbalances at Barrier Synchronization Points.
In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications
(OOPSLA). 369–386.
[211] Cindy Rubio-González, Cuong Nguyen, Benjamin Mehne, Koushik Sen, James Demmel, William Kahan, Costin Iancu,
Wim Lavrijsen, David H. Bailey, and David Hough. 2016. Floating-Point Precision Tuning Using Blame Analysis. In
IEEE/ACM International Conference on Software Engineering (ICSE). 1074–1085.
[212] Cindy Rubio-González, Cuong Nguyen, Hong Diep Nguyen, James Demmel, William Kahan, Koushik Sen, David H.
Bailey, Costin Iancu, and David Hough. 2013. Precimonious: Tuning Assistant for Floating-Point Precision. In SC13:
International Conference on High Performance Computing, Networking, Storage and Analysis. 1–12.
[213] Hassaan Saadat, Haseeb Bokhari, and Sri Parameswaran. 2018. Minimally Biased Multipliers for Approximate Integer
and Floating-Point Multiplication. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37,
11 (2018), 2623–2635.
[214] Hassaan Saadat, Haris Javaid, Aleksandar Ignjatovic, and Sri Parameswaran. 2020. REALM: Reduced-Error Approxi-
mate Log-based Integer Multiplier. In Design, Automation & Test in Europe (DATE). 1366–1371.
[215] Hassaan Saadat, Haris Javaid, and Sri Parameswaran. 2019. Approximate Integer and Floating-Point Dividers with
Near-Zero Error Bias. In Design Automation Conference (DAC). 1–6.
[216] Farnaz Sabetzadeh, Mohammad Hossein Moaiyeri, and Mohammad Ahmadinejad. 2019. A Majority-Based Imprecise
Multiplier for Ultra-Efficient Approximate Image Multiplication. IEEE Transactions on Circuits and Systems I: Regular
Papers 66, 11 (2019), 4200–4208.
[217] Mastooreh Salajegheh, Yue Wang, Kevin Fu, Anxiao Jiang, and Erik Learned-Miller. 2011. Exploiting Half-Wits:
Smarter Storage for Low-Power Devices. In USENIX Conference on File and Storage Technologies (FAST). 1–14.
[218] Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke. 2014. Paraprox: Pattern-Based Ap-
proximation for Data Parallel Applications. In ACM International Conference on Architectural Support for Programming
Languages and Operating Systems (ASPLOS). 35–50.
[219] Mehrzad Samadi, Janghaeng Lee, D. Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. 2013. SAGE: Self-Tuning
Approximation for Graphics Engines. In IEEE/ACM International Symposium on Microarchitecture (MICRO). 13–24.
[220] Felipe Sampaio, Muhammad Shafique, Bruno Zatt, Sergio Bampi, and Jörg Henkel. 2015. Approximation-Aware
Multi-Level Cells STT-RAM Cache Architecture. In International Conference on Compilers, Architecture and Synthesis
for Embedded Systems (CASES). 79–88.
[221] Adrian Sampson, André Baixo, Benjamin Ransford, Thierry Moreau, Joshua Yip, Luis Ceze, and Mark Oskin. 2015. AC-
CEPT: A Programmer-Guided Compiler Framework for Practical Approximate Computing. University of Washington
Technical Report (2015), 1–14.
[222] Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapragasam, Luis Ceze, and Dan Grossman. 2011.
EnerJ: Approximate Data Types for Safe and General Low-Power Computation. In ACM SIGPLAN Conference on
Programming Language Design and Implementation (PLDI). 164–174.
[223] Adrian Sampson, Jacob Nelson, Karin Strauss, and Luis Ceze. 2014. Approximate Storage in Solid-State Memories.
ACM Transactions on Computer Systems 32, 3 (2014), 1–23.
[224] John Sartori and Rakesh Kumar. 2011. Architecting Processors to Allow Voltage/Reliability Tradeoffs. In International
Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES). 115–124.
[225] Eric Schkufza, Rahul Sharma, and Alex Aiken. 2014. Stochastic Optimization of Floating-Point Programs with Tunable
Precision. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 53–64.
[226] Jeremy Schlachter, Vincent Camus, Krishna V. Palem, and Christian Enz. 2017. Design and Applications of Approximate
Circuits by Gate-Level Pruning. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 25, 5 (2017), 1694–
1702.
[227] Muhammad Shafique, Rehan Hafiz, Semeen Rehman, Walaa El-Harouni, and Jörg Henkel. 2016. Cross-Layer Approxi-
mate Computing: From Logic to Architectures. In Design Automation Conference (DAC). 1–6.
[228] Kan Shi, David Boland, and George A. Constantinides. 2013. Accuracy-Performance Tradeoffs on an FPGA through
Overclocking. In IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM). 29–36.
[229] Kan Shi, David Boland, Edward Stott, Samuel Bayliss, and George A. Constantinides. 2014. Datapath Synthesis for
Overclocking: Online Arithmetic for Latency-Accuracy Trade-offs. In Design Automation Conference (DAC). 1–6.
[230] Qingchuan Shi, Henry Hoffmann, and Omer Khan. 2015. A Cross-Layer Multicore Architecture to Tradeoff Program
Accuracy and Resilience Overheads. IEEE Computer Architecture Letters 14, 2 (2015), 85–89.
[231] Dongyeob Shin, Wonseok Choi, Jongsun Park, and Swaroop Ghosh. 2019. Sensitivity-Based Error Resilient Techniques
With Heterogeneous Multiply–Accumulate Unit for Voltage Scalable Deep Neural Network Accelerators. IEEE Journal
on Emerging and Selected Topics in Circuits and Systems 9, 3 (2019), 520–531.
[232] Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin C. Rinard. 2011. Managing Performance vs.
Accuracy Trade-Offs with Loop Perforation. In ACM SIGSOFT Symposium and European Conference on Foundations of
Software Engineering (FSE). 124–134.
[233] Lorenzo Sonnino, Shaswot Shresthamali, Yuan He, and Masaaki Kondo. 2023. DAISM: Digital Approximate In-SRAM
Multiplier-based Accelerator for DNN Training and Inference. arXiv preprint arXiv:2305.07376 (2023).
[234] Jacob Sorber, Alexander Kostadinov, Matthew Garber, Matthew Brennan, Mark D. Corner, and Emery D. Berger. 2007.
Eon: A Language and Runtime System for Perpetual Systems.
[235] The Open Source. 2023. History – RISC-V International. (2023). https://riscv.org/about/history/
[236] Ourania Spantidi, Georgios Zervakis, Iraklis Anagnostopoulos, Hussam Amrouch, and Jörg Henkel. 2021. Posi-
tive/Negative Approximate Multipliers for DNN Accelerators. In IEEE/ACM International Conference On Computer
Aided Design (ICCAD). 1–9.
[237] Jaswanth Sreeram and Santosh Pande. 2010. Exploiting Approximate Value Locality for Data Synchronization on
Multi-Core Processors. In IEEE International Symposium on Workload Characterization (IISWC). 1–10.
[238] Phillip Stanley-Marbell, Armin Alaghi, Michael Carbin, Eva Darulova, Lara Dolecek, Andreas Gerstlauer, Ghayoor
Gillani, Djordje Jevdjic, Thierry Moreau, Mattia Cacciotti, Alexandros Daglis, Natalie Enright Jerger, Babak Falsafi,
Sasa Misailovic, Adrian Sampson, and Damien Zufferey. 2020. Exploiting Errors for Efficiency: A Survey from Circuits
to Applications. Comput. Surveys 53, 3 (2020), 1–39.
[239] Greg Stitt and David Campbell. 2020. PANDORA: An Architecture-Independent Parallelizing Approximation-
Discovery Framework. ACM Transactions on Embedded Computing Systems 19, 5 (2020), 1–17.
[240] Antonio Giuseppe Maria Strollo, Ettore Napoli, Davide De Caro, Nicola Petra, and Gennaro Di Meo. 2020. Comparison
and Extension of Approximate 4-2 Compressors for Low-Power Approximate Multipliers. IEEE Transactions on
Circuits and Systems I: Regular Papers 67, 9 (2020), 3021–3034.
[241] Yang Sui, Miao Yin, Yi Xie, Huy Phan, Saman Aliari Zonouz, and Bo Yuan. 2021. CHIP: CHannel Independence-based
Pruning for Compact Neural Networks. Advances in Neural Information Processing Systems 34 (2021), 24604–24616.
[242] Mahdi Taheri, Mohammad Riazati, Mohammad Hasan Ahmadilivani, Maksim Jenihhin, Masoud Daneshtalab, Jaan
Raik, Mikael Sjodin, and Bjorn Lisper. 2023. DeepAxe: A Framework for Exploration of Approximation and Reliability
Trade-offs in DNN Accelerators. arXiv preprint arXiv:2303.08226 (2023), 1–8.
[243] Cheng Tan, Thannirmalai Somu Muthukaruppan, Tulika Mitra, and Ju Lei. 2015. Approximation-Aware Scheduling
on Heterogeneous Multi-Core Architectures. In Asia and South Pacific Design Automation Conference (ASP-DAC).
618–623.
[244] Chong Min John Tan and Mehul Motani. 2020. DropNet: Reducing Neural Network Complexity via Iterative Pruning.
In International Conference on Machine Learning (ICML). 9356–9366.
[245] Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In
International Conference on Machine Learning (ICML). 6105–6114.
[246] Hokchhay Tann, Soheil Hashemi, R Iris Bahar, and Sherief Reda. 2016. Runtime Configurable Deep Neural Networks
for Energy-Accuracy Trade-off. In International Conference on Hardware/Software Codesign and System Synthesis
(CODES+ISSS). 1–10.
[247] Mohammad Taghi Teimoori, Muhammad Abdullah Hanif, Alireza Ejlali, and Muhammad Shafique. 2018. AdAM:
Adaptive Approximation Management for the Non-Volatile Memory Hierarchies. In Design, Automation & Test in
Europe Conference (DATE). 785–790.
[248] Ye Tian, Qian Zhang, Ting Wang, Feng Yuan, and Qiang Xu. 2015. ApproxMA: Approximate Memory Access for
Dynamic Precision Scaling. In Great Lakes Symposium on VLSI (GLSVLSI). 337–342.
[249] David Tolpin, Jan-Willem van de Meent, Hongseok Yang, and Frank Wood. 2016. Design and Implementation
of Probabilistic Programming Language Anglican. In Symposium on Implementation and Application of Functional
Programming Languages (IFL). 1–12.
[250] Hung-Wei Tseng, Laura M. Grupp, and Steven Swanson. 2013. Underpowering NAND Flash: Profits and Perils. In
Design Automation Conference (DAC). 1–6.
[251] Georgios Tziantzioulis, Nikos Hardavellas, and Simone Campanoni. 2018. Temporal Approximate Function Memoiza-
tion. IEEE Micro 38, 4 (2018), 60–70.
[252] Salim Ullah, Sanjeev Sripadraj Murthy, and Akash Kumar. 2018. SMApproxLib: Library of FPGA-based Approximate
Multipliers. In Design Automation Conference (DAC). 1–6.
[253] Shaghayegh Vahdat, Mehdi Kamal, Ali Afzali-Kusha, and Massoud Pedram. 2019. TOSAM: An Energy-Efficient
Truncation- and Rounding-Based Scalable Approximate Multiplier. IEEE Transactions on Very Large Scale Integration
(VLSI) Systems 27, 5 (2019), 1161–1173.
[254] Shaghayegh Vahdat, Mehdi Kamal, Ali Afzali-Kusha, Massoud Pedram, and Zainalabedin Navabi. 2017. TruncApp:
A Truncation-Based Approximate Divider for Energy Efficient DSP Applications. In Design, Automation & Test in
Europe (DATE). 1635–1638.
[255] Zdenek Vasicek, Vojtech Mrazek, and Lukas Sekanina. 2019. Automated Circuit Approximation Method Driven by
Data Distribution. In Design, Automation & Test in Europe (DATE). 96–101.
[256] Zdenek Vasicek and Lukas Sekanina. 2015. Evolutionary Approach to Approximate Digital Circuits Design. IEEE
Transactions on Evolutionary Computation 19, 3 (2015), 432–444.
[257] Vassilis Vassiliadis, Konstantinos Parasyris, Charalambos Chalios, Christos D. Antonopoulos, Spyros Lalis, Nikolaos
Bellas, Hans Vandierendonck, and Dimitrios S. Nikolopoulos. 2015. A Programming Model and Runtime System for
Significance-Aware Energy-Efficient Computing. In ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming (PPoPP). 275–276.
[258] Vassilis Vassiliadis, Jan Riehme, Jens Deussen, Konstantinos Parasyris, Christos D. Antonopoulos, Nikolaos Bellas,
Spyros Lalis, and Uwe Naumann. 2016. Towards Automatic Significance Analysis for Approximate Computing. In
IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 182–193.
[259] Suganthi Venkatachalam, Elizabeth Adams, Hyuk Jae Lee, and Seok-Bum Ko. 2019. Design and Analysis of Area and
Power Efficient Approximate Booth Multipliers. IEEE Trans. Comput. 68, 11 (2019), 1697–1703.
[260] Swagath Venkataramani et al. 2020. Efficient AI System Design With Cross-Layer Approximate Computing. Proc.
IEEE 108, 12 (2020), 2232–2250.
[261] Swagath Venkataramani, Victor Bahl, Xian-Sheng Hua, Jie Liu, Jin Li, Matthai Phillipose, Bodhi Priyantha, and
Mohammed Shoaib. 2015. SAPPHIRE: An Always-on Context-Aware Computer Vision System for Portable Devices.
In Design, Automation & Test in Europe (DATE). 1491–1496.
[262] Swagath Venkataramani, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan. 2015. Approximate Computing
and the Quest for Computing Efficiency. In Design Automation Conference (DAC). 1–6.
[263] Swagath Venkataramani, Anand Raghunathan, Jie Liu, and Mohammed Shoaib. 2015. Scalable-Effort Classifiers for
Energy-Efficient Machine Learning. In Design Automation Conference (DAC). 1–6.
[264] Swagath Venkataramani, Ashish Ranjan, Kaushik Roy, and Anand Raghunathan. 2014. AxNN: Energy-Efficient Neu-
romorphic Systems Using Approximate Computing. In ACM/IEEE International Symposium on Low Power Electronics
and Design (ISLPED). 27–32.
[265] Swagath Venkataramani, Kaushik Roy, and Anand Raghunathan. 2013. Substitute-and-Simplify: A Unified Design
Paradigm for Approximate and Quality Configurable Circuits. In Design, Automation & Test in Europe (DATE).
1367–1372.
[266] Swagath Venkataramani, Amit Sabne, Vivek Kozhikkottu, Kaushik Roy, and Anand Raghunathan. 2012. SALSA:
Systematic Logic Synthesis of Approximate Circuits. In Design Automation Conference (DAC). 796–801.
[267] Huan Wang, Can Qin, Yulun Zhang, and Yun Fu. 2020. Neural Pruning via Growing Regularization. arXiv preprint
arXiv:2012.09243 (2020), 1–16.
[268] Jing Wang, Xin Fu, Xu Wang, Shubo Liu, Lan Gao, and Weigong Zhang. 2020. Enabling Energy-Efficient and Reliable
Neural Network via Neuron-Level Voltage Scaling. IEEE Trans. Comput. 69, 10 (2020), 1460–1473.
[269] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. 2019. HAQ: Hardware-Aware Automated Quantization
with Mixed Precision. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8612–8620.
[270] Ying Wang, Jiachao Deng, Yuntan Fang, Huawei Li, and Xiaowei Li. 2017. Resilience-Aware Frequency Tuning for
Neural-Network-Based Approximate Computing Chips. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems 25, 10 (2017), 2736–2748.
[271] Haroon Waris, Chenghua Wang, and Weiqiang Liu. 2020. Hybrid Low Radix Encoding-Based Approximate Booth
Multipliers. IEEE Transactions on Circuits and Systems II: Express Briefs 67, 12 (2020), 3367–3371.
[272] Haroon Waris, Chenghua Wang, Weiqiang Liu, and Fabrizio Lombardi. 2021. AxBMs: Approximate Radix-8 Booth
Multipliers for High-Performance FPGA-Based Accelerators. IEEE Transactions on Circuits and Systems II: Express
Briefs 68, 5 (2021), 1566–1570.
[273] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning Structured Sparsity in Deep Neural
Networks. Advances in Neural Information Processing Systems 29 (2016), 1–10.
[274] Zhenyu Wen, Do Le Quoc, Pramod Bhatotia, Ruichuan Chen, and Myungjin Lee. 2018. ApproxIoT: Approximate
Analytics for Edge Computing. In IEEE International Conference on Distributed Computing Systems (ICDCS). 411–421.
[275] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2
Programs: Characterization and Methodological Considerations. In ACM/IEEE Annual International Symposium on
Computer Architecture (ISCA). 24–36.
[276] Qiang Xu, Todd Mytkowicz, and Nam Sung Kim. 2016. Approximate Computing: A Survey. IEEE Design & Test 33, 1
(2016), 8–22.
[277] Amir Yazdanbakhsh, Divya Mahajan, Hadi Esmaeilzadeh, and Pejman Lotfi-Kamran. 2017. AxBench: A Multiplatform
Benchmark Suite for Approximate Computing. IEEE Design & Test 34, 2 (2017), 60–68.
[278] Amir Yazdanbakhsh, Divya Mahajan, Pejman Lotfi-Kamran, and Hadi Esmaeilzadeh. 2016. AxBench: A Benchmark
Suite for Approximate Computing Across the System Stack. Technical Report. Georgia Institute of Technology.
[279] Amir Yazdanbakhsh, Divya Mahajan, Bradley Thwaites, Jongse Park, Anandhavel Nagendrakumar, Sindhuja Sethura-
man, Kartik Ramkrishnan, Nishanthi Ravindran, Rudra Jariwala, Abbas Rahimi, Hadi Esmaeilzadeh, and Kia Bazargan.
2015. Axilog: Language Support for Approximate Hardware Design. In Design, Automation & Test in Europe (DATE).
812–817.
[280] Amir Yazdanbakhsh, Jongse Park, Hardik Sharma, Pejman Lotfi-Kamran, and Hadi Esmaeilzadeh. 2015. Neural
Acceleration for GPU Throughput Processors. In IEEE/ACM International Symposium on Microarchitecture (MICRO).
482–493.
[281] Rong Ye, Ting Wang, Feng Yuan, Rakesh Kumar, and Qiang Xu. 2013. On Reconfiguration-Oriented Approximate
Adder Design and Its Application. In International Conference on Computer-Aided Design (ICCAD). 48–54.
[282] Serif Yesil, Ismail Akturk, and Ulya R. Karpuzcu. 2018. Toward Dynamic Precision Scaling. IEEE Micro 38, 4 (2018),
30–39.
[283] Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das, and Scott Mahlke. 2017. Scalpel:
Customizing DNN Pruning to the Underlying Hardware Parallelism. ACM SIGARCH Computer Architecture News 45,
2 (2017), 548–560.
[284] Edouard Yvinec, Arnaud Dapogny, Matthieu Cord, and Kevin Bailly. 2021. RED : Looking for Redundancies for
Data-Free Structured Compression of Deep Neural Networks. Advances in Neural Information Processing Systems 34
(2021), 20863–20873.
[285] Edouard Yvinec, Arnaud Dapogny, Matthieu Cord, and Kevin Bailly. 2022. RED++ : Data-Free Pruning of Deep Neural
Networks via Input Splitting and Output Merging. IEEE Transactions on Pattern Analysis and Machine Intelligence 45,
3 (2022), 3664–3676.
[286] Behzad Zeinali, Dimitrios Karsinos, and Farshad Moradi. 2018. Progressive Scaled STT-RAM for Approximate
Computing in Multimedia Applications. IEEE Transactions on Circuits and Systems II: Express Briefs 65, 7 (2018),
938–942.
[287] Reza Zendegani, Mehdi Kamal, Milad Bahadori, Ali Afzali-Kusha, and Massoud Pedram. 2017. RoBA Multiplier:
A Rounding-Based Approximate Multiplier for High-Speed yet Energy-Efficient Digital Signal Processing. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems 25, 2 (2017), 393–401.
[288] Reza Zendegani, Mehdi Kamal, Arash Fayyazi, Ali Afzali-Kusha, Saeed Safari, and Massoud Pedram. 2016. SEERAD:
A High Speed yet Energy-Efficient Rounding-Based Approximate Divider. In Design, Automation & Test in Europe
(DATE). 1481–1484.
[289] Georgios Zervakis, Fotios Ntouskas, Sotirios Xydis, Dimitrios Soudris, and Kiamal Pekmestzi. 2018. VOSsim: A
Framework for Enabling Fast Voltage Overscaling Simulation for Approximate Computing Circuits. IEEE Transactions
on Very Large Scale Integration (VLSI) Systems 26, 6 (2018), 1204–1208.
[290] Georgios Zervakis, Sotirios Xydis, Dimitrios Soudris, and Kiamal Pekmestzi. 2019. Multi-Level Approximate Acceler-
ator Synthesis Under Voltage Island Constraints. IEEE Transactions on Circuits and Systems II: Express Briefs 66, 4
(2019), 607–611.
[291] Guowei Zhang and Daniel Sanchez. 2018. Leveraging Hardware Caches for Memoization. IEEE Computer Architecture
Letters 17, 1 (2018), 59–63.
[292] Jeff Zhang, Kartheek Rangineni, Zahra Ghodsi, and Siddharth Garg. 2018. ThUnderVolt: Enabling Aggressive Voltage
Underscaling and Timing Error Resilience for Energy Efficient Deep Learning Accelerators. In Design Automation
Conference (DAC). 1–6.
[293] Qian Zhang, Ting Wang, Ye Tian, Feng Yuan, and Qiang Xu. 2015. ApproxANN: An Approximate Computing
Framework for Artificial Neural Network. In Design, Automation & Test in Europe (DATE). 701–706.
[294] Xuhong Zhang, Jun Wang, and Jiangling Yin. 2016. Sapprox: Enabling Efficient and Accurate Approximations on
Sub-Datasets with Distribution-Aware Online Sampling. Proceedings of the VLDB Endowment 10, 3 (2016), 109–120.
[295] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. DoReFa-Net: Training Low
Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. arXiv preprint arXiv:1606.06160 (2016), 1–13.
[296] Feiyu Zhu, Shaowei Zhen, Xilin Yi, Haoran Pei, Bowen Hou, and Yajuan He. 2022. Design of Approximate Radix-256
Booth Encoding for Error-Tolerant Computing. IEEE Transactions on Circuits and Systems II: Express Briefs 69, 4 (2022),
2286–2290.