0% found this document useful (0 votes)
16 views7 pages

2406.03886v2

BiomedBench is a benchmark suite designed for evaluating TinyML biomedical applications in low-power wearable devices, addressing the lack of systematic hardware evaluation in this domain. It includes a variety of end-to-end applications that represent different computational workloads and energy requirements, facilitating a standardized approach to hardware and software co-design. The suite is open-source and aims to guide future hardware and application development by highlighting critical design features impacting performance.

Uploaded by

fahaf48794
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views7 pages

2406.03886v2

BiomedBench is a benchmark suite designed for evaluating TinyML biomedical applications in low-power wearable devices, addressing the lack of systematic hardware evaluation in this domain. It includes a variety of end-to-end applications that represent different computational workloads and energy requirements, facilitating a standardized approach to hardware and software co-design. The suite is open-source and aims to guide future hardware and application development by highlighting critical design features impacting performance.

Uploaded by

fahaf48794
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

1

BiomedBench: A benchmark suite of TinyML


biomedical applications for low-power wearables
Dimitrios Samakovlis1 , Stefano Albini1 , Rubén Rodrı́guez Álvarez1 , Denisa-Andreea Constantinescu1 ,
Pasquale Davide Schiavone1 , Miguel Peón-Quirós2 , David Atienza1 ,2
1 Embedded Systems Laboratory, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
2 EcoCloud Center, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
arXiv:2406.03886v2 [cs.LG] 29 Aug 2024

Abstract—The design of low-power wearables for the biomed- applications aimed at low-power wearable devices. These ap-
ical domain has received a lot of attention in recent decades, as plications feature diverse requirements during processing, idle,
technological advances in chip manufacturing have allowed real- and signal acquisition that effectively represent the challenges
time monitoring of patients using low-complexity ML within the
mW range. Despite advances in application and hardware design in the domain. Furthermore, we demonstrate how to utilize
research, the domain lacks a systematic approach to hardware BiomedBench to systematically evaluate and compare SoA
evaluation. In this work, we propose BiomedBench, a new bench- platforms. To our knowledge, this is the first benchmark suite
mark suite composed of complete end-to-end TinyML biomedical explicitly targeting the low-power wearable TinyML domain,
applications for real-time monitoring of patients using wearable offering a systematic approach to software and hardware co-
devices. Each application presents different requirements dur-
ing typical signal acquisition and processing phases, including design. The contribution of BiomedBench is twofold:
varying computational workloads and relations between active • It standardizes hardware evaluation in the TinyML wear-
and idle times. Furthermore, our evaluation of five state-of-the- able domain, offering a set of complete end-to-end
art low-power platforms in terms of energy efficiency shows that biomedical applications, including the idle, acquisition,
modern platforms cannot effectively target all types of biomedical
applications. BiomedBench is released as an open-source suite and processing phases. The variety of requirements
to standardize hardware evaluation and guide hardware and present in the applications is representative of the multi-
application design in the TinyML wearable domain. dimensional challenges in the domain.
Index Terms—Benchmarking, TinyML, biomedical applica- • It provides guidelines for future hardware and applica-
tions, wearable, low-power, signal processing. tion design in the TinyML wearable domain. Utilizing
BiomedBench to compare SoA platforms unveils the crit-
ical design features that impact performance for hardware
I. I NTRODUCTION
designers and hints at the deployment platform selection

W EARABLE devices promise to improve preventive


medicine through continuous health monitoring of
chronic diseases. To this end, we face the challenge of in-
for application designers. Overall, open-sourcing SoA
applications accelerates future application development
efforts.
creasing their computational capability and energy efficiency We organize this work as follows. In Section II, we identify
to implement advanced biomedical algorithms on the edge, the need for BiomedBench by comparing it with existing
increase patient privacy, and reduce response latency while biomedical benchmark suites. In Section III, we propose a
enabling seamless operation with small batteries and sparse set of SoA wearable applications along with a systematic
recharging cycles. This paper explicitly focuses on the TinyML characterization of their key features. In Sections IV and V, we
wearable domain where battery-powered devices operate in describe the setup of our experiments and analyze the results,
the mW range to meet tight energy budgets while employing respectively. Finally, we summarize the key findings of this
lightweight machine learning (ML) models. work in Section VI.
To increase research efficiency in the TinyML wearable
domain, it is vital to facilitate a seamless synergy between
software and hardware development efforts. First, architectural II. R ELATED W ORK
design must be aligned with the characteristics of state- BCIBench [1] is a benchmark suite targeting electroen-
of-the-art (SoA) applications to meet real-time and energy cephalogram (EEG)-based kernels and applications in the
constraints. Second, application developers must be aware of low-power wearables domain. Hermit [2] targets biomedical
SoA algorithms and platforms in the domain to minimize workloads in the Internet-of-Medical-Things (IoMT). Hermit
software development and deployment time. However, in the includes three kernels for monitoring common medical condi-
TinyML wearable domain, we are missing in the SoA a set of tions (sleep apnea, heart-rate variability, and blood pressure),
representative end-to-end applications to guide the co-design kernels for offline diagnosis using image processing, and
process by standardizing hardware evaluation and unveiling encryption and compression algorithms. Finally, ImpBench [3]
the critical hardware and software design points. targets devices in the implantable biomedical domain. Imp-
In response to these needs, we propose BiomedBench, a Bench features two synthetic benchmark applications for
biomedical benchmark suite composed of end-to-end TinyML motion detection and drug delivery system simulation, and
2

Normal

Pathology
Signal preprocessing Feature extraction Inference

Fig. 1: Typical modules of biomedical applications [5].

Deep sleep Acquisition Processing describe the computing profile, impact of the idle period,
acquisition intensity, and memory requirements of the applica-
... tions. This information is critical for identifying the challenges
posed across the three phases of a complete application cycle.
Fig. 2: micro-controller unit (MCU) operating phases To this end, we have selected the following five metrics: main
six lightweight kernels for data compression, encryption, and operations, duty cycle, input bandwidth, dynamic data, and
integrity. static data.
BiomedBench differs from existing biomedical benchmark 1) Main operations: Identifies whether the dominant oper-
suites in two ways. First, BiomedBench integrates the idle ations are branches, logical operations, fixed-point (FxP) or
and acquisition phases in complete end-to-end applications floating-point (FP) computations, among others. This metric
and highlights the impact of these phases in the applica- hints at the microarchitectural design and is vital to interpret-
tion and platform design. Second, BiomedBench includes a ing the performance variations among different architectures
wider variety of processing kernels stemming from a larger during processing.
set of input signals, like electrocardiogram (ECG), surface 2) Duty cycle: Represents the ratio between CPU active
electromyography (sEMG), and Photoplethysmography (PPG), cycles and total cycles. A low duty cycle means that the idle
thus covering a wider spectrum of workloads, as shown in phase dominates the energy footprint. We use the following
Section III-B. scale: “very low” (less than 0.1 %), “low” (between 0.1 % to
1 %), “medium” (between 1 % to 15 %), “high” (between 15 %
III. A PPLICATIONS to 60 %), and “very high” (above 60 %). Since this metric
Fig. 1 shows a typical biosignal monitoring application is platform-dependent, we calculate it running on an ARM
pipeline. Sensors capture the biosignal and send it to the Cortex-M4.
processing device for analysis. Typically, the processing step 3) Input bandwidth: Measured in B/sec, is the product of
consists of signal preprocessing (i.e., filtering), feature extrac- the sensor’s sampling rate, the size per sample, and the number
tion (i.e., time or frequency characteristics), and inference (i.e., of channels used. It specifies the intensity and energy impact
ML model). However, applications can exhibit a wide range of the signal acquisition phase.
of workloads and computational requirements. For example, 4) Static data: Measured in KiB, it quantifies the memory
feature extraction can be implemented explicitly (i.e., features required for the code and read-only data, such as pre-trained
designed manually) or implicitly (e.g., convolutional neural parameters. Hence, it defines the amount of memory retained
network (CNN)). Similarly, the inference step can use a during idle phases, which can be critical for idle consumption.
lightweight machine learning method, such as a random forest 5) Dynamic data: Measured in KiB, defines how much
or a computationally intensive deep neural network (DNN). memory the application requires during runtime for the stack
From an implementation point of view, the MCU inter- and the heap. Hence, it specifies the minimum amount of RAM
changes among idle, acquisition, and processing, as presented needed for a deployment platform.
in Fig. 2. Assuming an external analog-to-digital converter
(ADC) with a buffer, MCU collects the data through a commu- B. SoA wearable applications
nication protocol such as serial peripheral interface (SPI) upon We have selected eight biomedical wearable applications
buffer filling. The acquisition is typically served by the direct that offer representative workloads and varied profiles for the
memory access (DMA). The MCU is in low-power modes processing, idle, and acquisition phases. The applications are
during idle and acquisition (i.e., deep sleep for idle and light complementary and enable evaluation of different architectural
sleep for acquisition). The duration of the low-power modes parts (e.g., sleep mode, digital signal processing). Biomed-
can vary significantly between applications and can dominate Bench will be launched with eight applications but is open
the system’s energy consumption. Considering the variety of to future additions that present new challenges in any of the
workloads, input bandwidths, and idle-to-active ratios present three phases.
in low-power wearable applications, a benchmark suite that All applications are coded and optimized in C or C++ to
covers a wide range of applications is an essential tool for ensure effortless portability across all platforms. Considering
hardware evaluation in the domain. that modern C/C++ toolchains are capable of applying heavy
optimizations and fully exploiting the underlying microar-
A. Metrics for application characterization chitecture (i.e., DSP ISA extensions, SIMD instructions),
We propose an application characterization by metrics. We we consider this approach practical, consistent, and fair for
have selected the minimum number of metrics that efficiently comparison of SoA platforms.
3

Application Main operations Duty cycle Input bandwidth (B/sec) Static data (KiB) Dynamic data (KiB)
HeartBeatClass Branches (FxP min/max search) Low 1536 25 30
SeizureDetSVM 32-bit FxP multiplications/divisions Very Low 128 40 40
SeizureDetCNN 16-bit FxP MAC High 11776 350 120
CognWorkMon 32-bit FxP multiplications Medium 4096 90 50
GestureClass 32-bit FP MAC Very High 192000 50 110
CoughDet 32-bit FP multiplications Very High 64400 568 160
EmotionClass Branches (FP sorting) Low 822 16 4
Bio-BPfree 32-bit FP MAC - - 1300 2600

TABLE I: Benchmark applications - A characterization by metrics

Table I summarizes the main metrics of the applications, a shift. This application originally contained a self-aware
illustrating the broad spectrum of computational workloads, mechanism to determine the number of features and the
active-to-idle ratios, and acquisition and memory requirements complexity of the SVM. For this benchmark, we only use
covered by the benchmark suite. This variety of requirements the full pipeline to avoid variability among executions and
is vital to a complete evaluation of low-power platforms, as test the most complete version. Finally, we designed a parallel
illustrated in Section V. The benchmarks are explained below. version of this application since it features a high degree of
1) Heartbeat classifier (HeartBeatClass): Detects abnor- parallelism.
mal heartbeat patterns in real time for common heart diseases 3) Seizure detector convolutional neural network (Seizure-
using the ECG signal [4]. The input signal is sampled by DetCNN): Based on EEG data, detects epileptic seizure
three different ECG leads at 256 Hz with 16-bit accuracy for episodes in real time [7]. The signal is sampled from 23
15 s. The input signal is processed through morphological filter leads at 256 Hz with a 16-bit accuracy for 4 s. This application
(MF), and the root-mean-square (RMS) combines the three does not feature any signal preprocessing or feature extraction
signal sources before enhancing the signal through relative en- kernels. Instead, the input is sent directly to the input layer
ergy (Rel-En). In feature extraction, the relative-energy-based of a fully-convolutional network (FCN). The proposed FCN
wearable R-peak detection (REWARD) algorithm detects the architecture has three 1D convolutional layers, each including
R peaks before delineating the other fiducial points of ECG. batch normalization, pooling, and ReLU layers, and two fully
Finally, a neuro-fuzzy classifier using random projections (RP) connected layers. Most computations are 16-bit FxP multiply-
of the fiducial points classifies the heartbeats as abnormal or accumulate (MAC) operations due to convolution, as 90 %
not. of the execution is spent in the convolutional layers. We
The application uses 16-bit fixed-point arithmetic. MF is implemented the FCN in C from scratch for single-core and
the dominant kernel, accounting for more than 80 % of the multicore platforms.
execution time. The MF implementation involves a queue to 4) Cognitive workload monitor (CognWorkMon): Is de-
perform dilation and erosion, translating into data movements signed for real-time monitoring of the cognitive workload
and min/max search. We also include the multicore version state of a subject [8] and is based on EEG input. The
of the application [5], having improved the parallelization EEG signal is sampled by four leads at 256 Hz with 32-bit
strategy for the delineation and classification phases with accuracy. The input signal is processed in 14 batches of 4 s
dynamic task partition instead of static. for a total of 56 s. Preprocessing and feature extraction are
2) Seizure detector support vector machine (Seizure- executed 14× per channel before the classification phase is
DetSVM): Works on ECG input and recognizes epileptic executed. Preprocessing involves blink removal (BLR) and a
episodes in real time [6]. The ECG signal is sampled from band-pass filter (BPF) through infinite impulse response (IIR)
a single lead at 64 Hz with 16-bit accuracy for 60 s. The filters. Feature extraction contains time-domain features (i.e.,
preprocessing phase consists of a simple moving average skewness/kurtosis, Hjorth activity), frequency-domain features
(MAVG) subtraction. For feature extraction, the R-peak in- (i.e., power spectral density), and entropy features. A random
terval (RRI) and ECG-derived respiration (EDR) time series forest (RF) uses these features to classify the stress condition
are calculated from the ECG. From RRI, heart-rate variabil- of the subject.
ity (HRV) features and Lorenz plot features are extracted. The extraction of frequency features, which contains the
From EDR, the linear predictive coefficients and the power FFT, is the most demanding computational kernel, accounting
spectral density of different frequency bands are calculated. for more than 80 % of the total computation time. The main
For the frequency feature extraction (FFE) of RRI and HRV, operations are 32-bit integer multiplications with a 64-bit
the Lomb-Scargle periodogram (PLOMB) algorithm, which intermediate result followed by a shift since we transformed
involves a fast Fourier transform (FFT), is used. For inference, the original application into a FxP implementation with a
a support vector machine (SVM) uses all the extracted features negligible accuracy drop.
to classify the patient’s state. 5) Gesture classifier (GestureClass): Aims to classify hand
PLOMB is the dominant kernel, accounting for more than gestures by inspecting signals captured by sEMG of the fore-
75 % of the execution time. Since the implementation is in arm [9]. The idea is to extract the motor unit action potential
32-bit FxP arithmetic, the main operations are 32-bit integer train (MUAPT) and identify the motor neuron activity patterns
multiplications with a 64-bit intermediate result followed by to classify the hand gesture. The signal is sampled from 16
4

channels at 4 kHz with 24-bit accuracy for only 0.2 s. This EmotionClass uses both 16-bit and 32-bit FP arithmetic, as
application has no signal preprocessing. The authors apply it uses different representations for the three different signals.
a blind source separation (BSS) method to the input signal, The dominant kernel is the KNN inference, which includes the
namely independent component analysis (ICA), and classify 32-bit FP calculation and sorting of the Euclidean distances in
the gesture using a SVM or a multilayer Perceptron (MLP). 3D. Sorting includes multiple minimum search iterations over
We use the MLP for the inference stage to boost the variability the array of distances.
of the kernels under test. 8) Biological back-propagation-free (Bio-BPfree): Is a neu-
GestureClass is implemented in 32-bit FP arithmetic, and ral network training scheme for resource-constrained devices,
the dominant workload is the ICA which features matrix that is ideal for on-device training scenarios where person-
multiplications. Hence, the main operations are 32-bit FP alized samples remain private and can increase the model’s
MACs. We have included the original parallel implementation robustness [12]. The main notion of Bio-BPfree is to perform
of this application and converted it to run on a single core to per-layer training by maximizing the distance between the
make it available for single-core platforms. intermediate outputs of different classes and minimizing the
6) Cough detector (CoughDet): Is a novel application [10] distance between the intermediate outputs of the same class.
using non-invasive chest-worn biosensors to count the number Bio-BPfree avoids the prohibitive memory cost of backprop-
of cough episodes people experience per day, thus providing agation, thus opening possibilities for on-device training in
a quantifiable means of evaluating the efficacy of chronic low-power devices.
cough treatment. The device records audio data, sampled at We used Bio-BPfree to fine-tune the DNN presented in [7]
16 kHz with 32-bit precision, as well as 3-axis accelerometer for seizure detection. The model was originally trained on
and 3-axis gyroscope signals from an inertial measurement the server using a leave-one-out-patient on the CHB-MIT
unit (IMU), each sampled at 100 Hz with 16-bit precision. database. Later, we retrain the model on the device with
Biosignals are processed every 0.3 s. Bio-BPfree by exploiting the personalized samples acquired
from the patient under test. The on-device training yields a
Feature extraction includes computations in the time and
significant improvement in the F1 score up to 25% thanks
frequency domain. Time-domain computations include the
to the personalized samples available on the device while
extraction of statistical values (such as zero crossing rate, root-
ensuring data privacy.
means-squared, and kurtosis) of the IMU signals. An FFT is
The implementation of Bio-BPfree is based on the com-
used to extract spectral statistics (including standard deviation
putation of the gradients of the loss function with respect to
and dominant frequency), power spectral density, and mel-
the trainable parameters. We define a custom loss function
frequency cepstral coefficients (MFCC) of the audio signal.
per layer [12] and then compute the gradients using the chain
Features extracted from audio and IMU signals are forwarded
rule to account for the intermediate layers (i.e., ReLU, batch
to an RF classifier that computes the probability of a cough
normalization, max pooling). The main operations are 32-bit
event.
FP MACs because of the convolution in the forward passes
The MFCC constitutes the most intensive kernel that re- and the vector-matrix multiplications involved in the chain rule
quires the iterative computation of FFT and transcendental of the gradient computation. There is no acquisition phase. We
functions (i.e., the logarithm in the discrete cosine transform assume that four pre-recorded input samples are already stored
(DCT)). The application is implemented in 32-bit FP arith- in the FLASH and expect the retraining to occur during the
metic, and the main operations include FP multiplications. We device charging phase. One epoch is executed for benchmark
coded this application from scratch. purposes.
7) Emotion classifier (EmotionClass): Classifies the fear We have published all the code in a public GitHub reposi-
status of patients to prevent gender-based violence [11] based tory,1 which contains detailed information about each imple-
on three physiological signals, namely Galvanic skin response mentation. We have also developed a website2 to enhance the
(GSR), PPG, and skin temperature (ST). PPG is sampled at readability and communication of results across the commu-
200 Hz with 32-bit precision, GSR is sampled at 5 Hz with 32- nity.
bit precision, and ST is sampled at 1 Hz with 16-bit precision.
The acquisition window lasts 10 s and is divided into 10 IV. E XPERIMENTAL S ETUP
batches of partial inference before the final classification is In this section, we show the deployment and evaluation
performed based on the 10 partial classifications. process of BiomedBench on a set of representative SoA low-
EmotionClass has no signal preprocessing step. Feature ex- power boards.
traction includes the average (AVG) of the three input signals
over 1 s before forwarding them to a k-nearest neighbors A. Considered low-power boards
(KNN) classifier. The classifier computes the distances of the
We target low-power platforms with low-end MCUs, with
new 3D tuple from the training points that have already been
clock frequencies in the range of MHz, and RAM in the
labeled as fear or no fear. Using n training
√ √ points, we select the range of some hundreds of KiB, as the application char-
n closest training points by running n steps of selection
acterization in Table I suggests. Typically, such platforms
sort before classifying the new tuple based on the percent-
age of neighboring fear-labeled points. We use 685 training 1 https://github.com/esl-epfl/biomedbench

points—a tradeoff between accuracy and complexity [11]. 2 https://biomedbench.epfl.ch/


5

Board Manufacturer MCU Cores FPU RAM (KiB) FLASH (MB)


Raspberry Pi Pico Raspberry RP2040 2x ARM Cortex-M0+ No 264 2 (off-chip)
Nucleo L4R5 STMicroelectronics STM32L4R5ZI 1x ARM Cortex-M4 Yes 640 2 (on-chip)
Ambiq Apollo 3 Ambiq Apollo 3 Blue 1x ARM Cortex-M4 Yes 384 1 (on-chip)
Gapuino GreenWaves GAP8 1x CV32E40P (FC) No 512 2 (off-chip)
Technologies 8x CV32E40P (Cluster) Yes 64
GAP9EVK GreenWaves GAP9 1x CV32E40P (FC) Yes 1564 2 (off-chip)
Technologies 9x CV32E40P (Cluster) Yes 128

TABLE II: Selected boards - Summary of basic features

feature simple processor architectures with in-order execution, boards at the power supply entry point of the integrated circuit
no instruction-level parallelism, simple memory hierarchies, (IC)9 as we target a fair comparison across all platforms.
deep-sleep modes for long idle periods, and SPI support for However, we highlight that the reported energy numbers in-
signal acquisition. Such platforms often meet the application clude the energy drawn by the input-to-core step-down voltage
requirements in the domain [5], [6], [8], [9]. Finally, we converter embodied in the integrated circuit. Future platform
explicitly target variability in processor architectures (i.e., energy measurements must comply with this procedure for the
ARM, RISC-V). results to be considered valid.
We have selected five popular commercial low-power boards We have selected the Otii Arc provided by Qoitech, which
featuring five different MCUs and four different processors for samples at 4 kHz, to obtain an energy profile over time and
our experiments. The selected boards are: Nucleo-L4R5ZI3 extract the energy and execution time per phase. However, due
from ST Microelectronics, Ambiq Apollo 3 Blue4 , Raspberry to the limited ±10 µA precision of the Otii device, we used
Pi Pico5 , Gapuino v1.16 and GAP9EVK7 from GreenWaves the Fluke 8846A multimeter to measure the average current
Technologies. We summarize the architecture and storage of the Nucleo and Apollo boards in deep-sleep mode. This
specifications of each MCU in Table II. device can achieve a precision of 0.03 µA.

B. Sensor emulation and sleep modes D. Portability and software support


For signal acquisition, we emulate the sensor and ADC We use the boards’ toolchain and software development kit
functionality using an external board that artificially produces (SDK) to compile with -O3 and load the C/C++ program on
data. We assume an external ADC with 768 bytes of RAM each board. With this approach, we ensure that digital signal
buffer8 since, typically, MCUs do not feature embedded ADCs processing (DSP) extensions are exploited when present. For
or feature ADCs with insufficient bit precision. We perform the runtime, we use the portable FreeRTOS API, which all
per-batch acquisition by transferring data to the MCU when boards support, for dynamic memory management. Finally, we
the buffer is full. We employ an SPI acquisition scheme using utilize the hardware abstraction layer (HAL) of each board’s
the DMA while the core is in sleep mode. Moreover, we SDK to program the SPI peripheral communication and to
assume that the sensors have no processing ability and that configure the power management unit (PMU) for the sleep
all the computations take part in the MCU. modes.
During the idle period, we set the MCU to deep-sleep
V. E XPERIMENTAL R ESULTS
mode with RAM retention to preserve the data needed for the
next processing cycle. The MCUs support a wake-up interrupt In this section, we evaluate SoA platforms running Biomed-
mechanism to switch to active mode when the data are ready. Bench in terms of energy efficiency and processing capability.
For the processing phase, we select the lowest operating We showcase that BiomedBench stresses different architectural
voltage that allows the processing frequency to meet the real- aspects of the platforms, hence making it an effective hardware
time constraints of each application. For the selected voltage, evaluation tool. Table III reports the processing cycles and
we configure the highest available frequency for maximum energy per application and board.
energy efficiency as validated experimentally.
A. Processing cycles
C. Energy measurements The amount of processing cycles required to execute the
computational phase of the applications is a critical metric for
We use the evaluation boards provided by the manufacturers
evaluating wearable devices. Fewer processing cycles translate
to measure the energy consumption of each MCU executing
to shorter active phases for the MCU and energy efficiency.
BiomedBench applications. We do not consider the energy of
Analyzing in depth the processing performance discrepancies
the sensor and ADC in our experiments, as it is common for
of the SoA platforms is the key to comprehending the exact
all platforms. We have measured the energy consumption of all
microarchitectural challenges in the domain and, hence, facil-
3 https://www.st.com/en/evaluation-tools/nucleo-l4r5zi.html itating domain-specific hardware design.
4 https://ambiq.com/apollo3-blue/ We summarize our observations stemming from Table III
5 https://www.raspberrypi.com/products/raspberry-pi-pico/
in four key points. First, CV32E40P GAP9 consistently
6 https://greenwaves-technologies.com/product/gapuino/
7 https://greenwaves-technologies.com/gap9-store/ 9 We measure the total board energy for Raspberry Pi Pico with all
8 AD4130-8, https://www.analog.com/en/products/ad4130-8.html peripherals disabled — no test points for the MCU provided
6

Cycles Energy (mJ) Cycles Energy (mJ)


MCU Processor Application Application
(M) (M)
Idle Acq. Proc. Total Idle Acq. Proc. Total
RP2040 Arm Cortex-M0+ 11.6 29.647 3.519 6.532 39.698 149.7 0 2.972 87.543 90.515
STM32L4R5ZI Arm Cortex-M4 7.4 0.118 0.002 2.604 2.724 9.9 0.002 0.001 6.649 6.652
Apollo 3 Blue Arm Cortex-M4 HeartBeatClass 7.4 0.073 0.061 0.226 0.360 CoughDet 9.9 0.001 0.198 0.444 0.642
GAP8 CV32E40P GAP8 5.1 9.386 0.042 0.416 9.844 - - - - -
GAP9 CV32E40P GAP9 5.1 9.833 0.154 0.411 10.398 9.1 0.081 0.046 0.352 0.479
RP2040 Arm Cortex-M0+ 4.3 118.805 2.402 2.420 123.627 571.6 0 8.008 347.104 355.112
STM32L4R5ZI Arm Cortex-M4 2.3 0.476 0.001 1.313 1.790 23.0 0 0.004 13.425 13.429
Apollo 3 Blue Arm Cortex-M4 SeizureDetSVM 2.3 0.294 0.042 0.137 0.473 GestureClass 23.0 0 0.525 2.500 3.025
GAP8 CV32E40P GAP8 2.8 37.724 0.029 0.353 38.106 635.8 0 0.096 220.933 221.029
GAP9 CV32E40P GAP9 2.5 39.50 0.037 0.090 39.627 20.2 0.027 0.124 0.604 0.755
RP2040 Arm Cortex-M0+ 283.0 3.514 7.528 167.87 178.912 15.3 19.651 1.259 8.760 29.670
STM32L4R5ZI Arm Cortex-M4 240.0 0.015 0.004 112.049 112.068 2.5 0.002 0.001 1.462 1.465
Apollo 3 Blue Arm Cortex-M4 SeizureDetCNN 240.0 0.010 0.494 18.262 18.766 EmotionClass 2.5 0.061 0.083 0.110 0.254
GAP8 CV32E40P GAP8 160.0 0.464 0.090 31.987 32.541 14.3 6.224 0.015 1.052 7.291
GAP9 CV32E40P GAP9 160.0 2.234 0.117 5.101 7.452 1.6 6.572 0.019 0.061 6.652
RP2040 Arm Cortex-M0+ 346.0 104.902 35.876 195.910 336.688 16758.0 - - 9374.500 9374.500
STM32L4R5ZI Arm Cortex-M4 138.0 0.432 0.017 70.629 71.078 662.0 - - 432.227 432.227
Apollo 3 Blue Arm Cortex-M4 CognWorkMon 138.0 0.325 0.620 3.887 4.832 Bio-BPfree 662.0 - - 32.222 32.222
GAP8 CV32E40P GAP8 165.0 33.930 0.431 16.008 50.369 18450.0 - - 1453.368 1453.368
GAP9 CV32E40P GAP9 92.0 36.303 0.557 3.711 40.571 633.0 - - 24.970 24.970

TABLE III: Energy breakdown and processing cycles per application.

scores the highest in all applications. Second, the relative 3) Processing: High-duty-cycle applications, like Seizure-
performance of CV32E40P GAP8 and Arm Cortex-M4 varies DetCNN, CoughDet, and GestureClass, necessitate an energy-
significantly depending on the application type. In some efficient processing mode. Each application features a dif-
applications, Arm Cortex-M4 outperforms CV32E40P GAP8 ferent computational workload that impacts the duration of
and matches CV32E40P GAP9, while in other applications, the processing phase per platform. The platforms’ relative
CV32E40P GAP8 outperforms Arm Cortex-M4 and matches processing efficiency fluctuates per application, depending on
CV32E40P GAP9. Third, Arm Cortex-M0+ cannot handle their processor’s performance across different workloads.
computations as efficiently as the other processors. Finally,
Arm Cortex-M0+ and CV32E40P GAP8 lack a floating-point In general, Apollo 3 and GAP9 have the lowest processing
unit (FPU) and suffer in FP applications. energy. However, GAP9 manages to outperform Apollo 3 in
applications that it can complete in significantly fewer cycles
(i.e., SeizureDetCNN, EmotionClass). GAP8 spends more
B. Energy processing energy than GAP9 even when it matches GAP9’s
processing cycles (i.e., HeartBeatClass, SeizureDetCNN). De-
Understanding why some platforms are more energy- spite featuring the same processor, STM32L4R5ZI consumes
efficient than others and how the energy profile changes among approximately an order of magnitude more processing energy
different applications and their phases is vital to boosting than Apollo 3 Blue in all applications. RP2040 spends, on
low-power platform design. Table III reports the total energy average, two orders of magnitude more processing energy than
and the energy per phase for each application and platform. the most efficient MCU.
Interestingly, the impact of the phases on the total energy
footprint fluctuates with the application.
1) Idle: Low-duty-cycle applications highlight the impor- 4) Total: The experimental results show that no single plat-
tance of a well-designed deep-sleep mode. SeizureDetSVM, form is the most energy-efficient for every benchmark. GAP9
featuring the lowest duty cycle of all applications, illustrates is the most energy-efficient in computationally intensive, high-
that STM32L4R5ZI and Apollo 3 Blue have excellent deep- duty-cycle applications but features an uncompetitive deep-
sleep modes and dominate their rivals in total energy. Similar sleep mode. Apollo 3 and STM have the best deep-sleep modes
observations apply to HeartBeatClass and EmotionClass. In and perform the best in low-duty-cycle applications. Therefore,
contrast, idle energy is much less impactful in high-duty- total energy consumption varies significantly for each platform
cycle applications such as SeizureDetCNN, GestureClass, and according to the characteristics of each application. For in-
CoughDet. stance, STM is 22 × more energy-efficient than GAP9 in the
2) Acquisition: Applications with a high input bandwidth SeizureDetSVM, which is a low duty cycle application, and
highlight the need for an energy-efficient acquisition mode. thus the idle mode efficiency of STM is more relevant, but has
CoughDet and GestureClass, featuring the highest input band- 23.5 × more energy consumption in SeizureDetCNN, which
width, stress the importance of an energy-efficient acquisition is a high duty cycle application that benefits the processing
mode (i.e., a sleep mode that allows DMA operation). Apart efficiency of GAP9. The selected applications feature diverse
from Apollo 3, all platforms employ the DMA and spend requirements and pose different challenges for low-power
negligible energy during acquisition in CoughDet and Ges- platforms, making BiomedBench a representative benchmark
tureClass. The acquisition energy is very low for all other for evaluating architectural designs in the TinyML wearable
applications. domain.
7

VI. C ONCLUSION Thessaly, Volos, Greece. His research focuses on characterizing and deploying
biomedical wearable applications in low-power platforms, on-device training
In this paper, we have introduced BiomedBench, a new on resource-constrained devices, and strategies for efficient synergy between
biomedical benchmark suite aiming to systematize hard- Edge devices and servers.
ware evaluation in the TinyML wearable domain. Biomed-
Bench features end-to-end applications with diverse computa-
tional pipelines, active-to-idle ratios, and acquisition profiles. Stefano Albini is a Ph.D. student in the Embedded Systems Laboratory
BiomedBench, boosted by a systematic application charac- (ESL) at EPFL (Lausanne, Switzerland). He received his Master’s degree
in Computer Engineering from the University of Pavia, Italy, in 2022. His
terization, unveils the key hardware challenges of deploy- research focuses on deploying biomedical applications on wearable nodes
ing modern biomedical applications during idle, acquisition, and optimizing AI techniques for constrained edge devices.
and processing phases on wearable platforms. By evaluating
BiomedBench’s impact on energy and performance for five
SoA low-power platforms, we have shown that no single Rubén Rodrı́guez Álvarez is pursuing his PhD in the Embedded Systems
MCU can efficiently handle the varying challenges of different Laboratory at EPFL in Switzerland. His research interest is in SW-HW
benchmarks. To this end, BiomedBench will be released exploration and co-design using domain-specific accelerators for edge and
cloud computing. He completed his master’s in industrial and electronics
as an open-source suite to systematize platform evaluation, engineering from the Technical University of Madrid in 2021, after receiving
accelerate hardware design, and enable further advances in his bachelor’s in industrial engineering from Carlos III University in Madrid in
bioengineering systems and TinyML application design. 2019. He actively contributes to the development of an open-source hardware
platform called X-HEEP and a CGRA accelerator for the edge.

R EFERENCES
[1] Jafari, R., Dehzangi, O., Zong, C. & Nathan, V. BCIBench: A bench- Denisa-Andreea Constantinescu is a senior PostDoc at the Embedded
marking suite for EEG-based brain-computer interface. Proceedings of Systems Laboratory at EPFL in Switzerland, where she is leading research
the 11th Workshop on Optimizations for DSP and Embedded Systems. on sustainable computing technologies for urban digital twins and the devel-
pp. 19-24 (2014), https://doi.org/10.1145/2568326.2568330 opment of energy-aware hardware accelerators for the SKA Observatory. She
[2] Limaye, A. & Adegbija, T. HERMIT: A Benchmark Suite for the Internet obtained her Ph.D. in Mechatronics in 2022 and her Master’s in Computer
of Medical Things. IEEE Internet Of Things Journal. 5, 4212-4222 (2018) Engineering in 2017 from UMA, Spain. She completed two research stays:
[3] Strydis, C., Kachris, C. & Gaydadjiev, G. ImpBench: A novel benchmark at the EPFL Embedded Systems Laboratory (Lausanne, Switzerland) in
suite for biomedical, microelectronic implants. 2008 International Con- 2022 and at the Northeastern University Computer Architecture Research
ference on Embedded Computer Systems: Architectures, Modeling, and Laboratory (Boston, USA) in 2018. In recognition of her work, she received
Simulation. pp. 82-91 (2008) the Intel oneAPI Innovator award in 2020 for her research on “Efficiency and
[4] Braojos, R., Ansaloni, G. & Atienza, D. A methodology for the embedded Productivity for Decision-making on Mobile SoCs” and the SCIE-ZONTA
classification of heartbeats using random projections. DATE. pp. 899-904 Award from the Scientific Society of Informatics in Spain in 2021.
(2013)
[5] De Giovanni, E., Montagna, F., Denkinger, B., Machetti, S., Peón-
Quirós, M., Benatti, S., Rossi, D., Benini, L. & Atienza, D. Modular
Design and Optimization of Biomedical Applications for Ultra-Low
Power Heterogeneous Platforms. IEEE Trans. Comput.-Aided Des. Integr. Pasquale Davide Schiavone is a PostDoc at the EPFL and Director of
Circuits Syst.. 39, 3821-3832 (2020) Engineering of the OpenHW Group. He obtained the Ph.D. title at the
[6] Forooghifar, F., Aminifar, A. & Atienza Alonso, D. Self-Aware Wearable Integrated Systems Laboratory of ETH Zurich in the Digital Systems group
Systems in Epileptic Seizure Detection. DSD 2018. pp. 426-432 (2018) in 2020 and the BSc. and MSc. from “Politecnico di Torino” in computer
[7] Gomez, C., Arbelaez, P., Navarrete, M., Alvarado-Rojas, C., Le Van engineering in 2013 and 2016, respectively. His main activities are the RISC-V
Quyen, M. & Valderrama, M. Automatic seizure detection based on CPU design and low-power energy-efficient computer architectures for smart
imaged-EEG signals through fully convolutional networks. Scientific embedded systems and edge-computing devices.
Reports. 10 (2020,12)
[8] Zanetti, R., Arza, A., Aminifar, A. & Atienza, D. Real-Time EEG-
Based Cognitive Workload Monitoring on Wearable Devices. IEEE Trans.
Biomed. Eng.. 69, 265-277 (2022) Miguel Peón-Quirós received a Ph.D. in Computer Architecture from UCM,
[9] Orlandi, M., Zanghieri, M., Morinigo, V., Conti, F., Schiavone, D., Spain, in 2015. He collaborated as a Marie Curie scholar with IMEC (Leuven,
Benini, L. & Benatti, S. sEMG Neural Spikes Reconstruction for Gesture Belgium) and as a postdoctoral researcher with IMDEA Networks (Madrid,
Recognition on a Low-Power Multicore Processor. 2022 IEEE Biomedical Spain) and the Embedded Systems Laboratory (ESL) at EPFL (Lausanne,
Circuits And Systems Conference (BioCAS). pp. 704-708 (2022) Switzerland). He has participated in several H2020, SNSF, and industrial
[10] Orlandic, L., Thevenot, J., Teijeiro, T. & Atienza, D. A Multimodal projects and is currently part of EcoCloud, EPFL. His research focuses on
Dataset for Automatic Edge-AI Cough Detection. 2023 45th Annual energy-efficient computing.
International Conference Of The IEEE Engineering In Medicine &
Biology Society (EMBC). pp. 1-7 (2023)
[11] Miranda Calero, J., Marino, R., Lanza-Gutierrez, J., Riesgo, T., Garcia-
Valderas, M. & Lopez-Ongil, C. Embedded Emotion Recognition within David Atienza (M’05-SM’13-F’16) is a Professor of electrical and computer
Cyber-Physical Systems using Physiological Signals. DCIS. pp. 1-6 engineering, Heads the Embedded Systems Laboratory (ESL), and is the
(2018) Scientific Director of the EcoCloud Center for Sustainable Computing at the
[12] Baghersalimi, S., Amirshahi, A., Teijeiro, T., Aminifar, A. & Atienza, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland. He received
D. Layer-Wise Learning Framework for Efficient DNN Deployment in his Ph.D. in computer science and engineering from UCM, Spain, and IMEC,
Biomedical Wearable Systems. 2023 IEEE 19th International Conference Belgium 2005. His research interests include system-level design method-
On Body Sensor Networks (BSN). pp. 1-4 (2023) ologies for high-performance multi-processor system-on-chip (MPSoC) and
low-power Internet-of-Things (IoT) systems, including new 2-D/3-D thermal-
aware design for MPSoCs and many-core servers, and edge AI architectures
for wearable systems and smart consumer devices. He is a co-author of more
than 400 papers in peer-reviewed international journals and conferences, one
book, and 14 patents in these fields. Dr. Atienza is an IEEE Fellow and an
Dimitrios Samakovlis is a Ph.D. student in the Embedded Systems Lab- ACM Fellow.
oratory (ESL) at EPFL (Lausanne, Switzerland). He received his Master’s
in Electrical and Computer Engineering (2021) from the University of

You might also like