Image Classification On Resource-Constrained Microcontrollers

Image Classification on Resource-Constrained
Microcontrollers
1st Seungtae Hong 2nd Gunju PARK 3rd Jeong-Si Kim
Intelligent Device & Simulation Intelligent Device & Simulation Intelligent Device & Simulation
Research Section Research Section Research Section
Electronics and Telecommunications Electronics and Telecommunications Electronics and Telecommunications
Research Institute Research Institute Research Institute
and Daejeon, Korea Daejeon, Korea
University of Science and Technology parkgj@etri.re.kr sikim00@etri.re.kr
Daejeon, Korea
sthong@etri.re.kr
Abstract—Recently, as IoT devices have become popular, re- accessed directly without separate parsing or unpacking, and
search to perform deep learning in small devices such as micro- has the advantage of being usable on various platforms
controllers has been attempted. Microcontrollers have very lim- without any dependencies.
ited resources compared to edge devices such as mobiles. There-
fore, in order to perform deep learning-based image classifica- Since the microcontroller does not have a file system, the
tion in a microcontroller, an optimization technique considering model converted to FlatBuffer is included in the source code
HW constraints is required. To this end, in this paper, we pre- in the form of a C array. The model is then compiled along
sent a method for light weighting a model so that it can be exe- with other source code, built in binary form and stored in the
cuted in a microcontroller, and a process for distributing the flash on the microcontrollers.
lightweight model to a microcontroller. Finally, it was con-
firmed that image classification can be performed in an actual TFLM runs inference on the microcontroller through an
microcontroller through STM32F746G-Discovery. interpreter API in the form of a C/C++ language. For
initialization and control of the microcontroller, the API of the
Keywords—Microcontrollers, Deep Learning, Image Classifi- BSP (Board Support Package) should be used separately from
cation the interpreter API of TFLM.
I. INTRODUCTION III. IMAGE CLASSIFICATION ON RESOURCE-CONSTRAINED

With the progress of IT technology, deep learning MICROCONTROLLERS
technology is being utilized in a multitude of domains [1][2]. The procedure of the optimization technique for image
In particular, with the recent widespread use of IoT devices, classification on a microcontroller is shown in Figure 1.
research to perform deep learning in micro-devices such as
microcontrollers is being attempted [3].
A microcontroller consists of a CPU, memory, and
input/output features on a single chip. In addition,
microcontrollers have very few available resources, such as
memory (SRAM) within a few hundred KB and flash within
several MB. However, since microcontrollers are very cheap
and consume very little power, they can be used in many real-
life situations.
To this end, this paper proposes an optimization technique
for image classification on resource-limited microcontrollers.
The proposed optimization technique presents a model
conversion technique for deploying a pretrained model in the
server to a microcontroller. In addition, the proposed
optimization method presents a preprocessing method for
optimizing inference speed in microcontrollers. To verify the
optimization method proposed in this paper, image
classification is performed on the Cifar-10 dataset using
TensorFlow Lite for Microcontrollers on the STM32F746G- Fig. 1. The procedures of optimization techniques for image classification
Discovery board [4]. on microcontrollers
II. RELATED WORKS TFLM natively takes as input models from TensorFlow
TensorFlow Lite for Microcontrollers (=TFLM) is a sub- Lite that have been converted from TensorFlow. However,
component of TensorFlow Lite for performing machine various model lighting techniques such as pruning and
learning on resource-constrained microcontrollers. TFLM quantization are currently being released based on PyTorch.
takes as input a model transformed through TensorFlow Lite. Therefore, the proposed technique proceeds with PyTorch-
At this time, TensorFlow Lite converts the pre-trained model based training, pruning, and quantization aware training. The
to FlatBuffer [5] format. FlatBuffer is a cross-platform proposed method uses TinyNeuralNetwork [6] to convert a
serialization interface proposed by Google. FlatBuffer can be PyTorch model that has completed quantization aware
979-8-3503-1327-7/23/$31.00 ©2023 IEEE 1453 ICTC 2023

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
training into a TensorFlow Lite model. Finally, TensorFlow the HW control API of the BSP. For example, STM32
Lite models are converted to C array using the xxd command. provides cache control APIs (e.g., SCB_EnableICache(),
The xxd command is a utility available on Linux and Unix- SCB_EnableDCache()), through which microcontrollers can
like operating systems used to create and manipulate improve inference speed.
hexadecimal and binary representations of files.
IV. EXPERIMENTS
We used MobileNet v2 [7], which was redesigned to fit the
Cifar-10 dataset, and modified the output channels and A. Experimental Setup
expansion ratio considering the specifications of the To prove the validity of the proposed technique, we
microcontroller. The modified MobileNet v2 has an input size implemented an image classifier on a representative
of 32 * 32 * 3 (width * height * channel) and consists of 16 development board, STM32F746G-Discovery. Table 1 shows
bottlenecks. the HW specifications of STM32F746G-Discovery. To build
After training is finished, One Shot Channel Pruning [8] is the source code for the microcontroller, we used
performed to reduce the parameters of the model. One Shot STM32CubeIDE (ver. 1.11.0) officially provided by STM.
Channel Pruning has the advantage of shorter execution time STM32CudeIDE has a built-in C/C++ compiler. We
compared to existing pruning techniques. After pruning, fine- performed performance evaluation by specifying the -Ofast
tuning is performed to improve the accuracy of the model. option to optimize inference speed.
In general, quantization techniques are divided into post- TABLE I. HW SPECIFICATIONS OF STM32F746G-DISCOVERY
training quantization that can be performed without retraining Type Specification
and quantization-aware training that performs quantization STM32F746NG (ARM Cortex-M7)
while performing training. In this paper, we use quantization- CPU
- Single Core (216 MHz)
aware training to mitigate the loss of accuracy due to model SRAM 320 KB (User SRAM:256KB)
pruning.
Flash memory 1 MB
To deploy the quantized model to the microcontroller, we
need to convert the model with TensorFlow Lite. To this end, Power 3.3 V or 5 V
in this paper, we transform the trained model using Alibaba's 4.3” LCD display
TinyNeuralNetwork. Typically, ONNX (Open Neural Supported devices Micro SD card
Network Exchange) is used when converting models trained On-board ST-LINK/V2-1 debugger/programmer
with PyTorch to TensorFlow Lite. However, while the model
conversion process using ONNX is difficult, the model B. Evaluation Results
conversion process using TinyNeuralNetwork is relatively STM32F746G-Discovery supports LCD screen and
easy. external micro SD card. In this performance evaluation, the
When inference is performed in a microcontroller, the pre- Cifar-10 test dataset is saved on a micro SD card, and
processing of the input image must be minimized to speed up inference is performed by loading the saved image data. Then,
the inference. To this end, when converting a PyTorch model the inference result is output on the LCD screen.
to a TensorFlow Lite model, the input and output of the neural Figure 3 shows the screen of running the image classifier
network use quantized input (=signed int8). That is, on the STM32F746G-Discovery. 10 images randomly
quantization is applied to all inputs and outputs as well as the extracted from the Cifar-10 test dataset are stored in the micro
hidden layer of the neural network. Through this, each pixel SD card. Inference is performed by sequentially loading
value of the image obtained from the camera module on the images by pressing user buttons below the LCD screen. Figure
microcontroller can be directly used as a quantization input. 3 shows the screen of running the image classifier on the
To deploy a TensorFlow Lite model to a microcontroller, STM32F746G-Discovery.
it must be converted into an array with hexadecimal values
using the xxd command. Figure 2 shows an example of
converting a TensorFlow Lite model to a hexadecimal array
through the xxd command.
Fig. 2. An example of converting a hexadecimal array of TensorFlow Lite

models
Apart from TFLM, BSP (Board Support Package) must be Fig. 3. Screen for performing inference on STM32F746G-Discovery
used to initialize and control the microcontroller. BSP
provides user APIs that can directly control HW such as CPU Table 2 shows the change trend of FLOPs (FLoating point
clock setting and sensor initialization. In addition, in the Operations Per Second) and validation accuracy in the model
microcontroller, the inference speed can be optimized using conversion process We trained for 300 epochs on the server
1454
and got top-1 accuracy of 92.26%. After training, pruning and V. CONCLUSION
fine-tuning were performed to reduce the weight of the model, In this paper, we proposed an optimization technique for
and FLOPs were reduced by 40.04%. On the other hand, the image classification using deep learning in a microcontroller.
accuracy after pruning decreased to 87.16%. Therefore, in Microcontrollers have very limited resources compared to
order to compensate for the accuracy lost through pruning, portable edge devices such as mobiles. Therefore, in order to
quantization is performed through quantization-aware perform deep learning-based image classification in a
training. When quantization was performed through microcontroller, resource limitations of the microcontroller
quantization-aware training, it was confirmed that the must be considered.
accuracy improved by 1.33% compared to the model with
pruning and fine-tuning. In this paper, we presented the model conversion process
to deploy the trained model to the microcontroller. In addition,
TABLE II. TRENDS IN FLOPS AND ACCURACY BY MODEL it was confirmed that image classification can be performed
CONVERSION PROCESS through the proposed optimization technique by performing
Validation Accuracy inference in STM32F746G-Discovery.
Step FLOPs
(Top-1)
We plan to conduct research on performing image
Traing 15,448,512 92.26 %
classification on large-sized input data such as the ImageNet
9,262,416 87.16 % dataset.
Pruning + Fine-tuning
(-40.04 %) (-5.1%)
Quantization-aware 9,262,416 88.49 % ACKNOWLEDGMENT
training (-40.04 %) (-3.77 %)
This research was supported by the Challengeable Future
Defense Technology Research and Development Program
Table 3 shows the size difference between SRAM and through the Agency For Defense Development(ADD) funded
Flash depending on whether TFLM is included or not. by the Defense Acquisition Program Administration(DAPA)
Without TFML means that the binary running on the in 2022(No.915062201)
microcontroller contains only basic BSP and libraries for
loading image files (e.g., FatFs, LibJPEG). If TFLM is REFERENCES
included in binary, it also includes a model for inference. [1] H. H. Bu, N. C. Kim, and S. H. Kim, “Content-based image retrieval
using a fusion of global and local features,” ETRI Journal, vol. 45, no.3,
TABLE III. LIST OF RECOGNIZERS FOR PERFORMANCE EVALUATION 2023
SRAM Flash [2] S. Seo, and H. Jung, “A robust collision prediction and detection
Type method based on neural network for autonomous delivery robots,”
(Max: 320 KB) (Max: 1 MB)
ETRI Journal, vol. 45, no. 2, 2023.
Without TFLM
(including FatFs, 4.18 KB (1.3%) 88.89 KB (8.68%) [3] J. Lin, W. M. Chen, Y. Lin, J. Cohn, C. Gan, and S. Han, "Mcunet:
LibJPEG) Tiny deep learning on iot devices," Advances in Neural Information
With TFLM Processing Systems (NIPS), 2020
144.94 KB (45.29%) 685.28 KB (66.92%)
(including model) [4] STMicroelectronics, Discovery kit with STM32F746NG MCU,
https://www.st.com/en/evaluation-tools/32f746gdiscovery.html
[5] Google, FlatBuffers, https://github.com/google/flatbuffers
In order to perform inference using TFLM, activation [6] alibaba, TinyNeuralNetwork, https://github.com/alibaba/TinyNeural
memory for each operation is required. The activation Network
memory is allocated to the SRAM of the microcontroller. In [7] chenhang98, mobileNet-v2_cifar10, https://github.com/chenhang98/
contrast, the model converted to hexadecimal is allocated to mobileNet-v2_cifar10
flash memory. On the other hand, TFLM uses an interpreter [8] F. Yu, C. Han, P. Wang, X. Huang, and L. Cui, “Gate trimming: One-
method to perform inference on various microcontrollers. shot channel pruning for efficient convolutional neural networks,”
Therefore, TFLM requires additional memory for the IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2021.
interpreter method along with activation memory.
1455

Image Classification On Resource-Constrained Microcontrollers

Uploaded by

Copyright:

Available Formats

Image Classification On Resource-Constrained Microcontrollers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Image Classification On Resource-Constrained Microcontrollers

Uploaded by

Copyright:

Available Formats

Image Classification on Resource-Constrained

I. INTRODUCTION III. IMAGE CLASSIFICATION ON RESOURCE-CONSTRAINED

979-8-3503-1327-7/23/$31.00 ©2023 IEEE 1453 ICTC 2023

Fig. 2. An example of converting a hexadecimal array of TensorFlow Lite

You might also like