Image Classification On Resource-Constrained Microcontrollers
Image Classification On Resource-Constrained Microcontrollers
Image Classification On Resource-Constrained Microcontrollers
Microcontrollers
1st Seungtae Hong 2nd Gunju PARK 3rd Jeong-Si Kim
Intelligent Device & Simulation Intelligent Device & Simulation Intelligent Device & Simulation
Research Section Research Section Research Section
Electronics and Telecommunications Electronics and Telecommunications Electronics and Telecommunications
Research Institute Research Institute Research Institute
and Daejeon, Korea Daejeon, Korea
University of Science and Technology parkgj@etri.re.kr sikim00@etri.re.kr
Daejeon, Korea
sthong@etri.re.kr
Abstract—Recently, as IoT devices have become popular, re- accessed directly without separate parsing or unpacking, and
search to perform deep learning in small devices such as micro- has the advantage of being usable on various platforms
controllers has been attempted. Microcontrollers have very lim- without any dependencies.
ited resources compared to edge devices such as mobiles. There-
fore, in order to perform deep learning-based image classifica- Since the microcontroller does not have a file system, the
tion in a microcontroller, an optimization technique considering model converted to FlatBuffer is included in the source code
HW constraints is required. To this end, in this paper, we pre- in the form of a C array. The model is then compiled along
sent a method for light weighting a model so that it can be exe- with other source code, built in binary form and stored in the
cuted in a microcontroller, and a process for distributing the flash on the microcontrollers.
lightweight model to a microcontroller. Finally, it was con-
firmed that image classification can be performed in an actual TFLM runs inference on the microcontroller through an
microcontroller through STM32F746G-Discovery. interpreter API in the form of a C/C++ language. For
initialization and control of the microcontroller, the API of the
Keywords—Microcontrollers, Deep Learning, Image Classifi- BSP (Board Support Package) should be used separately from
cation the interpreter API of TFLM.
II. RELATED WORKS TFLM natively takes as input models from TensorFlow
TensorFlow Lite for Microcontrollers (=TFLM) is a sub- Lite that have been converted from TensorFlow. However,
component of TensorFlow Lite for performing machine various model lighting techniques such as pruning and
learning on resource-constrained microcontrollers. TFLM quantization are currently being released based on PyTorch.
takes as input a model transformed through TensorFlow Lite. Therefore, the proposed technique proceeds with PyTorch-
At this time, TensorFlow Lite converts the pre-trained model based training, pruning, and quantization aware training. The
to FlatBuffer [5] format. FlatBuffer is a cross-platform proposed method uses TinyNeuralNetwork [6] to convert a
serialization interface proposed by Google. FlatBuffer can be PyTorch model that has completed quantization aware
In general, quantization techniques are divided into post- TABLE I. HW SPECIFICATIONS OF STM32F746G-DISCOVERY
training quantization that can be performed without retraining Type Specification
and quantization-aware training that performs quantization STM32F746NG (ARM Cortex-M7)
while performing training. In this paper, we use quantization- CPU
- Single Core (216 MHz)
aware training to mitigate the loss of accuracy due to model SRAM 320 KB (User SRAM:256KB)
pruning.
Flash memory 1 MB
To deploy the quantized model to the microcontroller, we
need to convert the model with TensorFlow Lite. To this end, Power 3.3 V or 5 V
in this paper, we transform the trained model using Alibaba's 4.3” LCD display
TinyNeuralNetwork. Typically, ONNX (Open Neural Supported devices Micro SD card
Network Exchange) is used when converting models trained On-board ST-LINK/V2-1 debugger/programmer
with PyTorch to TensorFlow Lite. However, while the model
conversion process using ONNX is difficult, the model B. Evaluation Results
conversion process using TinyNeuralNetwork is relatively STM32F746G-Discovery supports LCD screen and
easy. external micro SD card. In this performance evaluation, the
When inference is performed in a microcontroller, the pre- Cifar-10 test dataset is saved on a micro SD card, and
processing of the input image must be minimized to speed up inference is performed by loading the saved image data. Then,
the inference. To this end, when converting a PyTorch model the inference result is output on the LCD screen.
to a TensorFlow Lite model, the input and output of the neural Figure 3 shows the screen of running the image classifier
network use quantized input (=signed int8). That is, on the STM32F746G-Discovery. 10 images randomly
quantization is applied to all inputs and outputs as well as the extracted from the Cifar-10 test dataset are stored in the micro
hidden layer of the neural network. Through this, each pixel SD card. Inference is performed by sequentially loading
value of the image obtained from the camera module on the images by pressing user buttons below the LCD screen. Figure
microcontroller can be directly used as a quantization input. 3 shows the screen of running the image classifier on the
To deploy a TensorFlow Lite model to a microcontroller, STM32F746G-Discovery.
it must be converted into an array with hexadecimal values
using the xxd command. Figure 2 shows an example of
converting a TensorFlow Lite model to a hexadecimal array
through the xxd command.
Apart from TFLM, BSP (Board Support Package) must be Fig. 3. Screen for performing inference on STM32F746G-Discovery
used to initialize and control the microcontroller. BSP
provides user APIs that can directly control HW such as CPU Table 2 shows the change trend of FLOPs (FLoating point
clock setting and sensor initialization. In addition, in the Operations Per Second) and validation accuracy in the model
microcontroller, the inference speed can be optimized using conversion process We trained for 300 epochs on the server
1454
and got top-1 accuracy of 92.26%. After training, pruning and V. CONCLUSION
fine-tuning were performed to reduce the weight of the model, In this paper, we proposed an optimization technique for
and FLOPs were reduced by 40.04%. On the other hand, the image classification using deep learning in a microcontroller.
accuracy after pruning decreased to 87.16%. Therefore, in Microcontrollers have very limited resources compared to
order to compensate for the accuracy lost through pruning, portable edge devices such as mobiles. Therefore, in order to
quantization is performed through quantization-aware perform deep learning-based image classification in a
training. When quantization was performed through microcontroller, resource limitations of the microcontroller
quantization-aware training, it was confirmed that the must be considered.
accuracy improved by 1.33% compared to the model with
pruning and fine-tuning. In this paper, we presented the model conversion process
to deploy the trained model to the microcontroller. In addition,
TABLE II. TRENDS IN FLOPS AND ACCURACY BY MODEL it was confirmed that image classification can be performed
CONVERSION PROCESS through the proposed optimization technique by performing
Validation Accuracy inference in STM32F746G-Discovery.
Step FLOPs
(Top-1)
We plan to conduct research on performing image
Traing 15,448,512 92.26 %
classification on large-sized input data such as the ImageNet
9,262,416 87.16 % dataset.
Pruning + Fine-tuning
(-40.04 %) (-5.1%)
Quantization-aware 9,262,416 88.49 % ACKNOWLEDGMENT
training (-40.04 %) (-3.77 %)
This research was supported by the Challengeable Future
Defense Technology Research and Development Program
Table 3 shows the size difference between SRAM and through the Agency For Defense Development(ADD) funded
Flash depending on whether TFLM is included or not. by the Defense Acquisition Program Administration(DAPA)
Without TFML means that the binary running on the in 2022(No.915062201)
microcontroller contains only basic BSP and libraries for
loading image files (e.g., FatFs, LibJPEG). If TFLM is REFERENCES
included in binary, it also includes a model for inference. [1] H. H. Bu, N. C. Kim, and S. H. Kim, “Content-based image retrieval
using a fusion of global and local features,” ETRI Journal, vol. 45, no.3,
TABLE III. LIST OF RECOGNIZERS FOR PERFORMANCE EVALUATION 2023
SRAM Flash [2] S. Seo, and H. Jung, “A robust collision prediction and detection
Type method based on neural network for autonomous delivery robots,”
(Max: 320 KB) (Max: 1 MB)
ETRI Journal, vol. 45, no. 2, 2023.
Without TFLM
(including FatFs, 4.18 KB (1.3%) 88.89 KB (8.68%) [3] J. Lin, W. M. Chen, Y. Lin, J. Cohn, C. Gan, and S. Han, "Mcunet:
LibJPEG) Tiny deep learning on iot devices," Advances in Neural Information
With TFLM Processing Systems (NIPS), 2020
144.94 KB (45.29%) 685.28 KB (66.92%)
(including model) [4] STMicroelectronics, Discovery kit with STM32F746NG MCU,
https://www.st.com/en/evaluation-tools/32f746gdiscovery.html
[5] Google, FlatBuffers, https://github.com/google/flatbuffers
In order to perform inference using TFLM, activation [6] alibaba, TinyNeuralNetwork, https://github.com/alibaba/TinyNeural
memory for each operation is required. The activation Network
memory is allocated to the SRAM of the microcontroller. In [7] chenhang98, mobileNet-v2_cifar10, https://github.com/chenhang98/
contrast, the model converted to hexadecimal is allocated to mobileNet-v2_cifar10
flash memory. On the other hand, TFLM uses an interpreter [8] F. Yu, C. Han, P. Wang, X. Huang, and L. Cui, “Gate trimming: One-
method to perform inference on various microcontrollers. shot channel pruning for efficient convolutional neural networks,”
Therefore, TFLM requires additional memory for the IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2021.
interpreter method along with activation memory.
1455