FedOCR Arxiv2007.11462
FedOCR Arxiv2007.11462
FedOCR Arxiv2007.11462
1 Introduction
Text in scene images contains valuable semantic information for text reading and
has become one of the most popular research topics in academia and industry
for a long time [8,1,32,20,15]. In practice, scene text recognition has been applied
to various real-world scenarios, such as autonomous navigation, photo transcrip-
tion, and scene understanding. With the development of deep learning and the
emergence of public text datasets, significant progress on scene text recognition
has been made in recent years.
However, most of the existing scene text recognition algorithms assume that
a large scale set of training images is easily accessible. As shown in Fig. 1(a),
they may achieve sub-optimal performance and be unable to model the data
2 W. Zhang, Y. Qiu, S. Bai, R. Zhang, X. Wei and X. Bai
Global server
Local participants
Fig. 1. An illustration of training scene text recognizers with (a) a single dataset, (b)
a centralized dataset from different devices, and (c) decentralized datasets distributed
on different local devices
2 Related Work
Scene text recognition has attracted great interest for a long time. According to
Long et al. [18], representative methods can be roughly divided into two main-
streams, i.e., Connectionist Temporal Classification (CTC) based and attention-
based methods. Generally, the CTC-based methods model scene text recognition
as a sequence recognition task. For example, Shi et al. [29] combine the convo-
lutional neural network (CNN) with the recurrent neural network (RNN) to
extract sequence features from input images, and decode the features with a
CTC layer. Different from Shi et al. [29], Gao et al. [7] use stacked convolutional
layers to extract contextual information from inputs without RNN, and show
advantages with low computational costs. Meanwhile, attention-based methods
extract features more effectively via the attention mechanism. For instance, Liu
4 W. Zhang, Y. Qiu, S. Bai, R. Zhang, X. Wei and X. Bai
In this section, we first introduce the pipeline of our federated scene text recogni-
tion framework. Then, we describe the details of local training and global aggre-
gation, which are the two main steps in federated learning. Finally, we elaborate
on how to improve communication efficiency and preserve data privacy in our
framework.
FedOCR 5
Global server
Step4: Aggregation …
Local clients
(1) Before each round of local training, all participants start with the same
parameters, which are initialized randomly in the first round and downloaded
from the global server in the next rounds.
(2) Each participant trains the model with its dataset for El epochs individually.
(3) All participants calculate parameter increments compared to the original
parameters in a round, and all parameter increments are sent to the global
server.
6 W. Zhang, Y. Qiu, S. Bai, R. Zhang, X. Wei and X. Bai
(4) The global server aggregates all parameter increments by average, and up-
dates a set of global parameters. Before the next local training, the global
parameters are downloaded for local model updating.
Local Training. In our FedOCR, each participant i and the global server main-
tain a set of local model parameters W i and W global , respectively. Algorithm 1
describes the local training process of our framework. As shown, all participants
first download the latest global parameters from the global server and overwrite
their local parameters. Then, participants train local models with their datasets
independently for El epochs and send parameter increments to the global server.
During local training, all participants do not share any image data with others.
To update the global parameters efficiently, all participants should train their
models enough before parameter transmission. McMahan et al. [23] demonstrate
that sufficient epochs of local training can bring a dramatic increase in parame-
ter update efficiency. Detailed experiment settings of our FedOCR are provided
in the next section.
Text Recognizer. Following the above methods, we can improve any existing
text recognition algorithms to construct a lightweight text recognizer. Specifi-
cally, in our experiments, we optimize a classical text recognizer, ASTER [30].
We replace the encoder in ASTER with ShuffleNetV2 [22] and apply the hashing
technique to the entire model parameters. Especially, we do not compress the
parameters of batch normalization layers in networks, because there are only
a few parameters. Benefited from hashing techniques and lightweight networks,
we successfully decrease communication costs to a large extent in our federated
learning framework.
Moreover, we keep the network structure and experiment settings the same
with ASTER as much as possible. We briefly introduce the method of scene
text recognition as follows: Firstly, an input image is rectified by a rectification
network before sent into a recognition network. The rectification network based
on the Spatial Transformer Network (STN) aims to rectify perspective or curved
texts. Secondly, we use a lightweight neural network as the encoder to extract
the feature sequence from the rectified image. Lastly, we use an attentional
sequence-to-sequence model as the decoder to translate the feature sequence.
During inference, we use a beam search algorithm by holding five candidates
with the highest accumulative scores at every step.
Based on the above equations, we can obtain any parameter’s gradient in the
real weight vector as follows:
X X ∂L ∂Vi,j l
∂L
= ·
∂Rkl i j
∂Vi,jl ∂Rkl
XX (3)
l
= gi,j · I(Idxl [i, j], k).
i j
4 Experiments
4.1 Experiment settings
Datasets. Two synthetic datasets [12,9] and six public real-world datasets
are used to train local models, and our models are evaluated on seven general
datasets. In our federated settings, we construct different local datasets with the
public real-world datasets. These datasets are briefly introduced as follows:
– Synth90k [12] contains 9 million images generated from a set of 90k English
words. Words are rendered onto natural images with random transformations
and effects.
10 W. Zhang, Y. Qiu, S. Bai, R. Zhang, X. Wei and X. Bai
– SynthText [9] contains 0.8 million images for end-to-end text detection and
recognition tasks. Therefore, we crop word images using the ground-truth
word bounding boxes.
– ICDAR 2003 (IC03) [19] contains 860 cropped word images for evaluation
after discarding images that contain non-alphanumeric characters or have
fewer than three characters, which follows [24]. For training, we use 1150
cropped images after filtering.
– ICDAR 2013 (IC13) [14], which inherits most images from IC03 and ex-
tends it with new images, contains 1015 cropped word images for evaluation
after filtering. For training, we use 848 cropped images after filtering.
– ICDAR 2015 (IC15) [13] contains images captured by a pair of Google
Glasses casually, and many images are severely distorted or blurred. For
a fair comparison, we evaluate models on 1811 cropped word images after
filtering. For training, we use 4426 cropped images after filtering.
– IIIT5K-Words (IIIT5K) [24] contains 3000 word images collected for
evaluation and 2000 word images for training, which are mostly horizontal
text images.
– Street View Text (SVT) [35] is collected from the Google Street View,
and it contains 647 images of cropped words, many of which are severely
corrupted by noise, blur, or low resolution.
– Street View Text Perspective (SVTP) [26], which is collected from
Google StreetView and contains many distorted images, contains 645 word
images for evaluation.
– CUTE80 (CUTE) [28] contains 80 real-world curved text images with high
quality. For evaluation, we crop 288 word images according to its ground-
truth.
– ArT [5] is a combination of Total-Text, SCUT-CTW1500, and Baidu Curved
Scene Text, which contains images with arbitrary-shaped texts. For train-
ing, we use 30271 word images after discarding images that contain non-
alphanumeric characters and vertical texts.
– COCO-Text [33] is based on the MS COCO dataset, which contains images
of complex everyday scenes. For training, we use 31943 cropped images af-
ter discarding images that contain non-alphanumeric characters and vertical
texts.
Table 1. Parameter size and accuracy comparison between different models in our
FedOCR. The accuracy is the average result of all testing datasets. The models size
refers to the storage occupied on the hard-disk. γ is the compression ratio of the
hashing technique, and “γ = −” means that we do not apply the hashing technique
to the model. The reduction percentages of parameter size, model size, and accuracy
compared with ASTER-FL are shown in parentheses respectively
Federated Learning for Scene Text Recognition. Table 2 shows the de-
tailed results on all testing datasets of ASTER-FL and different FedOCR-Hashγ
in three manners of training. First, “single” training means that the model is
trained only with one participant’s dataset. Second, ”centralized” training means
that the model is trained with a centralized set of image data. Third, “feder-
ated” training means that the model is trained with decentralized sets of image
data in a federated manner. As shown in Table 2, “federated” and “centralized”
training results of all models are similar to each other and better than “single”
training results. In the “single” training manner, scene text recognition faces the
problem in practice that the image data for training is limited, which causes poor
performance in scene text recognition. However, we succeed in training a shared
model with decentralized sets of image data collaboratively in the “federated”
training manner, and we do not exchange or expose any image data to other
participants. Expectantly, our FedOCR achieves comparable results, which are
very close to the results of the “centralized” training manner. Therefore, our
FedOCR 13
100
90
Accuracy (%)
80
70
FedOCR Hash1
FedOCR Hash1/2
60 FedOCR Hash1/4
FedOCR Hash1/8
ASTER-FL
50
0 200 400 600 800 1000 1200 1400 1600 1800
num_megabytes_uploaded_total (MB)
unstable data transmission network in the real world, our FedOCR has great
potential in practical application deployment.
References
1. Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition
with embedded attributes. TPAMI 36(12), 2552–2566 (2014)
2. Bai, F., Cheng, Z., Niu, Y., Pu, S., Zhou, S.: Edit probability for scene text recog-
nition. In: CVPR (2018)
3. Bartz, C., Bethge, J., Yang, H., Meinel, C.: Kiss: Keeping it simple for scene text
recognition. arXiv preprint arXiv:1911.08400 (2019)
4. Chen, W., Wilson, J., Tyree, S., Weinberger, K., Chen, Y.: Compressing neural
networks with the hashing trick. In: ICML (2015)
5. Chng, C.K., Liu, Y., Sun, Y., Ng, C.C., Luo, C., Ni, Z., Fang, C., Zhang, S., Han,
J., Ding, E., et al.: Icdar2019 robust reading challenge on arbitrary-shaped text
(rrc-art). arXiv preprint arXiv:1909.07145 (2019)
6. Fredrikson, M., Jha, S., Ristenpart, T.: Model inversion attacks that exploit con-
fidence information and basic countermeasures. In: CCS (2015)
7. Gao, Y., Chen, Y., Wang, J., Tang, M., Lu, H.: Reading scene text with fully
convolutional sequence modeling. Neurocomputing 339, 161–170 (2019)
8. Goel, V., Mishra, A., Alahari, K., Jawahar, C.: Whole is greater than sum of parts:
Recognizing scene text words. In: ICDAR (2013)
9. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in nat-
ural images. In: CVPR (2016)
10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: CVPR (2016)
11. Hitaj, B., Ateniese, G., Perez-Cruz, F.: Deep models under the gan: information
leakage from collaborative deep learning. In: CCS (2017)
12. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and
artificial neural networks for natural scene text recognition. arXiv preprint
arXiv:1406.2227 (2014)
13. Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwa-
mura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., et al.: Icdar 2015
competition on robust reading. In: ICDAR (2015)
14. Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L.G., Mestre, S.R.,
Mas, J., Mota, D.F., Almazan, J.A., De Las Heras, L.P.: Icdar 2013 robust reading
competition. In: ICDAR (2013)
15. Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: A simple and strong
baseline for irregular text recognition. In: AAAI (2019)
16. Liu, W., Chen, C., Wong, K.Y.K.: Char-net: A character-aware neural network for
distorted scene text recognition. In: AAAI (2018)
17. Liu, Z., Li, Y., Ren, F., Goh, W.L., Yu, H.: Squeezedtext: A real-time scene text
recognition by binary convolutional encoder-decoder network. In: AAAI (2018)
18. Long, S., He, X., Yao, C.: Scene text detection and recognition: The deep learning
era. arXiv preprint arXiv:1811.04256 (2018)
19. Lucas, S.M., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R., Ashida,
K., Nagai, H., Okamoto, M., Yamamoto, H., et al.: Icdar 2003 robust reading
competitions: entries, results, and future directions. IJDAR 7(2-3), 105–122 (2005)
20. Luo, C., Jin, L., Sun, Z.: Moran: A multi-object rectified attention network for
scene text recognition. PR 90, 109–118 (2019)
21. Luo, J., Wu, X., Luo, Y., Huang, A., Huang, Y., Liu, Y., Yang, Q.: Real-world
image datasets for federated learning. arXiv preprint arXiv:1910.11089 (2019)
16 W. Zhang, Y. Qiu, S. Bai, R. Zhang, X. Wei and X. Bai
22. Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines for
efficient cnn architecture design. In: ECCV. pp. 116–131 (2018)
23. McMahan, H.B., Moore, E., Ramage, D., Hampson, S., et al.: Communication-
efficient learning of deep networks from decentralized data. arXiv preprint
arXiv:1602.05629 (2016)
24. Mishra, A., Alahari, K., Jawahar, C.: Top-down and bottom-up cues for scene text
recognition. In: CVPR (2012)
25. Phong, L.T., Aono, Y., Hayashi, T., Wang, L., Moriai, S.: Privacy-preserving deep
learning via additively homomorphic encryption. TIFS 13(5), 1333–1345 (2018)
26. Quy Phan, T., Shivakumara, P., Tian, S., Lim Tan, C.: Recognizing text with
perspective distortion in natural scenes. In: ICCV (2013)
27. Reisizadeh, A., Mokhtari, A., Hassani, H., Jadbabaie, A., Pedarsani, R.: Fedpaq:
A communication-efficient federated learning method with periodic averaging and
quantization. arXiv preprint arXiv:1909.13014 (2019)
28. Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary
text detection system for natural scene images. Expert Systems with Applications
41(18), 8027–8048 (2014)
29. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based
sequence recognition and its application to scene text recognition. TPAMI 39(11),
2298–2304 (2016)
30. Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: An attentional scene
text recognizer with flexible rectification. TPAMI (2018)
31. Shokri, R., Shmatikov, V.: Privacy-preserving deep learning. In: CCS (2015)
32. Su, B., Lu, S.: Accurate scene text recognition based on recurrent neural network.
In: ACCV (2014)
33. Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: Coco-text: Dataset and
benchmark for text detection and recognition in natural images. arXiv preprint
arXiv:1601.07140 (2016)
34. Voigt, P., Von dem Bussche, A.: The eu general data protection regulation (gdpr).
A Practical Guide, 1st Ed., Cham: Springer International Publishing (2017)
35. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV
(2011)
36. Wei, K., Li, J., Ding, M., Ma, C., Yang, H.H., Farhad, F., Jin, S., Quek, T.Q., Poor,
H.V.: Federated learning with differential privacy: Algorithms and performance
analysis. arXiv preprint arXiv:1911.00222 (2019)
37. Yang, M., Guan, Y., Liao, M., He, X., Bian, K., Bai, S., Yao, C., Bai, X.: Symmetry-
constrained rectification network for scene text recognition. In: ICCV (2019)
38. Yang, Q., Liu, Y., Chen, T., Tong, Y.: Federated machine learning: Concept and
applications. TIST 10(2), 12 (2019)
39. Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint
arXiv:1212.5701 (2012)
40. Zhan, F., Lu, S.: Esir: End-to-end scene text recognition via iterative image recti-
fication. In: CVPR (2019)
41. Zhan, F., Xue, C., Lu, S.: Ga-dan: Geometry-aware domain adaptation network
for scene text detection and recognition. In: ICCV (2019)
42. Zhang, Y., Nie, S., Liu, W., Xu, X., Zhang, D., Shen, H.T.: Sequence-to-sequence
domain adaptation network for robust text image recognition. In: CVPR (2019)
43. Zhu, W., Baust, M., Cheng, Y., Ourselin, S., Cardoso, M.J., Feng, A.: Privacy-
preserving federated brain tumour segmentation. In: Machine Learning in Medical
Imaging: 10th International Workshop (2019)