FedOCR Arxiv2007.11462

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

FedOCR: Communication-Efficient Federated

arXiv:2007.11462v1 [cs.CV] 22 Jul 2020


Learning for Scene Text Recognition

Wenqing Zhang1 , Yang Qiu1 , Song Bai1 , Rui Zhang2 ,


Xiaolin Wei2 , and Xiang Bai1
1
Huazhong University of Science and Technology
{wenqingzhang,yqiu,xbai}@hust.edu.cn, songbai.site@gmail.com
2
Meituan-Dianping Group
{zhangrui36,weixiaolin02}@meituan.com

Abstract. While scene text recognition techniques have been widely


used in commercial applications, data privacy has rarely been taken into
account by this research community. Most existing algorithms have as-
sumed a set of shared or centralized training data. However, in practice,
data may be distributed on different local devices that can not be cen-
tralized to share due to the privacy restrictions. In this paper, we study
how to make use of decentralized datasets for training a robust scene text
recognizer while keeping them stay on local devices. To the best of our
knowledge, we propose the first framework leveraging federated learning
for scene text recognition, which is trained with decentralized datasets
collaboratively. Hence we name it FedOCR. To make FedCOR fairly
suitable to be deployed on end devices, we make two improvements in-
cluding using lightweight models and hashing techniques. We argue that
both are crucial for FedOCR in terms of the communication efficiency of
federated learning. The simulations on decentralized datasets show that
the proposed FedOCR achieves competitive results to the models that
are trained with centralized data, with fewer communication costs and
higher-level privacy-preserving.

Keywords: Federated Learning, Scene Text Recognition

1 Introduction

Text in scene images contains valuable semantic information for text reading and
has become one of the most popular research topics in academia and industry
for a long time [8,1,32,20,15]. In practice, scene text recognition has been applied
to various real-world scenarios, such as autonomous navigation, photo transcrip-
tion, and scene understanding. With the development of deep learning and the
emergence of public text datasets, significant progress on scene text recognition
has been made in recent years.
However, most of the existing scene text recognition algorithms assume that
a large scale set of training images is easily accessible. As shown in Fig. 1(a),
they may achieve sub-optimal performance and be unable to model the data
2 W. Zhang, Y. Qiu, S. Bai, R. Zhang, X. Wei and X. Bai

Global server

Local participants

Single Dataset Centralized Dataset Decentralized Datasets

(a) (b) (c)

Fig. 1. An illustration of training scene text recognizers with (a) a single dataset, (b)
a centralized dataset from different devices, and (c) decentralized datasets distributed
on different local devices

variations or diversity owing to the lack of sufficient images. To remedy this,


some works [3,40] merge different public datasets to build a more robust text
recognizer, as illustrated in Fig. 1(b). However, centralizing data in this way is
simply problematic in many real-world scenarios. For example, many laws
and regulations strengthening the data privacy constrain the use of data stored
on local devices, such as General Data Protection Regulation (GDPR) [34]. Be-
sides, centralizing tremendous image data from different local devices incurs
heavy communication loads. That means it is simply intractable to centralize
large amounts of data for scene text recognition training in practice. Our solu-
tion, which works within the framework of federated learning, is illustrated in
Fig. 1(c).
Federated Learning (FL), a new concept first proposed by McMahan et al.
[23], allows data owners to train a shared model collaboratively while keeping
data stored on different local devices. However, directly applying FL to scene
text recognition faces two inevitable difficulties. First, in most scene text recog-
nition algorithms, a heavyweight backbone model is usually adopted for the
sake of better performance. Hence, it results in heavy burdens of the parameter
transmission while doing federated learning. Second, there is an extra computa-
tional cost from a privacy-preserving module to handle privacy leakage due to
the honest-but-curious global server in general federated learning frameworks.
In this paper, to the best of our knowledge, we propose the first feder-
ated learning framework for scene text recognition, which we name FedOCR.
In our FedOCR (a schematic is given in Fig. 2), all participants train a shared
model collaboratively without centralizing the training images. In this manner,
datasets on different local devices have an indirect influence on the training
FedOCR 3

of the global model, which leads to a competitive performance to the model


trained with a centralized set of data. To improve the communication efficiency
between the global server and local clients, we argue two important aspects in
FedOCR, i.e., lightweight models and hashing techniques. Moreover, benefited
from the hashing technique, we can avoid privacy leakage to the global server by
a specific hashing function and the random seeds, which saves an extra compu-
tational cost for a privacy-preserving module. As a consequence, the proposed
FedOCR is readily to be deployed in practical applications for scene text recog-
nition.
Compared with existing scene text recognition methods [20,15,3,40] without
federated learning, the proposed framework has the following intriguing merits.
First, FedOCR can make use of more abundant image data from different local
devices. Particularly, there are billions of end devices that collect tremendous
image data containing text which benefits scene text recognition. Therefore,
our framework may have great potential in real-world applications of scene text
reading. Second, by design, our framework has a superior trade-off between pa-
rameter transmission efficiency and performance. The proposed text recognizer
has much fewer parameters than existing scene text recognition algorithms but
encouragingly reaches a comparable performance. Last, it can encrypt and de-
crypt with the hashing technique, which provides higher-level privacy-preserving
without an extra computational cost.
In summary, the main contributions of this paper are three-fold.
– We reveal the problem of data privacy in scene text recognition, which is
somehow overlooked by the existing methods.
– We propose the first federated scene text recognition framework called Fe-
dOCR for training a recognizer with decentralized datasets distributed on
different local devices.
– FedOCR is a highly communication-efficient and privacy-preserving frame-
work by incorporating lightweight backbones and hashing techniques, which
makes it suitable to be deployed in real privacy-sensitive applications and
edge devices.

2 Related Work
Scene text recognition has attracted great interest for a long time. According to
Long et al. [18], representative methods can be roughly divided into two main-
streams, i.e., Connectionist Temporal Classification (CTC) based and attention-
based methods. Generally, the CTC-based methods model scene text recognition
as a sequence recognition task. For example, Shi et al. [29] combine the convo-
lutional neural network (CNN) with the recurrent neural network (RNN) to
extract sequence features from input images, and decode the features with a
CTC layer. Different from Shi et al. [29], Gao et al. [7] use stacked convolutional
layers to extract contextual information from inputs without RNN, and show
advantages with low computational costs. Meanwhile, attention-based methods
extract features more effectively via the attention mechanism. For instance, Liu
4 W. Zhang, Y. Qiu, S. Bai, R. Zhang, X. Wei and X. Bai

et al. [17] propose a binary convolutional encoder-decoder network to provide


real-time scene text recognition. Unlike other attention-based algorithms, Bai
et al. [2] propose Edit Probability (EP) to handle the misalignment between
the output sequence of probability distribution and the ground-truth sequence,
which is caused by missing or superfluity of characters in the output.
With the improvement of scene text recognition, researchers start to focus on
more difficult settings or scenarios, such as irregular texts [37] and perspective
distortion [16,40]. To improve irregular text recognition, Yang et al. [37] propose
a symmetry-constrained rectification network to generate better rectification re-
sults than existing algorithms. Instead of using the global rectification, Liu et
al. [16] propose a character-aware neural network with a hierarchical attention
mechanism, which adopts a local transformation to rectify characters individu-
ally. Meanwhile, many works [12,9,15] exploit synthetic word images to remedy
the insufficiency of training data.
Undoubtedly, large amounts of real-world data are needed in practical ap-
plications of those scene text recognition methods. However, tremendous image
datasets are distributed on different local devices and can not be centralized
to share. To handle this problem, McMahan et al. [23] first propose the con-
cept of Federated Learning (FL) to train deep networks from decentralized data
collaboratively. Following McMahan et al. [23], many researchers are working
on improving the federated learning with more efficient parameter transmission
and higher-level privacy-preserving. To improve privacy security, Wei et al. [36]
propose a federated learning framework based on differential privacy, in which
artificial noises are added to the local parameters of participants before the
model aggregation. To improve communication efficiency, Reisizadeh et al. [27]
propose a communication-efficient federated learning method with periodic av-
eraging and quantization. Especially very recently, the computer vision
community starts to pay attention to federated learning, thus aris-
ing several pioneering works. For example, Luo et al. [21] implement object
detection algorithms with federated learning and release a reliable benchmark
framework. In the medical field, Zhu et al. [43] implement a privacy-preserving
federated learning system with the differential privacy for brain tumor segmen-
tation. To the best of our knowledge, we propose the first federated scene text
recognition framework, which is more efficient in communication and provides
higher-level privacy-preserving.

3 Federated Scene Text Recognition Framework

In this section, we first introduce the pipeline of our federated scene text recogni-
tion framework. Then, we describe the details of local training and global aggre-
gation, which are the two main steps in federated learning. Finally, we elaborate
on how to improve communication efficiency and preserve data privacy in our
framework.
FedOCR 5

Global server

Step4: Aggregation …

Step1: Global parameters downloading Step3: Local parameters uploading

Local clients

Participants 1 Participants 2 Participants C

Step2: Local training and update

Fig. 2. The pipeline of our federated scene text recognition framework

3.1 Pipeline of FedOCR

According to Yang et al. [38], our framework is a kind of horizontal federated


learning, where datasets of different participants share the same feature space
but are different in samples. Suppose we have C data owners, which have dif-
ferent sets of training images {D1 , ..., DC }. We denote the accuracy of the text
recognizer trained with decentralized datasets {D1 , ..., DC } as AccF ED . Note
that these decentralized datasets are not shared or transferred to other par-
ticipants during the training procedures. We denote the accuracy of the text
recognizer trained with a centralized dataset D = D1 ∪ ... ∪ DC as AccSU M . Ba-
sically, the objective of FedOCR is to minimize the difference between AccF ED
and AccSU M . A smaller difference between AccF ED and AccSU M means a better
performance of our federated learning for scene text recognition.
Fig. 2 illustrates the pipeline of our federated scene text recognition frame-
work. There are C participants, each of which has a set of data containing
cropped text word images and transcriptions, and a global server for local model
parameter aggregation. We assume all participants agree in advance on the same
network architecture and the same training objective but do not share their
datasets. The whole learning process can be decomposed into four steps:

(1) Before each round of local training, all participants start with the same
parameters, which are initialized randomly in the first round and downloaded
from the global server in the next rounds.
(2) Each participant trains the model with its dataset for El epochs individually.
(3) All participants calculate parameter increments compared to the original
parameters in a round, and all parameter increments are sent to the global
server.
6 W. Zhang, Y. Qiu, S. Bai, R. Zhang, X. Wei and X. Bai

Algorithm 1 Local Training


Input: Latest global parameters Wtglobal in round t; Local training learning rate η i , i ∈
[0, C − 1]
Output: All local parameter increments {∆Wti |i ∈ [0, C − 1]}
1: for each i ∈ [0, C − 1] do
2: Overwrite local weight vectors: Wti = Wtglobal
3: end for
4: for all local participant i ∈ {0, 1, ..., C − 1}) do
5: for e ∈ [0, El − 1] do
6: for step ∈ [0, stepmax ] do
7: Sample a minibatch Bstep
8: Compute gradients: gti = 5Loss(Bstep ; Wti )
9: Update local parameters: Wti = Wti − η i · gti
10: end for
11: end for
12: Compute local parameter increments: ∆Wti = Wti − Wtglobal
13: end for
14: Send {∆Wti |i ∈ [0, C − 1]} to the global server

(4) The global server aggregates all parameter increments by average, and up-
dates a set of global parameters. Before the next local training, the global
parameters are downloaded for local model updating.

Following this pipeline, our federated training continues until convergence.

Local Training. In our FedOCR, each participant i and the global server main-
tain a set of local model parameters W i and W global , respectively. Algorithm 1
describes the local training process of our framework. As shown, all participants
first download the latest global parameters from the global server and overwrite
their local parameters. Then, participants train local models with their datasets
independently for El epochs and send parameter increments to the global server.
During local training, all participants do not share any image data with others.
To update the global parameters efficiently, all participants should train their
models enough before parameter transmission. McMahan et al. [23] demonstrate
that sufficient epochs of local training can bring a dramatic increase in parame-
ter update efficiency. Detailed experiment settings of our FedOCR are provided
in the next section.

Global Aggregation. To aggregate parameter increments from different lo-


cal participants, McMahan et al. [23] propose a straightforward approach to
aggregate all local participants’ parameters by average. Following steps in Al-
gorithm 2, we adapt the federated average method [23] to our federated scene
text recognition framework. In the global aggregation step of our FedOCR, we
average all parameter increments and update former global parameters, which
are available for all participants downloading.
FedOCR 7

3.2 Communication Efficiency


Communication efficiency is an essential property in federated learning. For in-
stance, if the size of one participant’s model is one hundred megabytes, tens of
gigabytes will be required to transmit in a round, when hundreds of clients par-
ticipate in a federated learning framework. Under such a circumstance, plenty of
parameters result in huge communication costs, which lead to a training bottle-
neck. To reduce communication burdens, we replace the heavyweight backbone,
such as ResNet [10], for feature extraction in text recognizers with a lightweight
neural network. To further decrease the parameter size, we extend a hashing
technique [4] to compress the parameters of CNN and RNN, which makes it ap-
plicable for any text recognizer. In this way, the text recognizer in our FedOCR
has much fewer parameters compared with existing text recognition algorithms,
which shows great potential in practical federated learning deployment.

Hashing Technique. In fact, any well-designed scene text recognition model


can be applied in our federated learning framework. However, considering the
communication efficiency, the network with fewer parameters is more appro-
priate and practical. Therefore, we propose to compress model parameters by
a hashing technique. Specifically, we compress network parameters in a weight
sharing manner that a random subset of parameters in a layer share the same

Algorithm 2 Global Aggregation


Input: All local parameter increments {∆Wti |i ∈ {0, 1, ..., C − 1}} in round t; Global
parameters Wtglobal ;
global
Output: Updated global parameters: Wt+1 in round t + 1;
PC−1
1: Compute global parameter increments: ∆Wtglobal = C1 i=0 ∆Wt
i

global global global


2: Update global parameters: Wt+1 = Wt + ∆Wt
global
3: Send Wt+1 to all participants

Algorithm 3 Hashing Technique


Input: Compression ratio γ; Hashing seeds {seedl |l ∈ [0, L − 1]}, where L is the
number of network layers;
Output: A compressed network;
1: for each layer l in the entire network do
2: Assume the parameter size of weight matrix W l is T l
3: Generate a real weight vector Rl , and its parameter size is T l ∗ γ
4: Generate a random sort RS l of numbers from 0 to T l −1 with a hashing function
and a seed seedl
5: Generate an index vector: Idxl = [be ∗ γc, f or e in RS l ]
6: Reshape Idxl as the shape of W l
7: Generate a virtual weight matrix: V l = Rl [Idxl ]
8: end for
9: Initialize the text recognition network with Rl , l ∈ [0, L − 1]
10: The actual parameter size of the compressed network is only γ ∗ L−1 l
P
l=0 T
8 W. Zhang, Y. Qiu, S. Bai, R. Zhang, X. Wei and X. Bai

parameter. Following Algorithm 3, we compress nearly all parameters in a scene


text recognition network with a hyper-parameter γ to control the compression
ratio, and the hashing technique can reduce the parameter size to a large extent.
It should be noted that be ∗ γc means the largest integer that is smaller than e ∗ γ
in Algorithm 3. Notably, the specific hashing function and the random seeds are
shared among all local participants to keep the same relationship between real
weight vectors and virtual weight matrices of all local models.

Text Recognizer. Following the above methods, we can improve any existing
text recognition algorithms to construct a lightweight text recognizer. Specifi-
cally, in our experiments, we optimize a classical text recognizer, ASTER [30].
We replace the encoder in ASTER with ShuffleNetV2 [22] and apply the hashing
technique to the entire model parameters. Especially, we do not compress the
parameters of batch normalization layers in networks, because there are only
a few parameters. Benefited from hashing techniques and lightweight networks,
we successfully decrease communication costs to a large extent in our federated
learning framework.
Moreover, we keep the network structure and experiment settings the same
with ASTER as much as possible. We briefly introduce the method of scene
text recognition as follows: Firstly, an input image is rectified by a rectification
network before sent into a recognition network. The rectification network based
on the Spatial Transformer Network (STN) aims to rectify perspective or curved
texts. Secondly, we use a lightweight neural network as the encoder to extract
the feature sequence from the rectified image. Lastly, we use an attentional
sequence-to-sequence model as the decoder to translate the feature sequence.
During inference, we use a beam search algorithm by holding five candidates
with the highest accumulative scores at every step.

Network Training. After neural network initialization, the mapping rela-


tionship between real weight vectors and virtual weight matrices is fixed, which
is defined in Algorithm 3. In the forward computation, it is the virtual weight
matrices that participate in calculations with input features. In the backward
propagation, the gradients of all parameters in real weight vectors are calculated
based on virtual weight matrices’ parameter gradients.
l
Let Vi,j denote the i-th row and j-th column element of a virtual weight
matrix at layer l, and let Rkl denote the k-th element in the corresponding real
weight vector. Assuming that
∂L l
l
= gi,j , (1)
∂Vi,j
l
where gi,j is computed from the loss. Moreover,
(
l
∂Vi,j l 1 if a = b,
= I(Idx [i, j], k), where I(a, b) = . (2)
∂Rkl 0 otherwise
FedOCR 9

Based on the above equations, we can obtain any parameter’s gradient in the
real weight vector as follows:
X X ∂L ∂Vi,j l
∂L
= ·
∂Rkl i j
∂Vi,jl ∂Rkl
XX (3)
l
= gi,j · I(Idxl [i, j], k).
i j

3.3 Privacy Preserving


Federated learning can provide training procedures at a high level of security,
but the global server still has a chance to compromise data privacy, such as
model inversion [6] and GAN-based attacks [11]. Usually, local network param-
eters or their increments are sent to the global server in each communication
round, which gives the honest-but-curious server a chance to spy on local sets of
data. In recent works, Phong et al. [25] show that a small portion of gradients
may reveal information of training samples and apply an additively homomor-
phic encryption scheme to their federated framework. Shokri et al. [31] propose
to upload partially gradients added with noise to avoid information leakage,
and apply differential privacy to parameter updates for a higher level of secu-
rity. However, the above methods bring more computational costs or a dramatic
decrease in accuracy because of the privacy-preserving module.
In our FedOCR, we adopt the hashing technique to compress the entire model
parameters with a hashing function and random seeds, which are equivalent to
an encryption-decryption module and the keys. For the parameter aggregation
in the global server, we only upload increments of the parameters in real weight
vectors, which can not be used to reconstruct the complete network without the
specific hashing function and the random seeds. As for all local participants, they
share the same hashing function and random seeds, so the average operation in
the global aggregation can be directly applied to these parameter increments.
Therefore, the global server can not compromise the private data, while it can
finish its global aggregation task. In this way, we enhance the privacy-preserving
in our FedOCR without introducing an extra computational cost.

4 Experiments
4.1 Experiment settings
Datasets. Two synthetic datasets [12,9] and six public real-world datasets
are used to train local models, and our models are evaluated on seven general
datasets. In our federated settings, we construct different local datasets with the
public real-world datasets. These datasets are briefly introduced as follows:
– Synth90k [12] contains 9 million images generated from a set of 90k English
words. Words are rendered onto natural images with random transformations
and effects.
10 W. Zhang, Y. Qiu, S. Bai, R. Zhang, X. Wei and X. Bai

– SynthText [9] contains 0.8 million images for end-to-end text detection and
recognition tasks. Therefore, we crop word images using the ground-truth
word bounding boxes.
– ICDAR 2003 (IC03) [19] contains 860 cropped word images for evaluation
after discarding images that contain non-alphanumeric characters or have
fewer than three characters, which follows [24]. For training, we use 1150
cropped images after filtering.
– ICDAR 2013 (IC13) [14], which inherits most images from IC03 and ex-
tends it with new images, contains 1015 cropped word images for evaluation
after filtering. For training, we use 848 cropped images after filtering.
– ICDAR 2015 (IC15) [13] contains images captured by a pair of Google
Glasses casually, and many images are severely distorted or blurred. For
a fair comparison, we evaluate models on 1811 cropped word images after
filtering. For training, we use 4426 cropped images after filtering.
– IIIT5K-Words (IIIT5K) [24] contains 3000 word images collected for
evaluation and 2000 word images for training, which are mostly horizontal
text images.
– Street View Text (SVT) [35] is collected from the Google Street View,
and it contains 647 images of cropped words, many of which are severely
corrupted by noise, blur, or low resolution.
– Street View Text Perspective (SVTP) [26], which is collected from
Google StreetView and contains many distorted images, contains 645 word
images for evaluation.
– CUTE80 (CUTE) [28] contains 80 real-world curved text images with high
quality. For evaluation, we crop 288 word images according to its ground-
truth.
– ArT [5] is a combination of Total-Text, SCUT-CTW1500, and Baidu Curved
Scene Text, which contains images with arbitrary-shaped texts. For train-
ing, we use 30271 word images after discarding images that contain non-
alphanumeric characters and vertical texts.
– COCO-Text [33] is based on the MS COCO dataset, which contains images
of complex everyday scenes. For training, we use 31943 cropped images af-
ter discarding images that contain non-alphanumeric characters and vertical
texts.

Decentralized Datasets for Federated Learning are constructed by public


real-world datasets in our experiment settings. We use the training word images
from IC03 [19], IC13 [14], IC15 [13], IIIT5K [24], ArT [5], and COCO-Text [33].
As a sequence, we have 70638 real-word text images in total. To simply simulate
the decentralized datasets distributed on local devices in federated learning, we
split all the real-word text images randomly and uniformly into different sets of
image data for C participants. It should be mentioned that these different sets
of image data should not be shared or transferred to other participants during
the training procedures.
FedOCR 11

Federated Settings. Some hyper-parameters should be noted in our federated


settings: C, the number of participants in our federated scene text recognition
framework; γ, the compression ratio of the hashing technique; El , the number
of epochs that each local participant trains the model with its dataset before
communication with the global server; B, the batch size in local training. In our
experiments, we set C = 5, El = 3, B = 512 and γ ∈ {1/2, 1/4, 1/8}.

Baseline and FedOCR-Hash. In our experiments, we adopt ASTER3 [30]


as the text recognition baseline in our FedOCR, which is denoted as ASTER-
FL. Then, we replace the encoder in ASTER-FL with ShuffleNetV2 [22], and
this variant of ASTER-FL in our FedOCR is denoted as FedOCR-Hash1 . To
further reduce the parameter size, we apply the hashing technique to compress
FedOCR-Hash1 with different ratios γ ∈ {1/2, 1/4, 1/8}, and these models are
denoted as FedOCR-Hashγ in the following paper.

Implementation Details. Following the federated settings, we construct C =


5 participants in our FedOCR for experiments. In each local training, all models
are locally trained via Adadelta [39] with an initialization learning rate of 1.0,
and each participant trains the scene text recognition model with its dataset
individually for El = 3 epochs in each round. All word images are trained directly
without data argumentation. As for the complete federated training process of
our FedOCR, each participant trains its model with the two synthetic datasets
for 4 rounds, then trains on its real-world dataset for 40 rounds.
The learning rate is decayed to 0.1 and 0.01 at the 5-th round and the 30-
th round, respectively. Following Algorithm 2, in the global aggregation step,
the global server aggregates the parameter increments from all participants by
average. To simply simulate the communication procedure of federated learning,
we replace the parameter transmission between participants and the global server
with saving and restoring checkpoints on the hard-disk.

Evaluation Metric. In our experiments, we use the case-insensitive word


accuracy for evaluation. If the word prediction and the ground-truth are the
same in the lower case, the prediction is correct. The recognition accuracy is the
percentage of the correct number of total. Furthermore, the objective of FedOCR
is to minimize the difference between the accuracy of the text recognizer trained
with decentralized datasets and trained with a centralized dataset. A smaller
difference means a better performance of our FedOCR.

4.2 Experiments on FedOCR


In this subsection, we first compare the parameter reduction and the accuracy
decrease of different models in our FedOCR. Then, we analyze the performance
of our FedOCR compared with the other two training manners and show that
3
https://github.com/ayumiymk/aster.pytorch
12 W. Zhang, Y. Qiu, S. Bai, R. Zhang, X. Wei and X. Bai

Table 1. Parameter size and accuracy comparison between different models in our
FedOCR. The accuracy is the average result of all testing datasets. The models size
refers to the storage occupied on the hard-disk. γ is the compression ratio of the
hashing technique, and “γ = −” means that we do not apply the hashing technique
to the model. The reduction percentages of parameter size, model size, and accuracy
compared with ASTER-FL are shown in parentheses respectively

Models Backbone γ Param. (M) Model (MB) Accuracy (%)


ASTER-FL ResNet - 20.99 80.52 91.94
FedOCR-Hash1 ShuffleNetV2 - 13.34 (↓ 36.45%) 51.37 (↓ 36.20%) 89.08 (↓ 3.11%)
FedOCR-Hash1/2 ShuffleNetV2 1/2 6.70 (↓ 68.08%) 26.05 (↓ 67.65%) 86.65 (↓ 5.75%)
FedOCR-Hash1/4 ShuffleNetV2 1/4 3.38 (↓ 83.90%) 13.38 (↓ 83.38%) 85.39 (↓ 7.12%)
FedOCR-Hash1/8 ShuffleNetV2 1/8 1.72 (↓ 91.81%) 7.05 (↓ 91.24%) 82.58 (↓ 10.18%)

our FedOCR achieves the objective of federated learning. Finally, we evaluate


the two improvements in communication efficiency of our FedOCR.

Comparison of Parameter Size and Accuracy. Table 1 shows the param-


eter size and model size of different models in our FedOCR. The accuracy is
the average result of all testing datasets. The models size refers to the storage
occupied on the hard-disk. Compared with ASTER-FL, FedOCR-Hash1 reduce
36.45% parameter size, but there is only a 3.11% accuracy decrease. As for differ-
ent FedOCR-Hashγ in our experiments, FedOCR-Hash1/4 with an appropriate
compression ratio γ achieves a 83.90% reduction in parameter size and drops
only 7.12% in accuracy. Improved by the lightweight backbone and the hashing
technique, the model size of the scene text recognizers in our FedOCR reduces
to a large extent, and these lightweight text recognizers encouragingly reach a
comparable performance.

Federated Learning for Scene Text Recognition. Table 2 shows the de-
tailed results on all testing datasets of ASTER-FL and different FedOCR-Hashγ
in three manners of training. First, “single” training means that the model is
trained only with one participant’s dataset. Second, ”centralized” training means
that the model is trained with a centralized set of image data. Third, “feder-
ated” training means that the model is trained with decentralized sets of image
data in a federated manner. As shown in Table 2, “federated” and “centralized”
training results of all models are similar to each other and better than “single”
training results. In the “single” training manner, scene text recognition faces the
problem in practice that the image data for training is limited, which causes poor
performance in scene text recognition. However, we succeed in training a shared
model with decentralized sets of image data collaboratively in the “federated”
training manner, and we do not exchange or expose any image data to other
participants. Expectantly, our FedOCR achieves comparable results, which are
very close to the results of the “centralized” training manner. Therefore, our
FedOCR 13

Table 2. Recognition accuracy in different training manners. “single”: The model


is trained only with one participant’s dataset; “centralized”: The model is trained
with a centralized set of image data; “federated”: The global model is trained with
decentralized sets of image data in a federated manner. The detailed structures of
different FedOCR-Hash are shown in Table 1

Models Training IIIT5k SVT IC03 IC13 IC15 SVTP CUTE


single 93.7 89.0 93.7 93.8 80.6 82.3 85.4
ASTER-FL centralized 95.0 91.7 95.3 94.6 82.2 83.3 91.7
federated 95.0 90.7 94.8 94.0 82.0 82.3 91.0
single 90.8 83.0 90.9 89.4 77.3 77.5 82.6
FedOCR-Hash1 centralized 93.1 86.4 92.5 92.2 79.7 80.6 86.8
federated 92.9 86.9 92.0 91.7 79.4 80.8 86.5
single 89.2 83.0 90.2 88.5 75.3 73.8 77.8
FedOCR-Hash1/2 centralized 91.6 83.6 91.0 90.3 77.9 75.5 82.3
federated 91.2 84.2 91.6 90.7 77.5 76.0 82.6
single 87.2 79.1 87.1 86.1 73.4 71.5 77.4
FedOCR-Hash1/4 centralized 89.4 81.6 89.5 88.8 75.9 74.3 81.6
federated 89.0 81.8 89.3 89.2 76.3 75.2 81.6
single 83.5 74.8 84.8 81.4 70.2 71.2 73.3
FedOCR-Hash1/8 centralized 86.7 78.8 86.7 86.0 72.3 71.6 79.5
federated 86.6 80.1 87.1 85.4 72.4 71.6 79.9

FedOCR is effective to train a more robust model without centralizing datasets


on different local devices.

Communication Efficiency Improvement. In Table 2, FedOCR-Hash1


shows comparable accuracy with ASTER-FL in the “federated” training manner.
Owing to the lightweight backbone in FedOCR-Hash1 , it has fewer parameters
than ASTER-FL, which benefits communication efficiency in federated learn-
ing. As shown in Fig. 3, FedOCR-Hash1 has a higher accuracy than ASTER-FL
when little communication bytes are uploaded.
Fig. 3 illustrates the accuracy curves of different models on IIIT5k versus
uploaded bytes in federated training procedures. FedOCR-Hashγ with a smaller
compression ratio γ achieves higher accuracy when limited communication bytes
are uploaded, and it shows greater advantages in communication efficiency. The
advantage of our FedOCR-Hashγ will be more distinctive when more local clients
participate in our FedOCR. Considering both Table 1 and 2, FedOCR-Hash1/4
with an appropriate compression ratio γ shows a significant overall performance
in communication efficiency and accuracy of federated learning. Only 13.38
megabytes are required to be transmitted by each participant, which results
in a faster parameter transmission with the same communication bandwidth.
Benefited from lightweight models and hashing techniques, our federated
scene text recognition framework shows a comparable performance and advan-
tages in communication efficiency. Considering plenty of participants and the
14 W. Zhang, Y. Qiu, S. Bai, R. Zhang, X. Wei and X. Bai

100

90
Accuracy (%)

80

70
FedOCR Hash1
FedOCR Hash1/2
60 FedOCR Hash1/4
FedOCR Hash1/8
ASTER-FL
50
0 200 400 600 800 1000 1200 1400 1600 1800
num_megabytes_uploaded_total (MB)

Fig. 3. Accuracy on IIIT5k versus number of uploaded megabytes of different models


with limited transmitted bytes in federated learning

unstable data transmission network in the real world, our FedOCR has great
potential in practical application deployment.

5 Conclusion and Future Work


In this paper, we reveal the problem of data privacy in scene text recognition and
address the difficulty in utilizing decentralized datasets distributed on local de-
vices with federated learning. To the best of our knowledge, we propose the first
federated scene text recognition framework named FedOCR. In our FedOCR,
we succeed in training a shared text recognizer collaboratively with decentral-
ized datasets and avoid violating rules of data privacy. Benefited from lightweight
models and hashing techniques, we reduce communication costs to a large extent
and provide higher-level privacy-preserving against the honest-but-curious global
server. In terms of taking advantage of tremendous decentralized real-world data
in practice, our communication-efficient federated learning framework for scene
text recognition shows intriguing merits.
Recently, the domain shift in scene text recognition has attracted great in-
terest in academia, and some methods are proposed, such as GA-DAN [41] and
SSDAN [42]. Notably, the domain shift occurs in federated learning for scene
text recognition as well, which leads to a deterioration in the global model accu-
racy. Hence, we are working on the domain adaptation of decentralized datasets
for future works within the framework of FedOCR.
FedOCR 15

References

1. Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition
with embedded attributes. TPAMI 36(12), 2552–2566 (2014)
2. Bai, F., Cheng, Z., Niu, Y., Pu, S., Zhou, S.: Edit probability for scene text recog-
nition. In: CVPR (2018)
3. Bartz, C., Bethge, J., Yang, H., Meinel, C.: Kiss: Keeping it simple for scene text
recognition. arXiv preprint arXiv:1911.08400 (2019)
4. Chen, W., Wilson, J., Tyree, S., Weinberger, K., Chen, Y.: Compressing neural
networks with the hashing trick. In: ICML (2015)
5. Chng, C.K., Liu, Y., Sun, Y., Ng, C.C., Luo, C., Ni, Z., Fang, C., Zhang, S., Han,
J., Ding, E., et al.: Icdar2019 robust reading challenge on arbitrary-shaped text
(rrc-art). arXiv preprint arXiv:1909.07145 (2019)
6. Fredrikson, M., Jha, S., Ristenpart, T.: Model inversion attacks that exploit con-
fidence information and basic countermeasures. In: CCS (2015)
7. Gao, Y., Chen, Y., Wang, J., Tang, M., Lu, H.: Reading scene text with fully
convolutional sequence modeling. Neurocomputing 339, 161–170 (2019)
8. Goel, V., Mishra, A., Alahari, K., Jawahar, C.: Whole is greater than sum of parts:
Recognizing scene text words. In: ICDAR (2013)
9. Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in nat-
ural images. In: CVPR (2016)
10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: CVPR (2016)
11. Hitaj, B., Ateniese, G., Perez-Cruz, F.: Deep models under the gan: information
leakage from collaborative deep learning. In: CCS (2017)
12. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and
artificial neural networks for natural scene text recognition. arXiv preprint
arXiv:1406.2227 (2014)
13. Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwa-
mura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., et al.: Icdar 2015
competition on robust reading. In: ICDAR (2015)
14. Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L.G., Mestre, S.R.,
Mas, J., Mota, D.F., Almazan, J.A., De Las Heras, L.P.: Icdar 2013 robust reading
competition. In: ICDAR (2013)
15. Li, H., Wang, P., Shen, C., Zhang, G.: Show, attend and read: A simple and strong
baseline for irregular text recognition. In: AAAI (2019)
16. Liu, W., Chen, C., Wong, K.Y.K.: Char-net: A character-aware neural network for
distorted scene text recognition. In: AAAI (2018)
17. Liu, Z., Li, Y., Ren, F., Goh, W.L., Yu, H.: Squeezedtext: A real-time scene text
recognition by binary convolutional encoder-decoder network. In: AAAI (2018)
18. Long, S., He, X., Yao, C.: Scene text detection and recognition: The deep learning
era. arXiv preprint arXiv:1811.04256 (2018)
19. Lucas, S.M., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R., Ashida,
K., Nagai, H., Okamoto, M., Yamamoto, H., et al.: Icdar 2003 robust reading
competitions: entries, results, and future directions. IJDAR 7(2-3), 105–122 (2005)
20. Luo, C., Jin, L., Sun, Z.: Moran: A multi-object rectified attention network for
scene text recognition. PR 90, 109–118 (2019)
21. Luo, J., Wu, X., Luo, Y., Huang, A., Huang, Y., Liu, Y., Yang, Q.: Real-world
image datasets for federated learning. arXiv preprint arXiv:1910.11089 (2019)
16 W. Zhang, Y. Qiu, S. Bai, R. Zhang, X. Wei and X. Bai

22. Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines for
efficient cnn architecture design. In: ECCV. pp. 116–131 (2018)
23. McMahan, H.B., Moore, E., Ramage, D., Hampson, S., et al.: Communication-
efficient learning of deep networks from decentralized data. arXiv preprint
arXiv:1602.05629 (2016)
24. Mishra, A., Alahari, K., Jawahar, C.: Top-down and bottom-up cues for scene text
recognition. In: CVPR (2012)
25. Phong, L.T., Aono, Y., Hayashi, T., Wang, L., Moriai, S.: Privacy-preserving deep
learning via additively homomorphic encryption. TIFS 13(5), 1333–1345 (2018)
26. Quy Phan, T., Shivakumara, P., Tian, S., Lim Tan, C.: Recognizing text with
perspective distortion in natural scenes. In: ICCV (2013)
27. Reisizadeh, A., Mokhtari, A., Hassani, H., Jadbabaie, A., Pedarsani, R.: Fedpaq:
A communication-efficient federated learning method with periodic averaging and
quantization. arXiv preprint arXiv:1909.13014 (2019)
28. Risnumawan, A., Shivakumara, P., Chan, C.S., Tan, C.L.: A robust arbitrary
text detection system for natural scene images. Expert Systems with Applications
41(18), 8027–8048 (2014)
29. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based
sequence recognition and its application to scene text recognition. TPAMI 39(11),
2298–2304 (2016)
30. Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., Bai, X.: Aster: An attentional scene
text recognizer with flexible rectification. TPAMI (2018)
31. Shokri, R., Shmatikov, V.: Privacy-preserving deep learning. In: CCS (2015)
32. Su, B., Lu, S.: Accurate scene text recognition based on recurrent neural network.
In: ACCV (2014)
33. Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: Coco-text: Dataset and
benchmark for text detection and recognition in natural images. arXiv preprint
arXiv:1601.07140 (2016)
34. Voigt, P., Von dem Bussche, A.: The eu general data protection regulation (gdpr).
A Practical Guide, 1st Ed., Cham: Springer International Publishing (2017)
35. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV
(2011)
36. Wei, K., Li, J., Ding, M., Ma, C., Yang, H.H., Farhad, F., Jin, S., Quek, T.Q., Poor,
H.V.: Federated learning with differential privacy: Algorithms and performance
analysis. arXiv preprint arXiv:1911.00222 (2019)
37. Yang, M., Guan, Y., Liao, M., He, X., Bian, K., Bai, S., Yao, C., Bai, X.: Symmetry-
constrained rectification network for scene text recognition. In: ICCV (2019)
38. Yang, Q., Liu, Y., Chen, T., Tong, Y.: Federated machine learning: Concept and
applications. TIST 10(2), 12 (2019)
39. Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint
arXiv:1212.5701 (2012)
40. Zhan, F., Lu, S.: Esir: End-to-end scene text recognition via iterative image recti-
fication. In: CVPR (2019)
41. Zhan, F., Xue, C., Lu, S.: Ga-dan: Geometry-aware domain adaptation network
for scene text detection and recognition. In: ICCV (2019)
42. Zhang, Y., Nie, S., Liu, W., Xu, X., Zhang, D., Shen, H.T.: Sequence-to-sequence
domain adaptation network for robust text image recognition. In: CVPR (2019)
43. Zhu, W., Baust, M., Cheng, Y., Ourselin, S., Cardoso, M.J., Feng, A.: Privacy-
preserving federated brain tumour segmentation. In: Machine Learning in Medical
Imaging: 10th International Workshop (2019)

You might also like