Neurocomputing: Fan Yang, Xiangtong Wang, Xuan Zhu, Binbin Liang, Wei Li
Neurocomputing: Fan Yang, Xiangtong Wang, Xuan Zhu, Binbin Liang, Wei Li
Neurocomputing: Fan Yang, Xiangtong Wang, Xuan Zhu, Binbin Liang, Wei Li
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: Video-based person re-identification (Re-ID) aims to match the same pedestrian from the video
Received 6 October 2021 sequences captured by non-overlapping cameras. It is the key to fully extracting abundant spatial and
Revised 26 February 2022 temporal information from the video frames in video-based Re-ID. In this paper, a novel Relation-
Accepted 10 March 2022
Based Global-Partial Feature Learning framework is proposed to explore discriminative spatiotemporal
Available online 12 March 2022
features with the help of the global and partial relationship between frames. Specifically, we propose a
Relation-Based Global Feature Learning Module (RGL) to obtain global references for generating features
Keywords:
correlation maps between frames in the video sequence and determine the importance of frame-level
Person re-identification
Relation learning
features. As the supplementary of the global relation-based features, a Relation-Based Partial Feature
Global feature Learning Module (RPL) is also proposed to obtain the relationship between partial features of the same
Partial feature spatial position in different frames to enhance the frame-level partial representation. Moreover, we
Convolutional neural networks design a multi-level training scheme to deeply supervise our model. Extensive experiments are con-
ducted on three public video-based person Re-ID datasets, and the results indicate that our framework
achieves state-of-the-art performance on three benchmarks.
Ó 2022 Elsevier B.V. All rights reserved.
1. Introduction will be deteriorated by the poor-quality frames and will lead to the
degradation of the relation generation of the discriminative infor-
Person re-identification (Re-ID) aims to match a specific person mation. Moreover, previous relation-based methods mostly focus
across different non-overlapping cameras. Since video data con- on relation exploration of the partial features on certain parts of
tains abundant spatiotemporal information of a person compared an image. These methods may ignore the importance of the global
with image data, video-based person Re-ID has drawn increasing relation of the video sequence for person identification. Some stud-
attention in recent years. Most video-based person Re-ID methods ies [11] stack the partial relation of the position features to obtain
[1–3] extract the frame-level features and obtain the video-level the global relation. Although the correlation of the partial features
representation by the temporal aggregation to compute similarity can capture fine-grained clues, it will lose the global information
scores. Therefore, the construction of the representation which about the human body. Furthermore, the correlation of global fea-
contains the discriminative spatiotemporal information is the key tures can strengthen the most salient information among all the
in video-based Re-ID methods. In recent years, some relation- frames, but some non-obvious details may be easily ignored. These
based methods [4,5] have been proposed to learn the meaningful two types of relation-based features complement each other. It is
features and suppress the irrelevant ones by adopting the correla- necessary to combine the global and partial relation-based features
tion between frames. In this kind of method, how to obtain the ref- for enriching the information.
erence feature is the crucial issue to explore the discriminative Therefore, to explore the relationship between video frames and
features to generate the correlation between reference feature utilize the complementarity between the global and partial
and frame-level feature. However, the common reference feature relation-based features, in this paper, we propose a novel
selection strategies, which always select one frame-level feature Relation-Based Global-Partial Feature Learning model to fully learn
vector or average all frame-level features as the reference feature, the global and partial relationships of the whole video sequences.
For effective spatiotemporal feature extraction, we utilize the rela-
tion between each frame to explore the importance and relevance
⇑ Corresponding author. of characteristics from the global and partial views of the video
E-mail addresses: yang_fan_s@163.com (F. Yang), xdr94@126.com (X. Wang), sequences, respectively. To construct the relationships of the video
j_zhxxx@163.com (X. Zhu), sculiang@126.com (B. Liang), li.wei@scu.edu.cn (W. Li).
https://doi.org/10.1016/j.neucom.2022.03.032
0925-2312/Ó 2022 Elsevier B.V. All rights reserved.
F. Yang, X. Wang, X. Zhu et al. Neurocomputing 488 (2022) 424–435
sequences which contain discriminative cues and reduce the 2.2. Relation-based feature learning strategy
impact of the poor-quality frames, we exploit a temporal attention
operation to assign different weights for frames and aggregate Due to the continuity of the video sequence, the frames have a
them with the weighted average method. In the global view, the strong correlation with each other, which guides inferring the
video-level reference is generated by the aggregation of the whole importance of features. There are some attempts to utilize the
frame-level feature maps with the help of the temporal attention relation-based feature learning strategy to improve the perfor-
operation. Besides, as an important complement for global mance of the person Re-ID task. Park et al. [9] introduce a part-
relation-based features, we utilize frame-level equally-divided based relation network for person Re-ID to extract discriminative
parts for video-level partial reference generation to explore mean- features and propose a global contrastive pooling method to obtain
ingful fine-grained details in the partial view. Moreover, we design better global features for a person image. Zhang et al. [10] is the
a multi-level training scheme to deeply supervise our model from first attempt to utilize a relation-aware attention module to
frame-level and video-level. The whole model is a complete end- explore the partial relation and stack them to obtain the global
to-end learning process, which is easy for training and relation. Although the above two methods extract the relation
implementation. information well in image-based Re-ID, they only pay attention
Our primary contributions can be summarized as follows: to the relation between the features in a single image to connect
them and the relation-aware attention between frames for video-
We propose a novel RGL module to generate the relation scores based Re-ID is under-explored. Lin et al. [38] propose a Bilinear
based on global guidance of the video sequence for obtaining CNN model which generates the translationally invariant correla-
discriminative features from the global view. tion of different partial features by modeling the local pairwise fea-
We propose an RPL module to capture the fine-grained details ture interactions in an end-to-end train manner to improve the
across frames from the partial view. performance of the fine-grained visual recognition. This method
We design a multi-level training scheme consisting of video- extracts the correlation between local features rather than learning
level and frame-level loss to deeply supervise the proposed the relation between global and partial features. In addition, the
model. pairwise relation is generated by the outer product of the pairwise
We apply our proposed model on three popular video-based local feature vectors, which is inefficient. Simon et al. [43] present
person Re-ID datasets. It achieves Rank-1 accuracy of 89.1% on a fully-automatic information extraction tool named ViPER (Visual
MARS, 97.2% on DukeMTMC-VideoReID, and 88.7% on iLIDS- Perception-based Extraction of Records), which extracts and dis-
VID without re-ranking. Our model shows superiority compared criminates the relevance of different repetitive information con-
with the state-of-the-art. tents with respect to the user’s visual perception of the web page
for identifying the most relevant data region. Chen et al. [4] utilize
The rest of the paper is organized as follows. We briefly review region features of the first frame to guide other frames and obtain
the related works of Re-ID methods in Section 2. In Section 3, we feature relationships between frames. Zhang et al. [11] propose a
present the overall video-based Re-ID model pipeline and elabo- reference-aided attention feature aggregation module, which
rate the RGL module, RPL module, and the multi-level training adopts the average of all frame-level features as the reference to
scheme in detail. In Section 4, we show the results and comparison extract the relation. However, some poor-quality frames such as
of our experiments. The conclusion is given in Section 5. blur, weak light, and so on can affect the relational calculation of
the video sequence. Zhang et al. [5] introduce a quality evaluation
block to select high-quality frames and segment the reference
2. Related work frame leveraging human body pose information to learn relation
information between frames. This method requires additional data
2.1. Video-based person Re-ID annotation to train the body pose estimation model, and the per-
formance of person Re-ID depends heavily on the accuracy of the
Recently, person Re-ID has made great progress with the devel- pose estimator. Compared with existing methods, in order to make
opment of deep learning. Existing studies on person Re-ID mainly full use of the information of each frame and reduce the impact of
focus on two aspects: image-based and video-based person Re-ID. the poor-quality frames, we generate the reference with the help of
Video-based person Re-ID which uses much richer spatial–tempo- a temporal attention operation to learn the relation-based features.
ral information is an extension of the image-based person Re-ID
and has drawn more and more attention. Temporal average pool- 2.3. Part-based person Re-ID
ing is a common method to extract temporal information for
video-based person Re-ID. Zheng et al. [1] firstly utilize the tempo- Most methods only extract the global feature of an image to
ral average pooling method to fuse frame-level features to generate match the target. However, these methods will lose some fine-
video-level features. The recurrent neural network (RNN) is also grained details. In order to overcome the problem mentioned
widely used to analyze video data in the video-based person Re- above, some works try to learn partial features and achieve impres-
ID task. Mclanghlin et al. [2] and Yan et al. [6] exploit RNN to sive performance recently. Sun et al. [12] split the person image
extract temporal information from video sequences. However, into several stripes with a partitioning strategy to extract part-
the above methods treat all frames equally so that the poor- level features. Although adopting the above method can improve
quality frames will negatively affect the video-level representation. the discriminative ability, it destroys the structure information of
In light of this, some methods apply an attention mechanism to the human body. Cheng et al. [13] propose a multi-channel model
highlight important frames. For example, Zhou et al. [7] and Li to learn global and partial information simultaneously and extract
et al. [8] adopt temporal attention to assign different weights for discriminative features with the help of an improved triplet loss
each frame and aggregate them with the weighted average method function. This method combines global and partial features to
for meaningful video-level feature representations. In this paper, obtain the final representation of targets, but it ignores the corre-
we attempt to obtain the relation maps of each frame from the glo- lation between these two types of information. Zhang et al. [45]
bal and partial aspects to learn the discriminative information in introduce a part-based alignment method with shortest path pro-
video sequences. Besides, a novel multi-level training scheme is gramming and mutual learning to improve metric learning perfor-
proposed to deeply supervise our model. mance. Bai et al. [46] design a global and local feature learning
425
F. Yang, X. Wang, X. Zhu et al. Neurocomputing 488 (2022) 424–435
network that merges slices of part features with the LSTM network number of channels, height, and width of the feature maps,
and combines with global features learned from classification met- respectively.
ric learning. Li et al. [47] propose a part-aligning architecture to In the global branch, the frame-level features F t are fed into the
search latent regions and extract partial information for person RGL module for exploring the relationship between each frame to
Re-ID. Wei et al. [14] introduce a global–local-alignment descriptor enhance the relevance feature of the target from the global view. A
that leverages the human body pose points to learn global-partial Temporal Aggregate Module (TAM) following a global average
features and align them. Zhao et al. [15] propose a region proposed pooling operation is applied for aggregating the frame-level fea-
network to extract body local regions with the help of the human tures F t to obtain the video-level global feature vectorf global 2 RC .
body structure information for aligning partial features between In the partial branch, we split the frame-level feature maps F t
images. These methods need extra pose annotation for the Re-ID H
into several stripes F it 2 RC p W ð; i 2 ½1; pÞ in horizontal orientation,
datasets, and the possible errors of the body pose estimator can
where p indicates the number of stripes. Then, we input the frame-
directly affect the performance of the Re-ID model. Wang et al.
[16] propose an extraction strategy combining global and partial level global feature F t and the partial feature of each stripe F it into
features that divided the network architecture into three parallel the RPL module to obtain the video-level partial vector of each
i
global–local branches with different granularities to learn finer dis- stripef 2 RC . The final video-level partial vector f partial 2 RC is a
criminative information. Inspired by [16], Mao et al. [44] propose a combination of the p terms:
coarse granularity part-level person Re-ID network (CGPN) to
X
p
extract robust regional features and integrates supervised global f partial ¼ f
i
ð1Þ
features for pedestrian images. However, the above two methods i¼1
will bring a huge computational burden while boosting the perfor-
In addition, we introduce a multi-level training scheme that
mance of the Re-ID model. Inspired by the above methods, our
combines the video-level supervision and the frame-level supervi-
model combines the relation-based global and partial features to
sion with the help of the softmax loss and batch-hard triplet loss.
learn more discriminative representation.
During testing phases, we concatenate both the video-level glo-
bal and partial feature vectors to obtain the most discriminative
2.4. Attention mechanism final representationf 2 R2C :
ID Loss
Average
Pooling
Triplet Loss
ID Loss
RPL
Video-Level Triplet Loss
Frame-Level Feature
Partial Branch
Global Branch
ID Loss
RGL TAM
Video-Level Triplet Loss
Frame-Level Feature Relation-Based Global Feature
Global Feature
ID Loss
Average
Pooling
Triplet Loss
and ReLU denote Batch Normalization and ReLU activation, sequence into a single part-level feature map F i 2 RC p W with
H
respectively. i
the help of TAM. We obtain the partial single vectors f as the par-
The relation-based frame-level global features F Rt 2 RCHW is tial guide reference vector with a global max-pooling layer.
calculated by: In order to mine the fine-grained cues from each frame, we
F Rt ¼ F t hAt ð5Þ adopt the RGL module to generate the partial featureF it , where
the high similarity region will get high weight. The above process
where H represents element-wise multiplication.
can be formulated as follows:
i
3.3. Relation-based partial feature learning module F it ¼ RHLðf ; F t Þ ð6Þ
We utilize the partial feature relation calculating strategy to Through the global max-pooling layer, the elements in F it with the
i
explore the fine-grained features of the pedestrian between differ- high response value can be aggregated to a single vectorf t 2 RC ,
ent frames, as shown in Fig. 3. The spatial position of the same per- and they are aggregated to video-level partial vector f with the help
i
son captured by the same camera changes slightly in consecutive of the TAM module. Therefore, we can calculate the final partial vec-
frames, and the information in different frames is complementary tor as follows:
to the representation of the target person. Thus, we propose an RPL
i
module to enrich the partial information in the guided feature to f t ¼ GMPðF it Þ ð7Þ
extract the fine-grained features in different frames. We partition
i i
each frame-level feature F t into p horizontal stripesF it , and aggre- f ¼ TAMðf t Þ ð8Þ
gate all the strips in the same spatial positions of the frame
427
F. Yang, X. Wang, X. Zhu et al. Neurocomputing 488 (2022) 424–435
Max
RGL TAM
Pooling
Frame-Level Feature
where GMP is global max pooling operation, and TAM is the TAM rameter to control the differences between intra and inter
module. distances.
The softmax loss function Lsoftmax is used for discriminative
3.4. Temporal attention module learning as well.
1 XP X K
In order to focus on the informative frame-level features, the Lsoftmax ¼ p logqi;a ð12Þ
Temporal Attention Module [17] is applied to weight the features
PK i¼1 a¼1 i;a
of each frame. Thus, the video-level feature f v is formulated as:
wherepi;a , qi;a are the ground truth identity of the sample {i, a}.
1X T
t
We propose a multi-level training scheme that combines the
fv ¼ at f ð9Þ video-level and the frame-level supervision to improve the feature
T t¼1 v v
extraction capability of the network. Both levels of supervision
t include the global part and the partial part, and each part is super-
where f v is the frame-level feature of the video, atv is the attention
vised by triplet loss and softmax loss simultaneously. The video-
of frame-level features. The input of the TAM module is a sequence
level loss and frame-level loss can be calculated by:
of frame-level features [T; w; h; 2048], where the kernel size is w h;
the input number of channels is 2048, and the output number of 1 vg vp 1 vg
Lv ¼ ðL þ Ltriplet Þ þ ðLsoftmax þ Lvsotfmax
p
Þ ð13Þ
channels isdt . In order to generate temporal attentionstv , we use a 2 triplet 2
temporal convolutional layer {3 ; dt ; 1}. A softmax function is
applied to calculate the attention scores atv of each frame-level 1 XN
1 XN
Lf ¼ ðLfg
triplet þ Lfpi
triplet Þ þ ðLfg
softmax þ Lfpi
sotfmax Þ ð14Þ
feature: Nþ1 i¼1
N þ 1 i¼1
is another large-scale video-based Re-ID dataset that includes line, the proposed model can mine the discriminative cues with
1,812 persons and 4,832 tracklets which are captured by eight dif- less background information. In Fig. 5 (b), when the target occupies
ferent cameras. The dataset is divided into 408, 702, and 702 iden- a small region in the frame, the features of the baseline pay atten-
tities for distraction, training, and testing, respectively. The length tion to the area of the target pedestrian. Instead, our model still
of each tracklet has 168 frames on average. The three datasets are focuses on the target and ignores irrelevant information such as
captured by cameras in reality. However, privacy protection is so lawns, trees, and lampposts. In addition, our model identifies the
important for everyone. Especially in the field of person Re-ID, accessory (bag and umbrella) that belongs to the target, which is
which requires large amounts of data to train the model. Therefore, significant information to discriminate different people. As shown
data acquisition is a crucial issue for researchers. The generated in Fig. 5 (c), when the target pedestrian is partially occluded by
data can alleviate this problem by generating large-scale training environmental objects, the baseline will simultaneously extract
data such as the DG-Market dataset [42]. Thus, the generated data the features of the target and the occluding objects, which will
will be an important topic for our further research. harm the representation of the target pedestrian. In addition, in
order to compare the global and partial features, we show the fea-
4.1.2. Evaluation metrics ture maps extracted by the global and partial branches, respec-
We utilize the Cumulative Matching Characteristic curve (CMC) tively. Compared with the global features, the features generated
and mean Average Precision score (mAP) to evaluate the perfor- by the partial branch can capture fine-grained details, such as
mance of our model. The CMC indicates the accuracy of person umbrellas or patterns on clothes. Meanwhile, we can observe that
re-identification. We adopt Rank-1, Rank-5, Rank-20 scores to there are some variations between the features from the global and
express the CMC curve. partial branches. Obviously, these two types of features comple-
ment each other to assemble more discriminative representation
4.1.3. Implementation detail of the target. The visualization of the feature maps further validate
We employ the standard ResNet-50 [20] as the backbone net- that our proposed method can highlight more discriminative fea-
work for frame-level feature extraction. The input frames are ture and suppress the redundant information.
resized to 256 128. The batch size is set as 32. For each batch,
we randomly select P ¼ 4 samples for each identity and set the 4.3. Comparison with the state-of-the-art methods
sequence lengthT ¼ 4. We train our model for 500 epochs and
use Adam [21] to optimize our network with weight In this section, we compare our proposed approach with state-
decay5 104 . The initial learning rate is 3:5 104 with a decay of-the-art methods on MARS, DukeMTMC-VideoReID, and iLIDS-
factor 0.1 at every 100 epochs. In the test phase, we extract the fea- VID datasets, respectively. None of the compared approaches use
ture for each video sequence. The video-level feature is finally used the re-ranking strategy in post-processing. The results are reported
for retrieval with the cosine distances between the query and in Table 1.
gallery.
4.3.1. MARS
4.2. Results visualization Comparisons between our method and other approaches on the
MARS dataset show that our model achieves 84.5% mAP and out-
In order to evaluate the effect of the model more intuitively, we performs all previous methods by more than 1.5%. Even for the
visualize the retrieval results and the feature maps, which generate CMC curve, our proposed method achieves competitive results. It
from the baseline and our model. However, privacy protection of is worth noting that the recently proposed AdaptiveGraph [28]
the data is so important. To protect personal privacy in visualiza- achieves impressive Rank-1 accuracy, but it utilizes extra pose
tion, we hide the faces of pedestrians in the visual data. information to generate the graph and needs a huge computational
burden. Compared with this work, our method achieves competi-
4.2.1. Retrieval results analysis tive performance without additional information. The experimen-
As illustrated in Fig. 4, we visualize some video-based person tal results validate the superiority and effectiveness of our model.
Re-ID results achieved by the baseline and our proposed method
on MARS, DukeMTMC-VideoReID, and iLIDS-VID. The first
sequence is the query, and the remainder is the Rank-1 to Rank- 4.3.2. DukeMTMC-VideoReID
5 (from left to right) retrieved results in the gallery. The first and Due to the DukeMTMC-VideoReID is a newly proposed dataset
the second rows are the matching results of the baseline and our for person video-based Re-ID, only a few works released the per-
model, respectively. The top-5 returned video sequences by the formance on this dataset. As shown in Table 1, our proposed
two methods for each query are shown in Fig. 4. Candidates with method achieves the best performance with 95.9% and 97.2% at
green boxes indicate that they belong to the same identity as the Rank-1 accuracy and mAP, respectively. Note that, the DPRM [32]
query. The red boxes refer to the incorrectly matched sequences. method presents competitive performance but it ignores the learn-
From the visualization, we can observe that our model achieves ing of the fine-grained details, which is important for enhancing
the satisfying performance in matching correct pedestrians in the the discriminative features of the target person. Compared with
first ranking results, while the baseline method failed to find the those works, our method achieves state-of-the-art results.
same person at the top rank. For example, comparing the results
in the first query, our model is more discriminative to distinguish 4.3.3. iLIDS-VID
targets with similar clothes when there are spatial misalignments, The comparisons with the recent works on iLIDS-VID dataset
light changes, and viewing angle changes. are shown in Table 1. From the table, we observe that our proposed
method achieves 88.7% Rank-1 accuracy, outperforming the other
4.2.2. Visualization of feature maps published results. The reason behind this comes from the usage
We visualize the feature maps generated by the baseline and of both the fine-grained details and the relationship between each
our model on the MARS dataset in Fig. 5. The frames include vary- frame in our model. Therefore, it is reasonable to believe that our
ing conditions such as pose, scale, and partial occlusions. As shown proposed method achieves better performance than other state-
in Fig. 5 (a), comparing the features maps extracted by the base- of-the-art approaches.
429
F. Yang, X. Wang, X. Zhu et al. Neurocomputing 488 (2022) 424–435
Baseline
Query
Ours
(a) MARS
Rank-1 Rank-2 Rank-3 Rank-4 Rank-5
Baseline
Query
Ours
(b) DukeMTMC-
VideoReID
Rank-1 Rank-2 Rank-3 Rank-4 Rank-5
Baseline
Query
Ours
(c) iLIDS-VID
Fig. 4. Visualization of video-based Re-ID results using the baseline model and the proposed model on the three datasets. For each query, we show the top-5 ranking match
results. The green and red bounding boxes respectively denote correct and incorrect matches. (For interpretation of the references to color in this figure legend, the reader is
referred to the web version of this article.)
4.4. Ablation studies mance of different branches, we can observe that the
combination of the global and partial branches indeed achieves
4.4.1. Effectiveness of each component better performance than any single branch. These results also indi-
We evaluate the contribution of each component and show the cate that the complementarity of global and partial information
ablation results in Table 2. In the training stage, we add each com- enhances the discriminative representation of the target.
ponent into the baseline [17] respectively with the same training
settings. In the first row, the results of the baseline are obtained 4.4.3. Effectiveness of multi-level training scheme
by performing temporal attention on the frame-level features to To validate the effectiveness of the Multi-level Training Scheme,
calculate the video-level representations. The standard baseline we perform experiments on MARS dataset. The results are reported
achieves 76.7% mAP and 83.2% Rank-1 accuracy on the MARS data- in Table 4. The ‘‘VL” denotes the video-level loss, which is used to
set. ‘‘+RGL” means that we add the RGL module in the global supervise the video-level feature. The ‘‘FGL” and ‘‘FPL” denote the
branch to enhance the frame-level global features. We can see that frame-level global loss and partial loss. We can observe that the
the RGL module increase the Rank-1 accuracy by 4.0% and mAP by combination of the video-level loss, the frame-level global loss,
7.0% on MARS dataset. In order to explore the partial features, we and the frame-level partial loss can achieve the best performance.
design a partial branch that horizontally partitions the frame-level The results imply that employing the multi-level training scheme
feature maps into several partial regions and utilize the temporal can supervise our model to learn more discriminative features.
attention operation to generate the video-level partial features. ‘‘
+RPL” means that add the RPL Module in the partial branch to
4.4.4. Evaluate the balance of the losses
guide the feature under a partial view, which improves the Rank-
To evaluate the balance of the video-level loss (VL) and the
1 accuracy by 1.4% and mAP by 4.3% on MARS dataset. We combine
frame-level loss (FL), we introduce two weight parameters (a, b).
the two branches and further incorporate the multi-level training
Therefore, the overall loss function L can be calculated by:
scheme (MT). The Rank-1 accuracy and mAP are improved from
83.2% and 76.7% to 89.1% and 84.5% on MARS dataset, respectively. L ¼ aLv þ bLf ð16Þ
In addition, we also evaluate the model parameters and computa-
tional complexity of each component. We observe that our pro- We perform experiments on the MARS dataset by changing the
posed method improves the performance significantly with few value of the two parameters. In Table 5, experimental results show
additional parameters and little extra computational complexity. that different weights have little effect on performance. The best
result is obtained whena = 1 andb = 1, which indicates that these
two losses are equally important for our model.
4.4.2. Influences of different branches
We perform experiments to verify the effectiveness of the glo- 4.4.5. Comparison with different sequence length
bal and partial branches. The results on MARS are shown in Table 3 To investigate how the different sequence length influences the
‘‘with RGL” and ‘‘with RPL” means that we add the RGL or RPL mod- final performance, we conduct experiments with four different
ule in the two branches, respectively. By comparing the perfor- sequence length: T=2, 4, 6, and 8 separately. The results are listed
430
F. Yang, X. Wang, X. Zhu et al. Neurocomputing 488 (2022) 424–435
Frame
sequences
Baseline
Ours-
Holistic
Ours-
Partial
Ours-
Holistic
Partial
Table 1
Comparison with the state-of-the-art video-based person Re-ID methods on MARS, DukeMTMC-VideoReID, and iLIDS-VID datasets. The 1st ,2nd , and 3rd best results are emphasized
with bold, italic, and bold-italic, respectively.
in Table 6. From the table, we can see thatT = 4 achieves the best performance becomes better accordingly. However, the confusion
performance. One possible reason is that as the length of the of each group of frames will increase when the length of the
sequence increases, the model obtains more information and the sequence increases, which makes it more difficult for the model
431
F. Yang, X. Wang, X. Zhu et al. Neurocomputing 488 (2022) 424–435
Table 2
Comparison of different proposed components on MARS dataset.
Table 3
Comparison of different branches on MARS dataset.
Table 4
The benefit of using multi-level training scheme on MARS dataset.
Table 5
Evaluate the balance of the losses on the MARS dataset.
to extract effective features. Therefore, the model achieves the best 4.4.7. Comparison with different generation methods of reference
performance whenT = 4. WhenT = 2, the amount of information vector
obtained by the model is not enough, and whenT = 6 or 8, it In order to explore the influence of different reference vector
becomes more difficult for the model to extract effective generation methods on the performance, we conduct experiments
information. on the MARS dataset. Three reference vector generation methods
are selected. From Table 8, we can observe that our proposed
method which utilizes the temporal attention method achieves
4.4.6. Comparison with different numbers of split regions favorable accuracy. The reason is that the temporal attention
Intuitively, the number of split regions determines the granu- method can effectively suppress the influence of poor-quality
larity of the partial feature. However, accuracy does not always frames on the generated reference vector while aggregating the
increase as the number of split regions increases. Considering the important information of each frame. The method of the first frame
size of the feature map extracted by the backbone, we conduct selection fails to avoid poor-quality frames. Although the temporal
experiments with three different numbers of split regions: N=2, average method can aggregate the information of all video frames,
4, and 8. We show the results in Table 7. In these experiments, it treats all frames equally and the poor-quality frames will still
we can observe that the model achieves the best results with 4 impact the final reference vector.
split regions. When we increase the number of split regions from
4 to 8, the performance becomes poor. Since when N = 8, the split 4.4.8. Comparison with different attention methods
regions are too small to contain enough features for distinguishing We compare various attention methods with our RGL module.
different pedestrians. For a fair comparison, all methods are applied to the baseline. As
Table 6
Comparison of different sequence lengths T = 2, 4, 6, and 8 on MARS dataset.
432
F. Yang, X. Wang, X. Zhu et al. Neurocomputing 488 (2022) 424–435
Table 7
Comparison of the different number of split regions N = 2, 4, and 8 on MARS dataset.
Table 8
Comparison of different generation methods of Reference Vector on MARS dataset.
Table 9
Comparison of different attention methods applied to the baseline on MARS dataset.
Table 10
Comparison of different loss functions on MARS dataset.
shown in Table 9, all the attention models have some performance Re-ID datasets confirm that our proposed method achieves state-
improvement compared with the baseline. It is worth noting that of-the-art performance.
our method achieves the best performance, which outperforms
the baseline by 7.0% mAP and 4.0% Rank-1 accuracy.
CRediT authorship contribution statement
4.4.9. Comparison with different loss functions Fan Yang: Conceptualization, Methodology, Validation, Investi-
In order to explore the influence of different loss functions that gation, Formal analysis, Writing – original draft, Visualization.
optimize the model combined with the softmax loss on the perfor- Xiangtong Wang: Writing – review & editing. Xuan Zhu: Writing
mance, we conduct experiments on the MARS dataset. For a fair – review & editing. Binbin Liang: Writing – review & editing. Wei
comparison, all loss functions are applied to our model. As shown Li: Writing – review & editing, Funding acquisition, Supervision,
in Table 10, we can observe that our training method which utilizes Project administration, Resources.
hard triplet loss combined with softmax loss achieves the best
performance.
Declaration of Competing Interest
5. Conclusion
The authors declare that they have no known competing finan-
cial interests or personal relationships that could have appeared
In this paper, we propose a novel relation-based global-partial
to influence the work reported in this paper.
feature learning network for video-based person Re-ID, which
effectively enhances the discriminative information and sup-
presses the redundant information. We design an RGL module Acknowledgements
and an RPL module to generate the correlation maps between each
frame of the video sequence under the global and partial reference This work was supported by the Project of Sichuan Science and
feature for exploring global relation-based features and more fine- Technology Department (grant no. 2020YFG0134, and
grained partial relation-based details. Besides, we propose a multi- 2022YFG0153), the Funding from Sichuan University (grant no.
level training scheme that combines the frame-level loss and GSJDJS2021010, grant no. 2020SCUNG205, and grant no.
video-level loss to deeply supervise our model for learning more 2021SCUVS005), and the funding of Civil Aircraft Fire Science and
accurate and suitable representation. Extensive experiments and Safety Engineering Key Laboratory of Sichuan Province (grant no.
ablation studies conducted on three public video-based person MZ2022KF10).
433
F. Yang, X. Wang, X. Zhu et al. Neurocomputing 488 (2022) 424–435
References [25] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, and X. Chen, ‘‘VRSTC: Occlusion-Free
Video Person Re-Identification,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2019, pp. 7176-7185, 10.1109/CVPR.2019.00735.
[1] L. Zheng, Z. Bie, Y. Sun, J. Wang, and Q. Tian, ‘‘MARS: A Video Benchmark for
[26] S. Li, H. Yu, H. Hu, Appearance and Motion Enhancement for Video-Based
Large-Scale Person Re-Identification,” in Proc. Eur. Conf. Comput. Vis. (ECCV),
Person Re-Identification, Proc. AAAI Conf. Artif. Intell. 34 (7) (2020) 11394–
Oct. 2016, pp. 868-884, 10.1007/978-3-319-46466-4_52.
11401, https://doi.org/10.1609/aaai.v34i07.6802.
[2] N. McLaughlin, J. Martinez del Rincon and P. Miller, ‘‘Recurrent Convolutional
[27] Y. Fu, X. Wang, Y. Wei, T. Huang, STA: Spatial-Temporal Attention for Large-
Network for Video-Based Person Re-identification,” in Proc. IEEE Conf. Comput.
Scale Video-Based Person Re-Identification, Proc. AAAI Conf. Artif. Intell. 33
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1325-1334, 10.1109/
(2019) 8287–8294, https://doi.org/10.1609/aaai.v33i01.33018287.
CVPR.2016.148.
[28] Y. Wu, O. Bourahla, X. Li, F. Wu, X. Zhou, Adaptive Graph Representation
[3] D. Chung, K. Tahboub, and E. J. Delp, ‘‘A Two Stream Siamese Convolutional
Learning for Video Person Re-Identification, IEEE Trans. Image Process. 29
Neural Network for Person Re-identification,” in Proc. IEEE Int. Conf. Comput.
(2020) 8821–8830, https://doi.org/10.1109/TIP.2020.3001693.
Vis. (ICCV), Oct. 2017, pp. 1992-2000, 10.1109/ICCV.2017.218.
[29] P. Li, P. Pan, P. Liu, M. Xu, Y. Yang, Hierarchical Temporal Modeling With
[4] Z. Chen, Z. Zhou, J. Huang, P. Zhang, B. Li, Frame-Guided Region-Aligned
Mutual Distance Matching for Video Based Person Re-Identification, IEEE
Representation for Video Person Re-Identification, Proc. AAAI Conf. Artif. Intell.
Trans. Circuits Syst. Video Technol. 31 (2) (2021) 503–511, https://doi.org/
34 (7) (2020) 10591–10598, https://doi.org/10.1609/aaai.v34i07.6632.
10.1109/TCSVT.2020.2988034.
[5] G. Zhang, Y. Chen, Y. Dai, Y. Zheng and Y. Wu, ‘‘Reference-Aided Part-Aligned
[30] Z. Wang et al., ‘‘Robust Video-based Person Re-Identification by Hierarchical
Feature Disentangling for Video Person Re-Identification,” in Proc. IEEE Int.
Mining,” in IEEE Trans. Circuits Syst. Video Technol., 2021, pp. 1-1, 10.1109/
Conf. Multimedia Expo (ICME), 2021, pp. 1-6, 10.1109/
TCSVT.2021.3076097.
ICME51207.2021.9428118.
[31] D. Wu, M. Ye, G. Lin, X. Gao, J. Shen, Person Re-Identification by Context-aware
[6] Y. Yan, B. Ni, Z. Song, C. Ma, Y. Yan, and X. Yang, ‘‘Person Re-Identification via
Part Attention and Multi-Head Collaborative Learning, IEEE Trans Inf. Foren.
Recurrent Feature Aggregation,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Oct.
Sec. (2021) 1, https://doi.org/10.1109/TIFS.2021.3075894.
2016, pp. 701-716, 10.1007/978-3-319-46466-4_42.
[32] X. Yang, L. Liu, N. Wang, X. Gao, A Two-Stream Dynamic Pyramid
[7] Z. Zhou, Y. Huang, W. Wang, L. Wang and T. Tan, ‘‘See the Forest for the Trees:
Representation Model for Video-Based Person Re-Identification, IEEE Trans.
Joint Spatial and Temporal Recurrent Neural Networks for Video-Based Person
Image Process. 30 (2021) 6266–6276, https://doi.org/10.1109/
Re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul.
TIP.2021.3093759.
2017, pp. 6776-6785, 10.1109/CVPR.2017.717.
[33] M. Jiang, B. Leng, G. Song, Z. Meng, Weighted triple-sequence loss for video-
[8] S. Li, S. Bak, P. Carr and X. Wang, ‘‘Diversity Regularized Spatiotemporal
based person re-identification, Neurocomputing 381 (2020) 314–321.
Attention for Video-Based Person Re-identification,” in Proc. IEEE Conf. Comput.
[34] W. Gong, B. Yan, C. Lin, Flow-guided feature enhancement network for video-
Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 369-378, 10.1109/
based person re-identification, Neurocomputing 383 (2020) 295–302, https://
CVPR.2018.00046.
doi.org/10.1016/j.neucom.2019.11.050.
[9] H. Park, B. Ham, Relation Network for Person Re-Identification, Proc. AAAI Conf.
[35] G. Lin, S. Zhao, J. Shen, Video person re-identification with global statistic
Artif. Intell. 34 (7) (2020) 11839–11847, https://doi.org/10.1609/aaai.
pooling and self-attention distillation, Neurocomputing 381 (2021) 777–789,
v34i07.6857.
https://doi.org/10.1016/j.neucom.2020.05.111.
[10] Z. Zhang, C. Lan, W. Zeng, X. Jin, and Z. Chen, ‘‘Relation-Aware Global Attention
[36] J. Si et al., ‘‘Dual Attention Matching Network for Context-Aware Feature
for Person Re-Identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
Sequence Based Person Re-identification,” in Proc. IEEE Conf. Comput. Vis.
(CVPR), Jun. 2020, pp. 3183-3192, 10.1109/CVPR42600.2020.00325.
Pattern Recognit. (CVPR), Jun. 2018, pp. 5363-5372, 10.1109/CVPR.2018.00562.
[11] Z. Zhang, C. Lan, W. Zeng and Z. Chen, ‘‘Multi-Granularity Reference-Aided
[37] L. Zhang et al., Ordered or Orderless: A Revisit for Video Based Person Re-
Attentive Feature Aggregation for Video-Based Person Re-Identification,” in
Identification, IEEE Trans. Pattern Anal. Mach. Intell. 43 (4) (2021) 1460–1466,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10404-
https://doi.org/10.1109/TPAMI.2020.2976969.
10413, 10.1109/CVPR42600.2020.01042.
[38] T. Lin, A. RoyChowdhury, S. Maji, Bilinear Convolutional Neural Networks for
[12] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, ‘‘Beyond Part Models: Person
Fine-Grained Visual Recognition, IEEE Trans. Pattern Anal. Mach. Intell. 40 (6)
Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline),” in
(2018) 1309–1322, https://doi.org/10.1109/TPAMI.2017.2723400.
Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 480–496, 10.1007/978-3-
[39] J. Hu, L. Shen, S. Albanie, G. Sun, E. Wu, Squeeze-and-Excitation Networks, IEEE
030-01225-0_30.
Trans. Pattern Anal. Mach. Intell. 42 (8) (2020) 2011–2023, https://doi.org/
[13] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, ‘‘Person Re-identification by
10.1109/TPAMI.2019.2913372.
Multi-Channel Parts-Based CNN with Improved Triplet Loss Function,” in Proc.
[40] V. Ashish et al., Attention Is All You Need, in: Proceedings of the 31st
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1335-1344,
International Conference on Neural Information Processing Systems (NIPS),
10.1109/CVPR.2016.149.
2017, pp. 6000–6010, https://doi.org/10.5555/3295222.3295349.
[14] L. Wei, S. Zhang, H. Yao, W. Gao, Q. Tian, GLAD: Global-Local-Alignment
[41] S Woo, J Park, JY Lee, ‘‘CBAM: Convolutional Block Attention Module,” in Proc.
Descriptor for Scalable Person Re-Identification, IEEE Trans. Multimedia 21 (4)
Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 3-19, 10.1007/978-3-030-01234-2_1.
(April 2019) 986–999, https://doi.org/10.1109/TMM.2018.2870522.
[42] Z. Zheng, X. Yang, Z. Yu, L. Zheng, Y. Yang and J. Kautz, ‘‘Joint Discriminative
[15] H. Zhao, M. Tian, S. Sun, S. Jing, and X. Tang, ‘‘Spindle Net: Person Re-
and Generative Learning for Person Re-Identification,” in Proc. IEEE Conf.
identification with Human Body Region Guided Feature Decomposition and
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 2133-2142, 10.1109/
Fusion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp.
CVPR.2019.00224.
907-915, 10.1109/CVPR.2017.103.
[43] K. Simon, G. Lausen, ‘‘ViPER: augmenting automatic information extraction
[16] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou, ‘‘Learning Discriminative Features
with visual perceptions,” in Proceedings of the Acm International Conference
with Multiple Granularities for Person Re-Identification,” in Proc. ACM
on Information & Knowledge Management, Oct. 2005, pp. 381–388, 10.1145/
Multimedia Conf. MM, 2018, pp. 274-282, 10.1145/3240508.3240552.
1099554.1099672.
[17] J. Gao and R. Nevatia, ‘‘Revisiting Temporal Modeling for Video-based Person
[44] X. Mao et al., Integrating Coarse Granularity Part-Level Features with
ReID,” 2018, arXiv: 1805.02104.
Supervised Global-Level Features for Person Re-Identification, ZTE
[18] Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Ouyang and Y. Yang, ‘‘Exploit the Unknown
Communications 19 (1) (2021) 72–81, https://doi.org/10.12142/
Gradually: One-Shot Video-Based Person Re-identification by Stepwise
ZTECOM.202101009.
Learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018,
[45] X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao, W. Jiang, C. Zhang, and J.
pp. 5177-5186, 10.1109/CVPR.2018.00543.
Sun., ‘‘Alignedreid: Surpassing human-level performance in person re-
[19] T. Wang, S. Gong, X. Zhu, S. Wang, Person Re-identification by Video Ranking,
identification,”. 2017, arXiv: 1711.08184.
Proc. Eur. Conf. Comput. Vis. (ECCV) (2014), https://doi.org/10.1007/978-3-
[46] X. Bai, M. Yang, T. Huang, Z. Dou, R. Yu, and Y. Xu., ‘‘Deep-person: Learning
319-10593-2_45.
discriminative deep features for person re-identification,”. 2017,
[20] K. He, X. Zhang, S. Ren and J. Sun, ‘‘Deep Residual Learning for Image
arXiv:1711.10658.
Recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
[47] D. Li, X. Chen, Z. Zhang and K. Huang., ‘‘Learning Deep Context-Aware Features
pp. 770-778, 10.1109/CVPR.2016.90.
over Body and Latent Parts for Person Re-identification,” in Proc. IEEE Conf.
[21] D. Kingma, J. Ba, Adam: A Method for Stochastic Optimization arXiv (2014)
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 7398-7407, 10.1109/
1412.6980.
CVPR.2017.782.
[22] S. Xu, Y. Cheng, K. Gu, Y. Yang, S. Chang and P. Zhou, ‘‘Jointly Attentive Spatial-
[48] F. Wang, M. Jiang, Q. Chen et al., ‘‘Residual Attention Network for Image
Temporal Pooling Networks for Video-Based Person Re-identification,” in Proc.
Classification, ” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul.
IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 4743-4752, 10.1109/
2017, pp. 6450-6458, 10.1109/CVPR.2017.683.
ICCV.2017.507.
[49] S. Li, S. Bak, P. Carr et al., ‘‘Diversity Regularized Spatiotemporal Attention for
[23] A. Subramaniam, A. Nambiar and A. Mittal, ‘‘Co-Segmentation Inspired
Video-Based Person Re-identification, ” in Proc. IEEE Conf. Comput. Vis. Pattern
Attention Networks for Video-Based Person Re-Identification,” in Proc. IEEE
Recognit. (CVPR), Jun. 2018, pp. 369-378, 10.1109/CVPR.2018.00046.
Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 562-572, 10.1109/
[50] D. Chen, H. Li, X. Tong et al., ‘‘Video Person Re-identification with Competitive
ICCV.2019.00065.
Snippet-Similarity Aggregation and Co-attentive Snippet Embedding” in Proc.
[24] J. Li, S. Zhang, J. Wang, W. Gao and Q. Tian, ‘‘Global-Local Temporal
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 1169-1178,
Representations for Video Person Re-Identification,” in Proc. IEEE Int. Conf.
10.1109/CVPR.2018.00128.
Comput. Vis. (ICCV), Oct. 2019, pp. 3957-3966, 10.1109/ICCV.2019.00406.
434
F. Yang, X. Wang, X. Zhu et al. Neurocomputing 488 (2022) 424–435
Fan Yang received his B.S. degree in the School of Binbin Liang received B.S and M.S. degrees from College
Information and Communication Engineering, North of Civil Aviation, Nanjing University of Aeronautics and
University of China, Taiyuan Shanxi, China, in 2018. He Astronautics in China at 2012 and 2015 respectively. He
is currently pursuing the MA.Sc. degree with the School is currently a Ph.D candidate in the School of Aeronau-
of Aeronautics and Astronautics, Sichuan University. His tics and Astronautics, Sichuan University. His research
research interests include computer vision and video interests include multi-sensor data fusion and artificial
analysis. intelligence.
Xiangtong Wang received the B.S. degree from the Wei Li received the B.S and Ph.D degrees from School of
School of Computer Science, Sichuan Normal University, Mechatronic Engineering, Beijing Institute Technology,
in 2018, and the M.Eng. degree from the School of China, in 2003 and 2008 respectively. He joined
Aeronautics and Astronautics, Sichuan University,in University of Electronic Science and Technology of
2021. He is currently pursuing the Ph.D. degree with the China in 2008. From 2010 to 2011, he was a postdoc-
School of Aeronautics and Astronautics, Sichuan toral member at Center of Industrial Electronics,
University. His research interests include computer Polytechnic University of Madrid, Spain. In 2011, he
vision and computer network. became an associate professor in University of Elec-
tronic Science and Technology of China. Currently, he is
an associate professor at School of Astronautics and
Astronautics in Sichuan University, China. His research
interests include computer vision, video surveillance
and camera networks.
435