Neurocomputing: Fan Yang, Xiangtong Wang, Xuan Zhu, Binbin Liang, Wei Li

Neurocomputing 488 (2022) 424–435
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Relation-based global-partial feature learning network for video-based

person re-identification
Fan Yang, Xiangtong Wang, Xuan Zhu, Binbin Liang, Wei Li ⇑
School of Aeronautics and Astronautics, Sichuan University, Chengdu Sichuan 610065, China
a r t i c l e i n f o a b s t r a c t
Article history: Video-based person re-identification (Re-ID) aims to match the same pedestrian from the video
Received 6 October 2021 sequences captured by non-overlapping cameras. It is the key to fully extracting abundant spatial and
Revised 26 February 2022 temporal information from the video frames in video-based Re-ID. In this paper, a novel Relation-
Accepted 10 March 2022
Based Global-Partial Feature Learning framework is proposed to explore discriminative spatiotemporal
Available online 12 March 2022
features with the help of the global and partial relationship between frames. Specifically, we propose a
Relation-Based Global Feature Learning Module (RGL) to obtain global references for generating features
Keywords:
correlation maps between frames in the video sequence and determine the importance of frame-level
Person re-identification
Relation learning
features. As the supplementary of the global relation-based features, a Relation-Based Partial Feature
Global feature Learning Module (RPL) is also proposed to obtain the relationship between partial features of the same
Partial feature spatial position in different frames to enhance the frame-level partial representation. Moreover, we
Convolutional neural networks design a multi-level training scheme to deeply supervise our model. Extensive experiments are con-
ducted on three public video-based person Re-ID datasets, and the results indicate that our framework
achieves state-of-the-art performance on three benchmarks.
Ó 2022 Elsevier B.V. All rights reserved.
1. Introduction will be deteriorated by the poor-quality frames and will lead to the
degradation of the relation generation of the discriminative infor-
Person re-identification (Re-ID) aims to match a specific person mation. Moreover, previous relation-based methods mostly focus
across different non-overlapping cameras. Since video data con- on relation exploration of the partial features on certain parts of
tains abundant spatiotemporal information of a person compared an image. These methods may ignore the importance of the global
with image data, video-based person Re-ID has drawn increasing relation of the video sequence for person identification. Some stud-
attention in recent years. Most video-based person Re-ID methods ies [11] stack the partial relation of the position features to obtain
[1–3] extract the frame-level features and obtain the video-level the global relation. Although the correlation of the partial features
representation by the temporal aggregation to compute similarity can capture fine-grained clues, it will lose the global information
scores. Therefore, the construction of the representation which about the human body. Furthermore, the correlation of global fea-
contains the discriminative spatiotemporal information is the key tures can strengthen the most salient information among all the
in video-based Re-ID methods. In recent years, some relation- frames, but some non-obvious details may be easily ignored. These
based methods [4,5] have been proposed to learn the meaningful two types of relation-based features complement each other. It is
features and suppress the irrelevant ones by adopting the correla- necessary to combine the global and partial relation-based features
tion between frames. In this kind of method, how to obtain the ref- for enriching the information.
erence feature is the crucial issue to explore the discriminative Therefore, to explore the relationship between video frames and
features to generate the correlation between reference feature utilize the complementarity between the global and partial
and frame-level feature. However, the common reference feature relation-based features, in this paper, we propose a novel
selection strategies, which always select one frame-level feature Relation-Based Global-Partial Feature Learning model to fully learn
vector or average all frame-level features as the reference feature, the global and partial relationships of the whole video sequences.
For effective spatiotemporal feature extraction, we utilize the rela-
tion between each frame to explore the importance and relevance
⇑ Corresponding author. of characteristics from the global and partial views of the video
E-mail addresses: yang_fan_s@163.com (F. Yang), xdr94@126.com (X. Wang), sequences, respectively. To construct the relationships of the video
j_zhxxx@163.com (X. Zhu), sculiang@126.com (B. Liang), li.wei@scu.edu.cn (W. Li).
https://doi.org/10.1016/j.neucom.2022.03.032
0925-2312/Ó 2022 Elsevier B.V. All rights reserved.
F. Yang, X. Wang, X. Zhu et al. Neurocomputing 488 (2022) 424–435
sequences which contain discriminative cues and reduce the 2.2. Relation-based feature learning strategy
impact of the poor-quality frames, we exploit a temporal attention
operation to assign different weights for frames and aggregate Due to the continuity of the video sequence, the frames have a
them with the weighted average method. In the global view, the strong correlation with each other, which guides inferring the
video-level reference is generated by the aggregation of the whole importance of features. There are some attempts to utilize the
frame-level feature maps with the help of the temporal attention relation-based feature learning strategy to improve the perfor-
operation. Besides, as an important complement for global mance of the person Re-ID task. Park et al. [9] introduce a part-
relation-based features, we utilize frame-level equally-divided based relation network for person Re-ID to extract discriminative
parts for video-level partial reference generation to explore mean- features and propose a global contrastive pooling method to obtain
ingful fine-grained details in the partial view. Moreover, we design better global features for a person image. Zhang et al. [10] is the
a multi-level training scheme to deeply supervise our model from first attempt to utilize a relation-aware attention module to
frame-level and video-level. The whole model is a complete end- explore the partial relation and stack them to obtain the global
to-end learning process, which is easy for training and relation. Although the above two methods extract the relation
implementation. information well in image-based Re-ID, they only pay attention
Our primary contributions can be summarized as follows: to the relation between the features in a single image to connect
them and the relation-aware attention between frames for video-
We propose a novel RGL module to generate the relation scores based Re-ID is under-explored. Lin et al. [38] propose a Bilinear
based on global guidance of the video sequence for obtaining CNN model which generates the translationally invariant correla-
discriminative features from the global view. tion of different partial features by modeling the local pairwise fea-
We propose an RPL module to capture the fine-grained details ture interactions in an end-to-end train manner to improve the
across frames from the partial view. performance of the fine-grained visual recognition. This method
We design a multi-level training scheme consisting of video- extracts the correlation between local features rather than learning
level and frame-level loss to deeply supervise the proposed the relation between global and partial features. In addition, the
model. pairwise relation is generated by the outer product of the pairwise
We apply our proposed model on three popular video-based local feature vectors, which is inefficient. Simon et al. [43] present
person Re-ID datasets. It achieves Rank-1 accuracy of 89.1% on a fully-automatic information extraction tool named ViPER (Visual
MARS, 97.2% on DukeMTMC-VideoReID, and 88.7% on iLIDS- Perception-based Extraction of Records), which extracts and dis-
VID without re-ranking. Our model shows superiority compared criminates the relevance of different repetitive information con-
with the state-of-the-art. tents with respect to the user’s visual perception of the web page
for identifying the most relevant data region. Chen et al. [4] utilize
The rest of the paper is organized as follows. We briefly review region features of the first frame to guide other frames and obtain
the related works of Re-ID methods in Section 2. In Section 3, we feature relationships between frames. Zhang et al. [11] propose a
present the overall video-based Re-ID model pipeline and elabo- reference-aided attention feature aggregation module, which
rate the RGL module, RPL module, and the multi-level training adopts the average of all frame-level features as the reference to
scheme in detail. In Section 4, we show the results and comparison extract the relation. However, some poor-quality frames such as
of our experiments. The conclusion is given in Section 5. blur, weak light, and so on can affect the relational calculation of
the video sequence. Zhang et al. [5] introduce a quality evaluation
block to select high-quality frames and segment the reference
2. Related work frame leveraging human body pose information to learn relation
information between frames. This method requires additional data
2.1. Video-based person Re-ID annotation to train the body pose estimation model, and the per-
formance of person Re-ID depends heavily on the accuracy of the
Recently, person Re-ID has made great progress with the devel- pose estimator. Compared with existing methods, in order to make
opment of deep learning. Existing studies on person Re-ID mainly full use of the information of each frame and reduce the impact of
focus on two aspects: image-based and video-based person Re-ID. the poor-quality frames, we generate the reference with the help of
Video-based person Re-ID which uses much richer spatial–tempo- a temporal attention operation to learn the relation-based features.
ral information is an extension of the image-based person Re-ID
and has drawn more and more attention. Temporal average pool- 2.3. Part-based person Re-ID
ing is a common method to extract temporal information for
video-based person Re-ID. Zheng et al. [1] firstly utilize the tempo- Most methods only extract the global feature of an image to
ral average pooling method to fuse frame-level features to generate match the target. However, these methods will lose some fine-
video-level features. The recurrent neural network (RNN) is also grained details. In order to overcome the problem mentioned
widely used to analyze video data in the video-based person Re- above, some works try to learn partial features and achieve impres-
ID task. Mclanghlin et al. [2] and Yan et al. [6] exploit RNN to sive performance recently. Sun et al. [12] split the person image
extract temporal information from video sequences. However, into several stripes with a partitioning strategy to extract part-
the above methods treat all frames equally so that the poor- level features. Although adopting the above method can improve
quality frames will negatively affect the video-level representation. the discriminative ability, it destroys the structure information of
In light of this, some methods apply an attention mechanism to the human body. Cheng et al. [13] propose a multi-channel model
highlight important frames. For example, Zhou et al. [7] and Li to learn global and partial information simultaneously and extract
et al. [8] adopt temporal attention to assign different weights for discriminative features with the help of an improved triplet loss
each frame and aggregate them with the weighted average method function. This method combines global and partial features to
for meaningful video-level feature representations. In this paper, obtain the final representation of targets, but it ignores the corre-
we attempt to obtain the relation maps of each frame from the glo- lation between these two types of information. Zhang et al. [45]
bal and partial aspects to learn the discriminative information in introduce a part-based alignment method with shortest path pro-
video sequences. Besides, a novel multi-level training scheme is gramming and mutual learning to improve metric learning perfor-
proposed to deeply supervise our model. mance. Bai et al. [46] design a global and local feature learning
425
network that merges slices of part features with the LSTM network number of channels, height, and width of the feature maps,
and combines with global features learned from classification met- respectively.
ric learning. Li et al. [47] propose a part-aligning architecture to In the global branch, the frame-level features F t are fed into the
search latent regions and extract partial information for person RGL module for exploring the relationship between each frame to
Re-ID. Wei et al. [14] introduce a global–local-alignment descriptor enhance the relevance feature of the target from the global view. A
that leverages the human body pose points to learn global-partial Temporal Aggregate Module (TAM) following a global average
features and align them. Zhao et al. [15] propose a region proposed pooling operation is applied for aggregating the frame-level fea-
network to extract body local regions with the help of the human tures F t to obtain the video-level global feature vectorf global 2 RC .
body structure information for aligning partial features between In the partial branch, we split the frame-level feature maps F t
images. These methods need extra pose annotation for the Re-ID H
into several stripes F it 2 RC p W ð; i 2 ½1; pÞ in horizontal orientation,
datasets, and the possible errors of the body pose estimator can
where p indicates the number of stripes. Then, we input the frame-
directly affect the performance of the Re-ID model. Wang et al.
[16] propose an extraction strategy combining global and partial level global feature F t and the partial feature of each stripe F it into
features that divided the network architecture into three parallel the RPL module to obtain the video-level partial vector of each
i
global–local branches with different granularities to learn finer dis- stripef 2 RC . The final video-level partial vector f partial 2 RC is a
criminative information. Inspired by [16], Mao et al. [44] propose a combination of the p terms:
coarse granularity part-level person Re-ID network (CGPN) to
X
p
extract robust regional features and integrates supervised global f partial ¼ f
i
ð1Þ
features for pedestrian images. However, the above two methods i¼1
will bring a huge computational burden while boosting the perfor-
In addition, we introduce a multi-level training scheme that
mance of the Re-ID model. Inspired by the above methods, our
combines the video-level supervision and the frame-level supervi-
model combines the relation-based global and partial features to
sion with the help of the softmax loss and batch-hard triplet loss.
learn more discriminative representation.
During testing phases, we concatenate both the video-level glo-
bal and partial feature vectors to obtain the most discriminative
2.4. Attention mechanism final representationf 2 R2C :
f ¼ ðf global ; f partial Þ ð2Þ

Attention mechanism has been widely used to capture crucial
features and ignore redundant information recently. Si et al. [36] ^ AÞ
^ represents the concatenation operation.
where ðA;
propose a Dual Attention Matching network (DuATM), which con-
tains a dual attention mechanism to learn context-aware feature
sequences and perform attentive sequence comparison simultane- 3.2. Relation-based global feature learning module
ously. Hu et al. [39] introduce a SE-block, which improves the fea-
ture extraction capability of a model by explicitly modeling The global features provide the reference for determining the
interdependencies between channels. Woo et al. [41] design a con- importance level of frame-level features. In this paper, the feature
volutional block attention module that combines spatial attention vector along the channel dimension in a feature map is defined as a
and channel attention to improve the representational power of feature node. In order to enhance the relevance feature of the tar-
the CNN network. Wang et al. [48] propose a residual attention get, the RGL module is used to obtain the attention maps of the
network that stacks multiple attention modules to obtain different frame-level by calculating the relationship between the reference
types of attention scores for guiding representation learning. The feature vector and the frame-level feature node. As shown in
above two methods will improve the performance to a certain Fig. 2, the TAM module aggregates frame-level features to obtain
extent, but they ignore the correlation between the features which the global reference vectorf gR 2 RC . The global reference vector
belong to the same person. Li et al. [49] propose a spatiotemporal f gR and the frame-level feature F t both with the same number of
attention model which utilizes a set of different spatial attention channels perform the correlation calculations to produce the rela-
modules to simultaneously learn similar local features between tion mapsRt 2 RCHW . The function of the above process is formu-
frames. This method can align information across images and lated as follows:
detect whether the target pedestrian is occluded. Chen et al. [50]
introduce a similarity aggregation and co-attentive snippet embed- Rt ¼ f F t ; f gR ¼ F t f gR ð3Þ
ding method to reduce the intra-pedestrian variation and improve
where denotes the convolution operation. The correlation calcula-
the ability of metric learning. Compared with the existing attention
tion between feature nodes and feature vectors can be regarded as
mechanism, our method generates attention by constructing the
global and local relationships between frames to reinforce critical the convolution operation. The global reference vector f gR can be
information and facilitate representation learning. considered as the convolution kernel.
It is worth noting that the frame-level feature contains the
appearance information while the relation map represents the
3. Proposed method relationship among frames. They can strengthen and supplement
each other in different semantic spaces. Therefore, besides the rela-
3.1. Framework overview tion mapsRt , we also consider the frame-level feature F t itself to
combine the relation maps and the frame-level features. Due to
The architecture of our proposed model is shown in Fig. 1. Our these two kinds of information being in different embedding
model contains a global branch and a partial branch to explore the spaces, we concatenate them to obtain the attention
importance of relation-based features from the global and partial scoresAt 2 RCHW :
views, respectively. Given a frame sequence S ¼ fI1 ; I2 ; ; IT g
At ¼ ReLUðBNðConv ðRt ; F t ÞÞÞ ð4Þ
containing T sampled frames as input, we feed these frames into
the backbone module to extract the frame-level fea- where the Conv denotes 1 1 convolution operation to reduce the
turesF t 2 RCHW ðt 2 ½1; T Þ, whereC,H, and W represent the dimension to match the dimension of the frame-level feature.. BN
426
Frame-Level Partial Supervision
ID Loss
Average
Pooling
Triplet Loss
ID Loss
RPL
Video-Level Triplet Loss
Relation-Based Partial Feature

Partial Feature
Frame-Level Feature
Partial Branch
Global Branch
ID Loss
RGL TAM
Video-Level Triplet Loss
Frame-Level Feature Relation-Based Global Feature
Global Feature
ID Loss
Average
Pooling
Triplet Loss
Frame-Level Global Supervision
Fig. 1. Illustration of our overall system pipeline.
Convolution Global Global Relation-Based

Frame-Level
Operation Relation Map Attention Map Global Feature
Feature
Weighted
Average
TAM Concat
Global
Reference
Fig. 2. Illustration of the RGL module.
and ReLU denote Batch Normalization and ReLU activation, sequence into a single part-level feature map F i 2 RC p W with
H
respectively. i
the help of TAM. We obtain the partial single vectors f as the par-
The relation-based frame-level global features F Rt 2 RCHW is tial guide reference vector with a global max-pooling layer.
calculated by: In order to mine the fine-grained cues from each frame, we
F Rt ¼ F t hAt ð5Þ adopt the RGL module to generate the partial featureF it , where
the high similarity region will get high weight. The above process
where H represents element-wise multiplication.
can be formulated as follows:
i
3.3. Relation-based partial feature learning module F it ¼ RHLðf ; F t Þ ð6Þ
We utilize the partial feature relation calculating strategy to Through the global max-pooling layer, the elements in F it with the
i
explore the fine-grained features of the pedestrian between differ- high response value can be aggregated to a single vectorf t 2 RC ,
ent frames, as shown in Fig. 3. The spatial position of the same per- and they are aggregated to video-level partial vector f with the help
i
son captured by the same camera changes slightly in consecutive of the TAM module. Therefore, we can calculate the final partial vec-
frames, and the information in different frames is complementary tor as follows:
to the representation of the target person. Thus, we propose an RPL
i
module to enrich the partial information in the guided feature to f t ¼ GMPðF it Þ ð7Þ
extract the fine-grained features in different frames. We partition
i i
each frame-level feature F t into p horizontal stripesF it , and aggre- f ¼ TAMðf t Þ ð8Þ
gate all the strips in the same spatial positions of the frame
427
Frame-Level Partial Feature
Max
RGL TAM
Pooling
Relation-Based Relation-Based Relation-Based

Partial Feature Partial Feature Vector Video-Level Partial Feature
Frame-Level Feature
Fig. 3. Illustration of the RPL module.
where GMP is global max pooling operation, and TAM is the TAM rameter to control the differences between intra and inter
module. distances.
The softmax loss function Lsoftmax is used for discriminative
3.4. Temporal attention module learning as well.
1 XP X K
In order to focus on the informative frame-level features, the Lsoftmax ¼ p logqi;a ð12Þ
Temporal Attention Module [17] is applied to weight the features
PK i¼1 a¼1 i;a
of each frame. Thus, the video-level feature f v is formulated as:
wherepi;a , qi;a are the ground truth identity of the sample {i, a}.
1X T
t
We propose a multi-level training scheme that combines the
fv ¼ at f ð9Þ video-level and the frame-level supervision to improve the feature
T t¼1 v v
extraction capability of the network. Both levels of supervision
t include the global part and the partial part, and each part is super-
where f v is the frame-level feature of the video, atv is the attention
vised by triplet loss and softmax loss simultaneously. The video-
of frame-level features. The input of the TAM module is a sequence
level loss and frame-level loss can be calculated by:
of frame-level features [T; w; h; 2048], where the kernel size is w h;
the input number of channels is 2048, and the output number of 1 vg vp 1 vg
Lv ¼ ðL þ Ltriplet Þ þ ðLsoftmax þ Lvsotfmax
p
Þ ð13Þ
channels isdt . In order to generate temporal attentionstv , we use a 2 triplet 2
temporal convolutional layer {3 ; dt ; 1}. A softmax function is
applied to calculate the attention scores atv of each frame-level 1 XN
1 XN
Lf ¼ ðLfg
triplet þ Lfpi
triplet Þ þ ðLfg
softmax þ Lfpi
sotfmax Þ ð14Þ
feature: Nþ1 i¼1
N þ 1 i¼1
where Lv g and Lv p indicates the video-level global and partial loss,

t
esv
atv ¼ PT i
ð10Þ
sv
i¼1 e Lfg and Lfp indicates the frame-level global and partial loss, i indi-
cates the i-th part of the frame, N refers to the number of partitions
on the frame-level feature maps.
3.5. Training scheme Therefore, the overall loss function L is the combination of the
video-level loss and frame-level loss as follow:
In this paper, we utilize the batch-hard triplet loss function for
metric learning and the softmax loss function for classification to L ¼ Lv þ Lf ð15Þ
train the whole network.
We randomly sample P identities and K clips that contain T
4. Experiments
frames for a batch to meet the requirement of the triplet loss.
The function of the triplet loss is formulated as follows:
4.1. Experiment settings
8 9
>
> >
>
>
> >
>
>
> >
> 4.1.1. Datasets
>
> >
>
>
> >
> We evaluate our method on three popular video-based person
>
> >
>
>
> hard positiv e >
> Re-ID datasets, including MARS [1], DukeMTMC-VideoReID [18],
< XP X K =
Ltriplet ¼ h j ð iÞ
ðj Þ j and iLIDS-VID [19] . The iLIDS-VID dataset consists of 300 identities
> ðiÞ ðiÞ min x x n 2 þ
>
> ½m þ max jxa xp j n ¼ 1 .. .K
a
> and 600 video sequences from two different cameras. The length of
>
> i¼1 a¼1 p¼1:::K >
>
>
>
2
>
> each video sequence ranges from 23 to 192 frames with an average
>
> j ¼ 1. .. P >
>
>
> >
> length of time of 73 frames. This dataset is a challenging dataset
>
> j–i >
>
>
> >
>
: h ; that contains persons with similar clothes and many occlusions.
hard negativ e The MARS dataset is one of the largest video-based Re-ID datasets
ð11Þ that contain 17,503 tracklets of 1,261 identities, and additional
3,248 tracklets of poor quality serving as distractors. Each tracklet
ðiÞ ðiÞ ðiÞ
wherexa ,xp , xn are the features extracted from the anchor, positive has 59 frames on average. It is split into 625 identities for training
and negative samples respectively. m denotes the margin hyperpa- and 636 identities for testing. The DukeMTMC-VideoReID dataset
428
is another large-scale video-based Re-ID dataset that includes line, the proposed model can mine the discriminative cues with
1,812 persons and 4,832 tracklets which are captured by eight dif- less background information. In Fig. 5 (b), when the target occupies
ferent cameras. The dataset is divided into 408, 702, and 702 iden- a small region in the frame, the features of the baseline pay atten-
tities for distraction, training, and testing, respectively. The length tion to the area of the target pedestrian. Instead, our model still
of each tracklet has 168 frames on average. The three datasets are focuses on the target and ignores irrelevant information such as
captured by cameras in reality. However, privacy protection is so lawns, trees, and lampposts. In addition, our model identifies the
important for everyone. Especially in the field of person Re-ID, accessory (bag and umbrella) that belongs to the target, which is
which requires large amounts of data to train the model. Therefore, significant information to discriminate different people. As shown
data acquisition is a crucial issue for researchers. The generated in Fig. 5 (c), when the target pedestrian is partially occluded by
data can alleviate this problem by generating large-scale training environmental objects, the baseline will simultaneously extract
data such as the DG-Market dataset [42]. Thus, the generated data the features of the target and the occluding objects, which will
will be an important topic for our further research. harm the representation of the target pedestrian. In addition, in
order to compare the global and partial features, we show the fea-
4.1.2. Evaluation metrics ture maps extracted by the global and partial branches, respec-
We utilize the Cumulative Matching Characteristic curve (CMC) tively. Compared with the global features, the features generated
and mean Average Precision score (mAP) to evaluate the perfor- by the partial branch can capture fine-grained details, such as
mance of our model. The CMC indicates the accuracy of person umbrellas or patterns on clothes. Meanwhile, we can observe that
re-identification. We adopt Rank-1, Rank-5, Rank-20 scores to there are some variations between the features from the global and
express the CMC curve. partial branches. Obviously, these two types of features comple-
ment each other to assemble more discriminative representation
4.1.3. Implementation detail of the target. The visualization of the feature maps further validate
We employ the standard ResNet-50 [20] as the backbone net- that our proposed method can highlight more discriminative fea-
work for frame-level feature extraction. The input frames are ture and suppress the redundant information.
resized to 256 128. The batch size is set as 32. For each batch,
we randomly select P ¼ 4 samples for each identity and set the 4.3. Comparison with the state-of-the-art methods
sequence lengthT ¼ 4. We train our model for 500 epochs and
use Adam [21] to optimize our network with weight In this section, we compare our proposed approach with state-
decay5 104 . The initial learning rate is 3:5 104 with a decay of-the-art methods on MARS, DukeMTMC-VideoReID, and iLIDS-
factor 0.1 at every 100 epochs. In the test phase, we extract the fea- VID datasets, respectively. None of the compared approaches use
ture for each video sequence. The video-level feature is finally used the re-ranking strategy in post-processing. The results are reported
for retrieval with the cosine distances between the query and in Table 1.
gallery.
4.3.1. MARS
4.2. Results visualization Comparisons between our method and other approaches on the
MARS dataset show that our model achieves 84.5% mAP and out-
In order to evaluate the effect of the model more intuitively, we performs all previous methods by more than 1.5%. Even for the
visualize the retrieval results and the feature maps, which generate CMC curve, our proposed method achieves competitive results. It
from the baseline and our model. However, privacy protection of is worth noting that the recently proposed AdaptiveGraph [28]
the data is so important. To protect personal privacy in visualiza- achieves impressive Rank-1 accuracy, but it utilizes extra pose
tion, we hide the faces of pedestrians in the visual data. information to generate the graph and needs a huge computational
burden. Compared with this work, our method achieves competi-
4.2.1. Retrieval results analysis tive performance without additional information. The experimen-
As illustrated in Fig. 4, we visualize some video-based person tal results validate the superiority and effectiveness of our model.
Re-ID results achieved by the baseline and our proposed method
on MARS, DukeMTMC-VideoReID, and iLIDS-VID. The first
sequence is the query, and the remainder is the Rank-1 to Rank- 4.3.2. DukeMTMC-VideoReID
5 (from left to right) retrieved results in the gallery. The first and Due to the DukeMTMC-VideoReID is a newly proposed dataset
the second rows are the matching results of the baseline and our for person video-based Re-ID, only a few works released the per-
model, respectively. The top-5 returned video sequences by the formance on this dataset. As shown in Table 1, our proposed
two methods for each query are shown in Fig. 4. Candidates with method achieves the best performance with 95.9% and 97.2% at
green boxes indicate that they belong to the same identity as the Rank-1 accuracy and mAP, respectively. Note that, the DPRM [32]
query. The red boxes refer to the incorrectly matched sequences. method presents competitive performance but it ignores the learn-
From the visualization, we can observe that our model achieves ing of the fine-grained details, which is important for enhancing
the satisfying performance in matching correct pedestrians in the the discriminative features of the target person. Compared with
first ranking results, while the baseline method failed to find the those works, our method achieves state-of-the-art results.
same person at the top rank. For example, comparing the results
in the first query, our model is more discriminative to distinguish 4.3.3. iLIDS-VID
targets with similar clothes when there are spatial misalignments, The comparisons with the recent works on iLIDS-VID dataset
light changes, and viewing angle changes. are shown in Table 1. From the table, we observe that our proposed
method achieves 88.7% Rank-1 accuracy, outperforming the other
4.2.2. Visualization of feature maps published results. The reason behind this comes from the usage
We visualize the feature maps generated by the baseline and of both the fine-grained details and the relationship between each
our model on the MARS dataset in Fig. 5. The frames include vary- frame in our model. Therefore, it is reasonable to believe that our
ing conditions such as pose, scale, and partial occlusions. As shown proposed method achieves better performance than other state-
in Fig. 5 (a), comparing the features maps extracted by the base- of-the-art approaches.
429
Rank-1 Rank-2 Rank-3 Rank-4 Rank-5
Baseline
Query
Ours
(a) MARS
Baseline
Query
Ours
(b) DukeMTMC-
VideoReID
Baseline
Query
Ours
(c) iLIDS-VID
Fig. 4. Visualization of video-based Re-ID results using the baseline model and the proposed model on the three datasets. For each query, we show the top-5 ranking match
results. The green and red bounding boxes respectively denote correct and incorrect matches. (For interpretation of the references to color in this figure legend, the reader is
referred to the web version of this article.)
4.4. Ablation studies mance of different branches, we can observe that the
combination of the global and partial branches indeed achieves
4.4.1. Effectiveness of each component better performance than any single branch. These results also indi-
We evaluate the contribution of each component and show the cate that the complementarity of global and partial information
ablation results in Table 2. In the training stage, we add each com- enhances the discriminative representation of the target.
ponent into the baseline [17] respectively with the same training
settings. In the first row, the results of the baseline are obtained 4.4.3. Effectiveness of multi-level training scheme
by performing temporal attention on the frame-level features to To validate the effectiveness of the Multi-level Training Scheme,
calculate the video-level representations. The standard baseline we perform experiments on MARS dataset. The results are reported
achieves 76.7% mAP and 83.2% Rank-1 accuracy on the MARS data- in Table 4. The ‘‘VL” denotes the video-level loss, which is used to
set. ‘‘+RGL” means that we add the RGL module in the global supervise the video-level feature. The ‘‘FGL” and ‘‘FPL” denote the
branch to enhance the frame-level global features. We can see that frame-level global loss and partial loss. We can observe that the
the RGL module increase the Rank-1 accuracy by 4.0% and mAP by combination of the video-level loss, the frame-level global loss,
7.0% on MARS dataset. In order to explore the partial features, we and the frame-level partial loss can achieve the best performance.
design a partial branch that horizontally partitions the frame-level The results imply that employing the multi-level training scheme
feature maps into several partial regions and utilize the temporal can supervise our model to learn more discriminative features.
attention operation to generate the video-level partial features. ‘‘
+RPL” means that add the RPL Module in the partial branch to
4.4.4. Evaluate the balance of the losses
guide the feature under a partial view, which improves the Rank-
To evaluate the balance of the video-level loss (VL) and the
1 accuracy by 1.4% and mAP by 4.3% on MARS dataset. We combine
frame-level loss (FL), we introduce two weight parameters (a, b).
the two branches and further incorporate the multi-level training
Therefore, the overall loss function L can be calculated by:
scheme (MT). The Rank-1 accuracy and mAP are improved from
83.2% and 76.7% to 89.1% and 84.5% on MARS dataset, respectively. L ¼ aLv þ bLf ð16Þ
In addition, we also evaluate the model parameters and computa-
tional complexity of each component. We observe that our pro- We perform experiments on the MARS dataset by changing the
posed method improves the performance significantly with few value of the two parameters. In Table 5, experimental results show
additional parameters and little extra computational complexity. that different weights have little effect on performance. The best
result is obtained whena = 1 andb = 1, which indicates that these
two losses are equally important for our model.
4.4.2. Influences of different branches
We perform experiments to verify the effectiveness of the glo- 4.4.5. Comparison with different sequence length
bal and partial branches. The results on MARS are shown in Table 3 To investigate how the different sequence length influences the
‘‘with RGL” and ‘‘with RPL” means that we add the RGL or RPL mod- final performance, we conduct experiments with four different
ule in the two branches, respectively. By comparing the perfor- sequence length: T=2, 4, 6, and 8 separately. The results are listed
430
Frame
sequences
Baseline
Ours-
Holistic
Ours-
Partial
Ours-
Holistic
Partial
(a) (b) (c)

Fig. 5. Feature map visualization of the baseline and our proposed model on MARS dataset. The first images are raw frame sequences sampled from the video. The heat maps
in the second row are the feature maps of the baseline. The third and fourth rows are the feature maps obtained from the global and partial branches of our model.
Table 1
Comparison with the state-of-the-art video-based person Re-ID methods on MARS, DukeMTMC-VideoReID, and iLIDS-VID datasets. The 1st ,2nd , and 3rd best results are emphasized
with bold, italic, and bold-italic, respectively.
Methods MARS DukeMTMC-VideoReID iLIDS-VID

mAP Rank-1 Rank-5 Rank-20 mAP Rank-1 Rank-5 Rank-20 Rank-1 Rank-5 Rank-20
CNN+XQDA [1] 47.6 65.3 82.0 89.0 — — — — 53.0 81.4 95.1
SeeFores t[7] 50.7 70.6 90.0 97.6 — — — — 55.2 86.5 97.0
ASTPN [22] — 44.0 70.0 81.0 — — — — 62.0 86.0 98.0
DuATM [36] 67.7 81.2 92.5 — — — — — — — —
ETAP-Net [18] 67.4 80.8 92.1 96.1 78.3 83.6 94.6 97.6 — — —
COSAM [23] 79.9 84.9 95.5 97.9 94.1 95.4 99.3 79.6 95.3 —
GLTR [24] 78.5 87.0 95.8 98.2 93.7 96.3 99.3 99.7 86.0 98.0 —
VRSTC [25] 82.3 88.5 96.5 97.4 93.5 95.0 99.1 99.4 83.4 95.5 99.5
AMEM [26] 79.3 86.7 94.0 97.1 — — — — 87.2 97.7 99.5
STA [27] 80.8 86.3 95.7 98.1 94.9 96.2 99.3 — — — —
FGRA [4] 81.2 87.3 96.0 98.1 — — — — 88.0 96.7 99.3
AdaptiveGraph [28] 81.9 89.5 96.6 97.8 95.4 97.0 99.3 99.9 84.5 96.7 99.5
STPN [33] 77.9 85.9 94.6 97.3 — — — — 82.2 94.5 97.6
FEM [34] 74.8 82.6 — — — — — — 85.1 97.3 99.7
GSPnet [35] 82.3 87.9 95.0 97.2 95.3 86.4 — — 85.2 96.8 99.4
MG-TCN [29] 77.7 87.0 95.1 98.2 — — — — 86.2 96.7 99.5
Ensemble-RW [37] 80.1 86.8 94.6 97.5 — — — — 91.3 98.1 100
HMN [30] 82.6 88.5 96.2 98.1 95.1 96.3 99.2 99.8 — — —
CPA [31] 84.1 88.2 96.6 98.5 95.8 96.6 99.1 99.7 85.8 97.1 99.8
DPRM [32] 83.0 89.0 96.6 98.3 95.6 97.1 99.4 100 — — —
Ours 84.5 89.1 97.3 98.6 95.9 97.2 99.4 100 88.7 99.3 100
in Table 6. From the table, we can see thatT = 4 achieves the best performance becomes better accordingly. However, the confusion
performance. One possible reason is that as the length of the of each group of frames will increase when the length of the
sequence increases, the model obtains more information and the sequence increases, which makes it more difficult for the model
431
Table 2
Comparison of different proposed components on MARS dataset.
Model Param. GFLOPs mAP Rank-1 Rank-5 Rank-10 Rank-20

Baseline (Global) 22.44 M 15.17 76.7 83.2 93.8 96.0 97.4
Partial 22.44 M 15.17 78.5 85.7 95.8 97.0 97.8
Baseline + RGL 22.44 M 17.17 83.7 87.2 96.4 97.6 98.4
Partial + RPL 22.44 M 15.57 82.8 87.1 96.2 97.1 98.2
Baseline + RGL + Partial + RPL 22.44 M 17.67 84.0 88.6 96.4 97.8 98.4
Baseline + RGL + Partial + RPL + MT 22.44 M 17.67 84.5 89.1 97.3 98.2 98.6
Table 3
Comparison of different branches on MARS dataset.
Branch mAP Rank-1 Rank-5 Rank-10 Rank-20

Global (with RGL) 83.7 87.2 96.4 97.6 98.4
Partial (with RPL) 82.8 87.1 96.2 97.1 98.2
Global (with RGL) + Partial (with RPL) 84.5 89.1 97.3 98.2 98.6
Table 4
The benefit of using multi-level training scheme on MARS dataset.
Method mAP Rank-1 Rank-5 Rank-10 Rank-20

VL 84.0 88.6 96.4 97.8 98.4
VL + FGL 84.1 88.3 96.7 97.5 98.5
VL + FPL 84.2 88.6 96.4 97.8 98.5
VL + FGL + FPL 84.5 89.1 97.3 98.2 98.6
Table 5
Evaluate the balance of the losses on the MARS dataset.

a = 0.5, b=0.5 84.1 88.7 96.8 97.7 98.3
a = 1, b=0.5 84.3 88.9 97.0 97.7 98.4
a = 0.5, b=1 84.2 88.7 96.7 97.9 98.4
a = 1, b=1 84.5 89.1 97.3 98.2 98.6
a = 2, b=1 84.3 88.9 96.9 98.0 98.5
a = 1, b=2 84.2 88.8 97.0 97.9 98.4
to extract effective features. Therefore, the model achieves the best 4.4.7. Comparison with different generation methods of reference
performance whenT = 4. WhenT = 2, the amount of information vector
obtained by the model is not enough, and whenT = 6 or 8, it In order to explore the influence of different reference vector
becomes more difficult for the model to extract effective generation methods on the performance, we conduct experiments
information. on the MARS dataset. Three reference vector generation methods
are selected. From Table 8, we can observe that our proposed
method which utilizes the temporal attention method achieves
4.4.6. Comparison with different numbers of split regions favorable accuracy. The reason is that the temporal attention
Intuitively, the number of split regions determines the granu- method can effectively suppress the influence of poor-quality
larity of the partial feature. However, accuracy does not always frames on the generated reference vector while aggregating the
increase as the number of split regions increases. Considering the important information of each frame. The method of the first frame
size of the feature map extracted by the backbone, we conduct selection fails to avoid poor-quality frames. Although the temporal
experiments with three different numbers of split regions: N=2, average method can aggregate the information of all video frames,
4, and 8. We show the results in Table 7. In these experiments, it treats all frames equally and the poor-quality frames will still
we can observe that the model achieves the best results with 4 impact the final reference vector.
split regions. When we increase the number of split regions from
4 to 8, the performance becomes poor. Since when N = 8, the split 4.4.8. Comparison with different attention methods
regions are too small to contain enough features for distinguishing We compare various attention methods with our RGL module.
different pedestrians. For a fair comparison, all methods are applied to the baseline. As
Table 6
Comparison of different sequence lengths T = 2, 4, 6, and 8 on MARS dataset.
Sequence Length mAP Rank-1 Rank-5 Rank-10 Rank-20

T = 2 84.3 88.3 96.6 97.9 98.5
T = 4 84.5 89.1 97.3 98.2 98.6
T = 6 84.1 88.9 96.6 97.6 98.5
T = 8 83.6 87.8 96.2 97.5 98.2
432
Table 7
Comparison of the different number of split regions N = 2, 4, and 8 on MARS dataset.
Number of Split Regions mAP Rank-1 Rank-5 Rank-10 Rank-20

N=2 84.1 88.2 96.5 97.4 98.3
N=4 84.5 89.1 97.3 98.2 98.6
N=8 83.9 88.3 96.4 97.7 98.3
Table 8
Comparison of different generation methods of Reference Vector on MARS dataset.

The 1st frame 84.3 88.2 96.8 97.8 98.3
Temporal average 84.0 87.6 96.6 97.5 98.2
Temporal attention 84.5 89.1 97.3 98.2 98.6
Table 9
Comparison of different attention methods applied to the baseline on MARS dataset.

Baseline 76.7 83.2 93.8 96.0 97.4
Baseline + Bilinear CNN[38] 75.6 82.9 94.3 95.9 97.0
Baseline + SE-Block[39] 80.7 86.3 95.8 97.4 98.3
Baseline + Spatial Transformer[40] 78.6 85.7 94.9 96.5 97.6
Baseline + CBAM[41] 80.1 86.1 95.6 97.1 98.2
Ours 84.5 89.1 97.3 98.2 98.6
Table 10
Comparison of different loss functions on MARS dataset.
Loss Function mAP Rank-1 Rank-5 Rank-10 Rank-20

Softmax + Contrastive Loss 77.5 86.1 94.2 95.7 96.8
Softmax + Circle Loss 78.4 85.9 95.3 96.7 97.3
Ours (Softmax + Hard Triplet Loss) 84.5 89.1 97.3 98.2 98.6
shown in Table 9, all the attention models have some performance Re-ID datasets confirm that our proposed method achieves state-
improvement compared with the baseline. It is worth noting that of-the-art performance.
our method achieves the best performance, which outperforms
the baseline by 7.0% mAP and 4.0% Rank-1 accuracy.
CRediT authorship contribution statement
4.4.9. Comparison with different loss functions Fan Yang: Conceptualization, Methodology, Validation, Investi-
In order to explore the influence of different loss functions that gation, Formal analysis, Writing – original draft, Visualization.
optimize the model combined with the softmax loss on the perfor- Xiangtong Wang: Writing – review & editing. Xuan Zhu: Writing
mance, we conduct experiments on the MARS dataset. For a fair – review & editing. Binbin Liang: Writing – review & editing. Wei
comparison, all loss functions are applied to our model. As shown Li: Writing – review & editing, Funding acquisition, Supervision,
in Table 10, we can observe that our training method which utilizes Project administration, Resources.
hard triplet loss combined with softmax loss achieves the best
performance.
Declaration of Competing Interest
5. Conclusion
The authors declare that they have no known competing finan-
cial interests or personal relationships that could have appeared
In this paper, we propose a novel relation-based global-partial
to influence the work reported in this paper.
feature learning network for video-based person Re-ID, which
effectively enhances the discriminative information and sup-
presses the redundant information. We design an RGL module Acknowledgements
and an RPL module to generate the correlation maps between each
frame of the video sequence under the global and partial reference This work was supported by the Project of Sichuan Science and
feature for exploring global relation-based features and more fine- Technology Department (grant no. 2020YFG0134, and
grained partial relation-based details. Besides, we propose a multi- 2022YFG0153), the Funding from Sichuan University (grant no.
level training scheme that combines the frame-level loss and GSJDJS2021010, grant no. 2020SCUNG205, and grant no.
video-level loss to deeply supervise our model for learning more 2021SCUVS005), and the funding of Civil Aircraft Fire Science and
accurate and suitable representation. Extensive experiments and Safety Engineering Key Laboratory of Sichuan Province (grant no.
ablation studies conducted on three public video-based person MZ2022KF10).
433
References [25] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, and X. Chen, ‘‘VRSTC: Occlusion-Free
Video Person Re-Identification,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2019, pp. 7176-7185, 10.1109/CVPR.2019.00735.
[1] L. Zheng, Z. Bie, Y. Sun, J. Wang, and Q. Tian, ‘‘MARS: A Video Benchmark for
[26] S. Li, H. Yu, H. Hu, Appearance and Motion Enhancement for Video-Based
Large-Scale Person Re-Identification,” in Proc. Eur. Conf. Comput. Vis. (ECCV),
Person Re-Identification, Proc. AAAI Conf. Artif. Intell. 34 (7) (2020) 11394–
Oct. 2016, pp. 868-884, 10.1007/978-3-319-46466-4_52.
11401, https://doi.org/10.1609/aaai.v34i07.6802.
[2] N. McLaughlin, J. Martinez del Rincon and P. Miller, ‘‘Recurrent Convolutional
[27] Y. Fu, X. Wang, Y. Wei, T. Huang, STA: Spatial-Temporal Attention for Large-
Network for Video-Based Person Re-identification,” in Proc. IEEE Conf. Comput.
Scale Video-Based Person Re-Identification, Proc. AAAI Conf. Artif. Intell. 33
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1325-1334, 10.1109/
(2019) 8287–8294, https://doi.org/10.1609/aaai.v33i01.33018287.
CVPR.2016.148.
[28] Y. Wu, O. Bourahla, X. Li, F. Wu, X. Zhou, Adaptive Graph Representation
[3] D. Chung, K. Tahboub, and E. J. Delp, ‘‘A Two Stream Siamese Convolutional
Learning for Video Person Re-Identification, IEEE Trans. Image Process. 29
Neural Network for Person Re-identification,” in Proc. IEEE Int. Conf. Comput.
(2020) 8821–8830, https://doi.org/10.1109/TIP.2020.3001693.
Vis. (ICCV), Oct. 2017, pp. 1992-2000, 10.1109/ICCV.2017.218.
[29] P. Li, P. Pan, P. Liu, M. Xu, Y. Yang, Hierarchical Temporal Modeling With
[4] Z. Chen, Z. Zhou, J. Huang, P. Zhang, B. Li, Frame-Guided Region-Aligned
Mutual Distance Matching for Video Based Person Re-Identification, IEEE
Representation for Video Person Re-Identification, Proc. AAAI Conf. Artif. Intell.
Trans. Circuits Syst. Video Technol. 31 (2) (2021) 503–511, https://doi.org/
34 (7) (2020) 10591–10598, https://doi.org/10.1609/aaai.v34i07.6632.
10.1109/TCSVT.2020.2988034.
[5] G. Zhang, Y. Chen, Y. Dai, Y. Zheng and Y. Wu, ‘‘Reference-Aided Part-Aligned
[30] Z. Wang et al., ‘‘Robust Video-based Person Re-Identification by Hierarchical
Feature Disentangling for Video Person Re-Identification,” in Proc. IEEE Int.
Mining,” in IEEE Trans. Circuits Syst. Video Technol., 2021, pp. 1-1, 10.1109/
Conf. Multimedia Expo (ICME), 2021, pp. 1-6, 10.1109/
TCSVT.2021.3076097.
ICME51207.2021.9428118.
[31] D. Wu, M. Ye, G. Lin, X. Gao, J. Shen, Person Re-Identification by Context-aware
[6] Y. Yan, B. Ni, Z. Song, C. Ma, Y. Yan, and X. Yang, ‘‘Person Re-Identification via
Part Attention and Multi-Head Collaborative Learning, IEEE Trans Inf. Foren.
Recurrent Feature Aggregation,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Oct.
Sec. (2021) 1, https://doi.org/10.1109/TIFS.2021.3075894.
2016, pp. 701-716, 10.1007/978-3-319-46466-4_42.
[32] X. Yang, L. Liu, N. Wang, X. Gao, A Two-Stream Dynamic Pyramid
[7] Z. Zhou, Y. Huang, W. Wang, L. Wang and T. Tan, ‘‘See the Forest for the Trees:
Representation Model for Video-Based Person Re-Identification, IEEE Trans.
Joint Spatial and Temporal Recurrent Neural Networks for Video-Based Person
Image Process. 30 (2021) 6266–6276, https://doi.org/10.1109/
Re-identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul.
TIP.2021.3093759.
2017, pp. 6776-6785, 10.1109/CVPR.2017.717.
[33] M. Jiang, B. Leng, G. Song, Z. Meng, Weighted triple-sequence loss for video-
[8] S. Li, S. Bak, P. Carr and X. Wang, ‘‘Diversity Regularized Spatiotemporal
based person re-identification, Neurocomputing 381 (2020) 314–321.
Attention for Video-Based Person Re-identification,” in Proc. IEEE Conf. Comput.
[34] W. Gong, B. Yan, C. Lin, Flow-guided feature enhancement network for video-
Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 369-378, 10.1109/
based person re-identification, Neurocomputing 383 (2020) 295–302, https://
CVPR.2018.00046.
doi.org/10.1016/j.neucom.2019.11.050.
[9] H. Park, B. Ham, Relation Network for Person Re-Identification, Proc. AAAI Conf.
[35] G. Lin, S. Zhao, J. Shen, Video person re-identification with global statistic
Artif. Intell. 34 (7) (2020) 11839–11847, https://doi.org/10.1609/aaai.
pooling and self-attention distillation, Neurocomputing 381 (2021) 777–789,
v34i07.6857.
https://doi.org/10.1016/j.neucom.2020.05.111.
[10] Z. Zhang, C. Lan, W. Zeng, X. Jin, and Z. Chen, ‘‘Relation-Aware Global Attention
[36] J. Si et al., ‘‘Dual Attention Matching Network for Context-Aware Feature
for Person Re-Identification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
Sequence Based Person Re-identification,” in Proc. IEEE Conf. Comput. Vis.
(CVPR), Jun. 2020, pp. 3183-3192, 10.1109/CVPR42600.2020.00325.
Pattern Recognit. (CVPR), Jun. 2018, pp. 5363-5372, 10.1109/CVPR.2018.00562.
[11] Z. Zhang, C. Lan, W. Zeng and Z. Chen, ‘‘Multi-Granularity Reference-Aided
[37] L. Zhang et al., Ordered or Orderless: A Revisit for Video Based Person Re-
Attentive Feature Aggregation for Video-Based Person Re-Identification,” in
Identification, IEEE Trans. Pattern Anal. Mach. Intell. 43 (4) (2021) 1460–1466,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10404-
https://doi.org/10.1109/TPAMI.2020.2976969.
10413, 10.1109/CVPR42600.2020.01042.
[38] T. Lin, A. RoyChowdhury, S. Maji, Bilinear Convolutional Neural Networks for
[12] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, ‘‘Beyond Part Models: Person
Fine-Grained Visual Recognition, IEEE Trans. Pattern Anal. Mach. Intell. 40 (6)
Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline),” in
(2018) 1309–1322, https://doi.org/10.1109/TPAMI.2017.2723400.
Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 480–496, 10.1007/978-3-
[39] J. Hu, L. Shen, S. Albanie, G. Sun, E. Wu, Squeeze-and-Excitation Networks, IEEE
030-01225-0_30.
Trans. Pattern Anal. Mach. Intell. 42 (8) (2020) 2011–2023, https://doi.org/
[13] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, ‘‘Person Re-identification by
10.1109/TPAMI.2019.2913372.
Multi-Channel Parts-Based CNN with Improved Triplet Loss Function,” in Proc.
[40] V. Ashish et al., Attention Is All You Need, in: Proceedings of the 31st
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1335-1344,
International Conference on Neural Information Processing Systems (NIPS),
10.1109/CVPR.2016.149.
2017, pp. 6000–6010, https://doi.org/10.5555/3295222.3295349.
[14] L. Wei, S. Zhang, H. Yao, W. Gao, Q. Tian, GLAD: Global-Local-Alignment
[41] S Woo, J Park, JY Lee, ‘‘CBAM: Convolutional Block Attention Module,” in Proc.
Descriptor for Scalable Person Re-Identification, IEEE Trans. Multimedia 21 (4)
Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 3-19, 10.1007/978-3-030-01234-2_1.
(April 2019) 986–999, https://doi.org/10.1109/TMM.2018.2870522.
[42] Z. Zheng, X. Yang, Z. Yu, L. Zheng, Y. Yang and J. Kautz, ‘‘Joint Discriminative
[15] H. Zhao, M. Tian, S. Sun, S. Jing, and X. Tang, ‘‘Spindle Net: Person Re-
and Generative Learning for Person Re-Identification,” in Proc. IEEE Conf.
identification with Human Body Region Guided Feature Decomposition and
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 2133-2142, 10.1109/
Fusion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp.
CVPR.2019.00224.
907-915, 10.1109/CVPR.2017.103.
[43] K. Simon, G. Lausen, ‘‘ViPER: augmenting automatic information extraction
[16] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou, ‘‘Learning Discriminative Features
with visual perceptions,” in Proceedings of the Acm International Conference
with Multiple Granularities for Person Re-Identification,” in Proc. ACM
on Information & Knowledge Management, Oct. 2005, pp. 381–388, 10.1145/
Multimedia Conf. MM, 2018, pp. 274-282, 10.1145/3240508.3240552.
1099554.1099672.
[17] J. Gao and R. Nevatia, ‘‘Revisiting Temporal Modeling for Video-based Person
[44] X. Mao et al., Integrating Coarse Granularity Part-Level Features with
ReID,” 2018, arXiv: 1805.02104.
Supervised Global-Level Features for Person Re-Identification, ZTE
[18] Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Ouyang and Y. Yang, ‘‘Exploit the Unknown
Communications 19 (1) (2021) 72–81, https://doi.org/10.12142/
Gradually: One-Shot Video-Based Person Re-identification by Stepwise
ZTECOM.202101009.
Learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018,
[45] X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao, W. Jiang, C. Zhang, and J.
pp. 5177-5186, 10.1109/CVPR.2018.00543.
Sun., ‘‘Alignedreid: Surpassing human-level performance in person re-
[19] T. Wang, S. Gong, X. Zhu, S. Wang, Person Re-identification by Video Ranking,
identification,”. 2017, arXiv: 1711.08184.
Proc. Eur. Conf. Comput. Vis. (ECCV) (2014), https://doi.org/10.1007/978-3-
[46] X. Bai, M. Yang, T. Huang, Z. Dou, R. Yu, and Y. Xu., ‘‘Deep-person: Learning
319-10593-2_45.
discriminative deep features for person re-identification,”. 2017,
[20] K. He, X. Zhang, S. Ren and J. Sun, ‘‘Deep Residual Learning for Image
arXiv:1711.10658.
Recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
[47] D. Li, X. Chen, Z. Zhang and K. Huang., ‘‘Learning Deep Context-Aware Features
pp. 770-778, 10.1109/CVPR.2016.90.
over Body and Latent Parts for Person Re-identification,” in Proc. IEEE Conf.
[21] D. Kingma, J. Ba, Adam: A Method for Stochastic Optimization arXiv (2014)
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 7398-7407, 10.1109/
1412.6980.
CVPR.2017.782.
[22] S. Xu, Y. Cheng, K. Gu, Y. Yang, S. Chang and P. Zhou, ‘‘Jointly Attentive Spatial-
[48] F. Wang, M. Jiang, Q. Chen et al., ‘‘Residual Attention Network for Image
Temporal Pooling Networks for Video-Based Person Re-identification,” in Proc.
Classification, ” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul.
IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 4743-4752, 10.1109/
2017, pp. 6450-6458, 10.1109/CVPR.2017.683.
ICCV.2017.507.
[49] S. Li, S. Bak, P. Carr et al., ‘‘Diversity Regularized Spatiotemporal Attention for
[23] A. Subramaniam, A. Nambiar and A. Mittal, ‘‘Co-Segmentation Inspired
Video-Based Person Re-identification, ” in Proc. IEEE Conf. Comput. Vis. Pattern
Attention Networks for Video-Based Person Re-Identification,” in Proc. IEEE
Recognit. (CVPR), Jun. 2018, pp. 369-378, 10.1109/CVPR.2018.00046.
Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 562-572, 10.1109/
[50] D. Chen, H. Li, X. Tong et al., ‘‘Video Person Re-identification with Competitive
ICCV.2019.00065.
Snippet-Similarity Aggregation and Co-attentive Snippet Embedding” in Proc.
[24] J. Li, S. Zhang, J. Wang, W. Gao and Q. Tian, ‘‘Global-Local Temporal
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 1169-1178,
Representations for Video Person Re-Identification,” in Proc. IEEE Int. Conf.
10.1109/CVPR.2018.00128.
Comput. Vis. (ICCV), Oct. 2019, pp. 3957-3966, 10.1109/ICCV.2019.00406.
434
Fan Yang received his B.S. degree in the School of Binbin Liang received B.S and M.S. degrees from College
Information and Communication Engineering, North of Civil Aviation, Nanjing University of Aeronautics and
University of China, Taiyuan Shanxi, China, in 2018. He Astronautics in China at 2012 and 2015 respectively. He
is currently pursuing the MA.Sc. degree with the School is currently a Ph.D candidate in the School of Aeronau-
of Aeronautics and Astronautics, Sichuan University. His tics and Astronautics, Sichuan University. His research
research interests include computer vision and video interests include multi-sensor data fusion and artificial
analysis. intelligence.
Xiangtong Wang received the B.S. degree from the Wei Li received the B.S and Ph.D degrees from School of
School of Computer Science, Sichuan Normal University, Mechatronic Engineering, Beijing Institute Technology,
in 2018, and the M.Eng. degree from the School of China, in 2003 and 2008 respectively. He joined
Aeronautics and Astronautics, Sichuan University,in University of Electronic Science and Technology of
2021. He is currently pursuing the Ph.D. degree with the China in 2008. From 2010 to 2011, he was a postdoc-
School of Aeronautics and Astronautics, Sichuan toral member at Center of Industrial Electronics,
University. His research interests include computer Polytechnic University of Madrid, Spain. In 2011, he
vision and computer network. became an associate professor in University of Elec-
tronic Science and Technology of China. Currently, he is
an associate professor at School of Astronautics and
Astronautics in Sichuan University, China. His research
interests include computer vision, video surveillance
and camera networks.
Xuan Zhu received the B.S. degree from Hainan

University in 2019. She is currently pursuing the MA.Sc.
degree with the School of Aeronautics and Astronautics,
Sichuan University. Her research direction involves
image/video small object detection.
435

Neurocomputing: Fan Yang, Xiangtong Wang, Xuan Zhu, Binbin Liang, Wei Li

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Neurocomputing: Fan Yang, Xiangtong Wang, Xuan Zhu, Binbin Liang, Wei Li

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neurocomputing: Fan Yang, Xiangtong Wang, Xuan Zhu, Binbin Liang, Wei Li

Uploaded by

Copyright:

Available Formats

Neurocomputing 488 (2022) 424–435

Contents lists available at ScienceDirect

Relation-based global-partial feature learning network for video-based

f ¼ ðf global ; f partial Þ ð2Þ

Frame-Level Partial Supervision

Relation-Based Partial Feature

Frame-Level Global Supervision

Fig. 1. Illustration of our overall system pipeline.

Convolution Global Global Relation-Based

Fig. 2. Illustration of the RGL module.

Frame-Level Partial Feature

Relation-Based Relation-Based Relation-Based

Fig. 3. Illustration of the RPL module.

where Lv g and Lv p indicates the video-level global and partial loss,

Rank-1 Rank-2 Rank-3 Rank-4 Rank-5

(a) (b) (c)

Methods MARS DukeMTMC-VideoReID iLIDS-VID

Model Param. GFLOPs mAP Rank-1 Rank-5 Rank-10 Rank-20

Branch mAP Rank-1 Rank-5 Rank-10 Rank-20

Method mAP Rank-1 Rank-5 Rank-10 Rank-20

Method mAP Rank-1 Rank-5 Rank-10 Rank-20

Sequence Length mAP Rank-1 Rank-5 Rank-10 Rank-20

Number of Split Regions mAP Rank-1 Rank-5 Rank-10 Rank-20

Method mAP Rank-1 Rank-5 Rank-10 Rank-20

Method mAP Rank-1 Rank-5 Rank-10 Rank-20

Loss Function mAP Rank-1 Rank-5 Rank-10 Rank-20

Xuan Zhu received the B.S. degree from Hainan

You might also like