2307.05853v2

GLA-GCN: Global-local Adaptive Graph Convolutional Network for 3D Human
Pose Estimation from Monocular Video
Bruce X.B. Yu1 Zhi Zhang1 Yongxu Liu1

Sheng-hua Zhong2 Yan Liu1 Chang Wen Chen1
1 2
The Hong Kong Polytechnic University Shenzhen University
arXiv:2307.05853v2 [cs.CV] 22 Jul 2023
Abstract from advanced motion sensors such as motion capture sys-

tems, depth sensors, or stereotype cameras [52, 74]. The
3D human pose estimation has been researched for 3D HPE task can be performed under either multi-view or
decades with promising fruits. 3D human pose lifting is one monocular view settings. Although state-of-the-art multi-
of the promising research directions toward the task where view methods [31, 53, 79, 27] generally show superior
both estimated pose and ground truth pose data are used performance than monocular ones [29, 78], ordinary RGB
for training. Existing pose lifting works mainly focus on monocular cameras are much cheaper than these off-the-
improving the performance of estimated pose, but they usu- shelf motion sensors and more widely applied in real-world
ally underperform when testing on the ground truth pose surveillance scenarios. Hence, 3D HPE from a monocular
data. We observe that the performance of the estimated video is an important and challenging task, which has been
pose can be easily improved by preparing good quality 2D attracting increasing research interest. Recent monocular
pose, such as fine-tuning the 2D pose or using advanced view works can be grouped into model-based and model-
2D pose detectors. As such, we concentrate on improving free methods [14]. Model-based methods [16, 24] incorpo-
the 3D human pose lifting via ground truth data for the fu- rate parametric body models such as kinematic [68], pla-
ture improvement of more quality estimated pose data. To- nar [61], and volumetric models [2] for 3D HPE. Model-
wards this goal, a simple yet effective model called Global- free methods can be further grouped into single-stage and
local Adaptive Graph Convolutional Network (GLA-GCN) 2D to 3D lifting methods. Single-stage methods estimate
is proposed in this work. Our GLA-GCN globally mod- the 3D pose directly from images in an end-to-end manner
els the spatiotemporal structure via a graph representation [37, 47, 13, 66, 44, 82]. 2D to 3D lifting methods have
and backtraces local joint features for 3D human pose es- an intermediate 2D pose estimation layer [45, 51, 42, 62].
timation via individually connected layers. To validate our Among these methods, 2D to 3D lifting methods imple-
model design, we conduct extensive experiments on three mented with ground truth 2D poses achieved better perfor-
benchmark datasets: Human3.6M, HumanEva-I, and MPI- mance.
INF-3DHP. Experimental results show that our GLA-GCN1
The advantages of 2D to 3D lifting methods can be sum-
implemented with ground truth 2D poses significantly out-
marized as two main points: allowing make use of advances
performs state-of-the-art methods (e.g., up to 3%, 17%, and
in 2D human pose detection and exploiting temporal infor-
14% error reductions on Human3.6M, HumanEva-I, and
mation along multiple 2D pose frames [51, 32]. For the 2D
MPI-INF-3DHP, respectively).
human pose detection, it has achieved remarkable progress
via detectors such as Mask R-CNN (MRCNN) [25], Cas-
caded Pyramid Network (CPN) [15], Stacked Hourglass
1. Introduction (SH) detector [48], and HR-Net [59]. The intermediate 2D
3D Human Pose Estimation (HPE) in videos aims to pose estimation stage via these 2D pose detectors signifi-
predict the pose joint locations of the human body in 3D cantly reduces the data volume and complexity of the 3D
space, which can facilitate plenty of applications such as HPE task. For the temporal information, existing main-
video surveillance, human-robot interaction, and physio- stream methods [51, 42, 62, 29, 38, 78] gained notice-
therapy [54]. 3D human poses can be directly retrieved able improvements by feeding a long sequence of 2D pose
frames to their models, among which [78] achieved the
1 Code is available: https://github.com/bruceyo/GLA-GCN state-of-the-art performance via ground truth 2D poses. Re-
cent methods [78, 85] simply fine-tuned these 2D pose de- able margins e.g., up to 3% and 17% error reductions on
tectors on the target datasets and achieved great improve- Human3.6M [30] and HumanEva-I [58], respectively.
ments for the performance of estimated 2D pose data but
remain far behind the results of using ground truth 2D pose, 2. Related Work
which motivates us to concentrate on improving the 3D
HPE via ground truth 2D pose data for potential improve- 2D to 3D Lifting. 3D HPE is a traditional vision prob-
ments via future more quality estimated 2D pose data. lem that has been studied for decades [19, 76, 8, 64, 23,
33, 65, 67, 63, 43]. Existing works of 3D HPE from a
Given the promising performance and advantages of 2D
monocular view usually target two main scenarios: single
to 3D lifting methods, our work contributes the literature
person and multi-person [14]. This work aims to improve
along this direction. For 2D to 3D lifting approaches, since
the performance of single person 3D HPE. [34, 1, 71] rep-
[45] proposed Fully Connected Network (FCN), recent ad-
resent early efforts that attempt to infer 3D position from
vanced models have three main groups: Temporal Convolu-
2D projections. They usually rely on manually chosen pa-
tional Network (TCN)-based [51, 42], Graph Convolutional
rameters based on assumptions about pose joint mobility.
Network (GCN)-based [80, 62, 29], and Transformer-based
Methods [24, 77] estimating 3D pose from less frames or
ones [39, 38, 78]. On the one hand, we observe that exist-
even a single frame has shown great progress but can be
ing TCN- and Transformer-based methods can receive large
a lack of considering temporal information. Recent ad-
receptive fields (i.e., a long 2D pose sequence) with strided
vances in 2D human pose estimation [48, 25, 15] enable
convolutions. However, it can be difficult to make further
2D to 3D lifting approaches to achieve remarkable perfor-
intuitive designs to backtrace local joint features based on
mance over other counterparts. Inspired by [45], there has
the pose structure, since the 2D pose sequence is flattened
been more well-designed learning architectures being pro-
and fed to the model. Meanwhile, the estimation of dif-
posed to improve the performance, in particular, by utiliz-
ferent pose joints relies on the same fully connected layer,
ing temporal information. These methods are also known
which lacks considering the independent characteristic of
as 2D to 3D lifting, which can be grouped into three direc-
different pose joints. On the other hand, GCN-based mod-
tions: TCN-, GCN-, and Transformer-based architectures
els can explicitly reserve the structure of 2D and 3D human
[51, 42, 80, 62, 29, 38, 78].
pose during convolutional propagation. However, this ad-
TCN-based methods [51, 42] successfully push the per-
vantage of GCN remains under-explored. Existing GCN-
formance of 2D to 3D lifting forward with a strided de-
based methods [80, 62] also utilized a fully connected layer
sign for their learning architectures built upon 1D CNN
for the estimation of different 3D pose joints, which does
layers. The strided design is on the temporal dimension
not consider the structural features of GCN representations.
of the input, which allows the features to shrink from a
To this end, we propose Global-local Adaptive GCN 2D pose sequence to a feature embedding for the 3D pose
(GLA-GCN) for 2D to 3D human pose lifting. Our GLA- estimation via a final fully connected layer. The number
GCN contains two modules: global representation and lo- of channels for the fully connected layer is conventionally
cal 3D pose estimation. In the global representation, we set to 1024, which is shared to predict the 3D positions of
use an adaptive Graph Convolutional Network (GCN) to re- all pose joints. While varied numbers of input 2D pose
construct the global representation of an intermediate 3D frames have been extensively investigated, which shows in-
human pose sequence from its corresponding 2D sequence. put 2D pose frames with reasonable length can benefit the
For the local 3D pose joint estimation, we temporally shrink 3D pose reconstruction. The strided design can effectively
the global representation optimized by the reconstructed 3D reduce the feature size by shrinking the number of tempo-
pose sequence with a strided design. Then, an individual ral frames along propagation of several TCN blocks. Using
connected layer is proposed to locally estimate the 3D hu- this strided structure, Transformer-based methods [38, 78]
man pose joints from the shrunken global representation. show promising performance, especially [78] that takes ad-
Our contributions can be threefold as follows: vantage of weighted and temporal loss functions and helps
• We propose a global-local learning architecture that it outperform the GCN-based methods optimized with an
leverages the global spatialtemporal representation and lo- additional motion loss [62, 29]. The motion loss was shown
cal joint representation in the GCN-based model for 3D hu- not very effective in [78]. These observations compel us
man pose estimation. to explore effective models in the direction of GCN-based
• We are the first to introduce an individual connected models with the inspiring designs in mind but without rely-
layer that has two components to divide joint nodes and in- ing on various novel loss functions.
put the joint node representation for 3D pose joint estima- Graph Convolutional Network. A popular method repre-
tion instead of based on pooled features. senting the pose data with GCN is Spatial Temporal GCN
• Our GLA-GCN model performs better than corre- (ST-GCN) [70], which is originally proposed to model large
sponding state-of-the-art methods [38, 78] with consider- receptive fields for the skeleton-based action recognition.
F(96,3,17) F(96,1,17)
P: F(3,243,17)
P: F(2,243,17)
AGCN(96,3,1)
F(96,81,17) P: F(3,17)
F(96,243,17)
AGCN(96,96,3)
AGCN(96,96,1)
AGCN(96,96,1)
BatchNorm 1D
AGCN(2,96,1)
Reconstructed 3D Pose
Reconstruct 3D Pose Sequence Strided Learning Individual Connected Layer
2D Pose Sequence
Global Representation Local 3D Pose Joint Estimation
Figure 1. Learning architecture of our GLA-GCN. AGCN (Cin , Cout , S) represents AGCN blocks with the specific values of the input
channel, output channel, and stride length. F (C ′ , T ′ , N ′ ) represents the size of a feature map. The individual connected layer shows the
prediction process of four pose joint examples that use separate 1D CNN layers.
Following ST-GCN, advanced GCN models have been pro- lize the structured features of different pose joints to locally
posed to advance 3D HPE [18, 80, 17, 62]. predict their corresponding 3D pose locations.
Regarding GCN-based models for 3D HPE, Ci et al. [18]
proposed Locally Connected Network (LCN) that takes the 3. Method
advantages of FCN [45] and GCN [20]. LCN has the sim-
ilar design for the convolutional filters to ST-GCN [70], Given the temporal information of a 2D human pose se-
which defines a neighbor set for a node based on the dis- quence estimated from a video P = {pt,i ∈ R2 | t =
tance to perform convolutional operation. Zhao et al. [80] 1, ..., T ; i = 1, ..., N }, where T is the number of pose
proposed an architecture called SemGCN that stacks GCN frames and N is the number of pose joints, we aim to uti-
layers by flatten output to a fully connected layer. The op- lize this 2D pose sequence P to reconstruct 3D coordinates
timization of SemGCN is based on both joint positions and of pose joints P̄ = {p̄i ∈ R3 |i = 1, ..., N }. Figure 1
bone vectors. Choi et al. [17] also proposed to use GCN to shows the learning architecture of our GLA-GCN, which
recover 3D human pose and mesh from a 2D human pose. uses AGCN layers to globally represent the 2D pose se-
Liu et al. [41] investigated how weight sharing schemes quence and locally estimate the 3D pose via an individual
in GCNs affect the pose lifting task, which shows the pre- connected layer. In the following of this section, we intro-
aggregation method leads to relatively better performance. duce the detailed design of our GLA-GCN.
The architecture in [41] is similar with that of SemGCN.
The above mentioned GCN-based methods achieved good 3.1. Global Representation
performance via a single pose frame input but they did not Adaptive Graph Convolutional Network. An AGCN
take the advantage of temporal information in a 2D pose block [36, 57] is based on the GCN with an adaptive de-
sequence. sign that improves the flexibility of a typical ST-GCN block
Taking multiple 2D pose frames as input, U-shaped [70]. Let us represent the 2D pose sequence P as a spatial-
Graph Convolution Networks (UGCN) [62, 29] further im- temporal graph G = {υt , εt |t = 1, ..., T }, where υt =
proves the performance of GCN-based methods by paying {υt,i |i = 1, ..., N } represents the pose joints and εt repre-
attention to the temporal characteristics of a pose motion. sents the corresponding pose bones. To implement a basic
Specifically, UGCN utilizes spatial temporal GCN [70] to ST-GCN block, a neighbor set Bi is first defined to indi-
predict a 3D pose sequence from a 2D pose sequence for cate the spatial graph convolutional filter for a specific pose
the reconstruction of a single 3D pose frame. A motion joint υt,i . Specifically, for the graph convolutional filter of a
loss term that regulates the temporal trajectory of pose joints vertex node, we apply three distance neighbor subsets: the
based on the prediction of a 3D pose sequence and its cor- vertex itself, centripetal subset, and centrifugal subset. The
responding ground truth 3D pose sequence. Despite the im- definitions of centripetal and centrifugal subsets are based
provements grained with novel loss terms in works such as on the pose frame’s gravity center (i.e., the average coordi-
SemGCN and UGCN, we aim to contribute the literature nate of all pose joints). Centripetal and centrifugal subsets
of 2D-3D lifting by using the consistent loss term used in represent nodes that are closer and farther to the average
[51, 42]. In our model design, we propose to incorporate distance from the gravity center, respectively. Empirically,
the strided convolutions to a GCN-based model that rep- similar with 2D convolution, we set the kernel size K to 3,
resents global information of a 2D pose sequence. Based which will lead to 3 subsets in Bi . To implement the sub-
on the structure of GCN representation, we explicitly uti- sets, a mapping ht,i → {0, ..., K − 1} is used to index each
subset with a numeric label, where centripetal and centrifu- Sequence]). Here, each AGCN block has three key parame-
gal subsets are respectively labeled as 1 and 2. Subsets that ters: the number of input channels Cin , the number of out-
have the average distance to gravity center is indexed to 0. put channels Cout , and the stride S of the temporal convolu-
This graph convolutional operation can be written as tion, while the other parameters are kept consistent (e.g., the
temporal convolution kernel size is three). Given an input
X 1 Cin -dim pose representation F (Cin , Tin , N ), the AGCN
fout (υt,i ) = fin (υt,j ) W (ht,i (υt,j )) (1)
υt,j ∈Bi Zt,j block derives the output Cout -dim pose F (Cout , Tout , N )
via convolution on the pose structure sequence, where Tout
where fin : vt,j → R2 is a mapping that gets the attribute depends on Nin and S. To reconstruct the 3D pose se-
features of joint node vt,j and Zt,j is a normalization term quence, we first use AGCN (2, 96, 1) to convert the 2D
that equals to the subset’s cardinality. W (ht,i (vt,j )) is a pose sequence F (2, T, N ) into a 96D pose representation
weight function W (υt,i , υt,j ) : Bi → R2 implemented F (96, T, N ). Following the settings of related work, we
by indexing a (2, K) tensor. For a pose frame, the deter- set T to 243 and N to 17 for the Human3.6M dataset. That
mined graph convolution of a sampling strategy (e.g., cen- is, the input 2D pose sequence of F (2, 243, 17) is converted
tripetal and centrifugal subsets) can be implemented by an into a 96D pose sequence of F (96, 243, 17). Then, we stack
N × N adjacency Pmatrix. Specifically, with K spatial sam- iterative layers of AGCN (96, 96, 1) to construct the deep
K
pling strategies k=1 Ak and the adaptive design, Equa- spatiotemporal structural representation of the 96D pose se-
tion 1 can be transformed into quence. The output of the last AGCN block is fed into an
XK AGCN (96, 3, 1) to estimate the 3D pose sequence based
fout (υt ) = (Ak + Bk + Ck )fin Wk (2) on the 96D joint representation and derive F (3, 243, 17).
k=1 ...
Then, we let p t,i ∈ R3 be the 3D position of the i-th joint
−1 −1 at time t, and minimize the difference between the estimated
where Λk 2 Āk Λk 2 is a normalized adjacency matrix of
3D pose sequence and the ground truth 3D pose sequence:
Āk with its elements indicating whether a vertex υt,j is in-
ij
cluded in the neighbor set. Λii
P
k = j (Āk ) + α is a di- T N
1 1 X X ...
agonal matrix with α set to 0.001 to prevent empty rows. Lglobal = p t,i − pt,i 2
(4)
Wk denotes the weighting function of Equation 1, which is T N t=1 i=1
a weight tensor of the 1 × 1 convolutional operation. Unlike
Ak that represents the physical structure of a human pose, Strided Learning Architecture. Inspired by the TCN-
Bk represents learnable parameters that indicate the con- based approaches [51, 42], we further adapt the strided
nection strength between pose joints,which is implemented learning architecture to the AGCN model, using strided
with an N × N adjacency matrix initialized to 0. Ck per- convolution to reduce long time sequences and aggregate
forms the similar function of Bk , which is implemented by temporal information near time t for pose estimation. The
the dot product of two feature maps calculated by embed- gray module in Figure 1(Strided Learning) illustrates the
ding functions (i.e., θ and ϕ) to calculate the similarity be- design of the strided AGCN modules. Each strided AGCN
tween pose joints. Calculation of Ck can be represented module has two consecutive AGCN blocks, which are sur-
as rounded by residual connections [26]. We perform strided
T T convolutions at the second AGCN block of each strided
Ck = Sof tM ax(fin Wθk Wϕk fin ) (3)
AGCN module to gradually shrink the feature size at the
where Wθ and Wϕ are learnable parameter of the two em- temporal dimension. The input of the first strided AGCN
bedding functions, which are initialized as 0.0. Then an module is the intermediate output in 3D pose sequence
AGCN block is realized with a 1 × Γ classical 2D con- reconstruction, i.e., the extracted F (96, 243, 17). After
volutional layer (Γ is the temporal kernel size that we set the propagation through the first strided AGAN module,
to 9) and the defined adaptive graph convolution fout (υt ), the 96D pose sequence will be shrunken to F (96, 81, 17).
which are both followed by a batch normalization layer and Then, we repetitively perform subsequence AGCN layers
a ReLU layer and a dropout layer in between them. Mean- until the feature size is shrunken to the size of 96 × 1 × 17.
while, a residual connection [26] is added to the AGCN In this way, the pattern of the temporal neighbor in the
block. pose sequence will be aggregated for subsequent local 3D
Reconstruct 3D Pose Sequence. Taking the inspiration of pose joint estimation to estimate the 3D pose of the centric
recent works [62, 29, 39, 38], the introduced AGCN block timestep.
is then used to extract the spatiotemporal structural infor-
3.2. Local 3D Pose Joint Estimation
mation in the global graph representation, which is super-
vised by estimating the 3D pose sequence from the corre- Based on the above-mentioned strided AGCN modules,
sponding 2D sequence (see Figure 1 [Reconstruct 3D Pose the input 2D pose sequence represented as F (96, 243, 17)
can be transformed into a feature map F (96, 1, 17). The Finally, we wish to minimize the difference between the es-
next step is to estimate the 3D position of joint nodes based timated joint pose p̄i and the ground truth joint pose pi via
on the feature map. Llocal :
Individually Connected Layers. Existing TCN- and N
1 X
GCN-based methods [51, 42, 80, 62] usually flatten the de- Llocal = ∥p̄i − pi ∥2 (8)
N i=1
rived feature maps and use a global skeleton representa-
tion consisting of all joint nodes to estimate every single During the training process, we optimize Lglobal and
joint, neglecting the matching information between joints Llocal in two stages. In the first stage, we minimize
and corresponding vectors in feature maps. Unlike existing Lglobal + Llocal to optimize the model using globally su-
works, we believe the global knowledge of the temporal and pervised signal guidance. In the second stage, we minimize
spatial neighborhoods has been aggregated via the proposed Llocal to improve the 3D pose estimation performance.
global representation. Thus, it is crucial to scope at the spa-
tial information of the corresponding joint node to infer its 4. Experiments
3D position. Based on this idea, this paper first proposes
an individual connected layer to estimate the 3D position 4.1. Datasets and Evaluation
of every single joint based on the corresponding joint node Our experiments are based on three public datasets:
feature F (96, 1, 1), instead of the pooled representation of Human3.6M [30], HumanEva-I [58], and MPI-INF-3DHP
all joint nodes F (96, 1, 17). Mathematically, the individual [46]. With respect to Human3.6M, the data of subjects
connected layer can be denoted as: S1, S5, S6, S7, and S8 are applied for training, while that
(unshared) of S9 and S11 are used for testing, which is consistent
ṗi = vi Wi + bi (5) with the training and validation settings of existing works
[51, 42, 80, 62]. In terms of HumanEva-I, following [45]
where the estimated 3D position of joint i is denoted by ṗi
and [42], data for actions “walk” and “jog” from subjects
and vi represents the flattened features of F (96, 1, i) joint
S1, S2, and S3 are used for training and testing. For MPI-
node i. The weight parameters of the individual connected
INF-3DHP, we follow the experimental setting of the recent
layer is represented by Wi and Wi ∈ R96×3 , whose bias
state-of-the-art [55] for a fair comparison.
parameter is bi and bi ∈ R1×3 .
Standard evaluation protocols: Mean Per-Joint Position
Due to the weight Wi and bias bi are not shared between
Error (MPJPE) and Pose-aligned MPJPE (P-MPJPE), re-
joints, we name the above individually connected layers as
spectively known as P rotocol#1 and P rotocol#2, are
unshared individually connected layers. On top of that, we
used for both datasets. The calculation of MPJPE is based
find that individually connected layers in the unshared fash-
on the mean Euclidean distance between the predicted 3D
ion may ignore the shared rules between joints in 2D to 3D
pose joints aligned to root joints (i.e., pelvis) and the ground
pose lifting, resulting in overfitting joint-specific distribu-
truth 3D pose joints collected via motion capture, which fol-
tion. Therefore, we further designed shared individually
lows [84, 60, 50]. Comparing with MPJPE, P-MPJPE is
connected layers:
also based on the mean Euclidean distance but has an extra
(shared) post-processing step with rigid alignments (e.g., scale, ro-
ṗi = vi Ws + bs (6)
tation, and translation) to the predicted 3D pose. P-MPJPE
leads to smaller differences with the ground truth and it fol-
The weight parameters of the shared individual connected
lows [45, 28, 22].
layer is represented by Ws and Ws ∈ R96×3 , whose bias
parameter is bs and bs ∈ R1×3 . Then, the 3D pose estima- 4.2. Implementation Details
tion of each joint can be formulated as the weighted average
of the estimated results from the shared and unshared indi- We introduce the implementation detail of our GLA-
vidually connected layers: GCN from three main perspectives: 2D pose detections,
model setting, and hyperparameters for the training process.
(unshared) (shared) For fair comparison, we follow the 2D pose detections of
p̄i = λṗi + (1 − λ)ṗi (7)
Human3.6M [30] and HumanEva-I [58] used in [51, 42],
Here, λ is the parameter that weighs the shared individ- which are detected by CPN [15] and MRCNN [25], respec-
ual connected layer and the unshared individual connected tively. The CPN’s 2D pose detection has 17 joints while
layer. When λ is 0.0, the model uses only the shared in- the MRCNN’s 2D pose detection has 15 joints. Besides, we
dividual connected layer for estimation, and when λ is 1.0, also conduct experiments for the ground truth (GT) 2D pose
the model uses only the unshared individual connected layer detections of the two datasets.
for prediction. Especially, for convenience, the connected Based on the specific structure of 2D pose, we imple-
layers are implemented via a 1D CNN layer in this paper. ment the graph convolutional operation filters of AGCN
Method Dir. Disc. Eat. Greet Phone Photo Pose Purch. Sit SitD. Smoke Wait WalkD. Walk WalkT. Avg.
Martinez et al. [45] (ICCV’17) 51.8 56.2 58.1 59.0 69.5 78.4 55.2 58.1 74.0 94.6 62.3 59.1 65.1 49.5 52.4 62.9
Fang et al. [21] (AAAI’18) 50.1 54.3 57.0 57.1 66.6 73.3 53.4 55.7 72.8 88.6 60.3 57.7 62.7 47.5 50.6 60.4
Pavlakos et al. [49] (CVPR’18) 48.5 54.4 54.4 52.0 59.4 65.3 49.9 52.9 65.8 71.1 56.6 52.9 60.9 44.7 47.8 56.2
Lee et al. [35] (ECCV’18) † 40.2 49.2 47.8 52.6 50.1 75.0 50.2 43.0 55.8 73.9 54.1 55.6 58.2 43.3 43.3 52.8
Zhao et al. [80] (CVPR’19) 47.3 60.7 51.4 60.5 61.1 49.9 47.3 68.1 86.2 55.0 67.8 61.0 42.1 60.6 45.3 57.6
Ci et al. [18] (ICCV’19) 46.8 52.3 44.7 50.4 52.9 68.9 49.6 46.4 60.2 78.9 51.2 50.0 54.8 40.4 43.3 52.7
Pavllo et al. [51] (CVPR’19) † 45.2 46.7 43.3 45.6 48.1 55.1 44.6 44.3 57.3 65.8 47.1 44.0 49.0 32.8 33.9 46.8
Cai et al. [9] (ICCV’19) † 44.6 47.4 45.6 48.8 50.8 59.0 47.2 43.9 57.9 61.9 49.7 46.6 51.3 37.1 39.4 48.8
Pavllo et al. [51] (CVPR’19) † 45.2 46.7 43.3 45.6 48.1 55.1 44.6 44.3 57.3 65.8 47.1 44.0 49.0 32.8 33.9 46.8
Xu et al. [68] (CVPR’20) † 37.4 43.5 42.7 42.7 46.6 59.7 41.3 45.1 52.7 60.2 45.8 43.1 47.7 33.7 37.1 45.6
Liu et al. [42] (CVPR’20) † 41.8 44.8 41.1 44.9 47.4 54.1 43.4 42.2 56.2 63.6 45.3 43.5 45.3 31.3 32.2 45.1
Zeng et al. [75] (ECCV’20) † 46.6 47.1 43.9 41.6 45.8 49.6 46.5 40.0 53.4 61.1 46.1 42.6 43.1 31.5 32.6 44.8
Xu and Takano [69] (CVPR’21) 45.2 49.9 47.5 50.9 54.9 66.1 48.5 46.3 59.7 71.5 51.4 48.6 53.9 39.9 44.1 51.9
Zhou et al. [83] (PAMI’21) † 38.5 45.8 40.3 54.9 39.5 45.9 39.2 43.1 49.2 71.1 41.0 53.6 44.5 33.2 34.1 45.1
Li et al. [39] (CVPR’22) † 39.2 43.1 40.1 40.9 44.9 51.2 40.6 41.3 53.5 60.3 43.7 41.1 43.8 29.8 30.6 43.0
Shan et al. [55] (ECCV’22) † 38.9 42.7 40.4 41.1 45.6 49.7 40.9 39.9 55.5 59.4 44.9 42.2 42.7 29.4 29.4 42.8
Our GLA-GCN (T=243, CPN) † 41.3 44.3 40.8 41.8 45.9 54.1 42.1 41.5 57.8 62.9 45.0 42.8 45.9 29.4 29.9 44.4
Lee et al. [35] (ECCV’18) † 32.1 36.6 34.3 37.8 44.5 49.9 40.9 36.2 44.1 45.6 35.3 35.9 30.3 37.6 35.5 38.4
Zhao et al. [80] (CVPR’19) 37.8 49.4 37.6 40.9 45.1 41.4 40.1 48.3 50.1 42.2 53.5 44.3 40.5 47.3 39.0 43.8
Ci et al. [18] (ICCV’19) 36.3 38.8 29.7 37.8 34.6 42.5 39.8 32.5 36.2 39.5 34.4 38.4 38.2 31.3 34.2 36.3
Liu et al. [42] (CVPR’20) † 34.5 37.1 33.6 34.2 32.9 37.1 39.6 35.8 40.7 41.4 33.0 33.8 33.0 26.6 26.9 34.7
Xu and Takano [69] (CVPR’21) 35.8 38.1 31.0 35.3 35.8 43.2 37.3 31.7 38.4 45.5 35.4 36.7 36.8 27.9 30.7 35.8
Zheng et al. [81] (ICCV’21) † 30.0 33.6 29.9 31.0 30.2 33.3 34.8 31.4 37.8 38.6 31.7 31.5 29.0 23.3 23.1 31.3
Li et al. [39] (CVPR’22) † 27.7 32.1 29.1 28.9 30.0 33.9 33.0 31.2 37.0 39.3 30.0 31.0 29.4 22.2 23.0 30.5
Shan et al. [55] (ECCV’22) † 28.5 30.1 28.6 27.9 29.8 33.2 31.3 27.8 36.0 37.4 29.7 29.5 28.1 21.0 21.0 29.3
Our GLA-GCN (T=243, GT) † 26.5 27.2 29.2 25.4 28.2 31.7 29.5 26.9 37.8 39.9 29.9 27.0 27.3 20.5 20.8 28.5
Wang et al. [62] (ECCV’20) †* 23.0 25.7 22.8 22.6 24.1 30.6 24.9 24.5 31.1 35.0 25.6 24.3 25.1 19.8 18.4 25.6
Li et al. [38] (TMM’22) †* 27.1 29.4 26.5 27.1 28.6 33.0 30.7 26.8 38.2 34.7 29.1 29.8 26.8 19.1 19.8 28.5
Hu et al. [29] (MM’22) †* - - - - - - - - - - - - - - - 22.7
Zhang et al. [78] (CVPR’22) †* 21.6 22.0 20.4 21.0 20.8 24.3 24.7 21.9 26.9 24.9 21.2 21.5 20.8 14.7 15.7 21.6
Our method (T=243, GT) † * 20.1 21.2 20.0 19.6 21.5 26.7 23.3 19.8 27.0 29.4 20.8 20.1 19.2 12.8 13.8 21.0
Table 1. P rotocol#1: Reconstruction error with MPJPE (mm) on Human3.6M. Top-table: input 2D pose sequences are detected by (CPN)
- cascaded pyramid network. Bottom-table: input 2D pose sequences with ground truth (GT). Best in bold, second best underlined, the
lower the better. † indicates using temporal information. * indicates reconstructing an intermediate 3D pose sequence.
Walk Jog Method PCK↑ AUC↑ MPJPE↓

Method Avg
S1 S2 S3 S1 S2 S3 Mehta et al. [46] (3DV’17, T=1) 75.7 39.3 117.6
Martinez et al. [45] (ICCV’17) 19.7 17.4 46.8 26.9 18.2 18.6 24.6 Pavllo et al. [51] (CVPR’19, T=81) † 86.0 51.9 84.0
Fang et al. [21] (AAAI’18) 19.4 16.8 37.4 30.4 17.6 16.3 23.0 Lin et al. [40] (BMVC’19, T=25) † 83.6 51.4 79.8
Pavlakos et al. [49] (CVPR’18) 18.8 12.7 29.2 23.5 15.4 14.5 19.0 Wang et al. [62] (ECCV’20, T=96) †* 86.9 62.1 68.1
Lee et al. [35] (ECCV’18) † 18.6 19.9 30.5 25.7 16.8 17.7 21.5 Zheng et al. [81] (ICCV’21, T=9) † 88.6 56.4 77.1
Pavllo et al. [51] (CVPR’19) † 13.9 10.2 46.6 20.9 13.1 13.8 19.8 Chen et al. [11] (TCSVT’21, T=81) † 87.9 54.0 78.8
Liu et al. [42] (CVPR’20) † 13.1 9.8 26.8 16.9 12.8 13.3 15.5
Hu et al. [29] (MM’22, T=96) †* 97.9 69.5 42.5
Zheng et al. [81] (ICCV’21) † 14.4 10.2 46.6 22.7 13.4 13.4 20.1
Li et al. [38] (TMM’22) †* 14.0 10.0 32.8 19.5 13.6 14.2 17.4
Shan et al. [55] (ECCV’22, T=81) † 97.9 75.8 32.2
Zhang et al. [78] (CVPR’22) †* 12.7 10.9 17.6 22.6 15.8 17.0 16.1 Our GLA-GCN (T=27) † 98.19 76.53 31.36
Ours (T=27, MRCNN) † 12.5 9.1 26.9 18.5 12.7 12.8 15.4 Our GLA-GCN (T=81) † 98.53 79.12 27.76
Li et al. [38] (TMM’22) †* 9.7 7.6 15.8 12.3 9.4 11.2 11.1 Table 3. Results of P rotocol#1 for MPI-INF-3DHP. † uses tem-
Ours (T=27, GT) † 8.7 6.8 11.5 10.1 8.2 9.9 9.2 poral information. Best in bold, second best underlined. * indi-
Table 2. Results of P rotocol#2 for HumanEva-I. † uses temporal cates reconstructing an intermediate 3D pose sequence.
information. Best in bold, second best underlined. * indicates
reconstructing an intermediate 3D pose sequence.
blocks, e.g., the sizes of Ak , Bk , Ck are set to 17 × 17, In terms of the hyperparameters, we respectively set
15 × 15, and 17 × 17 for Human3.6M, HumanEva-I, and the batch size to 512, 256, and 256 for Human3.6M,
MPI-INF-3DHP, respectively. The designed model has HumanEva-I, and MPI-INF-3DHP. Being consistent with
some key parameters that can be adjusted to get better per- [42], we adopt the ranger optimizer and train the model with
formance. For this part, we conduct ablation expariments the MPJPE loss for 80 and 1000 epochs for Human3.6M
with difference numbers of channels and 2D pose frames and HumanEva-I, respectively, using an initial learning rate
(i.e., Cout and T , respectively) on Human3.6M. To ver- of 0.01. Meanwhile, we set the dropout rate to 0.1. For both
ify the proper design of the proposed model regarding the training and testing phases, data augmentation is applied by
strided design and individual connected layer, we perform horizontally flipping the pose data. All experiments are con-
further ablation experiments on both datasets. ducted with two GeForce GTX 3090 GPUs.
Method Dir. Disc. Eat. Greet Phone Photo Pose Purch. Sit SitD. Smoke Wait WalkD. Walk WalkT. Avg.
Fang et al. [21] (AAAI’18) 38.2 41.7 43.7 44.9 48.5 55.3 40.2 38.2 54.5 64.4 47.2 44.3 47.3 36.7 41.7 45.7
Pavlakos et al. [49] (CVPR’18) 34.7 39.8 41.8 38.6 42.5 47.5 38.0 36.6 50.7 56.8 42.6 39.6 43.9 32.1 36.5 41.8
Lee et al. [35] (ECCV’18) † 34.9 35.2 43.2 42.6 46.2 55.0 37.6 38.8 50.9 67.3 48.9 35.2 31.0 50.7 34.6 43.4
Cai et al. [9] (ICCV’19) † 35.7 37.8 36.9 40.7 39.6 45.2 37.4 34.5 46.9 50.1 40.5 36.1 41.0 29.6 33.2 39.0
Pavllo et al. [51] (CVPR’19) † 34.1 36.1 34.4 37.2 36.4 42.2 34.4 33.6 45.0 52.5 37.4 33.8 37.8 25.6 27.3 36.5
Xu et al. [68] (CVPR’20) † 31.0 34.8 34.7 34.4 36.2 43.9 31.6 33.5 42.3 49.0 37.1 33.0 39.1 26.9 31.9 36.2
Chen et al. [15] (ICCV’20) † 32.9 35.2 35.6 34.4 36.4 42.7 31.2 32.5 45.6 50.2 37.3 32.8 36.3 26.0 23.9 35.5
Liu et al. [42] (CVPR’20) † 32.3 35.2 33.3 35.8 35.9 41.5 33.2 32.7 44.6 50.9 37.0 32.4 37.0 25.2 27.2 35.6
Shan et al. [56] (MM’21) † 32.5 36.2 33.2 35.3 35.6 42.1 32.6 31.9 42.6 47.9 36.6 32.1 34.8 24.2 25.8 35.0
Shan et al. [55] (ECCV’22) † 31.3 35.2 32.9 33.9 35.4 39.3 32.5 31.5 44.6 48.2 36.3 32.9 34.4 23.8 23.9 34.4
Zhang et al. [78] (CVPR’22) †* 30.8 33.1 30.3 31.8 33.1 39.1 31.1 30.5 42.5 44.5 34.0 30.8 32.7 22.1 22.9 32.6
Our GLA-GCN (T=243, CPN) † 32.4 35.3 32.6 34.2 35.0 42.1 32.1 31.9 45.5 49.5 36.1 32.4 35.6 23.5 24.7 34.8
Martinez et al. [45] (ICCV’17) - - - - - - - - - - - - - - - 37.1
Ci et al. [18] (ICCV’19) 24.6 28.6 24.0 27.9 27.1 31.0 28.0 25.0 31.2 35.1 27.6 28.0 29.1 24.3 26.9 27.9
Our GLA-GCN (T=243, GT) † 20.2 21.9 21.7 19.9 21.6 24.7 22.5 20.8 28.6 33.1 22.7 20.6 20.3 15.9 16.2 22.0
Our GLA-GCN (T=243, GT) †* 16.6 18.1 16.2 17.0 17.0 22.2 19.0 17.1 22.4 25.9 17.5 16.4 16.3 10.8 11.6 17.6
Table 4. P rotocol#2: Reconstruction error after rigid alignment with P-MPJPE (mm) on Human3.6M. Top-table: input 2D joints are
acquired by detection (CPN) - cascaded pyramid network. Bottom-table: input 2D joints with (GT) - ground truth. † indicates using
temporal information. * indicates reconstructing an intermediate 3D pose sequence. Best in bold, second best underlined.
Input Zhang et al. [71] Ours Ground Truth Input Zhang et al. [71] Ours Ground Truth
S9 Eat. S11 Eat.
Input Zhang et al. [71] Ours Ground Truth Input Zhang et al. [71] Ours Ground Truth
S9 WalkT. S11 WalkT.
Figure 2. Qualitative comparison with Zhang et al. [78] for S9 and S11 on two actions of Human3.6M. Noticeable improvements are
enlarged.
0.40 40
36.31
32.37 Refine perior to state-of-the-art methods by just using the MPJPE
30.71 29.16
0.30 29.10 28.33 28.20 28.25 28.30 30 loss. We also conduct a qualitative comparison with the
MPJPE (mm)
21.10 21.11 21.16 state-of-the-art method that does not have a 3D pose se-
loss
0.20 20
21.30
quence reconstruction module [42]. Figure 2 shows the vi-
0.10 10 sualized improvements over [42]. For example, in the “S11
Training WalkT.” action, the visualizations of right-hip and right-
0.00 0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
hand joints estimated with our method and the ground truth
epoch 3D pose are clearly separate but those of [42] are connected
training loss test MPJPE
Figure 3. Loss on the training set and MPJPE on the test set. to each other. Moving on to the MPI-INF-3DHP dataset in
Table 3, we can see a significant decline in MPJPE with
4.3. Comparison with State-of-the-Art our model. Compared with the state-of-the-art method,
Tables 1 and 4 show the comparison on Human3.6M P-STMO[55], the MPJPE of our model decreases from
and HumanEva-I with state-of-the-art methods under 32.2mm to 27.76mm, representing an error reduction of ap-
P rotocol#1 and P rotocol#2, respectively. Based on the proximately 14%.
implementation via GT 2D pose respectively optimized Comparing the performance by using estimated 2D pose
with or without loss of reconstructing the intermediate 3D (i.e., CPN or HR-Net pose data) is also regarded as im-
pose sequence (defined in Equation 4), our GLA-GCN out- portant by most existing works. However, models such
performs the state-of-the-art method [78] in terms of aver- as [39, 55] can perform well on relatively low-quality esti-
aged results of two evaluation protocols. Figure 3 shows the mated 2D pose data but fail to well generalize the good per-
training process of our GLA-GCN on Human3.6M, which formance to high-quality 2D data (i.e., ground truth poses).
indicates our model converges quickly without observable We note that our method underperforms on relatively low-
overfitting. For the HumanEva-I dataset, the results of quality estimated 2D pose when compared with some recent
P rotocol#2 (see Table 2) also show that our method is su- methods [39, 55, 78]. In the following, we conduct an in-
Right Leg Right Arm Left Leg Left Arm Torso Arm Joints Leg Joints
Figure 4. Visualizations of inter-joint feature cosine similarity from actions: “eating” (first 3 columns) and “walking” (last three columns)
of Human3.6M. Upper row uses an individual connected layer; lower row uses a fully connected layer (please zoom in for a better view).
depth discussion on this issue. Medhod Frames No. of Parameters MPJPE (mm)
Pavllo et al. [51] (CVPR’19) † T = 27 8.56M 40.6
Liu et al. [42] (CVPR’20) † T = 27 5.69M 38.9
Discussion: the Effect of 2D Pose Quality. Tracing back Li et al. [39] (CVPR’22) †* T = 27 18.92M 34.3
to the first work on 3D Pose lifting, Martinez et al. [45] Our GLA-GCN † T = 27 0.84M 34.4
used the SH 2D pose detector fine-tuned on the Human3.6M Pavllo et al. [51] (CVPR’19) † T = 81 12.75M 38.7
Liu et al. [42] (CVPR’20) † T = 81 8.46M 36.2
dataset to improve the 3D HPE (significantly from 67.5mm Li et al. [39] (CVPR’22) †* T = 81 ≥ 18.92M 32.7
average MPJPE to 62.9mm), indicating that the quality of Our GLA-GCN † T = 81 1.10M 31.5
2D pose can be essential for the 3D HPE. Recent works Pavllo et al. [51] (CVPR’19) † T = 243 16.95M 37.8
Liu et al. [42] (CVPR’20) † T = 243 11.25M 34.7
[62, 39, 78] took advantage of advanced 2D pose detector Our GLA-GCN † T = 243 1.35M 28.5
HR-Net and achieved better performance (e.g., 39.8mm av-
Wang et al. [62] (ECCV’20) †* T = 96 1.69M 25.6
erage MPJPE). Zhu et al. [85] also successfully improved Our GLA-GCN (Cout=96 ) †* T = 243 1.88M 24.5
the result to 37.5mm average MPJPE by fine-tuning the SH Hu et al. [29] (MM’22) †* T = 96 3.42M 22.7
Li et al. [39] (CVPR’22) †* T = 351 ≥ 18.92M 30.5
network [48] on Human3.6M, which remains far behind the Zhang et al. [78] (CVPR’22) †* T = 243 33.70M 21.6
results implemented with GT 2D pose. Our GLA-GCN (Cout=512 ) †* T = 243 48.64M 21.0
Table 5. Comparison with state-of-the-art methods on Hu-
A similar observation is also applicable to the man3.6M implemented with different receptive fields of ground
HumanEva-I and MPI-INF-3DHP datasets. As shown in truth 2D pose. Results of P rotocol#1 are reported. * indicates
Table 2, our method yields a remarkable 40% drop in P- reconstructing an intermediate 3D pose sequence.
MPJPE on the HumanEva-I dataset. Given the GT 2D
pose, the P-MPJPE goes from 15.4mm to 9.2mm compared 4.4. Ablation Studies
with the best state-of-the-art algorithm. While on MPI-INF-
3DHP, the MPJPE goes from 32.2mm to 27.76mm. In the following, we ablate our model design gradients
(i.e., AGCN layers, strided design, and individual con-
Hence, improving the performance of the estimated pose nected layer). To validate the properness of using AGCN
purely relies on preparing quality 2D pose data, which can layers, we compare our model with the version imple-
be easily achieved by either using an advanced 2D pose de- mented with ST-GCN [70] blocks, which leads to the ab-
tector that can generate pose data similar to the GT 2D pose lation of AGCN. As shown in the row #1 of Table 6, the
or just arbitrarily fine-tuning the existing pose detectors. On results of P rotocol#2 on two datasets consistently indicate
the other hand, it remains unclear for what scenario the re- that using AGCN blocks can achieve better performances.
constructed 3D pose with advanced pose detectors can be For the ablation of strided design, we perform average
beneficial. One scenario is 3D human pose estimation in pooling on the second (i.e., temporal) dimension of the fea-
the wild, which is usually evaluated with qualitative visual- ture map. Results in row #2 of Table 6 indicates that it is
ization [38]. However, whether the 3D pose reconstructed not as effective as the strided design. Without the strided
from the estimated 2D pose can contribute to pose-based design, it will not only lead to a larger feature map represen-
tasks remains under-explored. Given that how to improve tation, i.e., increased from F (Cout , 1, N ) to F (Cout , T, N )
the performance of the estimated 2D pose is straightforward but also affects the 3D HPE.
and its usage remains lack of good applicable scenario, we To verify the design of our individual connected layer,
argue that the comparison based on the GT 2D pose can we compare it with the implementation of a fully connected
more properly reflect a model’s 3D HPE ability than com- layer that takes the expanded feature map as its input. The
parisons based on the estimated 2D pose. results in the row #3 of Table 6 indicates that our individual
Human3.6M HumanEva-I [6, 5, 4, 7, 3] and aim to improve the results of the esti-
# Method
CPN GT MRCNN GT mated 2D pose by preparing high-quality 2D pose data via
1 w/o AGCN 39.1 28.0 18.2 11.7 fine-tuned 2D pose detectors (e.g., SH detector [48], Open-
2 w/o strided design 41.2 30.6 22.6 12.9 Pose [10], and HR-Net [59]), abd investigate the effects of
3 w/o individual connected layer 39.0 27.7 17.6 12.4
4 Swap the structure of input 2D pose 38.3 28.1 16.4 12.4 other loss terms (e.g., based on bone features [12] and mo-
5 GLA-GCN (T=27) † 37.8 25.8 15.4 9.2 tion trajectory [62]).
Table 6. Ablation study on key designs of our GLA-GCN. The re-
sults are based on the average value of P rotocol#2 implemented References
with 27 receptive fields for various 2D pose detections of the Hu-
man3.6M and HumanEva-I datasets. [1] Mykhaylo Andriluka, Stefan Roth, and Bernt Schiele. Pic-
torial structures revisited: People detection and articulated
pose estimation. In 2009 IEEE conference on computer vi-
connected layer can make better use of the structured rep- sion and pattern recognition, pages 1014–1021. IEEE, 2009.
resentation of GCN and thus significantly improves the per- [2] Anurag Arnab, Carl Doersch, and Andrew Zisserman. Ex-
formance. The differences of features before the prediction ploiting temporal context for 3d human pose estimation in
layers (i.e., individually and fully connected layers) are re- the wild. In Proceedings of the IEEE/CVF Conference
spectively visualized in the upper and lower rows of Figure on Computer Vision and Pattern Recognition, pages 3395–
4. The visualization indicates that our individual connected 3404, 2019.
layer can make prediction based on more interpretable fea- [3] XB Bruce, Yan Liu, and Keith CC Chan. Skeleton-based de-
tures that cannot be traced by using a fully connected layer. tection of abnormalities in human actions using graph convo-
For example, the arm and leg joints show relatively higher lutional networks. In 2020 Second International Conference
on Transdisciplinary AI (TransAI), pages 131–137. IEEE,
independence to other joints for actions “eating” and “walk-
2020.
ing”, respectively. Feeding these independent features of all
[4] XB Bruce, Yan Liu, and Keith CC Chan. Multimodal fu-
joints to a fully connected layer will interfere with the pre- sion via teacher-student network for indoor action recogni-
diction of a specific joint. tion. In Proceedings of the AAAI Conference on Artificial
We further verify the advantage of this structured rep- Intelligence, volume 35, pages 3199–3207, 2021.
resentation of GCN by swapping the left and right limbs [5] XB Bruce, Yan Liu, Keith CC Chan, Qintai Yang, and Xi-
of 2D pose input data, leading to the break of pose struc- aoying Wang. Skeleton-based human action evaluation us-
ture. Results in row #4 of Table 6 show that breaking the ing graph convolutional network for monitoring alzheimer’s
pose structure will affect the 3D pose estimation. This ob- progression. Pattern Recognition, 119:108095, 2021.
servation, in turn, further indicates the proper design of our [6] Yu Bruce X.B., Yan Liu, Xiang Zhang, Sheng-hua Zhong,
individual connected layer. and Keith CC Chan. Mmnet: A model-based multimodal
network for human action recognition in rgb-d videos. IEEE
Discussion: Limitation on Model Size. Similar to state-
Transactions on Pattern Analysis and Machine Intelligence,
of-the-art methods [29, 39, 78], we note that our method is
45(3):3522–3538, 2022.
faced with the issue of model size. Specifically, the lower [7] Yu Bruce X.B., Liu Yan, Zhang Xiang, Chen Gong, and
table of Table 5 shows that our model can achieve better Chan Keith C.C. Egcn: An ensemble-based learning frame-
performance than state-of-the-art methods [62, 78] but uses work for ex-ploring effective skeleton-based rehabilitation
slightly more model parameters. We aim to tackle this issue exercise assessment. EGCN: An Ensemble-based Learning
in the future by using techniques such as pruning. Framework for Exploring Effective Skeleton-based Rehabil-
itation Exercise Assessment, pages 3681–3687, 2022.
5. Conclusion [8] Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham,
Junsong Yuan, and Nadia Magnenat Thalmann. Exploit-
This paper proposes a GCN-based method utilizing the ing spatial-temporal relationships for 3d pose estimation
structured representation for 3D HPE in the 2D to 3D lift- via graph convolutional networks. In Proceedings of the
ing paradigm. The proposed GLA-GCN globally represents IEEE/CVF International Conference on Computer Vision
the 2D pose sequence and locally estimates the 3D pose (ICCV), pages 2272–2281, 2019.
joints via an individual connected layer. Results show that [9] Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham,
our GLA-GCN outperforms corresponding state-of-the-art Junsong Yuan, and Nadia Magnenat Thalmann. Exploit-
ing spatial-temporal relationships for 3d pose estimation
methods implemented with GT 2D poses on datasets Hu-
via graph convolutional networks. In Proceedings of the
man3.6M, HumanEva-I, and MPI-INF-3DHP. We verify the IEEE/CVF International Conference on Computer Vision,
properness of model design with extensive ablation studies pages 2272–2281, 2019.
and visualizations. In the future, we will tackle the issue [10] Zhe Cao, Gines Hidalgo Martinez, Tomas Simon, Shih-En
of parameter efficiency of our model via tuning techniques Wei, and Yaser A Sheikh. Openpose: realtime multi-person
[72, 73]. Meanwhile, we will consider its effect on ap- 2d pose estimation using part affinity fields. IEEE transac-
plication scenarios such as human behavior understanding tions on pattern analysis and machine intelligence, 2019.
[11] Tianlang Chen, Chen Fang, Xiaohui Shen, Yiheng Zhu, sion and Pattern Recognition (CVPR), pages 13106–13115,
Zhili Chen, and Jiebo Luo. Anatomy-aware 3d human 2022.
pose estimation with bone-based pose decomposition. IEEE [24] Kehong Gong, Jianfeng Zhang, and Jiashi Feng. Poseaug:
Transactions on Circuits and Systems for Video Technology, A differentiable pose augmentation framework for 3d hu-
32(1):198–209, 2021. man pose estimation. In Proceedings of the IEEE/CVF Con-
[12] Tianlang Chen, Chen Fang, Xiaohui Shen, Yiheng Zhu, Zhili ference on Computer Vision and Pattern Recognition, pages
Chen, and Jiebo Luo. Anatomy-aware 3d human pose esti- 8575–8584, 2021.
mation with bone-based pose decomposition. IEEE Trans- [25] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
actions on Circuits and Systems for Video Technology, 2021. shick. Mask r-cnn. In Proceedings of the IEEE international
[13] Xipeng Chen, Pengxu Wei, and Liang Lin. Deductive learn- conference on computer vision, pages 2961–2969, 2017.
ing for weakly-supervised 3d human pose estimation via un- [26] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
calibrated cameras. In Proceedings of the AAAI Confer- Deep residual learning for image recognition. In Proceed-
ence on Artificial Intelligence, volume 35, pages 1089–1096, ings of the IEEE conference on computer vision and pattern
2021. recognition, pages 770–778, 2016.
[14] Yucheng Chen, Yingli Tian, and Mingyi He. Monocu- [27] Yihui He, Rui Yan, Katerina Fragkiadaki, and Shoou-I Yu.
lar human pose estimation: A survey of deep learning- Epipolar transformers. In Proceedings of the ieee/cvf con-
based methods. Computer Vision and Image Understanding, ference on computer vision and pattern recognition, pages
192:102897, 2020. 7779–7788, 2020.
[15] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang [28] Mir Rayat Imtiaz Hossain and James J Little. Exploiting
Zhang, Gang Yu, and Jian Sun. Cascaded pyramid net- temporal information for 3d human pose estimation. In Pro-
work for multi-person pose estimation. In Proceedings of ceedings of the European Conference on Computer Vision
the IEEE conference on computer vision and pattern recog- (ECCV), pages 68–84, 2018.
nition, pages 7103–7112, 2018. [29] Wenbo Hu, Changgong Zhang, Fangneng Zhan, Lei Zhang,
[16] Yu Cheng, Bo Yang, Bo Wang, and Robby T Tan. 3d human and Tien-Tsin Wong. Conditional directed graph convolution
pose estimation using spatio-temporal networks with explicit for 3d human pose estimation. In Proceedings of the 29th
occlusion training. In Proceedings of the AAAI Conference ACM International Conference on Multimedia, pages 602–
on Artificial Intelligence, volume 34, pages 10631–10638, 611, 2021.
2020. [30] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian
[17] Hongsuk Choi, Gyeongsik Moon, and Kyoung Mu Lee. Sminchisescu. Human3.6m: Large scale datasets and predic-
Pose2mesh: Graph convolutional network for 3d human pose tive methods for 3d human sensing in natural environments.
and mesh recovery from a 2d human pose. In European Con- IEEE transactions on pattern analysis and machine intelli-
ference on Computer Vision, pages 769–787. Springer, 2020. gence, 36(7):1325–1339, 2013.
[18] Hai Ci, Chunyu Wang, Xiaoxuan Ma, and Yizhou Wang. Op- [31] Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury
timizing network structure for 3d human pose estimation. In Malkov. Learnable triangulation of human pose. In Proceed-
Proceedings of the IEEE/CVF International Conference on ings of the IEEE/CVF International Conference on Com-
Computer Vision, pages 2262–2271, 2019. puter Vision, pages 7718–7727, 2019.
[19] Qi Dang, Jianqin Yin, Bin Wang, and Wenqing Zheng. Deep [32] Xiaopeng Ji, Qi Fang, Junting Dong, Qing Shuai, Wen Jiang,
learning based 2d human pose estimation: A survey. Ts- and Xiaowei Zhou. A survey on monocular 3d human
inghua Science and Technology, 24(6):663–676, 2019. pose estimation. Virtual Reality & Intelligent Hardware,
[20] Michaël Defferrard, Xavier Bresson, and Pierre Van- 2(6):471–500, 2020.
dergheynst. Convolutional neural networks on graphs with [33] Jogendra Nath Kundu, Siddharth Seth, Pradyumna YM,
fast localized spectral filtering. Advances in neural informa- Varun Jampani, Anirban Chakraborty, and R Venkatesh
tion processing systems, 29:3844–3852, 2016. Babu. Uncertainty-aware adaptation for self-supervised 3d
[21] Hao-Shu Fang, Yuanlu Xu, Wenguan Wang, Xiaobai Liu, human pose estimation. In Proceedings of the IEEE/CVF
and Song-Chun Zhu. Learning pose grammar to encode hu- Conference on Computer Vision and Pattern Recognition
man body configuration for 3d pose estimation. In Proceed- (CVPR), pages 20448–20459, 2022.
ings of the AAAI Conference on Artificial Intelligence, vol- [34] Hsi-Jian Lee and Zen Chen. Determination of 3d human
ume 32, 2018. body postures from a single view. Computer Vision, Graph-
[22] Hao-Shu Fang, Yuanlu Xu, Wenguan Wang, Xiaobai Liu, ics, and Image Processing, 30(2):148–168, 1985.
and Song-Chun Zhu. Learning pose grammar to encode hu- [35] Kyoungoh Lee, Inwoong Lee, and Sanghoon Lee. Propagat-
man body configuration for 3d pose estimation. In Proceed- ing lstm: 3d pose estimation based on joint interdependency.
ings of the AAAI Conference on Artificial Intelligence, vol- In Proceedings of the European Conference on Computer Vi-
ume 32, 2018. sion (ECCV), pages 119–135, 2018.
[23] Erik Gärtner, Mykhaylo Andriluka, Hongyi Xu, and Cristian [36] Ruoyu Li, Sheng Wang, Feiyun Zhu, and Junzhou Huang.
Sminchisescu. Trajectory optimization for physics-based re- Adaptive graph convolutional neural networks. In Proceed-
construction of 3d human pose from monocular video. In ings of the AAAI Conference on Artificial Intelligence, vol-
Proceedings of the IEEE/CVF Conference on Computer Vi- ume 32, 2018.
[37] Sijin Li, Weichen Zhang, and Antoni B Chan. Maximum- [50] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpa-
margin structured learning with deep networks for 3d human nis, and Kostas Daniilidis. Coarse-to-fine volumetric pre-
pose estimation. In Proceedings of the IEEE international diction for single-image 3d human pose. In Proceedings of
conference on computer vision, pages 2848–2856, 2015. the IEEE conference on computer vision and pattern recog-
[38] Wenhao Li, Hong Liu, Runwei Ding, Mengyuan Liu, Pichao nition, pages 7025–7034, 2017.
Wang, and Wenming Yang. Exploiting temporal contexts [51] Dario Pavllo, Christoph Feichtenhofer, David Grangier, and
with strided transformer for 3d human pose estimation. IEEE Michael Auli. 3d human pose estimation in video with tem-
Transactions on Multimedia, 2022. poral convolutions and semi-supervised training. In Proceed-
[39] Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, and Luc ings of the IEEE Conference on Computer Vision and Pattern
Van Gool. Mhformer: Multi-hypothesis transformer for 3d Recognition, pages 7753–7762, 2019.
human pose estimation. In Proceedings of the IEEE/CVF [52] Liliana Lo Presti and Marco La Cascia. 3d skeleton-based
Conference on Computer Vision and Pattern Recognition, human action classification: A survey. Pattern Recognition,
pages 13147–13156, 2022. 53:130–147, 2016.
[40] Jiahao Lin and Gim Hee Lee. Trajectory space factorization [53] N Dinesh Reddy, Laurent Guigues, Leonid Pishchulin, Jayan
for deep video-based 3d human pose estimation. In Proceed- Eledath, and Srinivasa G Narasimhan. Tessetrack: End-to-
ings of the British Machine Vision Conference, 2019. end learnable multi-person articulated 3d pose tracking. In
[41] Kenkun Liu, Rongqi Ding, Zhiming Zou, Le Wang, and Wei Proceedings of the IEEE/CVF Conference on Computer Vi-
Tang. A comprehensive study of weight sharing in graph sion and Pattern Recognition, pages 15190–15200, 2021.
networks for 3d human pose estimation. In European Con- [54] Nikolaos Sarafianos, Bogdan Boteanu, Bogdan Ionescu, and
ference on Computer Vision, pages 318–334. Springer, 2020. Ioannis A Kakadiaris. 3d human pose estimation: A review
[42] Ruixu Liu, Ju Shen, He Wang, Chen Chen, Sen-ching Che- of the literature and analysis of covariates. Computer Vision
ung, and Vijayan Asari. Attention mechanism exploits tem- and Image Understanding, 152:1–20, 2016.
poral contexts: Real-time 3d human pose reconstruction. In [55] Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe
Proceedings of the IEEE/CVF Conference on Computer Vi- Wang, Siwei Ma, and Wen Gao. P-stmo: Pre-trained spa-
sion and Pattern Recognition, pages 5064–5073, 2020. tial temporal many-to-one model for 3d human pose estima-
[43] Zhiwei Liu, Xiangyu Zhu, Lu Yang, Xiang Yan, Ming Tang, tion. In Computer Vision–ECCV 2022: 17th European Con-
Zhen Lei, Guibo Zhu, Xuetao Feng, Yan Wang, and Jinqiao ference, Tel Aviv, Israel, October 23–27, 2022, Proceedings,
Wang. Multi-initialization optimization network for accurate Part V, pages 461–478. Springer, 2022.
3d human pose and shape estimation. In Proceedings of the [56] Wenkang Shan, Haopeng Lu, Shanshe Wang, Xinfeng
ACM International Conference on Multimedia, pages 1976– Zhang, and Wen Gao. Improving robustness and accuracy
1984, 2021. via relative information encoding in 3d human pose estima-
[44] Xiaoxuan Ma, Jiajun Su, Chunyu Wang, Hai Ci, and Yizhou tion. In Proceedings of the 29th ACM International Confer-
Wang. Context modeling in 3d human pose estimation: A ence on Multimedia, pages 3446–3454, 2021.
unified perspective. In Proceedings of the IEEE/CVF Con- [57] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Two-
ference on Computer Vision and Pattern Recognition, pages stream adaptive graph convolutional networks for skeleton-
6238–6247, 2021. based action recognition. In Proceedings of the IEEE Con-
[45] Julieta Martinez, Rayat Hossain, Javier Romero, and James J ference on Computer Vision and Pattern Recognition, pages
Little. A simple yet effective baseline for 3d human pose es- 12026–12035, 2019.
timation. In Proceedings of the IEEE International Confer- [58] Leonid Sigal, Alexandru O Balan, and Michael J Black. Hu-
ence on Computer Vision, pages 2640–2649, 2017. maneva: Synchronized video and motion capture dataset and
[46] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal baseline algorithm for evaluation of articulated human mo-
Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian tion. International journal of computer vision, 87(1-2):4,
Theobalt. Monocular 3d human pose estimation in the wild 2010.
using improved cnn supervision. In International Confer- [59] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep
ence on 3D Vision, pages 506–516, 2017. high-resolution representation learning for human pose es-
[47] Gyeongsik Moon and Kyoung Mu Lee. I2l-meshnet: Image- timation. In Proceedings of the IEEE/CVF Conference
to-lixel prediction network for accurate 3d human pose and on Computer Vision and Pattern Recognition, pages 5693–
mesh estimation from a single rgb image. In Computer 5703, 2019.
Vision–ECCV 2020: 16th European Conference, Glasgow, [60] Bugra Tekin, Artem Rozantsev, Vincent Lepetit, and Pas-
UK, August 23–28, 2020, Proceedings, Part VII 16, pages cal Fua. Direct prediction of 3d body poses from motion
752–768. Springer, 2020. compensated sequences. In Proceedings of the IEEE Con-
[48] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour- ference on Computer Vision and Pattern Recognition, pages
glass networks for human pose estimation. In European con- 991–1000, 2016.
ference on computer vision, pages 483–499. Springer, 2016. [61] Raquel Urtasun and Pascal Fua. 3d human body tracking us-
[49] Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis. ing deterministic temporal motion models. In European con-
Ordinal depth supervision for 3d human pose estimation. In ference on computer vision, pages 92–106. Springer, 2004.
Proceedings of the IEEE Conference on Computer Vision [62] Jingbo Wang, Sijie Yan, Yuanjun Xiong, and Dahua Lin.
and Pattern Recognition, pages 7307–7316, 2018. Motion guided 3d pose estimation from videos. In European
Conference on Computer Vision, pages 764–780. Springer, human pose estimation with a split-and-recombine approach.
2020. In Computer Vision–ECCV 2020: 16th European Confer-
[63] Junjie Wang, Zhenbo Yu, Zhengyan Tong, Hang Wang, ence, Glasgow, UK, August 23–28, 2020, Proceedings, Part
Jinxian Liu, Wenjun Zhang, and Xiaoyan Wu. Ocr-pose: XIV 16, pages 507–523. Springer, 2020.
Occlusion-aware contrastive representation for unsupervised [76] Ailing Zeng, Xiao Sun, Lei Yang, Nanxuan Zhao, Minhao
3d human pose estimation. In Proceedings of the ACM In- Liu, and Qiang Xu. Learning skeletal graph neural networks
ternational Conference on Multimedia, pages 5477–5485, for hard 3d pose estimation. In Proceedings of the IEEE/CVF
2022. International Conference on Computer Vision (ICCV), pages
[64] Luyang Wang, Yan Chen, Zhenhua Guo, Keyuan Qian, 11436–11445, 2021.
Mude Lin, Hongsheng Li, and Jimmy S Ren. Generaliz- [77] Yu Zhan, Fenghai Li, Renliang Weng, and Wongun Choi.
ing monocular 3d human pose estimation in the wild. In Ray3d: ray-based 3d human pose estimation for monocular
Proceedings of the IEEE/CVF International Conference on absolute 3d localization. In Proceedings of the IEEE/CVF
Computer Vision (ICCV), pages 0–0, 2019. Conference on Computer Vision and Pattern Recognition,
[65] Zitian Wang, Xuecheng Nie, Xiaochao Qu, Yunpeng Chen, pages 13116–13125, 2022.
and Si Liu. Distribution-aware single-stage models for multi- [78] Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen, and Jun-
person 3d pose estimation. In Proceedings of the IEEE/CVF song Yuan. Mixste: Seq2seq mixed spatio-temporal encoder
Conference on Computer Vision and Pattern Recognition for 3d human pose estimation in video. In Proceedings of
(CVPR), pages 13096–13105, 2022. the IEEE/CVF Conference on Computer Vision and Pattern
[66] Tom Wehrbein, Marco Rudolph, Bodo Rosenhahn, and Bas- Recognition, pages 13232–13242, 2022.
tian Wandt. Probabilistic monocular 3d human pose es- [79] Zhe Zhang, Chunyu Wang, Weichao Qiu, Wenhu Qin, and
timation with normalizing flows. In Proceedings of the Wenjun Zeng. Adafuse: Adaptive multiview fusion for accu-
IEEE/CVF International Conference on Computer Vision rate human pose estimation in the wild. International Jour-
(ICCV), pages 11199–11208, 2021. nal of Computer Vision, 129(3):703–718, 2021.
[67] Wen-Li Wei, Jen-Chun Lin, Tyng-Luh Liu, and Hong- [80] Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, and Dim-
Yuan Mark Liao. Capturing humans in motion: temporal- itris N Metaxas. Semantic graph convolutional networks for
attentive 3d human pose and shape estimation from monoc- 3d human pose regression. In Proceedings of the IEEE/CVF
ular video. In Proceedings of the IEEE/CVF Conference Conference on Computer Vision and Pattern Recognition,
on Computer Vision and Pattern Recognition (CVPR), pages pages 3425–3435, 2019.
13211–13220, 2022. [81] Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang,
[68] Jingwei Xu, Zhenbo Yu, Bingbing Ni, Jiancheng Yang, Xi- Chen Chen, and Zhengming Ding. 3d human pose estima-
aokang Yang, and Wenjun Zhang. Deep kinematics analysis tion with spatial and temporal transformers. In Proceedings
for monocular 3d human pose estimation. In Proceedings of of the IEEE/CVF International Conference on Computer Vi-
the IEEE/CVF Conference on Computer Vision and Pattern sion, pages 11656–11665, 2021.
Recognition, pages 899–908, 2020.
[82] Kun Zhou, Xiaoguang Han, Nianjuan Jiang, Kui Jia, and
[69] Tianhan Xu and Wataru Takano. Graph stacked hourglass
Jiangbo Lu. Hemlets pose: Learning part-centric heatmap
networks for 3d human pose estimation. In Proceedings of
triplets for accurate 3d human pose estimation. In Proceed-
the IEEE/CVF Conference on Computer Vision and Pattern
ings of the IEEE/CVF International Conference on Com-
Recognition, pages 16105–16114, 2021.
puter Vision, pages 2344–2353, 2019.
[70] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo-
[83] Kun Zhou, Xiaoguang Han, Nianjuan Jiang, Kui Jia, and
ral graph convolutional networks for skeleton-based action
Jiangbo Lu. Hemlets posh: Learning part-centric heatmap
recognition. In 32nd AAAI conference on artificial intelli-
triplets for 3d human pose and shape estimation. IEEE
gence, 2018.
Transactions on Pattern Analysis and Machine Intelligence,
[71] Yi Yang and Deva Ramanan. Articulated pose estimation
2021.
with flexible mixtures-of-parts. In CVPR 2011, pages 1385–
[84] Xiaowei Zhou, Menglong Zhu, Spyridon Leonardos, Kon-
1392. IEEE, 2011.
stantinos G Derpanis, and Kostas Daniilidis. Sparseness
[72] Bruce XB Yu, Jianlong Chang, Lingbo Liu, Qi Tian,
meets deepness: 3d human pose estimation from monocular
and Chang Wen Chen. Towards a unified view on vi-
video. In Proceedings of the IEEE conference on computer
sual parameter-efficient transfer learning. arXiv preprint
vision and pattern recognition, pages 4966–4975, 2016.
arXiv:2210.00788, 2022.
[85] Wentao Zhu, Xiaoxuan Ma, Zhaoyang Liu, Libin Liu, Wayne
[73] Bruce XB Yu, Jianlong Chang, Haixin Wang, Lingbo Liu,
Wu, and Yizhou Wang. Motionbert: Unified pretraining for
Shijie Wang, Zhiyu Wang, Junfan Lin, Lingxi Xie, Hao-
human motion analysis. arXiv preprint arXiv:2210.06551,
jie Li, Zhouchen Lin, et al. Visual tuning. arXiv preprint
2022.
arXiv:2305.06061, 2023.
[74] Bruce XB Yu, Yan Liu, and Keith CC Chan. A survey of
sensor modalities for human activity recognition. 1:282–294,
2020.
[75] Ailing Zeng, Xiao Sun, Fuyang Huang, Minhao Liu, Qiang
Xu, and Stephen Lin. Srnet: Improving generalization in 3d

2307.05853v2

Uploaded by

Copyright:

Available Formats

2307.05853v2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2307.05853v2

Uploaded by

Copyright:

Available Formats

GLA-GCN: Global-local Adaptive Graph Convolutional Network for 3D Human

Pose Estimation from Monocular Video

Bruce X.B. Yu1 Zhi Zhang1 Yongxu Liu1

Abstract from advanced motion sensors such as motion capture sys-

Walk Jog Method PCK↑ AUC↑ MPJPE↓

S9 Eat. S11 Eat.

S9 WalkT. S11 WalkT.

You might also like