Mobicom22 Final138

MobiDepth: Real-Time Depth Estimation Using On-Device
Dual Cameras
Jinrui Zhang1† , Huan Yang1 , Ju Ren2 , Deyu Zhang1∗ , Bangwen He1 , Ting Cao3 , Yuanchun Li4 ,
Yaoxue Zhang2 , Yunxin Liu4∗
1 School
of Computer Science and Engineering, Central South University
2 Department of Computer Science and Technology, Tsinghua University 3 Microsoft Research
4 Institute for AI Industry Research (AIR), Tsinghua University
1 {zhangjinrui, yanghuan9812, zdy876, hebangwen}@csu.edu.cn
2 {renju, zhangyx}@tsinghua.edu.cn, 3 ting.cao@microsoft.com
4 {liyuanchun, liuyunxin}@air.tsinghua.edu.cn
ABSTRACT CCS CONCEPTS

Real-time depth estimation is critical for the increasingly popular • Computer systems organization → Real-time systems; •
augmented reality and virtual reality applications on mobile devices. Human-centered computing → Ubiquitous and mobile com-
Yet existing solutions are insufficient as they require expensive puting.
depth sensors or motion of the device, or have a high latency. We
propose MobiDepth, a real-time depth estimation system using the KEYWORDS
widely-available on-device dual cameras. While binocular depth Real Time, Depth Estimation, Dual Camera, Mobile Device, OpenCL
estimation is a mature technique, it is challenging to realize the
ACM Reference Format:
technique on commodity mobile devices due to the different focal Jinrui Zhang1† , Huan Yang1 , Ju Ren2 , Deyu Zhang1∗ , Bangwen He1 , Ting Cao3 ,
lengths and unsynchronized frame flows of the on-device dual Yuanchun Li4 , Yaoxue Zhang2 , Yunxin Liu4∗ . 2022. MobiDepth: Real-Time
cameras and the heavy stereo-matching algorithm. Depth Estimation Using On-Device Dual Cameras. In The 28th Annual Inter-
To address the challenges, MobiDepth integrates three novel national Conference On Mobile Computing And Networking (ACM MobiCom
techniques: 1) iterative field-of-view cropping, which crops the ’22), October 17–21, 2022, Sydney, NSW, Australia. ACM,New York, NY, USA,
field-of-views of the dual cameras to achieve the equivalent focal 14 pages. https://doi.org/10.1145/3495243.3560517
lengths for accurate epipolar rectification; 2) heterogeneous camera
synchronization, which synchronizes the frame flows captured by 1 INTRODUCTION
the dual cameras to avoid the displacement of moving objects across In recent years, the mobile industry and research community have
the frames in the same pair; 3) mobile GPU-friendly stereo match- significantly invested in augmented reality (AR) and virtual reality
ing, which effectively reduces the latency of stereo matching on a (VR) applications for mobile devices [8, 11, 20, 37, 53, 54]. Statistics
mobile GPU. We implement MobiDepth on multiple commodity mo- have shown that the AR/VR market has reached 30.7 billion dol-
bile devices and conduct comprehensive evaluations. Experimental lars in 2021 and will rise to about 300 billion dollars by 2024 [50].
results show that MobiDepth achieves real-time depth estimation of Among many technologies that enable diverse AR/VR applications
22 frames per second with a significantly reduced depth-estimation on mobile devices, real-time depth estimation is a fundamental
error compared with the baselines. Using MobiDepth, we further building block that connects the physical world to its 3D digital
build an example application of 3D pose estimation, which signifi- representation. For example, a key feature in AR applications is to
cantly outperforms the state-of-the-art 3D pose-estimation method, render virtual objects on the digital surface of physical objects, and
reducing the pose-estimation latency and error by up to 57.1% and the surface is computed through depth estimation.
29.5%, respectively. Currently, there are mainly three types of solutions for depth
estimation on mobile devices: 1) Dedicated depth sensors. Some
devices are equipped with dedicated depth sensors such as LiDAR,
ToF camera, and structured-light sensor. These sensors work by
† This work was done during internship at Institute for AI Industry Research (AIR),
emitting light in a specific spectrum and calculating the depth based
Tsinghua University.
∗ Corresponding author. on the light reflected back. Although these sensors can achieve pre-
cise depth estimation, they are only available on a few high-end
mobile devices due to the high cost. 2) Learning-based depth
prediction. Machine learning models, such as convolutional neu-
Permission to make digital or hard copies of part or all of this work for personal or ral networks (CNNs), can learn to predict the depth by training
classroom use is granted without fee provided that copies are not made or distributed on labeled data [48, 52]. However, the capability of learning-based
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. approaches relies heavily on the training dataset. They are usually
For all other uses, contact the owner/author(s). unable to achieve satisfactory accuracy for new scenes and new
ACM MobiCom ’22, October 17–21, 2022, Sydney, NSW, Australia objects that are not included in the dataset, as shown in Figure 1
© 2022 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-9181-8/22/10. (c)(d)(e). Furthermore, the models are generally heavy and difficult
https://doi.org/10.1145/3495243.3560517 to run in a real-time manner on computing limited mobile devices.
ACM MobiCom ’22, October 17–21, 2022, Sydney, NSW, Australia J. Zhang, H. Yang, J. Ren, D. Zhang, B. He, T. Cao, Y. Li, Y. Zhang and Y. Liu
Figure 1: Example depth maps generated by MobiDepth, AnyNet, MADNet with and without online adaptation (named MADNet-
MAD and MADNet-No, respectively), and ARCore, with the person sitting, walking and standing. (a) raw images. (b) our
approach estimates accurate depth with crisp edges. (c) and (d) the learning-based depth estimation models, i.e., AnyNet
and MAD-No, poorly perform in scenarios different from the training dataset. (e) the performance of MADNet-MAD is still
unsatisfactory, even with the extremely time consuming online adaptation module. (f) ARCore barely estimates depth in these
images.
The latency of the state-of-the-art (SOTA) models [48, 52] are as the basis for depth estimation. 2) How to synchronize the frame flows
high as 80ms to 550ms on the high-end Huawei Mate40Pro smart- captured by the dual cameras. The frame flows are highly out of
phone, as shown in Section 8. 3) Depth from motion. The most sync, due to the impacts of different frame periods of the cameras,
common solution adopted by existing mobile systems (including as well as the frequent garbage collection (GC). Suffering from the
ARKit in iOS [12] and ARCore in Android [13]) is the depth-from- out-of-sync frame flows, estimating the depth of objects in motion
motion algorithm [49]. The algorithm works by using visual-inertial becomes impossible, since the objects have displacement in the pair
odometry (VIO) to combine information from inertial measurement of frames. 3) How to accelerate the stereo matching on computation-
unit (IMU), with computer vision analysis of the scene visible to the limited mobile devices. The state-of-the-art stereo matching algo-
camera to obtain the depth, i.e., select keyframes during the motion rithms and CNN models, e.g., Semi-Global-Block Matching (SGBM),
of camera and estimate depth based on stereo matching between MADNet, and HITNet, are computation heavy, leading to long run-
the most recent image and a past keyframe. Although this solution time on mobile devices. The stereo matching needs to run in an
does not rely on dedicated sensors or large-scale training data, it online manner to find the correspondence between the points in
requires the camera to be moving and expects the target object the pair of frames. The long runtime leads to the low refresh rate
to be stationary, which significantly restricts its usage scenarios. of the depth estimation applications.
Figure 1(e) shows the performance of ARCore to get the depth of Addressing the above challenges, we propose MobiDepth to
the moving person, it is obvious that the accuracy of ARCore’s leverage the rear-facing dual cameras to estimate depth in real-time
depth estimation is terrible in this scenario. on mobile devices. MobiDepth resolves all the issues of the three
Inspired by the success of binocular depth estimation techniques, existing solutions, i.e., it does not rely on any dedicated sensors
we find that the distribution of the rear-facing cameras on mobile or pre-training, and works well for target objects in motion. Mo-
devices brings a great opportunity for depth estimation. Ideally, the biDepth integrates several new techniques, each addressing one of
disparity can be readily obtained by comparing the pair of frames the above challenges. Albeit simple, we are the first to apply them
captured by the dual cameras. The displacement of the dual cam- for efficient depth estimation using heterogeneous dual cameras
eras provides a stable baseline, compared to the depth from motion on commodity mobile devices. 1) Iterative field-of-view cropping.
solution. However, our in-depth analysis reveals several challenges It iteratively crops the field-of-view (FoV) of one camera, until it
in using the dual rear-facing cameras for depth estimation, as fol- matches that of the other camera. As such, the dual cameras achieve
lows: 1) How to reduce the impact of the diverse focal lengths of the the equivalent focal lengths. It improves the accuracy of epipolar
dual cameras. The rear-facing cameras are originally designed to rectification. 2) Heterogeneous camera synchronization. It filters out
serve various application scenarios, such as macro shooting or wide the frames that are generated at prominently different time. More-
angle shooting. Their focal lengths are thus quite diverse. It greatly over, it timely releases the metadata created by Android to avoid
impacts the accuracy of the epipolar rectification which serves as frequent garbage collection (GC). As such, the frame flows from
Dual Cameras ACM MobiCom ’22, October 17–21, 2022, Sydney, NSW, Australia
the dual cameras are synchronized to avoid the displacements of

moving objects across the frames in the same pair. 3) Mobile GPU-
friendly stereo matching. We use SGBM for stereo matching due to
its relatively high accuracy and efficiency. However, it still cannot
achieve real-time performance on mobile devices. Our insight is
that the limited memory bandwidth of mobile GPU lowers the com-
putation efficiency. As such, we fuse the calculations in SGBM to
reduce the overhead of accessing the global memory. Furthermore,
we carefully enlarge the data slicing to decrease the number of
concurrent threads. It reduces the access contention on the shared
memory and the synchronization cost among threads.
Based on the techniques, we build the end-to-end MobiDepth
system. MobiDepth first crops the FoVs of the dual cameras and Figure 2: Illustration of the imaging process in an ideal binoc-
synchronizes the frame flows. Then, it runs the stereo matching to ular system.
estimate the depth in a real-time manner. We implement MobiDepth
on dominated mobile devices equipped with multiple cameras, and
show the running demo in Figure 1(b). We also conduct extensive of implementing such a dual camera system on commodity mobile
experiments under various object distances and motion conditions. devices.
Evaluation results show that MobiDepth achieves high performance
in terms of both latency and accuracy. For example, it achieves real- 2.1 Background
time depth estimation of 22 frames per second (FPS) on the Huawei Depth estimation in a dual camera system. The dual cameras
Mate40Pro, i.e., an average latency of 45ms, with a small mean system simulates the principle of human vision for depth estimating
depth-estimation error of 1.1%~10.4% for stationary objects at the in the 3D world, as shown in Figure 2.
distance ranging from 0.5m to 5m, without requiring the motion The dual cameras are in the same orientation at different loca-
of device. MobiDepth significantly outperforms the learning-based tions with distance 𝐵. To get the accurate depth 𝑍 of the point
depth models, e.g., AnyNet [52] and MADNet [48], as well as the 𝑃 (𝑥 𝑤 , 𝑦 𝑤 , 𝑧 𝑤 ), there are several prerequisites: 1) the focal length 𝑓
state-of-the-art depth estimation system ARCore [13]. For exam- of the dual cameras should be equivalent. It guarantees that the cam-
ple, MobiDepth achieves a speedup of 1.66× and 12.13× compared eras have the same FoV, so that the imaging points, i.e., 𝑃𝐿 (𝑢𝑙 , 𝑣𝑙 )
to AnyNet and MADNet without online adaptation on Huawei and 𝑃𝑅 (𝑢𝑟 , 𝑣𝑟 ), lie on the same epipolar lines on the two image
Mate40Pro, respectively. With the motion of device, MobiDepth planes; 2) the dual cameras need to capture frames simultaneously;
achieves a small mean error of 8.7% for the objects moving at a 3) the disparity 𝑑𝐿 − 𝑑𝑅 between the two points 𝑃𝐿 and 𝑃𝑅 should
speed of 30-80 cm/s (at a distance of 100cm on the Huawei P30), be accurately estimated.
while the mean error of ARCore is as high as 43.5% under the same Given that the above conditions are met, the depth 𝑍 can be
settings. Note that ARCore does not work without the motion of derived by Equation. 1:
device. Furthermore, we build an example application of 3D pose
estimation based on MobiDepth and our application significantly 𝐵·𝑓
outperforms the state-of-the-art 3D pose estimation method, i.e., 𝑍= (1)
𝑑𝐿 − 𝑑𝑅
MobileHumanPose [8], reducing the pose estimation latency and
error by up to 57.1% and 29.5%, respectively. The equation indicates that the depth can be readily derived based
In summary, the main contributions are as follows: on the accurate disparity, i.e., 𝑑𝐿 − 𝑑𝑅 , and equivalent focal length
𝑓 for the dual cameras.
• Conduct in-depth analysis on the performance bottleneck of Semi-Global Block Matching (SGBM) SGBM is a commonly
on-device dual camera-based depth estimation; used computer vision algorithm in binocular camera systems for
• Propose iterative FoV cropping and heterogeneous camera depth estimation. It compares the similarity of pixels in two im-
synchronization to achieve equivalent focal lengths and syn- ages captured by binocular cameras to calculate the disparity of a
chronized frame flows for the dual cameras, respectively; pixel [19]. Figure 3 shows a block diagram of SGBM. It takes a pair of
• Propose the mobile GPU-friendly stereo matching based rectified images as input and outputs the disparity map. In specific,
on SGBM, which significantly reduces the memory access SGBM consists of four steps: 1) Cost Computation. It measures the
overhead and synchronization cost among threads; similarity between the pixels to be matched and candidate pixels
• Implement the MobiDepth system and a MobiDepth-based through three operations, i.e., Census Transform [41], Hamming
3D pose estimation application on commodity mobile devices distance computation for cost of pixels [51], and Cost optimization
to demonstrate the effectiveness of MobiDepth. with sliding window. 2) Cost Aggregation. It aggregates the cost
values of pixels on the same row and column, respectively. 3) Dis-
parity Computation. It uses the Winner-Takes-All algorithm [31] to
2 BACKGROUND AND CHALLENGES determine the optimal disparity value for each pixel according to
We first introduce the background of using dual cameras system to the aggregated cost. 4) Disparity Refinement. It refines the quality
estimate depth in an ideal case. Then, we elaborate on the challenges of the disparity map by filtering out the peaks.
Table 1: The Inference time of Stereo Matching Solutions on

Huawei Mate40Pro (Kirin 9000 SoC).
Deep Learning (DL) Method Traditional Method

Method-DL† Latency(s) FPS Method-Trad‡ Latency(s) FPS
StereoNet [22] 3.23 0.31 AD-Census∗[32] 1100 0.001
MADNet ⋄ [48] 0.55 1.81 PMS∗[6] 1500 0.0006
HITNet [44] 6.98 0.14 OpenCV-SGBM[19] 0.135 7.4
Figure 3: The process of SGBM. It takes a pair of rectified † These stereo matching models were tested by converting their open-source PyTorch
images as input, and outputs the disparity map. It consists model into a TFLite model.
of four steps, i.e., Cost Computation, Cost Aggregation, Dis- ‡ The input resolution of these algorithms is 640 × 480, besides, the max disparity is
64 in SGBM.
parity Computation, and Disparity Refinement. ∗ AD-Census and PMS are implemented by the open-source code from GitHub [1].
⋄ The latency of MADNet without online adaptation.
The frame flows from the on-device dual cameras are

highly out-of-sync. Synchronization between the two captured
frames of the dual cameras is crucial for the accuracy in depth esti-
mation. As shown in Figure 5, when the two frames are captured
at different times, the same key-point, such as the left hand in the
two frames, could move dozens of pixels between the two frames in
Figure 4: The raw images with different FoVs from dual cam- scenes with moving, making epipolar rectification extremely hard.
eras are on the left. Through the calibration and rectification, The reason is that the frame periods, i.e., time interval between con-
the images are severely distorted as shown on the right. secutive frames, of the dual cameras are not exactly equal even if we
set to use the same frame rate. For 30 frames per second (FPS), we
find that the average frame period of left camera is 33.33ms, while
the one of the right camera is 33.31ms, i.e., a 0.02ms time difference
for every frame on average. Importantly, the difference accumulates
as the system runs. It makes the frame flows from the dual cameras
highly out-of-sync. Moreover, we observe that frequent garbage
collection (GC) could also make the frames out-of-sync.
State-of-the-art stereo matching solutions are computa-
Figure 5: Example of frames out-of-sync. The same key-point, tion heavy, leading to intolerable runtime latency. We test
i.e., left hand circled in red, is not aligned horizontally. the latency of several state-of-the-art CNN models and traditional
algorithms for stereo matching running on the CPU2 of Huawei
2.2 Design Challenges Mate40 Pro with Kirin 9000 SoC (System on a Chip). Table 1 shows
However, using the dual cameras on commodity mobile devices the results that these stereo-matching solutions could not achieve
to estimate accurate depth in real-time is difficult. We expose the real-time performance on mobile devices. Even with SGBM [19],
key issues that hinders the accurate and real-time on-device depth it still takes 135𝑚𝑠, limiting the frame rate to only 7𝐹 𝑃𝑆. What’s
estimation: 1) diverse camera focal lengths; 2) out-of-sync frame worse, the latency shown in Table 1 does not include the time for
flows of the dual cameras; 3) computation heavy stereo matching. data copying and image rendering, which are also time consuming.
The focal lengths of the dual cameras are quite diverse, Next, we describe how MobiDepth addresses these challenges
leading to inaccurate epipolar rectification. The off-the-shelf by inventing novel techniques.
commercial mobile devices have cameras with different focal lengths.
The focal length determines the camera’s field of view (FoV). For 3 MOBIDEPTH SYSTEM OVERVIEW
instance, a primary camera with short focal length has a wide FoV To enable dual-camera-based real-time depth estimation on mobile
(WFoV). In contrast, the secondary camera with a long focal length devices, we need to achieve three goals: 1) matching the views of
has a tele FoV (TFoV). If we directly calibrate and rectify the cap- the dual cameras with different focal lengths; 2) synchronizing the
tured images of the dual cameras, the considerable difference in frame flows captured by the dual cameras; 3) boosting the stereo
focal lengths between the dual cameras could cause the epipolar matching algorithm on mobile devices. We design the MobiDepth
rectification1 disastrously, as shown in the raw images in Figure 4. system to achieve the three goals. It consists of two phases, i.e., the
As shown in Equation. 1, having equivalent focal lengths for offline phase and the online phase, as shown in Figure 6.
the dual cameras is necessary for depth estimation. However, this In the offline phase, we design an iterative FoV cropping tech-
requirement is not met on commodity mobile devices. nique to determine how to match the FoVs of dual cameras (Sec-
tion 4). It takes the several chessboard patterns as the input for
1 Epipolarrectification of a stereo pair is the process of re-sampling a pair of stereo
images so that the apparent motion of corresponding points is horizontal, which is an 2 We use CPU rather than the GPU because the algorithms/models either cannot run
important preliminary step in depth estimation [10]. or run slower on the mobile GPU.
Figure 6: The system overview and workflow of MobiDepth.
camera calibration, and crops the WFoV iteratively until the focal FoV cropping analysis. According to the lens imaging rule [35],
lengths of the dual cameras are equivalent. Based on the WFoV the field-of-view (FoV) of a camera is determined by its focal length.
cropped by factor 𝛼¯ and the TFoV, the following epipolar rectifi- Specifically, the camera with a smaller focal lengths 𝑓 𝑊 has a
cation module calculates the value of baseline 𝐵 and focal length 𝑊 𝐹𝑜𝑉 , and the camera with a longer focal length 𝑓 𝑇 has a 𝑇 𝐹𝑜𝑉 ,
𝑓 . The offline phase runs only once during the initialization of the as shown in Figure 7. By cropping the WFoV to be equal to the
MobiDepth system. TFoV, we can make the focal lengths of the dual cameras both equal
In the online phase, we introduce a heterogeneous camera syn- to 𝑓 𝑇 . As such, it is a fact that the same FoVs brings the equivalent
chronization technique (Section 5) to align the frame flows from focal lengths for the dual cameras.
the dual cameras, such that the time difference between two frames
from dual cameras does not exceed a certain threshold. The im-
age rectification module takes a pair of cropped and synchronized
images as input. It determines the correspondence between the
epipolar lines between the pair of images, based on the parameters
obtained in the offline phase. After that, the mobile GPU-friendly
stereo matching technique (Section 6) efficiently finds the corre-
spondence points on the epipolar lines to calculate the disparity.
The final depth map can be readily estimated based on the disparity,
baseline, and focal length. Figure 7: The relationship between focal length and FoV size.
The details of the proposed techniques can be found in the fol-
lowing sections. Based on the fact, MobiDepth propose the iterative FoVs cropping
to iteratively crop the WFoV, until the focal lengths of the dual
cameras are equivalent. Since the image captured by the camera is
not square, i.e., the size of image is 640×480, we use 𝛼¯ = (𝛼𝑥 , 𝛼 𝑦 )
4 ITERATIVE FOV CROPPING
to denote the crop factor on the width and height, respectively.
The focal lengths of the dual cameras on mobile devices are usually We formulate Equation 2 as the loss function which quantifies the
quite diverse, leading to different FoVs of the dual cameras. Gen- difference of focal lengths between the dual cameras.
erally, device manufacturers provide the equivalent focal length
1
of each camera. Ideally, we can calculate the crop factor to make ¯ = (𝑓𝛼𝑊
𝐽 (𝛼) − 𝑓 𝑇 )2 (2)
the FoV of the dual cameras equal. Yet, the provided equivalent 2 ¯𝑖
focal length is not accurate enough. For example, the Honor V30Pro where 𝑓 𝑇 denotes the focal length of the TFoV camera. We de-
officially provides an equivalent focal length equal to 16mm while fine 𝑓𝛼𝑊
¯𝑖
as the focal length of the WFoV camera after 𝑖 rounds
the measured value is close to 17mm. Cropping the images using of cropping. The optimal crop factor can be found by iteratively
the provided equivalent focal length leads to significant error. In minimizing the loss.
addition, the existing approaches typically use image matching Figure 8 illustrates the workflow of iterative FoV cropping: 1)
algorithms to crop the FoVs, such as SIFT [30], SURF [5], ORB [36]. First, we initialize the crop factor 𝛼¯ 0 to 1, and get the focal lengths
However, the accuracy of these approaches is not satisfactory since of the dual cameras with Zhang’s calibration method [58] 3 , i.e.,
they only perform a homography transformation, with the experi- 𝑓 𝑇 and 𝑓𝛼𝑊
¯0
. To obtain more accurate values, we capture over 15
mental results shown in Section 8.3.1. In the following, we first an- images of the calibration pattern (i.e., the chessboard) at different
alyze the feasibility of finding the equivalent focal lengths through 3 Zhang’s method is a camera calibration method that uses calibration pattern, e.g.,
cropping the FoVs. Then we iteratively crop the FoVs until the focal chessboard, to get the camera parameters, such as focal lengths, lens distortion coeffi-
lengths are equivalent for the dual cameras. cient, etc.
Figure 9: The illustration of heterogeneous camera synchro-

nization.
Table 2: The latency of each step of SGBM using existing par-
Figure 8: The process of the Iterative FoVs Cropping tech- allel optimization strategy on the GPU of Huawei Mate40Pro
nique. (Kirin 9000 SoC).
positions and filter out the image pairs with re-projection error Steps GPU(ms)
over 𝜙 (𝜙 is set to 0.05 in our implementation). 2) Then, we crop the Cost Computation 30
images with WFoV from all sides according to 𝛼¯𝑖 with the center of Cost Aggregation 150
the FoV as the basis point, then calibrate the camera after cropping Disparity Computation 10
to get the updated focal length 𝑓𝛼𝑊 ¯𝑖
. 3) Next, we update the value Disparity Refinement 9
of 𝛼¯ to 𝛼¯ using the following rule: if 𝑓𝛼𝑊
𝑖 𝑖+1 − 𝑓 𝑇 > 0, then 𝛼¯𝑖+1 Total latency 199
¯𝑖
= 𝛼¯ − 𝑠𝑡𝑒𝑝. Otherwise, 𝛼¯ = 𝛼¯ + 𝑠𝑡𝑒𝑝, where 𝑠𝑡𝑒𝑝 is set to 0.01
𝑖 𝑖+1 𝑖
in our implementation. 4) In case the value of Equation 2 becomes matched, so the 𝑛 − 𝑡ℎ frame from the left camera is discarded, and
less than Δ (0.01 according to our evaluation) for five consecutive the (𝑛 + 1) − 𝑡ℎ frame in the left camera are compared with 𝑛 − 𝑡ℎ
rounds, we obtain the final crop factor 𝛼, ¯ based on which the focal frame in the right camera.
lengths of the dual cameras are equivalent. By fine tuning 𝜃 , we have a trade-off between the synchroniza-
tion of frame flows and the loss of frames. With a larger value of 𝜃 ,
5 HETEROGENEOUS CAMERA we can retain more frames, but the frames may have larger time
SYNCHRONIZATION difference and lead to inaccurate depth estimation due to object
displacement, and vice versa. The value of 𝜃 should be determined
The accurate depth estimation of objects highly relies on the syn-
according to the application scenario. For example, in a scenario
chronization of frame flows of the heterogeneous dual cameras. In
where the target object is mostly static, we can use a larger 𝜃 to
case of out-of-sync flows, the target object will move a dozen of
tolerate frames that are slightly out-of-sync. While if the target
pixels between the pair of rectified images, leading to a disastrous
objects move fast, the value of 𝜃 should be smaller to ensure higher
performance of stereo matching.
accuracy.
To obtain time-aligned frames, previous works often use addi-
In addition, when implementing the synchronization technique,
tional external hardware signals to simultaneously trigger cameras
we observed that retrieving frame flows from dual cameras would
to capture images [21, 40]. However, there is no such dedicated
frequently trigger Android garbage collection (GC) events, which
hardware to trigger cameras on most mobile devices. As discussed
caused the frame periods of the dual cameras to fluctuate. By tracing
in Section 2.2, the actual frame rates of the dual cameras are slightly
the memory usage of MobiDepth, we found that the frequent GCs
different even under the same setting. The time difference of the
were due to the un-recycled metadata, such as the CameraMetadata
frames captured by the heterogeneous cameras would accumulate
objects which contains the settings of the camera [14]. After run-
at runtime if they are not carefully synchronized.
ning for a while, the un-recycled metadata would take all the pre-
In this section, we introduce a simple yet effective technique to
allocated memory. To solve this issue, we used the Java reflection
reduce the time difference between the frames from two heteroge-
method to timely release the metadata once it is no longer useful.
neous cameras to under a threshold. Figure 9 illustrates the process
of the technique. 𝑇𝐿 and 𝑇𝑅 denote the frame periods of the left
camera and the right camera, respectively. The small circles in the
6 MOBILE GPU-FRIENDLY STEREO
figure represent frames, and the number above the circle represents MATCHING
the order of the frames in the frame flow. The camera synchro- We use SGBM as the stereo matching algorithm in MobiDepth due
nization works as follows based on the recorded timestamp of each to its relatively low latency (e.g., 135ms on the CPU of the Kirin
frame captured by both cameras: 1) Compare the timestamps of two 9000 SoC) and satisfactory accuracy (e.g., 4.41% erroneous pixels in
frames with the same order in the frame flows. 2) If the difference total with error threshold as 5 pixels on KITTI 2012 [16]).
between the timestamps of a frame pair is lower than the threshold Although SGBM is efficient, it still cannot achieve real-time
𝜃 , regard the two frames as a matched pair, denoted by the solid performance on mobile devices. Using GPU may be a promising
line in Figure 9. The matched frame pair is used as the input in the direction to further optimize its performance. However, existing
next step (image rectification). 3) If the time difference exceeds the works of SGBM acceleration are for desktop GPUs [4, 18]. We adopt
threshold 𝜃 , discard the frame in the faster flow and goes to step the similar optimizations to implement SGBM on mobile GPU, but
1 to compare its next frame with the frame in the other flow. For the performance is even worse than running on the CPU, as shown
example, in Figure 9, the 𝑛 − 𝑡ℎ frames in the two flows are not in Table 2. This is due to the following limitations of mobile GPUs.
Figure 11: The process of data merged memory write in Cost

Figure 10: The process of Cost Computation in SGBM. Aggregation (Take left-right aggregation as an example).
Limited memory bandwidth. The most time-consuming steps causes considerable amount of memory read and write operations.
in SGBM are Cost Computation and Cost Aggregation since they 3) In Cost Aggregation, SGBM needs to get the aggregation cost
require frequent memory read/write operations to calculate the of each pixel in each of the four paths. It requires to write these
disparity of each pixel. However, the memory bandwidth of mobile values back to the memory which requires 2 × 𝐻 × 𝐷 + 2 × 𝑊 × 𝐷
GPUs is quite limited compared with desktop GPUs [27, 55]. For times writing.
instance, the Mali-G78 GPU in the Kirin 9000 SoC has 25.98 GB/s Calculation fusion. To reduce memory read and write over-
memory bandwidth shared with the CPU, which is only 24 1 of the
head, we fuse some calculations in Cost Computation and Cost
memory bandwidth of the NVIDIA RTX 2080Ti GPU (i.e., 616 GB/s). Aggregation. In Cost Computation, we compute the two Hamming
As a result, the frequent memory read/write operations in SGBM distances with four pixels each and then use the results directly
lead to a long latency on mobile GPUs. to get the disparity cost. In this case, the times of memory read
Limited memory architecture support. In Cost Aggregation and write is 𝑊 × 𝐻 × 𝐷 × (4 + 1), which reduces 𝑊 × 𝐻 × 𝐷 times
of SGBM, we need to use thread synchronization to get the mini- compared with the original approach. In Cost Aggregation, the
mum aggregated cost of a pixel with all disparities. In the desktop left and right aggregations operate on the same data. Therefore,
GPU, thread synchronization is performed in the fast on-chip shared they can be done simultaneously. For instance, we may calculate
memory, while mobile GPUs such as the Mali GPU only has off-chip the 𝐼𝑑 th pixel in the left aggregation and the 𝐼𝑊 −1−𝑑 th pixel in the
shared memory which is much slower. right aggregation at the same time. In this way, we could reduce
To tackle the challenges, we propose multiple techniques, includ- the times of memory accesses by 𝑊 ∗ 𝐷 + 𝐻 ∗ 𝐷. This is the same
ing calculation fusion and data merged memory write to reduce the for the up and down aggregations.
cost of memory read and write in SGBM, and enlarge data slicing Data merged memory write. Merging data before writing back
to reduce the overhead of thread synchronization. to memory can further reduce memory access overhead. In Cost
Aggregation, we combine the two values obtained from the left
6.1 Reducing the Memory Read and Write and right aggregations, or the up and down aggregations, into one
Overhead array for memory write. As such, the times of memory write can
Memory read and write overhead analysis. In the SGBM al- be reduced by half. As shown in Figure 11, we combine the two
gorithm, each step utilizes the results of the previous step, which 16-bit values obtained from left and right aggregation into a 32-bit
incurs a large amount of read and write operations to the global value. It is written into memory only once.
memory. 1) In Cost Computation, after Census Transform converts
each pixel of the pair of stereo images to a binary string, SGBM 6.2 Reduced Data Synchronization Overhead
calculates the Hamming distance between each pixel in the left In Cost Aggregation, one of the critical steps is to calculate the mini-
image and the corresponding disparity range pixel in the right im- mal aggregated cost for each pixel which requires all the disparities
age and writes the data to the memory, and then reads the data of each pixel to be calculated and synchronized. The existing im-
out for cost computation with sliding window. Figure 10 shows plementation of SGBM typically creates a number of threads equal
the original process of Cost Computation step. We set the image to the maximum number of disparity(𝐷) to calculate the minimum
resolution as 𝑊 × 𝐻 and the max disparity range is 𝐷. Calculat- value of all aggregated costs for a pixel. However, it leads to ex-
ing the Hamming distance for one disparity requires reading two tensive synchronization cost and contention on memory access.
pixels from the memory and writing one result back into memory, The synchronization is done in on-chip shared memory in desktop
which requires 𝑊 × 𝐻 × 𝐷 × 3 times for memory read and write. GPU with low execution time, while in the mobile GPUs, i.e., Mali
After the calculation of the Hamming distance of all pixels, SGBM GPUs, this operation is executed in off-chip shared memory, which
reads two Hamming distance results of the two disparities from the is time-consuming.
memory, and input them to cost computation operation to derive Enlarged data slicing. To address this issue, we take 2𝑛 (we
the disparity cost. Thus, the times of memory read and write to set n=3 according to our evaluation) disparities data at once by the
obtain the whole image’s disparity cost is 𝑊 × 𝐻 × 𝐷 × (3 + 3). vectorized memory access and then put them on a single thread
2) Cost Aggregation adopts four paths aggregation, i.e., left-right, for processing, as shown in Figure 12. By our enlarged data slicing
up-down. The existing approach to calculate pixels’ aggregation method, each thread calculates 2𝑛 disparity values and gets the
cost is to compute the results for each path individually, which also minimum disparity cost on that thread. Therefore, we only need
Figure 14: The hardware platform used to evaluate Mo-

Figure 12: The schematic diagram of data slicing. biDepth. It aligns the lenses horizontally to guarantee the
accuracy of the evaluation.
Table 3: Hardware configurations of the mobile devices used
in the experiments.
Device SoC CPU GPU
HWM40P∗ Kirin 9000 Cortex-A77 Mali G78
Figure 13: The latency hiding in our SGBM algorithm. HuaweiP30(HWP30) Kirin 980 Cortex-A76 Mali G76
Google Pixel6Pro Google Tensor Cortex-X1 Mali G78
to synchronize 2𝐷𝑛 threads in the mobile shared memory to get the ∗ HWM40P is the abbreviation of Huawei Mate40 Pro.
minimum value of all disparities, which significantly reduces the
data synchronization cost on mobile GPUs. Furthermore, using our of our system, we set the input image resolution of the SGBM algo-
method could also reduce the times of shared memory read and rithm to 640 × 480, and the disparity level to 64, and use four path
write. We read 2𝑛 disparity values into memory with vectorized directions for cost aggregation.
memory access only once, which decreases memory read times by In addition, we implement an example 3D pose-estimation ap-
𝐷 − 2𝐷𝑛 , and we only need to write 2𝐷𝑛 data to shared memory to plication based on MobiDepth. We use a lightweight 2D pose-
get the minimum disparity cost. Besides, we increase the amount of estimation method [57] based on TensorFlow Lite [28] to get the 2D
data on each thread, improving the data throughput and bandwidth coordinates of human keypoints. Then, we combine the coordinates
utilization. of each keypoint and the depth of the corresponding position to
Furthermore, the Cost Computation of one frame can start right form the 3D coordinates.
after the accomplishment of the disparity computation of the last
frame. To further reduce the latency of SGBM on mobile devices, 8 EVALUATION
we allocate the Disparity Refinement of the frame onto the CPU, as In this section, we evaluate the overall performance of MobiDepth
shown in Figure 13. The other three steps, i.e., Cost Computation, and its key components. We also evaluate the 3D pose-estimation
Cost Aggregation, and Disparity Computation, still run on mobile application and the system overhead of MobiDepth, including the
GPU. latency in mid-low end mobile device, the power consumption, and
the memory usage.
7 IMPLEMENTATION
We implement MobiDepth on commodity Android mobile devices in 8.1 Experimental Setup
Java [3] and C++ [43] with 5,457 lines of code excluding the testing Hardware platform. We built a hardware platform to evaluate
code, counted by Android Studio Statistic tool. Specifically, we use the depth estimation performance of MobiDepth. As shown in
the official multi-camera API [14] to obtain the images captured by Figure 14, we placed a mobile device and a depth camera (Intel
each camera at 30 FPS. RealSense D435i) horizontally on a bracket, facing the same active
To obtain the focal length of camera, we adopt the calibration area. We install the MobiDepth and baseline solutions on the mobile
tool in OpenCV [9] which is based Zhang’s method [58] to ob- device to calculate the depth of objects in the active area, and the
tain the parameter. With the method, we capture 20 images of depth camera was used to produce the ground-truth depth values.
the 12 × 9 chessboard pattern placed on a plate with a size of We tested three mobile devices that covered different mobile SoCs
20𝑚𝑚 × 20𝑚𝑚, and then we filter out the image pairs with reprojec- and diverse computing capabilities, as shown in Table 3.
tion error exceeding 0.2 pixels to improve the calibration accuracy. Operating conditions. We considered a wide range of operating
After we align the FoVs of the dual cameras based on the iterative conditions to evaluate the depth estimation systems, including the
FoV cropping method, we pass the processed pairs of images to the situations when the mobile device is stationary or moving and
cvStereoRectify function in OpenCV for epipolar rectification. the target object is stationary or moving at different distances.
We implement the SGBM in MobiDepth on mobile GPUs with Specifically, we considered different statuses of the target object,
OpenCL 2.0 [42], which is supported by most mobile device sys- including stationary, moving slowly, and moving quickly. For each
tems. We use the vloadn/vstoren functions to vectorize the read object status, we considered different distances of the target object,
buffer format to improve bandwidth utilization, and use the select including 50cm, 100cm, 300cm, and 500cm. The target object is a box
function to reduce branch operation. To ensure stable performance with a plate surface for the convenience of accuracy computation.
We also included the cases where the depth estimation device is require any pre-training. It not only enables MobiDepth to work in
stationary or moving. new scenes, but also saves lots of human efforts in collecting and
Baselines. The primary baselines which we compared MobiDepth labeling a training dataset.
against are AnyNet [52] and MADNet with and without online adap- From Table 4, MobiDepth also always outperforms ARCore. For
tation4 [48], which are the state-of-the-art depth estimation model example, when the two systems are tested under the exact same
on mobile devices, and ARCore [13], which is the official framework condition (moving device and stationary target) on HWP30, Mo-
for depth estimation on Android. For simplicity, we name MADNet biDepth achieves average accuracy of 5.0%, 4.2%, 8.9% and 23.2%
with and without online adaption as MADNet-MAD and MADNet- at different distances, while the accuracy of ARCore is 14.3%, 7.1%,
No, respectively. We also considered several other baselines for 11.7% and 25.5% respectively.
evaluating different components of MobiDepth. For example, we The greater advantage of MobiDepth is found when the target
compared with the SIFT method [30] on the performance of FoV object is moving. Due to the algorithmic basis of ARCore, it cannot
matching. When evaluating the example application (mobile 3D work well on moving targets, because the motion of the target object
pose estimation) based on MobiDepth, we selected a SOTA 3D hu- will harm the relative displacement between frames measured by
man pose estimation model, MobileHumanPose [8], as the baseline. IMU. However, it is not an issue in MobiDepth since our depth
Metrics. We evaluated the performance of MobiDepth in terms is estimated in a per-frame manner. As a result, we can observe
of accuracy, latency, energy cost, and memory usage. For depth- a significantly superior performance of MobiDepth (e.g., 8.7% vs.
estimation accuracy, we used the mean distance error (𝑚𝐷𝑒𝑟𝑟 ), the 43.5% on HWP30 at distance=100cm) when the target is moving
average percentage error between the estimated depth of each slowly (30 − 80𝑐𝑚/𝑠).
pixel 𝑝 in the target object (𝐷𝑒𝑝𝑡ℎ𝑒𝑠𝑡 ) and the groundtruth depth In addition, the accuracy of MobiDepth is even better when the
(𝐷𝑒𝑝𝑡ℎ𝑔𝑡 ), as follows: device is stationary. For example, on HWM40P, the accuracy with
the device stationary and moving are 1.1% and 3.2% respectively
|𝐷𝑒𝑝𝑡ℎ𝑔𝑡 (𝑝) − 𝐷𝑒𝑝𝑡ℎ𝑒𝑠𝑡 (𝑝)|
𝑚𝐷𝑒𝑟𝑟 = 𝑚𝑒𝑎𝑛𝑝 ∈𝑂𝑏 𝑗𝑒𝑐𝑡 (3) for stationary target at distance=100cm. On the contrary, ARCore
𝐷𝑒𝑝𝑡ℎ𝑔𝑡 (𝑝) is unable to obtain depth information without motion, which may
When evaluating the example application, we computed the mean limit its usage scenarios such as live streaming with a fixed mobile
per-joint position error (MPJPE), which is the mean Euclidean dis- device camera.
tance between the positions of predicted joints and the ground The accuracy of MobiDepth may be affected by the device. We
truth. can notice that the performance of MobiDepth on HWM40P is better
than others, as the quality of cameras and the distribution of dual
8.2 Overall Depth Estimation Accuracy cameras on HWM40P are better. The performance of MobiDepth
We first evaluate the end-to-end performance of MobiDepth on may get worse if the secondary camera on the device is too poor or
depth estimation. Table 4 shows the accuracy of MobiDepth, AnyNet, lies too close to the main camera. Nevertheless, MobiDepth can still
MADNet and ARCore on different mobile devices. Among the three achieve a reasonable accuracy on most devices since most mobile
devices, HWP30 supports all methods. ARCore is not supported on devices today are equipped with powerful dual cameras.
HWM40P since it cannot run Google play services for AR. As for
the third-party application, MobiDepth is not authorized to access 8.3 Performance Breakdown
the libopencl.so library in the Pixel series phones, since Google We next evaluate the performance of the key components of the
does not publicly support the OpenCL library. Therefore, we only MobiDepth system in detail.
evaluate the performance of ARCore on Google Pixel6Pro.
As shown in Table 4, MobiDepth outperforms the deep learning 8.3.1 FoV Matching. As introduced in Section 4, MobiDepth uses
model baselines, i.e., AnyNet, MADNet-MAD and MADNet-No, in an iterative FOV cropping method to match the FoVs of dual cam-
most cases. For instance, when the device and the target object are eras. An alternative to our method is the SIFT-based method, and
stationary, the estimation error of MobiDepth is 2.8%, 1.1%, 5.1% thus we tested the performance of MobiDepth if the FoV matching
and 10.4% at various distances, and yet, AnyNet achieves 3.5%, 2.4%, part is implemented with SIFT. The result is shown in Table 5. We
6.5%, 12.0% and MADNet-No is 10.6%, 12.4%, 47.6%, 61.5% at the can see that the SIFT baseline can lead to higher mean distance er-
corresponding distances. Only on HWP30 with object at distance rors than our original method on depth estimation, which are 7.3%
50 cm, the estimation error of MobiDepth is slightly higher than vs. 2.8%, 6.6% vs. 1.1%, 8.2% vs. 5.1%, 13.3% vs. 10.4% when the target
AnyNet, i.e., 3.8 % compared to 3.4%. One reason for the higher error object distance is 50cm, 100cm, 300cm and 500cm, respectively. This
of deep learning based models is that the pair of images for model demonstrates the effectiveness of our iterative cropping-based FoV
training has been perfectly rectified. However, the rectification matching method.
of images captured by the on-device dual cameras suffers from
inevitable error due to the diverse setting of cameras, e.g., a normal 8.3.2 Frame Synchronization. We introduced a frame synchroniza-
lens and wide-angle lens. Another reason is from the limited scenes tion technique (Section 5) to reduce the time difference between
in the training dataset. In comparison, MobiDepth tolerates slight the image streams of dual cameras. Figure 15 shows that the time
image rectification error as the SGBM uses block matching method difference of two frames accumulates as the time goes without
to calculate the disparity. More importantly, MobiDepth does not the frame synchronization. The time difference exceeds 80ms after
about 4000 frames. With our frame alignment method, we can keep
4 We use an average of 1280 frames for online adaptation. the time difference of the dual cameras within 16ms (as we set the
Table 4: The accuracy of MobiDepth, AnyNet, MADNet and ARCore on depth estimation under different operating conditions.
Each cell is the average mean distance error (𝑚𝐷𝑒𝑟𝑟 ).
Object status Obj-Stationary† Obj-Moving†(30∼80cm/s) Obj-Moving (80∼150cm/s)
Methods 50cm 100cm 300cm 500cm 50cm 100cm 300cm 500cm 50cm 100cm 300cm 500cm
AnyNet Dev-Stationary‡ 3.5% 2.4% 6.5% 12.0% 4.0% 5.2% 11.1% 21.4% 13.9% 19.8% 20.4% 31.4%
MADNet-No Dev-Stationary 10.6% 12.4% 47.8% 61.7% 13.5% 19.2% 50.3% 60.9% 33.3% 28.8% 55.8% 64.9%
HWM40P MADNet-MAD Dev-Stationary 5.8% 4.9% 11.5% 15.3% 5.9% 8.8% 19.6% 23.7% 14.7% 15.8% 28.3% 38.7%
Dev-Stationary 2.8% 1.1% 5.1% 10.4% 3.8% 3.1% 9.3% 19.2% 9.8% 9.5% 15.1% 29.9%
MobiDepth
Dev-Moving‡ 4.5% 3.2% 7.4% 14.3% 5.6% 5.8% 13.4% 23.1% 13.3% 15.7% 23.4% 31.1%
AnyNet Dev-Stationary 3.4% 5.2% 13.3% 25.5% 4.4% 7.6% 23.8% 30.9% 11.1% 16.0% 28.3% 39.3%
MADNet-No Dev-Stationary 18.8% 15.9% 58.9% 74.4% 26.8% 16.8% 62.7% 74.6% 24.5% 17.6% 68.9% 74.6%
MADNet-MAD Dev-Stationary 8.4% 9.1% 26.2% 49.6% 9.4% 10.9% 30.6% 54.8% 14.7% 13.1% 47.4% 54.7%
HWP30
Dev-Stationary 3.8% 4.1% 7.1% 21.4% 4.3% 6.3% 14.2% 28.5% 10.1% 13.7% 21.5% 35.6%
MobiDepth
Dev-Moving 5.0% 4.2% 8.9% 23.2% 7.4% 8.7% 24.4% 35.2% 15.3% 20.1% 30.5% 40.3%
ARCore∗ Dev-Moving 14.3% 7.1% 11.7% 25.5% 56.6% 43.5% 54.8% 57.9% 69.9% 66.6% 58.7% 61.9%
∗ Since ARCore cannot obtain depth information while the device is stationary, we only tested ARCore with the device moving.
† Obj-Stationary and Obj-Moving denote that the status of target object is static and moving, respectively.
‡ Dev-Stationary and Dev-Moving indicate that the state of mobile device is static and moving, respectively.
Figure 15: Time difference with and with- Figure 16: Latency reduction of our opti- Figure 17: Latency of MobiDepthPose
out frame synchronization. mized SGBM for mobile GPU. and MobileHumanPose.
Table 5: The average mean distance error (𝑚𝐷𝑒𝑟𝑟 ) of SIFT and Table 6: The average mean distance error (𝑚𝐷𝑒𝑟𝑟 ) of Mo-
our method for FoV matching. The target object is stationary biDepth when the two cameras are synchronized (Sync) or
at different distances. not synchronized (Unsync). Unsync1 and Unsync2 have a
33ms (one frame) and 66ms (two frames) time difference be-
Methods
Distance tween two frames, respectively. The target object is moving
MobiDepth-SIFT MobiDepth-Ours
at different distances and speeds.
50cm 7.3% 2.8%
100cm 6.6% 1.1% Speed Moving (30~80cm/s)
300cm 8.2% 5.1% Distance Original Unsync1 Unsync2
500cm 13.3% 10.4% 50cm 3.8% 17.2% 41.2%
100cm 3.1% 12.1% 31.1%
𝜃 to 16ms, the half of the frame period). To test the effectiveness of 300cm 9.3% 25.6% 42.2%
this technique, we compared the accuracy of MobiDepth with and 500cm 19.2% 36.3% 50.7%
without frame synchronization, as shown in Table 6.
In Table 6, Unsync represents the cases when the two frames reduction of our customized SGBM implementation, compared to
used for stereo matching in MobiDepth are not synchronized (by the existing SGBM implementation that was originally developed
one frame or two frames). As can be seen from the table, our frame for desktop GPU (denoted as D-SGBM). The result shows that al-
synchronization method allows the MobiDepth to achieve obviously most all parts of the SGBM algorithm are significantly optimized,
higher accuracy than without synchronization. The advantage of reducing the latency of SGBM by 81.4% on HWM40P and 72.8% on
using frame synchronization is more significant when the velocity HWP30.
of the target object is higher.
8.3.3 Stereo Matching Optimizations. In the stereo matching of 8.4 Case Study: 3D Pose Estimation
MobiDepth, we adopted several techniques to optimize the perfor- MobiDepth may enable various 3D applications on mobile devices.
mance of SGBM on mobile devices. Figure 16 shows the latency We take 3D pose estimation (PE) as an example case study to show
Table 7: Accuracy of MobileHumanPose and MobiDepthPose the three mobile devices5 . Compared to AnyNet and MADNet-No,
in terms of MPJPE (mm). MobiDepth can reduce latency by 40.02% and 91.81% on HWM40P,
Methods Walk Kick Sit Hit Avg. and by 24.04% and 88.34% on HWP30, respectively. Yet, as the
official framework adopted by Android, ARCore achieves better
MobileHumanPose [8] 290.3 231.4 213.8 229.9 241.4 latency than MobiDepth on both high-end and low-end devices.
MobiDepthPose (Ours) 195.3 164.8 141.6 178.5 170.1 Especially, the latency gap between ARCore and MobiDepth is
large (about 40ms) on HWP30. This is because that ARCore utilizes
keyframes and IMU to reduce the computation, while MobiDepth
computes depth in a per-frame manner. Nevertheless, the result
has also shown that the latency of MobiDepth is greatly reduced to
around 45ms on higher-end devices (e.g., HWM40P), which shows
the ability of MobiDepth to achieve real-time experience.
The results of power consumption are shown in Figure 19(b).
We use PerfDog [34] developed by Tencent to collect the power
consumption of different systems. Since MobiDepth exploits both
the CPU, GPU and the two cameras, its average power consumption
is about 50% higher than ARCore. However, MobiDepth can reduce
7.14% and 8.08% power consumption than AnyNet and MADNet-
No, respectively on HWM40P. This is because the most operations
of MobiDepth are simple additions and comparisons, which have a
low utilization of the ALU on GPU, i.e., 4% to 6% according to our
test.
The memory usage of MobiDepth, AnyNet, MADNet-No and
Figure 18: Visualization of some 3D poses predicted by Mo- ARCore are obtained using the Monitors tool in Android Studio. As
biDepthPose and MobileHumanPose. shown in Figure 19(c), the average memory usage of MobiDepth
exceeds AnyNet’s by 17.9% and ARCore’s by 35%, since it computes
the effectiveness of MobiDepth. To do it, we implement a mobile the disparity of each pixel on both CPU and GPU simultaneously.
3D PE application based on MobiDepth, named MobiDepthPose. However, the total memory usage of MobiDepth is less than 450MB,
The current SOTA mobile 3D PE system MobileHumanPose [8] which is relatively small as mainstream mobile devices have multi-
adopts an end-to-end neural network to predict the 3D keypoint gigabytes memory.
coordinates (𝑥, 𝑦, 𝑧) of human joints. Instead, our MobiDepthPose To sum up, MobiDepth introduces a higher overhead than AR-
can directly utilize MobiDepth to obtain accurate depth information, Core. However, it reduces the latency and power, compared with
so that the model only needs to predicts the 2D keypoint coordinates the AnyNet and MADNet. We believe that the overhead is tolerable
(𝑥,𝑦). In MobiDepthPose, we use an accurate and lightweight 2D given the great advantages offered by MobiDepth.
PE model [56, 57] in combination with MobiDepth to obtain the 3D
keypoint coordinates (𝑥, 𝑦, 𝑧). 9 RELATED WORK
We evaluated the accuracy of MobileHumanPose and MobiDepth-
Stereo matching. To achieve the accurate depth, many works have
Pose under different settings when a person was walking, sitting,
focused on stereo matching algorithm [17, 47]. Traditional stereo
kicking, or hitting something in the camera view. In each setting,
matching methods usually utilize the low-level features of image
we collected 300 samples, and the ground truth was produced by
patches around the pixel to measure the dissimilarity, which can
the combination of a full-featured depth camera (Intel RealSense
be grouped into three categories: 1) Local method. Both Lazaros et
D435i) and a SOTA 2D PE model (HigherHRNet-w48 [7]). As we
al. [33] and Kristina et al. [2] exploit the region-based local algo-
can see from Table 7, MobiDepthPose is able to achieve a much bet-
rithms, which is based on feature vectors extracted in a window
ter accuracy than MobileHumanPose in all cases, with the overall
for matching. 2) Global method. Vladimir et al. [26] and Andreas et
mean per joint position error (MPJPE) significantly reduced from
al. [24] select the disparity with the minimal global energy function.
241.4mm to 170.1mm, i.e., a reduction of 29.5%.
3) Semi-global method. Heiko [19] optimizes a path-wise form of
Moreover, since the neural network can be much simplified
the energy function in multiple direction. With the development
in MobiDepthPose, the latency of 3D PE can be reduced as well.
of deep learning, many stereo matching works use CNN models
As shown in Figure 17, the latency of MobiDepthPose is reduced
to improve the accuracy of depth estimation [25, 38, 44]. Akihito
by 45.4% to 57.1% compared with MobileHumanPose on different
et al. [38] propose the SGM-Nets model to improve the accuracy
devices. Some examples of the estimated 3D pose can be found in
of the model by providing a learning penalty to the SGM. Patrick
Figure 18.
et al. [25] learns smoothness penalties through a conditional ran-
dom field(CRF) and combines it with a correlation matching cost
8.5 System Overhead
Finally, we evaluate the system overhead of MobiDepth in terms of 5 We do not evaluate the latency of MADNet-MAD on mobile devices since the model
latency, power consumption, and memory usage. Figure 19(a) shows with the online adaptation module is extremely time consuming, which its latency is
the latency of MobiDepth, AnyNet, MADNet-No and ARCore on much higher than that of MADNet-No.
(a) Latency. (b) Power consumption. (c) Memory usage.
Figure 19: System overhead of MobiDepth, AnyNet, MADNet-No and ARCore.
predicted by CNN to integrate long-range interactions. Vladimir image quality. 2) Due to the short distance between the dual cameras
et al. [44] design a CNN model that propagate the information on mobile devices, the effective range of MobiDepth to obtain depth
across different resolution levels. However, these works are only is typically limited to 0.5 to 5 meters, i.e., the distance of the dual
suitable for identical dual cameras. Besides, none of these works cameras on HWM40P is about 2.1cm (the range varies slightly across
consider the specific architecture of mobile SoCs. MobiDepth aims mobile devices). If the target object lies outside of the effective
to estimate depth through a dual-camera system with diverse set- distance range, the system may not accurately estimate the depth.
tings, and well use the memory architecture of mobile GPU for However, we believe the effective distance range can already cover
acceleration. most AR/VR application scenarios on mobile devices. 3) MobiDepth
Depth estimation on mobile devices. Current methods of may not work well on devices with white-and-black cameras and relies
getting depth on mobile devices can be divided into 3 categories: 1) on the computing power of mobile GPU. For mobile devices with low-
Dedicated depth sensors. Kim et al. [23] and Tian et al. [46] use the end GPUs, the efficiency of stereo matching in MobiDepth may not
on-device time-of-flight(ToF) camera to obtain 3D images, and Shih be guaranteed. This is not a severe issue since most devices today
et al. [39] and Stefano et al. [45] use LiDAR instead. Depth sensors are equipped with powerful cameras and GPUs. These limitations
are currently only available on a few high-end mobile devices due can also be mitigated by incorporating more advanced algorithmic
to the high cost. 2) Learning-based depth prediction. Liu et al. [29] and system optimizations, which we leave for future work.
present a CNN field model to estimate depths from single monocular
images, aiming to jointly explore the capacity of CNN model and 11 CONCLUSION
continuous CRF. David et al. [15] propose a two-scale CNN model
In this paper, we propose MobiDepth, the first system to use the
trained on images and the corresponding depth maps. However, the
dual cameras on commodity mobile devices for real-time depth
3D CNN models are computing-heavy, which cannot achieve real-
estimation. MobiDepth does not require dedicated sensors or large-
time performance on mobile devices. Furthermore, these methods
scale data, and works for a wide range of scenarios, where both the
suffer from limited scalability, i.e., they cannot estimate the depth
device and the target objects can be stationary or moving. To do so,
of new objects that are unseen in the training dataset. 3) Using
MobiDepth employs three key techniques, including iterative FoV
monocular camera on mobile devices. Yang et al. [54] propose a
cropping, heterogeneous camera synchronization, and mobile GPU-
keyframe-based real-time surface mesh generation approach to
friendly stereo matching. Extensive experiments have demonstrated
reconstruct 3D objects from single RGB image. ARCore [49], the
that MobiDepth can achieve high accuracy and low overhead. The
well-known AR framework, obtains depth from motion, i.e., using
accuracy remains high on moving objects and moving devices,
monocular camera combined with inertial measurement unit (IMU)
significantly outperforming ARCore, the state-of-the-art framework
to estimate depth. However, the limitation of these works is that
for depth estimation on Android.
they cannot estimate the accurate depth of the object in motion.
In comparison, MobiDepth obtains the disparity from the dual
cameras, instead of moving a single camera. Therefore, MobiDepth
12 ACKNOWLEDGMENTS
can accurately estimate the depth for objects in motion. Further- We sincerely appreciate the anonymous shepherd and reviewers for
more, the stereo matching algorithm used in MobiDepth, i.e., SGBM, their valuable comments. This work is supported by National Sci-
does not need to pre-train on any 3D datasets, and thus achieves ence Foundation of China (62172439, 62122095, 62072472), National
better scalability. Key R&D Program of China (2019YFA0706403), Natural Science
Foundation Major Project of Hunan Science and Technology Inno-
10 DISCUSSION vation Program (S2021JJZDXM0022), Natural Science Foundation
of Hunan Province (2020JJ5774, 2020JJ2050) and U19A2067, a grant
The current MobiDepth system has several limitations. 1) Mo- from the Guoqiang Institute, Tsinghua University.
biDepth does not take into consideration the impact of auto-focus. In
auto-focus, a motor moves the camera lens backward and forward
to adjust the image distance, making it hard for MobiDepth to do
REFERENCES
[1] 2022. https://github.com/ethan-li-coding.
the FoV cropping. However, if auto-focus can be well-handled, it [2] Kristian Ambrosch and Wilfried Kubinger. 2010. Accurate hardware-based stereo
may help improve the depth estimation as it improves the captured vision. Computer Vision and Image Understanding 114, 11 (2010), 1303–1316.
[3] Ken Arnold, James Gosling, and David Holmes. 2005. The Java programming [32] Xing Mei, Xun Sun, Mingcai Zhou, Shaohui Jiao, Haitao Wang, and Xiaopeng
language. Addison Wesley Professional. Zhang. 2011. On building an accurate stereo matching system on graphics
[4] Christian Banz, Holger Blume, and Peter Pirsch. 2011. Real-time semi-global hardware. In 2011 IEEE International Conference on Computer Vision Workshops
matching disparity estimation on the GPU. In 2011 IEEE International Conference (ICCV Workshops). IEEE, 467–474.
on Computer Vision Workshops (ICCV Workshops). IEEE, 514–521. [33] Lazaros Nalpantidis and Antonios Gasteratos. 2010. Stereo vision for robotic
[5] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust applications in the presence of non-ideal lighting conditions. Image and Vision
features. In European conference on computer vision. Springer, 404–417. Computing 28, 6 (2010), 940–951.
[6] Michael Bleyer, Christoph Rhemann, and Carsten Rother. 2011. Patchmatch [34] PerfDog. 2022. https://perfdog.qq.com/.
stereo-stereo matching with slanted support windows.. In Bmvc, Vol. 11. 1–11. [35] Todd B Pittman, YH Shih, DV Strekalov, and Alexander V Sergienko. 1995. Optical
[7] Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Thomas S. Huang, and Lei imaging by means of two-photon quantum entanglement. Physical Review A 52,
Zhang. 2020. HigherHRNet: Scale-Aware Representation Learning for Bottom-Up 5 (1995), R3429.
Human Pose Estimation. In CVPR. [36] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011. ORB: An
[8] Sangbum Choi, Seokeon Choi, and Changick Kim. 2021. MobileHumanPose: efficient alternative to SIFT or SURF. In 2011 International conference on computer
Toward real-time 3D human pose estimation in mobile devices. In Proceedings of vision. Ieee, 2564–2571.
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2328–2338. [37] Thomas Schöps, Torsten Sattler, Christian Häne, and Marc Pollefeys. 2017. Large-
[9] Ivan Culjak, David Abram, Tomislav Pribanic, Hrvoje Dzapo, and Mario Cifrek. scale outdoor 3D reconstruction on a mobile device. Computer Vision and Image
2012. A brief introduction to OpenCV. In 2012 proceedings of the 35th international Understanding 157 (2017), 151–166.
convention MIPRO. IEEE, 1725–1730. [38] Akihito Seki and Marc Pollefeys. 2017. Sgm-nets: Semi-global matching with
[10] François Darmon and Pascal Monasse. 2021. The Polar Epipolar Rectification. neural networks. In Proceedings of the IEEE Conference on Computer Vision and
Image processing on line 11 (2021), 56–75. Pattern Recognition. 231–240.
[11] Pei-Huang Diao and Naai-Jung Shih. 2018. MARINS: A mobile smartphone AR [39] Naai-Jung Shih, Pei-Huang Diao, Yi-Ting Qiu, and Tzu-Yu Chen. 2020. Situated
system for pathfinding in a dark environment. Sensors 18, 10 (2018), 3442. ar simulations of a lantern festival using a smartphone and lidar-based 3d models.
[12] ARKit Developers Documentation. 2018. https://developer.apple.com/ Applied Sciences 11, 1 (2020), 12.
documentation/arkit. [40] Prarthana Shrstha, Mauro Barbieri, and Hans Weda. 2007. Synchronization of
[13] ARCore Developers Documentation. 2018. https://developers.google.com/ar. multi-camera video recordings based on audio. In Proceedings of the 15th ACM
[14] Android Developers Documentation. 2021. https://developer.android.com/ international conference on Multimedia. 545–548.
training/camera2/multi-camera. [41] Robert Spangenberg, Tobias Langner, and Raúl Rojas. 2013. Weighted semi-global
[15] David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from matching and center-symmetric census transform for robust driver assistance. In
a single image using a multi-scale deep network. Advances in neural information International Conference on Computer Analysis of Images and Patterns. Springer,
processing systems 27 (2014). 34–41.
[16] Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for au- [42] John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel pro-
tonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on gramming standard for heterogeneous computing systems. Computing in science
computer vision and pattern recognition. IEEE, 3354–3361. & engineering 12, 3 (2010), 66.
[17] Rostam Affendi Hamzah and Haidi Ibrahim. 2016. Literature survey on stereo [43] Bjarne Stroustrup. 2013. The C++ programming language. Pearson Education.
vision disparity map algorithms. Journal of Sensors 2016 (2016). [44] Vladimir Tankovich, Christian Hane, Yinda Zhang, Adarsh Kowdle, Sean Fanello,
[18] Daniel Hernandez-Juarez, Alejandro Chacón, Antonio Espinosa, David Vázquez, and Sofien Bouaziz. 2021. Hitnet: Hierarchical iterative tile refinement network
Juan Carlos Moure, and Antonio M López. 2016. Embedded real-time stereo for real-time stereo matching. In Proceedings of the IEEE/CVF Conference on
estimation via semi-global matching on the GPU. Procedia Computer Science 80 Computer Vision and Pattern Recognition. 14362–14372.
(2016), 143–153. [45] Stefano Tavani, Andrea Billi, Amerigo Corradetti, Marco Mercuri, Alessandro
[19] Heiko Hirschmuller. 2007. Stereo processing by semiglobal matching and mutual Bosman, Marco Cuffaro, Thomas Seers, and Eugenio Carminati. 2022. Smartphone
information. IEEE Transactions on pattern analysis and machine intelligence 30, 2 assisted fieldwork: Towards the digital transition of geoscience fieldwork using
(2007), 328–341. LiDAR-equipped iPhones. Earth-Science Reviews (2022), 103969.
[20] Dong-Hyun Hwang, Suntae Kim, Nicolas Monet, Hideki Koike, and Soonmin Bae. [46] Yuan Tian, Yuxin Ma, Shuxue Quan, and Yi Xu. 2019. Occlusion and collision
2020. Lightweight 3D human pose estimation network training using teacher- aware smartphone AR using time-of-flight camera. In International Symposium
student learning. In Proceedings of the IEEE/CVF Winter Conference on Applications on Visual Computing. Springer, 141–153.
of Computer Vision. 479–488. [47] Beau Tippetts, Dah Jye Lee, Kirt Lillywhite, and James Archibald. 2016. Review
[21] Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, of stereo vision algorithms and their suitability for resource-limited systems.
Shohei Nobuhara, and Yaser Sheikh. 2015. Panoptic studio: A massively multi- Journal of Real-Time Image Processing 11, 1 (2016), 5–25.
view system for social motion capture. In Proceedings of the IEEE International [48] Alessio Tonioni, Fabio Tosi, Matteo Poggi, Stefano Mattoccia, and Luigi Di Stefano.
Conference on Computer Vision. 3334–3342. 2019. Real-time self-adaptive deep stereo. In The IEEE Conference on Computer
[22] Sameh Khamis, Sean Fanello, Christoph Rhemann, Adarsh Kowdle, Julien Vision and Pattern Recognition (CVPR).
Valentin, and Shahram Izadi. 2018. Stereonet: Guided hierarchical refinement for [49] Julien Valentin, Adarsh Kowdle, Jonathan T Barron, Neal Wadhwa, Max Dzitsiuk,
edge-aware depth prediction. (2018). Michael Schoenberg, Vivek Verma, Ambrus Csaszar, Eric Turner, Ivan Dryanovski,
[23] Hyun Myung Kim, Min Seok Kim, Gil Ju Lee, Hyuk Jae Jang, and Young Min et al. 2018. Depth from motion for smartphone AR. ACM Transactions on Graphics
Song. 2020. Miniaturized 3D depth sensing-based smartphone light field camera. (ToG) 37, 6 (2018), 1–19.
Sensors 20, 7 (2020), 2129. [50] VR and AR market size 2024. 2021. https://www.statista.com/statistics/591181/global-
[24] Andreas Klaus, Mario Sormann, and Konrad Karner. 2006. Segment-based stereo augmented-virtual-reality-market-size/.
matching using belief propagation and a self-adapting dissimilarity measure. [51] Bill Waggener, William N Waggener, and William M Waggener. 1995. Pulse code
In 18th International Conference on Pattern Recognition (ICPR’06), Vol. 3. IEEE, modulation techniques. Springer Science & Business Media.
15–18. [52] Yan Wang, Zihang Lai, Gao Huang, Brian H. Wang, Laurens Van Der Maaten,
[25] Patrick Knobelreiter, Christian Reinbacher, Alexander Shekhovtsov, and Thomas Mark Campbell, and Kilian Q Weinberger. 2018. Anytime Stereo Image Depth
Pock. 2017. End-to-end training of hybrid CNN-CRF models for stereo. In Proceed- Estimation on Mobile Devices. arXiv preprint arXiv:1810.11408 (2018).
ings of the IEEE conference on computer vision and pattern recognition. 2339–2348. [53] Jingao Xu, Guoxuan Chi, Zheng Yang, Danyang Li, Qian Zhang, Qiang Ma,
[26] Vladimir Kolmogorov and Ramin Zabih. 2001. Computing visual correspon- and Xin Miao. 2021. FollowUpAR: enabling follow-up effects in mobile AR
dence with occlusions using graph cuts. In Proceedings Eighth IEEE International applications. In Proceedings of the 19th Annual International Conference on Mobile
Conference on Computer Vision. ICCV 2001, Vol. 2. IEEE, 508–515. Systems, Applications, and Services. 1–13.
[27] Rendong Liang, Ting Cao, Jicheng Wen, Manni Wang, Yang Wang, Jianhua Zou, [54] Xingbin Yang, Liyang Zhou, Hanqing Jiang, Zhongliang Tang, Yuanbo Wang,
and Yunxin Liu. 2022. Romou: Rapidly Generate High-Performance Tensor Ker- Hujun Bao, and Guofeng Zhang. 2020. Mobile3DRecon: real-time monocular
nels for Mobile GPUs. In Proceedings of the 24th Annual International Conference 3D reconstruction on a mobile phone. IEEE Transactions on Visualization and
on Mobile Computing and Networking. https://doi.org/10.1145/3495243.3517020 Computer Graphics 26, 12 (2020), 3446–3456.
[28] TensorFlow Lite. 2021. https://www.tensorflow.org/lite/. [55] Juheon Yi and Youngki Lee. 2020. Heimdall: mobile gpu coordination platform
[29] Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. 2015. Learning depth for augmented reality applications. In Proceedings of the 26th Annual International
from single monocular images using deep convolutional neural fields. IEEE Conference on Mobile Computing and Networking. 1–14.
transactions on pattern analysis and machine intelligence 38, 10 (2015), 2024–2039. [56] Changqian Yu, Bin Xiao, Changxin Gao, Lu Yuan, Lei Zhang, Nong Sang, and
[30] David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. Jingdong Wang. 2021. Lite-HRNet: A Lightweight High-Resolution Network. In
International journal of computer vision 60, 2 (2004), 91–110. CVPR.
[31] Wolfgang Maass. 2000. On the computational power of winner-take-all. Neural [57] Jinrui Zhang, Deyu Zhang, Xiaohui Xu, Fucheng Jia, Yunxin Liu, Xuanzhe Liu, Ju
computation 12, 11 (2000), 2519–2535. Ren, and Yaoxue Zhang. 2020. MobiPose: Real-time multi-person pose estimation
on mobile devices. In Proceedings of the 18th Conference on Embedded Networked [58] Zhengyou Zhang. 1999. Flexible camera calibration by viewing a plane from
Sensor Systems. 136–149. unknown orientations. In Proceedings of the seventh ieee international conference
on computer vision, Vol. 1. Ieee, 666–673.

Mobicom22 Final138

Uploaded by

Copyright:

Available Formats

Mobicom22 Final138

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mobicom22 Final138

Uploaded by

Copyright:

Available Formats

MobiDepth: Real-Time Depth Estimation Using On-Device

ABSTRACT CCS CONCEPTS

the dual cameras are synchronized to avoid the displacements of

Table 1: The Inference time of Stereo Matching Solutions on

Deep Learning (DL) Method Traditional Method

The frame flows from the on-device dual cameras are

Figure 6: The system overview and workflow of MobiDepth.

Figure 9: The illustration of heterogeneous camera synchro-

Figure 11: The process of data merged memory write in Cost

Figure 14: The hardware platform used to evaluate Mo-

(a) Latency. (b) Power consumption. (c) Memory usage.

Figure 19: System overhead of MobiDepth, AnyNet, MADNet-No and ARCore.

You might also like