Mobicom22 Final138
Mobicom22 Final138
Mobicom22 Final138
Dual Cameras
Jinrui Zhang1† , Huan Yang1 , Ju Ren2 , Deyu Zhang1∗ , Bangwen He1 , Ting Cao3 , Yuanchun Li4 ,
Yaoxue Zhang2 , Yunxin Liu4∗
1 School
of Computer Science and Engineering, Central South University
2 Department of Computer Science and Technology, Tsinghua University 3 Microsoft Research
4 Institute for AI Industry Research (AIR), Tsinghua University
1 {zhangjinrui, yanghuan9812, zdy876, hebangwen}@csu.edu.cn
2 {renju, zhangyx}@tsinghua.edu.cn, 3 ting.cao@microsoft.com
4 {liyuanchun, liuyunxin}@air.tsinghua.edu.cn
Figure 1: Example depth maps generated by MobiDepth, AnyNet, MADNet with and without online adaptation (named MADNet-
MAD and MADNet-No, respectively), and ARCore, with the person sitting, walking and standing. (a) raw images. (b) our
approach estimates accurate depth with crisp edges. (c) and (d) the learning-based depth estimation models, i.e., AnyNet
and MAD-No, poorly perform in scenarios different from the training dataset. (e) the performance of MADNet-MAD is still
unsatisfactory, even with the extremely time consuming online adaptation module. (f) ARCore barely estimates depth in these
images.
The latency of the state-of-the-art (SOTA) models [48, 52] are as the basis for depth estimation. 2) How to synchronize the frame flows
high as 80ms to 550ms on the high-end Huawei Mate40Pro smart- captured by the dual cameras. The frame flows are highly out of
phone, as shown in Section 8. 3) Depth from motion. The most sync, due to the impacts of different frame periods of the cameras,
common solution adopted by existing mobile systems (including as well as the frequent garbage collection (GC). Suffering from the
ARKit in iOS [12] and ARCore in Android [13]) is the depth-from- out-of-sync frame flows, estimating the depth of objects in motion
motion algorithm [49]. The algorithm works by using visual-inertial becomes impossible, since the objects have displacement in the pair
odometry (VIO) to combine information from inertial measurement of frames. 3) How to accelerate the stereo matching on computation-
unit (IMU), with computer vision analysis of the scene visible to the limited mobile devices. The state-of-the-art stereo matching algo-
camera to obtain the depth, i.e., select keyframes during the motion rithms and CNN models, e.g., Semi-Global-Block Matching (SGBM),
of camera and estimate depth based on stereo matching between MADNet, and HITNet, are computation heavy, leading to long run-
the most recent image and a past keyframe. Although this solution time on mobile devices. The stereo matching needs to run in an
does not rely on dedicated sensors or large-scale training data, it online manner to find the correspondence between the points in
requires the camera to be moving and expects the target object the pair of frames. The long runtime leads to the low refresh rate
to be stationary, which significantly restricts its usage scenarios. of the depth estimation applications.
Figure 1(e) shows the performance of ARCore to get the depth of Addressing the above challenges, we propose MobiDepth to
the moving person, it is obvious that the accuracy of ARCore’s leverage the rear-facing dual cameras to estimate depth in real-time
depth estimation is terrible in this scenario. on mobile devices. MobiDepth resolves all the issues of the three
Inspired by the success of binocular depth estimation techniques, existing solutions, i.e., it does not rely on any dedicated sensors
we find that the distribution of the rear-facing cameras on mobile or pre-training, and works well for target objects in motion. Mo-
devices brings a great opportunity for depth estimation. Ideally, the biDepth integrates several new techniques, each addressing one of
disparity can be readily obtained by comparing the pair of frames the above challenges. Albeit simple, we are the first to apply them
captured by the dual cameras. The displacement of the dual cam- for efficient depth estimation using heterogeneous dual cameras
eras provides a stable baseline, compared to the depth from motion on commodity mobile devices. 1) Iterative field-of-view cropping.
solution. However, our in-depth analysis reveals several challenges It iteratively crops the field-of-view (FoV) of one camera, until it
in using the dual rear-facing cameras for depth estimation, as fol- matches that of the other camera. As such, the dual cameras achieve
lows: 1) How to reduce the impact of the diverse focal lengths of the the equivalent focal lengths. It improves the accuracy of epipolar
dual cameras. The rear-facing cameras are originally designed to rectification. 2) Heterogeneous camera synchronization. It filters out
serve various application scenarios, such as macro shooting or wide the frames that are generated at prominently different time. More-
angle shooting. Their focal lengths are thus quite diverse. It greatly over, it timely releases the metadata created by Android to avoid
impacts the accuracy of the epipolar rectification which serves as frequent garbage collection (GC). As such, the frame flows from
MobiDepth: Real-Time Depth Estimation Using On-Device
Dual Cameras ACM MobiCom ’22, October 17–21, 2022, Sydney, NSW, Australia
camera calibration, and crops the WFoV iteratively until the focal FoV cropping analysis. According to the lens imaging rule [35],
lengths of the dual cameras are equivalent. Based on the WFoV the field-of-view (FoV) of a camera is determined by its focal length.
cropped by factor 𝛼¯ and the TFoV, the following epipolar rectifi- Specifically, the camera with a smaller focal lengths 𝑓 𝑊 has a
cation module calculates the value of baseline 𝐵 and focal length 𝑊 𝐹𝑜𝑉 , and the camera with a longer focal length 𝑓 𝑇 has a 𝑇 𝐹𝑜𝑉 ,
𝑓 . The offline phase runs only once during the initialization of the as shown in Figure 7. By cropping the WFoV to be equal to the
MobiDepth system. TFoV, we can make the focal lengths of the dual cameras both equal
In the online phase, we introduce a heterogeneous camera syn- to 𝑓 𝑇 . As such, it is a fact that the same FoVs brings the equivalent
chronization technique (Section 5) to align the frame flows from focal lengths for the dual cameras.
the dual cameras, such that the time difference between two frames
from dual cameras does not exceed a certain threshold. The im-
age rectification module takes a pair of cropped and synchronized
images as input. It determines the correspondence between the
epipolar lines between the pair of images, based on the parameters
obtained in the offline phase. After that, the mobile GPU-friendly
stereo matching technique (Section 6) efficiently finds the corre-
spondence points on the epipolar lines to calculate the disparity.
The final depth map can be readily estimated based on the disparity,
baseline, and focal length. Figure 7: The relationship between focal length and FoV size.
The details of the proposed techniques can be found in the fol-
lowing sections. Based on the fact, MobiDepth propose the iterative FoVs cropping
to iteratively crop the WFoV, until the focal lengths of the dual
cameras are equivalent. Since the image captured by the camera is
not square, i.e., the size of image is 640×480, we use 𝛼¯ = (𝛼𝑥 , 𝛼 𝑦 )
4 ITERATIVE FOV CROPPING
to denote the crop factor on the width and height, respectively.
The focal lengths of the dual cameras on mobile devices are usually We formulate Equation 2 as the loss function which quantifies the
quite diverse, leading to different FoVs of the dual cameras. Gen- difference of focal lengths between the dual cameras.
erally, device manufacturers provide the equivalent focal length
1
of each camera. Ideally, we can calculate the crop factor to make ¯ = (𝑓𝛼𝑊
𝐽 (𝛼) − 𝑓 𝑇 )2 (2)
the FoV of the dual cameras equal. Yet, the provided equivalent 2 ¯𝑖
focal length is not accurate enough. For example, the Honor V30Pro where 𝑓 𝑇 denotes the focal length of the TFoV camera. We de-
officially provides an equivalent focal length equal to 16mm while fine 𝑓𝛼𝑊
¯𝑖
as the focal length of the WFoV camera after 𝑖 rounds
the measured value is close to 17mm. Cropping the images using of cropping. The optimal crop factor can be found by iteratively
the provided equivalent focal length leads to significant error. In minimizing the loss.
addition, the existing approaches typically use image matching Figure 8 illustrates the workflow of iterative FoV cropping: 1)
algorithms to crop the FoVs, such as SIFT [30], SURF [5], ORB [36]. First, we initialize the crop factor 𝛼¯ 0 to 1, and get the focal lengths
However, the accuracy of these approaches is not satisfactory since of the dual cameras with Zhang’s calibration method [58] 3 , i.e.,
they only perform a homography transformation, with the experi- 𝑓 𝑇 and 𝑓𝛼𝑊
¯0
. To obtain more accurate values, we capture over 15
mental results shown in Section 8.3.1. In the following, we first an- images of the calibration pattern (i.e., the chessboard) at different
alyze the feasibility of finding the equivalent focal lengths through 3 Zhang’s method is a camera calibration method that uses calibration pattern, e.g.,
cropping the FoVs. Then we iteratively crop the FoVs until the focal chessboard, to get the camera parameters, such as focal lengths, lens distortion coeffi-
lengths are equivalent for the dual cameras. cient, etc.
ACM MobiCom ’22, October 17–21, 2022, Sydney, NSW, Australia J. Zhang, H. Yang, J. Ren, D. Zhang, B. He, T. Cao, Y. Li, Y. Zhang and Y. Liu
positions and filter out the image pairs with re-projection error Steps GPU(ms)
over 𝜙 (𝜙 is set to 0.05 in our implementation). 2) Then, we crop the Cost Computation 30
images with WFoV from all sides according to 𝛼¯𝑖 with the center of Cost Aggregation 150
the FoV as the basis point, then calibrate the camera after cropping Disparity Computation 10
to get the updated focal length 𝑓𝛼𝑊 ¯𝑖
. 3) Next, we update the value Disparity Refinement 9
of 𝛼¯ to 𝛼¯ using the following rule: if 𝑓𝛼𝑊
𝑖 𝑖+1 − 𝑓 𝑇 > 0, then 𝛼¯𝑖+1 Total latency 199
¯𝑖
= 𝛼¯ − 𝑠𝑡𝑒𝑝. Otherwise, 𝛼¯ = 𝛼¯ + 𝑠𝑡𝑒𝑝, where 𝑠𝑡𝑒𝑝 is set to 0.01
𝑖 𝑖+1 𝑖
in our implementation. 4) In case the value of Equation 2 becomes matched, so the 𝑛 − 𝑡ℎ frame from the left camera is discarded, and
less than Δ (0.01 according to our evaluation) for five consecutive the (𝑛 + 1) − 𝑡ℎ frame in the left camera are compared with 𝑛 − 𝑡ℎ
rounds, we obtain the final crop factor 𝛼, ¯ based on which the focal frame in the right camera.
lengths of the dual cameras are equivalent. By fine tuning 𝜃 , we have a trade-off between the synchroniza-
tion of frame flows and the loss of frames. With a larger value of 𝜃 ,
5 HETEROGENEOUS CAMERA we can retain more frames, but the frames may have larger time
SYNCHRONIZATION difference and lead to inaccurate depth estimation due to object
displacement, and vice versa. The value of 𝜃 should be determined
The accurate depth estimation of objects highly relies on the syn-
according to the application scenario. For example, in a scenario
chronization of frame flows of the heterogeneous dual cameras. In
where the target object is mostly static, we can use a larger 𝜃 to
case of out-of-sync flows, the target object will move a dozen of
tolerate frames that are slightly out-of-sync. While if the target
pixels between the pair of rectified images, leading to a disastrous
objects move fast, the value of 𝜃 should be smaller to ensure higher
performance of stereo matching.
accuracy.
To obtain time-aligned frames, previous works often use addi-
In addition, when implementing the synchronization technique,
tional external hardware signals to simultaneously trigger cameras
we observed that retrieving frame flows from dual cameras would
to capture images [21, 40]. However, there is no such dedicated
frequently trigger Android garbage collection (GC) events, which
hardware to trigger cameras on most mobile devices. As discussed
caused the frame periods of the dual cameras to fluctuate. By tracing
in Section 2.2, the actual frame rates of the dual cameras are slightly
the memory usage of MobiDepth, we found that the frequent GCs
different even under the same setting. The time difference of the
were due to the un-recycled metadata, such as the CameraMetadata
frames captured by the heterogeneous cameras would accumulate
objects which contains the settings of the camera [14]. After run-
at runtime if they are not carefully synchronized.
ning for a while, the un-recycled metadata would take all the pre-
In this section, we introduce a simple yet effective technique to
allocated memory. To solve this issue, we used the Java reflection
reduce the time difference between the frames from two heteroge-
method to timely release the metadata once it is no longer useful.
neous cameras to under a threshold. Figure 9 illustrates the process
of the technique. 𝑇𝐿 and 𝑇𝑅 denote the frame periods of the left
camera and the right camera, respectively. The small circles in the
6 MOBILE GPU-FRIENDLY STEREO
figure represent frames, and the number above the circle represents MATCHING
the order of the frames in the frame flow. The camera synchro- We use SGBM as the stereo matching algorithm in MobiDepth due
nization works as follows based on the recorded timestamp of each to its relatively low latency (e.g., 135ms on the CPU of the Kirin
frame captured by both cameras: 1) Compare the timestamps of two 9000 SoC) and satisfactory accuracy (e.g., 4.41% erroneous pixels in
frames with the same order in the frame flows. 2) If the difference total with error threshold as 5 pixels on KITTI 2012 [16]).
between the timestamps of a frame pair is lower than the threshold Although SGBM is efficient, it still cannot achieve real-time
𝜃 , regard the two frames as a matched pair, denoted by the solid performance on mobile devices. Using GPU may be a promising
line in Figure 9. The matched frame pair is used as the input in the direction to further optimize its performance. However, existing
next step (image rectification). 3) If the time difference exceeds the works of SGBM acceleration are for desktop GPUs [4, 18]. We adopt
threshold 𝜃 , discard the frame in the faster flow and goes to step the similar optimizations to implement SGBM on mobile GPU, but
1 to compare its next frame with the frame in the other flow. For the performance is even worse than running on the CPU, as shown
example, in Figure 9, the 𝑛 − 𝑡ℎ frames in the two flows are not in Table 2. This is due to the following limitations of mobile GPUs.
MobiDepth: Real-Time Depth Estimation Using On-Device
Dual Cameras ACM MobiCom ’22, October 17–21, 2022, Sydney, NSW, Australia
Limited memory bandwidth. The most time-consuming steps causes considerable amount of memory read and write operations.
in SGBM are Cost Computation and Cost Aggregation since they 3) In Cost Aggregation, SGBM needs to get the aggregation cost
require frequent memory read/write operations to calculate the of each pixel in each of the four paths. It requires to write these
disparity of each pixel. However, the memory bandwidth of mobile values back to the memory which requires 2 × 𝐻 × 𝐷 + 2 × 𝑊 × 𝐷
GPUs is quite limited compared with desktop GPUs [27, 55]. For times writing.
instance, the Mali-G78 GPU in the Kirin 9000 SoC has 25.98 GB/s Calculation fusion. To reduce memory read and write over-
memory bandwidth shared with the CPU, which is only 24 1 of the
head, we fuse some calculations in Cost Computation and Cost
memory bandwidth of the NVIDIA RTX 2080Ti GPU (i.e., 616 GB/s). Aggregation. In Cost Computation, we compute the two Hamming
As a result, the frequent memory read/write operations in SGBM distances with four pixels each and then use the results directly
lead to a long latency on mobile GPUs. to get the disparity cost. In this case, the times of memory read
Limited memory architecture support. In Cost Aggregation and write is 𝑊 × 𝐻 × 𝐷 × (4 + 1), which reduces 𝑊 × 𝐻 × 𝐷 times
of SGBM, we need to use thread synchronization to get the mini- compared with the original approach. In Cost Aggregation, the
mum aggregated cost of a pixel with all disparities. In the desktop left and right aggregations operate on the same data. Therefore,
GPU, thread synchronization is performed in the fast on-chip shared they can be done simultaneously. For instance, we may calculate
memory, while mobile GPUs such as the Mali GPU only has off-chip the 𝐼𝑑 th pixel in the left aggregation and the 𝐼𝑊 −1−𝑑 th pixel in the
shared memory which is much slower. right aggregation at the same time. In this way, we could reduce
To tackle the challenges, we propose multiple techniques, includ- the times of memory accesses by 𝑊 ∗ 𝐷 + 𝐻 ∗ 𝐷. This is the same
ing calculation fusion and data merged memory write to reduce the for the up and down aggregations.
cost of memory read and write in SGBM, and enlarge data slicing Data merged memory write. Merging data before writing back
to reduce the overhead of thread synchronization. to memory can further reduce memory access overhead. In Cost
Aggregation, we combine the two values obtained from the left
6.1 Reducing the Memory Read and Write and right aggregations, or the up and down aggregations, into one
Overhead array for memory write. As such, the times of memory write can
Memory read and write overhead analysis. In the SGBM al- be reduced by half. As shown in Figure 11, we combine the two
gorithm, each step utilizes the results of the previous step, which 16-bit values obtained from left and right aggregation into a 32-bit
incurs a large amount of read and write operations to the global value. It is written into memory only once.
memory. 1) In Cost Computation, after Census Transform converts
each pixel of the pair of stereo images to a binary string, SGBM 6.2 Reduced Data Synchronization Overhead
calculates the Hamming distance between each pixel in the left In Cost Aggregation, one of the critical steps is to calculate the mini-
image and the corresponding disparity range pixel in the right im- mal aggregated cost for each pixel which requires all the disparities
age and writes the data to the memory, and then reads the data of each pixel to be calculated and synchronized. The existing im-
out for cost computation with sliding window. Figure 10 shows plementation of SGBM typically creates a number of threads equal
the original process of Cost Computation step. We set the image to the maximum number of disparity(𝐷) to calculate the minimum
resolution as 𝑊 × 𝐻 and the max disparity range is 𝐷. Calculat- value of all aggregated costs for a pixel. However, it leads to ex-
ing the Hamming distance for one disparity requires reading two tensive synchronization cost and contention on memory access.
pixels from the memory and writing one result back into memory, The synchronization is done in on-chip shared memory in desktop
which requires 𝑊 × 𝐻 × 𝐷 × 3 times for memory read and write. GPU with low execution time, while in the mobile GPUs, i.e., Mali
After the calculation of the Hamming distance of all pixels, SGBM GPUs, this operation is executed in off-chip shared memory, which
reads two Hamming distance results of the two disparities from the is time-consuming.
memory, and input them to cost computation operation to derive Enlarged data slicing. To address this issue, we take 2𝑛 (we
the disparity cost. Thus, the times of memory read and write to set n=3 according to our evaluation) disparities data at once by the
obtain the whole image’s disparity cost is 𝑊 × 𝐻 × 𝐷 × (3 + 3). vectorized memory access and then put them on a single thread
2) Cost Aggregation adopts four paths aggregation, i.e., left-right, for processing, as shown in Figure 12. By our enlarged data slicing
up-down. The existing approach to calculate pixels’ aggregation method, each thread calculates 2𝑛 disparity values and gets the
cost is to compute the results for each path individually, which also minimum disparity cost on that thread. Therefore, we only need
ACM MobiCom ’22, October 17–21, 2022, Sydney, NSW, Australia J. Zhang, H. Yang, J. Ren, D. Zhang, B. He, T. Cao, Y. Li, Y. Zhang and Y. Liu
We also included the cases where the depth estimation device is require any pre-training. It not only enables MobiDepth to work in
stationary or moving. new scenes, but also saves lots of human efforts in collecting and
Baselines. The primary baselines which we compared MobiDepth labeling a training dataset.
against are AnyNet [52] and MADNet with and without online adap- From Table 4, MobiDepth also always outperforms ARCore. For
tation4 [48], which are the state-of-the-art depth estimation model example, when the two systems are tested under the exact same
on mobile devices, and ARCore [13], which is the official framework condition (moving device and stationary target) on HWP30, Mo-
for depth estimation on Android. For simplicity, we name MADNet biDepth achieves average accuracy of 5.0%, 4.2%, 8.9% and 23.2%
with and without online adaption as MADNet-MAD and MADNet- at different distances, while the accuracy of ARCore is 14.3%, 7.1%,
No, respectively. We also considered several other baselines for 11.7% and 25.5% respectively.
evaluating different components of MobiDepth. For example, we The greater advantage of MobiDepth is found when the target
compared with the SIFT method [30] on the performance of FoV object is moving. Due to the algorithmic basis of ARCore, it cannot
matching. When evaluating the example application (mobile 3D work well on moving targets, because the motion of the target object
pose estimation) based on MobiDepth, we selected a SOTA 3D hu- will harm the relative displacement between frames measured by
man pose estimation model, MobileHumanPose [8], as the baseline. IMU. However, it is not an issue in MobiDepth since our depth
Metrics. We evaluated the performance of MobiDepth in terms is estimated in a per-frame manner. As a result, we can observe
of accuracy, latency, energy cost, and memory usage. For depth- a significantly superior performance of MobiDepth (e.g., 8.7% vs.
estimation accuracy, we used the mean distance error (𝑚𝐷𝑒𝑟𝑟 ), the 43.5% on HWP30 at distance=100cm) when the target is moving
average percentage error between the estimated depth of each slowly (30 − 80𝑐𝑚/𝑠).
pixel 𝑝 in the target object (𝐷𝑒𝑝𝑡ℎ𝑒𝑠𝑡 ) and the groundtruth depth In addition, the accuracy of MobiDepth is even better when the
(𝐷𝑒𝑝𝑡ℎ𝑔𝑡 ), as follows: device is stationary. For example, on HWM40P, the accuracy with
the device stationary and moving are 1.1% and 3.2% respectively
|𝐷𝑒𝑝𝑡ℎ𝑔𝑡 (𝑝) − 𝐷𝑒𝑝𝑡ℎ𝑒𝑠𝑡 (𝑝)|
𝑚𝐷𝑒𝑟𝑟 = 𝑚𝑒𝑎𝑛𝑝 ∈𝑂𝑏 𝑗𝑒𝑐𝑡 (3) for stationary target at distance=100cm. On the contrary, ARCore
𝐷𝑒𝑝𝑡ℎ𝑔𝑡 (𝑝) is unable to obtain depth information without motion, which may
When evaluating the example application, we computed the mean limit its usage scenarios such as live streaming with a fixed mobile
per-joint position error (MPJPE), which is the mean Euclidean dis- device camera.
tance between the positions of predicted joints and the ground The accuracy of MobiDepth may be affected by the device. We
truth. can notice that the performance of MobiDepth on HWM40P is better
than others, as the quality of cameras and the distribution of dual
8.2 Overall Depth Estimation Accuracy cameras on HWM40P are better. The performance of MobiDepth
We first evaluate the end-to-end performance of MobiDepth on may get worse if the secondary camera on the device is too poor or
depth estimation. Table 4 shows the accuracy of MobiDepth, AnyNet, lies too close to the main camera. Nevertheless, MobiDepth can still
MADNet and ARCore on different mobile devices. Among the three achieve a reasonable accuracy on most devices since most mobile
devices, HWP30 supports all methods. ARCore is not supported on devices today are equipped with powerful dual cameras.
HWM40P since it cannot run Google play services for AR. As for
the third-party application, MobiDepth is not authorized to access 8.3 Performance Breakdown
the libopencl.so library in the Pixel series phones, since Google We next evaluate the performance of the key components of the
does not publicly support the OpenCL library. Therefore, we only MobiDepth system in detail.
evaluate the performance of ARCore on Google Pixel6Pro.
As shown in Table 4, MobiDepth outperforms the deep learning 8.3.1 FoV Matching. As introduced in Section 4, MobiDepth uses
model baselines, i.e., AnyNet, MADNet-MAD and MADNet-No, in an iterative FOV cropping method to match the FoVs of dual cam-
most cases. For instance, when the device and the target object are eras. An alternative to our method is the SIFT-based method, and
stationary, the estimation error of MobiDepth is 2.8%, 1.1%, 5.1% thus we tested the performance of MobiDepth if the FoV matching
and 10.4% at various distances, and yet, AnyNet achieves 3.5%, 2.4%, part is implemented with SIFT. The result is shown in Table 5. We
6.5%, 12.0% and MADNet-No is 10.6%, 12.4%, 47.6%, 61.5% at the can see that the SIFT baseline can lead to higher mean distance er-
corresponding distances. Only on HWP30 with object at distance rors than our original method on depth estimation, which are 7.3%
50 cm, the estimation error of MobiDepth is slightly higher than vs. 2.8%, 6.6% vs. 1.1%, 8.2% vs. 5.1%, 13.3% vs. 10.4% when the target
AnyNet, i.e., 3.8 % compared to 3.4%. One reason for the higher error object distance is 50cm, 100cm, 300cm and 500cm, respectively. This
of deep learning based models is that the pair of images for model demonstrates the effectiveness of our iterative cropping-based FoV
training has been perfectly rectified. However, the rectification matching method.
of images captured by the on-device dual cameras suffers from
inevitable error due to the diverse setting of cameras, e.g., a normal 8.3.2 Frame Synchronization. We introduced a frame synchroniza-
lens and wide-angle lens. Another reason is from the limited scenes tion technique (Section 5) to reduce the time difference between
in the training dataset. In comparison, MobiDepth tolerates slight the image streams of dual cameras. Figure 15 shows that the time
image rectification error as the SGBM uses block matching method difference of two frames accumulates as the time goes without
to calculate the disparity. More importantly, MobiDepth does not the frame synchronization. The time difference exceeds 80ms after
about 4000 frames. With our frame alignment method, we can keep
4 We use an average of 1280 frames for online adaptation. the time difference of the dual cameras within 16ms (as we set the
ACM MobiCom ’22, October 17–21, 2022, Sydney, NSW, Australia J. Zhang, H. Yang, J. Ren, D. Zhang, B. He, T. Cao, Y. Li, Y. Zhang and Y. Liu
Table 4: The accuracy of MobiDepth, AnyNet, MADNet and ARCore on depth estimation under different operating conditions.
Each cell is the average mean distance error (𝑚𝐷𝑒𝑟𝑟 ).
Object status Obj-Stationary† Obj-Moving†(30∼80cm/s) Obj-Moving (80∼150cm/s)
Methods 50cm 100cm 300cm 500cm 50cm 100cm 300cm 500cm 50cm 100cm 300cm 500cm
AnyNet Dev-Stationary‡ 3.5% 2.4% 6.5% 12.0% 4.0% 5.2% 11.1% 21.4% 13.9% 19.8% 20.4% 31.4%
MADNet-No Dev-Stationary 10.6% 12.4% 47.8% 61.7% 13.5% 19.2% 50.3% 60.9% 33.3% 28.8% 55.8% 64.9%
HWM40P MADNet-MAD Dev-Stationary 5.8% 4.9% 11.5% 15.3% 5.9% 8.8% 19.6% 23.7% 14.7% 15.8% 28.3% 38.7%
Dev-Stationary 2.8% 1.1% 5.1% 10.4% 3.8% 3.1% 9.3% 19.2% 9.8% 9.5% 15.1% 29.9%
MobiDepth
Dev-Moving‡ 4.5% 3.2% 7.4% 14.3% 5.6% 5.8% 13.4% 23.1% 13.3% 15.7% 23.4% 31.1%
AnyNet Dev-Stationary 3.4% 5.2% 13.3% 25.5% 4.4% 7.6% 23.8% 30.9% 11.1% 16.0% 28.3% 39.3%
MADNet-No Dev-Stationary 18.8% 15.9% 58.9% 74.4% 26.8% 16.8% 62.7% 74.6% 24.5% 17.6% 68.9% 74.6%
MADNet-MAD Dev-Stationary 8.4% 9.1% 26.2% 49.6% 9.4% 10.9% 30.6% 54.8% 14.7% 13.1% 47.4% 54.7%
HWP30
Dev-Stationary 3.8% 4.1% 7.1% 21.4% 4.3% 6.3% 14.2% 28.5% 10.1% 13.7% 21.5% 35.6%
MobiDepth
Dev-Moving 5.0% 4.2% 8.9% 23.2% 7.4% 8.7% 24.4% 35.2% 15.3% 20.1% 30.5% 40.3%
ARCore∗ Dev-Moving 14.3% 7.1% 11.7% 25.5% 56.6% 43.5% 54.8% 57.9% 69.9% 66.6% 58.7% 61.9%
∗ Since ARCore cannot obtain depth information while the device is stationary, we only tested ARCore with the device moving.
† Obj-Stationary and Obj-Moving denote that the status of target object is static and moving, respectively.
‡ Dev-Stationary and Dev-Moving indicate that the state of mobile device is static and moving, respectively.
Figure 15: Time difference with and with- Figure 16: Latency reduction of our opti- Figure 17: Latency of MobiDepthPose
out frame synchronization. mized SGBM for mobile GPU. and MobileHumanPose.
Table 5: The average mean distance error (𝑚𝐷𝑒𝑟𝑟 ) of SIFT and Table 6: The average mean distance error (𝑚𝐷𝑒𝑟𝑟 ) of Mo-
our method for FoV matching. The target object is stationary biDepth when the two cameras are synchronized (Sync) or
at different distances. not synchronized (Unsync). Unsync1 and Unsync2 have a
33ms (one frame) and 66ms (two frames) time difference be-
Methods
Distance tween two frames, respectively. The target object is moving
MobiDepth-SIFT MobiDepth-Ours
at different distances and speeds.
50cm 7.3% 2.8%
100cm 6.6% 1.1% Speed Moving (30~80cm/s)
300cm 8.2% 5.1% Distance Original Unsync1 Unsync2
500cm 13.3% 10.4% 50cm 3.8% 17.2% 41.2%
100cm 3.1% 12.1% 31.1%
𝜃 to 16ms, the half of the frame period). To test the effectiveness of 300cm 9.3% 25.6% 42.2%
this technique, we compared the accuracy of MobiDepth with and 500cm 19.2% 36.3% 50.7%
without frame synchronization, as shown in Table 6.
In Table 6, Unsync represents the cases when the two frames reduction of our customized SGBM implementation, compared to
used for stereo matching in MobiDepth are not synchronized (by the existing SGBM implementation that was originally developed
one frame or two frames). As can be seen from the table, our frame for desktop GPU (denoted as D-SGBM). The result shows that al-
synchronization method allows the MobiDepth to achieve obviously most all parts of the SGBM algorithm are significantly optimized,
higher accuracy than without synchronization. The advantage of reducing the latency of SGBM by 81.4% on HWM40P and 72.8% on
using frame synchronization is more significant when the velocity HWP30.
of the target object is higher.
8.3.3 Stereo Matching Optimizations. In the stereo matching of 8.4 Case Study: 3D Pose Estimation
MobiDepth, we adopted several techniques to optimize the perfor- MobiDepth may enable various 3D applications on mobile devices.
mance of SGBM on mobile devices. Figure 16 shows the latency We take 3D pose estimation (PE) as an example case study to show
MobiDepth: Real-Time Depth Estimation Using On-Device
Dual Cameras ACM MobiCom ’22, October 17–21, 2022, Sydney, NSW, Australia
Table 7: Accuracy of MobileHumanPose and MobiDepthPose the three mobile devices5 . Compared to AnyNet and MADNet-No,
in terms of MPJPE (mm). MobiDepth can reduce latency by 40.02% and 91.81% on HWM40P,
Methods Walk Kick Sit Hit Avg. and by 24.04% and 88.34% on HWP30, respectively. Yet, as the
official framework adopted by Android, ARCore achieves better
MobileHumanPose [8] 290.3 231.4 213.8 229.9 241.4 latency than MobiDepth on both high-end and low-end devices.
MobiDepthPose (Ours) 195.3 164.8 141.6 178.5 170.1 Especially, the latency gap between ARCore and MobiDepth is
large (about 40ms) on HWP30. This is because that ARCore utilizes
keyframes and IMU to reduce the computation, while MobiDepth
computes depth in a per-frame manner. Nevertheless, the result
has also shown that the latency of MobiDepth is greatly reduced to
around 45ms on higher-end devices (e.g., HWM40P), which shows
the ability of MobiDepth to achieve real-time experience.
The results of power consumption are shown in Figure 19(b).
We use PerfDog [34] developed by Tencent to collect the power
consumption of different systems. Since MobiDepth exploits both
the CPU, GPU and the two cameras, its average power consumption
is about 50% higher than ARCore. However, MobiDepth can reduce
7.14% and 8.08% power consumption than AnyNet and MADNet-
No, respectively on HWM40P. This is because the most operations
of MobiDepth are simple additions and comparisons, which have a
low utilization of the ALU on GPU, i.e., 4% to 6% according to our
test.
The memory usage of MobiDepth, AnyNet, MADNet-No and
Figure 18: Visualization of some 3D poses predicted by Mo- ARCore are obtained using the Monitors tool in Android Studio. As
biDepthPose and MobileHumanPose. shown in Figure 19(c), the average memory usage of MobiDepth
exceeds AnyNet’s by 17.9% and ARCore’s by 35%, since it computes
the effectiveness of MobiDepth. To do it, we implement a mobile the disparity of each pixel on both CPU and GPU simultaneously.
3D PE application based on MobiDepth, named MobiDepthPose. However, the total memory usage of MobiDepth is less than 450MB,
The current SOTA mobile 3D PE system MobileHumanPose [8] which is relatively small as mainstream mobile devices have multi-
adopts an end-to-end neural network to predict the 3D keypoint gigabytes memory.
coordinates (𝑥, 𝑦, 𝑧) of human joints. Instead, our MobiDepthPose To sum up, MobiDepth introduces a higher overhead than AR-
can directly utilize MobiDepth to obtain accurate depth information, Core. However, it reduces the latency and power, compared with
so that the model only needs to predicts the 2D keypoint coordinates the AnyNet and MADNet. We believe that the overhead is tolerable
(𝑥,𝑦). In MobiDepthPose, we use an accurate and lightweight 2D given the great advantages offered by MobiDepth.
PE model [56, 57] in combination with MobiDepth to obtain the 3D
keypoint coordinates (𝑥, 𝑦, 𝑧). 9 RELATED WORK
We evaluated the accuracy of MobileHumanPose and MobiDepth-
Stereo matching. To achieve the accurate depth, many works have
Pose under different settings when a person was walking, sitting,
focused on stereo matching algorithm [17, 47]. Traditional stereo
kicking, or hitting something in the camera view. In each setting,
matching methods usually utilize the low-level features of image
we collected 300 samples, and the ground truth was produced by
patches around the pixel to measure the dissimilarity, which can
the combination of a full-featured depth camera (Intel RealSense
be grouped into three categories: 1) Local method. Both Lazaros et
D435i) and a SOTA 2D PE model (HigherHRNet-w48 [7]). As we
al. [33] and Kristina et al. [2] exploit the region-based local algo-
can see from Table 7, MobiDepthPose is able to achieve a much bet-
rithms, which is based on feature vectors extracted in a window
ter accuracy than MobileHumanPose in all cases, with the overall
for matching. 2) Global method. Vladimir et al. [26] and Andreas et
mean per joint position error (MPJPE) significantly reduced from
al. [24] select the disparity with the minimal global energy function.
241.4mm to 170.1mm, i.e., a reduction of 29.5%.
3) Semi-global method. Heiko [19] optimizes a path-wise form of
Moreover, since the neural network can be much simplified
the energy function in multiple direction. With the development
in MobiDepthPose, the latency of 3D PE can be reduced as well.
of deep learning, many stereo matching works use CNN models
As shown in Figure 17, the latency of MobiDepthPose is reduced
to improve the accuracy of depth estimation [25, 38, 44]. Akihito
by 45.4% to 57.1% compared with MobileHumanPose on different
et al. [38] propose the SGM-Nets model to improve the accuracy
devices. Some examples of the estimated 3D pose can be found in
of the model by providing a learning penalty to the SGM. Patrick
Figure 18.
et al. [25] learns smoothness penalties through a conditional ran-
dom field(CRF) and combines it with a correlation matching cost
8.5 System Overhead
Finally, we evaluate the system overhead of MobiDepth in terms of 5 We do not evaluate the latency of MADNet-MAD on mobile devices since the model
latency, power consumption, and memory usage. Figure 19(a) shows with the online adaptation module is extremely time consuming, which its latency is
the latency of MobiDepth, AnyNet, MADNet-No and ARCore on much higher than that of MADNet-No.
ACM MobiCom ’22, October 17–21, 2022, Sydney, NSW, Australia J. Zhang, H. Yang, J. Ren, D. Zhang, B. He, T. Cao, Y. Li, Y. Zhang and Y. Liu
predicted by CNN to integrate long-range interactions. Vladimir image quality. 2) Due to the short distance between the dual cameras
et al. [44] design a CNN model that propagate the information on mobile devices, the effective range of MobiDepth to obtain depth
across different resolution levels. However, these works are only is typically limited to 0.5 to 5 meters, i.e., the distance of the dual
suitable for identical dual cameras. Besides, none of these works cameras on HWM40P is about 2.1cm (the range varies slightly across
consider the specific architecture of mobile SoCs. MobiDepth aims mobile devices). If the target object lies outside of the effective
to estimate depth through a dual-camera system with diverse set- distance range, the system may not accurately estimate the depth.
tings, and well use the memory architecture of mobile GPU for However, we believe the effective distance range can already cover
acceleration. most AR/VR application scenarios on mobile devices. 3) MobiDepth
Depth estimation on mobile devices. Current methods of may not work well on devices with white-and-black cameras and relies
getting depth on mobile devices can be divided into 3 categories: 1) on the computing power of mobile GPU. For mobile devices with low-
Dedicated depth sensors. Kim et al. [23] and Tian et al. [46] use the end GPUs, the efficiency of stereo matching in MobiDepth may not
on-device time-of-flight(ToF) camera to obtain 3D images, and Shih be guaranteed. This is not a severe issue since most devices today
et al. [39] and Stefano et al. [45] use LiDAR instead. Depth sensors are equipped with powerful cameras and GPUs. These limitations
are currently only available on a few high-end mobile devices due can also be mitigated by incorporating more advanced algorithmic
to the high cost. 2) Learning-based depth prediction. Liu et al. [29] and system optimizations, which we leave for future work.
present a CNN field model to estimate depths from single monocular
images, aiming to jointly explore the capacity of CNN model and 11 CONCLUSION
continuous CRF. David et al. [15] propose a two-scale CNN model
In this paper, we propose MobiDepth, the first system to use the
trained on images and the corresponding depth maps. However, the
dual cameras on commodity mobile devices for real-time depth
3D CNN models are computing-heavy, which cannot achieve real-
estimation. MobiDepth does not require dedicated sensors or large-
time performance on mobile devices. Furthermore, these methods
scale data, and works for a wide range of scenarios, where both the
suffer from limited scalability, i.e., they cannot estimate the depth
device and the target objects can be stationary or moving. To do so,
of new objects that are unseen in the training dataset. 3) Using
MobiDepth employs three key techniques, including iterative FoV
monocular camera on mobile devices. Yang et al. [54] propose a
cropping, heterogeneous camera synchronization, and mobile GPU-
keyframe-based real-time surface mesh generation approach to
friendly stereo matching. Extensive experiments have demonstrated
reconstruct 3D objects from single RGB image. ARCore [49], the
that MobiDepth can achieve high accuracy and low overhead. The
well-known AR framework, obtains depth from motion, i.e., using
accuracy remains high on moving objects and moving devices,
monocular camera combined with inertial measurement unit (IMU)
significantly outperforming ARCore, the state-of-the-art framework
to estimate depth. However, the limitation of these works is that
for depth estimation on Android.
they cannot estimate the accurate depth of the object in motion.
In comparison, MobiDepth obtains the disparity from the dual
cameras, instead of moving a single camera. Therefore, MobiDepth
12 ACKNOWLEDGMENTS
can accurately estimate the depth for objects in motion. Further- We sincerely appreciate the anonymous shepherd and reviewers for
more, the stereo matching algorithm used in MobiDepth, i.e., SGBM, their valuable comments. This work is supported by National Sci-
does not need to pre-train on any 3D datasets, and thus achieves ence Foundation of China (62172439, 62122095, 62072472), National
better scalability. Key R&D Program of China (2019YFA0706403), Natural Science
Foundation Major Project of Hunan Science and Technology Inno-
10 DISCUSSION vation Program (S2021JJZDXM0022), Natural Science Foundation
of Hunan Province (2020JJ5774, 2020JJ2050) and U19A2067, a grant
The current MobiDepth system has several limitations. 1) Mo- from the Guoqiang Institute, Tsinghua University.
biDepth does not take into consideration the impact of auto-focus. In
auto-focus, a motor moves the camera lens backward and forward
to adjust the image distance, making it hard for MobiDepth to do
REFERENCES
[1] 2022. https://github.com/ethan-li-coding.
the FoV cropping. However, if auto-focus can be well-handled, it [2] Kristian Ambrosch and Wilfried Kubinger. 2010. Accurate hardware-based stereo
may help improve the depth estimation as it improves the captured vision. Computer Vision and Image Understanding 114, 11 (2010), 1303–1316.
MobiDepth: Real-Time Depth Estimation Using On-Device
Dual Cameras ACM MobiCom ’22, October 17–21, 2022, Sydney, NSW, Australia
[3] Ken Arnold, James Gosling, and David Holmes. 2005. The Java programming [32] Xing Mei, Xun Sun, Mingcai Zhou, Shaohui Jiao, Haitao Wang, and Xiaopeng
language. Addison Wesley Professional. Zhang. 2011. On building an accurate stereo matching system on graphics
[4] Christian Banz, Holger Blume, and Peter Pirsch. 2011. Real-time semi-global hardware. In 2011 IEEE International Conference on Computer Vision Workshops
matching disparity estimation on the GPU. In 2011 IEEE International Conference (ICCV Workshops). IEEE, 467–474.
on Computer Vision Workshops (ICCV Workshops). IEEE, 514–521. [33] Lazaros Nalpantidis and Antonios Gasteratos. 2010. Stereo vision for robotic
[5] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust applications in the presence of non-ideal lighting conditions. Image and Vision
features. In European conference on computer vision. Springer, 404–417. Computing 28, 6 (2010), 940–951.
[6] Michael Bleyer, Christoph Rhemann, and Carsten Rother. 2011. Patchmatch [34] PerfDog. 2022. https://perfdog.qq.com/.
stereo-stereo matching with slanted support windows.. In Bmvc, Vol. 11. 1–11. [35] Todd B Pittman, YH Shih, DV Strekalov, and Alexander V Sergienko. 1995. Optical
[7] Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Thomas S. Huang, and Lei imaging by means of two-photon quantum entanglement. Physical Review A 52,
Zhang. 2020. HigherHRNet: Scale-Aware Representation Learning for Bottom-Up 5 (1995), R3429.
Human Pose Estimation. In CVPR. [36] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011. ORB: An
[8] Sangbum Choi, Seokeon Choi, and Changick Kim. 2021. MobileHumanPose: efficient alternative to SIFT or SURF. In 2011 International conference on computer
Toward real-time 3D human pose estimation in mobile devices. In Proceedings of vision. Ieee, 2564–2571.
the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2328–2338. [37] Thomas Schöps, Torsten Sattler, Christian Häne, and Marc Pollefeys. 2017. Large-
[9] Ivan Culjak, David Abram, Tomislav Pribanic, Hrvoje Dzapo, and Mario Cifrek. scale outdoor 3D reconstruction on a mobile device. Computer Vision and Image
2012. A brief introduction to OpenCV. In 2012 proceedings of the 35th international Understanding 157 (2017), 151–166.
convention MIPRO. IEEE, 1725–1730. [38] Akihito Seki and Marc Pollefeys. 2017. Sgm-nets: Semi-global matching with
[10] François Darmon and Pascal Monasse. 2021. The Polar Epipolar Rectification. neural networks. In Proceedings of the IEEE Conference on Computer Vision and
Image processing on line 11 (2021), 56–75. Pattern Recognition. 231–240.
[11] Pei-Huang Diao and Naai-Jung Shih. 2018. MARINS: A mobile smartphone AR [39] Naai-Jung Shih, Pei-Huang Diao, Yi-Ting Qiu, and Tzu-Yu Chen. 2020. Situated
system for pathfinding in a dark environment. Sensors 18, 10 (2018), 3442. ar simulations of a lantern festival using a smartphone and lidar-based 3d models.
[12] ARKit Developers Documentation. 2018. https://developer.apple.com/ Applied Sciences 11, 1 (2020), 12.
documentation/arkit. [40] Prarthana Shrstha, Mauro Barbieri, and Hans Weda. 2007. Synchronization of
[13] ARCore Developers Documentation. 2018. https://developers.google.com/ar. multi-camera video recordings based on audio. In Proceedings of the 15th ACM
[14] Android Developers Documentation. 2021. https://developer.android.com/ international conference on Multimedia. 545–548.
training/camera2/multi-camera. [41] Robert Spangenberg, Tobias Langner, and Raúl Rojas. 2013. Weighted semi-global
[15] David Eigen, Christian Puhrsch, and Rob Fergus. 2014. Depth map prediction from matching and center-symmetric census transform for robust driver assistance. In
a single image using a multi-scale deep network. Advances in neural information International Conference on Computer Analysis of Images and Patterns. Springer,
processing systems 27 (2014). 34–41.
[16] Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for au- [42] John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel pro-
tonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on gramming standard for heterogeneous computing systems. Computing in science
computer vision and pattern recognition. IEEE, 3354–3361. & engineering 12, 3 (2010), 66.
[17] Rostam Affendi Hamzah and Haidi Ibrahim. 2016. Literature survey on stereo [43] Bjarne Stroustrup. 2013. The C++ programming language. Pearson Education.
vision disparity map algorithms. Journal of Sensors 2016 (2016). [44] Vladimir Tankovich, Christian Hane, Yinda Zhang, Adarsh Kowdle, Sean Fanello,
[18] Daniel Hernandez-Juarez, Alejandro Chacón, Antonio Espinosa, David Vázquez, and Sofien Bouaziz. 2021. Hitnet: Hierarchical iterative tile refinement network
Juan Carlos Moure, and Antonio M López. 2016. Embedded real-time stereo for real-time stereo matching. In Proceedings of the IEEE/CVF Conference on
estimation via semi-global matching on the GPU. Procedia Computer Science 80 Computer Vision and Pattern Recognition. 14362–14372.
(2016), 143–153. [45] Stefano Tavani, Andrea Billi, Amerigo Corradetti, Marco Mercuri, Alessandro
[19] Heiko Hirschmuller. 2007. Stereo processing by semiglobal matching and mutual Bosman, Marco Cuffaro, Thomas Seers, and Eugenio Carminati. 2022. Smartphone
information. IEEE Transactions on pattern analysis and machine intelligence 30, 2 assisted fieldwork: Towards the digital transition of geoscience fieldwork using
(2007), 328–341. LiDAR-equipped iPhones. Earth-Science Reviews (2022), 103969.
[20] Dong-Hyun Hwang, Suntae Kim, Nicolas Monet, Hideki Koike, and Soonmin Bae. [46] Yuan Tian, Yuxin Ma, Shuxue Quan, and Yi Xu. 2019. Occlusion and collision
2020. Lightweight 3D human pose estimation network training using teacher- aware smartphone AR using time-of-flight camera. In International Symposium
student learning. In Proceedings of the IEEE/CVF Winter Conference on Applications on Visual Computing. Springer, 141–153.
of Computer Vision. 479–488. [47] Beau Tippetts, Dah Jye Lee, Kirt Lillywhite, and James Archibald. 2016. Review
[21] Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, of stereo vision algorithms and their suitability for resource-limited systems.
Shohei Nobuhara, and Yaser Sheikh. 2015. Panoptic studio: A massively multi- Journal of Real-Time Image Processing 11, 1 (2016), 5–25.
view system for social motion capture. In Proceedings of the IEEE International [48] Alessio Tonioni, Fabio Tosi, Matteo Poggi, Stefano Mattoccia, and Luigi Di Stefano.
Conference on Computer Vision. 3334–3342. 2019. Real-time self-adaptive deep stereo. In The IEEE Conference on Computer
[22] Sameh Khamis, Sean Fanello, Christoph Rhemann, Adarsh Kowdle, Julien Vision and Pattern Recognition (CVPR).
Valentin, and Shahram Izadi. 2018. Stereonet: Guided hierarchical refinement for [49] Julien Valentin, Adarsh Kowdle, Jonathan T Barron, Neal Wadhwa, Max Dzitsiuk,
edge-aware depth prediction. (2018). Michael Schoenberg, Vivek Verma, Ambrus Csaszar, Eric Turner, Ivan Dryanovski,
[23] Hyun Myung Kim, Min Seok Kim, Gil Ju Lee, Hyuk Jae Jang, and Young Min et al. 2018. Depth from motion for smartphone AR. ACM Transactions on Graphics
Song. 2020. Miniaturized 3D depth sensing-based smartphone light field camera. (ToG) 37, 6 (2018), 1–19.
Sensors 20, 7 (2020), 2129. [50] VR and AR market size 2024. 2021. https://www.statista.com/statistics/591181/global-
[24] Andreas Klaus, Mario Sormann, and Konrad Karner. 2006. Segment-based stereo augmented-virtual-reality-market-size/.
matching using belief propagation and a self-adapting dissimilarity measure. [51] Bill Waggener, William N Waggener, and William M Waggener. 1995. Pulse code
In 18th International Conference on Pattern Recognition (ICPR’06), Vol. 3. IEEE, modulation techniques. Springer Science & Business Media.
15–18. [52] Yan Wang, Zihang Lai, Gao Huang, Brian H. Wang, Laurens Van Der Maaten,
[25] Patrick Knobelreiter, Christian Reinbacher, Alexander Shekhovtsov, and Thomas Mark Campbell, and Kilian Q Weinberger. 2018. Anytime Stereo Image Depth
Pock. 2017. End-to-end training of hybrid CNN-CRF models for stereo. In Proceed- Estimation on Mobile Devices. arXiv preprint arXiv:1810.11408 (2018).
ings of the IEEE conference on computer vision and pattern recognition. 2339–2348. [53] Jingao Xu, Guoxuan Chi, Zheng Yang, Danyang Li, Qian Zhang, Qiang Ma,
[26] Vladimir Kolmogorov and Ramin Zabih. 2001. Computing visual correspon- and Xin Miao. 2021. FollowUpAR: enabling follow-up effects in mobile AR
dence with occlusions using graph cuts. In Proceedings Eighth IEEE International applications. In Proceedings of the 19th Annual International Conference on Mobile
Conference on Computer Vision. ICCV 2001, Vol. 2. IEEE, 508–515. Systems, Applications, and Services. 1–13.
[27] Rendong Liang, Ting Cao, Jicheng Wen, Manni Wang, Yang Wang, Jianhua Zou, [54] Xingbin Yang, Liyang Zhou, Hanqing Jiang, Zhongliang Tang, Yuanbo Wang,
and Yunxin Liu. 2022. Romou: Rapidly Generate High-Performance Tensor Ker- Hujun Bao, and Guofeng Zhang. 2020. Mobile3DRecon: real-time monocular
nels for Mobile GPUs. In Proceedings of the 24th Annual International Conference 3D reconstruction on a mobile phone. IEEE Transactions on Visualization and
on Mobile Computing and Networking. https://doi.org/10.1145/3495243.3517020 Computer Graphics 26, 12 (2020), 3446–3456.
[28] TensorFlow Lite. 2021. https://www.tensorflow.org/lite/. [55] Juheon Yi and Youngki Lee. 2020. Heimdall: mobile gpu coordination platform
[29] Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid. 2015. Learning depth for augmented reality applications. In Proceedings of the 26th Annual International
from single monocular images using deep convolutional neural fields. IEEE Conference on Mobile Computing and Networking. 1–14.
transactions on pattern analysis and machine intelligence 38, 10 (2015), 2024–2039. [56] Changqian Yu, Bin Xiao, Changxin Gao, Lu Yuan, Lei Zhang, Nong Sang, and
[30] David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. Jingdong Wang. 2021. Lite-HRNet: A Lightweight High-Resolution Network. In
International journal of computer vision 60, 2 (2004), 91–110. CVPR.
[31] Wolfgang Maass. 2000. On the computational power of winner-take-all. Neural [57] Jinrui Zhang, Deyu Zhang, Xiaohui Xu, Fucheng Jia, Yunxin Liu, Xuanzhe Liu, Ju
computation 12, 11 (2000), 2519–2535. Ren, and Yaoxue Zhang. 2020. MobiPose: Real-time multi-person pose estimation
ACM MobiCom ’22, October 17–21, 2022, Sydney, NSW, Australia J. Zhang, H. Yang, J. Ren, D. Zhang, B. He, T. Cao, Y. Li, Y. Zhang and Y. Liu
on mobile devices. In Proceedings of the 18th Conference on Embedded Networked [58] Zhengyou Zhang. 1999. Flexible camera calibration by viewing a plane from
Sensor Systems. 136–149. unknown orientations. In Proceedings of the seventh ieee international conference
on computer vision, Vol. 1. Ieee, 666–673.