sciadv.adn2031
sciadv.adn2031
sciadv.adn2031
reflected pattern and the ROI. Last, the 3D image is captured directly images are not smooth, especially when the object is far away, and
with a 2D camera sensor, and the intensity diminishes linearly as the the intensity does not change monotonically with the increasing
depth increases. The encoding and decoding surfaces of the chipset distance (Fig. 2B). For the quantitative evaluation, we calculate the
are co-optimized with deep learning algorithms to maximize the dis- root mean square error (RMSE) of the estimated depth with re-
cerning capability by minimizing a training loss function for linear spect to the ground truth. The depth RMSE is 0.069 cm for the two-
mapping between intensity and depth. The phase modulation masks layer Aop3D and 2.411 cm without the encoding surface, i.e., 0.35
can be implemented with spatial light modulators (SLMs) for recon- and 12.06% of the depth range, respectively. We conduct experi-
figurable and flexible design (see Materials and Methods and fig. S1) ments for the two-layer Aop3D and implement the diffractive sur-
or nanofabricated phase plates for compact system volume, passive faces with two SLMs for proof of principle (see Materials and
computation, and higher energy efficiency. Instead of developing a Methods and fig. S1). The experimental results match the simula-
depth-sensitive point spread function and recovering 3D information tions as expected, and a depth RMSE of 1.233 cm (6.17% of the
with electronic reconstruction algorithms (40, 41), Aop3D is trained depth range) is achieved. We also design an Aop3D for 3D percep-
to learn the depth information carried by the characteristic optical tion of a character “T” and obtain similar results (see fig. S2). The
modes all-optically and acts as a light-speed intelligent 3D imager. discrepancy of simulation and experiments is caused by system er-
For 3D perception of object-aware scenarios, we construct a data- rors, including the nonideal plane incident light, the phase modula-
set with specific objects placed at different depths to directly train the tion imprecisions of diffractive surfaces, and the alignment errors
encoding and decoding surfaces (see Materials and Methods). More- between encoding and decoding surfaces.
over, we develop a meta-pixel scheme for general scenarios with Different classes of objects can be processed with the same
arbitrary-shape objects (Fig. 1B). Each meta-pixel comprising many Aop3D by simply enriching the training dataset. To demonstrate
phase modulation atoms is termed a whole perception unit. An en- this, we train a pair of encoding and decoding surfaces with charac-
Fig. 2. High-resolution 3D perception for object-aware scenarios. (A) 3D perception for a single object. Depth images of a character “H” obtained from a two-layer
Aop3D are compared to a one-layer Aop3D without the encoding surface. The FOV is 7.36 mm by 7.36 mm and consists of 800 × 800 pixels. (B) Depth prediction results
in (A) versus the ground truth (GT). The solid lines are simulation results, and the hollow dots indicate experimental results. The subgraph illustrates the simulation and
experiment depth RMSE of the two-layer Aop3D and the simulation RMSE of the one-layer Aop3D. (C) 3D perception for different classes of objects, i.e., characters “T”, “H”,
and “U”, with the same Aop3D. (D) Depth prediction results in (C). (E) Simultaneous 3D perception for multiple complicated objects, i.e., toy buildings and a truck, with
millimeter-level spatial and depth resolution via spatial multiplexing. (F) Depth prediction results in (E). (G and H) Phase modulation masks of the one-layer and the two-
layer Aop3D in (A), respectively. (I and J) Phase modulation masks of Aop3D in (C) and (E), respectively.
and depth resolution authenticates the generalization ability of select 10 points from a depth range of 0.1 to 2 m for each step, and
Aop3D to complex scenarios. the training set already consists of a million diffraction images, not
to mention higher depth sampling densities and more steps. The
Arbitrary-shape 3D perception with the meta-pixel scheme enormous training set makes it extremely difficult for the U-net to
Aop3D is designed with the meta-pixel scheme instead of the global learn every depth combination. Therefore, the performance of the
training of diffractive surfaces for general scenarios with arbitrary- U-net fluctuates greatly, especially when the depth of six steps is
shape objects. The key point of the meta-pixel scheme is to train a similar, and the interference is severe (Fig. 3E). The overall depth
pair of encoding and decoding meta-pixels, which learn the depth- RMSE of the U-net is 27.3 cm (14.37% of the depth range), much
intensity mapping and simultaneously confine the structured light worse than Aop3D, which overcomes the problem by directly ma-
propagation within the region of them. A mean square error (MSE) nipulating the light propagation.
loss function is used to optimize the linear relationship between the Aop3D works at the speed of light theoretically because the 3D
average light intensity of the output ROI and the depth, and a cus- perception is completed during the flight of photons. In the experi-
tomized loss function is designed to maximize the light power ments, the speed is restricted by the camera frame rate. We demon-
within the region during propagation (see Materials and Methods). strate a high-speed (600 Hz) 3D perception scenario by approaching
After the encoding and the decoding meta-pixel are trained, they the limit of camera speed. We design an Aop3D with 3 × 3 meta-
can be replicated and concatenated for arbitrary-shape 3D percep- pixels for the 3D perception of an object moving at 5 cm/s between
tion. Figure 3 demonstrates 3D perception of a six-step stair with 25 and 60 cm (see Fig. 4, A to C, fig. S14, and movie S1). Each meta-
arbitrary depth from 0.1 to 2 m using 6 meta-pixels. Each meta- pixel is 4.6 mm by 4.6 mm with 100 × 100 phase modulation atoms.
pixel is 3.68 mm by 3.68 mm and comprises 100 × 100 phase mod- The acquisition area of the camera is set to the region of output ROI
ulation atoms (Fig. 3B). As depicted in Fig. 3A, the structured light of a meta-pixel (the red box in Fig. 4A), and a frame rate of 600 Hz
Fig. 3. Arbitrary-shape 3D perception with the meta-pixel scheme. (A) The structured light patterns and output results of a single meta-pixel. The light propagation is
constrained within the region of the meta-pixel (the blue dashed box). The depth is represented by the average intensity in the ROI (the red box) of the output plane.
(B) The encoding and decoding phase modulation masks containing 6 meta-pixels. Each meta-pixel is 3.68 mm by 3.68 mm with 100 × 100 phase modulation atoms.
(C) The light power proportion constrained in the meta-pixel region and the depth estimation performance of a single meta-pixel. (D) 3D perception performance for the
six-step stair with Aop3D and the comparison to an electronic neural network (U-net). The solid lines represent the mean value, and the shaded area represents the SD. The
box plots represent the experimental results. (E) 3D visualization of the retrieved stairs with different shapes. The U-net results are disturbed by cross-talk when the depth
of six steps is similar, but Aop3D overcomes the problem by directly manipulating the light propagation.
affected (see fig. S17). Instead of indicating depth with light inten- Aop3D for all-optical obstacle avoidance
sity, the classification scheme enables robust 3D perception perfor- Autonomous navigation of mobile robotics is one of the most es-
mance under different illumination and reflectivity conditions. The sential application scenarios of 3D perception technology. Howev-
depth classification scheme of spatial intensity distribution may er, the processing speeds of previous 3D perception approaches
challenge the 2D sensor resolution limits, which can be addressed are insufficient for obstacle avoidance to ensure safety, especial-
by projecting the spatial intensity distribution into time-domain ly in high-speed scenarios (1). Here, we demonstrate light-speed
intensity variation (43) or encoding the depth into optical power 3D sensing of Aop3D for mobile robotics in real world. Specifically,
spectrum classified with a single-pixel spectroscopic detector (44). we deploy a prototype system of Aop3D to a mobile robot to
support light-speed distance detection and obstacle avoidance (see Fig. 5, A and B, and figs. S18 and S19). In this implementation,
(Fig. 5A). We design the Aop3D with 9 meta-pixels for a depth Aop3D consumes no other energy for 3D perception except the laser
range from 0.5 to 6 m. Each meta-pixel is 4.6 mm by × 4.6 mm with power, which is more energy efficient than 3D sensors requiring post-
100 × 100 phase modulation atoms. To further reduce the system size processing. Before the obstacle avoidance task, we calibrate the
and enhance the energy efficiency, we fabricate the encoding and mapping relationship between the output light intensity of Aop3D
decoding surfaces with chip-scale phase plates for passive computing and the depth. We obtain a simulated depth RMSE of 0.056 m and
Fig. 5. Aop3D for all-optical obstacle avoidance of a mobile robot. (A) A compact Aop3D implemented with a chipset of phase plates is mounted on a mobile robot
to perform light-speed depth estimation. (B) Phase modulation masks (first column), photographs (second column), and scanning electron microscopy images (third
column) of the encoding and decoding surfaces containing 3 × 3 meta-pixels in (A). Each meta-pixel is 4.6 mm by 4.6 mm with 100 × 100 phase modulation atoms.
(C) Characterization of Aop3D for depth estimation from 0.5 to 6 m. The solid lines represent the mean value, and the shaded area represents the SD of the simulation
results. The box plots represent the experimental results. (D) A conceptual illustration of the obstacle avoidance scenario. The mobile robot needs to avoid the glass ob-
stacles and stop safely in the parking area. (E) Depth estimation performance of Aop3D with comparisons to a Kinect and a LIDAR (Slamtec RPLIDAR A2) in the obstacle
avoidance task. All the glass is detected with Aop3D at different positions on the road, but other 3D sensors wrongly return the distance of the wall.
an experimental RMSE of 0.301 m, i.e., 1.02 and 5.47% of the depth the proportional relationship between the area of the diffractive sur-
range, respectively (Fig. 5C). faces and FOV can be designed flexibly, enabling 3D perception of
We construct a city road scenario to demonstrate the obstacle large FOV with small chipset (see Supplementary Text and fig. S9).
avoidance task (Fig. 5D). Several pieces of transparent glass are placed Noting that the prototype Aop3D containing 81 meta-pixels is dem-
on the road as obstacles, which is a widespread yet especially challeng- onstrated, combined with advanced nanofabrication techniques such
ing ranging target (45, 46). The performance of Aop3D is compared as electron-beam lithography (47, 48) and two-photon polymeriza-
with state-of-the-art 3D sensors, including a stereo camera (Intel tion (49), more meta-pixels can be integrated into a compact system
RealSense D415), an Azure Kinect, and a LIDAR (Slamtec RPLIDAR for higher image resolution. Larger pixel size and more modulation
A2). The robot is expected to avoid obstacles when the distance sur- atoms in each meta-pixel permit better 3D perception performance
passes a threshold and stop safely in the parking area. As shown in but increase the area of a meta-pixel and may influence the spatial
Fig. 5E and movie S2, the robot successfully detects all the glass with resolution, which can be addressed with highly integrated metasur-
Aop3D and obtains the correct distance. It turns right when the dis- face and imaging systems.
tance is close to 2 m and lastly stops in the parking area when the Moreover, increasing the layer number of diffractive surfaces
distance is close to 0.5 m. However, the Kinect and the LIDAR cannot and incorporating it with nonlinear material (50) can further en-
detect the glass and wrongly return the distance of the wall, which is hance the model learning capabilities for better 3D perception per-
always bigger than the distance of the glass, causing the risk of colli- formance. Spectrum is also an important dimension that can be
sion. The stereo camera also has no competence for this task (see extended. Instead of working at a single wavelength, multiple wave-
fig. S20). The real-world demo shows that Aop3D has notable advan- lengths can be multiplexed to enlarge 3D perception throughput.
tages especially for the common yet challenging transparent building The coherent light-based Aop3D may be generalized to spatially
surface detection over state-of-the-art approaches. incoherent implementation (51), facilitating Aop3D to work in sce-
working wavelength of λ = 698 nm. Each phase modulation level is depth. For a depth d in the range [dmin, dmax], the ground truth out-
Δφ = 2π(n − 1)d ∕ λ = π4 . For the phase plate fabrication, phase put Ogt is specially designed
modulation parameters of encoding and decoding surfaces are ⎧ d −d
trained as 3-bit discrete value from 0 to 7π
4
. gt ⎪ max , (i, j) ∈ ROI
Oi,j = ⎨ dmax − dmin
(5)
Forward model ⎪ 0, Others
⎩
For an Aop3D with an encoding surface and a decoding surface, the
forward model of the light propagation is modeled as a sequence of (i) In the classification scheme, the location of the light gathering
the modulation of the incident plane wave by the encoding surface, represents the depth information
(ii) the free-space propagation between the encoding surface and the {
object, (iii) the reflection of the object, (iv) the free-space propagation gt 1, (i, j) ∈ Predefined classification region
between the object and the decoding surface, (v) the modulation of
Oi,j = (6)
0, Others
the optical field by the decoding surface, and (vi) the free-space prop-
agation between the decoding surface and the output plane. We focus on the central region of the output field O to realize the
Following the angular spectrum approach, the free-space propa- depth estimation, which is half (75% in the classification scheme)
gation operator Pd can be formulated as the width and height of the meta-pixel and noted as ROI2, leaving
the remaining area for light propagation. The loss function is
u(x, y, d) = Pd {u(x, y, 0)} = F −1 {F{u(x, y, 0)} ⋅ H(fx , fy , d)} (1) defined as
light intensity at different depths is fit linearly with least square er- 9. F. Liu, C. Shen, G. Lin, in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (IEEE, 2015), pp. 5162–5170.
rors and transformed to the depth.
10. H. Fu, M. Gong, C. Wang, K. Batmanghelich, D. Tao, in Proceedings of the IEEE Conference on
Computer vision and pattern recognition (IEEE, 2018), pp. 2002–2011.
Energy efficiency analysis 11. Y. Ming, X. Meng, C. Fan, H. Yu, Deep learning for monocular depth estimation: A review.
We model the camera noise in simulation to analyze the minimum Neurocomputing 438, 14–33 (2021).
power for Aop3D to maintain the 3D perception performance. The 12. J. Shan, C. K. Toth, Topographic Laser Ranging and Scanning: Principles and Processing (CRC
Press, 2018).
overall signal-to-noise ratio (SNR) of a sCMOS camera can be 13. P. F. McManamon, Review of ladar: A historic, yet emerging, sensor technology with rich
formulated as phenomenology. Opt. Eng. 51, 060901 (2012).
14. B. Behroozpour, P. A. Sandborn, M. C. Wu, B. E. Boser, Lidar system architectures and
(QE⋅n)2 circuits. IEEE Commun. Mag. 55, 135–142 (2017).
SNR = 10log
2
Nsn 2 + N2
+ Nrn dn
(11) 15. C. V. Poulton, A. Yaacobi, D. B. Cole, M. J. Byrd, M. Raval, D. Vermeulen, M. R. Watts,
Coherent solid-state LIDAR with silicon photonic optical phased arrays. Opt. Lett. 42,
4091–4094 (2017).
where QE is the quantum efficiency and n is the photon number of 16. C. Rogers, A. Y. Piggott, D. J. Thomson, R. F. Wiser, I. E. Opris, S. A. Fortune, A. J. Compston,
the signal. Nsn2 = QE⋅n,N 2 , and N 2 represent the shot noise, read-
rn dn
A. Gondarenko, F. Meng, X. Chen, A universal 3D imaging sensor on a silicon photonics
platform. Nature 590, 256–261 (2021).
out noise, and dark noise, respectively. According to the features of 17. R. Chen, H. Shu, B. Shen, L. Chang, W. Xie, W. Liao, Z. Tao, J. E. Bowers, X. Wang, Breaking
the sensor (Andor Zyla 4.2) used in this work, QE = 0.723 at the the temporal and frequency congestion of LiDAR by parallel chaos. Nat. Photon. 17,
wavelength of 698 nm, the root mean square value of readout noise 306–314 (2023).
is 1.4 e−, and the dark noise is 0.1 e−/s per pixel, where e− represents 18. J. Riemensberger, A. Lukashchuk, M. Karpov, W. Weng, E. Lucas, J. Liu, T. J. Kippenberg,
Massively parallel coherent laser ranging using a soliton microcomb. Nature 581,
the electron. The minimum exposure time is T = 10 μs in our ex-
164–170 (2020).
periments, and the illumination power can be formulated as
38. Y. LeCun, Y. Bengio, G. Hinton, Deep Learning. Nature 521, 436–444 (2015). 49. X. Zhou, Y. Hou, J. Lin, A review on the processing accuracy of two-photon
39. J. Geng, Structured-light 3D surface imaging: A tutorial. Adv. Opt. Photonics 3, 128–160 (2011). polymerization. AIP Adv. 5, 030701 (2015).
40. Y. Shechtman, S. J. Sahl, A. S. Backer, W. E. Moerner, Optimal point spread function design 50. Y. Zuo, B. Li, Y. Zhao, Y. Jiang, Y.-C. Chen, P. Chen, G.-B. Jo, J. Liu, S. Du, All-optical neural
for 3D imaging. Phys. Rev. Lett. 113, 133902 (2014). network with nonlinear activation functions. Optica 6, 1132–1137 (2019).
41. Z. Shen, F. Zhao, C. Jin, S. Wang, L. Cao, Y. Yang, Monocular metasurface camera for 51. M. S. S. Rahman, X. Yang, J. Li, B. Bai, A. Ozcan, Universal linear intensity transformations
passive single-shot 4D imaging. Nat. Commun. 14, 1035 (2023). using spatially incoherent diffractive processors. Light: Sci. Appl. 12, 195 (2023).
42. O. Ronneberger, P. Fischer, T. Brox, in Medical Image Computing and Computer-Assisted
Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, Acknowledgments: We thank S. Li for assistance with the experiments. Funding: This work is
2015, Proceedings, Part III 18 (Springer, 2015), pp. 234–241. supported in part by the National Science and Technology Major Project under contract no.
43. Z. Zhang, F. Feng, J. Gan, W. Lin, G. Chen, M. G. Somekh, X. Yuan, Space-time projection 2021ZD0109901, in part by the Natural Science Foundation of China (NSFC) under contract no.
enabled ultrafast all-optical diffractive neural network. Laser Photon. Rev., 2301367 62125106 and 62088102, in part by the Tsinghua-Zhijiang joint research center (L.F. is the
(2024). doi:10.1002/lpor.202301367. recipient), in part by China Association for Science and Technology (CAST) under contract No.
44. J. Li, D. Mengu, N. T. Yardimci, Y. Luo, X. Li, M. Veli, Y. Rivenson, M. Jarrahi, A. Ozcan, 2023QNRC001, and in part by China Postdoctoral Science Foundation under contract No.
Spectrally encoded single-pixel machine vision using diffractive networks. Sci. Adv. 7, GZB20230372. Author contributions: Q.D., L.F., and R.H. initiated and supervised the project.
eabd7690 (2021). T.Z., T.Y., and L.F. conceived the idea and developed the methods. T.Y., Y.G., T.Z., Y.Z., and G.S.
45. X. Yang, H. Mei, K. Xu, X. Wei, B. Yin, R. W. Lau, in Proceedings of the IEEE/CVF International conducted the simulations and experiments. T.Y., T.Z., and L.F. analyzed the results and
Conference on Computer Vision (IEEE, 2019), pp. 8809–8818. prepared the manuscript. All authors discussed the research. Competing interests: The
46. H. Mei, X. Yang, Y. Wang, Y. Liu, S. He, Q. Zhang, X. Wei, R. W. Lau, in Proceedings of the authors declare that they have no competing interests. Data and materials availability: All
IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2020), pp. data needed to evaluate the conclusions in the paper are present in the paper and/or the
3687–3696. Supplementary Materials.
47. C. Vieu, F. Carcenac, A. Pépin, Y. Chen, M. Mejias, A. Lebib, L. Manin-Ferlazzo, L. Couraud,
H. Launois, Electron beam lithography: Resolution limits and applications. Appl. Surf. Sci. Submitted 28 November 2023
164, 111–117 (2000). Accepted 3 June 2024
48. M. K. Chen, X. Liu, Y. Wu, J. Zhang, J. Yuan, Z. Zhang, D. P. Tsai, A meta-device for intelligent Published 5 July 2024