0% found this document useful (0 votes)
3 views12 pages

sciadv.adn2031

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 12

S c i e n c e A d v a n c e s | R e s e ar c h A r t i c l e

APPLIED SCIENCES AND ENGINEERING Copyright © 2024 The


Authors, some rights
Nanowatt all-­optical 3D perception for mobile robotics reserved; exclusive
licensee American
Association for the
Tao Yan1†, Tiankuang Zhou2,3†, Yanchen Guo1,4†, Yun Zhao1,4, Guocheng Shao1,4, Jiamin Wu1,3,5, Advancement of
Ruqi Huang4*, Qionghai Dai1,3,5*, Lu Fang2,3,5* Science. No claim to
original U.S.
Three-­dimensional (3D) perception is vital to drive mobile robotics’ progress toward intelligence. However, state-­ Government Works.
of-­the-­art 3D perception solutions require complicated postprocessing or point-­by-­point scanning, suffering Distributed under a
computational burden, latency of tens of milliseconds, and additional power consumption. Here, we propose a Creative Commons
parallel all-­optical computational chipset 3D perception architecture (Aop3D) with nanowatt power and light Attribution
speed. The 3D perception is executed during the light propagation over the passive chipset, and the captured NonCommercial
light intensity distribution provides a direct reflection of the depth map, eliminating the need for extensive post- License 4.0 (CC BY-­NC).
processing. The prototype system of Aop3D is tested in various scenarios and deployed to a mobile robot, demon-
strating unprecedented performance in distance detection and obstacle avoidance. Moreover, Aop3D works at a
frame rate of 600 hertz and a power consumption of 33.3 nanowatts per meta-­pixel experimentally. Our work is
promising toward next-­generation direct 3D perception techniques with light speed and high energy efficiency.

INTRODUCTION simple 2D imaging, i.e., recovering depth information all optically

Downloaded from https://www.science.org on November 18, 2024


Three-­dimensional (3D) perception provides a superior understand- without postprocessing.
ing of the physical world for mobile robotics, enhancing their intelli- Nevertheless, optical computing (19, 20) techniques have aroused
gence in dealing with complex scenarios, such as obstacle avoidance, flourishing research interests in recent years (21–37), benefiting from
place recognition, and autonomous navigation (1–3). Mobile robotics energy efficiency and speed advantages. For example, diffractive sur-
demand low-­latency and high-­efficiency 3D perception to reduce the faces (26, 29, 31) can be tailored to engineer the light transmission
reaction time and power consumption. However, existing 3D percep- from an input field of view (FOV) to an output FOV. With feasible
tion methods face notable challenges in acquiring 3D images directly manipulation of light propagation and utilization of its physical prop-
with optical systems, because the depth information inherently con- erties (36, 37), all-­optical linear transformation (26), image classifica-
tained in the propagation of the light field (4) is lost after being cap- tion (29), pulse shaping (32), wave sensing (33), and logic operations
tured by traditional 2D cameras. For instance, stereo vision methods (34) have been realized by diffractive surfaces. Taking advantage of
(5–8) estimate the 3D image from the disparities of multiview 2D im- optical computing to facilitate 3D perception will permit unprece-
ages using feature matching, computational stereo, deep learning al- dented possibilities in high-­speed and low-­power application scenar-
gorithms, etc. However, complicated postprocessing imposes a heavy ios for robotics, including ultrafast obstacle avoidance in autonomous
computational load on the electronic hardware of mobile robotics. It navigation, low-­power unmanned systems, energy-­saving smart fac-
limits the speed and energy efficiency of 3D perception, especially tory, etc.
with the increasing model complexity in artificial neural network– Here, we propose Aop3D, a parallel all-­optical computational
based methods (9–11). chipset architecture that enables light-­speed and postprocessing-­free
Instead of relying entirely on postprocessing, intervening in the 3D perception with chip-­scale diffractive surfaces (Fig. 1B). Aop3D
optical process of 3D perception can reduce the computation burden deeply intervenes in the optical process of 3D perception and per-
of electronic hardware. For example, light detection and ranging forms depth-­intensity mapping after deep learning design. Specifi-
(LIDAR) (12–18) schemes mainly exploit time of flight or frequency-­ cally, Aop3D works by encoding the depth information into spatial
modulated continuous wave and achieve 3D perception without com- modes of a customized structured light and decoding it into the out-
plicated reconstruction algorithms. However, sophisticated scanning put light intensity with optimized optical interconnections enabled by
and precise time or frequency measurement are indispensable, lead- the diffractive chipset. Instead of using reconstruction algorithms and
ing to the bottleneck in parallel detection. In general, no current 3D time or frequency analysis, the 3D map is instantly recorded with a
perception technology has yet to achieve the same ease of applicability 2D sensor, and the captured light intensity straightforwardly repre-
as 2D imaging sensors in robotics. With the deepening exploration of sents the depth information, estimating additional time for postpro-
the physical nature of 3D perception, we aspire to obtain depth im- cessing. Besides, we develop a personalized meta-­pixel model to
ages directly through the optical imaging device in the same way as manipulate the light propagation independently in each meta-­pixel
and achieve arbitrary-­shape 3D perception. The encoding and decod-
ing surfaces of the chipset are co-­optimized through deep learning
(38) for the best interrogation of the scenario, fabricated by simple 3D
1
Department of Automation, Tsinghua University, Beijing 100084, China. 2Depart- printing (35) or lithography, and assembled to form a 3D imager.
ment of Electronic Engineering, Tsinghua University, Beijing 100084, China. 3Beijing
National Research Center for Information Science and Technology, Tsinghua Uni-
Aop3D can be flexibly designed for different depth ranges. The chip-
versity, Beijing 100084, China. 4Shenzhen International Graduate School, Tsinghua set can be horizontally cascaded for a larger FOV and vertically
University, Shenzhen 518071, China. 5Institute for Brain and Cognitive Sciences, stacked for better performance. On the basis of this scalable architec-
Tsinghua University, Beijing 100084, China. ture, we demonstrate the capabilities of Aop3D for high-­resolution
*Corresponding author. Email: ruqihuang@​sz.​tsinghua.​edu.​cn (R.H.); qhdai@​tsinghua.​
edu.​cn (Q.D.); fanglu@​tsinghua.​edu.​cn (L.F.) 3D perception of object-­aware scenarios, 3D perception of arbitrary-­
†These authors contributed equally to this work. shape objects, and robust 3D perception under different illumination.

Yan et al., Sci. Adv. 10, eadn2031 (2024) 5 July 2024 1 of 12


S c i e n c e A d v a n c e s | R e s e ar c h A r t i c l e

Downloaded from https://www.science.org on November 18, 2024


Fig. 1. All-­optical computational 3D perception (Aop3D) for mobile robotics. (A) Comparisons between Aop3D and conventional 3D perception methods. Aop3D
enables light-­speed, low-­power, and postprocessing-­free 3D perception, while existing techniques, including stereo vision cameras and LIDAR, require complicated post-
processing, suffering computational burden, latency of tens of milliseconds, and additional power consumption. (B) The schematic illustration of Aop3D. The depth infor-
mation is encoded into spatial modes of a customized structured light and decoded into the output light intensity with optimized optical interconnections by chip-­scale
diffractive surfaces. The light intensity in the output ROI decreases linearly as the depth increases. A meta-­pixel scheme is developed to achieve arbitrary-­shape 3D percep-
tion. (C) Aop3D is capable of high-­resolution and illumination-­robust 3D perception, providing unprecedented possibilities for mobile robotics, including all-­optical ob-
stacle avoidance, low-­power unmanned systems, energy-­saving smart factory, etc.

We also deploy Aop3D to a mobile robot to conduct an all-­optical RESULTS


obstacle avoidance task in a real-­world scenario and achieve unprec- Principle of Aop3D
edented performance. By accomplishing the 3D perception with the To facilitate all-­optical 3D perception, we aim to encode the 3D infor-
passive diffractive chipset and consuming no other energy except mation into a depth-­varying structured light (39) illumination pat-
light power, Aop3D reduces the computation payload and provides a tern and decode the depth information in the reflected pattern into
light-­speed and power-­efficient solution for depth perception. In the the acquired light intensity. Specifically, as depicted in Fig. 1B, the in-
experiments, Aop3D works at a speed of 600 Hz (limited by the cam- cident plane wave collimated from a coherent light source is phase-­
era frame rate) and a power consumption of 33.3 nW per meta-­pixel, modulated with an encoding surface to render the illumination
which can be further improved to 76 kHz and 3 nW per meta-­pixel pattern depth dependent. The illumination beam is reflected from the
theoretically, showing unprecedented potential for next-­generation objects with phase and amplitude responsivity at varying depths. To
direct 3D perception. The all-­optical computational 3D perception establish a linear mapping between depth and acquired intensity of
insight can be extended to metasurface and other light-­material inter- the region of interest (ROI) on the output plane, a decoding surface is
action mechanisms. devised to optimize the diffractive interconnection between the

Yan et al., Sci. Adv. 10, eadn2031 (2024) 5 July 2024 2 of 12


S c i e n c e A d v a n c e s | R e s e ar c h A r t i c l e

reflected pattern and the ROI. Last, the 3D image is captured directly images are not smooth, especially when the object is far away, and
with a 2D camera sensor, and the intensity diminishes linearly as the the intensity does not change monotonically with the increasing
depth increases. The encoding and decoding surfaces of the chipset distance (Fig. 2B). For the quantitative evaluation, we calculate the
are co-­optimized with deep learning algorithms to maximize the dis- root mean square error (RMSE) of the estimated depth with re-
cerning capability by minimizing a training loss function for linear spect to the ground truth. The depth RMSE is 0.069 cm for the two-­
mapping between intensity and depth. The phase modulation masks layer Aop3D and 2.411 cm without the encoding surface, i.e., 0.35
can be implemented with spatial light modulators (SLMs) for recon- and 12.06% of the depth range, respectively. We conduct experi-
figurable and flexible design (see Materials and Methods and fig. S1) ments for the two-­layer Aop3D and implement the diffractive sur-
or nanofabricated phase plates for compact system volume, passive faces with two SLMs for proof of principle (see Materials and
computation, and higher energy efficiency. Instead of developing a Methods and fig. S1). The experimental results match the simula-
depth-­sensitive point spread function and recovering 3D information tions as expected, and a depth RMSE of 1.233 cm (6.17% of the
with electronic reconstruction algorithms (40, 41), Aop3D is trained depth range) is achieved. We also design an Aop3D for 3D percep-
to learn the depth information carried by the characteristic optical tion of a character “T” and obtain similar results (see fig. S2). The
modes all-­optically and acts as a light-­speed intelligent 3D imager. discrepancy of simulation and experiments is caused by system er-
For 3D perception of object-­aware scenarios, we construct a data- rors, including the nonideal plane incident light, the phase modula-
set with specific objects placed at different depths to directly train the tion imprecisions of diffractive surfaces, and the alignment errors
encoding and decoding surfaces (see Materials and Methods). More- between encoding and decoding surfaces.
over, we develop a meta-­pixel scheme for general scenarios with Different classes of objects can be processed with the same
arbitrary-­shape objects (Fig. 1B). Each meta-­pixel comprising many Aop3D by simply enriching the training dataset. To demonstrate
phase modulation atoms is termed a whole perception unit. An en- this, we train a pair of encoding and decoding surfaces with charac-

Downloaded from https://www.science.org on November 18, 2024


coding meta-­pixel and a decoding one are optimized to confine the ters “T”, “H”, and “U” situated at depths from 10 to 30 cm and show
structured light to propagate within the region of them and simulta- the 3D perception results in Fig. 2 (C and D) and fig. S3. Almost
neously perform the depth mapping for a pixel of the target scenario identical depth estimation curves are achieved for three objects. The
(see Materials and Methods). The propagation region constraints overall depth RMSE is 0.198 cm in the simulations and 1.326 cm in
prevent optical interference among different areas. Therefore, 3D the experiments, i.e., 0.99 and 6.63% of the depth range, respective-
perception of pixelated objects with arbitrary shapes can be accom- ly. The performance is compatible with the single-­object setup.
plished by replicating the encoding and decoding meta-­pixels. For When the depth increases and the average output intensity gets
instance, Fig. 1B illustrates an Aop3D with 9 × 9 meta-­pixels for both darker, the unevenness of output becomes visually noticeable. Be-
encoding and decoding surfaces, which supports 3D perception for sides, the system errors make the phenomenon more obvious in ex-
arbitrary pixelated objects with 9 × 9 pixels in a depth range of 0.5 to perimental results. The problem can be improved with more layers
6 m. Each meta-­pixel is 4.6 mm by 4.6 mm2 and contains 100 × 100 of diffractive surfaces, i.e., the light intensity of pixels can be more
phase modulation atoms with a pixel size of 46 μm by 46 μm. Each clustered around the average value, and the pixel-­wise depth estima-
meta-­pixel can be enlarged to support better depth perception per- tion RMSE can be obviously reduced (see Supplementary Text and
formance or shrank to permit higher image resolution in the same fig. S4).
FOV. Besides, the independent working mechanism of meta-­pixels To achieve 3D perception of multiple objects simultaneously,
permits 3D perception of more complicated scenarios. For example, we augment the capabilities of Aop3D with spatial multiplexing
even if objects are overlapped, the object in front and unobstructed (see Supplementary Text and fig. S5) and demonstrate the ability
parts of the objects behind can be successfully perceived. of Aop3D for multiple complicated objects in traffic scenarios. The
FOV of Aop3D is partitioned into subregions for 3D perception of
High-­resolution 3D perception for object-­aware scenarios each object located at the corresponding subregion. We enlarge
We first examine the 3D perception performance of Aop3D in simple the FOV to 10.12 mm by 10.12 mm with an image resolution of
object-­aware scenarios, where specific objects with known shapes are 1100 × 1100 pixels, fabricate toy images of two buildings and a
placed at different depths. The encoding and decoding surfaces are truck, and arrange them from left to right in the FOV. The diffrac-
trained to learn depth-­intensity mapping and obtain a sharp and ac- tive surfaces are also enlarged with the phase modulation param-
curate depth image with high resolution. The phase modulation at- eter number the same as the image pixels. The depth range is set
oms in the FOV are trained jointly. from 10 to 13 cm to explore the high-­accuracy depth estimation
We commence with the 3D perception of a single object, such as capabilities of Aop3D. During the training and testing, the trunk
a character “H” plated on the glass with chrome (Fig. 2A). To verify and buildings appear randomly at the corresponding subregion
the effectiveness of the encoding-­decoding design, we compare the with different depths. As a result, sophisticated image features such
results of a two-­layer Aop3D with an encoding surface to a one-­layer as spires, arcs, and right angles smaller than 1 mm and accurate
Aop3D without it, i.e., the plane wave instead of depth-­varying depth are perfectly reproduced in the Aop3D results (see Fig. 2, E
structured light is used for the probe of the object. The objects are and F, and fig. S6). The overall simulated and experimental depth
placed at depths from 10 to 30 cm. Moreover, the Aop3D is trained RMSE is 0.032 and 0.123 cm respectively, i.e., 1.07 and 4.09% of the
to image the character shape and estimate the depth with the aver- depth range, respectively. Aop3D can also achieve good perfor-
age light intensity. Figure 2A depicts the output depth images with mance for the traffic toy objects placed at 10 to 30 cm (see fig. S7).
a FOV of 7.36 mm by 7.36 mm and an image resolution of 800 × The results reveal that the Aop3D can independently detect the
800 pixels. The simulation results show that the two-­layer Aop3D depth of multiple objects with satisfying performance based on spa-
preserves the shapes and edges, and the intensity is linearly related tial multiplexing and can be easily scaled up for larger FOV by
to the depth. However, without the encoding surface, the depth increasing the size of diffractive surfaces. The millimeter-­level spatial

Yan et al., Sci. Adv. 10, eadn2031 (2024) 5 July 2024 3 of 12


S c i e n c e A d v a n c e s | R e s e ar c h A r t i c l e

Downloaded from https://www.science.org on November 18, 2024

Fig. 2. High-­resolution 3D perception for object-­aware scenarios. (A) 3D perception for a single object. Depth images of a character “H” obtained from a two-­layer
Aop3D are compared to a one-­layer Aop3D without the encoding surface. The FOV is 7.36 mm by 7.36 mm and consists of 800 × 800 pixels. (B) Depth prediction results
in (A) versus the ground truth (GT). The solid lines are simulation results, and the hollow dots indicate experimental results. The subgraph illustrates the simulation and
experiment depth RMSE of the two-­layer Aop3D and the simulation RMSE of the one-­layer Aop3D. (C) 3D perception for different classes of objects, i.e., characters “T”, “H”,
and “U”, with the same Aop3D. (D) Depth prediction results in (C). (E) Simultaneous 3D perception for multiple complicated objects, i.e., toy buildings and a truck, with
millimeter-­level spatial and depth resolution via spatial multiplexing. (F) Depth prediction results in (E). (G and H) Phase modulation masks of the one-­layer and the two-­
layer Aop3D in (A), respectively. (I and J) Phase modulation masks of Aop3D in (C) and (E), respectively.

Yan et al., Sci. Adv. 10, eadn2031 (2024) 5 July 2024 4 of 12


S c i e n c e A d v a n c e s | R e s e ar c h A r t i c l e

and depth resolution authenticates the generalization ability of select 10 points from a depth range of 0.1 to 2 m for each step, and
Aop3D to complex scenarios. the training set already consists of a million diffraction images, not
to mention higher depth sampling densities and more steps. The
Arbitrary-­shape 3D perception with the meta-­pixel scheme enormous training set makes it extremely difficult for the U-­net to
Aop3D is designed with the meta-­pixel scheme instead of the global learn every depth combination. Therefore, the performance of the
training of diffractive surfaces for general scenarios with arbitrary-­ U-­net fluctuates greatly, especially when the depth of six steps is
shape objects. The key point of the meta-­pixel scheme is to train a similar, and the interference is severe (Fig. 3E). The overall depth
pair of encoding and decoding meta-­pixels, which learn the depth-­ RMSE of the U-­net is 27.3 cm (14.37% of the depth range), much
intensity mapping and simultaneously confine the structured light worse than Aop3D, which overcomes the problem by directly ma-
propagation within the region of them. A mean square error (MSE) nipulating the light propagation.
loss function is used to optimize the linear relationship between the Aop3D works at the speed of light theoretically because the 3D
average light intensity of the output ROI and the depth, and a cus- perception is completed during the flight of photons. In the experi-
tomized loss function is designed to maximize the light power ments, the speed is restricted by the camera frame rate. We demon-
within the region during propagation (see Materials and Methods). strate a high-­speed (600 Hz) 3D perception scenario by approaching
After the encoding and the decoding meta-­pixel are trained, they the limit of camera speed. We design an Aop3D with 3 × 3 meta-­
can be replicated and concatenated for arbitrary-­shape 3D percep- pixels for the 3D perception of an object moving at 5 cm/s between
tion. Figure 3 demonstrates 3D perception of a six-­step stair with 25 and 60 cm (see Fig. 4, A to C, fig. S14, and movie S1). Each meta-­
arbitrary depth from 0.1 to 2 m using 6 meta-­pixels. Each meta-­ pixel is 4.6 mm by 4.6 mm with 100 × 100 phase modulation atoms.
pixel is 3.68 mm by 3.68 mm and comprises 100 × 100 phase mod- The acquisition area of the camera is set to the region of output ROI
ulation atoms (Fig. 3B). As depicted in Fig. 3A, the structured light of a meta-­pixel (the red box in Fig. 4A), and a frame rate of 600 Hz

Downloaded from https://www.science.org on November 18, 2024


from an encoding meta-­pixel is constrained within the region of is achieved, with 20 times improvement compared to ordinary 3D
the meta-­pixel (the blue dashed box) and focused to four corners cameras (30 Hz). The estimated depth increases linearly and steadily
with increasing depth. Subsequently, a decoding meta-­pixel learns as the object moves and matches perfectly with the ground truth.
the optical modes of the reflected structured light and retrieves the The overall depth RMSE is 0.358 cm in the simulations and 1.327 cm
depth of this step with the light intensity of the central region of the in the experiments, i.e., 1.02 and 3.79% of the depth range, respec-
output plane (the red dashed box). The output of the decoding tively. With its good performance in high-­speed 3D perception,
meta-­pixel is also constrained in the region of the meta-­pixel to Aop3D can be a promising tool for 3D perception of ultrafast
prevent cross-­talk between adjacent meta-­pixels. Figure 3C illus- dynamics.
trates the simulation results of a single meta-­pixel, where more Benefiting from the flexibility and scalability of Aop3D, arbitrary-­
than 80% of the light power is constrained in the region during the shape 3D perception with high image resolution can be realized by
propagation, and the predicted depth matches the ground truth simply replicating the meta-­pixels and enlarging the diffractive
perfectly. After the optimization of single encoding and decoding surfaces (Fig. 4D). We change each meta-­pixel to 1.104 mm by
meta-­pixels, they are replicated by 2 × 3 and spliced together to 1.104 mm with 120 × 120 phase modulation atoms, train it within a
perform 3D perception of the six-­step stair. During the testing, depth range from 20 to 50 cm (see fig. S15) and replicate it by 9 × 9.
depths of six steps are randomly and independently selected, and Figure 4E demonstrates depth images of simple characters, digits,
the simulation results are highly consistent with those of a single graphics, and Chinese characters with 9 × 9 pixels.
meta-­pixel (Fig. 3D). Benefiting from the little cross-­talk, the per- Considering the effects of different object reflectivity and un-
formance variation of 6 meta-­pixels is barely noticeable. To sim- stable illumination power, we develop a classification meta-­pixel
plify experiments, we conduct 3D perception of a mirror and sample scheme as a complement to the regression scheme (Fig. 4, F to H).
the results of each meta-­pixel at different depths (see fig. S8). The Specifically, the target depth range is divided into 10 intervals. The
overall depth RMSE is 3.913 cm in the simulations and 8.334 cm in meta-­pixels are trained to map depth intervals into 10 predefined
the experiments, i.e., 2.06 and 4.39% of the depth range, respective- regions on the output plane (the red dashed box in Fig. 4G). The
ly. Additionally, the FOV can be enlarged with 4f imaging systems region with the maximum light intensity is determined as the clas-
(see Supplementary Text and fig. S9). A larger depth range and a sification result and represents the quantized depth, i.e., the median
higher depth accuracy can be achieved by optimizing the pixel size of the corresponding depth interval. We train a meta-­pixel of
and the number of phase modulation atoms in each meta-­pixel (see 3.68 mm by 3.68 mm with 400 × 400 phase modulation atoms. The
Supplementary Text and figs. S10 and S11) and stacking more layers light power is also well constrained in the meta-­pixel region within
of encoding and decoding surfaces (see Supplementary Text and a depth range of 15 to 55 cm, and the classification results are almost
fig. S12), confirming the success of long-­range 3D perception for consistent with the quantized depth (see fig. S16). We adopt 4 meta-­
arbitrary-­shape scenarios. pixels for depth perception (Fig. 4F) and illustrate simulation and
To verify the superiority of the all-­optical approach, we compare experiment results in Fig. 4G. The object depth is successfully clas-
the depth perception performance of Aop3D with an electronic sified according to the corresponding depth interval. The overall
neural network, U-­net (42), a famous convolutional neural network simulated depth RMSE is 1.207 cm (3.02% of the depth range), ex-
architecture for computer vision tasks (see fig. S13). The diffraction tremely close to the inherent quantization error of 1.186 cm (2.97%
images of the six-­step stair are used to train the U-­net to learn the of the depth range), and the overall experimental depth RMSE is
depth information. However, the diffraction of six steps interferes 2.049 cm (5.12% of the depth range). The light power is 196.2 μW in
with each other, and each combination of depths produces different this experiment. We further reduce the light power to 92.6 and
diffraction images, causing the size of the training set to increase 23.8 μW for comparison, and the experimental depth RMSE is
exponentially with the number of steps. For example, we evenly 2.145 and 1.949 cm, respectively, i.e., the performance is barely

Yan et al., Sci. Adv. 10, eadn2031 (2024) 5 July 2024 5 of 12


S c i e n c e A d v a n c e s | R e s e ar c h A r t i c l e

Downloaded from https://www.science.org on November 18, 2024

Fig. 3. Arbitrary-­shape 3D perception with the meta-­pixel scheme. (A) The structured light patterns and output results of a single meta-­pixel. The light propagation is
constrained within the region of the meta-­pixel (the blue dashed box). The depth is represented by the average intensity in the ROI (the red box) of the output plane.
(B) The encoding and decoding phase modulation masks containing 6 meta-­pixels. Each meta-­pixel is 3.68 mm by 3.68 mm with 100 × 100 phase modulation atoms.
(C) The light power proportion constrained in the meta-­pixel region and the depth estimation performance of a single meta-­pixel. (D) 3D perception performance for the
six-­step stair with Aop3D and the comparison to an electronic neural network (U-­net). The solid lines represent the mean value, and the shaded area represents the SD. The
box plots represent the experimental results. (E) 3D visualization of the retrieved stairs with different shapes. The U-­net results are disturbed by cross-­talk when the depth
of six steps is similar, but Aop3D overcomes the problem by directly manipulating the light propagation.

Yan et al., Sci. Adv. 10, eadn2031 (2024) 5 July 2024 6 of 12


S c i e n c e A d v a n c e s | R e s e ar c h A r t i c l e

Downloaded from https://www.science.org on November 18, 2024


Fig. 4. High-­speed, high-­resolution, and illumination-­robust 3D perception for arbitrary-­shape objects. (A) An optimized meta-­pixel for high-­speed 3D perception,
which is 4.6 mm by 4.6 mm with 100 × 100 phase modulation atoms. (B) The output of Aop3D for depth estimation of a moving object at the frame rate of 600 Hz. The es-
timated depth is 31.5896 cm at t = 1.1333 s, represented by the average intensity in the red box, corresponding to the ground truth of 30.6665 cm. (C) Six-­hundred-­hertz
depth estimation results of the moving object. The solid lines represent the mean value, and the shaded area represents the SD. (D) Phase modulation masks containing 9
× 9 meta-­pixels for high-­resolution 3D perception. Each meta-­pixel is 1.104 × 1.104 mm2 with 120 × 120 phase modulation atoms. (E) Depth images of characters, digits,
graphics, and Chinese characters with 9 × 9 pixels. (F) Phase modulation masks of 2 × 2 meta-­pixels for classification-­based 3D perception. Each meta-­pixel is 3.68 ×
3.68 mm2 with 400 × 400 phase modulation atoms. (G) Depth classification results. A depth range of 15 to 55 cm is divided into 10 intervals, and the light intensity of 10
predefined regions on the output plane indicates the classification results. The classification scheme enables robust 3D perception under different illumination conditions
and object reflectivity. (H) 3D perception performance of 4 meta-­pixels in (F). The solid lines represent simulation results, and the hollow dots represent experimental results.

affected (see fig. S17). Instead of indicating depth with light inten- Aop3D for all-­optical obstacle avoidance
sity, the classification scheme enables robust 3D perception perfor- Autonomous navigation of mobile robotics is one of the most es-
mance under different illumination and reflectivity conditions. The sential application scenarios of 3D perception technology. Howev-
depth classification scheme of spatial intensity distribution may er, the processing speeds of previous 3D perception approaches
challenge the 2D sensor resolution limits, which can be addressed are insufficient for obstacle avoidance to ensure safety, especial-
by projecting the spatial intensity distribution into time-­domain ly in high-­speed scenarios (1). Here, we demonstrate light-­speed
intensity variation (43) or encoding the depth into optical power 3D sensing of Aop3D for mobile robotics in real world. Specifically,
spectrum classified with a single-­pixel spectroscopic detector (44). we deploy a prototype system of Aop3D to a mobile robot to

Yan et al., Sci. Adv. 10, eadn2031 (2024) 5 July 2024 7 of 12


S c i e n c e A d v a n c e s | R e s e ar c h A r t i c l e

support light-­speed distance detection and obstacle avoidance (see Fig. 5, A and B, and figs. S18 and S19). In this implementation,
(Fig. 5A). We design the Aop3D with 9 meta-­pixels for a depth Aop3D consumes no other energy for 3D perception except the laser
range from 0.5 to 6 m. Each meta-­pixel is 4.6 mm by × 4.6 mm with power, which is more energy efficient than 3D sensors requiring post-
100 × 100 phase modulation atoms. To further reduce the system size processing. Before the obstacle avoidance task, we calibrate the
and enhance the energy efficiency, we fabricate the encoding and mapping relationship between the output light intensity of Aop3D
decoding surfaces with chip-­scale phase plates for passive computing and the depth. We obtain a simulated depth RMSE of 0.056 m and

Downloaded from https://www.science.org on November 18, 2024

Fig. 5. Aop3D for all-­optical obstacle avoidance of a mobile robot. (A) A compact Aop3D implemented with a chipset of phase plates is mounted on a mobile robot
to perform light-­speed depth estimation. (B) Phase modulation masks (first column), photographs (second column), and scanning electron microscopy images (third
column) of the encoding and decoding surfaces containing 3 × 3 meta-­pixels in (A). Each meta-­pixel is 4.6 mm by 4.6 mm with 100 × 100 phase modulation atoms.
(C) Characterization of Aop3D for depth estimation from 0.5 to 6 m. The solid lines represent the mean value, and the shaded area represents the SD of the simulation
results. The box plots represent the experimental results. (D) A conceptual illustration of the obstacle avoidance scenario. The mobile robot needs to avoid the glass ob-
stacles and stop safely in the parking area. (E) Depth estimation performance of Aop3D with comparisons to a Kinect and a LIDAR (Slamtec RPLIDAR A2) in the obstacle
avoidance task. All the glass is detected with Aop3D at different positions on the road, but other 3D sensors wrongly return the distance of the wall.

Yan et al., Sci. Adv. 10, eadn2031 (2024) 5 July 2024 8 of 12


S c i e n c e A d v a n c e s | R e s e ar c h A r t i c l e

an experimental RMSE of 0.301 m, i.e., 1.02 and 5.47% of the depth the proportional relationship between the area of the diffractive sur-
range, respectively (Fig. 5C). faces and FOV can be designed flexibly, enabling 3D perception of
We construct a city road scenario to demonstrate the obstacle large FOV with small chipset (see Supplementary Text and fig. S9).
avoidance task (Fig. 5D). Several pieces of transparent glass are placed Noting that the prototype Aop3D containing 81 meta-­pixels is dem-
on the road as obstacles, which is a widespread yet especially challeng- onstrated, combined with advanced nanofabrication techniques such
ing ranging target (45, 46). The performance of Aop3D is compared as electron-­beam lithography (47, 48) and two-­photon polymeriza-
with state-­of-­the-­art 3D sensors, including a stereo camera (Intel tion (49), more meta-­pixels can be integrated into a compact system
RealSense D415), an Azure Kinect, and a LIDAR (Slamtec RPLIDAR for higher image resolution. Larger pixel size and more modulation
A2). The robot is expected to avoid obstacles when the distance sur- atoms in each meta-­pixel permit better 3D perception performance
passes a threshold and stop safely in the parking area. As shown in but increase the area of a meta-­pixel and may influence the spatial
Fig. 5E and movie S2, the robot successfully detects all the glass with resolution, which can be addressed with highly integrated metasur-
Aop3D and obtains the correct distance. It turns right when the dis- face and imaging systems.
tance is close to 2 m and lastly stops in the parking area when the Moreover, increasing the layer number of diffractive surfaces
distance is close to 0.5 m. However, the Kinect and the LIDAR cannot and incorporating it with nonlinear material (50) can further en-
detect the glass and wrongly return the distance of the wall, which is hance the model learning capabilities for better 3D perception per-
always bigger than the distance of the glass, causing the risk of colli- formance. Spectrum is also an important dimension that can be
sion. The stereo camera also has no competence for this task (see extended. Instead of working at a single wavelength, multiple wave-
fig. S20). The real-­world demo shows that Aop3D has notable advan- lengths can be multiplexed to enlarge 3D perception throughput.
tages especially for the common yet challenging transparent building The coherent light-­based Aop3D may be generalized to spatially
surface detection over state-­of-­the-­art approaches. incoherent implementation (51), facilitating Aop3D to work in sce-

Downloaded from https://www.science.org on November 18, 2024


narios of natural light illumination, which would further reduce
the power consumption of 3D perception. Besides, with the combi-
DISCUSSION nation of incoherent diffractive processors and imaging systems,
In summary, we take a step to present an all-­optical computational Aop3D can be extended to oblique objects with nonspecular-­
approach (Aop3D) to achieve light-­speed and low-­power 3D per- reflection surfaces. With flexible scalabilities and remarkable per-
ception for mobile robotics. Unlike conventional 3D perception formance, our work will inspire the development of next-­generation
methods that rely on postprocessing or scanning, Aop3D extracts 3D perception techniques and bring more opportunities for mobile
3D information during the light propagation and output depth im- robotics.
ages directly, greatly simplifying the complex process of 3D percep-
tion and reducing the computation burden of hardware. We have
verified the functionalities in various 3D perception scenarios and a MATERIALS AND METHODS
real-­world obstacle avoidance task. The real-­world demo shows the Experimental system
impressive advantages of Aop3D for mobile robotics to avoid trans- For reconfigurable phase modulation of the encoding and decoding
parent surface over state-­of-­the-­art approaches. surfaces, Aop3D is implemented with two SLMs (see fig. S1). The
With the phase plate implementation, 3D perception is complet- light beam is generated using a solid-­state laser (MRL-­FN-­698, CNI)
ed as the light diffracts through the passive components, providing at a working wavelength of 698 nm. It is coupled with a single-­mode
unprecedented speed and energy-­saving advantages. We have dem- fiber (P5-­630A-­PCAPC-­1, Thorlabs), collimated with a lens (AC254-­
onstrated 3D perception with a frame rate of 600 Hz, which is only 100-­A, Thorlabs), and polarized with a linear polarizer (LPNIR100,
limited by the camera frame rate and can be enhanced to 76 kHz Thorlabs). The first SLM (P1920-­400-­800, Meadowlark) modulates
using a high-­speed camera (e.g., Phantom TMX 7510). Further- the phase of the wavefront according to the optimized encoding
more, we experimentally demonstrate that a light power of 300 nW mask and generates the structured light, which is split with a beam
for 9 meta-­pixels, i.e., 33.3 nW per meta-­pixel is sufficient to main- splitter (CCM1-­BS013, Thorlabs) to illuminate two objects. The re-
tain the depth estimation performance of Aop3D (see Supplemen- flective light from the objects is perceived with the second SLM
tary Text and fig. S21). In contrast, microwatt-­per-­pixel–level power (HSP1920-­600-­1300-­HSP8, Meadowlark) for depth information de-
is required for stereo vision processing or LIDAR scanning in ordi- coding. Then, the depth images are captured with a scientific com-
nary 3D perception devices, i.e., the energy efficiency is improved by plementary metal-­oxide semiconductor (sCMOS) sensor (Andor
more than 10 times. The advantages of Aop3D is still notable con- Zyla 4.2) after diffraction propagation of 12 cm.
sidering the power consumption of photodetectors (see Supplemen- For a compact system volume and passive computation, the
tary Text and table S1). We also investigate the camera noise and SLMs are replaced by a chipset of phase plates (see fig. S19), and
evaluate the performance of Aop3D under different light power (see the phase modulation is completed when the light passes through
Materials and Methods, Supplementary Text, and fig. S21). The light the chipset. The high-­speed (600 Hz) 3D perception, all-­optical
power can be further reduced to 3nW per meta-­pixel theoretically, obstacle avoidance, and low-­power (300 nW) 3D perception ex-
i.e., 10 photons per pixel for the illumination of objects. periments are accomplished with phase plates, while others are
The proposed architecture performs all-­optical 3D perception in a with SLMs.
single snapshot for the entire scenario, i.e., the meta-­pixels of Aop3D
can be scaled for arbitrary FOV size and resolution without degrading Phase plate fabrication
the imaging speed, which avoids the exponential expansion issue The phase plates are fabricated on silica substrates. Eight-­level steps
in large-­scale data processing for 3D reconstruction. Fortunately, with a height of d = 192 nm are realized by three iterations of pho-
through collaborative optimization of Aop3D and imaging systems, tolithography. The refractive index of silica is n = 1.455 at the

Yan et al., Sci. Adv. 10, eadn2031 (2024) 5 July 2024 9 of 12


S c i e n c e A d v a n c e s | R e s e ar c h A r t i c l e

working wavelength of λ = 698 nm. Each phase modulation level is depth. For a depth d in the range [dmin, dmax], the ground truth out-
Δφ = 2π(n − 1)d ∕ λ = π4 . For the phase plate fabrication, phase put Ogt is specially designed
modulation parameters of encoding and decoding surfaces are ⎧ d −d
trained as 3-­bit discrete value from 0 to 7π
4
. gt ⎪ max , (i, j) ∈ ROI
Oi,j = ⎨ dmax − dmin
(5)
Forward model ⎪ 0, Others

For an Aop3D with an encoding surface and a decoding surface, the
forward model of the light propagation is modeled as a sequence of (i) In the classification scheme, the location of the light gathering
the modulation of the incident plane wave by the encoding surface, represents the depth information
(ii) the free-­space propagation between the encoding surface and the {
object, (iii) the reflection of the object, (iv) the free-­space propagation gt 1, (i, j) ∈ Predefined classification region
between the object and the decoding surface, (v) the modulation of
Oi,j = (6)
0, Others
the optical field by the decoding surface, and (vi) the free-­space prop-
agation between the decoding surface and the output plane. We focus on the central region of the output field O to realize the
Following the angular spectrum approach, the free-­space propa- depth estimation, which is half (75% in the classification scheme)
gation operator Pd can be formulated as the width and height of the meta-­pixel and noted as ROI2, leaving
the remaining area for light propagation. The loss function is
u(x, y, d) = Pd {u(x, y, 0)} = F −1 {F{u(x, y, 0)} ⋅ H(fx , fy , d)} (1) defined as

where u(x, y, d) represents the optical field at the distance of d, F and 1 ∑


(i,j)∈ROI2

Downloaded from https://www.science.org on November 18, 2024


gt
F−1 are Fourier transform and the inverse Fourier transform, and Lossdepth (O, Ogt ) = 2
(Oi,j − a ∣ Oi,j ∣2 )2 (7)
NROI2
H(fx, fy, d) is the transfer function i,j

where a is a trainable parameter to adjust the magnification of the


� � �
⎧ 2 − (λf )2 , f 2 + f 2 < 1 light intensity, and NROI2 is the pixel number of a row or column
⎪ exp jkd 1 − (λf x ) y
H(fx , fy , d) = ⎨
x y
λ2 (2) of the ROI2.
⎪ 0, 2 2 1 To constrain the light propagation within the region of the meta-­
fx + fy ≥ 2
⎩ λ pixel, noted as ROI3, the loss function is defined as

where λ is the wavelength and k = 2π . fx and f y are the spatial �


(i,j)∈ROI3
λ ∣ Oi,j ∣2
frequencies. i,j (8)
Therefore, the forward model can be formulated as Losspower (O) = −log ∑
∣ Oi,j ∣2
O = Pd3 {t de ⋅ Pd2 {R ⋅ Pd1 {t en ⋅ I}}} (3)
A two-­stage scheme is adopted, i.e., O and the structured light SL
where O denotes the output optical field, I is the optical field of the is simultaneously constrained
incident light, ten and tde are transmittance coefficients of the encod-
ing and decoding surfaces, and R is reflection coefficients of the
SL = Pd1 {t en ⋅ I} (9)
object. d1, d2, and d3 represent the distance between the encoding Therefore, the total loss function is
surface and the object, between the object and the encoding surface,
Loss2 = λ1 Lossdepth (O, Ogt ) + λ2 Losspower (O) + λ3 Losspower (SL) (10)
and between the decoding surface and the output plane, respective-
ly. Here, d1 = d2 is the depth of the object, d3 = 12 cm, and only the The coefficients λ1, λ2, and λ3 in the loss function are empirically
phase modulation of the diffractive surfaces is considered. set to 10,000, 100, and 100, respectively.
The training process is numerically implemented with Python
Training methods (v3.6.13) and TensorFlow (v1.11.0) running on a desktop computer
The phase modulation parameters of the encoding and decoding (Nvidia TITAN XP GPU, AMD Ryzen Threadripper 2990WX CPU
surfaces are optimized to ensure that the light intensity of the output with 32 cores, 128 GB of RAM, and the Microsoft Windows 10 op-
optical field O is linearly related to and depth. For object-­aware sce- erating system). The phase modulation parameters of the diffractive
narios, the MSE between ∣O∣2 and the ground truth depth image Ogt surfaces are optimized via stochastic gradient descent and error
is used as the loss function backpropagation algorithms with the Adam optimizer and a learn-
ing rate of 0.01. The training time is ~2 hours with the epoch num-
1 ∑ gt
Loss1 (O, Ogt ) = 2 (Oi,j − a ∣ Oi,j ∣2 )2 (4) ber of 100.
N i,j
Intensity to depth mapping
where a is a trainable parameter to adjust the magnification of the According to the training methods, the normalized light intensity I
light intensity, i and j are the pixel indices of the image, and N is the ∈ [0,1] is linearly mapped to a depth range [dmin, dmax], correspond-
pixel number of a row or column of the image. ing to a depth of dmin + (1 − I)(dmax − dmin). In the experiments, the
For arbitrary-­shape 3D perception with the meta-­pixel scheme, captured light intensity is affected by laser power, background illu-
the target is to optimize a meta-­pixel to realize the linear relation- mination noise, shot noise, readout noise, etc. Therefore, a calibra-
ship between the average light intensity of the output ROI and the tion process is used for more accurate depth estimation, i.e., the

Yan et al., Sci. Adv. 10, eadn2031 (2024) 5 July 2024 10 of 12


S c i e n c e A d v a n c e s | R e s e ar c h A r t i c l e

light intensity at different depths is fit linearly with least square er- 9. F. Liu, C. Shen, G. Lin, in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (IEEE, 2015), pp. 5162–5170.
rors and transformed to the depth.
10. H. Fu, M. Gong, C. Wang, K. Batmanghelich, D. Tao, in Proceedings of the IEEE Conference on
Computer vision and pattern recognition (IEEE, 2018), pp. 2002–2011.
Energy efficiency analysis 11. Y. Ming, X. Meng, C. Fan, H. Yu, Deep learning for monocular depth estimation: A review.
We model the camera noise in simulation to analyze the minimum Neurocomputing 438, 14–33 (2021).
power for Aop3D to maintain the 3D perception performance. The 12. J. Shan, C. K. Toth, Topographic Laser Ranging and Scanning: Principles and Processing (CRC
Press, 2018).
overall signal-­to-­noise ratio (SNR) of a sCMOS camera can be 13. P. F. McManamon, Review of ladar: A historic, yet emerging, sensor technology with rich
formulated as phenomenology. Opt. Eng. 51, 060901 (2012).
14. B. Behroozpour, P. A. Sandborn, M. C. Wu, B. E. Boser, Lidar system architectures and
(QE⋅n)2 circuits. IEEE Commun. Mag. 55, 135–142 (2017).
SNR = 10log
2
Nsn 2 + N2
+ Nrn dn
(11) 15. C. V. Poulton, A. Yaacobi, D. B. Cole, M. J. Byrd, M. Raval, D. Vermeulen, M. R. Watts,
Coherent solid-­state LIDAR with silicon photonic optical phased arrays. Opt. Lett. 42,
4091–4094 (2017).
where QE is the quantum efficiency and n is the photon number of 16. C. Rogers, A. Y. Piggott, D. J. Thomson, R. F. Wiser, I. E. Opris, S. A. Fortune, A. J. Compston,
the signal. Nsn2 = QE⋅n,N 2 , and N 2 represent the shot noise, read-
rn dn
A. Gondarenko, F. Meng, X. Chen, A universal 3D imaging sensor on a silicon photonics
platform. Nature 590, 256–261 (2021).
out noise, and dark noise, respectively. According to the features of 17. R. Chen, H. Shu, B. Shen, L. Chang, W. Xie, W. Liao, Z. Tao, J. E. Bowers, X. Wang, Breaking
the sensor (Andor Zyla 4.2) used in this work, QE = 0.723 at the the temporal and frequency congestion of LiDAR by parallel chaos. Nat. Photon. 17,
wavelength of 698 nm, the root mean square value of readout noise 306–314 (2023).
is 1.4 e−, and the dark noise is 0.1 e−/s per pixel, where e− represents 18. J. Riemensberger, A. Lukashchuk, M. Karpov, W. Weng, E. Lucas, J. Liu, T. J. Kippenberg,
Massively parallel coherent laser ranging using a soliton microcomb. Nature 581,
the electron. The minimum exposure time is T = 10 μs in our ex-
164–170 (2020).
periments, and the illumination power can be formulated as

Downloaded from https://www.science.org on November 18, 2024


19. G. Wetzstein, A. Ozcan, S. Gigan, S. Fan, D. Englund, M. Soljačić, C. Denz, D. A. Miller,
D. Psaltis, Inference in artificial intelligence with deep optics and photonics. Nature 588,
ni hν
P= (12) 39–47 (2020).
T 20. B. J. Shastri, A. N. Tait, T. F. de Lima, W. H. Pernice, H. Bhaskaran, C. D. Wright, P. R. Prucnal,
Photonics for artificial intelligence and neuromorphic computing. Nat. Photon. 15,
where h is the Planck constant, ν is the photon frequency, and ni is 102–114 (2021).
the photon number emitted from the light source in the exposure 21. X. Xu, M. Tan, B. Corcoran, J. Wu, A. Boes, T. G. Nguyen, S. T. Chu, B. E. Little, D. G. Hicks,
time. Under different illumination power, the photon number in R. Morandotti, A. Mitchell, D. J. Moss, 11 TOPS photonic convolutional accelerator for
each pixel of the Aop3D output ROI is calculated by the forward optical neural networks. Nature 589, 44–51 (2021).
22. J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu,
model of the Aop3D, termed as the photon number of the signal n.
A. Lukashchuk, A. S. Raja, J. Liu, C. D. Wright, A. Sebastian, T. J. Kippenberg,
Poisson and Gaussian noise are added to the signal to simulate the W. H. P. Pernice, H. Bhaskaran, Parallel convolutional processing using an integrated
camera noise and evaluate the theoretical performance of Aop3D photonic tensor core. Nature 589, 52–58 (2021).
under different light power. 23. J. Feldmann, N. Youngblood, C. D. Wright, H. Bhaskaran, W. H. Pernice, All-­optical spiking
neurosynaptic networks with self-­learning capabilities. Nature 569, 208–214 (2019).
24. Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-­Jones, M. Hochberg, X. Sun, S. Zhao,
H. Larochelle, D. Englund, M. Soljačić, Deep learning with coherent nanophotonic circuits.
Supplementary Materials Nat. Photon. 11, 441–446 (2017).
This PDF file includes: 25. F. Ashtiani, A. J. Geers, F. Aflatouni, An on-­chip photonic deep neural network for image
Supplementary Text classification. Nature 606, 501–506 (2022).
Figs. S1 to S21 26. O. Kulce, D. Mengu, Y. Rivenson, A. Ozcan, All-­optical synthesis of an arbitrary linear
Table S1 transformation using diffractive surfaces. Light: Sci. Appl. 10, 196 (2021).
Legends for movies S1 and S2 27. T. Zhou, X. Lin, J. Wu, Y. Chen, H. Xie, Y. Li, J. Fan, H. Wu, L. Fang, Q. Dai, Large-­scale
neuromorphic optoelectronic computing with a reconfigurable diffractive processing
Other Supplementary Material for this manuscript includes the following: unit. Nat. Photon. 15, 367–373 (2021).
Movies S1 and S2 28. T. Yan, R. Yang, Z. Zheng, X. Lin, H. Xiong, Q. Dai, All-­optical graph representation learning
using integrated diffractive photonic computing units. Sci. Adv. 8, eabn7630 (2022).
29. X. Lin, Y. Rivenson, N. T. Yardimci, M. Veli, Y. Luo, M. Jarrahi, A. Ozcan, All-­optical machine
REFERENCES AND NOTES learning using diffractive deep neural networks. Science 361, 1004–1008 (2018).
1. D. Falanga, K. Kleber, D. Scaramuzza, Dynamic obstacle avoidance for quadrotors with 30. T. Yan, J. Wu, T. Zhou, H. Xie, F. Xu, J. Fan, L. Fang, X. Lin, Q. Dai, Fourier-­space diffractive
event cameras. Sci. Robot. 5, eaaz9712 (2020). deep neural network. Phys. Rev. Lett. 123, 023901 (2019).
2. F. Yu, Y. Wu, S. Ma, M. Xu, H. Li, H. Qu, C. Song, T. Wang, R. Zhao, L. Shi, Brain-­inspired 31. O. Kulce, D. Mengu, Y. Rivenson, A. Ozcan, All-­optical information-­processing capacity of
multimodal hybrid neural network for robot place recognition. Sci. Robot. 8, eabm6996 diffractive surfaces. Light: Sci. Appl. 10, 25 (2021).
(2023). 32. M. Veli, D. Mengu, N. T. Yardimci, Y. Luo, J. Li, Y. Rivenson, M. Jarrahi, A. Ozcan, Terahertz
3. N. Chen, F. Kong, W. Xu, Y. Cai, H. Li, D. He, Y. Qin, F. Zhang, A self-­rotating, single-­actuated pulse shaping using diffractive surfaces. Nat. Commun. 12, 37 (2021).
UAV with extended sensor field of view for autonomous navigation. Sci. Robot. 8, 33. C. Liu, Q. Ma, Z. J. Luo, Q. R. Hong, Q. Xiao, H. C. Zhang, L. Miao, W. M. Yu, Q. Cheng, L. Li,
eade4538 (2023). T. J. Cui, A programmable diffractive deep neural network based on a digital-­coding
4. M. Levoy, Light fields and computational imaging. Computer 39, 46–55 (2006). metasurface array. Nat. Electron. 5, 113–122 (2022).
5. J. Mayhew, H. Longuet-­Higgins, A computational model of binocular depth perception. 34. C. Qian, X. Lin, X. Lin, J. Xu, Y. Sun, E. Li, B. Zhang, H. Chen, Performing optical logic
Nature 297, 376–378 (1982). operations by a diffractive neural network. Light: Sci. Appl. 9, 59 (2020).
6. M. Poggi, F. Tosi, K. Batsos, P. Mordohai, S. Mattoccia, On the synergies between machine 35. E. Goi, S. Schoenhardt, M. Gu, Direct retrieval of Zernike-­based pupil functions using
learning and binocular stereo for depth estimation from images: A survey. IEEE Trans. integrated diffractive deep neural networks. Nat. Commun. 13, 7531 (2022).
Pattern Anal. Mach. Intell. 44, 5314–5334 (2021). 36. X. Luo, Y. Hu, X. Ou, X. Li, J. Lai, N. Liu, X. Cheng, A. Pan, H. Duan, Metasurface-­enabled
7. M. Z. Brown, D. Burschka, G. D. Hager, Advances in computational stereo. IEEE Trans. on-­chip multiplexed diffractive neural networks in the visible. Light: Sci. Appl. 11, 158
Pattern Anal. Mach. Intell. 25, 993–1008 (2003). (2022).
8. H. Laga, L. V. Jospin, F. Boussaid, M. Bennamoun, A survey on deep learning techniques 37. Z. Huang, P. Wang, J. Liu, W. Xiong, Y. He, J. Xiao, H. Ye, Y. Li, S. Chen, D. Fan, All-­optical
for stereo-­based depth estimation. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1738–1764 signal processing of vortex beams with diffractive deep neural networks. Phys. Rev.
(2022). Applied 15, 014037 (2021).

Yan et al., Sci. Adv. 10, eadn2031 (2024) 5 July 2024 11 of 12


S c i e n c e A d v a n c e s | R e s e ar c h A r t i c l e

38. Y. LeCun, Y. Bengio, G. Hinton, Deep Learning. Nature 521, 436–444 (2015). 49. X. Zhou, Y. Hou, J. Lin, A review on the processing accuracy of two-­photon
39. J. Geng, Structured-­light 3D surface imaging: A tutorial. Adv. Opt. Photonics 3, 128–160 (2011). polymerization. AIP Adv. 5, 030701 (2015).
40. Y. Shechtman, S. J. Sahl, A. S. Backer, W. E. Moerner, Optimal point spread function design 50. Y. Zuo, B. Li, Y. Zhao, Y. Jiang, Y.-­C. Chen, P. Chen, G.-­B. Jo, J. Liu, S. Du, All-­optical neural
for 3D imaging. Phys. Rev. Lett. 113, 133902 (2014). network with nonlinear activation functions. Optica 6, 1132–1137 (2019).
41. Z. Shen, F. Zhao, C. Jin, S. Wang, L. Cao, Y. Yang, Monocular metasurface camera for 51. M. S. S. Rahman, X. Yang, J. Li, B. Bai, A. Ozcan, Universal linear intensity transformations
passive single-­shot 4D imaging. Nat. Commun. 14, 1035 (2023). using spatially incoherent diffractive processors. Light: Sci. Appl. 12, 195 (2023).
42. O. Ronneberger, P. Fischer, T. Brox, in Medical Image Computing and Computer-­Assisted
Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, Acknowledgments: We thank S. Li for assistance with the experiments. Funding: This work is
2015, Proceedings, Part III 18 (Springer, 2015), pp. 234–241. supported in part by the National Science and Technology Major Project under contract no.
43. Z. Zhang, F. Feng, J. Gan, W. Lin, G. Chen, M. G. Somekh, X. Yuan, Space-­time projection 2021ZD0109901, in part by the Natural Science Foundation of China (NSFC) under contract no.
enabled ultrafast all-­optical diffractive neural network. Laser Photon. Rev., 2301367 62125106 and 62088102, in part by the Tsinghua-­Zhijiang joint research center (L.F. is the
(2024). doi:10.1002/lpor.202301367. recipient), in part by China Association for Science and Technology (CAST) under contract No.
44. J. Li, D. Mengu, N. T. Yardimci, Y. Luo, X. Li, M. Veli, Y. Rivenson, M. Jarrahi, A. Ozcan, 2023QNRC001, and in part by China Postdoctoral Science Foundation under contract No.
Spectrally encoded single-­pixel machine vision using diffractive networks. Sci. Adv. 7, GZB20230372. Author contributions: Q.D., L.F., and R.H. initiated and supervised the project.
eabd7690 (2021). T.Z., T.Y., and L.F. conceived the idea and developed the methods. T.Y., Y.G., T.Z., Y.Z., and G.S.
45. X. Yang, H. Mei, K. Xu, X. Wei, B. Yin, R. W. Lau, in Proceedings of the IEEE/CVF International conducted the simulations and experiments. T.Y., T.Z., and L.F. analyzed the results and
Conference on Computer Vision (IEEE, 2019), pp. 8809–8818. prepared the manuscript. All authors discussed the research. Competing interests: The
46. H. Mei, X. Yang, Y. Wang, Y. Liu, S. He, Q. Zhang, X. Wei, R. W. Lau, in Proceedings of the authors declare that they have no competing interests. Data and materials availability: All
IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE, 2020), pp. data needed to evaluate the conclusions in the paper are present in the paper and/or the
3687–3696. Supplementary Materials.
47. C. Vieu, F. Carcenac, A. Pépin, Y. Chen, M. Mejias, A. Lebib, L. Manin-­Ferlazzo, L. Couraud,
H. Launois, Electron beam lithography: Resolution limits and applications. Appl. Surf. Sci. Submitted 28 November 2023
164, 111–117 (2000). Accepted 3 June 2024
48. M. K. Chen, X. Liu, Y. Wu, J. Zhang, J. Yuan, Z. Zhang, D. P. Tsai, A meta-­device for intelligent Published 5 July 2024

Downloaded from https://www.science.org on November 18, 2024


depth perception. Adv. Mater. 35, e2107465 (2023). 10.1126/sciadv.adn2031

Yan et al., Sci. Adv. 10, eadn2031 (2024) 5 July 2024 12 of 12

You might also like