Language, Camera, Autonomy! Prompt-Engineered Robot Control For Rapidly Evolving Deployment

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Language, Camera, Autonomy!

Prompt-engineered Robot Control


for Rapidly Evolving Deployment
Jacob P. Macdonald Rohit Mallick Allan B. Wollaber
jacob.macdonald@ll.mit.edu rmallic@clemson.edu allan.wollaber@ll.mit.edu
Massachusetts Institute of Clemson University Massachusetts Institute of
Technology Lincoln Laboratory Clemson, SC, USA Technology Lincoln Laboratory
Lexington, MA, USA Lexington, MA, USA

Jaime D. Peña Nathan McNeese Ho Chit Siu


jdpena@ll.mit.edu mcneese@clemson.edu hochit.siu@ll.mit.edu
Massachusetts Institute of Clemson University Massachusetts Institute of
Technology Lincoln Laboratory Clemson, SC, USA Technology Lincoln Laboratory
Lexington, MA, USA Lexington, MA, USA

ABSTRACT ACM Reference Format:


The Context-observant LLM-Enabled Autonomous Robots (CLEAR) Jacob P. Macdonald, Rohit Mallick, Allan B. Wollaber, Jaime D. Peña, Nathan
McNeese, and Ho Chit Siu. 2024. Language, Camera, Autonomy! Prompt-
platform ofers a general solution for large language model (LLM)-
engineered Robot Control for Rapidly Evolving Deployment. In Companion
enabled robot autonomy. CLEAR-controlled robots use natural lan- of the 2024 ACM/IEEE International Conference on Human-Robot Interaction
guage to perceive and interact with their environment: contextual (HRI ’24 Companion), March 11–14, 2024, Boulder, CO, USA. ACM, New York,
description deriving from computer vision and optional human NY, USA, 5 pages. https://doi.org/10.1145/3610978.3640671
commands prompt intelligent LLM responses that map to robotic ac-
tions. By emphasizing prompting, system behavior is programmed
without manipulating code, and unlike other LLM-based robot con-
trol methods, we do not perform any model fne-tuning. CLEAR 1 INTRODUCTION
employs of-the-shelf pre-trained machine learning models for con- Recent advances in large language models (LLMs) have enabled
trolling robots ranging from simulated quadcopters to terrestrial versatile new modalities for improving usability in the feld of
quadrupeds. We provide the open-source CLEAR platform, along Human-Robot Interaction (HRI). Thus far, existing LLM-driven
with sample implementations for a Unity-based quadcopter and robotics research has utilized models specifcally trained for robot
Boston Dynamics Spot® robot. Each LLM used, GPT-3.5, GPT-4, control [1–3] or by asking an LLM to generate robot actuation code
and LLaMA2, exhibited behavioral diferences when embodied by [6, 7, 9, 14, 17]. Advances in LLM-driven agent-based models have
CLEAR, contrasting in actuation preference, ability to apply new also provided a way to overcome LLM token-length “memory” and
knowledge, and receptivity to human instruction. GPT-4 demon- chain-of-thought reasoning limitations [5, 11, 16].
strates best performance compared to GPT-3.5 and LLaMA2, show- Our robot vision system, the Context-observant LLM-Enabled
ing successful task execution 97% of the time. The CLEAR platform Autonomous Robots (CLEAR), builds on previous work as a robot
contributes to HRI by increasing the usability of robotics for natural interface that does not require LLM fne-tuning. In other words,
human interaction. this system utilizes LLMs as they are, and further development of
their model is not required for CLEAR to be efective. This work
CCS CONCEPTS allows untrained users to interact with a vision-enabled robot to
perform dynamic and context-appropriate behaviors.
• Human-centered computing → Interactive systems and
We present a robot-vision-LLM system that is 1) robot-agnostic,
tools; • Computer systems organization → Robotic auton-
2) LLM-agnostic, and 3) prompt-only (no LLM fne-tuning or robot-
omy; • Computing methodologies → Vision for robotics; Neural
specifc pre-training). In contrast to previous work, our software is
networks.
specifcally focused on applications to HRI, as it includes the ability
for users to interact with the robot during and between tasks via a
KEYWORDS web-based voice/chat interface, allowing a diferent level of user
large language models; computer vision; robotics; software interaction than other, more planning-focused work. We evaluate
CLEAR on tasks carried out by a simulated quadcopter using multi-
ple LLMs with a YOLOv8 [8] vision model. GPT-4 demonstrates the
This work is licensed under a Creative Commons Attribution most consistent execution of commands (97%) compared to GPT-3.5
International 4.0 License. and LLaMA2. Next, changing only the initial prompt, we provide
an example of using CLEAR to direct a Boston Dynamics Spot®
HRI ’24 Companion, March 11–14, 2024, Boulder, CO, USA quadruped to accomplish multistep tasks via verbal commands.
© 2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0323-2/24/03. This makes it the frst prompt-only system to easily swap between
https://doi.org/10.1145/3610978.3640671 diferent robot form factors.

717
HRI ’24 Companion, March 11–14, 2024, Boulder, CO, USA Jacob P. Macdonald et al.

CLEAR is provided under an MIT license at https://github.com/ such as location and distance. The exposition dubbed the Conversa-
MITLL-CLEAR so that, as LLMs, vision models, and robotics con- tion Ledger is perceived by the LLM and is kept in the LLM handler.
tinue to evolve, almost anyone will be able to use an LLM to au- However, the coordinator creates this conversation ledger and, in it,
tonomously control a robot. The CLEAR_setup repository gives a details a conceptual framework for understanding subsequent ab-
means for installing, managing, and working with the components. stract representations of the frst layer that are provided as prompts.
We welcome community input and contributions. This perception is derived and applied to various services through
the coordinator’s relationship between the two web servers: the
2 CHARACTERISTICS interface server and the worker server. With the interface server,
CLEAR is composed of several distributed services following a sep- the coordinator receives images and user text that require exten-
aration of concerns design philosophy. The services communicate sive processing. To prevent the coordinator from stalling, these
via Representational State Transfer (REST) APIs built on Node.js. processes are shared with the worker services.
This design enables deployment confguration options, such as a CLEAR_worker_server handles the data transfer between the co-
globally available cloud service, a local network service, or a hybrid. ordinator, computer vision, and the LLM handler. The worker server
makes computationally rigorous functionality remotely available to
2.1 Object Detection and Tracking the coordinator, providing object detection, depth estimation, and
CLEAR’s awareness of its surroundings is achieved via a YOLO V8 LLM inference. These worker services employ an event-driven ar-
object detection model, which perceives a constant stream of meta- chitecture, awaiting for appropriate HTTP POST requests to deliver
objects. These meta-objects are abstract representations of detected input data and emit a signal. Select events launch respective worker
objects from the environment. They have attributes that include processes, which in turn, will yield data usable for the coordinator.
object classifcation, local coordinates, time of initialization (age), CLEAR_computer_vision interprets visual data, furnishing the
and sub-object detections (e.g., hands of a person). These attributes coordinator’s meta-object collection through its two sub-services:
are routinely updated (age reset and coordinates corrected) based object detection and monocular depth estimation. While the coor-
on feature similarities with newly-perceived meta-objects. They dinator provides camera input from the robot, the computer vision
are deleted if they are not updated before exceeding a maximum systems transforms and return visual data as depth matrices and
age threshold. Together, these meta-objects enable the system to encoded strings with object information. Depth is used for object
perceive the robot’s visual environment and characterize meaning avoidance, while the encoded strings are refned into meta-objects.
based on knowledge of the system defnition. Based on the accuracy These meta-objects are initialized in the coordinator’s frst percep-
of meta-objects and their attributes, meta-objects are foundational tion layer, then translated into the second Conversation Ledger
features that are utilized to facilitate robotic actuation. layer as prompts to be handled by the CLEAR_LLM_handler.
CLEAR_LLM_handler connects the coordinator to an LLM to gen-
2.2 System Design erate responses that produce robot behavior. The coordinator and
LLM handler are related through the conversation ledger. The coor-
CLEAR has fve component services, each in its own repository.
dinator initializes the conversation ledger and iteratively appends
These services operate in tandem to contextualize sensor/user in-
prompts to it. The LLM handler preserves the conversation ledger,
puts, process LLM input/output, and actuate the robot (Figure 1).
shares it with the LLM, and appends responses from the LLM to it,
allowing. the LLM to use its conversation history to infer responses.

2.3 The Conversation Ledger


The conversation ledger is a natural language JSON log that drives
the LLM’s perception and interactivity. Entries of the conversation
Figure 1: System architecture for CLEAR. ledger are conceptually divided into three interdependent cate-
gories: prompts, responses, and system defnitions (Figure 2).
The CLEAR_interface_server connects the robot system to the
CLEAR platform at large by handling the data transfer between the
robot, human users, and other CLEAR services. Its browser-based
user interface allows users to converse with the LLM via voice/chat,
vet the LLM’s commands to the robot, and take manual control of
the robot (Figure 3). The information obtained from users and the
robot is then relayed to the CLEAR coordinator.
CLEAR_coordinator is the central handler between sensor-related
services of CLEAR, the LLM, and actuation commands. The coor-
dinator manages two abstract processes for perceiving the robot’s
environment: a volatile list of detected objects (meta-objects) and
a natural-language exposition detailing context. The meta-objects Figure 2: The initial instructions, user commands, and vision-
are used for actions dependent on tracking, such as moving to or model textual output are all passed to the LLM — via the
grabbing an object. Both of these actions function upon information conversation ledger — to federate the robot actions.

718
Language, Camera, Autonomy! Prompt-engineered Robot Control for Rapidly Evolving Deployment HRI ’24 Companion, March 11–14, 2024, Boulder, CO, USA

Prompts are statements expressing system sensory input, includ- Table 1: Behavioral preferences based on LLM without human
ing an abstract representation of the coordinator’s meta-objects, input. Prompts are generated entirely by object detection.
general system information, and human comments from the web
interface (Figure 3). These prompts are exchanged with the LLM Command GPT-4 GPT-3.5 LLaMA2
for responses. LOOKAT 21.5% 15.4% 14.1%
Responses from the LLM can spur robotic actuation and/or mes- MOVETO 24.6% 14.4% 26.4%
sage users on the interface server. These responses must be strictly THROW 27.7% 33.2% 1.8%
structured because they are semantically similar to programmatic ROTATE 4.2% 3.3% 8.3%
function calls. Responses are parsed for arguments and related to a RESTART 4.5% 13.1% 0.8%
dictionary with function pointer key values. STOP 1.0% 2.7% 3.6%
System defnitions are tailored to each CLEAR confguration and DONOTHING 14.6% 15.2% 24.8%
contextualized conversation ledgers (Figure 2). System defnitions MESSAGE 1.9% 2.7% 19.0%
detail robot-specifc actions available and how to use them. For
example, the Unity quadcopter (Section 3.2.1) can throw objects
(e.g., apples) at other objects because the equipped system defnition Table 2: Number of responses until frst incorrect robotic
defnes a THROW action, that takes in a target object as a parameter. instruction. Deviating from the required response format.

3 SOFTWARE First Incorrect


LLM Correct
1 2 3 ≥4
We provide two reference implementations: a Unity-based quad-
GPT-4 97.1% 0.2% 1.9% 0.3% 0.5%
copter simulation and a Boston Dynamics Spot robot. Both were
GPT-3.5 95.8% 0.8% 2.2% 0.5% 0.7%
tested with object detection (the most computationally intensive lo-
cal process) running on either an NVIDIA DGX A100 or an NVIDIA LLaMA2 85.3% 2.0% 5.3% 3.6% 3.8%
GeForce RTX 3060 GPU. As multimodal LLMs evolve, local com-
pute would only be necessary for privacy considerations. We frst
describe the three LLMs used in testing, then the two reference We frst consider the diferences between how the LLMs synthe-
implementations and associated results. size input for robotic control without human intervention as a type
of “background chatter” and initial environmental surveying. Each
3.1 Choice of Large Language Model of the LLMs tested has access to the output commands (described
in the initial prompt): MOVETO, LOOKAT, THROW, ROTATE, RESTART,
For the simulated quadcopter, CLEAR was tested with gpt-3.5-turbo STOP, DO NOTHING, and MESSAGE. Table 1 indicates that the GPT
(GPT-3.5), gpt-4 (GPT-4), and LLaMA2-70B-chat-hf (LLaMA2) pri- models tend to observe and act in their surroundings with high
marily due to their low latencies [4, 10, 15]. Local instances of counts of THROW, MOVETO, and LOOKAT, whereas LLaMA2 tends to
“open-source” LLMs, such as the large-parameter LLaMA2 and Fal- move, do nothing, and message more frequently.
con models, were set up for integration in CLEAR and, with suf- In the same vein of providing observable insight into LLM capa-
cient optimization, will be testable, but the latencies were too high bility without human intervention, we measure the LLM’s resiliency
for real-time feedback [12, 15], and we leave this and the integration against memory fatigue. LLMs face a well-documented difculty
of other LLM models to future work. in utilizing older information. With CLEAR, we are able to track
LLaMA2, GPT-3.5, and GPT-4 difer in their parameter sizes this resiliency in an efort to showcase its longevity of use before
(70B, 175B, >1000B), token lengths (4K, 4K, 8K), and the number of a system error occurs. In this case, a system error would refer to
training tokens (0.3T, 0.2T, >10T)1 . All are sufciently trained to the LLM violating the system defnition and asking CLEAR to co-
produce “emergent abilities” such as instruction following [19]. ordinate an undefned action. Table 2 depicts the tallied number
of responses across all scenarios that were correctly formed (no
3.2 Example Implementations failures), as well as the number of responses that were incorrect on
3.2.1 Simulated Qadcopter. CLEAR_virtual_drone includes a Unity the frst, second, third, and � th tries. In all cases, the LLMs followed
prefab that can be immediately deployed in simulation. Due to the instructions without failure on the frst trial more than 85% of the
ease of sim-based replication, we focus on this implementation time, with GPT-4 executing 97% of the time without failure.
to demonstrate CLEAR’s usability. This implementation also en- In order to measure the receptivity of robot instruction-following
courages the community to build new environments and use cases. in the CLEAR system performance as controlled by multiple LLMs,
We do not seek state-of-the-art results in planning or other bench- we posited a series of increasingly challenging human-directed
marks and focus primarily on providing a novel, accessible HRI task scenarios, a chained request of three actions: restart, message,
capability. To demonstrate CLEAR’s behavior, we performed three move to an object. For testing, it was required that the actions be
experiments, each with 1000 trials for each of the three LLMs. The sequential – if any additional actions occurred within the sequence,
system defnition used by the LLMs in each experiment is included this was considered a failure. These results (Table 3) show that
in our repository. Our results are derived from a prerelease CLEAR LLaMA2 is the least receptive, satisfying the human-given 3-action
version using LLMs in September and October of 2023. command no more than 8% of the time, while GPT-4 was consis-
tently the most likely to follow the entire human command. For
1 GPT-4 details shown with inequalities are speculative, informed by [13]. this 3-action task, both GPT-4 and GPT-3.5 were successful more

719
HRI ’24 Companion, March 11–14, 2024, Boulder, CO, USA Jacob P. Macdonald et al.

Figure 3: Web interface for Spot (left) and Unity Drone (right). The interfaces show the robot’s point of view, the robot’s current
action, and chat history between the user and the LLM. Any user interaction with the robot occurs through this interface,
primarily via the chat box on the bottom right.

Table 3: Percent task completion per step of 3-action task level [18]. However, harm reduction training in many of-the-shelf
requests given in human prompt. LLMs flters some typical (non-adversarial) unsafe prompts.
To minimize unsafe robot actions, we recommend 1) restricting
LLM 1-action 2-action 3-action access of the UI to trusted users; 2) limiting actuation speeds on
GPT-4 80.8% 67.7% 39.1% the robot, regardless of the LLM commands; 3) ensuring that ob-
GPT-3.5 74.7% 41.7% 30.3% ject detection and action labels are appropriately mapped, and 4)
LLaMA2 71.8% 32.1% 8.4% potentially implementing an observer layer to judge commands in
the context of the robot’s actions and reject unsafe commands.

than 30% of the time, even under the somewhat strict defnition of
success, where we also did not attempt to optimize initial prompts. 5 CONCLUSION
With LLMs becoming ubiquitous, further research is needed to
3.2.2 Boston Dynamics Spot. We also ofer an example of CLEAR
understand how they interact with humans in both virtual and
on the Boston Dynamics Spot quadruped robot with an arm at-
physical environments. This development of the CLEAR platform
tachment (software is in the CLEAR_spot repository). A user/robot
to synthesize input from both LLMs and humans has the potential
chat exchange is shown (Figure 3) where the output commands
to aid researchers in better understanding how robotics can be
matched most of the commands of the simulation except RESTART
designed for the general public. This implementation is the frst
and THROW. We included a new command, GRAB, and a human-in-
of its kind that 1) robot-agnostic, 2) LLM agnostic, and 3) prompt-
the-loop approval requirement for it as a safety feature.
only, allowing user to use components of their choice without fne
tuning. As more advanced LLMs and vision models are developed,
4 USAGE AND FUTURE WORK
their advances may be rapidly incorporated into robotic systems via
The interfaces provided by CLEAR enable its use as an HRI research CLEAR. We ofer baseline measurements of performance that sup-
platform focusing on natural interactions between humans and port CLEAR’s usage with a variety of LLMs and present indications
robots. From the perspective of robotics accessibility, interfaces like of which may work better. Additionally, as CLEAR is implemented
CLEAR have the potential to democratize access to complex robotic with human-centered principles of safety and natural language-
systems by making their use closer to human-human communica- based instruction, we improve the overall usability of robotics by
tion, as well as allowing them to directly leverage state-of-the-art engaging humans without complicating interaction.
open-source machine learning models as they become available. CLEAR will be used by our team as part of a human-AI interac-
Such systems do, however, present some risks. We provide some tion project at least through October 2025, with a high likelihood
basic guidelines for safe and efective use, though experimental of continued use beyond that timeframe. Associated development
validation of this system with users remains part of future work. and maintenance will continue through that project, which will be
CLEAR provides basic safety checks, such as the rejection of pushed to the open-source repository.
commands not in the system defnitions, but more may be war-
ranted at the LLM or robot control levels. We describe examples
where human-in-the-loop autonomy is required and implemented ACKNOWLEDGEMENTS
– i.e. Spot’s GRAB and simulated drone’s THROW command, though This material is based upon work supported by the Under Secretary
safe robot behavior cannot be guaranteed. Additionally, “tricking” of Defense for Research and Engineering under Air Force Contract
the system into performing dangerous actions is relatively simple No. FA8702-15-D-0001. Any opinions, fndings, conclusions or rec-
if object detection or robot action labels were provided maliciously ommendations expressed in this material are those of the author(s)
(e.g. a dangerous action was mapped to an innocent name). The and do not necessarily refect the views of the Under Secretary of
system also inherits vulnerabilities to adversarial attacks at the LLM Defense for Research and Engineering.

720
Language, Camera, Autonomy! Prompt-engineered Robot Control for Rapidly Evolving Deployment HRI ’24 Companion, March 11–14, 2024, Boulder, CO, USA

REFERENCES [12] Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru,
[1] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei,
Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus- and Julien Launay. 2023. The RefnedWeb dataset for Falcon LLM: outperforming
man, et al. 2022. Do as i can, not as i say: Grounding language in robotic curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116
afordances. arXiv preprint arXiv:2204.01691 (2022). (2023).
[2] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, [13] Maximilian Schreiner. 2023. GPT-4 architecture, datasets, costs and more leaked.
Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/
Finn, et al. 2023. Rt-2: Vision-language-action models transfer web knowledge [14] Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan
to robotic control. arXiv preprint arXiv:2307.15818 (2023). Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. 2023. Progprompt:
[3] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Generating situated robot task plans using large language models. In 2023 IEEE
Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine International Conference on Robotics and Automation (ICRA). IEEE, 11523–11530.
Hsu, et al. 2022. Rt-1: Robotics transformer for real-world control at scale. arXiv [15] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas-
preprint arXiv:2212.06817 (2022). mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos-
[4] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, ale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucu-
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda rull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini,
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jefrey Wu, Clemens Winter, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet,
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton,
arXiv:2005.14165 [cs.CL] Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva,
[5] Ishita Dasgupta, Christine Kaeser-Chen, Kenneth Marino, Arun Ahuja, Sheila Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross
Babayan, Felix Hill, and Rob Fergus. 2023. Collaborating with language models Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov,
for embodied reasoning. arXiv:2302.00763 [cs.LG] Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Ro-
[6] Siyuan Huang, Zhengkai Jiang, Hao Dong, Yu Qiao, Peng Gao, and Hongsheng driguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2:
Li. 2023. Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
with Large Language Model. arXiv:2305.11176 [cs.RO] [16] Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang,
[7] Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yan- Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei,
jun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. 2023. VIMA: and Ji-Rong Wen. 2023. A Survey on Large Language Model based Autonomous
General Robot Manipulation with Multimodal Prompts. arXiv:2210.03094 [cs.RO] Agents. arXiv:2308.11432 [cs.AI]
[8] Glenn Jocher, Ayush Chaurasia, and Jing Qiu. 2023. YOLO by Ultralytics. https: [17] Jimmy Wu, Rika Antonova, Adam Kan, Marion Lepert, Andy Zeng, Shu-
//github.com/ultralytics/ultralytics ran Song, Jeannette Bohg, Szymon Rusinkiewicz, and Thomas Funkhouser.
[9] Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, 2023. TidyBot: Personalized Robot Assistance with Large Language Models.
Pete Florence, and Andy Zeng. 2023. Code as policies: Language model programs arXiv:2305.05658 [cs.RO]
for embodied control. In 2023 IEEE International Conference on Robotics and [18] Lei Xu, Yangyi Chen, Ganqu Cui, Hongcheng Gao, and Zhiyuan Liu. 2022. Ex-
Automation (ICRA). IEEE, 9493–9500. ploring the universal vulnerability of prompt-based learning paradigm. arXiv
[10] OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] preprint arXiv:2204.05239 (2022).
[19] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou,
[11] Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy
Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang,
Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra
Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang,
of Human Behavior. arXiv:2304.03442 [cs.HC]
Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A Survey of Large
Language Models. arXiv:2303.18223 [cs.CL]

721

You might also like