Scalable robot learning in the real world is limited by the cost and safety issues of real robots. In addition, rolling out robot trajectories in the real world can be time-consuming and labor-intensive. In this paper, we propose to learn an interactive real-robot action simulator as an alternative. We introduce a novel method, IRASim, which leverages the power of generative models to generate extremely realistic videos of a robot arm that executes a given action trajectory, starting from an initial given frame. To validate the effectiveness of our method, we create a new benchmark, IRASim Benchmark, based on three real-robot datasets and perform extensive experiments on the benchmark. Results show that IRASim outperforms all the baseline methods and is more preferable in human evaluations. We hope that IRASim can serve as an effective and scalable approach to enhance robot learning in the real world. To promote research for generative real-robot action simulators, we open-source code, benchmark, and checkpoints.
We create an interactive real-robot action simulator that can simulate robot trajectories in a way that is both accurate and almost visually indistinguishable from the real world. With such a simulator, agents can interactively control virtual robots to interact with diverse objects in various scenes in the simulator. It enables robots to improve policies by learning from simulated experiences without safety concerns and maintenance efforts. And the improved policy can consequently produce a large amount of simulated but realistic "real-robot" trajectories for training. Furthermore, the simulator can be leveraged as a dynamics model for imagining outcomes of different proposed candidate actions for model-based reinforcement learning.
IRASim is a novel method that generates extremely realistic videos of a robot that executes an action trajectory, starting from a given initial frame. We refer to this task as the trajectory-to-video task. The trajectory-to-video task differs from the general text-to-video task. While various videos can meet the text condition in the text-to-video task, the predicted video in our trajectory-to-video task must strictly and accurately follow the input trajectory. More importantly, a challenge of this task is that each action in the trajectory provides an exact description of the robot's movement in each frame. This contrasts with the text-to-video task, where textual descriptions offer a general condition without specific frame-by-frame details. Another challenge is that the trajectory-to-video task features rich robot-object interactions, which must adhere to physical laws. IRASim leverages an innovative frame-level condition method to achieve precise frame-by-frame alignment between actions and video frames. We use the powerful Diffusion Transformer as the backbone of IRASim to improve the modeling of robot-object interactions. IRASim can generate realistic videos of high-resolution (up to 288 × 512) and long-horizon (up to 150+ frames).
Uncurated qualitative results of short trajectories are shown below. Click the Click to View More button to display another random subset from 100 unpicked samples for each dataset. All samples are from the test set. Each video contains 16 frames with 4 fps. The video on the left is generated by IRASim, while the video on the right is the ground truth.
Quantitative results are shown in Tab. 1. IRASim-Frame-Ada performs the best among all the comparing methods in terms of Latent L2 loss and PSNR. It achieves the highest SSIM on RT-1 and Bridge and is comparable with the best baseline method LVDM on Language-Table. In all three datasets, it outperforms IRASim-Video-Ada in all the computation-based metrics. This indicates that frame-level condition enhances consistency between each frame and its corresponding action in the trajectory. IRASim-Frame-Ada also surpasses the two baseline methods based on U-Nets on all three datasets on Latent L2 loss. This demonstrates the superiority of transformer-based model, especially in handling complex 3D actions and robot-object interaction.
Human Preference Evaluation. We also perform a user study to help understand human preferences between IRASim-Frame-Ada and other methods. We juxtapose the videos predicted by IRASim-Frame-Ada and the comparing method and ask humans which one they prefer. The ground-truth is also provided as a reference. IRASim-Frame-Ada beats all the comparing methods in all three datasets. This result aligns with the Latent L2 loss and PSNR which justifies the reason for using them as the primary evaluation metrics.
Uncurated qualitative results of long trajectories are shown below. Click the Click to View More button to display another random subset from 100 unpicked episodes for each dataset. Click the Click to View Very Long Videos button to display the six longest videos from the 100 unpicked episodes. Hover over on these longest videos to see their number of frames. All episodes are from the test set. The average number of frames of the 100 unpicked episodes are 47.04, 36.43, and 24.57 for RT-1, Bridge, and Language-Table, respectively. The video on the left is generated by IRASim; the video on the right is the ground truth. IRASim retains the powerful capability of generating visually realistic and accurate videos of long-horizon as in the short trajectory setting.
Quantitative results are shown in Tab. 2. We compare IRASim with the best baseline method LVDM. IRASim-Frame-Ada consistently outperforms the comparison methods in all three datasets on Latent L2 loss and PSNR.
We follow DiT and train IRASim-Frame-Ada of different model sizes ranging from 33M to 679M. Results are shown in Fig. 4. On all three datasets, IRASim scales elegantly with the increase of model sizes and training steps. This indicates strong potential for increasing model sizes and training steps to further improve the performance.
To showcase the application of IRASim, we perform experiments on controlling the "real robots" in the three datasets of IRASim Benchmark. For Language-Table with a 2D translation action space, we use the arrow keys from the keyboard to input action trajectories for better accessibility. For RT-1 and Bridge with a 3D action space, we use the Vive controller to record action trajectories as input. The videos below show that IRASim can be used as an interactive real-robot action simulator in various ways. In particular, IRASim is able to robustly handle multimodality in generation. The videos for Language-Table show the generation of videos with an identical initial frame but different trajectories.
@article{FangqiIRASim2024,
title={IRASim: Learning Interactive Real-Robot Action Simulators},
author={Fangqi Zhu and Hongtao Wu and Song Guo and Yuxiao Liu and Chilam Cheang and Tao Kong},
year={2024},
journal={arXiv:2406.12802}
}