Scalable robot learning in the real world is limited by the cost and safety issues of real robots. In addition, rolling out robot trajectories in the real world can be time-consuming and labor-intensive. In this paper, we propose to learn an interactive real-robot action simulator as an alternative. We introduce a novel method, IRASim, which leverages the power of generative models to generate extremely realistic videos of a robot arm that executes a given action trajectory, starting from an initial given frame. To validate the effectiveness of our method, we create a new benchmark, IRASim Benchmark, based on three real-robot datasets and perform extensive experiments on the benchmark. Results show that IRASim outperforms all the baseline methods and is more preferable in human evaluations. We hope that IRASim can serve as an effective and scalable approach to enhance robot learning in the real world. To promote research for generative real-robot action simulators, we open-source code, benchmark, and checkpoints.

Video Generation as Real-Robot Simulators

We create an interactive real-robot action simulator that can simulate robot trajectories in a way that is both accurate and almost visually indistinguishable from the real world. With such a simulator, agents can interactively control virtual robots to interact with diverse objects in various scenes in the simulator. It enables robots to improve policies by learning from simulated experiences without safety concerns and maintenance efforts. And the improved policy can consequently produce a large amount of simulated but realistic "real-robot" trajectories for training. Furthermore, the simulator can be leveraged as a dynamics model for imagining outcomes of different proposed candidate actions for model-based reinforcement learning.

Figure 1: Overview of IRASim. IRASim is an interactive real-robot action simulator that allows agents to input an action trajectory to control the "real robot" in an initial frame.

Trajectory-conditioned Video Generation

IRASim is a novel method that generates extremely realistic videos of a robot that executes an action trajectory, starting from a given initial frame. We refer to this task as the trajectory-to-video task. The trajectory-to-video task differs from the general text-to-video task. While various videos can meet the text condition in the text-to-video task, the predicted video in our trajectory-to-video task must strictly and accurately follow the input trajectory. More importantly, a challenge of this task is that each action in the trajectory provides an exact description of the robot's movement in each frame. This contrasts with the text-to-video task, where textual descriptions offer a general condition without specific frame-by-frame details. Another challenge is that the trajectory-to-video task features rich robot-object interactions, which must adhere to physical laws. IRASim leverages an innovative frame-level condition method to achieve precise frame-by-frame alignment between actions and video frames. We use the powerful Diffusion Transformer as the backbone of IRASim to improve the modeling of robot-object interactions. IRASim can generate realistic videos of high-resolution (up to 288 × 512) and long-horizon (up to 150+ frames).

Figure 2: Network Architecture of IRASim. (a) shows the general diffusion transformer architecture of IRASim. The input to IRASim includes the initial frame and the entire trajectory. (b) Frame-level adaptation (Frame-Ada). (c) Video-level adaptation (Video-Ada).

Short Trajectory Prediction

Uncurated qualitative results of short trajectories are shown below. Click the Click to View More button to display another random subset from 100 unpicked samples for each dataset. All samples are from the test set. Each video contains 16 frames with 4 fps. The video on the left is generated by IRASim, while the video on the right is the ground truth.

Click to View More

Short Trajectory
RT-1
Prediction

Short Trajectory
RT-1
Ground-truth

Short Trajectory
RT-1
Prediction

Short Trajectory
RT-1
Ground-truth

Short Trajectory
RT-1
Prediction

Short Trajectory
RT-1
Ground-truth

Short Trajectory
RT-1
Prediction

Short Trajectory
RT-1
Ground-truth

Short Trajectory
RT-1
Prediction

Short Trajectory
RT-1
Ground-truth

Short Trajectory
RT-1
Prediction

Short Trajectory
RT-1
Ground-truth

Short Trajectory
Bridge
Prediction

Short Trajectory
Bridge
Ground-truth

Short Trajectory
Bridge
Prediction

Short Trajectory
Bridge
Ground-truth

Short Trajectory
Bridge
Prediction

Short Trajectory
Bridge
Ground-truth

Short Trajectory
Bridge
Prediction

Short Trajectory
Bridge
Ground-truth

Short Trajectory
Bridge
Prediction

Short Trajectory
Bridge
Ground-truth

Short Trajectory
Bridge
Prediction

Short Trajectory
Bridge
Ground-truth

Short Trajectory
Language-Table
Prediction

Short Trajectory
Language-Table
Ground-truth

Short Trajectory
Language-Table
Prediction

Short Trajectory
Language-Table
Ground-truth

Short Trajectory
Language-Table
Prediction

Short Trajectory
Language-Table
Ground-truth

Short Trajectory
Language-Table
Prediction

Short Trajectory
Language-Table
Ground-truth

Short Trajectory
Language-Table
Prediction

Short Trajectory
Language-Table
Ground-truth

Short Trajectory
Language-Table
Prediction

Short Trajectory
Language-Table
Ground-truth

Quantitative results are shown in Tab. 1. IRASim-Frame-Ada performs the best among all the comparing methods in terms of Latent L2 loss and PSNR. It achieves the highest SSIM on RT-1 and Bridge and is comparable with the best baseline method LVDM on Language-Table. In all three datasets, it outperforms IRASim-Video-Ada in all the computation-based metrics. This indicates that frame-level condition enhances consistency between each frame and its corresponding action in the trajectory. IRASim-Frame-Ada also surpasses the two baseline methods based on U-Nets on all three datasets on Latent L2 loss. This demonstrates the superiority of transformer-based model, especially in handling complex 3D actions and robot-object interaction.

Table 1: Quantitative Results on Video Generation of Short Trajectories. We prioritize Latent L2 loss and PSNR as primary evaluation metrics.

Human Preference Evaluation. We also perform a user study to help understand human preferences between IRASim-Frame-Ada and other methods. We juxtapose the videos predicted by IRASim-Frame-Ada and the comparing method and ask humans which one they prefer. The ground-truth is also provided as a reference. IRASim-Frame-Ada beats all the comparing methods in all three datasets. This result aligns with the Latent L2 loss and PSNR which justifies the reason for using them as the primary evaluation metrics.

Figure 3: Human Preference Evaluation. We perform a user study to evaluate the human preference between IRASim-Frame-Ada and other baseline methods.

Long Trajectory Prediction

Uncurated qualitative results of long trajectories are shown below. Click the Click to View More button to display another random subset from 100 unpicked episodes for each dataset. Click the Click to View Very Long Videos button to display the six longest videos from the 100 unpicked episodes. Hover over on these longest videos to see their number of frames. All episodes are from the test set. The average number of frames of the 100 unpicked episodes are 47.04, 36.43, and 24.57 for RT-1, Bridge, and Language-Table, respectively. The video on the left is generated by IRASim; the video on the right is the ground truth. IRASim retains the powerful capability of generating visually realistic and accurate videos of long-horizon as in the short trajectory setting.

Click to View More

Click to View Very Long Videos

Long Trajectory
RT-1
Prediction

Long Trajectory
RT-1
Ground-truth

Long Trajectory
RT-1
Prediction

Long Trajectory
RT-1
Ground-truth

Long Trajectory
RT-1
Prediction

Long Trajectory
RT-1
Ground-truth

Long Trajectory
RT-1
Prediction

Long Trajectory
RT-1
Ground-truth

Long Trajectory
RT-1
Prediction

Long Trajectory
RT-1
Ground-truth

Long Trajectory
RT-1
Prediction

Long Trajectory
RT-1
Ground-truth

Long Trajectory
Bridge
Prediction

Long Trajectory
Bridge
Ground-truth

Long Trajectory
Bridge
Prediction

Long Trajectory
Bridge
Ground-truth

Long Trajectory
Bridge
Prediction

Long Trajectory
Bridge
Ground-truth

Long Trajectory
Bridge
Prediction

Long Trajectory
Bridge
Ground-truth

Long Trajectory
Bridge
Prediction

Long Trajectory
Bridge
Ground-truth

Long Trajectory
Bridge
Prediction

Long Trajectory
Bridge
Ground-truth