PersistWorld: Stabilizing Multi-step Robot World Model Rollouts via Reinforcement Learning

Abstract

Action-conditioned robot world models generate future video frames given a robot action sequence, but break down during long-horizon autoregressive deployment: each predicted clip feeds back as context for the next, causing errors to compound and visual quality to rapidly degrade — a phenomenon known as the closed-loop gap. We present PersistWorld, an RL post-training framework that trains the world model directly on its own autoregressive rollouts, using contrastive denoising with multi-view perceptual rewards. Our approach establishes a new state-of-the-art on the DROID dataset: LPIPS reduced by 14% on external cameras, SSIM improved by 9.1% on the wrist camera, winning ~98% of paired comparisons, and achieving an 80% preference rate in a blind human study.

The Closed-Loop Gap

Standard diffusion world models are trained under teacher forcing: given ground-truth history frames, predict the next clip. At deployment, however, the model must condition on its own previously generated outputs — inputs it was never trained to handle. Small per-step errors accumulate, rapidly degrading coherence in scenes, objects, and robot poses. This is the closed-loop gap.

Using the DROID robot manipulation dataset, we show this failure mode is rapid and severe: within seconds of autoregressive generation, manipulated objects lose their structural identity, robot end-effectors drift from commanded trajectories, and scene configurations decohere. No amount of teacher-forced training resolves this — the model is simply never exposed to its own imperfect history during training. What is needed is a training signal computed directly from the model's own autoregressive outputs.

Method

PersistWorld addresses the closed-loop gap via online RL post-training. Rather than training on ground-truth history, we run the model autoregressively on its own outputs, compare groups of candidate continuations, and update toward higher-fidelity generations using a contrastive denoising objective — no backpropagation through the denoising chain required.

We adapt a recent contrastive RL framework for diffusion models to the EDM $x_0$-prediction parameterization used by the Ctrl-World backbone, and prove that the policy-improvement guarantees carry over exactly. The training protocol branches $K{=}16$ candidate continuations from a shared rollout history, ranks them with multi-view perceptual rewards (LPIPS, SSIM, PSNR), and uses the ranking to update lightweight LoRA adapters and the action encoder.

Overview of PersistWorld. (Top) Autoregressive inference: a robot policy feeds actions to the world model, which appends generated frames to the history buffer for the next step. (Bottom) RL post-training: S1 roll out a shared variable-length prefix from a ground-truth starting state; S2 branch $K$ independent candidate continuations; S3 score candidates with multi-view perceptual rewards; S4 reward weights scale implicit positive/negative $x_0$ predictions in the contrastive loss.

Results

Quantitative Metrics

We evaluate on 14-step autoregressive rollouts (≈11 s) on the DROID validation split. PersistWorld outperforms all baselines on every metric. The largest relative gains appear on the wrist camera — the view most critical for fine-grained object manipulation — where SSIM improves by 9.1% and LPIPS by 10.4%.

Cameras	Model	SSIM ↑	PSNR ↑	LPIPS ↓
External	WPE ^*	0.77	20.33	0.131
	IRASim ^*	0.77	21.36	0.117
	Ctrl-World ^*	0.83	23.56	0.091
	Ctrl-World ^†	0.84	23.02	0.081
	PersistWorld (Ours)	0.86	24.42	0.070
Wrist	Ctrl-World ^†	0.62	17.80	0.310
Wrist	PersistWorld (Ours)	0.67	19.39	0.277

^* Numbers from Ctrl-World paper. ^† Our reproduction.

Paired-Comparison Win Rate

On a 1-to-1 paired comparison across all validation trajectories, PersistWorld outperforms the baseline on approximately 98% of samples ($p < 10^{-6}$). The distributions below show per-sample $\Delta_{\text{metric}}$ (ours − baseline): positive means we are better.

Per-sample metric deltas (PersistWorld − baseline) across all validation trajectories. The distributions are heavily concentrated on the positive side, confirming consistent improvements.

Qualitative Comparison

We compare long-horizon (≈11 s) wrist-camera generations for two representative scenarios. Despite both models starting from an identical ground-truth first frame, the baseline rapidly loses structural identity of manipulated objects and the robot arm. PersistWorld maintains spatial consistency throughout.

Object-centric fidelity. The baseline (red) loses object structure — a cup dissolves into an amorphous texture. Our model (green) maintains shape throughout.

Robot-centric consistency. The baseline arm loses geometric structure. PersistWorld preserves end-effector pose and joint configuration throughout the rollout.

Metric Trends Over Rollout Depth

Visual quality is plotted per clip index (each clip ≈0.8 s at 5 Hz), averaged over validation trajectories. The baseline degrades monotonically as exposure bias compounds; PersistWorld maintains substantially higher fidelity, with the gap widening at longer horizons.

External Cameras

Wrist Camera

Per-clip metric trends over 14 autoregressive steps (≈11 s). The baseline (orange) degrades monotonically; PersistWorld (blue) remains consistently better, with the advantage compounding over longer horizons.

Human Study

We conducted a blind two-alternative forced-choice (2AFC) study. Evaluators were shown the same ground-truth reference video alongside two anonymous model outputs (A and B, randomly assigned) and asked to judge which generated video was better overall, considering robot consistency and object consistency. Raters preferred PersistWorld over the baseline 80% of the time ($n = 200$ comparisons).

Evaluator instructions shown before the study began.

The comparison interface: reference video (top) with synchronized Option A and Option B strips below.

Our 2AFC human study interface. A synchronized scrubber bar keeps all three video strips in lockstep so evaluators can compare the same moment across reference, A, and B. Model assignment was randomised to prevent position bias.

BibTeX

If you find this work useful, please cite:

@inproceedings{bardhan2026persistworld,
  title     = {PersistWorld: Stabilizing Multi-step Robot World Model Rollouts
               via Reinforcement Learning},
  author    = {Bardhan, Jai and Drozd\'{i}k, Patrik and \v{S}ivic, Josef
               and Petr\'{i}k, Vladim\'{i}r},
  booktitle = {ArXiv Preprint},
  year      = {2026}
}