World models
A world model is a learned predictor of what will happen next given the current observation and a candidate action. Replace the “what will happen” with anything: the next image frame, the next robot state, the reward thirty steps from now. Whatever the prediction target, a world model is the bridge between “imagine” and “commit.”
Why robotics is converging on world models now
For two decades the canonical robotics planner was a sampling- based search (RRT, PRM) over a kinematic state-space, sometimes wrapped in MPC over an analytically modelled environment. That works beautifully in structured industrial settings and breaks in three places:
- Unmodelled contact. Closed-form dynamics ignore friction transients, deformation, and stick-slip. A towel doesn’t fold the way a rigid-body simulator predicts.
- Semantic scenes. “Pick up the red mug” demands object recognition, grasping affordance, and pose estimation — three subsystems your motion planner cannot author.
- Multi-step task structure. “Unzip the school bag, then put the books in” is two policies and a re-grasp. Hard-coding the FSM doesn’t scale.
A generative model trained on enough robot, human, and synthetic video sidesteps all three. Friction shows up implicitly in the training distribution. Semantic prompts (“the red mug”) are conditioning. Multi-step tasks are just longer rollouts.
The τ₀-WM architecture
Midcore integrates τ₀-WM (“tau-zero world model”), published in May 2026 by Shanghai Innovation Institute and AGIBOT Finch under Apache 2.0. It packs two halves into a single training corpus:
Video Action Model (VAM) — 5.5 B parameters total
- Video branch (5 B): a DiT-style transformer cloned from Wan2.2-TI2V-5B. Predicts a future latent trajectory
zₜ₊₁…zₜ₊ₕconditioned on the current observation, the language instruction, and the robot state. - Action branch (0.5 B): a second DiT-style decoder that emits the action chunk
aₜ₊₁…aₜ₊ₖfor the same horizon. - Coupling: feature-level cross-attention at matched transformer stages, so the action branch sees the same intermediate visual features that the video branch is using to imagine the future.
Action-Conditioned Video Simulator (ACVS)
- Reuses the same Wan VAE + video transformer backbone but removes the action-generating branch.
- Conditions on a candidate action chunk (treated as a clean input, not generated) plus the visual context.
- Outputs an imagined latent rollout and a per-frame reward trajectory
r̂ₜ₊₁…r̂ₜ₊ₕ. - Functions as a learned simulator: “If we ran this chunk, here’s what the world would look like and how it would score.”
VAM ships today; ACVS is gated
How a world model trains
τ₀-WM uses flow matching, the successor to denoising diffusion that’s now standard in large generative work. The model learns a velocity field that transports samples from a noise distribution to the data distribution along straight-line paths.
The video loss and the action loss are flow-matching residuals weighted equally:
L_VAM = E[ λ_z · ‖f_θ^z(z̃, u_z, c_t, p) − v_z‖²
+ λ_a · ‖f_θ^a(ã, u_a, s_t, h) − v_a‖² ]...where c_t is the current context (visual + state + language), p is the prompt, h is the intermediate video feature the action branch attends to, and u is the noise level. Both weights are 1 in the released training recipe.
The clever part is modality-specific supervision masks: each clip in the training corpus contributes only the losses it can support. A human egocentric video (no action vector) contributes only the video loss; a teleop episode contributes both; a failure recovery contributes both plus a reward channel.
The training corpus
τ₀-WM’s 27,300 hours of pre-training data split three ways:
| Source | Hours | What it teaches the model |
|---|---|---|
| Real-robot teleop | 17,800 | Action-grounded manipulation across AGIBOT-G01, ARX, dual-arm Franka platforms. |
| UMI-style handheld gripper | 6,500 | Universal Manipulation Interface — a person walks around with a gripper and captures the demonstration without needing a robot present. |
| Egocentric human video | 3,000 | Egodex, Egoverse, Xperience-10M — head-mounted-camera footage of humans doing tasks. Provides visual diversity without any robot data. |
| Rollout + failure trajectories | mixed in | Provides the reward channel ACVS needs to score futures. |
Total compute: 64 H100 GPUs for 42 hours of pre-training, then 16 H100 GPUs for 26 hours of post-training. That’s roughly 3,100 H100-hours per checkpoint — out of reach for most teams to repeat from scratch, easily affordable for a domain fine-tune.
The action and state contract
τ₀-WM commits to a specific input/output shape that any compatible robot has to honour. Midcore’s Designer ships a “Dual-arm Franka FR3 (τ₀-WM ready)” template pre-configured to match.
| Channel | Direction | Shape | Frame |
|---|---|---|---|
| State | observation in | 14 ch = [left xyz + left quat (xyzw)] + [right xyz + right quat (xyzw)] | Each EE pose in its own arm-base frame. |
| Gripper state | observation in | 2 ch ∈ [0, 120] | 0 = open, 120 = closed. |
| Action | output | {T, 16} chunk = [left EE pose (7) + left gripper (1) + right EE pose (7) + right gripper (1)] × T | Same arm-base frames. Gripper output is normalised to [0, 1]. |
Internal vs wire representation
Why this architecture wins
Three design choices give τ₀-WM its edge over a flat VLA:
- Joint future + action learning. Asking the model to predict images and actions simultaneously means action gradients flow through the visual prediction loss too. The action head can’t hallucinate — the video branch acts as a regulariser.
- Modality-mixed pre-training. Human egocentric video brings semantic and physical breadth (humans see more places and do more things than any robot fleet); UMI brings action-grounded demonstrations without requiring scarce teleop time; real robot teleop calibrates the embodiment-specific bits.
- A learned reward / simulator channel. ACVS is what lets the deployed policy do test-time computation: imagine N candidate futures, score them, pick the best, refine. Without that channel a policy is one-shot.
What world models still can’t do
- Tactile-rich manipulation. Vision alone is too sparse for insertion, fastening, and deformable objects. The τ₀-WM authors flag this explicitly as future work.
- Calibrated uncertainty. The RCS confidence score is empirical — it works, but it’s not a calibrated probability. A model can confidently fail.
- Long-horizon planning. Action chunks are ~16 steps. Anything longer is stitched at the policy level (next chunk conditions on this chunk’s execution).
- Out-of-distribution embodiments. A 5-finger hand isn’t in the pre-training corpus. You can’t fine-tune your way to it.
Next: how the model actually emits actions