Manipulation policies
A policy is the function that maps a robot’s current observation to an action. For VLA-class models like τ₀-WM, “an action” is a chunk of poses + gripper commands; “committing” that chunk means handing it to the robot’s low-level controller for physical execution. This page explains how that handoff stays safe.
Action chunks, not single actions
Early robot policies emitted one action per inference call (Decision Transformer, RT-1). The problem: the policy and the controller race — at 100 Hz control with 150 ms inference, the controller stalls. The fix is to emit action chunks: predict T future steps in one shot, execute them open-loop, re-query at the end.
τ₀-WM’s output is a {T, 16} chunk — 16 channels per timestep (left EE pose + gripper + right EE pose + gripper), T timesteps deep. Typical T is 16 steps. At 30 Hz that buys you ~530 ms of execution per inference call — comfortably more than the 140–220 ms a single inference call takes on an RTX 5090.
| Channel range | Meaning | Units |
|---|---|---|
| action[0:3] | Left EE position (xyz) | metres, arm-base frame |
| action[3:7] | Left EE orientation (quaternion xyzw) | unit quaternion |
| action[7] | Left gripper openness | [0, 1] (0 = open, 1 = closed) |
| action[8:11] | Right EE position (xyz) | metres, arm-base frame |
| action[11:15] | Right EE orientation (quaternion xyzw) | unit quaternion |
| action[15] | Right gripper openness | [0, 1] |
The proposal — evaluation — revision loop
A vanilla VLA emits one chunk and commits it. τ₀-WM’s contribution is a three-stage inference loop that materially improves single-attempt success rate (the published numbers go from 0.43 with no test-time computation to 0.60 with the full loop).
1. Propose
Given the current observation, prompt, and robot state, the policy samples N candidate action chunks. Each is a complete {T, 16} chunk; the diversity comes from the stochasticity of the flow-matching sampler.
2. Evaluate — the Re-denoising Consistency Score (RCS)
RCS is a cheap distributional filter that asks: does this candidate look like something the policy would have generated from the input? Mechanically:
for each candidate chunk a^(i):
sample K random flow timesteps
re-noise a^(i) along the flow process
try to denoise it again with the policy
RCS^(i) = − || re_denoised(a^(i)) − a^(i) ||²
pick i* = argmax_i RCS^(i)High score ⇒ the candidate sits on the manifold the policy learned. Low score ⇒ the candidate is something the policy can emit but not stably regenerate — a strong warning sign. RCS adds a few percent overhead on top of the original inference call.
Midcore exposes the RCS value directly in the Command panel:
| RCS regime | Default threshold | Midcore commit decision |
|---|---|---|
| High (commit-ready) | RCS ≥ γ | Green "Execute" button. The chunk is committed without prompting. |
| Gated | γ > RCS ≥ floor | Amber "Force-confirm low confidence" gate. Requires explicit operator override. |
| Blocked | RCS < floor | No execute path. The proposal is still recorded on the audit ledger for review. |
γ and the hard floor are policies, not magic
3. Rectify — Low-quality Action Rectification (LAR)
When RCS lands in the gated regime, the policy doesn’t give up — it asks ACVS for help. LAR is a one-shot correction:
- Run ACVS on every candidate. For each, get an imagined latent rollout and a per-frame reward trajectory
r̂ₜ₊₁…r̂ₜ₊ₕ. - Score each candidate by its peak reward,
J^(i) = max_q r̂^(i)_t₊q. - Pick
j*= the candidate with the highest peak reward. - Convert candidate
j*’s imagined latent into a future-conditioning input. - Re-query the policy with the original context plus this future condition. Return the corrected chunk.
Net effect: when RCS reports low confidence, LAR substitutes a better-justified chunk grounded in an explicitly imagined successful future. In the τ₀-WM ablations LAR lifts single- attempt success from 0.50 (with RCS) to 0.60 (with RCS + LAR).
The OpenPI policy protocol
The wire contract between policy and robot has converged on OpenPI, an open WebSocket protocol from Physical Intelligence. It’s minimal:
client → server: msgpack({"method": "infer", "obs": {
obs_image_rgb,
prompt,
state,
gripper_states,
num_inference_steps,
sample_solver,
shift,
...
}})
server → client: msgpack({"actions": [[T, 16]],
"rcs_score": 0.74,
"lar_applied": false, ...})OpenPI is now the de-facto contract for pi-zero, pi-half, OpenVLA, τ₀-WM and most published VLA models. Midcore’s policy gateway speaks it natively, which is why swapping providers is a configuration change rather than an engineering project.
Why every proposal is recorded
A subtle but important property: Midcore appends a record for every proposal — including the ones that fail to commit — to a tamper-evident audit log. The record includes the prompt, the state, the RCS score, whether LAR was applied, and whether the chunk was eventually executed.
This matters for two reasons:
- Regulatory. Pharma manufacturing, surgical robotics, defence — all the high-stakes verticals need an offline-verifiable record of every autonomous physical action. RCS plus the proposal log gives you that out of the box.
- Debugging. A bad deployment leaves a trail of low-RCS proposals that explain themselves. Without the log you’re reduced to guessing.
A model’s confidence is not a guarantee
What to read next
You now understand what a chunk is, how it gets scored, and what gets logged when. Two natural next steps:
- To make the model better at your task, you need to capture examples in a format it can fine-tune on. That’s Datasets.
- To actually run the loop above on a robot in front of you, Using the app walks through every screen.