Manipulation policies

A policy is the function that maps a robot’s current observation to an action. For VLA-class models like τ₀-WM, “an action” is a chunk of poses + gripper commands; “committing” that chunk means handing it to the robot’s low-level controller for physical execution. This page explains how that handoff stays safe.

Action chunks, not single actions

Early robot policies emitted one action per inference call (Decision Transformer, RT-1). The problem: the policy and the controller race — at 100 Hz control with 150 ms inference, the controller stalls. The fix is to emit action chunks: predict T future steps in one shot, execute them open-loop, re-query at the end.

τ₀-WM’s output is a {T, 16} chunk — 16 channels per timestep (left EE pose + gripper + right EE pose + gripper), T timesteps deep. Typical T is 16 steps. At 30 Hz that buys you ~530 ms of execution per inference call — comfortably more than the 140–220 ms a single inference call takes on an RTX 5090.

Channel range	Meaning	Units
action[0:3]	Left EE position (xyz)	metres, arm-base frame
action[3:7]	Left EE orientation (quaternion xyzw)	unit quaternion
action[7]	Left gripper openness	[0, 1] (0 = open, 1 = closed)
action[8:11]	Right EE position (xyz)	metres, arm-base frame
action[11:15]	Right EE orientation (quaternion xyzw)	unit quaternion
action[15]	Right gripper openness	[0, 1]

The proposal — evaluation — revision loop

A vanilla VLA emits one chunk and commits it. τ₀-WM’s contribution is a three-stage inference loop that materially improves single-attempt success rate (the published numbers go from 0.43 with no test-time computation to 0.60 with the full loop).

1. Propose

Given the current observation, prompt, and robot state, the policy samples N candidate action chunks. Each is a complete {T, 16} chunk; the diversity comes from the stochasticity of the flow-matching sampler.

2. Evaluate — the Re-denoising Consistency Score (RCS)

RCS is a cheap distributional filter that asks: does this candidate look like something the policy would have generated from the input? Mechanically:

  for each candidate chunk a^(i):
      sample K random flow timesteps
      re-noise a^(i) along the flow process
      try to denoise it again with the policy
      RCS^(i) = − || re_denoised(a^(i)) − a^(i) ||²
  pick i* = argmax_i RCS^(i)

High score &Implies; the candidate sits on the manifold the policy learned. Low score &Implies; the candidate is something the policy can emit but not stably regenerate — a strong warning sign. RCS adds a few percent overhead on top of the original inference call.

Midcore exposes the RCS value directly in the Command panel:

RCS regime	Default threshold	Midcore commit decision
High (commit-ready)	RCS ≥ γ	Green "Execute" button. The chunk is committed without prompting.
Gated	γ > RCS ≥ floor	Amber "Force-confirm low confidence" gate. Requires explicit operator override.
Blocked	RCS < floor	No execute path. The proposal is still recorded on the audit ledger for review.

γ and the hard floor are policies, not magic

γ (the gating threshold, default 0.6) and the hard floor (default 0.2) are configuration. Tune them upward for high-risk tasks; loosen them in low-stakes settings (e.g., sandbox playback). They’re per-deployment, not per-customer.

3. Rectify — Low-quality Action Rectification (LAR)

When RCS lands in the gated regime, the policy doesn’t give up — it asks ACVS for help. LAR is a one-shot correction:

Run ACVS on every candidate. For each, get an imagined latent rollout and a per-frame reward trajectory r̂ₜ₊₁…r̂ₜ₊ₕ.
Score each candidate by its peak reward, J^(i) = max_q r̂^(i)_t₊q.
Pick j* = the candidate with the highest peak reward.
Convert candidate j*’s imagined latent into a future-conditioning input.
Re-query the policy with the original context plus this future condition. Return the corrected chunk.

Net effect: when RCS reports low confidence, LAR substitutes a better-justified chunk grounded in an explicitly imagined successful future. In the τ₀-WM ablations LAR lifts single- attempt success from 0.50 (with RCS) to 0.60 (with RCS + LAR).

The OpenPI policy protocol

The wire contract between policy and robot has converged on OpenPI, an open WebSocket protocol from Physical Intelligence. It’s minimal:

  client → server: msgpack({"method": "infer", "obs": {
                              obs_image_rgb,
                              prompt,
                              state,
                              gripper_states,
                              num_inference_steps,
                              sample_solver,
                              shift,
                              ...
                          }})
  server → client: msgpack({"actions": [[T, 16]],
                            "rcs_score": 0.74,
                            "lar_applied": false, ...})

OpenPI is now the de-facto contract for pi-zero, pi-half, OpenVLA, τ₀-WM and most published VLA models. Midcore’s policy gateway speaks it natively, which is why swapping providers is a configuration change rather than an engineering project.

Why every proposal is recorded

A subtle but important property: Midcore appends a record for every proposal — including the ones that fail to commit — to a tamper-evident audit log. The record includes the prompt, the state, the RCS score, whether LAR was applied, and whether the chunk was eventually executed.

This matters for two reasons:

Regulatory. Pharma manufacturing, surgical robotics, defence — all the high-stakes verticals need an offline-verifiable record of every autonomous physical action. RCS plus the proposal log gives you that out of the box.
Debugging. A bad deployment leaves a trail of low-RCS proposals that explain themselves. Without the log you’re reduced to guessing.

A model’s confidence is not a guarantee

High RCS means the candidate is plausible given the training distribution. The training distribution doesn’t cover every situation your robot will encounter. RCS-gated commits are a safety multiplier, not a guarantee — always pair them with hard kinematic limits, force/torque limits, and the e-stop.