Recoverability Audits for Humanoid Push Recovery

Steven Cao & Jay Wu

A humanoid survives one push. How fragile is it to the next one? We frame post-disturbance fragility as a post-hoc audit of a frozen policy, to our knowledge a new angle on push recovery: cloning a disturbed state into parallel Monte Carlo rollouts to label how recoverable it is, distilling that into a 16k-parameter classifier, and using the score to switch a frozen walking policy to a recovery policy only when needed.

The recoverability audit in action: from one disturbed state, parallel branches each take a fresh push, and the fraction that survive becomes a recoverability score.

All quantitative experiments run in NVIDIA Isaac Gym; the robot demonstration clips on this page are rendered in MuJoCo, so the visuals differ slightly from the training simulator.

TL;DR

The Hidden-Fragility Problem

Legged robots can recover from pushes; we train a humanoid walking policy with random pushes injected during training. The catch: in deployment the hits keep coming, and surviving the first doesn't mean you're stable.

Baseline walking policy taking a random push (red arrow) and recovering.

Now apply two pushes. The first differs across the two robots; the second is identical. Only one recovers, so the other was already in a fragile state that looked perfectly fine.

Same identical second push (yellow arrow), opposite outcomes: fragility difficult to spot by eye.

The question: after a disturbance, can we approximate how fragile the state is to future disturbances?

What's New

We label fragility with parallel Monte Carlo rollouts (clone a disturbed state, push each branch, score the fraction that survive) and distill it into a classifier. That recipe isn't new on its own. What is new is how we use it:

Method: The Recoverability Audit

We freeze the robot 15 steps after a push. That snapshot is the state we audit.

A push (red arrow), then we freeze a few steps later: this frozen state is what we score.

We clone the snapshot into 50 parallel branches, hit each with a fresh random push, and roll forward under the frozen policy. The fraction that survive is the state's recoverability score, which becomes the ground-truth label for our dataset.

Branches cloned from one snapshot, each given a different future push; the share that survive is the recoverability score. A 16-branch rollout is shown here for illustration.
Setup & reproducibility
Simulator
NVIDIA Isaac Gym (4096 envs)
Robot
Unitree G1
Base policy
Standard walking policy, PPO-trained with pushes ≤1.5 m/s.
Audit
Snapshot state 15 steps post-push → 50 branches, one random 1.5 m/s push → roll 100 steps.
Dataset
50k snapshots (40k/10k), ~81% recoverable
Classifier
Multilayer perceptron with 16,513 params
Recovery policy
PPO fine-tune; upright-focused reward, no pushes

Fragility Classifier

We distill the audit labels into a 16k-parameter MLP over the robot's observations alone. Notably, it aims to reproduce the expensive 50-rollout audit in a single forward pass, cheap enough to query every control step.

To test if the audit can correctly predict fragility, we push the robot, wait, audit, then push again and see if it matches the audit's verdict.

Green : the robot recovers from the 2nd push
Red : the robot falls from the 2nd push

How Can We Use This?

A cheap, per-step fragility score is only worthwhile if it changes what the robot does. We explored two main ways to put it to work:

1. Closed-Loop Policy Switching

When the audit classifier detects a fragile state, we switch from the baseline walking policy to a more stable policy (e.g. recovery policy). Using our earlier protocol, on a:

Building a Recovery Policy

To test switching, we first need a more stable policy to switch to. We finetuned the baseline walking policy on episodes that start from recorded fragile post-push states, tweaking the rewards to focus on staying upright instead of tracking a forward velocity.

Recovery
34%
Walking
22%

Success rate on 100 fragile-start episodes, is always on (~54% better).

Outcome breakdown across the 100 episodes
Only recovery succeeds 18
Only baseline succeeds 6
Both succeed 16
Both fail 60

Closing the Loop

The audit drives the handoff in closed loop: when it flags the state as fragile, control switches from the walking policy to the recovery policy, then switches back to walking once the state is no longer fragile.

Demo Legend

  • Green walking policy ON
  • Orange first push applied
  • Purple recovery policy ON

Does Switching Actually Save the Robot?

Same fragile state, same pushes, with and without the audit-triggered switch:

No switching (baseline walking only): falls.
Audit-triggered switch to recovery: stays up.

2. Audit-Shaped Reward for the Walking Policy

Let's try feeding the audit back into the original walking policy as a training signal to teach the policy to steer clear of fragile states in the first place.

We finetune our best baseline walking policy for about 3000 iterations into an audit-shaped baseline using an additional reward proportional to the fragility classifier's recoverability score.

A side-by-side example of the original baseline and the audit-shaped baseline on the same matched episode:

Original baseline walking policy.
Audit-shaped baseline, finetuned with the audit reward.

Results / Experiments

Slowing Movements when Fragile: Not Effective

We also tried slowing down the walking policy's actions whenever the audit flagged fragility. Layering the audit on top this way did not show any significant improvement, so we turn to policy switching instead.

Tuning Policy Switching

Deciding when to switch from the walking policy to the recovery policy is determined by:

Our first settings were conservative, a high threshold (τ = 0.46) checking late (15 steps after the push), and switching policies barely helped (46% → 45%). Flagging fragility more readily (τ ≈ 0.30–0.34) and checking sooner (10 steps) gave a small but real gain.

Stacking the Interventions

Our two uses of the audit can also be combined: switching to recovery, and an audit-shaped baseline (the walking policy fine-tuned with the audit signal, see Audit-Shaped Reward). We ran all four configurations on the same 100 matched episodes.

Episodes recovered out of 100 (same matched starts and pushes)
Policy No switch + Switch to recovery
Baseline 46 49
Audit-shaped baseline 48 52

Each intervention helps a little on its own, and they stack to the best result (52/100). The gains stay modest because on many of the 100 episodes the robot is already past saving by the time the audit fires, often more than halfway to the ground, so no intervention can recover it.

Restricting to the 24 episodes where the robot was still upright and recoverable at audit time, the same ordering holds and the gains are larger.

Success rate on the 24 still-recoverable episodes
Policy No switch + Switch to recovery
Baseline 50% 62.5%
Audit-shaped baseline 66.7% 70.8%

Future Direction

By the time the audit fires, many states are already past saving. The most useful next step is auditing earlier, or anticipating fragility before the state collapses, so an intervention still has room to work.

The fragility labels also point to which states are worth retraining on, which invites exploration for fragile states, retrains the policy on them, and re-audits. It would also be interesting to see if this approach can transfer to real hardware.

Acknowledgements

We thank Professor Unnat Jain for his guidance and support throughout this project.

@misc{cao_wu_2026_recoverability_audits,
  title        = {Recoverability Audits for Humanoid Push Recovery},
  author       = {Cao, Steven and Wu, Jay},
  year         = {2026},
  howpublished = {\url{https://github.com/jotalis/RA-HPR}}
}