Recoverability Audits for Humanoid Push Recovery
A humanoid survives one push. How fragile is it to the next one? We frame post-disturbance fragility as a post-hoc audit of a frozen policy, to our knowledge a new angle on push recovery: cloning a disturbed state into parallel Monte Carlo rollouts to label how recoverable it is, distilling that into a 16k-parameter classifier, and using the score to switch a frozen walking policy to a recovery policy only when needed.
All quantitative experiments run in NVIDIA Isaac Gym; the robot demonstration clips on this page are rendered in MuJoCo, so the visuals differ slightly from the training simulator.
TL;DR
- Problem: surviving one disturbance doesn't mean a robot is stable; it can sit in a hidden-fragile state that looks fine but falls on the next hit.
- What's new: auditing post-disturbance fragility (not general safety) on a frozen policy, so it needs no retraining and works on any pre-trained policy.
- Method: clone a disturbed state into 50 parallel branches, apply a random future push to each, and measure the fraction that survive as a recoverability score; distill it into a 16k-parameter MLP that runs every control step.
- Result: modest stability improvements using audit-gated policy switching and an audit-shaped walking policy.
The Hidden-Fragility Problem
Legged robots can recover from pushes; we train a humanoid walking policy with random pushes injected during training. The catch: in deployment the hits keep coming, and surviving the first doesn't mean you're stable.
Now apply two pushes. The first differs across the two robots; the second is identical. Only one recovers, so the other was already in a fragile state that looked perfectly fine.
The question: after a disturbance, can we approximate how fragile the state is to future disturbances?
What's New
We label fragility with parallel Monte Carlo rollouts (clone a disturbed state, push each branch, score the fraction that survive) and distill it into a classifier. That recipe isn't new on its own. What is new is how we use it:
- We score fragility, not generic safety. Random-push training is robust on average but leaves hidden-fragile pockets; we measure how well a disturbed state survives the next hit.
- We do it post-hoc, on a frozen policy. Safety monitors are usually trained jointly with the RL policy; we leave it untouched, so it drops onto any pre-trained controller.
Method: The Recoverability Audit
We freeze the robot 15 steps after a push. That snapshot is the state we audit.
We clone the snapshot into 50 parallel branches, hit each with a fresh random push, and roll forward under the frozen policy. The fraction that survive is the state's recoverability score, which becomes the ground-truth label for our dataset.
- Simulator
- NVIDIA Isaac Gym (4096 envs)
- Robot
- Unitree G1
- Base policy
- Standard walking policy, PPO-trained with pushes ≤1.5 m/s.
- Audit
- Snapshot state 15 steps post-push → 50 branches, one random 1.5 m/s push → roll 100 steps.
- Dataset
- 50k snapshots (40k/10k), ~81% recoverable
- Classifier
- Multilayer perceptron with 16,513 params
- Recovery policy
- PPO fine-tune; upright-focused reward, no pushes
Fragility Classifier
We distill the audit labels into a 16k-parameter MLP over the robot's observations alone. Notably, it aims to reproduce the expensive 50-rollout audit in a single forward pass, cheap enough to query every control step.
To test if the audit can correctly predict fragility, we push the robot, wait, audit, then push again and see if it matches the audit's verdict.
- Green → the robot should recover given the second push
- Red → the robot should fall given the second push
How Can We Use This?
A cheap, per-step fragility score is only worthwhile if it changes what the robot does. We explored two main ways to put it to work:
1. Closed-Loop Policy Switching
When the audit classifier detects a fragile state, we switch from the baseline walking policy to a more stable policy (e.g. recovery policy). Using our earlier protocol, on a:
- Green (stable) verdict: robot stays on the baseline policy
- Red (fragile) verdict: robot switches to the recovery policy
Building a Recovery Policy
To test switching, we first need a more stable policy to switch to. We finetuned the baseline walking policy on episodes that start from recorded fragile post-push states, tweaking the rewards to focus on staying upright instead of tracking a forward velocity.
| Only recovery succeeds | 18 |
| Only baseline succeeds | 6 |
| Both succeed | 16 |
| Both fail | 60 |
Closing the Loop
The audit drives the handoff in closed loop: when it flags the state as fragile, control switches from the walking policy to the recovery policy, then switches back to walking once the state is no longer fragile.
Demo Legend
- Green walking policy ON
- Orange first push applied
- Purple recovery policy ON
Does Switching Actually Save the Robot?
Same fragile state, same pushes, with and without the audit-triggered switch:
2. Audit-Shaped Reward for the Walking Policy
Let's try feeding the audit back into the original walking policy as a training signal to teach the policy to steer clear of fragile states in the first place.
We finetune our best baseline walking policy for about 3000 iterations into an audit-shaped baseline using an additional reward proportional to the fragility classifier's recoverability score.
A side-by-side example of the original baseline and the audit-shaped baseline on the same matched episode:
Results / Experiments
Slowing Movements when Fragile: Not Effective
We also tried slowing down the walking policy's actions whenever the audit flagged fragility. Layering the audit on top this way did not show any significant improvement, so we turn to policy switching instead.
Tuning Policy Switching
Deciding when to switch from the walking policy to the recovery policy is determined by:
- How cautious the audit is (the fragility threshold τ).
- How soon it checks after a push (the audit delay).
Our first settings were conservative, a high threshold (τ = 0.46) checking late (15 steps after the push), and switching policies barely helped (46% → 45%). Flagging fragility more readily (τ ≈ 0.30–0.34) and checking sooner (10 steps) gave a small but real gain.
Stacking the Interventions
Our two uses of the audit can also be combined: switching to recovery, and an audit-shaped baseline (the walking policy fine-tuned with the audit signal, see Audit-Shaped Reward). We ran all four configurations on the same 100 matched episodes.
| Policy | No switch | + Switch to recovery |
|---|---|---|
| Baseline | 46 | 49 |
| Audit-shaped baseline | 48 | 52 |
Each intervention helps a little on its own, and they stack to the best result (52/100). The gains stay modest because on many of the 100 episodes the robot is already past saving by the time the audit fires, often more than halfway to the ground, so no intervention can recover it.
Restricting to the 24 episodes where the robot was still upright and recoverable at audit time, the same ordering holds and the gains are larger.
| Policy | No switch | + Switch to recovery |
|---|---|---|
| Baseline | 50% | 62.5% |
| Audit-shaped baseline | 66.7% | 70.8% |
Future Direction
By the time the audit fires, many states are already past saving. The most useful next step is auditing earlier, or anticipating fragility before the state collapses, so an intervention still has room to work.
The fragility labels also point to which states are worth retraining on, which invites exploration for fragile states, retrains the policy on them, and re-audits. It would also be interesting to see if this approach can transfer to real hardware.
Acknowledgements
We thank Professor Unnat Jain for his guidance and support throughout this project.
@misc{cao_wu_2026_recoverability_audits,
title = {Recoverability Audits for Humanoid Push Recovery},
author = {Cao, Steven and Wu, Jay},
year = {2026},
howpublished = {\url{https://github.com/jotalis/RA-HPR}}
}