ECHO — Training Terminal Agents on the Environment's Response

ECHO's move is simple: when a CLI agent runs a command, train not only on the command it wrote, but also on the terminal response it caused. That turns stdout, stderr, file listings, logs, and tracebacks into dense supervision for terminal-agent RL.

Short Answer

They modified GRPO-style terminal-agent training so the model learns from both sides of the transcript:

Token stream	Standard GRPO	ECHO
Agent actions / commands	Policy-gradient loss	Policy-gradient loss
Terminal observations / outputs	Context only; masked out of loss	Cross-entropy prediction loss
Extra rollouts	No	No
Extra teacher model	No	No
Extra forward pass	No	No

So the trick is not a new agent loop or new benchmark. It is a training-mask change: stop masking out terminal-output tokens, add an auxiliary environment-prediction loss, and weight it with $\lambda$ .

Why This Is Clever

Terminal-agent RL is sparse. A whole rollout often receives only a final pass/fail verifier reward. If the agent fails, GRPO gets little useful signal beyond "this trajectory was bad."

But failed trajectories still contain useful consequences:

ls shows which files exist;
a compiler error points to the broken line;
a traceback reveals an API mismatch;
a test failure names the unmet behavior;
a config dump reveals hidden state.

ECHO says: those tokens are ground truth about what the agent's action did. Predicting them should teach the model the dynamics of the terminal environment.

The Objective

The paper writes ECHO as:

L_{ECHO} = L_{GRPO}(\mathcal{A}) + \lambda L_{Env}(\mathcal{O}')

where:

$\mathcal{A}$ indexes assistant action/command token positions;
$\mathcal{O}$ indexes all observation token positions;
$\mathcal{O}'$ is the subset used for environment prediction: terminal-output tokens, excluding harness warning prefixes;
$L_{Env}$ is length-normalized cross-entropy on $\mathcal{O}'$ ;
$\lambda$ controls how much world-modeling pressure is mixed into policy optimization.

Concretely, suppose the transcript contains an assistant action like pytest tests/test_api.py followed by a terminal observation with a failing assertion, a traceback, and the failing test name. Standard GRPO uses the final task reward to update the command tokens, but does not train on the traceback text itself. ECHO also asks the model to predict those observation tokens: given the prior task context and the command it just issued, the model should assign high likelihood to the terminal response that pytest actually produced. The same applies to simpler cases like ls src predicting the returned filenames or cat config.yaml predicting the file contents.

Implementation-wise, this is cheap because the model already processes the full transcript. ECHO reuses the same rollout and logits; it just gathers loss over an additional mask.

What Changed in the Training Stack

According to the paper and repo, they:

used SkyRL/vLLM for fast batched generation and GRPO training;
used Harbor/Terminus-style terminal task environments and verifiers;
constructed masks for assistant-action positions and terminal-observation positions;
added ECHO's observation-token cross-entropy term to the policy loss;
patched SkyRL minimally so custom tensors/masks and auxiliary losses flow through the trainer.

The repo positions this as a small hook/extension over SkyRL, not a separate RL system.

What They Found

The headline numbers:

Setting	GRPO	ECHO	Effect
Qwen3-8B on TerminalBench-2.0 pass@1	2.70%	5.17%	nearly 2×
Qwen3-14B on TerminalBench-2.0 pass@1	5.17%	10.79%	nearly 2×

Other reported findings:

ECHO improved all tested starting policies across their evaluation sets.
It reached matched GRPO performance in up to 2.3× fewer training steps.
It reduced environment-token cross-entropy on held-out trajectories, while GRPO barely did.
It recovered much of the value of expert SFT without using expert demonstrations.
In limited settings, environment-prediction-only continuation improved performance even with verifier rewards removed.

The Real Claim

The strongest defensible version is not "the agent has a full world model." It is:

A terminal agent trained with ECHO becomes a better predictor of terminal dynamics, and that predictive ability improves task-solving.

That matters because it turns the terminal itself into a teacher. Instead of relying only on sparse final rewards or expensive expert demonstrations, every interaction supplies a supervised learning target.

Design Takeaways

For CLI agents, do not waste observation tokens. If the environment returns text, that text can supervise the model.
Failed rollouts are still useful. They may fail the task but still teach action → consequence mappings.
World-modeling can be an auxiliary loss, not a separate module. ECHO is just token prediction on environment responses.
The target representation matters. Raw terminal output works here because terminals are token-native; other environments may need summarized or structured observations.
Balance matters. Too much observation prediction can make the agent prefer predictable outputs over useful actions.

How This Connects

ECHO strengthens the view of terminals as useful agent environments. Shell capability matters not only because the terminal is a broad action surface, but because it returns dense textual feedback that can be folded directly into training.

That makes ECHO the training-time counterpart to a runtime design principle: a sandbox or terminal is not just a place to execute code; it is a feedback channel whose observations can become learning targets.