Recursive Language Models

Alex L. Zhang MIT CSAIL altzhang@mit.edu

Tim Kraska MIT CSAIL kraska@mit.edu

Omar Khattab MIT CSAIL okhattab@mit.edu

arXiv: 2512.24601v3 [cs.AI], 11 May 2026

Correspondence: Alex L. Zhang, Omar Khattab <altzhang@mit.edu, okhattab@mit.edu>

Preprint.

Abstract

We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference paradigm that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs can successfully process inputs more than an order of magnitude beyond model context window limits and, even for shorter prompts, dramatically outperform the quality of vanilla frontier LLMs and common long-context and coding scaffolds (e.g., on GPT-5 by a median across the evaluated benchmarks of 26% against compaction, 130% against CodeAct with sub-calls, and 13% against Claude Code) across four diverse long-context tasks while having comparable cost. At a small scale, we post-train the first model around the RLM. Our model, RLM-Qwen3-8B, outperforms the underlying Qwen3-8B model by a median of 28% and even approaches the quality of vanilla GPT-5 on three long-context tasks. Code is available at https://github.com/alexzhang13/rlm.

1 Introduction

[Figure: Comparison of GPT-5 and RLM(GPT-5) performance on S-NIAH, OOLONG, and OOLONG-Pairs tasks across increasing input context lengths.]

Figure 1: A comparison of GPT-5 and a corresponding RLM(recursion depth=1) using GPT-5 on three long-context tasks of increasing complexity: S-NIAH, OOLONG, and OOLONG-Pairs. For each task, we scale the input length from $2^{13}$ to $2^{20}$ . GPT-5 performance degrades significantly as a function of both input length and task complexity, while the RLM maintains strong performance. Inputs beyond the red region do not fit in GPT-5’s context window of 272K tokens, but the RLM handles them effectively. Additional experiments across other models and benchmarks are in §3.

Frontier reasoning models have limited context windows and, even within their limits, tend to exhibit context rot [Hong et al., 2025], a phenomenon illustrated in Figure 1 where quality degrades steeply as prompts get longer. Though we expect context lengths to steadily rise through improvements to training, architecture, and infrastructure, we are interested in whether it is possible to scale the context size of general-purpose LLMs by orders of magnitude. This is increasingly urgent as LLMs begin to be widely adopted for long-horizon tasks, in which they must routinely process tens if not hundreds of millions of tokens.

We study this question through the lens of scaling inference-time compute. We are inspired by the way that reasoning models, another inference strategy, have become the fundamental interface to LLMs, resulting not only in empirical gains but also additional theoretical expressive power [Merrill and Sabharwal, 2024] compared to vanilla Transformers. Though most inference-time methods for dealing with long context are task-specific [Wu et al., 2021, Chang et al., 2024], the most popular general approach is context condensation or compaction [Khattab et al., 2021, Smith, 2025, OpenAI, 2025b, Wu et al., 2025], where context from user requests or agent trajectories is repeatedly summarized once it exceeds a length threshold. Unfortunately, compaction is rarely expressive enough for tasks that require dense access throughout the prompt. It presumes that some details that appear early in the prompt can safely be forgotten to make room for new content.

[Figure: A diagram illustrating a Recursive Language Model (RLM) architecture where the prompt is treated as an environment variable, allowing the model to write code to decompose and recursively query parts of the input.]

Figure 2: A Recursive Language Model (RLM) treats prompts as part of the environment. It loads the input prompt as a variable inside a REPL environment $\mathcal{E}$ and writes code to peek into, decompose, and invoke itself recursively over programmatic snippets of the variable.

We introduce Recursive Language Models (RLMs), a general-purpose inference paradigm for dramatically scaling the effective input and output lengths of LLMs. The key insight is that arbitrarily long user prompts should not be fed into the neural network (e.g., Transformer) directly but should instead be treated as part of the environment that the LLM is tasked to symbolically and recursively interact with. This system serves as an abstracted “language model” without context limitations.

As Figure 2 shows, an RLM exposes the same external interface as an LLM or a reasoning model: it accepts a string prompt of arbitrary structure and produces a string response. Given a prompt $P$ , the RLM initializes a Read-Eval-Print Loop (REPL) programming environment in which $P$ is set as the value of a variable. It then offers the LLM general context about the REPL environment (e.g., the length of the string $P$ ), and permits it to write code that peeks into and decomposes $P$ , and to iteratively observe any side effects from execution. Crucially, RLMs encourage the LLM to understand, transform, and execute the input prompt by writing symbolic programs that invoke the LLM itself on as many slices of the input as necessary.

By treating the prompt itself as an external object and enabling symbolic recursion, RLMs tackle limitations of expressive power in recent work on coding agents, retrieval agents, and sub-agent delegation. In particular, prior coding agents and retrieval agents treat some designated external data source (e.g., a filesystem or a corpus of search documents) as an environment for fetching snippets. However, they can only fill up the underlying LLM’s context window with snippets before facing compaction. Similarly, prior self-delegation approaches [Anthropic, 2025, Sentient AI, 2025, Schroeder et al., 2025, Sun et al., 2025] allow LLMs to invoke themselves as sub-agents. However,

Use caseConvert PDFs to Markdown

Add the source

Original page previews

Extracted Markdown

Recursive Language Models

Abstract

1 Introduction

What accumulates in your wiki

Convert your next paper