RAT / RunAnyThing

RAT: RunAnyThing via Fully Automated Environment Configuration

RAT is a language-agnostic framework for automated repository-level environment configuration. It combines semantic initialization, dual-mode planning, specialized tools, and robust sandboxing to improve executability across heterogeneous real-world repositories.

Framework

RAT framework overview

Language-Agnostic Abstraction

Unifies language-specific dependency patterns, package managers, and test runners under one interface.

ImageRetriever Initialization

Semantically analyzes repo artifacts, infers base images, and scores candidate images from Docker Hub.

Planning Modes

Supports Standard Plan Mode and Automated Plan Mode with a structured memory.

Specialized Toolset

Includes read/edit/outline/search/issue retrieval/version switch/error recovery/CI parse/test runners.

Robust Sandbox Generation

Builds tailored Docker environments and runs pre-flight validation before agent execution.

Long-term Expertise

Serializes historical trajectory knowledge to improve future configuration precision.

RATBench

RATBench is a large-scale benchmark targeting distribution diversity, language diversity, and availability diversity, with executability-driven validation.

Benchmark # Repos Langs. Stratified Auto-Collect Exec-Verified Difficulty Levels
RATBench 2,000+ P, J, R, JS/TS Yes Yes Yes Yes
EnvBench 994 P, J, K No Yes No No
Repo2Run 420 P No Yes Yes No
ExecutionAgent 50 14 Langs No No Yes No
Beyond Pip 40 P No No Yes No

P: Python, J: Java, K: Kotlin, R: Rust, JS/TS: JavaScript/TypeScript.

Experiments

We define ESSR with scenario-specific handling. For Python, it refines the metric to account for pre-existing repository issues.

Python: ESSR = N_pass / N_verified
General form: ESSR = N_pass / N

Here, N_pass is the number of tests that pass in the configured environment; N is the total number of repository tests; and N_verified is the subset of tests verified as inherently valid (excluding pre-existing broken tests).

For Java / Rust / JS/TS, success is defined by deterministic build/installation command completion without errors.

Results

ESSR across languages

Framework LLM Python Java Rust JS/TS
pipreqs None 35.8 / / /
Zero-shot DeepSeek-V3 15.2 0.0 0.0 7.3
SWE-agent DeepSeek-V3 15.5 29.3 56.7 51.8
Installamatic DeepSeek-V3 6.7 / / /
Repo2Run DeepSeek-V3 44.8 / / /
RAT DeepSeek-V3 63.2 41.3 98.7 68.7

RAT consistently outperforms baseline methods across languages. On Python, RAT reaches 63.2% ESSR (vs. 35.8% for pipreqs), and shows a 29.6% average gain over SWE-agent across evaluated language columns.

Performance across different scenarios

Levels ESSR (%) Tokens (K) Latency (min)
S1 50.5 451.3 41.6
S2 69.5 455.2 59.4
S3 92.0 122.2 14.4

Strong S2 performance indicates RAT can leverage project files and documentation even without containerization scripts. S3 results show RAT can infer entry points and build effective smoke tests in underspecified settings.

Citation

If you find RAT useful, please site:

@misc{huang2026ratrunanythingfullyautomated,
      title={RAT: RunAnyThing via Fully Automated Environment Configuration},
      author={Renhong Huang and Dongdong Hua and Yifei Sun and Sitao Ding and Hanyang Yuan and Daixin Wang and Yang Yang},
      year={2026},
      eprint={2604.23190},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2604.23190},
}