RAT: RunAnyThing via Fully Automated Environment Configuration

RAT is a language-agnostic framework for automated repository-level environment configuration. It combines semantic initialization, dual-mode planning, specialized tools, and robust sandboxing to improve executability across heterogeneous real-world repositories.

arXiv Github RATBench

Framework

Language-Agnostic Abstraction

Unifies language-specific dependency patterns, package managers, and test runners under one interface.

ImageRetriever Initialization

Semantically analyzes repo artifacts, infers base images, and scores candidate images from Docker Hub.

Planning Modes

Supports Standard Plan Mode and Automated Plan Mode with a structured memory.

Specialized Toolset

Includes read/edit/outline/search/issue retrieval/version switch/error recovery/CI parse/test runners.

Robust Sandbox Generation

Builds tailored Docker environments and runs pre-flight validation before agent execution.

Long-term Expertise

Serializes historical trajectory knowledge to improve future configuration precision.

RATBench

RATBench is a large-scale benchmark targeting distribution diversity, language diversity, and availability diversity, with executability-driven validation.

Benchmark	# Repos	Langs.	Stratified	Auto-Collect	Exec-Verified	Difficulty Levels
RATBench	2,000+	P, J, R, JS/TS	Yes	Yes	Yes	Yes
EnvBench	994	P, J, K	No	Yes	No	No
Repo2Run	420	P	No	Yes	Yes	No
ExecutionAgent	50	14 Langs	No	No	Yes	No
Beyond Pip	40	P	No	No	Yes	No

P: Python, J: Java, K: Kotlin, R: Rust, JS/TS: JavaScript/TypeScript.

Experiments

We define ESSR with scenario-specific handling. For Python, it refines the metric to account for pre-existing repository issues.

Python: ESSR = N_pass / N_verified

General form: ESSR = N_pass / N

Here, N_pass is the number of tests that pass in the configured environment; N is the total number of repository tests; and N_verified is the subset of tests verified as inherently valid (excluding pre-existing broken tests).

S1 (Artifact-guided): unit tests + functional containerization artifacts.
S2 (Artifact-free): unit tests but no containerization scripts; excludes inherently broken tests from N_verified.
S3 (Test-deficient): no tests and no scripts; verifies via runnable entry points or synthesized smoke tests.

For Java / Rust / JS/TS, success is defined by deterministic build/installation command completion without errors.

Results

ESSR across languages

Framework	LLM	Python	Java	Rust	JS/TS
pipreqs	None	35.8	/	/	/
Zero-shot	DeepSeek-V3	15.2	0.0	0.0	7.3
SWE-agent	DeepSeek-V3	15.5	29.3	56.7	51.8
Installamatic	DeepSeek-V3	6.7	/	/	/
Repo2Run	DeepSeek-V3	44.8	/	/	/
RAT	DeepSeek-V3	63.2	41.3	98.7	68.7

RAT consistently outperforms baseline methods across languages. On Python, RAT reaches 63.2% ESSR (vs. 35.8% for pipreqs), and shows a 29.6% average gain over SWE-agent across evaluated language columns.

Performance across different scenarios

Levels	ESSR (%)	Tokens (K)	Latency (min)
S1	50.5	451.3	41.6
S2	69.5	455.2	59.4
S3	92.0	122.2	14.4

Strong S2 performance indicates RAT can leverage project files and documentation even without containerization scripts. S3 results show RAT can infer entry points and build effective smoke tests in underspecified settings.

Citation

If you find RAT useful, please site:

@misc{huang2026ratrunanythingfullyautomated,
      title={RAT: RunAnyThing via Fully Automated Environment Configuration},
      author={Renhong Huang and Dongdong Hua and Yifei Sun and Sitao Ding and Hanyang Yuan and Daixin Wang and Yang Yang},
      year={2026},
      eprint={2604.23190},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2604.23190},
}