Build Arena

Towards Physics-Aligned Interactive Agentic Engineering Construction Automation

Tian Xia, Tianrun Gao, Wenhao Deng, Long Wei, Xiaowei Qian, Yixian Jiang, Chenglei Yu, Tailin Wu*
*Corresponding author: wutailin@westlake.edu.cn
Westlake University

Engineering construction automation aims to transform natural language specifications into physically viable structures, requiring complex integrated reasoning under strict physical constraints. While modern LLMs possess broad knowledge and strong reasoning capabilities that make them promising candidates for this domain, their construction competencies remain largely unevaluated.

Question: How can we comprehensively evaluate LLMs for language-driven and physics-grounded construction automation?

Here, we introduce Build Arena, the first physics-aligned interactive benchmark designed for language-driven engineering construction.

Build with language. Build with reasoning.

Watch how our agentic workflow orchestrates the construction process through multi-party collaboration. The animation below demonstrates a real construction session where the Guidance entity and Builder agent work together iteratively to construct a propulsion machine. Click Play to watch the agents communicate and see the structure evolve in real-time, or adjust the speed to explore at your own pace.

5ms
Ready
Agent Communication Logs
Please click on the Play button.
Render Output
Please click on the Play button.

What Sets Build Arena Apart?

Build Arena is the first benchmark that comprehensively addresses spatial reasoning, 3D construction, construction-aimed planning, physical simulation, and interactive environments in a unified framework.

Benchmark Spatial Reasoning 3D Construction Construction Planning Physical Simulator Interactive Environment
PlanBench[1] โœ— โœ— โœ— โœ— โœ—
PlanQA[2] โœ“ โœ— โœ— โœ— โœ—
PHYRE[3] โœ“ โœ— โœ— โœ“ โœ“
VOYAGER[4] โœ“ โœ— โœ“ โœ“ โœ“
Embodied Agent Interface[5] โœ“ โœ“ โœ— โœ“ โœ“
Build Arena (ours) โœ“ โœ“ โœ“ โœ“ โœ“

Agentic Engineering Construction Automation Arena

Grok Construction 0 Grok Construction 1 Grok Construction 2 Grok Construction 3 Grok Construction 4 Grok Construction 5 Grok Construction 6 Grok Construction 7
Claude Construction 0 Claude Construction 1 Claude Construction 2 Claude Construction 3 Claude Construction 4 Claude Construction 5 Claude Construction 6 Claude Construction 7
Doubao Construction 0 Doubao Construction 1 Doubao Construction 2 Doubao Construction 3 Doubao Construction 4 Doubao Construction 5 Doubao Construction 6
GPT Construction 0 GPT Construction 1 GPT Construction 2 GPT Construction 3 GPT Construction 4 GPT Construction 5 GPT Construction 6 GPT Construction 7
Kimi Construction 0 Kimi Construction 1 Kimi Construction 2 Kimi Construction 3 Kimi Construction 4 Kimi Construction 5 Kimi Construction 6 Kimi Construction 7
Qwen Construction 0 Qwen Construction 1 Qwen Construction 2 Qwen Construction 3 Qwen Construction 4 Qwen Construction 5
DeepSeek Construction 0 DeepSeek Construction 1 DeepSeek Construction 2 DeepSeek Construction 3 DeepSeek Construction 4 DeepSeek Construction 5 DeepSeek Construction 6 DeepSeek Construction 7
Gemini Construction 0 Gemini Construction 1 Gemini Construction 2 Gemini Construction 3 Gemini Construction 4 Gemini Construction 5 Gemini Construction 6 Gemini Construction 7

Benchmark Framework

Build Arena establishes a comprehensive evaluation framework with four core components:

Build Arena Benchmark Framework

1. Language-Driven and Physics-Grounded Construction

We develop an open-source 3D Spatial Geometric Computation Library that mirrors Besiege's construction logic and physical constraints. This library enables LLMs to interact with the construction space through natural language interfaces, ensuring consistency between language-based actions and physics-aligned outcomes.

Construction is inherently iterative: structures are assembled step by step, each component must connect to existing ones, and physical feasibility (e.g., collision avoidance) is continuously verified. Actions fall into four categories: Build, Refine, Query, and Control.

2. Agentic Workflow (Baseline & Customizable)

Inspired by human engineering practices, we design a baseline workflow that follows a coarse-to-fine structure with multi-party collaboration. It employs five specialized entities working together:

The baseline workflow progresses through three stages: Plan Phase โ†’ Draft-Review Loop โ†’ Build-Guidance Loop, producing a simulation-compatible construction result. Our framework supports flexible workflow customization: you can define your own agent collaboration patterns, modify the multi-stage structure, or integrate alternative reasoning strategies to suit specific task requirements.

3. Simulation-Based Evaluation

Construction results are evaluated in the Besiege physics simulator with task-specific protocols. For each task-LLM pair, we sample 64 times to ensure reliability. Evaluation metrics cover both performance (number of parts, success rate, task-specific indicators) and cost (input/output tokens, number of requests).

4. Task Suite (Base & Customizable)

Build Arena includes three representative engineering task categories, each with three difficulty levels (Easy, Medium, Hard). Tasks are designed along six difficulty dimensions: Quantification, Robustness, Magnitude, Compositionality, Precision, and Ambiguity. Click the โš™๏ธ Yours card below to learn about customizing your own tasks.

Performance Leaderboard

Model (Full Name) Transport
Avg Success Rate
Support
Avg Success Rate
Lift
Avg Success Rate
Overall
Performance
Grok-4
grok4-0709
11.5% 20.8% 21.9% ๐Ÿฅ‡ Excellent
Claude-4
claude-sonnet-4-20250514
12.5% 3.1% 4.2% ๐Ÿฅˆ Good
Seed-1.6
doubao-seed-1-6-250615
6.2% 19.3% 2.1% ๐Ÿฅ‰ Good
GPT-4o
gpt-4o
6.2% 13.5% 3.6% Moderate
Kimi-K2
kimi-k2-turbo-preview
4.7% 11.5% 5.2% Moderate
Qwen-3
qwen-plus (Qwen3 series)
5.7% 5.7% 1.0% Moderate
DeepSeek-3.1
deepseek-chat (DeepSeek-V3.1)
2.6% 8.3% 3.6% Moderate
Gemini-2.0
gemini-2.0-flash
1.6% 7.8% 0.0% Moderate

Success rates are averaged across all three difficulty levels (Lv.1, Lv.2, Lv.3) for each task category under our baseline agentic workflow. Full model snapshots and detailed experimental setup are available in the paper appendix.

We evaluate eight frontier large language models on Build Arena across three task categories (Transport, Support, Lift) and three difficulty levels (Lv.1 Easy, Lv.2 Medium, Lv.3 Hard) under our baseline agentic workflow. Performance is measured by success rate, with 64 samples per task-model pair to ensure statistical reliability.

Multi-Dimensional Performance Analysis

Performance of different LLMs against six dimensions of task difficulty: Quantification, Robustness, Magnitude, Compositionality, Precision, Ambiguity.

LLM Performance Radar Chart across Six Task Difficulty Dimensions

Key Findings

1

LLMs Can Perform Language-Driven 3D Construction

Eight frontier LLMs successfully completed construction tasks across multiple difficulty levels, demonstrating that language models can translate natural language into physically viable 3D structures.

2

Performance Varies Significantly Across Models

Grok-4 shows the strongest overall performance, particularly excelling in Precision and Robustness. Most models handle Magnitude and Ambiguity well but struggle with Compositionality and Precision.

3

LLMs Exhibit Creative Problem-Solving

When explicit constraints are relaxed, LLMs propose unconventional solutions such as propulsion-powered carriers for transport tasks and wheel-integrated bridge structures that utilize automatic braking for stabilization.

4

Real-World Engineering Knowledge Is Captured

LLMs construct structures mirroring real-world practices, such as steel trusses in bridges and differential steering in vehicles, suggesting that structural concepts learned from text carry implicit spatial information.

5

Significant Limitations Remain

Success rates drop sharply in hierarchical assembly tasks (Support Lv.3) and high-precision tasks (Lift). Most models except Grok-4 fail completely at the hardest difficulty levels, indicating challenges with compositional construction and precise spatial alignment.

6

More Tokens โ‰  Better Performance

Cost analysis reveals that massive inference does not guarantee high performance. Best construction results often consume only moderate numbers of tokens, while many failed attempts incur massive token usage. Beyond a capability threshold, additional inference cost does not translate into better results.

Acknowledgments

We are grateful to Spiderling Studios for creating Besiege, the inspiring physics sandbox that underpins our work. We also thank the developers of the open-source projects Lua Scripting Mod and Besiege Creation Import Addon for Blender for their valuable contributions to the community.

We also gratefully acknowledge the support of Westlake University Research Center for Industries of the Future. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding entities.

References

[1] Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.

[2] Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, and Peter Wonka. PlanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations. arXiv preprint arXiv:2507.07644, 2025.

[3] Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, and Ross Girshick. Phyre: A new benchmark for physical reasoning. Advances in Neural Information Processing Systems, 32, 2019.

[4] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.

[5] Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Erran Li, Ruohan Zhang, et al. Embodied agent interface: Benchmarking llms for embodied decision making. Advances in Neural Information Processing Systems, 37:100428โ€“100534, 2024.

Citation

If you find BuildArena useful in your research, please consider citing our paper:

@article{xia2025buildarena,
    title={BuildArena: A Physics-Aligned Interactive Benchmark of LLMs for Engineering Construction},
    author={Xia, Tian and Gao, Tianrun and Deng, Wenhao and Wei, Long and Qian, Xiaowei and Jiang, Yixian and Yu, Chenglei and Wu, Tailin},
    journal={arXiv preprint arXiv:2510.16559},
    year={2025}
    }