Build Arena

Question

Question: How can we comprehensively evaluate LLMs for language-driven and physics-grounded construction automation?

Answer 1

Here, we introduce Build Arena, the first physics-aligned interactive benchmark designed for language-driven engineering construction.

Benchmark	Spatial Reasoning	3D Construction	Construction Planning	Physical Simulator	Interactive Environment
PlanBench^[1]	✗	✗	✗	✗	✗
PlanQA^[2]	✓	✗	✗	✗	✗
PHYRE^[3]	✓	✗	✗	✓	✓
VOYAGER^[4]	✓	✗	✓	✓	✓
Embodied Agent Interface^[5]	✓	✓	✗	✓	✓
Build Arena (ours)	✓	✓	✓	✓	✓

Model (Full Name)	Transport Avg Success Rate	Support Avg Success Rate	Lift Avg Success Rate	Overall Performance
Grok-4 grok4-0709	11.5%	20.8%	21.9%	🥇 Excellent
Claude-4 claude-sonnet-4-20250514	12.5%	3.1%	4.2%	🥈 Good
Seed-1.6 doubao-seed-1-6-250615	6.2%	19.3%	2.1%	🥉 Good
GPT-4o gpt-4o	6.2%	13.5%	3.6%	Moderate
Kimi-K2 kimi-k2-turbo-preview	4.7%	11.5%	5.2%	Moderate
Qwen-3 qwen-plus (Qwen3 series)	5.7%	5.7%	1.0%	Moderate
DeepSeek-3.1 deepseek-chat (DeepSeek-V3.1)	2.6%	8.3%	3.6%	Moderate
Gemini-2.0 gemini-2.0-flash	1.6%	7.8%	0.0%	Moderate

Build with language. Build with reasoning.