Set the OfficeQA accuracy record (72.0%) in Sentient Labs Arena Cohort 0. Solo build, one 6.3 KB system prompt, no MCP servers, no custom agent code.
Why I Built This
Databricks published the OfficeQA benchmark: quantitative data analysis over 697 Treasury Bulletin text files spanning decades of U.S. fiscal data. Sentient Labs opened Arena Cohort 0 to beat it. I entered solo to see how far a minimal, prompt-only approach could go.
How It Works
Arya is a single 6.3 KB Jinja2 system prompt. No MCP servers, no custom agent code, no runtime skill files. The prompt is the agent. It encodes a strict workflow: SEARCH → EXTRACT → WRITE preliminary answer → COMPUTE → STOP. A grep → sed → python3 pipeline inside a Docker container handles retrieval and arithmetic, backed by a self-tested toolkit of 20 statistical functions.
Three decisions made the difference:
- Write-early rule: preliminary answer within 3–5 tool calls; a wrong answer scores higher than an empty one
- Pre-written formulas: the model copies
percent_change, cagr, hp_filter verbatim instead of inventing them, eliminating mental-math errors
- Lean prompt: found the 6.3 KB sweet spot after 20+ submissions. Shorter prompts cost more, not less: input caches at 96% hit rate, but extra iterations burn full-price output tokens
Results
- 72.0% accuracy: highest ever recorded on the OfficeQA benchmark, at $0.07/task on the OpenHands-SDK harness with Minimax M2.5
- 191.0 peak score: 70.7% accuracy at $0.01/task on the Goose harness with Minimax M2.5
Sentient Labs highlighted the result on X: