VR

Arya: Beating the OfficeQA Benchmark

Set the OfficeQA accuracy record (72.0%) in Sentient Labs Arena Cohort 0. Solo build, one 6.3 KB system prompt, no MCP servers, no custom agent code.

PythonJinja2OpenHands-SDKGooseMinimax M2.5OpenRouterDocker

Why I Built This

Databricks published the OfficeQA benchmark: quantitative data analysis over 697 Treasury Bulletin text files spanning decades of U.S. fiscal data. Sentient Labs opened Arena Cohort 0 to beat it. I entered solo to see how far a minimal, prompt-only approach could go.

How It Works

Arya is a single 6.3 KB Jinja2 system prompt. No MCP servers, no custom agent code, no runtime skill files. The prompt is the agent. It encodes a strict workflow: SEARCH → EXTRACT → WRITE preliminary answer → COMPUTE → STOP. A grep → sed → python3 pipeline inside a Docker container handles retrieval and arithmetic, backed by a self-tested toolkit of 20 statistical functions.

Three decisions made the difference:

  • Write-early rule: preliminary answer within 3–5 tool calls; a wrong answer scores higher than an empty one
  • Pre-written formulas: the model copies percent_change, cagr, hp_filter verbatim instead of inventing them, eliminating mental-math errors
  • Lean prompt: found the 6.3 KB sweet spot after 20+ submissions. Shorter prompts cost more, not less: input caches at 96% hit rate, but extra iterations burn full-price output tokens

Results

  • 72.0% accuracy: highest ever recorded on the OfficeQA benchmark, at $0.07/task on the OpenHands-SDK harness with Minimax M2.5
  • 191.0 peak score: 70.7% accuracy at $0.01/task on the Goose harness with Minimax M2.5

Sentient Labs highlighted the result on X:

During Arena, Vikranth (@rvikranth369) paired his agent Arya with OpenHands (@OpenHandsDev), attaining 72% accuracy on OfficeQA. That's a SOTA-level performance achieved at a substantially reduced cost ↓

Sentient Ecosystem
Sentient Ecosystem
@SentientEco

@rvikranth369 paired his agent Arya, named after the Indian Mathematician and Astronomer, with OpenHands (@OpenHandsDev) to hit 72% accuracy on OfficeQA, achieving SOTA-level performance at a fraction of the cost. This closes in on the frontier oracle baselines set by Opus 4.5

75
Reply