OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data
LM-elicited priors are often inaccurate and overconfident.
OpenEstimate is a multi-domain benchmark for testing whether language models can express calibrated Bayesian priors for numerical estimation tasks under uncertainty, using real-world datasets in healthcare, employment, and finance.
Classification
- Role
- measurement-piece
- Domain
- cross-domain
- Source type
- paper
- Harness types
- validation-harnessgrounding-context-loading
- Validation position
- immediately-after-generation
- Validation mode
- empiricalmechanical
- Prescription stance
- strongly-procedural
- Relation to argument
- validation-is-constitutivereward-structure-mattersdomain-structure-mattersbreakdown-when-harness-absent
- Tags
- uncertaintycalibrationprobabilistic-estimationbenchmarkbayesian-priorsopenestimatereal-world-data
Extended capability commentary
- Input legibility
- Questions are constructed from real-world datasets with natural-language variable descriptions and conditioning information.
- Task structure
- The benchmark asks models to represent beliefs as distributional priors, making uncertainty part of the output contract.
- Reward richness
- Ground-truth distributions from observational data support accuracy and calibration metrics.
- Feedback latency
- Feedback is benchmark-level evaluation after elicitation, not an interactive repair loop.
- Repairability
- The benchmark identifies overconfidence and inaccuracy, but does not itself provide rich diagnostic repair paths.
- Observability
- Priors, quantiles, calibration error, uncertainty-accuracy correlation, and statistical baselines expose how models represent uncertainty.
- Reversibility
- The source is an evaluation harness, not a workflow with rollback or undo semantics.
- Offline evaluability
- The benchmark is explicitly offline and reproducible against dataset-derived ground truth.
- Institutional ratification
- ICLR venue and open-source benchmark provide research-community ratification, but not deployment governance.
Why it matters
OpenEstimate targets a capability gap that ordinary right-answer benchmarks miss: knowing how uncertain you should be when neither the model nor the human has an obvious answer. It makes calibration and uncertainty representation first-class evaluation targets.
Annotation
OpenEstimate is a measurement entry for the part of the frontier where there is no simple "right answer" visible to the user. The task is numerical estimation under uncertainty: given partial information from real-world datasets, models must express beliefs as Bayesian priors, then those priors are evaluated against ground-truth distributions computed from data.
The benchmark covers domains such as health, employment, and finance using datasets including NHANES, Glassdoor, and PitchBook. It evaluates point accuracy, calibration, uncertainty-accuracy correlation, and the value of LM priors relative to statistical baselines based on samples from the true distribution.
The headline result is sobering for deployment: across six frontier models, model-elicited priors are often inaccurate and overconfident. The announcement thread adds the sharper interpretation that model priors can be equivalent to fewer than five real data points and that higher model certainty does not reliably mean higher accuracy.
Extended Frontier Read
OpenEstimate strengthens the measurement shelf because it asks for a richer construct than correctness. The relevant capability is not "can the model answer?" but:
- can it represent uncertainty as a usable prior,
- is that prior calibrated,
- does the model know when it does not know,
- does additional reasoning effort or prompting actually improve uncertainty quality?
That makes it a direct counterweight to benchmarks that reward confident point answers. The extension here is the evaluation harness itself: a structured output contract plus ground-truth distributions plus calibration metrics.
Boundary
This is a validation harness, not a repair harness. It can show that models are overconfident, and it can compare elicitation protocols, but it does not by itself teach the model how to repair its uncertainty estimates. That makes it a useful neighbor to RLCR-style work that tries to train models to reason about what they do not know.
Source code: alanarenda/openestimate. Announcement thread: @alanamarzoev on X.
Notes
GitHub repository: https://github.com/alanarenda/openestimate. Announcement thread supplied by Daniel from X. arXiv page says v1 submitted Oct 16, 2025 and v2 revised Apr 22, 2026. This entry was prepared with Codex (OpenAI).
Related entries
- Expanding RL with Verifiable Rewards Across Diverse DomainsMa et al. · 2025-03-30reward-structure-mattersdomain-structure-mattersvalidation-is-constitutivevalidation-harness
- Rubrics as Rewards: Reinforcement Learning Beyond Verifiable DomainsUnknown (OpenReview: 21UFlJrmS2) · 2025-08-31reward-structure-mattersdomain-structure-mattersvalidation-is-constitutivevalidation-harness
- Standard Signal: AI-native hedge fund announcementMichael Royzen · 2026-02-28reward-structure-mattersdomain-structure-mattersvalidation-is-constitutivevalidation-harness
- LLM Knowledge BasesAndrej Karpathy · 2026-04-01validation-is-constitutivedomain-structure-mattersgrounding-context-loadingvalidation-harness
Overlap is computed on tags, relation-to-argument, and harness types — not on role or domain, because contrasts are often the most useful neighbours.