This library entry is part of The Extended Frontier thesis. Entries are curated with AI assistance and human review; most initial entries were prepared with Claude (Anthropic), while individual entries may note other assisting systems. Metadata and annotations are editorial, not peer-reviewed. Entries flagged as unverified may contain placeholder dates, authors, or classifications.

OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data

Alana Renda, Jillian Ross, Michael Cafarella, Jacob Andreas··paper·source
LM-elicited priors are often inaccurate and overconfident.

OpenEstimate is a multi-domain benchmark for testing whether language models can express calibrated Bayesian priors for numerical estimation tasks under uncertainty, using real-world datasets in healthcare, employment, and finance.

Classification

Role
measurement-piece
Domain
cross-domain
Source type
paper
Harness types
validation-harnessgrounding-context-loading
Validation position
immediately-after-generation
Validation mode
empiricalmechanical
Prescription stance
strongly-procedural
Relation to argument
validation-is-constitutivereward-structure-mattersdomain-structure-mattersbreakdown-when-harness-absent
Tags
uncertaintycalibrationprobabilistic-estimationbenchmarkbayesian-priorsopenestimatereal-world-data

Extended capability commentary

Input legibility
Questions are constructed from real-world datasets with natural-language variable descriptions and conditioning information.
Task structure
The benchmark asks models to represent beliefs as distributional priors, making uncertainty part of the output contract.
Reward richness
Ground-truth distributions from observational data support accuracy and calibration metrics.
Feedback latency
Feedback is benchmark-level evaluation after elicitation, not an interactive repair loop.
Repairability
The benchmark identifies overconfidence and inaccuracy, but does not itself provide rich diagnostic repair paths.
Observability
Priors, quantiles, calibration error, uncertainty-accuracy correlation, and statistical baselines expose how models represent uncertainty.
Reversibility
The source is an evaluation harness, not a workflow with rollback or undo semantics.
Offline evaluability
The benchmark is explicitly offline and reproducible against dataset-derived ground truth.
Institutional ratification
ICLR venue and open-source benchmark provide research-community ratification, but not deployment governance.

Why it matters

OpenEstimate targets a capability gap that ordinary right-answer benchmarks miss: knowing how uncertain you should be when neither the model nor the human has an obvious answer. It makes calibration and uncertainty representation first-class evaluation targets.

Annotation

OpenEstimate is a measurement entry for the part of the frontier where there is no simple "right answer" visible to the user. The task is numerical estimation under uncertainty: given partial information from real-world datasets, models must express beliefs as Bayesian priors, then those priors are evaluated against ground-truth distributions computed from data.

The benchmark covers domains such as health, employment, and finance using datasets including NHANES, Glassdoor, and PitchBook. It evaluates point accuracy, calibration, uncertainty-accuracy correlation, and the value of LM priors relative to statistical baselines based on samples from the true distribution.

The headline result is sobering for deployment: across six frontier models, model-elicited priors are often inaccurate and overconfident. The announcement thread adds the sharper interpretation that model priors can be equivalent to fewer than five real data points and that higher model certainty does not reliably mean higher accuracy.

Extended Frontier Read

OpenEstimate strengthens the measurement shelf because it asks for a richer construct than correctness. The relevant capability is not "can the model answer?" but:

  • can it represent uncertainty as a usable prior,
  • is that prior calibrated,
  • does the model know when it does not know,
  • does additional reasoning effort or prompting actually improve uncertainty quality?

That makes it a direct counterweight to benchmarks that reward confident point answers. The extension here is the evaluation harness itself: a structured output contract plus ground-truth distributions plus calibration metrics.

Boundary

This is a validation harness, not a repair harness. It can show that models are overconfident, and it can compare elicitation protocols, but it does not by itself teach the model how to repair its uncertainty estimates. That makes it a useful neighbor to RLCR-style work that tries to train models to reason about what they do not know.

Source code: alanarenda/openestimate. Announcement thread: @alanamarzoev on X.

Notes

GitHub repository: https://github.com/alanarenda/openestimate. Announcement thread supplied by Daniel from X. arXiv page says v1 submitted Oct 16, 2025 and v2 revised Apr 22, 2026. This entry was prepared with Codex (OpenAI).

Related entries

Overlap is computed on tags, relation-to-argument, and harness types — not on role or domain, because contrasts are often the most useful neighbours.