Blog

MedAESQA: New Standard for Medical Question Answering

Author Image
Tristan Bishop, Head of Marketing
August 4, 2025

Grounding medical AI in evidence: Inside the MedAESQA dataset

Large language models are increasingly capable of generating human-like answers to open-ended medical questions. But when health is on the line, sounding right is not enough. We need to know whether those answers are factually correct, appropriately scoped, and backed by reliable evidence.

That’s the challenge MedAESQA aims to address.

What is MedAESQA?

MedAESQA is a new dataset released by researchers at the U.S. National Library of Medicine. It contains 40 real medical questions posed by the public. For each question, the dataset includes one expert-authored answer, as well as thirty responses generated by a wide range of automated systems submitted to the 2024 TREC Biomedical Evidence Accumulation and Evaluation Track. Each of these AI-generated answers was broken down into individual statements, with every statement annotated for accuracy and every cited abstract reviewed for relevance and support.

To make this possible at scale, the evaluation process leveraged a crowdsourcing effort facilitated by Centaur.ai. This effort was a structured process in which medically trained professionals contributed their expertise through Centaur’s human-in-the-loop platform. Each clinician independently reviewed individual statements generated by large language models, evaluating their factual accuracy and assessing whether cited PubMed abstracts genuinely supported the claims.

This approach allowed the researchers to scale expert validation across a high volume of content while maintaining consistent clinical standards. By distributing the annotation workload across a vetted network of contributors, the process achieved both efficiency and rigor, which are two qualities that are rarely easy to balance in medical AI evaluation.

What does MedAESQA offer?

MedAESQA offers something that has been missing from the medical AI ecosystem: a clear, structured benchmark for measuring not just linguistic fluency but evidence-grounded factual accuracy. It allows model developers to assess whether generated statements are actually supported by the citations provided. It also exposes where systems tend to overstate, omit, or misattribute claims.

In short, the dataset evaluates whether medical AI can be trusted, not just whether it sounds plausible.

This is a step forward for responsible AI evaluation. It moves us beyond surface-level fluency metrics and into a more rigorous framework grounded in expert judgment and transparent evidence. It also demonstrates how domain-specific expertise can be scaled when the right human-in-the-loop infrastructure is in place.

The full dataset and accompanying paper are now publicly available through Scientific Data. We’re encouraged by this direction and proud that our platform could support the annotation process. The work shows how AI evaluation improves when expert judgment is integrated systematically and how progress in this field depends not just on models but on measurement.

To read the entire research document, see below:

https://www.nature.com/articles/s41597-025-05233-z

For a demonstration of how we can facilitate your AI model training and evaluation with greater accuracy, scalability, and value, Schedule a demo with Centaur.ai

Related posts

July 1, 2025

Ground Medical AI in Expert-Labeled Data | Centaur AI

Centaur.AI collaborated with Microsoft Research and the University of Alicante to create PadChest-GR, the first multimodal, bilingual, sentence-level dataset for grounded radiology reporting. This breakthrough enables AI models to justify diagnostic claims with visual references, improving transparency and reliability in medical AI.

Continue reading →
April 1, 2025

Biomedical LLM Evaluation Case Study | Centaur AI

Collaborated with leading researchers to assess biomedical LLMs, advancing AI’s ability to answer medical queries and simplify complex scientific concepts.

Continue reading →
October 22, 2024

Protege Partnership | AI Development | Centaur AI

A new partnership between Protégé and Centaur Labs aims to accelerate AI development, driving innovation in healthcare and research technology.

Continue reading →