Blog

MedAESQA: Setting a new standard for evidence-grounded medical QA

Author Image
Tristan Bishop, Head of Marketing
August 4, 2025

Grounding medical AI in evidence: Inside the MedAESQA dataset

Large language models are increasingly capable of generating human-like answers to open-ended medical questions. But when health is on the line, sounding right is not enough. We need to know whether those answers are factually correct, appropriately scoped, and backed by reliable evidence.

That’s the challenge MedAESQA aims to address.

What is MedAESQA?

MedAESQA is a new dataset released by researchers at the U.S. National Library of Medicine. It contains 40 real medical questions posed by the public. For each question, the dataset includes one expert-authored answer, as well as thirty responses generated by a wide range of automated systems submitted to the 2024 TREC Biomedical Evidence Accumulation and Evaluation Track. Each of these AI-generated answers was broken down into individual statements, with every statement annotated for accuracy and every cited abstract reviewed for relevance and support.

To make this possible at scale, the evaluation process leveraged a crowdsourcing effort facilitated by Centaur.ai. This effort was a structured process in which medically trained professionals contributed their expertise through Centaur’s human-in-the-loop platform. Each clinician independently reviewed individual statements generated by large language models, evaluating their factual accuracy and assessing whether cited PubMed abstracts genuinely supported the claims.

This approach allowed the researchers to scale expert validation across a high volume of content while maintaining consistent clinical standards. By distributing the annotation workload across a vetted network of contributors, the process achieved both efficiency and rigor, which are two qualities that are rarely easy to balance in medical AI evaluation.

What does MedAESQA offer?

MedAESQA offers something that has been missing from the medical AI ecosystem: a clear, structured benchmark for measuring not just linguistic fluency but evidence-grounded factual accuracy. It allows model developers to assess whether generated statements are actually supported by the citations provided. It also exposes where systems tend to overstate, omit, or misattribute claims.

In short, the dataset evaluates whether medical AI can be trusted, not just whether it sounds plausible.

This is a step forward for responsible AI evaluation. It moves us beyond surface-level fluency metrics and into a more rigorous framework grounded in expert judgment and transparent evidence. It also demonstrates how domain-specific expertise can be scaled when the right human-in-the-loop infrastructure is in place.

The full dataset and accompanying paper are now publicly available through Scientific Data. We’re encouraged by this direction and proud that our platform could support the annotation process. The work shows how AI evaluation improves when expert judgment is integrated systematically and how progress in this field depends not just on models but on measurement.


Related posts

October 8, 2024

Announcing our $16M Series B to accelerate AI for health and science

Explored data curation strategies to mitigate bias in medical AI, with a focus on diverse datasets, expert input, and ensuring fairness in results.

Continue reading →
May 30, 2025

Expert Feedback Is Critical To Accelerate AI

Expert feedback is essential for safe, effective healthcare AI, as emphasized in a Centaur Labs webinar featuring leaders from Google Health, PathAI, and Centaur.

Continue reading →
August 31, 2023

Centaur.ai CEO Erik Duhaime Discusses Generative AI in Healthcare

Erik Duhaime, CEO and Co-founder of Centaur.ai, discusses the impact of generative AI on healthcare and life sciences in an interview with Emerj.

Continue reading →