Blog

MedAESQA: Setting a new standard for evidence-grounded medical QA

Tristan Bishop, Head of Marketing

August 4, 2025

Grounding medical AI in evidence: Inside the MedAESQA dataset

‍

Large language models are increasingly capable of generating human-like answers to open-ended medical questions. But when health is on the line, sounding right is not enough. We need to know whether those answers are factually correct, appropriately scoped, and backed by reliable evidence.

‍

That’s the challenge MedAESQA aims to address.

‍

What is MedAESQA?

MedAESQA is a new dataset released by researchers at the U.S. National Library of Medicine. It contains 40 real medical questions posed by the public. For each question, the dataset includes one expert-authored answer, as well as thirty responses generated by a wide range of automated systems submitted to the 2024 TREC Biomedical Evidence Accumulation and Evaluation Track. Each of these AI-generated answers was broken down into individual statements, with every statement annotated for accuracy and every cited abstract reviewed for relevance and support.

‍

To make this possible at scale, the evaluation process leveraged a crowdsourcing effort facilitated by Centaur.ai. This effort was a structured process in which medically trained professionals contributed their expertise through Centaur’s human-in-the-loop platform. Each clinician independently reviewed individual statements generated by large language models, evaluating their factual accuracy and assessing whether cited PubMed abstracts genuinely supported the claims.

‍

This approach allowed the researchers to scale expert validation across a high volume of content while maintaining consistent clinical standards. By distributing the annotation workload across a vetted network of contributors, the process achieved both efficiency and rigor, which are two qualities that are rarely easy to balance in medical AI evaluation.

‍

What does MedAESQA offer?

MedAESQA offers something that has been missing from the medical AI ecosystem: a clear, structured benchmark for measuring not just linguistic fluency but evidence-grounded factual accuracy. It allows model developers to assess whether generated statements are actually supported by the citations provided. It also exposes where systems tend to overstate, omit, or misattribute claims.

‍

In short, the dataset evaluates whether medical AI can be trusted, not just whether it sounds plausible.

‍

This is a step forward for responsible AI evaluation. It moves us beyond surface-level fluency metrics and into a more rigorous framework grounded in expert judgment and transparent evidence. It also demonstrates how domain-specific expertise can be scaled when the right human-in-the-loop infrastructure is in place.

‍

The full dataset and accompanying paper are now publicly available through Scientific Data. We’re encouraged by this direction and proud that our platform could support the annotation process. The work shows how AI evaluation improves when expert judgment is integrated systematically and how progress in this field depends not just on models but on measurement.

To read the entire research document, see below:

https://www.nature.com/articles/s41597-025-05233-z

For a demonstration of how Centaur can facilitate your AI model training and evaluation with greater accuracy, scalability, and value, click here: https://centaur.ai/demo

‍

December 8, 2023

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Subscribe to our monthly newsletter

MedAESQA: Setting a new standard for evidence-grounded medical QA

Grounding medical AI in evidence: Inside the MedAESQA dataset

What is MedAESQA?

What does MedAESQA offer?

Related posts

AI Unleashed - A discussion about AI safety in healthcare

Centaur.ai partners with Lucem Health to advance medical AI

Accelerating AI for GI with accurately annotated colonoscopy video