Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Subscribe to our monthly newsletter
Copyright © 2025. All rights reserved by Centaur.ai
Blog
Large language models are increasingly capable of generating human-like answers to open-ended medical questions. But when health is on the line, sounding right is not enough. We need to know whether those answers are factually correct, appropriately scoped, and backed by reliable evidence.
That’s the challenge MedAESQA aims to address.
MedAESQA is a new dataset released by researchers at the U.S. National Library of Medicine. It contains 40 real medical questions posed by the public. For each question, the dataset includes one expert-authored answer, as well as thirty responses generated by a wide range of automated systems submitted to the 2024 TREC Biomedical Evidence Accumulation and Evaluation Track. Each of these AI-generated answers was broken down into individual statements, with every statement annotated for accuracy and every cited abstract reviewed for relevance and support.
To make this possible at scale, the evaluation process leveraged a crowdsourcing effort facilitated by Centaur.ai. This effort was a structured process in which medically trained professionals contributed their expertise through Centaur’s human-in-the-loop platform. Each clinician independently reviewed individual statements generated by large language models, evaluating their factual accuracy and assessing whether cited PubMed abstracts genuinely supported the claims.
This approach allowed the researchers to scale expert validation across a high volume of content while maintaining consistent clinical standards. By distributing the annotation workload across a vetted network of contributors, the process achieved both efficiency and rigor, which are two qualities that are rarely easy to balance in medical AI evaluation.
MedAESQA offers something that has been missing from the medical AI ecosystem: a clear, structured benchmark for measuring not just linguistic fluency but evidence-grounded factual accuracy. It allows model developers to assess whether generated statements are actually supported by the citations provided. It also exposes where systems tend to overstate, omit, or misattribute claims.
In short, the dataset evaluates whether medical AI can be trusted, not just whether it sounds plausible.
This is a step forward for responsible AI evaluation. It moves us beyond surface-level fluency metrics and into a more rigorous framework grounded in expert judgment and transparent evidence. It also demonstrates how domain-specific expertise can be scaled when the right human-in-the-loop infrastructure is in place.
The full dataset and accompanying paper are now publicly available through Scientific Data. We’re encouraged by this direction and proud that our platform could support the annotation process. The work shows how AI evaluation improves when expert judgment is integrated systematically and how progress in this field depends not just on models but on measurement.
Explored data curation strategies to mitigate bias in medical AI, with a focus on diverse datasets, expert input, and ensuring fairness in results.
Expert feedback is essential for safe, effective healthcare AI, as emphasized in a Centaur Labs webinar featuring leaders from Google Health, PathAI, and Centaur.
Erik Duhaime, CEO and Co-founder of Centaur.ai, discusses the impact of generative AI on healthcare and life sciences in an interview with Emerj.