Blog

Case Study: Evaluating biomedical LLMs with experts from a leading research institute

Ali Devaney, Marketing
March 12, 2025

At one of the year’s most prestigious information retrieval research conferences, researchers challenged teams to build AI systems that accurately answer medical questions, provide relevant citations, and simplify scientific information for a general audience. Teams leveraged the most modern techniques—from multi-AI agents to prompt engineering and fine-tuning—and Centaur helped researchers evaluate their submissions. 

Supporting research into trustworthy biomedical LLMs

Each year, one of the leading information retrieval conferences brings together leading researchers and AI developers to advance information retrieval techniques. In 2024, this prestigious event featured two specialized healthcare tracks focused on biomedical information generation and simplification. To support these tracks, researchers provided participants with large test datasets, standardized scoring methods, and a collaborative forum to exchange and refine ideas. Teams also received exclusive year-long access to the datasets, enabling them to publish their findings.

The first track focused on improving the trustworthiness of LLMs. Organizers tasked teams with developing AI systems that generate factual answers to medical questions based on a given topic and narrative context. Additionally, each sentence in the generated answer had to be backed by a relevant citation. Participants were provided access to 20 million PMIDs from Medline/PubMed as potential citation sources, along with a dataset comprising 65 topics, questions, narrative contexts, sample answers, and citations for testing.

The second track aimed to make complex biomedical information more accessible to the general public. Organizers tasked teams with developing AI systems to analyze scientific abstracts and 1) identify and simplify complex terms and 2) convert each sentence into plain language while maintaining the meaning of the original abstract. Participants were given 750 abstracts corresponding to 50 consumer questions with answers and citations. The participants were asked to find complex terms and simplify them, as well as provide sentence-by-sentence simplifications for the abstracts.  

Thirteen teams participated across the two tracks, producing 49 submissions. After building their AI systems, the teams tested them on the provided questions and submitted the results for evaluation. Without in-house clinicians to review the submissions within the necessary timeframe, researchers sought assistance from external vendors. They needed a model evaluation solution to ensure high-quality evaluation while adhering to compressed timelines.

Evaluating biomedical LLMs to support cutting-edge research

Researchers partnered with Centaur to evaluate the submissions rigorously and efficiently. Centaur was an appealing partner because its methodology - grounded in collective intelligence - provides multiple high-quality evaluations for every submission as a standard process. It was important for researchers to collect multiple evaluations per submission because the tasks for both tracks are open-ended. Each question has numerous valid answers, so organizers expected multiple evaluations to minimize bias and enhance the quality of the assessments. Track organizers and teams also get access to detailed metrics derived from this methodology —such as interrater agreement and agreement with gold standards—which they can use to inform further improvements to their AI systems.

Enabling quality at scale

Centaur started by helping track organizers create Gold Standards for their tasks, sourcing four US-based physicians to manually label data using the Centaur desktop labeling tool. These Gold Standards are an essential part of Centaur’s methodology - they are high-quality reference standards hidden throughout the labeler workflow and used to measure labeler performance. Centaur also broke the first track’s task down into four hyper-focused tasks and the second track’s tasks into eight hyper-focused tasks. More focused tasks lead to more focused labelers, ultimately enabling improved quality measurement and higher quality evaluations.

Evaluating answer and citation quality

For the first track, Centaur evaluated four different views of the dataset: 1,800 question-and-answer pairs, 6,000 question-and-answer sentence pairs, 6,000 answer-sentence-and-citation pairs, and 10,000 answer-sentence-and-citation sentence pairs. Answers were typically five sentences long, and abstracts of citations were five to seven sentences long. Researchers uploaded the submission datasets to the Centaur system, automatically creating the dataset pairs and highlighting individual answer sentences before evaluation. 

Centaur classified the question-answer pairs on whether the answer was relevant to the question (yes or no) and how important a specific sentence was within the answer (Required, Unnecessary, Inappropriate, Borderline). For these evaluations, Centaur delivered 9-12 qualified reads per label at a rate of 1,000-3,300 labels/day. 

Centaur also evaluated the answer-citation pairs, first highlighting sentences in the cited abstract relevant to a specific sentence in the answer, then classifying those sentences on the nature of the relevance (Supporting; Contradicting; Neutral). Centaur delivered 8-10 qualified reads per label, with 75% agreement with the gold standards, and at a rate of 7,000 labels/day.

Evaluating content simplification 

For the second track, Centaur annotated two different datasets - 32,000 scientific term-simplified term pairs and 63,000 original abstract sentence-simplified sentence pairs. 

Centaur classified both datasets on a scale of -1 to 1 on four variables: 1) Brevity, 2) Completeness, 3) Accuracy, and 4) Simplicity. For these evaluations, Centaur delivered 8-10 qualified reads per label at a rate of 7,000 - 9,000 labels/day. 

For all tasks across both tracks, Centaur delivered quality at interrater agreements similar to or slightly superior to those achieved by in-house physicians on similar model evaluation projects completed in the past. 

Advancing research in trustworthy biomedical LLMs

Track coordinators shared evaluations for the submissions in time for the conference and were happy with the quality and timeliness of Centaur’s model evaluations. These researchers and Centaur are completing additional datasets for upcoming workshops and research studies. 

In Spring 2025, the broader research community will gain access to the datasets used in the tracks to test and improve their AI systems. Creating and contributing these reference standards pro bono to the research community is an important way to reduce barriers to progress in the field. 

The teams that participated in the two tracks are publishing their findings, in part, based on Centaur's evaluations. Centaur is thrilled to help enable this critical research, making biomedical LLMs more trustworthy and usable by all.

Related posts