Blog

Microsoft Case Study: Grounding AI in Expert-Labeled Data

Author Image
Tristan Bishop, Head of Marketing
June 27, 2025

Grounding AI in Expert-Labeled Data

Centaur Powers Multimodal Radiology Breakthrough with Microsoft Research and the University of Alicante

The future of AI depends not just on more innovative models but on better data. That includes data that is clinically grounded, linguistically precise, and validated by domain experts. Our recent collaboration with Microsoft Research and the University of Alicante exemplifies this vision in action.

Together, these teams have released PadChest-GR, the world’s first multimodal, bilingual, sentence-level dataset for grounded radiology reporting. This pioneering dataset aligns structured clinical text with annotated chest X-ray imagery, enabling machine learning models to justify each diagnostic claim with an interpretable visual reference—a step change in transparency and reliability.

The Challenge: Moving Beyond Image Classification

Most medical imaging datasets to date have supported image-level classification—e.g., labeling a chest X-ray as “showing signs of cardiomegaly” or “no abnormalities detected.” While useful, these models often lack transparency. They are prone to “hallucinations”, where generated reports fabricate findings unsupported by the image or fail to specify where a pathology is located. Grounded radiology reporting takes a different approach:

  • Findings are localized with bounding boxes on the image.
  • Textual descriptions, rather than simple classification assignments, are aligned with individual image regions.
  • Each report entry is contextualized in both language and space, reducing ambiguity and increasing interpretability.
Grounded report from PadChest-GR

This approach requires a fundamentally different type of dataset—one where each radiological observation is not only labeled but also grounded in a specific part of the image and expressed in natural language.

Human-in-the-Loop at Clinical Scale

To create such a dataset, high-quality annotations are non-negotiable. That’s where Centaur.AI came in. Our HIPAA-compliant labeling platform enabled a team of trained radiologists at the University of Alicante to:

  • Draw bounding boxes around visible pathologies in thousands of chest X-rays.
  • Link each region to specific sentence-level findings in both Spanish and English.
  • Conduct rigorous, consensus-driven quality control, including adjudication of edge cases and multilingual alignment.

Unlike generic platforms, Centaur.AI was designed from the ground up for medical-grade annotation workflows. We support:

  • Multiple annotator consensus and disagreement resolution.
  • Performance-weighted labeling, where expert annotations are weighted based on historical agreement.
  • DICOM and other complex medical imaging modalities
  • Multi-modal annotation workflows that include complex data formats
  • Audit trails, version control, and continuous quality monitoring—ensuring trust in every label.

These features allowed the research team to focus on complex medical edge cases without sacrificing annotation throughput or data integrity.

The Dataset: PadChest-GR

PadChest-GR builds on the original PadChest dataset but adds critical new dimensions:

  • Bilingual reports: Each sentence-level finding is annotated in both Spanish (original) and English (translated), enabling multilingual model training.
  • Sentence-level granularity: Each sentence in the radiology report is grounded to the corresponding image region, facilitating fine-grained natural language generation tasks.
  • Multimodal alignment: Text, image, and location data are unified in a single structure, making the dataset suitable for training vision-language models that require explicit cross-modal grounding.

This enables more than classification; it supports explainable AI, localized report generation, and the training and testing of model factuality—all essential components in the safe deployment of AI in radiology.

Why This Matters

In a clinical setting, a model’s ability to “explain itself” is more than a UX feature—it’s a safety imperative. Physicians need to know not just what an AI system says, but why it says it. Grounded reporting enables clinicians to verify that the AI is referencing the correct part of the image, thereby reducing the likelihood of generating hallucinated or clinically implausible findings. By collaborating with Microsoft Research and the University of Alicante on PadChest-GR, Centaur.AI helped support the type of data curation pipeline that supports this level of accountability and interpretability.

Data Curation Pipeline

As noted in Microsoft Research’s announcement, Centaur.AI was a “significant enabler” of this work. We’re proud of that recognition—but more importantly, we’re proud of what it enables for the field at large.

A Blueprint for the Future

As multimodal, multilingual, and clinically grounded AI systems become more common, the infrastructure for generating and validating high-quality data must keep pace. Centaur.AI is committed to meeting that challenge by:

  • Supporting expert annotators across specialties and geographies
  • Designing quality pipelines that combine clinical precision with scalable throughput
  • Enabling organizations—from academic research labs to Fortune 100 companies—to confidently build and validate medical AI tools

This is what it looks like to operationalize responsible innovation in healthcare AI.

We’re honored to have played a part in PadChest-GR. We are excited about what it signals: a future where AI doesn’t just interpret medical images but does so transparently, accurately, and in full partnership with clinical expertise.

Related posts

July 26, 2024

Gamified Data Labeling Boosts Eight Sleep Model Accuracy to 93%

Gamified data labeling enhances model accuracy from 70% to 93% in a case study with Eight Sleep, demonstrating the effectiveness of multimodal annotation.

Continue reading →
August 26, 2024

Crowdsourced annotations research accepted to MICCAI 2024 conference 

Centaur Labs' crowdsourced annotations research, accepted at MICCAI 2024. Collaborating with Brigham and Women’s Hospital to advance medical AI.

Continue reading →
April 1, 2025

Case Study: Evaluating Biomedical LLMs with Top Research Experts

Collaborated with leading researchers to assess biomedical LLMs, advancing AI’s ability to answer medical queries and simplify complex scientific concepts.

Continue reading →