Blog

Microsoft Case Study: Grounding AI in Expert-Labeled Data

Author Image
Tristan Bishop, Head of Marketing
June 27, 2025

Grounding AI in Expert-Labeled Data

Centaur Powers Multimodal Radiology Breakthrough with Microsoft Research and the University of Alicante

The future of AI depends not just on more innovative models but on better data. That includes data that is clinically grounded, linguistically precise, and validated by domain experts. Our recent collaboration with Microsoft Research and the University of Alicante exemplifies this vision in action.

Together, these teams have released PadChest-GR, the world’s first multimodal, bilingual, sentence-level dataset for grounded radiology reporting. This pioneering dataset aligns structured clinical text with annotated chest X-ray imagery, enabling machine learning models to justify each diagnostic claim with an interpretable visual reference—a step change in transparency and reliability.

The Challenge: Moving Beyond Image Classification

Most medical imaging datasets to date have supported image-level classification—e.g., labeling a chest X-ray as “showing signs of cardiomegaly” or “no abnormalities detected.” While useful, these models often lack transparency. They are prone to “hallucinations”, where generated reports fabricate findings unsupported by the image or fail to specify where a pathology is located. Grounded radiology reporting takes a different approach:

  • Findings are localized with bounding boxes on the image.
  • Textual descriptions, rather than simple classification assignments, are aligned with individual image regions.
  • Each report entry is contextualized in both language and space, reducing ambiguity and increasing interpretability.
Grounded report from PadChest-GR

This approach requires a fundamentally different type of dataset—one where each radiological observation is not only labeled but also grounded in a specific part of the image and expressed in natural language.

Human-in-the-Loop at Clinical Scale

To create such a dataset, high-quality annotations are non-negotiable. That’s where Centaur.AI came in. Our HIPAA-compliant labeling platform enabled a team of trained radiologists at the University of Alicante to:

  • Draw bounding boxes around visible pathologies in thousands of chest X-rays.
  • Link each region to specific sentence-level findings in both Spanish and English.
  • Conduct rigorous, consensus-driven quality control, including adjudication of edge cases and multilingual alignment.

Unlike generic platforms, Centaur.AI was designed from the ground up for medical-grade annotation workflows. We support:

  • Multiple annotator consensus and disagreement resolution.
  • Performance-weighted labeling, where expert annotations are weighted based on historical agreement.
  • DICOM and other complex medical imaging modalities
  • Multi-modal annotation workflows that include complex data formats
  • Audit trails, version control, and continuous quality monitoring—ensuring trust in every label.

These features allowed the research team to focus on complex medical edge cases without sacrificing annotation throughput or data integrity.

The Dataset: PadChest-GR

PadChest-GR builds on the original PadChest dataset but adds critical new dimensions:

  • Bilingual reports: Each sentence-level finding is annotated in both Spanish (original) and English (translated), enabling multilingual model training.
  • Sentence-level granularity: Each sentence in the radiology report is grounded to the corresponding image region, facilitating fine-grained natural language generation tasks.
  • Multimodal alignment: Text, image, and location data are unified in a single structure, making the dataset suitable for training vision-language models that require explicit cross-modal grounding.

This enables more than classification; it supports explainable AI, localized report generation, and the training and testing of model factuality—all essential components in the safe deployment of AI in radiology.

Why This Matters

In a clinical setting, a model’s ability to “explain itself” is more than a UX feature—it’s a safety imperative. Physicians need to know not just what an AI system says, but why it says it. Grounded reporting enables clinicians to verify that the AI is referencing the correct part of the image, thereby reducing the likelihood of generating hallucinated or clinically implausible findings. By collaborating with Microsoft Research and the University of Alicante on PadChest-GR, Centaur.AI helped support the type of data curation pipeline that supports this level of accountability and interpretability.

Data Curation Pipeline

As noted in Microsoft Research’s announcement, Centaur.AI was a “significant enabler” of this work. We’re proud of that recognition—but more importantly, we’re proud of what it enables for the field at large.

A Blueprint for the Future

As multimodal, multilingual, and clinically grounded AI systems become more common, the infrastructure for generating and validating high-quality data must keep pace. Centaur.AI is committed to meeting that challenge by:

  • Supporting expert annotators across specialties and geographies
  • Designing quality pipelines that combine clinical precision with scalable throughput
  • Enabling organizations—from academic research labs to Fortune 100 companies—to confidently build and validate medical AI tools

This is what it looks like to operationalize responsible innovation in healthcare AI.

We’re honored to have played a part in PadChest-GR. We are excited about what it signals: a future where AI doesn’t just interpret medical images but does so transparently, accurately, and in full partnership with clinical expertise.

Next Steps

Related posts

January 13, 2025

Microsoft Research, Alicante release PadChest-GR with Centaur Labs.

Learn PADChest GR, a new CXR dataset for GenAI by Microsoft Research & University of Alicante, developed with Centaur Labs' expert support.

Continue reading →
June 6, 2025

Unlocking Better AI with Cognitive-Inspired Data Engineering

Centaur Labs’ latest study tackles human bias in crowdsourced AI training data using cognitive-inspired data engineering. By applying recalibration techniques, they improved medical image classification accuracy significantly. This approach enhances AI reliability in healthcare and beyond, reducing bias and improving efficiency in machine learning model training.

Continue reading →
May 1, 2020

How multiple opinions drive huge gains in data labeling accuracy

How Centaur Labs leverages multiple expert opinions to create the most accurate medical data labeling platform for text, image and video data

Continue reading →