Blog

Microsoft Case Study: Grounding AI in Expert-Labeled Data

Tristan Bishop, Head of Marketing

July 1, 2025

Grounding AI in Expert-Labeled Data

Centaur Powers Multimodal Radiology Breakthrough with Microsoft Research and the University of Alicante

‍

The future of AI depends not just on more innovative models but on better data. That includes data that is clinically grounded, linguistically precise, and validated by domain experts. Our recent collaboration with Microsoft Research and the University of Alicante exemplifies this vision in action.

Together, these teams have released PadChest-GR, the world’s first multimodal, bilingual, sentence-level dataset for grounded radiology reporting. This pioneering dataset aligns structured clinical text with annotated chest X-ray imagery, enabling machine learning models to justify each diagnostic claim with an interpretable visual reference—a step change in transparency and reliability.

‍

The Challenge: Moving Beyond Image Classification

‍

Most medical imaging datasets to date have supported image-level classification—e.g., labeling a chest X-ray as “showing signs of cardiomegaly” or “no abnormalities detected.” While useful, these models often lack transparency. They are prone to “hallucinations”, where generated reports fabricate findings unsupported by the image or fail to specify where a pathology is located. Grounded radiology reporting takes a different approach:

‍

Findings are localized with bounding boxes on the image.
Textual descriptions, rather than simple classification assignments, are aligned with individual image regions.
Each report entry is contextualized in both language and space, reducing ambiguity and increasing interpretability.

This approach requires a fundamentally different type of dataset—one where each radiological observation is not only labeled but also grounded in a specific part of the image and expressed in natural language.

‍

Human-in-the-Loop at Clinical Scale

‍

To create such a dataset, high-quality annotations are non-negotiable. That’s where Centaur.AI came in. Our HIPAA-compliant labeling platform enabled a team of trained radiologists at the University of Alicante to:

‍

Draw bounding boxes around visible pathologies in thousands of chest X-rays.
Link each region to specific sentence-level findings in both Spanish and English.
Conduct rigorous, consensus-driven quality control, including adjudication of edge cases and multilingual alignment.

Unlike generic platforms, Centaur.AI was designed from the ground up for medical-grade annotation workflows. We support:

‍

Multiple annotator consensus and disagreement resolution.
Performance-weighted labeling, where expert annotations are weighted based on historical agreement.
DICOM and other complex medical imaging modalities
Multi-modal annotation workflows that include complex data formats
Audit trails, version control, and continuous quality monitoring—ensuring trust in every label.

These features allowed the research team to focus on complex medical edge cases without sacrificing annotation throughput or data integrity.

‍

The Dataset: PadChest-GR

‍

PadChest-GR builds on the original PadChest dataset but adds critical new dimensions:

‍

Bilingual reports: Each sentence-level finding is annotated in both Spanish (original) and English (translated), enabling multilingual model training.
Sentence-level granularity: Each sentence in the radiology report is grounded to the corresponding image region, facilitating fine-grained natural language generation tasks.
Multimodal alignment: Text, image, and location data are unified in a single structure, making the dataset suitable for training vision-language models that require explicit cross-modal grounding.

This enables more than classification; it supports explainable AI, localized report generation, and the training and testing of model factuality—all essential components in the safe deployment of AI in radiology.

‍

Why This Matters

‍

In a clinical setting, a model’s ability to “explain itself” is more than a UX feature—it’s a safety imperative. Physicians need to know not just what an AI system says, but why it says it. Grounded reporting enables clinicians to verify that the AI is referencing the correct part of the image, thereby reducing the likelihood of generating hallucinated or clinically implausible findings. By collaborating with Microsoft Research and the University of Alicante on PadChest-GR, Centaur.AI helped support the type of data curation pipeline that supports this level of accountability and interpretability.

As noted in Microsoft Research’s announcement, Centaur.AI was a “significant enabler” of this work. We’re proud of that recognition—but more importantly, we’re proud of what it enables for the field at large.

‍

A Blueprint for the Future

‍

As multimodal, multilingual, and clinically grounded AI systems become more common, the infrastructure for generating and validating high-quality data must keep pace. Centaur.AI is committed to meeting that challenge by:

‍

Supporting expert annotators across specialties and geographies
Designing quality pipelines that combine clinical precision with scalable throughput
Enabling organizations—from academic research labs to Fortune 100 companies—to confidently build and validate medical AI tools

This is what it looks like to operationalize responsible innovation in healthcare AI.

We’re honored to have played a part in PadChest-GR. We are excited about what it signals: a future where AI doesn’t just interpret medical images but does so transparently, accurately, and in full partnership with clinical expertise.

For a demonstration of how Centaur can facilitate your AI model training and evaluation with greater accuracy, scalability, and value, click here: https://centaur.ai/demo

To view the entire research paper, see here:

https://www.microsoft.com/en-us/research/blog/padchest-gr-a-bilingual-grounded-radiology-reporting-benchmark-for-chest-x-rays/

July 6, 2023

Aiberry builds explainable mental health AI with Centaur.ai data

Centaur.ai teamed Aiberry to annotate a new video dataset for mental health AI, boosting emotion detection and improving depression screening accuracy.

March 29, 2021

Our data-driven approach to QA

Medical assessments are rarely black and white. To handle the grey, we offer a rigorous, data-driven approach to QA.

October 20, 2025

Content Moderation in the Age of AI: Why High-Quality Data is the Real Game-Changer

Content moderation depends on more than AI automation—it requires high-quality training data. Centaur.ai delivers expert-labeled, multimodal datasets that help platforms detect hate speech, disinformation, explicit content, and compliance risks. By combining human insight with scalable infrastructure, Centaur.ai builds safer, more ethical, and more adaptable moderation systems.

Accurate and scalable data labeling and model evaluation

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Subscribe to our monthly newsletter