Blog
Transforming healthcare through AI hinges upon access to diverse, richly annotated, open-source datasets. Medical AI applications—from image segmentation to multimodal reasoning—are only as robust as their training data. Fortunately, recent years have yielded high-quality, freely available datasets that researchers can leverage to build, test, and deploy impactful models. We thought it would be helpful to put some of our favorite open-source datasets in an organized list and share them with the community.
In our list, you can explore dozens of datasets by size, category, modality (including X-ray, ultrasound, Whole Slide Images, CT scans, ECGs), and more. Additionally, we have included a brief description that helps you to quickly understand the specific abnormalities of interest, the balance of the data, and information about annotations included, such as medical image classifications or segmentations.

Access the full collection here
If you know of any datasets that should be added to this list, please let us know.
From SMS to insurance claims, pathology reports, and scientific studies, this post explores the most common medical text datasets used for NLP in healthcare.
Announcing a new DICOM labeling experience and text highlighting features, designed to improve medical image annotation and support better healthcare outcomes.
AI models in climate and energy depend on accurate data, not just algorithms. Centaur.ai delivers expert-validated, edge-aware labels that adapt to shifting seasons, regions, and infrastructure. From tagging satellite imagery to QA-ing emissions outputs, our collective intelligence approach ensures more reliable insights for planet-scale challenges.