Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Copyright © 2025. All rights reserved by Centaur Labs.
Blog
The volume of medical text data, especially digitized and unstructured narrative text, has taken off with the adoption of EHRs and other digital tools used by clinicians, researchers, and patients. All of this medical text is fueling a surge of NLP model development to turn those medical text datasets into insights that impact care delivery, biomedical research, and much more. As every healthcare, medical device, and pharmaceutical company deepens their investments in AI, we’re seeing a growing number of clients developing NLP and an increased need for medical text annotations.
Natural Language Processing (NLP) is a branch of artificial intelligence that enables computers to understand, interpret, and generate human language. It combines computational linguistics with machine learning and deep learning models to process and analyze large amounts of natural language data—such as text or speech. NLP is being applied in most parts of the healthcare value chain.
From accelerating subject recruitment in clinical trials to supporting decisions and diagnoses at the point of care to helping consumers find the accurate health information they need, there’s a lot of impact to be excited about, and we’ll give many detailed examples in a later post in this series.
At Centaur Labs we use ‘medical text’ as shorthand for multiple types of health-related data stored as text. Medical text encompasses 3 subcategories of text: clinical text, biomedical text, and ‘other’ health text.
Clinical text is defined as text collected during the course of ongoing patient care in the formal healthcare system or as part of a formal clinical trial program. We often think of clinical text as written by clinicians themselves, for example, when making clinical notes in the electronic health record (EHR), in a prescription write-up, or in a pathology report.
However, patients also generate an increasing volume of clinical text. They may describe the beginning of their healthcare journey, i.e., a discussion with a chatbot automating part of patient intake or a typical in-person clinical visit with a therapist. Patients may also describe the middle or end of their healthcare journey, i.e., a transcript of a follow-up call with a clinician, an insurance claim, or a discharge description.
Clinical text can also be written by others in the healthcare ecosystem who participate in patient care, such as insurance claims processors or family members.
Biomedical text is any data stored as text collected or created throughout the course of medical research. Often this text is written by research scientists in the form of scientific papers published in academic journals or as unpublished intellectual property. This unpublished text is owned and managed by the academic institution or company that employed the scientist and funded the research the paper summarizes.
There are many types of health-related text generated outside patient care delivery and scientific research. For example, users of consumer wellness applications share health information in text formats as they interact with private groups or company ‘coaches’. Consumers also share health information in public forums like Twitter and Reddit. Government agencies publish public health-related text to their populations, and regulatory bodies publish guidelines and rulings to their constituents. Companies in the healthcare ecosystem train their workforces - whether clinicians or pharmaceutical sales reps - with platforms that deliver content containing health-related text.
These sub-categories of medical text have one thing in common - they contain health concepts and terminology and subtle relationships between words and phrases that are difficult for an untrained eye to identify. Those with interest in medicine, healthcare, and biology - whether a seasoned clinician, a medical student, or a medical enthusiast - are best suited to interpret medical text. In this blog series, we’ll focus on NLP that leverages medical text - whether clinical, biomedical or ‘other’ - as this is where some of the most compelling innovation is taking place.
For additional insights, see the following posts:
Centaur Labs contributes high-quality data annotations to enhance Consensus’ scientific search algorithm, improving accuracy and boosting research capabilities.
From SMS to insurance claims, pathology reports, and scientific studies, this post explores the most common medical text datasets used for NLP in healthcare.
Gamified data labeling enhances model accuracy from 70% to 93% in a case study with Eight Sleep, demonstrating the effectiveness of multimodal annotation.