Blog

Centaur Cuts SciBite's Vocabulary Curation Timeline with Expert Crowd Labeling 

Author Image
The Centaur Blogging Team
June 18, 2024

So much of scientific information is stored as unstructured text - from scientific literature and internal R&D reports and emails, to patient documents that could qualify a patient for a clinical trial, or describe an adverse reaction to a medication. Mining this unstructured text for insights is key to accelerating scientific advancements.

SciBite, from Elsevier, has developed a semantic analytics platform that enables life sciences organizations to unlock knowledge from unstructured text. At the foundation of this platform are the high-quality vocabularies and ontologies used to detect terms and topics within biomedical text. Maintaining these comprehensive vocabularies is a never-ending task that relies on expert scientific curation.

To scale this critical “humans in the loop” capability, SciBite partnered with Centaur Labs.

About SciBite  

Founded in 2012 and based in Cambridge, UK, SciBite supports the world’s leading scientific organizations with its pioneering data-first, semantic analytics platform.

At the core of SciBite's platform is TERMite, a high-performance named entity recognition (NER) engine that, when combined with SciBite's domain-specific vocabularies, can recognize and extract relevant terms found in scientific text. 

These capabilities together, among others offered by SciBite, enable leading global pharmaceutical companies to extract insights from unstructured text, powering drug discovery, clinical trial recruitment, pharmacovigilance, and more. 

The Challenge: Validating and Scaling Vocabulary Growth 

SciBite has developed over 100 vocabularies that contain as many as hundreds of thousands of entities. The team builds these vocabularies using public ontologies - managed by organizations such as the US National Library of Medicine - as a starting point from which to go into much greater depth. These high-quality vocabularies and ontologies are the critical foundation that enable the TERMite engine to accurately detect important topics within biomedical text. 

Keeping these vocabularies up-to-date is essential, and also complex! New potential synonyms for existing terms can emerge, and ontology creators release regular updates. SciBite’s team of in-house expert curators need to evaluate each of these external changes, and decide if and how to appropriately incorporate them into SciBite’s vocabularies. 

"We found around 5,000 potential new synonyms for the indication and anatomy vocabularies, but curating that many terms manually would take months of dedicated work," said Mark Streer, Scientific Curator.

After a quick scan of these synonym candidates, sourced from Wikidata and web scraping, it was clear some of the synonyms were high quality, others were definitely wrong, and many warranted further investigation.

The Solution: Crowdsourced Text Evaluation and Classification 

SciBite partnered with Centaur Labs to design two separate crowdsourcing workflows to evaluate candidate synonyms and accelerate updates to two of SciBite’s most popular vocabularies - their indication and anatomy vocabularies. 

In both cases, the Centaur Labs crowd of medical experts was asked to evaluate the relationship between the candidate synonym and the existing reference term, classifying it as “Exact,” “Broad,” “Narrow,” or “No Match.” 

For both workflows, Centaur Labs was able to evaluate 700-1200 synonyms per day, generating 7-10 qualified opinions per case and achieving high accuracy—90.3% for disease and 95.1% for anatomy agreement with ground truth reference cases. 

The Results: Accelerated Vocabulary Expansion 

Through the crowdsourced efforts, 900 new disease terms and 600 new anatomy terms were classified as “exact” matches with the reference terms, giving the SciBite team the confidence to add these 1500 new terms to their vocabularies.

The crowd's high accuracy enabled SciBite to streamline internal curation efforts with our crowd-validated terms. For the most complex synonym candidates, where the interrater agreement was below 75%, the SciBite team performed a review and applied a final classification. Where agreement was above 75%, the classifications were accepted as is, and considered for addition to the vocabularies. 

"The Centaur Labs crowd offers the scalability and quality we need to rapidly update vocabularies based on new data sources. This frees our curators to focus on the most complex requirements rather than tedious tasks."

The success paved the way for an ongoing partnership, with SciBite looking to leverage Centaur Labs for additional vocabulary updates, generating new vocabularies their customers need, and evaluating the output of new NER models.

Related posts

November 21, 2022

Centaur.ai and Dandelion Health enable scalable clinical data for AI

Dandelion Health teams up with Centaur.ai to provide AI developers scalable access to high-quality clinical data, driving progress in healthcare technology.

Continue reading →
August 28, 2025

Why Smarter Energy and Climate AI Starts with Smarter Data

AI models in climate and energy depend on accurate data, not just algorithms. Centaur.ai delivers expert-validated, edge-aware labels that adapt to shifting seasons, regions, and infrastructure. From tagging satellite imagery to QA-ing emissions outputs, our collective intelligence approach ensures more reliable insights for planet-scale challenges.

Continue reading →
October 13, 2025

Why High-Quality Annotated Data is the Foundation of Smarter Supply Chains

Supply chains run on data, but manual entry creates errors that block automation and weaken AI. Annotated documents deliver structured, high-quality data ready for both workflow automation and LLM training. With Centaur.ai, businesses achieve faster approvals, reliable compliance, and datasets that power predictive, AI-driven supply chains.

Continue reading →