How we built the best speech-to-text engine for medical encounters

July 3, 20238 minute read

Ruben Stern

Machine Learning Engineer

Sam Humeau

Machine Learning Engineer

In a world where fingers fatigue and pens run out of ink, a savior appears in the form of speech-to-text technology. It's precisely what it sounds like - a nifty tool that transforms your spoken words into written text. And it is also essential to Nabla achieving its purpose and reaching its full potential.

What is Nabla? Essentially, it is an AI assistant built specifically for clinicians. It generates clinical notes in seconds from any medical encounter and seamlessly updates patients' medical records.

To do this effectively, Nabla needs an accurate and reliable speech-to-text tool. While speech-to-text technology has made its mark in a variety of fields, our quest for a dependable solution for the healthcare sector revealed its own unique challenges and twists.

From tongue-tied to transcribed

The quagmire of speech-to-text APIs

There are roughly two types of ready-to-use speech-to-text APIs/templates out there.

The first category includes models designed for general language transcription but lacking in medical vocabulary finesse. Take Google speech-to-text, for example.

The second category consists of models specialized in medical vocabulary but lacking in general language skills. Think of dictation software such as Dragon.

When it comes to writing a clinical note based on a medical encounter, we need to be fluent in both general language and medical terminology.

Connect Dragon to a encounter, and your AI scribe will be utterly clueless about what is happening.

Rely solely on Google speech-to-text, and you’ll end up with drug names and complex conditions being mercilessly mangled. Clinicians will spend their entire day fixing typos in the final note. Not fun.

But fear not. We’ve built the speech-to-text engine that combines the best of both worlds. It effortlessly captures all the relevant information in the medical encounter, ensuring nothing gets lost in translation.

Decoding commercial speech-to-text APIs

With our engineering DNA firmly rooted in pragmatism and data obsession, we started by evaluating existing speech-to-text solutions. To do so, we gathered a collection of tests mixing generic sentences and medical ones. Here were its components:

Common voice: Common phrases, words, and expressions from everyday life (Note: we used a subset of 2,000 sentences from our test series).
Medical terms: Concise sentences containing medical terms (at least one per sentence), dictated by Nabla employees. For weeks, we unleashed a symphony of accidental tongue twisters in our office.

Methodology

To build our product, we first evaluated a range of existing speech-to-text APIs, including Google speech, Deepgram, and the open-source Whisper model, using our English and French datasets.

To measure performance, we used the Word Error Rate (WER) for the Common voice dataset.

For a more focused examination of medical content in the Medical terms dataset, we calculated the error rate specifically for medical terms (MTER).

Riveting, we know. But it is through this meticulous process that we tried to shed light on the accuracy of speech-to-text solutions.

The medical term error rate (MTER) is computed as follows:

MTER = 1 - TP/Total

where TP is the number of medical terms correctly predicted in a sentence, and Total is the number of medical terms in the ground truth sentence.

And here were our results:

French language

	Common voice (WER, %, lower is better)	Medical terms (MTER, % lower is better)
Google speech “latest_long” model	17.2	28.3
Deepgram (latest model as of 31/01/2023)	6.6	37.8
Amazon Transcribe	18.3	41.0
Azure	13.1	26.0
Whisper large v2	11.9	27.2

English language

	Common voice (WER, %, lower is better)	Medical terms (MTER, % lower is better)
Google speech “latest_long” model	19.1	38.5
Deepgram (latest model as of 31/01/2023)	NA	47.7
Amazon Transcribe	19.8	33.4
Azure	12.1	21.9
Whisper large v2	11.4	26.7

Among these speech-to-text APIs we tested, Azure stood out as the top performer overall in both French and English. Notably, it outperformed others when it came to handling medical terms. On the other hand, Whisper large V2, an open-source speech-to-text API, closely trailed Azure, showcasing slightly better results in everyday language but marginally weaker performance with medical terms.

Our benchmarking efforts revealed that commercial speech-to-text APIs are not well-suited for healthcare. To ensure precise capture of medical terms like drugs and diseases, we decided to refine our own speech-to-text model, leveraging Whisper.

Building a healthcare-tailored speech-to-text model

How Nabla mastered the lexical labyrinth

To obtain better results on medical terms, we decided to refine an existing model instead of training it from scratch, to benefit from the fact that it had already been trained on a very large dataset.

Whisper stood out as a good candidate because it has been trained on a very large dataset (680,000 hours of multilingual speech and transcription data in 96 languages) - and its performance is state-of-the-art.

We therefore collected large sets of audio/transcription pairs from different sources. We aimed to have data that represented both everyday language and very specific medical language. We also recorded ourselves speaking medical phrases.

	Number of samples	Total audio hours
French	319k	586
English	326k	740

The training experiments on our own models were carried out on 4xA100 machines and lasted 1.5 days. More details on the experiments can be found here.

Here were our results:

French language

	Common voice (WER, %, lower is better)	Medical terms (MTER, % lower is better)
Azure	13.1	26.0
Whisper large fine tuned on french common voice	7.7	25.2
Whisper large fine tuned on our French dataset	10.1	15.8
Whisper large fine tuned on our French + English dataset	9.8	13.4

English language

	Common voice (WER, %, lower is better)	Medical terms (MTER, % lower is better)
Azure	12.1	21.9
Whisper large fine tuned on our English dataset	10.4	22.6
Whisper large fine tuned on our French + English dataset	10.2	19.6

For French, we conducted a comparison between two models: one trained solely on our French data and the other trained on the combined French and English data. Overall, the bilingual model yielded slightly better results.

Importantly, these experiments demonstrated our ability to outperform all the commercial speech-to-text APIs we evaluated, particularly when it came to accurately rendering medical terms. Anecdotally, we got feedback from clinicians that we were very robust to strong accents, which probably comes from our diverse dataset.

Hear it to believe it

Here are some audio examples with transcriptions for your auditory pleasure:

English

French

How does it work?

In a nutshell

Nabla works by taking a regular medical conversation and turning it into a structured medical note that can be seamlessly exported to the patient's EMR.

Step one, we transcribe the raw audio using a speech-to-text model. Step two, we enhance the text by incorporating insights from an advanced large language models (LLMs) to whip up that medical note based on the transcript.

It’s not that simple

To ensure we serve up the most accurate note imaginable, we need an impeccable transcript of the medical conversation. When we first introduced Nabla to the world of doctors months ago, we got some valuable feedback. It turned out that our API wasn't quite acing the analysis of medical jargon just yet (think symptoms, illnesses, medications, tests, and the like). The core conclusion was: the quality of our note is heavily reliant on how we transcribe these essential medical terms.

The best of both worlds

To tackle this initial hiccup, we needed a production-ready system that could deliver live transcriptions. We evaluated two possible solutions:

Repeatedly utilizing the aforementioned trained model to generate the transcription.
Using a commercial live system (Azure) as a preliminary step. Once a segment of the transcription was deemed final, we would let our own model take a crack at it for a second pass.

Both methods get the job done and are viable choices for production. However, we ultimately chose the second option as our default. This decision enhanced the system's resilience and allowed us the flexibility to choose between our model's transcription and Azure’s transcription. It also provided the option to restrict the use of our model in cases where it detected a medical term. How proud Miley would be.

In our current product, you can observe the seamless interplay between these components. The live transcription initially appears in gray, followed by our own speech-to-text correction, which is displayed in white. This process ensures accurate transcription, particularly when it comes to medical terms.

Obviously, we're continuously benchmarking and evaluating the performance of speech-to-text models on our growing dataset. We're upgrading our models and integrations over time to keep delivering state-of-the-art medical transcription technology to our users.