Evaluating medical note generation
The application of generative AI in medical note generation has stood out in the past year for its transformative role in saving clinicians time and allowing them to focus on caregiving.
For months now, thousands of doctors have reached out to express their appreciation for Nabla and for the quality of the notes it produces. To emphasize our models’ precision, it's worth noting that clinicians make edits to only 5% of the notes generated by Nabla.
To generate high quality notes and improve our generation process, we first must be able to measure the quality of a note. This is what this post will cover.
What is a medical note?
A medical note is a summary of an interaction between a doctor and their patient. Doctors are usually most familiar with a note template calledSOAP (Subjective, Objective, Assessment, Plan). By default, Nabla outputs a more detailed style of note that our users have reported to be very comfortable with. Here is a (fictional) example:
What is a good medical note?
By working closely with doctors, gaining insights into their preferences and requirements for medical notes, we have identified three key criteria.
- A note that contains the useful information
- A note with an enjoyable style (no redundancy, no repetitions, conciseness, etc.)
- A note that does not make up facts
The following sections aim at quantifying these points.
Measuring note quality
For the 3 aspects above, our methodology is to express the evaluation as a large set of clear questions that we address at a Large Language Model (in this case GPT-4), which will operate as a judge.
Data
By default, Nabla does not record any data. However, doctors using our product can share the transcript and the note of the encounter with us, if they consent to it and have the consent of their patients.Our evaluation is based on a dataset of 86 such encounters, happening in the context of a general medicine encounter.
Recall (getting the useful information)
For each encounter, a group of Nabla engineers and doctors extracted from the transcript several pieces of information that should or could be in the corresponding note.The resulting 1,088 pieces of information are then expressed as questions, such as these:
For each encounter, we then ask GPT-4 to judge whether the note generated by Nabla contains these facts, using the following prompt:
This prompt allows to judge several aspects in a single query. As you can see, we used several tricks to constraint this 0 shot prompt to an expected format.The recall score is given as the proportion of "tests" that "pass".
Style (showing the information in a clear and concise manner)
We use a similar methodology to evaluate the style of the note, only this time, the same questions are asked for every note. Here are some examples of questions:
We send a prompt to GPT-4 very similar to the one used for recall above, and the style score is given as the proportion of "tests" that "pass". The total number of style questions is 516.
Veracity (not making up facts)
Evaluating the veracity of a note is tricker. Basically the idea will be to evaluate the percentage of "atomic facts" from the note that appear in the transcript.
Extracting facts from the note
For each note, we first use GPT-4 to split it into atomic facts. This is a non-trivial task, and the use of a LLM is motivated by the fact that some sentences contain many atomic facts.We use the following prompt:
Here is an example output:
The total number of facts varies depending on the system that generates the note (cf. Results section).
Evaluating their presence in the transcript
We then use the following prompt to assert the presence of each of these atomic facts in the transcript.In this case though, instead of a binary answer, we provide 3 possibilities of response, as illustrated by this example:
The reason for this ternary is the fact that in a lot of cases, the poor quality of the transcript makes it often ambiguous whether a fact is present or not.Distinguishing the 2 cases of an obvious or non-obvious hallucination is helpful because penalizing the model for the latter may be unfair.
Here are some examples of non-obvious cases, providing the full response by GPT-4:
The veracity score is computed as the sum of proportions of response A or B.
Use case: comparison of backbone models
Nabla note generations are based on multiple queries addressed to a backbone model. Evaluating which backbone model to use is an ideal use case of the evaluation metrics that we describe in this post. Here we compare 4 different models, Hugging Face's Zephyr 7B Beta, and OpenAI's GPT-3 Turbo (0613), GPT-4 (0613) and GPT-4 Turbo (1106-preview).
Results
Discussion
- Regarding the recall scores, it must be noted that there is some subjectivity regarding which facts should or should not be in the note, which renders a 100% score virtually unattainable. Lifestyle for instance, ("Patient plays golf") may or may not be considered as mandatory. Further work will make the distinction between these types of facts.
- GPT-4 Turbo scores highest on the recall, which is aligned with its outputting more facts.
- GPT-4 0613 seems to be the most reliable in terms of veracity.
- For our specific task, we find that the open-source state-of-the-art among 7B parameters models (Zephyr) lags behind. We acknowledge however, that we are using prompts that have been iterated on over time to work well with GPT-4, and maybe other ways of formulating the input may work better with Zephyr. We also acknowledge that Zephyr is only 7B parameters, which makes this a very unfair comparison.
Conclusion
In this post we've shown how at Nabla we evaluate the quality of the notes that we generate, as well as a comparison of different backbone models. Although further work will follow, this is a critical step in the development of our medical assistant. We hope that this post will be useful to other companies and engineering team dedicated to improve healthcare.
If you liked this post, you might be interested in joining our team as a ML engineer