Ambient AI Scribe solutions typically would work by recording patient-doctor conversation, transcribing it, and then send to a Large Language Model (LLM) with a prompt to generate a clinical note or clinical documentation. Clinical notes can be generated using various LLMs like GPT, LLAMA2, Mistral, Claude, Command, Medpalm etc.
Following parameters are used to measure the note quality. All parameters are measured for the same prompt.
The 2 graphs are created by running same prompt multiple times across multiple LLMs. Some key observations that the hallucinations vary significantly and randomly when same prompt is run multiple times on the same LLM. The hallucinations typically vary from 1 to 7 utterances when runs are repeated without any change. This shows high variability.
Med-PaLM have the hallucinations at same value and does not vary with multiple runs. Command LLM has significantly high hallucinations.
The 2 graphs are created by running same prompt multiple times across multiple LLMs. The scores vary significantly and randomly across multiple runs and across LLMs. Interestingly Command LLM had very high hallucination but its coverage is also quite good. Coverage variesfrom 45% to 85% when runs are repeated without any change. Med-PaLM is showing consistent coverage across multiple runs and is more predictive.
The confidence score is a function of hallucination and coverage score. Confidence score is higher for higher coverage but reduces due to hallucinations. The 2 graphs above are created by running same prompt multiple times across multiple LLMs. The scores vary significantly and randomly across multiple runs and across LLMs. Command LLM shows lower confidence score which is expected due to high hallucinations. Med-PaLM has consistent confidence score showing high repeatability.
These variations are measure for a single patient-visit from primary care outpatient clinic and hence one cannot draw any conclusion on which LLM is better. Here are key observsations:
LLMs were evaluated for 133 patient-visits from outpatient clinics across 15+ specialties and varying diagnoses. Same prompt body was used for all runs. Only part of the prompt varying was the conversation/dictation transcription. A total of 3 runs were done for each patient-visit and analyzed for quality of the documentation. For each run, each patient-visit was attempted 2 times for each LLM and best output was chosen based on confidence score across 12 outputs (6 LLMs X 2 attempts = 12 outputs). The graph shows how many times each LLM had the best score for each run, giving you a sense on which LLM performs better.
Some key observations:
X-axis: Number of hallucination utterances
Y-axis: Frequency of notes
X-axis: Number of hallucination utterances
Y-axis: Frequency of notes x number of hallucination utterances
Key conclusion: Simbo output’s curve shift to left. Left shift implies that Simbo reduces hallucinations. The scaled hallucination shows lower the area under the curve, which implies lower hallucination.
X-axis: Coverage Score
Y-axis: Frequency of notes
Key conclusion: Simbo’s curve shifts to right significantly. Shift to right implies that Simbo gets significantly higher coverage score on the clinical documentation than any other LLM. All LLM curves peak at lower coverage scores implying significant portion of information from input conversation is missed out.
X-axis: Confidence Score
Y-axis: Frequency of notes
Key conclusion: Simbo output’s curve shits to right signifcantly. Shift to right implies that Simbo gets significantly higher confidence score for clinical documentation than any LLM. All LLM curves peak at lower confidence scores implying they seldomly give higher confidence score.