Generative artificial intelligence (AI) is transforming healthcare by supporting clinician tasks such as patient communication and clinical documentation. AI scribes, which use ambient listening to generate clinical notes from patient–physician interactions, are increasingly being adopted.
Early studies indicate that they can reduce documentation time, ease administrative burden, and reduce physician burnout. Concerns about the accuracy of AI-generated notes remain, as errors could impact patient care, which highlights the need for thorough validation and evaluation of these technologies in clinical settings.
The published study in JMIR Med Inform aimed to evaluate the quality and safety of AI-generated clinical notes from an ambient listening AI scribe in clinical settings. It focused on developing a method to assess note quality, detect error types and frequency, determine error severity and impact on patient outcomes and establish a monitoring framework. Ensuring patient safety before using AI scribes and understanding clinician interaction with AI documentation were central objectives.
A quality improvement pilot program was conducted involving 31 physicians across multiple specialties, including family medicine, internal medicine, otolaryngology, pediatrics, and others. Over the course of the pilot, 7545 clinical notes were generated by using the AI scribe. Physicians voluntarily participated and evaluated a subset of 356 notes (4.7%) by using a newly developed, concise survey instrument designed to capture key quality and safety concerns. The survey focused on detecting specific types of errors like accidental omissions, hallucinations (fabricated information), accidental inclusions, and bias and rated their severity on a scale to reflect potential risk to patient care. A composite quality score, which ranges from 0 to 4, was calculated, with higher scores indicating better quality. Additional data were collected on the time needed for evaluations, the interval between patient visits and note review, and patient characteristics. Vendor-provided metrics were used to assess the extent of physician edits by calculating the percentage of AI-generated text retained or modified in finalized notes.
The results showed that most of AI-generated notes were of high quality. 94.7% were free from significant errors, and the median overall quality score was 4.0, the highest possible score among the evaluated notes. About 68% of notes had no identified quality concerns. Only a small proportion (3.1%) scored below 3, which indicates lower quality. The most common issue observed was accidental omissions, occurring in 18% of notes, followed by hallucinations (11.5%) and accidental inclusions (9.3%). Bias was rare, detected in only 1.1% of cases. Most errors were classified as mild to moderate in severity. A small subset of notes (5.3%) contained errors that could pose a risk of serious patient harm if left uncorrected. It underscores the necessity of clinician oversight. The evaluation process itself was efficient, requiring a median of only 38 seconds per note. It was typically completed in a few hours of the clinical encounter.
Analysis of editing patterns revealed that physicians made relatively few changes to AI-generated notes. Among the subset of notes with available data, a median of only 9% of text was modified, and 15% of notes required no edits at all. There was considerable variability in editing behavior in physicians and specialties, with some clinicians making minimal changes and others editing a substantial portion of the note. Specialties like otolaryngology and pediatrics exhibited higher editing rates as compared to family and internal medicine. These findings suggest differences in documentation needs, clinician preferences, or trust in AI-generated outputs.
This study shows that AI scribes can generate high-quality clinical documentation with minimal editing, offering a promising approach to reduce clinical workload and burnout. However, the presence of occasional errors, including those with the potential for serious harm, underscores the need for ongoing clinician oversight and robust quality monitoring systems.
The study introduces a practical and scalable method to evaluate AI-generated notes, combing rapid physician assessments with automated vendor metrics. Limitations like potential selection bias, subjective error rating, and lack of formal validation of the assessment tool exist. The result gives a valuable framework for safe implementation. Ongoing monitoring, standardization of evaluation methods, integration in clinician training, and collaboration with AI vendors will be important to ensure that benefits of AI in healthcare are realized without compromising patient safety.
Reference: Taylor SL, Jost M, MacDonald S, et al. Quality of Clinical Notes Created by Ambient Listening Generative AI: Pragmatic Prospective Pilot Study. JMIR Med Inform. 2026;14:e86474. doi:10.2196/86474






