Clinical and economic impact of a large language model in perioperative medicine: a randomized crossover trial

Preoperative evaluation is a critical and high-risk aspect of perioperative care. Mistakes during this phase—such as deviations from clinical guidelines or incorrect instructions—can lead to last-minute surgery cancellations, treatment delays, and increased patient complications. These disruptions not only affect patient safety but also cause significant operational inefficiencies, with operating room delays costing approximately USD 1,400 to 1,700 per hour. Despite efforts to improve these processes, major challenges remain, including poor adherence to evolving guidelines and the heavy documentation workload, which consumes over 40% of a clinician’s time.

Recent developments in artificial intelligence (AI) and large language models (LLMs) offer a promising solution to these persistent issues. LLMs have shown strong performance in clinical reasoning tasks and have the potential to interpret specialized guidelines and support complex decision-making. Earlier studies have demonstrated that LLMs, when combined with retrieval-augmented generation (RAG), can accurately assess surgical readiness and create management plans with high accuracy in controlled settings.

Despite rapid advancements in AI capabilities, there remains a significant gap between experimental validation and real-world implementation. Most existing studies have focused on benchmark testing, retrospective data analysis, or simulated scenarios. Few have explored how LLMs function in real-time clinical settings or whether they can significantly improve documentation quality and clinician productivity in actual patient care. Additionally, their real-world impact on efficiency, clinician behavior, and cost-effectiveness has not been thoroughly measured.

Moreover, integrating LLMs into clinical practice presents several challenges. Key concerns include data privacy, model reliability, and contextual accuracy. Adversarial prompts have been shown to extract sensitive information from foundation models, and even anonymized patient data can be re-identified. Furthermore, LLM outputs may struggle to align with nuanced, patient-specific clinical contexts, leading to what is known as the “last-mile problem” in medical AI deployment.

To address these challenges, we utilized the PAIR platform—a secure, government-certified infrastructure developed under Singapore’s Smart Nation initiative—to build a clinical decision support tool for perioperative medicine. Within PAIR Chat, we developed the PErioperative AI CHatbot (PEACH), a system designed to streamline and standardize preoperative assessments. PEACH integrates 35 institution-specific perioperative guidelines into a unified knowledge base, enabling it to perform longitudinal reasoning across various clinical pathways and adapt dynamically to patient-specific contexts. Before deployment, PEACH was reviewed and approved by the institution’s Medical Board and underwent a formal risk assessment through the Clinical Risk Management (CRM) committee to ensure safety, compliance, and alignment with institutional standards. The system was approved by Singapore’s Health Sciences Authority (HSA) as a Class A Clinical Decision Support System (CDSS) in December 2024 and has since been deployed in routine use at the preoperative assessment clinic.

This study addresses a critical gap in the existing literature: the extent to which LLM-powered chatbots can influence real-world clinical workflows, with a particular focus on clinical efficiency and cost-effectiveness. To evaluate the utility of such a system, we conducted a randomized crossover trial at a tertiary academic medical center. The primary objective was to assess the impact of the LLM-based tool on clinical efficiency, as measured by documentation time. Secondary outcomes included assessments of documentation quality, accuracy and safety, user acceptability, and institutional economic outcomes. To our knowledge, this investigation represents the first prospective, real-world evaluation of an LLM-enabled clinical decision support system in perioperative medicine.

A total of 272 patient encounters were recorded during the study period, with 135 encounters on standard workflow days (without PEACH) and 137 on intervention days (with PEACH). PEACH was actively used in 111 of 137 intervention-day cases, yielding a utilization rate of 81.0%. Fourteen out of 16 eligible physicians participated in the study, with one declining to participate and one other unavailable due to emergency medical leave. A total of 161 controls and 111 interventions were included in the final analysis. A CONSORT diagram depicting participant flow is shown in Fig. 1. PEACH delivered outputs within 10–15 seconds on average.

The cases where PEACH was utilized exhibited a higher proportion of high-complexity (Complexity 3) cases (25.2% vs. 13.7%, p = 0.007) and had a greater share of American Society of Anesthesiologists (ASA) Physical Status classification 3 patients (48.6% vs. 34.2%, p = 0.050). The distribution of surgery types did not significantly differ, utilization rate by different seniority of resident physicians, and consultation rates with attending anesthesiologists were similar (p > 0.50) (Table 1).

In this study, “total consultation time” refers to the entire duration from the patient’s entry into the consultation room to the completion of documentation in the electronic health record, encompassing both face-to-face interaction and subsequent note writing. “Documentation time” is defined as the portion of that period spent solely on completing the clinical note before and after the patient interaction ended.

The total consultation time did not significantly differ between groups (40.04 vs. 40.66 min, p = 0.787). However, documentation time trended lower with PEACH (17.53 vs. 19.35 min), though this was not statistically significant (p = 0.192). By case complexity, PEACH was associated with significantly reduced documentation time in Complexity 2 cases by a mean time difference of 5.77 min per patient (p = 0.010) (Fig. 2). Stratifying by clinician experience, significant documentation time savings were observed among experienced resident physicians (20.0 vs. 24.6 min, p = 0.040), while no significant differences were seen among novice or new-to-institution participants (Table 2).

Human evaluators preferred PEACH-assisted documentation in 57.1% of cases compared to 35.7% for control, although this was not statistically significant (p = 0.180). Inter-rater agreement was substantial, with a Cohen’s κ of 0.71. PEACH outputs were more likely to include an issues list (71.4% vs. 43.9%, p = 0.05), while minor and major error rates were low and similar across groups (Table 3).

Across 168 PEACH interactions (in the 111 clinical cases seen), most cases involved a single interaction (71.2%), with multi-step use in 28.8%. Outputs were primarily summaries and management plans (82.7%), followed by Q&A responses (11.9%) and referral drafting (5.4%).

User feedback was positive, with high average scores for safety (4.94), explainability (4.81), and ease of understanding (4.72). Usefulness (4.23), job facilitation (4.62), and intention to use in the future (4.54) were also favorably rated. Output quality was consistently high, with 100% judged accurate by human reviewers (n = 30). Minor clinical deviations that would not result in patient harm were found in 3 outputs (10.0%), and no hallucinations were observed (Supplementary Table 1).

Usage of the PEACH chatbot varied across resident physicians in both frequency and intensity. Among the 14 participants, each resident saw a median of 8 patients on PEACH intervention days (range: 7–20), with a corresponding usage rate per resident ranging from 36.4% to 100%. Several residents consistently used PEACH for nearly all eligible encounters and engaged in multi-message sessions. Notably, two residents accounted for a disproportionate share of total chatbot interactions, contributing over 25% of all PEACH messages logged. A detailed breakdown of patient volumes, chatbot usage, and message intensity by resident is provided in Supplementary Table 2.

Health economic analysis projected that PEACH implementation could yield substantial institutional cost savings, primarily through reductions in clinician documentation time. Based on observed time reductions, PEACH is projected to save approximately 1091.4 resident clinician hours and 59 attending physician hours annually, corresponding to an estimated cost savings of SGD 197,501 (USD 146,297).

Sensitivity analyses, incorporating variations in adoption rates, time savings, token pricing, and IT maintenance costs, indicate that net annual savings could range from SGD 48,979 to 190,418 (USD 36,280 to 146,295). Importantly, even under conservative assumptions, PEACH remained cost-effective. A summary of these scenarios is presented in Table 4, with detailed calculations available in Supplementary Table 3.

To our knowledge, this randomized, crossover trial represents the first prospective real-world evaluation of a large language model-powered clinical decision support tool in perioperative medicine. Our findings demonstrate that PEACH, a secure, context-aware specialized chatbot, can be feasibly deployed in a high-volume clinical setting and may offer improvements in documentation efficiency and cost savings, particularly in certain subgroups. These results extend the growing evidence supporting LLM integration into clinical workflows and provide a practical blueprint for AI-enabled perioperative care.

Although the overall reduction in documentation time across all cases did not reach statistical significance, subgroup analyses revealed significant gains in intermediate-complexity encounters (Complexity 2) and among experienced clinicians. These findings suggest that PEACH’s benefits are most pronounced when clinical complexity is sufficient to require decision support but not so overwhelming as to necessitate extensive human deliberation. Straightforward cases (Complexity 1) likely afforded little room for improvement, while highly complex cases (Complexity 3) may have involved nuanced decision-making beyond the current capabilities of LLM summarization.

From a systems perspective, even modest per-case time savings scaled to substantial institutional cost reductions, suggesting that AI-assisted workflows can be not only clinically feasible but also economically beneficial. Most prior studies of LLMs have emphasized technical accuracy or simulation-based outcomes, and there has been limited real-world economic evidence of LLM use in operational clinical environments. Recent work by Klang et al. highlights the potential for cost-effective scaling of LLMs within health systems, emphasizing significant cost reductions through task concatenation and optimization strategies. Our findings build upon this framework, demonstrating that LLM-assisted workflows can be both clinically feasible and economically sustainable at scale.

While our current implementation uses a long-context prompting approach—this method incurs higher token usage and computational cost compared to the RAG frameworks. In our prior work using RAG-based LLMs for binary perioperative decision tasks, accuracy ranged from 43.0% to 90.0% depending on the model, revealing substantial variability and limited reliability for complex, multi-layered clinical decisions. In contrast, the long-context architecture deployed in this study allowed for more robust longitudinal reasoning across diverse guideline domains, supporting higher fidelity in perioperative assessment and planning. Despite its higher cost, our economic analysis shows that this approach remains cost-effective in practice, reinforcing the value of long-context prompting for high-stakes, domain-specific clinical applications.

Nevertheless, the limited overall impact observed points to several challenges. Uptake and usage variability suggest that successful deployment of LLMs in healthcare requires more than technical excellence; it demands attention to psychological, organizational, and cultural factors that shape trust and acceptance. Facilitating conditions, visible leadership support, structured training, and peer normalization will be crucial to scaling AI-enabled workflows. Future iterations should focus on embedding PEACH into native EHR systems, streamlining user interfaces, and providing real-time support to optimize facilitating conditions.

Several limitations warrant consideration. The open-label nature of the trial introduces potential for performance and selection bias. Outcomes were limited to documentation time and quality; future research should evaluate patient-centered safety endpoints and qualitative evaluation of user acceptance. The subgroup analyses were not adjusted for multiple comparisons, and the findings should be interpreted with caution as hypothesis-generating rather than confirmatory. Furthermore, while Claude 3.5 Sonnet demonstrated strong contextual performance, model-specific behaviors may vary, and generalizability to other settings or LLMs remains to be determined.

Importantly, we acknowledge the possibility of behavioral changes introduced by the study setting. Participants may have modified their workflow consciously or subconsciously when using PEACH, potentially affecting efficiency or documentation behavior. Additionally, despite blinding efforts, subtle stylistic differences may still have influenced evaluator perception.

Lastly, we did not initiate formal onboarding or structured training for PEACH, based on the assumption that its interface would be sufficiently intuitive for clinical users. However, retrospective review of participant interactions revealed suboptimal utilization of PEACH’s full capabilities—such as highlighting key clinical issues and automated summarization—resulting in continued manual data entry for several tasks. This finding highlights the critical importance of formative training to optimize adoption and realize the full benefits of AI-driven medical tools. Importantly, as AI applications in clinical practice are still emerging, best practices for training users to interact effectively with these technologies are not yet well established. Future implementations should not only incorporate structured user education and iterative feedback, but future research should also focus on developing and refining training strategies to maximize the benefits of AI integration in healthcare.

In conclusion, PEACH represents a promising example of an effective LLM integration into perioperative medicine. By leveraging secure infrastructure, domain-specific guideline ingestion, and adaptive prompting strategies, PEACH demonstrated potential improvements in documentation quality and efficiency in exploratory analyses, while maintaining high safety standards and favorable user acceptance. Our results underscore the importance of real-world validation, clinician-centered design, and economic evaluation in guiding responsible adoption of artificial intelligence in medicine.

Study design

We conducted a prospective, randomized crossover trial to evaluate the real-world clinical impact of PEACH in a high-volume preoperative evaluation clinic (PEC). The trial was conducted at Singapore General Hospital, a 1900-bedded tertiary academic hospital in Singapore between January and February 2025. The institutional review board waived ethics approval as the intervention was classified as non-human-subjects research under local governance policies.

Prompt engineering

PEACH was developed within the PAIR Chat platform, a large language model optimized for extended-context inputs. PAIR is hosted on the secure Government Commercial Cloud (GCC) and is designed to support healthcare AI applications managing data classified as “Restricted” or “Sensitive,” ensuring enterprise-grade data security. PAIR Chat leverages Claude 3.5 Sonnet (Anthropic, San Francisco, CA). Structured perioperative guidelines (Supplementary Table 4) were input directly into the model using document staffing, which enables comprehensive reasoning across full documents without the need for retrieval modules. This approach allows the model to synthesize and apply context across extensive clinical documentation.

Prompt engineering was a critical component of chatbot design and followed best practices. Prompts were constructed using role-based instructions, specifying the chatbot’s persona (e.g., senior perioperative clinician or health system analyst). Prompts were iteratively refined through internal testing, response evaluation, and user feedback. This process reduced hallucinations, improved clarity, and aligned outputs with domain-specific expectations. Final prompts emphasized task specificity, contextual relevance, and clinical or administrative nuance (see Supplementary Table 5). PEACH was designed to assist with specific tasks: 1) Question and Answer (Q&A), 2) Summarization and making perioperative plans, and 3) Writing referral letters.

Participants and settings

The trial employed a prospective, two-day randomized crossover design. Resident physicians rotating through the anesthesiology department and scheduled for at least two consecutive days in the preoperative evaluation clinic (PEC) during January and February 2025 were enrolled. A washout period was not implemented due to feasibility constraints in our local context: residents are typically assigned to the PEC for only 2–4 days per rotation, with their next assignment occurring 1–2 months later. Introducing a washout period would risk confounding by capturing residents at different stages of clinical experience, thereby introducing variability in documentation efficiency.

Participants varied in their prior exposure to institutional perioperative protocols and the electronic health record (EHR) system. The PEACH system had been available for informal use starting one month before study initiation. At our institution, anesthesiology residents rotate every six months, with a new cohort beginning in January. While some residents from the prior cohort had limited exposure to PEACH, no formal training was provided before the study. Upon enrollment, participants received a brief five-minute orientation covering basic access, login, and usage. No extensive onboarding was conducted, as the chatbot was designed to be intuitive and self-directed.

Each participant was randomized to complete two consecutive clinic sessions—one utilizing PEACH (Intervention) and the other following standard workflows without AI assistance (Control). Randomization was conducted using a 1:1 block design and stratified according to participants’ prior anesthesiology and institutional experience. Participants were categorized into three strata: (1) less than 6 months of anesthesiology experience (Novice); (2) physicians with at least 6 months of anesthesiology experience at other institutions but unfamiliar with the local EHR system (New to institution); and (3) experienced resident physicians who had previously completed at least six months of anesthesiology rotations within the study hospital (Experienced). The participants were given a 10-dollar gift voucher for their participation. Given the nature of the intervention, blinding of participants was not feasible, however, the reviewers were blinded.

To reduce the risk of reviewer bias, documentation samples were anonymized and selected to ensure parity across case complexity. Clinical documentation was extracted directly from the electronic health record without reformatting or alteration, preserving the original physician-authored content. While AI-generated text may exhibit subtle stylistic patterns, reviewers were not primed to expect such differences.

During the PEACH day (Intervention), participants accessed PEACH through a secure hospital interface. The extent and manner of PEACH usage were left to the discretion of each participant, who may elect to use the chatbot more frequently or selectively. All interactions with PEACH—including the number of uses and their specific purposes (e.g., summarization and planning, Q&A, or referral drafting)—were recorded and extracted from the system logs. The study team subsequently randomly selected 30 PEACH outputs for the evaluation of accuracy. In the control arm, participants conducted preoperative assessments and completed documentation using standard clinical procedures without access to AI assistance (Fig. 1).

For each patient encounter, three time metrics were recorded based on timestamps from the clinic queuing system and documentation logs within the electronic health record. “Patient time” was defined as the duration from the patient’s entry to exit from the consultation room. “Total time” refers to the entire period required for case review and documentation. “Documentation time” was calculated as the difference between total time and patient time.

Clinical documentation was extracted from the perioperative clinic records and anonymized for blinded evaluation. To account for variability in administrative burden based on case complexity, all cases were stratified into three predefined categories for analysis according to the plans outlined in the documentation. Complexity 1 included ASA I or II patients who required no further clinical interventions. Complexity 2 encompassed cases necessitating minor actions, such as brief consultations with the attending anesthesiologist. Complexity 3 referred to high-acuity cases requiring multidisciplinary coordination, including communication with multiple specialty teams, additional referrals, or changes to the surgical schedule (e.g., cancellations or rescheduling). Full inclusion and exclusion criteria for each category are detailed in Supplementary Table 6. Complexity assignments were made retrospectively by two anesthesiologists on the study team, based on the clinical documentation. The reviewers were blinded to the study arm, and discrepancies were resolved by consensus.

Outcome measures

The primary outcome was clinical efficiency, measured by differences in documentation time per patient between PEACH-assisted and standard workflows. Secondary outcomes included documentation quality, safety and accuracy, user acceptance, and economic impact.

Documentation quality was assessed using a paired-review design. Fifty-six clinical documents (28 pairs) were collected. These cases were matched by case complexity to ensure comparability. All samples were anonymized and independently reviewed by two anesthetists. Reviewers evaluated each documentation pair for the presence of a clinically relevant issues list and identified any minor and major errors in the perioperative management plan. Additionally, reviewers indicated their overall preference between the two outputs in each pair.

User acceptance was evaluated using a structured questionnaire based on the Davis’ Technology Acceptance Model (TAM), administered upon study completion. The survey included 5-point Likert-scale items assessing perceived usefulness, ease of use, clarity of reasoning, and intention to use PEACH in future clinical practice.

Economic analysis

An economic evaluation was conducted following the Consolidated Health Economic Evaluation Reporting Standards (CHEERS) guidelines to quantify the institutional value of PEACH implementation. The analysis modeled potential cost savings based on observed time reductions in documentation and consultation, extrapolated to an institutional scale, assuming an annual patient volume of 20,000 preoperative assessments. We assumed that the distribution of physician seniority mirrored that observed in the study population.

A simple costing model was used to estimate potential labor cost savings by converting clinician time saved into full-time equivalent (FTE) hours and multiplying by an estimated wage rate of SGD 140 per hour. Concurrently, the operating cost of the LLM (Claude 3.5 Sonnet, Anthropic, San Francisco, CA) was factored in based on token usage and per-token pricing at the time of analysis (SGD 0.00047 per output, assuming 133 tokens per 100 words). The model considered the average number of PEACH outputs generated per patient and projected annual usage volume. To test the robustness of the model, a sensitivity analysis was performed using ±20% variation in key input parameters, including clinician wage rates, output volume per patient, and token pricing.

Sample size calculation

As there were no prior studies evaluating the impact of a clinical decision support tool on documentation efficiency in preoperative settings, we prospectively planned an interim analysis after the enrollment of 12 participants to inform sample size estimation. Before initiating the study, we defined a clinically meaningful reduction in documentation time of 3 min, representing a 30% improvement relative to an estimated average documentation time of 10 min per patient. Assuming a standard deviation of 10 min in documentation time differences, a paired crossover design, a two-sided α of 0.05, and 80% power, an a priori sample size calculation indicated that 88 paired patient encounters would be required to detect this effect.

At interim analysis, the observed standard deviation of documentation time was consistent with this assumption. Additionally, it was found that each participant completed an average of 8 paired patient encounters per session. Based on this, the study proceeded to full enrollment, ultimately including 14 participants (111 paired patient encounters).

All statistical analyses were performed using Python 3.13.2. Paired t-tests were used to compare continuous time metrics between intervention and control arms, while categorical outcomes such as documentation preference were evaluated using Chi-square tests.

During manuscript preparation, the authors used large language models (ChatGPT-4o and DeepSeek) to assist with technical paraphrasing and grammatical refinement. All AI-assisted content was reviewed for scientific accuracy and verified against original data by the study authors.

— News Original —
Clinical and economic impact of a large language model in perioperative medicine: a randomized crossover trial

Preoperative assessment is a complex and high-stakes component of perioperative care. Errors during this phase of patient care—such as deviations from clinical guidelines or incorrect perioperative instructions—can lead to same-day surgical cancellations, delayed treatments, and increased perioperative morbidity1. These disruptions compromise patient safety and generate significant operational inefficiencies, with operating room delays estimated to cost between USD 1400 and 1700 per h2. Despite ongoing efforts to streamline these workflows, major barriers persist: poor adherence to extensive and evolving guidelines and the substantial documentation burden that consumes over 40% of clinician time. n nRecent advances in artificial intelligence (AI) and large language models (LLMs) offer a promising opportunity to address these longstanding challenges. LLMs have demonstrated impressive performance across clinical reasoning tasks and have shown potential to interpret domain-specific guidelines and support complex decision-making3,4. Prior work has demonstrated that LLMs coupled with retrieval-augmented generation (RAG) can accurately assess surgical fitness and formulate management plans, achieving high accuracy in structured simulation settings5. n nDespite the rapid evolution of AI capabilities, there is a critical gap between experimental validation and operational implementation. Most studies to date have focused on benchmark testing, retrospective data, or synthetic vignettes6,7,8. Few have examined how LLMs perform in real-time clinical workflows or whether they can meaningfully influence documentation quality and clinician productivity in live patient care. In addition, their real-world clinical impact—on efficiency, clinician behavior, and cost-effectiveness—remains largely unquantified. n nFurthermore, the translation of LLMs into clinical practice is not without its challenges. Chief among these are data privacy, model reliability, and contextual fidelity. Adversarial prompts have been shown to extract sensitive information from foundation models9, and even anonymized patient data remains vulnerable to reidentification10. Moreover, LLM outputs may struggle to align with nuanced, patient-specific clinical contexts, raising the so-called “last-mile problem” in medical AI deployment11. n nTo address these challenges, we utilized the PAIR platform12—a secure, government-certified infrastructure developed under Singapore’s Smart Nation initiative—to build a clinical decision support tool for perioperative medicine. Within PAIR Chat, we developed the PErioperative AI CHatbot (PEACH), a clinical decision support system designed to streamline and standardize preoperative assessments. PEACH integrates 35 institution-specific perioperative guidelines into a unified knowledge base, enabling it to perform longitudinal reasoning across various clinical pathways and adapt dynamically to patient-specific contexts. Before deployment, PEACH was reviewed and approved by the institution’s Medical Board and underwent a formal risk assessment through the Clinical Risk Management (CRM) committee to ensure safety, compliance, and alignment with institutional standards. The system was approved by Singapore’s Health Sciences Authority (HSA) as a Class A Clinical Decision Support System (CDSS)13 in December 2024 and has since been deployed in routine use at the preoperative assessment clinic. n nThis study addresses a critical gap in the existing literature: the extent to which LLM-powered chatbots can influence real-world clinical workflows, with a particular focus on clinical efficiency and cost-effectiveness. To evaluate the utility of such a system, we conducted a randomized crossover trial at a tertiary academic medical center. The primary objective was to assess the impact of the LLM-based tool on clinical efficiency, as measured by documentation time. Secondary outcomes included assessments of documentation quality, accuracy and safety, user acceptability, and institutional economic outcomes. To our knowledge, this investigation represents the first prospective, real-world evaluation of an LLM-enabled clinical decision support system in perioperative medicine. n nA total of 272 patient encounters were recorded during the study period, with 135 encounters on standard workflow days (without PEACH) and 137 on intervention days (with PEACH). PEACH was actively used in 111 of 137 intervention-day cases, yielding the utilization rate of 81.0%. Fourteen out of 16 eligible physicians participated in the study, with one declining to participate and one other unavailable due to emergency medical leave. A total of 161 controls and 111 interventions were included in the final analysis. A CONSORT diagram depicting participant flow is shown in Fig. 1. PEACH delivered outputs within 10–15 s on average. n nThe cases where PEACH was utilized exhibited a higher proportion of high-complexity (Complexity 3) cases (25.2% vs. 13.7%, p = 0.007) and had a greater share of American Society of Anesthesiologists (ASA) Physical Status classification 3 patients (48.6% vs. 34.2%, p = 0.050). The distribution of surgery types did not significantly differ, utilization rate by different seniority of resident physicians, and consultation rates with attending anesthesiologists were similar (p > 0.50) (Table 1). n nIn this study, “total consultation time” refers to the entire duration from the patient’s entry into the consultation room to the completion of documentation in the electronic health record, encompassing both face-to-face interaction and subsequent note writing. “Documentation time” is defined as the portion of that period spent solely on completing the clinical note before and after the patient interaction ended. n nThe total consultation time did not significantly differ between groups (40.04 vs. 40.66 min, p = 0.787). However, documentation time trended lower with PEACH (17.53 vs. 19.35 min), though this was not statistically significant (p = 0.192). By case complexity, PEACH was associated with significantly reduced documentation time in Complexity 2 cases by a mean time difference of 5.77 min per patient (p = 0.010) (Fig. 2). Stratifying by clinician experience, significant documentation time savings were observed among experienced resident physicians (20.0 vs. 24.6 min, p = 0.040), while no significant differences were seen among novice or new-to-institution participants (Table 2). n nHuman evaluators preferred PEACH-assisted documentation in 57.1% of cases compared to 35.7% for control, although this was not statistically significant (p = 0.180). Inter-rater agreement was substantial, with a Cohen’s κ of 0.71. PEACH outputs were more likely to include an issues list (71.4% vs. 43.9%, p = 0.05), while minor and major error rates were low and similar across groups (Table 3). n nAcross 168 PEACH interactions (in the 111 clinical cases seen), most cases involved a single interaction (71.2%), with multi-step use in 28.8%. Outputs were primarily summaries and management plans (82.7%), followed by Q&A responses (11.9%) and referral drafting (5.4%). n nUser feedback was positive, with high average scores for safety (4.94), explainability (4.81), and ease of understanding (4.72). Usefulness (4.23), job facilitation (4.62), and intention to use in the future (4.54) were also favorably rated. Output quality was consistently high, with 100% judged accurate by human reviewers (n = 30). Minor clinical deviations that would not result in patient harm were found in 3 outputs (10.0%), and no hallucinations were observed (Supplementary Table 1). n nUsage of the PEACH chatbot varied across resident physicians in both frequency and intensity. Among the 14 participants, each resident saw a median of 8 patients on PEACH intervention days (range: 7–20), with a corresponding usage rate per resident ranging from 36.4% to 100%. Several residents consistently used PEACH for nearly all eligible encounters and engaged in multi-message sessions. Notably, two residents accounted for a disproportionate share of total chatbot interactions, contributing over 25% of all PEACH messages logged. A detailed breakdown of patient volumes, chatbot usage, and message intensity by resident is provided in Supplementary Table 2. n nHealth economic analysis projected that PEACH implementation could yield substantial institutional cost savings, primarily through reductions in clinician documentation time. Based on observed time reductions, PEACH is projected to save approximately 1091.4 resident clinician hours and 59 attending physician hours annually, corresponding to an estimated cost savings of SGD 197,501 (USD 146,297). n nSensitivity analyses, incorporating variations in adoption rates, time savings, token pricing, and IT maintenance costs, indicate that net annual savings could range from SGD 48,97951,506 to 190,41897,499 (USD36,2808,153 to 146,2951,050). Importantly, even under conservative assumptions, PEACH remained cost-effective. A summary of these scenarios is presented in Table 4, with detailed calculations available in Supplementary Table 3. n nTo our knowledge, this randomized, crossover trial represents the first prospective real-world evaluation of a large language model-powered clinical decision support tool in perioperative medicine. Our findings demonstrate that PEACH, a secure, context-aware specialized chatbot, can be feasibly deployed in a high-volume clinical setting and may offer improvements in documentation efficiency and cost savings, particularly in certain subgroups. These results extend the growing evidence supporting LLM integration into clinical workflows and provide a practical blueprint for AI-enabled perioperative care. n nAlthough the overall reduction in documentation time across all cases did not reach statistical significance, subgroup analyses revealed significant gains in intermediate-complexity encounters (Complexity 2) and among experienced clinicians. These findings suggest that PEACH’s benefits are most pronounced when clinical complexity is sufficient to require decision support but not so overwhelming as to necessitate extensive human deliberation. Straightforward cases (Complexity 1) likely afforded little room for improvement, while highly complex cases (Complexity 3) may have involved nuanced decision-making beyond the current capabilities of LLM summarization14. n nFrom a systems perspective, even modest per-case time savings scaled to substantial institutional cost reductions, suggesting that AI-assisted workflows can be not only clinically feasible but also economically beneficial. Most prior studies of LLMs have emphasized technical accuracy or simulation-based outcomes6,15,16, and there has been limited real-world economic evidence of LLM use in operational clinical environments. Recent work by Klang et al. highlights the potential for cost-effective scaling of LLMs within health systems, emphasizing significant cost reductions through task concatenation and optimization strategies17. Our findings build upon this framework, demonstrating that LLM-assisted workflows can be both clinically feasible and economically sustainable at scale. n nWhile our current implementation uses a long-context prompting approach—this method incurs higher token usage and computational cost compared to the RAG frameworks18. In our prior work using RAG-based LLMs for binary perioperative decision tasks, accuracy ranged from 43.0% to 90.0% depending on the model, revealing substantial variability and limited reliability for complex, multi-layered clinical decisions5. In contrast, the long-context architecture deployed in this study allowed for more robust longitudinal reasoning across diverse guideline domains, supporting higher fidelity in perioperative assessment and planning. Despite its higher cost, our economic analysis shows that this approach remains cost-effective in practice, reinforcing the value of long-context prompting for high-stakes, domain-specific clinical applications. n nNevertheless, the limited overall impact observed points to several challenges. Uptake and usage variability suggest that successful deployment of LLMs in healthcare requires more than technical excellence; it demands attention to psychological, organizational, and cultural factors that shape trust and acceptance19,20,21. Facilitating conditions, visible leadership support, structured training, and peer normalization will be crucial to scaling AI-enabled workflows. Future iterations should focus on embedding PEACH into native EHR systems, streamlining user interfaces, and providing real-time support to optimize facilitating conditions. n nSeveral limitations warrant consideration. The open-label nature of the trial introduces potential for performance and selection bias. Outcomes were limited to documentation time and quality; future research should evaluate patient-centered safety endpoints and qualitative evaluation of user acceptance. The subgroup analyses were not adjusted for multiple comparisons, and the findings should be interpreted with caution as hypothesis-generating rather than confirmatory. Furthermore, while Claude 3.5 Sonnet demonstrated strong contextual performance, model-specific behaviors may vary, and generalizability to other settings or LLMs remains to be determined22,23. n nImportantly, we acknowledge the possibility of behavioral changes introduced by the study setting. Participants may have modified their workflow consciously or subconsciously when using PEACH, potentially affecting efficiency or documentation behavior. Additionally, despite blinding efforts, subtle stylistic differences may still have influenced evaluator perception. n nLastly, we did not initiate formal onboarding or structured training for PEACH, based on the assumption that its interface would be sufficiently intuitive for clinical users. However, retrospective review of participant interactions revealed suboptimal utilization of PEACH’s full capabilities—such as highlighting key clinical issues and automated summarization—resulting in continued manual data entry for several tasks. This finding highlights the critical importance of formative training to optimize adoption and realize the full benefits of AI-driven medical tools. Importantly, as AI applications in clinical practice are still emerging, best practices for training users to interact effectively with these technologies are not yet well established. Future implementations should not only incorporate structured user education and iterative feedback, but future research should also focus on developing and refining training strategies to maximize the benefits of AI integration in healthcare. n nIn conclusion, PEACH represents a promising example of an effective LLM integration into perioperative medicine. By leveraging secure infrastructure, domain-specific guideline ingestion, and adaptive prompting strategies, PEACH demonstrated potential improvements in documentation quality and efficiency in exploratory analyses, while maintaining high safety standards and favorable user acceptance. Our results underscore the importance of real-world validation, clinician-centered design, and economic evaluation in guiding responsible adoption of artificial intelligence in medicine. n nStudy design n nWe conducted a prospective, randomized crossover trial to evaluate the real-world clinical impact of PEACH in a high-volume preoperative evaluation clinic (PEC). The trial was conducted at Singapore General Hospital, a 1900-bedded tertiary academic hospital in Singapore between January and February 202524. The institutional review board waived ethics approval as the intervention was classified as non-human-subjects research under local governance policies. n nPrompt engineering n nPEACH was developed within the PAIR Chat platform, a large language model optimized for extended-context inputs25,26. PAIR is hosted on the secure Government Commercial Cloud (GCC) and is designed to support healthcare AI applications managing data classified as “Restricted” or “Sensitive,” ensuring enterprise-grade data security. PAIR Chat leverages Claude 3.5 Sonnet (Anthropic, San Francisco, CA)27. Structured perioperative guidelines (Supplementary Table 4) were input directly into the model using document staffing28, which enables comprehensive reasoning across full documents without the need for retrieval modules. This approach allows the model to synthesize and apply context across extensive clinical documentation. n nPrompt engineering was a critical component of chatbot design and followed best practices29. Prompts were constructed using role-based instructions, specifying the chatbot’s persona (e.g., senior perioperative clinician or health system analyst). Prompts were iteratively refined through internal testing, response evaluation, and user feedback. This process reduced hallucinations, improved clarity, and aligned outputs with domain-specific expectations. Final prompts emphasized task specificity, contextual relevance, and clinical or administrative nuance (see Supplementary Table 5). PEACH was designed to assist with specific tasks: 1) Question and Answer (Q&A), 2) Summarization and making perioperative plans, and 3) Writing referral letters. n nParticipants and settings n nThe trial employed a prospective, two-day randomized crossover design. Resident physicians rotating through the anesthesiology department and scheduled for at least two consecutive days in the preoperative evaluation clinic (PEC) during January and February 2025 were enrolled. A washout period was not implemented due to feasibility constraints in our local context: residents are typically assigned to the PEC for only 2–4 days per rotation, with their next assignment occurring 1–2 months later. Introducing a washout period would risk confounding by capturing residents at different stages of clinical experience, thereby introducing variability in documentation efficiency. n nParticipants varied in their prior exposure to institutional perioperative protocols and the electronic health record (EHR) system. The PEACH system had been available for informal use starting one month before study initiation. At our institution, anesthesiology residents rotate every six months, with a new cohort beginning in January. While some residents from the prior cohort had limited exposure to PEACH, no formal training was provided before the study. Upon enrollment, participants received a brief five-minute orientation covering basic access, login, and usage. No extensive onboarding was conducted, as the chatbot was designed to be intuitive and self-directed. n nEach participant was randomized to complete two consecutive clinic sessions—one utilizing PEACH (Intervention) and the other following standard workflows without AI assistance (Control). Randomization was conducted using a 1:1 block design and stratified according to participants’ prior anesthesiology and institutional experience. Participants were categorized into three strata: (1) less than 6 months of anesthesiology experience (Novice); (2) physicians with at least 6 months of anesthesiology experience at other institutions but unfamiliar with the local EHR system (New to institution); and (3) experienced resident physicians who had previously completed at least six months of anesthesiology rotations within the study hospital (Experienced). The participants were given a 10-dollar gift voucher for their participation. Given the nature of the intervention, blinding of participants was not feasible, however, the reviewers were blinded. n nTo reduce the risk of reviewer bias, documentation samples were anonymized and selected to ensure parity across case complexity. Clinical documentation was extracted directly from the electronic health record without reformatting or alteration, preserving the original physician-authored content. While AI-generated text may exhibit subtle stylistic patterns, reviewers were not primed to expect such differences. n nDuring the PEACH day (Intervention), participants accessed PEACH through a secure hospital interface. The extent and manner of PEACH usage were left to the discretion of each participant, who may elect to use the chatbot more frequently or selectively. All interactions with PEACH—including the number of uses and their specific purposes (e.g., summarization and planning, Q&A, or referral drafting)—were recorded and extracted from the system logs. The study team subsequently randomly selected 30 PEACH outputs for the evaluation of accuracy. In the control arm, participants conducted preoperative assessments and completed documentation using standard clinical procedures without access to AI assistance (Fig. 1). n nFor each patient encounter, three time metrics were recorded based on timestamps from the clinic queuing system and documentation logs within the electronic health record. “Patient time” was defined as the duration from the patient’s entry to exit from the consultation room. “Total time” refers to the entire period required for case review and documentation. “Documentation time” was calculated as the difference between total time and patient time. n nClinical documentation was extracted from the perioperative clinic records and anonymized for blinded evaluation. To account for variability in administrative burden based on case complexity, all cases were stratified into three predefined categories for analysis according to the plans outlined in the documentation. Complexity 1 included ASA I or II patients who required no further clinical interventions. Complexity 2 encompassed cases necessitating minor actions, such as brief consultations with the attending anesthesiologist. Complexity 3 referred to high-acuity cases requiring multidisciplinary coordination, including communication with multiple specialty teams, additional referrals, or changes to the surgical schedule (e.g., cancellations or rescheduling). Full inclusion and exclusion criteria for each category are detailed in Supplementary Table 6. Complexity assignments were made retrospectively by two anesthesiologists on the study team, based on the clinical documentation. The reviewers were blinded to the study arm, and discrepancies were resolved by consensus. n nOutcome measures n nThe primary outcome was clinical efficiency, measured by differences in documentation time per patient between PEACH-assisted and standard workflows. Secondary outcomes included documentation quality, safety and accuracy, user acceptance, and economic impact. n nDocumentation quality was assessed using a paired-review design. Fifty-six clinical documents (28 pairs) were collected. These cases were matched by case complexity to ensure comparability. All samples were anonymized and independently reviewed by two anesthetists. Reviewers evaluated each documentation pair for the presence of a clinically relevant issues list and identified any minor and major errors in the perioperative management plan. Additionally, reviewers indicated their overall preference between the two outputs in each pair. n nUser acceptance was evaluated using a structured questionnaire based on the Davis’ Technology Acceptance Model (TAM)30, administered upon study completion. The survey included 5-point Likert-scale items assessing perceived usefulness, ease of use, clarity of reasoning, and intention to use PEACH in future clinical practice. n nEconomic analysis n nAn economic evaluation was conducted following the Consolidated Health Economic Evaluation Reporting Standards (CHEERS) guidelines31 to quantify the institutional value of PEACH implementation. The analysis modeled potential cost savings based on observed time reductions in documentation and consultation, extrapolated to an institutional scale, assuming an annual patient volume of 20,000 preoperative assessments. We assumed that the distribution of physician seniority mirrored that observed in the study population. n nA simple costing model was used to estimate potential labor cost savings by converting clinician time saved into full-time equivalent (FTE) hours and multiplying by an estimated wage rate of SGD 140 per hour. Concurrently, the operating cost of the LLM (Claude 3.5 Sonnet, Anthropic, San Francisco, CA) was factored in based on token usage and per-token pricing at the time of analysis (SGD 0.00047 per output, assuming 133 tokens per 100 words). The model considered the average number of PEACH outputs generated per patient and projected annual usage volume. To test the robustness of the model, a sensitivity analysis was performed using ±20% variation in key input parameters, including clinician wage rates, output volume per patient, and token pricing. n nSample size calculation n nAs there were no prior studies evaluating the impact of a clinical decision support tool on documentation efficiency in preoperative settings, we prospectively planned an interim analysis after the enrollment of 12 participants to inform sample size estimation. Before initiating the study, we defined a clinically meaningful reduction in documentation time of 3 min, representing a 30% improvement relative to an estimated average documentation time of 10 min per patient. Assuming a standard deviation of 10 min in documentation time differences, a paired crossover design, a two-sided α of 0.05, and 80% power, an a priori sample size calculation indicated that 88 paired patient encounters would be required to detect this effect. n nAt interim analysis, the observed standard deviation of documentation time was consistent with this assumption. Additionally, it was found that each participant completed an average of 8 paired patient encounters per session. Based on this, the study proceeded to full enrollment, ultimately including 14 participants (111 paired patient encounters). n nAll statistical analyses were performed using Python 3.13.2. Paired t-tests were used to compare continuous time metrics between intervention and control arms, while categorical outcomes such as documentation preference were evaluated using Chi-square tests. n nDuring manuscript preparation, the authors used large language models (ChatGPT-4o and DeepSeek) to assist with technical paraphrasing and grammatical refinement. All AI-assisted content was reviewed for scientific accuracy and verified against original data by the study authors.

Leave a Reply Cancel reply