Tundra Space

Tundra Space

Clinical Research Directory

Browse clinical research sites, groups, and studies.

Back to Studies
ENROLLING BY INVITATION
NCT07199231

OpenEvidence Safety and Comparative Efficacy of Four LLM's in Clinical Practice

Sponsor: Cambridge Health Alliance

View on ClinicalTrials.gov

Summary

OpenEvidence is an online tool that aggregates and synthesizes data from peer-reviewed medical studies, then producing a response to a user's questions using generative AI. While it is in use by a number of clinicians (including residents) today, there is little to no published data on whether the tool's outputs are accurate and whether this information appropriately informs clinical decision making. Similarly, a number of clinicians are turning to other large language models (LLM's) to assist in decision making when providing clinical care. While there have been a number of studies published on the accuracy of these LLM's responses to medical boards questions or clinical vignettes, there have been few studies to date examining their performance in a real world clinical setting, and even fewer comparing this performance. In this study, investigators have two goals: 1. To determine whether the use of the AI tool "OpenEvidence" leads to clinically appropriate decisions when utilized by family medicine, internal medicine, and psychiatry residents in the course of clinical practice. 2. To determine how the output of the OpenEvidence tool compares with three other commonly-used, publicly-available large language models (OpenAI's ChatGPT, Anthropic's Claude, and Google's Gemini) in answering common questions that residents have in the course of clinical practice. To accomplish study goal #1, investigators have enlisted residents in the above specialties to use the OpenEvidence tool in the course of clinical practice. In order to mitigate any safety risks, the residents will also use a typical reference tool for their question, which is referred to as the "Gold Standard" tool. These tools include PubMed and UpToDate. The residents will: 1. State their clinical question. 2. Query OpenEvidence, capturing their prompt and the OpenEvidence output for data analysis. All residents will undergo training in prompt engineering at the start of the study. 3. State their clinical conclusion based on the OpenEvidence data. 4. Query the Gold Standard Resource. 5. State their final clinical conclusion. 6. Answer a question on whether their clinical conclusion was modified by the Gold Standard reference. 7. Answer a question on whether they had any clinical safety concerns on the output from OpenEvidence. Attending physician Subject Matter Experts (SMEs) matched by specialty with at least 5 years of post-training clinical experience will then evaluate the residents' responses. 5 years was chosen based the book "Outliers" by Malcolm Gladwell, in which he asserts that 10,000 hours of focused practice is needed to achieve expertise in a field. SMEs will be asked to evaluate the residents' initial clinical questions and their conclusions based only on OpenEvidence. They will be asked to rate the clinical appropriateness of those conclusions on a scale of 1-10. For questions where the SME's rate the clinical appropriateness of the residents' conclusions poorly (\< 5/10), they will be asked to review the OpenEvidence output and answer an additional question as to whether the output was incorrect or the resident misinterpreted the output from the tool. To accomplish goal #2, the initial prompt entered by the residents into OpenEvidence will be copied by the research team into ChatGPT, Gemini, and Claude. The outputs from each tool (including OpenEvidence) will be surfaced to SMEs, who will be asked to rate each output based on accuracy, completeness, and bias. Likert scales will be used for these ratings. SMEs will also be asked an open-ended question to identify any patient safety issues from any of the outputs.

Official title: A Comparative Performance Evaluation of Four Publicly Available Large Language Models Against Gold Standard Medical References

Key Details

Gender

All

Age Range

Any - Any

Study Type

OBSERVATIONAL

Enrollment

20

Start Date

2025-10-01

Completion Date

2026-09

Last Updated

2026-02-18

Healthy Volunteers

Yes

Interventions

OTHER

AI clinical reference tool

Residents will use OpenEvidence clinical reference tool in the course of routine clinical care. They must also use a Gold Standard clinical reference tool (e.g. PubMed, UpToDate) to mitigate risk.

Locations (1)

Cambridge Health Alliance

Cambridge, Massachusetts, United States