Diagnostic Accuracy of GPT-4o and Claude for HEART Score Calculation in Chest Pain

RECRUITING

NCT07626060

Sponsor: Marmara University Pendik Training and Research Hospital

Summary

This prospective observational diagnostic accuracy study evaluates whether large language models (LLMs) - GPT-4o (OpenAI, gpt-4o-2024-11-20) and Claude (Anthropic, claude-sonnet-4-6) - can accurately calculate HEART scores from unstructured Turkish clinical notes and predict 30-day major adverse cardiac events (MACE) in emergency department patients presenting with non-traumatic chest pain. The study will enroll 600 consecutive adult patients. For each patient, the same anonymized data (free-text anamnesis, ECG report text, troponin value, and age) will be independently processed by both LLMs via separate API calls with deterministic settings (temperature=0, JSON format). A three-expert consensus HEART score - derived through blinded independent scoring by three emergency medicine physicians with majority-vote adjudication - serves as the reference standard for agreement analysis. Actual 30-day MACE (all-cause death, AMI Type 1/2/4b, unplanned revascularization) determined via national health database and telephone follow-up serves as the outcome for diagnostic accuracy analysis. A secondary documentation-quality sub-study will quantify how spontaneously Turkish emergency anamnesis notes capture HEART score parameters.

Official title: Diagnostic Accuracy of Large Language Models (GPT-4o and Claude) in HEART Score Calculation and 30-Day MACE Prediction in Emergency Department Chest Pain Patients: A Prospective Observational Validation Study Against Three-Expert Consensus

Key Details

Gender

All

Age Range

18 Years - Any

Study Type

OBSERVATIONAL

Enrollment

690

Start Date

2026-06

Completion Date

2027-06

Last Updated

2026-06-23

Healthy Volunteers

No

Conditions

Emergency Medicine Artificial Intelligence (AI)Artificial Intelligence (AI) in Diagnosis Chest Pain Rule Out Myocardial Infarction

Interventions

OTHER

GPT-4o HEART Score Calculator

OpenAI GPT-4o (model: gpt-4o-2024-11-20, temperature=0, max\_tokens=500, response\_format=JSON). Each patient's anonymized anamnesis text, ECG report text, troponin value, and age are submitted via a separate API call with no conversation history. Output: HEART score components (0-2 each), total score (0-10), risk group, and indeterminate status.

OTHER

Claude HEART Score Calculator

Anthropic Claude (model: claude-sonnet-4-6, temperature=0, max\_tokens=500, response\_format=JSON). Identical system prompt and input format as GPT-4o. Processed independently with no cross-contamination between models. Output: same JSON schema as GPT-4o.

OTHER

Three-Expert Consensus HEART Score

Three emergency medicine physicians (\>=3 years experience, HEART-score trained) independently score each anonymized record. Majority vote (2/3) determines component scores; a 4th adjudicator resolves ties. Experts are blinded to LLM scores, each other's scores, and MACE outcomes.

Locations (1)

Marmara University Pendik Training and Research Hospital

Istanbul, Istanbul, Turkey (Türkiye)

Clinical Research Directory