Abstract

Performance of a trained large language model to provide clinical trial recommendation in a head and neck cancer population.

Author
person Tony Hung Memorial Sloan Kettering Cancer Center, New York, NY info_outline Tony Hung, Gilad Kuperman, Eric Jeffrey Sherman, Alan Loh Ho, Winston Wong, Anuja Kriplani, Lara Dunn, James Vincent Fetten, Loren S. Michel, Shrujal S. Baxi, Chunhua Weng, David G. Pfister, Jun J. Mao
Full text
Authors person Tony Hung Memorial Sloan Kettering Cancer Center, New York, NY info_outline Tony Hung, Gilad Kuperman, Eric Jeffrey Sherman, Alan Loh Ho, Winston Wong, Anuja Kriplani, Lara Dunn, James Vincent Fetten, Loren S. Michel, Shrujal S. Baxi, Chunhua Weng, David G. Pfister, Jun J. Mao Organizations Memorial Sloan Kettering Cancer Center, New York, NY, Columbia University, New York, NY, Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, NY Abstract Disclosures Research Funding emorial Sloan Kettering Cancer Center (MSK) Support Grant (P30-CA008748) Background: Chatbots based on large language model (LLM) have demonstrated ability to answer oncology exam questions; however, leveraging LLM in medical-decision support have not yet demonstrated suitable performance in oncology practice. We evaluated the performance of a trained a LLM, GPT-4, to recommend appropriate clinical trials for a head & neck (HN) cancer population. Methods: In 2022, we developed an artificial intelligence powered clinical trial management mobile app, LookUpTrials, and demonstrated promising user engagement among oncologists. Using LookUpTrials database, we applied direct preference optimization to train GPT-4 as an in-app assistant to LookUpTrials. From Nov 7 to Dec 19, 2023, we collected consecutive, new patient cases and their respective clinical trial recommendations from oncologists in the HN medical oncology service at Memorial Sloan Kettering Cancer Center. Cases were categorized by diagnosis, cancer stage, treatment setting, and physician recommendation on clinical trials. Trained GPT-4 is prompted using a semi-structured template: “Given patient with a <diagnosis>, <cancer stage>, <treatment setting>, what are possible clinical trials?” Physician recommendations were compared with trained GPT-4 responses. We analyzed the performance of GPT-4 based on its response precision (positive predictive value), recall (sensitivity), and F1 score (harmonic mean of precision and recall). Results: We analyzed 178 patient cases, mean age 65.6 (SD 13.9), primarily male (75%) with local/locally advanced (68%) HN (61%), thyroid (16%), skin (9%), or salivary (8%) cancers. Majority were treated in the definitive setting with combined modality therapy (42%) and modest proportion were treated under clinical trials (10%). Overall, trained GPT-4 achieved a moderate performance matching physician clinical trial recommendations with 63% precision and 100% recall (F1 score 0.77), narrowing a total list of 56 HN clinical trials to a range of 0-4 relevant trials per patient case (mean 1, SD 1.2). Comparatively, performance of our trained GPT-4 exceeded historic performance of untrained LLMs to provide oncology treatment recommendation by 4-20 folds (F1 score 0.04 - 0.19). Conclusions: This proof-of-concept study demonstrated that trained LLM can achieve moderate performance in matching physician clinical trial recommendation in HN oncology. Our results suggest the potential of embedding trained LLM into oncology workflow to aid clinical trial search and accelerate clinical trial accrual. Future research is needed to optimize precision of trained LLM and to assess whether trained LLM may be a scalable solution to enhance the diversity and equity of clinical trial participation.

3 organizations