Evaluating the Concordance Between ChatGPT and Multidisciplinary Teams in Breast Cancer Treatment Planning: A Study from Bosnia and Herzegovina
Background/Objectives: In many low- and middle-income countries (LMICs), including Bosnia and Herzegovina, oncology services are constrained by a limited number of specialists and uneven access to evidence-based care. Artificial intelligence (AI), particularly large language models (LLMs) such as ChatGPT, may provide clinical decision support to help standardize treatment and assist clinicians where oncology expertise is scarce. This study aimed to evaluate the concordance, safety, and clinical appropriateness of ChatGPT-generated treatment recommendations compared to decisions made by a multidisciplinary team (MDT) in the management of newly diagnosed breast cancer patients. Methods: This retrospective study included 91 patients with newly diagnosed, treatment-naïve breast cancer, presented to an MDT in Bosnia and Herzegovina in 2023. Patient data were entered into ChatGPT-4.0 to generate treatment recommendations. Four board-certified oncologists, two internal and two external, evaluated ChatGPT’s suggestions against MDT decisions using a 4-point Likert scale. Agreement was analyzed using descriptive statistics, Cronbach’s alpha, and Fleiss’ kappa. Results: The mean agreement score between ChatGPT and MDT decisions was 3.31 (SD = 0.10), with high consistency across oncologist ratings (Cronbach’s alpha = 0.86). Fleiss’ kappa indicated moderate inter-rater reliability (κ = 0.31, p < 0.001). Higher agreement was observed in patients with hormone receptor-negative tumors and those treated with standard chemotherapy regimens. Lower agreement occurred in cases requiring individualized decisions, such as low-grade tumors or uncertain indications for surgery or endocrine therapy. Conclusions: ChatGPT showed high concordance with MDT treatment plans, especially in standardized clinical scenarios. In resource-limited settings, AI tools may support oncology decision-making and help bridge gaps in clinical expertise. However, careful validation and expert oversight remain essential for safe and effective use in practice.