Monoarthritis is a frequent clinical presentation that encompasses a wide range of differential diagnoses, including septic arthritis, crystal arthropathies, and early autoimmune conditions. This study aims to assess the diagnostic accuracy and reasoning quality of three publicly available large language models (LLMs) Claude, DeepSeek, and Gemini in monoarthritis scenarios relevant to emergency care. Ten clinical vignettes involving monoarthritis were generated and presented in identical format to each LLM. Two specialists, a rheumatologist and an orthopedic surgeon, independently evaluated model responses in a blinded fashion using predefined criteria: diagnostic accuracy (binary), completeness, consistency, and misinformation (on Likert scales). Inter-rater reliability was calculated using Cohen’s kappa and intraclass correlation coefficients. Statistical analyses were performed using Kruskal–Wallis and Dunn’s post hoc tests. All three models achieved 100% diagnostic accuracy. However, statistically significant differences were observed across models regarding the quality of their justifications (p 0.90). Despite similar diagnostic success, notable differences exist in the explanatory quality among LLMs in monoarthritis cases. Claude demonstrated relatively superior performance, whereas Gemini underperformed in completeness. These findings highlight the need for structured evaluation frameworks before LLM integration into clinical workflows, especially in time-sensitive conditions such as monoarthritis.
Key words: Monoarthritis, large language models, artificial intelligence, diagnosis, natural language processing, interobserver variability
|