Clinical reliability assessment of large language models is necessary due to their increasing use in healthcare. This study assessed the performance of ChatGPT-3.5 and ChatGPT-4 in answering questions deducted from the German evidence-based S3 guideline for adult soft tissue sarcoma (STS). Reponses to 80 complex clinical questions covering diagnosis, treatment, and surveillance aspects were independently scored by two sarcoma experts for accuracy and adequacy. ChatGPT-4 outperformed …