A specialised medical artificial intelligence program has demonstrated the capacity to ask better questions in test consultations, rank higher on empathy, and make more accurate diagnoses than human doctors, its developers say.
Dubbed the AMIE (Articulate Medical Intelligence Explorer), the Google-developed algorithm operates in the same way as other large language models like ChatGPT but is described as optimised for “diagnostic dialogue”.
This is apparently thanks to its training on a “diverse suite of real-world medical datasets”, including over 11,000 old medical exam questions, dozens of electronic health record note summaries, and transcriptions of almost 100,000 recorded medical conversation interactions.
Beyond that, the programmers fed 64 expert-crafted long-form responses to questions from HealthSearchQA, LiveQA, and Medication QA in MultiMedBench into the algorithm.
The results have been published in the arXiv preprint server (link here) and, while neither peer-reviewed nor tested in human conditions as yet, offer strong reasons for optimism about the potential of the technology, according to the team.
They put the model to the test in a text-based Objective Structured Clinical Examination (OSCE) involving 149 case scenarios from clinical providers in Canada, the UK and India, comparing its results with 20 primary care physicians from the three countries.
Findings were that AIME demonstrated both greater diagnostic accuracy and superior performance on 28 of 32 measures according to specialist physicians who marked the exam, as well as 24 of 28 assessed by the patient actors.
Accuracy was superior in diagnosing respiratory, cardiovascular and other conditions, with the chatbot also managing ask questions that elicited equivalent amounts of information as human doctors, the researchers said.
“To our knowledge, this is the first time that a conversational AI system has ever been designed optimally for diagnostic dialogue and taking the clinical history,” Google research scientist and study co-author Alan Karthikesalingam told reporters from Nature in an interview last month (link here).
Nevertheless, he stressed it was still early days for the technology, which was seen as an assistive technology rather than a replacement for human doctors.
Beyond that, the test was somewhat skewed given both the AIME and the human candidates were forced to type out their responses into text, unlike in a typical OSCE or patient encounter, Mr Karthikesalingam said.
“We want the results to be interpreted with caution and humility,” he said.
“This in no way means that a language model is better than doctors in taking clinical history.”
An important next step for the research would be to conduct more-detailed studies to evaluate potential biases and ensure that the system is fair across different populations, he noted, adding the Google team was also starting to look into the ethical requirements for testing the system in humans with real medical problems.