Study: AI models need thorough preclinical testing to root out safety concerns

By James Schneider On Apr 13, 2022

An artificial intelligence algorithm used to detect hip fractures outperformed human radiologists, but researchers found mistakes that would prevent safe use upon further testing, according to a study published in The Lancet.

Researchers evaluated a deep learning model that aimed to find proximal femoral fractures in frontal X-rays in emergency department patients, which was trained on data from the Royal Adelaide Hospital in Australia.

They compared the model’s accuracy against five radiologists on a dataset also from the Royal Adelaide Hospital, and then performed an external validation study using imaging results from the Stanford University Medical Center in the U.S.

Finally, they conducted an algorithmic audit to find any unusual mistakes.

In the Royal Adelaide study, the area under the receiver operating characteristic curve (AUC) evaluating the performance of the AI model was 0.994 compared with an AUC of 0.969 for the radiologists. Using the Stanford dataset, the model performance was measured at an AUC of 0.980.

However, researchers found the external validation still wouldn’t be usable in the new setting without additional preparation.

“Whereas the discriminative performance of the artificial intelligence system (the AUC) appears to be maintained on external validation, the decrease in sensitivity at the prespecified operating point (from 95.5 to 75.0) would make the system clinically unusable in the new environment,” the study’s authors wrote.

“Although this shift could be mitigated by the selection of a new operating point, as shown when we found similar sensitivity and specificity in a post-hoc analysis (in which the smaller decrease in specificity reflects the minor reduction in discriminative performance), this would require a localisation process to determine the new operating point in the new environment.”

Though the model performed well overall, the study also noted it occasionally made non-human errors, or unexpected mistakes a human radiologist wouldn’t make.

“Despite the model performing extremely well at the task of proximal femoral fracture detection when assessed with summary statistics, the model appears to be prone to making unexpected mistakes and can behave unpredictably on cases that humans would consider simple to interpret,” the authors wrote.

WHY IT MATTERS

Researchers said the study highlights the importance of rigorous testing before implementing AI models.

“The model outperformed the radiologists tested and maintained performance on external validation, but showed several unexpected limitations during further testing. Thorough preclinical evaluation of artificial intelligence models, including algorithmic auditing, can reveal unexpected and potentially harmful behavior even in high-performance artificial intelligence systems, which can inform future clinical testing and deployment decisions,” they wrote.

THE LARGER TREND

A number of companies are using AI to analyze imaging results. Last month, Aidoc received two FDA 510(k) clearances for software that flag and triages potential pneumothorax and brain aneurysms. Another company in the space, Qure.ai, recently raised $40 million in funding not long after it earned the FDA greenlight for a tool that assists providers in placing breathing tubes based on chest X-rays.

Though proponents argue AI could improve outcomes and cut down on costs, research has shown many of the datasets used to train these models come from the U.S. and China, which could limit their usefulness in other countries. Bias is also a big concern for providers and researchers, as it has the potential to worsen health inequities.

Source