← Back

Machine Learning for Benchmarking Adolescent Idiopathic Scoliosis Surgery Outcomes

Aditi Gupta 1 2, Inez Y Oh 1, Seunghwan Kim 1, Michelle C Marks 3, Philip R O Payne 1, Christopher P Ames 4, Ferran Pellise 5, Joshua M Pahys 6, Nicholas D Fletcher 6, Peter O Newton 7, Michael P Kelly 7; Harms Study Group

PMID: 37249385 DOI: 10.1097/BRS.0000000000004734


Study design: Retrospective cohort.

Objective: Design a risk-stratified benchmarking tool for adolescent idiopathic scoliosis (AIS) surgeries.

Summary of background data: Machine learning (ML) is an emerging method for prediction modeling in orthopedic surgery. Benchmarking is an established method of process-improvement and is an area of opportunity for ML methods. Current surgical benchmark tools often use ranks and no “gold standards”for comparisons exist.

Methods: Data from 6076 AIS surgeries were collected from a multicenter registry and divided into three datasets: encompassing surgeries performed (1) during the entire registry, (2) the past 10 years, and (3) during the last 5 years of the registry. We trained three ML regression models (baseline linear regression, gradient boosting [GB], and XGBoost [XGB]) on each data subset to predict each of the five outcome variables, length of stay (LOS), estimated blood loss (EBL), operative time, SRS-Pain and -Self-image. Performance was categorized as “below expected” if performing worse than one standard deviation of the mean, “as expected” if within one standard deviation, and “better than expected” if better than one standard deviation of the mean.

Results: Ensemble ML methods classified performance better than traditional regression techniques for LOS, EBL, and operative time. The best performing models for predicting LOS and EBL were trained on data collected in the last 5 years, while operative time used the entire 10-year dataset. No models were able to predict SRS-Pain or -Self-image in any useful manner. Point-precise estimates for continuous variables were subject to high average errors.

Conclusions: Classification of benchmark outcomes is improved with ensemble ML-techniques and may provide much needed case-adjustment for a surgeon performance program. Precise estimates of health-related quality of life scores and continuous variables were not possible, suggesting that performance classification is a better method of performance evaluation.

Copyright © 2023 Wolters Kluwer Health, Inc. All rights reserved.

Conflict of interest statement

The authors report no conflicts of interest

See Full Article