TMHT and NC did not show good results as single predictors of DL. Five machine learning algorithms (BRF, XGB, LGBM, MLP, LR) were applied to predict DL using seven predictors, including TMHT and NC, which can be measured even in limited airway assessment. AUROC and AUPRC, which evaluate the model’s performance, showed the best performance in the model to which BRF was applied but did not show excellent performance. Sensitivity was highest in the model to which BRF was applied. Specificity and accuracy were the highest in the model to which LGBM was applied.
In many studies, the NC has been associated with difficult airway intubation in obese patients [8, 16, 17]. Thyromental height has also been reported as a predictor of difficult airway management [4, 16,17,18,19,20]. These findings support that the NC and TMHT may be predictors of DL. Several studies showed promising results, even with a single predictor [4, 16,17,18,19,20,21,22]. However, the previous studies are different from those of ours. The vast majority of the studies on prediction of difficult airway using NC is on obese patients so data in non-obese are insufficient [8, 16, 17]. There were also differences in the primary outcome (difficult intubation vs. DL) [8, 18, 20,21,22]. There may be differences in some TMHT studies because the patient population is of different races from the patient population in our study. Some studies have targeted specific patient populations such as coronary bypass patients, elderly and endotracheal intubation double-lumen tubes [16, 18, 20]. In some TMHT studies, like ours, the primary outcome was DL. In their study, TMHT as a predictor showed excellent performance in predicting DL [4, 17]. However, it is difficult to generalize because they were not a large-scale study and conducted for a specific race. In clinical practice, it is difficult to predict DL with a single predictor, including TMHT. Numerous studies have reported methods of predicting difficult airway, but no reliable way of predicting difficult airway exists yet [23,24,25,26]. Using multiple tests to predict difficulty in airway management may be a better predictor than any single test used in isolation [27].
Machine learning is being used to analyze the importance of clinical parameters and their combinations for prognosis, e.g. prediction of disease progression, extraction of medical knowledge for outcome research, therapy planning and support, and overall patient management [28]. Therefore, it may be necessary to apply machine learning even in difficult airway predictions. The models that predict difficult airways using machine learning has been reported in a few studies [29, 30]. Langerson and colleagues showed that the computer-based boosting method is superior to other conventional methods in predicting difficult tracheal intubation. Their results show that machine learning can be effective in predicting difficult airways. However, the predictors used by them included body mass index, age, Mallampati class, thyromental distance, mouth opening, macroglossia, sex, receding mandible, and snoring, so it cannot be applied to patients with limited airway assessment as in our study [30]. Moustafa and colleagues also reported a method of predicting DL using machine learning, as in our study. They used nine predictors and showed an AUROC of 0.79, which is the same as our study results. However, it is difficult to compare the model’s performance with our products because their results are the results of training with only 100 patients and do not include the model’s validation results through the test set. In addition, since predictors include interincisor distance, thyromental distance, sternomental distance, modified Mallampati score, upper lip bite test, and joint extension, it cannot be applied to patients with limited airway evaluation [29].
This study’s strength is that machine learning algorithms were used in the development of models to predict DL, and the models were validated through a test set. However, there are some limitations to this study. First, the model for predicting DL developed in this study does not show excellent performance with AUROC and especially AUPRC. Moreover, there is no predictive model with high sensitivity, high specificity, and accuracy. We did not calculate the number of samples required for the study. When applying machine learning algorithms, a lot of data is required. Often more data is required than is reasonably required by classical statistics. In particular, nonlinear models require as much data as possible. As few as thousands to tens of thousands of samples may be required [31]. In this study, unlike previous study with same algorithms [32], it was conducted prospectively, and we tried to include the maximum amount of training data in consideration of the expected study period and the difficulty of obtaining data. After oversampling with SMOTE, each class of train set was 1173. However, to improve the performance of a predictive model, the model needs to learn more data [33]. Second, the data used to train and validate the model can be difficult to apply to pediatric patients or other races because the data population is adults and mostly Koreans. Asian populations have statistically different dimensions from Caucasian populations in terms of chin arch, face length, and nose protrusion.