Mortality prediction by SOFA score in ICU-patients after cardiac surgery; comparison with traditional prognostic–models

Background There are many prognostic models and scoring systems in use to predict mortality in ICU patients. The only general ICU scoring system developed and validated for patients after cardiac surgery is the APACHE-IV model. This is, however, a labor-intensive scoring system requiring a lot of data and could therefore be prone to error. The SOFA score on the other hand is a simpler system, has been widely used in ICUs and could be a good alternative. The goal of the study was to compare the SOFA score with the APACHE-IV and other ICU prediction models. Methods We investigated, in a large cohort of cardiac surgery patients admitted to Dutch ICUs, how well the SOFA score from the first 24 h after admission, predict hospital and ICU mortality in comparison with other recalibrated general ICU scoring systems. Measures of discrimination, accuracy, and calibration (area under the receiver operating characteristic curve (AUC), Brier score, R2, and Ĉ-statistic) were calculated using bootstrapping. The cohort consisted of 36,632 Patients from the Dutch National Intensive Care Evaluation (NICE) registry having had a cardiac surgery procedure for which ICU admission was necessary between January 1st, 2006 and June 31st, 2018. Results Discrimination of the SOFA-, APACHE-IV-, APACHE-II-, SAPS-II-, MPM24-II - models to predict hospital mortality was good with an AUC of respectively: 0.809, 0.851, 0.830, 0.850, 0.801. Discrimination of the SOFA-, APACHE-IV-, APACHE-II-, SAPS-II-, MPM24-II - models to predict ICU mortality was slightly better with AUCs of respectively: 0.809, 0.906, 0.892, 0.919, 0.862. Calibration of the models was generally poor. Conclusion Although the SOFA score had a good discriminatory power for hospital- and ICU mortality the discriminatory power of the APACHE-IV and SAPS-II was better. The SOFA score should not be preferred as mortality prediction model above traditional prognostic ICU-models.

Acute Physiology and Chronic Health Evaluation-IV model (APACHE-IV), which was published in 2006 [7].
The SOFA score was initially developed as a tool to learn from the evolution of organ failure in sepsis and to assess the effects of therapies like mechanical ventilation and vasopressors on the course of organ dysfunction. It scores 1-4 points for each of the six organ systems (respiratory, circulation, renal, neurologic, hepatogenic, coagulation) [4]. The importance of the SOFA score is growing and it has been incorporated in the latest surviving sepsis campaign as a tool to describe and detect sepsis [8]. Although the SOFA score was initially not developed to predict mortality, several studies showed that SOFA has been used to predict morbidity and mortality and has been validated for that purpose in several ICU populations [9] [10]. It would be interesting to know if the SOFA score could predict mortality in the cardiac surgery population as well.
The SOFA score is much simpler compared to general ICU prediction models such as the APACHE-IV model, which requires a lot of data and lays a heavy burden on precise data acquisition. If mortality prediction could be achieved with the SOFA score as accurately as with the APACHE-IV model, use of the SOFA score would be preferable for that purpose.
The aim of the current study is to investigate, in a large retrospective cohort derived from the Dutch National Intensive Care Evaluation (NICE) registry [11] [12], how well the SOFA score on day one predicts ICU and hospital mortality in comparison to the general ICU mortality prediction models, i.e. SAPS-II, MPM 24 -II, APACHE-II, and APACHE-IV. Secondly, we wanted to investigate the contribution of the different components of the SOFA to its predictive value.

Data
The NICE registry collects demographic, physiological, clinical and organizational data from all 84 Dutch ICUs [12]. To ensure that the data are of a high quality, ICU employees are trained how to score patients, the data are checked before being included into the database, and data quality audits are carried out [11,13].
We used data from cardiac surgery centers in the NICE SOFA database with an APACHE-IV admission diagnosis related to open heart surgery (see E-Supplement 2) between January 1st, 2007 and June 31st, 2018. Patients were included if they were 18 years or older and all of the following scoring systems were available: SOFA score on day one and its six individual organ scores, APACHE-IV, APACHE-II, MPM 24 -II, and SAPS-II. All readmissions within the same hospital admission were excluded from analyses.

Severity of illness scores
Demographic data as well as all data needed to calculate the scoring systems were collected in the hospital in which the patient was admitted and were securely uploaded to the NICE registry [12]. All scoring systems were calculated according to the standards in the international literature [1] [2] [3] [4] [7]. A brief summary of the different scoring system is included in E-Supplement 1 and E-Supplement 4. We used only the SOFA score on day one because the general ICU prediction models included only data collected from the first 24 h of admission. To account for organ replacement devices that were not in common use at the time the SOFA score was developed, minor adaptations were made to the original SOFA score [4]. Consequently, we gave the maximum number of points for the renal category if the patient received continuous renal replacement therapy (CRRT) or other forms of renal replacement therapy. We gave the maximum number of points for the cardiovascular category if the patient had a left ventricular-or right ventricular assist device, an intra-aortic balloon pump (IABP) or was on veno-arterial extra corporeal membrane oxygenation (VA-ECMO). We gave the maximum number of points for the respiratory category if the patient was on veno-venous extra corporeal membrane oxygenation (VV-ECMO) or had special forms of ventilation (Nitric Oxygen (NO)-ventilation, Differential lung ventilation, Partial liquid ventilation but not prone position ventilation).

Statistical analyses
Categorical variables are presented as percentages, and continuous variables are presented as mean and SD or as median and interquartile range (IQR) depending on the data distribution. Demographics are also provided for sub-populations based on quartiles of the SOFA score. To assess differences in distribution of continuous variables between the sub-populations based on quartiles of the SOFA score, independent t-test was used when the data was distributed normally or Mann-Whitney U test when de data was distributed not normally. Normality was tested using graphical methods. All statistical analyses were performed using R version 3.6.0. A p value of less than 0.05 was applied as level of significance.

Hospital mortality
The SOFA score was initially developed to quantify organ dysfunction and not to predict mortality. In order to predict hospital mortality based on SOFA score and its sub-scores, we used logistic regression modelling. To keep these models as simple as possible but also to give it a fair chance to achieve a good prognostic performance compared to the general ICU prediction models, gender and age were added to the model as covariates.
The general ICU prediction models, i.e. APACHE-IV, APACHE-II, MPM 24 -II, and SAPS-II, are logistic regression models that use different predictor variables to predict hospital mortality. These models are not stable over time [14]. To make the mortality predictions comparable to the newly defined mortality prediction models based on SOFA score, the original models were calibrated using first-level customization [14]. To this end, for each model, a logistic regression model was fitted with observed in-hospital death as the dependent variable and the logit-transformed original predictions as the independent variable.

ICU mortality
In order to predict ICU mortality based on SOFA score and its sub-scores, we again used logistic regression modelling. Gender and age were added to the models as covariates.
The general ICU prediction models are developed to predict hospital mortality. To predict ICU mortality, logistic regression modelling was used with observed ICU mortality as the dependent variable and the logittransformed predictions based on the original model as the independent variable.

Performance assessment of the models
The area under the receiver operating characteristic curve (AUC) was used to describe the discrimination of the models [15]. An AUC of 0.5 indicates that the model has no discriminative power and an AUC of 1.0 indicates perfect discriminative power [15]. To compare the calibration of the models, the Hosmer-Lemeshow Ĉ-statistic was used [16]. The Hosmer-Lemeshow Ĉ-statistic assesses whether or not the observed mortality rates match the expected mortality rates in the sub-populations of the total model population [16]. The Ĉ-statistic is a χ 2 statistic in which a p value of > 0.05 is considered good calibration, i.e. the difference between predicted and actual outcomes in de subgroups is low and not significantly different [16].
The Brier score was used to assess the overall accuracy of the models [17]. The Brier score is the mean squared difference between the observed and predicted outcome, which includes both discrimination and calibration aspects. The smaller the difference between observed and predicted mortalities, the lower the score, the better the model.
The performance of the models was assessed using the ordinary bootstrap method with a sample of 500 bootstraps [18]. In each sample, the performance measures were calculated and exported to a separate table. For each model, the median and 95% confidence intervals for each performance measure was defined using the 2.5th, 50th and 97.5th percentiles of the bootstrap distribution. A difference in performance measure between the models was considered statistically significant in case the median was different and the related confidence intervals did not overlap. First-level customization does not change the influence of individual covariates included in the model but calibrates their joint influence on the observed mortality [14]. Note that therefore, for the APACHE-IV, APACHE-II, MPM 24 -II, and SAPS-II models the AUC for each bootstrap sample should be the same because the order of the probabilities will not change, only the absolute magnitude of the probabilities will differ.

Ethics
Data are encrypted such that all patient-identifying information are untraceable. The need for ethical committee approval was waived by the Central Committee on Research Involving Human Subjects, because the study was purely retrospective and used de-identified patient data (reference number W17_297 # 17.349; Medical Ethics Review Committee of the Academic Medical Center, University of Amsterdam).

Results
We included 36,632 cardiac surgery patients from 12 cardiac surgery centers participating in the NICE SOFA module of whom 70.7% were men. Figure 1 shows a flowchart of the data inclusion process. Mean age was 66.8 years, 1.3% died during their ICU admission and 2.2% died in hospital. In Table 1 baseline characteristics, procedures and outcome are described, categorized by quartiles of the SOFA score (Table 1). It was not possible to distribute the number of patients evenly over the different quartiles because the data was skewed. The incidence of ICU mortality and hospital mortality is highest in the quartile with the highest SOFA scores. In these patients more emergency surgery and complex surgery is prevalent compared to the other quartiles, while the number CABG's is lower. All patient characteristics showed unequal distribution among the sub-populations based on quartiles of SOFA score (P < 0.001).
Performance assessment of the models Tables 2 and 3 describe the performance of the models for predicting hospital mortality and ICU mortality respectively. Measured by the AUC, the SOFA model on day one had a significantly lower discriminative power for hospital mortality compared to the APACHE-IV, APACHE-II and SAPS-II models. Also, the discriminative power of the SOFA model for ICU mortality was worse than that of the APACHE-IV, APACHE-II and  Overall, the models showed good accuracy according to the Brier score [18] [18]. The accuracy was comparable between the models for both hospital mortality (Brier score ranging between 0.019 and 0.020) and for ICU mortality (Brier score ranging between 0.011 and 0.012).
Performance measures were also calculated for the prediction models based on the six individual organ components of the SOFA model for both hospital and ICU mortality (Tables 4 and 5). For all performance measures, the overall SOFA model performed significantly better than the individual organ component models. There was no significant difference between the calibration and accuracy of the models based on individual SOFA components, however discriminative power did differ. The renal component had a significantly better discrimination compared to all other components (Renal AUC 0.771 (0.763-0.777) for ICU mortality and 0.741 (0.736-0.745) for hospital mortality). The respiratory component had a significantly poor discrimination compared to all other components.

Discussion
Our main finding is that the SOFA score used as a prediction model underperforms in predicting ICU-and hospital mortality among cardiac surgery patients compared to the APACHE-IV, APACHE-II and SAPS-II models. Calibration of all models was poor for the outcome hospital mortality. From the recalibration curves (E-Supplement 3) it is clear that most models perform badly in patients with high risk, which influences the Hosmer-Lemeshow Ĉ-statistic [19]. Only the SAPS-II model and the MPM 24 -II model had good calibration for the outcome measure ICU mortality.
This study is not the first study investigating ICU prediction models in cardiac surgery patients, but it is the first study comparing these different models in a cohort of more than 36.000 patients.
Doerr et al. [5] have shown in a previous study in 2801 patients that the SOFA score and the SAPS-II had a good discriminative power for hospital mortality with an AUC of 0.85 (CI 95%; 0.81-0.88) for the SOFA score and 0.83 (0.79-0.86) for the SAPS-II model, which is different compared to our findings. Pätilä et al. [20] studied the SOFA score in 857 patients and found that the maximum SOFA score on day one predicted 30-day mortality with an AUC of 0.78 (CI 95%; 0.64-0.92) which was comparable with our finding but with a broader confidence interval, which can be explained by the low number of cases. Ceriani et al. tested the SOFA score for mortality prediction in 218 cardiac surgery patients who stayed in the ICU for > 96 h [21]. The AUC for the  prediction of hospital mortality of the SOFA score on day 1 was 0.71 (CI 95%; ± 0.08). We scored the SOFA score a little different than in the original article [4] because we included items such as (CRRT) and patients on (ECMO) giving them the maximum score possible within the respective SOFA component. It could be that other study groups treated the SOFA score differently in these patients leading to some discrepancy. We believe that the discrepancy cannot be large because it is unlikely that many patients started on day one with CRRT or ECMO. Giving patients on CRRT or ECMO the highest score within the respective SOFA component is, in our view, logical because these patients have the most severe deterioration of organ function.
From our data it is clear that most patients who died are found in the group with a SOFA score in the highest quartile. It is notable that in the last quartile surgery is of a more complex nature and has a more emergent character, while the percentage of CABG was lower, explaining the rise in mortality in this group of patients.
From the SOFA components, the renal component had the highest discriminative power followed by the circulation component. From these data we can conclude that renal insufficiency is an important determinant of mortality in cardiac surgery patients. Ceriani et al. also tested the importance of the SOFA components on day 1 and found that the cardiac component predicted mortality the best, followed by the neurologic-component and liver-component [21]. Their findings may have differed from ours because they only included patients who were admitted for more than 96 h while the median length of stay in our population was 1.8 days.
It is surprising that the SAPS-II model performed similar to the APACHE-IV model in predicting hospital mortality and was even better in predicting ICU mortality. SAPS-II does not include specific cardiac-surgical diagnostic categories and is generated from much less variables than APACHE-IV. In fact, the original SAPS-II model excluded cardiac surgery patients. The same observation has been made by Brinkman et al. [22] in the complete ICU population (i.e. all general, surgical and thoracic surgery patients).
Our data does not support the use of the SOFA score as a mortality prediction model in cardiac surgery patients. Nevertheless, we think that the SOFA score is still a valuable tool in other settings such as in the detection of sepsis [8] and the evolution of the condition of the patient [10] [4].

Conclusion
The SOFA score has important potential advantages when compared with the APACHE-IV model being simpler and less labor intensive. However, we must conclude that in this large cohort of cardiac surgery patients the SOFA score used as a mortality prediction model underperformed compared to the APACHE-IV and SAPS-II model in predicting hospital-and ICU mortality.