Skip to main content

Preadmission assessment of extended length of hospital stay with RFECV-ETC and hospital-specific data



Patients who exceed their expected length of stay in the hospital come at a cost to stakeholders in the healthcare sector as bed spaces are limited for new patients, nosocomial infections increase and the outcome for many patients is hampered due to multimorbidity after hospitalization.


This paper develops a technique for predicting Extended Length of Hospital Stay (ELOHS) at preadmission and their risk factors using hospital data.


A total of 91,468 records of patient’s hospital information from a private acute teaching hospital were used for developing a machine learning algorithm relaying on Recursive Feature Elimination with Cross-Validation and Extra Tree Classifier (RFECV-ETC). The study implemented Synthetic Minority Oversampling Technique (SMOTE) and tenfold cross-validation to determine the optimal features for predicting ELOHS while relying on multivariate Logistic Regression (LR) for computing the risk factors and the Relative Risk (RR) of ELOHS at a 95% confidence level.


An estimated 11.54% of the patients have ELOHS, which increases with patient age as patients < 18 years, 18–40 years, 40–65 years and ≥ 65 years, respectively, have 2.57%, 4.33%, 8.1%, and 15.18% ELOHS rates. The RFECV-ETC algorithm predicted preadmission ELOHS to an accuracy of 89.3%. Age is a predominant risk factors of ELOHS with patients who are > 90 years—PAG (> 90) {RR: 1.85 (1.34–2.56), P:  < 0.001} having 6.23% and 23.3%, respectively, higher likelihood of ELOHS than patient 80–90 years old—PAG (80–90) {RR: 1.74 (1.34–2.38), P:  < 0.001} and those 70–80 years old—PAG (70–80) {RR: 1.5 (1.1–2.05), P: 0.011}. Those from admission category—ADC (US1) {RR: 3.64 (3.09–4.28, P:  < 0.001} are 14.8% and 70.5%, respectively, more prone to ELOHS compared to ADC (UC1) {RR: 3.17 (2.82–3.55), P:  < 0.001} and ADC (EMG) {RR: 2.11 (1.93–2.31), P:  < 0.001}. Patients from SES (low) {RR: 1.45 (1.24–1.71), P:  < 0.001)} are 13.3% and 45% more susceptible to those from SES (middle) and SES (high). Admission type (ADT) such as AS2, M2, NEWS, S2 and others {RR: 1.37–2.77 (1.25–6.19), P:  < 0.001} also have a high likelihood of contributing to ELOHS while the distance to hospital (DTH) {RR: 0.64–0.75 (0.56–0.82), P:  < 0.001}, Charlson Score (CCI) {RR: 0.31–0.68 (0.22–0.99), P:  < 0.001–0.043} and some VMO specialties {RR: 0.08–0.69 (0.03–0.98), P:  < 0.001–0.035} have limited influence on ELOHS.


Relying on the preadmission assessment of ELOHS helps identify those patients who are susceptible to exceeding their expected length of stay on admission, thus, making it possible to improve patients’ management and outcomes.


When a patient stays more than three times the average length of stay (LOS) for a given Diagnosis Related Group (DRG), the patients will be said to have an Extended Length of Hospital Stay (ELOHS). Thus, making it imperative that patients are managed effectively in the hospitals to prevent them from exceeding their expected length of stay since that will introduce more financial burdens on the hospitals, health insurance, and the government [6, 23, 24] as well as causes more health complications for patients due to nosocomial infections [22]. There is a widespread variation in patients length of stay (LOS) in many public hospitals due to some inefficiencies associated with understanding and managing patients effectively from admission to discharge. This has resulted in significant cost blowout due to lack of hospital bed spaces that resulted in the loss of $125 M per annum to service patients overstaying on admission in the State of Victoria Australia [25]. This information makes it imperative that hospitals seek ways of reducing ELOHS through better knowledge of patients’ clinical and psychosocial features that may lead to the identification of high-risk patients and make it easier to provide appropriate care.

There are numerous studies on ELOHS, and their associated risk factors. Burton et al. [21] predicted ELOHS for patients of percutaneous coronary intervention using multivariate Logistic Regression (LR) by taking Normal Length of Hospital Stay (NLOHS) as patients staying < 5 days and those staying ≥ 5 days as ELOHS and obtained a prediction accuracy (AUC) of 79.9–81.9%. Staziak et al. [16] obtained an accuracy of 80—81% for ELOHS prediction of torso trauma patients using clinical and image data with Support Vector Machine (SVM) and Artificial Neural Network (ANN) algorithms. Zhang et al. [20] predicted the prospects of ELOHS for adult spinal deformity patients undergoing posterior spinal fusion surgery to an accuracy of 68–83% using LR, Decision Tree classifier (DTC), Random Forest (RF), XGBoost (XGB), and Gradient Boosting Machine (GBM) by considering ELOHS as those spending > 9 days in the hospital. Zolbanin et al. [26] predicted the length of stay (LOS) for patients suffering from chronic obstructive pulmonary disease (COPD) and pneumonia with a deep neural network algorithm and obtained an accuracy of 86–91% for COPD and 74–85% for pneumonia.

Numerous studies on ELOHS have described ELOHS as a specific number of days in the hospital that corresponds to the 75th percentile of the studied cohorts [9, 16, 21] while others chose a particular number of days in the hospital as the limit for NLOHS for a combination of DRGs [30, 31, 35]. Unfortunately, the variability in the severity of health conditions with various DRGs makes it imperative to consider ELOHS as a DRG-specific definition requiring specific durations. This approach is used in this study to define ELOHS for the various DRGs considered.

Even though better management of patients can be crafted from numerous conditions that include the DRG, patients’ demographic and clinical information, and several psychosocial conditions [6,7,8,9], patients’ susceptibility to ELOHS must be known preadmission if they are to be better managed. This will allow the hospitals to develop requisite patients’ management plans ab initio and forestall using ineffective strategies that may lead to ELOHS. Unfortunately, numerous ELOHS and risk factors prediction models did not consider hospital-specific factors and were not designed for preadmission. To this end, this study aims to utilize hospital-specific clinical and demographic features and documented psychosocial attributes of the patients to develop a machine learning technique for the ready prediction of ELOHS. The risk factors for ELOHS were determined based on the considered features to facilitate better patient management. The study relied on Recursive Feature Elimination with Cross-Validation and Extra Tree Classifier (RFECV-ETC) to predict ELOHS while using multivariate LR for estimating the risk factors and relative risk (RR) of ELOHS at a 95% significant level.

The fact that ELOHS is linked to numerous unpleasant outcomes in hospitalization such as decreased survival rate, increased time in the intensive care unit (ICU), increased number of hospital visits, preadmission comorbidities, infections, and complications [6, 8, 9, 33, 34] makes it imperative to understand patients’ ELOHS susceptibility preadmission. To this end, the strategy developed in this study will go a long way to promote improved patients experience seeing that the chances of developing and implementing contingency plans for patients’ care to forestall prolonged hospital stay will be executed at admission. The contribution of this study can be summarized as:

  • Design and development of a predictive technique for understanding patients’ susceptibility to ELOHS preadmission, which allows for the implementation of best practices in patient care to forestall extended hospital stay.

  • Using a definitive description of ELOHS to identify patients on admission who may be at risk of extended stay rather than adopting a specific LOS as the boundary between ELOHS and normal LOS as exemplified by numerous researchers [9, 16, 20, 35].

  • The use of Recursive Feature Elimination with Cross-Validation and Extra Tree Classifier (RFECV-ETC), to help determine the optimal features that will contribute to ELOHS prediction without overfitting the model.

  • Provision of risk factors and relative risk categorizes of the hospital and patient’s demographic and psychosocial characteristics enables reflective practice on patients’ management that will be vital for reducing hospital-acquired complications and infections.

  • This study provides a better outlook for ELOHS by developing a strategy for understanding the proneness of all patients suffering from different DRGs to ELOHS on admission.


The ELOHS used in this study is defined as 3* average length of stay (ALOS) for a given DRG following the Independent Hospital Pricing Authority (IHPA) standard national pricing model [39]. This specification is based on the resource use, diagnosis, and procedure coding for different DRGs and follows the “L3H3” trimming method and modifications that account for different adjustments according to episode severity. The advantage of defining the ELOHS with the “L3H3” is the ability to clearly describe a billing framework that equitably accounts for DRGs across hospitals nationally, thus, creating room for clinical cost normalization across the hospitals. Imperatively, patients, hospitals, and insurance companies are treated equitably when it comes to the cost of managing patients treated for different episodes.

This study develops a machine learning model for predicting patients prone to ELOHS preadmission using hospital, demographic, and psychosocial features. To ensure proper reporting that follows the prescribed benchmarks for modelling projects in medical informatics, the technique developed by Cabitza and Campagner [38] was adopted for self-assessment of the work. The breakdown of the procedure used for acquiring and pre-processing the data used for the study, the modelling strategy, and the statistical analysis approach for determining the risk factors of ELOHS are shown in the following sub-sections.

Data acquisition

De-identified patients’ records were obtained for separations between 10/2015 and 12/2020 from a private acute teaching hospital in Australia. Data were sourced from the Hospital Casemix Protocol (HCP) data extract routinely supplied to regulators from the patient administration data set. Initially, 91,468 samples comprising 73 features were extracted from the data set before the pre-processing that eliminates the features that have no relevance to the preadmission assessment of patients’ ELOHS. This process helped narrow down the features to the following hospital-specific parameters—visiting medical officer (VMO) specialty, patient age, patient gender, admission category (ADC), admission type, patient care class, Charlson Score, socioeconomic status (SES), and distance to hospital (DTH).

Data processing

Patient records with missing values were dropped from the dataset to minimize the impact of replacing missing values on the model accuracy. The features were also categorized into subclasses while the postcodes are used for computing the SES and DTH. The Socio-Economic Indexes for Areas (SEIFA) from the Australian Bureau of Statistics (ABS) [28] are used for classifying the patients as low (1–4 decile), middle (5–7 decile), and high (8–10 decile) SES. The longitude and latitude of the hospital and the patient's postcode provided the information for calculating DTH using the great circle distance model of the earth [29] for the Global Positioning System (GPS) location. To reduce the influence of outliers and extreme values in the model, VMO specialties with less than 100 samples were merged as a new VMO specialty class called VMO-others. The same process was used for admission types and patients religion by combining the classes with less than 100 samples to create new classes of the features. For Charlson Score, those with a score of 8 and above were merged to create a unique subclass (> 8). However, for admission category, the classes with less than 200 samples were merged to produce a new class. Since all independent features are categorical parameters, the various samples are represented as “1” for a given feature when the sample is affirmative for such feature subclass of “0” if the condition is negative. For instance, for the Patient Age subclass > 90 years, a sample row with a patient of age 45 years will have “0” while the row with patient age of 91 years will have “1”. The 10 features and their corresponding number of subclasses that are used in this study to model ELOHS are shown in Table 1.

Table 1 Modelling features and their number of classes after pre-processing

Prediction of preadmission ELOHS

To predict the ELOHS of patients' preadmission involves determining the likelihood of a patient exceeding their expected stay by considering the hospital-specific and psychosocial features itemized in Table 1. This is done by training and testing a machine learning model using the pre-processed data. The first step is to balance the data using the Synthetic Minority Oversampling Technique (SMOTE) [27] that ensures that the target feature (ELOHS) is of the same size amongst those with extended stay (‘1’) and those whose stay was within expected limits (‘0’). SMOTE has the potentials of reducing the class imbalance problems that include poor true positive and negative rates estimation, and model underfitting because of the poor learning performance of algorithms [1, 2]. As soon as the imbalance problem is solved, the next step is to determine the best model to use in the modelling of ELOHS by considering all the features shown in Table 1. Please note that the 99 sub-features of all the features are considered at this stage. Some of the prominent algorithms used for solving health informatics problems relating to ELOHS and LOS are tried on the dataset by implementing tenfold cross-validation. Implementing cross-validation ensures that all the sections of the dataset are used for training and testing the model. This practice makes it possible to have a better picture of the model’s performance because the accuracies of all the fields considered in the training and testing exercise are considered in computing the accuracy of the model. The best performing algorithm is determined by comparing the accuracy measured as sensitivity, specificity, and F1-score. The best algorithm is later used as the base estimator Recursive Feature Elimination with Cross-Validation (RFECV) in a stepwise fashion that considered different combinations of the features shown in Table 1. The algorithms tried at this stage include Extra tree Classifier (ETC), XGBoost (XGB), Adaptive Booster (ADB), GBM, ANN, RF, Support Vector Machine (SVM), and DTC. Since the best algorithm for predicting the accuracy of ELOHS amongst the tried algorithms is ETC, this study relied on ETC as the base estimators for the RFECV. The next section shows the strategy for obtaining the optimal features for ELOHS prediction using the RFECV-ETC model.

Optimal features selection (OFS) with RFECV-ETC

The RFECV is a backward elimination method that starts with a full set of all features and then removes the most irrelevant features one by one based on the validation scores [3]. This process aims to get the optimal number of features that will result in the best model accuracy by eliminating the features that have not influenced the accuracy. The procedure for using RFECV-ETC for predicting the best features combinations and the sub-features to obtain optimal accuracy of ELOHS is shown in Algorithm 1.

figure a

Risk factors estimation

The ELOHS risk factors are identified as a way of allowing the doctors and other health experts to know the high-risk patients who need specific attention because of their likelihood of extended stay on admission. Multivariate LR is used for computing the relative risk (RR) of ELOHS at a 95% confidence level with features having P-value ≤ 0.05 considered as significant risk factors. The appropriateness of the multivariate LR is assessed using accuracy computation that considered the true-positive, false-positive, true-negative, and false-negative predictions of the model for all the data, training data (70%), and testing data (30%). The computation of the risk factors and RR is based on the 10 features considered in the study.


Descriptive statistics of features

A total of 33,752 records are used from the initial 91,468 records after the pre-processing. There are 11.54% of ELOHS patients, which increased with the age of the patients. For instance, patients who are ≥ 65 years are 87%, 251%, and 491%, respectively, more prone to ELOHS than those < 18 years, 18–40 years, and 40–65 years. Female patients have a slightly higher (2.12%) ELOHS rate than males (10.33%) while patients with Charlson score of 5–8 have a higher ELOHS rate than others. The higher the length of stay (LOS) the higher the rate of ELOHS. This is evidence in the rate of ELOHS for health conditions that warranted > 20 days LOS with 71.15% ELOHS rate compared with other patients with ≤ 5 LOS (ELOHS rate: 0.41%), 6–10 days LOS (ELOHS rate: 19.69%), and 11–20 days LOS (ELOHS rate: 47.28%). A summary of some of the features used in this study and the frequencies of the NLOHS and ELOHS are shown in Table 2.

Table 2 Summary of frequencies (%) of some of the features considered in the model

Prediction of ELOHS with RFECV-ETC

RFECV is a feature selection technique that uses a recursive process for ranking features according to their importance and uses elimination to exclude weak features, dependencies, and collinearities from a model to improve the prediction accuracy [40]. By the process of cross-validation, the optimal features for enhanced performance are identified after dropping the insignificant features that are not positively impacting the model accuracy. As stated previously, eight algorithms that include GBM, ETC, RF, XGB, ANN, DTC, ADB, and SVM are considered in this study in the first instance to identify the one that will be most appropriate for predicting ELOHS. The result of the tenfold cross-validation of SMOTE data measured with recall, precision, and F1-score is shown in Table 3 (see Appendix 1 for the characteristics of the various algorithms). Since ETC is the best performing algorithm, future analysis to determine the optimal features for predicting ELOHS is done with ETC as the base algorithm for RFECV.

Table 3 Comparison of prediction accuracy of ELOHS using tenfold cross-validation for ELOHS with all the features

Since there is a likelihood of improving ELOHS prediction accuracy by relying only on the features and sub-features that have enhanced correlation with the target feature, the stepwise comparison of the features using their sub-features in the RFECV-ETC algorithm is shown in Table 3 (the mean performance scores and the point of optimal features selection for the tenfold cross-validation of the trials can be viewed in Fig. 1).

Fig. 1
figure 1

The mean performance scores and optimal features selection points for tenfold cross-validation of the RFECV-ETC algorithm for the numerous combinations of the features

The features considered in the optimal performance modelling are marked (√), whereas those that are not considered are marked (x). Each of the trials produced the optimal number of features to better ELOHS prediction. The modelling attempt (T#10), which considered the following input features PAG, PGD, ADC, ADT, PCC, PRG, DTC, SES and CCI produced the best accuracy of 89.3%. This prediction accuracy is comparably higher than some of the prediction models for ELOHS carried out previously as shown in the following references [16, 20, 21].

Although the optimal feature selection points shown in Fig. 1 represent the optimal number of features that will guarantee the best prediction of ELOHS for the 10 features (99 sub-features) or less with a stepwise combination of the features (as shown in Table 3), there may be the need for a trade-off when the accuracy at the optimal features point increases minimally. For instance, in T#1, when 31 sub-features are used (Fig. 1- T#1), the accuracy was 85.34% and when the optimal solution was found at 71 sub-features, the accuracy is 86%, which is an increase of 0.66%. Since the inclusion of additional 40 parameters in the algorithm training will increase the computational cost, it may suffice to trade-off the 0.66% extra accuracy for fewer parameters especially when the size of the data increases disproportionately as expected when the algorithm is deployed into production.

The first 20 most important features of the best-performing trial model (T#10) are shown in Fig. 2. With a 2.00–7.36% relative importance of these features, they have 0.07–5.85 times more importance than the 21–40th important features of the model and 1.92–31.41 times better than the 41–59th most important features. Even though the less important features did not contribute much to the accuracy of the model, excluding them will reduce the prediction accuracy of ELOHS, and adding other features that are not part of the selected optimal features will also reduce the prediction accuracy.

Fig. 2
figure 2

Summary of the 20 best features for the best predictive models (T#10)

Since the RFECV-ETC algorithm relies on tenfold cross-validation to predict patients’ ELOHS status, it suffices to show the performance accuracies of the target feature of the best performing model (T#10) as another way of ascertaining the model’s performance for the ELOHS and NLOHS patients. Table 4 summarizes the performance of the various folds in the cross-validation and how good ELOHS and NLOHS are predicted with RFCV-ETC. With an accuracy of 76.12–94.52% across the folds for all the target feature (NLOHS and ELOHS) and almost even accuracy levels for NLOHS and ELOHS for each fold, the RFCV-ETC can be acclaimed to have properly modelled ELOHS since there is no substantial lapse in prediction accuracy at any section of the dataset.

Table 4 Summary of the recall, precision, accuracy, and the optimal features selection (OFS) obtained with RFECV of the various input features combinations represented as a trial number (T#), √: included, x; exclude, ACC: accuracy, BACC: balanced accuracy, AUC: area under the curve, RCL: recall, PRC: precision

Risk factors of ELOHS

Table 5 shows the risk factors and relative risk (RR) of ELOHS determined at a 95% significant level. The reference sub-features for the multivariate LR are ADC (PL1), ADT (M), CCI (0), DTH (0-5 km), PAG (20–50 years), PGD (female), PRG (no religion), VMO (orthopaedic surgery), and SES (high). As expected, PAG is the predominant risk factor for ELOHS with PAG (> 90) {RR: 1.85 (1.34–2.56), P:  < 0.001} having 6.32% more likelihood of ELOHS compared with PAG (80–90) {RR: 1.74 (1.34–2.38), P:  < 0.001} and 23.3% more susceptible than PAG (70–80) {RR: 1.5 (1.1–2.05), P: 0.011}. Patients who are from ADC (US1) {RR: 3.64 (3.09–4.28, P:  < 0.001} are 14.8% and 70.5%, respectively, more prone to ELOHS compared to ADC (UC1) {RR: 3.17 (2.82–3.55), P:  < 0.001} and ADC (EMG) {RR: 2.11 (1.93–2.31), P:  < 0.001}. However, patients who fell into the ADC (others) {RR: 4.11 (2.71–6.24), P:  < 0.001} are 12.9% and 94.8%, respectively, more likely to have ELOHS compared to those admitted under the ADC (US1) and ADC (EMG) categories. Patients from SES (Low) {RR: 1.45 (1.24–1.71), P:  < 0.001)} are 13.3% more likely to have ELOHS compared to those from SES (middle) {RR: 1.28 (1.1–1.5), P: 0.002} and 45% more likely than SES (high) used as the reference for the SES categories. The remaining risk factors such as CCI, DTH and some VMO specialties such as breast surgery, cardiology, endocrine surgery, etc., have a limited likelihood of influencing ELOHS since their RR are < 1.

Table 5 Prediction accuracy of ELOHS and NLOHS with RFECV-ETC algorithm for the best model (T#10)

The accuracy of the multivariate LR used for estimating the risk factors and RR of the features is computed with ROC AUC shown in Fig. 3.

Fig. 3
figure 3

Receiver operating characteristic (ROC) area under the curve (AUC) for the multivariate logistic regression used for determining the risk factors and relative risks

Following the information in Table 5, the severity of the risk factors of ELOHS is grouped as patient features, DRG specialty and hospital-based features in Table 6. For the patients’ features, PAG (> 90) is the most profound risks factor with 6.32—2983% more likelihood of contributing to ELOHS compared to the other risk factors shown in Table 7. The risk factors associated with ADC (US1, others, UC1, EMG) have higher risk severities than the rest of the hospital-based features contributing to ELOHS. The risk of ELOHS associated with the various DRGs are comparatively lower than those associated with patients and hospital-based features and may have less likelihood of triggering ELOHS for patients treated for different health conditions.

Table 6 Summary of the risk factors of ELOHS for all patients showing the relative risks (RR) and P-values obtained from a multivariate logistic model (* are significant features at 95% level)
Table 7 Risk severity of the various risk factors of ELOHS (NB: all features are computed at 95% significance level; ** are significant at 90% significance level)


This study develops a preadmission assessment for patients admitted to a private acute teaching hospital to predict those that are prone to ELOHS and identify the risk factors of ELOHS to enable hospitals to pro-actively plan their care. By relying on hospital-specific features and patients' demographic and psychosocial characteristics, it was possible to develop a machine learning algorithm for reasonably identifying patients that will exceed their expected length of stay on admission. It can be inferred from the study that the risk of ELOHS is very predominant for patients who are 60 years and over and those that have been treated for infectious disease and neurological conditions. Even though some other conditions considered in the study such as neurosurgery, upper GI surgery, urology, cardiology, etc. (see VMO specialty in Table 5) are also linked to ELOHS because they are significant at 95% confidence level, the fact that the RR of these factors are small (< 1) means that their tendency of causing ELOHS is minimal compared to those with the RR values > 1.

The influence of age on ELOHS is pronounced as the rate of ELOHS increases with the age of the patients, a finding that resonated with other researchers, who also attributed SES, which is a risk factor in this study to ELOHS [6,7,8,9]. The risk of ELOHS is pronounced with patients who are ≥ 80 years judging from their RR of > 1, but previous research showed that those who are prone to ELOHS are patients who are ≥ 65 years old for patients of elective anterior cervical discectomy and fusion [7]. It is important to note that the vulnerability of the elderly can be attributed to hospital-acquired infections and other complications in hospitals [17]. This and other factors combine to cause complications, which result in a higher likelihood of ELOHS amongst the elderly [6]. This situation has resulted in higher hospitals costs, shortage of hospital space, and poor-quality penalties imposed on hospitals [18, 19]. There are significant incentives for hospitals to improve patients’ outcomes through quality care that will reduce hospital-acquired infections from endogenous and procedure-related risk factors [10, 11].

Previous studies linked Charlson score, which can be a good gauge of a patient’s comorbidity predisposition [12] to ELOHS. For this study, Charlson scores of 1–6 are risk factors of ELOHS, however, since the relative risk (RR: 0.04–0.09; P-value:  < 0.001)) of the patients is < 1, there is a higher likelihood they may not exceed their expected stay on admission. Despite the impact of Charlson score on ELOHS, it is also linked to unplanned readmission due to the severity of comorbidities [13]. This also suggests that the current DRG models successfully account for patient complexity.

The prediction accuracy of the ELOHS model, which is 89.3% is comparatively higher than the accuracy obtained by other researchers [9, 16, 20, 35] even though it may be difficult to justify some of their techniques for defining ELOHS. This is because some of the patients who may have been classified as likely to exceed their expected length of stay in the hospital because they spent 3, 4, 5, 9, or 11 days based on the proposition of the models may have not exceed their expected length of stay in the hospital following the assessment of their DRG per the technique described in this study. Even though most of the studies reported on specific disease conditions [9, 21, 35], the current study painted a better picture of ELOHS by taking a comprehensive look at patients in the acute care setting. This approach gives the hospital a better tool for an immediate decision on requisite patients’ management plans to forestall complications that will result in ELOHS. Again, it is important to state that some of the ELOHS contributing features investigated by many previous researchers such as surgical approach, preoperative functional status, and patients’ anaesthetic history [36, 37] may not suffice for preadmission assessment of ELOHS.

The core limitation of this study is the few features considered. There is a need to consider more demographic and psychosocial features such as ethnicity, level of education, marital status, and the comorbidities suffered by these patients as they have the potentials of influencing ELOHS. There is also the need to increase the data size while looking at a narrowed classification of similar DRGs to facilitate better accuracy of the prediction model. Again, the reliance on only 10 hospital-specific and psychosocial features for the analysis may not suffice. Other important features that can contribute to ELOHS such as the kind of procedure adopted for surgical patients will be vital for consideration in future studies.


To ensure that patients who are prone to ELOHS are given appropriate, tailored care when admitted to the hospital, a technique for preadmission assessment with hospital-specific and psychosocial features is developed in this study using hospital records. By relying on RFECV-ETC algorithm that uses the backward elimination technique, and ETC as the base estimator, it was possible to develop a model that predicted patients expected to have ELOHS. The study relied on SMOTE up sampling, tenfold cross-validation, and features such as VMO specialty, patient age, patient gender, admission category, admission type, patient care class, patient religion, distance to hospital, SES, and Charlson score. After 12 trials of different combinations of the features, the model with the best accuracy predicted ELOHS to 89.3% accuracy, 89.4% recall, 89.4% precision, and identified 61 optimal sub-features for ELOHS prediction.

Since the knowledge of the risk factors of ELOHS is vital for developing strategies for better care outcomes, multivariate LR was used for estimating the risk factors of ELOHS at a 95% confidence level and the relative risk of the risk factors. The risk of ELOHS increases with age, ADC (EMG, UC1, US1 and others), ADT (M2, AS2, NEW, S, S2, others), SES (low, middle), etc., while the VMO specialties have limited likelihood of increasing ELOHS despite many of them being risk factors of ELOHS. Prediction of ELOHS before admitting the patients and understanding the risk factors of ELOHS will make patient management better because of the increased likelihood of implementing adequate and person-centred treatments.

Summary points

What is known

  • Extended length of hospital stay has been treated as a LOS more than a given number of days of hospitalization such as 4, 5, 6, 9, etc.

  • Charlson scores affect LOS in hospitalization.

  • ELOHS is computed for patients already admitted to the hospital using clinical characteristics.

  • ELOHS has been treated for specific disease conditions, not in consideration of different DRGs.

What was found

  • Extended length of hospital stay (ELOHS) was determined as 3* average LOS for a given DRG.

  • ELOHS can be predicted preadmission to 89.3% accuracy with RFECV-ETC.

  • Infectious diseases and neurology patients have a very high likelihood of ELOHS compared to patients admitted for other DRGs.

  • The rate of ELOHS amongst patients increases with age but patients who are ≥ 80 years have a higher likelihood of ELOHS than younger patients.

  • Charlson score is a risk factor of ELOHS, but has a limited likelihood of causing ELOHS due to the low relative risk (< 1.0).

Availability of data and materials

Can be made available on request.





Adaptive booster


Admission category




Emergency (via ED)




Unscheduled community


Unscheduled (usually from the specialist’s rooms)


Includes CA1 (admissions medical (PAL, Rehab), NEW (newborns), STA/STD (statistical admissions/discharges)


Admission type




Medical 2


Medical 3




Surgical 2


Advanced surgical


Advanced surgical 2


Advanced surgical 3


Advanced surgical 4


Normally within the admission category group




Overnight Band 3 (usually used for endoscopy patients)


Obstetrics deliveries meant to be used for C section


Obstetrics meant to be used for vaginal deliveries


Number codes, HITH, day procedures, ON exception procedures,


Average length of stay


Artificial Neural Network


Area under the curve


Balanced accuracy


Charlson Score


Confidence interval


Chronic obstructive pulmonary disease


Diagnosis related group


Decision tree classifier


Distance to hospital


Extended length of stay


Extra tree classifier


Gradient boosting machine


Global Positioning System


Intensive care unit


Length of stay


Logistic regression


Normal length of hospital stay


Optimal features selection


Patient age


Patient care class


Unqualified new born






Palliative care


Patient gender




Patient religion




Random Forest


Recursive Feature Elimination with Cross-Validation and Extra Tree Classifier


Recursive Feature Elimination with Cross-Validation


Relative risk


Socio-economic indexes for areas


Socioeconomic status


Synthetic minority oversampling technique


Support vector machine


Visiting medical officer




  1. Ertekin S, Huang J, Giles CL. Active learning for class imbalance problem. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 2007. p. 823–824.

  2. Ali A, Shamsuddin SM, Ralescu AL. Classification with class imbalance problem. Int J Advance Soft Comput Appl. 2013;5:3.

    Google Scholar 

  3. Wang C, Xiao Z, Wu J. Functional connectivity-based classification of autism and control using SVM-RFECV on rs-fMRI data. Physica Med. 2019;65:99–105.

    Article  Google Scholar 

  4. Marée R, Wehenkel L, Geurts P. Extremely randomized trees and random subwindows for image classification, annotation, and retrieval. In: Decision Forests for Computer Vision and Medical Image Analysis. Springer, London. 2013. p. 125–41.

  5. Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63(1):3–42.

    Article  Google Scholar 

  6. Koo AB, Elsamadicy AA, Lin IH, David WB, Sujijantarat N, Santarosa C, Cord BJ, Zetchi A, Hebert R, Bahrassa F, Malhotra A. Predictors of extended length of stay following treatment of unruptured adult cerebral aneurysms: a study of the national inpatient sample. J Stroke Cerebrovasc Dis. 2020;29(11): 105230.

    Article  Google Scholar 

  7. Dial BL, Esposito VR, Danilkowicz R, O’Donnell J, Sugarman B, Blizzard DJ, Erickson ME. Factors associated with extended length of stay and 90-day readmission rates following ACDF. Global Spine J. 2020;10(3):252–60.

    Article  Google Scholar 

  8. Deister J, Cothern BG, Williams C, Froehle AW, Laughlin RT. Factors predicting length of hospital stay and extended care facility admission after hindfoot arthrodesis procedures. J Foot Ankle Surg. 2017;56(4):805–12.

    Article  Google Scholar 

  9. Passias PG, Jalai CM, Worley N, Vira S, Hasan S, Horn SR, Segreto FA, Bortz CA, White AP, Gerling M, LaFage V. Predictors of hospital length of stay and 30-day readmission in cervical spondylotic myelopathy patients: an analysis of 3057 patients using the ACS-NSQIP database. World Neurosurg. 2018;110:e450–8.

    Article  Google Scholar 

  10. Spagnolo AM, Ottria G, Amicizia D, Perdelli F, Cristina ML. Operating theatre quality and prevention of surgical site infections. J Prev Med Hyg. 2013;54(3):131.

    CAS  PubMed  PubMed Central  Google Scholar 

  11. Sephton BM, Bakhshayesh P, Edwards TC, Ali A, Singh VK, Nathwani D. Predictors of extended length of stay after unicompartmental knee arthroplasty. J Clin Orthopaed Trauma. 2020;11:S239–45.

    Article  CAS  Google Scholar 

  12. De Groot V, Beckerman H, Lankhorst GJ, Bouter LM. How to measure comorbidity: a critical review of available methods. J Clin Epidemiol. 2003;56(3):221–9.

    Article  Google Scholar 

  13. Lin Y, Yang C, Chu H, Wu J, Lin K, Shi Y, Wang H, Kong G, Zhang L. Association between the Charlson Comorbidity Index and the risk of 30-day unplanned readmission in patients receiving maintenance dialysis. BMC Nephrol. 2019;20(1):1–8.

    Article  Google Scholar 

  14. Lakomkin N, Kothari P, Dodd AC, VanHouten JP, Yarlagadda M, Collinge CA, Obremskey WT, Sethi MK. Higher Charlson Comorbidity Index scores are associated with increased hospital length of stay after lower extremity orthopaedic trauma. J Orthop Trauma. 2017;31(1):21–6.

    Article  Google Scholar 

  15. Liu H, Song B, Jin J, Liu Y, Wen X, Cheng S, Nicholas S, Maitland E, Wu X, Zhu D. Length of stay, hospital costs and mortality associated with comorbidity according to the Charlson comorbidity index in immobile patients after ischemic stroke in China: a national study. Int J Health Policy Manag. 2021;23:8.

    Google Scholar 

  16. Staziaki PV, Wu D, Rayan JC, de Oliveira Santo ID, Nan F, Maybury A, Gangasani N, Benador I, Saligrama V, Scalera J, Anderson SW. Machine learning combining CT findings and clinical parameters improves prediction of length of stay and ICU admission in torso trauma. In: European Radiology. 2021; p.1–8.

  17. Avci M, Ozgenc O, Coskuner SA, Olut AI. Hospital acquired infections (HAI) in the elderly: comparison with the younger patients. Arch Gerontol Geriatr. 2012;54(1):247–50.

    Article  Google Scholar 

  18. Hendy P, Patel JH, Kordbacheh T, Laskar N, Harbord M. In-depth analysis of delays to patient discharge: a metropolitan teaching hospital experience. Clin Med. 2012;12(4):320.

    Article  CAS  Google Scholar 

  19. Bo M, Fonte G, Pivaro F, Bonetto M, Comi C, Giorgis V, Marchese L, Isaia G, Maggiani G, Furno E, Falcone Y. Prevalence of and factors associated with prolonged length of stay in older hospitalized medical patients. Geriatr Gerontol Int. 2016;16(3):314–21.

    Article  Google Scholar 

  20. Zhang AS, Veeramani A, Quinn MS, Alsoof D, Kuris EO, Daniels AH. Machine learning prediction of length of stay in adult spinal deformity patients undergoing posterior spine fusion surgery. J Clin Med. 2021;10(18):4074.

    Article  Google Scholar 

  21. Burton BN, Abudu B, Danforth DJ, Patell S. Prediction model for extended hospital stay among medicare beneficiaries after percutaneous coronary intervention. J Cardiothorac Vasc Anesth. 2019;33(11):3035–41.

    Article  Google Scholar 

  22. Benenson S, Cohen MJ, Schwartz C, Revva M, Moses AE, Levin PD. Is it financially beneficial for hospitals to prevent nosocomial infections? BMC Health Serv Res. 2020;20(1):1–9.

    Article  Google Scholar 

  23. Cai Y, Zhu M, Sun W, Cao X, Wu H. Study on the cost attributable to central venous catheter-related bloodstream infection and its influencing factors in a tertiary hospital in China. Health Qual Life Outcomes. 2018;16(1):1–6.

    Article  Google Scholar 

  24. Stone PW. Economic burden of healthcare-associated infections: an American perspective. Expert Rev Pharmacoecon Outcomes Res. 2009;9(5):417–22.

    Article  Google Scholar 

  25. Victorian Auditor-General’s Office (VAGO) Report, 2016, Hospital Performance- Length of Stay. Accessed 21 Sept 2021.

  26. Zolbanin HM, Davazdahemami B, Delen D, Zadeh AH. Data analytics for the sustainable use of resources in hospitals: predicting the length of stay for patients with chronic diseases. In: Information & Management. 2020; p.103282.

  27. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16:321–57.

    Article  Google Scholar 

  28. ABS (Australian Bureau of Statistics), An introduction to socio-economic indexes for areas (SEIFA), Commonwealth of Australia, ABS, Canberra, 2011. Accessed 26 May 2021.

  29. Kifana BD, Abdurohman M. Great circle distance method for improving operational control system based on gps tracking system. Int J Computer Sci Eng. 2012;4(4):647.

    Google Scholar 

  30. Ofori-Asenso R, Liew D, Mårtensson J, Jones D. The frequency of, and factors associated with prolonged hospitalization: a multicentre study in Victoria. Aust J Clin Med. 2020;9(9):3055.

    Article  Google Scholar 

  31. Cohen ME, Bilimoria KY, Ko CY, Richards K, Hall BL. Variability in length of stay after colorectal surgery: assessment of 182 hospitals in the national surgical quality improvement program. Ann Surg. 2009;250(6):901–7.

    Article  Google Scholar 

  32. Yıldırım AA, Özdoğan C, Watson D. Parallel data reduction techniques for big datasets. In: Big Data: Concepts, Methodologies, Tools, and Applications. IGI Global; 2016. p. 734–56.

  33. Ho KJ, Madenci AL, McPhee JT, Semel ME, Bafford RA, Nguyen LL, Ozaki CK, Belkin M. Contemporary predictors of extended postoperative hospital length of stay after carotid endarterectomy. J Vasc Surg. 2014;59(5):1282–90.

    Article  Google Scholar 

  34. Storey A, MacDonald B, Rahman MA. The association between preoperative length of hospital stay and deep sternal wound infection: A scoping review. Aust Crit Care. 2021;34(6):620–33.

    Article  Google Scholar 

  35. Gabriel RA, Sharma BS, Doan CN, Jiang X, Schmidt UH, Vaida F. A predictive model for determining patients not requiring prolonged hospital length of stay after elective primary total hip arthroplasty. Anesth Analg. 2019;129(1):43–50.

    Article  Google Scholar 

  36. Den Hartog YM, Mathijssen NMC, Hannink G, Vehmeijer SBW. Which patient characteristics influence length of hospital stay after primary total hip arthroplasty in a ‘fast-track’ setting? Bone Jt J. 2015;97(1):19–23.

    Article  Google Scholar 

  37. Dall, G.F., Ohly, N.E., Ballantyne, J.A. and Brenkel, I.J., 2009. The influence of pre-operative factors on the length of in-patient stay following primary total hip replacement for osteoarthritis: a multivariate analysis of 2302 patients. The Journal of bone and joint surgery. British volume, 91 (4), pp.434–440.

  38. Cabitza F, Campagner A. The need to separate the wheat from the chaff in medical informatics: Introducing a comprehensive checklist for the (self)-assessment of medical AI studies. Int J Med Inform. 2021.

    Article  PubMed  Google Scholar 

  39. Independent Hospital Pricing Authority -IHPA (2020), National Pricing Model Technical Specifications 2020–21. Accessed 13 Dec 2021.

  40. Chen XW, Jeong JC. December. Enhanced recursive feature elimination. In: Sixth International Conference on Machine Learning and Applications (ICMLA 2007). 2007. p. 429–35).

Download references


The authors wish to thank the Digital Health Cooperative Research Centre (DHCRC) and Cabrini Health, Melbourne, Australia, for their financial supports and other contributions to this research.



Author information

Authors and Affiliations



Conceptualization of idea: David Rankin, Chinedu O. Design and development: CO, DR, NW. Writing and review of results: CO, DR. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Chinedu I. Ossai.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

We affirm that a high standard has been maintained in the design and development of the technique that led to the results we presented in this paper and so take full responsibility for the content.

Competing interests

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix I

Appendix I

Characteristics of algorithms




KNeighborsClassifier (algorithm = 'auto', leaf_size = 30, metric = 'minkowski', metric_params = None, n_jobs = None, n_neighbors = 5, p = 2, weights = 'uniform')


Gradient Boosting Classifier (ccp_alpha = 0.0, criterion = 'friedman_mse', init = None, learning_rate = 0.1, loss = 'deviance', max_depth = 3, max_features = None, leaf_nodes = None, min_impurity_decrease = 0.0, min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 100, n_iter_no_change = None, presort = 'deprecated', random_state = None, subsample = 1.0, tol = 0.0001, validation_fraction = 0.1, verbose = 0, warm_start = False)


Ada Boost Classifier (algorithm = 'SAMME.R', base_estimator = None, learning_rate = 1.0, n_estimators = 50, random_state = None)


Extra Trees Classifier (bootstrap = False, ccp_alpha = 0.0, class_weight = None, criterion = 'gini', max_depth = None, max_features = 'auto', max_leaf_nodes = None, max_samples = None, min_impurity_decrease = 0.0, min_impurity_split = None,min_samples_leaf = 1, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 100, n_jobs = None, oob_score = False, random_state = None, verbose = 0, warm_start = False)


SVC (C = 1.0, break_ties = False, cache_size = 200, class_weight = None, coef0 = 0.0, decision_function_shape = 'ovr', degree = 3, gamma = 'scale', kernel = 'rbf', max_iter = -1, probability = False, random_state = None, shrinking = True, tol = 0.001, verbose = False)


XGBClassifier (base_score = 0.5, booster = 'gbtree', colsample_bylevel = 1, colsample_bynode = 1, colsample_bytree = 1, gamma = 0, gpu_id = -1, importance_type = 'gain', interaction_constraints = '', learning_rate = 0.300000012, max_delta_step = 0, max_depth = 6, min_child_weight = 1, missing = nan, monotone_constraints = ' ()', n_estimators = 100, n_jobs = 4, num_parallel_tree = 1, objective = 'binary: logistic', random_state = 0, reg_alpha = 0, reg_lambda = 1, scale_pos_weight = 1, subsample = 1,

tree_method = 'exact', use_label_encoder = True, validate_parameters = 1, verbosity = None)


Random Forest Classifier (bootstrap = True, ccp_alpha = 0.0, class_weight = None, criterion = 'gini', max_depth = None, max_features = 'auto', max_leaf_nodes = None, max_samples = None, min_impurity_decrease = 0.0, min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 100, n_jobs = None, oob_score = False, random_state = None, verbose = 0, warm_start = False)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ossai, C.I., Rankin, D. & Wickramasinghe, N. Preadmission assessment of extended length of hospital stay with RFECV-ETC and hospital-specific data. Eur J Med Res 27, 128 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Extended length of hospital stay
  • Recursive feature elimination
  • Extra tree classifier
  • Multivariate logistic regression
  • Admission risk