Development and validation of a competitive risk model in patients with rectal cancer: based on SEER database

Background Rectal cancer is one of the most common malignancies. To predict the specific mortality risk of rectal cancer patients, we constructed a predictive nomogram based on a competing risk model. Methods The information on rectal cancer patients was extracted from the SEER database. Traditional survival analysis and specific death analysis were performed separately on the data. Results The present study included 23,680 patients, with 16,580 in the training set and 7100 in the validation set. The specific mortality rate calculated by the competing risk model was lower than that of the traditional survival analysis. Age, Marriage, Race, Sex, ICD-O-3Hist/Behav, Grade, AJCC stage, T stage, N stage, Surgery, Examined LN, RX SUMM-SURG OTH, Chemotherapy, CEA, Deposits, Regional nodes positive, Brain, Bone, Liver, Lung, Tumor size, and Malignant were independent influencing factors of specific death. The overall C statistic of the model in the training set was 0.821 (Se = 0.001), and the areas under the ROC curve for cancer-specific survival (CSS) at 1, 3, and 5 years were 0.842, 0.830, and 0.812, respectively. The overall C statistic of the model in the validation set was 0.829 (Se = 0.002), and the areas under the ROC curve for CSS at 1, 3, and 5 years were 0.851, 0.836, and 0.813, respectively. Conclusions The predictive nomogram based on a competing risk model for time-specific mortality in patients with rectal cancer has very desirable accuracy. Thus, the application of the predictive nomogram in clinical practice can help physicians make clinical decisions and follow-up strategies.

The distal third of the rectum is drained by systemic veins instead of the portal venous system, which increases the likelihood of direct lung metastases [9].RC penetrates the pelvis more easily than colon cancer, with a higher risk of lymph node metastasis in each T stage, complex anatomical structures, difficult-to-operate surgery, and a high postoperative recurrence rate.Currently, treatments for RC include local or extensive surgery, preoperative radiotherapy and systemic therapy, local ablation of metastases, palliative chemotherapy, targeted therapy, and immunotherapy [10].The selection of treatments is based on the tumor's clinicopathological characteristics, such as the degree of differentiation and the site of metastasis [11,12].RC survival is closely related to clinical decisions based on staging information.The 5-year disease-free survival (DFS) and overall survival (OS) of patients with RC were 70.1% (95%CI: 60.7%-73.0%)and 76.2% (95%CI: 73.8%-78.5%)with open surgery, and 72.2% (95%CI: 69.4%-74.8%)and 72.7% (95%CI: 69.8%-75.3%)with laparoscopic surgery, respectively [13].Compared with long-course chemoradiotherapy, both total neoadjuvant therapy with long-course chemoradiotherapy [OR = 1.78 (95%CI: 1.43-2.26)]and total neoadjuvant therapy with short-course radiotherapy [OR = 1.75(95%CI: 1.23-2.50)]improved the pathological complete response rate; neoadjuvant chemotherapy increased DFS, but did not significantly prolong OS [14].At present, TNM staging is the primary reference in clinical practice.Other potential prognostic factors, such as age, sex, marriage, and region, are also strongly associated with prognosis [15].Therefore, it is necessary to explore a more comprehensive prediction method.Clinical predictive models are more personalized approaches tailored to these factors.
The specific mortality risk is more valuable than the traditional mortality risk in clinical practice due to some unforeseeable mortality events.Specifically speaking, the causes of death in RC patients are diverse, and a large number of mortality events are not caused by RC, as RC patients are often in poor functional status or accompanied by co-morbidities [16].These deaths are unforeseeable when patients die of other diseases, such as cardiovascular disease.In addition, as competing risks often occur among the diverse risk factors for death in patients with RC, traditional survival analysis methods may ignore the occurrence of main concern events, resulting in overestimated results [17].Therefore, traditional survival analysis methods are unsuitable for processing data containing competing outcomes.A competing risk model is more suitable for dealing with data containing competing events and can produce more accurate results [17].
Since most patients with RC have a long survival period, identifying the risk of specific death early, in the case of competing events, is necessary to facilitate the development of specific prognostic rehabilitation programs and follow-up plans, thereby reducing the risk of specific death in patients.Although various prognostic models, especially based on radiomics, have been established in recent years, predictive models in clinical practice should be highly interpretable and applicable.Therefore, we conducted this study based on the SEER database to construct an interpretable nomogram for early predicting the risk of RC-specific mortality.

Data sources and extractions
The SEER database, which comprises 18 cancer registries, is the National Cancer Institute's collection of demographic statistics on all cancers diagnosed in representative geographic regions and subpopulations.We collected patient clinicopathological data, demographic information, clinicopathological information, and followup data from the database.The patient data were deidentified because the SEER database is publicly available.In addition, ethical approval and informed consent are not required for our study.Our methodology strictly follows the rules of the SEER database. We

Construction and verification of competitive risk model
We developed a competing risk model by defining deaths from unknown causes or survival cases in RC as censoring events, deaths from specific causes as interest events, and deaths from other causes as competing events.Before building the model, we randomly split the data by 70% and 30% into training and validation sets.A uniform random number distributed between 0 and 1 was assigned to each case, and then these cases were ranked according to their random number.The cases with a random number < 70 percentile were assigned to the training set, and the rest of the cases were assigned to the validation set.Univariate and multivariate analyses were performed in the training set to screen the independent influencing factors of RC, and RC-specific death nomograms were constructed according to the independent influencing factors.The concordance index (C-index) and the area under the receiver operating curve (AUC) were used to measure the accuracy of the prediction nomogram, and the calibration curve was used to measure the calibration degree of the prediction nomogram.

Statistical analysis
Continuous variables were tested for normal distribution using Shapiro-Wilk test.Normal distributed continuous variables were expressed as mean ± standard deviation (SD), and an independent samples t-test was used to compare the means of the two groups.For continuous variables with skewed distribution, the Mann-Whitney U test was performed for comparisons between groups.Count data were expressed as frequency (%); and the chisquare test was used for comparison between multiple groups.A P < 0.05 indicated a statistically significant difference.R 4.2.0 was used for modeling, while SPSS 2.3.0 was used for statistical analysis.

Patient characteristics
A total of 23,680 RC patients were included in the analysis.Patients were randomly assigned to the training cohort (n = 16,580) and the validation cohort (n = 7100).In the training cohort, the mean age of patients was 63.02 ± 13.38 years; the number of deaths was 6470; the number of specific deaths was 4602; and the mean follow-up time was 52.35 ± 27.20 years.In the validation cohort, the mean age of patients was 62.90 ± 13.41 years; the number of deaths was 9174; the number of specific deaths was 6537; and the mean follow-up time was 52.53 ± 27.77 years.It can be seen that the distribution of the two groups is even, and the rest of the information is shown in Table 1.

Competing risk model for univariate and multivariate analysis
The variables in the present study included Age, Marriage, Race, Sex, ICD-O-3Hist/Behav, Grade, AJCC stage, T stage, N stage, Surgery, Examined LN, RX SUMM-SURG OTH, Chemotherapy, CEA, Deposits, Regional nodes in single factor Positive, Brain, Bone, Liver, Lung, Tumor size, and Malignant variables.Both univariate and multivariate analyses did not include the M stage because it overlapped with the AJCC stage.The variable Radiation (P > 0.05) was excluded from the multivariate analysis, whereas other variables in the single factor analysis were included in the multivariate analysis, with P < 0.05 (Table 2).The results showed that these variables were all independent influencing factors for RC.Accordingly, we constructed a competing risk model based on these variables.

Validation of predictive nomogram
Specific mortality prediction nomograms were drawn according to the competing risk model described above (Fig. 1).In the training set, the overall C statistic was 0.821 (Se 0.001), and the 1-, 3-, and 5-year AUC for predicting the specific mortality was 0.842 (95% CI 0.830-0.854),0.830 (95% CI 0.822-0.838),0.812 (95% CI 0.803-0.820),respectively.The results show that the model has a good identification ability (Fig. 2A).In addition, the model's calibration curve (Fig. 3A) shows that the 1-, 3-and 5-year curves are consistent with the ideal curve, indicating that the predicted curve diagram has good accuracy and calibration.

Specific mortality estimation
We compared the 1-to 9-year specific mortality calculated by traditional survival analysis and the competing risk model.The results revealed that the competing risk model showed lower specific mortality than traditional survival analysis.For example, the 1-year mortality rate was 0.08 in the traditional survival analysis and 0.07 in the specific survival analysis.Their gap grows as the survival duration increases between 1 and 9 years.Therefore, mortality from other causes could not be ignored (Table 3).

Discussion
Based on univariate and multivariate analyses, we included the independent influencing factors of RCspecific death, including Age, Marriage, Race, Sex, ICD-O-3Hist/Behav, Grade, AJCC stage, T stage, N stage, Surgery, Examined LN, RX SUMM-SURG OTH, Chemotherapy, CEA, Deposits, Regional nodes positive, Brain, Bone, Liver, Lung, Tumor size, and Malignant.These variables were combined to construct a competing risk model for RC and a predictive nomogram, which have excellent accuracy.The overall C statistic was 0.821 (Se 0.001) in the model training cohort and 0.829 (Se 0.002) in the internal validation cohort.Currently, there are various RC prediction nomograms based on clinical data [18][19][20][21].However, they are based on Cox regression.For example, a previous study reported that the consistency index (C index) was 0.71 (95% CI: 0.64-0.79) in the training cohort and 0.69 (95% CI: 0.61-0.78) in the validation cohort [21].According to another research that constructed an RC risk prediction model based on the SEER database, the C index of the RC model was 0.756 (95% CI, 0.726-0.786)for the internal validation and 0.729 (95% CI, 0.678-0.780)for the external validation [22].In contrast, the overall C statistic for the model in the validation set in our study was 0.829 (Se = 0.002).The reason for the differences in C-index is that the former two do not consider the effects of other causes of death, which can result in a severe bias in the specific mortality and an overestimated specific mortality.In our study, deaths from other causes accounted for approximately 10%.If these deaths were ignored, the accuracy of the model would be reduced.We compared the results of the competing risk model for RC and the traditional Cox proportional hazards model, and found differences in their mortality estimates.Other studies also have reported similar results and proposed that the COX model may also misestimate the direction of independent risk factors and the correlation between outcomes [23].According to another competing risk model for early-stage RC-specific mortality [24], the AUC for predicting the 3-, 5-, and 10-year cancer-specific mortality was 82.2, 78.4, and 75 in the training cohort, and 83.4,75.9, and 76.8 in the validation cohort, respectively.The different results may be due to an early-stage population, a small number of included cases, and insufficient risk indicators in the study.
Currently, the most widely used tumor prognosis prediction tool is the TNM staging.However, the use of the TNM system for RC has been questioned in recent years [25].In some studies, a series of oxidative stress indicators significantly correlated with the survival rate of colorectal cancer patients were included to establish a CIOSS score system.The CIOSS score was reported to outperform the TNM staging in terms of survival prediction in patients with colorectal cancer [26].Furthermore, a recent analysis showed that the survival pattern of colorectal cancer was heterogeneous even within the TNM classification [27].Furthermore, a study attempted to use molecular and biochemical markers to aid in colon cancer staging [28].In our research, specificity factors were added.Carcinoembryonic antigen (CEA) is now the most widely adopted and readily available tumor marker [29].Many clinical scientists have evaluated the kinetic patterns of tumor markers as predictors [30].One study found that patients with elevated postoperative CEA had an increased risk of recurrence [31].Another study found a direct relationship between tumor volume and overall  survival in RC patients [32].Furthermore, the ratio of serum CEA to maximum tumor diameter may be a more important indicator for assessing tumor bioactivity with higher predictive value in RC [33].Tumor deposits (TDs), especially solitary tumor nodules (in the mesocolic and mesorectal adipose tissue) at the lymphatic drainage zone of the primary tumor, are a hallmark of RC aggressiveness.The clinical value of TDs in treating RC is severely underestimated [34].TD-positive tumors are classified as N1c without lymph node metastasis (LNM) in the NCCN guidelines [35], whereas neither the presence of TD nor the number of TD is considered for cases with LNM in the pN staging system.Recently, a clinical study found that tumor deposits in locally advanced RC  had great predictive value in RC patients.TDs combined with lymph node metastasis could improve the accuracy of TNM staging [35][36][37].In addition, CEA [38,39] and tumor deposits [40] were strongly associated with RCspecific death based on meta-analyses and evidencebased evidence.
The number of indicators involved in this paper is limited.More indicators should be included in the future to further explore the relationship between the research mechanism, clinical indicators, and prognosis.The occurrence of RC is thought to be closely related to the inflammation and oxidative stress of the intestinal epithelium, damaging the integrity of the intestinal barrier.Exposure to environmental toxins in the intestine  enhances intestinal inflammation and releases reactive oxygen species (ROS) [41].The indicators mentioned in this study, including the primary site of cancer [42], the surgical method [43], and lymph node organ metastasis [44], are all related to oxidative stress.For example, it was found that RC patients with a right-sided primary site had higher mean malondialdehyde serum levels than those with a primary site elsewhere [42].The serum 8-isoprostanes (8-epiPGF2α) could be used to judge the oxidative state, and the degree of oxidative stress was found to be lower in laparoscopic surgery than in open RC surgery [43].Additionally, the urinary 8-hydroxydeoxyguanosine (8-OHdG) content in colorectal cancer patients gradually increased from stage I to stage IV, and the urinary 8-OHdG content in patients with tumor metastasis was significantly higher than that of patients without tumor metastasis [42].More oxidative stress or other specific indicators will be incorporated in the future.

Limitations
In conclusion, the predictive nomogram of a competitive risk model with competing events is constructed for the first time based on a large number of sample data.There are still some limitations in our study.First, it is insufficient to use a competing risk model and key variables to reflect other causes of death, based on the SEER database.Second, some key predictors of specific mortality risk seem to be missing in the SEER database.Third, our research data are extracted from the SEER database, which includes multiple races.However, there are a large number of missing values in the case registration.When there are too many missing values, it is inappropriate to blindly use the imputation method, and deleting these data is more suitable.However, this may cause some biases, which is an inevitable flaw in the research based on large databases.Fourth, there is a lack of external data validation.Although our study covers the global population, it is still dominated by whites and blacks.There are only a small number of patients from other races, which may cause biased results to some extent.External validation will be the direction of our follow-up research after this model is applied to clinical practice.

Outlook
As statistical theory and computer systems have advanced considerably over recent years, precision medicine has attracted widespread attention, resulting in the promotion and application of artificial intelligence in clinical practice.With the popularization of image-based artificial intelligence in clinical practice, the application status of machine learning models based on radiomics in rectal cancer has been summarized by researchers.Current intractable challenges lie in diverse sources of radiomics, over-provision equipment, specificity of manual segmentation of regions of interest, and the screening of variables [45,46].Furthermore, some additional challenges cannot be ignored, such as data quality, goodness of fit of machine learning, evidence of general applicability, and moral considerations [47].Despite these challenges, the application of radiomics in clinical practice is still epochmaking, especially for the diagnosis of tumors [47,48].In addition, radiomics-based machine learning has gradually been used to predict the prognosis of tumors, such as response to chemotherapy, recurrence, metastasis, and mortality risk.In this context, some studies have applied radiomics to the clinical diagnosis and treatment of RC.A systematic review by Bedrikovetski et al. [49] shows that radiomics is efficient for the identification of lymph node metastasis in colorectal cancer.Di Re et al. [50] and Bourbonne et al. [51] also believe that this technique has relatively ideal predictive value for the response to neoadjuvant chemotherapy in patients with RC or colorectal cancer.In addition, Staal et al. [52] reviewed the performance of radiomics in the prediction of survival outcomes in patients with colorectal cancer; however, they only focused on OS.Although the predictive nomogram we constructed shows a relatively ideal predictive value, its predictive performance still needs to be improved.Subsequent studies are desired to combine radiomics and clinical features to build a more efficient predictive tool for specific death.

Conclusion
Our predictive nomogram constructed based on a competing risk model is highly accurate in predicting timespecific mortality in RC patients and can assist clinicians in making clinical decisions and developing follow-up strategies.Subsequent studies should incorporate some other key factors or use other radiomics methods to further enhance the predictive performance of the model.

Fig. 1
Fig. 1 The competitive risk model nomogram of patients with Rectal Cancer at 1, 3, and 5 years

Table 1
Clinicopathological characteristics of RC patients

Table 2
single factor and multi factor analysis

Table 3
Comparison of 1-to 9-year specific mortality (%) between traditional survival analysis and competitive risk model