Skip to main content

Construction of a 10-gene prognostic score model of predicting recurrence for laryngeal cancer

Abstract

We constructed a prognostic score (PS) model to predict the recurrence risk in patients previously diagnosed with laryngeal cancer (LC). Here the training dataset, consisting of 82 LC samples, was downloaded from The Cancer Genome Atlas (TCGA). The PS model then divided the LC samples into high- and low-risk groups, which predicted well the survival time of LC in three datasets (TCGA dataset: AUC = 0.899; GSE27020: AUC = 0.719; and GSE25727: AUC = 0.662). Therefore, the PS model based on the 10 genes and its nomogram is proposed to help predict the recurrence risk in patients with LC.

Introduction

Laryngeal cancer (LC) has been identified as the one of the most common types of head and neck cancers, which resulted in approximately 11,150 new cases in the United States in 2018 [1]. During the past decades, various treatment strategies have been devised for treating LC. However, the 5-year overall survival (OS) of patients with LC remains unsatisfactory [2]. According to the SEER database from 2006 to 2012, the 5-year OS of LC remained as low as 60.7%, which has not increased significantly in the last few decades [3]. Furthermore, the local recurrence of LC is common among patients, such as those with moderately or poorly differentiated squamous cell carcinoma, in addition to the thyroid cartilage plate invasion. Hence, comprehensive treatment and closer follow-up should be given to these patients [4]. Nevertheless, the identification of novel prognostic gene markers that can help distinguish the recurrence risk in patients with LC is vital for improving the OS of patients with LC.

In recent decades, the occurrence of next-sequencing technologies has made rapid disease and recurrence detection possible. Notably, existing evidence has indicated that many gene biomarkers have predictive values for LC. Likewise, Zhang et al. [5] indicated that five genes (EMP1, HOXB9, DPY19L2P1, MMP1, and KLHDC7B) had the potential function to predict LC recurrence. Cury et al. [6] also argued that DSG2 overexpression was associated with shorter OS. And, it is also indicated that high plasma protein levels of DSG2 indicated its detection in liquid biopsy, which is proposed to be applied as a recurring biomarker for LC. Pedro et al. [7] have also reported that ALCAM overexpression was an independent biomarker for predicting recurrence of laryngeal squamous cell carcinoma in patients. Nevertheless, although previous studies have identified numerous gene targets that account for the LC recurrence, further investigations are needed to explore the effect of these featured genes on the recurrence risk in patients with LC.

Therefore, according to the multiple bioinformatics data, we screened the genes significantly correlated with recurring LC using meta-analysis and L1-penalized optimization algorithm. Then, we constructed the risk model for predicting recurrence risk in patients with LC.

Method

Data source

The mRNA sequencing data of head and neck samples (including 604 samples) were obtained from The Cancer Genome Atlas (TCGA) database (https://tcga-data.nci.nih.gov/docs/publications/tcga/?) based on the Illumina HiSeq 2000 RNA Sequencing platform. The positions of the 604 samples were in the alveolar ridges (n = 18), tongue roots (n = 30), buccal mucosa (n = 22), floor of the mouth (n = 67), hard palates (n = 8), hypopharynx (n = 9), larynx (n = 138), lips (n = 3), mouth (n = 38), tongue (n = 158), oropharynx (n = 10), and tonsils (n = 45). The rest of the samples were from uncertain tumor locations. Among the 138 throat samples, we screened 82 LC samples with recurrence and prognosis information (28 and 54 samples with and without recurrence, respectively) in our study.

Additionally, we searched for validation dataset using the keyword “larynx cancer” from the National Center for Biotechnology Information Gene Expression Omnibus database (http://www.ncbi.nlm.nih.gov/geo/). The screening standards were as follows: (1) gene expression profile data, (2) the samples were from the tumor tissue specimen of patients, (3) human expression profile data, and (4) the samples with information of recurrent or non-recurrent prognosis. Two validation datasets were obtained. One was GSE27020 that composed of 109 LC tissue samples (34 and 75 samples with and without recurrence, respectively) based on the Affymetrix Human Genome U133A, the Array platform, and the other one was GSE25727 that included 56 LC tissue samples (17 and 39 samples with and without recurrence, respectively) based on the Illumina HumanRef-8 WG-DASLv3.0 platform.

Screening of differentially expressed genes

A meta-analysis on TCGA dataset, GSE27020 and GSE25727, was conducted using an ES function of MetaDE [8] (version: 1.0.5, https://cran.r-project.org/web/packages/MetaDE) in R3.4.1 to screen the differentially expressed genes (DEGs) [9]. Subsequently, we screened for DEGs [9] that showed consistent expression in these two datasets between samples with recurrence and those without recurrence by calculating the tau2, Q, and Qpval values (criterion for judgment; tau2 = 0 indicates that each research object is homogeneous and unbiased; the statistic Q obeys the Chi-square test with a degree of freedom of k-1, whereas Qpval  > 0.05 indicates that each research object is homogeneous and unbiased). Then, the false discovery rate (FDR) value was obtained using multiple test corrections. FDR < 0.05 showed that the difference was significant. Additionally, each dataset was calculated to express the fold change, after which several parameters were selected, and the threshold value was set. The set parameters were as follows: (1) To ensure that the source of each selected characteristic gene was homogeneous and unbiased (that the expression of each featured gene in each data set was consistent), Tau2 = 0 and Qpval  > 0.05 were selected as homogeneity test parameters. (2) FDR < 0.05 was selected as the significant threshold of expression difference between the genomes. (3) After screening with log2 FC, the genes having similar direction of difference (with the same symbol of log2 FC) were retained. After combining multiple screening parameters, we set the selection of threshold parameters:

I. We ensure that the source of each selected characteristic gene is homogeneous and unbiased, that is, the expression in each data set is consistent, so tau2 = 0 and Qpval > 0.05 are selected as homogeneity test parameters;

II. FDR < 0.05 was considered as the threshold of significant difference in expression between gene groups;

III. Combined with Log2FC for screening, we retained genes with the same direction of difference (consistent Log2FC symbols) in the three datasets.

The threshold was set to a false discovery rate < 0.05. Then, the Gene Ontology Biology Process (GO-BP) annotation analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis were conducted for these DEGs with consistency.

Establishment and verification of a risk assessment model

On the basis of the DEGs, we conducted univariate Cox regression analysis in the survival package [10] (version 2.4, http://bioconductor.org/packages/survivalr/) to screen DEGs significantly related to the prognosis in TCGA dataset. The multivariate Cox regression analysis was then used to screen DEGs that can be used as independent prognostic factors. The log-rank P < 0.05 was also regarded as the threshold of significant correlation.

Furthermore, the Cox proportional hazard model [11] based on the L1-penalized (Lasso) in the penalized package (version 0.950; http://bioconductor.org/packages/penalized/) [12] of the R3.4.1 language was used to screen out the optimized prognostic-associated signature DEG combinations based on the aforementioned DEGs related to the prognosis [13]. Then, on the basis of the prognostic coefficient of prognosis-related DEGs, the prognostic score (PS) prediction model was established in the training dataset using the following formula:

$${\text{ps}} = \sum {\beta {\text{DEGs}}} \times {\text{ExpDEGs,}}$$

where βDEGs refer to the prognostic coefficient of the optimized DEGs in the Lasso algorithm and ExpDEGs represent the expression of the corresponding DEGs in the training dataset.

With the median PS as the dividing point, the samples in TCGA training dataset were further categorized into high- and low-risk groups. After that, the Kaplan–Meier (KM) [14] survival curve in the R3.4.1 language survival package (version 2.41–1) [10] was used to measure the association between the risk model and prognosis. Simultaneously, we screened these optimized DEGs from the validation dataset (GSE25727 and GSE27020). Then, the PS score of each sample was obtained using the PS calculation method. The validation dataset samples were also separated into high- and low-risk sample groups in the same manner as in TCGA dataset samples. Thereafter, the KM curve method of the survival package (version 2.41–1) [10] in the R3.4.1 language was used to evaluate the relationship between the high- and low-risk groups, compared with the actual survival prognosis information from the validation dataset samples.

Screening of independent prognostic clinical factors for performance evaluation

Combining the clinical factors including recurrence, age, gender, pathologic (M, N, and T), pathologic stage, grade, alcohol history, angiolymphatic invasion, and perineural invasion in TCGA (Additional file 1: Table S1), we used univariate and multivariate Cox regression analysis methods in the R3.4.1 language survival package (version 2.41–1) [10] to screen the independent prognostic clinical factors. The threshold was set to log-rank P < 0.05. Next, to further explore the correlation between independent factors and prognosis, the nomogram with 3- and 5-year survival rates was constructed using the RMS software package (version 5.1.2; https://cran.r-project.org/web/packages/rms/index.html) in R3.4.1 [9, 15].

Next, the PS and risk models were compared using the area under the receiver-operating characteristic curve (AUROC) [14] and the concordance index (C-index). Additionally, the AUROC is a quantitative indicator of the receiver-operating characteristic (ROC) curve, which was calculated using the pROC in the R3.4.1 language (version 1.14.0, https://cran.r-project.org/web/packages/pROC/index.html). In contrast, the C-index is referred to as the scores of all individual pairs correctly sorted on the basis of the Harrell C statistics [16] to predict the survival time [17]. It was calculated using the survcomp package (http://www.bioconductor.org/packages/release/bioc/html/survcomp.html) in the R3.4.1 language.

Results

Identification of DEGs

A total of 981 DEGs were detected among TCGA datasets, GSE25727 and GSE27020, which contained 347 down-regulated genes and 634 up-regulated genes (Fig. 1 and Additional file 2: Table S2). The DEGs were significantly different among the various types of samples from the three datasets. This result indicated that the DEGs expressed significant difference among the three datasets.

Fig. 1
figure 1

A two-way hierarchical clustering heat map of TCGA, GSE27020 and GSE25727, datasets based on consistent DEGs

The GO results suggested that these genes are involved in 41 GO-BP terms, such as regulation of cell migration (P = 2.26E−04) and regulation of locomotion (P = 2.48E−04). Simultaneously, these DEGs were enriched in 10 KEGG pathways, including the Jak–STAT signaling pathway (P = 9.44E−03) (Fig. 2 and Table 1).

Fig. 2
figure 2

GO-BPs (A) and KEGG pathways (B) involved in DEGs

Table 1 GO biological process and KEGG signal pathway significantly related to target genes

Constructing the prognosis prediction model

A total of 206 prognosis-related DEGs were screened using univariate Cox regression analysis with a threshold of P < 0.05 (Additional file 3: Table S3). On the basis of the aforementioned DEGs, we obtained 96 DEGs via the multivariate Cox regression analysis. Subsequently, 10 optimized DEGs (CD38, ZNF212, POR, CC2D1A, GRAMD4, FH, SLC24A3, GATA2, FOXD1, and MMP10) were selected using the L1-penalized algorithm (Table 2).

Table 2 Information of optimizing DEGs combination

Evaluation and comparison of the prognostic risk prediction model’s effectiveness

As shown in Fig. 3, the PS value based on the 10 optimized DEGs could distinctly divide 82 patients with LC into high- and low-risk groups in TCGA training dataset, which indicated that the patients in the high-risk group were related to shorter OS in TCGA dataset (P = 3.853e−12). Meanwhile, we obtained the similar results from the validation datasets, which included GSE27020 (P = 4.259e−06) and GSE25727 (P = 0.0045).

Fig. 3
figure 3

Construction of prognostic risk prediction models. The samples of the A training set, B validation dataset 1 (GSE27020), and C validation dataset 2 (GSE25727) are all based on the KM curve of the PS prediction models and prognosis, with blue and red curves representing low- and high-risk samples, respectively (D). The ROC curve of the prediction results was based on the PS prognosis model

Furthermore, the ROC curves based on the PS prediction model indicated that this PS model accurately predicted the patient survival time in both TCGA dataset (AUC = 0.899) and validation dataset (GSE27020: AUC = 0.719; GSE25727: AUC = 0.662).

Screening of independent prognostic clinical factors

As expressed in Table 3, the PS model was substantially correlated with the LC clinical condition, which was an independent prognostic parameter. Subsequently, the PS model status was included in the nomogram model to predict the 3- and 5-year OS in patients with LC. After that, the score of each index was observed on the point table at the upper apex of the nomogram. Next, the scores of each index were added to estimate the 3- and 5-year survival probability (Fig. 4). These results indicated that the nomogram on the basis of the PS model status had high prediction accuracy for the survival and prognosis of patients with LC.

Table 3 Information of clinical factors
Fig. 4
figure 4

Construction of nomogram to predict the prognostic ability for patients with LC. A A nomogram was constructed using the PS model to predict the prognosis for patients with LC. The calibration plots for 1-year (B), 3-year (C), and 5-year (D) survival time

Discussion

As shown by the previous reports, it is important to detect several crucial gene biology markers associated with the LC survival prognosis, as this could provide a vital theoretical reference for treating patients with LC. Therefore, in our study, a PS model was established on the basis of 10 independent prognostic genes (CD38, ZNF212, POR, CC2D1A, GRAMD4, FH, SLC24A3, GATA2, FOXD1, and MMP10). Moreover, the PS model was determined to be an independent recurrence factor for the survival of patients.

Furthermore, LC recurrence seriously affects the survival time of patients with LC. Recently, high-throughput sequencing technologies have also improved the understanding of recurrent gene functions by decoding the genome of patients with LC. Besides, the prediction model of LC recurrence helped in the clinical decision-making. In our study, among the 206 DEGs related to the LC recurrence, 96 independent prognosis-related DEGs were screened using the multivariate Cox regression analysis. We identified 10 metabolic genes associated with prognosis and were further revealed by LASSO-based Cox proportional hazard model analysis to construct the RS survival prediction model, including CD38, ZNF212, POR, CC2D1A, GRAMD4, FH, SLC24A3, GATA2, FOXD1, and MMP10. The KM curves showed that the patients with LC in the low-risk group had remarkably better survival than the low-risk group for TCGA dataset (P = 3.853e−12). Meanwhile, we observed the similar findings in the validation datasets including GSE27020 (P = 4.259e−06) and GSE25727 (P = 0.045). A study from Xiang et al. [18] showed a PS model was constructed to predict the recurrence in patients with LC. The PS value demonstrated good accuracy in predicting the relapse with an AUC of 0.859 was at 1 year, 0.822 at 3 years, and 0.815 at 5 years survival predictive accuracy. Besides, Zhang et al. [19] constructed a four-gene signature that could be used to predict the prognosis of patients with LC. It is hypothesized that the four genes might affect the prognosis of patients with LC via mechanisms involved in the immune response and negative regulation of the Wnt signaling pathway. Moreover, Fan et al. [20] indicated that the constructed nomogram of the LC survival risk was good for predicting accuracy, which is helpful for doctors to make a more accurate prognosis evaluation of patients with LC, and can be used to guide and optimize the treatment of patients with LC. Likewise, the 10 independent prognostic genes were used to construct the PS model might be novel biomarkers for risk recurrence of patients with LC. Furthermore, we also constructed a nomogram with C-index of 0.822 using the PS model, which indicated that the nomogram performance has a good concordance with the prediction of 1-, 3-, and 5-year OS. Therefore, the PS model based on the 10 DEGs has the potential ability in the area of prognostic prediction.

In our study, some limitations exist. First, we found that the PS model based on 10 genes had a good predictive ability to predict LC recurrence. However, we failed to determine their detailed mechanisms. Then, only the PS model was screened through the multivariate Cox regression analysis with a threshold of P < 0.05. Therefore, we could not analyze other models based on other risk factors. Additionally, our study required large samples and clinical data to confirm whether the model we constructed would accurately distinguish high- and low-risk patients with recurrent LC. Finally, corresponding experimental studies should be conducted to verify the functions of these ten key genes.

Conclusion

A 10-gene PS model and nomogram are proposed to help predict the recurrence risk in patients with LC.

Availability of data and materials

The raw data were collected and analyzed by the authors, and are not ready to share their data because the data have not been published.

Abbreviations

TCGA:

The Cancer Genome Atlas

DEGs:

Differentially expressed genes

KEGG:

Kyoto Encyclopedia of Genes and Genomes

GO-BPs:

Gene Ontology Biology Processes

DEGs:

Differentially expressed genes

ROC:

Receiver-operating characteristic

LC:

Laryngeal cancer

PS:

Prognostic score

References

  1. Obid R, Redlich M, Tomeh C. The treatment of laryngeal cancer. Oral Maxillofac Surg Clin North Am. 2019;31(1):1–11.

    Article  PubMed  Google Scholar 

  2. García Lorenzo J, Montoro Martínez V, Rigo Quera A, Codina Aroca A, López Vilas M, Quer Agustí M, León Vintró X. Modifications in the treatment of advanced laryngeal cancer throughout the last 30 years. Eur Arch Otorhinolaryngol. 2017;274(9):3449–55.

    Article  PubMed  Google Scholar 

  3. Group MLCW, Mulcahy CF, Mohamed ASR, Kanwar A, Hutcheson KA, Ghosh A, Vock D, Weber RS, Lai SY, Gunn GB, et al. Age-adjusted comorbidity and survival in locally advanced laryngeal cancer. Head Neck. 2018;40(9):2060–9.

    Article  Google Scholar 

  4. Zhou Y, Yang D, Yang Q, Lv X, Huang W, Zhou Z, Wang Y, Zhang Z, Yuan T, Ding X, et al. Single-cell RNA landscape of intratumoral heterogeneity and immunosuppressive microenvironment in advanced osteosarcoma. Nat Commun. 2020;11(1):6322.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Zhang G, Fan E, Yue G, Zhong Q, Shuai Y, Wu M, Feng G, Chen Q, Gou X. Five genes as a novel signature for predicting the prognosis of patients with laryngeal cancer. J Cell Biochem. 2019;121(8–9):3804–13.

    Google Scholar 

  6. Cury SS, Lapa RML, de Mello JBH, Marchi FA, Domingues MAC, Pinto CAL, Carvalho RF, de Carvalho GB, Kowalski LP, Rogatto SR. Increased DSG2 plasmatic levels identified by transcriptomic-based secretome analysis is a potential prognostic biomarker in laryngeal carcinoma. Oral Oncol. 2020;103:1–9.

    Article  Google Scholar 

  7. Nicolau-Neto P, de Souza-Santos PT, Severo Ramundo M, Valverde P, Martins I, Santos IC, Dias F, de Almeida ST, Ribeiro Pinto LF. Transcriptome analysis identifies alcam overexpression as a prognosis biomarker in laryngeal squamous cell carcinoma. Cancers. 2020;12(2):1–14.

    Article  Google Scholar 

  8. Chang L-C, Lin H-M, Sibille E, Tseng GC. Meta-analysis methods for combining multiple expression profiles: comparisons, statistical characterization and an application guideline. BMC Bioinformatics. 2013;14:368.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Anderson WISD, Vesely KR. Thyroid follicular carcinoma with pulmonary metastases in a beaver (Castor canadensis). J Wildl Dis. 1989;25(4):599–600.

    Article  CAS  PubMed  Google Scholar 

  10. Wang P, Wang Y, Hang B, Zou X, Mao J-H. A novel gene expression-based prognostic scoring system to predict survival in gastric cancer. Oncotarget. 2016;7(34):55343–51.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Goeman JJ. L1 penalized estimation in the Cox proportional hazards model. Biom J. 2010;52(1):70–84.

    PubMed  Google Scholar 

  12. Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med. 1997;16(4):385–95.

    Article  CAS  PubMed  Google Scholar 

  13. Wu G, Zhang M. A novel risk score model based on eight genes and a nomogram for predicting overall survival of patients with osteosarcoma. BMC Cancer. 2020;20(1):456.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Handisurya A, Rumpold T, Caucig-Lütgendorf C, Flechl B, Preusser M, Ilhan-Mutlu A, Dieckmann K, Widhalm G, Grisold A, Wöhrer A, et al. Are hypothyroidism and hypogonadism clinically relevant in patients with malignant gliomas? A longitudinal trial in patients with glioma. Radiother Oncol. 2019;01:139–48.

    Article  Google Scholar 

  15. Eng KHSE, Morrell K. On representing the prognostic value of continuous gene expression biomarkers with the restricted mean survival curve. Oncotarget. 2015;6(34):36308–18.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996;15(4):361–87.

    Article  PubMed  Google Scholar 

  17. Mayr A, Schmid M. Boosting the concordance index for survival data–a unified framework to derive and evaluate biomarker combinations. PLoS ONE. 2014;9(1): e84483.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Xiang Y, Li C, Liao Y, Wu J. An integrated mRNA-lncRNA signature for relapse prediction in laryngeal cancer. J Cell Biochem. 2019;120(9):15883–90.

    Article  CAS  PubMed  Google Scholar 

  19. Zhang G, Fan E, Zhong Q, Feng G, Shuai Y, Wu M, Chen Q, Gou X. Identification and potential mechanisms of a 4-lncRNA signature that predicts prognosis in patients with laryngeal cancer. Hum Genomics. 2019;13(1):36.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Fan L, Zhao R, Chen X, Liu Y, Tian L, Liu M. Establishment of a non-squamous cell carcinoma of the larynx nomogram prognostic model and prognosis analysis. Auris Nasus Larynx. 2022;49(2):240–7.

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

This study was funded by Scientific research project of heilongjiang health and family planning commission (2018-021), Scientific research project of heilongjiang health and family planning commission (2019-147) and Heilongjiang Science and Technology Department Youth Fund (Yq2019h023)

Funding

None.

Author information

Authors and Affiliations

Authors

Contributions

YNL and ZGG participated in the design of this study, and they both performed the statistical analysis. CP carried out the study and collected important background information. XLJ drafted the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xingli Jiang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

The clinical factors in TCGA.

Additional file 2.

The differentially expressed genes among TCGA, GSE25727 and GSE27020 datasets.

Additional file 3.

Screening of prognosis-related differentially expressed genes using univariate Cox regression analysis.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Gao, Z., Peng, C. et al. Construction of a 10-gene prognostic score model of predicting recurrence for laryngeal cancer. Eur J Med Res 27, 249 (2022). https://doi.org/10.1186/s40001-022-00829-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40001-022-00829-2

Keywords

  • Laryngeal cancer
  • Recurrence
  • Prognosis