Skip to main content

Cardiovascular disease diagnosis: a holistic approach using the integration of machine learning and deep learning models

Abstract

Background

The incidence and mortality rates of cardiovascular disease worldwide are a major concern in the healthcare industry. Precise prediction of cardiovascular disease is essential, and the use of machine learning and deep learning can aid in decision-making and enhance predictive abilities.

Objective

The goal of this paper is to introduce a model for precise cardiovascular disease prediction by combining machine learning and deep learning.

Method

Two public heart disease classification datasets with 70,000 and 1190 records besides a locally collected dataset with 600 records were used in our experiments. Then, a model which makes use of both machine learning and deep learning was proposed in this paper. The proposed model employed CNN and LSTM, as the representatives of deep learning models, besides KNN and XGB, as the representatives of machine learning models. As each classifier defined the output classes, majority voting was then used as an ensemble learner to predict the final output class.

Result

The proposed model obtained the highest classification performance based on all evaluation metrics on all datasets, demonstrating its suitability and reliability in forecasting the probability of cardiovascular disease.

Introduction

Nowadays massive amounts of data are generated in the healthcare industry and individuals facing these types of data have realized that there is a significant gap between their collection and interpretation [1, 2]. In today's data-driven era, the intersection of healthcare and artificial intelligence has paved the way for transformative advancements in healthcare. In this context, machine learning algorithms have emerged as powerful tools capable of analyzing vast amounts of patient data with unprecedented speed and precision. By harnessing the potential of machine learning, healthcare providers can leverage complex patterns within diverse datasets to develop predictive models for disease diagnosis. These models can identify subtle indicators and risk factors that may elude traditional diagnostic methods, empowering healthcare providers to take proactive measures and customize individualized treatment plans for patients. [3, 4].

According to the World Health Organization report, cardiovascular diseases are the leading non-communicable disease that causes numerous deaths worldwide, roughly 17.9 million individuals pass away from cardiovascular diseases, making up approximately 31% of total worldwide fatalities [5, 6]. Recent information from the American Heart Association indicates that cardiovascular diseases continue to be the main cause of death in the United States in 2023. Given the increasing rate of coronary artery diseases and their transformation into a global concern, the healthcare industry needs to shape and enhance methods of dealing with these diseases to minimize their impact on society [7, 8]. In this regard, the integration of machine learning in cardiovascular disease diagnosis not only holds promise for enhancing diagnostic accuracy, but also for optimizing resource allocation within healthcare systems. By streamlining diagnostic processes and prioritizing high-risk individuals for further evaluation, machine learning-driven approaches offer the potential to improve patient care, increase operational efficiency, and ultimately save lives [9].

The application of machine learning has provided a new approach to predicting cardiovascular diseases [10,11,12]. Consequently, various machine learning methods have been used to recognize and extract useful information from clinical datasets with minimal user input and effort. However, the emergence of deep learning has revolutionized cardiovascular disease prediction by offering distinct advantages over traditional machine learning approaches [10, 13, 14]. Deep learning algorithms, such as neural networks, excel in processing vast amounts of complex data, capturing intricate patterns, and extracting high-level features from raw inputs. In the context of cardiovascular disease prediction, the inherent ability of deep learning models to automatically learn hierarchical representations of data enables them to uncover subtle relationships and dependencies that may not be apparent to conventional machine learning algorithms [15,16,17,18,19].

The fusion of deep learning and machine learning methodologies also holds great promise for advancing cardiovascular disease diagnosis and management in recent years [20,21,22,23,24]. By combining the strengths of deep learning in extracting intricate patterns from complex data and machine learning's interpretability and explainability, healthcare professionals can leverage a hybrid approach to enhance the accuracy, efficiency, and transparency of cardiovascular disease diagnosis. Deep learning models can effectively process and analyze large amounts of data to extract comprehensive features and patterns that may be indicative of cardiovascular conditions. These deep learning-derived features can then be integrated with traditional machine learning algorithms to build predictive models that not only provide accurate diagnostic assessments, but also offer insights into the underlying decision-making process, enabling clinicians to understand and trust the predictions made by the hybrid system. Multiple empirical and theoretical studies have shown that the accuracy of combinational models is often better than individual ones [11, 19, 21, 25, 26].

While multiple studies have been carried out in this domain, a specific accurate predictive model that can thoroughly identify all essential characteristics of cardiovascular diseases is still lacking. Accordingly, a combinational model is presented in this paper which makes use of both Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) network, as the representative of deep learning models, besides K-Nearest Neighbor (KNN) and Extreme Gradient Boosting (XGB) as the representative of machine learning models to predict cardiovascular disease. This amalgamation of deep learning and machine learning techniques empowers healthcare providers with a powerful toolkit for personalized and precise cardiovascular disease diagnosis, paving the way for more informed treatment strategies and improved patient care outcomes. To implement our proposed model, three distinct datasets were utilized. Two of them are public heart disease datasets sourced from Kaggle and the third one is a local dataset collected from the medical records of patients at Dr.Heshmat Hospital, Guilan, Iran. The key contribution of this paper can be outlined as follows:

  • Two public datasets besides a locally collected dataset were used in our experiments to identify the primary features and optimal model for predicting cardiovascular disease.

  • A novel combinational model was proposed for cardiovascular disease prediction that could capture complex relationships in data through both feature learning and traditional algorithmic approaches, leading to improved predictive performance.

  • Drawing from the results of experiments, a robust and precise framework for predicting cardiovascular disease was proposed, offering substantial value in optimizing resource allocation and aiding cardiologists in effectively categorizing recently diagnosed patients.

The remainder of this paper is classified as follows: Section "Related work" includes a summary of related studies. The employed methodology including dataset description, proposed model, and its details are mentioned in Section "Proposed model". Section "Experiments and results" includes the results of the experiments and discussion. Conclusion and future research direction are provided in Section "Experiments and results".

Related work

Using machine learning and deep learning for cardiovascular disease prediction is crucial as it can enhance accuracy in identifying risk factors, enable early detection of potential issues, personalize treatment plans, and ultimately improve patient outcomes through proactive healthcare interventions. Accordingly, numerous have been conducted to investigate and find the appropriate technique for predicting heart disease. While the focus of this paper is to propose a model based on the combination of machine learning and deep learning for cardiovascular prediction, studies conducted in this era are briefly introduced in the following.

Using machine learning for cardiovascular disease prediction involves training algorithms on large amounts of medical data to identify patterns and predict the likelihood of an individual developing heart-related issues. By leveraging machine learning techniques like logistic regression, random forests, or neural networks, healthcare professionals can assess the probability of cardiovascular events, allowing for early intervention and personalized treatment approaches. From the perspective of machine learning, Ahmad et al. [27] performed experiments using six machine learning algorithms: K-nearest neighbor, logistic regression, SVM, random forest classifier, decision tree, and extreme gradient boosting, which were trained on two datasets related to heart disease. The findings indicated that SVM recorded the highest accuracy, reaching 87.91% on the Cleveland dataset. Meanwhile, Akkaya et al. [28] examined eight distinct machine learning classification methods using the Cleveland dataset and determined that KNN had the best performance with an accuracy of 85.6%. Similarly, Tougui et al. [29] applied different data mining techniques, with the random forest method attaining the highest classification accuracy of 87.64%.

Additionally, Shafenoor et al. [30] explored the effectiveness of data mining methods in recognizing important features and classifying whether heart diseases are present or absent. Their results indicated that using a voting approach with Naïve Bayes and logistic regression achieved the highest classification accuracy of 87.41%.

In a similar research direction, Subanya & Rajalaxmi [31] utilized the SVM classification technique along with the Swarm intelligence-based Artificial Bee Colony (ABC) algorithm to find optimal features, resulting in an accuracy of 86.76%. Additionally, Mokeddem et al. [32] utilized the genetic algorithm alongside Naïve Bayes and SVM for classification, yielding accuracies of 85.50% and 83.82%, respectively. Khanna et al. [33] performed a comparative study of classification methods (logistic regression, SVM, and neural networks) to forecast the prevalence of heart disease, determining that logistic regression achieved the best performance with a classification accuracy of 84.80%. Furthermore, Kumar et al. [34] employed eight different data mining methods to predict heart disease, finding that Decision Tree C4.5 performed the best with an accuracy of 83.40%. Finally, Acharya [35] explored the effectiveness of various data mining techniques in predicting the presence of heart disease, concluding that KNN is the most effective algorithm with a classification accuracy of 82%. Arroyo et al. [36] employed the combination of an optimized neural network and genetic algorithm for cardiovascular disease prediction and proved that their proposed model outperformed the other machine learning algorithms. Lin et al. [37] employed the two-step Taguchi technique along with an artificial neural network for accurate prediction of cardiovascular disease risk and revealed that not only their proposed model could accurately predict cardiovascular risk but also conserve computational resources. They also proposed a model named TPTM-HANN-GA derived from the combination of the two-phase Taguchi method (TPTM), a hyperparameter artificial neural network (HANN), and a genetic algorithm (GA) that could effectively fine-tune hyperparameters for an artificial neural network in the training phase, leading to a substantial improvement in the accuracy of cardiovascular disease risk prediction [38]. Chaithra et al. [39] also used three learning models, namely DT-J48, Naive Bayes, and neural network, to design a predictive model for cardiovascular diseases. They applied these models to a dataset collected from the trans-thoracic echocardiography database. Their empirical results showed that the neural network model performs much better in predicting heart disease with an accuracy of 97.91%. Nazari et al. [40] also proposed a combinational model based on majority voting and GA for cardiovascular prediction and applied it to the Cleveland and a local dataset which obtained an accuracy of 88.43% on Cleveland dataset.

In the subsequent research area, the emergence of deep learning has significantly impacted the prediction of cardiovascular disease by enabling more complex and accurate models to be developed. Deep learning algorithms, particularly deep neural networks, are capable of automatically learning intricate patterns and relationships within large datasets without the need for explicit feature engineering. Accordingly, Singhal et al. [41] evaluated a convolutional neural network on the Cleveland dataset. They first used a convolutional neural network with 2 convolutional layers and then the number of layers was increased. Their model was expanded up to 5 layers, and the results were examined and compared. Based on their obtained result, the best accuracy (95%) was obtained with 3 convolutional layers. Dutta et al. [42] created a dataset similar to the Cleveland dataset using information gathered from the "National Health and Nutrition Examination Survey" database. They trained a neural network model with multiple convolutional layers and obtained an accuracy of 81.78%.

In recent years, the combination of machine learning and deep learning has revolutionized cardiovascular disease prediction by leveraging vast amounts of patient data to identify intricate patterns and subtle relationships that may not be apparent to the human eye. In this regard, Mehmood et al. [43] proposed a model called CardioHelp that can be used for predicting cardiovascular disease. They used a combination of deep learning algorithms, including multiple convolutional layers, and obtained an accuracy of 97% on a local dataset. Tarawneh and Embarak [44] examined various machine learning models on the Cleveland dataset to find the best one. They used Naive Bayes, Support Vector Machine, Decision Tree, Neural Network, and K-Nearest Neighbor and then used a combined approach. They concluded that the combinational model that makes use of the benefits of all baselines obtained the highest accuracy can be used as an expert system in hospitals to assist doctors in quickly diagnosing cardiovascular disease. Bhavekar and Goswami [45] developed a hybrid deep-learning methodology based on RNN and LSTM for cardiac disease categorization. In their classification process, RNN utilized three distinct activation functions while various preprocessing techniques were applied for sorting and classifying. Relational, bigram, and density-based methods were employed for feature extraction. Subhadra and Vikas [46] tested decision tree models, logistic regression, Naive Bayes, random forest, support vector machine, deep learning, gradient boosting tree algorithm, and generalized linear models on the Cleveland dataset to predict cardiovascular disease. Finally, their proposed MLP-NN model achieved an accuracy of 93.39% with 5 neurons in the hidden layer for predicting cardiovascular disease compared to other models.

Proposed model

Machine learning and deep learning both offer valuable tools for cardiovascular disease prediction, each with its own set of benefits. Notably, machine learning excels in interpretability, allowing for insights into factors influencing predictions. On the other hand, deep learning can automatically learn intricate patterns in raw data, potentially capturing complex relationships that may be missed by traditional machine learning methods. While each of them has its own benefits and pitfalls, using combinational models for cardiovascular disease prediction offers a significant advantage by harnessing the collective power of diverse predictive models to improve the accuracy and robustness via capturing complex relationships within the data. In this regard, the proposed model leverages the combination of CNN and LSTM from deep learning and KNN and XGB from machine learning as the base classifiers for cardiovascular disease classification. As the output classes are defined by each classifier, majority voting is then used as an ensemble learner to predict the final output class.

Noteworthy, the reason behind choosing each of these classifiers refers to their unique structure while CNN can automatically extract relevant features from the input, effectively capturing key aspects of the data that are important for classification. Moreover, LSTM has the inherent ability to extract relevant features and representations from sequential data without the need for explicit feature engineering. The combination of CNN and LSTM networks allows the model to capture complex relationships and patterns within the clinical data, potentially enabling it to discern intricate dependencies and interactions between different clinical variables, ultimately aiding in accurate disease classification. On the other hand, KNN is effective in capturing local patterns within the feature space. In the context of clinical data, where characteristics of patients and their health profiles can exhibit local patterns, KNN can be suitable for identifying similarities between patients based on their clinical attributes, potentially aiding in patient classification. Moreover, clinical data often contain non-linear relationships between patient characteristics and disease outcomes. XGB's ability to capture complex interactions between features and its capability to handle non-linear relationships can be beneficial for accurate cardiovascular disease classification. The use of majority voting also ensures mitigating the impact of individual model biases by aggregating predictions from multiple models, thereby reducing the risk of overfitting and improving generalization to unseen data. The structure of the proposed model is depicted in Fig. 1 and more details are provided in the following.

Fig. 1
figure 1

Structure of the proposed model

The first step in the proposed model is preprocessing, which is essential for preparing the data for the critical phase of the learning model. The preprocessing step includes data cleaning, data transformation, data augmentation, data balancing, and data normalization.

  • Data cleaning: Data cleaning encompasses the tasks of recognizing and rectifying errors or discrepancies within the dataset, such as addressing missing values, eliminating duplicates, rectifying inaccuracies, and managing outliers.

  • Data transformation: Data transformation is the process of converting data into a format that is better suited for analysis, helping improve the performance of machine learning models by making the data more understandable and easier to work with.

  • Data balancing: Data balancing is a preprocessing step used when the dataset is imbalanced to ensure that the model does not favor the majority class and makes more accurate predictions for all classes.

  • Data normalization: Data normalization is the process of scaling the features of the dataset to a standard range to make sure that no particular feature dominates the learning process due to its larger scale compared to other features.

Once the data have been preprocessed, it is divided into two groups: a training set and a testing set. The training set is utilized to train the model. Our proposed model includes three base classifiers as CNN-LTSM, XGB and KNN. The first classifier is made by combining CNN and LSTM. Notably, combining these two models can help to leverage their strengths and advantages of each model and mitigate the weaknesses to some extent. One of the main advantages of a CNN is its low number of parameters and ease of training. CNN is also capable of extracting local features and, with an increase in the number of layers, extracting more valuable features from the input sequence [47]. LSTM is good at capturing temporal dependencies in time series data, which can be useful in predicting cardiovascular disease progression over time. The structure of the long short-term memory network is designed in a way that can effectively address the main limitations of CNN. In summary, CNNs can effectively process structured clinical data, extracting relevant patterns and features, while LSTMs are adept at capturing temporal dependencies within sequential clinical records. The architecture of the combined CNN-LSTM model which is used as the first baseline classifier is shown in Fig. 2.

Fig. 2
figure 2

Proposed CNN-LSTM architecture

XGB and KNN are the next base classifiers. While clinical data often contain non-linear relationships between patient characteristics and disease outcomes, XGB's ability to capture complex interactions between features and its capability to handle non-linear relationships can be beneficial for accurate cardiovascular disease classification. KNN is also a simple yet effective algorithm for classification tasks while it is a non-parametric model and does not make strong assumptions about the underlying distribution of the data. This can be advantageous when dealing with clinical data, as it allows the model to adapt to complex and diverse patient characteristics without imposing strict constraints on the data's distribution. For in-depth information about the algorithms discussed, please refer to Taye et al. [48], as their comprehensive explanations exceed the scope of this paper.

In the following, while the outputs of these classifiers are generated, majority voting is utilized for combining the predictions of multiple classifiers, where the final prediction is based on the most common prediction among the individual classifiers. Majority voting can help to reduce the impact of individual classifiers that may be biased or have poor performance on certain types of data. In other words, majority voting is effective for combining classifiers because it leverages the wisdom of the crowd, aggregating diverse opinions from multiple classifiers to reduce individual biases and errors, leading to more robust and accurate predictions. By voting on the most frequently predicted class, the combinational model tends to make better decisions, enhancing overall performance and generalization across various datasets and learning tasks. It is worth mentioning that the weights assigned to each classifier's prediction are determined based on their individual performance on a validation set to maximize the overall predictive performance.

Experiments and results

Datasets

As mentioned earlier, our experiments involved utilizing three distinct datasets. Two of them are freely available heart disease datasets sourced from Kaggle and the next one is a local dataset collected from medical records of patients who visited Dr. Heshmat Hospital, Guilan, Iran. The comparison of these three datasets is shown in Table 1 and their details are provided in the following.

Table 1 Datasets comparison

Dataset I (https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset) contains 70,000 instances gathered from medical examinations and consists of 12 variables. The first 11 variables serve as input features, while the 12th variable is the output feature indicating the presence or absence of cardiovascular disease. It must be mentioned that this dataset contains many duplicate values and extreme outliers. Accordingly, in the preprocessing step, duplicated values and instances with extreme outliers were removed the number of records was reduced to 62,267. Moreover, the age attributed was converted from days to years. Based on the American Heart Association's normal range, systolic and diastolic blood pressure were changed from numerical to nominal for better analysis.  Feature descriptions and their distribution after preprocessing are provided in Table 2.

Table 2 Dataset I features descriptions and their distribution

Dataset II (https://www.kaggle.com/datasets/mexwell/heart-disease-dataset/data) includes 1190 instances obtained from the combination of five original datasets across 11 shared attributes and one attribute as the predictor presenting the presence or absence of cardiovascular disease in a patient. The compilation of this dataset includes the incorporation of five original datasets including Cleveland (303 samples), Hungarian (294 samples), Switzerland (123 samples), Long Beach VA (200 samples), and Statlog datasets (270 samples). This dataset also includes missing and duplicate values that were removed along the preprocessing step and the number of records was reduced to 918. Features descriptions and their distribution after preprocessing are provided in Table 3.

Table 3 Dataset II features descriptions and their distribution

To enhance the foundation of our analysis, we opted to gather a local dataset that mirrors the attributes of Dataset II. Accordingly, a local dataset was collected in this paper which includes the information on the medical records of patients who visited Dr. Heshmat Hospital from January to June 2023. The collected dataset is called “Dataset III” containing 600 and 12 features while 11 of them are used as predictors and one is the nominator of output indicating the absence or presence of cardiovascular disease. This dataset also includes missing values that were removed along the preprocessing step and the number of records was reduced to 577. Features descriptions and their distribution after preprocessing are provided in Table 4.

Table 4 Dataset III features descriptions and their distribution

Evaluation metrics

To assess the performance of implemented models, it is necessary to consider an appropriate metric to examine the efficiency of the models. In this study, metrics including accuracy, precision, recall, F1 score, and specificity have been used for evaluating the proposed model that are indicated in equations \(1 to 5\). In these equations, TP is the count of positive samples accurately identified by the system, while FP is the count of positive samples incorrectly identified. FN refers to negative samples that were wrongly classified as positive, and TN is the number of negative samples correctly recognized.

$$Accuracy= \frac{TP+TN}{\text{TP}+\text{TN}+\text{FP}+\text{FN}},$$
(1)
$$P\text{recision}=\frac{TP}{TP+FP},$$
(2)
$$Recall=\frac{TP}{TP+FN}$$
(3)
$$F-measure=2*\frac{\text{Precision}*\text{recall}}{\text{Precision}+recall}.$$
(4)

Implementation details and hyperparameters

To take advantage of the processing power of the graphics processor, all instructions were implemented using Google Colab based on Python 3.10 as the programming language. The hardware infrastructure for running the proposed model was a system with an Intel Core i5 processor, 8 GB RAM, and an Ubuntu distribution as the operating system.

Notably, oversampling techniques along with cross-validation were employed in our implementations to maintain data integrity and prevent data leakage. To this end, the original datasets were split into a training set and a separate holdout test set (80% for training and 20% for testing), ensuring that the test set is not used during cross-validation. Then, the tenfold cross-validation technique was utilized to split the training data into multiple folds. For each fold in the cross-validation process, the oversampling technique was only applied to the training data within that fold to ensure that oversampling is performed independently for each fold, preventing data leakage across folds. Thereafter, the model was trained on the training data within each fold and its performance was evaluated on the validation data within that fold. Finally, performance metrics for each fold were measured and then aggregated to obtain an overall assessment of the model's performance. After completing the cross-validation process, the final model was evaluated on the holdout test set that was initially set aside to provide an additional independent evaluation of our model's performance.

Hyperparameters are external settings that influence an algorithm's behavior and can greatly affect the model's performance and generalization ability. Since they directly impact how well a model performs and predicts, it is important to tune these hyperparameters carefully. In our study, we made rigorous efforts to appropriately configure hyperparameters for each algorithm to optimize performance and ensure effective pattern capture. Details of the hyperparameters employed are outlined in Table 5.

Table 5 Summary of used hyperparameters  

Performance evaluation

The goal of this paper is to introduce a model based on the combination of machine learning and deep learning models to predict the risk of cardiovascular diseases. Accordingly, basic machine learning and deep learning models besides combinational models were implemented on three mentioned datasets. The average performance metrics across the 10 folds of cross-validation with the standard deviation on each dataset are provided in the following tables (Tables 6, 7, and 8).

Table 6 Comparison of evaluation metrics on Dataset I
Table 7 Comparison of evaluation metrics on Dataset II
Table 8 Comparison of evaluation metrics on Dataset III

To better assess the classification performance of our proposed model compared to other models, namely CNN-LSTM, KNN, and XGB, their ROC curves on the three mentioned datasets are, respectively, depicted in Fig. 3.

Fig. 3
figure 3

Roc Curve illustration of our proposed model compared to other models on three datasets

Based on the result of experiments on all three datasets, it can be concluded that:

  • Among traditional machine learning-based models, KNN has the highest classification accuracy on all datasets. XGB also presents the highest accuracy among ensemble-based models on all datasets.

  • Comparing baselines with combinational models, it is clear that combinational models outperform any single model alone because they leverage the strengths of multiple individual models by combining their predictions.

  • Considering the results of deep learning-based models, it can be concluded that they have higher classification accuracy than both traditional and combinational models due to their ability to automatically learn intricate patterns and features from raw data.

  • The main conclusion can be drawn from the last lines of Tables 6, 7, and 8, which showcase the advantages of our proposed model. It underscores the effectiveness of our combinational learning model in comparison to others. Overall, our model outperforms all evaluation metrics across all datasets, making it a suitable benchmark for future research.

  • While our proposed model has the highest performance based on all evaluation metrics on all three datasets, it can be stated that it not only can generalize well to new, unseen data, which is an important characteristic for any predictive model, but also it has consistent behavior in its predictive capability which is a good indication that the model is not overfitting to a particular dataset and is generalizing reasonably well.

  • Considering Fig. 3, it can be seen that our proposed model has the closest ROC curve to the top-left corner on all three datasets which signals that the model has strong overall performance across different thresholds, making it a dependable predictor. Therefore, it can be claimed that our proposed model has strong discriminative power, meaning it is effective at distinguishing between classes.

Discussion

Precisely forecasting the risk of cardiovascular disease is essential for early intervention and better patient results. This paper proposed a holistic approach using the integration of machine learning and deep learning models to improve the accuracy of cardiovascular disease prediction. According to the empirical results, our combinational model presented the highest classification performance based on all evaluation metrics indicating that this combination offers a more comprehensive approach to analyzing complex cardiovascular disease data compared to using just one type of model. In order to assess the efficiency of our proposed model, it is crucial to conduct a comparison with current state-of-the-art approaches. While the majority of prior research utilized the Cleveland dataset, a component of Dataset II, for their evaluations, we opted to apply our proposed model using all specified configurations on the Cleveland dataset to ensure a fair and comprehensive comparison. An accurate comparison of the existing studies and our proposed model on the Cleveland dataset is provided in Table 9. To provide more comparison (Table 10), we also compared our proposed model with studies that concluded their experiments on Dataset I. As can be seen, our proposed model obtained the highest classification accuracy compared to the state of the arts on both datasets.

Table 9 Accuracy comparison of the existing studies and our proposed model on the Cleveland dataset
Table 10 Accuracy comparison of the existing studies and our proposed model on Dataset I

Even if our proposed mode performs well in research settings, transitioning them into clinical practice requires rigorous validation and regulatory approval while it may not generalize well to different settings or populations. Differences in patient demographics, medical procedures, and treatment guidelines can influence how well predictive models work when used in varying environments. Given that disease trends and risk elements can change due to factors like lifestyle shifts, medical progress, and population demographics, the model created using historical data may face challenges in adjusting to these evolving trends, potentially leading to reduced prediction accuracy.

Conclusion

Cardiovascular disease is a prominent global cause of mortality, underscoring the critical need for early detection in healthcare settings. Artificial intelligence plays a crucial role in this area by pinpointing risk factors, facilitating predictive analytics, aiding decision-making, and fostering knowledge exploration. This contributes to proactive and personalized strategies for managing cardiovascular disease. Accordingly, a model which makes use of both machine learning and deep learning is proposed in this paper. The proposed model employed CNN and LSTM, as the representatives of deep learning models, besides KNN and XGB, as the representatives of machine learning models. As the output classes are defined by each classifier, majority voting is then used as an ensemble learner to predict the final output class.

To demonstrate the effectiveness of the proposed model, we utilized two public datasets along with a locally collected dataset in our experiments. To ensure a valid comparison, all datasets were first processed using the same methods. The experimental results across all datasets showed that the proposed model outperformed both individual classifiers and combinations of classifiers. These findings highlight a precise model that can be utilized for predicting the risk of cardiovascular disease. Additionally, it offers an important utility for cardiologists and physicians in categorizing new patients and assessing the necessary human resources, including doctors, technicians, nurses, and vital medical equipment.

There are abundant opportunities to enhance this research and address the constraints of the current study. One strategy involves broadening the study by replicating the experiment using larger real-world datasets. Future research could investigate alternative combinations of machine learning and deep learning models for predicting cardiovascular disease. Moreover, implementing novel feature selection methods could offer a more comprehensive insight into crucial features, leading to improved prediction accuracy. Exploring the application of the proposed model in other domains holds promise and could be considered as a potential avenue for future research.

Availability of data and materials

Three datasets were used in our experiments. Two of them are freely available datasets that can be found on Kaggle. The next one is the locally collected dataset that is freely available for academic purposes upon request.

References

  1. Bhatt CM, et al. Effective heart disease prediction using machine learning techniques. Algorithms. 2023;16(2):88.

    Article  Google Scholar 

  2. Ramesh T, et al. Predictive analysis of heart diseases with machine learning approaches. Malays J Comp Sci. 2022. https://doi.org/10.22452/mjcs.sp2022no1.10.

    Article  Google Scholar 

  3. Nagavelli U, Samanta D, Chakraborty P. Machine learning technology-based heart disease detection models. J Healthc Eng. 2022;2022:7351061.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Al-Jammali K. Prediction of heart diseases using data mining algorithms. Informatica. 2023;47(5):e4467.

    Article  Google Scholar 

  5. Tsao CW, et al. Heart disease and stroke statistics—2022 update: a report from the American heart association. Circulation. 2022;145(8):e153–639.

    Article  PubMed  Google Scholar 

  6. Saberi ZA, Sadr H, Yamaghani MR. An intelligent diagnosis system for predicting coronary heart disease. In: Saberi ZA, editor. 2024 10th International conference on artificial intelligence and robotics (QICAR). Qazvin: IEEE; 2024.

    Google Scholar 

  7. Solanki A, Barot MP. Study of heart disease diagnosis by comparing various classification algorithms. Int J Eng Adv Technol. 2019;8(2S2):40–2.

    Google Scholar 

  8. Bhavekar GS, et al. Heart disease prediction using machine learning, deep Learning and optimization techniques—a semantic review. Multimed Tools Appl. 2024. https://doi.org/10.1007/s11042-024-19680-0.

    Article  Google Scholar 

  9. Vijayarani S, Dhayanand S, Phil M. Kidney disease prediction using SVM and ANN algorithms. Int J Comp Bus Res (IJCBR). 2015;6(2):1–12.

    Google Scholar 

  10. Bakar WA, et al. A review: heart disease prediction in machine learning and deep learning. In: Bakar WA, editor., et al., 2023 19th IEEE International colloquium on signal processing and its applications (CSPA). Kedah: IEEE; 2023.

    Google Scholar 

  11. Mohades Deilami F, Sadr H, Tarkhan M. Contextualized multidimensional personality recognition using combination of deep neural network and ensemble learning. Neural Proc Lett. 2022;54(5):3811–28.

    Article  Google Scholar 

  12. Ogunpola A, et al. Machine learning-based predictive models for detection of cardiovascular diseases. Diagnostics. 2024;14(2):144.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Latha CBC, Jeeva SC. Improving the accuracy of prediction of heart disease risk based on ensemble classification techniques. Informatics in Medicine Unlocked. 2019;16: 100203.

    Article  Google Scholar 

  14. Garavand A, et al. The attributes of hospital-based coronary artery diseases registries with a focus on key registry processes: a systematic review. Health Information Management Journal. 2022;51(2):63–78.

    Article  PubMed  Google Scholar 

  15. Alotaibi N, Alzahrani M. Comparative analysis of machine learning algorithms and data mining techniques for predicting the existence of heart disease. Int J Adv Comp Sci Applx. 2022;13(7):810.

    Google Scholar 

  16. Ayatollahi H, Gholamhosseini L, Salehi M. Predicting coronary artery disease: a comparison between two data mining algorithms. BMC Public Health. 2019;19(1):1–9.

    Article  Google Scholar 

  17. Alizadehsani R, et al. Machine learning-based coronary artery disease diagnosis: a comprehensive review. Comput Biol Med. 2019;111: 103346.

    Article  PubMed  Google Scholar 

  18. Garavand A, et al. Designing the coronary artery disease registry with data management processes approach: a comparative systematic review in selected registries. Int Cardiovasc Res J. 2020;14(1): 100833.

    Google Scholar 

  19. Khodaverdian Z, et al. An energy aware resource allocation based on combination of CNN and GRU for virtual machine selection. Multimed Tools Appl. 2023. https://doi.org/10.1007/s11042-023-16488-2.

    Article  Google Scholar 

  20. Alkayyali Z, Idris S, Abu-Naser SS. A systematic literature review of deep and machine learning algorithms in cardiovascular diseases diagnosis. J Theor Appl Inf Technol. 2023;101(4):1353–65.

    Google Scholar 

  21. Naser MA, et al. A review of machine learning’s role in cardiovascular disease prediction: recent advances and future challenges. Algorithms. 2024;17(2):78.

    Article  Google Scholar 

  22. Samavat T, Hojatzadeh E. Programs for prevention and control of cardiovascular diseases. Javan: Ministry of Health; 2012.

    Google Scholar 

  23. Goswami AD, Bhavekar GS, Chafle PV. Electrocardiogram signal classification using VGGNet: a neural network based classification model. Int J Inf Technol. 2023;15(1):119–28.

    Google Scholar 

  24. Bhavekar GS, Goswami AD. Herding exploring algorithm With light gradient boosting machine classifier for effective prediction of heart diseases. Int J Swarm Intell Res (IJSIR). 2022;13(1):1–22.

    Article  Google Scholar 

  25. Subramani S, et al. Cardiovascular diseases prediction by machine learning incorporation with deep learning. Front Med. 2023;10:1150933.

    Article  Google Scholar 

  26. Sadr H et al. Unified topic-based semantic models: a study in computing the semantic relatedness of geographic terms. In: 2019 5th International conference on web research (ICWR). IEEE; 2019.

  27. Ahamad GN, et al. Influence of optimal hyperparameters on the performance of machine learning algorithms for predicting heart disease. Processes. 2023;11(3):734.

    Article  Google Scholar 

  28. Akkaya B, Sener E, Gursu C. A comparative study of heart disease prediction using machine learning techniques. In: 2022 International congress on human–computer interaction, optimization and robotic applications (HORA). IEEE; 2022.

  29. Tougui I, Jilbab A, El Mhamdi J. Heart disease classification using data mining tools and machine learning techniques. Heal Technol. 2020;10:1137–44.

    Article  Google Scholar 

  30. Amin MS, Chiam YK, Varathan KD. Identification of significant features and data mining techniques in predicting heart disease. Telematics Inform. 2019;36:82–93.

    Article  Google Scholar 

  31. Subanya B, Rajalaxmi R.Feature selection using artificial bee colony for cardiovascular disease classification. In: 2014 International conference on electronics and communication systems (ICECS). IEEE; 2014.

  32. Mokeddem S, Atmani B, Mokaddem M. Supervised feature selection for diagnosis of coronary artery disease based on genetic algorithm. arXiv preprint. 2013. https://arxiv.org/abs/1305.6046.

  33. Khanna D, et al. Comparative study of classification techniques (SVM, logistic regression and neural networks) to predict the prevalence of heart disease. Int J Mach Learn Comput. 2015;5(5):414.

    Article  Google Scholar 

  34. Kumar MN, Koushik K, Deepak K. Prediction of heart diseases using data mining and machine learning algorithms and tools. Int J Sci Res Comput Sci Eng Inform Technol. 2018;3(3):887–98.

    Google Scholar 

  35. Acharya A. Comparative study of machine learning algorithms for heart disease prediction. 2017.

  36. Arroyo JCT, Delima AJP. An optimized neural network using genetic algorithm for cardiovascular disease prediction. J Adv Inform Technol. 2022. https://doi.org/10.12720/jait.13.1.95-99.

    Article  Google Scholar 

  37. Lin C-M, Lin Y-S. Utilizing a two-stage taguchi method and artificial neural network for the precise forecasting of cardiovascular disease risk. Bioengineering. 2023;10(11):1286.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Lin C-M, Lin Y-S. TPTM-HANN-GA: a novel hyperparameter optimization framework integrating the Taguchi method, an artificial neural network, and a genetic algorithm for the precise prediction of cardiovascular disease risk. Mathematics. 2024;12(9):1303.

    Article  Google Scholar 

  39. Chaithra N, Madhu B. Classification models on cardiovascular disease prediction using data mining techniques. Cardiovasc Dis Diagn. 2018;6(6):1–4.

    CAS  Google Scholar 

  40. Nazari M, et al. Detection of cardiovascular diseases using data mining approaches: application of an ensemble-based model. Cogn Comput. 2024. https://doi.org/10.1007/s12559-024-10306-z.

    Article  Google Scholar 

  41. Singhal S, Kumar H, Passricha V. Prediction of heart disease using CNN. Am Int J Res Sci Technol Eng Math. 2018;23(1):257–61.

    Google Scholar 

  42. Dutta A, et al. An efficient convolutional neural network for coronary heart disease prediction. Expert Syst Appl. 2020;159: 113408.

    Article  Google Scholar 

  43. Mehmood A, et al. Prediction of heart disease using deep convolutional neural networks. Arab J Sci Eng. 2021;46(4):3409–22.

    Article  Google Scholar 

  44. Tarawneh M, Embarak O. Hybrid approach for heart disease prediction using data mining techniques. In: Barolli L, Xhafa F, Khan ZA, Odhabi H, editors. International conference on emerging internetworking, data and web technologies. Cham: Springer; 2019.

    Google Scholar 

  45. Bhavekar GS, Goswami AD. A hybrid model for heart disease prediction using recurrent neural network and long short term memory. Int J Inf Technol. 2022;14(4):1781–9.

    Google Scholar 

  46. Subhadra K, Vikas B. Neural network based intelligent system for predicting heart disease. Int J Innov Technol Explor Eng. 2019;8(5):484–7.

    Google Scholar 

  47. Van VD, Thai T, Nghiem MQ. Combining convolution and recursive neural networks for sentiment analysis. In: Proceedings of the 8th international symposium on information and communication technology. 2017.

  48. Taye MM. Understanding of machine learning with deep learning: architectures, workflow, applications and future directions. Computers. 2023;12(5):91.

    Article  Google Scholar 

Download references

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Contributions

H.S., A.S., and M.N. conceived of the presented idea. M.N. developed the theory and performed the computations. H.S. and M.A conceived the study and were in charge of overall direction and planning. A.S. and M.A. verified the analytical methods and obtained results. All authors discussed the results and contributed to the final manuscript.

Corresponding authors

Correspondence to Hossein Sadr or Mojdeh Nazari.

Ethics declarations

Ethics approval and consent to participate

The project was found to be in accordance with the ethical principles and the national norms and standards for conducting Medical Research in Iran (Approval ID: IR.GUMS.REC.1402.516).

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sadr, H., Salari, A., Ashoobi, M.T. et al. Cardiovascular disease diagnosis: a holistic approach using the integration of machine learning and deep learning models. Eur J Med Res 29, 455 (2024). https://doi.org/10.1186/s40001-024-02044-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40001-024-02044-7

Keywords