The data set is publicly available on the Kaggle website, and it is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has a 10-year risk of future coronary heart disease (CHD). The data set provides the patients’ information. It includes over 4,000 records and 15 attributes. Each attribute is a potential risk factor. There are both demographic, behavioral, and medical risk factors.
1. The General approach that I employed
Data cleaning and preprocessing
Exploratory Data Analysis
Feature Selection
Model development and comparison
The accuracy score
The F1 Score
The Area under the ROC Curve (AUC)
Observation
a) XGBoost, the SVM gives the highest Accuracy, Recall, Precision, and AUC score.
b) The highest recall is given by the SVM.
c) Highest AUC is given by SVM Overall we can say that the support vector machine was the best-performing model across all metrics. Its best parameters were a radial kernel, a C value of 10, and a gamma value of 1. Its high AUC and F1 score also show that the model has a high true positive rate and is thus sensitive to predict if one has a high risk of developing CHD, i.e., getting a heart attack within 10 years.
2. CHALLENGES
a) Handling the missing values.
b) Making data more accurate.
c) Selection of important features.
3. CONCLUSION
a) The number of people who have Cardiovascular heart disease is almost equal between smokers and non-smokers.
b) The top features in predicting the ten-year risk of developing Cardiovascular Heart Disease are 'age', 'totChol', 'sysBP', 'diaBP', 'BMI', 'heart rate', and 'glucose'.
c) The SVM with the radial kernel is the best-performing model in terms of accuracy and the F1 score.
d) Balancing the dataset by using the SMOTE technique helped in improving the models' sensitivity.
With more data(especially that of the minority class) better models can be built.