Predicting Strokes

Machine Learning, Siena University, Spring 2024

Project Description

This project contains data and graphs on predicting strokes. Using a dataset on stroke patients as a team we attempted to try to predict Stroke risk level using various machine learning methods. According to the World Health Organization stroke was the second largest cause of death globally. If we are able to see relationships between strokes and certain variables then perhaps better prevention methods can be developed. Using Python we created models using Random Forest, Neural Network, Support Vector Machine, Logistic Regression. We focused on data with Frequency of strokes among different ages and Correlation of strokes with bmi, age and average glucose level.

Project Question

Can we predict strokes with machine learning, using factors such as average glucose level, bmi, and age?

Data Analysis and System Performance

Graph 1 Graph 2 Graph 3 Graph 4

Graph 1 Percentage of Stroke Cases by Heart Disease Status:

Separates the patients who had strokes and had heart versus patients with and without heart disease.

Graph 2 Feature Importance:

Shows the highest to lowest correlated features. The top three are average glucose level, BMI, and age.

Graph 3 Model Accuracy:

Model accuracy for our Neural Network model. The blue line is training data and the orange line is testing data. The x-axis represents epochs (complete passes through the dataset), and the y-axis represents accuracy (0–1 or 0–100%).

Graph 4 Model Loss:

Model loss provides insights into training dynamics and convergence. Like the accuracy graph, the x-axis represents epochs and the y-axis represents the loss value. We used 192 neurons and 4 layers.

Classification Report and Findings

Support Vector Machine Random Forest Logistic Regression Neural Network
Precision 0.33 0.2 0.33 0.22
Recall 0.06 0.06 0.03 0.18
F1 Score 0.11 0.1 0.06 0.30

Our research succeeded in providing some insights into stroke prevention with Neural Network being the best way to predict strokes. This could be because Neural Network can effectively capture complex patterns. It is also better at handling large datasets. We did have to adjust by changing epochs to get the best results, with the best score of 80 and the highest score as compared to the rest of the models.


Dataset Source: Kaggle