One of the applications of Machine Learning is to predict likelihood of occurrence of an event from the past data. With our learning from exploring data to testing hypothesis, lets see if we can put together the learning to predict likelihood of onset of diabetes from diagnostic measures.
Please note: This is a labelled data set.
Number of Records : 768
Number of Attributes : 8
Please note: The column names are shortened for convenience
Summarizing the data set is a simple but powerful step to start understanding the data. As we can see here, there are a few measures with zero values which are biologically impossible (example would be blood pressure). Now this leads us to the next action of handling the missing/invalid data. And its also a good idea to visualize the data.
Also please have in mind that this visuals are rendered without handling for the missing values. The next step is to handle or treat the data for missing values.
Just to cross-check the feasibility lets look at the record and check the Age (which is 47). This suggests that this can be possible (at least theoretically).
As the values of Pregnancies seem to be biologically valid, no data handling is needed.
After Imputing
Read More
Data Source
We would be working on Pima Indians Diabetes Data Set from UCI Machine Learning Repository.Data Set Information
This data set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. This data set contains diagnostic measures for female patients of at least 21 years of age who belong to Pima Indian Heritage.Please note: This is a labelled data set.
Number of Records : 768
Number of Attributes : 8
Attribute Information
- Number of times pregnant
- Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- Diastolic blood pressure (mm Hg)
- Triceps skin fold thickness (mm)
- 2-Hour serum insulin (mu U/ml)
- Body mass index (weight in kg/(height in m)^2)
- Diabetes pedigree function
- Age (years)
- Class variable (0 or 1)
Taking a quick glance at the data
Please note: The column names are shortened for convenience
The action begins
Now we have sourced the data, using our learning lets start exploring the data.Summarizing the data set is a simple but powerful step to start understanding the data. As we can see here, there are a few measures with zero values which are biologically impossible (example would be blood pressure). Now this leads us to the next action of handling the missing/invalid data. And its also a good idea to visualize the data.
A Quick Look
As the visuals suggest, Glucose, Blood Pressure, Skin Thickness, Body Mass Index are normally distributed while are Pregnancies, Insulin, DBF, Age exponentially distributed.Also please have in mind that this visuals are rendered without handling for the missing values. The next step is to handle or treat the data for missing values.
Data Preparation
Missing Values
There are zero value entries for the following measures. We ll impute them with mean value for corresponding measure.- Glucose
- Blood Pressure
- Skin Thickness
- Insulin
- Body Mass Index
Pregnancies
Quick tabulation on this column shows that minimum value being zero (permissible) and maximum value being 17 (still theoretically possible even though this is an outlier).Just to cross-check the feasibility lets look at the record and check the Age (which is 47). This suggests that this can be possible (at least theoretically).
As the values of Pregnancies seem to be biologically valid, no data handling is needed.
Visualizing Missing Values
Before Imputing
After Imputing
With cleaner data, we are geared up for building models and do the prediction.
This is easy, isn't it?
The summary gives us interesting information (listed below).
Out of eight features that we have, four stand out to be significant on influencing the person to be diabetic or not (Outcome)
Lets looks at the Odds Ratio to see to what extent these features are influencing the Outcome.
Each pregnancy is associated with an odds increase of 13.31%
Each unit increase of plasma glucose concentration is associated with an odds increase of 3.80%
An unit increase in BMI increases the odds by 9.76%
For every unit increment in DPF, the odds are up by 137.76%
Here comes the interesting part. Having built a model that can predict the likelihood of onset of diabetes, lets try to do some prediction and see how well does this model work.
Here is the confusion matrix from prediction.
This model has got an accuracy of 77.73% but the point to be noted here is the model used the same data to do the prediction.
Now a few questions that linger our mind are
Are there any disadvantages in building the model and testing the model on the same data set?
How can we apply this model in a new data with confidence?
How can I ensure this is a good model to predict?
Can I build a better model?
and so on.
Lets try to answer these questions (and more) in the subsequent posts.
Building a model
We try to build a logistic regression model on the data as the Outcome is categorical.This is easy, isn't it?
The summary gives us interesting information (listed below).
Out of eight features that we have, four stand out to be significant on influencing the person to be diabetic or not (Outcome)
- Pregnancies
- Glucose
- BMI
- DPF
Lets looks at the Odds Ratio to see to what extent these features are influencing the Outcome.
Odds Ratio
Each pregnancy is associated with an odds increase of 13.31%
Each unit increase of plasma glucose concentration is associated with an odds increase of 3.80%
An unit increase in BMI increases the odds by 9.76%
For every unit increment in DPF, the odds are up by 137.76%
Prediction
Here comes the interesting part. Having built a model that can predict the likelihood of onset of diabetes, lets try to do some prediction and see how well does this model work.
Here is the confusion matrix from prediction.
Confusion Matrix Visualized
Closing Notes
This model has got an accuracy of 77.73% but the point to be noted here is the model used the same data to do the prediction.
Now a few questions that linger our mind are
Are there any disadvantages in building the model and testing the model on the same data set?
How can we apply this model in a new data with confidence?
How can I ensure this is a good model to predict?
Can I build a better model?
and so on.
Lets try to answer these questions (and more) in the subsequent posts.
P.S.: The complete code that was used for this article is here.