Tuesday, May 23, 2017

Published 14:21 by with 0 comment

Predicting the onset of Diabetes - II

We concluded our previous attempt to predict the onset of diabetes with a few questions. Lets try to answer the questions in this post.

Recap


The model that we built previously had an accuracy of 77.73% - with a caveat that we used the same data for training the model and do the prediction. And we had a few questions.

Is there a better model to predict? If so, how do we build and measure it?

Yes. There can be many models that we can build on the data set. Lets try to build one more and try to measure it.

Are there any disadvantages in building the model and testing the model on the same data set?

Certainly, its like looking at the questions that ll be asked before taking an examination. Any average person can memorize the answers and can increase the chance of scoring better marks in the exam. Same is the case with the model that we build. It stands a better chance and accuracy if we work on the same data set to predict.

A different model - Decision Trees


Lets try to build a decision tree (without going through the math behind them) on the data set . We ll use the model to make prediction (as we did earlier). We ll also measure the accuracy of the model.

Here we ll build a Decision Tree using Recursive Partitioning to classify the data. The technique is to classify members of the population by splitting it into sub-populations based on several dichotomous independent variables. The process is termed recursive because each sub-population may in turn be split an indefinite number of times until the splitting process terminates after a particular stopping criterion is reached.

A major advantage with this method is this is very simple and more intuitive.

rpart.fit <- rpart(Outcome ~ .,data=pimaimpute)
prp(rpart.fit, faclen = 0, box.palette = "auto", branch.type = 5, cex = 1)

Doing that produces this beautiful Decision Tree.



How to read this?


The model splits the data set into two sections based on Glucose Level and then on Age (for the section where Glucose level is <128) and on BMI and so on until it finds stopping criteria.

As we have build the model, lets use this model for predict (on the same data set).

rpart.pred <- predict(rpart.fit, data=pimaimpute, type = "class")


Lets predict the accuracy of this model.





Confusion Matrix Visualized



Closing Notes

This model has got an accuracy of 83.98% (versus our Linear Model's accuracy of 77.73%).

One important question that we ll address in the next post is - 
Are there any disadvantages in building the model and testing the model on the same data set?


P.S.: The complete code that was used for this article is here.




Read More
      edit

Tuesday, May 9, 2017

Published 10:03 by with 0 comment

Predicting the onset of Diabetes

One of the applications of Machine Learning is to predict likelihood of occurrence of an event from the past data. With our learning from exploring data to testing hypothesis, lets see if we can put together the learning to predict likelihood of onset of diabetes from diagnostic measures.

Data Source

We would be working on Pima Indians Diabetes Data Set from UCI Machine Learning Repository.

Data Set Information

This data set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. This data set contains diagnostic measures for female patients of at least 21 years of age who belong to Pima Indian Heritage.

Please note: This is a labelled data set.

Number of Records                  :        768
Number of Attributes               :            8

Attribute Information

  1. Number of times pregnant 
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 
  3. Diastolic blood pressure (mm Hg) 
  4. Triceps skin fold thickness (mm) 
  5. 2-Hour serum insulin (mu U/ml) 
  6. Body mass index (weight in kg/(height in m)^2) 
  7. Diabetes pedigree function 
  8. Age (years) 
  9. Class variable (0 or 1)

Taking a quick glance at the data


Please note: The column names are shortened for convenience

The action begins

Now we have sourced the data, using our learning lets start exploring the data.


Summarizing the data set is a simple but powerful step to start understanding the data. As we can see here, there are a few measures with zero values which are biologically impossible (example would be blood pressure). Now this leads us to the next action of handling the missing/invalid data. And its also a good idea to visualize the data.

A Quick Look

As the visuals suggest, Glucose, Blood Pressure, Skin Thickness, Body Mass Index are normally distributed while are Pregnancies, Insulin, DBF, Age exponentially distributed.





Also please have in mind that this visuals are rendered without handling for the missing values. The next step is to handle or treat the data for missing values.

Data Preparation


Missing Values

There are zero value entries for the following measures. We ll impute them with mean value for corresponding measure.

  • Glucose
  • Blood Pressure
  • Skin Thickness
  • Insulin
  • Body Mass Index

Pregnancies

Quick tabulation on this column shows that minimum value being zero (permissible) and maximum value being 17 (still theoretically possible even though this is an outlier).


Just to cross-check the feasibility lets look at the record and check the Age (which is 47). This suggests that this can be possible (at least theoretically).


As the values of Pregnancies seem to be biologically valid, no data handling is needed.

Visualizing Missing Values

Before Imputing

After Imputing
With cleaner data, we are geared up for building models and do the prediction.

Building a model

We try to build a logistic regression model on the data as the Outcome is categorical.


This is easy, isn't it?

The summary gives us interesting information (listed below).

Out of eight features that we have, four stand out to be significant on influencing the person to be diabetic or not (Outcome)

  • Pregnancies
  • Glucose
  • BMI
  • DPF 

Lets looks at the Odds Ratio to see to what extent these features are influencing the Outcome.

Odds Ratio


Each pregnancy is associated with an odds increase of 13.31%
Each unit increase of plasma glucose concentration is associated with an odds increase of 3.80%
An unit increase in BMI increases the odds by 9.76%
For every unit increment in DPF, the odds are up by 137.76%

Prediction


Here comes the interesting part. Having built a model that can predict the likelihood of onset of diabetes, lets try to do some prediction and see how well does this model work.

Here is the confusion matrix from prediction.

Confusion Matrix Visualized


Closing Notes


This model has got an accuracy of 77.73% but the point to be noted here is the model used the same data to do the prediction.

Now a few questions that linger our mind are
Are there any disadvantages in building the model and testing the model on the same data set?
How can we apply this model in a new data with confidence?
How can I ensure this is a good model to predict?
Can I build a better model?
and so on.

Lets try to answer these questions (and more) in the subsequent posts.


P.S.: The complete code that was used for this article is here.

Read More
      edit