Tuesday, May 9, 2017

Published 10:03 by DataScience Student with 0 comment

Predicting the onset of Diabetes

One of the applications of Machine Learning is to predict likelihood of occurrence of an event from the past data. With our learning from exploring data to testing hypothesis, lets see if we can put together the learning to predict likelihood of onset of diabetes from diagnostic measures.

Data Source

We would be working on Pima Indians Diabetes Data Set from UCI Machine Learning Repository.

Data Set Information

This data set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. This data set contains diagnostic measures for female patients of at least 21 years of age who belong to Pima Indian Heritage.

Please note: This is a labelled data set.

Number of Records : 768
Number of Attributes : 8

Attribute Information

Number of times pregnant
Plasma glucose concentration a 2 hours in an oral glucose tolerance test
Diastolic blood pressure (mm Hg)
Triceps skin fold thickness (mm)
2-Hour serum insulin (mu U/ml)
Body mass index (weight in kg/(height in m)^2)
Diabetes pedigree function
Age (years)
Class variable (0 or 1)

Taking a quick glance at the data

Please note: The column names are shortened for convenience

The action begins

Now we have sourced the data, using our learning lets start exploring the data.

Summarizing the data set is a simple but powerful step to start understanding the data. As we can see here, there are a few measures with zero values which are biologically impossible (example would be blood pressure). Now this leads us to the next action of handling the missing/invalid data. And its also a good idea to visualize the data.

A Quick Look

As the visuals suggest, Glucose, Blood Pressure, Skin Thickness, Body Mass Index are normally distributed while are Pregnancies, Insulin, DBF, Age exponentially distributed.

Also please have in mind that this visuals are rendered without handling for the missing values. The next step is to handle or treat the data for missing values.

Data Preparation

Missing Values

There are zero value entries for the following measures. We ll impute them with mean value for corresponding measure.

Glucose
Blood Pressure
Skin Thickness
Insulin
Body Mass Index

Pregnancies

Quick tabulation on this column shows that minimum value being zero (permissible) and maximum value being 17 (still theoretically possible even though this is an outlier).

Just to cross-check the feasibility lets look at the record and check the Age (which is 47). This suggests that this can be possible (at least theoretically).

As the values of Pregnancies seem to be biologically valid, no data handling is needed.

Visualizing Missing Values

Before Imputing

After Imputing

With cleaner data, we are geared up for building models and do the prediction.

Building a model

We try to build a logistic regression model on the data as the Outcome is categorical.

This is easy, isn't it?

The summary gives us interesting information (listed below).

Out of eight features that we have, four stand out to be significant on influencing the person to be diabetic or not (Outcome)

Pregnancies
Glucose
BMI
DPF

Lets looks at the Odds Ratio to see to what extent these features are influencing the Outcome.

Odds Ratio

Each pregnancy is associated with an odds increase of 13.31%
Each unit increase of plasma glucose concentration is associated with an odds increase of 3.80%
An unit increase in BMI increases the odds by 9.76%
For every unit increment in DPF, the odds are up by 137.76%

Prediction

Here comes the interesting part. Having built a model that can predict the likelihood of onset of diabetes, lets try to do some prediction and see how well does this model work.

Here is the confusion matrix from prediction.

Confusion Matrix Visualized

Closing Notes

This model has got an accuracy of 77.73% but the point to be noted here is the model used the same data to do the prediction.

Now a few questions that linger our mind are
Are there any disadvantages in building the model and testing the model on the same data set?
How can we apply this model in a new data with confidence?
How can I ensure this is a good model to predict?
Can I build a better model?
and so on.

Lets try to answer these questions (and more) in the subsequent posts.

P.S.: The complete code that was used for this article is here.

Linear Regression, Machine Learning, R, Supervised Learning

edit

Thursday, February 26, 2015

Published 19:45 by DataScience Student with 0 comment

Multiple Linear Regression using R

Having understood Simple Linear Regression, we ll move on to understand Multiple Linear Regression. We ll use sample data and R to understand this better.

Building a Regression model with more than one explanatory variable is called a Multiple Linear Regression.

We ll use the example data to build Multiple Linear Regression model using R and answer a few questions.

Cigarette

Lets list the questions that we would answer post this exercise.

Determine a linear regression model equation to represent this data
Decide whether the new equation is a "good fit" to represent this data
Graphically represent the relation
Doing a few predictions

Please note: The data can be downloaded from here.

The "Cigarettes" data set contains 25 observations (each corresponding to different Cigarette Brand) with five variables.

Here is some additional information about the variables.

Variable 1 is the Cigarette Brand Name
Variable 2 corresponds to Tar content in mg
Variable 3 corresponds to Nicotine content in mg
Variable 4 corresponds to weight in g
Variable 5 corresponds to Carbon monoxide content in mg

Now, let us take a moment to understand which are our dependent vs. independent variables. Just to clear some air on these terminologies - an independent variable can be assumed as the cause whereas the dependent variable is the effect.

In this "Cigarette" data set, we would want to test the effect of Variables 2, 3 & 4 on Variable 5. Hence, Variable 5 would be our dependent variable y and the others are independent variables x1, x2 & x3.

Let us import the data in R and have a look at first few observations.

The next logical step is to 'visualize' the data. Please note: We are excluding the variable 1 which corresponds to the Brand Name.

Building Multiple Linear Regression also is similar to building Simple Linear Regression.

By applying the lm() function on these variables a linear model ll be built.

mlfit <- lm(CO ~ Tar+Nicotine+Weight, data = cigarettes)

Now its time to evaluate the model.

Looking at the R-squared value, this model is indeed a "good fit" (as it tells that ~92% of the variation between y can be explained by the linear relationship between y & x1, x2, x3).

Since there are three independent variables we would also be interested in the variable that influences variability in y the most. Luckily, that is also provided by the summary above.

p-value is the parameter that gives us that information. A low p-value indicates strong evidence against the null hypothesis. We ll discuss more on this later. For now, any variable which has a lower p-value is significantly correlating to the variability in y. Please remember, there can be more than one independent variable with low p-value.

From the above, p-value for Tar is the least (0.000692 and you may also notice the triple asterisk placed after the same indicating its significance).

The linear equation the model has given us is

y = 3.2022 + 0.9626 * x1 - 2.6317 * x2 - 0.1305 * x3

Its time to visualize our model.

Since we have the "good fit" model ready and have visualized, we can go ahead with predicting the CO content in the new fictitious brand (assuming Tar = 12.22 mg, Nicotine = 0.87 mg & Weight = 0.97 g).

Our model predicts the CO content of the fictitious brand to be 12.55 mg.

Complete R code can be found here.

Conclusion

Let us quickly check if we were able to achieve our objectives completely by answering the questions that we had initially.

Determine a linear regression model equation to represent this data
Yes. We applied the linear regression model and got the mathematical equation to represent this data
Decide whether the new equation is a "good fit" to represent this data
Yes. We were able to decide the equation which was a "good fit" based on the R-squared value. Additionally, we also used the p-value parameter for model evaluation
Graphically represent the relation
Yes. We did this using R
Doing a few predictions
Yes. We successfully predicted CO Content for a fictitious brand

Closing Comments
Is this the "best model" that we can fit on the given data?
Interestingly, if you have a close look at what we did today, you ll realize this can be reduced to a "Simple Linear Regression" problem. Only one of the three independent variables can be used to build the model as well predict the CO Content resulting in similar values when all the three were used.

Why don't you try and post your observations/results in the comments?

Linear Regression, R, Regression

edit

Machine Learning

Tuesday, May 9, 2017

Predicting the onset of Diabetes

Data Source

Data Set Information

Attribute Information

Taking a quick glance at the data

The action begins

A Quick Look

Data Preparation

Missing Values

Pregnancies

Visualizing Missing Values

Building a model

Odds Ratio

Prediction

Confusion Matrix Visualized

Closing Notes

Thursday, February 26, 2015

Multiple Linear Regression using R

Multiple Linear Regression using R

Cigarette

Conclusion

Search

Followers

Popular Posts

Recent Posts

Categories

BTemplates.com

Blog Archive

About