February 2015 ~ Machine Learning

Multiple Linear Regression using R

Having understood Simple Linear Regression, we ll move on to understand Multiple Linear Regression. We ll use sample data and R to understand this better.

Building a Regression model with more than one explanatory variable is called a Multiple Linear Regression.

We ll use the example data to build Multiple Linear Regression model using R and answer a few questions.

Cigarette

Lets list the questions that we would answer post this exercise.

Determine a linear regression model equation to represent this data
Decide whether the new equation is a "good fit" to represent this data
Graphically represent the relation
Doing a few predictions

Please note: The data can be downloaded from here.

The "Cigarettes" data set contains 25 observations (each corresponding to different Cigarette Brand) with five variables.

Here is some additional information about the variables.

Variable 1 is the Cigarette Brand Name
Variable 2 corresponds to Tar content in mg
Variable 3 corresponds to Nicotine content in mg
Variable 4 corresponds to weight in g
Variable 5 corresponds to Carbon monoxide content in mg

Now, let us take a moment to understand which are our dependent vs. independent variables. Just to clear some air on these terminologies - an independent variable can be assumed as the cause whereas the dependent variable is the effect.

In this "Cigarette" data set, we would want to test the effect of Variables 2, 3 & 4 on Variable 5. Hence, Variable 5 would be our dependent variable y and the others are independent variables x1, x2 & x3.

Let us import the data in R and have a look at first few observations.

The next logical step is to 'visualize' the data. Please note: We are excluding the variable 1 which corresponds to the Brand Name.

Building Multiple Linear Regression also is similar to building Simple Linear Regression.

By applying the lm() function on these variables a linear model ll be built.

mlfit <- lm(CO ~ Tar+Nicotine+Weight, data = cigarettes)

Now its time to evaluate the model.

Looking at the R-squared value, this model is indeed a "good fit" (as it tells that ~92% of the variation between y can be explained by the linear relationship between y & x1, x2, x3).

Since there are three independent variables we would also be interested in the variable that influences variability in y the most. Luckily, that is also provided by the summary above.

p-value is the parameter that gives us that information. A low p-value indicates strong evidence against the null hypothesis. We ll discuss more on this later. For now, any variable which has a lower p-value is significantly correlating to the variability in y. Please remember, there can be more than one independent variable with low p-value.

From the above, p-value for Tar is the least (0.000692 and you may also notice the triple asterisk placed after the same indicating its significance).

The linear equation the model has given us is

y = 3.2022 + 0.9626 * x1 - 2.6317 * x2 - 0.1305 * x3

Its time to visualize our model.

Since we have the "good fit" model ready and have visualized, we can go ahead with predicting the CO content in the new fictitious brand (assuming Tar = 12.22 mg, Nicotine = 0.87 mg & Weight = 0.97 g).

Our model predicts the CO content of the fictitious brand to be 12.55 mg.

Complete R code can be found here.

Conclusion

Let us quickly check if we were able to achieve our objectives completely by answering the questions that we had initially.

Determine a linear regression model equation to represent this data
Yes. We applied the linear regression model and got the mathematical equation to represent this data
Decide whether the new equation is a "good fit" to represent this data
Yes. We were able to decide the equation which was a "good fit" based on the R-squared value. Additionally, we also used the p-value parameter for model evaluation
Graphically represent the relation
Yes. We did this using R
Doing a few predictions
Yes. We successfully predicted CO Content for a fictitious brand

Closing Comments
Is this the "best model" that we can fit on the given data?
Interestingly, if you have a close look at what we did today, you ll realize this can be reduced to a "Simple Linear Regression" problem. Only one of the three independent variables can be used to build the model as well predict the CO Content resulting in similar values when all the three were used.

Why don't you try and post your observations/results in the comments?

Machine Learning

Thursday, February 26, 2015

Multiple Linear Regression using R

Multiple Linear Regression using R

Cigarette

Conclusion

Search

Followers

Popular Posts

Recent Posts

Categories

BTemplates.com

Blog Archive

About