Saturday, February 25, 2017

Published 22:16 by with 0 comment

Lets Explore Data - I(a)

I thought it would be a good idea to visually explain the inferences from the previous post because a picture is worth a thousand words. Visualization is one of the most effective ways to communicate the findings from Data Analytics.

This post would be a shorter one as we are going to visually represent the analysis we did in the previous post.

I am using R to explore the data. Please feel free to use tool of your choice for data exploration.

Descriptive Statistics on "mpg"

Summary Statistics - Miles Per Gallon


The above "box plot" would give the five number statistics (minimum, lower quartile, median, upper quartile, maximum) and Mean.

Distribution of cars based on number of cylinders




Distribution of cars based on number of forward gears




Distribution of cars based on transmission and on engine type




We can also combine different aspects/parameters and chart them together


In the next post, we ll see about multivariate data analysis.
Read More
      edit

Tuesday, February 14, 2017

Published 13:12 by with 0 comment

Lets Explore Data - I

Having acquainted with The ABCs of Machine Learning, its time to do some data exploration to understand the nature of data to graduate to the next level.

After we acquire data, the immediate next step would be to make sense of the data, quickly. This step would be of paramount importance as doing this right would help in saving lots of time while we build model/draw conclusion.

Data Set

The data that we are going to explore is sourced from here.

Lets get started.

As defined, the data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

I am using R to explore the data. Please feel free to use tool of your choice for data exploration.

Quick Tip: As the data is available in csv format, Microsoft Excel would also help us in doing the data exploration.

A quick look at (part) of what is inside the data set.

screen-shot-2017-02-08-at-3-54-40-pm

Here is the metadata for the data set.

Variable Name Description Variable Type
mpg Miles/(US) gallon Quantitative (Continuous)
cyl Number of cylinders Qualitative (Nominal)*
disp Displacement (cu.in.) Quantitative (Continuous)
hp Gross horsepower Quantitative (Continuous)
drat Rear axle ratio Quantitative (Continuous)
wt Weight (1000 lbs) Quantitative (Continuous)
qsec 1/4 mile time Quantitative (Continuous)
vs V Engine/S Engine (0=V, 1=S) Qualitative (Nominal)
am Transmission (0 = automatic, 1 = manual) Qualitative (Nominal)
gear Number of forward gears Qualitative (Nominal)
carb Number of carburetors Qualitative (Nominal)


* Depending on the analysis that we do using "cyl" this can be classified as Quantitative (Discrete) too.

I have also mapped the variable name and description with the variable type. Please have a quick look to review the same. If there are any questions, please ask them in the comments section.

After we skim through the metadata and understand the variable properties, next step is to get descriptive statistics on the data set. Descriptive Statistics include frequencies (counts), ranks, measures of central tendency (e.g., mean, median and mode), and measures of variability (e.g., range and standard deviation). These help in developing good and quick understanding of the data.

Remember we would want to do numeric operations only on the Quantitative Variables even though some of the Qualitative Variables that have numeric values (e.g., vs, am, etc)

Descriptive Statistics on "mpg"

Average Miles Per Gallon







Median Miles Per Gallon






Miles Per Gallon Range






screen-shot-2017-02-08-at-5-35-40-pm
Summary Statistics - this captures all of that we calculated earlier at one shot












Inference on mpg

The mpg (for all the vehicles in the data set) ranges from 10.40 miles per gallon to 33.90 miles per gallon with an average of 20.09 miles per gallon and median 19.20 miles per gallon.

Please note: summary() in R gives another interesting measure "IQR" (interquartile range) which we ll read about in a future post.

Extending the Descriptive Statistics to the other variables of interest (mpg, weight, gear, am, vs, cyl) in the data set yields us the following.

Inference on mtcars data set

  • Total number of cars in the data set - 32
  • Distribution of cars based on number of cylinders is as below
    Number of Cylinders Number of Cars
    Four 11
    Six 7
    Eight 14

  • Distribution of cars based on number of forward gears is as below
    Number of Forward Gears Number of Cars
    Three 15
    Four 12
    Five 5

  • Distribution of cars based on transmission is as below
    Transmission Type Number of Cars
    Automatic 19
    Manual 13

  • Distribution of cars based on engine type is as below
    Engine Type Number of Cars
    V-Engine 18
    Straight Engine 14

  • Miles Per Gallon ranges between 10.40 mpg to 33.90 mpg
  • Weight of the cars range between 1513 lbs to 5424 lbs

Conclusion

We did univariate analysis on the variables in the "mtcars" data set to understand more about the content of the data set. Descriptive Statistics helps us get a good understanding about the data, quickly. We also reinforced our learning about the variable types and we are able to apply different operations for different variable type (e.g., we did not do an average on a Qualitative Variable).

We would continue to do more analysis (involving multiple variables) to make more sense on these data set in the next post.

Gist

A short hand way in R to get descriptive statistics of a data set.

screen-shot-2017-02-08-at-8-43-09-pm
Summary Statistics on "mtcars" data set


Use summary() command with caution as it treats all the variables that appear to be number as Quantitative irrespective of whether they are Quantitative or not.

Based on the type of the variable, summary() function gives different output.
  • Gives the range, quartiles, median, and mean & number of missing values (if any) for Quantitative Variables
  • Gives a table with frequencies & number of missing values (if any) for Categorical Variables

How to handle Qualitative Variables

The results of the summary() has given Quantitative statistics on Qualitative data - how to handle this?

Change the type of the variable (as a categorical) on the fly and we can compute frequency on the Qualitative variable.


 




Please Note: The above is an example of changing the type of one variable from the data set. We can also do this treatment as we do preprocessing before we do the descriptive statistics on the data set.

Note #1 : Based on the feedback (in the comments section), I would plan to write a post detailing on these measures (such as mean, range, etc) and explain the significance and on which variable types/when/where these should be used.

Note #2 : I also think of a post on "getting started with R" - if there are takers. Please leave comments if you are interested.

Note #3: Visualizing the data would be another powerful and significant method in exploring data. Please watch out for a post on the same.
Read More
      edit