Tuesday, January 31, 2017

Published 18:13 by with 2 comments

The ABCs of Machine Learning

It would be a good idea to spend some time reading the basics and this page would be dedicated for that.

Why do we care about variable type (in the context of Machine Learning)?

Understanding the variable types gives us the power to treat them appropriately. For instance, it would help us in avoiding mistakes like doing an average on postal codes or taking ratio of two pH values. It also helps us choosing appropriate operation on the variable based on the context. For example, in a psychological study of perception, different colors would be regarded as nominal. In a physics study, color is quantified by wavelength, so color would be considered a ratio variable.

Feature

Feature (aka Variable) is a measurable property or attribute of an observation. An example would be while studying about performance of cars the possible features would be
  • Number of Cylinders
  • Miles Per Gallon (or Kilometers Per Liter)
  • Horse Power
  • Weight
  • Number of Gears
  • Type of Transmission
Feature is also referred as 'Explanatory Variable' or 'Independent Variable'

Label

Label represents the outcome of the or the output whose variation is being studied. An example would be the learning problem shown below. Here the examples are labeled '1' and '0'. In this case, animals are marked with '1' and others with '0'. screen-shot-2017-01-25-at-12-09-44-pm
Label is also referred as 'Explained Variable' or 'Dependent Variable' The dependent variable responds to the independent variable and for this reason it is called 'dependent' variable.

Dataset

Dataset (or Data Set) is simply a collection of observations. Typically its the data collection in rows and columns where columns correspond to feature/label and rows correspond to the observations. Most popular dataset format used is spreadsheet which is powerful for quick analysis.

Data (Variable) Type

Data is often classified as below based on the nature of the data.
Please Note: This is different the 'datatype' defined in database realm.

Quantitative Variable

Quantitative Variable is expressed in numerical form and therefore arithmetic operations can be performed on them. Quantitative Variable can be further classified as follows
  • Continuous Variable
A continuous variable can take values between two numbers. This variable can take infinitely many values.
Example: Time taken by top five athletes to complete 100m in Rio Olympics: 9.81, 9.89, 9.91, 9.93, 9.94.
    • Interval Variable
      Interval Variables take numeric values and they can be measured along continuum. The intervals between the values of the interval variable are equally spaced.
      Example: Temperature measured in degrees Celsius or Fahrenheit. The difference between 20C and 30C is the same as 30C to 40C
    • Ratio Variable
      Ratio Variables are Interval Variables but with the different that they posses the clear definition of zero (0) which indicates that there is none of that variable.
      Example: Temperature measures in Kelvin as 0 Kelvin (also called as absolute zero) indicates that there is no temperature. And for the very same reason temperature measured in Celsius and Fahrenheit are NOT Ratio Variables. Other examples include height, mass, distance, etc.
  • Discrete Variable
A discrete variable does not admit intermediate values between two specific numbers. It is represented by whole integer values.
Example: Total medals by the USA at last three summer Olympics: 121 (Rio 2016), 103 (London 2012), 110 (Beijing 2008).

Qualitative Variable

Qualitative Variable takes non-numeric value. It describes data that fit into categories. This is also referred as Categorical Variable. Qualitative Variable can be further classified as follows
  • Nominal Variable
A nominal variable is one that has two or more categories, but there is no intrinsic ordering to the categories.
Example: The blood type of a person has multiple categories A, B, AB or O but there is no intrinsic order.
    • Dichotomous (Binary) Variable
      Dichotomous variables are nominal variables which have only two categories or levels. Example: If we ask a person if s/he owns a car. The response can be either 'Yes' or 'No'
  • Ordinal Variable
An ordinal variable is similar to a nominal variable but the difference being there is a clear ordering (or ranking) of the variables.
Example: Clothing Size having values like S, M, L, XL where there is an order (S < M)

In a nut shell

screen-shot-2017-01-29-at-11-48-40-am

      edit

2 comments:

  1. Kindly provide some example programs to practice

    ReplyDelete
  2. good start on the basics. looking forward these learnings put it into practice

    ReplyDelete