Tuesday, January 31, 2017

Published 18:13 by with 2 comments

The ABCs of Machine Learning

It would be a good idea to spend some time reading the basics and this page would be dedicated for that.

Why do we care about variable type (in the context of Machine Learning)?

Understanding the variable types gives us the power to treat them appropriately. For instance, it would help us in avoiding mistakes like doing an average on postal codes or taking ratio of two pH values. It also helps us choosing appropriate operation on the variable based on the context. For example, in a psychological study of perception, different colors would be regarded as nominal. In a physics study, color is quantified by wavelength, so color would be considered a ratio variable.

Feature

Feature (aka Variable) is a measurable property or attribute of an observation. An example would be while studying about performance of cars the possible features would be
  • Number of Cylinders
  • Miles Per Gallon (or Kilometers Per Liter)
  • Horse Power
  • Weight
  • Number of Gears
  • Type of Transmission
Feature is also referred as 'Explanatory Variable' or 'Independent Variable'

Label

Label represents the outcome of the or the output whose variation is being studied. An example would be the learning problem shown below. Here the examples are labeled '1' and '0'. In this case, animals are marked with '1' and others with '0'. screen-shot-2017-01-25-at-12-09-44-pm
Label is also referred as 'Explained Variable' or 'Dependent Variable' The dependent variable responds to the independent variable and for this reason it is called 'dependent' variable.

Dataset

Dataset (or Data Set) is simply a collection of observations. Typically its the data collection in rows and columns where columns correspond to feature/label and rows correspond to the observations. Most popular dataset format used is spreadsheet which is powerful for quick analysis.

Data (Variable) Type

Data is often classified as below based on the nature of the data.
Please Note: This is different the 'datatype' defined in database realm.

Quantitative Variable

Quantitative Variable is expressed in numerical form and therefore arithmetic operations can be performed on them. Quantitative Variable can be further classified as follows
  • Continuous Variable
A continuous variable can take values between two numbers. This variable can take infinitely many values.
Example: Time taken by top five athletes to complete 100m in Rio Olympics: 9.81, 9.89, 9.91, 9.93, 9.94.
    • Interval Variable
      Interval Variables take numeric values and they can be measured along continuum. The intervals between the values of the interval variable are equally spaced.
      Example: Temperature measured in degrees Celsius or Fahrenheit. The difference between 20C and 30C is the same as 30C to 40C
    • Ratio Variable
      Ratio Variables are Interval Variables but with the different that they posses the clear definition of zero (0) which indicates that there is none of that variable.
      Example: Temperature measures in Kelvin as 0 Kelvin (also called as absolute zero) indicates that there is no temperature. And for the very same reason temperature measured in Celsius and Fahrenheit are NOT Ratio Variables. Other examples include height, mass, distance, etc.
  • Discrete Variable
A discrete variable does not admit intermediate values between two specific numbers. It is represented by whole integer values.
Example: Total medals by the USA at last three summer Olympics: 121 (Rio 2016), 103 (London 2012), 110 (Beijing 2008).

Qualitative Variable

Qualitative Variable takes non-numeric value. It describes data that fit into categories. This is also referred as Categorical Variable. Qualitative Variable can be further classified as follows
  • Nominal Variable
A nominal variable is one that has two or more categories, but there is no intrinsic ordering to the categories.
Example: The blood type of a person has multiple categories A, B, AB or O but there is no intrinsic order.
    • Dichotomous (Binary) Variable
      Dichotomous variables are nominal variables which have only two categories or levels. Example: If we ask a person if s/he owns a car. The response can be either 'Yes' or 'No'
  • Ordinal Variable
An ordinal variable is similar to a nominal variable but the difference being there is a clear ordering (or ranking) of the variables.
Example: Clothing Size having values like S, M, L, XL where there is an order (S < M)

In a nut shell

screen-shot-2017-01-29-at-11-48-40-am

Read More
      edit

Wednesday, January 18, 2017

Published 13:32 by with 5 comments

Introduction to Machine Learning

What is Machine Learning?

Machine Learning is a branch of Computer Science that deals with making a machine (i.e. computer)  learn without explicitly being programmed. Essentially, it is a method of teaching computers to make and improve predictions or patterns based on data.

A common example used to explain Machine Learning is ‘Digit Recognition’ - where a machine is taught to understand how different digits look like (10 of them - zero included) using images that contain handwritten single digit. The algorithm is then made to ‘recognize’ new set of images of handwritten digit.

Machine Learning can be broadly classified into the following categories

Supervised Learning

In Supervised Learning, the machine is taught using example inputs and their corresponding outputs. The objective of the machine is to learn the “rule” that derives the output (for the given set of input).

Digit Recognition is an example for Supervised Learning as the labelled example data (handwritten image and the corresponding digit) is used to train the machine. The objective of this method is to Predict the output given an unknown/new input data.

Unsupervised Learning

In Unsupervised Learning, the machine is *not* provided with labelled examples and the objective is to identify hidden patterns, outlier detection, clustering, etc.

An everyday example for Unsupervised Learning is Google News - news articles of same/similar content are sourced from various sites and grouped together.

Reinforcement Learning

Reinforcement Learning is where a machine operates in a dynamic environment and makes decisions and it is penalised or awarded for making such decision periodically. The objective of this method is to maximize the performance or efficiency of the whole system.

Self driving cars or computers playing games against human players are notable examples of Reinforcement Learning.

Footnote: As humans, we inherently perform all the three ways of learning and here are the instances of them

  • Humans learning to identify shapes, colors, alphabets, etc are Supervised as we are taught with prior examples


  • Doctors diagnosing medical issues, identification of bugs in software, etc are examples of humans doing anomaly detection. Grouping items based on similarities is another example of humans doing clustering. For the same sample items, the groups/clusters can be very different for different humans


  • The process of humans learning to speak a language is an example of Reinforcement Learning where based on the reaction or response of the other person(s) involved (and over a period of time) the 
    mastery on the language improves



Why Machine Learning?

Because of its application in the real world.

Following are some of the use cases for Machine Learning

  • Fraud Detection
  • Sentiment Analysis
  • Recommendation Engine
  • Self-Driving Cars

Why now?

As collection and processing of data becomes cheaper, applying Machine Learning becomes more and more practical and effective. For some organizations, Machine Learning applications are the game changer. Examples are Amazon, Google, Netflix, etc

Steps involved in Machine Learning

  • Collection of Data
  • Preparation of Data
  • Building (or Training) a Model
  • Evaluating the Model
  • Improving the Performance or efficiency




Read More
      edit