Think, Model, Code.....: physics

If you are a someone that holds a statistical, analytical, or generally quantitative position centered around data analysis and predictions, then you have surely heard the term Bayesian Statistics or Bayes' Theorem. And while you might get the gist that it has to do with probability, you might not fully understand the power behind this insightful, yet relatively simple theory. The goal of this post is to help give an introduction to Bayes' Theorem, specifically in the context of classification.

Bayes' Theorm or Bayesian logic is named after English Mathematician Thomas Bayes. The basic idea is this: how can we use prior or past events to help predict the most likely future events to occur. Let's try to understand this from a simple, intuitive approach. Assume we would like to determine the (fictitious) problem of classifying a certain type of tablet (iPad, Kindle, or Leappad) to an individual. The data features we have on the individuals are: age, computer, and tablet (the "class" we would like to predict). All the features are discrete or categorical, and the data is as follows:

Table 1 - Individual feature data corresponding to tablet purchase

Given this set of data, let's compute the following:
What is the probability of a person's age being between 0-12?

P(age==0-12) = 5/20 = 0.25 or 25%

What is the probability of a person purchasing a Kindle?

P(tablet==Kindle) = 8/20 = 0.4 or 40%

What is the probability of computer owned being PC?

P(comp==PC) = 14/20 = 0.70 or 70%

We could continue to ask the probability of each possible value for each of the 3 features (age, computer, and tablet). What we have been computing above are none other than the prior probabilities!

Now what if we were to ask what is the likelihood of a person purchasing a Kindle tablet (future event) given that person uses a PC (past or prior event). This calculation is still simple, since we already know that 14 persons own a PC, of which 6 purchased a Kindle or 6/14 = 42.9%. We use the following notation (note: future events are specified first), which is read "probability of tablet being purchased is a Kindle given the condition the computer owned is a PC:

P(tablet==Kindle | comp==PC) = 6/14 = 0.429

Similarly, we can compute the likelihood someone purchases a leappad and iPad given they own a PC:

P(tablet==leappad | comp==PC) = 5/14 = 0.357

P(tablet==iPad | comp==PC) = 3/14 = 0.214

The computations above are known as conditional probabilities, and are at the heart of Bayes' classifiers.

Now think about how powerful this is, knowing only that a person owns a PC, we can make the most likely guess that he or she would own a Kindle (assuming every person in consideration owns a tablet). This is the essence of Bayes' Theorem. Extending Bayes' Theorem to more than a single feature, we can further predict the most likely tablet purchased if we also know the "age" of the person. Assume we are interested in the tablet most likely purchased by a person who is 13-40 years old AND owns a MAC, this is simply:

P(tablet==Kindle | age==13-40, comp=Mac) =

P(tablet==Kindle)*P(tablet==Kindle | age==13-40)*P(tablet==Kindle | comp=Mac) =

(8/20) * (3/8) * (2/6) = 0.050

P(tablet==leappad | age==13-40, comp=Mac) =

P(tablet==leappad)*P(tablet==leappad | age==13-40)*P(tablet==leappad | comp=Mac) =

(5/20) * (0/8) * (0/6) = 0.000

P(tablet==iPad | age==13-40, comp=Mac) =

P(tablet==iPad)*P(tablet== iPad | age==13-40)*P(tablet== iPad | comp=Mac) =

(7/20) * (5/8) * (4/6) = 0.146

Here you can clearly see that given a person whom is between 13-40 years age and owns a Mac is 3 times as likely (0.05 vs. 0.146) to have purchased an iPad rather than Kindle, with zero probability of owning a leappad (which makes sense as leappad is a targeted "learning" tablet with games and such for very young children). To compute the relative likelihoods we needed only to multiply the conditional probabilities for [age == 13-40] and [comp==Mac]. Note: To get the "proper" probabilities, we would have to divide by P(age=13-40) to restrict probabilities between [0,1].

You might be wondering why I included the P(tablet== ) probability. This is because in addition to the conditional probabilities (namely conditional probabilities with comp and age), the prior probability of tablet purchases is also strong indicators of which tablet is most likely be have been purchased. Consider the extreme possibility that 19 of 20 tablets purchased were iPads, then the prior probability would be so high (P(tablet==iPad) = 0.95) that it strongly sway the likelihoods that a given person purchased a different tablet for most conditional probability values.

Formally, Bayes' Theorem can be stated as:

P(A | B) = P(B | A)*P(A) / P(B)

You will (correctly) hear the above theorem referred to as Naive-Bayes or NB-classifier, since it makes assumptions on the features being independent of each other. Features (particularly in large data sets) are rarely independent, however this assumption holds astonishingly well and the NB-classifier can many times outperform other classification methods. The simplicity and interpretability of Bayes' Classifier are just couple of the main reasons Bayesian methods are so ubiquitously used in the data science community. Below you will find some links to discussions and humor related to Bayesian statistics.

Even Nate Silver Can Get Things In Data Science Wrong!'

Bayesian Humor

Coming up in the next post......

1. Principles and derivation of Bayes' Theorem

2. Modeling and predicting using an actual data set

Think, Model, Code.....

Monday, January 28, 2013

What's so Naive about Bayes' Classifier? [THINK: part 1 of 3]

Contributors