Bayes' Theorm or Bayesian logic is named after English Mathematician Thomas Bayes. The basic idea is this: how can we use prior or past events to help predict the most likely future events to occur. Let's try to understand this from a simple, intuitive approach. Assume we would like to determine the (fictitious) problem of classifying a certain type of tablet (iPad, Kindle, or Leappad) to an individual. The data features we have on the individuals are: age, computer, and tablet (the "class" we would like to predict). All the features are discrete or categorical, and the data is as follows:
![]() |
Table 1 - Individual feature data corresponding to tablet purchase |
What is the probability of a person's age being between 0-12?
P(age==0-12) = 5/20 = 0.25 or 25%
What is the probability of a person purchasing a Kindle?
P(tablet==Kindle) = 8/20 = 0.4 or 40%
What is the probability of computer owned being PC?
P(comp==PC) = 14/20 = 0.70 or 70%
We could continue to ask the probability of each possible value for each of the 3 features (age, computer, and tablet). What we have been computing above are none other than the prior probabilities!Now what if we were to ask what is the likelihood of a person purchasing a Kindle tablet (future event) given that person uses a PC (past or prior event). This calculation is still simple, since we already know that 14 persons own a PC, of which 6 purchased a Kindle or 6/14 = 42.9%. We use the following notation (note: future events are specified first), which is read "probability of tablet being purchased is a Kindle given the condition the computer owned is a PC:
P(tablet==Kindle | comp==PC) = 6/14 = 0.429
Similarly, we can compute the likelihood someone purchases a leappad and iPad given they own a PC:
P(tablet==leappad | comp==PC) = 5/14 = 0.357
P(tablet==iPad | comp==PC) = 3/14 = 0.214
The computations above are known as conditional probabilities, and are at the heart of Bayes' classifiers.
Now think about how powerful this is, knowing only that a person owns a PC, we can make the most likely guess that he or she would own a Kindle (assuming every person in consideration owns a tablet). This is the essence of Bayes' Theorem. Extending Bayes' Theorem to more than a single feature, we can further predict the most likely tablet purchased if we also know the "age" of the person. Assume we are interested in the tablet most likely purchased by a person who is 13-40 years old AND owns a MAC, this is simply:
P(tablet==Kindle | age==13-40, comp=Mac) =
P(tablet==Kindle)*P(tablet==Kindle | age==13-40)*P(tablet==Kindle | comp=Mac) =
(8/20) * (3/8) * (2/6) = 0.050
P(tablet==leappad | age==13-40, comp=Mac) =
P(tablet==leappad)*P(tablet==leappad | age==13-40)*P(tablet==leappad | comp=Mac) =
(5/20) * (0/8) * (0/6) = 0.000
P(tablet==iPad | age==13-40, comp=Mac) =
P(tablet==iPad)*P(tablet== iPad | age==13-40)*P(tablet== iPad | comp=Mac) =
(7/20) * (5/8) * (4/6) = 0.146
You might be wondering why I included the P(tablet== ) probability. This is because in addition to the conditional probabilities (namely conditional probabilities with comp and age), the prior probability of tablet purchases is also strong indicators of which tablet is most likely be have been purchased. Consider the extreme possibility that 19 of 20 tablets purchased were iPads, then the prior probability would be so high (P(tablet==iPad) = 0.95) that it strongly sway the likelihoods that a given person purchased a different tablet for most conditional probability values.
Formally, Bayes' Theorem can be stated as:
P(A | B) = P(B | A)*P(A) / P(B)
You will (correctly) hear the above theorem referred to as Naive-Bayes or NB-classifier, since it makes assumptions on the features being independent of each other. Features (particularly in large data sets) are rarely independent, however this assumption holds astonishingly well and the NB-classifier can many times outperform other classification methods. The simplicity and interpretability of Bayes' Classifier are just couple of the main reasons Bayesian methods are so ubiquitously used in the data science community. Below you will find some links to discussions and humor related to Bayesian statistics.
Coming up in the next post......
1. Principles and derivation of Bayes' Theorem
2. Modeling and predicting using an actual data set