Thursday, January 31, 2013

The $50k question for Bayes' Theorem [MODEL: part 2 of 3]

Here's a somewhat disturbing fact: in the city of San Francisco, if you are a single making under $60k/year you qualify for low-income housing. It's probably safe to assume that the average person working in San Francisco has a higher salary than someone in Baton Rouge, LA. While were at it, I think its also likely that individuals who are older and, hence, have worked more years will have a higher income. In fact, a census study done in 1994 asked these types of questions with the goal to determine if they could predict whether a given person earned more or less than $50k/year.

The data we will use for this study comes from the UCI machine learning repository, which houses a large set of data which is great for learning and testing different types of modeling techniques. The data we will be working with contains 14 data features which we will use to predict the likelihood a given individual has an income greater than $50k or less than (or equal to) $50k. You can find this census income data set here.   A description of the data is as follows:

The details of the data at the feature-level is as follows:
  • age: continuous value (years)
  • workclass: private, federal-gov, local-gov, never-worked, etc.....
  • fnlwgt: continuous demographic index value
  • education: high-school, college, associates degree, PhD, etc
  • education-num: (number of years educated)
  • marital-status: single, married, divorces, etc
  • occupation: Tech-support, Craft-repair, Other-service, Sales, etc
  • relationship: Wife, Own-child, Husband, Not-in-family, etc
  • race: White, Asian-Pac-Islander, Black, etc
  • sex: Male, Female
  • capital-gain: continuous value (dollars) 
  • capital-loss: continuous value (dollars)
  • hours-per-week: continuous value (hours)
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, etc

We will attempt to use Bayes' Theorem (Formal Explanation of Bayes' Theorem)

P(A|B) = \frac{P(B | A)\, P(A)}{P(B)}. \,

with the aforementioned 14 features to accurately predict an individuals income as >$50k or <=$50k.

As promised in the previous post, here is a quick and simple derivation of Bayes' Theorem since we will be actually applying it in this post (the MODEL portion of the series):

 = Probability of an event A & B occurring together


The conditional probability of event A occurring given B has occurred can also be thought of as the probability of both events A & B occurring together divided by the probability of event B occurring alone. The conditional probability of event B, given A has occurred can be similarly found. Formally:

P(A|B)=\frac{P(A \cap B)}{P(B)}, \text{ if } P(B) \neq 0, \!

P(B|A) = \frac{P(A \cap B)}{P(A)}, \text{ if } P(A) \neq 0, \!

Now that we have expressed both P(A | B) and P(B | A) with the same numerator, we can simplify:

\implies P(A \cap B) = P(A|B)\, P(B) = P(B|A)\, P(A), \!

Then, dividing by P(B) yields the familiar form of Bayes' Theorem:

\implies P(A|B) = \frac{P(B|A)\,P(A)}{P(B)}, \text{ if } P(B) \neq 0.

An important fact about using Bayes' theorem for classification (i.e. NB-classifiers) is that it requires the data to be discrete. In fact, the application of Bayes' Theorem used for this problem is often referred to as a multinomial naive bayes (MNB) classifier. THINK back to the first post of this series on Bayes' Theorem: all the probabilities (prior or conditional) were all computed assuming discrete values for the features. In the event you have a mixed set of features which are discrete and continuous, you can always discretize your continuous features (optimally discretizing features will be the topic of a future Think, Model, Code.... blog).

Okay, enough derivations. Let's return to the problem of predicting the income class of individuals. A good first step when looking at a new data set is look at the distribution of feature values. For this data set, of particular interest lies in the "age" feature. Age takes on many values (~ 80-100 values or years), while our response feature (income  >$50k or <=$50k) is binary. So let's plot a histogram and take a peek at the age distribution of individuals in this census:
Clearly there is a skew in the distribution towards younger individuals. The key here is to look at how age distributions change as a function of the two income classes (<=$50k  or >$50k).

Just as we suspected in the beginning of this post, there is a clear difference in age distribution. Younger individuals tend to make less money, as they have likely worked fewer years, whereas the peak age for the higher income earners is around the mid-40's and early-50's. Using this analysis, we can suggest a quantization of age from a nearly continuous feature to one that takes on fewer discrete values (NB-classifiers work best when you have the fewest values possible for discrete features while minimizing the information lost). Knowing that we would like to have discrete values representing the lower age groups and middle age groups, we can justifiably use ~ 6 discrete values corresponding to age "bins" which capture most the correlation with income class. If we have more than 6 feature values we start to lose correlation to the response feature (income) and likewise for fewer feature values. 6 features maximizes the correlation between age and income class. The new 6 values for age are such that anyone with an age between [17,29] would be given the age label = 1, age between [30,41] given age label = 2, age between [42-53] gets label = 3, and so on. Some of you may have noticed that basically the spacing is age bins of 11 years, this not by chance but the spacing that is "optimal." Instead of having ~90-100 different age possibilities, there are now just 6! Not only does this make computing the probabilities faster, but easier to interpret! (Unsurprisingly, this will also make computation time on in a computer program significantly faster).

Following this same type of rationale for studying the features with respect to the income class, we can determine different levels of discretization for any of the continuous features.

These continuous features were converted to discrete features with the corresponding new values:
age:                    [18-29], [30-41], [42-53], [54-65], [66-77], [77+]          --> 0,1,2,3,4,5
fnlwgt:               [13770-504081], [504082-994393], [994394,1484705] -- > 0,1,2
education-num:  [1,4], [5-7], [8-10], [11-13], [16+]                                  -- > 0,1,2,3,4
capital-gain:       [0-33333], [33334,66666], [66667,99999]                     --> 0,1,2
capital-loss:        [0-1089], [1090-2178], [2179-3267], [3268-4356]         --> 0,1,2,3
hours-per-week: [1-33.66], [33.67-66.33], [66.34-99]                              --> 0,1,2

We now have a fully discrete data set which can be used to train a NB-classifier (this amounts to finding the prior probabilities for the income classes and conditional probabilities). Training the NB-classifier is only part of the story. There's also a need to test the model against possibly overfitting the data and ensuring generalization of the classifier and accurate predictions. This process of training and testing a model is popularly known as cross-validation.

The UCI machine learning repository suggests we use 2/3 of the data for training and 1/3 of the data for testing. After removing rows in the data which have missing feature values, the final row count should go from 32561 --> 30162. Your training and test sets should be, approximately ~20000 training samples, and ~ 10000 test samples. Training essentially amounts to a scaled up version of what we did in the previous post: calculate prior and conditional probabilities for all 14 features against income class. The result of all the probabilities computed is what I refer to as a 3-dimensional "probability cube."



Here the "front" cube slice would be all the conditional probabilities (the probabilities are not the actual values) with respect to the income class (>$50k) while the "back" cube slice are the conditional probabilities with respect to (<=$50k). The table is used like this: A person in the first age "bin" has conditional probabilities of being in the income class (>$50k) of 6.5% and income class (<=$50k) of 12.5%. In other words, someone in the first age bin is twice as likely to be in the lower income class than the higher income class.

Let's look at the feature Native-Country. Someone from the 4th value of "native-country" has a 0.9% chance of being in the higher income class vs. 0.6% for the lower income class (this country happens to be Germany). Furthermore, just as we also did in the previous post, finding the conditional probability of being in either income class with respect to any number of features is simply a multiplicative computation. For example, someone with age = 1 and native-country=1, would have the resulting or posterior probabilities below:

P(>$50k) * P(Age_bin==1 | Income== >$50k) * P(Native-Country==1 | Income== >$50k) =
0.25 * 0.065 * 0.064 = 0.00104  (un-normalized)


P(<=$50k) * P(Age_bin==1 | Income== <=$50k) * P(Native-Country==1 | Income== <=$50k) =
0.75 * 0.125 * 0.105 = 0.00984 (un-normalized)

Consider what the results mean. For a person that is relatively young (Age = 1 is the lowest age "bin") and is from Native-Coutry = 1 (Puerto Rico), the odds lie nearly a 10-to-1 for them being in the lower income class vs. higher income class. 

Clearly, there are useful insights and interpretations from the Bayes' Model created. The take-home from all this: the meat of Naive-Bayes' (NB) Classifiers is the probability cube created looking at "past" events. Here, "training" data represents "past" events and the future (or individuals whose income class is taken as unknown) is represented by the "test" data. To make predictions on individuals which have unknown income class, we simply find or "look up" their corresponding conditional probabilities (as we did above) then determine which income class "wins" using the value of relative un-normalized probability. 

Coming up in part 3 [CODE portion] of this series:

1. Creating a python script to execute a NB-classifier on the above train and test data sets! 






Monday, January 28, 2013

What's so Naive about Bayes' Classifier? [THINK: part 1 of 3]

If you are a someone that holds a statistical, analytical, or generally quantitative position centered around data analysis and predictions, then you have surely heard the term Bayesian Statistics or Bayes' Theorem. And while you might get the gist that it has to do with probability, you might not fully understand the power behind this insightful, yet relatively simple theory. The goal of this post is to help give an introduction to Bayes' Theorem, specifically in the context of classification.

Bayes' Theorm or Bayesian logic is named after English Mathematician Thomas Bayes. The basic idea is this: how can we use prior or past events to help predict the most likely future events to occur. Let's try to understand this from a simple, intuitive approach. Assume we would like to determine the (fictitious) problem of classifying a certain type of tablet (iPad, Kindle, or Leappad) to an individual. The data features we have on the individuals are: age, computer, and tablet (the "class" we would like to predict). All the features are discrete or categorical, and the data is as follows:
Table 1 - Individual feature data corresponding to tablet purchase
Given this set of data, let's compute the following:
What is the probability of a person's age being between 0-12?
P(age==0-12) = 5/20 = 0.25 or 25%
What is the probability of a person purchasing a Kindle?
P(tablet==Kindle) = 8/20 = 0.4 or 40%
What is the probability of computer owned being PC?
P(comp==PC) = 14/20 = 0.70 or 70%
We could continue to ask the probability of each possible value for each of the 3 features (age, computer, and tablet). What we have been computing above are none other than the prior probabilities!

Now what if we were to ask what is the likelihood of a person purchasing a Kindle tablet (future event) given that person uses a PC (past or prior event). This calculation is still simple, since we already know that 14 persons own a PC, of which 6 purchased a Kindle or 6/14 = 42.9%. We use the following notation (note: future events are specified first), which is read "probability of tablet being purchased is a Kindle given the condition the computer owned is a PC:
P(tablet==Kindle | comp==PC) = 6/14 = 0.429

Similarly, we can compute the likelihood someone purchases a leappad and iPad given they own a PC:

P(tablet==leappad | comp==PC) = 5/14 = 0.357
P(tablet==iPad | comp==PC) = 3/14 = 0.214

The computations above are known as conditional probabilities, and are at the heart of Bayes' classifiers.

Now think about how powerful this is, knowing only that a person owns a PC, we can make the most likely guess that he or she would own a Kindle (assuming every person in consideration owns a tablet). This is the essence of Bayes' Theorem. Extending Bayes' Theorem to more than a single feature, we can further predict the most likely tablet purchased if we also know the "age" of the person. Assume we are interested in the tablet most likely purchased by a person who is 13-40 years old AND owns a MAC, this is simply:

P(tablet==Kindle | age==13-40, comp=Mac) = 
P(tablet==Kindle)*P(tablet==Kindle | age==13-40)*P(tablet==Kindle | comp=Mac) = 
(8/20) * (3/8) *  (2/6) =   0.050

P(tablet==leappad | age==13-40, comp=Mac) =
P(tablet==leappad)*P(tablet==leappad | age==13-40)*P(tablet==leappad | comp=Mac) = 
(5/20) * (0/8) *  (0/6) =   0.000

P(tablet==iPad | age==13-40, comp=Mac) = 
P(tablet==iPad)*P(tablet== iPad | age==13-40)*P(tablet== iPad | comp=Mac) =
 (7/20) * (5/8) *  (4/6) =   0.146

Here you can clearly see that given a person whom is between 13-40 years age and owns a Mac is 3 times as likely (0.05 vs. 0.146) to have purchased an iPad rather than Kindle, with zero probability of owning a leappad (which makes sense as leappad is a targeted "learning" tablet with games and such for very young children). To compute the relative likelihoods we needed only to multiply the conditional probabilities for [age == 13-40] and [comp==Mac]. Note: To get the "proper" probabilities, we would have to divide by P(age=13-40) to restrict probabilities between [0,1].

You might be wondering why I included the P(tablet== ) probability. This is because in addition to the conditional probabilities (namely conditional probabilities with comp and age), the prior probability of tablet purchases is also strong indicators of which tablet is most likely be have been purchased. Consider the extreme possibility that 19 of 20 tablets purchased were iPads, then the prior probability would be so high (P(tablet==iPad) = 0.95) that it strongly sway the likelihoods that a given person purchased a different tablet for most conditional probability values.

Formally, Bayes' Theorem can be stated as:

P(A | B) = P(B | A)*P(A) / P(B)




You will (correctly) hear the above theorem referred to as Naive-Bayes or NB-classifier, since it makes assumptions on the features being independent of each other. Features (particularly in large data sets) are rarely independent, however this assumption holds astonishingly well and the NB-classifier can many times outperform other classification methods. The simplicity and interpretability of Bayes' Classifier are just couple of the main reasons Bayesian methods are so ubiquitously used in the data science community. Below you will find some links to discussions and humor related to Bayesian statistics. 



Coming up in the next post......

1. Principles and derivation of Bayes' Theorem
2. Modeling and predicting using an actual data set







Our background, Our Passions, Our Blog.....

Welcome to the blog of two not-so-socially awkward physicists: Jason Schissel (theorist) and Brian Choi (experimentalist). We ran into each other during graduate school in the department of physics at UCLA. As it happens, we both gravitated towards algorithm development, predictive modeling, and machine learning. What we realized is that we wanted to share our experiences and ideas with the data science community to proliferate the use of machine learning and statistical methods in solving real-world problems, finding elegant solutions to seemingly complex systems, and optimizing predictions while stressing the importance of interpreting and deeply understanding the results.

The purpose of this blog is simple: introduce concepts and ideas which we (and anyone else in the data science community) find interesting. The concepts will center around our passions in statistics, probability, machine learning, and linear algebra. There will be toy data sets, actual data sets, and visualization aids to drive concepts beyond simple understanding, hopefully reaching your intuition. Finally, and practically, we will codify the concepts and ideas into scripting or programming languages such as: MATLAB, Python, and C++.

As the blog title states, this blog will follow a format of posts which will: THINK, MODEL, then CODE the chosen topics to a point where the reader can learn, practice, and execute these methods on their own data.

We hope you enjoy reading our blog, learn a few things!