Showing posts with label python. Show all posts
Showing posts with label python. Show all posts

Saturday, April 13, 2013

Naive-Bayes Classification using Python, NumPy, and Scikits

So after a busy few months, I have finally returned to wrap up this series on Naive-Bayes Classification. I have decided to use a simple classification problem borrowed (again) from the UCI machine learning repository. You can read about this data set here, and download the data used in this example here. This example assumes you have python version 2.7.X or newer, and have the packages NumPy and Scikits-learn installed. You can use the links provided to download and install them, or use easy_install to do the installation (an example of using easy_install for installing scikits-learn is given here).

Here is a quick snapshot of the data and the classification task at hand:

========================================================================
This dataset was taken from the UCI Machine Learning Repository

(http://archive.ics.uci.edu/ml/datasets.html)

1. Number of Instances: 1728
   (instances completely cover the attribute space)

2. Number of Attributes (features): 6

3. Data feature descriptions:
0 - buying:    vhigh, high, med, low.
1 - maint:      vhigh, high, med, low.
2 - doors:        2,  3, 4, 5more.
3 - persons:     2,  4, more.
4 - lug_boot:  small,  med, big.
5 - safety:        low,  med, high.

4. Class Labels (to predict thru classification):
car evaluation: unacc, acc, good, vgood

5. Missing Attribute Values: none

6. Class Distribution (number of instances per class)
There is a sample imbalance (very common in real world data sets)

   class      N          N[%]
   -----------------------------
   unacc     1210     (70.023 %)
   acc        384     (22.222 %)
   good        69     ( 3.993 %)
   v-good      65     ( 3.762 %)
========================================================================

Here is the python script which trains the model and tests its generalizability using a test set. 




If the above code is executed either in the python shell (by copy and pasting the lines) or at the command prompt using: $ python (path_to_file)/thinkModelcode_carData.py
you should get an accuracy between 80-90% (there is a range because it all depends on WHICH rows were used for test vs. test data). The output line should be something like:

Classification accuracy of MNB =  0.901162790698



Explanation of python code:
  1. lines 1-6: importing several modules and packages necessary to run this script
  2. lines 10-23: using packages urllib and csv to read in data from URL (click links for more info)
  3. line 25: create a list of feature names (for reference)
  4. lines 27-31: converting string data into numerical form by using NumPy's unique() function
    • keys       --> contains string labels corresponding to numerical values assigned
    • numdata --> contains numerical representations of string labels in 'data' read from URL
  5. lines 33-37: determine number of rows, columns. Also convert numdata to be of int array type.
    • split numdata into xdata (first 6 columns) and ydata (last column of class labels)
  6. lines 41-46: convert each multivalued feature in xdata, to a multi-column binary feature array. This conversion is done using sklearn.preprocessing.LabelBinarizer. Here's an example of how what this conversion looks like:
    • >>> a = [2,0,1]
    • >>> lbin.fit_transform(a)
      • OUTPUT:
      • array([[ 0.,  0.,  1.],
      •           [ 1.,  0.,  0.],
      •           [ 0.,  1.,  0.]])
  7. lines 51-55: create training and test data sets from full sample of 1728 rows:
    • To create the test and training sets, we simple create an array of ALL indices:
      • allIDX = [0, 1, 2,......,1725, 1726, 1727]
    • random.shuffle(allIDX) "shuffles" ordered indices of allIDX to a randomized list:
      • allIDX = [564, 981, 17, ...., 1023, 65, 235]
    • Then we simply take the first 10% of allIDX as the test set, the remaining as training.
  8. lines 58-61: use testIDX and trainIDX with xdata to create xtest and xtrain, respectively
  9. lines 62-67: use sklearn's naive_bayes module to perform multinomial naive-bayes classification
Hopefully, the combination of having an introduction to the basics and formalism of Naive Bayes Classifiers, running thru a toy example in US census income dataset, and being able to see an application of Naive-Bayes classifiers in the above python code (I hope you play with it beyond the basic python script above!) helps solidify some of the main points and value of using Bayes' Theorem.

Please let me know if you have any questions and, as always, comments and suggestions are greatly appreciated!

Monday, January 28, 2013

What's so Naive about Bayes' Classifier? [THINK: part 1 of 3]

If you are a someone that holds a statistical, analytical, or generally quantitative position centered around data analysis and predictions, then you have surely heard the term Bayesian Statistics or Bayes' Theorem. And while you might get the gist that it has to do with probability, you might not fully understand the power behind this insightful, yet relatively simple theory. The goal of this post is to help give an introduction to Bayes' Theorem, specifically in the context of classification.

Bayes' Theorm or Bayesian logic is named after English Mathematician Thomas Bayes. The basic idea is this: how can we use prior or past events to help predict the most likely future events to occur. Let's try to understand this from a simple, intuitive approach. Assume we would like to determine the (fictitious) problem of classifying a certain type of tablet (iPad, Kindle, or Leappad) to an individual. The data features we have on the individuals are: age, computer, and tablet (the "class" we would like to predict). All the features are discrete or categorical, and the data is as follows:
Table 1 - Individual feature data corresponding to tablet purchase
Given this set of data, let's compute the following:
What is the probability of a person's age being between 0-12?
P(age==0-12) = 5/20 = 0.25 or 25%
What is the probability of a person purchasing a Kindle?
P(tablet==Kindle) = 8/20 = 0.4 or 40%
What is the probability of computer owned being PC?
P(comp==PC) = 14/20 = 0.70 or 70%
We could continue to ask the probability of each possible value for each of the 3 features (age, computer, and tablet). What we have been computing above are none other than the prior probabilities!

Now what if we were to ask what is the likelihood of a person purchasing a Kindle tablet (future event) given that person uses a PC (past or prior event). This calculation is still simple, since we already know that 14 persons own a PC, of which 6 purchased a Kindle or 6/14 = 42.9%. We use the following notation (note: future events are specified first), which is read "probability of tablet being purchased is a Kindle given the condition the computer owned is a PC:
P(tablet==Kindle | comp==PC) = 6/14 = 0.429

Similarly, we can compute the likelihood someone purchases a leappad and iPad given they own a PC:

P(tablet==leappad | comp==PC) = 5/14 = 0.357
P(tablet==iPad | comp==PC) = 3/14 = 0.214

The computations above are known as conditional probabilities, and are at the heart of Bayes' classifiers.

Now think about how powerful this is, knowing only that a person owns a PC, we can make the most likely guess that he or she would own a Kindle (assuming every person in consideration owns a tablet). This is the essence of Bayes' Theorem. Extending Bayes' Theorem to more than a single feature, we can further predict the most likely tablet purchased if we also know the "age" of the person. Assume we are interested in the tablet most likely purchased by a person who is 13-40 years old AND owns a MAC, this is simply:

P(tablet==Kindle | age==13-40, comp=Mac) = 
P(tablet==Kindle)*P(tablet==Kindle | age==13-40)*P(tablet==Kindle | comp=Mac) = 
(8/20) * (3/8) *  (2/6) =   0.050

P(tablet==leappad | age==13-40, comp=Mac) =
P(tablet==leappad)*P(tablet==leappad | age==13-40)*P(tablet==leappad | comp=Mac) = 
(5/20) * (0/8) *  (0/6) =   0.000

P(tablet==iPad | age==13-40, comp=Mac) = 
P(tablet==iPad)*P(tablet== iPad | age==13-40)*P(tablet== iPad | comp=Mac) =
 (7/20) * (5/8) *  (4/6) =   0.146

Here you can clearly see that given a person whom is between 13-40 years age and owns a Mac is 3 times as likely (0.05 vs. 0.146) to have purchased an iPad rather than Kindle, with zero probability of owning a leappad (which makes sense as leappad is a targeted "learning" tablet with games and such for very young children). To compute the relative likelihoods we needed only to multiply the conditional probabilities for [age == 13-40] and [comp==Mac]. Note: To get the "proper" probabilities, we would have to divide by P(age=13-40) to restrict probabilities between [0,1].

You might be wondering why I included the P(tablet== ) probability. This is because in addition to the conditional probabilities (namely conditional probabilities with comp and age), the prior probability of tablet purchases is also strong indicators of which tablet is most likely be have been purchased. Consider the extreme possibility that 19 of 20 tablets purchased were iPads, then the prior probability would be so high (P(tablet==iPad) = 0.95) that it strongly sway the likelihoods that a given person purchased a different tablet for most conditional probability values.

Formally, Bayes' Theorem can be stated as:

P(A | B) = P(B | A)*P(A) / P(B)




You will (correctly) hear the above theorem referred to as Naive-Bayes or NB-classifier, since it makes assumptions on the features being independent of each other. Features (particularly in large data sets) are rarely independent, however this assumption holds astonishingly well and the NB-classifier can many times outperform other classification methods. The simplicity and interpretability of Bayes' Classifier are just couple of the main reasons Bayesian methods are so ubiquitously used in the data science community. Below you will find some links to discussions and humor related to Bayesian statistics. 



Coming up in the next post......

1. Principles and derivation of Bayes' Theorem
2. Modeling and predicting using an actual data set







Our background, Our Passions, Our Blog.....

Welcome to the blog of two not-so-socially awkward physicists: Jason Schissel (theorist) and Brian Choi (experimentalist). We ran into each other during graduate school in the department of physics at UCLA. As it happens, we both gravitated towards algorithm development, predictive modeling, and machine learning. What we realized is that we wanted to share our experiences and ideas with the data science community to proliferate the use of machine learning and statistical methods in solving real-world problems, finding elegant solutions to seemingly complex systems, and optimizing predictions while stressing the importance of interpreting and deeply understanding the results.

The purpose of this blog is simple: introduce concepts and ideas which we (and anyone else in the data science community) find interesting. The concepts will center around our passions in statistics, probability, machine learning, and linear algebra. There will be toy data sets, actual data sets, and visualization aids to drive concepts beyond simple understanding, hopefully reaching your intuition. Finally, and practically, we will codify the concepts and ideas into scripting or programming languages such as: MATLAB, Python, and C++.

As the blog title states, this blog will follow a format of posts which will: THINK, MODEL, then CODE the chosen topics to a point where the reader can learn, practice, and execute these methods on their own data.

We hope you enjoy reading our blog, learn a few things!