Saturday, April 13, 2013

Naive-Bayes Classification using Python, NumPy, and Scikits

So after a busy few months, I have finally returned to wrap up this series on Naive-Bayes Classification. I have decided to use a simple classification problem borrowed (again) from the UCI machine learning repository. You can read about this data set here, and download the data used in this example here. This example assumes you have python version 2.7.X or newer, and have the packages NumPy and Scikits-learn installed. You can use the links provided to download and install them, or use easy_install to do the installation (an example of using easy_install for installing scikits-learn is given here).

Here is a quick snapshot of the data and the classification task at hand:

========================================================================
This dataset was taken from the UCI Machine Learning Repository

(http://archive.ics.uci.edu/ml/datasets.html)

1. Number of Instances: 1728
   (instances completely cover the attribute space)

2. Number of Attributes (features): 6

3. Data feature descriptions:
0 - buying:    vhigh, high, med, low.
1 - maint:      vhigh, high, med, low.
2 - doors:        2,  3, 4, 5more.
3 - persons:     2,  4, more.
4 - lug_boot:  small,  med, big.
5 - safety:        low,  med, high.

4. Class Labels (to predict thru classification):
car evaluation: unacc, acc, good, vgood

5. Missing Attribute Values: none

6. Class Distribution (number of instances per class)
There is a sample imbalance (very common in real world data sets)

   class      N          N[%]
   -----------------------------
   unacc     1210     (70.023 %)
   acc        384     (22.222 %)
   good        69     ( 3.993 %)
   v-good      65     ( 3.762 %)
========================================================================

Here is the python script which trains the model and tests its generalizability using a test set. 




If the above code is executed either in the python shell (by copy and pasting the lines) or at the command prompt using: $ python (path_to_file)/thinkModelcode_carData.py
you should get an accuracy between 80-90% (there is a range because it all depends on WHICH rows were used for test vs. test data). The output line should be something like:

Classification accuracy of MNB =  0.901162790698



Explanation of python code:
  1. lines 1-6: importing several modules and packages necessary to run this script
  2. lines 10-23: using packages urllib and csv to read in data from URL (click links for more info)
  3. line 25: create a list of feature names (for reference)
  4. lines 27-31: converting string data into numerical form by using NumPy's unique() function
    • keys       --> contains string labels corresponding to numerical values assigned
    • numdata --> contains numerical representations of string labels in 'data' read from URL
  5. lines 33-37: determine number of rows, columns. Also convert numdata to be of int array type.
    • split numdata into xdata (first 6 columns) and ydata (last column of class labels)
  6. lines 41-46: convert each multivalued feature in xdata, to a multi-column binary feature array. This conversion is done using sklearn.preprocessing.LabelBinarizer. Here's an example of how what this conversion looks like:
    • >>> a = [2,0,1]
    • >>> lbin.fit_transform(a)
      • OUTPUT:
      • array([[ 0.,  0.,  1.],
      •           [ 1.,  0.,  0.],
      •           [ 0.,  1.,  0.]])
  7. lines 51-55: create training and test data sets from full sample of 1728 rows:
    • To create the test and training sets, we simple create an array of ALL indices:
      • allIDX = [0, 1, 2,......,1725, 1726, 1727]
    • random.shuffle(allIDX) "shuffles" ordered indices of allIDX to a randomized list:
      • allIDX = [564, 981, 17, ...., 1023, 65, 235]
    • Then we simply take the first 10% of allIDX as the test set, the remaining as training.
  8. lines 58-61: use testIDX and trainIDX with xdata to create xtest and xtrain, respectively
  9. lines 62-67: use sklearn's naive_bayes module to perform multinomial naive-bayes classification
Hopefully, the combination of having an introduction to the basics and formalism of Naive Bayes Classifiers, running thru a toy example in US census income dataset, and being able to see an application of Naive-Bayes classifiers in the above python code (I hope you play with it beyond the basic python script above!) helps solidify some of the main points and value of using Bayes' Theorem.

Please let me know if you have any questions and, as always, comments and suggestions are greatly appreciated!

38 comments:

  1. Hello, Thank you for this interesting example. Just one question, why we need to transform the multivalue feature into multi binary colums? Thanks

    ReplyDelete
  2. Hey Cozyberry, thanks for reading and for leaving a comment. The need for a transform from a multi-value feature to a multi binary columns has more to do with dealing with programming and computing probabilities with sklearn's multinomial naive-bayes classification functions and classes. If you wanted to write your own functions or used a different package you would not need to make this transform. Please let me know if you have further questions! Thanks again!

    ReplyDelete
    Replies
    1. Hey Brian thanks for your response. And I have another question concerning this transformation. Why in the example code we used xdata_ml for attributes matrix: xtrain=xdata_ml[trainIDX,:] however we used the original ydata for target matrix: ytrain=ydata[trainIDX,:]. I tried to test with ytrain=ydata_ml[trainIDX,:] and ytest=ydata_ml[0:trainIDX] and the accuracy turned as 0.0 ><

      Delete
    2. Hi Cozy - good question and I'm glad you brought it up. The reason for not using the multi-labeled version of ytrain was because the NB-classifier function already accounts for the multivalues unlike for x-values. The difference here is that sometimes you actually have more than one y-value you are trying to predict, similar to having many response variables in a regression. When you specify a since multi-valued column as the response (as I did above) the function knows that you are fitting to a single dependent variable and converts it to a multi-label column internally in the function. Does this make sense? Thanks again for the comments. Hopefully I find some time soon to write my next series, its been too long.

      Delete
    3. HaHa then it all makes sense. Thanks again for your quick response. And it is really good to see an example here. Since the official documentation for Naives Bayes focus more on text predicting. And the Car data is more normal as a benchmark. I will come back again for your next series. Bow~~

      Delete
  3. Please share the code and data in the github or other repository. Thanks :)

    ReplyDelete
  4. Hi, i try to use it in Python 3.

    But i received that error: "data[k,:] = np.array(row)
    ValueError: could not broadcast input array from shape (2) into shape (7)"

    van you help me?

    ReplyDelete
  5. Hi, I tried your code but i didn't feed the classificator with a multi binary matrix, I just used the matrix of the features i had. It seems the performance is not affected by this. I read your explanation to cozy, but i didn't get it. Can you explain me again why are you using such binary matrix?

    ReplyDelete
  6. Hi,
    I want to read the csv file from my system itself not using the url...
    could you please help me out with the code to read csv file from my system..
    reply as soon as early
    Thanks in advance..

    ReplyDelete
  7. Pretty section of content. I simply stumbled upon your site and in accession capital to say that I get actually loved to account your blog posts.
    Python Training in Chennai

    ReplyDelete
  8. I am happy for sharing on this blog its awesome blog I really impressed. thanks for sharing. Great efforts.

    Looking for Best Training Institute in Bangalore , India. Softgen Infotech is the best one to offers 85+ computer training courses including IT Software Course in Bangalore , India. Also, it provides placement assistance service in Bangalore for IT.

    ReplyDelete
  9. Very Informative blog thank you for sharing. Keep sharing.

    Best software training institute in Chennai. Make your career development the best by learning software courses.

    devops certification in chennai
    uipath training in chennai
    cloud computing courses in chennai

    ReplyDelete

Thanks for reading and for choosing to give us feedback!