Here is a quick snapshot of the data and the classification task at hand:
========================================================================
This dataset was taken from the UCI Machine Learning Repository
(http://archive.ics.uci.edu/ml/datasets.html)
1. Number of Instances: 1728
(instances completely cover the attribute space)
2. Number of Attributes (features): 6
3. Data feature descriptions:
0 - buying: vhigh, high, med, low.
1 - maint: vhigh, high, med, low.
2 - doors: 2, 3, 4, 5more.
3 - persons: 2, 4, more.
4 - lug_boot: small, med, big.
5 - safety: low, med, high.
4. Class Labels (to predict thru classification):
car evaluation: unacc, acc, good, vgood
5. Missing Attribute Values: none
6. Class Distribution (number of instances per class)
There is a sample imbalance (very common in real world data sets)
class N N[%]
-----------------------------
unacc 1210 (70.023 %)
acc 384 (22.222 %)
good 69 ( 3.993 %)
v-good 65 ( 3.762 %)
========================================================================
Here is the python script which trains the model and tests its generalizability using a test set.
If the above code is executed either in the python shell (by copy and pasting the lines) or at the command prompt using: $ python (path_to_file)/thinkModelcode_carData.py
you should get an accuracy between 80-90% (there is a range because it all depends on WHICH rows were used for test vs. test data). The output line should be something like:
Classification accuracy of MNB = 0.901162790698
Explanation of python code:
you should get an accuracy between 80-90% (there is a range because it all depends on WHICH rows were used for test vs. test data). The output line should be something like:
Classification accuracy of MNB = 0.901162790698
Explanation of python code:
- lines 1-6: importing several modules and packages necessary to run this script
- lines 10-23: using packages urllib and csv to read in data from URL (click links for more info)
- line 25: create a list of feature names (for reference)
- lines 27-31: converting string data into numerical form by using NumPy's unique() function
- keys --> contains string labels corresponding to numerical values assigned
- numdata --> contains numerical representations of string labels in 'data' read from URL
- lines 33-37: determine number of rows, columns. Also convert numdata to be of int array type.
- split numdata into xdata (first 6 columns) and ydata (last column of class labels)
- lines 41-46: convert each multivalued feature in xdata, to a multi-column binary feature array. This conversion is done using sklearn.preprocessing.LabelBinarizer. Here's an example of how what this conversion looks like:
- >>> a = [2,0,1]
- >>> lbin.fit_transform(a)
- OUTPUT:
- array([[ 0., 0., 1.],
- [ 1., 0., 0.],
- [ 0., 1., 0.]])
- lines 51-55: create training and test data sets from full sample of 1728 rows:
- To create the test and training sets, we simple create an array of ALL indices:
- allIDX = [0, 1, 2,......,1725, 1726, 1727]
- random.shuffle(allIDX) "shuffles" ordered indices of allIDX to a randomized list:
- allIDX = [564, 981, 17, ...., 1023, 65, 235]
- Then we simply take the first 10% of allIDX as the test set, the remaining as training.
- lines 58-61: use testIDX and trainIDX with xdata to create xtest and xtrain, respectively
- lines 62-67: use sklearn's naive_bayes module to perform multinomial naive-bayes classification
Hopefully, the combination of having an introduction to the basics and formalism of Naive Bayes Classifiers, running thru a toy example in US census income dataset, and being able to see an application of Naive-Bayes classifiers in the above python code (I hope you play with it beyond the basic python script above!) helps solidify some of the main points and value of using Bayes' Theorem.
Please let me know if you have any questions and, as always, comments and suggestions are greatly appreciated!