K-Nearest Neighbors
Revised on February 16, 2020.
TABLE OF CONTENTS:
In order to arrive at the most accurate prediction, machine learning models are built, tuned and compared against each other. The reader can get can click on the links below to assess the models or sections of the exercise. Each section has a short explanation of theory, and a description of applied machine learning with Python:
K-Nearest Neighbors (Current Blog)
OBJECTIVES:
This blog is part of a series of models showcasing applied machine learning models in a classification setting. By clicking on any of the tabs above, the reader can navigate to other methods of analysis applied to the same data. This was designed so that one could see what a data scientist would do from soup to nuts when faced with a problem like the one presented here. Note that the overall focus of this blog is K-Nearest Neighbors. More specifically,
Learn about the Bayes Classifier and the K-Nearest Neighbor Classifier;
Understand the role of K in KNN classifier;
Get introduced to KNeighborsClassifier in Scikit-Learn; and
Find out how to tune the parameters of a KNN model using GridSearchCV.
Bayes Classifier:
There are several statistics text books available showing that the test error rate in machine learning is minimized when using the Bayes classifier, which assigns observations to a class based on predictor values. For example, when we have two classes, the Bayes classifier assigns an observation to one of the classes if
otherwise the observation is assigned to the other class.
K-Nearest Neighbor Classifier:
Unfortunately, the real decision boundary is rarely known in real world problems and the computing of the Bayes classifier is impossible. One of the most frequently cited classifiers introduced that does a reasonable job instead is called K-Nearest Neighbors (KNN) Classifier.
As with many other classifiers, the KNN classifier estimates the conditional distribution of Y given X and then classifies the observation to the class with the highest estimated probability.
It is not surprising that altering K produces dramatically different results. When K=1, the decision boundary is minimally restricted, KNN models are said to produce low bias but high variance. As we increase K, the flexibility of the classifier gets reduced and the decision boundary gets closer and closer to linear. These models produce low variance but high bias. Neither perform particularly well based on test accuracy so we need to find a model with well balanced variance and bias, and we can find that model through parameter tuning.
Use Python to fit KNN MODEL:
So let us tune a KNN model with GridSearchCV. The first step is to load all libraries and the charity data for classification. Note that I created three separate datasets: 1.) the original data set wit 21 variables that were partitioned into train and test sets, 2.) a dataset that contains second order polynomials and interaction terms also partitioned, and 3.) a a dataset that contains third order polynomials and interaction terms - partitioned into train and test sets. Each dataset was standardized and the variables with VIF scores greater than 5 were removed. All datasets were pickled and those pickles are called and loaded below. The pre-work described above can be seen by navigating to the Linear and Quadratic Discriminant Analysis blog.
The untuned KNN model can serve as a baseline model in terms of test accuracy. The basic code structure looks like this:
We use cross validation and grid search to find the best model. Scikit-Learn affords us with several tunable parameters. For a complete list of tunable parameters click on the link for KNeighborsClassifier.
The list of tunable parameters are is also embedded (and coded out) in the chunk below. Further, I set the algorithm used to auto, although there are other parameters levels that one can decide on. Note that there are four options for algorithm:
At some point I tried these options but even then I removed brute from the list because the leaf size cannot be specified with that option.
The next step is to fit several models with the different datasets. Remember we have three different ones with first, second and third degree polynomials and interaction terms.
We can print several attributes based on GridSearchCV, one of which is the list of parameters for the best model:
The leaf size was 20
The distance metric used for the tree was Minkowski
The umber of neighbors used for k neighbor queries was 10
The power parameter for the Minkowski metric was 1 (When p = 1, this is equivalent to using manhattan_distance )
All points in each neighborhood were weighted equally
Now we can see how accurate teach of the four models performed based on test data. The first model was our default model without any tuning. Indeed, tuning parameters can get us significant gains over the accuracy of our default model. In fact, the model fitted on the original training data without interaction terms performed will and had an 86% accuracy.
Additional statistics are also available about the accuracy of the winning model. Take a look at the recall of our winning model for example. It will be really interesting to compare these results to the output of other methods.