Linear and Quadratic Discriminant Analysis
Revised on Feb. 10, 2020
Table of Contents:
In order to arrive at the most accurate prediction, machine learning models are built, tuned and compared against each other. The reader can get can click on the links below to assess the models or sections of the exercise. Each section has a short explanation of theory, and a description of applied machine learning with Python:
LDA/QDA/Naive Bayes Classifier (Current Blog)
Ensemble Learning
Objectives:
This blog is part of a series of models showcasing applied machine learning models in a classification setting. By clicking on any of the tabs below, the reader can navigate to other methods of analysis applied to the same data. This was designed so that one could see what a data scientist would do from soup to nuts when faced with a problem like the one presented here. Note that the overall focus of this blog is Linear and Quadratic Discriminant Analysis as well as the Naive Bayes Classifier.
Learn about Bayes’ Theorem and its application in making class predictions;
Get introduced to Linear Discriminant Analysis;
Gain familiarity with Quadratic Discriminant Analysis;
Understand the conceptual and mathematical differences between LDA, QDA and the Naive Bayes Classifier;
Find out how to use Sci-Kit Learn to fit LDA, QDA, NBC;
Learn how to tune parameters with GridSeachCV(); and
Refresh how to gauge the accuracy of classification models
ROC Curves
Confusion Matrix
Accuracy score, F1, Precision, Recall
Bayes’ Theorem for Classification:
We can classify an observation into one of K classes (K≥ 2), and K can take unordered and distinct values according to Introduction to Statistical Learning (James et al.). Click for more. When classifying observations, one can use Bayes’ Theorem, which describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if getting into a car accident is related to the driver’s prior history of speeding, then assessing the probability of getting into a car accident in the future can be done with higher accuracy if the analyst has information about the number of speeding tickets vs. when the analyst does not have access to the history of an individual’s driving record. The mathematical notation of Bayes’ Theorem is
where A and B are events and P(B) ≠ 0.
P(A| B)} is a conditional probability: the likelihood of event A occurring given that B is true.
P(B|A) is also a conditional probability: the likelihood of event B occurring given that A is true.
P(A)} and P(B) are the probabilities of observing A and B respectively; they are known as the marginal probability.
Bayes’ Theorem can be rewritten as below given that πk(the prior or posterior probability that an observation belongs to the kth class, e.g. if we randomly choose an observation, it comes from the kth class. Also, the density function of X for an observation comes from the kth class is fk(x) ≡ Pr(X = x|Y = k).
Instead of directly computing pk(X) , we can simply use πk(the prior or posterior probability that an observation belongs to the kth class) and the fk(x), e.g. density function: fk(x) ≡ Pr(X = x|Y = k). This means that we can compute the probability that an observation belongs to a certain class given the predictor value for the observation. The Bayes classifier classifies an observation to the class for which fk(x) is the largest and has the lowest error rate compared to all other classifiers. So if fk(x) is large than the probability that an observation belongs to the kth class is high. If fk(x) is small, than the probability that an observation fk(x) belongs to the same kth class is minimal. In analytics, we must estimate fk(x) based on pk(x), which is called the posterior probability that an observation belongs to the kth class.
Linear Discriminant Analysis:
Linear Discriminant Analysis (LDA) is a method that is designed to separate two (or more) classes of observations based on a linear combination of features. The linear designation is the result of the discriminant functions being linear.
The image above shows two Gaussian density functions. (Source: Introduction to Statistical Learning - James et al.) Click for more. The dashed vertical line shows the decision boundary. The right side shows histograms of randomly chosen observations. The dashed line again is the Bayesian decision boundary. The solid vertical line is the LDA decision boundary estimated from the training data. When the Bayesian decision boundary and the LDA decision boundary are close, the model is considered to perform well.
LDA is used to estimate πkusing the proportion of the training observations that belong to the kth class. In this example there is only one regressor (p=the number of regressors). When multiple regressors are used, then observations are assumed to be drown from a multivariate Gaussian distribution.
Quadratic Discriminant Analysis:
Quadratic Discriminant Analysis (QDA) is similar to LDA based on the fact that there is an assumption of the observations being drawn form a normal distribution. The difference is that QDA assumes that each class has its own covariance matrix, while LDA does not.
QDA classifier uses several parameters (Σk, μk, and π k) to determine in which class should an observation be classified. Whether we use QDA or LDA depends on the bias-variance tradeoff. LDA is less flexible with lower variance. However, in LDA, observations share a common covariance matrix, resulting in higher bias.
ANALYTICS WITH With Python:
LDA: Sci-Kit Learn uses a classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule. The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix. (Source: Sci-Kit Learn - Click for more)
QDA: Quadratic Discriminant Analysis
Sci-Kit Learn uses a classifier with a quadratic decision boundary based on fitted conditional densities as described by Bayes’ Theorem. Each class is fitted with a Gaussian density. (Source: Sci-Kit Learn - Click for more)
The first block loads all necessary libraries, creates the regressors and the dependent variable required by sklearn. Finally, the data set is partitioned into train and test sets.
Since we are charged with creating the best model possible, let us create new features. Here, we’ll create second order polynomials and interaction terms, and separately, we create third order polynomials and third degree interaction terms. Creation of these terms will bring up some issues but will spend more time on that in a bit. For now, just notice that creating third order polynomials increased our column count from 23 to 1,539.
When fitting LDA models, standardizing or scaling is a good idea. There are several articles out there explaining why standardizing is a must. Here we have to remember to standardize all of our data sets.
Since we created interaction terms and polynomials, multicollinearity will certainly be an issue. Here, we check if multicollinearity exists in the original data set, and then we go through the newly created two data sets containing second and third degree polynomials the same way. We make use of VIF and identify all variables with a VIF of greater than 5. We simply will eliminate these variables from the analysis.
And now the painful task of eliminating variables begins. This may be a slow process if the dataset is large. Think about it: We need to create a matrix of correlations among all variables.
In both cases, the number of variables was reduced significantly. The second order polynomial file now has only 142 features, while the third order polynomial file contains only 148 features vs. the 1,539 we started with. We are ready to fit some models.
Parameter Tuning:
Our first task is to fit a generic LDA model without any parameters tuned for two reasons: understand code structure and get a baseline accuracy. For fitting, we are using the first degree polynomial, e.g. the original dataset.
So let us try to tune the parameters of the LDA model. A list of tunable parameters can be found on the Scikit-Learn Linear Discriminant Analysis Page (Please click to navigate). There are three different solvers one can try, but one of them (svd) does not work with shrinkage. As a result, the cross validation routines using GridSearchCV were separated in the code below for the two solver that work with shrinkage vs. the the one that does not. The shrinkage parameter can be tuned or set to auto as well. Nuanced difference but it does impact the final model selected. In the final run shown here, the solver type and n_components were tuned.
Also, please note that GridSearchCV itself has a myriad of options. For selection of champion models, accuracy was chosen as a decision metric. Documentation of GridSearchCV is available by clicking here.
Predictions were made with all three datasets (e.g. 1st, 2nd and 3rd degree polynomial datasets.), which will be used later for assessment of model accuracy. Below is the tuning of the svd model followed by model fitting and predicting outcomes based on test data.
Quadratic Discriminant Analysis:
The next section is focused on QDA. A list of tunable parameters is available by clicking here. GridSearchCV was once again used for parameter tuning, and the final exercise looked at the tuning of three parameters.
Naive Bayes Classifier:
Finally, we fitted a Naive Bayes Classifier with the exact same GridSearchCV approach as the one used by LDA and QDA. NBA can also be tuned, and the tunable parameter list can be reached by clicking here.
Interpretation of Model Output:
Several model attributes are available to assess the models. These attributes are stored in various objects. For a list please click here. So what are these attributes?
When fitting a LDA model, linear boundaries between classes are created based on the means and variances of each class. Coefficients serve as delimiters for the boundaries. Below, I printed model coefficients, class priors and class means for both LDA models.
Class Priors:
Prior probabilities of groups in the training are represented by Class priors. They should add to 100%. They indeed meet that criteria as the distribution of donated vs. did not donate is 50.2% vs. 49.8%. Since the same training data was used for all three models, this ratio is the same in all three models.
Group Means:
Group means represent the average of each predictor within every individual class. We can judge influence of the predictors in each class. Note that there are two sets of them because the model has a binary outcome. Let’s take a look at the first two regressors in the first LDA model to illustrate (and the other regressors work the same way).
We can see that the variable Region 1 (a person lives in region 1) might have a slightly greater influence on becoming a donor (0.21) vs. not being a donor (0.17). The second variable has a stronger effect. Living in Region 2 has twice as much influence on becoming a donor than not (0.47 vs. 0.23).
Coefficients:
In the first LDA model, the first regressor (reg1) has a coefficient of 1.31086181, and the second (reg2) has a coefficient of 2.40106887. This means that the boundary between the donor vs. not a donor classes will be specified by y=1.31086181*reg1+2.40106887*reg2+…other 18 coefficients and variables. It appears that living in region 2 vs. region 1 is almost twice as influential on becoming a donor.
Accuracy:
Now we are ready to assess the accuracy of the the models. Let’s print the accuracy of all fitted models:
It looks like the accuracy score of the LDA model could be further tuned because the default score is actually the same as the score of the tuned LDA model using the svd solver and first degree polynomials.
The same can be seen about the QDA accuracy scores. So we just found a clever way to check if an employee spent any time at all with tuning parameters. A tuned model should be better than the default model, right?
Having said this, the QDA model seems to outperform the LDA model. Gaussian Naive Bayes accuracy scores are also provided and they are not very flattering. We did not take them seriously anyway! I think this is a sentence in the documentation somewhere. We just fitted those models because “If in the QDA model one assumes that the covariance matrices are diagonal, then the inputs are assumed to be conditionally independent in each class, and the resulting classifier is equivalent to the Gaussian Naive Bayes classifier.” (Source: Linear and Quadratic Discriminant Analysis)
We now have the best LDA, QDA and NBC models, let’s see if we can get some basic information about them.
Confusion Matrix:
The confusion matrix is a logical place to ask about overall accuracy, precision and recall, etc. The below section offers a way to obtain all necessary statistics. There are clear differences between the models, so subject matter expertise would dictate which model would be preferred. In the current example, the choice is easy because the QDA model is superior to all others based on all metrics, including accuracy, recall and precision.