Linear and Quadratic Discriminant Analysis

January 13, 2020 Gellert Toth

Revised on Feb. 10, 2020

In order to arrive at the most accurate prediction, machine learning models are built, tuned and compared against each other. The reader can get can click on the links below to assess the models or sections of the exercise. Each section has a short explanation of theory, and a description of applied machine learning with Python:

Exploratory Data Analysis
LDA/QDA/Naive Bayes Classifier (Current Blog)
Multi-Layer Perceptron
K-Nearest Neighbors
Support Vector Machines
Ensemble Learning
Model Comparisons

Objectives:

This blog is part of a series of models showcasing applied machine learning models in a classification setting. By clicking on any of the tabs below, the reader can navigate to other methods of analysis applied to the same data. This was designed so that one could see what a data scientist would do from soup to nuts when faced with a problem like the one presented here. Note that the overall focus of this blog is Linear and Quadratic Discriminant Analysis as well as the Naive Bayes Classifier.

Learn about Bayes’ Theorem and its application in making class predictions;
Get introduced to Linear Discriminant Analysis;
Gain familiarity with Quadratic Discriminant Analysis;
Understand the conceptual and mathematical differences between LDA, QDA and the Naive Bayes Classifier;
Find out how to use Sci-Kit Learn to fit LDA, QDA, NBC;
Learn how to tune parameters with GridSeachCV(); and
Refresh how to gauge the accuracy of classification models
- ROC Curves
- Confusion Matrix
- Accuracy score, F1, Precision, Recall

Bayes’ Theorem for Classification:

We can classify an observation into one of K classes (K≥ 2), and K can take unordered and distinct values according to Introduction to Statistical Learning (James et al.). Click for more. When classifying observations, one can use Bayes’ Theorem, which describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if getting into a car accident is related to the driver’s prior history of speeding, then assessing the probability of getting into a car accident in the future can be done with higher accuracy if the analyst has information about the number of speeding tickets vs. when the analyst does not have access to the history of an individual’s driving record. The mathematical notation of Bayes’ Theorem is

where A and B are events and P(B) ≠ 0.

P(A| B)} is a conditional probability: the likelihood of event A occurring given that B is true.
P(B|A) is also a conditional probability: the likelihood of event B occurring given that A is true.
P(A)} and P(B) are the probabilities of observing A and B respectively; they are known as the marginal probability.

Bayes’ Theorem can be rewritten as below given that π_k(the prior or posterior probability that an observation belongs to the kth class, e.g. if we randomly choose an observation, it comes from the kth class. Also, the density function of X for an observation comes from the kth class is f_k(x) ≡ Pr(X = x|Y = k).

Instead of directly computing p_k(X) , we can simply use π_k(the prior or posterior probability that an observation belongs to the kth class) and the f_k(x), e.g. density function: f_k(x) ≡ Pr(X = x|Y = k). This means that we can compute the probability that an observation belongs to a certain class given the predictor value for the observation. The Bayes classifier classifies an observation to the class for which f_k(x) is the largest and has the lowest error rate compared to all other classifiers. So if f_k(x) is large than the probability that an observation belongs to the kth class is high. If f_k(x) is small, than the probability that an observation f_k(x) belongs to the same kth class is minimal. In analytics, we must estimate f_k(x) based on p_k(x), which is called the posterior probability that an observation belongs to the kth class.

Linear Discriminant Analysis:

Linear Discriminant Analysis (LDA) is a method that is designed to separate two (or more) classes of observations based on a linear combination of features. The linear designation is the result of the discriminant functions being linear.

The image above shows two Gaussian density functions. (Source: Introduction to Statistical Learning - James et al.) Click for more. The dashed vertical line shows the decision boundary. The right side shows histograms of randomly chosen observations. The dashed line again is the Bayesian decision boundary. The solid vertical line is the LDA decision boundary estimated from the training data. When the Bayesian decision boundary and the LDA decision boundary are close, the model is considered to perform well.

LDA is used to estimate π_kusing the proportion of the training observations that belong to the kth class. In this example there is only one regressor (p=the number of regressors). When multiple regressors are used, then observations are assumed to be drown from a multivariate Gaussian distribution.

Quadratic Discriminant Analysis:

Quadratic Discriminant Analysis (QDA) is similar to LDA based on the fact that there is an assumption of the observations being drawn form a normal distribution. The difference is that QDA assumes that each class has its own covariance matrix, while LDA does not.

QDA classifier uses several parameters (Σk, μ_k, and π _k) to determine in which class should an observation be classified. Whether we use QDA or LDA depends on the bias-variance tradeoff. LDA is less flexible with lower variance. However, in LDA, observations share a common covariance matrix, resulting in higher bias.

ANALYTICS WITH With Python:

LDA: Sci-Kit Learn uses a classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule. The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix. (Source: Sci-Kit Learn - Click for more)
QDA: Quadratic Discriminant Analysis
Sci-Kit Learn uses a classifier with a quadratic decision boundary based on fitted conditional densities as described by Bayes’ Theorem. Each class is fitted with a Gaussian density. (Source: Sci-Kit Learn - Click for more)

The first block loads all necessary libraries, creates the regressors and the dependent variable required by sklearn. Finally, the data set is partitioned into train and test sets.

Table of Contents:

Objectives:

Bayes’ Theorem for Classification:

Linear Discriminant Analysis:

Quadratic Discriminant Analysis:

ANALYTICS WITH With Python:

Parameter Tuning:

Interpretation of Model Output:

Accuracy: