Multinomial Logistic Regression
OBJECTIVES:
This blog focuses solely on multinomial logistic regression. Discussion about binary models can be found by clicking below:
The discussion below is focused on fitting multinomial logistic regression models with sklearn and statsmodels.
Get introduced to the multinomial logistic regression model;
Understand the meaning of regression coefficients in both sklearn and statsmodels;
Assess the accuracy of a multinomial logistic regression model.
Introduction:
At times, we need to classify a dependent variable that has more than two classes. For this purpose, the binary logistic regression model offers multinomial extensions. Multinomial logistic regression analysis has lots of aliases: polytomous LR, multiclass LR, softmax regression, multinomial logit, and others. Despite the numerous names, the method remains relatively unpopular because it is difficult to interpret and it tends to be inferior to other models when accuracy is the ultimate goal.
Multinomial Logit:
Multinomial logit models represent an appropriate option when the dependent variable is categorical but not ordinal. They are called multinomial because the distribution of the dependent variable follows a multinomial distribution.
When fitting a multinomial logistic regression model, the outcome has several (more than two or K) outcomes, which means that we can think of the problem as fitting K-1 independent binary logit models, where one of the possible outcomes is defined as a pivot, and the K-1 outcomes are regressed vs. the pivot outcome.
Exponentiating both sides of the equations will provide probabilities:
The multinomial logit model uses maximum likelihood estimation. The first iteration is a model with no regressors, only the intercept. The next iteration includes regressors in the model. The regressors are changed at every iteration, and iterations continue until the model is said to have converged. The log likelihood decreases until a model converges, e.g. the next iteration would not produce a lower log likelihood. See
Using Sci-Kit Learn:
In order to demonstrate how to use Sci-Kit Learn for fitting multinomial logistic regression models, I used a dataset from the UCI library called Abalone Data Set. While most projects try to identify the age of abalone based on several features, I tried to classify abalone instead to show the how LogisticRegression works :
Category 1: Male (n= 1,516)
Category 2: Infant (n=1,324)
Category 3: Female (n=1,301)
There are 10 variables of which the first - SEX - will be used as the dependent variable. Also, CLASS is formatted as string and needs to be converted to integer format. For that purpose, I dropped the first character (A), and kept the second character as an integer. I stored this data in a newly created variable called SIZE_CLASS.
Description of Data:
The data is sourced from study of Abalone in Tasmania. It can be found at the UCI Machine Learning Repository. The dataset contains 4,141 observations and 10 variables.
SEX = M (male), F (female), I (infant)
LENGTH = Longest shell length in mm
DIAM = Diameter perpendicular to length in mm
HEIGHT = Height perpendicular to length and diameter in mm
WHOLE = Whole weight of abalone in grams
SHUCK = Shucked weight of meat in grams
VISCERA = Viscera weight in grams
SHELL = Shell weight after drying in grams
RINGS = Age (+1.5 gives the age in years)
CLASS = Age classification from 1 to 6 (A1= youngest,..., A6=oldest)
We are now ready to partition the dataset:
Using the training data, we can fit a Multinomial Logistic Regression model, and then deploy the model on the test dataset to predict classification of individual abalone.
Intercept and Coefficients:
The intercept and coefficients are stored in model1.intercept and model1. coef_ respectively. Here we need to spend a bit of time, because the output of Sci-Kit Learn is different from what we may expect.
The first array contains three intercepts and the second array contains three sets of regression coefficients. This is different from what we may be used to in SAS and R. In fact, the sklearn based output is different from the statsmodel version (A discussion of Multinomial Logistic Regression with statsmodels is available below). Let’s see why…
In this solution, there is an equation for each class. These act as independent binary logistic regression models. The actual output is log(p(y=c)/1 - p(y=c)), which are multinomial logit coefficients, hence the three equations. After exponentiating each regressor coefficient, we in fact get odds ratios. The interpretation of the coefficients is for a single unit change in the predictor variable, the log of odds will change by a factor indicated by the beta coefficient, given that all other variables are held constant. Log of odds is not really meaningful, so exponentiating the output gets a slightly more user friendly output:
The interpretation of the exponentiated coefficients is for a single unit change in the predictor variable, the odds will be multiplied by a factor indicated by the exponent of the beta coefficient, given that all other variables are held constant. For example, the first variable is LENGTH with a value of 0.038. This means that if length increases by one unit the odds of being female is 3.8% compared to the status when length did not increase by one unit. A generic output looks looks something like this (the below is based on infants!)
Statsmodels:
Notice that the statsmodels output is very different from that of sklearn. In this case, there are K-1, in this case two equations, which show coefficients against a reference group. In the abalone example, the reference group was chosen to be female. The coefficients represent the log of ratios between two probabilities: the probability of belonging to a group of interest vs. the probability of belonging to the reference group. In the abalone example, the reference group was female, therefore the equation below represents the first set of coefficients marked as SEX=Infant. Note that there are two sets of coefficients, one marked as Infant and the second marked as Male.
So how do we interpret this data? For example, the coefficient LENGTH is 17.1 for the Infant group. This means that increasing the LENGTH measurement by one unit will result in an increase by 17.1 units in the log of the ratio between the probability of being an infant vs. the probability of being female. Very complicated…but that doesn’t matter if the goal is to accurately predict an outcome.
Accuracy:
Assessing the accuracy of the model is not difficult but errors at the different levels act as a compounding problem.
The accuracy of this model is poor with only 57% of predictions being correct. The precision and recall of female and male abalone is very concerning as well.