Manatee Data: Simple OLS Regression
Simple Ordinary Least Squares Regression with statsmodels. api
I am continuing the discussion about general linear models. In one of the prior post, I demonstrated how simple linear regression works. I looked at individual observations and computed several items including the slope, the y intercept and the coefficient of determination. I also tested the significance of the coefficient and that of the model suing t statistic and F statistic, respectively.
In this writing, I wanted to focus on computing the intercept and regression coefficient using statstmodels. api and statsmodels.formula.api. As you’ll see later the difference between the two is that one needs manual insertion of an intercept, while the other does not.
The first task is the same as in the manual regression example. I created a data set from data I collected from the Internet. The manatee death data came from the Florida Fish and Wildlife Conservation Commission and the boat registration data was sourced from the Florida from the Department of Highway Safety and Motor Vehicles (FLHSMV). An explanation of how the data was gathered and what it actually represents is available from Manatee Data: General Linear
The statsmodel.api allows us to fit an Ordinary Least Squares model. This is a linear model that estimates the intercept and regression coefficient. These parameters are chosen and estimated by the method of least squares, e.g. we minimize the sum of squared differences between actual observations of the dependent variable vs. predicted values of the dependent variable. The predictions are based on a linear function.
When visualizing OLS, it is the sum of squared distances between data points and the regression line, parallel to the y axis (axis of the dependent variable). When the sum of the distances is small, the model is considered a better representation/fit of the data.
Statsmodels api
When estimating parameters with this method, be sure to add a constant that will account for the y intercept. The statsmodels OLS estimator does not automatically come with the constant. Also, we must ensure that the values in the data frame equal 1. I have seen people trying to add 0s but the package will show an error.
When I ran the statsmodels OLS package, I managed to reproduce the exact y intercept and regression coefficient I got when I did the work manually (y intercept: 67.580618, regression coefficient: 0.000018.) One must print results.params to get the above mentioned parameters. The plot of observations and regression line look the same as well, which is very reassuring for what I tried to accomplish earlier.
statsmodels.formula.api
Now, we can accomplish the exact same result by using statsmodels.formula.api. In this case, we do not have to add a constant, as this module does have a built in y-intercept.
I am very please to see that this method also reproduces the same parameters (y intercept: 67.580618, regression coefficient: 0.000018.)
While the method of fitting a simple OLS model is simple, I do think it is important to understand what we are doing during the fitting of these models before moving onto more complicated things. I hope I managed to describe the basics of regression modeling.