Python statsmodels statistical regression

Statistical Regression with Python statsmodels

Statistical Regression with Python statsmodels

1. Introduction

Statistical regression is a common analytical method in statistics used to study the relationship between one or more independent variables and a dependent variable. Python is a powerful programming language with an ecosystem that includes many libraries for statistical analysis. In this article, we will focus on the statsmodels library in Python, which provides a wide range of statistical models and methods. We will also discuss in detail how to use statsmodels to perform statistical regression analysis.

2. Installing the statsmodels Library

Before we begin, we need to install the statsmodels library. You can install it from the command line using pip:

pip install statsmodels

Shortly, the library will be installed, and you can use statsmodels to perform statistical analysis in Python.

3. Simple Linear Regression

Simple linear regression is the most basic form of statistical regression, used to investigate whether there is a linear relationship between an independent variable and a dependent variable. We’ll use a simple example to illustrate how to perform simple linear regression analysis using statsmodels.

First, we need to import the required libraries:

import numpy as np
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt

Let’s create a dummy dataset containing a linear relationship between the independent variable x and the dependent variable y:

np.random.seed(0)
n = 100
x = np.random.normal(0, 1, n)
y = 2 * x + np.random.normal(0, 1, n)

Now, we can use the statsmodels library to perform a simple linear regression analysis. First, we need to add a constant column to x so that the statistical model can fit the intercept:

x = sm.add_constant(x)

Next, we can create and fit a regression model using the OLS (Ordinary Least Squares) method:

model = sm.OLS(y, x)

results = model.fit()

Finally, we can print a summary of the regression results using the summary() method:

print(results.summary())

Running the above code, we will obtain a regression summary similar to the following:

OLS Regression Results                            
==============================================================================
Dep. Variable: y R-squared: 0.486
Model: OLS Adj. R-squared: 0.481
Method: Least Squares F-statistic: 101.7
Date: Mon, 01 Jun 2020 Prob (F-statistic): 5.40e-17
Time: 12:00:00 Log-Likelihood: -136.49
No. Observations: 100 AIC: 277.0
Df Residuals: 98 BIC: 282.2
Df Model: 1                                         
Covariance Type: nonrobust
==============================================================================
                 coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
const 0.0178 0.100 0.178 0.859 -0.181 0.216
x1 1.9737 0.195 10.084 0.000 1.586       2.361
==============================================================================
Omnibus: 0.118 Durbin-Watson: 1.924
Prob(Omnibus): 0.943 Jarque-Bera (JB): 0.238
Skew: -0.073 Prob(JB): 0.888
Kurtosis: 2.826 Cond. No. 1.09
==================================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The regression results summary provides statistical indicators such as R-squared, adjusted R-squared, F-statistic, and coef. These indicators can be used to assess the goodness of fit of the regression model and the significance of the independent variables.

We can also use seaborn to draw a regression plot to visualize the regression results:

sns.regplot(x[:,1], y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Simple Linear Regression')
plt.show()

The above code will draw a regression line, representing the linear relationship between the independent variable x and the dependent variable y.

4. Multiple Linear Regression

Multiple linear regression is used to study the linear relationship between multiple independent variables and a dependent variable. We will use an example dataset to illustrate how to perform multiple linear regression analysis using statsmodels.

First, we’ll import the required libraries and sample dataset:

import numpy as np
import statsmodels.api as sm
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Load the example dataset into a DataFrame
data = sns.load_dataset('tips')

We will use ‘total_bill’ and ‘size’ as independent variables, and ‘tip’ as the dependent variable. Now, we can perform a multiple linear regression analysis using the statsmodels library. First, we need to add a constant column to the independent variables so that the statistical model can fit the intercept:

x = sm.add_constant(data[[‘total_bill’, ‘size’]])
y = data[‘tip’]

Then, we can create and fit the regression model:

model = sm.OLS(y, x)
results = model.fit()

Finally, we can use the summary() method to print a summary of the regression results:

print(results.summary())

Running the above code, we will obtain a regression summary similar to the following:

OLS Regression Results
==============================================================================
Dep. Variable: tip R-squared: 0.456
Model: OLS Adj. R-squared: 0.451
Method: Least Squares F-statistic: 92.31
Date: Mon, 01 Jun 2020 Prob (F-statistic): 6.70e-30
Time: 12:00:00 Log-Likelihood: -346.86
No. Observations: 244 AIC: 699.7
Df Residuals: 241 BIC: 710.3
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
———————————————————————————-
const 0.6689 0.193 3.469 0.001 0.289 1.049
total_bill 0.0927 0.009 10.149 0.000 0.075 0.111
size 0.1926 0.088 2.179 0.030 0.018 0.367
==============================================================================
Omnibus: 18.817 Durbin-Watson: 1.097
Prob(Omnibus): 0.000 Jarque-Bera (JB): 20.528
Skew: 0.684 Prob(JB): 3.44e-05
Kurtosis: 3.191 Cond. No. 96.5
=================================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Similar to simple linear regression analysis, the regression result summary provides statistical indicators such as R-squared, adjusted R-squared, F-statistic, and coef. These indicators can be used to evaluate the goodness of fit of the regression model and the significance of the independent variables.

We can also use seaborn to draw a regression plot to visualize the regression results:

sns.regplot(data['total_bill'], data['tip'])
plt.xlabel('total_bill')
plt.ylabel('tip')
plt.title('Multiple Linear Regression')
plt.show()

The above code will draw a regression line representing the linear relationship between the independent variables ‘total_bill’ and ‘size’ and the dependent variable ‘tip’.

5. Summary

This article introduced how to perform statistical regression analysis using the statsmodels library in Python. We demonstrated the implementation of simple and multiple linear regression using example code and explained the statistical indicators in the regression summary. Through these methods, we can perform various statistical regression analyses in Python and gain insights into the relationship between independent and dependent variables from the results.

Leave a Reply

Your email address will not be published. Required fields are marked *