Statsmodels API

Statsmodels API is a Python package that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests and statistical data exploration. For this article, I'll explain how to use statsmodels API to train a simple linear regression model.

PRELIMINARY

To get started, install the statsmodel api package using the pip command

!pip install statsmodels

Next, import statsmodels api and assign it to an alias "smf"

import statsmodels.formula.api as smf

The dataset used in this article is a dummy dataset with just two features(columns) and seven hundred observations(rows). This dataset was generated using different values for "x" column from 0-100 and a corresponding "y" column generated using the Excel function NORMINV(RAND(), x, 3).

Download the train dataset (train.csv) using this link and read it using the pandas package.

import pandas as pd
df = pd.read_csv("file")

FITTING THE MODEL

We train a simple linear regression model where "y" is the dependent variable and "x" is the independent variable.

The next step is to call the function that carries out the linear regression. Statsmodels api has different techniques for carrying out linear regression. For this article, we'll use Ordinary Least Squares(OLS). OLS is a technique in linear regression that compares the difference between individual points in your data set and the predicted best fit line to measure the amount of error produced. If you are curious, you can check out different techniques available in statsmodels.api.

model = smf.ols(formula='y ~ x' , data=df)

The code snippet above calls the ols function from the stasmodels api. The formula argument is defined by a patsy formula string that uses the tilde (~) symbol to relate the dependent variable to the independent variable. The data argument tells the function where the data is located.

Next, call the .fit() method of the model instance and assign the results of the method to a variable "modelResult." The .fit() method fits the linear regression model to the data as shown in the following code snippet below.

modelResult = model.fit()

RESULT

The final step is to print the result of our linear regression model.

print(modelResult.summary())

If you got to this point without any errors, you should get a result like this.

Congratulations, you have successfully used the statsmodels API to train a simple linear regression model. The R-Squared value of 0.991 shows that our model has an accuracy of 99%. Remember it's a dummy dataset 😉

If you want a detailed explanation of the OLS regression results, send me a message using the comment section, and I'll make a Part 2 of the article. Thanks for reading!

Introduction To Statsmodels Api

PRELIMINARY

FITTING THE MODEL

RESULT