Prediction (out of sample)

In [1]:
%matplotlib inline
In [2]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.985
Model:                            OLS   Adj. R-squared:                  0.984
Method:                 Least Squares   F-statistic:                     1012.
Date:                Tue, 03 Dec 2019   Prob (F-statistic):           5.51e-42
Time:                        21:23:04   Log-Likelihood:                 5.0459
No. Observations:                  50   AIC:                            -2.092
Df Residuals:                      46   BIC:                             5.556
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0827      0.078     65.389      0.000       4.926       5.239
x1             0.4798      0.012     40.024      0.000       0.456       0.504
x2             0.4592      0.047      9.744      0.000       0.364       0.554
x3            -0.0190      0.001    -18.079      0.000      -0.021      -0.017
==============================================================================
Omnibus:                        1.430   Durbin-Watson:                   2.403
Prob(Omnibus):                  0.489   Jarque-Bera (JB):                1.200
Skew:                           0.187   Prob(JB):                        0.549
Kurtosis:                       2.339   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.60698662  5.05958814  5.47590429  5.83091009  6.10861197  6.30467541
  6.42713713  6.49508467  6.53552038  6.57892496  6.65424963  6.78415947
  6.98130965  7.24626621  7.56741305  7.92286044  8.28404144  8.62040534
  8.90443709  9.11617793  9.24650455  9.29862804  9.2875663   9.23767667
  9.17865403  9.14065211  9.1493292   9.22163268  9.36301452  9.5665356
  9.81400576 10.07897239 10.33106654 10.54099242 10.68534064 10.75043431
 10.7345761  10.64832709 10.51277167 10.35605337 10.20875101 10.09885464
 10.04716691 10.06388518 10.14692621 10.28226834 10.44625528 10.60948275
 10.74162948 10.81643899]

Create a new sample of explanatory variables Xnew, predict and plot

In [6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.80466026 10.67129958 10.43436429 10.13489087  9.82689776  9.56415977
  9.38704212  9.31261765  9.33048675  9.4053236 ]

Plot comparison

In [7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [9]:
res.params
Out[9]:
Intercept           5.082726
x1                  0.479808
np.sin(x1)          0.459181
I((x1 - 5) ** 2)   -0.019030
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [10]:
res.predict(exog=dict(x1=x1n))
Out[10]:
0    10.804660
1    10.671300
2    10.434364
3    10.134891
4     9.826898
5     9.564160
6     9.387042
7     9.312618
8     9.330487
9     9.405324
dtype: float64