### Introduction

Financial forecasting, especially stock price forecasting is one of the key components of the everyday market activities. Small increase of accuracy can lead to significant profit so such algorithms and models attracts significant visibility in academic and professional circles.

Inspired by this article we decided to start a series of articles which will show how deep learning and applied math can be used for financial forecasting.

In this article we will analyze simple Linear regression.

The underlying model is assumed to be

\[ SP500_n = a_1c_1+ a_2c_2+…+ a_{500}c_{500}+ a_0 \]

where

\( SP500_n \)is the value of SP500 index \( n \) steps ahead into the future, \( c_1 \) to \( c_{500} \) are values of individual constituents in the current time instant, and \( a_0 \) to \( a_{500} \) are model parameters to be identified during training. \( a_0 \) coefficient is special, it is the so called bias, intercept, or free term. It is a constant term, due to which the model is not linear (strictly speaking), but rather *affine*.

The dataset contains \( n = 41266 \) minutes of data ranging from April to August 2017 on 500 stocks as well as the total S&P 500 index price. Index and stocks are arranged in wide format. Data can be downloaded from upper link.

First, we started with data import and plot.

data = pd.read_csv('../data/data_stocks.csv') data.head() data['SP500'].plot();

The following problem will be considered:

**Value of the S&P500 index at some future time instant will be predicted based on the current values of its individual constituents. Therefore, for each individual data points there will be 500 input quantities and 1 output quantity.**

### Prepare data

Split into training and test set.

# Input data - regressor # Each column of the regressor represents a # single input variable (a single constituent). # Each row of the regressor represents a single # data point values of the coefficients taken # at a specific time instant). X = data.drop(['DATE', 'SP500'], axis=1) X = sm.add_constant(X) X.head() # Output data (targets, regressand) for the # one-step-ahead prediction. # Training targets are values of the SP500 index, # taken in the consequtive time interval. # The data set is already alligned in this way, # so that i-th row contains constituents' # data for some time instant `t`, and SP500 index # data for the consequtive time instant `t+1`. Y = data['SP500'] Y.head()

Training data for the one-step ahead prediction

Xtrain_1sa = X.iloc[0:N_train, :] # Input columns Ytrain_1sa = Y.iloc[0:N_train]

Test data for the one-step ahead prediction

Xtest_1sa = X.iloc[N_train:, :] # Input columns Ytest_1sa = Y.iloc[N_train:]

### Train the model

model_1sa = sm.OLS(Ytrain_1sa, Xtrain_1sa).fit() model_1sa.summary()

There are a lot of information that can be read from the model summary. In particular:

- Multiple columns are essentially colinear, indicating that the price of different constituents is not independent.
- There is good statistical indication that some constituents are essentially irrelevant, since 0 is a possible value of the corresponding coefficient.

These information strongly indicate that simmilar fit could be obtained from a significantly simpler model which could be obtained by reducing the number of inputs.

### Evaluate prediction performance on the TRAINING set

# Get model prediction Yhat_train_1sa_ols = model_1sa.predict(Xtrain_1sa) # Compare actual and predicted results result_train_1sa = pd.DataFrame() result_train_1sa['Ytrain'] = Ytrain_1sa result_train_1sa['Yhat_train_ols'] = Yhat_train_1sa_ols result_train_1sa.plot(); # Evaluate training error Etrain_1sa_ols = (Ytrain_1sa - Yhat_train_1sa_ols)/Ytrain_1sa Etrain_1sa_ols.plot(); plt.title(f'MSE = {np.mean(Etrain_1sa_ols**2)}');

# Visualize statistical distribution of # the prediction error sns.distplot(Etrain_1sa_ols);

The statistical distribution of the error signal testifies that the error almost obeys Gaussian distribution with standard deviation near to 1/3 of a per mille.

pdplt.autocorrelation_plot(Etrain_1sa_ols);

Since autocorrelation is practically zero, the error signal is essentially a white noise signal. This indicates that there is not unexplained phenomena in the data, and that it is highly unlikely that a significantly better results could be obtained by introducing more complex models.

### Evaluate prediction performance on the TEST set

Yhat_test_1sa_ols = model_1sa.predict(Xtest_1sa) # Compare actual and predicted results result_test_1sa = pd.DataFrame() result_test_1sa['Ytest'] = Ytest_1sa result_test_1sa['Yhat_test_ols'] = Yhat_test_1sa_ols result_test_1sa.plot();

Note that the two curves are very similar, but not indistinguishable anymore!

Forecasting accuracy is near-perfect.

# Evaluate test error Etest_1sa_ols = (Ytest_1sa - Yhat_test_1sa_ols)/Ytest_1sa Etest_1sa_ols.plot(); plt.title(f'MSE = {np.mean(Etest_1sa_ols**2)}');

Root mean square error is 4 times bigger when evaluated on the test set, compared to the value obtained on the training set. This is normal, test error should allways be bigger than the training error. Good performance on the test set is a solid indication of good generalization power of the model.

# Visualize statistical distribution # of the prediction error sns.distplot(Etest_1sa_ols);

The histogram clearly indicates that the prediction is slightly biased to the negative values. (We tend to predict slightly higher values of the S&500 index than the ones that actually occur). Notice also that the distribution is no longer a “perfect” bell-shaped curve.

pdplt.autocorrelation_plot(Etest_1sa_ols);

Autocorrelation plot now has more complex shape. The error is no longer white noise, it has a more complex distribution. We can conclude (as it is obvious) that there are other factor at play influencing the value of S&P500 index.

This is just a simple example how basic linear regression can have same results accuracy as TensorFlow deep learning if we provide simple problem. 🙂

In next article we will show how this algorithm behaves if we request n-Step-Ahead Prediction.