Select Page

Let us now attempt to predict future values of S&P500 based on the current values of the constituents. The results obtained for the prediction horizont of n=10 steps ahead will be elaborated, but the same procedure cn be applied for arbitrary n . The results will, off course, tend to become worse as n grows!

# The prediction horizont. We attempt to
# predict SP value n steps ahead, with n=10.
n=10

# Training data for the n-step ahead prediction
Xtrain_nsa = X.iloc[0:N_train, :]
# Input columns
Ytrain_nsa = Y.iloc[n-1:N_train+n-1]
# Test data for the n-step ahead prediction
Xtest_nsa = X.iloc[N_train:, :]
# Input columns
Ytest_nsa = Y.iloc[N_train+n-1:]

We will perform the same steps used for 1-step-ahead prediction.

# Train the model.
model_nsa = sm.OLS(Ytrain_nsa.values, Xtrain_nsa.values).fit()

# Predict values on the test set.
Yhat_test_nsa = model_nsa.predict(Xtest_nsa)

# Compare actual and predicted results
result_test_nsa = pd.DataFrame()
result_test_nsa['actual data'] = Ytest_nsa
result_test_nsa['model prediction'] = Yhat_test_nsa
result_test_nsa.plot();
plt.legend(prop={'size': 16});

# Evaluate test error
Etest_nsa = (Ytest_nsa - Yhat_test_nsa)/Ytest_nsa
Etest_nsa.plot();
plt.title(f'RMSE = {np.sqrt(np.mean(Etest_nsa**2))}',
fontsize=16);

Simply by plotting the data, we see that there is a trend in the error! There is some structure, which needs to be explained. The input data does not seem to be sufficient for accurate prediction of the S&P500 index 10 steps ahead. Notice also that root means square error (RMSE) is an order of magnitude higher.

# Visualize statistical distribution of
# the prediction error
sns.distplot(Etest_nsa.dropna());

Error distribution shows clear signs of grouping into (at least) three distinct clusters. Another sign of underlying structure uncought by the model.

# Visualize error autocorrelation function.
pdplt.autocorrelation_plot(Etest_nsa.dropna());

Autocorrelation demonstartes significant temporal dependencies between error values. This seems to indicate that a dynamical model is needed.

### Interactive analysis

# IPython interactive and widgets
from ipywidgets import interactive
from IPython.display import display

# Function which performs all steps of the
# learning procedure for the n-steps-ahead
# prediction problem, including:
# * data selection,
# * model training, and
# * model testing,
# using reduced data set (first 10% of the
# original data is used).

# The function will be later used to perform
# interactive fitting of
# values of n

def build_and_evaluate_model(n):
NN_train = int(0.1*N_train) # Reduced number of training samples
NN_test = int(0.25*NN_train) # Reduced number of test samples
# Training data for the n-step ahead prediction
Xtrain = X.iloc[0:NN_train, :] # Input columns
Ytrain = Y.iloc[n-1:NN_train+n-1]
# Test data for the n-step ahead prediction
Xtest = X.iloc[NN_train:NN_train+NN_test, :] # Input columns
Ytest = Y.iloc[NN_train+n-1:NN_train+NN_test+n-1]
# Train the model.
model = sm.OLS(Ytrain.values, Xtrain.values).fit()
# Predict values on the test set.
Yhat_test = model.predict(Xtest)
# Compare actual and predicted results
result = pd.DataFrame()
result['actual data'] = Ytest
result['model prediction'] = Yhat_test
result.plot();
plt.legend(prop={'size': 16});
plt.title(f'n = {n}', fontsize=16)
plt.show()

y=interactive(build_and_evaluate_model,n=(1,50,1))
display(y)

Interactive analysis on the reduced data set reveals several interesting points.

The results become worse and worse when the prediction horizont grows (expected).
The second point reveal necessity of data normalization, if not for numerical purposes, then for the evaluation purposes. The obtained plot on the reduced set reveals that even for the 1-step-ahead prediction, the obtained results are not that “perfect”, in the sense that there is visible difference between the two curves.

$${}$$