Let us now attempt to predict future values of S&P500 based on the current values of the constituents. The results obtained for the prediction horizont of n=10 steps ahead will be elaborated, but the same procedure cn be applied for arbitrary n . The results will, off course, tend to become worse as n grows!

# The prediction horizont. We attempt to # predict SP value n steps ahead, with n=10. n=10

# Training data for the n-step ahead prediction Xtrain_nsa = X.iloc[0:N_train, :] # Input columns Ytrain_nsa = Y.iloc[n-1:N_train+n-1] # Test data for the n-step ahead prediction Xtest_nsa = X.iloc[N_train:, :] # Input columns Ytest_nsa = Y.iloc[N_train+n-1:]

We will perform the same steps used for 1-step-ahead prediction.

# Train the model. model_nsa = sm.OLS(Ytrain_nsa.values, Xtrain_nsa.values).fit() # Predict values on the test set. Yhat_test_nsa = model_nsa.predict(Xtest_nsa) # Compare actual and predicted results result_test_nsa = pd.DataFrame() result_test_nsa['actual data'] = Ytest_nsa result_test_nsa['model prediction'] = Yhat_test_nsa result_test_nsa.plot(); plt.legend(prop={'size': 16});

# Evaluate test error Etest_nsa = (Ytest_nsa - Yhat_test_nsa)/Ytest_nsa Etest_nsa.plot(); plt.title(f'RMSE = {np.sqrt(np.mean(Etest_nsa**2))}', fontsize=16);

Simply by plotting the data, we see that there is a trend in the error! There is some structure, which needs to be explained. The input data does not seem to be sufficient for accurate prediction of the S&P500 index 10 steps ahead. Notice also that root means square error (RMSE) is an order of magnitude higher.

# Visualize statistical distribution of # the prediction error sns.distplot(Etest_nsa.dropna());

Error distribution shows clear signs of grouping into (at least) three distinct clusters. Another sign of underlying structure uncought by the model.

# Visualize error autocorrelation function. pdplt.autocorrelation_plot(Etest_nsa.dropna());

Autocorrelation demonstartes significant temporal dependencies between error values. This seems to indicate that a dynamical model is needed.

### Interactive analysis

# IPython interactive and widgets from ipywidgets import interactive from IPython.display import display

# Function which performs all steps of the # learning procedure for the n-steps-ahead # prediction problem, including: # * data selection, # * model training, and # * model testing, # using reduced data set (first 10% of the # original data is used). # The function will be later used to perform # interactive fitting of # n-steps-ahead-prediction for various # values of n def build_and_evaluate_model(n): NN_train = int(0.1*N_train) # Reduced number of training samples NN_test = int(0.25*NN_train) # Reduced number of test samples # Training data for the n-step ahead prediction Xtrain = X.iloc[0:NN_train, :] # Input columns Ytrain = Y.iloc[n-1:NN_train+n-1] # Test data for the n-step ahead prediction Xtest = X.iloc[NN_train:NN_train+NN_test, :] # Input columns Ytest = Y.iloc[NN_train+n-1:NN_train+NN_test+n-1] # Train the model. model = sm.OLS(Ytrain.values, Xtrain.values).fit() # Predict values on the test set. Yhat_test = model.predict(Xtest) # Compare actual and predicted results result = pd.DataFrame() result['actual data'] = Ytest result['model prediction'] = Yhat_test result.plot(); plt.legend(prop={'size': 16}); plt.title(f'n = {n}', fontsize=16) plt.show() y=interactive(build_and_evaluate_model,n=(1,50,1)) display(y)

Interactive analysis on the reduced data set reveals several interesting points.

The results become worse and worse when the prediction horizont grows (expected).

The second point reveal necessity of data normalization, if not for numerical purposes, then for the evaluation purposes. The obtained plot on the reduced set reveals that even for the 1-step-ahead prediction, the obtained results are not that “perfect”, in the sense that there is visible difference between the two curves.