This post illustrates basic procedure necessary for performing one-step-ahead prediction of S&P500 index based on prices of its constituents using feed-forward neural networks implemented in `TensorFlow`

using `Keras`

API.

`TensorFlow`

is open source library for dataflow programming developped by the **Google Brain Team** using `C++`

, `CUDA`

and `Python`

. In depth information and tutorials can be found at their official site. `Keras`

is an open source neural network library written in Python, which can run on top of several backend engines, including `TensorFlow`

(but also `Theano`

, which is no longer actively developped, and `Microsoft Cognitive Toolkit`

).

S&P500 is a financial index, essentially a weighted moving average, computed based on market capitalization of 500 companies having stock listed on NYSE and NASDAQ. The problem considered here is to predict S&P500 value at some future time instant, based on the known values of its constituents in the current time instant.

Assuming that stock data is sampled at some fixed interval of time, the problem can be formally stated as identifying unknown mapping \( f \) such that:

\[ SP500_n=f(c_1,c_2,…,c_{500}) \]

where \( SP500_n \)is the value of SP500 index \( n \) steps ahead into the future,\( c_1 \) to \( c_{500} \) are values of individual constituents (market capitalization of relevant companies) in the current time instant.

Depending on the type of prediction algorithm, \( f \) can take various different forms. In the present article we are interested in predictors implemented by means of **multilayer perceptrons** – multilayer, non-recursive, feed-forward neural networks.

The illustration is focused on various steps involved in design, training and utilization of nonlinear network models using `Keras`

API backed by `TensorFlow`

. Detailled analysis of financial aspects of the problem, which are quite complex, is not performed here.

### Loading Necessary Python Packages

First, we must load appropriate `Python`

packages. `NumPy`

has become a *de facto* standard for numerical computation in `Python`

. `Pandas`

is, on the other hand, has become core data analysis toolkit.

Next, we must load visualization packages. `Matplotlib`

is `MATLAB`

-inspired `Python`

visualization library. `Seaborn`

is an extension of `Matplotlib`

focusing on statistical data visualization.

Finaly, we load `TensorFlow`

and `Keras`

.

import numpy as np import pandas as pd import matplotlib.pyplot as plt import pandas.plotting as pdplt import seaborn as sns sns.set() import tensorflow as tf from tensorflow import keras

### Loading Data

Data is loaded from a `CSV`

file on disk. The data file is the same one used in this articel.

It is important to stress that the data has been aligned in a special way, so that the SP500 entry in any goven row is already one-step-ahed of the constituent values in the same row. The data is already prepared for the one-step ahead prediction.

data = pd.read_csv('./data/data_stocks.csv') data.head() data['SP500'].plot();

### One-Step-Ahead Prediction

#### Preprocessing

First, all data will be separated into two sets: **training data** and **test data**. In this particular case, the data will be decimated so that every tenth sample will be used for training and every other sample for testing. **So, only 10% of the available data will be included in the training set, all other data will be used for testing only!**

decimation = 10 data_train = data.iloc[::decimation, :] test_mask = np.ones(len(data), np.bool) test_mask[::decimation] = 0 data_test = data.iloc[test_mask, :]

In the following step, all data will be scaled. **Scalling** is a common preprocessing step in most regression and classification problems. It is beneficial for numerous reasons, including the fact that it makes tuning of the meta-parameters much easier.

When performing scaling it is paramount to remember that scalling parameters are computed based on the **training data only**! Later, the same scaling is applied to test data, as well as during data encountered in exploitation.

In the present case we use a very simple **0-1** scalling: all data will be scaled so that the minimal encountered value along each dimension of input and output is equal to 0, while the maximal one is scaled to 1.

# Compute mean and standard deviation of # the TRAINING data m = data_train.min(axis=0) M = data_train.max(axis=0) # Scale both the training and the test data data_train_scaled = (data_train - m) / (M-m) data_test_scaled = (data_test - m) / (M-m) # Build training and test regressor and target X_train = data_train_scaled.iloc[:, 1:] Y_train = data_train_scaled.iloc[:, 0] X_test = data_test_scaled.iloc[:, 1:] Y_test = data_test_scaled.iloc[:, 0]

#### Define network architecture

In order to build, train and test models, `TensorFlow`

`Keras`

interface will be used. The interface is straightforward and easy to use. It hides most of the low-level details and specifics of `TensorFlow`

.

*Powers of 2 are chosen for no particular reason.*

All layers are conventional **dense** layers (meaning that there is a connection between every node of the input/previous layer to every node of the current layer) with **rectified linear** activation function (also commonly used). A neuron having such activation function is usually denoted as **rectified linear unit** – often abbreviated as **ReLU**.

model = keras.Sequential([ keras.layers.Dense(1024, activation=tf.nn.relu, input_shape=(500,)), keras.layers.Dense(512, activation=tf.nn.relu), keras.layers.Dense(256, activation=tf.nn.relu), keras.layers.Dense(128, activation=tf.nn.relu), keras.layers.Dense(1) ]) model.summary()

An interesting observation that can be made here is that there would only be 501 trainable parameters if one would attempt to solve this problem using simple linear regression. There are more than 1 million and 200 thousand tunable parameters in the proposed network architecture.

Knowing that the linear regression could easily solve the problem at hand, it is clear that the network is severely overparameterized. The rest of this notebook, however, shows that this is not actually a problem for modern deep learning frameworks.

The structure of the underlying TensorFlow computational graph is shown in the image below.

#### Select optimizer (learning algorithm) and loss function

Select appropriate optimization algorithm. In this particular case, we choose **adaptive moment optimizer** **(ADAM)** algorithm, as described here.

optimizer = tf.train.AdamOptimizer()

Keras compile method configures the learning process. It must be invoked before learning. The configuration ivolves selecting three significant items, each of which could be specified as predefined string literal, or as an appropriate object:

- loss function – in this case, mean square error
- optimization algorithms – in this case, previously defined ADAM procedure
- performance metrics – in this case, mean absolute error

It is important to notice that the loss function is actually used within the optimization process, while metrics are not. They are used to evaluate model performance. In a way, they could be alternative ways in which model quality is evaluated, but on which we do not want to perform optimization (because they are non-smooth, or computationally expensive, etc)

model.compile(loss='mse', optimizer=optimizer, metrics=['mae'])

#### Train the model

Before actually initiate the training procedure, let us define some **callbacks**. Callbacks are functions that are called during training, at each iteration. In particular, we want to have some indication of training progress (like a primitive progress bar). However, different callbacks can be used, for example it is possible to log training info so that we can visualize it later using `tensorboard`

.

from IPython.display import clear_output # Poor man's progress bar 🙂 class PrintDot(keras.callbacks.Callback): def on_epoch_end(self, epoch, logs): clear_output() print(f'EPOCH: {epoch}', end='')

Finally, we can initiate the training procedure. There are several notions to be defined here, which are common in any model training procedure.

Number of epochs is the number of times the entire data set will be used in training, while the batch size is the number of training points which are used during a single parameter updates. Simply stated, the entire training set is splitted into batches of the given size, and the parameters are updated once for each batch! Once all batches are processed, we say that the epoch is over. The learning stops after a predefined number of epochs.

EPOCHS = 40 BATCH_SIZE = 256 # Store training stats history = model.fit(X_train, Y_train, epochs=EPOCHS, batch_size=BATCH_SIZE, shuffle=True, validation_data=(X_test, Y_test), verbose=0, callbacks=[PrintDot()])

Let us visualize the performance of the model over time. We see that the loss metric is more-or-less monotonic on the training set, while the test measure is similar, yet somewhat more fluctuating.

plt.figure() plt.xlabel('Epoch') plt.ylabel('Mean Abs Error') plt.plot(history.epoch, history.history['mean_absolute_error'], label='Train Loss') plt.plot(history.epoch, history.history['val_mean_absolute_error'], label = 'Validation loss') plt.legend(fontsize=16);

#### Evaluate the model

Model performance will be evaluated by making predictions using the test data set. It cannot be emphasized enough how important is not to use this data for training. Essentially, **the network newer “saw” this data before this point.**

Yhat_test = model.predict(X_test).flatten()

Finally, we compare the actual data with the obtained predictions.

result_test = pd.DataFrame() result_test['actual S&500 data'] = Y_test result_test['NN model prediction'] = Yhat_test result_test.plot(); plt.legend(prop={'size': 16});

It seems obvious from the figure that the network is capable of performing reasonably accurate predictions – the two lines almost coincide!