Select Page

In previous posts we proposed basic procedures necessary for performing one-step-ahead and multi-step-ahead prediction of S&P500 index based on prices of its constituents. In this post we will focus on accuracy and performance increase using dimension reduction techniques.

The illustration is focused on various methods for reducing input dimensionality, and decorrelating input vectors in general. Detailed analysis of financial aspects of the problem, which are quite complex, is not performed here.

The following problem will be considered:

Find a subset of the constituents of the S&P500 index which is sufficient for successful prediction of its value in some consecutive time instant. Alternatively, find some low-dimensional transformation of S&P500 constituents based on which a sufficient prediction can be performed.

Scaling

First, all data will be separated into two sets: training data and test data. In this particular case, the data will be decimated so that 10% of the data will be used for training, while the rest will be used for testing.

decimation = 10
data_train = data.iloc[::decimation, :]
data_test = data.iloc[test_mask, :]

In the following step, all data will be scaled. Scaling is a common preprocessing step in most regression and classification problems. It is beneficial for numerous reasons, including the fact that it makes tuning of the meta-parameters much easier.

By means of scaling the empirical mean of each column in the training data set becomes ​$$0$$​, which is a necessary step before applying Principal Component Analysis (PCA).

When performing scaling it is paramount to remember that scaling parameters are computed based on the training data only! Later, the same scaling is applied to test data, as well as during data encountered in exploitation.

m = data_train.min(axis=0)
M = data_train.max(axis=0)

# Scale both the training and the test data
data_train_scaled = (data_train - m) / (M-m)
data_test_scaled = (data_test - m) / (M-m)

# Build training and test regressor and target
X_train = data_train_scaled.iloc[:, 1:]
Y_train = data_train_scaled.iloc[:, 0]
X_test = data_test_scaled.iloc[:, 1:]
Y_test = data_test_scaled.iloc[:, 0]

Principal Component Analysis

Principal components of a data set are the right eigenvectors of the corresponding data matrix.

More details about PCA can be found here.

A detailed introduction in PCA, with emphasis on Python implementation, can be found herehere, and here (to name just a few sources).

(U, S, V) = np.linalg.svd(X_train)
W = V.T

In this way, U contains left singular vectorsV contains right singular vectors, while the corresponding singular values are stored in S. Since the singular vectors are stored in rows of V, it is more convenient to transpose it. Matrix W has columns which are right singular vectors of the data matrix X.

PCA has an interesting, easy to understand interpretation. PCA is essentially nothing more than a coordinate transformation. Instead of the original coordinate system, specified by the input data, PCA it introduces a new coordinate system such that the total variation of the data is concentrated within the first few components. In other words, instead of the original constituents of the S&P500 index, PCA it introduces linear combinations of those constituents as new coordinate axes.

In practice, usually, only a given percentage of the new coordinates are used for further processing. In this way, it is possible to obtain dimensionality reduction, data compression, decorrelation, denoising, etc.

Each singular vector of the data matrix represents a single new component, while the corresponding singular value is indicative of the importance of this component. Relative importance can be obtained by dividing each singular value by the sum of all singular values.

s = np.sum(S)
normed_eigv = S/s
normed_cumsum = np.cumsum(S)/s
plt.plot(normed_eigv, label='singular values')
plt.plot(normed_cumsum, label='cummulative sum')
plt.legend(fontsize=16);

We see that a lot of variation of the data is due to just a few first eigenvectors.

tol = 0.9
cut_off_index = np.argmax(normed_cumsum >;gt; tol)
print(f'Cut-off index for tolerance {tol*100}% is {cut_off_index}')

Cut-off index for tolerance 90.0% is 126

It can be seen that 99% of the information contained within the data can be computed by 411 singular components, that is by 82.2% of the input data. In fact, if we are willing to go as low as 90% of the information, we could retain only 126 eigenvectors, which is just 25.2% of the data (in this case).

It is easy to graphically visualize the dependence between the tolerance (the amount of information to keep) and the cut off index (the amount of input data necessary).

tol_array = np.arange(0.5, 1.0, 0.01)
cut_off_array = [np.argmax(normed_cumsum &gt; tol)
for tol in tol_array]
plt.plot(tol_array, cut_off_array);
plt.title('Cut off indes as a function of tolerance', fontsize=16);

It is easy to retain only the desired information within the data set. In fact, in order to transform any data vector from the original input coordinate system to the normalized coordinate system of singular components it is enough to multiply it by the matrix whose columns are right eigenvectors.

$\mathbf{t} = \mathbf{W} \mathbf{x}$

In the formula above, ​$$x$$​ is the original data vector, while ​$$t$$​ represents the same data in the new coordinates. The entire data set can be transformed in a similar way, using a single matrix multiplication formula

$\mathbf{T} = \mathbf{W} \mathbf{X}$

Now, the rows of ​$$X$$​ represent individual data points, while the rows of ​$$T$$​ are the corresponding data points in the transformed coordinates.

However, if we would like to perform dimensionality reduction (compression), then we should use only some given number of leading principal components (singular vectors). In other words, only the given number of leading columns of ​$$W$$​ should be used during transformation.

# PCA transformation matrix with only columns
# up to cut off index preserved.
Wt = W[:,:cut_off_index]

We now use the same transformation for both training and test data. As usual, the transformation is computed based on training data only! It is an error to compute the transform on the entire data set! The data used in production is not available during training, therefore tests must be performed as if the test data is seen for the first time. Obviously, no manipulations based on test data are allowed during training.

T_train = np.matmul(X_train, Wt)
T_test = np.matmul(X_test, Wt)

After transformation, the data set has the same number of rows (samples), but a reduced number of columns (independent variables). In addition, all variables have been effectively decorrelated.

T_train.shape

# Training data for the one-step ahead prediction
Xtrain_1sa = T_train
Ytrain_1sa = Y_train

# Test data for the one-step-ahead prediction
Xtest_1sa = T_test
Ytest_1sa = Y_test

TBC...

$${}$$