**In previous posts we proposed basic procedures necessary for performing one-step-ahead and multi-step-ahead prediction of S&P500 index based on prices of its constituents. In this post we will focus on accuracy and performance increase using dimension reduction techniques.**

The illustration is focused on various methods for reducing input dimensionality, and decorrelating input vectors in general. Detailed analysis of financial aspects of the problem, which are quite complex, is not performed here.

The following problem will be considered:

**Find a subset of the constituents of the S&P500 index which is sufficient for successful prediction of its value in some consecutive time instant. Alternatively, find some low-dimensional transformation of S&P500 constituents based on which a sufficient prediction can be performed.**

## Separating Data into Training and Test

## Scaling

First, all data will be separated into two sets: **training data** and **test data**. In this particular case, the data will be decimated so that 10% of the data will be used for training, while the rest will be used for testing.

decimation = 10 data_train = data.iloc[::decimation, :] test_mask = np.ones(len(data), np.bool) test_mask[::decimation] = 0 data_test = data.iloc[test_mask, :]

In the following step, all data will be scaled. **Scaling** is a common preprocessing step in most regression and classification problems. It is beneficial for numerous reasons, including the fact that it makes tuning of the meta-parameters much easier.

By means of scaling the **empirical mean** of each column in the training data set becomes \( 0 \), which is a necessary step before applying **Principal Component Analysis (PCA)**.

When performing scaling it is paramount to remember that scaling parameters are computed based on the *training data* **only**! Later, the same scaling is applied to test data, as well as during data encountered in exploitation.

m = data_train.min(axis=0) M = data_train.max(axis=0) # Scale both the training and the test data data_train_scaled = (data_train - m) / (M-m) data_test_scaled = (data_test - m) / (M-m) # Build training and test regressor and target X_train = data_train_scaled.iloc[:, 1:] Y_train = data_train_scaled.iloc[:, 0] X_test = data_test_scaled.iloc[:, 1:] Y_test = data_test_scaled.iloc[:, 0]

### Principal Component Analysis

**Principal components** of a data set are the **right eigenvectors** of the corresponding data matrix.

More details about PCA can be found here.

A detailed introduction in PCA, with emphasis on `Python`

implementation, can be found here, here, and here (to name just a few sources).

(U, S, V) = np.linalg.svd(X_train) W = V.T

In this way, `U`

contains **left singular vectors**, `V`

contains **right singular vectors**, while the corresponding **singular values** are stored in `S`

. Since the singular vectors are stored in rows of `V`

, it is more convenient to transpose it. Matrix `W`

has columns which are right singular vectors of the data matrix `X`

.

`PCA`

has an interesting, easy to understand interpretation. `PCA`

is essentially nothing more than a coordinate transformation. Instead of the original coordinate system, specified by the input data, `PCA`

it introduces a new coordinate system such that the total variation of the data is concentrated within the first few components. In other words, instead of the original constituents of the S&P500 index, `PCA`

it introduces linear combinations of those constituents as new coordinate axes.

In practice, usually, only a given percentage of the new coordinates are used for further processing. In this way, it is possible to obtain dimensionality reduction, data compression, decorrelation, denoising, etc.

Each **singular vector** of the data matrix represents a single new component, while the corresponding **singular value** is indicative of the importance of this component. Relative importance can be obtained by dividing each singular value by the sum of all singular values.

s = np.sum(S) normed_eigv = S/s normed_cumsum = np.cumsum(S)/s plt.plot(normed_eigv, label='singular values') plt.plot(normed_cumsum, label='cummulative sum') plt.legend(fontsize=16);

We see that a lot of variation of the data is due to just a few first eigenvectors.

tol = 0.9 cut_off_index = np.argmax(normed_cumsum >;gt; tol) print(f'Cut-off index for tolerance {tol*100}% is {cut_off_index}')

It can be seen that 99% of the information contained within the data can be computed by 411 singular components, that is by 82.2% of the input data. In fact, if we are willing to go as low as 90% of the information, we could retain only 126 eigenvectors, which is just 25.2% of the data (in this case).

It is easy to graphically visualize the dependence between the tolerance (the amount of information to keep) and the cut off index (the amount of input data necessary).

tol_array = np.arange(0.5, 1.0, 0.01) cut_off_array = [np.argmax(normed_cumsum > tol) for tol in tol_array] plt.plot(tol_array, cut_off_array); plt.title('Cut off indes as a function of tolerance', fontsize=16);

It is easy to retain only the desired information within the data set. In fact, in order to transform any data vector from the original **input coordinate system** to the normalized **coordinate system of singular components** it is enough to multiply it by the matrix whose columns are right eigenvectors.

\[ \mathbf{t} = \mathbf{W} \mathbf{x} \]

In the formula above, \( x \) is the original data vector, while \( t \) represents the same data in the new coordinates. The entire data set can be transformed in a similar way, using a single matrix multiplication formula

\[ \mathbf{T} = \mathbf{W} \mathbf{X} \]

Now, the rows of \( X \) represent individual data points, while the rows of \( T \) are the corresponding data points in the transformed coordinates.

However, if we would like to perform dimensionality reduction (compression), then we should use only some given number of leading principal components (singular vectors). In other words, only the given number of leading columns of \( W \) should be used during transformation.

# PCA transformation matrix with only columns # up to cut off index preserved. Wt = W[:,:cut_off_index]

We now use the same transformation for both training and test data. As usual, the transformation is computed based on training data only! **It is an error to compute the transform on the entire data set!** The data used in production is not available during training, therefore tests must be performed as if the test data is seen for the first time. Obviously, no manipulations based on test data are allowed during training.

T_train = np.matmul(X_train, Wt) T_test = np.matmul(X_test, Wt)

After transformation, the data set has the same number of rows (samples), but a reduced number of columns (independent variables). In addition, all variables have been effectively decorrelated.

T_train.shape

# Training data for the one-step ahead prediction Xtrain_1sa = T_train Ytrain_1sa = Y_train # Test data for the one-step-ahead prediction Xtest_1sa = T_test Ytest_1sa = Y_test

TBC...