A PyTorch Example to Use RNN for Financial Prediction (2024)

04 Nov 2017 | Chandler

While deep learning has successfully driven fundamental progress in natural language processing and image processing, one pertaining question is whether the technique will equally be successful to beat other models in the classical statistics and machine learning areas to yield the new state-of-the-art methodology for uncovering the underlying data patterns. One such area is the prediction of financial time series, a notoriously difficult problem given the fickleness of such data movement. In this blog, I implement one recently proposed model for this problem. For myself, this is more of a learning process to implement the deep learning technology for real data, which I would like to share my experience with others.

The Model

The model I have implemented is proposed by the paper A Dual-Stage Attention-Based Recurrent Neural Network for Time Series Prediction. The Dual-Stage Attention-Based RNN (a.k.a. DA-RNN) model belongs to the general class of Nonlinear Autoregressive Exogenous (NARX) models, which predict the current value of a time series based on historical values of this series plus the historical values of multiple exogenous time series. A linear counterpart of a NARX model is the ARMA model with exogenous factors.

There are two important concepts in the DA-RNN model. The first one is the popular Recursive Neural Network model, which has enjoyed big success in the NLP area. On a high level, RNN models are powerful to exhibit quite sophisticated dynamic temporal structure for sequential data. RNN models come in many forms, one of which is the Long-Short Term Memory(LSTM) model that is widely applied in language models. The second concept is the Attention Mechanism. Attention mechanism somewhat performs feature selection in a dynamic way, so that the model can keep only the most useful information at each temporal stage. Many successful deep learning models nowadays combine attention mechanism with RNN, with examples including machine translation.

The DA-RNN model, on the high level, includes two LSTM networks with attention mechanism. The first LSTM network encodes information among historical exogenous data, and its attention mechanism performs feature selection to select the most important exogenous factors. Based on the output of the first LSTM network, the second LSTM network further combines the information from exogenous data with the historical target time series. The attention mechanism in the second network performs feature selection in the time domain, i.e., it applies weights to information at different historical time points. The final prediction, therefore, is based on feature selection in both the dimension of exogenous factors and time. Unlike classical feature selection in statistical models, DA-RNN selects features dynamically. In other words, the weights on factors and time points are changing across time. One inaccurate analogy, perhaps, is a regression model with ARMA errors, with time-varying coefficients for both the exogenous factors and the ARMA terms.

PyTorch

My experiment is implemented in PyTorch. Why PyTorch? From my experience, it has better integration with Python as compared to some popular alternatives including TensorFlow and Keras. It also seems the only package I have found so far that enables interchanging between auto-grad calculations and other calculations. This gives great flexibility for me as a person with primarily model building and not engineering backgrounds:

  1. PyTorch codes are easy to debug by inserting python codes to peep into intermediate values between individual auto-grad steps;
  2. PyTorch also enables experimenting ideas by adding some calculations between different auto-grad steps. For example, it is easy to implement an algorithm that iterates between discrete calculations and auto-grad calculations.

A PyTorch tutorial for machine translation model can be seen at this link. My implementation is based on this tutorial.

Data

I use the NASDAQ 100 Stock Data as mentioned in the DA-RNN paper. Unlike the experiment presented in the paper, which uses the contemporary values of exogenous factors to predict the target variable, I exclude them. For this data set, the exogenous factors are individual stock prices, and the target time series is the NASDAQ stock index. Using the current prices of individual stocks to predict the current NASDAQ index is not really meaningful, thus I have made this change.

Codes

My main notebook is shown below. Supportive codes can be found here.

First, I declare the Python module dependencies.

import torchfrom torch import nnfrom torch.autograd import Variablefrom torch import optimimport torch.nn.functional as Fimport matplotlib# matplotlib.use('Agg')%matplotlib inlineimport datetime as dt, itertools, pandas as pd, matplotlib.pyplot as plt, numpy as npimport utility as utilglobal loggerutil.setup_log()util.setup_path()logger = util.loggertext_process.logger = loggeruse_cuda = torch.cuda.is_available()logger.info("Is CUDA available? %s.", use_cuda)

Second, I build the two Attention-Based LSTM networks, named by encoder and decoder respectively. For easier understanding I annotate my codes with equation numbers in the DA-RNN paper.

class encoder(nn.Module): def __init__(self, input_size, hidden_size, T, logger): # input size: number of underlying factors (81) # T: number of time steps (10) # hidden_size: dimension of the hidden state super(encoder, self).__init__() self.input_size = input_size self.hidden_size = hidden_size self.T = T self.logger = logger self.lstm_layer = nn.LSTM(input_size = input_size, hidden_size = hidden_size, num_layers = 1) self.attn_linear = nn.Linear(in_features = 2 * hidden_size + T - 1, out_features = 1) def forward(self, input_data): # input_data: batch_size * T - 1 * input_size  input_weighted = Variable(input_data.data.new(input_data.size(0), self.T - 1, self.input_size).zero_()) input_encoded = Variable(input_data.data.new(input_data.size(0), self.T - 1, self.hidden_size).zero_()) # hidden, cell: initial states with dimention hidden_size hidden = self.init_hidden(input_data) # 1 * batch_size * hidden_size cell = self.init_hidden(input_data) # hidden.requires_grad = False # cell.requires_grad = False for t in range(self.T - 1): # Eqn. 8: concatenate the hidden states with each predictor x = torch.cat((hidden.repeat(self.input_size, 1, 1).permute(1, 0, 2), cell.repeat(self.input_size, 1, 1).permute(1, 0, 2), input_data.permute(0, 2, 1)), dim = 2) # batch_size * input_size * (2*hidden_size + T - 1) # Eqn. 9: Get attention weights x = self.attn_linear(x.view(-1, self.hidden_size * 2 + self.T - 1)) # (batch_size * input_size) * 1 attn_weights = F.softmax(x.view(-1, self.input_size)) # batch_size * input_size, attn weights with values sum up to 1. # Eqn. 10: LSTM weighted_input = torch.mul(attn_weights, input_data[:, t, :]) # batch_size * input_size # Fix the warning about non-contiguous memory # see https://discuss.pytorch.org/t/dataparallel-issue-with-flatten-parameter/8282 self.lstm_layer.flatten_parameters() _, lstm_states = self.lstm_layer(weighted_input.unsqueeze(0), (hidden, cell)) hidden = lstm_states[0] cell = lstm_states[1] # Save output input_weighted[:, t, :] = weighted_input input_encoded[:, t, :] = hidden return input_weighted, input_encoded def init_hidden(self, x): # No matter whether CUDA is used, the returned variable will have the same type as x. return Variable(x.data.new(1, x.size(0), self.hidden_size).zero_()) # dimension 0 is the batch dimensionclass decoder(nn.Module): def __init__(self, encoder_hidden_size, decoder_hidden_size, T, logger): super(decoder, self).__init__() self.T = T self.encoder_hidden_size = encoder_hidden_size self.decoder_hidden_size = decoder_hidden_size self.logger = logger self.attn_layer = nn.Sequential(nn.Linear(2 * decoder_hidden_size + encoder_hidden_size, encoder_hidden_size), nn.Tanh(), nn.Linear(encoder_hidden_size, 1)) self.lstm_layer = nn.LSTM(input_size = 1, hidden_size = decoder_hidden_size) self.fc = nn.Linear(encoder_hidden_size + 1, 1) self.fc_final = nn.Linear(decoder_hidden_size + encoder_hidden_size, 1) self.fc.weight.data.normal_() def forward(self, input_encoded, y_history): # input_encoded: batch_size * T - 1 * encoder_hidden_size # y_history: batch_size * (T-1) # Initialize hidden and cell, 1 * batch_size * decoder_hidden_size hidden = self.init_hidden(input_encoded) cell = self.init_hidden(input_encoded) # hidden.requires_grad = False # cell.requires_grad = False for t in range(self.T - 1): # Eqn. 12-13: compute attention weights ## batch_size * T * (2*decoder_hidden_size + encoder_hidden_size) x = torch.cat((hidden.repeat(self.T - 1, 1, 1).permute(1, 0, 2), cell.repeat(self.T - 1, 1, 1).permute(1, 0, 2), input_encoded), dim = 2) x = F.softmax(self.attn_layer(x.view(-1, 2 * self.decoder_hidden_size + self.encoder_hidden_size )).view(-1, self.T - 1)) # batch_size * T - 1, row sum up to 1 # Eqn. 14: compute context vector context = torch.bmm(x.unsqueeze(1), input_encoded)[:, 0, :] # batch_size * encoder_hidden_size if t < self.T - 1: # Eqn. 15 y_tilde = self.fc(torch.cat((context, y_history[:, t].unsqueeze(1)), dim = 1)) # batch_size * 1 # Eqn. 16: LSTM self.lstm_layer.flatten_parameters() _, lstm_output = self.lstm_layer(y_tilde.unsqueeze(0), (hidden, cell)) hidden = lstm_output[0] # 1 * batch_size * decoder_hidden_size cell = lstm_output[1] # 1 * batch_size * decoder_hidden_size # Eqn. 22: final output y_pred = self.fc_final(torch.cat((hidden[0], context), dim = 1)) # self.logger.info("hidden %s context %s y_pred: %s", hidden[0][0][:10], context[0][:10], y_pred[:10]) return y_pred def init_hidden(self, x): return Variable(x.data.new(1, x.size(0), self.decoder_hidden_size).zero_())

The next code chunk implements an object for all steps in the modeling pipeline.

# Train the modelclass da_rnn: def __init__(self, file_data, logger, encoder_hidden_size = 64, decoder_hidden_size = 64, T = 10, learning_rate = 0.01, batch_size = 128, parallel = True, debug = False): self.T = T dat = pd.read_csv(file_data, nrows = 100 if debug else None) self.logger = logger self.logger.info("Shape of data: %s.\nMissing in data: %s.", dat.shape, dat.isnull().sum().sum()) self.X = dat.loc[:, [x for x in dat.columns.tolist() if x != 'NDX']].as_matrix() self.y = np.array(dat.NDX) self.batch_size = batch_size self.encoder = encoder(input_size = self.X.shape[1], hidden_size = encoder_hidden_size, T = T, logger = logger).cuda() self.decoder = decoder(encoder_hidden_size = encoder_hidden_size, decoder_hidden_size = decoder_hidden_size, T = T, logger = logger).cuda() if parallel: self.encoder = nn.DataParallel(self.encoder) self.decoder = nn.DataParallel(self.decoder) self.encoder_optimizer = optim.Adam(params = itertools.ifilter(lambda p: p.requires_grad, self.encoder.parameters()), lr = learning_rate) self.decoder_optimizer = optim.Adam(params = itertools.ifilter(lambda p: p.requires_grad, self.decoder.parameters()), lr = learning_rate) # self.learning_rate = learning_rate self.train_size = int(self.X.shape[0] * 0.7) self.y = self.y - np.mean(self.y[:self.train_size]) # Question: why Adam requires data to be normalized? self.logger.info("Training size: %d.", self.train_size) def train(self, n_epochs = 10): iter_per_epoch = int(np.ceil(self.train_size * 1. / self.batch_size)) logger.info("Iterations per epoch: %3.3f ~ %d.", self.train_size * 1. / self.batch_size, iter_per_epoch) self.iter_losses = np.zeros(n_epochs * iter_per_epoch) self.epoch_losses = np.zeros(n_epochs) self.loss_func = nn.MSELoss() n_iter = 0 learning_rate = 1. for i in range(n_epochs): perm_idx = np.random.permutation(self.train_size - self.T) j = 0 while j < self.train_size: batch_idx = perm_idx[j:(j + self.batch_size)] X = np.zeros((len(batch_idx), self.T - 1, self.X.shape[1])) y_history = np.zeros((len(batch_idx), self.T - 1)) y_target = self.y[batch_idx + self.T] for k in range(len(batch_idx)): X[k, :, :] = self.X[batch_idx[k] : (batch_idx[k] + self.T - 1), :] y_history[k, :] = self.y[batch_idx[k] : (batch_idx[k] + self.T - 1)] loss = self.train_iteration(X, y_history, y_target) self.iter_losses[i * iter_per_epoch + j / self.batch_size] = loss #if (j / self.batch_size) % 50 == 0: # self.logger.info("Epoch %d, Batch %d: loss = %3.3f.", i, j / self.batch_size, loss) j += self.batch_size n_iter += 1 if n_iter % 10000 == 0 and n_iter > 0: for param_group in self.encoder_optimizer.param_groups: param_group['lr'] = param_group['lr'] * 0.9 for param_group in self.decoder_optimizer.param_groups: param_group['lr'] = param_group['lr'] * 0.9 self.epoch_losses[i] = np.mean(self.iter_losses[range(i * iter_per_epoch, (i + 1) * iter_per_epoch)]) if i % 10 == 0: self.logger.info("Epoch %d, loss: %3.3f.", i, self.epoch_losses[i]) if i % 10 == 0: y_train_pred = self.predict(on_train = True) y_test_pred = self.predict(on_train = False) y_pred = np.concatenate((y_train_pred, y_test_pred)) plt.figure() plt.plot(range(1, 1 + len(self.y)), self.y, label = "True") plt.plot(range(self.T , len(y_train_pred) + self.T), y_train_pred, label = 'Predicted - Train') plt.plot(range(self.T + len(y_train_pred) , len(self.y) + 1), y_test_pred, label = 'Predicted - Test') plt.legend(loc = 'upper left') plt.show() def train_iteration(self, X, y_history, y_target): self.encoder_optimizer.zero_grad() self.decoder_optimizer.zero_grad() input_weighted, input_encoded = self.encoder(Variable(torch.from_numpy(X).type(torch.FloatTensor).cuda())) y_pred = self.decoder(input_encoded, Variable(torch.from_numpy(y_history).type(torch.FloatTensor).cuda())) y_true = Variable(torch.from_numpy(y_target).type(torch.FloatTensor).cuda()) loss = self.loss_func(y_pred, y_true) loss.backward() self.encoder_optimizer.step() self.decoder_optimizer.step() return loss.data[0] def predict(self, on_train = False): if on_train: y_pred = np.zeros(self.train_size - self.T + 1) else: y_pred = np.zeros(self.X.shape[0] - self.train_size) i = 0 while i < len(y_pred): batch_idx = np.array(range(len(y_pred)))[i : (i + self.batch_size)] X = np.zeros((len(batch_idx), self.T - 1, self.X.shape[1])) y_history = np.zeros((len(batch_idx), self.T - 1)) for j in range(len(batch_idx)): if on_train: X[j, :, :] = self.X[range(batch_idx[j], batch_idx[j] + self.T - 1), :] y_history[j, :] = self.y[range(batch_idx[j], batch_idx[j]+ self.T - 1)] else: X[j, :, :] = self.X[range(batch_idx[j] + self.train_size - self.T, batch_idx[j] + self.train_size - 1), :] y_history[j, :] = self.y[range(batch_idx[j] + self.train_size - self.T, batch_idx[j]+ self.train_size - 1)] y_history = Variable(torch.from_numpy(y_history).type(torch.FloatTensor).cuda()) _, input_encoded = self.encoder(Variable(torch.from_numpy(X).type(torch.FloatTensor).cuda())) y_pred[i:(i + self.batch_size)] = self.decoder(input_encoded, y_history).cpu().data.numpy()[:, 0] i += self.batch_size return y_pred

The actual execution of the algorithm can be triggered as the following:

io_dir = '~/nasdaq'model = da_rnn(file_data = '{}/data/nasdaq100_padding.csv'.format(io_dir), logger = logger, parallel = False, learning_rate = .001)model.train(n_epochs = 500)y_pred = model.predict()plt.figure()plt.semilogy(range(len(model.iter_losses)), model.iter_losses)plt.show()plt.figure()plt.semilogy(range(len(model.epoch_losses)), model.epoch_losses)plt.show()plt.figure()plt.plot(y_pred, label = 'Predicted')plt.plot(model.y[model.train_size:], label = "True")plt.legend(loc = 'upper left')plt.show()

Results can be presented in the following figures. Notice that the target time series are normalized to have mean 0 within the training set; otherwise the network is extremely slow to converge.

Model Fit after 10 Epochs

A PyTorch Example to Use RNN for Financial Prediction (1)

Model Fit after 500 Epochs

A PyTorch Example to Use RNN for Financial Prediction (2)

Training Loss across 500 Epochs

A PyTorch Example to Use RNN for Financial Prediction (3)

From the first two figures, it is interesting to see that how the model gradually expands its range of prediction across training epochs. Eventually, the model can predict quite accurately within the whole range of the training data, but fails to predict outside this regime. Regime shifts in the stock market, apparently, remains an unpredictable beast.

Updates 10/1/2018

Thanks to Sean Aubin’s contribution, an updated version of these codes is now available. These codes are written in Python3 and depend on more recent versions of PyTorch. The updated codes also use functional programming that may suit some people’s flavor. Please refer to his repository and applaud for his work!

(c)2017-2026 CHANDLER ZUO ALL RIGHTS PRESERVED

A PyTorch Example to Use RNN for Financial Prediction (2024)

FAQs

How do you get predictions from model in PyTorch? ›

To predict outcomes using the loaded model, we need to pass the input data through the model and get the predicted output. The x_test variable should be a PyTorch tensor that contains the input data. The y_pred variable will contain the predicted output.

What is the use of LSTM in PyTorch? ›

Long Short-Term Memory (LSTM) is a structure that can be used in neural network. It is a type of recurrent neural network (RNN) that expects the input in the form of a sequence of features. It is useful for data such as time series or string of text.

What is PyTorch forecasting time series? ›

Pytorch Forecasting : it is a framework made on top of PyTorch Light used to ease time series forecasting with the help of neural networks for real-world use-cases. It is having state of the art time series forecasting architectures that can be easily trained with input data points.

Is LSTM good for time series? ›

LSTM is an artificial recurrent neural network used in deep learning and can process entire sequences of data. Due to the model's ability to learn long term sequences of observations, LSTM has become a trending approach to time series forecasting.

Why LSTM is used in RNN? ›

LSTM is a type of RNN with higher memory power to remember the outputs of each node for a more extended period to produce the outcome for the next node efficiently. LSTM networks combat the RNN's vanishing gradients or long-term dependence issue.

What is the benefit of using LSTM compared to RNN? ›

LSTMs are a type of Recurrent Neural Network (RNN) that can better retain long-term dependencies in the data. They have a more complex structure than regular RNNs, consisting of input, output, and forget gates that can selectively retain or discard information from the hidden state.

What is PyTorch forecasting? ›

PyTorch Forecasting is a PyTorch-based package for forecasting time series with state-of-the-art network architectures. It provides a high-level API for training networks on pandas data frames and leverages PyTorch Lightning for scalable training on (multiple) GPUs, CPUs and for automatic logging.

What is the best time series forecasting method? ›

AutoRegressive Integrated Moving Average (ARIMA) models are among the most widely used time series forecasting techniques: In an Autoregressive model, the forecasts correspond to a linear combination of past values of the variable.

What is RNN good for? ›

A recurrent neural network is a type of artificial neural network commonly used in speech recognition and natural language processing. Recurrent neural networks recognize data's sequential characteristics and use patterns to predict the next likely scenario.

What is the difference between LSTM and RNN time series? ›

LSTM and GRU have several advantages over the basic RNNs for time series applications. First, they can capture long-term dependencies better than RNNs, which tend to forget the distant past inputs. This is important for time series that have long-term cycles or trends, such as climate data or economic indicators.

What is better than LSTM for time series? ›

More specifically, it was observed that BiLSTM models provide better predictions compared to ARIMA and LSTM models. It was also observed that BiLSTM models reach the equilibrium much slower than LSTM-based models.

What does model predict () return? ›

The output of this function typically represents the model's predictions or estimated probabilities for the input data. When you call model. predict on a set of input data, you receive an array or tensor containing the model's predictions for each input sample.

How do machine learning models make predictions? ›

Predictions are made for a new data point by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances.

Top Articles
Latest Posts
Article information

Author: Clemencia Bogisich Ret

Last Updated:

Views: 6020

Rating: 5 / 5 (80 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Clemencia Bogisich Ret

Birthday: 2001-07-17

Address: Suite 794 53887 Geri Spring, West Cristentown, KY 54855

Phone: +5934435460663

Job: Central Hospitality Director

Hobby: Yoga, Electronics, Rafting, Lockpicking, Inline skating, Puzzles, scrapbook

Introduction: My name is Clemencia Bogisich Ret, I am a super, outstanding, graceful, friendly, vast, comfortable, agreeable person who loves writing and wants to share my knowledge and understanding with you.