Predicting Stock Prices with Linear Regression in Python - αlphαrithms (2024)

Predicting stock prices in Python using linear regression is easy. Finding the right combination of features to make those predictions profitable is another story. In this article, we’ll train a regression model using historic pricing data and technical indicators to make predictions on future prices.

Table of Contents show

We’ll cover how to add technical indicators using the pandas_ta package, how to troubleshoot some common errors, and finally let our trained model loose with a basic trading strategy to assess its predictive power. This article focuses primarily on the implementation of the scikit-learn LinearRegressionmodel and assumes the reader has a basic working knowledge of the Python language.

Highlights

  • We’ll get load historic pricing data into a Pandas’ DataFrame and add technical indicators to use as features in our Linear Regression model.
  • We’ll extract only the data we intend to use from the DataFrame
  • We’ll cover some common mistakes in how data is handled prior to training our model and show how some simple “reshaping” can solve a nagging error message.
  • We’ll train a simple linear regression model using a 10-day exponential moving average as a predictor for the closing price.
  • We’ll analyze the accuracy of our model, plot the results, and consider the magnitude of our errors
  • Finally, we’ll run a simulated trading strategy to see what kind of returns we could make by leveraging the predictive power of our model. Spoiler alert: it turned out pretty decent!

Introduction

Linear regression is utilized in business, science, and just about any other field where predictions and forecasting are relevant. It helps identify the relationships between a dependent variable and one or more independent variables. Simple linear regression is defined by using a feature to predict an outcome. That’s what we’ll be doing here.

Stock market forecasting is an attractive application of linear regression. Modern machine learning packages like scikit-learn make implementing these analyses possible in a few lines of code. Sounds like an easy way to make money, right? Well, don’t cash in your 401kjust yet.

As easy as these analyses are to implement, selecting features with ample enough predictive power to turn a profit is more of an art than science. In training our model, we’ll take a look at how to easily add common technical indicators to our data to use as features in training our model. Let’s take this in a step-by-step approach starting with getting our historic pricing data.

Note: The information in this article is for informational purposes only and does not constitute financial advice. See our financial disclosure for more information.

Step 1: Get Historic Pricing Data

To get started we need data. This will come in the form of historic pricing data for Tesla Motor’s (TSLA). I’m getting this as a direct .csv download from the finance.yahoo.com website and loading it into memory as a pandas data frame. See this post on getting stock prices with Python for a more detailed walkthrough.

import pandas as pd# Load local .csv file as DataFramedf = pd.read_csv('TSLA.csv')# Inspect the dataprint(df)# List of entries Date Open High ... Close Adj Close Volume0 2020-01-02 84.900002 86.139999 ... 86.052002 86.052002 476605001 2020-01-03 88.099998 90.800003 ... 88.601997 88.601997 888925002 2020-01-06 88.094002 90.311996 ... 90.307999 90.307999 506650003 2020-01-07 92.279999 94.325996 ... 93.811996 93.811996 894105004 2020-01-08 94.739998 99.697998 ... 98.428001 98.428001 155721500.. ... ... ... ... ... ... ...248 2020-12-24 642.989990 666.090027 ... 661.770020 661.770020 22865600249 2020-12-28 674.510010 681.400024 ... 663.690002 663.690002 32278600250 2020-12-29 661.000000 669.900024 ... 665.989990 665.989990 22910800251 2020-12-30 672.000000 696.599976 ... 694.780029 694.780029 42846000252 2020-12-31 699.989990 718.719971 ... 705.669983 705.669983 49649900[253 rows x 7 columns]# Show some summary statisticsprint(df.describe()) Open High Low Close Adj Close Volumecount 253.000000 253.000000 253.000000 253.000000 253.000000 2.530000e+02mean 289.108428 297.288412 280.697937 289.997067 289.997067 7.530795e+07std 167.665389 171.702889 163.350196 168.995613 168.995613 4.013706e+07min 74.940002 80.972000 70.101997 72.244003 72.244003 1.735770e+0725% 148.367996 154.990005 143.222000 149.792007 149.792007 4.713450e+0750% 244.296005 245.600006 237.119995 241.731995 241.731995 7.025550e+0775% 421.390015 430.500000 410.579987 421.200012 421.200012 9.454550e+07max 699.989990 718.719971 691.119995 705.669983 705.669983 3.046940e+08

Note: This data is available for download via Github.

Step 2: Prepare the data

Before we start developing our regression model we are going to trim our data some. The ‘Date’ column will be converted to a DatetimeIndex and the ‘Adj Close’ will be the only numerical values we keep. Everything else is getting dropped.

# Reindex data using a DatetimeIndexdf.set_index(pd.DatetimeIndex(df['Date']), inplace=True)# Keep only the 'Adj Close' Valuedf = df[['Adj Close']]# Re-inspect dataprint(df) Adj CloseDate 2020-01-02 86.0520022020-01-03 88.6019972020-01-06 90.3079992020-01-07 93.8119962020-01-08 98.428001... ...2020-12-24 661.7700202020-12-28 663.6900022020-12-29 665.9899902020-12-30 694.7800292020-12-31 705.669983[253 rows x 1 columns]# Print Infoprint(df.info())<class 'pandas.core.frame.DataFrame'>DatetimeIndex: 253 entries, 2020-01-02 to 2020-12-31Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Adj Close 253 non-null float64dtypes: float64(1)memory usage: 4.0 KB

What we see here is our ‘Date’ column having been converted to a DatetimeIndex with 253 entries and the ‘Adj Close’ column being the only retained value of type float64 (np.float64.) Let’s plot our data to get a visual picture of what we’ll be working with from here on out.

We can see a significant upward trend here reflecting a 12-month price increase from 86.052002 to 705.669983. That’s a relative increase of ~720%. Let’s see if we can’t develop a linear regression model that might help predict upward trends like this!

Aside: Linear Regression Assumptions & Autocorrelation

Before we proceed we need to discuss a technical limitation of linear regression. Linear regression requires a series of assumptions to be made to be effective. One can certainly apply a linear model without validating these assumptions but useful insights are not likely to be had.

One of these assumptions is that variables in the data are independent.Namely, this dictates that the residuals (difference between the predicted value and observed value) for any single variable aren’t related.

For Time Series data this is often a problem since our observed values are longitudinal in nature—meaning they are observed values for the same thing, recorded in sequence. This produces a characteristic called autocorrelation which describes how a variable is somehow related to itself (self-related.) (Chatterjee, 2012)

Autocorrelation analysis is useful in identifying trends like seasonality or weather patterns. When it comes to extrapolating values for price prediction, however, it is problematic. The takeaway here is that our date values aren’t suitable as our independent variable and we need to come up with something else and use the adjusted close value as the independent variable. Fortunately, there are some great options here.

Step 3: Adding Technical Indicators

Technical indicators are calculated values describing movements in historic pricing data for securities like stocks, bonds, and ETFs. Investors use these metrics to predict the movements of stocks to best determine when to buy, sell, or hold.

Commonly used technical indicators include moving averages (SMA, EMA, MACD), the Relative Strength Index (RSI), Bollinger Bands (BBANDS), and several others. There is certainly no shortage of popular technical indicators out there to choose from. To add our technical indicators we’ll be using the pandas_ta library. To get started, let’s add an exponential moving average (EMA) to our data:

import pandas_ta# Add EMA to dataframe by appending# Note: pandas_ta integrates seamlessly into# our existing dataframedf.ta.ema(close='adj_close', length=10, append=True)# Inspect Data once again adj_close EMA_10date 2020-01-02 86.052002 NaN2020-01-03 88.601997 NaN2020-01-06 90.307999 NaN2020-01-07 93.811996 NaN2020-01-08 98.428001 NaN... ... ...2020-12-24 661.770020 643.5723942020-12-28 663.690002 647.2301412020-12-29 665.989990 650.6410222020-12-30 694.780029 658.6662962020-12-31 705.669983 667.212421[253 rows x 2 columns]<class 'pandas.core.frame.DataFrame'>DatetimeIndex: 253 entries, 2020-01-02 to 2020-12-31Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 adj_close 253 non-null float64 1 EMA_10 244 non-null float64dtypes: float64(2)

As evident from the printouts above, we now have a new column in our data titled “EMA_10.” This is our newly-calculated value representing the exponential moving average calculated over a 10-day period.

Note: The pandas_ta library will alter the column names. Here we see the “Adj Close” column renamed to “adj_close.” This is expected behavior but can cause issues if one isn’t aware of this functionality.

This is great news but also comes with a caveat: the first 9 entries in our data will have a NaN value since there weren’t proceeding values from which the EMA could be calculated. Let’s take a closer look at that:

# Print the first 10 entries of our dataprint(df.head(10)) adj_close EMA_10date 2020-01-02 86.052002 NaN2020-01-03 88.601997 NaN2020-01-06 90.307999 NaN2020-01-07 93.811996 NaN2020-01-08 98.428001 NaN2020-01-09 96.267998 NaN2020-01-10 95.629997 NaN2020-01-13 104.972000 NaN2020-01-14 107.584000 NaN2020-01-15 103.699997 96.535599

We need to deal with this issue before moving on. There are several approaches we could take to replace the NaN values in our data. These include replacing with zeros, the mean for the series, backfilling from the next available, etc. All these approaches seek to replace NaN values with some pseudo values.

Given our goal of predicting real-world pricing that’s not an attractive option. Instead, we’re going to just drop all the rows where we have NaN values and use a slightly smaller dataset by taking the following approach:

# Drop the first n-rowsdf = df.iloc[10:]# View our newly-formed datasetprint(df.head(10)) adj_close EMA_10date 2020-01-16 102.697998 97.6560352020-01-17 102.099998 98.4640282020-01-21 109.440002 100.4596602020-01-22 113.912003 102.9055402020-01-23 114.440002 105.0027152020-01-24 112.963997 106.4502212020-01-27 111.603996 107.3872712020-01-28 113.379997 108.4768582020-01-29 116.197998 109.8807012020-01-30 128.162003 113.204574

Now we’re ready to start developing our regression model to see how effective the EMA is at predicting the price of the stock. First, let’s take a quick look at a plot of our data now to get an idea of how the EMA value tracks with the adjusted closing price.

We can see here the EMA tracks nicely and that we’ve only lost alittttttlebit of our data at the leading edge. Nothing to worry about—our linear model will still have ample data to train on!

Step 4: Test-Train Split

Machine learning models require at minimum two sets of data to be effective: the training data and the testing data. Given that new data can be hard to come by, a common approach to generate these subsets of data is to split a single dataset into multiple sets (Xu, 2018).

Using eighty percent of data for training and the remaining twenty percent for testing is common. This 80/20 split is the most common approach but more formulaic approaches can be used as well (Guyon, 1997).

The 80/20 split is where we’ll be starting out. Rather than mucking about trying to split our DataFrame object manually we’ll just the scikit-learn test_train_split function to handle the heavy lifting:

# Split data into testing and training setsX_train, X_test, y_train, y_test = train_test_split(df[['adj_close']], df[['EMA_10']], test_size=.2)# Test setprint(X_test.describe()) adj_closecount 49.000000mean 272.418612std 140.741107min 86.04000125% 155.75999550% 205.00999575% 408.089996max 639.830017# Training setprint(X_train.describe()) adj_closecount 194.000000mean 291.897732std 166.033359min 72.24400325% 155.81999650% 232.82899575% 421.770004max 705.669983

We can see that our data has been split into separate DataFrame objects with the nearest whole-number value of rows reflective of our 80/20 split (49 test samples, 192 training samples.) Note the test size 0.20 (20%) was specified as an argument to the train_test_split function.

Note: The X_train, X_test, y_train, and y_test data are Pandas DataFrame objects in memory. This results from the use of double-bracketed access notation df[['adj_close']] as opposed to single-bracket notation df['adj_close']. The single-bracketed notation would return a Series object and would require reshaping before we could proceed to fit our model. See this post for more details.

Step 5: Training the Model

We have our data and now we want to see how well it can be fit to a linear model. Scikit-learn’s LinearRegression class makes this simple enough—requiring only 2 lines of code (not including imports):

from sklearn.linear_model import LinearRegression# Create Regression Modelmodel = LinearRegression()# Train the modelmodel.fit(X_train, y_train)# Use model to make predictionsy_pred = model.predict(X_test)

That’s it—our linear model has now been trained on 194 training samples, and we’ve generated predicted values (y_pred). Now we can assess how well our model fits our data by examining our model coefficients and some statistics like the mean absolute error (MAE) and coefficient of determination (r2).

Step 6: Validating the Fit

The linear model generates coefficients for each feature during training and returns these values as an array. In our case, we have one feature that will be reflected by a single value. We can access this using the model.regr_ attribute.

In addition, we can use the predicted values from our trained model to calculate the mean squared error and the coefficient of determination using other functions from the sklearn.metrics module. Let’s see a medley of metrics useful in evaluating our model’s utility.

from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error# Printout relevant metricsprint("Model Coefficients:", model.coef_)print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))print("Coefficient of Determination:", r2_score(y_test, y_pred))# ResultsModel Coefficients: [[0.94540376]]Mean Absolute Error: 12.554147460577513Coefficient of Determination: 0.9875188616393644

The MAE is the arithmetic mean of the absolute errors of our model, calculated by summing the absolute difference between observed values of X and Y and dividing by the total number of observations.

The MAE can be described as the sum of the absolute error for all observed values divided by the total number of observations. Check out this article by Shravankumar Hiregoudar for a deeper look into using the MAE, as well as other metrics, for evaluating regression models.

For now, let’s just recognize that a lower MAE value is better, and the closer our coefficient of the correlation value is to 1.0 the better. The metrics here suggest that our model fits our data well, though the MAE is slightly high.

Let’s consider a chart of our observed values compared to the predicted values to see how this is represented visually:

This looks like a pretty good fit! Given our relatively high r2 value that’s no surprise. Just for kicks, let’s add some lines to represent the residuals for each predicted value.

This doesn’t tell us anything new but helps to conceptualize what the coefficient of correlation is actually representing—an aggregate statistic for how far off our predicted values are from the actual values. So now we have this linear model—but what is it telling us?

Step 7: Interpretation

At this point, we’ve trained a model on historical pricing data using the Adjusted Closing value and the Exponential Moving Average for a 10-day trading period. Our goal was to develop a model that can use the EMA of any given day (dependent on pricing from the previous 9 days) and accurately predict that day’s closing price. Let’s run a simulation of a very simple trading strategy to assess how well we might have done using this.

Strategy: If our model predicts a higher closing value than the opening value we make a trade for a single share on that day—buying at market open and selling just before market close.

Below is a summary of each trading day during our test data period:

Results

In the 49 possible trade days, our strategy elected to make 4 total trades. This strategy makes two bold assumptions:

  1. We were able to purchase a share at the exact price open price recorded;
  2. We were able to sell that share just before closing at the exact price recorded.

Applying this strategy—and these assumptions—our model generated $151.77. If our starting capital was $1,000 this strategy would have resulted in a ~15.18% increase of total capital.

Shortcomings

Before you open your TD Ameritrade account and start transferring your 401K let’s consider these results—there are quite a few problems with them after all.

  1. We’re applying this model to data very close to the training data;
  2. We aren’t accounting for relevant broker fees for buy/sells
  3. We aren’t accounting for taxes—as much as your “ordinary income” as the IRS would say.

Review

Using linear regression to predict stock prices is a simple task in Python when one leverages the power of machine learning libraries like scikit-learn. The convenience of the pandas_ta library also cannot be overstated—allowing one to add any of the dozens of technical indicators in single lines of code.

In this article we have seen how to load in data, test-train split the data, add indicators, train a linear model, and finally apply that model to predict future stock prices—with some degree of success!

The use of the exponential moving average (EMA) was chosen somewhat arbitrarily. There are many other technical indicators that are common among algorithmic trading and traditional trading strategies:

  1. Relative Strenght Index
  2. Mean Average Convergence-Divergence (MACD)
  3. Aspects of Bollinger Bands
  4. Average Daily Range or Average True Range
  5. somany more …

These indicators can be used instead of the EMA, alongside it in multiple regression models, or creatively combined with feature engineering. The only limitation to how one chooses to leverage these indicators in developing linear models is imagination alone!

References

  1. Chatterjee. Regression Analysis by Example, 5th Edition. 5th ed., Wiley, 2012.
  2. Guyon, Isabelle. A Scaling Law for the Validation-Set Training-Set Size Ratio. In AT & T Bell Laboratories. (1997)doi:10.1.1.33.1337
  3. Xu, Yun, and Royston Goodacre. “On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning.”Journal of analysis and testingvol. 2,3 (2018): 249-262. doi:10.1007/s41664-018-0068-2
Predicting Stock Prices with Linear Regression in Python - αlphαrithms (2024)

FAQs

How to use linear regression to predict stock prices? ›

How to Predict Stock Prices Using Linear Regression
  1. Step 1: Gather Data. ...
  2. Step 2: Explore and Prepare Data. ...
  3. Step 3: Select Independent Variables. ...
  4. Step 4: Build the Model. ...
  5. Step 5: Evaluate and Fine-Tune. ...
  6. Step 6: Make Predictions. ...
  7. Step 7: Monitor and Adapt.
Sep 27, 2023

How to calculate predicted value in linear regression in Python? ›

Calculating Predicted Response Vector: y_pred = b[0] + b[1]*x calculates the predicted values for y based on the estimated coefficients b . Plotting Regression Line: plt. plot(x, y_pred, color = "g") plots the regression line using the predicted values and the independent variable x .

How do you forecast stock prices in Python? ›

Stock Market Prediction Using the Long Short-Term Memory Method
  1. #Importing the Libraries import pandas as PD import NumPy as np %matplotlib inline import matplotlib. ...
  2. #Get the Dataset df=pd.read_csv(“MicrosoftStockData.csv”,na_values=['null'],index_col='Date',parse_dates=True,infer_datetime_format=True) df.head()

What is the best algorithm for predicting stock prices? ›

The LSTM algorithm has the ability to store historical information and is widely used in stock price prediction (Heaton et al. 2016). For stock price prediction, LSTM network performance has been greatly appreciated when combined with NLP, which uses news text data as input to predict price trends.

How do you make a prediction using a linear regression equation? ›

How to Use a Linear Regression Model to Calculate a Predicted Response Value. Step 1: Identify the independent variable . Step 2: Calculate the predicted response value by plugging in the given value into the least-squares linear regression line y ^ ( x ) = a x + b .

How linear regression works in prediction? ›

Linear regression is a data analysis technique that predicts the value of unknown data by using another related and known data value. It mathematically models the unknown or dependent variable and the known or independent variable as a linear equation.

What is the predictive equation for linear regression? ›

The formula for simple linear regression is Y = mX + b, where Y is the response (dependent) variable, X is the predictor (independent) variable, m is the estimated slope, and b is the estimated intercept.

How to write a linear regression equation in Python? ›

When implementing linear regression of some dependent variable 𝑦 on the set of independent variables 𝐱 = (𝑥₁, …, 𝑥ᵣ), where 𝑟 is the number of predictors, you assume a linear relationship between 𝑦 and 𝐱: 𝑦 = 𝛽₀ + 𝛽₁𝑥₁ + ⋯ + 𝛽ᵣ𝑥ᵣ + 𝜀. This equation is the regression equation.

What is the linear regression formula for forecasting? ›

Simple linear regression. In the simplest case, the regression model allows for a linear relationship between the forecast variable y and a single predictor variable x : yt=β0+β1xt+εt.

What is the formula for predicting stock price? ›

For a beginning investor, an easier task is determining if the stock is trading lower or higher than its peers by looking at the price-to-earnings (P/E) ratio. The P/E ratio is calculated by dividing the current price per share by the most recent 12-month trailing earnings per share.

What is the most accurate stock predictor? ›

1. AltIndex – Overall Most Accurate Stock Predictor with Claimed 72% Win Rate. From our research, AltIndex is the most accurate stock predictor to consider today. Unlike other predictor services, AltIndex doesn't rely on manual research or analysis.

Which model is best for stock price prediction? ›

The best model is ( Moving Average (MA) technique ) and research about company assets and states is used for predicting future stock prices!

Which regression is best for stock prediction? ›

The best regression models for predicting stock prices and tendencies are Linear Regression, Ridge Regression, Lasso Regression, Polynomial Regression, and Gaussian Process Regression. These models have shown good performance in experiments and are suitable for prediction tasks.

Which methods is best used for predicting the price of a stock? ›

Three main types of structured inputs are used in stock market prediction: basic features, technical indicators, and fundamental indicators. Basic features are stock values such as OHLCV data; closing prices are the most commonly used information to predict the prices of the next trading day.

What is the best indicator to predict stocks? ›

Popular technical indicators include simple moving averages (SMAs), exponential moving averages (EMAs), bollinger bands, stochastics, and on-balance volume (OBV). Technical indicators provide insight into support and resistance levels which may be key in devising a low risk-reward ratio strategy.

How to use linear regression for trading? ›

The LRI can help traders determine optimal entry and exit points for trades. When the price crosses the regression line, it may signal a chance to enter a trade in the direction of the trend. Likewise, when the price crosses the line in the opposite direction, it may indicate an opportunity to exit or take profits.

Can you use linear regression for forecasting? ›

Building a multiple linear regression model can potentially generate more accurate forecasts as we expect consumption expenditure to not only depend on personal income but on other predictors as well.

How do you predict sales using linear regression? ›

For example, if you have a linear regression model that shows that sales increase by $10 for every $1 increase in price and by $20 for every $1 increase in marketing, you can use the following formula to predict sales for any given values of price and marketing: Sales = Intercept + Coefficient (Price) * Price + ...

How do you use linear regression forecast indicator? ›

How to Use the LRI in Trading. To effectively use the LRI in trading, traders should consider the following: Understanding the signals generated by the technical indicator: A rising LRI suggests an upward trend, whereas a falling LRI indicates a downward trend.

Top Articles
Latest Posts
Article information

Author: Foster Heidenreich CPA

Last Updated:

Views: 6012

Rating: 4.6 / 5 (76 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Foster Heidenreich CPA

Birthday: 1995-01-14

Address: 55021 Usha Garden, North Larisa, DE 19209

Phone: +6812240846623

Job: Corporate Healthcare Strategist

Hobby: Singing, Listening to music, Rafting, LARPing, Gardening, Quilting, Rappelling

Introduction: My name is Foster Heidenreich CPA, I am a delightful, quaint, glorious, quaint, faithful, enchanting, fine person who loves writing and wants to share my knowledge and understanding with you.