Mie 1628H - Big Data Final Project Report Apple Stock Price Prediction
Mie 1628H - Big Data Final Project Report Apple Stock Price Prediction
Mie 1628H - Big Data Final Project Report Apple Stock Price Prediction
The most fundamentally distinguishing feature in time series analysis is that the observations are
dependent or correlated. Time series analysis requires different methods and commonly used
statistical methods based on random samples may not be applicable. For instance, we cannot used
cross validation with random split. Traditionally the statistical methods used for time series are:
• Random Walks
• Simple and Damped Exponential Smoothing.
• Holt’s linear trend
• Autoregressive Integration Moving Average (ARIMA)
• Simple Moving Average
But over the time, machine learning methods are increasingly seen as an alternative to this traditional
methods and have more accuracy (Spyros Makridakis, 2018) but, they are computationally more
expensive in comparison to statistical methods. It is always advantageous to apply both methods
because statistical methods could be a simple benchmark for our machine learning models to check If
whether the extra effort for constructing the model and the computational requirement is giving us
any better predictions. Our group have used Simple moving average as a bench mark model. There are
many machine learning models used for timeseries (Nesreen K. Ahmed, 2010) such as Bayesian and
generalized regression neural networks, K-nearest neighbour regression, regression trees and SVR.
Figure 1: Adjusted closing price of Apple on left and Trend part of the decomposition on the right
2
2. Business Problem
The aim of our project is to construct and compare machine learning models for stock price prediction
of Apple for different time horizons. By precisely predicting the stock prices we can anticipate the
expected movement and take a position in the market. The continuing aim is to develop a trading
strategy to see the profitability of the classification model.
3. Target Variable
This is a supervised machine learning approach and the target variable or independent variable is
adjusted closing price of Apple stock (please assume unless mentioned I am referring to adjusted
prices). The features are selected such that the correlation, time dependent nature of the series will
be taken in to account. The models considered are Moving Average, Linear regression, Random
Forests, Gradient Boosted trees and logistic regression (classification for trading). The reason for
limiting only to this simple regression models is due to the easy implementation of them in the Spark
ML Lib library.
4. Data set
We used data set derived from Quandl website and the main reason is to get the data of adjusted
prices. The following plot illustrate the difference in prices from both data sources. The yahoo data
only provides the Adjusted Close prices where as Quandl provides Adjusted prices of open, high, low,
close and volume. The difference in adjusted and actual close price is due to the nature of the market.
It happens when a company announces a dividend, or the stock of the company is split. The apple
stock is split four times with the most recent and biggest split happened on June 9 th, 2014 with 7:1
basis (apple).
High
Adj High
Low
Adj Low
Close
Adj Close
Volume
Adj Volume
Split
3
During the recent split, closing price decreased from 645$ on June 6th, 2014 to 93$ on June 9th, 2014. If
we were to train a machine learning model it would be biased and try to account this decrease in price
to some attributes unnecessarily. We can always calculate back the Actual close price from adjusted
price (investopedia)
6.
7.
Figure 3 Moving Average model for different window periods. RMSE on left and SMAPE on right
We can see that as period of window is increasing from 1 day to 6 days the RMSE and SMAPE values
are increasing for all the three different time horizons. We can say that the moving average is not a
good model for longer window periods. Tomorrows price is explained more by today’s price rather
than average of last six days. This also makes sense intuitively and better version of Moving average is
exponential smoothing where the weightage given to past prices decreases exponentially. Also, to note
is that the moving average is performing worse for longer time horizons.
6. Feature Engineering
The following table represents the features for predicting the closing price. We selected features to
account for different parts in a time series as explained earlier trend, seasonality and volatility.
7. Feature Importance
We used linear regression and random forests to identify important features for all time horizons.
Our team observed that the importance of features has come out to be same in both models.
8. Evaluation Metrics
Through out the project we used two evaluation metrics:
Both are good because we are dealing with regression problem and they measure how close our
prediction is to the actual value. Since we are interested in correctly predicting the price these are right
metrics to choose. The main advantage of SMAPE is, it is a normalized error metric so we can
compare across time horizons because this is normalized for the price and share prices usually follow
an increasing trend. Due to this reason we choose SMAPE for the Final results table. Both metrics do
not distinguish with the direction of prediction (either predicting more or less that the actual).
9. Model improvement
There are many ways of improving the model performance and our group approached it in two
perspectives first, finding the right amount of training data and second, hyper parameter tuning.
6
Figure 11: 1-day Prediction – Random Forests Results Figure 12:1-day Prediction -Gradient Booosting Results
For random forests there is a trade-off and we can determine the optimum amount of train data
which gives best results. For Gradient boosting there is no clear pattern, but we can say that increasing
the training data is not improving the performance. I will explain the reasons for this behaviour in the
results and conclusion section.
Model Optimization
We developed our own user defined functions for performing fixed window rolling forecast cross
validation with 10 folds and grid search hyperparameter tuning. The function “crossval” takes any
spark data frame as input and splits it to 10 equal parts using the rank function on the data column.
Now, I merge using union function 1st,2nd,3rd splits as train and 4th split as test and send them as
arguments to “tuning_lr” function which would perform grid search on the list of parameters (another
argument) and stores the results. Next, I merge 2nd,3rd,4th as training and 5th split as test and repeat
this until the 10th split becomes the test by each time moving forward the window by 1 split. I take all
results and find out the average of results on test data.
7
The simpler model Linear regression performed better in comparison to Random forests, Gradient
boost regressor and moving average. As expected, the machine learning has capability to explain the
behaviour of time series.
One of the reasons for poor performance of random forests which are based on decision trees is
random forests do not work well with time series data which has increasing trend. This is also
illustrated in this blog (medium) very neatly. After training the random forests on a training data
when it is tested on validation data which it has never seen before the random forests can never
predict values which are bigger than the training samples. But stock prices generally have increasing
trend and the test data would have on average higher value than the train. The random forests
simply put calculates good average of the new data from training data. Another point to note is that
all models are performing better in smaller time horizons and this is expected because as stock price
after 4 months can be influenced more by external factors
We have only developed the classification model but not implemented the trading strategy. I believe
for trading strategies classification methods give better returns because if we predict the stock price
tomorrow to decrease to 9.8$ from 10$ and sold the stock today but if the actual stock price
increases to 10.1$ we have lost money though, the SMAPE is good. constructing a model for
predicting if tomorrow’s price will go up or down should be a better strategy.
8
11.Recommendations
1. Do the hyper parameter tuning for more range of values and also for more parameters to
improve the accuracy if random forests
2. Try to remove the trend in the series by some statistical transformations such as differencing
and then apply random forests and gradient boosted trees
3. Our group has not explored advanced algorithms such as neural networks and this could
potentially perform better than linear regression
4. Do multivariate analysis by adding external factors such as currency rate and sentiment
12.Mentors Recommendations
Our mentor Reza Farahani was very helpful and approachable. His recommendations were mainly in
the side of feature engineering. Especially the features such as average of adjusted close of alternate
days and average of prices with a window of week (Avg_Close_0_1_5_15), maximum price in last 5, 10
and 84 training days were his recommendations. It is evident that these features were proved to be
important in longer time horizons. He also suggested us to not to focus much effort on classical models
like ARIMA as machine learning models would give better performance. He answered a lot of our
questions and guided us to good time series kernels in Kaggle. We referred to some of them but due to
limited functionality in spark we didn’t apply all of them.
13.Challenges:
We faced multiple limitations during the project and some of them are
1. Lack of native libraries for performing tasks such as rolling window cross validation
2. Lack of native libraries for statistical models such as ARIMA. There was a link provided to spark-
ts library which had all the stats models that leverage spark are developed by a developer
named Sandy Ryza (github) but the documentation was hard to infer and with our limited
knowledge in Java and Scala we decided not to move a head.
3. The community account with data bricks has limited computational capacity and due to this the
cross validation for model like random forests took hours to execute and many a times it has
stopped running in between throwing a memory error (“Connect Exception error: This is often
caused by an OOM error”)
14.Contribution
My main contribution was in developing the code for performing train test splits, cross validation,
hyper parameter tuning and running them for different scenarios. Group discussion, contributing to
the analysis and presentation. Segregating all the teams work in the end.
15.Databricks
Although the community edition of data bricks has limited computational capacity it had many good
features. I could see the power of spark and its parallel processing. For running a cross validation of
9
random forests, it made more than 10,000 spark jobs and I could also see the job numbers its details
and DAG graph (I didn’t understand it completely).
Bibliography
(n.d.). Retrieved from https://www.quandl.com/
(n.d.). Retrieved from https://investor.apple.com/investor-relations/faq/default.aspx
(n.d.). Retrieved from https://www.investopedia.com/ask/answers/06/adjustedclosingprice.asp
(n.d.). Retrieved from https://github.com/sryza/spark-timeseries
(n.d.). Retrieved from
https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.Ran
domForestRegressor
(n.d.). Retrieved from https://medium.com/datadriveninvestor/why-wont-time-series-data-and-
random-forests-work-very-well-together-3c9f7b271631
Nesreen K. Ahmed, A. F.-S. (2010). An Empirical Comparison of Machine Learning Models for Time
Series Forecasting. Econometric Reviews.
Spyros Makridakis, E. S. (2018). Statistical and Machine Learning forecasting methods: Concerns and
ways forward.
Wei, W. W. (2013). Time Series Analysis. Retrieved from
http://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199934898.001.0001/oxfordh
b-9780199934898-e-022?print=pdf
10