TSA Business Report
TSA Business Report
TSA Business Report
Wine
Time Series
Project
Report
June’ 21
Date: 19/06/2021
0
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Table of Contents
Contents
Executive Summary....................................................................................................................................................3
Introduction................................................................................................................................................................3
Data Description.........................................................................................................................................................3
Sample of the dataset:............................................................................................................................................3
Q1: Use methods of descriptive statistics to summarize data....................................................................................3
Q2: Perform appropriate Exploratory Data Analysis to understand the data and also perform decomposition ...4
Q3: Split the data into training and test. The test data should start in 1991...........................................................8
Q4: Build various exponential smoothing models on the training data and evaluate the model using RMSE on
the test data. Other models such as regression,naïve forecast models, simple average models etc. should
also be built on the training data and check the performance on the test data using RMSE...............................9
Q5: Check for the stationarity of the data on which the model is being built on using appropriate statistical
tests and also mention the hypothesis for the statistical test. If the data is found to be non-stationary, take
appropriate steps to make it stationary. Check the new data for stationarity and comment. .............................15
Q6: Build an automated version of the ARIMA/SARIMA model in which the parameters are selected using the
lowest Akaike Information Criteria (AIC) on the training data and evaluate this model on the test data using
RMSE....................................................................................................................................................................... 17
Q7: Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training data and
evaluate this model on the test data using RMSE.................................................................................................21
Q8: Build a table with all the models built along with their corresponding parameters and the respective RMSE
values on the test data............................................................................................................................................23
Q9: Based on the model-building exercise, build the most optimum model(s) on the complete data and predict
12 months into the future with appropriate confidence intervals/bands ...............................................................24
Q10: Comment on the model thus built and report your findings and suggest the measures that the company
should be taking for future sales............................................................................................................................25
THE END!.................................................................................................................................................................. 25
1
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
List of Figures
Fig.1 – Time series plot
Fig.2 – Wine Sale per Year
Fig.3 – Wine Sale per Month for 8 years
Fig.4 – Wine sale trend over the years per month
Fig.5 – Sales units plot
Fig.6 – Average Sales and Sales Percentage Change
Fig.7 – Time series decomposition - Additive
Fig.8 – Time series decomposition - Multiplicative
Fig.9 – SES Test set prediction
Fig.10 – DES Test set prediction
Fig.11 – TES Test set prediction
Fig.12 – TES Test set prediction – 2 for Brute Force
Fig.13 – Regression Test set prediction
Fig.14 – Naive Test set prediction
Fig.15 – Simple Average Test set prediction
Fig.16 – Rolling Mean & Std Deviation on original series
Fig.17 – Rolling Mean & Std Deviation on differenced series
Fig.18 – ACF on Original time series
Fig.19 – ACF on Differentiated time series
Fig.20 – PACF on Original time series
Fig.21 – PACF on Differentiated time series
Fig.22 – SARIMA (1,1,2)(1,0,2,12)
Fig.23 – ACF on whole Time series
Fig.24 – PACF on whole Time series
Fig.25 – SARIMA (3,1,2)(1,0,1,12)
Fig.26 – Sparkling wine forecast
Executive Summary
For this particular assignment, the data of Sparkling wine sales in the 20th century is to be analysed.
The task is to analyze and forecast Wine Sales in the 20th century
Data Description
1. YearMonth: Year and month in format YYYY-MM. Data type - object
2. Sparkling : Sales unit of Sparkling wine for respective duration on month and year. Data type – int64
The data is having 187 rows for these 2 columns with no null values.
Q1: Read the data as an appropriate Time Series data and plot the data.
After reading the data and with above description of the data – converting it into Timeseries format by making
the YearMonth as the index.
The dataset plot shows a steady increase in wine sales over the year 1982 -1988 post which the sales have
reduced and the maximum sales/year remain in between the range 6500-6000 units similarly the lowest sales
points are 1000-2000 units/year over the entire 7 years duration.
The dataset plot also shows seasonality wherein similar pattern of sales is observed per year.
Q2: Perform appropriate Exploratory Data Analysis to understand the data and
also perform decomposition.
Wine sales per year when observed closely and plotted gives below observation:
There was a steady increase in sales/year for sparkling wine from 1982-1985.
Post year 1985 – the sale unit were almost constant until 1989.
From 1989 the sales started decreasing until 1992
Again the trend for sales increase was observed for 1993 and 1994.
The lowest sales of wine is observed at year 1995.
The median lines at the box plots need to be explored more – as it shows that the sales unit across the
year were not constant.
Plotting the monthly sales of wines over the years, sales of wine per month per year to corelate above
2 plots:
Most of the sales of the wine has been in units of 3000-7000 – meaning these are units per year sales
which is observed for the duration of last 8 years
Roughly about 40% of sales unit with last 8 years is around 1000-3500. Which also means company
should strive to increase this and keep sales unit above this – if aiming for a growth of sales.
Sales unit lower than 3000 an year could be taken as a loss of market – due to other competitor wines
or change of consumer preference to drink wine with some other liquid product like juice, water etc.
An average sale of 6000-7000 units during Dec is normal sales trend – lesser than this means
company might be losing the market.
An average sales percentage change during first half if the year is generally negative upto even -75%
and at later half of the year it would be mostly positive with upte 100% sales increase from 3000 units
of wine/month on average.
Also – its important to look at the -ve sales during the first half of the year (June/July months specially)
as to why the sales are less during these months. If average sale during these months could be
increased – it would lead to a significant increase of profit for the company.
The residues seem to be NOT following the trend and are independent and random. As observed in
highlighted data points, the highest errors are observed not during the hightes peaks of trends and lowest are
not during the lowest peaks of trend.
In this case also we observe that the residues and seasonality are independent of the trend of the time series.
Hence the time series could be taken as Additive time series.
Q3: Split the data into training and test. The test data should start in 1991.
The dataset is not time series; hence it is ordered in a sequential manner. Splitting the data in Train and test
with test data starting from 1991, we get 2 separate data sets with below shapes and points:
Q4: Split Build various exponential smoothing models on the training data and
evaluate the model using RMSE on the test data.
4.a ) Model 1 – Simple exponential Smoothing model
Simple exponential model captures only the smoothing level α. And not really the trend and
Seasonality. Hence only one attribute of the time series.
Using brutes force method or multiple iterations – the best value if α found is 0.07028781460389563.
Which when used in below equation:
Ft+1=αYt+(1−α)FtFt+1=αYt+(1−α)Ft
Using α, and Yt values from the train dataset, the predicted values for test dataset are calculated and
plotted along with actual test dataset values:
Fig.9 – SES Test set prediction
As can be seen with SES (Simple Exponential Smoothing) for α = 0.07 we get straight line as the
forecasted sale for test data set duration which is not very close to actual dataset.
The performance of the model, can be evaluated using RMSE (Root Mean square error) which is
essentially the error in prediction of the forecasted values of sales of Sparkling wine:
RMSE = 1338.0046232563645
Double exponential model captures the smoothing level α and smoothing Trend β. And not really the
Seasonality. Hence only 2 attributes of the time series Dataset .
Using brutes force method or multiple iterations – the best value found are:
The seasonality component of the time series would notbe captured still but using Holt’s model(DES),
The forecast for sales of wines is calculated for test dataset and when plotted :
Fig.10 – DES Test set prediction
As can be seen with DES (Double Exponential Smoothing) for α = 0.66 and β = 0.0001 we get straight
line as the forecasted sale for test data set duration, which is having increasing trend as well.
The performance of the model, when evaluated using RMSE (Root Mean square error) which is
essentially the error in prediction of the forecasted values of sales of Sparkling wine:
RMSE = 5291.8798332269125
Using auto fit in the TES model – the best value found are:
Holt’s Winter model(TES), when used for calculation of the forecast for sales of wines for test dataset
and plotted :
Fig.11 – TES Test set prediction
As can be seen with TES (Triple Exponential Smoothing) for α = 0.11 and β = 0.04 and γ = 0.36 we get
forecasted line (purple) resembling very close to the actual test dataset.
The performance of the model, when evaluated using RMSE (Root Mean square error) which is
essentially the error in prediction of the forecasted values of sales of Sparkling wine:
RMSE = 403.0192361612048
Triple exponential model captures autofit or default fit will provide the best values for α,β, and γ.
However it would not really mean that that is the best TES model with least RMSE.
Using multiple iterations to calculate α,β, and γ using Brute Force method and checking respective
RMSE’s, the values with least RMSE value that are picked are:
This Holt’s Winter model(TES), when used for calculation of the forecast for sales of wines for test
dataset and plotted Vs the autofit TES Model (4.c) :
Fig.12 – TES Test set prediction – 2 for Brute Force
As can be seen with TES (Triple Exponential Smoothing) for α = 0.3 and β = 0.3 and γ = 0.3 we get
forecasted line (RED) resembling the test data set more closely than the autofit TES model.
The performance of the model, when evaluated using RMSE (Root Mean square error) which is
essentially the error in prediction of the forecasted values of sales of Sparkling wine:
RMSE = 361.397300
The same time series dataset could be used for frecasting using Linear Regression.
For Linear Regression – the timeseries need to converted into a linear dataset.
Then based on this linear Train dataset – based on old values the future values would be predicted.
As the name suggest – its linear in nature and thus it would not be able to capture the seasonality part
of the timeseries well.
When the forecasted values for test dataset are plotted agained actual test dataset:
Fig.13 – Regression Test set prediction
The performance of the model, when evaluated using RMSE (Root Mean square error) which is
essentially the error in prediction of the forecasted values of sales of Sparkling wine:
RMSE = 1389.135
The performance of the model, when evaluated using RMSE (Root Mean square error) which is
essentially the error in prediction of the forecasted values of sales of Sparkling wine:
RMSE = 3864.279
The performance of the model, when evaluated using RMSE (Root Mean square error) which is
essentially the error in prediction of the forecasted values of sales of Sparkling wine:
RMSE = 1275.082
Q5: Check for the stationarity of the data on which the model is being built on
using appropriate statistical tests and also mention the hypothesis for the
statistical test. If the data is found to be non-stationary, take appropriate steps to
make it stationary. Check the new data for stationarity and comment.
Using Dickey–Fuller test on entire dataset of Sparkling wine. The test works with institution that if the
timeseries is stationary and broken in period or windows – and the mean for all these windows is taken, then it
has tendency to return to a constant mean. If not – then the time series is not stationary.
The rolling mean (red) if can be seen that this is not having constant mean.
The P value is 0.6 > 0.05 so the null hypothesis is correct which is Time series is NOT-STATIONARY.
Differentiating the time series by order 1 to make it stationary, we get below time series plot now :
As can be seen at 95% significance level, p-value happens to be very less than 0.05 and this null-hypothesis is
rejected in this case – making the series stationary.
After the series differentiation the Auto Corelative function (ACF) plot:
Fig.19 – ACF on Differentiated time series
As observed in the plot the pattern repeats after every 12 Laps, and there are about 2 laps in these 12 periods
that are corrected and will have predominant effect on forecasting the futuristic values. This also means p = 2
after which the first cut off happen.
Similarly the Partial Auto Correlated function (PACF) is plotted for original series:
Where we see that first 3 months data out of 12 months lags would be where the ACF will be observed more
and will have more predominant effect in forecasting. This also means q = 3 after which the first cut off happen.
Now using the various iterations to get the values of p, q and d and respective AIC (Akaike Information
Criteria) for a Lag at 12 months. And the lowest 5 values of these are :
So using the parameters where AIC value is lowest as per above table for the first record with below details :
The Model when evaluated for RMSE (Root mean square error)
RMSE = 1275.082
Q7: Check Build ARIMA/SARIMA models based on the cut-off points of ACF
and PACF on the training data and evaluate this model on the test data using
RMSE.
Again using only SARIMA as time series has a seasonality and ARIMA would not be able to capture that
aspect of the timeseries.
The p,q,d values would remain the same as that of section Q6.
For getting the values of P and Q, PACF and ACF plots are rechecked again to see the cut off points after lag
12.
α = 0.05
p=3
q=2
d=1
P=1
Q=1
D=0
S = 12
Histogram and Normal Q-Q present the difference between Actual data and Forecasted data.
The Model when evaluated for RMSE (Root mean square error)
RMSE = 325.090069
Q8: Check Build a table with all the models built along with their corresponding
parameters and the respective RMSE values on the test data.
The table of all the models thus far and their variants that were tried is below:
Q9: Based on the model-building exercise, build the most optimum model(s) on
the complete data and predict 12 months into the future with appropriate
confidence intervals/bands.
Based on above Q8 figure picking most optimum models to forecast the 12 months futuristic sales of Sparkling
wine which is model 9 – SARIMA manual fit.
RMSE = 560.0757913891757
This model is then used to forecast next 12 months data and with confidence level 95%, below is how the
forecast for next 12 months would be:
Q10: Based on the model thus built and report your findings and suggest the
measures that the company should be taking for future sales.
THE END!