Business Report TSF - Rose DataSet
Business Report TSF - Rose DataSet
Business Report TSF - Rose DataSet
Submitted to:
Concerned
faculty At
Great learning
The University of Texas & Austin
Submitted By:
1. Read the data as an appropriate Time Series data and plot the data.
2. Perform appropriate Exploratory Data Analysis to understand the data and also perform
decomposition.
3. Split the data into training and test. The test data should start in 1991.
4. Build all the exponential smoothing models on the training data and evaluate the model using
RMSE on the test data. Other additional models such as regression, naïve forecast models,
simple average models, moving average models should also be built on the training data and
check the performance on the test data using RMSE.
5. Check for the stationarity of the data on which the model is being built on using appropriate
statistical tests and also mention the hypothesis for the statistical test. If the data is found to
be non-stationary, take appropriate steps to make it stationary. Check the new data for
stationarity and comment. Note: Stationarity should be checked at alpha = 0.05.
6. Build an automated version of the ARIMA/SARIMA model in which the parameters are
selected using the lowest Akaike Information Criteria (AIC) on the training data and evaluate
this model on the test data using RMSE.
7. Build ARIMA/SARIMA models based on the cut-off points of ACF and PACF on the training data
and evaluate this model on the test data using RMSE.
8. Build a table with all the models built along with their corresponding parameters and the
respective RMSE values on the test data.
9. Based on the model-building exercise, build the most optimum model(s) on the complete data
and predict 12 months into the future with appropriate confidence intervals/bands.
10. Comment on the model thus built and report your findings and suggest the measures that the
company should be taking for future sales.
1. Read the data as an appropriate Time Series data and plot the data.
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 13, 6
Afterwards read the dataset call them pd read command in the pandas
Rose
YearMonth
0 1980-01 112.0
1 1980-02 118.0
2 1980-03 129.0
3 1980-04 99.0
4 1980-05 116.0
Then apply the Describe command to check the basic statistical details like
percentile, mean, std etc. of a data frame or a series of numeric values.
It shows a downward trend of production in Year 1981 it was way above then
250 and in year 1995 its marginally above 50 we can say near to 60
Now we apply the isnull function to check the missing values or the null values
in the dataset
Now we will apply the shape command to check the size and dimension of the
dataframe
(187, 1)
Then we will apply the df.info to get further info of the dataframe
Now we will construct a boxplot here X label will be the Rose Wine Sale and Y
the number of years
Now we will construct a pivot table for monthly sales across years of Rose
Wine
Then we will construct a plot to show the monthly sales across the years.
df_decade_sum = df.resample('10Y').sum()
df_decade_sum.plot();
# statistics
from statsmodels.distributions.empirical_distribution import ECDF
cdf = ECDF(df['Rose'])
plt.plot(cdf.x, cdf.y, label = "statmodels");
df['1994']
df.interpolate(methods='spline',order=3,inplace=True)
df['1994']
decomposition = seasonal_decompose(df['Rose'],model='additive')
trend = decomposition.trend
seasonality = decomposition.seasonal
residual = decomposition.resid
deaseasonalized_ts = trend + residual
decomposition = seasonal_decompose(df['Rose'],model='multiplicative')
3. Split the data into training and test. The test data should start in 1991.
train=df[df.index.year< 1991]
test=df[df.index.year>=1991]
(132, 1)
print(test.shape)
(55, 1)
We will then print first few rows of training data and last few rows of training
data.
4. Build various exponential smoothing models on the training data and evaluate the model
using RMSE on the test data. Other models such as regression,naïve forecast models and
simple average models. should also be built on the training data and check the
performance on the test data using RMSE.
LinearRegression_train = train.copy()
LinearRegression_test = test.copy()
lr = LinearRegression()
Test RMSERegressionOnTime51.433312
NaiveModel_train = train.copy()
NaiveModel_test = test.copy()
NaiveModel_test['naive'] =
np.asarray(train['Rose'])[len(np.asarray(train['Rose']))-1]
NaiveModel_test['naive'].head()
Test RMSERegressionOnTime51.433312NaiveModel79.718773
SimpleAverage_train = train.copy()
SimpleAverage_test = test.copy()
Average means
# Plotting on the whole data
For 2 point Moving Average Model forecast on the Training Data, RMSE is 1
1.529
For 4 point Moving Average Model forecast on the Training Data, RMSE is 1
4.451
For 6 point Moving Average Model forecast on the Training Data, RMSE is 1
4.566
For 9 point Moving Average Model forecast on the Training Data, RMSE is 1
4.728
Exponential smoothing is generally used to make short term forecasts, but longer-term forecasts
using this technique can be quite unreliable.
SES_train = train.copy()
SES_test = test.copy()
For Alpha =0.098 Simple Exponential Smoothing Model forecast on the Test
Data, RMSE is 36.796
## First we will define an empty dataframe to store our values from the loop
'Alpha=0.3,SimpleExponentialSmoothing','Alpha=0.4,SimpleExponentialSmoothi
ng'
DES_train = train.copy()
DES_test = test.copy()
## First we will define an empty dataframe to store our values from the loop
for i in np.arange(0.3,1.1,0.1):
for j in np.arange(0.1,1.1,0.1):
model_DES_alpha_i_j =
model_DES.fit(smoothing_level=i,smoothing_slope=j,optimized=False,use_brut
e=True)
DES_train['predict',i,j] = model_DES_alpha_i_j.fittedvalues
DES_test['predict',i,j] = model_DES_alpha_i_j.forecast(steps=len(test))
rmse_model6_train =
metrics.mean_squared_error(DES_train['Rose'],DES_train['predict',i,j],squared=
False)
rmse_model6_test =
metrics.mean_squared_error(DES_test['Rose'],DES_test['predict',i,j],squared=Fa
lse)
resultsDf_7 = resultsDf_7.append({'Alpha Values':i,'Beta Values':j,'Train
RMSE':rmse_model6_train,'Test RMSE':rmse_model6_test}, ignore_index=True)
resultsDf_7.sort_values(by=['Test RMSE']).head()
import statsmodels.api as sm
for param in pdq:
for param_seasonal in model_pdq:
SARIMA_model = sm.tsa.statespace.SARIMAX(train['Rose'].values,
order=param,
seasonal_order=param_seasonal,
enforce_stationarity=False,
enforce_invertibility=False)
15.732718754522914
MANUAL SARIMA
Df.plot is constructed to check the trend
RMSE: 20.672560612957582
# Getting the predictions for the same number of times stamps that are present
in the test data prediction = fullmodel.forecast(steps=len(test))
prediction = fullmodel.forecast(steps=len(test))
#In the below code, we have calculated the upper and lower confidence bands at
95% confidence level
#The percentile function under numpy lets us calculate these and adding and
subtracting from the predictions