Rashmi Jeswani Capstone
Rashmi Jeswani Capstone
Rashmi Jeswani Capstone
Rashmi Jeswani
Michael McQuaid
School of Information
December 2021
ii
The project report “Predicting Walmart Sales, EDA, and Sales Dashboard” by Rashmi
Jeswani has been examined and approved by the following Examination Committee:
Michael McQuaid
Senior Lecturer
Project Committee Chair
Stephen Zilora
Professor
iii
Dedication
Abstract
Predicting Walmart Sales, Exploratory Data Analysis, and
Walmart Sales Dashboard
Rashmi Jeswani
Data is one of the most essential commodities for any organization in the 21st century.
Harnessing data and utilizing it to create effective marketing strategies and making
better decisions is extremely essential for organizations. For a conglomerate as big as
Walmart, it is necessary to organize and analyze the large volumes of data generated to
make sense of existing performance and identify growth potential.The main goal of this
project is to understand how different factors affect the sales for this conglomerate and
how these findings could be used to create more efficient plans and strategies directed
at increasing revenue.
This paper explores the performance of a subset of Walmart stores and forecasts fu-
ture weekly sales for these stores based on several models including linear and lasso re-
gression, random forest, and gradient boosting. An exploratory data analysis has been
performed on the dataset to explore the effects of different factors like holidays, fuel
price, and temperature on Walmart’s weekly sales. Additionally, a dashboard high-
lighting information about predicted sales for each of the stores and departments has
been created in Power BI and provides an overview of the overall predicted sales.
Through the analysis, it was observed that the gradient boosting model provided the
most accurate sales predictions and slight relationships were observed between factors
like store size, holidays, unemployment, and weekly sales. Through the implementa-
tion of interaction effects, as part of the linear models, relationship between a combi-
nation of variables like temperature, CPI, and unemployment was observed and had a
direct impact on the sales for Walmart stores.
2
Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1 Project Goals and Background . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Project Deliverables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Tools and Technologies Applied . . . . . . . . . . . . . . . . . . . . . . 8
2 Purpose Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1 About the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.1 EDA I: Exploring the Dataset with ‘inspectdf’ . . . . . . . . . . 14
4.2.2 EDA II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.3 Identifying Storewide Sales . . . . . . . . . . . . . . . . . . . . . 20
4.2.4 Identifying Department-wide Sales . . . . . . . . . . . . . . . . 21
4.2.5 Identifying Average Store Sales . . . . . . . . . . . . . . . . . . . 22
4.2.6 Identifying Specific Stores and Departments with Highest Sales 23
4.2.7 Identifying Monthly Sales for Each Year . . . . . . . . . . . . . . 25
4.2.8 Identifying Week Over Week Sales for Each Year . . . . . . . . . 26
4.2.9 Impact of Size of Store on Sales . . . . . . . . . . . . . . . . . . 27
4.2.10 Week of Year Sales by Store Type . . . . . . . . . . . . . . . . . . 28
4.2.11 Impact of Temperature on Sales . . . . . . . . . . . . . . . . . . 29
4.2.12 Impact of Unemployment on Sales . . . . . . . . . . . . . . . . 30
4.2.13 Impact of CPI on Sales . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.14 Impact of Fuel Price on Sales . . . . . . . . . . . . . . . . . . . . 32
4.2.15 Holiday VS Non-Holiday Sales . . . . . . . . . . . . . . . . . . . 33
4.2.16 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Data Cleaning and Preprocessing . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Model Selection and Implementation . . . . . . . . . . . . . . . . . . . 38
3
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1 Overall Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
List of Tables
1 List of holidays from the dataset . . . . . . . . . . . . . . . . . . . . . . 13
2 Description of Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5
List of Figures
1 A summary of the Training dataset . . . . . . . . . . . . . . . . . . . . . 12
2 A summary of the Features dataset . . . . . . . . . . . . . . . . . . . . . 13
3 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Missing Values in the Features dataset . . . . . . . . . . . . . . . . . . . 16
5 Distribution of Numerical attributes in the Features dataset . . . . . . . 17
6 Distribution of Store Types . . . . . . . . . . . . . . . . . . . . . . . . . 18
7 Correlation between attributes of the training dataset . . . . . . . . . . 19
8 Average Sales for Each Store Type . . . . . . . . . . . . . . . . . . . . . 20
9 Department Wide Sales . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
10 Store Wide Sales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
11 Departments with highest sales for each store type . . . . . . . . . . . . 24
12 Stores and departments with highest sales for each store type . . . . . . 25
13 Overall Monthly Sales . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
14 Week Over Week Sales . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
15 Impact of Size of Store on Sales . . . . . . . . . . . . . . . . . . . . . . . 28
16 Week of Year Sales Based on Store Type . . . . . . . . . . . . . . . . . . 29
17 Impact of Temperature on Sales . . . . . . . . . . . . . . . . . . . . . . 30
18 Impact of Unemployment on Sales . . . . . . . . . . . . . . . . . . . . . 31
19 Impact of CPI on Sales . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
20 Impact of Fuel Price on Sales . . . . . . . . . . . . . . . . . . . . . . . . 33
21 Holiday Versus Non-Holiday Sales . . . . . . . . . . . . . . . . . . . . . 34
22 Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
23 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
24 WMAE Rate for Linear Regression . . . . . . . . . . . . . . . . . . . . . 39
25 Multiple Linear Regression in R Summary: Coefficients and F-Statistic . 41
26 Trained Multiple Linear Regression Model Summary . . . . . . . . . . . 42
27 Multiple Linear Regression: Studying Interaction Effects . . . . . . . . 43
28 WMAE Rate for Lasso Regression . . . . . . . . . . . . . . . . . . . . . 44
29 Lasso Regression: Optimal Lambda value (7.49) . . . . . . . . . . . . . 46
30 Lasso Regression Coefficients . . . . . . . . . . . . . . . . . . . . . . . . 46
31 WMAE Rate for Gradient Boosting Machine . . . . . . . . . . . . . . . 47
32 Feature Importance for Gradient Boosting Machine . . . . . . . . . . . 48
33 WMAE Rate for Gradient Boosting Machine: Tuned Parameters . . . . 48
6
1 Introduction
1.1 Project Goals and Background
The 21st century has seen an outburst of data that is being generated as a result of the
continuous use of growing technology. Retail giants like Walmart consider this data
as their biggest asset as this helps them predict future sales and customers and helps
them lay out plans to generate profits and compete with other organizations. Walmart
is an American multinational retail corporation that has almost 11,000 stores in over
27 countries, employing over 2.2 million associates (Wikipedia, n.d.).
Catering to their customers with the promise of ‘everyday low prices’, the range of
products sold by Walmart draws its yearly revenue to almost 500 billion dollars thus
making it extremely crucial for the company to utilize extensive techniques to forecast
future sales and consequent profits. The world’s largest company by revenue, Walmart,
sells everything from groceries, home furnishings, body care products to electronics,
clothing, etc. and generates a large amount of consumer data that it utilizes to pre-
dict customer buying patterns, future sales, and promotional plans and creating new
and innovative in-store technologies. The employment of modern technological ap-
proaches is crucial for the organization to survive in today’s cutting-edge global market
and create products and services that distinguish them from its competitors.
The main focus of this research is to predict Walmart’s sales based on the avail-
able historic data and identify whether factors like temperature, unemployment, fuel
prices, etc affect the weekly sales of particular stores under study. This study also aims
to understand whether sales are relatively higher during holidays like Christmas and
Thanksgiving than normal days so that stores can work on creating promotional offers
that increase sales and generate higher revenue.
Walmart runs several promotional markdown sales throughout the year on days
immediately following the prominent holidays in the United States; it becomes crucial
for the organization to determine the impact of these promotional offerings on weekly
sales to drive resources towards such key strategic initiatives. It is also essential for
Walmart to understand user requirements and user buying patterns to create higher
customer retention, increasing their demand adding to their profits. The findings from
this study can help the organization understand market conditions at various times of
the year and allocate resources according to regional demand and profitability.
Additionally, the application of big data analytics will help analyze past data effi-
ciently to generate insights and observations and help identify stores that might be at
risk, help predict as well as increase future sales and profits and evaluate if the organi-
zation is on the right track.
The analysis for this study has been done using SQL, R, Python, and Power BI on
the dataset provided by Walmart Recruiting on Kaggle (“Walmart Recruiting - Store
Sales Forecasting,” 2014). The modeling, as well as the exploratory data analysis for
the research, have been performed in R and Python, aggregation and querying will be
8
performed using SQL and the final dashboard has been created using Power BI.
2 Purpose Statement
The purpose of this study is to predict the weekly sales for Walmart based on available
historical data (collected between 2010 to 2013) from 45 stores located in different re-
gions around the country. Each store contains a number of departments and the main
deliverable is to predict the weekly sales for all such departments.
The data has been collected from Kaggle and contains the weekly sales for 45 stores,
the size and type of store, department information for each of those stores, the amount
of weekly sales, and whether the week is a holiday week or not. There is additional
information in the dataset about the factors that might influence the sales of a particular
week. Factors like Consumer Price Index (CPI), temperature, fuel price, promotional
markdowns for the week, and unemployment rate have been recorded for each week
to try and understand if there is a correlation between the sales of each week and their
determinant factors.
9
3 Related Work
Studies have previously been performed to predict sales for retail industry corporations
based on the availability of relevant historic data. Several authors from the Fiji National
University and The University of the South Pacific analyzed the Walmart dataset to pre-
dict sales (“Walmart’s Sales Data Analysis - A Big Data Analytics Perspective,” 2017).
Tools like Hadoop Distributed File Systems (HDFS), Hadoop MapReduce framework,
and Apache Spark along with Scala, Java, and Python high-level programming envi-
ronments were used to analyze and visualize the data. Their study also aimed at trying
to understand whether the factors included in the dataset have any impact on the sales
of Walmart.
In 2015, Harsoor and Patil (Harsoor & Patil, 2015) worked on forecasting Sales of
Walmart Stores using big data applications: Hadoop, MapReduce, and Hive so that
resources are managed efficiently. This paper used the same sales data set that has been
used for analysis in this study, however, they forecasted the sales for the upcoming 39
weeks using Holt’s winter algorithm. The forecasted sales are visually represented in
Tableau using bubble charts.
10
studies the interaction effects between the multiple independent variables in the dataset
like unemployment, fuel prices, CPI, etc. and tries to find if there is a relationship be-
tween a combination of these factors and weekly sales.
A further extension of predictive techniques relevant to this study involves the im-
plementation of random forest algorithms to create predictive models. A study con-
ducted by researchers at the San Diego State University (Lingjun et al., 2018) highlights
the importance of this tree-based machine learning algorithm over other regression
methods to create predictive models for the higher education sector. With their study,
the authors use a standard classification and regression tree (CART) algorithm along
with feature importance to highlight the importance of using random forest algorithms
with prediction problems in Weka and R and compare their efficacy with several other
models like lasso regression and logistic regression.
The purpose of this review was to identify similar practices utilized by other authors
or researchers when creating a predictive model influenced by several independent
variables. It is clear from the research reviewed above that a significant number of these
authors make use of a combination of tools and techniques to create efficient models
and tend to compare their findings across all models to select the best-performing model
based on their dataset. Just like Harsoor and Patil, along with Chouskey and Chauhan,
make use of Hadoop, MapReduce, and Hive to generate predictions, this study makes
use of several algorithms like linear and lasso regression, random forest, etc. and also
studies interaction effects for multiple regression to make predictions.
Performing a comparative analysis with several models is crucial to ensure that the
predictions are accurate and that they are not limited in scope. Testing out multiple
models is also necessary for this study as models tend to perform differently based on
the nature and size of the data.
4 Methodology
The paper comprises of several different components that explore various aspects of
the 45 Walmart stores used in this study. The methodology section is broken down into
several sub-sections that follow a ‘top-down’ approach of the process that is followed
in this analysis.
This section contains detailed information about the dataset, the exact techniques
that have been used in forecasting weekly sales and the last section talks about how
this study is significant in predicting the weekly sales for Walmart stores. It will also
discuss the success of the applied models in identifying the effect of different factors
on such weekly sales.
12
There is another dataset called ‘stores.csv’ that contains some more detailed infor-
mation about the type and size of these 45 stores used in this study.
Another big aspect of this study is to determine whether there is an increase in
the weekly store sales because of changes in temperature, fuel prices, holidays, mark-
downs, unemployment rate, and fluctuations in consumer price indexes, The file ‘fea-
tures.csv’ contains all necessary information about these factors and is used in the anal-
ysis to study their impact on sale performances.
The holiday information listed in the study is:
13
The final file called ‘sampleSubmission.csv’ contains two main columns: dates for
each of the weeks in the study as well as a blank column that should be utilized to record
predicted sales for that week based on the different models and techniques applied.
The results of the most accurate and efficient model have been recorded in this file
and the final Power BI dashboard has been created based on these predicted values, in
conformity with the ‘stores’ and ‘features’ dataset.
14
The dataset contains the following five main CSV files: train, test, features, stores,
and sampleSubmission. Each file contains crucial information relevant to this study,
some of the important ones are discussed below:
A visualization of the ’inspect df()’ package for the features, train, and stores
datasets can also be observed in Figure 3.
(a) Data Type: Stores Dataset (b) Data Type: Features Dataset (c) Data Type: Training Dataset
Another important function from the package is the ‘inspect na()’ function that
highlights the missing values in each column of the dataset used for this study. Apart
from the ‘features.csv’ file, it was found that no other dataset used had any missing val-
ues in any of the columns. The major missing values in the features dataset come from
16
the markdown columns that contain information about the different promotional ac-
tivities happening at different Walmart stores. A reason behind such a massive amount
of missing values in these columns is due to the seasonal promotional prices set by the
stores during holidays (that mostly happen to start from November until January).
The next function from the package looks at the distribution of the numeric vari-
ables using histograms created from identical bins. Considering that the features
dataset has the most numeric variables, that is the only one that will be looked at in
detail. According to the package website, ‘The hist column is a list whose elements are
tibbles each containing the relative frequencies of bins for each feature. These tibbles
are used to generate the histograms when showplot = ‘TRUE’.
The histograms are represented through heat plot comparisons using Fisher’s exact
test to highlight the significance of values within the column; the higher the signifi-
cance, the redder the data label (Rushworth, n.d.).
17
After looking at the distribution of numeric variables, the study proceeds to look at
the distribution of categorical variables. This is done using ‘inspect cat()’; this function
looks at the complete distribution of values in the categorical columns. Considering
there are not many categorical variables in the datasets used for this study (as seen
above through ’inspect types()’), the only relevant information that was gathered using
the function was the distribution of types of stores in the ‘stores’ dataset.
18
From the image above, it is clear that a majority of the Walmart stores included in
this study belong to Type ‘A’. This will be briefly discussed in the coming sections of the
study when advanced EDA will be used to answer some important questions.
The last significant function from the ‘inspectdf()’ package is called ‘inspect cor()’.
This function, in a nutshell, enables users to look at Pearson’s correlation between dif-
ferent variables in a dataframe. Understanding if there is an association between vari-
ables beforehand will answer a lot of questions about what variables affect the weekly
sales the most significantly.
19
A look at the correlation between the variables of the training dataset depicts that
while there is a slight association between the weekly sales and department, it is still
not as significant (higher the R-value, higher the significance).
4.2.2 EDA II
The second section under this Exploratory Data Analysis looks at advanced and exten-
sive visualizations that answer some crucial questions about the Walmart dataset, as
listed in the purpose statement.
After inspecting crucial elements in each of the data frames about the types of vari-
ables, their distribution, correlation, and association, etc. using ‘inspectdf’, more de-
tailed and summarized information about the weekly sales for each department/store
and the effect of various factors on the weekly sales are studied here. This is performed
using a combination of R and Python packages like ‘ggplot2’, ‘matplotlib’, ‘seaborn’,
‘plotly’, and several others.
This section will aim at looking at the following aspects of the Walmart dataset and
also possibly look at some more crucial information that stems out from the below-
mentioned criteria:
20
Based on our bar graph, it is clear that Type ‘A’ stores have the highest sales com-
pared to the other two stores. This tells us that there is a direct relationship between the
21
size of the store and their corresponding sales. Type ‘B’ stores have the second-highest
number of stores as well as average sales, thus proving this assumption.
After looking at the average sales for each store type, the next focus is to look at the
sales for each department associated with the stores.
After looking at the average sales for the stores and departments, it is also impera-
tive to look at the breakdown of sales for both the stores as well as departments. This
22
breakdown will include looking at average sales for each year, week over week sales,
monthly sales, week of year sales, etc. Each of these will throw more light on customers’
buying patterns over different periods.
It will also help in evaluating whether sales go up during certain time periods as
compared to normal average sales. I will also look at the average sales for different
stores associated with the three store types and evaluate which store number has the
highest sales.
Figure 11. Departments with highest sales for each store type
The histogram generated in Figure 11 highlights the top 15 departments with the
highest revenue for each of the three store types. As observed in the figure, it is certain
that departments 92, 95, 38, 72 and 90 are the top revenue generating departments
across all the three store types. Although there is no information in this dataset about
the nature of these departments, it is still safe to establish that Walmart can increase
revenue by including the above mentioned departments in their stores.
The figure below (Figure 12) consists of a breakdown of stores and their respective
types for the four largest departments.
In conclusion, for all store types listed in this study, departments 95, 92, 38, 90, and
72 generate some of the highest revenue for Walmart stores.
25
Figure 12. Stores and departments with highest sales for each store type
The week over week overview again helps us in understanding if there is an increase in
sales during holiday weeks each year, i.e. the weeks of Thanksgiving, Christmas, Labor
27
Day, etc.
There is an evident hike in sales in weeks 47 and 51 that correspond to Thanksgiving
and Christmas respectively, proving again that sales rise during the holiday season. Due
to the insufficiency of data for the year 2012, these conclusions have only been made
based on the data available from 2010 and 2011. This graph also tells that there is a
distinguished pattern of decline immediately following Christmas and New Year’s.
After studying the overall sales summaries of different components of the Walmart
dataset, this report will now throw light upon the effect of different factors (such as
holidays, markdowns, CPIs, unemployment, etc.) on the weekly sales. It has always
been an integral part of this study to understand the effect that these factors have on
Walmart’s sales performance. I will also create several visualizations that shed light on
the difference in Walmart store sales on holidays versus non-holiday days, the impact
of store size and type on weekly sales, and finally create a correlation matrix to examine
the correlation between the many factors included in the study.
• For the given store types, there seems to be a visible decrease in sales when the
unemployment index is higher than 11
• Even when the unemployment index is higher than 11, there is no significant
change in the average sales for Type C stores when compared to the overall sales
• There seems to be a significant drop in sales for store types A and B when the
unemployment index increases
• Highest recorded sales for store types A and B occur around the unemployment
index of 8 to 10; this gives ambiguous ideas about the impact of unemployment
on sales for each of the stores
31
sales = merge_train.groupby('IsHoliday')['Weekly_Sales'].mean()
counts = merge_train.IsHoliday.value_counts()
34
Cov(𝑥, 𝑦)
𝑟= (1)
𝜎𝑥 𝜎𝑦
with
̄ 𝑦)̄
• Cov(𝑥, 𝑦) =
∑(𝑥−𝑥)(𝑦−
𝑛−1
• 𝜎𝑥 = ∑(𝑥 − 𝑥)̄ 2
• 𝜎𝑦 = ∑(𝑦 − 𝑦)̄ 2
The heatmap/correlation matrix in Figure 22, created using the seaborn library in
Python (Szabo, 2020) gives the following information:
• There is a slight correlation between weekly sales and store size, type, and de-
partment
• Markdowns 1-5 also seem to have no distinct correlation with weekly sales, thus
they are not as important a factor in the study
Data has been checked for inaccuracies, missing or out of range values using the
‘inspectdf’ package in R as part of the initial EDA. Columns with missing values have
been dropped. The dataset contains information about weekly sales which was ini-
tially broken down to acquire information about monthly as well as quarterly sales for
our analysis, however, that information is not going to be utilized during the modeling
process.
The boolean ‘isHoliday’ column in the dataset contains information about whether
the weekly date was a holiday week or not. As observed in the EDA above, sales have
been higher during the holiday season as compared to non-holiday season sales, hence
the ‘isHoliday’ column has been used for further analysis.
Furthermore, as part of this data preprocessing step, I have also created input and
target data frames along with the training and validation datasets that help accurately
measure the performance of applied models. In addition, as part of this data prepro-
cessing, feature scaling (Vashisht, 2021) has been applied to normalize different data
attributes. This has primarily been done to unionize the independent variables in the
training and testing datasets so that these variables will be centered around the same
range (0,1) and provide more accuracy.
Also referred to as normalization, this method uses a simple min-max scaling tech-
nique (implemented in Python using the Scikit-learn (Sklearn) library (Pedregosa et al.,
38
2011).
Lastly, as part of this competition, Walmart wished that participants assess the ac-
curacies of models using the Weighted Mean Absolute Error (WMAE) (“Walmart Re-
cruiting - Store Sales Forecasting,” 2014), a brief description of which is displayed as
follows. 𝑛
1
WMAE = ∑ 𝑤𝑖 |𝑦𝑖 − 𝑦𝑖̂ |
∑ 𝑤𝑖 𝑖=1
where
• 𝑛 is the number of rows
• 𝑦𝑖̂ is the predicted sales
• 𝑦𝑖 is the actual sales
• 𝑤𝑖 are weights. 𝑤 = 5 if the week is a holiday week and 1 otherwise
The Weighted Mean Absolute Error is one of the most common metrics used to measure
accuracy for continuous variables (JJ, 2016).
A WMAE function has been created that provides a measure of success for the dif-
ferent models applied. It is the average of errors between prediction and actual ob-
servations, with a weighting factor. In conclusion, the smaller the WMAE, the more
efficient the model.
Considering that the WMAE values for the validation data is extremely high, linear
regression cannot be considered as a very efficient model for this analysis based on this
model.
Apart from the ’scikit-learn’ library in Python, further analysis and cross-validation
has been performed on the training dataset using the lm() function in R (created an-
other linear regression model using the variables pre-selected in data processing above).
The lm() function takes in two main arguments, namely:
• 1. Formula
• 2. Data
40
The data is typically a data.frame and the formula is an object of the class formula
(Prabhakaran, 2016). Based on the linear model, the coefficients, F-statistic, RSE score,
etc. have been calculated as shown below and in Figure 25.
An important step in analyzing the linear model is to look at the variable coefficients
ad the F-statistic at the end of the model summary. This summary table highlights
the estimate of regression beta coefficients and their respective p-values and helps in
identifying the variables that are significantly related to weekly sales.
The link between a predictor variable and the response is described by regression
coefficients, which are estimations of unknown population parameters. Coefficients
are the values that multiply the predictor values in linear regression (Frost, 2021). The
value of a coefficient determines the relationship between a predictor and response
variable: a positive relationship is indicated by a positive sign and vice versa. Another
metric calculated in the model is the F-statistic value which tells if a group of variables
are statistically significant and helps analyze whether to support or reject the null hy-
pothesis. The value is mostly always used with the p-value to determine the significance
of a model by studying all variables that are significant.
41
As per the figure above, variables with p-value of the F-statistic lower than 2.2e-
16 have some significant relationship with the dependent variable. This means that
variables like department, temperature, size, and isHoliday have some statistical rela-
tionship with weekly sales. For the F-statistic, the lm function runs an ANOVA test to
check the significance of the overall model. Here the null hypothesis is that the model
is not significant. According to the 𝑝 < 2.2𝑒 − 16, our model is significant.
The summary also provides the R squared values that determine how well our data
fits our linear model, in simple terms, it tells us about the goodness of fit of the model.
This value tells us how close the predicted data values are to the fitted regression line.
In general, if the differences between the actual values and the model’s predicted values
are minimal and unbiased, the model fits the data well (Editor, 2013). R square values
range from 0 to 1, hence the closer the value of R-square to 1, the better the fit. However,
as shown in Figure 25, the model has a relatively small R-square value of 0.08 which
suggests that the model might not exactly be a good fit.
The RSE (Residual Standard Error) or SIGMA gives a measure of the error of pre-
diction. The lower the RSE, the more accurate the model. The RSE has been calculated
as follows:
sigma(linearmodel)/mean(WeeklySales)
The RSE value for the model comes out at 1.35 which is relatively higher denoting
high standard error and less efficiency.
As the next step, I utilized a cross-validation or rotation estimation technique that
42
resamples the data under study and trains the data iteratively using different chunks of
data to avoid overfitting.
According to Figure above, even with the ten-fold cross-validation, the R square
value is still extremely low to establish statistical significance for the model.
• Unemployment
• CPI
• Fuel Price
• IsHoliday
43
A combination of these factors has been studied to understand if more than one
factor jointly contributes towards the increase or decrease of weekly sales for Walmart
stores under study.
For the same, the ’tidyverse’ and ’caret’ packages have been used in R for data ma-
nipulation and implementing machine learning model functions respectively. Just as
the linear regression model has been created, a similar model has been created to study
these interaction effects, however, instead of using all independent variables from the
dataset, only the five main ones listed above are used in the linear regression model.
The ’*’ operator helps study interaction effects between several independent and de-
pendent variables in the linear regression model (kassambara, 2018).
The additive effect applied to the model uses only the specified variables and the
model creates and evaluates the significance of every possible combination of these
variables. As per the Figure above, significant interactions can be identified using the
coefficient estimates as well as the p-value for the combinations of variables. Based
on the Figure, the only significant relationship seems to exist between ’Temperature’,
44
’CPI’ and ’Unemployment’. This suggests that a combination of these three indepen-
dent variables can have a significant impact on the predictor variable, i.e. ideal values
for the three independent variables can signify a direct relationship with the indepen-
dent variable, meaning higher sales in ideal conditions. There also seems to be a slight
relationship between ’Temperature’, ’Fuel Price’ and ’CPI’ but this is not as significant
as the previous relationship.
One important thing to note about the model is that the R-square value for the
model is very low, denoting a very poor fit of the data points around its mean. Hence,
as a result of low statistical significance, this model might not be the best method to
establish proof of the existence of such relationships.
# Creating model
45
lr = lasso.fit(train_inputs, train_targets)
Just like the linear regression model, the lasso regression model does not perform
well on the validation data, giving a relatively higher WMAE value. Hence, the model
cannot exactly be considered as an efficient model for the study.
Some further analysis has been done for the model in R using the ’glmnet’ package.
With the help of the cross-validation glmnet function (i.e. cv.glmnet()), an optimal
value for lambda is obtained from the model. The lowest point on the curve represents
the optimal lambda: the log value of lambda that best-minimized cross-validation error.
46
The model is then rebuilt with this optimal lambda value, with alpha as 1, and
coefficients are calculated. We can again observe positive coefficients for variables like
temperature, type, and isHoliday that depict the slight statistical significance of such
variables with weekly sales as per Figure 30.
The R-squared value for the model has been calculated using the Sum of Squares
47
Total (SST) and Sum of Squares Error (SSE) values as observed below. With a very low
R2 value of 0.06, the model cannot be considered a good fit for predicting weekly sales
for Walmart.
As observed from the image above, the error rate for the initial model is high; several
tuning parameters have been applied to the model below to reduce this error.
One important aspect of feature importance with gradient boosting is that with the
creation of the boosted trees with the model, it is easier to capture the importance of
each attribute. An importance score is assigned to each attribute that is calculated by
48
the gradient boosting model by ranking and comparing attributes with each other. The
importance of an attribute is weighted by the amount that each split improves on the
performance metric, weighted by the number of observations for which the node is
responsible. The performance measure could be purity (Gini index) used to select split
points or another more specific error function. The significance of the features is then
averaged over all decision trees of the model (Brownlee, 2016).
Harmonious to the findings in the EDA, the ‘Dept’ and ‘Size’ attributes are the two
most important features that affect weekly sales for the retail giant.
There are several tuning parameters associated with the model which can assist
in reducing the WMAE rate and attaining higher accuracy for the model. Under the
scikit-learn library in python, several functions like ’min samples split’, ’max Depth’,
’min samples leaf’, ’max features’, ’n estimators’, ’random state’, etc. can be used to tune
the GBM model.
For the purpose of this study, the ‘random state’ (seed that generates random num-
bers for parameter tuning), ‘n jobs’ (reduce computation time for parallel processes),
’n estimators’ (modeling of sequential trees), ’max depth’ (define the maximum depth
of a tree to control overfitting), and ’learning rate’ (impact of each tree on the model
outcome) (Jain, 2016) functions have been applied and this results in following WMAE
rate:
Figure 33. WMAE Rate for Gradient Boosting Machine: Tuned Parameters
49
As observed from the image above, the error rate for the initial model is high; several
tuning parameters have been applied to the model below to reduce this error.
Just like the gradient boosting model, the feature importance plot below highlights
the most important attributes in the random forest model; the ‘feature importance’
function in python’s scikit-learn library is used with the ‘RandomForestRegressor’ and
computes importance for each of the attributes using Gini index (Płoński, 2020). Ac-
cording to the feature importance barplot below, the ‘Dept’ and ‘Size’ attributes impact
weekly sales the most, which corresponds to the correlation matrix findings in the EDA.
50
This model was also tuned based on similar parameters that were previously applied
for the tuning of the GBM model and includes some additional parameters like ’min
samples split’ (minimum samples required in a node for a split), ’min samples leaf’
(minimum required samples in a leaf), ’max samples’, and ’max features’ (maximum
samples required for a split) (Jain, 2016).
With the parameters set, the new WMAE rate for the Random Forest model is:
The general idea when trying to select the most efficient model would have sug-
gested looking at the R² (proportion of variance explained), the Root Average Squared
Error (RASE), or the p-values to generate a predictive model that can inform decision-
makers about what output is expected given certain input (Yu et al., 2018).
Hence, in addition to WMAE, several other metrics have been created for the sake
of this study.
WMAE Rate Linear Regression Lasso Regression GBM (Tuned) Random Forest (Tuned)
WMAE:Validation 14882.76 14881.42 1338.63 1589.4
51
This file is then merged with the ‘stores’ file that contains information about the
type and size of the store as well as holiday information. All these columns will be
used to create several visualizations that track weekly predicted sales for various stores
and departments, sales based on store size and type, etc. The dashboard also provides
detailed information about stores and departments that generate the highest revenue
and their respective store types. The PDF file contains brief information about all the
visualizations created in the dashboard.
The dashboard can be found in the final submitted folder. If a user does not have
access to Power BI, a PDF export of the entire dashboard is included along with the
.pbix file that contains all of the created visualizations and reports in the dashboard.
Some views of the dashboard created are included below:
53
(a) Dashboard 1
(b) Dashboard 2
Based on the visualizations created based on the predicted sales, the following ob-
servations can be made:
• Sales still seem to be the highest during the holiday season (in the months of
November and December)
• Store size is still a great factor that affects sales; the bigger the store, the higher
the sales. Store A still has the highest sales, followed by stores B and C
• Stores 4, 14, and 20 are the three stores with the highest sales; similar to this,
other than store 14, stores 4 and 20 still have the highest predicted sales
55
• Departments 92, 95. 38, and 72 still have the highest sales for all three store types
as per figure 41
5 Conclusion
5.1 Overall Results
The main purpose of this study was to predict Walmart’s sales based on the available his-
toric data and identify whether factors like temperature, unemployment, fuel prices, etc
affect the weekly sales of particular stores under study. This study also aims to under-
stand whether sales are relatively higher during holidays like Christmas and Thanks-
giving than normal days so that stores can work on creating promotional offers that
increase sales and generate higher revenue.
As observed through the exploratory data analysis, store size and holidays have a
direct relationship with high Walmart sales. It was also observed that out of all the store
types, Type A stores gathered the most sales for Walmart. Additionally, departments 92,
95, 38, and 72 accumulate the most sales for Walmart stores across all three store types;
for all of the 45 stores, the presence of these departments in a store ensures higher sales.
Pertaining to the specific factors provided in the study (temperature, unemploy-
ment, CPI, and fuel price), it was observed that sales do tend to go up slightly during
favorable climate conditions as well as when the prices of fuel are adequate. However, it
is difficult to make a strong claim about this assumption considering the limited scope
of the training dataset provided as part of this study. By the observations in the ex-
ploratory data analysis, sales also tend to be relatively higher when the unemployment
level is lower. Additionally, with the dataset provided for this study, there does not
seem to be a relationship between sales and the CPI index. Again, it is hard to make a
substantial claim about these findings without the presence of a larger training dataset
with additional information available.
Interaction effects were studied as part of the linear regression model to identify if
a combination of different factors could influence the weekly sales for Walmart. This
was necessary because of the presence of a high number of predictor variables in the
dataset. While the interaction effects were tested on a combination of significant vari-
ables, a statistically significant relationship was only observed between the indepen-
dent variables of temperature, CPI and unemployment, and weekly sales (predictor
variable). However, this is not definite because of the limitation of training data.
Relationships between independent and target variables were tried to be identified
through EDA components like the correlation matrix and scatter plots, feature impor-
tance plots created as part of the random forest and gradient boosting models as well
as the interaction effects. It was discovered that, although, there were no significant
relationships between weekly sales and factors like temperature, fuel price, store size,
department, etc. in the correlation matrix (Figure 22), some significant relationships
56
were observed between weekly sales and store size and department in the feature im-
portance plots created as part of the gradient boosting and random forest models. Con-
sidering that the performance of both these models was significantly better than the
performance of the regression models, it can be concluded that a non-linear statisti-
cally significant relationship exists between these independent and target variables.
Finally, the tuned Gradient Boosting model, with the lowest WMAE score, is the
main model used to create the final predictions for this study. These predictions can be
found in the ‘sampleSubmissionFinal.csv’ file and a visualized report of the outcome
can be seen in the Power BI dashboard.
5.2 Limitations
A huge constraint of this study is the lack of sales history data available for analysis.
The data for the analysis only comes from a limited number of Walmart stores between
the years 2010 and 2013. Because of this limited past history data, models cannot be
trained as efficiently to give accurate results and predictions. Because of this lack of
availability, it is harder to train and tune models as an over-constrained model might
reduce the accuracy of the model. An appropriate amount of training data is required
to efficiently train the model and draw useful insights.
Additionally, the models created have been developed based on certain preset as-
sumptions and business conditions; it is harder to predict the effects of certain eco-
nomic, political, or social policies on the sales recorded by the organization. Also, it
is tough to predict how the consumer buying behavior changes over the years or how
the policies laid down by the management might affect the company’s revenue; these
factors can have a direct impact on Walmart sales and it is necessary to constantly study
the market trends and compare them with existing performance to create better policies
and techniques for increased profits.
on profitable regions, and identifying ways to improve products and services in specific
regions or for specific customers.
Another aspect that would be worth exploring with this study is identifying trends
with sales for each of the stores and predicting future trends based on the available sales
data. Time series forecasting can be utilized (ARMA and ARIMA modeling) to predict
future sales for each of the stores and their respective departments.
References
Bakshi, C. (2020). Random forest regression. https : / / levelup . gitconnected . com /
random-forest-regression-209c0f354c84
Bari, A., Chaouchi, M., & Jung, T. (n.d.). How to utilize linear regressions in predictive
analytics. https://www.dummies.com/programming/big-data/data-science/
how-to-utilize-linear-regressions-in-predictive-analytics/
Baum, D. (2011). How higher gas prices affect consumer behavior. https : / / www .
sciencedaily.com/releases/2011/05/110512132426.htm
Brownlee, J. (2016). Feature importance and feature selection with xgboost in python.
https : / / machinelearningmastery . com / feature - importance - and - feature -
selection-with-xgboost-in-python/
Chouksey, P., & Chauhan, A. S. (2017). A review of weather data analytics using big
data. International Journal of Advanced Research in Computer and Communica-
tion Engineering, 6. https://doi.org/https://ijarcce.com/upload/2017/january-
17/IJARCCE%2072.pdf
Crown, M. (2016). Weekly sales forecasts using non-seasonal arima models. http : / /
mxcrown.com/walmart-sales-forecasting/
Editor, M. B. (2013). Regression analysis: How do i interpret r-squared and assess the
goodness-of-fit? https : / / blog . minitab . com / en / adventures - in - statistics - 2 /
regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-
fit
Ellis, L. (2019). Simple eda in r with inspectdf. https://www.r-bloggers.com/2019/05/
part-2-simple-eda-in-r-with-inspectdf/
Frost, J. (2021). Regression coefficients- statistics by jim. https://statisticsbyjim.com/
glossary/regression-coefficient/
Glen, S. (2016). Elementary statistics for the rest of us. https://www.statisticshowto.
com/correlation-matrix/
58
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, B., V.and Thirion, Grisel, O., Blon-
del, M., Prettenhofer, R., P.and Weiss, Dubourg, V., Vanderplas, J., Passos, A.,
Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Sklearn.pre-
processing.minmaxscaler. https://scikit-learn.org/stable/modules/generated/
sklearn.preprocessing.MinMaxScaler.html
Płoński, P. (2020). Random forest feature importance computed in 3 ways with python.
https://mljar.com/blog/feature-importance-in-random-forest/
Prabhakaran, S. (2016). Linear regression: R-statistics.co. http://r-statistics.co/Linear-
Regression.html
Rushworth, A. (n.d.). Numeric column summaries and visualisations. https : / /
alastairrushworth . github . io / inspectdf / articles / pkgdown / inspect _ num _
examples.html
Singh, D. (2019). Linear, lasso, and ridge regression with r. https://www.pluralsight.
com/guides/linear-lasso-and-ridge-regression-with-r
STATISTICS, U. B. O. L. (n.d.). Consumer price index. https://www.bls.gov/cpi/
Sullivan, J. (2019). Data cleaning with r and the tidyverse: Detecting missing values.
https : / / towardsdatascience . com / data - cleaning - with - r - and - the - tidyverse -
detecting-missing-values-ea23c519bc62
Szabo, B. (2020). How to create a seaborn correlation heatmap in python? https : / /
medium.com/@szabo.bibor/how-to-create-a-seaborn-correlation-heatmap-in-
python-834c0686b88e
UNSW. (2020). Descriptive, predictive and prescriptive analytics: What are the dif-
ferences? https : / / studyonline . unsw . edu . au / blog / descriptive - predictive -
prescriptive-analytics
Vashisht, R. (2021). When to perform a feature scaling? https://www.atoti.io/when-to-
perform-a-feature-scaling/
Walmart recruiting - store sales forecasting. (2014). https : / / www . kaggle . com / c /
walmart-recruiting-store-sales-forecasting/data
Walmart’s sales data analysis - a big data analytics perspective. (2017). https://doi.org/
10.1109/APWConCSE.2017.00028
Wikipedia, t. f. e. (n.d.). Walmart. https : / / en . wikipedia . org / w / index . php ? title =
Walmart&oldid=1001006854
Yu, C. H., Lee, H. S., Lara, E., & Gan, S. (2018). The ensemble and model comparison
approaches for big data analytics in social sciences. https://scholarworks.umass.
edu/pare/vol23/iss1/17
60
import pip
def install(package):
if hasattr(pip, 'main'):
pip.main(['install', package])
else:
pip._internal.main(['install', package])
install('pandas')
install('seaborn')
install('plotly')
install('jovian')
install('opendatasets')
install('numpy')
install('matplotlib')
install('inline')
install('zipFile')
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.style as style
import seaborn as sns
import opendatasets as od
from matplotlib import pyplot as plt
61
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
train = pd.read_csv('train.csv')
stores = pd.read_csv('stores.csv')
features = pd.read_csv('features.csv')
test = pd.read_csv('test.csv')
# Joining the train and test datatsets with features and stores datasets
# Extracting Date, Month, Quarter and Week columns from the week column
def split_date(df):
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df.Date.dt.year
df['Month'] = df.Date.dt.month
df['Day'] = df.Date.dt.day
df['Quarter'] = df.Date.dt.quarter
df['WeekOfYear'] = (df.Date.dt.isocalendar().week) * 1.0
split_date(merge_train)
split_date(merge_test)
# EDA: Initial
averagweeklysales = merge_train.groupby('Type')['Weekly_Sales'].mean().to_dict()
62
fig1 = px.bar(df,
x="Store_Type",
y="AvgSales",
title="Average Sales for Each Store")
fig1.show()
departmentsales = merge_train.groupby('Dept')['Weekly_Sales'].mean().sort_values(asc
fig2 = px.bar(departmentsales,
x=departmentsales.values,
y=departmentsales.index,
title="Average Sales for Each Department",
color_discrete_sequence=["#114D77"], orientation='h',
labels={'x': 'Average Sales', 'y': 'Department'})
fig2.update_yaxes(tick0=1, dtick=10)
fig2.show()
########################################################
# Department: Alternative Vis #
plt.figure(figsize=(20, 7))
plt.bar(departament.index, departament['Weekly_Sales']['mean'])
plt.xticks(np.arange(1, 100, step=2))
plt.ylabel('Week Sales', fontsize=16)
plt.xlabel('Departament', fontsize=16)
plt.show()
#########################################################
trace1 = go.Bar(
x=df_2010.Month,
y=df_2010.AverageSales2010,
name="AverageSales2010")
trace2 = go.Bar(
x=df_2011.Month,
y=df_2011.AverageSales2011,
name="AverageSales2011")
trace3 = go.Bar(
x=df_2012.Month,
y=df_2012.AverageSales2012,
name="AverageSales2012")
layout = go.Layout(barmode="group",
xaxis_title="Month",
yaxis_title="Monthly Avg Sales")
fig3.show()
plt.plot(weeklysales2010.index, weeklysales2010.values)
plt.plot(weeklysales2011.index, weeklysales2011.values)
plt.plot(weeklysales2012.index, weeklysales2012.values)
store_sales = merge_train.groupby('Store')['Weekly_Sales'].mean().sort_values(ascend
fig4 = px.bar(store_sales,
x=store_sales.index,
y=store_sales.values,
title="Average Sales for Each Store",
labels={'x': 'Stores', 'y': 'Average Sales'})
fig4.update_xaxes(tick0=1, dtick=1)
fig4.show()
plt.xlabel('Size')
plt.ylabel('Sales')
plt.title('Impact of Size of Store on Sales')
plt.show()
65
sales = merge_train.groupby('IsHoliday')['Weekly_Sales'].mean()
counts = merge_train.IsHoliday.value_counts()
# Correlation Matrix
import pip
def install(package):
if hasattr(pip, 'main'):
pip.main(['install', package])
else:
pip._internal.main(['install', package])
install('pandas')
install('seaborn')
install('plotly')
install('jovian')
install('opendatasets')
install('numpy')
install('matplotlib')
install('inline')
install('zipFile')
install('scikit-learn')
install('xgboost')
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.linear_model import LinearRegression
68
train = pd.read_csv('train.csv')
stores = pd.read_csv('stores.csv')
features = pd.read_csv('features.csv')
test = pd.read_csv('test.csv')
# Joining the train and test datasets with features and stores datasets
# Extracting Date, Month, Quarter and Week columns from the week column
def split_date(df):
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df.Date.dt.year
df['Month'] = df.Date.dt.month
df['Day'] = df.Date.dt.day
df['Quarter'] = df.Date.dt.quarter
df['WeekOfYear'] = (df.Date.dt.isocalendar().week) * 1.0
split_date(merge_train)
split_date(merge_test)
df = merge_train.isnull().mean() * 100
69
input_column = merge_train.columns.to_list()
input_column.remove('Weekly_Sales')
target_column = 'Weekly_Sales'
inputs = merge_train[input_column].copy()
targets = merge_train[target_column].copy()
minmax_scaler = MinMaxScaler().fit(merge_train[input_column])
inputs[input_column] = minmax_scaler.transform(inputs[input_column])
merge_test[input_column] = minmax_scaler.transform(merge_test[input_column])
# LINEAR REGRESSION
70
# Creating model
lm = LinearRegression().fit(train_inputs, train_targets)
###########################################################################
# LASSO REGRESSION
# Creating model
lr = lasso.fit(train_inputs, train_targets)
#####################################################################
# Creating model
gbm = XGBRegressor(random_state=42, n_jobs=-1)
feature_imp = pd.DataFrame({
'feature': train_inputs.columns,
'importance': gbm.feature_importances_
})
def test_parameters_xgb(**params):
model = XGBRegressor(random_state=42, n_jobs=-1, **params).fit(train_inputs, tra
train_wmae = WMAE(train_inputs, train_targets, model.predict(train_inputs))
val_wmae = WMAE(val_inputs, val_targets, model.predict(val_inputs))
return train_wmae, val_wmae
gbm_train_preds = gbm.predict(train_inputs)
gbm_val_preds = gbm.predict(val_inputs)
##################################################################
73
# RANDOM FOREST
rf1_train_preds = rf1.predict(train_inputs)
rf1_val_preds = rf1.predict(val_inputs)
importance_df = pd.DataFrame({
'feature': train_inputs.columns,
'importance': rf1.feature_importances_})
def test_parameters_rf(**params):
model = RandomForestRegressor(random_state=42, n_jobs=-1, **params).fit(train_in
trainWMAE = WMAE(train_inputs, train_targets, model.predict(train_inputs))
valWMAE = WMAE(val_inputs, val_targets, model.predict(val_inputs))
return trainWMAE, valWMAE
74
rf1_train_preds = randomforest1.predict(train_inputs)
rf1_val_preds = rf1.predict(val_inputs)
predicted_df = gbm.predict(merge_test)
merge_test['Weekly_Sales'] = predicted_df
# Reading sampleSubmission.csv file and exporting predicted results into the final s
75
sampleSubmission = pd.read_csv('sampleSubmission.csv')
sampleSubmission['Weekly_Sales'] = predicted_df
sampleSubmission.to_csv('sampleSubmissionFinal.csv', index=False)
76
library(stringi)
77
library(zoo)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(lubridate)
library(CARS)
library(VIF)
library(car)
library(inspectdf)
library(data.table)
library(reshape)
library(forecast)
library(tidyr)
library(corrplot)
library(scales)
library(caret)
library(glmnet)
library(broom)
library(modelr)
library(coefplot)
getwd()
setwd("/Users/nikki09/Desktop/Capstone Research/FINAL/Data")
## Data Modeling ##
## JOINING THE TRAIN AND TEST DATASET WITH STORES & FEATURES
79
## Creating the Year, Month, Week of Year, Day, Quarter columns from Date Column
print(linearmodel)
# Setting lambda = 1
# cv.glmnet automatically performs cross validation
plot(cv_model)
#find coefficients
best_model <- glmnet(x_var, y_var, alpha = 1, lambda = best_lambda)
coef(best_model)
81
# R squared
rsq <- 1 - sse / sst
rsq
# lower R square; inefficient model
theme_set(theme_sjplot())