Quantium Data Analytics Virtual Internship Assignment 1: Data Preparation and Customer Analytics
Quantium Data Analytics Virtual Internship Assignment 1: Data Preparation and Customer Analytics
Quantium Data Analytics Virtual Internship Assignment 1: Data Preparation and Customer Analytics
The objective of this task is to help the Category Manager for Chips in a grocery store, better understand the types of customers who
purchase Chips and their purchasing behaviour within the region.
To accomplish this task, we are provided with two datastes, 1 Year of potato chip transactions and Customer details.
We are going to prepare these datasets for analysis by cleaning and merging them together. We then proceed to creating extra features that
we might deem necessary for the analysis.
Once we have the final dataset ready, we can proceed to analysing the data with the goal of understanding which segments are the biggest
drivers of sales and what are their purchasing preferences to create recomendations for the category manager.
Data Cleaning
Transactions
In [3]: # Sample of the transaction dataset
data_t.sample(5)
Out[3]:
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES
2018-11-
29756 184 184033 186423 97 RRD Salt & Vinegar 165g 2 6.0
03
2018-10-
18453 48 48212 44366 73 Smiths Crinkle Cut Salt & Vinegar 170g 2 5.8
22
2018-10-
181853 209 209047 207764 9 Kettle Tortilla ChpsBtroot&Ricotta 150g 2 9.2
23
2019-04-
49586 199 199183 198377 78 Thins Chips Salt & Vinegar 175g 2 6.6
22
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DATE 264836 non-null datetime64[ns]
1 STORE_NBR 264836 non-null int64
2 LYLTY_CARD_NBR 264836 non-null int64
3 TXN_ID 264836 non-null int64
4 PROD_NBR 264836 non-null int64
5 PROD_NAME 264836 non-null object
6 PROD_QTY 264836 non-null int64
7 TOT_SALES 264836 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 16.2+ MB
Out[6]:
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES
2018-10-
0 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6.0
17
2019-05-
1 1 1307 348 66 CCs Nacho Cheese 175g 3 6.3
14
2019-05-
2 1 1343 383 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.9
20
2018-08-
264832 272 272358 270154 74 Tostitos Splash Of Lime 175g 1 4.4
13
2018-11-
264833 272 272379 270187 51 Doritos Mexicana 170g 2 8.8
06
2018-09-
264835 272 272380 270189 74 Tostitos Splash Of Lime 175g 2 8.8
22
Out[7]:
date store_nbr lylty_card_nbr txn_id prod_nbr prod_name prod_qty tot_sales
We can note that we have one order of 200 Units, a clear outlier, lets clean that.
Cool, looks like we have dealt with outliers. Lets look for outliers in the products i.e proudcts that should not be there.
In [11]: # Clean product name (Remove excess spaces) and remove digits.
import re
data_t['prod_name'] = data_t.prod_name.apply(lambda x: re.sub(' +', ' ', x))
# Remove numbers from prodcut name to check for most commun words
names = data_t.prod_name.apply(lambda x: "".join(
filter(lambda x: not x.isdigit(), x)))
Looks like the word Salsa repeats a 18054 times, might be salsa flavoured chips or actual salsa products. Lets confirm.
After some googling looks like these products are indeed salsa products and not chips, lets remove these products from the dataset.
In [15]: # Debug
prod = data_t.prod_name.count()
remove = salsa.prod_name.count()
result = prod - remove
print('After we remove the {} salsa products we should have {} products'.format(
remove, result))
After we remove the 18054 salsa products we should have 246204 products
Out[16]: 18054
# Debug
data_t.prod_name.count()
Out[17]: 246204
Looks like we have solved all data quality issues for the transacitons data set. Now on to the customers data set.
Customers
In [18]: # Get a sample of the costumers dataset
data_c.sample(5)
Out[18]:
LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER
In [19]: data_c.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72637 entries, 0 to 72636
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 LYLTY_CARD_NBR 72637 non-null int64
1 LIFESTAGE 72637 non-null object
2 PREMIUM_CUSTOMER 72637 non-null object
dtypes: int64(1), object(2)
memory usage: 1.7+ MB
# Remove duplicates
data_c.drop_duplicates()
Out[21]:
lylty_card_nbr lifestage premium_customer
The customer dataset looks good, now we can procced to merging the two datasets.
Feature Generation
Lets create the following columns:
Price
Brand
Pack Size
Pack Size Category
Month
Day of Week
In [27]: # Debug
data_final.sample(1)
Out[27]:
date store_nbr lylty_card_nbr txn_id prod_nbr prod_name prod_qty tot_sales lifestage premium_customer price br
Doritos
Corn Chips
2018- Midage
16691 195 195277 195157 77 Nacho 1 4.4 Premium 4.4 Do
11-11 singles/couples
Cheese
170g
Looks like the brand names are not standardized. We have errors like : Doritos and Dorito. We can manualy map the names as they are not
many.
In [29]: # Debug
data_final.brand.unique()
We have managed to extract the brand name from the products column and create the brand column.
# Debug
data_final['packet_size'].value_counts()
# Debug
data_final['packet_size'].value_counts().sort_index()
Out[31]: 70 1507
90 3008
110 22335
125 1454
134 25052
135 3251
150 40135
160 2970
165 15269
170 19952
175 66222
180 1465
190 2994
200 4473
210 6265
220 1564
250 3158
270 6262
330 12480
380 6388
Name: packet_size, dtype: int64
Now we have the weight of each packet, seems like a usefull metric for analysis.
In [32]: # Sizes box plot to understand how the sizes are distributed
sns.boxplot(x=data_final.packet_size, data=data_final)
The box plot shows us that most of the products weigh between 150g and 175g, we will use that as our middle point to create our size
category. Lets get to coding.
In [33]: # Create a function to atribute each row to a class depending on packet wheight
def packet_size(df):
if ((df['packet_size'] >= 70) and (df['packet_size'] < 101)):
return 'Small'
elif ((df['packet_size'] >= 101) and (df['packet_size'] < 151)):
return 'Medium'
elif ((df['packet_size'] >= 151) and (df['packet_size'] < 201)):
return 'Big'
elif ((df['packet_size'] >= 201) and (df['packet_size'] < 251)):
return 'Very_Big'
elif ((df['packet_size'] >= 251) and (df['packet_size'] < 301)):
return 'Large'
else:
return 'Very_Large'
# Debug
data_final.groupby('packet_size_category')[
'packet_size'].describe().sort_values(by='count', ascending=False)
Out[34]:
count mean std min 25% 50% 75% max
packet_size_category
In [37]: # Debug
data_final.dtypes
# Debug
data_final.sample(5)
Out[38]:
date store_nbr lylty_card_nbr txn_id prod_nbr prod_name prod_qty tot_sales lifestage premium_customer price brand
Smiths
Crinkle Cut
2018- Older
44043 155 155184 156205 107 French 2 5.2 Budget 2.6 Smiths
07-28 families
OnionDip
150g
Natural
2018- ChipCo Sea
160218 83 83207 83288 80 2 6.0 Retirees Premium 3.0 Naturals
07-18 Salt &
Vinegr 175g
Dorito Corn
2019- Chp Older
69860 102 102232 102942 4 2 13.0 Premium 6.5 Doritos
01-24 Supreme families
380g
Kettle
2019- Sensations
159728 8 8356 8316 3 1 4.6 Retirees Premium 4.6 Kettle
05-19 Camembert
& Fig 150g
Kettle 135g
2018- Young
194738 88 88157 87008 63 Swt Pot 2 8.4 Mainstream 4.2 Kettle
08-05 families
Sea Salt
Data Analysis
Now that we have cleaned our data we are ready to start analysing.
We know that the chip categroy manager is intrested in understanding her customers purchasing patterns.
To get a better understanding of the customers lets rank the segments on the following metrics:
After seeing how the segments rank we can take a deeper look at the ones we deem to be the biggest drivers of sales.
Mainstream
7000 Budget
Premium
Number of Cusotumers
8.92 %
6000
5000
6.81 %
6.8 %
6.57 %
6.47 %
6.15 %
4000
5.55 %
5.35 %
5.12 %
3000
4.62 %
3.91 %
3.77 %
3.48 %
3.36 %
3.33 %
2000
3.13 %
2.07 %
1000
1.53 %
1.17 %
0.81 %
0
Yo Re Ol Mi Ol Yo Ne
un ti re de da de u ng w
g es rs g es r fam fam
sin ing fam
gle les ing ilie ilie ilie
s/c /co les s s s
ou up /co
ple les up
s les
lifestage
Budget
8.19 %
140k
8.06 %
Mainstream
Premium
7.18 %
Total Ammount of Sales
7.08 %
120k
6.9 %
6.84 %
100k
5.86 %
5.33 %
5.06 %
80k
4.78 %
4.69 %
4.35 %
4.16 %
60k
3.17 %
3.02 %
40k
2.17 %
1.85 %
20k
1.14 %
0.89 %
0.6 %
0
Ol Yo Ol Re Yo Mi Ne
de un de tir u ng da w
rf g rs ee g es fam
am fam ing s sin
ilie ilie les gle ing ilie
s s /co s/c les s
up ou /co
les ple up
s le s
lifestage
35 Customer Segment
34.33
33.84
Mainstream
33.49
32.62
32.59
31.95
30 Budget
Average Sale per Customer
Premium
25
26.23
26.2
25.54
25.53
24.01
23.83
22.85
22.74
22.51
20
19.18
18.9
18.64
18.58
15
15.7
15.62
10
0
Yo Re Ol Mi Ol Yo Ne
un ti re de da de u ng w
g es rs g es r fam fam
sin ing fam
gle les ing ilie ilie ilie
s/c /co les s s s
ou up /co
ple les up
s les
lifestage
Customer Segment
9.2
Budget
9.04
9.01
8.68
8.67
8.59
8 Mainstream
Premium
Average Quantity
6.75
6.74
6.69
6.4
6
6.11
6.08
6.05
6
5.91
4.87
4.81
4.8
4.56
4
4.25
4.24
0
Ol Yo Ol Re Yo Mi Ne
de un de tir u ng da w
rf g rs ee g es fam
am fam ing s sin
ilie ilie les gle ing ilie
s s /co s/c les s
up ou /co
les ple up
s le s
lifestage
Customer Segment
4.05
4.06
Mainstream
4 Budget
Premium
3.99
3.95
Average Price
3.92
3.92
3.92
3.92
3.9
3.89
3.88
3.87
3.85
3.84
3.8
3.81
3.77
3.75
3.76
3.76
3.74
3.74
3.74
3.72
3.72
3.7
3.67
3.65 3.66
Yo Mi Ne Re Ol Ol Yo
un da w tir de de u ng
g ge fam ee r sin r fam
sin sin s fam
gle gle ilie gle ilie ilie
s/c s s s/c s s
ou /co ou
ple up ple
s les s
lifestage
Analysis Insights:
Biggest Segments:
Highest Revenue:
Our biggest segment is Young Singles/Couples (Mainstream), we should expect that since they are the majority they should bring in the
most revenue but interestingly enough the highest proportion of revenue (8.68%) comes from Older Families (budget) which make up 6.47%
of customers.
Young Singles/Couples (Mainstream) that make up 11.1% of all customers were responsible for 8.19% of total revenue the second highest.
This can be explained by the fact that Older and Younger families buy more packets, on average about 9 compared to about 5 for Young
Singles/Couples (Mainstream).
On the other hand, Young Singles/Couples (Mainstream) pay more per chip, on average 4.06 while Older Families (budget) spend on
average 3.74.
We can conclude that Young Singles/Couples (Mainstream) are more prone to impulse purchases and since they have less people in the
household, they purchase the more expensive brands at less quantity.
On the contrary Older Families (budget) try to buy cheaper brands but more quantity.
But before we have to test if the Average price difference is statistically significant since the differance is small,before we start assuming it in
our analysis
2 sample t-test assums that both data sets must have almost the same mean and are identically and independently distributed. We can
check the mean and skew of the datasets to see if our datasets pass those assumptions.
In [47]: # Create control and experiment group and print skew and mean
control_mean_sales = control.groupby('lylty_card_nbr')['tot_sales'].mean()
experiment_mean_sales = experiment.groupby('lylty_card_nbr')[
'tot_sales'].mean()
skew_control = round(control_mean_sales.skew(), 3)
skew_experiment = round(experiment_mean_sales.skew(), 3)
print('Control mean:{} skew: {}\nExperiment mean:{} skew: {}'.format(
control_mean_sales.mean(), skew_control, experiment_mean_sales.mean(), skew_experiment))
Looks like both datasets have the same distributions. Lets proced with the test.
Our null hypothesis : Both groups have no statisticaly signifacant differance in average price.
# 2 sample Ttest
tstat, pval = stats.ttest_ind(control_mean_sales, experiment_mean_sales)
print(pval)
if pval < 0.05:
print('We can reject the null hypothesis.')
else:
print('We have failed to reject the null hypothesis.')
2.847236809420268e-81
We can reject the null hypothesis.
We have a pvalue much lower than our alpha of 0.05 so we can conclude that the diferance of average price is not due to chance.
Now we can assume that these segments are more likly to buy more expensive brands. Lets see the relationship between quantity sold and
average price.
# Debug
qty_price.sample(3)
Out[49]:
lifestage premium_customer price prod_qty qty_member
lifestage
Young singles/couples
9
Midage singles/couples
Retirees
New families
8
Older singles/couples
Average Packets
Young families
Older families
7
4
3.7 3.8 3.9 4
price
The above scatter plot shows some interesting insights, as we thought we see families tend to purchase more quantity but tend to buy less
expensive chips but this is noted only for the Older and Young families, New families do not follow this trend, they show preference for
quality over quantity indicated by lower quantity and higher average price.
Also we can see how Young Singles/Couples Mainstream and Midge Singles/Couples Mainstream are willing to pay more per chip when
compared with their premium and budget counterparts. We can see that Sinlges/Couples buy more chips as they get older, note that Young
S/C buy on average less than Midage S/C and so on, maybe due to increased health awareness from younger demographics.
Let’s get to know better the segments that buy more than the average and pay more than average with the intent of finding ways to keep
these segments buying and retaining their business.
Quantity:
Older & Younger Families
Price:
Midage Singles/Couples (Mainstream)
Young Singles/Couples (Mainstream)
Its in the clients intrest to retain and serve well these segments, for this the store should keep stocked the brands and sizes these segments
tend to buy the most. Lets figure out which those are.
We are going to create a affinity table for brand and packet size for each of the 4 segments. How are we going to make this affinity table?
Well if you think about it we can only get an understanding about prefrences towards a specific brand or size by comparing between what
proportion a brand ocupies in the overall total quantity bought.
Example : Lets take in to consideration our Older families segment, We will compare how much the brand 'Kettle' contributes to total quantity
sold between our target segment and the population (Everyone - our target).
Lets code the following steps in a function so we can apply them to all the 4 segments with out the need to code it each time.
# Rename Columns
target_qty_brand.rename(columns={'qty_total': 'Target'}, inplace=True)
population_qty_brand.rename(
columns={'qty_total': 'Population'}, inplace=True)
# Debug
older_families.lifestage.unique()
# Debug
young_families.lifestage.unique()
Out[54]:
brand Target Population Proportion
Out[55]:
packet_size Target Population Proportion
Older families are about 30% more likley to buy NCC, Burger Rings, WW & RRD. This segment is 30% more likley to buy packets sized
between 180 - 220grams
Out[56]:
brand Target Population Proportion
Out[57]:
packet_size Target Population Proportion
While Young families are amlost 30% more likly to buy Sunbites, French Fries and CCs.
And 30% more likely to buy 90g packets. Smaller packets might be due to the children in the family bieng smaller.
That probably explains high affinity to sunbites, they are the only brand offering 90g packs.
We can conclude that young families are more likly to buy brands that offer smaller packets probably because they buy chips for children as
a snack.
Out[61]:
brand Target Population Proportion
Out[62]:
packet_size Target Population Proportion
Out[63]:
brand Target Population Proportion
Out[64]:
packet_size Target Population Proportion
It seems like we have uncovered prefrences in all the segments, lets list the top prefrences for brand and size:
Older Families:
Brand: 30% more likley to buy NCC & Burger Rings.
Size: 30% more likley to buy packets sized 180g, 200g & 220g.
Young Families:
Brand: 30% more likley to buy Sunbites & French fries.
Size: 30% more likley to buy smallest packs 90g to be exact.
Midage S/C:
Brand: 16% more likley to buy Twisties and 14% to buy Kettle.
Size: 22% more likely to buy 270g packs and 19% 330g.
Young S/C Brand: 23% more likley to buy Tyrrells and 22% Twisties.
Size: 27% more likley to buy 270g packs and 26% 380g
Lets visualise the relationship between price and size for the preferred brands to confirm of they follow our beliefs about our selected
segments purchasing patterns.
# Debug
selected_brands.brand.unique()
brand
Bruger Rings
Twisties
200
Tyrrells
150
100
price
Let’s recap what we’ve found! Sales have mainly been due to Budget - older families, Mainstream - young singles/couples, and Mainstream
retirees shoppers. We found that the high spend in chips for mainstream young singles/couples and retirees is due to there being more
of them than other buyers. Mainstream, midage and young singles and couples are also more likely to pay more per packet of chips.
This is indicative of impulse buying behaviour. We’ve also found that Mainstream young singles and couples are 23% more likely to
purchase Tyrrells chips compared to the rest of the population. The Category Manager may want to increase the category’s
performance by off-locating some Tyrrells and smaller packs of chips in discretionary space near segments where young singles and
couples frequent more often to increase visibilty and impulse behaviour. Quantium can help the Category Manager with
recommendations of where these segments are and further help them with measuring the impact of the changed placement. We’ll work
on measuring the impact of trials in the next task and putting all these together in the third task.
Conclusion
What have we found out ?
Most sales come from Old Families (Budget) Young Singles/Couples (Mainstream) & Retirees (Mainstream).
Older Families, Young Families & Older Singles/Couples bring in most revenue per customer.
Young Singles/Couples (Mainstream) & Midage Singles/Couples (Mainstream) pay more per chip packet.
Older families and Young families buy more packets of chips on average.
Older and Young families are about 30% more likley to buy brands that cost under 3.00, young families are more likley to buy small
packets.
Midage and Young Singles/Couples are about 20% more likley to buy more expensive brands, above 4.00.
Our recommendations:
1. Increase small sized pack offerings to increase the number of families buying chips and to keep them returning.
2. Positions the smaller and cheaper packs closer to other items families frequntly buy to increase purchases.
3. Create stands that only house chips bellow 3.00 and other with chips above 4.00 at strategic points in the store as to make the
customers go buy other items and probably add them to their shopping basket.
In [ ]: