Quantium Data Analytics Virtual Internship Assignment 1: Data Preparation and Customer Analytics

Quantium Data Analytics Virtual Internship
Assignment 1 : Data preparation and Customer Analytics.
The objective of this task is to help the Category Manager for Chips in a grocery store, better understand the types of customers who
purchase Chips and their purchasing behaviour within the region.
To accomplish this task, we are provided with two datastes, 1 Year of potato chip transactions and Customer details.
We are going to prepare these datasets for analysis by cleaning and merging them together. We then proceed to creating extra features that
we might deem necessary for the analysis.
Once we have the final dataset ready, we can proceed to analysing the data with the goal of understanding which segments are the biggest
drivers of sales and what are their purchasing preferences to create recomendations for the category manager.
Load Libraries and Data
In [1]: # Import libraries to be used in this section

import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")
In [2]: # Load the data

data_t = pd.read_excel("QVI_transaction_data.xlsx")
data_c = pd.read_csv("QVI_purchase_behaviour.csv")
Data Cleaning
Transactions
In [3]: # Sample of the transaction dataset
data_t.sample(5)
Out[3]:
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES
2018-11-
29756 184 184033 186423 97 RRD Salt & Vinegar 165g 2 6.0
03
2018-10-
18453 48 48212 44366 73 Smiths Crinkle Cut Salt & Vinegar 170g 2 5.8
22
2018-09- Doritos Corn Chips Cheese Supreme

157927 91 91057 89459 30 2 8.8
03 170g
2018-10-
181853 209 209047 207764 9 Kettle Tortilla ChpsBtroot&Ricotta 150g 2 9.2
23
2019-04-
49586 199 199183 198377 78 Thins Chips Salt & Vinegar 175g 2 6.6
22
In [4]: # Data Info

data_t.info()
print('\n')
print('{:,} rows; {:,} columns'
.format(data_t.shape[0], data_t.shape[1]))
print('Transactions timeframe from {} to {}'.format(data_t['DATE'].min(),
data_t['DATE'].max()))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DATE 264836 non-null datetime64[ns]
1 STORE_NBR 264836 non-null int64
2 LYLTY_CARD_NBR 264836 non-null int64
3 TXN_ID 264836 non-null int64
4 PROD_NBR 264836 non-null int64
5 PROD_NAME 264836 non-null object
6 PROD_QTY 264836 non-null int64
7 TOT_SALES 264836 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 16.2+ MB
264,836 rows; 8 columns

Transactions timeframe from 2018-07-01 00:00:00 to 2019-06-30 00:00:00
We have 264,836 transactions from July 2018 to July 2019.
We need to change columns 1 - 5 to categorical.
In [5]: # Transform categorical features

data_t = data_t.astype({"LYLTY_CARD_NBR": 'category', "STORE_NBR": 'category', "TXN_ID": 'category',
"PROD_NBR": 'category', "PROD_NAME": 'category'})
# Debug
data_t.dtypes
Out[5]: DATE datetime64[ns]

STORE_NBR category
LYLTY_CARD_NBR category
TXN_ID category
PROD_NBR category
PROD_NAME category
PROD_QTY int64
TOT_SALES float64
dtype: object
In [6]: # Check for duplicates

data_t.duplicated().sum()
# Remove duplicates
data_t.drop_duplicates()
Out[6]:
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES
2018-10-
0 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6.0
17
2019-05-
1 1 1307 348 66 CCs Nacho Cheese 175g 3 6.3
14
2019-05-
2 1 1343 383 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.9
20
2018-08- Smiths Chip Thinly S/Cream&Onion

3 2 2373 974 69 5 15.0
17 175g
2018-08- Kettle Tortilla ChpsHny&Jlpno Chili

4 2 2426 1038 108 3 13.8
18 150g
... ... ... ... ... ... ... ... ...
2019-03- Kettle Sweet Chilli And Sour Cream

264831 272 272319 270088 89 2 10.8
09 175g
2018-08-
264832 272 272358 270154 74 Tostitos Splash Of Lime 175g 1 4.4
13
2018-11-
264833 272 272379 270187 51 Doritos Mexicana 170g 2 8.8
06
2018-12- Doritos Corn Chip Mexican Jalapeno

264834 272 272379 270188 42 2 7.8
27 150g
2018-09-
264835 272 272380 270189 74 Tostitos Splash Of Lime 175g 2 8.8
22
264835 rows × 8 columns
In [7]: # Change column name capitlization

data_t.rename(columns=lambda x: x.lower(), inplace=True)
# Debug
data_t.head(1)
Out[7]:
date store_nbr lylty_card_nbr txn_id prod_nbr prod_name prod_qty tot_sales
0 2018-10-17 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6.0
In [8]: # Lets look for outliers in total sales and quantity

plt.scatter(x='prod_qty', y='tot_sales', data=data_t)
Out[8]: <matplotlib.collections.PathCollection at 0x1b7d4e9a048>
We can note that we have one order of 200 Units, a clear outlier, lets clean that.
In [9]: # Lets remove outliers using IQR method

# 1. Calculate IQR
Q1 = data_t.tot_sales.quantile(0.25)
Q3 = data_t.tot_sales.quantile(0.75)
IQR = round(Q3-Q1, 3)
Lower_Whisker = round(Q1-1.5*IQR, 3)
Upper_Whisker = round(Q3+1.5*IQR, 3)
# 2. Drop sales grater that upper whisker
data_t = data_t[data_t['tot_sales'] < Upper_Whisker]
In [10]: # Debug outliers

plt.grid(False)
sns.boxplot(data=data_t, x=data_t.tot_sales)
plt.scatter(x='prod_qty', y='tot_sales', data=data_t, )
Out[10]: <matplotlib.collections.PathCollection at 0x1b7cb04aa08>
Cool, looks like we have dealt with outliers. Lets look for outliers in the products i.e proudcts that should not be there.
In [11]: # Clean product name (Remove excess spaces) and remove digits.
import re
data_t['prod_name'] = data_t.prod_name.apply(lambda x: re.sub(' +', ' ', x))
# Remove numbers from prodcut name to check for most commun words
names = data_t.prod_name.apply(lambda x: "".join(
filter(lambda x: not x.isdigit(), x)))
# Get series with individual words to count

check = names.str.split()
In [12]: # Import Counter

from collections import Counter
# Create counter function

def counter(series):
cnt = Counter()
for words in series:
for word in words:
cnt[word] += 1
return cnt
In [13]: # Check if we have unexpected words

counter(check).most_common(20)
Out[13]: [('g', 246091),

('Chips', 49688),
('Kettle', 41141),
('&', 35490),
('Smiths', 28810),
('Salt', 27924),
('Cheese', 27809),
('Pringles', 25052),
('Doritos', 24896),
('Crinkle', 23938),
('Corn', 22014),
('Original', 21510),
('Cut', 20751),
('Chip', 18596),
('Salsa', 18054),
('Chicken', 15374),
('Sea', 14121),
('Thins', 14049),
('Sour', 13853),
('Crisps', 12583)]
Looks like the word Salsa repeats a 18054 times, might be salsa flavoured chips or actual salsa products. Lets confirm.
In [14]: # Create salsa dataframe

salsa = data_t[data_t['prod_name'].str.contains('Salsa')]
# Check the list of products

salsa_names = salsa.prod_name.unique()
list(salsa_names)
Out[14]: ['Old El Paso Salsa Dip Tomato Mild 300g',

'Red Rock Deli SR Salsa & Mzzrlla 150g',
'Smiths Crinkle Cut Tomato Salsa 150g',
'Doritos Salsa Medium 300g',
'Old El Paso Salsa Dip Chnky Tom Ht300g',
'Woolworths Mild Salsa 300g',
'Old El Paso Salsa Dip Tomato Med 300g',
'Woolworths Medium Salsa 300g',
'Doritos Salsa Mild 300g']
After some googling looks like these products are indeed salsa products and not chips, lets remove these products from the dataset.
In [15]: # Debug
prod = data_t.prod_name.count()
remove = salsa.prod_name.count()
result = prod - remove
print('After we remove the {} salsa products we should have {} products'.format(
remove, result))
After we remove the 18054 salsa products we should have 246204 products
In [16]: # Get the index numbers of rows to drop

index = data_t[data_t['prod_name'].isin(salsa_names)].index.values
len(index)
Out[16]: 18054
In [17]: # Drop the salsa rows

data_t = data_t.drop(index)
# Debug
data_t.prod_name.count()
Out[17]: 246204
Looks like we have solved all data quality issues for the transacitons data set. Now on to the customers data set.
Customers
In [18]: # Get a sample of the costumers dataset
data_c.sample(5)
Out[18]:
LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER
49548 185295 YOUNG SINGLES/COUPLES Mainstream
28798 105220 MIDAGE SINGLES/COUPLES Mainstream
7136 26005 NEW FAMILIES Premium
70020 262086 RETIREES Mainstream
43555 164219 YOUNG FAMILIES Budget
In [19]: data_c.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72637 entries, 0 to 72636
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 LYLTY_CARD_NBR 72637 non-null int64
1 LIFESTAGE 72637 non-null object
2 PREMIUM_CUSTOMER 72637 non-null object
dtypes: int64(1), object(2)
memory usage: 1.7+ MB
In [20]: # Change column name capitlization

data_c.rename(columns=lambda x: x.lower(), inplace=True)
# Change life stage capitlization

data_c.lifestage = data_c.lifestage.apply(lambda x: x.capitalize())
In [21]: # Check for duplicates

data_c.duplicated().sum()
# Remove duplicates
data_c.drop_duplicates()
Out[21]:
lylty_card_nbr lifestage premium_customer
0 1000 Young singles/couples Premium
1 1002 Young singles/couples Mainstream
2 1003 Young families Budget
3 1004 Older singles/couples Mainstream
4 1005 Midage singles/couples Mainstream
... ... ... ...
72632 2370651 Midage singles/couples Mainstream
72633 2370701 Young families Mainstream
72634 2370751 Young families Premium
72635 2370961 Older families Budget
72636 2373711 Young singles/couples Mainstream
72637 rows × 3 columns
The customer dataset looks good, now we can procced to merging the two datasets.
In [22]: # Check Shape

data_t.shape, data_c.shape
Out[22]: ((246204, 8), (72637, 3))
In [23]: # Merge the two datasets

data_final = pd.merge(data_t, data_c, on='lylty_card_nbr')
data_final.shape
Out[23]: (246204, 10)
In [24]: # Save data

data_final.to_csv('data_cleaned_merged_v1.csv', index=False,)
Feature Generation
Lets create the following columns:
Price
Brand
Pack Size
Pack Size Category
Month
Day of Week
Price of each item

In [25]: data_final = pd.read_csv('data_cleaned_merged_v1.csv')
# Lets create a price column
data_final['price'] = data_final.tot_sales / data_final.prod_qty
data_final.price.describe()
Out[25]: count 246204.000000

mean 3.831039
std 1.077786
min 1.320000
25% 3.000000
50% 3.800000
75% 4.600000
max 6.500000
Name: price, dtype: float64
Brand of each product

In [26]: # Lets take the first word of the product name as brand name
data_final['brand'] = data_final.prod_name.apply(
lambda x: str(x.split(' ')[0]))
# Debug
data_final.brand.unique()
Out[26]: array(['Natural', 'CCs', 'WW', 'Smiths', 'Kettle', 'Tyrrells',

'Infuzions', 'Grain', 'Doritos', 'Thins', 'Red', 'GrnWves',
'Tostitos', 'Pringles', 'Twisties', 'RRD', 'Infzns', 'Burger',
'NCC', 'Cheezels', 'Smith', 'Cobs', 'Dorito', 'Sunbites',
'Cheetos', 'Woolworths', 'French', 'Snbts'], dtype=object)
In [27]: # Debug
data_final.sample(1)
Out[27]:
date store_nbr lylty_card_nbr txn_id prod_nbr prod_name prod_qty tot_sales lifestage premium_customer price br
Doritos
Corn Chips
2018- Midage
16691 195 195277 195157 77 Nacho 1 4.4 Premium 4.4 Do
11-11 singles/couples
Cheese
170g
Looks like the brand names are not standardized. We have errors like : Doritos and Dorito. We can manualy map the names as they are not
many.
In [28]: # Map the correct brand names

data_final.brand.replace({'Natural': 'Naturals', 'Old': 'Old Dutch', 'Dorito': 'Doritos', 'Grain': 'Gra
in Waves',
'Red': 'RRD', 'GrnWves': 'Grain Waves', 'Infzns': 'Infuzions', 'Burger': 'Bru
ger Rings', 'Smith': 'Smiths',
'Woolworths': 'WW', 'French': 'French Fries', 'Snbts': 'Sunbites'}, inplace=T
rue)
In [29]: # Debug
data_final.brand.unique()
Out[29]: array(['Naturals', 'CCs', 'WW', 'Smiths', 'Kettle', 'Tyrrells',

'Infuzions', 'Grain Waves', 'Doritos', 'Thins', 'RRD', 'Tostitos',
'Pringles', 'Twisties', 'Bruger Rings', 'NCC', 'Cheezels', 'Cobs',
'Sunbites', 'Cheetos', 'French Fries'], dtype=object)
We have managed to extract the brand name from the products column and create the brand column.
Size of each packet

In [30]: # Get the Final part of the product name (Where we can find the Grams)
import re
data_final['packet_size'] = data_final.prod_name.apply(
lambda x: str(x.split(' ')))
# Extract only the number from the string as an array

data_final['packet_size'] = data_final.packet_size.apply(
lambda x: re.findall('[0-9]+', x))
# Convert array object to string

data_final['packet_size'] = data_final['packet_size'].apply(
lambda x: ' '.join(map(str, x)))
# Debug
data_final['packet_size'].value_counts()
Out[30]: 175 66222

150 40135
134 25052
110 22335
170 19952
165 15269
330 12480
380 6388
210 6265
270 6262
200 4473
135 3251
250 3158
90 3008
190 2994
160 2970
220 1564
70 1507
180 1465
125 1454
Name: packet_size, dtype: int64
In [31]: # Change to number format

data_final['packet_size'] = pd.to_numeric(
data_final['packet_size'], errors='coerce')
# Debug
data_final['packet_size'].value_counts().sort_index()
Out[31]: 70 1507
90 3008
110 22335
125 1454
134 25052
135 3251
150 40135
160 2970
165 15269
170 19952
175 66222
180 1465
190 2994
200 4473
210 6265
220 1564
250 3158
270 6262
330 12480
380 6388
Name: packet_size, dtype: int64
Now we have the weight of each packet, seems like a usefull metric for analysis.
Pack Size Category
In [32]: # Sizes box plot to understand how the sizes are distributed
sns.boxplot(x=data_final.packet_size, data=data_final)
Out[32]: <matplotlib.axes._subplots.AxesSubplot at 0x1b7cac55d48>
The box plot shows us that most of the products weigh between 150g and 175g, we will use that as our middle point to create our size
category. Lets get to coding.
In [33]: # Create a function to atribute each row to a class depending on packet wheight
def packet_size(df):
if ((df['packet_size'] >= 70) and (df['packet_size'] < 101)):
return 'Small'
elif ((df['packet_size'] >= 101) and (df['packet_size'] < 151)):
return 'Medium'
return 'Big'
return 'Very_Big'
return 'Large'
else:
return 'Very_Large'
In [34]: # Create new packet_size_category using the above function

data_final['packet_size_category'] = data_final.apply(packet_size, axis=1)
# Debug
data_final.groupby('packet_size_category')[
'packet_size'].describe().sort_values(by='count', ascending=False)
Out[34]:
count mean std min 25% 50% 75% max
packet_size_category
Big 113345.0 173.827121 7.302311 160.0 170.0 175.0 175.0 200.0
Medium 92227.0 135.044000 15.846845 110.0 125.0 134.0 150.0 150.0
Very_Large 18868.0 346.928132 23.661670 330.0 330.0 330.0 380.0 380.0
Very_Big 10987.0 222.920724 17.527316 210.0 210.0 210.0 250.0 250.0
Large 6262.0 270.000000 0.000000 270.0 270.0 270.0 270.0 270.0
Small 4515.0 83.324474 9.432263 70.0 70.0 90.0 90.0 90.0
Month & Day of Week

In [35]: # Dtypes Check
data_final.dtypes
Out[35]: date object

store_nbr int64
lylty_card_nbr int64
txn_id int64
prod_nbr int64
prod_name object
prod_qty int64
tot_sales float64
lifestage object
premium_customer object
price float64
brand object
packet_size int64
packet_size_category object
dtype: object
In [36]: # Set category data type

data_final = data_final.astype(
{"lifestage": 'category', "premium_customer": 'category', "brand": 'object',
"store_nbr": 'object', 'lylty_card_nbr': 'object', "txn_id": "object",
"prod_nbr": 'object'
})
# Change date column to date Type

data_final.date = pd.to_datetime(data_final.date)
In [37]: # Debug
data_final.dtypes
Out[37]: date datetime64[ns]

store_nbr object
lylty_card_nbr object
txn_id object
prod_nbr object
prod_name object
prod_qty int64
tot_sales float64
lifestage category
premium_customer category
price float64
brand object
packet_size int64
packet_size_category object
dtype: object
In [38]: # Create Day of week and month

data_final['day_week'] = data_final.date.dt.day_name()
data_final['month'] = data_final.date.dt.month_name()
# Debug
data_final.sample(5)
Out[38]:
date store_nbr lylty_card_nbr txn_id prod_nbr prod_name prod_qty tot_sales lifestage premium_customer price brand
Smiths
Crinkle Cut
2018- Older
44043 155 155184 156205 107 French 2 5.2 Budget 2.6 Smiths
07-28 families
OnionDip
150g
Natural
2018- ChipCo Sea
160218 83 83207 83288 80 2 6.0 Retirees Premium 3.0 Naturals
07-18 Salt &
Vinegr 175g
Dorito Corn
2019- Chp Older
69860 102 102232 102942 4 2 13.0 Premium 6.5 Doritos
01-24 Supreme families
380g
Kettle
2019- Sensations
159728 8 8356 8316 3 1 4.6 Retirees Premium 4.6 Kettle
05-19 Camembert
& Fig 150g
Kettle 135g
2018- Young
194738 88 88157 87008 63 Swt Pot 2 8.4 Mainstream 4.2 Kettle
08-05 families
Sea Salt
Data Analysis
Now that we have cleaned our data we are ready to start analysing.
We know that the chip categroy manager is intrested in understanding her customers purchasing patterns.
To get a better understanding of the customers lets rank the segments on the following metrics:
1. Number of customers in each segment.

2. Percent of total revenue.
3. Revenue per customer.
4. Packets Bought per Customer.
5. Average Chip Price by Segment.
After seeing how the segments rank we can take a deeper look at the ones we deem to be the biggest drivers of sales.
How many customers do we have in each segment ?
In [39]: # Create the dataframe containing the count of each member

segment_count = pd.DataFrame(data_final.groupby(["lifestage", 'premium_customer'], as_index=False)[
'lylty_card_nbr'].nunique().sort_values(
by='lylty_card_nbr', ascending=False))
# Create percent of total columns
segment_count['percent_total'] = (
(segment_count.lylty_card_nbr / segment_count.lylty_card_nbr.sum())*100).round(2)
segment_count['percent_total'] = segment_count['percent_total'].astype(
str) + ' %'
In [40]: # Plot Number of customers per segment

barchart = px.bar(
data_frame=segment_count,
y='lylty_card_nbr',
x='lifestage',
color='premium_customer',
opacity=0.9,
orientation='v',
barmode='group',
text='percent_total',
title='Number of Customers in Each Segment',
labels={'lylty_card_nbr': 'Number of Cusotumers',
'premium_customer': 'Customer Segment'},
template='plotly_white'
)
barchart
Number of Customers in Each Segment
8000 Customer Segment

11.1 %
Mainstream
7000 Budget
Premium
Number of Cusotumers
8.92 %
6000
5000
6.81 %
6.8 %
6.57 %
6.47 %
6.15 %
4000
5.55 %
5.35 %
5.12 %
3000
4.62 %
3.91 %
3.77 %
3.48 %
3.36 %
3.33 %
2000
3.13 %
2.07 %
1000
1.53 %
1.17 %
0.81 %
0
Yo Re Ol Mi Ol Yo Ne
un ti re de da de u ng w
g es rs g es r fam fam
sin ing fam
gle les ing ilie ilie ilie
s/c /co les s s s
ou up /co
ple les up
s les
lifestage
Which segment brought in most revenue ?

In [41]: # Create total sales by segment dataframe
segment_revenue = pd.DataFrame(data_final.groupby(["lifestage", 'premium_customer'], as_index=False)[
'tot_sales'].sum().sort_values(
by='tot_sales', ascending=False))
# Create percent of total columns

segment_revenue['percent_total'] = (
(segment_revenue.tot_sales / segment_revenue.tot_sales.sum())*100).round(2)
segment_revenue['percent_total'] = segment_revenue['percent_total'].astype(
str) + ' %'
# Plot the segments

barchart = px.bar(
data_frame=segment_revenue,
y='tot_sales',
x='lifestage',
opacity=0.9,
orientation='v',
barmode='group',
text='percent_total',
title='Total Sales by Segments',
labels={'tot_sales': 'Total Ammount of Sales',
)
barchart
Total Sales by Segments
160k Customer Segment

8.68 %
Budget
8.19 %
140k
8.06 %
Mainstream
Premium
7.18 %
Total Ammount of Sales
7.08 %
120k
6.9 %
6.84 %
100k
5.86 %
5.33 %
5.06 %
80k
4.78 %
4.69 %
4.35 %
4.16 %
60k
3.17 %
3.02 %
40k
2.17 %
1.85 %
20k
1.14 %
0.89 %
0.6 %
0
Ol Yo Ol Re Yo Mi Ne
de un de tir u ng da w
rf g rs ee g es fam
am fam ing s sin
ilie ilie les gle ing ilie
s s /co s/c les s
up ou /co
les ple up
s le s
lifestage
What is the revenue per customer ?
In [42]: # Create revenue per customer dataframe

revenue_member = segment_count.drop(
['lylty_card_nbr', 'percent_total'], axis=1)
revenue_member['revenue_member'] = round(
segment_revenue.tot_sales / segment_count.lylty_card_nbr, 2)
# Plot revenue per customer

barchart = px.bar(
data_frame=revenue_member,
y='revenue_member',
x='lifestage',
opacity=0.9,
orientation='v',
barmode='group',
text='revenue_member',
title='Revenue Per Customer by Segment',
labels={'revenue_member': 'Average Sale per Customer',
)
barchart
Revenue Per Customer by Segment
35 Customer Segment
34.33
33.84
Mainstream
33.49
32.62
32.59
31.95
30 Budget
Average Sale per Customer
Premium
25
26.23
26.2
25.54
25.53
24.01
23.83
22.85
22.74
22.51
20
19.18
18.9
18.64
18.58
15
15.7
15.62
10
0
Yo Re Ol Mi Ol Yo Ne
un ti re de da de u ng w
g es rs g es r fam fam
sin ing fam
gle les ing ilie ilie ilie
s/c /co les s s s
ou up /co
ple les up
s les
lifestage
How many packets does each customer buy?
In [43]: # Create segment total quantity dataframe

segment_quantity = pd.DataFrame(data_final.groupby(["lifestage", 'premium_customer'], as_index=False)[
'prod_qty'].sum().sort_values(
by='prod_qty', ascending=False))
# Create quantity per member column

segment_quantity['qty_member'] = round(
segment_quantity.prod_qty / segment_count.lylty_card_nbr, 2)
# Plot quantity per member

barchart = px.bar(
data_frame=segment_quantity,
y='qty_member',
x='lifestage',
opacity=0.9,
orientation='v',
barmode='group',
text='qty_member',
title='Quantity Sold by Count of Customer',
labels={'qty_member': 'Average Quantity',
)
barchart
Quantity Sold by Count of Customer
Customer Segment
9.2
Budget
9.04
9.01
8.68
8.67
8.59
8 Mainstream
Premium
Average Quantity
6.75
6.74
6.69
6.4
6
6.11
6.08
6.05
6
5.91
4.87
4.81
4.8
4.56
4
4.25
4.24
0
Ol Yo Ol Re Yo Mi Ne
de un de tir u ng da w
rf g rs ee g es fam
am fam ing s sin
ilie ilie les gle ing ilie
s s /co s/c les s
up ou /co
les ple up
s le s
lifestage
What is the average price paid per chip?
In [44]: # Create average price per customer dataset

average_price = pd.DataFrame(data_final.groupby(["lifestage", 'premium_customer'], as_index=False)[
'price'].mean().sort_values(
by='price', ascending=False)).round(2)
# Plot average price per segment

barchart = px.bar(
data_frame=average_price,
y='price',
log_y=True,
x='lifestage',
opacity=0.9,
orientation='v',
barmode='group',
text='price',
title='Average Price of Chip Bought',
labels={'price': 'Average Price', 'premium_customer': 'Customer Segment'},
)
barchart
Average Price of Chip Bought
Customer Segment
4.05
4.06
Mainstream
4 Budget
Premium
3.99
3.95
Average Price
3.92
3.92
3.92
3.92
3.9
3.89
3.88
3.87
3.85
3.84
3.8
3.81
3.77
3.75
3.76
3.76
3.74
3.74
3.74
3.72
3.72
3.7
3.67
3.65 3.66
Yo Mi Ne Re Ol Ol Yo
un da w tir de de u ng
g ge fam ee r sin r fam
sin sin s fam
gle gle ilie gle ilie ilie
s/c s s s/c s s
ou /co ou
ple up ple
s les s
lifestage
Analysis Insights:
Biggest Segments:
1. Young Singles/Couples (Mainstrean) = 11.1%

2. Retirees (Mainstream) = 8.92%
3. Older Singles/Couples (All) = 20%
Highest Revenue:
1. Old Families (Budget) = 8.68%

2. Young Singles/Couples (Mainstream) = 8.19%
3. Retirees (Mainstream) = 8.06%
Revenue per Customer:
1. Older Families = 33.88

2. Young Families = 32.38
3. Older Singles/Couples = 25.98
Average Number of Packets bought:
1. Older Families = 9.2 ~ 9

2. Young Families = 8.7 ~ 9
3. Older Singles/Couples = 6.73 ~ 7
Average Price Per Chip:
1. Young Singles/Couples (Mainstream) = 4.06

2. Midage Singles/Couples (Mainstream) = 3.99
3. New families (Mainstream & Budget) + Retirees (Budget & Premium) = 3.90
Our biggest segment is Young Singles/Couples (Mainstream), we should expect that since they are the majority they should bring in the
most revenue but interestingly enough the highest proportion of revenue (8.68%) comes from Older Families (budget) which make up 6.47%
of customers.
Young Singles/Couples (Mainstream) that make up 11.1% of all customers were responsible for 8.19% of total revenue the second highest.
This can be explained by the fact that Older and Younger families buy more packets, on average about 9 compared to about 5 for Young
Singles/Couples (Mainstream).
On the other hand, Young Singles/Couples (Mainstream) pay more per chip, on average 4.06 while Older Families (budget) spend on
average 3.74.
We can conclude that Young Singles/Couples (Mainstream) are more prone to impulse purchases and since they have less people in the
household, they purchase the more expensive brands at less quantity.
On the contrary Older Families (budget) try to buy cheaper brands but more quantity.
We can visualise that relationship by plotting average price vs average quantity.
But before we have to test if the Average price difference is statistically significant since the differance is small,before we start assuming it in
our analysis
In [45]: # Create a DataFrame for the Test population

Test = pd.DataFrame(data_final.query(
'lifestage == ("Midage singles/couples", "Young singles/couples") & premium_customer == "Mainstrea
m" '))
# Debug
Test.lifestage.unique()
Out[45]: ['Midage singles/couples', 'Young singles/couples']

Categories (2, object): ['Midage singles/couples', 'Young singles/couples']
In [46]: # Get index of test sample

index = Test.index
# Remove test from the population

control = data_final.drop(index)
# Create experiment set

experiment = Test
2 sample t-test assums that both data sets must have almost the same mean and are identically and independently distributed. We can
check the mean and skew of the datasets to see if our datasets pass those assumptions.
In [47]: # Create control and experiment group and print skew and mean
control_mean_sales = control.groupby('lylty_card_nbr')['tot_sales'].mean()
experiment_mean_sales = experiment.groupby('lylty_card_nbr')[
'tot_sales'].mean()
skew_control = round(control_mean_sales.skew(), 3)
skew_experiment = round(experiment_mean_sales.skew(), 3)
print('Control mean:{} skew: {}\nExperiment mean:{} skew: {}'.format(
control_mean_sales.mean(), skew_control, experiment_mean_sales.mean(), skew_experiment))
Control mean:7.023056981996668 skew: -0.361

Experiment mean:7.426350719604131 skew: -0.356
Looks like both datasets have the same distributions. Lets proced with the test.
Our null hypothesis : Both groups have no statisticaly signifacant differance in average price.
In [48]: # Import the scipy library

from scipy import stats
# 2 sample Ttest
tstat, pval = stats.ttest_ind(control_mean_sales, experiment_mean_sales)
print(pval)
if pval < 0.05:
print('We can reject the null hypothesis.')
else:
print('We have failed to reject the null hypothesis.')
2.847236809420268e-81
We can reject the null hypothesis.
We have a pvalue much lower than our alpha of 0.05 so we can conclude that the diferance of average price is not due to chance.
Now we can assume that these segments are more likly to buy more expensive brands. Lets see the relationship between quantity sold and
average price.
In [49]: # Create average quantity vs average price dataframe

qty_price = pd.merge(average_price, segment_quantity)
# Debug
qty_price.sample(3)
Out[49]:
lifestage premium_customer price prod_qty qty_member
11 Midage singles/couples Premium 3.77 14327 6.05
7 Older singles/couples Budget 3.88 32697 6.75
19 Young singles/couples Premium 3.67 10543 4.25
In [50]: # Plot the scatter plot

scatter = px.scatter(
data_frame=qty_price,
x='price',
y='qty_member',
size='prod_qty',
color='lifestage',
hover_name='premium_customer',
title='Average Packets Vs Average Price',
labels={'qty_member': 'Average Packets',
)
scatter
Average Packets Vs Average Price
lifestage
Young singles/couples
9
Midage singles/couples
Retirees
New families
8
Older singles/couples
Average Packets
Young families
Older families
7
4
3.7 3.8 3.9 4
price
The above scatter plot shows some interesting insights, as we thought we see families tend to purchase more quantity but tend to buy less
expensive chips but this is noted only for the Older and Young families, New families do not follow this trend, they show preference for
quality over quantity indicated by lower quantity and higher average price.
Also we can see how Young Singles/Couples Mainstream and Midge Singles/Couples Mainstream are willing to pay more per chip when
compared with their premium and budget counterparts. We can see that Sinlges/Couples buy more chips as they get older, note that Young
S/C buy on average less than Midage S/C and so on, maybe due to increased health awareness from younger demographics.
Let’s get to know better the segments that buy more than the average and pay more than average with the intent of finding ways to keep
these segments buying and retaining their business.
Quantity:
Older & Younger Families
Price:
Midage Singles/Couples (Mainstream)
Young Singles/Couples (Mainstream)
Its in the clients intrest to retain and serve well these segments, for this the store should keep stocked the brands and sizes these segments
tend to buy the most. Lets figure out which those are.
Older Families & Young Families
We are going to create a affinity table for brand and packet size for each of the 4 segments. How are we going to make this affinity table?
Well if you think about it we can only get an understanding about prefrences towards a specific brand or size by comparing between what
proportion a brand ocupies in the overall total quantity bought.
Example : Lets take in to consideration our Older families segment, We will compare how much the brand 'Kettle' contributes to total quantity
sold between our target segment and the population (Everyone - our target).
In this example we can imagine something like this:

Brand : Kettle
Target %: 0.2
Population %: 0.16
Affinity = 0.2/0.16 = 1.25.
We can say that our target segment is 25% more likley to purchase kettle over the population.
Lets code the following steps in a function so we can apply them to all the 4 segments with out the need to code it each time.
In [51]: # Create Affinity Calculator Function

def affinity(target, population, metric):
"""
This function takes in 2 dataframes and metric to compare the populations for
affinity towards a specific metric. For the pourpuses of this project metric has to be either
brand or packet_size.
"""
# Create Quantity Totals
qty_target = target.prod_qty.sum()
qty_population = population.prod_qty.sum()
# Create Total per brand data set

target_qty_brand = pd.DataFrame(target.groupby(
metric, as_index=False)['prod_qty'].sum())
population_qty_brand = pd.DataFrame(
population.groupby(metric, as_index=False)['prod_qty'].sum())
# Create qty_total columns

target_qty_brand['qty_total'] = target_qty_brand.prod_qty / qty_target
population_qty_brand['qty_total'] = population_qty_brand.prod_qty / qty_population
# Rename Columns
target_qty_brand.rename(columns={'qty_total': 'Target'}, inplace=True)
population_qty_brand.rename(
columns={'qty_total': 'Population'}, inplace=True)
# Drop Unnecesary columns

target_qty_brand.drop('prod_qty', axis=1, inplace=True)
population_qty_brand.drop('prod_qty', axis=1, inplace=True)
# Create Affinity Dataframe

affinity = pd.merge(target_qty_brand, population_qty_brand)
# Create Proportion columns
affinity['Proportion'] = affinity.Target / affinity.Population
return affinity.nlargest(5, 'Proportion')
In [52]: # Create older families dataset

older_families = data_final.query('lifestage == "Older families"')
# Debug
older_families.lifestage.unique()
Out[52]: ['Older families']

Categories (1, object): ['Older families']
In [53]: # Create Younger Families dataset

young_families = data_final.query('lifestage == "Young families"')
# Debug
young_families.lifestage.unique()
Out[53]: ['Young families']

Categories (1, object): ['Young families']
Older Families Brand and Size Prefrences
In [54]: # Create Target and Population DataSets

target = older_families
index = older_families.index
population = data_final.drop(index)
# Get affinity table for brand

affinity(target, population, 'brand')
Out[54]:
brand Target Population Proportion
10 NCC 0.007039 0.005391 1.305685
0 Bruger Rings 0.007794 0.006014 1.296110
20 WW 0.058155 0.045325 1.283070
13 RRD 0.079902 0.062633 1.275710
2 Cheetos 0.013838 0.011339 1.220353
In [55]: # Get affinity table for packet size

affinity(target, population, 'packet_size')
Out[55]:
packet_size Target Population Proportion
11 180 0.007440 0.005515 1.349092
15 220 0.007794 0.006014 1.296110
13 200 0.022079 0.017066 1.293680
12 190 0.014605 0.011539 1.265695
7 160 0.014124 0.011481 1.230196
Older families are about 30% more likley to buy NCC, Burger Rings, WW & RRD. This segment is 30% more likley to buy packets sized
between 180 - 220grams
Young Families Brand and Size Preferences

target = young_families
index = young_families.index
# Get brand affinity table young families

Out[56]:
15 Sunbites 0.015027 0.011588 1.296693
6 French Fries 0.006925 0.005392 1.284340
1 CCs 0.022220 0.017629 1.260430
13 RRD 0.077783 0.063467 1.225560
2 Cheetos 0.013683 0.011429 1.197148
In [57]: # Get size affinity table young families

Out[57]:
1 90 0.015027 0.011588 1.296693
12 190 0.014335 0.011665 1.228890
13 200 0.020876 0.017426 1.197964
15 220 0.007347 0.006146 1.195475
11 180 0.006771 0.005694 1.189099
While Young families are amlost 30% more likly to buy Sunbites, French Fries and CCs.
And 30% more likely to buy 90g packets. Smaller packets might be due to the children in the family bieng smaller.
What brands sell 90g packets?
In [58]: # Brands that sell 90 packets.

list(data_final.query('packet_size == "90"').prod_name.unique())
Out[58]: ['Sunbites Whlegrn Crisps Frch/Onin 90g',

'Snbts Whlgrn Crisps Cheddr&Mstrd 90g']
That probably explains high affinity to sunbites, they are the only brand offering 90g packs.
We can conclude that young families are more likly to buy brands that offer smaller packets probably because they buy chips for children as
a snack.
Midage Singles/Couples & Young Singles/Couples (Mainstream)
In [59]: # Create Midage S/C Dataset

midage = data_final.query(
'lifestage == "Midage singles/couples" & premium_customer == "Mainstream"')
# Debug
midage.lifestage.unique()
Out[59]: ['Midage singles/couples']

Categories (1, object): ['Midage singles/couples']
In [60]: # Create Young S/C Dataset

young = data_final.query(
'lifestage == "Young singles/couples" & premium_customer == "Mainstream"')
# Debug
young.lifestage.unique()
Out[60]: ['Young singles/couples']

Categories (1, object): ['Young singles/couples']
Midage S/C Brand and Size Preference

target = midage
index = midage.index

Out[61]:
18 Twisties 0.044208 0.038110 1.160022
9 Kettle 0.190779 0.166476 1.145986
17 Tostitos 0.043592 0.038240 1.139960
4 Cobs 0.044635 0.039200 1.138666
3 Cheezels 0.019970 0.018458 1.081924
In [62]: # Get affinity table for size

Out[62]:
17 270 0.030927 0.025266 1.224032
18 330 0.060099 0.050349 1.193635
2 110 0.101983 0.090500 1.126883
5 135 0.014610 0.013148 1.111181
14 210 0.027891 0.025372 1.099306
Young S/C Brand and Size preference

target = young
index = young.index

Out[63]:
19 Tyrrells 0.031685 0.025683 1.233708
18 Twisties 0.046127 0.037738 1.222308
5 Doritos 0.122969 0.100876 1.219015
17 Tostitos 0.045462 0.037898 1.199601
9 Kettle 0.196984 0.165114 1.193019
In [64]: # Get affinity table for size

Out[64]:
17 270 0.031823 0.024995 1.273192
19 380 0.032128 0.025463 1.261783
18 330 0.061291 0.049911 1.227991
4 134 0.119366 0.100626 1.186231
2 110 0.106365 0.089735 1.185316
It seems like we have uncovered prefrences in all the segments, lets list the top prefrences for brand and size:
Older Families:
Brand: 30% more likley to buy NCC & Burger Rings.
Size: 30% more likley to buy packets sized 180g, 200g & 220g.
Young Families:
Brand: 30% more likley to buy Sunbites & French fries.
Size: 30% more likley to buy smallest packs 90g to be exact.
Midage S/C:
Brand: 16% more likley to buy Twisties and 14% to buy Kettle.
Size: 22% more likely to buy 270g packs and 19% 330g.
Young S/C Brand: 23% more likley to buy Tyrrells and 22% Twisties.
Size: 27% more likley to buy 270g packs and 26% 380g
Lets visualise the relationship between price and size for the preferred brands to confirm of they follow our beliefs about our selected
segments purchasing patterns.
In [65]: # Create list of brands to select

brands = ['NCC', 'Bruger Rings', 'Sunbites',
'French Fries', 'Twisties', 'Kettle', 'Tyrrells']
# Create dataframe only with selected brands

selected_brands = data_final[data_final['brand'].isin(brands)]
# Debug
selected_brands.brand.unique()
Out[65]: array(['Kettle', 'Tyrrells', 'Twisties', 'Bruger Rings', 'NCC',

'Sunbites', 'French Fries'], dtype=object)
In [66]: # Create dataset to plot

size_by_price = selected_brands.groupby('brand', as_index=False)[
['packet_size', 'price']].mean().round(2)
# Plot the scatter plot

scatter = px.scatter(
data_frame=size_by_price,
x='price',
y='packet_size',
color='brand',
size='packet_size',
title='Brands Average Size Vs Average Price',
labels={'packet_size': 'Average Pack Size'},
)
scatter
Brands Average Size Vs Average Price
brand
Bruger Rings
250 French Fries

Kettle
NCC
Sunbites
Average Pack Size
Twisties
200
Tyrrells
150
100
1.5 2 2.5 3 3.5 4 4.5 5
price
Let’s recap what we’ve found! Sales have mainly been due to Budget - older families, Mainstream - young singles/couples, and Mainstream
retirees shoppers. We found that the high spend in chips for mainstream young singles/couples and retirees is due to there being more
of them than other buyers. Mainstream, midage and young singles and couples are also more likely to pay more per packet of chips.
This is indicative of impulse buying behaviour. We’ve also found that Mainstream young singles and couples are 23% more likely to
purchase Tyrrells chips compared to the rest of the population. The Category Manager may want to increase the category’s
performance by off-locating some Tyrrells and smaller packs of chips in discretionary space near segments where young singles and
couples frequent more often to increase visibilty and impulse behaviour. Quantium can help the Category Manager with
recommendations of where these segments are and further help them with measuring the impact of the changed placement. We’ll work
on measuring the impact of trials in the next task and putting all these together in the third task.
Conclusion
What have we found out ?
Most sales come from Old Families (Budget) Young Singles/Couples (Mainstream) & Retirees (Mainstream).
Older Families, Young Families & Older Singles/Couples bring in most revenue per customer.
Young Singles/Couples (Mainstream) & Midage Singles/Couples (Mainstream) pay more per chip packet.
Older families and Young families buy more packets of chips on average.
Older and Young families are about 30% more likley to buy brands that cost under 3.00, young families are more likley to buy small
packets.
Midage and Young Singles/Couples are about 20% more likley to buy more expensive brands, above 4.00.
Our recommendations:
1. Increase small sized pack offerings to increase the number of families buying chips and to keep them returning.
2. Positions the smaller and cheaper packs closer to other items families frequntly buy to increase purchases.
3. Create stands that only house chips bellow 3.00 and other with chips above 4.00 at strategic points in the store as to make the
customers go buy other items and probably add them to their shopping basket.
In [ ]:

Quantium Data Analytics Virtual Internship Assignment 1: Data Preparation and Customer Analytics

Uploaded by

Copyright:

Available Formats

Quantium Data Analytics Virtual Internship Assignment 1: Data Preparation and Customer Analytics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Quantium Data Analytics Virtual Internship Assignment 1: Data Preparation and Customer Analytics

Uploaded by

Copyright:

Available Formats

Quantium Data Analytics Virtual Internship

Assignment 1 : Data preparation and Customer Analytics.

Load Libraries and Data

In [1]: # Import libraries to be used in this section

In [2]: # Load the data

2018-09- Doritos Corn Chips Cheese Supreme

In [4]: # Data Info

264,836 rows; 8 columns

We have 264,836 transactions from July 2018 to July 2019.

We need to change columns 1 - 5 to categorical.

In [5]: # Transform categorical features

Out[5]: DATE datetime64[ns]

In [6]: # Check for duplicates

2018-08- Smiths Chip Thinly S/Cream&Onion

2018-08- Kettle Tortilla ChpsHny&Jlpno Chili

... ... ... ... ... ... ... ... ...

2019-03- Kettle Sweet Chilli And Sour Cream

2018-12- Doritos Corn Chip Mexican Jalapeno

264835 rows × 8 columns

In [7]: # Change column name capitlization

0 2018-10-17 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6.0

In [8]: # Lets look for outliers in total sales and quantity

Out[8]: <matplotlib.collections.PathCollection at 0x1b7d4e9a048>

In [9]: # Lets remove outliers using IQR method

In [10]: # Debug outliers

Out[10]: <matplotlib.collections.PathCollection at 0x1b7cb04aa08>

# Get series with individual words to count

In [12]: # Import Counter

# Create counter function

In [13]: # Check if we have unexpected words

Out[13]: [('g', 246091),

In [14]: # Create salsa dataframe

# Check the list of products

Out[14]: ['Old El Paso Salsa Dip Tomato Mild 300g',

In [16]: # Get the index numbers of rows to drop

In [17]: # Drop the salsa rows

49548 185295 YOUNG SINGLES/COUPLES Mainstream

28798 105220 MIDAGE SINGLES/COUPLES Mainstream

7136 26005 NEW FAMILIES Premium

70020 262086 RETIREES Mainstream

43555 164219 YOUNG FAMILIES Budget

In [20]: # Change column name capitlization

# Change life stage capitlization

In [21]: # Check for duplicates

0 1000 Young singles/couples Premium

1 1002 Young singles/couples Mainstream

2 1003 Young families Budget

3 1004 Older singles/couples Mainstream

4 1005 Midage singles/couples Mainstream

... ... ... ...

72632 2370651 Midage singles/couples Mainstream

72633 2370701 Young families Mainstream

72634 2370751 Young families Premium

72635 2370961 Older families Budget

72636 2373711 Young singles/couples Mainstream

72637 rows × 3 columns

In [22]: # Check Shape

Out[22]: ((246204, 8), (72637, 3))