Quantium Data Analytics Virtual Internship Assignment 1: Data Preparation and Customer Analytics

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Quantium Data Analytics Virtual Internship

Assignment 1 : Data preparation and Customer Analytics.

The objective of this task is to help the Category Manager for Chips in a grocery store, better understand the types of customers who
purchase Chips and their purchasing behaviour within the region.

To accomplish this task, we are provided with two datastes, 1 Year of potato chip transactions and Customer details.

We are going to prepare these datasets for analysis by cleaning and merging them together. We then proceed to creating extra features that
we might deem necessary for the analysis.

Once we have the final dataset ready, we can proceed to analysing the data with the goal of understanding which segments are the biggest
drivers of sales and what are their purchasing preferences to create recomendations for the category manager.

Load Libraries and Data

In [1]: # Import libraries to be used in this section


import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")

In [2]: # Load the data


data_t = pd.read_excel("QVI_transaction_data.xlsx")
data_c = pd.read_csv("QVI_purchase_behaviour.csv")

Data Cleaning

Transactions
In [3]: # Sample of the transaction dataset
data_t.sample(5)

Out[3]:
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES

2018-11-
29756 184 184033 186423 97 RRD Salt & Vinegar 165g 2 6.0
03

2018-10-
18453 48 48212 44366 73 Smiths Crinkle Cut Salt & Vinegar 170g 2 5.8
22

2018-09- Doritos Corn Chips Cheese Supreme


157927 91 91057 89459 30 2 8.8
03 170g

2018-10-
181853 209 209047 207764 9 Kettle Tortilla ChpsBtroot&Ricotta 150g 2 9.2
23

2019-04-
49586 199 199183 198377 78 Thins Chips Salt & Vinegar 175g 2 6.6
22

In [4]: # Data Info


data_t.info()
print('\n')
print('{:,} rows; {:,} columns'
.format(data_t.shape[0], data_t.shape[1]))
print('Transactions timeframe from {} to {}'.format(data_t['DATE'].min(),
data_t['DATE'].max()))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DATE 264836 non-null datetime64[ns]
1 STORE_NBR 264836 non-null int64
2 LYLTY_CARD_NBR 264836 non-null int64
3 TXN_ID 264836 non-null int64
4 PROD_NBR 264836 non-null int64
5 PROD_NAME 264836 non-null object
6 PROD_QTY 264836 non-null int64
7 TOT_SALES 264836 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(5), object(1)
memory usage: 16.2+ MB

264,836 rows; 8 columns


Transactions timeframe from 2018-07-01 00:00:00 to 2019-06-30 00:00:00

We have 264,836 transactions from July 2018 to July 2019.

We need to change columns 1 - 5 to categorical.

In [5]: # Transform categorical features


data_t = data_t.astype({"LYLTY_CARD_NBR": 'category', "STORE_NBR": 'category', "TXN_ID": 'category',
"PROD_NBR": 'category', "PROD_NAME": 'category'})
# Debug
data_t.dtypes

Out[5]: DATE datetime64[ns]


STORE_NBR category
LYLTY_CARD_NBR category
TXN_ID category
PROD_NBR category
PROD_NAME category
PROD_QTY int64
TOT_SALES float64
dtype: object

In [6]: # Check for duplicates


data_t.duplicated().sum()
# Remove duplicates
data_t.drop_duplicates()

Out[6]:
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES

2018-10-
0 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6.0
17

2019-05-
1 1 1307 348 66 CCs Nacho Cheese 175g 3 6.3
14

2019-05-
2 1 1343 383 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.9
20

2018-08- Smiths Chip Thinly S/Cream&Onion


3 2 2373 974 69 5 15.0
17 175g

2018-08- Kettle Tortilla ChpsHny&Jlpno Chili


4 2 2426 1038 108 3 13.8
18 150g

... ... ... ... ... ... ... ... ...

2019-03- Kettle Sweet Chilli And Sour Cream


264831 272 272319 270088 89 2 10.8
09 175g

2018-08-
264832 272 272358 270154 74 Tostitos Splash Of Lime 175g 1 4.4
13

2018-11-
264833 272 272379 270187 51 Doritos Mexicana 170g 2 8.8
06

2018-12- Doritos Corn Chip Mexican Jalapeno


264834 272 272379 270188 42 2 7.8
27 150g

2018-09-
264835 272 272380 270189 74 Tostitos Splash Of Lime 175g 2 8.8
22

264835 rows × 8 columns

In [7]: # Change column name capitlization


data_t.rename(columns=lambda x: x.lower(), inplace=True)
# Debug
data_t.head(1)

Out[7]:
date store_nbr lylty_card_nbr txn_id prod_nbr prod_name prod_qty tot_sales

0 2018-10-17 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6.0

In [8]: # Lets look for outliers in total sales and quantity


plt.scatter(x='prod_qty', y='tot_sales', data=data_t)

Out[8]: <matplotlib.collections.PathCollection at 0x1b7d4e9a048>

We can note that we have one order of 200 Units, a clear outlier, lets clean that.

In [9]: # Lets remove outliers using IQR method


# 1. Calculate IQR
Q1 = data_t.tot_sales.quantile(0.25)
Q3 = data_t.tot_sales.quantile(0.75)
IQR = round(Q3-Q1, 3)
Lower_Whisker = round(Q1-1.5*IQR, 3)
Upper_Whisker = round(Q3+1.5*IQR, 3)
# 2. Drop sales grater that upper whisker
data_t = data_t[data_t['tot_sales'] < Upper_Whisker]

In [10]: # Debug outliers


plt.grid(False)
sns.boxplot(data=data_t, x=data_t.tot_sales)
plt.scatter(x='prod_qty', y='tot_sales', data=data_t, )

Out[10]: <matplotlib.collections.PathCollection at 0x1b7cb04aa08>

Cool, looks like we have dealt with outliers. Lets look for outliers in the products i.e proudcts that should not be there.

In [11]: # Clean product name (Remove excess spaces) and remove digits.
import re
data_t['prod_name'] = data_t.prod_name.apply(lambda x: re.sub(' +', ' ', x))

# Remove numbers from prodcut name to check for most commun words
names = data_t.prod_name.apply(lambda x: "".join(
filter(lambda x: not x.isdigit(), x)))

# Get series with individual words to count


check = names.str.split()

In [12]: # Import Counter


from collections import Counter

# Create counter function


def counter(series):
cnt = Counter()
for words in series:
for word in words:
cnt[word] += 1
return cnt

In [13]: # Check if we have unexpected words


counter(check).most_common(20)

Out[13]: [('g', 246091),


('Chips', 49688),
('Kettle', 41141),
('&', 35490),
('Smiths', 28810),
('Salt', 27924),
('Cheese', 27809),
('Pringles', 25052),
('Doritos', 24896),
('Crinkle', 23938),
('Corn', 22014),
('Original', 21510),
('Cut', 20751),
('Chip', 18596),
('Salsa', 18054),
('Chicken', 15374),
('Sea', 14121),
('Thins', 14049),
('Sour', 13853),
('Crisps', 12583)]

Looks like the word Salsa repeats a 18054 times, might be salsa flavoured chips or actual salsa products. Lets confirm.

In [14]: # Create salsa dataframe


salsa = data_t[data_t['prod_name'].str.contains('Salsa')]

# Check the list of products


salsa_names = salsa.prod_name.unique()
list(salsa_names)

Out[14]: ['Old El Paso Salsa Dip Tomato Mild 300g',


'Red Rock Deli SR Salsa & Mzzrlla 150g',
'Smiths Crinkle Cut Tomato Salsa 150g',
'Doritos Salsa Medium 300g',
'Old El Paso Salsa Dip Chnky Tom Ht300g',
'Woolworths Mild Salsa 300g',
'Old El Paso Salsa Dip Tomato Med 300g',
'Woolworths Medium Salsa 300g',
'Doritos Salsa Mild 300g']

After some googling looks like these products are indeed salsa products and not chips, lets remove these products from the dataset.

In [15]: # Debug
prod = data_t.prod_name.count()
remove = salsa.prod_name.count()
result = prod - remove
print('After we remove the {} salsa products we should have {} products'.format(
remove, result))

After we remove the 18054 salsa products we should have 246204 products

In [16]: # Get the index numbers of rows to drop


index = data_t[data_t['prod_name'].isin(salsa_names)].index.values
len(index)

Out[16]: 18054

In [17]: # Drop the salsa rows


data_t = data_t.drop(index)

# Debug
data_t.prod_name.count()

Out[17]: 246204

Looks like we have solved all data quality issues for the transacitons data set. Now on to the customers data set.

Customers
In [18]: # Get a sample of the costumers dataset
data_c.sample(5)

Out[18]:
LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER

49548 185295 YOUNG SINGLES/COUPLES Mainstream

28798 105220 MIDAGE SINGLES/COUPLES Mainstream

7136 26005 NEW FAMILIES Premium

70020 262086 RETIREES Mainstream

43555 164219 YOUNG FAMILIES Budget

In [19]: data_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72637 entries, 0 to 72636
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 LYLTY_CARD_NBR 72637 non-null int64
1 LIFESTAGE 72637 non-null object
2 PREMIUM_CUSTOMER 72637 non-null object
dtypes: int64(1), object(2)
memory usage: 1.7+ MB

In [20]: # Change column name capitlization


data_c.rename(columns=lambda x: x.lower(), inplace=True)

# Change life stage capitlization


data_c.lifestage = data_c.lifestage.apply(lambda x: x.capitalize())

In [21]: # Check for duplicates


data_c.duplicated().sum()

# Remove duplicates
data_c.drop_duplicates()

Out[21]:
lylty_card_nbr lifestage premium_customer

0 1000 Young singles/couples Premium

1 1002 Young singles/couples Mainstream

2 1003 Young families Budget

3 1004 Older singles/couples Mainstream

4 1005 Midage singles/couples Mainstream

... ... ... ...

72632 2370651 Midage singles/couples Mainstream

72633 2370701 Young families Mainstream

72634 2370751 Young families Premium

72635 2370961 Older families Budget

72636 2373711 Young singles/couples Mainstream

72637 rows × 3 columns

The customer dataset looks good, now we can procced to merging the two datasets.

In [22]: # Check Shape


data_t.shape, data_c.shape

Out[22]: ((246204, 8), (72637, 3))

In [23]: # Merge the two datasets


data_final = pd.merge(data_t, data_c, on='lylty_card_nbr')
data_final.shape

Out[23]: (246204, 10)

In [24]: # Save data


data_final.to_csv('data_cleaned_merged_v1.csv', index=False,)

Feature Generation
Lets create the following columns:

Price
Brand
Pack Size
Pack Size Category
Month
Day of Week

Price of each item


In [25]: data_final = pd.read_csv('data_cleaned_merged_v1.csv')
# Lets create a price column
data_final['price'] = data_final.tot_sales / data_final.prod_qty
data_final.price.describe()

Out[25]: count 246204.000000


mean 3.831039
std 1.077786
min 1.320000
25% 3.000000
50% 3.800000
75% 4.600000
max 6.500000
Name: price, dtype: float64

Brand of each product


In [26]: # Lets take the first word of the product name as brand name
data_final['brand'] = data_final.prod_name.apply(
lambda x: str(x.split(' ')[0]))
# Debug
data_final.brand.unique()

Out[26]: array(['Natural', 'CCs', 'WW', 'Smiths', 'Kettle', 'Tyrrells',


'Infuzions', 'Grain', 'Doritos', 'Thins', 'Red', 'GrnWves',
'Tostitos', 'Pringles', 'Twisties', 'RRD', 'Infzns', 'Burger',
'NCC', 'Cheezels', 'Smith', 'Cobs', 'Dorito', 'Sunbites',
'Cheetos', 'Woolworths', 'French', 'Snbts'], dtype=object)

In [27]: # Debug
data_final.sample(1)

Out[27]:
date store_nbr lylty_card_nbr txn_id prod_nbr prod_name prod_qty tot_sales lifestage premium_customer price br

Doritos
Corn Chips
2018- Midage
16691 195 195277 195157 77 Nacho 1 4.4 Premium 4.4 Do
11-11 singles/couples
Cheese
170g

Looks like the brand names are not standardized. We have errors like : Doritos and Dorito. We can manualy map the names as they are not
many.

In [28]: # Map the correct brand names


data_final.brand.replace({'Natural': 'Naturals', 'Old': 'Old Dutch', 'Dorito': 'Doritos', 'Grain': 'Gra
in Waves',
'Red': 'RRD', 'GrnWves': 'Grain Waves', 'Infzns': 'Infuzions', 'Burger': 'Bru
ger Rings', 'Smith': 'Smiths',
'Woolworths': 'WW', 'French': 'French Fries', 'Snbts': 'Sunbites'}, inplace=T
rue)

In [29]: # Debug
data_final.brand.unique()

Out[29]: array(['Naturals', 'CCs', 'WW', 'Smiths', 'Kettle', 'Tyrrells',


'Infuzions', 'Grain Waves', 'Doritos', 'Thins', 'RRD', 'Tostitos',
'Pringles', 'Twisties', 'Bruger Rings', 'NCC', 'Cheezels', 'Cobs',
'Sunbites', 'Cheetos', 'French Fries'], dtype=object)

We have managed to extract the brand name from the products column and create the brand column.

Size of each packet


In [30]: # Get the Final part of the product name (Where we can find the Grams)
import re
data_final['packet_size'] = data_final.prod_name.apply(
lambda x: str(x.split(' ')))

# Extract only the number from the string as an array


data_final['packet_size'] = data_final.packet_size.apply(
lambda x: re.findall('[0-9]+', x))

# Convert array object to string


data_final['packet_size'] = data_final['packet_size'].apply(
lambda x: ' '.join(map(str, x)))

# Debug
data_final['packet_size'].value_counts()

Out[30]: 175 66222


150 40135
134 25052
110 22335
170 19952
165 15269
330 12480
380 6388
210 6265
270 6262
200 4473
135 3251
250 3158
90 3008
190 2994
160 2970
220 1564
70 1507
180 1465
125 1454
Name: packet_size, dtype: int64

In [31]: # Change to number format


data_final['packet_size'] = pd.to_numeric(
data_final['packet_size'], errors='coerce')

# Debug
data_final['packet_size'].value_counts().sort_index()

Out[31]: 70 1507
90 3008
110 22335
125 1454
134 25052
135 3251
150 40135
160 2970
165 15269
170 19952
175 66222
180 1465
190 2994
200 4473
210 6265
220 1564
250 3158
270 6262
330 12480
380 6388
Name: packet_size, dtype: int64

Now we have the weight of each packet, seems like a usefull metric for analysis.

Pack Size Category

In [32]: # Sizes box plot to understand how the sizes are distributed
sns.boxplot(x=data_final.packet_size, data=data_final)

Out[32]: <matplotlib.axes._subplots.AxesSubplot at 0x1b7cac55d48>

The box plot shows us that most of the products weigh between 150g and 175g, we will use that as our middle point to create our size
category. Lets get to coding.

In [33]: # Create a function to atribute each row to a class depending on packet wheight
def packet_size(df):
if ((df['packet_size'] >= 70) and (df['packet_size'] < 101)):
return 'Small'
elif ((df['packet_size'] >= 101) and (df['packet_size'] < 151)):
return 'Medium'
elif ((df['packet_size'] >= 151) and (df['packet_size'] < 201)):
return 'Big'
elif ((df['packet_size'] >= 201) and (df['packet_size'] < 251)):
return 'Very_Big'
elif ((df['packet_size'] >= 251) and (df['packet_size'] < 301)):
return 'Large'
else:
return 'Very_Large'

In [34]: # Create new packet_size_category using the above function


data_final['packet_size_category'] = data_final.apply(packet_size, axis=1)

# Debug
data_final.groupby('packet_size_category')[
'packet_size'].describe().sort_values(by='count', ascending=False)

Out[34]:
count mean std min 25% 50% 75% max

packet_size_category

Big 113345.0 173.827121 7.302311 160.0 170.0 175.0 175.0 200.0

Medium 92227.0 135.044000 15.846845 110.0 125.0 134.0 150.0 150.0

Very_Large 18868.0 346.928132 23.661670 330.0 330.0 330.0 380.0 380.0

Very_Big 10987.0 222.920724 17.527316 210.0 210.0 210.0 250.0 250.0

Large 6262.0 270.000000 0.000000 270.0 270.0 270.0 270.0 270.0

Small 4515.0 83.324474 9.432263 70.0 70.0 90.0 90.0 90.0

Month & Day of Week


In [35]: # Dtypes Check
data_final.dtypes

Out[35]: date object


store_nbr int64
lylty_card_nbr int64
txn_id int64
prod_nbr int64
prod_name object
prod_qty int64
tot_sales float64
lifestage object
premium_customer object
price float64
brand object
packet_size int64
packet_size_category object
dtype: object

In [36]: # Set category data type


data_final = data_final.astype(
{"lifestage": 'category', "premium_customer": 'category', "brand": 'object',
"store_nbr": 'object', 'lylty_card_nbr': 'object', "txn_id": "object",
"prod_nbr": 'object'
})

# Change date column to date Type


data_final.date = pd.to_datetime(data_final.date)

In [37]: # Debug
data_final.dtypes

Out[37]: date datetime64[ns]


store_nbr object
lylty_card_nbr object
txn_id object
prod_nbr object
prod_name object
prod_qty int64
tot_sales float64
lifestage category
premium_customer category
price float64
brand object
packet_size int64
packet_size_category object
dtype: object

In [38]: # Create Day of week and month


data_final['day_week'] = data_final.date.dt.day_name()
data_final['month'] = data_final.date.dt.month_name()

# Debug
data_final.sample(5)

Out[38]:
date store_nbr lylty_card_nbr txn_id prod_nbr prod_name prod_qty tot_sales lifestage premium_customer price brand

Smiths
Crinkle Cut
2018- Older
44043 155 155184 156205 107 French 2 5.2 Budget 2.6 Smiths
07-28 families
OnionDip
150g

Natural
2018- ChipCo Sea
160218 83 83207 83288 80 2 6.0 Retirees Premium 3.0 Naturals
07-18 Salt &
Vinegr 175g

Dorito Corn
2019- Chp Older
69860 102 102232 102942 4 2 13.0 Premium 6.5 Doritos
01-24 Supreme families
380g

Kettle
2019- Sensations
159728 8 8356 8316 3 1 4.6 Retirees Premium 4.6 Kettle
05-19 Camembert
& Fig 150g

Kettle 135g
2018- Young
194738 88 88157 87008 63 Swt Pot 2 8.4 Mainstream 4.2 Kettle
08-05 families
Sea Salt

Data Analysis
Now that we have cleaned our data we are ready to start analysing.

We know that the chip categroy manager is intrested in understanding her customers purchasing patterns.

To get a better understanding of the customers lets rank the segments on the following metrics:

1. Number of customers in each segment.


2. Percent of total revenue.
3. Revenue per customer.
4. Packets Bought per Customer.
5. Average Chip Price by Segment.

After seeing how the segments rank we can take a deeper look at the ones we deem to be the biggest drivers of sales.

How many customers do we have in each segment ?

In [39]: # Create the dataframe containing the count of each member


segment_count = pd.DataFrame(data_final.groupby(["lifestage", 'premium_customer'], as_index=False)[
'lylty_card_nbr'].nunique().sort_values(
by='lylty_card_nbr', ascending=False))
# Create percent of total columns
segment_count['percent_total'] = (
(segment_count.lylty_card_nbr / segment_count.lylty_card_nbr.sum())*100).round(2)
segment_count['percent_total'] = segment_count['percent_total'].astype(
str) + ' %'

In [40]: # Plot Number of customers per segment


barchart = px.bar(
data_frame=segment_count,
y='lylty_card_nbr',
x='lifestage',
color='premium_customer',
opacity=0.9,
orientation='v',
barmode='group',
text='percent_total',
title='Number of Customers in Each Segment',
labels={'lylty_card_nbr': 'Number of Cusotumers',
'premium_customer': 'Customer Segment'},
template='plotly_white'
)
barchart

Number of Customers in Each Segment

8000 Customer Segment


11.1 %

Mainstream
7000 Budget
Premium
Number of Cusotumers

8.92 %

6000

5000
6.81 %
6.8 %
6.57 %

6.47 %
6.15 %

4000
5.55 %
5.35 %
5.12 %

3000
4.62 %

3.91 %

3.77 %
3.48 %

3.36 %
3.33 %

2000
3.13 %
2.07 %

1000
1.53 %
1.17 %

0.81 %

0
Yo Re Ol Mi Ol Yo Ne
un ti re de da de u ng w
g es rs g es r fam fam
sin ing fam
gle les ing ilie ilie ilie
s/c /co les s s s
ou up /co
ple les up
s les

lifestage

Which segment brought in most revenue ?


In [41]: # Create total sales by segment dataframe
segment_revenue = pd.DataFrame(data_final.groupby(["lifestage", 'premium_customer'], as_index=False)[
'tot_sales'].sum().sort_values(
by='tot_sales', ascending=False))

# Create percent of total columns


segment_revenue['percent_total'] = (
(segment_revenue.tot_sales / segment_revenue.tot_sales.sum())*100).round(2)
segment_revenue['percent_total'] = segment_revenue['percent_total'].astype(
str) + ' %'

# Plot the segments


barchart = px.bar(
data_frame=segment_revenue,
y='tot_sales',
x='lifestage',
color='premium_customer',
opacity=0.9,
orientation='v',
barmode='group',
text='percent_total',
title='Total Sales by Segments',
labels={'tot_sales': 'Total Ammount of Sales',
'premium_customer': 'Customer Segment'},
template='plotly_white'
)
barchart

Total Sales by Segments

160k Customer Segment


8.68 %

Budget
8.19 %

140k
8.06 %

Mainstream
Premium
7.18 %
Total Ammount of Sales

7.08 %

120k
6.9 %
6.84 %

100k
5.86 %
5.33 %

5.06 %

80k
4.78 %

4.69 %
4.35 %
4.16 %

60k
3.17 %

3.02 %

40k
2.17 %

1.85 %

20k
1.14 %

0.89 %

0.6 %

0
Ol Yo Ol Re Yo Mi Ne
de un de tir u ng da w
rf g rs ee g es fam
am fam ing s sin
ilie ilie les gle ing ilie
s s /co s/c les s
up ou /co
les ple up
s le s

lifestage

What is the revenue per customer ?

In [42]: # Create revenue per customer dataframe


revenue_member = segment_count.drop(
['lylty_card_nbr', 'percent_total'], axis=1)
revenue_member['revenue_member'] = round(
segment_revenue.tot_sales / segment_count.lylty_card_nbr, 2)

# Plot revenue per customer


barchart = px.bar(
data_frame=revenue_member,
y='revenue_member',
x='lifestage',
color='premium_customer',
opacity=0.9,
orientation='v',
barmode='group',
text='revenue_member',
title='Revenue Per Customer by Segment',
labels={'revenue_member': 'Average Sale per Customer',
'premium_customer': 'Customer Segment'},
template='plotly_white'
)
barchart

Revenue Per Customer by Segment

35 Customer Segment
34.33
33.84

Mainstream
33.49

32.62
32.59
31.95

30 Budget
Average Sale per Customer

Premium

25
26.23
26.2

25.54
25.53
24.01
23.83

22.85
22.74

22.51

20
19.18
18.9
18.64
18.58

15
15.7
15.62

10

0
Yo Re Ol Mi Ol Yo Ne
un ti re de da de u ng w
g es rs g es r fam fam
sin ing fam
gle les ing ilie ilie ilie
s/c /co les s s s
ou up /co
ple les up
s les

lifestage

How many packets does each customer buy?

In [43]: # Create segment total quantity dataframe


segment_quantity = pd.DataFrame(data_final.groupby(["lifestage", 'premium_customer'], as_index=False)[
'prod_qty'].sum().sort_values(
by='prod_qty', ascending=False))

# Create quantity per member column


segment_quantity['qty_member'] = round(
segment_quantity.prod_qty / segment_count.lylty_card_nbr, 2)

# Plot quantity per member


barchart = px.bar(
data_frame=segment_quantity,
y='qty_member',
x='lifestage',
color='premium_customer',
opacity=0.9,
orientation='v',
barmode='group',
text='qty_member',
title='Quantity Sold by Count of Customer',
labels={'qty_member': 'Average Quantity',
'premium_customer': 'Customer Segment'},
template='plotly_white'
)
barchart

Quantity Sold by Count of Customer

Customer Segment
9.2

Budget
9.04

9.01

8.68

8.67
8.59

8 Mainstream
Premium
Average Quantity

6.75

6.74
6.69

6.4

6
6.11

6.08

6.05

6
5.91

4.87
4.81

4.8
4.56

4
4.25
4.24

0
Ol Yo Ol Re Yo Mi Ne
de un de tir u ng da w
rf g rs ee g es fam
am fam ing s sin
ilie ilie les gle ing ilie
s s /co s/c les s
up ou /co
les ple up
s le s

lifestage

What is the average price paid per chip?

In [44]: # Create average price per customer dataset


average_price = pd.DataFrame(data_final.groupby(["lifestage", 'premium_customer'], as_index=False)[
'price'].mean().sort_values(
by='price', ascending=False)).round(2)

# Plot average price per segment


barchart = px.bar(
data_frame=average_price,
y='price',
log_y=True,
x='lifestage',
color='premium_customer',
opacity=0.9,
orientation='v',
barmode='group',
text='price',
title='Average Price of Chip Bought',
labels={'price': 'Average Price', 'premium_customer': 'Customer Segment'},
template='plotly_white'
)
barchart

Average Price of Chip Bought

Customer Segment
4.05
4.06

Mainstream
4 Budget
Premium
3.99

3.95
Average Price

3.92
3.92

3.92
3.92

3.9
3.89
3.88
3.87

3.85
3.84

3.8
3.81
3.77

3.75
3.76
3.76
3.74

3.74
3.74
3.72

3.72

3.7
3.67

3.65 3.66

Yo Mi Ne Re Ol Ol Yo
un da w tir de de u ng
g ge fam ee r sin r fam
sin sin s fam
gle gle ilie gle ilie ilie
s/c s s s/c s s
ou /co ou
ple up ple
s les s

lifestage

Analysis Insights:

Biggest Segments:

1. Young Singles/Couples (Mainstrean) = 11.1%


2. Retirees (Mainstream) = 8.92%
3. Older Singles/Couples (All) = 20%

Highest Revenue:

1. Old Families (Budget) = 8.68%


2. Young Singles/Couples (Mainstream) = 8.19%
3. Retirees (Mainstream) = 8.06%

Revenue per Customer:

1. Older Families = 33.88


2. Young Families = 32.38
3. Older Singles/Couples = 25.98

Average Number of Packets bought:

1. Older Families = 9.2 ~ 9


2. Young Families = 8.7 ~ 9
3. Older Singles/Couples = 6.73 ~ 7

Average Price Per Chip:

1. Young Singles/Couples (Mainstream) = 4.06


2. Midage Singles/Couples (Mainstream) = 3.99
3. New families (Mainstream & Budget) + Retirees (Budget & Premium) = 3.90

Our biggest segment is Young Singles/Couples (Mainstream), we should expect that since they are the majority they should bring in the
most revenue but interestingly enough the highest proportion of revenue (8.68%) comes from Older Families (budget) which make up 6.47%
of customers.

Young Singles/Couples (Mainstream) that make up 11.1% of all customers were responsible for 8.19% of total revenue the second highest.
This can be explained by the fact that Older and Younger families buy more packets, on average about 9 compared to about 5 for Young
Singles/Couples (Mainstream).

On the other hand, Young Singles/Couples (Mainstream) pay more per chip, on average 4.06 while Older Families (budget) spend on
average 3.74.
We can conclude that Young Singles/Couples (Mainstream) are more prone to impulse purchases and since they have less people in the
household, they purchase the more expensive brands at less quantity.

On the contrary Older Families (budget) try to buy cheaper brands but more quantity.

We can visualise that relationship by plotting average price vs average quantity.

But before we have to test if the Average price difference is statistically significant since the differance is small,before we start assuming it in
our analysis

In [45]: # Create a DataFrame for the Test population


Test = pd.DataFrame(data_final.query(
'lifestage == ("Midage singles/couples", "Young singles/couples") & premium_customer == "Mainstrea
m" '))
# Debug
Test.lifestage.unique()

Out[45]: ['Midage singles/couples', 'Young singles/couples']


Categories (2, object): ['Midage singles/couples', 'Young singles/couples']

In [46]: # Get index of test sample


index = Test.index

# Remove test from the population


control = data_final.drop(index)

# Create experiment set


experiment = Test

2 sample t-test assums that both data sets must have almost the same mean and are identically and independently distributed. We can
check the mean and skew of the datasets to see if our datasets pass those assumptions.

In [47]: # Create control and experiment group and print skew and mean
control_mean_sales = control.groupby('lylty_card_nbr')['tot_sales'].mean()
experiment_mean_sales = experiment.groupby('lylty_card_nbr')[
'tot_sales'].mean()
skew_control = round(control_mean_sales.skew(), 3)
skew_experiment = round(experiment_mean_sales.skew(), 3)
print('Control mean:{} skew: {}\nExperiment mean:{} skew: {}'.format(
control_mean_sales.mean(), skew_control, experiment_mean_sales.mean(), skew_experiment))

Control mean:7.023056981996668 skew: -0.361


Experiment mean:7.426350719604131 skew: -0.356

Looks like both datasets have the same distributions. Lets proced with the test.

Our null hypothesis : Both groups have no statisticaly signifacant differance in average price.

In [48]: # Import the scipy library


from scipy import stats

# 2 sample Ttest
tstat, pval = stats.ttest_ind(control_mean_sales, experiment_mean_sales)
print(pval)
if pval < 0.05:
print('We can reject the null hypothesis.')
else:
print('We have failed to reject the null hypothesis.')

2.847236809420268e-81
We can reject the null hypothesis.

We have a pvalue much lower than our alpha of 0.05 so we can conclude that the diferance of average price is not due to chance.

Now we can assume that these segments are more likly to buy more expensive brands. Lets see the relationship between quantity sold and
average price.

In [49]: # Create average quantity vs average price dataframe


qty_price = pd.merge(average_price, segment_quantity)

# Debug
qty_price.sample(3)

Out[49]:
lifestage premium_customer price prod_qty qty_member

11 Midage singles/couples Premium 3.77 14327 6.05

7 Older singles/couples Budget 3.88 32697 6.75

19 Young singles/couples Premium 3.67 10543 4.25

In [50]: # Plot the scatter plot


scatter = px.scatter(
data_frame=qty_price,
x='price',
y='qty_member',
size='prod_qty',
color='lifestage',
hover_name='premium_customer',
title='Average Packets Vs Average Price',
labels={'qty_member': 'Average Packets',
'premium_customer': 'Customer Segment'},
template='plotly_white'
)
scatter

Average Packets Vs Average Price

lifestage
Young singles/couples
9
Midage singles/couples
Retirees
New families
8
Older singles/couples
Average Packets

Young families
Older families
7

4
3.7 3.8 3.9 4

price

The above scatter plot shows some interesting insights, as we thought we see families tend to purchase more quantity but tend to buy less
expensive chips but this is noted only for the Older and Young families, New families do not follow this trend, they show preference for
quality over quantity indicated by lower quantity and higher average price.

Also we can see how Young Singles/Couples Mainstream and Midge Singles/Couples Mainstream are willing to pay more per chip when
compared with their premium and budget counterparts. We can see that Sinlges/Couples buy more chips as they get older, note that Young
S/C buy on average less than Midage S/C and so on, maybe due to increased health awareness from younger demographics.

Let’s get to know better the segments that buy more than the average and pay more than average with the intent of finding ways to keep
these segments buying and retaining their business.

Quantity:
Older & Younger Families

Price:
Midage Singles/Couples (Mainstream)
Young Singles/Couples (Mainstream)

Its in the clients intrest to retain and serve well these segments, for this the store should keep stocked the brands and sizes these segments
tend to buy the most. Lets figure out which those are.

Older Families & Young Families

We are going to create a affinity table for brand and packet size for each of the 4 segments. How are we going to make this affinity table?

Well if you think about it we can only get an understanding about prefrences towards a specific brand or size by comparing between what
proportion a brand ocupies in the overall total quantity bought.

Example : Lets take in to consideration our Older families segment, We will compare how much the brand 'Kettle' contributes to total quantity
sold between our target segment and the population (Everyone - our target).

In this example we can imagine something like this:


Brand : Kettle
Target %: 0.2
Population %: 0.16
Affinity = 0.2/0.16 = 1.25.
We can say that our target segment is 25% more likley to purchase kettle over the population.

Lets code the following steps in a function so we can apply them to all the 4 segments with out the need to code it each time.

In [51]: # Create Affinity Calculator Function


def affinity(target, population, metric):
"""
This function takes in 2 dataframes and metric to compare the populations for
affinity towards a specific metric. For the pourpuses of this project metric has to be either
brand or packet_size.
"""
# Create Quantity Totals
qty_target = target.prod_qty.sum()
qty_population = population.prod_qty.sum()

# Create Total per brand data set


target_qty_brand = pd.DataFrame(target.groupby(
metric, as_index=False)['prod_qty'].sum())
population_qty_brand = pd.DataFrame(
population.groupby(metric, as_index=False)['prod_qty'].sum())

# Create qty_total columns


target_qty_brand['qty_total'] = target_qty_brand.prod_qty / qty_target
population_qty_brand['qty_total'] = population_qty_brand.prod_qty / qty_population

# Rename Columns
target_qty_brand.rename(columns={'qty_total': 'Target'}, inplace=True)
population_qty_brand.rename(
columns={'qty_total': 'Population'}, inplace=True)

# Drop Unnecesary columns


target_qty_brand.drop('prod_qty', axis=1, inplace=True)
population_qty_brand.drop('prod_qty', axis=1, inplace=True)

# Create Affinity Dataframe


affinity = pd.merge(target_qty_brand, population_qty_brand)
# Create Proportion columns
affinity['Proportion'] = affinity.Target / affinity.Population
return affinity.nlargest(5, 'Proportion')

In [52]: # Create older families dataset


older_families = data_final.query('lifestage == "Older families"')

# Debug
older_families.lifestage.unique()

Out[52]: ['Older families']


Categories (1, object): ['Older families']

In [53]: # Create Younger Families dataset


young_families = data_final.query('lifestage == "Young families"')

# Debug
young_families.lifestage.unique()

Out[53]: ['Young families']


Categories (1, object): ['Young families']

Older Families Brand and Size Prefrences

In [54]: # Create Target and Population DataSets


target = older_families
index = older_families.index
population = data_final.drop(index)

# Get affinity table for brand


affinity(target, population, 'brand')

Out[54]:
brand Target Population Proportion

10 NCC 0.007039 0.005391 1.305685

0 Bruger Rings 0.007794 0.006014 1.296110

20 WW 0.058155 0.045325 1.283070

13 RRD 0.079902 0.062633 1.275710

2 Cheetos 0.013838 0.011339 1.220353

In [55]: # Get affinity table for packet size


affinity(target, population, 'packet_size')

Out[55]:
packet_size Target Population Proportion

11 180 0.007440 0.005515 1.349092

15 220 0.007794 0.006014 1.296110

13 200 0.022079 0.017066 1.293680

12 190 0.014605 0.011539 1.265695

7 160 0.014124 0.011481 1.230196

Older families are about 30% more likley to buy NCC, Burger Rings, WW & RRD. This segment is 30% more likley to buy packets sized
between 180 - 220grams

Young Families Brand and Size Preferences

In [56]: # Create Target and Population DataSets


target = young_families
index = young_families.index
population = data_final.drop(index)

# Get brand affinity table young families


affinity(target, population, 'brand')

Out[56]:
brand Target Population Proportion

15 Sunbites 0.015027 0.011588 1.296693

6 French Fries 0.006925 0.005392 1.284340

1 CCs 0.022220 0.017629 1.260430

13 RRD 0.077783 0.063467 1.225560

2 Cheetos 0.013683 0.011429 1.197148

In [57]: # Get size affinity table young families


affinity(target, population, 'packet_size')

Out[57]:
packet_size Target Population Proportion

1 90 0.015027 0.011588 1.296693

12 190 0.014335 0.011665 1.228890

13 200 0.020876 0.017426 1.197964

15 220 0.007347 0.006146 1.195475

11 180 0.006771 0.005694 1.189099

While Young families are amlost 30% more likly to buy Sunbites, French Fries and CCs.

And 30% more likely to buy 90g packets. Smaller packets might be due to the children in the family bieng smaller.

What brands sell 90g packets?

In [58]: # Brands that sell 90 packets.


list(data_final.query('packet_size == "90"').prod_name.unique())

Out[58]: ['Sunbites Whlegrn Crisps Frch/Onin 90g',


'Snbts Whlgrn Crisps Cheddr&Mstrd 90g']

That probably explains high affinity to sunbites, they are the only brand offering 90g packs.

We can conclude that young families are more likly to buy brands that offer smaller packets probably because they buy chips for children as
a snack.

Midage Singles/Couples & Young Singles/Couples (Mainstream)

In [59]: # Create Midage S/C Dataset


midage = data_final.query(
'lifestage == "Midage singles/couples" & premium_customer == "Mainstream"')
# Debug
midage.lifestage.unique()

Out[59]: ['Midage singles/couples']


Categories (1, object): ['Midage singles/couples']

In [60]: # Create Young S/C Dataset


young = data_final.query(
'lifestage == "Young singles/couples" & premium_customer == "Mainstream"')
# Debug
young.lifestage.unique()

Out[60]: ['Young singles/couples']


Categories (1, object): ['Young singles/couples']

Midage S/C Brand and Size Preference

In [61]: # Create Target and Population DataSets


target = midage
index = midage.index
population = data_final.drop(index)

# Get affinity table for brand


affinity(target, population, 'brand')

Out[61]:
brand Target Population Proportion

18 Twisties 0.044208 0.038110 1.160022

9 Kettle 0.190779 0.166476 1.145986

17 Tostitos 0.043592 0.038240 1.139960

4 Cobs 0.044635 0.039200 1.138666

3 Cheezels 0.019970 0.018458 1.081924

In [62]: # Get affinity table for size


affinity(target, population, 'packet_size')

Out[62]:
packet_size Target Population Proportion

17 270 0.030927 0.025266 1.224032

18 330 0.060099 0.050349 1.193635

2 110 0.101983 0.090500 1.126883

5 135 0.014610 0.013148 1.111181

14 210 0.027891 0.025372 1.099306

Young S/C Brand and Size preference

In [63]: # Create Target and Population DataSets


target = young
index = young.index
population = data_final.drop(index)

# Get affinity table for brand


affinity(target, population, 'brand')

Out[63]:
brand Target Population Proportion

19 Tyrrells 0.031685 0.025683 1.233708

18 Twisties 0.046127 0.037738 1.222308

5 Doritos 0.122969 0.100876 1.219015

17 Tostitos 0.045462 0.037898 1.199601

9 Kettle 0.196984 0.165114 1.193019

In [64]: # Get affinity table for size


affinity(target, population, 'packet_size')

Out[64]:
packet_size Target Population Proportion

17 270 0.031823 0.024995 1.273192

19 380 0.032128 0.025463 1.261783

18 330 0.061291 0.049911 1.227991

4 134 0.119366 0.100626 1.186231

2 110 0.106365 0.089735 1.185316

It seems like we have uncovered prefrences in all the segments, lets list the top prefrences for brand and size:

Older Families:
Brand: 30% more likley to buy NCC & Burger Rings.
Size: 30% more likley to buy packets sized 180g, 200g & 220g.

Young Families:
Brand: 30% more likley to buy Sunbites & French fries.
Size: 30% more likley to buy smallest packs 90g to be exact.

Midage S/C:
Brand: 16% more likley to buy Twisties and 14% to buy Kettle.
Size: 22% more likely to buy 270g packs and 19% 330g.

Young S/C Brand: 23% more likley to buy Tyrrells and 22% Twisties.
Size: 27% more likley to buy 270g packs and 26% 380g

Lets visualise the relationship between price and size for the preferred brands to confirm of they follow our beliefs about our selected
segments purchasing patterns.

In [65]: # Create list of brands to select


brands = ['NCC', 'Bruger Rings', 'Sunbites',
'French Fries', 'Twisties', 'Kettle', 'Tyrrells']

# Create dataframe only with selected brands


selected_brands = data_final[data_final['brand'].isin(brands)]

# Debug
selected_brands.brand.unique()

Out[65]: array(['Kettle', 'Tyrrells', 'Twisties', 'Bruger Rings', 'NCC',


'Sunbites', 'French Fries'], dtype=object)

In [66]: # Create dataset to plot


size_by_price = selected_brands.groupby('brand', as_index=False)[
['packet_size', 'price']].mean().round(2)

# Plot the scatter plot


scatter = px.scatter(
data_frame=size_by_price,
x='price',
y='packet_size',
color='brand',
size='packet_size',
title='Brands Average Size Vs Average Price',
labels={'packet_size': 'Average Pack Size'},
template='plotly_white'
)
scatter

Brands Average Size Vs Average Price

brand
Bruger Rings

250 French Fries


Kettle
NCC
Sunbites
Average Pack Size

Twisties
200
Tyrrells

150

100

1.5 2 2.5 3 3.5 4 4.5 5

price

Let’s recap what we’ve found! Sales have mainly been due to Budget - older families, Mainstream - young singles/couples, and Mainstream

retirees shoppers. We found that the high spend in chips for mainstream young singles/couples and retirees is due to there being more
of them than other buyers. Mainstream, midage and young singles and couples are also more likely to pay more per packet of chips.
This is indicative of impulse buying behaviour. We’ve also found that Mainstream young singles and couples are 23% more likely to
purchase Tyrrells chips compared to the rest of the population. The Category Manager may want to increase the category’s
performance by off-locating some Tyrrells and smaller packs of chips in discretionary space near segments where young singles and
couples frequent more often to increase visibilty and impulse behaviour. Quantium can help the Category Manager with
recommendations of where these segments are and further help them with measuring the impact of the changed placement. We’ll work
on measuring the impact of trials in the next task and putting all these together in the third task.

Conclusion
What have we found out ?

Most sales come from Old Families (Budget) Young Singles/Couples (Mainstream) & Retirees (Mainstream).
Older Families, Young Families & Older Singles/Couples bring in most revenue per customer.
Young Singles/Couples (Mainstream) & Midage Singles/Couples (Mainstream) pay more per chip packet.
Older families and Young families buy more packets of chips on average.
Older and Young families are about 30% more likley to buy brands that cost under 3.00, young families are more likley to buy small
packets.
Midage and Young Singles/Couples are about 20% more likley to buy more expensive brands, above 4.00.

Our recommendations:

1. Increase small sized pack offerings to increase the number of families buying chips and to keep them returning.
2. Positions the smaller and cheaper packs closer to other items families frequntly buy to increase purchases.
3. Create stands that only house chips bellow 3.00 and other with chips above 4.00 at strategic points in the store as to make the
customers go buy other items and probably add them to their shopping basket.

In [ ]:

You might also like