EDA Project
EDA Project
EDA Project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # data visulization libraray
%matplotlib inline
import seaborn as sns
In [54]:
df = pd.read_csv(r'D:\Python_Diwali_Sales_Analysis\Diwali Sales Data.csv', encoding = 'unicode_escape')
# to avoid encoder error , use unicode_escape.
A UnicodeDecodeError in Python, particularly when working with pandas, usually occurs when trying to read a file or data that contains non-UTF-8
encoded characters without specifying the correct encoding.
In [55]:
df.shape
In [56]: df.head()
Out [56]:
Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zone Occupation Product_Category Orders Am
Group
1 1000732 Kartik P00110942 F 26-35 35 1 Andhra Pradesh Southern Govt Auto 3 239
2 1001990 Bindu P00118542 F 26-35 35 1 Uttar Pradesh Central Automobile Auto 3 239
In [57]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11251 entries, 0 to 11250
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 11251 non-null int64
1 Cust_name 11251 non-null object
2 Product_ID 11251 non-null object
3 Gender 11251 non-null object
4 Age Group 11251 non-null object
5 Age 11251 non-null int64
6 Marital_Status 11251 non-null int64
7 State 11251 non-null object
8 Zone 11251 non-null object
9 Occupation 11251 non-null object
10 Product_Category 11251 non-null object
11 Orders 11251 non-null int64
12 Amount 11239 non-null float64
13 Status 0 non-null float64
14 unnamed1 0 non-null float64
dtypes: float64(3), int64(4), object(8)
memory usage: 1.3+ MB
In [58]:
#drop unrelated / blank columns
df.drop(['Status', 'unnamed1'], axis = 1 , inplace = True)
In [59]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11251 entries, 0 to 11250
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User_ID 11251 non-null int64
1 Cust_name 11251 non-null object
2 Product_ID 11251 non-null object
3 Gender 11251 non-null object
4 Age Group 11251 non-null object
5 Age 11251 non-null int64
6 Marital_Status 11251 non-null int64
7 State 11251 non-null object
8 Zone 11251 non-null object
9 Occupation 11251 non-null object
10 Product_Category 11251 non-null object
11 Orders 11251 non-null int64
12 Amount 11239 non-null float64
dtypes: float64(1), int64(4), object(8)
memory usage: 1.1+ MB
In [60]: pd.isnull(df)
Out [60]:
Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zone Occupation Product_Category Orders Amount
Group
0 False False False False False False False False False False False False False
1 False False False False False False False False False False False False False
2 False False False False False False False False False False False False False
Age
User_ID Cust_name Product_ID Gender Age Marital_Status State Zone Occupation Product_Category Orders Amount
Group
3 False False False False False False False False False False False False False
4 False False False False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
11246 False False False False False False False False False False False False False
11247 False False False False False False False False False False False False False
11248 False False False False False False False False False False False False False
11249 False False False False False False False False False False False False False
11250 False False False False False False False False False False False False False
here we are finding null values it most important part of data cleaning process in python but lokking above data we cant understand. false indicating
that there are some values it is not null.
In [61]: pd.isnull(df).sum()
In [62]:
df.shape
In [64]:
df.shape
Amount column is in float(decimal) so we want to change the data type so we did it using .astype('datatype')
In [65]:
#change the data type
df['Amount'] = df['Amount'].astype('int')
In [66]: df['Amount'].dtypes
In [67]:
df.columns
Out [68]:
Age
User_ID Cust_name Product_ID Gender Age Shaadi State Zone Occupation Product_Category Orders Amo
Group
1 1000732 Kartik P00110942 F 26-35 35 1 Andhra Pradesh Southern Govt Auto 3 2393
2 1001990 Bindu P00118542 F 26-35 35 1 Uttar Pradesh Central Automobile Auto 3 2392
... ... ... ... ... ... ... ... ... ... ... ... ... ...
11246 1000695 Manning P00296942 M 18-25 19 1 Maharashtra Western Chemical Office 4 370
11247 1004089 Reichenbach P00171342 M 26-35 33 0 Haryana Northern Healthcare Veterinary 3 367
Madhya
11248 1001209 Oshin P00201342 F 36-45 40 0 Central Textile Office 4 213
Pradesh
11249 1004023 Noonan P00059442 M 36-45 37 0 Karnataka Southern Agriculture Office 3 206
11250 1002744 Brumley P00281742 F 18-25 19 0 Maharashtra Western Healthcare Office 3 188
11239 rows × 13 columns
In [69]:
# describe method returns description of the data in the dataframe(i.e, mean std, count)
df.describe()
Out [69]:
User_ID Age Marital_Status Orders Amount
Gender
In [70]:
df.columns
In [71]:
# using seaborn for data visulation here we can see that gender column has displyed using countplot
# normaly we can put we can use seaborn like this
sns.countplot(x = 'Gender', data = df, hue='Gender')
In [73]:
df.groupby(['Gender'], as_index = False)['Amount'].sum().sort_values(by='Amount', ascending = False)
Out [73]:
Gender Amount
0 F 74335853
1 M 31913276
In [74]:
sales_gen = df.groupby(['Gender'], as_index = False)['Amount'].sum().sort_values(by='Amount', ascending = False
sns.barplot(x = 'Gender', y = 'Amount', data = sales_gen, hue = 'Gender')
Age group
In [75]:
ax = sns.countplot(x = 'Age Group', data = df, hue='Gender')
from above graph we can see that most of the buyers are Females in between the 26-35 Age-group
state
In [77]:
# top 10 sales among the states
sales_states = df.groupby(['State'], as_index=False)['Orders'].sum().sort_values(by='Orders', ascending=False)
from aboove graphs we can see that most of the ordes are form uttar pradesh, Maharashtra , karnataka respectviley.
Martial status
In [79]:
df.columns
In [80]:
ax = sns.countplot(data = df, x = 'Marital_Status', hue = 'Marital_Status')
sns.set(rc={'figure.figsize':(7,2)})
for bars in ax.containers:
ax.bar_label(bars)
In [81]: #identiying from married couple who spent the money based on gender
sales_state = df.groupby(['Marital_Status', 'Gender'], as_index=False)['Amount'].sum().sort_values(by='Amount'
sns.set(rc={'figure.figsize': (7, 5)})
sns.barplot(data=sales_state, x='Marital_Status', y='Amount', hue='Gender')
Occupation
In [82]:
df.columns
In [83]:
sns.set(rc={'figure.figsize':(22,5)})
ax = sns.countplot(data = df, x = 'Occupation', hue = 'Occupation')
In [84]:
sales_state = df.groupby(['Occupation'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=Fal
sns.set(rc={'figure.figsize': (20, 5)})
sns.barplot(data=sales_state, x ='Occupation', y='Amount', hue = 'Occupation')
from above graph most of the buyers are from IT sector, Helathcare, Aviation.
product catogory
In [85]:
sns.set(rc={'figure.figsize':(27,7)})
ax = sns.countplot(data = df, x = 'Product_Category', hue = 'Product_Category')
for bars in ax.containers:
ax.bar_label(bars)
In [86]:
sales_state = df.groupby(['Product_Category'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascendi
sns.set(rc={'figure.figsize': (20, 5)})
sns.barplot(data=sales_state, x ='Product_Category', y='Amount', hue = 'Product_Category')
from above graphs we can see that most of the sold products are food, clothing and electronics category
In [87]:
sales_state = df.groupby(['Product_ID'], as_index=False)['Orders'].sum().sort_values(by='Orders', ascending=Fal
sns.set(rc={'figure.figsize': (20, 5)})
sns.barplot(data=sales_state, x ='Product_ID', y='Orders', hue = 'Product_ID')
married women age group 26-25 yrs from UP, Maharaeshtra and karnataka working in It, Healthcare and Aviation are more likley to buy products from
food, clothing and Electronics category.