Visualisation All
Visualisation All
Visualisation All
import numpy as np
import matplotlib.pyplot as plt
1
Another Basic Plot
# To avoid calling plt.show()
# we can use
%matplotlib
# in Ipython (Python shell)
# and for Jupyter Notebook use
%matplotlib inline
# In IPython:
plt.close() # to close the figure window
4
Ticks, …
plt.plot(np.random.randn(1000).cumsum())
# set them:
plt.xticks(tick_loc, tick_lab, rotation=30,
fontsize='small')
5
…, Labels, …
# Change the x-axis label:
plt.xlabel('Stages')
6
… and Legends
# Create three sets of random data
data = [[],[],[]]
for i in range(3):
data[i] = np.random.randn(100).cumsum()
# plot them
for i in range(3):
plt.plot(data[i], styles[i], label=labels[i])
# see also:
# https://matplotlib.org/users/annotations.html
8
Time to Save the Plot
filename = 'UpDown.png'
plt.savefig(filename)
9
Scatter Plot
xvals = np.random.randn(100).cumsum()
yvals = np.random.randn(100).cumsum()
# repeat
xvals = np.random.randn(100).cumsum()
yvals = np.random.randn(100).cumsum()
10
Scatter Plot (cont'd)
# random size of dots
sizes = abs(np.random.randn(100) * 100)
# repeat
xvals = np.random.randn(100).cumsum()
yvals = np.random.randn(100).cumsum()
11
Histograms
vals = np.random.randn(100)
# Univariate Histogram
plt.hist(vals, alpha=0.5)
12
Figures and Subplots
# an empty figure
fig = plt.figure()
13
Figures and Subplots (con't)
# draw into another subplot
# histogram
ax1.hist(np.random.randn(100), bins=20, color='k',
alpha=0.3)
# scatterplot
ax2.scatter(np.arange(30), np.arange(30) + 3 *
np.random.randn(30))
# in IPython
plt.close() 14
Grids of Subplots
# figure with 2x3 subplots
fig, axes = plt.subplots(2, 3)
15
Grids of Subplots (cont'd)
# Make histograms directly comparable by
# sharing same x-axis ticks and y-axis ticks.
# Have to do this WHEN creating the figure
fig, axes = plt.subplots(2, 3, sharex=True, sharey=True)
16
First, Read Data from CSV file
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
sales =
pd.read_csv("https://raw.githubusercontent.com/GerhardT
rippen/DataSets/master/sample-salesv2.csv",
parse_dates=['date'])
sales.head()
sales.dtypes
sales.describe()
sales['unit price'].describe() 17
Customers
customers = sales[['name','ext price','date']]
customers.head()
customer_group = customers.groupby('name')
customer_group.size()
sales_totals = customer_group.sum()
sales_totals.sort_values('ext price').head()
my_plot = sales_totals.plot(kind='bar')
my_plot = sales_totals.plot(kind='barh')
# identical
my_plot = sales_totals.plot.bar()
18
Customers – Title and Labels
my_plot = sales_totals.sort_values('ext price',
ascending=False).plot(kind='bar', legend=None,
title="Total Sales by Customer")
my_plot.set_xlabel("Customers")
my_plot.set_ylabel("Sales ($)")
19
Customers with Product Category
customers = sales[['name', 'category', 'ext price',
'date']]
customers.head()
category_group =
customers.groupby(['name','category']).sum()
category_group.head(10)
category_group = category_group.unstack()-- transpose
category_group.head(10)
my_plot = category_group.plot(kind='bar', stacked=True,
title="Total Sales by Customer")
my_plot.set_xlabel("Customers")
my_plot.set_ylabel("Sales ($)")
my_plot.legend(["Belts","Shirts","Shoes"], loc='best',
ncol=3)
20
Customers with Product Category –
Sorted!
category_group = category_group.sort_values(('ext
price', 'Belt'), ascending=False)
category_group.head()
my_plot = category_group.plot(kind='bar', stacked=True,
title="Total Sales by Customer")
purchase_plot = purchase_patterns['ext
price'].hist(bins=20)
22
Purchase Patterns – Timeline
purchase_patterns = purchase_patterns.set_index('date’)
Taking date from data and making it index
purchase_patterns.head()
# sorted by time
purchase_patterns.sort_index()
# resampled by months
purchase_plot =
purchase_patterns.resample('M').sum().plot(title="Total
Sales by Month", legend=None)
24
Boxplots …
# Box and Whisker Plots
sales.boxplot() # Not very useful!
print(visitors.shape)
print(visitors.head())
print(visitors.dtypes)
27
Histograms, Density Plots, Box and
Whisker Plots
# Univariate Histograms
visitors.hist()
28
Correlation Matrix Plot
# correlation matrix
correlations = visitors.corr()
# plot correlation matrix (generic)
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
30
Hypothesis Testing Versus
Exploratory Data Analysis
Analyst may have “a priori” (presupposed by experience)
hypothesis to test
For example, has increasing fee-structure led to
decreasing market share?
Hypothesis Testing: test hypothesis market share has
decreased
31
Hypothesis Testing Vs
Exploratory Data Analysis (cont’d)
However, we do not always have a priori notions about data
In this case, use Exploratory Data Analysis (EDA)
Approach useful for:
– Delving into data
– Examining important interrelationships between attributes
– Identifying interesting subsets or patterns
– Discovering possible relationships between predictors
and target variable
32
Getting to Know the Data Set
– Graphs, plots, and tables often uncover important
relationships in data
– The 3,333 records and 21 variables in churn data set
are explored (see churn.txt)
– Simple approach looks at field values of records
33
Getting to Know the Data Set –
Python Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
churn.shape
churn.dtypes
churn.info()
34
Getting to Know the Data Set (cont’d)
– Eight of the attributes:
» State: categorical
» Account Length: numeric
» Area Code: categorical
» Phone: categorical
» Intl Plan: Boolean
» VMail Plan: Boolean
» Vmail Messages: numeric
» Day Mins: numeric
– “churn” attribute indicates customers leaving one company
in favor of another company’s products or services
35
Exploratory Data Analysis
Goals:
– Investigate variables as part of the Data Understanding
Phase
» Numeric Analyze Histograms, Scatter Plots,
Summary Statistics
» Categorical Examine Distributions, Cross-
tabulations
– Become familiar with data
– Explore relationships among variable sets
– While performing EDA, remain focused on objective
I.e., creating data mining model of customer likely to
“churn”
36
Exploring the Target – Python Code
churn["Churn?"]
# in comparison: matplotlib
churn["Churn?"].value_counts().plot(kind='bar', title=
"Churning Customers") 37
Exploring Categorical Variables
– Cross-tabulation quantifies relationship between Churn and
International Plan
– International Plan and Churn variables both categorical
International Plan
– Figure 3.4 shows proportion of customers in
International Plan with churn overlay
– International Plan: yes = 9.69%, no = 90.31%
– Possibly, greater proportion of those in International
Plan are churners?
40
Exploring Categorical Variables (cont’d)
42
Exploring Categorical Variables –
Python Code
churn_crosstab_norm =
churn_crosstab.div(churn_crosstab.sum(axis=1), axis=0)
churn_crosstab_norm
43
Exploring Categorical Variables (cont’d)
45
Exploring Categorical Variables (cont’d)
46
Exploring Categorical Variables (cont’d)
48
Exploring Numeric Variables
– Numeric summary measures for several variables shown
– see analysis in Python: churn.describe()
– Includes min and max, mean, median, std, and 1st
and 3rd quartile
– For example, Account Length has min = 1 and max = 243
– Mean and median both ~101, which indicates
symmetry
# Density Plots
churn.plot(kind='density', subplots=True, layout=(4,4),
sharex=False)
sns.distplot(churn["CustServ Calls"])
– Again, histogram of
Customer Service Calls shown
– Normalized values enhance pattern
of churn
– Customers calling customer service
3 or fewer times, far less likely to churn
– Results: Carefully track number of customer service calls
made by customers; Offer incentives to retain those
making higher number of calls
– Data mining model will probably include Customer
Service Calls as predictor
53
Exploring Numerical Variables –
Python Code
import numpy as np
55
Exploring Numeric Variables (cont’d)
– Normalized histogram of Day Minutes shown
with Churn overlay (Top)
– Indicates high usage customers churn at
significantly greater rate
– Results: Carefully track customer Day
Minutes as total exceeds 200
– Investigate why those with high usage tend
to leave
– Normalized histogram of Evening Minutes
shown with Churn overlay (Bottom)
– Higher usage customers churn slightly
more
– Results: Based on graphical evidence, we
cannot conclude beyond a reasonable
doubt that such an effect exists
56
Exploring Numeric Variables (cont’d)
– Additional EDA concludes no obvious association between
Churn and remaining numeric attributes (not shown)
– These numeric attributes probably not strong predictors in
data model
– However, they should be retained as input to model
– Important higher-level associations/interactions may exist
– In this case, let model identify which inputs are important
– Data mining performance adversely affected by many inputs
– Possibility: Use dimension-reduction technique such as
principal components analysis
57
Exploring Multivariate Relationships
– Multivariate graphics can uncover new interaction effects
which our univariate exploration missed
– Figure 3.20 shows a scatter plot of day minutes vs.
evenings minutes, with churners indicated by the darker
circles
58
Selecting Interesting Subsets of
the Data for further Investigation
– Graphical EDA can uncover subsets of records that call
for further investigation, as the rectangle in Figure 3.21
illustrates
59
Exploring Multivariate Relationships –
Python Code
sns.scatterplot(x="Day Mins", y="Eve Mins", data=churn)
60
Selecting Interesting Subsets of the
Data for further Investigation (cont'd)
– Figure 3.22 shows that about 65% (115 of 177) of the
selected records are churners
– Those with high customer service calls and low day
minutes have a 65% probability of churning
– Figure 3.23 shows that only about 26% of customers with
high customer service calls and high day minutes are
churners
– Red-flag customers with high customer service calls and
low day minutes
61
Using EDA to uncover Anomalous Fields
– Exploratory data analysis will sometimes uncover strange or
anomalous records or fields which the earlier data cleaning
phase may have missed
– E.g., the area code field contains only three different values
for all the records, 408, 415, and 510 (which all happen to
be California area codes), as shown by Figure 3.24
63
Binning based on Predictive Value –
Python Code
churn['Eve Mins binned'] = pd.cut(x = churn['Eve Mins'],
bins = [0, 160.01, 240.01, 400], labels=["Low",
"Medium", "High"], right = False)
64
Using EDA to Investigate
Correlated Predictor Variables
– Just because two variables are correlated does not mean that we
should omit one of them
– Instead use the following strategy:
1) Identify any variables that are perfectly correlated (that is, r = 1.0
or r = -1.0). Do not retain both variables in the model, but rather omit
one
2) Identify groups of variables that are correlated with each other.
Then, later, during the modeling phase, apply dimension reduction
methods, such as principal components analysis, to these variables
65
Using EDA to Investigate
Correlated Predictor Variables (cont'd)
– There does not seem to be any relationship between day
minutes and day calls, nor between day calls and day charge
– On the other hand, there is a perfect linear relationship between
day minutes and day charge, indicating that day charge is a
simple linear function of day minutes only
– We may express this function as the estimated regression
equation:
– “Day charge equals 0.000613 plus 0.17 times Day minutes.”
– Since day charge is perfectly correlated with day minutes, then
we should eliminate one of the two variables
– After dealing with the perfectly correlated predictors, the
correlation of each numerical predictor with every other
numerical predictor should be checked: (see next slide)
66
Using EDA to Investigate
Correlated Predictor Variables (cont'd)
– All relationships between the remaining numerical
predictors are very weak and statistically not significant.
67
Correlated Predictor Variables –
Python Code
from pandas.plotting import scatter_matrix
scatter_matrix(churn)
# correlation matrix
correlations = churn.corr()
print(correlations)
68
Correlated Predictor Variables –
Python Code
# plot correlation matrix (generic)
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
69
Our EDA – Brief Summary
– The four charge fields are linear functions of the minute fields, and
should be omitted
– The area code field and/or the state field are anomalous, and should
be omitted until further clarification is obtained
– Some insights with respect to churn are as follows:
– Customers with the International Plan tend to churn more
frequently
– Customers with the Voice Mail Plan tend to churn less frequently
– Customers with four or more Customer Service Calls tend to
churn more frequently
– Customers with both high Day Minutes and high Evening
Minutes tend to churn at a higher rate than the other customers
– Customers with low Day Minutes and high Customer Service
Calls churn at a higher rate than the other customers 70