The Data Science Process

PHUONG NGUYEN
THE DATA SCIENCE PROCESS

HOW to embed Machine Learning into business
CONTENT
1. A SIMPLE EXAMPLE IN PYTHON
2. STANDARD DATA SCIENCE PROCESSES
3. MACHINE LEARNING CANVAS

DATA SCIENCE PROCESS
https://github.com/nnbphuong/datascience4biz/blob/
master/Overview_of_the_Data_Science_Process.ipynb
THE DATA SCIENCE PROCESS
1. DETERMINE THE PURPOSE
▪
▪
2. OBTAIN THE DATA
▪
▪
▪
import pandas as pd
# Load data
housing_df = pd.read_csv('WestRoxbury.csv')
housing_df.shape #find dimension of data frame
housing_df.head() #show the 1st five rows
print(housing_df) #show all the data
# Rename columns: replace spaces with '_’

housing_df = housing_df.rename
(columns={'TOTAL VALUE ': 'TOTAL_VALUE’}) # explicit
housing_df.columns = [s.strip().replace(' ', '_’)
for s in housing_df.columns] # all columns
# Show first four rows of the data

housing_df.loc[0:3] # loc[a:b] gives rows a to b, inclusive
housing_df.iloc[0:4] # iloc[a:b] gives rows a to b-1
# Different ways of showing the first 10
# values in column TOTAL_VALUE
housing_df['TOTAL_VALUE'].iloc[0:10]
housing_df.iloc[0:10]['TOTAL_VALUE']
housing_df.iloc[0:10].TOTAL_VALUE
# use dot notation if the column name has no spaces
# Show the fifth row of the first 10 columns

housing_df.iloc[4][0:10]
housing_df.iloc[4, 0:10]
housing_df.iloc[4:5, 0:10]
# use a slice to return a data frame
# Use pd.concat to combine non-consecutive columns into a
# new data frame. Axis argument specifies dimension along
# which concatenation happens, 0=rows, 1=columns.
pd.concat([housing_df.iloc[4:6,0:2],
housing_df.iloc[4:6,4:6]], axis=1)
# To specify a full column, use:

housing.iloc[:,0:1]
housing.TOTAL_VALUE
# show the first 10 rows of the first column

housing_df['TOTAL_VALUE'][0:10]
# Descriptive statistics
# show length of first column

print('Number of rows ', len(housing_df['TOTAL_VALUE’]))
# show mean of column

print('Mean of TOTAL_VALUE ',
housing_df['TOTAL_VALUE'].mean())
# show summary statistics for each column

housing_df.describe()
# random sample of 5 observations
housing_df.sample(5)
# oversample houses with over 10 rooms

weights = [0.9 if rooms > 10 else 0.01
for rooms in housing_df.ROOMS]
housing_df.sample(5, weights=weights)
3. EXPLORE, CLEAN, AND PRE-PROCESS THE DATA
→
▪
▪
▪
housing_df.columns # print a list of variables
Index(['TOTAL_VALUE', 'TAX', 'LOT_SQFT',

'YR_BUILT', 'GROSS_AREA','LIVING_AREA',
'FLOORS', 'ROOMS', 'BEDROOMS', 'FULL_BATH',
'HALF_BATH','KITCHEN', 'FIREPLACE',
'REMODEL'], dtype='object')
•
HANDLING VARIABLES
▪
▪
▪
▪
▪
▪
▪
# REMODEL needs to be converted to a categorical variable
housing_df.REMODEL = housing_df.REMODEL.astype('category')
housing_df.REMODEL.cat.categories # Show number of categories
housing_df.REMODEL.dtype # Check type of converted variable
# use drop_first=True to drop the first dummy variable

housing_df = pd.get_dummies(housing_df,
prefix_sep='_', drop_first=True)
housing_df.columns
housing_df.loc[:,'REMODEL_Old':'REMODEL_Recent'].head(5)
['None', 'Old', 'Recent']
REMODEL_Old REMODEL_Recent
0 0 0
1 0 1
2 0 0
3 0 0
4 0 0
DETECTING OUTLIERS
▪
▪
housing_df.plot.scatter(x='ROOMS', y='FLOORS', legend=False)
HANDLING MISSING DATA
▪
▪
▪
▪
▪
▪
▪
▪
# To illustrate missing data procedures,
# we first convert a few entries for bedrooms to NA’s.
# Then we impute these missing values
# using the median of the remaining values.
missingRows = housing_df.sample(10).index
housing_df.loc[missingRows, 'BEDROOMS'] = np.nan
print(‘Number of rows with valid BEDROOMS values
after setting to NAN: ’, housing_df['BEDROOMS'].count())
medianBedrooms = housing_df['BEDROOMS'].median()
housing_df.BEDROOMS =
housing_df.BEDROOMS.fillna(value=medianBedrooms)
print(‘Number of rows with valid BEDROOMS values
after filling NA values: ’, housing_df['BEDROOMS'].count())
NORMALIZING/RESCALING DATA
▪
▪
▪
▪
# Normalizing a data frame
norm_df = (housing_df - housing_df.mean()) /
housing_df.std()
# Rescaling a data frame

norm_df = (housing_df - housing_df.min()) /
(housing_df.max() - housing_df.min())
4. REDUCE THE DATA DIMENSION
▪
▪
5. DETERMINE THE DATA SCIENCE TASK
▪
→
6. PARTITION THE DATA
▪
# set random_state for reproducibility
# training (60%) and validation (40%)

trainData, validData = train_test_split(housing_df,
test_size=0.40, random_state=1)
# produces Training: 3481 Validation: 2321
# training (50%), validation (30%), and test (20%)

trainData, temp = train_test_split(housing_df, test_size=0.5, random_state=1)
# now split temp into validation and test
validData, testData = train_test_split(temp, test_size=0.4, random_state=1)
# produces Training: 2901 Validation: 1741 Test: 1160

7. CHOOSE THE TECHNIQUES
▪
8. PERFORM THE TASK
▪
LinearRegression
# create list of predictors and outcome
excludeColumns = ('TOTAL_VALUE', 'TAX')
predictors = [s for s in housing_df.columns if s
not in excludeColumns]
outcome = 'TOTAL_VALUE’
# partition data
X = housing_df[predictors]
y = housing_df[outcome]
train_X, valid_X, train_y, valid_y =
train_test_split(X, y, test_size=0.4,
random_state=1)
model = LinearRegression()
model.fit(train_X, train_y)
train_pred = model.predict(train_X)
train_results = pd.DataFrame({
'TOTAL_VALUE': train_y,
'predicted': train_pred,
'residual': train_y - train_pred
})
# show sample of predictions

train_results.head()
TOTAL_VALUE predicted residual

2024 392.0 387.726258 4.273742
5140 476.3 430.785540 45.514460
5259 367.4 384.042952 -16.642952
421 350.3 369.005551 -18.705551
1401 348.1 314.725722 33.374278
valid_pred = model.predict(valid_X)
valid_results = pd.DataFrame({
'TOTAL_VALUE': valid_y,
'predicted': valid_pred,
'residual': valid_y - valid_pred
})
valid_results.head()
TOTAL_VALUE predicted residual

1822 462.0 406.946377 55.053623
1998 370.4 362.888928 7.511072
5126 407.4 390.287208 17.112792
808 316.1 382.470203 -66.370203
4034 393.2 434.334998 -41.134998
# import the utility function regressionSummary
from dmba import regressionSummary
# training set
regressionSummary(train_results.TOTAL_VALUE,
train_results.predicted)
# validation set
regressionSummary(valid_results.TOTAL_VALUE,
valid_results.predicted)
OUTPUT
Regression statistics (training)
Mean Error (ME) : -0.0000

Root Mean Squared Error (RMSE) : 43.0306
Mean Absolute Error (MAE) : 32.6042
Mean Percentage Error (MPE) : -1.1116
Mean Absolute Percentage Error (MAPE) : 8.4886
Regression statistics (validation)
Mean Error (ME) : -0.1463

Root Mean Squared Error (RMSE) : 42.7292
Mean Absolute Error (MAE) : 31.9663
Mean Percentage Error (MPE) : -1.0884
Mean Absolute Percentage Error (MAPE) : 8.3283
9. ASSESS AND INTERPRET THE RESULTS
▪
▪
10. DEPLOY THE BEST MODEL
▪
▪
new_data = pd.DataFrame({
'LOT_SQFT': [4200, 6444, 5035],
'YR_BUILT': [1960, 1940, 1925],
'GROSS_AREA': [2670, 2886, 3264],
'LIVING_AREA': [1710, 1474, 1523],
'FLOORS': [2.0, 1.5, 1.9],
'ROOMS': [10, 6, 6],
'BEDROOMS': [4, 3, 2],
'FULL_BATH': [1, 1, 1],
'HALF_BATH': [1, 1, 0],
'KITCHEN': [1, 1, 1],
'FIREPLACE': [1, 1, 0],
'REMODEL_Old': [0, 0, 0],
'REMODEL_Recent': [0, 0, 1],
})
print('Predictions: ', model.predict(new_data))
> Predictions: [384.47210285 378.06696706 386.01773842]

THE STANDARD DATA SCIENCE PROCESS
▪
▪
▪
▪
▪
▪
▪
THE STANDARD DATA SCIENCE PROCESS: CRISM-DM
THE STANDARD DATA SCIENCE PROCESS: SEMMA
THE STANDARD DATA SCIENCE PROCESS: KDD
THE MACHINE LEARNING CANVAS
▪
▪
▪
▪
THE MACHINE LEARNING CANVAS: GOAL
▪
▪
▪
▪
THE MACHINE LEARNING CANVAS: LEARN
▪
▪
▪
▪
THE MACHINE LEARNING CANVAS: PREDICT
▪
▪
THE MACHINE LEARNING CANVAS: EVALUATE
▪

The Data Science Process

Uploaded by

Copyright:

Available Formats

The Data Science Process

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Data Science Process

Uploaded by

Copyright:

Available Formats

What are the standard data science processes discussed?

What are the standard data science processes discussed?

What are the steps in the Machine Learning Canvas?

What are the steps in the Machine Learning Canvas?

PHUONG NGUYEN

THE DATA SCIENCE PROCESS

2. STANDARD DATA SCIENCE PROCESSES

3. MACHINE LEARNING CANVAS

# Rename columns: replace spaces with '_’

# Show first four rows of the data

# Show the fifth row of the first 10 columns

# To specify a full column, use:

# show the first 10 rows of the first column

# show length of first column

# show mean of column

# show summary statistics for each column

# oversample houses with over 10 rooms

Index(['TOTAL_VALUE', 'TAX', 'LOT_SQFT',

# use drop_first=True to drop the first dummy variable

['None', 'Old', 'Recent']

# Rescaling a data frame

# training (60%) and validation (40%)

# produces Training: 3481 Validation: 2321

# training (50%), validation (30%), and test (20%)

# produces Training: 2901 Validation: 1741 Test: 1160

# show sample of predictions

TOTAL_VALUE predicted residual

TOTAL_VALUE predicted residual

Mean Error (ME) : -0.0000

Regression statistics (validation)

Mean Error (ME) : -0.1463

> Predictions: [384.47210285 378.06696706 386.01773842]

You might also like