The Data Science Process

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53
At a glance
Powered by AI
The document discusses several standard data science processes like CRISP-DM, SEMMA, and KDD. It also discusses the Machine Learning Canvas process and gives an example of applying a linear regression model to predict housing prices.

The standard data science processes discussed are CRISP-DM, SEMMA, and KDD. CRISP-DM stands for Cross-Industry Standard Process for Data Mining. SEMMA refers to Sample, Explore, Modify, Model, and Assess. KDD stands for Knowledge Discovery in Databases.

The Machine Learning Canvas process consists of Goal, Learn, Predict, and Evaluate steps. The Goal step defines the business problem and metrics. Learn involves selecting and preparing data. Predict refers to building and selecting models. Evaluate validates and assesses models.

PHUONG NGUYEN

THE DATA SCIENCE PROCESS


HOW to embed Machine Learning into business
CONTENT
1. A SIMPLE EXAMPLE IN PYTHON

2. STANDARD DATA SCIENCE PROCESSES

3. MACHINE LEARNING CANVAS


DATA SCIENCE PROCESS

https://github.com/nnbphuong/datascience4biz/blob/
master/Overview_of_the_Data_Science_Process.ipynb
THE DATA SCIENCE PROCESS
1. DETERMINE THE PURPOSE


2. OBTAIN THE DATA


import pandas as pd

# Load data
housing_df = pd.read_csv('WestRoxbury.csv')
housing_df.shape #find dimension of data frame
housing_df.head() #show the 1st five rows
print(housing_df) #show all the data

# Rename columns: replace spaces with '_’


housing_df = housing_df.rename
(columns={'TOTAL VALUE ': 'TOTAL_VALUE’}) # explicit
housing_df.columns = [s.strip().replace(' ', '_’)
for s in housing_df.columns] # all columns

# Show first four rows of the data


housing_df.loc[0:3] # loc[a:b] gives rows a to b, inclusive
housing_df.iloc[0:4] # iloc[a:b] gives rows a to b-1
# Different ways of showing the first 10
# values in column TOTAL_VALUE

housing_df['TOTAL_VALUE'].iloc[0:10]
housing_df.iloc[0:10]['TOTAL_VALUE']
housing_df.iloc[0:10].TOTAL_VALUE
# use dot notation if the column name has no spaces

# Show the fifth row of the first 10 columns


housing_df.iloc[4][0:10]
housing_df.iloc[4, 0:10]
housing_df.iloc[4:5, 0:10]
# use a slice to return a data frame
# Use pd.concat to combine non-consecutive columns into a
# new data frame. Axis argument specifies dimension along
# which concatenation happens, 0=rows, 1=columns.
pd.concat([housing_df.iloc[4:6,0:2],
housing_df.iloc[4:6,4:6]], axis=1)

# To specify a full column, use:


housing.iloc[:,0:1]
housing.TOTAL_VALUE

# show the first 10 rows of the first column


housing_df['TOTAL_VALUE'][0:10]
# Descriptive statistics

# show length of first column


print('Number of rows ', len(housing_df['TOTAL_VALUE’]))

# show mean of column


print('Mean of TOTAL_VALUE ',
housing_df['TOTAL_VALUE'].mean())

# show summary statistics for each column


housing_df.describe()
# random sample of 5 observations
housing_df.sample(5)

# oversample houses with over 10 rooms


weights = [0.9 if rooms > 10 else 0.01
for rooms in housing_df.ROOMS]
housing_df.sample(5, weights=weights)
3. EXPLORE, CLEAN, AND PRE-PROCESS THE DATA





housing_df.columns # print a list of variables

Index(['TOTAL_VALUE', 'TAX', 'LOT_SQFT',


'YR_BUILT', 'GROSS_AREA','LIVING_AREA',
'FLOORS', 'ROOMS', 'BEDROOMS', 'FULL_BATH',
'HALF_BATH','KITCHEN', 'FIREPLACE',
'REMODEL'], dtype='object')


HANDLING VARIABLES







# REMODEL needs to be converted to a categorical variable
housing_df.REMODEL = housing_df.REMODEL.astype('category')
housing_df.REMODEL.cat.categories # Show number of categories
housing_df.REMODEL.dtype # Check type of converted variable

# use drop_first=True to drop the first dummy variable


housing_df = pd.get_dummies(housing_df,
prefix_sep='_', drop_first=True)
housing_df.columns
housing_df.loc[:,'REMODEL_Old':'REMODEL_Recent'].head(5)

['None', 'Old', 'Recent']

REMODEL_Old REMODEL_Recent
0 0 0
1 0 1
2 0 0
3 0 0
4 0 0
DETECTING OUTLIERS


housing_df.plot.scatter(x='ROOMS', y='FLOORS', legend=False)
HANDLING MISSING DATA







# To illustrate missing data procedures,
# we first convert a few entries for bedrooms to NA’s.
# Then we impute these missing values
# using the median of the remaining values.

missingRows = housing_df.sample(10).index
housing_df.loc[missingRows, 'BEDROOMS'] = np.nan
print(‘Number of rows with valid BEDROOMS values
after setting to NAN: ’, housing_df['BEDROOMS'].count())

medianBedrooms = housing_df['BEDROOMS'].median()
housing_df.BEDROOMS =
housing_df.BEDROOMS.fillna(value=medianBedrooms)
print(‘Number of rows with valid BEDROOMS values
after filling NA values: ’, housing_df['BEDROOMS'].count())
NORMALIZING/RESCALING DATA



# Normalizing a data frame
norm_df = (housing_df - housing_df.mean()) /
housing_df.std()

# Rescaling a data frame


norm_df = (housing_df - housing_df.min()) /
(housing_df.max() - housing_df.min())
4. REDUCE THE DATA DIMENSION


5. DETERMINE THE DATA SCIENCE TASK


6. PARTITION THE DATA

# set random_state for reproducibility

# training (60%) and validation (40%)


trainData, validData = train_test_split(housing_df,
test_size=0.40, random_state=1)

# produces Training: 3481 Validation: 2321

# training (50%), validation (30%), and test (20%)


trainData, temp = train_test_split(housing_df, test_size=0.5, random_state=1)
# now split temp into validation and test
validData, testData = train_test_split(temp, test_size=0.4, random_state=1)

# produces Training: 2901 Validation: 1741 Test: 1160


7. CHOOSE THE TECHNIQUES

8. PERFORM THE TASK

LinearRegression
# create list of predictors and outcome
excludeColumns = ('TOTAL_VALUE', 'TAX')
predictors = [s for s in housing_df.columns if s
not in excludeColumns]
outcome = 'TOTAL_VALUE’

# partition data
X = housing_df[predictors]
y = housing_df[outcome]
train_X, valid_X, train_y, valid_y =
train_test_split(X, y, test_size=0.4,
random_state=1)

model = LinearRegression()
model.fit(train_X, train_y)
train_pred = model.predict(train_X)
train_results = pd.DataFrame({
'TOTAL_VALUE': train_y,
'predicted': train_pred,
'residual': train_y - train_pred
})

# show sample of predictions


train_results.head()

TOTAL_VALUE predicted residual


2024 392.0 387.726258 4.273742
5140 476.3 430.785540 45.514460
5259 367.4 384.042952 -16.642952
421 350.3 369.005551 -18.705551
1401 348.1 314.725722 33.374278
valid_pred = model.predict(valid_X)
valid_results = pd.DataFrame({
'TOTAL_VALUE': valid_y,
'predicted': valid_pred,
'residual': valid_y - valid_pred
})

valid_results.head()

TOTAL_VALUE predicted residual


1822 462.0 406.946377 55.053623
1998 370.4 362.888928 7.511072
5126 407.4 390.287208 17.112792
808 316.1 382.470203 -66.370203
4034 393.2 434.334998 -41.134998
# import the utility function regressionSummary
from dmba import regressionSummary

# training set
regressionSummary(train_results.TOTAL_VALUE,
train_results.predicted)

# validation set
regressionSummary(valid_results.TOTAL_VALUE,
valid_results.predicted)
OUTPUT
Regression statistics (training)

Mean Error (ME) : -0.0000


Root Mean Squared Error (RMSE) : 43.0306
Mean Absolute Error (MAE) : 32.6042
Mean Percentage Error (MPE) : -1.1116
Mean Absolute Percentage Error (MAPE) : 8.4886

Regression statistics (validation)

Mean Error (ME) : -0.1463


Root Mean Squared Error (RMSE) : 42.7292
Mean Absolute Error (MAE) : 31.9663
Mean Percentage Error (MPE) : -1.0884
Mean Absolute Percentage Error (MAPE) : 8.3283
9. ASSESS AND INTERPRET THE RESULTS


10. DEPLOY THE BEST MODEL


new_data = pd.DataFrame({
'LOT_SQFT': [4200, 6444, 5035],
'YR_BUILT': [1960, 1940, 1925],
'GROSS_AREA': [2670, 2886, 3264],
'LIVING_AREA': [1710, 1474, 1523],
'FLOORS': [2.0, 1.5, 1.9],
'ROOMS': [10, 6, 6],
'BEDROOMS': [4, 3, 2],
'FULL_BATH': [1, 1, 1],
'HALF_BATH': [1, 1, 0],
'KITCHEN': [1, 1, 1],
'FIREPLACE': [1, 1, 0],
'REMODEL_Old': [0, 0, 0],
'REMODEL_Recent': [0, 0, 1],
})
print('Predictions: ', model.predict(new_data))

> Predictions: [384.47210285 378.06696706 386.01773842]


THE STANDARD DATA SCIENCE PROCESS






THE STANDARD DATA SCIENCE PROCESS
THE STANDARD DATA SCIENCE PROCESS: CRISM-DM
THE STANDARD DATA SCIENCE PROCESS: SEMMA
THE STANDARD DATA SCIENCE PROCESS: KDD
THE STANDARD DATA SCIENCE PROCESS
THE MACHINE LEARNING CANVAS



THE MACHINE LEARNING CANVAS: GOAL




THE MACHINE LEARNING CANVAS: LEARN



THE MACHINE LEARNING CANVAS: PREDICT


THE MACHINE LEARNING CANVAS: EVALUATE

You might also like