Thyroid Disease Detection
Thyroid Disease Detection
Thyroid Disease Detection
1
• Free T3 test: This test measures the level of free T3 in the blood. Free T3 is the unbound
form of T3, which is another active form of thyroid hormone. A low free T3 level can indicate
hypothyroidism, while a high free T3 level can indicate hyperthyroidism.
Other blood tests that may be ordered include:
• Thyroid peroxidase antibodies (TPOAb) test: This test measures the level of TPOAb
in the blood. TPOAb are antibodies that attack the thyroid gland. A high level of TPOAb
can indicate an autoimmune thyroid disease, such as Hashimoto’s thyroiditis.
• Thyroglobulin antibodies (TgAb) test: This test measures the level of TgAb in the
blood. TgAb are antibodies that attack thyroglobulin, a protein produced by the thyroid
gland. A high level of TgAb can indicate an autoimmune thyroid disease, such as Graves’
disease.
Imaging tests that may be ordered to diagnose thyroid disease include:
• Thyroid ultrasound: This test uses sound waves to create images of the thyroid gland.
This can be used to look for nodules, cysts, or other abnormalities in the thyroid gland.
• Thyroid scan: This test uses a small amount of radioactive iodine to create images of the
thyroid gland. This can be used to look for areas of the thyroid gland that are not functioning
properly.
Physical examinations can also be helpful in diagnosing thyroid disease. The doctor will look for
signs of hypothyroidism, such as weight gain, fatigue, and cold intolerance. The doctor will also
look for signs of hyperthyroidism, such as weight loss, anxiety, and heat intolerance.
2 Importing Libraries
[ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
2
from sklearn.model_selection import train_test_split
# ML classifiers
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
# performance parameters
from sklearn.metrics import confusion_matrix, accuracy_score,␣
↪classification_report
3 Reading Datasets
3.1 Sort Information Of Datasets:
Here are the features of the datasets allhypo.data and allhyper.data, which are related to thyroid
disease:
• Age: Age in years
• Sex: Gender (M or F)
• On Thyroxine: If the patient is currently taking thyroxine (1 or 0)
• Thyroid Status: Whether the patient has hypothyroid (1) or hyperthyroid (0) disease
• TSH: Thyroid-stimulating hormone level in milli-international units per liter (mIU/L)
• T4: Free thyroxine level in nanograms per deciliter (ng/dL)
• T3: Free triiodothyronine level in nanograms per deciliter (ng/dL)
• FTI: Free thyroxine index
• TBG: Thyroxine-binding globulin level in milligrams per deciliter (mg/dL)
• RT3U: Reverse triiodothyronine level in nanograms per deciliter (ng/dL)
The datasets also include the following two header lines:
@RELATION thyroid
3
[ ]: df_hypo = pd.read_csv('data/allhypo.data', header = None, index_col = False)
df_hyper = pd.read_csv('data/allhyper.data', header = None, index_col = False)
[ ]: df_hypo.head()
[ ]: 0 1 2 3 4 5 6 7 8 9 … 20 21 22 23 24 25 26 27 28 \
0 41 F f f f f f f f f … t 125 t 1.14 t 109 f ? SVHC
1 23 F f f f f f f f f … t 102 f ? f ? f ? other
2 46 M f f f f f f f f … t 109 t 0.91 t 120 f ? other
3 70 F t f f f f f f f … t 175 f ? f ? f ? other
4 70 F f f f f f f f f … t 61 t 0.87 t 70 f ? SVI
29
0 negative.|3733
1 negative.|1442
2 negative.|2965
3 negative.|806
4 negative.|2807
[5 rows x 30 columns]
[ ]: df_hypo.shape
[ ]: (2800, 30)
[ ]: df_hypo.columns.unique()
[ ]: df_hyper.head()
[ ]: 0 1 2 3 4 5 6 7 8 9 … 20 21 22 23 24 25 26 27 28 \
0 41 F f f f f f f f f … t 125 t 1.14 t 109 f ? SVHC
1 23 F f f f f f f f f … t 102 f ? f ? f ? other
2 46 M f f f f f f f f … t 109 t 0.91 t 120 f ? other
3 70 F t f f f f f f f … t 175 f ? f ? f ? other
4 70 F f f f f f f f f … t 61 t 0.87 t 70 f ? SVI
29
0 negative.|3733
1 negative.|1442
2 negative.|2965
3 negative.|806
4 negative.|2807
[5 rows x 30 columns]
4
[ ]: df_hyper.shape
[ ]: (2800, 30)
4 Data Processing
This code will read the two thyroid datasets into Pandas DataFrames. The column
names are defined in the columns variable. The read_csv() function is used to read
the data from the files. The na_values parameter is used to specify the values that
represent missing data. The index_col parameter is used to specify the column that
should be used as the index.
5
'I131 treatment', 'query hypothyroid', 'query hyperthyroid', 'lithium',
'goitre', 'tumor', 'hypopituitary', 'psych', 'TSH measured', 'TSH',
'T3 measured', 'T3', 'TT4 measured', 'TT4', 'T4U measured', 'T4U',
'FTI measured', 'FTI', 'TBG measured', 'TBG', 'referral source',
'labels'],
dtype='object')
age sex on thyroxine query on thyroxine on antithyroid medication sick \
0 41.0 F f f f f
1 23.0 F f f f f
2 46.0 M f f f f
3 70.0 F t f f f
4 70.0 F f f f f
TT4 T4U measured T4U FTI measured FTI TBG measured TBG \
0 125.0 t 1.14 t 109.0 f NaN
1 102.0 f NaN f NaN f NaN
2 109.0 t 0.91 t 120.0 f NaN
3 175.0 f NaN f NaN f NaN
4 61.0 t 0.87 t 70.0 f NaN
[5 rows x 30 columns]
Index(['age', 'sex', 'on thyroxine', 'query on thyroxine',
'on antithyroid medication', 'sick', 'pregnant', 'thyroid surgery',
'I131 treatment', 'query hypothyroid', 'query hyperthyroid', 'lithium',
'goitre', 'tumor', 'hypopituitary', 'psych', 'TSH measured', 'TSH',
'T3 measured', 'T3', 'TT4 measured', 'TT4', 'T4U measured', 'T4U',
'FTI measured', 'FTI', 'TBG measured', 'TBG', 'referral source',
'labels'],
dtype='object')
age sex on thyroxine query on thyroxine on antithyroid medication sick \
0 41.0 F f f f f
1 23.0 F f f f f
2 46.0 M f f f f
3 70.0 F t f f f
6
4 70.0 F f f f f
TT4 T4U measured T4U FTI measured FTI TBG measured TBG \
0 125.0 t 1.14 t 109.0 f NaN
1 102.0 f NaN f NaN f NaN
2 109.0 t 0.91 t 120.0 f NaN
3 175.0 f NaN f NaN f NaN
4 61.0 t 0.87 t 70.0 f NaN
[5 rows x 30 columns]
• Splitting the labels column at | into two columns: ‘class’ and ‘id’.
• After that dropping the labels column.
7
'goitre', 'tumor', 'hypopituitary', 'psych', 'TSH measured', 'TSH',
'T3 measured', 'T3', 'TT4 measured', 'TT4', 'T4U measured', 'T4U',
'FTI measured', 'FTI', 'TBG measured', 'TBG', 'referral source',
'hypo_class', 'id'],
dtype='object')
Index(['age', 'sex', 'on thyroxine', 'query on thyroxine',
'on antithyroid medication', 'sick', 'pregnant', 'thyroid surgery',
'I131 treatment', 'query hypothyroid', 'query hyperthyroid', 'lithium',
'goitre', 'tumor', 'hypopituitary', 'psych', 'TSH measured', 'TSH',
'T3 measured', 'T3', 'TT4 measured', 'TT4', 'T4U measured', 'T4U',
'FTI measured', 'FTI', 'TBG measured', 'TBG', 'referral source',
'hyper_class', 'id'],
dtype='object')
• Replacing the ‘.’ in ‘class’ column with space (’ ’)
[ ]: df_hypo_copy = df_hypo.copy(deep=True)
df_hyper_copy = df_hyper.copy(deep=True)
[ ]: print(df_hypo['hypo_class'].unique())
print(df_hyper['hyper_class'].unique())
8
df_hyper.replace(['T3 toxic', 'goitre'], 'hyperthyroid', inplace = True)
• Since all values are common in both the datasets expect ‘class’.
• So, we can concatnate both the datasets.
T4U FTI measured FTI TBG measured TBG referral source hypo_class \
0 1.14 t 109.0 f NaN SVHC negative
1 NaN f NaN f NaN other negative
2 0.91 t 120.0 f NaN other negative
3 NaN f NaN f NaN other negative
4 0.87 t 70.0 f NaN SVI negative
id hyper_class
0 3733 negative
1 1442 negative
2 2965 negative
3 806 negative
4 2807 negative
[5 rows x 32 columns]
[ ]: df_concat.shape
[ ]: (2800, 32)
Now, the conditions based on which we are replacing the ‘class’ value with any of the ‘class’
i.e. ‘hypo_class’ or ‘hyper_class’
9
• neg + neg = neg
• hypo + neg = hypo
• hyper + neg = hyper
np.where((df_concat['hyper_class'] != 'negative'),␣
↪df_concat['hyper_class'],'negative'))
[ ]: df_concat.head()
hyper_class labels
0 negative negative
1 negative negative
2 negative negative
3 negative negative
4 negative negative
[5 rows x 33 columns]
10
[ ]: df = df_concat.drop(['referral source', 'TBG', 'hypo_class', 'id',␣
↪'hyper_class'], axis = 1)
[ ]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2800 entries, 0 to 2799
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 2799 non-null float64
1 sex 2690 non-null object
2 on thyroxine 2800 non-null object
3 query on thyroxine 2800 non-null object
4 on antithyroid medication 2800 non-null object
5 sick 2800 non-null object
6 pregnant 2800 non-null object
7 thyroid surgery 2800 non-null object
8 I131 treatment 2800 non-null object
9 query hypothyroid 2800 non-null object
10 query hyperthyroid 2800 non-null object
11 lithium 2800 non-null object
12 goitre 2800 non-null object
13 tumor 2800 non-null object
14 hypopituitary 2800 non-null object
15 psych 2800 non-null object
16 TSH measured 2800 non-null object
17 TSH 2516 non-null float64
18 T3 measured 2800 non-null object
19 T3 2215 non-null float64
20 TT4 measured 2800 non-null object
21 TT4 2616 non-null float64
22 T4U measured 2800 non-null object
23 T4U 2503 non-null float64
24 FTI measured 2800 non-null object
25 FTI 2505 non-null float64
26 TBG measured 2800 non-null object
27 labels 2800 non-null object
dtypes: float64(6), object(22)
memory usage: 612.6+ KB
Some columns are just indicating whether next column in the same row has some value or not.
Like ‘TSH measured’ have ‘true’ & ‘false’ value.
The ‘true’ means, the next column in the same row has some value and ‘false’ means, next column
in the same row has ‘NaN’. So, we are any ways going to handle the missing values, there is no
point of having such columns in our dataset.
Let’s drop these feature columns.
11
For Example
[ ]: df.head()
[5 rows x 22 columns]
5 Types Of Datasets
• categorical
– nominal (order not matter)
12
– ordinal (order matter)
• numerical
– discrete (discrete data is counted)
– continuous (continuous data is measured)
• let print categorical featues
[ ]: ['sex',
'on thyroxine',
'query on thyroxine',
'on antithyroid medication',
'sick',
'pregnant',
'thyroid surgery',
'I131 treatment',
'query hypothyroid',
'query hyperthyroid',
'lithium',
'goitre',
'tumor',
'hypopituitary',
'psych',
'labels']
[ ]: len(cat_features)
[ ]: 16
• Numerical Features
[ ]: for i in cat_features:
print('==='*20)
print(i, 'feature: unique values: ', df[i].unique())
print('==='*20)
============================================================
sex feature: unique values: ['F' 'M' nan]
============================================================
on thyroxine feature: unique values: ['f' 't']
============================================================
query on thyroxine feature: unique values: ['f' 't']
13
============================================================
on antithyroid medication feature: unique values: ['f' 't']
============================================================
sick feature: unique values: ['f' 't']
============================================================
pregnant feature: unique values: ['f' 't']
============================================================
thyroid surgery feature: unique values: ['f' 't']
============================================================
I131 treatment feature: unique values: ['f' 't']
============================================================
query hypothyroid feature: unique values: ['f' 't']
============================================================
query hyperthyroid feature: unique values: ['f' 't']
============================================================
lithium feature: unique values: ['f' 't']
============================================================
goitre feature: unique values: ['f' 't']
============================================================
tumor feature: unique values: ['f' 't']
============================================================
hypopituitary feature: unique values: ['f' 't']
============================================================
psych feature: unique values: ['f' 't']
============================================================
labels feature: unique values: ['negative' 'hypothyroid' 'hyperthyroid']
============================================================
[ ]: for i in num_features:
print('==='*12)
print(i, 'feature: unique values: ', len(df[i].unique()))
print('==='*12)
====================================
age feature: unique values: 94
====================================
TSH feature: unique values: 264
====================================
T3 feature: unique values: 65
====================================
TT4 feature: unique values: 218
====================================
T4U feature: unique values: 139
====================================
FTI feature: unique values: 210
====================================
14
[ ]: for i in num_features:
print('==='*12)
print(i, 'feature: minimum value: ', min(df[i].unique()))
print(i, 'feature: maximum value: ', max(df[i].unique()))
print('==='*12)
====================================
age feature: minimum value: 1.0
age feature: maximum value: 455.0
====================================
TSH feature: minimum value: 0.005
TSH feature: maximum value: 478.0
====================================
T3 feature: minimum value: 0.05
T3 feature: maximum value: 10.6
====================================
TT4 feature: minimum value: 2.0
TT4 feature: maximum value: 430.0
====================================
T4U feature: minimum value: 0.31
T4U feature: maximum value: 2.12
====================================
FTI feature: minimum value: 2.0
FTI feature: maximum value: 395.0
====================================
FTI
count 2505.000000
mean 110.787984
std 32.883986
15
min 2.000000
25% 93.000000
50% 107.000000
75% 124.000000
max 395.000000
[ ]: # categorical data
df.describe(include = 'object')
[ ]: df.isnull().sum()
[ ]: age 1
sex 110
on thyroxine 0
query on thyroxine 0
on antithyroid medication 0
sick 0
pregnant 0
thyroid surgery 0
I131 treatment 0
query hypothyroid 0
query hyperthyroid 0
lithium 0
goitre 0
tumor 0
hypopituitary 0
psych 0
16
TSH 284
T3 585
TT4 184
T4U 297
FTI 295
labels 0
dtype: int64
Observations:
• age has maximum value 455 which is not possible (an outlier)
• each categorical features has two unique value except labels which has three unique values
• NULL values:
– age: 1
– sex: 110
– TSH: 284
– T3: 585
– TT4: 184
– T4U: 297
– FTI: 295
numerical_null
[ ]: df[df['age'].isnull()]
[1 rows x 22 columns]
17
[ ]: numerical_null = [i for i in df.columns if (df[i].dtype != 'O' and df[i].
↪isnull().sum() != 0)]
numerical_null
Filling the ‘NaN’ values with ‘median’ of that column in all the numerical columns.
[ ]: for i in numerical_null:
df[i].fillna(df[i].median(),inplace = True)
[ ]: df.isnull().sum()
[ ]: age 0
sex 110
on thyroxine 0
query on thyroxine 0
on antithyroid medication 0
sick 0
pregnant 0
thyroid surgery 0
I131 treatment 0
query hypothyroid 0
query hyperthyroid 0
lithium 0
goitre 0
tumor 0
hypopituitary 0
psych 0
TSH 0
T3 0
TT4 0
T4U 0
FTI 0
labels 0
dtype: int64
Let’s handle NaN value in categorical column(s) Seperating the ‘object’ and ‘int’ or ‘float’
columns and storing ‘object’ into ‘categorical’.
[ ]: ['sex',
'on thyroxine',
'query on thyroxine',
'on antithyroid medication',
'sick',
18
'pregnant',
'thyroid surgery',
'I131 treatment',
'query hypothyroid',
'query hyperthyroid',
'lithium',
'goitre',
'tumor',
'hypopituitary',
'psych',
'labels']
categorical_null
[ ]: ['sex']
Replace missing values with the most frequent value, called ‘mode’.
[ ]: for i in categorical_null:
df[i].fillna(df[i].mode()[0], inplace = True)
[ ]: df[categorical_null].isnull().any()
[ ]: sex False
dtype: bool
[ ]: df.isnull().sum().any()
[ ]: False
[ ]: df.isnull().sum().sum()
[ ]: 0
6 Data Transformation
[ ]: numerical_all = [i for i in df.columns if (df[i].dtype != 'O')]
numerical_all
19
[ ]: plt.style.use('dark_background')
m = []
# print("-------LOG Transformation-------")
log_target = np.log1p(dataframe[feat])
df_filled['log_'+i] = pd.DataFrame(log_target)
plt.rcParams["figure.figsize"] = 13,5
fig,ax = plt.subplots(1,2)
sns.distplot(dataframe[feat], label= "Orginal Skew:{0}".format(np.
↪round(skew(dataframe[feat]),4)), color="r", ax=ax[0], axlabel="ORGINAL")
fig.legend()
m.append(np.round(skew(log_target),4))
sqrrt_target = dataframe[feat]**(1/2)
df_filled['sqrroot_'+i] = pd.DataFrame(sqrrt_target)
plt.rcParams["figure.figsize"] = 13,5
fig,ax = plt.subplots(1,2)
sns.distplot(dataframe[feat], label= "Orginal Skew:{0}".format(np.
↪round(skew(dataframe[feat]),4)), color="r", ax=ax[0], axlabel="ORGINAL")
↪TRANSFORMED")
fig.legend()
m.append(np.round(skew(sqrrt_target),4))
print(m)
[ ]: import warnings
warnings.filterwarnings('ignore')
[ ]: for i in df_filled[numerical_all]:
20
print(" Plots after transformations for col : ", i)
checkPlot(df_filled, i)
21
22
23
24
Observations:
• After Applying Transforamtion and Calculating the Skewness, we found best result for nu-
merical features:
• Age: SQRT
• TSH: LOG
• T3: SQRT
• TT4: SQRT
• T4U: LOG
• FTI: SQRT
[ ]: df_transformed.columns
[ ]: df_transformed.shape
[ ]: (2800, 34)
Since, only the above transforamtion are useful in our case, we will drop others
Now Dropping transformed columns which are not useful in our case
25
[ ]: df_transformed.drop(['age', 'log_age', \
'TSH', 'sqrroot_TSH', \
'T3', 'log_T3', \
'TT4', 'log_TT4', \
'T4U', 'sqrroot_T4U', \
'FTI', 'log_FTI'], axis = 1, inplace = True)
[ ]: df_transformed.columns
[ ]: df_transformed.shape
[ ]: (2800, 22)
[ ]: df_transformed_cat = df_transformed.select_dtypes(include =␣
↪['object','category'])
df_transformed_cat.columns
[ ]: df_transformed_cat.shape
[ ]: (2800, 16)
[ ]: df_onehot_encoded.columns
26
'hypopituitary_t', 'psych_t'],
dtype='object')
[ ]: df_onehot_encoded.shape
[ ]: (2800, 15)
Now, let’s do the label encoding on the target/labels columns (dependent feature)
[ ]: df_transformed_label = pd.DataFrame(df_transformed_cat.iloc[:,-1])
df_transformed_label.head()
[ ]: labels
0 negative
1 negative
2 negative
3 negative
4 negative
# we will save the encoder as pickle to use when we do the prediction. We will␣
↪need to decode the predcited values
# back to original
file = "label_encoder_random_forest.pickle"
pickle.dump(label_encoder_random_forest, open(file, "wb"))
[ ]: df_transformed_label['labels'].unique()
[ ]: array([2, 1, 0])
[ ]: df_transformed_label.head()
27
[ ]: labels
0 2
1 2
2 2
3 2
4 2
[ ]: df_onehot_encoded.head()
hypopituitary_t psych_t
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
28
sick_t pregnant_t thyroid surgery_t I131 treatment_t \
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
6.0.2 Multi-collinearity
Few suggestions from people:
• If we are using linear Model to solve problem, we have to deal with Multicollinearity
• And Linear models are Model which creates a line to predict or separate
• Example Linear Models: SVM, Logistic Regression, Linear Regression
• If you’re going to solve using decision tree or any other tree model then no need to deal with
Multicollinearity
• And to check Multicollinearity - Scatter plot, correlation and VIF (between independent
features)
• For classification problems - checking independent vs dependent feature is not useful
• Multicollinearity is entirely for independent features
29
4 8.366600 0.542324 1.095445 7.810250 0.625938 8.366600
Ways to check Multi-collinearity: Scatter plot, Correlation and Variance Inflation Factor (VIF)
- between independent features
Can you calculate VIF for categorical variables? - VIF cannot be used on categorical data.
Statistically speaking, it wouldn’t make sense. - If you want to check independence between 2
categorical variables you can however run a Chi-square test.
[ ]: def calculate_vif(X):
# Calculating VIF
vif = pd.DataFrame()
vif["variables"] = X.columns
vif["VIF"] = [ variance_inflation_factor(X.values, i) for i in range(X.
↪shape[1]) ]
return(vif)
[ ]: calculate_vif(df_numerical)
[ ]: variables VIF
0 sqrroot_age 23.879352
1 log_TSH 3.341043
2 sqrroot_T3 49.171247
3 sqrroot_TT4 282.213577
4 log_T4U 112.040640
5 sqrroot_FTI 183.589494
Observations: These features are having high correlation: - sqrroot_age - sqrroot_T3 - sqr-
root_TT4 - log_T4U - sqrroot_FTI
Now, let’s handle this carefully. Explore what is happening in behind the scene of the
VIF!
30
Let’s deep dive what does it means? First understood this:
• The statsmodel API assumes line passes through origin. Hence intercept is 0.
• In that case If we try to build a linear regression model using statsmodel model our intercept
will always be 0.
• So, we externally have to tell statsmodel API that our intercept should not be 0. And that’s
why we have to add constant in our dataset.
• If we do linear regression via SKlib, then it internally add constant. But in statsmodel, we
have to do this externally.
Now coming to VIF.
• Internally VIF is nothing but a Linear regression.
• So, in the backend, each independent model is predicted using other independent model.
• E.g. if we have X, Y, Z columns in the dataset.
• Y & Z are used as independent features and X become dependent feature.
• Similarly, X & Y will act as independent features. This 2 will predict Z.
• And similarly, X & Z will act as independent features. They 2 will predict Y.
• This means VIF is using Linear Regression and this VIF is present in statsmodel API.
• And in statsmodel API, we need to externally add a constant variable in the dataset.
[ ]: df_numerical_constant['constant'] = 1
[ ]: df_numerical_constant.head()
constant
0 1
1 1
2 1
3 1
4 1
[ ]: calculate_vif(df_numerical_constant)
31
[ ]: variables VIF
0 sqrroot_age 1.079170
1 log_TSH 1.420100
2 sqrroot_T3 1.608135
3 sqrroot_TT4 17.571336
4 log_T4U 6.915051
5 sqrroot_FTI 14.999932
6 constant 521.505157
We can see the VIF is significantly reduced after adding the constant value.
Now, let’s analyze the above one by dropping the highest VIF value first.
[ ]: calculate_vif(df_numerical_constant)
[ ]: variables VIF
0 sqrroot_age 1.078160
1 log_TSH 1.395940
2 sqrroot_T3 1.607507
3 log_T4U 1.410238
4 sqrroot_FTI 1.545203
5 constant 206.999885
Observations:
• Now we can see the VIF is significantly reduced after dropping the sqrroot_TT4 feature with
highest VIF.
• And none of the features are correlated now.
This looks good.
Now, let’s create the final dataset with numerical features by dropping the constant
column
[ ]: df_numerical_final = df_numerical_constant.drop('constant', axis = 1)
[ ]: df_numerical_final.head()
32
[ ]: df_categorical_final = df_encoded_cat.copy(deep = True)
[ ]: df_categorical_final.head()
Cancatenate the numerical and categorical features and create a final dataset
[ ]: df_final = pd.concat([df_numerical_final, df_categorical_final], axis = 1)
[ ]: df_final.head()
33
on thyroxine_t query on thyroxine_t on antithyroid medication_t sick_t \
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 1 0 0 0
4 0 0 0 0
psych_t labels
0 0 2
1 0 2
2 0 2
3 0 2
4 0 2
[5 rows x 21 columns]
[ ]: df_final.columns
df_final.labels.value_counts()
34
[ ]: 2 2503
1 220
0 77
Name: labels, dtype: int64
[ ]: df_final.columns
[ ]: df.columns
[ ]: df_final.head()
35
3 8.366600 0.148420 1.378405 0.683097 10.344080 0 1
4 8.366600 0.542324 1.095445 0.625938 8.366600 0 0
[5 rows x 21 columns]
[ ]: df_final.columns
36
[ ]: df_final.columns
7 Model Training
[ ]: # first step is to seperate the independent and dependent features
# independent features i.e. all columns expect 'labels' are stored in 'X'
# dependent features i.e 'labels' column is stored in 'y'
X = df_final.drop('labels', axis = 1)
y = df_final['labels']
# 70% for training and 30% for testing (this is not a hard rule)
Now before proceedind further we need to check whether our Train dataset is balanced or not.
If not balanced, then we should balance it first.
[ ]: # Sets the figure size temporarily but has to be set again the next plot
plt.figure(figsize = (6,4))
sns.countplot(y_train)
plt.show()
37
[ ]: y_train.value_counts()
[ ]: 2 1757
1 152
0 51
Name: labels, dtype: int64
From the above plot, we can say our dataset is highly imbalanced.
Let’s balance it first.
Sampling Methods-
• SMOTE
• Oversampling- Adasyn
• Under Sampling
NOTE: We should do the train_test_split first and then sampling on TRAIN data only (we already
did).
[ ]: y_train_sampled.value_counts()
38
[ ]: 1 1757
2 1757
0 1757
Name: labels, dtype: int64
[ ]: X_train.shape
[ ]: (1960, 20)
[ ]: y_train.shape
[ ]: (1960,)
[ ]: X_train_sampled.shape
[ ]: (5271, 20)
[ ]: y_train_sampled.shape
[ ]: (5271,)
[ ]: X_test.shape
[ ]: (840, 20)
[ ]: y_test.shape
[ ]: (840,)
39
# calculating the accuracy on train data
print('Accuracy Score on train data: ', metrics.accuracy_score(y_true =␣
↪y_train_sampled,\
y_pred =␣
↪random_forest_model.predict(X_train_sampled)))
Performance Parameters
[ ]: print("****** Random Forest Model Prediction on Test Data ******")
print("*********************************************************\n")
print("--------- Confusion Matrix ---------\n\n", confusion_matrix(y_test,␣
↪y_predicted_randomforest))
print("\n------------------------------------")
print("Acurracy Score:", accuracy_score(y_test, y_predicted_randomforest))
print("------------------------------------")
print("\n------ Classification Report -------\n\n",␣
↪classification_report(y_test, y_predicted_randomforest))
print("--------------------------------------------------------")
[[ 23 0 3]
[ 0 68 0]
[ 16 6 724]]
------------------------------------
Acurracy Score: 0.9702380952380952
------------------------------------
40
macro avg 0.83 0.95 0.88 840
weighted avg 0.98 0.97 0.97 840
--------------------------------------------------------
Observations On Training Dataset, we are getting good accuracy (99.92%), Precision, Recall
and F-score.
Also, on Testing data, we are getting accuracy: 96.90%.
import pickle
pickle.dump(random_forest_model, open("random_forest_model.pkl","wb"))
8 Thank You!
41