healthcare-project-simplilearn- Week1

9/16/2021 healthcare-project-simplilearn
In [1]: import numpy as np

import pandas as pd
import warnings
warnings.filterwarnings("ignore")
# import plotting libraries
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
In [2]: import seaborn as sns

sns.set(color_codes=True)
sns.set(style='white')
In [4]: df= pd.read_csv("D:\health care diabetes.csv")
In [ ]: Data Exploration:
1. Perform descriptive analysis. Understand the variables and their corresponding va
• Glucose
• BloodPressure
• SkinThickness
• Insulin
• BMI
2. Visually explore these variables using histograms. Treat the missing values accor
3. There are integer and float data type variables in this dataset. Create a count (
In [7]: df.head()
Out[7]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 6 148 72 35 0 33.6 0.627 50
1 1 85 66 29 0 26.6 0.351 31
2 8 183 64 0 0 23.3 0.672 32
3 1 89 66 23 94 28.1 0.167 21
4 0 137 40 35 168 43.1 2.288 33
In [8]: df.tail()
Out[8]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction A
763 10 101 76 48 180 32.9 0.171
764 2 122 70 27 0 36.8 0.340
765 5 121 72 23 112 26.2 0.245
766 1 126 60 0 0 30.1 0.349
767 1 93 70 31 0 30.4 0.315
localhost:8888/nbconvert/html/healthcare-project-simplilearn.ipynb?download=false 1/6
In [11]: df.columns
Out[11]: Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',

'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
In [12]: df.describe()
Out[12]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedig
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000
In [13]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
In [14]: df.shape
Out[14]: (768, 9)
In [15]: df.isnull().sum()
Out[15]: Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
In [29]: df['Outcome'].value_counts()
df['Outcome'].value_counts(normalize=True)
Out[29]: 0 0.651042
1 0.348958
Name: Outcome, dtype: float64
In [38]: feature_cols=[col for col in df.columns if col != 'Outcome']

df.head()
0 6 148 72 35 0 33.6 0.627 50
1 1 85 66 29 0 26.6 0.351 31
2 8 183 64 0 0 23.3 0.672 32
3 1 89 66 23 94 28.1 0.167 21
4 0 137 40 35 168 43.1 2.288 33
In [39]: plt.figure(figsize=(15,15))
for i, feature in enumerate(feature_cols):
rows = int(len(feature_cols)/2)
plt.subplot(rows, 2, i+1)
sns.distplot(df[feature])
plt.tight_layout()
plt.show()
In [44]: # Finding skewness in the features

from scipy.stats import skew
negative_skew=[]
positive_skew=[]
In [48]: for feature in feature_cols:

print("Skewness of {0} is {1}". format(feature,skew(df[feature])), end='\n')
if skew(df[feature]) <0:
negative_skew.append(feature)
else:
positive_skew.append(feature)
Skewness of Pregnancies is 0.8999119408414357

Skewness of Glucose is 0.17341395519987735
Skewness of BloodPressure is -1.8400052311728738
Skewness of SkinThickness is 0.109158762323673
Skewness of Insulin is 2.2678104585131753
Skewness of BMI is -0.42814327880861786
Skewness of DiabetesPedigreeFunction is 1.9161592037386292
Skewness of Age is 1.127389259531697
In [49]: print(end="\n")
print("Negatively skewed Features are {}".format(negative_skew), end='\n')
print("Positively skewed Features are {}".format(positive_skew), end='\n')
print("Negatively skewed feature")
Negatively skewed Features are ['BloodPressure', 'BMI', 'BloodPressure', 'BMI', 'Blo

odPressure', 'BMI']
Positively skewed Features are ['Pregnancies', 'Glucose', 'SkinThickness', 'Insuli
n', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies', 'Glucose', 'SkinThickness', 'I
nsulin', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies', 'Glucose', 'SkinThicknes
s', 'Insulin', 'DiabetesPedigreeFunction', 'Age']
Negatively skewed feature
In [51]: # Percantage of missing values in features

for col in feature_cols:
df[col].replace(0,np.nan,inplace=True)
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns, 'percent_missing':percen
missing_value_df.sort_values(by=['percent_missing'],inplace=True, ascending=False)
missing_value_df.set_index(keys=['column_name'],drop=True)
Out[51]: percent_missing
column_name
Insulin 48.697917
SkinThickness 29.557292
Pregnancies 14.453125
BloodPressure 4.557292
BMI 1.432292
Glucose 0.651042
DiabetesPedigreeFunction 0.000000
Age 0.000000
Outcome 0.000000
In [52]: # Missing value imputaion using mean

for col in feature_cols:
df[col].fillna(int(df[col].mean()),inplace=True)
In [53]: df.isnull().sum()
Out[53]: Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
In [54]: for col in feature_cols:

if col not in ['BMI','DiabetesPedigreeFunction']:
df[col]=df[col].apply(lambda x:int(x))
In [55]: df.head()
0 6 148 72 35 155 33.6 0.627 50
1 1 85 66 29 155 26.6 0.351 31
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
2 8 183 64 29 155 23.3 0.672 32
3 1 89 66 23 94 28.1 0.167 21
4 4 137 40 35 168 43.1 2.288 33
In [56]: df.dtypes
Out[56]: Pregnancies int64

Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
Outcome int64
dtype: object
In [58]: plt.figure(figsize=(10,6))
df.dtypes.value_counts().plot(kind='bar', color='gray')
plt.title("Frequency plot describing the data types and the count of variables"
, fontsize=15,loc='center', color='Black')
plt.xlabel("Data types")
plt.ylabel("Count of types")
plt.show()
In [ ]:

healthcare-project-simplilearn- Week1

Uploaded by

Copyright:

Available Formats

healthcare-project-simplilearn- Week1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

healthcare-project-simplilearn- Week1

Uploaded by

Copyright:

Available Formats

9/16/2021 healthcare-project-simplilearn

In [1]: import numpy as np

In [2]: import seaborn as sns

In [4]: df= pd.read_csv("D:\health care diabetes.csv")

1. Perform descriptive analysis. Understand the variables and their corresponding va

Out[7]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age

0 6 148 72 35 0 33.6 0.627 50

2 8 183 64 0 0 23.3 0.672 32

4 0 137 40 35 168 43.1 2.288 33

Out[8]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction A

763 10 101 76 48 180 32.9 0.171

764 2 122 70 27 0 36.8 0.340

765 5 121 72 23 112 26.2 0.245

766 1 126 60 0 0 30.1 0.349

767 1 93 70 31 0 30.4 0.315

Out[11]: Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',

Out[12]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedig

count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000

mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578

std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160

min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000

50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000

75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000

max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000

In [38]: feature_cols=[col for col in df.columns if col != 'Outcome']

Out[38]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age

0 6 148 72 35 0 33.6 0.627 50

2 8 183 64 0 0 23.3 0.672 32

4 0 137 40 35 168 43.1 2.288 33

In [44]: # Finding skewness in the features

In [48]: for feature in feature_cols:

Skewness of Pregnancies is 0.8999119408414357

Negatively skewed Features are ['BloodPressure', 'BMI', 'BloodPressure', 'BMI', 'Blo

In [51]: # Percantage of missing values in features

In [52]: # Missing value imputaion using mean

In [54]: for col in feature_cols:

Out[55]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age

0 6 148 72 35 155 33.6 0.627 50

1 1 85 66 29 155 26.6 0.351 31

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age

2 8 183 64 29 155 23.3 0.672 32

4 4 137 40 35 168 43.1 2.288 33

Out[56]: Pregnancies int64

You might also like