healthcare-project-simplilearn- Week1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

9/16/2021 healthcare-project-simplilearn

In [1]: import numpy as np


import pandas as pd
import warnings
warnings.filterwarnings("ignore")
# import plotting libraries
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

In [2]: import seaborn as sns


sns.set(color_codes=True)
sns.set(style='white')

In [4]: df= pd.read_csv("D:\health care diabetes.csv")

In [ ]: Data Exploration:

1. Perform descriptive analysis. Understand the variables and their corresponding va

• Glucose

• BloodPressure

• SkinThickness

• Insulin

• BMI

2. Visually explore these variables using histograms. Treat the missing values accor

3. There are integer and float data type variables in this dataset. Create a count (

In [7]: df.head()

Out[7]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age

0 6 148 72 35 0 33.6 0.627 50

1 1 85 66 29 0 26.6 0.351 31

2 8 183 64 0 0 23.3 0.672 32

3 1 89 66 23 94 28.1 0.167 21

4 0 137 40 35 168 43.1 2.288 33

In [8]: df.tail()

Out[8]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction A

763 10 101 76 48 180 32.9 0.171

764 2 122 70 27 0 36.8 0.340

765 5 121 72 23 112 26.2 0.245

766 1 126 60 0 0 30.1 0.349

767 1 93 70 31 0 30.4 0.315

localhost:8888/nbconvert/html/healthcare-project-simplilearn.ipynb?download=false 1/6
9/16/2021 healthcare-project-simplilearn

In [11]: df.columns

Out[11]: Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',


'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')

In [12]: df.describe()

Out[12]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedig

count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000

mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578

std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160

min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000

50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000

75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000

max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000

In [13]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

In [14]: df.shape

Out[14]: (768, 9)

In [15]: df.isnull().sum()

Out[15]: Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

In [29]: df['Outcome'].value_counts()
df['Outcome'].value_counts(normalize=True)

localhost:8888/nbconvert/html/healthcare-project-simplilearn.ipynb?download=false 2/6
9/16/2021 healthcare-project-simplilearn

Out[29]: 0 0.651042
1 0.348958
Name: Outcome, dtype: float64

In [38]: feature_cols=[col for col in df.columns if col != 'Outcome']


df.head()

Out[38]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age

0 6 148 72 35 0 33.6 0.627 50

1 1 85 66 29 0 26.6 0.351 31

2 8 183 64 0 0 23.3 0.672 32

3 1 89 66 23 94 28.1 0.167 21

4 0 137 40 35 168 43.1 2.288 33

In [39]: plt.figure(figsize=(15,15))
for i, feature in enumerate(feature_cols):
rows = int(len(feature_cols)/2)

plt.subplot(rows, 2, i+1)

sns.distplot(df[feature])
plt.tight_layout()
plt.show()

localhost:8888/nbconvert/html/healthcare-project-simplilearn.ipynb?download=false 3/6
9/16/2021 healthcare-project-simplilearn

In [44]: # Finding skewness in the features


from scipy.stats import skew
negative_skew=[]
positive_skew=[]

In [48]: for feature in feature_cols:


print("Skewness of {0} is {1}". format(feature,skew(df[feature])), end='\n')
if skew(df[feature]) <0:
negative_skew.append(feature)
else:
positive_skew.append(feature)

Skewness of Pregnancies is 0.8999119408414357


Skewness of Glucose is 0.17341395519987735
Skewness of BloodPressure is -1.8400052311728738
Skewness of SkinThickness is 0.109158762323673
Skewness of Insulin is 2.2678104585131753
Skewness of BMI is -0.42814327880861786
Skewness of DiabetesPedigreeFunction is 1.9161592037386292
Skewness of Age is 1.127389259531697

In [49]: print(end="\n")
print("Negatively skewed Features are {}".format(negative_skew), end='\n')
print("Positively skewed Features are {}".format(positive_skew), end='\n')
print("Negatively skewed feature")

localhost:8888/nbconvert/html/healthcare-project-simplilearn.ipynb?download=false 4/6
9/16/2021 healthcare-project-simplilearn

Negatively skewed Features are ['BloodPressure', 'BMI', 'BloodPressure', 'BMI', 'Blo


odPressure', 'BMI']
Positively skewed Features are ['Pregnancies', 'Glucose', 'SkinThickness', 'Insuli
n', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies', 'Glucose', 'SkinThickness', 'I
nsulin', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies', 'Glucose', 'SkinThicknes
s', 'Insulin', 'DiabetesPedigreeFunction', 'Age']
Negatively skewed feature

In [51]: # Percantage of missing values in features


for col in feature_cols:
df[col].replace(0,np.nan,inplace=True)
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns, 'percent_missing':percen
missing_value_df.sort_values(by=['percent_missing'],inplace=True, ascending=False)
missing_value_df.set_index(keys=['column_name'],drop=True)

Out[51]: percent_missing

column_name

Insulin 48.697917

SkinThickness 29.557292

Pregnancies 14.453125

BloodPressure 4.557292

BMI 1.432292

Glucose 0.651042

DiabetesPedigreeFunction 0.000000

Age 0.000000

Outcome 0.000000

In [52]: # Missing value imputaion using mean


for col in feature_cols:
df[col].fillna(int(df[col].mean()),inplace=True)

In [53]: df.isnull().sum()

Out[53]: Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

In [54]: for col in feature_cols:


if col not in ['BMI','DiabetesPedigreeFunction']:
df[col]=df[col].apply(lambda x:int(x))

In [55]: df.head()

Out[55]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age

0 6 148 72 35 155 33.6 0.627 50

1 1 85 66 29 155 26.6 0.351 31

localhost:8888/nbconvert/html/healthcare-project-simplilearn.ipynb?download=false 5/6
9/16/2021 healthcare-project-simplilearn

Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age

2 8 183 64 29 155 23.3 0.672 32

3 1 89 66 23 94 28.1 0.167 21

4 4 137 40 35 168 43.1 2.288 33

In [56]: df.dtypes

Out[56]: Pregnancies int64


Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
Outcome int64
dtype: object

In [58]: plt.figure(figsize=(10,6))
df.dtypes.value_counts().plot(kind='bar', color='gray')
plt.title("Frequency plot describing the data types and the count of variables"
, fontsize=15,loc='center', color='Black')
plt.xlabel("Data types")
plt.ylabel("Count of types")
plt.show()

In [ ]:

localhost:8888/nbconvert/html/healthcare-project-simplilearn.ipynb?download=false 6/6

You might also like