Hgs Phase II
Hgs Phase II
Hgs Phase II
SUBMITTED BY:
KAVIYARASU.V
NM ID: aukaviyarasuv
TEAM NAME: Health guard heros
TEAM MEMBERS:
1.KAVIYARASU.V
2.INBATAMILAN.MK
3.DINESH KUMAR.K
4.ABDUL RAHUMAN.G
5.DHARAN KARTHIK.R
PHASE 2 DOCUMENT: DATA WRANGLING AND ANALYSIS
INTRODUCTION:
OBJECTIVES:
Dataset Description:
1.Data Description:
Head: Displaying the first few rows of the dataset to get an initial overview.
Tail: Examining the last few rows of the dataset to ensure completeness
Info: Obtaining information about the dataset structure, data types, and
memory usage.
Describe: Generating descriptive statistics for numerical features to understand
their distributions and central tendencies.
import pandas as pd
df=pd.read_csv('C:\\Users\\Johanan\\Downloads\\diabetes.csv')
print(df.head())
print("--------------------------------------------------------------------------------------------------
------")
print(df.tail())
print("--------------------------------------------------------------------------------------------------
------")
print(df.info())
print("--------------------------------------------------------------------------------------------------
------")
print(df.describe())
print("--------------------------------------------------------------------------------------------------
------"
output:
2. Null Data Handling:
Null Data Identification: Identifying missing values in the dataset.
Null Data Imputation: Filling missing values with appropriate strategies.
Null Data Removal: Eliminating rows or columns with excessive missing
values.
Code:
import numpy as np
import pandas as pd
df=pd.read_csv('C:\\Users\\Johanan\\Downloads\\diabetes.csv')
print(df.isnull().values.any())
print((df.Pregnancies ==
0).sum(),(df.Glucose==0).sum(),(df.BloodPressure==0).sum(),(df.SkinThickness==0).sum(),(d
f.Insulin==0).sum(),(df.BMI==0).sum(),(df.DiabetesPedigreeFunction==0).sum(),(df.Age==0).
sum())
print(df.describe())
drop_Glu=df.index[df.Glucose == 0].tolist()
drop_BP=df.index[df.BloodPressure == 0].tolist()
drop_Skin = df.index[df.SkinThickness==0].tolist()
drop_Ins = df.index[df.Insulin==0].tolist()
drop_BMI = df.index[df.BMI==0].tolist()
c=drop_Glu+drop_BP+drop_Skin+drop_Ins+drop_BMI
data=df.drop(df.index[c])
print(data)
output:
3. Data Validation:
Data Integrity Check: Verifying data consistency and integrity to eliminate
errors.
Data Consistency Verification: Ensuring data consistency across different
columns or datasets.
Code:
import numpy as np
import pandas as pd
df=pd.read_csv('C:\\Users\\Johanan\\Downloads\\diabetes.csv')
dat=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age',
'DiabetesPedigreeFunction','Outcome']
for column in data:
unique_values=data[column].unique()
print(unique_values)
output:
4. Data Reshaping
Reshaping Rows and Columns: Transforming the dataset into a suitable format
for analysis.
Transposing Data: Converting rows into columns and vice versa as needed.
Code:
import numpy as np
import pandas as pd
df=pd.read_csv('C:\\Users\\Johanan\\Downloads\\diabetes.csv')
5. Data Merging –
Combining Datasets: Merging multiple datasets or data sources to enrich the
information available for analysis.
Joining Data: Joining datasets based on common columns or keys.
Code:
I don’t want to merge the dataset
6. Data Aggregation
Grouping Data: Grouping dataset rows based on specific criteria.
Aggregating Data: Computing summary statistics for grouped data.
Code:
import numpy as np
import pandas as pd
df=pd.read_csv('C:\\Users\\Johanan\\Downloads\\diabetes.csv')
print("diabetes data set dimensions:{}".format(df.shape))
print(data.groupby('Outcome').size())
output:
BIVARIATE ANALYSIS:
plt.figure(figsize=(10,6))
sns.scatterplot(x='Age', y='BloodPressure', data=df, hue='Outcome',
palette='coolwarm')
plt.title('Scatterplot of Age vs. BloodPressure with Outcome')
plt.xlabel('Age')
plt.ylabel('BloodPressure')
plt.legend(title='Outcome', loc='upper right')
plt.grid(True)
plt.show()
MULTIVARIATE ANALYSIS:
plt.figure(figsize=(10,6))
sns.pairplot(data=df,vars=['Glucose', 'SkinThickness',
'DiabetesPedigreeFunction','Age','BMI','Pregnancies','BloodPressure'], hue =
'Outcome')
plt.show()
output:
8. Feature Engineering
Creating User Profiles: Aggregating user interaction data to construct
comprehensive user profiles capturing preferences.
Temporal Analysis: Incorporating temporal features such as time of day or day
of week to capture temporal trends in user behavior.
Content Embeddings: Generating embeddings for content items to represent
their characteristics and relationships.
Code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('C:\\Users\\Johanan\\Downloads\\diabetes.csv')
age_group_data = data.groupby("Age")["Glucose"].mean()
plt.figure(figsize=(10, 6))
age_group_data.plot(marker="o", linestyle="-")
plt.xlabel("Age")
plt.ylabel("Average Blood Sugar Level (mg/dL)")
plt.title("Average Blood Sugar Levels by Age")
plt.show()
#CONTENT EMBEDDING
data["BMI_Category"] = pd.cut(data["BMI"], bins=[0, 25, 30, float("inf")],
labels=["Normal", "Overweight", "Obese"])
numerical_features = ["Glucose", "Insulin", "Age", "BMI"]
scaler = StandardScaler()
data[numerical_features] = scaler.fit_transform(data[numerical_features])
print(data.head())
output:
9.Assumed Scenario:
Scenario: The project aims to recommend personalized content to users based
on their historical interactions and preferences.
Objective: Enhance user engagement and satisfaction by delivering relevant and
tailored content recommendations.
Target Audience: Digital platform users seeking personalized content
recommendations across various domains
Conclusion:
Phase 2 of the project focuses on data wrangling and analysis to prepare the
dataset for building a personalized content discovery engine. By employing
Python-based data manipulation techniques and assuming a scenario focused on
personalized content recommendations, we aim to transform raw data into
actionable insights for enhancing user experience and engagement on digital
platforms
Sample code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv('C:\\Users\\Johanan\\Downloads\\diabetes.csv')
print(df.head())
print("------------------------------------------------------------------------------------------
--------------")
print(df.tail())
print("------------------------------------------------------------------------------------------
--------------")
print(df.info())
print("------------------------------------------------------------------------------------------
--------------")
print(df.describe())
print("------------------------------------------------------------------------------------------
--------------")
#NULL VALUES ANALYSIS
print(df.isnull().values.any())
print((df.Pregnancies ==
0).sum(),(df.Glucose==0).sum(),(df.BloodPressure==0).sum(),(df.SkinThickness
==0).sum(),(df.Insulin==0).sum(),(df.BMI==0).sum(),(df.DiabetesPedigreeFunct
ion==0).sum(),(df.Age==0).sum())
print(df.describe())
drop_Glu=df.index[df.Glucose == 0].tolist()
drop_BP=df.index[df.BloodPressure == 0].tolist()
drop_Skin = df.index[df.SkinThickness==0].tolist()
drop_Ins = df.index[df.Insulin==0].tolist()
drop_BMI = df.index[df.BMI==0].tolist()
c=drop_Glu+drop_BP+drop_Skin+drop_Ins+drop_BMI
data=df.drop(df.index[c])
print(data)
#DATA VALIDATION
dat=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age',
'DiabetesPedigreeFunction','Outcome']
for column in data:
unique_values=data[column].unique()
print(unique_values)
#DATA SHAPE
print("diabetes data set dimensions:{}".format(data.shape))
#DATA AGGREGATION
print("diabetes data set dimensions:{}".format(df.shape))
print(data.groupby('Outcome').size())
#UNIVARIATE ANALAYSIS
dat=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age',
'DiabetesPedigreeFunction','Outcome']
for column in dat[:-1]:
plt.figure(figsize=(3, 2))
sns.histplot(data[column], kde=True, bins=10, color='blue')
plt.title(f'Distribution of {column}')
plt.xlabel(column)
plt.ylabel('Frequency')
plt.show()
#BIVARIATE ANALYSIS
plt.figure(figsize=(10,6))
sns.scatterplot(x='Age', y='BloodPressure', data=df, hue='Outcome',
palette='coolwarm')
plt.title('Scatterplot of Age vs. BloodPressure with Outcome')
plt.xlabel('Age')
plt.ylabel('BloodPressure')
plt.legend(title='Outcome', loc='upper right')
plt.grid(True)
plt.show()
#MULTIVARIATE ANALYSIS
plt.figure(figsize=(10,6))
sns.pairplot(data=df,vars=['Glucose', 'SkinThickness',
'DiabetesPedigreeFunction','Age','BMI','Pregnancies','BloodPressure'], hue =
'Outcome')
plt.show()
#FEATURE ENGINEERING
data = pd.read_csv('C:\\Users\\Johanan\\Downloads\\diabetes.csv')
age_group_data = data.groupby("Age")["Glucose"].mean()
plt.figure(figsize=(10, 6))
age_group_data.plot(marker="o", linestyle="-")
plt.xlabel("Age")
plt.ylabel("Average Blood Sugar Level (mg/dL)")
plt.title("Average Blood Sugar Levels by Age")
plt.show()
data["BMI_Category"] = pd.cut(data["BMI"], bins=[0, 25, 30, float("inf")],
labels=["Normal", "Overweight", "Obese"])
numerical_features = ["Glucose", "Insulin", "Age", "BMI"]
scaler = StandardScaler()
data[numerical_features] = scaler.fit_transform(data[numerical_features])
print(data.head())
output:
_________________________________________________________________