Hgs Phase II

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

UNIVERSITY COLLEGE OF ENGINEERING

THIRUCHIRAPALLI –BIT CAMPUS

PROJECT PHASE -II


College code:8100
Technology: AI
Title: Health Monitoring and Diagnosis

SUBMITTED BY:
KAVIYARASU.V
NM ID: aukaviyarasuv
TEAM NAME: Health guard heros

TEAM MEMBERS:

1.KAVIYARASU.V
2.INBATAMILAN.MK
3.DINESH KUMAR.K
4.ABDUL RAHUMAN.G
5.DHARAN KARTHIK.R
PHASE 2 DOCUMENT: DATA WRANGLING AND ANALYSIS

INTRODUCTION:

Phase 2 of our project is dedicated to data wrangling and analysis,


critical steps in preparing the raw dataset for building a personalized content
discovery engine. This phase involves employing various data manipulation
techniques using Python to clean, transform, and explore the dataset.
Additionally, we assume a scenario where the project aims to recommend
personalized content to users based on their preferences and interactions,
enhancing user engagement and satisfaction.

OBJECTIVES:

1. Cleanse the dataset by addressing inconsistencies, errors, and


missing values to ensure data integrity
2.Explore the dataset's characteristics through exploratory data analysis
(EDA) to understand distributions and correlations.
3.Engineer relevant features to enhance model performance for accurate
content recommendations.
4.Document the data wrangling process comprehensively, ensuring
transparency and reproducibility.

Dataset Description:

The dataset comprises user interaction data collected from a digital


platform, including information about user profiles, content items, and user
interactions such as ratings, views, and purchases. Each row in the dataset
represents a user's interaction with a specific content item, forming the
foundation for personalized content recommendations.
Data Wrangling Techniques:

1.Data Description:
Head: Displaying the first few rows of the dataset to get an initial overview.
Tail: Examining the last few rows of the dataset to ensure completeness
Info: Obtaining information about the dataset structure, data types, and
memory usage.
Describe: Generating descriptive statistics for numerical features to understand
their distributions and central tendencies.

Code: import numpy as np

import pandas as pd

df=pd.read_csv('C:\\Users\\Johanan\\Downloads\\diabetes.csv')

print(df.head())

print("--------------------------------------------------------------------------------------------------
------")

print(df.tail())

print("--------------------------------------------------------------------------------------------------
------")

print(df.info())

print("--------------------------------------------------------------------------------------------------
------")

print(df.describe())

print("--------------------------------------------------------------------------------------------------
------"
output:
2. Null Data Handling:
Null Data Identification: Identifying missing values in the dataset.
Null Data Imputation: Filling missing values with appropriate strategies.
Null Data Removal: Eliminating rows or columns with excessive missing
values.

Code:

import numpy as np

import pandas as pd

df=pd.read_csv('C:\\Users\\Johanan\\Downloads\\diabetes.csv')

print(df.isnull().values.any())
print((df.Pregnancies ==
0).sum(),(df.Glucose==0).sum(),(df.BloodPressure==0).sum(),(df.SkinThickness==0).sum(),(d
f.Insulin==0).sum(),(df.BMI==0).sum(),(df.DiabetesPedigreeFunction==0).sum(),(df.Age==0).
sum())

print(df.describe())
drop_Glu=df.index[df.Glucose == 0].tolist()
drop_BP=df.index[df.BloodPressure == 0].tolist()
drop_Skin = df.index[df.SkinThickness==0].tolist()
drop_Ins = df.index[df.Insulin==0].tolist()
drop_BMI = df.index[df.BMI==0].tolist()
c=drop_Glu+drop_BP+drop_Skin+drop_Ins+drop_BMI
data=df.drop(df.index[c])
print(data)

output:
3. Data Validation:
Data Integrity Check: Verifying data consistency and integrity to eliminate
errors.
Data Consistency Verification: Ensuring data consistency across different
columns or datasets.

Code:
import numpy as np
import pandas as pd
df=pd.read_csv('C:\\Users\\Johanan\\Downloads\\diabetes.csv')
dat=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age',
'DiabetesPedigreeFunction','Outcome']
for column in data:
unique_values=data[column].unique()
print(unique_values)
output:
4. Data Reshaping
Reshaping Rows and Columns: Transforming the dataset into a suitable format
for analysis.
Transposing Data: Converting rows into columns and vice versa as needed.
Code:

import numpy as np

import pandas as pd

df=pd.read_csv('C:\\Users\\Johanan\\Downloads\\diabetes.csv')

print("diabetes data set dimensions:{}".format(data.shape))


output:

5. Data Merging –
Combining Datasets: Merging multiple datasets or data sources to enrich the
information available for analysis.
Joining Data: Joining datasets based on common columns or keys.
Code:
I don’t want to merge the dataset
6. Data Aggregation
Grouping Data: Grouping dataset rows based on specific criteria.
Aggregating Data: Computing summary statistics for grouped data.
Code:
import numpy as np
import pandas as pd
df=pd.read_csv('C:\\Users\\Johanan\\Downloads\\diabetes.csv')
print("diabetes data set dimensions:{}".format(df.shape))
print(data.groupby('Outcome').size())
output:

Data Analysis Techniques


7. Exploratory Data Analysis (EDA) –
Univariate Analysis: Analyzing individual variables to understand their
distributions and characteristics.

Bivariate Analysis: Investigating relationships between pairs of variables to


identify correlations and dependencies
Multivariate Analysis: Exploring interactions among multiple variables to
uncover complex patterns and trends.
Code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv('C:\\Users\\Johanan\\Downloads\\diabetes.csv')
dat=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age',
'DiabetesPedigreeFunction','Outcome']
for column in dat[:-1]: # Excluding the last column 'class'
plt.figure(figsize=(3, 2))
sns.histplot(data[column], kde=True, bins=10, color='blue')
plt.title(f'Distribution of {column}')
plt.xlabel(column)
plt.ylabel('Frequency')
plt.show()

BIVARIATE ANALYSIS:
plt.figure(figsize=(10,6))
sns.scatterplot(x='Age', y='BloodPressure', data=df, hue='Outcome',
palette='coolwarm')
plt.title('Scatterplot of Age vs. BloodPressure with Outcome')
plt.xlabel('Age')
plt.ylabel('BloodPressure')
plt.legend(title='Outcome', loc='upper right')
plt.grid(True)
plt.show()
MULTIVARIATE ANALYSIS:
plt.figure(figsize=(10,6))
sns.pairplot(data=df,vars=['Glucose', 'SkinThickness',
'DiabetesPedigreeFunction','Age','BMI','Pregnancies','BloodPressure'], hue =
'Outcome')
plt.show()

output:
8. Feature Engineering
Creating User Profiles: Aggregating user interaction data to construct
comprehensive user profiles capturing preferences.
Temporal Analysis: Incorporating temporal features such as time of day or day
of week to capture temporal trends in user behavior.
Content Embeddings: Generating embeddings for content items to represent
their characteristics and relationships.
Code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('C:\\Users\\Johanan\\Downloads\\diabetes.csv')
age_group_data = data.groupby("Age")["Glucose"].mean()
plt.figure(figsize=(10, 6))
age_group_data.plot(marker="o", linestyle="-")
plt.xlabel("Age")
plt.ylabel("Average Blood Sugar Level (mg/dL)")
plt.title("Average Blood Sugar Levels by Age")
plt.show()
#CONTENT EMBEDDING
data["BMI_Category"] = pd.cut(data["BMI"], bins=[0, 25, 30, float("inf")],
labels=["Normal", "Overweight", "Obese"])
numerical_features = ["Glucose", "Insulin", "Age", "BMI"]
scaler = StandardScaler()
data[numerical_features] = scaler.fit_transform(data[numerical_features])
print(data.head())
output:

9.Assumed Scenario:
Scenario: The project aims to recommend personalized content to users based
on their historical interactions and preferences.
Objective: Enhance user engagement and satisfaction by delivering relevant and
tailored content recommendations.
Target Audience: Digital platform users seeking personalized content
recommendations across various domains
Conclusion:
Phase 2 of the project focuses on data wrangling and analysis to prepare the
dataset for building a personalized content discovery engine. By employing
Python-based data manipulation techniques and assuming a scenario focused on
personalized content recommendations, we aim to transform raw data into
actionable insights for enhancing user experience and engagement on digital
platforms
Sample code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv('C:\\Users\\Johanan\\Downloads\\diabetes.csv')
print(df.head())
print("------------------------------------------------------------------------------------------
--------------")
print(df.tail())
print("------------------------------------------------------------------------------------------
--------------")
print(df.info())
print("------------------------------------------------------------------------------------------
--------------")
print(df.describe())
print("------------------------------------------------------------------------------------------
--------------")
#NULL VALUES ANALYSIS
print(df.isnull().values.any())
print((df.Pregnancies ==
0).sum(),(df.Glucose==0).sum(),(df.BloodPressure==0).sum(),(df.SkinThickness
==0).sum(),(df.Insulin==0).sum(),(df.BMI==0).sum(),(df.DiabetesPedigreeFunct
ion==0).sum(),(df.Age==0).sum())
print(df.describe())
drop_Glu=df.index[df.Glucose == 0].tolist()
drop_BP=df.index[df.BloodPressure == 0].tolist()
drop_Skin = df.index[df.SkinThickness==0].tolist()
drop_Ins = df.index[df.Insulin==0].tolist()
drop_BMI = df.index[df.BMI==0].tolist()
c=drop_Glu+drop_BP+drop_Skin+drop_Ins+drop_BMI
data=df.drop(df.index[c])
print(data)
#DATA VALIDATION
dat=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age',
'DiabetesPedigreeFunction','Outcome']
for column in data:
unique_values=data[column].unique()
print(unique_values)
#DATA SHAPE
print("diabetes data set dimensions:{}".format(data.shape))
#DATA AGGREGATION
print("diabetes data set dimensions:{}".format(df.shape))
print(data.groupby('Outcome').size())
#UNIVARIATE ANALAYSIS
dat=['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age',
'DiabetesPedigreeFunction','Outcome']
for column in dat[:-1]:
plt.figure(figsize=(3, 2))
sns.histplot(data[column], kde=True, bins=10, color='blue')
plt.title(f'Distribution of {column}')
plt.xlabel(column)
plt.ylabel('Frequency')
plt.show()
#BIVARIATE ANALYSIS
plt.figure(figsize=(10,6))
sns.scatterplot(x='Age', y='BloodPressure', data=df, hue='Outcome',
palette='coolwarm')
plt.title('Scatterplot of Age vs. BloodPressure with Outcome')
plt.xlabel('Age')
plt.ylabel('BloodPressure')
plt.legend(title='Outcome', loc='upper right')
plt.grid(True)
plt.show()
#MULTIVARIATE ANALYSIS
plt.figure(figsize=(10,6))
sns.pairplot(data=df,vars=['Glucose', 'SkinThickness',
'DiabetesPedigreeFunction','Age','BMI','Pregnancies','BloodPressure'], hue =
'Outcome')
plt.show()
#FEATURE ENGINEERING
data = pd.read_csv('C:\\Users\\Johanan\\Downloads\\diabetes.csv')
age_group_data = data.groupby("Age")["Glucose"].mean()
plt.figure(figsize=(10, 6))
age_group_data.plot(marker="o", linestyle="-")
plt.xlabel("Age")
plt.ylabel("Average Blood Sugar Level (mg/dL)")
plt.title("Average Blood Sugar Levels by Age")
plt.show()
data["BMI_Category"] = pd.cut(data["BMI"], bins=[0, 25, 30, float("inf")],
labels=["Normal", "Overweight", "Obese"])
numerical_features = ["Glucose", "Insulin", "Age", "BMI"]
scaler = StandardScaler()
data[numerical_features] = scaler.fit_transform(data[numerical_features])
print(data.head())

output:
_________________________________________________________________

You might also like