2nd Print Last

DATA ANALYSIS USING PYTHON 2023-2024
CHAPTER 1
PROFILE OF THE ORGANIZATION

1.1 INTRODUCTION
IOBrain Innovative Labz (IIL) is at the forefront of pioneering research and development
in the intersection of Robotics, Automation and cutting-edge technologies. Established with a
vision to unravel the mysteries of the human brain and leverage this understanding to create
transformative solutions, IIL stands as a beacon of innovation in the field.
Driven by a multidisciplinary team of experts ranging from engineers to data scientists

and psychologists, IIL is committed to pushing the boundaries of knowledge and application
in the realm of cognitive sciences.
Figure 1 : IObrain logo
Among the hundreds of robotics training institutes in India, we are one of the best choice,
who provide quality training in the field of Robotics, Automation, and Embedded Systems.
Today we have reached more than 30+ schools and have trained more than 10000 rural students
too on various robotics projects online and offline.
With advanced courses to all-time technologies like Drone, Robotics and Automation,
Python, Artificial Intelligence, Android Applications, Games and Animations, A Simulation
Vision with TinkerCad, Internet of Things (IoT), Arduino, Raspberry Pi, and a lot more, we
have students to professionals who learn technical courses with us to accomplish their dream
projects.
Department of Electronics and Communication Engineering, BIT Page 1

1.2 ORGANIZATION STRATEGY
PURPOSE
The Current state-of-the-art in the proposed project is increasingly demanding in the field
of education. Students require an all-around development of their mental ability &
psychological personality and exhibit skills in order to survive in this demanding and fast
growing generation. Students not only require academics but also need some extra set of skill
sets which would help them mold and grow in this paced environment around us. IObrain
would look into all the sections of a student right from his/her mental ability, thinking style,
personal skills and help them be positive, productive and professional. The educators drive
towards effective solutions to help students meet their needs while preparing for 21st century
workforce.
VISION
At IOBrain Innovative Labz (IIL), the vision is to revolutionize our understanding of the
human brain and its potential applications in technology and healthcare. The ultimate objective
is to unlock the mysteries of cognition and create groundbreaking solutions that enhance human
lives and drive progress in society.
MISSION STATEMENT
“Our mission at IOBrain Innovative Labz (IIL) is to conduct cutting-edge research at the
intersection of neuroscience, artificial intelligence, and engineering. Through interdisciplinary
collaboration and innovative thinking, we aim to:
1. Advance Knowledge: Push the boundaries of neuroscience research to unravel the

complexities of the human brain and its functions, from perception and cognition to
emotion and behaviour.
2. Foster Innovation: Translate our scientific discoveries into practical applications and
technologies that have real-world impact, spanning fields such as brain-computer
interfaces, neurorehabilitation, cognitive enhancement, and beyond.

3. Empower Humanity: Harness the power of brain science to improve human health,
well-being, and performance. Whether it's developing assistive technologies for
individuals with disabilities or optimizing cognitive training programs, we strive to
empower individuals to reach their full potential.
4. Collaborate and Share: Foster a culture of collaboration and knowledge-sharing, both
within our organization and with external partners in academia, industry, and
healthcare. By working together, we can accelerate progress and address the most
pressing challenges facing humanity.
Through our unwavering commitment to scientific excellence, innovation, and societal

impact, we envision a future where the insights gained from studying the brain transform lives
and shape the course of human history."
TEAM WORK
No individual can be independent, all are interdependent. Thus, group projects work as
effective tool for good communication and outcome. The analytical and cognitive thinking
increase and gradually the next generation tech transforms the society and technological
implementation promotes the development of a city. Out of the Box approach Few people who
were in part of this current education system had a great desire to teach students with
conceptualization which was not part of the school curriculum. The implementation of this idea
in few schools and received an overwhelming response.
TRAINING
Passionate and highly qualified trainers who are subject matter experts train the faculty
or the students or both ensuring complete knowledge transfer. Trainers work with students to
develop project based documentation to bolster their growth. A group of 20+ trainers undergo
several levels of detailed training and conduct research before transferring the knowledge to
the kids or the faculty. Trainers provide enthusiastic and inspiring sessions with hands on

1.3 COURSES OFFERED BY THE ORGANIZATION
1.3.1 PROGRAM YOUR VIRTUAL ROBOT – PVR
With the PVR Course, student will begin their journey into learning Coding. They will
use VEXcode VR software and engaging robotics-based activities to learn about project flow,
loops, conditionals, algorithms, and more.
This introductory course encourages students to use VEXcode VR to learn and practice
computational thinking and coding. Each lesson and unit walks the students through a
particular Computer Science concept, leading the students to complete independent challenges
applying what they have learned. Because VEXcode VR is 100% web-based, students can use
any of the major web browsers on a computer or tablet-based device. Using the VEXcode VR
block-based coding system, students can get started instantly with their coding journey.
Why Learn Coding?
Learning coding will help students develop 21st-century job skills. Most of today’s
professional math and science fields have a computational component. Additionally, skills such
as the ability to analyse and solve unstructured problems and to work with new information are
extremely valuable in today’s knowledge economy. This course will help students become
creators, not just consumers, of technology.
Additionally, learning to code will also help students to better understand the world
around them. Computers and computing have changed the way we live and work. Much like
learning about Biology or Physics helps us to understand the world around us, learning coding
will help students better understand how computer science influences their daily lives.
Finally, learning to code will help students bring their own digital creations to life. The
process of creating personally meaningful digital creations includes both creative expression
and the exploration of ideas.

1.3.2 PYTHON PROGRAMMING
What is Python?
Python is a popular programming language. It was created by Guido van Rossum, and
released in 1991.
It is used for:
• Web development (server-side),

• Software development,
• Mathematics,
• System scripting.
Figure 2: Python Logo
What can Python do?
• Python can be used on a server to create web applications.

• Python can be used alongside software to create workflows.
• Python can connect to database systems. It can also read and modify files.
• Python can be used to handle big data and perform complex mathematics.

Why Python?
• Python works on different platforms (Windows, Mac, Linux, Raspberry Pi, etc).
• Python has a simple syntax similar to the English language.
• Python has a syntax that allows developers to write programs with fewer lines than
some other programming languages.
• Python runs on an interpreter system, meaning that code can be executed as soon as it
is written. This means that prototyping can be very quick.
• Python can be treated in a procedural way, an object-oriented way, or a functional way.
Good to know
• The most recent major version of Python is Python 3, which we shall be using in this
tutorial. However, Python 2, although not being updated with anything other than
security updates, is still quite popular.
• Python can be written in a text editor. It is possible to write Python in an Integrated
Development Environment, such as Thonny, Pycharm, Netbeans, or Eclipse which are
particularly useful when managing larger collections of Python files.
Python Syntax compared to other programming languages
• Python was designed for readability and has some similarities to the English language
with influence from mathematics.
• Python uses new lines to complete a command, as opposed to other programming
languages which often use semicolons or parentheses.
• Python relies on indentation, using whitespace, to define scope; such as the scope of
loops, functions, and classes. Other programming languages often use curly brackets
for this purpose.

CHAPTER 2
CONCEPTS LEARNT
2.1 INTRODUCTION TO STATISTICS ANALYTICS

Statistics is the science of collecting, analyzing, presenting, and interpreting data, as well as
making decisions based on such analyses.
Types of Statistics:
1. Descriptive Statistics
2. Inferential Statistics
2.1.1 Descriptive Statistics and Summarization of Data

Descriptive statistics involve methods for organizing, displaying, and describing data
using tables, graphs, and summary measures. These statistics provide an overview of a dataset's
characteristics, such as central tendency, variability, distribution, and shape.
Key Concepts in Descriptive Statistics
1. Measures of Central Tendency
• Mean: The average value, calculated by summing all values and dividing by the number
of observation
• Median: The middle value when data is ordered.
• Mode: The most frequently occurring value(s).
• Percentiles: Values below which a certain percentage of data falls.
• Quartiles: Values that divide data into four equal parts.
2. Measures of Variability
• Range: The difference between the maximum and minimum values.

• Variance: The average squared deviation from the mean.

• Standard Deviation: The square root of the variance, indicating average deviation from
the mean.
3. Measures of Distribution
• Histogram: A graph showing frequency distribution of data with intervals (bins) on the
x-axis and frequencies on the y-axis.
• Frequency Distribution: A table or chart showing the number of occurrences of each
value or interval.
• Probability Density Function (PDF): A function describing the likelihood of a random
variable taking certain values.
4. Shape of the Distribution
• Skewness: A measure of asymmetry, indicating whether data is skewed left or right.

• Kurtosis: A measure of "peakedness" or "tailedness" of the distribution, compared to a
normal distribution.
Descriptive statistics are often the first step in data analysis to gain insights into a dataset's
characteristics before applying inferential techniques. They provide a summary of central
tendencies, spread, and shape, facilitating easier interpretation and comparison across datasets.
2.1.2 MEASURES OF CENTRAL TENDENCY
These measures indicate specific locations in a group of numbers.

MODE
The mode is the value that occurs most frequently in a dataset. It represents the peak of
the frequency distribution. Unlike the mean and median, which indicate central points, the
mode shows the most common value.
Scenarios regarding modes:

1. Unimodal: One mode.
2. Bimodal: Two modes.
3. Multimodal: More than two modes.
4. No Mode: No value repeats or all values occur with the same frequency.

The mode is useful for categorical and discrete data and can apply to continuous data when
grouped. When multiple modes or no clear mode exist, the mean or median may be more
informative.
MEDIAN
The median is a key statistical measure that indicates the central point of a data set. Unlike the
mean, which can be significantly affected by outliers, the median provides a more robust
central value for skewed distributions.
• Suitable for: Ordinal, interval, and ratio data

• Not suitable for: Nominal data
• Resistant to: Extremely high or low values
MEAN
The mean, commonly known as the average, is a primary statistical measure used to determine
the central tendency of a data set. It is extensively used across disciplines like mathematics,
economics, and social sciences to provide a concise summary of data.
• Applicable for: Interval and ratio data

• Not applicable for: Nominal or ordinal data
• Sensitive to: Every value in the data set, including outliers
Types of Means in Statistics:

1. Population Mean (μ)
• The population mean is the average of all values in an entire population.

• It is a parameter, meaning it is a fixed value that describes the whole population.
2. Sample Mean (x̄)
• The sample mean is the average of values in a sample, which is a subset of the
population.
• It is a statistic, meaning it is calculated from the sample and can vary between
different samples.

QUARTILES
Quartiles divide a data set into four equal parts, each containing 25% of the data points.
They help in understanding the spread and distribution of the data, offering insights into its
variability and central tendency.
1. First Quartile (Q1)
2. Second Quartile (Q2)
3. Third Quartile (Q3)
PERCENTILES
Percentiles are statistical measures that divide a data set into 100 equal parts, providing a
way to understand the distribution and relative standing of data points within the set. Each
percentile indicates the value below which a given percentage of the data falls.
A percentile is a measure indicating the value below which a given percentage of

observations in a group of observations falls. For example, the 25th percentile is the value
below which 25% of the data points lie.
2.1.3 MEASURES OF VARIABILITY

Measures of variability quantify the extent to which data points differ from each other
and from the central tendency. They provide insight into the spread or dispersion of the data,
helping to understand its reliability and predictability.
Common Measures of Variability:
1. Range
• Definition: The difference between the maximum and minimum values.

• Example: For data 35 and 48, Range = 48 - 35 = 13.
2. Interquartile Range (IQR)
• Definition: The range within which the central 50% of data points lie, calculated as
Q3 - Q1.

3. Mean Absolute Deviation (MAD)
• Definition: The average of the absolute deviations from the mean.

• Insight: Provides a straightforward measure of dispersion, less sensitive to outliers.
4. Variance
• Definition: The average squared deviation of each data point from the mean.
• Insight: Indicates how much the values vary from the mean.
5. Standard Deviation
• Definition: The square root of the variance, showing the average distance of each data
point from the mean.
• Insight: More interpretable as it is in the same units as the original data.
6. Coefficient of Variation (CV)
• Definition: The standard deviation expressed as a percentage of the mean.

• Insight: Useful for comparing variation between datasets with different units or scales.
2.1.4 MEASURES OF DISTRIBUTION
• Histogram: A graphical representation of the frequency distribution of numerical data,

displaying data points grouped into intervals (bins) along the x-axis and their
frequencies along the y-axis.
• Frequency Distribution: A table or chart showing the number of occurrences
(frequency) of each distinct value or interval in a dataset.
• Probability Density Function (PDF): A function that describes the likelihood of a
random variable taking on a particular value or falling within a range of values.

2.1.5 SHAPE OF DISTRIBUTION
Skewness: Measures asymmetry in a distribution.
Positive Skew (Right Skew): Longer tail on the right.
Negative Skew (Left Skew): Longer tail on the left.
Coefficient of Skewness:
Figure 3: Coefficient of Skewness
• S < 0: Negatively skewed.

• S = 0: Symmetric.
• S > 0: Positively skewed.
Kurtosis
Kurtosis: Describes the "tailedness" of a distribution.
Figure 4: Kurtosis
• Leptokurtic: Sharp peak, heavy tails (Kurtosis > 3).

• Mesokurtic: Moderate peak and tails (Kurtosis ≈ 3).

2.2 INFERENTIAL STATISTICS
Inferential statistics is a branch of statistics that allows us to make inferences about a

population based on a sample of data drawn from that population. It involves using probability
theory to draw conclusions and make predictions about a population's characteristics.
Key Terms:
• Population: Entire group being studied.

• Sample: Subset of the population.
• Census: Survey of the entire population.
• Sample Survey: Survey of a sample.
• Representative Sample: Sample that closely matches population characteristics.
• Variable: Characteristic being measured.
• Data: Collected values.
• Experiment: Activity yielding data.
• Parameter: Numerical summary of a population.
• Statistic: Numerical summary of a sample.
2.2.1 TYPES OF VARIABLES:
Qualitative (Categorical):
• Nominal: Categorizes data (e.g., colors, names).

• Ordinal: Ordered data (e.g., rankings).
Quantitative (Numerical):
• Discrete: Countable values (e.g., number of students).

• Continuous: Any value within an interval (e.g., height)

2.3 CRISP-DM
2.3.1 PHASES OF CRISP-DM
The CRISP-DM methodology provides a structured approach to planning and executing data
mining projects. This framework ensures that the process is reliable and repeatable, even by
individuals with minimal data mining expertise. It is non-proprietary, application and
industry-neutral, and tool-neutral, focusing primarily on addressing business issues. In the
figure 5, we outline the six key phases of the CRISP-DM process:
Figure 5: CRISP DM Framework
2.3.2 BUSINESS UNDERSTANDING
• Objective Setting: Comprehend the business context and specific challenges. Clearly
define the goals of the data mining project in measurable terms, such as increasing
sales from repeat customers by a specific percentage.
• Success Criteria: Establish criteria to gauge the effectiveness of the project, such as
achieving targeted ROI or improving customer retention rates.
• Stakeholder Identification: Identify and engage key stakeholders to ensure that the
project aligns with strategic objectives.
• Preliminary Data Assessment: Conduct an initial assessment to identify existing data
and potential gaps. This ensures that subsequent data analysis stages are focused and
relevant.

2.3.3. DATA UNDERSTANDING
• Data Collection: Gather initial data, which might include customer contact details,
purchase histories, interests, demographics, etc.
• Data Exploration: Analyze the data to understand its structure, quality, and relevance.
This step involves summarizing the data and identifying any issues or patterns.
• Data Quality Assessment: Evaluate the completeness and accuracy of the data.
Identify missing, inconsistent, or outlier values that may affect the analysis.
• Data Relevance: Determine the usefulness of each dataset in relation to the project’s
objectives. This step ensures that the analysis will be based on pertinent data.
2.3.4. DATA PREPARATION
• Selection: Choose the data that is relevant to the analysis. This step may involve
selecting subsets of variables or observations.
• Cleaning: Address data quality issues identified in the previous phase. This may
involve handling missing values, correcting inconsistencies, and removing duplicates.
• Transformation: Convert the data into formats suitable for analysis. This may include
normalization, aggregation, or creating new variables.
• Integration: Combine data from different sources into a coherent dataset. This step
often involves using ETL (Extract, Transform, Load) software to pull data from
multiple locations and integrate it into a uniform format.
• Formatting: Arrange the data into the required format for analysis. This could include
structuring data into tables or converting it into specific file types.

2.3.5. MODELING
• Model Selection: Choose the appropriate modeling techniques based on the data and
the project objectives. This could involve classification, regression, clustering, etc.
• Training: Use training data to build models of customer behavior. These models help
in understanding and predicting outcomes based on historical data.
• Testing: Execute numerous tests to evaluate model performance. This helps in
comparing different models and selecting the best one.
• Pattern Discovery: Identify new patterns in the data that could provide insights.
Machine learning tools are often used to automate this process.
• Validation: Conduct blind tests with real data to validate model accuracy. This step
ensures that the models are reliable and can be generalized to new data.
• Human Evaluation: Apply human intuition to evaluate the identified patterns and
models. This ensures that the findings make practical sense and are actionable.
2.3.6. EVALUATION
• Review of Models: Assess the models built during the modeling phase to ensure they
are technically correct and effective.
• Comparison with Business Objectives: Evaluate the results against the business
success criteria established at the beginning of the project.
• Result Validation: Confirm that the insights obtained are valuable and actionable.
This ensures that the organization can make informed decisions based on the results.
• Documentation: Record the findings, methodologies, and any insights gained during
the evaluation phase. This documentation is crucial for transparency and future
reference.

2.3.7 DEPLOYMENT
• Implementation: Integrate the insights and models into business processes. This could
involve deploying a model to produce churn scores that are then read into a data
warehouse.
• Action: Use the insights gained from data mining to drive changes within the
organization. This could involve adjusting marketing strategies, optimizing
operations, or improving customer service.
• Monitoring: Continuously monitor the deployed models and processes to ensure they
are performing as expected. Make adjustments as necessary to maintain their
effectiveness.
• Documentation and Reporting: Document the deployment process, outcomes, and any
lessons learned. This helps in refining future data mining projects and provides record
of the changes made.
Figure 6: CRISP DM Key Phrases
Figure 6 shows the CRISP DM KEY Phrases.By following these structured steps, the CRISP-
DM methodology ensures a thorough and effective approach to data mining, aligning
technical analysis with business objectives to deliver actionable insights.

CHAPTER 3
TASKS PERFORMED
3.1 OVERVIEW
The task performed is Data Analysis which is the technique of collecting, transforming,
and organizing data to make future predictions and informed data-driven decisions. It also helps
to find possible solutions for a business problem. There are six steps for Data Analysis they
are: Ask or Specify Data Requirements, Prepare or Collect Data, Clean and Process, Analyze,
Model training, conclusion.
3.2 DATA REQUIREMENTS
Though traditional requirements analysis centers on functional needs, data

requirements analysis complements the functional requirements process and focuses on the
information needs, providing a standard set of procedures for identifying, analyzing, and
validating data requirements and quality for data-consuming applications. Data requirements
analysis is a significant part of an enterprise data management program that is intended to help
in:
• Articulating a clear understanding of data needs of all consuming business processes,
• Identifying relevant data quality dimensions associated with those data needs,
• Assessing the quality and suitability of candidate data sources,
• Continually reviewing to identify improvement opportunities in relation to downstream

data needs.
During this process, the data quality analyst needs to focus on identifying and capturing more
than just a list of business questions that need to be answered.

3.3 DATA PREPARATION
Data preparation is the process of preparing raw data so that it is suitable for further
processing and analysis. Key steps include collecting, cleaning, and labeling raw data into a
form suitable for machine learning (ML) algorithms and then exploring and visualizing the
data. Data preparation can take up to 80% of the time spent on an ML project. Using specialized
data preparation tools is important to optimize this process.
Connection between ML and data preparation:
Data flows through organizations like never before, arriving from everything from
smartphones to smart cities as both structured data and unstructured data (images, documents,
geospatial data, and more). Unstructured data makes up 80% of data today. ML can analyze
not just structured data, but also discover patterns in unstructured data. ML is the process where
a computer learns to interpret data and make decisions and recommendations based on that
data. During the learning process—and later when used to make predictions—incorrect, biased,
or incomplete data can result in inaccurate predictions.
3.4 DATA CLEANING AND PROCESSING
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly

formatted, duplicate, or incomplete data within a dataset. Figure 7 shows steps of cleaning data.
When combining multiple data sources, there are many opportunities for data to be duplicated
or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they
may look correct. There is no one absolute way to prescribe the exact steps in the data cleaning
process because the processes will vary from dataset to dataset. But it is crucial to establish a
template for your data cleaning process so you know you are doing it the right way every time.
There are four steps to clean data:
1. Remove duplicate or irrelevant observations 2. Fix structural errors
3. Filter unwanted outliers 4. Handle missing data

Figure 7: Steps of Cleaning Data
3.5 PYTHON LIBRARY
In computer programming, a library refers to a bundle of code consisting of dozens or

even hundreds of modules that offer a range of functionality. Each library contains a set of
pre-combined codes whose use reduces the time necessary to code. Libraries are especially
useful for accessing pre-written codes that are repeatedly used, saving users the time of
writing them from scratch every time. Python boasts over 137,000 libraries, with the Python
Standard Library comprising hundreds of modules geared toward executing basic tasks like
reading JSON data or sending emails.
Here are some key Python libraries:
1. Pyplot
• Functionality: This graphic library can create a variety of interactive, high-quality
data visualizations.
• Use Cases: Scatter plots, heatmaps, histograms, box plots, bubble charts, polar charts,
and more.

2. NumPy
• Functionality: Introduces objects for multidimensional arrays and matrices, and
functions for advanced mathematical and statistical operations.
• Use Cases: Numerical computing, scientific computing, linear algebra, Fourier
transforms, and random number generation.
3. SciPy
• Functionality: A collection of algorithms for linear algebra, differential equations,
numerical integration, optimization, and statistics.
• Use Cases: Scientific and technical computing, optimization problems, signal
processing, and statistical analysis.
4. Pandas
• Functionality: Adds data structures and tools designed to work with table-like data
(similar to Series and DataFrames).
• Use Cases: Data manipulation, data cleaning, data wrangling, and time series analysis.
.
5. Matplotlib
• Functionality: A 2D plotting library that produces publication-quality figures in a
variety of hardcopy formats and interactive environments.
• Use Cases: Creating static, animated, and interactive visualizations such as line plots,
bar charts, histograms, and scatter plots.
6. Seaborn
• Functionality: Based on Matplotlib, it provides a high-level interface for drawing
attractive statistical graphics.
• Use Cases: Visualizing distributions of data, relationships between variables, and
statistical plots like regression lines and categorical plots.

3.6 DATA ANALYZING

Data analysis is the process of inspecting, cleansing, transforming, and modeling data
with the goal of discovering useful information, informing conclusions, and supporting
decision-making.[1] Data analysis has multiple facets and approaches, encompassing diverse
techniques under a variety of names, and is used in different business, science, and social
science domains.
Types of data analysis:
• Descriptive analysis: Descriptive analysis tells us what happened. This type of
analysis helps describe or summaries quantitative data by presenting statistics. For
example, descriptive statistical analysis could show sales distribution across a group
of employees and the average sales figure per employee.
• Diagnostic analysis: If the descriptive analysis determines the “what,” diagnostic
analysis determines the “why.” Let’s say a descriptive analysis shows an unusual
influx of patients in a hospital. Drilling into the data might reveal that many of these
patients shared symptoms of a particular virus. This diagnostic analysis can help you
determine that an infectious agent—the “why”—led to the influx of patients.
• Predictive analysis: So far, we’ve looked at types of analysis that examine and draw
conclusions about the past. Predictive analytics uses data to form projections about
the future. Using predictive analysis, you might notice that a given product has had its
best sales during September and October each year, leading you to predict a similar
high point during the upcoming year.

3.7 MODEL TRAINING

A machine learning training model is a process in which a machine learning (ML)
algorithm is fed with sufficient training data to learn from. Figure 8 shows Machine learning
types.
Figure 8: Model Training
Different Machine Learning Models:

1. Logistic Regression: Logistic Regression is used to determine if an input belongs to a
certain group or not. Regression in data science and machine learning is a statistical
method that enables predicting outcomes based on a set of input variables. Figure 9
shows a an example for logistic regression curve.
Figure 9: Logistic Regression

2. Naive Bayes: Naive Bayes is an algorithm that assumes independence among variables
and uses probability to classify objects based on features. Figure 10 shows an
Applications using Naive Bayes Algorithm. Naive Bayes classifiers are a collection of
classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a
family of algorithms where all of them share a common principle, i.e. every pair of
features being classified is independent of each other. To start with, let us consider a
dataset. One of the most simple and effective classification algorithms, the Naïve Bayes
classifier aids in the rapid development of machine learning models with rapid
prediction capabilities. It is highly used in text classification. In text classification tasks,
data contains high dimension (as each word represent one feature in the data). It is used
in spam filtering, sentiment detection, rating classification etc..
Figure 10: Naïve Bayes
3. Decision Trees: Decision trees are also classifiers that are used to determine what
category an input falls into by traversing the leaf's and nodes of a tree. In Fig 11
shows Decision Tree is a predictive approach in ML to determine what class an object
belongs to.
Figure 11: Decision Trees

4. Linear Regression: Linear regression is used to identify relationships between the

variable of interest and the inputs, and predict its values based on the values of the
input variables.
5. kNN: The k Nearest Neighbors technique involves grouping the closest objects in a
dataset and finding the most frequent or average characteristics among the objects.
6. Random Forest: Random forest is a collection of many decision trees from random
subsets of the data, resulting in a combination of trees that may be more accurate in
prediction than a single decision tree.
7. K-Means: The K-Means algorithm finds similarities between objects and groups them
into K different clusters.
Figure 12: Supervised vs Unsupervised Classification

CHAPTER 4
PROJECT WORK
WATER POTABILITY
Problem Statement: Develop a classification model to assess the safety of drinking
water based on water quality parameters such as pH, hardness, solids, chloramines, sulfate,
conductivity, organic carbon, trihalomethanes, and turbidity, using Random Forest.
4.1 IMPLEMENTATION
Figure 13: Project Implementation
First five rows of the dataset are printed in the above figure, the dataset contains null values
which is rectified by preprocessing techniques.

4.2 DATA PREPROCESSING
4.2.1 Data cleaning:
As null values are present in the dataset, Data cleaning is carried out and the missing
valuesare filled. Null values in each feature list is filled using the mean of the respective
feature
Figure 14: Data Cleaning
4.2.2 Analysis of class variable
Figure 15: Analysis of Class Variable
Figure displays the count of the class variable which is namely potability. The dataset
contains 1998 number of 0s which implies contaminated water and 1278 number of 1s
which implies potable water.

4.3 DATA VISUALIZATION:
Figure 16: Data Visualization
Histogram representation of all the feature list and the class variable.
Model planning
By splitting the dataset, a portion is used for training the model and other for testing the
model. 80% of dataset is used as training data and remaining 20% used as testing.
Figure 17: Model Planning

4.4 MODEL TRAINING
4.4.1 By using Random Forest
Figure 18: Model Training

Training the model by changing the parameters of random forest:
Figure 19: Training the Model by Changing Parameters
Training a random forest model involves adjusting its parameters to optimize its
performance for a specific task.
In the above example, an accuracy score of 71.5% was obtained by repeated training of
themodel by changing the hyperparameters like criterion, estimators, random state etc.
The criterion was changed from entropy to gini as the accuracy score for entropy was
muchlesser than gini.
estimators imply the number of trees, and it was increased to get the final accuracy.

4.4.2 By using Decision Tree Classifier:
Figure 20: Using Decision Tree Classifier
In the above example we are importing Decision Tree Classifier from sklearn. tree. Then
we are fitting the data which as to be trained into the decision Tree Classifier model.
Then we are predicting the value of potability for the test data by using decision Tree
Classifier model. Then finding the accuracy score, Confusion matrix and classification
report of the test data.

4.5 MODEL TESTING
Figure 21: Model Testing
Evaluating the final selected model using the testing set to estimate its performance on
unseen data. This step provides an unbiased estimate of the model's generalization ability
and helps assess its practical utility in real-world water quality prediction tasks.

4.6 RESULT
1) Accuracy score by using random forest model is 69.2% and decision tree classifier
modelis 59.1%
2) An accuracy score of 71.5% was obtained by repeated training of the model by
changingthe hyper parameters like criterion, estimators, random state etc.
3) The criterion was changed from entropy to gini as the accuracy score for entropy
wasmuch lesser than gini.
4) Estimators imply the number of trees and it was increased to get the final accuracy.
Figure 22: Results

CHAPTER 5
CONCLUSION
Our research highlighted the Random Forest (RF) model's superior performance in
predicting water-quality indices, making it a valuable tool for urban water management. It helps
simulate scenarios and informs strategies to mitigate urbanization impacts on water bodies.
Key Learnings:
Data Collection and Preprocessing: Essential for reliable datasets on water quality
parameters (e.g., pH, dissolved oxygen, pollutants). Involves cleaning data and handling
missing values.
Feature Selection and Engineering: Critical for identifying relevant features and
enhancing predictive power with domain knowledge.
Model Selection: Depends on the problem and data characteristics. Common models
include linear regression, decision trees, random forests, SVMs, and neural networks.
Model Evaluation: Uses metrics like mean absolute error and R-squared. Cross-
validation estimates generalization to unseen data.
The experiences I encountered during the internship allowed me to develop my

technical skills in data analysis. The training program on data analysis using Python focused
on increasing our knowledge and interest in and toward Python. Because Python is the most
interesting and most used language these days. We learnt the basics of the language and its
implement in building up various types of programs. It was a great experience. Two main things
that I've learned are the importance of our time-management skills and self-motivation.
Internship is a really good program. It helps to enhance and develop our skills, abilities, and
knowledge. However, the overall experience was positive, and I'm sure I will be able to use the
skills I learned, in my career later. Throughout my internship, I could understand more about
the definition of an IT technician and programmer and prepare myself to become a responsible
and innovative technician and programmer in future. During my training period, I realised that
observation is a main element in finding out the root cause of a problem. Python serves as a
versatile and powerful tool for data analysis, offering robust capabilities from data cleaning
and exploratory analysis to statistical modelling and visualization.

CHAPTER 6
REFERENCE
• About company profile:

https://iobrain.in/about-iil/
• Python libraries:
https://docs.python.org/3/library/index.html
• Crisp Dm framework:
https://www.datascience-pm.com/crisp-dm-2/
• Machine Learning Models:

https://javatpoint.com/machine-learning-models

2nd Print Last

Uploaded by

Copyright:

Available Formats

2nd Print Last

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2nd Print Last

Uploaded by

Copyright:

Available Formats

DATA ANALYSIS USING PYTHON 2023-2024

PROFILE OF THE ORGANIZATION

Driven by a multidisciplinary team of experts ranging from engineers to data scientists

Figure 1 : IObrain logo

Department of Electronics and Communication Engineering, BIT Page 1

1.2 ORGANIZATION STRATEGY

1. Advance Knowledge: Push the boundaries of neuroscience research to unravel the

Department of Electronics and Communication Engineering, BIT Page 2

Through our unwavering commitment to scientific excellence, innovation, and societal

Department of Electronics and Communication Engineering, BIT Page 3

1.3 COURSES OFFERED BY THE ORGANIZATION

1.3.1 PROGRAM YOUR VIRTUAL ROBOT – PVR

Why Learn Coding?

Department of Electronics and Communication Engineering, BIT Page 4

1.3.2 PYTHON PROGRAMMING

• Web development (server-side),

Figure 2: Python Logo

What can Python do?

• Python can be used on a server to create web applications.

Department of Electronics and Communication Engineering, BIT Page 5

Python Syntax compared to other programming languages

Department of Electronics and Communication Engineering, BIT Page 6

2.1 INTRODUCTION TO STATISTICS ANALYTICS

2.1.1 Descriptive Statistics and Summarization of Data

Key Concepts in Descriptive Statistics

1. Measures of Central Tendency

• Range: The difference between the maximum and minimum values.

Department of Electronics and Communication Engineering, BIT Page 7

4. Shape of the Distribution

• Skewness: A measure of asymmetry, indicating whether data is skewed left or right.

2.1.2 MEASURES OF CENTRAL TENDENCY

These measures indicate specific locations in a group of numbers.

Scenarios regarding modes:

Department of Electronics and Communication Engineering, BIT Page 8

• Suitable for: Ordinal, interval, and ratio data

• Applicable for: Interval and ratio data

Types of Means in Statistics:

• The population mean is the average of all values in an entire population.

2. Sample Mean (x̄)

Department of Electronics and Communication Engineering, BIT Page 9

1. First Quartile (Q1)

2. Second Quartile (Q2)

3. Third Quartile (Q3)

A percentile is a measure indicating the value below which a given percentage of

2.1.3 MEASURES OF VARIABILITY

• Definition: The difference between the maximum and minimum values.

2. Interquartile Range (IQR)

Department of Electronics and Communication Engineering, BIT Page 10

3. Mean Absolute Deviation (MAD)

• Definition: The average of the absolute deviations from the mean.

6. Coefficient of Variation (CV)

• Definition: The standard deviation expressed as a percentage of the mean.

2.1.4 MEASURES OF DISTRIBUTION

• Histogram: A graphical representation of the frequency distribution of numerical data,

Department of Electronics and Communication Engineering, BIT Page 11

2.1.5 SHAPE OF DISTRIBUTION

Skewness: Measures asymmetry in a distribution.

Positive Skew (Right Skew): Longer tail on the right.

Negative Skew (Left Skew): Longer tail on the left.