Celebal Summer t-1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

A

PRACTICAL TRAINING REPORT


AT

Celebal Technologies
“Submitted In Partial Fulfillment of the Requirement for the Award
of the Degree of the Bachelor of Technology in Computer Science Engineering
(Artificial Intelligence) to RajasthanTechnical University, Kota”

SUBMITTED BY
Yogesh Gaur(21AI044)

Department of Computer Science (Artificial Intelligence)


ANAND INTERNATIONAL COLLEGE OF ENGINEERING, JAIPUR

2024-25
Acknowledgment

It is a great pleasure and privilege for me to present this practical training report carried out

at “Anand International College of Engineering” and submitted in partial fulfillment of the

requirement for the award of the degree of the Bachelor of Technology in Computer

Science Engineering(Artificial Intelligence) to Rajasthan Technical University, Kota.

We express our sincere thanks to H.O.D. in Computer Science Engineering(Artificial

Intelligence) Department of our college for his kind co-operation and valuable

suggestions.

We are very much thankful to Mr. Sharthak Acharjee| Senior Manager of Celebal tech.

for his encouragement and inspiration at every step to a great extent. This training would not

have been possible without his support and able guidance. He was very supportive

throughout the given period in sharingtheir knowledge and technical aspects.

Finally, we express earnest and sincere thanks to the whole “Celebal Technologies” for the

generous help and co-operation in every possible manner.

Yogesh Gaur

College ID.: 21AI044


Candidate’s Declaration

I hereby declare that the work, which is being presented in the practical training report

carried out at “Celebal Technologies”

and submitted in partial fulfillment of the requirement for the award of the degree of the

Bachelor of Technology in Computer Science(Artificial Intelligence) Engineering to

Rajasthan Technical University, Kota is a record of my training carried under the Guidance

Mr.Sharthak Acharjee| SeniorManager of Celebal Technologies.

Yogesh Gaur

College ID.:21AI044

Mr. Sharthak Acharjee

Celebal Technologies
Abstract

Data Science is an interdisciplinary field that combines statistics, mathematics, computer science,

and domain expertise to extract insights from structured and unstructured data. It leverages

advanced techniques such as machine learning, data visualization, and big data analytics to address

real-world challenges in industries like healthcare, finance, e-commerce, and more.

With the exponential growth of data, Data Science plays a pivotal role in decision-making by

transforming raw data into actionable insights. It encompasses data collection, cleaning, analysis,

and interpretation, fostering innovation and driving business value. Emerging technologies like AI,

cloud computing, and IoT further enhance its scope, making it a cornerstone of modern problem-

solving and predictive analytics.


Table of Content
Introduction to Organization

About Training

Module-1: Introduction to Data Science

1.1. Data Science Overview .............................................................................................. 3

Module-2: Python for Data Science

2.1. Introduction to Python .............................................................................................. 5


2.2. Understanding Operators......................................................................................... 7
2.3. Variables and Data Types .......................................................................................... 8
2.4. Conditional Statements ..............................................................................................9
2.5. Looping Constructs .................................................................................................... 10
2.6. Functions ........................................................................................................................ 11
2.7. Data Structure .............................................................................................................. 12
2.8. Lists .................................................................................................................................. 13
2.9. Dictionaries .................................................................................................................... 14
2.10. Understanding Standard Libraries in Python ................................................. 15
2.11. Reading a CSV File in Python ............................................................................... 16
2.12. Data Frames and basic operations with Data Frames .................................. 17
2.13. Indexing Data Frame............................................................................................... 18
Module-3: Understanding the Statistics for Data Science

3.1. Introduction to Statistics ......................................................................................... 19


3.2. Measures of Central Tendency ..............................................................................19
3.3. Understanding the spread of data ....................................................................... 19
3.4. Data Distribution ........................................................................................................ 20
3.5. Introduction to Probability ...................................................................................... 20
3.6. Probabilities of Discreet and Continuous Variables ....................................... 21
3.7. Central Limit Theorem and Normal Distribution............................................. 21
3.8. Introduction to Inferential Statistics .................................................................... 22
3.9. Understanding the Confidence Interval and margin of error ..................... 23

1
3.10. Hypothesis Testing ....................................................................................................24
3.11. T tests ............................................................................................................................ 24
3.12. Chi Squared Tests ...................................................................................................... 24
3.13. Understanding the concept of Correlation .......................................................24
Module-4: Predictive Modeling and Basics of Machine Learning

4.1. Introduction to Predictive Modeling ..................................................................... 25


4.2. Understanding the types of Predictive Models ................................................. 25
4.3. Stages of Predictive Models ..................................................................................... 25
4.4. Hypothesis Generation ................................................................................................ 25
4.5. Data Extraction
4.6. Data Exploration
4.7. Reading the data into Python
4.8. Variable Identification ................................................................................................. 27
4.9. Univariate Analysis for Continuous Variables ......................................................27
4.10. Univariate Analysis for Categorical Variables ................................................... 27
4.11. Bivariate Analysis .........................................................................................................27
4.12. Treating Missing Values ........................................................................................... 28
4.13. How to treat Outliers .................................................................................................28
4.14. Transforming the Variables .................................................................................... 28
4.15. Basics of Model Building ......................................................................................... 28
4.16. Linear Regression ....................................................................................................... 29
4.17. Logistic Regression .................................................................................................... 29
4.18. Decision Trees ............................................................................................................. 29
4.19. K-means ......................................................................................................................... 29

2
Module-1: Introduction to Data Science

1.1. Data Science Overview


Data science is the study of data. Like biological sciences is a study of biology,
physical sciences, it’s the study of physical reactions. Data is real, data has real
properties, and we need to study them if we’re going to work on them. Data
Science involves data and some signs.
It is a process, not an event. It is the process of using data to understand too many
different things, to understand the world. Let Suppose when you have a model or
proposed explanation of a problem, and you try to validate that proposed
explanation or model with your data.
It is the skill of unfolding the insights and trends that are hiding (or abstract) behind
data. It’s when you translate data into a story. So, use storytelling to generate
insight. And with these insights, you can make strategic choices for a company or
an institution.
We can also define data science as a field which is about processes and systems to
extract data of various forms and from various resources whether the data is
unstructured or structured.

Predictive modeling:
Predictive modeling is a form of artificial intelligence that uses data mining and
probability to forecast or estimate more granular, specific outcomes.
For example, predictive modeling could help identify customers who are likely to
purchase our new One AI software over the next 90 days.
Machine Learning:
Machine learning is a branch of artificial intelligence (ai) where computers learn to
act and adapt to new data without being programmed to do so. The computer is
able to act independently of human interaction.
Forecasting:
Forecasting is a process of predicting or estimating future events based on past
and present data and most commonly by analysis of trends. "Guessing" doesn't cut
it. A forecast, unlike a prediction, must have logic to it. It must be defendable. This
logic is what differentiates it from the magic 8 ball's lucky guess. After all, even a
broken watch is right two times a day.

3
Applications of Data Science:
Data science and big data are making an undeniable impact on businesses,
changing day-to-day operations, financial analytics, and especially interactions
with customers. It's clear that businesses can gain enormous value from the
insights data science can provide. But sometimes it's hard to see exactly how. So
let's look at some examples. In this era of big data, almost everyone generates
masses of data every day, often without being aware of it. This digital trace reveals
the patterns of our online lives. If you have ever searched for or bought a product
on a site like Amazon, you'll notice that it starts making recommendations related
to your search. This type of system known as a recommendation engine is a
common application of data science. Companies like Amazon, Netflix, and Spotify
use algorithms to make specific recommendations derived from customer
preferences and historical behavior. Personal assistants like Siri on Apple devices
use data science to devise answers to the infinite number of questions end users
may ask. Google watches your every move in the world, you're online shopping
habits, and your social media. Then it analyzes that data to create
recommendations for restaurants, bars, shops, and other attractions based on the
data collected from your device and your current location. Wearable devices like
Fitbits, Apple watches, and Android watches add information about your activity
levels, sleep patterns, and heart rate to the data you generate. Now that we know
how consumers generate data, let's take a look at how data science is impacting
business. In 2011, McKinsey & Company said that data science was going to
become the key basis of competition. Supporting new waves of productivity,
growth, and innovation. In 2013, UPS announced that it was using data from
customers, drivers, and vehicles, in a new route guidance system aimed to save
time, money, and fuel. Initiatives like this support the statement that data science
will fundamentally change the way businesses compete and operate. How does a
firm gain a competitive advantage? Let's take Netflix as an example. Netflix collects
and analyzes massive amounts of data from millions of users, including which
shows people are watching at what time a day when people pause, rewind, and
fast-forward, and which shows directors and actors they search for. Netflix can be
confident that a show will be a hit before filming even begins by analyzing users
preference for certain directors and acting talent, and discovering which
combinations people enjoy. Add this to the success of earlier versions of a show
and you have a hit. For example, Netflix knew many of its users had streamed to
the work of David Fincher. They also knew that films featuring Robin Wright had
always done well, and that the British version of House of Cards was very successful.
Module-2: Python for Data Science

2.1. Introduction to Python


Python is a high-level, general-purpose and a very popular programming
language. Python programming language (latest Python 3) is being used in web
development, Machine Learning applications, along with all cutting edge
technology in Software Industry. Python Programming Language is very well suited
for Beginners, also for experienced programmers with other programming
languages like C++ and Java.

Below are some facts about Python Programming Language:

• Python is currently the most widely used multi-purpose, high-level


programming language.
• Python allows programming in Object-Oriented and Procedural paradigms.
• Python programs generally are smaller than other programming languages like
Java. Programmers have to type relatively less and indentation requirement of
the language, makes them readable all the time.
• Python language is being used by almost all tech-giant companies like –
Google, Amazon, Facebook, Instagram, Dropbox, Uber… etc.
• The biggest strength of Python is huge collection of standard library which can
be used for the following:
• Machine Learning
• GUI Applications (like Kivy, Tkinter, PyQt etc. )
• Web frameworks like Django (used by YouTube, Instagram, Dropbox)
• Image processing (like OpenCV, Pillow)
• Web scraping (like Scrapy, BeautifulSoup, Selenium)
• Test frameworks
• Multimedia
• Scientific computing
• Text processing and many more.

5
2.2. Understanding Operators
a. Arithmetic operators:
Arithmetic operators are used to perform mathematical operations like
addition, subtraction, multiplication and division.

OPERATOR DESCRIPTION SYNTAX

+ Addition: adds two operands x+y

- Subtraction: subtracts two operands x-y

* Multiplication: multiplies two operands x*y

/ Division (float): divides the first operand by the second x/y

// Division (floor): divides the first operand by the second x // y

Modulus: returns the remainder when first operand is

% divided by the second x%y

** Power : Returns first raised to power second x ** y

6
b. Relational Operators:
Relational operators compares the values. It either
returns True or False according to the condition.

OPERATOR DESCRIPTION SYNTAX

> Greater than: True if left operand is greater than the right x>y

< Less than: True if left operand is less than the right x<y

x ==

== Equal to: True if both operands are equal y

!= Not equal to - True if operands are not equal x != y

Greater than or equal to: True if left operand is greater than x >=

>= or equal to the right y

Less than or equal to: True if left operand is less than or equal x <=

<= to the right y

7
Logical operators:
Logical operators perform Logical AND, Logical OR and LogicalNOT operations.
OPERATOR DESCRIPTION SYNTAX

and Logical AND: True if both the operands are true x and y

or Logical OR: True if either of the operands is true x or y

not Logical NOT: True if operand is false not x

c. Bitwise operators:
Bitwise operators acts on bits and performs bit by bit operation.
OPERATOR DESCRIPTION SYNTAX

& Bitwise AND x&y

| Bitwise OR x|y

~ Bitwise NOT ~x

^ Bitwise XOR x^y

>> Bitwise right shift x>>

<< Bitwise left shift x<<

8
d. Assignment operators:
Assignment operators are used to assign values to the variables.

OPERATOR DESCRIPTION SYNTAX

Assign value of right side of expression to left

= side operand x=y+z

Add AND: Add right side operand with left

+= side operand and then assign to left operand a+=b a=a+b

Subtract AND: Subtract right operand from

-= left operand and then assign to left operand a-=b a=a-b

Multiply AND: Multiply right operand with left

*= operand and then assign to left operand a*=b a=a*b

Divide AND: Divide left operand with right

/= operand and then assign to left operand a/=b a=a/b

Modulus AND: Takes modulus using left and

right operands and assign result to left

%= operand a%=b a=a%b

9
OPERATOR DESCRIPTION ASSOCIATIVITY

() Parentheses left-to-right

** Exponent right-to-left

* / % Multiplication/division/modulus left-to-right

10
OPERATOR DESCRIPTION ASSOCIATIVITY

+ - Addition/subtraction left-to-right

<< >> Bitwise shift left, Bitwise shift right left-to-right

Relational less than/less than or equal to

< <= Relational greater than/greater than or equal

> >= to left-to-right

== != Relational is equal to/is not equal to left-to-right

2.3. Variables and Data Types


Variables:
a. Python Variables Naming Rules:
There are certain rules to what you can name a variable(called an identifier).
• Python variables can only begin with a letter(A-Z/a-z) or an underscore(_).
• The rest of the identifier may contain letters(A-Z/a-z), underscores(_), and
numbers(0-9).
• Python is case-sensitive, and so are Python identifiers. Name and name are two
different identifiers.
b. Assigning and Reassigning Python Variables:
• To assign a value to Python variables, you don’t need to declare its type.
• You name it according to the rules stated in section 2a, and type the value after
the equal sign(=).
• You can’t put the identifier on the right-hand side of the equal sign.
• Neither can you assign Python variables to a keyword.

11
A. Sets:
A set can have a list of values. Define it using curly braces. It returns only one
instance of any value present more than once. However, a set is unordered, so it
doesn’t support indexing. Also, it is mutable. You can change its elements or add
more. Use the add() and remove() methods to do so.
B. Type Conversion:
Since Python is dynamically-typed, you may want to convert a value into another
type. Python supports a list of functions for the same.
a. int()
b. float()
c. bool()
d. set()
e. list()
f. tuple()
g. str()
2.4. Conditional Statements
a. If statements
If statement is one of the most commonly used conditional statement in most of
the programming languages. It decides whether certain statements need to be
executed or not. If statement checks for a given condition, if the condition is true,
then the set of code present inside the if block will be executed.
The If condition evaluates a Boolean expression and executes the block of code
only when the Boolean expression becomes TRUE.
Syntax:
If (Boolean expression): Block of code #Set of statements to execute if the
condition is true

b. If-else statements
The statement itself tells that if a given condition is true then execute the
statements present inside if block and if the condition is false then execute the else
block.
Else block will execute only when the condition becomes false, this is the block
where you will perform some actions when the condition is not true.
If-else statement evaluates the Boolean expression and executes the block of code
present inside the if block if the condition becomes TRUE and executes a block of
code present in the else block if the condition becomes FALSE.

12
Syntax:
if(Boolean expression):
Block of code #Set of statements to execute if condition is true

else:
Block of code #Set of statements to execute if condition is false

c. elif statements
In python, we have one more conditional statement called elif statements. Elif
statement is used to check multiple conditions only if the given if condition false.
It’s similar to an if-else statement and the only difference is that in else we will not
check the condition but in elif we will do check the condition.
Elif statements are similar to if-else statements but elif statements evaluate
multiple conditions.
Syntax:
if (condition):
#Set of statement to execute if condition is true
elif (condition):
#Set of statements to be executed when if condition is false and elif
condition is true
else:
#Set of statement to be executed when both if and elif conditions are false

d. Nested if-else statements


Nested if-else statements mean that an if statement or if-else statement is present
inside another if or if-else block. Python provides this feature as well, this in turn
will help us to check multiple conditions in a given program.
An if statement present inside another if statement which is present inside another
if statements and so on.
Nested if Syntax:
if(condition):
#Statements to execute if condition is true
if(condition):
#Statements to execute if condition is true
#end of nested if
#end of if

13
Nested if-else Syntax:

if(condition):
#Statements to execute if condition is true
if(condition):
#Statements to execute if condition is true
else:
#Statements to execute if condition is false
else:
#Statements to execute if condition is false

2.5. Looping Constructs


Loops:
a. while loop:
Repeats a statement or group of statements while a given condition is TRUE. It
tests the condition before executing the loop body.
Syntax:
while expression:
statement(s)
b for loop:
Executes a sequence of statements multiple times and abbreviates the code that
manages the loop variable.
Syntax:
for iterating_var in sequence:
statements(s)
b. nested loops:
You can use one or more loop inside any another while, for or do..while loop.
Syntax of nested for loop:
for iterating_var in sequence:
for iterating_var in sequence:
statements(s)
statements(s)
Syntax of nested while loop:
while expression:
while expression:
statement(s)
statement(s)
14
2.6. Functions
Built-in Functions or pre-defined functions:
These are the functions which are already defined by Python. For example: id (),
type(), print (), etc.

2.7. Data Structure


Python has implicit support for Data Structures which enable you to store and
access data. These structures are called List, Dictionary, Tuple and Set.
2.8. Lists
Lists in Python are the most versatile data structure. They are used to store
heterogeneous data items, from integers to strings or even another list! They are
also mutable, which means that their elements can be changed even after the list
is created.
Creating Lists
Lists are created by enclosing elements within [square] brackets and each item is
separated by a comma.
Creating lists in Python
Since each element in a list has its own distinct position, having duplicate values in
a list is not a problem.

2.9. Dictionaries
Dictionary is another Python data structure to store heterogeneous objects that
are immutable but unordered.
Generating Dictionary
Dictionaries are generated by writing keys and values within a { curly } bracket
separated by a semi-colon. And each key-value pair is separated by a comma:

Using the key of the item, we can easily extract the associated value of the item:

15
Dictionaries are very useful to access items quickly because, unlike lists and tuples,
a dictionary does not have to iterate over all the items finding a value. Dictionary
uses the item key to quickly find the item value. This concept is called hashing.
2.10. Understanding Standard Libraries in Python
Pandas
When it comes to data manipulation and analysis, nothing beats Pandas. It is the
most popular Python library, period. Pandas is written in the Python language
especially for manipulation and analysis tasks.
Pandas provides features like:
• Dataset joining and merging
• Data Structure column deletion and insertion
• Data filtration
• Reshaping datasets
• DataFrame objects to manipulate data, and much more!
NumPy
NumPy, like Pandas, is an incredibly popular Python library. NumPy brings in
functions to support large multi-dimensional arrays and matrices. It also brings in
high-level mathematical functions to work with these arrays and matrices. NumPy
is an open-source library and has multiple contributors.
Matplotlib
Matplotlib is the most popular data visualization library in Python. It allows us to
generate and build plots of all kinds. This is my go-to library for exploring data
visually along with Seaborn.
2.11. Reading a CSV File in Python
A CSV (Comma Separated Values) file is a form of plain text document which uses
a particular format to organize tabular information. CSV file format is a bounded
text document that uses a comma to distinguish the values. Every row in the
document is a data log. Each log is composed of one or more fields, divided by
commas. It is the most popular file format for importing and exporting
spreadsheets and databases.

16
• USing csv.reader(): At first, the CSV file is opened using the open() method
in ‘r’ mode(specifies read mode while opening a file) which returns the file
object then it is read by using the reader() method of CSV module that
returns the reader object that iterates throughout the lines in the specified
CSV document.
Note: The ‘with‘ keyword is used along with the open() method as it simplifies
exception handling and automatically closes the CSV file.

import csv

# opening the CSV file


with open('Giants.csv', mode ='r')as file:

# reading the CSV file


csvFile = csv.reader(file)

# displaying the contents of the CSV file


for lines in csvFile:
print(lines)
• Using pandas.read_csv() method: It is very easy and simple to read a CSV
file using pandas library functions. Here read_csv() method of pandas library
is used to read data from CSV files.

import pandas

# reading the CSV file


csvFile = pandas.read_csv('Giants.csv')

# displaying the contents of the CSV file


print(csvFile)

2.12. Data Frames and basic operations with Data Frames


Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous
tabular data structure with labeled axes (rows and columns). A Data frame is a two-
dimensional data structure, i.e., data is aligned in a tabular fashion in rows and
columns. Pandas DataFrame consists of three principal components, the data, rows,
and columns.

17
2.13. Indexing Data Frame
This function allows us to retrieve rows and columns by position. In order to do
that, we’ll need to specify the positions of the rows that we want, and the positions
of the columns that we want as well. The df.iloc indexer is very similar to df.loc but
only uses integer locations to make its selections.

18
Module-3: Understanding the Statistics for Data
Science

3.1 Introduction to Statistics


Statistics simply means numerical data, and is field of math that generally deals with
collection of data, tabulation, and interpretation of numerical data. It is actually a
form of mathematical analysis that uses different quantitative models to produce a
set of experimental data or studies of real life. It is an area of applied mathematics
concern with data collection analysis, interpretation, and presentation. Statistics
deals with how data can be used to solve complex problems. Some people consider
statistics to be a distinct mathematical science rather than a branch of mathematics.
Statistics makes work easy and simple and provides a clear and clean picture of work
you do on a regular basis.
Basic terminology of Statistics:
• Population –
It is actually a collection of set of individuals or objects or events
whose properties are to be analyzed.
• Sample –
It is the subset of a population.
Types of Statistics :

19
3.2 Understanding the spread of data
Measure of Variability is also known as measure of dispersion and used to
describe variability in a sample or population. In statistics, there are three
common measures of variability as shown below:
(i) Range :
It is given measure of how to spread apart values in sample set or data set.
Range = Maximum value - Minimum value
(ii) Variance :
It simply describes how much a random variable defers from expected value
and it is also computed as square of deviation.
S2= ∑ni=1 [(xi - x)2 ÷ n]
In these formula, n represent total data points, represent mean of data points
and xi represent individual data points.
(iii) Dispersion :
It is measure of dispersion of set of data from its mean.
σ= √ (1÷n) ∑ni=1 (xi - μ)2

3.3 Data Distribution


Terms related to Exploration of Data Distribution
-> Boxplot
-> Frequency Table
-> Histogram
-> Density Plot

20
3.4 Chi Squared Tests
Chi-square test is used for categorical features in a dataset. We calculate Chi-square
between each feature and the target and select the desired number of features with best
Chi-square scores. It determines if the association between two categorical variables of the
sample would reflect their real association in the population.
Chi- square score is given by :

3.5 Understanding the concept of Correlation


Correlation –
1. It show whether and how strongly pairs of variables are related to each
other.
2. Correlation takes values between -1 to +1, wherein values close to +1
represents strong positive correlation and values close to -1 represents
strong negative correlation.
3. In this variable are indirectly related to each other.
4. It gives the direction and strength of relationship between variables.
Formula –

Here,
x’ and y’ = mean of given sample set
n = total no of sample
xi and yi = individual sample of set
Example –

21
Module-4: Predictive Modeling and Basics of
Machine Learning

4.1. Introduction to Predictive Modeling


Predictive analytics involves certain manipulations on data from existing data sets
with the goal of identifying some new trends and patterns. These trends and patterns
are then used to predict future outcomes and trends. By performing predictive
analysis, we can predict future trends and performance. It is also defined as the
prognostic analysis, the word prognostic means prediction. Predictive analytics uses
the data, statistical algorithms and machine learning techniques to identify the
probability of future outcomes based on historical data.

4.2. Understanding the types of Predictive Models


Supervised learning
Supervised learning as the name indicates the presence of a supervisor as a teacher.
Basically supervised learning is a learning in which we teach or train the machine
using data which is well labeled that means some data is already tagged with the
correct answer. After that, the machine is provided with a new set of examples(data)
so that supervised learning algorithm analyses the training data(set of training
examples) and produces a correct outcome from labeled data.
Unsupervised learning
Unsupervised learning is the training of machine using information that is neither
classified nor labeled and allowing the algorithm to act on that information without
guidance. Here the task of machine is to group unsorted information according to
similarities, patterns and differences without any prior training of data.

4.3. Stages of Predictive Models


Steps To Perform Predictive Analysis:
Some basic steps should be performed in order to perform predictive analysis.
1. Define Problem Statement:
Define the project outcomes, the scope of the effort, objectives, identify the data
sets that are going to be used.
2. Data Collection:
Data collection involves gathering the necessary details required for the analysis.

22
It involves the historical or past data from an authorized source over which
predictive analysis is to be performed.
3. Data Cleaning:
Data Cleaning is the process in which we refine our data sets. In the process of
data cleaning, we remove un-necessary and erroneous data. It involves removing
the redundant data and duplicate data from our data sets.
4. Data Analysis:
It involves the exploration of data. We explore the data and analyze it thoroughly
in order to identify some patterns or new outcomes from the data set. In this
stage, we discover useful information and conclude by identifying some patterns
or trends.

23
4.4. Univariate Analysis for Continuous Variables
Continuous Variables:- In case of continuous variables, we need to understand
the central tendency and spread of the variable. These are measured using various
statistical metrics visualization methods as shown below:

Note: Univariate analysis is also used to highlight missing and outlier values. In

the upcoming part of this series, we will look at methods to handle missing and
outlier values.

4.5. Univariate Analysis for Categorical Variables


For categorical variables, we’ll use frequency table to understand distribution
of each category. We can also read as percentage of values under each category.
It can be be measured using two metrics, Count and Count% against each
category. Bar chart can be used as visualization.

4.6. Bivariate Analysis


Bi-variate Analysis finds out the relationship between two variables. Here, we look
for association and disassociation between variables at a pre-defined significance
level. We can perform bi-variate analysis for any combination of categorical and
continuous variables. The combination can be: Categorical & Categorical,
Categorical & Continuous and Continuous & Continuous. Different methods are
used to tackle these combinations during analysis process.

Continuous & Continuous: While doing bi-variate analysis between two


continuous variables, we should look at scatter plot. It is a nifty way to find out the
relationship between two variables. The pattern of scatter plot indicates the
relationship between variables. The relationship can be linear or non-linear.

24
Scatter plot shows the relationship between two variable but does not indicates the
strength of relationship amongst them. To find the strength of the relationship, we
use Correlation. Correlation varies between -1 and +1.

• -1: perfect negative linear correlation


• +1:perfect positive linear correlation and

• 0: No correlation

Correlation can be derived using following formula:

Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))


.

• Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically

different from each other or not. If the probability of Z is


small then the difference of two averages is more significant. The T-test is very
similar to Z-test but it is used when number of observation for both categories is
less than 30.

25
4.7. Basics of Model Building
Lifecycle of Model Building –
• Select variables
• Balance data
• Build models
• Validate
• Deploy
• Maintain
• Define success
• Explore data
• Condition data
Data exploration is used to figure out gist of data and to develop first step
assessment of its quality, quantity, and characteristics. Visualization techniques can
be also applied. However, this can be difficult task in high dimensional spaces with
many input variables. In the conditioning of data, we group functional data which is
applied upon modeling techniques after then rescaling is done, in some cases
rescaling is an issue if variables are coupled. Variable section is very important to
develop quality model.
This process is implicity model-dependent since it is used to configure which
combination of variables should be used in ongoing model development. Data
balancing is to partition data into appropriate subsets for training, test, and
validation. Model building is to focus on desired algorithms. The most famous
technique is symbolic regression, other techniques can also be preferred.

Linear Regression
Linear Regression is a machine learning algorithm based on supervised
regression algorithm. Regression models a target prediction value based on
independent variables. It is mostly used for finding out the relationship between
variables and forecasting. Different regression models differ based on – the kind of
relationship between the dependent and independent variables, they are
considering and the number of independent variables being used.

4.8. Logistic Regression


Logistic regression is basically a supervised classification algorithm. In a
classification problem, the target variable(or output), y, can take only discrete values
for a given set of features(or inputs), X.

26
Any change in the coefficient leads to a change in both the direction and the
steepness of the logistic function. It means positive slopes result in an S-shaped
curve and negative slopes result in a Z-shaped curve.
4.9. Decision Trees
Decision Tree : Decision tree is the most powerful and popular tool for classification
and prediction. A Decision tree is a flowchart like tree structure, where each internal
node denotes a test on an attribute, each branch represents an outcome of the test,
and each leaf node (terminal node) holds a class label.
Decision Tree Representation :
Decision trees classify instances by sorting them down the tree from the root to
some leaf node, which provides the classification of the instance. An instance is
classified by starting at the root node of the tree,testing the attribute specified by
this node,then moving down the tree branch corresponding to the value of the
attribute as shown in the above figure.This process is then repeated for the subtree
rooted at the new node.
Strengths and Weakness of Decision Tree approach
The strengths of decision tree methods are:
• Decision trees are able to generate understandable rules.
• Decision trees perform classification without requiring much computation.
• Decision trees are able to handle both continuous and categorical variables.
• Decision trees provide a clear indication of which fields are most important for
prediction or classification.
The weaknesses of decision tree methods :
• Decision trees are less appropriate for estimation tasks where the goal is to predict
the value of a continuous attribute.
• Decision trees are prone to errors in classification problems with many class and
relatively small number of training examples.
• Decision tree can be computationally expensive to train. The process of growing a
decision tree is computationally expensive. At each node, each candidate splitting
field must be sorted before its best split can be found. In some algorithms,
combinations of fields are used and a search must be made for optimal combining
weights. Pruning algorithms can also be expensive since many candidate sub-trees
must be formed and compared.

27
4.10. K-means
k-means clustering tries to group similar kinds of items in form of clusters. It finds
the similarity between the items and groups them into the clusters. K-means
clustering algorithm works in three steps. Let’s see what are these three steps.

1. Select the k values.


2. Initialize the centroids.
3. Select the group and find the average.

Let us understand the above steps with the help of the figure because a good
picture is better than the thousands of words.

We will understand each figure one by one.

• Figure 1 shows the representation of data of two different items. the first item has
shown in blue color and the second item has shown in red color. Here I am
choosing the value of K randomly as 2. There are different methods by which we
can choose the right k values.
• In figure 2, Join the two selected points. Now to find out centroid, we will draw a
perpendicular line to that line. The points will move to their centroid. If you will

28
notice there, then you will see that some of the red points are now moved to the
blue points. Now, these points belong to the group of blue color items.
• The same process will continue in figure 3. we will join the two points and draw a
perpendicular line to that and find out the centroid. Now the two points will move
to its centroid and again some of the red points get converted to blue points.
• The same process is happening in figure 4. This process will be continued until and
unless we get two completely different clusters of these groups.

How to choose the value of K?

One of the most challenging tasks in this clustering algorithm is to choose the right
values of k. What should be the right k-value? How to choose the k-value? Let us
find the answer to these questions. If you are choosing the k values randomly, it
might be correct or may be wrong. If you will choose the wrong value then it will
directly affect your model performance. So there are two methods by which you
can select the right value of k.

1. Elbow Method.
2. Silhouette Method.

29

You might also like