Celebal Summer t-1
Celebal Summer t-1
Celebal Summer t-1
Celebal Technologies
“Submitted In Partial Fulfillment of the Requirement for the Award
of the Degree of the Bachelor of Technology in Computer Science Engineering
(Artificial Intelligence) to RajasthanTechnical University, Kota”
SUBMITTED BY
Yogesh Gaur(21AI044)
2024-25
Acknowledgment
It is a great pleasure and privilege for me to present this practical training report carried out
requirement for the award of the degree of the Bachelor of Technology in Computer
Intelligence) Department of our college for his kind co-operation and valuable
suggestions.
We are very much thankful to Mr. Sharthak Acharjee| Senior Manager of Celebal tech.
for his encouragement and inspiration at every step to a great extent. This training would not
have been possible without his support and able guidance. He was very supportive
Finally, we express earnest and sincere thanks to the whole “Celebal Technologies” for the
Yogesh Gaur
I hereby declare that the work, which is being presented in the practical training report
and submitted in partial fulfillment of the requirement for the award of the degree of the
Rajasthan Technical University, Kota is a record of my training carried under the Guidance
Yogesh Gaur
College ID.:21AI044
Celebal Technologies
Abstract
Data Science is an interdisciplinary field that combines statistics, mathematics, computer science,
and domain expertise to extract insights from structured and unstructured data. It leverages
advanced techniques such as machine learning, data visualization, and big data analytics to address
With the exponential growth of data, Data Science plays a pivotal role in decision-making by
transforming raw data into actionable insights. It encompasses data collection, cleaning, analysis,
and interpretation, fostering innovation and driving business value. Emerging technologies like AI,
cloud computing, and IoT further enhance its scope, making it a cornerstone of modern problem-
About Training
1
3.10. Hypothesis Testing ....................................................................................................24
3.11. T tests ............................................................................................................................ 24
3.12. Chi Squared Tests ...................................................................................................... 24
3.13. Understanding the concept of Correlation .......................................................24
Module-4: Predictive Modeling and Basics of Machine Learning
2
Module-1: Introduction to Data Science
Predictive modeling:
Predictive modeling is a form of artificial intelligence that uses data mining and
probability to forecast or estimate more granular, specific outcomes.
For example, predictive modeling could help identify customers who are likely to
purchase our new One AI software over the next 90 days.
Machine Learning:
Machine learning is a branch of artificial intelligence (ai) where computers learn to
act and adapt to new data without being programmed to do so. The computer is
able to act independently of human interaction.
Forecasting:
Forecasting is a process of predicting or estimating future events based on past
and present data and most commonly by analysis of trends. "Guessing" doesn't cut
it. A forecast, unlike a prediction, must have logic to it. It must be defendable. This
logic is what differentiates it from the magic 8 ball's lucky guess. After all, even a
broken watch is right two times a day.
3
Applications of Data Science:
Data science and big data are making an undeniable impact on businesses,
changing day-to-day operations, financial analytics, and especially interactions
with customers. It's clear that businesses can gain enormous value from the
insights data science can provide. But sometimes it's hard to see exactly how. So
let's look at some examples. In this era of big data, almost everyone generates
masses of data every day, often without being aware of it. This digital trace reveals
the patterns of our online lives. If you have ever searched for or bought a product
on a site like Amazon, you'll notice that it starts making recommendations related
to your search. This type of system known as a recommendation engine is a
common application of data science. Companies like Amazon, Netflix, and Spotify
use algorithms to make specific recommendations derived from customer
preferences and historical behavior. Personal assistants like Siri on Apple devices
use data science to devise answers to the infinite number of questions end users
may ask. Google watches your every move in the world, you're online shopping
habits, and your social media. Then it analyzes that data to create
recommendations for restaurants, bars, shops, and other attractions based on the
data collected from your device and your current location. Wearable devices like
Fitbits, Apple watches, and Android watches add information about your activity
levels, sleep patterns, and heart rate to the data you generate. Now that we know
how consumers generate data, let's take a look at how data science is impacting
business. In 2011, McKinsey & Company said that data science was going to
become the key basis of competition. Supporting new waves of productivity,
growth, and innovation. In 2013, UPS announced that it was using data from
customers, drivers, and vehicles, in a new route guidance system aimed to save
time, money, and fuel. Initiatives like this support the statement that data science
will fundamentally change the way businesses compete and operate. How does a
firm gain a competitive advantage? Let's take Netflix as an example. Netflix collects
and analyzes massive amounts of data from millions of users, including which
shows people are watching at what time a day when people pause, rewind, and
fast-forward, and which shows directors and actors they search for. Netflix can be
confident that a show will be a hit before filming even begins by analyzing users
preference for certain directors and acting talent, and discovering which
combinations people enjoy. Add this to the success of earlier versions of a show
and you have a hit. For example, Netflix knew many of its users had streamed to
the work of David Fincher. They also knew that films featuring Robin Wright had
always done well, and that the British version of House of Cards was very successful.
Module-2: Python for Data Science
5
2.2. Understanding Operators
a. Arithmetic operators:
Arithmetic operators are used to perform mathematical operations like
addition, subtraction, multiplication and division.
6
b. Relational Operators:
Relational operators compares the values. It either
returns True or False according to the condition.
> Greater than: True if left operand is greater than the right x>y
< Less than: True if left operand is less than the right x<y
x ==
Greater than or equal to: True if left operand is greater than x >=
Less than or equal to: True if left operand is less than or equal x <=
7
Logical operators:
Logical operators perform Logical AND, Logical OR and LogicalNOT operations.
OPERATOR DESCRIPTION SYNTAX
and Logical AND: True if both the operands are true x and y
c. Bitwise operators:
Bitwise operators acts on bits and performs bit by bit operation.
OPERATOR DESCRIPTION SYNTAX
| Bitwise OR x|y
~ Bitwise NOT ~x
8
d. Assignment operators:
Assignment operators are used to assign values to the variables.
9
OPERATOR DESCRIPTION ASSOCIATIVITY
() Parentheses left-to-right
** Exponent right-to-left
* / % Multiplication/division/modulus left-to-right
10
OPERATOR DESCRIPTION ASSOCIATIVITY
+ - Addition/subtraction left-to-right
11
A. Sets:
A set can have a list of values. Define it using curly braces. It returns only one
instance of any value present more than once. However, a set is unordered, so it
doesn’t support indexing. Also, it is mutable. You can change its elements or add
more. Use the add() and remove() methods to do so.
B. Type Conversion:
Since Python is dynamically-typed, you may want to convert a value into another
type. Python supports a list of functions for the same.
a. int()
b. float()
c. bool()
d. set()
e. list()
f. tuple()
g. str()
2.4. Conditional Statements
a. If statements
If statement is one of the most commonly used conditional statement in most of
the programming languages. It decides whether certain statements need to be
executed or not. If statement checks for a given condition, if the condition is true,
then the set of code present inside the if block will be executed.
The If condition evaluates a Boolean expression and executes the block of code
only when the Boolean expression becomes TRUE.
Syntax:
If (Boolean expression): Block of code #Set of statements to execute if the
condition is true
b. If-else statements
The statement itself tells that if a given condition is true then execute the
statements present inside if block and if the condition is false then execute the else
block.
Else block will execute only when the condition becomes false, this is the block
where you will perform some actions when the condition is not true.
If-else statement evaluates the Boolean expression and executes the block of code
present inside the if block if the condition becomes TRUE and executes a block of
code present in the else block if the condition becomes FALSE.
12
Syntax:
if(Boolean expression):
Block of code #Set of statements to execute if condition is true
else:
Block of code #Set of statements to execute if condition is false
c. elif statements
In python, we have one more conditional statement called elif statements. Elif
statement is used to check multiple conditions only if the given if condition false.
It’s similar to an if-else statement and the only difference is that in else we will not
check the condition but in elif we will do check the condition.
Elif statements are similar to if-else statements but elif statements evaluate
multiple conditions.
Syntax:
if (condition):
#Set of statement to execute if condition is true
elif (condition):
#Set of statements to be executed when if condition is false and elif
condition is true
else:
#Set of statement to be executed when both if and elif conditions are false
13
Nested if-else Syntax:
if(condition):
#Statements to execute if condition is true
if(condition):
#Statements to execute if condition is true
else:
#Statements to execute if condition is false
else:
#Statements to execute if condition is false
2.9. Dictionaries
Dictionary is another Python data structure to store heterogeneous objects that
are immutable but unordered.
Generating Dictionary
Dictionaries are generated by writing keys and values within a { curly } bracket
separated by a semi-colon. And each key-value pair is separated by a comma:
Using the key of the item, we can easily extract the associated value of the item:
15
Dictionaries are very useful to access items quickly because, unlike lists and tuples,
a dictionary does not have to iterate over all the items finding a value. Dictionary
uses the item key to quickly find the item value. This concept is called hashing.
2.10. Understanding Standard Libraries in Python
Pandas
When it comes to data manipulation and analysis, nothing beats Pandas. It is the
most popular Python library, period. Pandas is written in the Python language
especially for manipulation and analysis tasks.
Pandas provides features like:
• Dataset joining and merging
• Data Structure column deletion and insertion
• Data filtration
• Reshaping datasets
• DataFrame objects to manipulate data, and much more!
NumPy
NumPy, like Pandas, is an incredibly popular Python library. NumPy brings in
functions to support large multi-dimensional arrays and matrices. It also brings in
high-level mathematical functions to work with these arrays and matrices. NumPy
is an open-source library and has multiple contributors.
Matplotlib
Matplotlib is the most popular data visualization library in Python. It allows us to
generate and build plots of all kinds. This is my go-to library for exploring data
visually along with Seaborn.
2.11. Reading a CSV File in Python
A CSV (Comma Separated Values) file is a form of plain text document which uses
a particular format to organize tabular information. CSV file format is a bounded
text document that uses a comma to distinguish the values. Every row in the
document is a data log. Each log is composed of one or more fields, divided by
commas. It is the most popular file format for importing and exporting
spreadsheets and databases.
16
• USing csv.reader(): At first, the CSV file is opened using the open() method
in ‘r’ mode(specifies read mode while opening a file) which returns the file
object then it is read by using the reader() method of CSV module that
returns the reader object that iterates throughout the lines in the specified
CSV document.
Note: The ‘with‘ keyword is used along with the open() method as it simplifies
exception handling and automatically closes the CSV file.
import csv
import pandas
17
2.13. Indexing Data Frame
This function allows us to retrieve rows and columns by position. In order to do
that, we’ll need to specify the positions of the rows that we want, and the positions
of the columns that we want as well. The df.iloc indexer is very similar to df.loc but
only uses integer locations to make its selections.
18
Module-3: Understanding the Statistics for Data
Science
19
3.2 Understanding the spread of data
Measure of Variability is also known as measure of dispersion and used to
describe variability in a sample or population. In statistics, there are three
common measures of variability as shown below:
(i) Range :
It is given measure of how to spread apart values in sample set or data set.
Range = Maximum value - Minimum value
(ii) Variance :
It simply describes how much a random variable defers from expected value
and it is also computed as square of deviation.
S2= ∑ni=1 [(xi - x)2 ÷ n]
In these formula, n represent total data points, represent mean of data points
and xi represent individual data points.
(iii) Dispersion :
It is measure of dispersion of set of data from its mean.
σ= √ (1÷n) ∑ni=1 (xi - μ)2
20
3.4 Chi Squared Tests
Chi-square test is used for categorical features in a dataset. We calculate Chi-square
between each feature and the target and select the desired number of features with best
Chi-square scores. It determines if the association between two categorical variables of the
sample would reflect their real association in the population.
Chi- square score is given by :
Here,
x’ and y’ = mean of given sample set
n = total no of sample
xi and yi = individual sample of set
Example –
21
Module-4: Predictive Modeling and Basics of
Machine Learning
22
It involves the historical or past data from an authorized source over which
predictive analysis is to be performed.
3. Data Cleaning:
Data Cleaning is the process in which we refine our data sets. In the process of
data cleaning, we remove un-necessary and erroneous data. It involves removing
the redundant data and duplicate data from our data sets.
4. Data Analysis:
It involves the exploration of data. We explore the data and analyze it thoroughly
in order to identify some patterns or new outcomes from the data set. In this
stage, we discover useful information and conclude by identifying some patterns
or trends.
23
4.4. Univariate Analysis for Continuous Variables
Continuous Variables:- In case of continuous variables, we need to understand
the central tendency and spread of the variable. These are measured using various
statistical metrics visualization methods as shown below:
Note: Univariate analysis is also used to highlight missing and outlier values. In
the upcoming part of this series, we will look at methods to handle missing and
outlier values.
24
Scatter plot shows the relationship between two variable but does not indicates the
strength of relationship amongst them. To find the strength of the relationship, we
use Correlation. Correlation varies between -1 and +1.
• 0: No correlation
• Z-Test/ T-Test:- Either test assess whether mean of two groups are statistically
25
4.7. Basics of Model Building
Lifecycle of Model Building –
• Select variables
• Balance data
• Build models
• Validate
• Deploy
• Maintain
• Define success
• Explore data
• Condition data
Data exploration is used to figure out gist of data and to develop first step
assessment of its quality, quantity, and characteristics. Visualization techniques can
be also applied. However, this can be difficult task in high dimensional spaces with
many input variables. In the conditioning of data, we group functional data which is
applied upon modeling techniques after then rescaling is done, in some cases
rescaling is an issue if variables are coupled. Variable section is very important to
develop quality model.
This process is implicity model-dependent since it is used to configure which
combination of variables should be used in ongoing model development. Data
balancing is to partition data into appropriate subsets for training, test, and
validation. Model building is to focus on desired algorithms. The most famous
technique is symbolic regression, other techniques can also be preferred.
Linear Regression
Linear Regression is a machine learning algorithm based on supervised
regression algorithm. Regression models a target prediction value based on
independent variables. It is mostly used for finding out the relationship between
variables and forecasting. Different regression models differ based on – the kind of
relationship between the dependent and independent variables, they are
considering and the number of independent variables being used.
26
Any change in the coefficient leads to a change in both the direction and the
steepness of the logistic function. It means positive slopes result in an S-shaped
curve and negative slopes result in a Z-shaped curve.
4.9. Decision Trees
Decision Tree : Decision tree is the most powerful and popular tool for classification
and prediction. A Decision tree is a flowchart like tree structure, where each internal
node denotes a test on an attribute, each branch represents an outcome of the test,
and each leaf node (terminal node) holds a class label.
Decision Tree Representation :
Decision trees classify instances by sorting them down the tree from the root to
some leaf node, which provides the classification of the instance. An instance is
classified by starting at the root node of the tree,testing the attribute specified by
this node,then moving down the tree branch corresponding to the value of the
attribute as shown in the above figure.This process is then repeated for the subtree
rooted at the new node.
Strengths and Weakness of Decision Tree approach
The strengths of decision tree methods are:
• Decision trees are able to generate understandable rules.
• Decision trees perform classification without requiring much computation.
• Decision trees are able to handle both continuous and categorical variables.
• Decision trees provide a clear indication of which fields are most important for
prediction or classification.
The weaknesses of decision tree methods :
• Decision trees are less appropriate for estimation tasks where the goal is to predict
the value of a continuous attribute.
• Decision trees are prone to errors in classification problems with many class and
relatively small number of training examples.
• Decision tree can be computationally expensive to train. The process of growing a
decision tree is computationally expensive. At each node, each candidate splitting
field must be sorted before its best split can be found. In some algorithms,
combinations of fields are used and a search must be made for optimal combining
weights. Pruning algorithms can also be expensive since many candidate sub-trees
must be formed and compared.
27
4.10. K-means
k-means clustering tries to group similar kinds of items in form of clusters. It finds
the similarity between the items and groups them into the clusters. K-means
clustering algorithm works in three steps. Let’s see what are these three steps.
Let us understand the above steps with the help of the figure because a good
picture is better than the thousands of words.
• Figure 1 shows the representation of data of two different items. the first item has
shown in blue color and the second item has shown in red color. Here I am
choosing the value of K randomly as 2. There are different methods by which we
can choose the right k values.
• In figure 2, Join the two selected points. Now to find out centroid, we will draw a
perpendicular line to that line. The points will move to their centroid. If you will
28
notice there, then you will see that some of the red points are now moved to the
blue points. Now, these points belong to the group of blue color items.
• The same process will continue in figure 3. we will join the two points and draw a
perpendicular line to that and find out the centroid. Now the two points will move
to its centroid and again some of the red points get converted to blue points.
• The same process is happening in figure 4. This process will be continued until and
unless we get two completely different clusters of these groups.
One of the most challenging tasks in this clustering algorithm is to choose the right
values of k. What should be the right k-value? How to choose the k-value? Let us
find the answer to these questions. If you are choosing the k values randomly, it
might be correct or may be wrong. If you will choose the wrong value then it will
directly affect your model performance. So there are two methods by which you
can select the right value of k.
1. Elbow Method.
2. Silhouette Method.
29