Introduction To Data Science and Machine Learning

Introduction to
Data Science and Machine Learning

What is Data Science?
1. Data Science is the science of extracting hidden patterns from large data sets
2. Hidden patterns can appear in form of trends, cycles, associations, rules,

groups etc. in the data
3. Data sets usually refer to large volume of cleansed, structured data prepared
for the analysis
4. Science refers to the statistical tools and techniques employed to understand

the data and reliability of the identified patterns
a. That part of statistics which is used to understand the data is called descriptive
statistics. Descriptive statistics give vital insights into the data in terms of
central values, spread and distribution shape of the data
b. The part of statistics which is used to establish the reliability of the potential
patterns identified, is called inferential statistics
What is Machine Learning?
1. Machine Learning is an integral and critical part of data science. It refers to a

collection of algorithms which are used to extract the hidden patterns from the
dataset
2. These algorithms use a learning process through which they identify the patterns
in the dataset. The patterns they learn from the data are called models
3. The models could be expressed in form of mathematical equations, rules,

probability ratios etc.
4. Machine learning algorithms work on the data prepared for analytics to express
the hidden patterns in form of models
5. For machine learning algorithms to successfully identify reliable hidden patterns,

the input data should be reliable.
6. If input data is not reliable, models generated may be statistically unreliable

When is machine learning useful ?
1. Cannot express our knowledge about patterns as a program. For e.g.

Character recognition or natural language processing
2. Do not have an algorithm to identify a pattern of interest. For e.g. In spam

mail detection
3. Too complex and dynamic. For e.g. Weather forecasting
4. No prior experience or knowledge. For e.g. Mars rover
5. Patterns hidden in humongous data. For e.g. Recommendation system

Machine Learning Applications (examples)
1. Fraud detection
2. Sentiment analysis
3. Credit risk management
4. Prediction of equipment failures
5. New pricing models / strategies
6. Network intrusion detection
7. Pattern and image recognition
8. Email spam filtering

Machine Learning Pre-requisites
1. Rich set of data representing the environment where the model is to be used
2. Knowledge and skills in

a. Mathematics and statistics (graduate level or more)
b. Programming in any language such as Python or R (considered as the
two most popular languages for data science)
c. Domain knowledge
3. Usually data science is a team effort where the team consists of all the
required skills and knowledge
Real World as Mathematical Space
Machine learning happens in mathematical space / feature
space:
1. A data set representing the real world, is a collection attributes that
define an entity
2. Each entity is represented as one record / line in the data set

Attributes / Dimensions
space:
1. Each attribute becomes a

dimension
2. Each record becomes a point

in the space
Sugar
BP level
Heart healthy
Potential heart ailments
space:
1. Position of a point in space

is defined with respect to the
origin
2. The position is decided by

the values of the attributes
for a point
Sugar
BP level
Heart healthy
space:
3. A model represents the real

world process that generated
the different set of data points
4. The model could be a simple

plane, complex plane, hyper
plane
5. But multiple planes can do the

job. Each representing an
Sugar
alternate hypothesis
6. The learning algorithm selects

that hypothesis which
minimizes errors in the test
data
BP level
Heart healthy
Erroneous classification Potential heart ailments
space:
7. In the figure, since the

separator is a plane, the model
will be the equation
representing the plane
ax + by + cz = d
8. x , y, z represent the three
dimensions i.e. BP, Age, Sugar
while d represents the color
Sugar
i.e. healthy or ailing heart
BP level
Heart healthy
space:
9. A new data point enters the

system
10. It’s x,y and z values will be

fed into the model to get
value of d (healthy or ailing)
11. The data point will be placed

above or below the plane
Sugar
based on d
ax + by + cz = d, BP level
Heart healthy
space:
12. Whether the new data point is

correctly placed (above or
below the plane) i.e. correctly
classified as ailing or healthy
hear will be known only after
direct observation
Sugar
ax + by + cz = d, BP level
Heart healthy
space:
13. Only direct test on the object

of interest will tell whether the
classification is correct or not
ax + by + cz = d,
Sugar
14. If majority of new data points
are correctly classified, the
model is good else not
BP level
Heart healthy
Machine Learning Categories
Machine learning categories:
There are broadly three categories into which the machine learning algorithms
are grouped
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
Supervised Machine Learning:
1. Class of algorithms which work in two stages. The first stage is called training and
second one is usually called testing. Sometimes it may involve validation stage
followed by testing
2. At each stage it takes input data prepared for that stage. i.e. for training data for
training stage, test data for test stage, validation data for validation stage
3. During training, the machine learning algorithm gets the training data inform of
independent and dependent variables
4. In the process of learning, the algorithm learns the relationship between the
dependent and the independent variables
5. This relationship is expressed as a model which can take the form of a equation,
probability ratios, hidden rules etc.
6. Supervised Machine Learning can be further classified into Regression (when

predicting numeric values) or classification (when predicting class / labels)
Examples of Supervised Machine Learning:
1. Regression - Predicting mileage of a car given the other features such as weight,
engine capacity, horse power, transmission type, number of cylinders etc.
a. In this example, mileage is the dependent variable and weight, engine

capacity, horse power, transmission type, number of cylinders are independent
variables
b. Mileage = f ( weight, engine capacity, horsepower, transmission type, number

of cylinders)
2. Classification – Categorizing a mail into spam or ham

a. In this example, the email category (spam or ham) is the target variable and
the occurrences of certain words and their frequency are independent variables
b. P(ham) = f( words, frequencies) where P stands for probability. 1- P(ham)

will give the P(Spam) assuming only two categories ham and spam
Unsupervised Machine Learning:
1. Class of algorithms which work in a single stage. Unlike supervised learning
algorithms, it does not have a separate training, testing or validation stage
2. Unsupervised learning algorithms take the data as a whole, not in form of

independent and dependent variables.
3. The algorithms are not used to find any relationship between dependent and
independent variables
4. This class of algorithms usually find patterns in form of clusters and associations
reflecting some kind of commonality, togetherness among the data points in the
given data sets
5. It is the responsibility of the data scientist to analyse the identified

clusters/associations and give meaning to those clusters
6. Clustering and PCA (Principal Component Analysis), a mathematical technique

used to transform given data into a more useful form, belong to this category of
machine learning
Examples of Unsupervised Machine Learning:
1. Clustering - Identifying groups in the given data set where a group represents
some kind of commonality among the data points. “Birds of same feather, flock
together”.
2. Clustering can further be categorized into –
a. Flat clustering, e.g. Kmeans clustering- The clusters identified are disjoint,
non-overlapping. For e.g. segmenting customers into different groups based
on their purchase amount, frequency of purchase and types of items purchase
b. Hierarchical clustering, clustering is done at multiple levels indicating clusters

inside clusters indicating some kind of sub-groups inside a given group. For
e.g. to identify sub clusters within cash-cow customers from Kmeans
Reinforced Machine Learning:
1. Reinforcement learning algorithm learns through trial and error and the feedback it
receives from the environment in which it learns.
2. During the initial stages of learning, it is likely to commit many errors in learning
the patterns, however, through a process of reward and punishment, it learns to
identify the patterns correctly.
3. Self driving cars is an example of reinforced learning

Thank You

Introduction To Data Science and Machine Learning

Uploaded by

Copyright:

Available Formats

Introduction To Data Science and Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Data Science and Machine Learning

Uploaded by

Copyright:

Available Formats

Introduction to

Data Science and Machine Learning

2. Hidden patterns can appear in form of trends, cycles, associations, rules,

4. Science refers to the statistical tools and techniques employed to understand

1. Machine Learning is an integral and critical part of data science. It refers to a

3. The models could be expressed in form of mathematical equations, rules,

5. For machine learning algorithms to successfully identify reliable hidden patterns,

6. If input data is not reliable, models generated may be statistically unreliable

1. Cannot express our knowledge about patterns as a program. For e.g.

2. Do not have an algorithm to identify a pattern of interest. For e.g. In spam

3. Too complex and dynamic. For e.g. Weather forecasting

4. No prior experience or knowledge. For e.g. Mars rover

5. Patterns hidden in humongous data. For e.g. Recommendation system

3. Credit risk management

4. Prediction of equipment failures

5. New pricing models / strategies

6. Network intrusion detection

7. Pattern and image recognition

8. Email spam filtering

2. Knowledge and skills in

2. Each entity is represented as one record / line in the data set

1. Each attribute becomes a

2. Each record becomes a point

1. Position of a point in space

2. The position is decided by

3. A model represents the real

4. The model could be a simple

5. But multiple planes can do the

6. The learning algorithm selects

7. In the figure, since the

9. A new data point enters the

10. It’s x,y and z values will be

11. The data point will be placed

12. Whether the new data point is

13. Only direct test on the object

6. Supervised Machine Learning can be further classified into Regression (when

a. In this example, mileage is the dependent variable and weight, engine

b. Mileage = f ( weight, engine capacity, horsepower, transmission type, number

2. Classification – Categorizing a mail into spam or ham

b. P(ham) = f( words, frequencies) where P stands for probability. 1- P(ham)

2. Unsupervised learning algorithms take the data as a whole, not in form of

5. It is the responsibility of the data scientist to analyse the identified

6. Clustering and PCA (Principal Component Analysis), a mathematical technique

2. Clustering can further be categorized into –

b. Hierarchical clustering, clustering is done at multiple levels indicating clusters

3. Self driving cars is an example of reinforced learning

You might also like