Introduction To Data Science and Machine Learning

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 23

Introduction to

Data Science and Machine Learning


What is Data Science?

1. Data Science is the science of extracting hidden patterns from large data sets

2. Hidden patterns can appear in form of trends, cycles, associations, rules,


groups etc. in the data

3. Data sets usually refer to large volume of cleansed, structured data prepared
for the analysis

4. Science refers to the statistical tools and techniques employed to understand


the data and reliability of the identified patterns

a. That part of statistics which is used to understand the data is called descriptive
statistics. Descriptive statistics give vital insights into the data in terms of
central values, spread and distribution shape of the data
b. The part of statistics which is used to establish the reliability of the potential
patterns identified, is called inferential statistics
What is Machine Learning?

1. Machine Learning is an integral and critical part of data science. It refers to a


collection of algorithms which are used to extract the hidden patterns from the
dataset

2. These algorithms use a learning process through which they identify the patterns
in the dataset. The patterns they learn from the data are called models

3. The models could be expressed in form of mathematical equations, rules,


probability ratios etc.

4. Machine learning algorithms work on the data prepared for analytics to express
the hidden patterns in form of models

5. For machine learning algorithms to successfully identify reliable hidden patterns,


the input data should be reliable.

6. If input data is not reliable, models generated may be statistically unreliable


When is machine learning useful ?

1. Cannot express our knowledge about patterns as a program. For e.g.


Character recognition or natural language processing

2. Do not have an algorithm to identify a pattern of interest. For e.g. In spam


mail detection

3. Too complex and dynamic. For e.g. Weather forecasting

4. No prior experience or knowledge. For e.g. Mars rover

5. Patterns hidden in humongous data. For e.g. Recommendation system


Machine Learning Applications (examples)

1. Fraud detection

2. Sentiment analysis

3. Credit risk management

4. Prediction of equipment failures

5. New pricing models / strategies

6. Network intrusion detection

7. Pattern and image recognition

8. Email spam filtering


Machine Learning Pre-requisites
1. Rich set of data representing the environment where the model is to be used

2. Knowledge and skills in


a. Mathematics and statistics (graduate level or more)
b. Programming in any language such as Python or R (considered as the
two most popular languages for data science)
c. Domain knowledge

3. Usually data science is a team effort where the team consists of all the
required skills and knowledge
Real World as Mathematical Space
Machine learning happens in mathematical space / feature
space:
1. A data set representing the real world, is a collection attributes that
define an entity

2. Each entity is represented as one record / line in the data set


Attributes / Dimensions
Machine learning happens in mathematical space / feature
space:

1. Each attribute becomes a


dimension

2. Each record becomes a point


in the space

Sugar

BP level
Heart healthy
Potential heart ailments
Machine learning happens in mathematical space / feature
space:

1. Position of a point in space


is defined with respect to the
origin

2. The position is decided by


the values of the attributes
for a point

Sugar

BP level

Heart healthy
Potential heart ailments
Machine learning happens in mathematical space / feature
space:

3. A model represents the real


world process that generated
the different set of data points

4. The model could be a simple


plane, complex plane, hyper
plane

5. But multiple planes can do the


job. Each representing an

Sugar
alternate hypothesis

6. The learning algorithm selects


that hypothesis which
minimizes errors in the test
data
BP level
Heart healthy
Erroneous classification Potential heart ailments
Machine learning happens in mathematical space / feature
space:

7. In the figure, since the


separator is a plane, the model
will be the equation
representing the plane

ax + by + cz = d
8. x , y, z represent the three
dimensions i.e. BP, Age, Sugar
while d represents the color

Sugar
i.e. healthy or ailing heart

BP level
Heart healthy
Potential heart ailments
Machine learning happens in mathematical space / feature
space:

9. A new data point enters the


system

10. It’s x,y and z values will be


fed into the model to get
value of d (healthy or ailing)

11. The data point will be placed


above or below the plane

Sugar
based on d

ax + by + cz = d, BP level

Heart healthy
Potential heart ailments
Machine learning happens in mathematical space / feature
space:

12. Whether the new data point is


correctly placed (above or
below the plane) i.e. correctly
classified as ailing or healthy
hear will be known only after
direct observation

Sugar

ax + by + cz = d, BP level

Heart healthy
Potential heart ailments
Machine learning happens in mathematical space / feature
space:

13. Only direct test on the object


of interest will tell whether the
classification is correct or not

ax + by + cz = d,

Sugar
14. If majority of new data points
are correctly classified, the
model is good else not

BP level
Heart healthy
Potential heart ailments
Machine Learning Categories
Machine learning categories:

There are broadly three categories into which the machine learning algorithms
are grouped

1. Supervised Learning

2. Unsupervised Learning

3. Reinforcement Learning
Supervised Machine Learning:

1. Class of algorithms which work in two stages. The first stage is called training and
second one is usually called testing. Sometimes it may involve validation stage
followed by testing

2. At each stage it takes input data prepared for that stage. i.e. for training data for
training stage, test data for test stage, validation data for validation stage

3. During training, the machine learning algorithm gets the training data inform of
independent and dependent variables

4. In the process of learning, the algorithm learns the relationship between the
dependent and the independent variables

5. This relationship is expressed as a model which can take the form of a equation,
probability ratios, hidden rules etc.

6. Supervised Machine Learning can be further classified into Regression (when


predicting numeric values) or classification (when predicting class / labels)
Examples of Supervised Machine Learning:

1. Regression - Predicting mileage of a car given the other features such as weight,
engine capacity, horse power, transmission type, number of cylinders etc.

a. In this example, mileage is the dependent variable and weight, engine


capacity, horse power, transmission type, number of cylinders are independent
variables

b. Mileage = f ( weight, engine capacity, horsepower, transmission type, number


of cylinders)

2. Classification – Categorizing a mail into spam or ham


a. In this example, the email category (spam or ham) is the target variable and
the occurrences of certain words and their frequency are independent variables

b. P(ham) = f( words, frequencies) where P stands for probability. 1- P(ham)


will give the P(Spam) assuming only two categories ham and spam
Unsupervised Machine Learning:
1. Class of algorithms which work in a single stage. Unlike supervised learning
algorithms, it does not have a separate training, testing or validation stage

2. Unsupervised learning algorithms take the data as a whole, not in form of


independent and dependent variables.

3. The algorithms are not used to find any relationship between dependent and
independent variables

4. This class of algorithms usually find patterns in form of clusters and associations
reflecting some kind of commonality, togetherness among the data points in the
given data sets

5. It is the responsibility of the data scientist to analyse the identified


clusters/associations and give meaning to those clusters

6. Clustering and PCA (Principal Component Analysis), a mathematical technique


used to transform given data into a more useful form, belong to this category of
machine learning
Examples of Unsupervised Machine Learning:

1. Clustering - Identifying groups in the given data set where a group represents
some kind of commonality among the data points. “Birds of same feather, flock
together”.

2. Clustering can further be categorized into –

a. Flat clustering, e.g. Kmeans clustering- The clusters identified are disjoint,
non-overlapping. For e.g. segmenting customers into different groups based
on their purchase amount, frequency of purchase and types of items purchase

b. Hierarchical clustering, clustering is done at multiple levels indicating clusters


inside clusters indicating some kind of sub-groups inside a given group. For
e.g. to identify sub clusters within cash-cow customers from Kmeans
Reinforced Machine Learning:
1. Reinforcement learning algorithm learns through trial and error and the feedback it
receives from the environment in which it learns.

2. During the initial stages of learning, it is likely to commit many errors in learning
the patterns, however, through a process of reward and punishment, it learns to
identify the patterns correctly.

3. Self driving cars is an example of reinforced learning


Thank You

You might also like