Data Science Activity

[[ Data Science ]]
[ IT - 8003]
Enrollment No.-:0905IT161041
Department -:Information Technology
Batch -: 2016-2020
Submitted By :- Submitted To :-
Prateek Bharadwaj Mr. Shirish M .Dubey
(Assistant Professor)
Q.1 What is the difference between supervised learning and unsupervised
learning? Give concrete Examples
Supervised learning: Supervised learning is the learning of the model where with input variable
( say, x) and an output variable (say, Y) and an algorithm to map the input to the output.
That is, Y = f(X)
Why supervised learning?
The basic aim is to approximate the mapping function(mentioned above) so well that when there
is a new input data (x) then the corresponding output variable can be predicted.
It is called supervised learning because the process of an learning(from the training dataset) can
be thought of as a teacher who is supervising the entire learning process. Thus, the “learning
algorithm” iteratively makes predictions on the training data and is corrected by the “teacher”,
and the learning stops when the algorithm achieves an acceptable level of performance(or the
desired accuracy).
Example of Supervised Learning

Suppose there is a basket which is filled with some fresh fruits, the task is to arrange the same
type of fruits at one place.
Also, suppose that the fruits are apple, banana, cherry, grape.
Suppose one already knows from their previous work (or experience) that, the shape of each and
every fruit present in the basket so, it is easy for them to arrange the same type of fruits in one
place.
Here, the previous work is called as training data in Data Mining terminology. So, it learns the
things from the training data. This is because it has a response variable which says y that if some
fruit has so and so features then it is grape, and similarly for each and every fruit.
This type of information is deciphered from the data that is used to train the model.
This type of learning is called Supervised Learning.
Such problems are listed under classical Classification Tasks.

Unsupervised Learning: Unsupervised learning is where only the input data (say, X) is present
and no corresponding output variable is there.
Why Unsupervised Learning?
The main aim of Unsupervised learning is to model the distribution in the data in order to learn
more about the data.
It is called so, because there is no correct answer and there is no such teacher(unlike supervised
learning). Algorithms are left to their own devises to discover and present the interesting
structure in the data.
Example of Unsupervised Learning
Again, Suppose there is a basket and it is filled with some fresh fruits. The task is to arrange the
same type of fruits at one place.
This time there is no information about those fruits beforehand, its the first time that the fruits are
being seen or discovered
Q.2 What is logistic regression? State an example when you have used logistic
regression recently?
Logistic regression is a predictive modelling algorithm that is used when the Y variable is
binary categorical. That is, it can take only two values like 1 or 0. The goal is to
determine a mathematical equation that can be used to predict the probability of event 1.
Once the equation is established, it can be used to predict the Y when only the X�s are
known.
. Introduction to Logistic Regression

Earlier you saw what is linear regression and how to use it to predict continuous Y
variables.
In linear regression the Y variable is always a continuous variable. If suppose, the Y

variable was categorical, you cannot use linear regression model it.
So what would you do when the Y is a categorical variable with 2 classes?
Logistic regression can be used to model and solve such problems, also called as binary
classification problems.
A key point to note here is that Y can have 2 classes only and not more than that. If Y has
more than 2 classes, it would become a multi class classification and you can no longer
use the vanilla logistic regression for that.
Yet, Logistic regression is a classic predictive modelling technique and still remains a
popular choice for modelling binary categorical variables.
Another advantage of logistic regression is that it computes a prediction probability score

of an event. More on that when you actually start building the models.
2. Some real world examples of binary classification problems
You might wonder what kind of problems you can use logistic regression for.
Here are some examples of binary classification problems:
 Spam Detection : Predicting if an email is Spam or not

 Credit Card Fraud : Predicting if a given credit card transaction is fraud or not
 Health : Predicting if a given mass of tissue is benign or malignant
 Marketing : Predicting if a given user will buy an insurance product or not
 Banking : Predicting if a customer will default on a loan.
Q.3 Describe Big Data Visualization Technique in Detail ?
Visualization is the first step to make sense of data. To transcript and present data and data
correlations in a simple way, data analysts use a wide range of techniques - charts, diagrams,
maps, etc. Choosing the right technique and its setup is often the true way to make data
understandable. And vice versa, wrong tactics may fail to present the full potential of data or
even make it irrelevant.
5 factors that influence data visualization choices

1. Audience. It’s important to adjust data representation to the target audience. If it’s end
customers who browse through their progress in a fitness app, then simplicity is the key. On
the other hand, if data insights are intended for researchers or experienced decision-makers,
you can and often should go beyond simple charts.
2. Content. The type of data determines the tactics. For example, if it’s metrics that changes
over time, you most probably will use line charts to show the dynamics. To show the
relationship between two elements you will use a scatter plot. In turn, bar charts are perfect
for comparison analysis.
3. Context. You may use different approaches to the way your graphs look and therefore
read depending on the context. To emphasize a certain figure, for example serious profit
growth compared to other years, you may want to use the shades of one color and pick the
bright one for the most significant element on the chart. On the contrary, to differentiate
elements, you’ll use contrast colors.
4. Dynamics. There are various types of data, and each of it implies a different rate of
change. For example, financial results can be measured monthly or yearly, while time series
and tracking data is constantly changing. Depending on the type of change, you may consider
dynamic representation (steaming) or a static visualization.
5. Purpose. The goal of data visualization also has serious influence on the way it is
implemented. In order to make a complex analysis of a system or combine different types of
data for a more profound view, visualizations are compiled into dashboards with controls and
filters. However, dashboards are not necessary to show a single or occasional data insight.
Data visualization techniques
Depending on these 5 factors, you choose among different data visualization techniques and
configure their features. Here are the common tactics used in business today:
Charts
The easiest way to show the development of one or several data sets is a chart. Charts vary from
bar and line charts that show relationship between elements over time to pie charts that
demonstrate the components or proportions between the elements of one whole.
Plots
Plots allow to distribute two or more data sets over a 2D or even 3D space to show the
relationship between these sets and the parameters on the plot. Plots also vary: scatter and bubble
plots are the most traditional. Though when it comes to big data, analysts use box plots that
enable to visualize the relationship between large volumes of different data
Maps
Maps are widely-used in different industries. They allow to position elements on relevant objects
and areas - geographical maps, building plans, website layouts, etc. Among the most popular
map visualizations are heat maps, dot distribution maps, cartograms.
Diagrams and matrices

Diagrams are usually used to demonstrate complex data relationships and links and include
various types of data on one visualization. They can be hierarchical, multidimensional, tree-like.
Matrix is a big data visualization technique that allows to reflect the correlations between
multiple constantly updating (steaming) data sets
Q 4 :Why do we need Hadoop for Big Data Analytics? Explain the different
features of Hadoop.
Hadoop has changed the perception of big data management, especially for
unstructured data. Hadoop is a framework or software library and plays a vital role in handling
voluminous data. It helps in streamlining data across clusters for distributed processing with the
help of small programming models.
Open source
It is an open source Java-based programming framework. Open source means it is freely

available and even we can change its source code as per your requirements.
Fault Tolerance
Hadoop control faults by the process of replica creation. When client stores a file in HDFS,
Hadoop framework divide the file into blocks. Then client distributes data blocks across different
machines present in HDFS cluster. And, then create the replica of each block is on other
machines present in the cluster. HDFS, by default, creates 3 copies of a block on other machines
present in the cluster. If any machine in the cluster goes down or fails due to unfavorable
conditions. Then also, the user can easily access that data from other machines.
Distributed Processing
Hadoop stores huge amount of data in a distributed manner in HDFS. Process the data in parallel
on a cluster of nodes.
Scalability
Hadoop is an open-source platform. This makes it extremely scalable platform. So, new nodes
can be easily added without any downtime. Hadoop provides horizontal scalability so new node
added on the fly model to the system. In Apache hadoop, applications run on more than
thousands of node.
Reliability
Data is reliably stored on the cluster of machines despite machine failure due to replication of
data. So, if any of the nodes fails, then also we can store data reliably
High Availability
Due to multiple copies of data, data is highly available and accessible despite hardware failure.
So, any machine goes down data can be retrieved from the other path. Learn Hadoop High
Availability feature in detail.
Economic
Hadoop is not very expensive as it runs on the cluster of commodity hardware. As we are using
low-cost commodity hardware, we don’t need to spend a huge amount of money for scaling out
your Hadoop cluster.\
Flexibility
Hadoop is very flexible in terms of ability to deal with all kinds of data. It deals with structured,
semi-structured or unstructured.
Easy to use
No need of client to deal with distributed computing, the framework takes care of all the things.
So it is easy to use.
Data locality
It refers to the ability to move the computation close to where actual data resides on the node.
Instead of moving data to computation. This minimizes network congestion and increases the
over throughput of the system. Learn more about Data Locality.
In conclusion, we can say, Hadoop is highly fault-tolerant. It reliably stores huge amount of data
despite hardware failure. It provides High scalability and high availability. Hadoop is cost
efficient as it runs on a cluster of commodity hardware. Hadoop work on Data locality as moving
computation is cheaper than moving data. All these features of Big data Hadoop make it
powerful for the Big data processing.
Q.5 What is Clustering ? Define Application of Clustering and How the K-
means algorithm work
Clustering
Clustering is one of the most common exploratory data analysis technique used to get an
intuition about the structure of the data. It can be defined as the task of identifying subgroups in
the data such that data points in the same subgroup (cluster) are very similar while data points in
different clusters are very different. In other words, we try to find homogeneous subgroups within
the data such that data points in each cluster are as similar as possible according to a similarity
measure such as euclidean-based distance or correlation-based distance. The decision of which
similarity measure to use is application-specific.
Clustering analysis can be done on the basis of features where we try to find subgroups of
samples based on features or on the basis of samples where we try to find subgroups of features
based on samples. We’ll cover here clustering based on features. Clustering is used in market
segmentation; where we try to fined customers that are similar to each other whether in terms of
behaviors or attributes, image segmentation/compression; where we try to group similar regions
together, document clustering based on topics, etc.
Unlike supervised learning, clustering is considered an unsupervised learning method since we
don’t have the ground truth to compare the output of the clustering algorithm to the true labels to
evaluate its performance. We only want to try to investigate the structure of the data by grouping
the data points into distinct subgroups.
In this post, we will cover only Kmeans which is considered as one of the most used clustering
algorithms due to its simplicity.

Kmeans Algorithm
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined
distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It
tries to make the intra-cluster data points as similar as possible while also keeping the clusters as
different (far) as possible. It assigns data points to a cluster such that the sum of the squared
distance between the data points and the cluster’s centroid (arithmetic mean of all the data points
that belong to that cluster) is at the minimum. The less variation we have within clusters, the more
homogeneous (similar) the data points are within the same cluster.
The way kmeans algorithm works is as follows:
1. Specify number of clusters K.
2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points
for the centroids without replacement.
3. Keep iterating until there is no change to the centroids. i.e assignment of data points to
clusters isn’t changing.
 Compute the sum of the squared distance between data points and all centroids.
 Assign each data point to the closest cluster (centroid).
 Compute the centroids for the clusters by taking the average of the all data points that
belong to each cluster.

The approach kmeans follows to solve the problem is called Expectation-Maximization. The E-
step is assigning the data points to the closest cluster. The M-step is computing the centroid of
each cluster. Below is a break down of how we can solve it mathematically (feel free to skip it).
The objective function is:
where wik=1 for data point xi if it belongs to cluster k; otherwise, wik=0. Also, μk is the centroid
of xi’s cluster.
It’s a minimization problem of two parts. We first minimize J w.r.t. wik and treat μk fixed. Then
we minimize J w.r.t. μk and treat wik fixed. Technically speaking, we differentiate J w.r.t. wik
first and update cluster assignments (E-step). Then we differentiate J w.r.t. μk and recompute the
centroids after the cluster assignments from previous step (M-step). Therefore, E-step is:
In other words, assign the data point xi to the closest cluster judged by its sum of squared distance
from cluster’s centroid.

And M-step is:
Which translates to recomputing the centroid of each cluster to reflect the new assignments.
Few things to note here:
 Since clustering algorithms including kmeans use distance-based measurements to
determine the similarity between data points, it’s recommended to standardize the data to
have a mean of zero and a standard deviation of one since almost always the features in any
dataset would have different units of measurements such as age vs income.
 Given kmeans iterative nature and the random initialization of centroids at the start of the
algorithm, different initializations may lead to different clusters since kmeans algorithm
may stuck in a local optimum and may not converge to global optimum. Therefore, it’s
recommended to run the algorithm using different initializations of centroids and pick the
results of the run that that yielded the lower sum of squared distance.
 Assignment of examples isn’t changing is the same thing as no change in within-cluster
variation:

Data Science Activity

Uploaded by

Copyright:

Available Formats

Data Science Activity

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Activity

Uploaded by

Copyright:

Available Formats

[[ Data Science ]]

Example of Supervised Learning

. Introduction to Logistic Regression

In linear regression the Y variable is always a continuous variable. If suppose, the Y

So what would you do when the Y is a categorical variable with 2 classes?

Another advantage of logistic regression is that it computes a prediction probability score

2. Some real world examples of binary classification problems

Here are some examples of binary classification problems:

 Spam Detection : Predicting if an email is Spam or not

5 factors that influence data visualization choices

Diagrams and matrices

It is an open source Java-based programming framework. Open source means it is freely

measure such as euclidean-based distance or correlation-based distance. The decision of which

similarity measure to use is application-specific.

behaviors or attributes, image segmentation/compression; where we try to group similar regions

together, document clustering based on topics, etc.

Unlike supervised learning, clustering is considered an unsupervised learning method since we

the data points into distinct subgroups.

algorithms due to its simplicity.

Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined

The way kmeans algorithm works is as follows:

1. Specify number of clusters K.

for the centroids without replacement.

clusters isn’t changing.

 Assign each data point to the closest cluster (centroid).

belong to each cluster.

The objective function is:

from cluster’s centroid.

Few things to note here:

 Since clustering algorithms including kmeans use distance-based measurements to

dataset would have different units of measurements such as age vs income.

 Assignment of examples isn’t changing is the same thing as no change in within-cluster

You might also like