Data Science Activity
Data Science Activity
Data Science Activity
[ IT - 8003]
Submitted By :- Submitted To :-
Sandesh Singhal Mr. Shirish M .Dubey
(0905IT161047) (Assistant Professor)
Q.1 What is the difference between supervised learning and unsupervised learning? Give
concrete Examples
Supervised learning: Supervised learning is the learning of the model where with input variable ( say, x)
and an output variable (say, Y) and an algorithm to map the input to the output.
That is, Y = f(X)
Why supervised learning?
The basic aim is to approximate the mapping function(mentioned above) so well that when there is a new
input data (x) then the corresponding output variable can be predicted.
It is called supervised learning because the process of an learning(from the training dataset) can be
thought of as a teacher who is supervising the entire learning process. Thus, the “learning algorithm”
iteratively makes predictions on the training data and is corrected by the “teacher”, and the learning stops
when the algorithm achieves an acceptable level of performance(or the desired accuracy).
Unsupervised Learning: Unsupervised learning is where only the input data (say, X) is present and no
corresponding output variable is there.
Q.2 What is logistic regression? State an example when you have used logistic
regression recently?
Logistic regression is a predictive modelling algorithm that is used when the Y variable is binary
categorical. That is, it can take only two values like 1 or 0. The goal is to determine a
mathematical equation that can be used to predict the probability o f event 1. Once the equation
is established, it can be used to predict the Y when only the Xs are known.
In linear regression the Y variable is always a continuous variable. If suppose, the Y variable was
categorical, you cannot use linear regression model it.
Logistic regression can be used to model and solve such problems, also called as binary
classification problems.
A key point to note here is that Y can have 2 classes only and not more than that. If Y has more
than 2 classes, it would become a multi class classification and you can no longer use the vanilla
logistic regression for that.
Yet, Logistic regression is a classic predictive modelling technique and still remains a popular
choice for modelling binary categorical variables.
You might wonder what kind of problems you can use logistic regression for.
Visualization is the first step to make sense of data. To transcript and present data and data correlations
in a simple way, data analysts use a wide range of techniques - charts, diagrams, maps, etc. Choosing
the right technique and its setup is often the true way to make data understandable. And vice versa,
wrong tactics may fail to present the full potential of data or even make it irrelevant.
5 factors that influence data visualization choices
1. Audience. It’s important to adjust data representation to the target audience. If it’s end customers who
browse through their progress in a fitness app, then simplicity is the key. On the other hand, if data
insights are intended for researchers or experienced decision-makers, you can and often should go
2. Content. The type of data determines the tactics. For example, if it’s metrics that changes over time,
you most probably will use line charts to show the dynamics. To show the relationship between two
elements you will use a scatter plot. In turn, bar charts are perfect for comparison analysis.
3. Context. You may use different approaches to the way your graphs look and therefore read
depending on the context. To emphasize a certain figure, for example serious profit growth compared
to other years, you may want to use the shades of one color and pick the bright one for the most
significant element on the chart. On the contrary, to differentiate elements, you’ll use contrast colors.
4. Dynamics. There are various types of data, and each of it implies a different rate of change. For
example, financial results can be measured monthly or yearly, while time series and tracking data is
constantly changing. Depending on the type of change, you may consider dynamic representation
5. Purpose. The goal of data visualization also has serious influence on the way it is implemented. In
order to make a complex analysis of a system or combine different types of data for a more profound
view, visualizations are compiled into dashboards with controls and filters. However, dashboards are
Depending on these 5 factors, you choose among different data visualization techniques and configure
their features. Here are the common tactics used in business today:
Charts
The easiest way to show the development of one or several data sets is a chart. Charts vary from bar and
line charts that show relationship between elements over time to pie charts that demonstrate the
Plots allow to distribute two or more data sets over a 2D or even 3D space to show the relationship
between these sets and the parameters on the plot. Plots also vary: scatter and bubble plots are the most
traditional. Though when it comes to big data, analysts use box plots that enable to visualize the
Maps
Maps are widely-used in different industries. They allow to position elements on relevant objects and
areas - geographical maps, building plans, website layouts, etc. Among the most popular map
Diagrams are usually used to demonstrate complex data relationships and links and include various types
Matrix is a big data visualization technique that allows to reflect the correlations between multiple
Q 4 :Why do we need Hadoop for Big Data Analytics? Explain the different features of
Hadoop.
Hadoop has changed the perception of big data management, especially for unstructured data. Hadoop
is a framework or software library and plays a vital role in handling voluminous data. It helps in
streamlining data across clusters for distributed processing with the help of small programming models.
Open source
It is an open source Java-based programming framework. Open source means it is freely available and
even we can change its source code as per your requirements.
Fault Tolerance
Hadoop control faults by the process of replica creation. When client stores a file in HDFS, Hadoop
framework divide the file into blocks. Then client distributes data blocks across different machines present
in HDFS cluster. And, then create the replica of each block is on other machines present in the cluster.
HDFS, by default, creates 3 copies of a block on other machines present in the cluster. If any machine in
the cluster goes down or fails due to unfavorable conditions. Then also, the user can easily access that
data from other machines.
Distributed Processing
Hadoop stores huge amount of data in a distributed manner in HDFS. Process the data in parallel on a
cluster of nodes.
Scalability
Hadoop is an open-source platform. This makes it extremely scalable platform. So, new nodes can be
easily added without any downtime. Hadoop provides horizontal scalability so new node added on the fly
model to the system. In Apache hadoop, applications run on more than thousands of node.
Reliability
Data is reliably stored on the cluster of machines despite machine failure due to replication of data. So, if
any of the nodes fails, then also we can store data reliably
High Availability
Due to multiple copies of data, data is highly available and accessible despite hardware failure. So, any
machine goes down data can be retrieved from the other path. Learn Hadoop High Availability feature in
detail.
Economic
Hadoop is not very expensive as it runs on the cluster of commodity hardware. As we are using low-cost
commodity hardware, we don’t need to spend a huge amount of money for scaling out your Hadoop
cluster.\
Flexibility
Hadoop is very flexible in terms of ability to deal with all kinds of data. It deals with structured, semi-
structured or unstructured.
Easy to use
No need of client to deal with distributed computing, the framework takes care of all the things. So it is
easy to use.
Data locality
It refers to the ability to move the computation close to where actual data resides on the node. Instead of
moving data to computation. This minimizes network congestion and increases the over throughput of the
system. Learn more about Data Locality.
In conclusion, we can say, Hadoop is highly fault-tolerant. It reliably stores huge amount of data despite
hardware failure. It provides High scalability and high availability. Hadoop is cost efficient as it runs on a
cluster of commodity hardware. Hadoop work on Data locality as moving computation is cheaper than
moving data. All these features of Big data Hadoop make it powerful for the Big data processing.
Q.5 What is Clustering ? Define Application of Clustering and How the K-means
algorithm work
Clustering
Clustering is one of the most common exploratory data analysis technique used to get an intuition about
the structure of the data. It can be defined as the task of identifying subgroups in the data such that data
points in the same subgroup (cluster) are very similar while data points in different clusters are very
different. In other words, we try to find homogeneous subgroups within the data such that data points in
each cluster are as similar as possible according to a similarity measure such as euclidean-based distance
on features or on the basis of samples where we try to find subgroups of features based on samples. We’ll
cover here clustering based on features. Clustering is used in market segmentation; where we try to fined
customers that are similar to each other whether in terms of behaviors or attributes, image
segmentation/compression; where we try to group similar regions together, document clustering based on
topics, etc.
Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have
the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its
performance. We only want to try to investigate the structure of the data by grouping the data points into
distinct subgroups.
In this post, we will cover only Kmeans which is considered as one of the most used clustering algorithms
Kmeans Algorithm
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-
overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the
intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible.
It assigns data points to a cluster such that the sum of the squared distance between the data points and
the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum.
The less variation we have within clusters, the more homogeneous (similar) the data points are within the
same cluster.
2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the
3. Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn’t
changing.
• Compute the sum of the squared distance between data points and all centroids.
• Compute the centroids for the clusters by taking the average of the all data points that belong to each
cluster.
The approach kmeans follows to solve the problem is called Expectation-Maximization. The E-step is
assigning the data points to the closest cluster. The M-step is computing the centroid of each cluster.
Below is a break down of how we can solve it mathematically (feel free to skip it).
where wik=1 for data point xi if it belongs to cluster k; otherwise, wik=0. Also, μk is the centroid of xi’s
cluster.
It’s a minimization problem of two parts. We first minimize J w.r.t. wik and treat μk fixed. Then we minimize
J w.r.t. μk and treat wik fixed. Technically speaking, we differentiate J w.r.t. wik first and update cluster
assignments (E-step). Then we differentiate J w.r.t. μk and recompute the centroids after the cluster
In other words, assign the data point xi to the closest cluster judged by its sum of squared distance from
cluster’s centroid.
Which translates to recomputing the centroid of each cluster to reflect the new assignments.
• Since clustering algorithms including kmeans use distance-based measurements to determine the
similarity between data points, it’s recommended to standardize the data to have a mean of zero and
a standard deviation of one since almost always the features in any dataset would have different units
• Given kmeans iterative nature and the random initialization of centroids at the start of the algorithm,
different initializations may lead to different clusters since kmeans algorithm may stuck in a local
optimum and may not converge to global optimum. Therefore, it’s recommended to run the algorithm
using different initializations of centroids and pick the results of the run that that yielded the lower sum
of squared distance.
• Assignment of examples isn’t changing is the same thing as no change in within-cluster variation: