unit V
unit V
unit V
Introduction
Machine learning is a growing technology which enables computers
to learn automatically from past data.
Machine learning uses various algorithms for building mathematical
models and making predictions using historical data or information.
Currently, it is being used for various tasks such as image recognition,
speech recognition, email filtering, Facebook auto-tagging, recommender
system, and many more.
The machine learning techniques are Supervised, Unsupervised, and
Reinforcement learning. Regression and classification models, clustering
methods, hidden Markov models, and various sequential models.
Machine learning enables a machine to automatically learn from data,
improve performance from experiences, and predict things without being
explicitly programmed.
With the help of sample historical data, which is known as training data,
machine learning algorithms build a mathematical model that helps in
making predictions or decisions without being explicitly programmed.
Machine learning brings computer science and statistics together for creating
predictive models.
Machine learning is a branch in computer science that studies the design of
algorithms that can learn.
Typical machine learning tasks are concept learning, function learning or
“predictive modeling”, clustering and finding predictive patterns. These tasks
are learned through available data that were observed through experiences or
instructions, for example.
Machine learning hopes that including the experience into its tasks will
eventually improve the learning. The ultimate goal is to improve the learning in
such a way that it becomes automatic, so that humans like ourselves don’t need
to interfere any more.
Supervised learning
Supervised learning, also known as supervised machine learning, is a
subcategory of machine learning and artificial intelligence.
It is defined by its use of labeled data sets to train algorithms that to
classify data or predict outcomes accurately. As input data is fed into the model,
it adjusts its weights until the model has been fitted appropriately, which occurs
as part of the cross validation process.
Supervised learning helps organizations solve for a variety of real-world
problems at scale, such as classifying spam in a separate folder from your
inbox. It can be used to build highly accurate machine learning models.
• Neural networks:
Primarily leveraged for deep learning algorithms, neural networks
process training data by mimicking the interconnectivity of the human brain
through layers of nodes. Each node is made up of inputs, weights, a bias (or
threshold), and an output. If that output value exceeds a given threshold, it
“fires” or activates the node, passing data to the next layer in the network.
Neural networks learn this mapping function through supervised learning,
adjusting based on the loss function through the process of gradient descent.
When the cost function is at or near zero, we can be confident in the model’s
accuracy to yield the correct answer.
• Naive bayes:
Naive Bayes is classification approach that adopts the principle of class
conditional independence from the Bayes Theorem. This means that the
presence of one feature does not impact the presence of another in the
probability of a given outcome, and each predictor has an equal effect on that
result.
There are three types of Naïve Bayes classifiers: Multinomial Naïve
Bayes, Bernoulli Naïve Bayes, and Gaussian Naïve Bayes.
This technique is primarily used in text classification, spam identification,
and recommendation systems.
• Linear regression:
Linear regression is used to identify the relationship between a dependent
variable and one or more independent variables and is typically leveraged to
make predictions about future outcomes. When there is only one independent
variable and one dependent variable, it is known as simple linear regression. As
the number of independent variables increases, it is referred to as multiple linear
regression. For each type of linear regression, it seeks to plot a line of best fit,
which is calculated through the method of least squares. However, unlike other
regression models, this line is straight when plotted on a graph.
Linear regression is a statistical tool that is mainly used for predicting and
forecasting values based on historical information subject to some important
assumptions:
• There requires a dependent variable and a set of independent variables
• There exist a linear relationship between the dependent and the
independent variables, that is:
Where
: is the response variable.
𝑥 : is the predictor variable j where j=1,2,3,………..p.
𝑒 : is the error term that is normally distributed with mean 0 and constant
Variance
𝑎𝑗 and 𝑏: are the regression coefficients to be estimated the coefficients
Regression is a technique used to identify the linear relationship between
target variables and explanatory variables. Other terms are also used to describe
the variable. One of these variable is called predictor variable whose value is
gathered through experiments. The other variable is called response variable
whose value is derived from the predictor variable.
• Logistic regression:
While linear regression is leveraged when dependent variables are
continuous, logistic regression is selected when the dependent variable is
categorical, meaning they have binary outputs, such as "true" and "false" or
"yes" and "no." While both regression models seek to understand relationships
between data inputs, logistic regression is mainly used to solve binary
classification problems, such as spam identification.
In statistics, logistic regression is known to be a probabilistic
classification model. Logistic regression is widely used in many disciplines,
including the medical and social science fields. Logistic regression can be either
binomial or multinomial. It is very popular to predict a categorical response.
Binary logistic regression is used in cases where the outcome for a dependent
variable have two possibilities. As for multinomial, the logistic regression is
concerned with possibilities where there are three or more possible types.
Using logistic regression, the input values (x) are combined linearly using
weights or coefficient values to predict an output value (y) based on the log
odds ratio. One major difference between linear regression and logistic
regression is that in linear regression, the output value being modeled is a
numerical value while in logistic, it is a binary value (0 or 1)
The logistic regression equation can be given as follows:
Where
Py is the expected probability for the y(f) subject,
b0 is the bias or intercept term and
b1 is the coefficient for the single input value (xi).
Each column in your input data has an associated b coefficient (a constant
real value) that must be learned from your training data.
It is quite simple to make predictions using logistic regression since there
is a need to plug in numbers into the logistic regression equation to obtain the
output.
• K-nearest neighbor:
K-nearest neighbor, also known as the KNN algorithm, is a non-
parametric algorithm that classifies data points based on their proximity and
association to other available data. This algorithm assumes that similar data
points can be found near each other. As a result, it seeks to calculate the
distance between data points, usually through Euclidean distance, and then it
assigns a category based on the most frequent category or average. Its ease of
use and low calculation time make it a preferred algorithm by data scientists,
but as the test dataset grows, the processing time lengthens, making it less
appealing for classification tasks. KNN is typically used for recommendation
engines and image recognition.
• Random forest:
Random forest is another flexible supervised machine learning algorithm
used for both classification and regression purposes. The "forest" references a
collection of uncorrelated decision trees, which are then merged together to
reduce variance and create more accurate data predictions.
Unsupervised learning
Unsupervised learning, also known as unsupervised machine learning,
uses machine learning(ML) algorithms to analyze and cluster unlabeled data
sets. These algorithms discover hidden patterns or data groupings without the
need for human intervention. Unsupervised learning's ability to discover
similarities and differences in information make it the ideal solution for
exploratory data analysis, cross-selling strategies, customer segmentation and
image recognition
Clustering
Clustering is a data mining technique which groups unlabeled data based
on their similarities or differences. Clustering algorithms are used to process
raw, unclassified data objects into groups represented by structures or patterns
in the information. Clustering algorithms can be categorized into a few types,
specifically exclusive, overlapping, hierarchical, and probabilistic.
Clustering or cluster analysis is a machine learning technique, which
groups the unlabelled dataset. It can be defined as "A way of grouping the data
points into different clusters, consisting of similar data points. The objects with
the possible similarities remain in a group that has less or no similarities with
another group."
It does it by finding some similar patterns in the unlabelled dataset such
as shape, size, color, behavior, etc., and divides them as per the presence and
absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided
to the algorithm, and it deals with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided
with a cluster-ID. ML system can use this id to simplify the processing of large
and complex datasets.
The clustering technique is commonly used for statistical data analysis.
Example: Let's understand the clustering technique with the real-world
example of Mall: When we visit any shopping mall, we can observe that the
things with similar usage are grouped together. Such as the t-shirts are grouped
in one section, and trousers are at other sections, similarly, at vegetable sections,
apples, bananas, Mangoes, etc., are grouped in separate sections, so that we can
easily find out the things. The clustering technique also works in the same way.
Other examples of clustering are grouping documents according to the
topic.
The below diagram explains the working of the clustering algorithm. We
can see the different fruits are divided into several groups with similar
properties.
The clustering technique can be widely used in various tasks. Some most
common uses of this technique are:
Market Segmentation
Statistical data analysis
Social network analysis
Image segmentation
Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its
recommendation system to provide the recommendations as per the past search
of products. Netflix also uses this technique to recommend the movies and web-
series to its users as per the watch history.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are
explained above. There are different types of clustering algorithms published,
but only a few are commonly used. The clustering algorithm is based on the
kind of data that we are using. Such as, some algorithms need to guess the
number of clusters in the given dataset, whereas some are required to find the
minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are
widely used in machine learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular
clustering algorithms. It classifies the dataset by dividing the samples into
different clusters of equal variances. The number of clusters must be
specified in this algorithm. It is fast with fewer computations required,
with the linear complexity of O(n).
K-Means algorithm is an unsupervised machine learning algorithm which
aims at clustering data together, that is, finding clusters in data based on
similarity in the descriptions of the data and their relationships. Each
cluster is associated with a center point known as a centroid. Based on the
center, the length of space of each cluster with respect to the center is
calculated and the clusters are formed by assigning points to the closest
centroid. Various algorithms, such as Euclidean distance, Euclidean
squared distance and the Manhattan or City distance, are used to
determine which observation is appended to which centroid. The number
of clusters is represented by the variable K
Applications of Clustering
Below are some commonly known applications of clustering technique in
Machine Learning:
In Identification of Cancer Cells: The clustering algorithms are widely
used for the identification of cancerous cells. It divides the cancerous and
non-cancerous data sets into different groups.
In Search Engines: Search engines also work on the clustering technique.
The search result appears based on the closest object to the search query.
It does it by grouping similar data objects in one group that is far from the
other dissimilar objects. The accurate result of a query depends on the
quality of the clustering algorithm used.
Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
In Biology: It is used in the biology stream to classify different species of
plants and animals using the image recognition technique.
In Land Use: The clustering technique is used in identifying the area of
similar lands use in the GIS database. This can be very useful to find that
for what purpose the particular land should be used, that means for which
purpose it is more suitable.
Association Rules
An association rule is a rule-based method for finding relationships
between variables in a given dataset. These methods are frequently used for
market basket analysis, allowing companies to better understand relationships
between different products. Understanding consumption habits of customers
enables businesses to develop better cross-selling strategies and
recommendation engines. Examples of this can be seen in Amazon’s
“Customers Who Bought This Item Also Bought” or Spotify’s "Discover
Weekly" playlist. While there are a few different algorithms used to generate
association rules, such as Apriori, Eclat, and FP-Growth, the Apriori algorithm
is most widely used.
Association Rule Mining is used when you want to find an association
between different objects in a set, find frequent patterns in a transaction
database, relational databases or any other information repository. The
applications of Association Rule Mining are found in Marketing, Basket Data
Analysis (or Market Basket Analysis) in retailing, clustering and classification.
It can tell you what items do customers frequently buy together by generating a
set of rules called Association Rules. In simple words, it gives you output as
rules in form if this then that.
Clients can use those rules for numerous marketing strategies:
Changing the store layout according to trends
Customer behavior analysis
Catalogue design
Cross marketing on online stores
What are the trending items customers buy
Customized emails with add-on sales
Consider the following example:
1. Apriori algorithms
Apriori algorithms have been popularized through market basket
analyses, leading to different recommendation engines for music platforms and
online retailers. They are used within transactional datasets to identify frequent
item sets, or collections of items, to identify the likelihood of consuming a
product given the consumption of another product. For example, if I play Black
Sabbath’s radio on Spotify, starting with their song “Orchid”, one of the other
songs on this channel will likely be a Led Zeppelin song, such as “Over the
Hills and Far Away.” This is based on my prior listening habits as well as the
ones of others. Apriori algorithms use a hash tree to count itemsets, navigating
through the dataset in a breadth-first manner.
Association Rule Mining is viewed as a two-step approach:
1. Frequent Itemset Generation: Find all frequent item-sets with support >=
pre-determined min_support count
2. Rule Generation: List all Association Rules from frequent item-sets.
Calculate Support and Confidence for all rules. Prune rules that fail
min_support and min_confidence thresholds.
Frequent Itemset Generation is the most computationally expensive step
because it requires a full database scan.
Among the above steps, Frequent Item-set generation is the most costly in terms
of computation.
Above you have seen the example of only 5 transactions, but in real-world
transaction data for retail can exceed up to GB s and TBs of data for which an
optimized algorithm is needed to prune out Item-sets that will not help in later
steps. For this APRIORI Algorithm is used
Dimensionality reduction
While more data generally yields more accurate results, it can also impact
the performance of machine learning algorithms (e.g. overfitting) and it can also
make it difficult to visualize datasets. Dimensionality reduction is a technique
used when the number of features, or dimensions, in a given dataset is too high.
It reduces the number of data inputs to a manageable size while also preserving
the integrity of the dataset as much as possible. It is commonly used in the pre
processing data stage, and there are a few different dimensionality reduction
methods that can be used, such as:
3. Autoencoders
Autoencoders leverage neural networks to compress data and then
recreate a new representation of the original data’s input. Looking at the image
below, you can see that the hidden layer specifically acts as a bottleneck to
compress the input layer prior to reconstructing within the output layer. The
stage from the input layer to the hidden layer is referred to as “encoding” while
the stage from the hidden layer to the output layer is known as “decoding
Collaborative Filtering
Collaborative filtering is a technique that can filter out items that a
user might like on the basis of reactions by similar users. It works by
searching a large group of people and finding a smaller set of users with
tastes similar to a particular user.
Collaborative filtering is a method used by recommender systems to
make automatic predictions about a user’s interests by collecting preferences
from many users (collaborating).
The underlying assumption is that if person A has a similar opinion as
person B on one issue, A is more likely to have B’s opinion on a different issue
than that of a randomly chosen person.
Although you can use both types of data corresponding with the data you
have, you should consider the computation amount in the practical setting. You
should utilize the item-item collaborative filtering if you have more users than
the number of items. In comparison, you should use the user-user collaborative
filtering if you have more items than the users.
Social Media Analytics
In this era of social media and networking, Social Media analytics is a
process for the extraction of unseen and unknown insights from the abundant
data available worldwide.
Social media analytics is the ability to gather and find meaning in
data gathered from social channels to support business decisions and measure
the performance of actions based on those decisions through social media.
It is considered a science as it methodically involves the identification,
extraction, and evaluation of social media data using various tools and methods.
It is also an art for interpreting insights obtained with business goals and
objectives. It focuses on seven layers of social media: text, networks, actions,
hyperlinks, mobile, location, and search engines. Various tools for social media
analytics include Discovertext, Lexalytics, Netlytic, Google Analytics, Network
NodeXL, Netminer, and many more
Social media analytics is broader than metrics such as likes,
follows, retweets, previews, clicks, and impressions gathered from
individual channels. It also differs from reporting offered by services that
support marketing campaigns such as LinkedIn or Google Analytics.
Social media analytics uses specifically designed software platforms
that work similarly to web search tools.
Data about keywords or topics is retrieved through search queries or
web ‘crawlers’ that span channels. Fragments of text are returned, loaded
into a database, categorized and analyzed to derive meaningful insights.
Strategic decision-making:
Social Media Analysis also aids in trend analysis and the identification of
high-value features for a brand. It gauges responses to social media and other
communications, facilitating meaningful decision-making for organizations to
improve productivity.
Sentiment analysis:
Comments and reviews about products and services are collected,
extracted, cleaned, and analyzed using various tools. Categorization of these
comments reveals the intention about the brand. Natural language processing
methodologies are employed to understand the intensity and group comments
into positive, negative, or neutral categories regarding a product or service.
Summarization charts about customer sentiment reveal future prospects for
product usage and guide corrective actions accordingly.
Performance analysis/metrics
Measuring the performance of social media marketing efforts is critical to
understanding where strategic efforts are working and where improvement is
needed.
Key performance metrics to track include the following:
interactions across platforms and over time to determine if the
posted content is properly engaging the audience;
whether the number of followers is increasing over time to verify
consistent progress across platforms; and
click-through rate for link clicks on posts to see if they're properly
driving traffic from social media channels.
First and foremost, you need to measure the overall performance of your
social media efforts. This includes social media metrics including:
Impressions
Reach
Likes
Comments
Shares
Views
ClicksSales
Audience analytics
It's important to clearly understand and define the target audience, as it is
the most important element of a social media strategy. Understanding the
audience will help create a favorable customer experience with content targeted
at what customers want and what they're looking for.
In the past, audience data was difficult to measure as it was scattered
across multiple social media platforms. But with analytics tools, marketers can
analyze data across platforms to better understand audience demographics,
interests and behaviors. AI-enabled tools can even help predict customer
behavior. They can also study how an audience changes over time.
The better targeted the content is, the less advertising will cost and the
cost-per-click of ads can be optimized.
Audience analytics will include data like:
Age
Gender
Location
Device
Competitor analysis
To obtain a full understanding of performance metrics, it's necessary to
look at the metrics through a competitive lens. In other words, how do they
stack up to competitors' performance?
With social media analytics tools, social media performance can be
compared to competitors' performance with a head-to-head analysis to gauge
relative effectiveness and to determine what can be improved.
Most modern tools that include AI capabilities can benchmark competitor
performance by industry to determine a good starting point for social media
efforts.
Another key area to look into is how your competitors perform on social
media. How many followers do they have? What is their engagement rate?
How many people seem to engage with each of their posts?
You can then compare this data to your own to see how you stack up—as
well as set more realistic growth goals.
Influencer analysis
To gain a leg up on competition in a competitive space, many social
media marketers will collaborate with social influencers as part of their
marketing campaigns. To make the most of partnerships, it's necessary to
measure key metrics to ensure that the Influencer marketing is achieving desired
goals.
Social media analytics can provide insights into the right metrics to
ensure that influencer campaigns are successful.
Some influencer metrics that should be tracked include the following:
total interactions per 1,000 followers to understand if they're properly
generating engagement;
audience size and most frequently used hashtags, to help determine the
maximum reach of your campaign;
the number of posts influencers create on a regular basis, to help
determine how active they are and how powerful engagement can be; and
past collaborations, which can be a great indicator of the potential for
success with an influencer.
If you're running influencer marketing campaigns, tracking the success of
these partnerships is essential to proving ROI. We recommend using the five
W’s + H of influencer marketing to inform your strategy and measure ROI at
each stage of the buyer journey.
Some of the data you'll want to keep track of includes:
Number of posts created per influencer
Total number of interactions per post
Audience size of each influencer
Hashtag usage and engagement
This can help you gauge overall engagement from your influencer
campaigns. If you have an affiliate marketing program, you can designate
promo codes for each individual influencer to use so your team can track how
many sales each partner drives as well.
Sentiment analysis
Sentiment analysis is an important metric to measure as it can indicate
whether a campaign is gaining favorability with an audience or losing it. And
for customer service oriented businesses, sentiment analysis can reveal potential
customer care issues.
To ensure that a campaign is in sync with the target audience and
maintains a strong rate of growth, interactions and engagement rate should be
tracked over time. A decline could indicate that a change of course is needed.
Gathering and analyzing customer sentiment can help avoid guesswork in
developing a marketing strategy and deciding which content will resonate best
with the audience. This type of analysis can also indicate the type of content
that's likely to have a positive impact on customer sentiment If your social
media analytics tool detects a spike in negative sentiment, action should be
taken immediately to address and correct it before it becomes a PR nightmare.
Data Visualisation
Data visualisation is the phase where the output of data analytics is being
displayed. Visualisation is an interactive way of representing the results. Plots
and charts can be used to visualize the data by using the required packages
available in the data visualisation software.
3. Prediction of Cancer
Despite the rapid advancement in technology, the early detection and
prognosis of cancer is still a challenge. Detection of cancer is concerned with
the analysis of petabytes of data. This involves high dimensional data, which are
collected from various sources such as scientific experiments, literature,
computational analysis and research. Prognosis is being used to determine the
survival pattern using various attributes such as specific drug administered to a
patient, treatment given and response of the patient. Lots of data are involved
and thus data analytics can be used to determine trends and patterns which will
eventually help doctors in taking the proper decisions. Data mining techniques
can be used to determine trends and acquire knowledge using the information
available.
R for Big Data Analytics
Introduction
Big data analytics has become an integral part of decision-making and
business intelligence across various industries. With the exponential growth of
data, organizations need robust tools and techniques to extract meaningful
insights.
R, a powerful programming language and software environment, has gained
popularity for its extensive capabilities in data analysis and statistical
computing.
Data Visualization
Data Visualization Packages − R is renowned for its extensive data
visualization capabilities. It provides a wide range of packages for creating
visually appealing and informative plots and charts.
Some popular data visualization packages in R include −
ggplot2 − ggplot2 is a highly flexible and powerful package for creating
elegant and customizable data visualizations. It follows the grammar of
graphics principles, allowing users to build complex plots layer by layer.
plotly − plotly is an interactive visualization package that enables the
creation of interactive and web-based plots. It offers a wide range of
options for creating interactive charts, maps, and dashboards.
lattice − lattice provides a comprehensive set of functions for creating
conditioned plots, such as trellis plots and multi-panel plots. It is
particularly useful for visualizing multivariate data.
Visualizing Big Data − When working with big data, visualization can be
challenging due to the sheer volume of data. R offers techniques to visualize big
data efficiently, such as sampling techniques, aggregating data, and using
interactive visualizations that can handle large datasets.
Performance Optimization
Real-World Applications
Finance and Banking − Big data analytics in finance and banking can help in
fraud detection, risk modeling, portfolio optimization, and customer
segmentation. R's capabilities in data analysis and modeling make it a valuable
tool in this domain.
Healthcare − In the healthcare industry, big data analytics can contribute to
disease prediction, drug discovery, patient monitoring, and personalized
medicine. R's statistical and machine learning capabilities are well-suited for
analyzing healthcare data.
Marketing and Customer Analytics − R plays a significant role in marketing
and customer analytics by analyzing customer behavior, sentiment analysis,
market segmentation, and campaign optimization. It helps organizations make
data-driven marketing decisions.
Big Data Analytics with Big R refers to using R programming for analyzing
large-scale datasets, leveraging distributed computing frameworks, cloud
environments, and specialized R packages to perform data processing and
analysis. While R is traditionally known for handling small to medium-sized
datasets, tools and extensions like bigmemory, sparklyr, and integration
with Hadoop and Spark enable R to manage and analyze big data effectively.
Why Use R for Big Data Analytics?
1. Statistical Analysis: R provides a rich set of statistical and machine-
learning libraries.
2. Visualization: R offers advanced data visualization packages
like ggplot2 and plotly.
3. Extensibility: It integrates with big data platforms like Hadoop and
Spark.
4. Ease of Use: Its syntax and data manipulation packages
like dplyr simplify working with large datasets.