unit V

Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

UNIT - V

Data Analytics with R Machine Learning: Introduction, Supervised Learning,


Unsupervised Learning, Collaborative Filtering, Social Media Analytics,
Mobile Analytics, Big Data Analytics with BigR.

Introduction
Machine learning is a growing technology which enables computers
to learn automatically from past data.
Machine learning uses various algorithms for building mathematical
models and making predictions using historical data or information.
Currently, it is being used for various tasks such as image recognition,
speech recognition, email filtering, Facebook auto-tagging, recommender
system, and many more.
The machine learning techniques are Supervised, Unsupervised, and
Reinforcement learning. Regression and classification models, clustering
methods, hidden Markov models, and various sequential models.
Machine learning enables a machine to automatically learn from data,
improve performance from experiences, and predict things without being
explicitly programmed.
With the help of sample historical data, which is known as training data,
machine learning algorithms build a mathematical model that helps in
making predictions or decisions without being explicitly programmed.

How does Machine Learning work


A Machine Learning system learns from historical data, builds the
prediction models, and whenever it receives new data, predicts the output for it.
The accuracy of predicted output depends upon the amount of data, as the huge
amount of data helps to build a better model which predicts the output more
accurately.
Suppose we have a complex problem, where we need to perform some
predictions, so instead of writing a code for it, we just need to feed the data to
generic algorithms, and with the help of these algorithms, machine builds the
logic as per the data and predict the output. Machine learning has changed our
way of thinking about the problem. The below block diagram explains the
working of Machine Learning algorithm:

Features of Machine Learning:


 Machine learning uses data to detect various patterns in a given dataset.
 It can learn from past data and improve automatically.
 It is a data-driven technology.
 Machine learning is much similar to data mining as it also deals with the
huge amount of the data.

Need for Machine Learning


The need for machine learning is increasing day by day. The
reason behind the need for machine learning is that it is capable of doing
tasks that are too complex for a person to implement directly.
As a human, we have some limitations as we cannot access the
huge amount of data manually, so for this, we need some computer systems
and here comes the machine learning to make things easy for us.
We can train machine learning algorithms by providing them the huge
amount of data and let them explore the data, construct the models, and
predict the required output automatically. The performance of the machine
learning algorithm depends on the amount of data, and it can be determined
by the cost function.
With the help of machine learning, we can save both time and money.
The importance of machine learning can be easily understood by its use
cases, Currently, machine learning is used in self-driving cars, cyber fraud
detection, face recognition, and friend suggestion by Facebook, etc.
Various top companies such as Netflix and Amazon have build machine
learning models that are using a vast amount of data to analyze the user interest
and recommend product accordingly.
Following are some key points which show the importance of Machine
Learning:
 Rapid increment in the production of data
 Solving complex problems, which are difficult for a human
 Decision making in various sector including finance
 Finding hidden patterns and extracting useful information from data

Machine learning brings computer science and statistics together for creating
predictive models.
Machine learning is a branch in computer science that studies the design of
algorithms that can learn.
Typical machine learning tasks are concept learning, function learning or
“predictive modeling”, clustering and finding predictive patterns. These tasks
are learned through available data that were observed through experiences or
instructions, for example.
Machine learning hopes that including the experience into its tasks will
eventually improve the learning. The ultimate goal is to improve the learning in
such a way that it becomes automatic, so that humans like ourselves don’t need
to interfere any more.

At a broad level, machine learning can be classified into three types:


1. Supervised learning: In supervised learning (SML), the learning
algorithm is presented with labelled example inputs, where the labels
indicate the desired output. SML itself is composed of classification,
where the output is categorical, and regression, where the output is
numerical.
2. Unsupervised learning: In unsupervised learning (UML), no labels are
provided, and the learning algorithm focuses solely on detecting structure
in unlabelled input data. Note that there are also semi-supervised learning
approaches that use labelled data to inform unsupervised learning on the
unlabelled data to identify and annotate new classes in the dataset (also
called novelty detection).
3. Reinforcement learning: Reinforcement learning, the learning algorithm
performs a task using feedback from operating in a real or synthetic
environment.

Supervised learning
Supervised learning, also known as supervised machine learning, is a
subcategory of machine learning and artificial intelligence.
It is defined by its use of labeled data sets to train algorithms that to
classify data or predict outcomes accurately. As input data is fed into the model,
it adjusts its weights until the model has been fitted appropriately, which occurs
as part of the cross validation process.
Supervised learning helps organizations solve for a variety of real-world
problems at scale, such as classifying spam in a separate folder from your
inbox. It can be used to build highly accurate machine learning models.

How supervised learning works


Supervised learning uses a training set to teach models to yield the
desired output. This training dataset includes inputs and correct outputs, which
allow the model to learn over time. The algorithm measures its accuracy
through the loss function, adjusting until the error has been sufficiently
minimized.
Supervised learning can be separated into two types of problems when
data mining — classification and regression:
 Classification uses an algorithm to accurately assign test data into
specific categories. It recognizes specific entities within the dataset and
attempts to draw some conclusions on how those entities should be
labeled or defined. Common classification algorithms are linear
classifiers, support vector machines (SVM), decision trees, k-nearest
neighbor, and random forest.
 Regression is used to understand the relationship between dependent and
independent variables. It is commonly used to make projections, such as
for sales revenue for a given business. Linear regression, logistical
regression, and polynomial regression are popular regression algorithms.

Supervised learning algorithms


Various algorithms and computations techniques are used in supervised
machine learning processes.

• Neural networks:
Primarily leveraged for deep learning algorithms, neural networks
process training data by mimicking the interconnectivity of the human brain
through layers of nodes. Each node is made up of inputs, weights, a bias (or
threshold), and an output. If that output value exceeds a given threshold, it
“fires” or activates the node, passing data to the next layer in the network.
Neural networks learn this mapping function through supervised learning,
adjusting based on the loss function through the process of gradient descent.
When the cost function is at or near zero, we can be confident in the model’s
accuracy to yield the correct answer.

• Naive bayes:
Naive Bayes is classification approach that adopts the principle of class
conditional independence from the Bayes Theorem. This means that the
presence of one feature does not impact the presence of another in the
probability of a given outcome, and each predictor has an equal effect on that
result.
There are three types of Naïve Bayes classifiers: Multinomial Naïve
Bayes, Bernoulli Naïve Bayes, and Gaussian Naïve Bayes.
This technique is primarily used in text classification, spam identification,
and recommendation systems.

• Linear regression:
Linear regression is used to identify the relationship between a dependent
variable and one or more independent variables and is typically leveraged to
make predictions about future outcomes. When there is only one independent
variable and one dependent variable, it is known as simple linear regression. As
the number of independent variables increases, it is referred to as multiple linear
regression. For each type of linear regression, it seeks to plot a line of best fit,
which is calculated through the method of least squares. However, unlike other
regression models, this line is straight when plotted on a graph.
Linear regression is a statistical tool that is mainly used for predicting and
forecasting values based on historical information subject to some important
assumptions:
• There requires a dependent variable and a set of independent variables
• There exist a linear relationship between the dependent and the
independent variables, that is:

Where
 : is the response variable.
 𝑥 : is the predictor variable j where j=1,2,3,………..p.
 𝑒 : is the error term that is normally distributed with mean 0 and constant
Variance
 𝑎𝑗 and 𝑏: are the regression coefficients to be estimated the coefficients
Regression is a technique used to identify the linear relationship between
target variables and explanatory variables. Other terms are also used to describe
the variable. One of these variable is called predictor variable whose value is
gathered through experiments. The other variable is called response variable
whose value is derived from the predictor variable.

• Logistic regression:
While linear regression is leveraged when dependent variables are
continuous, logistic regression is selected when the dependent variable is
categorical, meaning they have binary outputs, such as "true" and "false" or
"yes" and "no." While both regression models seek to understand relationships
between data inputs, logistic regression is mainly used to solve binary
classification problems, such as spam identification.
In statistics, logistic regression is known to be a probabilistic
classification model. Logistic regression is widely used in many disciplines,
including the medical and social science fields. Logistic regression can be either
binomial or multinomial. It is very popular to predict a categorical response.
Binary logistic regression is used in cases where the outcome for a dependent
variable have two possibilities. As for multinomial, the logistic regression is
concerned with possibilities where there are three or more possible types.
Using logistic regression, the input values (x) are combined linearly using
weights or coefficient values to predict an output value (y) based on the log
odds ratio. One major difference between linear regression and logistic
regression is that in linear regression, the output value being modeled is a
numerical value while in logistic, it is a binary value (0 or 1)
The logistic regression equation can be given as follows:

Where
Py is the expected probability for the y(f) subject,
b0 is the bias or intercept term and
b1 is the coefficient for the single input value (xi).
Each column in your input data has an associated b coefficient (a constant
real value) that must be learned from your training data.
It is quite simple to make predictions using logistic regression since there
is a need to plug in numbers into the logistic regression equation to obtain the
output.

• Support vector machines (SVM):


A support vector machine is a popular supervised learning model
developed by Vladimir Vapnik, used for both data classification and regression.
That said, it is typically leveraged for classification problems, constructing a
hyperplane where the distance between two classes of data points is at its
maximum. This hyperplane is known as the decision boundary, separating the
classes of data points (e.g.,oranges vs. apples) on either side of the plane.

• K-nearest neighbor:
K-nearest neighbor, also known as the KNN algorithm, is a non-
parametric algorithm that classifies data points based on their proximity and
association to other available data. This algorithm assumes that similar data
points can be found near each other. As a result, it seeks to calculate the
distance between data points, usually through Euclidean distance, and then it
assigns a category based on the most frequent category or average. Its ease of
use and low calculation time make it a preferred algorithm by data scientists,
but as the test dataset grows, the processing time lengthens, making it less
appealing for classification tasks. KNN is typically used for recommendation
engines and image recognition.

• Random forest:
Random forest is another flexible supervised machine learning algorithm
used for both classification and regression purposes. The "forest" references a
collection of uncorrelated decision trees, which are then merged together to
reduce variance and create more accurate data predictions.

Supervised learning examples


Supervised learning models can be used to build and advance a number of
business applications, including the following:

 Image- and object-recognition: Supervised learning algorithms can be


used to locate, isolate, and categorize objects out of videos or images,
making them useful when applied to various computer vision techniques
and imagery analysis.
 Predictive analytics: A widespread use case for supervised learning
models is in creating predictive analytics systems to provide deep insights
into various business data points. This allows enterprises to anticipate
certain results based on a given output variable, helping business leaders
justify decisions or pivot for the benefit of the organization.
 Customer sentiment analysis: Using supervised machine learning
algorithms, organizations can extract and classify important pieces of
information from large volumes of data — including context, emotion,
and intent — with very little human intervention. This can be incredibly
useful when gaining a better understanding of customer interactions and
can be used to improve brand engagement efforts.
 Spam detection: Spam detection is another example of a supervised
learning model. Using supervised classification algorithms, organizations
can train databases to recognize patterns or anomalies in new data to
organize spam and non-spam-related correspondences effectively.

Challenges of supervised learning


Although supervised learning can offer businesses advantages, such as
deep data insights and improved automation, there are some challenges when
building sustainable supervised learning models.
The following are some of these challenges:
 Supervised learning models can require certain levels of expertise to
structure accurately.
 Training supervised learning models can be very time intensive.
 Datasets can have a higher likelihood of human error, resulting in
algorithms learning incorrectly.
 Unlike unsupervised learning models, supervised learning cannot cluster
or classify data on its own

Unsupervised learning
Unsupervised learning, also known as unsupervised machine learning,
uses machine learning(ML) algorithms to analyze and cluster unlabeled data
sets. These algorithms discover hidden patterns or data groupings without the
need for human intervention. Unsupervised learning's ability to discover
similarities and differences in information make it the ideal solution for
exploratory data analysis, cross-selling strategies, customer segmentation and
image recognition

Common unsupervised learning approaches


Unsupervised learning models are utilized for three main tasks —
clustering, association, and dimensionality reduction.

Clustering
Clustering is a data mining technique which groups unlabeled data based
on their similarities or differences. Clustering algorithms are used to process
raw, unclassified data objects into groups represented by structures or patterns
in the information. Clustering algorithms can be categorized into a few types,
specifically exclusive, overlapping, hierarchical, and probabilistic.
Clustering or cluster analysis is a machine learning technique, which
groups the unlabelled dataset. It can be defined as "A way of grouping the data
points into different clusters, consisting of similar data points. The objects with
the possible similarities remain in a group that has less or no similarities with
another group."
It does it by finding some similar patterns in the unlabelled dataset such
as shape, size, color, behavior, etc., and divides them as per the presence and
absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided
to the algorithm, and it deals with the unlabeled dataset.
After applying this clustering technique, each cluster or group is provided
with a cluster-ID. ML system can use this id to simplify the processing of large
and complex datasets.
The clustering technique is commonly used for statistical data analysis.
Example: Let's understand the clustering technique with the real-world
example of Mall: When we visit any shopping mall, we can observe that the
things with similar usage are grouped together. Such as the t-shirts are grouped
in one section, and trousers are at other sections, similarly, at vegetable sections,
apples, bananas, Mangoes, etc., are grouped in separate sections, so that we can
easily find out the things. The clustering technique also works in the same way.
Other examples of clustering are grouping documents according to the
topic.
The below diagram explains the working of the clustering algorithm. We
can see the different fruits are divided into several groups with similar
properties.
The clustering technique can be widely used in various tasks. Some most
common uses of this technique are:
 Market Segmentation
 Statistical data analysis
 Social network analysis
 Image segmentation
 Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its
recommendation system to provide the recommendations as per the past search
of products. Netflix also uses this technique to recommend the movies and web-
series to its users as per the watch history.

Exclusive and Overlapping Clustering


Exclusive clustering is a form of grouping that stipulates a data point can
exist only in one cluster. This can also be referred to as “hard” clustering. The K
-means clustering algorithm is an example of exclusive clustering.
• K-means clustering is a common example of an exclusive clustering
method where data points are assigned into K groups, where K represents the
number of clusters based on the distance from each group’s centroid. The data
points closest to a given centroid will be clustered under the same category. A
larger K value will be indicative of smaller groupings with more granularity
whereas a smaller K value will have larger groupings and less granularity. K-
means clustering is commonly used in market segmentation, document
clustering, image segmentation, and image compression.
Overlapping clusters differs from exclusive clustering in that it allows data
points to belong to multiple clusters with separate degrees of membership.
“Soft” or fuzzy k-means clustering is an example of overlapping clustering.

Types of Clustering Methods


The clustering methods are broadly divided into Hard clustering
(datapoint belongs to only one group) and Soft Clustering (data points can
belong to another group also). But there are also other various approaches of
Clustering exist. Below are the main clustering methods used in Machine
learning:
1. Partitioning Clustering: Partitioning Clustering It is a type of clustering
that divides the data into non-hierarchical groups. It is also known as the
centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used
to define the number of pre-defined groups. The cluster center is created
in such a way that the distance between the data points of one cluster is
minimum as compared to another cluster centroid.
2. Density-Based Clustering: The density-based clustering method connects
the highly-dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be connected.
This algorithm does it by identifying different clusters in the dataset and
connects the areas of high densities into clusters. The dense areas in data
space are divided from each other by sparser areas. These algorithms can
face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.
3. Distribution Model-Based Clustering: In the distribution model-based
clustering method, the data is divided based on the probability of how a
dataset belongs to a particular distribution. The grouping is done by
assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering
algorithm that uses Gaussian Mixture Models (GMM).
4. Hierarchical Clustering: Hierarchical clustering, also known as
hierarchical cluster analysis (HCA), is an unsupervised clustering
algorithm that can be categorized in two ways: agglomerative or divisive.
 Agglomerative clustering is considered a “bottoms-up approach.” Its data
points are isolated as separate groupings initially, and then they are
merged together iteratively on the basis of similarity until one cluster has
been achieved.
Four different methods are commonly used to measure similarity:
1. Ward’s linkage: This method states that the distance between two clusters
is defined by the increase in the sum of squared after the clusters are
merged.
2. Average linkage: This method is defined by the mean distance between
two points in each cluster.
3. Complete (or maximum) linkage: This method is defined by the
maximum distance between two points in each cluster.
4. Single (or minimum) linkage: This method is defined by the minimum
distance between two points in each cluster.
Euclidean distance is the most common metric used to calculate these
distances; however, other metrics, such as Manhattan distance, Minkowski.

 Divisive clustering can be defined as the opposite of agglomerative


clustering; instead it takes a “top-down” approach. In this case, a single
data cluster is divided based on the differences between data points.
Divisive clustering is not commonly used, but it is still worth noting in
the context of hierarchical clustering. These clustering processes are
usually visualized using a dendrogram, a tree-like diagram that
documents the merging or splitting of data points at each iteration
5. Fuzzy Clustering: Fuzzy clustering is a type of soft method in which a
data object may belong to more than one group or cluster. Each dataset
has a set of membership coefficients, which depend on the degree of
membership to be in a cluster. Fuzzy Cmeans algorithm is the example of
this type of clustering; it is sometimes also known as the Fuzzy k-means
algorithm
6. Probabilistic clustering: A probabilistic model is an unsupervised
technique that helps us solve density estimation or “soft” clustering
problems. In probabilistic clustering, data points are clustered based on
the likelihood that they belong to a particular distribution. The Gaussian
Mixture Model (GMM) is the one of the most commonly used
probabilistic clustering methods.
• Gaussian Mixture Models are classified as mixture models, which
means that they are made up of an unspecified number of probability
distribution functions. GMMs are primarily leveraged to determine which
Gaussian, or normal, probability distribution a given data point belongs to. If
the mean or variance are known, then we can determine which distribution a
given data point belongs to. However, in GMMs, these variables are not known,
so we assume that a latent, or hidden, variable exists to cluster data points
appropriately. While it is not required to use the Expectation-Maximization
(EM)algorithm, it is a commonly used to estimate the assignment probabilities
for a given data point to a particular data cluster.

Clustering Algorithms
The Clustering algorithms can be divided based on their models that are
explained above. There are different types of clustering algorithms published,
but only a few are commonly used. The clustering algorithm is based on the
kind of data that we are using. Such as, some algorithms need to guess the
number of clusters in the given dataset, whereas some are required to find the
minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are
widely used in machine learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular
clustering algorithms. It classifies the dataset by dividing the samples into
different clusters of equal variances. The number of clusters must be
specified in this algorithm. It is fast with fewer computations required,
with the linear complexity of O(n).
K-Means algorithm is an unsupervised machine learning algorithm which
aims at clustering data together, that is, finding clusters in data based on
similarity in the descriptions of the data and their relationships. Each
cluster is associated with a center point known as a centroid. Based on the
center, the length of space of each cluster with respect to the center is
calculated and the clusters are formed by assigning points to the closest
centroid. Various algorithms, such as Euclidean distance, Euclidean
squared distance and the Manhattan or City distance, are used to
determine which observation is appended to which centroid. The number
of clusters is represented by the variable K

2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas


in the smooth density of data points. It is an example of a centroid-based
model, that works on updating the candidates for centroid to be the center
of the points within a given region.

3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of


Applications with Noise. It is an example of a density-based model
similar to the mean-shift, but with some remarkable advantages. In this
algorithm, the areas of high density are separated by the areas of low
density. Because of this, the clusters can be found in any arbitrary shape.

4. Expectation-Maximization Clustering using GMM: This algorithm can be


used as an alternative for the k-means algorithm or for those cases where
K-means can be failed. In GMM, it is assumed that the data points are
Gaussian distributed.

5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical


algorithm performs the bottomup hierarchical clustering. In this, each
data point is treated as a single cluster at the outset and then successively
merged. The cluster hierarchy can be represented as a tree-structure.

6. Affinity Propagation: It is different from other clustering algorithms as it


does not require to specify the number of clusters. In this, each data point
sends a message between the pair of data points until convergence. It has
O(N2T) time complexity, which is the main drawback of this algorithm.

Applications of Clustering
Below are some commonly known applications of clustering technique in
Machine Learning:
 In Identification of Cancer Cells: The clustering algorithms are widely
used for the identification of cancerous cells. It divides the cancerous and
non-cancerous data sets into different groups.
 In Search Engines: Search engines also work on the clustering technique.
The search result appears based on the closest object to the search query.
It does it by grouping similar data objects in one group that is far from the
other dissimilar objects. The accurate result of a query depends on the
quality of the clustering algorithm used.
 Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
 In Biology: It is used in the biology stream to classify different species of
plants and animals using the image recognition technique.
 In Land Use: The clustering technique is used in identifying the area of
similar lands use in the GIS database. This can be very useful to find that
for what purpose the particular land should be used, that means for which
purpose it is more suitable.
Association Rules
An association rule is a rule-based method for finding relationships
between variables in a given dataset. These methods are frequently used for
market basket analysis, allowing companies to better understand relationships
between different products. Understanding consumption habits of customers
enables businesses to develop better cross-selling strategies and
recommendation engines. Examples of this can be seen in Amazon’s
“Customers Who Bought This Item Also Bought” or Spotify’s "Discover
Weekly" playlist. While there are a few different algorithms used to generate
association rules, such as Apriori, Eclat, and FP-Growth, the Apriori algorithm
is most widely used.
Association Rule Mining is used when you want to find an association
between different objects in a set, find frequent patterns in a transaction
database, relational databases or any other information repository. The
applications of Association Rule Mining are found in Marketing, Basket Data
Analysis (or Market Basket Analysis) in retailing, clustering and classification.
It can tell you what items do customers frequently buy together by generating a
set of rules called Association Rules. In simple words, it gives you output as
rules in form if this then that.
Clients can use those rules for numerous marketing strategies:
 Changing the store layout according to trends
 Customer behavior analysis
 Catalogue design
 Cross marketing on online stores
 What are the trending items customers buy
 Customized emails with add-on sales
Consider the following example:

Given is a set of transaction data. You can see transactions numbered 1 to


5. Each transaction shows items bought in that transaction. You can see that
Diaper is bought with Beer in three transactions. Similarly, Bread is bought with
milk in three transactions making them both frequent item sets. Association
rules are given in the form as below:
A=>B[Support,Confidence]
The part before => is referred to as if (Antecedent) and the part after => is
referred to as then(Consequent).
Where A and B are sets of items in the transaction data. A and B are
disjoint sets.
Computer=>Anti−virusSoftware[Support=20%,confidence=60%]Computer=>A
nti−virusSoftware[Support=20%,confidence=60%]
Above rule says:
1. 20% transaction show Anti-virus software is bought with purchase of a
Computer
2. 60% of customers who purchase Anti-virus software is bought with
purchase of a Computer
Basic Concepts of Association Rule Mining

1. Itemset: Collection of one or more items. K-item-set means a set of k


items.
2. Support Count: Frequency of occurrence of an item-set
3. Support (s): Fraction of transactions that contain the item-set 'X'

Market Basket Analysis using R


Learn about Market Basket Analysis & the APRIORI Algorithm that
works behind it. You'll see how it is helping retailers boost business by
predicting what items customers buy together. You are a data scientist (or
becoming one!), and you get a client who runs a retail store. Your client gives
you data for all transactions that consists of items bought in the store by several
customers over a period of time and asks you to use that data to help boost their
business. Your client will use your findings to not only change/update/add items
in inventory but also use them to change the layout of the physical store or
rather an online store. To find results that will help your client, you will use
Market Basket Analysis (MBA) which uses Association Rule Mining on the
given transaction data.

1. Apriori algorithms
Apriori algorithms have been popularized through market basket
analyses, leading to different recommendation engines for music platforms and
online retailers. They are used within transactional datasets to identify frequent
item sets, or collections of items, to identify the likelihood of consuming a
product given the consumption of another product. For example, if I play Black
Sabbath’s radio on Spotify, starting with their song “Orchid”, one of the other
songs on this channel will likely be a Led Zeppelin song, such as “Over the
Hills and Far Away.” This is based on my prior listening habits as well as the
ones of others. Apriori algorithms use a hash tree to count itemsets, navigating
through the dataset in a breadth-first manner.
Association Rule Mining is viewed as a two-step approach:
1. Frequent Itemset Generation: Find all frequent item-sets with support >=
pre-determined min_support count
2. Rule Generation: List all Association Rules from frequent item-sets.
Calculate Support and Confidence for all rules. Prune rules that fail
min_support and min_confidence thresholds.
Frequent Itemset Generation is the most computationally expensive step
because it requires a full database scan.
Among the above steps, Frequent Item-set generation is the most costly in terms
of computation.
Above you have seen the example of only 5 transactions, but in real-world
transaction data for retail can exceed up to GB s and TBs of data for which an
optimized algorithm is needed to prune out Item-sets that will not help in later
steps. For this APRIORI Algorithm is used

Dimensionality reduction
While more data generally yields more accurate results, it can also impact
the performance of machine learning algorithms (e.g. overfitting) and it can also
make it difficult to visualize datasets. Dimensionality reduction is a technique
used when the number of features, or dimensions, in a given dataset is too high.
It reduces the number of data inputs to a manageable size while also preserving
the integrity of the dataset as much as possible. It is commonly used in the pre
processing data stage, and there are a few different dimensionality reduction
methods that can be used, such as:

1. Principal component analysis


Principal component analysis (PCA) is a type of dimensionality reduction
algorithm which is used to reduce redundancies and to compress datasets
through feature extraction. This method uses a linear transformation to create a
new data representation, yielding a set of "principal components." The first
principal component is the direction which maximizes the variance of the
dataset. While the second principal component also finds the maximum
variance in the data, it is completely uncorrelated to the first principal
component, yielding a direction that is perpendicular, or orthogonal, to the first
component. This process repeats based on the number of dimensions, where a
next principal component is the direction orthogonal to the prior components
with the most variance.

2. Singular value decomposition


Singular value decomposition (SVD) is another dimensionality reduction
approach which factorizes a matrix, A, into three, low-rank matrices. SVD is
denoted by the formula, A =USVT, where U and V are orthogonal matrices. S is
a diagonal matrix, and S values are considered singular values of matrix A.
Similar to PCA, it is commonly used to reduce noise and compress data, such as
image files.

3. Autoencoders
Autoencoders leverage neural networks to compress data and then
recreate a new representation of the original data’s input. Looking at the image
below, you can see that the hidden layer specifically acts as a bottleneck to
compress the input layer prior to reconstructing within the output layer. The
stage from the input layer to the hidden layer is referred to as “encoding” while
the stage from the hidden layer to the output layer is known as “decoding

Applications of unsupervised learning


Machine learning techniques have become a common method to improve
a product user experience and to test systems for quality assurance.
Unsupervised learning provides an exploratory path to view data, allowing
businesses to identify patterns in large volumes of data more quickly when
compared to manual observation. Some of the most common real-world
applications of unsupervised learning are:
 News Sections: Google News uses unsupervised learning to categorize
articles on the same story from various online news outlets. For example,
the results of a presidential election could be categorized under their label
for “US” news.
 Computer vision: Unsupervised learning algorithms are used for visual
perception tasks, such as object recognition.
 Medical imaging: Unsupervised machine learning provides essential
features to medical imaging devices, such as image detection,
classification and segmentation, used in radiology and pathology to
diagnose patients quickly and accurately.
 Anomaly detection: Unsupervised learning models can comb through
large amounts of data and discover atypical data points within a dataset.
These anomalies can raise awareness around faulty equipment, human
error, or breaches in security.
 Customer personas: Defining customer personas makes it easier to
understand common traits and business clients' purchasing habits.
Unsupervised learning allows businesses to build better buyer persona
profiles, enabling organizations to align their product messaging more
appropriately.
 Recommendation Engines: Using past purchase behavior data,
unsupervised learning can help to discover data trends that can be used to
develop more effective cross- selling strategies. This is used to make
relevant add-on recommendations to customers during the checkout
process for online retailers strategies. This is used to make relevant add-
on recommendations to customers during the checkout process for online
retailers.

Challenges of unsupervised learning


While unsupervised learning has many benefits, some challenges can occur
when it allows machine learning models to execute without any human
intervention.
Some of these challenges can include:
• Computational complexity due to a high volume of training data
• Longer training times
• Higher risk of inaccurate results
• Human intervention to validate output variables
• Lack of transparency into the basis on which data was clustered

Collaborative Filtering
Collaborative filtering is a technique that can filter out items that a
user might like on the basis of reactions by similar users. It works by
searching a large group of people and finding a smaller set of users with
tastes similar to a particular user.
Collaborative filtering is a method used by recommender systems to
make automatic predictions about a user’s interests by collecting preferences
from many users (collaborating).
The underlying assumption is that if person A has a similar opinion as
person B on one issue, A is more likely to have B’s opinion on a different issue
than that of a randomly chosen person.

What is a Recommendation system?


There are a lot of applications where websites collect data from their
users and use that data to predict the likes and dislikes of their users. This
allows them to recommend the content that they like.
Recommender systems are a way of suggesting similar items and ideas
to a user’s specific way of thinking.
There are basically two types of recommender Systems:
 Collaborative Filtering: Collaborative Filtering recommends items
based on similarity measures between users and/or items. The basic
assumption behind the algorithm is that users with similar interests have
common preferences.
 Content-Based Recommendation: It is supervised machine learning
used to induce a classifier to discriminate between interesting and
uninteresting items for the user.

Overview of Collaborative Filtering


Collaborative filtering is integral to the recommendation engines of many
online services, including e-commerce websites, streaming services, and social
media platforms.
It leverages the power of user data to provide personalized
recommendations, enhancing user experience and engagement.

Types of Collaborative Filtering


Collaborative filtering can be broadly categorized into two types:
user-based and item-based filtering.
 User-based Collaborative Filtering: This method finds similarities
between users.
For example, if user A and user B both rate several items similarly, they
are considered similar. Future recommendations for user A will include
items that user B liked, but user A has not yet rated or seen.
 Item-based Collaborative Filtering: This approach finds similarities
between items. If item X and item Y receive similar ratings from users,
they are considered similar. If a user likes item X, the system will
recommend item Y.
Benefits of Collaborative Filtering
 Personalization: Provides highly personalized recommendations based
on user behavior.
 Scalability: Can handle large datasets effectively, making it suitable for
large-scale applications.
 Implicit Feedback: Can work with implicit feedback, such as clicks or
view times, not just explicit ratings.

Challenges and Limitations


 Cold Start Problem: New users or items with no interactions pose a
challenge as the system lacks data to make accurate predictions.
 Sparsity: In large datasets, users interact with only a small fraction of
items, leading to sparse matrices that can hinder the effectiveness of the
algorithm.
 Collaborative recommender systems face two major challenges:
scalability and ensuring quality recommendations to the consumer.
 Scalability is important, because e-commerce systems must be able to
search through millions of potential neighbours in real time. If the site is
using browsing patterns as indications of product preference, it may have
thousands of data points for some of its customers.
 Ensuring quality recommendations is essential in order to gain
consumers’ trust. If consumers follow a system recommendation but then
do not end up liking the product, they are less likely to use the
recommender system again.
 As with classification systems, recommender systems can make two types
of errors: false negatives and false positives.
 Here, false negatives are products that the system fails to recommend,
although the consumer would like them.
 False positives are products that are recommended, but which the
consumer does not like. False positives are less desirable because they
can annoy or anger consumers.

Applications of Collaborative Filtering


Collaborative filtering is widely used across various industries to enhance
user experience and increase engagement.
Some common applications include:
 E-commerce: Amazon uses collaborative filtering to recommend
products based on users’ purchase history and ratings.
 Streaming Services: Netflix and Spotify utilize collaborative filtering to
suggest movies, TV shows, or music tracks that align with users’ tastes.
 Social Media: Platforms like Facebook and Twitter use collaborative
filtering to suggest friends or content that users might find interesting.
 Dimension reduction, association mining, clustering, and Bayesian
learning are some of the techniques that have been adapted for
collaborative recommender systems. While collaborative filtering
explores the ratings of items provided by similar users, some
recommender systems explore a content-based method that provides
recommendations based on the similarity of the contents contained in an
item. Moreover, some systems integrate both content-based and user-
based methods to achieve further improved recommendations.

Implementing Collaborative Filtering


Implementing collaborative filtering involves several steps, from data
collection to recommendation generation. Here is a step-by-step guide:
Step 1: Data Collection
Gather user-item interaction data. This can be explicit, like ratings and
reviews, or implicit, like clicks and view times.
Step 2: Data Preprocessing
Clean and preprocess the data. This may involve normalizing ratings,
handling missing values, and converting the data into a suitable format for
analysis.
Step 3: Similarity Calculation
Choose a similarity metric to compute the similarities between users or
items. Common metrics include:
 Cosine Similarity: Measures the cosine of the angle between two
vectors.
 Pearson Correlation: Measures the linear correlation between two sets
of data.
 Euclidean Distance: Measures the straight-line distance between two
points in Euclidean space.
Step 4: Prediction
Use the similarity scores to predict the ratings or preferences for a user-
item pair. This can be done using techniques like weighted average or k-nearest
neighbors.
Step 5: Recommendation Generation
Generate a list of recommendations for each user based on the predicted
ratings. This can be done by selecting the top-N items with the highest predicted
ratings.
Step 6: Evaluation
Evaluate the performance of the collaborative filtering algorithm using
metrics like precision, recall, and F1-score. This helps in fine-tuning the model
and improving its accuracy.

Advanced Techniques in Collaborative Filtering


To overcome some of the challenges and limitations, advanced techniques
have been developed:
 Matrix Factorization: Techniques like Singular Value Decomposition
(SVD) decompose the user-item interaction matrix into lower-
dimensional matrices, capturing latent factors that influence user
preferences.
 Hybrid Approaches: Combining collaborative filtering with content-
based filtering or other techniques to improve recommendation
accuracy and address the cold start problem.
 Deep Learning: Utilizing neural networks to model complex interactions
between users and items, enhancing the quality of recommendations.

Collaborative filtering — comprehensive understanding


Again, collaborative filtering is the generic term in which the algorithms
use explicit and implicit ratings and compute similarities of ratings.

What is explicit and implicit ratings?


 Explicit: users specify(score) how much they liked a product, like
Amazon’s product rating or Netflix’s movie rating.
 Implicit: based on the user’s behavior. For example, if a user buys
something or watches a particular movie, we think the user is interested.
Explicit rating image from the author

Implicit rating image from the author


In a practical story, collaborative filtering uses the user-item rating matrix
that we can get from the above data. The figures below are examples of user-
item rating matrices corresponding with the images above.

Explicit rating matrix image from the author

Implicit rating matrix image from the author


As you can see, the user-item matrix based on explicit rating has
numerical values. On the other hand, the user-item matrix based on implicit
rating has binary values instead.
After we understand the details of the user-item rating matrix, or the input
data for collaborative filtering, we must comprehend how we use it to compute
similarities.
There are two paths to calculate similarities.
When we focus on the users (rows in the user rating matrix), we compare
the rating vector between a user and a user, and it is called user-user
collaborative filtering.
When we focus on the items(columns in the user rating matrix), we
compare the rating vector between an item and an item, and it is called item-
item collaborative filtering.
Intuitively, if the rating vector of each user is similar, it means that users’
preferences are similar.
Also, if the rating vector of each item is similar, it tells us the item is liked
by similar users.

User-User similarities image from the author


Item-Item similarities image from the author

Although you can use both types of data corresponding with the data you
have, you should consider the computation amount in the practical setting. You
should utilize the item-item collaborative filtering if you have more users than
the number of items. In comparison, you should use the user-user collaborative
filtering if you have more items than the users.
Social Media Analytics
In this era of social media and networking, Social Media analytics is a
process for the extraction of unseen and unknown insights from the abundant
data available worldwide.
Social media analytics is the ability to gather and find meaning in
data gathered from social channels to support business decisions and measure
the performance of actions based on those decisions through social media.
It is considered a science as it methodically involves the identification,
extraction, and evaluation of social media data using various tools and methods.
It is also an art for interpreting insights obtained with business goals and
objectives. It focuses on seven layers of social media: text, networks, actions,
hyperlinks, mobile, location, and search engines. Various tools for social media
analytics include Discovertext, Lexalytics, Netlytic, Google Analytics, Network
NodeXL, Netminer, and many more
Social media analytics is broader than metrics such as likes,
follows, retweets, previews, clicks, and impressions gathered from
individual channels. It also differs from reporting offered by services that
support marketing campaigns such as LinkedIn or Google Analytics.
Social media analytics uses specifically designed software platforms
that work similarly to web search tools.
Data about keywords or topics is retrieved through search queries or
web ‘crawlers’ that span channels. Fragments of text are returned, loaded
into a database, categorized and analyzed to derive meaningful insights.

What is the Importance of Social Media Analytics?


Improve productivity of your organization:
By using various tools to analyze social media, companies can summarize
customer reviews and formulate strategies to improve the quality of products,
thereby increasing the productivity of the organization. The profitability of the
organization can be enhanced by identifying loopholes and making
improvements in the weaker sectors.

To analyze potential competition:


Utilizing various analytics tools also helps in identifying competitors in
the market. It assists in focusing on methodologies to achieve better results.
Comparison charts provide insights about the organization's brand and its
standing relative to competitors in the market.

To enhance customer reach:


Managing the customer journey through social analytics is crucial for
retaining them. Constantly engaging with your consumers enhances social
presence and understanding, leading to further improvements for your business.
It considers semi-structured and unstructured data, summarizing it to identify
meaningful insights. Engagement rate tracks how people are involved with your
content and campaigns.

Improve product quality:


Customers often provide product reviews on social media platforms.
Companies analyze these reviews and feedback to enhance product quality.
Non-positive comments can be used by organizations to improve negative
aspects, thus enhancing the overall customer experience. Customer feedback
and complaints provide opportunities for improvements.

Strategic decision-making:
Social Media Analysis also aids in trend analysis and the identification of
high-value features for a brand. It gauges responses to social media and other
communications, facilitating meaningful decision-making for organizations to
improve productivity.

Sentiment analysis:
Comments and reviews about products and services are collected,
extracted, cleaned, and analyzed using various tools. Categorization of these
comments reveals the intention about the brand. Natural language processing
methodologies are employed to understand the intensity and group comments
into positive, negative, or neutral categories regarding a product or service.
Summarization charts about customer sentiment reveal future prospects for
product usage and guide corrective actions accordingly.

Types of social media analytics


There are several different types of social media analytics you should
monitor in your social media dashboard that will guide your strategy and
discover valuable insights. We'll walk you through the six main types of
analytics below.

Performance analysis/metrics
Measuring the performance of social media marketing efforts is critical to
understanding where strategic efforts are working and where improvement is
needed.
Key performance metrics to track include the following:
 interactions across platforms and over time to determine if the
posted content is properly engaging the audience;
 whether the number of followers is increasing over time to verify
consistent progress across platforms; and
 click-through rate for link clicks on posts to see if they're properly
driving traffic from social media channels.
First and foremost, you need to measure the overall performance of your
social media efforts. This includes social media metrics including:
 Impressions
 Reach
 Likes
 Comments
 Shares
 Views
 ClicksSales

Audience analytics
It's important to clearly understand and define the target audience, as it is
the most important element of a social media strategy. Understanding the
audience will help create a favorable customer experience with content targeted
at what customers want and what they're looking for.
In the past, audience data was difficult to measure as it was scattered
across multiple social media platforms. But with analytics tools, marketers can
analyze data across platforms to better understand audience demographics,
interests and behaviors. AI-enabled tools can even help predict customer
behavior. They can also study how an audience changes over time.
The better targeted the content is, the less advertising will cost and the
cost-per-click of ads can be optimized.
Audience analytics will include data like:
 Age
 Gender
 Location
 Device
Competitor analysis
To obtain a full understanding of performance metrics, it's necessary to
look at the metrics through a competitive lens. In other words, how do they
stack up to competitors' performance?
With social media analytics tools, social media performance can be
compared to competitors' performance with a head-to-head analysis to gauge
relative effectiveness and to determine what can be improved.
Most modern tools that include AI capabilities can benchmark competitor
performance by industry to determine a good starting point for social media
efforts.
Another key area to look into is how your competitors perform on social
media. How many followers do they have? What is their engagement rate?
How many people seem to engage with each of their posts?
You can then compare this data to your own to see how you stack up—as
well as set more realistic growth goals.

Paid social analytics


Ad spending is serious business. If targeting and content isn't right, it can
end up an expensive proposition for unsuccessful content. More advanced
analytics tools can often predict which content is most likely to perform well
and be a less risky investment for a marketing budget.
For best results, an all-in-one platform is the preferred choice to track
performance across all social media accounts such as Twitter analytics, paid
Facebook posts or LinkedIn ads.
When you're putting money behind specific social media posts, you want
to make sure they're driving results. This is why you absolutely need to pay
close attention to your paid social analytics.
Some of the most important ad analytics to measure include:
 Total number of active ads
 Clicks
 Click-through rate
 Cost-per-click
 Cost-per-engagement
 Cost-per-action
 Conversion rate
 Total ad spend
These metrics will indicate exactly where each dollar spent is going and how
much return is being generated for social media efforts. They can also be
compared against competitor spending to ensure that spending is at an
appropriate level and to reveal strategic opportunities where an increased share
of voice may be attainable.

Influencer analysis
To gain a leg up on competition in a competitive space, many social
media marketers will collaborate with social influencers as part of their
marketing campaigns. To make the most of partnerships, it's necessary to
measure key metrics to ensure that the Influencer marketing is achieving desired
goals.
Social media analytics can provide insights into the right metrics to
ensure that influencer campaigns are successful.
Some influencer metrics that should be tracked include the following:
 total interactions per 1,000 followers to understand if they're properly
generating engagement;
 audience size and most frequently used hashtags, to help determine the
maximum reach of your campaign;
 the number of posts influencers create on a regular basis, to help
determine how active they are and how powerful engagement can be; and
 past collaborations, which can be a great indicator of the potential for
success with an influencer.
If you're running influencer marketing campaigns, tracking the success of
these partnerships is essential to proving ROI. We recommend using the five
W’s + H of influencer marketing to inform your strategy and measure ROI at
each stage of the buyer journey.
Some of the data you'll want to keep track of includes:
 Number of posts created per influencer
 Total number of interactions per post
 Audience size of each influencer
 Hashtag usage and engagement
This can help you gauge overall engagement from your influencer
campaigns. If you have an affiliate marketing program, you can designate
promo codes for each individual influencer to use so your team can track how
many sales each partner drives as well.

Sentiment analysis
Sentiment analysis is an important metric to measure as it can indicate
whether a campaign is gaining favorability with an audience or losing it. And
for customer service oriented businesses, sentiment analysis can reveal potential
customer care issues.
To ensure that a campaign is in sync with the target audience and
maintains a strong rate of growth, interactions and engagement rate should be
tracked over time. A decline could indicate that a change of course is needed.
Gathering and analyzing customer sentiment can help avoid guesswork in
developing a marketing strategy and deciding which content will resonate best
with the audience. This type of analysis can also indicate the type of content
that's likely to have a positive impact on customer sentiment If your social
media analytics tool detects a spike in negative sentiment, action should be
taken immediately to address and correct it before it becomes a PR nightmare.

Key capabilities of effective social media analytics


From there, topics or keywords can be selected and parameters such as
date range can be set. Sources also need to be specified — responses to
YouTube videos, Facebook conversations, Twitter arguments, Amazon
product reviews, comments from news sites.
 Natural language processing and machine learning technologies
identify entities and relationships in unstructured data — information
not pre-formatted to work with data analytics. Virtually all social media
content is unstructured. These technologies are critical to deriving
meaningful insights.
 Segmentation is a fundamental need in social media analytics. It
categorizes social media participants by geography, age, gender, marital
status, parental status and other demographics. It can help identify
influencers in those categories. Messages, initiatives and responses can
be better tuned and targeted by understanding who is interacting on key
topics.
 Behavior analysis is used to understand the concerns of social media
participants by assigning behavioral types such as user, recommender,
prospective user and detractor. Understanding these roles helps develop
targeted messages and responses to meet, change or deflect their
perceptions.
 Sentiment analysis measures the tone and intent of social media
comments. It typically involves natural language processing
technologies to help understand entities and relationships to reveal
positive, negative, neutral or ambivalent attributes.
 Share of voice analyzes prevalence and intensity in conversations
regarding brand, products, services, reputation and more. It helps
determine key issues and important topics. It also helps classify
discussions as positive, negative, neutral or ambivalent.
 Clustering analysis can uncover hidden conversations and unexpected
insights. It makes associations between keywords or phrases that
appear together frequently and derives new topics, issues and
opportunities. The people that make baking soda, for example,
discovered new uses and opportunities using clustering analysis.
 Dashboards and visualization charts, graphs, tables and other
presentation tools summarize and share social media analytics
findings — a critical capability for communicating and acting on
what has been learned. They also enable users to grasp meaning and
insights more quickly and look deeper into specific findings without
advanced technical skills.

Social media analytics tools


 While many businesses use some sort of social media management tool,
most of these baseline scheduling tools don't go far enough to provide the
in-depth metrics and data points that social media analytics tools can
deliver.
 Not only can this deeper level of insight go a long way to inform a
successful campaign, it can also be shared with stakeholders to show
high-level ROI across disparate social media channels.
 An effective analytics tool will have an intuitive, easy-to-use interface
that enables transparency in a campaign; it should also streamline the
social media marketing processes and workflows.
 Examples of social media analytics tools include Sprout Social, Google
Analytics, Hootsuite and Buffer Analyze.
Mobile Analytics
Mobile analytics involves measuring and analysing data generated by
mobile platforms and properties, such as mobile sites and mobile
applications. AT Internet's analytics solution lets you track, measure and
understand how your mobile users are interacting with your mobile sites and
mobile apps.

Why do companies use mobile analytics?


Mobile analytics gives companies unparalleled insights into the
otherwise hidden lives of app users. Analytics usually comes in the form of
software that integrates into companies’ existing websites and apps to
capture, store, and analyze the data. This data is vitally important to
marketing, sales, and product management teams who use it to make more
informed decisions.
Without a mobile analytics solution, companies are left flying blind.
They’re unable to tell what users engage with, who those users are, what brings
them to the site or app, and why they leave.

Why are mobile analytics important?


Mobile usage surpassed that of desktop in 2015 and smartphones are
fast becoming consumers’ preferred portal to the internet. Consumers spend 70
percent of their media consumption and screen time on mobile devices,
and most of that time in mobile apps.
This is a tremendous opportunity for companies to reach their
consumers, but it’s also a highly saturated market. There are more than
6.5 million apps in the major mobile app stores, millions of web apps, and more
than a billion websites in existence.
Companies use mobile analytics platforms to gain a competitive edge in
building mobile experiences that stand out. Mobile analytics tools also give
teams a much-needed edge in advertising.
As more businesses compete for customers on mobile, teams need
to understand how their ads perform in detail, and whether app users who
interact with ads end up purchasing.

How do mobile analytics work?


Mobile analytics typically track:
 Page views
 Visits
 Visitors
 Source data
 Strings of actions
 Location
 Device information
 Login / logout
 Custom event data
Companies use this data to figure out what users want in order to deliver
a more satisfying user experience.
For example, they’re able to see:
 What draws visitors to the mobile site or app
 How long visitors typically stay
 What features visitors interact with
 Where visitors encounter problems
 What factors are correlated with outcomes like
How different teams use mobile analytics:
 Marketing: Tracks campaign ROI, segments users, automates marketing
 UX/UI: Tracks behaviors, tests features, measures user experience
 Product: Tracks usage, A/B test features, debugs, sets alerts
 Technical teams: Track performance metrics such as app crashes

How to implement mobile analytics


Mobile analytics platforms vary widely in features and functionality.
Some free applications have technical limitations and struggle with tracking
users as they move between mobile websites and apps. A top tier mobile
analytics platform should be able to:
 Integrate easily: With a codeless mobile feature, for instance Offer a
unified view of the customer: Track data across operating systems,
devices, and platforms
 Measure user engagement: For both standard and custom-defined events
 Segment users: Create cohorts based on location, device, demographics,
behaviors, and more Offer dashboards: View data and surface insights
with customizable reporting
 A/B test: Test features and messaging for performance
 Send notifications: Alert administrators and engage users with behaviour-
based
 Send notifications: Alert administrators and engage users with behavior-
based messaging such as push notifications and in-app messages
 Out-of-the-box metrics: Insights with minimal client-side coding
 Real-time analytics: Proactively identify user issues
 Reliable infrastructure: Guaranteed uptime for consistent access to the
platform
The actual installation of mobile analytics involves adding tracking code to
the sites and SDKs to the mobile applications teams want to track. Most mobile
analytics platforms will be set up to automatically track website visits.
Platforms with codeless mobile features will be able to automatically track
certain basic features of apps such as crashes, errors, and clicks, but you’ll
want to expand that by manually tagging additional actions for tracking. With
mobile analytics in place, you’ll have deeper insights into your mobile web
and app users which you can use to create competitive, world-class products
and experiences.

The Challenges of Mobile Analytics


 Because mobile analytics is a fairly new field of analytics and continues
to change with rapidly changing consumer expectations, there are
many challenges to be faced in implementing it.
 Collecting the data necessary for successful mobile analytics is often
the greatest challenge organizations face when attempting to
understand consumer behavior on mobile devices. Many devices do
not allow for cookies to track actions or do not use Javascript which can
also help with website data tracking.

The Benefits of Mobile Analytics


Despite the challenges of mobile analytics, it’s essential for modern
businesses to invest in and can lead to many opportunities for the business.
 Measure user engagement: For both standard and custom-defined events
 Segment users: Create cohorts based on location, device, demographics,
behaviors
 Out-of-the-box metrics: Insights with minimal client-side coding
 Real-time analytics: Proactively identify user issues
 Reliable infrastructure: Guaranteed uptime for consistent access to the
platform
Introduction to Big Data Analytics
Analyzing Big Data follows a different path from traditional systems. Big
Data analytics refers to “a set of procedures and statistical models to extract the
information from a large variety of data sets” . Big data analytics can provide
valuable insights that may provide substantial advantages.
This section highlights the main Big Data techniques:
 Text Analytics: A vast proportion of unstructured data comes from social
media, email and newspapers and is therefore in textual format. Text
analytics derives information from these textual sources. Modern text
analytics make use of statistical models and text mining to extract
valuable information from vast amount of data.
 In Memory Analytics: In-memory analytics is an approach used “for
querying data when it resides in a computer’s random access memory
(RAM), as opposed of querying data stored on physical disks”. The
adoption of in-memory analytics has led to a paradigm shift where the
technique has resulted in faster query and calculation and improved
performance. This has resulted in quicker decision making for businesses
• Graph Analytics: Graph analytics is another technique that is widely
adopted to analyse large volume of data. It studies the behavior of
connected components such as social networks. Additionally, the
technique extracts intelligence between data sets by inferring paths
through complex relationships. A number of graph analytics frameworks
such as GraphLab, CombBLAS, Giraph, SociaLite and Galois exist.
 Statistical methods: Statistical methods are used “to exploit relationships
and causal relationships between different objectives”. However,
traditional statistical methods are not appropriate to manage Big Data.
 Data Mining: “Data mining has form a branch of applied artificial
intelligence” and its main purpose is to retrieve required data from large
amount of data. There are various techniques such as classification,
clustering, pattern matching that are used in data mining. Different
algorithms such as k-means, clustering, decision forest algorithms and
regression trees are available for processing of data, calculation and
reasoning. The algorithms for data mining comprise of three parts namely
the model, preference criterion and the searching algorithm. The model
can be either classification or clustering and the search algorithm is used
to find particular model or attributes. Data mining also involves dynamic
prediction which is suitable for the use of healthcare purposes such as
diagnosis.
 Machine Leaning: Machine learning (ML) is defined as a “field of study
that gives computers the ability to learn without being explicitly
programmed”. It is a branch of artificial intelligence (AI) that uses
various statistical, probabilistic and optimization techniques to allow
computers to “learn” from previous examples. It is used to detect
complex pattern from huge and complex data sets. ML concepts are used
to enable applications to take a decision from the available datasets. ML
is a field that has covered nearly every scientific domain, which has
eventually had a great impact on the science and society.
 Social Media Analytics: Social Media analytics refer to “the analysis of
structured and unstructured data from social media channels”. Social
media analytics are classified into content-based analytics and structure
based analytics. Content-based analytics analyse data posted by users
whereas structure based analytics analyse data with respect to structural
attributes of a social network and determine links and intelligence
between relationships among the participating entities. A number of
techniques have emerged to extract data from the structure of a social
network. Some of these include community detection, social influence
analysis and link prediction.
 Predictive Analytics: Predictive analytics aim to uncover patterns and
relationships in data by using above discussed techniques such as
optimization methods, statistical methods, data mining, machine learning,
visualization approaches and social media analytics. Predictive analytics
seek to predict the future by analyzing current and historical data.

Big Data Analytics Lifecycle


Data science projects are different from most traditional Business
Intelligence. They are more exploratory in nature. Thus, it is of utmost
importance to consider a process that can govern the development that allows
exploration. It is useful to consider a framework that would help in the
organization of the work and to obtain a clear insight of the big data. Thus, a
framework consisting of some stages have been identified so that the expected
output can be obtained. The framework can be termed as the big data analytics
lifecycle. The stages of this lifecycle are not linear, that is, they are related to
each other. The data analytics processes that are defined in the lifecycle should
be followed sequentially so that proper mining and analytics are achieved. The
main processes of the data analytics lifecycle are shown in Figure below.
Identification of the Problem
This phase is also termed as discovery. In this process, the main focus is
understanding the requirements and objectives of the project from a business
perspective. It is essential for the team to understand the business domain and to
convert this knowledge into a data mining problem definition. An initial plan is
devised and a decision model is often used. In this part, the team also assesses
the resources available to support the project in terms of people, technology,
time, and data.

Designing the data requirement


In this phase, the datasets to be used for the data analytics are identified.
Data to be used are collected and the attributes of the datasets are being defined
based on the domain and the problem specification. This phase is important
since it allows the user to discover first insights into the data and to determine
relevant and interesting subsets to form hypotheses for hidden information.

Data Preparation (Preprocessing of the data)


The final outcome of the data preparation phase is to obtain the final
dataset that will be used for modelling. In data analytics, various formats of data
are required at different time, that is, different applications would require
different data sources, algorithms and attributes. Thus, it is important to provide
the data in a format so that all the algorithms and data tools can use. In this
phase, the data is cleansed, aggregated, augmented, sorted and formatted. Thus,
after preprocessing, a fixed data set format is generated.

Data Modelling (Data Analytics)


Data modelling, also known as data analytics, is performed to discover
meaningful information from the data. There are several techniques such as
regression, classification and clustering, that can be used to analyse and
discover patterns in data. Likewise, the same algorithms can be used for big
data where the data analytics can be sent to the MapReduce job. Data analytics
enable organisations to deal with large volume of data. In this phase, the user
understands the relationship between the features and consequently devise data
mining methods that can be used for prediction. Machine learning techniques
are capable of discovering patterns in very large datasets and this is useful for
decision making.

Data Visualisation
Data visualisation is the phase where the output of data analytics is being
displayed. Visualisation is an interactive way of representing the results. Plots
and charts can be used to visualize the data by using the required packages
available in the data visualisation software.

Big Data Analytics Problems


Various packages are available in R along with the computational power
of Hadoop, analytics and predictions. Big Data Analytics and Deep Learning
are gaining much attention with regards to data science. Companies are in
possession of huge amount of information regarding problems such as national
intelligence, cyber security, fraud detection, marketing, and medical
informatics. Deep Learning algorithms are analyzing massive amount of
unsupervised data and thus becoming an important tool for data analytics.
Complex patterns are being extracted, which are eventually helpful for decision
making. Analysis is done on large data sets to uncover correlations to find
business trends, combat crime, prevent diseases amongst others.
1. Semantic Indexing
There exist different types of data namely text, audio, video and images that
are available across social networks, marketing applications, shopping systems,
security systems, fraud detection applications amongst others. The efficient
storage and retrieval of these information are becoming a challenging task.
Thus, semantic indexing can be used instead of storing the data using big
strings. Semantic indexing stores the data in a more efficient way, thus making
it useful for discovery. Complex associations between the data and factors can
be depicted. For semantic indexing, deep learning can be used to generate high
level data abstraction of data instead of using raw input. The packages available
in R and other data analytics software are capable of uncovering the underlying
trends and patterns of the data.

2. Exploring web pages categorization


In web analytics, it is important to determine the importance of web pages.
Based on the information such as content, colours, design, no of visits, ease of
navigations and other details, the webpages can be customized accordingly.
Data collected with respect to the data, source, title and page path are captured.
After data collection, it is subject to the MapReduce algorithm. Depending on
the popularity of the website, it can be categorized as high, medium, or low.

3. Prediction of Cancer
Despite the rapid advancement in technology, the early detection and
prognosis of cancer is still a challenge. Detection of cancer is concerned with
the analysis of petabytes of data. This involves high dimensional data, which are
collected from various sources such as scientific experiments, literature,
computational analysis and research. Prognosis is being used to determine the
survival pattern using various attributes such as specific drug administered to a
patient, treatment given and response of the patient. Lots of data are involved
and thus data analytics can be used to determine trends and patterns which will
eventually help doctors in taking the proper decisions. Data mining techniques
can be used to determine trends and acquire knowledge using the information
available.
R for Big Data Analytics
Introduction
Big data analytics has become an integral part of decision-making and
business intelligence across various industries. With the exponential growth of
data, organizations need robust tools and techniques to extract meaningful
insights.
R, a powerful programming language and software environment, has gained
popularity for its extensive capabilities in data analysis and statistical
computing.

Understanding R for Big Data Analytics


R Programming Language: R is an open-source programming language
that provides a wide range of statistical and graphical techniques.
It offers a rich ecosystem of packages and libraries that support data
manipulation, visualization, and modeling.
R's flexibility and extensibility make it an excellent choice for big data
analytics.
R for Big Data: While R is traditionally known for its performance on
smaller datasets, it can also handle big data efficiently.
Several R packages have been developed specifically for big data
analytics, allowing users to process and analyze large datasets without
compromising performance.

Handling Big Data in R


R Packages for Big Data Analytics: R offers several packages that facilitate
big data analytics. Some popular packages include −
 dplyr − This package provides a grammar of data manipulation, allowing
users to perform various operations like filtering, summarizing, and
joining datasets efficiently.
 data.table − Thedata.table package enhances data manipulation by
implementing fast and memory-efficient data structures. It can handle
large datasets with millions or even billions of rows.
 SparkR − Built on Apache Spark, the SparkR package enables
distributed data processing with R. It leverages the power of Spark's
distributed computing capabilities to analyze big data efficiently.

Data Manipulation and Preprocessing


Data Cleaning − Data cleaning is a crucial step in big data analytics. R
provides a variety of functions and packages for data cleaning tasks, including
missing data imputation, outlier detection, and data transformation.
Data Transformation − R offers powerful functions for transforming data,
such as reshaping data from wide to long format (melt function), creating new
variables using calculated values (mutate function), and splitting or combining
variables (separate and unite functions).
Feature Engineering − Feature engineering involves creating new features
from existing data to improve model performance. R provides a plethora of
packages and functions for feature engineering, including text mining, time
series analysis, and dimensionality reduction techniques.

Modeling and Analysis


Machine Learning with R − R is widely used for machine learning tasks. It
offers numerous packages for various machine learning algorithms, including
classification, regression, clustering, and ensemble methods. Popular machine
learning packages in R include caret, randomForest, glmnet, and xgboost.
Deep Learning with R − Deep learning has gained significant popularity in
recent years. R provides several packages for deep learning, such as keras,
tensorflow, and mxnet. These packages allow users to build and train deep
neural networks for tasks like image classification, natural language processing,
and time series analysis.

Data Visualization
Data Visualization Packages − R is renowned for its extensive data
visualization capabilities. It provides a wide range of packages for creating
visually appealing and informative plots and charts.
Some popular data visualization packages in R include −
 ggplot2 − ggplot2 is a highly flexible and powerful package for creating
elegant and customizable data visualizations. It follows the grammar of
graphics principles, allowing users to build complex plots layer by layer.
 plotly − plotly is an interactive visualization package that enables the
creation of interactive and web-based plots. It offers a wide range of
options for creating interactive charts, maps, and dashboards.
 lattice − lattice provides a comprehensive set of functions for creating
conditioned plots, such as trellis plots and multi-panel plots. It is
particularly useful for visualizing multivariate data.

Visualizing Big Data − When working with big data, visualization can be
challenging due to the sheer volume of data. R offers techniques to visualize big
data efficiently, such as sampling techniques, aggregating data, and using
interactive visualizations that can handle large datasets.
Performance Optimization

Code Optimization − To enhance performance in big data analytics,


optimizing code is crucial. R provides several techniques for code optimization,
including vectorization, avoiding unnecessary loops, and efficient memory
management.

Memory Management − Big data often exceeds the available memory


capacity, requiring careful memory management. R provides techniques for
reducing memory usage, such as using efficient data structures (data.table),
garbage collection, and loading data in chunks.

Real-World Applications
Finance and Banking − Big data analytics in finance and banking can help in
fraud detection, risk modeling, portfolio optimization, and customer
segmentation. R's capabilities in data analysis and modeling make it a valuable
tool in this domain.
Healthcare − In the healthcare industry, big data analytics can contribute to
disease prediction, drug discovery, patient monitoring, and personalized
medicine. R's statistical and machine learning capabilities are well-suited for
analyzing healthcare data.
Marketing and Customer Analytics − R plays a significant role in marketing
and customer analytics by analyzing customer behavior, sentiment analysis,
market segmentation, and campaign optimization. It helps organizations make
data-driven marketing decisions.
Big Data Analytics with Big R refers to using R programming for analyzing
large-scale datasets, leveraging distributed computing frameworks, cloud
environments, and specialized R packages to perform data processing and
analysis. While R is traditionally known for handling small to medium-sized
datasets, tools and extensions like bigmemory, sparklyr, and integration
with Hadoop and Spark enable R to manage and analyze big data effectively.
Why Use R for Big Data Analytics?
1. Statistical Analysis: R provides a rich set of statistical and machine-
learning libraries.
2. Visualization: R offers advanced data visualization packages
like ggplot2 and plotly.
3. Extensibility: It integrates with big data platforms like Hadoop and
Spark.
4. Ease of Use: Its syntax and data manipulation packages
like dplyr simplify working with large datasets.

Challenges in Using R for Big Data


1. Memory Constraints: R loads entire datasets into memory, making it
unsuitable for very large datasets on single machines.
2. Performance: Without optimization, processing large datasets in R can
be slow.
3. Scalability: Requires distributed frameworks to handle datasets beyond
the system's memory.

Key Tools and Packages for Big Data Analytics with R


1. Integration with Distributed Systems
 sparklyr:
 Interface between R and Apache Spark.
 Enables large-scale data processing and machine learning on
distributed data.
 RHadoop:
 Integrates R with Hadoop.
 Includes packages like:
 rhdfs: For interacting with Hadoop Distributed File System
(HDFS).
 rmr2: For writing MapReduce jobs in R.
 plyrmr: For manipulating structured data on Hadoop.
2. Big Memory Management
 bigmemory:
 Handles datasets larger than RAM by storing them in shared
memory.
 ff:
 Manages datasets too large for memory by storing them on disk.
3. Data Manipulation and Analysis
 data.table:
 Optimized for fast manipulation of large datasets in R.
 dplyr with Databases:
 Can work with SQL databases for large data.
4. Machine Learning
 MLlib via Spark:
 Scalable machine learning using sparklyr.
Steps to Perform Big Data Analytics with Big R
1. Data Loading:
 Use tools like rhdfs to load data from HDFS or connect to cloud
data sources (e.g., AWS S3).
 Load data into distributed memory frameworks like Spark
using sparklyr.
2. Data Preprocessing:
 Use dplyr or data.table for cleaning, transformation, and
summarization.
 For distributed data, leverage Spark's in-built capabilities.
3. Exploratory Data Analysis (EDA):
 Perform summary statistics and visualize data
using ggplot2 or plotly.
 Use scalable methods to handle subsets or aggregated data.
4. Model Building:
 For distributed machine learning:
 Use Spark MLlib via sparklyr for linear regression, decision
trees, clustering, etc.
 For large local datasets:
 Use packages like biglm for linear models.
5. Result Visualization:
 Use visualization libraries like ggplot2, shiny, or plotly to present
findings.
6. Export and Deployment:
 Save results to HDFS or a database for further use.
 Deploy models using APIs or tools like R Shiny.

You might also like