data analytics-1

1) Explain data cleaning
 Data cleaning is the process of fixing or removing incorrect, corrupted,

incorrectly formatted, duplicate, or incomplete data within a dataset.
When combining multiple data sources, there are many opportunities
for data to be duplicated or mislabeled. If data is incorrect, outcomes
and algorithms are unreliable, even though they may look correct. There
is no one absolute way to prescribe the exact steps in the data cleaning
process because the processes will vary from dataset to dataset. But it
is crucial to establish a template for your data cleaning process so you
know you are doing it the right way every time.
2) Explain what is big data analytics

=> Big data analytics describes the process of uncovering trends, patterns,
and correlations in large amounts of raw data to help make data-informed
decisions. These processes use familiar statistical analysis techniques—
like clustering and regression—and apply them to more extensive datasets
with the help of newer tools. Big data has been a buzzword since the early
2000s when software and hardware capabilities made it possible for
organizations to handle large amounts of unstructured data.
3) Explain why SVM more accurate than logistic regression.
=> Support vector machines (SVMs) and logistic regression are both supervised
learning algorithms that are commonly used for classification tasks.
One of the key advantages of support vector machines is that they can model
non-linear decision boundaries, while logistic regression is limited to modeling
linear decision boundaries. This means that SVMs can be more accurate than
logistic regression when the data is not linearly separable, or when there is a
complex relationship between the features and the target variable.
Another advantage of SVMs is that they can use the kernel trick to project the
data into a higher dimensional space, where it may be linearly separable. This
allows SVMs to effectively model data that is not linearly separable in the
original feature space, which can improve their accuracy.
Additionally, SVMs can be regularized to avoid overfitting, which can help to

improve their generalization performance on unseen data. Logistic regression,
on the other hand, can overfit to the training data if the regularization parameter
is not set properly.
4) Describe the various hierarchical model of cluster analysis.

=> A Hierarchical clustering method works via grouping data into a tree of
clusters. Hierarchical clustering begins by treating every data point as a separate
cluster. Then, it repeatedly executes the subsequent steps:
1. Identify the 2 clusters which can be closest together, and

2. Merge the 2 maximum comparable clusters. We need to continue these
steps until all the clusters are merged together.
1. Agglomerative: Initially consider every data point as

an individual Cluster and at every step, merge the nearest pairs of the
cluster. (It is a bottom-up method). At first, every dataset is considered as
an individual entity or cluster. At every iteration, the clusters merge with
different clusters until one cluster is formed.
2. Divisive: We can say that Divisive Hierarchical clustering is precisely

the opposite of Agglomerative Hierarchical clustering. In Divisive
Hierarchical clustering, we take into account all of the data points as a single
cluster and in every iteration, we separate the data points from the clusters
which aren’t comparable. In the end, we are left with N clusters.
5) How can the initial number of clusters for the K-means algorithm be
estimated?
=> One way to estimate the initial number of clusters for the K-means algorithm
is to use the elbow method. This method involves fitting the K-means algorithm
to the data for a range of values for the number of clusters and then visualizing
the sum of squared distances (also known as the within-cluster sum of squares)
for each value of k.
To use the elbow method, you would first fit the K-means algorithm to the data
for a range of values for k, such as from 1 to 10. Then, you would calculate the
within-cluster sum of squares for each value of k. Finally, you would plot the
within-cluster sum of squares versus k and look for the "elbow" point in the plot
to determine the optimal number of clusters.
6) Explain How Hadoop is related to Big Data.

 Big data and Hadoop are interrelated. In simple words, big data is a
massive amount of data that cannot be stored, processed, or analyzed
using traditional methods. Big data consists of vast volumes of
various types of data that are generated at a high speed. To overcome
the issue of storing, processing, and analyzing big data, Hadoop is
used.
Hadoop is a framework that is used to store and process big data in a

distributed and parallel way. In Hadoop, storing vast volumes of data
becomes easy as the data is distributed across various machines, and data
is also processed parallelly which saves time.
7) Short note on Probability Distribution?
=> A probability distribution is a statistical function that describes all the

possible values and likelihoods that a random variable can take within a
given range. This range will be bounded between the minimum and
maximum possible values, but precisely where the possible value is likely
to be plotted on the probability distribution depends on a number of
factors. These factors include the distribution's mean (average), standard
deviation, skewness, and kurtosis.
8) Difference between active learning & reinforcement learning.
Active learning and reinforcement learning are both types of machine learning,
but they differ in their approach to training models. Active learning is a type of
supervised learning in which the model is able to interactively query the user (or
some other information source) to obtain the desired outputs at new data points.
This allows the model to learn from a smaller amount of labeled data and can
improve its performance compared to a model trained using only a large amount
of labeled data.
Reinforcement learning, on the other hand, is a type of unsupervised learning in

which the model learns to take actions in an environment in order to maximize a
reward signal. In reinforcement learning, the model is not given explicit labeled
data to learn from, but instead must learn through trial and error by taking
actions and receiving feedback in the form of rewards or punishments. This
allows the model to learn complex behaviors and make decisions based on its
environment.
In summary, active learning is a type of supervised learning that involves

interaction with a user or some other source of information to obtain labeled
data, while reinforcement learning is a type of unsupervised learning that
involves learning through trial and error by taking actions in an environment to
maximize a reward signal.
9) Write down the few problems that data analytics
=>
 The amount of data being collected. ...

 Collecting meaningful and real-time data. ...
 Visual representation of data. ...
 Data from multiple sources. ...
 Inaccessible data. ...
 Poor quality data. ...
 Pressure from the top. ...
 Lack of support.
 Budget
 Scaling data analysis
10) Explain in detail about the challenges of a conventional system
=>
Large quantities: The data s are huge in a conventional system. So it’s all most
impossible to process this data
Meaningful: We are here dealing with updated data we don’t need old data or any
kind of irregular or inconsistent data here so whatever is the data it should be
meaningful relevant and should be collected in real-time.
Multiple Sources: The data should be collected from multiple sources.

categorized or labelled.
11) Difference between Data analytics & Data Mining
=>
Data Mining Data Analysis
Data mining is a process of extracting useful Data analysis is a method that can be
information, patterns, and trends from raw data. used to investigate, analyze, and
demonstrate data to find useful
information.
The data mining output gives the data pattern. The data analysis output is a verified
hypothesis or insights based on the
data.
It includes the intersection of databases, machine It requires expertise in computer
learning, and statistics. science, mathematics, statistics, and
AI.
It is also called KDD. It is of various types - text analytics,
predictive analysis, data mining, etc.
It is responsible for extracting useful patterns and It is responsible for developing
trends in data. models, testing, and proposing
hypotheses using analytical methods.
The best example of a data mining application is The best example of data analysis is
in the E-commerce sector, where websites display the study of the census
options of those who purchased and viewed the
specific product.
12) A short note on Two way ANOVA.
=> ANOVA (Analysis of Variance) is a statistical technique used to test for

significant differences between the means of two or more groups. Two-way
ANOVA is an extension of the technique that is used to test for differences
between two or more groups when there are two different independent
variables. This allows researchers to determine whether the differences between
the groups are due to one or both of the independent variables.
To perform a two-way ANOVA, the researcher first divides the data into two or
more groups based on the levels of the first independent variable. Within each
of these groups, the data is further divided into subgroups based on the levels of
the second independent variable. The researcher then calculates the mean and
variance of the data in each of the subgroups and uses these values to test for
significant differences between the means of the subgroups.
Two-way ANOVA is commonly used in experiments where there are two

factors that could potentially affect the outcome of the study. For example, a
researcher might use two-way ANOVA to test for differences in test scores
between different schools, where the first independent variable is the type of
school (public or private) and the second independent variable is the location of
the school (urban or rural). By using two-way ANOVA, the researcher can
determine whether the differences in test scores are due to the type of school or
the location of the school, or both.
13) What is the role of activation function in neural networks
=>
The activation function decides whether a neuron should be activated or not by

calculating the weighted sum and further adding bias to it. The purpose of the
activation function is to introduce non-linearity into the output of a neuron.
We know, the neural network has neurons that work in correspondence with
weight, bias, and their respective activation function. In a neural network, we
would update the weights and biases of the neurons on the basis of the error at
the output. This process is known as back-propagation. Activation functions
make the back-propagation possible since the gradients are supplied along with
the error to update the weights and biases.
14) Write down the application of association rule
=>
Association rule learning is a type of unsupervised learning methods that tests for
the dependence of one data element on another data element and create
appropriately so that it can be more effective. It tries to discover some interesting
relations or relations among the variables of the dataset. It depends on several
rules to find interesting relations between variables in the database
There are various applications of Association Rule which are as follows −
Items purchased on a credit card, such as rental cars and hotel rooms,
support insight into the following product that customer are likely to buy.
Optional services purchased by tele-connection users (call waiting, call
forwarding, DSL, speed call, etc.) support decide how to bundle these
functions to maximize revenue.
Banking services used by retail users (money industry accounts, CDs,

investment services, car loans, etc.) recognize users likely to needed other
services.
Unusual group of insurance claims can be an expression of fraud and can

spark higher investigation.
Medical patient histories can supports expressions of likely complications

based on definite set of treatments.
15) Underfitting & Overfitting describe
=>Underfitting and overfitting are the most common problems in case of any
machine learning algo.
 In underfitting, the data that we provided to the system for training is

not efficient so your bias or training error will be high and the variance
will also follow the same. underfitting occurs when your model is too
simple for your data. It simply means that your model makes accurate,
but initially incorrect predictions. In this case, the training error is large
and the value/test error is large too. to fix underfitting, you should
complicate the model.
 In overfitting the data that you have provided for training is more than
efficient so your bias will be low but the validation error or variance or
the testing error will be high. overfitting occurs when your model is
too complex for your data. To put it simply model makes not accurate
predictions. In this case, the train error is very small and val/test error
is large. to fix overfitting, you should simplify the model.
16)What is clustering?
=> Clustering is the task of dividing the population or data points into a number
of groups such that data points in the same groups are more similar to other data
points in the same group and dissimilar to the data points in other groups. It is
basically a collection of objects on the basis of similarity and dissimilarity
between them.
Example: The data points in the graph below clustered together can be classified
into one single group. We can distinguish the clusters, and we can identify that
there are 3 clusters in the below picture.
It is not necessary for clusters to be spherical. Such as :
17) What is out layer?
=> In the context of neural networks, the term "output layer" typically refers to
the last layer of a network that produces the final outputs of the model. The
output layer receives input from the other layers in the network, processes this
input using the activation function, and then produces the final outputs of the
model.
The output layer is important because it produces the predictions or

classifications that the neural network is designed to make. The number of
neurons in the output layer is usually determined by the number of classes that
the model is trying to predict, and the activation function used in the output
layer is typically chosen based on the type of task that the model is being used
for.
18) Different types of sampling techniques used by data analyst.

=> There are basically two types of sampling types 1) Probability sampling 2)
Non-probability sampling
Probability sampling: Probability Sampling Techniques are one of the

important types of sampling techniques. Probability sampling allows every
member of the population a chance to get selected. It is mainly used in
quantitative research when you want to produce results representative of the
whole population.
1) Simple random sampling

2) Systematic sampling
3) Stratified sampling
4) Cluster sampling
Non-probability sampling: Non-Probability Sampling Techniques is one of the
important types of Sampling techniques. In non-probability sampling, not every
individual has a chance of being included in the sample. This sampling method is
easier and cheaper but also has high risks of sampling bias. It is often used in
exploratory and qualitative research with the aim to develop an initial
understanding of the population.
1) Convenience sampling
2) Voluntary sampling
3) Purpose sampling
4) Snowball sampling
19) What is Data Wrangling

=> Data wrangling is the process of removing errors and combining complex data
sets to make them more accessible and easier to analyze. Due to the rapid
expansion of the amount of data and data sources available today, storing and
organizing large quantities of data for analysis is becoming increasingly
necessary.
A data wrangling process, also known as a data munging process, consists of
reorganizing, transforming, and mapping data from one "raw" form into another
in order to make it more usable and valuable for a variety of downstream uses
including analytics.
Data wrangling can be defined as the process of cleaning, organizing, and
transforming raw data into the desired format for analysts to use for prompt
decision-making. Also known as data cleaning or data munging, data wrangling
enables businesses to tackle more complex data in less time, produce more
accurate results, and make better decisions. The exact methods vary from project
to project depending on your data and the goal you are trying to achieve. More
and more organizations are increasingly relying on data-wrangling tools to make
data ready for downstream analytics.
20) Explain the term normal distribution
=> A normal distribution is an arrangement of a data set in which most values

cluster in the middle of the range and the rest taper off symmetrically toward
either extreme.
Normal distribution, also known as the Gaussian distribution, is a probability

distribution that is symmetric about the mean, showing that data near the mean
are more frequent in occurrence than data far from the mean.
21) Type of hypothesis testing
=> Hypothesis Tests can be classified into two big families:
 Parametric Tests, if samples follow a normal distribution. In general,

samples follow a normal distribution if their mean is 0 and variance is
1.
 Non-Parametric Tests, if samples do not follow a normal

distribution.
Depending on the number of samples to be compared, two families of

Hypothesis Tests can be formulated:
 One Sample, if there is just one sample, which must be compared
with a given value
 Two Samples, if there are two or more samples to be compared. In

this case, possible tests include correlation and difference between
samples. In both cases, samples can be paired or not. Paired
samples are also called dependent samples, while not paired samples
are also called independent samples. In paired samples, natural or
matched couplings occur.
22) Characteristics of Big data
=>Big data is a collection of data from many different sources and is often
describe by five characteristics: volume, value, variety, velocity, and veracity.
Volume: the size and amounts of big data that companies manage and analyze
Value: the most important “V” from the perspective of the business, the value of
big data usually comes from insight discovery and pattern recognition that lead
to more effective operations, stronger customer relationships and other clear and
quantifiable business benefits
Variety: the diversity and range of different data types, including unstructured
data, semi-structured data and raw data
Velocity: the speed at which companies receive, store and manage data – e.g., the
specific number of social media posts or search queries received within a day,
hour or other unit of time
Veracity: the “truth” or accuracy of data and information assets, which often
determines executive-level confidence
The additional characteristic of variability can also be considered:
Variability: the changing nature of the data companies seek to capture, manage
and analyse – e.g., in sentiment or text analytics, changes in the meaning of key
words or phrases.
23) Type 1 error & Type 2 error comparison
=>
BASIS FOR
TYPE I ERROR TYPE II ERROR
COMPARISON
Meaning Type I error refers to non- Type II error is the

acceptance of hypothesis acceptance of hypothesis
which ought to be accepted. which ought to be rejected.
Equivalent to False positive False negative
What is it? It is incorrect rejection of true It is incorrect acceptance of

null hypothesis. false null hypothesis.
Represents A false hit A miss
Probability of Equals the level of Equals the power of test.

committing error significance.
Indicated by Greek letter 'α' Greek letter 'β'
24) Explain EDA?

=>Exploratory Data Analysis (EDA) is an approach to analyze the data using
visual techniques. It is used to discover trends, patterns, or to check assumptions
with the help of statistical summary and graphical representations.
Exploratory Data Analysis (EDA) is an approach to analyzing datasets to
summarize their main characteristics, often with visual methods. EDA is used for
seeing what the data can tell us before the modeling task. It is not easy to look at
a column of numbers or a whole spreadsheet and determine important
characteristics of the data. It may be tedious, boring, and/or overwhelming to
derive insights by looking at plain numbers. Exploratory data analysis techniques
have been devised as an aid in this situation.
Exploratory data analysis is generally cross-classified in two ways. First, each
method is either non-graphical or graphical. And second, each method is either
univariate or multivariate (usually just bivariate).
5 Marks question
1. Write down a short note on associative rule mining.
=> It is Unsupervised learning. Association rule mining finds interesting
associations and relationships among large sets of data items. This rule shows
how frequently an item set occurs in a transaction. A typical example is a Market
Based Analysis.
Market-Based Analysis is one of the key techniques used by large relations to
show associations between items. It allows retailers to identify relationships
between the items that people buy together frequently.
Given a set of transactions, we can find rules that will predict the occurrence of
an item based on the occurrences of other items in the transaction.
2. Write down a short note on deep mining
=> Deep mining is a type of data mining that involves using deep learning
algorithms to extract information and insights from large and complex datasets.
Deep learning is a subset of machine learning that uses neural networks with
multiple layers of processing to learn from data and make predictions or
classifications.
Deep mining combines the power of deep learning with the techniques of data
mining to discover hidden patterns and relationships in data. This can be useful
for a wide range of applications, such as analyzing customer behavior, detecting
fraud, or predicting the stock market.
One of the key advantages of deep mining is that it can handle large and
complex datasets that may be too difficult for traditional data mining algorithms
to handle. Deep learning algorithms are able to automatically extract useful
features from the data and learn complex relationships between the variables,
which can improve their accuracy and performance.
Another advantage of deep mining is that it can handle unstructured and

unlabeled data, such as images, audio, and text. This can be useful for tasks
such as image recognition, speech recognition, and natural language processing.
Overall, deep mining is a powerful tool for extracting information and insights
from large and complex datasets. By combining the techniques of data mining
with the power of deep learning, deep mining can help businesses and
organizations to make better decisions and improve their operations.
3. Application of clustering.
=> There are various applications of clustering which are as follows −

Scalability − Some clustering algorithms work well in small data sets including
less than 200 data objects; however, a huge database can include millions of
objects. Clustering on a sample of a given huge data set can lead to biased results.
There are highly scalable clustering algorithms are required.
Ability to deal with different types of attributes − Some algorithms are
designed to cluster interval-based (numerical) records. However, applications can
require clustering several types of data, including binary, categorical (nominal),
and ordinal data, or a combination of these data types.
Discovery of clusters with arbitrary shape − Some clustering algorithms
determine clusters depending on Euclidean or Manhattan distance measures.
Algorithms based on such distance measures tend to discover spherical clusters
with the same size and density. However, a cluster can be of any shape. It is
essential to develop algorithms that can identify clusters of arbitrary shapes.
Minimal requirements for domain knowledge to determine input
parameters − Some clustering algorithms needed users to input specific
parameters in cluster analysis (including the number of desired clusters). The
clustering results are quite sensitive to input parameters. Parameters are hard to
decide, specifically for data sets including high-dimensional objects. This not
only burdens users but also creates the quality of clustering tough to control.
Ability to deal with noisy data − Some real-world databases include outliers or
missing, unknown, or erroneous records. Some clustering algorithms are sensitive
to such data and may lead to clusters of poor quality.
Insensitivity to the order of input records − Some clustering algorithms are
responsive to the order of input data, e.g., a similar set of data, when presented
with multiple orderings to such an algorithm, and it can generate dramatically
different clusters. It is essential to develop algorithms that are unresponsive to the
order of input.
High dimensionality − A database or a data warehouse can include several
dimensions or attributes. Some clustering algorithms are best at managing low-
dimensional data, containing only two to three dimensions. Human eyes are best
at determining the quality of clustering for up to three dimensions. It is disputing
to cluster data objects in high-dimensional space, especially considering that data
in high-dimensional space can be very inadequate and highly misrepresented.
Constraint-based clustering − Real-world applications can be required to
perform clustering under several types of constraints. Consider that your job is to
select the areas for a given number of new automatic cash stations (ATMs) in a
city.
4. write a short note on Quadratic Discriminate analysis.
=> Quadratic discriminant analysis (QDA) is a statistical technique used for
classification tasks. It is a type of linear discriminant analysis, but unlike linear
discriminant analysis (LDA), which assumes that the data is normally distributed
and that the classes have the same covariance matrix, QDA allows for the
covariance matrix of each class to be different. This allows QDA to capture more
complex relationships in the data and potentially improve the classification
accuracy. However, this comes at the cost of increased computational
complexity and a higher number of model parameters that need to be estimated
from the data. In general, QDA is more flexible than LDA and can provide better
classification results when the assumptions of LDA are not met, but it may be
less stable and more prone to overfitting when the sample size is small.
5. Describe in details about the role of statistical model in data analysis.
=> In data analysis, statistical models are used to analyze and make predictions
or inferences from data. Statistical models are mathematical equations that can
be used to describe the relationship between different variables in a dataset.
These models can be used to identify trends, patterns, and relationships in the
data that can help analysts understand the underlying processes that generated
the data.
One of the key roles of statistical models in data analysis is to provide a

framework for making predictions or inferences about a population based on a
sample of data. For example, a statistical model could be used to estimate the
average income of a population based on a sample of data from that population.
The model would take into account factors such as the sample size, the
variability of the data, and the relationship between income and other variables
in the dataset.
Another important role of statistical models in data analysis is to identify the

underlying factors that may be contributing to the patterns and trends in the
data. For example, a statistical model could be used to identify the factors that
are most strongly associated with changes in a particular variable, such as the
factors that are most strongly associated with changes in a company's sales. This
can help analysts better understand the processes that are driving the data and
make more informed decisions.
Overall, the role of statistical models in data analysis is to provide a framework

for making predictions and inferences from data, and for identifying the
underlying factors that may be contributing to patterns and trends in the data.
These models can be used to gain a better understanding of the processes that
are driving the data, and to make more informed decisions based on that data.
6. write a short note on Descriptive statistics.
=> In data analytics, descriptive statistics is used to summarize and organize

large amounts of data, and to identify trends, patterns, and relationships within
the data. This can help analysts gain a better understanding of the underlying
processes that generated the data, and can be used to make predictions and
inferences about a population based on a sample of data.
Descriptive statistics is often used in conjunction with other statistical

techniques, such as inferential statistics, to provide a more complete picture of
the data. For example, an analyst might use descriptive statistics to summarize
the key characteristics of a dataset, and then use inferential statistics to make
predictions or inferences about the population based on that dataset.
In data analytics, descriptive statistics is commonly used to explore and

summarize large datasets, to identify trends and patterns in the data, and to
present the results of data analysis in a clear and concise manner. It is an
important tool for gaining a better understanding of the data, and for making
more informed decisions based on that data.
7. Difference between probability distribution & descriptive analysis.
=> In data analytics, probability distribution and descriptive analysis are two
different statistical techniques that are used to analyze and make predictions or
inferences from data.
Probability distribution is a mathematical function that describes the likelihood

of different outcomes in a statistical experiment. It is used to model the behavior
of random variables, such as the outcomes of a dice roll or the results of a
survey. In data analytics, probability distributions can be used to make
predictions about the likelihood of different outcomes based on a dataset, and to
calculate the probability of a given event occurring.
Descriptive analysis, on the other hand, is a statistical technique that is used to
summarize and organize large amounts of data. It is concerned with providing a
summary of the key characteristics of a dataset, such as the mean, median, and
mode, as well as measures of dispersion, such as the range, standard deviation,
and variance. In data analytics, descriptive analysis is commonly used to
explore a dataset and to identify trends, patterns, and relationships within the
data.
8. Clarify over between overfitting & underfitting. How to deal with them.
=> Overfitting and underfitting are two common problems in machine learning
and data modeling. Overfitting occurs when a model is too complex and has too
many parameters, and as a result, it performs well on the training data but
poorly on new or unseen data. Underfitting, on the other hand, occurs when a
model is too simple and does not have enough parameters to accurately capture
the underlying trends and patterns in the data.
One way to deal with overfitting is to use regularization, which is a technique

that adds a penalty to the model's complexity to prevent it from becoming too
complex and overfitting the data. Another way to deal with overfitting is to use
cross-validation, which is a technique that involves training the model on
multiple subsets of the data and then averaging the results to get a more accurate
estimate of the model's performance on new data.
To deal with underfitting, one approach is to increase the complexity of the

model by adding more parameters. This can help the model capture more of the
underlying trends and patterns in the data, but it can also increase the risk of
overfitting. Another approach is to collect more data, which can help the model
learn more about the underlying patterns and trends in the data.
9. Explain principle of neural network.
=> A neural network is a type of machine learning algorithm that is inspired by

the structure and function of the human brain. It is composed of multiple layers
of interconnected "neurons," which process and transmit information.
The basic principle of a neural network is that it learns to identify patterns and
make predictions or decisions based on the input data. This learning is
accomplished through a process of trial and error, where the neural network
adjusts the strength of the connections between its neurons based on the input
data and the desired output. Over time, the neural network learns to make more
accurate predictions or decisions based on the input data.
One of the key principles of a neural network is that it is capable of learning and
adapting to new data. This means that the neural network can improve its
performance over time as it is exposed to more and more data, and it can learn
to identify complex patterns and relationships in the data that are not easily
recognizable by humans.
Another important principle of a neural network is that it is highly parallel,

which means that it can process large amounts of data quickly and efficiently.
This makes neural networks well-suited for tasks such as image recognition and
natural language processing, where large amounts of data need to be processed
in real-time.
10. Explain why SVM move accurate than logistic regression with example.
=> Support vector machines (SVMs) are a type of supervised learning

algorithm that is often considered to be more accurate than logistic regression.
This is because SVMs are able to identify complex patterns and relationships in
the data that are not easily recognizable by other algorithms, such as logistic
regression.
One reason why SVMs are more accurate than logistic regression is that they
are able to handle non-linear data more effectively. Logistic regression is a
linear algorithm, which means that it can only model linear relationships
between the input data and the output. This can make it less effective at
identifying complex patterns and relationships in the data.
In contrast, SVMs are capable of modeling non-linear relationships between the

input data and the output. This is because SVMs use a kernel trick to transform
the input data into a higher-dimensional space, where it becomes linearly
separable. This allows SVMs to capture more of the underlying structure of the
data, which can improve the accuracy of the model.
Another reason why SVMs are more accurate than logistic regression is that
they are able to handle high-dimensional data more effectively. Logistic
regression can become less accurate as the number of input features increases,
because it can become difficult to estimate the coefficients for all of the input
features.
11. Justify why SVM is so fast.
=> Support vector machines (SVMs) are a type of supervised learning algorithm
that are known for their speed and efficiency. There are several reasons why
SVMs are fast compared to other algorithms.
One reason why SVMs are fast is that they are able to handle large amounts of
data quickly and efficiently. This is because SVMs use a kernel trick to
transform the input data into a higher-dimensional space, where it becomes
linearly separable. This allows SVMs to process large amounts of data without
incurring a significant computational cost.
Another reason why SVMs are fast is that they are able to make predictions
quickly. Once the SVM has been trained on a dataset, it can make predictions
on new data using a simple dot product, which is a fast and efficient operation.
This allows SVMs to make predictions quickly and efficiently, even on large
datasets.
In addition, SVMs are fast because they are able to make predictions using only
a subset of the training data, known as the support vectors. This means that the
SVM only needs to store and process the support vectors in order to make
predictions, which can significantly reduce the computational cost of the model.
12. Justify what are the best practise in Big data analysis.
=> There are several best practices that can be followed to ensure effective and
efficient big data analysis. Some of the key best practices include:
1. Identify the business problem or opportunity that the data analysis is

intended to address. This will help ensure that the analysis is focused and
relevant, and that the results can be used to make informed decisions.
2. Develop a clear and concise plan for the data analysis, including the
specific steps and methods that will be used. This will help ensure that the
analysis is conducted in a structured and systematic manner, and that the
results can be replicated and validated.
3. Use appropriate data analysis tools and techniques that are well-suited for
big data, such as distributed computing and machine learning algorithms.
This will help ensure that the data can be processed efficiently and
accurately, and that the results of the analysis are reliable.
4. Use appropriate data visualization techniques to present the results of the
analysis in a clear and concise manner. This will help ensure that the
results are easily understandable and actionable, and that they can be used
to make informed decisions.
5. Continuously evaluate and monitor the results of the data analysis to
ensure that they are accurate and relevant. This will help identify any
errors or biases in the data or the analysis, and can help improve the
accuracy and reliability of the results.
13. Assess the technique in Big data analysis.
=> There are several techniques that are commonly used in big data analysis.
These techniques can be divided into two main categories: statistical techniques
and machine learning techniques.
Statistical techniques are mathematical methods that are used to summarize,

organize, and analyze large datasets. These techniques include descriptive
statistics, which are used to summarize the key characteristics of a dataset, and
inferential statistics, which are used to make predictions or inferences about a
population based on a sample of data.
Machine learning techniques are algorithms that are designed to learn from data
and make predictions or decisions based on that data. These techniques include
supervised learning, where the algorithm is trained on a labeled dataset and then
makes predictions on new data, and unsupervised learning, where the algorithm
learns to identify patterns and relationships in the data without the use of
labeled data.
14. Explain Anova.
=> ANOVA, or analysis of variance, is a statistical technique that is used to

compare the means of two or more groups or samples. It is used to determine
whether there are significant differences between the means of the different
groups or samples, and to identify which groups or samples are significantly
different from each other.
ANOVA is commonly used in research studies, where it is used to compare the

means of different groups of subjects on a particular dependent variable, such as
their scores on a test or their responses to a survey. It is also used in business
settings, where it can be used to compare the means of different groups of
customers or employees on a particular variable, such as their satisfaction levels
or their performance on a task.
To conduct an ANOVA, the data are first organized into groups or samples
based on the factors that are being compared. The mean of each group or
sample is then calculated, and the overall mean of all of the groups or samples is
also calculated. The differences between the group means and the overall mean
are then calculated and used to determine whether there are significant
differences between the groups.
If the ANOVA indicates that there are significant differences between the
groups, a post-hoc test can be used to identify which specific groups are
significantly different from each other. This can help identify the specific
factors or variables that are contributing to the differences between the groups.
15.Different feature of Hadoop.

=>Hadoop is open source
=>Hadoop is highly scalable
=>Hadoop provides fault tolerance
=>provides high availability
=>Hadoop is very cost effective
=>Hadoop ensures faster data processing
=>Hadoop is based on data locality concept
=>Hadoop provides feasibility
=>It is easy to use
=>Hadoop ensures data reliability
16. illustrate Hadoop related to big data analytics.
=> Hadoop is an open-source software framework that is used for storing and
processing large amounts of data. It is designed to handle big data analytics,
which involves analyzing large and complex datasets to identify trends,
patterns, and relationships within the data.
Hadoop consists of several components, including a distributed file system

(HDFS) for storing large amounts of data, and a parallel processing framework
(MapReduce) for analyzing that data. These components work together to allow
Hadoop to efficiently store and process large amounts of data, and to make it
possible to perform complex analyses on that data.
One of the key advantages of Hadoop is that it is highly scalable, which means
that it can easily handle increases in the amount of data that needs to be
processed. This makes Hadoop well-suited for big data analytics, where large
and complex datasets are common.
Another advantage of Hadoop is that it is fault-tolerant, which means that it can

continue to operate even if one or more of the machines in the cluster fail. This
helps ensure that the data is always available and that the analysis can be
completed, even in the event of hardware failures or other disruptions.

data analytics-1

Uploaded by

Copyright:

Available Formats

data analytics-1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

data analytics-1

Uploaded by

Copyright:

Available Formats

1) Explain data cleaning

 Data cleaning is the process of fixing or removing incorrect, corrupted,

2) Explain what is big data analytics

3) Explain why SVM more accurate than logistic regression.

Additionally, SVMs can be regularized to avoid overfitting, which can help to

4) Describe the various hierarchical model of cluster analysis.

1. Identify the 2 clusters which can be closest together, and

1. Agglomerative: Initially consider every data point as

2. Divisive: We can say that Divisive Hierarchical clustering is precisely

6) Explain How Hadoop is related to Big Data.

Hadoop is a framework that is used to store and process big data in a

7) Short note on Probability Distribution?

=> A probability distribution is a statistical function that describes all the

8) Difference between active learning & reinforcement learning.

Reinforcement learning, on the other hand, is a type of unsupervised learning in

In summary, active learning is a type of supervised learning that involves

 The amount of data being collected. ...

10) Explain in detail about the challenges of a conventional system

Multiple Sources: The data should be collected from multiple sources.

11) Difference between Data analytics & Data Mining

12) A short note on Two way ANOVA.

=> ANOVA (Analysis of Variance) is a statistical technique used to test for

Two-way ANOVA is commonly used in experiments where there are two

The activation function decides whether a neuron should be activated or not by

14) Write down the application of association rule

There are various applications of Association Rule which are as follows −

Banking services used by retail users (money industry accounts, CDs,

Unusual group of insurance claims can be an expression of fraud and can

Medical patient histories can supports expressions of likely complications

 In underfitting, the data that we provided to the system for training is

17) What is out layer?

The output layer is important because it produces the predictions or

18) Different types of sampling techniques used by data analyst.

Probability sampling: Probability Sampling Techniques are one of the

1) Simple random sampling

19) What is Data Wrangling

20) Explain the term normal distribution

=> A normal distribution is an arrangement of a data set in which most values

Normal distribution, also known as the Gaussian distribution, is a probability

21) Type of hypothesis testing

=> Hypothesis Tests can be classified into two big families:

 Parametric Tests, if samples follow a normal distribution. In general,

 Non-Parametric Tests, if samples do not follow a normal

Depending on the number of samples to be compared, two families of

 Two Samples, if there are two or more samples to be compared. In

Meaning Type I error refers to non- Type II error is the

Equivalent to False positive False negative

What is it? It is incorrect rejection of true It is incorrect acceptance of

Represents A false hit A miss

Probability of Equals the level of Equals the power of test.

Indicated by Greek letter 'α' Greek letter 'β'

24) Explain EDA?

Another advantage of deep mining is that it can handle unstructured and

=> There are various applications of clustering which are as follows −

One of the key roles of statistical models in data analysis is to provide a

Another important role of statistical models in data analysis is to identify the

Overall, the role of statistical models in data analysis is to provide a framework

6. write a short note on Descriptive statistics.

=> In data analytics, descriptive statistics is used to summarize and organize

Descriptive statistics is often used in conjunction with other statistical