data analytics-1
data analytics-1
data analytics-1
=> Support vector machines (SVMs) and logistic regression are both supervised
learning algorithms that are commonly used for classification tasks.
One of the key advantages of support vector machines is that they can model
non-linear decision boundaries, while logistic regression is limited to modeling
linear decision boundaries. This means that SVMs can be more accurate than
logistic regression when the data is not linearly separable, or when there is a
complex relationship between the features and the target variable.
Another advantage of SVMs is that they can use the kernel trick to project the
data into a higher dimensional space, where it may be linearly separable. This
allows SVMs to effectively model data that is not linearly separable in the
original feature space, which can improve their accuracy.
5) How can the initial number of clusters for the K-means algorithm be
estimated?
=> One way to estimate the initial number of clusters for the K-means algorithm
is to use the elbow method. This method involves fitting the K-means algorithm
to the data for a range of values for the number of clusters and then visualizing
the sum of squared distances (also known as the within-cluster sum of squares)
for each value of k.
To use the elbow method, you would first fit the K-means algorithm to the data
for a range of values for k, such as from 1 to 10. Then, you would calculate the
within-cluster sum of squares for each value of k. Finally, you would plot the
within-cluster sum of squares versus k and look for the "elbow" point in the plot
to determine the optimal number of clusters.
Active learning and reinforcement learning are both types of machine learning,
but they differ in their approach to training models. Active learning is a type of
supervised learning in which the model is able to interactively query the user (or
some other information source) to obtain the desired outputs at new data points.
This allows the model to learn from a smaller amount of labeled data and can
improve its performance compared to a model trained using only a large amount
of labeled data.
=>
Large quantities: The data s are huge in a conventional system. So it’s all most
impossible to process this data
Meaningful: We are here dealing with updated data we don’t need old data or any
kind of irregular or inconsistent data here so whatever is the data it should be
meaningful relevant and should be collected in real-time.
=>
Data Mining Data Analysis
Data mining is a process of extracting useful Data analysis is a method that can be
information, patterns, and trends from raw data. used to investigate, analyze, and
demonstrate data to find useful
information.
The data mining output gives the data pattern. The data analysis output is a verified
hypothesis or insights based on the
data.
It includes the intersection of databases, machine It requires expertise in computer
learning, and statistics. science, mathematics, statistics, and
AI.
It is also called KDD. It is of various types - text analytics,
predictive analysis, data mining, etc.
It is responsible for extracting useful patterns and It is responsible for developing
trends in data. models, testing, and proposing
hypotheses using analytical methods.
The best example of a data mining application is The best example of data analysis is
in the E-commerce sector, where websites display the study of the census
options of those who purchased and viewed the
specific product.
To perform a two-way ANOVA, the researcher first divides the data into two or
more groups based on the levels of the first independent variable. Within each
of these groups, the data is further divided into subgroups based on the levels of
the second independent variable. The researcher then calculates the mean and
variance of the data in each of the subgroups and uses these values to test for
significant differences between the means of the subgroups.
We know, the neural network has neurons that work in correspondence with
weight, bias, and their respective activation function. In a neural network, we
would update the weights and biases of the neurons on the basis of the error at
the output. This process is known as back-propagation. Activation functions
make the back-propagation possible since the gradients are supplied along with
the error to update the weights and biases.
=>
Association rule learning is a type of unsupervised learning methods that tests for
the dependence of one data element on another data element and create
appropriately so that it can be more effective. It tries to discover some interesting
relations or relations among the variables of the dataset. It depends on several
rules to find interesting relations between variables in the database
Items purchased on a credit card, such as rental cars and hotel rooms,
support insight into the following product that customer are likely to buy.
Optional services purchased by tele-connection users (call waiting, call
forwarding, DSL, speed call, etc.) support decide how to bundle these
functions to maximize revenue.
=>Underfitting and overfitting are the most common problems in case of any
machine learning algo.
In overfitting the data that you have provided for training is more than
efficient so your bias will be low but the validation error or variance or
the testing error will be high. overfitting occurs when your model is
too complex for your data. To put it simply model makes not accurate
predictions. In this case, the train error is very small and val/test error
is large. to fix overfitting, you should simplify the model.
16)What is clustering?
=> Clustering is the task of dividing the population or data points into a number
of groups such that data points in the same groups are more similar to other data
points in the same group and dissimilar to the data points in other groups. It is
basically a collection of objects on the basis of similarity and dissimilarity
between them.
Example: The data points in the graph below clustered together can be classified
into one single group. We can distinguish the clusters, and we can identify that
there are 3 clusters in the below picture.
It is not necessary for clusters to be spherical. Such as :
=> In the context of neural networks, the term "output layer" typically refers to
the last layer of a network that produces the final outputs of the model. The
output layer receives input from the other layers in the network, processes this
input using the activation function, and then produces the final outputs of the
model.
1) Convenience sampling
2) Voluntary sampling
3) Purpose sampling
4) Snowball sampling
BASIS FOR
TYPE I ERROR TYPE II ERROR
COMPARISON
=> Deep mining is a type of data mining that involves using deep learning
algorithms to extract information and insights from large and complex datasets.
Deep learning is a subset of machine learning that uses neural networks with
multiple layers of processing to learn from data and make predictions or
classifications.
Deep mining combines the power of deep learning with the techniques of data
mining to discover hidden patterns and relationships in data. This can be useful
for a wide range of applications, such as analyzing customer behavior, detecting
fraud, or predicting the stock market.
One of the key advantages of deep mining is that it can handle large and
complex datasets that may be too difficult for traditional data mining algorithms
to handle. Deep learning algorithms are able to automatically extract useful
features from the data and learn complex relationships between the variables,
which can improve their accuracy and performance.
Overall, deep mining is a powerful tool for extracting information and insights
from large and complex datasets. By combining the techniques of data mining
with the power of deep learning, deep mining can help businesses and
organizations to make better decisions and improve their operations.
3. Application of clustering.
=> In data analysis, statistical models are used to analyze and make predictions
or inferences from data. Statistical models are mathematical equations that can
be used to describe the relationship between different variables in a dataset.
These models can be used to identify trends, patterns, and relationships in the
data that can help analysts understand the underlying processes that generated
the data.
=> In data analytics, probability distribution and descriptive analysis are two
different statistical techniques that are used to analyze and make predictions or
inferences from data.
8. Clarify over between overfitting & underfitting. How to deal with them.
=> Overfitting and underfitting are two common problems in machine learning
and data modeling. Overfitting occurs when a model is too complex and has too
many parameters, and as a result, it performs well on the training data but
poorly on new or unseen data. Underfitting, on the other hand, occurs when a
model is too simple and does not have enough parameters to accurately capture
the underlying trends and patterns in the data.
The basic principle of a neural network is that it learns to identify patterns and
make predictions or decisions based on the input data. This learning is
accomplished through a process of trial and error, where the neural network
adjusts the strength of the connections between its neurons based on the input
data and the desired output. Over time, the neural network learns to make more
accurate predictions or decisions based on the input data.
One of the key principles of a neural network is that it is capable of learning and
adapting to new data. This means that the neural network can improve its
performance over time as it is exposed to more and more data, and it can learn
to identify complex patterns and relationships in the data that are not easily
recognizable by humans.
10. Explain why SVM move accurate than logistic regression with example.
One reason why SVMs are more accurate than logistic regression is that they
are able to handle non-linear data more effectively. Logistic regression is a
linear algorithm, which means that it can only model linear relationships
between the input data and the output. This can make it less effective at
identifying complex patterns and relationships in the data.
Another reason why SVMs are more accurate than logistic regression is that
they are able to handle high-dimensional data more effectively. Logistic
regression can become less accurate as the number of input features increases,
because it can become difficult to estimate the coefficients for all of the input
features.
=> Support vector machines (SVMs) are a type of supervised learning algorithm
that are known for their speed and efficiency. There are several reasons why
SVMs are fast compared to other algorithms.
One reason why SVMs are fast is that they are able to handle large amounts of
data quickly and efficiently. This is because SVMs use a kernel trick to
transform the input data into a higher-dimensional space, where it becomes
linearly separable. This allows SVMs to process large amounts of data without
incurring a significant computational cost.
Another reason why SVMs are fast is that they are able to make predictions
quickly. Once the SVM has been trained on a dataset, it can make predictions
on new data using a simple dot product, which is a fast and efficient operation.
This allows SVMs to make predictions quickly and efficiently, even on large
datasets.
In addition, SVMs are fast because they are able to make predictions using only
a subset of the training data, known as the support vectors. This means that the
SVM only needs to store and process the support vectors in order to make
predictions, which can significantly reduce the computational cost of the model.
12. Justify what are the best practise in Big data analysis.
=> There are several best practices that can be followed to ensure effective and
efficient big data analysis. Some of the key best practices include:
=> There are several techniques that are commonly used in big data analysis.
These techniques can be divided into two main categories: statistical techniques
and machine learning techniques.
Machine learning techniques are algorithms that are designed to learn from data
and make predictions or decisions based on that data. These techniques include
supervised learning, where the algorithm is trained on a labeled dataset and then
makes predictions on new data, and unsupervised learning, where the algorithm
learns to identify patterns and relationships in the data without the use of
labeled data.
To conduct an ANOVA, the data are first organized into groups or samples
based on the factors that are being compared. The mean of each group or
sample is then calculated, and the overall mean of all of the groups or samples is
also calculated. The differences between the group means and the overall mean
are then calculated and used to determine whether there are significant
differences between the groups.
If the ANOVA indicates that there are significant differences between the
groups, a post-hoc test can be used to identify which specific groups are
significantly different from each other. This can help identify the specific
factors or variables that are contributing to the differences between the groups.
=> Hadoop is an open-source software framework that is used for storing and
processing large amounts of data. It is designed to handle big data analytics,
which involves analyzing large and complex datasets to identify trends,
patterns, and relationships within the data.
One of the key advantages of Hadoop is that it is highly scalable, which means
that it can easily handle increases in the amount of data that needs to be
processed. This makes Hadoop well-suited for big data analytics, where large
and complex datasets are common.