U2 PDF
U2 PDF
U2 PDF
Non-probability sampling
Sampling can also be based on non-probability, an approach in which a data
sample is determined and extracted based on the judgment of the analyst.
As inclusion is determined by the analyst, it can be more difficult to
extrapolate whether the sample accurately represents the larger population
than when probability sampling is used.Non-probability data sampling
methods include the following:
Convenience sampling. Data is collected from an easily accessible and
available group.
Consecutive sampling. Data is collected from every subject that meets the
criteria until the predetermined sample size is met.
Purposive or judgmental sampling. The researcher selects the data to
sample based on predefined criteria.
The value 6 occurs the most hence the mode of the data set is 6.
We often test our data by plotting the distribution curve, if most of the
values are centrally located and very few values are off from the center then
we say that the data is having a normal distribution. At that time the values
of mean, median, and mode are almost equal (w/o data skewness).
Data Skewness
When the data is skewed, for example, as with the right-skewed data set
below:
We can say that the mean is being dragged in the direction of the skew. In
right skewed distribution, mode < median < mean. The more skewed the
distribution, the greater the difference between the median and mean, here
we consider median for the conclusion. For left-skewed distribution mean
< median < mode.
1. Range: It is simply the difference between the maximum value and the
minimum value given in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -
1= 6
2. Variance: Deduct the mean from each data in the set, square each of them
and add each square and finally divide them by the total no of values in the
data set to get the variance. Variance (σ2) = ∑(X−μ)2/N
3. Standard Deviation: The square root of the variance is known as the
standard deviation i.e. S.D. = √σ.
4. Quartiles and Quartile Deviation: The quartiles are values that divide a
list of numbers into quarters. The quartile deviation is half of the distance
between the third and the first quartile.
5. Mean and Mean Deviation: The average of numbers is known as the
mean and the arithmetic mean of the absolute deviations of the
observations from a measure of central tendency is known as the mean
deviation (also called mean absolute deviation).
How can we identify the similarity or dispersion b/w them? by asking them
same set of questions or by conducting survey. A similar response from them will
most likely make us to conclude that they might be probably friends or will
become friends.
Correlation coefficients range from 0 to 1, where the higher the coefficient means
the stronger correlation.
The correlation coefficient formula finds out the relation between the variables.
It returns the values between -1 and 1. Use the below Pearson coefficient
correlation calculator to measure the strength of two variables.
2. Lets take a positive linear relationship example. Salary increases with age.
Make a data chart, including both variables. Label these variables ‘x’- Age
/ Valentina’s response and ‘y’- Salary / Srilakshmi’s response. Add three
additional columns – (xy), (x^2), and (y^2). In the below example N = 4 as we
have 4 pairs.
There are several different data reduction techniques that can be used
in data mining, including:
It’s important to note that data reduction can have a trade-off between the
accuracy and the size of the data. The more data is reduced, the less accurate
the model will be and the less generalizable it will be.
1. Dimensionality Reduction
Whenever we encounter weakly important data, we use the attribute required for
our analysis. Dimensionality reduction eliminates the attributes from the data
set under consideration, thereby reducing the volume of original data. It reduces
data size as it eliminates outdated or redundant features. Here are three methods
of dimensionality reduction.
This technique reduces the size of the files using different encoding mechanisms,
such as Huffman Encoding and run-length Encoding. We can divide it into two
types based on their compression techniques.
The data discretization technique is used to divide the attributes of the continuous
nature into data with intervals. We replace many constant values of the attributes
with labels of small intervals. This means that mining results are shown in a
concise and easily understandable way.
Advantages:
1. Improved efficiency: Data reduction can help to improve the efficiency of
machine learning algorithms by reducing the size of the dataset. This can
make it faster and more practical to work with large datasets.
2. Improved performance: Data reduction can help to improve the
performance of machine learning algorithms by removing irrelevant or
redundant information from the dataset. This can help to make the model
more accurate and robust.
3. Reduced storage costs: Data reduction can help to reduce the storage costs
associated with large datasets by reducing the size of the data.
4. Improved interpretability: Data reduction can help to improve the
interpretability of the results by removing irrelevant or redundant
information from the dataset.
Disadvantages:
1. Loss of information: Data reduction can result in a loss of information, if
important data is removed during the reduction process.
2. Impact on accuracy: Data reduction can impact the accuracy of a model,
as reducing the size of the dataset can also remove important information
that is needed for accurate predictions.