U2 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Unit 2

Descriptive Data Analysis

2.1 Dataset Construction


Steps to Constructing Your Dataset
To construct your dataset (and before doing data transformation), you
should:
 Collect the raw data.
 Identify feature and label sources.
 Select a sampling strategy.
 Split the data.
The Size of a Data Set
As a rough rule of thumb, your model should train on at least an order of
magnitude more examples than trainable parameters. Simple models on
large data sets generally beat fancy models on small data sets. Google has
had great success training simple linear regression models on large data sets.
What counts as "a lot" of data? It depends on the project. Consider the
relative size of these data sets:

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


The Quality of a Data Set
It’s no use having a lot of data if it’s bad data; quality matters, too.
But what counts as "quality"? It's a fuzzy term. Consider taking an empirical
approach and picking the option that produces the best outcome. With that
mindset, a quality data set is one that lets you succeed with the business
problem you care about. In other words, the data is good if it accomplishes
its intended task.
However, while collecting data, it's helpful to have a more concrete
definition of quality. Certain aspects of quality tend to correspond to better-
performing models:
 reliability
 feature representation
 minimizing skew
Reliability
Reliability refers to the degree to which you can trust your data. A model
trained on a reliable data set is more likely to yield useful predictions than a
model trained on unreliable data. In measuring reliability, you must
determine:
How common are label errors? For example, if your data is labeled by
humans, sometimes humans make mistakes.
Are your features noisy? For example, GPS measurements fluctuate. Some
noise is okay. You’ll never purge your data set of all noise. You can collect
more examples too.
Is the data properly filtered for your problem? For example, should your
data set include search queries from bots? If you're building a spam-
detection system, then likely the answer is yes, but if you're trying to
improve search results for humans, then no.
What makes data unreliable?
Recall from the Machine Learning Crash Course that many examples in data
sets are unreliable due to one or more of the following:
Omitted values. For instance, a person forgot to enter a value for a house's
age.
Duplicate examples. For example, a server mistakenly uploaded the same
logs twice.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Bad labels. For instance, a person mislabeled a picture of an oak tree as a
maple.
Bad feature values. For example, someone typed an extra digit, or a
thermometer was left out in the sun.
Feature Representation
Recall from the Machine Learning Crash Course that representation is the
mapping of data to useful features. You'll want to consider the following
questions:
 How is data shown to the model?
 Should you normalize numeric values?
 How should you handle outliers?
Identifying Labels and Sources
Direct vs. Derived Labels
Machine learning is easier when your labels are well-defined. The best
label is a direct label of what you want to predict. For example, if you want
to predict whether a user is a Taylor Swift fan, a direct label would be "User
is a Taylor Swift fan." A simpler test of fanhood might be whether the user
has watched a Taylor Swift video on YouTube. The label "user has watched
a Taylor Swift video on YouTube" is a derived label because it does not
directly measure what you want to predict. Is this derived label a reliable
indicator that the user likes Taylor Swift? Your model will only be as good
as the connection between your derived label and your desired prediction.
Label Sources
The output of your model could be either an Event or an Attribute.
This results in the following two types of labels:
 Direct label for Events, such as “Did the user click the top search
result?”
 Direct label for Attributes, such as “Will the advertiser spend more
than $X in the next week?”
Direct Labels for Attributes
Let's say your label is, “The advertiser will spend more than $X in the next
week.” Typically, you'd use the previous days of data to predict what will
happen in the subsequent days. For example, the following illustration
shows the ten days of training data that predict the subsequent seven days:

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Human Labeled Data?
There are advantages and disadvantages to using human-labeled data.
Pros
 Human raters can perform a wide range of tasks.
 The data forces you to have a clear problem definition.
Cons
 The data is expensive for certain domains.
 Good data typically requires multiple iterations.
Introduction to Transforming Data
Feature engineering is the process of determining which features might be
useful in training a model, and then creating those features by transforming
raw data found in log files and other sources.
 Identify types of data transformation, including why and where to
transform.
 Transform numerical data (normalization and bucketization).
 Transform categorical data.
Reasons for Data Transformation
We transform features primarily for the following reasons:
1. Mandatory transformations for data compatibility. Examples
include:
 Converting non-numeric features into numeric. You can’t do
matrix multiplication on a string, so we must convert the string
to some numeric representation.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


 Resizing inputs to a fixed size. Linear models and feed-forward
neural networks have a fixed number of input nodes, so your
input data must always have the same size. For example, image
models need to reshape the images in their dataset to a fixed
size.
2. Optional quality transformations that may help the model perform
better. Examples include:
 Tokenization or lower-casing of text features.
 Normalized numeric features (most models perform better
afterwards).
 Allowing linear models to introduce non-linearities into the
feature space.
EXTRACT – TRANSFORM - LOAD
ETL and ELT are the most widely used methods for delivering data
from one or many sources to a centralized system for easy access and
analysis. Both consist of the extract, transform, and load stages.
ETL is the initialism for extraction, transformation, and loading. It is
the process of collecting raw data from disparate sources, transmitting it to
a staging database for conversion, and loading prepared data into a unified
destination system.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


ELT is the initialism for extraction, loading, and transformation.
Basically, ELT inverts the last two stages of the ETL process, meaning that
after being extracted from databases, data is loaded straight into a central
repository where all transformations occur. The staging database is absent.

Key stages of the ETL and ELT processes


Data is usually extracted in one of the three ways.
 Full extraction is applied to systems that can’t identify which records
are new or changed. In such cases, the only way to pull data out of the
system is to extract all records — old and new.
 Partial extraction with update notifications is the most convenient
way to extract data from source systems. It is possible if the systems
provide alerts when any records are changed so there’s no need to load
all data.
 Incremental extraction or partial extraction without update
notifications is the method of getting extracts on only those records
that have been modified.
With the ETL method, users have to plan ahead which data items should be
extracted for further transformation and loading. ELT, on the flip side,
makes it possible to extract and load all the data immediately. Users can
decide which data to transform and analyze later.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan
The transformation phase involves an array of activities aiming at
preparing data by changing it to fit the parameters of another system or the
desired result.
Transformations may include:
 data sorting and filtering to get rid of irrelevant items,
 de-duplicating and cleansing,
 translating and converting,
 removing or encrypting to protect sensitive information, and
 splitting or joining tables, etc.
Load
Stage 2 in ELT/ Stage 3 in ETL
This stage applies to loading data into a target data storage system so
that users can access it. The ETL process flow implies the import of
previously extracted and already prepared data from a staging database into
a target data warehouse or database. This is performed either through
physically inserting separate records as new rows into the table of a
warehouse using SQL commands or with the help of a massive bulk load
scenario.
ELT, in turn, delivers the mass of raw data directly to the target
storage location, skipping an intermediate level. This cuts the extraction-to-
delivery cycle big time. Just like with extraction, data can be loaded either
fully or partially.
2.2 Data Sampling
What is data sampling?
Data sampling is a statistical analysis technique used to select, manipulate
and analyze a representative subset of data points to identify patterns and
trends in the larger data set being examined. It enables data scientists,
predictive modelers and other data analysts to work with a small,
manageable amount of data about a statistical population to build and run
analytical models more quickly, while still producing accurate findings.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Why is data sampling important?
Data sampling is a widely used statistical approach that can be applied in
various use cases, including opinion, web analytics or political polls. For
example, a researcher doesn't need to speak with every American to
discover the most common method of commuting to work in the U.S.
Instead, they can choose 1,000 participants as a representative sample in the
hopes that this number will be sufficient to produce accurate results.

Therefore, data sampling enables data scientists and researchers to


extrapolate knowledge about a broad population from a smaller sample of
data. By taking a data sample, predictions about the larger population can
be made with a certain level of confidence without having to collect and
analyze data from each member of the population.
Advantages and challenges of data sampling
Data sampling is an effective approach for data analysis that comes with
various benefits and also a few challenges.
Benefits of data sampling
Time savings. Sampling can be particularly useful with data sets that are
too large to efficiently analyze in full -- for example, in big data analytics
applications or surveys. Identifying and analyzing a representative sample
is more efficient and less time-consuming than surveying the entirety of
the data or paopulation.
 Cost savings. Data sampling is often more cost-effective than
collecting data from the entire population.
 Accuracy. Correct sampling techniques can produce reliable
findings. Researchers can accurately interpret information about the
total population by selecting a representative sample.
 Flexibility. Data sampling provides researchers with the flexibility to
choose from a variety of sampling methods and sample sizes to best
address their research questions and make use of their resources.
 Bias elimination. Sampling can help to eliminate bias in data
analysis, as a well-designed sample can limit the influence of outliers,
errors and other kinds of bias that may impair the analysis of the entire
population.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Data sampling process
The process of data sampling typically involves the following steps:
 Defining the population. The population is the entire set of data from
which the sample is drawn. To guarantee that the sample is
representative of the entire population, the target population must be
precisely defined, including all essential traits and criteria.
 Selecting a sampling technique. The next step is to choose the best
sampling method based on the research question and the
characteristics of the population under study. There are several
methods for drawing samples from data such as simple random
sampling, cluster sampling, stratified sampling and systematic
sampling.
 Determining the sample size. The optimum sample size required to
produce accurate and reliable results should be decided in this phase.
This decision may be influenced by certain factors, such as money,
time constraints and the requirement for greater precision. The sample
size should be large enough to be representative of the population, but
not so large that it becomes impractical to work with.
 Collecting the data. The data is collected from the sample using the
sampling approach that was chosen, such as interviews, surveys or
observations. This may entail random selection or other stated criteria,
depending on the research question. For example, in random
sampling, data points are selected at random from the population.
 Analyzing the sample data. After collecting the data sample, it's
processed and analyzed to draw conclusions about the population. The
results of the analysis are then generalized or applied to the entire
population.
 Predictive analytics is being used by many organizations to forecast
occurrences and improve the accuracy of data-driven choices.
Examine the four popular simulation approaches used in data
analytics.
Challenges of data sampling
Risk of bias. One of the main challenges with data sampling is the
possibility of introducing bias into the sample. If the sample is not
representative of the population, it can lead to incorrect or misleading
conclusions.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Determining the sample size. With data sampling, determining an
appropriate sample size can be difficult sometimes. If the sample size is too
small, the results might not be accurate since the sample will not be
representative of the population.
Sampling error. Data sampling can also pose the risk of sampling error,
which is the discrepancy between the sample and the population. The
accuracy of the results may be affected by this inaccuracy, which may
happen by chance, bias or other factors.
Sampling method. The choice of sampling method can vary depending on
the research question and population being studied. However, selecting the
appropriate sampling technique can be difficult, as different techniques are
better suited for different research questions and populations.

Types of data sampling methods


There are many different methods for drawing samples from data; the ideal
one depends on the data set and situation. The following are the two
common types of sampling methods:
Probability sampling
Sampling can be based on probability, an approach that uses random
numbers that correspond to points in the data set to ensure that there is no
correlation between points chosen for the sample. Further variations in
probability sampling include the following:

Simple random sampling. Software is used to randomly select subjects


from the whole population.
Stratified sampling. Subsets of the data sets or population are created based
on a common factor and samples are randomly collected from each
subgroup.
Cluster sampling. The larger data set is divided into subsets (clusters)
based on a defined factor, then a random sampling of clusters is analyzed.
Multistage sampling. A more complicated form of cluster sampling, this
method also involves dividing the larger population into a number of
clusters. Second-stage clusters are then broken out based on a secondary
factor, and those clusters are sampled and analyzed. This staging could
continue as multiple subsets are identified, clustered and analyzed.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Systematic sampling. A sample is created by setting an interval at which
to extract data from the larger population -- for example, selecting every
10th row in a spreadsheet of 200 items to create a sample size of 20 rows to
analyze.

Non-probability sampling
Sampling can also be based on non-probability, an approach in which a data
sample is determined and extracted based on the judgment of the analyst.
As inclusion is determined by the analyst, it can be more difficult to
extrapolate whether the sample accurately represents the larger population
than when probability sampling is used.Non-probability data sampling
methods include the following:
Convenience sampling. Data is collected from an easily accessible and
available group.
Consecutive sampling. Data is collected from every subject that meets the
criteria until the predetermined sample size is met.
Purposive or judgmental sampling. The researcher selects the data to
sample based on predefined criteria.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Quota sampling. The researcher ensures equal representation within the
sample for all subgroups in the data set or population.
Once generated, a sample can be used for predictive analytics. For
example, a retail business might use data sampling to uncover patterns in
customer behavior and predictive modeling to create more effective sales
strategies.
Common data sampling errors
A sampling error is a difference between the sampled value and the
true population value. Sampling errors happen during data collection when
the sample is not typical of the population or is biased in some way.
Because a sample is merely an approximation of the population from
which it is collected, even randomized samples will have some sampling
error.
The following are some common data sampling errors:
Sampling error. Sampling bias arises when the sample is not representative
of the population. This can occur when the sampling method is incorrect or
when there is a systemic inaccuracy in the sampling process. Errors may
develop as a result of a large variance in a specific metric across a specified
date range. Alternatively, they could happen due to a generally low volume
of a given measure in relation to visits. For instance, if a site has a very low
transaction count in comparison to overall visits, sampling may result in
substantial disparities.
Selection error. Selection bias arises when the sample is chosen in a way
that favors a specific group or trait. For example, if a health study is only
conducted on people who are willing to participate, the sample may not be
representative of the overall community.
Non-response error. This bias happens when people chosen for the sample
do not participate in the survey or study. As a result, certain groups may be
underrepresented, affecting the accuracy of the results.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


2.3 & 2.4 Stem and Leaf Plots
A stem and leaf plot, also known as a stem and leaf diagram, is a way to
arrange and represent data so that it is simple to see how frequently various
data values occur. It is a plot that displays ordered numerical data.
A stem and leaf plot is shown as a special table where the digits of a data
value are divided into a stem (first few digits) and a leaf (usually the last
digit). The symbol ‘|’ is used to split and illustrate the stem and leaf values.
For instance, 105 is written as 10 on the stem and 5 on the leaf.
Example:
Let’s say there are 10 Technical Content Writers at GeeksforGeeks. Each of
them submitted 100 articles to publish at the site. Out of 100 articles, the
number of articles which had some errors are given below for each 10
content writers –
16, 25, 47, 56, 23, 45, 19, 55, 44, 27
Stem-and-leaf plot will be –
1 | 69
2 | 357
4 | 457
5 | 56
Plotting Stem and Leaf plots in Python (Self-explanatory)
https://www.geeksforgeeks.org/stem-and-leaf-plots-in-python/
Explanation –
The leftmost column in the above plot is the frequency count. There are two
observations in the range 10-20 and 3 observations in the range 20-30,
which gives total of 5 observations in the range 0-30. Continuing in the same
way, there is total of 10 observations which is at the top in the same column.
Then after a vertical line, there are two values, one at bottom most we have
16. While at the topmost we have 56, these values are nothing but the
minimum and maximum values respectively in the given jdata set.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan
2.4 Time Series Data
Time series data definition
Time series data is a collection of observations (behaviour) for
a single subject (entity) at different time intervals.
A time series is a group of observations on a single entity over time — e.g.
the daily closing prices over one year for a single financial security, or a
single patient’s heart rate measured every minute over a one-hour
procedure.
Time series data is data that is recorded over consistent intervals of
time. Time series data is a collection of observations obtained through
repeated measurements over time. Plot the points on a graph, and one of
your axes would always be time.
Time series metrics refer to a piece of data that is tracked at an increment in
time. For instance, a metric could refer to how much inventory was sold in
a store from one day to the next.
Time series data is everywhere, since time is a constituent of everything that
is observable. As our world gets increasingly instrumented, sensors and
systems are constantly emitting a relentless stream of time series data. Such
data has numerous applications across various industries. Let’s put this in
context through some examples.
Examples of time series analysis:
 Electrical activity in the brain
 Rainfall measurements
 Stock prices
 Number of sunspots
 Annual retail sales
 Monthly subscribers
 Heartbeats per minute

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Another familiar example of time series data is patient health monitoring,
such as in an electrocardiogram (ECG), which monitors the heart’s activity
to show whether it is working normally.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Types of time series data
Time series data can be classified into two types:
 Measurements gathered at regular time intervals (metrics)
 Measurements gathered at irregular time intervals (events)

Time Series Analysis Types

Models of time series analysis include:

 Classification: Identifies and assigns categories to the data.


 Curve fitting: Plots the data along a curve to study the relationships of
variables within the data.
 Descriptive analysis: Identifies patterns in time series data, like trends,
cycles, or seasonal variation.
 Explanative analysis: Attempts to understand the data and the
relationships within it, as well as cause and effect.
 Exploratory analysis: Highlights the main characteristics of the time
series data, usually in a visual format.
 Forecasting: Predicts future data. This type is based on historical trends.
It uses the historical data as a model for future data, predicting scenarios
that could happen along future plot points.
 Intervention analysis: Studies how an event can change the data.
 Segmentation: Splits the data into segments to show the underlying
properties of the source information.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Data variations
In time series data, variations can occur sporadically throughout the data:
 Functional analysis can pick out the patterns and relationships within
the data to identify notable events.
 Trend analysis means determining consistent movement in a certain
direction. There are two types of trends: deterministic, where we can
find the underlying cause, and stochastic, which is random and
unexplainable.
 Seasonal variation describes events that occur at specific and regular
intervals during the course of a year. Serial dependence occurs when
data points close together in time tend to be related.

2.6 Measure of Central Tendency


Central tendency measures provide valuable insights into a dataset’s
typical or central values. They help us understand the overall distribution
and characteristics of a set of observations by identifying the central or
representative value around which the data tend to cluster.
What is Measure of Central Tendency?
We should first understand the term Central Tendency. Data tend to
accumulate around the average value of the total data under consideration.
Measures of central tendency will help us to find the middle, or the average,
of a data set. If most of the data is centrally located and there is a very small
spread it will form an asymmetric bell curve. In such conditions values of
mean, median and mode are equal.
Mean, Median, Mode
Mean - It is the average of values.
Median - It is the centrally located value of the data set sorted in ascending
order.
Mode - It is the most frequent value in the data set. We can easily get the
mode by counting the frequency of occurrence.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Consider a data set with the values 1,5,5,6,8,2,6,6. In this data set, we can
observe the following,

The value 6 occurs the most hence the mode of the data set is 6.
We often test our data by plotting the distribution curve, if most of the
values are centrally located and very few values are off from the center then
we say that the data is having a normal distribution. At that time the values
of mean, median, and mode are almost equal (w/o data skewness).

Data Skewness
When the data is skewed, for example, as with the right-skewed data set
below:
We can say that the mean is being dragged in the direction of the skew. In
right skewed distribution, mode < median < mean. The more skewed the
distribution, the greater the difference between the median and mean, here
we consider median for the conclusion. For left-skewed distribution mean
< median < mode.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Figure: Right Skewed Data distribution.

Figure: Left Skewed Data distribution.


An Example
An OTT platform company has conducted a survey in a particular region
based on the watch time, language of streaming, and age of the viewer. For
our understanding, we have taken a sample of 10 people.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


We can do different median/mean/mode analysis on all the attributes
of the above dataset and plot different plots. Since language is a non-
numerical data, lets count the number of occurrences of the same which
yields Hindi – 4, English -3 and other languages -1. By using mode, we can
conclude that hindi is most popular followed by English.

Hence from the above observation, it is concluded that in the sample


survey average age of viewers is 12.5 years who watch for 2.5 hours daily a
show in the Hindi language.
We can say there is no best central tendency measure method because
the result is always based on the types of data.
Q1. What are the 4 measures of central tendency?
A. The four measures of central tendency are mean, median, mode, and
midrange. Central tendency examples include finding the average age in a
group, determining the middle value of test scores, or identifying the most
frequently occurring color in a survey.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Q2. What is central tendency examples?
A. To find central tendency, calculate the mean by summing all values and
dividing by the total number, find the median by locating the middle value,
or determine the mode as the most commonly occurring value.
Q3. How do we find central tendency?
A. The best measure of central tendency depends on the type of data and the
specific context of the analysis. The mean is commonly used, but other
measures may be more appropriate in certain situations.

2.7 Measures of dispersion


Why is dispersion analysis needed?
The measures of central tendency are not adequate to describe data.
Two or more data sets can have the same mean, but they can be entirely
different as shown in below figure. Thus to describe data, one needs to know
the extent of variability. This is given by the measures of dispersion. Range,
interquartile range, and standard deviation are the three commonly used
measures of dispersion.
Dispersion is the state of getting dispersed or spread. Statistical
dispersion means the extent to which numerical data is likely to vary about
an average value. In other words, dispersion helps to understand the
distribution of the data.

Figure: 3 Different datasets giving same Mean.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Range – Boundary values of data
The range is the difference between the largest and the smallest observation
in the data. The prime advantage of this measure of dispersion is that it is
easy to calculate. On the other hand, it has lot of disadvantages. It is very
sensitive to outliers and does not use all the observations in a data set. It is
more informative to provide the minimum and the maximum values rather
than providing the range.
E.g., TNEA Cutoff Range for AZ course.

INTERQUARTILE RANGE – Analysis of middle data


Interquartile range is defined as the difference between the 25 th and
75th percentile (also called the first and third quartile). Hence the
interquartile range describes the middle 50% of observations. If the
interquartile range is large, it means that the middle 50% of observations are
spaced wide apart. The important advantage of interquartile range is that it
can be used as a measure of variability if the extreme values are not being
recorded exactly (as in case of open-ended class intervals in the frequency
distribution). Other advantageous feature is that it is not affected by extreme
values.
E.g., Product Sales analysis in Q2, Q3 of a calendar year.

STANDARD DEVIATION – Analysis of data differences / Distance


computation
Standard deviation (SD) is the most used measure of dispersion. It is a
measure of spread of data about the mean. SD is the square root of sum
of squared deviation from the mean divided by the number of
observations.

Or in case of datasets, we can use distance formulas such as Euclidean,


Manhattan, Mahalanobis distance to compute the deviation between 2
datasets.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Measure of Dispersion
An absolute measure of dispersion contains the same unit as the original
data set. The absolute dispersion method expresses the variations in terms
of the average of deviations of observations like standard or means
deviations. It includes range, standard deviation, quartile deviation, etc.
The types of absolute measures of dispersion are:

1. Range: It is simply the difference between the maximum value and the
minimum value given in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -
1= 6
2. Variance: Deduct the mean from each data in the set, square each of them
and add each square and finally divide them by the total no of values in the
data set to get the variance. Variance (σ2) = ∑(X−μ)2/N
3. Standard Deviation: The square root of the variance is known as the
standard deviation i.e. S.D. = √σ.
4. Quartiles and Quartile Deviation: The quartiles are values that divide a
list of numbers into quarters. The quartile deviation is half of the distance
between the third and the first quartile.
5. Mean and Mean Deviation: The average of numbers is known as the
mean and the arithmetic mean of the absolute deviations of the
observations from a measure of central tendency is known as the mean
deviation (also called mean absolute deviation).

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


2.8 Correlation Analysis

Correlation analysis in research is a statistical method used to measure the


strength of the linear relationship between two variables and compute their
association. Simply put - correlation analysis calculates the level of change in one
variable due to the change in the other.

A high correlation points to a strong relationship between the two


variables, while a low correlation means that the variables are weakly related.

Example of correlation analysis

Correlation between two variables can be either a positive correlation, a


negative correlation, or no correlation. Let's look at examples of each of these
three types.

 Positive correlation: A positive correlation between two variables means


both the variables move in the same direction. An increase in one variable
leads to an increase in the other variable and vice versa. For example,
spending more time on a treadmill burns more calories.

 Negative correlation: A negative correlation between two variables


means that the variables move in opposite directions. An increase in one
variable leads to a decrease in the other variable and vice versa. For
example, increasing the speed of a vehicle decreases the time you take to
reach your destination.

 Weak/Zero correlation: No correlation exists when one variable does not


affect the other. For example, there is no correlation between the number
of years of school a person has attended and the letters in his/her name.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Let’s say you are interested in the similarity between three girls, Girl A,
Girl B and Girl C, Say Valentina, Sri Lakshmi and Dhanya Gaayathri.

How can we identify the similarity or dispersion b/w them? by asking them
same set of questions or by conducting survey. A similar response from them will
most likely make us to conclude that they might be probably friends or will
become friends.

Correlation coefficients range from 0 to 1, where the higher the coefficient means
the stronger correlation.

All correlation strength scores and classifications are outlined below.

 Perfect: 0.80 to 1.00


 Strong: 0.50 to 0.79
 Moderate: 0.30 to 0.49
 Weak: 0.00 to 0.29

Results close to +1 indicate a positive correlation, meaning as Variable A


increases, Variable B also increases.

Outputs closer to -1 are a sign of a negative correlation, these results mean


that as Variable A increases, Variable B decreases.

A value near 0 in a correlation analysis indicates a less meaningful


relationship between Variable A and Variable B.

Pearson correlation coefficient: Definition, formula & calculation, and


examples

Pearson correlation coefficient or Pearson’s correlation coefficient or Pearson’s r


is defined in statistics as the measurement of the strength of the relationship
between two variables and their association with each other.

What does the Pearson correlation coefficient test do?

The Pearson coefficient correlation has a high statistical significance. It


looks at the relationship between two variables. It seeks to draw a line through
the data of two variables to show their relationship. This linear relationship can
be positive or negative.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Pearson correlation coefficient formula and calculation

The correlation coefficient formula finds out the relation between the variables.
It returns the values between -1 and 1. Use the below Pearson coefficient
correlation calculator to measure the strength of two variables.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Calculation

1. Consider the following example.


X 1 2 3 4 5 6
Y 2 4 7 9 12 14

Step one: Create a correlation coefficient table.


X Y XY X2 Y2
1 2 2 1 4
2 4 8 4 16
3 7 21 9 49
4 9 36 16 81
5 12 60 25 144
6 14 84 36 169
21 48 211 91 490

On applying the values to formula, we have r = 0.998.

Here is a step-by-step guide to calculating Pearson’s correlation


coefficient:

2. Lets take a positive linear relationship example. Salary increases with age.

Step one: Create a correlation coefficient table.

Make a data chart, including both variables. Label these variables ‘x’- Age
/ Valentina’s response and ‘y’- Salary / Srilakshmi’s response. Add three
additional columns – (xy), (x^2), and (y^2). In the below example N = 4 as we
have 4 pairs.

Step two: Use basic multiplication to complete the table

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Step four: Use the correlation formula to plug in the values.

If the result is negative, there is a negative correlation relationship between the


two variables. If the result is positive, there is a positive correlation relationship
between the variables. Results can also define the strength of a linear relationship
i.e., strong positive relationship, strong negative relationship, medium positive
relationship, and so on.

Refer: Spearman Coefficient as well.

2.9 Data Reduction

Data reduction is a technique used in data mining to reduce the size of a


dataset while still preserving the most important information. This can be
beneficial in situations where the dataset is too large to be processed efficiently,
or where the dataset contains a large amount of irrelevant or redundant
information.

There are several different data reduction techniques that can be used
in data mining, including:

1. Data Sampling: This technique involves selecting a subset of the data to


work with, rather than using the entire dataset. This can be useful for
reducing the size of a dataset while still preserving the overall trends and
patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number
of features in the dataset, either by removing features that are not relevant
or by combining multiple features into a single feature.
3. Data Compression: This technique involves using techniques such as
lossy or lossless compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data
into discrete data by partitioning the range of possible values into intervals
or bins.
5. Feature Selection: This technique involves selecting a subset of features
from the dataset that are most relevant to the task at hand.

It’s important to note that data reduction can have a trade-off between the
accuracy and the size of the data. The more data is reduced, the less accurate
the model will be and the less generalizable it will be.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Methods of data reduction:

1. Data Cube Aggregation:


This technique is used to aggregate data in a simpler form.

2. Dimension Reduction: // Compare this to Functional Dependency


concept in databases.
Whenever we come across any data which is weakly important, then we use
the attribute required for our analysis. It reduces data size as it eliminates
outdated or redundant features.
Stepwise Forward Selection – Zero to Minimal Set
The selection begins with an empty set of attributes later on we decide the
best of the original attributes on the set based on their relevance to other
attributes. We know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which few
attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
Stepwise Backward Selection – Reducing Max to Min
//Compare this to super key  candidate key reduction in DBMS
This selection starts with a set of complete attributes in the original data and
at each point, it eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few
attributes are redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}


Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}


Step-2: {X1, X2, X3, X5}
Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

1. Dimensionality Reduction

Whenever we encounter weakly important data, we use the attribute required for
our analysis. Dimensionality reduction eliminates the attributes from the data
set under consideration, thereby reducing the volume of original data. It reduces
data size as it eliminates outdated or redundant features. Here are three methods
of dimensionality reduction.

i. Wavelet Transform: In the wavelet transform, suppose a data vector A is


transformed into a numerically different data vector A' such that both A
and A' vectors are of the same length. Then how it is useful in reducing
data because the data obtained from the wavelet transform can be
truncated. The compressed data is obtained by retaining the smallest
fragment of the strongest wavelet coefficients. Wavelet transform can be
applied to data cubes, sparse data, or skewed data.
ii. Principal Component Analysis: Suppose we have a data set to be
analyzed that has tuples with n attributes. The principal component
analysis identifies k independent tuples with n attributes that can represent
the data set.In this way, the original data can be cast on a much smaller

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


space, and dimensionality reduction can be achieved. Principal component
analysis can be applied to sparse and skewed data.
iii. Attribute Subset Selection: The large data set has many attributes, some
of which are irrelevant to data mining or some are redundant. The core
attribute subset selection reduces the data volume and dimensionality. The
attribute subset selection reduces the volume of data by eliminating
redundant and irrelevant attributes. The attribute subset selection ensures
that we get a good subset of original attributes even after eliminating the
unwanted attributes. The resulting probability of data distribution is as
close as possible to the original data distribution using all the attributes.

3. Data Cube Aggregation

This technique is used to aggregate data in a simpler form. Data Cube


Aggregation is a multidimensional aggregation that uses aggregation at
various levels of a data cube to represent the original data set, thus achieving
data reduction.
For example, suppose you have the data of All Electronics sales per quarter
for the year 2018 to the year 2022. If you want to get the annual sale per year,
you just have to aggregate the sales per quarter for each year. In this way,
aggregation provides you with the required data, which is much smaller in
size, and thereby we achieve data reduction even without losing any data.

The data cube aggregation is a multidimensional aggregation that eases


multidimensional analysis. The data cube present precomputed and
summarized data which eases the data mining into fast access.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


4. Data Compression

Data compression employs modification, encoding, or converting the


structure of data in a way that consumes less space. Data compression
involves building a compact representation of information by removing
redundancy and representing data in binary form. Data that can be restored
successfully from its compressed form is called Lossless compression. In
contrast, the opposite where it is not possible to restore the original form from
the compressed form is Lossy compression. Dimensionality and numerosity
reduction method are also used for data compression.

This technique reduces the size of the files using different encoding mechanisms,
such as Huffman Encoding and run-length Encoding. We can divide it into two
types based on their compression techniques.

//illustrate lossy and lossless with decomposition from DBMS

i. Lossless Compression: Encoding techniques (Run Length Encoding)


allow a simple and minimal data size reduction. Lossless data compression
uses algorithms to restore the precise original data from the compressed
data.
ii. Lossy Compression: In lossy-data compression, the decompressed data
may differ from the original data but are useful enough to retrieve
information from them. For example, the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original image.
Methods such as the Discrete Wavelet transform technique PCA (principal
component analysis) are examples of this compression.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


5. Discretization Operation

The data discretization technique is used to divide the attributes of the continuous
nature into data with intervals. We replace many constant values of the attributes
with labels of small intervals. This means that mining results are shown in a
concise and easily understandable way.

 Top-down discretization: If you first consider one or a couple of points (so-


called breakpoints or split points) to divide the whole set of attributes and
repeat this method up to the end, then the process is known as top-down
discretization, also known as splitting.

 Bottom-up discretization: If you first consider all the constant values as


split-points, some are discarded through a combination of the neighborhood
values in the interval. That process is called bottom-up discretization.

Data reduction in data mining can have several advantages and


disadvantages.

Advantages:
1. Improved efficiency: Data reduction can help to improve the efficiency of
machine learning algorithms by reducing the size of the dataset. This can
make it faster and more practical to work with large datasets.
2. Improved performance: Data reduction can help to improve the
performance of machine learning algorithms by removing irrelevant or
redundant information from the dataset. This can help to make the model
more accurate and robust.
3. Reduced storage costs: Data reduction can help to reduce the storage costs
associated with large datasets by reducing the size of the data.
4. Improved interpretability: Data reduction can help to improve the
interpretability of the results by removing irrelevant or redundant
information from the dataset.
Disadvantages:
1. Loss of information: Data reduction can result in a loss of information, if
important data is removed during the reduction process.
2. Impact on accuracy: Data reduction can impact the accuracy of a model,
as reducing the size of the dataset can also remove important information
that is needed for accurate predictions.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


3. Impact on interpretability: Data reduction can make it harder to interpret
the results, as removing irrelevant or redundant information can also
remove context that is needed to understand the results.
4. Additional computational costs: Data reduction can add additional
computational costs to the data mining process, as it requires additional
processing time to reduce the data.

2.10 Principal Component Analysis


(A Dimensionality reduction Technique)

As the number of features or dimensions in a dataset increases, the amount


of data required to obtain a statistically significant result increases exponentially.
This can lead to issues such as overfitting, increased computation time, and
reduced accuracy of machine learning models this is known as the curse of
dimensionality problems that arise while working with high-dimensional data.
As the number of dimensions increases, the number of possible
combinations of features increases exponentially, which makes it
computationally difficult to obtain a representative sample of the data and it
becomes expensive to perform tasks such as clustering or classification because
it becomes.
To address the curse of dimensionality, Feature engineering techniques are
used which include feature selection and feature extraction. Dimensionality
reduction is a type of feature extraction technique that aims to reduce the number
of input features while retaining as much of the original information as possible.

In this article, we will discuss one of the most popular dimensionality


reduction techniques i.e. Principal Component Analysis(PCA).

What is Principal Component Analysis(PCA)?


Principal Component Analysis(PCA) - GeeksforGeeks

It works on the condition that while the data in a higher dimensional


space is mapped to data in a lower dimension space, the variance of the data in
the lower dimensional space should be maximum.
 Principal Component Analysis (PCA) is a statistical procedure that uses an
orthogonal transformation that converts a set of correlated variables to a set
of uncorrelated variables. PCA is the most widely used tool in exploratory
data analysis and in machine learning for predictive models. Moreover,
 Principal Component Analysis (PCA) is an unsupervised learning
algorithm technique used to examine the interrelations among a set of

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


variables. It is also known as a general factor analysis where regression
determines a line of best fit.
 The main goal of Principal Component Analysis (PCA) is to reduce the
dimensionality of a dataset while preserving the most important patterns or
relationships between the variables without any prior knowledge of the
target variables.

 Principal Component Analysis (PCA) is used to reduce the dimensionality


of a data set by finding a new set of variables, smaller than the original set
of variables, retaining most of the sample’s information, and useful for the
regression and classification of data.
 Principal Component Analysis (PCA) is a technique for dimensionality
reduction that identifies a set of orthogonal axes, called principal
components, that capture the maximum variance in the data. The principal
components are linear combinations of the original variables in the dataset
and are ordered in decreasing order of importance. The total variance
captured by all the principal components is equal to the total variance in the
original dataset.
 The first principal component captures the most variation in the data, but the
second principal component captures the maximum variance that is
orthogonal to the first principal component, and so on.
 Principal Component Analysis can be used for a variety of purposes,
including data visualization, feature selection, and data compression. In data
visualization, PCA can be used to plot high-dimensional data in two or three
dimensions, making it easier to interpret. In feature selection, PCA can be
used to identify the most important variables in a dataset. In data
compression, PCA can be used to reduce the size of a dataset without losing
important information.
 In Principal Component Analysis, it is assumed that the information is
carried in the variance of the features, that is, the higher the variation in a
feature, the more information that features carries.

FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan


FUNDAMENTALS OF DATA SCIENCE / UNIT 1 | Dr. D. Vivekanandan

You might also like